CONNECTORS-618 Karl
On Mon, Jan 21, 2013 at 9:08 AM, Karl Wright <[email protected]> wrote: > Bad news, I am afraid. MySQL seems to always put null values at the > front of the index, and that cannot be changed through any means I can > find. This is different from all other databases I know of. > > The only possible fixes for the problem are as follows: > > (1) Not use a null doc priority but instead use an actual special > number that is guaranteed to always sort to the end. This would be > non-trivial because this column is a FLOAT value, and round-off errors > will prevent the ManifoldCF code from reliably using a special number > like that on all databases. > > (2) Use docpriorities that are ordered in the opposite way - which > would work ONLY for MySQL and would break all other databases. > > I'll create a ticket and think about the problem some more. > > Karl > > On Mon, Jan 21, 2013 at 8:48 AM, Karl Wright <[email protected]> wrote: >> Hi Shigeki, >> >> I reviewed the code in detail. At the time CONNECTORS-290 was fixed, >> all document priorities were set to null whenever a job was paused or >> aborted, so what I suspected might be the problem cannot in fact >> happen. >> >> The most likely possible explanation for MySQL's behavior, therefore, >> is that MySQL orders null docpriority values BEFORE all other rows in >> the index it is using for queue stuffing. I have no other way of >> explaining why it thinks it needs to go through 6.5 million rows >> before it gets to the ones that are active. >> >> If this is the case, it may be possible to tell MySQL to order null >> column values to the END instead of the beginning of the index. I'll >> do some research on this later and get back to you. >> >> Thanks, >> Karl >> >> >> >> On Mon, Jan 21, 2013 at 6:21 AM, Karl Wright <[email protected]> wrote: >>> Are there any large paused or aborted jobs present on the same >>> ManifoldCF? If so, can you tell me whether the job is paused, or >>> aborted? (I am betting paused...) >>> >>> Karl >>> >>> On Mon, Jan 21, 2013 at 5:59 AM, Shigeki Kobayashi >>> <[email protected]> wrote: >>>> Hi Karl, >>>> >>>> >>>> Here is the explain. There isn't such sort... >>>> >>>> mysql> explain SELECT >>>> t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset >>>> FROM jobqueue t0 FORCE INDEX (i1358228295210) WHERE t0 IN ('P','G') AND >>>> t0.checkaction='R' AND t0.checktime<=1358649661663 AND EXISTS(SELECT 'x' >>>> FROM jobs t1 WHERE t1.status IN ('A','a') AND t1.id=t0.jobid AND >>>> t1.priority=5) AND NOT EXISTS(SELECT 'x' FROM jobqueue t2 WHERE >>>> t2.dochash=t0.dochash AND t2.status IN ('A','F','a','f','D','d') AND >>>> t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT 'x' FROM prereqevents t3,events >>>> t4 >>>> WHERE t0.id=t3.owner AND t3.eventname=t4.name) ORDER BY t0.docpriority ASC >>>> LIMIT 4800; >>>> +----+--------------------+-------+--------+----------------------------------------------+----------------+---------+-------------------------+------+-------------+ >>>> | id | select_type | table | type | possible_keys >>>> | key | key_len | ref | rows | Extra | >>>> +----+--------------------+-------+--------+----------------------------------------------+----------------+---------+-------------------------+------+-------------+ >>>> | 1 | PRIMARY | t0 | index | NULL >>>> | I1358228295210 | 25 | NULL | 4800 | Using where | >>>> | 4 | DEPENDENT SUBQUERY | t3 | ref | I1358228295216 >>>> | I1358228295216 | 8 | manifoldcf.t0.id | 1 | | >>>> | 4 | DEPENDENT SUBQUERY | t4 | eq_ref | PRIMARY >>>> | PRIMARY | 767 | manifoldcf.t3.eventname | 1 | Using index | >>>> | 3 | DEPENDENT SUBQUERY | t2 | ref | >>>> I1358228295209,I1358228295212,I1358228295211 | I1358228295209 | 122 | >>>> manifoldcf.t0.dochash | 1 | Using where | >>>> | 2 | DEPENDENT SUBQUERY | t1 | eq_ref | PRIMARY,I1358228295219 >>>> | PRIMARY | 8 | manifoldcf.t0.jobid | 1 | Using where | >>>> +----+--------------------+-------+--------+----------------------------------------------+----------------+---------+-------------------------+------+-------------+ >>>> 5 rows in set (0.00 sec) >>>> >>>> >>>> Regards, >>>> >>>> >>>> Shigeki >>>> >>>> >>>> >>>> 2013/1/21 Karl Wright <[email protected]> >>>>> >>>>> takes too long >>>>> MIME-Version: 1.0 >>>>> Content-Type: multipart/alternative; boundary=14dae9399bd9676e5704d3c356e9 >>>>> >>>>> --14dae9399bd9676e5704d3c356e9 >>>>> Content-Type: text/plain; charset="utf-8" >>>>> Content-Transfer-Encoding: 7bit >>>>> >>>>> Can you get an EXPLAIN for this query? It sounds like it is >>>>> disregarding the hint for some reason. >>>>> >>>>> Karl >>>>> >>>>> Sent from my Windows Phone >>>>> From: Shigeki Kobayashi >>>>> Sent: 1/20/2013 9:37 PM >>>>> To: [email protected] >>>>> Subject: Re: Crawling new/updated files using Windows share connection >>>>> takes too long >>>>> Hi Karl. >>>>> >>>>> I configured MySQL 5.5 to run MCF this time. >>>>> The version of MCF is trunk 1.1dev downloaded on Dec, 12th. , which you >>>>> fixed >>>>> the slow query using "FORCE INDEX". Solr is 4.0 >>>>> >>>>> I thought is was fixed but the log shows that the following are slow >>>>> queries. >>>>> ------------------------------------------------------------------- >>>>> # Time: 130120 11:41:10 >>>>> # User@Host: manifoldcf[manifoldcf] @ localhost [127.0.0.1] >>>>> # Query_time: 8.761087 Lock_time: 0.000163 Rows_sent: 17 Rows_examined: >>>>> 6365233 >>>>> SET timestamp=1358649670; >>>>> SELECT >>>>> t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset >>>>> FROM jobqueue t0 FORCE INDEX (i1358228295210) WHERE t0.status IN ('P','G') >>>>> AND t0.checkaction='R' AND t0.checktime<=1358649661663 AND EXISTS(SELECT >>>>> 'x' FROM jobs t1 WHERE t1.status IN ('A','a') AND t1.id=t0.jobid AND >>>>> t1.priority=5) AND NOT EXISTS(SELECT 'x' FROM jobqueue t2 WHERE >>>>> t2.dochash=t0.dochash AND t2.status IN ('A','F','a','f','D','d') AND >>>>> t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT 'x' FROM prereqevents t3,events >>>>> t4 WHERE t0.id=t3.owner AND t3.eventname=t4.name) ORDER BY t0.docpriority >>>>> ASC LIMIT 4800; >>>>> >>>>> # Time: 130120 11:41:18 >>>>> # User@Host: manifoldcf[manifoldcf] @ localhost [127.0.0.1] >>>>> # Query_time: 7.714277 Lock_time: 0.000123 Rows_sent: 0 Rows_examined: >>>>> 6365182 >>>>> SET timestamp=1358649678; >>>>> SELECT docpriority,jobid,dochash,docid FROM jobqueue t0 FORCE INDEX >>>>> (i1358228295210) WHERE status IN ('P','G') AND checkaction='R' AND >>>>> checktime<=1358649661663 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE >>>>> t1.status >>>>> IN ('A','a') AND t1.id=t0.jobid) ORDER BY docpriority ASC LIMIT 1; >>>>> >>>>> Regards, >>>>> >>>>> >>>>> Shigeki >>>>> >>>>> >>>>> >>>>> 2013/1/18 Karl Wright <[email protected]> >>>>> >>>>> > Hi Shigeki, >>>>> > >>>>> > What database is ManifoldCF configured to use in this case? Do you >>>>> > see any indication of slow queries in the ManifoldCF log? >>>>> > >>>>> > >>>>> > Karl >>>>> > >>>>> > On Fri, Jan 18, 2013 at 5:27 AM, Shigeki Kobayashi >>>>> > <[email protected]> wrote: >>>>> > > Hello >>>>> > > >>>>> > > >>>>> > > I would like some advice to improve crawling time of new/updated files >>>>> > using >>>>> > > Windows share connection. >>>>> > > >>>>> > > I crawl file in Windows server and index them into Solr. >>>>> > > >>>>> > > Currently, the second crawling of two hundred thousands files takes >>>>> > over 5 >>>>> > > hours, even though any files are not updated, created, deleted. >>>>> > > >>>>> > > I assume MCF does the following processes (let me know if I am wrong) >>>>> > > >>>>> > > - obtain updated time of a file >>>>> > > - compare the updated time with the one MCF obtained last time >>>>> > > crawling( >>>>> > > probably stored in DB) >>>>> > > - if they are different MCF recognizes the file is to be indexed. >>>>> > > >>>>> > > If the above processes are done for two thousands files, what part of >>>>> > > the >>>>> > > processes could take time the most? obtaining updated time? reading >>>>> > > data >>>>> > > from DB? what could be done to increase the crawling time do you >>>>> > > think? >>>>> > > >>>>> > > Please give me some advice. >>>>> > > >>>>> > > >>>>> > > Regards, >>>>> > > >>>>> > > Shigeki >>>>> > > >>>>> > > >>>>> > >>>>> >>>>> --14dae9399bd9676e5704d3c356e9 >>>>> Content-Type: text/html; charset="utf-8" >>>>> Content-Transfer-Encoding: quoted-printable >>>>> >>>>> <html><head><meta content=3D"text/html; charset=3Dutf-8" >>>>> http-equiv=3D"Cont= >>>>> ent-Type"></head><body><div><div style=3D"font-family: Calibri,sans-serif; >>>>> = >>>>> font-size: 11pt;"><br>Can you get an EXPLAIN for this query? It >>>>> sound= >>>>> s like it is disregarding the hint for some >>>>> reason.<br><br>Karl<br><br>Sent= >>>>> from my Windows Phone<br></div></div><hr><span style=3D"font-family: >>>>> Tahom= >>>>> a,sans-serif; font-size: 10pt; font-weight: bold;">From: </span><span >>>>> style= >>>>> =3D"font-family: Tahoma,sans-serif; font-size: 10pt;">Shigeki >>>>> Kobayashi</sp= >>>>> an><br><span style=3D"font-family: Tahoma,sans-serif; font-size: 10pt; >>>>> font= >>>>> -weight: bold;">Sent: </span><span style=3D"font-family: >>>>> Tahoma,sans-serif;= >>>>> font-size: 10pt;">1/20/2013 9:37 PM</span><br><span style=3D"font-family: >>>>> = >>>>> Tahoma,sans-serif; font-size: 10pt; font-weight: bold;">To: </span><span >>>>> st= >>>>> yle=3D"font-family: Tahoma,sans-serif; font-size: >>>>> 10pt;">[email protected]= >>>>> ache.org</span><br><span style=3D"font-family: Tahoma,sans-serif; >>>>> font-size= >>>>> : 10pt; font-weight: bold;">Subject: </span><span style=3D"font-family: >>>>> Tah= >>>>> oma,sans-serif; font-size: 10pt;">Re: Crawling new/updated files using >>>>> Wind= >>>>> ows share connection takes too long</span><br><br></body></html><div >>>>> dir=3D= >>>>> "ltr">Hi Karl.<div><br></div><div>I configured MySQL 5.5 to run MCF this >>>>> ti= >>>>> me.</div><div>The version of MCF is trunk 1.1dev downloaded on Dec, 12th. >>>>> ,= >>>>> which you fixed</div><div style>the slow query using "FORCE >>>>> INDEX&quo= >>>>> t;. Solr is 4.0</div> >>>>> >>>>> <div style><br></div><div style>I thought is was fixed but the log shows >>>>> th= >>>>> at =C2=A0the following are slow queries.=C2=A0</div><div><div >>>>> class=3D"gmai= >>>>> >>>>> l_extra">------------------------------------------------------------------= >>>>> -</div> >>>>> <div class=3D"gmail_extra"> >>>>> <div class=3D"gmail_extra"># Time: 130120 11:41:10</div><div >>>>> class=3D"gmail= >>>>> _extra"># User@Host: manifoldcf[manifoldcf] @ localhost >>>>> [127.0.0.1]</div><d= >>>>> iv class=3D"gmail_extra"># Query_time: 8.761087 =C2=A0Lock_time: 0.000163 >>>>> R= >>>>> ows_sent: 17 =C2=A0Rows_examined: 6365233</div> >>>>> >>>>> <div class=3D"gmail_extra">SET timestamp=3D1358649670;</div><div >>>>> class=3D"g= >>>>> mail_extra">SELECT <a >>>>> href=3D"http://t0.id">t0.id</a>,t0.jobid,t0.dochash,t= >>>>> 0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset FROM jobqueue t0 >>>>> = >>>>> FORCE INDEX (i1358228295210) WHERE t0.status IN ('P','G') >>>>> A= >>>>> ND t0.checkaction=3D'R' AND t0.checktime<=3D1358649661663 AND >>>>> EX= >>>>> ISTS(SELECT 'x' FROM jobs t1 WHERE t1.status IN >>>>> ('A','a= >>>>> ') AND <a href=3D"http://t1.id">t1.id</a>=3Dt0.jobid AND >>>>> t1.priority=3D= >>>>> 5) AND NOT EXISTS(SELECT 'x' FROM jobqueue t2 WHERE >>>>> t2.dochash=3Dt0= >>>>> .dochash AND t2.status IN >>>>> ('A','F','a','f',= >>>>> 'D','d') AND t2.jobid!=3Dt0.jobid) AND NOT EXISTS(SELECT >>>>> &#= >>>>> 39;x' FROM prereqevents t3,events t4 WHERE <a >>>>> href=3D"http://t0.id">t0.= >>>>> id</a>=3Dt3.owner AND t3.eventname=3D<a >>>>> href=3D"http://t4.name">t4.name</a>= >>>>> ) ORDER BY t0.docpriority ASC LIMIT 4800;</div> >>>>> >>>>> <div class=3D"gmail_extra"><br></div><div class=3D"gmail_extra"># Time: >>>>> 130= >>>>> 120 11:41:18<br></div><div class=3D"gmail_extra"># User@Host: >>>>> manifoldcf[ma= >>>>> nifoldcf] @ localhost [127.0.0.1]<br></div><div class=3D"gmail_extra"># >>>>> Que= >>>>> ry_time: 7.714277 =C2=A0Lock_time: 0.000123 Rows_sent: 0 >>>>> =C2=A0Rows_examine= >>>>> d: 6365182</div> >>>>> >>>>> <div class=3D"gmail_extra">SET timestamp=3D1358649678;</div><div >>>>> class=3D"g= >>>>> mail_extra">SELECT docpriority,jobid,dochash,docid FROM jobqueue t0 FORCE >>>>> I= >>>>> NDEX (i1358228295210) WHERE status IN ('P','G') AND >>>>> checkac= >>>>> tion=3D'R' AND checktime<=3D1358649661663 AND EXISTS(SELECT >>>>> '= >>>>> ;x' FROM jobs t1 WHERE t1.status IN ('A','a') AND <a >>>>> hr= >>>>> ef=3D"http://t1.id">t1.id</a>=3Dt0.jobid) =C2=A0ORDER BY docpriority ASC >>>>> LI= >>>>> MIT 1;</div> >>>>> >>>>> <div><br></div><div class=3D"gmail_extra" style>Regards,</div><div >>>>> class=3D= >>>>> "gmail_extra"><br></div><div class=3D"gmail_extra"><br></div><div >>>>> class=3D"= >>>>> gmail_extra" style>Shigeki</div><div class=3D"gmail_extra"><br></div><div >>>>> c= >>>>> lass=3D"gmail_extra"> >>>>> >>>>> <br></div><br><div class=3D"gmail_quote">2013/1/18 Karl Wright <span >>>>> dir=3D= >>>>> "ltr"><<a href=3D"mailto:[email protected]" >>>>> target=3D"_blank">daddywri@= >>>>> gmail.com</a>></span><br><blockquote class=3D"gmail_quote" >>>>> style=3D"marg= >>>>> in:0px 0px 0px >>>>> 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,20= >>>>> 4);border-left-style:solid;padding-left:1ex"> >>>>> >>>>> Hi Shigeki,<br> >>>>> <br> >>>>> What database is ManifoldCF configured to use in this case? =C2=A0Do >>>>> you<br= >>>>> > >>>>> see any indication of slow queries in the ManifoldCF log?<br> >>>>> <br> >>>>> <br> >>>>> Karl<br> >>>>> <br> >>>>> On Fri, Jan 18, 2013 at 5:27 AM, Shigeki Kobayashi<br> >>>>> <<a >>>>> href=3D"mailto:[email protected]">shigeki.kobayash= >>>>> [email protected]</a>> wrote:<br> >>>>> > Hello<br> >>>>> ><br> >>>>> ><br> >>>>> > I would like some advice to improve crawling time of new/updated >>>>> files= >>>>> using<br> >>>>> > Windows share connection.<br> >>>>> ><br> >>>>> > I crawl file in Windows server and index them into Solr.<br> >>>>> ><br> >>>>> > Currently, the second crawling of two hundred thousands files takes = >>>>> =C2=A0over 5<br> >>>>> > hours, even though any files are not updated, created, deleted.<br> >>>>> ><br> >>>>> > I assume MCF does the following processes (let me know if I am >>>>> wrong)<= >>>>> br> >>>>> ><br> >>>>> > - obtain updated time of a file<br> >>>>> > - compare the updated time with the one MCF obtained last time >>>>> crawlin= >>>>> g(<br> >>>>> > probably stored in DB)<br> >>>>> > - if they are different MCF recognizes the file is to be indexed.<br> >>>>> ><br> >>>>> > If the above processes are done for two thousands files, what part of >>>>> = >>>>> the<br> >>>>> > processes could take time the most? obtaining updated time? reading >>>>> da= >>>>> ta<br> >>>>> > from DB? what could be done to increase the crawling time do you >>>>> think= >>>>> ?<br> >>>>> ><br> >>>>> > Please give me some advice.<br> >>>>> ><br> >>>>> ><br> >>>>> > Regards,<br> >>>>> ><br> >>>>> > Shigeki<br> >>>>> ><br> >>>>> ><br> >>>>> </blockquote></div><br><br clear=3D"all"><div><br></div><br><div><font >>>>> face= >>>>> =3D"'ms gothic', monospace">=C2=A0</font></div><div><br></div> >>>>> </div></div></div> >>>>> >>>>> --14dae9399bd9676e5704d3c356e9-- >>>> >>>> >>>> >>>> >>>>
