I checked a fix for this into trunk. Please sync up with trunk and see if this fixes your problem. If it does, I will gladly include the fix in MCF 1.1.
Karl On Mon, Jan 21, 2013 at 9:14 AM, Karl Wright <[email protected]> wrote: > CONNECTORS-618 > > Karl > > On Mon, Jan 21, 2013 at 9:08 AM, Karl Wright <[email protected]> wrote: >> Bad news, I am afraid. MySQL seems to always put null values at the >> front of the index, and that cannot be changed through any means I can >> find. This is different from all other databases I know of. >> >> The only possible fixes for the problem are as follows: >> >> (1) Not use a null doc priority but instead use an actual special >> number that is guaranteed to always sort to the end. This would be >> non-trivial because this column is a FLOAT value, and round-off errors >> will prevent the ManifoldCF code from reliably using a special number >> like that on all databases. >> >> (2) Use docpriorities that are ordered in the opposite way - which >> would work ONLY for MySQL and would break all other databases. >> >> I'll create a ticket and think about the problem some more. >> >> Karl >> >> On Mon, Jan 21, 2013 at 8:48 AM, Karl Wright <[email protected]> wrote: >>> Hi Shigeki, >>> >>> I reviewed the code in detail. At the time CONNECTORS-290 was fixed, >>> all document priorities were set to null whenever a job was paused or >>> aborted, so what I suspected might be the problem cannot in fact >>> happen. >>> >>> The most likely possible explanation for MySQL's behavior, therefore, >>> is that MySQL orders null docpriority values BEFORE all other rows in >>> the index it is using for queue stuffing. I have no other way of >>> explaining why it thinks it needs to go through 6.5 million rows >>> before it gets to the ones that are active. >>> >>> If this is the case, it may be possible to tell MySQL to order null >>> column values to the END instead of the beginning of the index. I'll >>> do some research on this later and get back to you. >>> >>> Thanks, >>> Karl >>> >>> >>> >>> On Mon, Jan 21, 2013 at 6:21 AM, Karl Wright <[email protected]> wrote: >>>> Are there any large paused or aborted jobs present on the same >>>> ManifoldCF? If so, can you tell me whether the job is paused, or >>>> aborted? (I am betting paused...) >>>> >>>> Karl >>>> >>>> On Mon, Jan 21, 2013 at 5:59 AM, Shigeki Kobayashi >>>> <[email protected]> wrote: >>>>> Hi Karl, >>>>> >>>>> >>>>> Here is the explain. There isn't such sort... >>>>> >>>>> mysql> explain SELECT >>>>> t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset >>>>> FROM jobqueue t0 FORCE INDEX (i1358228295210) WHERE t0 IN ('P','G') AND >>>>> t0.checkaction='R' AND t0.checktime<=1358649661663 AND EXISTS(SELECT 'x' >>>>> FROM jobs t1 WHERE t1.status IN ('A','a') AND t1.id=t0.jobid AND >>>>> t1.priority=5) AND NOT EXISTS(SELECT 'x' FROM jobqueue t2 WHERE >>>>> t2.dochash=t0.dochash AND t2.status IN ('A','F','a','f','D','d') AND >>>>> t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT 'x' FROM prereqevents t3,events >>>>> t4 >>>>> WHERE t0.id=t3.owner AND t3.eventname=t4.name) ORDER BY t0.docpriority ASC >>>>> LIMIT 4800; >>>>> +----+--------------------+-------+--------+----------------------------------------------+----------------+---------+-------------------------+------+-------------+ >>>>> | id | select_type | table | type | possible_keys >>>>> | key | key_len | ref | rows | Extra >>>>> | >>>>> +----+--------------------+-------+--------+----------------------------------------------+----------------+---------+-------------------------+------+-------------+ >>>>> | 1 | PRIMARY | t0 | index | NULL >>>>> | I1358228295210 | 25 | NULL | 4800 | Using where >>>>> | >>>>> | 4 | DEPENDENT SUBQUERY | t3 | ref | I1358228295216 >>>>> | I1358228295216 | 8 | manifoldcf.t0.id | 1 | >>>>> | >>>>> | 4 | DEPENDENT SUBQUERY | t4 | eq_ref | PRIMARY >>>>> | PRIMARY | 767 | manifoldcf.t3.eventname | 1 | Using index >>>>> | >>>>> | 3 | DEPENDENT SUBQUERY | t2 | ref | >>>>> I1358228295209,I1358228295212,I1358228295211 | I1358228295209 | 122 | >>>>> manifoldcf.t0.dochash | 1 | Using where | >>>>> | 2 | DEPENDENT SUBQUERY | t1 | eq_ref | PRIMARY,I1358228295219 >>>>> | PRIMARY | 8 | manifoldcf.t0.jobid | 1 | Using where >>>>> | >>>>> +----+--------------------+-------+--------+----------------------------------------------+----------------+---------+-------------------------+------+-------------+ >>>>> 5 rows in set (0.00 sec) >>>>> >>>>> >>>>> Regards, >>>>> >>>>> >>>>> Shigeki >>>>> >>>>> >>>>> >>>>> 2013/1/21 Karl Wright <[email protected]> >>>>>> >>>>>> takes too long >>>>>> MIME-Version: 1.0 >>>>>> Content-Type: multipart/alternative; >>>>>> boundary=14dae9399bd9676e5704d3c356e9 >>>>>> >>>>>> --14dae9399bd9676e5704d3c356e9 >>>>>> Content-Type: text/plain; charset="utf-8" >>>>>> Content-Transfer-Encoding: 7bit >>>>>> >>>>>> Can you get an EXPLAIN for this query? It sounds like it is >>>>>> disregarding the hint for some reason. >>>>>> >>>>>> Karl >>>>>> >>>>>> Sent from my Windows Phone >>>>>> From: Shigeki Kobayashi >>>>>> Sent: 1/20/2013 9:37 PM >>>>>> To: [email protected] >>>>>> Subject: Re: Crawling new/updated files using Windows share connection >>>>>> takes too long >>>>>> Hi Karl. >>>>>> >>>>>> I configured MySQL 5.5 to run MCF this time. >>>>>> The version of MCF is trunk 1.1dev downloaded on Dec, 12th. , which you >>>>>> fixed >>>>>> the slow query using "FORCE INDEX". Solr is 4.0 >>>>>> >>>>>> I thought is was fixed but the log shows that the following are slow >>>>>> queries. >>>>>> ------------------------------------------------------------------- >>>>>> # Time: 130120 11:41:10 >>>>>> # User@Host: manifoldcf[manifoldcf] @ localhost [127.0.0.1] >>>>>> # Query_time: 8.761087 Lock_time: 0.000163 Rows_sent: 17 Rows_examined: >>>>>> 6365233 >>>>>> SET timestamp=1358649670; >>>>>> SELECT >>>>>> t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset >>>>>> FROM jobqueue t0 FORCE INDEX (i1358228295210) WHERE t0.status IN >>>>>> ('P','G') >>>>>> AND t0.checkaction='R' AND t0.checktime<=1358649661663 AND EXISTS(SELECT >>>>>> 'x' FROM jobs t1 WHERE t1.status IN ('A','a') AND t1.id=t0.jobid AND >>>>>> t1.priority=5) AND NOT EXISTS(SELECT 'x' FROM jobqueue t2 WHERE >>>>>> t2.dochash=t0.dochash AND t2.status IN ('A','F','a','f','D','d') AND >>>>>> t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT 'x' FROM prereqevents t3,events >>>>>> t4 WHERE t0.id=t3.owner AND t3.eventname=t4.name) ORDER BY t0.docpriority >>>>>> ASC LIMIT 4800; >>>>>> >>>>>> # Time: 130120 11:41:18 >>>>>> # User@Host: manifoldcf[manifoldcf] @ localhost [127.0.0.1] >>>>>> # Query_time: 7.714277 Lock_time: 0.000123 Rows_sent: 0 Rows_examined: >>>>>> 6365182 >>>>>> SET timestamp=1358649678; >>>>>> SELECT docpriority,jobid,dochash,docid FROM jobqueue t0 FORCE INDEX >>>>>> (i1358228295210) WHERE status IN ('P','G') AND checkaction='R' AND >>>>>> checktime<=1358649661663 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE >>>>>> t1.status >>>>>> IN ('A','a') AND t1.id=t0.jobid) ORDER BY docpriority ASC LIMIT 1; >>>>>> >>>>>> Regards, >>>>>> >>>>>> >>>>>> Shigeki >>>>>> >>>>>> >>>>>> >>>>>> 2013/1/18 Karl Wright <[email protected]> >>>>>> >>>>>> > Hi Shigeki, >>>>>> > >>>>>> > What database is ManifoldCF configured to use in this case? Do you >>>>>> > see any indication of slow queries in the ManifoldCF log? >>>>>> > >>>>>> > >>>>>> > Karl >>>>>> > >>>>>> > On Fri, Jan 18, 2013 at 5:27 AM, Shigeki Kobayashi >>>>>> > <[email protected]> wrote: >>>>>> > > Hello >>>>>> > > >>>>>> > > >>>>>> > > I would like some advice to improve crawling time of new/updated >>>>>> > > files >>>>>> > using >>>>>> > > Windows share connection. >>>>>> > > >>>>>> > > I crawl file in Windows server and index them into Solr. >>>>>> > > >>>>>> > > Currently, the second crawling of two hundred thousands files takes >>>>>> > over 5 >>>>>> > > hours, even though any files are not updated, created, deleted. >>>>>> > > >>>>>> > > I assume MCF does the following processes (let me know if I am wrong) >>>>>> > > >>>>>> > > - obtain updated time of a file >>>>>> > > - compare the updated time with the one MCF obtained last time >>>>>> > > crawling( >>>>>> > > probably stored in DB) >>>>>> > > - if they are different MCF recognizes the file is to be indexed. >>>>>> > > >>>>>> > > If the above processes are done for two thousands files, what part of >>>>>> > > the >>>>>> > > processes could take time the most? obtaining updated time? reading >>>>>> > > data >>>>>> > > from DB? what could be done to increase the crawling time do you >>>>>> > > think? >>>>>> > > >>>>>> > > Please give me some advice. >>>>>> > > >>>>>> > > >>>>>> > > Regards, >>>>>> > > >>>>>> > > Shigeki >>>>>> > > >>>>>> > > >>>>>> > >>>>>> >>>>>> --14dae9399bd9676e5704d3c356e9 >>>>>> Content-Type: text/html; charset="utf-8" >>>>>> Content-Transfer-Encoding: quoted-printable >>>>>> >>>>>> <html><head><meta content=3D"text/html; charset=3Dutf-8" >>>>>> http-equiv=3D"Cont= >>>>>> ent-Type"></head><body><div><div style=3D"font-family: >>>>>> Calibri,sans-serif; >>>>>> = >>>>>> font-size: 11pt;"><br>Can you get an EXPLAIN for this query? It >>>>>> sound= >>>>>> s like it is disregarding the hint for some >>>>>> reason.<br><br>Karl<br><br>Sent= >>>>>> from my Windows Phone<br></div></div><hr><span style=3D"font-family: >>>>>> Tahom= >>>>>> a,sans-serif; font-size: 10pt; font-weight: bold;">From: </span><span >>>>>> style= >>>>>> =3D"font-family: Tahoma,sans-serif; font-size: 10pt;">Shigeki >>>>>> Kobayashi</sp= >>>>>> an><br><span style=3D"font-family: Tahoma,sans-serif; font-size: 10pt; >>>>>> font= >>>>>> -weight: bold;">Sent: </span><span style=3D"font-family: >>>>>> Tahoma,sans-serif;= >>>>>> font-size: 10pt;">1/20/2013 9:37 PM</span><br><span >>>>>> style=3D"font-family: >>>>>> = >>>>>> Tahoma,sans-serif; font-size: 10pt; font-weight: bold;">To: </span><span >>>>>> st= >>>>>> yle=3D"font-family: Tahoma,sans-serif; font-size: >>>>>> 10pt;">[email protected]= >>>>>> ache.org</span><br><span style=3D"font-family: Tahoma,sans-serif; >>>>>> font-size= >>>>>> : 10pt; font-weight: bold;">Subject: </span><span style=3D"font-family: >>>>>> Tah= >>>>>> oma,sans-serif; font-size: 10pt;">Re: Crawling new/updated files using >>>>>> Wind= >>>>>> ows share connection takes too long</span><br><br></body></html><div >>>>>> dir=3D= >>>>>> "ltr">Hi Karl.<div><br></div><div>I configured MySQL 5.5 to run MCF this >>>>>> ti= >>>>>> me.</div><div>The version of MCF is trunk 1.1dev downloaded on Dec, 12th. >>>>>> ,= >>>>>> which you fixed</div><div style>the slow query using "FORCE >>>>>> INDEX&quo= >>>>>> t;. Solr is 4.0</div> >>>>>> >>>>>> <div style><br></div><div style>I thought is was fixed but the log shows >>>>>> th= >>>>>> at =C2=A0the following are slow queries.=C2=A0</div><div><div >>>>>> class=3D"gmai= >>>>>> >>>>>> l_extra">------------------------------------------------------------------= >>>>>> -</div> >>>>>> <div class=3D"gmail_extra"> >>>>>> <div class=3D"gmail_extra"># Time: 130120 11:41:10</div><div >>>>>> class=3D"gmail= >>>>>> _extra"># User@Host: manifoldcf[manifoldcf] @ localhost >>>>>> [127.0.0.1]</div><d= >>>>>> iv class=3D"gmail_extra"># Query_time: 8.761087 =C2=A0Lock_time: 0.000163 >>>>>> R= >>>>>> ows_sent: 17 =C2=A0Rows_examined: 6365233</div> >>>>>> >>>>>> <div class=3D"gmail_extra">SET timestamp=3D1358649670;</div><div >>>>>> class=3D"g= >>>>>> mail_extra">SELECT <a >>>>>> href=3D"http://t0.id">t0.id</a>,t0.jobid,t0.dochash,t= >>>>>> 0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset FROM jobqueue >>>>>> t0 >>>>>> = >>>>>> FORCE INDEX (i1358228295210) WHERE t0.status IN ('P','G') >>>>>> A= >>>>>> ND t0.checkaction=3D'R' AND t0.checktime<=3D1358649661663 AND >>>>>> EX= >>>>>> ISTS(SELECT 'x' FROM jobs t1 WHERE t1.status IN >>>>>> ('A','a= >>>>>> ') AND <a href=3D"http://t1.id">t1.id</a>=3Dt0.jobid AND >>>>>> t1.priority=3D= >>>>>> 5) AND NOT EXISTS(SELECT 'x' FROM jobqueue t2 WHERE >>>>>> t2.dochash=3Dt0= >>>>>> .dochash AND t2.status IN >>>>>> ('A','F','a','f',= >>>>>> 'D','d') AND t2.jobid!=3Dt0.jobid) AND NOT EXISTS(SELECT >>>>>> &#= >>>>>> 39;x' FROM prereqevents t3,events t4 WHERE <a >>>>>> href=3D"http://t0.id">t0.= >>>>>> id</a>=3Dt3.owner AND t3.eventname=3D<a >>>>>> href=3D"http://t4.name">t4.name</a>= >>>>>> ) ORDER BY t0.docpriority ASC LIMIT 4800;</div> >>>>>> >>>>>> <div class=3D"gmail_extra"><br></div><div class=3D"gmail_extra"># Time: >>>>>> 130= >>>>>> 120 11:41:18<br></div><div class=3D"gmail_extra"># User@Host: >>>>>> manifoldcf[ma= >>>>>> nifoldcf] @ localhost [127.0.0.1]<br></div><div class=3D"gmail_extra"># >>>>>> Que= >>>>>> ry_time: 7.714277 =C2=A0Lock_time: 0.000123 Rows_sent: 0 >>>>>> =C2=A0Rows_examine= >>>>>> d: 6365182</div> >>>>>> >>>>>> <div class=3D"gmail_extra">SET timestamp=3D1358649678;</div><div >>>>>> class=3D"g= >>>>>> mail_extra">SELECT docpriority,jobid,dochash,docid FROM jobqueue t0 FORCE >>>>>> I= >>>>>> NDEX (i1358228295210) WHERE status IN ('P','G') AND >>>>>> checkac= >>>>>> tion=3D'R' AND checktime<=3D1358649661663 AND EXISTS(SELECT >>>>>> '= >>>>>> ;x' FROM jobs t1 WHERE t1.status IN ('A','a') AND <a >>>>>> hr= >>>>>> ef=3D"http://t1.id">t1.id</a>=3Dt0.jobid) =C2=A0ORDER BY docpriority ASC >>>>>> LI= >>>>>> MIT 1;</div> >>>>>> >>>>>> <div><br></div><div class=3D"gmail_extra" style>Regards,</div><div >>>>>> class=3D= >>>>>> "gmail_extra"><br></div><div class=3D"gmail_extra"><br></div><div >>>>>> class=3D"= >>>>>> gmail_extra" style>Shigeki</div><div class=3D"gmail_extra"><br></div><div >>>>>> c= >>>>>> lass=3D"gmail_extra"> >>>>>> >>>>>> <br></div><br><div class=3D"gmail_quote">2013/1/18 Karl Wright <span >>>>>> dir=3D= >>>>>> "ltr"><<a href=3D"mailto:[email protected]" >>>>>> target=3D"_blank">daddywri@= >>>>>> gmail.com</a>></span><br><blockquote class=3D"gmail_quote" >>>>>> style=3D"marg= >>>>>> in:0px 0px 0px >>>>>> 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,20= >>>>>> 4);border-left-style:solid;padding-left:1ex"> >>>>>> >>>>>> Hi Shigeki,<br> >>>>>> <br> >>>>>> What database is ManifoldCF configured to use in this case? =C2=A0Do >>>>>> you<br= >>>>>> > >>>>>> see any indication of slow queries in the ManifoldCF log?<br> >>>>>> <br> >>>>>> <br> >>>>>> Karl<br> >>>>>> <br> >>>>>> On Fri, Jan 18, 2013 at 5:27 AM, Shigeki Kobayashi<br> >>>>>> <<a >>>>>> href=3D"mailto:[email protected]">shigeki.kobayash= >>>>>> [email protected]</a>> wrote:<br> >>>>>> > Hello<br> >>>>>> ><br> >>>>>> ><br> >>>>>> > I would like some advice to improve crawling time of new/updated >>>>>> files= >>>>>> using<br> >>>>>> > Windows share connection.<br> >>>>>> ><br> >>>>>> > I crawl file in Windows server and index them into Solr.<br> >>>>>> ><br> >>>>>> > Currently, the second crawling of two hundred thousands files takes >>>>>> = >>>>>> =C2=A0over 5<br> >>>>>> > hours, even though any files are not updated, created, deleted.<br> >>>>>> ><br> >>>>>> > I assume MCF does the following processes (let me know if I am >>>>>> wrong)<= >>>>>> br> >>>>>> ><br> >>>>>> > - obtain updated time of a file<br> >>>>>> > - compare the updated time with the one MCF obtained last time >>>>>> crawlin= >>>>>> g(<br> >>>>>> > probably stored in DB)<br> >>>>>> > - if they are different MCF recognizes the file is to be >>>>>> indexed.<br> >>>>>> ><br> >>>>>> > If the above processes are done for two thousands files, what part >>>>>> of >>>>>> = >>>>>> the<br> >>>>>> > processes could take time the most? obtaining updated time? reading >>>>>> da= >>>>>> ta<br> >>>>>> > from DB? what could be done to increase the crawling time do you >>>>>> think= >>>>>> ?<br> >>>>>> ><br> >>>>>> > Please give me some advice.<br> >>>>>> ><br> >>>>>> ><br> >>>>>> > Regards,<br> >>>>>> ><br> >>>>>> > Shigeki<br> >>>>>> ><br> >>>>>> ><br> >>>>>> </blockquote></div><br><br clear=3D"all"><div><br></div><br><div><font >>>>>> face= >>>>>> =3D"'ms gothic', monospace">=C2=A0</font></div><div><br></div> >>>>>> </div></div></div> >>>>>> >>>>>> --14dae9399bd9676e5704d3c356e9-- >>>>> >>>>> >>>>> >>>>> >>>>>
