takes too long MIME-Version: 1.0 Content-Type: multipart/alternative; boundary=14dae9399bd9676e5704d3c356e9
--14dae9399bd9676e5704d3c356e9 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Can you get an EXPLAIN for this query? It sounds like it is disregarding the hint for some reason. Karl Sent from my Windows Phone From: Shigeki Kobayashi Sent: 1/20/2013 9:37 PM To: [email protected] Subject: Re: Crawling new/updated files using Windows share connection takes too long Hi Karl. I configured MySQL 5.5 to run MCF this time. The version of MCF is trunk 1.1dev downloaded on Dec, 12th. , which you fixed the slow query using "FORCE INDEX". Solr is 4.0 I thought is was fixed but the log shows that the following are slow queries. ------------------------------------------------------------------- # Time: 130120 11:41:10 # User@Host: manifoldcf[manifoldcf] @ localhost [127.0.0.1] # Query_time: 8.761087 Lock_time: 0.000163 Rows_sent: 17 Rows_examined: 6365233 SET timestamp=1358649670; SELECT t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset FROM jobqueue t0 FORCE INDEX (i1358228295210) WHERE t0.status IN ('P','G') AND t0.checkaction='R' AND t0.checktime<=1358649661663 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE t1.status IN ('A','a') AND t1.id=t0.jobid AND t1.priority=5) AND NOT EXISTS(SELECT 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash AND t2.status IN ('A','F','a','f','D','d') AND t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner AND t3.eventname=t4.name) ORDER BY t0.docpriority ASC LIMIT 4800; # Time: 130120 11:41:18 # User@Host: manifoldcf[manifoldcf] @ localhost [127.0.0.1] # Query_time: 7.714277 Lock_time: 0.000123 Rows_sent: 0 Rows_examined: 6365182 SET timestamp=1358649678; SELECT docpriority,jobid,dochash,docid FROM jobqueue t0 FORCE INDEX (i1358228295210) WHERE status IN ('P','G') AND checkaction='R' AND checktime<=1358649661663 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE t1.status IN ('A','a') AND t1.id=t0.jobid) ORDER BY docpriority ASC LIMIT 1; Regards, Shigeki 2013/1/18 Karl Wright <[email protected]> > Hi Shigeki, > > What database is ManifoldCF configured to use in this case? Do you > see any indication of slow queries in the ManifoldCF log? > > > Karl > > On Fri, Jan 18, 2013 at 5:27 AM, Shigeki Kobayashi > <[email protected]> wrote: > > Hello > > > > > > I would like some advice to improve crawling time of new/updated files > using > > Windows share connection. > > > > I crawl file in Windows server and index them into Solr. > > > > Currently, the second crawling of two hundred thousands files takes > over 5 > > hours, even though any files are not updated, created, deleted. > > > > I assume MCF does the following processes (let me know if I am wrong) > > > > - obtain updated time of a file > > - compare the updated time with the one MCF obtained last time crawling( > > probably stored in DB) > > - if they are different MCF recognizes the file is to be indexed. > > > > If the above processes are done for two thousands files, what part of the > > processes could take time the most? obtaining updated time? reading data > > from DB? what could be done to increase the crawling time do you think? > > > > Please give me some advice. > > > > > > Regards, > > > > Shigeki > > > > > --14dae9399bd9676e5704d3c356e9 Content-Type: text/html; charset="utf-8" Content-Transfer-Encoding: quoted-printable <html><head><meta content=3D"text/html; charset=3Dutf-8" http-equiv=3D"Cont= ent-Type"></head><body><div><div style=3D"font-family: Calibri,sans-serif; = font-size: 11pt;"><br>Can you get an EXPLAIN for this query? It sound= s like it is disregarding the hint for some reason.<br><br>Karl<br><br>Sent= from my Windows Phone<br></div></div><hr><span style=3D"font-family: Tahom= a,sans-serif; font-size: 10pt; font-weight: bold;">From: </span><span style= =3D"font-family: Tahoma,sans-serif; font-size: 10pt;">Shigeki Kobayashi</sp= an><br><span style=3D"font-family: Tahoma,sans-serif; font-size: 10pt; font= -weight: bold;">Sent: </span><span style=3D"font-family: Tahoma,sans-serif;= font-size: 10pt;">1/20/2013 9:37 PM</span><br><span style=3D"font-family: = Tahoma,sans-serif; font-size: 10pt; font-weight: bold;">To: </span><span st= yle=3D"font-family: Tahoma,sans-serif; font-size: 10pt;">[email protected]= ache.org</span><br><span style=3D"font-family: Tahoma,sans-serif; font-size= : 10pt; font-weight: bold;">Subject: </span><span style=3D"font-family: Tah= oma,sans-serif; font-size: 10pt;">Re: Crawling new/updated files using Wind= ows share connection takes too long</span><br><br></body></html><div dir=3D= "ltr">Hi Karl.<div><br></div><div>I configured MySQL 5.5 to run MCF this ti= me.</div><div>The version of MCF is trunk 1.1dev downloaded on Dec, 12th. ,= which you fixed</div><div style>the slow query using "FORCE INDEX&quo= t;. Solr is 4.0</div> <div style><br></div><div style>I thought is was fixed but the log shows th= at =C2=A0the following are slow queries.=C2=A0</div><div><div class=3D"gmai= l_extra">------------------------------------------------------------------= -</div> <div class=3D"gmail_extra"> <div class=3D"gmail_extra"># Time: 130120 11:41:10</div><div class=3D"gmail= _extra"># User@Host: manifoldcf[manifoldcf] @ localhost [127.0.0.1]</div><d= iv class=3D"gmail_extra"># Query_time: 8.761087 =C2=A0Lock_time: 0.000163 R= ows_sent: 17 =C2=A0Rows_examined: 6365233</div> <div class=3D"gmail_extra">SET timestamp=3D1358649670;</div><div class=3D"g= mail_extra">SELECT <a href=3D"http://t0.id">t0.id</a>,t0.jobid,t0.dochash,t= 0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset FROM jobqueue t0 = FORCE INDEX (i1358228295210) WHERE t0.status IN ('P','G') A= ND t0.checkaction=3D'R' AND t0.checktime<=3D1358649661663 AND EX= ISTS(SELECT 'x' FROM jobs t1 WHERE t1.status IN ('A','a= ') AND <a href=3D"http://t1.id">t1.id</a>=3Dt0.jobid AND t1.priority=3D= 5) AND NOT EXISTS(SELECT 'x' FROM jobqueue t2 WHERE t2.dochash=3Dt0= .dochash AND t2.status IN ('A','F','a','f',= 'D','d') AND t2.jobid!=3Dt0.jobid) AND NOT EXISTS(SELECT &#= 39;x' FROM prereqevents t3,events t4 WHERE <a href=3D"http://t0.id">t0.= id</a>=3Dt3.owner AND t3.eventname=3D<a href=3D"http://t4.name">t4.name</a>= ) ORDER BY t0.docpriority ASC LIMIT 4800;</div> <div class=3D"gmail_extra"><br></div><div class=3D"gmail_extra"># Time: 130= 120 11:41:18<br></div><div class=3D"gmail_extra"># User@Host: manifoldcf[ma= nifoldcf] @ localhost [127.0.0.1]<br></div><div class=3D"gmail_extra"># Que= ry_time: 7.714277 =C2=A0Lock_time: 0.000123 Rows_sent: 0 =C2=A0Rows_examine= d: 6365182</div> <div class=3D"gmail_extra">SET timestamp=3D1358649678;</div><div class=3D"g= mail_extra">SELECT docpriority,jobid,dochash,docid FROM jobqueue t0 FORCE I= NDEX (i1358228295210) WHERE status IN ('P','G') AND checkac= tion=3D'R' AND checktime<=3D1358649661663 AND EXISTS(SELECT '= ;x' FROM jobs t1 WHERE t1.status IN ('A','a') AND <a hr= ef=3D"http://t1.id">t1.id</a>=3Dt0.jobid) =C2=A0ORDER BY docpriority ASC LI= MIT 1;</div> <div><br></div><div class=3D"gmail_extra" style>Regards,</div><div class=3D= "gmail_extra"><br></div><div class=3D"gmail_extra"><br></div><div class=3D"= gmail_extra" style>Shigeki</div><div class=3D"gmail_extra"><br></div><div c= lass=3D"gmail_extra"> <br></div><br><div class=3D"gmail_quote">2013/1/18 Karl Wright <span dir=3D= "ltr"><<a href=3D"mailto:[email protected]" target=3D"_blank">daddywri@= gmail.com</a>></span><br><blockquote class=3D"gmail_quote" style=3D"marg= in:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,20= 4);border-left-style:solid;padding-left:1ex"> Hi Shigeki,<br> <br> What database is ManifoldCF configured to use in this case? =C2=A0Do you<br= > see any indication of slow queries in the ManifoldCF log?<br> <br> <br> Karl<br> <br> On Fri, Jan 18, 2013 at 5:27 AM, Shigeki Kobayashi<br> <<a href=3D"mailto:[email protected]">shigeki.kobayash= [email protected]</a>> wrote:<br> > Hello<br> ><br> ><br> > I would like some advice to improve crawling time of new/updated files= using<br> > Windows share connection.<br> ><br> > I crawl file in Windows server and index them into Solr.<br> ><br> > Currently, the second crawling of two hundred thousands files takes = =C2=A0over 5<br> > hours, even though any files are not updated, created, deleted.<br> ><br> > I assume MCF does the following processes (let me know if I am wrong)<= br> ><br> > - obtain updated time of a file<br> > - compare the updated time with the one MCF obtained last time crawlin= g(<br> > probably stored in DB)<br> > - if they are different MCF recognizes the file is to be indexed.<br> ><br> > If the above processes are done for two thousands files, what part of = the<br> > processes could take time the most? obtaining updated time? reading da= ta<br> > from DB? what could be done to increase the crawling time do you think= ?<br> ><br> > Please give me some advice.<br> ><br> ><br> > Regards,<br> ><br> > Shigeki<br> ><br> ><br> </blockquote></div><br><br clear=3D"all"><div><br></div><br><div><font face= =3D"'ms gothic', monospace">=C2=A0</font></div><div><br></div> </div></div></div> --14dae9399bd9676e5704d3c356e9--
