Re: Crawling new/updated files using Windows share connection

Karl Wright Mon, 21 Jan 2013 15:21:09 -0800

I checked a fix for this into trunk.

Please sync up with trunk and see if this fixes your problem.  If it
does, I will gladly include the fix in MCF 1.1.


Karl


On Mon, Jan 21, 2013 at 9:14 AM, Karl Wright <[email protected]> wrote:
> CONNECTORS-618
>
> Karl
>
> On Mon, Jan 21, 2013 at 9:08 AM, Karl Wright <[email protected]> wrote:
>> Bad news, I am afraid.  MySQL seems to always put null values at the
>> front of the index, and that cannot be changed through any means I can
>> find.  This is different from all other databases I know of.
>>
>> The only possible fixes for the problem are as follows:
>>
>> (1) Not use a null doc priority but instead use an actual special
>> number that is guaranteed to always sort to the end.  This would be
>> non-trivial because this column is a FLOAT value, and round-off errors
>> will prevent the ManifoldCF code from reliably using a special number
>> like that on all databases.
>>
>> (2) Use docpriorities that are ordered in the opposite way - which
>> would work ONLY for MySQL and would break all other databases.
>>
>> I'll create a ticket and think about the problem some more.
>>
>> Karl
>>
>> On Mon, Jan 21, 2013 at 8:48 AM, Karl Wright <[email protected]> wrote:
>>> Hi Shigeki,
>>>
>>> I reviewed the code in detail.  At the time CONNECTORS-290 was fixed,
>>> all document priorities were set to null whenever a job was paused or
>>> aborted, so what I suspected might be the problem cannot in fact
>>> happen.
>>>
>>> The most likely possible explanation for MySQL's behavior, therefore,
>>> is that MySQL orders null docpriority values BEFORE all other rows in
>>> the index it is using for queue stuffing.  I have no other way of
>>> explaining why it thinks it needs to go through 6.5 million rows
>>> before it gets to the ones that are active.
>>>
>>> If this is the case, it may be possible to tell MySQL to order null
>>> column values to the END instead of the beginning of the index.  I'll
>>> do some research on this later and get back to you.
>>>
>>> Thanks,
>>> Karl
>>>
>>>
>>>
>>> On Mon, Jan 21, 2013 at 6:21 AM, Karl Wright <[email protected]> wrote:
>>>> Are there any large paused or aborted jobs present on the same
>>>> ManifoldCF?  If so, can you tell me whether the job is paused, or
>>>> aborted?  (I am betting paused...)
>>>>
>>>> Karl
>>>>
>>>> On Mon, Jan 21, 2013 at 5:59 AM, Shigeki Kobayashi
>>>> <[email protected]> wrote:
>>>>> Hi Karl,
>>>>>
>>>>>
>>>>> Here is the explain. There isn't such sort...
>>>>>
>>>>> mysql> explain SELECT
>>>>> t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
>>>>> FROM jobqueue t0 FORCE INDEX (i1358228295210) WHERE t0 IN ('P','G') AND
>>>>> t0.checkaction='R' AND t0.checktime<=1358649661663 AND EXISTS(SELECT 'x'
>>>>> FROM jobs t1 WHERE t1.status IN ('A','a') AND t1.id=t0.jobid AND
>>>>> t1.priority=5) AND NOT EXISTS(SELECT 'x' FROM jobqueue t2 WHERE
>>>>> t2.dochash=t0.dochash AND t2.status IN ('A','F','a','f','D','d') AND
>>>>> t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT 'x' FROM prereqevents t3,events 
>>>>> t4
>>>>> WHERE t0.id=t3.owner AND t3.eventname=t4.name) ORDER BY t0.docpriority ASC
>>>>> LIMIT 4800;
>>>>> +----+--------------------+-------+--------+----------------------------------------------+----------------+---------+-------------------------+------+-------------+
>>>>> | id | select_type        | table | type   | possible_keys
>>>>> | key            | key_len | ref                     | rows | Extra       
>>>>> |
>>>>> +----+--------------------+-------+--------+----------------------------------------------+----------------+---------+-------------------------+------+-------------+
>>>>> |  1 | PRIMARY            | t0    | index  | NULL
>>>>> | I1358228295210 | 25      | NULL                    | 4800 | Using where 
>>>>> |
>>>>> |  4 | DEPENDENT SUBQUERY | t3    | ref    | I1358228295216
>>>>> | I1358228295216 | 8       | manifoldcf.t0.id        |    1 |             
>>>>> |
>>>>> |  4 | DEPENDENT SUBQUERY | t4    | eq_ref | PRIMARY
>>>>> | PRIMARY        | 767     | manifoldcf.t3.eventname |    1 | Using index 
>>>>> |
>>>>> |  3 | DEPENDENT SUBQUERY | t2    | ref    |
>>>>> I1358228295209,I1358228295212,I1358228295211 | I1358228295209 | 122     |
>>>>> manifoldcf.t0.dochash   |    1 | Using where |
>>>>> |  2 | DEPENDENT SUBQUERY | t1    | eq_ref | PRIMARY,I1358228295219
>>>>> | PRIMARY        | 8       | manifoldcf.t0.jobid     |    1 | Using where 
>>>>> |
>>>>> +----+--------------------+-------+--------+----------------------------------------------+----------------+---------+-------------------------+------+-------------+
>>>>> 5 rows in set (0.00 sec)
>>>>>
>>>>>
>>>>> Regards,
>>>>>
>>>>>
>>>>> Shigeki
>>>>>
>>>>>
>>>>>
>>>>> 2013/1/21 Karl Wright <[email protected]>
>>>>>>
>>>>>>  takes too long
>>>>>> MIME-Version: 1.0
>>>>>> Content-Type: multipart/alternative; 
>>>>>> boundary=14dae9399bd9676e5704d3c356e9
>>>>>>
>>>>>> --14dae9399bd9676e5704d3c356e9
>>>>>> Content-Type: text/plain; charset="utf-8"
>>>>>> Content-Transfer-Encoding: 7bit
>>>>>>
>>>>>> Can you get an EXPLAIN for this query? It sounds like it is
>>>>>> disregarding the hint for some reason.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>> Sent from my Windows Phone
>>>>>> From: Shigeki Kobayashi
>>>>>> Sent: 1/20/2013 9:37 PM
>>>>>> To: [email protected]
>>>>>> Subject: Re: Crawling new/updated files using Windows share connection
>>>>>> takes too long
>>>>>> Hi Karl.
>>>>>>
>>>>>> I configured MySQL 5.5 to run MCF this time.
>>>>>> The version of MCF is trunk 1.1dev downloaded on Dec, 12th. , which you
>>>>>> fixed
>>>>>> the slow query using "FORCE INDEX". Solr is 4.0
>>>>>>
>>>>>> I thought is was fixed but the log shows that  the following are slow
>>>>>> queries.
>>>>>> -------------------------------------------------------------------
>>>>>> # Time: 130120 11:41:10
>>>>>> # User@Host: manifoldcf[manifoldcf] @ localhost [127.0.0.1]
>>>>>> # Query_time: 8.761087  Lock_time: 0.000163 Rows_sent: 17  Rows_examined:
>>>>>> 6365233
>>>>>> SET timestamp=1358649670;
>>>>>> SELECT
>>>>>> t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
>>>>>> FROM jobqueue t0 FORCE INDEX (i1358228295210) WHERE t0.status IN 
>>>>>> ('P','G')
>>>>>> AND t0.checkaction='R' AND t0.checktime<=1358649661663 AND EXISTS(SELECT
>>>>>> 'x' FROM jobs t1 WHERE t1.status IN ('A','a') AND t1.id=t0.jobid AND
>>>>>> t1.priority=5) AND NOT EXISTS(SELECT 'x' FROM jobqueue t2 WHERE
>>>>>> t2.dochash=t0.dochash AND t2.status IN ('A','F','a','f','D','d') AND
>>>>>> t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT 'x' FROM prereqevents t3,events
>>>>>> t4 WHERE t0.id=t3.owner AND t3.eventname=t4.name) ORDER BY t0.docpriority
>>>>>> ASC LIMIT 4800;
>>>>>>
>>>>>> # Time: 130120 11:41:18
>>>>>> # User@Host: manifoldcf[manifoldcf] @ localhost [127.0.0.1]
>>>>>> # Query_time: 7.714277  Lock_time: 0.000123 Rows_sent: 0  Rows_examined:
>>>>>> 6365182
>>>>>> SET timestamp=1358649678;
>>>>>> SELECT docpriority,jobid,dochash,docid FROM jobqueue t0 FORCE INDEX
>>>>>> (i1358228295210) WHERE status IN ('P','G') AND checkaction='R' AND
>>>>>> checktime<=1358649661663 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE
>>>>>> t1.status
>>>>>> IN ('A','a') AND t1.id=t0.jobid)  ORDER BY docpriority ASC LIMIT 1;
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>>
>>>>>> Shigeki
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2013/1/18 Karl Wright <[email protected]>
>>>>>>
>>>>>> > Hi Shigeki,
>>>>>> >
>>>>>> > What database is ManifoldCF configured to use in this case?  Do you
>>>>>> > see any indication of slow queries in the ManifoldCF log?
>>>>>> >
>>>>>> >
>>>>>> > Karl
>>>>>> >
>>>>>> > On Fri, Jan 18, 2013 at 5:27 AM, Shigeki Kobayashi
>>>>>> > <[email protected]> wrote:
>>>>>> > > Hello
>>>>>> > >
>>>>>> > >
>>>>>> > > I would like some advice to improve crawling time of new/updated 
>>>>>> > > files
>>>>>> > using
>>>>>> > > Windows share connection.
>>>>>> > >
>>>>>> > > I crawl file in Windows server and index them into Solr.
>>>>>> > >
>>>>>> > > Currently, the second crawling of two hundred thousands files takes
>>>>>> >  over 5
>>>>>> > > hours, even though any files are not updated, created, deleted.
>>>>>> > >
>>>>>> > > I assume MCF does the following processes (let me know if I am wrong)
>>>>>> > >
>>>>>> > > - obtain updated time of a file
>>>>>> > > - compare the updated time with the one MCF obtained last time
>>>>>> > > crawling(
>>>>>> > > probably stored in DB)
>>>>>> > > - if they are different MCF recognizes the file is to be indexed.
>>>>>> > >
>>>>>> > > If the above processes are done for two thousands files, what part of
>>>>>> > > the
>>>>>> > > processes could take time the most? obtaining updated time? reading
>>>>>> > > data
>>>>>> > > from DB? what could be done to increase the crawling time do you
>>>>>> > > think?
>>>>>> > >
>>>>>> > > Please give me some advice.
>>>>>> > >
>>>>>> > >
>>>>>> > > Regards,
>>>>>> > >
>>>>>> > > Shigeki
>>>>>> > >
>>>>>> > >
>>>>>> >
>>>>>>
>>>>>> --14dae9399bd9676e5704d3c356e9
>>>>>> Content-Type: text/html; charset="utf-8"
>>>>>> Content-Transfer-Encoding: quoted-printable
>>>>>>
>>>>>> <html><head><meta content=3D"text/html; charset=3Dutf-8"
>>>>>> http-equiv=3D"Cont=
>>>>>> ent-Type"></head><body><div><div style=3D"font-family: 
>>>>>> Calibri,sans-serif;
>>>>>> =
>>>>>> font-size: 11pt;"><br>Can you get an EXPLAIN for this query?&nbsp; It
>>>>>> sound=
>>>>>> s like it is disregarding the hint for some
>>>>>> reason.<br><br>Karl<br><br>Sent=
>>>>>>  from my Windows Phone<br></div></div><hr><span style=3D"font-family:
>>>>>> Tahom=
>>>>>> a,sans-serif; font-size: 10pt; font-weight: bold;">From: </span><span
>>>>>> style=
>>>>>> =3D"font-family: Tahoma,sans-serif; font-size: 10pt;">Shigeki
>>>>>> Kobayashi</sp=
>>>>>> an><br><span style=3D"font-family: Tahoma,sans-serif; font-size: 10pt;
>>>>>> font=
>>>>>> -weight: bold;">Sent: </span><span style=3D"font-family:
>>>>>> Tahoma,sans-serif;=
>>>>>>  font-size: 10pt;">1/20/2013 9:37 PM</span><br><span 
>>>>>> style=3D"font-family:
>>>>>> =
>>>>>> Tahoma,sans-serif; font-size: 10pt; font-weight: bold;">To: </span><span
>>>>>> st=
>>>>>> yle=3D"font-family: Tahoma,sans-serif; font-size:
>>>>>> 10pt;">[email protected]=
>>>>>> ache.org</span><br><span style=3D"font-family: Tahoma,sans-serif;
>>>>>> font-size=
>>>>>> : 10pt; font-weight: bold;">Subject: </span><span style=3D"font-family:
>>>>>> Tah=
>>>>>> oma,sans-serif; font-size: 10pt;">Re: Crawling new/updated files using
>>>>>> Wind=
>>>>>> ows share connection takes too long</span><br><br></body></html><div
>>>>>> dir=3D=
>>>>>> "ltr">Hi Karl.<div><br></div><div>I configured MySQL 5.5 to run MCF this
>>>>>> ti=
>>>>>> me.</div><div>The version of MCF is trunk 1.1dev downloaded on Dec, 12th.
>>>>>> ,=
>>>>>>  which you fixed</div><div style>the slow query using &quot;FORCE
>>>>>> INDEX&quo=
>>>>>> t;. Solr is 4.0</div>
>>>>>>
>>>>>> <div style><br></div><div style>I thought is was fixed but the log shows
>>>>>> th=
>>>>>> at =C2=A0the following are slow queries.=C2=A0</div><div><div
>>>>>> class=3D"gmai=
>>>>>>
>>>>>> l_extra">------------------------------------------------------------------=
>>>>>> -</div>
>>>>>> <div class=3D"gmail_extra">
>>>>>> <div class=3D"gmail_extra"># Time: 130120 11:41:10</div><div
>>>>>> class=3D"gmail=
>>>>>> _extra"># User@Host: manifoldcf[manifoldcf] @ localhost
>>>>>> [127.0.0.1]</div><d=
>>>>>> iv class=3D"gmail_extra"># Query_time: 8.761087 =C2=A0Lock_time: 0.000163
>>>>>> R=
>>>>>> ows_sent: 17 =C2=A0Rows_examined: 6365233</div>
>>>>>>
>>>>>> <div class=3D"gmail_extra">SET timestamp=3D1358649670;</div><div
>>>>>> class=3D"g=
>>>>>> mail_extra">SELECT <a
>>>>>> href=3D"http://t0.id";>t0.id</a>,t0.jobid,t0.dochash,t=
>>>>>> 0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset FROM jobqueue 
>>>>>> t0
>>>>>> =
>>>>>> FORCE INDEX (i1358228295210) WHERE t0.status IN (&#39;P&#39;,&#39;G&#39;)
>>>>>> A=
>>>>>> ND t0.checkaction=3D&#39;R&#39; AND t0.checktime&lt;=3D1358649661663 AND
>>>>>> EX=
>>>>>> ISTS(SELECT &#39;x&#39; FROM jobs t1 WHERE t1.status IN
>>>>>> (&#39;A&#39;,&#39;a=
>>>>>> &#39;) AND <a href=3D"http://t1.id";>t1.id</a>=3Dt0.jobid AND
>>>>>> t1.priority=3D=
>>>>>> 5) AND NOT EXISTS(SELECT &#39;x&#39; FROM jobqueue t2 WHERE
>>>>>> t2.dochash=3Dt0=
>>>>>> .dochash AND t2.status IN
>>>>>> (&#39;A&#39;,&#39;F&#39;,&#39;a&#39;,&#39;f&#39;,=
>>>>>> &#39;D&#39;,&#39;d&#39;) AND t2.jobid!=3Dt0.jobid) AND NOT EXISTS(SELECT
>>>>>> &#=
>>>>>> 39;x&#39; FROM prereqevents t3,events t4 WHERE <a
>>>>>> href=3D"http://t0.id";>t0.=
>>>>>> id</a>=3Dt3.owner AND t3.eventname=3D<a
>>>>>> href=3D"http://t4.name";>t4.name</a>=
>>>>>> ) ORDER BY t0.docpriority ASC LIMIT 4800;</div>
>>>>>>
>>>>>> <div class=3D"gmail_extra"><br></div><div class=3D"gmail_extra"># Time:
>>>>>> 130=
>>>>>> 120 11:41:18<br></div><div class=3D"gmail_extra"># User@Host:
>>>>>> manifoldcf[ma=
>>>>>> nifoldcf] @ localhost [127.0.0.1]<br></div><div class=3D"gmail_extra">#
>>>>>> Que=
>>>>>> ry_time: 7.714277 =C2=A0Lock_time: 0.000123 Rows_sent: 0
>>>>>> =C2=A0Rows_examine=
>>>>>> d: 6365182</div>
>>>>>>
>>>>>> <div class=3D"gmail_extra">SET timestamp=3D1358649678;</div><div
>>>>>> class=3D"g=
>>>>>> mail_extra">SELECT docpriority,jobid,dochash,docid FROM jobqueue t0 FORCE
>>>>>> I=
>>>>>> NDEX (i1358228295210) WHERE status IN (&#39;P&#39;,&#39;G&#39;) AND
>>>>>> checkac=
>>>>>> tion=3D&#39;R&#39; AND checktime&lt;=3D1358649661663 AND EXISTS(SELECT
>>>>>> &#39=
>>>>>> ;x&#39; FROM jobs t1 WHERE t1.status IN (&#39;A&#39;,&#39;a&#39;) AND <a
>>>>>> hr=
>>>>>> ef=3D"http://t1.id";>t1.id</a>=3Dt0.jobid) =C2=A0ORDER BY docpriority ASC
>>>>>> LI=
>>>>>> MIT 1;</div>
>>>>>>
>>>>>> <div><br></div><div class=3D"gmail_extra" style>Regards,</div><div
>>>>>> class=3D=
>>>>>> "gmail_extra"><br></div><div class=3D"gmail_extra"><br></div><div
>>>>>> class=3D"=
>>>>>> gmail_extra" style>Shigeki</div><div class=3D"gmail_extra"><br></div><div
>>>>>> c=
>>>>>> lass=3D"gmail_extra">
>>>>>>
>>>>>> <br></div><br><div class=3D"gmail_quote">2013/1/18 Karl Wright <span
>>>>>> dir=3D=
>>>>>> "ltr">&lt;<a href=3D"mailto:[email protected]";
>>>>>> target=3D"_blank">daddywri@=
>>>>>> gmail.com</a>&gt;</span><br><blockquote class=3D"gmail_quote"
>>>>>> style=3D"marg=
>>>>>> in:0px 0px 0px
>>>>>> 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,20=
>>>>>> 4);border-left-style:solid;padding-left:1ex">
>>>>>>
>>>>>> Hi Shigeki,<br>
>>>>>> <br>
>>>>>> What database is ManifoldCF configured to use in this case? =C2=A0Do
>>>>>> you<br=
>>>>>> >
>>>>>> see any indication of slow queries in the ManifoldCF log?<br>
>>>>>> <br>
>>>>>> <br>
>>>>>> Karl<br>
>>>>>> <br>
>>>>>> On Fri, Jan 18, 2013 at 5:27 AM, Shigeki Kobayashi<br>
>>>>>> &lt;<a
>>>>>> href=3D"mailto:[email protected]";>shigeki.kobayash=
>>>>>> [email protected]</a>&gt; wrote:<br>
>>>>>> &gt; Hello<br>
>>>>>> &gt;<br>
>>>>>> &gt;<br>
>>>>>> &gt; I would like some advice to improve crawling time of new/updated
>>>>>> files=
>>>>>>  using<br>
>>>>>> &gt; Windows share connection.<br>
>>>>>> &gt;<br>
>>>>>> &gt; I crawl file in Windows server and index them into Solr.<br>
>>>>>> &gt;<br>
>>>>>> &gt; Currently, the second crawling of two hundred thousands files takes 
>>>>>> =
>>>>>> =C2=A0over 5<br>
>>>>>> &gt; hours, even though any files are not updated, created, deleted.<br>
>>>>>> &gt;<br>
>>>>>> &gt; I assume MCF does the following processes (let me know if I am
>>>>>> wrong)<=
>>>>>> br>
>>>>>> &gt;<br>
>>>>>> &gt; - obtain updated time of a file<br>
>>>>>> &gt; - compare the updated time with the one MCF obtained last time
>>>>>> crawlin=
>>>>>> g(<br>
>>>>>> &gt; probably stored in DB)<br>
>>>>>> &gt; - if they are different MCF recognizes the file is to be 
>>>>>> indexed.<br>
>>>>>> &gt;<br>
>>>>>> &gt; If the above processes are done for two thousands files, what part 
>>>>>> of
>>>>>> =
>>>>>> the<br>
>>>>>> &gt; processes could take time the most? obtaining updated time? reading
>>>>>> da=
>>>>>> ta<br>
>>>>>> &gt; from DB? what could be done to increase the crawling time do you
>>>>>> think=
>>>>>> ?<br>
>>>>>> &gt;<br>
>>>>>> &gt; Please give me some advice.<br>
>>>>>> &gt;<br>
>>>>>> &gt;<br>
>>>>>> &gt; Regards,<br>
>>>>>> &gt;<br>
>>>>>> &gt; Shigeki<br>
>>>>>> &gt;<br>
>>>>>> &gt;<br>
>>>>>> </blockquote></div><br><br clear=3D"all"><div><br></div><br><div><font
>>>>>> face=
>>>>>> =3D"&#39;ms gothic&#39;, monospace">=C2=A0</font></div><div><br></div>
>>>>>> </div></div></div>
>>>>>>
>>>>>> --14dae9399bd9676e5704d3c356e9--
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>

Re: Crawling new/updated files using Windows share connection

Reply via email to