Re: Web crawl that doesn't complete and robot.txt error

Karl Wright Mon, 10 Feb 2014 11:41:01 -0800

Hi Mark,

MCF retries those sorts of errors automatically.  It's possible there's a
place we missed, but let's pursue other avenues first.


One thing worth noting is that you have hop counting enabled, which is fine
for small crawls but slows things down a lot (and can cause stalls when
there are lots of records whose hopcount needs to be updated).  Do you
truly need link counting?

The thread dump will tell us a lot, as will the simple history.  When was
the last time something happened in the simple history?

Karl



On Mon, Feb 10, 2014 at 2:22 PM, Mark Libucha <[email protected]> wrote:

> More info...maybe we don't have postgres configured correctly. Lots of
> errors to stdout log. For example:
>
> STATEMENT:  INSERT INTO intrinsiclink
> (parentidhash,isnew,jobid,linktype,childidhash) VALUES ($1,$2,$3,$4,$5)
> ERROR:  could not serialize access due to read/write dependencies among
> transactions
> DETAIL:  Reason code: Canceled on identification as a pivot, during
> conflict in checking.
> HINT:  The transaction might succeed if retried.
>
> and on other tables as well.
>
> Mark
>
>
> On Mon, Feb 10, 2014 at 11:18 AM, Mark Libucha <[email protected]> wrote:
>
>> Thanks Karl, we may take you up on the offer when/if we reproduce with
>> just a single crawl. We were running many at once. Can you describe or
>> point me at instructions for the thread dump you'd like to see?
>>
>> We're using 1.4.1.
>>
>> The simple history looks clean. All 200s and OKs, with a few broken
>> pipes, but those documents all seem to have been successfully fetch later.
>> No rejects.
>>
>> Thanks again,
>>
>> Mark
>>
>>
>>
>> On Mon, Feb 10, 2014 at 10:41 AM, Karl Wright <[email protected]> wrote:
>>
>>> Hi Mark,
>>>
>>> The robots parse error is informational only and does not otherwise
>>> affect crawling.  So you will need to look elsewhere for the issue.
>>>
>>> First question: what version of MCF are you using?  For a time, trunk
>>> (and the release 1.5 branch) had exactly this problem whenever connections
>>> were used that included certificates.
>>>
>>> I suggest that you rule out blocked sites by looking at the simple
>>> history.  If you see a lot of rejections then maybe you are being blocked.
>>> If, on the other hand, not much has happened at all for a while, that's not
>>> the answer.
>>>
>>> The fastest way to start diagnosing this problem is to get a thread
>>> dump.  I'd be happy to look at it and let you know what I find.
>>>
>>> Karl
>>>
>>>
>>>
>>>
>>>
>>> On Mon, Feb 10, 2014 at 1:26 PM, Mark Libucha <[email protected]>wrote:
>>>
>>>> I kicked off a bunch of web crawls on Friday to run over the weekend.
>>>> They all started fine but didn't finish. No errors in the logs I can find.
>>>> All action seemed to stop after a couple of hours. It's configured as
>>>> complete crawl that runs every 24 hours.
>>>>
>>>> I don't expect you to have an answer to what went wrong with such
>>>> limited information, but I did see a problem with robots.txt (at the bottom
>>>> of this email).
>>>>
>>>> Does it mean robots.txt was not used at all for the crawl, or just that
>>>> part was ignored? (I kind of expected this kind of error to kill the crawl,
>>>> but maybe I just don't understand it.)
>>>>
>>>> If the crawl were ignoring the robots.txt, or a part of it, and the
>>>> crawled site banned my crawler, what would I see in the MCF logs?
>>>>
>>>> Thanks,
>>>>
>>>> Mark
>>>>
>>>> 02-09-2014 09:54:48.679 robots parsesomesite.gov:80
>>>> ERRORS 01 Unknown robots.txt line: 'Sitemap: <
>>>> http://www.somesite.gov/sitemapindex.xml>'
>>>>
>>>
>>>
>>
>

Re: Web crawl that doesn't complete and robot.txt error

Reply via email to