Re: Web crawl that doesn't complete and robot.txt error

Mark Libucha Tue, 11 Feb 2014 14:22:17 -0800

Karl, looks like the hop filter setting has fixed our problem, though
there's still a bit more testing to do. Thanks so much for the help.


Mark



On Mon, Feb 10, 2014 at 1:12 PM, Mark Libucha <[email protected]> wrote:

> Apologies for the RTFM stumble after being pointed to it. Thought I did
> read it. Apparently not very carefully. I understand it now.
>
> Thanks.
>
>
> On Mon, Feb 10, 2014 at 12:55 PM, Karl Wright <[email protected]> wrote:
>
>> Read the documentation.  Unless you select "Keep unreachable documents
>> forever", MCF will keep track of hop count info.
>>
>> Karl
>>
>>
>> On Mon, Feb 10, 2014 at 3:17 PM, Mark Libucha <[email protected]> wrote:
>>
>>> So, I carefully checked all of our jobs, and *none* have hop filters
>>> turned on (the text boxes are blank for all jobs).
>>>
>>> Still seeing lots of these:
>>>
>>> STATEMENT:  INSERT INTO hopdeletedeps
>>> (parentidhash,ownerid,jobid,childidhash,linktype) VALUES ($1,$2,$3,$4,$5)
>>>  ERROR:  could not serialize access due to read/write dependencies
>>> among transactions
>>>
>>>
>>>
>>> On Mon, Feb 10, 2014 at 12:01 PM, Karl Wright <[email protected]>wrote:
>>>
>>>> Hi Mark,
>>>>
>>>> Look here
>>>> manifoldcf.apache.org/release/trunk/en_US/end-user-documentation.html, and 
>>>> read the section on hop filters for the web connector.
>>>>
>>>> Karl
>>>>
>>>>
>>>>
>>>> On Mon, Feb 10, 2014 at 2:55 PM, Mark Libucha <[email protected]>wrote:
>>>>
>>>>> We restarted manifold so we'll have to reproduce before we get you
>>>>> more details.
>>>>>
>>>>> I don't understand the hopcount thing. How do you know, and we're is
>>>>> it set? We're running with default settings pretty much.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Mark
>>>>>
>>>>>
>>>>> On Mon, Feb 10, 2014 at 11:39 AM, Karl Wright <[email protected]>wrote:
>>>>>
>>>>>> Hi Mark,
>>>>>>
>>>>>> MCF retries those sorts of errors automatically.  It's possible
>>>>>> there's a place we missed, but let's pursue other avenues first.
>>>>>>
>>>>>> One thing worth noting is that you have hop counting enabled, which
>>>>>> is fine for small crawls but slows things down a lot (and can cause 
>>>>>> stalls
>>>>>> when there are lots of records whose hopcount needs to be updated).  Do 
>>>>>> you
>>>>>> truly need link counting?
>>>>>>
>>>>>> The thread dump will tell us a lot, as will the simple history.  When
>>>>>> was the last time something happened in the simple history?
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Feb 10, 2014 at 2:22 PM, Mark Libucha <[email protected]>wrote:
>>>>>>
>>>>>>> More info...maybe we don't have postgres configured correctly. Lots
>>>>>>> of errors to stdout log. For example:
>>>>>>>
>>>>>>> STATEMENT:  INSERT INTO intrinsiclink
>>>>>>> (parentidhash,isnew,jobid,linktype,childidhash) VALUES ($1,$2,$3,$4,$5)
>>>>>>> ERROR:  could not serialize access due to read/write dependencies
>>>>>>> among transactions
>>>>>>> DETAIL:  Reason code: Canceled on identification as a pivot, during
>>>>>>> conflict in checking.
>>>>>>> HINT:  The transaction might succeed if retried.
>>>>>>>
>>>>>>> and on other tables as well.
>>>>>>>
>>>>>>> Mark
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Feb 10, 2014 at 11:18 AM, Mark Libucha 
>>>>>>> <[email protected]>wrote:
>>>>>>>
>>>>>>>> Thanks Karl, we may take you up on the offer when/if we reproduce
>>>>>>>> with just a single crawl. We were running many at once. Can you 
>>>>>>>> describe or
>>>>>>>> point me at instructions for the thread dump you'd like to see?
>>>>>>>>
>>>>>>>> We're using 1.4.1.
>>>>>>>>
>>>>>>>> The simple history looks clean. All 200s and OKs, with a few broken
>>>>>>>> pipes, but those documents all seem to have been successfully fetch 
>>>>>>>> later.
>>>>>>>> No rejects.
>>>>>>>>
>>>>>>>> Thanks again,
>>>>>>>>
>>>>>>>> Mark
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Feb 10, 2014 at 10:41 AM, Karl Wright 
>>>>>>>> <[email protected]>wrote:
>>>>>>>>
>>>>>>>>> Hi Mark,
>>>>>>>>>
>>>>>>>>> The robots parse error is informational only and does not
>>>>>>>>> otherwise affect crawling.  So you will need to look elsewhere for the
>>>>>>>>> issue.
>>>>>>>>>
>>>>>>>>> First question: what version of MCF are you using?  For a time,
>>>>>>>>> trunk (and the release 1.5 branch) had exactly this problem whenever
>>>>>>>>> connections were used that included certificates.
>>>>>>>>>
>>>>>>>>> I suggest that you rule out blocked sites by looking at the simple
>>>>>>>>> history.  If you see a lot of rejections then maybe you are being 
>>>>>>>>> blocked.
>>>>>>>>> If, on the other hand, not much has happened at all for a while, 
>>>>>>>>> that's not
>>>>>>>>> the answer.
>>>>>>>>>
>>>>>>>>> The fastest way to start diagnosing this problem is to get a
>>>>>>>>> thread dump.  I'd be happy to look at it and let you know what I find.
>>>>>>>>>
>>>>>>>>> Karl
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Feb 10, 2014 at 1:26 PM, Mark Libucha 
>>>>>>>>> <[email protected]>wrote:
>>>>>>>>>
>>>>>>>>>> I kicked off a bunch of web crawls on Friday to run over the
>>>>>>>>>> weekend. They all started fine but didn't finish. No errors in the 
>>>>>>>>>> logs I
>>>>>>>>>> can find. All action seemed to stop after a couple of hours. It's
>>>>>>>>>> configured as complete crawl that runs every 24 hours.
>>>>>>>>>>
>>>>>>>>>> I don't expect you to have an answer to what went wrong with such
>>>>>>>>>> limited information, but I did see a problem with robots.txt (at the 
>>>>>>>>>> bottom
>>>>>>>>>> of this email).
>>>>>>>>>>
>>>>>>>>>> Does it mean robots.txt was not used at all for the crawl, or
>>>>>>>>>> just that part was ignored? (I kind of expected this kind of error 
>>>>>>>>>> to kill
>>>>>>>>>> the crawl, but maybe I just don't understand it.)
>>>>>>>>>>
>>>>>>>>>> If the crawl were ignoring the robots.txt, or a part of it, and
>>>>>>>>>> the crawled site banned my crawler, what would I see in the MCF logs?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> Mark
>>>>>>>>>>
>>>>>>>>>> 02-09-2014 09:54:48.679 robots parsesomesite.gov:80
>>>>>>>>>> ERRORS 01 Unknown robots.txt line: 'Sitemap: <
>>>>>>>>>> http://www.somesite.gov/sitemapindex.xml>'
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Web crawl that doesn't complete and robot.txt error

Reply via email to