Re: Web crawl that doesn't complete and robot.txt error

Mark Libucha Mon, 10 Feb 2014 13:13:48 -0800

Apologies for the RTFM stumble after being pointed to it. Thought I did
read it. Apparently not very carefully. I understand it now.


Thanks.


On Mon, Feb 10, 2014 at 12:55 PM, Karl Wright <[email protected]> wrote:

> Read the documentation.  Unless you select "Keep unreachable documents
> forever", MCF will keep track of hop count info.
>
> Karl
>
>
> On Mon, Feb 10, 2014 at 3:17 PM, Mark Libucha <[email protected]> wrote:
>
>> So, I carefully checked all of our jobs, and *none* have hop filters
>> turned on (the text boxes are blank for all jobs).
>>
>> Still seeing lots of these:
>>
>> STATEMENT:  INSERT INTO hopdeletedeps
>> (parentidhash,ownerid,jobid,childidhash,linktype) VALUES ($1,$2,$3,$4,$5)
>>  ERROR:  could not serialize access due to read/write dependencies among
>> transactions
>>
>>
>>
>> On Mon, Feb 10, 2014 at 12:01 PM, Karl Wright <[email protected]> wrote:
>>
>>> Hi Mark,
>>>
>>> Look here
>>> manifoldcf.apache.org/release/trunk/en_US/end-user-documentation.html ,
>>> and read the section on hop filters for the web connector.
>>>
>>> Karl
>>>
>>>
>>>
>>> On Mon, Feb 10, 2014 at 2:55 PM, Mark Libucha <[email protected]>wrote:
>>>
>>>> We restarted manifold so we'll have to reproduce before we get you more
>>>> details.
>>>>
>>>> I don't understand the hopcount thing. How do you know, and we're is it
>>>> set? We're running with default settings pretty much.
>>>>
>>>> Thanks,
>>>>
>>>> Mark
>>>>
>>>>
>>>> On Mon, Feb 10, 2014 at 11:39 AM, Karl Wright <[email protected]>wrote:
>>>>
>>>>> Hi Mark,
>>>>>
>>>>> MCF retries those sorts of errors automatically.  It's possible
>>>>> there's a place we missed, but let's pursue other avenues first.
>>>>>
>>>>> One thing worth noting is that you have hop counting enabled, which is
>>>>> fine for small crawls but slows things down a lot (and can cause stalls
>>>>> when there are lots of records whose hopcount needs to be updated).  Do 
>>>>> you
>>>>> truly need link counting?
>>>>>
>>>>> The thread dump will tell us a lot, as will the simple history.  When
>>>>> was the last time something happened in the simple history?
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Feb 10, 2014 at 2:22 PM, Mark Libucha <[email protected]>wrote:
>>>>>
>>>>>> More info...maybe we don't have postgres configured correctly. Lots
>>>>>> of errors to stdout log. For example:
>>>>>>
>>>>>> STATEMENT:  INSERT INTO intrinsiclink
>>>>>> (parentidhash,isnew,jobid,linktype,childidhash) VALUES ($1,$2,$3,$4,$5)
>>>>>> ERROR:  could not serialize access due to read/write dependencies
>>>>>> among transactions
>>>>>> DETAIL:  Reason code: Canceled on identification as a pivot, during
>>>>>> conflict in checking.
>>>>>> HINT:  The transaction might succeed if retried.
>>>>>>
>>>>>> and on other tables as well.
>>>>>>
>>>>>> Mark
>>>>>>
>>>>>>
>>>>>> On Mon, Feb 10, 2014 at 11:18 AM, Mark Libucha <[email protected]>wrote:
>>>>>>
>>>>>>> Thanks Karl, we may take you up on the offer when/if we reproduce
>>>>>>> with just a single crawl. We were running many at once. Can you 
>>>>>>> describe or
>>>>>>> point me at instructions for the thread dump you'd like to see?
>>>>>>>
>>>>>>> We're using 1.4.1.
>>>>>>>
>>>>>>> The simple history looks clean. All 200s and OKs, with a few broken
>>>>>>> pipes, but those documents all seem to have been successfully fetch 
>>>>>>> later.
>>>>>>> No rejects.
>>>>>>>
>>>>>>> Thanks again,
>>>>>>>
>>>>>>> Mark
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Feb 10, 2014 at 10:41 AM, Karl Wright <[email protected]>wrote:
>>>>>>>
>>>>>>>> Hi Mark,
>>>>>>>>
>>>>>>>> The robots parse error is informational only and does not otherwise
>>>>>>>> affect crawling.  So you will need to look elsewhere for the issue.
>>>>>>>>
>>>>>>>> First question: what version of MCF are you using?  For a time,
>>>>>>>> trunk (and the release 1.5 branch) had exactly this problem whenever
>>>>>>>> connections were used that included certificates.
>>>>>>>>
>>>>>>>> I suggest that you rule out blocked sites by looking at the simple
>>>>>>>> history.  If you see a lot of rejections then maybe you are being 
>>>>>>>> blocked.
>>>>>>>> If, on the other hand, not much has happened at all for a while, 
>>>>>>>> that's not
>>>>>>>> the answer.
>>>>>>>>
>>>>>>>> The fastest way to start diagnosing this problem is to get a thread
>>>>>>>> dump.  I'd be happy to look at it and let you know what I find.
>>>>>>>>
>>>>>>>> Karl
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Feb 10, 2014 at 1:26 PM, Mark Libucha 
>>>>>>>> <[email protected]>wrote:
>>>>>>>>
>>>>>>>>> I kicked off a bunch of web crawls on Friday to run over the
>>>>>>>>> weekend. They all started fine but didn't finish. No errors in the 
>>>>>>>>> logs I
>>>>>>>>> can find. All action seemed to stop after a couple of hours. It's
>>>>>>>>> configured as complete crawl that runs every 24 hours.
>>>>>>>>>
>>>>>>>>> I don't expect you to have an answer to what went wrong with such
>>>>>>>>> limited information, but I did see a problem with robots.txt (at the 
>>>>>>>>> bottom
>>>>>>>>> of this email).
>>>>>>>>>
>>>>>>>>> Does it mean robots.txt was not used at all for the crawl, or just
>>>>>>>>> that part was ignored? (I kind of expected this kind of error to kill 
>>>>>>>>> the
>>>>>>>>> crawl, but maybe I just don't understand it.)
>>>>>>>>>
>>>>>>>>> If the crawl were ignoring the robots.txt, or a part of it, and
>>>>>>>>> the crawled site banned my crawler, what would I see in the MCF logs?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Mark
>>>>>>>>>
>>>>>>>>> 02-09-2014 09:54:48.679 robots parsesomesite.gov:80
>>>>>>>>> ERRORS 01 Unknown robots.txt line: 'Sitemap: <
>>>>>>>>> http://www.somesite.gov/sitemapindex.xml>'
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Web crawl that doesn't complete and robot.txt error

Reply via email to