Re: Web crawl that doesn't complete and robot.txt error

Karl Wright Mon, 10 Feb 2014 12:05:17 -0800

Hi Mark,

Look here
manifoldcf.apache.org/release/trunk/en_US/end-user-documentation.html , and
read the section on hop filters for the web connector.


Karl



On Mon, Feb 10, 2014 at 2:55 PM, Mark Libucha <[email protected]> wrote:

> We restarted manifold so we'll have to reproduce before we get you more
> details.
>
> I don't understand the hopcount thing. How do you know, and we're is it
> set? We're running with default settings pretty much.
>
> Thanks,
>
> Mark
>
>
> On Mon, Feb 10, 2014 at 11:39 AM, Karl Wright <[email protected]> wrote:
>
>> Hi Mark,
>>
>> MCF retries those sorts of errors automatically.  It's possible there's a
>> place we missed, but let's pursue other avenues first.
>>
>> One thing worth noting is that you have hop counting enabled, which is
>> fine for small crawls but slows things down a lot (and can cause stalls
>> when there are lots of records whose hopcount needs to be updated).  Do you
>> truly need link counting?
>>
>> The thread dump will tell us a lot, as will the simple history.  When was
>> the last time something happened in the simple history?
>>
>> Karl
>>
>>
>>
>> On Mon, Feb 10, 2014 at 2:22 PM, Mark Libucha <[email protected]> wrote:
>>
>>> More info...maybe we don't have postgres configured correctly. Lots of
>>> errors to stdout log. For example:
>>>
>>> STATEMENT:  INSERT INTO intrinsiclink
>>> (parentidhash,isnew,jobid,linktype,childidhash) VALUES ($1,$2,$3,$4,$5)
>>> ERROR:  could not serialize access due to read/write dependencies among
>>> transactions
>>> DETAIL:  Reason code: Canceled on identification as a pivot, during
>>> conflict in checking.
>>> HINT:  The transaction might succeed if retried.
>>>
>>> and on other tables as well.
>>>
>>> Mark
>>>
>>>
>>> On Mon, Feb 10, 2014 at 11:18 AM, Mark Libucha <[email protected]>wrote:
>>>
>>>> Thanks Karl, we may take you up on the offer when/if we reproduce with
>>>> just a single crawl. We were running many at once. Can you describe or
>>>> point me at instructions for the thread dump you'd like to see?
>>>>
>>>> We're using 1.4.1.
>>>>
>>>> The simple history looks clean. All 200s and OKs, with a few broken
>>>> pipes, but those documents all seem to have been successfully fetch later.
>>>> No rejects.
>>>>
>>>> Thanks again,
>>>>
>>>> Mark
>>>>
>>>>
>>>>
>>>> On Mon, Feb 10, 2014 at 10:41 AM, Karl Wright <[email protected]>wrote:
>>>>
>>>>> Hi Mark,
>>>>>
>>>>> The robots parse error is informational only and does not otherwise
>>>>> affect crawling.  So you will need to look elsewhere for the issue.
>>>>>
>>>>> First question: what version of MCF are you using?  For a time, trunk
>>>>> (and the release 1.5 branch) had exactly this problem whenever connections
>>>>> were used that included certificates.
>>>>>
>>>>> I suggest that you rule out blocked sites by looking at the simple
>>>>> history.  If you see a lot of rejections then maybe you are being blocked.
>>>>> If, on the other hand, not much has happened at all for a while, that's 
>>>>> not
>>>>> the answer.
>>>>>
>>>>> The fastest way to start diagnosing this problem is to get a thread
>>>>> dump.  I'd be happy to look at it and let you know what I find.
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Feb 10, 2014 at 1:26 PM, Mark Libucha <[email protected]>wrote:
>>>>>
>>>>>> I kicked off a bunch of web crawls on Friday to run over the weekend.
>>>>>> They all started fine but didn't finish. No errors in the logs I can 
>>>>>> find.
>>>>>> All action seemed to stop after a couple of hours. It's configured as
>>>>>> complete crawl that runs every 24 hours.
>>>>>>
>>>>>> I don't expect you to have an answer to what went wrong with such
>>>>>> limited information, but I did see a problem with robots.txt (at the 
>>>>>> bottom
>>>>>> of this email).
>>>>>>
>>>>>> Does it mean robots.txt was not used at all for the crawl, or just
>>>>>> that part was ignored? (I kind of expected this kind of error to kill the
>>>>>> crawl, but maybe I just don't understand it.)
>>>>>>
>>>>>> If the crawl were ignoring the robots.txt, or a part of it, and the
>>>>>> crawled site banned my crawler, what would I see in the MCF logs?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Mark
>>>>>>
>>>>>> 02-09-2014 09:54:48.679 robots parsesomesite.gov:80
>>>>>> ERRORS 01 Unknown robots.txt line: 'Sitemap: <
>>>>>> http://www.somesite.gov/sitemapindex.xml>'
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Web crawl that doesn't complete and robot.txt error

Reply via email to