We restarted manifold so we'll have to reproduce before we get you more details.
I don't understand the hopcount thing. How do you know, and we're is it set? We're running with default settings pretty much. Thanks, Mark On Mon, Feb 10, 2014 at 11:39 AM, Karl Wright <[email protected]> wrote: > Hi Mark, > > MCF retries those sorts of errors automatically. It's possible there's a > place we missed, but let's pursue other avenues first. > > One thing worth noting is that you have hop counting enabled, which is > fine for small crawls but slows things down a lot (and can cause stalls > when there are lots of records whose hopcount needs to be updated). Do you > truly need link counting? > > The thread dump will tell us a lot, as will the simple history. When was > the last time something happened in the simple history? > > Karl > > > > On Mon, Feb 10, 2014 at 2:22 PM, Mark Libucha <[email protected]> wrote: > >> More info...maybe we don't have postgres configured correctly. Lots of >> errors to stdout log. For example: >> >> STATEMENT: INSERT INTO intrinsiclink >> (parentidhash,isnew,jobid,linktype,childidhash) VALUES ($1,$2,$3,$4,$5) >> ERROR: could not serialize access due to read/write dependencies among >> transactions >> DETAIL: Reason code: Canceled on identification as a pivot, during >> conflict in checking. >> HINT: The transaction might succeed if retried. >> >> and on other tables as well. >> >> Mark >> >> >> On Mon, Feb 10, 2014 at 11:18 AM, Mark Libucha <[email protected]>wrote: >> >>> Thanks Karl, we may take you up on the offer when/if we reproduce with >>> just a single crawl. We were running many at once. Can you describe or >>> point me at instructions for the thread dump you'd like to see? >>> >>> We're using 1.4.1. >>> >>> The simple history looks clean. All 200s and OKs, with a few broken >>> pipes, but those documents all seem to have been successfully fetch later. >>> No rejects. >>> >>> Thanks again, >>> >>> Mark >>> >>> >>> >>> On Mon, Feb 10, 2014 at 10:41 AM, Karl Wright <[email protected]>wrote: >>> >>>> Hi Mark, >>>> >>>> The robots parse error is informational only and does not otherwise >>>> affect crawling. So you will need to look elsewhere for the issue. >>>> >>>> First question: what version of MCF are you using? For a time, trunk >>>> (and the release 1.5 branch) had exactly this problem whenever connections >>>> were used that included certificates. >>>> >>>> I suggest that you rule out blocked sites by looking at the simple >>>> history. If you see a lot of rejections then maybe you are being blocked. >>>> If, on the other hand, not much has happened at all for a while, that's not >>>> the answer. >>>> >>>> The fastest way to start diagnosing this problem is to get a thread >>>> dump. I'd be happy to look at it and let you know what I find. >>>> >>>> Karl >>>> >>>> >>>> >>>> >>>> >>>> On Mon, Feb 10, 2014 at 1:26 PM, Mark Libucha <[email protected]>wrote: >>>> >>>>> I kicked off a bunch of web crawls on Friday to run over the weekend. >>>>> They all started fine but didn't finish. No errors in the logs I can find. >>>>> All action seemed to stop after a couple of hours. It's configured as >>>>> complete crawl that runs every 24 hours. >>>>> >>>>> I don't expect you to have an answer to what went wrong with such >>>>> limited information, but I did see a problem with robots.txt (at the >>>>> bottom >>>>> of this email). >>>>> >>>>> Does it mean robots.txt was not used at all for the crawl, or just >>>>> that part was ignored? (I kind of expected this kind of error to kill the >>>>> crawl, but maybe I just don't understand it.) >>>>> >>>>> If the crawl were ignoring the robots.txt, or a part of it, and the >>>>> crawled site banned my crawler, what would I see in the MCF logs? >>>>> >>>>> Thanks, >>>>> >>>>> Mark >>>>> >>>>> 02-09-2014 09:54:48.679 robots parsesomesite.gov:80 >>>>> ERRORS 01 Unknown robots.txt line: 'Sitemap: < >>>>> http://www.somesite.gov/sitemapindex.xml>' >>>>> >>>> >>>> >>> >> >
