Hi Mark, MCF retries those sorts of errors automatically. It's possible there's a place we missed, but let's pursue other avenues first.
One thing worth noting is that you have hop counting enabled, which is fine for small crawls but slows things down a lot (and can cause stalls when there are lots of records whose hopcount needs to be updated). Do you truly need link counting? The thread dump will tell us a lot, as will the simple history. When was the last time something happened in the simple history? Karl On Mon, Feb 10, 2014 at 2:22 PM, Mark Libucha <[email protected]> wrote: > More info...maybe we don't have postgres configured correctly. Lots of > errors to stdout log. For example: > > STATEMENT: INSERT INTO intrinsiclink > (parentidhash,isnew,jobid,linktype,childidhash) VALUES ($1,$2,$3,$4,$5) > ERROR: could not serialize access due to read/write dependencies among > transactions > DETAIL: Reason code: Canceled on identification as a pivot, during > conflict in checking. > HINT: The transaction might succeed if retried. > > and on other tables as well. > > Mark > > > On Mon, Feb 10, 2014 at 11:18 AM, Mark Libucha <[email protected]> wrote: > >> Thanks Karl, we may take you up on the offer when/if we reproduce with >> just a single crawl. We were running many at once. Can you describe or >> point me at instructions for the thread dump you'd like to see? >> >> We're using 1.4.1. >> >> The simple history looks clean. All 200s and OKs, with a few broken >> pipes, but those documents all seem to have been successfully fetch later. >> No rejects. >> >> Thanks again, >> >> Mark >> >> >> >> On Mon, Feb 10, 2014 at 10:41 AM, Karl Wright <[email protected]> wrote: >> >>> Hi Mark, >>> >>> The robots parse error is informational only and does not otherwise >>> affect crawling. So you will need to look elsewhere for the issue. >>> >>> First question: what version of MCF are you using? For a time, trunk >>> (and the release 1.5 branch) had exactly this problem whenever connections >>> were used that included certificates. >>> >>> I suggest that you rule out blocked sites by looking at the simple >>> history. If you see a lot of rejections then maybe you are being blocked. >>> If, on the other hand, not much has happened at all for a while, that's not >>> the answer. >>> >>> The fastest way to start diagnosing this problem is to get a thread >>> dump. I'd be happy to look at it and let you know what I find. >>> >>> Karl >>> >>> >>> >>> >>> >>> On Mon, Feb 10, 2014 at 1:26 PM, Mark Libucha <[email protected]>wrote: >>> >>>> I kicked off a bunch of web crawls on Friday to run over the weekend. >>>> They all started fine but didn't finish. No errors in the logs I can find. >>>> All action seemed to stop after a couple of hours. It's configured as >>>> complete crawl that runs every 24 hours. >>>> >>>> I don't expect you to have an answer to what went wrong with such >>>> limited information, but I did see a problem with robots.txt (at the bottom >>>> of this email). >>>> >>>> Does it mean robots.txt was not used at all for the crawl, or just that >>>> part was ignored? (I kind of expected this kind of error to kill the crawl, >>>> but maybe I just don't understand it.) >>>> >>>> If the crawl were ignoring the robots.txt, or a part of it, and the >>>> crawled site banned my crawler, what would I see in the MCF logs? >>>> >>>> Thanks, >>>> >>>> Mark >>>> >>>> 02-09-2014 09:54:48.679 robots parsesomesite.gov:80 >>>> ERRORS 01 Unknown robots.txt line: 'Sitemap: < >>>> http://www.somesite.gov/sitemapindex.xml>' >>>> >>> >>> >> >
