Apologies for the RTFM stumble after being pointed to it. Thought I did read it. Apparently not very carefully. I understand it now.
Thanks. On Mon, Feb 10, 2014 at 12:55 PM, Karl Wright <[email protected]> wrote: > Read the documentation. Unless you select "Keep unreachable documents > forever", MCF will keep track of hop count info. > > Karl > > > On Mon, Feb 10, 2014 at 3:17 PM, Mark Libucha <[email protected]> wrote: > >> So, I carefully checked all of our jobs, and *none* have hop filters >> turned on (the text boxes are blank for all jobs). >> >> Still seeing lots of these: >> >> STATEMENT: INSERT INTO hopdeletedeps >> (parentidhash,ownerid,jobid,childidhash,linktype) VALUES ($1,$2,$3,$4,$5) >> ERROR: could not serialize access due to read/write dependencies among >> transactions >> >> >> >> On Mon, Feb 10, 2014 at 12:01 PM, Karl Wright <[email protected]> wrote: >> >>> Hi Mark, >>> >>> Look here >>> manifoldcf.apache.org/release/trunk/en_US/end-user-documentation.html , >>> and read the section on hop filters for the web connector. >>> >>> Karl >>> >>> >>> >>> On Mon, Feb 10, 2014 at 2:55 PM, Mark Libucha <[email protected]>wrote: >>> >>>> We restarted manifold so we'll have to reproduce before we get you more >>>> details. >>>> >>>> I don't understand the hopcount thing. How do you know, and we're is it >>>> set? We're running with default settings pretty much. >>>> >>>> Thanks, >>>> >>>> Mark >>>> >>>> >>>> On Mon, Feb 10, 2014 at 11:39 AM, Karl Wright <[email protected]>wrote: >>>> >>>>> Hi Mark, >>>>> >>>>> MCF retries those sorts of errors automatically. It's possible >>>>> there's a place we missed, but let's pursue other avenues first. >>>>> >>>>> One thing worth noting is that you have hop counting enabled, which is >>>>> fine for small crawls but slows things down a lot (and can cause stalls >>>>> when there are lots of records whose hopcount needs to be updated). Do >>>>> you >>>>> truly need link counting? >>>>> >>>>> The thread dump will tell us a lot, as will the simple history. When >>>>> was the last time something happened in the simple history? >>>>> >>>>> Karl >>>>> >>>>> >>>>> >>>>> On Mon, Feb 10, 2014 at 2:22 PM, Mark Libucha <[email protected]>wrote: >>>>> >>>>>> More info...maybe we don't have postgres configured correctly. Lots >>>>>> of errors to stdout log. For example: >>>>>> >>>>>> STATEMENT: INSERT INTO intrinsiclink >>>>>> (parentidhash,isnew,jobid,linktype,childidhash) VALUES ($1,$2,$3,$4,$5) >>>>>> ERROR: could not serialize access due to read/write dependencies >>>>>> among transactions >>>>>> DETAIL: Reason code: Canceled on identification as a pivot, during >>>>>> conflict in checking. >>>>>> HINT: The transaction might succeed if retried. >>>>>> >>>>>> and on other tables as well. >>>>>> >>>>>> Mark >>>>>> >>>>>> >>>>>> On Mon, Feb 10, 2014 at 11:18 AM, Mark Libucha <[email protected]>wrote: >>>>>> >>>>>>> Thanks Karl, we may take you up on the offer when/if we reproduce >>>>>>> with just a single crawl. We were running many at once. Can you >>>>>>> describe or >>>>>>> point me at instructions for the thread dump you'd like to see? >>>>>>> >>>>>>> We're using 1.4.1. >>>>>>> >>>>>>> The simple history looks clean. All 200s and OKs, with a few broken >>>>>>> pipes, but those documents all seem to have been successfully fetch >>>>>>> later. >>>>>>> No rejects. >>>>>>> >>>>>>> Thanks again, >>>>>>> >>>>>>> Mark >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Mon, Feb 10, 2014 at 10:41 AM, Karl Wright <[email protected]>wrote: >>>>>>> >>>>>>>> Hi Mark, >>>>>>>> >>>>>>>> The robots parse error is informational only and does not otherwise >>>>>>>> affect crawling. So you will need to look elsewhere for the issue. >>>>>>>> >>>>>>>> First question: what version of MCF are you using? For a time, >>>>>>>> trunk (and the release 1.5 branch) had exactly this problem whenever >>>>>>>> connections were used that included certificates. >>>>>>>> >>>>>>>> I suggest that you rule out blocked sites by looking at the simple >>>>>>>> history. If you see a lot of rejections then maybe you are being >>>>>>>> blocked. >>>>>>>> If, on the other hand, not much has happened at all for a while, >>>>>>>> that's not >>>>>>>> the answer. >>>>>>>> >>>>>>>> The fastest way to start diagnosing this problem is to get a thread >>>>>>>> dump. I'd be happy to look at it and let you know what I find. >>>>>>>> >>>>>>>> Karl >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Feb 10, 2014 at 1:26 PM, Mark Libucha >>>>>>>> <[email protected]>wrote: >>>>>>>> >>>>>>>>> I kicked off a bunch of web crawls on Friday to run over the >>>>>>>>> weekend. They all started fine but didn't finish. No errors in the >>>>>>>>> logs I >>>>>>>>> can find. All action seemed to stop after a couple of hours. It's >>>>>>>>> configured as complete crawl that runs every 24 hours. >>>>>>>>> >>>>>>>>> I don't expect you to have an answer to what went wrong with such >>>>>>>>> limited information, but I did see a problem with robots.txt (at the >>>>>>>>> bottom >>>>>>>>> of this email). >>>>>>>>> >>>>>>>>> Does it mean robots.txt was not used at all for the crawl, or just >>>>>>>>> that part was ignored? (I kind of expected this kind of error to kill >>>>>>>>> the >>>>>>>>> crawl, but maybe I just don't understand it.) >>>>>>>>> >>>>>>>>> If the crawl were ignoring the robots.txt, or a part of it, and >>>>>>>>> the crawled site banned my crawler, what would I see in the MCF logs? >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> >>>>>>>>> Mark >>>>>>>>> >>>>>>>>> 02-09-2014 09:54:48.679 robots parsesomesite.gov:80 >>>>>>>>> ERRORS 01 Unknown robots.txt line: 'Sitemap: < >>>>>>>>> http://www.somesite.gov/sitemapindex.xml>' >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
