> Another problem is that they have fetch_time well into the future, > I guess because retry_interval is applied.
Correct. Fetch time is - time when to fetch next for a CrawlDatum in the CrawlDb - time when fetch has happened for those in segments crawl_fetch folder On 03/09/2018 11:04 PM, Michael Coffey wrote: > Thanks for the suggestion. On closer inspection, I see that redirection > targets do show up in the crawldb. > One problem is that the target urls all have scores equal to zero, because no > other pages point to them. Another problem is that they have fetch_time well > into the future, I guess because retry_interval is applied. > Interestingly, the target urls do sometimes show up in a segment. When I dump > the segment after attempted fetching, they show responseCode 301 (even for > the redirection targets), nutchStatus 67, and empty content. I imagine this > might be just the result of the fetcher noticing the redirection and this is > how it communicates to the updatedb. > Here are some examples urls (the http and https examples are the same, except > for the "s") > https://m.sfgate.com/49ers/article/2018-49ers-calendar-outdated-players-gone-12287960.php > > http://m.sfgate.com/49ers/article/2018-49ers-calendar-outdated-players-gone-12287960.phphttps://m.sfgate.com/bayarea/article/new-california-laws-going-into-effect-in-2018-12458046.php > > http://m.sfgate.com/bayarea/article/new-california-laws-going-into-effect-in-2018-12458046.phphttps://m.sfgate.com/business/article/Tesla-s-enormous-battery-in-Australia-just-weeks-12455377.php > > http://m.sfgate.com/business/article/Tesla-s-enormous-battery-in-Australia-just-weeks-12455377.php > > In case anybody wants to replicate this, here are the key parts in my > regex-urlfilter. > # reject certain sfgate urls > -blog\.sfgate\.com > -findnsave\.sfgate\.com > -homeguides\.sfgate\.com > -healthyeating\.sfgate\.com > -cars\.sfgate\.com > -marketing\.sfgate\.com > -insidescoopsf\.sfgate\.com > -reviews\.sfgate\.com > -stats\.sfgate\.com > -video\.sfgate\.com > > # accept other mobile sfgate urls > +/m\.sfgate\.com > > > >> What is the best way to handle this, in general? I am thinking of specifying >> http.redirect.max=1 > (rather than the default 0) in nutch-site.xml because I want it to fetch > these pages right away, > rather than waiting until the next cycle. > > Of course, you can do this. But keep in mind: if both, the http and the https > URLs are in the > CrawlDb, this may lead to duplicates. Fetcher redirect targets are not > checked in the Crawldb. > >> I think I want the redirection target to get stored in the crawldb > > That's done by the updatedb command, independent from the value of > http.redirect.max > Is there any URL filter which may cause that the redirect targets are > filtered? > > On 03/09/2018 08:39 PM, Michael Coffey wrote: >> I am having a problem crawling some sites that seem to be transitioning to >> https. All their links contain http urls and the fetcher gets response code >> 301 and content that says "the document has moved" because the actual >> content is accessible only via https. This has been happening for a few days >> with my news crawler. >> >> What is the best way to handle this, in general? I am thinking of specifying >> http.redirect.max=1 (rather than the default 0) in nutch-site.xml because I >> want it to fetch these pages right away, rather than waiting until the next >> cycle. >> I think I want the redirection target to get stored in the crawldb, but I >> don't know how to achieve that. In fact, I thought that would be the default >> behavior, and I am surprised to see it not doing that. >> >> Are there any other settings I should change, and is there any drawback to >> using http.redirect.max for this purpose? >> >> > > > > >

