Re: dealing with redirects from http to https

Sebastian Nagel Fri, 09 Mar 2018 15:07:56 -0800

> Another problem is that they have fetch_time well into the future,
> I guess because retry_interval is applied.


Correct. Fetch time is
- time when to fetch next for a CrawlDatum in the CrawlDb
- time when fetch has happened for those in segments crawl_fetch folder

On 03/09/2018 11:04 PM, Michael Coffey wrote:
> Thanks for the suggestion. On closer inspection, I see that redirection 
> targets do show up in the crawldb.
> One problem is that the target urls all have scores equal to zero, because no 
> other pages point to them. Another problem is that they have fetch_time well 
> into the future, I guess because retry_interval is applied.
> Interestingly, the target urls do sometimes show up in a segment. When I dump 
> the segment after attempted fetching, they show responseCode 301 (even for 
> the redirection targets), nutchStatus 67, and empty content. I imagine this 
> might be just the result of the fetcher noticing the redirection and this is 
> how it communicates to the updatedb.
> Here are some examples urls (the http and https examples are the same, except 
> for the "s")
> https://m.sfgate.com/49ers/article/2018-49ers-calendar-outdated-players-gone-12287960.php
>     
> http://m.sfgate.com/49ers/article/2018-49ers-calendar-outdated-players-gone-12287960.phphttps://m.sfgate.com/bayarea/article/new-california-laws-going-into-effect-in-2018-12458046.php
>     
> http://m.sfgate.com/bayarea/article/new-california-laws-going-into-effect-in-2018-12458046.phphttps://m.sfgate.com/business/article/Tesla-s-enormous-battery-in-Australia-just-weeks-12455377.php
>     
> http://m.sfgate.com/business/article/Tesla-s-enormous-battery-in-Australia-just-weeks-12455377.php
> 
> In case anybody wants to replicate this, here are the key parts in my 
> regex-urlfilter.
> # reject certain sfgate urls
> -blog\.sfgate\.com
> -findnsave\.sfgate\.com
> -homeguides\.sfgate\.com
> -healthyeating\.sfgate\.com
> -cars\.sfgate\.com
> -marketing\.sfgate\.com
> -insidescoopsf\.sfgate\.com
> -reviews\.sfgate\.com
> -stats\.sfgate\.com
> -video\.sfgate\.com
> 
> # accept other mobile sfgate urls
> +/m\.sfgate\.com
> 
> 
>      
>> What is the best way to handle this, in general? I am thinking of specifying 
>> http.redirect.max=1
> (rather than the default 0) in nutch-site.xml because I want it to fetch 
> these pages right away,
> rather than waiting until the next cycle.
> 
> Of course, you can do this. But keep in mind: if both, the http and the https 
> URLs are in the
> CrawlDb, this may lead to duplicates. Fetcher redirect targets are not 
> checked in the Crawldb.
> 
>> I think I want the redirection target to get stored in the crawldb
> 
> That's done by the updatedb command, independent from the value of 
> http.redirect.max
> Is there any URL filter which may cause that the redirect targets are 
> filtered?
> 
> On 03/09/2018 08:39 PM, Michael Coffey wrote:
>> I am having a problem crawling some sites that seem to be transitioning to 
>> https. All their links contain http urls and the fetcher gets response code 
>> 301 and content that says "the document has moved" because the actual 
>> content is accessible only via https. This has been happening for a few days 
>> with my news crawler.
>>
>> What is the best way to handle this, in general? I am thinking of specifying 
>> http.redirect.max=1 (rather than the default 0) in nutch-site.xml because I 
>> want it to fetch these pages right away, rather than waiting until the next 
>> cycle.
>> I think I want the redirection target to get stored in the crawldb, but I 
>> don't know how to achieve that. In fact, I thought that would be the default 
>> behavior, and I am surprised to see it not doing that.
>>
>> Are there any other settings I should change, and is there any drawback to 
>> using http.redirect.max for this purpose?
>>
>>
> 
> 
> 
>    
>

Re: dealing with redirects from http to https

Reply via email to