Re: Redirection behavior

prateek Fri, 07 May 2021 07:33:33 -0700

Hi,

Just to close this thread - I figured out that the issue was because of
apache http client's (HttpClientBuilder) default behavior of handling
redirections.
Disabling the behavior like this solved the problem.



* HttpClientBuilder builder = HttpClientBuilder.create();*
* builder.disableRedirectHandling();*

Now I am able to see all the redirected URL's being crawled as expected.
Thanks for the help.

Regards
Prateek
On Thu, May 6, 2021 at 11:42 AM prateek <[email protected]> wrote:

> Thanks.. I am using a custom http plugin. So I will debug with 1.16 to see
> what's causing it. Thanks for your help
>
> Regards
> Prateek
>
> On Thu, May 6, 2021 at 11:26 AM Sebastian Nagel <
> [email protected]> wrote:
>
>> Hi Prateek,
>>
>> (sorry, I pressed the wrong reply button, so redirecting the discussion
>> back to user@nutch)
>>
>>
>>  > I am not sure what I am missing.
>>
>> Well, URL filters?  Robots.txt?  Don't know...
>>
>>
>>  > I am currently using Nutch 1.16
>>
>> Just to make sure this isn't the cause: there was a bug (NUTCH-2550 [1])
>> which caused Fetcher
>> not to follow redirects. But it was fixed already in Nutch 1.15.
>>
>> I've retried using Nutch 1.16:
>> - using -Dplugin.includes='protocol-okhttp|parse-html'
>>     FetcherThread 43 fetching http://wikipedia.com/ (queue crawl
>> delay=3000ms)
>>     FetcherThread 43 fetching https://wikipedia.com/ (queue crawl
>> delay=3000ms)
>>     FetcherThread 43 fetching https://www.wikipedia.org/ (queue crawl
>> delay=3000ms)
>>
>> Note: there might be an issue using protocol-http
>> (-Dplugin.includes='protocol-http|parse-html')
>> together with Nutch 1.16:
>>     FetcherThread 43 fetching https://wikipedia.com/ (queue crawl
>> delay=3000ms)
>>     FetcherThread 43 fetching https://wikipedia.com/ (queue crawl
>> delay=3000ms)
>>     Couldn't get robots.txt for https://wikipedia.com/:
>> java.net.SocketException: Socket is closed
>>     FetcherThread 43 fetching https://www.wikipedia.org/ (queue crawl
>> delay=3000ms)
>>     FetcherThread 43 fetching https://www.wikipedia.org/ (queue crawl
>> delay=3000ms)
>>     Couldn't get robots.txt for https://www.wikipedia.org/:
>> java.net.SocketException: Socket is closed
>>     Failed to get protocol output java.net.SocketException: Socket is
>> closed
>>          at
>> sun.security.ssl.SSLSocketImpl.getOutputStream(SSLSocketImpl.java:1109)
>>          at
>> org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:162)
>>          at org.apache.nutch.protocol.http.Http.getResponse(Http.java:63)
>>          at
>> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:375)
>>          at
>> org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:343)
>>     FetcherThread 43 fetch of https://www.wikipedia.org/ failed with:
>> java.net.SocketException: Socket is closed
>>
>> But it's not reproducible using Nutch master / 1.18 - as it relates to
>> HTTPS/SSL it's likely fixed by NUTCH-2794 [2].
>>
>> In any case, could you try to reproduce the problem using Nutch 1.18 ?
>>
>> Best,
>> Sebastian
>>
>> [1] https://issues.apache.org/jira/browse/NUTCH-2550
>> [2] https://issues.apache.org/jira/browse/NUTCH-2794
>>
>>
>> On 5/6/21 11:54 AM, prateek wrote:
>> > Thanks for your reply Sebastian.
>> >
>> > I am using http.redirect.max=5 for my setup.
>> > In the seed URL, I am only passing http://wikipedia.com/ <
>> http://wikipedia.com/> and https://zyfro.com/ <https://zyfro.com/> .
>> CrawlDatum
>> > and ParseData shared in my earlier email are from http://wikipedia.com/
>> <http://wikipedia.com/> url.
>> > I don't see the other redirected URL's in the logs or segments. Here is
>> my log -
>> >
>> > /2021-05-05 17:35:23,854 INFO [main]
>> org.apache.nutch.fetcher.FetcherThread: FetcherThread 1 Using queue mode :
>> byHost
>> > 2021-05-05 17:35:23,854 INFO [main] org.apache.nutch.fetcher.Fetcher:
>> Fetcher: throughput threshold: -1
>> > 2021-05-05 17:35:23,854 INFO [main] org.apache.nutch.fetcher.Fetcher:
>> Fetcher: throughput threshold retries: 5
>> > *2021-05-05 17:35:23,855 INFO [main]
>> org.apache.nutch.fetcher.FetcherThread: FetcherThread 50 fetching
>> http://wikipedia.com/
>> > <http://wikipedia.com/> (queue crawl delay=1000ms)*
>> >
>> > *2021-05-05 17:35:29,095 INFO [main]
>> org.apache.nutch.fetcher.FetcherThread: FetcherThread 50 fetching
>> https://zyfro.com/
>> > <https://zyfro.com/> (queue crawl delay=1000ms)*
>> > 2021-05-05 17:35:29,095 INFO [main] com.**.nutchplugin.http.Http:
>> fetching https://zyfro.com/robots.txt <https://zyfro.com/robots.txt>
>> > 2021-05-05 17:35:29,862 INFO [main] org.apache.nutch.fetcher.Fetcher:
>> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0,
>> > fetchQueues.getQueueCount=1
>> > 2021-05-05 17:35:30,189 INFO [main] com.**.nutchplugin.http.Http:
>> fetching https://zyfro.com/ <https://zyfro.com/>
>> > 2021-05-05 17:35:30,786 INFO [main]
>> org.apache.nutch.fetcher.FetcherThread: FetcherThread 50 has no more work
>> available/
>> >
>> > I am not sure what I am missing.
>> >
>> > Regards
>> > Prateek
>> >
>> >
>> > On Thu, May 6, 2021 at 10:21 AM Sebastian Nagel <
>> [email protected] <mailto:[email protected]>> wrote:
>> >
>> >     Hi Prateek,
>> >
>> >     could you share information about all pages/URLs in the redirect
>> chain?
>> >
>> >     http://wikipedia.com/ <http://wikipedia.com/>
>> >     https://wikipedia.com/ <https://wikipedia.com/>
>> >     https://www.wikipedia.org/ <https://www.wikipedia.org/>
>> >
>> >     If I'm not wrong, the shown  CrawlDatum and ParseData stems from
>> >     https://www.wikipedia.org/ <https://www.wikipedia.org/> and is
>> _http_status_code_=200.
>> >     So, looks like the redirects have been followed.
>> >
>> >     Note: all 3 URLs should have records in the segment and the CrawlDb.
>> >
>> >     I've also verified that the above redirect chain is followed by
>> Fetcher
>> >     with the following settings (passed on the command-line via -D)
>> using
>> >     Nutch master (1.18):
>> >        -Dhttp.redirect.max=3
>> >        -Ddb.ignore.external.links=true
>> >        -Ddb.ignore.external.links.mode=byDomain
>> >        -Ddb.ignore.also.redirects=false
>> >
>> >     Fetcher log snippets:
>> >        FetcherThread 51 fetching http://wikipedia.com/ <
>> http://wikipedia.com/> (queue crawl delay=3000ms)
>> >        FetcherThread 51 fetching https://wikipedia.com/ <
>> https://wikipedia.com/> (queue crawl delay=3000ms)
>> >        FetcherThread 51 fetching https://www.wikipedia.org/ <
>> https://www.wikipedia.org/> (queue crawl delay=3000ms)
>> >
>> >     Just in case: what's the value of the property http.redirect.max ?
>> >
>> >     Best,
>> >     Sebastian
>> >
>> >
>> >     On 5/5/21 8:09 PM, prateek wrote:
>> >      > Hi,
>> >      >
>> >      > I am currently using Nutch 1.16 with the properties below -
>> >      >
>> >      >
>> >      >
>> >      >
>> *db.ignore.external.links=truedb.ignore.external.links.mode=byDomaindb.ignore.also.redirects=false*
>> >      >
>> >      > When I am crawling websites that are redirecting (301 http code)
>> using
>> >      > Nutch (for example - https://zyfro.com/ <https://zyfro.com/>
>> and http://wikipedia.com/ <http://wikipedia.com/>). I see
>> >      > that the new redirected URL is not captured by nutch. Even the
>> outlinks
>> >      > point to the original url provided and status returned is 200.
>> >      > So my question is
>> >      > 1. How do I capture the new URL?
>> >      > 2. Is there a way to allow nutch to capture 301 status and then
>> the new url
>> >      > and then crawl the related content?
>> >      >
>> >      > Here is CrawlDatum and Parsedata structure for
>> http://wikipedia.com/ <http://wikipedia.com/> which
>> >      > gets redirected to wikipedia.org <http://wikipedia.org>.
>> >      >
>> >      >
>> >      >
>> >      >
>> >      >
>> >      >
>> >      >
>> >      >
>> >      >
>> >      >
>> >      >
>> >      >
>> >      >
>> >      >
>> >      >
>> >      >
>> >      >
>> >      >
>> >      >
>> >      >
>> >      >
>> >      >
>> >      >
>> >      >
>> >      > *CrawlDatum : Version: 7Status: 33 (fetch_success)Fetch time:
>> Wed May 05
>> >      > 17:35:29 UTC 2021Modified time: Thu Jan 01 00:00:00 UTC
>> 1970Retries since
>> >      > fetch: 0Retry interval: 31536000 seconds (365 days)Score:
>> 2.0Signature:
>> >      > nullMetadata:   _ngt_=1620235730883 _depth_=1
>> _http_status_code_=200
>> >      > _pst_=success(1), lastModified=1620038693000 _rs_=410
>> >      > Content-Type=text/html _maxdepth_=1000
>> nutch.protocol.code=200ParseData :
>> >      > Version: 5Status: success(1,0)Title: WikipediaOutlinks: 1
>> outlink: toUrl:
>> >      >
>> http://wikipedia.com/portal/wikipedia.org/assets/img/Wikipedia-logo-v2.png
>> >     <
>> http://wikipedia.com/portal/wikipedia.org/assets/img/Wikipedia-logo-v2.png
>> >
>> >      > <
>> http://wikipedia.com/portal/wikipedia.org/assets/img/Wikipedia-logo-v2.png
>> >     <
>> http://wikipedia.com/portal/wikipedia.org/assets/img/Wikipedia-logo-v2.png
>> >>
>> >      > anchor: WikipediaContent Metadata: _depth_=1 Server=ATS/8.0.8
>> >      > nutch.content.digest=bc4a6cee4d559c44fbc839c9f2b4a449
>> >      > Server-Timing=cache;desc="hit-front", host;desc="cp1081"
>> >      > Permissions-Policy=interest-cohort=() Last-Modified=Mon, 03 May
>> 2021
>> >      > 10:44:53 GMT Strict-Transport-Security=max-age=106384710;
>> >      > includeSubDomains; preload X-Cache-Status=hit-front Report-To={
>> "group":
>> >      > "wm_nel", "max_age": 86400, "endpoints": [{ "url":
>> >      >
>> >     "
>> https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0
>> >     <
>> https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0
>> >
>> >      >
>> >     <
>> https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0
>> >     <
>> https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0
>> >>"
>> >      > }] } Age=27826 Content-Type=text/html X-Cache=cp1079 hit, cp1081
>> hit/578233
>> >      > Connection=keep-alive _maxdepth_=1000 X-Client-IP=108.174.5.114
>> Date=Wed,
>> >      > 05 May 2021 09:51:42 GMT nutch.crawl.score=2.0
>> Accept-Ranges=bytes
>> >      > nutch.segment.name <http://nutch.segment.name> <
>> http://nutch.segment.name <http://nutch.segment.name>>=20210505173059
>> >      > Cache-Control=s-maxage=86400, must-revalidate, max-age=3600 NEL={
>> >      > "report_to": "wm_nel", "max_age": 86400, "failure_fraction":
>> 0.05,
>> >      > "success_fraction": 0.0} ETag=W/"11e90-5c16aa6d9b068"
>> Vary=Accept-Encoding
>> >      >
>> >
>>  
>> X-LI-Tracking-Id=treeId:AAAAAAAAAAAAAAAAAAAAAA==|ts:1620236128580|cc:14fb413b|sc:490b9459|req:JTwZc4y|src:/10.148.138.11:17671
>> |dst:www.wikipedia.org
>> >     <http://www.wikipedia.org>
>> >      > <http://www.wikipedia.org 
>> > <http://www.wikipedia.org>>|principal:hadoop-test
>> _fst_=33 Parse Metadata:
>> >      > CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8
>> _depth_=1
>> >      > viewport=initial-scale=1,user-scalable=yes
>> metatag.description=Wikipedia is
>> >      > a free online encyclopedia, created and edited by volunteers
>> around the
>> >      > world and hosted by the Wikimedia Foundation.
>> metatag.description=Wikipedia
>> >      > is a free online encyclopedia, created and edited by volunteers
>> around the
>> >      > world and hosted by the Wikimedia Foundation.
>> description=Wikipedia is a
>> >      > free online encyclopedia, created and edited by volunteers
>> around the world
>> >      > and hosted by the Wikimedia Foundation. _maxdepth_=1000 *
>> >      >
>> >      >
>> >      > Thanks
>> >      > Prateek
>> >      >
>> >
>>
>>

Re: Redirection behavior

Reply via email to