Hi, Just to close this thread - I figured out that the issue was because of apache http client's (HttpClientBuilder) default behavior of handling redirections. Disabling the behavior like this solved the problem.
* HttpClientBuilder builder = HttpClientBuilder.create();* * builder.disableRedirectHandling();* Now I am able to see all the redirected URL's being crawled as expected. Thanks for the help. Regards Prateek On Thu, May 6, 2021 at 11:42 AM prateek <[email protected]> wrote: > Thanks.. I am using a custom http plugin. So I will debug with 1.16 to see > what's causing it. Thanks for your help > > Regards > Prateek > > On Thu, May 6, 2021 at 11:26 AM Sebastian Nagel < > [email protected]> wrote: > >> Hi Prateek, >> >> (sorry, I pressed the wrong reply button, so redirecting the discussion >> back to user@nutch) >> >> >> > I am not sure what I am missing. >> >> Well, URL filters? Robots.txt? Don't know... >> >> >> > I am currently using Nutch 1.16 >> >> Just to make sure this isn't the cause: there was a bug (NUTCH-2550 [1]) >> which caused Fetcher >> not to follow redirects. But it was fixed already in Nutch 1.15. >> >> I've retried using Nutch 1.16: >> - using -Dplugin.includes='protocol-okhttp|parse-html' >> FetcherThread 43 fetching http://wikipedia.com/ (queue crawl >> delay=3000ms) >> FetcherThread 43 fetching https://wikipedia.com/ (queue crawl >> delay=3000ms) >> FetcherThread 43 fetching https://www.wikipedia.org/ (queue crawl >> delay=3000ms) >> >> Note: there might be an issue using protocol-http >> (-Dplugin.includes='protocol-http|parse-html') >> together with Nutch 1.16: >> FetcherThread 43 fetching https://wikipedia.com/ (queue crawl >> delay=3000ms) >> FetcherThread 43 fetching https://wikipedia.com/ (queue crawl >> delay=3000ms) >> Couldn't get robots.txt for https://wikipedia.com/: >> java.net.SocketException: Socket is closed >> FetcherThread 43 fetching https://www.wikipedia.org/ (queue crawl >> delay=3000ms) >> FetcherThread 43 fetching https://www.wikipedia.org/ (queue crawl >> delay=3000ms) >> Couldn't get robots.txt for https://www.wikipedia.org/: >> java.net.SocketException: Socket is closed >> Failed to get protocol output java.net.SocketException: Socket is >> closed >> at >> sun.security.ssl.SSLSocketImpl.getOutputStream(SSLSocketImpl.java:1109) >> at >> org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:162) >> at org.apache.nutch.protocol.http.Http.getResponse(Http.java:63) >> at >> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:375) >> at >> org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:343) >> FetcherThread 43 fetch of https://www.wikipedia.org/ failed with: >> java.net.SocketException: Socket is closed >> >> But it's not reproducible using Nutch master / 1.18 - as it relates to >> HTTPS/SSL it's likely fixed by NUTCH-2794 [2]. >> >> In any case, could you try to reproduce the problem using Nutch 1.18 ? >> >> Best, >> Sebastian >> >> [1] https://issues.apache.org/jira/browse/NUTCH-2550 >> [2] https://issues.apache.org/jira/browse/NUTCH-2794 >> >> >> On 5/6/21 11:54 AM, prateek wrote: >> > Thanks for your reply Sebastian. >> > >> > I am using http.redirect.max=5 for my setup. >> > In the seed URL, I am only passing http://wikipedia.com/ < >> http://wikipedia.com/> and https://zyfro.com/ <https://zyfro.com/> . >> CrawlDatum >> > and ParseData shared in my earlier email are from http://wikipedia.com/ >> <http://wikipedia.com/> url. >> > I don't see the other redirected URL's in the logs or segments. Here is >> my log - >> > >> > /2021-05-05 17:35:23,854 INFO [main] >> org.apache.nutch.fetcher.FetcherThread: FetcherThread 1 Using queue mode : >> byHost >> > 2021-05-05 17:35:23,854 INFO [main] org.apache.nutch.fetcher.Fetcher: >> Fetcher: throughput threshold: -1 >> > 2021-05-05 17:35:23,854 INFO [main] org.apache.nutch.fetcher.Fetcher: >> Fetcher: throughput threshold retries: 5 >> > *2021-05-05 17:35:23,855 INFO [main] >> org.apache.nutch.fetcher.FetcherThread: FetcherThread 50 fetching >> http://wikipedia.com/ >> > <http://wikipedia.com/> (queue crawl delay=1000ms)* >> > >> > *2021-05-05 17:35:29,095 INFO [main] >> org.apache.nutch.fetcher.FetcherThread: FetcherThread 50 fetching >> https://zyfro.com/ >> > <https://zyfro.com/> (queue crawl delay=1000ms)* >> > 2021-05-05 17:35:29,095 INFO [main] com.**.nutchplugin.http.Http: >> fetching https://zyfro.com/robots.txt <https://zyfro.com/robots.txt> >> > 2021-05-05 17:35:29,862 INFO [main] org.apache.nutch.fetcher.Fetcher: >> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, >> > fetchQueues.getQueueCount=1 >> > 2021-05-05 17:35:30,189 INFO [main] com.**.nutchplugin.http.Http: >> fetching https://zyfro.com/ <https://zyfro.com/> >> > 2021-05-05 17:35:30,786 INFO [main] >> org.apache.nutch.fetcher.FetcherThread: FetcherThread 50 has no more work >> available/ >> > >> > I am not sure what I am missing. >> > >> > Regards >> > Prateek >> > >> > >> > On Thu, May 6, 2021 at 10:21 AM Sebastian Nagel < >> [email protected] <mailto:[email protected]>> wrote: >> > >> > Hi Prateek, >> > >> > could you share information about all pages/URLs in the redirect >> chain? >> > >> > http://wikipedia.com/ <http://wikipedia.com/> >> > https://wikipedia.com/ <https://wikipedia.com/> >> > https://www.wikipedia.org/ <https://www.wikipedia.org/> >> > >> > If I'm not wrong, the shown CrawlDatum and ParseData stems from >> > https://www.wikipedia.org/ <https://www.wikipedia.org/> and is >> _http_status_code_=200. >> > So, looks like the redirects have been followed. >> > >> > Note: all 3 URLs should have records in the segment and the CrawlDb. >> > >> > I've also verified that the above redirect chain is followed by >> Fetcher >> > with the following settings (passed on the command-line via -D) >> using >> > Nutch master (1.18): >> > -Dhttp.redirect.max=3 >> > -Ddb.ignore.external.links=true >> > -Ddb.ignore.external.links.mode=byDomain >> > -Ddb.ignore.also.redirects=false >> > >> > Fetcher log snippets: >> > FetcherThread 51 fetching http://wikipedia.com/ < >> http://wikipedia.com/> (queue crawl delay=3000ms) >> > FetcherThread 51 fetching https://wikipedia.com/ < >> https://wikipedia.com/> (queue crawl delay=3000ms) >> > FetcherThread 51 fetching https://www.wikipedia.org/ < >> https://www.wikipedia.org/> (queue crawl delay=3000ms) >> > >> > Just in case: what's the value of the property http.redirect.max ? >> > >> > Best, >> > Sebastian >> > >> > >> > On 5/5/21 8:09 PM, prateek wrote: >> > > Hi, >> > > >> > > I am currently using Nutch 1.16 with the properties below - >> > > >> > > >> > > >> > > >> *db.ignore.external.links=truedb.ignore.external.links.mode=byDomaindb.ignore.also.redirects=false* >> > > >> > > When I am crawling websites that are redirecting (301 http code) >> using >> > > Nutch (for example - https://zyfro.com/ <https://zyfro.com/> >> and http://wikipedia.com/ <http://wikipedia.com/>). I see >> > > that the new redirected URL is not captured by nutch. Even the >> outlinks >> > > point to the original url provided and status returned is 200. >> > > So my question is >> > > 1. How do I capture the new URL? >> > > 2. Is there a way to allow nutch to capture 301 status and then >> the new url >> > > and then crawl the related content? >> > > >> > > Here is CrawlDatum and Parsedata structure for >> http://wikipedia.com/ <http://wikipedia.com/> which >> > > gets redirected to wikipedia.org <http://wikipedia.org>. >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > *CrawlDatum : Version: 7Status: 33 (fetch_success)Fetch time: >> Wed May 05 >> > > 17:35:29 UTC 2021Modified time: Thu Jan 01 00:00:00 UTC >> 1970Retries since >> > > fetch: 0Retry interval: 31536000 seconds (365 days)Score: >> 2.0Signature: >> > > nullMetadata: _ngt_=1620235730883 _depth_=1 >> _http_status_code_=200 >> > > _pst_=success(1), lastModified=1620038693000 _rs_=410 >> > > Content-Type=text/html _maxdepth_=1000 >> nutch.protocol.code=200ParseData : >> > > Version: 5Status: success(1,0)Title: WikipediaOutlinks: 1 >> outlink: toUrl: >> > > >> http://wikipedia.com/portal/wikipedia.org/assets/img/Wikipedia-logo-v2.png >> > < >> http://wikipedia.com/portal/wikipedia.org/assets/img/Wikipedia-logo-v2.png >> > >> > > < >> http://wikipedia.com/portal/wikipedia.org/assets/img/Wikipedia-logo-v2.png >> > < >> http://wikipedia.com/portal/wikipedia.org/assets/img/Wikipedia-logo-v2.png >> >> >> > > anchor: WikipediaContent Metadata: _depth_=1 Server=ATS/8.0.8 >> > > nutch.content.digest=bc4a6cee4d559c44fbc839c9f2b4a449 >> > > Server-Timing=cache;desc="hit-front", host;desc="cp1081" >> > > Permissions-Policy=interest-cohort=() Last-Modified=Mon, 03 May >> 2021 >> > > 10:44:53 GMT Strict-Transport-Security=max-age=106384710; >> > > includeSubDomains; preload X-Cache-Status=hit-front Report-To={ >> "group": >> > > "wm_nel", "max_age": 86400, "endpoints": [{ "url": >> > > >> > " >> https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0 >> > < >> https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0 >> > >> > > >> > < >> https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0 >> > < >> https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0 >> >>" >> > > }] } Age=27826 Content-Type=text/html X-Cache=cp1079 hit, cp1081 >> hit/578233 >> > > Connection=keep-alive _maxdepth_=1000 X-Client-IP=108.174.5.114 >> Date=Wed, >> > > 05 May 2021 09:51:42 GMT nutch.crawl.score=2.0 >> Accept-Ranges=bytes >> > > nutch.segment.name <http://nutch.segment.name> < >> http://nutch.segment.name <http://nutch.segment.name>>=20210505173059 >> > > Cache-Control=s-maxage=86400, must-revalidate, max-age=3600 NEL={ >> > > "report_to": "wm_nel", "max_age": 86400, "failure_fraction": >> 0.05, >> > > "success_fraction": 0.0} ETag=W/"11e90-5c16aa6d9b068" >> Vary=Accept-Encoding >> > > >> > >> >> X-LI-Tracking-Id=treeId:AAAAAAAAAAAAAAAAAAAAAA==|ts:1620236128580|cc:14fb413b|sc:490b9459|req:JTwZc4y|src:/10.148.138.11:17671 >> |dst:www.wikipedia.org >> > <http://www.wikipedia.org> >> > > <http://www.wikipedia.org >> > <http://www.wikipedia.org>>|principal:hadoop-test >> _fst_=33 Parse Metadata: >> > > CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8 >> _depth_=1 >> > > viewport=initial-scale=1,user-scalable=yes >> metatag.description=Wikipedia is >> > > a free online encyclopedia, created and edited by volunteers >> around the >> > > world and hosted by the Wikimedia Foundation. >> metatag.description=Wikipedia >> > > is a free online encyclopedia, created and edited by volunteers >> around the >> > > world and hosted by the Wikimedia Foundation. >> description=Wikipedia is a >> > > free online encyclopedia, created and edited by volunteers >> around the world >> > > and hosted by the Wikimedia Foundation. _maxdepth_=1000 * >> > > >> > > >> > > Thanks >> > > Prateek >> > > >> > >> >>

