Hi, I am currently using Nutch 1.16 with the properties below -
*db.ignore.external.links=truedb.ignore.external.links.mode=byDomaindb.ignore.also.redirects=false* When I am crawling websites that are redirecting (301 http code) using Nutch (for example - https://zyfro.com/ and http://wikipedia.com/). I see that the new redirected URL is not captured by nutch. Even the outlinks point to the original url provided and status returned is 200. So my question is 1. How do I capture the new URL? 2. Is there a way to allow nutch to capture 301 status and then the new url and then crawl the related content? Here is CrawlDatum and Parsedata structure for http://wikipedia.com/ which gets redirected to wikipedia.org. *CrawlDatum : Version: 7Status: 33 (fetch_success)Fetch time: Wed May 05 17:35:29 UTC 2021Modified time: Thu Jan 01 00:00:00 UTC 1970Retries since fetch: 0Retry interval: 31536000 seconds (365 days)Score: 2.0Signature: nullMetadata: _ngt_=1620235730883 _depth_=1 _http_status_code_=200 _pst_=success(1), lastModified=1620038693000 _rs_=410 Content-Type=text/html _maxdepth_=1000 nutch.protocol.code=200ParseData : Version: 5Status: success(1,0)Title: WikipediaOutlinks: 1 outlink: toUrl: http://wikipedia.com/portal/wikipedia.org/assets/img/Wikipedia-logo-v2.png <http://wikipedia.com/portal/wikipedia.org/assets/img/Wikipedia-logo-v2.png> anchor: WikipediaContent Metadata: _depth_=1 Server=ATS/8.0.8 nutch.content.digest=bc4a6cee4d559c44fbc839c9f2b4a449 Server-Timing=cache;desc="hit-front", host;desc="cp1081" Permissions-Policy=interest-cohort=() Last-Modified=Mon, 03 May 2021 10:44:53 GMT Strict-Transport-Security=max-age=106384710; includeSubDomains; preload X-Cache-Status=hit-front Report-To={ "group": "wm_nel", "max_age": 86400, "endpoints": [{ "url": "https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0 <https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0>" }] } Age=27826 Content-Type=text/html X-Cache=cp1079 hit, cp1081 hit/578233 Connection=keep-alive _maxdepth_=1000 X-Client-IP=108.174.5.114 Date=Wed, 05 May 2021 09:51:42 GMT nutch.crawl.score=2.0 Accept-Ranges=bytes nutch.segment.name <http://nutch.segment.name>=20210505173059 Cache-Control=s-maxage=86400, must-revalidate, max-age=3600 NEL={ "report_to": "wm_nel", "max_age": 86400, "failure_fraction": 0.05, "success_fraction": 0.0} ETag=W/"11e90-5c16aa6d9b068" Vary=Accept-Encoding X-LI-Tracking-Id=treeId:AAAAAAAAAAAAAAAAAAAAAA==|ts:1620236128580|cc:14fb413b|sc:490b9459|req:JTwZc4y|src:/10.148.138.11:17671|dst:www.wikipedia.org <http://www.wikipedia.org>|principal:hadoop-test _fst_=33 Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8 _depth_=1 viewport=initial-scale=1,user-scalable=yes metatag.description=Wikipedia is a free online encyclopedia, created and edited by volunteers around the world and hosted by the Wikimedia Foundation. metatag.description=Wikipedia is a free online encyclopedia, created and edited by volunteers around the world and hosted by the Wikimedia Foundation. description=Wikipedia is a free online encyclopedia, created and edited by volunteers around the world and hosted by the Wikimedia Foundation. _maxdepth_=1000 * Thanks Prateek

