Hi,

I am currently using Nutch 1.16 with the properties below -



*db.ignore.external.links=truedb.ignore.external.links.mode=byDomaindb.ignore.also.redirects=false*

When I am crawling websites that are redirecting (301 http code) using
Nutch (for example -  https://zyfro.com/ and http://wikipedia.com/). I see
that the new redirected URL is not captured by nutch. Even the outlinks
point to the original url provided and status returned is 200.
So my question is
1. How do I capture the new URL?
2. Is there a way to allow nutch to capture 301 status and then the new url
and then crawl the related content?

Here is CrawlDatum and Parsedata structure for http://wikipedia.com/ which
gets redirected to wikipedia.org.
























*CrawlDatum : Version: 7Status: 33 (fetch_success)Fetch time: Wed May 05
17:35:29 UTC 2021Modified time: Thu Jan 01 00:00:00 UTC 1970Retries since
fetch: 0Retry interval: 31536000 seconds (365 days)Score: 2.0Signature:
nullMetadata:   _ngt_=1620235730883 _depth_=1 _http_status_code_=200
_pst_=success(1), lastModified=1620038693000 _rs_=410
Content-Type=text/html _maxdepth_=1000 nutch.protocol.code=200ParseData :
Version: 5Status: success(1,0)Title: WikipediaOutlinks: 1  outlink: toUrl:
http://wikipedia.com/portal/wikipedia.org/assets/img/Wikipedia-logo-v2.png
<http://wikipedia.com/portal/wikipedia.org/assets/img/Wikipedia-logo-v2.png>
anchor: WikipediaContent Metadata: _depth_=1 Server=ATS/8.0.8
nutch.content.digest=bc4a6cee4d559c44fbc839c9f2b4a449
Server-Timing=cache;desc="hit-front", host;desc="cp1081"
Permissions-Policy=interest-cohort=() Last-Modified=Mon, 03 May 2021
10:44:53 GMT Strict-Transport-Security=max-age=106384710;
includeSubDomains; preload X-Cache-Status=hit-front Report-To={ "group":
"wm_nel", "max_age": 86400, "endpoints": [{ "url":
"https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0
<https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0>"
}] } Age=27826 Content-Type=text/html X-Cache=cp1079 hit, cp1081 hit/578233
Connection=keep-alive _maxdepth_=1000 X-Client-IP=108.174.5.114 Date=Wed,
05 May 2021 09:51:42 GMT nutch.crawl.score=2.0 Accept-Ranges=bytes
nutch.segment.name <http://nutch.segment.name>=20210505173059
Cache-Control=s-maxage=86400, must-revalidate, max-age=3600 NEL={
"report_to": "wm_nel", "max_age": 86400, "failure_fraction": 0.05,
"success_fraction": 0.0} ETag=W/"11e90-5c16aa6d9b068" Vary=Accept-Encoding
X-LI-Tracking-Id=treeId:AAAAAAAAAAAAAAAAAAAAAA==|ts:1620236128580|cc:14fb413b|sc:490b9459|req:JTwZc4y|src:/10.148.138.11:17671|dst:www.wikipedia.org
<http://www.wikipedia.org>|principal:hadoop-test _fst_=33 Parse Metadata:
CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8 _depth_=1
viewport=initial-scale=1,user-scalable=yes metatag.description=Wikipedia is
a free online encyclopedia, created and edited by volunteers around the
world and hosted by the Wikimedia Foundation. metatag.description=Wikipedia
is a free online encyclopedia, created and edited by volunteers around the
world and hosted by the Wikimedia Foundation. description=Wikipedia is a
free online encyclopedia, created and edited by volunteers around the world
and hosted by the Wikimedia Foundation. _maxdepth_=1000 *


Thanks
Prateek

Reply via email to