Hi Prateek,
(sorry, I pressed the wrong reply button, so redirecting the discussion back to
user@nutch)
> I am not sure what I am missing.
Well, URL filters? Robots.txt? Don't know...
> I am currently using Nutch 1.16
Just to make sure this isn't the cause: there was a bug (NUTCH-2550 [1]) which
caused Fetcher
not to follow redirects. But it was fixed already in Nutch 1.15.
I've retried using Nutch 1.16:
- using -Dplugin.includes='protocol-okhttp|parse-html'
FetcherThread 43 fetching http://wikipedia.com/ (queue crawl delay=3000ms)
FetcherThread 43 fetching https://wikipedia.com/ (queue crawl delay=3000ms)
FetcherThread 43 fetching https://www.wikipedia.org/ (queue crawl
delay=3000ms)
Note: there might be an issue using protocol-http
(-Dplugin.includes='protocol-http|parse-html')
together with Nutch 1.16:
FetcherThread 43 fetching https://wikipedia.com/ (queue crawl delay=3000ms)
FetcherThread 43 fetching https://wikipedia.com/ (queue crawl delay=3000ms)
Couldn't get robots.txt for https://wikipedia.com/:
java.net.SocketException: Socket is closed
FetcherThread 43 fetching https://www.wikipedia.org/ (queue crawl
delay=3000ms)
FetcherThread 43 fetching https://www.wikipedia.org/ (queue crawl
delay=3000ms)
Couldn't get robots.txt for https://www.wikipedia.org/:
java.net.SocketException: Socket is closed
Failed to get protocol output java.net.SocketException: Socket is closed
at
sun.security.ssl.SSLSocketImpl.getOutputStream(SSLSocketImpl.java:1109)
at
org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:162)
at org.apache.nutch.protocol.http.Http.getResponse(Http.java:63)
at
org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:375)
at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:343)
FetcherThread 43 fetch of https://www.wikipedia.org/ failed with:
java.net.SocketException: Socket is closed
But it's not reproducible using Nutch master / 1.18 - as it relates to
HTTPS/SSL it's likely fixed by NUTCH-2794 [2].
In any case, could you try to reproduce the problem using Nutch 1.18 ?
Best,
Sebastian
[1] https://issues.apache.org/jira/browse/NUTCH-2550
[2] https://issues.apache.org/jira/browse/NUTCH-2794
On 5/6/21 11:54 AM, prateek wrote:
Thanks for your reply Sebastian.
I am using http.redirect.max=5 for my setup.
In the seed URL, I am only passing http://wikipedia.com/ <http://wikipedia.com/> and https://zyfro.com/ <https://zyfro.com/> . CrawlDatum
and ParseData shared in my earlier email are from http://wikipedia.com/ <http://wikipedia.com/> url.
I don't see the other redirected URL's in the logs or segments. Here is my log -
/2021-05-05 17:35:23,854 INFO [main] org.apache.nutch.fetcher.FetcherThread:
FetcherThread 1 Using queue mode : byHost
2021-05-05 17:35:23,854 INFO [main] org.apache.nutch.fetcher.Fetcher: Fetcher:
throughput threshold: -1
2021-05-05 17:35:23,854 INFO [main] org.apache.nutch.fetcher.Fetcher: Fetcher:
throughput threshold retries: 5
*2021-05-05 17:35:23,855 INFO [main] org.apache.nutch.fetcher.FetcherThread: FetcherThread 50 fetching http://wikipedia.com/
<http://wikipedia.com/> (queue crawl delay=1000ms)*
*2021-05-05 17:35:29,095 INFO [main] org.apache.nutch.fetcher.FetcherThread: FetcherThread 50 fetching https://zyfro.com/
<https://zyfro.com/> (queue crawl delay=1000ms)*
2021-05-05 17:35:29,095 INFO [main] com.linkedin.nutchplugin.http.Http: fetching
https://zyfro.com/robots.txt <https://zyfro.com/robots.txt>
2021-05-05 17:35:29,862 INFO [main] org.apache.nutch.fetcher.Fetcher: -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0,
fetchQueues.getQueueCount=1
2021-05-05 17:35:30,189 INFO [main] com.linkedin.nutchplugin.http.Http: fetching
https://zyfro.com/ <https://zyfro.com/>
2021-05-05 17:35:30,786 INFO [main] org.apache.nutch.fetcher.FetcherThread:
FetcherThread 50 has no more work available/
I am not sure what I am missing.
Regards
Prateek
On Thu, May 6, 2021 at 10:21 AM Sebastian Nagel <wastl.na...@googlemail.com
<mailto:wastl.na...@googlemail.com>> wrote:
Hi Prateek,
could you share information about all pages/URLs in the redirect chain?
http://wikipedia.com/ <http://wikipedia.com/>
https://wikipedia.com/ <https://wikipedia.com/>
https://www.wikipedia.org/ <https://www.wikipedia.org/>
If I'm not wrong, the shown CrawlDatum and ParseData stems from
https://www.wikipedia.org/ <https://www.wikipedia.org/> and is
_http_status_code_=200.
So, looks like the redirects have been followed.
Note: all 3 URLs should have records in the segment and the CrawlDb.
I've also verified that the above redirect chain is followed by Fetcher
with the following settings (passed on the command-line via -D) using
Nutch master (1.18):
-Dhttp.redirect.max=3
-Ddb.ignore.external.links=true
-Ddb.ignore.external.links.mode=byDomain
-Ddb.ignore.also.redirects=false
Fetcher log snippets:
FetcherThread 51 fetching http://wikipedia.com/ <http://wikipedia.com/>
(queue crawl delay=3000ms)
FetcherThread 51 fetching https://wikipedia.com/
<https://wikipedia.com/> (queue crawl delay=3000ms)
FetcherThread 51 fetching https://www.wikipedia.org/
<https://www.wikipedia.org/> (queue crawl delay=3000ms)
Just in case: what's the value of the property http.redirect.max ?
Best,
Sebastian
On 5/5/21 8:09 PM, prateek wrote:
> Hi,
>
> I am currently using Nutch 1.16 with the properties below -
>
>
>
>
*db.ignore.external.links=truedb.ignore.external.links.mode=byDomaindb.ignore.also.redirects=false*
>
> When I am crawling websites that are redirecting (301 http code) using
> Nutch (for example - https://zyfro.com/ <https://zyfro.com/> and
http://wikipedia.com/ <http://wikipedia.com/>). I see
> that the new redirected URL is not captured by nutch. Even the outlinks
> point to the original url provided and status returned is 200.
> So my question is
> 1. How do I capture the new URL?
> 2. Is there a way to allow nutch to capture 301 status and then the new
url
> and then crawl the related content?
>
> Here is CrawlDatum and Parsedata structure for http://wikipedia.com/
<http://wikipedia.com/> which
> gets redirected to wikipedia.org <http://wikipedia.org>.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *CrawlDatum : Version: 7Status: 33 (fetch_success)Fetch time: Wed May 05
> 17:35:29 UTC 2021Modified time: Thu Jan 01 00:00:00 UTC 1970Retries since
> fetch: 0Retry interval: 31536000 seconds (365 days)Score: 2.0Signature:
> nullMetadata: _ngt_=1620235730883 _depth_=1 _http_status_code_=200
> _pst_=success(1), lastModified=1620038693000 _rs_=410
> Content-Type=text/html _maxdepth_=1000 nutch.protocol.code=200ParseData :
> Version: 5Status: success(1,0)Title: WikipediaOutlinks: 1 outlink:
toUrl:
>
http://wikipedia.com/portal/wikipedia.org/assets/img/Wikipedia-logo-v2.png
<http://wikipedia.com/portal/wikipedia.org/assets/img/Wikipedia-logo-v2.png>
>
<http://wikipedia.com/portal/wikipedia.org/assets/img/Wikipedia-logo-v2.png
<http://wikipedia.com/portal/wikipedia.org/assets/img/Wikipedia-logo-v2.png>>
> anchor: WikipediaContent Metadata: _depth_=1 Server=ATS/8.0.8
> nutch.content.digest=bc4a6cee4d559c44fbc839c9f2b4a449
> Server-Timing=cache;desc="hit-front", host;desc="cp1081"
> Permissions-Policy=interest-cohort=() Last-Modified=Mon, 03 May 2021
> 10:44:53 GMT Strict-Transport-Security=max-age=106384710;
> includeSubDomains; preload X-Cache-Status=hit-front Report-To={ "group":
> "wm_nel", "max_age": 86400, "endpoints": [{ "url":
>
"https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0
<https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0>
>
<https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0
<https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0>>"
> }] } Age=27826 Content-Type=text/html X-Cache=cp1079 hit, cp1081
hit/578233
> Connection=keep-alive _maxdepth_=1000 X-Client-IP=108.174.5.114 Date=Wed,
> 05 May 2021 09:51:42 GMT nutch.crawl.score=2.0 Accept-Ranges=bytes
> nutch.segment.name <http://nutch.segment.name> <http://nutch.segment.name
<http://nutch.segment.name>>=20210505173059
> Cache-Control=s-maxage=86400, must-revalidate, max-age=3600 NEL={
> "report_to": "wm_nel", "max_age": 86400, "failure_fraction": 0.05,
> "success_fraction": 0.0} ETag=W/"11e90-5c16aa6d9b068"
Vary=Accept-Encoding
>
X-LI-Tracking-Id=treeId:AAAAAAAAAAAAAAAAAAAAAA==|ts:1620236128580|cc:14fb413b|sc:490b9459|req:JTwZc4y|src:/10.148.138.11:17671|dst:www.wikipedia.org
<http://www.wikipedia.org>
> <http://www.wikipedia.org
<http://www.wikipedia.org>>|principal:hadoop-test _fst_=33 Parse Metadata:
> CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8 _depth_=1
> viewport=initial-scale=1,user-scalable=yes metatag.description=Wikipedia
is
> a free online encyclopedia, created and edited by volunteers around the
> world and hosted by the Wikimedia Foundation.
metatag.description=Wikipedia
> is a free online encyclopedia, created and edited by volunteers around
the
> world and hosted by the Wikimedia Foundation. description=Wikipedia is a
> free online encyclopedia, created and edited by volunteers around the
world
> and hosted by the Wikimedia Foundation. _maxdepth_=1000 *
>
>
> Thanks
> Prateek
>