Hi Prateek,

(sorry, I pressed the wrong reply button, so redirecting the discussion back to 
user@nutch)


> I am not sure what I am missing.

Well, URL filters?  Robots.txt?  Don't know...


> I am currently using Nutch 1.16

Just to make sure this isn't the cause: there was a bug (NUTCH-2550 [1]) which 
caused Fetcher
not to follow redirects. But it was fixed already in Nutch 1.15.

I've retried using Nutch 1.16:
- using -Dplugin.includes='protocol-okhttp|parse-html'
   FetcherThread 43 fetching http://wikipedia.com/ (queue crawl delay=3000ms)
   FetcherThread 43 fetching https://wikipedia.com/ (queue crawl delay=3000ms)
   FetcherThread 43 fetching https://www.wikipedia.org/ (queue crawl 
delay=3000ms)

Note: there might be an issue using protocol-http 
(-Dplugin.includes='protocol-http|parse-html')
together with Nutch 1.16:
   FetcherThread 43 fetching https://wikipedia.com/ (queue crawl delay=3000ms)
   FetcherThread 43 fetching https://wikipedia.com/ (queue crawl delay=3000ms)
   Couldn't get robots.txt for https://wikipedia.com/: 
java.net.SocketException: Socket is closed
   FetcherThread 43 fetching https://www.wikipedia.org/ (queue crawl 
delay=3000ms)
   FetcherThread 43 fetching https://www.wikipedia.org/ (queue crawl 
delay=3000ms)
   Couldn't get robots.txt for https://www.wikipedia.org/: 
java.net.SocketException: Socket is closed
   Failed to get protocol output java.net.SocketException: Socket is closed
        at 
sun.security.ssl.SSLSocketImpl.getOutputStream(SSLSocketImpl.java:1109)
        at 
org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:162)
        at org.apache.nutch.protocol.http.Http.getResponse(Http.java:63)
        at 
org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:375)
        at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:343)
   FetcherThread 43 fetch of https://www.wikipedia.org/ failed with: 
java.net.SocketException: Socket is closed

But it's not reproducible using Nutch master / 1.18 - as it relates to 
HTTPS/SSL it's likely fixed by NUTCH-2794 [2].

In any case, could you try to reproduce the problem using Nutch 1.18 ?

Best,
Sebastian

[1] https://issues.apache.org/jira/browse/NUTCH-2550
[2] https://issues.apache.org/jira/browse/NUTCH-2794


On 5/6/21 11:54 AM, prateek wrote:
Thanks for your reply Sebastian.

I am using http.redirect.max=5 for my setup.
In the seed URL, I am only passing http://wikipedia.com/ <http://wikipedia.com/> and https://zyfro.com/ <https://zyfro.com/> . CrawlDatum and ParseData shared in my earlier email are from http://wikipedia.com/ <http://wikipedia.com/> url.
I don't see the other redirected URL's in the logs or segments. Here is my log -

/2021-05-05 17:35:23,854 INFO [main] org.apache.nutch.fetcher.FetcherThread: 
FetcherThread 1 Using queue mode : byHost
2021-05-05 17:35:23,854 INFO [main] org.apache.nutch.fetcher.Fetcher: Fetcher: 
throughput threshold: -1
2021-05-05 17:35:23,854 INFO [main] org.apache.nutch.fetcher.Fetcher: Fetcher: 
throughput threshold retries: 5
*2021-05-05 17:35:23,855 INFO [main] org.apache.nutch.fetcher.FetcherThread: FetcherThread 50 fetching http://wikipedia.com/ <http://wikipedia.com/> (queue crawl delay=1000ms)*

*2021-05-05 17:35:29,095 INFO [main] org.apache.nutch.fetcher.FetcherThread: FetcherThread 50 fetching https://zyfro.com/ <https://zyfro.com/> (queue crawl delay=1000ms)*
2021-05-05 17:35:29,095 INFO [main] com.linkedin.nutchplugin.http.Http: fetching 
https://zyfro.com/robots.txt <https://zyfro.com/robots.txt>
2021-05-05 17:35:29,862 INFO [main] org.apache.nutch.fetcher.Fetcher: -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=1
2021-05-05 17:35:30,189 INFO [main] com.linkedin.nutchplugin.http.Http: fetching 
https://zyfro.com/ <https://zyfro.com/>
2021-05-05 17:35:30,786 INFO [main] org.apache.nutch.fetcher.FetcherThread: 
FetcherThread 50 has no more work available/

I am not sure what I am missing.

Regards
Prateek


On Thu, May 6, 2021 at 10:21 AM Sebastian Nagel <wastl.na...@googlemail.com 
<mailto:wastl.na...@googlemail.com>> wrote:

    Hi Prateek,

    could you share information about all pages/URLs in the redirect chain?

    http://wikipedia.com/ <http://wikipedia.com/>
    https://wikipedia.com/ <https://wikipedia.com/>
    https://www.wikipedia.org/ <https://www.wikipedia.org/>

    If I'm not wrong, the shown  CrawlDatum and ParseData stems from
    https://www.wikipedia.org/ <https://www.wikipedia.org/> and is 
_http_status_code_=200.
    So, looks like the redirects have been followed.

    Note: all 3 URLs should have records in the segment and the CrawlDb.

    I've also verified that the above redirect chain is followed by Fetcher
    with the following settings (passed on the command-line via -D) using
    Nutch master (1.18):
       -Dhttp.redirect.max=3
       -Ddb.ignore.external.links=true
       -Ddb.ignore.external.links.mode=byDomain
       -Ddb.ignore.also.redirects=false

    Fetcher log snippets:
       FetcherThread 51 fetching http://wikipedia.com/ <http://wikipedia.com/> 
(queue crawl delay=3000ms)
       FetcherThread 51 fetching https://wikipedia.com/ 
<https://wikipedia.com/> (queue crawl delay=3000ms)
       FetcherThread 51 fetching https://www.wikipedia.org/ 
<https://www.wikipedia.org/> (queue crawl delay=3000ms)

    Just in case: what's the value of the property http.redirect.max ?

    Best,
    Sebastian


    On 5/5/21 8:09 PM, prateek wrote:
     > Hi,
     >
     > I am currently using Nutch 1.16 with the properties below -
     >
     >
     >
     > 
*db.ignore.external.links=truedb.ignore.external.links.mode=byDomaindb.ignore.also.redirects=false*
     >
     > When I am crawling websites that are redirecting (301 http code) using
     > Nutch (for example - https://zyfro.com/ <https://zyfro.com/> and 
http://wikipedia.com/ <http://wikipedia.com/>). I see
     > that the new redirected URL is not captured by nutch. Even the outlinks
     > point to the original url provided and status returned is 200.
     > So my question is
     > 1. How do I capture the new URL?
     > 2. Is there a way to allow nutch to capture 301 status and then the new 
url
     > and then crawl the related content?
     >
     > Here is CrawlDatum and Parsedata structure for http://wikipedia.com/ 
<http://wikipedia.com/> which
     > gets redirected to wikipedia.org <http://wikipedia.org>.
     >
     >
     >
     >
     >
     >
     >
     >
     >
     >
     >
     >
     >
     >
     >
     >
     >
     >
     >
     >
     >
     >
     >
     >
     > *CrawlDatum : Version: 7Status: 33 (fetch_success)Fetch time: Wed May 05
     > 17:35:29 UTC 2021Modified time: Thu Jan 01 00:00:00 UTC 1970Retries since
     > fetch: 0Retry interval: 31536000 seconds (365 days)Score: 2.0Signature:
     > nullMetadata:   _ngt_=1620235730883 _depth_=1 _http_status_code_=200
     > _pst_=success(1), lastModified=1620038693000 _rs_=410
     > Content-Type=text/html _maxdepth_=1000 nutch.protocol.code=200ParseData :
     > Version: 5Status: success(1,0)Title: WikipediaOutlinks: 1  outlink: 
toUrl:
     > 
http://wikipedia.com/portal/wikipedia.org/assets/img/Wikipedia-logo-v2.png
    <http://wikipedia.com/portal/wikipedia.org/assets/img/Wikipedia-logo-v2.png>
     > 
<http://wikipedia.com/portal/wikipedia.org/assets/img/Wikipedia-logo-v2.png
    
<http://wikipedia.com/portal/wikipedia.org/assets/img/Wikipedia-logo-v2.png>>
     > anchor: WikipediaContent Metadata: _depth_=1 Server=ATS/8.0.8
     > nutch.content.digest=bc4a6cee4d559c44fbc839c9f2b4a449
     > Server-Timing=cache;desc="hit-front", host;desc="cp1081"
     > Permissions-Policy=interest-cohort=() Last-Modified=Mon, 03 May 2021
     > 10:44:53 GMT Strict-Transport-Security=max-age=106384710;
     > includeSubDomains; preload X-Cache-Status=hit-front Report-To={ "group":
     > "wm_nel", "max_age": 86400, "endpoints": [{ "url":
     >
    
"https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0
    
<https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0>
     >
    
<https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0
    
<https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0>>"
     > }] } Age=27826 Content-Type=text/html X-Cache=cp1079 hit, cp1081 
hit/578233
     > Connection=keep-alive _maxdepth_=1000 X-Client-IP=108.174.5.114 Date=Wed,
     > 05 May 2021 09:51:42 GMT nutch.crawl.score=2.0 Accept-Ranges=bytes
     > nutch.segment.name <http://nutch.segment.name> <http://nutch.segment.name 
<http://nutch.segment.name>>=20210505173059
     > Cache-Control=s-maxage=86400, must-revalidate, max-age=3600 NEL={
     > "report_to": "wm_nel", "max_age": 86400, "failure_fraction": 0.05,
     > "success_fraction": 0.0} ETag=W/"11e90-5c16aa6d9b068" 
Vary=Accept-Encoding
     >
    
X-LI-Tracking-Id=treeId:AAAAAAAAAAAAAAAAAAAAAA==|ts:1620236128580|cc:14fb413b|sc:490b9459|req:JTwZc4y|src:/10.148.138.11:17671|dst:www.wikipedia.org
    <http://www.wikipedia.org>
     > <http://www.wikipedia.org 
<http://www.wikipedia.org>>|principal:hadoop-test _fst_=33 Parse Metadata:
     > CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8 _depth_=1
     > viewport=initial-scale=1,user-scalable=yes metatag.description=Wikipedia 
is
     > a free online encyclopedia, created and edited by volunteers around the
     > world and hosted by the Wikimedia Foundation. 
metatag.description=Wikipedia
     > is a free online encyclopedia, created and edited by volunteers around 
the
     > world and hosted by the Wikimedia Foundation. description=Wikipedia is a
     > free online encyclopedia, created and edited by volunteers around the 
world
     > and hosted by the Wikimedia Foundation. _maxdepth_=1000 *
     >
     >
     > Thanks
     > Prateek
     >


Reply via email to