Hello Nicholas, Your IP might be blocked, or the firewall just drops the connection due to your User-Agent name. We have no problems fetching this host.
Regards, Markus -----Original message----- > From:Nicholas Roberts <niccolo.robe...@gmail.com> > Sent: Wednesday 14th November 2018 7:58 > To: user@nutch.apache.org > Subject: Wordpress.com hosted sites fail > org.apache.commons.httpclient.NoHttpResponseException > > hi > > I am setting up a new crawler with Nutch 1.15 and am having problems only > with Wordpress.com hosted sites > > I can crawl other https sites no problems > > Wordpress sites can be crawled on other hosts, but I think there is a > problem with the SSL certs at Wordpress.com > > I get this error > > FetcherThread 43 fetch of https://whatdavidread.ca/ failed with: > org.apache.commons.httpclient.NoHttpResponseException: The server > whatdavidread.ca failed to respond > FetcherThread 43 has no more work available > > there seems to be two layers of SSL certs > > first there is a Letsencrypt cert, with many domains, including the one > above, and the tls.auttomatic.com domain > > then, underlying the Lets Encrypt cert, there is a *.wordpress.com cert > from Comodo > > Certificate chain > 0 s:/OU=Domain Control Validated/OU=EssentialSSL Wildcard/CN=*. > wordpress.com > i:/C=GB/ST=Greater Manchester/L=Salford/O=COMODO CA Limited/CN=COMODO > RSA Domain Validation Secure Server CA > > I can crawl other https sites no problems > > I have tried the NUTCH_OPTS=($NUTCH_OPTS -Dhadoop.log.dir="$NUTCH_LOG_DIR" > -Djsse.enableSNIExtension=false) and no joy > > my nutch-site.xml > > <property> > <name>plugin.includes</name> > > <value>protocol-http|protocol-httpclient|urlfilter-(regex|validator)|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|urlfilter-domainblacklist</value> > <description> > </description> > </property> > > > thanks for the consideration > -- > Nicholas Roberts > www.niccolox.org >