Re: Wordpress.com hosted sites fail org.apache.commons.httpclient.NoHttpResponseException

2018-11-14 Thread Semyon Semyonov
Hi Nicholas,   I have the same problem with https://www.graydon.nl/ And it doesnt look like a wordpress website. Semyon   Sent: Wednesday, November 14, 2018 at 7:49 AM From: "Nicholas Roberts" To: user@nutch.apache.org Subject: Wordpress.com hosted sites fail org.apache.commons.httpclient.NoHtt

Re: Wordpress.com hosted sites fail org.apache.commons.httpclient.NoHttpResponseException

2018-11-14 Thread Yash Thenuan Thenuan
Most probably the problem is these websites allow only some specific crawlers in their robots.txt file. On Wed, 14 Nov 2018, 15:56 Semyon Semyonov Hi Nicholas, > > I have the same problem with https://www.graydon.nl/ > And it doesnt look like a wordpress website. > > Semyon > > > Sent: Wednesday,

Re: Wordpress.com hosted sites fail org.apache.commons.httpclient.NoHttpResponseException

2018-11-14 Thread Yash Thenuan Thenuan
You can try checking robots.txt for these websites On Wed, 14 Nov 2018, 16:00 Yash Thenuan Thenuan Most probably the problem is these websites allow only some specific > crawlers in their robots.txt file. > > On Wed, 14 Nov 2018, 15:56 Semyon Semyonov wrote: > >> Hi Nicholas, >> >> I have the sa

Re: Wordpress.com hosted sites fail org.apache.commons.httpclient.NoHttpResponseException

2018-11-14 Thread Semyon Semyonov
For https://www.graydon.nl/ User-agent: * Crawl-delay: 10 It doesnt look like they specify any specific robots rules. Sent: Wednesday, November 14, 2018 at 11:32 AM From: "Yash Thenuan Thenuan" To: user@nutch.apache.org Subject: Re: Wordpress.com hosted sites fail org.apache.commons.httpclient

Quality problems of crawling. Parsing(Missing attribute name), fetching(empty body) and javascript.

2018-11-14 Thread Semyon Semyonov
Hi everyone, We are testing the quality of our crawl for one of our domain countries against the other public crawling tool( http://tools.seochat.com/tools/online-crawl-google-sitemap-generator/#sthash.ZgdDhqwy.dpbs ). All the webpages tested via both crawl script and the parsechecker tool for

Block certain parts of HTML code from being indexed

2018-11-14 Thread hany . nasr
Hello All, I am using Nutch 1.15, and wondering if there is a feature for blocking certain parts of HTML code from being indexed (header & footer). Kind regards, Hany Shehata Solutions Architect, Marketing and Communications IT Corporate Functions | HSBC Operations, Services and Technology (HOST

RE: Block certain parts of HTML code from being indexed

2018-11-14 Thread Yossi Tamari
Hi Hany, The Tika parser supports Boilerpipe for header and footer removal, but I don't know how well it works. You can test it online at https://boilerpipe-web.appspot.com/ > -Original Message- > From: hany.n...@hsbc.com > Sent: 14 November 2018 16:53 > To: user@nutch.apache.org > Sub

RE: Block certain parts of HTML code from being indexed

2018-11-14 Thread Markus Jelsma
Hello Hany, Using parse-tika as your HTML parser, you can enable Boilerpipe (see nutch-default). Regards, Markus -Original message- > From:hany.n...@hsbc.com > Sent: Wednesday 14th November 2018 15:53 > To: user@nutch.apache.org > Subject: Block certain parts of HTML code from bein

RE: Wordpress.com hosted sites fail org.apache.commons.httpclient.NoHttpResponseException

2018-11-14 Thread Markus Jelsma
Hello Nicholas, Your IP might be blocked, or the firewall just drops the connection due to your User-Agent name. We have no problems fetching this host. Regards, Markus -Original message- > From:Nicholas Roberts > Sent: Wednesday 14th November 2018 7:58 > To: user@nutch.apache.org

Re: Wordpress.com hosted sites fail org.apache.commons.httpclient.NoHttpResponseException

2018-11-14 Thread Sebastian Nagel
Hi Nicholas, looks like it's the user-agent string sent in the HTTP header which makes the server return no/empty content. bin/nutch parsechecker \ -Dhttp.agent.name="mytestbot" \ -Dhttp.agent.version=3.0 \ -Dhttp.agent.url=http://example.com/ https://whatdavidread.ca/ Obviously, the defau

Re: Wordpress.com hosted sites fail org.apache.commons.httpclient.NoHttpResponseException

2018-11-14 Thread Nicholas Roberts
thanks for this I was wondering also about whether Wordpress has a whitelist or some kind of registration process or whether they even have business arrangements around search On Wed, Nov 14, 2018 at 7:26 AM Sebastian Nagel wrote: > Hi Nicholas, > > looks like it's the user-agent string sent in

Re: Wordpress.com hosted sites fail org.apache.commons.httpclient.NoHttpResponseException

2018-11-14 Thread Nicholas Roberts
thanks adding the meta worked On Wed, Nov 14, 2018 at 11:24 AM Nicholas Roberts wrote: > thanks for this > > I was wondering also about whether Wordpress has a whitelist or some kind > of registration process or whether they even have business arrangements > around search > > On Wed, Nov 14, 201