RE: Block certain parts of HTML code from being indexed

2018-11-15 Thread hany . nasr
Hello Markus, What if I want to remove specific component or page section? Kind regards, Hany Shehata Solutions Architect, Marketing and Communications IT Corporate Functions | HSBC Operations, Services and Technology (HOST) ul. Kapelanka 42A, 30-347 Kraków, Poland

Re: Quality problems of crawling. Parsing(Missing attribute name), fetching(empty body) and javascript.

2018-11-15 Thread Sebastian Nagel
Hi Semyon, > Is there any reasons to keep the default HTML plugin there? only for > maintenance ? Are there really HTML pages where parse-html fails? >From my experience it still does a good job and parses almost every HTML page, including HTML5. But I've never run any large scale comparison.

Re: Quality problems of crawling. Parsing(Missing attribute name), fetching(empty body) and javascript.

2018-11-15 Thread Semyon Semyonov
Hi Sebastian,   Thanks for the detailed response. I will try to migrate to Tika. Is there any reasons to keep the default HTML plugin there? only for maintenance ?   Semyon.  Sent: Thursday, November 15, 2018 at 2:23 PM From: "Sebastian Nagel" To: user@nutch.apache.org Subject: Re: Quality

Re: Quality problems of crawling. Parsing(Missing attribute name), fetching(empty body) and javascript.

2018-11-15 Thread Sebastian Nagel
Hi Semyon, I've tried to reproduce your problems using the recent Nutch master (upcoming 1.16). I cannot see any issues, except that Javascript is not executed but that's clear. Of course, you are free to use parse-tika instead of parse-html which is legacy. See results below. Best, Sebastian

Re: Quality problems of crawling. Parsing(Missing attribute name), fetching(empty body) and javascript.

2018-11-15 Thread Semyon Semyonov
Ok, with parsing it is more or less clear(in theory) - Nutch uses some kind of legacy of the ancients for parsing. The error comes from both parsers available for html private DocumentFragment parse(InputSource input) throws Exception { if (parserImpl.equalsIgnoreCase("tagsoup"))