Hello Markus,
What if I want to remove specific component or page section?
Kind regards,
Hany Shehata
Solutions Architect, Marketing and Communications IT
Corporate Functions | HSBC Operations, Services and Technology (HOST)
ul. Kapelanka 42A, 30-347 Kraków, Poland
Hi Semyon,
> Is there any reasons to keep the default HTML plugin there? only for
> maintenance ?
Are there really HTML pages where parse-html fails?
>From my experience it still does a good job and parses almost every HTML page,
including HTML5. But I've never run any large scale comparison.
Hi Sebastian,
Thanks for the detailed response.
I will try to migrate to Tika.
Is there any reasons to keep the default HTML plugin there? only for
maintenance ?
Semyon.
Sent: Thursday, November 15, 2018 at 2:23 PM
From: "Sebastian Nagel"
To: user@nutch.apache.org
Subject: Re: Quality
Hi Semyon,
I've tried to reproduce your problems using the recent Nutch master (upcoming
1.16).
I cannot see any issues, except that Javascript is not executed but that's
clear.
Of course, you are free to use parse-tika instead of parse-html which is legacy.
See results below.
Best,
Sebastian
Ok, with parsing it is more or less clear(in theory) - Nutch uses some kind of
legacy of the ancients for parsing.
The error comes from both parsers available for html
private DocumentFragment parse(InputSource input) throws Exception {
if (parserImpl.equalsIgnoreCase("tagsoup"))
5 matches
Mail list logo