Re: Quality problems of crawling. Parsing(Missing attribute name), fetching(empty body) and javascript.

Sebastian Nagel Thu, 15 Nov 2018 06:06:49 -0800

Hi Semyon,

> Is there any reasons to keep the default HTML plugin there? only for 
> maintenance ?


Are there really HTML pages where parse-html fails?

>From my experience it still does a good job and parses almost every HTML page,
including HTML5. But I've never run any large scale comparison.

One argument pro: it's much smaller. While parse-tika including dependencies 
uses around 60 MB,
parse-html ships with only few 100 kB.

Regarding http://www.vialucy.nl/ : if the noindex is removed the page
is parsed well by parse-tika and parse-html and the outputs only differ
in white space in the parsed text.

Of course, for the long term parse-html should be either actively maintained
or needs to be skipped.

Best,
Sebastian

On 11/15/18 2:39 PM, Semyon Semyonov wrote:
> Hi Sebastian,
>  
> Thanks for the detailed response.
> I will try to migrate to Tika.
> 
> Is there any reasons to keep the default HTML plugin there? only for 
> maintenance ? 
>  
> Semyon. 
> 
> Sent: Thursday, November 15, 2018 at 2:23 PM
> From: "Sebastian Nagel" <wastl.na...@googlemail.com.INVALID>
> To: user@nutch.apache.org
> Subject: Re: Quality problems of crawling. Parsing(Missing attribute name), 
> fetching(empty body) and javascript.
> Hi Semyon,
> 
> I've tried to reproduce your problems using the recent Nutch master (upcoming 
> 1.16).
> I cannot see any issues, except that Javascript is not executed but that's 
> clear.
> Of course, you are free to use parse-tika instead of parse-html which is 
> legacy.
> See results below.
> 
> Best,
> Sebastian
> 
>> http://www.vialucy.nl/[http://www.vialucy.nl/[http://www.vialucy.nl/]]
> 
> Successfully fetched and parsed (no errors). Of course, there is no content 
> kept
> because of robots=noindex. Here the output of parsechecker:
> 
> % bin/nutch parsechecker -Dplugin.includes='protocol-okhttp|parse-tika' 
> -dumpText http://www.vialucy.nl/[http://www.vialucy.nl/]
> ...
> Parse Metadata:
> dc:title=Vialucy | nieuws uit Les Vans – Ardêche – France
> Content-Encoding=UTF-8
> generator=WordPress 3.1
> robots=noindex,nofollow
> Content-Language=en-US
> Content-Type=text/html; charset=UTF-8
> 
> 
>> https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/]
> Succeeds if you can trick the anti-bot software, otherwise the server sends
> empty content back. Recently discussed on this list.
> 
> 
>> 3) Javascipt problems
>>
>> http://www.amphar.com/Home.html[http://www.amphar.com/Home.html]
> 
> Yes, Javascript is not executed. But fetching and parsing works pretty fine
> for the HTML page as such:
> 
> % bin/nutch parsechecker -Dplugin.includes='protocol-okhttp|parse-tika' \
> -dumpText http://www.amphar.com/Home.html[http://www.amphar.com/Home.html]
> fetching: http://www.amphar.com/Home.html[http://www.amphar.com/Home.html]
> ...
> Status: success(1,0)
> Title: Home
> Outlinks: 19
> ...
> Parse Metadata: iWeb-Build=local-build-20140815 X-UA-Compatible=IE=EmulateIE7 
> viewport=width=700
> dc:title=Home Content-Encoding=UTF-8 Content-Type-Hint=text/html; 
> charset=UTF-8 Content-Language=en
> Content-Type=application/xhtml+xml; charset=UTF-8 Generator=iWeb 3.0.4
> 
> Founded in 1975, Amphar B.V. provides solutions, services and support to the 
> generic pharmaceutical
> industry.
> Headquartered in Amsterdam, The Netherlands, we assist our customers in 
> identifying and developing
> new products, carefully select or initiate appropriate sources for Active 
> Pharmaceutical Ingredients
> (APIs), develop and test formulations as well as compilation and submission 
> of the required
> regulatory documentation and data.
> With our dedicated staff of experienced professionals and our logistics 
> centre at Amsterdam Schiphol
> International Airport, we are well positioned to anticipate and react swiftly 
> to the dynamic
> requirements of our customers.
> Amphar B.V.
>  
> 
> 
> On 11/15/18 1:30 PM, Semyon Semyonov wrote:
>> Ok, with parsing it is more or less clear(in theory) - Nutch uses some kind 
>> of legacy of the ancients for parsing.
>>
>> The error comes from both parsers available for html
>>
>> private DocumentFragment parse(InputSource input) throws Exception {
>> if (parserImpl.equalsIgnoreCase("tagsoup"))
>> return parseTagSoup(input);
>> else
>> return parseNeko(input);
>> }
>>  
>> Neko and TagSoup both are dead for 4+ 
>> years(https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers#cite_note-1[https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers#cite_note-1]).
>> If I try to parse it online with one of the modern plugin such as 
>> https://jsoup.org/[https://jsoup.org/] it works fine.
>>
>> Very amazing considering the fact that it is THE core part of any parser.
>>  
>>
>> Sent: Wednesday, November 14, 2018 at 3:32 PM
>> From: "Semyon Semyonov" <semyon.semyo...@mail.com>
>> To: user@nutch.apache.org
>> Subject: Quality problems of crawling. Parsing(Missing attribute name), 
>> fetching(empty body) and javascript.
>> Hi everyone,
>>
>>
>> We are testing the quality of our crawl for one of our domain countries 
>> against the other public crawling tool( 
>> http://tools.seochat.com/tools/online-crawl-google-sitemap-generator/#sthash.ZgdDhqwy.dpbs[http://tools.seochat.com/tools/online-crawl-google-sitemap-generator/#sthash.ZgdDhqwy.dpbs]
>>  ).
>> All the webpages tested via both crawl script and the parsechecker tool for 
>> both Tika and default HTML plugin. 
>>  
>> The results are not very good comparing to the tool, I would appreciate if 
>> you give me a hint. 
>>
>>
>> I classify several types of problems:
>>  
>> 1) Parsing problems.
>>  
>> http://www.vialucy.nl/[http://www.vialucy.nl/][http://www.vialucy.nl/[http://www.vialucy.nl/]]
>> During the parsing I got a bunch of messages such as [Error] :4:23: Missing 
>> attribute name and as a result I have an empty page back.   
>>  
>>  
>> 2) Fetching problems 
>>
>> https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/][https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/]]
>> Fetch returns HTTP/1.1 200 OK for header but empty body
>>  
>>  
>> 3) Javascipt problems
>>  
>> http://www.amphar.com/Home.html[http://www.amphar.com/Home.html][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html]]
>>  
>> Returns an empty body because of javasciprt
>>  
>>
>> <?xml version="1.0" encoding="UTF-8"?><!DOCTYPE html PUBLIC "-//W3C//DTD 
>> XHTML 1.0 Transitional//EN" 
>> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd[http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd][http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd[http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd]]";><html
>>  xmlns="http://www.w3.org/1999/xhtml";><head><title></title><meta 
>> http-equiv="refresh" content="0;url= Home.html" /></head><body></body></html>
>>  
>> Another example ,
>> https://www.sizo.com/[https://www.sizo.com/][https://www.sizo.com/[https://www.sizo.com/]]
>>
>> How to crawl these JavaScript websites? An activation of tika javascipt 
>> doesnt help.
>>
>>
>>
>> Thanks.
>>
>> Semyon.
>>
>>  
>>
>  
>

Re: Quality problems of crawling. Parsing(Missing attribute name), fetching(empty body) and javascript.

Reply via email to