Hi Semyon, > Is there any reasons to keep the default HTML plugin there? only for > maintenance ?
Are there really HTML pages where parse-html fails? >From my experience it still does a good job and parses almost every HTML page, including HTML5. But I've never run any large scale comparison. One argument pro: it's much smaller. While parse-tika including dependencies uses around 60 MB, parse-html ships with only few 100 kB. Regarding http://www.vialucy.nl/ : if the noindex is removed the page is parsed well by parse-tika and parse-html and the outputs only differ in white space in the parsed text. Of course, for the long term parse-html should be either actively maintained or needs to be skipped. Best, Sebastian On 11/15/18 2:39 PM, Semyon Semyonov wrote: > Hi Sebastian, > > Thanks for the detailed response. > I will try to migrate to Tika. > > Is there any reasons to keep the default HTML plugin there? only for > maintenance ? > > Semyon. > > Sent: Thursday, November 15, 2018 at 2:23 PM > From: "Sebastian Nagel" <wastl.na...@googlemail.com.INVALID> > To: user@nutch.apache.org > Subject: Re: Quality problems of crawling. Parsing(Missing attribute name), > fetching(empty body) and javascript. > Hi Semyon, > > I've tried to reproduce your problems using the recent Nutch master (upcoming > 1.16). > I cannot see any issues, except that Javascript is not executed but that's > clear. > Of course, you are free to use parse-tika instead of parse-html which is > legacy. > See results below. > > Best, > Sebastian > >> http://www.vialucy.nl/[http://www.vialucy.nl/[http://www.vialucy.nl/]] > > Successfully fetched and parsed (no errors). Of course, there is no content > kept > because of robots=noindex. Here the output of parsechecker: > > % bin/nutch parsechecker -Dplugin.includes='protocol-okhttp|parse-tika' > -dumpText http://www.vialucy.nl/[http://www.vialucy.nl/] > ... > Parse Metadata: > dc:title=Vialucy | nieuws uit Les Vans – Ardêche – France > Content-Encoding=UTF-8 > generator=WordPress 3.1 > robots=noindex,nofollow > Content-Language=en-US > Content-Type=text/html; charset=UTF-8 > > >> https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/] > Succeeds if you can trick the anti-bot software, otherwise the server sends > empty content back. Recently discussed on this list. > > >> 3) Javascipt problems >> >> http://www.amphar.com/Home.html[http://www.amphar.com/Home.html] > > Yes, Javascript is not executed. But fetching and parsing works pretty fine > for the HTML page as such: > > % bin/nutch parsechecker -Dplugin.includes='protocol-okhttp|parse-tika' \ > -dumpText http://www.amphar.com/Home.html[http://www.amphar.com/Home.html] > fetching: http://www.amphar.com/Home.html[http://www.amphar.com/Home.html] > ... > Status: success(1,0) > Title: Home > Outlinks: 19 > ... > Parse Metadata: iWeb-Build=local-build-20140815 X-UA-Compatible=IE=EmulateIE7 > viewport=width=700 > dc:title=Home Content-Encoding=UTF-8 Content-Type-Hint=text/html; > charset=UTF-8 Content-Language=en > Content-Type=application/xhtml+xml; charset=UTF-8 Generator=iWeb 3.0.4 > > Founded in 1975, Amphar B.V. provides solutions, services and support to the > generic pharmaceutical > industry. > Headquartered in Amsterdam, The Netherlands, we assist our customers in > identifying and developing > new products, carefully select or initiate appropriate sources for Active > Pharmaceutical Ingredients > (APIs), develop and test formulations as well as compilation and submission > of the required > regulatory documentation and data. > With our dedicated staff of experienced professionals and our logistics > centre at Amsterdam Schiphol > International Airport, we are well positioned to anticipate and react swiftly > to the dynamic > requirements of our customers. > Amphar B.V. > > > > On 11/15/18 1:30 PM, Semyon Semyonov wrote: >> Ok, with parsing it is more or less clear(in theory) - Nutch uses some kind >> of legacy of the ancients for parsing. >> >> The error comes from both parsers available for html >> >> private DocumentFragment parse(InputSource input) throws Exception { >> if (parserImpl.equalsIgnoreCase("tagsoup")) >> return parseTagSoup(input); >> else >> return parseNeko(input); >> } >> >> Neko and TagSoup both are dead for 4+ >> years(https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers#cite_note-1[https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers#cite_note-1]). >> If I try to parse it online with one of the modern plugin such as >> https://jsoup.org/[https://jsoup.org/] it works fine. >> >> Very amazing considering the fact that it is THE core part of any parser. >> >> >> Sent: Wednesday, November 14, 2018 at 3:32 PM >> From: "Semyon Semyonov" <semyon.semyo...@mail.com> >> To: user@nutch.apache.org >> Subject: Quality problems of crawling. Parsing(Missing attribute name), >> fetching(empty body) and javascript. >> Hi everyone, >> >> >> We are testing the quality of our crawl for one of our domain countries >> against the other public crawling tool( >> http://tools.seochat.com/tools/online-crawl-google-sitemap-generator/#sthash.ZgdDhqwy.dpbs[http://tools.seochat.com/tools/online-crawl-google-sitemap-generator/#sthash.ZgdDhqwy.dpbs] >> ). >> All the webpages tested via both crawl script and the parsechecker tool for >> both Tika and default HTML plugin. >> >> The results are not very good comparing to the tool, I would appreciate if >> you give me a hint. >> >> >> I classify several types of problems: >> >> 1) Parsing problems. >> >> http://www.vialucy.nl/[http://www.vialucy.nl/][http://www.vialucy.nl/[http://www.vialucy.nl/]] >> During the parsing I got a bunch of messages such as [Error] :4:23: Missing >> attribute name and as a result I have an empty page back. >> >> >> 2) Fetching problems >> >> https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/][https://www.vishandelbunschoten.nl/[https://www.vishandelbunschoten.nl/]] >> Fetch returns HTTP/1.1 200 OK for header but empty body >> >> >> 3) Javascipt problems >> >> http://www.amphar.com/Home.html[http://www.amphar.com/Home.html][http://www.amphar.com/Home.html[http://www.amphar.com/Home.html]] >> >> Returns an empty body because of javasciprt >> >> >> <?xml version="1.0" encoding="UTF-8"?><!DOCTYPE html PUBLIC "-//W3C//DTD >> XHTML 1.0 Transitional//EN" >> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd[http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd][http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd[http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd]]"><html >> xmlns="http://www.w3.org/1999/xhtml"><head><title></title><meta >> http-equiv="refresh" content="0;url= Home.html" /></head><body></body></html> >> >> Another example , >> https://www.sizo.com/[https://www.sizo.com/][https://www.sizo.com/[https://www.sizo.com/]] >> >> How to crawl these JavaScript websites? An activation of tika javascipt >> doesnt help. >> >> >> >> Thanks. >> >> Semyon. >> >> >> > >