2014-05-03 20:04 GMT+03:00 Lewis John Mcgibbney <lewis.mcgibb...@gmail.com>: > Hi Talat, > > On Sat, May 3, 2014 at 4:35 AM, <dev-digest-h...@nutch.apache.org> wrote: >> >> >> Now used parser plugins nekohtml doesnt parse correctly. > > > What is wrong with it? Are there any issues in Jira to back this up? > >> >> When I tested >> in huge website site, it leaves html tags. > > > Pretty vague. Anything else? Any more details? Can this be implemented in > existing parser plugins? > >> >> IMHO our parser is little >> bit old. > > > Which one? Is it possible to upgrade? I don't know which parser you mean. > >> >> After doing some research, I found Jsoup[1] and Gumbo[2] >> parser. I did some test on broken websites. I saw gumbo and jsoup >> parsed very similar Google's parser. >> > So what are the benefits? If we have a clear cut argument then lets go for > it. If not then maybe your time would be better invested elsewhere. It's up > to you I suppose :) >
-- Talat UYARER Websitesi: http://talat.uyarer.com Twitter: http://twitter.com/talatuyarer Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304