2014-05-03 20:04 GMT+03:00 Lewis John Mcgibbney <lewis.mcgibb...@gmail.com>:
> Hi Talat,
>
> On Sat, May 3, 2014 at 4:35 AM, <dev-digest-h...@nutch.apache.org> wrote:
>>
>>
>> Now used parser plugins nekohtml doesnt parse correctly.
>
>
> What is wrong with it? Are there any issues in Jira to back this up?
>
>>
>> When I tested
>> in huge website site, it leaves html tags.
>
>
> Pretty vague. Anything else? Any more details? Can this be implemented in
> existing parser plugins?
>
>>
>> IMHO our parser is little
>> bit old.
>
>
> Which one? Is it possible to upgrade? I don't know which parser you mean.
>
>>
>> After doing some research, I found Jsoup[1] and Gumbo[2]
>> parser.  I did some test on broken websites. I saw gumbo and jsoup
>> parsed very similar Google's parser.
>>
> So what are the benefits? If we have a clear cut argument then lets go for
> it. If not then maybe your time would be better invested elsewhere. It's up
> to you I suppose :)
>



-- 
Talat UYARER
Websitesi: http://talat.uyarer.com
Twitter: http://twitter.com/talatuyarer
Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304

Reply via email to