[Nutch-dev] RE: Nutch Improvement - HTML Parser

Fuad Efendi Sat, 25 Feb 2006 00:35:25 -0800

It's not a tool,
IT IS stupidness of Nutch, it uses DOM just to extract plain text and
Outlink[]...
It's very easy to design specific routine to 'parse' byte[], we can improve
everything 100 times... At Least!
So, now I understand exactly what OpenSourceApache is, especially by looking
at comments in the MAIN method of Tomcat ;)))
Regards,
Fuad



-----Original Message-----
From: Elwin [mailto:[EMAIL PROTECTED] 
Sent: Saturday, February 18, 2006 12:47 AM
To: [email protected]
Subject: Re: Nutch Improvement - HTML Parser


So the tool you use can extract outlinks and plaintext, not accessing DOM?

2006/2/18, Fuad Efendi <[EMAIL PROTECTED]>:
>
> I am using  http://htmlparser.sourseforge.net for my Data Mining engine.
> It has 'lexer' package, lightweight, and I don't need to perform ANY
> html/xml error checking etc., - it's lightweight low-level 'parser', it is
> not a parser, it is not DOM, SAX, etc. We do not need to create DOM to
> extract Outlink[], and to extract plain text.
> What about licensing?
>
> We can develop own low-lewel HTML (InputSource) processing engine from
> scratch, we need only Outlink[] and PlainText.
>
>


--
《盖世豪侠》好评如潮，让无线收视居高不下，
无线高兴之余，仍未重用。周星驰岂是池中物，
喜剧天分既然崭露，当然不甘心受冷落，于是
转投电影界，在大银幕上一展风采。无线既得
千里马，又失千里马，当然后悔莫及。



-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid0944&bid$1720&dat1642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] RE: Nutch Improvement - HTML Parser

Reply via email to