It's not a tool, IT IS stupidness of Nutch, it uses DOM just to extract plain text and Outlink[]... It's very easy to design specific routine to 'parse' byte[], we can improve everything 100 times... At Least! So, now I understand exactly what OpenSourceApache is, especially by looking at comments in the MAIN method of Tomcat ;))) Regards, Fuad
-----Original Message----- From: Elwin [mailto:[EMAIL PROTECTED] Sent: Saturday, February 18, 2006 12:47 AM To: [email protected] Subject: Re: Nutch Improvement - HTML Parser So the tool you use can extract outlinks and plaintext, not accessing DOM? 2006/2/18, Fuad Efendi <[EMAIL PROTECTED]>: > > I am using http://htmlparser.sourseforge.net for my Data Mining engine. > It has 'lexer' package, lightweight, and I don't need to perform ANY > html/xml error checking etc., - it's lightweight low-level 'parser', it is > not a parser, it is not DOM, SAX, etc. We do not need to create DOM to > extract Outlink[], and to extract plain text. > What about licensing? > > We can develop own low-lewel HTML (InputSource) processing engine from > scratch, we need only Outlink[] and PlainText. > > -- 《盖世豪侠》好评如潮,让无线收视居高不下, 无线高兴之余,仍未重用。周星驰岂是池中物, 喜剧天分既然崭露,当然不甘心受冷落,于是 转投电影界,在大银幕上一展风采。无线既得 千里马,又失千里马,当然后悔莫及。 ------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid0944&bid$1720&dat1642 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
