Jingkang Zhang wrote: >Three HTML parsers(Lucene web application >demo,CyberNeko HTML Parser,JTidy) are mentioned in >Lucene FAQ >1.3.27.Which is the best?Can it filter tags that are >auto-created by MS-word 'Save As HTML files' function? > >
maybe you can try this library... http://htmlparser.sourceforge.net/ I use the following code to get the text from HTML files, it was not intensively tested, but it works. import org.htmlparser.Node; import org.htmlparser.Parser; import org.htmlparser.util.NodeIterator; import org.htmlparser.util.Translate; Parser parser = new Parser(source.getAbsolutePath()); NodeIterator iter = parser.elements(); while (iter.hasMoreNodes()) { Node element = (Node) iter.nextNode(); //System.out.println("1:" + element.getText()); String text = Translate.decode(element.toPlainTextString()); if (Utils.notEmptyString(text)) writer.write(text); } Sergiu >_________________________________________________________ >Do You Yahoo!? >150万曲MP3疯狂搜,带您闯入音乐殿堂 >http://music.yisou.com/ >美女明星应有尽有,搜遍美图、艳图和酷图 >http://image.yisou.com >1G就是1000兆,雅虎电邮自助扩容! >http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/mail_1g/ > >--------------------------------------------------------------------- >To unsubscribe, e-mail: [EMAIL PROTECTED] >For additional commands, e-mail: [EMAIL PROTECTED] > > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]