Yeah, Neko is not the most straight forward, but it works. Sorry, the code is somewhere.....can;t look for it now. But you could also look at LARM under Lucene Sanbox, it's got a nice HTML parser, too.
Otis --- Leo Galambos <[EMAIL PROTECTED]> wrote: > So, I have tried this with Lucene: > 1) original JavaCC LL(k) HTML parser > 2) SWING's HTML parser > > In case of (1) I could process about 300K of HTML documents. In case > of > (2) more than 400K. > > But I cannot process complete collection (5M) and finish my hard > stress > tests of Lucene. > > Is there anyone who has HTML parser that really works with Lucene? :) > If > you think that you have one, please let me know. I wanted to try > Neko, but > it looks complicated and I do not want to affect the results by > ``robust'' > parser. > > THX > > -g- > > > -- > To unsubscribe, e-mail: > <mailto:[EMAIL PROTECTED]> > For additional commands, e-mail: > <mailto:[EMAIL PROTECTED]> > __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>