So, I have tried this with Lucene: 1) original JavaCC LL(k) HTML parser 2) SWING's HTML parser
In case of (1) I could process about 300K of HTML documents. In case of (2) more than 400K. But I cannot process complete collection (5M) and finish my hard stress tests of Lucene. Is there anyone who has HTML parser that really works with Lucene? :) If you think that you have one, please let me know. I wanted to try Neko, but it looks complicated and I do not want to affect the results by ``robust'' parser. THX -g- -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>