Gregor J. Rothfuss wrote:
Robert Goene wrote:
I am trying to extend the current HTMLParser of lenya 1.2.1 to support keywords.
that is some of the nastiest code in lenya as you might have figured out by now. if i recall correctly, that code is auto generated by a parser generator and is almost illegible. i tried to document things a little bit at
I removed the remark from my email that it looked like generated code, just in case it would insult someone :)
http://lenya.apache.org/apidocs/1.4/org/apache/lenya/lucene/html/HTMLParser.html
michi is apparently working on replacing that custom crawler with the nutch codebase, which should hopefully be easier to deal with:
http://incubator.apache.org/nutch/apidocs/index.html
michi, why not do your experiments in the sandbox.. ?
Is there an xml parser for lucene somewhere? Should be fairly easy. The documents that i am indexing are xhtml, so there is no need for a parser that can handle those illegal html files.
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
