Re: HTMLParser

Robert Goene Thu, 12 May 2005 15:22:40 -0700

Gregor J. Rothfuss wrote:

Robert Goene wrote:
I am trying to extend the current HTMLParser of lenya 1.2.1 to support keywords.
that is some of the nastiest code in lenya as you might have figured out by now. if i recall correctly, that code is auto generated by a parser generator and is almost illegible. i tried to document things a little bit at

I removed the remark from my email that it looked like generated code, just in case it would insult someone :)

http://lenya.apache.org/apidocs/1.4/org/apache/lenya/lucene/html/HTMLParser.html

michi is apparently working on replacing that custom crawler with the nutch codebase, which should hopefully be easier to deal with:
http://incubator.apache.org/nutch/apidocs/index.html
michi, why not do your experiments in the sandbox.. ?

Is there an xml parser for lucene somewhere? Should be fairly easy. The documents that i am indexing are xhtml, so there is no need for a parser that can handle those illegal html files.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: HTMLParser

Reply via email to