I'm new to Lucene and I'm trying to index an HTML file parsed with NekoHTML.

With text between HTML tags, its easy enough to have an overloaded getText() method which either recursively indexes all text, or which accepts the name of a tag (like "title") and only finds text between <title></title> tags.

Unfortunately I'm trying to index URL's, image names, and ALT text, all of which remain inside the tag and I can't figure out how to access that data. I realize this is more of a NekoHTML question than a Lucene question, but I know Lucene is often used for indexing web content and was hoping someone on this list could help.

Cheers.
Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to