indexing URL's from parsed HTML

Michael Dodson Sat, 28 Jan 2006 08:39:33 -0800

I'm new to Lucene and I'm trying to index an HTML file parsed withNekoHTML.

With text between HTML tags, its easy enough to have an overloadedgetText() method which either recursively indexes all text, or whichaccepts the name of a tag (like "title") and only finds text between<title></title> tags.

Unfortunately I'm trying to index URL's, image names, and ALT text,all of which remain inside the tag and I can't figure out how toaccess that data. I realize this is more of a NekoHTML question thana Lucene question, but I know Lucene is often used for indexing webcontent and was hoping someone on this list could help.


Cheers.
Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

indexing URL's from parsed HTML

Reply via email to