On Mon, 21 Oct 2002, Terry Steichen wrote: > Thanks for the comments - you might have something there. What I do > is clean up the HTML with JTidy and then parse it into a DOM. Then I > use selected parts to create a new DOM which I write out as an XML > file. I then use Lucene to index the XML files. Upon retrieval, I > once again parse the XML, format it and render it to a browser. > > The conversion from brackets to entities is necessary in order for the > browser (which will subsequently view it) to render it properly. > > But maybe, in the indexing process, I could convert it back again (to > brackets), but I'm not sure what to do with it then - in other words, > how to bring an HTML parser into the picture. If you have ideas on > this, I'd very much appreciate hearing them.
Perhaps there is some reason for the conversion to XML that I'm not understanding (and this isn't really within my area of expertise). But if your purpose is to index HTML files and then display them later in response to a search, why not just use JTidy and then index the HTML instead (skipping the DOM and XML stages entirely), and then return the (cleaned-up) HTML later when asked for? The basis of any 'semantic' tags that you might be putting in the XML (perhaps to define Lucene fields) must be there in the HTML anyway, so I'm not sure what the DOM and XML representations get you. Regards, Joshua O'Madadhain [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall It's that moment of dawning comprehension that I live for--Bill Watterson My opinions are too rational and insightful to be those of any organization. -- To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@;jakarta.apache.org> For additional commands, e-mail: <mailto:lucene-user-help@;jakarta.apache.org>