On Mon, Jul 11, 2011 at 03:28:23PM +1200, Grant McLean wrote: > I'm just getting started with trying out Lucy. Installation went without > a hitch and I've successfully worked my way through the tutorials.
Nice... > Congratulations on getting the project to this level of quality. Thanks! :) > My main interest is indexing HTML documents for web sites. It seems > that if I feed the HTML file contents to the Lucy indexer, all the > markup (tags and attributes) ends up in the index and consequently comes > back out in the highlighted excerpts. Is it my responsibility to strip > the tags out before passing the text to the indexer? You have to handle document parsing yourself and supply plain text to Lucy. Lucy is a specialized fulltext indexing library rather than a turnkey indexing solution, so it does not bundle file-format-specific parsing tools. Instead, it is designed so that it may serve as the indexing component within a larger system which aggregates additional components such as parsers. At this point I would ordinarily suggest a variety of HTML parsing CPAN distributions, but presuming that you are the Grant McLean who maintains XML::Simple and XML::SAX, I imagine that you are familiar with the lay of the land. :) Marvin Humphrey
