On Sat, 3 Aug 2002, petite_abeille wrote: > I was wandering what would be a good way to incorporate text format > information in Lucene word/document scoring. For example, when turning > HTML into plain text for indexing purpose, a lot of potentially useful > information are lost: eg tags like <bold>, <strong> and so on could be > understood as conveying emphasis information about some words. If > somebody took the pain to "underline" some words, why throw it away? > Assuming there is some interesting meaning in a document format/layout, > and a way to understand it and weight it, how could one incorporate this > information into document scoring?
If you can boost terms as they are indexed (I can't remember if this is possible, but you can certainly do so on queries) then that might be a good way of doing it; it's not so much a matter of changing document scores (on the back end, with respect to a particular query) as it is of changing the weighting of terms (on the front end). I've just glanced through the API and I don't see a way to do term boosting during indexing, but maybe there's something I've missed. Anyone? Regards, Joshua O'Madadhain [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall It's that moment of dawning comprehension that I live for--Bill Watterson My opinions are too rational and insightful to be those of any organization. -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>