: Here is an approach which works based on the quantity : of matching terms in an adapted BooleanQuery: : : http://issues.apache.org/bugzilla/show_bug.cgi?id=35284
Doh! ... I should really start paying attention to the stuff in SVN, I didn't even know there was a DisjunctionSumScorer -- this is exactly what i was had in mind when i first started thinking about "Alternative #2", But... : This approach of course is based purely on the : quantity of matching terms, not the quality-based ...this is what I'm worried about. : measures in your example. As you suggest, quality is a : combination of user-derived measures (boosts) and : data-derived measures (tf,idf, docBoost). It sounds : like a more informed approach in principle but I'm : not currently sure how it would be implemented : efficiently in practice. Here's one possible approach that's the thing, i'm thinking that if there was a subclass of DisjunctionSumScorer (say "DisjunctionBoostSumScorer") that totaled the sum of hte boosts of the sub-queries, and compared the sum of the boosts of the queries that match ech doc against a percentage of the total, that would be a very simple, inexpensive, calculaiotion that would at least allow us to leverage the user-derived measures of the score -- if not the data-derived measures. Does that make sense? Does it seem like taking advantage of the Boosts instead of just the coord would be worthwhile? : I have previously optimized large BooleanQueries : generated by nGrams before now by taking only the top : idf-ranked terms - purely to reduce query times. A : similar approach could be used to automatically : rewrite a BooleanQuery consisting of entirely optional : terms into the equivalent of: : +( my high idf terms) (low idf terms) Alas, i don't know if that is a practical solution for my situation: 1) There is no guarantee that all possible sub-Queries can be decnstructed into Terms, so you can't rank exclusively by idf (Consider for example the Queries Yonik submited in bug#35796) 2) Even if we confine ourselfs to simple queries consisting purely of Term queries, your suggested approach may over emphasise Terms that aren't particularly important to the user -- or worse, terms that the user misspelled or miss-remembered. imagine a user is trying to search for the digital camera "Canon EOS 5D" .. but when they saw the name of the camera in a magazine, they didn't realize that the "EOS" is "ee oh es" they thought it was "ee zero es" so they search for "Canon E0S 5D" "E0S" may not even be in the index giving it a really high idf -- which based on your suggestion would make it a mandatory term so the results would be empty. Even if we had an explicit check to ignore terms with a docFreq of 0, there might be one product that acctually contained "E0S" in it's name, giving the user results that contain only that product -- ignoring all Canon products or products with "5D" in their names. -Hoss --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]