Doug Cutting wrote: > Is the default operator AND or OR? It appears to be OR, but it should > probably be AND. That's become the industry standard since QueryParser > was first written. Also, any chance we can get explanations for hits?
Explanations are available. Click the score link on a result. > It is difficult to decipher what's doing what. I think we should > separately evaluate query formulation and boosting from changes to > tf/idf. Earlier I proposed the opposite as my mechanism is designed to work in concert: i.e., the Similarity and the query parsing work together. Most real collections have at least title and body fields. We decided to look at the combined structure and compare results, then dig into individual details as appropriate to understand the results. The analysis can be approached bottom-up, a factor at a time, or top down, looking at two complete formulations and then dissecting them to further understand their differences. I think the differences are pretty clear as the systems stands. Notice a substantial difference in the idf's in the respective explanations. I continue to think the current mechanism weights these too high, primarily due to its squaring. The other big difference occurs when all query terms are not required, as the current mechanism then does not consider term diversity (e.g., t1 in title and in content gets as a good a score as t1 in title and t2 in content), while the new approach does. > MultiFieldQueryParser is known to be deficient. A better > general-purpose multi-field query formulator might be like that used by > Nutch. It would translate a query "t1 t2" given fields f1 and f2 into > something like: > > +(f1:t1^b1 f2:t1^b2) > +(f2:t1^b1 f2:t2^b2) > f1:"t1 t2"~s1^b3 > f2:"t1 t2"~s2^b4 This does not seem scalable. How do you expand a general query with n terms? I believe Dave has some code that generates all the pairwise combinations, but this is quadratic in the length of the query and it doesn't consider proximity of larger collections of query terms. I sent a not earlier today suggesting that a new Query class is needed that simultaneously handles multiple fields, term diversity and term proximity. > Do folks agree that this is a good general formulation? Not unless it is scalable and the desire is to require all query terms. I would rather not require all query terms, which introduces a more complex diversity requirement (ensure that as many distinct query terms as possible are matched somewhere). I'm interested in solving this problem and would be happy to contribute whatever I write. Chuck --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]