Hi Radha, On 4/17/2009 at 6:19 AM, Radhalakshmi Sreedharan wrote: > What I need is the following : > If my document field is ( ab,bc,cd,ef) and Search tokens are > (ab,bc,cd). > > Given the following : > I should get a hit even if all of the search tokens aren't present > If the tokens are found they should be found within a distance x of > each other ( proximity search) > > I need the percentage match of the search tokens with the document > field. > > Currently this is my query : > 1) I form all possible permutation of the search tokens > 2) do a spanNearQuery of each permutation > 3) Do a DisjunctionMaxQuery on the spannearqueries. > > This is how I compute % match : > % match = ( Score by running the query on the document field ) / > ( score by running the query on a document field created > out of search tokens ) > > The numerator gives me the actual score with the search tokens run on > the field. > Denominator gives me the best possible or maximum possible score with > the current search tokens > > For this example << If my document field is ( ab,bc,cd,ef) and Search > tokens are (ab,bc,cd).>> I expect a % match of around 90%.
I'm having trouble understanding what your "% match" function represents. What scores do you want if a document field contains (ab,bc,cd,ef,gh)? Or (ab,bc,cd,ef,gh,ij,jk,lm,no)? Where does your "best possible" document reside? I'm assuming that: either the query set is fixed, and so you have a pre-indexed set of pseudo-documents corresponding to the queries; or you're creating new pseudo-documents for the query just prior to launching it. In either case, do your pseudo-documents reside in the same index as the one that contains the real documents? > I tried out your approach and the problem got solved to an extent but > still it remains. > > The problem is the score reduces quite a bit even now as bc is not > found in the combinations ( bc,cd) ( bc,ef) and ( ab,bc,cd,ef) etc. > > The boosting infact has a negative impact and reduces the score further > :( > > The factor which is affected by boosting is the queryNorm . > > With a boost of 6 - > > 0.015559823 = (MATCH) max of: > 0.015559823 = (MATCH) weight(spanNear([SearchField:cd, > SearchField:ef], 10, false)^6.0 in 0), product of: > 0.07606166 = queryWeight(spanNear([SearchField:cd, SearchField:ef], > 10, false)^6.0), product of: > 6.0 = boost > 0.61370564 = idf(SearchField: cd=1 ef=1) > 0.02065639 = queryNorm > 0.20456855 = (MATCH) fieldWeight(SearchField:spanNear([cd, ef], 10, > false)^6.0 in 0), product of: > 0.33333334 = tf(phraseFreq=0.33333334) > 0.61370564 = idf(SearchField: cd=1 ef=1) > 1.0 = fieldNorm(field=SearchField, doc=0) > > Without a boost - > > 0.07779912 = (MATCH) max of: > 0.07779912 = (MATCH) weight(spanNear([SearchField:cd, > SearchField:ef], 10, false) in 0), product of: > 0.3803083 = queryWeight(spanNear([SearchField:cd, SearchField:ef], > 10, false)), product of: > 0.61370564 = idf(SearchField: cd=1 ef=1) > 0.6196917 = queryNorm > 0.20456855 = (MATCH) fieldWeight(SearchField:spanNear([cd, ef], 10, > false) in 0), product of: > 0.33333334 = tf(phraseFreq=0.33333334) > 0.61370564 = idf(SearchField: cd=1 ef=1) > 1.0 = fieldNorm(field=SearchField, doc=0) The queryNorm for the boosted variant is 1/30th of the non-boosted query, but the boost is multiplied through, resulting in a score that is 1/5th of the non-boosted query. I would think that the queryNorm would be unaffected by the boost. Sorry, I don't know what's happening - maybe a bug? Steve --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org