One important thing - Since I am not using the indexed documents fields' norms, because the weight is the value of the field, I am now indexing the fields using: Field field = new Field(field_name, Float.toString(weight), Store.YES, Index.NOT_ANALYZED_NO_NORMS); And the memory usage is back to normal... So cool!
-----Original Message----- From: Yuval Kesten [mailto:ykes...@yahoo-inc.com] Sent: Wednesday, February 22, 2012 7:29 PM To: java-user@lucene.apache.org Subject: RE: Custom lucene scoring - Dot product between field boost and query boost Hi all, Inspired by another thread here (Question about CustomScoreQuery) I am using this solution which is working really well (with one drawback): I discovered that some of my problems were due to the fact that my assumption was wrong: I did have many fields/queries terms with the same field ID. This ruined my approach because the query boost was aggregated and my calculations were wrong. What I did was during indexing I added the field value to the field id (concatenated it by '_') and as filed value used the desired score. At search time I am using simple FieldScoreQuery (As-is, no modifications needed) with the complex field ID. Here I can still use the setBoost to set the score because now my filed are unique. Logic wise this is perfect - dot product using Lucene. Drawback - Lots of lots of different types of fields - effects the memory usage dramatically. If anyone has better ideas - please share! -----Original Message----- From: Alan Woodward [mailto:alan.woodw...@romseysoftware.co.uk] Sent: Wednesday, February 22, 2012 4:00 PM To: java-user@lucene.apache.org Subject: Re: Custom lucene scoring - Dot product between field boost and query boost Hi Yuval, You can just override Similarity, rather than DefaultSimilarity - that way you don't burn any CPU cycles on TF/IDF calculations. Alan On 22 Feb 2012, at 07:17, Yuval Kesten wrote: > Hi Em, > 1. Regarding the performances - the similarity class (And my subtype as well) > gets the IDF and TF and SQUARED SUMS calculations as inputs - they just > factor them differently. Even though I ignore the values they are being > computed. > 2. I have written this code: > static { > Similarity.setDefault(new MySimilarity()); > } > Which means that I am setting the default similarity before doing the > indexing and obviously before the searching. > Thanks! > > -----Original Message----- > From: Em [mailto:mailformailingli...@yahoo.de] > Sent: Tuesday, February 21, 2012 6:07 PM > To: java-user@lucene.apache.org > Subject: Re: Custom lucene scoring - Dot product between field boost > and query boost > > Hi Yuval, > >> 1. Performances: I am calculating all the TF/IDF stuff and NORMS for >> nothing... > You aren't calculating that much, since you declared all those values as > constants. What are you worried about? > >> 2. The score I get from the TopScoreDocCollector is not the same as I > get from the Explanation. >> Here is part of my code: > Could you provide us the code where you are setting the Similarity, please? > > Kind regards, > Em > > Am 21.02.2012 16:18, schrieb Yuval Kesten: >> Hi, >> I want to use Lucene with the following scoring logic: >> When I index my documents I want to set for each field a score/weight. >> When I query my index I want to set for each query term a score/weight. >> >> I will NEVER index or query with many instances of the same field - In each >> query (document) there will be 0-1 instances with the same field name. >> My fields/query term are not analyzed - they are already made out of one >> token. >> >> I want the score to be simply the dot product between the fields of the >> query to the fields of the document if they have the same value. >> >> For example: >> Query: >> Field Name >> >> Field Value >> >> Field Score >> >> 1 >> >> AA >> >> 0.1 >> >> 7 >> >> BB >> >> 0.2 >> >> 8 >> >> CC >> >> 0.3 >> >> >> Document 1: >> Field Name >> >> Field Value >> >> Field Score >> >> 1 >> >> AA >> >> 0.2 >> >> 2 >> >> DD >> >> 0.8 >> >> 7 >> >> CC >> >> 0.999 >> >> 10 >> >> FFF >> >> 0.1 >> >> >> Document 2: >> Field Name >> >> Field Value >> >> Field Score >> >> 7 >> >> BB >> >> 0.3 >> >> 8 >> >> CC >> >> 0.5 >> >> >> The scores should be: >> Score(q,d1) = FIELD_1_SCORE_Q * FILED_1_SCORE_D1 = 0.1 * 0.2 = 0.02 >> Score(q,d2) = FIELD_7_SCORE_Q * FILED_7_SCORE_D2 + FIELD_8_SCORE_Q * >> FILED_8_SCORE_D2 = (0.2 * 0.3) + (0.3 * 0.5) >> >> What would be the best way implement it? In terms of accuracy and >> performances (I don't need TF and IDF calculations). >> >> I currently implemented it by setting boosts to the fields and query terms. >> Then I overwritten the DefaultSimilarity class: >> >> public class MySimilarity extends DefaultSimilarity { >> >> @Override >> public float computeNorm(String field, FieldInvertState state) { >> return state.getBoost(); >> } >> >> @Override >> public float queryNorm(float sumOfSquaredWeights) { >> return 1; >> } >> >> @Override >> public float tf(float freq) { >> return 1; >> } >> >> @Override >> public float idf(int docFreq, int numDocs) { >> return 1; >> } >> >> @Override >> public float coord(int overlap, int maxOverlap) { >> return 1; >> } >> >> } >> >> And based on >> http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/scoring.html >> this should work. >> Problems: >> 1. Performances: I am calculating all the TF/IDF stuff and NORMS for >> nothing... >> 2. The score I get from the TopScoreDocCollector is not the same as I get >> from the Explanation. >> Here is part of my code: >> >> indexSearcher = new IndexSearcher(IndexReader.open(directory, true)); >> TopScoreDocCollector collector = TopScoreDocCollector.create(iTopN, >> true); indexSearcher.search(query, collector); ScoreDoc[] hits = >> collector.topDocs().scoreDocs; for (int i = 0; i < hits.length; ++i) >> { int docId = hits[i].doc; Document d = indexSearcher.doc(docId); >> double score = hits[i].score; String id = d.get(FIELD_ID); >> Explanation explanation = indexSearcher.explain(query, docId); } >> >> Thanks! >> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org