Term Vectors (specifically TermFreqVector) in Lucene are a storage mechanism for convenience and applications to use. They are not an integral part of the scoring in the way you may be thinking of them in terms of the traditional Vector Space Model, thus there may be some confusion from the different usages of that terminology. If you want to see examples of how to implement scorers have a look at classes like TermScorer, BoostingTermQuery, and any of the other classes that extend Scorer. You might also find the file formats page (off of the Lucene Java website under Documentation) helpful for understanding what Lucene stores so that it can do scoring.

There really isn't any tutorial on scoring, as it is not something that many people have expressed an interest in or no one has made it a high enough priority to write one. Having written a Scorer (or maybe two, I forget) I can give advice on specific things, but I am not sure I could write a tutorial that is general enough to be useful at this point.

One thought for associating a weight to a given term based on its cooccurring terms is to use the new Payload mechanism whereby you can store a byte array at each term which can then be used in scoring via things like the BoostingTermQuery (or your own implementation.) If that is of interest, you can search the archives for payloads (I also think Michael Busch is presenting on Payloads, amongst other things, at ApacheCon in Atlanta) and have a look at the BoostingTermQuery. There certainly are other PayloadQueries that need to be implemented. See the Lucene wiki for some background and details on Payloads as well.

I don't know that it is a big mistake to try this in Lucene. The community hasn't put a huge priority on making altering the innards of scoring easier to deal with (if possible), but that doesn't mean we are not open to suggestions and patches. You may find https://issues.apache.org/jira/browse/LUCENE-965 to be informative for both the implementation and the discussion of things that need to happen to be accepted into Lucene. This JIRA issue specifically attempts to provide Lucene with a new scoring mechanism.

You might also have a look at Lemur (http://www.lemurproject.org/) which is much more academically focused.

Cheers,
Grant

On Nov 7, 2007, at 12:49 PM, Ariel wrote:

Then if I want to use another scoring formula I must to implement my
own Query/Weigh/Scorer  ? For example instead of cousine distance
leiderbage distance or .. another. I'm studying Query/Weigh/Scorer
classes to find out how to do that but there is not much documentation
about that.

I have seen I could change similarity factors extending the simlarity
class, but I have not seen any example about changing scoring formula
and changing the weight by term in the term vector. Do you know any
tutorial about this ?

What I want to do changing frecuency in the terms vector is this: for
example instead of take the tf term frecuency of the term and stored
in the vector I want to consider the correlation of the term with the
other terms of the documents and store that measure by term in the
vector so later with my custom similarity formula calculate the
ranking of a document against a query considering the correlation
between terms.
Dou you think is a big mistake try to do this with lucene ??? Is there any way ?

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Boot Camp Training:
ApacheCon Atlanta, Nov. 12, 2007.  Sign up now!  http://www.apachecon.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to