Term Vectors (specifically TermFreqVector) in Lucene are a storage
mechanism for convenience and applications to use. They are not an
integral part of the scoring in the way you may be thinking of them in
terms of the traditional Vector Space Model, thus there may be some
confusion from the different usages of that terminology. If you want
to see examples of how to implement scorers have a look at classes
like TermScorer, BoostingTermQuery, and any of the other classes that
extend Scorer. You might also find the file formats page (off of the
Lucene Java website under Documentation) helpful for understanding
what Lucene stores so that it can do scoring.
There really isn't any tutorial on scoring, as it is not something
that many people have expressed an interest in or no one has made it a
high enough priority to write one. Having written a Scorer (or maybe
two, I forget) I can give advice on specific things, but I am not sure
I could write a tutorial that is general enough to be useful at this
point.
One thought for associating a weight to a given term based on its
cooccurring terms is to use the new Payload mechanism whereby you can
store a byte array at each term which can then be used in scoring via
things like the BoostingTermQuery (or your own implementation.) If
that is of interest, you can search the archives for payloads (I also
think Michael Busch is presenting on Payloads, amongst other things,
at ApacheCon in Atlanta) and have a look at the BoostingTermQuery.
There certainly are other PayloadQueries that need to be implemented.
See the Lucene wiki for some background and details on Payloads as well.
I don't know that it is a big mistake to try this in Lucene. The
community hasn't put a huge priority on making altering the innards of
scoring easier to deal with (if possible), but that doesn't mean we
are not open to suggestions and patches. You may find https://issues.apache.org/jira/browse/LUCENE-965
to be informative for both the implementation and the discussion of
things that need to happen to be accepted into Lucene. This JIRA
issue specifically attempts to provide Lucene with a new scoring
mechanism.
You might also have a look at Lemur (http://www.lemurproject.org/)
which is much more academically focused.
Cheers,
Grant
On Nov 7, 2007, at 12:49 PM, Ariel wrote:
Then if I want to use another scoring formula I must to implement my
own Query/Weigh/Scorer ? For example instead of cousine distance
leiderbage distance or .. another. I'm studying Query/Weigh/Scorer
classes to find out how to do that but there is not much documentation
about that.
I have seen I could change similarity factors extending the simlarity
class, but I have not seen any example about changing scoring formula
and changing the weight by term in the term vector. Do you know any
tutorial about this ?
What I want to do changing frecuency in the terms vector is this: for
example instead of take the tf term frecuency of the term and stored
in the vector I want to consider the correlation of the term with the
other terms of the documents and store that measure by term in the
vector so later with my custom similarity formula calculate the
ranking of a document against a query considering the correlation
between terms.
Dou you think is a big mistake try to do this with lucene ??? Is
there any way ?
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com
Lucene Boot Camp Training:
ApacheCon Atlanta, Nov. 12, 2007. Sign up now! http://www.apachecon.com
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]