Re: How to build your custom termfreq vector an add it to the field ?

Grant Ingersoll Wed, 07 Nov 2007 10:49:33 -0800

Term Vectors (specifically TermFreqVector) in Lucene are a storagemechanism for convenience and applications to use. They are not anintegral part of the scoring in the way you may be thinking of them interms of the traditional Vector Space Model, thus there may be someconfusion from the different usages of that terminology. If you wantto see examples of how to implement scorers have a look at classeslike TermScorer, BoostingTermQuery, and any of the other classes thatextend Scorer. You might also find the file formats page (off of theLucene Java website under Documentation) helpful for understandingwhat Lucene stores so that it can do scoring.

There really isn't any tutorial on scoring, as it is not somethingthat many people have expressed an interest in or no one has made it ahigh enough priority to write one. Having written a Scorer (or maybetwo, I forget) I can give advice on specific things, but I am not sureI could write a tutorial that is general enough to be useful at thispoint.

One thought for associating a weight to a given term based on itscooccurring terms is to use the new Payload mechanism whereby you canstore a byte array at each term which can then be used in scoring viathings like the BoostingTermQuery (or your own implementation.) Ifthat is of interest, you can search the archives for payloads (I alsothink Michael Busch is presenting on Payloads, amongst other things,at ApacheCon in Atlanta) and have a look at the BoostingTermQuery.There certainly are other PayloadQueries that need to be implemented.See the Lucene wiki for some background and details on Payloads as well.

I don't know that it is a big mistake to try this in Lucene. Thecommunity hasn't put a huge priority on making altering the innards ofscoring easier to deal with (if possible), but that doesn't mean weare not open to suggestions and patches. You may find https://issues.apache.org/jira/browse/LUCENE-965to be informative for both the implementation and the discussion ofthings that need to happen to be accepted into Lucene. This JIRAissue specifically attempts to provide Lucene with a new scoringmechanism.

You might also have a look at Lemur (http://www.lemurproject.org/)which is much more academically focused.


Cheers,
Grant

On Nov 7, 2007, at 12:49 PM, Ariel wrote:

Then if I want to use another scoring formula I must to implement my
own Query/Weigh/Scorer  ? For example instead of cousine distance
leiderbage distance or .. another. I'm studying Query/Weigh/Scorer
classes to find out how to do that but there is not much documentation
about that.

I have seen I could change similarity factors extending the simlarity
class, but I have not seen any example about changing scoring formula
and changing the weight by term in the term vector. Do you know any
tutorial about this ?

What I want to do changing frecuency in the terms vector is this: for
example instead of take the tf term frecuency of the term and stored
in the vector I want to consider the correlation of the term with the
other terms of the documents and store that measure by term in the
vector so later with my custom similarity formula calculate the
ranking of a document against a query considering the correlation
between terms.

Dou you think is a big mistake try to do this with lucene ??? Isthere any way ?


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Boot Camp Training:
ApacheCon Atlanta, Nov. 12, 2007.  Sign up now!  http://www.apachecon.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: How to build your custom termfreq vector an add it to the field ?

Reply via email to