Not really sure what to tell you other than you need to dig in and look at how the other Query classes are implemented. I would start with TermQuery/TermScorer.

One thing I did to get to know the scoring was to go through and document it the best I could (given the time I had) as pseudocode (some of my notes are at http://lucene.apache.org/java/docs/scoring.html#Algorithm) by stepping through it starting with some of the Unit tests. Index a sample collection of known documents w/ known terms and frequencies that you can accurately product what the scores should be, then issue some basic TermQuery's and debug. As your confidence grows, move onto PhraseQuery and BooleanQuery, then onto the SpanQueries. Then write it all down and send us a patch :-) Kind of kidding (we do like patches to docs), but do ask specific questions as you dig in.

Next thing you know, you will be submitting patches and on your way to being a committer.

Cheers,
Grant

On Nov 8, 2007, at 9:11 PM, Ariel wrote:

Very interesting the link you suggest me Mr Grant Ingersoll.
Let see if I understand how the ranking issue in lucene could be implemented:
1.      First I must create my own query class extending the abstract Query
class. The only method I must implement from this class is toString.
Is right this ???
2.      I must implement inside my own query class the Weight interface
But I really don't understand how this is going to let me change my
ranking scoring.
3 I must implement my custom Scorer ???
I don't know how integrate this. There is a lot of little pieces of
information but not concrete.
Greetings



On Nov 7, 2007 1:48 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
Term Vectors (specifically TermFreqVector) in Lucene are a storage
mechanism for convenience and applications to use.  They are not an
integral part of the scoring in the way you may be thinking of them in
terms of the traditional Vector Space Model, thus there may be some
confusion from the different usages of that terminology.  If you want
to see examples of how to implement scorers have a look at classes
like TermScorer, BoostingTermQuery, and any of the other classes that
extend Scorer.  You might also find the file formats page (off of the
Lucene Java website under Documentation) helpful for understanding
what Lucene stores so that it can do scoring.

There really isn't any tutorial on scoring, as it is not something
that many people have expressed an interest in or no one has made it a
high enough priority to write one.  Having written a Scorer (or maybe
two, I forget) I can give advice on specific things, but I am not sure
I could write a tutorial that is general enough to be useful at this
point.

One thought for associating a weight to a given term based on its
cooccurring terms is to use the new Payload mechanism whereby you can
store a byte array at each term which can then be used in scoring via
things like the BoostingTermQuery (or your own implementation.)  If
that is of interest, you can search the archives for payloads (I also
think Michael Busch is presenting on Payloads, amongst other things,
at ApacheCon in Atlanta) and have a look at the BoostingTermQuery.
There certainly are other PayloadQueries that need to be implemented.
See the Lucene wiki for some background and details on Payloads as well.

I don't know that it is a big mistake to try this in Lucene.  The
community hasn't put a huge priority on making altering the innards of
scoring easier to deal with (if possible), but that doesn't mean we
are not open to suggestions and patches.    You may find 
https://issues.apache.org/jira/browse/LUCENE-965
 to be informative for both the implementation and the discussion of
things that need to happen to be accepted into Lucene.  This JIRA
issue specifically attempts to provide Lucene with a new scoring
mechanism.

You might also have a look at Lemur (http://www.lemurproject.org/)
which is much more academically focused.

Cheers,
Grant


On Nov 7, 2007, at 12:49 PM, Ariel wrote:

Then if I want to use another scoring formula I must to implement my
own Query/Weigh/Scorer  ? For example instead of cousine distance
leiderbage distance or .. another. I'm studying Query/Weigh/Scorer
classes to find out how to do that but there is not much documentation
about that.

I have seen I could change similarity factors extending the simlarity class, but I have not seen any example about changing scoring formula
and changing the weight by term in the term vector. Do you know any
tutorial about this ?

What I want to do changing frecuency in the terms vector is this: for
example instead of take the tf term frecuency of the term and stored
in the vector I want to consider the correlation of the term with the
other terms of the documents and store that measure by term in the
vector so later with my custom similarity formula calculate the
ranking of a document against a query considering the correlation
between terms.
Dou you think is a big mistake try to do this with lucene ??? Is
there any way ?

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Boot Camp Training:
ApacheCon Atlanta, Nov. 12, 2007.  Sign up now!  http://www.apachecon.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Boot Camp Training:
ApacheCon Atlanta, Nov. 12, 2007.  Sign up now!  http://www.apachecon.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to