RE: Average Precision - TREC-3

Provalov, Ivan (Gale) Wed, 27 Jan 2010 11:44:36 -0800

Thank you, Jose.

-----Original Message-----
From: José Ramón Pérez Agüera [mailto:jose.agu...@gmail.com] 
Sent: Wednesday, January 27, 2010 1:42 PM
To: java-user@lucene.apache.org
Subject: Re: Average Precision - TREC-3


Hi Ivan,

you might want use the lucene BM25 implementation. Results should be
better changing the ranking function. Another option is Language model
implementation for Lucene:

http://nlp.uned.es/~jperezi/Lucene-BM25/
http://ilps.science.uva.nl/resources/lm-lucene

The main problem with this implementation is that not every different
kind of Lucene query, but if you don't need that these alternatives
implementation are a good choice.

best jose

On Wed, Jan 27, 2010 at 1:36 PM, Ivan Provalov <iprov...@yahoo.com> wrote:
> Robert, Grant:
>
> Thank you for your replies.
>
> Our goal is to fine-tune our existing system to perform better on relevance.
>
> I agree with Robert's comment that these collections are not completely 
> compatible.  Yes, it is possible that the results will vary some depending on 
> the collections differences.  The reason for us picking TREC-3 TIPSTER 
> collection is that our production content overlaps with some TIPSTER 
> documents.
>
> Any suggestions on how to obtain Lucene's TREC-3 compatible results, or 
> select a better approach would be appreciated.
>
> We are doing this project in three stages:
>
> 1. Test Lucene's "vanilla" performance to establish the baseline.  We want to 
> iron out the issues such as topic or document formats.  For example, we had 
> to add a different parser and clean up the topic title.  This will give us 
> confidence that we are using the data and the methodology correctly.
>
> 2. Fine-tune Lucene based on the latest research findings (TREC by E. 
> Voorhees, conference proceedings, etc...).
>
> 3. Repeat these steps with our production system which runs on Lucene.  The 
> reason we are doing this step last is to ensure that our overall system 
> doesn't introduce the relevance issues (content pre-processing steps, query 
> parsing steps, etc...).
>
> Thank you,
>
> Ivan Provalov
>
> --- On Wed, 1/27/10, Robert Muir <rcm...@gmail.com> wrote:
>
>> From: Robert Muir <rcm...@gmail.com>
>> Subject: Re: Average Precision - TREC-3
>> To: java-user@lucene.apache.org
>> Date: Wednesday, January 27, 2010, 11:16 AM
>> Hello, forgive my ignorance here (I
>> have not worked with these english TREC
>> collections), but is the TREC-3 test collection the same as
>> the test
>> collection used in the 2007 paper you referenced?
>>
>> It looks like that is a different collection, its not
>> really possible to
>> compare these relevance scores across different
>> collections.
>>
>> On Wed, Jan 27, 2010 at 11:06 AM, Grant Ingersoll <gsing...@apache.org>wrote:
>>
>> >
>> > On Jan 26, 2010, at 8:28 AM, Ivan Provalov wrote:
>> >
>> > > We are looking into making some improvements to
>> relevance ranking of our
>> > search platform based on Lucene.  We started by
>> running the Ad Hoc TREC task
>> > on the TREC-3 data using "out-of-the-box"
>> Lucene.  The reason to run this
>> > old TREC-3 (TIPSTER Disk 1 and Disk 2; topics 151-200)
>> data was that the
>> > content is matching the content of our production
>> system.
>> > >
>> > > We are currently getting average precision of
>> 0.14.  We found some format
>> > issues with the TREC-3 data which were causing even
>> lower score.  For
>> > example, the initial average precision number was
>> 0.9.  We discovered that
>> > the topics included the word "Topic:" in the
>> <title> tag.  For example,
>> > > "<title> Topic:  Coping with
>> overcrowded prisons".  By removing this term
>> > from the queries, we bumped the average precision to
>> 0.14.
>> >
>> > There's usually a lot of this involved in running
>> TREC.  I've also seen a
>> > good deal of improvement from things like using phrase
>> queries and the
>> > Dismax Query Parser in Solr (which uses
>> DisjunctionQuery in Lucene, amongst
>> > other things) and by playing around with length
>> normalization.
>> >
>> >
>> > >
>> > > Our query is based on the title tag of the topic
>> and the index field is
>> > based on the <TEXT> tag of the document.
>> > >
>> > > QualityQueryParser qqParser = new
>> SimpleQQParser("title", "TEXT");
>> > >
>> > > Is there an average precision number which
>> "out-of-the-box" Lucene should
>> > be close to?  For example, this IBM's 2007 TREC
>> paper mentions 0.154:
>> > > http://trec.nist.gov/pubs/trec16/papers/ibm-haifa.mq.final.pdf
>> >
>> > Hard to say.  I can't say I've run TREC 3.
>> You might ask over on the Open
>> > Relevance list too (http://lucene.apache.org/openrelevance).  I know
>> > Robert Muir's done a lot of experiments with Lucene on
>> standard collections
>> > like TREC.
>> >
>> > I guess the bigger question back to you is what is
>> your goal?  Is it to get
>> > better at TREC or to actually tune your system?
>> >
>> > -Grant
>> >
>> >
>> > --------------------------
>> > Grant Ingersoll
>> > http://www.lucidimagination.com/
>> >
>> > Search the Lucene ecosystem using Solr/Lucene:
>> > http://www.lucidimagination.com/search
>> >
>> >
>> >
>> ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> > For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >
>> >
>>
>>
>> --
>> Robert Muir
>> rcm...@gmail.com
>>
>
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>



-- 
Jose R. Pérez-Agüera

Clinical Assistant Professor
Metadata Research Center
School of Information and Library Science
University of North Carolina at Chapel Hill
email: jagu...@email.unc.edu
Web page: http://www.unc.edu/~jaguera/
MRC website: http://ils.unc.edu/mrc/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: Average Precision - TREC-3

Reply via email to