Robert, Grant:

Thank you for your replies.  

Our goal is to fine-tune our existing system to perform better on relevance.

I agree with Robert's comment that these collections are not completely 
compatible.  Yes, it is possible that the results will vary some depending on 
the collections differences.  The reason for us picking TREC-3 TIPSTER 
collection is that our production content overlaps with some TIPSTER 
documents.  

Any suggestions on how to obtain Lucene's TREC-3 compatible results, or select 
a better approach would be appreciated.

We are doing this project in three stages:

1. Test Lucene's "vanilla" performance to establish the baseline.  We want to 
iron out the issues such as topic or document formats.  For example, we had to 
add a different parser and clean up the topic title.  This will give us 
confidence that we are using the data and the methodology correctly.

2. Fine-tune Lucene based on the latest research findings (TREC by E. Voorhees, 
conference proceedings, etc...).

3. Repeat these steps with our production system which runs on Lucene.  The 
reason we are doing this step last is to ensure that our overall system doesn't 
introduce the relevance issues (content pre-processing steps, query parsing 
steps, etc...).

Thank you,

Ivan Provalov

--- On Wed, 1/27/10, Robert Muir <rcm...@gmail.com> wrote:

> From: Robert Muir <rcm...@gmail.com>
> Subject: Re: Average Precision - TREC-3
> To: java-user@lucene.apache.org
> Date: Wednesday, January 27, 2010, 11:16 AM
> Hello, forgive my ignorance here (I
> have not worked with these english TREC
> collections), but is the TREC-3 test collection the same as
> the test
> collection used in the 2007 paper you referenced?
> 
> It looks like that is a different collection, its not
> really possible to
> compare these relevance scores across different
> collections.
> 
> On Wed, Jan 27, 2010 at 11:06 AM, Grant Ingersoll <gsing...@apache.org>wrote:
> 
> >
> > On Jan 26, 2010, at 8:28 AM, Ivan Provalov wrote:
> >
> > > We are looking into making some improvements to
> relevance ranking of our
> > search platform based on Lucene.  We started by
> running the Ad Hoc TREC task
> > on the TREC-3 data using "out-of-the-box"
> Lucene.  The reason to run this
> > old TREC-3 (TIPSTER Disk 1 and Disk 2; topics 151-200)
> data was that the
> > content is matching the content of our production
> system.
> > >
> > > We are currently getting average precision of
> 0.14.  We found some format
> > issues with the TREC-3 data which were causing even
> lower score.  For
> > example, the initial average precision number was
> 0.9.  We discovered that
> > the topics included the word "Topic:" in the
> <title> tag.  For example,
> > > "<title> Topic:  Coping with
> overcrowded prisons".  By removing this term
> > from the queries, we bumped the average precision to
> 0.14.
> >
> > There's usually a lot of this involved in running
> TREC.  I've also seen a
> > good deal of improvement from things like using phrase
> queries and the
> > Dismax Query Parser in Solr (which uses
> DisjunctionQuery in Lucene, amongst
> > other things) and by playing around with length
> normalization.
> >
> >
> > >
> > > Our query is based on the title tag of the topic
> and the index field is
> > based on the <TEXT> tag of the document.
> > >
> > > QualityQueryParser qqParser = new
> SimpleQQParser("title", "TEXT");
> > >
> > > Is there an average precision number which
> "out-of-the-box" Lucene should
> > be close to?  For example, this IBM's 2007 TREC
> paper mentions 0.154:
> > > http://trec.nist.gov/pubs/trec16/papers/ibm-haifa.mq.final.pdf
> >
> > Hard to say.  I can't say I've run TREC 3. 
> You might ask over on the Open
> > Relevance list too (http://lucene.apache.org/openrelevance).  I know
> > Robert Muir's done a lot of experiments with Lucene on
> standard collections
> > like TREC.
> >
> > I guess the bigger question back to you is what is
> your goal?  Is it to get
> > better at TREC or to actually tune your system?
> >
> > -Grant
> >
> >
> > --------------------------
> > Grant Ingersoll
> > http://www.lucidimagination.com/
> >
> > Search the Lucene ecosystem using Solr/Lucene:
> > http://www.lucidimagination.com/search
> >
> >
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
> 
> 
> -- 
> Robert Muir
> rcm...@gmail.com
>







---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to