Thank you, Jose. -----Original Message----- From: José Ramón Pérez Agüera [mailto:jose.agu...@gmail.com] Sent: Wednesday, January 27, 2010 1:42 PM To: java-user@lucene.apache.org Subject: Re: Average Precision - TREC-3
Hi Ivan, you might want use the lucene BM25 implementation. Results should be better changing the ranking function. Another option is Language model implementation for Lucene: http://nlp.uned.es/~jperezi/Lucene-BM25/ http://ilps.science.uva.nl/resources/lm-lucene The main problem with this implementation is that not every different kind of Lucene query, but if you don't need that these alternatives implementation are a good choice. best jose On Wed, Jan 27, 2010 at 1:36 PM, Ivan Provalov <iprov...@yahoo.com> wrote: > Robert, Grant: > > Thank you for your replies. > > Our goal is to fine-tune our existing system to perform better on relevance. > > I agree with Robert's comment that these collections are not completely > compatible. Yes, it is possible that the results will vary some depending on > the collections differences. The reason for us picking TREC-3 TIPSTER > collection is that our production content overlaps with some TIPSTER > documents. > > Any suggestions on how to obtain Lucene's TREC-3 compatible results, or > select a better approach would be appreciated. > > We are doing this project in three stages: > > 1. Test Lucene's "vanilla" performance to establish the baseline. We want to > iron out the issues such as topic or document formats. For example, we had > to add a different parser and clean up the topic title. This will give us > confidence that we are using the data and the methodology correctly. > > 2. Fine-tune Lucene based on the latest research findings (TREC by E. > Voorhees, conference proceedings, etc...). > > 3. Repeat these steps with our production system which runs on Lucene. The > reason we are doing this step last is to ensure that our overall system > doesn't introduce the relevance issues (content pre-processing steps, query > parsing steps, etc...). > > Thank you, > > Ivan Provalov > > --- On Wed, 1/27/10, Robert Muir <rcm...@gmail.com> wrote: > >> From: Robert Muir <rcm...@gmail.com> >> Subject: Re: Average Precision - TREC-3 >> To: java-user@lucene.apache.org >> Date: Wednesday, January 27, 2010, 11:16 AM >> Hello, forgive my ignorance here (I >> have not worked with these english TREC >> collections), but is the TREC-3 test collection the same as >> the test >> collection used in the 2007 paper you referenced? >> >> It looks like that is a different collection, its not >> really possible to >> compare these relevance scores across different >> collections. >> >> On Wed, Jan 27, 2010 at 11:06 AM, Grant Ingersoll <gsing...@apache.org>wrote: >> >> > >> > On Jan 26, 2010, at 8:28 AM, Ivan Provalov wrote: >> > >> > > We are looking into making some improvements to >> relevance ranking of our >> > search platform based on Lucene. We started by >> running the Ad Hoc TREC task >> > on the TREC-3 data using "out-of-the-box" >> Lucene. The reason to run this >> > old TREC-3 (TIPSTER Disk 1 and Disk 2; topics 151-200) >> data was that the >> > content is matching the content of our production >> system. >> > > >> > > We are currently getting average precision of >> 0.14. We found some format >> > issues with the TREC-3 data which were causing even >> lower score. For >> > example, the initial average precision number was >> 0.9. We discovered that >> > the topics included the word "Topic:" in the >> <title> tag. For example, >> > > "<title> Topic: Coping with >> overcrowded prisons". By removing this term >> > from the queries, we bumped the average precision to >> 0.14. >> > >> > There's usually a lot of this involved in running >> TREC. I've also seen a >> > good deal of improvement from things like using phrase >> queries and the >> > Dismax Query Parser in Solr (which uses >> DisjunctionQuery in Lucene, amongst >> > other things) and by playing around with length >> normalization. >> > >> > >> > > >> > > Our query is based on the title tag of the topic >> and the index field is >> > based on the <TEXT> tag of the document. >> > > >> > > QualityQueryParser qqParser = new >> SimpleQQParser("title", "TEXT"); >> > > >> > > Is there an average precision number which >> "out-of-the-box" Lucene should >> > be close to? For example, this IBM's 2007 TREC >> paper mentions 0.154: >> > > http://trec.nist.gov/pubs/trec16/papers/ibm-haifa.mq.final.pdf >> > >> > Hard to say. I can't say I've run TREC 3. >> You might ask over on the Open >> > Relevance list too (http://lucene.apache.org/openrelevance). I know >> > Robert Muir's done a lot of experiments with Lucene on >> standard collections >> > like TREC. >> > >> > I guess the bigger question back to you is what is >> your goal? Is it to get >> > better at TREC or to actually tune your system? >> > >> > -Grant >> > >> > >> > -------------------------- >> > Grant Ingersoll >> > http://www.lucidimagination.com/ >> > >> > Search the Lucene ecosystem using Solr/Lucene: >> > http://www.lucidimagination.com/search >> > >> > >> > >> --------------------------------------------------------------------- >> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> > For additional commands, e-mail: java-user-h...@lucene.apache.org >> > >> > >> >> >> -- >> Robert Muir >> rcm...@gmail.com >> > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Jose R. Pérez-Agüera Clinical Assistant Professor Metadata Research Center School of Information and Library Science University of North Carolina at Chapel Hill email: jagu...@email.unc.edu Web page: http://www.unc.edu/~jaguera/ MRC website: http://ils.unc.edu/mrc/ --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org