[jira] Commented: (LUCENE-1540) Improvements to contrib.benchmark for TREC collections

Robert Muir (JIRA) Thu, 17 Feb 2011 08:02:53 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12995865#comment-12995865
 ]


Robert Muir commented on LUCENE-1540:
-------------------------------------

Hi guys, this is just a feature request (we can open a new issue if anyone is 
up for it).

I was wondering if we could do a simple write-up and put it in the website 
notes for 3.1, 3.2,
whenever we can get to it, with some basic instructions on how to use this 
functionality.

I noticed with more research-oriented search engines, there are simple 
instructions for
how to index trec collections, run relevance experiments, and get trec_eval 
results... I feel
like if we had these it would be really beneficial towards getting new folks 
involved with Lucene.

Some examples (some are simpler than others, but at least they all describe how 
to build the index):
* Terrier: http://terrier.org/docs/current/trec_examples.html
* Zettair: http://www.seg.rmit.edu.au/zettair/quick_start_trec.html
* MG4J: http://mg4j.dsi.unimi.it/man/manual/ch01s04.html#id2769812
* Indri: http://ciir.cs.umass.edu/~strohman/indri/



> Improvements to contrib.benchmark for TREC collections
> ------------------------------------------------------
>
>                 Key: LUCENE-1540
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1540
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>            Reporter: Tim Armstrong
>            Assignee: Doron Cohen
>            Priority: Minor
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-1540.patch, LUCENE-1540.patch, LUCENE-1540.patch, 
> LUCENE-1540.patch, trecdocs.zip
>
>
> The benchmarking utilities for  TREC test collections (http://trec.nist.gov) 
> are quite limited and do not support some of the variations in format of 
> older TREC collections.  
> I have been doing some benchmarking work with Lucene and have had to modify 
> the package to support:
> * Older TREC document formats, which the current parser fails on due to 
> missing document headers.
> * Variations in query format - newlines after <title> tag causing the query 
> parser to get confused.
> * Ability to detect and read in uncompressed text collections
> * Storage of document numbers by default without storing full text.
> I can submit a patch if there is interest, although I will probably want to 
> write unit tests for the new functionality first.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1540) Improvements to contrib.benchmark for TREC collections

Reply via email to