[jira] Commented: (LUCENE-1540) Improvements to contrib.benchmark for TREC collections

Doron Cohen (JIRA) Sun, 06 Feb 2011 09:10:59 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991176#comment-12991176
 ]


Doron Cohen commented on LUCENE-1540:
-------------------------------------

I am able to reproduce this on Linux.
The test fails with *locale tr_TR* because TrecDocParser was upper-casing the 
file names for deciding which parser to apply.
Problem with this is that toUpperCase is locale sensitive, and so the file name 
no longer matched the enum name.
Fixed by adding a lower case dirName member to the enums.
Also recreated the test files zip with '-UN u' for UTF8 handling of file names 
in the zip.

committed at r1067699 for 3x.

In trunk the test passes with same args also in Linux, but fails if you pass 
the locale that was randomly selected in 3x, i.e. like this: 
ant test -Dtestcase=TrecContentSourceTest -Dtestmethod=testTrecFeedDirAllTypes 
-Dtests.locale=tr_TR

Will merge the fix to trunk shortly.

> Improvements to contrib.benchmark for TREC collections
> ------------------------------------------------------
>
>                 Key: LUCENE-1540
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1540
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>            Reporter: Tim Armstrong
>            Assignee: Doron Cohen
>            Priority: Minor
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-1540.patch, LUCENE-1540.patch, LUCENE-1540.patch, 
> LUCENE-1540.patch, trecdocs.zip
>
>
> The benchmarking utilities for  TREC test collections (http://trec.nist.gov) 
> are quite limited and do not support some of the variations in format of 
> older TREC collections.  
> I have been doing some benchmarking work with Lucene and have had to modify 
> the package to support:
> * Older TREC document formats, which the current parser fails on due to 
> missing document headers.
> * Variations in query format - newlines after <title> tag causing the query 
> parser to get confused.
> * Ability to detect and read in uncompressed text collections
> * Storage of document numbers by default without storing full text.
> I can submit a patch if there is interest, although I will probably want to 
> write unit tests for the new functionality first.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-1540) Improvements to contrib.benchmark for TREC collections

Reply via email to