[jira] Commented: (LUCENE-1540) Improvements to contrib.benchmark for TREC collections

2010-09-23 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12914301#action_12914301
 ] 

Robert Muir commented on LUCENE-1540:
-

Tim, if you have modified benchmark to work with various formats of older TREC 
collections, that would be really nice.

> Improvements to contrib.benchmark for TREC collections
> --
>
> Key: LUCENE-1540
> URL: https://issues.apache.org/jira/browse/LUCENE-1540
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/benchmark
>Affects Versions: 2.4
>Reporter: Tim Armstrong
>Priority: Minor
>
> The benchmarking utilities for  TREC test collections (http://trec.nist.gov) 
> are quite limited and do not support some of the variations in format of 
> older TREC collections.  
> I have been doing some benchmarking work with Lucene and have had to modify 
> the package to support:
> * Older TREC document formats, which the current parser fails on due to 
> missing document headers.
> * Variations in query format - newlines after  tag causing the query 
> parser to get confused.
> * Ability to detect and read in uncompressed text collections
> * Storage of document numbers by default without storing full text.
> I can submit a patch if there is interest, although I will probably want to 
> write unit tests for the new functionality first.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1540) Improvements to contrib.benchmark for TREC collections

2011-01-11 Thread Doron Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12980022#action_12980022
 ] 

Doron Cohen commented on LUCENE-1540:
-

Indeed TrecContentSource is inadequate for the Trec-Disks-4+5-minus-CR 
collection (FBIS, FR94, FT, LATimes) so I am writing something to process this 
collection, in which, interestingly, each sub-collection's format slightly 
differs. (Will use this with the robust 2004 queries.) If there are ready to 
use building blocks for this that would be helpful. 

I think of writing separate content source implementations for each format - 
current one being gov2 format, and at the method openNextFile() identify the 
correct trec format according to the file path - i.e. if it is under LATimes 
will use that appropriate content source. The default will remain as today, for 
backcompat, and will be used if the path does not match any of the defined 
patterns.Also should be possible to specify - perhaps in a property - the 
default trec format.

> Improvements to contrib.benchmark for TREC collections
> --
>
> Key: LUCENE-1540
> URL: https://issues.apache.org/jira/browse/LUCENE-1540
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/benchmark
>Affects Versions: 2.4
>Reporter: Tim Armstrong
>Assignee: Doron Cohen
>Priority: Minor
>
> The benchmarking utilities for  TREC test collections (http://trec.nist.gov) 
> are quite limited and do not support some of the variations in format of 
> older TREC collections.  
> I have been doing some benchmarking work with Lucene and have had to modify 
> the package to support:
> * Older TREC document formats, which the current parser fails on due to 
> missing document headers.
> * Variations in query format - newlines after  tag causing the query 
> parser to get confused.
> * Ability to detect and read in uncompressed text collections
> * Storage of document numbers by default without storing full text.
> I can submit a patch if there is interest, although I will probably want to 
> write unit tests for the new functionality first.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1540) Improvements to contrib.benchmark for TREC collections

2011-01-11 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12980080#action_12980080
 ] 

Shai Erera commented on LUCENE-1540:


Perhaps instead of separate ContentSource implementations, we can have 
TrecContentSource use a TrecDocParser (new class) or something, for parsing 
different formats. We can then have Gov2Parser, LATimesParser etc. for parsing 
the different formats, and TrecContentSource would use the appropriate parser 
per the path detected, as you suggest.

In addition, we can have it use a specific format through a configuration 
parameter, in which case it will not attempt to auto-detect the right format, 
but always use the specified parser. Through Benchmark (as well as all other 
contrib / modules) does not need to maintain back-compat, I think that if we go 
with this approach, it can default to using the Gov2Parser, and thus you 
achieve backwards support.

> Improvements to contrib.benchmark for TREC collections
> --
>
> Key: LUCENE-1540
> URL: https://issues.apache.org/jira/browse/LUCENE-1540
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/benchmark
>Affects Versions: 2.4
>Reporter: Tim Armstrong
>Assignee: Doron Cohen
>Priority: Minor
>
> The benchmarking utilities for  TREC test collections (http://trec.nist.gov) 
> are quite limited and do not support some of the variations in format of 
> older TREC collections.  
> I have been doing some benchmarking work with Lucene and have had to modify 
> the package to support:
> * Older TREC document formats, which the current parser fails on due to 
> missing document headers.
> * Variations in query format - newlines after  tag causing the query 
> parser to get confused.
> * Ability to detect and read in uncompressed text collections
> * Storage of document numbers by default without storing full text.
> I can submit a patch if there is interest, although I will probably want to 
> write unit tests for the new functionality first.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1540) Improvements to contrib.benchmark for TREC collections

2011-01-14 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12982028#action_12982028
 ] 

Shai Erera commented on LUCENE-1540:


Patch looks good !

Can you make TrecContentSource.read() public and not package-private? That way 
people can use it outside benchmark's package as well, supporting 
other/newer/older TREC formats.

> Improvements to contrib.benchmark for TREC collections
> --
>
> Key: LUCENE-1540
> URL: https://issues.apache.org/jira/browse/LUCENE-1540
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/benchmark
>Affects Versions: 2.4
>Reporter: Tim Armstrong
>Assignee: Doron Cohen
>Priority: Minor
> Attachments: LUCENE-1540.patch
>
>
> The benchmarking utilities for  TREC test collections (http://trec.nist.gov) 
> are quite limited and do not support some of the variations in format of 
> older TREC collections.  
> I have been doing some benchmarking work with Lucene and have had to modify 
> the package to support:
> * Older TREC document formats, which the current parser fails on due to 
> missing document headers.
> * Variations in query format - newlines after  tag causing the query 
> parser to get confused.
> * Ability to detect and read in uncompressed text collections
> * Storage of document numbers by default without storing full text.
> I can submit a patch if there is interest, although I will probably want to 
> write unit tests for the new functionality first.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1540) Improvements to contrib.benchmark for TREC collections

2011-01-23 Thread Doron Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12985420#action_12985420
 ] 

Doron Cohen commented on LUCENE-1540:
-

Hi Shai, thanks for reviewing!
I agree about making read() public, and same for parseDate().
As we discussed offline the interface with TrecParser is not ideal - I looked 
at the option we discussed to have the TrecContentSource just read everything 
between  and  and then let the TrecDocParser do everything - in one 
call to it - but as this would mean going through the data 3 times (comparing 
to 2 times today) this is not appealing either and I rather stay with the two 
methods interface for now.

> Improvements to contrib.benchmark for TREC collections
> --
>
> Key: LUCENE-1540
> URL: https://issues.apache.org/jira/browse/LUCENE-1540
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/benchmark
>Affects Versions: 2.4
>Reporter: Tim Armstrong
>Assignee: Doron Cohen
>Priority: Minor
> Attachments: LUCENE-1540.patch
>
>
> The benchmarking utilities for  TREC test collections (http://trec.nist.gov) 
> are quite limited and do not support some of the variations in format of 
> older TREC collections.  
> I have been doing some benchmarking work with Lucene and have had to modify 
> the package to support:
> * Older TREC document formats, which the current parser fails on due to 
> missing document headers.
> * Variations in query format - newlines after  tag causing the query 
> parser to get confused.
> * Ability to detect and read in uncompressed text collections
> * Storage of document numbers by default without storing full text.
> I can submit a patch if there is interest, although I will probably want to 
> write unit tests for the new functionality first.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1540) Improvements to contrib.benchmark for TREC collections

2011-01-23 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12985506#action_12985506
 ] 

Shai Erera commented on LUCENE-1540:


Ok though I really think the 3 vs 2 times is negligible. The extra time we add 
is very simple - it's the only one that does IO, and even then, it reads lines 
and compares them to  or  (which is a very simple comparison). From 
then on, it parses the actual TREC document in-memory.

This is something I think could have even improved the current multi-threading 
support in TrecContentSource - today the threads sync on each one reading the 
TREC document, which means parsing its structure, and the only thing that's 
done in parallel is parsing the Html content. It'd be interesting to benchmark 
the 3-passes method, where each thread would sync on reading the section from 
 to  and then proceed to actually parse the structure.

It sounds like TrecContentSource could have acted like a SAX parser, reading 
TrecDoc objects and emitting them to a BlockingQueue, while threads would read 
from it and proceed on their own.

What I do agree on is that 3-passes is unnecessarily more expensive for 
single-threaded benchmarks.

> Improvements to contrib.benchmark for TREC collections
> --
>
> Key: LUCENE-1540
> URL: https://issues.apache.org/jira/browse/LUCENE-1540
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/benchmark
>Affects Versions: 2.4
>Reporter: Tim Armstrong
>Assignee: Doron Cohen
>Priority: Minor
> Attachments: LUCENE-1540.patch
>
>
> The benchmarking utilities for  TREC test collections (http://trec.nist.gov) 
> are quite limited and do not support some of the variations in format of 
> older TREC collections.  
> I have been doing some benchmarking work with Lucene and have had to modify 
> the package to support:
> * Older TREC document formats, which the current parser fails on due to 
> missing document headers.
> * Variations in query format - newlines after  tag causing the query 
> parser to get confused.
> * Ability to detect and read in uncompressed text collections
> * Storage of document numbers by default without storing full text.
> I can submit a patch if there is interest, although I will probably want to 
> write unit tests for the new functionality first.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1540) Improvements to contrib.benchmark for TREC collections

2011-01-31 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989029#comment-12989029
 ] 

Shai Erera commented on LUCENE-1540:


Patch looks good !

Few comments:

In TrecFTParser.parse(), I think you can extract the logic which finds the date 
and title into a common method which receives the strings to look for as 
parameters (e.g. find(String str, String start, int startlen, String end))? 
Then the code can be simplified to:
{code}
Date date = trecSrc.parseDate(find(dobBuf, DATE, DATE_LENGTH, DATE_END));
String title = find(docBuf, HEADLINE, HEADLINE_LENGTH, HEADLINE_END);
{code}

I believe this method will be useful for other parsers as well, so might be 
good to pull it up to the abstract TrecDocParser (and +1 for making it abstract 
and moving logic from TCS to it).

In TrecContentSource you changed rawDocSize from int to int[], however it's an 
array that's always allocated at size 1 and never resized. I think it can be an 
int?

Also, TCS.cleanTags has two versions, one taking a String and one a 
StringBuilder (took me a minute to notice the difference) -- do you think the 
performance gain (of not allocating a String in the SB variant) is worth the 
code dup? I didn't understand what does cleanTags do - does it strip tags off 
of the HTML content?

I would also make all those static methods public (and move them to 
TrecDocParser) in case someone wants to impl his own parser.

Thanks for adding support for GZIP in ContentSource - I had this on my TODO 
list for a long time :). Two things:
# I think the try-catch can be extracted to wrap the 'switch' because it is now 
needed by both BZIP and GZIP.
# Is it possible to add support for ZIP as well? If it's not trivial, then 
let's resolve it in a different issue.

> Improvements to contrib.benchmark for TREC collections
> --
>
> Key: LUCENE-1540
> URL: https://issues.apache.org/jira/browse/LUCENE-1540
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/benchmark
>Reporter: Tim Armstrong
>Assignee: Doron Cohen
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-1540.patch, LUCENE-1540.patch
>
>
> The benchmarking utilities for  TREC test collections (http://trec.nist.gov) 
> are quite limited and do not support some of the variations in format of 
> older TREC collections.  
> I have been doing some benchmarking work with Lucene and have had to modify 
> the package to support:
> * Older TREC document formats, which the current parser fails on due to 
> missing document headers.
> * Variations in query format - newlines after  tag causing the query 
> parser to get confused.
> * Ability to detect and read in uncompressed text collections
> * Storage of document numbers by default without storing full text.
> I can submit a patch if there is interest, although I will probably want to 
> write unit tests for the new functionality first.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1540) Improvements to contrib.benchmark for TREC collections

2011-01-31 Thread Doron Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989081#comment-12989081
 ] 

Doron Cohen commented on LUCENE-1540:
-

Thanks for the review Shai!
All comments accepted.
Good catch with the int[] - added that sometimes and forgot to cleanup.
I think ZIP can wait for another day - let's get this one in.
Note that we are using 
[CompressorInputStream.createCompressorInputStream()|http://commons.apache.org/compress/apidocs/org/apache/commons/compress/compressors/CompressorStreamFactory.html#createCompressorInputStream(java.lang.String,%20java.io.InputStream)]
 which at version 1.1 only supports BZIP2 and GZIP. But the docs for Compress 
specify ZIP as well - so I guess this is possible, just needs a deeper dig 
into, in another issue, which, I guess, should also upgrade benchmark/compress 
from 1.0 to 1.1 (solr already uses 1.1).

bq. do you think the performance gain (of not allocating a String in the SB 
variant) is worth the code dup?
I think so, it is a rather short method, will add a javadoc clarification.

> Improvements to contrib.benchmark for TREC collections
> --
>
> Key: LUCENE-1540
> URL: https://issues.apache.org/jira/browse/LUCENE-1540
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/benchmark
>Reporter: Tim Armstrong
>Assignee: Doron Cohen
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-1540.patch, LUCENE-1540.patch
>
>
> The benchmarking utilities for  TREC test collections (http://trec.nist.gov) 
> are quite limited and do not support some of the variations in format of 
> older TREC collections.  
> I have been doing some benchmarking work with Lucene and have had to modify 
> the package to support:
> * Older TREC document formats, which the current parser fails on due to 
> missing document headers.
> * Variations in query format - newlines after  tag causing the query 
> parser to get confused.
> * Ability to detect and read in uncompressed text collections
> * Storage of document numbers by default without storing full text.
> I can submit a patch if there is interest, although I will probably want to 
> write unit tests for the new functionality first.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1540) Improvements to contrib.benchmark for TREC collections

2011-02-01 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989508#comment-12989508
 ] 

Shai Erera commented on LUCENE-1540:


We're very close indeed !

* Maybe instead of moving the unzip method to LuceneTestCase, you can put it as 
a static method in _TestUtil? LTC is crowded enough to be added more 
functionality :). Also, _TestUtil already has a rmDir method, I think we should 
use it? I would also do the same for fullTempDir.

* The method pathType(File f) in TrecDocParser -- maybe instead of walking up 
the path elements you can obtain its full absolute path (which is a String) and 
then do indexOf() checks for the 4 types? It will simplify matters IMO.

* stripTags:
** Typo in TDP: unmodofied --> unmodified.
** Maybe we can use String.replaceAll() which takes a regex? This is not 
critical ...
** Does stripTags strips off tags of the HTML content? Or is it used for the 
TREC types other than GOV2?

* In TrecContentSource, can you replace TrecParserByPath.pathType to 
TrecDocParser.pathType?
* Also, do we still need TrecParserByPath? I don't see that it's used.

> Improvements to contrib.benchmark for TREC collections
> --
>
> Key: LUCENE-1540
> URL: https://issues.apache.org/jira/browse/LUCENE-1540
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/benchmark
>Reporter: Tim Armstrong
>Assignee: Doron Cohen
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-1540.patch, LUCENE-1540.patch, LUCENE-1540.patch, 
> trecdocs.zip
>
>
> The benchmarking utilities for  TREC test collections (http://trec.nist.gov) 
> are quite limited and do not support some of the variations in format of 
> older TREC collections.  
> I have been doing some benchmarking work with Lucene and have had to modify 
> the package to support:
> * Older TREC document formats, which the current parser fails on due to 
> missing document headers.
> * Variations in query format - newlines after  tag causing the query 
> parser to get confused.
> * Ability to detect and read in uncompressed text collections
> * Storage of document numbers by default without storing full text.
> I can submit a patch if there is interest, although I will probably want to 
> write unit tests for the new functionality first.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1540) Improvements to contrib.benchmark for TREC collections

2011-02-02 Thread Doron Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989565#comment-12989565
 ] 

Doron Cohen commented on LUCENE-1540:
-

Thanks for reviewing Shai!

bq. Maybe instead of moving the unzip method to LuceneTestCase, you can put it 
as a static method in _TestUtil? Also, _TestUtil already has a rmDir method, I 
think we should use it? I would also do the same for fullTempDir.
Good point, will do.

bq. The method pathType(File f) in TrecDocParser – maybe instead of walking up 
the path elements you can obtain its full absolute path (which is a String) and 
then do indexOf() checks for the 4 types? It will simplify matters IMO.
Not sure yet if I like better this file separator sensitive approach, I'll take 
a look.

bq. Typo in TDP: unmodofied --> unmodified.
Will fix.

bq. Maybe we can use String.replaceAll() which takes a regex? This is not 
critical ...
Right, much simpler this way, will do!

bq. Does stripTags strips off tags of the HTML content? Or is it used for the 
TREC types other than GOV2?
It strips any tags, but it is used by parsers which are not using the HTML 
parser, that is, the Gov2 one does not use it.

bq. In TrecContentSource, can you replace TrecParserByPath.pathType to 
TrecDocParser.pathType?
Good catch, this is part of older code, will do.

bq. Also, do we still need TrecParserByPath? I don't see that it's used.
Yes we do, this is an important addition of this patch - allowing you to index 
trec docs of several formats. It is used, but dynamically, through the 
algorithm in TrecContentSourceTest.testTrecFeedDirAllTypes(). So removing it 
will not break compilation but will fail the tests.

> Improvements to contrib.benchmark for TREC collections
> --
>
> Key: LUCENE-1540
> URL: https://issues.apache.org/jira/browse/LUCENE-1540
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/benchmark
>Reporter: Tim Armstrong
>Assignee: Doron Cohen
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-1540.patch, LUCENE-1540.patch, LUCENE-1540.patch, 
> trecdocs.zip
>
>
> The benchmarking utilities for  TREC test collections (http://trec.nist.gov) 
> are quite limited and do not support some of the variations in format of 
> older TREC collections.  
> I have been doing some benchmarking work with Lucene and have had to modify 
> the package to support:
> * Older TREC document formats, which the current parser fails on due to 
> missing document headers.
> * Variations in query format - newlines after  tag causing the query 
> parser to get confused.
> * Ability to detect and read in uncompressed text collections
> * Storage of document numbers by default without storing full text.
> I can submit a patch if there is interest, although I will probably want to 
> write unit tests for the new functionality first.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1540) Improvements to contrib.benchmark for TREC collections

2011-02-02 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989766#comment-12989766
 ] 

Shai Erera commented on LUCENE-1540:


I see that we both missed the CHANGES entry? :)

Other than that, patch looks good. +1 to commit !

> Improvements to contrib.benchmark for TREC collections
> --
>
> Key: LUCENE-1540
> URL: https://issues.apache.org/jira/browse/LUCENE-1540
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/benchmark
>Reporter: Tim Armstrong
>Assignee: Doron Cohen
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-1540.patch, LUCENE-1540.patch, LUCENE-1540.patch, 
> LUCENE-1540.patch, trecdocs.zip
>
>
> The benchmarking utilities for  TREC test collections (http://trec.nist.gov) 
> are quite limited and do not support some of the variations in format of 
> older TREC collections.  
> I have been doing some benchmarking work with Lucene and have had to modify 
> the package to support:
> * Older TREC document formats, which the current parser fails on due to 
> missing document headers.
> * Variations in query format - newlines after  tag causing the query 
> parser to get confused.
> * Ability to detect and read in uncompressed text collections
> * Storage of document numbers by default without storing full text.
> I can submit a patch if there is interest, although I will probably want to 
> write unit tests for the new functionality first.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1540) Improvements to contrib.benchmark for TREC collections

2011-02-04 Thread Doron Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990854#comment-12990854
 ] 

Doron Cohen commented on LUCENE-1540:
-

Committed:
r1066771 - 3x
r1067359 - trunk,

A comment about the merging and messaging the mergeinfo's. 
The great [wiki page about 
svn-merge|http://wiki.apache.org/lucene-java/SvnMerge] was very helpful, just 
that I merged from 3x to trunk, while there it is recommended the other way 
around. I think the two are equivalent, but had to go carefully with it. 

Eventually these are the commands I ran:

{noformat}
cd trunk/lucene/src
svn merge -c 1066771 
https://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x/lucene/src
cd trunk/modules/benchmark
svn merge -c 1066771 
https://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x/lucene/contrib/benchmark
cd trunk
svn merge --record-only -c 1066771 
https://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x
svn revert  --depth empty modules/benchmark
svn revert solr
{noformat}

The record-only merge discarded, by itself, the (new) mergeinfo prop from 
trunk/lucene/src, and updated the ones in trunk and trunk/src.
Note the use of *revert --depth empty* for reverting (all) property changes.

I'll keep this issue open for a day in case any problem is revealed with this 
merge process which I am doing for the first time.

> Improvements to contrib.benchmark for TREC collections
> --
>
> Key: LUCENE-1540
> URL: https://issues.apache.org/jira/browse/LUCENE-1540
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/benchmark
>Reporter: Tim Armstrong
>Assignee: Doron Cohen
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-1540.patch, LUCENE-1540.patch, LUCENE-1540.patch, 
> LUCENE-1540.patch, trecdocs.zip
>
>
> The benchmarking utilities for  TREC test collections (http://trec.nist.gov) 
> are quite limited and do not support some of the variations in format of 
> older TREC collections.  
> I have been doing some benchmarking work with Lucene and have had to modify 
> the package to support:
> * Older TREC document formats, which the current parser fails on due to 
> missing document headers.
> * Variations in query format - newlines after  tag causing the query 
> parser to get confused.
> * Ability to detect and read in uncompressed text collections
> * Storage of document numbers by default without storing full text.
> I can submit a patch if there is interest, although I will probably want to 
> write unit tests for the new functionality first.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1540) Improvements to contrib.benchmark for TREC collections

2011-02-06 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991145#comment-12991145
 ] 

Michael McCandless commented on LUCENE-1540:


I think this commit has caused a failure on at least 3.x?
{noformat}
[junit] Testcase: 
testTrecFeedDirAllTypes(org.apache.lucene.benchmark.byTask.feeds.TrecContentSourceTest):
  Caused an ERROR
[junit] expected: but was:
[junit] at 
org.apache.lucene.benchmark.byTask.feeds.TrecContentSourceTest.assertDocData(TrecContentSourceTest.java:70)
[junit] at 
org.apache.lucene.benchmark.byTask.feeds.TrecContentSourceTest.testTrecFeedDirAllTypes(TrecContentSourceTest.java:369)
[junit] at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:1045)
[junit] at 
org.apache.lucene.util.LuceneTestCase$LuceneTestCaseRunner.runChild(LuceneTestCase.java:977)
[junit] 
[junit] 
[junit] Tests run: 6, Failures: 0, Errors: 1, Time elapsed: 0.488 sec
[junit] 
[junit] - Standard Error -
[junit] WARNING: test method: 'testBadDate' left thread running: 
Thread[Thread-6,5,main]
[junit] RESOURCE LEAK: test method: 'testBadDate' left 1 thread(s) running
[junit] NOTE: reproduce with: ant test -Dtestcase=TrecContentSourceTest 
-Dtestmethod=testBadDate -Dtests.seed=-1485993969467368126:6510043524258948665 
-Dtests.multiplier=5
[junit] NOTE: reproduce with: ant test -Dtestcase=TrecContentSourceTest 
-Dtestmethod=testTrecFeedDirAllTypes 
-Dtests.seed=-1485993969467368126:-9055415333820766139 -Dtests.multiplier=5
[junit] NOTE: test params are: locale=tr_TR, timezone=Europe/Zagreb
[junit] NOTE: all tests run in this JVM:
[junit] [TrecContentSourceTest]
[junit] NOTE: FreeBSD 8.2-RC2 amd64/Sun Microsystems Inc. 1.6.0 
(64-bit)/cpus=16,threads=1,free=66439840,total=86376448
[junit] -  ---
{noformat}

> Improvements to contrib.benchmark for TREC collections
> --
>
> Key: LUCENE-1540
> URL: https://issues.apache.org/jira/browse/LUCENE-1540
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/benchmark
>Reporter: Tim Armstrong
>Assignee: Doron Cohen
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-1540.patch, LUCENE-1540.patch, LUCENE-1540.patch, 
> LUCENE-1540.patch, trecdocs.zip
>
>
> The benchmarking utilities for  TREC test collections (http://trec.nist.gov) 
> are quite limited and do not support some of the variations in format of 
> older TREC collections.  
> I have been doing some benchmarking work with Lucene and have had to modify 
> the package to support:
> * Older TREC document formats, which the current parser fails on due to 
> missing document headers.
> * Variations in query format - newlines after  tag causing the query 
> parser to get confused.
> * Ability to detect and read in uncompressed text collections
> * Storage of document numbers by default without storing full text.
> I can submit a patch if there is interest, although I will probably want to 
> write unit tests for the new functionality first.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1540) Improvements to contrib.benchmark for TREC collections

2011-02-06 Thread Doron Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991176#comment-12991176
 ] 

Doron Cohen commented on LUCENE-1540:
-

I am able to reproduce this on Linux.
The test fails with *locale tr_TR* because TrecDocParser was upper-casing the 
file names for deciding which parser to apply.
Problem with this is that toUpperCase is locale sensitive, and so the file name 
no longer matched the enum name.
Fixed by adding a lower case dirName member to the enums.
Also recreated the test files zip with '-UN u' for UTF8 handling of file names 
in the zip.

committed at r1067699 for 3x.

In trunk the test passes with same args also in Linux, but fails if you pass 
the locale that was randomly selected in 3x, i.e. like this: 
ant test -Dtestcase=TrecContentSourceTest -Dtestmethod=testTrecFeedDirAllTypes 
-Dtests.locale=tr_TR

Will merge the fix to trunk shortly.

> Improvements to contrib.benchmark for TREC collections
> --
>
> Key: LUCENE-1540
> URL: https://issues.apache.org/jira/browse/LUCENE-1540
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/benchmark
>Reporter: Tim Armstrong
>Assignee: Doron Cohen
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-1540.patch, LUCENE-1540.patch, LUCENE-1540.patch, 
> LUCENE-1540.patch, trecdocs.zip
>
>
> The benchmarking utilities for  TREC test collections (http://trec.nist.gov) 
> are quite limited and do not support some of the variations in format of 
> older TREC collections.  
> I have been doing some benchmarking work with Lucene and have had to modify 
> the package to support:
> * Older TREC document formats, which the current parser fails on due to 
> missing document headers.
> * Variations in query format - newlines after  tag causing the query 
> parser to get confused.
> * Ability to detect and read in uncompressed text collections
> * Storage of document numbers by default without storing full text.
> I can submit a patch if there is interest, although I will probably want to 
> write unit tests for the new functionality first.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1540) Improvements to contrib.benchmark for TREC collections

2011-02-06 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991179#comment-12991179
 ] 

Robert Muir commented on LUCENE-1540:
-

Hi Doron, about the test random seeds:

It is complicated (though maybe we could fix this!) for the same random seed in 
trunk to work just like 3.x

But for the locales: the way it picks a random locale is from the available 
system locales. This changes from jre to jre,
so unfortunately we cannot guarantee that the same seed chooses the same locale 
randomly... Its the same with 
timezones too... and these even change in minor jdk updates! 

I wish we knew of a good solution, because I hate it when things aren't 
completely reproducible everywhere.


> Improvements to contrib.benchmark for TREC collections
> --
>
> Key: LUCENE-1540
> URL: https://issues.apache.org/jira/browse/LUCENE-1540
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/benchmark
>Reporter: Tim Armstrong
>Assignee: Doron Cohen
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-1540.patch, LUCENE-1540.patch, LUCENE-1540.patch, 
> LUCENE-1540.patch, trecdocs.zip
>
>
> The benchmarking utilities for  TREC test collections (http://trec.nist.gov) 
> are quite limited and do not support some of the variations in format of 
> older TREC collections.  
> I have been doing some benchmarking work with Lucene and have had to modify 
> the package to support:
> * Older TREC document formats, which the current parser fails on due to 
> missing document headers.
> * Variations in query format - newlines after  tag causing the query 
> parser to get confused.
> * Ability to detect and read in uncompressed text collections
> * Storage of document numbers by default without storing full text.
> I can submit a patch if there is interest, although I will probably want to 
> write unit tests for the new functionality first.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1540) Improvements to contrib.benchmark for TREC collections

2011-02-06 Thread Doron Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991181#comment-12991181
 ] 

Doron Cohen commented on LUCENE-1540:
-

Fix for the locale issue merged to trunk at r1076605.
Keeping open for a day or so to make sure there are no more failures.

> Improvements to contrib.benchmark for TREC collections
> --
>
> Key: LUCENE-1540
> URL: https://issues.apache.org/jira/browse/LUCENE-1540
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/benchmark
>Reporter: Tim Armstrong
>Assignee: Doron Cohen
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-1540.patch, LUCENE-1540.patch, LUCENE-1540.patch, 
> LUCENE-1540.patch, trecdocs.zip
>
>
> The benchmarking utilities for  TREC test collections (http://trec.nist.gov) 
> are quite limited and do not support some of the variations in format of 
> older TREC collections.  
> I have been doing some benchmarking work with Lucene and have had to modify 
> the package to support:
> * Older TREC document formats, which the current parser fails on due to 
> missing document headers.
> * Variations in query format - newlines after  tag causing the query 
> parser to get confused.
> * Ability to detect and read in uncompressed text collections
> * Storage of document numbers by default without storing full text.
> I can submit a patch if there is interest, although I will probably want to 
> write unit tests for the new functionality first.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1540) Improvements to contrib.benchmark for TREC collections

2011-02-06 Thread Doron Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991210#comment-12991210
 ] 

Doron Cohen commented on LUCENE-1540:
-

bq. I wish we knew of a good solution, because I hate it when things aren't 
completely reproducible everywhere.

Thanks Robert, I am actually very pleased with this array of testing with 
various parameters like locale and others randomly selected - it is very 
powreful, and since the failure printed all the parameters used and even the 
ant line to reproduce(\!)  - it was possible to reproduce in 3x, and, once 
understanding what the problem was also possible to reproduce in trunk - to me 
this is testing's heaven...

> Improvements to contrib.benchmark for TREC collections
> --
>
> Key: LUCENE-1540
> URL: https://issues.apache.org/jira/browse/LUCENE-1540
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/benchmark
>Reporter: Tim Armstrong
>Assignee: Doron Cohen
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-1540.patch, LUCENE-1540.patch, LUCENE-1540.patch, 
> LUCENE-1540.patch, trecdocs.zip
>
>
> The benchmarking utilities for  TREC test collections (http://trec.nist.gov) 
> are quite limited and do not support some of the variations in format of 
> older TREC collections.  
> I have been doing some benchmarking work with Lucene and have had to modify 
> the package to support:
> * Older TREC document formats, which the current parser fails on due to 
> missing document headers.
> * Variations in query format - newlines after  tag causing the query 
> parser to get confused.
> * Ability to detect and read in uncompressed text collections
> * Storage of document numbers by default without storing full text.
> I can submit a patch if there is interest, although I will probably want to 
> write unit tests for the new functionality first.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1540) Improvements to contrib.benchmark for TREC collections

2011-02-06 Thread Doron Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991224#comment-12991224
 ] 

Doron Cohen commented on LUCENE-1540:
-

Following suggestions by Robert, brought back case insensitivity of path names 
by upper casing with Locale.ENGLISH as suggested in 
[toUpperCase()|http://download.oracle.com/javase/6/docs/api/java/lang/String.html#toUpperCase%28%29].
 
Committed:
- r1067764 - 3x
- r1067772 - trunk

> Improvements to contrib.benchmark for TREC collections
> --
>
> Key: LUCENE-1540
> URL: https://issues.apache.org/jira/browse/LUCENE-1540
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/benchmark
>Reporter: Tim Armstrong
>Assignee: Doron Cohen
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-1540.patch, LUCENE-1540.patch, LUCENE-1540.patch, 
> LUCENE-1540.patch, trecdocs.zip
>
>
> The benchmarking utilities for  TREC test collections (http://trec.nist.gov) 
> are quite limited and do not support some of the variations in format of 
> older TREC collections.  
> I have been doing some benchmarking work with Lucene and have had to modify 
> the package to support:
> * Older TREC document formats, which the current parser fails on due to 
> missing document headers.
> * Variations in query format - newlines after  tag causing the query 
> parser to get confused.
> * Ability to detect and read in uncompressed text collections
> * Storage of document numbers by default without storing full text.
> I can submit a patch if there is interest, although I will probably want to 
> write unit tests for the new functionality first.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1540) Improvements to contrib.benchmark for TREC collections

2011-02-17 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12995865#comment-12995865
 ] 

Robert Muir commented on LUCENE-1540:
-

Hi guys, this is just a feature request (we can open a new issue if anyone is 
up for it).

I was wondering if we could do a simple write-up and put it in the website 
notes for 3.1, 3.2,
whenever we can get to it, with some basic instructions on how to use this 
functionality.

I noticed with more research-oriented search engines, there are simple 
instructions for
how to index trec collections, run relevance experiments, and get trec_eval 
results... I feel
like if we had these it would be really beneficial towards getting new folks 
involved with Lucene.

Some examples (some are simpler than others, but at least they all describe how 
to build the index):
* Terrier: http://terrier.org/docs/current/trec_examples.html
* Zettair: http://www.seg.rmit.edu.au/zettair/quick_start_trec.html
* MG4J: http://mg4j.dsi.unimi.it/man/manual/ch01s04.html#id2769812
* Indri: http://ciir.cs.umass.edu/~strohman/indri/



> Improvements to contrib.benchmark for TREC collections
> --
>
> Key: LUCENE-1540
> URL: https://issues.apache.org/jira/browse/LUCENE-1540
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/benchmark
>Reporter: Tim Armstrong
>Assignee: Doron Cohen
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-1540.patch, LUCENE-1540.patch, LUCENE-1540.patch, 
> LUCENE-1540.patch, trecdocs.zip
>
>
> The benchmarking utilities for  TREC test collections (http://trec.nist.gov) 
> are quite limited and do not support some of the variations in format of 
> older TREC collections.  
> I have been doing some benchmarking work with Lucene and have had to modify 
> the package to support:
> * Older TREC document formats, which the current parser fails on due to 
> missing document headers.
> * Variations in query format - newlines after  tag causing the query 
> parser to get confused.
> * Ability to detect and read in uncompressed text collections
> * Storage of document numbers by default without storing full text.
> I can submit a patch if there is interest, although I will probably want to 
> write unit tests for the new functionality first.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1540) Improvements to contrib.benchmark for TREC collections

2011-02-17 Thread Doron Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12996015#comment-12996015
 ] 

Doron Cohen commented on LUCENE-1540:
-

I agree, this would be helpful. 
Let's have a new issue on this, I'll take it.
Doron

> Improvements to contrib.benchmark for TREC collections
> --
>
> Key: LUCENE-1540
> URL: https://issues.apache.org/jira/browse/LUCENE-1540
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/benchmark
>Reporter: Tim Armstrong
>Assignee: Doron Cohen
>Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-1540.patch, LUCENE-1540.patch, LUCENE-1540.patch, 
> LUCENE-1540.patch, trecdocs.zip
>
>
> The benchmarking utilities for  TREC test collections (http://trec.nist.gov) 
> are quite limited and do not support some of the variations in format of 
> older TREC collections.  
> I have been doing some benchmarking work with Lucene and have had to modify 
> the package to support:
> * Older TREC document formats, which the current parser fails on due to 
> missing document headers.
> * Variations in query format - newlines after  tag causing the query 
> parser to get confused.
> * Ability to detect and read in uncompressed text collections
> * Storage of document numbers by default without storing full text.
> I can submit a patch if there is interest, although I will probably want to 
> write unit tests for the new functionality first.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org