[ https://issues.apache.org/jira/browse/LUCENE-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Doron Cohen updated LUCENE-1540: -------------------------------- Attachment: LUCENE-1540.patch Initial patch - against 3.x - not ready to commit - refactors parsing of trec text from TrecContentSource into interface TrecDocParser, currently with single impl - TrecGov2Parser. The interaction between TCS and TDP is less clean than I hoped, for two reasons: # trying to keep the synchronization pattern added while ago to that class, in which the reading of data from the file is synced but the parsing can go in parallel. For this reason there are two methods in that interface. # allowing the TDP impls to use whatever is in TCS caused required to expose some of its methods, and also to pass TCS as param to TDP. With this patch: # TDP was cleaned to use ContentSource's method getInputStream() - this also supporting .gz, .bz2, and plain text (before the patch it supports only .gz). # should be easy to add parsers for other formats. I removed the retry logic for opening the stream - I don't remember why it was added in the first place and it seems strange - if opening failed in first trial why would the next trial succeed? Remaining to do: - add parsers for the other formats - add tests for the other formats and also for bz2, plain text. - allow a single run to ingest file of different formats (needed for the disks 4+5 track). - fix some documemtation. - allow to specify the TDP to use in a property. - changes.txt. - port to trunk, so as to first commit in trunk and then backport to 3.x. > Improvements to contrib.benchmark for TREC collections > ------------------------------------------------------ > > Key: LUCENE-1540 > URL: https://issues.apache.org/jira/browse/LUCENE-1540 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/benchmark > Affects Versions: 2.4 > Reporter: Tim Armstrong > Assignee: Doron Cohen > Priority: Minor > Attachments: LUCENE-1540.patch > > > The benchmarking utilities for TREC test collections (http://trec.nist.gov) > are quite limited and do not support some of the variations in format of > older TREC collections. > I have been doing some benchmarking work with Lucene and have had to modify > the package to support: > * Older TREC document formats, which the current parser fails on due to > missing document headers. > * Variations in query format - newlines after <title> tag causing the query > parser to get confused. > * Ability to detect and read in uncompressed text collections > * Storage of document numbers by default without storing full text. > I can submit a patch if there is interest, although I will probably want to > write unit tests for the new functionality first. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org