[ https://issues.apache.org/jira/browse/LUCENE-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12698069#action_12698069 ]
Shai Erera commented on LUCENE-1591: ------------------------------------ I wonder why does EnwikiDocMaker extend LineDocMaker? The latter assumes the input is given in lines, while the former assumes an XML format ... so why the inheritance? This affects EnwikiDocMaker today when LDM.openFile() instantiates a BufferedReader, which is never used by EDM. Is it because of DocState? Perhaps some of the logic in LDM can be pulled up to BasicDocMaker, or a new abstract DocStateDocMaker? If there is a good reason, then maybe introduce a protected member useReader and set it to false in EDM? Or override openFile() in EDM and not instantiate the reader? Also, somewhat unrelated to this issue, but I found two issues in LDM: # In makeDocument(), if the read line is null, then we first call openFile() and then check 'forever' (and possibly throw a NoMoreDataException). Should we first check forever, and only if it's true call openFile()? # resetInputs() reads the docs.file property and throws an exception if it's not set. Shouldn't this code belong to setConfig? I can include those two in the patch as well. > Enable bzip compression in benchmark > ------------------------------------ > > Key: LUCENE-1591 > URL: https://issues.apache.org/jira/browse/LUCENE-1591 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/benchmark > Reporter: Shai Erera > Fix For: 2.9 > > > bzip compression can aid the benchmark package by not requiring extracting > bzip files (such as enwiki) in order to index them. The plan is to add a > config parameter bzip.compression=true/false and in the relevant tasks either > decompress the input file or compress the output file using the bzip streams. > It will add a dependency on ant.jar which contains two classes similar to > GZIPOutputStream and GZIPInputStream which compress/decompress files using > the bzip algorithm. > bzip is known to be superior in its compression performance to the gzip > algorithm (~20% better compression), although it does the > compression/decompression a bit slower. > I wil post a patch which adds this parameter and implement it in > LineDocMaker, EnwikiDocMaker and WriteLineDoc task. Maybe even add the > capability to DocMaker or some of the super classes, so it can be inherited > by all sub-classes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org