[ 
https://issues.apache.org/jira/browse/LUCENE-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12698265#action_12698265
 ] 

Shai Erera commented on LUCENE-1591:
------------------------------------

Here some numbers:

* Reading the enwiki bz2 file with CBZip2InputStream, wrapped as a 
BufferedReader and reading one line at a time took *28m*. Unzipping with WinRAR 
took about *~30m* (this includes also writing the uncompressed data to disk). 
So in that respect, the code does not fall short of other bunzip tools (at 
least not WinRAR).
* Before the change, the time to read the compressed data, parse and write to a 
one-line file, compressed took 7h (3.1M documents were read). After the change 
(wrapping with BOS and removing flush()) it took 2h, so significant improvement 
here.

Overall, I think the performance of the BZIP classes is reasonable. Most of the 
time spent in the algorithm is in compressing the data, which is usually a 
process done only once. The result is a 2.5GB enwiki file compressed to a 
2.31GB one-line file (8.5GB uncompressed content).

I compared the time it takes to read 100k lines from the compressed and 
un-compressed one-line file: compressed-2.26m, un-compressed-1.36m 
({color:red}-66%{color}). The difference is significant, however I'm not sure 
how much is it from the overall process (i.e., reading the documents and 
indexing them). On my machine it would take 1.1 hours to read the data, but I'm 
sure it will take more to index it, and the indexing time is the same whether 
we read the data from a bzip archive or not.

I'll attach the patch shortly, and I think overall this is a good addition. It 
is off by default, and configurable, so if someone doesn't care about disk 
space, he can always run the indexing algorithm on an un-compressed one-line 
file.

> Enable bzip compression in benchmark
> ------------------------------------
>
>                 Key: LUCENE-1591
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1591
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>            Reporter: Shai Erera
>             Fix For: 2.9
>
>         Attachments: ant-1.7.1.jar, LUCENE-1591.patch
>
>
> bzip compression can aid the benchmark package by not requiring extracting 
> bzip files (such as enwiki) in order to index them. The plan is to add a 
> config parameter bzip.compression=true/false and in the relevant tasks either 
> decompress the input file or compress the output file using the bzip streams.
> It will add a dependency on ant.jar which contains two classes similar to 
> GZIPOutputStream and GZIPInputStream which compress/decompress files using 
> the bzip algorithm.
> bzip is known to be superior in its compression performance to the gzip 
> algorithm (~20% better compression), although it does the 
> compression/decompression a bit slower.
> I wil post a patch which adds this parameter and implement it in 
> LineDocMaker, EnwikiDocMaker and WriteLineDoc task. Maybe even add the 
> capability to DocMaker or some of the super classes, so it can be inherited 
> by all sub-classes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to