[ 
https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12697057#action_12697057
 ] 

Shai Erera commented on LUCENE-1539:
------------------------------------

Is it also interesting to add extensions to EnwikiDocMaker, WriteLineDoc and 
LineDocMaker which can read/write the content in a bzip format?
I downloaded the latest Enwiki dump, 4.5 GB in bzip format. Extracted XML size 
is 17GB. I thought to myslef that I don't have a real reason to extract it - I 
can read the content directly from the bzip-type file.

So I looked around and found out that in ant.jar there are two classes which 
can read/write that format. Just to compare, I gzipped the XML file and the 
result was 5.1GB file (~13% larger). The general measurements on the web also 
show bzip is superior to gzip, although it probably runs a bit slower.

I then ran the WriteLineDoc task, to produce the one-line-per-document text 
file, and stopped when it reache 228MB. Again, I zipped, gzipped and bzipped 
the file, and the bzip format was smaller by ~20%.

So I was wondering - besides the speed of writing from a compressed archive, 
which is slwoer than reading from a plain XML or TXT file, is there a reason 
why we don't use bzip/gzip when reading content? It will save a lot of space 
and I'm not sure that part of the indexing is what's most important.
However, I'm aware that some people might find it better to read from plain 
files, so I suggest we just have extensions which can read/write the compressed 
format.
The question is, assuming you agree to it, should we use bzip (which requires 
external library) or gzip which is in the JDK, does not compress as good as 
bzip, but might have better performance (I can give it some measurements if 
needed, but the main question I have is whether we want to introduce a 
dependency on another library).

If this belongs in a separate issue, let me know.

> Improve Benchmark
> -----------------
>
>                 Key: LUCENE-1539
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1539
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, 
> sortBench2.py, sortCollate2.py
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Benchmark can be improved by incorporating recent suggestions posted
> on java-dev. M. McCandless' Python scripts that execute multiple
> rounds of tests can either be incorporated into the codebase or
> converted to Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to