[ https://issues.apache.org/jira/browse/LUCENE-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12697057#action_12697057 ]
Shai Erera commented on LUCENE-1539: ------------------------------------ Is it also interesting to add extensions to EnwikiDocMaker, WriteLineDoc and LineDocMaker which can read/write the content in a bzip format? I downloaded the latest Enwiki dump, 4.5 GB in bzip format. Extracted XML size is 17GB. I thought to myslef that I don't have a real reason to extract it - I can read the content directly from the bzip-type file. So I looked around and found out that in ant.jar there are two classes which can read/write that format. Just to compare, I gzipped the XML file and the result was 5.1GB file (~13% larger). The general measurements on the web also show bzip is superior to gzip, although it probably runs a bit slower. I then ran the WriteLineDoc task, to produce the one-line-per-document text file, and stopped when it reache 228MB. Again, I zipped, gzipped and bzipped the file, and the bzip format was smaller by ~20%. So I was wondering - besides the speed of writing from a compressed archive, which is slwoer than reading from a plain XML or TXT file, is there a reason why we don't use bzip/gzip when reading content? It will save a lot of space and I'm not sure that part of the indexing is what's most important. However, I'm aware that some people might find it better to read from plain files, so I suggest we just have extensions which can read/write the compressed format. The question is, assuming you agree to it, should we use bzip (which requires external library) or gzip which is in the JDK, does not compress as good as bzip, but might have better performance (I can give it some measurements if needed, but the main question I have is whether we want to introduce a dependency on another library). If this belongs in a separate issue, let me know. > Improve Benchmark > ----------------- > > Key: LUCENE-1539 > URL: https://issues.apache.org/jira/browse/LUCENE-1539 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/benchmark > Affects Versions: 2.4 > Reporter: Jason Rutherglen > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1539.patch, LUCENE-1539.patch, LUCENE-1539.patch, > sortBench2.py, sortCollate2.py > > Original Estimate: 336h > Remaining Estimate: 336h > > Benchmark can be improved by incorporating recent suggestions posted > on java-dev. M. McCandless' Python scripts that execute multiple > rounds of tests can either be incorporated into the codebase or > converted to Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org