[ http://issues.apache.org/jira/browse/HADOOP-54?page=all ]
Doug Cutting updated HADOOP-54: ------------------------------- Status: Open (was: Patch Available) I think this is nearly ready. A minor improvement: the typesafe enumeration instances should probably have a toString() method, to facilitate debugging. Running the TestSequenceFile unit test caused my 515MB Ubuntu box to swap horribly and it didn't complete. I grabbed a stack trace and saw: [junit] at java.util.zip.Inflater.init(Native Method) [junit] at java.util.zip.Inflater.<init>(Inflater.java:75) [junit] at java.util.zip.Inflater.<init>(Inflater.java:82) [junit] at org.apache.hadoop.io.SequenceFile$CompressedBytes.<init>(SequenceFile.java:231) [junit] at org.apache.hadoop.io.SequenceFile$CompressedBytes.<init>(SequenceFile.java:227) [junit] at org.apache.hadoop.io.SequenceFile$Reader.createValueBytes(SequenceFile.java:1195) [junit] at org.apache.hadoop.io.SequenceFile$Sorter$SortPass.run(SequenceFile.java:1459) [junit] at org.apache.hadoop.io.SequenceFile$Sorter.sortPass(SequenceFile.java:1413) [junit] at org.apache.hadoop.io.SequenceFile$Sorter.sort(SequenceFile.java:1386) [junit] at org.apache.hadoop.io.SequenceFile$Sorter.sort(SequenceFile.java:1406) [junit] at org.apache.hadoop.io.TestSequenceFile.sortTest(TestSequenceFile.java:178) Since sorting should not do any inflating, the Inflater should probably not be created in this case. So maybe we should lazily initialize this field? More generally, before we commit this we should ensure that performance is comparable to what it was before. Creating a new ValueBytes wrapper per entry processed when sorting looks expensive to me, but this may in fact be insignificant. If it is significant, then we might replace the ValueBytes API with a compressor API, where the bytes to be compressed are passed explicitly. > SequenceFile should compress blocks, not individual entries > ----------------------------------------------------------- > > Key: HADOOP-54 > URL: http://issues.apache.org/jira/browse/HADOOP-54 > Project: Hadoop > Issue Type: Improvement > Components: io > Affects Versions: 0.2.0 > Reporter: Doug Cutting > Assigned To: Arun C Murthy > Fix For: 0.6.0 > > Attachments: SequenceFile.updated.final.patch, > SequenceFiles.final.patch, SequenceFiles.patch, SequenceFilesII.patch, > VIntCompressionResults.txt > > > SequenceFile will optionally compress individual values. But both > compression and performance would be much better if sequences of keys and > values are compressed together. Sync marks should only be placed between > blocks. This will require some changes to MapFile too, so that all file > positions stored there are the positions of blocks, not entries within > blocks. Probably this can be accomplished by adding a > getBlockStartPosition() method to SequenceFile.Writer. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira