[ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12894773#action_12894773 ]
Nicholas Carlini commented on HADOOP-6837: ------------------------------------------ Responding to the major comments -- will upload a patch that fixes these and the smaller comments soon. FakeInputStream LinkedList: This LinkedList can get fairly long, depending on how write is called. Worst case it can have upwards of 12 million elements, which is far beyond acceptable. This is the case if the write(single_byte) is called over and over. Each call will add a new link. Looking back at this, linked list probably wasn't the best way to go. There are two (obvious) ways that write() could have worked. One is using linked lists as I did. The other way would be to create a byte array that can hold forceWriteLen bytes and just copy into it; however this can be as large as 12MB. There are then two other ways to make this work. The first is just allocating the 12MB right up front. The other way is to start it with maybe just 64k, and make it grow (by powers of two) until it reaches 12MB, however this would end up arraycopying a little under 12MB in total more than the other solution. I will implement one of these for the patch. FakeOutputStream LinkedList: This linked list has a more reasonable use. Its purpose is to hold extra bytes just in case the input stream gives too many. I am fairly confident that at most 272 bytes (maximum number of fast bytes - 1) can be written to it. The reason I used a linked list, however, is that I couldn't formally prove this after going through code. I wanted to be safe and just in case their code doesn't behave as it should, everything will work on the OutputStream end. Code(..., len) I think I remember figuring out that Code(...) will return at least (but possibly more than) len bytes with the one exception that when the end of the stream is reached it will only read up to the end of the stream. I will modify the decompressor to no longer assume this and use the actual number of bytes read instead. Fixed the inStream.read() bug (and will be in patch I upload). Added a while loop to read until EOF is reached so the assumptions are true. Tail call recursive methods -> while loop. Java should add tail-call optimizations when methods only call themselves recursively (which would require no changes to the bytecode). Fixed memory leaks. > Support for LZMA compression > ---------------------------- > > Key: HADOOP-6837 > URL: https://issues.apache.org/jira/browse/HADOOP-6837 > Project: Hadoop Common > Issue Type: Improvement > Components: io > Reporter: Nicholas Carlini > Assignee: Nicholas Carlini > Attachments: HADOOP-6837-lzma-1-20100722.non-trivial.pseudo-patch, > HADOOP-6837-lzma-1-20100722.patch, HADOOP-6837-lzma-c-20100719.patch, > HADOOP-6837-lzma-java-20100623.patch > > > Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which > generally achieves higher compression ratios than both gzip and bzip2. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.