[jira] Commented: (HADOOP-6837) Support for LZMA compression

Nicholas Carlini (JIRA) Mon, 02 Aug 2010 18:18:47 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12894773#action_12894773
 ]


Nicholas Carlini commented on HADOOP-6837:
------------------------------------------

Responding to the major comments -- will upload a patch that fixes these and 
the smaller comments soon.

FakeInputStream LinkedList:
This LinkedList can get fairly long, depending on how write is called. Worst 
case it can have upwards of 12 million elements, which is far beyond 
acceptable. This is the case if the write(single_byte) is called over and over. 
Each call will add a new link. Looking back at this, linked list probably 
wasn't the best way to go.

There are two (obvious) ways that write() could have worked. One is using 
linked lists as I did. The other way would be to create a byte array that can 
hold forceWriteLen bytes and just copy into it; however this can be as large as 
12MB. There are then two other ways to make this work. The first is just 
allocating the 12MB right up front. The other way is to start it with maybe 
just 64k, and make it grow (by powers of two) until it reaches 12MB, however 
this would end up arraycopying a little under 12MB in total more than the other 
solution. I will implement one of these for the patch.


FakeOutputStream LinkedList:
This linked list has a more reasonable use. Its purpose is to hold extra bytes 
just in case the input stream gives too many. I am fairly confident that at 
most 272 bytes (maximum number of fast bytes - 1) can be written to it. The 
reason I used a linked list, however, is that I couldn't formally prove this 
after going through code. I wanted to be safe and just in case their code 
doesn't behave as it should, everything will work on the OutputStream end.


Code(..., len)
I think I remember figuring out that Code(...) will return at least (but 
possibly more than) len bytes with the one exception that when the end of the 
stream is reached it will only read up to the end of the stream. I will modify 
the decompressor to no longer assume this and use the actual number of bytes 
read instead.


Fixed the inStream.read() bug (and will be in patch I upload). Added a while 
loop to read until EOF is reached so the assumptions are true.


Tail call recursive methods -> while loop. Java should add tail-call 
optimizations when methods only call themselves recursively (which would 
require no changes to the bytecode).


Fixed memory leaks.

> Support for LZMA compression
> ----------------------------
>
>                 Key: HADOOP-6837
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6837
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Nicholas Carlini
>            Assignee: Nicholas Carlini
>         Attachments: HADOOP-6837-lzma-1-20100722.non-trivial.pseudo-patch, 
> HADOOP-6837-lzma-1-20100722.patch, HADOOP-6837-lzma-c-20100719.patch, 
> HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which 
> generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6837) Support for LZMA compression

Reply via email to