[jira] Updated: (HADOOP-6837) Support for LZMA compression

2010-08-11 Thread Nicholas Carlini (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Carlini updated HADOOP-6837:
-

  Status: Patch Available  (was: Open)
Hadoop Flags: [Reviewed]

> Support for LZMA compression
> 
>
> Key: HADOOP-6837
> URL: https://issues.apache.org/jira/browse/HADOOP-6837
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: io
>Reporter: Nicholas Carlini
>Assignee: Nicholas Carlini
> Attachments: HADOOP-6837-lzma-1-20100722.non-trivial.pseudo-patch, 
> HADOOP-6837-lzma-1-20100722.patch, HADOOP-6837-lzma-2-20100806.patch, 
> HADOOP-6837-lzma-3-20100809.patch, HADOOP-6837-lzma-4-20100811.patch, 
> HADOOP-6837-lzma-c-20100719.patch, HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which 
> generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-6837) Support for LZMA compression

2010-08-11 Thread Nicholas Carlini (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Carlini updated HADOOP-6837:
-

Attachment: HADOOP-6837-lzma-4-20100811.patch

Fixed two FindBugs warnings. Made one variable static and removed another.

Also I fixed build.xml (I think, it works correctly when doing 'ant package' on 
src/contrib, and it didn't work on the previous patch) to not crash when ant 
package is run. Had to override all of the build-contrib targets with empty 
ones. Otherwise it'd go looking for directories which didn't exist (because I 
don't need them to exist). 

I also changed build-contrib.xml (which is included by the other build.xml's) 
to have a compile-before target in it so there won't be any errors (even if I 
did set failonerror to false). Safer that way.

FindBugs fails because of four errors in the original SevenZip code. I could 
change those to static to make FindBugs happy but that would make more work to 
apply a patch to the java code, so unless there's a particularly good reason to 
do so I won't.
SIC Should SevenZip.Compression.LZMA.Decoder$LenDecoder be a _static_ inner 
class? 
SIC Should SevenZip.Compression.LZMA.Decoder$LiteralDecoder be a _static_ inner 
class? 
SIC Should SevenZip.Compression.LZMA.Encoder$LiteralEncoder be a _static_ inner 
class? 
SIC Should SevenZip.Compression.LZMA.Encoder$Optimal be a _static_ inner class?

And JavaDoc find an error:
[javadoc] javadoc: error - Illegal package name: ""
However, I can not reproduce this on my local machine. Let's see if it happens 
with Hudson.

> Support for LZMA compression
> 
>
> Key: HADOOP-6837
> URL: https://issues.apache.org/jira/browse/HADOOP-6837
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: io
>Reporter: Nicholas Carlini
>Assignee: Nicholas Carlini
> Attachments: HADOOP-6837-lzma-1-20100722.non-trivial.pseudo-patch, 
> HADOOP-6837-lzma-1-20100722.patch, HADOOP-6837-lzma-2-20100806.patch, 
> HADOOP-6837-lzma-3-20100809.patch, HADOOP-6837-lzma-4-20100811.patch, 
> HADOOP-6837-lzma-c-20100719.patch, HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which 
> generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HADOOP-6837) Support for LZMA compression

2010-08-11 Thread Nicholas Carlini (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12897337#action_12897337
 ] 

Nicholas Carlini commented on HADOOP-6837:
--

The current patch does not support LZMA2, and correct, it probably won't be 
included in this patch. There will probably be another JIRA filed to swap out 
the current C code with the LibLzma library to make the C side of things 
cleaner. LibLzma also has LZMA2 support, so that would be free. It should be 
trivial to add LibLzma which has a very Zlib-like interface. And then there 
will probably be another JIRA to port LibLzma to Java, and make everything nice 
and clean.

> Support for LZMA compression
> 
>
> Key: HADOOP-6837
> URL: https://issues.apache.org/jira/browse/HADOOP-6837
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: io
>Reporter: Nicholas Carlini
>Assignee: Nicholas Carlini
> Attachments: HADOOP-6837-lzma-1-20100722.non-trivial.pseudo-patch, 
> HADOOP-6837-lzma-1-20100722.patch, HADOOP-6837-lzma-2-20100806.patch, 
> HADOOP-6837-lzma-3-20100809.patch, HADOOP-6837-lzma-c-20100719.patch, 
> HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which 
> generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-6349) Implement FastLZCodec for fastlz/lzo algorithm

2010-08-11 Thread Nicholas Carlini (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-6349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Carlini updated HADOOP-6349:
-

Attachment: hadoop-6349-4.patch

Fixed the buffering issues. There is still work to do here, though. In the case 
where compress() is called with an array that is big enough, then instead of 
compressing to a temporary buffer and then to the byte array given, it should 
compress directly to that buffer if possible. It is possible to not do this if 
the compressor was changed to keep its state and resume compression from the 
middle, but that seems like it would be more work than it's worth, at very 
little cost. This would, however, mean that the buffer size used by the 
CompressorStream would need to be around 64k so that the majority of the time 
bytes could be written directly to it. (That applies to decompression, too.)

Fixed a bug where it was possible for the end of stream mark to show up in the 
wrong place. 

Fixed another bug from the compressor where calling write(int) on the output 
stream n times would add 47*n bytes to the output stream. This is because each 
time write() is called, so is compress(), which means 26 bytes for a header 
block and 16 bytes of a header, and then 1 byte of uncompressed data. Fixed 
this by adding another case where if the buffer has fewer than 2^16 (default 
block size) bytes, then it'll return true from needsInput, and so it won't 
compress.

Changed the way that the test codec performance will call write() several times 
instead of writing all at once. It now calls write() with a random length until 
it's out of input.

Got rid of the uses of BigInteger ...

Removed the moved code ... no idea why it was ever there.

Made abbreviations in comments real words.

FASTLZ_STRICT_ALIGN mode when set to false doesn't even work. Deleted.

FASTLZ_SAFE mode set to false has no performance increases. All it does is make 
a few if statements happen (which don't have any loops or anything). Deleted.

At some point, the header blocks (with block ID 1) should get removed. They 
have no purpose and just remain from the port from 6pack.c. Maybe even make the 
header smaller by removing the ID now that all blocks will have ID of 17. And 
the 'options' are only one bit for compressed or just uncompressible data.

> Implement FastLZCodec for fastlz/lzo algorithm
> --
>
> Key: HADOOP-6349
> URL: https://issues.apache.org/jira/browse/HADOOP-6349
> Project: Hadoop Common
>  Issue Type: New Feature
>  Components: io
>Reporter: William Kinney
> Attachments: hadoop-6349-1.patch, hadoop-6349-2.patch, 
> hadoop-6349-3.patch, hadoop-6349-4.patch, HADOOP-6349-TestFastLZCodec.patch, 
> HADOOP-6349.patch, TestCodecPerformance.java, TestCodecPerformance.java, 
> testCodecPerfResults.tsv
>
>
> Per  [HADOOP-4874|http://issues.apache.org/jira/browse/HADOOP-4874], FastLZ 
> is a good (speed, license) alternative to LZO. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HADOOP-6349) Implement FastLZCodec for fastlz/lzo algorithm

2010-08-09 Thread Nicholas Carlini (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-6349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896753#action_12896753
 ] 

Nicholas Carlini commented on HADOOP-6349:
--

I believe I've fixed the buffering problems with very few changes to the 
compressor and decompressor. I'll need to verify that these changes didn't 
break anything else, but it seems like a simple issue to fix.

> Implement FastLZCodec for fastlz/lzo algorithm
> --
>
> Key: HADOOP-6349
> URL: https://issues.apache.org/jira/browse/HADOOP-6349
> Project: Hadoop Common
>  Issue Type: New Feature
>  Components: io
>Reporter: William Kinney
> Attachments: hadoop-6349-1.patch, hadoop-6349-2.patch, 
> hadoop-6349-3.patch, HADOOP-6349-TestFastLZCodec.patch, HADOOP-6349.patch, 
> TestCodecPerformance.java, TestCodecPerformance.java, testCodecPerfResults.tsv
>
>
> Per  [HADOOP-4874|http://issues.apache.org/jira/browse/HADOOP-4874], FastLZ 
> is a good (speed, license) alternative to LZO. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HADOOP-6837) Support for LZMA compression

2010-08-09 Thread Nicholas Carlini (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896751#action_12896751
 ] 

Nicholas Carlini commented on HADOOP-6837:
--

I did. The average number of overflow bytes is 24. I never saw it go above 120. 
A quick sed/dc script tells me the standard deviation is 18. So I'm fairly sure 
that I am correct in that it will never go above 273. Trying with setting the 
number of fast bytes to 273 gives average of 37 and standard deviation of 26.

> Support for LZMA compression
> 
>
> Key: HADOOP-6837
> URL: https://issues.apache.org/jira/browse/HADOOP-6837
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: io
>Reporter: Nicholas Carlini
>Assignee: Nicholas Carlini
> Attachments: HADOOP-6837-lzma-1-20100722.non-trivial.pseudo-patch, 
> HADOOP-6837-lzma-1-20100722.patch, HADOOP-6837-lzma-2-20100806.patch, 
> HADOOP-6837-lzma-3-20100809.patch, HADOOP-6837-lzma-c-20100719.patch, 
> HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which 
> generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-6837) Support for LZMA compression

2010-08-09 Thread Nicholas Carlini (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Carlini updated HADOOP-6837:
-

Attachment: HADOOP-6837-lzma-3-20100809.patch

I'm going to respond to each one this time to make sure I don't miss anything 
again...

FakeOutputStream isn't the one I'm talking about in package.html. That's for 
the OutputStream/FakeInputStream. FakeOutputStream is just the one where I 
couldn't justify the maximum acting correctly (wrtiting a max of 273 bytes 
extra) so I added the linked list in case anything goes wrong.

Changed directory structure as suggested -- src/contrib/lzma912/src/SevenZip/* 
has the java files and src/contrib/lzma912/src/native has the native code.

The default dictionary size doesn't matter, because it gets set to something 
else on initialization. I left it there to keep the minimum set of changes.

I needed library.properties there for it to build correctly: I have no idea 
what it does, but if I delete it then it doesn't build. And when it's empty it 
doesn't complain, so that's why it's there.

Sorry about the ec2 file; I didn't realize that went into the patch! It's not 
supposed to be there.

Deleted CRC.java ... apparently it was never used.

Removed the read() while loop and wrote as suggested.

Fixed names (buffered/sendData) again, I had fixed them but then reverted to an 
older version when I introduced a bug.

Removed 1< Support for LZMA compression
> 
>
> Key: HADOOP-6837
> URL: https://issues.apache.org/jira/browse/HADOOP-6837
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: io
>Reporter: Nicholas Carlini
>Assignee: Nicholas Carlini
> Attachments: HADOOP-6837-lzma-1-20100722.non-trivial.pseudo-patch, 
> HADOOP-6837-lzma-1-20100722.patch, HADOOP-6837-lzma-2-20100806.patch, 
> HADOOP-6837-lzma-3-20100809.patch, HADOOP-6837-lzma-c-20100719.patch, 
> HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which 
> generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-6837) Support for LZMA compression

2010-08-06 Thread Nicholas Carlini (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Carlini updated HADOOP-6837:
-

Attachment: HADOOP-6837-lzma-2-20100806.patch

I found a bug where Java compression would set a very, very wrong dictionary 
size. Instead of setting the dictionary size to, say, 2^24, it would set it to 
24 (which would then be forced up to the min size, but still, very very wrong).

I also added a fairly long package.html (~3000 words) with documentation about 
how what I did works, so anyone else who wants to modify it hopefully won't 
need to spend forever exploring the code again figuring out how it works.

And also, I was both right and wrong about giveMore(). I was right when I first 
wrote it, and wrong when I said I fixed it by using the return value. The 
return values were actually left over from when I was checking for the end of 
stream in the Java end, but I realized that it was possible (because of the 
semi-circular buffer) for Java to indicate an EOF but for it not to be really 
true. So I had moved that check to the C code and just never removed that code 
from that Java end.

Fixed the linked list stuff.

Also a fairly significant directory restructure. The modified SDK code now is 
in src/contrib/SevenZip. Java code under src/java and C code is under 
src/native. I removed all of my re-formatting of their code so should a future 
version of the SDK be released, it shouldn't be as hard to do a diff and apply 
the patch to make this code better. The makefile from there builds to the same 
build tree as otherwise for java.

In order to get it building correctly, I had to modify the base build.xml and 
the contrib/build.xml. compile-core-classes now also depends on 
compile-contrib-before. compile-contrib-before now calls compile-before on 
contrib/build.xml. From there contrib/build.xml calls compile-before on 
contrib/*/build.xml, with failonerror set to false so this change will not 
break any other build scripts. (This change is required because 
Lzma{Input,Output}Stream stream requires the classes to be built first.)

I also cleaned up the code and fixed all the review comments.

There will be at least one more version of this patch for things I didn't catch.

> Support for LZMA compression
> 
>
> Key: HADOOP-6837
> URL: https://issues.apache.org/jira/browse/HADOOP-6837
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: io
>Reporter: Nicholas Carlini
>Assignee: Nicholas Carlini
> Attachments: HADOOP-6837-lzma-1-20100722.non-trivial.pseudo-patch, 
> HADOOP-6837-lzma-1-20100722.patch, HADOOP-6837-lzma-2-20100806.patch, 
> HADOOP-6837-lzma-c-20100719.patch, HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which 
> generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HADOOP-6837) Support for LZMA compression

2010-08-02 Thread Nicholas Carlini (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12894773#action_12894773
 ] 

Nicholas Carlini commented on HADOOP-6837:
--

Responding to the major comments -- will upload a patch that fixes these and 
the smaller comments soon.

FakeInputStream LinkedList:
This LinkedList can get fairly long, depending on how write is called. Worst 
case it can have upwards of 12 million elements, which is far beyond 
acceptable. This is the case if the write(single_byte) is called over and over. 
Each call will add a new link. Looking back at this, linked list probably 
wasn't the best way to go.

There are two (obvious) ways that write() could have worked. One is using 
linked lists as I did. The other way would be to create a byte array that can 
hold forceWriteLen bytes and just copy into it; however this can be as large as 
12MB. There are then two other ways to make this work. The first is just 
allocating the 12MB right up front. The other way is to start it with maybe 
just 64k, and make it grow (by powers of two) until it reaches 12MB, however 
this would end up arraycopying a little under 12MB in total more than the other 
solution. I will implement one of these for the patch.


FakeOutputStream LinkedList:
This linked list has a more reasonable use. Its purpose is to hold extra bytes 
just in case the input stream gives too many. I am fairly confident that at 
most 272 bytes (maximum number of fast bytes - 1) can be written to it. The 
reason I used a linked list, however, is that I couldn't formally prove this 
after going through code. I wanted to be safe and just in case their code 
doesn't behave as it should, everything will work on the OutputStream end.


Code(..., len)
I think I remember figuring out that Code(...) will return at least (but 
possibly more than) len bytes with the one exception that when the end of the 
stream is reached it will only read up to the end of the stream. I will modify 
the decompressor to no longer assume this and use the actual number of bytes 
read instead.


Fixed the inStream.read() bug (and will be in patch I upload). Added a while 
loop to read until EOF is reached so the assumptions are true.


Tail call recursive methods -> while loop. Java should add tail-call 
optimizations when methods only call themselves recursively (which would 
require no changes to the bytecode).


Fixed memory leaks.

> Support for LZMA compression
> 
>
> Key: HADOOP-6837
> URL: https://issues.apache.org/jira/browse/HADOOP-6837
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: io
>Reporter: Nicholas Carlini
>Assignee: Nicholas Carlini
> Attachments: HADOOP-6837-lzma-1-20100722.non-trivial.pseudo-patch, 
> HADOOP-6837-lzma-1-20100722.patch, HADOOP-6837-lzma-c-20100719.patch, 
> HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which 
> generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-6349) Implement FastLZCodec for fastlz/lzo algorithm

2010-07-30 Thread Nicholas Carlini (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-6349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Carlini updated HADOOP-6349:
-

Attachment: hadoop-6349-3.patch

Another patch. There is still debug code scattered about (commented out), as I 
might need to put it to use at some point. This code isn't tested as well as 
the last patch.

Adds support for native compression/decompression. Native compression is 230% 
faster than java. Native decompression is 70% faster than java.

Somewhat-large redesign of the compressor. Compression is now fifty times 
faster when compressing around 64MB. The compressor used to keep in memory all 
input it had previously processed, and arraycopy it to a new array every time 
it needed more space, so through the process of compressing 64MB of data 
calling write every 64k, it would end up copying ~32GB through memory (this is 
how it was for my test case). Instead compress 128MB of data and write every 
1k, and you copy 8.8TB through memory.

Also modified compressor to include an end-of-stream marker. This way the 
decompressor can set to "finished" so the stream can return -1. The end of 
stream mark is indicated by setting the four unused bytes after the input size 
to high in the last chunk of length 0. By this way, any decompressor which does 
not support the end of stream marker will never read those bytes and will just 
decompress an empty block and not notice anything is wrong.

Adds another method to TestCodecPerformance which haves it load a (relatively 
small) input file to memory, and from it generate 64MB of data to compress. (It 
does this by taking random substrings from 16 to 128 bytes at random offsets 
until there are 64MB.) It then directly compresses the 64MB from memory to 
memory and times that. These times seem to be more reflective than timing the 
compression of "key %d value %d" and of timing the compression of random data. 
Right now this mode is enabled by calling it with the -input flag.

Ported code for Adler32 to C, uses it when using native libraries.

Added a constant in the compressor to allow for uncompressible data to instead 
be copied over byte for byte. This decreases the speed of the compressor by 
~10% as it results in another memcpy, but it can more than double the speed of 
decompression.  



Here's what the new part of the test codec performance gives when given a log 
file. For comparison: DefaultCodec gets the size down to 11% and the BZip2Codec 
down to 8%.

Previous patch:
10/07/29 11:51:39 INFO compress.TestCodecPerformance: Total decompressed size: 
640 MB.
10/07/29 11:51:39 INFO compress.TestCodecPerformance: Total compressed size: 
177 MB (27% of original).
10/07/29 11:51:39 INFO compress.TestCodecPerformance: Total compression time: 
381868 ms (1716 KBps).
10/07/29 11:51:39 INFO compress.TestCodecPerformance: Total decompression time: 
5051 ms (126 MBps).

Current patch:
Native C:
10/07/29 11:56:57 INFO compress.TestCodecPerformance: Total decompressed size: 
640 MB.
10/07/29 11:56:57 INFO compress.TestCodecPerformance: Total compressed size: 
177 MB (27% of original).
10/07/29 11:56:57 INFO compress.TestCodecPerformance: Total compression time: 
3314 ms (193 MBps).
10/07/29 11:56:57 INFO compress.TestCodecPerformance: Total decompression time: 
2861 ms (223 MBps).

Current patch:
Pure Java:
10/07/29 12:15:50 INFO compress.TestCodecPerformance: Total decompressed size: 
640 MB.
10/07/29 12:15:50 INFO compress.TestCodecPerformance: Total compressed size: 
177 MB (27% of original).
10/07/29 12:15:50 INFO compress.TestCodecPerformance: Total compression time: 
7891 ms (81 MBps).
10/07/29 12:15:50 INFO compress.TestCodecPerformance: Total decompression time: 
5077 ms (126 MBps).

> Implement FastLZCodec for fastlz/lzo algorithm
> --
>
> Key: HADOOP-6349
> URL: https://issues.apache.org/jira/browse/HADOOP-6349
> Project: Hadoop Common
>  Issue Type: New Feature
>  Components: io
>Reporter: William Kinney
> Attachments: hadoop-6349-1.patch, hadoop-6349-2.patch, 
> hadoop-6349-3.patch, HADOOP-6349-TestFastLZCodec.patch, HADOOP-6349.patch, 
> TestCodecPerformance.java, TestCodecPerformance.java, testCodecPerfResults.tsv
>
>
> Per  [HADOOP-4874|http://issues.apache.org/jira/browse/HADOOP-4874], FastLZ 
> is a good (speed, license) alternative to LZO. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-6349) Implement FastLZCodec for fastlz/lzo algorithm

2010-07-26 Thread Nicholas Carlini (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-6349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Carlini updated HADOOP-6349:
-

Attachment: hadoop-6349-2.patch

> Implement FastLZCodec for fastlz/lzo algorithm
> --
>
> Key: HADOOP-6349
> URL: https://issues.apache.org/jira/browse/HADOOP-6349
> Project: Hadoop Common
>  Issue Type: New Feature
>  Components: io
>Reporter: William Kinney
> Attachments: hadoop-6349-1.patch, hadoop-6349-2.patch, 
> HADOOP-6349-TestFastLZCodec.patch, HADOOP-6349.patch, 
> TestCodecPerformance.java, TestCodecPerformance.java, testCodecPerfResults.tsv
>
>
> Per  [HADOOP-4874|http://issues.apache.org/jira/browse/HADOOP-4874], FastLZ 
> is a good (speed, license) alternative to LZO. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-6349) Implement FastLZCodec for fastlz/lzo algorithm

2010-07-26 Thread Nicholas Carlini (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-6349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Carlini updated HADOOP-6349:
-

Attachment: (was: hadoop-6349-2.patch)

> Implement FastLZCodec for fastlz/lzo algorithm
> --
>
> Key: HADOOP-6349
> URL: https://issues.apache.org/jira/browse/HADOOP-6349
> Project: Hadoop Common
>  Issue Type: New Feature
>  Components: io
>Reporter: William Kinney
> Attachments: hadoop-6349-1.patch, hadoop-6349-2.patch, 
> HADOOP-6349-TestFastLZCodec.patch, HADOOP-6349.patch, 
> TestCodecPerformance.java, TestCodecPerformance.java, testCodecPerfResults.tsv
>
>
> Per  [HADOOP-4874|http://issues.apache.org/jira/browse/HADOOP-4874], FastLZ 
> is a good (speed, license) alternative to LZO. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-6837) Support for LZMA compression

2010-07-26 Thread Nicholas Carlini (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Carlini updated HADOOP-6837:
-

Attachment: (was: hadoop-6349-2.patch)

> Support for LZMA compression
> 
>
> Key: HADOOP-6837
> URL: https://issues.apache.org/jira/browse/HADOOP-6837
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: io
>Reporter: Nicholas Carlini
>Assignee: Nicholas Carlini
> Attachments: HADOOP-6837-lzma-1-20100722.patch, 
> HADOOP-6837-lzma-c-20100719.patch, HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which 
> generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-6349) Implement FastLZCodec for fastlz/lzo algorithm

2010-07-26 Thread Nicholas Carlini (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-6349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Carlini updated HADOOP-6349:
-

Attachment: hadoop-6349-2.patch

Attached an update patch.

Fixed the checksum mismatch. It was possible for the decompressor to run out of 
input after reading the header bytes but not notice if the block ID was 1. So 
if there were fewer than 26 bytes in the input (but more than 16) and the byte 
ID was 1 then it wouldn't notice and just use whatever happened to be in the 
buffer at the time.

Fixed a bug in the decompressor where it would incorrectly indicate it was 
finished if at the end of decompressing a block there was no more input left to 
decompress and decompress() was then called again (TestCodec seed 1333275328, 
2011623221, -1402938700 or -1990280158; generate 50,000 records). Actually, the 
decompressor never returns finished now. This is because the only time the 
decompressor should return true is if it somehow knows the end of the stream 
has been reached and it doesn't, it just guesses that if it has read all the 
bytes it currently has then it's done, which is not the case.

Implemented getRemaining().

Removed iOff from both the compressor and decompressor. It was initialized to 
zero from the start and was only ever modified after that by setting it to 0.

Modified TestCodec to accept a seed as an argument.

Removed the rest of the carriage returns.

I will be adding a native version over the next few days and will upload that 
patch when it's done.

> Implement FastLZCodec for fastlz/lzo algorithm
> --
>
> Key: HADOOP-6349
> URL: https://issues.apache.org/jira/browse/HADOOP-6349
> Project: Hadoop Common
>  Issue Type: New Feature
>  Components: io
>Reporter: William Kinney
> Attachments: hadoop-6349-1.patch, hadoop-6349-2.patch, 
> HADOOP-6349-TestFastLZCodec.patch, HADOOP-6349.patch, 
> TestCodecPerformance.java, TestCodecPerformance.java, testCodecPerfResults.tsv
>
>
> Per  [HADOOP-4874|http://issues.apache.org/jira/browse/HADOOP-4874], FastLZ 
> is a good (speed, license) alternative to LZO. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HADOOP-6837) Support for LZMA compression

2010-07-26 Thread Nicholas Carlini (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892425#action_12892425
 ] 

Nicholas Carlini commented on HADOOP-6837:
--

... that was supposed to go on HADOOP-6349, not here. Ignore that.

> Support for LZMA compression
> 
>
> Key: HADOOP-6837
> URL: https://issues.apache.org/jira/browse/HADOOP-6837
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: io
>Reporter: Nicholas Carlini
>Assignee: Nicholas Carlini
> Attachments: hadoop-6349-2.patch, HADOOP-6837-lzma-1-20100722.patch, 
> HADOOP-6837-lzma-c-20100719.patch, HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which 
> generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-6837) Support for LZMA compression

2010-07-26 Thread Nicholas Carlini (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Carlini updated HADOOP-6837:
-

Attachment: hadoop-6349-2.patch

Attached an update patch.

Fixed the checksum mismatch. It was possible for the decompressor to run out of 
input after reading the header bytes but not notice if the block ID was 1. So 
if there were fewer than 26 bytes in the input (but more than 16) and the byte 
ID was 1 then it wouldn't notice and just use whatever happened to be in the 
buffer at the time.

Fixed a bug in the decompressor where it would incorrectly indicate it was 
finished if at the end of decompressing a block there was no more input left to 
decompress and decompress() was then called again (TestCodec seed 1333275328, 
2011623221, -1402938700 or -1990280158; generate 50,000 records). Actually, the 
decompressor never returns finished now. This is because the only time the 
decompressor should return true is if it somehow knows the end of the stream 
has been reached and it doesn't, it just guesses that if it has read all the 
bytes it currently has then it's done, which is not the case.

Implemented getRemaining().

Removed iOff from both the compressor and decompressor. It was initialized to 
zero from the start and was only ever modified after that by setting it to 0.

Modified TestCodec to accept a seed as an argument.

Removed the rest of the carriage returns.


I will be adding a native version over the next few days and will upload that 
patch when it's done.

> Support for LZMA compression
> 
>
> Key: HADOOP-6837
> URL: https://issues.apache.org/jira/browse/HADOOP-6837
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: io
>Reporter: Nicholas Carlini
>Assignee: Nicholas Carlini
> Attachments: hadoop-6349-2.patch, HADOOP-6837-lzma-1-20100722.patch, 
> HADOOP-6837-lzma-c-20100719.patch, HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which 
> generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-6837) Support for LZMA compression

2010-07-22 Thread Nicholas Carlini (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Carlini updated HADOOP-6837:
-

Attachment: HADOOP-6837-lzma-1-20100722.patch

Attached a patch merging the java and c code together. Fixed two bugs in the C 
code, and removed some unneeded methods/arguments. Ran TestCodec on both of 
them for several hours each, and didn't find any bugs. Tested (again with 
TestCodec) all possible dictionary sizes from 1<<12 to 1<<23. Also added more 
(much needed) documentation.

> Support for LZMA compression
> 
>
> Key: HADOOP-6837
> URL: https://issues.apache.org/jira/browse/HADOOP-6837
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: io
>Reporter: Nicholas Carlini
>Assignee: Nicholas Carlini
> Attachments: HADOOP-6837-lzma-1-20100722.patch, 
> HADOOP-6837-lzma-c-20100719.patch, HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which 
> generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HADOOP-6837) Support for LZMA compression

2010-07-19 Thread Nicholas Carlini (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890051#action_12890051
 ] 

Nicholas Carlini commented on HADOOP-6837:
--

I spoke with Greg about it just now and he said it would probably be better for 
me to work on FastLZ first, and come back to doing that latter.

> Support for LZMA compression
> 
>
> Key: HADOOP-6837
> URL: https://issues.apache.org/jira/browse/HADOOP-6837
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: io
>Reporter: Nicholas Carlini
>Assignee: Nicholas Carlini
> Attachments: HADOOP-6837-lzma-c-20100719.patch, 
> HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which 
> generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-6837) Support for LZMA compression

2010-07-19 Thread Nicholas Carlini (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Carlini updated HADOOP-6837:
-

Attachment: HADOOP-6837-lzma-c-20100719.patch

Uploaded C code with LzmaNativeInputStream and LzmaNativeOutputStream. Testing 
is the same as that for the Java code. The documentation is limited on the C 
side, and there are still (commented out) debug statements scattered all over.

> Support for LZMA compression
> 
>
> Key: HADOOP-6837
> URL: https://issues.apache.org/jira/browse/HADOOP-6837
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: io
>Reporter: Nicholas Carlini
>Assignee: Nicholas Carlini
> Attachments: HADOOP-6837-lzma-c-20100719.patch, 
> HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which 
> generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HADOOP-6349) Implement FastLZCodec for fastlz/lzo algorithm

2010-07-19 Thread Nicholas Carlini (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-6349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12889893#action_12889893
 ] 

Nicholas Carlini commented on HADOOP-6349:
--

Eli - did you make an updated patch? If you haven't that's okay -- I can rebase 
it on trunk if you haven't yet.

> Implement FastLZCodec for fastlz/lzo algorithm
> --
>
> Key: HADOOP-6349
> URL: https://issues.apache.org/jira/browse/HADOOP-6349
> Project: Hadoop Common
>  Issue Type: New Feature
>  Components: io
>Reporter: William Kinney
> Attachments: HADOOP-6349-TestFastLZCodec.patch, HADOOP-6349.patch, 
> TestCodecPerformance.java, TestCodecPerformance.java, testCodecPerfResults.tsv
>
>
> Per  [HADOOP-4874|http://issues.apache.org/jira/browse/HADOOP-4874], FastLZ 
> is a good (speed, license) alternative to LZO. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HADOOP-6349) Implement FastLZCodec for fastlz/lzo algorithm

2010-07-08 Thread Nicholas Carlini (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-6349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886452#action_12886452
 ] 

Nicholas Carlini commented on HADOOP-6349:
--

Is anyone currently working on this? I'm currently working on adding LZMA 
compression ([HADOOP-6837|https://issues.apache.org/jira/browse/HADOOP-6837]), 
but after I finish that if no one else is then I'll work on this.

> Implement FastLZCodec for fastlz/lzo algorithm
> --
>
> Key: HADOOP-6349
> URL: https://issues.apache.org/jira/browse/HADOOP-6349
> Project: Hadoop Common
>  Issue Type: New Feature
>  Components: io
>Reporter: William Kinney
> Attachments: HADOOP-6349-TestFastLZCodec.patch, HADOOP-6349.patch, 
> TestCodecPerformance.java, TestCodecPerformance.java, testCodecPerfResults.tsv
>
>
> Per  [HADOOP-4874|http://issues.apache.org/jira/browse/HADOOP-4874], FastLZ 
> is a good (speed, license) alternative to LZO. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HADOOP-6837) Support for LZMA compression

2010-06-24 Thread Nicholas Carlini (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882250#action_12882250
 ] 

Nicholas Carlini commented on HADOOP-6837:
--

The Java code from the SDK hasn't been updated since version 4.61 (which is as 
of 23 November, 2008), so support for LZMA2 would need to rely on C code, or be 
ported to Java. 

The compression ratios of LZMA and LZMA2 are nearly identical (+/- .01% from 
the tests I did). It does look like LZMA2 is block based and is splittable, so 
that would be a major plus for it.

On the differences between LZMA and LZMA2:

          LZMA2 is an extension on top of the original 
LZMA. LZMA2 uses
          LZMA internally, but adds support for 
flushing the encoder,
          uncompressed chunks, eases stateful decoder 
implementations,
          and improves support for multithreading.

http://tukaani.org/xz/xz-file-format.txt

I did have to add support for flushing the encoder to the Java code (flushing 
the encoder still produces valid lzma-compressed output).

> Support for LZMA compression
> 
>
> Key: HADOOP-6837
> URL: https://issues.apache.org/jira/browse/HADOOP-6837
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: io
>Reporter: Nicholas Carlini
>Assignee: Nicholas Carlini
> Attachments: HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which 
> generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HADOOP-6837) Support for LZMA compression

2010-06-23 Thread Nicholas Carlini (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881935#action_12881935
 ] 

Nicholas Carlini commented on HADOOP-6837:
--

Per the FAQ:

"You can also read about the LZMA SDK, which is available under a more liberal 
license."

http://www.7-zip.org/faq.html

> Support for LZMA compression
> 
>
> Key: HADOOP-6837
> URL: https://issues.apache.org/jira/browse/HADOOP-6837
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: io
>Reporter: Nicholas Carlini
>Assignee: Nicholas Carlini
> Attachments: HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which 
> generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (HADOOP-6837) Support for LZMA compression

2010-06-23 Thread Nicholas Carlini (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Carlini reassigned HADOOP-6837:


Assignee: Nicholas Carlini

> Support for LZMA compression
> 
>
> Key: HADOOP-6837
> URL: https://issues.apache.org/jira/browse/HADOOP-6837
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: io
>Reporter: Nicholas Carlini
>Assignee: Nicholas Carlini
> Attachments: HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which 
> generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HADOOP-6837) Support for LZMA compression

2010-06-23 Thread Nicholas Carlini (JIRA)

 [ 
https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Carlini updated HADOOP-6837:
-

Attachment: HADOOP-6837-lzma-java-20100623.patch

Attached a patch. It includes an LzmaCodec, LzmaInputStream, and 
LzmaOutputStream. The LZMA compression/decompression uses the LZMA SDK from 
http://www.7-zip.org/sdk.html. The code has been tested minimally -- when used 
in io.SequenceFile.java, it passes the TestSetFile/TestArrayFile tests. I will 
attach another later on when I have fully-tested code. I will also be working 
on a native version of the code written in C, also based off of the LZMA SDK. 
It is significantly faster than than the Java code.

> Support for LZMA compression
> 
>
> Key: HADOOP-6837
> URL: https://issues.apache.org/jira/browse/HADOOP-6837
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: io
>Reporter: Nicholas Carlini
> Attachments: HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which 
> generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HADOOP-6837) Support for LZMA compression

2010-06-23 Thread Nicholas Carlini (JIRA)
Support for LZMA compression


 Key: HADOOP-6837
 URL: https://issues.apache.org/jira/browse/HADOOP-6837
 Project: Hadoop Common
  Issue Type: Improvement
  Components: io
Reporter: Nicholas Carlini
 Attachments: HADOOP-6837-lzma-java-20100623.patch

Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which 
generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.