[jira] Updated: (HADOOP-6837) Support for LZMA compression
[ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Carlini updated HADOOP-6837: - Status: Patch Available (was: Open) Hadoop Flags: [Reviewed] > Support for LZMA compression > > > Key: HADOOP-6837 > URL: https://issues.apache.org/jira/browse/HADOOP-6837 > Project: Hadoop Common > Issue Type: Improvement > Components: io >Reporter: Nicholas Carlini >Assignee: Nicholas Carlini > Attachments: HADOOP-6837-lzma-1-20100722.non-trivial.pseudo-patch, > HADOOP-6837-lzma-1-20100722.patch, HADOOP-6837-lzma-2-20100806.patch, > HADOOP-6837-lzma-3-20100809.patch, HADOOP-6837-lzma-4-20100811.patch, > HADOOP-6837-lzma-c-20100719.patch, HADOOP-6837-lzma-java-20100623.patch > > > Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which > generally achieves higher compression ratios than both gzip and bzip2. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-6837) Support for LZMA compression
[ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Carlini updated HADOOP-6837: - Attachment: HADOOP-6837-lzma-4-20100811.patch Fixed two FindBugs warnings. Made one variable static and removed another. Also I fixed build.xml (I think, it works correctly when doing 'ant package' on src/contrib, and it didn't work on the previous patch) to not crash when ant package is run. Had to override all of the build-contrib targets with empty ones. Otherwise it'd go looking for directories which didn't exist (because I don't need them to exist). I also changed build-contrib.xml (which is included by the other build.xml's) to have a compile-before target in it so there won't be any errors (even if I did set failonerror to false). Safer that way. FindBugs fails because of four errors in the original SevenZip code. I could change those to static to make FindBugs happy but that would make more work to apply a patch to the java code, so unless there's a particularly good reason to do so I won't. SIC Should SevenZip.Compression.LZMA.Decoder$LenDecoder be a _static_ inner class? SIC Should SevenZip.Compression.LZMA.Decoder$LiteralDecoder be a _static_ inner class? SIC Should SevenZip.Compression.LZMA.Encoder$LiteralEncoder be a _static_ inner class? SIC Should SevenZip.Compression.LZMA.Encoder$Optimal be a _static_ inner class? And JavaDoc find an error: [javadoc] javadoc: error - Illegal package name: "" However, I can not reproduce this on my local machine. Let's see if it happens with Hudson. > Support for LZMA compression > > > Key: HADOOP-6837 > URL: https://issues.apache.org/jira/browse/HADOOP-6837 > Project: Hadoop Common > Issue Type: Improvement > Components: io >Reporter: Nicholas Carlini >Assignee: Nicholas Carlini > Attachments: HADOOP-6837-lzma-1-20100722.non-trivial.pseudo-patch, > HADOOP-6837-lzma-1-20100722.patch, HADOOP-6837-lzma-2-20100806.patch, > HADOOP-6837-lzma-3-20100809.patch, HADOOP-6837-lzma-4-20100811.patch, > HADOOP-6837-lzma-c-20100719.patch, HADOOP-6837-lzma-java-20100623.patch > > > Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which > generally achieves higher compression ratios than both gzip and bzip2. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-6837) Support for LZMA compression
[ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12897337#action_12897337 ] Nicholas Carlini commented on HADOOP-6837: -- The current patch does not support LZMA2, and correct, it probably won't be included in this patch. There will probably be another JIRA filed to swap out the current C code with the LibLzma library to make the C side of things cleaner. LibLzma also has LZMA2 support, so that would be free. It should be trivial to add LibLzma which has a very Zlib-like interface. And then there will probably be another JIRA to port LibLzma to Java, and make everything nice and clean. > Support for LZMA compression > > > Key: HADOOP-6837 > URL: https://issues.apache.org/jira/browse/HADOOP-6837 > Project: Hadoop Common > Issue Type: Improvement > Components: io >Reporter: Nicholas Carlini >Assignee: Nicholas Carlini > Attachments: HADOOP-6837-lzma-1-20100722.non-trivial.pseudo-patch, > HADOOP-6837-lzma-1-20100722.patch, HADOOP-6837-lzma-2-20100806.patch, > HADOOP-6837-lzma-3-20100809.patch, HADOOP-6837-lzma-c-20100719.patch, > HADOOP-6837-lzma-java-20100623.patch > > > Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which > generally achieves higher compression ratios than both gzip and bzip2. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-6349) Implement FastLZCodec for fastlz/lzo algorithm
[ https://issues.apache.org/jira/browse/HADOOP-6349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Carlini updated HADOOP-6349: - Attachment: hadoop-6349-4.patch Fixed the buffering issues. There is still work to do here, though. In the case where compress() is called with an array that is big enough, then instead of compressing to a temporary buffer and then to the byte array given, it should compress directly to that buffer if possible. It is possible to not do this if the compressor was changed to keep its state and resume compression from the middle, but that seems like it would be more work than it's worth, at very little cost. This would, however, mean that the buffer size used by the CompressorStream would need to be around 64k so that the majority of the time bytes could be written directly to it. (That applies to decompression, too.) Fixed a bug where it was possible for the end of stream mark to show up in the wrong place. Fixed another bug from the compressor where calling write(int) on the output stream n times would add 47*n bytes to the output stream. This is because each time write() is called, so is compress(), which means 26 bytes for a header block and 16 bytes of a header, and then 1 byte of uncompressed data. Fixed this by adding another case where if the buffer has fewer than 2^16 (default block size) bytes, then it'll return true from needsInput, and so it won't compress. Changed the way that the test codec performance will call write() several times instead of writing all at once. It now calls write() with a random length until it's out of input. Got rid of the uses of BigInteger ... Removed the moved code ... no idea why it was ever there. Made abbreviations in comments real words. FASTLZ_STRICT_ALIGN mode when set to false doesn't even work. Deleted. FASTLZ_SAFE mode set to false has no performance increases. All it does is make a few if statements happen (which don't have any loops or anything). Deleted. At some point, the header blocks (with block ID 1) should get removed. They have no purpose and just remain from the port from 6pack.c. Maybe even make the header smaller by removing the ID now that all blocks will have ID of 17. And the 'options' are only one bit for compressed or just uncompressible data. > Implement FastLZCodec for fastlz/lzo algorithm > -- > > Key: HADOOP-6349 > URL: https://issues.apache.org/jira/browse/HADOOP-6349 > Project: Hadoop Common > Issue Type: New Feature > Components: io >Reporter: William Kinney > Attachments: hadoop-6349-1.patch, hadoop-6349-2.patch, > hadoop-6349-3.patch, hadoop-6349-4.patch, HADOOP-6349-TestFastLZCodec.patch, > HADOOP-6349.patch, TestCodecPerformance.java, TestCodecPerformance.java, > testCodecPerfResults.tsv > > > Per [HADOOP-4874|http://issues.apache.org/jira/browse/HADOOP-4874], FastLZ > is a good (speed, license) alternative to LZO. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-6349) Implement FastLZCodec for fastlz/lzo algorithm
[ https://issues.apache.org/jira/browse/HADOOP-6349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896753#action_12896753 ] Nicholas Carlini commented on HADOOP-6349: -- I believe I've fixed the buffering problems with very few changes to the compressor and decompressor. I'll need to verify that these changes didn't break anything else, but it seems like a simple issue to fix. > Implement FastLZCodec for fastlz/lzo algorithm > -- > > Key: HADOOP-6349 > URL: https://issues.apache.org/jira/browse/HADOOP-6349 > Project: Hadoop Common > Issue Type: New Feature > Components: io >Reporter: William Kinney > Attachments: hadoop-6349-1.patch, hadoop-6349-2.patch, > hadoop-6349-3.patch, HADOOP-6349-TestFastLZCodec.patch, HADOOP-6349.patch, > TestCodecPerformance.java, TestCodecPerformance.java, testCodecPerfResults.tsv > > > Per [HADOOP-4874|http://issues.apache.org/jira/browse/HADOOP-4874], FastLZ > is a good (speed, license) alternative to LZO. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-6837) Support for LZMA compression
[ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896751#action_12896751 ] Nicholas Carlini commented on HADOOP-6837: -- I did. The average number of overflow bytes is 24. I never saw it go above 120. A quick sed/dc script tells me the standard deviation is 18. So I'm fairly sure that I am correct in that it will never go above 273. Trying with setting the number of fast bytes to 273 gives average of 37 and standard deviation of 26. > Support for LZMA compression > > > Key: HADOOP-6837 > URL: https://issues.apache.org/jira/browse/HADOOP-6837 > Project: Hadoop Common > Issue Type: Improvement > Components: io >Reporter: Nicholas Carlini >Assignee: Nicholas Carlini > Attachments: HADOOP-6837-lzma-1-20100722.non-trivial.pseudo-patch, > HADOOP-6837-lzma-1-20100722.patch, HADOOP-6837-lzma-2-20100806.patch, > HADOOP-6837-lzma-3-20100809.patch, HADOOP-6837-lzma-c-20100719.patch, > HADOOP-6837-lzma-java-20100623.patch > > > Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which > generally achieves higher compression ratios than both gzip and bzip2. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-6837) Support for LZMA compression
[ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Carlini updated HADOOP-6837: - Attachment: HADOOP-6837-lzma-3-20100809.patch I'm going to respond to each one this time to make sure I don't miss anything again... FakeOutputStream isn't the one I'm talking about in package.html. That's for the OutputStream/FakeInputStream. FakeOutputStream is just the one where I couldn't justify the maximum acting correctly (wrtiting a max of 273 bytes extra) so I added the linked list in case anything goes wrong. Changed directory structure as suggested -- src/contrib/lzma912/src/SevenZip/* has the java files and src/contrib/lzma912/src/native has the native code. The default dictionary size doesn't matter, because it gets set to something else on initialization. I left it there to keep the minimum set of changes. I needed library.properties there for it to build correctly: I have no idea what it does, but if I delete it then it doesn't build. And when it's empty it doesn't complain, so that's why it's there. Sorry about the ec2 file; I didn't realize that went into the patch! It's not supposed to be there. Deleted CRC.java ... apparently it was never used. Removed the read() while loop and wrote as suggested. Fixed names (buffered/sendData) again, I had fixed them but then reverted to an older version when I introduced a bug. Removed 1< Support for LZMA compression > > > Key: HADOOP-6837 > URL: https://issues.apache.org/jira/browse/HADOOP-6837 > Project: Hadoop Common > Issue Type: Improvement > Components: io >Reporter: Nicholas Carlini >Assignee: Nicholas Carlini > Attachments: HADOOP-6837-lzma-1-20100722.non-trivial.pseudo-patch, > HADOOP-6837-lzma-1-20100722.patch, HADOOP-6837-lzma-2-20100806.patch, > HADOOP-6837-lzma-3-20100809.patch, HADOOP-6837-lzma-c-20100719.patch, > HADOOP-6837-lzma-java-20100623.patch > > > Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which > generally achieves higher compression ratios than both gzip and bzip2. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-6837) Support for LZMA compression
[ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Carlini updated HADOOP-6837: - Attachment: HADOOP-6837-lzma-2-20100806.patch I found a bug where Java compression would set a very, very wrong dictionary size. Instead of setting the dictionary size to, say, 2^24, it would set it to 24 (which would then be forced up to the min size, but still, very very wrong). I also added a fairly long package.html (~3000 words) with documentation about how what I did works, so anyone else who wants to modify it hopefully won't need to spend forever exploring the code again figuring out how it works. And also, I was both right and wrong about giveMore(). I was right when I first wrote it, and wrong when I said I fixed it by using the return value. The return values were actually left over from when I was checking for the end of stream in the Java end, but I realized that it was possible (because of the semi-circular buffer) for Java to indicate an EOF but for it not to be really true. So I had moved that check to the C code and just never removed that code from that Java end. Fixed the linked list stuff. Also a fairly significant directory restructure. The modified SDK code now is in src/contrib/SevenZip. Java code under src/java and C code is under src/native. I removed all of my re-formatting of their code so should a future version of the SDK be released, it shouldn't be as hard to do a diff and apply the patch to make this code better. The makefile from there builds to the same build tree as otherwise for java. In order to get it building correctly, I had to modify the base build.xml and the contrib/build.xml. compile-core-classes now also depends on compile-contrib-before. compile-contrib-before now calls compile-before on contrib/build.xml. From there contrib/build.xml calls compile-before on contrib/*/build.xml, with failonerror set to false so this change will not break any other build scripts. (This change is required because Lzma{Input,Output}Stream stream requires the classes to be built first.) I also cleaned up the code and fixed all the review comments. There will be at least one more version of this patch for things I didn't catch. > Support for LZMA compression > > > Key: HADOOP-6837 > URL: https://issues.apache.org/jira/browse/HADOOP-6837 > Project: Hadoop Common > Issue Type: Improvement > Components: io >Reporter: Nicholas Carlini >Assignee: Nicholas Carlini > Attachments: HADOOP-6837-lzma-1-20100722.non-trivial.pseudo-patch, > HADOOP-6837-lzma-1-20100722.patch, HADOOP-6837-lzma-2-20100806.patch, > HADOOP-6837-lzma-c-20100719.patch, HADOOP-6837-lzma-java-20100623.patch > > > Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which > generally achieves higher compression ratios than both gzip and bzip2. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-6837) Support for LZMA compression
[ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12894773#action_12894773 ] Nicholas Carlini commented on HADOOP-6837: -- Responding to the major comments -- will upload a patch that fixes these and the smaller comments soon. FakeInputStream LinkedList: This LinkedList can get fairly long, depending on how write is called. Worst case it can have upwards of 12 million elements, which is far beyond acceptable. This is the case if the write(single_byte) is called over and over. Each call will add a new link. Looking back at this, linked list probably wasn't the best way to go. There are two (obvious) ways that write() could have worked. One is using linked lists as I did. The other way would be to create a byte array that can hold forceWriteLen bytes and just copy into it; however this can be as large as 12MB. There are then two other ways to make this work. The first is just allocating the 12MB right up front. The other way is to start it with maybe just 64k, and make it grow (by powers of two) until it reaches 12MB, however this would end up arraycopying a little under 12MB in total more than the other solution. I will implement one of these for the patch. FakeOutputStream LinkedList: This linked list has a more reasonable use. Its purpose is to hold extra bytes just in case the input stream gives too many. I am fairly confident that at most 272 bytes (maximum number of fast bytes - 1) can be written to it. The reason I used a linked list, however, is that I couldn't formally prove this after going through code. I wanted to be safe and just in case their code doesn't behave as it should, everything will work on the OutputStream end. Code(..., len) I think I remember figuring out that Code(...) will return at least (but possibly more than) len bytes with the one exception that when the end of the stream is reached it will only read up to the end of the stream. I will modify the decompressor to no longer assume this and use the actual number of bytes read instead. Fixed the inStream.read() bug (and will be in patch I upload). Added a while loop to read until EOF is reached so the assumptions are true. Tail call recursive methods -> while loop. Java should add tail-call optimizations when methods only call themselves recursively (which would require no changes to the bytecode). Fixed memory leaks. > Support for LZMA compression > > > Key: HADOOP-6837 > URL: https://issues.apache.org/jira/browse/HADOOP-6837 > Project: Hadoop Common > Issue Type: Improvement > Components: io >Reporter: Nicholas Carlini >Assignee: Nicholas Carlini > Attachments: HADOOP-6837-lzma-1-20100722.non-trivial.pseudo-patch, > HADOOP-6837-lzma-1-20100722.patch, HADOOP-6837-lzma-c-20100719.patch, > HADOOP-6837-lzma-java-20100623.patch > > > Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which > generally achieves higher compression ratios than both gzip and bzip2. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-6349) Implement FastLZCodec for fastlz/lzo algorithm
[ https://issues.apache.org/jira/browse/HADOOP-6349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Carlini updated HADOOP-6349: - Attachment: hadoop-6349-3.patch Another patch. There is still debug code scattered about (commented out), as I might need to put it to use at some point. This code isn't tested as well as the last patch. Adds support for native compression/decompression. Native compression is 230% faster than java. Native decompression is 70% faster than java. Somewhat-large redesign of the compressor. Compression is now fifty times faster when compressing around 64MB. The compressor used to keep in memory all input it had previously processed, and arraycopy it to a new array every time it needed more space, so through the process of compressing 64MB of data calling write every 64k, it would end up copying ~32GB through memory (this is how it was for my test case). Instead compress 128MB of data and write every 1k, and you copy 8.8TB through memory. Also modified compressor to include an end-of-stream marker. This way the decompressor can set to "finished" so the stream can return -1. The end of stream mark is indicated by setting the four unused bytes after the input size to high in the last chunk of length 0. By this way, any decompressor which does not support the end of stream marker will never read those bytes and will just decompress an empty block and not notice anything is wrong. Adds another method to TestCodecPerformance which haves it load a (relatively small) input file to memory, and from it generate 64MB of data to compress. (It does this by taking random substrings from 16 to 128 bytes at random offsets until there are 64MB.) It then directly compresses the 64MB from memory to memory and times that. These times seem to be more reflective than timing the compression of "key %d value %d" and of timing the compression of random data. Right now this mode is enabled by calling it with the -input flag. Ported code for Adler32 to C, uses it when using native libraries. Added a constant in the compressor to allow for uncompressible data to instead be copied over byte for byte. This decreases the speed of the compressor by ~10% as it results in another memcpy, but it can more than double the speed of decompression. Here's what the new part of the test codec performance gives when given a log file. For comparison: DefaultCodec gets the size down to 11% and the BZip2Codec down to 8%. Previous patch: 10/07/29 11:51:39 INFO compress.TestCodecPerformance: Total decompressed size: 640 MB. 10/07/29 11:51:39 INFO compress.TestCodecPerformance: Total compressed size: 177 MB (27% of original). 10/07/29 11:51:39 INFO compress.TestCodecPerformance: Total compression time: 381868 ms (1716 KBps). 10/07/29 11:51:39 INFO compress.TestCodecPerformance: Total decompression time: 5051 ms (126 MBps). Current patch: Native C: 10/07/29 11:56:57 INFO compress.TestCodecPerformance: Total decompressed size: 640 MB. 10/07/29 11:56:57 INFO compress.TestCodecPerformance: Total compressed size: 177 MB (27% of original). 10/07/29 11:56:57 INFO compress.TestCodecPerformance: Total compression time: 3314 ms (193 MBps). 10/07/29 11:56:57 INFO compress.TestCodecPerformance: Total decompression time: 2861 ms (223 MBps). Current patch: Pure Java: 10/07/29 12:15:50 INFO compress.TestCodecPerformance: Total decompressed size: 640 MB. 10/07/29 12:15:50 INFO compress.TestCodecPerformance: Total compressed size: 177 MB (27% of original). 10/07/29 12:15:50 INFO compress.TestCodecPerformance: Total compression time: 7891 ms (81 MBps). 10/07/29 12:15:50 INFO compress.TestCodecPerformance: Total decompression time: 5077 ms (126 MBps). > Implement FastLZCodec for fastlz/lzo algorithm > -- > > Key: HADOOP-6349 > URL: https://issues.apache.org/jira/browse/HADOOP-6349 > Project: Hadoop Common > Issue Type: New Feature > Components: io >Reporter: William Kinney > Attachments: hadoop-6349-1.patch, hadoop-6349-2.patch, > hadoop-6349-3.patch, HADOOP-6349-TestFastLZCodec.patch, HADOOP-6349.patch, > TestCodecPerformance.java, TestCodecPerformance.java, testCodecPerfResults.tsv > > > Per [HADOOP-4874|http://issues.apache.org/jira/browse/HADOOP-4874], FastLZ > is a good (speed, license) alternative to LZO. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-6349) Implement FastLZCodec for fastlz/lzo algorithm
[ https://issues.apache.org/jira/browse/HADOOP-6349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Carlini updated HADOOP-6349: - Attachment: hadoop-6349-2.patch > Implement FastLZCodec for fastlz/lzo algorithm > -- > > Key: HADOOP-6349 > URL: https://issues.apache.org/jira/browse/HADOOP-6349 > Project: Hadoop Common > Issue Type: New Feature > Components: io >Reporter: William Kinney > Attachments: hadoop-6349-1.patch, hadoop-6349-2.patch, > HADOOP-6349-TestFastLZCodec.patch, HADOOP-6349.patch, > TestCodecPerformance.java, TestCodecPerformance.java, testCodecPerfResults.tsv > > > Per [HADOOP-4874|http://issues.apache.org/jira/browse/HADOOP-4874], FastLZ > is a good (speed, license) alternative to LZO. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-6349) Implement FastLZCodec for fastlz/lzo algorithm
[ https://issues.apache.org/jira/browse/HADOOP-6349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Carlini updated HADOOP-6349: - Attachment: (was: hadoop-6349-2.patch) > Implement FastLZCodec for fastlz/lzo algorithm > -- > > Key: HADOOP-6349 > URL: https://issues.apache.org/jira/browse/HADOOP-6349 > Project: Hadoop Common > Issue Type: New Feature > Components: io >Reporter: William Kinney > Attachments: hadoop-6349-1.patch, hadoop-6349-2.patch, > HADOOP-6349-TestFastLZCodec.patch, HADOOP-6349.patch, > TestCodecPerformance.java, TestCodecPerformance.java, testCodecPerfResults.tsv > > > Per [HADOOP-4874|http://issues.apache.org/jira/browse/HADOOP-4874], FastLZ > is a good (speed, license) alternative to LZO. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-6837) Support for LZMA compression
[ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Carlini updated HADOOP-6837: - Attachment: (was: hadoop-6349-2.patch) > Support for LZMA compression > > > Key: HADOOP-6837 > URL: https://issues.apache.org/jira/browse/HADOOP-6837 > Project: Hadoop Common > Issue Type: Improvement > Components: io >Reporter: Nicholas Carlini >Assignee: Nicholas Carlini > Attachments: HADOOP-6837-lzma-1-20100722.patch, > HADOOP-6837-lzma-c-20100719.patch, HADOOP-6837-lzma-java-20100623.patch > > > Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which > generally achieves higher compression ratios than both gzip and bzip2. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-6349) Implement FastLZCodec for fastlz/lzo algorithm
[ https://issues.apache.org/jira/browse/HADOOP-6349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Carlini updated HADOOP-6349: - Attachment: hadoop-6349-2.patch Attached an update patch. Fixed the checksum mismatch. It was possible for the decompressor to run out of input after reading the header bytes but not notice if the block ID was 1. So if there were fewer than 26 bytes in the input (but more than 16) and the byte ID was 1 then it wouldn't notice and just use whatever happened to be in the buffer at the time. Fixed a bug in the decompressor where it would incorrectly indicate it was finished if at the end of decompressing a block there was no more input left to decompress and decompress() was then called again (TestCodec seed 1333275328, 2011623221, -1402938700 or -1990280158; generate 50,000 records). Actually, the decompressor never returns finished now. This is because the only time the decompressor should return true is if it somehow knows the end of the stream has been reached and it doesn't, it just guesses that if it has read all the bytes it currently has then it's done, which is not the case. Implemented getRemaining(). Removed iOff from both the compressor and decompressor. It was initialized to zero from the start and was only ever modified after that by setting it to 0. Modified TestCodec to accept a seed as an argument. Removed the rest of the carriage returns. I will be adding a native version over the next few days and will upload that patch when it's done. > Implement FastLZCodec for fastlz/lzo algorithm > -- > > Key: HADOOP-6349 > URL: https://issues.apache.org/jira/browse/HADOOP-6349 > Project: Hadoop Common > Issue Type: New Feature > Components: io >Reporter: William Kinney > Attachments: hadoop-6349-1.patch, hadoop-6349-2.patch, > HADOOP-6349-TestFastLZCodec.patch, HADOOP-6349.patch, > TestCodecPerformance.java, TestCodecPerformance.java, testCodecPerfResults.tsv > > > Per [HADOOP-4874|http://issues.apache.org/jira/browse/HADOOP-4874], FastLZ > is a good (speed, license) alternative to LZO. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-6837) Support for LZMA compression
[ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892425#action_12892425 ] Nicholas Carlini commented on HADOOP-6837: -- ... that was supposed to go on HADOOP-6349, not here. Ignore that. > Support for LZMA compression > > > Key: HADOOP-6837 > URL: https://issues.apache.org/jira/browse/HADOOP-6837 > Project: Hadoop Common > Issue Type: Improvement > Components: io >Reporter: Nicholas Carlini >Assignee: Nicholas Carlini > Attachments: hadoop-6349-2.patch, HADOOP-6837-lzma-1-20100722.patch, > HADOOP-6837-lzma-c-20100719.patch, HADOOP-6837-lzma-java-20100623.patch > > > Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which > generally achieves higher compression ratios than both gzip and bzip2. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-6837) Support for LZMA compression
[ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Carlini updated HADOOP-6837: - Attachment: hadoop-6349-2.patch Attached an update patch. Fixed the checksum mismatch. It was possible for the decompressor to run out of input after reading the header bytes but not notice if the block ID was 1. So if there were fewer than 26 bytes in the input (but more than 16) and the byte ID was 1 then it wouldn't notice and just use whatever happened to be in the buffer at the time. Fixed a bug in the decompressor where it would incorrectly indicate it was finished if at the end of decompressing a block there was no more input left to decompress and decompress() was then called again (TestCodec seed 1333275328, 2011623221, -1402938700 or -1990280158; generate 50,000 records). Actually, the decompressor never returns finished now. This is because the only time the decompressor should return true is if it somehow knows the end of the stream has been reached and it doesn't, it just guesses that if it has read all the bytes it currently has then it's done, which is not the case. Implemented getRemaining(). Removed iOff from both the compressor and decompressor. It was initialized to zero from the start and was only ever modified after that by setting it to 0. Modified TestCodec to accept a seed as an argument. Removed the rest of the carriage returns. I will be adding a native version over the next few days and will upload that patch when it's done. > Support for LZMA compression > > > Key: HADOOP-6837 > URL: https://issues.apache.org/jira/browse/HADOOP-6837 > Project: Hadoop Common > Issue Type: Improvement > Components: io >Reporter: Nicholas Carlini >Assignee: Nicholas Carlini > Attachments: hadoop-6349-2.patch, HADOOP-6837-lzma-1-20100722.patch, > HADOOP-6837-lzma-c-20100719.patch, HADOOP-6837-lzma-java-20100623.patch > > > Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which > generally achieves higher compression ratios than both gzip and bzip2. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-6837) Support for LZMA compression
[ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Carlini updated HADOOP-6837: - Attachment: HADOOP-6837-lzma-1-20100722.patch Attached a patch merging the java and c code together. Fixed two bugs in the C code, and removed some unneeded methods/arguments. Ran TestCodec on both of them for several hours each, and didn't find any bugs. Tested (again with TestCodec) all possible dictionary sizes from 1<<12 to 1<<23. Also added more (much needed) documentation. > Support for LZMA compression > > > Key: HADOOP-6837 > URL: https://issues.apache.org/jira/browse/HADOOP-6837 > Project: Hadoop Common > Issue Type: Improvement > Components: io >Reporter: Nicholas Carlini >Assignee: Nicholas Carlini > Attachments: HADOOP-6837-lzma-1-20100722.patch, > HADOOP-6837-lzma-c-20100719.patch, HADOOP-6837-lzma-java-20100623.patch > > > Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which > generally achieves higher compression ratios than both gzip and bzip2. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-6837) Support for LZMA compression
[ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890051#action_12890051 ] Nicholas Carlini commented on HADOOP-6837: -- I spoke with Greg about it just now and he said it would probably be better for me to work on FastLZ first, and come back to doing that latter. > Support for LZMA compression > > > Key: HADOOP-6837 > URL: https://issues.apache.org/jira/browse/HADOOP-6837 > Project: Hadoop Common > Issue Type: Improvement > Components: io >Reporter: Nicholas Carlini >Assignee: Nicholas Carlini > Attachments: HADOOP-6837-lzma-c-20100719.patch, > HADOOP-6837-lzma-java-20100623.patch > > > Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which > generally achieves higher compression ratios than both gzip and bzip2. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-6837) Support for LZMA compression
[ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Carlini updated HADOOP-6837: - Attachment: HADOOP-6837-lzma-c-20100719.patch Uploaded C code with LzmaNativeInputStream and LzmaNativeOutputStream. Testing is the same as that for the Java code. The documentation is limited on the C side, and there are still (commented out) debug statements scattered all over. > Support for LZMA compression > > > Key: HADOOP-6837 > URL: https://issues.apache.org/jira/browse/HADOOP-6837 > Project: Hadoop Common > Issue Type: Improvement > Components: io >Reporter: Nicholas Carlini >Assignee: Nicholas Carlini > Attachments: HADOOP-6837-lzma-c-20100719.patch, > HADOOP-6837-lzma-java-20100623.patch > > > Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which > generally achieves higher compression ratios than both gzip and bzip2. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-6349) Implement FastLZCodec for fastlz/lzo algorithm
[ https://issues.apache.org/jira/browse/HADOOP-6349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12889893#action_12889893 ] Nicholas Carlini commented on HADOOP-6349: -- Eli - did you make an updated patch? If you haven't that's okay -- I can rebase it on trunk if you haven't yet. > Implement FastLZCodec for fastlz/lzo algorithm > -- > > Key: HADOOP-6349 > URL: https://issues.apache.org/jira/browse/HADOOP-6349 > Project: Hadoop Common > Issue Type: New Feature > Components: io >Reporter: William Kinney > Attachments: HADOOP-6349-TestFastLZCodec.patch, HADOOP-6349.patch, > TestCodecPerformance.java, TestCodecPerformance.java, testCodecPerfResults.tsv > > > Per [HADOOP-4874|http://issues.apache.org/jira/browse/HADOOP-4874], FastLZ > is a good (speed, license) alternative to LZO. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-6349) Implement FastLZCodec for fastlz/lzo algorithm
[ https://issues.apache.org/jira/browse/HADOOP-6349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886452#action_12886452 ] Nicholas Carlini commented on HADOOP-6349: -- Is anyone currently working on this? I'm currently working on adding LZMA compression ([HADOOP-6837|https://issues.apache.org/jira/browse/HADOOP-6837]), but after I finish that if no one else is then I'll work on this. > Implement FastLZCodec for fastlz/lzo algorithm > -- > > Key: HADOOP-6349 > URL: https://issues.apache.org/jira/browse/HADOOP-6349 > Project: Hadoop Common > Issue Type: New Feature > Components: io >Reporter: William Kinney > Attachments: HADOOP-6349-TestFastLZCodec.patch, HADOOP-6349.patch, > TestCodecPerformance.java, TestCodecPerformance.java, testCodecPerfResults.tsv > > > Per [HADOOP-4874|http://issues.apache.org/jira/browse/HADOOP-4874], FastLZ > is a good (speed, license) alternative to LZO. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-6837) Support for LZMA compression
[ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882250#action_12882250 ] Nicholas Carlini commented on HADOOP-6837: -- The Java code from the SDK hasn't been updated since version 4.61 (which is as of 23 November, 2008), so support for LZMA2 would need to rely on C code, or be ported to Java. The compression ratios of LZMA and LZMA2 are nearly identical (+/- .01% from the tests I did). It does look like LZMA2 is block based and is splittable, so that would be a major plus for it. On the differences between LZMA and LZMA2: LZMA2 is an extension on top of the original LZMA. LZMA2 uses LZMA internally, but adds support for flushing the encoder, uncompressed chunks, eases stateful decoder implementations, and improves support for multithreading. http://tukaani.org/xz/xz-file-format.txt I did have to add support for flushing the encoder to the Java code (flushing the encoder still produces valid lzma-compressed output). > Support for LZMA compression > > > Key: HADOOP-6837 > URL: https://issues.apache.org/jira/browse/HADOOP-6837 > Project: Hadoop Common > Issue Type: Improvement > Components: io >Reporter: Nicholas Carlini >Assignee: Nicholas Carlini > Attachments: HADOOP-6837-lzma-java-20100623.patch > > > Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which > generally achieves higher compression ratios than both gzip and bzip2. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HADOOP-6837) Support for LZMA compression
[ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881935#action_12881935 ] Nicholas Carlini commented on HADOOP-6837: -- Per the FAQ: "You can also read about the LZMA SDK, which is available under a more liberal license." http://www.7-zip.org/faq.html > Support for LZMA compression > > > Key: HADOOP-6837 > URL: https://issues.apache.org/jira/browse/HADOOP-6837 > Project: Hadoop Common > Issue Type: Improvement > Components: io >Reporter: Nicholas Carlini >Assignee: Nicholas Carlini > Attachments: HADOOP-6837-lzma-java-20100623.patch > > > Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which > generally achieves higher compression ratios than both gzip and bzip2. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (HADOOP-6837) Support for LZMA compression
[ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Carlini reassigned HADOOP-6837: Assignee: Nicholas Carlini > Support for LZMA compression > > > Key: HADOOP-6837 > URL: https://issues.apache.org/jira/browse/HADOOP-6837 > Project: Hadoop Common > Issue Type: Improvement > Components: io >Reporter: Nicholas Carlini >Assignee: Nicholas Carlini > Attachments: HADOOP-6837-lzma-java-20100623.patch > > > Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which > generally achieves higher compression ratios than both gzip and bzip2. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HADOOP-6837) Support for LZMA compression
[ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Carlini updated HADOOP-6837: - Attachment: HADOOP-6837-lzma-java-20100623.patch Attached a patch. It includes an LzmaCodec, LzmaInputStream, and LzmaOutputStream. The LZMA compression/decompression uses the LZMA SDK from http://www.7-zip.org/sdk.html. The code has been tested minimally -- when used in io.SequenceFile.java, it passes the TestSetFile/TestArrayFile tests. I will attach another later on when I have fully-tested code. I will also be working on a native version of the code written in C, also based off of the LZMA SDK. It is significantly faster than than the Java code. > Support for LZMA compression > > > Key: HADOOP-6837 > URL: https://issues.apache.org/jira/browse/HADOOP-6837 > Project: Hadoop Common > Issue Type: Improvement > Components: io >Reporter: Nicholas Carlini > Attachments: HADOOP-6837-lzma-java-20100623.patch > > > Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which > generally achieves higher compression ratios than both gzip and bzip2. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HADOOP-6837) Support for LZMA compression
Support for LZMA compression Key: HADOOP-6837 URL: https://issues.apache.org/jira/browse/HADOOP-6837 Project: Hadoop Common Issue Type: Improvement Components: io Reporter: Nicholas Carlini Attachments: HADOOP-6837-lzma-java-20100623.patch Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which generally achieves higher compression ratios than both gzip and bzip2. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.