[ https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12894246#action_12894246 ]
Greg Roelofs commented on HADOOP-6837: -------------------------------------- First, apologies for having steered Nicholas wrong on the liblzma issue. As Hong noted, it provides a much saner (that is, zlib-like) API for this sort of thing, but I mistakenly thought it shared the GPL license of (parts of) xz, so we ignored it and he worked on the LZMA SDK code instead. (The latter did include a Java port, however; liblzma does not.) Overall, the 20100722 patch looks pretty decent (given the starting material), but it does include some less-than-beautiful workarounds to cope with the impedance mismatch between push- and pull-style I/O models. In light of the fact that liblzma is, in fact, public-domain software (every file under xz-4.999.9beta-143-g3e49/src/liblzma is either explicitly in the public domain or has been automatically generated by such a file), I'm going to ask that Nicholas redo the native-code version to use liblzma rather than the SDK. (Unfortunately, it looks like the transformation from C SDK to liblzma was a significant amount of work, so it doesn't appear that a trivial "liblzma- ification" of the Java SDK code is likely. If Nicholas concurs with that assessment, we can instead file a separate JIRA to port the liblzma C code to Java.) Note that liblzma includes an LZMA2 codec, so Scott Carey's splittable-codec suggestion is within reach, too. OK, enough preamble. There were a number of relatively minor style issues, which I'll simply list below, but the five main concerns were: - FakeInputStream.java, FakeOutputStream.java: the linked lists of byte arrays are tough to swallow, even given the push/pull problem, even given our previous discussions on the matter. It would be good to know what the stats are on these things in "typical" cases--how frequently does overflow occur in LzmaInputStream, for example, and how many buffers are used? - Is the Code(..., len) call in LzmaInputStream guaranteed to produce len bytes if it returns true? The calling read() function assumes this, but it's not at all obvious to me; the parameter is outSize in Code(), and I don't see that it's decremented or even really used at all (other than being stored in oldOutSize), unless it's buried inside macros defined elsewhere. The next two (or perhaps three) are no longer directly relevant, but they're general things to watch out for: - The return value from inStream.read() in LzmaNativeInputStream.java is ignored even though there's no guarantee the call will return the requested number of bytes. A comment ("never have to call ... again") reiterates this error. - There's no need for recursion in LzmaNativeOutputStream.java's write() method; iterative solutions tend to be far cleaner, I think. (Probably ditto for LzmaNativeInputStream.java's read() method.) - LzmaCompressor.c has a pair of memleaks (state->outStream, state->inStream). Here are the minor readability/maintainability/cosmetic/style issues: * add LZMA SDK version (apparently 9.12) and perhaps its release date to the boilerplate * tabs/formatting of LZMA SDK code (CRC.java, BinTree, etc.): I _think_ tabs are frowned upon in Hadoop, though I might be wrong; at any rate, they seem to be rarely used ** for easy Hadoop-style formatting, "indent -i2 -br -l80" is a start (though it's sometimes confused by Java/C++ constructs) * reuse existing CRC implementation(s)? (JDK has one, Hadoop has another) * prefer lowercase "lzma" for subdirs * use uppercase "LZMA" when referring to codec/algorithm (e.g., comments) * add README mentioning anything special/weird/etc. (e.g., weird hashtable issue); summary of changes made for Hadoop; potential Java/C diffs; binary compatibility between various output formats (other than trivial headers/ footers); LZMA2 (splittable, not yet implemented); liblzma (much cleaner, more zlib-like implementation, still PD); etc. * "ant javadoc" run yet? (apparently not recently) * line lengths, particularly comments (LzmaInputStream.java, etc.): should be no more than 80 columns in general (Hadoop guidelines) * avoid generic variable names for globals and class members; use existing conventions where possible (e.g., look at gzip/zlib and bzip2 code) * LzmaCodec.java: ** uppercase "LZMA" when referring to codec/algorithm in general ** "funcionality" x 4 ** "throws ... {" continuation line: don't further indent * LZ/InWindow.java ** leftover debug code at end * RangeCoder/Encoder.java ** spurious blank line (else just boilerplate) * FakeOutputStream.java: ** "stuffeed" ** "ammount" ** isOverflow() -> didOverflow() * LzmaInputStream.java: ** [uses FakeOutputStream] ** "bufferd" ** "we 've" ** index -> overflowIndex (or similar): too generic ** hasRead -> initialized ** doInit(): use LzmaCodec getDefaultExtension() (or simply change text to "LZMA") instead of hardcoded ".lzma" *** if higher-level code extended to support .xz, would this code even know? ** read(): do doInit() after args-checking/throws ** 3rd read() has bogus @return comment ** overflow approach: data on how frequent, how many buffers? *** no reuse: creating new FakeOutputStream on every read() * LzmaOutputStream.java ** [uses FakeInputStream] ** hasWritten -> initialized ** 6 -> static final int defaultCompressionLevel ** ctor: level -> this.level; l -> level (broken javadoc currently) ** "may cause a lower compression ratio": measured? by how much? * LzmaNativeInputStream.java ** line lengths ** hasRead -> initialized ** tmp -> tmpBuf ** max -> maxBlockSize? ...? ** index -> ...? ** buffered, sendData -> compressedDirectBuf, uncompressedDirectBuf (~zlib) ** what -> something better ** 1<<16, 1<<12 -> static final whatever ints, and use N*1024 instead ** allocateDirect() should be using config'd buffer size, I think (check zlib/similar impls for use of io.file.buffer.size) ** giveMore(): return val not used, but if it were, true/false seems backward (expect: if !giveMore() => nothing available) ** read(): point of extras just to support recursion? (extras, what: what??) ** "decompresed" * LzmaNativeOutputStream.java ** buffered, sendData -> compressedDirectBuf, uncompressedDirectBuf (~zlib) ** 1<<16, 1<<17 -> static final whatever int (and 64*1024, etc.) ** hasWritten -> initialized ** tmp -> tmpBuf * LZMA/Encoder.java ** "their" -> "original SDK" or similar ** "Setup the encoder" -> "Set up the encoder" ** "Do one encoding of a block" -> "Encode one block" ? * LzmaCompressor.c ** buffered, sendData -> compressedDirectBuf, uncompressedDirectBuf (~zlib) * LzmaDecompressor.c ** inPos vs. posInInBuf: ??? ** buffered, sendData -> compressedDirectBuf, uncompressedDirectBuf (~zlib) ** 1<<16, 1<<12, difference -> macros or static const ints * Makefile.am ** zlib -> LZMA ** "All these are setup" -> "All these are set up" ** "are also done" -> "is also done" * LzFind.c ** remove commented-out stuff ** (1<<15)+(1<<14) -> 48*1024 (etc.) * LZMA/LzmaEnc.c ** tabs > Support for LZMA compression > ---------------------------- > > Key: HADOOP-6837 > URL: https://issues.apache.org/jira/browse/HADOOP-6837 > Project: Hadoop Common > Issue Type: Improvement > Components: io > Reporter: Nicholas Carlini > Assignee: Nicholas Carlini > Attachments: HADOOP-6837-lzma-1-20100722.non-trivial.pseudo-patch, > HADOOP-6837-lzma-1-20100722.patch, HADOOP-6837-lzma-c-20100719.patch, > HADOOP-6837-lzma-java-20100623.patch > > > Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which > generally achieves higher compression ratios than both gzip and bzip2. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.