[jira] Commented: (HADOOP-6837) Support for LZMA compression

Greg Roelofs (JIRA) Fri, 30 Jul 2010 21:56:51 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12894246#action_12894246
 ]


Greg Roelofs commented on HADOOP-6837:
--------------------------------------

First, apologies for having steered Nicholas wrong on the liblzma issue.  As
Hong noted, it provides a much saner (that is, zlib-like) API for this sort
of thing, but I mistakenly thought it shared the GPL license of (parts of)
xz, so we ignored it and he worked on the LZMA SDK code instead.  (The latter
did include a Java port, however; liblzma does not.)

Overall, the 20100722 patch looks pretty decent (given the starting material),
but it does include some less-than-beautiful workarounds to cope with the
impedance mismatch between push- and pull-style I/O models.  In light of the
fact that liblzma is, in fact, public-domain software (every file under
xz-4.999.9beta-143-g3e49/src/liblzma is either explicitly in the public
domain or has been automatically generated by such a file), I'm going to ask
that Nicholas redo the native-code version to use liblzma rather than the
SDK.  (Unfortunately, it looks like the transformation from C SDK to liblzma
was a significant amount of work, so it doesn't appear that a trivial "liblzma-
ification" of the Java SDK code is likely.  If Nicholas concurs with that
assessment, we can instead file a separate JIRA to port the liblzma C code
to Java.)  Note that liblzma includes an LZMA2 codec, so Scott Carey's
splittable-codec suggestion is within reach, too.

OK, enough preamble.  There were a number of relatively minor style issues,
which I'll simply list below, but the five main concerns were:

 - FakeInputStream.java, FakeOutputStream.java:  the linked lists of byte
   arrays are tough to swallow, even given the push/pull problem, even given
   our previous discussions on the matter.  It would be good to know what the
   stats are on these things in "typical" cases--how frequently does overflow
   occur in LzmaInputStream, for example, and how many buffers are used?

 - Is the Code(..., len) call in LzmaInputStream guaranteed to produce len
   bytes if it returns true?  The calling read() function assumes this, but
   it's not at all obvious to me; the parameter is outSize in Code(), and I
   don't see that it's decremented or even really used at all (other than
   being stored in oldOutSize), unless it's buried inside macros defined
   elsewhere.

The next two (or perhaps three) are no longer directly relevant, but they're
general things to watch out for:

 - The return value from inStream.read() in LzmaNativeInputStream.java is
   ignored even though there's no guarantee the call will return the requested
   number of bytes.  A comment ("never have to call ... again") reiterates
   this error.

 - There's no need for recursion in LzmaNativeOutputStream.java's write()
   method; iterative solutions tend to be far cleaner, I think.  (Probably
   ditto for LzmaNativeInputStream.java's read() method.)

 - LzmaCompressor.c has a pair of memleaks (state->outStream, state->inStream).

Here are the minor readability/maintainability/cosmetic/style issues:

 * add LZMA SDK version (apparently 9.12) and perhaps its release date to the
   boilerplate
 * tabs/formatting of LZMA SDK code (CRC.java, BinTree, etc.):  I _think_
   tabs are frowned upon in Hadoop, though I might be wrong; at any rate,
   they seem to be rarely used
   ** for easy Hadoop-style formatting, "indent -i2 -br -l80" is a start (though
     it's sometimes confused by Java/C++ constructs)
 * reuse existing CRC implementation(s)?  (JDK has one, Hadoop has another)
 * prefer lowercase "lzma" for subdirs
 * use uppercase "LZMA" when referring to codec/algorithm (e.g., comments)
 * add README mentioning anything special/weird/etc. (e.g., weird hashtable
   issue); summary of changes made for Hadoop; potential Java/C diffs; binary
   compatibility between various output formats (other than trivial headers/
   footers); LZMA2 (splittable, not yet implemented); liblzma (much cleaner,
   more zlib-like implementation, still PD); etc.
 * "ant javadoc" run yet?  (apparently not recently)
 * line lengths, particularly comments (LzmaInputStream.java, etc.):  should
   be no more than 80 columns in general (Hadoop guidelines)
 * avoid generic variable names for globals and class members; use existing
   conventions where possible (e.g., look at gzip/zlib and bzip2 code)
 * LzmaCodec.java:
   ** uppercase "LZMA" when referring to codec/algorithm in general
   ** "funcionality" x 4
   ** "throws ... {" continuation line:  don't further indent
 * LZ/InWindow.java
   ** leftover debug code at end
 * RangeCoder/Encoder.java
   ** spurious blank line (else just boilerplate)
 * FakeOutputStream.java:
   ** "stuffeed"
   ** "ammount"
   ** isOverflow() -> didOverflow()
 * LzmaInputStream.java:
   ** [uses FakeOutputStream]
   ** "bufferd"
   ** "we 've"
   ** index -> overflowIndex (or similar):  too generic
   ** hasRead -> initialized
   ** doInit():  use LzmaCodec getDefaultExtension() (or simply change text to
     "LZMA") instead of hardcoded ".lzma"
     *** if higher-level code extended to support .xz, would this code even 
know?
   ** read():  do doInit() after args-checking/throws
   ** 3rd read() has bogus @return comment
   ** overflow approach:  data on how frequent, how many buffers?
     *** no reuse:  creating new FakeOutputStream on every read()
 * LzmaOutputStream.java
   ** [uses FakeInputStream]
   ** hasWritten -> initialized
   ** 6 -> static final int defaultCompressionLevel
   ** ctor:  level -> this.level; l -> level (broken javadoc currently)
   ** "may cause a lower compression ratio":  measured?  by how much?
 * LzmaNativeInputStream.java
   ** line lengths
   ** hasRead -> initialized
   ** tmp -> tmpBuf
   ** max -> maxBlockSize?  ...?
   ** index -> ...?
   ** buffered, sendData -> compressedDirectBuf, uncompressedDirectBuf (~zlib)
   ** what -> something better
   ** 1<<16, 1<<12 -> static final whatever ints, and use N*1024 instead
   ** allocateDirect() should be using config'd buffer size, I think (check
     zlib/similar impls for use of io.file.buffer.size)
   ** giveMore():  return val not used, but if it were, true/false seems 
backward
     (expect:  if !giveMore() => nothing available)
   ** read():  point of extras just to support recursion?  (extras, what: 
what??)
   ** "decompresed"
 * LzmaNativeOutputStream.java
   ** buffered, sendData -> compressedDirectBuf, uncompressedDirectBuf (~zlib)
   ** 1<<16, 1<<17 -> static final whatever int (and 64*1024, etc.)
   ** hasWritten -> initialized
   ** tmp -> tmpBuf
 * LZMA/Encoder.java
   ** "their" -> "original SDK" or similar
   ** "Setup the encoder" -> "Set up the encoder"
   ** "Do one encoding of a block" -> "Encode one block" ?
 * LzmaCompressor.c
   ** buffered, sendData -> compressedDirectBuf, uncompressedDirectBuf (~zlib)
 * LzmaDecompressor.c
   ** inPos vs. posInInBuf:  ???
   ** buffered, sendData -> compressedDirectBuf, uncompressedDirectBuf (~zlib)
   ** 1<<16, 1<<12, difference -> macros or static const ints
 * Makefile.am
   ** zlib -> LZMA
   ** "All these are setup" -> "All these are set up"
   ** "are also done" -> "is also done"
 * LzFind.c
   ** remove commented-out stuff
   ** (1<<15)+(1<<14) -> 48*1024 (etc.)
 * LZMA/LzmaEnc.c
   ** tabs


> Support for LZMA compression
> ----------------------------
>
>                 Key: HADOOP-6837
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6837
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: io
>            Reporter: Nicholas Carlini
>            Assignee: Nicholas Carlini
>         Attachments: HADOOP-6837-lzma-1-20100722.non-trivial.pseudo-patch, 
> HADOOP-6837-lzma-1-20100722.patch, HADOOP-6837-lzma-c-20100719.patch, 
> HADOOP-6837-lzma-java-20100623.patch
>
>
> Add support for LZMA (http://www.7-zip.org/sdk.html) compression, which 
> generally achieves higher compression ratios than both gzip and bzip2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-6837) Support for LZMA compression

Reply via email to