[jira] [Commented] (HADOOP-13578) Add Codec for ZStandard Compression

Jason Lowe (JIRA) Mon, 14 Nov 2016 11:18:25 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-13578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15664738#comment-15664738
 ]


Jason Lowe commented on HADOOP-13578:
-------------------------------------

The problem with ignoring the Codec interface is that code using the 
Compressor/Decompressor interfaces will break if it happens to get the zstd 
codec but work for all others.  That's not really adding a full Codec for zstd. 
 I agree it's not the most ideal interface for the way zstd wants to be called, 
but that doesn't change the fact that existing code that should work as-is on a 
Hadoop-provided codec will not if we don't implement the 
Compressor/Decompressor interfaces.

It looks like the zstd code already includes a wrapper that translates the zlib 
API into the zstd API (see 
https://github.com/facebook/zstd/tree/dev/zlibWrapper) which we can maybe 
leverage or at least use as a model for the Compressor/Decompressor 
implementations.  To be clear, I think it's a good thing that the 
compressor/decompressor streams are using zstd more natively, but I also think 
it's important to not ignore the non-stream Codec APIs.  I'm also OK if the 
initial implementation of the zstd Compressor/Decompressor is not super 
optimal.  One tactic would be to always copy the input data to an internal 
input buffer and track the amount of data zstd wants.  Then the needsInput 
method is easy, just return true when the amount of data in the internal input 
buffer is less than what zstd last requested.  The decompress method would 
return 0 if needInput would return true.  Not super efficient since we make a 
copy of all the input data before passing it to zstd, but it should be 
relatively straightforward to implement.

I'm also not so sure about the minimum buffer assertion.  I saw in the zstd 
unit tests there is a byte-by-byte streaming decompression test where it tries 
to decompress a buffer with a 1-byte input buffer and a 1-byte output buffer.  
I also verified this independently by applying the following changes to the 
streaming_decompression.c example:
{noformat}
$ diff -u streaming_decompression.c.orig streaming_decompression.c
--- streaming_decompression.c.orig      2016-11-14 18:39:43.423069894 +0000
+++ streaming_decompression.c   2016-11-14 19:08:36.395286748 +0000
@@ -63,10 +63,10 @@
 static void decompressFile_orDie(const char* fname)
 {
     FILE* const fin  = fopen_orDie(fname, "rb");
-    size_t const buffInSize = ZSTD_DStreamInSize();
+    size_t const buffInSize = 1;
     void*  const buffIn  = malloc_orDie(buffInSize);
     FILE* const fout = stdout;
-    size_t const buffOutSize = ZSTD_DStreamOutSize();  /* Guarantee to 
successfully flush at least one complete compressed block in all circumstances. 
*/
+    size_t const buffOutSize = 1;
     void*  const buffOut = malloc_orDie(buffOutSize);
 
     ZSTD_DStream* const dstream = ZSTD_createDStream();
@@ -80,6 +80,7 @@
         while (input.pos < input.size) {
             ZSTD_outBuffer output = { buffOut, buffOutSize, 0 };
             toRead = ZSTD_decompressStream(dstream, &output , &input);  /* 
toRead : size of next compressed block */
+           toRead = (toRead > buffInSize) ? buffInSize : toRead;
             if (ZSTD_isError(toRead)) { fprintf(stderr, 
"ZSTD_decompressStream() error : %s \n", ZSTD_getErrorName(toRead)); exit(12); }
             fwrite_orDie(buffOut, output.pos, fout);
         }
{noformat}

With those changes the streaming_decompression example was able to decompress 
zstd files even though it was using an extremely small input and output buffer. 
 So even though I'm not giving the zstd library a full block of input it's able 
to compensate.

bq. The finished field is more to reinitialize the zstd stream than the actual 
java decompressor stream.

Ah, I see that now.  Thanks for the explanation.

bq.  So we could allow the user to configure and take the larger of the 2 
buffer sizes in case they configure it with a buffer size that is too small.

See input/output buffer discussion above.  I don't think we need to apply 
safety rails to what the user specifies because the zstd library apparently can 
accommodate.

> Add Codec for ZStandard Compression
> -----------------------------------
>
>                 Key: HADOOP-13578
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13578
>             Project: Hadoop Common
>          Issue Type: New Feature
>            Reporter: churro morales
>            Assignee: churro morales
>         Attachments: HADOOP-13578.patch, HADOOP-13578.v1.patch, 
> HADOOP-13578.v2.patch
>
>
> ZStandard: https://github.com/facebook/zstd has been used in production for 6 
> months by facebook now.  v1.0 was recently released.  Create a codec for this 
> library.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-13578) Add Codec for ZStandard Compression

Reply via email to