[ 
https://issues.apache.org/jira/browse/FLUME-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13230252#comment-13230252
 ] 

Thomas Andrews commented on FLUME-983:
--------------------------------------

If this is true, it would be great if this implicit contract for this interface 
was mentioned in the Flume interface Javadocs, or for the AbstractOutputFormat 
class.
                
> snappy compression via AvroDataFileOutputFormat suboptimal
> ----------------------------------------------------------
>
>                 Key: FLUME-983
>                 URL: https://issues.apache.org/jira/browse/FLUME-983
>             Project: Flume
>          Issue Type: Bug
>          Components: Sinks+Sources
>    Affects Versions: v0.9.4
>         Environment: Cloudera CDH3u2 flume + hadoop
>            Reporter: Steve Hoffman
>            Priority: Critical
>
> I used the AvroDataFileOutputFormat with the snappy compression option to 
> write compressed avro files to HDFS via flume.
> The original file was 106,514,936 bytes of json.  The output is written to 
> HDFS as raw (no flume wrapper).
> The file size I got using the snappy compression option was 47,520,735 bytes 
> which is about 1/2 the size.  Looking at the file directly it didn't look 
> like it had been compressed too much.
> So I used avro-tools tojson to convert my final flume-written output back to 
> json which resulted in a file size of 79,773,371 bytes (so this is basically 
> the starting size of the data being compressed).  Then I used the avro-tools 
> fromjson, giving it the same schema as getschema returned, and the snappy 
> compression option.  The resulting file was 11,904,857 bytes (which seemed 
> much better).
> So I asked myself, why the data written via flume record by record wasn't 
> compressed as much?  Looking at the raw file written to HDFS clearly showed 
> 'snappy' in the header and the data looked minimally encoded/compressed.
> I looked at the source and was struck by a call to sink.flush() after the 
> sink.append() in AvroDataFileOutputFormat.format().
> It appears that calling sink() was the root cause of the not so great 
> compression.
> To test this theory, I recompiled the sink with the flush() line commented 
> out.  The resulting test wrote a file for the same sample data to be 
> 11,870,573 bytes (pretty much matching the command-line avro-tools created 
> version).
> I'm filing this 'cause I think this may be a bug wasting lots of space by 
> users trying to use snappy compression (or any compression for that matter).  
> Not really sure what the impact of removing this flush() call is either 
> (since the file doesn't really exist in HDFS until it is closed).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to