[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904683#action_12904683
 ] 

Scott Carey commented on PIG-794:
---------------------------------

bq.  The performance of InterRecordWriter is much better than AvroRecordWriter, 
internally they use DataFileWriter (avro) and FSDataOutputStream (inter). And 
both of them use BufferedOutputStream as one buffer layer. The difference is 
that DataFileWriter (avro) has another buffer layer, it will first write 
contents to an in-memory block and then write it to BufferedOutputStream when 
the block is full. Not sure whether this layer have overhead.

I've tested this a bit before, the extra block copy is minor overhead.  How the 
BufferedOutputStream is used is the problem.  We have not yet optimized the 
write side of Avro completely -- there are enhancements to the serialization 
process that can be done.

> Use Avro serialization in Pig
> -----------------------------
>
>                 Key: PIG-794
>                 URL: https://issues.apache.org/jira/browse/PIG-794
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.2.0
>            Reporter: Rakesh Setty
>            Assignee: Dmitriy V. Ryaboy
>         Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
> AvroStorage_2.patch, AvroStorage_3.patch, AvroTest.java, 
> jackson-asl-0.9.4.jar, PIG-794.patch
>
>
> We would like to use Avro serialization in Pig to pass data between MR jobs 
> instead of the current BinStorage. Attached is an implementation of 
> AvroBinStorage which performs significantly better compared to BinStorage on 
> our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to