[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904555#action_12904555
 ] 

Jeff Zhang commented on PIG-794:
--------------------------------

Besides the above experiment, I also did a experiment to compare 
AvroRecordWriter and InterRecordWriter in local environment. You can see the 
attached file AvroTest.java
I write 50,000,000 records using these two RecordWriter, and time spent on 
AvroRecordWriter is 70 seconds while it is 29 seconds using InterRecordWriter. 

The performance of InterRecordWriter is much better than AvroRecordWriter, 
internally they use DataFileWriter (avro) and FSDataOutputStream (inter).  And 
both of them use BufferedOutputStream as one buffer layer. The difference is 
that DataFileWriter (avro) has another buffer layer, it will first write 
contents to an in-memory block and then write it to BufferedOutputStream when 
the block is full. Not sure whether this layer have overhead.




> Use Avro serialization in Pig
> -----------------------------
>
>                 Key: PIG-794
>                 URL: https://issues.apache.org/jira/browse/PIG-794
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.2.0
>            Reporter: Rakesh Setty
>            Assignee: Dmitriy V. Ryaboy
>         Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
> AvroStorage_2.patch, AvroTest.java, jackson-asl-0.9.4.jar, PIG-794.patch
>
>
> We would like to use Avro serialization in Pig to pass data between MR jobs 
> instead of the current BinStorage. Attached is an implementation of 
> AvroBinStorage which performs significantly better compared to BinStorage on 
> our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to