[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904551#action_12904551
 ] 

Jeff Zhang commented on PIG-794:
--------------------------------

I did some experiment on Avro, Avro_Storage_2.patch is the detail 
implementation.

Here I use avro as the data storage between map reduce jobs to replace 
InterStorage which has been optimized compared to BinStorage. 
 I use a simple pig script which will been translate into 2 mapred jobs
{code}
a = load '/a.txt';
b = load '/b.txt';
c = join a by $0, b by $0;
d = group c by $0;
dump d;
{code}

The following table shows my experiment result (1 master + 3 slaves)
|| Storage || Time spent on job_1 || Output size of job_1 || Mapper task number 
of job_2 || Time spent on job_2 || Total spent time on pig script
| AvroStorage | 5min 57 sec | 7.97G | 120 | 16min 50 sec | 22min 47 sec| 
| InterStorage | 4min 33 sec | 9.55G | 143 | 17min 17 sec | 21min 50 sec|

The experiment shows that AvroStorage has more compact format than InterStorage 
( according the output size of job_1), but has more overhead on serialization ( 
according the time spent on job_1). I think the time spent on job_2 using 
AvroStorage is less than that using InterStorage is because the input size of 
job_2 (the output of job_1) which using AvroStorage is much less than that 
using InterStorage, so it need less mapper task.

Overall, AvroStorage is not so good as expected.
One reason is maybe I do not use Avro's API correctly (hope avro guys can 
review my code), another reason is maybe avro's serialization performance is 
not so good.
BTW, I use avro trunk.


> Use Avro serialization in Pig
> -----------------------------
>
>                 Key: PIG-794
>                 URL: https://issues.apache.org/jira/browse/PIG-794
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.2.0
>            Reporter: Rakesh Setty
>            Assignee: Dmitriy V. Ryaboy
>         Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
> AvroStorage_2.patch, jackson-asl-0.9.4.jar, PIG-794.patch
>
>
> We would like to use Avro serialization in Pig to pass data between MR jobs 
> instead of the current BinStorage. Attached is an implementation of 
> AvroBinStorage which performs significantly better compared to BinStorage on 
> our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to