[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904551#action_12904551 ]
Jeff Zhang commented on PIG-794: -------------------------------- I did some experiment on Avro, Avro_Storage_2.patch is the detail implementation. Here I use avro as the data storage between map reduce jobs to replace InterStorage which has been optimized compared to BinStorage. I use a simple pig script which will been translate into 2 mapred jobs {code} a = load '/a.txt'; b = load '/b.txt'; c = join a by $0, b by $0; d = group c by $0; dump d; {code} The following table shows my experiment result (1 master + 3 slaves) || Storage || Time spent on job_1 || Output size of job_1 || Mapper task number of job_2 || Time spent on job_2 || Total spent time on pig script | AvroStorage | 5min 57 sec | 7.97G | 120 | 16min 50 sec | 22min 47 sec| | InterStorage | 4min 33 sec | 9.55G | 143 | 17min 17 sec | 21min 50 sec| The experiment shows that AvroStorage has more compact format than InterStorage ( according the output size of job_1), but has more overhead on serialization ( according the time spent on job_1). I think the time spent on job_2 using AvroStorage is less than that using InterStorage is because the input size of job_2 (the output of job_1) which using AvroStorage is much less than that using InterStorage, so it need less mapper task. Overall, AvroStorage is not so good as expected. One reason is maybe I do not use Avro's API correctly (hope avro guys can review my code), another reason is maybe avro's serialization performance is not so good. BTW, I use avro trunk. > Use Avro serialization in Pig > ----------------------------- > > Key: PIG-794 > URL: https://issues.apache.org/jira/browse/PIG-794 > Project: Pig > Issue Type: Improvement > Components: impl > Affects Versions: 0.2.0 > Reporter: Rakesh Setty > Assignee: Dmitriy V. Ryaboy > Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, > AvroStorage_2.patch, jackson-asl-0.9.4.jar, PIG-794.patch > > > We would like to use Avro serialization in Pig to pass data between MR jobs > instead of the current BinStorage. Attached is an implementation of > AvroBinStorage which performs significantly better compared to BinStorage on > our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.