[ https://issues.apache.org/jira/browse/PIG-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870508#action_12870508 ]
Jeff Zhang commented on PIG-1426: --------------------------------- I did a simple experiment for the performance comparison. This is the pig script I used {code} a = load '/input'; b = foreach a generate $0,$1; c = group b by $0 PARALLEL 2; result = foreach c generate group,SUM(b.$1); dump result; {code} And the following is the result || ||Using Int||Using VInt|| |Mapper Output|3,288,892,896|2,688,892,896| |Time cost for the pig script|12mins, 23sec|12mins, 1sec| I haven't did a complete comparison of PigMix, but I believed it will improve the performance. > Change the size of Tuple from Int to VInt when Serialize Tuple > -------------------------------------------------------------- > > Key: PIG-1426 > URL: https://issues.apache.org/jira/browse/PIG-1426 > Project: Pig > Issue Type: Improvement > Components: data > Affects Versions: 0.8.0 > Reporter: Jeff Zhang > Assignee: Jeff Zhang > Fix For: 0.8.0 > > Attachments: PIG_1426.patch > > > Most of time, the size of tuple is not very large, one byte is enough for > store the size of tuple. So I suggest to use VInt instead of Int for the size > of tuple when doing Serialization. Because the key type of map output is > Tuple, so this can reduce the amount of data transferred from mapper to > reducer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.