[ https://issues.apache.org/jira/browse/PIG-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173740#comment-13173740 ]
Thejas M Nair commented on PIG-2397: ------------------------------------ bq. When the split sizes are comparable for TPC-H Q1, Hive's tasks finish in about 60 seconds on average, while Pig takes about 84 seconds. I believe this is due to the fact that Hive triggers in-mem aggregation and output based on memory utilization; we have a hardcoded MAX_SIZE_CURVAL_CACHE = 1024. In this particular case, that means Hive's tasks output 4 records (a single aggregation), while we output 28 (9 aggregations). If we make MAX_SIZE_CURVAL_CACHE configurable, or based on memory, we can probably improve performance for small records. MAX_SIZE_CURVAL_CACHE limits the number of values held in memory for a particular group-key. Once that limit is hit or a new group-key is seen, it aggregates the values for that key and stores the result back in the hash-map. That does not trigger dumping to disk. Are you saying that you got 28 output records from a single map, though there were only 4 unique group-keys ? I expect only 4 output records from a single map, because the hashmap with 4 entries should easily fit in memory. If that is the case, I need to check why that might be happening. > Running TPC-H on Pig > -------------------- > > Key: PIG-2397 > URL: https://issues.apache.org/jira/browse/PIG-2397 > Project: Pig > Issue Type: Task > Reporter: Jie Li > Attachments: TPC-H_on_Pig.tgz, pig_tpch.ppt > > > For a class project we developed a whole set of Pig scripts for TPC-H. Our > goals are: > 1) identifying the bottlenecks of Pig's performance especially of its > relational operators, > 2) studying how to write efficient scripts by making full use of Pig Latin's > features, > 3) comparing with Hive's TPC-H results for verifying both 1) and 2). > We will update the JIRA with our scripts, results and analysis soon. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira