[ 
https://issues.apache.org/jira/browse/PIG-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173740#comment-13173740
 ] 

Thejas M Nair commented on PIG-2397:
------------------------------------

bq. When the split sizes are comparable for TPC-H Q1, Hive's tasks finish in 
about 60 seconds on average, while Pig takes about 84 seconds. I believe this 
is due to the fact that Hive triggers in-mem aggregation and output based on 
memory utilization; we have a hardcoded MAX_SIZE_CURVAL_CACHE = 1024. In this 
particular case, that means Hive's tasks output 4 records (a single 
aggregation), while we output 28 (9 aggregations). If we make 
MAX_SIZE_CURVAL_CACHE configurable, or based on memory, we can probably improve 
performance for small records.

MAX_SIZE_CURVAL_CACHE limits the number of values held in memory for a 
particular group-key. Once that limit is hit or a new group-key is seen, it 
aggregates the values for that key and stores the result back in the hash-map. 
That does not trigger dumping to disk. Are you saying that you got 28 output 
records from a single map, though there were only 4 unique group-keys ? I 
expect only 4 output records from a single map, because the hashmap with 4 
entries should easily fit in memory. If that is the case, I need to check why 
that might be happening. 


                
> Running TPC-H on Pig
> --------------------
>
>                 Key: PIG-2397
>                 URL: https://issues.apache.org/jira/browse/PIG-2397
>             Project: Pig
>          Issue Type: Task
>            Reporter: Jie Li
>         Attachments: TPC-H_on_Pig.tgz, pig_tpch.ppt
>
>
> For a class project we developed a whole set of Pig scripts for TPC-H. Our 
> goals are:
> 1) identifying the bottlenecks of Pig's performance especially of its 
> relational operators,
> 2) studying how to write efficient scripts by making full use of Pig Latin's 
> features,
> 3) comparing with Hive's TPC-H results for verifying both 1) and 2).
> We will update the JIRA with our scripts, results and analysis soon.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to