[ 
https://issues.apache.org/jira/browse/HIVE-9495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Sridharan updated HIVE-9495:
----------------------------------
    Attachment: profiler_screenshot.PNG

Profiler screenshot showing GroupByOperator.processHashAggr as hotspot.

> Map Side aggregation affecting map performance
> ----------------------------------------------
>
>                 Key: HIVE-9495
>                 URL: https://issues.apache.org/jira/browse/HIVE-9495
>             Project: Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.14.0
>         Environment: RHEL 6.4
> Hortonworks Hadoop 2.2
>            Reporter: Anand Sridharan
>         Attachments: profiler_screenshot.PNG
>
>
> When trying to run a simple aggregation query with hive.map.aggr=true, map 
> tasks take a lot of time in Hive 0.14 as against  with hive.map.aggr=false.
> e.g.
> Consider the query:
> INSERT OVERWRITE TABLE lineitem_tgt_agg SELECT alias.a0 as a0, alias.a2 as 
> a1, alias.a1 as a2, alias.a3 as a3, alias.a4 as a4 FROM (SELECT alias.a0 as 
> a0, SUM(alias.a1) as a1, SUM(alias.a2) as a2, SUM(alias.a3) as a3, 
> SUM(alias.a4) as a4 FROM (SELECT lineitem_sf500.l_orderkey as a0, 
> CAST(lineitem_sf500.l_quantity * lineitem_sf500.l_extendedprice * (1 - 
> lineitem_sf500.l_discount) * (1 + lineitem_sf500.l_tax) AS DOUBLE) as a1, 
> lineitem_sf500.l_quantity as a2, CAST(lineitem_sf500.l_quantity * 
> lineitem_sf500.l_extendedprice * lineitem_sf500.l_discount AS DOUBLE) as a3, 
> CAST(lineitem_sf500.l_quantity * lineitem_sf500.l_extendedprice * 
> lineitem_sf500.l_tax AS DOUBLE) as a4 FROM lineitem_sf500) alias GROUP BY 
> alias.a0) alias;
> The above query was run with ~376GB of data / ~3billion records in the source.
> It takes ~10 minutes with hive.map.aggr=false.
> With map side aggregation set to true, the map tasks don't complete even 
> after an hour.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to