GitHub user jeanlyn opened a pull request:

    https://github.com/apache/spark/pull/6426

    [SPARK-7885][SQL]add config to control map aggregation in spark sql

    [SPARK-7885](https://issues.apache.org/jira/browse/SPARK-7885),we add 
`spark.sql.partialAggregation.enable`,it's true by default,we can set false to 
make map aggregation unable to avoid gc problem.For example,we run the sql
    ```sql
    insert overwrite table groupbytest
    select sale_ord_id as order_id,
          coalesce(sum(sku_offer_amount),0.0) as sku_offer_amount,
          coalesce(sum(suit_offer_amount),0.0) as suit_offer_amount,
          coalesce(sum(flash_gp_offer_amount),0.0) + 
coalesce(sum(gp_offer_amount),0.0) as gp_offer_amount,
          coalesce(sum(flash_gp_offer_amount),0.0) as flash_gp_offer_amount,
          coalesce(sum(full_minus_offer_amount),0.0) as 
full_rebate_offer_amount,
          0.0 as telecom_point_offer_amount,
          coalesce(sum(coupon_pay_amount),0.0) as dq_and_jq_pay_amount,
          coalesce(sum(jq_pay_amount),0.0) + 
coalesce(sum(pop_shop_jq_pay_amount),0.0) + 
coalesce(sum(lim_cate_jq_pay_amount),0.0) as jq_pay_amount,
          coalesce(sum(dq_pay_amount),0.0) + 
coalesce(sum(pop_shop_dq_pay_amount),0.0) + 
coalesce(sum(lim_cate_dq_pay_amount),0.0) as dq_pay_amount,
          coalesce(sum(gift_cps_pay_amount),0.0) as gift_cps_pay_amount ,
          coalesce(sum(mobile_red_packet_pay_amount),0.0) as 
mobile_red_packet_pay_amount,
          coalesce(sum(acct_bal_pay_amount),0.0) as acct_bal_pay_amount,
          coalesce(sum(jbean_pay_amount),0.0) as jbean_pay_amount,
          coalesce(sum(sku_rebate_amount),0.0) as sku_rebate_amount,
          coalesce(sum(yixun_point_pay_amount),0.0) as yixun_point_pay_amount,
          coalesce(sum(sku_freight_coupon_amount),0.0) as freight_coupon_amount
    from        ord_at_det_di
    where       ds = '2015-05-20'
    group  by   sale_ord_id
    ```
    use 6 executor, each executor has 8GB memory and 2 cpu,we got gc problems 
during the map aggregation and finally the executor crash
    
![5869030a-d924-4249-9e1d-c637caa9363a](https://cloud.githubusercontent.com/assets/3426093/7828153/4afdaf88-0462-11e5-8af0-3bff04edab92.png)
 
    
    When we set `spark.sql.partialAggregation.enable` false ,the sql run in 2 
min

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jeanlyn/spark partialAggregation

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/6426.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #6426
    
----
commit b17c676bb0d33019bbdd124048221595f278b9d0
Author: jeanlyn <jeanly...@gmail.com>
Date:   2015-05-27T03:03:47Z

    add config to control map aggregation in spark sql

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to