[ 
https://issues.apache.org/jira/browse/SPARK-17930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15581119#comment-15581119
 ] 

Guoqiang Li commented on SPARK-17930:
-------------------------------------


TPC-DS 2T data (Parquet) and the SQL(query 2) => 
{noformat}
select  i_item_id, 
        avg(ss_quantity) agg1,
        avg(ss_list_price) agg2,
        avg(ss_coupon_amt) agg3,
        avg(ss_sales_price) agg4 
 from store_sales, customer_demographics, date_dim, item, promotion
 where ss_sold_date_sk = d_date_sk and
       ss_item_sk = i_item_sk and
       ss_cdemo_sk = cd_demo_sk and
       ss_promo_sk = p_promo_sk and
       cd_gender = 'M' and 
       cd_marital_status = 'M' and
       cd_education_status = '4 yr Degree' and
       (p_channel_email = 'N' or p_channel_event = 'N') and
       d_year = 2001 
 group by i_item_id
 order by i_item_id
 limit 100;
{noformat}

spark-defaults.conf =>

{noformat}
spark.master                           yarn-client
spark.executor.instances               20
spark.driver.memory                    16g
spark.executor.memory                  30g
spark.executor.cores                   5
spark.default.parallelism              100 
spark.sql.shuffle.partitions           100000 
spark.serializer                       
org.apache.spark.serializer.KryoSerializer
spark.driver.maxResultSize              0
spark.rpc.netty.dispatcher.numThreads   8
spark.executor.extraJavaOptions          -XX:+UseG1GC 
-XX:+UseStringDeduplication -XX:G1HeapRegionSize=16M -XX:MetaspaceSize=256M 
spark.cleaner.referenceTracking.blocking true
spark.cleaner.referenceTracking.blocking.shuffle true
{noformat}


Performance test results are as follows => 

||[SPARK-17930|https://github.com/witgo/spark/tree/SPARK-17930]||[ed14633|https://github.com/witgo/spark/commit/ed1463341455830b8867b721a1b34f291139baf3]||
|54.5 s|231.7 s|


> The SerializerInstance instance used when deserializing a TaskResult is not 
> reused 
> -----------------------------------------------------------------------------------
>
>                 Key: SPARK-17930
>                 URL: https://issues.apache.org/jira/browse/SPARK-17930
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 1.6.1, 2.0.1
>            Reporter: Guoqiang Li
>
> The following code is called when the DirectTaskResult instance is 
> deserialized
> {noformat}
>   def value(): T = {
>     if (valueObjectDeserialized) {
>       valueObject
>     } else {
>       // Each deserialization creates a new instance of SerializerInstance, 
> which is very time-consuming
>       val resultSer = SparkEnv.get.serializer.newInstance()
>       valueObject = resultSer.deserialize(valueBytes)
>       valueObjectDeserialized = true
>       valueObject
>     }
>   }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to