[ 
https://issues.apache.org/jira/browse/SPARK-6464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SaintBacchus updated SPARK-6464:
--------------------------------
    Description: 
Nowadays, the transformation *coalesce* was always used to expand or reduce the 
number of the partition in order to gain a good performance.
But *coalesce* can't make sure that the child partition will be executed in the 
same executor as the parent partition. And this will lead to have a large 
network transfer.
In some scenario such as I mentioned in the title +small and cached rdd+, we 
want to coalesce all the partition in the same executor into one partition and 
make sure the child partition will be executed in this executor. It can avoid 
network transfer and reduce the scheduler of the Tasks and also can reused the 
cpu core to do other job. 
In this scenario, our performance had improved 20% than before.


  was:
Nowadays, the transformation *coalesce* was always used to expand or reduce the 
number of the partition in order to gain a good performance.
But *coalesce* can't make sure that the child partition will be executed in the 
same executor as the parent partition. And this will lead to have a large 
network transfer.
In some scenario such as I mentioned in the title +small and cached rdd+, we 
want to coalesce all the partition in the same executor into one partition and 
make sure the child partition will be executed in this executor. It can avoid 
network transfer and reduce the scheduler of the Tasks and also can reused the 
cpu core to do other job. 
In this scenario, our performance had improved 20% than before.


> Add a new transformation of rdd named processCoalesce which was  particularly 
> to deal with the small and cached rdd
> -------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-6464
>                 URL: https://issues.apache.org/jira/browse/SPARK-6464
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 1.3.0
>            Reporter: SaintBacchus
>
> Nowadays, the transformation *coalesce* was always used to expand or reduce 
> the number of the partition in order to gain a good performance.
> But *coalesce* can't make sure that the child partition will be executed in 
> the same executor as the parent partition. And this will lead to have a large 
> network transfer.
> In some scenario such as I mentioned in the title +small and cached rdd+, we 
> want to coalesce all the partition in the same executor into one partition 
> and make sure the child partition will be executed in this executor. It can 
> avoid network transfer and reduce the scheduler of the Tasks and also can 
> reused the cpu core to do other job. 
> In this scenario, our performance had improved 20% than before.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to