[ 
https://issues.apache.org/jira/browse/SPARK-24519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-24519:
----------------------------------
    Description: 
MapStatus uses hardcoded value of 2000 partitions to determine if it should use 
highly compressed map status. We should make it configurable to allow users to 
more easily tune their jobs with respect to this without having for them to 
modify their code to change the number of partitions.  Note we can leave this 
as an internal/undocumented config for now until we have more advise for the 
users on how to set this config.

Some of my reasoning:

The config gives you a way to easily change something without the user having 
to change code, redeploy jar, and then run again. You can simply change the 
config and rerun. It also allows for easier experimentation. Changing the # of 
partitions has other side affects, whether good or bad is situation dependent. 
It can be worse are you could be increasing # of output files when you don't 
want to be, affects the # of tasks needs and thus executors to run in parallel, 
etc.

There have been various talks about this number at spark summits where people 
have told customers to increase it to be 2001 partitions. Note if you just do a 
search for spark 2000 partitions you will fine various things all talking about 
this number.  This shows that people are modifying their code to take this into 
account so it seems to me having this configurable would be better.

Once we have more advice for users we could expose this and document 
information on it.

 

  was:MapStatus uses hardcoded value of 2000 partitions to determine if it 
should use highly compressed map status. We should make it configurable.


> MapStatus has 2000 hardcoded
> ----------------------------
>
>                 Key: SPARK-24519
>                 URL: https://issues.apache.org/jira/browse/SPARK-24519
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 2.3.0
>            Reporter: Hieu Tri Huynh
>            Priority: Minor
>
> MapStatus uses hardcoded value of 2000 partitions to determine if it should 
> use highly compressed map status. We should make it configurable to allow 
> users to more easily tune their jobs with respect to this without having for 
> them to modify their code to change the number of partitions.  Note we can 
> leave this as an internal/undocumented config for now until we have more 
> advise for the users on how to set this config.
> Some of my reasoning:
> The config gives you a way to easily change something without the user having 
> to change code, redeploy jar, and then run again. You can simply change the 
> config and rerun. It also allows for easier experimentation. Changing the # 
> of partitions has other side affects, whether good or bad is situation 
> dependent. It can be worse are you could be increasing # of output files when 
> you don't want to be, affects the # of tasks needs and thus executors to run 
> in parallel, etc.
> There have been various talks about this number at spark summits where people 
> have told customers to increase it to be 2001 partitions. Note if you just do 
> a search for spark 2000 partitions you will fine various things all talking 
> about this number.  This shows that people are modifying their code to take 
> this into account so it seems to me having this configurable would be better.
> Once we have more advice for users we could expose this and document 
> information on it.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to