[jira] [Commented] (SPARK-30101) spark.sql.shuffle.partitions is not in Configuration docs, but a very critical parameter

sam (Jira) Thu, 05 Dec 2019 03:01:24 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-30101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16988676#comment-16988676
 ]


sam commented on SPARK-30101:
-----------------------------

[~kabhwan] [~cloud_fan] [~sowen]

> We may deal with it we strongly agree about needs for prioritizing this.

Oh great thanks.

I think part of the problem here is Google SEO is broken because it's Algorithm 
has been trained by RDD. Googling how to set parallism always gives 
`spark.default.parallelism`.  Even if you Google "set default parallelism 
dataset spark" it still doesn't take you to 
http://spark.apache.org/docs/latest/sql-performance-tuning.html

I think setting parallelism is indeed one of the most important things you 
would ever need to do in Spark, so yes making it easier to find this would be 
super helpful to the community.

> spark.sql.shuffle.partitions is not in Configuration docs, but a very 
> critical parameter
> ----------------------------------------------------------------------------------------
>
>                 Key: SPARK-30101
>                 URL: https://issues.apache.org/jira/browse/SPARK-30101
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.4.0, 2.4.4
>            Reporter: sam
>            Priority: Major
>
> I'm creating a `SparkSession` like this:
> ```
> SparkSession
>       .builder().appName("foo").master("local")
>       .config("spark.default.parallelism", 2).getOrCreate()
> ```
> when I run
> ```
> ((1 to 10) ++ (1 to 10)).toDS().distinct().count()
> ```
> I get 200 partitions
> ```
> 19/12/02 10:29:34 INFO TaskSchedulerImpl: Adding task set 1.0 with 200 tasks
> ...
> 19/12/02 10:29:34 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) 
> in 46 ms on localhost (executor driver) (1/200)
> ```
> It is the `distinct` that is broken since `ds.rdd.getNumPartitions` gives 
> `2`, while `ds.distinct().rdd.getNumPartitions` gives `200`.  
> `ds.rdd.groupBy(identity).map(_._2.head)` and `ds.rdd.distinct()` work 
> correctly.
> Finally I notice that the good old `RDD` interface has a `distinct` that 
> accepts `numPartitions` partitions, while `Dataset` does not.
> .......................
> According to below comments, it uses spark.sql.shuffle.partitions, which 
> needs documenting in configuration.
> > Default number of partitions in RDDs returned by transformations like join, 
> > reduceByKey, and parallelize when not set by user.
> in https://spark.apache.org/docs/latest/configuration.html should say
> > Default number of partitions in RDDs, but not DS/DF (see 
> > spark.sql.shuffle.partitions) returned by transformations like join, 
> > reduceByKey, and parallelize when not set by user.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30101) spark.sql.shuffle.partitions is not in Configuration docs, but a very critical parameter

Reply via email to