[ https://issues.apache.org/jira/browse/SPARK-30101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16988676#comment-16988676 ]
sam commented on SPARK-30101: ----------------------------- [~kabhwan] [~cloud_fan] [~sowen] > We may deal with it we strongly agree about needs for prioritizing this. Oh great thanks. I think part of the problem here is Google SEO is broken because it's Algorithm has been trained by RDD. Googling how to set parallism always gives `spark.default.parallelism`. Even if you Google "set default parallelism dataset spark" it still doesn't take you to http://spark.apache.org/docs/latest/sql-performance-tuning.html I think setting parallelism is indeed one of the most important things you would ever need to do in Spark, so yes making it easier to find this would be super helpful to the community. > spark.sql.shuffle.partitions is not in Configuration docs, but a very > critical parameter > ---------------------------------------------------------------------------------------- > > Key: SPARK-30101 > URL: https://issues.apache.org/jira/browse/SPARK-30101 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.4.0, 2.4.4 > Reporter: sam > Priority: Major > > I'm creating a `SparkSession` like this: > ``` > SparkSession > .builder().appName("foo").master("local") > .config("spark.default.parallelism", 2).getOrCreate() > ``` > when I run > ``` > ((1 to 10) ++ (1 to 10)).toDS().distinct().count() > ``` > I get 200 partitions > ``` > 19/12/02 10:29:34 INFO TaskSchedulerImpl: Adding task set 1.0 with 200 tasks > ... > 19/12/02 10:29:34 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) > in 46 ms on localhost (executor driver) (1/200) > ``` > It is the `distinct` that is broken since `ds.rdd.getNumPartitions` gives > `2`, while `ds.distinct().rdd.getNumPartitions` gives `200`. > `ds.rdd.groupBy(identity).map(_._2.head)` and `ds.rdd.distinct()` work > correctly. > Finally I notice that the good old `RDD` interface has a `distinct` that > accepts `numPartitions` partitions, while `Dataset` does not. > ....................... > According to below comments, it uses spark.sql.shuffle.partitions, which > needs documenting in configuration. > > Default number of partitions in RDDs returned by transformations like join, > > reduceByKey, and parallelize when not set by user. > in https://spark.apache.org/docs/latest/configuration.html should say > > Default number of partitions in RDDs, but not DS/DF (see > > spark.sql.shuffle.partitions) returned by transformations like join, > > reduceByKey, and parallelize when not set by user. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org