sam created SPARK-30101:
---------------------------

             Summary: Dataset distinct does not respect 
spark.default.parallelism
                 Key: SPARK-30101
                 URL: https://issues.apache.org/jira/browse/SPARK-30101
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 2.4.4, 2.4.0
            Reporter: sam


I'm creating a `SparkSession` like this:

```
SparkSession
      .builder().appName("foo").master("local")
      .config("spark.default.parallelism", 2).getOrCreate()
```

when I run

```
((1 to 10) ++ (1 to 10)).toDS().distinct().count()
```

I get 200 partitions

```
19/12/02 10:29:34 INFO TaskSchedulerImpl: Adding task set 1.0 with 200 tasks
...
19/12/02 10:29:34 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) 
in 46 ms on localhost (executor driver) (1/200)
```

It is the `distinct` that is broken since `ds.rdd.getNumPartitions` gives `2`, 
while `ds.distinct().rdd.getNumPartitions` gives `200`.  
`ds.rdd.groupBy(identity).map(_._2.head)` and `ds.rdd.distinct()` work 
correctly.

Finally I notice that the good old `RDD` interface has a `distinct` that 
accepts `numPartitions` partitions, while `Dataset` does not.






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to