Github user junegunn closed the pull request at:
https://github.com/apache/spark/pull/16347
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature
Github user junegunn commented on the issue:
https://github.com/apache/spark/pull/16347
Hive makes sure that the output file is properly sorted by the column
specified in `SORT BY` clause by having only one reduce task (output) for each
partition.
```
STAGE PLANS
Github user junegunn commented on the issue:
https://github.com/apache/spark/pull/16347
See my answer above.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes
Github user junegunn commented on the issue:
https://github.com/apache/spark/pull/16347
@cloud-fan It's not a problem in the context of DataFrame API. But when it
comes to Spark SQL, it makes Spark SQL incompatible to equivalent HiveQL in a
subtle way. At least we may need to revisit
Github user junegunn commented on the issue:
https://github.com/apache/spark/pull/16347
@cloud-fan Unfortunately, yes.
```scala
sc.parallelize(1 to 1000).toDS.withColumn("part", 'value.mod(2))
.repartition(1, 'part).sortWithinPartitions("value"
Github user junegunn commented on the issue:
https://github.com/apache/spark/pull/16347
Rebased to current master. The patch is simpler thanks to the refactoring
made in [SPARK-18243](https://issues.apache.org/jira/browse/SPARK-18243).
Anyway, I can understand your rationale
Github user junegunn commented on the issue:
https://github.com/apache/spark/pull/16347
@chpritchard-expedia The patch here fixes the problem. I don't think it's
possible to workaround the issue by using Spark API in some different ways,
because we can't completely avoid memory
Github user junegunn commented on the issue:
https://github.com/apache/spark/pull/16347
Thanks for the comment. I was trying to implement the following Hive QL in
Spark SQL/API:
```sql
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.mapred.mode = nonstrict
GitHub user junegunn opened a pull request:
https://github.com/apache/spark/pull/16347
[SPARK-18934][SQL] Writing to dynamic partitions does not preserve sort
order if spills occur
## What changes were proposed in this pull request?
Make dynamic partition writer perform