Github user junegunn commented on the issue:
https://github.com/apache/spark/pull/16347
Hive makes sure that the output file is properly sorted by the column
specified in `SORT BY` clause by having only one reduce task (output) for each
partition.
```
STAGE PLANS:
Github user HyukjinKwon commented on the issue:
https://github.com/apache/spark/pull/16347
gentle ping @junegunn on ^.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/16347
@junegunn Can you check the query plan of hive for `INSERT OVERWRITE TABLE
... DISTRIBUTE BY ... SORT BY ...`?
In Spark SQL, the query plan looks like
```
'InsertIntoTable
Github user junegunn commented on the issue:
https://github.com/apache/spark/pull/16347
See my answer above.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/16347
is this still needed after https://github.com/apache/spark/pull/16898 ?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
Github user junegunn commented on the issue:
https://github.com/apache/spark/pull/16347
@cloud-fan It's not a problem in the context of DataFrame API. But when it
comes to Spark SQL, it makes Spark SQL incompatible to equivalent HiveQL in a
subtle way. At least we may need to revisit
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/16347
@junegunn I think it's not a problem, `df.write.xxx` is not guaranteed to
retain the ordering of `df` when writing data output.
Currently the `DataFrameWriter` doesn't provide an
Github user junegunn commented on the issue:
https://github.com/apache/spark/pull/16347
@cloud-fan Unfortunately, yes.
```scala
sc.parallelize(1 to 1000).toDS.withColumn("part", 'value.mod(2))
.repartition(1, 'part).sortWithinPartitions("value")
Github user cloud-fan commented on the issue:
https://github.com/apache/spark/pull/16347
is this still a problem?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and
Github user Downchuck commented on the issue:
https://github.com/apache/spark/pull/16347
Is there anyone on the Spark team taking this up? This bug is painful; it's
saddened a hundred TB of data I stacked up, and I'm really trying to avoid more
manual work. "INSERT OVERWRITE TABLE
Github user junegunn commented on the issue:
https://github.com/apache/spark/pull/16347
Rebased to current master. The patch is simpler thanks to the refactoring
made in [SPARK-18243](https://issues.apache.org/jira/browse/SPARK-18243).
Anyway, I can understand your rationale
Github user chpritchard-expedia commented on the issue:
https://github.com/apache/spark/pull/16347
@rxin - Oh, yes that'd be fantastic, partitionBy.sortBy is just about all I
need to survive in this crazy world. In the meantime, I think there ought to be
a big warning label on
Github user rxin commented on the issue:
https://github.com/apache/spark/pull/16347
What I was suggesting was to allow sort by without bucketing.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
Github user chpritchard-expedia commented on the issue:
https://github.com/apache/spark/pull/16347
@rxin - sortBy is somewhat tied in with bucketing, which is also a little
difficult to work with. First, bucketing often relies on a column being
present, whereas in Hive (and with
Github user rxin commented on the issue:
https://github.com/apache/spark/pull/16347
Maybe we should make DataFrameWriter.sortBy work here.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have
Github user junegunn commented on the issue:
https://github.com/apache/spark/pull/16347
@chpritchard-expedia The patch here fixes the problem. I don't think it's
possible to workaround the issue by using Spark API in some different ways,
because we can't completely avoid memory
Github user chpritchard-expedia commented on the issue:
https://github.com/apache/spark/pull/16347
@junegunn I ran into the same issue, using partitionBy; missed it
completely during my testing. Would you share the workaround you used? I wasn't
able to understand it from your Apache
Github user junegunn commented on the issue:
https://github.com/apache/spark/pull/16347
Thanks for the comment. I was trying to implement the following Hive QL in
Spark SQL/API:
```sql
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.mapred.mode = nonstrict;
Github user rxin commented on the issue:
https://github.com/apache/spark/pull/16347
Thanks for submitting the ticket. In general I don't think the
sortWithinPartitions property can carry over to writing out data, because one
partition actually corresponds to more than one file.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/16347
Can one of the admins verify this patch?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this
20 matches
Mail list logo