[GitHub] spark issue #16347: [SPARK-18934][SQL] Writing to dynamic partitions does no...

2017-06-19 Thread junegunn
Github user junegunn commented on the issue: https://github.com/apache/spark/pull/16347 Hive makes sure that the output file is properly sorted by the column specified in `SORT BY` clause by having only one reduce task (output) for each partition. ``` STAGE PLANS:

[GitHub] spark issue #16347: [SPARK-18934][SQL] Writing to dynamic partitions does no...

2017-06-18 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/16347 gentle ping @junegunn on ^. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark issue #16347: [SPARK-18934][SQL] Writing to dynamic partitions does no...

2017-05-23 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/16347 @junegunn Can you check the query plan of hive for `INSERT OVERWRITE TABLE ... DISTRIBUTE BY ... SORT BY ...`? In Spark SQL, the query plan looks like ``` 'InsertIntoTable

[GitHub] spark issue #16347: [SPARK-18934][SQL] Writing to dynamic partitions does no...

2017-05-23 Thread junegunn
Github user junegunn commented on the issue: https://github.com/apache/spark/pull/16347 See my answer above. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes

[GitHub] spark issue #16347: [SPARK-18934][SQL] Writing to dynamic partitions does no...

2017-05-23 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/16347 is this still needed after https://github.com/apache/spark/pull/16898 ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark issue #16347: [SPARK-18934][SQL] Writing to dynamic partitions does no...

2017-04-11 Thread junegunn
Github user junegunn commented on the issue: https://github.com/apache/spark/pull/16347 @cloud-fan It's not a problem in the context of DataFrame API. But when it comes to Spark SQL, it makes Spark SQL incompatible to equivalent HiveQL in a subtle way. At least we may need to revisit

[GitHub] spark issue #16347: [SPARK-18934][SQL] Writing to dynamic partitions does no...

2017-04-11 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/16347 @junegunn I think it's not a problem, `df.write.xxx` is not guaranteed to retain the ordering of `df` when writing data output. Currently the `DataFrameWriter` doesn't provide an

[GitHub] spark issue #16347: [SPARK-18934][SQL] Writing to dynamic partitions does no...

2017-04-11 Thread junegunn
Github user junegunn commented on the issue: https://github.com/apache/spark/pull/16347 @cloud-fan Unfortunately, yes. ```scala sc.parallelize(1 to 1000).toDS.withColumn("part", 'value.mod(2)) .repartition(1, 'part).sortWithinPartitions("value")

[GitHub] spark issue #16347: [SPARK-18934][SQL] Writing to dynamic partitions does no...

2017-04-10 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/16347 is this still a problem? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and

[GitHub] spark issue #16347: [SPARK-18934][SQL] Writing to dynamic partitions does no...

2017-04-03 Thread Downchuck
Github user Downchuck commented on the issue: https://github.com/apache/spark/pull/16347 Is there anyone on the Spark team taking this up? This bug is painful; it's saddened a hundred TB of data I stacked up, and I'm really trying to avoid more manual work. "INSERT OVERWRITE TABLE

[GitHub] spark issue #16347: [SPARK-18934][SQL] Writing to dynamic partitions does no...

2017-01-19 Thread junegunn
Github user junegunn commented on the issue: https://github.com/apache/spark/pull/16347 Rebased to current master. The patch is simpler thanks to the refactoring made in [SPARK-18243](https://issues.apache.org/jira/browse/SPARK-18243). Anyway, I can understand your rationale

[GitHub] spark issue #16347: [SPARK-18934][SQL] Writing to dynamic partitions does no...

2017-01-06 Thread chpritchard-expedia
Github user chpritchard-expedia commented on the issue: https://github.com/apache/spark/pull/16347 @rxin - Oh, yes that'd be fantastic, partitionBy.sortBy is just about all I need to survive in this crazy world. In the meantime, I think there ought to be a big warning label on

[GitHub] spark issue #16347: [SPARK-18934][SQL] Writing to dynamic partitions does no...

2017-01-05 Thread rxin
Github user rxin commented on the issue: https://github.com/apache/spark/pull/16347 What I was suggesting was to allow sort by without bucketing. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not

[GitHub] spark issue #16347: [SPARK-18934][SQL] Writing to dynamic partitions does no...

2017-01-05 Thread chpritchard-expedia
Github user chpritchard-expedia commented on the issue: https://github.com/apache/spark/pull/16347 @rxin - sortBy is somewhat tied in with bucketing, which is also a little difficult to work with. First, bucketing often relies on a column being present, whereas in Hive (and with

[GitHub] spark issue #16347: [SPARK-18934][SQL] Writing to dynamic partitions does no...

2017-01-04 Thread rxin
Github user rxin commented on the issue: https://github.com/apache/spark/pull/16347 Maybe we should make DataFrameWriter.sortBy work here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark issue #16347: [SPARK-18934][SQL] Writing to dynamic partitions does no...

2017-01-04 Thread junegunn
Github user junegunn commented on the issue: https://github.com/apache/spark/pull/16347 @chpritchard-expedia The patch here fixes the problem. I don't think it's possible to workaround the issue by using Spark API in some different ways, because we can't completely avoid memory

[GitHub] spark issue #16347: [SPARK-18934][SQL] Writing to dynamic partitions does no...

2017-01-04 Thread chpritchard-expedia
Github user chpritchard-expedia commented on the issue: https://github.com/apache/spark/pull/16347 @junegunn I ran into the same issue, using partitionBy; missed it completely during my testing. Would you share the workaround you used? I wasn't able to understand it from your Apache

[GitHub] spark issue #16347: [SPARK-18934][SQL] Writing to dynamic partitions does no...

2016-12-20 Thread junegunn
Github user junegunn commented on the issue: https://github.com/apache/spark/pull/16347 Thanks for the comment. I was trying to implement the following Hive QL in Spark SQL/API: ```sql set hive.exec.dynamic.partition.mode=nonstrict; set hive.mapred.mode = nonstrict;

[GitHub] spark issue #16347: [SPARK-18934][SQL] Writing to dynamic partitions does no...

2016-12-20 Thread rxin
Github user rxin commented on the issue: https://github.com/apache/spark/pull/16347 Thanks for submitting the ticket. In general I don't think the sortWithinPartitions property can carry over to writing out data, because one partition actually corresponds to more than one file.

[GitHub] spark issue #16347: [SPARK-18934][SQL] Writing to dynamic partitions does no...

2016-12-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16347 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this