[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-10-25 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13526 if they chain like that then i think i know how to do the optimization. but do they? look for example at dataset.groupByKey(...).mapValues(...) Dataset[T].groupByKey[K] uses

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-10-24 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13526 @cloud-fan that makes sense to me, but its definitely not a quick win to create that optimization. let me think about it some more --- If your project is set up for it, you can reply to

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-10-21 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/13526 That's a good point, let's focus on `ds.groupBy(...).mapValues(...)` then. One thought, in `mapValues`, we will project away the previous value attributes, so the workflow should be: ```

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-10-21 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13526 @cloud-fan i can try to optimize ```grouped.mapValues(...).mapValues(...)``` but its a bit of an anti-pattern (there should be no need to do mapValues twice) so i dont think there is much

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-10-21 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/13526 To optimize `ds.groupBy(...).mapValues(...)`, yea it's not trivial as you explained above. But for `grouped.mapValues(...).mapValues(...)`, I think it should not be that hard, as it's a pattern

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-10-20 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/13526 2 chained `AppendColumns` will have 2 functions: T => U and U => W, so we can combine them this way: convert UnsafeRow to T apply func to T to generate U apply func to U to generate W

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-10-20 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13526 @rxin i can give it a try (the optimizer rule) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-10-20 Thread rxin
Github user rxin commented on the issue: https://github.com/apache/spark/pull/13526 Alright merging in master. Thanks. @koertkuipers would you be able to add the optimizer rule? --- If your project is set up for it, you can reply to this email and have your reply appear on

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-10-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13526 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-10-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13526 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67268/ Test PASSed. ---

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-10-20 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13526 **[Test build #67268 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67268/consoleFull)** for PR 13526 at commit

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-10-20 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13526 **[Test build #67268 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67268/consoleFull)** for PR 13526 at commit

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-10-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13526 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-10-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13526 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67264/ Test FAILed. ---

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-10-20 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13526 **[Test build #67264 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67264/consoleFull)** for PR 13526 at commit

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-10-20 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13526 **[Test build #67264 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67264/consoleFull)** for PR 13526 at commit

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-10-20 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/13526 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-10-20 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/13526 it lacks an optimizer rule to collapse `AppendColumns`, but seems ok to merge it first and add the rule in follow-up. --- If your project is set up for it, you can reply to this email and have

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-10-18 Thread rxin
Github user rxin commented on the issue: https://github.com/apache/spark/pull/13526 cc @cloud-fan This looks good to me -- but I don't remember why we didn't merge it earlier already. --- If your project is set up for it, you can reply to this email and have your reply

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-07-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13526 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-07-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13526 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/62388/ Test PASSed. ---

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-07-15 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13526 **[Test build #62388 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62388/consoleFull)** for PR 13526 at commit

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-07-15 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13526 **[Test build #62388 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/62388/consoleFull)** for PR 13526 at commit

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-06-07 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13526 could we "rewind"/undo the append for the key and change it to a map that inserts new values and key? so remove one append and replace it with another operation? --- If your project is set

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-06-07 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13526 the tricky part with that is that (ds: Dataset[(K, V)]).groupBy(_._1).mapValues(_._2) should return a KeyValueGroupedDataset[K, V] On Tue, Jun 7, 2016 at 8:22 PM, Wenchen Fan

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-06-07 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/13526 A possible approach maybe just keep the function given by `mapValues`, and apply it before calling the function given by `mapGroups`. By doing this, we at least won't make the performance worse,

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-06-07 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13526 ``` scala> val x = Seq(("a", 1), ("b", 2)).toDS x: org.apache.spark.sql.Dataset[(String, Int)] = [_1: string, _2: int] scala> x.groupByKey(_._1).mapValues(_._2).reduceGroups(_ +

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-06-07 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13526 ok i will study the physical plans for both and try to understand why one would be slower --- If your project is set up for it, you can reply to this email and have your reply appear on

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-06-06 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/13526 OK now I agree this is a useful API. For performance, I would expect that `ds.groupByKey(_._1).mapValues(_._2).mapGroups { case (k, vs) => (k, vs.sum) }` should be at least as fast as

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-06-06 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13526 can you explain a bit what is inefficient and would need an optimizer rule? is it mapValues being called twice? once for the key and then for the new values? thanks! --- If your

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-06-06 Thread koertkuipers
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13526 see this conversation: https://mail-archives.apache.org/mod_mbox/spark-user/201602.mbox/%3ccaaswr-7kqfmxd_cpr-_wdygafh+rarecm9olm5jkxfk14fc...@mail.gmail.com%3E mapGroups is not a

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-06-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13526 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60080/ Test PASSed. ---

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-06-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13526 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-06-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13526 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60082/ Test PASSed. ---

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-06-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13526 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-06-06 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13526 **[Test build #60080 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60080/consoleFull)** for PR 13526 at commit

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-06-06 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13526 **[Test build #60082 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60082/consoleFull)** for PR 13526 at commit

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-06-06 Thread cloud-fan
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/13526 I doubt if this feature is really useful? I think users can easily call `map` on the values during `mapGroups`, e.g. ``` val ds = Seq(("a", 10), ("a", 20), ("b", 1), ("b", 2), ("c",

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-06-06 Thread rxin
Github user rxin commented on the issue: https://github.com/apache/spark/pull/13526 cc @cloud-fan too --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-06-06 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13526 **[Test build #60082 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60082/consoleFull)** for PR 13526 at commit

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-06-06 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13526 **[Test build #60080 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60080/consoleFull)** for PR 13526 at commit

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-06-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13526 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-06-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13526 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60054/ Test PASSed. ---

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-06-06 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13526 **[Test build #60054 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60054/consoleFull)** for PR 13526 at commit

[GitHub] spark issue #13526: [SPARK-15780][SQL] Support mapValues on KeyValueGroupedD...

2016-06-06 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13526 **[Test build #60054 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60054/consoleFull)** for PR 13526 at commit