[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13512 @cloud-fan i thought about this a little more, and my suggested changes to the Aggregator api does not allow one to use a different encoder when applying a typed operation on Dataset. so i do not think it is dangerous as such. it does enable usage within the untyped grouping, which is where type conversions are already customary anyhow. its not more dangerous than say using a udaf in a DataFrame. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13512 for example with this branch you can do: ``` val df3 = Seq(("a", "x", 1), ("a", "y", 3), ("b", "x", 3)).toDF("i", "j", "k") df3.groupBy("i").agg( ComplexResultAgg.apply("i", "k"), SumAgg.apply("j"), AverageAgg.apply("j") ) so these are multiple Aggregators applied in an untyped groupBy, each on selected columns. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13512 well that was sort of what i was trying to achieve. the unit tests i added were for using Aggregator for untyped grouping(```groupBy```). and i think for it to be useful within that context one should also be able to select the columns that the Aggregator operates on. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/13512 a possible approach may be to enable untyped grouping(`groupBy`) support typed aggregating(`Aggregator`), i.e. something like `df.groupBy("i").keyAs[Int].agg(ComplexResultAgg.toColumn)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13512 If Aggregator is designed for typed Dataset only then that is a bit of a shame, because its a elegant and generic api that should be useful for DataFrame too. this causes fragmentation (Aggregator versus UserDefinedAggregationFunction). I am not sure what the better way to do this is, but i would like a single high level Aggregator-like api that i can use in Dataset and DataFrame. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/13512 `Aggregator` API is designed for typed `Dataset` only, not for untyped `DataFrame`. It can work if users use `Row` as the input type of `Aggregator`, but it's not a recommended usage. On the other way, it's dangerous to use a different encoder when apply a typed operation on `Dataset`. Let's say we have a `Dataset[T]` called `ds`, and its data is encoded from instances of `T` by `enc1`. Now you apply an `Aggregator` on the `ds`, with input encoder `enc2`. We can not guarantee that `enc2` can decode the data of `ds` to `T` instances, because the data is encoded by `enc1`. We can discuss it further if you have other ideas, and close this PR if you think my explaintion makes sense, thanks for working on it anyway! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13512 @cloud-fan i am running into some trouble updating my branch to the latest master. i get errors in tests due to Analyzer.validateTopLevelTupleFields the issue seems to be that in KeyValueGroupedDataset[K, T] the Aggregators are supposed to operate on T, but the logicalPlan at this point already has K appended to T because AppendColumns(func, inputPlan) is applied to the plan before its passed into KeyValueGroupedDataset. so validateTopLevelTupleFields also sees the column for the key in the inputs and believes the deserializer for T is missing a field. any suggestions on how to get around this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13512 @cloud-fan from the (added) unit tests: ``` val df2 = Seq("a" -> 1, "a" -> 3, "b" -> 3).toDF("i", "j") checkAnswer(df2.groupBy("i").agg(ComplexResultAgg.toColumn), Row("a", Row(2, 4)) :: Row("b", Row(1, 3)) :: Nil) ``` this shows how the underlying type is Row (with a schema consisting of Strings and Ints), and it gets converted to the input type of the Aggregator which is (String, Long), so this involves both conversion and upcast. and: ``` val df3 = Seq(("a", "x", 1), ("a", "y", 3), ("b", "x", 3)).toDF("i", "j", "k") checkAnswer(df3.groupBy("i").agg(ComplexResultAgg("i", "k")), Row("a", Row(2, 4)) :: Row("b", Row(1, 3)) :: Nil) ``` this is similar to the previous example but i also select the columns i want the Aggregator to operate on (namely columns "i" and "k") --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/13512 Can you give some examples to show how this PR make the aggregator API more friendly and easier to use? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13512 **[Test build #5 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/5/consoleFull)** for PR 13512 at commit [`077f782`](https://github.com/apache/spark/commit/077f782cbf1e64439b8d5bb738819faebbf5914b). * This patch **fails MiMa tests**. * This patch **does not merge cleanly**. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13512 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/5/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13512 Build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/13512 **[Test build #5 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/5/consoleFull)** for PR 13512 at commit [`077f782`](https://github.com/apache/spark/commit/077f782cbf1e64439b8d5bb738819faebbf5914b). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org