[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...

2016-09-21 Thread koertkuipers
Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/13512
  
@cloud-fan i thought about this a little more, and my suggested changes to 
the Aggregator api does not allow one to use a different encoder when applying 
a typed operation on Dataset. so i do not think it is dangerous as such.

it does enable usage within the untyped grouping, which is where type 
conversions are already customary anyhow. its not more dangerous than say using 
a udaf in a DataFrame.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...

2016-06-06 Thread koertkuipers
Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/13512
  
for example with this branch you can do:
```
val df3 = Seq(("a", "x", 1), ("a", "y", 3), ("b", "x", 3)).toDF("i", "j", 
"k")
df3.groupBy("i").agg(
  ComplexResultAgg.apply("i", "k"),
  SumAgg.apply("j"),
  AverageAgg.apply("j")
)

so these are multiple Aggregators applied in an untyped groupBy, each on 
selected columns.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...

2016-06-06 Thread koertkuipers
Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/13512
  
well that was sort of what i was trying to achieve. the unit tests i added 
were for using Aggregator for untyped grouping(```groupBy```). 
and i think for it to be useful within that context one should also be able 
to select the columns that the Aggregator operates on.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...

2016-06-06 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/13512
  
a possible approach may be to enable untyped grouping(`groupBy`) support 
typed aggregating(`Aggregator`), i.e. something like 
`df.groupBy("i").keyAs[Int].agg(ComplexResultAgg.toColumn)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...

2016-06-06 Thread koertkuipers
Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/13512
  
If Aggregator is designed for typed Dataset only then that is a bit of a 
shame, because its a elegant and generic api that should be useful for 
DataFrame too. this causes fragmentation (Aggregator versus 
UserDefinedAggregationFunction). 
I am not sure what the better way to do this is, but i would like a single 
high level Aggregator-like api that i can use in Dataset and DataFrame.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...

2016-06-06 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/13512
  
`Aggregator` API is designed for typed `Dataset` only, not for untyped 
`DataFrame`. It can work if users use `Row` as the input type of `Aggregator`, 
but it's not a recommended usage.

On the other way, it's dangerous to use a different encoder when apply a 
typed operation on `Dataset`. Let's say we have a `Dataset[T]` called `ds`, and 
its data is encoded from instances of `T` by `enc1`. Now you apply an 
`Aggregator` on the `ds`, with input encoder `enc2`. We can not guarantee that 
`enc2` can decode the data of `ds` to `T` instances, because the data is 
encoded by `enc1`.

We can discuss it further if you have other ideas, and close this PR if you 
think my explaintion makes sense, thanks for working on it anyway!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...

2016-06-05 Thread koertkuipers
Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/13512
  
@cloud-fan i am running into some trouble updating my branch to the latest 
master. i get errors in tests due to Analyzer.validateTopLevelTupleFields

the issue seems to be that in KeyValueGroupedDataset[K, T] the Aggregators 
are supposed to operate on T, but the logicalPlan at this point already has K 
appended to T because AppendColumns(func, inputPlan) is applied to the plan 
before its passed into KeyValueGroupedDataset. so validateTopLevelTupleFields 
also sees the column for the key in the inputs and believes the deserializer 
for T is missing a field.

any suggestions on how to get around this? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...

2016-06-05 Thread koertkuipers
Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/13512
  
@cloud-fan from the (added) unit tests:
```
val df2 = Seq("a" -> 1, "a" -> 3, "b" -> 3).toDF("i", "j")
checkAnswer(df2.groupBy("i").agg(ComplexResultAgg.toColumn),
  Row("a", Row(2, 4)) :: Row("b", Row(1, 3)) :: Nil)
```
this shows how the underlying type is Row (with a schema consisting of 
Strings and Ints), and it gets converted to the input type of the Aggregator 
which is (String, Long), so this involves both conversion and upcast.

and:
```
val df3 = Seq(("a", "x", 1), ("a", "y", 3), ("b", "x", 3)).toDF("i", "j", 
"k")
checkAnswer(df3.groupBy("i").agg(ComplexResultAgg("i", "k")),
  Row("a", Row(2, 4)) :: Row("b", Row(1, 3)) :: Nil)
```
this is similar to the previous example but i also select the columns i 
want the Aggregator to operate on (namely columns "i" and "k")


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...

2016-06-05 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/13512
  
Can you give some examples to show how this PR make the aggregator API more 
friendly and easier to use?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...

2016-06-04 Thread koertkuipers
Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/13512
  
**[Test build #5 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/5/consoleFull)**
 for PR 13512 at commit 
[`077f782`](https://github.com/apache/spark/commit/077f782cbf1e64439b8d5bb738819faebbf5914b).
 * This patch **fails MiMa tests**.
 * This patch **does not merge cleanly**.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...

2016-06-04 Thread koertkuipers
Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/13512
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/5/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...

2016-06-04 Thread koertkuipers
Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/13512
  
Build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13512: [SPARK-15769][SQL] Add Encoder for input type to Aggrega...

2016-06-04 Thread koertkuipers
Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/13512
  
**[Test build #5 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/5/consoleFull)**
 for PR 13512 at commit 
[`077f782`](https://github.com/apache/spark/commit/077f782cbf1e64439b8d5bb738819faebbf5914b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org