[GitHub] spark pull request #16329: [SPARK-16046][DOCS] Aggregations in the Spark SQL...

2017-01-24 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/16329


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16329: [SPARK-16046][DOCS] Aggregations in the Spark SQL...

2017-01-24 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/16329#discussion_r97590536
  
--- Diff: 
examples/src/main/java/org/apache/spark/examples/sql/JavaUserDefinedTypedAggregation.java
 ---
@@ -0,0 +1,160 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.examples.sql;
+
+// $example on:typed_custom_aggregation$
+import java.io.Serializable;
+
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Encoder;
+import org.apache.spark.sql.Encoders;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.TypedColumn;
+import org.apache.spark.sql.expressions.Aggregator;
+// $example off:typed_custom_aggregation$
+
+public class JavaUserDefinedTypedAggregation {
+
+  // $example on:typed_custom_aggregation$
+  public static class Employee implements Serializable {
+private String name;
+private long salary;
+
+// Constructors, getters, setters...
+// $example off:typed_custom_aggregation$
+public String getName() {
+  return name;
+}
+
+public void setName(String name) {
+  this.name = name;
+}
+
+public long getSalary() {
+  return salary;
+}
+
+public void setSalary(long salary) {
+  this.salary = salary;
+}
+// $example on:typed_custom_aggregation$
+  }
+
+  public static class Average implements Serializable  {
+private long sum;
+private long count;
+
+// Constructors, getters, setters...
+// $example off:typed_custom_aggregation$
+public Average() {
+}
+
+public Average(long sum, long count) {
+  this.sum = sum;
+  this.count = count;
+}
+
+public long getSum() {
+  return sum;
+}
+
+public void setSum(long sum) {
+  this.sum = sum;
+}
+
+public long getCount() {
+  return count;
+}
+
+public void setCount(long count) {
+  this.count = count;
+}
+// $example on:typed_custom_aggregation$
+  }
+
+  public static class MyAverage extends Aggregator {
+// A zero value for this aggregation. Should satisfy the property that 
any b + zero = b
+public Average zero() {
--- End diff --

My bad, I read this incorrectly while skimming.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16329: [SPARK-16046][DOCS] Aggregations in the Spark SQL...

2017-01-24 Thread aokolnychyi
Github user aokolnychyi commented on a diff in the pull request:

https://github.com/apache/spark/pull/16329#discussion_r97589440
  
--- Diff: 
examples/src/main/java/org/apache/spark/examples/sql/JavaUserDefinedTypedAggregation.java
 ---
@@ -0,0 +1,160 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.examples.sql;
+
+// $example on:typed_custom_aggregation$
+import java.io.Serializable;
+
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Encoder;
+import org.apache.spark.sql.Encoders;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.TypedColumn;
+import org.apache.spark.sql.expressions.Aggregator;
+// $example off:typed_custom_aggregation$
+
+public class JavaUserDefinedTypedAggregation {
+
+  // $example on:typed_custom_aggregation$
+  public static class Employee implements Serializable {
+private String name;
+private long salary;
+
+// Constructors, getters, setters...
+// $example off:typed_custom_aggregation$
+public String getName() {
+  return name;
+}
+
+public void setName(String name) {
+  this.name = name;
+}
+
+public long getSalary() {
+  return salary;
+}
+
+public void setSalary(long salary) {
+  this.salary = salary;
+}
+// $example on:typed_custom_aggregation$
+  }
+
+  public static class Average implements Serializable  {
+private long sum;
+private long count;
+
+// Constructors, getters, setters...
+// $example off:typed_custom_aggregation$
+public Average() {
+}
+
+public Average(long sum, long count) {
+  this.sum = sum;
+  this.count = count;
+}
+
+public long getSum() {
+  return sum;
+}
+
+public void setSum(long sum) {
+  this.sum = sum;
+}
+
+public long getCount() {
+  return count;
+}
+
+public void setCount(long count) {
+  this.count = count;
+}
+// $example on:typed_custom_aggregation$
+  }
+
+  public static class MyAverage extends Aggregator {
+// A zero value for this aggregation. Should satisfy the property that 
any b + zero = b
+public Average zero() {
--- End diff --

@srowen `Average` is a Java bean that holds current sum and count. It is 
defined earlier. Here it represents a zero value. `MyAverage`, in turn, is the 
actual aggregator that accepts instances of the `Employee` class, stores 
intermediate results using an instance of`Average`, and produces `Double` as a 
result. 

I can rename `MyAverage` to `MyAverageAggregator` if this makes things 
clearer. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16329: [SPARK-16046][DOCS] Aggregations in the Spark SQL...

2017-01-24 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/16329#discussion_r97580048
  
--- Diff: 
examples/src/main/java/org/apache/spark/examples/sql/JavaUserDefinedTypedAggregation.java
 ---
@@ -0,0 +1,160 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.examples.sql;
+
+// $example on:typed_custom_aggregation$
+import java.io.Serializable;
+
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Encoder;
+import org.apache.spark.sql.Encoders;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.TypedColumn;
+import org.apache.spark.sql.expressions.Aggregator;
+// $example off:typed_custom_aggregation$
+
+public class JavaUserDefinedTypedAggregation {
+
+  // $example on:typed_custom_aggregation$
+  public static class Employee implements Serializable {
+private String name;
+private long salary;
+
+// Constructors, getters, setters...
+// $example off:typed_custom_aggregation$
+public String getName() {
+  return name;
+}
+
+public void setName(String name) {
+  this.name = name;
+}
+
+public long getSalary() {
+  return salary;
+}
+
+public void setSalary(long salary) {
+  this.salary = salary;
+}
+// $example on:typed_custom_aggregation$
+  }
+
+  public static class Average implements Serializable  {
+private long sum;
+private long count;
+
+// Constructors, getters, setters...
+// $example off:typed_custom_aggregation$
+public Average() {
+}
+
+public Average(long sum, long count) {
+  this.sum = sum;
+  this.count = count;
+}
+
+public long getSum() {
+  return sum;
+}
+
+public void setSum(long sum) {
+  this.sum = sum;
+}
+
+public long getCount() {
+  return count;
+}
+
+public void setCount(long count) {
+  this.count = count;
+}
+// $example on:typed_custom_aggregation$
+  }
+
+  public static class MyAverage extends Aggregator {
+// A zero value for this aggregation. Should satisfy the property that 
any b + zero = b
+public Average zero() {
--- End diff --

Is this meant to be `MyAverage`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16329: [SPARK-16046][DOCS] Aggregations in the Spark SQL...

2016-12-22 Thread michalsenkyr
Github user michalsenkyr commented on a diff in the pull request:

https://github.com/apache/spark/pull/16329#discussion_r93682192
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/UserDefinedTypedAggregation.scala
 ---
@@ -0,0 +1,87 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.examples.sql
+
+// $example on:typed_custom_aggregation$
+import org.apache.spark.sql.expressions.Aggregator
+import org.apache.spark.sql.Encoder
+import org.apache.spark.sql.Encoders
+import org.apache.spark.sql.SparkSession
+// $example off:typed_custom_aggregation$
+
+object UserDefinedTypedAggregation {
+
+  // $example on:typed_custom_aggregation$
+  case class Employee(name: String, salary: Long)
+  case class Average(var sum: Long, var count: Long)
+
+  object MyAverage extends Aggregator[Employee, Average, Double] {
+// A zero value for this aggregation. Should satisfy the property that 
any b + zero = b
+def zero: Average = Average(0L, 0L)
+// Combine two values to produce a new value. For performance, the 
function may modify `buffer`
+// and return it instead of constructing a new object
+def reduce(buffer: Average, employee: Employee): Average = {
+  buffer.sum += employee.salary
+  buffer.count += 1
+  buffer
+}
+// Merge two intermediate values
+def merge(b1: Average, b2: Average): Average = Average(b1.sum + 
b2.sum, b1.count + b2.count)
--- End diff --

Personally, I prefer consistency. When I saw this, I immediately wondered 
whether there is a specific reason you did it this way.
I'd rather see both methods use the same paradigm. In this case probably 
the immutable one as the option of mutability is already mentioned in the 
comment above.
Or you can mention it again in the comment on this method if you want to 
provide examples of both. This way it just seems a little confusing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16329: [SPARK-16046][DOCS] Aggregations in the Spark SQL...

2016-12-22 Thread aokolnychyi
Github user aokolnychyi commented on a diff in the pull request:

https://github.com/apache/spark/pull/16329#discussion_r93606051
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/UserDefinedTypedAggregation.scala
 ---
@@ -0,0 +1,87 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.examples.sql
+
+// $example on:typed_custom_aggregation$
+import org.apache.spark.sql.expressions.Aggregator
+import org.apache.spark.sql.Encoder
+import org.apache.spark.sql.Encoders
+import org.apache.spark.sql.SparkSession
+// $example off:typed_custom_aggregation$
+
+object UserDefinedTypedAggregation {
+
+  // $example on:typed_custom_aggregation$
+  case class Employee(name: String, salary: Long)
+  case class Average(var sum: Long, var count: Long)
+
+  object MyAverage extends Aggregator[Employee, Average, Double] {
+// A zero value for this aggregation. Should satisfy the property that 
any b + zero = b
+def zero: Average = Average(0L, 0L)
+// Combine two values to produce a new value. For performance, the 
function may modify `buffer`
+// and return it instead of constructing a new object
+def reduce(buffer: Average, employee: Employee): Average = {
+  buffer.sum += employee.salary
+  buffer.count += 1
+  buffer
+}
+// Merge two intermediate values
+def merge(b1: Average, b2: Average): Average = Average(b1.sum + 
b2.sum, b1.count + b2.count)
--- End diff --

@michalsenkyr It is not required to create a new object in the `merge` 
method. One can modify the vars and return the existing object just like in the 
`reduce` method.  However, it is less critical here since this method will be 
called on pre-aggregated data and not for every element. On the one hand, I can 
apply here the same approach as in the `reduce` method to make the example 
consistent. On the other hand, the current code shows that it is not mandatory 
to modify vars. Probably, a comment might help. I am not sure which approach is 
better. Therefore, I am open to suggestions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16329: [SPARK-16046][DOCS] Aggregations in the Spark SQL...

2016-12-21 Thread aokolnychyi
Github user aokolnychyi commented on a diff in the pull request:

https://github.com/apache/spark/pull/16329#discussion_r93395745
  
--- Diff: docs/sql-programming-guide.md ---
@@ -382,6 +382,52 @@ For example:
 
 
 
+## Aggregations
+
+The [built-in DataFrames 
functions](api/scala/index.html#org.apache.spark.sql.functions$) mentioned 
+before provide such common aggregations as `count()`, `countDistinct()`, 
`avg()`, `max()`, `min()`, etc.
+While those functions are designed for DataFrames, Spark SQL also has 
type-safe versions for some of them in 

+[Scala](api/scala/index.html#org.apache.spark.sql.expressions.scalalang.typed$)
 and 
+[Java](api/java/org/apache/spark/sql/expressions/javalang/typed.html) to 
work with strongly typed Datasets.
+Moreover, users are not limited to the predefined aggregate functions and 
can create their own.
--- End diff --

I also thought about this. In my view, it will be appropriate to have a 
separate subsection before Aggregations to show how to apply predefined SQL 
functions, including writing your own UDFs. That's will be worth another pull 
request. Alternatively, I can also try to extend this one to add an example of 
`max()` or `min()`. @marmbrus what's your opinion?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16329: [SPARK-16046][DOCS] Aggregations in the Spark SQL...

2016-12-20 Thread jnh5y
Github user jnh5y commented on a diff in the pull request:

https://github.com/apache/spark/pull/16329#discussion_r93262242
  
--- Diff: docs/sql-programming-guide.md ---
@@ -382,6 +382,52 @@ For example:
 
 
 
+## Aggregations
+
+The [built-in DataFrames 
functions](api/scala/index.html#org.apache.spark.sql.functions$) mentioned 
+before provide such common aggregations as `count()`, `countDistinct()`, 
`avg()`, `max()`, `min()`, etc.
+While those functions are designed for DataFrames, Spark SQL also has 
type-safe versions for some of them in 

+[Scala](api/scala/index.html#org.apache.spark.sql.expressions.scalalang.typed$)
 and 
+[Java](api/java/org/apache/spark/sql/expressions/javalang/typed.html) to 
work with strongly typed Datasets.
+Moreover, users are not limited to the predefined aggregate functions and 
can create their own.
--- End diff --

I think it'd be worth showing an Spark SQL example using the 
included/pre-defined functions.  Since your example implements 'avg', maybe use 
'min' / 'max'?

Alternatively, the example could be added to the SQL statements in the main 
driver for the UserDefinedAggregateFunction implementations.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16329: [SPARK-16046][DOCS] Aggregations in the Spark SQL...

2016-12-20 Thread jnh5y
Github user jnh5y commented on a diff in the pull request:

https://github.com/apache/spark/pull/16329#discussion_r93261268
  
--- Diff: docs/sql-programming-guide.md ---
@@ -382,6 +382,52 @@ For example:
 
 
 
+## Aggregations
+
+The [built-in DataFrames 
functions](api/scala/index.html#org.apache.spark.sql.functions$) mentioned 
+before provide such common aggregations as `count()`, `countDistinct()`, 
`avg()`, `max()`, `min()`, etc.
--- End diff --

As a suggestion, I'd change this to read:
"The [built-in DataFrames 
functions](api/scala/index.html#org.apache.spark.sql.functions$) provide common 
aggregations such as `count()`, `countDistinct()`, `avg()`, `max()`, and 
`min()`."


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16329: [SPARK-16046][DOCS] Aggregations in the Spark SQL...

2016-12-19 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/16329#discussion_r93119785
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/UserDefinedTypedAggregation.scala
 ---
@@ -0,0 +1,82 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.examples.sql
+
+// $example on:typed_custom_aggregation$
+import org.apache.spark.sql.expressions.Aggregator
+import org.apache.spark.sql.Encoder
+import org.apache.spark.sql.Encoders
+import org.apache.spark.sql.SparkSession
+// $example off:typed_custom_aggregation$
+
+object UserDefinedTypedAggregation {
+
+  // $example on:typed_custom_aggregation$
+  case class Salary(person: String, salary: Long)
+  case class Average(sum: Long, count: Long)
+
+  object MyAverage extends Aggregator[Salary, Average, Double] {
+// A zero value for this aggregation. Should satisfy the property that 
any b + zero = b
+def zero: Average = Average(0L, 0L)
+// Combine two values to produce a new value. For performance, the 
function may modify `b` and
--- End diff --

Same comment here with object reuse.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16329: [SPARK-16046][DOCS] Aggregations in the Spark SQL...

2016-12-19 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/16329#discussion_r93121147
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/UserDefinedTypedAggregation.scala
 ---
@@ -0,0 +1,82 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.examples.sql
+
+// $example on:typed_custom_aggregation$
+import org.apache.spark.sql.expressions.Aggregator
+import org.apache.spark.sql.Encoder
+import org.apache.spark.sql.Encoders
+import org.apache.spark.sql.SparkSession
+// $example off:typed_custom_aggregation$
+
+object UserDefinedTypedAggregation {
+
+  // $example on:typed_custom_aggregation$
+  case class Salary(person: String, salary: Long)
+  case class Average(sum: Long, count: Long)
+
+  object MyAverage extends Aggregator[Salary, Average, Double] {
+// A zero value for this aggregation. Should satisfy the property that 
any b + zero = b
+def zero: Average = Average(0L, 0L)
+// Combine two values to produce a new value. For performance, the 
function may modify `b` and
+// return it instead of constructing a new object for b
+def reduce(b: Average, a: Salary): Average = Average(b.sum + a.salary, 
b.count + 1)
+// Merge two intermediate values
+def merge(b1: Average, b2: Average): Average = Average(b1.sum + 
b2.sum, b1.count + b2.count)
+// Transform the output of the reduction
+def finish(reduction: Average): Double = reduction.sum.toDouble / 
reduction.count
+// Specifies the Encoder for the intermediate value type
+def bufferEncoder: Encoder[Average] = Encoders.product
+// Specifies the Encoder for the final output value type
+def outputEncoder: Encoder[Double] = Encoders.scalaDouble
+  }
+  // $example off:typed_custom_aggregation$
+
+  def main(args: Array[String]): Unit = {
+val spark = SparkSession
+  .builder()
+  .appName("Spark SQL user-defined Datasets aggregation example")
+  .getOrCreate()
+
+import spark.implicits._
+
+// $example on:typed_custom_aggregation$
+val ds = 
spark.read.json("examples/src/main/resources/salaries.json").as[Salary]
+ds.show()
+// +---+--+
+// | person|salary|
+// +---+--+
+// |Michael|  3000|
+// |   Andy|  4500|
+// | Justin|  3500|
+// |  Berta|  4000|
+// +---+--+
+
+val averageSalary = MyAverage.toColumn.name("average_salary")
--- End diff --

Maybe comment what `name` is doing here.  I actually had to look it up.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16329: [SPARK-16046][DOCS] Aggregations in the Spark SQL...

2016-12-19 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/16329#discussion_r93118975
  
--- Diff: 
examples/src/main/java/org/apache/spark/examples/sql/JavaUserDefinedTypedAggregation.java
 ---
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.examples.sql;
+
+// $example on:typed_custom_aggregation$
+import java.io.Serializable;
+
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Encoder;
+import org.apache.spark.sql.Encoders;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.TypedColumn;
+import org.apache.spark.sql.expressions.Aggregator;
+// $example off:typed_custom_aggregation$
+
+public class JavaUserDefinedTypedAggregation {
+
+  // $example on:typed_custom_aggregation$
+  public static class Salary implements Serializable {
--- End diff --

I might be a little clearer if this was a `Person` with a `name` and 
`salary`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16329: [SPARK-16046][DOCS] Aggregations in the Spark SQL...

2016-12-19 Thread marmbrus
Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/16329#discussion_r93118905
  
--- Diff: 
examples/src/main/java/org/apache/spark/examples/sql/JavaUserDefinedTypedAggregation.java
 ---
@@ -0,0 +1,154 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.examples.sql;
+
+// $example on:typed_custom_aggregation$
+import java.io.Serializable;
+
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Encoder;
+import org.apache.spark.sql.Encoders;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.TypedColumn;
+import org.apache.spark.sql.expressions.Aggregator;
+// $example off:typed_custom_aggregation$
+
+public class JavaUserDefinedTypedAggregation {
+
+  // $example on:typed_custom_aggregation$
+  public static class Salary implements Serializable {
+private String person;
+private long salary;
+
+// Constructors, getters, setters...
+// $example off:typed_custom_aggregation$
+public String getPerson() {
+  return person;
+}
+
+public void setPerson(String person) {
+  this.person = person;
+}
+
+public long getSalary() {
+  return salary;
+}
+
+public void setSalary(long salary) {
+  this.salary = salary;
+}
+// $example on:typed_custom_aggregation$
+  }
+
+  public static class Average implements Serializable  {
+private long sum;
+private long count;
+
+// Constructors, getters, setters...
+// $example off:typed_custom_aggregation$
+public Average() {
+}
+
+public Average(long sum, long count) {
+  this.sum = sum;
+  this.count = count;
+}
+
+public long getSum() {
+  return sum;
+}
+
+public void setSum(long sum) {
+  this.sum = sum;
+}
+
+public long getCount() {
+  return count;
+}
+
+public void setCount(long count) {
+  this.count = count;
+}
+// $example on:typed_custom_aggregation$
+  }
+
+  public static class MyAverage extends Aggregator {
+// A zero value for this aggregation. Should satisfy the property that 
any b + zero = b
+public Average zero() {
+  return new Average(0L, 0L);
+}
+// Combine two values to produce a new value. For performance, the 
function may modify `b` and
--- End diff --

Its a little confusing to have the comment here for this optimization, but 
then not implement it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16329: [SPARK-16046][DOCS] Aggregations in the Spark SQL...

2016-12-19 Thread assafmendelson
Github user assafmendelson commented on a diff in the pull request:

https://github.com/apache/spark/pull/16329#discussion_r93025608
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/UserDefinedUntypedAggregation.scala
 ---
@@ -0,0 +1,97 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.examples.sql
+
+// $example on:untyped_custom_aggregation$
+import org.apache.spark.sql.expressions.MutableAggregationBuffer
+import org.apache.spark.sql.expressions.UserDefinedAggregateFunction
+import org.apache.spark.sql.types._
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.SparkSession
+// $example off:untyped_custom_aggregation$
+
+object UserDefinedUntypedAggregation {
+
+  // $example on:untyped_custom_aggregation$
+  object MyAverage extends UserDefinedAggregateFunction {
+// Data types of input arguments
+def inputSchema: StructType = StructType(StructField("salary", 
LongType) :: Nil)
--- End diff --

I would go with inputColumn. 
What I think should be more strongly explained is that this is basically 
the schema of the input for the aggregate function and not for the source 
dataframe.  Basically someone might think that their original dataframe might 
need to have this name for the column.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16329: [SPARK-16046][DOCS] Aggregations in the Spark SQL...

2016-12-19 Thread aokolnychyi
Github user aokolnychyi commented on a diff in the pull request:

https://github.com/apache/spark/pull/16329#discussion_r93019316
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/UserDefinedUntypedAggregation.scala
 ---
@@ -0,0 +1,97 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.examples.sql
+
+// $example on:untyped_custom_aggregation$
+import org.apache.spark.sql.expressions.MutableAggregationBuffer
+import org.apache.spark.sql.expressions.UserDefinedAggregateFunction
+import org.apache.spark.sql.types._
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.SparkSession
+// $example off:untyped_custom_aggregation$
+
+object UserDefinedUntypedAggregation {
+
+  // $example on:untyped_custom_aggregation$
+  object MyAverage extends UserDefinedAggregateFunction {
+// Data types of input arguments
+def inputSchema: StructType = StructType(StructField("salary", 
LongType) :: Nil)
+// Data types of values in the aggregation buffer
+def bufferSchema: StructType = {
+  StructType(StructField("sum", LongType) :: StructField("count", 
LongType) :: Nil)
+}
+// The data type of the returned value
+def dataType: DataType = DoubleType
+// Whether this function always returns the same output on the 
identical input
+def deterministic: Boolean = true
+// Initializes the given aggregation buffer
+def initialize(buffer: MutableAggregationBuffer): Unit = {
--- End diff --

Agree, I will try to add a small but meaningful explanation here. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16329: [SPARK-16046][DOCS] Aggregations in the Spark SQL...

2016-12-19 Thread aokolnychyi
Github user aokolnychyi commented on a diff in the pull request:

https://github.com/apache/spark/pull/16329#discussion_r93019035
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/UserDefinedUntypedAggregation.scala
 ---
@@ -0,0 +1,97 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.examples.sql
+
+// $example on:untyped_custom_aggregation$
+import org.apache.spark.sql.expressions.MutableAggregationBuffer
+import org.apache.spark.sql.expressions.UserDefinedAggregateFunction
+import org.apache.spark.sql.types._
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.SparkSession
+// $example off:untyped_custom_aggregation$
+
+object UserDefinedUntypedAggregation {
+
+  // $example on:untyped_custom_aggregation$
+  object MyAverage extends UserDefinedAggregateFunction {
+// Data types of input arguments
+def inputSchema: StructType = StructType(StructField("salary", 
LongType) :: Nil)
--- End diff --

Yes, your point is definitely reasonable. Now I am thinking whether I 
should keep "salary" here. As an option, I can replace "salary" with 
"inputColumn" or something like this to make `MyAverage` more generic. No 
reason to bound it to salary. What's your opinion?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16329: [SPARK-16046][DOCS] Aggregations in the Spark SQL...

2016-12-19 Thread assafmendelson
Github user assafmendelson commented on a diff in the pull request:

https://github.com/apache/spark/pull/16329#discussion_r92990221
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/UserDefinedUntypedAggregation.scala
 ---
@@ -0,0 +1,97 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.examples.sql
+
+// $example on:untyped_custom_aggregation$
+import org.apache.spark.sql.expressions.MutableAggregationBuffer
+import org.apache.spark.sql.expressions.UserDefinedAggregateFunction
+import org.apache.spark.sql.types._
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.SparkSession
+// $example off:untyped_custom_aggregation$
+
+object UserDefinedUntypedAggregation {
+
+  // $example on:untyped_custom_aggregation$
+  object MyAverage extends UserDefinedAggregateFunction {
+// Data types of input arguments
+def inputSchema: StructType = StructType(StructField("salary", 
LongType) :: Nil)
+// Data types of values in the aggregation buffer
+def bufferSchema: StructType = {
+  StructType(StructField("sum", LongType) :: StructField("count", 
LongType) :: Nil)
+}
+// The data type of the returned value
+def dataType: DataType = DoubleType
+// Whether this function always returns the same output on the 
identical input
+def deterministic: Boolean = true
+// Initializes the given aggregation buffer
+def initialize(buffer: MutableAggregationBuffer): Unit = {
--- End diff --

I believe an explanation on what MutableAggregationBuffer is should be 
added.
Basically explain how to access it, what it means for it to be mutable 
(including probably explaining that arrays and map types are immutable even if 
the buffer itself is mutable) etc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16329: [SPARK-16046][DOCS] Aggregations in the Spark SQL...

2016-12-19 Thread assafmendelson
Github user assafmendelson commented on a diff in the pull request:

https://github.com/apache/spark/pull/16329#discussion_r92989437
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/sql/UserDefinedUntypedAggregation.scala
 ---
@@ -0,0 +1,97 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.examples.sql
+
+// $example on:untyped_custom_aggregation$
+import org.apache.spark.sql.expressions.MutableAggregationBuffer
+import org.apache.spark.sql.expressions.UserDefinedAggregateFunction
+import org.apache.spark.sql.types._
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.SparkSession
+// $example off:untyped_custom_aggregation$
+
+object UserDefinedUntypedAggregation {
+
+  // $example on:untyped_custom_aggregation$
+  object MyAverage extends UserDefinedAggregateFunction {
+// Data types of input arguments
+def inputSchema: StructType = StructType(StructField("salary", 
LongType) :: Nil)
--- End diff --

Maybe add a little explanation here. For example, when I first saw this I 
tried to figure out where "salary" appears in the code as in practice it is 
being accessed by index only (input.getLong(0)). 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16329: [SPARK-16046][DOCS] Aggregations in the Spark SQL...

2016-12-18 Thread aokolnychyi
GitHub user aokolnychyi opened a pull request:

https://github.com/apache/spark/pull/16329

[SPARK-16046][DOCS] Aggregations in the Spark SQL programming guide

## What changes were proposed in this pull request?

- A separate subsection for Aggregations under “Getting Started” in the 
Spark SQL programming guide. It mentions which are predefined and how users can 
create their own.
- Examples of using the `UserDefinedAggregateFunction` abstract class for 
untyped aggregations in Java and Scala.
- Examples of using the `Aggregator` abstract class for type-safe 
aggregations in Java and Scala.
- Python is not covered. 
- The PR might not resolve the ticket since I do not know what was exactly 
planned by the author. 

In total, there are four new standalone examples that can be executed via 
`spark-submit` or `run-example`. The updated Spark SQL programming guide 
references to these examples and does not contain hard-coded snippets. 

## How was this patch tested?

The patch was tested locally by building the docs. The examples were run as 
well. 


![image](https://cloud.githubusercontent.com/assets/6235869/21292915/04d9d084-c515-11e6-811a-999d598dffba.png)


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/aokolnychyi/spark SPARK-16046

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16329.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16329


commit 8c18b2a980ddfe220c380d0a60e379d9fdeac488
Author: aokolnychyi 
Date:   2016-12-18T10:08:03Z

[SPARK-16046][DOCS] Aggregations in the Spark SQL programming guide




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org