[GitHub] spark pull request #21054: [SPARK-23907][SQL] Add regr_* functions

2018-05-11 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21054#discussion_r187761743
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
@@ -775,6 +775,178 @@ object functions {
*/
   def var_pop(columnName: String): Column = var_pop(Column(columnName))
 
+  /**
+   * Aggregate function: returns the number of non-null pairs.
+   *
+   * @group agg_funcs
+   * @since 2.4.0
+   */
+  def regr_count(y: Column, x: Column): Column = withAggregateFunction {
--- End diff --

@rxin, how about splitting up this file by the group or something, or 
deprecating all the functions that can be called via expr for 3.0.0? To me, it 
looked a bit odd when some functions exist and some did not. It was an actual 
use case and I had to check which function exists or not every time.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21054: [SPARK-23907][SQL] Add regr_* functions

2018-05-11 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21054#discussion_r187751801
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala ---
@@ -775,6 +775,178 @@ object functions {
*/
   def var_pop(columnName: String): Column = var_pop(Column(columnName))
 
+  /**
+   * Aggregate function: returns the number of non-null pairs.
+   *
+   * @group agg_funcs
+   * @since 2.4.0
+   */
+  def regr_count(y: Column, x: Column): Column = withAggregateFunction {
--- End diff --

do we need all of these? seems like users can just invoke expr to do them. 
this file is getting very long.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21054: [SPARK-23907][SQL] Add regr_* functions

2018-05-10 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/21054


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21054: [SPARK-23907][SQL] Add regr_* functions

2018-05-07 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21054#discussion_r186616594
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/regression.scala
 ---
@@ -0,0 +1,190 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.aggregate
+
+import org.apache.spark.sql.catalyst.dsl.expressions._
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.types.{AbstractDataType, DoubleType}
+
+/**
+ * Base trait for all regression functions.
+ */
+trait RegrLike extends AggregateFunction with ImplicitCastInputTypes {
+  def y: Expression
+  def x: Expression
+
+  override def children: Seq[Expression] = Seq(y, x)
+  override def inputTypes: Seq[AbstractDataType] = Seq(DoubleType, 
DoubleType)
+
+  protected def updateIfNotNull(exprs: Seq[Expression]): Seq[Expression] = 
{
+assert(aggBufferAttributes.length == exprs.length)
+val nullableChildren = children.filter(_.nullable)
+if (nullableChildren.isEmpty) {
+  exprs
+} else {
+  exprs.zip(aggBufferAttributes).map { case (e, a) =>
+If(nullableChildren.map(IsNull).reduce(Or), a, e)
+  }
+}
+  }
+}
+
+
+@ExpressionDescription(
+  usage = "_FUNC_(y, x) - Returns the number of non-null pairs.",
+  since = "2.4.0")
+case class RegrCount(y: Expression, x: Expression)
+  extends CountLike with RegrLike {
+
+  override lazy val updateExpressions: Seq[Expression] = 
updateIfNotNull(Seq(count + 1L))
+
+  override def prettyName: String = "regr_count"
+}
+
+
+@ExpressionDescription(
+  usage = "_FUNC_(y, x) - Returns SUM(x*x)-SUM(x)*SUM(x)/N. Any pair with 
a NULL is ignored.",
--- End diff --

It is reasonable to follow Hive. Personally, I like DB2 or Oracle, because 
normally these commercial dbms is more professional. : )


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21054: [SPARK-23907][SQL] Add regr_* functions

2018-05-04 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21054#discussion_r186020071
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/regression.scala
 ---
@@ -0,0 +1,190 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.aggregate
+
+import org.apache.spark.sql.catalyst.dsl.expressions._
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.types.{AbstractDataType, DoubleType}
+
+/**
+ * Base trait for all regression functions.
+ */
+trait RegrLike extends AggregateFunction with ImplicitCastInputTypes {
+  def y: Expression
+  def x: Expression
+
+  override def children: Seq[Expression] = Seq(y, x)
+  override def inputTypes: Seq[AbstractDataType] = Seq(DoubleType, 
DoubleType)
+
+  protected def updateIfNotNull(exprs: Seq[Expression]): Seq[Expression] = 
{
+assert(aggBufferAttributes.length == exprs.length)
+val nullableChildren = children.filter(_.nullable)
+if (nullableChildren.isEmpty) {
+  exprs
+} else {
+  exprs.zip(aggBufferAttributes).map { case (e, a) =>
+If(nullableChildren.map(IsNull).reduce(Or), a, e)
+  }
+}
+  }
+}
+
+
+@ExpressionDescription(
+  usage = "_FUNC_(y, x) - Returns the number of non-null pairs.",
+  since = "2.4.0")
+case class RegrCount(y: Expression, x: Expression)
+  extends CountLike with RegrLike {
+
+  override lazy val updateExpressions: Seq[Expression] = 
updateIfNotNull(Seq(count + 1L))
+
+  override def prettyName: String = "regr_count"
+}
+
+
+@ExpressionDescription(
+  usage = "_FUNC_(y, x) - Returns SUM(x*x)-SUM(x)*SUM(x)/N. Any pair with 
a NULL is ignored.",
--- End diff --

@gatorsmile What do you think?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21054: [SPARK-23907][SQL] Add regr_* functions

2018-05-03 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21054#discussion_r185800917
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/regression.scala
 ---
@@ -0,0 +1,190 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.aggregate
+
+import org.apache.spark.sql.catalyst.dsl.expressions._
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.types.{AbstractDataType, DoubleType}
+
+/**
+ * Base trait for all regression functions.
+ */
+trait RegrLike extends AggregateFunction with ImplicitCastInputTypes {
+  def y: Expression
+  def x: Expression
+
+  override def children: Seq[Expression] = Seq(y, x)
+  override def inputTypes: Seq[AbstractDataType] = Seq(DoubleType, 
DoubleType)
+
+  protected def updateIfNotNull(exprs: Seq[Expression]): Seq[Expression] = 
{
+assert(aggBufferAttributes.length == exprs.length)
+val nullableChildren = children.filter(_.nullable)
+if (nullableChildren.isEmpty) {
+  exprs
+} else {
+  exprs.zip(aggBufferAttributes).map { case (e, a) =>
+If(nullableChildren.map(IsNull).reduce(Or), a, e)
+  }
+}
+  }
+}
+
+
+@ExpressionDescription(
+  usage = "_FUNC_(y, x) - Returns the number of non-null pairs.",
+  since = "2.4.0")
+case class RegrCount(y: Expression, x: Expression)
+  extends CountLike with RegrLike {
+
+  override lazy val updateExpressions: Seq[Expression] = 
updateIfNotNull(Seq(count + 1L))
+
+  override def prettyName: String = "regr_count"
+}
+
+
+@ExpressionDescription(
+  usage = "_FUNC_(y, x) - Returns SUM(x*x)-SUM(x)*SUM(x)/N. Any pair with 
a NULL is ignored.",
--- End diff --

Here I am following Hive. This is Hive docs and it reflects how it is 
actually computed. I am not sure it is a good idea to change it, since we are 
not really computing it as `REGR_COUNT(x, y) * VAR_POP(y)`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21054: [SPARK-23907][SQL] Add regr_* functions

2018-05-03 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21054#discussion_r185791203
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/regression.scala
 ---
@@ -0,0 +1,190 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.aggregate
+
+import org.apache.spark.sql.catalyst.dsl.expressions._
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.types.{AbstractDataType, DoubleType}
+
+/**
+ * Base trait for all regression functions.
+ */
+trait RegrLike extends AggregateFunction with ImplicitCastInputTypes {
+  def y: Expression
+  def x: Expression
+
+  override def children: Seq[Expression] = Seq(y, x)
+  override def inputTypes: Seq[AbstractDataType] = Seq(DoubleType, 
DoubleType)
+
+  protected def updateIfNotNull(exprs: Seq[Expression]): Seq[Expression] = 
{
+assert(aggBufferAttributes.length == exprs.length)
+val nullableChildren = children.filter(_.nullable)
+if (nullableChildren.isEmpty) {
+  exprs
+} else {
+  exprs.zip(aggBufferAttributes).map { case (e, a) =>
+If(nullableChildren.map(IsNull).reduce(Or), a, e)
+  }
+}
+  }
+}
+
+
+@ExpressionDescription(
+  usage = "_FUNC_(y, x) - Returns the number of non-null pairs.",
+  since = "2.4.0")
+case class RegrCount(y: Expression, x: Expression)
+  extends CountLike with RegrLike {
+
+  override lazy val updateExpressions: Seq[Expression] = 
updateIfNotNull(Seq(count + 1L))
+
+  override def prettyName: String = "regr_count"
+}
+
+
+@ExpressionDescription(
+  usage = "_FUNC_(y, x) - Returns SUM(x*x)-SUM(x)*SUM(x)/N. Any pair with 
a NULL is ignored.",
--- End diff --

I found it is usually to explain `REGR_SXX` by `REGR_COUNT(x, y) * 
VAR_POP(y)`, e.g., 
[1](https://www.ibm.com/support/knowledgecenter/en/SSPT3X_2.1.2/com.ibm.swg.im.infosphere.biginsights.bigsql.doc/doc/bsql_regr_sxx.html)
 and 
[2](https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions132.htm). 
Is it better to follow it?

Same for `REGR_SYY`, etc..


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21054: [SPARK-23907][SQL] Add regr_* functions

2018-05-03 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21054#discussion_r185783983
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala ---
@@ -36,6 +36,8 @@ case class Fact(date: Int, hour: Int, minute: Int, 
room_name: String, temp: Doub
 class DataFrameAggregateSuite extends QueryTest with SharedSQLContext {
   import testImplicits._
 
+  val absTol = 1e-8
--- End diff --

Ah, ok.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21054: [SPARK-23907][SQL] Add regr_* functions

2018-05-03 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21054#discussion_r185783832
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala ---
@@ -686,4 +687,72 @@ class DataFrameAggregateSuite extends QueryTest with 
SharedSQLContext {
   }
 }
   }
+
+  test("SPARK-23907: regression functions") {
+val emptyTableData = Seq.empty[(Double, Double)].toDF("a", "b")
+val correlatedData = Seq[(Double, Double)]((2, 3), (3, 4), (7.5, 8.2), 
(10.3, 12))
+  .toDF("a", "b")
+val correlatedDataWithNull = Seq[(java.lang.Double, java.lang.Double)](
+  (2.0, 3.0), (3.0, null), (7.5, 8.2), (10.3, 12.0)).toDF("a", "b")
--- End diff --

ok, thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21054: [SPARK-23907][SQL] Add regr_* functions

2018-05-02 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/21054#discussion_r185564388
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/regression.scala
 ---
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.aggregate
+
+import org.apache.spark.sql.catalyst.dsl.expressions._
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.types.{AbstractDataType, DoubleType}
+
+/**
+ * Base trait for all regression functions.
+ */
+trait RegrLike extends AggregateFunction with ImplicitCastInputTypes {
+  def y: Expression
+  def x: Expression
+
+  override def children: Seq[Expression] = Seq(y, x)
+  override def inputTypes: Seq[AbstractDataType] = Seq(DoubleType, 
DoubleType)
+
+  protected def updateIfNotNull(exprs: Seq[Expression]): Seq[Expression] = 
{
+assert(aggBufferAttributes.length == exprs.length)
+val nullableChildren = children.filter(_.nullable)
+if (nullableChildren.isEmpty) {
+  exprs
+} else {
+  exprs.zip(aggBufferAttributes).map { case (e, a) =>
+If(nullableChildren.map(IsNull).reduce(Or), a, e)
+  }
+}
+  }
+}
+
+
+@ExpressionDescription(
+  usage = "_FUNC_(y, x) - Returns the number of non-null pairs.",
+  since = "2.4.0")
+case class RegrCount(y: Expression, x: Expression)
+  extends CountLike with RegrLike {
+
+  override lazy val updateExpressions: Seq[Expression] = 
updateIfNotNull(Seq(count + 1L))
+
+  override def prettyName: String = "regr_count"
+}
+
+
+@ExpressionDescription(
+  usage = "_FUNC_(y, x) - Returns SUM(x*x)-SUM(x)*SUM(x)/N. Any pair with 
a NULL is ignored.",
+  since = "2.4.0")
+case class RegrSXX(y: Expression, x: Expression)
+  extends CentralMomentAgg(x) with RegrLike {
+
+  override protected def momentOrder = 2
+
+  override lazy val updateExpressions: Seq[Expression] = 
updateIfNotNull(updateExpressionsDef)
+
+  override val evaluateExpression: Expression = {
+If(n === Literal(0.0), Literal.create(null, DoubleType), m2)
+  }
+
+  override def prettyName: String = "regr_sxx"
+}
+
+
+@ExpressionDescription(
+  usage = "_FUNC_(y, x) - Returns SUM(y*y)-SUM(y)*SUM(y)/N. Any pair with 
a NULL is ignored.",
+  since = "2.4.0")
+case class RegrSYY(y: Expression, x: Expression)
+  extends CentralMomentAgg(y) with RegrLike {
+
+  override protected def momentOrder = 2
+
+  override lazy val updateExpressions: Seq[Expression] = 
updateIfNotNull(updateExpressionsDef)
+
+  override val evaluateExpression: Expression = {
+If(n === Literal(0.0), Literal.create(null, DoubleType), m2)
+  }
+
+  override def prettyName: String = "regr_syy"
+}
+
+
+@ExpressionDescription(
+  usage = "_FUNC_(y, x) - Returns the average of x. Any pair with a NULL 
is ignored.",
+  since = "2.4.0")
+case class RegrAvgX(y: Expression, x: Expression)
+  extends AverageLike(x) with RegrLike {
+
+  override lazy val updateExpressions: Seq[Expression] = 
updateIfNotNull(updateExpressionsDef)
+
+  override def prettyName: String = "regr_avgx"
+}
+
+
+@ExpressionDescription(
+  usage = "_FUNC_(y, x) - Returns the average of y. Any pair with a NULL 
is ignored.",
+  since = "2.4.0")
+case class RegrAvgY(y: Expression, x: Expression)
+  extends AverageLike(y) with RegrLike {
+
+  override lazy val updateExpressions: Seq[Expression] = 
updateIfNotNull(updateExpressionsDef)
+
+  override def prettyName: String = "regr_avgy"
+}
+
+// scalastyle:off line.size.limit
+@ExpressionDescription(
+  usage = "_FUNC_(y, x) - Returns the covariance of y and x multiplied for 
the number of items in the dataset. Any pair with a NULL is ignored.",
+  since = 

[GitHub] spark pull request #21054: [SPARK-23907][SQL] Add regr_* functions

2018-05-02 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21054#discussion_r185468199
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala ---
@@ -36,6 +36,8 @@ case class Fact(date: Int, hour: Int, minute: Int, 
room_name: String, temp: Doub
 class DataFrameAggregateSuite extends QueryTest with SharedSQLContext {
   import testImplicits._
 
+  val absTol = 1e-8
--- End diff --

we cannot because this would cause a compile error since there are two 
overloaded functions:
```
in class QueryTest, multiple overloaded alternatives of 
checkAggregatesWithTol define default arguments
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21054: [SPARK-23907][SQL] Add regr_* functions

2018-05-02 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21054#discussion_r185461822
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Average.scala
 ---
@@ -85,4 +65,29 @@ case class Average(child: Expression) extends 
DeclarativeAggregate with Implicit
 case _ =>
   Cast(sum, resultType) / Cast(count, resultType)
   }
+
+  protected def updateExpressionsDef: Seq[Expression] = Seq(
+/* sum = */
+Add(
+  sum,
+  Coalesce(Cast(child, sumDataType) :: Cast(Literal(0), sumDataType) 
:: Nil)),
+/* count = */ If(IsNull(child), count, count + 1L)
+  )
+
+  override lazy val updateExpressions = updateExpressionsDef
+}
+
+@ExpressionDescription(
+  usage = "_FUNC_(expr) - Returns the mean calculated from values of a 
group.")
+case class Average(child: Expression)
+extends AverageLike(child) with ImplicitCastInputTypes {
--- End diff --

nit: indent


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21054: [SPARK-23907][SQL] Add regr_* functions

2018-05-02 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21054#discussion_r185462735
  
--- Diff: 
sql/core/src/test/resources/sql-tests/inputs/udaf-regrfunctions.sql ---
@@ -0,0 +1,61 @@
+--
+--   Licensed to the Apache Software Foundation (ASF) under one or more
+--   contributor license agreements.  See the NOTICE file distributed with
+--   this work for additional information regarding copyright ownership.
+--   The ASF licenses this file to You under the Apache License, Version 
2.0
+--   (the "License"); you may not use this file except in compliance with
+--   the License.  You may obtain a copy of the License at
+--
+--  http://www.apache.org/licenses/LICENSE-2.0
+--
+--   Unless required by applicable law or agreed to in writing, software
+--   distributed under the License is distributed on an "AS IS" BASIS,
+--   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 
implied.
+--   See the License for the specific language governing permissions and
+--   limitations under the License.
+--
+
+CREATE OR REPLACE TEMPORARY VIEW t1 AS SELECT * FROM VALUES
+ (101, 1, 1, 1),
+ (201, 2, 1, 1),
+ (301, 3, 1, 1),
+ (401, 4, 1, 11),
+ (501, 5, 1, null),
+ (601, 6, null, 1),
+ (701, 6, null, null),
+ (102, 1, 2, 2),
+ (202, 2, 1, 2),
+ (302, 3, 2, 1),
+ (402, 4, 2, 12),
+ (502, 5, 2, null),
+ (602, 6, null, 2),
+ (702, 6, null, null),
+ (103, 1, 3, 3),
+ (203, 2, 1, 3),
+ (303, 3, 3, 1),
+ (403, 4, 3, 13),
+ (503, 5, 3, null),
+ (603, 6, null, 3),
+ (703, 6, null, null),
+ (104, 1, 4, 4),
+ (204, 2, 1, 4),
+ (304, 3, 4, 1),
+ (404, 4, 4, 14),
+ (504, 5, 4, null),
+ (604, 6, null, 4),
+ (704, 6, null, null),
+ (800, 7, 1, 1)
+as t1(id, px, y, x);
+
+explain select px, var_pop(x), var_pop(y), corr(y,x), covar_samp(y,x), 
covar_pop(y,x),
--- End diff --

Do we need `explain`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21054: [SPARK-23907][SQL] Add regr_* functions

2018-05-02 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21054#discussion_r185461620
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Corr.scala
 ---
@@ -22,18 +22,14 @@ import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.types._
 
 /**
- * Compute Pearson correlation between two expressions.
+ * Base class for computing Pearson correlation between two expressions.
  * When applied on empty data (i.e., count is zero), it returns NULL.
  *
  * Definition of Pearson correlation can be found at
  * 
http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient
  */
-// scalastyle:off line.size.limit
-@ExpressionDescription(
-  usage = "_FUNC_(expr1, expr2) - Returns Pearson coefficient of 
correlation between a set of number pairs.")
-// scalastyle:on line.size.limit
-case class Corr(x: Expression, y: Expression)
-  extends DeclarativeAggregate with ImplicitCastInputTypes {
+abstract class PearsonCorrelation(x: Expression, y: Expression)
+extends DeclarativeAggregate with ImplicitCastInputTypes {
--- End diff --

nit: indent


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21054: [SPARK-23907][SQL] Add regr_* functions

2018-05-02 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21054#discussion_r185460008
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Average.scala
 ---
@@ -62,14 +52,6 @@ case class Average(child: Expression) extends 
DeclarativeAggregate with Implicit
 /* count = */ Literal(0L)
   )
 
-  override lazy val updateExpressions = Seq(
--- End diff --

done


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21054: [SPARK-23907][SQL] Add regr_* functions

2018-05-02 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21054#discussion_r185460040
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala ---
@@ -686,4 +687,72 @@ class DataFrameAggregateSuite extends QueryTest with 
SharedSQLContext {
   }
 }
   }
+
--- End diff --

done


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21054: [SPARK-23907][SQL] Add regr_* functions

2018-05-02 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21054#discussion_r185460147
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala ---
@@ -686,4 +687,72 @@ class DataFrameAggregateSuite extends QueryTest with 
SharedSQLContext {
   }
 }
   }
+
+  test("SPARK-23907: regression functions") {
+val emptyTableData = Seq.empty[(Double, Double)].toDF("a", "b")
+val correlatedData = Seq[(Double, Double)]((2, 3), (3, 4), (7.5, 8.2), 
(10.3, 12))
+  .toDF("a", "b")
+val correlatedDataWithNull = Seq[(java.lang.Double, java.lang.Double)](
+  (2.0, 3.0), (3.0, null), (7.5, 8.2), (10.3, 12.0)).toDF("a", "b")
--- End diff --

I don't think so, since we have this tested also in the SQL tests added.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21054: [SPARK-23907][SQL] Add regr_* functions

2018-05-02 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/21054#discussion_r185448281
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/regression.scala
 ---
@@ -0,0 +1,193 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.aggregate
+
+import org.apache.spark.sql.catalyst.dsl.expressions._
+import org.apache.spark.sql.catalyst.expressions.{Expression, 
ExpressionDescription, If, ImplicitCastInputTypes, IsNull, Literal, Or}
--- End diff --

nit: Would it be possible to use 
`org.apache.spark.sql.catalyst.expressions._`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21054: [SPARK-23907][SQL] Add regr_* functions

2018-05-02 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/21054#discussion_r185444248
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/regression.scala
 ---
@@ -0,0 +1,193 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.aggregate
+
+import org.apache.spark.sql.catalyst.dsl.expressions._
+import org.apache.spark.sql.catalyst.expressions.{Expression, 
ExpressionDescription, If, ImplicitCastInputTypes, IsNull, Literal, Or}
+import org.apache.spark.sql.types.{AbstractDataType, DoubleType}
+
+/**
+ * Base trait for all regression functions.
+ */
+trait RegrBase extends AggregateFunction with ImplicitCastInputTypes {
--- End diff --

it needs to be a trait to be mixed in with the other abstract classes 
(`CountLike`, `AggregateLike`, `CentralMomentAgg`, ...)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21054: [SPARK-23907][SQL] Add regr_* functions

2018-05-02 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21054#discussion_r185411834
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/regression.scala
 ---
@@ -0,0 +1,193 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.aggregate
+
+import org.apache.spark.sql.catalyst.dsl.expressions._
+import org.apache.spark.sql.catalyst.expressions.{Expression, 
ExpressionDescription, If, ImplicitCastInputTypes, IsNull, Literal, Or}
+import org.apache.spark.sql.types.{AbstractDataType, DoubleType}
+
+/**
+ * Base trait for all regression functions.
+ */
+trait RegrBase extends AggregateFunction with ImplicitCastInputTypes {
--- End diff --

`abstract class`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21054: [SPARK-23907][SQL] Add regr_* functions

2018-05-02 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21054#discussion_r185409726
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Corr.scala
 ---
@@ -51,7 +47,27 @@ case class Corr(x: Expression, y: Expression)
 
   override val initialValues: Seq[Expression] = Array.fill(6)(Literal(0.0))
 
-  override val updateExpressions: Seq[Expression] = {
+  override lazy val updateExpressions: Seq[Expression] = 
updateExpressionsDef
+
+  override val mergeExpressions: Seq[Expression] = {
+
--- End diff --

nit: Remove the blank line (I know this is not related to this pr though...)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21054: [SPARK-23907][SQL] Add regr_* functions

2018-05-02 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21054#discussion_r185409402
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/CentralMomentAgg.scala
 ---
@@ -128,6 +100,36 @@ abstract class CentralMomentAgg(child: Expression)
 
 trimHigherOrder(Seq(newN, newAvg, newM2, newM3, newM4))
   }
+
+  def updateExpressionsDef: Seq[Expression] = {
--- End diff --

`protected def`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21054: [SPARK-23907][SQL] Add regr_* functions

2018-05-02 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21054#discussion_r185407773
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Average.scala
 ---
@@ -85,4 +67,28 @@ case class Average(child: Expression) extends 
DeclarativeAggregate with Implicit
 case _ =>
   Cast(sum, resultType) / Cast(count, resultType)
   }
+
+  def updateExpressionsDef: Seq[Expression] = Seq(
--- End diff --

`protected def`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21054: [SPARK-23907][SQL] Add regr_* functions

2018-05-02 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21054#discussion_r185407606
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Average.scala
 ---
@@ -85,4 +67,28 @@ case class Average(child: Expression) extends 
DeclarativeAggregate with Implicit
 case _ =>
   Cast(sum, resultType) / Cast(count, resultType)
   }
+
+  def updateExpressionsDef: Seq[Expression] = Seq(
--- End diff --

How about `doUpdateExpressions`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21054: [SPARK-23907][SQL] Add regr_* functions

2018-05-02 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21054#discussion_r185403679
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Average.scala
 ---
@@ -23,23 +23,13 @@ import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.catalyst.util.TypeUtils
 import org.apache.spark.sql.types._
 
-@ExpressionDescription(
-  usage = "_FUNC_(expr) - Returns the mean calculated from values of a 
group.")
-case class Average(child: Expression) extends DeclarativeAggregate with 
ImplicitCastInputTypes {
-
-  override def prettyName: String = "avg"
-
-  override def children: Seq[Expression] = child :: Nil
+abstract class AverageAggregate extends DeclarativeAggregate {
--- End diff --

How about `AverageLike`? (It is just a suggestion)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21054: [SPARK-23907][SQL] Add regr_* functions

2018-05-02 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21054#discussion_r185402034
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala ---
@@ -686,4 +687,72 @@ class DataFrameAggregateSuite extends QueryTest with 
SharedSQLContext {
   }
 }
   }
+
+  test("SPARK-23907: regression functions") {
+val emptyTableData = Seq.empty[(Double, Double)].toDF("a", "b")
+val correlatedData = Seq[(Double, Double)]((2, 3), (3, 4), (7.5, 8.2), 
(10.3, 12))
+  .toDF("a", "b")
+val correlatedDataWithNull = Seq[(java.lang.Double, java.lang.Double)](
+  (2.0, 3.0), (3.0, null), (7.5, 8.2), (10.3, 12.0)).toDF("a", "b")
--- End diff --

nit: we don't need tests for `null` in the left-side value?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21054: [SPARK-23907][SQL] Add regr_* functions

2018-05-02 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21054#discussion_r185401222
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala ---
@@ -36,6 +36,8 @@ case class Fact(date: Int, hour: Int, minute: Int, 
room_name: String, temp: Doub
 class DataFrameAggregateSuite extends QueryTest with SharedSQLContext {
   import testImplicits._
 
+  val absTol = 1e-8
--- End diff --

Since we always use `1e-8` for `checkAggregatesWithTol` in this test suite, 
how about setting this value as default in the function argument?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21054: [SPARK-23907][SQL] Add regr_* functions

2018-05-02 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21054#discussion_r185400649
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala ---
@@ -686,4 +687,72 @@ class DataFrameAggregateSuite extends QueryTest with 
SharedSQLContext {
   }
 }
   }
+
--- End diff --

We need to add tests in `sql/core/src/test/resources/sql-tests` for testing 
via the spark sql parser, too?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21054: [SPARK-23907][SQL] Add regr_* functions

2018-05-02 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21054#discussion_r185399774
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Corr.scala
 ---
@@ -22,18 +22,14 @@ import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.types._
 
 /**
- * Compute Pearson correlation between two expressions.
+ * Basee class for computing Pearson correlation between two expressions.
--- End diff --

nit: `Basee` -> `Base`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21054: [SPARK-23907][SQL] Add regr_* functions

2018-05-01 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21054#discussion_r185170928
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Average.scala
 ---
@@ -23,23 +23,13 @@ import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.catalyst.util.TypeUtils
 import org.apache.spark.sql.types._
 
-@ExpressionDescription(
-  usage = "_FUNC_(expr) - Returns the mean calculated from values of a 
group.")
-case class Average(child: Expression) extends DeclarativeAggregate with 
ImplicitCastInputTypes {
-
-  override def prettyName: String = "avg"
-
-  override def children: Seq[Expression] = child :: Nil
+abstract class AverageAggregate extends DeclarativeAggregate {
 
   override def nullable: Boolean = true
-
   // Return data type.
   override def dataType: DataType = resultType
 
-  override def inputTypes: Seq[AbstractDataType] = Seq(NumericType)
-
-  override def checkInputDataTypes(): TypeCheckResult =
-TypeUtils.checkForNumericExpr(child.dataType, "function average")
+  def child: Expression
--- End diff --

How about declaring this as an argument of `AverageAggregate` as the same 
as other abstract classes?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21054: [SPARK-23907][SQL] Add regr_* functions

2018-05-01 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21054#discussion_r185170552
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Average.scala
 ---
@@ -62,14 +52,6 @@ case class Average(child: Expression) extends 
DeclarativeAggregate with Implicit
 /* count = */ Literal(0L)
   )
 
-  override lazy val updateExpressions = Seq(
--- End diff --

Is it okay to remain as `override lazy val updateExpressions = 
updateExpressionsDef` as the same as other abstract classes?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21054: [SPARK-23907][SQL] Add regr_* functions

2018-05-01 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/21054#discussion_r185171846
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/regression.scala
 ---
@@ -0,0 +1,193 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.aggregate
+
+import org.apache.spark.sql.catalyst.dsl.expressions._
+import org.apache.spark.sql.catalyst.expressions.{Expression, 
ExpressionDescription, If, ImplicitCastInputTypes, IsNull, Literal, Or}
+import org.apache.spark.sql.types.{AbstractDataType, DoubleType}
+
+/**
+ * Base trait for all regression functions.
+ */
+trait RegrBase extends AggregateFunction with ImplicitCastInputTypes {
+  def y: Expression
+  def x: Expression
+
+  override def children: Seq[Expression] = Seq(y, x)
+  override def inputTypes: Seq[AbstractDataType] = Seq(DoubleType, 
DoubleType)
+
+  def updateIfNotNull(exprs: Seq[Expression]): Seq[Expression] = {
+assert(aggBufferAttributes.length == exprs.length)
+val nullableChildren = children.filter(_.nullable)
+if (nullableChildren.isEmpty) {
+  exprs
+} else {
+  exprs.zip(aggBufferAttributes).map { case (e, a) =>
+If(nullableChildren.map(IsNull).reduce(Or), a, e)
+  }
+}
+  }
+}
+
+
+@ExpressionDescription(
+  usage = "_FUNC_(y, x) - Returns the number of non-null pairs.",
+  since = "2.4.0")
+case class RegrCount(y: Expression, x: Expression) extends CountAggregate
+with RegrBase {
--- End diff --

nit: maybe we need line break before `extends`:

```scala
case class RegrCount( ... )
  extends CountAggregate with RegrBase {
...
```

, and ditto for the following functions.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21054: [SPARK-23907][SQL] Add regr_* functions

2018-04-12 Thread mgaido91
GitHub user mgaido91 opened a pull request:

https://github.com/apache/spark/pull/21054

[SPARK-23907][SQL] Add regr_* functions

## What changes were proposed in this pull request?

The PR introduces regr_slope, regr_intercept, regr_r2, regr_sxx, regr_syy, 
regr_sxy, regr_avgx, regr_avgy, regr_count.

The implementation of this functions mirrors Hive's one in HIVE-15978.

## How was this patch tested?

added UT (values compared with Hive)


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mgaido91/spark SPARK-23907

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21054.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21054


commit 90c266e09e276c37126d8428a4d00f4986d240e5
Author: Marco Gaido 
Date:   2018-04-12T12:27:39Z

[SPARK-23907][SQL] Add regr_* functions




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org