[GitHub] spark pull request #18113: [SPARK-20890][SQL] Added min and max typed aggreg...

2017-12-07 Thread setjet
Github user setjet commented on a diff in the pull request:

https://github.com/apache/spark/pull/18113#discussion_r155667834
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DatasetAggregatorSuite.scala ---
@@ -263,6 +262,25 @@ class DatasetAggregatorSuite extends QueryTest with 
SharedSQLContext {
   ("a", 4), ("b", 3))
   }
 
+  test("typed aggregate: min, max") {
+val ds = Seq("a" -> 1, "a" -> 3, "b" -> 4, "b" -> -4, "b" -> 0).toDS()
+checkDataset(
+  ds.groupByKey(_._1).agg(
+typed.min(_._2), typed.minLong(_._2), typed.max(_._2), 
typed.maxLong(_._2)),
+  ("a", Some(1.0), Some(1L), Some(3.0), Some(3L)),
+  ("b", Some(-4.0), Some(-4L), Some(4.0), Some(4L)))
+  }
+
+  test("typed aggregate: empty") {
+val empty = Seq.empty[(Double, Double)].toDS
+val f = (x: (Double, Double)) => x._2
+val g = (x: (Long, Long)) => x._2
+checkDataset(
+  empty.agg(typed.sum(f), typed.sumLong(g), typed.avg(f),
--- End diff --

Yes I did not notice that. Given that its a change in core, maybe we should 
create a separate JIRA for that, and make this one depend on it?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18113: [SPARK-20890][SQL] Added min and max typed aggreg...

2017-12-06 Thread setjet
Github user setjet commented on a diff in the pull request:

https://github.com/apache/spark/pull/18113#discussion_r155374041
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DatasetAggregatorSuite.scala ---
@@ -263,6 +262,25 @@ class DatasetAggregatorSuite extends QueryTest with 
SharedSQLContext {
   ("a", 4), ("b", 3))
   }
 
+  test("typed aggregate: min, max") {
+val ds = Seq("a" -> 1, "a" -> 3, "b" -> 4, "b" -> -4, "b" -> 0).toDS()
+checkDataset(
+  ds.groupByKey(_._1).agg(
+typed.min(_._2), typed.minLong(_._2), typed.max(_._2), 
typed.maxLong(_._2)),
+  ("a", Some(1.0), Some(1L), Some(3.0), Some(3L)),
+  ("b", Some(-4.0), Some(-4L), Some(4.0), Some(4L)))
+  }
+
+  test("typed aggregate: empty") {
+val empty = Seq.empty[(Double, Double)].toDS
--- End diff --

That won't change anything unfortunately. The difference between the empty 
and the non-empty testcases is that the latter is doing a groupbykey. If this 
is done on an empty dataset, no Row is returned at all and therefore doesn't 
allow us to verify a None is returned.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18113: [SPARK-20890][SQL] Added min and max typed aggregation f...

2017-12-05 Thread setjet
Github user setjet commented on the issue:

https://github.com/apache/spark/pull/18113
  
There is one issue however I am stuck on: the tests for empty sets ("typed 
aggregate: empty") seem to be casting to nulls from options, resulting into the 
following:

Decoded objects do not match expected objects:
expected: WrappedArray([0.0,0,NaN,None,None,None,None])
actual: WrappedArray([0.0,0,NaN,[null],[null],[null],[null]])

This doesn't happen to non-empty data sets. Does anyone have a clue?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18113: [SPARK-20890][SQL] Added min and max typed aggreg...

2017-12-05 Thread setjet
Github user setjet commented on a diff in the pull request:

https://github.com/apache/spark/pull/18113#discussion_r155070641
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/typedaggregators.scala
 ---
@@ -99,3 +96,165 @@ class TypedAverage[IN](val f: IN => Double) extends 
Aggregator[IN, (Double, Long
 toColumn.asInstanceOf[TypedColumn[IN, java.lang.Double]]
   }
 }
+
+class TypedMinDouble[IN](val f: IN => Double)
+  extends Aggregator[IN, MutableDouble, java.lang.Double] {
+  override def zero: MutableDouble = null
+  override def reduce(b: MutableDouble, a: IN): MutableDouble = {
+if (b == null) {
+  new MutableDouble(f(a))
+} else {
+  b.value = math.min(b.value, f(a))
+  b
+}
+  }
+  override def merge(b1: MutableDouble, b2: MutableDouble): MutableDouble 
= {
+if (b1 == null) {
+  b2
+} else if (b2 == null) {
+  b1
+} else {
+  b1.value = math.min(b1.value, b2.value)
+  b1
+}
+  }
+  override def finish(reduction: MutableDouble): java.lang.Double = {
+if (reduction == null) {
+  null
+} else {
+  reduction.toJavaDouble
+}
+  }
+
+  override def bufferEncoder: Encoder[MutableDouble] = 
Encoders.kryo[MutableDouble]
+  override def outputEncoder: Encoder[java.lang.Double] = 
ExpressionEncoder[java.lang.Double]()
+
+  // Java api support
+  def this(f: MapFunction[IN, java.lang.Double]) = this((x: IN) => 
f.call(x))
+  def toColumnScala: TypedColumn[IN, Double] = {
--- End diff --

@cloud-fan I agree that's the best option. Made some slight changes but its 
implemented now.
There is one issue  however I am stuck on: the tests for empty sets ("typed 
aggregate: empty") seem to be casting to nulls from options, resulting into the 
following:

Decoded objects do not match expected objects:
expected: WrappedArray([0.0,0,NaN,None,None,None,None])
actual:   WrappedArray([0.0,0,NaN,[null],[null],[null],[null]])

This doesn't happen to non-empty data sets. Do  you have any clue?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18113: [SPARK-20890][SQL] Added min and max typed aggreg...

2017-12-05 Thread setjet
Github user setjet commented on a diff in the pull request:

https://github.com/apache/spark/pull/18113#discussion_r155068746
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/typedaggregators.scala
 ---
@@ -99,3 +96,165 @@ class TypedAverage[IN](val f: IN => Double) extends 
Aggregator[IN, (Double, Long
 toColumn.asInstanceOf[TypedColumn[IN, java.lang.Double]]
   }
 }
+
+class TypedMinDouble[IN](val f: IN => Double)
+  extends Aggregator[IN, MutableDouble, java.lang.Double] {
+  override def zero: MutableDouble = null
+  override def reduce(b: MutableDouble, a: IN): MutableDouble = {
+if (b == null) {
+  new MutableDouble(f(a))
+} else {
+  b.value = math.min(b.value, f(a))
+  b
+}
+  }
+  override def merge(b1: MutableDouble, b2: MutableDouble): MutableDouble 
= {
+if (b1 == null) {
+  b2
+} else if (b2 == null) {
+  b1
+} else {
+  b1.value = math.min(b1.value, b2.value)
+  b1
+}
+  }
+  override def finish(reduction: MutableDouble): java.lang.Double = {
+if (reduction == null) {
+  null
+} else {
+  reduction.toJavaDouble
--- End diff --

Done


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18113: [SPARK-20890][SQL] Added min and max typed aggreg...

2017-12-05 Thread setjet
Github user setjet commented on a diff in the pull request:

https://github.com/apache/spark/pull/18113#discussion_r155068709
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/typedaggregators.scala
 ---
@@ -17,8 +17,10 @@
 
 package org.apache.spark.sql.execution.aggregate
 
+import java.lang
--- End diff --

Resolved


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18113: [SPARK-20890][SQL] Added min and max typed aggregation f...

2017-12-01 Thread setjet
Github user setjet commented on the issue:

https://github.com/apache/spark/pull/18113
  
@cloud-fan done, could you please have a look?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18113: [SPARK-20890][SQL] Added min and max typed aggreg...

2017-11-30 Thread setjet
Github user setjet commented on a diff in the pull request:

https://github.com/apache/spark/pull/18113#discussion_r154172794
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/typedaggregators.scala
 ---
@@ -99,3 +94,91 @@ class TypedAverage[IN](val f: IN => Double) extends 
Aggregator[IN, (Double, Long
 toColumn.asInstanceOf[TypedColumn[IN, java.lang.Double]]
   }
 }
+
+class TypedMinDouble[IN](val f: IN => Double) extends Aggregator[IN, 
Double, Double] {
+  override def zero: Double = Double.PositiveInfinity
+  override def reduce(b: Double, a: IN): Double = math.min(b, f(a))
+  override def merge(b1: Double, b2: Double): Double = math.min(b1, b2)
+  override def finish(reduction: Double): Double = {
+if (Double.PositiveInfinity == reduction) {
--- End diff --

Which of those 3 will we decide on then? None of them is ideal 
unfortunately.

cc @cloud-fan @gatorsmile @HyukjinKwon @srowen




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18113: [SPARK-20890][SQL] Added min and max typed aggreg...

2017-11-24 Thread setjet
Github user setjet commented on a diff in the pull request:

https://github.com/apache/spark/pull/18113#discussion_r153025270
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/typedaggregators.scala
 ---
@@ -99,3 +94,91 @@ class TypedAverage[IN](val f: IN => Double) extends 
Aggregator[IN, (Double, Long
 toColumn.asInstanceOf[TypedColumn[IN, java.lang.Double]]
   }
 }
+
+class TypedMinDouble[IN](val f: IN => Double) extends Aggregator[IN, 
Double, Double] {
+  override def zero: Double = Double.PositiveInfinity
+  override def reduce(b: Double, a: IN): Double = math.min(b, f(a))
+  override def merge(b1: Double, b2: Double): Double = math.min(b1, b2)
+  override def finish(reduction: Double): Double = {
+if (Double.PositiveInfinity == reduction) {
--- End diff --

Ah sorry yes, was only talking about the output type. 
I don't have time right now but Ill do it over the weekend and mark you 
when I am done.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18113: [SPARK-20890][SQL] Added min and max typed aggreg...

2017-11-24 Thread setjet
Github user setjet commented on a diff in the pull request:

https://github.com/apache/spark/pull/18113#discussion_r153024799
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/typedaggregators.scala
 ---
@@ -99,3 +94,91 @@ class TypedAverage[IN](val f: IN => Double) extends 
Aggregator[IN, (Double, Long
 toColumn.asInstanceOf[TypedColumn[IN, java.lang.Double]]
   }
 }
+
+class TypedMinDouble[IN](val f: IN => Double) extends Aggregator[IN, 
Double, Double] {
+  override def zero: Double = Double.PositiveInfinity
+  override def reduce(b: Double, a: IN): Double = math.min(b, f(a))
+  override def merge(b1: Double, b2: Double): Double = math.min(b1, b2)
+  override def finish(reduction: Double): Double = {
+if (Double.PositiveInfinity == reduction) {
--- End diff --

Ok makes sense. What about the return ```finish``` return type? Leaving 
that as a java type would cause the ```this``` and ```toColumnJava``` to be 
flipped, creating a  ```toColumnScala``` instead.
What about:
```
override def finish(reduction: java.lang.Double): Double = reduction
```
As its on the finish, it shouldn't cause much performance overhead as its 
not execution many times. It would also reduce complexity a bit.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18113: [SPARK-20890][SQL] Added min and max typed aggreg...

2017-11-24 Thread setjet
Github user setjet commented on a diff in the pull request:

https://github.com/apache/spark/pull/18113#discussion_r153022290
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/typedaggregators.scala
 ---
@@ -99,3 +94,91 @@ class TypedAverage[IN](val f: IN => Double) extends 
Aggregator[IN, (Double, Long
 toColumn.asInstanceOf[TypedColumn[IN, java.lang.Double]]
   }
 }
+
+class TypedMinDouble[IN](val f: IN => Double) extends Aggregator[IN, 
Double, Double] {
+  override def zero: Double = Double.PositiveInfinity
+  override def reduce(b: Double, a: IN): Double = math.min(b, f(a))
+  override def merge(b1: Double, b2: Double): Double = math.min(b1, b2)
+  override def finish(reduction: Double): Double = {
+if (Double.PositiveInfinity == reduction) {
--- End diff --

Doesn't that boil down to what was there previously? 
https://github.com/apache/spark/pull/18113/commits/51783b55197cea6c130722838ec97ad6df5c92be

```
  override def zero: java.lang.Double = null
  override def reduce(b: java.lang.Double, a: IN): java.lang.Double =
if (b == null) f(a) else math.max(b, f(a))

  override def merge(b1: java.lang.Double, b2: java.lang.Double): 
java.lang.Double = {
if (b1 == null) {
  b2
} else if (b2 == null) {
  b1
} else {
  math.max(b1, b2)
}
  }
  override def finish(reduction: java.lang.Double): java.lang.Double = 
reduction
```

Here we just return null in case its an empty set or if we have the edge 
case you just mentioned. You rejected it because you were afraid of boxing 
performance on the 8th of June.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18113: [SPARK-20890][SQL] Added min and max typed aggreg...

2017-11-24 Thread setjet
Github user setjet commented on a diff in the pull request:

https://github.com/apache/spark/pull/18113#discussion_r153021545
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/typedaggregators.scala
 ---
@@ -99,3 +94,91 @@ class TypedAverage[IN](val f: IN => Double) extends 
Aggregator[IN, (Double, Long
 toColumn.asInstanceOf[TypedColumn[IN, java.lang.Double]]
   }
 }
+
+class TypedMinDouble[IN](val f: IN => Double) extends Aggregator[IN, 
Double, Double] {
+  override def zero: Double = Double.PositiveInfinity
+  override def reduce(b: Double, a: IN): Double = math.min(b, f(a))
+  override def merge(b1: Double, b2: Double): Double = math.min(b1, b2)
+  override def finish(reduction: Double): Double = {
+if (Double.PositiveInfinity == reduction) {
--- End diff --

That's correct. That was part of the discussion above. We used to init it 
with null, so that we could then distinguish between these cases. As you can 
read above, that initial proposal was tossed as it didnt meet ANSI standards.
Another option I just realised would be to initialize it with Double.NaN, 
and then use that as a flag to distinguish between infinity and the initial 
value. Then again, that would not be supported for Longs as we cannot assign a 
NaN.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18113: [SPARK-20890][SQL] Added min and max typed aggregation f...

2017-11-24 Thread setjet
Github user setjet commented on the issue:

https://github.com/apache/spark/pull/18113
  
@cloud-fan done, some small white spaces remain as it formats the functions 
within the file consistently


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18113: [SPARK-20890][SQL] Added min and max typed aggreg...

2017-11-24 Thread setjet
Github user setjet commented on a diff in the pull request:

https://github.com/apache/spark/pull/18113#discussion_r153020474
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/typedaggregators.scala
 ---
@@ -38,13 +38,11 @@ class TypedSumDouble[IN](val f: IN => Double) extends 
Aggregator[IN, Double, Dou
 
   // Java api support
   def this(f: MapFunction[IN, java.lang.Double]) = this((x: IN) => 
f.call(x).asInstanceOf[Double])
-
--- End diff --

removed some whitelines to make functions more consistent


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18113: [SPARK-20890][SQL] Added min and max typed aggreg...

2017-11-24 Thread setjet
Github user setjet commented on a diff in the pull request:

https://github.com/apache/spark/pull/18113#discussion_r153020437
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/typedaggregators.scala
 ---
@@ -81,14 +77,13 @@ class TypedCount[IN](val f: IN => Any) extends 
Aggregator[IN, Long, Long] {
   }
 }
 
-
 class TypedAverage[IN](val f: IN => Double) extends Aggregator[IN, 
(Double, Long), Double] {
   override def zero: (Double, Long) = (0.0, 0L)
   override def reduce(b: (Double, Long), a: IN): (Double, Long) = (f(a) + 
b._1, 1 + b._2)
-  override def finish(reduction: (Double, Long)): Double = reduction._1 / 
reduction._2
   override def merge(b1: (Double, Long), b2: (Double, Long)): (Double, 
Long) = {
 (b1._1 + b2._1, b1._2 + b2._2)
   }
+  override def finish(reduction: (Double, Long)): Double = reduction._1 / 
reduction._2
--- End diff --

switched finish and merge around to make functions consistent


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18113: [SPARK-20890][SQL] Added min and max typed aggregation f...

2017-11-24 Thread setjet
Github user setjet commented on the issue:

https://github.com/apache/spark/pull/18113
  
@cloud-fan could you have a look please?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18113: [SPARK-20890][SQL] Added min and max typed aggreg...

2017-11-11 Thread setjet
Github user setjet commented on a diff in the pull request:

https://github.com/apache/spark/pull/18113#discussion_r150392495
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/typedaggregators.scala
 ---
@@ -76,26 +76,126 @@ class TypedCount[IN](val f: IN => Any) extends 
Aggregator[IN, Long, Long] {
 
   // Java api support
   def this(f: MapFunction[IN, Object]) = this((x: IN) => 
f.call(x).asInstanceOf[Any])
+  
   def toColumnJava: TypedColumn[IN, java.lang.Long] = {
 toColumn.asInstanceOf[TypedColumn[IN, java.lang.Long]]
   }
 }
 
+class TypedAverage[IN](val f: IN => Double)
+  extends Aggregator[IN, (Double, Long), Double] {
 
-class TypedAverage[IN](val f: IN => Double) extends Aggregator[IN, 
(Double, Long), Double] {
   override def zero: (Double, Long) = (0.0, 0L)
   override def reduce(b: (Double, Long), a: IN): (Double, Long) = (f(a) + 
b._1, 1 + b._2)
-  override def finish(reduction: (Double, Long)): Double = reduction._1 / 
reduction._2
--- End diff --

Order of functions is consistent among all aggregation functions: zero, 
reduce, merge finish.  Hence the swap of location of the merge and finish 
functions


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18113: [SPARK-20890][SQL] Added min and max typed aggreg...

2017-11-11 Thread setjet
Github user setjet commented on a diff in the pull request:

https://github.com/apache/spark/pull/18113#discussion_r150392424
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/expressions/javalang/typed.java ---
@@ -74,4 +71,40 @@
   public static  TypedColumn sumLong(MapFunction f) {
--- End diff --

Its already there a bit higher up in the file


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18113: [SPARK-20890][SQL] Added min and max typed aggregation f...

2017-11-11 Thread setjet
Github user setjet commented on the issue:

https://github.com/apache/spark/pull/18113
  
Exactly my point. I'll return -/+ inf then for doubles only, and min/max 
values  for longs



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18113: [SPARK-20890][SQL] Added min and max typed aggregation f...

2017-11-11 Thread setjet
Github user setjet commented on the issue:

https://github.com/apache/spark/pull/18113
  
Ok sounds good. What about doubles? We could return the proper mathematical 
defintion, but that is not consistent with Longs



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18113: [SPARK-20890][SQL] Added min and max typed aggregation f...

2017-11-11 Thread setjet
Github user setjet commented on the issue:

https://github.com/apache/spark/pull/18113
  
An empty sets min and max are defined is -infinity and +infinity: 
https://en.wikipedia.org/wiki/Empty_set
This is supported for Java doubles, but not for Longs. We could instead 
Long.MIN and Long.MAX values? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18113: [SPARK-20890][SQL] Added min and max typed aggreg...

2017-11-11 Thread setjet
Github user setjet commented on a diff in the pull request:

https://github.com/apache/spark/pull/18113#discussion_r150381736
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/typedaggregators.scala
 ---
@@ -26,43 +26,64 @@ import org.apache.spark.sql.expressions.Aggregator
 // This file defines internal implementations for aggregators.
 

 
+class TypedSumDouble[IN](val f: IN => Double)
+  extends Aggregator[IN, java.lang.Double, java.lang.Double] {
+
+  override def zero: java.lang.Double = 0.0
+  override def reduce(b: java.lang.Double, a: IN): java.lang.Double =
--- End diff --

As discussed  previously the boxing is needed to have appropriate return 
types for min/max. This of course would not be needed if we align it to the 
current (incorrect) return values.
 
I have bounced back and forth between the return values multiple times now, 
so it might be worthwhile to have some more discussion.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18113: [SPARK-20890][SQL] Added min and max typed aggregation f...

2017-11-11 Thread setjet
Github user setjet commented on the issue:

https://github.com/apache/spark/pull/18113
  
@cloud-fan 
Sorry i misread the conclusion of the discussion, reverted the initial api 
to exactly how it was before, while the new functions follow the SQL standard 
as you agreed on 2 weeks ago. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18113: [SPARK-20890][SQL] Added min and max typed aggregation f...

2017-11-09 Thread setjet
Github user setjet commented on the issue:

https://github.com/apache/spark/pull/18113
  
@gatorsmile @cloud-fan 
Could you have a look please?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18113: [SPARK-20890][SQL] Added min and max typed aggregation f...

2017-10-29 Thread setjet
Github user setjet commented on the issue:

https://github.com/apache/spark/pull/18113
  
Hi, it has been a while but I can pick it back up when I have time next 
weekend or so if that's OK. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18113: [SPARK-20890][SQL] Added min and max typed aggreg...

2017-06-13 Thread setjet
Github user setjet commented on a diff in the pull request:

https://github.com/apache/spark/pull/18113#discussion_r121761524
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/typedaggregators.scala
 ---
@@ -26,43 +26,64 @@ import org.apache.spark.sql.expressions.Aggregator
 // This file defines internal implementations for aggregators.
 

 
+class TypedSumDouble[IN](val f: IN => Double)
+  extends Aggregator[IN, java.lang.Double, java.lang.Double] {
+
+  override def zero: java.lang.Double = null
+  override def reduce(b: java.lang.Double, a: IN): java.lang.Double =
--- End diff --

I'll change it back to make it backwards compatible. What about the new 
ones @rxin ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18113: [SPARK-20890][SQL] Added min and max typed aggreg...

2017-06-03 Thread setjet
Github user setjet commented on a diff in the pull request:

https://github.com/apache/spark/pull/18113#discussion_r119996472
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/typedaggregators.scala
 ---
@@ -95,7 +93,123 @@ class TypedAverage[IN](val f: IN => Double) extends 
Aggregator[IN, (Double, Long
 
   // Java api support
   def this(f: MapFunction[IN, java.lang.Double]) = this(x => 
f.call(x).asInstanceOf[Double])
+
   def toColumnJava: TypedColumn[IN, java.lang.Double] = {
 toColumn.asInstanceOf[TypedColumn[IN, java.lang.Double]]
   }
 }
+
+class TypedMinDouble[IN](val f: IN => Double)
+  extends Aggregator[IN, java.lang.Double, java.lang.Double] {
+
+  override def zero: java.lang.Double = null
--- End diff --

@cloud-fan do you agree with this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18113: [SPARK-20890][SQL] Added min and max typed aggreg...

2017-05-30 Thread setjet
Github user setjet commented on a diff in the pull request:

https://github.com/apache/spark/pull/18113#discussion_r119175369
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/typedaggregators.scala
 ---
@@ -95,7 +93,123 @@ class TypedAverage[IN](val f: IN => Double) extends 
Aggregator[IN, (Double, Long
 
   // Java api support
   def this(f: MapFunction[IN, java.lang.Double]) = this(x => 
f.call(x).asInstanceOf[Double])
+
   def toColumnJava: TypedColumn[IN, java.lang.Double] = {
 toColumn.asInstanceOf[TypedColumn[IN, java.lang.Double]]
   }
 }
+
+class TypedMinDouble[IN](val f: IN => Double)
+  extends Aggregator[IN, java.lang.Double, java.lang.Double] {
+
+  override def zero: java.lang.Double = null
--- End diff --

`TypedSum` is actually correct because it will return 0 in case of an empty 
set  ['In mathematics, an empty sum, or nullary sum, is a summation where the 
number of terms is zero. By convention,[1] the value of any empty sum of 
numbers is the additive identity, 
zero.'](https://en.wikipedia.org/wiki/Empty_sum). One could therefore argue 
that `Sum.scala` is  actually wrong because it returns null: 
`emptyTestData.agg(sum('key))`. We could either fix Sum.scala, although that 
might affect existing applications, or align both to return null, even though 
that is not technically correct.
The same does go for `TypedAvg`, which returns Double.Nan instead of null. 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18080: [Spark-20771][SQL] Make weekofyear more intuitive

2017-05-29 Thread setjet
Github user setjet commented on the issue:

https://github.com/apache/spark/pull/18080
  
This variant is available in other DB's, albeit with slightly different 
function and parameter naming. For example, MySQL allows it via the `week()` 
function: 
http://www.w3resource.com/mysql/date-and-time-functions/mysql-week-function.php

In this case, you pass in an integer that specifies which permutation you 
want. Please note that if you look at the table, the 'Week 1 is the first week 
…' column is the difference between gregorian and iso.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18113: [SPARK-20890][SQL] Added min and max typed aggreg...

2017-05-29 Thread setjet
Github user setjet commented on a diff in the pull request:

https://github.com/apache/spark/pull/18113#discussion_r118939565
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/typedaggregators.scala
 ---
@@ -99,3 +97,67 @@ class TypedAverage[IN](val f: IN => Double) extends 
Aggregator[IN, (Double, Long
 toColumn.asInstanceOf[TypedColumn[IN, java.lang.Double]]
   }
 }
+
+class TypedMinDouble[IN](val f: IN => Double) extends Aggregator[IN, 
Double, Double] {
+  override def zero: Double = Double.PositiveInfinity
--- End diff --

Ah  I see my misunderstanding: in reduce I tried to also have an `if` for 
`f(a) == null` because of the previously mentioned implicit casting issue. This 
would force a `java.lang.Double` to be returned by the function, as `Double == 
null` doesn't make sense in Scala.

I have updated the code, please have a look :) Becuase `OUT` is already a 
`java.lang.Double`, we do not need the `toColumnJava`. As a result of `OUT` 
being `java.lang.Double` however, we do need a `toColumnScala` to accommodate ` 
   val f = (x: (Double, Double)) => x._2; empty.agg(typed.min(f)).show()` 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18113: [SPARK-20890][SQL] Added min and max typed aggreg...

2017-05-29 Thread setjet
Github user setjet commented on a diff in the pull request:

https://github.com/apache/spark/pull/18113#discussion_r118914424
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/typedaggregators.scala
 ---
@@ -99,3 +97,67 @@ class TypedAverage[IN](val f: IN => Double) extends 
Aggregator[IN, (Double, Long
 toColumn.asInstanceOf[TypedColumn[IN, java.lang.Double]]
   }
 }
+
+class TypedMinDouble[IN](val f: IN => Double) extends Aggregator[IN, 
Double, Double] {
+  override def zero: Double = Double.PositiveInfinity
--- End diff --

Turns out I made a typo which caused me to miss a permutation of handling 
null in the parameters...

Comparing both solutions (tuple with `OUT` as `java.lang.Double` vs 
non-tuple with both `BUF` and `OUT` as `java.lang.Double`), it seems we have 
the following trade-offs:
- tuple will require more data to be shuffled around as we are adding an 
additional value
- non-tuple solution requires the developer to know a bit about the 
internals, i.e.: 
`val tuple = (x: (Double, Double)) => x._2
emptyDataSet.agg(typed.min(tuple)).show()`
`val nontuple = (x: (Double, java.lang.Double)) => x._2
emptyDataSet.agg(typed.min(nontuple)).show()`

This is because function `f` passed in into typed.min outputs a `BUF`, 
forcing the caller to know about it the internals.
Given that users can always implement their own (non-tuple version) if 
needed, I'd argue in favor of the tupled solution beacuse  it is a bit more 
developer friendly. What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18113: [SPARK-20890][SQL] Added min and max typed aggreg...

2017-05-28 Thread setjet
Github user setjet commented on a diff in the pull request:

https://github.com/apache/spark/pull/18113#discussion_r118843548
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/typedaggregators.scala
 ---
@@ -99,3 +97,67 @@ class TypedAverage[IN](val f: IN => Double) extends 
Aggregator[IN, (Double, Long
 toColumn.asInstanceOf[TypedColumn[IN, java.lang.Double]]
   }
 }
+
+class TypedMinDouble[IN](val f: IN => Double) extends Aggregator[IN, 
Double, Double] {
+  override def zero: Double = Double.PositiveInfinity
--- End diff --

@viirya I just had a go at your suggestion, but it seems to be more 
complicated than anticipated. Spark performs some implicit casts (I think as 
part of Catalyst) between `java.lang.Double` and `scala.Double`, causing a 
nullpointer:
`java.lang.NullPointerException at 
scala.Predef$.Double2double(Predef.scala:365!`
I am not sure if this method is feasible. 

Sample of the `merge` function:
`override def merge(b1: java.lang.Double, b2: java.lang.Double): 
java.lang.Double =  java.lang.Math.min(b1, b2)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18113: [SPARK-20890][SQL] Added min and max typed aggreg...

2017-05-28 Thread setjet
Github user setjet commented on a diff in the pull request:

https://github.com/apache/spark/pull/18113#discussion_r118841626
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/typedaggregators.scala
 ---
@@ -99,3 +97,67 @@ class TypedAverage[IN](val f: IN => Double) extends 
Aggregator[IN, (Double, Long
 toColumn.asInstanceOf[TypedColumn[IN, java.lang.Double]]
   }
 }
+
+class TypedMinDouble[IN](val f: IN => Double) extends Aggregator[IN, 
Double, Double] {
+  override def zero: Double = Double.PositiveInfinity
--- End diff --

Yes that is what I mean, sorry for the confusion. There is therefore no 
nice solution unfortunately. I agree using `java.lang.Double` is probably the 
simplest and therefore the way to go. Let's see what cloud-fan says before I 
update the PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18113: [SPARK-20890][SQL] Added min and max typed aggreg...

2017-05-28 Thread setjet
Github user setjet commented on a diff in the pull request:

https://github.com/apache/spark/pull/18113#discussion_r118840234
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/typedaggregators.scala
 ---
@@ -99,3 +97,67 @@ class TypedAverage[IN](val f: IN => Double) extends 
Aggregator[IN, (Double, Long
 toColumn.asInstanceOf[TypedColumn[IN, java.lang.Double]]
   }
 }
+
+class TypedMinDouble[IN](val f: IN => Double) extends Aggregator[IN, 
Double, Double] {
+  override def zero: Double = Double.PositiveInfinity
--- End diff --

Which aggregators do you mean with 'those aggregators'?

Wouldn't it make more sense to put it in an Option? The whole point of 
DataSets is to provide proper typing.  If someone prefers the other way, they 
can still it by passing in a Column instead of a TypedColumn: 
(https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/Dataset.html#agg(org.apache.spark.sql.Column,%20org.apache.spark.sql.Column...)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18113: [SPARK-20890][SQL] Added min and max typed aggreg...

2017-05-27 Thread setjet
Github user setjet commented on a diff in the pull request:

https://github.com/apache/spark/pull/18113#discussion_r118822478
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/typedaggregators.scala
 ---
@@ -99,3 +97,67 @@ class TypedAverage[IN](val f: IN => Double) extends 
Aggregator[IN, (Double, Long
 toColumn.asInstanceOf[TypedColumn[IN, java.lang.Double]]
   }
 }
+
+class TypedMinDouble[IN](val f: IN => Double) extends Aggregator[IN, 
Double, Double] {
+  override def zero: Double = Double.PositiveInfinity
--- End diff --

Hmm it seems that typedAvg (already implemented), returns a NaN, while 
aggregate.Min returns null. 
Aligning it with typedAvg would not be possible for minLong, as NaN is only 
availble for Double of course. Another possibility of course would be to wrap 
it in Option type, but that again is not completely in line with aggregate.Min. 
This is because aggregate.Min is expression based, which has built in support 
for null as it extends aggregate.interfaces.DeclarativeAggregate, whereas 
typedaggregators extend Aggregator.
Aligning this properly seems like a huge refactor. What do you think the 
best approach is?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18113: [SPARK-20890][SQL] Added min and max typed aggreg...

2017-05-27 Thread setjet
Github user setjet commented on a diff in the pull request:

https://github.com/apache/spark/pull/18113#discussion_r118821329
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/typedaggregators.scala
 ---
@@ -99,3 +97,67 @@ class TypedAverage[IN](val f: IN => Double) extends 
Aggregator[IN, (Double, Long
 toColumn.asInstanceOf[TypedColumn[IN, java.lang.Double]]
   }
 }
+
+class TypedMinDouble[IN](val f: IN => Double) extends Aggregator[IN, 
Double, Double] {
+  override def zero: Double = Double.PositiveInfinity
--- End diff --

Ah yes I you're right, let me have a look


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18113: [SPARK-20890][SQL] Added min and max typed aggreg...

2017-05-27 Thread setjet
Github user setjet commented on a diff in the pull request:

https://github.com/apache/spark/pull/18113#discussion_r118821077
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/typedaggregators.scala
 ---
@@ -99,3 +97,67 @@ class TypedAverage[IN](val f: IN => Double) extends 
Aggregator[IN, (Double, Long
 toColumn.asInstanceOf[TypedColumn[IN, java.lang.Double]]
   }
 }
+
+class TypedMinDouble[IN](val f: IN => Double) extends Aggregator[IN, 
Double, Double] {
+  override def zero: Double = Double.PositiveInfinity
--- End diff --

Hi, thanks for having a look. This is actually not an issue because on an 
empty dataset, nothing is returned. For more details, you could have a look a 
the existing tests: the 'agg' function is called on a 'KeyValueGroupedDataset' 
object, which is returned by the 'groupByKey'. This ensures it's only done per 
key.

I have added an additional unit test to demonstrate.

Regarding Double.PositiveInfinity, I could change it to Double.Max, to be 
in line with Long.Max if you'd prefer that. I personally think Infinity makes 
more sense, although that is inconsistent with Long.Max because 
Long.PositiiveInfinity is not available 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18080: [Spark-20771][SQL] Make weekofyear more intuitive

2017-05-27 Thread setjet
Github user setjet commented on a diff in the pull request:

https://github.com/apache/spark/pull/18080#discussion_r118820467
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
 ---
@@ -402,23 +402,40 @@ case class DayOfMonth(child: Expression) extends 
UnaryExpression with ImplicitCa
   }
 }
 
+// scalastyle:off line.size.limit
 @ExpressionDescription(
-  usage = "_FUNC_(date) - Returns the week of the year of the given date.",
+  usage = "_FUNC_(date[, format]) - Returns the week of the year of the 
given date. Defaults to ISO 8601 standard, but can be gregorian specific",
   extended = """
 Examples:
   > SELECT _FUNC_('2008-02-20');
8
+  > SELECT _FUNC_('2017-01-01', 'gregorian');
+   1
+  > SELECT _FUNC_('2017-01-01', 'iso');
+   52
+  > SELECT _FUNC_('2017-01-01');
+   52
   """)
-case class WeekOfYear(child: Expression) extends UnaryExpression with 
ImplicitCastInputTypes {
+// scalastyle:on line.size.limit
+case class WeekOfYear(child: Expression, format: Expression) extends
+  UnaryExpression with ImplicitCastInputTypes {
+
+  def this(child: Expression) = {
+this(child, Literal("iso"))
+  }
 
   override def inputTypes: Seq[AbstractDataType] = Seq(DateType)
 
   override def dataType: DataType = IntegerType
 
+  @transient private lazy val minimalDays = {
+if ("gregorian".equalsIgnoreCase(format.toString)) 1 else 4
--- End diff --

It will still default to ISO stanards with Monday-Sunday week of course, 
but now users can override it in any way they would like


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18080: [Spark-20771][SQL] Make weekofyear more intuitive

2017-05-27 Thread setjet
Github user setjet commented on a diff in the pull request:

https://github.com/apache/spark/pull/18080#discussion_r118820456
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
 ---
@@ -402,23 +402,40 @@ case class DayOfMonth(child: Expression) extends 
UnaryExpression with ImplicitCa
   }
 }
 
+// scalastyle:off line.size.limit
 @ExpressionDescription(
-  usage = "_FUNC_(date) - Returns the week of the year of the given date.",
+  usage = "_FUNC_(date[, format]) - Returns the week of the year of the 
given date. Defaults to ISO 8601 standard, but can be gregorian specific",
   extended = """
 Examples:
   > SELECT _FUNC_('2008-02-20');
8
+  > SELECT _FUNC_('2017-01-01', 'gregorian');
+   1
+  > SELECT _FUNC_('2017-01-01', 'iso');
+   52
+  > SELECT _FUNC_('2017-01-01');
+   52
   """)
-case class WeekOfYear(child: Expression) extends UnaryExpression with 
ImplicitCastInputTypes {
+// scalastyle:on line.size.limit
+case class WeekOfYear(child: Expression, format: Expression) extends
+  UnaryExpression with ImplicitCastInputTypes {
+
+  def this(child: Expression) = {
+this(child, Literal("iso"))
+  }
 
   override def inputTypes: Seq[AbstractDataType] = Seq(DateType)
 
   override def dataType: DataType = IntegerType
 
+  @transient private lazy val minimalDays = {
+if ("gregorian".equalsIgnoreCase(format.toString)) 1 else 4
--- End diff --

I did a bit of research, and there seem to be no other formats. However, 
some systems (such as MySQL and Java), allow the first day of the week to be 
defined  as well. Some countries in the middle east have a week on 
Friday/Saturday, or even Thursday/Friday. 
I will update the PR to allow users to override the first day of the week, 
as well as specify how the first week is defined (1 iso standard: week with 
more than half of the days, i.e. Thursday in a Monday-Sunday week. 2 gregorian: 
week with first day of the new year)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18125: [SPARK-20891][SQL] Reduce duplicate code typedagg...

2017-05-26 Thread setjet
GitHub user setjet opened a pull request:

https://github.com/apache/spark/pull/18125

[SPARK-20891][SQL] Reduce duplicate code typedaggregators.scala

## What changes were proposed in this pull request?
The aggregators in typedaggregators.scala were polluted with duplicate 
code, which only different in type. I have  extracted the duplicate code in a 
parent class, and have allowed the child classes to specify the type. 

## How was this patch tested?
unit tests are already in place in DatasetAggregatorSuite.scala


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/setjet/spark spark-20891

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18125.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18125


commit 2e6899fac07926584dc745a27743884db1f48a8a
Author: setjet 
Date:   2017-05-26T20:37:29Z

reduced duplicate code




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18113: [SPARK-20890][SQL] Added min and max typed aggreg...

2017-05-25 Thread setjet
GitHub user setjet opened a pull request:

https://github.com/apache/spark/pull/18113

[SPARK-20890][SQL] Added min and max typed aggregation functions

## What changes were proposed in this pull request?
Typed Min and Max functions are missing for aggregations done on dataset. 
These are supported for DataFrames and therefore should also be part of the 
DataSet API.

Please note that it is OK that the min and max functions start the MR job 
with MAX and MIN values respectively, because only retrieved keys are returned.

## How was this patch tested?
Added some corresponding unit tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/setjet/spark spark-20890

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18113.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18113


commit d7159930d10cff73fb838e51e9971e9857911a5c
Author: setjet 
Date:   2017-05-25T21:08:04Z

added min and max typed aggregation functions




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18097: [Spark-20873][SQL] Improve the error message for ...

2017-05-24 Thread setjet
GitHub user setjet opened a pull request:

https://github.com/apache/spark/pull/18097

[Spark-20873][SQL] Improve the error message for unsupported Column Type

## What changes were proposed in this pull request?
Upon encountering an invalid columntype, the column type object is printed, 
rather than the type.
This  change improves this by outputting its name.

## How was this patch tested?
Added a simple  unit test to verify the contents of the raised exception

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/setjet/spark spark-20873

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18097.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18097


commit 69369fb876cd32e07e00b889ba9af46a831d48ec
Author: setjet 
Date:   2017-05-24T20:10:35Z

added typename and corresponding unit test




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18094: [Spark-20775][SQL] Added scala support from_json

2017-05-24 Thread setjet
GitHub user setjet opened a pull request:

https://github.com/apache/spark/pull/18094

[Spark-20775][SQL] Added scala support from_json

## What changes were proposed in this pull request?

from_json function required to take in a java.util.Hashmap. For other 
functions, a java wrapper is provided which casts a java hashmap to a scala 
map. Only a java function is provided in this case, forcing scala users to pass 
in a java.util.Hashmap.

Added the missing wrapper.

## How was this patch tested?
Added a unit test for passing in a scala map


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/setjet/spark spark-20775

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18094.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18094


commit f09c477528d92dcdad87976de3e1e733ba2de6fb
Author: setjet 
Date:   2017-05-24T18:22:22Z

changed function to support scala instead of java, and added java wrapper




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18080: [Spark-20771][SQL] Make weekofyear more intuitive

2017-05-24 Thread setjet
Github user setjet commented on the issue:

https://github.com/apache/spark/pull/18080
  
I agree that we shouldn't change the behavior, hence I suggested we could 
do it the other way around: make a new function for gregorian  instead and 
leave weekofyear as is.

I suppose we could define the function as follows: _FUNC_(date[, gregorian])


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18080: [Spark-20771][SQL] Make weekofyear more intuitive

2017-05-23 Thread setjet
Github user setjet commented on the issue:

https://github.com/apache/spark/pull/18080
  
Coming to think of it, it might actually be better to switch it around: 
have ISO8601 as function weekofyear, and make a separate function for gregorian 
because ISO is more of a commonly used term.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18080: [Spark-20771][SQL] Make weekofyear more intuitive

2017-05-23 Thread setjet
GitHub user setjet opened a pull request:

https://github.com/apache/spark/pull/18080

[Spark-20771][SQL] Make weekofyear more intuitive

## What changes were proposed in this pull request?
The current implementation of weekofyear implements ISO8601, which results 
in the following unintuitive behaviour: 

weekofyear("2017-01-01") returns 52 

In MySQL, this would return 1 
(https://dev.mysql.com/doc/refman/5.5/en/date-and-time-functions.html#function_weekofyear),
 although it could return 52 if specified specifically 
(https://dev.mysql.com/doc/refman/5.5/en/date-and-time-functions.html#function_week).

I therefore think instead of only changing the behavior as specified in the 
JIRA, it would be better to support both. Hence  I've added an additional 
function.

## How was this patch tested?
Added some unit tests



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/setjet/spark SPARK-20771

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18080.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18080


commit 7235f4a731f83a3a81fd65846179efaf38354bfa
Author: setjet 
Date:   2017-05-24T00:20:30Z

added additional weekofyear function

commit 057ede5b68cc7980987ae181156f376f84c41809
Author: setjet 
Date:   2017-05-24T00:22:54Z

updated desc




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14233: [SPARK-16490] [Examples] added a python example f...

2017-05-23 Thread setjet
Github user setjet closed the pull request at:

https://github.com/apache/spark/pull/14233


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17523: [SPARK-20064][PySpark] Bump the PySpark verison number t...

2017-04-04 Thread setjet
Github user setjet commented on the issue:

https://github.com/apache/spark/pull/17523
  
Done :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17523: [SPARK-20064][PySpark]

2017-04-03 Thread setjet
GitHub user setjet opened a pull request:

https://github.com/apache/spark/pull/17523

[SPARK-20064][PySpark]

## What changes were proposed in this pull request?
PySpark version in version.py was lagging behind
Versioning is  in line with PEP 440: 
https://www.python.org/dev/peps/pep-0440/

## How was this patch tested?
Simply rebuild the project with existing tests



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/setjet/spark SPARK-20064

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17523.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17523


commit a2358f7afa8502b8272a4e7caa6c64ad9f0db27d
Author: Ruben Janssen 
Date:   2016-07-16T15:03:19Z

added a python example for chisq selector in mllib

commit ca7cd787e174e04fbe0fcdcff26c8169450abc7b
Author: Ruben Janssen 
Date:   2016-08-01T18:14:01Z

updated documentation to refer to the example

commit 035aeb63ef8e8f2af8f7ed838d434a069392c336
Author: Ruben Janssen 
Date:   2016-10-16T15:00:44Z

updated with changes suggested by sethah

commit f49e6aea59994c471ea0270b41d5237a1f2a6a47
Author: Ruben Janssen 
Date:   2016-10-16T15:09:46Z

oops forgot to revert back local changes

commit a45ff2fa5e5a3633d3de24c5c2f91d59824b0fc8
Author: setjet 
Date:   2017-04-03T19:18:42Z

Merge remote-tracking branch 'upstream/master'

commit 8363e28e2d400c599052120153fc08eff8253cd5
Author: setjet 
Date:   2017-04-03T19:53:02Z

increased pyspark version

commit 881470d87d499c16cfbf6ea0a265369d60ba8f80
Author: setjet 
Date:   2017-04-03T21:25:37Z

Revert "oops forgot to revert back local changes"

This reverts commit f49e6aea59994c471ea0270b41d5237a1f2a6a47.

commit 09171936d5d1e9293fee6d28c44d74441a4920ab
Author: setjet 
Date:   2017-04-03T21:26:03Z

Revert "updated with changes suggested by sethah"

This reverts commit 035aeb63ef8e8f2af8f7ed838d434a069392c336.

commit c15654aa242d486b5eeb7e22e79915a165f6bb99
Author: setjet 
Date:   2017-04-03T21:26:30Z

Revert "updated documentation to refer to the example"

This reverts commit ca7cd787e174e04fbe0fcdcff26c8169450abc7b.

commit 47e4ab2cf8794718d68b5007f4980aae175eb94e
Author: setjet 
Date:   2017-04-03T21:26:39Z

Revert "added a python example for chisq selector in mllib"

This reverts commit a2358f7afa8502b8272a4e7caa6c64ad9f0db27d.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org