[GitHub] spark pull request #18113: [SPARK-20890][SQL] Added min and max typed aggreg...
Github user setjet commented on a diff in the pull request: https://github.com/apache/spark/pull/18113#discussion_r155667834 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetAggregatorSuite.scala --- @@ -263,6 +262,25 @@ class DatasetAggregatorSuite extends QueryTest with SharedSQLContext { ("a", 4), ("b", 3)) } + test("typed aggregate: min, max") { +val ds = Seq("a" -> 1, "a" -> 3, "b" -> 4, "b" -> -4, "b" -> 0).toDS() +checkDataset( + ds.groupByKey(_._1).agg( +typed.min(_._2), typed.minLong(_._2), typed.max(_._2), typed.maxLong(_._2)), + ("a", Some(1.0), Some(1L), Some(3.0), Some(3L)), + ("b", Some(-4.0), Some(-4L), Some(4.0), Some(4L))) + } + + test("typed aggregate: empty") { +val empty = Seq.empty[(Double, Double)].toDS +val f = (x: (Double, Double)) => x._2 +val g = (x: (Long, Long)) => x._2 +checkDataset( + empty.agg(typed.sum(f), typed.sumLong(g), typed.avg(f), --- End diff -- Yes I did not notice that. Given that its a change in core, maybe we should create a separate JIRA for that, and make this one depend on it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18113: [SPARK-20890][SQL] Added min and max typed aggreg...
Github user setjet commented on a diff in the pull request: https://github.com/apache/spark/pull/18113#discussion_r155374041 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetAggregatorSuite.scala --- @@ -263,6 +262,25 @@ class DatasetAggregatorSuite extends QueryTest with SharedSQLContext { ("a", 4), ("b", 3)) } + test("typed aggregate: min, max") { +val ds = Seq("a" -> 1, "a" -> 3, "b" -> 4, "b" -> -4, "b" -> 0).toDS() +checkDataset( + ds.groupByKey(_._1).agg( +typed.min(_._2), typed.minLong(_._2), typed.max(_._2), typed.maxLong(_._2)), + ("a", Some(1.0), Some(1L), Some(3.0), Some(3L)), + ("b", Some(-4.0), Some(-4L), Some(4.0), Some(4L))) + } + + test("typed aggregate: empty") { +val empty = Seq.empty[(Double, Double)].toDS --- End diff -- That won't change anything unfortunately. The difference between the empty and the non-empty testcases is that the latter is doing a groupbykey. If this is done on an empty dataset, no Row is returned at all and therefore doesn't allow us to verify a None is returned. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18113: [SPARK-20890][SQL] Added min and max typed aggregation f...
Github user setjet commented on the issue: https://github.com/apache/spark/pull/18113 There is one issue however I am stuck on: the tests for empty sets ("typed aggregate: empty") seem to be casting to nulls from options, resulting into the following: Decoded objects do not match expected objects: expected: WrappedArray([0.0,0,NaN,None,None,None,None]) actual: WrappedArray([0.0,0,NaN,[null],[null],[null],[null]]) This doesn't happen to non-empty data sets. Does anyone have a clue? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18113: [SPARK-20890][SQL] Added min and max typed aggreg...
Github user setjet commented on a diff in the pull request: https://github.com/apache/spark/pull/18113#discussion_r155070641 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/typedaggregators.scala --- @@ -99,3 +96,165 @@ class TypedAverage[IN](val f: IN => Double) extends Aggregator[IN, (Double, Long toColumn.asInstanceOf[TypedColumn[IN, java.lang.Double]] } } + +class TypedMinDouble[IN](val f: IN => Double) + extends Aggregator[IN, MutableDouble, java.lang.Double] { + override def zero: MutableDouble = null + override def reduce(b: MutableDouble, a: IN): MutableDouble = { +if (b == null) { + new MutableDouble(f(a)) +} else { + b.value = math.min(b.value, f(a)) + b +} + } + override def merge(b1: MutableDouble, b2: MutableDouble): MutableDouble = { +if (b1 == null) { + b2 +} else if (b2 == null) { + b1 +} else { + b1.value = math.min(b1.value, b2.value) + b1 +} + } + override def finish(reduction: MutableDouble): java.lang.Double = { +if (reduction == null) { + null +} else { + reduction.toJavaDouble +} + } + + override def bufferEncoder: Encoder[MutableDouble] = Encoders.kryo[MutableDouble] + override def outputEncoder: Encoder[java.lang.Double] = ExpressionEncoder[java.lang.Double]() + + // Java api support + def this(f: MapFunction[IN, java.lang.Double]) = this((x: IN) => f.call(x)) + def toColumnScala: TypedColumn[IN, Double] = { --- End diff -- @cloud-fan I agree that's the best option. Made some slight changes but its implemented now. There is one issue however I am stuck on: the tests for empty sets ("typed aggregate: empty") seem to be casting to nulls from options, resulting into the following: Decoded objects do not match expected objects: expected: WrappedArray([0.0,0,NaN,None,None,None,None]) actual: WrappedArray([0.0,0,NaN,[null],[null],[null],[null]]) This doesn't happen to non-empty data sets. Do you have any clue? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18113: [SPARK-20890][SQL] Added min and max typed aggreg...
Github user setjet commented on a diff in the pull request: https://github.com/apache/spark/pull/18113#discussion_r155068746 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/typedaggregators.scala --- @@ -99,3 +96,165 @@ class TypedAverage[IN](val f: IN => Double) extends Aggregator[IN, (Double, Long toColumn.asInstanceOf[TypedColumn[IN, java.lang.Double]] } } + +class TypedMinDouble[IN](val f: IN => Double) + extends Aggregator[IN, MutableDouble, java.lang.Double] { + override def zero: MutableDouble = null + override def reduce(b: MutableDouble, a: IN): MutableDouble = { +if (b == null) { + new MutableDouble(f(a)) +} else { + b.value = math.min(b.value, f(a)) + b +} + } + override def merge(b1: MutableDouble, b2: MutableDouble): MutableDouble = { +if (b1 == null) { + b2 +} else if (b2 == null) { + b1 +} else { + b1.value = math.min(b1.value, b2.value) + b1 +} + } + override def finish(reduction: MutableDouble): java.lang.Double = { +if (reduction == null) { + null +} else { + reduction.toJavaDouble --- End diff -- Done --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18113: [SPARK-20890][SQL] Added min and max typed aggreg...
Github user setjet commented on a diff in the pull request: https://github.com/apache/spark/pull/18113#discussion_r155068709 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/typedaggregators.scala --- @@ -17,8 +17,10 @@ package org.apache.spark.sql.execution.aggregate +import java.lang --- End diff -- Resolved --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18113: [SPARK-20890][SQL] Added min and max typed aggregation f...
Github user setjet commented on the issue: https://github.com/apache/spark/pull/18113 @cloud-fan done, could you please have a look? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18113: [SPARK-20890][SQL] Added min and max typed aggreg...
Github user setjet commented on a diff in the pull request: https://github.com/apache/spark/pull/18113#discussion_r154172794 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/typedaggregators.scala --- @@ -99,3 +94,91 @@ class TypedAverage[IN](val f: IN => Double) extends Aggregator[IN, (Double, Long toColumn.asInstanceOf[TypedColumn[IN, java.lang.Double]] } } + +class TypedMinDouble[IN](val f: IN => Double) extends Aggregator[IN, Double, Double] { + override def zero: Double = Double.PositiveInfinity + override def reduce(b: Double, a: IN): Double = math.min(b, f(a)) + override def merge(b1: Double, b2: Double): Double = math.min(b1, b2) + override def finish(reduction: Double): Double = { +if (Double.PositiveInfinity == reduction) { --- End diff -- Which of those 3 will we decide on then? None of them is ideal unfortunately. cc @cloud-fan @gatorsmile @HyukjinKwon @srowen --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18113: [SPARK-20890][SQL] Added min and max typed aggreg...
Github user setjet commented on a diff in the pull request: https://github.com/apache/spark/pull/18113#discussion_r153025270 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/typedaggregators.scala --- @@ -99,3 +94,91 @@ class TypedAverage[IN](val f: IN => Double) extends Aggregator[IN, (Double, Long toColumn.asInstanceOf[TypedColumn[IN, java.lang.Double]] } } + +class TypedMinDouble[IN](val f: IN => Double) extends Aggregator[IN, Double, Double] { + override def zero: Double = Double.PositiveInfinity + override def reduce(b: Double, a: IN): Double = math.min(b, f(a)) + override def merge(b1: Double, b2: Double): Double = math.min(b1, b2) + override def finish(reduction: Double): Double = { +if (Double.PositiveInfinity == reduction) { --- End diff -- Ah sorry yes, was only talking about the output type. I don't have time right now but Ill do it over the weekend and mark you when I am done. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18113: [SPARK-20890][SQL] Added min and max typed aggreg...
Github user setjet commented on a diff in the pull request: https://github.com/apache/spark/pull/18113#discussion_r153024799 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/typedaggregators.scala --- @@ -99,3 +94,91 @@ class TypedAverage[IN](val f: IN => Double) extends Aggregator[IN, (Double, Long toColumn.asInstanceOf[TypedColumn[IN, java.lang.Double]] } } + +class TypedMinDouble[IN](val f: IN => Double) extends Aggregator[IN, Double, Double] { + override def zero: Double = Double.PositiveInfinity + override def reduce(b: Double, a: IN): Double = math.min(b, f(a)) + override def merge(b1: Double, b2: Double): Double = math.min(b1, b2) + override def finish(reduction: Double): Double = { +if (Double.PositiveInfinity == reduction) { --- End diff -- Ok makes sense. What about the return ```finish``` return type? Leaving that as a java type would cause the ```this``` and ```toColumnJava``` to be flipped, creating a ```toColumnScala``` instead. What about: ``` override def finish(reduction: java.lang.Double): Double = reduction ``` As its on the finish, it shouldn't cause much performance overhead as its not execution many times. It would also reduce complexity a bit. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18113: [SPARK-20890][SQL] Added min and max typed aggreg...
Github user setjet commented on a diff in the pull request: https://github.com/apache/spark/pull/18113#discussion_r153022290 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/typedaggregators.scala --- @@ -99,3 +94,91 @@ class TypedAverage[IN](val f: IN => Double) extends Aggregator[IN, (Double, Long toColumn.asInstanceOf[TypedColumn[IN, java.lang.Double]] } } + +class TypedMinDouble[IN](val f: IN => Double) extends Aggregator[IN, Double, Double] { + override def zero: Double = Double.PositiveInfinity + override def reduce(b: Double, a: IN): Double = math.min(b, f(a)) + override def merge(b1: Double, b2: Double): Double = math.min(b1, b2) + override def finish(reduction: Double): Double = { +if (Double.PositiveInfinity == reduction) { --- End diff -- Doesn't that boil down to what was there previously? https://github.com/apache/spark/pull/18113/commits/51783b55197cea6c130722838ec97ad6df5c92be ``` override def zero: java.lang.Double = null override def reduce(b: java.lang.Double, a: IN): java.lang.Double = if (b == null) f(a) else math.max(b, f(a)) override def merge(b1: java.lang.Double, b2: java.lang.Double): java.lang.Double = { if (b1 == null) { b2 } else if (b2 == null) { b1 } else { math.max(b1, b2) } } override def finish(reduction: java.lang.Double): java.lang.Double = reduction ``` Here we just return null in case its an empty set or if we have the edge case you just mentioned. You rejected it because you were afraid of boxing performance on the 8th of June. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18113: [SPARK-20890][SQL] Added min and max typed aggreg...
Github user setjet commented on a diff in the pull request: https://github.com/apache/spark/pull/18113#discussion_r153021545 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/typedaggregators.scala --- @@ -99,3 +94,91 @@ class TypedAverage[IN](val f: IN => Double) extends Aggregator[IN, (Double, Long toColumn.asInstanceOf[TypedColumn[IN, java.lang.Double]] } } + +class TypedMinDouble[IN](val f: IN => Double) extends Aggregator[IN, Double, Double] { + override def zero: Double = Double.PositiveInfinity + override def reduce(b: Double, a: IN): Double = math.min(b, f(a)) + override def merge(b1: Double, b2: Double): Double = math.min(b1, b2) + override def finish(reduction: Double): Double = { +if (Double.PositiveInfinity == reduction) { --- End diff -- That's correct. That was part of the discussion above. We used to init it with null, so that we could then distinguish between these cases. As you can read above, that initial proposal was tossed as it didnt meet ANSI standards. Another option I just realised would be to initialize it with Double.NaN, and then use that as a flag to distinguish between infinity and the initial value. Then again, that would not be supported for Longs as we cannot assign a NaN. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18113: [SPARK-20890][SQL] Added min and max typed aggregation f...
Github user setjet commented on the issue: https://github.com/apache/spark/pull/18113 @cloud-fan done, some small white spaces remain as it formats the functions within the file consistently --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18113: [SPARK-20890][SQL] Added min and max typed aggreg...
Github user setjet commented on a diff in the pull request: https://github.com/apache/spark/pull/18113#discussion_r153020474 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/typedaggregators.scala --- @@ -38,13 +38,11 @@ class TypedSumDouble[IN](val f: IN => Double) extends Aggregator[IN, Double, Dou // Java api support def this(f: MapFunction[IN, java.lang.Double]) = this((x: IN) => f.call(x).asInstanceOf[Double]) - --- End diff -- removed some whitelines to make functions more consistent --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18113: [SPARK-20890][SQL] Added min and max typed aggreg...
Github user setjet commented on a diff in the pull request: https://github.com/apache/spark/pull/18113#discussion_r153020437 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/typedaggregators.scala --- @@ -81,14 +77,13 @@ class TypedCount[IN](val f: IN => Any) extends Aggregator[IN, Long, Long] { } } - class TypedAverage[IN](val f: IN => Double) extends Aggregator[IN, (Double, Long), Double] { override def zero: (Double, Long) = (0.0, 0L) override def reduce(b: (Double, Long), a: IN): (Double, Long) = (f(a) + b._1, 1 + b._2) - override def finish(reduction: (Double, Long)): Double = reduction._1 / reduction._2 override def merge(b1: (Double, Long), b2: (Double, Long)): (Double, Long) = { (b1._1 + b2._1, b1._2 + b2._2) } + override def finish(reduction: (Double, Long)): Double = reduction._1 / reduction._2 --- End diff -- switched finish and merge around to make functions consistent --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18113: [SPARK-20890][SQL] Added min and max typed aggregation f...
Github user setjet commented on the issue: https://github.com/apache/spark/pull/18113 @cloud-fan could you have a look please? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18113: [SPARK-20890][SQL] Added min and max typed aggreg...
Github user setjet commented on a diff in the pull request: https://github.com/apache/spark/pull/18113#discussion_r150392495 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/typedaggregators.scala --- @@ -76,26 +76,126 @@ class TypedCount[IN](val f: IN => Any) extends Aggregator[IN, Long, Long] { // Java api support def this(f: MapFunction[IN, Object]) = this((x: IN) => f.call(x).asInstanceOf[Any]) + def toColumnJava: TypedColumn[IN, java.lang.Long] = { toColumn.asInstanceOf[TypedColumn[IN, java.lang.Long]] } } +class TypedAverage[IN](val f: IN => Double) + extends Aggregator[IN, (Double, Long), Double] { -class TypedAverage[IN](val f: IN => Double) extends Aggregator[IN, (Double, Long), Double] { override def zero: (Double, Long) = (0.0, 0L) override def reduce(b: (Double, Long), a: IN): (Double, Long) = (f(a) + b._1, 1 + b._2) - override def finish(reduction: (Double, Long)): Double = reduction._1 / reduction._2 --- End diff -- Order of functions is consistent among all aggregation functions: zero, reduce, merge finish. Hence the swap of location of the merge and finish functions --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18113: [SPARK-20890][SQL] Added min and max typed aggreg...
Github user setjet commented on a diff in the pull request: https://github.com/apache/spark/pull/18113#discussion_r150392424 --- Diff: sql/core/src/main/java/org/apache/spark/sql/expressions/javalang/typed.java --- @@ -74,4 +71,40 @@ public static TypedColumn sumLong(MapFunction f) { --- End diff -- Its already there a bit higher up in the file --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18113: [SPARK-20890][SQL] Added min and max typed aggregation f...
Github user setjet commented on the issue: https://github.com/apache/spark/pull/18113 Exactly my point. I'll return -/+ inf then for doubles only, and min/max values for longs --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18113: [SPARK-20890][SQL] Added min and max typed aggregation f...
Github user setjet commented on the issue: https://github.com/apache/spark/pull/18113 Ok sounds good. What about doubles? We could return the proper mathematical defintion, but that is not consistent with Longs --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18113: [SPARK-20890][SQL] Added min and max typed aggregation f...
Github user setjet commented on the issue: https://github.com/apache/spark/pull/18113 An empty sets min and max are defined is -infinity and +infinity: https://en.wikipedia.org/wiki/Empty_set This is supported for Java doubles, but not for Longs. We could instead Long.MIN and Long.MAX values? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18113: [SPARK-20890][SQL] Added min and max typed aggreg...
Github user setjet commented on a diff in the pull request: https://github.com/apache/spark/pull/18113#discussion_r150381736 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/typedaggregators.scala --- @@ -26,43 +26,64 @@ import org.apache.spark.sql.expressions.Aggregator // This file defines internal implementations for aggregators. +class TypedSumDouble[IN](val f: IN => Double) + extends Aggregator[IN, java.lang.Double, java.lang.Double] { + + override def zero: java.lang.Double = 0.0 + override def reduce(b: java.lang.Double, a: IN): java.lang.Double = --- End diff -- As discussed previously the boxing is needed to have appropriate return types for min/max. This of course would not be needed if we align it to the current (incorrect) return values. I have bounced back and forth between the return values multiple times now, so it might be worthwhile to have some more discussion. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18113: [SPARK-20890][SQL] Added min and max typed aggregation f...
Github user setjet commented on the issue: https://github.com/apache/spark/pull/18113 @cloud-fan Sorry i misread the conclusion of the discussion, reverted the initial api to exactly how it was before, while the new functions follow the SQL standard as you agreed on 2 weeks ago. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18113: [SPARK-20890][SQL] Added min and max typed aggregation f...
Github user setjet commented on the issue: https://github.com/apache/spark/pull/18113 @gatorsmile @cloud-fan Could you have a look please? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18113: [SPARK-20890][SQL] Added min and max typed aggregation f...
Github user setjet commented on the issue: https://github.com/apache/spark/pull/18113 Hi, it has been a while but I can pick it back up when I have time next weekend or so if that's OK. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18113: [SPARK-20890][SQL] Added min and max typed aggreg...
Github user setjet commented on a diff in the pull request: https://github.com/apache/spark/pull/18113#discussion_r121761524 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/typedaggregators.scala --- @@ -26,43 +26,64 @@ import org.apache.spark.sql.expressions.Aggregator // This file defines internal implementations for aggregators. +class TypedSumDouble[IN](val f: IN => Double) + extends Aggregator[IN, java.lang.Double, java.lang.Double] { + + override def zero: java.lang.Double = null + override def reduce(b: java.lang.Double, a: IN): java.lang.Double = --- End diff -- I'll change it back to make it backwards compatible. What about the new ones @rxin ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18113: [SPARK-20890][SQL] Added min and max typed aggreg...
Github user setjet commented on a diff in the pull request: https://github.com/apache/spark/pull/18113#discussion_r119996472 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/typedaggregators.scala --- @@ -95,7 +93,123 @@ class TypedAverage[IN](val f: IN => Double) extends Aggregator[IN, (Double, Long // Java api support def this(f: MapFunction[IN, java.lang.Double]) = this(x => f.call(x).asInstanceOf[Double]) + def toColumnJava: TypedColumn[IN, java.lang.Double] = { toColumn.asInstanceOf[TypedColumn[IN, java.lang.Double]] } } + +class TypedMinDouble[IN](val f: IN => Double) + extends Aggregator[IN, java.lang.Double, java.lang.Double] { + + override def zero: java.lang.Double = null --- End diff -- @cloud-fan do you agree with this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18113: [SPARK-20890][SQL] Added min and max typed aggreg...
Github user setjet commented on a diff in the pull request: https://github.com/apache/spark/pull/18113#discussion_r119175369 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/typedaggregators.scala --- @@ -95,7 +93,123 @@ class TypedAverage[IN](val f: IN => Double) extends Aggregator[IN, (Double, Long // Java api support def this(f: MapFunction[IN, java.lang.Double]) = this(x => f.call(x).asInstanceOf[Double]) + def toColumnJava: TypedColumn[IN, java.lang.Double] = { toColumn.asInstanceOf[TypedColumn[IN, java.lang.Double]] } } + +class TypedMinDouble[IN](val f: IN => Double) + extends Aggregator[IN, java.lang.Double, java.lang.Double] { + + override def zero: java.lang.Double = null --- End diff -- `TypedSum` is actually correct because it will return 0 in case of an empty set ['In mathematics, an empty sum, or nullary sum, is a summation where the number of terms is zero. By convention,[1] the value of any empty sum of numbers is the additive identity, zero.'](https://en.wikipedia.org/wiki/Empty_sum). One could therefore argue that `Sum.scala` is actually wrong because it returns null: `emptyTestData.agg(sum('key))`. We could either fix Sum.scala, although that might affect existing applications, or align both to return null, even though that is not technically correct. The same does go for `TypedAvg`, which returns Double.Nan instead of null. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18080: [Spark-20771][SQL] Make weekofyear more intuitive
Github user setjet commented on the issue: https://github.com/apache/spark/pull/18080 This variant is available in other DB's, albeit with slightly different function and parameter naming. For example, MySQL allows it via the `week()` function: http://www.w3resource.com/mysql/date-and-time-functions/mysql-week-function.php In this case, you pass in an integer that specifies which permutation you want. Please note that if you look at the table, the 'Week 1 is the first week â¦' column is the difference between gregorian and iso. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18113: [SPARK-20890][SQL] Added min and max typed aggreg...
Github user setjet commented on a diff in the pull request: https://github.com/apache/spark/pull/18113#discussion_r118939565 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/typedaggregators.scala --- @@ -99,3 +97,67 @@ class TypedAverage[IN](val f: IN => Double) extends Aggregator[IN, (Double, Long toColumn.asInstanceOf[TypedColumn[IN, java.lang.Double]] } } + +class TypedMinDouble[IN](val f: IN => Double) extends Aggregator[IN, Double, Double] { + override def zero: Double = Double.PositiveInfinity --- End diff -- Ah I see my misunderstanding: in reduce I tried to also have an `if` for `f(a) == null` because of the previously mentioned implicit casting issue. This would force a `java.lang.Double` to be returned by the function, as `Double == null` doesn't make sense in Scala. I have updated the code, please have a look :) Becuase `OUT` is already a `java.lang.Double`, we do not need the `toColumnJava`. As a result of `OUT` being `java.lang.Double` however, we do need a `toColumnScala` to accommodate ` val f = (x: (Double, Double)) => x._2; empty.agg(typed.min(f)).show()` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18113: [SPARK-20890][SQL] Added min and max typed aggreg...
Github user setjet commented on a diff in the pull request: https://github.com/apache/spark/pull/18113#discussion_r118914424 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/typedaggregators.scala --- @@ -99,3 +97,67 @@ class TypedAverage[IN](val f: IN => Double) extends Aggregator[IN, (Double, Long toColumn.asInstanceOf[TypedColumn[IN, java.lang.Double]] } } + +class TypedMinDouble[IN](val f: IN => Double) extends Aggregator[IN, Double, Double] { + override def zero: Double = Double.PositiveInfinity --- End diff -- Turns out I made a typo which caused me to miss a permutation of handling null in the parameters... Comparing both solutions (tuple with `OUT` as `java.lang.Double` vs non-tuple with both `BUF` and `OUT` as `java.lang.Double`), it seems we have the following trade-offs: - tuple will require more data to be shuffled around as we are adding an additional value - non-tuple solution requires the developer to know a bit about the internals, i.e.: `val tuple = (x: (Double, Double)) => x._2 emptyDataSet.agg(typed.min(tuple)).show()` `val nontuple = (x: (Double, java.lang.Double)) => x._2 emptyDataSet.agg(typed.min(nontuple)).show()` This is because function `f` passed in into typed.min outputs a `BUF`, forcing the caller to know about it the internals. Given that users can always implement their own (non-tuple version) if needed, I'd argue in favor of the tupled solution beacuse it is a bit more developer friendly. What do you think? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18113: [SPARK-20890][SQL] Added min and max typed aggreg...
Github user setjet commented on a diff in the pull request: https://github.com/apache/spark/pull/18113#discussion_r118843548 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/typedaggregators.scala --- @@ -99,3 +97,67 @@ class TypedAverage[IN](val f: IN => Double) extends Aggregator[IN, (Double, Long toColumn.asInstanceOf[TypedColumn[IN, java.lang.Double]] } } + +class TypedMinDouble[IN](val f: IN => Double) extends Aggregator[IN, Double, Double] { + override def zero: Double = Double.PositiveInfinity --- End diff -- @viirya I just had a go at your suggestion, but it seems to be more complicated than anticipated. Spark performs some implicit casts (I think as part of Catalyst) between `java.lang.Double` and `scala.Double`, causing a nullpointer: `java.lang.NullPointerException at scala.Predef$.Double2double(Predef.scala:365!` I am not sure if this method is feasible. Sample of the `merge` function: `override def merge(b1: java.lang.Double, b2: java.lang.Double): java.lang.Double = java.lang.Math.min(b1, b2)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18113: [SPARK-20890][SQL] Added min and max typed aggreg...
Github user setjet commented on a diff in the pull request: https://github.com/apache/spark/pull/18113#discussion_r118841626 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/typedaggregators.scala --- @@ -99,3 +97,67 @@ class TypedAverage[IN](val f: IN => Double) extends Aggregator[IN, (Double, Long toColumn.asInstanceOf[TypedColumn[IN, java.lang.Double]] } } + +class TypedMinDouble[IN](val f: IN => Double) extends Aggregator[IN, Double, Double] { + override def zero: Double = Double.PositiveInfinity --- End diff -- Yes that is what I mean, sorry for the confusion. There is therefore no nice solution unfortunately. I agree using `java.lang.Double` is probably the simplest and therefore the way to go. Let's see what cloud-fan says before I update the PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18113: [SPARK-20890][SQL] Added min and max typed aggreg...
Github user setjet commented on a diff in the pull request: https://github.com/apache/spark/pull/18113#discussion_r118840234 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/typedaggregators.scala --- @@ -99,3 +97,67 @@ class TypedAverage[IN](val f: IN => Double) extends Aggregator[IN, (Double, Long toColumn.asInstanceOf[TypedColumn[IN, java.lang.Double]] } } + +class TypedMinDouble[IN](val f: IN => Double) extends Aggregator[IN, Double, Double] { + override def zero: Double = Double.PositiveInfinity --- End diff -- Which aggregators do you mean with 'those aggregators'? Wouldn't it make more sense to put it in an Option? The whole point of DataSets is to provide proper typing. If someone prefers the other way, they can still it by passing in a Column instead of a TypedColumn: (https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/Dataset.html#agg(org.apache.spark.sql.Column,%20org.apache.spark.sql.Column...) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18113: [SPARK-20890][SQL] Added min and max typed aggreg...
Github user setjet commented on a diff in the pull request: https://github.com/apache/spark/pull/18113#discussion_r118822478 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/typedaggregators.scala --- @@ -99,3 +97,67 @@ class TypedAverage[IN](val f: IN => Double) extends Aggregator[IN, (Double, Long toColumn.asInstanceOf[TypedColumn[IN, java.lang.Double]] } } + +class TypedMinDouble[IN](val f: IN => Double) extends Aggregator[IN, Double, Double] { + override def zero: Double = Double.PositiveInfinity --- End diff -- Hmm it seems that typedAvg (already implemented), returns a NaN, while aggregate.Min returns null. Aligning it with typedAvg would not be possible for minLong, as NaN is only availble for Double of course. Another possibility of course would be to wrap it in Option type, but that again is not completely in line with aggregate.Min. This is because aggregate.Min is expression based, which has built in support for null as it extends aggregate.interfaces.DeclarativeAggregate, whereas typedaggregators extend Aggregator. Aligning this properly seems like a huge refactor. What do you think the best approach is? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18113: [SPARK-20890][SQL] Added min and max typed aggreg...
Github user setjet commented on a diff in the pull request: https://github.com/apache/spark/pull/18113#discussion_r118821329 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/typedaggregators.scala --- @@ -99,3 +97,67 @@ class TypedAverage[IN](val f: IN => Double) extends Aggregator[IN, (Double, Long toColumn.asInstanceOf[TypedColumn[IN, java.lang.Double]] } } + +class TypedMinDouble[IN](val f: IN => Double) extends Aggregator[IN, Double, Double] { + override def zero: Double = Double.PositiveInfinity --- End diff -- Ah yes I you're right, let me have a look --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18113: [SPARK-20890][SQL] Added min and max typed aggreg...
Github user setjet commented on a diff in the pull request: https://github.com/apache/spark/pull/18113#discussion_r118821077 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/typedaggregators.scala --- @@ -99,3 +97,67 @@ class TypedAverage[IN](val f: IN => Double) extends Aggregator[IN, (Double, Long toColumn.asInstanceOf[TypedColumn[IN, java.lang.Double]] } } + +class TypedMinDouble[IN](val f: IN => Double) extends Aggregator[IN, Double, Double] { + override def zero: Double = Double.PositiveInfinity --- End diff -- Hi, thanks for having a look. This is actually not an issue because on an empty dataset, nothing is returned. For more details, you could have a look a the existing tests: the 'agg' function is called on a 'KeyValueGroupedDataset' object, which is returned by the 'groupByKey'. This ensures it's only done per key. I have added an additional unit test to demonstrate. Regarding Double.PositiveInfinity, I could change it to Double.Max, to be in line with Long.Max if you'd prefer that. I personally think Infinity makes more sense, although that is inconsistent with Long.Max because Long.PositiiveInfinity is not available --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18080: [Spark-20771][SQL] Make weekofyear more intuitive
Github user setjet commented on a diff in the pull request: https://github.com/apache/spark/pull/18080#discussion_r118820467 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala --- @@ -402,23 +402,40 @@ case class DayOfMonth(child: Expression) extends UnaryExpression with ImplicitCa } } +// scalastyle:off line.size.limit @ExpressionDescription( - usage = "_FUNC_(date) - Returns the week of the year of the given date.", + usage = "_FUNC_(date[, format]) - Returns the week of the year of the given date. Defaults to ISO 8601 standard, but can be gregorian specific", extended = """ Examples: > SELECT _FUNC_('2008-02-20'); 8 + > SELECT _FUNC_('2017-01-01', 'gregorian'); + 1 + > SELECT _FUNC_('2017-01-01', 'iso'); + 52 + > SELECT _FUNC_('2017-01-01'); + 52 """) -case class WeekOfYear(child: Expression) extends UnaryExpression with ImplicitCastInputTypes { +// scalastyle:on line.size.limit +case class WeekOfYear(child: Expression, format: Expression) extends + UnaryExpression with ImplicitCastInputTypes { + + def this(child: Expression) = { +this(child, Literal("iso")) + } override def inputTypes: Seq[AbstractDataType] = Seq(DateType) override def dataType: DataType = IntegerType + @transient private lazy val minimalDays = { +if ("gregorian".equalsIgnoreCase(format.toString)) 1 else 4 --- End diff -- It will still default to ISO stanards with Monday-Sunday week of course, but now users can override it in any way they would like --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18080: [Spark-20771][SQL] Make weekofyear more intuitive
Github user setjet commented on a diff in the pull request: https://github.com/apache/spark/pull/18080#discussion_r118820456 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala --- @@ -402,23 +402,40 @@ case class DayOfMonth(child: Expression) extends UnaryExpression with ImplicitCa } } +// scalastyle:off line.size.limit @ExpressionDescription( - usage = "_FUNC_(date) - Returns the week of the year of the given date.", + usage = "_FUNC_(date[, format]) - Returns the week of the year of the given date. Defaults to ISO 8601 standard, but can be gregorian specific", extended = """ Examples: > SELECT _FUNC_('2008-02-20'); 8 + > SELECT _FUNC_('2017-01-01', 'gregorian'); + 1 + > SELECT _FUNC_('2017-01-01', 'iso'); + 52 + > SELECT _FUNC_('2017-01-01'); + 52 """) -case class WeekOfYear(child: Expression) extends UnaryExpression with ImplicitCastInputTypes { +// scalastyle:on line.size.limit +case class WeekOfYear(child: Expression, format: Expression) extends + UnaryExpression with ImplicitCastInputTypes { + + def this(child: Expression) = { +this(child, Literal("iso")) + } override def inputTypes: Seq[AbstractDataType] = Seq(DateType) override def dataType: DataType = IntegerType + @transient private lazy val minimalDays = { +if ("gregorian".equalsIgnoreCase(format.toString)) 1 else 4 --- End diff -- I did a bit of research, and there seem to be no other formats. However, some systems (such as MySQL and Java), allow the first day of the week to be defined as well. Some countries in the middle east have a week on Friday/Saturday, or even Thursday/Friday. I will update the PR to allow users to override the first day of the week, as well as specify how the first week is defined (1 iso standard: week with more than half of the days, i.e. Thursday in a Monday-Sunday week. 2 gregorian: week with first day of the new year) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18125: [SPARK-20891][SQL] Reduce duplicate code typedagg...
GitHub user setjet opened a pull request: https://github.com/apache/spark/pull/18125 [SPARK-20891][SQL] Reduce duplicate code typedaggregators.scala ## What changes were proposed in this pull request? The aggregators in typedaggregators.scala were polluted with duplicate code, which only different in type. I have extracted the duplicate code in a parent class, and have allowed the child classes to specify the type. ## How was this patch tested? unit tests are already in place in DatasetAggregatorSuite.scala You can merge this pull request into a Git repository by running: $ git pull https://github.com/setjet/spark spark-20891 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18125.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18125 commit 2e6899fac07926584dc745a27743884db1f48a8a Author: setjet Date: 2017-05-26T20:37:29Z reduced duplicate code --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18113: [SPARK-20890][SQL] Added min and max typed aggreg...
GitHub user setjet opened a pull request: https://github.com/apache/spark/pull/18113 [SPARK-20890][SQL] Added min and max typed aggregation functions ## What changes were proposed in this pull request? Typed Min and Max functions are missing for aggregations done on dataset. These are supported for DataFrames and therefore should also be part of the DataSet API. Please note that it is OK that the min and max functions start the MR job with MAX and MIN values respectively, because only retrieved keys are returned. ## How was this patch tested? Added some corresponding unit tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/setjet/spark spark-20890 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18113.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18113 commit d7159930d10cff73fb838e51e9971e9857911a5c Author: setjet Date: 2017-05-25T21:08:04Z added min and max typed aggregation functions --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18097: [Spark-20873][SQL] Improve the error message for ...
GitHub user setjet opened a pull request: https://github.com/apache/spark/pull/18097 [Spark-20873][SQL] Improve the error message for unsupported Column Type ## What changes were proposed in this pull request? Upon encountering an invalid columntype, the column type object is printed, rather than the type. This change improves this by outputting its name. ## How was this patch tested? Added a simple unit test to verify the contents of the raised exception You can merge this pull request into a Git repository by running: $ git pull https://github.com/setjet/spark spark-20873 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18097.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18097 commit 69369fb876cd32e07e00b889ba9af46a831d48ec Author: setjet Date: 2017-05-24T20:10:35Z added typename and corresponding unit test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18094: [Spark-20775][SQL] Added scala support from_json
GitHub user setjet opened a pull request: https://github.com/apache/spark/pull/18094 [Spark-20775][SQL] Added scala support from_json ## What changes were proposed in this pull request? from_json function required to take in a java.util.Hashmap. For other functions, a java wrapper is provided which casts a java hashmap to a scala map. Only a java function is provided in this case, forcing scala users to pass in a java.util.Hashmap. Added the missing wrapper. ## How was this patch tested? Added a unit test for passing in a scala map You can merge this pull request into a Git repository by running: $ git pull https://github.com/setjet/spark spark-20775 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18094.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18094 commit f09c477528d92dcdad87976de3e1e733ba2de6fb Author: setjet Date: 2017-05-24T18:22:22Z changed function to support scala instead of java, and added java wrapper --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18080: [Spark-20771][SQL] Make weekofyear more intuitive
Github user setjet commented on the issue: https://github.com/apache/spark/pull/18080 I agree that we shouldn't change the behavior, hence I suggested we could do it the other way around: make a new function for gregorian instead and leave weekofyear as is. I suppose we could define the function as follows: _FUNC_(date[, gregorian]) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18080: [Spark-20771][SQL] Make weekofyear more intuitive
Github user setjet commented on the issue: https://github.com/apache/spark/pull/18080 Coming to think of it, it might actually be better to switch it around: have ISO8601 as function weekofyear, and make a separate function for gregorian because ISO is more of a commonly used term. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18080: [Spark-20771][SQL] Make weekofyear more intuitive
GitHub user setjet opened a pull request: https://github.com/apache/spark/pull/18080 [Spark-20771][SQL] Make weekofyear more intuitive ## What changes were proposed in this pull request? The current implementation of weekofyear implements ISO8601, which results in the following unintuitive behaviour: weekofyear("2017-01-01") returns 52 In MySQL, this would return 1 (https://dev.mysql.com/doc/refman/5.5/en/date-and-time-functions.html#function_weekofyear), although it could return 52 if specified specifically (https://dev.mysql.com/doc/refman/5.5/en/date-and-time-functions.html#function_week). I therefore think instead of only changing the behavior as specified in the JIRA, it would be better to support both. Hence I've added an additional function. ## How was this patch tested? Added some unit tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/setjet/spark SPARK-20771 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18080.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18080 commit 7235f4a731f83a3a81fd65846179efaf38354bfa Author: setjet Date: 2017-05-24T00:20:30Z added additional weekofyear function commit 057ede5b68cc7980987ae181156f376f84c41809 Author: setjet Date: 2017-05-24T00:22:54Z updated desc --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14233: [SPARK-16490] [Examples] added a python example f...
Github user setjet closed the pull request at: https://github.com/apache/spark/pull/14233 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17523: [SPARK-20064][PySpark] Bump the PySpark verison number t...
Github user setjet commented on the issue: https://github.com/apache/spark/pull/17523 Done :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17523: [SPARK-20064][PySpark]
GitHub user setjet opened a pull request: https://github.com/apache/spark/pull/17523 [SPARK-20064][PySpark] ## What changes were proposed in this pull request? PySpark version in version.py was lagging behind Versioning is in line with PEP 440: https://www.python.org/dev/peps/pep-0440/ ## How was this patch tested? Simply rebuild the project with existing tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/setjet/spark SPARK-20064 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17523.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17523 commit a2358f7afa8502b8272a4e7caa6c64ad9f0db27d Author: Ruben Janssen Date: 2016-07-16T15:03:19Z added a python example for chisq selector in mllib commit ca7cd787e174e04fbe0fcdcff26c8169450abc7b Author: Ruben Janssen Date: 2016-08-01T18:14:01Z updated documentation to refer to the example commit 035aeb63ef8e8f2af8f7ed838d434a069392c336 Author: Ruben Janssen Date: 2016-10-16T15:00:44Z updated with changes suggested by sethah commit f49e6aea59994c471ea0270b41d5237a1f2a6a47 Author: Ruben Janssen Date: 2016-10-16T15:09:46Z oops forgot to revert back local changes commit a45ff2fa5e5a3633d3de24c5c2f91d59824b0fc8 Author: setjet Date: 2017-04-03T19:18:42Z Merge remote-tracking branch 'upstream/master' commit 8363e28e2d400c599052120153fc08eff8253cd5 Author: setjet Date: 2017-04-03T19:53:02Z increased pyspark version commit 881470d87d499c16cfbf6ea0a265369d60ba8f80 Author: setjet Date: 2017-04-03T21:25:37Z Revert "oops forgot to revert back local changes" This reverts commit f49e6aea59994c471ea0270b41d5237a1f2a6a47. commit 09171936d5d1e9293fee6d28c44d74441a4920ab Author: setjet Date: 2017-04-03T21:26:03Z Revert "updated with changes suggested by sethah" This reverts commit 035aeb63ef8e8f2af8f7ed838d434a069392c336. commit c15654aa242d486b5eeb7e22e79915a165f6bb99 Author: setjet Date: 2017-04-03T21:26:30Z Revert "updated documentation to refer to the example" This reverts commit ca7cd787e174e04fbe0fcdcff26c8169450abc7b. commit 47e4ab2cf8794718d68b5007f4980aae175eb94e Author: setjet Date: 2017-04-03T21:26:39Z Revert "added a python example for chisq selector in mllib" This reverts commit a2358f7afa8502b8272a4e7caa6c64ad9f0db27d. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org