[GitHub] spark pull request #19321: [SPARK-22100] [SQL] Make percentile_approx suppor...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19321 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19321: [SPARK-22100] [SQL] Make percentile_approx suppor...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19321#discussion_r140764796 --- Diff: docs/sql-programming-guide.md --- @@ -1553,6 +1553,7 @@ options. ## Upgrading From Spark SQL 2.2 to 2.3 - Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column (named `_corrupt_record` by default). For example, `spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()` and `spark.read.schema(schema).json(file).select("_corrupt_record").show()`. Instead, you can cache or save the parsed results and then send the same query. For example, `val df = spark.read.schema(schema).json(file).cache()` and then `df.filter($"_corrupt_record".isNotNull).count()`. + - The percentile_approx function previously accepted only double type input and output double type results. Now it supports date type, timestamp type and all numeric types as input types. The result type is also changed to be the same as the input type, which is more reasonable for percentiles. --- End diff -- Right, my description is not accurate, I'll correct it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19321: [SPARK-22100] [SQL] Make percentile_approx suppor...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19321#discussion_r140684015 --- Diff: docs/sql-programming-guide.md --- @@ -1553,6 +1553,7 @@ options. ## Upgrading From Spark SQL 2.2 to 2.3 - Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column (named `_corrupt_record` by default). For example, `spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()` and `spark.read.schema(schema).json(file).select("_corrupt_record").show()`. Instead, you can cache or save the parsed results and then send the same query. For example, `val df = spark.read.schema(schema).json(file).cache()` and then `df.filter($"_corrupt_record".isNotNull).count()`. + - The percentile_approx function previously accepted only double type input and output double type results. Now it supports date type, timestamp type and all numeric types as input types. The result type is also changed to be the same as the input type, which is more reasonable for percentiles. --- End diff -- This is not right? Before this PR, we already support numeric types. We automatically cast it to Double, right? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19321: [SPARK-22100] [SQL] Make percentile_approx suppor...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19321#discussion_r140647928 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala --- @@ -123,7 +124,13 @@ case class ApproximatePercentile( val value = child.eval(inputRow) // Ignore empty rows, for example: percentile_approx(null) if (value != null) { - buffer.add(value.asInstanceOf[Double]) + // Convert the value to a double value + val doubleValue = child.dataType match { +case DateType => value.asInstanceOf[Int].toDouble +case TimestampType => value.asInstanceOf[Long].toDouble +case n: NumericType => n.numeric.toDouble(value.asInstanceOf[n.InternalType]) --- End diff -- The same here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19321: [SPARK-22100] [SQL] Make percentile_approx suppor...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19321#discussion_r140647683 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala --- @@ -134,7 +141,18 @@ case class ApproximatePercentile( } override def eval(buffer: PercentileDigest): Any = { -val result = buffer.getPercentiles(percentages) +val doubleResult = buffer.getPercentiles(percentages) +val result = child.dataType match { + case DateType => doubleResult.map(_.toInt) + case TimestampType => doubleResult.map(_.toLong) + case ByteType => doubleResult.map(_.toByte) + case ShortType => doubleResult.map(_.toShort) + case IntegerType => doubleResult.map(_.toInt) + case LongType => doubleResult.map(_.toLong) + case FloatType => doubleResult.map(_.toFloat) + case DoubleType => doubleResult + case _: DecimalType => doubleResult.map(Decimal(_)) --- End diff -- Add ```Scala case other: DataType => throw new UnsupportedOperationException(s"Unexpected data type $other") ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19321: [SPARK-22100] [SQL] Make percentile_approx suppor...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19321#discussion_r140627874 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala --- @@ -85,7 +85,8 @@ case class ApproximatePercentile( private lazy val accuracy: Int = accuracyExpression.eval().asInstanceOf[Int] override def inputTypes: Seq[AbstractDataType] = { -Seq(DoubleType, TypeCollection(DoubleType, ArrayType(DoubleType)), IntegerType) +Seq(TypeCollection(NumericType, DateType, TimestampType), + TypeCollection(DoubleType, ArrayType(DoubleType)), IntegerType) --- End diff -- This will cause the result difference. We need to document it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19321: [SPARK-22100] [SQL] Make percentile_approx suppor...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/19321#discussion_r140429643 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentileSuite.scala --- @@ -270,7 +270,6 @@ class ApproximatePercentileSuite extends SparkFunSuite { percentageExpression = percentageExpression, accuracyExpression = Literal(100)) - val result = wrongPercentage.checkInputDataTypes() --- End diff -- This is duplicate by line 274. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19321: [SPARK-22100] [SQL] Make percentile_approx suppor...
GitHub user wzhfy opened a pull request: https://github.com/apache/spark/pull/19321 [SPARK-22100] [SQL] Make percentile_approx support numeric/date/timestamp types ## What changes were proposed in this pull request? Currently `percentile_approx` only supports double type. But since all numeric types, date and timestamp types are represented as numerics internally, `percentile_approx` can support them easily. ## How was this patch tested? Added a new test and modified some existing tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/wzhfy/spark approx_percentile_support_types Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19321.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19321 commit 958715b54d36d4bd5875c53ec59d71e4cd22cade Author: Zhenhua WangDate: 2017-09-22T07:28:09Z percentile_approx supports all atomic types --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org