[GitHub] spark pull request #19321: [SPARK-22100] [SQL] Make percentile_approx suppor...

2017-09-25 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19321


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19321: [SPARK-22100] [SQL] Make percentile_approx suppor...

2017-09-25 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19321#discussion_r140764796
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1553,6 +1553,7 @@ options.
 ## Upgrading From Spark SQL 2.2 to 2.3
 
   - Since Spark 2.3, the queries from raw JSON/CSV files are disallowed 
when the referenced columns only include the internal corrupt record column 
(named `_corrupt_record` by default). For example, 
`spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()`
 and `spark.read.schema(schema).json(file).select("_corrupt_record").show()`. 
Instead, you can cache or save the parsed results and then send the same query. 
For example, `val df = spark.read.schema(schema).json(file).cache()` and then 
`df.filter($"_corrupt_record".isNotNull).count()`.
+  - The percentile_approx function previously accepted only double type 
input and output double type results. Now it supports date type, timestamp type 
and all numeric types as input types. The result type is also changed to be the 
same as the input type, which is more reasonable for percentiles.
--- End diff --

Right, my description is not accurate, I'll correct it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19321: [SPARK-22100] [SQL] Make percentile_approx suppor...

2017-09-24 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19321#discussion_r140684015
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1553,6 +1553,7 @@ options.
 ## Upgrading From Spark SQL 2.2 to 2.3
 
   - Since Spark 2.3, the queries from raw JSON/CSV files are disallowed 
when the referenced columns only include the internal corrupt record column 
(named `_corrupt_record` by default). For example, 
`spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()`
 and `spark.read.schema(schema).json(file).select("_corrupt_record").show()`. 
Instead, you can cache or save the parsed results and then send the same query. 
For example, `val df = spark.read.schema(schema).json(file).cache()` and then 
`df.filter($"_corrupt_record".isNotNull).count()`.
+  - The percentile_approx function previously accepted only double type 
input and output double type results. Now it supports date type, timestamp type 
and all numeric types as input types. The result type is also changed to be the 
same as the input type, which is more reasonable for percentiles.
--- End diff --

This is not right? Before this PR, we already support numeric types. We 
automatically cast it to Double, right?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19321: [SPARK-22100] [SQL] Make percentile_approx suppor...

2017-09-23 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19321#discussion_r140647928
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala
 ---
@@ -123,7 +124,13 @@ case class ApproximatePercentile(
 val value = child.eval(inputRow)
 // Ignore empty rows, for example: percentile_approx(null)
 if (value != null) {
-  buffer.add(value.asInstanceOf[Double])
+  // Convert the value to a double value
+  val doubleValue = child.dataType match {
+case DateType => value.asInstanceOf[Int].toDouble
+case TimestampType => value.asInstanceOf[Long].toDouble
+case n: NumericType => 
n.numeric.toDouble(value.asInstanceOf[n.InternalType])
--- End diff --

The same here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19321: [SPARK-22100] [SQL] Make percentile_approx suppor...

2017-09-23 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19321#discussion_r140647683
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala
 ---
@@ -134,7 +141,18 @@ case class ApproximatePercentile(
   }
 
   override def eval(buffer: PercentileDigest): Any = {
-val result = buffer.getPercentiles(percentages)
+val doubleResult = buffer.getPercentiles(percentages)
+val result = child.dataType match {
+  case DateType => doubleResult.map(_.toInt)
+  case TimestampType => doubleResult.map(_.toLong)
+  case ByteType => doubleResult.map(_.toByte)
+  case ShortType => doubleResult.map(_.toShort)
+  case IntegerType => doubleResult.map(_.toInt)
+  case LongType => doubleResult.map(_.toLong)
+  case FloatType => doubleResult.map(_.toFloat)
+  case DoubleType => doubleResult
+  case _: DecimalType => doubleResult.map(Decimal(_))
--- End diff --

Add 
```Scala
case other: DataType =>
  throw new UnsupportedOperationException(s"Unexpected data type 
$other")
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19321: [SPARK-22100] [SQL] Make percentile_approx suppor...

2017-09-23 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19321#discussion_r140627874
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala
 ---
@@ -85,7 +85,8 @@ case class ApproximatePercentile(
   private lazy val accuracy: Int = 
accuracyExpression.eval().asInstanceOf[Int]
 
   override def inputTypes: Seq[AbstractDataType] = {
-Seq(DoubleType, TypeCollection(DoubleType, ArrayType(DoubleType)), 
IntegerType)
+Seq(TypeCollection(NumericType, DateType, TimestampType),
+  TypeCollection(DoubleType, ArrayType(DoubleType)), IntegerType)
--- End diff --

This will cause the result difference. We need to document it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19321: [SPARK-22100] [SQL] Make percentile_approx suppor...

2017-09-22 Thread wzhfy
Github user wzhfy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19321#discussion_r140429643
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentileSuite.scala
 ---
@@ -270,7 +270,6 @@ class ApproximatePercentileSuite extends SparkFunSuite {
 percentageExpression = percentageExpression,
 accuracyExpression = Literal(100))
 
-  val result = wrongPercentage.checkInputDataTypes()
--- End diff --

This is duplicate by line 274.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19321: [SPARK-22100] [SQL] Make percentile_approx suppor...

2017-09-22 Thread wzhfy
GitHub user wzhfy opened a pull request:

https://github.com/apache/spark/pull/19321

[SPARK-22100] [SQL] Make percentile_approx support numeric/date/timestamp 
types

## What changes were proposed in this pull request?

Currently `percentile_approx` only supports double type.
But since all numeric types, date and timestamp types are represented as 
numerics internally, `percentile_approx` can support them easily.

## How was this patch tested?

Added a new test and modified some existing tests.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wzhfy/spark approx_percentile_support_types

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19321.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19321


commit 958715b54d36d4bd5875c53ec59d71e4cd22cade
Author: Zhenhua Wang 
Date:   2017-09-22T07:28:09Z

percentile_approx supports all atomic types




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org