[GitHub] spark pull request #17174: [SPARK-19145][SQL] Timestamp to String casting is...
Github user tanejagagan commented on a diff in the pull request: https://github.com/apache/spark/pull/17174#discussion_r105333124 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala --- @@ -324,14 +324,22 @@ object TypeCoercion { // We should cast all relative timestamp/date/string comparison into string comparisons // This behaves as a user would expect because timestamp strings sort lexicographically. // i.e. TimeStamp(2013-01-01 00:00 ...) < "2014" = true - case p @ BinaryComparison(left @ StringType(), right @ DateType()) => -p.makeCopy(Array(left, Cast(right, StringType))) - case p @ BinaryComparison(left @ DateType(), right @ StringType()) => -p.makeCopy(Array(Cast(left, StringType), right)) - case p @ BinaryComparison(left @ StringType(), right @ TimestampType()) => -p.makeCopy(Array(left, Cast(right, StringType))) - case p @ BinaryComparison(left @ TimestampType(), right @ StringType()) => -p.makeCopy(Array(Cast(left, StringType), right)) + // If StringType is foldable then we need to cast String to Date or Timestamp type + // which would give order of magnitude performance gain as well as preserve the behavior + // achieved by expressed above + // TimeStamp(2013-01-01 00:00 ...) < Cast( "2014" as timestamp) = true + case p @ BinaryComparison(left @ StringType(), right) if dateOrTimestampType(right) => +if (left.foldable) { + p.makeCopy(Array(Cast(left, right.dataType), right)) --- End diff -- Yes.. You can explicitly cast the string to timestamp and then speed up will be much faster. By default without casting query just runs fine silently , pick up a very bad plan, with no indication to user whatsoever and about order of magnitude slower Some of the other issue related to comparison such as` time < 'abc' `will also run just fine which i think should be fail fast and let user know about the issue with casting Other problem is with BI tools which generate these SQLs where user do not have direct control on the SQL. We came across this issue when the same query in Impala was running 10 times faster than in Spark and investigation of the that resulted in this bug and therefore fix --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17174: [SPARK-19145][SQL] Timestamp to String casting is...
Github user tanejagagan commented on a diff in the pull request: https://github.com/apache/spark/pull/17174#discussion_r104481076 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala --- @@ -324,14 +324,22 @@ object TypeCoercion { // We should cast all relative timestamp/date/string comparison into string comparisons // This behaves as a user would expect because timestamp strings sort lexicographically. // i.e. TimeStamp(2013-01-01 00:00 ...) < "2014" = true - case p @ BinaryComparison(left @ StringType(), right @ DateType()) => -p.makeCopy(Array(left, Cast(right, StringType))) - case p @ BinaryComparison(left @ DateType(), right @ StringType()) => -p.makeCopy(Array(Cast(left, StringType), right)) - case p @ BinaryComparison(left @ StringType(), right @ TimestampType()) => -p.makeCopy(Array(left, Cast(right, StringType))) - case p @ BinaryComparison(left @ TimestampType(), right @ StringType()) => -p.makeCopy(Array(Cast(left, StringType), right)) + // If StringType is foldable then we need to cast String to Date or Timestamp type + // which would give order of magnitude performance gain as well as preserve the behavior + // achieved by expressed above + // TimeStamp(2013-01-01 00:00 ...) < Cast( "2014" as timestamp) = true + case p @ BinaryComparison(left @ StringType(), right) if dateOrTimestampType(right) => +if (left.foldable) { + p.makeCopy(Array(Cast(left, right.dataType), right)) --- End diff -- Thanks for pointing that out. After applying the PR ` '10' < current_timestamp` does return `null` and its breaking the semantics. The fix to this issue is very important. We have seen order of magnitude performance difference if string is correctly casted to timestamp for Literals in SQL and most of these SQLs are generated using some BI tool. Any other suggestion to fix this issue? Shall i make it more restrictive so that null cases are better covered. i.e `if( expr.foldable && expressed.eval( null ) !=null ) cast to timestamp` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17174: [SPARK-19145][SQL] Timestamp to String casting is...
GitHub user tanejagagan opened a pull request: https://github.com/apache/spark/pull/17174 [SPARK-19145][SQL] Timestamp to String casting is slowing the query s⦠â¦ignificantly If BinaryComparison has expression with timestamp and string datatype then cast string to timestamp if string type expression is foldable. This results in order of magnitude performance improvement in query execution ## What changes were proposed in this pull request? If BinaryComparison has expression with timestamp and string datatype then cast string to timestamp if string type expression is foldable. This results in order of magnitude performance improvement in query execution ## How was this patch tested? Added new unit tests to conver the functionality Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tanejagagan/spark branch-19145 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17174.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17174 commit 746b8926747ce85d4d5ecd78a9291a13b15e52d5 Author: gagan taneja Date: 2017-03-06T07:26:39Z [SPARK-19145][SQL] Timestamp to String casting is slowing the query significantly If BinaryComparison has expression with timestamp and string datatype then cast string to timestamp if string type expression is foldable. This results in order of magnitude performance improvement in query execution Added new unit tests to conver the functionality --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17035: [SPARK-19705][SQL] Preferred location supporting HDFS ca...
Github user tanejagagan commented on the issue: https://github.com/apache/spark/pull/17035 @hvanhovell Can you help me with this pull request --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17035: [SPARK-19705][SQL] Preferred location supporting ...
GitHub user tanejagagan opened a pull request: https://github.com/apache/spark/pull/17035 [SPARK-19705][SQL] Preferred location supporting HDFS cache for FileS⦠â¦canRDD Added support of HDFS cache using TaskLocation.inMemoryLocationTag NewHadoopRDD and HadoopRDD both support HDFS cache using TaskLocation.inMemoryLocationTag where "hdfs_cache_" is added to hostname which is then interpretted by scheduler With this enhacement same tag ("hdfs_cache_") will be added to hostname if FilePartition only contains single file and the file is cached on one or more host Current implementation would not cased where FilePartition would have multiple files as preferredLocation calculation is more complex. ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tanejagagan/spark branch-19705 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17035.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17035 commit a9288e56f30f7d9f051e06502171d7f2639913a7 Author: gagan taneja Date: 2017-02-23T07:26:36Z [SPARK-19705][SQL] Preferred location supporting HDFS cache for FileScanRDD Added support of HDFS cache using TaskLocation.inMemoryLocationTag NewHadoopRDD and HadoopRDD both support HDFS cache using TaskLocation.inMemoryLocationTag where "hdfs_cache_" is added to hostname which is then interpretted by scheduler With this enhacement same tag ("hdfs_cache_") will be added to hostname if FilePartition only contains single file and the file is cached on one or more host Current implementation would not cased where FilePartition would have multiple files as preferredLocation calculation is more complex. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16497: [SPARK-19118] [SQL] Percentile support for frequency dis...
Github user tanejagagan commented on the issue: https://github.com/apache/spark/pull/16497 Thanks to you for including the changes --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...
Github user tanejagagan commented on a diff in the pull request: https://github.com/apache/spark/pull/16497#discussion_r99753046 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala --- @@ -125,10 +139,17 @@ case class Percentile( buffer: OpenHashMap[Number, Long], input: InternalRow): OpenHashMap[Number, Long] = { val key = child.eval(input).asInstanceOf[Number] +val frqValue = frequencyExpression.eval(input) // Null values are ignored in counts map. -if (key != null) { - buffer.changeValue(key, 1L, _ + 1L) +if (key != null && frqValue != null) { + val frqLong = frqValue.asInstanceOf[Number].longValue() + // add only when frequency is positive + if (frqLong > 0) { +buffer.changeValue(key, frqLong, _ + frqLong) + } else if ( frqLong < 0 ) { --- End diff -- Done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...
Github user tanejagagan commented on a diff in the pull request: https://github.com/apache/spark/pull/16497#discussion_r95508337 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala --- @@ -81,7 +96,11 @@ case class Percentile( case arrayData: ArrayData => arrayData.toDoubleArray().toSeq } - override def children: Seq[Expression] = child :: percentageExpression :: Nil + override def children: Seq[Expression] = if (withFrqExpr) { --- End diff -- @hvanhovell Can you provide your final thoughts on this and i will make the changes accordingly i did try to find an expression where middle argument is Optional but could not. Seems like only the last argument(s) are optional. Which argument sequence would you recommend Column [, frequency], percentages OR Column, percentages [, frequency] --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...
Github user tanejagagan commented on a diff in the pull request: https://github.com/apache/spark/pull/16497#discussion_r95307385 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala --- @@ -51,16 +51,31 @@ import org.apache.spark.util.collection.OpenHashMap _FUNC_(col, array(percentage1 [, percentage2]...)) - Returns the exact percentile value array of numeric column `col` at the given percentage(s). Each value of the percentage array must be between 0.0 and 1.0. -""") + + _FUNC_(col, frequency, percentage) - Returns the exact percentile value of numeric + column `col` with frequency column `frequency` at the given percentage. The value of + percentage must be between 0.0 and 1.0. + + _FUNC_(col, frequency, array(percentage1 [, percentage2]...)) - Returns the exact percentile + value array of numeric column `col` with frequency column `frequency` at the given + percentage(s).Each value of the percentage array must be between 0.0 and 1.0. + + """) case class Percentile( child: Expression, +frequency : Expression, --- End diff -- i think frequenctExpression would be correct name but i have sequence them as they would appear in the SQL select percentile( col, frq, percentage ) from table where frq is Optional --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...
Github user tanejagagan commented on a diff in the pull request: https://github.com/apache/spark/pull/16497#discussion_r95307232 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala --- @@ -51,16 +51,31 @@ import org.apache.spark.util.collection.OpenHashMap _FUNC_(col, array(percentage1 [, percentage2]...)) - Returns the exact percentile value array of numeric column `col` at the given percentage(s). Each value of the percentage array must be between 0.0 and 1.0. -""") + + _FUNC_(col, frequency, percentage) - Returns the exact percentile value of numeric + column `col` with frequency column `frequency` at the given percentage. The value of + percentage must be between 0.0 and 1.0. + + _FUNC_(col, frequency, array(percentage1 [, percentage2]...)) - Returns the exact percentile + value array of numeric column `col` with frequency column `frequency` at the given + percentage(s).Each value of the percentage array must be between 0.0 and 1.0. + + """) case class Percentile( child: Expression, +frequency : Expression, percentageExpression: Expression, +withFrqExpr : Boolean, --- End diff -- Please see my comment below why we need to make a distinction either using flag or Option I am inclined towards using a flag because switching to option would change the code in update from val frqValue = frequency.eval(input) to val frqValue = frequency.getOrElse( unit).eval(input) But i think Option[Expression] would be better logically Once we have an agreement if we need to have a distinction or not i will make the changes accordingly --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...
Github user tanejagagan commented on a diff in the pull request: https://github.com/apache/spark/pull/16497#discussion_r95305820 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala --- @@ -81,7 +96,11 @@ case class Percentile( case arrayData: ArrayData => arrayData.toDoubleArray().toSeq } - override def children: Seq[Expression] = child :: percentageExpression :: Nil + override def children: Seq[Expression] = if (withFrqExpr) { --- End diff -- I have given lot of thought to it and if we need to make a difference here. Lets take a data set of age_string, age, count "20", 20, 1 "15", 15, 1 "10", 10, 1 For Sql "select percentile( age, count , 0.5 ) from table" logically correct values should be children = age::count ::0.5 :: Nil and inputType = IntegerType :: IntegerType::DoubleType::Nil For sql "select pecentile( age, 0.5 ) from table" logically correct values should be children = age::0.5 :: Nil and inputType = IntegerType ::DoubleType::Nil Here is one example where keeping it logically correct would help For following incorrect SQL "select percentile( age, '10') from table" With children = age::'10'::Nil and inputType = IntergerType::StringType:: Nil Since both children and inputType is used for dataType validation, the error message would be correct as below. "argument 2 requires Double type, however, 10 is of String type." However With children = age::Literal(1)::'10'::Nil and inputType = IntergerType::IntegerType::StringType:: Nil The error message would be NOT correct and confusing as below "argument 3 requires Double type, however, 10 is of String type." Since both children and dataType are public method i was inclined to keep them explicitly correct and therefore i decided to make a difference. Please let me know your thoughts --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...
Github user tanejagagan commented on a diff in the pull request: https://github.com/apache/spark/pull/16497#discussion_r95303304 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala --- @@ -126,10 +152,15 @@ case class Percentile( buffer: OpenHashMap[Number, Long], input: InternalRow): OpenHashMap[Number, Long] = { val key = child.eval(input).asInstanceOf[Number] +val frqValue = frequency.eval(input) // Null values are ignored in counts map. -if (key != null) { - buffer.changeValue(key, 1L, _ + 1L) +if (key != null && frqValue != null) { --- End diff -- Yes this would be wrong to use the default value of 1 Let take a data set of Age, Count 20, 1 15, 1 10, 0 If we take the default value of 1L when the frq is 0 is then .5 percentile would become 15 . This is incorrect. I agree with other suggestion of either failing or disregard those values --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...
Github user tanejagagan commented on a diff in the pull request: https://github.com/apache/spark/pull/16497#discussion_r95303054 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala --- @@ -126,10 +152,15 @@ case class Percentile( buffer: OpenHashMap[Number, Long], input: InternalRow): OpenHashMap[Number, Long] = { val key = child.eval(input).asInstanceOf[Number] +val frqValue = frequency.eval(input) // Null values are ignored in counts map. -if (key != null) { - buffer.changeValue(key, 1L, _ + 1L) +if (key != null && frqValue != null) { + val frqLong = frqValue.asInstanceOf[Number].longValue() + // add only when frequency is positive + if (frqLong > 0) { --- End diff -- I think the option was between either fail or disregard those values. We can certainly make this a requirement, document and fail when the values are negatives I think for the cases where values are either null or 0 we should not be adding them to Map to unnecessary bloat the map. The logic would look like if ( frqLong < 0 ) { throw new SomeException }else if( frqLong > 0 ) { // process to add them to map } Let me know if above look good and i will make the changes accordingly --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...
Github user tanejagagan commented on a diff in the pull request: https://github.com/apache/spark/pull/16497#discussion_r95092209 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala --- @@ -91,10 +111,19 @@ case class Percentile( case _ => DoubleType } - override def inputTypes: Seq[AbstractDataType] = percentageExpression.dataType match { -case _: ArrayType => Seq(NumericType, ArrayType(DoubleType)) -case _ => Seq(NumericType, DoubleType) - } + override def inputTypes: Seq[AbstractDataType] = +if (frequency == unit) { + percentageExpression.dataType match { +case _: ArrayType => Seq(NumericType, ArrayType(DoubleType)) +case _=> Seq(NumericType, DoubleType) + } +} else { + percentageExpression.dataType match { --- End diff -- It seems like I would have to do it based on what were the argument in the SQL and which constructor was invoked for sql without Frequency like percentile( `a`, 0.5 ) children need to be child:: percentageExpr::Nil for sql with Frequency like percentile( `a`, `frq`, 0.5 ) children need to be child::frequency::percentageExpr::Nil As well as input type should be reflected on how many arguments were passed Both InputDataType Check and generated sql tests fails However i have made changes in the logic to determine which constructor was invoked and decide inputType and children based on it --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...
Github user tanejagagan commented on a diff in the pull request: https://github.com/apache/spark/pull/16497#discussion_r95091986 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala --- @@ -51,26 +51,42 @@ import org.apache.spark.util.collection.OpenHashMap _FUNC_(col, array(percentage1 [, percentage2]...)) - Returns the exact percentile value array of numeric column `col` at the given percentage(s). Each value of the percentage array must be between 0.0 and 1.0. + + _FUNC_(col, frequency, percentage) - Returns the exact percentile value of numeric column `col` + with frequency column `frequency` at the given percentage. The value of percentage must be + between 0.0 and 1.0. + + _FUNC_(col, frequency, array(percentage1 [, percentage2]...)) - Returns the exact percentile + value array of numeric column `col` with frequency column `frequency` at the given percentage(s). + Each value of the percentage array must be between 0.0 and 1.0. + """) case class Percentile( child: Expression, +frequency : Expression, percentageExpression: Expression, mutableAggBufferOffset: Int = 0, inputAggBufferOffset: Int = 0) extends TypedImperativeAggregate[OpenHashMap[Number, Long]] with ImplicitCastInputTypes { def this(child: Expression, percentageExpression: Expression) = { -this(child, percentageExpression, 0, 0) +this(child, Literal(1l), percentageExpression, 0, 0) + } + + def this(child: Expression, frequency: Expression, percentageExpression: Expression) = { +this(child, frequency, percentageExpression, 0, 0) } + private val unit = Literal(1l) + override def prettyName: String = "percentile" override def withNewMutableAggBufferOffset(newMutableAggBufferOffset: Int): Percentile = copy(mutableAggBufferOffset = newMutableAggBufferOffset) override def withNewInputAggBufferOffset(newInputAggBufferOffset: Int): Percentile = copy(inputAggBufferOffset = newInputAggBufferOffset) - + --- End diff -- Did not realize that I had forgotten to do the men build which does the style check as well --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16497: [SPARK-19118] [SQL] Percentile support for frequency dis...
Github user tanejagagan commented on the issue: https://github.com/apache/spark/pull/16497 @hvanhovell Can you please review the changes --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16497: [SPARK-19118] [SQL] Percentile support for freque...
GitHub user tanejagagan opened a pull request: https://github.com/apache/spark/pull/16497 [SPARK-19118] [SQL] Percentile support for frequency distribution table ## What changes were proposed in this pull request? I have a frequency distribution table with following entries Age,No of person 21, 10 22, 15 23, 18 .. .. 30, 14 Moreover it is common to have data in frequency distribution format to further calculate Percentile, Median. With current implementation It would be very difficult and complex to find the percentile. Therefore i am proposing enhancement to current Percentile and Approx Percentile implementation to take frequency distribution column into consideration ## How was this patch tested? 1) Enhanced /sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/aggregate/PercentileSuite.scala to cover the additional functionality 2) Run some performance benchmark test with 20 million row in local environment and did not see any performance degradation Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tanejagagan/spark branch-18940 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16497.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16497 commit c638fbc45047a250f7e41ad589e3f144d911e441 Author: gagan taneja Date: 2017-01-07T18:03:27Z [SPARK-18940][SQL] Percentile and approximate percentile support for frequency distribution table commit 3740a0c7785e8480b889c8a99f639906d8a0d5a1 Author: gagan taneja Date: 2017-01-07T18:03:27Z [SPARK-18940][SQL] Percentile and approximate percentile support for frequency distribution table commit 28cdb6652fe5c0953a749b9468548edfc9d42067 Author: gagan taneja Date: 2017-01-07T18:23:47Z Merge branch 'branch-18940' of https://github.com/tanejagagan/spark into branch-18940 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org