[GitHub] spark pull request #18544: [SPARK-21318][SQL]Improve exception message throw...
Github user stanzhai commented on a diff in the pull request: https://github.com/apache/spark/pull/18544#discussion_r219485843 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/UDFSuite.scala --- @@ -193,4 +193,29 @@ class UDFSuite } } } + + test("SPARK-21318: The correct exception message should be thrown " + +"if a UDF/UDAF has already been registered") { +val UDAFName = "empty" +val UDAFClassName = classOf[org.apache.spark.sql.hive.execution.UDAFEmpty].getCanonicalName + +withTempDatabase { dbName => --- End diff -- @cloud-fan I just copied and modified the code from another test case, the default database works well. The test case has been simplified now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18544: [SPARK-21318][SQL]Improve exception message throw...
Github user stanzhai commented on a diff in the pull request: https://github.com/apache/spark/pull/18544#discussion_r219468948 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala --- @@ -1440,6 +1441,8 @@ abstract class SessionCatalogSuite extends AnalysisTest { } assert(cause.getMessage.contains("Undefined function: 'undefined_fn'")) +// SPARK-21318: the error message should contains the current database name --- End diff -- org.apache.spark.sql.AnalysisException: Undefined function: 'undefined_fn'. This function is neither a registered temporary function nor a permanent function registered in the database 'db1'.; --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18544: [SPARK-21318][SQL]Improve exception message thrown by `l...
Github user stanzhai commented on the issue: https://github.com/apache/spark/pull/18544 @cloud-fan User's hive UDFs are registered in externalCatalog which not exists in functionRegistry. It will throws a NoSuchFunctionException when an exception is encountered while loading a hive UDF. But we should throw the original exception. So, I just fix the issue by: ``` if (functionRegistry.functionExists(funcName)) { throw error } else { ... } ``` changed to: ``` if (super.functionExists(name)) { throw error } else { ... } ``` The following is implementation of `super.functionExists` ``` def functionExists(name: FunctionIdentifier): Boolean = { val db = formatDatabaseName(name.database.getOrElse(getCurrentDatabase)) requireDbExists(db) functionRegistry.functionExists(name) || externalCatalog.functionExists(db, name.funcName) } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18544: [SPARK-21318][SQL]Improve exception message thrown by `l...
Github user stanzhai commented on the issue: https://github.com/apache/spark/pull/18544 The issue has been addressed a long time ago @cloud-fan @maropu --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22051: [SPARK-25064][WEBUI] Add killed tasks count info ...
GitHub user stanzhai opened a pull request: https://github.com/apache/spark/pull/22051 [SPARK-25064][WEBUI] Add killed tasks count info to WebUI ## What changes were proposed in this pull request? Add missing killed tasks to WebUI. Total tasks = Active + Failed + Killed + Complete tasks. ## How was this patch tested? Manual tests + Unit tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/stanzhai/spark fix-webui-task-count Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22051.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22051 commit 4317d0c578e3dd1bd3182325bf7089fa380420e5 Author: Stan Zhai Date: 2018-08-03T07:21:45Z add killed task count info to webui --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18544: [SPARK-21318][SQL]Improve exception message thrown by `l...
Github user stanzhai commented on the issue: https://github.com/apache/spark/pull/18544 It's not reasonable, `failFunctionLookup` throws `NoSuchFunctionException`. The function actually exists in current selected database, we should throw the exception which is due to an initialization failure, but not `NoSuchFunctionException`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18544: [SPARK-21318][SQL]Improve exception message thrown by `l...
Github user stanzhai commented on the issue: https://github.com/apache/spark/pull/18544 cc @gatorsmile changes in `sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala` has been reverted. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18544: [SPARK-21318][SQL]Improve exception message throw...
Github user stanzhai commented on a diff in the pull request: https://github.com/apache/spark/pull/18544#discussion_r202607295 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala --- @@ -129,14 +129,14 @@ private[sql] class HiveSessionCatalog( Try(super.lookupFunction(funcName, children)) match { case Success(expr) => expr case Failure(error) => -if (functionRegistry.functionExists(funcName)) { - // If the function actually exists in functionRegistry, it means that there is an - // error when we create the Expression using the given children. +if (super.functionExists(name)) { + // If the function actually exists in functionRegistry or externalCatalog, + // it means that there is an error when we create the Expression using the given children. // We need to throw the original exception. throw error } else { - // This function is not in functionRegistry, let's try to load it as a Hive's - // built-in function. + // This function is not in functionRegistry or externalCatalog, + // let's try to load it as a Hive's built-in function. // Hive is case insensitive. val functionName = funcName.unquotedString.toLowerCase(Locale.ROOT) if (!hiveFunctions.contains(functionName)) { --- End diff -- Yes, that's right. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18544: [SPARK-21318][SQL]Improve exception message thrown by `l...
Github user stanzhai commented on the issue: https://github.com/apache/spark/pull/18544 cc @gatorsmile Addressed. Review this please. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18544: [SPARK-21318][SQL]Improve exception message throw...
Github user stanzhai commented on a diff in the pull request: https://github.com/apache/spark/pull/18544#discussion_r201579348 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala --- @@ -129,14 +129,14 @@ private[sql] class HiveSessionCatalog( Try(super.lookupFunction(funcName, children)) match { case Success(expr) => expr case Failure(error) => -if (functionRegistry.functionExists(funcName)) { - // If the function actually exists in functionRegistry, it means that there is an - // error when we create the Expression using the given children. +if (super.functionExists(name)) { --- End diff -- We should keep use `super.functionExists(name)`, we can not load a Hive's built-in function if replaced by `functionExists(name)` and `org.apache.spark.sql.AnalysisException: Undefined function: 'histogram_numeric'` will be thrown. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21663: [SPARK-24680][Deploy]Support spark.executorEnv.JAVA_HOME...
Github user stanzhai commented on the issue: https://github.com/apache/spark/pull/21663 @jerryshao My Spark Application is built on top of JDK10, but the standalone cluster manager is running with JDK8 which does not support JDK10. Java 7 support has been removed since Spark 2.2. I've tried that JDK10 serialized message from executors which can be read by JDK8 worker. Aside from that, I think we should let the spark.executorEnv.JAVA_HOME configuration work, and as for effectiveness, we should give it to the user. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21680: [SPARK-24704][WebUI] Fix the order of stages in t...
GitHub user stanzhai opened a pull request: https://github.com/apache/spark/pull/21680 [SPARK-24704][WebUI] Fix the order of stages in the DAG graph ## What changes were proposed in this pull request? Before: ![wx20180630-155537](https://user-images.githubusercontent.com/1438757/42123357-2c2e2d84-7c83-11e8-8abd-1c2860f38783.png) After: ![wx20180630-155604](https://user-images.githubusercontent.com/1438757/42123359-32fae990-7c83-11e8-8a7b-cdcee94f9123.png) ## How was this patch tested? Manual tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/stanzhai/spark fix-dag-graph Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21680.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21680 commit b3420d61025f7bb9e17160dfb586bc54fba1a51d Author: Stan Zhai Date: 2018-06-30T07:57:08Z fix stage order in job graph --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21623: [SPARK-24638][SQL] StringStartsWith support push ...
Github user stanzhai commented on a diff in the pull request: https://github.com/apache/spark/pull/21623#discussion_r199062132 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -378,6 +378,14 @@ object SQLConf { .booleanConf .createWithDefault(true) + val PARQUET_FILTER_PUSHDOWN_STRING_STARTSWITH_ENABLED = +buildConf("spark.sql.parquet.filterPushdown.string.startsWith") --- End diff -- It would be better if we added `.enabled` postfix. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21663: [SPARK-24680][Deploy]Support spark.executorEnv.JA...
GitHub user stanzhai opened a pull request: https://github.com/apache/spark/pull/21663 [SPARK-24680][Deploy]Support spark.executorEnv.JAVA_HOME in Standalone mode ## What changes were proposed in this pull request? spark.executorEnv.JAVA_HOME does not take effect when a Worker starting an Executor process in Standalone mode. This PR fixed this. ## How was this patch tested? Manual tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/stanzhai/spark fix-executor-env-java-home Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21663.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21663 commit b46c5357746880d420b208733443cb8b49164e81 Author: Stan Zhai Date: 2018-06-28T13:44:01Z fix spark.executorEnv.JAVA_HOME --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19301: [SPARK-22084][SQL] Fix performance regression in ...
Github user stanzhai closed the pull request at: https://github.com/apache/spark/pull/19301 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18544: [SPARK-21318][SQL]Improve exception message thrown by `l...
Github user stanzhai commented on the issue: https://github.com/apache/spark/pull/18544 fixed @gatorsmile . retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18544: [SPARK-21318][SQL]Improve exception message thrown by `l...
Github user stanzhai commented on the issue: https://github.com/apache/spark/pull/18544 Hi @gatorsmile , I've added some test cases, and passed on my machine. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19301: [SPARK-22084][SQL] Fix performance regression in ...
Github user stanzhai commented on a diff in the pull request: https://github.com/apache/spark/pull/19301#discussion_r140699522 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala --- @@ -72,11 +74,19 @@ object AggregateExpression { aggregateFunction: AggregateFunction, mode: AggregateMode, isDistinct: Boolean): AggregateExpression = { +val state = if (aggregateFunction.resolved) { + Seq(aggregateFunction.toString, aggregateFunction.dataType, +aggregateFunction.nullable, mode, isDistinct) +} else { + Seq(aggregateFunction.toString, mode, isDistinct) +} +val hashCode = state.map(Objects.hashCode).foldLeft(0)((a, b) => 31 * a + b) + AggregateExpression( aggregateFunction, mode, isDistinct, - NamedExpression.newExprId) + ExprId(hashCode)) --- End diff -- I've tried to optimize in aggregate planner (https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala#L211). ```scala // A single aggregate expression might appear multiple times in resultExpressions. // In order to avoid evaluating an individual aggregate function multiple times, we'll // build a set of the distinct aggregate expressions and build a function which can // be used to re-write expressions so that they reference the single copy of the // aggregate function which actually gets computed. val aggregateExpressions = resultExpressions.flatMap { expr => expr.collect { case agg: AggregateExpression => val aggregateFunction = agg.aggregateFunction val state = if (aggregateFunction.resolved) { Seq(aggregateFunction.toString, aggregateFunction.dataType, aggregateFunction.nullable, agg.mode, agg.isDistinct) } else { Seq(aggregateFunction.toString, agg.mode, agg.isDistinct) } val hashCode = state.map(Objects.hashCode).foldLeft(0)((a, b) => 31 * a + b) (hashCode, agg) } }.groupBy(_._1).map { case (_, values) => values.head._2 }.toSeq ``` But it's difficult to distinguish between different typed aggregators without expr id. Current solution can work well for all of aggregate functions. I'm not familiar with typed aggregators, any suggestions will be appreciated. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19301: [SPARK-22084][SQL] Fix performance regression in aggrega...
Github user stanzhai commented on the issue: https://github.com/apache/spark/pull/19301 @viirya Benchmark code: ```scala val N = 500L << 22 val benchmark = new Benchmark("agg", N) val expressions = (0 until 50).map(i => s"sum(id) as r$i") benchmark.addCase("agg with optimize", numIters = 2) { iter => sparkSession.range(N).selectExpr(expressions: _*).collect() } benchmark.run() ``` Result: ``` Java HotSpot(TM) 64-Bit Server VM 1.8.0_91-b14 on Mac OS X 10.12.6 Intel(R) Core(TM) i5-4278U CPU @ 2.60GHz agg: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative agg with optimize 1306 / 1354 1605.7 0.6 1.0X agg without optimize 121799 / 148115 17.2 58.1 1.0X ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19301: [SPARK-22084][SQL] Fix performance regression in aggrega...
Github user stanzhai commented on the issue: https://github.com/apache/spark/pull/19301 @viirya The problem is already obvious, and the same aggregate expression will be computed multi times. I will provide a benchmark result later. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19301: [SPARK-22084][SQL] Fix performance regression in aggrega...
Github user stanzhai commented on the issue: https://github.com/apache/spark/pull/19301 @cenyuhai This is an optimize for physical plan, and your case can be optimized. ```SQL select dt, geohash_of_latlng, sum(mt_cnt), sum(ele_cnt), round(sum(mt_cnt) * 1.0 * 100 / sum(mt_cnt_all), 2), round(sum(ele_cnt) * 1.0 * 100 / sum(ele_cnt_all), 2) from values(1, 2, 3, 4, 5, 6) as (dt, geohash_of_latlng, mt_cnt, ele_cnt, mt_cnt_all, ele_cnt_all) group by dt, geohash_of_latlng order by dt, geohash_of_latlng limit 10 ``` Before: ``` == Physical Plan == TakeOrderedAndProject(limit=10, orderBy=[dt#26 ASC NULLS FIRST,geohash_of_latlng#27 ASC NULLS FIRST], output=[dt#26,geohash_of_latlng#27,sum(mt_cnt)#38L,sum(ele_cnt)#39L,round((CAST((CAST((CAST(CAST(sum(CAST(mt_cnt AS BIGINT)) AS DECIMAL(20,0)) AS DECIMAL(21,1)) * CAST(1.0 AS DECIMAL(21,1))) AS DECIMAL(23,1)) * CAST(CAST(100 AS DECIMAL(23,1)) AS DECIMAL(23,1))) AS DECIMAL(38,2)) / CAST(CAST(sum(CAST(mt_cnt_all AS BIGINT)) AS DECIMAL(20,0)) AS DECIMAL(38,2))), 2)#40,round((CAST((CAST((CAST(CAST(sum(CAST(ele_cnt AS BIGINT)) AS DECIMAL(20,0)) AS DECIMAL(21,1)) * CAST(1.0 AS DECIMAL(21,1))) AS DECIMAL(23,1)) * CAST(CAST(100 AS DECIMAL(23,1)) AS DECIMAL(23,1))) AS DECIMAL(38,2)) / CAST(CAST(sum(CAST(ele_cnt_all AS BIGINT)) AS DECIMAL(20,0)) AS DECIMAL(38,2))), 2)#41]) +- *HashAggregate(keys=[dt#26, geohash_of_latlng#27], functions=[sum(cast(mt_cnt#28 as bigint)), sum(cast(ele_cnt#29 as bigint)), sum(cast(mt_cnt#28 as bigint)), sum(cast(mt_cnt_all#30 as bigint)), sum(cast(ele_cnt#29 as bigint)), sum(cast(ele_cnt_all#31 as bigint))]) +- Exchange hashpartitioning(dt#26, geohash_of_latlng#27, 200) +- *HashAggregate(keys=[dt#26, geohash_of_latlng#27], functions=[partial_sum(cast(mt_cnt#28 as bigint)), partial_sum(cast(ele_cnt#29 as bigint)), partial_sum(cast(mt_cnt#28 as bigint)), partial_sum(cast(mt_cnt_all#30 as bigint)), partial_sum(cast(ele_cnt#29 as bigint)), partial_sum(cast(ele_cnt_all#31 as bigint))]) +- LocalTableScan [dt#26, geohash_of_latlng#27, mt_cnt#28, ele_cnt#29, mt_cnt_all#30, ele_cnt_all#31] ``` After: ``` == Physical Plan == TakeOrderedAndProject(limit=10, orderBy=[dt#28 ASC NULLS FIRST,geohash_of_latlng#29 ASC NULLS FIRST], output=[dt#28,geohash_of_latlng#29,sum(mt_cnt)#34L,sum(ele_cnt)#35L,round((CAST((CAST((CAST(CAST(sum(CAST(mt_cnt AS BIGINT)) AS DECIMAL(20,0)) AS DECIMAL(21,1)) * CAST(1.0 AS DECIMAL(21,1))) AS DECIMAL(23,1)) * CAST(CAST(100 AS DECIMAL(23,1)) AS DECIMAL(23,1))) AS DECIMAL(38,2)) / CAST(CAST(sum(CAST(mt_cnt_all AS BIGINT)) AS DECIMAL(20,0)) AS DECIMAL(38,2))), 2)#36,round((CAST((CAST((CAST(CAST(sum(CAST(ele_cnt AS BIGINT)) AS DECIMAL(20,0)) AS DECIMAL(21,1)) * CAST(1.0 AS DECIMAL(21,1))) AS DECIMAL(23,1)) * CAST(CAST(100 AS DECIMAL(23,1)) AS DECIMAL(23,1))) AS DECIMAL(38,2)) / CAST(CAST(sum(CAST(ele_cnt_all AS BIGINT)) AS DECIMAL(20,0)) AS DECIMAL(38,2))), 2)#37]) +- *HashAggregate(keys=[dt#28, geohash_of_latlng#29], functions=[sum(cast(mt_cnt#30 as bigint)), sum(cast(ele_cnt#31 as bigint)), sum(cast(mt_cnt_all#32 as bigint)), sum(cast(ele_cnt_all#33 as bigint))]) +- Exchange hashpartitioning(dt#28, geohash_of_latlng#29, 200) +- *HashAggregate(keys=[dt#28, geohash_of_latlng#29], functions=[partial_sum(cast(mt_cnt#30 as bigint)), partial_sum(cast(ele_cnt#31 as bigint)), partial_sum(cast(mt_cnt_all#32 as bigint)), partial_sum(cast(ele_cnt_all#33 as bigint))]) +- LocalTableScan [dt#28, geohash_of_latlng#29, mt_cnt#30, ele_cnt#31, mt_cnt_all#32, ele_cnt_all#33] ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19301: [SPARK-22084][SQL] Fix performance regression in aggrega...
Github user stanzhai commented on the issue: https://github.com/apache/spark/pull/19301 https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala#L211 ```scala val aggregateExpressions = resultExpressions.flatMap { expr => expr.collect { case agg: AggregateExpression => agg } }.distinct ``` Before the fix, the exprId of each aggregate expression is different which can cause distinct fail. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19301: [SPARK-22084][SQL] Fix performance regression in ...
Github user stanzhai commented on a diff in the pull request: https://github.com/apache/spark/pull/19301#discussion_r140155475 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala --- @@ -38,7 +38,7 @@ import org.apache.spark.sql.internal.SQLConf * view resolution, in this way, we are able to get the correct view column ordering and * omit the extra columns that we don't require); *1.2. Else set the child output attributes to `queryOutput`. - * 2. Map the `queryQutput` to view output by index, if the corresponding attributes don't match, + * 2. Map the `queryOutput` to view output by index, if the corresponding attributes don't match, --- End diff -- Q -> O --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19301: [SPARK-22084][SQL] Fix performance regression in ...
GitHub user stanzhai opened a pull request: https://github.com/apache/spark/pull/19301 [SPARK-22084][SQL] Fix performance regression in aggregation strategy ## What changes were proposed in this pull request? This PR fix a performance regression in aggregation strategy which introduced in Spark 2.0. For the following SQL: ```SQL SELECT a, SUM(b) AS b0, SUM(b) AS b1 FROM VALUES(1, 1), (2, 2) AS (a, b) GROUP BY a ``` Before the fix: ``` == Physical Plan == *HashAggregate(keys=[a#11], functions=[sum(cast(b#12 as bigint)), sum(cast(b#12 as bigint))]) +- Exchange hashpartitioning(a#11, 200) +- *HashAggregate(keys=[a#11], functions=[partial_sum(cast(b#12 as bigint)), partial_sum(cast(b#12 as bigint))]) +- LocalTableScan [a#11, b#12] ``` After ``` == Physical Plan == *HashAggregate(keys=[a#11], functions=[sum(cast(b#12 as bigint))]) +- Exchange hashpartitioning(a#11, 2) +- *HashAggregate(keys=[a#11], functions=[partial_sum(cast(b#12 as bigint))]) +- LocalTableScan [a#11, b#12] ``` ## How was this patch tested? WIP You can merge this pull request into a Git repository by running: $ git pull https://github.com/stanzhai/spark improve-aggregate Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19301.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19301 commit 6f555c20c5c6d2821410aff671758ba73cd8f300 Author: Stan Zhai <m...@stanzhai.site> Date: 2017-09-19T09:27:35Z use hashCode as exprId commit 5aaae4caa6225ecc6d174afb2eefa8d68af5471a Author: Stan Zhai <m...@stanzhai.site> Date: 2017-09-19T09:53:56Z typo commit adce4740c3c41000215f5d7cc0285701d15bb7cf Author: Stan Zhai <m...@stanzhai.site> Date: 2017-09-20T07:12:23Z Merge branch 'master' of https://github.com/apache/spark into improve-aggregate commit bf7d2cf103e2a0caf1538e3df5c174df173cfc56 Author: Stan Zhai <m...@stanzhai.site> Date: 2017-09-21T05:19:20Z Merge branch 'master' of https://github.com/apache/spark into improve-aggregate --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18986: [SPARK-21774][SQL] The rule PromoteStrings should cast a...
Github user stanzhai commented on the issue: https://github.com/apache/spark/pull/18986 @gatorsmile @DonnyZone When comparing a string to a int in Hive, it will cast string type to double. ``` hive> select * from tb; 0 0 0.1 0 true0 19157170390056971 0 hive> select * from tb where a = 0; 0 0 hive> select * from tb where a = 19157170390056973L; WARNING: Comparing a bigint and a string may result in a loss of precision. 19157170390056973 0 hive> select 1 = 'true'; NULL hive> select 19157170390056973L = '19157170390056971'; WARNING: Comparing a bigint and a string may result in a loss of precision. true ``` So, I think that cast a string to double type when compare with a numeric is more reasonable. Actually, my usage scenarios are for Spark compatibility. The problem I found when I upgraded Spark to 2.2.0, and lots of SQL's results are wrong. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18986: [SPARK-21774][SQL] The rule PromoteStrings should cast a...
Github user stanzhai commented on the issue: https://github.com/apache/spark/pull/18986 @DonnyZone @gatorsmile @cloud-fan PostgreSQL will throw an error when comparing a string to a int. ``` postgres=# select * from tb; a | b --+--- 0.1 | 1 a| 1 true | 1 (3 rows) postgres=# select * from tb where a>0; ERROR: operator does not exist: character varying > integer LINE 1: select * from tb where a>0; ^ HINT: No operator matches the given name and argument type(s). You might need to add explicit type casts. ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18986: [SPARK-21774][SQL] The rule PromoteStrings should cast a...
Github user stanzhai commented on the issue: https://github.com/apache/spark/pull/18986 In MySQL conversion of values from one string type to numeric, will be compared as floating-point (real) numbers. [](https://dev.mysql.com/doc/refman/5.7/en/type-conversion.html) The following rules describe how conversion occurs for comparison operations: - If one or both arguments are NULL, the result of the comparison is NULL, except for the NULL-safe <=> equality comparison operator. For NULL <=> NULL, the result is true. No conversion is needed. - If both arguments in a comparison operation are strings, they are compared as strings. - If both arguments are integers, they are compared as integers. - Hexadecimal values are treated as binary strings if not compared to a number. - If one of the arguments is a TIMESTAMP or DATETIME column and the other argument is a constant, the constant is converted to a timestamp before the comparison is performed. This is done to be more ODBC-friendly. > Note that this is not done for the arguments to IN()! To be safe, always use complete datetime, date, or time strings when doing comparisons. For example, to achieve best results when using BETWEEN with date or time values, use CAST() to explicitly convert the values to the desired data type. - A single-row subquery from a table or tables is not considered a constant. For example, if a subquery returns an integer to be compared to a DATETIME value, the comparison is done as two integers. The integer is not converted to a temporal value. To compare the operands as DATETIME values, use CAST() to explicitly convert the subquery value to DATETIME. - If one of the arguments is a decimal value, comparison depends on the other argument. The arguments are compared as decimal values if the other argument is a decimal or integer value, or as floating-point values if the other argument is a floating-point value. - In all other cases, the arguments are compared as floating-point (real) numbers. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18986: [SPARK-21774][SQL] The rule PromoteStrings should...
GitHub user stanzhai opened a pull request: https://github.com/apache/spark/pull/18986 [SPARK-21774][SQL] The rule PromoteStrings should cast a string to double type when compare with a int ## What changes were proposed in this pull request? The rule PromoteStrings should cast a string to double type when compare with a int. This PR fixed this. ## How was this patch tested? Origin test cases updated. You can merge this pull request into a Git repository by running: $ git pull https://github.com/stanzhai/spark fix-type-coercion Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18986.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18986 commit 1a289a5a1b0756e86e225d43de73d9b42afb0a0e Author: Stan Zhai <m...@stanzhai.site> Date: 2017-08-18T02:17:20Z fix a bug of TypeCoercion --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18544: [SPARK-21318][SQL]Improve exception message thrown by `l...
Github user stanzhai commented on the issue: https://github.com/apache/spark/pull/18544 @gatorsmile Some test cases have been added. Thanks for reviewing. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18544: [SPARK-21318][SQL]Improve exception message thrown by `l...
Github user stanzhai commented on the issue: https://github.com/apache/spark/pull/18544 cc @liancheng --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18544: [SPARK-21318][SQL]Improve exception message throw...
GitHub user stanzhai opened a pull request: https://github.com/apache/spark/pull/18544 [SPARK-21318][SQL]Improve exception message thrown by `lookupFunction` ## What changes were proposed in this pull request? The function actually exists in current selected database, and it's failed to init during `lookupFunciton`, but the exception message is: ``` This function is neither a registered temporary function nor a permanent function registered in the database 'default'. ``` This is not conducive to positioning problems. This PR fix the problem. ## How was this patch tested? Exists tests + manual tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/stanzhai/spark fix-udf-error-message Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18544.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18544 commit 373fc5cacb77bb6e6be02eb3608497cbcaa7edef Author: Stan Zhai <m...@stanzhai.site> Date: 2017-07-05T14:41:02Z optimized udf lookup exception message --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17529: [SPARK-20211][SQL]Fix a bug in FLOOR and CEIL whe...
Github user stanzhai closed the pull request at: https://github.com/apache/spark/pull/17529 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18244: [SPARK-20211][SQL] Fix the Precision and Scale of...
Github user stanzhai commented on a diff in the pull request: https://github.com/apache/spark/pull/18244#discussion_r121060627 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala --- @@ -126,7 +126,15 @@ final class Decimal extends Ordered[Decimal] with Serializable { def set(decimal: BigDecimal): Decimal = { this.decimalVal = decimal this.longVal = 0L -this._precision = decimal.precision +if (decimal.precision <= decimal.scale) { --- End diff -- Got it, thanks for the fix! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18244: [SPARK-20211][SQL] Fix the Precision and Scale of...
Github user stanzhai commented on a diff in the pull request: https://github.com/apache/spark/pull/18244#discussion_r121058323 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala --- @@ -126,7 +126,15 @@ final class Decimal extends Ordered[Decimal] with Serializable { def set(decimal: BigDecimal): Decimal = { this.decimalVal = decimal this.longVal = 0L -this._precision = decimal.precision +if (decimal.precision <= decimal.scale) { --- End diff -- But the comment is `// For Decimal, we expect the precision is equal to or large than the scale`. `=` has been processed within the function `floor` and `ceil`. <https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L387> This is reason that I think we should use `if (decimal.precision < decimal.scale)`, and it works fine for `0.90`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18244: [SPARK-20211][SQL] Fix the Precision and Scale of...
Github user stanzhai commented on a diff in the pull request: https://github.com/apache/spark/pull/18244#discussion_r121053165 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala --- @@ -126,7 +126,15 @@ final class Decimal extends Ordered[Decimal] with Serializable { def set(decimal: BigDecimal): Decimal = { this.decimalVal = decimal this.longVal = 0L -this._precision = decimal.precision +if (decimal.compare(BigDecimal(1.0)) == -1 && decimal.compare(BigDecimal(-1.0)) == 1) { --- End diff -- just `if (decimal.presision < decimal.scale) {` https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L387 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #10991: [SPARK-12299][CORE] Remove history serving functionality...
Github user stanzhai commented on the issue: https://github.com/apache/spark/pull/10991 We've just upgraded our Spark cluster from 1.6.x to 2.x, I found that the REST APIs from Spark MasterUI is unavailable. It's important for us to use the REST APIs to monitor our Applications. I believe that some other people would rely on this function too. Right now, the only way to get them is using the Spark Master WebUI, it's too bad. It would be great that we have some REST APIs to access MasterãWorkers and Applications information from Master. @BryanCutler @andrewor14 @JoshRosen --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17529: [SPARK-20211][SQL]Fix a bug in FLOOR and CEIL when a dec...
Github user stanzhai commented on the issue: https://github.com/apache/spark/pull/17529 cc @gatorsmile --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17529: [SPARK-20211][SQL]floor or ceil with a decimal that its ...
Github user stanzhai commented on the issue: https://github.com/apache/spark/pull/17529 cc @chenghao-intel --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17529: [SPARK-20211][SQL]floor or ceil with a decimal th...
GitHub user stanzhai opened a pull request: https://github.com/apache/spark/pull/17529 [SPARK-20211][SQL]floor or ceil with a decimal that its `precision < scale` should be supported ## What changes were proposed in this pull request? `precision` in a decimal indicates the length of the arbitrary precision integer. Here are a few examples of numbers with the same scale, but different precision: - 12345 / 10 = 0.12345 // scale = 5, precision = 5 - 12340 / 10 = 0.1234 // scale = 5, precision = 4 - 1 / 10 = 0.1 // scale = 5, precision = 1 This PR fix a bug in floor and ceil in `org.apache.spark.sql.types.Decimal` that will throw a `Decimal scale (0) cannot be greater than precision (-2)` exception when `precision < scale`. Before the fix, the following SQL will throw exception: ``` select 1 > 0.0001 from tb select floor(0.0001) from tb select ceil(0.0001) from tb ``` ## How was this patch tested? Added unit tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/stanzhai/spark fix_decimal_precision Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17529.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17529 commit 2b094b6e8fb1b0b8ae8bc89782305ac44d172ec3 Author: stanzhai <stanz...@outlook.com> Date: 2017-04-04T12:48:53Z fix decimal floor/ceil precision bug commit 2d60230b8344b391c3edfeec7c19ad1717e93710 Author: stanzhai <stanz...@outlook.com> Date: 2017-04-04T14:28:03Z add test case commit 61058b6e69802312bda35cdaf04a5b2af7dcd827 Author: stanzhai <stanz...@outlook.com> Date: 2017-04-04T15:02:54Z update test case --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17131: [SPARK-19766][SQL][BRANCH-2.0] Constant alias col...
Github user stanzhai closed the pull request at: https://github.com/apache/spark/pull/17131 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17131: [SPARK-19766][SQL][BRANCH-2.0] Constant alias col...
GitHub user stanzhai opened a pull request: https://github.com/apache/spark/pull/17131 [SPARK-19766][SQL][BRANCH-2.0] Constant alias columns in INNER JOIN should not be folded by FoldablePropagation rule This PR fix for branch-2.0 Refer #17099 @gatorsmile You can merge this pull request into a Git repository by running: $ git pull https://github.com/stanzhai/spark fix-inner-join-2.0 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17131.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17131 commit 4975ac7f3a6a714c80e5f875ab54dd60f4aa22a5 Author: Stan Zhai <zhaishi...@haizhi.com> Date: 2017-03-02T05:56:07Z fix innner join --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17099: [SPARK-19766][SQL] Constant alias columns in INNE...
Github user stanzhai commented on a diff in the pull request: https://github.com/apache/spark/pull/17099#discussion_r103848391 --- Diff: sql/core/src/test/resources/sql-tests/results/inner-join.sql.out --- @@ -0,0 +1,68 @@ +-- Automatically generated by SQLQueryTestSuite +-- Number of queries: 13 --- End diff -- Thanks! I will pay attention to this next time. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17099: [SPARK-19766][SQL] Constant alias columns in INNER JOIN ...
Github user stanzhai commented on the issue: https://github.com/apache/spark/pull/17099 ok --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17099: [SPARK-19766][SQL] Constant alias columns in INNER JOIN ...
Github user stanzhai commented on the issue: https://github.com/apache/spark/pull/17099 Thanks for @gatorsmile 's help. `ConstantFolding` will affect other test cases in `FoldablePropagationSuite`. It's fine without adding `ConstantFolding`. Before fix: ``` [info] !'Join Inner, ((a#0 = a#0) && (1 = 1))'Join Inner, (('tb.a = 'ta.a) && ('tb.tag = 'ta.tag)) [info] !:- Union :- 'SubqueryAlias ta [info] !: :- Project [a#0, 1 AS tag#0] : +- 'Union [info] !: : +- LocalRelation , [a#0, b#0] : :- 'Project ['a, 1 AS tag#0] [info] !: +- Project [a#0, 2 AS tag#0] : : +- LocalRelation , [a#0, b#0] [info] !: +- LocalRelation , [a#0, b#0] : +- 'Project ['a, 2 AS tag#0] [info] !+- Union :+- LocalRelation , [a#0, b#0] [info] ! :- Project [a#0, 1 AS tag#0] +- 'SubqueryAlias tb [info] ! : +- LocalRelation , [a#0, b#0] +- 'Union [info] ! +- Project [a#0, 2 AS tag#0]:- 'Project ['a, 1 AS tag#0] [info] ! +- LocalRelation , [a#0, b#0] : +- LocalRelation , [a#0, b#0] [info] ! +- 'Project ['a, 2 AS tag#0] [info] ! +- LocalRelation , [a#0, b#0] (PlanTest.scala:99) ``` After fix: ``` [info] !'Join Inner, ((a#0 = a#0) && (tag#0 = tag#0)) 'Join Inner, (('tb.a = 'ta.a) && ('tb.tag = 'ta.tag)) [info] !:- Union:- 'SubqueryAlias ta [info] !: :- Project [a#0, 1 AS tag#0] : +- 'Union [info] !: : +- LocalRelation , [a#0, b#0] : :- 'Project ['a, 1 AS tag#0] [info] !: +- Project [a#0, 2 AS tag#0] : : +- LocalRelation , [a#0, b#0] [info] !: +- LocalRelation , [a#0, b#0] : +- 'Project ['a, 2 AS tag#0] [info] !+- Union:+- LocalRelation , [a#0, b#0] [info] ! :- Project [a#0, 1 AS tag#0] +- 'SubqueryAlias tb [info] ! : +- LocalRelation , [a#0, b#0] +- 'Union [info] ! +- Project [a#0, 2 AS tag#0] :- 'Project ['a, 1 AS tag#0] [info] ! +- LocalRelation , [a#0, b#0]: +- LocalRelation , [a#0, b#0] [info] ! +- 'Project ['a, 2 AS tag#0] [info] ! +- LocalRelation , [a#0, b#0] (PlanTest.scala:99) ``` I just fix the test case(`"tb.tag" -> "tb.tag".attr`). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17099: [SPARK-19766][SQL] Constant alias columns in INNER JOIN ...
Github user stanzhai commented on the issue: https://github.com/apache/spark/pull/17099 @hvanhovell --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17099: Constant alias columns in INNER JOIN should not b...
GitHub user stanzhai opened a pull request: https://github.com/apache/spark/pull/17099 Constant alias columns in INNER JOIN should not be folded by FoldablePropagation rule ## What changes were proposed in this pull request? This PR fixes the code in Optimizer phase where the constant alias columns of a `INNER JOIN` query are folded in Rule `FoldablePropagation`. For the following query(): ``` val sqlA = """ |create temporary view ta as |select a, 'a' as tag from t1 union all |select a, 'b' as tag from t2 """.stripMargin val sqlB = """ |create temporary view tb as |select a, 'a' as tag from t3 union all |select a, 'b' as tag from t4 """.stripMargin val sql = """ |select tb.* from ta inner join tb on |ta.a = tb.a and |ta.tag = tb.tag """.stripMargin ``` The tag column is an constant alias column, it's folded by `FoldablePropagation` like this: ``` TRACE SparkOptimizer: === Applying Rule org.apache.spark.sql.catalyst.optimizer.FoldablePropagation === Project [a#4, tag#14] Project [a#4, tag#14] !+- Join Inner, ((a#0 = a#4) && (tag#8 = tag#14)) +- Join Inner, ((a#0 = a#4) && (a = a)) :- Union :- Union : :- Project [a#0, a AS tag#8]: :- Project [a#0, a AS tag#8] : : +- LocalRelation [a#0] : : +- LocalRelation [a#0] : +- Project [a#2, b AS tag#9]: +- Project [a#2, b AS tag#9] : +- LocalRelation [a#2] : +- LocalRelation [a#2] +- Union +- Union :- Project [a#4, a AS tag#14] :- Project [a#4, a AS tag#14] : +- LocalRelation [a#4] : +- LocalRelation [a#4] +- Project [a#6, b AS tag#15] +- Project [a#6, b AS tag#15] +- LocalRelation [a#6] +- LocalRelation [a#6] ``` Finally the Result of Batch Operator Optimizations is: ``` Project [a#4, tag#14] Project [a#4, tag#14] !+- Join Inner, ((a#0 = a#4) && (tag#8 = tag#14)) +- Join Inner, (a#0 = a#4) ! :- SubqueryAlias ta, `ta` :- Union ! : +- Union: :- LocalRelation [a#0] ! : :- Project [a#0, a AS tag#8] : +- LocalRelation [a#2] ! : : +- SubqueryAlias t1, `t1` +- Union ! : : +- Project [a#0] :- LocalRelation [a#4, tag#14] ! : :+- SubqueryAlias grouping +- LocalRelation [a#6, tag#15] ! : : +- LocalRelation [a#0] ! : +- Project [a#2, b AS tag#9] ! :+- SubqueryAlias t2, `t2` ! : +- Project [a#2] ! : +- SubqueryAlias grouping ! : +- LocalRelation [a#2] ! +- SubqueryAlias tb, `tb` ! +- Union ! :- Project [a#4, a AS tag#14] ! : +- SubqueryAlias t3, `t3` ! : +- Project [a#4] ! :+- SubqueryAlias grouping ! : +- LocalRelation [a#4] ! +- Project [a#6, b AS tag#15] !+- SubqueryAlias t4, `t4` ! +- Project [a#6] ! +- SubqueryAlias grouping ! +- LocalRelation [a#6] ``` The condition `tag#8 = tag#14` of INNER JOIN has been removed. This leads to the data of inner join being wrong. After fix: ``` === Result of Batch LocalRelation === GlobalLimit 21 GlobalLimit 21 +- LocalLimit 21 +- LocalLimit 21 +- Project [a#4, tag#11] +- Project [a#4, tag#11] +- Join Inner, ((a#0 = a#4) && (tag#8 = tag#11)) +- Join Inner, ((a#0 = a#4) && (tag#8 = tag#11)) ! :- SubqueryAlias ta :- Union ! : +- Union : :- LocalRelation [a#0, tag#8] ! : :- Project [a#0, a AS tag#8] : +- LocalRelation [a#2, tag#9] ! : : +- SubqueryAlias t
[GitHub] spark pull request #16953: [SPARK-19622][WebUI]Fix a http error in a paged t...
GitHub user stanzhai opened a pull request: https://github.com/apache/spark/pull/16953 [SPARK-19622][WebUI]Fix a http error in a paged table when using a `Go` button to search. ## What changes were proposed in this pull request? The search function of paged table is not available because of we don't skip the hash data of the reqeust path. ![](https://issues.apache.org/jira/secure/attachment/12852996/screenshot-1.png) ## How was this patch tested? Tested manually with my browser. You can merge this pull request into a Git repository by running: $ git pull https://github.com/stanzhai/spark fix-webui-paged-table Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16953.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16953 commit a4364dace3a8305f5ef7627ce68973bf7b7f7c6b Author: Stan Zhai <zhaishi...@haizhi.com> Date: 2017-02-16T06:17:54Z fixed a pagination bug of paged table. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16874: [SPARK-19509][SQL]Fix a NPE problem in grouping s...
Github user stanzhai closed the pull request at: https://github.com/apache/spark/pull/16874 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16874: [SPARK-19509][SQL][branch-2.1]Fix a NPE problem i...
GitHub user stanzhai opened a pull request: https://github.com/apache/spark/pull/16874 [SPARK-19509][SQL][branch-2.1]Fix a NPE problem in grouping sets when using an empty column ## What changes were proposed in this pull request? If a column of a table is all null values, the follow SQL will throw an NPE: `select count(1) from test group by e grouping sets(e)`. The reason is that when transformUp a `GroupingSets` in `ResolveGroupingAnalytics` it uses a `nullBitmask` to set an attribute with null ability, the nullable attribute may be modified. This pr just set all attribute's null ability to `true` in group by expressions to fix the problem. The pr #15484 in master branch has fixed this problem. ## How was this patch tested? Test with Hive in my environment. You can merge this pull request into a Git repository by running: $ git pull https://github.com/stanzhai/spark fix-grouping-sets Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16874.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16874 commit 3690cb29a3c15903dd6290502fb736daa99157b4 Author: Stan Zhai <zhaishi...@haizhi.com> Date: 2017-02-09T13:22:02Z fix a NPE issue of grouping sets --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16617: [SPARK-19261][SQL]Support `ALTER TABLE table_name ADD CO...
Github user stanzhai commented on the issue: https://github.com/apache/spark/pull/16617 Good job! I will review your PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16617: [SPARK-19261][SQL]Support `ALTER TABLE table_name...
Github user stanzhai closed the pull request at: https://github.com/apache/spark/pull/16617 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16617: [SPARK-19261][SQL]Support `ALTER TABLE table_name...
GitHub user stanzhai opened a pull request: https://github.com/apache/spark/pull/16617 [SPARK-19261][SQL]Support `ALTER TABLE table_name ADD COLUMNS(..)` statement ## What changes were proposed in this pull request? We should support `ALTER TABLE table_name ADD COLUMNS(..)` statement, which already be supported in version < 2.x. This is very useful for those who want to upgrade there Spark version to 2.x. ## How was this patch tested? Add some test cases in `DDLCommandSuite`, and test with Hive in my environment. You can merge this pull request into a Git repository by running: $ git pull https://github.com/stanzhai/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16617.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16617 commit 69729ef083f152eb91a80e1ea7f1481234766c7c Author: zhaishidan <zhaishi...@haizhi.com> Date: 2015-07-14T07:47:36Z fix document error about spark.kryoserializer.buffer.max.mb commit f7f5c77194492c41eaa63efc1516e1cb73603c3f Author: zhaishidan <zhaishi...@haizhi.com> Date: 2016-03-17T10:00:02Z Merge branch 'master' of https://github.com/apache/spark commit e0b6a807a6374553a81a8a07d37fdd643e9fcbc0 Author: StanZhai <stan@stanzhaidemac-mini.local> Date: 2016-10-08T15:22:46Z Merge branch 'master' of https://github.com/apache/spark commit f50377b9e4d5c3ae1a7b232fffe96015319a32af Author: Stan Zhai <zhaishi...@haizhi.com> Date: 2017-01-16T07:45:34Z Merge branch 'master' of https://github.com/apache/spark into stan-master commit 2e1e53a2bd28decef6dbef3af16b10512b26a664 Author: Stan Zhai <zhaishi...@haizhi.com> Date: 2017-01-16T08:43:46Z support `alter table add columns` commit e55350a1876e4b46584476795cfce6184248d66d Author: Stan Zhai <zhaishi...@haizhi.com> Date: 2017-01-17T10:16:22Z Merge branch 'master' of https://github.com/apache/spark into stan-master commit ba7373256a9deeefcb22a7facf006ec85403afb9 Author: Stan Zhai <zhaishi...@haizhi.com> Date: 2017-01-17T12:09:42Z update test commit 3cafe2c0d54b0bb2a9d9ff8814e5183571deff26 Author: Stan Zhai <zhaishi...@haizhi.com> Date: 2017-01-17T13:09:58Z revert pom.xml --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9010][Documentation]Improve the Spark C...
Github user stanzhai closed the pull request at: https://github.com/apache/spark/pull/7393 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9010][Documentation]Improve the Spark C...
Github user stanzhai closed the pull request at: https://github.com/apache/spark/pull/7368 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9010][Documentation]Improve the Spark C...
GitHub user stanzhai opened a pull request: https://github.com/apache/spark/pull/7393 [SPARK-9010][Documentation]Improve the Spark Configuration document about `spark.kryoserializer.buffer` The meaning of spark.kryoserializer.buffer should be Initial size of Kryo's serialization buffer. Note that there will be one buffer per core on each worker. This buffer will grow up to spark.kryoserializer.buffer.max if needed.. The spark.kryoserializer.buffer.max.mb is out-of-date in spark 1.4. You can merge this pull request into a Git repository by running: $ git pull https://github.com/stanzhai/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/7393.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #7393 commit 69729ef083f152eb91a80e1ea7f1481234766c7c Author: zhaishidan zhaishi...@haizhi.com Date: 2015-07-14T07:47:36Z fix document error about spark.kryoserializer.buffer.max.mb --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9010][Documentation]Improve the Spark C...
GitHub user stanzhai opened a pull request: https://github.com/apache/spark/pull/7368 [SPARK-9010][Documentation]Improve the Spark Configuration document about `spark.kryoserializer.buffer` The meaning of spark.kryoserializer.buffer should be Initial size of Kryo's serialization buffer. Note that there will be one buffer per core on each worker. This buffer will grow up to spark.kryoserializer.buffer.max if needed.. The spark.kryoserializer.buffer.max.mb is out-of-date in spark 1.4. You can merge this pull request into a Git repository by running: $ git pull https://github.com/stanzhai/spark branch-1.4 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/7368.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #7368 commit bfb80d44f90fe4d538907309dfa55d9ec8703ff5 Author: zhaishidan zhaishi...@haizhi.com Date: 2015-07-13T07:53:11Z fix document error about spark.kryoserializer.buffer.max.mb --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org