spark git commit: [SPARK-8184][SQL] Add additional function description for weekofyear
Repository: spark Updated Branches: refs/heads/master c9749068e -> 1c7db00c7 [SPARK-8184][SQL] Add additional function description for weekofyear ## What changes were proposed in this pull request? Add additional function description for weekofyear. ## How was this patch tested? manual tests ![weekofyear](https://cloud.githubusercontent.com/assets/5399861/26525752/08a1c278-4394-11e7-8988-7cbf82c3a999.gif) Author: Yuming WangCloses #18132 from wangyum/SPARK-8184. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1c7db00c Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1c7db00c Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1c7db00c Branch: refs/heads/master Commit: 1c7db00c74ec6a91c7eefbdba85cbf41fbe8634a Parents: c974906 Author: Yuming Wang Authored: Mon May 29 16:10:22 2017 -0700 Committer: Reynold Xin Committed: Mon May 29 16:10:22 2017 -0700 -- .../spark/sql/catalyst/expressions/datetimeExpressions.scala | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/1c7db00c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala index 43ca2cf..4098300 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala @@ -402,13 +402,15 @@ case class DayOfMonth(child: Expression) extends UnaryExpression with ImplicitCa } } +// scalastyle:off line.size.limit @ExpressionDescription( - usage = "_FUNC_(date) - Returns the week of the year of the given date.", + usage = "_FUNC_(date) - Returns the week of the year of the given date. A week is considered to start on a Monday and week 1 is the first week with >3 days.", extended = """ Examples: > SELECT _FUNC_('2008-02-20'); 8 """) +// scalastyle:on line.size.limit case class WeekOfYear(child: Expression) extends UnaryExpression with ImplicitCastInputTypes { override def inputTypes: Seq[AbstractDataType] = Seq(DateType) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] spark issue #18132: [SPARK-8184][SQL] Add additional function description fo...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/18132 Thanks - merging in master/branch-2.2. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18086: [SPARK-20854][SQL] Extend hint syntax to support ...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/18086#discussion_r118473083 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala --- @@ -533,13 +533,16 @@ class AstBuilder(conf: SQLConf) extends SqlBaseBaseVisitor[AnyRef] with Logging } /** - * Add a [[UnresolvedHint]] to a logical plan. + * Add a [[UnresolvedHint]]s to a logical plan. */ private def withHints( ctx: HintContext, query: LogicalPlan): LogicalPlan = withOrigin(ctx) { -val stmt = ctx.hintStatement -UnresolvedHint(stmt.hintName.getText, stmt.parameters.asScala.map(_.getText), query) +var plan = query --- End diff -- Honestly I think foldLeft is almost always a bad idea ... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18042: [SPARK-20817][core] Fix to return "Unknown processor" on...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/18042 Does this really matter? I'd rather not complicate the actual code for it to display properly in some niche hardware that very few people use. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18086: [SPARK-20854][SQL] Extend hint syntax to support express...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/18086 cc @gatorsmile @cloud-fan @hvanhovell --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18016: [SPARK-20786][SQL]Improve ceil and floor handle the valu...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/18016 hm guys please donât use the end-to-end tests to test expression behavior. use unit tests which automatically tests code gen, interpreted, and different data types. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18087: [SPARK-20867][SQL] Move hints from Statistics int...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/18087#discussion_r118353924 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala --- @@ -195,9 +195,9 @@ case class Intersect(left: LogicalPlan, right: LogicalPlan) extends SetOperation val leftSize = left.stats(conf).sizeInBytes val rightSize = right.stats(conf).sizeInBytes val sizeInBytes = if (leftSize < rightSize) leftSize else rightSize -val isBroadcastable = left.stats(conf).isBroadcastable || right.stats(conf).isBroadcastable - -Statistics(sizeInBytes = sizeInBytes, isBroadcastable = isBroadcastable) +Statistics( + sizeInBytes = sizeInBytes, + hints = left.stats(conf).hints.resetForJoin()) --- End diff -- It's actually no-op since Intersect is rewritten to a join always .. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18087: [SPARK-20867][SQL] Move hints from Statistics into HintI...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/18087 cc @hvanhovell, @bogdanrdc --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18087: [SPARK-20867][SQL] Move hints from Statistics int...
GitHub user rxin opened a pull request: https://github.com/apache/spark/pull/18087 [SPARK-20867][SQL] Move hints from Statistics into HintInfo class ## What changes were proposed in this pull request? This is a follow-up to SPARK-20857 to move the broadcast hint from Statistics into a new HintInfo class, so we can be more flexible in adding new hints in the future. ## How was this patch tested? Updated test cases to reflect the change. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rxin/spark SPARK-20867 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18087.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18087 commit 19232cfc8ad54229b73fce4792c8b4c6d3d72495 Author: Reynold Xin <r...@databricks.com> Date: 2017-05-24T12:57:36Z [SPARK-20867][SQL] Move individual hints from Statistics into HintInfo class --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18082: [SPARK-20665][SQL][FOLLOW-UP]Move test case to SQLQueryT...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/18082 Hm I'm not sure if it is a good idea to run so many "unit test" style tests for expressions in the end to end suites. It takes a lot of time than just running unit tests. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
spark git commit: [SPARK-20857][SQL] Generic resolved hint node
Repository: spark Updated Branches: refs/heads/branch-2.2 dbb068f4f -> d20c64695 [SPARK-20857][SQL] Generic resolved hint node ## What changes were proposed in this pull request? This patch renames BroadcastHint to ResolvedHint (and Hint to UnresolvedHint) so the hint framework is more generic and would allow us to introduce other hint types in the future without introducing new hint nodes. ## How was this patch tested? Updated test cases. Author: Reynold Xin <r...@databricks.com> Closes #18072 from rxin/SPARK-20857. (cherry picked from commit 0d589ba00b5d539fbfef5174221de046a70548cd) Signed-off-by: Reynold Xin <r...@databricks.com> Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d20c6469 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d20c6469 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d20c6469 Branch: refs/heads/branch-2.2 Commit: d20c6469565c4f7687f9af14a6f12a775b0c6e62 Parents: dbb068f Author: Reynold Xin <r...@databricks.com> Authored: Tue May 23 18:44:49 2017 +0200 Committer: Reynold Xin <r...@databricks.com> Committed: Tue May 23 18:45:08 2017 +0200 -- .../spark/sql/catalyst/analysis/Analyzer.scala | 2 +- .../sql/catalyst/analysis/CheckAnalysis.scala | 2 +- .../sql/catalyst/analysis/ResolveHints.scala| 12 ++--- .../sql/catalyst/optimizer/Optimizer.scala | 2 +- .../sql/catalyst/optimizer/expressions.scala| 2 +- .../spark/sql/catalyst/parser/AstBuilder.scala | 4 +- .../spark/sql/catalyst/planning/patterns.scala | 4 +- .../sql/catalyst/plans/logical/Statistics.scala | 5 ++ .../plans/logical/basicLogicalOperators.scala | 22 + .../sql/catalyst/plans/logical/hints.scala | 49 .../catalyst/analysis/ResolveHintsSuite.scala | 41 .../catalyst/optimizer/ColumnPruningSuite.scala | 5 +- .../optimizer/FilterPushdownSuite.scala | 4 +- .../optimizer/JoinOptimizationSuite.scala | 4 +- .../sql/catalyst/parser/PlanParserSuite.scala | 15 +++--- .../BasicStatsEstimationSuite.scala | 2 +- .../scala/org/apache/spark/sql/Dataset.scala| 2 +- .../spark/sql/execution/SparkStrategies.scala | 2 +- .../scala/org/apache/spark/sql/functions.scala | 5 +- .../execution/joins/BroadcastJoinSuite.scala| 14 +++--- 20 files changed, 118 insertions(+), 80 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/d20c6469/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala index 5be67ac..9979642 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala @@ -1311,7 +1311,7 @@ class Analyzer( // Category 1: // BroadcastHint, Distinct, LeafNode, Repartition, and SubqueryAlias -case _: BroadcastHint | _: Distinct | _: LeafNode | _: Repartition | _: SubqueryAlias => +case _: ResolvedHint | _: Distinct | _: LeafNode | _: Repartition | _: SubqueryAlias => // Category 2: // These operators can be anywhere in a correlated subquery. http://git-wip-us.apache.org/repos/asf/spark/blob/d20c6469/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala index ea4560a..2e3ac3e 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala @@ -399,7 +399,7 @@ trait CheckAnalysis extends PredicateHelper { |in operator ${operator.simpleString} """.stripMargin) - case _: Hint => + case _: UnresolvedHint => throw new IllegalStateException( "Internal error: logical hint operator should have been removed during analysis") http://git-wip-us.apache.org/repos/asf/spark/blob/d20c6469/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala b/sq
spark git commit: [SPARK-20857][SQL] Generic resolved hint node
Repository: spark Updated Branches: refs/heads/master ad09e4ca0 -> 0d589ba00 [SPARK-20857][SQL] Generic resolved hint node ## What changes were proposed in this pull request? This patch renames BroadcastHint to ResolvedHint (and Hint to UnresolvedHint) so the hint framework is more generic and would allow us to introduce other hint types in the future without introducing new hint nodes. ## How was this patch tested? Updated test cases. Author: Reynold Xin <r...@databricks.com> Closes #18072 from rxin/SPARK-20857. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0d589ba0 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0d589ba0 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0d589ba0 Branch: refs/heads/master Commit: 0d589ba00b5d539fbfef5174221de046a70548cd Parents: ad09e4c Author: Reynold Xin <r...@databricks.com> Authored: Tue May 23 18:44:49 2017 +0200 Committer: Reynold Xin <r...@databricks.com> Committed: Tue May 23 18:44:49 2017 +0200 -- .../spark/sql/catalyst/analysis/Analyzer.scala | 2 +- .../sql/catalyst/analysis/CheckAnalysis.scala | 2 +- .../sql/catalyst/analysis/ResolveHints.scala| 12 ++--- .../sql/catalyst/optimizer/Optimizer.scala | 2 +- .../sql/catalyst/optimizer/expressions.scala| 2 +- .../spark/sql/catalyst/parser/AstBuilder.scala | 4 +- .../spark/sql/catalyst/planning/patterns.scala | 4 +- .../sql/catalyst/plans/logical/Statistics.scala | 5 ++ .../plans/logical/basicLogicalOperators.scala | 22 + .../sql/catalyst/plans/logical/hints.scala | 49 .../catalyst/analysis/ResolveHintsSuite.scala | 41 .../catalyst/optimizer/ColumnPruningSuite.scala | 5 +- .../optimizer/FilterPushdownSuite.scala | 4 +- .../optimizer/JoinOptimizationSuite.scala | 4 +- .../sql/catalyst/parser/PlanParserSuite.scala | 15 +++--- .../BasicStatsEstimationSuite.scala | 2 +- .../scala/org/apache/spark/sql/Dataset.scala| 2 +- .../spark/sql/execution/SparkStrategies.scala | 2 +- .../scala/org/apache/spark/sql/functions.scala | 5 +- .../execution/joins/BroadcastJoinSuite.scala| 14 +++--- 20 files changed, 118 insertions(+), 80 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/0d589ba0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala index d58b8ac..d130962 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala @@ -1336,7 +1336,7 @@ class Analyzer( // Category 1: // BroadcastHint, Distinct, LeafNode, Repartition, and SubqueryAlias -case _: BroadcastHint | _: Distinct | _: LeafNode | _: Repartition | _: SubqueryAlias => +case _: ResolvedHint | _: Distinct | _: LeafNode | _: Repartition | _: SubqueryAlias => // Category 2: // These operators can be anywhere in a correlated subquery. http://git-wip-us.apache.org/repos/asf/spark/blob/0d589ba0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala index ea4560a..2e3ac3e 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala @@ -399,7 +399,7 @@ trait CheckAnalysis extends PredicateHelper { |in operator ${operator.simpleString} """.stripMargin) - case _: Hint => + case _: UnresolvedHint => throw new IllegalStateException( "Internal error: logical hint operator should have been removed during analysis") http://git-wip-us.apache.org/repos/asf/spark/blob/0d589ba0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala index df688fa..9dfd84c 100644 --- a/sql/cata
[GitHub] spark issue #18072: [SPARK-20857][SQL] Generic resolved hint node
Github user rxin commented on the issue: https://github.com/apache/spark/pull/18072 Merging in master / branch-2.2 ... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18072: [SPARK-20857][SQL] Generic resolved hint node
GitHub user rxin opened a pull request: https://github.com/apache/spark/pull/18072 [SPARK-20857][SQL] Generic resolved hint node ## What changes were proposed in this pull request? This patch renames BroadcastHint to ResolvedHint so it is more generic and would allow us to introduce other hint types in the future without introducing new hint nodes. ## How was this patch tested? Updated test cases. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rxin/spark SPARK-20857 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18072.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18072 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18064: [SPARK-20213][SQL] Fix DataFrameWriter operations in SQL...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/18064 That works too, if we can attach metrics to these commands. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18070: [SPARK-20713][Spark Core] Convert CommitDenied to TaskKi...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/18070 cc @ericl --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17999: [SPARK-20751][SQL] Add built-in SQL Function - COT
Github user rxin commented on the issue: https://github.com/apache/spark/pull/17999 hmnmm seems like we should be following how we test tan, cos, etc in MathExpressionsSuite? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18023: [SPARK-12139] [SQL] REGEX Column Specification
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/18023#discussion_r117540055 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala --- @@ -2624,4 +2624,92 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext { val e = intercept[AnalysisException](sql("SELECT nvl(1, 2, 3)")) assert(e.message.contains("Invalid number of arguments")) } + + test("SPARK-12139: REGEX Column Specification for Hive Queries") { --- End diff -- Yes let's use those rather than adding more files to SQLQuerySUite. I'd love to get rid of SQLQuerySuite --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18023: [SPARK-12139] [SQL] REGEX Column Specification
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/18023#discussion_r117539904 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -795,6 +795,12 @@ object SQLConf { .intConf .createWithDefault(UnsafeExternalSorter.DEFAULT_NUM_ELEMENTS_FOR_SPILL_THRESHOLD.toInt) + val SUPPORT_QUOTED_REGEX_COLUMN_NAME = buildConf("spark.sql.parser.quotedRegexColumnNames") +.internal() --- End diff -- should be public --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16478: [SPARK-7768][SQL] Revise user defined types (UDT)
Github user rxin commented on the issue: https://github.com/apache/spark/pull/16478 I don't know how important it is. It seems like it's primarily used by MLlib and very few other things ... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17997: [SPARK-20763][SQL]The function of `month` and `da...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/17997#discussion_r116878495 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala --- @@ -601,22 +601,32 @@ object DateTimeUtils { * The calculation uses the fact that the period 1.1.2001 until 31.12.2400 is * equals to the period 1.1.1601 until 31.12.2000. */ - private[this] def getYearAndDayInYear(daysSince1970: SQLDate): (Int, Int) = { -// add the difference (in days) between 1.1.1970 and the artificial year 0 (-17999) -val daysNormalized = daysSince1970 + toYearZero -val numOfQuarterCenturies = daysNormalized / daysIn400Years -val daysInThis400 = daysNormalized % daysIn400Years + 1 -val (years, dayInYear) = numYears(daysInThis400) -val year: Int = (2001 - 2) + 400 * numOfQuarterCenturies + years -(year, dayInYear) + private[this] def getYearAndDayInYear(daysSince1970: SQLDate): (Int, Int, Int) = { +val date = new Date(daysToMillis(daysSince1970)) +val YMD = date.toString.trim.split("-") --- End diff -- this would cause massive performance regression. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15821: [SPARK-13534][PySpark] Using Apache Arrow to increase pe...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/15821 @BryanCutler even though the json is long, it is still so much clearer than reading a pile of code that generates json ... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17941: [SPARK-20684][R] Expose createGlobalTempView and dropGlo...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/17941 @felixcheung what's your concern with this one? seems like just for api parity sake we should add this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17711: [SPARK-19951][SQL] Add string concatenate operator || to...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/17711 I feel both are pretty complicated. Can we just do something similar to CombineUnion: ``` /** * Combines all adjacent [[Union]] operators into a single [[Union]]. */ object CombineUnions extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan transformDown { case u: Union => flattenUnion(u, false) case Distinct(u: Union) => Distinct(flattenUnion(u, true)) } private def flattenUnion(union: Union, flattenDistinct: Boolean): Union = { val stack = mutable.Stack[LogicalPlan](union) val flattened = mutable.ArrayBuffer.empty[LogicalPlan] while (stack.nonEmpty) { stack.pop() match { case Distinct(Union(children)) if flattenDistinct => stack.pushAll(children.reverse) case Union(children) => stack.pushAll(children.reverse) case child => flattened += child } } Union(flattened) } }``` It's going to be simpler because you don't need to handle distinct here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17942: [SPARK-20702][Core]TaskContextImpl.markTaskComple...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/17942#discussion_r116143097 --- Diff: core/src/main/scala/org/apache/spark/util/taskListeners.scala --- @@ -55,14 +55,16 @@ class TaskCompletionListenerException( extends RuntimeException { override def getMessage: String = { -if (errorMessages.size == 1) { --- End diff -- It's a common pattern in scala. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17923: [SPARK-20591][WEB UI] Succeeded tasks num not equal in a...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/17923 sry too long ago --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17931: [SPARK-12837][CORE][FOLLOWUP] getting name should not fa...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/17931 What's the issue with SQL metrics? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
spark git commit: Revert "[SPARK-12297][SQL] Hive compatibility for Parquet Timestamps"
Repository: spark Updated Branches: refs/heads/master 1b85bcd92 -> ac1ab6b9d Revert "[SPARK-12297][SQL] Hive compatibility for Parquet Timestamps" This reverts commit 22691556e5f0dfbac81b8cc9ca0a67c70c1711ca. See JIRA ticket for more information. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ac1ab6b9 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ac1ab6b9 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ac1ab6b9 Branch: refs/heads/master Commit: ac1ab6b9db188ac54c745558d57dd0a031d0b162 Parents: 1b85bcd Author: Reynold XinAuthored: Tue May 9 11:35:59 2017 -0700 Committer: Reynold Xin Committed: Tue May 9 11:35:59 2017 -0700 -- .../spark/sql/catalyst/catalog/interface.scala | 4 +- .../spark/sql/catalyst/util/DateTimeUtils.scala | 5 - .../parquet/VectorizedColumnReader.java | 28 +- .../parquet/VectorizedParquetRecordReader.java | 6 +- .../spark/sql/execution/command/tables.scala| 8 +- .../datasources/parquet/ParquetFileFormat.scala | 2 - .../parquet/ParquetReadSupport.scala| 3 +- .../parquet/ParquetRecordMaterializer.scala | 9 +- .../parquet/ParquetRowConverter.scala | 53 +-- .../parquet/ParquetWriteSupport.scala | 25 +- .../spark/sql/hive/HiveExternalCatalog.scala| 11 +- .../spark/sql/hive/HiveMetastoreCatalog.scala | 12 +- .../hive/ParquetHiveCompatibilitySuite.scala| 379 +-- 13 files changed, 29 insertions(+), 516 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/ac1ab6b9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala index c39017e..cc0cbba 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala @@ -132,10 +132,10 @@ case class CatalogTablePartition( /** * Given the partition schema, returns a row with that schema holding the partition values. */ - def toRow(partitionSchema: StructType, defaultTimeZoneId: String): InternalRow = { + def toRow(partitionSchema: StructType, defaultTimeZondId: String): InternalRow = { val caseInsensitiveProperties = CaseInsensitiveMap(storage.properties) val timeZoneId = caseInsensitiveProperties.getOrElse( - DateTimeUtils.TIMEZONE_OPTION, defaultTimeZoneId) + DateTimeUtils.TIMEZONE_OPTION, defaultTimeZondId) InternalRow.fromSeq(partitionSchema.map { field => val partValue = if (spec(field.name) == ExternalCatalogUtils.DEFAULT_PARTITION_NAME) { null http://git-wip-us.apache.org/repos/asf/spark/blob/ac1ab6b9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala index bf596fa..6c1592f 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala @@ -498,11 +498,6 @@ object DateTimeUtils { false } - lazy val validTimezones = TimeZone.getAvailableIDs().toSet - def isValidTimezone(timezoneId: String): Boolean = { -validTimezones.contains(timezoneId) - } - /** * Returns the microseconds since year zero (-17999) from microseconds since epoch. */ http://git-wip-us.apache.org/repos/asf/spark/blob/ac1ab6b9/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java -- diff --git a/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java index dabbc2b..9d641b5 100644 --- a/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java +++ b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java @@ -18,9 +18,7 @@ package org.apache.spark.sql.execution.datasources.parquet; import java.io.IOException; -import java.util.TimeZone; -import org.apache.hadoop.conf.Configuration; import org.apache.parquet.bytes.BytesUtils; import
[GitHub] spark issue #16781: [SPARK-12297][SQL] Hive compatibility for Parquet Timest...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/16781 Did we conduct any performance tests on this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17915: [SPARK-20674][SQL] Support registering UserDefine...
GitHub user rxin opened a pull request: https://github.com/apache/spark/pull/17915 [SPARK-20674][SQL] Support registering UserDefinedFunction as named UDF ## What changes were proposed in this pull request? For some reason we don't have an API to register UserDefinedFunction as named UDF. It is a no brainer to add one, in addition to the existing register functions we have. ## How was this patch tested? Added a test case in UDFSuite for the new API. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rxin/spark SPARK-20674 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17915.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17915 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
spark git commit: [SPARK-20616] RuleExecutor logDebug of batch results should show diff to start of batch
Repository: spark Updated Branches: refs/heads/master b31648c08 -> 5d75b14bf [SPARK-20616] RuleExecutor logDebug of batch results should show diff to start of batch ## What changes were proposed in this pull request? Due to a likely typo, the logDebug msg printing the diff of query plans shows a diff to the initial plan, not diff to the start of batch. ## How was this patch tested? Now the debug message prints the diff between start and end of batch. Author: Juliusz SompolskiCloses #17875 from juliuszsompolski/SPARK-20616. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5d75b14b Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5d75b14b Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5d75b14b Branch: refs/heads/master Commit: 5d75b14bf0f4c1f0813287efaabf49797908ed55 Parents: b31648c Author: Juliusz Sompolski Authored: Fri May 5 15:31:06 2017 -0700 Committer: Reynold Xin Committed: Fri May 5 15:31:06 2017 -0700 -- .../scala/org/apache/spark/sql/catalyst/rules/RuleExecutor.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/5d75b14b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/RuleExecutor.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/RuleExecutor.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/RuleExecutor.scala index 6fc828f..85b368c 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/RuleExecutor.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/RuleExecutor.scala @@ -122,7 +122,7 @@ abstract class RuleExecutor[TreeType <: TreeNode[_]] extends Logging { logDebug( s""" |=== Result of Batch ${batch.name} === - |${sideBySide(plan.treeString, curPlan.treeString).mkString("\n")} + |${sideBySide(batchStartPlan.treeString, curPlan.treeString).mkString("\n")} """.stripMargin) } else { logTrace(s"Batch ${batch.name} has no effect.") - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-20616] RuleExecutor logDebug of batch results should show diff to start of batch
Repository: spark Updated Branches: refs/heads/branch-2.2 f59c74a94 -> 1d9b7a74a [SPARK-20616] RuleExecutor logDebug of batch results should show diff to start of batch ## What changes were proposed in this pull request? Due to a likely typo, the logDebug msg printing the diff of query plans shows a diff to the initial plan, not diff to the start of batch. ## How was this patch tested? Now the debug message prints the diff between start and end of batch. Author: Juliusz SompolskiCloses #17875 from juliuszsompolski/SPARK-20616. (cherry picked from commit 5d75b14bf0f4c1f0813287efaabf49797908ed55) Signed-off-by: Reynold Xin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1d9b7a74 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1d9b7a74 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1d9b7a74 Branch: refs/heads/branch-2.2 Commit: 1d9b7a74a839021814ab28d3eba3636c64483130 Parents: f59c74a Author: Juliusz Sompolski Authored: Fri May 5 15:31:06 2017 -0700 Committer: Reynold Xin Committed: Fri May 5 15:31:13 2017 -0700 -- .../scala/org/apache/spark/sql/catalyst/rules/RuleExecutor.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/1d9b7a74/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/RuleExecutor.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/RuleExecutor.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/RuleExecutor.scala index 6fc828f..85b368c 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/RuleExecutor.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/RuleExecutor.scala @@ -122,7 +122,7 @@ abstract class RuleExecutor[TreeType <: TreeNode[_]] extends Logging { logDebug( s""" |=== Result of Batch ${batch.name} === - |${sideBySide(plan.treeString, curPlan.treeString).mkString("\n")} + |${sideBySide(batchStartPlan.treeString, curPlan.treeString).mkString("\n")} """.stripMargin) } else { logTrace(s"Batch ${batch.name} has no effect.") - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-20616] RuleExecutor logDebug of batch results should show diff to start of batch
Repository: spark Updated Branches: refs/heads/branch-2.1 704b249b6 -> a1112c615 [SPARK-20616] RuleExecutor logDebug of batch results should show diff to start of batch ## What changes were proposed in this pull request? Due to a likely typo, the logDebug msg printing the diff of query plans shows a diff to the initial plan, not diff to the start of batch. ## How was this patch tested? Now the debug message prints the diff between start and end of batch. Author: Juliusz SompolskiCloses #17875 from juliuszsompolski/SPARK-20616. (cherry picked from commit 5d75b14bf0f4c1f0813287efaabf49797908ed55) Signed-off-by: Reynold Xin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a1112c61 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a1112c61 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a1112c61 Branch: refs/heads/branch-2.1 Commit: a1112c615b05d615048159c9d324aa10a4391d4e Parents: 704b249 Author: Juliusz Sompolski Authored: Fri May 5 15:31:06 2017 -0700 Committer: Reynold Xin Committed: Fri May 5 15:31:23 2017 -0700 -- .../scala/org/apache/spark/sql/catalyst/rules/RuleExecutor.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/a1112c61/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/RuleExecutor.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/RuleExecutor.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/RuleExecutor.scala index 6fc828f..85b368c 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/RuleExecutor.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/RuleExecutor.scala @@ -122,7 +122,7 @@ abstract class RuleExecutor[TreeType <: TreeNode[_]] extends Logging { logDebug( s""" |=== Result of Batch ${batch.name} === - |${sideBySide(plan.treeString, curPlan.treeString).mkString("\n")} + |${sideBySide(batchStartPlan.treeString, curPlan.treeString).mkString("\n")} """.stripMargin) } else { logTrace(s"Batch ${batch.name} has no effect.") - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] spark issue #17875: [SPARK-20616] RuleExecutor logDebug of batch results sho...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/17875 Merging in master/branch-2.2. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17875: [SPARK-20616] RuleExecutor logDebug of batch results sho...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/17875 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17851: [SPARK-20585][SPARKR] R generic hint support
Github user rxin commented on the issue: https://github.com/apache/spark/pull/17851 @felixcheung was this merged only in master but not branch-2.2? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17770: [SPARK-20392][SQL] Set barrier to prevent re-entering a ...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/17770 @srinathshankar also thinks it's weird to add a barrier node. I suggest @hvanhovell and @srinathshankar duke it out. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17723: [SPARK-20434][YARN][CORE] Move kerberos delegation token...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/17723 I'm saying avoid exposing Hadoop APIs. Wrap them around something if possible. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17723: [SPARK-20434][YARN][CORE] Move kerberos delegation token...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/17723 I didn't read through the super long debate here, but I have a strong preference to not expose Hadoop APIs directly. I'm seeing more and more deployments out there that do not use Hadoop (e.g. connect directly to cloud storage, connect to some on-premise object store, connect to Redis, connect to some netapp appliance, connect directly to a message queue or just run Spark on a laptop). Hadoop APIs were designed for a different world pre Spark. Serialization is painful (Configuration?) to deal with, API breaking changes are painful to deal with, size of the dependencies are painful to deal with (especially considering the single node use cases in which ideally we'd just want a super trimmed down jar). As you can see (although most of you that have chimed in here don't know much about the new components), the newer components (Spark SQL) does not expose Hadoop APIs. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
spark git commit: [SPARK-20584][PYSPARK][SQL] Python generic hint support
Repository: spark Updated Branches: refs/heads/master 13eb37c86 -> 02bbe7311 [SPARK-20584][PYSPARK][SQL] Python generic hint support ## What changes were proposed in this pull request? Adds `hint` method to PySpark `DataFrame`. ## How was this patch tested? Unit tests, doctests. Author: zero323Closes #17850 from zero323/SPARK-20584. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/02bbe731 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/02bbe731 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/02bbe731 Branch: refs/heads/master Commit: 02bbe73118a39e2fb378aa2002449367a92f6d67 Parents: 13eb37c Author: zero323 Authored: Wed May 3 19:15:28 2017 -0700 Committer: Reynold Xin Committed: Wed May 3 19:15:28 2017 -0700 -- python/pyspark/sql/dataframe.py | 29 + python/pyspark/sql/tests.py | 16 2 files changed, 45 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/02bbe731/python/pyspark/sql/dataframe.py -- diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py index ab6d35b..7b67985 100644 --- a/python/pyspark/sql/dataframe.py +++ b/python/pyspark/sql/dataframe.py @@ -380,6 +380,35 @@ class DataFrame(object): jdf = self._jdf.withWatermark(eventTime, delayThreshold) return DataFrame(jdf, self.sql_ctx) +@since(2.2) +def hint(self, name, *parameters): +"""Specifies some hint on the current DataFrame. + +:param name: A name of the hint. +:param parameters: Optional parameters. +:return: :class:`DataFrame` + +>>> df.join(df2.hint("broadcast"), "name").show() +++---+--+ +|name|age|height| +++---+--+ +| Bob| 5|85| +++---+--+ +""" +if len(parameters) == 1 and isinstance(parameters[0], list): +parameters = parameters[0] + +if not isinstance(name, str): +raise TypeError("name should be provided as str, got {0}".format(type(name))) + +for p in parameters: +if not isinstance(p, str): +raise TypeError( +"all parameters should be str, got {0} of type {1}".format(p, type(p))) + +jdf = self._jdf.hint(name, self._jseq(parameters)) +return DataFrame(jdf, self.sql_ctx) + @since(1.3) def count(self): """Returns the number of rows in this :class:`DataFrame`. http://git-wip-us.apache.org/repos/asf/spark/blob/02bbe731/python/pyspark/sql/tests.py -- diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py index ce4abf8..f644624 100644 --- a/python/pyspark/sql/tests.py +++ b/python/pyspark/sql/tests.py @@ -1906,6 +1906,22 @@ class SQLTests(ReusedPySparkTestCase): # planner should not crash without a join broadcast(df1)._jdf.queryExecution().executedPlan() +def test_generic_hints(self): +from pyspark.sql import DataFrame + +df1 = self.spark.range(10e10).toDF("id") +df2 = self.spark.range(10e10).toDF("id") + +self.assertIsInstance(df1.hint("broadcast"), DataFrame) +self.assertIsInstance(df1.hint("broadcast", []), DataFrame) + +# Dummy rules +self.assertIsInstance(df1.hint("broadcast", "foo", "bar"), DataFrame) +self.assertIsInstance(df1.hint("broadcast", ["foo", "bar"]), DataFrame) + +plan = df1.join(df2.hint("broadcast"), "id")._jdf.queryExecution().executedPlan() +self.assertEqual(1, plan.toString().count("BroadcastHashJoin")) + def test_toDF_with_schema_string(self): data = [Row(key=i, value=str(i)) for i in range(100)] rdd = self.sc.parallelize(data, 5) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-20584][PYSPARK][SQL] Python generic hint support
Repository: spark Updated Branches: refs/heads/branch-2.2 a3a5fcfef -> d8bd213f1 [SPARK-20584][PYSPARK][SQL] Python generic hint support ## What changes were proposed in this pull request? Adds `hint` method to PySpark `DataFrame`. ## How was this patch tested? Unit tests, doctests. Author: zero323Closes #17850 from zero323/SPARK-20584. (cherry picked from commit 02bbe73118a39e2fb378aa2002449367a92f6d67) Signed-off-by: Reynold Xin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d8bd213f Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d8bd213f Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d8bd213f Branch: refs/heads/branch-2.2 Commit: d8bd213f13279664d50ffa57c1814d0b16fc5d23 Parents: a3a5fcf Author: zero323 Authored: Wed May 3 19:15:28 2017 -0700 Committer: Reynold Xin Committed: Wed May 3 19:15:42 2017 -0700 -- python/pyspark/sql/dataframe.py | 29 + python/pyspark/sql/tests.py | 16 2 files changed, 45 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/d8bd213f/python/pyspark/sql/dataframe.py -- diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py index f567cc4..d62ba96 100644 --- a/python/pyspark/sql/dataframe.py +++ b/python/pyspark/sql/dataframe.py @@ -371,6 +371,35 @@ class DataFrame(object): jdf = self._jdf.withWatermark(eventTime, delayThreshold) return DataFrame(jdf, self.sql_ctx) +@since(2.2) +def hint(self, name, *parameters): +"""Specifies some hint on the current DataFrame. + +:param name: A name of the hint. +:param parameters: Optional parameters. +:return: :class:`DataFrame` + +>>> df.join(df2.hint("broadcast"), "name").show() +++---+--+ +|name|age|height| +++---+--+ +| Bob| 5|85| +++---+--+ +""" +if len(parameters) == 1 and isinstance(parameters[0], list): +parameters = parameters[0] + +if not isinstance(name, str): +raise TypeError("name should be provided as str, got {0}".format(type(name))) + +for p in parameters: +if not isinstance(p, str): +raise TypeError( +"all parameters should be str, got {0} of type {1}".format(p, type(p))) + +jdf = self._jdf.hint(name, self._jseq(parameters)) +return DataFrame(jdf, self.sql_ctx) + @since(1.3) def count(self): """Returns the number of rows in this :class:`DataFrame`. http://git-wip-us.apache.org/repos/asf/spark/blob/d8bd213f/python/pyspark/sql/tests.py -- diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py index cd92148..2aa2d23 100644 --- a/python/pyspark/sql/tests.py +++ b/python/pyspark/sql/tests.py @@ -1906,6 +1906,22 @@ class SQLTests(ReusedPySparkTestCase): # planner should not crash without a join broadcast(df1)._jdf.queryExecution().executedPlan() +def test_generic_hints(self): +from pyspark.sql import DataFrame + +df1 = self.spark.range(10e10).toDF("id") +df2 = self.spark.range(10e10).toDF("id") + +self.assertIsInstance(df1.hint("broadcast"), DataFrame) +self.assertIsInstance(df1.hint("broadcast", []), DataFrame) + +# Dummy rules +self.assertIsInstance(df1.hint("broadcast", "foo", "bar"), DataFrame) +self.assertIsInstance(df1.hint("broadcast", ["foo", "bar"]), DataFrame) + +plan = df1.join(df2.hint("broadcast"), "id")._jdf.queryExecution().executedPlan() +self.assertEqual(1, plan.toString().count("BroadcastHashJoin")) + def test_toDF_with_schema_string(self): data = [Row(key=i, value=str(i)) for i in range(100)] rdd = self.sc.parallelize(data, 5) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] spark issue #17850: [SPARK-20584][PYSPARK][SQL] Python generic hint support
Github user rxin commented on the issue: https://github.com/apache/spark/pull/17850 Merging in master/2.2. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17850: [SPARK-20584][PYSPARK][SQL] Python generic hint support
Github user rxin commented on the issue: https://github.com/apache/spark/pull/17850 LGTM pending Jenkins. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17850: [SPARK-20584][PYSPARK][SQL] Python generic hint s...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/17850#discussion_r114677412 --- Diff: python/pyspark/sql/dataframe.py --- @@ -380,6 +380,35 @@ def withWatermark(self, eventTime, delayThreshold): jdf = self._jdf.withWatermark(eventTime, delayThreshold) return DataFrame(jdf, self.sql_ctx) +@since(2.3) --- End diff -- 2.2 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
spark git commit: [MINOR][SQL] Fix the test title from =!= to <=>, remove a duplicated test and add a test for =!=
Repository: spark Updated Branches: refs/heads/master 6b9e49d12 -> 13eb37c86 [MINOR][SQL] Fix the test title from =!= to <=>, remove a duplicated test and add a test for =!= ## What changes were proposed in this pull request? This PR proposes three things as below: - This test looks not testing `<=>` and identical with the test above, `===`. So, it removes the test. ```diff - test("<=>") { - checkAnswer( - testData2.filter($"a" === 1), - testData2.collect().toSeq.filter(r => r.getInt(0) == 1)) - -checkAnswer( - testData2.filter($"a" === $"b"), - testData2.collect().toSeq.filter(r => r.getInt(0) == r.getInt(1))) - } ``` - Replace the test title from `=!=` to `<=>`. It looks the test actually testing `<=>`. ```diff + private lazy val nullData = Seq( +(Some(1), Some(1)), (Some(1), Some(2)), (Some(1), None), (None, None)).toDF("a", "b") + ... - test("=!=") { + test("<=>") { -val nullData = spark.createDataFrame(sparkContext.parallelize( - Row(1, 1) :: - Row(1, 2) :: - Row(1, null) :: - Row(null, null) :: Nil), - StructType(Seq(StructField("a", IntegerType), StructField("b", IntegerType - checkAnswer( nullData.filter($"b" <=> 1), ... ``` - Add the tests for `=!=` which looks not existing. ```diff + test("=!=") { +checkAnswer( + nullData.filter($"b" =!= 1), + Row(1, 2) :: Nil) + +checkAnswer(nullData.filter($"b" =!= null), Nil) + +checkAnswer( + nullData.filter($"a" =!= $"b"), + Row(1, 2) :: Nil) + } ``` ## How was this patch tested? Manually running the tests. Author: hyukjinkwonCloses #17842 from HyukjinKwon/minor-test-fix. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/13eb37c8 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/13eb37c8 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/13eb37c8 Branch: refs/heads/master Commit: 13eb37c860c8f672d0e9d9065d0333f981db71e3 Parents: 6b9e49d Author: hyukjinkwon Authored: Wed May 3 13:08:25 2017 -0700 Committer: Reynold Xin Committed: Wed May 3 13:08:25 2017 -0700 -- .../spark/sql/ColumnExpressionSuite.scala | 31 +--- 1 file changed, 14 insertions(+), 17 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/13eb37c8/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala index b0f398d..bc708ca 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala @@ -39,6 +39,9 @@ class ColumnExpressionSuite extends QueryTest with SharedSQLContext { StructType(Seq(StructField("a", BooleanType), StructField("b", BooleanType } + private lazy val nullData = Seq( +(Some(1), Some(1)), (Some(1), Some(2)), (Some(1), None), (None, None)).toDF("a", "b") + test("column names with space") { val df = Seq((1, "a")).toDF("name with space", "name.with.dot") @@ -284,23 +287,6 @@ class ColumnExpressionSuite extends QueryTest with SharedSQLContext { test("<=>") { checkAnswer( - testData2.filter($"a" === 1), - testData2.collect().toSeq.filter(r => r.getInt(0) == 1)) - -checkAnswer( - testData2.filter($"a" === $"b"), - testData2.collect().toSeq.filter(r => r.getInt(0) == r.getInt(1))) - } - - test("=!=") { -val nullData = spark.createDataFrame(sparkContext.parallelize( - Row(1, 1) :: - Row(1, 2) :: - Row(1, null) :: - Row(null, null) :: Nil), - StructType(Seq(StructField("a", IntegerType), StructField("b", IntegerType - -checkAnswer( nullData.filter($"b" <=> 1), Row(1, 1) :: Nil) @@ -321,7 +307,18 @@ class ColumnExpressionSuite extends QueryTest with SharedSQLContext { checkAnswer( nullData2.filter($"a" <=> null), Row(null) :: Nil) + } + test("=!=") { +checkAnswer( + nullData.filter($"b" =!= 1), + Row(1, 2) :: Nil) + +checkAnswer(nullData.filter($"b" =!= null), Nil) + +checkAnswer( + nullData.filter($"a" =!= $"b"), + Row(1, 2) :: Nil) } test(">") { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [MINOR][SQL] Fix the test title from =!= to <=>, remove a duplicated test and add a test for =!=
Repository: spark Updated Branches: refs/heads/branch-2.2 36d807906 -> 2629e7c7a [MINOR][SQL] Fix the test title from =!= to <=>, remove a duplicated test and add a test for =!= ## What changes were proposed in this pull request? This PR proposes three things as below: - This test looks not testing `<=>` and identical with the test above, `===`. So, it removes the test. ```diff - test("<=>") { - checkAnswer( - testData2.filter($"a" === 1), - testData2.collect().toSeq.filter(r => r.getInt(0) == 1)) - -checkAnswer( - testData2.filter($"a" === $"b"), - testData2.collect().toSeq.filter(r => r.getInt(0) == r.getInt(1))) - } ``` - Replace the test title from `=!=` to `<=>`. It looks the test actually testing `<=>`. ```diff + private lazy val nullData = Seq( +(Some(1), Some(1)), (Some(1), Some(2)), (Some(1), None), (None, None)).toDF("a", "b") + ... - test("=!=") { + test("<=>") { -val nullData = spark.createDataFrame(sparkContext.parallelize( - Row(1, 1) :: - Row(1, 2) :: - Row(1, null) :: - Row(null, null) :: Nil), - StructType(Seq(StructField("a", IntegerType), StructField("b", IntegerType - checkAnswer( nullData.filter($"b" <=> 1), ... ``` - Add the tests for `=!=` which looks not existing. ```diff + test("=!=") { +checkAnswer( + nullData.filter($"b" =!= 1), + Row(1, 2) :: Nil) + +checkAnswer(nullData.filter($"b" =!= null), Nil) + +checkAnswer( + nullData.filter($"a" =!= $"b"), + Row(1, 2) :: Nil) + } ``` ## How was this patch tested? Manually running the tests. Author: hyukjinkwonCloses #17842 from HyukjinKwon/minor-test-fix. (cherry picked from commit 13eb37c860c8f672d0e9d9065d0333f981db71e3) Signed-off-by: Reynold Xin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2629e7c7 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2629e7c7 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2629e7c7 Branch: refs/heads/branch-2.2 Commit: 2629e7c7a1dacfb267d866cf825fa8a078612462 Parents: 36d8079 Author: hyukjinkwon Authored: Wed May 3 13:08:25 2017 -0700 Committer: Reynold Xin Committed: Wed May 3 13:08:31 2017 -0700 -- .../spark/sql/ColumnExpressionSuite.scala | 31 +--- 1 file changed, 14 insertions(+), 17 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/2629e7c7/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala index b0f398d..bc708ca 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala @@ -39,6 +39,9 @@ class ColumnExpressionSuite extends QueryTest with SharedSQLContext { StructType(Seq(StructField("a", BooleanType), StructField("b", BooleanType } + private lazy val nullData = Seq( +(Some(1), Some(1)), (Some(1), Some(2)), (Some(1), None), (None, None)).toDF("a", "b") + test("column names with space") { val df = Seq((1, "a")).toDF("name with space", "name.with.dot") @@ -284,23 +287,6 @@ class ColumnExpressionSuite extends QueryTest with SharedSQLContext { test("<=>") { checkAnswer( - testData2.filter($"a" === 1), - testData2.collect().toSeq.filter(r => r.getInt(0) == 1)) - -checkAnswer( - testData2.filter($"a" === $"b"), - testData2.collect().toSeq.filter(r => r.getInt(0) == r.getInt(1))) - } - - test("=!=") { -val nullData = spark.createDataFrame(sparkContext.parallelize( - Row(1, 1) :: - Row(1, 2) :: - Row(1, null) :: - Row(null, null) :: Nil), - StructType(Seq(StructField("a", IntegerType), StructField("b", IntegerType - -checkAnswer( nullData.filter($"b" <=> 1), Row(1, 1) :: Nil) @@ -321,7 +307,18 @@ class ColumnExpressionSuite extends QueryTest with SharedSQLContext { checkAnswer( nullData2.filter($"a" <=> null), Row(null) :: Nil) + } + test("=!=") { +checkAnswer( + nullData.filter($"b" =!= 1), + Row(1, 2) :: Nil) + +checkAnswer(nullData.filter($"b" =!= null), Nil) + +checkAnswer( + nullData.filter($"a" =!= $"b"), + Row(1, 2) :: Nil) } test(">") { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For
[GitHub] spark issue #17842: [MINOR][SQL] Fix the test title from =!= to <=>, remove ...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/17842 Merging in master/branch-2.2. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17839: [SPARK-20576][SQL] Support generic hint function in Data...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/17839 BTW I filed follow-up tickets for Python/R at https://issues.apache.org/jira/browse/SPARK-20576 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
spark git commit: [SPARK-20576][SQL] Support generic hint function in Dataset/DataFrame
Repository: spark Updated Branches: refs/heads/branch-2.2 b1a732fea -> f0e80aa2d [SPARK-20576][SQL] Support generic hint function in Dataset/DataFrame ## What changes were proposed in this pull request? We allow users to specify hints (currently only "broadcast" is supported) in SQL and DataFrame. However, while SQL has a standard hint format (/*+ ... */), DataFrame doesn't have one and sometimes users are confused that they can't find how to apply a broadcast hint. This ticket adds a generic hint function on DataFrame that allows using the same hint on DataFrames as well as SQL. As an example, after this patch, the following will apply a broadcast hint on a DataFrame using the new hint function: ``` df1.join(df2.hint("broadcast")) ``` ## How was this patch tested? Added a test case in DataFrameJoinSuite. Author: Reynold Xin <r...@databricks.com> Closes #17839 from rxin/SPARK-20576. (cherry picked from commit 527fc5d0c990daaacad4740f62cfe6736609b77b) Signed-off-by: Reynold Xin <r...@databricks.com> Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f0e80aa2 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f0e80aa2 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f0e80aa2 Branch: refs/heads/branch-2.2 Commit: f0e80aa2ddee80819ef33ee24eb6a15a73bc02d5 Parents: b1a732f Author: Reynold Xin <r...@databricks.com> Authored: Wed May 3 09:22:25 2017 -0700 Committer: Reynold Xin <r...@databricks.com> Committed: Wed May 3 09:22:41 2017 -0700 -- .../sql/catalyst/analysis/ResolveHints.scala | 8 +++- .../main/scala/org/apache/spark/sql/Dataset.scala | 16 .../org/apache/spark/sql/DataFrameJoinSuite.scala | 18 +- 3 files changed, 40 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/f0e80aa2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala index c4827b8..df688fa 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala @@ -86,7 +86,13 @@ object ResolveHints { def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { case h: Hint if BROADCAST_HINT_NAMES.contains(h.name.toUpperCase(Locale.ROOT)) => -applyBroadcastHint(h.child, h.parameters.toSet) +if (h.parameters.isEmpty) { + // If there is no table alias specified, turn the entire subtree into a BroadcastHint. + BroadcastHint(h.child) +} else { + // Otherwise, find within the subtree query plans that should be broadcasted. + applyBroadcastHint(h.child, h.parameters.toSet) +} } } http://git-wip-us.apache.org/repos/asf/spark/blob/f0e80aa2/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala index 06dd550..5f602dc 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala @@ -1074,6 +1074,22 @@ class Dataset[T] private[sql]( def apply(colName: String): Column = col(colName) /** + * Specifies some hint on the current Dataset. As an example, the following code specifies + * that one of the plan can be broadcasted: + * + * {{{ + * df1.join(df2.hint("broadcast")) + * }}} + * + * @group basic + * @since 2.2.0 + */ + @scala.annotation.varargs + def hint(name: String, parameters: String*): Dataset[T] = withTypedPlan { +Hint(name, parameters, logicalPlan) + } + + /** * Selects column based on the column name and return it as a [[Column]]. * * @note The column name can also reference to a nested column like `a.b`. http://git-wip-us.apache.org/repos/asf/spark/blob/f0e80aa2/sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala index 541ffb5..4a52af6 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala @@
spark git commit: [SPARK-20576][SQL] Support generic hint function in Dataset/DataFrame
Repository: spark Updated Branches: refs/heads/master 27f543b15 -> 527fc5d0c [SPARK-20576][SQL] Support generic hint function in Dataset/DataFrame ## What changes were proposed in this pull request? We allow users to specify hints (currently only "broadcast" is supported) in SQL and DataFrame. However, while SQL has a standard hint format (/*+ ... */), DataFrame doesn't have one and sometimes users are confused that they can't find how to apply a broadcast hint. This ticket adds a generic hint function on DataFrame that allows using the same hint on DataFrames as well as SQL. As an example, after this patch, the following will apply a broadcast hint on a DataFrame using the new hint function: ``` df1.join(df2.hint("broadcast")) ``` ## How was this patch tested? Added a test case in DataFrameJoinSuite. Author: Reynold Xin <r...@databricks.com> Closes #17839 from rxin/SPARK-20576. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/527fc5d0 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/527fc5d0 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/527fc5d0 Branch: refs/heads/master Commit: 527fc5d0c990daaacad4740f62cfe6736609b77b Parents: 27f543b Author: Reynold Xin <r...@databricks.com> Authored: Wed May 3 09:22:25 2017 -0700 Committer: Reynold Xin <r...@databricks.com> Committed: Wed May 3 09:22:25 2017 -0700 -- .../sql/catalyst/analysis/ResolveHints.scala | 8 +++- .../main/scala/org/apache/spark/sql/Dataset.scala | 16 .../org/apache/spark/sql/DataFrameJoinSuite.scala | 18 +- 3 files changed, 40 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/527fc5d0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala index c4827b8..df688fa 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala @@ -86,7 +86,13 @@ object ResolveHints { def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { case h: Hint if BROADCAST_HINT_NAMES.contains(h.name.toUpperCase(Locale.ROOT)) => -applyBroadcastHint(h.child, h.parameters.toSet) +if (h.parameters.isEmpty) { + // If there is no table alias specified, turn the entire subtree into a BroadcastHint. + BroadcastHint(h.child) +} else { + // Otherwise, find within the subtree query plans that should be broadcasted. + applyBroadcastHint(h.child, h.parameters.toSet) +} } } http://git-wip-us.apache.org/repos/asf/spark/blob/527fc5d0/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala index 147e765..620c8bd 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala @@ -1161,6 +1161,22 @@ class Dataset[T] private[sql]( def apply(colName: String): Column = col(colName) /** + * Specifies some hint on the current Dataset. As an example, the following code specifies + * that one of the plan can be broadcasted: + * + * {{{ + * df1.join(df2.hint("broadcast")) + * }}} + * + * @group basic + * @since 2.2.0 + */ + @scala.annotation.varargs + def hint(name: String, parameters: String*): Dataset[T] = withTypedPlan { +Hint(name, parameters, logicalPlan) + } + + /** * Selects column based on the column name and return it as a [[Column]]. * * @note The column name can also reference to a nested column like `a.b`. http://git-wip-us.apache.org/repos/asf/spark/blob/527fc5d0/sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala -- diff --git a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala index 541ffb5..4a52af6 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala @@ -151,7 +151,7 @@ class DataFrameJoinSuite extends QueryTest with SharedSQLContext { Row(1, 1, 1, 1) :: Row(2, 1, 2, 2) :: Nil) }
[GitHub] spark issue #17839: [SPARK-20576][SQL] Support generic hint function in Data...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/17839 Merging in master/branch-2.2. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17839: [SPARK-20576][SQL] Support generic hint function in Data...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/17839 @felixcheung do you worry about conflicts? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17678: [SPARK-20381][SQL] Add SQL metrics of numOutputRows for ...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/17678 cc @gatorsmile can you review this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17770: [SPARK-20392][SQL] Set barrier to prevent re-entering a ...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/17770 Let's see what other people say before going too far... cc @cloud-fan / @hvanhovell / @marmbrus / @gatorsmile see my proposal: https://github.com/apache/spark/pull/17770#issuecomment-298833348 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17770: [SPARK-20392][SQL] Set barrier to prevent re-entering a ...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/17770 What self join case are you talking about? The one that we manually rewrite half of the plan? That one would be a special case anyway, wouldn't it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17770: [SPARK-20392][SQL] Set barrier to prevent re-entering a ...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/17770 I'm actually wondering if we should just introduce a variant of transform that takes a stop condition, e.g. ``` def transform(stopCondition: BaseType => Boolean)(rule: PartialFunction[BaseType, BaseType]) ``` and then in analyzer we can use ``` def transform(_.resolved) { case ... } ``` I worry adding this random node everywhere will make the plan look ugly and break certain assumptions we have somewhere. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17839: [SPARK-20576][SQL] Support generic hint function in Data...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/17839 Actually somebody should add the Python / R wrapper. cc @felixcheung --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17770: [SPARK-20392][SQL] Set barrier to prevent re-entering a ...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/17770 why don't we always add this to the dataset's logicalPlan? we can change that in one place. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17770: [SPARK-20392][SQL] Set barrier to prevent re-ente...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/17770#discussion_r114478015 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -1134,7 +1138,7 @@ class Dataset[T] private[sql]( */ @scala.annotation.varargs def select(cols: Column*): DataFrame = withPlan { -Project(cols.map(_.named), logicalPlan) +Project(cols.map(_.named), AnalysisBarrier(logicalPlan)) --- End diff -- does this work if we turn off eager analysis? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17839: [SPARK-20576][SQL] Support generic hint function ...
GitHub user rxin opened a pull request: https://github.com/apache/spark/pull/17839 [SPARK-20576][SQL] Support generic hint function in Dataset/DataFrame ## What changes were proposed in this pull request? We allow users to specify hints (currently only "broadcast" is supported) in SQL and DataFrame. However, while SQL has a standard hint format (/*+ ... */), DataFrame doesn't have one and sometimes users are confused that they can't find how to apply a broadcast hint. This ticket adds a generic hint function on DataFrame that allows using the same hint on DataFrames as well as SQL. As an example, after this patch, the following will apply a broadcast hint on a DataFrame using the new hint function: ``` df1.join(df2.hint("broadcast")) ``` ## How was this patch tested? Added a test case in DataFrameJoinSuite. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rxin/spark SPARK-20576 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17839.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17839 commit 921eb21d0bbb65715617a019ce13b53e2c868121 Author: Reynold Xin <r...@databricks.com> Date: 2017-05-03T06:02:51Z [SPARK-20576][SQL] Support generic hint function in Dataset/DataFrame --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17806: [SPARK-20487][SQL] Display `serde` for `HiveTableScan` n...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/17806 @gatorsmile i will let you merge ... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17806: [SPARK-20487][SQL] Display `serde` for `HiveTableScan` n...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/17806 Maybe get rid of the Some? If it is not defined, we probably just shouldn't show anything. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17780: [SPARK-20487][SQL] `HiveTableScan` node is quite verbose...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/17780 Can we at least include the serde? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
spark git commit: [SPARK-20474] Fixing OnHeapColumnVector reallocation
Repository: spark Updated Branches: refs/heads/branch-2.2 6709bcf6e -> e278876ba [SPARK-20474] Fixing OnHeapColumnVector reallocation ## What changes were proposed in this pull request? OnHeapColumnVector reallocation copies to the new storage data up to 'elementsAppended'. This variable is only updated when using the ColumnVector.appendX API, while ColumnVector.putX is more commonly used. ## How was this patch tested? Tested using existing unit tests. Author: Michal SzafranskiCloses #17773 from michal-databricks/spark-20474. (cherry picked from commit a277ae80a2836e6533b338d2b9c4e59ed8a1daae) Signed-off-by: Reynold Xin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e278876b Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e278876b Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e278876b Branch: refs/heads/branch-2.2 Commit: e278876ba3d66d3fb249df59c3de8d78ca25c5f0 Parents: 6709bcf Author: Michal Szafranski Authored: Wed Apr 26 12:47:37 2017 -0700 Committer: Reynold Xin Committed: Wed Apr 26 12:47:50 2017 -0700 -- .../vectorized/OnHeapColumnVector.java | 20 ++-- 1 file changed, 10 insertions(+), 10 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/e278876b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java -- diff --git a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java index 9b410ba..94ed322 100644 --- a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java +++ b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java @@ -410,53 +410,53 @@ public final class OnHeapColumnVector extends ColumnVector { int[] newLengths = new int[newCapacity]; int[] newOffsets = new int[newCapacity]; if (this.arrayLengths != null) { -System.arraycopy(this.arrayLengths, 0, newLengths, 0, elementsAppended); -System.arraycopy(this.arrayOffsets, 0, newOffsets, 0, elementsAppended); +System.arraycopy(this.arrayLengths, 0, newLengths, 0, capacity); +System.arraycopy(this.arrayOffsets, 0, newOffsets, 0, capacity); } arrayLengths = newLengths; arrayOffsets = newOffsets; } else if (type instanceof BooleanType) { if (byteData == null || byteData.length < newCapacity) { byte[] newData = new byte[newCapacity]; -if (byteData != null) System.arraycopy(byteData, 0, newData, 0, elementsAppended); +if (byteData != null) System.arraycopy(byteData, 0, newData, 0, capacity); byteData = newData; } } else if (type instanceof ByteType) { if (byteData == null || byteData.length < newCapacity) { byte[] newData = new byte[newCapacity]; -if (byteData != null) System.arraycopy(byteData, 0, newData, 0, elementsAppended); +if (byteData != null) System.arraycopy(byteData, 0, newData, 0, capacity); byteData = newData; } } else if (type instanceof ShortType) { if (shortData == null || shortData.length < newCapacity) { short[] newData = new short[newCapacity]; -if (shortData != null) System.arraycopy(shortData, 0, newData, 0, elementsAppended); +if (shortData != null) System.arraycopy(shortData, 0, newData, 0, capacity); shortData = newData; } } else if (type instanceof IntegerType || type instanceof DateType || DecimalType.is32BitDecimalType(type)) { if (intData == null || intData.length < newCapacity) { int[] newData = new int[newCapacity]; -if (intData != null) System.arraycopy(intData, 0, newData, 0, elementsAppended); +if (intData != null) System.arraycopy(intData, 0, newData, 0, capacity); intData = newData; } } else if (type instanceof LongType || type instanceof TimestampType || DecimalType.is64BitDecimalType(type)) { if (longData == null || longData.length < newCapacity) { long[] newData = new long[newCapacity]; -if (longData != null) System.arraycopy(longData, 0, newData, 0, elementsAppended); +if (longData != null) System.arraycopy(longData, 0, newData, 0, capacity); longData = newData; } } else if (type instanceof FloatType) { if (floatData == null || floatData.length < newCapacity) { float[] newData = new float[newCapacity]; -if (floatData != null) System.arraycopy(floatData, 0,
spark git commit: [SPARK-20474] Fixing OnHeapColumnVector reallocation
Repository: spark Updated Branches: refs/heads/master 99c6cf9ef -> a277ae80a [SPARK-20474] Fixing OnHeapColumnVector reallocation ## What changes were proposed in this pull request? OnHeapColumnVector reallocation copies to the new storage data up to 'elementsAppended'. This variable is only updated when using the ColumnVector.appendX API, while ColumnVector.putX is more commonly used. ## How was this patch tested? Tested using existing unit tests. Author: Michal SzafranskiCloses #17773 from michal-databricks/spark-20474. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a277ae80 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a277ae80 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a277ae80 Branch: refs/heads/master Commit: a277ae80a2836e6533b338d2b9c4e59ed8a1daae Parents: 99c6cf9 Author: Michal Szafranski Authored: Wed Apr 26 12:47:37 2017 -0700 Committer: Reynold Xin Committed: Wed Apr 26 12:47:37 2017 -0700 -- .../vectorized/OnHeapColumnVector.java | 20 ++-- 1 file changed, 10 insertions(+), 10 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/a277ae80/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java -- diff --git a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java index 9b410ba..94ed322 100644 --- a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java +++ b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java @@ -410,53 +410,53 @@ public final class OnHeapColumnVector extends ColumnVector { int[] newLengths = new int[newCapacity]; int[] newOffsets = new int[newCapacity]; if (this.arrayLengths != null) { -System.arraycopy(this.arrayLengths, 0, newLengths, 0, elementsAppended); -System.arraycopy(this.arrayOffsets, 0, newOffsets, 0, elementsAppended); +System.arraycopy(this.arrayLengths, 0, newLengths, 0, capacity); +System.arraycopy(this.arrayOffsets, 0, newOffsets, 0, capacity); } arrayLengths = newLengths; arrayOffsets = newOffsets; } else if (type instanceof BooleanType) { if (byteData == null || byteData.length < newCapacity) { byte[] newData = new byte[newCapacity]; -if (byteData != null) System.arraycopy(byteData, 0, newData, 0, elementsAppended); +if (byteData != null) System.arraycopy(byteData, 0, newData, 0, capacity); byteData = newData; } } else if (type instanceof ByteType) { if (byteData == null || byteData.length < newCapacity) { byte[] newData = new byte[newCapacity]; -if (byteData != null) System.arraycopy(byteData, 0, newData, 0, elementsAppended); +if (byteData != null) System.arraycopy(byteData, 0, newData, 0, capacity); byteData = newData; } } else if (type instanceof ShortType) { if (shortData == null || shortData.length < newCapacity) { short[] newData = new short[newCapacity]; -if (shortData != null) System.arraycopy(shortData, 0, newData, 0, elementsAppended); +if (shortData != null) System.arraycopy(shortData, 0, newData, 0, capacity); shortData = newData; } } else if (type instanceof IntegerType || type instanceof DateType || DecimalType.is32BitDecimalType(type)) { if (intData == null || intData.length < newCapacity) { int[] newData = new int[newCapacity]; -if (intData != null) System.arraycopy(intData, 0, newData, 0, elementsAppended); +if (intData != null) System.arraycopy(intData, 0, newData, 0, capacity); intData = newData; } } else if (type instanceof LongType || type instanceof TimestampType || DecimalType.is64BitDecimalType(type)) { if (longData == null || longData.length < newCapacity) { long[] newData = new long[newCapacity]; -if (longData != null) System.arraycopy(longData, 0, newData, 0, elementsAppended); +if (longData != null) System.arraycopy(longData, 0, newData, 0, capacity); longData = newData; } } else if (type instanceof FloatType) { if (floatData == null || floatData.length < newCapacity) { float[] newData = new float[newCapacity]; -if (floatData != null) System.arraycopy(floatData, 0, newData, 0, elementsAppended); +if (floatData != null) System.arraycopy(floatData, 0, newData, 0, capacity);
[GitHub] spark issue #17773: [SPARK-20474] Fixing OnHeapColumnVector reallocation
Github user rxin commented on the issue: https://github.com/apache/spark/pull/17773 Merging in master/branch-2.2. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
spark git commit: [SPARK-20473] Enabling missing types in ColumnVector.Array
Repository: spark Updated Branches: refs/heads/branch-2.2 b65858bb3 -> 6709bcf6e [SPARK-20473] Enabling missing types in ColumnVector.Array ## What changes were proposed in this pull request? ColumnVector implementations originally did not support some Catalyst types (float, short, and boolean). Now that they do, those types should be also added to the ColumnVector.Array. ## How was this patch tested? Tested using existing unit tests. Author: Michal SzafranskiCloses #17772 from michal-databricks/spark-20473. (cherry picked from commit 99c6cf9ef16bf8fae6edb23a62e46546a16bca80) Signed-off-by: Reynold Xin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6709bcf6 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6709bcf6 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6709bcf6 Branch: refs/heads/branch-2.2 Commit: 6709bcf6e66e99e17ba2a3b1482df2dba1a15716 Parents: b65858b Author: Michal Szafranski Authored: Wed Apr 26 11:21:25 2017 -0700 Committer: Reynold Xin Committed: Wed Apr 26 11:21:57 2017 -0700 -- .../apache/spark/sql/execution/vectorized/ColumnVector.java| 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/6709bcf6/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java -- diff --git a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java index 354c878..b105e60 100644 --- a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java +++ b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java @@ -180,7 +180,7 @@ public abstract class ColumnVector implements AutoCloseable { @Override public boolean getBoolean(int ordinal) { - throw new UnsupportedOperationException(); + return data.getBoolean(offset + ordinal); } @Override @@ -188,7 +188,7 @@ public abstract class ColumnVector implements AutoCloseable { @Override public short getShort(int ordinal) { - throw new UnsupportedOperationException(); + return data.getShort(offset + ordinal); } @Override @@ -199,7 +199,7 @@ public abstract class ColumnVector implements AutoCloseable { @Override public float getFloat(int ordinal) { - throw new UnsupportedOperationException(); + return data.getFloat(offset + ordinal); } @Override - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-20473] Enabling missing types in ColumnVector.Array
Repository: spark Updated Branches: refs/heads/master 66dd5b83f -> 99c6cf9ef [SPARK-20473] Enabling missing types in ColumnVector.Array ## What changes were proposed in this pull request? ColumnVector implementations originally did not support some Catalyst types (float, short, and boolean). Now that they do, those types should be also added to the ColumnVector.Array. ## How was this patch tested? Tested using existing unit tests. Author: Michal SzafranskiCloses #17772 from michal-databricks/spark-20473. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/99c6cf9e Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/99c6cf9e Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/99c6cf9e Branch: refs/heads/master Commit: 99c6cf9ef16bf8fae6edb23a62e46546a16bca80 Parents: 66dd5b8 Author: Michal Szafranski Authored: Wed Apr 26 11:21:25 2017 -0700 Committer: Reynold Xin Committed: Wed Apr 26 11:21:25 2017 -0700 -- .../apache/spark/sql/execution/vectorized/ColumnVector.java| 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/99c6cf9e/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java -- diff --git a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java index 354c878..b105e60 100644 --- a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java +++ b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java @@ -180,7 +180,7 @@ public abstract class ColumnVector implements AutoCloseable { @Override public boolean getBoolean(int ordinal) { - throw new UnsupportedOperationException(); + return data.getBoolean(offset + ordinal); } @Override @@ -188,7 +188,7 @@ public abstract class ColumnVector implements AutoCloseable { @Override public short getShort(int ordinal) { - throw new UnsupportedOperationException(); + return data.getShort(offset + ordinal); } @Override @@ -199,7 +199,7 @@ public abstract class ColumnVector implements AutoCloseable { @Override public float getFloat(int ordinal) { - throw new UnsupportedOperationException(); + return data.getFloat(offset + ordinal); } @Override - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] spark issue #17772: [SPARK-20473] Enabling missing types in ColumnVector.Arr...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/17772 Merging in master / branch-2.2. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17770: [SPARK-20392][SQL][WIP] Set barrier to prevent re-enteri...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/17770 Can we fix the description? It is really confusing since it uses the word exchange. Also can we just skip a plan if it is resolved in transform? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17727: [SQL][MINOR] Remove misleading comment (and tags do bett...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/17727 Hm I don't think the comment makes sense ... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
spark git commit: [SPARK-20453] Bump master branch version to 2.3.0-SNAPSHOT
Repository: spark Updated Branches: refs/heads/master 5280d93e6 -> f44c8a843 [SPARK-20453] Bump master branch version to 2.3.0-SNAPSHOT This patch bumps the master branch version to `2.3.0-SNAPSHOT`. Author: Josh RosenCloses #17753 from JoshRosen/SPARK-20453. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f44c8a84 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f44c8a84 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f44c8a84 Branch: refs/heads/master Commit: f44c8a843ca512b319f099477415bc13eca2e373 Parents: 5280d93 Author: Josh Rosen Authored: Mon Apr 24 21:48:04 2017 -0700 Committer: Reynold Xin Committed: Mon Apr 24 21:48:04 2017 -0700 -- assembly/pom.xml | 2 +- common/network-common/pom.xml | 2 +- common/network-shuffle/pom.xml| 2 +- common/network-yarn/pom.xml | 2 +- common/sketch/pom.xml | 2 +- common/tags/pom.xml | 2 +- common/unsafe/pom.xml | 2 +- core/pom.xml | 2 +- docs/_config.yml | 4 ++-- examples/pom.xml | 2 +- external/docker-integration-tests/pom.xml | 2 +- external/flume-assembly/pom.xml | 2 +- external/flume-sink/pom.xml | 2 +- external/flume/pom.xml| 2 +- external/kafka-0-10-assembly/pom.xml | 2 +- external/kafka-0-10-sql/pom.xml | 2 +- external/kafka-0-10/pom.xml | 2 +- external/kafka-0-8-assembly/pom.xml | 2 +- external/kafka-0-8/pom.xml| 2 +- external/kinesis-asl-assembly/pom.xml | 2 +- external/kinesis-asl/pom.xml | 2 +- external/spark-ganglia-lgpl/pom.xml | 2 +- graphx/pom.xml| 2 +- launcher/pom.xml | 2 +- mllib-local/pom.xml | 2 +- mllib/pom.xml | 2 +- pom.xml | 2 +- project/MimaExcludes.scala| 5 + repl/pom.xml | 2 +- resource-managers/mesos/pom.xml | 2 +- resource-managers/yarn/pom.xml| 2 +- sql/catalyst/pom.xml | 2 +- sql/core/pom.xml | 2 +- sql/hive-thriftserver/pom.xml | 2 +- sql/hive/pom.xml | 2 +- streaming/pom.xml | 2 +- tools/pom.xml | 2 +- 37 files changed, 42 insertions(+), 37 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/f44c8a84/assembly/pom.xml -- diff --git a/assembly/pom.xml b/assembly/pom.xml index 9d8607d..742a4a1 100644 --- a/assembly/pom.xml +++ b/assembly/pom.xml @@ -21,7 +21,7 @@ org.apache.spark spark-parent_2.11 -2.2.0-SNAPSHOT +2.3.0-SNAPSHOT ../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/f44c8a84/common/network-common/pom.xml -- diff --git a/common/network-common/pom.xml b/common/network-common/pom.xml index 8657af7..066970f 100644 --- a/common/network-common/pom.xml +++ b/common/network-common/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.11 -2.2.0-SNAPSHOT +2.3.0-SNAPSHOT ../../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/f44c8a84/common/network-shuffle/pom.xml -- diff --git a/common/network-shuffle/pom.xml b/common/network-shuffle/pom.xml index 24c10fb..2de882a 100644 --- a/common/network-shuffle/pom.xml +++ b/common/network-shuffle/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.11 -2.2.0-SNAPSHOT +2.3.0-SNAPSHOT ../../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/f44c8a84/common/network-yarn/pom.xml -- diff --git a/common/network-yarn/pom.xml b/common/network-yarn/pom.xml index 5e5a80b..a8488d8 100644 --- a/common/network-yarn/pom.xml +++ b/common/network-yarn/pom.xml @@ -22,7 +22,7 @@ org.apache.spark spark-parent_2.11 -2.2.0-SNAPSHOT +2.3.0-SNAPSHOT ../../pom.xml http://git-wip-us.apache.org/repos/asf/spark/blob/f44c8a84/common/sketch/pom.xml -- diff --git a/common/sketch/pom.xml b/common/sketch/pom.xml index 1356c47..6b81fc2 100644 --- a/common/sketch/pom.xml +++ b/common/sketch/pom.xml @@ -22,7 +22,7 @@
[GitHub] spark issue #17753: [SPARK-20453] Bump master branch version to 2.3.0-SNAPSH...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/17753 Merging in master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/14731 Steve I think the main point is you should also respect the time of reviewers. The way most of your pull requests manifest have been suboptimal: they often start with a very early WIP (which is not necessarily a problem), and once in a while (e.g. a month or two) you update it to almost completely change it. The time itself is a problem. It requires a lot of context switching to review your pull requests. In addition, every time you update it it looks like a complete new giant pull request. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17648: [SPARK-19851] Add support for EVERY and ANY (SOME) aggre...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/17648 sgtm --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17736: [SPARK-20399][SQL] Can't use same regex pattern between ...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/17736 cc @hvanhovell for review ... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17712: [SPARK-20416][SQL] Print UDF names in EXPLAIN
Github user rxin commented on the issue: https://github.com/apache/spark/pull/17712 Why use a map? That's super unstructured and easy to break ... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17712: [SPARK-20416][SQL] Print UDF names in EXPLAIN
Github user rxin commented on the issue: https://github.com/apache/spark/pull/17712 cc @gatorsmile This is related to the deterministic thing you want to do? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17717: [SPARK-20430][SQL] Initialise RangeExec parameters in a ...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/17717 LGTM pending Jenkins. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17717: [SPARK-20430][SQL] Initialise RangeExec parameter...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/17717#discussion_r112803232 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala --- @@ -1732,4 +1732,10 @@ class DataFrameSuite extends QueryTest with SharedSQLContext { .filter($"x1".isNotNull || !$"y".isin("a!")) .count } + + test("SPARK-20430 Initialize Range parameters in a deriver side") { --- End diff -- driver --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17717: [SPARK-20430][SQL] Initialise RangeExec parameter...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/17717#discussion_r112803234 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala --- @@ -1732,4 +1732,10 @@ class DataFrameSuite extends QueryTest with SharedSQLContext { .filter($"x1".isNotNull || !$"y".isin("a!")) .count } + + test("SPARK-20430 Initialize Range parameters in a deriver side") { --- End diff -- also move this into dataframe range suite? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17712: [SPARK-20416][SQL] Print UDF names in EXPLAIN
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/17712#discussion_r112803097 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala --- @@ -45,14 +45,33 @@ import org.apache.spark.sql.types.DataType case class UserDefinedFunction protected[sql] ( f: AnyRef, dataType: DataType, -inputTypes: Option[Seq[DataType]]) { +inputTypes: Option[Seq[DataType]], +name: Option[String]) { + + // Optionally used for printing an UDF name in EXPLAIN + def withName(name: String): UserDefinedFunction = { +UserDefinedFunction(f, dataType, inputTypes, Option(name)) + } /** * Returns an expression that invokes the UDF, using the given arguments. * * @since 1.3.0 */ def apply(exprs: Column*): Column = { -Column(ScalaUDF(f, dataType, exprs.map(_.expr), inputTypes.getOrElse(Nil))) +Column(ScalaUDF(f, dataType, exprs.map(_.expr), inputTypes.getOrElse(Nil), name)) + } +} + +object UserDefinedFunction { --- End diff -- ah ok - that sucks. that means this will break compatibility ... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17712: [SPARK-20416][SQL] Print UDF names in EXPLAIN
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/17712#discussion_r112800640 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala --- @@ -45,14 +45,33 @@ import org.apache.spark.sql.types.DataType case class UserDefinedFunction protected[sql] ( f: AnyRef, dataType: DataType, -inputTypes: Option[Seq[DataType]]) { +inputTypes: Option[Seq[DataType]], +name: Option[String]) { + + // Optionally used for printing an UDF name in EXPLAIN + def withName(name: String): UserDefinedFunction = { +UserDefinedFunction(f, dataType, inputTypes, Option(name)) + } /** * Returns an expression that invokes the UDF, using the given arguments. * * @since 1.3.0 */ def apply(exprs: Column*): Column = { -Column(ScalaUDF(f, dataType, exprs.map(_.expr), inputTypes.getOrElse(Nil))) +Column(ScalaUDF(f, dataType, exprs.map(_.expr), inputTypes.getOrElse(Nil), name)) + } +} + +object UserDefinedFunction { --- End diff -- also need an unapply function --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17648: [SPARK-19851] Add support for EVERY and ANY (SOME) aggre...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/17648 I was saying rather than implementing them, just rewrite them into an aggregate on the conditions and compare them against the value. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17712: [SPARK-20416][SQL] Print UDF names in EXPLAIN
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/17712#discussion_r112754224 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala --- @@ -47,12 +47,20 @@ case class UserDefinedFunction protected[sql] ( dataType: DataType, inputTypes: Option[Seq[DataType]]) { + // Optionally used for printing UDF names in EXPLAIN + private var nameOption: Option[String] = None --- End diff -- it will be fine if we add an explicit apply method and unapply method. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
spark git commit: [SPARK-20420][SQL] Add events to the external catalog
Repository: spark Updated Branches: refs/heads/master 48d760d02 -> e2b3d2367 [SPARK-20420][SQL] Add events to the external catalog ## What changes were proposed in this pull request? It is often useful to be able to track changes to the `ExternalCatalog`. This PR makes the `ExternalCatalog` emit events when a catalog object is changed. Events are fired before and after the change. The following events are fired per object: - Database - CreateDatabasePreEvent: event fired before the database is created. - CreateDatabaseEvent: event fired after the database has been created. - DropDatabasePreEvent: event fired before the database is dropped. - DropDatabaseEvent: event fired after the database has been dropped. - Table - CreateTablePreEvent: event fired before the table is created. - CreateTableEvent: event fired after the table has been created. - RenameTablePreEvent: event fired before the table is renamed. - RenameTableEvent: event fired after the table has been renamed. - DropTablePreEvent: event fired before the table is dropped. - DropTableEvent: event fired after the table has been dropped. - Function - CreateFunctionPreEvent: event fired before the function is created. - CreateFunctionEvent: event fired after the function has been created. - RenameFunctionPreEvent: event fired before the function is renamed. - RenameFunctionEvent: event fired after the function has been renamed. - DropFunctionPreEvent: event fired before the function is dropped. - DropFunctionPreEvent: event fired after the function has been dropped. The current events currently only contain the names of the object modified. We add more events, and more details at a later point. A user can monitor changes to the external catalog by adding a listener to the Spark listener bus checking for `ExternalCatalogEvent`s using the `SparkListener.onOtherEvent` hook. A more direct approach is add listener directly to the `ExternalCatalog`. ## How was this patch tested? Added the `ExternalCatalogEventSuite`. Author: Herman van HovellCloses #17710 from hvanhovell/SPARK-20420. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e2b3d236 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e2b3d236 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e2b3d236 Branch: refs/heads/master Commit: e2b3d2367a563d4600d8d87b5317e71135c362f0 Parents: 48d760d Author: Herman van Hovell Authored: Fri Apr 21 00:05:03 2017 -0700 Committer: Reynold Xin Committed: Fri Apr 21 00:05:03 2017 -0700 -- .../sql/catalyst/catalog/ExternalCatalog.scala | 85 - .../sql/catalyst/catalog/InMemoryCatalog.scala | 22 ++- .../spark/sql/catalyst/catalog/events.scala | 158 .../catalog/ExternalCatalogEventSuite.scala | 188 +++ .../apache/spark/sql/internal/SharedState.scala | 7 + .../spark/sql/hive/HiveExternalCatalog.scala| 22 ++- 6 files changed, 457 insertions(+), 25 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/e2b3d236/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala index 08a01e8..974ef90 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala @@ -20,6 +20,7 @@ package org.apache.spark.sql.catalyst.catalog import org.apache.spark.sql.catalyst.analysis.{FunctionAlreadyExistsException, NoSuchDatabaseException, NoSuchFunctionException, NoSuchTableException} import org.apache.spark.sql.catalyst.expressions.Expression import org.apache.spark.sql.types.StructType +import org.apache.spark.util.ListenerBus /** * Interface for the system catalog (of functions, partitions, tables, and databases). @@ -30,7 +31,8 @@ import org.apache.spark.sql.types.StructType * * Implementations should throw [[NoSuchDatabaseException]] when databases don't exist. */ -abstract class ExternalCatalog { +abstract class ExternalCatalog + extends ListenerBus[ExternalCatalogEventListener, ExternalCatalogEvent] { import CatalogTypes.TablePartitionSpec protected def requireDbExists(db: String): Unit = { @@ -61,9 +63,22 @@ abstract class ExternalCatalog { // Databases // -- - def createDatabase(dbDefinition: CatalogDatabase, ignoreIfExists: Boolean):
spark git commit: [SPARK-20420][SQL] Add events to the external catalog
Repository: spark Updated Branches: refs/heads/branch-2.2 6cd2f16b1 -> cddb4b7db [SPARK-20420][SQL] Add events to the external catalog ## What changes were proposed in this pull request? It is often useful to be able to track changes to the `ExternalCatalog`. This PR makes the `ExternalCatalog` emit events when a catalog object is changed. Events are fired before and after the change. The following events are fired per object: - Database - CreateDatabasePreEvent: event fired before the database is created. - CreateDatabaseEvent: event fired after the database has been created. - DropDatabasePreEvent: event fired before the database is dropped. - DropDatabaseEvent: event fired after the database has been dropped. - Table - CreateTablePreEvent: event fired before the table is created. - CreateTableEvent: event fired after the table has been created. - RenameTablePreEvent: event fired before the table is renamed. - RenameTableEvent: event fired after the table has been renamed. - DropTablePreEvent: event fired before the table is dropped. - DropTableEvent: event fired after the table has been dropped. - Function - CreateFunctionPreEvent: event fired before the function is created. - CreateFunctionEvent: event fired after the function has been created. - RenameFunctionPreEvent: event fired before the function is renamed. - RenameFunctionEvent: event fired after the function has been renamed. - DropFunctionPreEvent: event fired before the function is dropped. - DropFunctionPreEvent: event fired after the function has been dropped. The current events currently only contain the names of the object modified. We add more events, and more details at a later point. A user can monitor changes to the external catalog by adding a listener to the Spark listener bus checking for `ExternalCatalogEvent`s using the `SparkListener.onOtherEvent` hook. A more direct approach is add listener directly to the `ExternalCatalog`. ## How was this patch tested? Added the `ExternalCatalogEventSuite`. Author: Herman van HovellCloses #17710 from hvanhovell/SPARK-20420. (cherry picked from commit e2b3d2367a563d4600d8d87b5317e71135c362f0) Signed-off-by: Reynold Xin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/cddb4b7d Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/cddb4b7d Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/cddb4b7d Branch: refs/heads/branch-2.2 Commit: cddb4b7db81b01b4abf2ab683aba97e4eabb9769 Parents: 6cd2f16 Author: Herman van Hovell Authored: Fri Apr 21 00:05:03 2017 -0700 Committer: Reynold Xin Committed: Fri Apr 21 00:05:10 2017 -0700 -- .../sql/catalyst/catalog/ExternalCatalog.scala | 85 - .../sql/catalyst/catalog/InMemoryCatalog.scala | 22 ++- .../spark/sql/catalyst/catalog/events.scala | 158 .../catalog/ExternalCatalogEventSuite.scala | 188 +++ .../apache/spark/sql/internal/SharedState.scala | 7 + .../spark/sql/hive/HiveExternalCatalog.scala| 22 ++- 6 files changed, 457 insertions(+), 25 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/cddb4b7d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala -- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala index 08a01e8..974ef90 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala @@ -20,6 +20,7 @@ package org.apache.spark.sql.catalyst.catalog import org.apache.spark.sql.catalyst.analysis.{FunctionAlreadyExistsException, NoSuchDatabaseException, NoSuchFunctionException, NoSuchTableException} import org.apache.spark.sql.catalyst.expressions.Expression import org.apache.spark.sql.types.StructType +import org.apache.spark.util.ListenerBus /** * Interface for the system catalog (of functions, partitions, tables, and databases). @@ -30,7 +31,8 @@ import org.apache.spark.sql.types.StructType * * Implementations should throw [[NoSuchDatabaseException]] when databases don't exist. */ -abstract class ExternalCatalog { +abstract class ExternalCatalog + extends ListenerBus[ExternalCatalogEventListener, ExternalCatalogEvent] { import CatalogTypes.TablePartitionSpec protected def requireDbExists(db: String): Unit = { @@ -61,9 +63,22 @@ abstract class ExternalCatalog { // Databases //
[GitHub] spark issue #17710: [SPARK-20420][SQL] Add events to the external catalog
Github user rxin commented on the issue: https://github.com/apache/spark/pull/17710 Merging in master/branch-2.2. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17712: [SPARK-20416][SQL] Print UDF names in EXPLAIN
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/17712#discussion_r112622098 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala --- @@ -47,12 +47,20 @@ case class UserDefinedFunction protected[sql] ( dataType: DataType, inputTypes: Option[Seq[DataType]]) { + // Optionally used for printing UDF names in EXPLAIN + private var nameOption: Option[String] = None --- End diff -- can we create a new instance instead so this is immutable? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17711: [SPARK-19951][SQL] Add string concatenate operator || to...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/17711 can you add a test case in sql query file tests? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17711: [SPARK-19951][SQL] Add string concatenate operato...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/17711#discussion_r112590613 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala --- @@ -1483,4 +1483,12 @@ class SparkSqlAstBuilder(conf: SQLConf) extends AstBuilder { query: LogicalPlan): LogicalPlan = { RepartitionByExpression(expressions, query, conf.numShufflePartitions) } + + /** + * Create a [[Concat]] expression for pipeline concatenation. + */ + override def visitConcat(ctx: ConcatContext): Expression = { +val exprs = ctx.primaryExpression().asScala +Concat(expression(exprs.head) +: exprs.drop(1).map(expression)) --- End diff -- isn't this just `expression(exprs)`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17705: [SPARK-20410][SQL] Make sparkConf a def in SharedSQLCont...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/17705 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17699: [SPARK-20405][SQL] Dataset.withNewExecutionId sho...
GitHub user rxin opened a pull request: https://github.com/apache/spark/pull/17699 [SPARK-20405][SQL] Dataset.withNewExecutionId should be private ## What changes were proposed in this pull request? Dataset.withNewExecutionId is only used in Dataset itself and should be private. ## How was this patch tested? N/A - this is a simple visibility change. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rxin/spark SPARK-20405 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17699.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17699 commit 0c24ad2b38a9369399daf5a79b137c8b5495d2ca Author: Reynold Xin <r...@databricks.com> Date: 2017-04-20T07:37:07Z [SPARK-20405][SQL] Dataset.withNewExecutionId should be private --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17698: [SPARK-20403][SQL][Documentation]Modify the instr...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/17698#discussion_r112383091 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala --- @@ -1036,3 +1036,8 @@ case class UpCast(child: Expression, dataType: DataType, walkedTypePath: Seq[Str extends UnaryExpression with Unevaluable { override lazy val resolved = false } + +@ExpressionDescription( + usage = "_FUNC_(expr) - Casts the value `expr` to the target data type `_FUNC_`.") +class CastAlias(child: Expression, dataType: DataType, timeZoneId: Option[String] = None) --- End diff -- will this work with our pattern matches in the query optimizer? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15821: [SPARK-13534][PySpark] Using Apache Arrow to incr...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/15821#discussion_r112382152 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/ArrowConvertersSuite.scala --- @@ -0,0 +1,568 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql + +import java.io.File +import java.nio.charset.StandardCharsets +import java.sql.{Date, Timestamp} +import java.text.SimpleDateFormat +import java.util.Locale + +import com.google.common.io.Files +import org.apache.arrow.vector.{VectorLoader, VectorSchemaRoot} +import org.apache.arrow.vector.file.json.JsonFileReader +import org.apache.arrow.vector.util.Validator +import org.json4s.jackson.JsonMethods._ +import org.json4s.JsonAST._ +import org.json4s.JsonDSL._ +import org.scalatest.BeforeAndAfterAll + +import org.apache.spark.SparkException +import org.apache.spark.sql.test.SharedSQLContext +import org.apache.spark.sql.types.StructType +import org.apache.spark.util.Utils + + +class ArrowConvertersSuite extends SharedSQLContext with BeforeAndAfterAll { + import testImplicits._ + + private var tempDataPath: String = _ + + private def collectAsArrow(df: DataFrame, + converter: Option[ArrowConverters] = None): ArrowPayload = { +val cnvtr = converter.getOrElse(new ArrowConverters) +val payloadByteArrays = df.toArrowPayloadBytes().collect() +cnvtr.readPayloadByteArrays(payloadByteArrays) + } + + override def beforeAll(): Unit = { +super.beforeAll() +tempDataPath = Utils.createTempDir(namePrefix = "arrow").getAbsolutePath + } + + test("collect to arrow record batch") { +val indexData = (1 to 6).toDF("i") +val arrowPayload = collectAsArrow(indexData) +assert(arrowPayload.nonEmpty) +val arrowBatches = arrowPayload.toArray +assert(arrowBatches.length == indexData.rdd.getNumPartitions) +val rowCount = arrowBatches.map(batch => batch.getLength).sum +assert(rowCount === indexData.count()) +arrowBatches.foreach(batch => assert(batch.getNodes.size() > 0)) +arrowBatches.foreach(batch => batch.close()) + } + + test("numeric type conversion") { +collectAndValidate(indexData) +collectAndValidate(shortData) +collectAndValidate(intData) +collectAndValidate(longData) +collectAndValidate(floatData) +collectAndValidate(doubleData) + } + + test("mixed numeric type conversion") { +collectAndValidate(mixedNumericData) + } + + test("boolean type conversion") { +collectAndValidate(boolData) + } + + test("string type conversion") { +collectAndValidate(stringData) + } + + test("byte type conversion") { +collectAndValidate(byteData) + } + + ignore("timestamp conversion") { +collectAndValidate(timestampData) + } + + // TODO: Not currently supported in Arrow JSON reader + ignore("date conversion") { +// collectAndValidate(dateTimeData) + } + + // TODO: Not currently supported in Arrow JSON reader + ignore("binary type conversion") { +// collectAndValidate(binaryData) + } + + test("floating-point NaN") { +collectAndValidate(floatNaNData) + } + + test("partitioned DataFrame") { +val converter = new ArrowConverters +val schema = testData2.schema +val arrowPayload = collectAsArrow(testData2, Some(converter)) +val arrowBatches = arrowPayload.toArray +// NOTE: testData2 should have 2 partitions -> 2 arrow batches in payload +assert(arrowBatches.length === 2) +val pl1 = new ArrowStaticPayload(arrowBatches(0)) +val pl2 = new ArrowStaticPayload
[GitHub] spark pull request #15821: [SPARK-13534][PySpark] Using Apache Arrow to incr...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/15821#discussion_r112381608 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/ArrowConvertersSuite.scala --- @@ -0,0 +1,568 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql + +import java.io.File +import java.nio.charset.StandardCharsets +import java.sql.{Date, Timestamp} +import java.text.SimpleDateFormat +import java.util.Locale + +import com.google.common.io.Files +import org.apache.arrow.vector.{VectorLoader, VectorSchemaRoot} +import org.apache.arrow.vector.file.json.JsonFileReader +import org.apache.arrow.vector.util.Validator +import org.json4s.jackson.JsonMethods._ +import org.json4s.JsonAST._ +import org.json4s.JsonDSL._ +import org.scalatest.BeforeAndAfterAll + +import org.apache.spark.SparkException +import org.apache.spark.sql.test.SharedSQLContext +import org.apache.spark.sql.types.StructType +import org.apache.spark.util.Utils + + +class ArrowConvertersSuite extends SharedSQLContext with BeforeAndAfterAll { + import testImplicits._ + + private var tempDataPath: String = _ + + private def collectAsArrow(df: DataFrame, + converter: Option[ArrowConverters] = None): ArrowPayload = { +val cnvtr = converter.getOrElse(new ArrowConverters) +val payloadByteArrays = df.toArrowPayloadBytes().collect() +cnvtr.readPayloadByteArrays(payloadByteArrays) + } + + override def beforeAll(): Unit = { +super.beforeAll() +tempDataPath = Utils.createTempDir(namePrefix = "arrow").getAbsolutePath + } + + test("collect to arrow record batch") { +val indexData = (1 to 6).toDF("i") +val arrowPayload = collectAsArrow(indexData) +assert(arrowPayload.nonEmpty) +val arrowBatches = arrowPayload.toArray +assert(arrowBatches.length == indexData.rdd.getNumPartitions) +val rowCount = arrowBatches.map(batch => batch.getLength).sum +assert(rowCount === indexData.count()) +arrowBatches.foreach(batch => assert(batch.getNodes.size() > 0)) +arrowBatches.foreach(batch => batch.close()) + } + + test("numeric type conversion") { +collectAndValidate(indexData) --- End diff -- separate these into different test cases, and please inline the data directly in each test case. It's pretty annoying to have to jump around. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15821: [SPARK-13534][PySpark] Using Apache Arrow to incr...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/15821#discussion_r112376143 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/ArrowConverters.scala --- @@ -0,0 +1,432 @@ +/* +* Licensed to the Apache Software Foundation (ASF) under one or more +* contributor license agreements. See the NOTICE file distributed with +* this work for additional information regarding copyright ownership. +* The ASF licenses this file to You under the Apache License, Version 2.0 +* (the "License"); you may not use this file except in compliance with +* the License. You may obtain a copy of the License at +* +*http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, software +* distributed under the License is distributed on an "AS IS" BASIS, +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +* See the License for the specific language governing permissions and +* limitations under the License. +*/ + +package org.apache.spark.sql + +import java.io.ByteArrayOutputStream +import java.nio.ByteBuffer +import java.nio.channels.{Channels, SeekableByteChannel} + +import scala.collection.JavaConverters._ + +import io.netty.buffer.ArrowBuf +import org.apache.arrow.memory.{BaseAllocator, RootAllocator} +import org.apache.arrow.vector._ +import org.apache.arrow.vector.BaseValueVector.BaseMutator +import org.apache.arrow.vector.file.{ArrowReader, ArrowWriter} +import org.apache.arrow.vector.schema.{ArrowFieldNode, ArrowRecordBatch} +import org.apache.arrow.vector.types.{FloatingPointPrecision, TimeUnit} +import org.apache.arrow.vector.types.pojo.{ArrowType, Field, Schema} + +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.types._ +import org.apache.spark.util.Utils + + +/** + * ArrowReader requires a seekable byte channel. + * TODO: This is available in arrow-vector now with ARROW-615, to be included in 0.2.1 release + */ +private[sql] class ByteArrayReadableSeekableByteChannel(var byteArray: Array[Byte]) + extends SeekableByteChannel { + var _position: Long = 0L + + override def isOpen: Boolean = { +byteArray != null + } + + override def close(): Unit = { +byteArray = null + } + + override def read(dst: ByteBuffer): Int = { +val remainingBuf = byteArray.length - _position +val length = Math.min(dst.remaining(), remainingBuf).toInt +dst.put(byteArray, _position.toInt, length) +_position += length +length + } + + override def position(): Long = _position + + override def position(newPosition: Long): SeekableByteChannel = { +_position = newPosition.toLong +this + } + + override def size: Long = { +byteArray.length.toLong + } + + override def write(src: ByteBuffer): Int = { +throw new UnsupportedOperationException("Read Only") + } + + override def truncate(size: Long): SeekableByteChannel = { +throw new UnsupportedOperationException("Read Only") + } +} + +/** + * Intermediate data structure returned from Arrow conversions + */ +private[sql] abstract class ArrowPayload extends Iterator[ArrowRecordBatch] + +/** + * Build a payload from existing ArrowRecordBatches + */ +private[sql] class ArrowStaticPayload(batches: ArrowRecordBatch*) extends ArrowPayload { + private val iter = batches.iterator + override def next(): ArrowRecordBatch = iter.next() + override def hasNext: Boolean = iter.hasNext +} + +/** + * Class that wraps an Arrow RootAllocator used in conversion + */ +private[sql] class ArrowConverters { + private val _allocator = new RootAllocator(Long.MaxValue) + + private[sql] def allocator: RootAllocator = _allocator + + /** + * Iterate over the rows and convert to an ArrowPayload, using RootAllocator from this class + */ + def interalRowIterToPayload(rowIter: Iterator[InternalRow], schema: StructType): ArrowPayload = { +val batch = ArrowConverters.internalRowIterToArrowBatch(rowIter, schema, _allocator) +new ArrowStaticPayload(batch) + } + + /** + * Read an Array of Arrow Record batches as byte Arrays into an ArrowPayload, using + * RootAllocator from this class + */ + def readPayloadByteArrays(payloadByteArrays: Array[Array[Byte]]): ArrowPayload = { +val batches = scala.collection.mutable.ArrayBuffer.empty[ArrowRecordBatch] +var i = 0 +while (i < payloadByteArrays.length) { + val payloadBytes = p
[GitHub] spark pull request #15821: [SPARK-13534][PySpark] Using Apache Arrow to incr...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/15821#discussion_r112376037 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/ArrowConverters.scala --- @@ -0,0 +1,432 @@ +/* +* Licensed to the Apache Software Foundation (ASF) under one or more +* contributor license agreements. See the NOTICE file distributed with +* this work for additional information regarding copyright ownership. +* The ASF licenses this file to You under the Apache License, Version 2.0 +* (the "License"); you may not use this file except in compliance with +* the License. You may obtain a copy of the License at +* +*http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, software +* distributed under the License is distributed on an "AS IS" BASIS, +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +* See the License for the specific language governing permissions and +* limitations under the License. +*/ + +package org.apache.spark.sql + +import java.io.ByteArrayOutputStream +import java.nio.ByteBuffer +import java.nio.channels.{Channels, SeekableByteChannel} + +import scala.collection.JavaConverters._ + +import io.netty.buffer.ArrowBuf +import org.apache.arrow.memory.{BaseAllocator, RootAllocator} +import org.apache.arrow.vector._ +import org.apache.arrow.vector.BaseValueVector.BaseMutator +import org.apache.arrow.vector.file.{ArrowReader, ArrowWriter} +import org.apache.arrow.vector.schema.{ArrowFieldNode, ArrowRecordBatch} +import org.apache.arrow.vector.types.{FloatingPointPrecision, TimeUnit} +import org.apache.arrow.vector.types.pojo.{ArrowType, Field, Schema} + +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.types._ +import org.apache.spark.util.Utils + + +/** + * ArrowReader requires a seekable byte channel. + * TODO: This is available in arrow-vector now with ARROW-615, to be included in 0.2.1 release + */ +private[sql] class ByteArrayReadableSeekableByteChannel(var byteArray: Array[Byte]) + extends SeekableByteChannel { + var _position: Long = 0L + + override def isOpen: Boolean = { +byteArray != null + } + + override def close(): Unit = { +byteArray = null + } + + override def read(dst: ByteBuffer): Int = { +val remainingBuf = byteArray.length - _position +val length = Math.min(dst.remaining(), remainingBuf).toInt +dst.put(byteArray, _position.toInt, length) +_position += length +length + } + + override def position(): Long = _position + + override def position(newPosition: Long): SeekableByteChannel = { +_position = newPosition.toLong +this + } + + override def size: Long = { +byteArray.length.toLong + } + + override def write(src: ByteBuffer): Int = { +throw new UnsupportedOperationException("Read Only") + } + + override def truncate(size: Long): SeekableByteChannel = { +throw new UnsupportedOperationException("Read Only") + } +} + +/** + * Intermediate data structure returned from Arrow conversions + */ +private[sql] abstract class ArrowPayload extends Iterator[ArrowRecordBatch] + +/** + * Build a payload from existing ArrowRecordBatches + */ +private[sql] class ArrowStaticPayload(batches: ArrowRecordBatch*) extends ArrowPayload { + private val iter = batches.iterator + override def next(): ArrowRecordBatch = iter.next() + override def hasNext: Boolean = iter.hasNext +} + +/** + * Class that wraps an Arrow RootAllocator used in conversion + */ +private[sql] class ArrowConverters { + private val _allocator = new RootAllocator(Long.MaxValue) + + private[sql] def allocator: RootAllocator = _allocator + + /** + * Iterate over the rows and convert to an ArrowPayload, using RootAllocator from this class + */ + def interalRowIterToPayload(rowIter: Iterator[InternalRow], schema: StructType): ArrowPayload = { +val batch = ArrowConverters.internalRowIterToArrowBatch(rowIter, schema, _allocator) +new ArrowStaticPayload(batch) + } + + /** + * Read an Array of Arrow Record batches as byte Arrays into an ArrowPayload, using + * RootAllocator from this class + */ + def readPayloadByteArrays(payloadByteArrays: Array[Array[Byte]]): ArrowPayload = { +val batches = scala.collection.mutable.ArrayBuffer.empty[ArrowRecordBatch] +var i = 0 +while (i < payloadByteArrays.length) { + val payloadBytes = p
[GitHub] spark pull request #15821: [SPARK-13534][PySpark] Using Apache Arrow to incr...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/15821#discussion_r112375921 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/ArrowConverters.scala --- @@ -0,0 +1,432 @@ +/* +* Licensed to the Apache Software Foundation (ASF) under one or more +* contributor license agreements. See the NOTICE file distributed with +* this work for additional information regarding copyright ownership. +* The ASF licenses this file to You under the Apache License, Version 2.0 +* (the "License"); you may not use this file except in compliance with +* the License. You may obtain a copy of the License at +* +*http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, software +* distributed under the License is distributed on an "AS IS" BASIS, +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +* See the License for the specific language governing permissions and +* limitations under the License. +*/ + +package org.apache.spark.sql + +import java.io.ByteArrayOutputStream +import java.nio.ByteBuffer +import java.nio.channels.{Channels, SeekableByteChannel} + +import scala.collection.JavaConverters._ + +import io.netty.buffer.ArrowBuf +import org.apache.arrow.memory.{BaseAllocator, RootAllocator} +import org.apache.arrow.vector._ +import org.apache.arrow.vector.BaseValueVector.BaseMutator +import org.apache.arrow.vector.file.{ArrowReader, ArrowWriter} +import org.apache.arrow.vector.schema.{ArrowFieldNode, ArrowRecordBatch} +import org.apache.arrow.vector.types.{FloatingPointPrecision, TimeUnit} +import org.apache.arrow.vector.types.pojo.{ArrowType, Field, Schema} + +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.types._ +import org.apache.spark.util.Utils + + +/** + * ArrowReader requires a seekable byte channel. + * TODO: This is available in arrow-vector now with ARROW-615, to be included in 0.2.1 release + */ +private[sql] class ByteArrayReadableSeekableByteChannel(var byteArray: Array[Byte]) + extends SeekableByteChannel { + var _position: Long = 0L + + override def isOpen: Boolean = { +byteArray != null + } + + override def close(): Unit = { +byteArray = null + } + + override def read(dst: ByteBuffer): Int = { +val remainingBuf = byteArray.length - _position +val length = Math.min(dst.remaining(), remainingBuf).toInt +dst.put(byteArray, _position.toInt, length) +_position += length +length + } + + override def position(): Long = _position + + override def position(newPosition: Long): SeekableByteChannel = { +_position = newPosition.toLong +this + } + + override def size: Long = { +byteArray.length.toLong + } + + override def write(src: ByteBuffer): Int = { +throw new UnsupportedOperationException("Read Only") + } + + override def truncate(size: Long): SeekableByteChannel = { +throw new UnsupportedOperationException("Read Only") + } +} + +/** + * Intermediate data structure returned from Arrow conversions + */ +private[sql] abstract class ArrowPayload extends Iterator[ArrowRecordBatch] + +/** + * Build a payload from existing ArrowRecordBatches + */ +private[sql] class ArrowStaticPayload(batches: ArrowRecordBatch*) extends ArrowPayload { + private val iter = batches.iterator + override def next(): ArrowRecordBatch = iter.next() + override def hasNext: Boolean = iter.hasNext +} + +/** + * Class that wraps an Arrow RootAllocator used in conversion + */ +private[sql] class ArrowConverters { + private val _allocator = new RootAllocator(Long.MaxValue) + + private[sql] def allocator: RootAllocator = _allocator + + /** + * Iterate over the rows and convert to an ArrowPayload, using RootAllocator from this class + */ + def interalRowIterToPayload(rowIter: Iterator[InternalRow], schema: StructType): ArrowPayload = { +val batch = ArrowConverters.internalRowIterToArrowBatch(rowIter, schema, _allocator) +new ArrowStaticPayload(batch) + } + + /** + * Read an Array of Arrow Record batches as byte Arrays into an ArrowPayload, using + * RootAllocator from this class + */ + def readPayloadByteArrays(payloadByteArrays: Array[Array[Byte]]): ArrowPayload = { +val batches = scala.collection.mutable.ArrayBuffer.empty[ArrowRecordBatch] +var i = 0 +while (i < payloadByteArrays.length) { + val payloadBytes = p
[GitHub] spark pull request #15821: [SPARK-13534][PySpark] Using Apache Arrow to incr...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/15821#discussion_r112375496 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/ArrowConverters.scala --- @@ -0,0 +1,432 @@ +/* +* Licensed to the Apache Software Foundation (ASF) under one or more +* contributor license agreements. See the NOTICE file distributed with +* this work for additional information regarding copyright ownership. +* The ASF licenses this file to You under the Apache License, Version 2.0 +* (the "License"); you may not use this file except in compliance with +* the License. You may obtain a copy of the License at +* +*http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, software +* distributed under the License is distributed on an "AS IS" BASIS, +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +* See the License for the specific language governing permissions and +* limitations under the License. +*/ + +package org.apache.spark.sql + +import java.io.ByteArrayOutputStream +import java.nio.ByteBuffer +import java.nio.channels.{Channels, SeekableByteChannel} + +import scala.collection.JavaConverters._ + +import io.netty.buffer.ArrowBuf +import org.apache.arrow.memory.{BaseAllocator, RootAllocator} +import org.apache.arrow.vector._ +import org.apache.arrow.vector.BaseValueVector.BaseMutator +import org.apache.arrow.vector.file.{ArrowReader, ArrowWriter} +import org.apache.arrow.vector.schema.{ArrowFieldNode, ArrowRecordBatch} +import org.apache.arrow.vector.types.{FloatingPointPrecision, TimeUnit} +import org.apache.arrow.vector.types.pojo.{ArrowType, Field, Schema} + +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.types._ +import org.apache.spark.util.Utils + + +/** + * ArrowReader requires a seekable byte channel. + * TODO: This is available in arrow-vector now with ARROW-615, to be included in 0.2.1 release + */ +private[sql] class ByteArrayReadableSeekableByteChannel(var byteArray: Array[Byte]) + extends SeekableByteChannel { + var _position: Long = 0L + + override def isOpen: Boolean = { +byteArray != null + } + + override def close(): Unit = { +byteArray = null + } + + override def read(dst: ByteBuffer): Int = { +val remainingBuf = byteArray.length - _position +val length = Math.min(dst.remaining(), remainingBuf).toInt +dst.put(byteArray, _position.toInt, length) +_position += length +length + } + + override def position(): Long = _position + + override def position(newPosition: Long): SeekableByteChannel = { +_position = newPosition.toLong +this + } + + override def size: Long = { +byteArray.length.toLong + } + + override def write(src: ByteBuffer): Int = { +throw new UnsupportedOperationException("Read Only") + } + + override def truncate(size: Long): SeekableByteChannel = { +throw new UnsupportedOperationException("Read Only") + } +} + +/** + * Intermediate data structure returned from Arrow conversions + */ +private[sql] abstract class ArrowPayload extends Iterator[ArrowRecordBatch] + +/** + * Build a payload from existing ArrowRecordBatches + */ +private[sql] class ArrowStaticPayload(batches: ArrowRecordBatch*) extends ArrowPayload { + private val iter = batches.iterator + override def next(): ArrowRecordBatch = iter.next() + override def hasNext: Boolean = iter.hasNext +} + +/** + * Class that wraps an Arrow RootAllocator used in conversion + */ +private[sql] class ArrowConverters { + private val _allocator = new RootAllocator(Long.MaxValue) + + private[sql] def allocator: RootAllocator = _allocator + + /** + * Iterate over the rows and convert to an ArrowPayload, using RootAllocator from this class + */ + def interalRowIterToPayload(rowIter: Iterator[InternalRow], schema: StructType): ArrowPayload = { +val batch = ArrowConverters.internalRowIterToArrowBatch(rowIter, schema, _allocator) +new ArrowStaticPayload(batch) + } + + /** + * Read an Array of Arrow Record batches as byte Arrays into an ArrowPayload, using + * RootAllocator from this class + */ + def readPayloadByteArrays(payloadByteArrays: Array[Array[Byte]]): ArrowPayload = { +val batches = scala.collection.mutable.ArrayBuffer.empty[ArrowRecordBatch] +var i = 0 +while (i < payloadByteArrays.length) { + val payloadBytes = p