[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154577760 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -568,8 +571,11 @@ object DataSource extends Logging { "org.apache.spark.Logging") /** Given a provider name, look up the data source class definition. */ - def lookupDataSource(provider: String): Class[_] = { -val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) + def lookupDataSource(conf: SQLConf, provider: String): Class[_] = { +var provider1 = backwardCompatibilityMap.getOrElse(provider, provider) --- End diff -- instead of using `var`, you can use the pattern match --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19869: [SPARK-22677][SQL] cleanup whole stage codegen for hash ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19869 **[Test build #84419 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84419/testReport)** for PR 19869 at commit [`0589e7d`](https://github.com/apache/spark/commit/0589e7d8b57aa71c72dd052f687a4706fb0c5567). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19871: [SPARK-20728][SQL] Make OrcFileFormat configurable betwe...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19871 **[Test build #84413 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84413/testReport)** for PR 19871 at commit [`2e498f9`](https://github.com/apache/spark/commit/2e498f9c1a748bdb648705c16305f9f34e638ff8). * This patch **fails due to an unknown error code, -9**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19871: [SPARK-20728][SQL] Make OrcFileFormat configurable betwe...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19871 **[Test build #84411 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84411/testReport)** for PR 19871 at commit [`e7beb02`](https://github.com/apache/spark/commit/e7beb02ef8b1f9761c77952fc69d087c7cc92d3f). * This patch **fails due to an unknown error code, -9**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19874: [SPARK-22675] [SQL] Refactoring PropagateTypes in TypeCo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19874 **[Test build #84418 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84418/testReport)** for PR 19874 at commit [`657cf88`](https://github.com/apache/spark/commit/657cf88bbf4259d1a823f93f16eaccc2fbe78667). * This patch **fails due to an unknown error code, -9**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `trait TypePropagation extends Logging ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19871: [SPARK-20728][SQL] Make OrcFileFormat configurable betwe...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19871 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84411/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19871: [SPARK-20728][SQL] Make OrcFileFormat configurable betwe...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19871 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84412/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19873: [SPARK-20392][SQL] Set barrier to prevent re-entering a ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19873 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84417/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19871: [SPARK-20728][SQL] Make OrcFileFormat configurable betwe...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19871 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19874: [SPARK-22675] [SQL] Refactoring PropagateTypes in TypeCo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19874 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84418/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19873: [SPARK-20392][SQL] Set barrier to prevent re-entering a ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19873 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19871: [SPARK-20728][SQL] Make OrcFileFormat configurable betwe...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19871 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19871: [SPARK-20728][SQL] Make OrcFileFormat configurable betwe...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19871 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19871: [SPARK-20728][SQL] Make OrcFileFormat configurable betwe...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19871 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84413/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19874: [SPARK-22675] [SQL] Refactoring PropagateTypes in TypeCo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19874 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19871: [SPARK-20728][SQL] Make OrcFileFormat configurable betwe...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19871 **[Test build #84412 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84412/testReport)** for PR 19871 at commit [`8b7e88a`](https://github.com/apache/spark/commit/8b7e88a833adf1d3138ce426142b6bb6abe057df). * This patch **fails due to an unknown error code, -9**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19873: [SPARK-20392][SQL] Set barrier to prevent re-entering a ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19873 **[Test build #84417 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84417/testReport)** for PR 19873 at commit [`136fd30`](https://github.com/apache/spark/commit/136fd30cb98609e648a8b689a28853c1bae67bf7). * This patch **fails due to an unknown error code, -9**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class AnalysisBarrier(child: LogicalPlan) extends LeafNode ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19873: [SPARK-20392][SQL] Set barrier to prevent re-entering a ...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19873 retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19873: [SPARK-20392][SQL] Set barrier to prevent re-entering a ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19873 **[Test build #84420 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84420/testReport)** for PR 19873 at commit [`136fd30`](https://github.com/apache/spark/commit/136fd30cb98609e648a8b689a28853c1bae67bf7). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19840: [SPARK-22640][PYSPARK][YARN]switch python exec on execut...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/19840 @yaooqinn OK, I see the situation. In client mode, I think we can't use `spark.yarn.appMasterEnv.XXX` which is for cluster mode. So we should use environment variable `PYSPARK_PYTHON` or `PYSPARK_DRIVER_PYTHON`, or corresponding spark conf, `spark.pyspark.python`, `spark.pyspark.driver.python`. In cluster mode, we can use `spark.yarn.appMasterEnv.XXX` and if there exist `spark.yarn.appMasterEnv.PYSPARK_PYTHON` or `spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON`, they overwrite original environment variables. Btw, `PYSPARK_DRIVER_PYTHON` is for only Driver, not Executors, so we should handle only `PYSPARK_PYTHON` in executor and the priority of `PYSPARK_DRIVER_PYTHON` is higher than `PYSPARK_PYTHON` in Driver. Currently we handle only environment varibale but not `spark.executorEnv.PYSPARK_PYTHON` for executor so we should handle it at `api/python/PythonRunner` as you do now or [context.py#L191](https://github.com/yaooqinn/spark/blob/8ff5663fe9a32eae79c8ee6bc310409170a8da64/python/pyspark/context.py#L191). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19869: [SPARK-22677][SQL] cleanup whole stage codegen for hash ...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19869 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154580007 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -568,8 +571,11 @@ object DataSource extends Logging { "org.apache.spark.Logging") /** Given a provider name, look up the data source class definition. */ - def lookupDataSource(provider: String): Class[_] = { -val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) + def lookupDataSource(conf: SQLConf, provider: String): Class[_] = { +var provider1 = backwardCompatibilityMap.getOrElse(provider, provider) --- End diff -- Also add the maps for new orc format to `backwardCompatibilityMap` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19869: [SPARK-22677][SQL] cleanup whole stage codegen fo...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19869#discussion_r154580311 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala --- @@ -596,7 +596,7 @@ case class HashAggregateExec( ctx.addMutableState(fastHashMapClassName, fastHashMapTerm, s"$fastHashMapTerm = new $fastHashMapClassName();") ctx.addMutableState( - classOf[java.util.Iterator[ColumnarRow]].getName, + s"java.util.Iterator<${classOf[ColumnarRow]}>", --- End diff -- ```scala scala> s"java.util.Iterator<${classOf[ColumnarRow]}>" res2: String = java.util.Iterator ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154580323 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala --- @@ -2153,4 +2153,21 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton { } } } + + test("SPARK-20728 Make ORCFileFormat configurable between sql/hive and sql/core") { --- End diff -- Move it to `OrcQuerySuite` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154580498 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala --- @@ -195,8 +195,18 @@ case class RelationConversions( .convertToLogicalRelation(relation, options, classOf[ParquetFileFormat], "parquet") } else { val options = relation.tableMeta.storage.properties - sessionCatalog.metastoreCatalog -.convertToLogicalRelation(relation, options, classOf[OrcFileFormat], "orc") + if (conf.getConf(SQLConf.ORC_USE_NEW_VERSION)) { +sessionCatalog.metastoreCatalog.convertToLogicalRelation( + relation, + options, + classOf[org.apache.spark.sql.execution.datasources.orc.OrcFileFormat], + "orc") + } else { +sessionCatalog.metastoreCatalog.convertToLogicalRelation( + relation, + options, + classOf[org.apache.spark.sql.hive.orc.OrcFileFormat], "orc") --- End diff -- indents. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19840: [SPARK-22640][PYSPARK][YARN]switch python exec on execut...
Github user yaooqinn commented on the issue: https://github.com/apache/spark/pull/19840 I can `spark.executorEnv.PYSPARK_PYTHON` in `sparkConf` at executor side , because it is set at [context.py#L156](https://github.com/yaooqinn/spark/blob/8ff5663fe9a32eae79c8ee6bc310409170a8da64/python/pyspark/context.py#L156) by [conf.py#L153](https://github.com/yaooqinn/spark/blob/8ff5663fe9a32eae79c8ee6bc310409170a8da64/python/pyspark/conf.py#L153) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nest...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19813#discussion_r154581731 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala --- @@ -115,9 +120,35 @@ abstract class Expression extends TreeNode[Expression] { } } + /** + * Returns the input variables to this expression. + */ + private def findInputVars(ctx: CodegenContext, eval: ExprCode): Seq[ExprInputVar] = { +if (ctx.currentVars != null) { + val boundRefs = this.collect { +case b @ BoundReference(ordinal, _, _) if ctx.currentVars(ordinal) != null => (ordinal, b) --- End diff -- It is not necessarily empty. We will also record it if its `code` is not empty. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19862: [WIP][SPARK-22671][SQL] Make SortMergeJoin shuffl...
Github user gczsjdy commented on a diff in the pull request: https://github.com/apache/spark/pull/19862#discussion_r154581844 --- Diff: sql/catalyst/src/main/java/org/apache/spark/sql/execution/UnsafeExternalRowSorter.java --- @@ -159,6 +154,12 @@ public boolean hasNext() { @Override public UnsafeRow next() { try { +if (!alreadyCalculated) { --- End diff -- Yes. There are mistakes here, I will change. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nest...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19813#discussion_r154581864 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/ExpressionCodegen.scala --- @@ -0,0 +1,237 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions.codegen + +import scala.collection.mutable + +import org.apache.spark.sql.catalyst.expressions._ + +/** + * Defines APIs used in expression code generation. + */ +object ExpressionCodegen { + + /** + * Given an expression, returns the all necessary parameters to evaluate it, so the generated + * code of this expression can be split in a function. + * The 1st string in returned tuple is the parameter strings used to call the function. + * The 2nd string in returned tuple is the parameter strings used to declare the function. + * + * Returns `None` if it can't produce valid parameters. + * + * Params to include: + * 1. Evaluated columns referred by this, children or deferred expressions. + * 2. Rows referred by this, children or deferred expressions. + * 3. Eliminated subexpressions referred bu children expressions. + */ + def getExpressionInputParams( + ctx: CodegenContext, + expr: Expression): Option[(Seq[String], Seq[String])] = { +val (inputAttrs, inputVars) = getInputVarsForChildren(ctx, expr) +val paramsFromColumns = prepareFunctionParams(ctx, inputAttrs, inputVars) + +val subExprs = getSubExprInChildren(ctx, expr) +val subExprCodes = getSubExprCodes(ctx, subExprs) +val paramsFromSubExprs = prepareFunctionParams(ctx, subExprs, subExprCodes) + +val inputRows = ctx.INPUT_ROW +: getInputRowsForChildren(ctx, expr) +val paramsFromRows = inputRows.distinct.filter(_ != null).map { row => + (row, s"InternalRow $row") +} + +val paramsLength = getParamLength(ctx, inputAttrs, subExprs) + paramsFromRows.length +// Maximum allowed parameter number for Java's method descriptor. +if (paramsLength > 255) { + None +} else { + val allParams = (paramsFromRows ++ paramsFromColumns ++ paramsFromSubExprs).unzip + val callParams = allParams._1.distinct + val declParams = allParams._2.distinct + Some((callParams, declParams)) +} + } + + /** + * Returns the eliminated subexpressions in the children expressions. + */ + def getSubExprInChildren(ctx: CodegenContext, expr: Expression): Seq[Expression] = { +expr.children.flatMap { child => + child.collect { +case e if ctx.subExprEliminationExprs.contains(e) => e + } +}.distinct + } + + /** + * A small helper function to return `ExprCode`s that represent subexpressions. + */ + def getSubExprCodes(ctx: CodegenContext, subExprs: Seq[Expression]): Seq[ExprCode] = { +subExprs.map { subExpr => + val stat = ctx.subExprEliminationExprs(subExpr) --- End diff -- It is for state. Yes, `state` is better and not to confuse. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19840: [SPARK-22640][PYSPARK][YARN]switch python exec on execut...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/19840 @yaooqinn I meant it is not used for `pythonExec`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nest...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19813#discussion_r154582383 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/ExpressionCodegen.scala --- @@ -0,0 +1,237 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions.codegen + +import scala.collection.mutable + +import org.apache.spark.sql.catalyst.expressions._ + +/** + * Defines APIs used in expression code generation. + */ +object ExpressionCodegen { + + /** + * Given an expression, returns the all necessary parameters to evaluate it, so the generated + * code of this expression can be split in a function. + * The 1st string in returned tuple is the parameter strings used to call the function. + * The 2nd string in returned tuple is the parameter strings used to declare the function. + * + * Returns `None` if it can't produce valid parameters. + * + * Params to include: + * 1. Evaluated columns referred by this, children or deferred expressions. + * 2. Rows referred by this, children or deferred expressions. + * 3. Eliminated subexpressions referred bu children expressions. + */ + def getExpressionInputParams( + ctx: CodegenContext, + expr: Expression): Option[(Seq[String], Seq[String])] = { +val (inputAttrs, inputVars) = getInputVarsForChildren(ctx, expr) +val paramsFromColumns = prepareFunctionParams(ctx, inputAttrs, inputVars) + +val subExprs = getSubExprInChildren(ctx, expr) +val subExprCodes = getSubExprCodes(ctx, subExprs) +val paramsFromSubExprs = prepareFunctionParams(ctx, subExprs, subExprCodes) + +val inputRows = ctx.INPUT_ROW +: getInputRowsForChildren(ctx, expr) +val paramsFromRows = inputRows.distinct.filter(_ != null).map { row => + (row, s"InternalRow $row") +} + +val paramsLength = getParamLength(ctx, inputAttrs, subExprs) + paramsFromRows.length +// Maximum allowed parameter number for Java's method descriptor. +if (paramsLength > 255) { + None +} else { + val allParams = (paramsFromRows ++ paramsFromColumns ++ paramsFromSubExprs).unzip + val callParams = allParams._1.distinct + val declParams = allParams._2.distinct + Some((callParams, declParams)) +} + } + + /** + * Returns the eliminated subexpressions in the children expressions. + */ + def getSubExprInChildren(ctx: CodegenContext, expr: Expression): Seq[Expression] = { +expr.children.flatMap { child => + child.collect { +case e if ctx.subExprEliminationExprs.contains(e) => e + } +}.distinct + } + + /** + * A small helper function to return `ExprCode`s that represent subexpressions. + */ + def getSubExprCodes(ctx: CodegenContext, subExprs: Seq[Expression]): Seq[ExprCode] = { +subExprs.map { subExpr => + val stat = ctx.subExprEliminationExprs(subExpr) + ExprCode(code = "", value = stat.value, isNull = stat.isNull) +} + } + + /** + * Retrieves previous input rows referred by children and deferred expressions. + */ + def getInputRowsForChildren(ctx: CodegenContext, expr: Expression): Seq[String] = { +expr.children.flatMap(getInputRows(ctx, _)).distinct + } + + /** + * Given a child expression, retrieves previous input rows referred by it or deferred expressions + * which are needed to evaluate it. + */ + def getInputRows(ctx: CodegenContext, child: Expression): Seq[String] = { +child.flatMap { + // An expression directly evaluates on current input row. + case BoundReference(ordinal, _, _) if ctx.currentVars == null || + ctx.currentVars(ordinal) == null => +Seq(ctx.INPUT_ROW) + + // An expression which is not
[GitHub] spark pull request #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nest...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19813#discussion_r154582497 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/ExpressionCodegen.scala --- @@ -0,0 +1,237 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions.codegen + +import scala.collection.mutable + +import org.apache.spark.sql.catalyst.expressions._ + +/** + * Defines APIs used in expression code generation. + */ +object ExpressionCodegen { + + /** + * Given an expression, returns the all necessary parameters to evaluate it, so the generated + * code of this expression can be split in a function. + * The 1st string in returned tuple is the parameter strings used to call the function. + * The 2nd string in returned tuple is the parameter strings used to declare the function. + * + * Returns `None` if it can't produce valid parameters. + * + * Params to include: + * 1. Evaluated columns referred by this, children or deferred expressions. + * 2. Rows referred by this, children or deferred expressions. + * 3. Eliminated subexpressions referred bu children expressions. + */ + def getExpressionInputParams( + ctx: CodegenContext, + expr: Expression): Option[(Seq[String], Seq[String])] = { +val (inputAttrs, inputVars) = getInputVarsForChildren(ctx, expr) +val paramsFromColumns = prepareFunctionParams(ctx, inputAttrs, inputVars) + +val subExprs = getSubExprInChildren(ctx, expr) +val subExprCodes = getSubExprCodes(ctx, subExprs) +val paramsFromSubExprs = prepareFunctionParams(ctx, subExprs, subExprCodes) --- End diff -- I think so. The eliminated subexpressions will be extracted as parameters too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154582670 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -568,8 +571,11 @@ object DataSource extends Logging { "org.apache.spark.Logging") /** Given a provider name, look up the data source class definition. */ - def lookupDataSource(provider: String): Class[_] = { -val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) + def lookupDataSource(conf: SQLConf, provider: String): Class[_] = { +var provider1 = backwardCompatibilityMap.getOrElse(provider, provider) --- End diff -- Thanks. Sure. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154582712 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala --- @@ -2153,4 +2153,21 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton { } } } + + test("SPARK-20728 Make ORCFileFormat configurable between sql/hive and sql/core") { --- End diff -- Yep. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154583136 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala --- @@ -195,8 +195,18 @@ case class RelationConversions( .convertToLogicalRelation(relation, options, classOf[ParquetFileFormat], "parquet") } else { val options = relation.tableMeta.storage.properties - sessionCatalog.metastoreCatalog -.convertToLogicalRelation(relation, options, classOf[OrcFileFormat], "orc") + if (conf.getConf(SQLConf.ORC_USE_NEW_VERSION)) { +sessionCatalog.metastoreCatalog.convertToLogicalRelation( + relation, + options, + classOf[org.apache.spark.sql.execution.datasources.orc.OrcFileFormat], + "orc") + } else { +sessionCatalog.metastoreCatalog.convertToLogicalRelation( + relation, + options, + classOf[org.apache.spark.sql.hive.orc.OrcFileFormat], "orc") --- End diff -- Done. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19871: [SPARK-20728][SQL] Make OrcFileFormat configurable betwe...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19871 **[Test build #84421 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84421/testReport)** for PR 19871 at commit [`5474a07`](https://github.com/apache/spark/commit/5474a070ada64be7467a64fcd849064b063e2e42). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19869: [SPARK-22677][SQL] cleanup whole stage codegen for hash ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19869 **[Test build #84419 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84419/testReport)** for PR 19869 at commit [`0589e7d`](https://github.com/apache/spark/commit/0589e7d8b57aa71c72dd052f687a4706fb0c5567). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19869: [SPARK-22677][SQL] cleanup whole stage codegen for hash ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19869 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84419/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19869: [SPARK-22677][SQL] cleanup whole stage codegen for hash ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19869 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19840: [SPARK-22640][PYSPARK][YARN]switch python exec on execut...
Github user yaooqinn commented on the issue: https://github.com/apache/spark/pull/19840 @ueshin i see. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19841: [SPARK-22642][SQL] the createdTempDir will not be delete...
Github user zuotingbing commented on the issue: https://github.com/apache/spark/pull/19841 please review it again, thanks all. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nest...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19813#discussion_r154588045 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/ColumnarBatchScan.scala --- @@ -108,7 +108,10 @@ private[sql] trait ColumnarBatchScan extends CodegenSupport { |}""".stripMargin) ctx.currentVars = null +// `rowIdx` isn't in `ctx.currentVars`. If the expressions are split later, we can't track it. +// So making it as global variable. --- End diff -- I think it works, although it feels a bit hacky. Like: ```scala val rowidx = ctx.freshName("rowIdx") val rowidxExpr = AttributeReference("rowIdx", IntegerType, nullable = false)() val columnsBatchInput = (output zip colVars).map { case (attr, colVar) => val exprCode = genCodeColumnVector(ctx, colVar, rowidx, attr.dataType, attr.nullable) exprCode.inputVars = Seq(ExprInputVar(rowidxExpr, ExprCode("", isNull = "false", value = rowidx))) exprCode } } ``` This just adds one global variable. I think it is not a big problem? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19860: [SPARK-22669][SQL] Avoid unnecessary function calls in c...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19860 Thanks for your work! A late LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19813 **[Test build #84422 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84422/testReport)** for PR 19813 at commit [`429afba`](https://github.com/apache/spark/commit/429afbabef6f718870ca3c6caf0712a1e459681f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19841: [SPARK-22642][SQL] the createdTempDir will not be delete...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19841 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19841: [SPARK-22642][SQL] the createdTempDir will not be delete...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19841 **[Test build #84423 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84423/testReport)** for PR 19841 at commit [`515b00a`](https://github.com/apache/spark/commit/515b00abf2d214d2a253ef3ccb213df670d64fd4). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19840: [SPARK-22640][PYSPARK][YARN]switch python exec on execut...
Github user yaooqinn commented on the issue: https://github.com/apache/spark/pull/19840 @ueshin [context.py#L191](https://github.com/yaooqinn/spark/blob/8ff5663fe9a32eae79c8ee6bc310409170a8da64/python/pyspark/context.py#L191) set for both driver and executor? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19874: [SPARK-22675] [SQL] Refactoring PropagateTypes in TypeCo...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/19874 Retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19874: [SPARK-22675] [SQL] Refactoring PropagateTypes in TypeCo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19874 **[Test build #84424 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84424/testReport)** for PR 19874 at commit [`657cf88`](https://github.com/apache/spark/commit/657cf88bbf4259d1a823f93f16eaccc2fbe78667). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18581: [SPARK-21289][SQL][ML] Supports custom line separ...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/18581#discussion_r154598540 --- Diff: mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala --- @@ -184,4 +184,54 @@ class LibSVMRelationSuite extends SparkFunSuite with MLlibTestSparkContext { spark.sql("DROP TABLE IF EXISTS libsvmTable") } } + + def testLineSeparator(lineSep: String): Unit = { +test(s"SPARK-21289: Support line separator - lineSep: '$lineSep'") { + val data = Seq( +"1.0 1:1.0 3:2.0 5:3.0", "0.0", "0.0", "0.0 2:4.0 4:5.0 6:6.0").mkString(lineSep) + val dataWithTrailingLineSep = s"$data$lineSep" + + Seq(data, dataWithTrailingLineSep).foreach { lines => +val path0 = new File(tempDir.getCanonicalPath, "write0") +val path1 = new File(tempDir.getCanonicalPath, "write1") +try { + // Read + java.nio.file.Files.write(path0.toPath, lines.getBytes(StandardCharsets.UTF_8)) + val df = spark.read +.option("lineSep", lineSep) +.format("libsvm") +.load(path0.getAbsolutePath) + + assert(df.columns(0) == "label") + assert(df.columns(1) == "features") + val row1 = df.first() + assert(row1.getDouble(0) == 1.0) + val v = row1.getAs[SparseVector](1) + assert(v == Vectors.sparse(6, Seq((0, 1.0), (2, 2.0), (4, 3.0 + --- End diff -- So here you only test the first line ? Why not use df.collect() to test every line ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18581: [SPARK-21289][SQL][ML] Supports custom line separ...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/18581#discussion_r154594381 --- Diff: mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala --- @@ -184,4 +184,54 @@ class LibSVMRelationSuite extends SparkFunSuite with MLlibTestSparkContext { spark.sql("DROP TABLE IF EXISTS libsvmTable") } } + + def testLineSeparator(lineSep: String): Unit = { +test(s"SPARK-21289: Support line separator - lineSep: '$lineSep'") { + val data = Seq( +"1.0 1:1.0 3:2.0 5:3.0", "0.0", "0.0", "0.0 2:4.0 4:5.0 6:6.0").mkString(lineSep) + val dataWithTrailingLineSep = s"$data$lineSep" + + Seq(data, dataWithTrailingLineSep).foreach { lines => +val path0 = new File(tempDir.getCanonicalPath, "write0") +val path1 = new File(tempDir.getCanonicalPath, "write1") +try { + // Read + java.nio.file.Files.write(path0.toPath, lines.getBytes(StandardCharsets.UTF_8)) --- End diff -- Why not import this ? `java.nio.file.Files` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18581: [SPARK-21289][SQL][ML] Supports custom line separ...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/18581#discussion_r154594735 --- Diff: mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala --- @@ -184,4 +184,54 @@ class LibSVMRelationSuite extends SparkFunSuite with MLlibTestSparkContext { spark.sql("DROP TABLE IF EXISTS libsvmTable") } } + + def testLineSeparator(lineSep: String): Unit = { +test(s"SPARK-21289: Support line separator - lineSep: '$lineSep'") { + val data = Seq( +"1.0 1:1.0 3:2.0 5:3.0", "0.0", "0.0", "0.0 2:4.0 4:5.0 6:6.0").mkString(lineSep) + val dataWithTrailingLineSep = s"$data$lineSep" + + Seq(data, dataWithTrailingLineSep).foreach { lines => +val path0 = new File(tempDir.getCanonicalPath, "write0") +val path1 = new File(tempDir.getCanonicalPath, "write1") +try { + // Read + java.nio.file.Files.write(path0.toPath, lines.getBytes(StandardCharsets.UTF_8)) + val df = spark.read +.option("lineSep", lineSep) +.format("libsvm") +.load(path0.getAbsolutePath) + + assert(df.columns(0) == "label") + assert(df.columns(1) == "features") + val row1 = df.first() + assert(row1.getDouble(0) == 1.0) + val v = row1.getAs[SparseVector](1) + assert(v == Vectors.sparse(6, Seq((0, 1.0), (2, 2.0), (4, 3.0 --- End diff -- Use `===` instead of `==` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19840: [SPARK-22640][PYSPARK][YARN]switch python exec on execut...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/19840 @yaooqinn It is used for executors. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18581: [SPARK-21289][SQL][ML] Supports custom line separ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18581#discussion_r154605874 --- Diff: mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala --- @@ -184,4 +184,54 @@ class LibSVMRelationSuite extends SparkFunSuite with MLlibTestSparkContext { spark.sql("DROP TABLE IF EXISTS libsvmTable") } } + + def testLineSeparator(lineSep: String): Unit = { +test(s"SPARK-21289: Support line separator - lineSep: '$lineSep'") { + val data = Seq( +"1.0 1:1.0 3:2.0 5:3.0", "0.0", "0.0", "0.0 2:4.0 4:5.0 6:6.0").mkString(lineSep) + val dataWithTrailingLineSep = s"$data$lineSep" + + Seq(data, dataWithTrailingLineSep).foreach { lines => +val path0 = new File(tempDir.getCanonicalPath, "write0") +val path1 = new File(tempDir.getCanonicalPath, "write1") +try { + // Read + java.nio.file.Files.write(path0.toPath, lines.getBytes(StandardCharsets.UTF_8)) + val df = spark.read +.option("lineSep", lineSep) +.format("libsvm") +.load(path0.getAbsolutePath) + + assert(df.columns(0) == "label") + assert(df.columns(1) == "features") + val row1 = df.first() + assert(row1.getDouble(0) == 1.0) + val v = row1.getAs[SparseVector](1) + assert(v == Vectors.sparse(6, Seq((0, 1.0), (2, 2.0), (4, 3.0 --- End diff -- I'd like to ask why actually. I had some discussion about this and ended up without conclusion. The doc says `===` is preferred but the actual error messages are even clear with `==` sometimes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18581: [SPARK-21289][SQL][ML] Supports custom line separ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18581#discussion_r154606263 --- Diff: mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala --- @@ -184,4 +184,54 @@ class LibSVMRelationSuite extends SparkFunSuite with MLlibTestSparkContext { spark.sql("DROP TABLE IF EXISTS libsvmTable") } } + + def testLineSeparator(lineSep: String): Unit = { +test(s"SPARK-21289: Support line separator - lineSep: '$lineSep'") { + val data = Seq( +"1.0 1:1.0 3:2.0 5:3.0", "0.0", "0.0", "0.0 2:4.0 4:5.0 6:6.0").mkString(lineSep) + val dataWithTrailingLineSep = s"$data$lineSep" + + Seq(data, dataWithTrailingLineSep).foreach { lines => +val path0 = new File(tempDir.getCanonicalPath, "write0") +val path1 = new File(tempDir.getCanonicalPath, "write1") +try { + // Read + java.nio.file.Files.write(path0.toPath, lines.getBytes(StandardCharsets.UTF_8)) --- End diff -- To differenciate with google's `Files` explicitly above. Not a big deal. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18581: [SPARK-21289][SQL][ML] Supports custom line separ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18581#discussion_r154606499 --- Diff: mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala --- @@ -184,4 +184,54 @@ class LibSVMRelationSuite extends SparkFunSuite with MLlibTestSparkContext { spark.sql("DROP TABLE IF EXISTS libsvmTable") } } + + def testLineSeparator(lineSep: String): Unit = { +test(s"SPARK-21289: Support line separator - lineSep: '$lineSep'") { + val data = Seq( +"1.0 1:1.0 3:2.0 5:3.0", "0.0", "0.0", "0.0 2:4.0 4:5.0 6:6.0").mkString(lineSep) + val dataWithTrailingLineSep = s"$data$lineSep" + + Seq(data, dataWithTrailingLineSep).foreach { lines => +val path0 = new File(tempDir.getCanonicalPath, "write0") +val path1 = new File(tempDir.getCanonicalPath, "write1") +try { + // Read + java.nio.file.Files.write(path0.toPath, lines.getBytes(StandardCharsets.UTF_8)) + val df = spark.read +.option("lineSep", lineSep) +.format("libsvm") +.load(path0.getAbsolutePath) + + assert(df.columns(0) == "label") + assert(df.columns(1) == "features") + val row1 = df.first() + assert(row1.getDouble(0) == 1.0) + val v = row1.getAs[SparseVector](1) + assert(v == Vectors.sparse(6, Seq((0, 1.0), (2, 2.0), (4, 3.0 + --- End diff -- Why not just test the first line? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19873: [SPARK-20392][SQL] Set barrier to prevent re-entering a ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19873 **[Test build #84420 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84420/testReport)** for PR 19873 at commit [`136fd30`](https://github.com/apache/spark/commit/136fd30cb98609e648a8b689a28853c1bae67bf7). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class AnalysisBarrier(child: LogicalPlan) extends LeafNode ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19873: [SPARK-20392][SQL] Set barrier to prevent re-entering a ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19873 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84420/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19873: [SPARK-20392][SQL] Set barrier to prevent re-entering a ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19873 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19792: [SPARK-22566][PYTHON] Better error message for `_merge_t...
Github user gberger commented on the issue: https://github.com/apache/spark/pull/19792 Friendly ping -- I've fixed that @ueshin. Is there anything else I should look at to get this to be merged? /cc @HyukjinKwon --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/19872#discussion_r154615728 --- Diff: python/pyspark/sql/udf.py --- @@ -56,6 +56,10 @@ def _create_udf(f, returnType, evalType): return udf_obj._wrapped() +class UDFColumn(Column): --- End diff -- Why did we add this new sub-class? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/19872#discussion_r154616454 --- Diff: python/pyspark/sql/functions.py --- @@ -2070,6 +2070,8 @@ class PandasUDFType(object): GROUP_MAP = PythonEvalType.SQL_PANDAS_GROUP_MAP_UDF +GROUP_AGG = PythonEvalType.SQL_PANDAS_GROUP_AGG_UDF --- End diff -- So I'm worried that it isn't clear to the user that this will result in a full-shuffle with no-partial aggregation. Is there maybe a place we can document this warning? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19792: [SPARK-22566][PYTHON] Better error message for `_merge_t...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19792 Oh, will double check too for sure shortly. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19841: [SPARK-22642][SQL] the createdTempDir will not be delete...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19841 **[Test build #84423 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84423/testReport)** for PR 19841 at commit [`515b00a`](https://github.com/apache/spark/commit/515b00abf2d214d2a253ef3ccb213df670d64fd4). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19841: [SPARK-22642][SQL] the createdTempDir will not be delete...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19841 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84423/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19841: [SPARK-22642][SQL] the createdTempDir will not be delete...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19841 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19875: [SPARK-22473][FOLLOWUP][TEST] Remove deprecated D...
GitHub user mgaido91 opened a pull request: https://github.com/apache/spark/pull/19875 [SPARK-22473][FOLLOWUP][TEST] Remove deprecated Date functions ## What changes were proposed in this pull request? #19696 replaced the deprecated usages for `Date` and `Waiter`, but a few methods were missed. The PR fixes the forgotten deprecated usages. ## How was this patch tested? existing UTs You can merge this pull request into a Git repository by running: $ git pull https://github.com/mgaido91/spark SPARK-22473_FOLLOWUP Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19875.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19875 commit 424e47175387e063a60fe06287f77703cf400045 Author: Marco Gaido Date: 2017-12-04T11:09:24Z [SPARK-22473][FOLLOWUP][TEST] Remove deprecated Date functions --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19876: [WIP][ML][SPARK-11171] spark 11237 Add PMML expor...
GitHub user holdenk opened a pull request: https://github.com/apache/spark/pull/19876 [WIP][ML][SPARK-11171] spark 11237 Add PMML export to Spark ML pipelines ## What changes were proposed in this pull request? Adds PMML export support to Spark ML pipelines in the style of Spark's DataSource API to allow library authors to add their own model export formats. This is a WIP to see if this is the design we want to go with. ## How was this patch tested? Basic unit test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/holdenk/spark SPARK-11171-SPARK-11237-Add-PMML-export-for-ML-KMeans-r2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19876.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19876 commit 43ae30f08aed921178da07a5e982297b272c7c8f Author: Holden Karau Date: 2017-11-24T14:00:16Z Initial attempt at allowing Spark ML writers to be slightly more pluggable commit 9fec08fbd2dd1c980d5862f0b4521213e1e9349c Author: Holden Karau Date: 2017-11-25T12:55:19Z The LinearRegression suite passes commit 0075bf4776ecffa7fcb24a6f74c0e96161d6221c Author: Holden Karau Date: 2017-11-25T13:00:18Z Add missing META-INFO for MLFormatRegister commit c68880d6d982c56934f4b583263ed5cd4e8329d6 Author: Holden Karau Date: 2017-11-25T16:19:35Z Add a (untested) PMMLLinearRegressionModelWriter commit c2108df2b499bd45dff0e8add789f01d8c3c2c48 Author: Holden Karau Date: 2017-12-04T10:00:56Z Basic PMML export test commit de8619098eeb01ff86b54753f27c29729935bb94 Author: Holden Karau Date: 2017-12-04T11:27:03Z Add PMML testing utils for Spark ML that were accidently left out --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19871: [SPARK-20728][SQL] Make OrcFileFormat configurable betwe...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19871 **[Test build #84421 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84421/testReport)** for PR 19871 at commit [`5474a07`](https://github.com/apache/spark/commit/5474a070ada64be7467a64fcd849064b063e2e42). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19871: [SPARK-20728][SQL] Make OrcFileFormat configurable betwe...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19871 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19871: [SPARK-20728][SQL] Make OrcFileFormat configurable betwe...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19871 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84421/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19813 **[Test build #84422 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84422/testReport)** for PR 19813 at commit [`429afba`](https://github.com/apache/spark/commit/429afbabef6f718870ca3c6caf0712a1e459681f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19813 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84422/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19813 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19792: [SPARK-22566][PYTHON] Better error message for `_...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19792#discussion_r154630463 --- Diff: python/pyspark/sql/session.py --- @@ -337,7 +338,7 @@ def _inferSchemaFromList(self, data): if type(first) is dict: warnings.warn("inferring schema from dict is deprecated," "please use pyspark.sql.Row instead") -schema = reduce(_merge_type, map(_infer_schema, data)) +schema = reduce(_merge_type, [_infer_schema(row, names) for row in data]) --- End diff -- Not a big deal but let's use generator expression -> `(_infer_schema(row, names) for row in data)` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19877: [SPARK-22681]Accumulator should only updated once...
GitHub user carsonwang opened a pull request: https://github.com/apache/spark/pull/19877 [SPARK-22681]Accumulator should only updated once for each task in result stage ## What changes were proposed in this pull request? As the doc says "For accumulator updates performed inside actions only, Spark guarantees that each taskâs update to the accumulator will only be applied once, i.e. restarted tasks will not update the value." But currently the code doesn't guarantee this. ## How was this patch tested? New added tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/carsonwang/spark fixAccum Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19877.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19877 commit 882126c2671e1733d572350af9749e9f8bdca1c2 Author: Carson Wang Date: 2017-12-04T12:23:14Z Do not update accumulator for resubmitted task in result stage --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19862: [SPARK-22671][SQL] Make SortMergeJoin shuffle rea...
Github user gczsjdy commented on a diff in the pull request: https://github.com/apache/spark/pull/19862#discussion_r154635850 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala --- @@ -699,39 +700,44 @@ private[joins] class SortMergeJoinScanner( matchJoinKey = null bufferedMatches.clear() false -} else if (matchJoinKey != null && keyOrdering.compare(streamedRowKey, matchJoinKey) == 0) { - // The new streamed row has the same join key as the previous row, so return the same matches. - true -} else if (bufferedRow == null) { - // The streamed row's join key does not match the current batch of buffered rows and there are - // no more rows to read from the buffered iterator, so there can be no more matches. - matchJoinKey = null - bufferedMatches.clear() - false } else { - // Advance both the streamed and buffered iterators to find the next pair of matching rows. - var comp = keyOrdering.compare(streamedRowKey, bufferedRowKey) - do { -if (streamedRowKey.anyNull) { - advancedStreamed() -} else { - assert(!bufferedRowKey.anyNull) - comp = keyOrdering.compare(streamedRowKey, bufferedRowKey) - if (comp > 0) advancedBufferedToRowWithNullFreeJoinKey() - else if (comp < 0) advancedStreamed() -} - } while (streamedRow != null && bufferedRow != null && comp != 0) - if (streamedRow == null || bufferedRow == null) { -// We have either hit the end of one of the iterators, so there can be no more matches. + // To make sure vars like bufferedRow is set + advancedBufferedIterRes --- End diff -- Good advice. Thx. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19874: [SPARK-22675] [SQL] Refactoring PropagateTypes in TypeCo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19874 **[Test build #84424 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84424/testReport)** for PR 19874 at commit [`657cf88`](https://github.com/apache/spark/commit/657cf88bbf4259d1a823f93f16eaccc2fbe78667). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `trait TypePropagation extends Logging ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19876: [WIP][ML][SPARK-11171][SPARK-11239] Add PMML export to S...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19876 **[Test build #84426 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84426/testReport)** for PR 19876 at commit [`de86190`](https://github.com/apache/spark/commit/de8619098eeb01ff86b54753f27c29729935bb94). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19874: [SPARK-22675] [SQL] Refactoring PropagateTypes in TypeCo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19874 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19877: [SPARK-22681]Accumulator should only be updated once for...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19877 **[Test build #84427 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84427/testReport)** for PR 19877 at commit [`882126c`](https://github.com/apache/spark/commit/882126c2671e1733d572350af9749e9f8bdca1c2). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19875: [SPARK-22473][FOLLOWUP][TEST] Remove deprecated Date fun...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19875 **[Test build #84425 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84425/testReport)** for PR 19875 at commit [`424e471`](https://github.com/apache/spark/commit/424e47175387e063a60fe06287f77703cf400045). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19874: [SPARK-22675] [SQL] Refactoring PropagateTypes in TypeCo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19874 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84424/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19877: [SPARK-22681]Accumulator should only be updated once for...
Github user carsonwang commented on the issue: https://github.com/apache/spark/pull/19877 cc @vanzin --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19870: [SPARK-22665][SQL] Avoid repartitioning with empt...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19870#discussion_r154542257 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala --- @@ -838,6 +838,8 @@ case class RepartitionByExpression( numPartitions: Int) extends RepartitionOperation { require(numPartitions > 0, s"Number of partitions ($numPartitions) must be positive.") + require(partitionExpressions.nonEmpty, s"${getClass.getSimpleName} requires a non empty set of " + +s"partitioning expressions.") --- End diff -- Let's remove this leading 's'. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19876: [WIP][ML][SPARK-11171][SPARK-11239] Add PMML export to S...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19876 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84426/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19876: [WIP][ML][SPARK-11171][SPARK-11239] Add PMML export to S...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19876 **[Test build #84426 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84426/testReport)** for PR 19876 at commit [`de86190`](https://github.com/apache/spark/commit/de8619098eeb01ff86b54753f27c29729935bb94). * This patch **fails to generate documentation**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `trait PMMLReadWriteTest extends TempDirectory ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19876: [WIP][ML][SPARK-11171][SPARK-11239] Add PMML export to S...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19876 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154638156 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -85,7 +87,8 @@ case class DataSource( case class SourceInfo(name: String, schema: StructType, partitionColumns: Seq[String]) - lazy val providingClass: Class[_] = DataSource.lookupDataSource(className) + lazy val providingClass: Class[_] = +DataSource.lookupDataSource(sparkSession.sessionState.conf, className) --- End diff -- I'd put this conf as the last argument actually if you wouldn't mind .. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154638371 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -363,6 +363,14 @@ object SQLConf { .checkValues(Set("none", "uncompressed", "snappy", "zlib", "lzo")) .createWithDefault("snappy") + val ORC_USE_NEW_VERSION = buildConf("spark.sql.orc.useNewVersion") +.doc("When true, use new OrcFileFormat in sql/core module instead of the one in sql/hive. " + + "Since new OrcFileFormat uses Apache ORC library instead of ORC library Hive 1.2.1, it is " + + "more stable and faster.") --- End diff -- tiny nit: let's take out `more stable` .. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/19871#discussion_r154640805 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -568,8 +574,12 @@ object DataSource extends Logging { "org.apache.spark.Logging") /** Given a provider name, look up the data source class definition. */ - def lookupDataSource(provider: String): Class[_] = { -val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) + def lookupDataSource(conf: SQLConf, provider: String): Class[_] = { --- End diff -- After more thinking, I think it don't worth to pass the whole SQLConf into this function, we just need to know whether `SQLConf.ORC_USE_NEW_VERSION` is enabled. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #895: [SPARK-1940] Enabling rolling of executor logs, and automa...
Github user wbowditch commented on the issue: https://github.com/apache/spark/pull/895 Can these configuration additions be added to Spark Documentation (https://spark.apache.org/docs/latest/configuration.html) ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19872#discussion_r154644235 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala --- @@ -437,6 +437,37 @@ class RelationalGroupedDataset protected[sql]( df.logicalPlan)) } + + private[sql] def aggInPandas(columns: Seq[Column]): DataFrame = { +val exprs = columns.map(column => column.expr.asInstanceOf[PythonUDF]) + +val groupingNamedExpressions = groupingExprs.map { + case ne: NamedExpression => ne + case other => Alias(other, other.toString)() +} + +val groupingAttributes = groupingNamedExpressions.map(_.toAttribute) + +val child = df.logicalPlan + +val childrenExpressions = exprs.flatMap(expr => + expr.children.map { + case ne: NamedExpression => ne + case other => Alias(other, other.toString)() --- End diff -- indentation nit --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19872#discussion_r154644340 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala --- @@ -437,6 +437,37 @@ class RelationalGroupedDataset protected[sql]( df.logicalPlan)) } + + private[sql] def aggInPandas(columns: Seq[Column]): DataFrame = { +val exprs = columns.map(column => column.expr.asInstanceOf[PythonUDF]) + +val groupingNamedExpressions = groupingExprs.map { + case ne: NamedExpression => ne + case other => Alias(other, other.toString)() +} + +val groupingAttributes = groupingNamedExpressions.map(_.toAttribute) + +val child = df.logicalPlan + +val childrenExpressions = exprs.flatMap(expr => + expr.children.map { + case ne: NamedExpression => ne + case other => Alias(other, other.toString)() +}) + +val project = Project(groupingNamedExpressions ++ childrenExpressions, child) + +val udfOutputs = exprs.flatMap(expr => + Seq(AttributeReference(expr.name, expr.dataType)()) +) --- End diff -- I think this could be inlined. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19872#discussion_r154644620 --- Diff: python/pyspark/sql/tests.py --- @@ -4016,6 +4016,89 @@ def test_unsupported_types(self): with self.assertRaisesRegexp(Exception, 'Unsupported data type'): df.groupby('id').apply(f).collect() +@unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not installed") +class GroupbyAggTests(ReusedSQLTestCase): +def assertFramesEqual(self, expected, result): +msg = ("DataFrames are not equal: " + + ("\n\nExpected:\n%s\n%s" % (expected, expected.dtypes)) + --- End diff -- indentation nit --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19872#discussion_r154642902 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/pythonLogicalOperators.scala --- @@ -38,3 +38,13 @@ case class FlatMapGroupsInPandas( */ override val producedAttributes = AttributeSet(output) } + +case class AggregateInPandas( +groupingAttributes: Seq[Attribute], +functionExprs: Seq[Expression], +output: Seq[Attribute], +child: LogicalPlan +) extends UnaryNode { --- End diff -- nit: ``` child: LogicalPlan) extends UnaryNode { ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19872#discussion_r154642230 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/AggregateInPandasExec.scala --- @@ -0,0 +1,135 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.python + +import java.io.File + +import scala.collection.mutable.ArrayBuffer + +import org.apache.spark.{SparkEnv, TaskContext} +import org.apache.spark.api.python.{ChainedPythonFunctions, PythonEvalType} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.{Ascending, Attribute, AttributeSet, Expression, JoinedRow, SortOrder, UnsafeProjection, UnsafeRow} +import org.apache.spark.sql.catalyst.plans.physical.{AllTuples, ClusteredDistribution, Distribution, Partitioning} +import org.apache.spark.sql.execution.{GroupedIterator, SparkPlan, UnaryExecNode} +import org.apache.spark.sql.types.StructType +import org.apache.spark.util.Utils + +case class AggregateInPandasExec( +groupingAttributes: Seq[Attribute], +func: Seq[Expression], +output: Seq[Attribute], +child: SparkPlan) + extends UnaryExecNode { + private val udfs = func.map(expr => expr.asInstanceOf[PythonUDF]) + + override def outputPartitioning: Partitioning = child.outputPartitioning + + override def producedAttributes: AttributeSet = AttributeSet(output) + + override def requiredChildDistribution: Seq[Distribution] = { +if (groupingAttributes.isEmpty) { + AllTuples :: Nil +} else { + ClusteredDistribution(groupingAttributes) :: Nil +} + } + + private def collectFunctions(udf: PythonUDF): (ChainedPythonFunctions, Seq[Expression]) = { +udf.children match { + case Seq(u: PythonUDF) => +val (chained, children) = collectFunctions(u) +(ChainedPythonFunctions(chained.funcs ++ Seq(udf.func)), children) + case children => +// There should not be any other UDFs, or the children can't be evaluated directly. +assert(children.forall(_.find(_.isInstanceOf[PythonUDF]).isEmpty)) +(ChainedPythonFunctions(Seq(udf.func)), udf.children) +} + } + + override def requiredChildOrdering: Seq[Seq[SortOrder]] = +Seq(groupingAttributes.map(SortOrder(_, Ascending))) + + override protected def doExecute(): RDD[InternalRow] = { +val inputRDD = child.execute() + +val bufferSize = inputRDD.conf.getInt("spark.buffer.size", 65536) +val reuseWorker = inputRDD.conf.getBoolean("spark.python.worker.reuse", defaultValue = true) +// val argOffsets = Array((0 until (child.output.length - groupingAttributes.length)).toArray) +val schema = StructType(child.schema.drop(groupingAttributes.length)) +val sessionLocalTimeZone = conf.sessionLocalTimeZone +val pandasRespectSessionTimeZone = conf.pandasRespectSessionTimeZone + +val (pyFuncs, inputs) = udfs.map(collectFunctions).unzip + +val allInputs = new ArrayBuffer[Expression] + +val argOffsets = inputs.map { input => + input.map { e => + allInputs += e --- End diff -- indentation nit --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19872#discussion_r154644084 --- Diff: python/pyspark/sql/udf.py --- @@ -56,6 +56,10 @@ def _create_udf(f, returnType, evalType): return udf_obj._wrapped() +class UDFColumn(Column): --- End diff -- BTW, what do you think about adding an attribute instead? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19875: [SPARK-22473][FOLLOWUP][TEST] Remove deprecated Date fun...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/19875 yes @HyukjinKwon , you are 100% right, sorry for this error. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18424: [SPARK-17091] Add rule to convert IN predicate to equiva...
Github user ptkool commented on the issue: https://github.com/apache/spark/pull/18424 @a10y Yes, I'm still tracking this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org