[GitHub] spark issue #21939: [SPARK-23874][SQL][PYTHON] Upgrade Apache Arrow to 0.10....
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/21939 got it. Thank you! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21939: [SPARK-23874][SQL][PYTHON] Upgrade Apache Arrow to 0.10....
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/21939 @shaneknapp what was the version of pyarrow in that build? 0.8 or 0.10? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21939: [SPARK-23874][SQL][PYTHON] Upgrade Apache Arrow to 0.10....
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/21939 @BryanCutler So, for this upgrade, even the JVM side dependency is 0.10, pyspark can work with any version between pyarrow 0.8 to 0.10 without problem? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22003: [SPARK-25019][BUILD] Fix orc dependency to use the same ...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/22003 @dongjoon-hyun no problem. Thank you! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22003: [SPARK-25019][BUILD] Fix orc dependency to use the same ...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/22003 lgtm. Merging to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22003: [SPARK-25019][BUILD] Fix orc dependency to use th...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/22003#discussion_r207986831 --- Diff: sql/core/pom.xml --- @@ -90,39 +90,11 @@ org.apache.orc orc-core ${orc.classifier} - - - org.apache.hadoop - hadoop-hdfs - - - - org.apache.hive - hive-storage-api - - org.apache.orc orc-mapreduce ${orc.classifier} - - - org.apache.hadoop - hadoop-hdfs - - - - org.apache.hive - hive-storage-api - - --- End diff -- got it. Thank you. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22003: [SPARK-25019][BUILD] Fix orc dependency to use th...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/22003#discussion_r207962501 --- Diff: sql/core/pom.xml --- @@ -90,39 +90,11 @@ org.apache.orc orc-core ${orc.classifier} - - - org.apache.hadoop - hadoop-hdfs - - - - org.apache.hive - hive-storage-api - - org.apache.orc orc-mapreduce ${orc.classifier} - - - org.apache.hadoop - hadoop-hdfs - - - - org.apache.hive - hive-storage-api - - --- End diff -- Thank you. Just for me to understand it better. Do you know why defining exclusions in this pom file messed up the pom? Also, how should I try it out myself? What is the right command to publish locally? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22003: [SPARK-25019][BUILD] Fix orc dependency to use th...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/22003#discussion_r207888608 --- Diff: sql/core/pom.xml --- @@ -90,39 +90,11 @@ org.apache.orc orc-core ${orc.classifier} - - - org.apache.hadoop - hadoop-hdfs - - - - org.apache.hive - hive-storage-api - - org.apache.orc orc-mapreduce ${orc.classifier} - - - org.apache.hadoop - hadoop-hdfs - - - - org.apache.hive - hive-storage-api - - --- End diff -- @dongjoon-hyun when we publish snapshot artifacts or releases, will the pom for spark sql get all of exclusions defined in the parent pom? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21865: [SPARK-24895] Remove spotbugs plugin
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/21865 lgtm. I am merging this PR to master branch. Then, I will kick off https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21865: [SPARK-24895] Remove spotbugs plugin
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/21865 cc @HyukjinKwon @kiszk I will merge this PR once it passes the test. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20473: [SPARK-23300][TESTS] Prints out if Pandas and PyA...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/20473#discussion_r16362 --- Diff: python/run-tests.py --- @@ -151,6 +151,38 @@ def parse_opts(): return opts +def _check_dependencies(python_exec, modules_to_test): +if "COVERAGE_PROCESS_START" in os.environ: +# Make sure if coverage is installed. +try: +subprocess_check_output( +[python_exec, "-c", "import coverage"], +stderr=open(os.devnull, 'w')) +except: +print_red("Coverage is not installed in Python executable '%s' " + "but 'COVERAGE_PROCESS_START' environment variable is set, " + "exiting." % python_exec) +sys.exit(-1) + +if pyspark_sql in modules_to_test: +# If we should test 'pyspark-sql', it checks if PyArrow and Pandas are installed and +# explicitly prints out. See SPARK-23300. +try: +subprocess_check_output( +[python_exec, "-c", "import pyarrow"], +stderr=open(os.devnull, 'w')) +except: --- End diff -- Thank you. Appreciate it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/19872#discussion_r165449847 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala --- @@ -199,7 +200,7 @@ object ExtractFiltersAndInnerJoins extends PredicateHelper { object PhysicalAggregation { // groupingExpressions, aggregateExpressions, resultExpressions, child type ReturnType = -(Seq[NamedExpression], Seq[AggregateExpression], Seq[NamedExpression], LogicalPlan) +(Seq[NamedExpression], Seq[Expression], Seq[NamedExpression], LogicalPlan) --- End diff -- It will be good to try it out soon. But it is not urgent. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20473: [SPARK-23300][TESTS] Prints out if Pandas and PyA...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/20473#discussion_r165445947 --- Diff: python/run-tests.py --- @@ -151,6 +151,38 @@ def parse_opts(): return opts +def _check_dependencies(python_exec, modules_to_test): +if "COVERAGE_PROCESS_START" in os.environ: +# Make sure if coverage is installed. +try: +subprocess_check_output( +[python_exec, "-c", "import coverage"], +stderr=open(os.devnull, 'w')) +except: +print_red("Coverage is not installed in Python executable '%s' " + "but 'COVERAGE_PROCESS_START' environment variable is set, " + "exiting." % python_exec) +sys.exit(-1) + +if pyspark_sql in modules_to_test: +# If we should test 'pyspark-sql', it checks if PyArrow and Pandas are installed and +# explicitly prints out. See SPARK-23300. +try: +subprocess_check_output( +[python_exec, "-c", "import pyarrow"], +stderr=open(os.devnull, 'w')) +except: --- End diff -- Actually, since we are here, is it possible to do the same thing as https://github.com/apache/spark/blob/ec63e2d0743a4f75e1cce21d0fe2b54407a86a4a/python/pyspark/sql/tests.py#L51-L63 and https://github.com/apache/spark/blob/ec63e2d0743a4f75e1cce21d0fe2b54407a86a4a/python/pyspark/sql/tests.py#L78-L84? It will be nice to use the same logic. Otherwise, even we do not print the warning at here, tests may still get skipped because of the version issue. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20473: [SPARK-23300][TESTS] Prints out if Pandas and PyA...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/20473#discussion_r165445232 --- Diff: python/run-tests.py --- @@ -151,6 +151,38 @@ def parse_opts(): return opts +def _check_dependencies(python_exec, modules_to_test): +if "COVERAGE_PROCESS_START" in os.environ: +# Make sure if coverage is installed. +try: +subprocess_check_output( +[python_exec, "-c", "import coverage"], +stderr=open(os.devnull, 'w')) +except: +print_red("Coverage is not installed in Python executable '%s' " + "but 'COVERAGE_PROCESS_START' environment variable is set, " + "exiting." % python_exec) +sys.exit(-1) + +if pyspark_sql in modules_to_test: +# If we should test 'pyspark-sql', it checks if PyArrow and Pandas are installed and +# explicitly prints out. See SPARK-23300. +try: +subprocess_check_output( +[python_exec, "-c", "import pyarrow"], +stderr=open(os.devnull, 'w')) +except: --- End diff -- How about we also explicitly mention that pyarrow/pandas related tests will run if they are installed? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20465: [SPARK-23292][TEST] always run python tests
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/20465 So, jenkins jobs run those tests with python3? If so, I feel better because those tests are not completely skipped in Jenkins. If it is hard to make them run with python 2. Letâs have a log to explicitly show if we are going to run tests using pandas/pyarrow, which will help us confirm if they get exercised with python 3 in Jenkins or not. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20465: [SPARK-23292][TEST] always run python tests
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/20465 @felixcheung jenkins is actually skipping those tests (see the failure of this pr). It makes sense to provide a way to allow developers to not run those tests. But, I'd prefer that we run those tests by default. So, we can make sure that jenkins is doing the right thing. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/19872#discussion_r165253818 --- Diff: python/pyspark/sql/tests.py --- @@ -4353,6 +4347,446 @@ def test_unsupported_types(self): df.groupby('id').apply(f).collect() +@unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not installed") +class GroupbyAggPandasUDFTests(ReusedSQLTestCase): + +@property +def data(self): +from pyspark.sql.functions import array, explode, col, lit +return self.spark.range(10).toDF('id') \ +.withColumn("vs", array([lit(i * 1.0) + col('id') for i in range(20, 30)])) \ +.withColumn("v", explode(col('vs'))) \ +.drop('vs') \ +.withColumn('w', lit(1.0)) + +@property +def python_plus_one(self): +from pyspark.sql.functions import udf + +@udf('double') +def plus_one(v): +assert isinstance(v, (int, float)) +return v + 1 +return plus_one + +@property +def pandas_scalar_plus_two(self): +import pandas as pd +from pyspark.sql.functions import pandas_udf, PandasUDFType + +@pandas_udf('double', PandasUDFType.SCALAR) +def plus_two(v): +assert isinstance(v, pd.Series) +return v + 2 +return plus_two + +@property +def pandas_agg_mean_udf(self): +from pyspark.sql.functions import pandas_udf, PandasUDFType + +@pandas_udf('double', PandasUDFType.GROUP_AGG) +def avg(v): +return v.mean() +return avg + +@property +def pandas_agg_sum_udf(self): +from pyspark.sql.functions import pandas_udf, PandasUDFType + +@pandas_udf('double', PandasUDFType.GROUP_AGG) +def sum(v): +return v.sum() +return sum + +@property +def pandas_agg_weighted_mean_udf(self): +import numpy as np +from pyspark.sql.functions import pandas_udf, PandasUDFType + +@pandas_udf('double', PandasUDFType.GROUP_AGG) +def weighted_mean(v, w): +return np.average(v, weights=w) +return weighted_mean + +def test_manual(self): +df = self.data +sum_udf = self.pandas_agg_sum_udf +mean_udf = self.pandas_agg_mean_udf + +result1 = df.groupby('id').agg(sum_udf(df.v), mean_udf(df.v)).sort('id') +expected1 = self.spark.createDataFrame( +[[0, 245.0, 24.5], + [1, 255.0, 25.5], + [2, 265.0, 26.5], + [3, 275.0, 27.5], + [4, 285.0, 28.5], + [5, 295.0, 29.5], + [6, 305.0, 30.5], + [7, 315.0, 31.5], + [8, 325.0, 32.5], + [9, 335.0, 33.5]], +['id', 'sum(v)', 'avg(v)']) + +self.assertPandasEqual(expected1.toPandas(), result1.toPandas()) + +def test_basic(self): +from pyspark.sql.functions import col, lit, sum, mean + +df = self.data +weighted_mean_udf = self.pandas_agg_weighted_mean_udf + +# Groupby one column and aggregate one UDF with literal +result1 = df.groupby('id').agg(weighted_mean_udf(df.v, lit(1.0))).sort('id') +expected1 = df.groupby('id').agg(mean(df.v).alias('weighted_mean(v, 1.0)')).sort('id') +self.assertPandasEqual(expected1.toPandas(), result1.toPandas()) + +# Groupby one expression and aggregate one UDF with literal +result2 = df.groupby((col('id') + 1)).agg(weighted_mean_udf(df.v, lit(1.0)))\ +.sort(df.id + 1) +expected2 = df.groupby((col('id') + 1))\ +.agg(mean(df.v).alias('weighted_mean(v, 1.0)')).sort(df.id + 1) +self.assertPandasEqual(expected2.toPandas(), result2.toPandas()) + +# Groupby one column and aggregate one UDF without literal +result3 = df.groupby('id').agg(weighted_mean_udf(df.v, df.w)).sort('id') +expected3 = df.groupby('id').agg(mean(df.v).alias('weighted_mean(v, w)')).sort('id') +self.assertPandasEqual(expected3.toPandas(), result3.toPandas()) + +# Groupby one expression and aggregate one UDF without literal +result4 = df.groupby((col('id') + 1).alias('id'))\ +.agg(weighted_mean_udf(df.v, df.w))\ +.sort('id') +expected4 = df.groupby((col('id') + 1).alias('id'))\ +.agg(mean(df.v).alias('weighted_mean(v, w)'))\ +.sort('id') +
[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/19872#discussion_r165253514 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala --- @@ -199,7 +200,7 @@ object ExtractFiltersAndInnerJoins extends PredicateHelper { object PhysicalAggregation { // groupingExpressions, aggregateExpressions, resultExpressions, child type ReturnType = -(Seq[NamedExpression], Seq[AggregateExpression], Seq[NamedExpression], LogicalPlan) +(Seq[NamedExpression], Seq[Expression], Seq[NamedExpression], LogicalPlan) --- End diff -- I prefer that we try out using a new rule. We can create utility function to reuse code. Will you have a chance to try it out? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/19872#discussion_r165220142 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala --- @@ -199,7 +200,7 @@ object ExtractFiltersAndInnerJoins extends PredicateHelper { object PhysicalAggregation { // groupingExpressions, aggregateExpressions, resultExpressions, child type ReturnType = -(Seq[NamedExpression], Seq[AggregateExpression], Seq[NamedExpression], LogicalPlan) +(Seq[NamedExpression], Seq[Expression], Seq[NamedExpression], LogicalPlan) --- End diff -- @icexelloss Thank you for this contribution! I just came across the change in this file. I am not sure if changing the type at here is the best option. The reason is that whenever we use this PhysicalAggregation rule, we have to check the instance type of those aggregate expressions and do casting. To me, it seems better to leave this rule untouched and create a new rule just for Python UDAF. What do you think? (maybe you and reviewers already discussed it. If so, can you point me to the discussion?) Thank you! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20037: [SPARK-22849] ivy.retrieve pattern should also co...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/20037#discussion_r163463718 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala --- @@ -1271,7 +1271,7 @@ private[spark] object SparkSubmitUtils { // retrieve all resolved dependencies ivy.retrieve(rr.getModuleDescriptor.getModuleRevisionId, packagesDirectory.getAbsolutePath + File.separator + -"[organization]_[artifact]-[revision].[ext]", +"[organization]_[artifact]-[revision](-[classifier]).[ext]", --- End diff -- I tried it today. Somehow, I only got the test jar downloaded. Have you guys seen this issue? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20110: [SPARK-22313][PYTHON][FOLLOWUP] Explicitly import warnin...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/20110 Thank you! Let's also check the build result to make sure `pyspark.streaming.tests.FlumePollingStreamTests` is indeed triggered (I hit this issue while running this test). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19535: [SPARK-22313][PYTHON] Mark/print deprecation warn...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/19535#discussion_r159019845 --- Diff: python/pyspark/streaming/flume.py --- @@ -54,8 +54,13 @@ def createStream(ssc, hostname, port, :param bodyDecoder: A function used to decode body (default is utf8_decoder) :return: A DStream object -.. note:: Deprecated in 2.3.0 +.. note:: Deprecated in 2.3.0. Flume support is deprecated as of Spark 2.3.0. +See SPARK-22142. """ +warnings.warn( --- End diff -- thank you :) It will be good to also check why master build does not fail since python should complain about it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19535: [SPARK-22313][PYTHON] Mark/print deprecation warn...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/19535#discussion_r159013024 --- Diff: python/pyspark/streaming/flume.py --- @@ -54,8 +54,13 @@ def createStream(ssc, hostname, port, :param bodyDecoder: A function used to decode body (default is utf8_decoder) :return: A DStream object -.. note:: Deprecated in 2.3.0 +.. note:: Deprecated in 2.3.0. Flume support is deprecated as of Spark 2.3.0. +See SPARK-22142. """ +warnings.warn( --- End diff -- Seems `warnings` is not imported in this file? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #5604: [SPARK-1442][SQL] Window Function Support for Spar...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/5604#discussion_r157933488 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala --- @@ -0,0 +1,340 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions + +import org.apache.spark.sql.catalyst.analysis.UnresolvedException +import org.apache.spark.sql.catalyst.errors.TreeNodeException +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.types.{NumericType, DataType} + +/** + * The trait of the Window Specification (specified in the OVER clause or WINDOW clause) for + * Window Functions. + */ +sealed trait WindowSpec + +/** + * The specification for a window function. + * @param partitionSpec It defines the way that input rows are partitioned. + * @param orderSpec It defines the ordering of rows in a partition. + * @param frameSpecification It defines the window frame in a partition. + */ +case class WindowSpecDefinition( +partitionSpec: Seq[Expression], +orderSpec: Seq[SortOrder], +frameSpecification: WindowFrame) extends Expression with WindowSpec { + + def validate: Option[String] = frameSpecification match { +case UnspecifiedFrame => + Some("Found a UnspecifiedFrame. It should be converted to a SpecifiedWindowFrame " + +"during analysis. Please file a bug report.") +case frame: SpecifiedWindowFrame => frame.validate.orElse { + def checkValueBasedBoundaryForRangeFrame(): Option[String] = { +if (orderSpec.length > 1) { + // It is not allowed to have a value-based PRECEDING and FOLLOWING + // as the boundary of a Range Window Frame. + Some("This Range Window Frame only accepts at most one ORDER BY expression.") +} else if (orderSpec.nonEmpty && !orderSpec.head.dataType.isInstanceOf[NumericType]) { + Some("The data type of the expression in the ORDER BY clause should be a numeric type.") +} else { + None +} + } + + (frame.frameType, frame.frameStart, frame.frameEnd) match { +case (RangeFrame, vp: ValuePreceding, _) => checkValueBasedBoundaryForRangeFrame() +case (RangeFrame, vf: ValueFollowing, _) => checkValueBasedBoundaryForRangeFrame() +case (RangeFrame, _, vp: ValuePreceding) => checkValueBasedBoundaryForRangeFrame() +case (RangeFrame, _, vf: ValueFollowing) => checkValueBasedBoundaryForRangeFrame() +case (_, _, _) => None + } +} + } + + type EvaluatedType = Any + + override def children: Seq[Expression] = partitionSpec ++ orderSpec + + override lazy val resolved: Boolean = +childrenResolved && frameSpecification.isInstanceOf[SpecifiedWindowFrame] + + + override def toString: String = simpleString + + override def eval(input: Row): EvaluatedType = throw new UnsupportedOperationException + override def nullable: Boolean = true + override def foldable: Boolean = false + override def dataType: DataType = throw new UnsupportedOperationException +} + +/** + * A Window specification reference that refers to the [[WindowSpecDefinition]] defined + * under the name `name`. + */ +case class WindowSpecReference(name: String) extends WindowSpec + +/** + * The trait used to represent the type of a Window Frame. + */ +sealed trait FrameType + +/** + * RowFrame treats rows in a partition individually. When a [[ValuePreceding]] + * or a [[ValueFollowing]] is used as its [[FrameBoundary]], the value is considered + * as a physical offset. + * For example, `ROW BETWEEN 1 PRECEDING AND 1 FOLLOWING` represents a 3-row frame, + *
[GitHub] spark issue #19448: [SPARK-22217] [SQL] ParquetFileFormat to support arbitra...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/19448 Thank you :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19448: [SPARK-22217] [SQL] ParquetFileFormat to support arbitra...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/19448 I am not really worried about this particular change. It's already merged and it seems a small and safe change. I am not planning to revert it. But, in general, let's avoid of merging changes that are not bug fixes to a maintenance branch. If there is an exception, it will be better to make it clear earlier. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19448: [SPARK-22217] [SQL] ParquetFileFormat to support arbitra...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/19448 @HyukjinKwon branch-2.2 is in a maintenance branch, I am not sure it is appropriate to merge this change to branch-2.2 since it is not really a bug fix. If the doc is not accurate, we should fix the doc. For a maintenance branch, we need to be very careful on what we merge and we should always avoid of unnecessary changes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19149: [SPARK-21652][SQL][FOLLOW-UP] Fix rule conflict between ...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/19149 Can we add a test? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19080: [SPARK-21865][SQL] simplify the distribution sema...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/19080#discussion_r136214689 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala --- @@ -30,18 +30,43 @@ import org.apache.spark.sql.types.{DataType, IntegerType} * - Intra-partition ordering of data: In this case the distribution describes guarantees made *about how tuples are distributed within a single partition. */ -sealed trait Distribution +sealed trait Distribution { + /** + * The required number of partitions for this distribution. If it's None, then any number of + * partitions is allowed for this distribution. + */ + def requiredNumPartitions: Option[Int] + + /** + * Creates a default partitioning for this distribution, which can satisfy this distribution while + * matching the given number of partitions. + */ + def createPartitioning(numPartitions: Int): Partitioning +} /** * Represents a distribution where no promises are made about co-location of data. */ -case object UnspecifiedDistribution extends Distribution +case object UnspecifiedDistribution extends Distribution { + override def requiredNumPartitions: Option[Int] = None + + override def createPartitioning(numPartitions: Int): Partitioning = { +throw new IllegalStateException("UnspecifiedDistribution does not have default partitioning.") + } +} /** * Represents a distribution that only has a single partition and all tuples of the dataset * are co-located. */ -case object AllTuples extends Distribution +case object AllTuples extends Distribution { --- End diff -- I'd like to keep `AllTuples`. `SingleNodeDistribution` is a special case of `AllTuples` and seems we do not really need the extra information introduced by `SingleNode`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19080: [SPARK-21865][SQL] simplify the distribution semantic of...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/19080 Have a question after reading the new approach. Let's say that we have a join like `T1 JOIN T2 on T1.a = T2.a`. Also `T1` is hash partitioned by the value of `T1.a` and it has 10 partitions, and `T2` is range partitioned by the value of `T2.a` and it has 10 partitions. Both sides will satisfy the required distribution of the join. However, we need to add an exchange at either side in order to produce the correct result. How will we handle this case with this change? Also, regarding > For multiple children, Spark only guarantees they have the same number of partitions, and it's the operator's responsibility to leverage this guarantee to achieve more complicated requirements. Can you give a concrete example? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18944: [SPARK-21732][SQL]Lazily init hive metastore client
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/18944 lgtm --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18316: [SPARK-21111] [TEST] [2.2] Fix the test failure of descr...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/18316 Thanks! I have merged this pr to branch-2.2. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18316: [SPARK-21111] [TEST] [2.2] Fix the test failure of descr...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/18316 thanks! merging to branch-2.2 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18316: [SPARK-21111] [TEST] [2.2] Fix the test failure of descr...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/18316 lgtm --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18064: [SPARK-20213][SQL] Fix DataFrameWriter operations in SQL...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/18064 My suggestion was about getting changes on the interfaces of ExecutedCommandExec and SaveIntoDataSourceCommand to separate prs. It will help code review (both speed and quality). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18148: [SPARK-20926][SQL] Removing exposures to guava library c...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/18148 @vanzin Seems merging to branch-2.2 was an accident? Since it is not really a bug fix, should we revert it from branch-2.2 and just keep it in the master? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18064: [SPARK-20213][SQL] Fix DataFrameWriter operations in SQL...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/18064 I just case across this pr. I have one general feedback. It will be great if we can make a pr have a single purpose. This pr contains different kinds of changes in order to fix the UI. If refactoring is needed, I'd recommend to have separate PR for refactoring purposes. Then, use a different PR to do the actual fix. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18172: [SPARK-20946][SQL] simplify the config setting logic in ...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/18172 Reverting this because it breaks repl tests. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17617: [SPARK-20244][Core] Handle incorrect bytesRead me...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/17617#discussion_r119938185 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala --- @@ -143,14 +144,29 @@ class SparkHadoopUtil extends Logging { * Returns a function that can be called to find Hadoop FileSystem bytes read. If * getFSBytesReadOnThreadCallback is called from thread r at time t, the returned callback will * return the bytes read on r since t. - * - * @return None if the required method can't be found. --- End diff -- Why removing this line instead of the doc? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17763: [SPARK-13747][Core]Add ThreadUtils.awaitReady and disall...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/17763 lgtm --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17666: [SPARK-20311][SQL] Support aliases for table value funct...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/17666 I have reverted this change from both master and branch-2.2. I have reopened the jira. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17666: [SPARK-20311][SQL] Support aliases for table value funct...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/17666 I am going to revert this PR from master and branch-2.2. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17666: [SPARK-20311][SQL] Support aliases for table value funct...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/17666 @maropu Sorry. I think this PR introduces a regression. ``` scala> spark.sql("select * from range(1, 10) cross join range(1, 10)").explain == Physical Plan == org.apache.spark.sql.AnalysisException: Detected cartesian product for INNER join between logical plans Range (1, 10, step=1, splits=None) and Range (1, 10, step=1, splits=None) Join condition is missing or trivial. Use the CROSS JOIN syntax to allow cartesian products between these relations.; ``` I think we are taking the cross as the alias. I reverted your change locally and the query worked. I am attaching the expected analyzed plan below. ``` scala> spark.sql("select * from range(1, 10) cross join range(1, 10)").queryExecution.analyzed res1: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = Project [id#8L, id#9L] +- Join Cross :- Range (1, 10, step=1, splits=None) +- Range (1, 10, step=1, splits=None) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17905: [SPARK-20661][SPARKR][TEST][FOLLOWUP] SparkR tableNames(...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/17905 i see. I think https://github.com/apache/spark/pull/17905/commits/d4c1a9db25ee7386f7b12e4dabb54210a9892510 is good. How about we get it checked in first (after jenkins passes)? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17905: [SPARK-20661][SPARKR][TEST][FOLLOWUP] SparkR tableNames(...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/17905 lgtm --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17905: [SPARK-20661][SPARKR][TEST][FOLLOWUP] SparkR tableNames(...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/17905 @falaki's PR did not actually trigger that test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17905: [SPARK-20661][SPARKR][TEST][FOLLOWUP] SparkR tableNames(...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/17905 @felixcheung you are right. That is the problem. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17903: [SPARK-20661][SparkR][Test] SparkR tableNames() test fai...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/17903 I do not think https://github.com/apache/spark/pull/17649 caused the problem. I saw failures without that internally. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17903: [SPARK-20661][SparkR][Test] SparkR tableNames() test fai...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/17903 Thanks @falaki. Merging to master and branch-2.2. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17903: [SPARK-20661][SparkR][Test] SparkR tableNames() test fai...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/17903 Seems 2.2 build is fine. But, I'd like to get this merged in branch-2.2 since this test will fail if any previous tests leak tables. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17903: [SPARK-20661][SparkR][Test] SparkR tableNames() test fai...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/17903 @felixcheung fyi. I think the main problem of this test is that it will be broken if tests executed before this one leak any table. I think this change makes sense. I will merge it once it passes jenkins. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17892: [SPARK-20626][SPARKR] address date test warning with tim...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/17892 @felixcheung Seems master build is broken because R tests are broken (https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-sbt-hadoop-2.7/2844/console). I am not sure if this PR caused that. Can you help to take a look? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17746: [SPARK-20449][ML] Upgrade breeze version to 0.13.1
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/17746 @dbtsai Thanks for the explanation and the context :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17746: [SPARK-20449][ML] Upgrade breeze version to 0.13.1
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/17746 Can I ask how we decided merging this dependency change after the cut of the release branch (especially this change affects user code)? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17659: [SPARK-20358] [core] Executors failing stage on interrup...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/17659 lgtm. Merging to master and branch-2.2. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17531: [SPARK-20217][core] Executor should not fail stage if ki...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/17531 Thanks. Merging to master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17531: [SPARK-20217][core] Executor should not fail stage if ki...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/17531 test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17423: [SPARK-20088] Do not create new SparkContext in SparkR c...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/17423 got it. Thanks :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17423: [SPARK-20088] Do not create new SparkContext in SparkR c...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/17423 @felixcheung `SparkContext.getOrCreate` is the preferred way to create a SparkContext. So, even we have check, it is still better to use `getOrCreate`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16952: [SPARK-19620][SQL]Fix incorrect exchange coordinator id ...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/16952 LGTM. Merging to master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17156: [SPARK-19816][SQL][Tests] Fix an issue that DataFrameCal...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/17156 merged to branch-2.1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17156: [SPARK-19816][SQL][Tests] Fix an issue that DataFrameCal...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/17156 Let's also merge this to branch-2.1. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16917: [SPARK-19529][BRANCH-1.6] Backport PR #16866 to branch-1...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/16917 Let's use a meaningful title in future :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16935: [SPARK-19604] [TESTS] Log the start of every Python test
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/16935 cool. It has been merged. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16935: [SPARK-19604] [TESTS] Log the start of every Python test
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/16935 Seems I cannot merge now... Will try again later. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16935: [SPARK-19604] [TESTS] Log the start of every Python test
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/16935 ok. Nothing new to add. I will merge this to master and branch-2.1 (in case we want to debug any python test hanging issue in branch-2.1). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16935: [SPARK-19604] [TESTS] Log the start of every Python test
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/16935 Let's not merge it right now. I may need to log more. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16935: [SPARK-19604] [TESTS] Log the start of every Pyth...
GitHub user yhuai opened a pull request: https://github.com/apache/spark/pull/16935 [SPARK-19604] [TESTS] Log the start of every Python test ## What changes were proposed in this pull request? Right now, we only have info level log after we finish the tests of a Python test file. We should also log the start of a test. So, if a test is hanging, we can tell which test file is running. ## How was this patch tested? This is a change for python tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/yhuai/spark SPARK-19604 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16935.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16935 commit 1181cc3be7bcf21fbe7e88b35ac662353fb2f366 Author: Yin Huai <yh...@databricks.com> Date: 2017-02-15T04:19:28Z Right now, we only have info level log after we finish the tests of a Python test file. We should also log the start of a test. So, if a test is hanging, we can tell which test file is running. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16894: [SPARK-17897] [SQL] [BACKPORT-2.0] Fixed IsNotNull Const...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/16894 thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16067: [SPARK-17897] [SQL] Fixed IsNotNull Constraint Inference...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/16067 @gatorsmile can we also add it in branch-2.0? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16649: [SPARK-19295] [SQL] IsolatedClientLoader's downloadVersi...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/16649 Cool I am merging this to master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16645: [SPARK-19290][SQL] add a new extending interface in Anal...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/16645 My main concern of this pr is that if people will think it is recommended to add new batches to force those rules running in a certain ordering. For these resolution rules, we can also use conditions to control when they will fire, right? If we will always replace a logical plan to another one in the analysis phase, seems we should use `resolved` to control if a rule will fired. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16649: [SPARK-19295] [SQL] IsolatedClientLoader's downlo...
GitHub user yhuai opened a pull request: https://github.com/apache/spark/pull/16649 [SPARK-19295] [SQL] IsolatedClientLoader's downloadVersion should log the location of downloaded metastore client jars ## What changes were proposed in this pull request? This will help the users to know the location of those downloaded jars. ## How was this patch tested? jenkins You can merge this pull request into a Git repository by running: $ git pull https://github.com/yhuai/spark SPARK-19295 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16649.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16649 commit 6c67582d85473d123053a45aa051578232c32dad Author: Yin Huai <yh...@databricks.com> Date: 2017-01-19T20:30:08Z [SPARK-19295] IsolatedClientLoader's downloadVersion should log the location of downloaded metastore client jars --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16613: [SPARK-19024][SQL] Implement new approach to write a per...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/16613 nvm. After second thought, the feature flag does not really buy us anything. We just store the original view definition and the column mapping in the metastore. So, I think it is fine to just do the switch. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16628: Update known_translations for contributor names
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/16628 I am merging this to master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14204: [SPARK-16520] [WEBUI] Link executors to corresponding wo...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/14204 ok I agree. Originally, I thought it will be helpful to figure out the worker that an executor belongs to. But, if it does not provide very useful information. I am fine to drop it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16628: Update known_translations for contributor names
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/16628 done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16613: [SPARK-19024][SQL] Implement new approach to write a per...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/16613 is there a feature flag that is used to determine if we use this new approach? I feel it will be good to have an internal feature flag to determine the code path. So, if there is something wrong that is hard to fix quickly before the release, we can still switch back to the old code path. Then, in the next release, we can remove the feature flag. What do you think? Also, @jiangxb1987 can you take a look at the SQLViewSuite and see if we have enough test coverage? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16517: [SPARK-18243][SQL] Port Hive writing to use FileFormat i...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/16517 Looks good to me. @gatorsmile can you explain your concerns? I am wondering what kind of cases that you think HiveFileFormat may not be able to handle. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16517: [SPARK-18243][SQL] Port Hive writing to use FileF...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/16517#discussion_r96566857 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala --- @@ -276,40 +276,31 @@ case class InsertIntoHiveTable( } } -val jobConf = new JobConf(hadoopConf) -val jobConfSer = new SerializableJobConf(jobConf) - -// When speculation is on and output committer class name contains "Direct", we should warn -// users that they may loss data if they are using a direct output committer. -val speculationEnabled = sqlContext.sparkContext.conf.getBoolean("spark.speculation", false) -val outputCommitterClass = jobConf.get("mapred.output.committer.class", "") -if (speculationEnabled && outputCommitterClass.contains("Direct")) { --- End diff -- seems this change is unnecessary and users may still use direct output committer (they can still find the code on Internet). Let's keep the warning. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16517: [SPARK-18243][SQL] Port Hive writing to use FileF...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/16517#discussion_r96566523 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala --- @@ -276,40 +276,31 @@ case class InsertIntoHiveTable( } } -val jobConf = new JobConf(hadoopConf) -val jobConfSer = new SerializableJobConf(jobConf) - -// When speculation is on and output committer class name contains "Direct", we should warn -// users that they may loss data if they are using a direct output committer. -val speculationEnabled = sqlContext.sparkContext.conf.getBoolean("spark.speculation", false) -val outputCommitterClass = jobConf.get("mapred.output.committer.class", "") -if (speculationEnabled && outputCommitterClass.contains("Direct")) { --- End diff -- Do we still need this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16517: [SPARK-18243][SQL] Port Hive writing to use FileF...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/16517#discussion_r96566290 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveFileFormat.scala --- @@ -0,0 +1,135 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive.execution + +import scala.collection.JavaConverters._ + +import org.apache.hadoop.fs.{FileStatus, Path} +import org.apache.hadoop.hive.ql.exec.Utilities +import org.apache.hadoop.hive.ql.io.{HiveFileFormatUtils, HiveOutputFormat} +import org.apache.hadoop.hive.serde2.Serializer +import org.apache.hadoop.hive.serde2.objectinspector.{ObjectInspectorUtils, StructObjectInspector} +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.ObjectInspectorCopyOption +import org.apache.hadoop.io.Writable +import org.apache.hadoop.mapred.{JobConf, Reporter} +import org.apache.hadoop.mapreduce.{Job, TaskAttemptContext} + +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.execution.datasources.{FileFormat, OutputWriter, OutputWriterFactory} +import org.apache.spark.sql.hive.{HiveInspectors, HiveTableUtil} +import org.apache.spark.sql.hive.HiveShim.{ShimFileSinkDesc => FileSinkDesc} +import org.apache.spark.sql.types.StructType +import org.apache.spark.util.SerializableJobConf + +/** + * `FileFormat` for writing Hive tables. + * + * TODO: implement the read logic. + */ +class HiveFileFormat(fileSinkConf: FileSinkDesc) extends FileFormat { + override def inferSchema( + sparkSession: SparkSession, + options: Map[String, String], + files: Seq[FileStatus]): Option[StructType] = { +throw new UnsupportedOperationException(s"inferSchema is not supported for hive data source.") + } + + override def prepareWrite( + sparkSession: SparkSession, + job: Job, + options: Map[String, String], + dataSchema: StructType): OutputWriterFactory = { +val conf = job.getConfiguration +val tableDesc = fileSinkConf.getTableInfo +conf.set("mapred.output.format.class", tableDesc.getOutputFileFormatClassName) + +// Add table properties from storage handler to hadoopConf, so any custom storage +// handler settings can be set to hadoopConf +HiveTableUtil.configureJobPropertiesForStorageHandler(tableDesc, conf, false) +Utilities.copyTableJobPropertiesToConf(tableDesc, conf) --- End diff -- Will tableDesc be null? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16517: [SPARK-18243][SQL] Port Hive writing to use FileF...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/16517#discussion_r96566171 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala --- @@ -86,6 +86,47 @@ class DetermineHiveSerde(conf: SQLConf) extends Rule[LogicalPlan] { } } +class HiveAnalysis(session: SparkSession) extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { +case InsertIntoTable(table: MetastoreRelation, partSpec, query, overwrite, ifNotExists) +if hasBeenPreprocessed(table.output, table.partitionKeys.toStructType, partSpec, query) => + InsertIntoHiveTable(table, partSpec, query, overwrite, ifNotExists) + +case CreateTable(tableDesc, mode, Some(query)) if DDLUtils.isHiveTable(tableDesc) => + // Currently `DataFrameWriter.saveAsTable` doesn't support the Append mode of hive serde + // tables yet. + if (mode == SaveMode.Append) { +throw new AnalysisException( + "CTAS for hive serde tables does not support append semantics.") + } + + val dbName = tableDesc.identifier.database.getOrElse(session.catalog.currentDatabase) + CreateHiveTableAsSelectCommand( +tableDesc.copy(identifier = tableDesc.identifier.copy(database = Some(dbName))), +query, +mode == SaveMode.Ignore) + } + + /** + * Returns true if the [[InsertIntoTable]] plan has already been preprocessed by analyzer rule + * [[PreprocessTableInsertion]]. It is important that this rule([[HiveAnalysis]]) has to + * be run after [[PreprocessTableInsertion]], to normalize the column names in partition spec and --- End diff -- I think this version is good for now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16517: [SPARK-18243][SQL] Port Hive writing to use FileF...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/16517#discussion_r96549456 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveFileFormat.scala --- @@ -0,0 +1,133 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive.execution + +import scala.collection.JavaConverters._ + +import org.apache.hadoop.fs.{FileStatus, Path} +import org.apache.hadoop.hive.ql.exec.Utilities +import org.apache.hadoop.hive.ql.io.{HiveFileFormatUtils, HiveOutputFormat} +import org.apache.hadoop.hive.serde2.Serializer +import org.apache.hadoop.hive.serde2.objectinspector.{ObjectInspectorUtils, StructObjectInspector} +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.ObjectInspectorCopyOption +import org.apache.hadoop.io.Writable +import org.apache.hadoop.mapred.{JobConf, Reporter} +import org.apache.hadoop.mapreduce.{Job, TaskAttemptContext} + +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.execution.datasources.{FileFormat, OutputWriter, OutputWriterFactory} +import org.apache.spark.sql.hive.{HiveInspectors, HiveTableUtil} +import org.apache.spark.sql.hive.HiveShim.{ShimFileSinkDesc => FileSinkDesc} +import org.apache.spark.sql.types.StructType +import org.apache.spark.util.SerializableJobConf + +/** + * `FileFormat` for writing Hive tables. + * + * TODO: implement the read logic. + */ +class HiveFileFormat(fileSinkConf: FileSinkDesc) extends FileFormat { + override def inferSchema( + sparkSession: SparkSession, + options: Map[String, String], + files: Seq[FileStatus]): Option[StructType] = None --- End diff -- ok. Let's throw an exception at here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16628: Update known_translations for contributor names
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/16628 cc @lw-lin --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16628: Update known_translations for contributor names
GitHub user yhuai opened a pull request: https://github.com/apache/spark/pull/16628 Update known_translations for contributor names ## What changes were proposed in this pull request? Update known_translations per https://github.com/apache/spark/pull/16423#issuecomment-269739634 You can merge this pull request into a Git repository by running: $ git pull https://github.com/yhuai/spark known_translations Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16628.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16628 commit bf2f532a8b7ef60d1542415c0d6dacd8571e7bef Author: Yin Huai <yh...@databricks.com> Date: 2017-01-18T00:50:08Z Update known_translations for contributor names --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16423: Update known_translations for contributor names and also...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/16423 Sure. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16517: [SPARK-18243][SQL] Port Hive writing to use FileF...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/16517#discussion_r96316695 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala --- @@ -86,6 +86,47 @@ class DetermineHiveSerde(conf: SQLConf) extends Rule[LogicalPlan] { } } +class HiveAnalysis(session: SparkSession) extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { +case InsertIntoTable(table: MetastoreRelation, partSpec, query, overwrite, ifNotExists) +if hasBeenPreprocessed(table.output, table.partitionKeys.toStructType, partSpec, query) => + InsertIntoHiveTable(table, partSpec, query, overwrite, ifNotExists) + +case CreateTable(tableDesc, mode, Some(query)) if DDLUtils.isHiveTable(tableDesc) => + // Currently `DataFrameWriter.saveAsTable` doesn't support the Append mode of hive serde + // tables yet. + if (mode == SaveMode.Append) { +throw new AnalysisException( + "CTAS for hive serde tables does not support append semantics.") + } + + val dbName = tableDesc.identifier.database.getOrElse(session.catalog.currentDatabase) + CreateHiveTableAsSelectCommand( +tableDesc.copy(identifier = tableDesc.identifier.copy(database = Some(dbName))), +query, +mode == SaveMode.Ignore) + } + + /** + * Returns true if the [[InsertIntoTable]] plan has already been preprocessed by analyzer rule + * [[PreprocessTableInsertion]]. It is important that this rule([[HiveAnalysis]]) has to + * be run after [[PreprocessTableInsertion]], to normalize the column names in partition spec and --- End diff -- This rule is in the same batch with PreprocessTableInsertion, right? If so, we cannot guarantee that PreprocessTableInsertion will always fire first for a command before InsertIntoTable. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16517: [SPARK-18243][SQL] Port Hive writing to use FileF...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/16517#discussion_r96317272 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveFileFormat.scala --- @@ -0,0 +1,133 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive.execution + +import scala.collection.JavaConverters._ + +import org.apache.hadoop.fs.{FileStatus, Path} +import org.apache.hadoop.hive.ql.exec.Utilities +import org.apache.hadoop.hive.ql.io.{HiveFileFormatUtils, HiveOutputFormat} +import org.apache.hadoop.hive.serde2.Serializer +import org.apache.hadoop.hive.serde2.objectinspector.{ObjectInspectorUtils, StructObjectInspector} +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.ObjectInspectorCopyOption +import org.apache.hadoop.io.Writable +import org.apache.hadoop.mapred.{JobConf, Reporter} +import org.apache.hadoop.mapreduce.{Job, TaskAttemptContext} + +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.execution.datasources.{FileFormat, OutputWriter, OutputWriterFactory} +import org.apache.spark.sql.hive.{HiveInspectors, HiveTableUtil} +import org.apache.spark.sql.hive.HiveShim.{ShimFileSinkDesc => FileSinkDesc} +import org.apache.spark.sql.types.StructType +import org.apache.spark.util.SerializableJobConf + +/** + * `FileFormat` for writing Hive tables. + * + * TODO: implement the read logic. + */ +class HiveFileFormat(fileSinkConf: FileSinkDesc) extends FileFormat { + override def inferSchema( + sparkSession: SparkSession, + options: Map[String, String], + files: Seq[FileStatus]): Option[StructType] = None + + override def prepareWrite( + sparkSession: SparkSession, + job: Job, + options: Map[String, String], + dataSchema: StructType): OutputWriterFactory = { +val conf = job.getConfiguration +val tableDesc = fileSinkConf.getTableInfo +conf.set("mapred.output.format.class", tableDesc.getOutputFileFormatClassName) + +// Add table properties from storage handler to hadoopConf, so any custom storage +// handler settings can be set to hadoopConf +HiveTableUtil.configureJobPropertiesForStorageHandler(tableDesc, conf, false) +Utilities.copyTableJobPropertiesToConf(tableDesc, conf) + +// Avoid referencing the outer object. +val fileSinkConfSer = fileSinkConf +new OutputWriterFactory { + private val jobConf = new SerializableJobConf(new JobConf(conf)) + @transient private lazy val outputFormat = + jobConf.value.getOutputFormat.asInstanceOf[HiveOutputFormat[AnyRef, Writable]] + + override def getFileExtension(context: TaskAttemptContext): String = { +Utilities.getFileExtension(jobConf.value, fileSinkConfSer.getCompressed, outputFormat) + } + + override def newInstance( + path: String, + dataSchema: StructType, + context: TaskAttemptContext): OutputWriter = { +new HiveOutputWriter(path, fileSinkConfSer, jobConf.value, dataSchema) + } +} --- End diff -- Should we just create a class instead of using an anonymous class? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16517: [SPARK-18243][SQL] Port Hive writing to use FileF...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/16517#discussion_r96317211 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveFileFormat.scala --- @@ -0,0 +1,133 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive.execution + +import scala.collection.JavaConverters._ + +import org.apache.hadoop.fs.{FileStatus, Path} +import org.apache.hadoop.hive.ql.exec.Utilities +import org.apache.hadoop.hive.ql.io.{HiveFileFormatUtils, HiveOutputFormat} +import org.apache.hadoop.hive.serde2.Serializer +import org.apache.hadoop.hive.serde2.objectinspector.{ObjectInspectorUtils, StructObjectInspector} +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.ObjectInspectorCopyOption +import org.apache.hadoop.io.Writable +import org.apache.hadoop.mapred.{JobConf, Reporter} +import org.apache.hadoop.mapreduce.{Job, TaskAttemptContext} + +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.execution.datasources.{FileFormat, OutputWriter, OutputWriterFactory} +import org.apache.spark.sql.hive.{HiveInspectors, HiveTableUtil} +import org.apache.spark.sql.hive.HiveShim.{ShimFileSinkDesc => FileSinkDesc} +import org.apache.spark.sql.types.StructType +import org.apache.spark.util.SerializableJobConf + +/** + * `FileFormat` for writing Hive tables. + * + * TODO: implement the read logic. + */ +class HiveFileFormat(fileSinkConf: FileSinkDesc) extends FileFormat { + override def inferSchema( + sparkSession: SparkSession, + options: Map[String, String], + files: Seq[FileStatus]): Option[StructType] = None + + override def prepareWrite( + sparkSession: SparkSession, + job: Job, + options: Map[String, String], + dataSchema: StructType): OutputWriterFactory = { --- End diff -- Want to comment the original source of code in this function? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16517: [SPARK-18243][SQL] Port Hive writing to use FileF...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/16517#discussion_r96316863 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala --- @@ -86,6 +86,47 @@ class DetermineHiveSerde(conf: SQLConf) extends Rule[LogicalPlan] { } } +class HiveAnalysis(session: SparkSession) extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { +case InsertIntoTable(table: MetastoreRelation, partSpec, query, overwrite, ifNotExists) +if hasBeenPreprocessed(table.output, table.partitionKeys.toStructType, partSpec, query) => + InsertIntoHiveTable(table, partSpec, query, overwrite, ifNotExists) + +case CreateTable(tableDesc, mode, Some(query)) if DDLUtils.isHiveTable(tableDesc) => + // Currently `DataFrameWriter.saveAsTable` doesn't support the Append mode of hive serde + // tables yet. + if (mode == SaveMode.Append) { +throw new AnalysisException( + "CTAS for hive serde tables does not support append semantics.") + } + + val dbName = tableDesc.identifier.database.getOrElse(session.catalog.currentDatabase) + CreateHiveTableAsSelectCommand( +tableDesc.copy(identifier = tableDesc.identifier.copy(database = Some(dbName))), +query, +mode == SaveMode.Ignore) + } + + /** + * Returns true if the [[InsertIntoTable]] plan has already been preprocessed by analyzer rule + * [[PreprocessTableInsertion]]. It is important that this rule([[HiveAnalysis]]) has to + * be run after [[PreprocessTableInsertion]], to normalize the column names in partition spec and --- End diff -- Or, you mean that we use this function to determine if PreprocessTableInsertion has fired? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16517: [SPARK-18243][SQL] Port Hive writing to use FileF...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/16517#discussion_r96316302 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala --- @@ -108,35 +108,30 @@ class QueryExecution(val sparkSession: SparkSession, val logical: LogicalPlan) { /** - * Returns the result as a hive compatible sequence of strings. For native commands, the - * execution is simply passed back to Hive. + * Returns the result as a hive compatible sequence of strings. This is for testing only. */ def hiveResultString(): Seq[String] = executedPlan match { case ExecutedCommandExec(desc: DescribeTableCommand) => - SQLExecution.withNewExecutionId(sparkSession, this) { --- End diff -- Explain the reason that `SQLExecution.withNewExecutionId(sparkSession, this)` is not needed? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16517: [SPARK-18243][SQL] Port Hive writing to use FileF...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/16517#discussion_r96316982 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala --- @@ -86,6 +86,47 @@ class DetermineHiveSerde(conf: SQLConf) extends Rule[LogicalPlan] { } } +class HiveAnalysis(session: SparkSession) extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { +case InsertIntoTable(table: MetastoreRelation, partSpec, query, overwrite, ifNotExists) +if hasBeenPreprocessed(table.output, table.partitionKeys.toStructType, partSpec, query) => + InsertIntoHiveTable(table, partSpec, query, overwrite, ifNotExists) + +case CreateTable(tableDesc, mode, Some(query)) if DDLUtils.isHiveTable(tableDesc) => + // Currently `DataFrameWriter.saveAsTable` doesn't support the Append mode of hive serde + // tables yet. + if (mode == SaveMode.Append) { +throw new AnalysisException( + "CTAS for hive serde tables does not support append semantics.") + } + + val dbName = tableDesc.identifier.database.getOrElse(session.catalog.currentDatabase) + CreateHiveTableAsSelectCommand( +tableDesc.copy(identifier = tableDesc.identifier.copy(database = Some(dbName))), +query, +mode == SaveMode.Ignore) + } + + /** + * Returns true if the [[InsertIntoTable]] plan has already been preprocessed by analyzer rule + * [[PreprocessTableInsertion]]. It is important that this rule([[HiveAnalysis]]) has to + * be run after [[PreprocessTableInsertion]], to normalize the column names in partition spec and --- End diff -- Should this function actually be part of the resolved method of InsertIntoTable? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16517: [SPARK-18243][SQL] Port Hive writing to use FileF...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/16517#discussion_r96317150 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveFileFormat.scala --- @@ -0,0 +1,133 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive.execution + +import scala.collection.JavaConverters._ + +import org.apache.hadoop.fs.{FileStatus, Path} +import org.apache.hadoop.hive.ql.exec.Utilities +import org.apache.hadoop.hive.ql.io.{HiveFileFormatUtils, HiveOutputFormat} +import org.apache.hadoop.hive.serde2.Serializer +import org.apache.hadoop.hive.serde2.objectinspector.{ObjectInspectorUtils, StructObjectInspector} +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.ObjectInspectorCopyOption +import org.apache.hadoop.io.Writable +import org.apache.hadoop.mapred.{JobConf, Reporter} +import org.apache.hadoop.mapreduce.{Job, TaskAttemptContext} + +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.execution.datasources.{FileFormat, OutputWriter, OutputWriterFactory} +import org.apache.spark.sql.hive.{HiveInspectors, HiveTableUtil} +import org.apache.spark.sql.hive.HiveShim.{ShimFileSinkDesc => FileSinkDesc} +import org.apache.spark.sql.types.StructType +import org.apache.spark.util.SerializableJobConf + +/** + * `FileFormat` for writing Hive tables. + * + * TODO: implement the read logic. + */ +class HiveFileFormat(fileSinkConf: FileSinkDesc) extends FileFormat { + override def inferSchema( + sparkSession: SparkSession, + options: Map[String, String], + files: Seq[FileStatus]): Option[StructType] = None --- End diff -- Is it safe to return None? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16528: [SPARK-19148][SQL] do not expose the external table conc...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/16528 looks good to me. If possible, I'd like to get https://github.com/apache/spark/pull/16528/files#r96314156 reverted. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16528: [SPARK-19148][SQL] do not expose the external tab...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/16528#discussion_r96314156 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateOrdering.scala --- @@ -131,17 +131,15 @@ object GenerateOrdering extends CodeGenerator[Seq[SortOrder], Ordering[InternalR return 0; """ }, - foldFunctions = { funCalls => -funCalls.zipWithIndex.map { case (funCall, i) => - val comp = ctx.freshName("comp") - s""" -int $comp = $funCall; -if ($comp != 0) { - return $comp; -} - """ -}.mkString - }) + foldFunctions = _.map { funCall => --- End diff -- How about we revert this change? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16561: [SPARK-18801][SQL][FOLLOWUP] Alias the view with ...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/16561#discussion_r96313495 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala --- @@ -28,22 +28,60 @@ import org.apache.spark.sql.catalyst.rules.Rule */ /** - * Make sure that a view's child plan produces the view's output attributes. We wrap the child - * with a Project and add an alias for each output attribute. The attributes are resolved by - * name. This should be only done after the batch of Resolution, because the view attributes are - * not completely resolved during the batch of Resolution. + * Make sure that a view's child plan produces the view's output attributes. We try to wrap the + * child by: + * 1. Generate the `queryOutput` by: + *1.1. If the query column names are defined, map the column names to attributes in the child + * output by name(This is mostly for handling view queries like SELECT * FROM ..., the + * schema of the referenced table/view may change after the view has been created, so we + * have to save the output of the query to `viewQueryColumnNames`, and restore them during + * view resolution, in this way, we are able to get the correct view column ordering and + * omit the extra columns that we don't require); + *1.2. Else set the child output attributes to `queryOutput`. + * 2. Map the `queryQutput` to view output by index, if the corresponding attributes don't match, + *try to up cast and alias the attribute in `queryOutput` to the attribute in the view output. + * 3. Add a Project over the child, with the new output generated by the previous steps. + * If the view output doesn't have the same number of columns neither with the child output, nor + * with the query column names, throw an AnalysisException. + * + * This should be only done after the batch of Resolution, because the view attributes are not + * completely resolved during the batch of Resolution. */ case class AliasViewChild(conf: CatalystConf) extends Rule[LogicalPlan] { override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators { -case v @ View(_, output, child) if child.resolved => +case v @ View(desc, output, child) if child.resolved && output != child.output => val resolver = conf.resolver - val newOutput = output.map { attr => -val originAttr = findAttributeByName(attr.name, child.output, resolver) -// The dataType of the output attributes may be not the same with that of the view output, -// so we should cast the attribute to the dataType of the view output attribute. If the -// cast can't perform, will throw an AnalysisException. -Alias(Cast(originAttr, attr.dataType), attr.name)(exprId = attr.exprId, - qualifier = attr.qualifier, explicitMetadata = Some(attr.metadata)) + val queryColumnNames = desc.viewQueryColumnNames + val queryOutput = if (queryColumnNames.nonEmpty) { +// If the view output doesn't have the same number of columns with the query column names, +// throw an AnalysisException. +if (output.length != queryColumnNames.length) { + throw new AnalysisException( +s"The view output ${output.mkString("[", ",", "]")} doesn't have the same number of " + + s"columns with the query column names ${queryColumnNames.mkString("[", ",", "]")}") +} +desc.viewQueryColumnNames.map { colName => + findAttributeByName(colName, child.output, resolver) +} + } else { +// For view created before Spark 2.2.0, the view text is already fully qualified, the plan +// output is the same with the view output. +child.output + } + // Map the attributes in the query output to the attributes in the view output by index. + val newOutput = output.zip(queryOutput).map { --- End diff -- Seems we need to check the size of `output` and `queryOutput`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16597: [SPARK-19240][SQL] SET LOCATION is not allowed for table...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/16597 I am not sure if it is worth breaking this behavior. If the table is a managed table, it is possible that existing behavior allows users to move a table from one managed place to another managed place (e.g. the location of a database is changed). It is not clear that breaking this behavior can give us any benefit. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16568: [SPARK-18971][Core]Upgrade Netty to 4.0.43.Final
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/16568 test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16233: [SPARK-18801][SQL] Support resolve a nested view
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/16233 @jiangxb1987 Once jira is back, let's create jiras to address follow-up issues (probably you have already done that before jira went down). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org