[GitHub] spark issue #19259: [BACKPORT-2.1][SPARK-19318][SPARK-22041][SQL] Docker tes...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19259 LGTM pending test. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17819 **[Test build #81867 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81867/testReport)** for PR 17819 at commit [`7c38b77`](https://github.com/apache/spark/commit/7c38b77c30747e316328afbe25cc8bff1a51cd40). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/17819 retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17819 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17819 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81864/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17819 **[Test build #81864 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81864/testReport)** for PR 17819 at commit [`7c38b77`](https://github.com/apache/spark/commit/7c38b77c30747e316328afbe25cc8bff1a51cd40). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19259: [BACKPORT-2.1][SPARK-19318][SPARK-22041][SQL] Docker tes...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19259 **[Test build #81866 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81866/testReport)** for PR 19259 at commit [`315e389`](https://github.com/apache/spark/commit/315e3898e5350121dbd7bb09ae67eb09730f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12646: [SPARK-14878][SQL] Trim characters string function suppo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/12646 **[Test build #81865 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81865/testReport)** for PR 12646 at commit [`79846bf`](https://github.com/apache/spark/commit/79846bfd86c6265ebe7d14906853bfe2ec0467f3). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19256: [SPARK-21338][SQL]implement isCascadingTruncateTable() m...
Github user huaxingao commented on the issue: https://github.com/apache/spark/pull/19256 Thanks @gatorsmile Does the following logic look good to you? ``` if(any dialect's isCascadingTruncateTable returns true) return Some(true) else if (any dialect's isCascadingTruncateTable returns false) return Some(false) else // None of the dialect implements this method, return the superclass default value return None ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/16578#discussion_r139337054 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/AggregateFieldExtractionPushdown.scala --- @@ -0,0 +1,77 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.optimizer + +import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeSet, NamedExpression} +import org.apache.spark.sql.catalyst.plans.logical.{Aggregate, LogicalPlan, Project} + +/** + * Pushes down aliases to [[expressions.GetStructField]] expressions in an aggregate's grouping and + * aggregate expressions into a projection over its children. The original + * [[expressions.GetStructField]] expressions are replaced with references to the pushed down + * aliases. + */ +object AggregateFieldExtractionPushdown extends FieldExtractionPushdown { + override def apply(plan: LogicalPlan): LogicalPlan = +plan transformDown { + case agg @ Aggregate(groupingExpressions, aggregateExpressions, child) => +val expressions = groupingExpressions ++ aggregateExpressions +val attributes = AttributeSet(expressions.collect { case att: Attribute => att }) +val childAttributes = AttributeSet(child.expressions) +val fieldExtractors0 = + expressions +.flatMap(getFieldExtractors) +.distinct +val fieldExtractors1 = + fieldExtractors0 +.filter(_.collectFirst { case att: Attribute => att } + .filter(attributes.contains).isEmpty) --- End diff -- Why do we need to trim the extractors which contain attributes referred from `groupingExpressions ++ aggregateExpressions`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19234: [SPARK-22010][PySpark] Change fromInternal method of Tim...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/19234 LGTM. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/16578#discussion_r139336399 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/JoinFieldExtractionPushdown.scala --- @@ -0,0 +1,66 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.optimizer + +import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeSet, NamedExpression} +import org.apache.spark.sql.catalyst.planning.PhysicalOperation +import org.apache.spark.sql.catalyst.plans.logical.{Join, LogicalPlan, Project} + +/** + * Pushes down aliases to [[expressions.GetStructField]] expressions in a projection over a join + * and its join condition. The original [[expressions.GetStructField]] expressions are replaced + * with references to the pushed down aliases. + */ +object JoinFieldExtractionPushdown extends FieldExtractionPushdown { --- End diff -- This applies generally. But for the non-nested column pruning cases, this seems a burden and making the query plan complicated. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19260: [SPARK-22043][PYTHON] Improves error message for ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19260 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19260: [SPARK-22043][PYTHON] Improves error message for show_pr...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19260 Merged to master, branch-2.2 and branch-2.1. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19260: [SPARK-22043][PYTHON] Improves error message for show_pr...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19260 Thanks @maver1ck, @felixcheung and @viirya. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19249: [SPARK-22032][PySpark] Speed up StructType conversion
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19249 Thanks for double checking @ueshin. Yes, I noticed that too while reviewing it. I just decided to merge it as is because I am quite sure of this one given struct type is the root type and this case looks quite common, and regarding that it looks the first contribution. Even though this one has a downside, practically the improvement looked better. I am also fine with doing this for others too (I am +0 for other types). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15544: [SPARK-17997] [SQL] Add an aggregation function for coun...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15544 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81863/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15544: [SPARK-17997] [SQL] Add an aggregation function for coun...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15544 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15544: [SPARK-17997] [SQL] Add an aggregation function for coun...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15544 **[Test build #81863 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81863/testReport)** for PR 15544 at commit [`0e9a8dd`](https://github.com/apache/spark/commit/0e9a8dd4004c3331271105e4e9a3df963c3bc84b). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19249: [SPARK-22032][PySpark] Speed up StructType conversion
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/19249 A late LGTM. Btw, can we use the same idea for `MapType`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17819 **[Test build #81864 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81864/testReport)** for PR 17819 at commit [`7c38b77`](https://github.com/apache/spark/commit/7c38b77c30747e316328afbe25cc8bff1a51cd40). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19230: [SPARK-22003][SQL] support array column in vectorized re...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19230 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15544: [SPARK-17997] [SQL] Add an aggregation function for coun...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15544 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81862/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15544: [SPARK-17997] [SQL] Add an aggregation function for coun...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15544 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15544: [SPARK-17997] [SQL] Add an aggregation function for coun...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15544 **[Test build #81862 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81862/testReport)** for PR 15544 at commit [`14bf25a`](https://github.com/apache/spark/commit/14bf25a8c3d18b0bdd2c2365b3471f2f3d2a453d). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19230: [SPARK-22003][SQL] support array column in vectorized re...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19230 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81861/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19230: [SPARK-22003][SQL] support array column in vectorized re...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19230 **[Test build #81861 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81861/testReport)** for PR 19230 at commit [`5ea4e89`](https://github.com/apache/spark/commit/5ea4e8985002dd26922376d672af4db052063909). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19230: [SPARK-22003][SQL] support array column in vectorized re...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19230 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19259: [BACKPORT-2.1][SPARK-19318][SPARK-22041][SQL] Doc...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19259#discussion_r139333133 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala --- @@ -913,6 +913,25 @@ class JDBCSuite extends SparkFunSuite "dbtable" -> "t1", "numPartitions" -> "10") assert(new JDBCOptions(parameters).asConnectionProperties.isEmpty) -assert(new JDBCOptions(new CaseInsensitiveMap(parameters)).asConnectionProperties.isEmpty) +assert(new JDBCOptions(CaseInsensitiveMap(parameters)).asConnectionProperties.isEmpty) + } + + test("SPARK-19318: jdbc data source options should be treated case-insensitive.") { --- End diff -- Where is the test case `test("SPARK-19318: Connection properties keys should be case-sensitive.")`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/16578#discussion_r139333125 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala --- @@ -0,0 +1,130 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.parquet + +import org.apache.spark.sql.catalyst.expressions.{And, Attribute, Expression, NamedExpression} +import org.apache.spark.sql.catalyst.planning.{PhysicalOperation, ProjectionOverSchema, SelectedField} +import org.apache.spark.sql.catalyst.plans.logical.{Filter, LogicalPlan, Project} +import org.apache.spark.sql.catalyst.rules.Rule +import org.apache.spark.sql.execution.datasources.{HadoopFsRelation, LogicalRelation} +import org.apache.spark.sql.types.{StructField, StructType} + +/** + * Prunes unnecessary Parquet columns given a [[PhysicalOperation]] over a + * [[ParquetRelation]]. By "Parquet column", we mean a column as defined in the + * Parquet format. In Spark SQL, a root-level Parquet column corresponds to a + * SQL column, and a nested Parquet column corresponds to a [[StructField]]. + */ +private[sql] object ParquetSchemaPruning extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = +plan transformDown { + case op @ PhysicalOperation(projects, filters, + l @ LogicalRelation(hadoopFsRelation @ HadoopFsRelation(_, partitionSchema, +dataSchema, _, parquetFormat: ParquetFileFormat, _), _, _)) => +val projectionFields = projects.flatMap(getFields) +val filterFields = filters.flatMap(getFields) +val requestedFields = (projectionFields ++ filterFields).distinct + +// If [[requestedFields]] includes a proper field, continue. Otherwise, +// return [[op]] +if (requestedFields.exists { case (_, optAtt) => optAtt.isEmpty }) { + val prunedSchema = requestedFields +.map { case (field, _) => StructType(Array(field)) } +.reduceLeft(_ merge _) --- End diff -- We should add related tests for such cases. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19259: [BACKPORT-2.1][SPARK-19318][SPARK-22041][SQL] Doc...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19259#discussion_r139333048 --- Diff: external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleIntegrationSuite.scala --- @@ -63,15 +63,40 @@ class OracleIntegrationSuite extends DockerJDBCIntegrationSuite with SharedSQLCo } override def dataPreparation(conn: Connection): Unit = { -conn.prepareStatement("CREATE TABLE numerics (b DECIMAL(1), f DECIMAL(3, 2), i DECIMAL(10))").executeUpdate(); +conn.prepareStatement("CREATE TABLE datetime (id NUMBER(10), d DATE, t TIMESTAMP)") + .executeUpdate() conn.prepareStatement( - "INSERT INTO numerics VALUES (4, 1.23, 99)").executeUpdate(); -conn.commit(); + """INSERT INTO datetime VALUES +|(1, {d '1991-11-09'}, {ts '1996-01-01 01:23:45'}) + """.stripMargin.replaceAll("\n", " ")).executeUpdate() +conn.commit() + +sql( + s""" + |CREATE TEMPORARY VIEW datetime + |USING org.apache.spark.sql.jdbc + |OPTIONS (url '$jdbcUrl', dbTable 'datetime', oracle.jdbc.mapDateToTimestamp 'false') + """.stripMargin.replaceAll("\n", " ")) + +conn.prepareStatement("CREATE TABLE datetime1 (id NUMBER(10), d DATE, t TIMESTAMP)") + .executeUpdate() +conn.commit() + +sql( + s""" + |CREATE TEMPORARY VIEW datetime1 + |USING org.apache.spark.sql.jdbc + |OPTIONS (url '$jdbcUrl', dbTable 'datetime1', oracle.jdbc.mapDateToTimestamp 'false') + """.stripMargin.replaceAll("\n", " ")) + +conn.prepareStatement("CREATE TABLE numerics (b DECIMAL(1), f DECIMAL(3, 2), i DECIMAL(10))").executeUpdate() +conn.prepareStatement( + "INSERT INTO numerics VALUES (4, 1.23, 99)").executeUpdate() +conn.commit() } - - test("SPARK-16625 : Importing Oracle numeric types") { -val df = sqlContext.read.jdbc(jdbcUrl, "numerics", new Properties); + test("SPARK-16625 : Importing Oracle numeric types") { +val df = sqlContext.read.jdbc(jdbcUrl, "numerics", new Properties) --- End diff -- revert this back --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19259: [BACKPORT-2.1][SPARK-19318][SPARK-22041][SQL] Doc...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19259#discussion_r139333045 --- Diff: external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleIntegrationSuite.scala --- @@ -82,7 +107,7 @@ class OracleIntegrationSuite extends DockerJDBCIntegrationSuite with SharedSQLCo // A value with fractions from DECIMAL(3, 2) is correct: assert(row.getDecimal(1).compareTo(BigDecimal.valueOf(1.23)) == 0) // A value > Int.MaxValue from DECIMAL(10) is correct: -assert(row.getDecimal(2).compareTo(BigDecimal.valueOf(99l)) == 0) +assert(row.getDecimal(2).compareTo(BigDecimal.valueOf(99L)) == 0) --- End diff -- revert this back --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19196: [SPARK-21977] SinglePartition optimizations break certai...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19196 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19196: [SPARK-21977] SinglePartition optimizations break certai...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19196 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81860/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19196: [SPARK-21977] SinglePartition optimizations break certai...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19196 **[Test build #81860 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81860/testReport)** for PR 19196 at commit [`be39125`](https://github.com/apache/spark/commit/be3912552d200fced24f69c436b59a2c24639380). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/16578#discussion_r139331293 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala --- @@ -63,9 +74,22 @@ private[parquet] class ParquetReadSupport extends ReadSupport[UnsafeRow] with Lo StructType.fromString(schemaString) } -val parquetRequestedSchema = +val clippedParquetSchema = ParquetReadSupport.clipParquetSchema(context.getFileSchema, catalystRequestedSchema) +val parquetRequestedSchema = if (parquetMrCompatibility) { + // Parquet-mr will throw an exception if we try to read a superset of the file's schema. + // Therefore, we intersect our clipped schema with the underlying file's schema --- End diff -- `clipParquetSchema` can add column paths only exist in catalyst schema into parquet schema. I think those columns are useful but looks like the `intersectParquetGroups` will remove them again? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/16578#discussion_r139331198 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala --- @@ -63,9 +74,22 @@ private[parquet] class ParquetReadSupport extends ReadSupport[UnsafeRow] with Lo StructType.fromString(schemaString) } -val parquetRequestedSchema = +val clippedParquetSchema = ParquetReadSupport.clipParquetSchema(context.getFileSchema, catalystRequestedSchema) +val parquetRequestedSchema = if (parquetMrCompatibility) { + // Parquet-mr will throw an exception if we try to read a superset of the file's schema. + // Therefore, we intersect our clipped schema with the underlying file's schema + ParquetReadSupport.intersectParquetGroups(clippedParquetSchema, context.getFileSchema) +.map(intersectionGroup => + new MessageType(intersectionGroup.getName, intersectionGroup.getFields)) +.getOrElse(ParquetSchemaConverter.EMPTY_MESSAGE) +} else { + // Spark's built-in Parquet reader will throw an exception in some cases if the requested + // schema is not the same as the clipped schema --- End diff -- I think the built-in Parquet reader means vectorized reader. But I think we don't use `ParquetReadSupport` in vectorized reader? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19164: [SPARK-21953] Show both memory and disk bytes spi...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19164 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19221: [SPARK-4131] Merge HiveTmpFile.scala to SaveAsHiveFile.s...
Github user aoxiangcao commented on the issue: https://github.com/apache/spark/pull/19221 i'm sorry if it right to ask here .i have some question when use 'insert overwrite directory /user/appUser/test'. first, i start the thrift by 'hdfs', and login into beeline with user 'appUser'. then, i run sql 'insert overwrite directory ' , the job cannot move data from staging folder to destination folder. because the result file is own by 'hdfs',the folder '/user/appUser/test' is own by appUser. so permission denied. is something wrong? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19164: [SPARK-21953] Show both memory and disk bytes spilled if...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19164 LGTM, merging to master/2.2/2.1! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15544: [SPARK-17997] [SQL] Add an aggregation function for coun...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15544 **[Test build #81863 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81863/testReport)** for PR 15544 at commit [`0e9a8dd`](https://github.com/apache/spark/commit/0e9a8dd4004c3331271105e4e9a3df963c3bc84b). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15544: [SPARK-17997] [SQL] Add an aggregation function for coun...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15544 **[Test build #81862 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81862/testReport)** for PR 15544 at commit [`14bf25a`](https://github.com/apache/spark/commit/14bf25a8c3d18b0bdd2c2365b3471f2f3d2a453d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19230: [SPARK-22003][SQL] support array column in vectorized re...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19230 **[Test build #81861 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81861/testReport)** for PR 19230 at commit [`5ea4e89`](https://github.com/apache/spark/commit/5ea4e8985002dd26922376d672af4db052063909). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19230: [SPARK-22003][SQL] support array column in vectorized re...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19230 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19230: [SPARK-22003][SQL] support array column in vector...
Github user liufengdb commented on a diff in the pull request: https://github.com/apache/spark/pull/19230#discussion_r139329522 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java --- @@ -16,6 +16,7 @@ */ package org.apache.spark.sql.execution.vectorized; +import org.apache.spark.api.java.function.Function; --- End diff -- oops, reverted it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19230: [SPARK-22003][SQL] support array column in vector...
Github user liufengdb commented on a diff in the pull request: https://github.com/apache/spark/pull/19230#discussion_r139329523 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnVectorSuite.scala --- @@ -0,0 +1,201 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.vectorized + +import org.scalatest.BeforeAndAfterEach + +import org.apache.spark.SparkFunSuite +import org.apache.spark.sql.catalyst.util.ArrayData +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String + +class ColumnVectorSuite extends SparkFunSuite with BeforeAndAfterEach { + + var testVector: WritableColumnVector = _ + + private def allocate(capacity: Int, dt: DataType): WritableColumnVector = { +new OnHeapColumnVector(capacity, dt) + } + + override def afterEach(): Unit = { +testVector.close() + } + + test("boolean") { +testVector = allocate(10, BooleanType) +(0 until 10).foreach { i => + testVector.appendBoolean(i % 2 == 0) +} + +val array = new ColumnVector.Array(testVector) + +(0 until 10).foreach { i => + assert(array.get(i, BooleanType) === (i % 2 == 0)) +} + } + + test("byte") { +testVector = allocate(10, ByteType) +(0 until 10).foreach { i => + testVector.appendByte(i.toByte) +} + +val array = new ColumnVector.Array(testVector) + +(0 until 10).foreach { i => + assert(array.get(i, ByteType) === (i.toByte)) +} + } + + test("short") { +testVector = allocate(10, ShortType) +(0 until 10).foreach { i => + testVector.appendShort(i.toShort) +} + +val array = new ColumnVector.Array(testVector) + +(0 until 10).foreach { i => + assert(array.get(i, ShortType) === (i.toShort)) +} + } + + test("int") { +testVector = allocate(10, IntegerType) +(0 until 10).foreach { i => + testVector.appendInt(i) +} + +val array = new ColumnVector.Array(testVector) + +(0 until 10).foreach { i => + assert(array.get(i, IntegerType) === i) +} + } + + test("long") { +testVector = allocate(10, LongType) +(0 until 10).foreach { i => + testVector.appendLong(i) +} + +val array = new ColumnVector.Array(testVector) + +(0 until 10).foreach { i => + assert(array.get(i, LongType) === i) +} + } + + test("float") { +testVector = allocate(10, FloatType) +(0 until 10).foreach { i => + testVector.appendFloat(i.toFloat) +} + +val array = new ColumnVector.Array(testVector) + +(0 until 10).foreach { i => + assert(array.get(i, FloatType) === i.toFloat) +} + } + + test("double") { +testVector = allocate(10, DoubleType) +(0 until 10).foreach { i => + testVector.appendDouble(i.toDouble) +} + +val array = new ColumnVector.Array(testVector) + +(0 until 10).foreach { i => + assert(array.get(i, DoubleType) === i.toDouble) +} + } + + test("string") { +testVector = allocate(10, StringType) +(0 until 10).map { i => + val utf8 = s"str$i".getBytes("utf8") + testVector.appendByteArray(utf8, 0, utf8.length) +} + +val array = new ColumnVector.Array(testVector) + +(0 until 10).foreach { i => + assert(array.get(i, StringType) === UTF8String.fromString(s"str$i")) +} + } + + test("binary") { +testVector = allocate(10, BinaryType) +(0 until 10).map { i => + val utf8 = s"str$i".getBytes("utf8") + testVector.appendByteArray(utf8, 0, utf8.length) +} + +val array = new ColumnVector.Array(testVector) + +
[GitHub] spark issue #19260: [SPARK-22043][PYTHON] Improves error message for show_pr...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19260 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19196: [SPARK-21977] SinglePartition optimizations break certai...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19196 **[Test build #81860 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81860/testReport)** for PR 19196 at commit [`be39125`](https://github.com/apache/spark/commit/be3912552d200fced24f69c436b59a2c24639380). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15544: [SPARK-17997] [SQL] Add an aggregation function f...
Github user wzhfy commented on a diff in the pull request: https://github.com/apache/spark/pull/15544#discussion_r139327658 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxCountDistinctForIntervals.scala --- @@ -0,0 +1,248 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.expressions.aggregate + +import java.util + +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.analysis.TypeCheckResult +import org.apache.spark.sql.catalyst.analysis.TypeCheckResult.{TypeCheckFailure, TypeCheckSuccess} +import org.apache.spark.sql.catalyst.expressions.{AttributeReference, ExpectsInputTypes, Expression, ExpressionDescription} +import org.apache.spark.sql.catalyst.util.{ArrayData, GenericArrayData, HyperLogLogPlusPlusHelper} +import org.apache.spark.sql.types._ + +/** + * This function counts the approximate number of distinct values (ndv) in + * intervals constructed from endpoints specified in `endpointsExpression`. The endpoints will be + * sorted into ascending order. To count ndv's in these intervals, apply the HyperLogLogPlusPlus + * algorithm in each of them. + * @param child to estimate the ndv's of. + * @param endpointsExpression to construct the intervals. + * @param relativeSD The maximum estimation error allowed in the HyperLogLogPlusPlus algorithm. + */ +@ExpressionDescription( + usage = """ +_FUNC_(col, array(endpoint_1, endpoint_2, ... endpoint_N)) - Returns the approximate + number of distinct values (ndv) for intervals [endpoint_1, endpoint_2], + (endpoint_2, endpoint_3], ... (endpoint_N-1, endpoint_N]. + +_FUNC_(col, array(endpoint_1, endpoint_2, ... endpoint_N), relativeSD=0.05) - Returns + the approximate number of distinct values (ndv) for intervals with relativeSD, the maximum + estimation error allowed in the HyperLogLogPlusPlus algorithm. + """, + extended = """ +Examples: + > SELECT approx_count_distinct_for_intervals(10.0, array(5, 15, 25), 0.01); + [1, 0] + """) +case class ApproxCountDistinctForIntervals( +child: Expression, +endpointsExpression: Expression, +relativeSD: Double = 0.05, +mutableAggBufferOffset: Int = 0, +inputAggBufferOffset: Int = 0) + extends ImperativeAggregate with ExpectsInputTypes { + + def this(child: Expression, endpointsExpression: Expression) = { +this( + child = child, + endpointsExpression = endpointsExpression, + relativeSD = 0.05, + mutableAggBufferOffset = 0, + inputAggBufferOffset = 0) + } + + def this(child: Expression, endpointsExpression: Expression, relativeSD: Expression) = { +this( + child = child, + endpointsExpression = endpointsExpression, + relativeSD = HyperLogLogPlusPlus.validateDoubleLiteral(relativeSD), + mutableAggBufferOffset = 0, + inputAggBufferOffset = 0) + } + + override def inputTypes: Seq[AbstractDataType] = { +Seq(TypeCollection(NumericType, TimestampType, DateType), ArrayType) + } + + // Mark as lazy so that endpointsExpression is not evaluated during tree transformation. + lazy val endpoints: Array[Double] = { +val doubleArray = (endpointsExpression.dataType, endpointsExpression.eval()) match { + case (ArrayType(baseType: NumericType, _), arrayData: ArrayData) => +val numericArray = arrayData.toObjectArray(baseType) +numericArray.map { x => + baseType.numeric.toDouble(x.asInstanceOf[baseType.InternalType]) +} +} +util.Arrays.sort(doubleArray) +doubleArray + } + + override def checkInputDataTypes(): TypeCheckResult = { +val defaultCheck = super.checkInputDataTypes() +if (defaultCheck.isFailure) { + defaultCheck +} else if
[GitHub] spark issue #19252: [SPARK-21969][SQL] CommandUtils.updateTableStats should ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19252 This is not a bug. We just follow [the behavior of Hive's dynamic partition insert](https://docs.databricks.com/spark/latest/spark-sql/language-manual/insert.html#dynamic-partition-inserts). > The dynamic partition columns must be specified last in both part_spec and the input result set (of the row value lists or the select query). They are resolved by position, instead of by names. Thus, the orders must be exactly matched. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19258: add MockNetCat
Github user bluejoe2008 closed the pull request at: https://github.com/apache/spark/pull/19258 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19230: [SPARK-22003][SQL] support array column in vector...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/19230#discussion_r139324826 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnVectorSuite.scala --- @@ -0,0 +1,201 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.vectorized + +import org.scalatest.BeforeAndAfterEach + +import org.apache.spark.SparkFunSuite +import org.apache.spark.sql.catalyst.util.ArrayData +import org.apache.spark.sql.types._ +import org.apache.spark.unsafe.types.UTF8String + +class ColumnVectorSuite extends SparkFunSuite with BeforeAndAfterEach { + + var testVector: WritableColumnVector = _ + + private def allocate(capacity: Int, dt: DataType): WritableColumnVector = { +new OnHeapColumnVector(capacity, dt) + } + + override def afterEach(): Unit = { +testVector.close() + } + + test("boolean") { +testVector = allocate(10, BooleanType) +(0 until 10).foreach { i => + testVector.appendBoolean(i % 2 == 0) +} + +val array = new ColumnVector.Array(testVector) + +(0 until 10).foreach { i => + assert(array.get(i, BooleanType) === (i % 2 == 0)) +} + } + + test("byte") { +testVector = allocate(10, ByteType) +(0 until 10).foreach { i => + testVector.appendByte(i.toByte) +} + +val array = new ColumnVector.Array(testVector) + +(0 until 10).foreach { i => + assert(array.get(i, ByteType) === (i.toByte)) +} + } + + test("short") { +testVector = allocate(10, ShortType) +(0 until 10).foreach { i => + testVector.appendShort(i.toShort) +} + +val array = new ColumnVector.Array(testVector) + +(0 until 10).foreach { i => + assert(array.get(i, ShortType) === (i.toShort)) +} + } + + test("int") { +testVector = allocate(10, IntegerType) +(0 until 10).foreach { i => + testVector.appendInt(i) +} + +val array = new ColumnVector.Array(testVector) + +(0 until 10).foreach { i => + assert(array.get(i, IntegerType) === i) +} + } + + test("long") { +testVector = allocate(10, LongType) +(0 until 10).foreach { i => + testVector.appendLong(i) +} + +val array = new ColumnVector.Array(testVector) + +(0 until 10).foreach { i => + assert(array.get(i, LongType) === i) +} + } + + test("float") { +testVector = allocate(10, FloatType) +(0 until 10).foreach { i => + testVector.appendFloat(i.toFloat) +} + +val array = new ColumnVector.Array(testVector) + +(0 until 10).foreach { i => + assert(array.get(i, FloatType) === i.toFloat) +} + } + + test("double") { +testVector = allocate(10, DoubleType) +(0 until 10).foreach { i => + testVector.appendDouble(i.toDouble) +} + +val array = new ColumnVector.Array(testVector) + +(0 until 10).foreach { i => + assert(array.get(i, DoubleType) === i.toDouble) +} + } + + test("string") { +testVector = allocate(10, StringType) +(0 until 10).map { i => + val utf8 = s"str$i".getBytes("utf8") + testVector.appendByteArray(utf8, 0, utf8.length) +} + +val array = new ColumnVector.Array(testVector) + +(0 until 10).foreach { i => + assert(array.get(i, StringType) === UTF8String.fromString(s"str$i")) +} + } + + test("binary") { +testVector = allocate(10, BinaryType) +(0 until 10).map { i => + val utf8 = s"str$i".getBytes("utf8") + testVector.appendByteArray(utf8, 0, utf8.length) +} + +val array = new ColumnVector.Array(testVector) + +(0
[GitHub] spark issue #19259: [BACKPORT-2.1][SPARK-19318][SPARK-22041][SQL] Docker tes...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/19259 @gatorsmile Yes, Docker integration tests passed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19263: Optionally add block updates to log
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19263 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19263: Optionally add block updates to log
GitHub user michaelmior opened a pull request: https://github.com/apache/spark/pull/19263 Optionally add block updates to log I see that block updates are not logged to the event log. This makes sense as a default for performance reasons. However, I find it helpful when trying to get a better understanding of caching for a job to be able to log these updates. This PR adds a configuration setting `spark.eventLog.blockUpdates` (defaulting to false) which allows block updates to be recorded in the log. You can merge this pull request into a Git repository by running: $ git pull https://github.com/michaelmior/spark log-block-updates Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19263.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19263 commit 2a967c368cf72316b760c5f17503658aa12b901d Author: Michael MiorDate: 2017-07-12T13:32:24Z Optionally add block updates to log --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19218: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/19218 Sorry, guys. I've been away from keyboard since last Friday night. I'll be back on next Tuesday (PST). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19252: [SPARK-21969][SQL] CommandUtils.updateTableStats should ...
Github user aokolnychyi commented on the issue: https://github.com/apache/spark/pull/19252 @gatorsmile thanks for the feedback. I also covered ``TruncateTableCommand`` with additional tests. However, I see a bit strange behavior while creating a test for ``AlterTableAddPartitionCommand ``. ``` sql(s"CREATE TABLE t1 (col1 int, col2 int) USING PARQUET") sql(s"INSERT INTO TABLE t1 SELECT 1, 2") sql(s"INSERT INTO TABLE t1 SELECT 2, 4") sql("SELECT * FROM t1").show() +++ |col1|col2| +++ | 1| 2| | 2| 4| +++ sql(s"CREATE TABLE t2 (col1 int, col2 int) USING PARQUET PARTITIONED BY (col1)") sql(s"INSERT INTO TABLE t2 SELECT 1, 2") sql(s"INSERT INTO TABLE t2 SELECT 2, 4") sql("SELECT * FROM t2").show() +++ |col2|col1| +++ | 2| 4| | 1| 2| +++ ``` Why are the results different? Is it a bug? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19243: [SPARK-21780][R] Simpler Dataset.sample API in R
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/19243 thinking about this, I wonder if it is more common in R to skip param with default values and the rest of param by names, like `sample(df, fraction=1.0)` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19256: [SPARK-21338][SQL]implement isCascadingTruncateTable() m...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19256 Let me correct what I said above. The logics should be > If any dialect's `isCascadingTruncateTable` returns `true`, we should return `true`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19256: [SPARK-21338][SQL]implement isCascadingTruncateTable() m...
Github user huaxingao commented on the issue: https://github.com/apache/spark/pull/19256 Thanks @gatorsmile I will change both the implementation and the PR title. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19256: [SPARK-21338][SQL]implement isCascadingTruncateTable() m...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19256 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81859/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19256: [SPARK-21338][SQL]implement isCascadingTruncateTable() m...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19256 **[Test build #81859 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81859/testReport)** for PR 19256 at commit [`7b9d6fc`](https://github.com/apache/spark/commit/7b9d6fcec61cd72ffb257abf17eb0e9d6c462f9e). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19256: [SPARK-21338][SQL]implement isCascadingTruncateTable() m...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19256 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19252: [SPARK-21969][SQL] CommandUtils.updateTableStats should ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19252 Actually, the right fix should add `refreshTable(identifier)` to the SessionCatalog's [alterTableStats API](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala#L379). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19252: [SPARK-21969][SQL] CommandUtils.updateTableStats ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19252#discussion_r139317292 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala --- @@ -44,6 +44,7 @@ object CommandUtils extends Logging { } else { catalog.alterTableStats(table.identifier, None) } + catalog.refreshTable(table.identifier) --- End diff -- Add a comment above this line: > Invalidate the table relation cache --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19260: [SPARK-22043][PYTHON] Improves error message for show_pr...
Github user maver1ck commented on the issue: https://github.com/apache/spark/pull/19260 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19252: [SPARK-21969][SQL] CommandUtils.updateTableStats should ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19252 `TruncateTableCommand` also has similar issues. Could you also fix it in this PR? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19260: [SPARK-22043][PYTHON] Improves error message for show_pr...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19260 cc @maver1ck and @viirya who I believe recently ran Python profile. Could you take a look please when you have some time? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19226: [SPARK-21985][PySpark] PairDeserializer is broken for do...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19226 Merged to master, branch-2.2 and branch-2.1. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19226: [SPARK-21985][PySpark] PairDeserializer is broken...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19226 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19249: [SPARK-22032][PySpark] Speed up StructType conver...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19249 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19249: [SPARK-22032][PySpark] Speed up StructType conversion
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19249 Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19230: [SPARK-22003][SQL] support array column in vector...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19230#discussion_r139316092 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java --- @@ -16,6 +16,7 @@ */ package org.apache.spark.sql.execution.vectorized; +import org.apache.spark.api.java.function.Function; --- End diff -- Please revert it back. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19259: [BACKPORT-2.1][SPARK-19318][SPARK-22041][SQL] Docker tes...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19259 Have you run the docker test? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19261: [SPARK-22040] Add current_date function with timezone id
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19261 Any other database has such an interface? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19256: [SPARK-21338][SQL]implement isCascadingTruncateTable() m...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19256 BTW, could you update the PR title? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19256: [SPARK-21338][SQL]implement isCascadingTruncateTa...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/19256#discussion_r139315660 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/jdbc/AggregatedDialect.scala --- @@ -41,4 +41,8 @@ private class AggregatedDialect(dialects: List[JdbcDialect]) extends JdbcDialect override def getJDBCType(dt: DataType): Option[JdbcType] = { dialects.flatMap(_.getJDBCType(dt)).headOption } + + override def isCascadingTruncateTable(): Option[Boolean] = { +dialects.flatMap(_.isCascadingTruncateTable).headOption --- End diff -- This one is different from `getCatalystType ` and `getJDBCType `. Instead, we should follow [canHandle](https://github.com/huaxingao/spark/blob/7b9d6fcec61cd72ffb257abf17eb0e9d6c462f9e/sql/core/src/main/scala/org/apache/spark/sql/jdbc/AggregatedDialect.scala#L33-L34). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19256: [SPARK-21338][SQL]implement isCascadingTruncateTable() m...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19256 **[Test build #81859 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81859/testReport)** for PR 19256 at commit [`7b9d6fc`](https://github.com/apache/spark/commit/7b9d6fcec61cd72ffb257abf17eb0e9d6c462f9e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19256: [SPARK-21338][SQL]implement isCascadingTruncateTable() m...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19256 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19204: [SPARK-21981][PYTHON][ML] Added Python interface for Clu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19204 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19204: [SPARK-21981][PYTHON][ML] Added Python interface for Clu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19204 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81858/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19204: [SPARK-21981][PYTHON][ML] Added Python interface for Clu...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19204 **[Test build #81858 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81858/testReport)** for PR 19204 at commit [`a980e6b`](https://github.com/apache/spark/commit/a980e6bfbbeb1da9ce674d69b9cb537b7204459a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19204: [SPARK-21981][PYTHON][ML] Added Python interface for Clu...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19204 **[Test build #81858 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81858/testReport)** for PR 19204 at commit [`a980e6b`](https://github.com/apache/spark/commit/a980e6bfbbeb1da9ce674d69b9cb537b7204459a). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19226: [SPARK-21985][PySpark] PairDeserializer is broken for do...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19226 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19226: [SPARK-21985][PySpark] PairDeserializer is broken for do...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19226 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81856/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19226: [SPARK-21985][PySpark] PairDeserializer is broken for do...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19226 **[Test build #81856 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81856/testReport)** for PR 19226 at commit [`5282ee5`](https://github.com/apache/spark/commit/5282ee5a8624d313ec5995db4d772ec1170ea049). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19262: [MINOR][ML] Remove unnecessary default value setting for...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19262 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19262: [MINOR][ML] Remove unnecessary default value setting for...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19262 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81857/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19262: [MINOR][ML] Remove unnecessary default value setting for...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19262 **[Test build #81857 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81857/testReport)** for PR 19262 at commit [`84c6c9e`](https://github.com/apache/spark/commit/84c6c9e394939fa97a6d3888547ac269b234a2c0). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19256: [SPARK-21338][SQL]implement isCascadingTruncateTa...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/19256#discussion_r139313120 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/jdbc/AggregatedDialect.scala --- @@ -41,4 +41,8 @@ private class AggregatedDialect(dialects: List[JdbcDialect]) extends JdbcDialect override def getJDBCType(dt: DataType): Option[JdbcType] = { dialects.flatMap(_.getJDBCType(dt)).headOption } + + override def isCascadingTruncateTable(): Option[Boolean] = { +dialects.flatMap(_.isCascadingTruncateTable).headOption --- End diff -- Both getCatalystType and getJDBCType(dt: DataType) use the first one. Also, in the class header, it has the following: /** * AggregatedDialect can unify multiple dialects into one virtual Dialect. * Dialects are tried in order, and the first dialect that does not return a * neutral element will will. * * @param dialects List of dialects. */ It has a typo, I guess it means "the first dialect that does not return a neutral element will return" --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18193: [SPARK-15616] [SQL] CatalogRelation should fallba...
Github user cenyuhai commented on a diff in the pull request: https://github.com/apache/spark/pull/18193#discussion_r139312866 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala --- @@ -139,6 +138,54 @@ class DetermineTableStats(session: SparkSession) extends Rule[LogicalPlan] { } /** + * + * TODO: merge this with PruneFileSourcePartitions after we completely make hive as a data source. + */ +case class PruneHiveTablePartitions( +session: SparkSession) extends Rule[LogicalPlan] with PredicateHelper { + override def apply(plan: LogicalPlan): LogicalPlan = plan transformDown { +case filter @ Filter(condition, relation: HiveTableRelation) if relation.isPartitioned => + val predicates = splitConjunctivePredicates(condition) + val normalizedFilters = predicates.map { e => +e transform { + case a: AttributeReference => +a.withName(relation.output.find(_.semanticEquals(a)).get.name) +} + } + val partitionSet = AttributeSet(relation.partitionCols) + val pruningPredicates = normalizedFilters.filter { predicate => +!predicate.references.isEmpty && + predicate.references.subsetOf(partitionSet) + } + if (pruningPredicates.nonEmpty && session.sessionState.conf.fallBackToHdfsForStatsEnabled && +session.sessionState.conf.metastorePartitionPruning) { +val prunedPartitions = session.sharedState.externalCatalog.listPartitionsByFilter( + relation.tableMeta.database, + relation.tableMeta.identifier.table, + pruningPredicates, + session.sessionState.conf.sessionLocalTimeZone) +val sizeInBytes = try { + prunedPartitions.map { part => +CommandUtils.calculateLocationSize( --- End diff -- I think we should first check whether partition.parameters contains SetupConst.RAW_DATA_SIZE and StatsSetupConst.TOTAL_SIZE) or not. If partition.parameters contains the size of the partition, use it instead of getConetSummary of hdfs --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19262: [MINOR][ML] Remove unnecessary default value setting for...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19262 **[Test build #81857 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81857/testReport)** for PR 19262 at commit [`84c6c9e`](https://github.com/apache/spark/commit/84c6c9e394939fa97a6d3888547ac269b234a2c0). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19204: [SPARK-21981][PYTHON][ML] Added Python interface ...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/19204#discussion_r139312695 --- Diff: python/pyspark/ml/evaluation.py --- @@ -328,6 +329,86 @@ def setParams(self, predictionCol="prediction", labelCol="label", kwargs = self._input_kwargs return self._set(**kwargs) + +@inherit_doc +class ClusteringEvaluator(JavaEvaluator, HasPredictionCol, HasFeaturesCol, + JavaMLReadable, JavaMLWritable): +""" +.. note:: Experimental + +Evaluator for Clustering results, which expects two input +columns: prediction and features. + +>>> from sklearn import datasets +>>> from pyspark.sql.types import * +>>> from pyspark.ml.linalg import Vectors, VectorUDT +>>> from pyspark.ml.evaluation import ClusteringEvaluator +... +>>> iris = datasets.load_iris() +>>> iris_rows = [(Vectors.dense(x), int(iris.target[i])) +... for i, x in enumerate(iris.data)] +>>> schema = StructType([ +...StructField("features", VectorUDT(), True), +...StructField("cluster_id", IntegerType(), True)]) +>>> rdd = spark.sparkContext.parallelize(iris_rows) +>>> dataset = spark.createDataFrame(rdd, schema) +... +>>> evaluator = ClusteringEvaluator(predictionCol="cluster_id") +>>> evaluator.evaluate(dataset) +0.656... +>>> ce_path = temp_path + "/ce" +>>> evaluator.save(ce_path) +>>> evaluator2 = ClusteringEvaluator.load(ce_path) +>>> str(evaluator2.getPredictionCol()) +'cluster_id' + +.. versionadded:: 2.3.0 +""" +metricName = Param(Params._dummy(), "metricName", + "metric name in evaluation (silhouette)", + typeConverter=TypeConverters.toString) + +@keyword_only +def __init__(self, predictionCol="prediction", featuresCol="features", + metricName="silhouette"): +""" +__init__(self, predictionCol="prediction", featuresCol="features", \ + metricName="silhouette") +""" +super(ClusteringEvaluator, self).__init__() +self._java_obj = self._new_java_obj( +"org.apache.spark.ml.evaluation.ClusteringEvaluator", self.uid) +self._setDefault(predictionCol="prediction", featuresCol="features", --- End diff -- I sent #19262 to fix same issue for other evaluators, please feel free to comment. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19226: [SPARK-21985][PySpark] PairDeserializer is broken for do...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19226 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/16578#discussion_r139312657 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/QueryPlanConstraints.scala --- @@ -77,20 +77,21 @@ trait QueryPlanConstraints { self: LogicalPlan => constraint match { // When the root is IsNotNull, we can push IsNotNull through the child null intolerant // expressions - case IsNotNull(expr) => scanNullIntolerantAttribute(expr).map(IsNotNull(_)) + case IsNotNull(expr) => scanNullIntolerantField(expr).map(IsNotNull(_)) // Constraints always return true for all the inputs. That means, null will never be returned. // Thus, we can infer `IsNotNull(constraint)`, and also push IsNotNull through the child // null intolerant expressions. - case _ => scanNullIntolerantAttribute(constraint).map(IsNotNull(_)) + case _ => scanNullIntolerantField(constraint).map(IsNotNull(_)) --- End diff -- Previously IsNotNull constraints are Attribute-specific, why do we need to expand it to `ExtractValue`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19226: [SPARK-21985][PySpark] PairDeserializer is broken for do...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19226 **[Test build #81856 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81856/testReport)** for PR 19226 at commit [`5282ee5`](https://github.com/apache/spark/commit/5282ee5a8624d313ec5995db4d772ec1170ea049). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19262: [MINOR][ML] Remove unnecessary default value sett...
GitHub user yanboliang opened a pull request: https://github.com/apache/spark/pull/19262 [MINOR][ML] Remove unnecessary default value setting for evaluators. ## What changes were proposed in this pull request? Remove unnecessary default value setting for all evaluators, as we have set them in corresponding _HasXXX_ base classes. ## How was this patch tested? Existing tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/yanboliang/spark evaluation Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19262.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19262 commit 84c6c9e394939fa97a6d3888547ac269b234a2c0 Author: Yanbo LiangDate: 2017-09-17T14:48:35Z Remove unnecessary default value setting for evaluators. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user vkhristenko commented on a diff in the pull request: https://github.com/apache/spark/pull/16578#discussion_r139312613 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala --- @@ -0,0 +1,130 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.parquet + +import org.apache.spark.sql.catalyst.expressions.{And, Attribute, Expression, NamedExpression} +import org.apache.spark.sql.catalyst.planning.{PhysicalOperation, ProjectionOverSchema, SelectedField} +import org.apache.spark.sql.catalyst.plans.logical.{Filter, LogicalPlan, Project} +import org.apache.spark.sql.catalyst.rules.Rule +import org.apache.spark.sql.execution.datasources.{HadoopFsRelation, LogicalRelation} +import org.apache.spark.sql.types.{StructField, StructType} + +/** + * Prunes unnecessary Parquet columns given a [[PhysicalOperation]] over a + * [[ParquetRelation]]. By "Parquet column", we mean a column as defined in the + * Parquet format. In Spark SQL, a root-level Parquet column corresponds to a + * SQL column, and a nested Parquet column corresponds to a [[StructField]]. + */ +private[sql] object ParquetSchemaPruning extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = +plan transformDown { + case op @ PhysicalOperation(projects, filters, + l @ LogicalRelation(hadoopFsRelation @ HadoopFsRelation(_, partitionSchema, +dataSchema, _, parquetFormat: ParquetFileFormat, _), _, _)) => +val projectionFields = projects.flatMap(getFields) +val filterFields = filters.flatMap(getFields) +val requestedFields = (projectionFields ++ filterFields).distinct + +// If [[requestedFields]] includes a proper field, continue. Otherwise, +// return [[op]] +if (requestedFields.exists { case (_, optAtt) => optAtt.isEmpty }) { + val prunedSchema = requestedFields +.map { case (field, _) => StructType(Array(field)) } +.reduceLeft(_ merge _) --- End diff -- @viirya @mallman If I may add here, given a schema: ``` root | - a: StructType || - f1: Int || - f2: Int ``` and a selection `df.select("a.f1", "a.f2")` vs `df.select("a.f2", "a.f1")` will produce different requiredSchema fed into buildReader upon some action. In the first case it will be `StructType( StructField( "a", StructType(StructField("f1") :: StructField("f2") :: Nil)) :: Nil)` and in the second `StructType( StructField( "a", StructType(StructField("f2") :: StructField("f1") :: Nil)) :: Nil)` which means that the original schema is different from the one that is required for the second case. But as long as fields f1 and f2 are splitted (can be read without reading the other), it's remappable on the data source level. VK --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19204: [SPARK-21981][PYTHON][ML] Added Python interface ...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/19204#discussion_r139312388 --- Diff: python/pyspark/ml/evaluation.py --- @@ -328,6 +329,86 @@ def setParams(self, predictionCol="prediction", labelCol="label", kwargs = self._input_kwargs return self._set(**kwargs) + +@inherit_doc +class ClusteringEvaluator(JavaEvaluator, HasPredictionCol, HasFeaturesCol, + JavaMLReadable, JavaMLWritable): +""" +.. note:: Experimental + +Evaluator for Clustering results, which expects two input +columns: prediction and features. + +>>> from sklearn import datasets +>>> from pyspark.sql.types import * +>>> from pyspark.ml.linalg import Vectors, VectorUDT +>>> from pyspark.ml.evaluation import ClusteringEvaluator +... +>>> iris = datasets.load_iris() +>>> iris_rows = [(Vectors.dense(x), int(iris.target[i])) +... for i, x in enumerate(iris.data)] +>>> schema = StructType([ +...StructField("features", VectorUDT(), True), +...StructField("cluster_id", IntegerType(), True)]) --- End diff -- ```cluster_id``` -> ```prediction``` to emphasize this is the prediction value, not ground truth. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19204: [SPARK-21981][PYTHON][ML] Added Python interface ...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/19204#discussion_r139312199 --- Diff: python/pyspark/ml/evaluation.py --- @@ -328,6 +329,86 @@ def setParams(self, predictionCol="prediction", labelCol="label", kwargs = self._input_kwargs return self._set(**kwargs) + +@inherit_doc +class ClusteringEvaluator(JavaEvaluator, HasPredictionCol, HasFeaturesCol, + JavaMLReadable, JavaMLWritable): +""" +.. note:: Experimental + +Evaluator for Clustering results, which expects two input +columns: prediction and features. + +>>> from sklearn import datasets +>>> from pyspark.sql.types import * +>>> from pyspark.ml.linalg import Vectors, VectorUDT +>>> from pyspark.ml.evaluation import ClusteringEvaluator +... +>>> iris = datasets.load_iris() +>>> iris_rows = [(Vectors.dense(x), int(iris.target[i])) +... for i, x in enumerate(iris.data)] +>>> schema = StructType([ +...StructField("features", VectorUDT(), True), +...StructField("cluster_id", IntegerType(), True)]) +>>> rdd = spark.sparkContext.parallelize(iris_rows) +>>> dataset = spark.createDataFrame(rdd, schema) +... +>>> evaluator = ClusteringEvaluator(predictionCol="cluster_id") +>>> evaluator.evaluate(dataset) +0.656... +>>> ce_path = temp_path + "/ce" +>>> evaluator.save(ce_path) +>>> evaluator2 = ClusteringEvaluator.load(ce_path) +>>> str(evaluator2.getPredictionCol()) +'cluster_id' + +.. versionadded:: 2.3.0 +""" +metricName = Param(Params._dummy(), "metricName", + "metric name in evaluation (silhouette)", + typeConverter=TypeConverters.toString) + +@keyword_only +def __init__(self, predictionCol="prediction", featuresCol="features", + metricName="silhouette"): +""" +__init__(self, predictionCol="prediction", featuresCol="features", \ + metricName="silhouette") +""" +super(ClusteringEvaluator, self).__init__() +self._java_obj = self._new_java_obj( +"org.apache.spark.ml.evaluation.ClusteringEvaluator", self.uid) +self._setDefault(predictionCol="prediction", featuresCol="features", --- End diff -- Remove setting default value for ```predictionCol``` and ```featuresCol```, as they have been set in ```HasPredictionCol``` and ```HasFeaturesCol```. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org