[GitHub] spark pull request #13756: [SPARK-16041][SQL] Disallow Duplicate Columns in ...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/13756#discussion_r69541474 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala --- @@ -248,4 +281,15 @@ private[sql] case class PreWriteCheck(conf: SQLConf, catalog: SessionCatalog) case _ => // OK } } + + private def checkDuplicates(columnNames: Seq[String], columnType: String): Unit = { +val duplicateColumns = columnNames.groupBy { name => + if (conf.caseSensitiveAnalysis) name else name.toLowerCase }.collect { + case (x, ys) if ys.length > 1 => s"`$x`" +} +if (duplicateColumns.nonEmpty) { + throw new AnalysisException( +s"Found duplicate column(s) in $columnType: ${duplicateColumns.mkString(", ")}") --- End diff -- Nit: duplicate => duplicated --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13756: [SPARK-16041][SQL] Disallow Duplicate Columns in ...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/13756#discussion_r69541158 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala --- @@ -206,7 +207,39 @@ private[sql] case class PreWriteCheck(conf: SQLConf, catalog: SessionCatalog) // The relation in l is not an InsertableRelation. failAnalysis(s"$l does not allow insertion.") + case c: CreateTableCommand => +val allColNamesInSchema = c.table.schema.map(_.name) +val colNames = allColNamesInSchema.diff(c.table.partitionColumnNames) +val partitionColumnNames = c.table.partitionColumnNames +// Duplicates are not allowed in partitionBy +// Todo: when bucketBy and sortBy are supported, we also need to ban the duplication. +checkDuplicates(partitionColumnNames, "Partition") +// Ensuring whether no duplicate name is used in table definition +checkDuplicates(colNames, s"table definition of ${c.table.identifier}") +// For non-data-source tables, partition columns must not be part of the schema +val badPartCols = partitionColumnNames.toSet.intersect(colNames.toSet) --- End diff -- Again, this line is using case sensitive comparison. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13756: [SPARK-16041][SQL] Disallow Duplicate Columns in ...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/13756#discussion_r69540936 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala --- @@ -248,4 +281,15 @@ private[sql] case class PreWriteCheck(conf: SQLConf, catalog: SessionCatalog) case _ => // OK } } + + private def checkDuplicates(columnNames: Seq[String], columnType: String): Unit = { +val duplicateColumns = columnNames.groupBy { name => + if (conf.caseSensitiveAnalysis) name else name.toLowerCase }.collect { + case (x, ys) if ys.length > 1 => s"`$x`" +} --- End diff -- Code style within this block looks weird. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13756: [SPARK-16041][SQL] Disallow Duplicate Columns in ...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/13756#discussion_r69540754 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala --- @@ -206,7 +207,39 @@ private[sql] case class PreWriteCheck(conf: SQLConf, catalog: SessionCatalog) // The relation in l is not an InsertableRelation. failAnalysis(s"$l does not allow insertion.") + case c: CreateTableCommand => +val allColNamesInSchema = c.table.schema.map(_.name) +val colNames = allColNamesInSchema.diff(c.table.partitionColumnNames) --- End diff -- Is it safe to do case sensitive comparison here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13756: [SPARK-16041][SQL] Disallow Duplicate Columns in ...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/13756#discussion_r69528941 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala --- @@ -867,8 +865,8 @@ class SparkSqlAstBuilder(conf: SQLConf) extends AstBuilder { } /** - * Create a table, returning either a [[CreateTableCommand]] or a - * [[CreateHiveTableAsSelectLogicalPlan]]. + * Create a table, returning either a [[CreateTableCommand]], a --- End diff -- Nit: Remove "either". --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13389: [SPARK-9876][SQL][FOLLOWUP] Enable string and binary tes...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/13389 LGTM, merging to master. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13389: [SPARK-9876][SQL][FOLLOWUP] Enable string and bin...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/13389#discussion_r69527639 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystWriteSupport.scala --- @@ -150,7 +150,8 @@ private[parquet] class CatalystWriteSupport extends WriteSupport[InternalRow] wi case StringType => (row: SpecializedGetters, ordinal: Int) => - recordConsumer.addBinary(Binary.fromByteArray(row.getUTF8String(ordinal).getBytes)) + recordConsumer.addBinary( + Binary.fromReusedByteArray(row.getUTF8String(ordinal).getBytes)) --- End diff -- `UTF8String` itself is immutable, but the underlying buffer it points to can be mutable. I'd vote for using `Binary.fromReusedByteArray` here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14044: [SPARK-16360][SQL] Speed up SQL query performance by rem...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/14044 Merged to master. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14012: [SPARK-16343][SQL] Improve the PushDownPredicate ...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/14012#discussion_r69522538 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -1106,12 +1106,15 @@ object PushDownPredicate extends Rule[LogicalPlan] with PredicateHelper { // Push [[Filter]] operators through [[Window]] operators. Parts of the predicate that can be // pushed beneath must satisfy the following two conditions: // 1. All the expressions are part of window partitioning key. The expressions can be compound. -// 2. Deterministic +// 2. Deterministic. +// 3. Placed before any non-deterministic predicates. case filter @ Filter(condition, w: Window) if w.partitionSpec.forall(_.isInstanceOf[AttributeReference]) => val partitionAttrs = AttributeSet(w.partitionSpec.flatMap(_.references)) + var isPredicatePushdownAble = true val (pushDown, stayUp) = splitConjunctivePredicates(condition).partition { cond => -cond.references.subsetOf(partitionAttrs) && cond.deterministic && +isPredicatePushdownAble = isPredicatePushdownAble && cond.deterministic +isPredicatePushdownAble && cond.references.subsetOf(partitionAttrs) && --- End diff -- And we should move the `partitionAttrs.forall(_.isInstanceOf[Attribute])` predicate out of the closure. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14012: [SPARK-16343][SQL] Improve the PushDownPredicate ...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/14012#discussion_r69522366 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -1106,12 +1106,15 @@ object PushDownPredicate extends Rule[LogicalPlan] with PredicateHelper { // Push [[Filter]] operators through [[Window]] operators. Parts of the predicate that can be // pushed beneath must satisfy the following two conditions: // 1. All the expressions are part of window partitioning key. The expressions can be compound. -// 2. Deterministic +// 2. Deterministic. +// 3. Placed before any non-deterministic predicates. case filter @ Filter(condition, w: Window) if w.partitionSpec.forall(_.isInstanceOf[AttributeReference]) => val partitionAttrs = AttributeSet(w.partitionSpec.flatMap(_.references)) + var isPredicatePushdownAble = true val (pushDown, stayUp) = splitConjunctivePredicates(condition).partition { cond => -cond.references.subsetOf(partitionAttrs) && cond.deterministic && +isPredicatePushdownAble = isPredicatePushdownAble && cond.deterministic +isPredicatePushdownAble && cond.references.subsetOf(partitionAttrs) && --- End diff -- The following can be easier to read: ```scala val (candidates, containingNonDeterministic) = splitConjunctivePredicates(condition).span(_.deterministic) val (pushDown, rest) = candidates.partition { cond => cond.references.subsetOf(partitionAttrs) && partitionAttrs.forall(_.isInstanceOf[Attribute]) } val stayUp = rest ++ containingNonDeterministic ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14012: [SPARK-16343][SQL] Improve the PushDownPredicate rule to...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/14012 add to whitelist --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14038: [SPARK-16317][SQL] Add a new interface to filter ...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/14038#discussion_r69520473 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala --- @@ -437,11 +442,26 @@ private[sql] object HadoopFsRelation extends Logging { accessTime: Long, blockLocations: Array[FakeBlockLocation]) + private[sql] def mergePathFilter( + filter1: Option[PathFilter], filter2: Option[PathFilter]): Path => Boolean = { +(filter1, filter2) match { + case (Some(f1), Some(f2)) => +(path: Path) => f1.accept(path) && f2.accept(path) + case (Some(f1), None) => +(path: Path) => f1.accept(path) + case (None, Some(f2)) => +(path: Path) => f2.accept(path) + case (None, None) => +(path: Path) => true +} --- End diff -- This can be conciser: ```scala (filter1 ++ filter2).reduceOption { (f1, f2) => (path: Path) => f1.accept(path) && f2.accept(path) }.getOrElse { (path: Path) => true } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14038: [SPARK-16317][SQL] Add a new interface to filter files i...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/14038 Left some comments, the overall structure looks good. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14038: [SPARK-16317][SQL] Add a new interface to filter ...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/14038#discussion_r69520156 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala --- @@ -172,6 +171,13 @@ case class HadoopFsRelation( } /** + * A helper class to list up qualified files in parallel. + */ +private[spark] abstract class SerializablePathFilter extends PathFilter with Serializable { --- End diff -- Oh I see, because parallel file listing may filter input files on executor side. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14038: [SPARK-16317][SQL] Add a new interface to filter ...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/14038#discussion_r69519931 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala --- @@ -172,6 +171,13 @@ case class HadoopFsRelation( } /** + * A helper class to list up qualified files in parallel. + */ +private[spark] abstract class SerializablePathFilter extends PathFilter with Serializable { --- End diff -- I probably missed something here, but why it has to be `Serializable`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14038: [SPARK-16317][SQL] Add a new interface to filter ...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/14038#discussion_r69519846 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala --- @@ -230,6 +236,15 @@ trait FileFormat { } /** + * Return a `SerializablePathFilter` class to filter qualified files for this format. + */ + def getPathFilter(): SerializablePathFilter = { --- End diff -- What is the semantics of the return value of the method? Seems that it should never return a null filter since it defaults to an "accept all" filter. If this is true, it's unnecessary to use `Option` to wrap returned filters elsewhere in this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14038: [SPARK-16317][SQL] Add a new interface to filter ...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/14038#discussion_r69519641 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala --- @@ -172,6 +171,13 @@ case class HadoopFsRelation( } /** + * A helper class to list up qualified files in parallel. + */ +private[spark] abstract class SerializablePathFilter extends PathFilter with Serializable { --- End diff -- Extending from `PathFilter` makes internal implementation a little bit easier, but I'd prefer to avoid depending on Hadoop classes/interfaces in Spark SQL public interfaces whenever possible. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14038: [SPARK-16317][SQL] Add a new interface to filter ...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/14038#discussion_r69518421 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala --- @@ -230,6 +236,15 @@ trait FileFormat { } /** + * Return a `SerializablePathFilter` class to filter qualified files for this format. + */ + def getPathFilter(): SerializablePathFilter = { --- End diff -- Shall we add either the data source options map or the Hadoop conf as an argument of this method? For example, the Avro data source may filter out all input files whose file names don't end with ".avro" if Hadoop conf "avro.mapred.ignore.inputs.without.extension" is set to true. This is consistent with default behavior of `AvroInputFormat`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12972: [SPARK-15198][SQL] Support for pushing down filters for ...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/12972 LGTM, merging to master. Sorry for leaving this PR for so long... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14044: [SPARK-16360][SQL] Speed up SQL query performance by rem...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/14044 LGTM pending Jenkins. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14044: [SPARK-16360][SQL] Speed up SQL query performance by rem...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/14044 Agree with @hvanhovell. Analysis should never take so long a time for such a simple query. We should avoid duplicated analysis work, but fixing performance issue(s) within the analyzer seems to be more resultful. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13818: [SPARK-15968][SQL] Nonempty partitioned metastore tables...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/13818 Shall we also have this in branch-2.0? This seems to be a pretty serious bug. cc @rxin. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13818: [SPARK-15968][SQL] Nonempty partitioned metastore tables...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/13818 LGTM except for minor styling issues. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13818: [SPARK-15968][SQL] Nonempty partitioned metastore...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/13818#discussion_r69448268 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/parquetSuites.scala --- @@ -425,6 +425,28 @@ class ParquetMetastoreSuite extends ParquetPartitioningTest { } } + test("SPARK-15968: nonempty partitioned metastore Parquet table lookup should use cached " + + "relation") { +withTable("partitioned") { + sql( +s"""CREATE TABLE partitioned ( + | key INT, + | value STRING + |) + |PARTITIONED BY (part INT) + |STORED AS PARQUET + """.stripMargin) + sql("INSERT INTO TABLE partitioned PARTITION(part=0) SELECT 1 as key, 'one' as value") + + // First lookup fills the cache + val r1 = collectHadoopFsRelation (table("partitioned")) + // Second lookup should reuse the cache + val r2 = collectHadoopFsRelation (table("partitioned")) --- End diff -- Nit: Remove space before `(`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13818: [SPARK-15968][SQL] Nonempty partitioned metastore...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/13818#discussion_r69448259 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/parquetSuites.scala --- @@ -425,6 +425,28 @@ class ParquetMetastoreSuite extends ParquetPartitioningTest { } } + test("SPARK-15968: nonempty partitioned metastore Parquet table lookup should use cached " + + "relation") { +withTable("partitioned") { + sql( +s"""CREATE TABLE partitioned ( + | key INT, + | value STRING + |) + |PARTITIONED BY (part INT) + |STORED AS PARQUET + """.stripMargin) + sql("INSERT INTO TABLE partitioned PARTITION(part=0) SELECT 1 as key, 'one' as value") + + // First lookup fills the cache + val r1 = collectHadoopFsRelation (table("partitioned")) --- End diff -- Nit: Remove space before `(`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13818: [SPARK-15968][SQL] Nonempty partitioned metastore...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/13818#discussion_r69448203 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/parquetSuites.scala --- @@ -425,6 +425,28 @@ class ParquetMetastoreSuite extends ParquetPartitioningTest { } } + test("SPARK-15968: nonempty partitioned metastore Parquet table lookup should use cached " + + "relation") { +withTable("partitioned") { + sql( +s"""CREATE TABLE partitioned ( + | key INT, + | value STRING + |) + |PARTITIONED BY (part INT) + |STORED AS PARQUET + """.stripMargin) --- End diff -- Nit: Indentation is off here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14025: [DOC][SQL] update out-of-date code snippets using SQLCon...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/14025 @tdas Would you please to help review streaming example code changes? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14025: [DOC][SQL] update out-of-date code snippets using...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/14025#discussion_r69415295 --- Diff: docs/configuration.md --- @@ -1564,8 +1564,8 @@ spark.sql("SET -v").show(n=200, truncate=False) {% highlight r %} -# sqlContext is an existing sqlContext. -properties <- sql(sqlContext, "SET -v") +# spark is an existing SparkSession. +properties <- sql(spark, "SET -v") --- End diff -- I believe in the most recent SparkR API, you don't need to pass in the `spark` parameter to run a SQL statement now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14025: [WIP][DOC] update out-of-date code snippets using SQLCon...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/14025 @WeichenXu123 Is this ready for review? If yes, please remove the WIP tag in the PR description. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14009: [SPARK-16311][SQL] Metadata refresh should work on tempo...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/14009 LGTM except for a minor styling issue. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14009: [SPARK-16311][SQL] Metadata refresh should work o...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/14009#discussion_r69316671 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/MetadataCacheSuite.scala --- @@ -77,12 +77,12 @@ class MetadataCacheSuite extends QueryTest with SharedSQLContext { sql("select count(*) from view_refresh").first() } assert(e.getMessage.contains("FileNotFoundException")) - assert(e.getMessage.contains("refresh()")) + assert(e.getMessage.contains("REFRESH")) // Refresh and we should be able to read it again. spark.catalog.refreshTable("view_refresh") val newCount = sql("select count(*) from view_refresh").first().getLong(0) assert(newCount > 0 && newCount < 100) -} +}} --- End diff -- This style is pretty weird... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13906: [SPARK-16208][SQL] Add `PropagateEmptyRelation` optimize...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/13906 Merged to master. @cloud-fan Sorry that I didn't notice your comment while merging it. We may address it in follow-up ones. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14013: [SPARK-16344][SQL][BRANCH-1.6] Decoding Parquet array of...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/14013 @rdblue Verified that parquet-avro also suffers from this issue. Filed [PARQUET-651][1] to track it. [1]: https://issues.apache.org/jira/browse/PARQUET-651 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14013: [SPARK-16344][SQL][BRANCH-1.6] Decoding Parquet array of...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/14013 @rdblue Would you mind to help review this one? My initial investigation suggested that parquet-avro probably suffers the same issue. Will file a parquet-mr JIRA ticket soon. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14014: [SPARK-16344][SQL] Decoding Parquet array of struct with...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/14014 cc @yhuai --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14014: [SPARK-16344][SQL] Decoding Parquet array of stru...
GitHub user liancheng opened a pull request: https://github.com/apache/spark/pull/14014 [SPARK-16344][SQL] Decoding Parquet array of struct with a single field named "element" ## What changes were proposed in this pull request? This PR ports #14013 to master and branch-2.0. ## How was this patch tested? See #14013. You can merge this pull request into a Git repository by running: $ git pull https://github.com/liancheng/spark spark-16344-for-master-and-2.0 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14014.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14014 commit 3bfe45fe8b81f44141b737df6b292f12cd37d06a Author: Cheng Lian <l...@databricks.com> Date: 2016-07-01T11:32:52Z Fixes SPARK-16344 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14013: [SPARK-16344][SQL] Decoding Parquet array of struct with...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/14013 cc @yhuai --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14013: [SPARK-16344][SQL] Decoding Parquet array of stru...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/14013#discussion_r69283877 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystRowConverter.scala --- @@ -481,13 +481,106 @@ private[parquet] class CatalystRowConverter( */ // scalastyle:on private def isElementType( -parquetRepeatedType: Type, catalystElementType: DataType, parentName: String): Boolean = { +parquetRepeatedType: Type, catalystElementType: DataType, parent: GroupType): Boolean = { + + def isStandardListLayout(t: GroupType): Boolean = +Option(parent.getOriginalType) == Some(LIST) && + t.getFieldCount == 1 && + t.getName == "list" && + t.getFieldName(0) == "element" + (parquetRepeatedType, catalystElementType) match { -case (t: PrimitiveType, _) => true -case (t: GroupType, _) if t.getFieldCount > 1 => true -case (t: GroupType, _) if t.getFieldCount == 1 && t.getName == "array" => true -case (t: GroupType, _) if t.getFieldCount == 1 && t.getName == parentName + "_tuple" => true -case (t: GroupType, StructType(Array(f))) if f.name == t.getFieldName(0) => true +case (t: PrimitiveType, _) => + // For legacy 2-level list types with primitive element type, e.g.: + // + //// List (nullable list, non-null elements) + //optional group my_list (LIST) { + // repeated int32 element; + //} + true + +case (t: GroupType, _) if t.getFieldCount > 1 => + // For legacy 2-level list types whose element type is a group type with 2 or more fields, + // e.g.: + // + //// List<Tuple<String, Integer>> (nullable list, non-null elements) + //optional group my_list (LIST) { + // repeated group element { + //required binary str (UTF8); + //required int32 num; + // }; + //} + true + +case (t: GroupType, _) if t.getFieldCount == 1 && t.getName == "array" => + // For Parquet data generated by parquet-thrift, e.g.: + // + //// List<OneTuple> (nullable list, non-null elements) + //optional group my_list (LIST) { + // repeated group my_list_tuple { + //required binary str (UTF8); + // }; + //} + true + +case (t: GroupType, _) if t.getFieldCount == 1 && t.getName == parent + "_tuple" => + // For Parquet data generated by parquet-thrift, e.g.: + // + //// List<OneTuple> (nullable list, non-null elements) + //optional group my_list (LIST) { + // repeated group my_list_tuple { + //required binary str (UTF8); + // }; + //} + true + +case (t: GroupType, _) if isStandardListLayout(t) => + // For standard 3-level list types, e.g.: + // + //// List (list nullable, elements non-null) + //optional group my_list (LIST) { + // repeated group list { + //required binary element (UTF8); + // } + //} + // + // This case branch must appear before the next one. See comments of the next case branch + // for details. + false --- End diff -- This case branch is essential for the bug fix. Basically, it matches the standard 3-level layout first before trying to match the legacy 2-level layout, so that the "element" syntactic group in Parquet LIST won't be mistaken for the "element" field in the nested struct. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14013: [SPARK-16344][SQL] Decoding Parquet array of stru...
GitHub user liancheng opened a pull request: https://github.com/apache/spark/pull/14013 [SPARK-16344][SQL] Decoding Parquet array of struct with a single field named "element" ## What changes were proposed in this pull request? Please refer to [SPARK-16344][1] for details about this issue. ## How was this patch tested? New test case added in `ParquetQuerySuite`. [1]: https://issues.apache.org/jira/browse/SPARK-16344 You can merge this pull request into a Git repository by running: $ git pull https://github.com/liancheng/spark spark-16344-parquet-schema-corner-case Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14013.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14013 commit 9620b48d463ed2f2a8ede7397420050dc1e7d832 Author: Cheng Lian <l...@databricks.com> Date: 2016-07-01T10:52:29Z Fixes SPARK-16344 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14006: [SPARK-13015][MLlib][DOC] Replace example code in...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/14006#discussion_r69242197 --- Diff: docs/_plugins/include_example.rb --- @@ -85,20 +85,20 @@ def select_lines(code) .select { |l, i| l.include? "$example off#{@snippet_label}$" } .map { |l, i| i } - raise "Start indices amount is not equal to end indices amount, see #{@file}." \ + raise "Start indices amount is not equal to end indices amount, see #{@file}, #{@snippet_label}." \ --- End diff -- I'd prefer ``` ... see #{@file} [labeled=#{@snippet_label}]. ``` Otherwise the label itself might be mistaken for a file path. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13974: [SPARK-16296][SQL] add null check for key when cr...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/13974#discussion_r69231758 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapData.scala --- @@ -19,6 +19,11 @@ package org.apache.spark.sql.catalyst.util class ArrayBasedMapData(val keyArray: ArrayData, val valueArray: ArrayData) extends MapData { require(keyArray.numElements() == valueArray.numElements()) + for (i <- 0 until keyArray.numElements()) { --- End diff -- Let's use a `while`-loop to replace `for` since Scala `for` is super slow. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13558: [SPARK-15820][PySpark][SQL]Add Catalog.refreshTable into...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/13558 LGTM, merging to master and branch-2.0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13906: [SPARK-16208][SQL] Add `CollapseEmptyPlan` optimizer
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/13906 LGTM except for those comments @cloud-fan brought up. Thanks for working on this! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13906: [SPARK-16208][SQL] Add `CollapseEmptyPlan` optimi...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/13906#discussion_r69140767 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CollapseEmptyPlanSuite.scala --- @@ -0,0 +1,173 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.optimizer + +import org.apache.spark.sql.Row +import org.apache.spark.sql.catalyst.dsl.expressions._ +import org.apache.spark.sql.catalyst.dsl.plans._ +import org.apache.spark.sql.catalyst.plans._ +import org.apache.spark.sql.catalyst.plans.logical.{LocalRelation, LogicalPlan} +import org.apache.spark.sql.catalyst.rules.RuleExecutor + +class CollapseEmptyPlanSuite extends PlanTest { + object Optimize extends RuleExecutor[LogicalPlan] { +val batches = + Batch("CollapseEmptyPlan", Once, +CombineUnions, +ReplaceDistinctWithAggregate, +ReplaceExceptWithAntiJoin, +ReplaceIntersectWithSemiJoin, +PushDownPredicate, +PruneFilters, +CollapseEmptyPlan) :: Nil + } + + object OptimizeWithoutCollapseEmptyPlan extends RuleExecutor[LogicalPlan] { +val batches = + Batch("CollapseEmptyPlan", Once, +CombineUnions, +ReplaceDistinctWithAggregate, +ReplaceExceptWithAntiJoin, +ReplaceIntersectWithSemiJoin, +PushDownPredicate, +PruneFilters) :: Nil + } + + val testRelation1 = LocalRelation.fromExternalRows(Seq('a.int), data = Seq(Row(1))) + val testRelation2 = LocalRelation.fromExternalRows(Seq('b.int), data = Seq(Row(1))) + val testRelation3 = LocalRelation.fromExternalRows(Seq('c.int), data = Seq(Row(1))) + + test("Binary Logical Plans - Collapse empty union") { +val query = testRelation1 + .where(false) + .union(testRelation2.where(false)) + +val optimized = Optimize.execute(query.analyze) +val correctAnswer = LocalRelation('a.int) + +comparePlans(optimized, correctAnswer) + } + + test("Binary Logical Plans - Collapse joins") { --- End diff -- Well, I guess the verbosity is necessary since it guarantees that we don't do wrong empty relation propagation for certain types of joins (e.g. outer join with only one empty child). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13906: [SPARK-16208][SQL] Add `CollapseEmptyPlan` optimi...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/13906#discussion_r69140206 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CollapseEmptyPlan.scala --- @@ -0,0 +1,70 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.optimizer + +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.expressions.aggregate.DeclarativeAggregate +import org.apache.spark.sql.catalyst.plans._ +import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.catalyst.rules._ + +/** + * Collapse plans consisting empty local relations generated by [[PruneFilters]]. + * 1. Binary(or Higher)-node Logical Plans + *- Union with all empty children. + *- Join with one or two empty children (including Intersect/Except). + * 2. Unary-node Logical Plans + *- Project/Filter/Sample/Join/Limit/Repartition with all empty children. + *- Aggregate with all empty children and without DeclarativeAggregate expressions like COUNT. + *- Generate(Explode) with all empty children. Others like Hive UDTF may return results. + */ +object CollapseEmptyPlan extends Rule[LogicalPlan] with PredicateHelper { + private def isEmptyLocalRelation(plan: LogicalPlan): Boolean = plan match { +case p: LocalRelation => p.data.isEmpty +case _ => false + } + + private def isDeclarativeAggregate(e: Expression): Boolean = e match { +case _: DeclarativeAggregate => true +case _: LeafExpression => false +case other => other.children.forall(isDeclarativeAggregate) + } + + private def empty(plan: LogicalPlan) = LocalRelation(plan.output, data = Seq.empty) + + def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { +case p: Union if p.children.nonEmpty && p.children.forall(isEmptyLocalRelation) => + empty(p) + +case p @ Join(_, _, joinType, _) if p.children.exists(isEmptyLocalRelation) => joinType match { + case Inner => empty(p) + case LeftOuter | LeftSemi | LeftAnti if isEmptyLocalRelation(p.left) => empty(p) + case RightOuter if isEmptyLocalRelation(p.right) => empty(p) + case FullOuter if p.children.forall(isEmptyLocalRelation) => empty(p) + case _ => p +} --- End diff -- Sorry that I didn't notice the comment above. I think that should be enough. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13994: [BUILD] Fix version in poms related to kafka-0-10
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/13994 Merging to branch-2.0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13906: [SPARK-16208][SQL] Add `CollapseEmptyPlan` optimi...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/13906#discussion_r69111075 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/CollapseEmptyPlanSuite.scala --- @@ -0,0 +1,173 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.optimizer + +import org.apache.spark.sql.Row +import org.apache.spark.sql.catalyst.dsl.expressions._ +import org.apache.spark.sql.catalyst.dsl.plans._ +import org.apache.spark.sql.catalyst.plans._ +import org.apache.spark.sql.catalyst.plans.logical.{LocalRelation, LogicalPlan} +import org.apache.spark.sql.catalyst.rules.RuleExecutor + +class CollapseEmptyPlanSuite extends PlanTest { + object Optimize extends RuleExecutor[LogicalPlan] { +val batches = + Batch("CollapseEmptyPlan", Once, +CombineUnions, +ReplaceDistinctWithAggregate, +ReplaceExceptWithAntiJoin, +ReplaceIntersectWithSemiJoin, +PushDownPredicate, +PruneFilters, +CollapseEmptyPlan) :: Nil + } + + object OptimizeWithoutCollapseEmptyPlan extends RuleExecutor[LogicalPlan] { +val batches = + Batch("CollapseEmptyPlan", Once, +CombineUnions, +ReplaceDistinctWithAggregate, +ReplaceExceptWithAntiJoin, +ReplaceIntersectWithSemiJoin, +PushDownPredicate, +PruneFilters) :: Nil + } + + val testRelation1 = LocalRelation.fromExternalRows(Seq('a.int), data = Seq(Row(1))) + val testRelation2 = LocalRelation.fromExternalRows(Seq('b.int), data = Seq(Row(1))) + val testRelation3 = LocalRelation.fromExternalRows(Seq('c.int), data = Seq(Row(1))) + + test("Binary(or Higher) Logical Plans - Collapse empty union") { --- End diff -- Remove "(or Higher)". --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13906: [SPARK-16208][SQL] Add `CollapseEmptyPlan` optimi...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/13906#discussion_r69110745 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CollapseEmptyPlan.scala --- @@ -0,0 +1,70 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.optimizer + +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.expressions.aggregate.DeclarativeAggregate +import org.apache.spark.sql.catalyst.plans._ +import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.catalyst.rules._ + +/** + * Collapse plans consisting empty local relations generated by [[PruneFilters]]. + * 1. Binary(or Higher)-node Logical Plans + *- Union with all empty children. + *- Join with one or two empty children (including Intersect/Except). + * 2. Unary-node Logical Plans + *- Project/Filter/Sample/Join/Limit/Repartition with all empty children. + *- Aggregate with all empty children and without DeclarativeAggregate expressions like COUNT. + *- Generate(Explode) with all empty children. Others like Hive UDTF may return results. + */ +object CollapseEmptyPlan extends Rule[LogicalPlan] with PredicateHelper { + private def isEmptyLocalRelation(plan: LogicalPlan): Boolean = plan match { +case p: LocalRelation => p.data.isEmpty +case _ => false + } + + private def isDeclarativeAggregate(e: Expression): Boolean = e match { +case _: DeclarativeAggregate => true +case _: LeafExpression => false +case other => other.children.forall(isDeclarativeAggregate) + } + + private def empty(plan: LogicalPlan) = LocalRelation(plan.output, data = Seq.empty) + + def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { +case p: Union if p.children.nonEmpty && p.children.forall(isEmptyLocalRelation) => + empty(p) + +case p @ Join(_, _, joinType, _) if p.children.exists(isEmptyLocalRelation) => joinType match { + case Inner => empty(p) + case LeftOuter | LeftSemi | LeftAnti if isEmptyLocalRelation(p.left) => empty(p) + case RightOuter if isEmptyLocalRelation(p.right) => empty(p) + case FullOuter if p.children.forall(isEmptyLocalRelation) => empty(p) + case _ => p +} + +case p: UnaryNode if p.children.nonEmpty && p.children.forall(isEmptyLocalRelation) => p match { + case _: Project | _: Filter | _: Sample | _: Sort | _: GlobalLimit | _: LocalLimit | + _: Repartition | _: RepartitionByExpression => empty(p) + case Aggregate(_, ae, _) if !ae.exists(isDeclarativeAggregate) => empty(p) + case Generate(_ : Explode, _, _, _, _, _) => empty(p) --- End diff -- Let's add comment here to explain why we must special case `Aggregate` and `Generate`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13906: [SPARK-16208][SQL] Add `CollapseEmptyPlan` optimi...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/13906#discussion_r69109823 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CollapseEmptyPlan.scala --- @@ -0,0 +1,70 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.optimizer + +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.expressions.aggregate.DeclarativeAggregate +import org.apache.spark.sql.catalyst.plans._ +import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.catalyst.rules._ + +/** + * Collapse plans consisting empty local relations generated by [[PruneFilters]]. + * 1. Binary(or Higher)-node Logical Plans + *- Union with all empty children. + *- Join with one or two empty children (including Intersect/Except). + * 2. Unary-node Logical Plans + *- Project/Filter/Sample/Join/Limit/Repartition with all empty children. + *- Aggregate with all empty children and without DeclarativeAggregate expressions like COUNT. + *- Generate(Explode) with all empty children. Others like Hive UDTF may return results. + */ +object CollapseEmptyPlan extends Rule[LogicalPlan] with PredicateHelper { + private def isEmptyLocalRelation(plan: LogicalPlan): Boolean = plan match { +case p: LocalRelation => p.data.isEmpty +case _ => false + } + + private def isDeclarativeAggregate(e: Expression): Boolean = e match { +case _: DeclarativeAggregate => true +case _: LeafExpression => false +case other => other.children.forall(isDeclarativeAggregate) + } + + private def empty(plan: LogicalPlan) = LocalRelation(plan.output, data = Seq.empty) + + def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { +case p: Union if p.children.nonEmpty && p.children.forall(isEmptyLocalRelation) => + empty(p) + +case p @ Join(_, _, joinType, _) if p.children.exists(isEmptyLocalRelation) => joinType match { + case Inner => empty(p) + case LeftOuter | LeftSemi | LeftAnti if isEmptyLocalRelation(p.left) => empty(p) + case RightOuter if isEmptyLocalRelation(p.right) => empty(p) + case FullOuter if p.children.forall(isEmptyLocalRelation) => empty(p) + case _ => p +} --- End diff -- Could you please comment that `Intersect` is also covered here? I didn't realized that we've already translated `Intersect` using joins at first. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13906: [SPARK-16208][SQL] Add `CollapseEmptyPlan` optimi...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/13906#discussion_r69109611 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CollapseEmptyPlan.scala --- @@ -0,0 +1,70 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.optimizer + +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.expressions.aggregate.DeclarativeAggregate +import org.apache.spark.sql.catalyst.plans._ +import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.catalyst.rules._ + +/** + * Collapse plans consisting empty local relations generated by [[PruneFilters]]. + * 1. Binary(or Higher)-node Logical Plans + *- Union with all empty children. + *- Join with one or two empty children (including Intersect/Except). + * 2. Unary-node Logical Plans + *- Project/Filter/Sample/Join/Limit/Repartition with all empty children. + *- Aggregate with all empty children and without DeclarativeAggregate expressions like COUNT. + *- Generate(Explode) with all empty children. Others like Hive UDTF may return results. + */ +object CollapseEmptyPlan extends Rule[LogicalPlan] with PredicateHelper { + private def isEmptyLocalRelation(plan: LogicalPlan): Boolean = plan match { +case p: LocalRelation => p.data.isEmpty +case _ => false + } + + private def isDeclarativeAggregate(e: Expression): Boolean = e match { +case _: DeclarativeAggregate => true +case _: LeafExpression => false +case other => other.children.forall(isDeclarativeAggregate) + } --- End diff -- I'd replace this method with: ```scala def containsAggregateExpression(e: Expression): Boolean = { e.collectFirst { case _: AggregateFunction => () }.isDefined } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13906: [SPARK-16208][SQL] Add `CollapseEmptyPlan` optimi...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/13906#discussion_r69109340 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CollapseEmptyPlan.scala --- @@ -0,0 +1,70 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.optimizer + +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.expressions.aggregate.DeclarativeAggregate +import org.apache.spark.sql.catalyst.plans._ +import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.catalyst.rules._ + +/** + * Collapse plans consisting empty local relations generated by [[PruneFilters]]. + * 1. Binary(or Higher)-node Logical Plans + *- Union with all empty children. + *- Join with one or two empty children (including Intersect/Except). + * 2. Unary-node Logical Plans + *- Project/Filter/Sample/Join/Limit/Repartition with all empty children. + *- Aggregate with all empty children and without DeclarativeAggregate expressions like COUNT. + *- Generate(Explode) with all empty children. Others like Hive UDTF may return results. + */ +object CollapseEmptyPlan extends Rule[LogicalPlan] with PredicateHelper { + private def isEmptyLocalRelation(plan: LogicalPlan): Boolean = plan match { +case p: LocalRelation => p.data.isEmpty +case _ => false + } + + private def isDeclarativeAggregate(e: Expression): Boolean = e match { +case _: DeclarativeAggregate => true --- End diff -- Why `DeclarativeAggregate` rather than `AggregateFunction`? `AggregateFunction` also covers `ImperativeAggregate` like `ScalaUDAF`, which should also be covered here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13992: [SPARK-12177][TEST] Removed test to avoid compilation is...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/13992 Merging to master and branch-2.0. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13989: [SPARK-16311][SQL] Improve metadata refresh
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/13989 In general, I think reconstructing a DataFrame/Dataset or using `REFRESH TABLE` may be a better approach to solve the problem this PR tries to solve. Did I missed some context here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13989: [SPARK-16311][SQL] Improve metadata refresh
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/13989 One concern of mine is that, analyzed plan, optimized plan, and executed (physical) plan stored in `QueryExecution` are all lazy vals, which means that they won't be re-optimized/planned accordingly after refreshing metadata of the corresponding logical plan. Say we constructed a DataFrame `df` to join a small table `A` and a large table `B`. After calling `df.write.parquet(...)`, analyzed, optimized, and executed plans of `df` are all computed. Since `A` is small, the planner may decide to broadcast it, and this decision is reflected in the physical plan. Next, we add a bunch of files into the directory where table `A` lives to make it super large, then call `df.refresh()` to refresh the logical plan. Now, if we try to call `df.write.parquet(...)` again, the query may probably crash since the physical plan is not refreshed and still thinks that `A` should be broadcasted. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13989: [SPARK-16311][SQL] Improve metadata refresh
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/13989#discussion_r69090469 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -2307,6 +2307,19 @@ class Dataset[T] private[sql]( def distinct(): Dataset[T] = dropDuplicates() /** + * Refreshes the metadata and data cached in Spark for data associated with this Dataset. + * An example use case is to invalidate the file system metadata cached by Spark, when the + * underlying files have been updated by an external process. + * + * @group action + * @since 2.0.0 + */ + def refresh(): Unit = { +unpersist(false) --- End diff -- Actually we can and should call `unpersist`, but we should also call `persist()`/`cache()` again so that the Dataset will be cached lazily again with correct data when it gets executed next time. I guess that's also what @gatorsmile meant. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13906: [SPARK-16208][SQL] Add `CollapseEmptyPlan` optimi...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/13906#discussion_r69089371 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CollapseEmptyPlan.scala --- @@ -0,0 +1,49 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.optimizer + +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.plans._ +import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.catalyst.rules._ + +/** + * Collapse plans consisting empty local relations generated by [[PruneFilters]]. + * 1. InnerJoin with one or two empty children. + * 2. Project/Generate/Filter/Sample/Join/Limit/Union/Repartition with all empty children. + * 3. Aggregate with all empty children and grpExprs containing all aggExprs. + */ +object CollapseEmptyPlan extends Rule[LogicalPlan] with PredicateHelper { + private def isEmptyLocalRelation(plan: LogicalPlan): Boolean = +plan.isInstanceOf[LocalRelation] && plan.asInstanceOf[LocalRelation].data.isEmpty + + def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { +case p @ Join(_, _, Inner, _) if p.children.exists(isEmptyLocalRelation) => + LocalRelation(p.output, data = Seq.empty) + +case p: LogicalPlan if p.children.nonEmpty && p.children.forall(isEmptyLocalRelation) => + p match { +case _: Project | _: Generate | _: Filter | _: Sample | _: Join | + _: Sort | _: GlobalLimit | _: LocalLimit | _: Union | _: Repartition => + LocalRelation(p.output, data = Seq.empty) +case Aggregate(ge, ae, _) if ae.forall(ge.contains(_)) => --- End diff -- Yea. The following predicate should work: ```scala ae.forall(_.collectFirst { case _: AggregateExpression => () }.isEmpty) ``` (But probably put it into a separate method though.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13992: [SPARK-12177][TEST] Removed test to avoid compilation is...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/13992 LGTM pending Jenkins. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13989: [SPARK-16311][SQL] Improve metadata refresh
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/13989#discussion_r69083486 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala --- @@ -265,6 +265,11 @@ abstract class LogicalPlan extends QueryPlan[LogicalPlan] with Logging { s"Reference '$name' is ambiguous, could be: $referenceNames.") } } + + /** + * Invalidates any metadata cached in the plan recursively. + */ + def refresh(): Unit = children.foreach(_.refresh()) --- End diff -- `children.foreach(_.refresh())` isn't tail recursive. Using a non-tail recursion here is probably fine since this isn't a critical path. As @petermaxlee said, we are already using recursion to handle `TreeNode`s everywhere with the assumption that such trees are relatively shallow. We do need to pay attention when using recursion in other places though. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13989: [SPARK-16311][SQL] Improve metadata refresh
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/13989#discussion_r69081636 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala --- @@ -265,6 +265,11 @@ abstract class LogicalPlan extends QueryPlan[LogicalPlan] with Logging { s"Reference '$name' is ambiguous, could be: $referenceNames.") } } + + /** + * Invalidates any metadata cached in the plan recursively. --- End diff -- "Refreshes" instead of "Invalidates"? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13972: [SPARK-16294][SQL] Labelling support for the include_exa...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/13972 Other example snippets in the SQL programming guide will be updated in follow-up PRs. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13972: [SPARK-16294][SQL] Labelling support for the include_exa...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/13972 Thanks for the review! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13906: [SPARK-16208][SQL] Add `CollapseEmptyPlan` optimizer
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/13906 @cloud-fan Yea, that's a good point. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13906: [SPARK-16208][SQL] Add `CollapseEmptyPlan` optimi...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/13906#discussion_r69065541 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CollapseEmptyPlan.scala --- @@ -0,0 +1,49 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.optimizer + +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.plans._ +import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.catalyst.rules._ + +/** + * Collapse plans consisting empty local relations generated by [[PruneFilters]]. + * 1. InnerJoin with one or two empty children. + * 2. Project/Generate/Filter/Sample/Join/Limit/Union/Repartition with all empty children. + * 3. Aggregate with all empty children and grpExprs containing all aggExprs. + */ +object CollapseEmptyPlan extends Rule[LogicalPlan] with PredicateHelper { + private def isEmptyLocalRelation(plan: LogicalPlan): Boolean = +plan.isInstanceOf[LocalRelation] && plan.asInstanceOf[LocalRelation].data.isEmpty + + def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { +case p @ Join(_, _, Inner, _) if p.children.exists(isEmptyLocalRelation) => --- End diff -- Yea, we can also add `Intersect`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13906: [SPARK-16208][SQL] Add `CollapseEmptyPlan` optimi...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/13906#discussion_r69065425 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CollapseEmptyPlan.scala --- @@ -0,0 +1,49 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.optimizer + +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.plans._ +import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.catalyst.rules._ + +/** + * Collapse plans consisting empty local relations generated by [[PruneFilters]]. + * 1. InnerJoin with one or two empty children. + * 2. Project/Generate/Filter/Sample/Join/Limit/Union/Repartition with all empty children. + * 3. Aggregate with all empty children and grpExprs containing all aggExprs. + */ +object CollapseEmptyPlan extends Rule[LogicalPlan] with PredicateHelper { + private def isEmptyLocalRelation(plan: LogicalPlan): Boolean = +plan.isInstanceOf[LocalRelation] && plan.asInstanceOf[LocalRelation].data.isEmpty --- End diff -- ```scala plan match { case p: LocalRelation => p.data.isEmpty case _ => false } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13906: [SPARK-16208][SQL] Add `CollapseEmptyPlan` optimizer
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/13906 My feeling is that, this optimization rule is mostly useful for binary plan nodes like inner join and intersection, where we can avoid scanning output of the non-empty side. On the other hand, for unary plan nodes, firstly it doesn't bring much performance benefits, especially when whole stage codegen is enabled; secondly there are non-obvious and tricky corner cases, like `Aggregate` and `Generate`. That said, although this patch is not a big one, it does introduce non-trivial complexities. For example, I didn't immediately realize that why `Aggregate` must be special cased at first (`COUNT(x)` may return 0 for empty input). The `Generate` case is even trickier. So my suggestion is to only implement this rule for inner join and intersection, which are much simpler to handle. what do you think? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13906: [SPARK-16208][SQL] Add `CollapseEmptyPlan` optimi...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/13906#discussion_r69064054 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CollapseEmptyPlan.scala --- @@ -0,0 +1,49 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.optimizer + +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.plans._ +import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.catalyst.rules._ + +/** + * Collapse plans consisting empty local relations generated by [[PruneFilters]]. + * 1. InnerJoin with one or two empty children. + * 2. Project/Generate/Filter/Sample/Join/Limit/Union/Repartition with all empty children. + * 3. Aggregate with all empty children and grpExprs containing all aggExprs. + */ +object CollapseEmptyPlan extends Rule[LogicalPlan] with PredicateHelper { + private def isEmptyLocalRelation(plan: LogicalPlan): Boolean = +plan.isInstanceOf[LocalRelation] && plan.asInstanceOf[LocalRelation].data.isEmpty + + def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { +case p @ Join(_, _, Inner, _) if p.children.exists(isEmptyLocalRelation) => + LocalRelation(p.output, data = Seq.empty) + +case p: LogicalPlan if p.children.nonEmpty && p.children.forall(isEmptyLocalRelation) => + p match { +case _: Project | _: Generate | _: Filter | _: Sample | _: Join | --- End diff -- Actually `Generate` can't be included here. Our `Generate` also support Hive style UDTF, which has a weird semantics: for a UDTF `f`, after all rows being processed, `f.close()` will be called, and *more rows can be generated* within `f.close()`. This means a UDTF may generate one or more rows even if the underlying input is empty. See [here][1] and PR #5338 for more details. [1]: https://github.com/apache/spark/pull/5383/files --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13972: [SPARK-16294][SQL] Labelling support for the include_exa...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/13972 @yinxusen Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13972: [SPARK-16294][SQL] Labelling support for the include_exa...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/13972 @yinxusen @mengxr Actually I found overlapped labelling is far more easier than I expected earlier... So did it in the last commit. Made the following experiment to illustrate the effect: ```python # $SPARK_HOME/examples/src/main/python/test.py # $example on:foo$ import used.by.foo # $example on:bar$ import common.stuff # $example off:foo$ import used.by.bar # $example off:bar$ ``` Liquid template: ``` {% include_example foo python/test.py %} {% include_example bar python/test.py %} ``` Screenshot: https://cloud.githubusercontent.com/assets/230655/16474910/1490d872-3ead-11e6-9225-6cc35b55f365.png;> --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13972: [SPARK-16294][SQL] Labelling support for the include_exa...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/13972 @yinxusen Thanks for the review! Also discussed with @mengxr. IIUC, overlapped labels is most useful for handling imports, since sometimes we may want to include one import line in multiple example snippet blocks. Regarding to this issue, I have two concerns: 1. It's a rare case, and we possibly can workaround this case while authoring documentations. Thus I guess we can do it in a follow-up PR. 2. Instead of overlapped labels, I'd prefer a multi-label approach: ```scala // $example on:A$ import org.examples.A // $example off:A$ // $example on:A:B$ import org.examples.common._ // $example off:A:B$ // $example on:B$ import org.examples.B // $example off:B$ ``` This approach achieves the same goal, and is much simpler to implement than handling overlapped scoping. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13846: [SPARK-16134][SQL] optimizer rules for typed filter
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/13846 Reverted the commit on branch-2.0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13846: [SPARK-16134][SQL] optimizer rules for typed filter
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/13846 LGTM, merged to master. (Also merged to branch-2.0 by mistake, will revert it ASAP. Sorry for the trouble.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13972: [SPARK-16294][SQL] Labelling support for the include_exa...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/13972 @yinxusen Could you please help review this one since you're the original author of this plugin? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13972: [SPARK-16294][SQL] Labelling support for the include_exa...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/13972 Also cc @yhuai and @rxin. The background here is that I'm going to extract snippets from actual Scala/Java/Python/R source files rather than hard-code them in the SQL programming guide, so that they are easier to maintain. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13972: [SPARK-16294][SQL] Labelling support for the include_exa...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/13972 @mengxr Could you please help review this PR? Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13972: [SPARK-16294][SQL] Labelling support for the incl...
GitHub user liancheng opened a pull request: https://github.com/apache/spark/pull/13972 [SPARK-16294][SQL] Labelling support for the include_example Jekyll plugin ## What changes were proposed in this pull request? This PR adds labelling support for the `include_example` Jekyll plugin, so that we may split a single source file into multiple line blocks with different labels, and include them in multiple code snippets in the generated HTML page. ## How was this patch tested? Manually tested. You can merge this pull request into a Git repository by running: $ git pull https://github.com/liancheng/spark include-example-with-labels Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13972.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13972 commit 71307508928a0e0706c1665df9951f909bf7b214 Author: Cheng Lian <l...@databricks.com> Date: 2016-06-29T10:37:32Z Add labelling support for include_example Jekyll plugin commit 93f33986e2d6c31200c53f4887893a275948a65a Author: Cheng Lian <l...@databricks.com> Date: 2016-06-29T11:39:30Z Update SQL programming guide and example code to illustrate the new labelling feature --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13968: [SPARK-16291][SQL] CheckAnalysis should capture nested a...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/13968 cc @yhuai @cloud-fan @clockfly --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13893: [SPARK-14172][SQL] Hive table partition predicate not pa...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/13893 @jiangxb1987 Please feel free to create a new JIRA ticket and PR for this, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13968: [SPARK-16291][SQL] CheckAnalysis should capture n...
GitHub user liancheng opened a pull request: https://github.com/apache/spark/pull/13968 [SPARK-16291][SQL] CheckAnalysis should capture nested aggregate functions that reference no input attributes ## What changes were proposed in this pull request? `MAX(COUNT(*))` is invalid since aggregate expression can't be nested within another aggregate expression. This case should be captured at analysis phase, but somehow sneaks off to runtime. The reason is that when checking aggregate expressions in `CheckAnalysis`, a checking branch treats all expressions that reference no input attributes as valid ones. However, `MAX(COUNT(*))` is translated into `MAX(COUNT(1))` at analysis phase and also references no input attribute. This PR fixes this issue by removing the aforementioned branch. ## How was this patch tested? New test case added in `AnalysisErrorSuite`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/liancheng/spark spark-16291-nested-agg-functions Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13968.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13968 commit 18e74855968fe404dea86a49e03c689c01b1b737 Author: Cheng Lian <l...@databricks.com> Date: 2016-06-29T08:00:20Z CheckAnalysis should capture nested aggregate functions that reference no input attributes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13835: [SPARK-16100][SQL] fix bug when use Map as the buffer ty...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/13835 LGTM, merging to master and branch-2.0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13933: [SPARK-16236] [SQL] Add Path Option back to Load API in ...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/13933 LGTM. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13720: [SPARK-16004] [SQL] Correctly display "Last Acces...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/13720#discussion_r68855824 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -522,7 +523,7 @@ case class DescribeTableCommand(table: TableIdentifier, isExtended: Boolean, isF private def describeSchema(schema: Seq[CatalogColumn], buffer: ArrayBuffer[Row]): Unit = { schema.foreach { column => - append(buffer, column.name, column.dataType.toLowerCase, column.comment.orNull) + append(buffer, column.name, column.dataType.toLowerCase, column.comment.getOrElse("")) --- End diff -- Agree, a column without any comment (null) is different from a column with a comment that is an empty string. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13720: [SPARK-16004] [SQL] Correctly display "Last Acces...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/13720#discussion_r68855663 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala --- @@ -180,7 +180,8 @@ case class CatalogTable( Seq(s"Table: ${identifier.quotedString}", if (owner.nonEmpty) s"Owner: $owner" else "", s"Created: ${new Date(createTime).toString}", -s"Last Access: ${new Date(lastAccessTime).toString}", +"Last Access: " + + (if (lastAccessTime == -1) "UNKNOWN" else new Date(lastAccessTime).toString), --- End diff -- Empty string looks fine to me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13846: [SPARK-16134][SQL] optimizer rules for typed filt...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/13846#discussion_r68753881 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/TypedFilterOptimizationSuite.scala --- @@ -23,54 +23,111 @@ import org.apache.spark.sql.catalyst.analysis.UnresolvedDeserializer import org.apache.spark.sql.catalyst.dsl.expressions._ import org.apache.spark.sql.catalyst.dsl.plans._ import org.apache.spark.sql.catalyst.encoders.{encoderFor, ExpressionEncoder} +import org.apache.spark.sql.catalyst.expressions.{BoundReference, ReferenceToExpressions} import org.apache.spark.sql.catalyst.plans.PlanTest import org.apache.spark.sql.catalyst.plans.logical.{LocalRelation, LogicalPlan} import org.apache.spark.sql.catalyst.rules.RuleExecutor -import org.apache.spark.sql.types.BooleanType +import org.apache.spark.sql.types.{BooleanType, ObjectType} class TypedFilterOptimizationSuite extends PlanTest { object Optimize extends RuleExecutor[LogicalPlan] { val batches = Batch("EliminateSerialization", FixedPoint(50), EliminateSerialization) :: - Batch("EmbedSerializerInFilter", FixedPoint(50), -EmbedSerializerInFilter) :: Nil + Batch("CombineTypedFilters", FixedPoint(50), +CombineTypedFilters) :: Nil } implicit private def productEncoder[T <: Product : TypeTag] = ExpressionEncoder[T]() - test("back to back filter") { + test("filter after serialize") { val input = LocalRelation('_1.int, '_2.int) -val f1 = (i: (Int, Int)) => i._1 > 0 -val f2 = (i: (Int, Int)) => i._2 > 0 +val f = (i: (Int, Int)) => i._1 > 0 -val query = input.filter(f1).filter(f2).analyze +val query = input + .deserialize[(Int, Int)] + .serialize[(Int, Int)] + .filter(f).analyze val optimized = Optimize.execute(query) -val expected = input.deserialize[(Int, Int)] - .where(callFunction(f1, BooleanType, 'obj)) - .select('obj.as("obj")) - .where(callFunction(f2, BooleanType, 'obj)) +val expected = input + .deserialize[(Int, Int)] + .where(callFunction(f, BooleanType, 'obj)) .serialize[(Int, Int)].analyze comparePlans(optimized, expected) } - // TODO: Remove this after we completely fix SPARK-15632 by adding optimization rules - // for typed filters. - ignore("embed deserializer in typed filter condition if there is only one filter") { + test("filter after serialize with object change") { +val input = LocalRelation('_1.int, '_2.int) +val f = (i: OtherTuple) => i._1 > 0 + +val query = input + .deserialize[(Int, Int)] + .serialize[(Int, Int)] + .filter(f).analyze +val optimized = Optimize.execute(query) +comparePlans(optimized, query) + } + + test("filter before deserialize") { val input = LocalRelation('_1.int, '_2.int) val f = (i: (Int, Int)) => i._1 > 0 -val query = input.filter(f).analyze +val query = input + .filter(f) + .deserialize[(Int, Int)] + .serialize[(Int, Int)].analyze + +val optimized = Optimize.execute(query) + +val expected = input + .deserialize[(Int, Int)] + .where(callFunction(f, BooleanType, 'obj)) + .serialize[(Int, Int)].analyze + +comparePlans(optimized, expected) + } + + test("filter before deserialize with object change") { +val input = LocalRelation('_1.int, '_2.int) +val f = (i: OtherTuple) => i._1 > 0 + +val query = input + .filter(f) + .deserialize[(Int, Int)] + .serialize[(Int, Int)].analyze +val optimized = Optimize.execute(query) +comparePlans(optimized, query) + } + + test("back to back filter") { --- End diff -- Nit: "back to back filters with the same object type" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13846: [SPARK-16134][SQL] optimizer rules for typed filt...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/13846#discussion_r68753763 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/TypedFilterOptimizationSuite.scala --- @@ -23,54 +23,111 @@ import org.apache.spark.sql.catalyst.analysis.UnresolvedDeserializer import org.apache.spark.sql.catalyst.dsl.expressions._ import org.apache.spark.sql.catalyst.dsl.plans._ import org.apache.spark.sql.catalyst.encoders.{encoderFor, ExpressionEncoder} +import org.apache.spark.sql.catalyst.expressions.{BoundReference, ReferenceToExpressions} import org.apache.spark.sql.catalyst.plans.PlanTest import org.apache.spark.sql.catalyst.plans.logical.{LocalRelation, LogicalPlan} import org.apache.spark.sql.catalyst.rules.RuleExecutor -import org.apache.spark.sql.types.BooleanType +import org.apache.spark.sql.types.{BooleanType, ObjectType} class TypedFilterOptimizationSuite extends PlanTest { object Optimize extends RuleExecutor[LogicalPlan] { val batches = Batch("EliminateSerialization", FixedPoint(50), EliminateSerialization) :: - Batch("EmbedSerializerInFilter", FixedPoint(50), -EmbedSerializerInFilter) :: Nil + Batch("CombineTypedFilters", FixedPoint(50), +CombineTypedFilters) :: Nil } implicit private def productEncoder[T <: Product : TypeTag] = ExpressionEncoder[T]() - test("back to back filter") { + test("filter after serialize") { val input = LocalRelation('_1.int, '_2.int) -val f1 = (i: (Int, Int)) => i._1 > 0 -val f2 = (i: (Int, Int)) => i._2 > 0 +val f = (i: (Int, Int)) => i._1 > 0 -val query = input.filter(f1).filter(f2).analyze +val query = input + .deserialize[(Int, Int)] + .serialize[(Int, Int)] + .filter(f).analyze val optimized = Optimize.execute(query) -val expected = input.deserialize[(Int, Int)] - .where(callFunction(f1, BooleanType, 'obj)) - .select('obj.as("obj")) - .where(callFunction(f2, BooleanType, 'obj)) +val expected = input + .deserialize[(Int, Int)] + .where(callFunction(f, BooleanType, 'obj)) .serialize[(Int, Int)].analyze comparePlans(optimized, expected) } - // TODO: Remove this after we completely fix SPARK-15632 by adding optimization rules - // for typed filters. - ignore("embed deserializer in typed filter condition if there is only one filter") { + test("filter after serialize with object change") { +val input = LocalRelation('_1.int, '_2.int) +val f = (i: OtherTuple) => i._1 > 0 + +val query = input + .deserialize[(Int, Int)] + .serialize[(Int, Int)] + .filter(f).analyze +val optimized = Optimize.execute(query) +comparePlans(optimized, query) + } + + test("filter before deserialize") { val input = LocalRelation('_1.int, '_2.int) val f = (i: (Int, Int)) => i._1 > 0 -val query = input.filter(f).analyze +val query = input + .filter(f) + .deserialize[(Int, Int)] + .serialize[(Int, Int)].analyze + +val optimized = Optimize.execute(query) + +val expected = input + .deserialize[(Int, Int)] + .where(callFunction(f, BooleanType, 'obj)) + .serialize[(Int, Int)].analyze + +comparePlans(optimized, expected) + } + + test("filter before deserialize with object change") { +val input = LocalRelation('_1.int, '_2.int) +val f = (i: OtherTuple) => i._1 > 0 + +val query = input + .filter(f) + .deserialize[(Int, Int)] + .serialize[(Int, Int)].analyze +val optimized = Optimize.execute(query) +comparePlans(optimized, query) + } + + test("back to back filter") { +val input = LocalRelation('_1.int, '_2.int) +val f1 = (i: (Int, Int)) => i._1 > 0 +val f2 = (i: (Int, Int)) => i._2 > 0 + +val query = input.filter(f1).filter(f2).analyze val optimized = Optimize.execute(query) val deserializer = UnresolvedDeserializer(encoderFor[(Int, Int)].deserializer) -val condition = callFunction(f, BooleanType, deserializer) -val expected = input.where(condition).select('_1.as("_1"), '_2.as("_2")).analyze +val boundReference = BoundReference(0, ObjectType(classOf[(Int, Int)]), nullable = false) +val callFunc1 = callFunction(f1, BooleanType, boundReference) +
[GitHub] spark pull request #13846: [SPARK-16134][SQL] optimizer rules for typed filt...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/13846#discussion_r68751774 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala --- @@ -167,6 +169,43 @@ case class MapElements( outputObjAttr: Attribute, child: LogicalPlan) extends ObjectConsumer with ObjectProducer +object TypedFilter { + def apply[T : Encoder](func: AnyRef, child: LogicalPlan): TypedFilter = { +TypedFilter(func, UnresolvedDeserializer(encoderFor[T].deserializer), child) + } +} + +/** + * A relation produced by applying `func` to each element of the `child` and filter them by the + * resulting boolean value. + * + * This is logically equal to a normal [[Filter]] operator whose condition expression is decoding + * the input row to object and apply the given function with decoded object. However we need the + * encapsulation of [[TypedFilter]] to make the concept more clear and make it easier to write + * optimizer rules. + */ +case class TypedFilter( +func: AnyRef, +deserializer: Expression, +child: LogicalPlan) extends UnaryNode { + + override def output: Seq[Attribute] = child.output + + def withObject(obj: LogicalPlan): Filter = { --- End diff -- How about renaming this method to `withObjectProducerChild`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13846: [SPARK-16134][SQL] optimizer rules for typed filt...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/13846#discussion_r68751775 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala --- @@ -167,6 +169,43 @@ case class MapElements( outputObjAttr: Attribute, child: LogicalPlan) extends ObjectConsumer with ObjectProducer +object TypedFilter { + def apply[T : Encoder](func: AnyRef, child: LogicalPlan): TypedFilter = { +TypedFilter(func, UnresolvedDeserializer(encoderFor[T].deserializer), child) + } +} + +/** + * A relation produced by applying `func` to each element of the `child` and filter them by the + * resulting boolean value. + * + * This is logically equal to a normal [[Filter]] operator whose condition expression is decoding + * the input row to object and apply the given function with decoded object. However we need the + * encapsulation of [[TypedFilter]] to make the concept more clear and make it easier to write + * optimizer rules. + */ +case class TypedFilter( +func: AnyRef, +deserializer: Expression, +child: LogicalPlan) extends UnaryNode { + + override def output: Seq[Attribute] = child.output + + def withObject(obj: LogicalPlan): Filter = { +assert(obj.output.length == 1) +Filter(getCondition(obj.output.head), obj) + } + + def getCondition(input: Expression): Expression = { --- End diff -- How about renaming it to `typedCondition`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13846: [SPARK-16134][SQL] optimizer rules for typed filt...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/13846#discussion_r68750076 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ReferenceToExpressions.scala --- @@ -45,6 +45,7 @@ case class ReferenceToExpressions(result: Expression, children: Seq[Expression]) var maxOrdinal = -1 result foreach { case b: BoundReference if b.ordinal > maxOrdinal => maxOrdinal = b.ordinal + case _ => --- End diff -- So this is actually a bug fix? Before we can only use a single `BoundReference` as `result`, right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13846: [SPARK-16134][SQL] optimizer rules for typed filt...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/13846#discussion_r68750055 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -1637,55 +1654,31 @@ case class GetCurrentDatabase(sessionCatalog: SessionCatalog) extends Rule[Logic } /** - * Typed [[Filter]] is by default surrounded by a [[DeserializeToObject]] beneath it and a - * [[SerializeFromObject]] above it. If these serializations can't be eliminated, we should embed - * the deserializer in filter condition to save the extra serialization at last. + * Combines all adjacent [[TypedFilter]]s, which operate on same type object in condition, into a + * single [[Filter]]. */ -object EmbedSerializerInFilter extends Rule[LogicalPlan] { +object CombineTypedFilters extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan transform { -case s @ SerializeFromObject(_, Filter(condition, d: DeserializeToObject)) - // SPARK-15632: Conceptually, filter operator should never introduce schema change. This - // optimization rule also relies on this assumption. However, Dataset typed filter operator - // does introduce schema changes in some cases. Thus, we only enable this optimization when - // - // 1. either input and output schemata are exactly the same, or - // 2. both input and output schemata are single-field schema and share the same type. - // - // The 2nd case is included because encoders for primitive types always have only a single - // field with hard-coded field name "value". - // TODO Cleans this up after fixing SPARK-15632. - if s.schema == d.child.schema || samePrimitiveType(s.schema, d.child.schema) => - - val numObjects = condition.collect { -case a: Attribute if a == d.output.head => a - }.length - - if (numObjects > 1) { -// If the filter condition references the object more than one times, we should not embed -// deserializer in it as the deserialization will happen many times and slow down the -// execution. -// TODO: we can still embed it if we can make sure subexpression elimination works here. -s +case t @ TypedFilter(_, deserializer, child) => + val filters = collectTypedFiltersOnSameTypeObj(child, deserializer.dataType, ArrayBuffer(t)) + if (filters.length > 1) { +val objHolder = BoundReference(0, deserializer.dataType, nullable = false) +val condition = filters.map(_.getCondition(objHolder)).reduce(And) +Filter(ReferenceToExpressions(condition, deserializer :: Nil), filters.last.child) } else { -val newCondition = condition transform { - case a: Attribute if a == d.output.head => d.deserializer -} -val filter = Filter(newCondition, d.child) - -// Adds an extra Project here, to preserve the output expr id of `SerializeFromObject`. -// We will remove it later in RemoveAliasOnlyProject rule. -val objAttrs = filter.output.zip(s.output).map { case (fout, sout) => - Alias(fout, fout.name)(exprId = sout.exprId) -} -Project(objAttrs, filter) +t } } - def samePrimitiveType(lhs: StructType, rhs: StructType): Boolean = { -(lhs, rhs) match { - case (StructType(Array(f1)), StructType(Array(f2))) => f1.dataType == f2.dataType - case _ => false -} + @tailrec + private def collectTypedFiltersOnSameTypeObj( + plan: LogicalPlan, + objType: DataType, + filters: ArrayBuffer[TypedFilter]): Array[TypedFilter] = plan match { +case t: TypedFilter if t.deserializer.dataType == objType => + filters += t --- End diff -- Shall we prepend rather than append found filters here? Otherwise filter predicates will be evaluated in reverse order after being combined. Also would be nice to comment about this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13918: [SPARK-16221][SQL] Redirect Parquet JUL logger via SLF4J...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/13918 Thanks, merged to master. @rxin Shall we have this in branch-2.0 at this stage? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13913: [SPARK-10591][SQL][TEST] Add a testcase to ensure if `ch...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/13913 Thanks! Merged to master and branch-2.0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13918: [SPARK-16221][SQL] Redirect Parquet JUL logger via SLF4J...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/13918 LGTM except for one minor comment. Thanks for fixing this annoying issue! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13918: [SPARK-16221][SQL] Redirect Parquet JUL logger vi...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/13918#discussion_r68557487 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala --- @@ -129,6 +129,8 @@ private[sql] class ParquetFileFormat conf.setBoolean(ParquetOutputFormat.ENABLE_JOB_SUMMARY, false) } +ParquetFileFormat.redirectParquetLogs --- End diff -- Let's add `()` since this method is defined with `()` and is invoked for side effect only. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13893: [SPARK-14172][SQL] Hive table partition predicate...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/13893#discussion_r68556327 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/PruningSuite.scala --- @@ -141,6 +141,14 @@ class PruningSuite extends HiveComparisonTest with BeforeAndAfter { Seq("2008-04-08", "11"), Seq("2008-04-09", "11"))) + createPruningTest("Partition pruning - with nondeterministic fields", +"SELECT value, hr FROM srcpart1 WHERE ds = '2008-04-08' AND rand(7) < 1", --- End diff -- Good question. I think technically we can't push down any predicates that are placed after a non-deterministic predicate. Otherwise number of input rows may change and lead to wrong query results. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13889: [SQL][minor] Simplify data source predicate filter trans...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/13889 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13865: [SPARK-13709][SQL] Initialize deserializer with b...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/13865#discussion_r68352349 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/QueryPartitionSuite.scala --- @@ -65,4 +68,77 @@ class QueryPartitionSuite extends QueryTest with SQLTestUtils with TestHiveSingl sql("DROP TABLE IF EXISTS createAndInsertTest") } } + + test("SPARK-13709: reading partitioned Avro table with nested schema") { +withTempDir { dir => + val path = dir.getCanonicalPath + val tableName = "spark_13709" + val tempTableName = "spark_13709_temp" + + new File(path, tableName).mkdir() + new File(path, tempTableName).mkdir() + + val avroSchema = +"""{ + | "name": "test_record", + | "type": "record", + | "fields": [ { + |"name": "f0", + |"type": "int" + | }, { + |"name": "f1", + |"type": { + | "type": "record", + | "name": "inner", + | "fields": [ { + |"name": "f10", + |"type": "int" + | }, { + |"name": "f11", + |"type": "double" + | } ] + |} + | } ] + |} +""".stripMargin + + withTable(tableName, tempTableName) { +// Creates the external partitioned Avro table to be tested. +sql( + s"""CREATE EXTERNAL TABLE $tableName + |PARTITIONED BY (ds STRING) + |ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' + |STORED AS + | INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' + | OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' + |LOCATION '$path/$tableName' + |TBLPROPERTIES ('avro.schema.literal' = '$avroSchema') + """.stripMargin +) + +// Creates an temporary Avro table used to prepare testing Avro file. +sql( + s"""CREATE EXTERNAL TABLE $tempTableName + |ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' + |STORED AS + | INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' + | OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' + |LOCATION '$path/$tempTableName' + |TBLPROPERTIES ('avro.schema.literal' = '$avroSchema') + """.stripMargin +) + +// Generates Avro data. +sql(s"INSERT OVERWRITE TABLE $tempTableName SELECT 1, STRUCT(2, 2.5)") + +// Adds generated Avro data as a new partition to the testing table. +sql(s"ALTER TABLE $tableName ADD PARTITION (ds = 'foo') LOCATION '$path/$tempTableName'") + +checkAnswer( + sql(s"SELECT * FROM $tableName"), --- End diff -- Yea, sure. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13865: [SPARK-13709][SQL] Initialize deserializer with b...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/13865#discussion_r68352258 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/QueryPartitionSuite.scala --- @@ -65,4 +68,77 @@ class QueryPartitionSuite extends QueryTest with SQLTestUtils with TestHiveSingl sql("DROP TABLE IF EXISTS createAndInsertTest") } } + + test("SPARK-13709: reading partitioned Avro table with nested schema") { +withTempDir { dir => + val path = dir.getCanonicalPath + val tableName = "spark_13709" + val tempTableName = "spark_13709_temp" + + new File(path, tableName).mkdir() + new File(path, tempTableName).mkdir() + + val avroSchema = +"""{ + | "name": "test_record", + | "type": "record", + | "fields": [ { + |"name": "f0", + |"type": "int" + | }, { + |"name": "f1", + |"type": { + | "type": "record", + | "name": "inner", + | "fields": [ { + |"name": "f10", + |"type": "int" + | }, { + |"name": "f11", + |"type": "double" + | } ] + |} + | } ] + |} +""".stripMargin + + withTable(tableName, tempTableName) { +// Creates the external partitioned Avro table to be tested. +sql( + s"""CREATE EXTERNAL TABLE $tableName + |PARTITIONED BY (ds STRING) + |ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' + |STORED AS + | INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' + | OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' + |LOCATION '$path/$tableName' + |TBLPROPERTIES ('avro.schema.literal' = '$avroSchema') + """.stripMargin +) + +// Creates an temporary Avro table used to prepare testing Avro file. +sql( + s"""CREATE EXTERNAL TABLE $tempTableName + |ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' + |STORED AS + | INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' + | OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' + |LOCATION '$path/$tempTableName' + |TBLPROPERTIES ('avro.schema.literal' = '$avroSchema') + """.stripMargin +) + +// Generates Avro data. +sql(s"INSERT OVERWRITE TABLE $tempTableName SELECT 1, STRUCT(2, 2.5)") + +// Adds generated Avro data as a new partition to the testing table. +sql(s"ALTER TABLE $tableName ADD PARTITION (ds = 'foo') LOCATION '$path/$tempTableName'") + +checkAnswer( + sql(s"SELECT * FROM $tableName"), --- End diff -- Yea, when reading data from a partition, the Avro deserializer needs to know the Avro schema defined in the table properties (`avro.schema.literal`). However, originally we only initialize the deserializer using the partition properties, which doesn't contain `avro.schema.literal`. This PR fixes it by merging to sets of properties. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13865: [SPARK-13709][SQL] Initialize deserializer with b...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/13865#discussion_r68335825 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala --- @@ -230,10 +234,21 @@ class HadoopTableReader( // Fill all partition keys to the given MutableRow object fillPartitionKeys(partValues, mutableRow) + val tableProperties = relation.tableDesc.getProperties --- End diff -- Local variable for serialization. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13865: [SPARK-13709][SQL] Initialize deserializer with both tab...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/13865 This is ready for review. cc @yhuai @cloud-fan @clockfly --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13870: [SPARK-16165][SQL] Fix the update logic for InMemoryTabl...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/13870 LGTM, merging to master and branch-2.0. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13871: [SPARK-16163] [SQL] Cache the statistics for logical pla...
Github user liancheng commented on the issue: https://github.com/apache/spark/pull/13871 LGTM except for the compilation error. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org