[GitHub] spark pull request: [SPARK-14346] SHOW CREATE TABLE for data sourc...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/12781#discussion_r61845886 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/tables.scala --- @@ -389,3 +392,238 @@ case class ShowTablePropertiesCommand(table: TableIdentifier, propertyKey: Optio } } } + +/** + * A command to list the column names for a table. This function creates a + * [[ShowColumnsCommand]] logical plan. + * + * The syntax of using this command in SQL is: + * {{{ + * SHOW COLUMNS (FROM | IN) table_identifier [(FROM | IN) database]; + * }}} + */ +case class ShowColumnsCommand(table: TableIdentifier) extends RunnableCommand { + // The result of SHOW COLUMNS has one column called 'result' + override val output: Seq[Attribute] = { +AttributeReference("result", StringType, nullable = false)() :: Nil + } + + override def run(sparkSession: SparkSession): Seq[Row] = { +sparkSession.sessionState.catalog.getTableMetadata(table).schema.map { c => + Row(c.name) +} + } +} + +/** + * A command to list the partition names of a table. If the partition spec is specified, + * partitions that match the spec are returned. [[AnalysisException]] exception is thrown under + * the following conditions: + * + * 1. If the command is called for a non partitioned table. + * 2. If the partition spec refers to the columns that are not defined as partitioning columns. + * + * This function creates a [[ShowPartitionsCommand]] logical plan + * + * The syntax of using this command in SQL is: + * {{{ + * SHOW PARTITIONS [db_name.]table_name [PARTITION(partition_spec)] + * }}} + */ +case class ShowPartitionsCommand( +table: TableIdentifier, +spec: Option[TablePartitionSpec]) extends RunnableCommand { + // The result of SHOW PARTITIONS has one column called 'result' + override val output: Seq[Attribute] = { +AttributeReference("result", StringType, nullable = false)() :: Nil + } + + private def getPartName(spec: TablePartitionSpec, partColNames: Seq[String]): String = { +partColNames.map { name => + PartitioningUtils.escapePathName(name) + "=" + PartitioningUtils.escapePathName(spec(name)) +}.mkString(File.separator) + } + + override def run(sparkSession: SparkSession): Seq[Row] = { +val catalog = sparkSession.sessionState.catalog + +if (catalog.isTemporaryTable(table)) { + throw new AnalysisException( +s"SHOW PARTITIONS is not allowed on a temporary table: ${table.unquotedString}") +} + +val tab = catalog.getTableMetadata(table) + +/** + * Validate and throws an [[AnalysisException]] exception under the following conditions: + * 1. If the table is not partitioned. + * 2. If it is a datasource table. + * 3. If it is a view or index table. + */ +if (tab.tableType == VIEW || + tab.tableType == INDEX) { + throw new AnalysisException( +s"SHOW PARTITIONS is not allowed on a view or index table: ${tab.qualifiedName}") +} + +if (!DDLUtils.isTablePartitioned(tab)) { + throw new AnalysisException( +s"SHOW PARTITIONS is not allowed on a table that is not partitioned: ${tab.qualifiedName}") +} + +if (DDLUtils.isDatasourceTable(tab)) { + throw new AnalysisException( +s"SHOW PARTITIONS is not allowed on a datasource table: ${tab.qualifiedName}") +} + +/** + * Validate the partitioning spec by making sure all the referenced columns are + * defined as partitioning columns in table definition. An AnalysisException exception is + * thrown if the partitioning spec is invalid. + */ +if (spec.isDefined) { + val badColumns = spec.get.keySet.filterNot(tab.partitionColumns.map(_.name).contains) + if (badColumns.nonEmpty) { +val badCols = badColumns.mkString("[", ", ", "]") +throw new AnalysisException( + s"Non-partitioning column(s) $badCols are specified for SHOW PARTITIONS") + } +} + +val partNames = catalog.listPartitions(table, spec).map { p => + getPartName(p.spec, tab.partitionColumnNames) +} + +partNames.map(Row(_)) + } +} + +case class ShowCreateTableCommand(table: TableIdentifier) extends RunnableCommand { + override val output: Seq[Attribute] = Seq( +AttributeReference("createtab_stmt", StringType, nullable = false)() + )
[GitHub] spark pull request: [SPARK-14127][SQL] Native "DESC [EXTENDED | FO...
GitHub user liancheng opened a pull request: https://github.com/apache/spark/pull/12844 [SPARK-14127][SQL] Native "DESC [EXTENDED | FORMATTED] " DDL command ## What changes were proposed in this pull request? This PR implements native `DESC [EXTENDED | FORMATTED] ` DDL command. Sample output: ``` scala> spark.sql("desc extended src").show(100, truncate = false) ++-+---+ |col_name|data_type|comment| ++-+---+ |key |int | | |value |string | | || | | |# Detailed Table Information|CatalogTable(`default`.`src`, ...| | ++-+---+ scala> spark.sql("desc formatted src").show(100, truncate = false) ++--+---+ |col_name|data_type |comment| ++--+---+ |key |int | | |value |string | | || | | |# Detailed Table Information| | | |Database: |default | | |Owner: |lian | | |Create Time:|Mon Jan 04 17:06:00 CST 2016 | | |Last Access Time: |Thu Jan 01 08:00:00 CST 1970 | | |Location: |hdfs://localhost:9000/user/hive/warehouse_hive121/src | | |Table Type: |MANAGED | | |Table Parameters: | | | | transient_lastDdlTime |1451898360 | | || | | |# Storage Information | | | |SerDe Library: |org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe| | |InputFormat:|org.apache.hadoop.mapred.TextInputFormat | | |OutputFormat: |org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat| | |Num Buckets:|-1 | | |Bucket Columns: |[] | | |Sort Columns: |[] | | |Storage Desc Parameters:| | | | serialization.format |1 | | ++--+---+ ``` ## How was this patch tested? A test case is added to `HiveDDLSuite` to check command output. You can merge this pull request into a Git repository by running: $ git pull https://github.com/liancheng/spark spark-14127-desc-table Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/12844.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #12844 commit 89e0aea04bdeffaa69be0c19f10860d12af90953 Author: Cheng Lian <l...@databricks.com> Date: 2016-05-02T17:07:32Z Implements native 'DESC TABE' DDL command commit 718da25b489ef56e2b1092484bd6fbcdf2a547ed Author: Cheng Lian <l...@databricks.com> Date: 2016-05-02T17:39:24Z Test case --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruc
[GitHub] spark pull request: [SPARK-14346] SHOW CREATE TABLE for data sourc...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/12781#discussion_r61619036 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/commands.scala --- @@ -117,101 +112,3 @@ case class ExplainCommand( ("Error occurred during query planning: \n" + cause.getMessage).split("\n").map(Row(_)) } } - -/** - * A command to list the column names for a table. This function creates a - * [[ShowColumnsCommand]] logical plan. - * - * The syntax of using this command in SQL is: - * {{{ - * SHOW COLUMNS (FROM | IN) table_identifier [(FROM | IN) database]; - * }}} - */ -case class ShowColumnsCommand(table: TableIdentifier) extends RunnableCommand { - // The result of SHOW COLUMNS has one column called 'result' - override val output: Seq[Attribute] = { -AttributeReference("result", StringType, nullable = false)() :: Nil - } - - override def run(sparkSession: SparkSession): Seq[Row] = { -sparkSession.sessionState.catalog.getTableMetadata(table).schema.map { c => - Row(c.name) -} - } -} - -/** - * A command to list the partition names of a table. If the partition spec is specified, - * partitions that match the spec are returned. [[AnalysisException]] exception is thrown under - * the following conditions: - * - * 1. If the command is called for a non partitioned table. - * 2. If the partition spec refers to the columns that are not defined as partitioning columns. - * - * This function creates a [[ShowPartitionsCommand]] logical plan - * - * The syntax of using this command in SQL is: - * {{{ - * SHOW PARTITIONS [db_name.]table_name [PARTITION(partition_spec)] - * }}} - */ -case class ShowPartitionsCommand( -table: TableIdentifier, -spec: Option[TablePartitionSpec]) extends RunnableCommand { - // The result of SHOW PARTITIONS has one column called 'result' - override val output: Seq[Attribute] = { -AttributeReference("result", StringType, nullable = false)() :: Nil - } - - private def getPartName(spec: TablePartitionSpec, partColNames: Seq[String]): String = { -partColNames.map { name => - PartitioningUtils.escapePathName(name) + "=" + PartitioningUtils.escapePathName(spec(name)) -}.mkString(File.separator) - } - - override def run(sparkSession: SparkSession): Seq[Row] = { -val catalog = sparkSession.sessionState.catalog -val db = table.database.getOrElse(catalog.getCurrentDatabase) -if (catalog.isTemporaryTable(table)) { - throw new AnalysisException("SHOW PARTITIONS is not allowed on a temporary table: " + -s"${table.unquotedString}") -} else { - val tab = catalog.getTableMetadata(table) - /** - * Validate and throws an [[AnalysisException]] exception under the following conditions: - * 1. If the table is not partitioned. - * 2. If it is a datasource table. - * 3. If it is a view or index table. - */ - if (tab.tableType == CatalogTableType.VIEW || -tab.tableType == CatalogTableType.INDEX) { -throw new AnalysisException("SHOW PARTITIONS is not allowed on a view or index table: " + - s"${tab.qualifiedName}") - } - if (!DDLUtils.isTablePartitioned(tab)) { -throw new AnalysisException("SHOW PARTITIONS is not allowed on a table that is not " + - s"partitioned: ${tab.qualifiedName}") - } - if (DDLUtils.isDatasourceTable(tab)) { -throw new AnalysisException("SHOW PARTITIONS is not allowed on a datasource table: " + - s"${tab.qualifiedName}") - } - /** - * Validate the partitioning spec by making sure all the referenced columns are - * defined as partitioning columns in table definition. An AnalysisException exception is - * thrown if the partitioning spec is invalid. - */ - if (spec.isDefined) { -val badColumns = spec.get.keySet.filterNot(tab.partitionColumns.map(_.name).contains) -if (badColumns.nonEmpty) { - throw new AnalysisException( -s"Non-partitioning column(s) [${badColumns.mkString(", ")}] are " + - s"specified for SHOW PARTITIONS") -} - } - val partNames = -catalog.listPartitions(table, spec).map(p => getPartName(p.spec, tab.partitionColumnNames)) - partNames.map { p => Row(p) } -} - } -} --- End diff -
[GitHub] spark pull request: [SPARK-14346] SHOW CREATE TABLE for data sourc...
GitHub user liancheng opened a pull request: https://github.com/apache/spark/pull/12781 [SPARK-14346] SHOW CREATE TABLE for data source tables ## What changes were proposed in this pull request? This PR adds native `SHOW CREATE TABLE` DDL command for data source tables. Support for Hive tables will be added in follow-up PR(s). To show table creation DDL for data source tables created by CTAS statements, this PR also added partitioning and bucketing support for normal `CREATE TABLE ... USING ...` syntax. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) A new test suite `ShowCreateTableSuite` is added in sql/hive package to test the new feature. You can merge this pull request into a Git repository by running: $ git pull https://github.com/liancheng/spark spark-14346-show-create-table Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/12781.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #12781 commit bc7354026f728f06f8eaebfbc3bf05a19f2ebcd9 Author: Cheng Lian <l...@databricks.com> Date: 2016-04-28T06:38:19Z Moves table commands to tables.scala commit 092d9423d982509ffc9e490618a8bf784518c854 Author: Cheng Lian <l...@databricks.com> Date: 2016-04-29T05:42:05Z SHOW CREATE TABLE for data source tables --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14981][SQL] Throws exception if DESC is...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/12759#issuecomment-215628056 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14981][SQL] Throws exception if DESC is...
GitHub user liancheng opened a pull request: https://github.com/apache/spark/pull/12759 [SPARK-14981][SQL] Throws exception if DESC is specified for sorting columns ## What changes were proposed in this pull request? Currently Spark SQL doesn't support sorting columns in descending order. However, the parser accepts the syntax and silently drops sorting directions. This PR fixes this by throwing an exception if `DESC` is specified as sorting direction of a sorting column. ## How was this patch tested? A test case is added to test the invalid sorting order by checking exception message. You can merge this pull request into a Git repository by running: $ git pull https://github.com/liancheng/spark spark-14981 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/12759.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #12759 commit 5a3aa2c55a3d7a600887e82b9663b52a366a4842 Author: Cheng Lian <l...@databricks.com> Date: 2016-04-28T17:46:16Z Throws exception if DESC is specified for sorting columns --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14346][SQL] Add PARTITIONED BY and CLUS...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/12734#issuecomment-215310529 @jodersky Oh sorry, pasted the JIRA ticket summary to the PR title but forgot to add the tags. Updated! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14124] [SQL] [FOLLOWUP] Implement Datab...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/12081#discussion_r61283810 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala --- @@ -693,7 +697,6 @@ class SessionCatalog( dropDatabase(db, ignoreIfNotExists = false, cascade = true) } tempTables.clear() -functionRegistry.clear() --- End diff -- Is this related to the `current_database()` issue you and @yhuai discussed previously? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14124] [SQL] [FOLLOWUP] Implement Datab...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/12081#discussion_r61282206 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala --- @@ -118,7 +148,7 @@ class DDLSuite extends QueryTest with SharedSQLContext with BeforeAndAfterEach { assert(db1 == CatalogDatabase( dbNameWithoutBackTicks, "", - System.getProperty("java.io.tmpdir") + File.separator + s"$dbNameWithoutBackTicks.db", + appendTrailingSlash(System.getProperty("java.io.tmpdir")) + s"$dbNameWithoutBackTicks.db", --- End diff -- According to the PR description, we should append `File.separator` rather than slash, right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14124] [SQL] [FOLLOWUP] Implement Datab...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/12081#discussion_r61278961 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala --- @@ -693,7 +697,6 @@ class SessionCatalog( dropDatabase(db, ignoreIfNotExists = false, cascade = true) } tempTables.clear() -functionRegistry.clear() --- End diff -- Why do we remove this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14127][SQL][WIP] Describe table
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/12460#issuecomment-215115938 @dilipbiswal One purpose of re-implementing all DDL as native Spark SQL command is to minimize dependency to Hive so that we can move Hive into a separate data source some day. That said, we really don't want to make these new DDL commands rely on classes like `HiveClient`, `HiveClientImpl`, or `HiveSessionCatalog`. When you need to access Hive table metadata, you should access them via `CatalogTable` rather than depending on any Hive data structure. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Add PARTITION BY and BUCKET BY clause for "CRE...
GitHub user liancheng opened a pull request: https://github.com/apache/spark/pull/12734 Add PARTITION BY and BUCKET BY clause for "CREATE TABLE ... USING ..." syntax ## What changes were proposed in this pull request? Currently, we can only create persisted partitioned and/or bucketed data source tables using the Dataset API but not using SQL DDL. This PR implements the following syntax to add partitioning and bucketing support to the SQL DDL: ```sql CREATE TABLE USING [OPTIONS ( , , ...)] [PARTITIONED BY (col1, col2, ...)] [CLUSTERED BY (col1, col2, ...) [SORTED BY (col1, col2, ...)] INTO BUCKETS] ``` ## How was this patch tested? Test cases are added in `MetastoreDataSourcesSuite` to check the newly added syntax. You can merge this pull request into a Git repository by running: $ git pull https://github.com/liancheng/spark spark-14954 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/12734.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #12734 commit f51300c6ed22a54ff1dc49262cd046774166d957 Author: Cheng Lian <l...@databricks.com> Date: 2016-04-27T13:06:53Z Add PARTITION BY and BUCKET BY clause for "CREATE TABLE ... USING ..." syntax commit af973d64cf3e1079e6c8a185d826e2e43cb06532 Author: Cheng Lian <l...@databricks.com> Date: 2016-04-27T13:33:13Z Moves test case to MetastoreDataSourcesSuite Also checks for metastore table properties --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14445][SQL] Supports native execution o...
Github user liancheng closed the pull request at: https://github.com/apache/spark/pull/12703 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14445][SQL] Supports native execution o...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/12703#issuecomment-214938531 @dilipbiswal also updated #1 and passed Jenkins. So I merged it. Closing this one. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14445][SQL] Support native execution of...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/1#issuecomment-214938170 @dilipbiswal Great! LGTM now, I'm merging this one to master. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14445][SQL] Supports native execution o...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/12703#discussion_r61185701 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveCommandSuite.scala --- @@ -247,4 +259,105 @@ class HiveCommandSuite extends QueryTest with SQLTestUtils with TestHiveSingleto hiveContext.sessionState.hadoopConf.set("fs.default.name", originalFsName) } } + + test("show columns") { +checkAnswer( + sql("SHOW COLUMNS IN parquet_tab3"), + Row("col1") :: Row("col 2") :: Nil) + +checkAnswer( + sql("SHOW COLUMNS IN default.parquet_tab3"), + Row("col1") :: Row("col 2") :: Nil) + +checkAnswer( + sql("SHOW COLUMNS IN parquet_tab3 FROM default"), + Row("col1") :: Row("col 2") :: Nil) + +checkAnswer( + sql("SHOW COLUMNS IN parquet_tab4 IN default"), + Row("price") :: Row("qty") :: Row("year") :: Row("month") :: Nil) + +val message = intercept[NoSuchTableException] { + sql("SHOW COLUMNS IN badtable FROM default") +}.getMessage +assert(message.contains("badtable not found in database")) + } + + test("show partitions - show everything") { +checkAnswer( + sql("show partitions parquet_tab4"), + Row("year=2015/month=1") :: +Row("year=2015/month=2") :: +Row("year=2016/month=2") :: +Row("year=2016/month=3") :: Nil) + +checkAnswer( + sql("show partitions default.parquet_tab4"), + Row("year=2015/month=1") :: +Row("year=2015/month=2") :: +Row("year=2016/month=2") :: +Row("year=2016/month=3") :: Nil) + } + --- End diff -- Oh yea, I should add that. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14445][SQL] Support native execution of...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/1#issuecomment-214803024 Hi @dilipbiswal, we did a bunch of major refactoring on the master branch, and it's pretty close to 2.0 code freeze, so I took it over based on your version and opened PR #12703. Would you mind to have a look at it? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14445][SQL] Supports native execution o...
GitHub user liancheng opened a pull request: https://github.com/apache/spark/pull/12703 [SPARK-14445][SQL] Supports native execution of SHOW COLUMNS and SHOW PARTITIONS (This PR is mostly a rebased version of PR #1 by @dilipbiswal with all pending PR review comments addressed. Committer who merges this PR please attribute it to @dilipbiswal.) ## What changes were proposed in this pull request? This PR adds Native execution of `SHOW COLUMNS` and `SHOW PARTITION` commands. Command Syntax: ```sql SHOW COLUMNS (FROM | IN) table_identifier [(FROM | IN) database] SHOW PARTITIONS [db_name.]table_name [PARTITION(partition_spec)] ``` ## How was this patch tested? Added test cases in HiveCommandSuite to verify execution and DDLCommandSuite to verify plans. You can merge this pull request into a Git repository by running: $ git pull https://github.com/liancheng/spark spark-14445-show-columns-partitions Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/12703.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #12703 commit b0edb3dd09bac2d8c44235f91bc360ffa831e074 Author: Dilip Biswal <dbis...@us.ibm.com> Date: 2016-04-01T17:25:41Z [SPARK-14445] Support native execution of SHOW COLUMNS and SHOW PARTITIONS commit a13db610805f9ec7817370d393a16260bfabd543 Author: Dilip Biswal <dbis...@us.ibm.com> Date: 2016-04-14T22:14:18Z Andrew's comments commit c0985a8a326e6c4af13bad9fb02e7b8f2b2171d8 Author: Dilip Biswal <dbis...@us.ibm.com> Date: 2016-04-15T06:19:20Z test fix commit 6cf6d954a799b9690a7f4a1b4a2a706207ea03e4 Author: Cheng Lian <l...@databricks.com> Date: 2016-04-26T15:26:08Z Fixes issues introduced while rebasing --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14445][SQL] Support native execution of...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/1#discussion_r61011875 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala --- @@ -424,7 +424,7 @@ private[sql] object PartitioningUtils { path.foreach { c => if (needsEscaping(c)) { builder.append('%') -builder.append(f"${c.asInstanceOf[Int]}%02x") +builder.append(f"${c.asInstanceOf[Int]}%02X") --- End diff -- Makes sense. Thanks for the explanation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14346][SQL] Show Create Table (Native)
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/12579#issuecomment-214419087 test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14445][SQL] Support native execution of...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/1#discussion_r60937575 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveCommandSuite.scala --- @@ -36,12 +38,22 @@ class HiveCommandSuite extends QueryTest with SQLTestUtils with TestHiveSingleto |STORED AS PARQUET |TBLPROPERTIES('prop1Key'="prop1Val", '`prop2Key`'="prop2Val") """.stripMargin) + sql("CREATE TABLE parquet_tab3(col1 int, `col 2` int)") + sql("CREATE TABLE parquet_tab4 (price int, qty int) partitioned by (year int, month int)") + sql("INSERT INTO parquet_tab4 PARTITION(year = 2015, month=1) SELECT 1,1") + sql("INSERT INTO parquet_tab4 PARTITION(year = 2015, month=2) SELECT 2,2") + sql("INSERT INTO parquet_tab4 PARTITION(year = 2016, month=2) SELECT 3,3") + sql("INSERT INTO parquet_tab4 PARTITION(year = 2016, month=3) SELECT 3,3") --- End diff -- Nit: Spaces around `=` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14445][SQL] Support native execution of...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/1#discussion_r60937498 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveCommandSuite.scala --- @@ -17,11 +17,13 @@ package org.apache.spark.sql.hive.execution -import org.apache.spark.sql.{AnalysisException, QueryTest, Row} -import org.apache.spark.sql.hive.test.TestHiveSingleton +import org.apache.spark.sql.{AnalysisException, QueryTest, Row, SaveMode} +import org.apache.spark.sql.catalyst.analysis.NoSuchTableException +import org.apache.spark.sql.hive.test.{TestHive, TestHiveSingleton} import org.apache.spark.sql.test.SQLTestUtils class HiveCommandSuite extends QueryTest with SQLTestUtils with TestHiveSingleton { + import testImplicits._ protected override def beforeAll(): Unit = { --- End diff -- Please help fix this indentation, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14445][SQL] Support native execution of...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/1#discussion_r60934317 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala --- @@ -424,7 +424,7 @@ private[sql] object PartitioningUtils { path.foreach { c => if (needsEscaping(c)) { builder.append('%') -builder.append(f"${c.asInstanceOf[Int]}%02x") +builder.append(f"${c.asInstanceOf[Int]}%02X") --- End diff -- Why this change is necessary? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14445][SQL] Support native execution of...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/1#discussion_r60934120 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/commands.scala --- @@ -423,6 +426,102 @@ case class ShowTablePropertiesCommand( } /** + * A command for users to list the column names for a table. This function creates a + * [[ShowColumnsCommand]] logical plan. + * + * The syntax of using this command in SQL is: + * {{{ + * SHOW COLUMNS (FROM | IN) table_identifier [(FROM | IN) database]; + * }}} + */ +case class ShowColumnsCommand(table: TableIdentifier) extends RunnableCommand { + // The result of SHOW COLUMNS has one column called 'result' + override val output: Seq[Attribute] = { +AttributeReference("result", StringType, nullable = false)() :: Nil + } + + override def run(sqlContext: SQLContext): Seq[Row] = { +sqlContext.sessionState.catalog.getTableMetadata(table).schema.map { c => + Row(c.name) +} + } +} + +/** + * A command for users to list the partition names of a table. If the partition spec is specified, + * partitions that match the spec are returned. [[AnalysisException]] exception is thrown under + * the following conditions: + * + * 1. If the command is called for a non partitioned table. + * 2. If the partition spec refers to the columns that are not defined as partitioning columns. + * + * This function creates a [[ShowPartitionsCommand]] logical plan + * + * The syntax of using this command in SQL is: + * {{{ + * SHOW PARTITIONS [db_name.]table_name [PARTITION(partition_spec)] + * }}} + */ +case class ShowPartitionsCommand( +table: TableIdentifier, +spec: Option[TablePartitionSpec]) extends RunnableCommand { + // The result of SHOW PARTITIONS has one column called 'result' + override val output: Seq[Attribute] = { +AttributeReference("result", StringType, nullable = false)() :: Nil + } + + def getPartName(spec: TablePartitionSpec): String = { +spec.map {s => + PartitioningUtils.escapePathName(s._1) + "=" + PartitioningUtils.escapePathName(s._2) +}.mkString("/") + } + override def run(sqlContext: SQLContext): Seq[Row] = { +val catalog = sqlContext.sessionState.catalog +val db = table.database.getOrElse(catalog.getCurrentDatabase) +if (catalog.isTemporaryTable(table)) { + throw new AnalysisException("SHOW PARTITIONS is not allowed on a temporary table: " + + s"${table.unquotedString}") +} else { + val tab = catalog.getTableMetadata(table) + /** + * Validate and throws an [[AnalysisException]] exception under the following conditions: + * 1. If the table is not partitioned. + * 2. If it is a datasource table. + * 3. If it is a view or index table. + */ + if (tab.tableType == CatalogTableType.VIRTUAL_VIEW || +tab.tableType == CatalogTableType.INDEX_TABLE) { +throw new AnalysisException("SHOW PARTITIONS is not allowed on a view or index table: " + + s"${tab.qualifiedName}") + } + if (!DDLUtils.isTablePartitioned(tab)) { +throw new AnalysisException("SHOW PARTITIONS is not allowed on a table that is not " + + s"partitioned: ${tab.qualifiedName}") + } + if (DDLUtils.isDatasourceTable(tab)) { +throw new AnalysisException("SHOW PARTITIONS is not allowed on a datasource table: " + + s"${tab.qualifiedName}") + } + /** + * Validate the partitioning spec by making sure all the referenced columns are + * defined as partitioning columns in table definition. An AnalysisException exception is + * thrown if the partitioning spec is invalid. + */ + if (spec.isDefined) { +val badColumns = spec.get.keySet.filterNot(tab.partitionColumns.map(_.name).contains) +if (badColumns.nonEmpty) { + throw new AnalysisException( +s"Non-partitioned column(s) [${badColumns.mkString(", ")}] are " + --- End diff -- Nit: Non-partitioning --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14445][SQL] Support native execution of...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/1#discussion_r60931135 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveCommandSuite.scala --- @@ -122,4 +134,105 @@ class HiveCommandSuite extends QueryTest with SQLTestUtils with TestHiveSingleto checkAnswer(sql("SHOW TBLPROPERTIES parquet_temp"), Nil) } } + + test("show columns") { +checkAnswer( + sql("SHOW COLUMNS IN parquet_tab3"), + Row("col1") :: Row("col 2") :: Nil) + +checkAnswer( + sql("SHOW COLUMNS IN default.parquet_tab3"), + Row("col1") :: Row("col 2") :: Nil) + +checkAnswer( + sql("SHOW COLUMNS IN parquet_tab3 FROM default"), + Row("col1") :: Row("col 2") :: Nil) + +checkAnswer( + sql("SHOW COLUMNS IN parquet_tab4 IN default"), + Row("price") :: Row("qty") :: Row("year") :: Row("month") :: Nil) + +val message = intercept[NoSuchTableException] { + sql("SHOW COLUMNS IN badtable FROM default") +}.getMessage +assert(message.contains("badtable not found in database")) + } + + test("show partitions - show everything") { +checkAnswer( + sql("show partitions parquet_tab4"), + Row("year=2015/month=1") :: +Row("year=2015/month=2") :: +Row("year=2016/month=2") :: +Row("year=2016/month=3") :: Nil) --- End diff -- As a simple experiment, order is preserved for maps with less than 5 elements: ```scala scala> Map("year" -> 1) foreach println (year,1) scala> Map("year" -> 1, "month" -> 2) foreach println (year,1) (month,2) scala> Map("year" -> 1, "month" -> 2, "day" -> 3) foreach println (year,1) (month,2) (day,3) scala> Map("year" -> 1, "month" -> 2, "day" -> 3, "hour" -> 4) foreach println (year,1) (month,2) (day,3) (hour,4) scala> Map("year" -> 1, "month" -> 2, "day" -> 3, "hour" -> 4, "minute" -> 5) foreach println (minute,5) (year,1) (hour,4) (day,3) (month,2) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14445][SQL] Support native execution of...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/1#discussion_r60930580 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveCommandSuite.scala --- @@ -122,4 +134,105 @@ class HiveCommandSuite extends QueryTest with SQLTestUtils with TestHiveSingleto checkAnswer(sql("SHOW TBLPROPERTIES parquet_temp"), Nil) } } + + test("show columns") { +checkAnswer( + sql("SHOW COLUMNS IN parquet_tab3"), + Row("col1") :: Row("col 2") :: Nil) + +checkAnswer( + sql("SHOW COLUMNS IN default.parquet_tab3"), + Row("col1") :: Row("col 2") :: Nil) + +checkAnswer( + sql("SHOW COLUMNS IN parquet_tab3 FROM default"), + Row("col1") :: Row("col 2") :: Nil) + +checkAnswer( + sql("SHOW COLUMNS IN parquet_tab4 IN default"), + Row("price") :: Row("qty") :: Row("year") :: Row("month") :: Nil) + +val message = intercept[NoSuchTableException] { + sql("SHOW COLUMNS IN badtable FROM default") +}.getMessage +assert(message.contains("badtable not found in database")) + } + + test("show partitions - show everything") { +checkAnswer( + sql("show partitions parquet_tab4"), + Row("year=2015/month=1") :: +Row("year=2015/month=2") :: +Row("year=2016/month=2") :: +Row("year=2016/month=3") :: Nil) --- End diff -- Could you please add test cases where the testing table containing >= 5 partition columns? I believe Scala standard library provides specialized classes for maps containing <= 4 elements, which covers the out-of-order bug. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14445][SQL] Support native execution of...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/1#discussion_r60928725 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/commands.scala --- @@ -423,6 +426,102 @@ case class ShowTablePropertiesCommand( } /** + * A command for users to list the column names for a table. This function creates a + * [[ShowColumnsCommand]] logical plan. + * + * The syntax of using this command in SQL is: + * {{{ + * SHOW COLUMNS (FROM | IN) table_identifier [(FROM | IN) database]; + * }}} + */ +case class ShowColumnsCommand(table: TableIdentifier) extends RunnableCommand { + // The result of SHOW COLUMNS has one column called 'result' + override val output: Seq[Attribute] = { +AttributeReference("result", StringType, nullable = false)() :: Nil + } + + override def run(sqlContext: SQLContext): Seq[Row] = { +sqlContext.sessionState.catalog.getTableMetadata(table).schema.map { c => + Row(c.name) +} + } +} + +/** + * A command for users to list the partition names of a table. If the partition spec is specified, + * partitions that match the spec are returned. [[AnalysisException]] exception is thrown under + * the following conditions: + * + * 1. If the command is called for a non partitioned table. + * 2. If the partition spec refers to the columns that are not defined as partitioning columns. + * + * This function creates a [[ShowPartitionsCommand]] logical plan + * + * The syntax of using this command in SQL is: + * {{{ + * SHOW PARTITIONS [db_name.]table_name [PARTITION(partition_spec)] + * }}} + */ +case class ShowPartitionsCommand( +table: TableIdentifier, +spec: Option[TablePartitionSpec]) extends RunnableCommand { + // The result of SHOW PARTITIONS has one column called 'result' + override val output: Seq[Attribute] = { +AttributeReference("result", StringType, nullable = false)() :: Nil + } + + def getPartName(spec: TablePartitionSpec): String = { +spec.map {s => + PartitioningUtils.escapePathName(s._1) + "=" + PartitioningUtils.escapePathName(s._2) +}.mkString("/") --- End diff -- @rxin I'm thinking it might not be a good idea to use `Map` to represent partition spec in `CatalogTablePartition` since it doesn't preserve partition column order. What do you think? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14445][SQL] Support native execution of...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/1#discussion_r60927623 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/commands.scala --- @@ -423,6 +426,102 @@ case class ShowTablePropertiesCommand( } /** + * A command for users to list the column names for a table. This function creates a + * [[ShowColumnsCommand]] logical plan. + * + * The syntax of using this command in SQL is: + * {{{ + * SHOW COLUMNS (FROM | IN) table_identifier [(FROM | IN) database]; + * }}} + */ +case class ShowColumnsCommand(table: TableIdentifier) extends RunnableCommand { + // The result of SHOW COLUMNS has one column called 'result' + override val output: Seq[Attribute] = { +AttributeReference("result", StringType, nullable = false)() :: Nil + } + + override def run(sqlContext: SQLContext): Seq[Row] = { +sqlContext.sessionState.catalog.getTableMetadata(table).schema.map { c => + Row(c.name) +} + } +} + +/** + * A command for users to list the partition names of a table. If the partition spec is specified, + * partitions that match the spec are returned. [[AnalysisException]] exception is thrown under + * the following conditions: + * + * 1. If the command is called for a non partitioned table. + * 2. If the partition spec refers to the columns that are not defined as partitioning columns. + * + * This function creates a [[ShowPartitionsCommand]] logical plan + * + * The syntax of using this command in SQL is: + * {{{ + * SHOW PARTITIONS [db_name.]table_name [PARTITION(partition_spec)] + * }}} + */ +case class ShowPartitionsCommand( +table: TableIdentifier, +spec: Option[TablePartitionSpec]) extends RunnableCommand { + // The result of SHOW PARTITIONS has one column called 'result' + override val output: Seq[Attribute] = { +AttributeReference("result", StringType, nullable = false)() :: Nil + } + + def getPartName(spec: TablePartitionSpec): String = { +spec.map {s => + PartitioningUtils.escapePathName(s._1) + "=" + PartitioningUtils.escapePathName(s._2) +}.mkString("/") + } --- End diff -- New line here please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14445][SQL] Support native execution of...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/1#discussion_r60927757 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/commands.scala --- @@ -423,6 +426,102 @@ case class ShowTablePropertiesCommand( } /** + * A command for users to list the column names for a table. This function creates a + * [[ShowColumnsCommand]] logical plan. + * + * The syntax of using this command in SQL is: + * {{{ + * SHOW COLUMNS (FROM | IN) table_identifier [(FROM | IN) database]; + * }}} + */ +case class ShowColumnsCommand(table: TableIdentifier) extends RunnableCommand { + // The result of SHOW COLUMNS has one column called 'result' + override val output: Seq[Attribute] = { +AttributeReference("result", StringType, nullable = false)() :: Nil + } + + override def run(sqlContext: SQLContext): Seq[Row] = { +sqlContext.sessionState.catalog.getTableMetadata(table).schema.map { c => + Row(c.name) +} + } +} + +/** + * A command for users to list the partition names of a table. If the partition spec is specified, + * partitions that match the spec are returned. [[AnalysisException]] exception is thrown under + * the following conditions: + * + * 1. If the command is called for a non partitioned table. + * 2. If the partition spec refers to the columns that are not defined as partitioning columns. + * + * This function creates a [[ShowPartitionsCommand]] logical plan + * + * The syntax of using this command in SQL is: + * {{{ + * SHOW PARTITIONS [db_name.]table_name [PARTITION(partition_spec)] + * }}} + */ +case class ShowPartitionsCommand( +table: TableIdentifier, +spec: Option[TablePartitionSpec]) extends RunnableCommand { + // The result of SHOW PARTITIONS has one column called 'result' + override val output: Seq[Attribute] = { +AttributeReference("result", StringType, nullable = false)() :: Nil + } + + def getPartName(spec: TablePartitionSpec): String = { +spec.map {s => + PartitioningUtils.escapePathName(s._1) + "=" + PartitioningUtils.escapePathName(s._2) +}.mkString("/") + } + override def run(sqlContext: SQLContext): Seq[Row] = { +val catalog = sqlContext.sessionState.catalog +val db = table.database.getOrElse(catalog.getCurrentDatabase) +if (catalog.isTemporaryTable(table)) { + throw new AnalysisException("SHOW PARTITIONS is not allowed on a temporary table: " + + s"${table.unquotedString}") --- End diff -- Wrong indentation here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14445][SQL] Support native execution of...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/1#discussion_r60927505 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/commands.scala --- @@ -423,6 +426,102 @@ case class ShowTablePropertiesCommand( } /** + * A command for users to list the column names for a table. This function creates a + * [[ShowColumnsCommand]] logical plan. + * + * The syntax of using this command in SQL is: + * {{{ + * SHOW COLUMNS (FROM | IN) table_identifier [(FROM | IN) database]; + * }}} + */ +case class ShowColumnsCommand(table: TableIdentifier) extends RunnableCommand { + // The result of SHOW COLUMNS has one column called 'result' + override val output: Seq[Attribute] = { +AttributeReference("result", StringType, nullable = false)() :: Nil + } + + override def run(sqlContext: SQLContext): Seq[Row] = { +sqlContext.sessionState.catalog.getTableMetadata(table).schema.map { c => + Row(c.name) +} + } +} + +/** + * A command for users to list the partition names of a table. If the partition spec is specified, + * partitions that match the spec are returned. [[AnalysisException]] exception is thrown under + * the following conditions: + * + * 1. If the command is called for a non partitioned table. + * 2. If the partition spec refers to the columns that are not defined as partitioning columns. + * + * This function creates a [[ShowPartitionsCommand]] logical plan + * + * The syntax of using this command in SQL is: + * {{{ + * SHOW PARTITIONS [db_name.]table_name [PARTITION(partition_spec)] + * }}} + */ +case class ShowPartitionsCommand( +table: TableIdentifier, +spec: Option[TablePartitionSpec]) extends RunnableCommand { + // The result of SHOW PARTITIONS has one column called 'result' + override val output: Seq[Attribute] = { +AttributeReference("result", StringType, nullable = false)() :: Nil + } + + def getPartName(spec: TablePartitionSpec): String = { --- End diff -- Mark this as private. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14445][SQL] Support native execution of...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/1#discussion_r60927399 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/commands.scala --- @@ -423,6 +426,102 @@ case class ShowTablePropertiesCommand( } /** + * A command for users to list the column names for a table. This function creates a + * [[ShowColumnsCommand]] logical plan. + * + * The syntax of using this command in SQL is: + * {{{ + * SHOW COLUMNS (FROM | IN) table_identifier [(FROM | IN) database]; + * }}} + */ +case class ShowColumnsCommand(table: TableIdentifier) extends RunnableCommand { + // The result of SHOW COLUMNS has one column called 'result' + override val output: Seq[Attribute] = { +AttributeReference("result", StringType, nullable = false)() :: Nil + } + + override def run(sqlContext: SQLContext): Seq[Row] = { +sqlContext.sessionState.catalog.getTableMetadata(table).schema.map { c => + Row(c.name) +} + } +} + +/** + * A command for users to list the partition names of a table. If the partition spec is specified, + * partitions that match the spec are returned. [[AnalysisException]] exception is thrown under + * the following conditions: + * + * 1. If the command is called for a non partitioned table. + * 2. If the partition spec refers to the columns that are not defined as partitioning columns. + * + * This function creates a [[ShowPartitionsCommand]] logical plan + * + * The syntax of using this command in SQL is: + * {{{ + * SHOW PARTITIONS [db_name.]table_name [PARTITION(partition_spec)] + * }}} + */ +case class ShowPartitionsCommand( +table: TableIdentifier, +spec: Option[TablePartitionSpec]) extends RunnableCommand { + // The result of SHOW PARTITIONS has one column called 'result' + override val output: Seq[Attribute] = { +AttributeReference("result", StringType, nullable = false)() :: Nil + } + + def getPartName(spec: TablePartitionSpec): String = { +spec.map {s => + PartitioningUtils.escapePathName(s._1) + "=" + PartitioningUtils.escapePathName(s._2) +}.mkString("/") --- End diff -- I don't think we can simply join all parts using here since `TablePartitionSpec` is a `Map` rather than a `Seq`. Order of partition columns is not preserved. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14445][SQL] Support native execution of...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/1#discussion_r60926964 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/commands.scala --- @@ -423,6 +426,102 @@ case class ShowTablePropertiesCommand( } /** + * A command for users to list the column names for a table. This function creates a + * [[ShowColumnsCommand]] logical plan. + * + * The syntax of using this command in SQL is: + * {{{ + * SHOW COLUMNS (FROM | IN) table_identifier [(FROM | IN) database]; + * }}} + */ +case class ShowColumnsCommand(table: TableIdentifier) extends RunnableCommand { + // The result of SHOW COLUMNS has one column called 'result' + override val output: Seq[Attribute] = { +AttributeReference("result", StringType, nullable = false)() :: Nil + } + + override def run(sqlContext: SQLContext): Seq[Row] = { +sqlContext.sessionState.catalog.getTableMetadata(table).schema.map { c => + Row(c.name) +} + } +} + +/** + * A command for users to list the partition names of a table. If the partition spec is specified, + * partitions that match the spec are returned. [[AnalysisException]] exception is thrown under + * the following conditions: + * + * 1. If the command is called for a non partitioned table. + * 2. If the partition spec refers to the columns that are not defined as partitioning columns. + * + * This function creates a [[ShowPartitionsCommand]] logical plan + * + * The syntax of using this command in SQL is: + * {{{ + * SHOW PARTITIONS [db_name.]table_name [PARTITION(partition_spec)] + * }}} + */ +case class ShowPartitionsCommand( +table: TableIdentifier, +spec: Option[TablePartitionSpec]) extends RunnableCommand { + // The result of SHOW PARTITIONS has one column called 'result' + override val output: Seq[Attribute] = { +AttributeReference("result", StringType, nullable = false)() :: Nil + } + + def getPartName(spec: TablePartitionSpec): String = { +spec.map {s => --- End diff -- Space after `{` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14445][SQL] Support native execution of...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/1#discussion_r60926196 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/commands.scala --- @@ -423,6 +426,102 @@ case class ShowTablePropertiesCommand( } /** + * A command for users to list the column names for a table. This function creates a + * [[ShowColumnsCommand]] logical plan. + * + * The syntax of using this command in SQL is: + * {{{ + * SHOW COLUMNS (FROM | IN) table_identifier [(FROM | IN) database]; + * }}} --- End diff -- We don't need to mention the SQL syntax here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14445][SQL] Support native execution of...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/1#discussion_r60925953 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/commands.scala --- @@ -423,6 +426,102 @@ case class ShowTablePropertiesCommand( } /** + * A command for users to list the column names for a table. This function creates a + * [[ShowColumnsCommand]] logical plan. --- End diff -- "A command to list column names of a given table." --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14445][SQL] Support native execution of...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/1#discussion_r60923441 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/InMemoryCatalog.scala --- @@ -277,9 +277,18 @@ class InMemoryCatalog extends ExternalCatalog { catalog(db).tables(table).partitions(spec) } + /** + * List the metadata of all partitions that belong to the specified table, assuming it exists. + * + * A partial partition spec may optionally be provided to filter the partitions returned. + * For instance, if there exist partitions (a='1', b='2'), (a='1', b='3') and (a='2', b='4'), + * then a partial spec of (a='1') will return the first two only. + * TODO: Currently partialSpec is not used for memory catalog and it returns all the partitions. --- End diff -- Can we throw exception instead of returning all the partitions if a partition spec is given to an in-memory catalog? Silently returning wrong answer can be dangerous. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14445][SQL] Support native execution of...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/1#discussion_r60923122 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala --- @@ -427,13 +427,18 @@ class SessionCatalog( } /** - * List all partitions in a table, assuming it exists. - * If no database is specified, assume the table is in the current database. + * List the metadata of all partitions that belong to the specified table, assuming it exists. + * + * A partial partition spec may optionally be provided to filter the partitions returned. + * For instance, if there exist partitions (a='1', b='2'), (a='1', b='3') and (a='2', b='4'), + * then a partial spec of (a='1') will return the first two only. */ - def listPartitions(tableName: TableIdentifier): Seq[CatalogTablePartition] = { + def listPartitions( --- End diff -- Nit: Missing `override` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14445][SQL] Support native execution of...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/1#discussion_r60923039 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala --- @@ -427,13 +427,18 @@ class SessionCatalog( } /** - * List all partitions in a table, assuming it exists. - * If no database is specified, assume the table is in the current database. + * List the metadata of all partitions that belong to the specified table, assuming it exists. + * + * A partial partition spec may optionally be provided to filter the partitions returned. + * For instance, if there exist partitions (a='1', b='2'), (a='1', b='3') and (a='2', b='4'), + * then a partial spec of (a='1') will return the first two only. --- End diff -- Same as above. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14445][SQL] Support native execution of...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/1#discussion_r60922965 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/InMemoryCatalog.scala --- @@ -277,9 +277,18 @@ class InMemoryCatalog extends ExternalCatalog { catalog(db).tables(table).partitions(spec) } + /** + * List the metadata of all partitions that belong to the specified table, assuming it exists. + * + * A partial partition spec may optionally be provided to filter the partitions returned. + * For instance, if there exist partitions (a='1', b='2'), (a='1', b='3') and (a='2', b='4'), + * then a partial spec of (a='1') will return the first two only. --- End diff -- Nit: No need to repeat the comment here since it's automatically inherited from the parent class/trait. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14875][SQL] Makes OutputWriterFactory.n...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/12652#issuecomment-214290243 Merging to master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14875][SQL] Makes OutputWriterFactory.n...
GitHub user liancheng opened a pull request: https://github.com/apache/spark/pull/12652 [SPARK-14875][SQL] Makes OutputWriterFactory.newInstance public ## What changes were proposed in this pull request? This method was accidentally made `private[sql]` in Spark 2.0. This PR makes it public again. ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/liancheng/spark spark-14875 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/12652.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #12652 commit 35be32f702f5d3f3a68ceb0d01dbb5fb2e1adb41 Author: Cheng Lian <l...@databricks.com> Date: 2016-04-23T14:56:51Z Makes OutputWriterFactory.newInstance public --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14551][SQL] Reduce number of NameNode c...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/12319#issuecomment-213659267 LGTM pending Jenkins. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14551][SQL] Reduce number of NameNode c...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/12319#issuecomment-213658964 add to whitelist --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14551][SQL] Reduce number of NameNode c...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/12319#issuecomment-213658842 test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14843][ML] Fix encoding error in LibSVM...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/12611#issuecomment-213514108 LGTM. Merging to master, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/12527#discussion_r60695719 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala --- @@ -131,4 +134,23 @@ class FileScanRDD( } override protected def getPartitions: Array[RDDPartition] = filePartitions.toArray + + override protected def getPreferredLocations(split: RDDPartition): Seq[String] = { +val files = split.asInstanceOf[FilePartition].files + +// Computes total number of bytes can be retrieved from each host. +val hostToNumBytes = mutable.HashMap.empty[String, Long] +files.foreach { file => + file.locations.filter(_ != "localhost").foreach { host => --- End diff -- We should filter them out. For a partition that doesn't have any preferred locations, it can be bundled with any other tasks and scheduled to any executor. But once it's marked with "localhost", delayed scheduling may be triggered because they have different host name as other tasks. Further more, "localhost" isn't a valid location for the `DAGScheduler` when deciding which executors to run the tasks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/12527#issuecomment-213167317 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/12527#issuecomment-213034425 The root cause of the deadlock has been found. Essentially, we should prevent "localhost" to be returned as `FileScanRDD` preferred locations. Here's a detailed description of the whole processes: 1. The test case involves a shuffle, which results in a `ShuffleRowRDD`, whose preferred locations are determined by the locations of the block manager that serves corresponding map output blocks. In a word, in the case of local test, the only preferred location string is the IP of the block manager. 1. In the case of local testing, `FileScanRDD.preferredLocations` always returns "localhost". 1. As a result, task set `ts1` derived from the `ShuffleRowRDD` and task set `ts2` derived from the `FileScanRDD` have different locality preference. 1. After job submission, `DAGScheduler` first schedules `ts1`. While trying to schedule `ts2`, delayed scheduling is triggered because `ts1` and `ts2` have different preferred locations. By default, `DAGScheduler` waits for 3s before trying `ts2` again. 1. 3s is long enough for all tasks in `ts1` to finish. However, `LocalBackend` doesn't revive offers periodically like other scheduler backends. It only revives offer when tasks are submitted, finish, or fail. Thus `ts2` never gets an opportunity to be scheduled again, and the submitted job never finishes. The only factor that is not clear for now is how the number of buckets (which affects number of submitted tasks) interact with the above process. The fix for this issue is simple, just filter out all "localhost" in `FileScanRDD.preferredLocations()` since "localhost" doesn't make sense as a preferred executor location. Actually this is exactly the last step of what `NewHadoopRDD.preferredLocations()` does. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/12527#issuecomment-212696111 Yea, agree. Should either prove this is an existing bug (so that it can be fixed in another PR), or fix it if it's a bug introduced by this change. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/12527#issuecomment-212425913 @yhuai This one can probably avoid the deadlock since I worked around it [here][1]. Also check [this comment] for more information. Not sure whether we should merge this one since I haven't figured out whether the deadlock is caused bug(s) in this PR or in `DAGScheduler`. [1]: https://github.com/apache/spark/pull/12527/files#diff-75164dfeac2c1507ae828d2ba5529470R304 [2]: https://github.com/apache/spark/pull/12153#issuecomment-212424785 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/12153#issuecomment-212424940 I'm closing this one for #12527, which is a rebased version of this one. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user liancheng closed the pull request at: https://github.com/apache/spark/pull/12153 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/12153#issuecomment-212424785 All the timeout in the Jenkins builds were due to a deadlock in `DAGScheduler`, and can be steadily reproduced locally by running the following test case > BucketedReadSuite.only shuffle one side when 2 bucketed tables have different bucket keys. This test case creates two bucketed tables both with 8 buckets and then joins them. Reducing 8 to 5 eliminates the deadlock. But I haven't figured out the real reason behind the deadlock. The deadlock also disappears if I remove FileScanRDD.preferredLocations(). Maybe that too many tasks are scheduled to the same place and exhausted some thread-pool? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
GitHub user liancheng opened a pull request: https://github.com/apache/spark/pull/12527 [SPARK-14369][SQL] Locality support for FileScanRDD (This PR is a rebased version of PR #12153.) ## What changes were proposed in this pull request? This PR adds preliminary locality support for `FileFormat` data sources by overriding `FileScanRDD.preferredLocations()`. The strategy can be divided into two parts: 1. Block location lookup Unlike `HadoopRDD` or `NewHadoopRDD`, `FileScanRDD` doesn't have access to the underlying `InputFormat` or `InputSplit`, and thus can't rely on `InputSplit.getLocations()` to gather locality information. Instead, this PR queries block locations using `FileSystem.getBlockLocations()` after listing all `FileStatus`es in `HDFSFileCatalog` and convert all `FileStatus`es into `LocatedFileStatus`es. Note that although S3/S3A/S3N file systems don't provide valid locality information, their `getLocatedStatus()` implementations don't actually issue remote calls either. So there's no need to special case these file systems. 2. Selecting preferred locations For each `FilePartition`, we pick up top 3 locations that containing the most data to be retrieved. This isn't necessarily the best algorithm out there. Further improvements may be brought up in follow-up PRs. ## How was this patch tested? Tested by overriding default `FileSystem` implementation for `file:///` with a mocked one, which returns mocked block locations. You can merge this pull request into a Git repository by running: $ git pull https://github.com/liancheng/spark spark-14369-locality-rebased Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/12527.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #12527 commit 0358e33c1a796b7875f932c4d4aee28db505f696 Author: Cheng Lian <l...@databricks.com> Date: 2016-04-20T13:25:51Z Rebased PR #12153 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14551][SQL] Reduce number of NameNode c...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/12319#issuecomment-212323419 @rajeshbalamohan Sorry for the late review and thanks for working on this! My major concern is that the newly added `OrcRecordReader` should be live in spark-hive rather than spark-sql. Otherwise it looks good except for a few styling issues. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14551][SQL] Reduce number of NameNode c...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/12319#discussion_r60368033 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala --- @@ -337,10 +340,12 @@ private[orc] case class OrcTableScan( rdd.mapPartitionsWithInputSplit { case (split: OrcSplit, iterator) => val writableIterator = iterator.map(_._2) + val maybeStructOI = OrcFileOperator.getObjectInspector( +split.getPath.toString, Some(conf)) --- End diff -- Nit: No need to wrap here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14551][SQL] Reduce number of NameNode c...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/12319#discussion_r60367904 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala --- @@ -160,19 +160,22 @@ private[sql] class DefaultSource val job = Job.getInstance(conf) FileInputFormat.setInputPaths(job, file.filePath) - val inputFormat = new OrcNewInputFormat val fileSplit = new FileSplit( new Path(new URI(file.filePath)), file.start, file.length, Array.empty ) - - val attemptId = new TaskAttemptID(new TaskID(new JobID(), TaskType.MAP, 0), 0) - val hadoopAttemptContext = new TaskAttemptContextImpl(conf, attemptId) - inputFormat.createRecordReader(fileSplit, hadoopAttemptContext) + // Create RecordReader here. This will help in getting + // ObjectInspector during recordReader creation itself and can + // avoid NN call in unwrapOrcStructs to get ObjectInspector for the + // file. Would be helpful for partitioned datasets. + new OrcRecordReader(OrcFile.createReader(new Path(new URI(file +.filePath)), OrcFile.readerOptions(conf)), conf, fileSplit.getStart(), +fileSplit.getLength()) } // Unwraps `OrcStruct`s to `UnsafeRow`s -val unsafeRowIterator = OrcRelation.unwrapOrcStructs( - file.filePath, conf, requiredSchema, new RecordReaderIterator[OrcStruct](orcRecordReader) +val unsafeRowIterator = OrcRelation.unwrapOrcStructs(conf, requiredSchema, + Some(orcRecordReader.getObjectInspector.asInstanceOf[StructObjectInspector]), + new RecordReaderIterator[OrcStruct](orcRecordReader) --- End diff -- Nit: Usually we wrap function arguments like this ```scala func( arg1, arg2, ...) ``` or ```scala func( arg1, arg2, ...) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14551][SQL] Reduce number of NameNode c...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/12319#discussion_r60365888 --- Diff: sql/core/pom.xml --- @@ -128,6 +128,10 @@ xbean-asm5-shaded test + + org.spark-project.hive + hive-exec + --- End diff -- Why do we need to add this dependency to spark-core? The ORC data source lives in spark-hive. Can we move the newly added `OrcRecordReader.java` to spark-hive and remove this dependency? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14551][SQL] Reduce number of NameNode c...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/12319#discussion_r60365709 --- Diff: sql/core/src/main/java/org/apache/hadoop/hive/ql/io/orc/OrcRecordReader.java --- @@ -0,0 +1,94 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hadoop.hive.ql.io.orc; + +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector; +import org.apache.hadoop.io.NullWritable; +import org.apache.hadoop.mapreduce.InputSplit; +import org.apache.hadoop.mapreduce.RecordReader; +import org.apache.hadoop.mapreduce.TaskAttemptContext; + +import java.io.IOException; +import java.util.List; + +/** + * This is based on + * {@link org.apache.hadoop.hive.ql.io.orc.OrcNewInputFormat.OrcRecordReader}. --- End diff -- Would be nice to mention the specific version to which the original class belongs. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14551][SQL] Reduce number of NameNode c...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/12319#discussion_r60365527 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala --- @@ -160,19 +160,22 @@ private[sql] class DefaultSource val job = Job.getInstance(conf) FileInputFormat.setInputPaths(job, file.filePath) - val inputFormat = new OrcNewInputFormat val fileSplit = new FileSplit( new Path(new URI(file.filePath)), file.start, file.length, Array.empty ) - - val attemptId = new TaskAttemptID(new TaskID(new JobID(), TaskType.MAP, 0), 0) - val hadoopAttemptContext = new TaskAttemptContextImpl(conf, attemptId) - inputFormat.createRecordReader(fileSplit, hadoopAttemptContext) + // Create RecordReader here. This will help in getting + // ObjectInspector during recordReader creation itself and can + // avoid NN call in unwrapOrcStructs to get ObjectInspector for the + // file. Would be helpful for partitioned datasets. --- End diff -- It would be good to emphasize that this `OrcRecordReader` is a customized one. People who are not familiar with ORC may think it's the default ORC record reader. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14551][SQL] Reduce number of NameNode c...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/12319#discussion_r60365406 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala --- @@ -160,19 +160,22 @@ private[sql] class DefaultSource val job = Job.getInstance(conf) FileInputFormat.setInputPaths(job, file.filePath) - val inputFormat = new OrcNewInputFormat val fileSplit = new FileSplit( new Path(new URI(file.filePath)), file.start, file.length, Array.empty ) - - val attemptId = new TaskAttemptID(new TaskID(new JobID(), TaskType.MAP, 0), 0) - val hadoopAttemptContext = new TaskAttemptContextImpl(conf, attemptId) - inputFormat.createRecordReader(fileSplit, hadoopAttemptContext) + // Create RecordReader here. This will help in getting + // ObjectInspector during recordReader creation itself and can + // avoid NN call in unwrapOrcStructs to get ObjectInspector for the + // file. Would be helpful for partitioned datasets. + new OrcRecordReader(OrcFile.createReader(new Path(new URI(file +.filePath)), OrcFile.readerOptions(conf)), conf, fileSplit.getStart(), +fileSplit.getLength()) --- End diff -- Nit: Please split this call into the following to for better readability: ```scala val orcReader = OrcFile.createReader(new Path(new URI(file.filePath)), OrcFile.readerOptions(conf)) new OrcRecordReader(orcReader, conf, fileSplit.getStart(), fileSplit.getLength()) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14637][SQL] object expressions cleanup
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/12399#discussion_r60267607 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects.scala --- @@ -130,60 +126,52 @@ case class Invoke( case _ => None } - lazy val unboxer = (dataType, method.map(_.getReturnType.getName).getOrElse("")) match { -case (IntegerType, "java.lang.Object") => (s: String) => - s"((java.lang.Integer)$s).intValue()" -case (LongType, "java.lang.Object") => (s: String) => - s"((java.lang.Long)$s).longValue()" -case (FloatType, "java.lang.Object") => (s: String) => - s"((java.lang.Float)$s).floatValue()" -case (ShortType, "java.lang.Object") => (s: String) => - s"((java.lang.Short)$s).shortValue()" -case (ByteType, "java.lang.Object") => (s: String) => - s"((java.lang.Byte)$s).byteValue()" -case (DoubleType, "java.lang.Object") => (s: String) => - s"((java.lang.Double)$s).doubleValue()" -case (BooleanType, "java.lang.Object") => (s: String) => - s"((java.lang.Boolean)$s).booleanValue()" -case _ => identity[String] _ - } - override def doGenCode(ctx: CodegenContext, ev: ExprCode): String = { val javaType = ctx.javaType(dataType) val obj = targetObject.genCode(ctx) val argGen = arguments.map(_.genCode(ctx)) val argString = argGen.map(_.value).mkString(", ") -// If the function can return null, we do an extra check to make sure our null bit is still set -// correctly. -val objNullCheck = if (ctx.defaultValue(dataType) == "null") { - s"boolean ${ev.isNull} = ${ev.value} == null;" +val callFunc = if (method.isDefined && method.get.getReturnType.isPrimitive) { + s"${obj.value}.$functionName($argString)" } else { - ev.isNull = obj.isNull - "" + s"(${ctx.boxedType(javaType)}) ${obj.value}.$functionName($argString)" } -val value = unboxer(s"${obj.value}.$functionName($argString)") --- End diff -- Don't we need to do unboxing anymore here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13681][SPARK-14458][SPARK-14566][SQL] A...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/12179#discussion_r60094164 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -288,15 +288,25 @@ case class DataSource( val fileCatalog: FileCatalog = new HDFSFileCatalog(sqlContext, options, globbedPaths, partitionSchema) -val dataSchema = userSpecifiedSchema.orElse { + +val dataSchema = userSpecifiedSchema.map { schema => + val equality = +if (sqlContext.conf.caseSensitiveAnalysis) { + org.apache.spark.sql.catalyst.analysis.caseSensitiveResolution +} else { + org.apache.spark.sql.catalyst.analysis.caseInsensitiveResolution +} + + StructType(schema.filterNot(f => partitionColumns.exists(equality(_, f.name +}.orElse { --- End diff -- @yhuai Answering the question we raised offline here. Without this fix, the following test case added back in this PR fails: > SimpleTextHadoopFsRelationSuite.SPARK-7616: adjust column name order accordingly when saving partitioned table The major contradiction here is that, result of `FileFormat.inferSchema()` is a schema consists of all columns live in physical data files, and may contain a subset of partition columns. On the otherhand, the user-specified schema passed via `DataFrameReader.schema()` refers to the full schema of the table, including all the partition columns. For `FileFormat` data sources whose `inferSchema()` return `None`, we have no idea whether the physical files contain partiiton columns or not. To fix the regression failure, here we chop off all partition columns from user-specified schema. But this imposes a restriction to `FileFormat` data sources without schema inference ability: > No partition columns are allowed in physical files. This doesn't make much trouble for Spark built-in `FileFormat` data sources since all of them either have fixed schema (LibSVM and text), or are able to infer their own schema (Parquet, ORC, JSON, and CSV). I've checked that, this restriction also exists in branch-1.6. But I'd say this restriction is more like by accident rather than by design. An alternative design is to alter the semantics of the user-specified schema set via `DataFrameReader.schema()` and make it represent the schema of the physical files. In this way, we can solve the problem unambiguously. But this apparently this may break runtime behavior of existing user code. So seems that living with it is a more reasonable choice for now? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/12153#issuecomment-211446073 Seems that Jenkins is down... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/12153#issuecomment-211423901 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/12153#issuecomment-210092607 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/12153#discussion_r59766576 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala --- @@ -624,20 +624,40 @@ class HDFSFileCatalog( def getStatus(path: Path): Array[FileStatus] = leafDirToChildrenFiles(path) + private implicit class LocatedFileStatusIterator(iterator: RemoteIterator[LocatedFileStatus]) --- End diff -- It's used [here][1]. `RemoteIterator` doesn't extend from `Iterator`. This implicit conversion helps to convert contents behind a `RemoteIterator` into an array. [1]: https://github.com/apache/spark/pull/12153/files#diff-40c347747af9101e7e9fee52fc4120b8R656 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/12153#discussion_r59707908 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala --- @@ -197,4 +204,26 @@ private[sql] object FileSourceStrategy extends Strategy with Logging { case _ => Nil } + + private def getBlockLocations(file: FileStatus): Array[BlockLocation] = file match { +case f: LocatedFileStatus => f.getBlockLocations +case f => Array.empty[BlockLocation] + } + + // Given locations of all blocks of a single file, `blockLocations`, and an `offset` position of + // the same file, find out the index of the block that contains this `offset`. If no such block + // can be found, returns -1. + private def getBlockIndex(blockLocations: Array[BlockLocation], offset: Long): Int = { +blockLocations.indexWhere { b => + b.getOffset <= offset && offset < b.getOffset + b.getLength +} + } + + // Given locations of all blocks of a single file, `blockLocations`, and an `offset` position of + // the same file, find out the block that contains this `offset`, and returns location hosts of + // that block. If no such block can be found, returns an empty array. + private def getBlockHosts(blockLocations: Array[BlockLocation], offset: Long): Array[String] = { --- End diff -- Makes sense, updated. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/12153#discussion_r59707891 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala --- @@ -86,4 +87,9 @@ class FileScanRDD( } override protected def getPartitions: Array[Partition] = filePartitions.toArray + + override protected def getPreferredLocations(split: Partition): Seq[String] = { +val files = split.asInstanceOf[FilePartition].files +if (files.isEmpty) Seq.empty else files.maxBy(_.length).locations --- End diff -- Updated. @cloud-fan also suggested the same strategy. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13681][SPARK-14458][SPARK-14566][SQL] A...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/12179#discussion_r59596010 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/sources/CommitFailureTestRelationSuite.scala --- @@ -0,0 +1,45 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.sources + +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException +import org.apache.spark.deploy.SparkHadoopUtil +import org.apache.spark.sql.hive.test.TestHiveSingleton +import org.apache.spark.sql.test.SQLTestUtils + +class CommitFailureTestRelationSuite extends SQLTestUtils with TestHiveSingleton { + // When committing a task, `CommitFailureTestSource` throws an exception for testing purpose. + val dataSourceName: String = classOf[CommitFailureTestSource].getCanonicalName + + test("SPARK-7684: commitTask() failure should fallback to abortTask()") { --- End diff -- But yea, we should probably update these test cases without testing `unhandledFilters` and add them back to test filter push-down. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13681][SPARK-14458][SPARK-14566][SQL] A...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/12179#discussion_r59595837 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/sources/CommitFailureTestRelationSuite.scala --- @@ -0,0 +1,45 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.sources + +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException +import org.apache.spark.deploy.SparkHadoopUtil +import org.apache.spark.sql.hive.test.TestHiveSingleton +import org.apache.spark.sql.test.SQLTestUtils + +class CommitFailureTestRelationSuite extends SQLTestUtils with TestHiveSingleton { + // When committing a task, `CommitFailureTestSource` throws an exception for testing purpose. + val dataSourceName: String = classOf[CommitFailureTestSource].getCanonicalName + + test("SPARK-7684: commitTask() failure should fallback to abortTask()") { --- End diff -- Do you mean those filter push-down tests in the original `SimpleTextHadoopFsRelationSuite`? Those weren't added because `HadoopFsRelation` doesn't have `unhandledFilters` now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/12153#discussion_r59595284 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala --- @@ -197,4 +204,26 @@ private[sql] object FileSourceStrategy extends Strategy with Logging { case _ => Nil } + + private def getBlockLocations(file: FileStatus): Array[BlockLocation] = file match { +case f: LocatedFileStatus => f.getBlockLocations +case f => Array.empty[BlockLocation] + } + + // Given locations of all blocks of a single file, `blockLocations`, and an `offset` position of + // the same file, find out the index of the block that contains this `offset`. If no such block + // can be found, returns -1. + private def getBlockIndex(blockLocations: Array[BlockLocation], offset: Long): Int = { --- End diff -- This one is a little bit tricky, would like to keep it here for readability. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/12153#discussion_r59595123 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/sources/hadoopFsRelationSuites.scala --- @@ -668,6 +671,42 @@ abstract class HadoopFsRelationTest extends QueryTest with SQLTestUtils with Tes df.write.format(dataSourceName).partitionBy("c", "d", "e").saveAsTable("t") } } + + test("Locality support for FileScanRDD") { +withHadoopConf( + "fs.file.impl" -> classOf[LocalityTestFileSystem].getName, + "fs.file.impl.disable.cache" -> "true" +) { + withTempPath { dir => +val path = "file://" + dir.getCanonicalPath +val df1 = sqlContext.range(4) + df1.coalesce(1).write.mode("overwrite").format(dataSourceName).save(path) + df1.coalesce(1).write.mode("append").format(dataSourceName).save(path) + +val df2 = sqlContext.read + .format(dataSourceName) + .option("dataSchema", df1.schema.json) + .load(path) + +val Some(fileScanRDD) = { --- End diff -- Yea, this is better, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/12153#discussion_r59594996 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/sources/hadoopFsRelationSuites.scala --- @@ -668,6 +671,42 @@ abstract class HadoopFsRelationTest extends QueryTest with SQLTestUtils with Tes df.write.format(dataSourceName).partitionBy("c", "d", "e").saveAsTable("t") } } + + test("Locality support for FileScanRDD") { +withHadoopConf( + "fs.file.impl" -> classOf[LocalityTestFileSystem].getName, + "fs.file.impl.disable.cache" -> "true" +) { + withTempPath { dir => +val path = "file://" + dir.getCanonicalPath +val df1 = sqlContext.range(4) + df1.coalesce(1).write.mode("overwrite").format(dataSourceName).save(path) + df1.coalesce(1).write.mode("append").format(dataSourceName).save(path) + +val df2 = sqlContext.read + .format(dataSourceName) + .option("dataSchema", df1.schema.json) + .load(path) + +val Some(fileScanRDD) = { + def allRDDsInDAG(rdd: RDD[_]): Seq[RDD[_]] = { +rdd +: rdd.dependencies.map(_.rdd).flatMap(allRDDsInDAG) + } + + // We have to search for the `FileScanRDD` along the RDD DAG since + // RDD.preferredLocations doesn't propagate along the DAG to the root RDD. + allRDDsInDAG(df2.rdd).collectFirst { +case f: FileScanRDD => f + } +} + +val partitions = fileScanRDD.partitions --- End diff -- Not necessarily, and probably only 1, since both files written are pretty small and will be merged into a single partition. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14407][SQL] Hides HadoopFsRelation rela...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/12178#issuecomment-209510234 Closing this in favor of #12361 since master went through some major changes and it would be hard to resolve conflicts of this one. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14407][SQL] Hides HadoopFsRelation rela...
Github user liancheng closed the pull request at: https://github.com/apache/spark/pull/12178 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14407][SQL] Hides HadoopFsRelation rela...
GitHub user liancheng opened a pull request: https://github.com/apache/spark/pull/12361 [SPARK-14407][SQL] Hides HadoopFsRelation related data source API into execution/datasources package #12178 ## What changes were proposed in this pull request? This PR moves `HadoopFsRelation` related data source API into `execution/datasources` package. Note that to avoid conflicts, this PR is based on #12153. Effective commits for this PR only consist of the last two. Will rebase after merging #12153. ## How was this patch tested? Existing tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/liancheng/spark spark-14407-hide-hadoop-fs-relation Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/12361.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #12361 commit 06b2afb9f3bd5dc4833e939f3f522168252eaa4b Author: Cheng Lian <l...@databricks.com> Date: 2016-04-04T15:33:18Z Locality support for FileScanRDD commit d94f7e95fc772d26c9d515a8ab9b6b96432fc8ff Author: Cheng Lian <l...@databricks.com> Date: 2016-04-04T17:21:22Z Fixes import order commit 5f214c12bdf133d90ceac441130af18af2fb7e6a Author: Cheng Lian <l...@databricks.com> Date: 2016-04-04T18:38:46Z Fixes cases where block location list is empty commit f036bb10304eb8f127a9956c178d6948a470ed9c Author: Cheng Lian <l...@databricks.com> Date: 2016-04-05T12:29:42Z Adds withHadoopConf utility method commit 426b52817d80b36f2c585ae498580a317e642312 Author: Cheng Lian <l...@databricks.com> Date: 2016-04-05T14:27:36Z Tests commit 082d4a3ee7034dae3447cde68ae0bbcae3a80632 Author: Cheng Lian <l...@databricks.com> Date: 2016-04-05T16:59:58Z Moves the test case to HadoopFsRelationTest commit f72d0ffbf15d599918f275ecf1a6c76223a6de97 Author: Cheng Lian <l...@databricks.com> Date: 2016-04-06T14:54:53Z Addresses PR comments commit deb0f8509f6edff890f7882a3ce36d4cd0c6e403 Author: Cheng Lian <l...@databricks.com> Date: 2016-04-06T14:59:22Z Adjusts file partition size fraction threshold to 10% commit 25224e1d09cd37beb4a922ad238f59c9e3732efb Author: Cheng Lian <l...@databricks.com> Date: 2016-04-07T16:44:56Z Fixes test case flakiness commit e666dea2241f400dd44abd2161db269ab7fb71af Author: Cheng Lian <l...@databricks.com> Date: 2016-04-11T13:29:17Z Uses getLocatedStatus() instead of getStatus() + getBlockLocations() commit 7cd40d989cc5008287547640ee47b0b89d59bdee Author: Cheng Lian <l...@databricks.com> Date: 2016-04-12T14:33:49Z Fixes import order commit 843e050b3baea14bff042e62c519518720d33c0a Author: Cheng Lian <l...@databricks.com> Date: 2016-04-13T13:55:01Z Makes locality test case more robust commit cef40bfb940a24e43ac01e4c521be6de51f5662d Author: Cheng Lian <l...@databricks.com> Date: 2016-04-13T13:59:09Z Simplifies locality selection commit af018f8b501ea3a396605e05889ac7cbed00cc78 Author: Cheng Lian <l...@databricks.com> Date: 2016-04-13T15:23:23Z Hides HadoopFsRelation related data source API commit c5fba2f9ab821ef5c577a244b7c157549bb819e5 Author: Cheng Lian <l...@databricks.com> Date: 2016-04-13T15:25:04Z Fixes MiMA --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13681][SPARK-14458][SPARK-14566][SQL] A...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/12179#issuecomment-209462358 @cloud-fan Actually not completely copy-pasted since we did major changes to the data sources API... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13681][SPARK-14458][SPARK-14566][SQL] A...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/12179#issuecomment-209047957 Not sure why MiMA complained about `InputMetrics` stuff... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13681][SPARK-14458][SPARK-14566][SQL] A...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/12179#issuecomment-209047839 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13681][SPARK-14458][SPARK-14566][SQL] A...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/12179#issuecomment-209047622 @yhuai This should be ready for review. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13681][SPARK-14458][SPARK-14566][SQL] A...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/12179#discussion_r59430563 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/commands.scala --- @@ -222,7 +222,7 @@ case class CreateMetastoreDataSourceAsSelect( val data = Dataset.ofRows(hiveContext, query) val df = existingSchema match { // If we are inserting into an existing table, just use the existing schema. - case Some(s) => sqlContext.internalCreateDataFrame(data.queryExecution.toRdd, s) + case Some(s) => data.selectExpr(s.fieldNames: _*) --- End diff -- Fix for SPARK-14566 by using a projection instead of simply applying existing schema to input query plan. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13681][SPARK-14458][SPARK-14566][SQL] A...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/12179#discussion_r59430279 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala --- @@ -288,15 +288,25 @@ case class DataSource( val fileCatalog: FileCatalog = new HDFSFileCatalog(sqlContext, options, globbedPaths, partitionSchema) -val dataSchema = userSpecifiedSchema.orElse { + +val dataSchema = userSpecifiedSchema.map { schema => + val equality = +if (sqlContext.conf.caseSensitiveAnalysis) { + org.apache.spark.sql.catalyst.analysis.caseSensitiveResolution +} else { + org.apache.spark.sql.catalyst.analysis.caseInsensitiveResolution +} + + StructType(schema.filterNot(f => partitionColumns.exists(equality(_, f.name +}.orElse { --- End diff -- Fix for SPARK-14458. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14558][CORE] In ClosureCleaner, clean t...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/12327#issuecomment-208939061 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14558][CORE] In ClosureCleaner, clean t...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/12327#issuecomment-208939090 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14488][SPARK-14493][SQL] "CREATE TEMPOR...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/12303#issuecomment-208930853 Thanks for the review! Merging to master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14488][SPARK-14493][SQL] "CREATE TEMPOR...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/12303#discussion_r59306581 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala --- @@ -1852,4 +1852,31 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton { } } } + + test( +"SPARK-14488 \"CREATE TEMPORARY TABLE ... USING ... AS SELECT ...\" " + +"shouldn't create persisted table" + ) { +withTempPath { dir => + withTempTable("t1", "t2") { +val path = dir.getCanonicalPath +val ds = sqlContext.range(10) +ds.registerTempTable("t1") + +sql( + s"""CREATE TEMPORARY TABLE t2 + |USING PARQUET + |OPTIONS (PATH '$path') + |AS SELECT * FROM t1 --- End diff -- By "persisted" I mean the table metadata is persisted into Hive metastore. Since we are fixing `CreateTempTableUsingAsSelect`, this syntax IS the one we are going to test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14488][SQL] "CREATE TEMPORARY TABLE ......
GitHub user liancheng opened a pull request: https://github.com/apache/spark/pull/12303 [SPARK-14488][SQL] "CREATE TEMPORARY TABLE ... USING ... AS SELECT" shouldn't create persisted table ## What changes were proposed in this pull request? When planning logical plan node `CreateTableUsingAsSelect`, we neglected its `temporary` field and always generates a `CreateMetastoreDataSourceAsSelect`. This PR fixes this issue generating `CreateTempTableUsingAsSelect` when `temporary` is true. ## How was this patch tested? Added a test case to create a temporary table using the target syntax and check whether it's indeed a temporary table. You can merge this pull request into a Git repository by running: $ git pull https://github.com/liancheng/spark spark-14488-fix-ctas-using Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/12303.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #12303 commit eefc8cdc8373c835d2983ea7343487590d9826f8 Author: Cheng Lian <l...@databricks.com> Date: 2016-04-11T17:34:06Z Fixes SPARK-14488 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14535][SQL] Remove buildInternalScan fr...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/12300#issuecomment-208411048 And the ORC data source, which lives in the `sql/hive`, hasn't been cleaned yet. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14535][SQL] Remove buildInternalScan fr...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/12300#issuecomment-208410251 We can also remove [these pattern match guard conditions][1] and the `spark.sql.sources.fileScan` SQL option. [1]: https://github.com/apache/spark/blob/e82d95bf63f57cefa02dc545ceb451ecdeedce28/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L59-L66 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14372] [SQL] : Dataset.randomSplit() ne...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/12184#issuecomment-208244759 Thanks! Merged to master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14372] [SQL] : Dataset.randomSplit() ne...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/12184#issuecomment-208243291 LGTM @marmbrus Returning `Seq` seems to be not so Java-friendly? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/12153#discussion_r59057227 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala --- @@ -137,10 +140,14 @@ private[sql] object FileSourceStrategy extends Strategy with Logging { val splitFiles = selectedPartitions.flatMap { partition => partition.files.flatMap { file => + val blockLocations = getBlockLocations(file) (0L to file.getLen by maxSplitBytes).map { offset => val remaining = file.getLen - offset val size = if (remaining > maxSplitBytes) maxSplitBytes else remaining -PartitionedFile(partition.values, file.getPath.toUri.toString, offset, size) +// Finds out a list of location hosts where lives the first block that contains the +// starting point the file block starts at `offset` with length `size`. +val hosts = getBlockHosts(blockLocations, offset) +PartitionedFile(partition.values, file.getPath.toUri.toString, offset, size, hosts) --- End diff -- This is a really good pointer, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL] Locality support for FileSc...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/12153#discussion_r58909434 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala --- @@ -137,10 +140,14 @@ private[sql] object FileSourceStrategy extends Strategy with Logging { val splitFiles = selectedPartitions.flatMap { partition => partition.files.flatMap { file => + val blockLocations = getBlockLocations(file) (0L to file.getLen by maxSplitBytes).map { offset => val remaining = file.getLen - offset val size = if (remaining > maxSplitBytes) maxSplitBytes else remaining -PartitionedFile(partition.values, file.getPath.toUri.toString, offset, size) +// Finds out a list of location hosts where lives the first block that contains the +// starting point the file block starts at `offset` with length `size`. +val hosts = getBlockHosts(blockLocations, offset) +PartitionedFile(partition.values, file.getPath.toUri.toString, offset, size, hosts) --- End diff -- Each `FilePartition` derives their own preferred location in `FileScanRDD.preferredLocations()` from location hosts of all its `PartitionedFile`s. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL][test-hadoop2.2] Locality su...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/12153#discussion_r58720687 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala --- @@ -86,4 +87,8 @@ class FileScanRDD( } override protected def getPartitions: Array[Partition] = filePartitions.toArray + + override protected def getPreferredLocations(split: Partition): Seq[String] = { +split.asInstanceOf[FilePartition].files.flatMap(_.locations).distinct --- End diff -- Added a constraint so that we only pick locations of blocks whose sizes are larger than 10% of the whole file partition. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL][test-hadoop2.2] Locality su...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/12153#discussion_r58720699 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala --- @@ -205,4 +210,20 @@ private[sql] object FileSourceStrategy extends Strategy with Logging { case _ => Nil } + + private def getBlockLocations(file: FileStatus): Array[BlockLocation] = file match { +case f: LocatedFileStatus => f.getBlockLocations +case f => Array.empty[BlockLocation] + } + + private def getBlockIndex(blockLocations: Array[BlockLocation], offset: Long): Int = { --- End diff -- Added comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL][test-hadoop2.2] Locality su...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/12153#discussion_r58720668 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala --- @@ -620,17 +621,28 @@ class HDFSFileCatalog( if (paths.length >= sqlContext.conf.parallelPartitionDiscoveryThreshold) { HadoopFsRelation.listLeafFilesInParallel(paths, hadoopConf, sqlContext.sparkContext) } else { - val statuses = paths.flatMap { path => + val statuses: Seq[FileStatus] = paths.flatMap { path => val fs = path.getFileSystem(hadoopConf) logInfo(s"Listing $path on driver") // Dummy jobconf to get to the pathFilter defined in configuration -val jobConf = new JobConf(hadoopConf, this.getClass()) +val jobConf = new JobConf(hadoopConf, this.getClass) val pathFilter = FileInputFormat.getInputPathFilter(jobConf) -if (pathFilter != null) { +val statuses = if (pathFilter != null) { Try(fs.listStatus(path, pathFilter)).getOrElse(Array.empty) } else { Try(fs.listStatus(path)).getOrElse(Array.empty) } + +fs match { + case _: S3FileSystem => --- End diff -- Thanks, made corresponding changes and added comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL][test-hadoop2.2] Locality su...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/12153#discussion_r58720664 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala --- @@ -620,17 +621,28 @@ class HDFSFileCatalog( if (paths.length >= sqlContext.conf.parallelPartitionDiscoveryThreshold) { HadoopFsRelation.listLeafFilesInParallel(paths, hadoopConf, sqlContext.sparkContext) } else { - val statuses = paths.flatMap { path => + val statuses: Seq[FileStatus] = paths.flatMap { path => val fs = path.getFileSystem(hadoopConf) logInfo(s"Listing $path on driver") // Dummy jobconf to get to the pathFilter defined in configuration -val jobConf = new JobConf(hadoopConf, this.getClass()) +val jobConf = new JobConf(hadoopConf, this.getClass) val pathFilter = FileInputFormat.getInputPathFilter(jobConf) -if (pathFilter != null) { +val statuses = if (pathFilter != null) { Try(fs.listStatus(path, pathFilter)).getOrElse(Array.empty) } else { Try(fs.listStatus(path)).getOrElse(Array.empty) } + +fs match { + case _: S3FileSystem => +// S3 doesn't provide locality information. +statuses.toSeq + + case _ => +statuses.map { f => + new LocatedFileStatus(f, fs.getFileBlockLocations(f, 0, f.getLen)) --- End diff -- Done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14369][SQL][test-hadoop2.2] Locality su...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/12153#discussion_r58720692 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala --- @@ -137,10 +140,12 @@ private[sql] object FileSourceStrategy extends Strategy with Logging { val splitFiles = selectedPartitions.flatMap { partition => partition.files.flatMap { file => + val blockLocations = getBlockLocations(file) (0L to file.getLen by maxSplitBytes).map { offset => val remaining = file.getLen - offset val size = if (remaining > maxSplitBytes) maxSplitBytes else remaining -PartitionedFile(partition.values, file.getPath.toUri.toString, offset, size) +val hosts = getBlockHosts(blockLocations, offset) +PartitionedFile(partition.values, file.getPath.toUri.toString, offset, size, hosts) --- End diff -- Added. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13681][SQL][WIP] Add back once removed ...
GitHub user liancheng opened a pull request: https://github.com/apache/spark/pull/12179 [SPARK-13681][SQL][WIP] Add back once removed CommitFailureTestRelationSuite ## What changes were proposed in this pull request? This test suite was removed while refactoring `HadoopFsRelation` related API. This PR brings it back. It's still WIP because `SimpleTextHadoopFsRelationSuite` still fails. Fixing it. ## How was this patch tested? A testing relation that always fails committing write tasks is used to test task commit failure. You can merge this pull request into a Git repository by running: $ git pull https://github.com/liancheng/spark spark-13681-commit-failure-test Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/12179.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #12179 commit 4cc1adc8049bdf83255e16875ebb9d970eac00d2 Author: Cheng Lian <l...@databricks.com> Date: 2016-04-05T17:56:26Z WIP commit 3b784501eb4831e6355aeb8b131a31cc2c4eada8 Author: Cheng Lian <l...@databricks.com> Date: 2016-04-05T18:19:30Z Comments out failed test suite --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14407][SQL] Hides HadoopFsRelation rela...
GitHub user liancheng opened a pull request: https://github.com/apache/spark/pull/12178 [SPARK-14407][SQL] Hides HadoopFsRelation related data source API into execution package ## What changes were proposed in this pull request? This PR moves `HadoopFsRelation` related data source API into execution package. ## How was this patch tested? Existing tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/liancheng/spark spark-14407-hide-file-scan-api Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/12178.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #12178 commit 64b7cf487c59aee3217aae37733ff9879dd79c2c Author: Cheng Lian <l...@databricks.com> Date: 2016-04-05T16:58:52Z Hides HadoopFsRelation related data source API --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org