[GitHub] spark pull request: [SPARK-15269][SQL] Set provided path to Catalo...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/13120#discussion_r64673144 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala --- @@ -354,7 +356,27 @@ object CreateDataSourceTableUtils extends Logging { tableType = tableType, schema = Nil, storage = CatalogStorageFormat( - locationUri = None, + // We don't want Hive metastore to implicitly create a table directory, + // which may be not the one Data Source table is referring to, + // yet which will be left behind when the table is dropped for an external table + locationUri = if (new CaseInsensitiveMap(options).get("path").isDefined) { +val path = new Path(new CaseInsensitiveMap(options).get("path").get) +val fs = path.getFileSystem(sparkSession.sessionState.newHadoopConf()) +if (fs.exists(path)) { + // if the provided path exists, Hive metastore only takes directory + // as table data location + if (fs.getFileStatus(path).isDirectory) { +Some(path.toUri.toString) + } else { +Some(path.getParent.toUri.toString) + } +} else { + // If the path does not exists yet, it is assumed to be directory + Some(path.toUri.toString) +} + } else { +None + }, --- End diff -- (For future reference, the above comment is replied [here][1].) [1]: https://github.com/apache/spark/pull/13270#issuecomment-221739839 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15269][SQL] Set provided path to Catalo...
Github user xwu0226 commented on a diff in the pull request: https://github.com/apache/spark/pull/13120#discussion_r64126341 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala --- @@ -354,7 +356,27 @@ object CreateDataSourceTableUtils extends Logging { tableType = tableType, schema = Nil, storage = CatalogStorageFormat( - locationUri = None, + // We don't want Hive metastore to implicitly create a table directory, + // which may be not the one Data Source table is referring to, + // yet which will be left behind when the table is dropped for an external table + locationUri = if (new CaseInsensitiveMap(options).get("path").isDefined) { +val path = new Path(new CaseInsensitiveMap(options).get("path").get) +val fs = path.getFileSystem(sparkSession.sessionState.newHadoopConf()) +if (fs.exists(path)) { + // if the provided path exists, Hive metastore only takes directory + // as table data location + if (fs.getFileStatus(path).isDirectory) { +Some(path.toUri.toString) + } else { +Some(path.getParent.toUri.toString) + } +} else { + // If the path does not exists yet, it is assumed to be directory + Some(path.toUri.toString) +} + } else { +None + }, --- End diff -- The full path is already populated in the serde Property. The select will still work from spark sql. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15269][SQL] Set provided path to Catalo...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/13120#discussion_r64100525 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala --- @@ -354,7 +356,27 @@ object CreateDataSourceTableUtils extends Logging { tableType = tableType, schema = Nil, storage = CatalogStorageFormat( - locationUri = None, + // We don't want Hive metastore to implicitly create a table directory, + // which may be not the one Data Source table is referring to, + // yet which will be left behind when the table is dropped for an external table + locationUri = if (new CaseInsensitiveMap(options).get("path").isDefined) { +val path = new Path(new CaseInsensitiveMap(options).get("path").get) +val fs = path.getFileSystem(sparkSession.sessionState.newHadoopConf()) +if (fs.exists(path)) { + // if the provided path exists, Hive metastore only takes directory + // as table data location + if (fs.getFileStatus(path).isDirectory) { +Some(path.toUri.toString) + } else { +Some(path.getParent.toUri.toString) + } +} else { + // If the path does not exists yet, it is assumed to be directory + Some(path.toUri.toString) +} + } else { +None + }, --- End diff -- Hm... Then I think we probably should save the path as a SerDe property (similar to schema of persisted data source tables). @yhuai How do you think? It breaks existing functionality if we can't read individual files. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15269][SQL] Set provided path to Catalo...
Github user xwu0226 commented on a diff in the pull request: https://github.com/apache/spark/pull/13120#discussion_r63903975 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala --- @@ -354,7 +356,27 @@ object CreateDataSourceTableUtils extends Logging { tableType = tableType, schema = Nil, storage = CatalogStorageFormat( - locationUri = None, + // We don't want Hive metastore to implicitly create a table directory, + // which may be not the one Data Source table is referring to, + // yet which will be left behind when the table is dropped for an external table + locationUri = if (new CaseInsensitiveMap(options).get("path").isDefined) { +val path = new Path(new CaseInsensitiveMap(options).get("path").get) +val fs = path.getFileSystem(sparkSession.sessionState.newHadoopConf()) +if (fs.exists(path)) { + // if the provided path exists, Hive metastore only takes directory + // as table data location + if (fs.getFileStatus(path).isDirectory) { +Some(path.toUri.toString) + } else { +Some(path.getParent.toUri.toString) + } +} else { + // If the path does not exists yet, it is assumed to be directory + Some(path.toUri.toString) +} + } else { +None + }, --- End diff -- @liancheng Thanks! I tried this before, but hive complained that the path is either not a directory or it can not create one with the path.. This was the reason it failed the testcases in `MetastoreDataSourcesSuite`, wherever we create a datasource (non-hive compatible) table with an exact file name. Example: ``` [info] - CTAS a managed table *** FAILED *** (365 milliseconds) [info] org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:file:/home/xwu0226/spark/sql/hive/target/scala-2.11/test-classes/sample.json is not a directory or unable to create one); ``` I also tried in hive shell: ``` hive> create external table t_txt1 (c1 int) location '/tmp/test1.txt'; FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt is not a directory or unable to create one) ``` So it seems hive only takes a directory as table location. In our case, we need to give hive a directory via `locationURI`. For your concern of having a directory containing multiple files, in this case, we are in the non-hive compatible code path, do we still expect the consistency between hive and spark sql? Querying from spark sql will return expected results. while the results will be different from hive. But current behavior of non-hive compatible table is like this already. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15269][SQL] Set provided path to Catalo...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/13120#discussion_r63869408 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala --- @@ -354,7 +356,27 @@ object CreateDataSourceTableUtils extends Logging { tableType = tableType, schema = Nil, storage = CatalogStorageFormat( - locationUri = None, + // We don't want Hive metastore to implicitly create a table directory, + // which may be not the one Data Source table is referring to, + // yet which will be left behind when the table is dropped for an external table + locationUri = if (new CaseInsensitiveMap(options).get("path").isDefined) { +val path = new Path(new CaseInsensitiveMap(options).get("path").get) +val fs = path.getFileSystem(sparkSession.sessionState.newHadoopConf()) +if (fs.exists(path)) { + // if the provided path exists, Hive metastore only takes directory + // as table data location + if (fs.getFileStatus(path).isDirectory) { +Some(path.toUri.toString) + } else { +Some(path.getParent.toUri.toString) + } +} else { + // If the path does not exists yet, it is assumed to be directory + Some(path.toUri.toString) +} + } else { +None + }, --- End diff -- The following line should be enough for `localtionUri`: ```scala locationUri = new CaseInsensitiveMap(options).get("path") ``` Consider the following directory layout containing two Parquet files: ``` /tmp/dir/ part-1.parquet part-2.parquet ``` If we pass "/tmp/dir/part-1.parquet" as file path, the logic above will use the "/tmp/dir/" as `locationUri`, thus "part-2.parquet" is also included, which is not the expected behavior. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15269][SQL] Set provided path to Catalo...
Github user xwu0226 commented on the pull request: https://github.com/apache/spark/pull/13120#issuecomment-220172124 @yhuai @liancheng I updated the code and also did some manual tests for creating table with a real hdfs path to one of my clusters. For example: ```scala> spark.sql("create table json_t3 (c1 int) using json options (path 'hdfs://bdavm009.svl.ibm.com:8020/tmp/json_t3')")``` and my hdfs environment shows ``` hdfs dfs -ls /tmp drwxrwxrwx - xwu0226 hdfs 0 2016-05-18 14:52 /tmp/json_t3 ``` Then, I create another table with the previously created data path ``` scala> spark.sql("create table json_t4 (c1 int) using json options (path 'hdfs://bdavm009.svl.ibm.com:8020/tmp/json_t3/part-r-3-8382e0e2-8518-48df-82c8-b6c84ab03c45.json')") scala> spark.sql("select * from json_t4").show 16/05/18 14:59:50 WARN DataSource: Error while looking for metadata directory. +---+ | c1| +---+ | 1| +---+ ``` Please take a look at the change again! Thank you very much! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15269][SQL] Set provided path to Catalo...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/13120#discussion_r63772240 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala --- @@ -348,13 +349,23 @@ object CreateDataSourceTableUtils extends Logging { className = provider, options = options) -def newSparkSQLSpecificMetastoreTable(): CatalogTable = { +def newSparkSQLSpecificMetastoreTable(relation: HadoopFsRelation): CatalogTable = { CatalogTable( identifier = tableIdent, tableType = tableType, schema = Nil, storage = CatalogStorageFormat( - locationUri = None, + // We don't want Hive metastore to implicitly create a table directory, + // which may be not the one Data Source table is referring to, + // yet which will be left behind when the table is dropped for an external table + locationUri = if (isExternal) { +Some(relation.location.paths.map(_.toUri.toString).map { + case p if new File(p).isDirectory => p + case p => new File(p).getParent +}.head) + } else { +None --- End diff -- Got it, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15269][SQL] Set provided path to Catalo...
Github user xwu0226 commented on a diff in the pull request: https://github.com/apache/spark/pull/13120#discussion_r63719953 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala --- @@ -348,13 +349,23 @@ object CreateDataSourceTableUtils extends Logging { className = provider, options = options) -def newSparkSQLSpecificMetastoreTable(): CatalogTable = { +def newSparkSQLSpecificMetastoreTable(relation: HadoopFsRelation): CatalogTable = { CatalogTable( identifier = tableIdent, tableType = tableType, schema = Nil, storage = CatalogStorageFormat( - locationUri = None, + // We don't want Hive metastore to implicitly create a table directory, + // which may be not the one Data Source table is referring to, + // yet which will be left behind when the table is dropped for an external table + locationUri = if (isExternal) { +Some(relation.location.paths.map(_.toUri.toString).map { + case p if new File(p).isDirectory => p + case p => new File(p).getParent --- End diff -- @liancheng In this code path, we are creating a non-hive compatible metastore table, which means we don't expect users to query the table from Hive directly and get the correct results, right? And the purpose of populating this locationURI is to avoid hive metastore to implicitly create a directory. A query against the table from spark sql will use the path set in `CatalogTable.storage.serdeProperties.get("path")`. So I think it is OK that the single file lives in a directory containing multiple files, right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15269][SQL] Set provided path to Catalo...
Github user xwu0226 commented on a diff in the pull request: https://github.com/apache/spark/pull/13120#discussion_r63718106 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala --- @@ -348,13 +349,23 @@ object CreateDataSourceTableUtils extends Logging { className = provider, options = options) -def newSparkSQLSpecificMetastoreTable(): CatalogTable = { +def newSparkSQLSpecificMetastoreTable(relation: HadoopFsRelation): CatalogTable = { CatalogTable( identifier = tableIdent, tableType = tableType, schema = Nil, storage = CatalogStorageFormat( - locationUri = None, + // We don't want Hive metastore to implicitly create a table directory, + // which may be not the one Data Source table is referring to, + // yet which will be left behind when the table is dropped for an external table + locationUri = if (isExternal) { +Some(relation.location.paths.map(_.toUri.toString).map { + case p if new File(p).isDirectory => p + case p => new File(p).getParent --- End diff -- @liancheng Thanks! That is what I am testing now. But just hit some error running the test against the recently merged `ShowCreateTableSuite`. Still investigating. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15269][SQL] Set provided path to Catalo...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/13120#discussion_r63687429 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala --- @@ -348,13 +349,23 @@ object CreateDataSourceTableUtils extends Logging { className = provider, options = options) -def newSparkSQLSpecificMetastoreTable(): CatalogTable = { +def newSparkSQLSpecificMetastoreTable(relation: HadoopFsRelation): CatalogTable = { CatalogTable( identifier = tableIdent, tableType = tableType, schema = Nil, storage = CatalogStorageFormat( - locationUri = None, + // We don't want Hive metastore to implicitly create a table directory, + // which may be not the one Data Source table is referring to, + // yet which will be left behind when the table is dropped for an external table + locationUri = if (isExternal) { +Some(relation.location.paths.map(_.toUri.toString).map { + case p if new File(p).isDirectory => p + case p => new File(p).getParent +}.head) + } else { +None --- End diff -- @gatorsmile The `else` branch is for managed table. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15269][SQL] Set provided path to Catalo...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/13120#discussion_r63686818 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala --- @@ -348,13 +349,23 @@ object CreateDataSourceTableUtils extends Logging { className = provider, options = options) -def newSparkSQLSpecificMetastoreTable(): CatalogTable = { +def newSparkSQLSpecificMetastoreTable(relation: HadoopFsRelation): CatalogTable = { CatalogTable( identifier = tableIdent, tableType = tableType, schema = Nil, storage = CatalogStorageFormat( - locationUri = None, + // We don't want Hive metastore to implicitly create a table directory, + // which may be not the one Data Source table is referring to, + // yet which will be left behind when the table is dropped for an external table + locationUri = if (isExternal) { +Some(relation.location.paths.map(_.toUri.toString).map { + case p if new File(p).isDirectory => p + case p => new File(p).getParent --- End diff -- I think for external tables, you can always get the path from `options` using option key `path`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15269][SQL] Set provided path to Catalo...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/13120#discussion_r63686590 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala --- @@ -348,13 +349,23 @@ object CreateDataSourceTableUtils extends Logging { className = provider, options = options) -def newSparkSQLSpecificMetastoreTable(): CatalogTable = { +def newSparkSQLSpecificMetastoreTable(relation: HadoopFsRelation): CatalogTable = { CatalogTable( identifier = tableIdent, tableType = tableType, schema = Nil, storage = CatalogStorageFormat( - locationUri = None, + // We don't want Hive metastore to implicitly create a table directory, + // which may be not the one Data Source table is referring to, + // yet which will be left behind when the table is dropped for an external table + locationUri = if (isExternal) { +Some(relation.location.paths.map(_.toUri.toString).map { + case p if new File(p).isDirectory => p + case p => new File(p).getParent --- End diff -- What if we are only reading a single data file that lives in a directory containing multiple files? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15269][SQL] Set provided path to Catalo...
Github user xwu0226 commented on a diff in the pull request: https://github.com/apache/spark/pull/13120#discussion_r63605497 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala --- @@ -451,13 +462,15 @@ object CreateDataSourceTableUtils extends Logging { s"Could not persist $qualifiedTableName in a Hive compatible way. Persisting " + s"it into Hive metastore in Spark SQL specific format." logWarning(warningMessage, e) -val table = newSparkSQLSpecificMetastoreTable() +val table = + newSparkSQLSpecificMetastoreTable(resolvedRelation.asInstanceOf[HadoopFsRelation]) sparkSession.sessionState.catalog.createTable(table, ignoreIfExists = false) } case (None, message) => logWarning(message) -val table = newSparkSQLSpecificMetastoreTable() +val table = + newSparkSQLSpecificMetastoreTable(resolvedRelation.asInstanceOf[HadoopFsRelation]) --- End diff -- Thanks! I will update. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15269][SQL] Set provided path to Catalo...
Github user xwu0226 commented on a diff in the pull request: https://github.com/apache/spark/pull/13120#discussion_r63605320 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala --- @@ -348,13 +349,23 @@ object CreateDataSourceTableUtils extends Logging { className = provider, options = options) -def newSparkSQLSpecificMetastoreTable(): CatalogTable = { +def newSparkSQLSpecificMetastoreTable(relation: HadoopFsRelation): CatalogTable = { CatalogTable( identifier = tableIdent, tableType = tableType, schema = Nil, storage = CatalogStorageFormat( - locationUri = None, + // We don't want Hive metastore to implicitly create a table directory, + // which may be not the one Data Source table is referring to, + // yet which will be left behind when the table is dropped for an external table + locationUri = if (isExternal) { +Some(relation.location.paths.map(_.toUri.toString).map { + case p if new File(p).isDirectory => p + case p => new File(p).getParent --- End diff -- Thanks, @yhuai ! I will update. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15269][SQL] Set provided path to Catalo...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/13120#discussion_r63604166 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala --- @@ -348,13 +349,23 @@ object CreateDataSourceTableUtils extends Logging { className = provider, options = options) -def newSparkSQLSpecificMetastoreTable(): CatalogTable = { +def newSparkSQLSpecificMetastoreTable(relation: HadoopFsRelation): CatalogTable = { CatalogTable( identifier = tableIdent, tableType = tableType, schema = Nil, storage = CatalogStorageFormat( - locationUri = None, + // We don't want Hive metastore to implicitly create a table directory, + // which may be not the one Data Source table is referring to, + // yet which will be left behind when the table is dropped for an external table + locationUri = if (isExternal) { --- End diff -- I think what we can do is to check if the options contains `path`. If so, we put that at here. Otherwise, we leave locationUri as None. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15269][SQL] Set provided path to Catalo...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/13120#discussion_r63603796 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala --- @@ -451,13 +462,15 @@ object CreateDataSourceTableUtils extends Logging { s"Could not persist $qualifiedTableName in a Hive compatible way. Persisting " + s"it into Hive metastore in Spark SQL specific format." logWarning(warningMessage, e) -val table = newSparkSQLSpecificMetastoreTable() +val table = + newSparkSQLSpecificMetastoreTable(resolvedRelation.asInstanceOf[HadoopFsRelation]) sparkSession.sessionState.catalog.createTable(table, ignoreIfExists = false) } case (None, message) => logWarning(message) -val table = newSparkSQLSpecificMetastoreTable() +val table = + newSparkSQLSpecificMetastoreTable(resolvedRelation.asInstanceOf[HadoopFsRelation]) --- End diff -- I do not think we can do this cast. There are other kinds of relations. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15269][SQL] Set provided path to Catalo...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/13120#discussion_r63599112 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala --- @@ -348,13 +349,23 @@ object CreateDataSourceTableUtils extends Logging { className = provider, options = options) -def newSparkSQLSpecificMetastoreTable(): CatalogTable = { +def newSparkSQLSpecificMetastoreTable(relation: HadoopFsRelation): CatalogTable = { CatalogTable( identifier = tableIdent, tableType = tableType, schema = Nil, storage = CatalogStorageFormat( - locationUri = None, + // We don't want Hive metastore to implicitly create a table directory, + // which may be not the one Data Source table is referring to, + // yet which will be left behind when the table is dropped for an external table + locationUri = if (isExternal) { +Some(relation.location.paths.map(_.toUri.toString).map { + case p if new File(p).isDirectory => p + case p => new File(p).getParent --- End diff -- Looks like `new File` does not because it is used for local fs. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15269][SQL] Set provided path to Catalo...
Github user xwu0226 commented on a diff in the pull request: https://github.com/apache/spark/pull/13120#discussion_r63300902 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala --- @@ -348,13 +349,23 @@ object CreateDataSourceTableUtils extends Logging { className = provider, options = options) -def newSparkSQLSpecificMetastoreTable(): CatalogTable = { +def newSparkSQLSpecificMetastoreTable(relation: HadoopFsRelation): CatalogTable = { CatalogTable( identifier = tableIdent, tableType = tableType, schema = Nil, storage = CatalogStorageFormat( - locationUri = None, + // We don't want Hive metastore to implicitly create a table directory, + // which may be not the one Data Source table is referring to, + // yet which will be left behind when the table is dropped for an external table + locationUri = if (isExternal) { +Some(relation.location.paths.map(_.toUri.toString).map { + case p if new File(p).isDirectory => p + case p => new File(p).getParent +}.head) + } else { +None --- End diff -- Hive metastore will generate the path for internal table too. But when dropping, it will be deleted by hive too. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15269][SQL] Set provided path to Catalo...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/13120#discussion_r63298718 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala --- @@ -348,13 +349,23 @@ object CreateDataSourceTableUtils extends Logging { className = provider, options = options) -def newSparkSQLSpecificMetastoreTable(): CatalogTable = { +def newSparkSQLSpecificMetastoreTable(relation: HadoopFsRelation): CatalogTable = { CatalogTable( identifier = tableIdent, tableType = tableType, schema = Nil, storage = CatalogStorageFormat( - locationUri = None, + // We don't want Hive metastore to implicitly create a table directory, + // which may be not the one Data Source table is referring to, + // yet which will be left behind when the table is dropped for an external table + locationUri = if (isExternal) { +Some(relation.location.paths.map(_.toUri.toString).map { + case p if new File(p).isDirectory => p + case p => new File(p).getParent +}.head) + } else { +None --- End diff -- In this case, `locationUri` is still `None`. Does that mean we still let Hive generate the path? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15269][SQL] Set provided path to Catalo...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/13120#issuecomment-219253785 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15269][SQL] Set provided path to Catalo...
Github user xwu0226 commented on the pull request: https://github.com/apache/spark/pull/13120#issuecomment-219253654 cc @liancheng @yhuai @gatorsmile Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15269][SQL] Set provided path to Catalo...
GitHub user xwu0226 opened a pull request: https://github.com/apache/spark/pull/13120 [SPARK-15269][SQL] Set provided path to CatalogTable.storage.locationURI when creating external non-hive compatible table ## What changes were proposed in this pull request? ### Symptom ``` scala> spark.range(1).write.json("/home/xwu0226/spark-test/data/spark-15269") Datasource.write -> Path: file:/home/xwu0226/spark-test/data/spark-15269 scala> spark.sql("create table spark_15269 using json options(PATH '/home/xwu0226/spark-test/data/spark-15269')") 16/05/11 14:51:00 WARN CreateDataSourceTableUtils: Couldn't find corresponding Hive SerDe for data source provider json. Persisting data source relation `spark_15269` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive. going through newSparkSQLSpecificMetastoreTable() res1: org.apache.spark.sql.DataFrame = [] scala> spark.sql("drop table spark_15269") res2: org.apache.spark.sql.DataFrame = [] scala> spark.sql("create table spark_15269 using json as select 1 as a") org.apache.spark.sql.AnalysisException: path file:/user/hive/warehouse/spark_15269 already exists.; at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:88) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:62) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:60) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) ... ``` The 2nd creation of the table fails complaining about the path exists. ### Root cause: When the first table is created as external table with the data source path, but as json, `createDataSourceTables `considers it as non-hive compatible table because `json `is not a Hive SerDe. Then, `newSparkSQLSpecificMetastoreTable`is invoked to create the `CatalogTable `before asking HiveClient to create the metastore table. In this call, `locationURI `is not set. So when we convert `CatalogTable` to HiveTable before passing to Hive Metastore, hive table's data location is not set. Then, Hive metastore implicitly creates a data location as /tableName, which is, `file:/user/hive/warehouse/spark_15269` in the above case. When dropping the table, hive does not delete this implicitly created path because the table is external. when we create the 2nd table with select and without a path, the table is created as managed table, provided a default path in the options as following: ``` val optionsWithPath = if (!new CaseInsensitiveMap(options).contains("path")) { isExternal = false options + ("path" -> sessionState.catalog.defaultTablePath(tableIdent)) } else { options } ``` This default path happens to be the hive's warehouse directory + the table name, which is the same as the one hive metastore implicitly created earlier for the 1st table. So when trying to write the provided data to this data source table by `InsertIntoHadoopFsRelation`, which complains about the path existence since the SaveMode is SaveMode.ErrorIfExists. ### Solution: When creating an external datasource table that is non-hive compatible, make sure we set the provided path to `CatalogTable.storage.locationURI`, so we avoid hive metastore from implicitly creating a data location for the table. ## How was this patch tested? Testcase is added. And run regtest. You can merge this pull request into a Git repository by running: $ git pull https://github.com/xwu0226/spark SPARK-15269 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13120.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13120 commit 21d188321284a86176927445fd1703353e0add09 Author: xin Wu Date: 2016-05-08T07:06:36Z spark-15206 add testcases for distinct aggregate in having clause following up PR12974 commit e43d56ab260633d7c2af54a6960cec7eadff34c4 Author: xin Wu Date: 2016-05-08T07:09:44Z Revert "spark-15206 add testcases for distinct aggregate in having clause following up PR12974" This reverts commit 98a1f804d7343ba77731f9aa400c00f1a26c03fe. commit f9f1f1f36f3759eecfb6070b2372462ee454b700 Author: xin Wu Date: 2016-05-13T00:39:45Z SPARK-15269: set locationUFI to the non-hive compatible metastore table commit 58ad82db21f90b571d70371ff25c167ecda17720 Author: xin Wu Date: 2016-05-14T20:16:11Z SPARK-15269: only for external datasource table --- If your project is set up for it,