[GitHub] spark pull request: [SPARK-15269][SQL] Set provided path to Catalo...

2016-05-25 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/13120#discussion_r64673144
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala
 ---
@@ -354,7 +356,27 @@ object CreateDataSourceTableUtils extends Logging {
 tableType = tableType,
 schema = Nil,
 storage = CatalogStorageFormat(
-  locationUri = None,
+  // We don't want Hive metastore to implicitly create a table 
directory,
+  // which may be not the one Data Source table is referring to,
+  // yet which will be left behind when the table is dropped for 
an external table
+  locationUri = if (new 
CaseInsensitiveMap(options).get("path").isDefined) {
+val path = new Path(new 
CaseInsensitiveMap(options).get("path").get)
+val fs = 
path.getFileSystem(sparkSession.sessionState.newHadoopConf())
+if (fs.exists(path)) {
+  // if the provided path exists, Hive metastore only takes 
directory
+  // as table data location
+  if (fs.getFileStatus(path).isDirectory) {
+Some(path.toUri.toString)
+  } else {
+Some(path.getParent.toUri.toString)
+  }
+} else {
+  // If the path does not exists yet, it is assumed to be 
directory
+  Some(path.toUri.toString)
+}
+  } else {
+None
+  },
--- End diff --

(For future reference, the above comment is replied [here][1].)

[1]: https://github.com/apache/spark/pull/13270#issuecomment-221739839


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15269][SQL] Set provided path to Catalo...

2016-05-20 Thread xwu0226
Github user xwu0226 commented on a diff in the pull request:

https://github.com/apache/spark/pull/13120#discussion_r64126341
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala
 ---
@@ -354,7 +356,27 @@ object CreateDataSourceTableUtils extends Logging {
 tableType = tableType,
 schema = Nil,
 storage = CatalogStorageFormat(
-  locationUri = None,
+  // We don't want Hive metastore to implicitly create a table 
directory,
+  // which may be not the one Data Source table is referring to,
+  // yet which will be left behind when the table is dropped for 
an external table
+  locationUri = if (new 
CaseInsensitiveMap(options).get("path").isDefined) {
+val path = new Path(new 
CaseInsensitiveMap(options).get("path").get)
+val fs = 
path.getFileSystem(sparkSession.sessionState.newHadoopConf())
+if (fs.exists(path)) {
+  // if the provided path exists, Hive metastore only takes 
directory
+  // as table data location
+  if (fs.getFileStatus(path).isDirectory) {
+Some(path.toUri.toString)
+  } else {
+Some(path.getParent.toUri.toString)
+  }
+} else {
+  // If the path does not exists yet, it is assumed to be 
directory
+  Some(path.toUri.toString)
+}
+  } else {
+None
+  },
--- End diff --

The full path is already populated in the serde Property. The select will 
still work from spark sql. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15269][SQL] Set provided path to Catalo...

2016-05-20 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/13120#discussion_r64100525
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala
 ---
@@ -354,7 +356,27 @@ object CreateDataSourceTableUtils extends Logging {
 tableType = tableType,
 schema = Nil,
 storage = CatalogStorageFormat(
-  locationUri = None,
+  // We don't want Hive metastore to implicitly create a table 
directory,
+  // which may be not the one Data Source table is referring to,
+  // yet which will be left behind when the table is dropped for 
an external table
+  locationUri = if (new 
CaseInsensitiveMap(options).get("path").isDefined) {
+val path = new Path(new 
CaseInsensitiveMap(options).get("path").get)
+val fs = 
path.getFileSystem(sparkSession.sessionState.newHadoopConf())
+if (fs.exists(path)) {
+  // if the provided path exists, Hive metastore only takes 
directory
+  // as table data location
+  if (fs.getFileStatus(path).isDirectory) {
+Some(path.toUri.toString)
+  } else {
+Some(path.getParent.toUri.toString)
+  }
+} else {
+  // If the path does not exists yet, it is assumed to be 
directory
+  Some(path.toUri.toString)
+}
+  } else {
+None
+  },
--- End diff --

Hm... Then I think we probably should save the path as a SerDe property 
(similar to schema of persisted data source tables). @yhuai How do you think? 
It breaks existing functionality if we can't read individual files.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15269][SQL] Set provided path to Catalo...

2016-05-19 Thread xwu0226
Github user xwu0226 commented on a diff in the pull request:

https://github.com/apache/spark/pull/13120#discussion_r63903975
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala
 ---
@@ -354,7 +356,27 @@ object CreateDataSourceTableUtils extends Logging {
 tableType = tableType,
 schema = Nil,
 storage = CatalogStorageFormat(
-  locationUri = None,
+  // We don't want Hive metastore to implicitly create a table 
directory,
+  // which may be not the one Data Source table is referring to,
+  // yet which will be left behind when the table is dropped for 
an external table
+  locationUri = if (new 
CaseInsensitiveMap(options).get("path").isDefined) {
+val path = new Path(new 
CaseInsensitiveMap(options).get("path").get)
+val fs = 
path.getFileSystem(sparkSession.sessionState.newHadoopConf())
+if (fs.exists(path)) {
+  // if the provided path exists, Hive metastore only takes 
directory
+  // as table data location
+  if (fs.getFileStatus(path).isDirectory) {
+Some(path.toUri.toString)
+  } else {
+Some(path.getParent.toUri.toString)
+  }
+} else {
+  // If the path does not exists yet, it is assumed to be 
directory
+  Some(path.toUri.toString)
+}
+  } else {
+None
+  },
--- End diff --

@liancheng Thanks! I tried this before, but hive complained that the path 
is either not a directory or  it can not create one with the path.. This was 
the reason it failed the testcases in `MetastoreDataSourcesSuite`, wherever we 
create a datasource (non-hive compatible) table with an exact file name. 
Example:
```
[info] - CTAS a managed table *** FAILED *** (365 milliseconds)
[info]   org.apache.spark.sql.AnalysisException: 
org.apache.hadoop.hive.ql.metadata.HiveException: 
MetaException(message:file:/home/xwu0226/spark/sql/hive/target/scala-2.11/test-classes/sample.json
 is not a directory or unable to create one);
```
I also tried in hive shell:
```
hive> create external table t_txt1 (c1 int) location '/tmp/test1.txt';
FAILED: Execution Error, return code 1 from 
org.apache.hadoop.hive.ql.exec.DDLTask. 
MetaException(message:hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt is not a 
directory or unable to create one)
```
So it seems hive only takes a directory as table location. In our case, we 
need to give hive a directory via `locationURI`. 

For your concern of having a directory containing multiple files,  in this 
case, we are in the non-hive compatible code path, do we still expect the 
consistency between hive and spark sql? Querying from spark sql will return 
expected results. while the results will be different from hive. But current 
behavior of non-hive compatible table is like this already. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15269][SQL] Set provided path to Catalo...

2016-05-19 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/13120#discussion_r63869408
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala
 ---
@@ -354,7 +356,27 @@ object CreateDataSourceTableUtils extends Logging {
 tableType = tableType,
 schema = Nil,
 storage = CatalogStorageFormat(
-  locationUri = None,
+  // We don't want Hive metastore to implicitly create a table 
directory,
+  // which may be not the one Data Source table is referring to,
+  // yet which will be left behind when the table is dropped for 
an external table
+  locationUri = if (new 
CaseInsensitiveMap(options).get("path").isDefined) {
+val path = new Path(new 
CaseInsensitiveMap(options).get("path").get)
+val fs = 
path.getFileSystem(sparkSession.sessionState.newHadoopConf())
+if (fs.exists(path)) {
+  // if the provided path exists, Hive metastore only takes 
directory
+  // as table data location
+  if (fs.getFileStatus(path).isDirectory) {
+Some(path.toUri.toString)
+  } else {
+Some(path.getParent.toUri.toString)
+  }
+} else {
+  // If the path does not exists yet, it is assumed to be 
directory
+  Some(path.toUri.toString)
+}
+  } else {
+None
+  },
--- End diff --

The following line should be enough for `localtionUri`:

```scala
locationUri = new CaseInsensitiveMap(options).get("path")
```

Consider the following directory layout containing two Parquet files:

```
/tmp/dir/
  part-1.parquet
  part-2.parquet
```

If we pass "/tmp/dir/part-1.parquet" as file path, the logic above will 
use the "/tmp/dir/" as `locationUri`, thus "part-2.parquet" is also 
included, which is not the expected behavior.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15269][SQL] Set provided path to Catalo...

2016-05-18 Thread xwu0226
Github user xwu0226 commented on the pull request:

https://github.com/apache/spark/pull/13120#issuecomment-220172124
  
@yhuai @liancheng I updated the code and also did some manual tests for 
creating table with a real hdfs path to one of my clusters. For example:
```scala> spark.sql("create table json_t3 (c1 int) using json options (path 
'hdfs://bdavm009.svl.ibm.com:8020/tmp/json_t3')")```
and my hdfs environment shows 
```
hdfs dfs -ls /tmp
drwxrwxrwx   - xwu0226   hdfs  0 2016-05-18 14:52 /tmp/json_t3 
```
Then, I create another table with the previously created data path
```
scala> spark.sql("create table json_t4 (c1 int) using json options (path 
'hdfs://bdavm009.svl.ibm.com:8020/tmp/json_t3/part-r-3-8382e0e2-8518-48df-82c8-b6c84ab03c45.json')")

scala> spark.sql("select * from json_t4").show
16/05/18 14:59:50 WARN DataSource: Error while looking for metadata 
directory.
+---+
| c1|
+---+
|  1|
+---+

```

Please take a look at the change again! Thank you very much!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15269][SQL] Set provided path to Catalo...

2016-05-18 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/13120#discussion_r63772240
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala
 ---
@@ -348,13 +349,23 @@ object CreateDataSourceTableUtils extends Logging {
 className = provider,
 options = options)
 
-def newSparkSQLSpecificMetastoreTable(): CatalogTable = {
+def newSparkSQLSpecificMetastoreTable(relation: HadoopFsRelation): 
CatalogTable = {
   CatalogTable(
 identifier = tableIdent,
 tableType = tableType,
 schema = Nil,
 storage = CatalogStorageFormat(
-  locationUri = None,
+  // We don't want Hive metastore to implicitly create a table 
directory,
+  // which may be not the one Data Source table is referring to,
+  // yet which will be left behind when the table is dropped for 
an external table
+  locationUri = if (isExternal) {
+Some(relation.location.paths.map(_.toUri.toString).map {
+  case p if new File(p).isDirectory => p
+  case p => new File(p).getParent
+}.head)
+  } else {
+None
--- End diff --

Got it, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15269][SQL] Set provided path to Catalo...

2016-05-18 Thread xwu0226
Github user xwu0226 commented on a diff in the pull request:

https://github.com/apache/spark/pull/13120#discussion_r63719953
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala
 ---
@@ -348,13 +349,23 @@ object CreateDataSourceTableUtils extends Logging {
 className = provider,
 options = options)
 
-def newSparkSQLSpecificMetastoreTable(): CatalogTable = {
+def newSparkSQLSpecificMetastoreTable(relation: HadoopFsRelation): 
CatalogTable = {
   CatalogTable(
 identifier = tableIdent,
 tableType = tableType,
 schema = Nil,
 storage = CatalogStorageFormat(
-  locationUri = None,
+  // We don't want Hive metastore to implicitly create a table 
directory,
+  // which may be not the one Data Source table is referring to,
+  // yet which will be left behind when the table is dropped for 
an external table
+  locationUri = if (isExternal) {
+Some(relation.location.paths.map(_.toUri.toString).map {
+  case p if new File(p).isDirectory => p
+  case p => new File(p).getParent
--- End diff --

@liancheng In this code path, we are creating a non-hive compatible 
metastore table, which means we don't expect users to query the table from Hive 
directly and get the correct results, right? And the purpose of populating this 
locationURI is to avoid hive metastore to implicitly create a directory.  A 
query against the table from spark sql will use the path set in 
`CatalogTable.storage.serdeProperties.get("path")`. So I think it is OK that 
the single file lives in a directory containing multiple files, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15269][SQL] Set provided path to Catalo...

2016-05-18 Thread xwu0226
Github user xwu0226 commented on a diff in the pull request:

https://github.com/apache/spark/pull/13120#discussion_r63718106
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala
 ---
@@ -348,13 +349,23 @@ object CreateDataSourceTableUtils extends Logging {
 className = provider,
 options = options)
 
-def newSparkSQLSpecificMetastoreTable(): CatalogTable = {
+def newSparkSQLSpecificMetastoreTable(relation: HadoopFsRelation): 
CatalogTable = {
   CatalogTable(
 identifier = tableIdent,
 tableType = tableType,
 schema = Nil,
 storage = CatalogStorageFormat(
-  locationUri = None,
+  // We don't want Hive metastore to implicitly create a table 
directory,
+  // which may be not the one Data Source table is referring to,
+  // yet which will be left behind when the table is dropped for 
an external table
+  locationUri = if (isExternal) {
+Some(relation.location.paths.map(_.toUri.toString).map {
+  case p if new File(p).isDirectory => p
+  case p => new File(p).getParent
--- End diff --

@liancheng Thanks! That is what I am testing now. But just hit some error 
running the test against the recently merged `ShowCreateTableSuite`. Still 
investigating. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15269][SQL] Set provided path to Catalo...

2016-05-18 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/13120#discussion_r63687429
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala
 ---
@@ -348,13 +349,23 @@ object CreateDataSourceTableUtils extends Logging {
 className = provider,
 options = options)
 
-def newSparkSQLSpecificMetastoreTable(): CatalogTable = {
+def newSparkSQLSpecificMetastoreTable(relation: HadoopFsRelation): 
CatalogTable = {
   CatalogTable(
 identifier = tableIdent,
 tableType = tableType,
 schema = Nil,
 storage = CatalogStorageFormat(
-  locationUri = None,
+  // We don't want Hive metastore to implicitly create a table 
directory,
+  // which may be not the one Data Source table is referring to,
+  // yet which will be left behind when the table is dropped for 
an external table
+  locationUri = if (isExternal) {
+Some(relation.location.paths.map(_.toUri.toString).map {
+  case p if new File(p).isDirectory => p
+  case p => new File(p).getParent
+}.head)
+  } else {
+None
--- End diff --

@gatorsmile The `else` branch is for managed table.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15269][SQL] Set provided path to Catalo...

2016-05-18 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/13120#discussion_r63686818
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala
 ---
@@ -348,13 +349,23 @@ object CreateDataSourceTableUtils extends Logging {
 className = provider,
 options = options)
 
-def newSparkSQLSpecificMetastoreTable(): CatalogTable = {
+def newSparkSQLSpecificMetastoreTable(relation: HadoopFsRelation): 
CatalogTable = {
   CatalogTable(
 identifier = tableIdent,
 tableType = tableType,
 schema = Nil,
 storage = CatalogStorageFormat(
-  locationUri = None,
+  // We don't want Hive metastore to implicitly create a table 
directory,
+  // which may be not the one Data Source table is referring to,
+  // yet which will be left behind when the table is dropped for 
an external table
+  locationUri = if (isExternal) {
+Some(relation.location.paths.map(_.toUri.toString).map {
+  case p if new File(p).isDirectory => p
+  case p => new File(p).getParent
--- End diff --

I think for external tables, you can always get the path from `options` 
using option key `path`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15269][SQL] Set provided path to Catalo...

2016-05-18 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/13120#discussion_r63686590
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala
 ---
@@ -348,13 +349,23 @@ object CreateDataSourceTableUtils extends Logging {
 className = provider,
 options = options)
 
-def newSparkSQLSpecificMetastoreTable(): CatalogTable = {
+def newSparkSQLSpecificMetastoreTable(relation: HadoopFsRelation): 
CatalogTable = {
   CatalogTable(
 identifier = tableIdent,
 tableType = tableType,
 schema = Nil,
 storage = CatalogStorageFormat(
-  locationUri = None,
+  // We don't want Hive metastore to implicitly create a table 
directory,
+  // which may be not the one Data Source table is referring to,
+  // yet which will be left behind when the table is dropped for 
an external table
+  locationUri = if (isExternal) {
+Some(relation.location.paths.map(_.toUri.toString).map {
+  case p if new File(p).isDirectory => p
+  case p => new File(p).getParent
--- End diff --

What if we are only reading a single data file that lives in a directory 
containing multiple files?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15269][SQL] Set provided path to Catalo...

2016-05-17 Thread xwu0226
Github user xwu0226 commented on a diff in the pull request:

https://github.com/apache/spark/pull/13120#discussion_r63605497
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala
 ---
@@ -451,13 +462,15 @@ object CreateDataSourceTableUtils extends Logging {
   s"Could not persist $qualifiedTableName in a Hive compatible 
way. Persisting " +
 s"it into Hive metastore in Spark SQL specific format."
 logWarning(warningMessage, e)
-val table = newSparkSQLSpecificMetastoreTable()
+val table =
+  
newSparkSQLSpecificMetastoreTable(resolvedRelation.asInstanceOf[HadoopFsRelation])
 sparkSession.sessionState.catalog.createTable(table, 
ignoreIfExists = false)
 }
 
   case (None, message) =>
 logWarning(message)
-val table = newSparkSQLSpecificMetastoreTable()
+val table =
+  
newSparkSQLSpecificMetastoreTable(resolvedRelation.asInstanceOf[HadoopFsRelation])
--- End diff --

Thanks! I will update.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15269][SQL] Set provided path to Catalo...

2016-05-17 Thread xwu0226
Github user xwu0226 commented on a diff in the pull request:

https://github.com/apache/spark/pull/13120#discussion_r63605320
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala
 ---
@@ -348,13 +349,23 @@ object CreateDataSourceTableUtils extends Logging {
 className = provider,
 options = options)
 
-def newSparkSQLSpecificMetastoreTable(): CatalogTable = {
+def newSparkSQLSpecificMetastoreTable(relation: HadoopFsRelation): 
CatalogTable = {
   CatalogTable(
 identifier = tableIdent,
 tableType = tableType,
 schema = Nil,
 storage = CatalogStorageFormat(
-  locationUri = None,
+  // We don't want Hive metastore to implicitly create a table 
directory,
+  // which may be not the one Data Source table is referring to,
+  // yet which will be left behind when the table is dropped for 
an external table
+  locationUri = if (isExternal) {
+Some(relation.location.paths.map(_.toUri.toString).map {
+  case p if new File(p).isDirectory => p
+  case p => new File(p).getParent
--- End diff --

Thanks, @yhuai ! I will update. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15269][SQL] Set provided path to Catalo...

2016-05-17 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/13120#discussion_r63604166
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala
 ---
@@ -348,13 +349,23 @@ object CreateDataSourceTableUtils extends Logging {
 className = provider,
 options = options)
 
-def newSparkSQLSpecificMetastoreTable(): CatalogTable = {
+def newSparkSQLSpecificMetastoreTable(relation: HadoopFsRelation): 
CatalogTable = {
   CatalogTable(
 identifier = tableIdent,
 tableType = tableType,
 schema = Nil,
 storage = CatalogStorageFormat(
-  locationUri = None,
+  // We don't want Hive metastore to implicitly create a table 
directory,
+  // which may be not the one Data Source table is referring to,
+  // yet which will be left behind when the table is dropped for 
an external table
+  locationUri = if (isExternal) {
--- End diff --

I think what we can do is to check if the options contains `path`. If so, 
we put that at here. Otherwise, we leave locationUri as None.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15269][SQL] Set provided path to Catalo...

2016-05-17 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/13120#discussion_r63603796
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala
 ---
@@ -451,13 +462,15 @@ object CreateDataSourceTableUtils extends Logging {
   s"Could not persist $qualifiedTableName in a Hive compatible 
way. Persisting " +
 s"it into Hive metastore in Spark SQL specific format."
 logWarning(warningMessage, e)
-val table = newSparkSQLSpecificMetastoreTable()
+val table =
+  
newSparkSQLSpecificMetastoreTable(resolvedRelation.asInstanceOf[HadoopFsRelation])
 sparkSession.sessionState.catalog.createTable(table, 
ignoreIfExists = false)
 }
 
   case (None, message) =>
 logWarning(message)
-val table = newSparkSQLSpecificMetastoreTable()
+val table =
+  
newSparkSQLSpecificMetastoreTable(resolvedRelation.asInstanceOf[HadoopFsRelation])
--- End diff --

I do not think we can do this cast. There are other kinds of relations.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15269][SQL] Set provided path to Catalo...

2016-05-17 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/13120#discussion_r63599112
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala
 ---
@@ -348,13 +349,23 @@ object CreateDataSourceTableUtils extends Logging {
 className = provider,
 options = options)
 
-def newSparkSQLSpecificMetastoreTable(): CatalogTable = {
+def newSparkSQLSpecificMetastoreTable(relation: HadoopFsRelation): 
CatalogTable = {
   CatalogTable(
 identifier = tableIdent,
 tableType = tableType,
 schema = Nil,
 storage = CatalogStorageFormat(
-  locationUri = None,
+  // We don't want Hive metastore to implicitly create a table 
directory,
+  // which may be not the one Data Source table is referring to,
+  // yet which will be left behind when the table is dropped for 
an external table
+  locationUri = if (isExternal) {
+Some(relation.location.paths.map(_.toUri.toString).map {
+  case p if new File(p).isDirectory => p
+  case p => new File(p).getParent
--- End diff --

Looks like `new File` does not because it is used for local fs. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15269][SQL] Set provided path to Catalo...

2016-05-15 Thread xwu0226
Github user xwu0226 commented on a diff in the pull request:

https://github.com/apache/spark/pull/13120#discussion_r63300902
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala
 ---
@@ -348,13 +349,23 @@ object CreateDataSourceTableUtils extends Logging {
 className = provider,
 options = options)
 
-def newSparkSQLSpecificMetastoreTable(): CatalogTable = {
+def newSparkSQLSpecificMetastoreTable(relation: HadoopFsRelation): 
CatalogTable = {
   CatalogTable(
 identifier = tableIdent,
 tableType = tableType,
 schema = Nil,
 storage = CatalogStorageFormat(
-  locationUri = None,
+  // We don't want Hive metastore to implicitly create a table 
directory,
+  // which may be not the one Data Source table is referring to,
+  // yet which will be left behind when the table is dropped for 
an external table
+  locationUri = if (isExternal) {
+Some(relation.location.paths.map(_.toUri.toString).map {
+  case p if new File(p).isDirectory => p
+  case p => new File(p).getParent
+}.head)
+  } else {
+None
--- End diff --

Hive metastore will generate the path for internal table too. But when 
dropping, it will be deleted by hive too. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15269][SQL] Set provided path to Catalo...

2016-05-15 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/13120#discussion_r63298718
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala
 ---
@@ -348,13 +349,23 @@ object CreateDataSourceTableUtils extends Logging {
 className = provider,
 options = options)
 
-def newSparkSQLSpecificMetastoreTable(): CatalogTable = {
+def newSparkSQLSpecificMetastoreTable(relation: HadoopFsRelation): 
CatalogTable = {
   CatalogTable(
 identifier = tableIdent,
 tableType = tableType,
 schema = Nil,
 storage = CatalogStorageFormat(
-  locationUri = None,
+  // We don't want Hive metastore to implicitly create a table 
directory,
+  // which may be not the one Data Source table is referring to,
+  // yet which will be left behind when the table is dropped for 
an external table
+  locationUri = if (isExternal) {
+Some(relation.location.paths.map(_.toUri.toString).map {
+  case p if new File(p).isDirectory => p
+  case p => new File(p).getParent
+}.head)
+  } else {
+None
--- End diff --

In this case, `locationUri` is still `None`. Does that mean we still let 
Hive generate the path?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15269][SQL] Set provided path to Catalo...

2016-05-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/13120#issuecomment-219253785
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15269][SQL] Set provided path to Catalo...

2016-05-14 Thread xwu0226
Github user xwu0226 commented on the pull request:

https://github.com/apache/spark/pull/13120#issuecomment-219253654
  
cc @liancheng @yhuai @gatorsmile Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-15269][SQL] Set provided path to Catalo...

2016-05-14 Thread xwu0226
GitHub user xwu0226 opened a pull request:

https://github.com/apache/spark/pull/13120

[SPARK-15269][SQL] Set provided path to CatalogTable.storage.locationURI 
when creating external non-hive compatible table

## What changes were proposed in this pull request?
### Symptom
```
scala> 
spark.range(1).write.json("/home/xwu0226/spark-test/data/spark-15269")
Datasource.write -> Path: file:/home/xwu0226/spark-test/data/spark-15269


scala> spark.sql("create table spark_15269 using json options(PATH 
'/home/xwu0226/spark-test/data/spark-15269')")
16/05/11 14:51:00 WARN CreateDataSourceTableUtils: Couldn't find 
corresponding Hive SerDe for data source provider json. Persisting data source 
relation `spark_15269` into Hive metastore in Spark SQL specific format, which 
is NOT compatible with Hive.
going through newSparkSQLSpecificMetastoreTable()
res1: org.apache.spark.sql.DataFrame = []

scala> spark.sql("drop table spark_15269")
res2: org.apache.spark.sql.DataFrame = []

scala> spark.sql("create table spark_15269 using json as select 1 as a")
org.apache.spark.sql.AnalysisException: path 
file:/user/hive/warehouse/spark_15269 already exists.;
  at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:88)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:62)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:60)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
...
```
The 2nd creation of the table fails complaining about the path exists. 

### Root cause:
When the first table is created as external table with the data source 
path, but as json, `createDataSourceTables `considers it as non-hive compatible 
table because `json `is not a Hive SerDe. Then, 
`newSparkSQLSpecificMetastoreTable`is invoked to create the `CatalogTable 
`before asking HiveClient to create the metastore table. In this call, 
`locationURI `is not set. So when we convert `CatalogTable` to HiveTable before 
passing to Hive Metastore, hive table's data location is not set. Then, Hive 
metastore implicitly creates a data location as /tableName, 
which is, `file:/user/hive/warehouse/spark_15269` in the above case. 

When dropping the table, hive does not delete this implicitly created path 
because the table is external.

when we create the 2nd table with select and without a path, the table is 
created as managed table, provided a default path in the options as following:
```
val optionsWithPath =
  if (!new CaseInsensitiveMap(options).contains("path")) {
isExternal = false
options + ("path" -> 
sessionState.catalog.defaultTablePath(tableIdent))
  } else {
options
  }
```
This default path happens to be the hive's warehouse directory + the table 
name, which is the same as the one hive metastore implicitly created earlier 
for the 1st table.  So when trying to write the provided data to this data 
source table by `InsertIntoHadoopFsRelation`, which complains about the path 
existence since the SaveMode is SaveMode.ErrorIfExists.

### Solution:
When creating an external datasource table that is non-hive compatible, 
make sure we set the provided path to `CatalogTable.storage.locationURI`, so we 
avoid hive metastore from implicitly creating a data location for the table.

## How was this patch tested?
Testcase is added. And run regtest. 


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/xwu0226/spark SPARK-15269

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13120.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13120


commit 21d188321284a86176927445fd1703353e0add09
Author: xin Wu 
Date:   2016-05-08T07:06:36Z

spark-15206 add testcases for distinct aggregate in having clause following 
up PR12974

commit e43d56ab260633d7c2af54a6960cec7eadff34c4
Author: xin Wu 
Date:   2016-05-08T07:09:44Z

Revert "spark-15206 add testcases for distinct aggregate in having clause 
following up PR12974"

This reverts commit 98a1f804d7343ba77731f9aa400c00f1a26c03fe.

commit f9f1f1f36f3759eecfb6070b2372462ee454b700
Author: xin Wu 
Date:   2016-05-13T00:39:45Z

SPARK-15269: set locationUFI to the non-hive compatible metastore table

commit 58ad82db21f90b571d70371ff25c167ecda17720
Author: xin Wu 
Date:   2016-05-14T20:16:11Z

SPARK-15269: only for external datasource table




---
If your project is set up for it,