[jira] [Comment Edited] (SPARK-15269) Creating external table in test code leaves empty directory under warehouse directory

Xin Wu (JIRA) Wed, 11 May 2016 14:48:00 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-15269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15280877#comment-15280877
 ]


Xin Wu edited comment on SPARK-15269 at 5/11/16 9:47 PM:
---------------------------------------------------------

The root cause maybe the following?

When the first table is created as external table with the data source path, 
but as json,  createDataSourceTables considers it as non-hive compatible table 
because json is not a Hive SerDe. Then, newSparkSQLSpecificMetastoreTable is 
invoked to create the CatalogTable before asking HiveClient to create the 
metastore table. In this call,  locationURI is not set. So when we convert 
CatalogTable to HiveTable before passing to Hive Metastore, hive table's data 
location is not set. Then, Hive metastore implicitly creates a data location as 
<hive warehouse>/tableName, which is 
{code}/Users/lian/local/src/spark/workspace-b/target/tmp/warehouse-205e25e7-8918-4615-acf1-10e06af7c35c/ddl_test1{code}
 in this JIRA. I also verified that creating an external directly in Hive shell 
without a path will result in a default table directory created by hive. 

Then, even after dropping table, hive will not delete this stealth directory 
because the table is external. 


was (Author: xwu0226):
The root cause maybe the following?

When the first table is created as external table with the data source path, 
but as `json`,  `createDataSourceTables` considers it as non-hive compatible 
table because `json` is not a Hive SerDe. Then, 
`newSparkSQLSpecificMetastoreTable` is invoked to create the `CatalogTable` 
before asking `HiveClient` to create the metastore table. In this call,  
`locationURI` is not set. So when we convert CatalogTable to HiveTable before 
passing to Hive Metastore, hive table's data location is not set. Then, Hive 
metastore implicitly creates a data location as `<hive warehouse>/tableName`, 
which 
`/Users/lian/local/src/spark/workspace-b/target/tmp/warehouse-205e25e7-8918-4615-acf1-10e06af7c35c/ddl_test1`
 in this JIRA. I also verified that creating an external directly in Hive shell 
without a path will result in a default table directory created by hive. 

Then, even after dropping table, hive will not delete this stealth directory 
because the table is external. 

> Creating external table in test code leaves empty directory under warehouse 
> directory
> -------------------------------------------------------------------------------------
>
>                 Key: SPARK-15269
>                 URL: https://issues.apache.org/jira/browse/SPARK-15269
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL, Tests
>    Affects Versions: 2.0.0
>            Reporter: Cheng Lian
>
> It seems that this issue doesn't affect production code. I couldn't reproduce 
> it using Spark shell.
> Adding the following test case in {{HiveDDLSuite}} may reproduce this issue:
> {code}
>   test("foo") {
>     withTempPath { dir =>
>       val path = dir.getCanonicalPath
>       spark.range(1).write.json(path)
>       withTable("ddl_test1") {
>         sql(s"CREATE TABLE ddl_test1 USING json OPTIONS (PATH '$path')")
>         sql("DROP TABLE ddl_test1")
>         sql(s"CREATE TABLE ddl_test1 USING json AS SELECT 1 AS a")
>       }
>     }
>   }
> {code}
> Note that the first {{CREATE TABLE}} command creates an external table since 
> data source tables are always external when {{PATH}} option is specified.
> When executing the second {{CREATE TABLE}} command, which creates a managed 
> table with the same name, it fails because there's already an unexpected 
> directory with the same name as the table name in the warehouse directory:
> {noformat}
> [info] - foo *** FAILED *** (7 seconds, 649 milliseconds)
> [info]   org.apache.spark.sql.AnalysisException: path 
> file:/Users/lian/local/src/spark/workspace-b/target/tmp/warehouse-205e25e7-8918-4615-acf1-10e06af7c35c/ddl_test1
>  already exists.;
> [info]   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:88)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85)
> [info]   at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:417)
> [info]   at 
> org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:231)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85)
> [info]   at org.apache.spark.sql.Dataset.<init>(Dataset.scala:186)
> [info]   at org.apache.spark.sql.Dataset.<init>(Dataset.scala:167)
> [info]   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:62)
> [info]   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:541)
> [info]   at 
> org.apache.spark.sql.test.SQLTestUtils$$anonfun$sql$1.apply(SQLTestUtils.scala:59)
> [info]   at 
> org.apache.spark.sql.test.SQLTestUtils$$anonfun$sql$1.apply(SQLTestUtils.scala:59)
> [info]   at 
> org.apache.spark.sql.hive.execution.HiveDDLSuite$$anonfun$23$$anonfun$apply$mcV$sp$34$$anonfun$apply$6.apply$mcV$sp(HiveDDLSuite.scala:597)
> [info]   at 
> org.apache.spark.sql.test.SQLTestUtils$class.withTable(SQLTestUtils.scala:166)
> [info]   at 
> org.apache.spark.sql.hive.execution.HiveDDLSuite.withTable(HiveDDLSuite.scala:32)
> [info]   at 
> org.apache.spark.sql.hive.execution.HiveDDLSuite$$anonfun$23$$anonfun$apply$mcV$sp$34.apply(HiveDDLSuite.scala:594)
> [info]   at 
> org.apache.spark.sql.hive.execution.HiveDDLSuite$$anonfun$23$$anonfun$apply$mcV$sp$34.apply(HiveDDLSuite.scala:590)
> [info]   at 
> org.apache.spark.sql.test.SQLTestUtils$class.withTempPath(SQLTestUtils.scala:114)
> [info]   at 
> org.apache.spark.sql.hive.execution.HiveDDLSuite.withTempPath(HiveDDLSuite.scala:32)
> [info]   at 
> org.apache.spark.sql.hive.execution.HiveDDLSuite$$anonfun$23.apply$mcV$sp(HiveDDLSuite.scala:590)
> [info]   at 
> org.apache.spark.sql.hive.execution.HiveDDLSuite$$anonfun$23.apply(HiveDDLSuite.scala:590)
> [info]   at 
> org.apache.spark.sql.hive.execution.HiveDDLSuite$$anonfun$23.apply(HiveDDLSuite.scala:590)
> [info]   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
> [info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
> [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
> [info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
> [info]   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:57)
> [info]   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
> [info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
> [info]   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
> [info]   at 
> org.apache.spark.sql.hive.execution.HiveDDLSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(HiveDDLSuite.scala:32)
> [info]   at 
> org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)
> [info]   at 
> org.apache.spark.sql.hive.execution.HiveDDLSuite.runTest(HiveDDLSuite.scala:32)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
> [info]   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
> [info]   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
> [info]   at scala.collection.immutable.List.foreach(List.scala:381)
> [info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
> [info]   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
> [info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
> [info]   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
> [info]   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
> [info]   at org.scalatest.Suite$class.run(Suite.scala:1424)
> [info]   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
> [info]   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
> [info]   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
> [info]   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:29)
> [info]   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
> [info]   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
> [info]   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:29)
> [info]   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:357)
> [info]   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:502)
> [info]   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
> [info]   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
> [info]   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> [info]   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> [info]   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> [info]   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15269) Creating external table in test code leaves empty directory under warehouse directory

Reply via email to