[jira] [Comment Edited] (SPARK-15269) Creating external table in test code leaves empty directory under warehouse directory

Xin Wu (JIRA) Wed, 11 May 2016 15:14:57 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-15269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15280877#comment-15280877
 ]


Xin Wu edited comment on SPARK-15269 at 5/11/16 10:13 PM:
----------------------------------------------------------

The root cause maybe the following?

When the first table is created as external table with the data source path, 
but as json,  createDataSourceTables considers it as non-hive compatible table 
because json is not a Hive SerDe. Then, newSparkSQLSpecificMetastoreTable is 
invoked to create the CatalogTable before asking HiveClient to create the 
metastore table. In this call,  locationURI is not set. So when we convert 
CatalogTable to HiveTable before passing to Hive Metastore, hive table's data 
location is not set. Then, Hive metastore implicitly creates a data location as 
<hive warehouse>/tableName, which is 
{code}/Users/lian/local/src/spark/workspace-b/target/tmp/warehouse-205e25e7-8918-4615-acf1-10e06af7c35c/ddl_test1{code}
 in this JIRA. I also verified that creating an external directly in Hive shell 
without a path will result in a default table directory created by hive. 

Then, even after dropping table, hive will not delete this stealth directory 
because the table is external. 

when we create the 2nd table with select and without a path, the table is 
created as managed table, provided a default path in the options:
{code}val optionsWithPath =
      if (!new CaseInsensitiveMap(options).contains("path")) {
        isExternal = false
        options + ("path" -> sessionState.catalog.defaultTablePath(tableIdent))
      } else {
        options
      }{code}
This default path happens to be the hive's warehouse directory + the table 
name, which is the same as the one hive metastore implicitly created earlier 
for the 1st table.  So when trying to write the provided data to this data 
source table by {code} val plan =
          InsertIntoHadoopFsRelation(
            outputPath,
            partitionColumns.map(UnresolvedAttribute.quoted),
            bucketSpec,
            format,
            () => Unit, // No existing table needs to be refreshed.
            options,
            data.logicalPlan,
            mode){code}, InsertIntoHadoopFsRelation complains about the path 
existence since the SaveMode is SaveMode.ErrorIfExists.


was (Author: xwu0226):
The root cause maybe the following?

When the first table is created as external table with the data source path, 
but as json,  createDataSourceTables considers it as non-hive compatible table 
because json is not a Hive SerDe. Then, newSparkSQLSpecificMetastoreTable is 
invoked to create the CatalogTable before asking HiveClient to create the 
metastore table. In this call,  locationURI is not set. So when we convert 
CatalogTable to HiveTable before passing to Hive Metastore, hive table's data 
location is not set. Then, Hive metastore implicitly creates a data location as 
<hive warehouse>/tableName, which is 
{code}/Users/lian/local/src/spark/workspace-b/target/tmp/warehouse-205e25e7-8918-4615-acf1-10e06af7c35c/ddl_test1{code}
 in this JIRA. I also verified that creating an external directly in Hive shell 
without a path will result in a default table directory created by hive. 

Then, even after dropping table, hive will not delete this stealth directory 
because the table is external. 

> Creating external table in test code leaves empty directory under warehouse 
> directory
> -------------------------------------------------------------------------------------
>
>                 Key: SPARK-15269
>                 URL: https://issues.apache.org/jira/browse/SPARK-15269
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL, Tests
>    Affects Versions: 2.0.0
>            Reporter: Cheng Lian
>
> It seems that this issue doesn't affect production code. I couldn't reproduce 
> it using Spark shell.
> Adding the following test case in {{HiveDDLSuite}} may reproduce this issue:
> {code}
>   test("foo") {
>     withTempPath { dir =>
>       val path = dir.getCanonicalPath
>       spark.range(1).write.json(path)
>       withTable("ddl_test1") {
>         sql(s"CREATE TABLE ddl_test1 USING json OPTIONS (PATH '$path')")
>         sql("DROP TABLE ddl_test1")
>         sql(s"CREATE TABLE ddl_test1 USING json AS SELECT 1 AS a")
>       }
>     }
>   }
> {code}
> Note that the first {{CREATE TABLE}} command creates an external table since 
> data source tables are always external when {{PATH}} option is specified.
> When executing the second {{CREATE TABLE}} command, which creates a managed 
> table with the same name, it fails because there's already an unexpected 
> directory with the same name as the table name in the warehouse directory:
> {noformat}
> [info] - foo *** FAILED *** (7 seconds, 649 milliseconds)
> [info]   org.apache.spark.sql.AnalysisException: path 
> file:/Users/lian/local/src/spark/workspace-b/target/tmp/warehouse-205e25e7-8918-4615-acf1-10e06af7c35c/ddl_test1
>  already exists.;
> [info]   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:88)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85)
> [info]   at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:417)
> [info]   at 
> org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:231)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85)
> [info]   at org.apache.spark.sql.Dataset.<init>(Dataset.scala:186)
> [info]   at org.apache.spark.sql.Dataset.<init>(Dataset.scala:167)
> [info]   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:62)
> [info]   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:541)
> [info]   at 
> org.apache.spark.sql.test.SQLTestUtils$$anonfun$sql$1.apply(SQLTestUtils.scala:59)
> [info]   at 
> org.apache.spark.sql.test.SQLTestUtils$$anonfun$sql$1.apply(SQLTestUtils.scala:59)
> [info]   at 
> org.apache.spark.sql.hive.execution.HiveDDLSuite$$anonfun$23$$anonfun$apply$mcV$sp$34$$anonfun$apply$6.apply$mcV$sp(HiveDDLSuite.scala:597)
> [info]   at 
> org.apache.spark.sql.test.SQLTestUtils$class.withTable(SQLTestUtils.scala:166)
> [info]   at 
> org.apache.spark.sql.hive.execution.HiveDDLSuite.withTable(HiveDDLSuite.scala:32)
> [info]   at 
> org.apache.spark.sql.hive.execution.HiveDDLSuite$$anonfun$23$$anonfun$apply$mcV$sp$34.apply(HiveDDLSuite.scala:594)
> [info]   at 
> org.apache.spark.sql.hive.execution.HiveDDLSuite$$anonfun$23$$anonfun$apply$mcV$sp$34.apply(HiveDDLSuite.scala:590)
> [info]   at 
> org.apache.spark.sql.test.SQLTestUtils$class.withTempPath(SQLTestUtils.scala:114)
> [info]   at 
> org.apache.spark.sql.hive.execution.HiveDDLSuite.withTempPath(HiveDDLSuite.scala:32)
> [info]   at 
> org.apache.spark.sql.hive.execution.HiveDDLSuite$$anonfun$23.apply$mcV$sp(HiveDDLSuite.scala:590)
> [info]   at 
> org.apache.spark.sql.hive.execution.HiveDDLSuite$$anonfun$23.apply(HiveDDLSuite.scala:590)
> [info]   at 
> org.apache.spark.sql.hive.execution.HiveDDLSuite$$anonfun$23.apply(HiveDDLSuite.scala:590)
> [info]   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
> [info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
> [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
> [info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
> [info]   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:57)
> [info]   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
> [info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
> [info]   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
> [info]   at 
> org.apache.spark.sql.hive.execution.HiveDDLSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(HiveDDLSuite.scala:32)
> [info]   at 
> org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)
> [info]   at 
> org.apache.spark.sql.hive.execution.HiveDDLSuite.runTest(HiveDDLSuite.scala:32)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
> [info]   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
> [info]   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
> [info]   at scala.collection.immutable.List.foreach(List.scala:381)
> [info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
> [info]   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
> [info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
> [info]   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
> [info]   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
> [info]   at org.scalatest.Suite$class.run(Suite.scala:1424)
> [info]   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
> [info]   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
> [info]   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
> [info]   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
> [info]   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:29)
> [info]   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:257)
> [info]   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:256)
> [info]   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:29)
> [info]   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:357)
> [info]   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:502)
> [info]   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
> [info]   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
> [info]   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> [info]   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> [info]   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> [info]   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15269) Creating external table in test code leaves empty directory under warehouse directory

Reply via email to