[jira] [Updated] (SPARK-18544) Append with df.saveAsTable writes data to wrong location

Eric Liang (JIRA) Tue, 22 Nov 2016 13:51:08 -0800

     [ 
https://issues.apache.org/jira/browse/SPARK-18544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Eric Liang updated SPARK-18544:
-------------------------------
    Description: 
When using saveAsTable in append mode, data will be written to the wrong 
location for non-managed Datasource tables. The following example illustrates 
this.

It seems somehow pass the wrong table path to InsertIntoHadoopFsRelation from 
DataFrameWriter. Also, we should probably remove the repair table call at the 
end of saveAsTable in DataFrameWriter. That shouldn't be needed in either the 
Hive or Datasource case.

{code}
scala> spark.sqlContext.range(100).selectExpr("id", "id as A", "id as 
B").write.partitionBy("A", "B").mode("overwrite").parquet("/tmp/test")

scala> sql("create table test (id long, A int, B int) USING parquet OPTIONS 
(path '/tmp/test') PARTITIONED BY (A, B)")

scala> sql("msck repair table test")

scala> sql("select * from test where A = 1").count
res6: Long = 1

scala> spark.sqlContext.range(10).selectExpr("id", "id as A", "id as 
B").write.partitionBy("A", "B").mode("append").saveAsTable("test")

scala> sql("select * from test where A = 1").count
res8: Long = 1
{code}

  was:
When using saveAsTable in append mode, data will be written to the wrong 
location for non-managed Datasource tables. The following example illustrates 
this.

It seems somehow pass the wrong table path to InsertIntoHadoopFsRelation from 
DataFrameWriter. Also, we should probably remove the repair table call at the 
end of saveAsTable in DataFrameWriter. That shouldn't be needed in either the 
Hive or Datasource case.

{code}
scala> spark.sqlContext.range(10000).selectExpr("id", "id as A", "id as 
B").write.partitionBy("A", "B").mode("overwrite").parquet("/tmp/test_10k")

scala> sql("msck repair table test_10k")

scala> sql("select * from test_10k where A = 1").count
res6: Long = 1

scala> spark.sqlContext.range(10).selectExpr("id", "id as A", "id as 
B").write.partitionBy("A", "B").mode("append").parquet("/tmp/test_10k")

scala> sql("select * from test_10k where A = 1").count
res8: Long = 1
{code}


> Append with df.saveAsTable writes data to wrong location
> --------------------------------------------------------
>
>                 Key: SPARK-18544
>                 URL: https://issues.apache.org/jira/browse/SPARK-18544
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>            Reporter: Eric Liang
>            Priority: Blocker
>
> When using saveAsTable in append mode, data will be written to the wrong 
> location for non-managed Datasource tables. The following example illustrates 
> this.
> It seems somehow pass the wrong table path to InsertIntoHadoopFsRelation from 
> DataFrameWriter. Also, we should probably remove the repair table call at the 
> end of saveAsTable in DataFrameWriter. That shouldn't be needed in either the 
> Hive or Datasource case.
> {code}
> scala> spark.sqlContext.range(100).selectExpr("id", "id as A", "id as 
> B").write.partitionBy("A", "B").mode("overwrite").parquet("/tmp/test")
> scala> sql("create table test (id long, A int, B int) USING parquet OPTIONS 
> (path '/tmp/test') PARTITIONED BY (A, B)")
> scala> sql("msck repair table test")
> scala> sql("select * from test where A = 1").count
> res6: Long = 1
> scala> spark.sqlContext.range(10).selectExpr("id", "id as A", "id as 
> B").write.partitionBy("A", "B").mode("append").saveAsTable("test")
> scala> sql("select * from test where A = 1").count
> res8: Long = 1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18544) Append with df.saveAsTable writes data to wrong location

Reply via email to