[jira] [Commented] (SPARK-15682) Hive ORC partition write looks for root hdfs folder for existence

lyc (JIRA) Thu, 01 Jun 2017 20:15:43 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-15682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034076#comment-16034076
 ]


lyc commented on SPARK-15682:
-----------------------------

Hi, I tried this both for `orc` and `parquet`, and they both throws `path 
already exists`. The reason is that spark will check if path in `save(path)` 
exists, and if it exists and mode is `ErrorIfExists`(default), it will throws. 
You can overwrite the whole table by specifying `mode("overwrite")`, but there 
seems no way to overwrite a specific partition. By the way, if you try 
`save("test.sms_outbound_view_orc/proc_date=2016-05-30")`, the path will be 
treated as a table path, so, if you succeed, the final partition path for 
`2016-05-30` will be 
`test.sms_outbound_view_orc/proc_date=2016-05-30/proc_date=2016-05-30`.

What do you mean `have handle to the hive table`?

> Hive ORC partition write looks for root hdfs folder for existence
> -----------------------------------------------------------------
>
>                 Key: SPARK-15682
>                 URL: https://issues.apache.org/jira/browse/SPARK-15682
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.6.1
>            Reporter: Dipankar
>
> Scenario:
> I am using the below program to create new partition based on the current 
> date which signifies the run date.
> However, it fails citing hdfs folder already exists. It checks the root 
> folder and not new partition value.
> Is partitionBy clause actually not checking the hive metastore or folder till 
> proc_date= some value. ? and it's just a way to create folders based on 
> partition key. Not any way related to hive partition ??
> Alternatively, should i use
> result.write.format("orc").save("test.sms_outbound_view_orc/proc_date=2016-05-30")
>  to achieve the result.
> But this will not update hive metastore with new partition details.
> Is spark orc support not equivalent to HCatStorer API?
> My hive table is built with proc_date as partition column. 
> Source code :
> result.registerTempTable("result_tab")
> val result_partition = sqlContext.sql("FROM result_tab select 
> *,'"+curr_date+"' as proc_date")
> result_partition.write.format("orc").partitionBy("proc_date").save("test.sms_outbound_view_orc")
> Exception
> 16/05/31 15:57:34 INFO ParseDriver: Parsing command: FROM result_tab select 
> *,'2016-05-31' as proc_date
> 16/05/31 15:57:34 INFO ParseDriver: Parse Completed
> Exception in thread "main" org.apache.spark.sql.AnalysisException: path 
> hdfs://hdpprod/user/dipankar.ghosal/test.sms_outbound_view_orc already 
> exists.;
>       at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:76)
>       at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
>       at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
>       at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69)
>       at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
>       at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
>       at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>       at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
>       at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933)
>       at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933)
>       at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197)
>       at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146)
>       at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137)
>       at SampleApp$.main(SampleApp.scala:31)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15682) Hive ORC partition write looks for root hdfs folder for existence

Reply via email to