[ https://issues.apache.org/jira/browse/SPARK-15682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034076#comment-16034076 ]
lyc commented on SPARK-15682: ----------------------------- Hi, I tried this both for `orc` and `parquet`, and they both throws `path already exists`. The reason is that spark will check if path in `save(path)` exists, and if it exists and mode is `ErrorIfExists`(default), it will throws. You can overwrite the whole table by specifying `mode("overwrite")`, but there seems no way to overwrite a specific partition. By the way, if you try `save("test.sms_outbound_view_orc/proc_date=2016-05-30")`, the path will be treated as a table path, so, if you succeed, the final partition path for `2016-05-30` will be `test.sms_outbound_view_orc/proc_date=2016-05-30/proc_date=2016-05-30`. What do you mean `have handle to the hive table`? > Hive ORC partition write looks for root hdfs folder for existence > ----------------------------------------------------------------- > > Key: SPARK-15682 > URL: https://issues.apache.org/jira/browse/SPARK-15682 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 1.6.1 > Reporter: Dipankar > > Scenario: > I am using the below program to create new partition based on the current > date which signifies the run date. > However, it fails citing hdfs folder already exists. It checks the root > folder and not new partition value. > Is partitionBy clause actually not checking the hive metastore or folder till > proc_date= some value. ? and it's just a way to create folders based on > partition key. Not any way related to hive partition ?? > Alternatively, should i use > result.write.format("orc").save("test.sms_outbound_view_orc/proc_date=2016-05-30") > to achieve the result. > But this will not update hive metastore with new partition details. > Is spark orc support not equivalent to HCatStorer API? > My hive table is built with proc_date as partition column. > Source code : > result.registerTempTable("result_tab") > val result_partition = sqlContext.sql("FROM result_tab select > *,'"+curr_date+"' as proc_date") > result_partition.write.format("orc").partitionBy("proc_date").save("test.sms_outbound_view_orc") > Exception > 16/05/31 15:57:34 INFO ParseDriver: Parsing command: FROM result_tab select > *,'2016-05-31' as proc_date > 16/05/31 15:57:34 INFO ParseDriver: Parse Completed > Exception in thread "main" org.apache.spark.sql.AnalysisException: path > hdfs://hdpprod/user/dipankar.ghosal/test.sms_outbound_view_orc already > exists.; > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:76) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57) > at > org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138) > at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933) > at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933) > at > org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137) > at SampleApp$.main(SampleApp.scala:31) -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org