[ 
https://issues.apache.org/jira/browse/SPARK-29299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17017932#comment-17017932
 ] 

Steve Loughran commented on SPARK-29299:
----------------------------------------

Have you tried using the s3 optimised committer in EMR?

It still materializes files in the destination on task commit, so potentially 
retains the issue -I'd like to know if it does.
thanks

> Intermittently getting "Cannot create the managed table error" while creating 
> table from spark 2.4
> --------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-29299
>                 URL: https://issues.apache.org/jira/browse/SPARK-29299
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.4.0
>            Reporter: Abhijeet
>            Priority: Major
>
> We are facing below error in spark 2.4 intermittently when saving the managed 
> table from spark.
> Error -
>  pyspark.sql.utils.AnalysisException: u"Can not create the managed 
> table('`hive_issue`.`table`'). The associated 
> location('s3://\{bucket_name}/EMRFS_WARE_TEST167_new/warehouse/hive_issue.db/table')
>  already exists.;"
> Steps to reproduce--
>  1. Create dataframe from spark mid size data (30MB CSV file)
>  2. Save dataframe as a table
>  3. Terminate the session when above mentioned operation is in progress
> Note--
>  Session termination is just a way to reproduce this issue. In real time we 
> are facing this issue intermittently when we are running same spark jobs 
> multiple times. We use EMRFS and HDFS from EMR cluster and we face the same 
> issue on both of the systems.
>  The only ways we can fix this is by deleting the target folder where table 
> will keep its files which is not option for us and we need to keep historical 
> information in the table hence we use APPEND mode while writing to table.
> Sample code--
>  from pyspark.sql import SparkSession
>  sc = SparkSession.builder.enableHiveSupport().getOrCreate()
>  df = sc.read.csv("s3://\{sample-bucket}1/DATA/consumecomplians.csv")
>  print "STARTED WRITING TO TABLE"
>  # Terminate session using ctrl + c after this statement post df.write action 
> started
>  df.write.mode("append").saveAsTable("hive_issue.table")
>  print "COMPLETED WRITING TO TABLE"
> We went through the documentation of spark 2.4 [1] and found that spark is no 
> longer allowing to create manage tables on non empty folders.
> 1. Any reason behind change in the spark behavior
>  2. To us it looks like a breaking change as despite specifying "overwrite" 
> option spark in unable to wipe out existing data and create tables
>  3. Do we have any solution for this issue other that setting 
> "spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation" flag
> [1]
>  [https://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to