[ https://issues.apache.org/jira/browse/SPARK-29299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17017932#comment-17017932 ]
Steve Loughran commented on SPARK-29299: ---------------------------------------- Have you tried using the s3 optimised committer in EMR? It still materializes files in the destination on task commit, so potentially retains the issue -I'd like to know if it does. thanks > Intermittently getting "Cannot create the managed table error" while creating > table from spark 2.4 > -------------------------------------------------------------------------------------------------- > > Key: SPARK-29299 > URL: https://issues.apache.org/jira/browse/SPARK-29299 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.4.0 > Reporter: Abhijeet > Priority: Major > > We are facing below error in spark 2.4 intermittently when saving the managed > table from spark. > Error - > pyspark.sql.utils.AnalysisException: u"Can not create the managed > table('`hive_issue`.`table`'). The associated > location('s3://\{bucket_name}/EMRFS_WARE_TEST167_new/warehouse/hive_issue.db/table') > already exists.;" > Steps to reproduce-- > 1. Create dataframe from spark mid size data (30MB CSV file) > 2. Save dataframe as a table > 3. Terminate the session when above mentioned operation is in progress > Note-- > Session termination is just a way to reproduce this issue. In real time we > are facing this issue intermittently when we are running same spark jobs > multiple times. We use EMRFS and HDFS from EMR cluster and we face the same > issue on both of the systems. > The only ways we can fix this is by deleting the target folder where table > will keep its files which is not option for us and we need to keep historical > information in the table hence we use APPEND mode while writing to table. > Sample code-- > from pyspark.sql import SparkSession > sc = SparkSession.builder.enableHiveSupport().getOrCreate() > df = sc.read.csv("s3://\{sample-bucket}1/DATA/consumecomplians.csv") > print "STARTED WRITING TO TABLE" > # Terminate session using ctrl + c after this statement post df.write action > started > df.write.mode("append").saveAsTable("hive_issue.table") > print "COMPLETED WRITING TO TABLE" > We went through the documentation of spark 2.4 [1] and found that spark is no > longer allowing to create manage tables on non empty folders. > 1. Any reason behind change in the spark behavior > 2. To us it looks like a breaking change as despite specifying "overwrite" > option spark in unable to wipe out existing data and create tables > 3. Do we have any solution for this issue other that setting > "spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation" flag > [1] > [https://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html] > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org