[ 
https://issues.apache.org/jira/browse/SPARK-29299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhijeet updated SPARK-29299:
-----------------------------
    Description: 
We are facing below error in spark 2.4 intermittently when saving the managed 
table from spark.

Error -
 pyspark.sql.utils.AnalysisException: u"Can not create the managed 
table('`hive_issue`.`table`'). The associated 
location('s3://\{bucket_name}/EMRFS_WARE_TEST167_new/warehouse/hive_issue.db/table')
 already exists.;"

Steps to reproduce--
 1. Create dataframe from spark mid size data (30MB CSV file)
 2. Save dataframe as a table
 3. Terminate the session when above mentioned operation is in progress

Note--
 Session termination is just a way to reproduce this issue. In real time we are 
facing this issue intermittently when we are running same spark jobs multiple 
times. We use EMRFS and HDFS from EMR cluster and we face the same issue on 
both of the systems.
 The only ways we can fix this is by deleting the target folder where table 
will keep its files which is not option for us and we need to keep historical 
information in the table hence we use APPEND mode while writing to table.

Sample code--
 from pyspark.sql import SparkSession
 sc = SparkSession.builder.enableHiveSupport().getOrCreate()
 df = sc.read.csv("s3://\{sample-bucket}1/DATA/consumecomplians.csv")
 print "STARTED WRITING TO TABLE"
 # Terminate session using ctrl + c after this statement post df.write action 
started
 df.write.mode("append").saveAsTable("hive_issue.table")
 print "COMPLETED WRITING TO TABLE"

We went through the documentation of spark 2.4 [1] and found that spark is no 
longer allowing to create manage tables on non empty folders.

1. Any reason behind change in the spark behavior
 2. To us it looks like a breaking change as despite specifying "overwrite" 
option spark in unable to wipe out existing data and create tables
 3. Do we have any solution for this issue other that setting 
"spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation" flag

[1]
 [https://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html]

 

  was:
We are facing below error in spark 2.4 intermittently when saving the managed 
table from spark.

Error -
pyspark.sql.utils.AnalysisException: u"Can not create the managed 
table('`hive_issue`.`table`'). The associated 
location('s3://\{bucket_name}/EMRFS_WARE_TEST167_new/warehouse/hive_issue.db/table')
 already exists.;"

Steps to reproduce--
1. Create dataframe from spark mid size data (30MB CSV file)
2. Save dataframe as a table
3. Terminate the session when above mentioned operation is in progress

Note--
Session termination is just a way to repro this issue. In real time we are 
facing this issue intermittently when we are running same spark jobs multiple 
times. We use EMRFS and HDFS from EMR cluster and we face the same issue on 
both of the systems.
The only ways we can fix this is by deleting the target folder where table will 
keep its files which is not option for us and we need to keep historical 
information in the table hence we use APPEND mode while writing to table.


Sample code--
from pyspark.sql import SparkSession
sc = SparkSession.builder.enableHiveSupport().getOrCreate()
df = sc.read.csv("s3://\{sample-bucket}1/DATA/consumecomplians.csv")
print "STARTED WRITING TO TABLE"
# Terminate session using ctrl + c after this statement post df.write action 
started
df.write.mode("append").saveAsTable("hive_issue.table")
print "COMPLETED WRITING TO TABLE"

We went through the documentation of spark 2.4 [1] and found that spark is no 
longer allowing to create manage tables on non empty folders.

1. Any reason behind change in the spatk behaviour
2. To us it looks like a breaking change as despite specifying "overwrite" 
option spark in unable to wipe out existing data and create tables
3. Do we have any solution for this issue other that setting 
"spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation" flag

[1]
https://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html

 


> Intermittently getting "Cannot create the managed table error" while creating 
> table from spark 2.4
> --------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-29299
>                 URL: https://issues.apache.org/jira/browse/SPARK-29299
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.4.0
>            Reporter: Abhijeet
>            Priority: Major
>
> We are facing below error in spark 2.4 intermittently when saving the managed 
> table from spark.
> Error -
>  pyspark.sql.utils.AnalysisException: u"Can not create the managed 
> table('`hive_issue`.`table`'). The associated 
> location('s3://\{bucket_name}/EMRFS_WARE_TEST167_new/warehouse/hive_issue.db/table')
>  already exists.;"
> Steps to reproduce--
>  1. Create dataframe from spark mid size data (30MB CSV file)
>  2. Save dataframe as a table
>  3. Terminate the session when above mentioned operation is in progress
> Note--
>  Session termination is just a way to reproduce this issue. In real time we 
> are facing this issue intermittently when we are running same spark jobs 
> multiple times. We use EMRFS and HDFS from EMR cluster and we face the same 
> issue on both of the systems.
>  The only ways we can fix this is by deleting the target folder where table 
> will keep its files which is not option for us and we need to keep historical 
> information in the table hence we use APPEND mode while writing to table.
> Sample code--
>  from pyspark.sql import SparkSession
>  sc = SparkSession.builder.enableHiveSupport().getOrCreate()
>  df = sc.read.csv("s3://\{sample-bucket}1/DATA/consumecomplians.csv")
>  print "STARTED WRITING TO TABLE"
>  # Terminate session using ctrl + c after this statement post df.write action 
> started
>  df.write.mode("append").saveAsTable("hive_issue.table")
>  print "COMPLETED WRITING TO TABLE"
> We went through the documentation of spark 2.4 [1] and found that spark is no 
> longer allowing to create manage tables on non empty folders.
> 1. Any reason behind change in the spark behavior
>  2. To us it looks like a breaking change as despite specifying "overwrite" 
> option spark in unable to wipe out existing data and create tables
>  3. Do we have any solution for this issue other that setting 
> "spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation" flag
> [1]
>  [https://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to