Hi,

We have upgraded Spark from 2.4.x to 3.3.1 recently and managed table
creation while writing dataframe as saveAsTable failed with below error.

Can not create the managed table(`<table name>`) The associated
location('hdfs:<table path>') already exists.

On high level our code does below before writing dataframe as table:

sparkSession.sql(s"DROP TABLE IF EXISTS $hiveTableName PURGE")
mydataframe.write.mode(SaveMode.Overwrite).saveAsTable(hiveTableName)

The above code works with Spark 2 because of
spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation which is
deprecated in Spark 3.

The table is dropped and purged before writing the dataframe. I expected
dataframe write shouldn't complain that the path already exists.

After digging further, I noticed there is `_tempory` folder present in the
hdfs table path.

dfs -ls /apps/hive/warehouse/<table-path>/
Found 1 items
drwxr-xr-x   - hadoop hdfsadmingroup          0 2023-06-23 04:45
/apps/hive/warehouse/<table-path>/_temporary

[root@ip-10-121-107-90 bin]# hdfs dfs -ls
/apps/hive/warehouse/<table-path>/_temporary
Found 1 items
drwxr-xr-x   - hadoop hdfsadmingroup          0 2023-06-23 04:45
/apps/hive/warehouse/<table-path>/_temporary/0

[root@ip-10-121-107-90 bin]# hdfs dfs -ls
/apps/hive/warehouse/<table-path>/_temporary/0
Found 1 items
drwxr-xr-x   - hadoop hdfsadmingroup          0 2023-06-23 04:45
/apps/hive/warehouse/<table-path>/_temporary/0/_temporary

Is it because of task failures ? Is there a way to workaround this issue ?

Thanks

Reply via email to