joeytman opened a new issue, #10550: URL: https://github.com/apache/iceberg/issues/10550
### Apache Iceberg version 1.5.2 (latest release) ### Query engine Spark ### Please describe the bug 🐞 We have a table in our relational database named `files`. When we ingest the table into our data lake, we would like to be able to keep the name `files` for the table. However, it seems that tables cannot be named `files`, `history`, etc. We use hive metastore, and SparkSessionCatalog. This issue is reproducible via both SparkSQL and via Spark Scala job. For the spark scala job, we're using [this JAR](https://mvnrepository.com/artifact/org.apache.iceberg/iceberg-spark-runtime-3.4_2.12/1.5.2), and running on EMR. We have some code like: ``` val catalog = new HiveCatalog() ... val tableId = TableIdentifier.parse(config.table) val table = catalog.createTable(tableId, schema, newPartitionSpec(schema), config.tableLocation, config.tableProps) ... df.withColumns(extraCols.toMap) .writeTo(config.table) .options(config.icebergProps) .overwritePartitions() ``` This code works as long as the name of the table is not equal to `files` or other metadata table names. It correctly produces a V2 iceberg table and writes the data. However, when the table is named `files`, the table is created in HMS successfully, a metadata file is written, but then the write fails with: ``` User class threw exception: org.apache.spark.sql.AnalysisException: Cannot write into v1 table: `spark_catalog`.`iceberg`.`files` ``` I then went into spark-sql CLI on EMR (using EMR-provided JAR for convenience), as follows ``` spark-sql \ --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \ --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \ --conf spark.sql.catalog.spark_catalog.type=hive \ --conf spark.sql.catalog.spark_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO \ --conf spark.sql.catalog.spark_catalog.warehouse=s3://redacted/redacted \ --conf spark.jars=/usr/share/aws/iceberg/lib/iceberg-spark3-runtime.jar ``` From there, I tried to look at the table created by my job vs my other Iceberg tables written by the same job. For other tables, I could successfully run `SHOW CREATE TABLE` and see the Iceberg v2 table: ``` spark-sql (default)> show create table iceberg.users; CREATE TABLE spark_catalog.iceberg.users ( ... <the rest of the statement looks normal for a v2 iceberg table> ... ``` However, when I tried to query the `files` table, I saw: ``` spark-sql (default)> show create table iceberg.files; Failed to execute SHOW CREATE TABLE against table files, which is created by Hive and uses the following unsupported serde configuration SERDE: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe INPUTFORMAT: org.apache.hadoop.mapred.FileInputFormat OUTPUTFORMAT: org.apache.hadoop.mapred.FileOutputFormat Please use `SHOW CREATE TABLE files AS SERDE` to show Hive DDL instead. ``` As if the table was not really an Iceberg table. I then had the idea to submit the same job with the exact same args, modifying only the table name argument to call the destination table `files_temp` instead of `files`. To my surprise, the job succeeded. From the spark-sql cli, I noticed that I was able to interact with my `files_temp` table as expected: ``` spark-sql (default)> show create table iceberg.files_temp; CREATE TABLE spark_catalog.iceberg.files_temp ( ... <the rest of the statement looks normal for a v2 iceberg table> ... ``` It looked exactly as expected. My hope was that the bug was in my scala spark bootstrap job, and that I could simply rename `files_temp` to `files` and then it would work. However, renaming the table to `files` immediately breaks it and reproduces the issue: ``` spark-sql (default)> alter table iceberg.files_temp rename to iceberg.files; spark-sql (default)> show create table iceberg.files; Failed to execute SHOW CREATE TABLE against table files, which is created by Hive and uses the following unsupported serde configuration SERDE: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe INPUTFORMAT: org.apache.hadoop.mapred.FileInputFormat OUTPUTFORMAT: org.apache.hadoop.mapred.FileOutputFormat Please use `SHOW CREATE TABLE files AS SERDE` to show Hive DDL instead. ``` By renaming the table back to something other than `files`, it is able to be interpreted correctly. ``` spark-sql (default)> alter table iceberg.files rename to iceberg.files_temp; spark-sql (default)> show create table iceberg.files_temp; CREATE TABLE spark_catalog.iceberg.files_temp ( ... <the rest of the statement looks normal for a v2 iceberg table> ... ``` After this, I even tried using `SparkCatalog` instead of `SparkSessionCatalog` and when I renamed the table to `iceberg.files`, `SparkCatalog` was unable to even find the table, as if it was a non-iceberg table: ``` spark-sql (default)> show create table iceberg.files_temp; CREATE TABLE spark_catalog.iceberg.files_temp ( ... <the rest of the statement looks normal for a v2 iceberg table> ... spark-sql (default)> alter table iceberg.files_temp rename to iceberg.files; Time taken: 0.141 seconds spark-sql (default)> show create table iceberg.files; [TABLE_OR_VIEW_NOT_FOUND] The table or view `iceberg`.`files` cannot be found. Verify the spelling and correctness of the schema and catalog. If you did not qualify the name with a schema, verify the current_schema() output, or qualify the name with the correct schema and catalog. To tolerate the error on drop use DROP VIEW IF EXISTS or DROP TABLE IF EXISTS.; line 1 pos 18; 'ShowCreateTable false, [createtab_stmt#101] +- 'UnresolvedTableOrView [iceberg_dev, files], SHOW CREATE TABLE, false ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
