joeytman opened a new issue, #10550:
URL: https://github.com/apache/iceberg/issues/10550

   ### Apache Iceberg version
   
   1.5.2 (latest release)
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   We have a table in our relational database named `files`. When we ingest the 
table into our data lake, we would like to be able to keep the name `files` for 
the table. However, it seems that tables cannot be named `files`, `history`, 
etc.
   
   We use hive metastore, and SparkSessionCatalog.
   
   This issue is reproducible via both SparkSQL and via Spark Scala job. 
   
   For the spark scala job, we're using [this 
JAR](https://mvnrepository.com/artifact/org.apache.iceberg/iceberg-spark-runtime-3.4_2.12/1.5.2),
 and running on EMR. We have some code like:
   ```
   val catalog = new HiveCatalog()
   ...
   val tableId = TableIdentifier.parse(config.table)
   val table = catalog.createTable(tableId, schema, newPartitionSpec(schema), 
config.tableLocation, config.tableProps)
   ...
   df.withColumns(extraCols.toMap)
         .writeTo(config.table)
         .options(config.icebergProps)
         .overwritePartitions()
   ```
   
   This code works as long as the name of the table is not equal to `files` or 
other metadata table names. It correctly produces a V2 iceberg table and writes 
the data. However, when the table is named `files`, the table is created in HMS 
successfully, a metadata file is written, but then the write fails with:
   ```
   User class threw exception: org.apache.spark.sql.AnalysisException: Cannot 
write into v1 table: `spark_catalog`.`iceberg`.`files`
   ```
   
   I then went into spark-sql CLI on EMR (using EMR-provided JAR for 
convenience), as follows
   ```
   spark-sql \
   --conf 
spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
 \
   --conf 
spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
   --conf spark.sql.catalog.spark_catalog.type=hive \
   --conf 
spark.sql.catalog.spark_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
   --conf spark.sql.catalog.spark_catalog.warehouse=s3://redacted/redacted \
   --conf spark.jars=/usr/share/aws/iceberg/lib/iceberg-spark3-runtime.jar
   ```
   From there, I tried to look at the table created by my job vs my other 
Iceberg tables written by the same job. 
   
   For other tables, I could successfully run `SHOW CREATE TABLE` and see the 
Iceberg v2 table:
   ```
   spark-sql (default)> show create table iceberg.users;
   CREATE TABLE spark_catalog.iceberg.users (
   ...
    <the rest of the statement looks normal for a v2 iceberg table>
   ...
   ```
   
   However, when I tried to query the `files` table, I saw:
   ```
   spark-sql (default)> show create table iceberg.files;
   Failed to execute SHOW CREATE TABLE against table files, which is created by 
Hive and uses the following unsupported serde configuration
    SERDE: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe INPUTFORMAT: 
org.apache.hadoop.mapred.FileInputFormat OUTPUTFORMAT: 
org.apache.hadoop.mapred.FileOutputFormat
   Please use `SHOW CREATE TABLE files AS SERDE` to show Hive DDL instead.
   ```
   As if the table was not really an Iceberg table.
   
   
   I then had the idea to submit the same job with the exact same args, 
modifying only the table name argument to call the destination table 
`files_temp` instead of `files`. To my surprise, the job succeeded.
   
   From the spark-sql cli, I noticed that I was able to interact with my 
`files_temp` table as expected:
   ```
   spark-sql (default)> show create table iceberg.files_temp;
   CREATE TABLE spark_catalog.iceberg.files_temp (
   ...
    <the rest of the statement looks normal for a v2 iceberg table>
   ...
   ```
   
   It looked exactly as expected. My hope was that the bug was in my scala 
spark bootstrap job, and that I could simply rename `files_temp` to `files` and 
then it would work. However, renaming the table to `files` immediately breaks 
it and reproduces the issue:
   
   ```
   spark-sql (default)> alter table iceberg.files_temp rename to iceberg.files;
   
   spark-sql (default)> show create table iceberg.files;
   Failed to execute SHOW CREATE TABLE against table files, which is created by 
Hive and uses the following unsupported serde configuration
    SERDE: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe INPUTFORMAT: 
org.apache.hadoop.mapred.FileInputFormat OUTPUTFORMAT: 
org.apache.hadoop.mapred.FileOutputFormat
   Please use `SHOW CREATE TABLE files AS SERDE` to show Hive DDL instead.
   ```
   
   By renaming the table back to something other than `files`, it is able to be 
interpreted correctly.
   ```
   spark-sql (default)> alter table iceberg.files rename to iceberg.files_temp;
   
   spark-sql (default)> show create table iceberg.files_temp;
   CREATE TABLE spark_catalog.iceberg.files_temp (
   ...
    <the rest of the statement looks normal for a v2 iceberg table>
   ...
   ```
   
   After this, I even tried using `SparkCatalog` instead of 
`SparkSessionCatalog` and when I renamed the table to `iceberg.files`, 
`SparkCatalog` was unable to even find the table, as if it was a non-iceberg 
table:
   ```
   spark-sql (default)> show create table iceberg.files_temp;
   CREATE TABLE spark_catalog.iceberg.files_temp (
   ...
    <the rest of the statement looks normal for a v2 iceberg table>
   ...
   
   spark-sql (default)> alter table iceberg.files_temp rename to iceberg.files;
   Time taken: 0.141 seconds
   
   spark-sql (default)> show create table iceberg.files;
   [TABLE_OR_VIEW_NOT_FOUND] The table or view `iceberg`.`files` cannot be 
found. Verify the spelling and correctness of the schema and catalog.
   If you did not qualify the name with a schema, verify the current_schema() 
output, or qualify the name with the correct schema and catalog.
   To tolerate the error on drop use DROP VIEW IF EXISTS or DROP TABLE IF 
EXISTS.; line 1 pos 18;
   'ShowCreateTable false, [createtab_stmt#101]
   +- 'UnresolvedTableOrView [iceberg_dev, files], SHOW CREATE TABLE, false
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to