[I] On droping table with shared location the data got deleted which should not be case after version 0.14.0 of Apache iceberg [iceberg]

via GitHub Wed, 24 Jul 2024 21:38:58 -0700


Muskan-m opened a new issue, #10779:
URL: https://github.com/apache/iceberg/issues/10779


   ### Apache Iceberg version
   
   1.2.0
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   I have created a two tables which share same location and i dropped one of 
the table and it has removed data from location which is now impacting me for 
data loss. even unable to query on other table.
   
   I am even unable to recover dropped table to recover my data 
   
   Question: 
   Once the table gets dropped where can i find it's data if it's removed from 
it's location by drop table action 
   I am expecting it to store temporarily for some time like Hive stores in 
it's trash.
   
   
   We are using CDP 7.1.9 version.
   
   We have one existing Iceberg table as below
   scala> spark.sql("show create table  
proc_mes_qdata.mesdp_archive").show(false)
   
   
   |createtab_stmt                                                              
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                      
                                                                                
                                                 |
   --------------------------------------------------------------------+
   |CREATE TABLE spark_catalog.proc_mes_qdata.mesdp_archive (\n  source_type 
STRING COMMENT 'type of source data: partdetails or componenttrace or groups, 
used for partitioning',\n  insert_date STRING COMMENT 'Date when data is 
inserted in Hive table, used for partitioning',\n  kafka_offset BIGINT COMMENT 
'offset from source Kafka topic',\n  kafka_topic STRING COMMENT 'source Kafka 
topic name',\n  key STRING COMMENT 'key from source Kafka topic',\n  value 
STRING COMMENT 'message from source Kafka topic',\n  kafka_partition BIGINT 
COMMENT 'partition from source Kafka topic',\n  kafka_timestamp TIMESTAMP 
COMMENT 'timestamp from source Kafka topic',\n  kafka_timestampType BIGINT 
COMMENT 'timestampType from source Kafka topic')\nUSING iceberg\nPARTITIONED BY 
(source_type, insert_date)\nCOMMENT 'MES Data Publisher - Storing raw messages 
of partdetails,componenttrace and groups, partitioned by column source_type and 
insert_date'\nLOCATION '/proc/mes_qdata/db/mesdp_archive'\nTBLPROPERTIES (
 \n  'current-snapshot-id' = '8590217566145417146',\n  'format' = 
'iceberg/parquet',\n  'format-version' = '1')\n|
   
   
   Now I have created another temp table on same path 
   
   Command: 
   |createtab_stmt                                                              
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                      
                                                                                
                                                 |
   --------------------------------------------------------------------+
   |CREATE TABLE spark_catalog.proc_mes_qdata.mesdp_archive_test (\n  
source_type STRING COMMENT 'type of source data: partdetails or componenttrace 
or groups, used for partitioning',\n  insert_date STRING COMMENT 'Date when 
data is inserted in Hive table, used for partitioning',\n  kafka_offset BIGINT 
COMMENT 'offset from source Kafka topic',\n  kafka_topic STRING COMMENT 'source 
Kafka topic name',\n  key STRING COMMENT 'key from source Kafka topic',\n  
value STRING COMMENT 'message from source Kafka topic',\n  kafka_partition 
BIGINT COMMENT 'partition from source Kafka topic',\n  kafka_timestamp 
TIMESTAMP COMMENT 'timestamp from source Kafka topic',\n  kafka_timestampType 
BIGINT COMMENT 'timestampType from source Kafka topic')\nUSING 
iceberg\nPARTITIONED BY (source_type, insert_date)\nCOMMENT 'MES Data Publisher 
- Storing raw messages of partdetails,componenttrace and groups, partitioned by 
column source_type and insert_date'\nLOCATION 
'/proc/mes_qdata/db/mesdp_archive'\nTBLPROPERT
 IES (\n  'current-snapshot-id' = '8590217566145417146',\n  'format' = 
'iceberg/parquet',\n  'format-version' = '1')\n|
   
   
   
   And run below commands: 
   
   scala> spark.sql("""CALL aeanpprod.system.add_files(table 
=>'proc_mes_qdata.mesdp_archive_test',source_table => 
'`parquet`.`/proc/mes_qdata/db/mesdp_archive/data/source_type=componenttrace/`')""")
   
   
   scala> spark.sql("select max(insert_date) from 
proc_mes_qdata.mesdp_archive_test").show(false)
   
   +----------------+
   |max(insert_date)|
   +----------------+
   |2024-07-24      |
   +----------------+
   
   
   scala> spark.sql("select min(insert_date) from 
proc_mes_qdata.mesdp_archive_test").show(false)
   +----------------+
   |min(insert_date)|
   +----------------+
   |2023-10-20      |
   +----------------+
   
   
   Before executing below drop temp table command i had data in my path
   
   [t_mes_qdata_proc@an0vm004 ~]$ hdfs dfs -ls  
/proc/mes_qdata/db/mesdp_archive/data/source_type=componenttrace
   Found 269 items
   
/proc/mes_qdata/db/mesdp_archive/data/source_type=componenttrace/insert_date=2024-02-19
   drwxrwx---+  - t_mes_qdata_proc hive          0 2024-02-21 00:50 
/proc/mes_qdata/db/mesdp_archive/data/source_type=componenttrace/insert_date=2024-02-20
   drwxrwx---+  - t_mes_qdata_proc hive          0 2024-02-22 00:50 
/proc/mes_qdata/db/mesdp_archive/data/source_type=componenttrace/insert_date=2024-02-21
   drwxrwx---+  - t_mes_qdata_proc hive          0 2024-02-23 00:50 
/proc/mes_qdata/db/mesdp_archive/data/source_type=componenttrace/insert_date=2024-02-22
   drwxrwx---+  - t_mes_qdata_proc hive          0 2024-02-24 00:50 
/proc/mes_qdata/db/mesdp_archive/data/source_type=componenttrace/insert_date=2024-02-23
   drwxrwx---+  - t_mes_qdata_proc hive          0 2024-02-25 00:50 
/proc/mes_qdata/db/mesdp_archive/data/source_type=componenttrace/insert_date=2024-02-24
   drwxrwx---+  - t_mes_qdata_proc hive          0 2024-02-26 05:07 
/proc/mes_qdata/db/mesdp_archive/data/source_type=componenttrace/insert_date=2024-02-25
   drwxrwx---+  - t_mes_qdata_proc hive          0 2024-02-27 00:50 
/proc/mes_qdata/db/mesdp_archive/data/source_type=componenttrace/insert_date=2024-02-26
   drwxrwx---+  - t_mes_qdata_proc hive          0 2024-02-28 00:50 
/proc/mes_qdata/db/mesdp_archive/data/source_type=componenttrace/insert_date=2024-02-27
   ..............
   
   now Executed below drop command for temp table:
   
   scala> spark.sql("drop table proc_mes_qdata.mesdp_archive_test").show(false)
   24/07/24 10:54:16 WARN  conf.HiveConf: [main]: HiveConf of name 
hive.cluster.delegation.token.renew-interval does not exist
   24/07/24 10:54:16 WARN  conf.HiveConf: [main]: HiveConf of name 
hive.metastore.runworker.in does not exist
   24/07/24 10:54:16 WARN  conf.HiveConf: [main]: HiveConf of name 
hive.cluster.delegation.key.update-interval does not exist
   24/07/24 10:54:16 WARN  conf.HiveConf: [main]: HiveConf of name 
hive.masking.algo does not exist
   24/07/24 10:54:16 WARN  conf.HiveConf: [main]: HiveConf of name 
hive.cluster.delegation.token.max-lifetime does not exist
   24/07/24 10:54:16 WARN  conf.HiveConf: [main]: HiveConf of name 
hive.cluster.delegation.token.gc-interval does not exist
   ++
   ||
   ++
   ++
   
   
   My data got deleted from the below HDFS paths. It kept the partition 
directory but inside partition directory no parquet files are available.
   
   
   [t_mes_qdata_proc@an0vm004 ~]$ hdfs dfs -ls 
/proc/mes_qdata/db/mesdp_archive/data/source_type=componenttrace
   Found 269 items
   /db/mesdp_archive/data/source_type=componenttrace/insert_date=2024-07-14
   drwxrwx---+  - t_mes_qdata_proc hive          0 2024-07-24 10:54 
/proc/mes_qdata/db/mesdp_archive/data/source_type=componenttrace/insert_date=2024-07-15
   drwxrwx---+  - t_mes_qdata_proc hive          0 2024-07-24 10:54 
/proc/mes_qdata/db/mesdp_archive/data/source_type=componenttrace/insert_date=2024-07-16
   drwxrwx---+  - t_mes_qdata_proc hive          0 2024-07-24 10:54 
/proc/mes_qdata/db/mesdp_archive/data/source_type=componenttrace/insert_date=2024-07-17
   drwxrwx---+  - t_mes_qdata_proc hive          0 2024-07-24 10:54 
/proc/mes_qdata/db/mesdp_archive/data/source_type=componenttrace/insert_date=2024-07-18
   drwxrwx---+  - t_mes_qdata_proc hive          0 2024-07-24 10:54 
/proc/mes_qdata/db/mesdp_archive/data/source_type=componenttrace/insert_date=2024-07-19
   drwxrwx---+  - t_mes_qdata_proc hive          0 2024-07-24 10:54 
/proc/mes_qdata/db/mesdp_archive/data/source_type=componenttrace/insert_date=2024-07-22
   drwxrwx---+  - t_mes_qdata_proc hive          0 2024-07-24 10:54 
/proc/mes_qdata/db/mesdp_archive/data/source_type=componenttrace/insert_date=2024-07-23
   ..................
   
   
   This should not happen as per Apache iceberg official documentation
   https://iceberg.apache.org/docs/latest/spark-ddl/#drop-table
   
   We haven't dropped main table, only the temp table was dropped which was 
created in this spark shell.
   
   I also checked in hive trash path could not find any traces of my table
   hdfs dfs -ls /user/hive/.Trash
   
   
   Even in Hiveserver there were no logs or traces were found for my temp table
   Attached screenshot
   
   
   Expectations: 
   Need to know why this data is been deleted and do we have any trash location 
for iceberg table where this data can reside temporarily for sometime
   
   ### Willingness to contribute
   
   - [ ] I can contribute a fix for this bug independently
   - [X] I would be willing to contribute a fix for this bug with guidance from 
the Iceberg community
   - [ ] I cannot contribute a fix for this bug at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] On droping table with shared location the data got deleted which should not be case after version 0.14.0 of Apache iceberg [iceberg]

Reply via email to