Muskan-m opened a new issue, #10779:
URL: https://github.com/apache/iceberg/issues/10779
### Apache Iceberg version
1.2.0
### Query engine
Spark
### Please describe the bug 🐞
I have created a two tables which share same location and i dropped one of
the table and it has removed data from location which is now impacting me for
data loss. even unable to query on other table.
I am even unable to recover dropped table to recover my data
Question:
Once the table gets dropped where can i find it's data if it's removed from
it's location by drop table action
I am expecting it to store temporarily for some time like Hive stores in
it's trash.
We are using CDP 7.1.9 version.
We have one existing Iceberg table as below
scala> spark.sql("show create table
proc_mes_qdata.mesdp_archive").show(false)
|createtab_stmt
|
--------------------------------------------------------------------+
|CREATE TABLE spark_catalog.proc_mes_qdata.mesdp_archive (\n source_type
STRING COMMENT 'type of source data: partdetails or componenttrace or groups,
used for partitioning',\n insert_date STRING COMMENT 'Date when data is
inserted in Hive table, used for partitioning',\n kafka_offset BIGINT COMMENT
'offset from source Kafka topic',\n kafka_topic STRING COMMENT 'source Kafka
topic name',\n key STRING COMMENT 'key from source Kafka topic',\n value
STRING COMMENT 'message from source Kafka topic',\n kafka_partition BIGINT
COMMENT 'partition from source Kafka topic',\n kafka_timestamp TIMESTAMP
COMMENT 'timestamp from source Kafka topic',\n kafka_timestampType BIGINT
COMMENT 'timestampType from source Kafka topic')\nUSING iceberg\nPARTITIONED BY
(source_type, insert_date)\nCOMMENT 'MES Data Publisher - Storing raw messages
of partdetails,componenttrace and groups, partitioned by column source_type and
insert_date'\nLOCATION '/proc/mes_qdata/db/mesdp_archive'\nTBLPROPERTIES (
\n 'current-snapshot-id' = '8590217566145417146',\n 'format' =
'iceberg/parquet',\n 'format-version' = '1')\n|
Now I have created another temp table on same path
Command:
|createtab_stmt
|
--------------------------------------------------------------------+
|CREATE TABLE spark_catalog.proc_mes_qdata.mesdp_archive_test (\n
source_type STRING COMMENT 'type of source data: partdetails or componenttrace
or groups, used for partitioning',\n insert_date STRING COMMENT 'Date when
data is inserted in Hive table, used for partitioning',\n kafka_offset BIGINT
COMMENT 'offset from source Kafka topic',\n kafka_topic STRING COMMENT 'source
Kafka topic name',\n key STRING COMMENT 'key from source Kafka topic',\n
value STRING COMMENT 'message from source Kafka topic',\n kafka_partition
BIGINT COMMENT 'partition from source Kafka topic',\n kafka_timestamp
TIMESTAMP COMMENT 'timestamp from source Kafka topic',\n kafka_timestampType
BIGINT COMMENT 'timestampType from source Kafka topic')\nUSING
iceberg\nPARTITIONED BY (source_type, insert_date)\nCOMMENT 'MES Data Publisher
- Storing raw messages of partdetails,componenttrace and groups, partitioned by
column source_type and insert_date'\nLOCATION
'/proc/mes_qdata/db/mesdp_archive'\nTBLPROPERT
IES (\n 'current-snapshot-id' = '8590217566145417146',\n 'format' =
'iceberg/parquet',\n 'format-version' = '1')\n|
And run below commands:
scala> spark.sql("""CALL aeanpprod.system.add_files(table
=>'proc_mes_qdata.mesdp_archive_test',source_table =>
'`parquet`.`/proc/mes_qdata/db/mesdp_archive/data/source_type=componenttrace/`')""")
scala> spark.sql("select max(insert_date) from
proc_mes_qdata.mesdp_archive_test").show(false)
+----------------+
|max(insert_date)|
+----------------+
|2024-07-24 |
+----------------+
scala> spark.sql("select min(insert_date) from
proc_mes_qdata.mesdp_archive_test").show(false)
+----------------+
|min(insert_date)|
+----------------+
|2023-10-20 |
+----------------+
Before executing below drop temp table command i had data in my path
[t_mes_qdata_proc@an0vm004 ~]$ hdfs dfs -ls
/proc/mes_qdata/db/mesdp_archive/data/source_type=componenttrace
Found 269 items
/proc/mes_qdata/db/mesdp_archive/data/source_type=componenttrace/insert_date=2024-02-19
drwxrwx---+ - t_mes_qdata_proc hive 0 2024-02-21 00:50
/proc/mes_qdata/db/mesdp_archive/data/source_type=componenttrace/insert_date=2024-02-20
drwxrwx---+ - t_mes_qdata_proc hive 0 2024-02-22 00:50
/proc/mes_qdata/db/mesdp_archive/data/source_type=componenttrace/insert_date=2024-02-21
drwxrwx---+ - t_mes_qdata_proc hive 0 2024-02-23 00:50
/proc/mes_qdata/db/mesdp_archive/data/source_type=componenttrace/insert_date=2024-02-22
drwxrwx---+ - t_mes_qdata_proc hive 0 2024-02-24 00:50
/proc/mes_qdata/db/mesdp_archive/data/source_type=componenttrace/insert_date=2024-02-23
drwxrwx---+ - t_mes_qdata_proc hive 0 2024-02-25 00:50
/proc/mes_qdata/db/mesdp_archive/data/source_type=componenttrace/insert_date=2024-02-24
drwxrwx---+ - t_mes_qdata_proc hive 0 2024-02-26 05:07
/proc/mes_qdata/db/mesdp_archive/data/source_type=componenttrace/insert_date=2024-02-25
drwxrwx---+ - t_mes_qdata_proc hive 0 2024-02-27 00:50
/proc/mes_qdata/db/mesdp_archive/data/source_type=componenttrace/insert_date=2024-02-26
drwxrwx---+ - t_mes_qdata_proc hive 0 2024-02-28 00:50
/proc/mes_qdata/db/mesdp_archive/data/source_type=componenttrace/insert_date=2024-02-27
..............
now Executed below drop command for temp table:
scala> spark.sql("drop table proc_mes_qdata.mesdp_archive_test").show(false)
24/07/24 10:54:16 WARN conf.HiveConf: [main]: HiveConf of name
hive.cluster.delegation.token.renew-interval does not exist
24/07/24 10:54:16 WARN conf.HiveConf: [main]: HiveConf of name
hive.metastore.runworker.in does not exist
24/07/24 10:54:16 WARN conf.HiveConf: [main]: HiveConf of name
hive.cluster.delegation.key.update-interval does not exist
24/07/24 10:54:16 WARN conf.HiveConf: [main]: HiveConf of name
hive.masking.algo does not exist
24/07/24 10:54:16 WARN conf.HiveConf: [main]: HiveConf of name
hive.cluster.delegation.token.max-lifetime does not exist
24/07/24 10:54:16 WARN conf.HiveConf: [main]: HiveConf of name
hive.cluster.delegation.token.gc-interval does not exist
++
||
++
++
My data got deleted from the below HDFS paths. It kept the partition
directory but inside partition directory no parquet files are available.
[t_mes_qdata_proc@an0vm004 ~]$ hdfs dfs -ls
/proc/mes_qdata/db/mesdp_archive/data/source_type=componenttrace
Found 269 items
/db/mesdp_archive/data/source_type=componenttrace/insert_date=2024-07-14
drwxrwx---+ - t_mes_qdata_proc hive 0 2024-07-24 10:54
/proc/mes_qdata/db/mesdp_archive/data/source_type=componenttrace/insert_date=2024-07-15
drwxrwx---+ - t_mes_qdata_proc hive 0 2024-07-24 10:54
/proc/mes_qdata/db/mesdp_archive/data/source_type=componenttrace/insert_date=2024-07-16
drwxrwx---+ - t_mes_qdata_proc hive 0 2024-07-24 10:54
/proc/mes_qdata/db/mesdp_archive/data/source_type=componenttrace/insert_date=2024-07-17
drwxrwx---+ - t_mes_qdata_proc hive 0 2024-07-24 10:54
/proc/mes_qdata/db/mesdp_archive/data/source_type=componenttrace/insert_date=2024-07-18
drwxrwx---+ - t_mes_qdata_proc hive 0 2024-07-24 10:54
/proc/mes_qdata/db/mesdp_archive/data/source_type=componenttrace/insert_date=2024-07-19
drwxrwx---+ - t_mes_qdata_proc hive 0 2024-07-24 10:54
/proc/mes_qdata/db/mesdp_archive/data/source_type=componenttrace/insert_date=2024-07-22
drwxrwx---+ - t_mes_qdata_proc hive 0 2024-07-24 10:54
/proc/mes_qdata/db/mesdp_archive/data/source_type=componenttrace/insert_date=2024-07-23
..................
This should not happen as per Apache iceberg official documentation
https://iceberg.apache.org/docs/latest/spark-ddl/#drop-table
We haven't dropped main table, only the temp table was dropped which was
created in this spark shell.
I also checked in hive trash path could not find any traces of my table
hdfs dfs -ls /user/hive/.Trash
Even in Hiveserver there were no logs or traces were found for my temp table
Attached screenshot
Expectations:
Need to know why this data is been deleted and do we have any trash location
for iceberg table where this data can reside temporarily for sometime
### Willingness to contribute
- [ ] I can contribute a fix for this bug independently
- [X] I would be willing to contribute a fix for this bug with guidance from
the Iceberg community
- [ ] I cannot contribute a fix for this bug at this time
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]