amitmittal5 opened a new issue, #9172:
URL: https://github.com/apache/iceberg/issues/9172
### Apache Iceberg version
1.4.2 (latest release)
### Query engine
Spark
### Please describe the bug 🐞
I have a spark streaming job which reads the data from ADLS gen 2 and write
to iceberg table as per following steps:
**Step 1: Create the table**
```
CREATE TABLE IF NOT EXISTS default.blob_iceberg
(id string, state string, name string)
USING ICEBERG
LOCATION
'abfss://<container>@<storage-account>.dfs.core.windows.net/test/blob_iceberg'
```
Run a spark streaming scala job with AvailableNow trigger (same behavior
with ProcessingTime("10 seconds") trigger):
```
val sourceFilePath =
"abfss://<container>@<storage-account>.dfs.core.windows.net/test/source"
val schema = spark.read.format("csv").option("header",
"true").load(s"${sourceFilePath}/Sample.txt").schema
val checkpointPath =
"abfss://<container>@<storage-account>.dfs.core.windows.net/test/blob_iceberg_checkpoint"
val sourceDF = spark
.readStream
.schema(schema)
.format("csv")
.option("header", "true")
.option("sep",",")
.load(sourceFilePath)
sourceDF
.writeStream
.format("iceberg")
.outputMode("append")
.trigger(Trigger.AvailableNow)
.option("checkpointLocation", checkpointPath)
.toTable("default.blob_iceberg")
```
The behavior observed that for the 1st execution, parquet files are created
under data directory in which the parquet files are named like
`00000-2-852c47ed-881c-4cac-8d9f-230da7873d05-00001.parquet` in which
"852c47ed-881c-4cac-8d9f-230da7873d05" is the spark streaming id from
checkpoint metadata file.
When the same job is executed multiple times, with new data in source
directory, the streaming job sometimes overwrites one or more existing parquet
file(s). In this test, the file
/data/00000-2-852c47ed-881c-4cac-8d9f-230da7873d05-00001.parquet got
overwritten with new data, so original 4 records are lost and only 2 new
records are part of that file.
Here is the screenshot of query
`select substring(file_path, 88), record_count from
default.blobiceberg16.files`
This also makes iceberg metadata and actual data files out-of-sync.
**Environment**:
Runtime: iceberg-spark-runtime-3.4_2.12-1.4.2
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]