[I] Parquet file overwritten by spark streaming job in subsequent execution with same spark streaming checkpoint location [iceberg]

via GitHub Tue, 28 Nov 2023 10:55:37 -0800


amitmittal5 opened a new issue, #9172:
URL: https://github.com/apache/iceberg/issues/9172


   ### Apache Iceberg version
   
   1.4.2 (latest release)
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   I have a spark streaming job which reads the data from ADLS gen 2 and write 
to iceberg table as per following steps:
   
   **Step 1: Create the table**
   ```
   CREATE TABLE IF NOT EXISTS default.blob_iceberg 
   (id string, state string, name string) 
   USING ICEBERG 
   LOCATION 
'abfss://<container>@<storage-account>.dfs.core.windows.net/test/blob_iceberg'
   ```
   
   Run a spark streaming scala job with AvailableNow trigger (same behavior 
with ProcessingTime("10 seconds") trigger):
   
   ```
   val sourceFilePath = 
"abfss://<container>@<storage-account>.dfs.core.windows.net/test/source"
   val schema = spark.read.format("csv").option("header", 
"true").load(s"${sourceFilePath}/Sample.txt").schema
   val checkpointPath = 
"abfss://<container>@<storage-account>.dfs.core.windows.net/test/blob_iceberg_checkpoint"
   
   val sourceDF = spark
                                .readStream
                                .schema(schema)
                                .format("csv")
                                .option("header", "true")
                                .option("sep",",")
                                .load(sourceFilePath)    
   sourceDF
        .writeStream
        .format("iceberg")
        .outputMode("append")
        .trigger(Trigger.AvailableNow)
        .option("checkpointLocation", checkpointPath)
        .toTable("default.blob_iceberg")
   ```
   The behavior observed that for the 1st execution, parquet files are created 
under data directory in which the parquet files are named like 
`00000-2-852c47ed-881c-4cac-8d9f-230da7873d05-00001.parquet` in which 
"852c47ed-881c-4cac-8d9f-230da7873d05" is the spark streaming id from 
checkpoint metadata file.
   When the same job is executed multiple times, with new data in source 
directory, the streaming job sometimes overwrites one or more existing parquet 
file(s). In this test, the file 
/data/00000-2-852c47ed-881c-4cac-8d9f-230da7873d05-00001.parquet got 
overwritten with new data, so original 4 records are lost and only 2 new 
records are part of that file.
   Here is the screenshot of query
   `select substring(file_path, 88), record_count from  
default.blobiceberg16.files`
   
   This also makes iceberg metadata and actual data files out-of-sync. 
   **Environment**: 
   Runtime: iceberg-spark-runtime-3.4_2.12-1.4.2
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Parquet file overwritten by spark streaming job in subsequent execution with same spark streaming checkpoint location [iceberg]

Reply via email to