Re: [I] Structured streaming writes to partitioned table fails when spark.sql.extensions is set to IcebergSparkSessionExtensions [iceberg]

via GitHub Thu, 21 Mar 2024 05:12:47 -0700


greg-roberts-bbc commented on issue #7226:
URL: https://github.com/apache/iceberg/issues/7226#issuecomment-2012098543


   We've found a workaround in our use case. (Iceberg 1.4.3, Spark 3.3.0 on 
Glue 4.0).
   
   Our previous flow was:
   
   ```
   # set up readStream
   read_stream = spark.readStream.format(
   <setup read stream>
   .load()
   
   # dataframe operations
   df = read_stream.select(
   <various dataframe operations>
   )
   
   # setup write stream
   write_stream = df.writeStream.format("iceberg").outputMode("append").trigger(
       processingTime=job_args["TRIGGER_PROCESSING_TIME"]
   ).options(**{
       "fanout-enabled": job_args["FANOUT_ENABLED"],
       "checkpointLocation": job_args["CHECKPOINT_LOCATION"],
   }).toTable(TABLE)
   ```
   
   which always failed on the insert with the above described error.
   
   Our new flow is to use processBatch:
   
   ```
   def process_batch(df, batch_id):
       df = df.select(
       <various dataframe operations>
       )
   
       df.writeTo(TABLE).append()
   
   
   read_stream.writeStream.forEachBatch(process_batch).start()
   ```
   
   The above is for completeness, as we're actually using Glue's inbuilt 
[`GlueContext.forEachBatch`](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-glue-context.html#aws-glue-api-crawler-pyspark-extensions-glue-context-forEachBatch)
 but it [does exactly the same 
thing](https://github.com/awslabs/aws-glue-libs/blob/master/awsglue/context.py#L602).
   
   and this is no longer failing. We're able to write to the table with 
partition transforms (we're using `hour()` to partition our data).
   
   Interestingly, the data is now being written to S3 as you'd expect for the 
S3FileIO implementation (i.e. writes are prefixed with a [random 
string](https://iceberg.apache.org/docs/latest/aws/#object-store-file-layout).
   
   It would be nice to use the inbuilt write triggers as described [in the 
docs](https://iceberg.apache.org/docs/latest/spark-structured-streaming/#streaming-writes)
 but we are happy with a working solution. and this allows us to add MERGE 
behaviour in with SQL. 
   
   Hope someone else finds this useful!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Structured streaming writes to partitioned table fails when spark.sql.extensions is set to IcebergSparkSessionExtensions [iceberg]

Reply via email to