Hi, I am writing to ask for advice regarding the cleanSource option of the DataStreamReader. I am using pyspark with Spark 3.1. via Azure Synapse. To my knowledge, cleanSource option was introduced in Spark version 3. I'd spent a significant amount of time trying to configure this option with both "archive" and "delete" options, but the streaming seems to only process data in the source data lake storage account container, and store them in the sink storage account data lake container. The archive folder is never created nor any of the source processed files are removed. None of the forums or stackoverflow have been of any help so far, so I am reaching out to you if you perhaps have any tips on how to get it running? Here is my code:
Reading: df = (spark .readStream .option("sourceArchiveDir", f'abfss://{TRANSIENT_DATA_LAKE_CONTAINER_NAME}@ {DATA_LAKE_ACCOUNT_NAME}. dfs.core.windows.net/budget-app/budgetOutput/archived-v5') .option("cleanSource", "archive") .format('json') .schema(schema) .load(TRANSIENT_DATA_LAKE_PATH)) -- ...Processing... Writing: ( df.writeStream .format("delta") .outputMode('append') .option("checkpointLocation", RAW_DATA_LAKE_CHECKPOINT_PATH) .trigger(once=True) .partitionBy("Year", "Month", "clientId") .start(RAW_DATA_LAKE_PATH) .awaitTermination() ) Thank you very much for help, Gabriela _____________________________________ Med venlig hilsen / Best regards Gabriela Dvořáková Developer | monthio M: +421902480757 E: gabri...@monthio.com W: www.monthio.com Monthio Aps, Ragnagade 7, 2100 Copenhagen Create personal wealth and healthy economy for people by changing the ways of banking"