Hi,

I am writing to ask for advice regarding the cleanSource option of the
DataStreamReader. I am using pyspark with Spark 3.1. via Azure Synapse. To
my knowledge, cleanSource option was introduced in Spark version 3. I'd
spent a significant amount of time trying to configure this option with
both "archive" and "delete" options, but the streaming seems to only
process data in the source data lake storage account container, and store
them in the sink storage account data lake container. The archive folder is
never created nor any of the source processed files are removed. None of
the forums or stackoverflow have been of any help so far, so I am reaching
out to you if you perhaps have any tips on how to get it running? Here is
my code:

Reading:
df = (spark
.readStream
.option("sourceArchiveDir", f'abfss://{TRANSIENT_DATA_LAKE_CONTAINER_NAME}@
{DATA_LAKE_ACCOUNT_NAME}.
dfs.core.windows.net/budget-app/budgetOutput/archived-v5')
.option("cleanSource", "archive")
.format('json')
.schema(schema)
.load(TRANSIENT_DATA_LAKE_PATH))
-- 

...Processing...

Writing:
(
df.writeStream
.format("delta")
.outputMode('append')
.option("checkpointLocation", RAW_DATA_LAKE_CHECKPOINT_PATH)
.trigger(once=True)
.partitionBy("Year", "Month", "clientId")
.start(RAW_DATA_LAKE_PATH)
.awaitTermination()
)

Thank you very much for help,
Gabriela

_____________________________________

Med venlig hilsen / Best regards

Gabriela Dvořáková

Developer | monthio

M: +421902480757

E: gabri...@monthio.com

W: www.monthio.com

Monthio Aps, Ragnagade 7, 2100 Copenhagen


Create personal wealth and healthy economy

for people by changing the ways of banking"

Reply via email to