Hi, Could you please set the config "spark.sql.streaming.fileSource.cleaner.numThreads" to 0 and see whether it works? (NOTE: will slow down your process since the cleaning phase will happen in the foreground. The default is background with 1 thread. You can try out more threads than 1.) If it doesn't help, please turn on the DEBUG log level for the package "org.apache.spark.sql.execution.streaming" and grep the log messages from SourceFileArchiver & SourceFileRemover.
Thanks, Jungtaek Lim (HeartSaVioR) On Thu, Jan 27, 2022 at 9:56 PM Gabriela Dvořáková <gabri...@monthio.com.invalid> wrote: > Hi, > > I am writing to ask for advice regarding the cleanSource option of the > DataStreamReader. I am using pyspark with Spark 3.1. via Azure Synapse. To > my knowledge, cleanSource option was introduced in Spark version 3. I'd > spent a significant amount of time trying to configure this option with > both "archive" and "delete" options, but the streaming seems to only > process data in the source data lake storage account container, and store > them in the sink storage account data lake container. The archive folder is > never created nor any of the source processed files are removed. None of > the forums or stackoverflow have been of any help so far, so I am reaching > out to you if you perhaps have any tips on how to get it running? Here is > my code: > > Reading: > df = (spark > .readStream > .option("sourceArchiveDir", f > 'abfss://{TRANSIENT_DATA_LAKE_CONTAINER_NAME}@{DATA_LAKE_ACCOUNT_NAME}. > dfs.core.windows.net/budget-app/budgetOutput/archived-v5') > .option("cleanSource", "archive") > .format('json') > .schema(schema) > .load(TRANSIENT_DATA_LAKE_PATH)) > -- > > ...Processing... > > Writing: > ( > df.writeStream > .format("delta") > .outputMode('append') > .option("checkpointLocation", RAW_DATA_LAKE_CHECKPOINT_PATH) > .trigger(once=True) > .partitionBy("Year", "Month", "clientId") > .start(RAW_DATA_LAKE_PATH) > .awaitTermination() > ) > > Thank you very much for help, > Gabriela > > _____________________________________ > > Med venlig hilsen / Best regards > > Gabriela Dvořáková > > Developer | monthio > > M: +421902480757 > > E: gabri...@monthio.com > > W: www.monthio.com > > Monthio Aps, Ragnagade 7, 2100 Copenhagen > > > Create personal wealth and healthy economy > > for people by changing the ways of banking" > >