Hi,

Could you please set the config
"spark.sql.streaming.fileSource.cleaner.numThreads"
to 0 and see whether it works? (NOTE: will slow down your process since the
cleaning phase will happen in the foreground. The default is background
with 1 thread. You can try out more threads than 1.)
If it doesn't help, please turn on the DEBUG log level for the package
"org.apache.spark.sql.execution.streaming"
and grep the log messages from SourceFileArchiver & SourceFileRemover.

Thanks,
Jungtaek Lim (HeartSaVioR)

On Thu, Jan 27, 2022 at 9:56 PM Gabriela Dvořáková
<gabri...@monthio.com.invalid> wrote:

> Hi,
>
> I am writing to ask for advice regarding the cleanSource option of the
> DataStreamReader. I am using pyspark with Spark 3.1. via Azure Synapse. To
> my knowledge, cleanSource option was introduced in Spark version 3. I'd
> spent a significant amount of time trying to configure this option with
> both "archive" and "delete" options, but the streaming seems to only
> process data in the source data lake storage account container, and store
> them in the sink storage account data lake container. The archive folder is
> never created nor any of the source processed files are removed. None of
> the forums or stackoverflow have been of any help so far, so I am reaching
> out to you if you perhaps have any tips on how to get it running? Here is
> my code:
>
> Reading:
> df = (spark
> .readStream
> .option("sourceArchiveDir", f
> 'abfss://{TRANSIENT_DATA_LAKE_CONTAINER_NAME}@{DATA_LAKE_ACCOUNT_NAME}.
> dfs.core.windows.net/budget-app/budgetOutput/archived-v5')
> .option("cleanSource", "archive")
> .format('json')
> .schema(schema)
> .load(TRANSIENT_DATA_LAKE_PATH))
> --
>
> ...Processing...
>
> Writing:
> (
> df.writeStream
> .format("delta")
> .outputMode('append')
> .option("checkpointLocation", RAW_DATA_LAKE_CHECKPOINT_PATH)
> .trigger(once=True)
> .partitionBy("Year", "Month", "clientId")
> .start(RAW_DATA_LAKE_PATH)
> .awaitTermination()
> )
>
> Thank you very much for help,
> Gabriela
>
> _____________________________________
>
> Med venlig hilsen / Best regards
>
> Gabriela Dvořáková
>
> Developer | monthio
>
> M: +421902480757
>
> E: gabri...@monthio.com
>
> W: www.monthio.com
>
> Monthio Aps, Ragnagade 7, 2100 Copenhagen
>
>
> Create personal wealth and healthy economy
>
> for people by changing the ways of banking"
>
>

Reply via email to