_spark_metadata path issue with S3 lifecycle policy

Yuval Itzchakov Thu, 13 Apr 2023 01:50:55 -0700

Hi everyone,

I am using Sparks FileStreamSink in order to write files to S3. On the S3
bucket, I have a lifecycle policy that deletes data older than X days back
from the bucket in order for it to not infinitely grow. My problem starts
with Spark jobs that don't have frequent data. What will happen in this
case is that new batches will not be created, which in turn means no new
checkpoints will be written to the output path and no overwrites to the
_spark_metadata file will be performed, thus eventually causing the
lifecycle policy to delete the file which causes the job to fail.


As far as I can tell from reading the code and looking at StackOverflow
answers, _spark_metadata file path is hardcoded to the base path of the
output directory created by the DataStreamWriter, which means I cannot
store this file in a separate prefix which is not under the lifecycle
policy rule.

Has anyone run into a similar problem?



-- 
Best Regards,
Yuval Itzchakov.

_spark_metadata path issue with S3 lifecycle policy

Reply via email to