Re: _spark_metadata path issue with S3 lifecycle policy

Yuval Itzchakov Thu, 13 Apr 2023 04:07:59 -0700

Not sure I follow. If my output is my/path/output then the spark metadata
will be written to my/path/output/_spark_metadata. All my data will also be
stored under my/path/output so there's no way to split it?


‪On Thu, Apr 13, 2023 at 1:14 PM ‫"Yuri Oleynikov (‫יורי אולייניקוב‬‎)"‬‎ <
yur...@gmail.com> wrote:‬

> Yeah but can’t you use following?
> 1 . For data files: My/path/part-
> 2. For partitioned data: my/path/partition=
>
>
> Best regards
>
> On 13 Apr 2023, at 12:58, Yuval Itzchakov <yuva...@gmail.com> wrote:
>
> 
> The problem is that specifying two lifecycle policies for the same path,
> the one with the shorter retention wins :(
>
>
> https://docs.aws.amazon.com/AmazonS3/latest/userguide/lifecycle-configuration-examples.html#lifecycle-config-conceptual-ex4
>
> "You might specify an S3 Lifecycle configuration in which you specify
> overlapping prefixes, or actions.
>
> Generally, S3 Lifecycle optimizes for cost. For example, if two expiration
> policies overlap, the shorter expiration policy is honored so that data is
> not stored for longer than expected. Likewise, if two transition policies
> overlap, S3 Lifecycle transitions your objects to the lower-cost storage
> class."
>
>
>
> On Thu, Apr 13, 2023, 12:29 "Yuri Oleynikov (‫יורי אולייניקוב‬‎)" <
> yur...@gmail.com> wrote:
>
>> My naïve  assumption that specifying lifecycle policy for _spark_metadata
>> with longer retention will solve the issue
>>
>> Best regards
>>
>> > On 13 Apr 2023, at 11:52, Yuval Itzchakov <yuva...@gmail.com> wrote:
>> >
>> > 
>> > Hi everyone,
>> >
>> > I am using Sparks FileStreamSink in order to write files to S3. On the
>> S3 bucket, I have a lifecycle policy that deletes data older than X days
>> back from the bucket in order for it to not infinitely grow. My problem
>> starts with Spark jobs that don't have frequent data. What will happen in
>> this case is that new batches will not be created, which in turn means no
>> new checkpoints will be written to the output path and no overwrites to the
>> _spark_metadata file will be performed, thus eventually causing the
>> lifecycle policy to delete the file which causes the job to fail.
>> >
>> > As far as I can tell from reading the code and looking at StackOverflow
>> answers, _spark_metadata file path is hardcoded to the base path of the
>> output directory created by the DataStreamWriter, which means I cannot
>> store this file in a separate prefix which is not under the lifecycle
>> policy rule.
>> >
>> > Has anyone run into a similar problem?
>> >
>> >
>> >
>> > --
>> > Best Regards,
>> > Yuval Itzchakov.
>>
>

-- 
Best Regards,
Yuval Itzchakov.

Re: _spark_metadata path issue with S3 lifecycle policy

Reply via email to