Not sure I follow. If my output is my/path/output then the spark metadata will be written to my/path/output/_spark_metadata. All my data will also be stored under my/path/output so there's no way to split it?
On Thu, Apr 13, 2023 at 1:14 PM "Yuri Oleynikov (יורי אולייניקוב)" < yur...@gmail.com> wrote: > Yeah but can’t you use following? > 1 . For data files: My/path/part- > 2. For partitioned data: my/path/partition= > > > Best regards > > On 13 Apr 2023, at 12:58, Yuval Itzchakov <yuva...@gmail.com> wrote: > > > The problem is that specifying two lifecycle policies for the same path, > the one with the shorter retention wins :( > > > https://docs.aws.amazon.com/AmazonS3/latest/userguide/lifecycle-configuration-examples.html#lifecycle-config-conceptual-ex4 > > "You might specify an S3 Lifecycle configuration in which you specify > overlapping prefixes, or actions. > > Generally, S3 Lifecycle optimizes for cost. For example, if two expiration > policies overlap, the shorter expiration policy is honored so that data is > not stored for longer than expected. Likewise, if two transition policies > overlap, S3 Lifecycle transitions your objects to the lower-cost storage > class." > > > > On Thu, Apr 13, 2023, 12:29 "Yuri Oleynikov (יורי אולייניקוב)" < > yur...@gmail.com> wrote: > >> My naïve assumption that specifying lifecycle policy for _spark_metadata >> with longer retention will solve the issue >> >> Best regards >> >> > On 13 Apr 2023, at 11:52, Yuval Itzchakov <yuva...@gmail.com> wrote: >> > >> > >> > Hi everyone, >> > >> > I am using Sparks FileStreamSink in order to write files to S3. On the >> S3 bucket, I have a lifecycle policy that deletes data older than X days >> back from the bucket in order for it to not infinitely grow. My problem >> starts with Spark jobs that don't have frequent data. What will happen in >> this case is that new batches will not be created, which in turn means no >> new checkpoints will be written to the output path and no overwrites to the >> _spark_metadata file will be performed, thus eventually causing the >> lifecycle policy to delete the file which causes the job to fail. >> > >> > As far as I can tell from reading the code and looking at StackOverflow >> answers, _spark_metadata file path is hardcoded to the base path of the >> output directory created by the DataStreamWriter, which means I cannot >> store this file in a separate prefix which is not under the lifecycle >> policy rule. >> > >> > Has anyone run into a similar problem? >> > >> > >> > >> > -- >> > Best Regards, >> > Yuval Itzchakov. >> > -- Best Regards, Yuval Itzchakov.