Re: _spark_metadata path issue with S3 lifecycle policy

2023-04-13 Thread Yuval Itzchakov
Not sure I follow. If my output is my/path/output then the spark metadata will be written to my/path/output/_spark_metadata. All my data will also be stored under my/path/output so there's no way to split it? ‪On Thu, Apr 13, 2023 at 1:14 PM ‫"Yuri Oleynikov (‫יורי אולייניקוב‬‎)"‬‎ < yur...@gmail.

Re: _spark_metadata path issue with S3 lifecycle policy

2023-04-13 Thread Yuri Oleynikov (‫יורי אולייניקוב‬‎)
Yeah but can’t you use following?1 . For data files: My/path/part-2. For partitioned data: my/path/partition=Best regardsOn 13 Apr 2023, at 12:58, Yuval Itzchakov wrote:The problem is that specifying two lifecycle policies for the same path, the one with the shorter retention wins :(https://docs.

Re: _spark_metadata path issue with S3 lifecycle policy

2023-04-13 Thread Yuval Itzchakov
The problem is that specifying two lifecycle policies for the same path, the one with the shorter retention wins :( https://docs.aws.amazon.com/AmazonS3/latest/userguide/lifecycle-configuration-examples.html#lifecycle-config-conceptual-ex4 "You might specify an S3 Lifecycle configuration in which

Re: _spark_metadata path issue with S3 lifecycle policy

2023-04-13 Thread Yuri Oleynikov (‫יורי אולייניקוב‬‎)
My naïve assumption that specifying lifecycle policy for _spark_metadata with longer retention will solve the issue Best regards > On 13 Apr 2023, at 11:52, Yuval Itzchakov wrote: > >  > Hi everyone, > > I am using Sparks FileStreamSink in order to write files to S3. On the S3 > bucket, I

_spark_metadata path issue with S3 lifecycle policy

2023-04-13 Thread Yuval Itzchakov
Hi everyone, I am using Sparks FileStreamSink in order to write files to S3. On the S3 bucket, I have a lifecycle policy that deletes data older than X days back from the bucket in order for it to not infinitely grow. My problem starts with Spark jobs that don't have frequent data. What will happe