[ANNOUNCE] Apache Spark 3.2.4 released

2023-04-13 Thread Dongjoon Hyun
We are happy to announce the availability of Apache Spark 3.2.4! Spark 3.2.4 is a maintenance release containing stability fixes. This release is based on the branch-3.2 maintenance branch of Spark. We strongly recommend all 3.2 users to upgrade to this stable release. To download Spark 3.2.4,

Re: How to determine the function of tasks on each stage in an Apache Spark application?

2023-04-13 Thread Trường Trần Phan An
Hi, Can you give me more details or give me a tutorial on "You'd have to intercept execution events and correlate them. Not an easy task yet doable" Thank Vào Th 4, 12 thg 4, 2023 vào lúc 21:04 Jacek Laskowski đã viết: > Hi, > > tl;dr it's not possible to "reverse-engineer" tasks to

Re: _spark_metadata path issue with S3 lifecycle policy

2023-04-13 Thread Yuval Itzchakov
Not sure I follow. If my output is my/path/output then the spark metadata will be written to my/path/output/_spark_metadata. All my data will also be stored under my/path/output so there's no way to split it? ‪On Thu, Apr 13, 2023 at 1:14 PM ‫"Yuri Oleynikov (‫יורי אולייניקוב‬‎)"‬‎ <

Re: _spark_metadata path issue with S3 lifecycle policy

2023-04-13 Thread Yuri Oleynikov (‫יורי אולייניקוב‬‎)
Yeah but can’t you use following?1 . For data files: My/path/part-2. For partitioned data: my/path/partition=Best regardsOn 13 Apr 2023, at 12:58, Yuval Itzchakov wrote:The problem is that specifying two lifecycle policies for the same path, the one with the shorter retention wins

Re: _spark_metadata path issue with S3 lifecycle policy

2023-04-13 Thread Yuval Itzchakov
The problem is that specifying two lifecycle policies for the same path, the one with the shorter retention wins :( https://docs.aws.amazon.com/AmazonS3/latest/userguide/lifecycle-configuration-examples.html#lifecycle-config-conceptual-ex4 "You might specify an S3 Lifecycle configuration in

Re: _spark_metadata path issue with S3 lifecycle policy

2023-04-13 Thread Yuri Oleynikov (‫יורי אולייניקוב‬‎)
My naïve assumption that specifying lifecycle policy for _spark_metadata with longer retention will solve the issue Best regards > On 13 Apr 2023, at 11:52, Yuval Itzchakov wrote: > >  > Hi everyone, > > I am using Sparks FileStreamSink in order to write files to S3. On the S3 > bucket, I

_spark_metadata path issue with S3 lifecycle policy

2023-04-13 Thread Yuval Itzchakov
Hi everyone, I am using Sparks FileStreamSink in order to write files to S3. On the S3 bucket, I have a lifecycle policy that deletes data older than X days back from the bucket in order for it to not infinitely grow. My problem starts with Spark jobs that don't have frequent data. What will