Re: Apache Spark 4.0.1 ?

2025-08-25 Thread Yang Jie
+1, thank you Dongjoon Thanks Jie Yang On 2025/08/26 02:52:35 Kent Yao wrote: > +1, thank you Dongjoon > > Cheng Pan 于2025年8月26日周二 10:15写道: > > > +1, thank you for driving this. > > > > Thanks, > > Cheng Pan > > > > > > > > On Aug 26, 2025, at 00:31, Dongjoon Hyun wrote: > > > > Hi, All. > >

Re: Apache Spark 4.0.1 ?

2025-08-25 Thread Kent Yao
+1, thank you Dongjoon Cheng Pan 于2025年8月26日周二 10:15写道: > +1, thank you for driving this. > > Thanks, > Cheng Pan > > > > On Aug 26, 2025, at 00:31, Dongjoon Hyun wrote: > > Hi, All. > > Since the Apache Spark 4.0.0 tag was created in May, more than three > months have passed. > > https://g

Re: Apache Spark 4.0.1 ?

2025-08-25 Thread Cheng Pan
+1, thank you for driving this. Thanks, Cheng Pan > On Aug 26, 2025, at 00:31, Dongjoon Hyun wrote: > > Hi, All. > > Since the Apache Spark 4.0.0 tag was created in May, more than three months > have passed. > > https://github.com/apache/spark/releases/tag/v4.0.0 (2025-05-19) > > So f

Re: [Structured Streaming] SST file does not exist. Race condition corrupting state store

2025-08-25 Thread Mich Talebzadeh
Hi Pedro, Hi Pedro, Glad it helped A couple of quick hints while you implement: 1) Configurable padding + N manifests - Add two knobs (defaults shown): - stateStore.rocksdb.gc.paddingMs = 12 (HDFS: 60–120s; S3/GCS: 120–300s) - stateStore.rocksdb.gc.protectedVersions = 3 (union o

Re: Apache Spark 4.0.1 ?

2025-08-25 Thread Bjørn Jørgensen
+1 Thank you, @Dongjoon Hyun man. 25. aug. 2025 kl. 18:32 skrev Dongjoon Hyun : > Hi, All. > > Since the Apache Spark 4.0.0 tag was created in May, more than three > months have passed. > > https://github.com/apache/spark/releases/tag/v4.0.0 (2025-05-19) > > So far, 124 commits (mostly bug f

Re: [Structured Streaming] SST file does not exist. Race condition corrupting state store

2025-08-25 Thread Siying Dong
Thanks. I think a relatively simple fix can be to include the zip file's modification time in the filtering condition too. If the SST's modification timestamp is earlier than any version x's zip file modification time, it is kept. Thanks, Siying On Mon, Aug 25, 2025 at 11:29 AM Pedro Miguel Duar

Re: [Structured Streaming] SST file does not exist. Race condition corrupting state store

2025-08-25 Thread Pedro Miguel Duarte
Hi Siying, thanks for your reply. We currently run with "spark.speculation: false" so it is not speculative execution. This is because the partition gets assigned to two different executors on subsequent stages. In StateStore.scala in the doMaintenance() function provide.doMaintenance() is called

Re: [Structured Streaming] SST file does not exist. Race condition corrupting state store

2025-08-25 Thread Pedro Miguel Duarte
Thanks for your reply! Yes this helps. I think adding a time padding will help prevent deleting files that are incorrectly labeled as orphaned in the current implementation. This only happens if two executors run maintenance at nearly the exact same time. I'll look into implementing a fix. On Mon

Re: [Structured Streaming] SST file does not exist. Race condition corrupting state store

2025-08-25 Thread Siying Dong
I suspect that this problem will be mitigated with checkpoint structure V2 ( https://issues.apache.org/jira/browse/SPARK-49374 https://github.com/apache/spark/blob/bc36a7db43f287af536bb2767d7d9f1d70bc799f/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2656 ). The motivatio

Re: [Structured Streaming] SST file does not exist. Race condition corrupting state store

2025-08-25 Thread Mich Talebzadeh
In your statement "*Instead of simply older, should there be some padding to allow for maintenance being executed simultaneously on two executors? Something like at least 60s older than the oldest tracked file."* *What you need to do is to add a time padding before deleting orphans which is a goo

Apache Spark 4.0.1 ?

2025-08-25 Thread Dongjoon Hyun
Hi, All. Since the Apache Spark 4.0.0 tag was created in May, more than three months have passed. https://github.com/apache/spark/releases/tag/v4.0.0 (2025-05-19) So far, 124 commits (mostly bug fixes) have been merged into the branch-4.0 branch. $ git log --oneline v4.0.0...HEAD | wc -

Re: [Spark SQL][Parquet]: Question about support for Parquet TIME data

2025-08-25 Thread Sarah Gilmore
Hi all, I opened a sub-task (SPARK-53368) of SPARK-51162 to track future discussions. Here's a link[1] to the new JIRA issue. I created a subtask of SPARK-51162 instead of SPARK-51342 since the latter is already a subtask. Thanks for taking the time to consider this enhancement! Best Regards,