from:"Zakelly Lan \(Jira\)"

[jira] [Assigned] (FLINK-36664) [Window]The window with window_offset will lose data.

2024-11-08 Thread Zakelly Lan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-36664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zakelly Lan reassigned FLINK-36664:
---

Assignee: kaitian

> [Window]The window with window_offset will lose data.
> -
>
> Key: FLINK-36664
> URL: https://issues.apache.org/jira/browse/FLINK-36664
> Project: Flink
>  Issue Type: Bug
>Reporter: kaitian
>Assignee: kaitian
>Priority: Major
>  Labels: window
> Attachments: image-2024-11-06-20-20-34-562.png, 
> image-2024-11-06-20-23-53-184.png
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> When setting the offset for the window, data is lost because the triggering 
> window time is not aligned.
>  
> When SlicingWindowOperator processesWatermarker, it records the time when the 
> next window is triggered (nextTriggerWatermark):
> !image-2024-11-06-20-20-34-562.png!
> The calculation method is to first calculate the begin time of the window 
> where the watermark is located, but the offset passed in during the 
> calculation is 0:
> !image-2024-11-06-20-23-53-184.png!
> That is to say, the window triggering time does not take into account the 
> window offset.
> It is OK if the nextTriggerWatermark is too small. When watermark-offset>0, 
> if (watermark-offset)%window_size > watermark%window_size is satisfied, the 
> nextTriggerWatermark will be too large, where offset and window_size are 
> constants. If the watermark is completely random, it is easy to prove that 
> there is a 50% probability that the nextTriggerWatermark will be too large.
> When the nextTriggerWatermark is too large, a processWatermark should have 
> flushWindowBuffer but was not triggered, resulting in less data in the 
> currently triggered window (assuming it is key1). When the next 
> processWatermark triggers flushWindowBuffer, since the Watermark has moved 
> forward, key1 will be regarded as expired data and the timer will not be 
> registered. That is to say, the subsequent processWatermark will no longer 
> calculate key1, and data will be lost.
>  
> I have writen a UT to prove this bug:
> window_size 3000, offset 1000, tumping window
> window_size 3000, offset 1000. When processWatermark(3000), the normally 
> calculated nextTriggerProgress = 3000 - (3000-1000)%3000 + 3000-1 = 3999, but 
> because the code does not consider the offset, nextTriggerProgress = 3000 - 
> (3000-0)%3000 + 3000-1 = 5999, which is too large.We will lose data.
> !https://intranetproxy.alipay.com/skylark/lark/0/2024/png/101856358/1730893082766-4ea5d356-2b19-436d-be46-093f70e445cd.png!
> This wrong UT will pass. You can see that the data in the UT is lost and will 
> not be calculated in any subsequent processWatermark.
>  
> Repair suggestion:
> pass window_offset when calculating nextTriggerWatermark
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (FLINK-36663) [Window]The first processWatermark() after Sliding Window restore may have extra expired data.

2024-11-08 Thread Zakelly Lan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-36663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zakelly Lan reassigned FLINK-36663:
---

Assignee: kaitian

> [Window]The first processWatermark() after Sliding Window restore may have 
> extra expired data.
> --
>
> Key: FLINK-36663
> URL: https://issues.apache.org/jira/browse/FLINK-36663
> Project: Flink
>  Issue Type: Bug
>Reporter: kaitian
>Assignee: kaitian
>Priority: Major
>  Labels: window
> Attachments: image-2024-11-06-19-43-58-569.png, 
> image-2024-11-06-19-47-37-225.png, image-2024-11-06-19-50-35-362.png
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The root cause is that the currentWatermark of TimerService is not restored 
> after restore.
> !image-2024-11-06-19-43-58-569.png!
> This will cause problems when using the sliding window, such as hopping 
> window. Because when flushing WindowBuffer, it is necessary to determine 
> whether the current  is expired. Here, the currentWatermark 
> in the TimerService is used, and the currentWatermark in the TimerService 
> will only be updated to the  when 
> advance.
> !image-2024-11-06-19-47-37-225.png!
> This will cause that after restore and before the first advance, if 
> processElement( ), the slice_end has been triggered, but 
> because this slice is included in other untriggered windows, the data  slice_end> will be written to the windowBuffer. (At this time,  currentWatermark in the window> is used to determine whether the data is 
> expired, and  is restored):
> !image-2024-11-06-19-50-35-362.png!
> However, because  is used when 
> flushWindowBuffer, it is determined that the slice has not expired and the 
> timer is registered.
> This will cause an expired slice_end to be registered, resulting in an 
> expired window being triggered and output extra data.
>  
>  
> I have writen a UT to prove this problem:
> !https://aone.alibaba-inc.com/v2/api/workitem/adapter/file/url?fileIdentifier=workitem%2Falibaba%2F875140%2F1730810319419%E6%88%AA%E5%B1%8F2024-11-05%2020.38.19.png!
> Triggering restore will have different results. You can remove the restore 
> part in the red box and the test will pass.
> use hopping window, window size = 3000, we have window range:[-1000,2000), 
> [0, 3000]
> processWatermark (currentWatermark = 2001)
> processElement:     the expired record <"key2", 1, fromEpochMillis(1500L)>
> if no restore after processWatermark (currentWatermark = 2001):
> the expired record will not output in window range [-1000,2000).
>  
> if restore after processWatermark (currentWatermark = 2001):
> the expired record will not output in window range [-1000,2000).
>  
>  
> Repair suggestions：
> While restoring the window currentWatermark, restore the currentWatermark in 
> the timeService
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (FLINK-36622) Remove the dependency of StateBenchmark on RocksDBKeyedStateBackend APIs.

2024-11-03 Thread Zakelly Lan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-36622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zakelly Lan resolved FLINK-36622.
-
Resolution: Fixed

> Remove the dependency of StateBenchmark on RocksDBKeyedStateBackend APIs.
> -
>
> Key: FLINK-36622
> URL: https://issues.apache.org/jira/browse/FLINK-36622
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / State Backends
>Reporter: Han Yin
>Assignee: Han Yin
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>
> Currently, flink-benchmarks relies on non-public APIs in Flink. For example, 
> in {_}+StateBackendBenchmarkUtils.java+{_}, the function _+compactState+_ 
> takes RocksDBKeyedStateBackend as its first argument.
> This requires explicit type conversion in flink-benchmark(from 
> +_KeyedStateBackend_+ to {+}_RocksDBKeyedStateBackend_{+}). Moreover, this 
> means that once the signature of +_RocksDBKeyedStateBackend_+ changes, we 
> need to modify flink-benchmark correspondingly.
> Therefore, we should avoid exposing non-public APIs in 
> {_}+StateBackendBenchmarkUtils+{_}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-36622) Remove the dependency of StateBenchmark on RocksDBKeyedStateBackend APIs.

2024-11-03 Thread Zakelly Lan (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-36622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17895168#comment-17895168
 ] 

Zakelly Lan commented on FLINK-36622:
-

Merge fde0c29 into flink-benchmarks

> Remove the dependency of StateBenchmark on RocksDBKeyedStateBackend APIs.
> -
>
> Key: FLINK-36622
> URL: https://issues.apache.org/jira/browse/FLINK-36622
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / State Backends
>Reporter: Han Yin
>Assignee: Han Yin
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>
> Currently, flink-benchmarks relies on non-public APIs in Flink. For example, 
> in {_}+StateBackendBenchmarkUtils.java+{_}, the function _+compactState+_ 
> takes RocksDBKeyedStateBackend as its first argument.
> This requires explicit type conversion in flink-benchmark(from 
> +_KeyedStateBackend_+ to {+}_RocksDBKeyedStateBackend_{+}). Moreover, this 
> means that once the signature of +_RocksDBKeyedStateBackend_+ changes, we 
> need to modify flink-benchmark correspondingly.
> Therefore, we should avoid exposing non-public APIs in 
> {_}+StateBackendBenchmarkUtils+{_}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-36622) Remove the dependency of StateBenchmark on RocksDBKeyedStateBackend APIs.

2024-10-30 Thread Zakelly Lan (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-36622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17894403#comment-17894403
 ] 

Zakelly Lan commented on FLINK-36622:
-

Merge a6c4dc5 into main repo

> Remove the dependency of StateBenchmark on RocksDBKeyedStateBackend APIs.
> -
>
> Key: FLINK-36622
> URL: https://issues.apache.org/jira/browse/FLINK-36622
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / State Backends
>Affects Versions: 2.0-preview
>Reporter: Han Yin
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>
> Currently, flink-benchmarks relies on non-public APIs in Flink. For example, 
> in {_}+StateBackendBenchmarkUtils.java+{_}, the function _+compactState+_ 
> takes RocksDBKeyedStateBackend as its first argument.
> This requires explicit type conversion in flink-benchmark(from 
> +_KeyedStateBackend_+ to {+}_RocksDBKeyedStateBackend_{+}). Moreover, this 
> means that once the signature of +_RocksDBKeyedStateBackend_+ changes, we 
> need to modify flink-benchmark correspondingly.
> Therefore, we should avoid exposing non-public APIs in 
> {_}+StateBackendBenchmarkUtils+{_}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (FLINK-36622) Remove the dependency of StateBenchmark on RocksDBKeyedStateBackend APIs.

2024-10-30 Thread Zakelly Lan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-36622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zakelly Lan reassigned FLINK-36622:
---

Assignee: Han Yin

> Remove the dependency of StateBenchmark on RocksDBKeyedStateBackend APIs.
> -
>
> Key: FLINK-36622
> URL: https://issues.apache.org/jira/browse/FLINK-36622
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / State Backends
>Affects Versions: 2.0-preview
>Reporter: Han Yin
>Assignee: Han Yin
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>
> Currently, flink-benchmarks relies on non-public APIs in Flink. For example, 
> in {_}+StateBackendBenchmarkUtils.java+{_}, the function _+compactState+_ 
> takes RocksDBKeyedStateBackend as its first argument.
> This requires explicit type conversion in flink-benchmark(from 
> +_KeyedStateBackend_+ to {+}_RocksDBKeyedStateBackend_{+}). Moreover, this 
> means that once the signature of +_RocksDBKeyedStateBackend_+ changes, we 
> need to modify flink-benchmark correspondingly.
> Therefore, we should avoid exposing non-public APIs in 
> {_}+StateBackendBenchmarkUtils+{_}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-36622) Remove the dependency of StateBenchmark on RocksDBKeyedStateBackend APIs.

2024-10-30 Thread Zakelly Lan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-36622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zakelly Lan updated FLINK-36622:

Affects Version/s: (was: 2.0-preview)

> Remove the dependency of StateBenchmark on RocksDBKeyedStateBackend APIs.
> -
>
> Key: FLINK-36622
> URL: https://issues.apache.org/jira/browse/FLINK-36622
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / State Backends
>Reporter: Han Yin
>Assignee: Han Yin
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>
> Currently, flink-benchmarks relies on non-public APIs in Flink. For example, 
> in {_}+StateBackendBenchmarkUtils.java+{_}, the function _+compactState+_ 
> takes RocksDBKeyedStateBackend as its first argument.
> This requires explicit type conversion in flink-benchmark(from 
> +_KeyedStateBackend_+ to {+}_RocksDBKeyedStateBackend_{+}). Moreover, this 
> means that once the signature of +_RocksDBKeyedStateBackend_+ changes, we 
> need to modify flink-benchmark correspondingly.
> Therefore, we should avoid exposing non-public APIs in 
> {_}+StateBackendBenchmarkUtils+{_}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (FLINK-4602) Move RocksDB backend to proper package

2024-10-29 Thread Zakelly Lan (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-4602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17893675#comment-17893675
 ] 

Zakelly Lan edited comment on FLINK-4602 at 10/30/24 2:32 AM:
--

Merge 316daca and 255ca52 into master


was (Author: zakelly):
Merge 316daca into master

> Move RocksDB backend to proper package
> --
>
> Key: FLINK-4602
> URL: https://issues.apache.org/jira/browse/FLINK-4602
> Project: Flink
>  Issue Type: Technical Debt
>  Components: Runtime / State Backends
>Reporter: Aljoscha Krettek
>Assignee: Han Yin
>Priority: Major
>  Labels: 2.0-related, auto-unassigned, pull-request-available
> Fix For: 2.0.0
>
>
> Right now the package is {{org.apache.flink.contrib.streaming.state}}, it 
> should probably be in {{org.apache.flink.runtime.state.rocksdb}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-4602) Move RocksDB backend to proper package

2024-10-28 Thread Zakelly Lan (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-4602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17893675#comment-17893675
 ] 

Zakelly Lan commented on FLINK-4602:


Merge 316daca into master

> Move RocksDB backend to proper package
> --
>
> Key: FLINK-4602
> URL: https://issues.apache.org/jira/browse/FLINK-4602
> Project: Flink
>  Issue Type: Technical Debt
>  Components: Runtime / State Backends
>Reporter: Aljoscha Krettek
>Assignee: Han Yin
>Priority: Major
>  Labels: 2.0-related, auto-unassigned, pull-request-available
> Fix For: 2.0.0
>
>
> Right now the package is {{org.apache.flink.contrib.streaming.state}}, it 
> should probably be in {{org.apache.flink.runtime.state.rocksdb}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (FLINK-4602) Move RocksDB backend to proper package

2024-10-28 Thread Zakelly Lan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-4602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zakelly Lan resolved FLINK-4602.

Resolution: Fixed

> Move RocksDB backend to proper package
> --
>
> Key: FLINK-4602
> URL: https://issues.apache.org/jira/browse/FLINK-4602
> Project: Flink
>  Issue Type: Technical Debt
>  Components: Runtime / State Backends
>Reporter: Aljoscha Krettek
>Assignee: Han Yin
>Priority: Major
>  Labels: 2.0-related, auto-unassigned, pull-request-available
> Fix For: 2.0.0
>
>
> Right now the package is {{org.apache.flink.contrib.streaming.state}}, it 
> should probably be in {{org.apache.flink.runtime.state.rocksdb}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (FLINK-36521) Introduce TtlAwareSerializer to resolve the compatibility check between ttlSerializer and original serializer

2024-10-27 Thread Zakelly Lan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-36521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zakelly Lan resolved FLINK-36521.
-
Fix Version/s: 2.0.0
 Assignee: xiangyu feng
   Resolution: Fixed

> Introduce TtlAwareSerializer to resolve the compatibility check between 
> ttlSerializer and original serializer 
> --
>
> Key: FLINK-36521
> URL: https://issues.apache.org/jira/browse/FLINK-36521
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / State Backends
>Reporter: xiangyu feng
>Assignee: xiangyu feng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-36521) Introduce TtlAwareSerializer to resolve the compatibility check between ttlSerializer and original serializer

2024-10-27 Thread Zakelly Lan (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-36521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17893323#comment-17893323
 ] 

Zakelly Lan commented on FLINK-36521:
-

Merge bf6cf50f...cddb14ed  into master

> Introduce TtlAwareSerializer to resolve the compatibility check between 
> ttlSerializer and original serializer 
> --
>
> Key: FLINK-36521
> URL: https://issues.apache.org/jira/browse/FLINK-36521
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / State Backends
>Reporter: xiangyu feng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-36589) Decouple the initialization of sync and async keyed statebackend

2024-10-27 Thread Zakelly Lan (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-36589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17893321#comment-17893321
 ] 

Zakelly Lan commented on FLINK-36589:
-

Merge 8161fbd2 into master

> Decouple the initialization of sync and async keyed statebackend
> 
>
> Key: FLINK-36589
> URL: https://issues.apache.org/jira/browse/FLINK-36589
> Project: Flink
>  Issue Type: Sub-task
>Reporter: Zakelly Lan
>Assignee: Zakelly Lan
>Priority: Major
>  Labels: pull-request-available
>
> Currently, the operator will create both AsyncKeyedStateBackend and 
> KeyedStateBackend and pick one for use. It's better to only create one.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (FLINK-36589) Decouple the initialization of sync and async keyed statebackend

2024-10-27 Thread Zakelly Lan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-36589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zakelly Lan resolved FLINK-36589.
-
Fix Version/s: 2.0.0
   Resolution: Fixed

> Decouple the initialization of sync and async keyed statebackend
> 
>
> Key: FLINK-36589
> URL: https://issues.apache.org/jira/browse/FLINK-36589
> Project: Flink
>  Issue Type: Sub-task
>Reporter: Zakelly Lan
>Assignee: Zakelly Lan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
>
> Currently, the operator will create both AsyncKeyedStateBackend and 
> KeyedStateBackend and pick one for use. It's better to only create one.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-36598) [state/forst] Refactor initialization of ForSt db core

2024-10-24 Thread Zakelly Lan (Jira)

Zakelly Lan created FLINK-36598:
---

 Summary: [state/forst] Refactor initialization of ForSt db core
 Key: FLINK-36598
 URL: https://issues.apache.org/jira/browse/FLINK-36598
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / State Backends
Reporter: Zakelly Lan
Assignee: Zakelly Lan


Currently, the {{ForSt}} creates {{FileSystem}} itself. It is better to create 
it from flink side add give it to {{ForSt}}. By doing so, we are able to inject 
a cache or share files with the checkpoint.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-36589) Decouple the initialization of sync and async keyed statebackend

2024-10-22 Thread Zakelly Lan (Jira)

Zakelly Lan created FLINK-36589:
---

 Summary: Decouple the initialization of sync and async keyed 
statebackend
 Key: FLINK-36589
 URL: https://issues.apache.org/jira/browse/FLINK-36589
 Project: Flink
  Issue Type: Sub-task
Reporter: Zakelly Lan
Assignee: Zakelly Lan


Currently, the operator will create both AsyncKeyedStateBackend and 
KeyedStateBackend and pick one for use. It's better to only create one.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-36421) Missing fsync in FsCheckpointStreamFactory

2024-10-18 Thread Zakelly Lan (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-36421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17890877#comment-17890877
 ] 

Zakelly Lan commented on FLINK-36421:
-

All merged.

1.18: cabc6f4

1.19: 8c13611

1.20: 6a51962

> Missing fsync in FsCheckpointStreamFactory
> --
>
> Key: FLINK-36421
> URL: https://issues.apache.org/jira/browse/FLINK-36421
> Project: Flink
>  Issue Type: Bug
>  Components: FileSystems, Runtime / Checkpointing
>Affects Versions: 1.17.0, 1.18.0, 1.19.0, 1.20.0
>Reporter: Marc Aurel Fritz
>Assignee: Marc Aurel Fritz
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 2.0-preview
>
> Attachments: fsync-fs-stream-factory.diff
>
>
> With Flink 1.20 we observed another checkpoint corruption bug. This is 
> similar to FLINK-35217, but affects only files written by the taskmanager 
> (the ones with random names as described 
> [here|https://github.com/Planet-X/flink/blob/0d6e25a9738d9d4ee94de94e1437f92611b50758/flink-runtime/src/main/java/org/apache/flink/runtime/state/filesystem/FsCheckpointStreamFactory.java#L47]).
> After system crash the files written by the taskmanager may be corrupted 
> (file size of 0 bytes) if the changes in the file-system cache haven't been 
> written to disk. The "_metadata" file written by the jobmanager is always 
> fine because it's properly fsynced.
> Investigation revealed that "fsync" is missing, this time in 
> "FsCheckpointStreamFactory". In this case the "OutputStream" is closed 
> without calling "fsync", thus the file is not durably persisted on disk 
> before the checkpoint is completed. (As previously established in 
> FLINK-35217, calling "fsync" is necessary as simply closing the stream does 
> not have any guarantees on persistence.)
> "strace" on the taskmanager's process confirms this behavior:
>  # The checkpoint chk-1217's directory is created at "mkdir"
>  # The checkpoint chk-1217's non-inline state is written by the taskmanager 
> at "openat", filename is "0507881e-8877-40b0-82d6-3d7dead64ccc". Note that 
> there's no "fsync" before "close".
>  # The checkpoint chk-1217 is finished, its "_metadata" is written and synced 
> properly
>  # The old checkpoint chk-1216 is deleted at "unlink"
> The new checkpoint chk-1217 now references a not-synced file that can get 
> corrupted on e.g. power loss. This means there is no working checkpoint left 
> as the old checkpoint was deleted.
> For durable persistence an "fsync" call is missing before "close" in step 2.
> Full "strace" log:
> {code:java}
> [pid 947250] 08:22:58 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", 
> 0x7f68414c5b50) = -1 ENOENT (No such file or directory)
> [pid 947250] 08:22:58 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", 
> 0x7f68414c5b50) = -1 ENOENT (No such file or directory)
> [pid 947250] 08:22:58 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502", 
> {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
> [pid 947250] 08:22:58 
> mkdir("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", 
> 0777) = 0
> [pid 1303248] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/0507881e-8877-40b0-82d6-3d7dead64ccc",
>  0x7f56f08d5610) = -1 ENOENT (No such file or directory)
> [pid 1303248] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", 
> {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
> [pid 1303248] 08:22:59 openat(AT_FDCWD, 
> "/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/0507881e-8877-40b0-82d6-3d7dead64ccc",
>  O_WRONLY|O_CREAT|O_TRUNC, 0666) = 199
> [pid 1303248] 08:22:59 fstat(199, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
> [pid 1303248] 08:22:59 close(199)   = 0
> [pid 947310] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/_metadata",
>  0x7f683fb378b0) = -1 ENOENT (No such file or directory)
> [pid 947310] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/_metadata",
>  0x7f683fb37730) = -1 ENOENT (No such file or directory)
> [pid 947310] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/._metadata.inprogress.e87ac62b-5f75-472b-ad21-9579a872b0d0",
>  0x7f683fb37730) = -1 ENOENT (No such file or directory)
> [pid 947310] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", 
> {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
> [pid 947310] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", 
> {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
> [pid 947310] 08:22:59 openat(AT_FDCWD, 
> "/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/._metadata.inprogress.e87ac62b-5f

[jira] [Updated] (FLINK-36421) Missing fsync in FsCheckpointStreamFactory

2024-10-18 Thread Zakelly Lan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-36421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zakelly Lan updated FLINK-36421:

Fix Version/s: 1.18.2
   1.19.2
   1.20.1

> Missing fsync in FsCheckpointStreamFactory
> --
>
> Key: FLINK-36421
> URL: https://issues.apache.org/jira/browse/FLINK-36421
> Project: Flink
>  Issue Type: Bug
>  Components: FileSystems, Runtime / Checkpointing
>Affects Versions: 1.17.0, 1.18.0, 1.19.0, 1.20.0
>Reporter: Marc Aurel Fritz
>Assignee: Marc Aurel Fritz
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.18.2, 1.19.2, 1.20.1, 2.0-preview
>
> Attachments: fsync-fs-stream-factory.diff
>
>
> With Flink 1.20 we observed another checkpoint corruption bug. This is 
> similar to FLINK-35217, but affects only files written by the taskmanager 
> (the ones with random names as described 
> [here|https://github.com/Planet-X/flink/blob/0d6e25a9738d9d4ee94de94e1437f92611b50758/flink-runtime/src/main/java/org/apache/flink/runtime/state/filesystem/FsCheckpointStreamFactory.java#L47]).
> After system crash the files written by the taskmanager may be corrupted 
> (file size of 0 bytes) if the changes in the file-system cache haven't been 
> written to disk. The "_metadata" file written by the jobmanager is always 
> fine because it's properly fsynced.
> Investigation revealed that "fsync" is missing, this time in 
> "FsCheckpointStreamFactory". In this case the "OutputStream" is closed 
> without calling "fsync", thus the file is not durably persisted on disk 
> before the checkpoint is completed. (As previously established in 
> FLINK-35217, calling "fsync" is necessary as simply closing the stream does 
> not have any guarantees on persistence.)
> "strace" on the taskmanager's process confirms this behavior:
>  # The checkpoint chk-1217's directory is created at "mkdir"
>  # The checkpoint chk-1217's non-inline state is written by the taskmanager 
> at "openat", filename is "0507881e-8877-40b0-82d6-3d7dead64ccc". Note that 
> there's no "fsync" before "close".
>  # The checkpoint chk-1217 is finished, its "_metadata" is written and synced 
> properly
>  # The old checkpoint chk-1216 is deleted at "unlink"
> The new checkpoint chk-1217 now references a not-synced file that can get 
> corrupted on e.g. power loss. This means there is no working checkpoint left 
> as the old checkpoint was deleted.
> For durable persistence an "fsync" call is missing before "close" in step 2.
> Full "strace" log:
> {code:java}
> [pid 947250] 08:22:58 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", 
> 0x7f68414c5b50) = -1 ENOENT (No such file or directory)
> [pid 947250] 08:22:58 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", 
> 0x7f68414c5b50) = -1 ENOENT (No such file or directory)
> [pid 947250] 08:22:58 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502", 
> {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
> [pid 947250] 08:22:58 
> mkdir("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", 
> 0777) = 0
> [pid 1303248] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/0507881e-8877-40b0-82d6-3d7dead64ccc",
>  0x7f56f08d5610) = -1 ENOENT (No such file or directory)
> [pid 1303248] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", 
> {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
> [pid 1303248] 08:22:59 openat(AT_FDCWD, 
> "/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/0507881e-8877-40b0-82d6-3d7dead64ccc",
>  O_WRONLY|O_CREAT|O_TRUNC, 0666) = 199
> [pid 1303248] 08:22:59 fstat(199, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
> [pid 1303248] 08:22:59 close(199)   = 0
> [pid 947310] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/_metadata",
>  0x7f683fb378b0) = -1 ENOENT (No such file or directory)
> [pid 947310] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/_metadata",
>  0x7f683fb37730) = -1 ENOENT (No such file or directory)
> [pid 947310] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/._metadata.inprogress.e87ac62b-5f75-472b-ad21-9579a872b0d0",
>  0x7f683fb37730) = -1 ENOENT (No such file or directory)
> [pid 947310] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", 
> {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
> [pid 947310] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", 
> {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
> [pid 947310] 08:22:59 openat(AT_FDCWD, 
> "/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/._metadata.inprogress.e87ac62b-5f75-472b-ad21-

[jira] [Updated] (FLINK-4602) Move RocksDB backend to proper package

2024-10-17 Thread Zakelly Lan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-4602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zakelly Lan updated FLINK-4602:
---
Parent: (was: FLINK-3957)
Issue Type: Technical Debt  (was: Sub-task)

> Move RocksDB backend to proper package
> --
>
> Key: FLINK-4602
> URL: https://issues.apache.org/jira/browse/FLINK-4602
> Project: Flink
>  Issue Type: Technical Debt
>  Components: Runtime / State Backends
>Reporter: Aljoscha Krettek
>Priority: Major
>  Labels: 2.0-related, auto-unassigned
> Fix For: 2.0.0
>
>
> Right now the package is {{org.apache.flink.contrib.streaming.state}}, it 
> should probably be in {{org.apache.flink.runtime.state.rocksdb}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (FLINK-4602) Move RocksDB backend to proper package

2024-10-17 Thread Zakelly Lan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-4602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zakelly Lan reassigned FLINK-4602:
--

Assignee: Han Yin

> Move RocksDB backend to proper package
> --
>
> Key: FLINK-4602
> URL: https://issues.apache.org/jira/browse/FLINK-4602
> Project: Flink
>  Issue Type: Technical Debt
>  Components: Runtime / State Backends
>Reporter: Aljoscha Krettek
>Assignee: Han Yin
>Priority: Major
>  Labels: 2.0-related, auto-unassigned
> Fix For: 2.0.0
>
>
> Right now the package is {{org.apache.flink.contrib.streaming.state}}, it 
> should probably be in {{org.apache.flink.runtime.state.rocksdb}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (FLINK-36526) Optimize the overhead of writing with direct buffer in ForSt

2024-10-15 Thread Zakelly Lan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-36526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zakelly Lan resolved FLINK-36526.
-
Fix Version/s: 2.0.0
   Resolution: Fixed

> Optimize the overhead of writing with direct buffer in ForSt 
> -
>
> Key: FLINK-36526
> URL: https://issues.apache.org/jira/browse/FLINK-36526
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / State Backends
>Reporter: Zakelly Lan
>Assignee: Zakelly Lan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0.0
>
> Attachments: image-2024-10-14-15-52-41-457.png
>
>
> Currently, the ForSt gives a direct buffer to 
> {{{}ByteBufferWritableFSDataOutputStream{}}}, where the data will be written 
> one byte by byte. According our perf, the statistics of hadoop based fs will 
> be updated once for each byte, which takes a lot of CPU. Below is a 
> flamegraph, where the statistics part is marked as purple (taking 8.14% of 
> the overall CPU).
> !image-2024-10-14-15-52-41-457.png|width=1296,height=616!
>  
> It might be better to copy to a heap buffer before invoking write.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-36526) Optimize the overhead of writing with direct buffer in ForSt

2024-10-15 Thread Zakelly Lan (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-36526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17889600#comment-17889600
 ] 

Zakelly Lan commented on FLINK-36526:
-

Merge f7a532a into master

> Optimize the overhead of writing with direct buffer in ForSt 
> -
>
> Key: FLINK-36526
> URL: https://issues.apache.org/jira/browse/FLINK-36526
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / State Backends
>Reporter: Zakelly Lan
>Assignee: Zakelly Lan
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2024-10-14-15-52-41-457.png
>
>
> Currently, the ForSt gives a direct buffer to 
> {{{}ByteBufferWritableFSDataOutputStream{}}}, where the data will be written 
> one byte by byte. According our perf, the statistics of hadoop based fs will 
> be updated once for each byte, which takes a lot of CPU. Below is a 
> flamegraph, where the statistics part is marked as purple (taking 8.14% of 
> the overall CPU).
> !image-2024-10-14-15-52-41-457.png|width=1296,height=616!
>  
> It might be better to copy to a heap buffer before invoking write.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-36322) Fix compile error of flink benchmark caused by breaking changes

2024-10-15 Thread Zakelly Lan (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-36322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17889512#comment-17889512
 ] 

Zakelly Lan commented on FLINK-36322:
-

First part (26454e6...61de85e) merged into flink-benchmark master.

> Fix compile error of flink benchmark caused by breaking changes
> ---
>
> Key: FLINK-36322
> URL: https://issues.apache.org/jira/browse/FLINK-36322
> Project: Flink
>  Issue Type: Technical Debt
>  Components: Benchmarks
>Reporter: Zakelly Lan
>Assignee: Zakelly Lan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-36526) Optimize the overhead of writing with direct buffer in ForSt

2024-10-14 Thread Zakelly Lan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-36526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zakelly Lan updated FLINK-36526:

Description: 
Currently, the ForSt gives a direct buffer to 
{{{}ByteBufferWritableFSDataOutputStream{}}}, where the data will be written 
one byte by byte. According our perf, the statistics of hadoop based fs will be 
updated once for each byte, which takes a lot of CPU. Below is a flamegraph, 
where the statistics part is marked as purple (taking 8.14% of the overall CPU).

!image-2024-10-14-15-52-41-457.png|width=1296,height=616!

 

It might be better to copy to a heap buffer before invoking write.

  was:
Currently, the ForSt gives a direct buffer to 
\{{ByteBufferWritableFSDataOutputStream}}, where the data will be written one 
byte by byte. According our perf, the statistics of hadoop based fs will be 
updated once for each byte, which takes a lot of CPU.

!image-2024-10-14-15-52-41-457.png!

 

It might be better to copy to a heap buffer before invoking write.


> Optimize the overhead of writing with direct buffer in ForSt 
> -
>
> Key: FLINK-36526
> URL: https://issues.apache.org/jira/browse/FLINK-36526
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / State Backends
>Reporter: Zakelly Lan
>Assignee: Zakelly Lan
>Priority: Major
> Attachments: image-2024-10-14-15-52-41-457.png
>
>
> Currently, the ForSt gives a direct buffer to 
> {{{}ByteBufferWritableFSDataOutputStream{}}}, where the data will be written 
> one byte by byte. According our perf, the statistics of hadoop based fs will 
> be updated once for each byte, which takes a lot of CPU. Below is a 
> flamegraph, where the statistics part is marked as purple (taking 8.14% of 
> the overall CPU).
> !image-2024-10-14-15-52-41-457.png|width=1296,height=616!
>  
> It might be better to copy to a heap buffer before invoking write.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-36526) Optimize the overhead of writing with direct buffer in ForSt

2024-10-14 Thread Zakelly Lan (Jira)

Zakelly Lan created FLINK-36526:
---

 Summary: Optimize the overhead of writing with direct buffer in 
ForSt 
 Key: FLINK-36526
 URL: https://issues.apache.org/jira/browse/FLINK-36526
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / State Backends
Reporter: Zakelly Lan
Assignee: Zakelly Lan
 Attachments: image-2024-10-14-15-52-41-457.png

Currently, the ForSt gives a direct buffer to 
\{{ByteBufferWritableFSDataOutputStream}}, where the data will be written one 
byte by byte. According our perf, the statistics of hadoop based fs will be 
updated once for each byte, which takes a lot of CPU.

!image-2024-10-14-15-52-41-457.png!

 

It might be better to copy to a heap buffer before invoking write.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (FLINK-36421) Missing fsync in FsCheckpointStreamFactory

2024-10-09 Thread Zakelly Lan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-36421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zakelly Lan resolved FLINK-36421.
-
Fix Version/s: 2.0-preview
 Assignee: Marc Aurel Fritz
   Resolution: Fixed

> Missing fsync in FsCheckpointStreamFactory
> --
>
> Key: FLINK-36421
> URL: https://issues.apache.org/jira/browse/FLINK-36421
> Project: Flink
>  Issue Type: Bug
>  Components: FileSystems, Runtime / Checkpointing
>Affects Versions: 1.17.0, 1.18.0, 1.19.0, 1.20.0
>Reporter: Marc Aurel Fritz
>Assignee: Marc Aurel Fritz
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 2.0-preview
>
> Attachments: fsync-fs-stream-factory.diff
>
>
> With Flink 1.20 we observed another checkpoint corruption bug. This is 
> similar to FLINK-35217, but affects only files written by the taskmanager 
> (the ones with random names as described 
> [here|https://github.com/Planet-X/flink/blob/0d6e25a9738d9d4ee94de94e1437f92611b50758/flink-runtime/src/main/java/org/apache/flink/runtime/state/filesystem/FsCheckpointStreamFactory.java#L47]).
> After system crash the files written by the taskmanager may be corrupted 
> (file size of 0 bytes) if the changes in the file-system cache haven't been 
> written to disk. The "_metadata" file written by the jobmanager is always 
> fine because it's properly fsynced.
> Investigation revealed that "fsync" is missing, this time in 
> "FsCheckpointStreamFactory". In this case the "OutputStream" is closed 
> without calling "fsync", thus the file is not durably persisted on disk 
> before the checkpoint is completed. (As previously established in 
> FLINK-35217, calling "fsync" is necessary as simply closing the stream does 
> not have any guarantees on persistence.)
> "strace" on the taskmanager's process confirms this behavior:
>  # The checkpoint chk-1217's directory is created at "mkdir"
>  # The checkpoint chk-1217's non-inline state is written by the taskmanager 
> at "openat", filename is "0507881e-8877-40b0-82d6-3d7dead64ccc". Note that 
> there's no "fsync" before "close".
>  # The checkpoint chk-1217 is finished, its "_metadata" is written and synced 
> properly
>  # The old checkpoint chk-1216 is deleted at "unlink"
> The new checkpoint chk-1217 now references a not-synced file that can get 
> corrupted on e.g. power loss. This means there is no working checkpoint left 
> as the old checkpoint was deleted.
> For durable persistence an "fsync" call is missing before "close" in step 2.
> Full "strace" log:
> {code:java}
> [pid 947250] 08:22:58 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", 
> 0x7f68414c5b50) = -1 ENOENT (No such file or directory)
> [pid 947250] 08:22:58 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", 
> 0x7f68414c5b50) = -1 ENOENT (No such file or directory)
> [pid 947250] 08:22:58 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502", 
> {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
> [pid 947250] 08:22:58 
> mkdir("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", 
> 0777) = 0
> [pid 1303248] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/0507881e-8877-40b0-82d6-3d7dead64ccc",
>  0x7f56f08d5610) = -1 ENOENT (No such file or directory)
> [pid 1303248] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", 
> {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
> [pid 1303248] 08:22:59 openat(AT_FDCWD, 
> "/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/0507881e-8877-40b0-82d6-3d7dead64ccc",
>  O_WRONLY|O_CREAT|O_TRUNC, 0666) = 199
> [pid 1303248] 08:22:59 fstat(199, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
> [pid 1303248] 08:22:59 close(199)   = 0
> [pid 947310] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/_metadata",
>  0x7f683fb378b0) = -1 ENOENT (No such file or directory)
> [pid 947310] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/_metadata",
>  0x7f683fb37730) = -1 ENOENT (No such file or directory)
> [pid 947310] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/._metadata.inprogress.e87ac62b-5f75-472b-ad21-9579a872b0d0",
>  0x7f683fb37730) = -1 ENOENT (No such file or directory)
> [pid 947310] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", 
> {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
> [pid 947310] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", 
> {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
> [pid 947310] 08:22:59 openat(AT_FDCWD, 
> "/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/._metadata.inprogress.e87ac62b-5f75-472b-ad21-9579a872

[jira] [Comment Edited] (FLINK-36421) Missing fsync in FsCheckpointStreamFactory

2024-10-09 Thread Zakelly Lan (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-36421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17888115#comment-17888115
 ] 

Zakelly Lan edited comment on FLINK-36421 at 10/10/24 2:26 AM:
---

Merged dfb9bfeabac8d3ac289e46a3017ed68c50ba3777 into master


was (Author: zakelly):
Merged 
[fd427ff|https://github.com/apache/flink/pull/25468/commits/fd427ffdeaa9b9fc6c31dee3dc09ca982ad7b5ba]
 into master

> Missing fsync in FsCheckpointStreamFactory
> --
>
> Key: FLINK-36421
> URL: https://issues.apache.org/jira/browse/FLINK-36421
> Project: Flink
>  Issue Type: Bug
>  Components: FileSystems, Runtime / Checkpointing
>Affects Versions: 1.17.0, 1.18.0, 1.19.0, 1.20.0
>Reporter: Marc Aurel Fritz
>Priority: Critical
>  Labels: pull-request-available
> Attachments: fsync-fs-stream-factory.diff
>
>
> With Flink 1.20 we observed another checkpoint corruption bug. This is 
> similar to FLINK-35217, but affects only files written by the taskmanager 
> (the ones with random names as described 
> [here|https://github.com/Planet-X/flink/blob/0d6e25a9738d9d4ee94de94e1437f92611b50758/flink-runtime/src/main/java/org/apache/flink/runtime/state/filesystem/FsCheckpointStreamFactory.java#L47]).
> After system crash the files written by the taskmanager may be corrupted 
> (file size of 0 bytes) if the changes in the file-system cache haven't been 
> written to disk. The "_metadata" file written by the jobmanager is always 
> fine because it's properly fsynced.
> Investigation revealed that "fsync" is missing, this time in 
> "FsCheckpointStreamFactory". In this case the "OutputStream" is closed 
> without calling "fsync", thus the file is not durably persisted on disk 
> before the checkpoint is completed. (As previously established in 
> FLINK-35217, calling "fsync" is necessary as simply closing the stream does 
> not have any guarantees on persistence.)
> "strace" on the taskmanager's process confirms this behavior:
>  # The checkpoint chk-1217's directory is created at "mkdir"
>  # The checkpoint chk-1217's non-inline state is written by the taskmanager 
> at "openat", filename is "0507881e-8877-40b0-82d6-3d7dead64ccc". Note that 
> there's no "fsync" before "close".
>  # The checkpoint chk-1217 is finished, its "_metadata" is written and synced 
> properly
>  # The old checkpoint chk-1216 is deleted at "unlink"
> The new checkpoint chk-1217 now references a not-synced file that can get 
> corrupted on e.g. power loss. This means there is no working checkpoint left 
> as the old checkpoint was deleted.
> For durable persistence an "fsync" call is missing before "close" in step 2.
> Full "strace" log:
> {code:java}
> [pid 947250] 08:22:58 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", 
> 0x7f68414c5b50) = -1 ENOENT (No such file or directory)
> [pid 947250] 08:22:58 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", 
> 0x7f68414c5b50) = -1 ENOENT (No such file or directory)
> [pid 947250] 08:22:58 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502", 
> {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
> [pid 947250] 08:22:58 
> mkdir("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", 
> 0777) = 0
> [pid 1303248] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/0507881e-8877-40b0-82d6-3d7dead64ccc",
>  0x7f56f08d5610) = -1 ENOENT (No such file or directory)
> [pid 1303248] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", 
> {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
> [pid 1303248] 08:22:59 openat(AT_FDCWD, 
> "/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/0507881e-8877-40b0-82d6-3d7dead64ccc",
>  O_WRONLY|O_CREAT|O_TRUNC, 0666) = 199
> [pid 1303248] 08:22:59 fstat(199, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
> [pid 1303248] 08:22:59 close(199)   = 0
> [pid 947310] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/_metadata",
>  0x7f683fb378b0) = -1 ENOENT (No such file or directory)
> [pid 947310] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/_metadata",
>  0x7f683fb37730) = -1 ENOENT (No such file or directory)
> [pid 947310] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/._metadata.inprogress.e87ac62b-5f75-472b-ad21-9579a872b0d0",
>  0x7f683fb37730) = -1 ENOENT (No such file or directory)
> [pid 947310] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", 
> {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
> [pid 947310] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", 
> {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
> [pid 947310] 08

[jira] [Commented] (FLINK-36421) Missing fsync in FsCheckpointStreamFactory

2024-10-09 Thread Zakelly Lan (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-36421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17888115#comment-17888115
 ] 

Zakelly Lan commented on FLINK-36421:
-

Merged 
[fd427ff|https://github.com/apache/flink/pull/25468/commits/fd427ffdeaa9b9fc6c31dee3dc09ca982ad7b5ba]
 into master

> Missing fsync in FsCheckpointStreamFactory
> --
>
> Key: FLINK-36421
> URL: https://issues.apache.org/jira/browse/FLINK-36421
> Project: Flink
>  Issue Type: Bug
>  Components: FileSystems, Runtime / Checkpointing
>Affects Versions: 1.17.0, 1.18.0, 1.19.0, 1.20.0
>Reporter: Marc Aurel Fritz
>Priority: Critical
>  Labels: pull-request-available
> Attachments: fsync-fs-stream-factory.diff
>
>
> With Flink 1.20 we observed another checkpoint corruption bug. This is 
> similar to FLINK-35217, but affects only files written by the taskmanager 
> (the ones with random names as described 
> [here|https://github.com/Planet-X/flink/blob/0d6e25a9738d9d4ee94de94e1437f92611b50758/flink-runtime/src/main/java/org/apache/flink/runtime/state/filesystem/FsCheckpointStreamFactory.java#L47]).
> After system crash the files written by the taskmanager may be corrupted 
> (file size of 0 bytes) if the changes in the file-system cache haven't been 
> written to disk. The "_metadata" file written by the jobmanager is always 
> fine because it's properly fsynced.
> Investigation revealed that "fsync" is missing, this time in 
> "FsCheckpointStreamFactory". In this case the "OutputStream" is closed 
> without calling "fsync", thus the file is not durably persisted on disk 
> before the checkpoint is completed. (As previously established in 
> FLINK-35217, calling "fsync" is necessary as simply closing the stream does 
> not have any guarantees on persistence.)
> "strace" on the taskmanager's process confirms this behavior:
>  # The checkpoint chk-1217's directory is created at "mkdir"
>  # The checkpoint chk-1217's non-inline state is written by the taskmanager 
> at "openat", filename is "0507881e-8877-40b0-82d6-3d7dead64ccc". Note that 
> there's no "fsync" before "close".
>  # The checkpoint chk-1217 is finished, its "_metadata" is written and synced 
> properly
>  # The old checkpoint chk-1216 is deleted at "unlink"
> The new checkpoint chk-1217 now references a not-synced file that can get 
> corrupted on e.g. power loss. This means there is no working checkpoint left 
> as the old checkpoint was deleted.
> For durable persistence an "fsync" call is missing before "close" in step 2.
> Full "strace" log:
> {code:java}
> [pid 947250] 08:22:58 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", 
> 0x7f68414c5b50) = -1 ENOENT (No such file or directory)
> [pid 947250] 08:22:58 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", 
> 0x7f68414c5b50) = -1 ENOENT (No such file or directory)
> [pid 947250] 08:22:58 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502", 
> {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
> [pid 947250] 08:22:58 
> mkdir("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", 
> 0777) = 0
> [pid 1303248] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/0507881e-8877-40b0-82d6-3d7dead64ccc",
>  0x7f56f08d5610) = -1 ENOENT (No such file or directory)
> [pid 1303248] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", 
> {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
> [pid 1303248] 08:22:59 openat(AT_FDCWD, 
> "/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/0507881e-8877-40b0-82d6-3d7dead64ccc",
>  O_WRONLY|O_CREAT|O_TRUNC, 0666) = 199
> [pid 1303248] 08:22:59 fstat(199, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
> [pid 1303248] 08:22:59 close(199)   = 0
> [pid 947310] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/_metadata",
>  0x7f683fb378b0) = -1 ENOENT (No such file or directory)
> [pid 947310] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/_metadata",
>  0x7f683fb37730) = -1 ENOENT (No such file or directory)
> [pid 947310] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/._metadata.inprogress.e87ac62b-5f75-472b-ad21-9579a872b0d0",
>  0x7f683fb37730) = -1 ENOENT (No such file or directory)
> [pid 947310] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", 
> {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
> [pid 947310] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", 
> {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
> [pid 947310] 08:22:59 openat(AT_FDCWD, 
> "/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/._metadata.inprogress.e87ac62b-5f75-472b-ad2

[jira] [Commented] (FLINK-36421) Missing fsync in FsCheckpointStreamFactory

2024-10-09 Thread Zakelly Lan (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-36421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17888114#comment-17888114
 ] 

Zakelly Lan commented on FLINK-36421:
-

[~planet9]  It is the java part of flink that uploads all the sst files, since 
the rocksdb lib couldn't write the remote storage. And when recovery the flink 
downloads all the files to local and rebuild the rocksdb. So for rocksdb it 
always manipulates the real files at local.  We are working on a new k-v store 
based on rocksdb called forst 
(https://issues.apache.org/jira/browse/FLINK-34975), which could read and write 
the remote storage. So in future the store itself can do checkpoint and 
recovery.

 

I think it would be better if you backport this to release-1.18/19/20 .

> Missing fsync in FsCheckpointStreamFactory
> --
>
> Key: FLINK-36421
> URL: https://issues.apache.org/jira/browse/FLINK-36421
> Project: Flink
>  Issue Type: Bug
>  Components: FileSystems, Runtime / Checkpointing
>Affects Versions: 1.17.0, 1.18.0, 1.19.0, 1.20.0
>Reporter: Marc Aurel Fritz
>Priority: Critical
>  Labels: pull-request-available
> Attachments: fsync-fs-stream-factory.diff
>
>
> With Flink 1.20 we observed another checkpoint corruption bug. This is 
> similar to FLINK-35217, but affects only files written by the taskmanager 
> (the ones with random names as described 
> [here|https://github.com/Planet-X/flink/blob/0d6e25a9738d9d4ee94de94e1437f92611b50758/flink-runtime/src/main/java/org/apache/flink/runtime/state/filesystem/FsCheckpointStreamFactory.java#L47]).
> After system crash the files written by the taskmanager may be corrupted 
> (file size of 0 bytes) if the changes in the file-system cache haven't been 
> written to disk. The "_metadata" file written by the jobmanager is always 
> fine because it's properly fsynced.
> Investigation revealed that "fsync" is missing, this time in 
> "FsCheckpointStreamFactory". In this case the "OutputStream" is closed 
> without calling "fsync", thus the file is not durably persisted on disk 
> before the checkpoint is completed. (As previously established in 
> FLINK-35217, calling "fsync" is necessary as simply closing the stream does 
> not have any guarantees on persistence.)
> "strace" on the taskmanager's process confirms this behavior:
>  # The checkpoint chk-1217's directory is created at "mkdir"
>  # The checkpoint chk-1217's non-inline state is written by the taskmanager 
> at "openat", filename is "0507881e-8877-40b0-82d6-3d7dead64ccc". Note that 
> there's no "fsync" before "close".
>  # The checkpoint chk-1217 is finished, its "_metadata" is written and synced 
> properly
>  # The old checkpoint chk-1216 is deleted at "unlink"
> The new checkpoint chk-1217 now references a not-synced file that can get 
> corrupted on e.g. power loss. This means there is no working checkpoint left 
> as the old checkpoint was deleted.
> For durable persistence an "fsync" call is missing before "close" in step 2.
> Full "strace" log:
> {code:java}
> [pid 947250] 08:22:58 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", 
> 0x7f68414c5b50) = -1 ENOENT (No such file or directory)
> [pid 947250] 08:22:58 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", 
> 0x7f68414c5b50) = -1 ENOENT (No such file or directory)
> [pid 947250] 08:22:58 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502", 
> {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
> [pid 947250] 08:22:58 
> mkdir("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", 
> 0777) = 0
> [pid 1303248] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/0507881e-8877-40b0-82d6-3d7dead64ccc",
>  0x7f56f08d5610) = -1 ENOENT (No such file or directory)
> [pid 1303248] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", 
> {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
> [pid 1303248] 08:22:59 openat(AT_FDCWD, 
> "/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/0507881e-8877-40b0-82d6-3d7dead64ccc",
>  O_WRONLY|O_CREAT|O_TRUNC, 0666) = 199
> [pid 1303248] 08:22:59 fstat(199, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
> [pid 1303248] 08:22:59 close(199)   = 0
> [pid 947310] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/_metadata",
>  0x7f683fb378b0) = -1 ENOENT (No such file or directory)
> [pid 947310] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/_metadata",
>  0x7f683fb37730) = -1 ENOENT (No such file or directory)
> [pid 947310] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/._metadata.inprogress.e87ac62b-5f75-472b-ad21-9579a872b0d0",
>  0x7f683fb37730) = -1 ENOENT (No such file

[jira] [Assigned] (FLINK-36322) Fix compile error of flink benchmark caused by breaking changes

2024-10-09 Thread Zakelly Lan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-36322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zakelly Lan reassigned FLINK-36322:
---

Assignee: Zakelly Lan

> Fix compile error of flink benchmark caused by breaking changes
> ---
>
> Key: FLINK-36322
> URL: https://issues.apache.org/jira/browse/FLINK-36322
> Project: Flink
>  Issue Type: Technical Debt
>  Components: Benchmarks
>Reporter: Zakelly Lan
>Assignee: Zakelly Lan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (FLINK-34984) FLIP-423: Disaggregated State Storage and Management (Umbrella FLIP)

2024-10-08 Thread Zakelly Lan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-34984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zakelly Lan reassigned FLINK-34984:
---

Assignee: Zakelly Lan

> FLIP-423: Disaggregated State Storage and Management (Umbrella FLIP)
> 
>
> Key: FLINK-34984
> URL: https://issues.apache.org/jira/browse/FLINK-34984
> Project: Flink
>  Issue Type: New Feature
>  Components: API / Core, API / DataStream, Runtime / Checkpointing, 
> Runtime / State Backends
>Reporter: Yuan Mei
>Assignee: Zakelly Lan
>Priority: Major
>
> The past decade has witnessed a dramatic shift in Flink's deployment mode, 
> workload patterns, and hardware improvements. We've moved from the map-reduce 
> era where workers are computation-storage tightly coupled nodes to a 
> cloud-native world where containerized deployments on Kubernetes become 
> standard. To enable Flink's Cloud-Native future, we introduce Disaggregated 
> State Storage and Management that uses DFS as primary storage in Flink 2.0
> This new architecture is aimed to solve the following challenges brought in 
> the cloud-native era for Flink.
> 1. Local Disk Constraints in containerization
> 2. Spiky Resource Usage caused by compaction in the current state model
> 3. Fast Rescaling for jobs with large states (hundreds of Terabytes)
> 4. Light and Fast Checkpoint in a native way
>  
> Design Details can be found in 
> [FLIP-423|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=293046855]
> Proposed changes can be found here:
>  * [Asynchronous State APIs (FLIP-424) 
> |https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=293046855#FLIP423:DisaggregatedStateStorageandManagement(UmbrellaFLIP)-AsynchronousStateAPIs(FLIP-424)]
>  * [Non-blocking Asynchronous Execution Model: Parallel I/O 
> (FLIP-425)|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=293046855#FLIP423:DisaggregatedStateStorageandManagement(UmbrellaFLIP)-Non-blockingAsynchronousExecutionModel:ParallelI/O(FLIP-425)]
>  * [Batching for Network I/O: Beyond Parallel I/O 
> (FLIP-426)|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=293046855#FLIP423:DisaggregatedStateStorageandManagement(UmbrellaFLIP)-BatchingforNetworkI/O:BeyondParallelI/O(FLIP-426)]
>  * [Disaggregated State Store: ForSt 
> (FLIP-427)|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=293046855#FLIP423:DisaggregatedStateStorageandManagement(UmbrellaFLIP)-DisaggregatedStateStore:ForSt(FLIP-427)]
>  * [Faster Checkpoint/Restore/Rescale: Leverage Shared DFS 
> (FLIP-428)|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=293046855#FLIP423:DisaggregatedStateStorageandManagement(UmbrellaFLIP)-FasterCheckpoint/Restore/Rescale:LeverageSharedDFS(FLIP-428)]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-36439) Documentation for Disaggregated State Storage and Management

2024-10-08 Thread Zakelly Lan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-36439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zakelly Lan updated FLINK-36439:

Component/s: Documentation

> Documentation for Disaggregated State Storage and Management 
> -
>
> Key: FLINK-36439
> URL: https://issues.apache.org/jira/browse/FLINK-36439
> Project: Flink
>  Issue Type: Sub-task
>  Components: Documentation, Runtime / Checkpointing, Runtime / State 
> Backends, Runtime / Task
>Reporter: Zakelly Lan
>Assignee: Zakelly Lan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-36439) Documentation for Disaggregated State Storage and Management

2024-10-08 Thread Zakelly Lan (Jira)

Zakelly Lan created FLINK-36439:
---

 Summary: Documentation for Disaggregated State Storage and 
Management 
 Key: FLINK-36439
 URL: https://issues.apache.org/jira/browse/FLINK-36439
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Checkpointing, Runtime / State Backends, 
Runtime / Task
Reporter: Zakelly Lan
Assignee: Zakelly Lan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (FLINK-35928) ForSt supports compiling with RocksDB

2024-10-08 Thread Zakelly Lan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-35928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zakelly Lan closed FLINK-35928.
---
Fix Version/s: 2.0-preview
   Resolution: Fixed

> ForSt supports compiling with RocksDB
> -
>
> Key: FLINK-35928
> URL: https://issues.apache.org/jira/browse/FLINK-35928
> Project: Flink
>  Issue Type: Sub-task
>Reporter: Hangxiang Yu
>Assignee: Yanfei Lei
>Priority: Major
> Fix For: 2.0-preview
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-36374) Bundle forst statebackend in flink-dist and provide shortcut to enable

2024-10-08 Thread Zakelly Lan (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-36374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17887495#comment-17887495
 ] 

Zakelly Lan commented on FLINK-36374:
-

Merge dc2bf35b into master

> Bundle forst statebackend in flink-dist and provide shortcut to enable
> --
>
> Key: FLINK-36374
> URL: https://issues.apache.org/jira/browse/FLINK-36374
> Project: Flink
>  Issue Type: Sub-task
>Reporter: Zakelly Lan
>Assignee: Zakelly Lan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0-preview
>
>
> Currently, the forst statebackend are built under flink-statebackend-forst, 
> but is not included in the flink-dist jar. It is better to provide a same 
> distribution way like rocksdb.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (FLINK-36374) Bundle forst statebackend in flink-dist and provide shortcut to enable

2024-10-08 Thread Zakelly Lan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-36374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zakelly Lan resolved FLINK-36374.
-
Fix Version/s: 2.0-preview
   Resolution: Fixed

> Bundle forst statebackend in flink-dist and provide shortcut to enable
> --
>
> Key: FLINK-36374
> URL: https://issues.apache.org/jira/browse/FLINK-36374
> Project: Flink
>  Issue Type: Sub-task
>Reporter: Zakelly Lan
>Assignee: Zakelly Lan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0-preview
>
>
> Currently, the forst statebackend are built under flink-statebackend-forst, 
> but is not included in the flink-dist jar. It is better to provide a same 
> distribution way like rocksdb.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (FLINK-35411) Optimize buffer triggering of async state requests

2024-10-07 Thread Zakelly Lan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-35411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zakelly Lan resolved FLINK-35411.
-
Fix Version/s: 2.0-preview
   Resolution: Fixed

> Optimize buffer triggering of async state requests
> --
>
> Key: FLINK-35411
> URL: https://issues.apache.org/jira/browse/FLINK-35411
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / State Backends, Runtime / Task
>Reporter: Zakelly Lan
>Assignee: Zakelly Lan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0-preview
>
>
> Currently during draining of async state requests, the task thread performs 
> {{Thread.sleep}} to avoid cpu overhead when polling mails. This can be 
> optimized by wait & notify.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-36421) Missing fsync in FsCheckpointStreamFactory

2024-10-07 Thread Zakelly Lan (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-36421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17887479#comment-17887479
 ] 

Zakelly Lan commented on FLINK-36421:
-

[~gaborgsomogyi] Thanks for the heads up~

[~planet9] I think this also affects the rocksdb since it also uses the 
{{FsCheckpointStreamFactory}}. A {{sync()}} before {{close()}} would be fine. 
Would you please fix this?

> Missing fsync in FsCheckpointStreamFactory
> --
>
> Key: FLINK-36421
> URL: https://issues.apache.org/jira/browse/FLINK-36421
> Project: Flink
>  Issue Type: Bug
>  Components: FileSystems, Runtime / Checkpointing
>Affects Versions: 1.17.0, 1.18.0, 1.19.0, 1.20.0
>Reporter: Marc Aurel Fritz
>Priority: Critical
> Attachments: fsync-fs-stream-factory.diff
>
>
> With Flink 1.20 we observed another checkpoint corruption bug. This is 
> similar to FLINK-35217, but affects only files written by the taskmanager 
> (the ones with random names as described 
> [here|https://github.com/Planet-X/flink/blob/0d6e25a9738d9d4ee94de94e1437f92611b50758/flink-runtime/src/main/java/org/apache/flink/runtime/state/filesystem/FsCheckpointStreamFactory.java#L47]).
> After system crash the files written by the taskmanager may be corrupted 
> (file size of 0 bytes) if the changes in the file-system cache haven't been 
> written to disk. The "_metadata" file written by the jobmanager is always 
> fine because it's properly fsynced.
> Investigation revealed that "fsync" is missing, this time in 
> "FsCheckpointStreamFactory". In this case the "OutputStream" is closed 
> without calling "fsync", thus the file is not durably persisted on disk 
> before the checkpoint is completed. (As previously established in 
> FLINK-35217, calling "fsync" is necessary as simply closing the stream does 
> not have any guarantees on persistence.)
> "strace" on the taskmanager's process confirms this behavior:
>  # The checkpoint chk-1217's directory is created at "mkdir"
>  # The checkpoint chk-1217's non-inline state is written by the taskmanager 
> at "openat", filename is "0507881e-8877-40b0-82d6-3d7dead64ccc". Note that 
> there's no "fsync" before "close".
>  # The checkpoint chk-1217 is finished, its "_metadata" is written and synced 
> properly
>  # The old checkpoint chk-1216 is deleted at "unlink"
> The new checkpoint chk-1217 now references a not-synced file that can get 
> corrupted on e.g. power loss. This means there is no working checkpoint left 
> as the old checkpoint was deleted.
> For durable persistence an "fsync" call is missing before "close" in step 2.
> Full "strace" log:
> {code:java}
> [pid 947250] 08:22:58 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", 
> 0x7f68414c5b50) = -1 ENOENT (No such file or directory)
> [pid 947250] 08:22:58 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", 
> 0x7f68414c5b50) = -1 ENOENT (No such file or directory)
> [pid 947250] 08:22:58 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502", 
> {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
> [pid 947250] 08:22:58 
> mkdir("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", 
> 0777) = 0
> [pid 1303248] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/0507881e-8877-40b0-82d6-3d7dead64ccc",
>  0x7f56f08d5610) = -1 ENOENT (No such file or directory)
> [pid 1303248] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", 
> {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
> [pid 1303248] 08:22:59 openat(AT_FDCWD, 
> "/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/0507881e-8877-40b0-82d6-3d7dead64ccc",
>  O_WRONLY|O_CREAT|O_TRUNC, 0666) = 199
> [pid 1303248] 08:22:59 fstat(199, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
> [pid 1303248] 08:22:59 close(199)   = 0
> [pid 947310] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/_metadata",
>  0x7f683fb378b0) = -1 ENOENT (No such file or directory)
> [pid 947310] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/_metadata",
>  0x7f683fb37730) = -1 ENOENT (No such file or directory)
> [pid 947310] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217/._metadata.inprogress.e87ac62b-5f75-472b-ad21-9579a872b0d0",
>  0x7f683fb37730) = -1 ENOENT (No such file or directory)
> [pid 947310] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", 
> {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
> [pid 947310] 08:22:59 
> stat("/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/chk-1217", 
> {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
> [pid 947310] 08:22:59 openat(AT_FDCWD, 
> "/opt/flink/statestore/d0346e6c26416765a6038fa482b29502/

[jira] [Resolved] (FLINK-36325) Implement basic restore from checkpoint for ForStStateBackend

2024-09-30 Thread Zakelly Lan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-36325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zakelly Lan resolved FLINK-36325.
-
Fix Version/s: 2.0-preview
 Assignee: Feifan Wang
   Resolution: Fixed

> Implement basic restore from checkpoint for ForStStateBackend
> -
>
> Key: FLINK-36325
> URL: https://issues.apache.org/jira/browse/FLINK-36325
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / State Backends
>Reporter: Feifan Wang
>Assignee: Feifan Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0-preview
>
>
> As title, implement basic restore from checkpoint for ForStStateBackend, 
> rescale will be implemented later.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-36325) Implement basic restore from checkpoint for ForStStateBackend

2024-09-30 Thread Zakelly Lan (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-36325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17885850#comment-17885850
 ] 

Zakelly Lan commented on FLINK-36325:
-

Merge d42110e...a6a5ee6 into master

> Implement basic restore from checkpoint for ForStStateBackend
> -
>
> Key: FLINK-36325
> URL: https://issues.apache.org/jira/browse/FLINK-36325
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / State Backends
>Reporter: Feifan Wang
>Priority: Major
>  Labels: pull-request-available
>
> As title, implement basic restore from checkpoint for ForStStateBackend, 
> rescale will be implemented later.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (FLINK-36323) Remove deprecated MemoryStateBackend and RocksDBStateBackend

2024-09-30 Thread Zakelly Lan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-36323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zakelly Lan resolved FLINK-36323.
-
Fix Version/s: 2.0-preview
   Resolution: Fixed

> Remove deprecated MemoryStateBackend and RocksDBStateBackend
> 
>
> Key: FLINK-36323
> URL: https://issues.apache.org/jira/browse/FLINK-36323
> Project: Flink
>  Issue Type: Technical Debt
>  Components: Runtime / State Backends
>Reporter: Zakelly Lan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0-preview
>
>
> After FLINK-19463, the checkpoint storage has been split from the state 
> backends. The deprecated implementation {{MemoryStateBackend}} and 
> {{RocksDBStateBackend}} are replaced by {{HashMapStateBackend}} and 
> {{EmbeddedRocksDBStateBackend}}. Those two state backends should be removed 
> in Flink 2.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-36323) Remove deprecated MemoryStateBackend and RocksDBStateBackend

2024-09-30 Thread Zakelly Lan (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-36323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17885843#comment-17885843
 ] 

Zakelly Lan commented on FLINK-36323:
-

Merge 5dcfdf7...3e1b309 into master

> Remove deprecated MemoryStateBackend and RocksDBStateBackend
> 
>
> Key: FLINK-36323
> URL: https://issues.apache.org/jira/browse/FLINK-36323
> Project: Flink
>  Issue Type: Technical Debt
>  Components: Runtime / State Backends
>Reporter: Zakelly Lan
>Priority: Major
>  Labels: pull-request-available
>
> After FLINK-19463, the checkpoint storage has been split from the state 
> backends. The deprecated implementation {{MemoryStateBackend}} and 
> {{RocksDBStateBackend}} are replaced by {{HashMapStateBackend}} and 
> {{EmbeddedRocksDBStateBackend}}. Those two state backends should be removed 
> in Flink 2.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (FLINK-36395) Graceful quit for Forst StateBackend

2024-09-29 Thread Zakelly Lan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-36395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zakelly Lan resolved FLINK-36395.
-
Fix Version/s: 2.0-preview
   Resolution: Fixed

> Graceful quit for Forst StateBackend
> 
>
> Key: FLINK-36395
> URL: https://issues.apache.org/jira/browse/FLINK-36395
> Project: Flink
>  Issue Type: Sub-task
>Reporter: Zakelly Lan
>Assignee: Zakelly Lan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0-preview
>
>
> There are many threads in forst statebackends. We should make them quit 
> gracefully when task stop.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (FLINK-35666) Implement Aggregating Async State API for ForStStateBackend

2024-09-29 Thread Zakelly Lan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-35666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zakelly Lan resolved FLINK-35666.
-
Fix Version/s: 2.0-preview
 Assignee: Jie Pu
   Resolution: Fixed

> Implement Aggregating Async State API for ForStStateBackend
> ---
>
> Key: FLINK-35666
> URL: https://issues.apache.org/jira/browse/FLINK-35666
> Project: Flink
>  Issue Type: Sub-task
>Reporter: Zakelly Lan
>Assignee: Jie Pu
>Priority: Major
> Fix For: 2.0-preview
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-35666) Implement Aggregating Async State API for ForStStateBackend

2024-09-29 Thread Zakelly Lan (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-35666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17885778#comment-17885778
 ] 

Zakelly Lan commented on FLINK-35666:
-

Merge 773ea76 into master 

> Implement Aggregating Async State API for ForStStateBackend
> ---
>
> Key: FLINK-35666
> URL: https://issues.apache.org/jira/browse/FLINK-35666
> Project: Flink
>  Issue Type: Sub-task
>Reporter: Zakelly Lan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-36376) More friendly error or warn message for misconfigured statebackend with async state processing

2024-09-28 Thread Zakelly Lan (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-36376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17885655#comment-17885655
 ] 

Zakelly Lan commented on FLINK-36376:
-

Merge ca11cb9 into master

> More friendly error or warn message for misconfigured statebackend with async 
> state processing
> --
>
> Key: FLINK-36376
> URL: https://issues.apache.org/jira/browse/FLINK-36376
> Project: Flink
>  Issue Type: Sub-task
>Reporter: Zakelly Lan
>Assignee: Zakelly Lan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (FLINK-36376) More friendly error or warn message for misconfigured statebackend with async state processing

2024-09-28 Thread Zakelly Lan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-36376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zakelly Lan resolved FLINK-36376.
-
Fix Version/s: 2.0-preview
   Resolution: Fixed

> More friendly error or warn message for misconfigured statebackend with async 
> state processing
> --
>
> Key: FLINK-36376
> URL: https://issues.apache.org/jira/browse/FLINK-36376
> Project: Flink
>  Issue Type: Sub-task
>Reporter: Zakelly Lan
>Assignee: Zakelly Lan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0-preview
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-36395) Graceful quit for Forst StateBackend

2024-09-26 Thread Zakelly Lan (Jira)

Zakelly Lan created FLINK-36395:
---

 Summary: Graceful quit for Forst StateBackend
 Key: FLINK-36395
 URL: https://issues.apache.org/jira/browse/FLINK-36395
 Project: Flink
  Issue Type: Sub-task
Reporter: Zakelly Lan
Assignee: Zakelly Lan


There are many threads in forst statebackends. We should make them quit 
gracefully when task stop.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-30614) Improve resolving schema compatibility -- Milestone two

2024-09-26 Thread Zakelly Lan (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-30614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17885199#comment-17885199
 ] 

Zakelly Lan commented on FLINK-30614:
-

Sorry [~zxcoccer], we're on a pretty tight schedule for version 2.0-preview, so 
[~AlexYinHan] finishes this these days. Thanks for volunteering

> Improve resolving schema compatibility -- Milestone two
> ---
>
> Key: FLINK-30614
> URL: https://issues.apache.org/jira/browse/FLINK-30614
> Project: Flink
>  Issue Type: Sub-task
>  Components: API / Type Serialization System
>Reporter: Hangxiang Yu
>Priority: Major
>  Labels: pull-request-available
>
> In the milestone two, we should:
>  # Remove TypeSerializerSnapshot#resolveSchemaCompatibility(TypeSerializer 
> newSerializer) and related implementation.
>  # Make all places where use 
> TypeSerializerSnapshot#resolveSchemaCompatibility(TypeSerializer 
> newSerializer) to check the compatibility call 
> Typeserializer#resolveSchemaCompatibility(TypeSerializerSnapshot 
> oldSerializerSnapshot). 
>  # Remove the default implementation of the new method.
> It will be done after several stable version.
> See FLIP-263 for more details.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (FLINK-30614) Improve resolving schema compatibility -- Milestone two

2024-09-26 Thread Zakelly Lan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-30614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zakelly Lan resolved FLINK-30614.
-
Fix Version/s: 2.0-preview
   Resolution: Fixed

> Improve resolving schema compatibility -- Milestone two
> ---
>
> Key: FLINK-30614
> URL: https://issues.apache.org/jira/browse/FLINK-30614
> Project: Flink
>  Issue Type: Sub-task
>  Components: API / Type Serialization System
>Reporter: Hangxiang Yu
>Assignee: Han Yin
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0-preview
>
>
> In the milestone two, we should:
>  # Remove TypeSerializerSnapshot#resolveSchemaCompatibility(TypeSerializer 
> newSerializer) and related implementation.
>  # Make all places where use 
> TypeSerializerSnapshot#resolveSchemaCompatibility(TypeSerializer 
> newSerializer) to check the compatibility call 
> Typeserializer#resolveSchemaCompatibility(TypeSerializerSnapshot 
> oldSerializerSnapshot). 
>  # Remove the default implementation of the new method.
> It will be done after several stable version.
> See FLIP-263 for more details.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-30614) Improve resolving schema compatibility -- Milestone two

2024-09-26 Thread Zakelly Lan (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-30614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17885200#comment-17885200
 ] 

Zakelly Lan commented on FLINK-30614:
-

Merge 6a76eee ... be90707 into master

> Improve resolving schema compatibility -- Milestone two
> ---
>
> Key: FLINK-30614
> URL: https://issues.apache.org/jira/browse/FLINK-30614
> Project: Flink
>  Issue Type: Sub-task
>  Components: API / Type Serialization System
>Reporter: Hangxiang Yu
>Assignee: Han Yin
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0-preview
>
>
> In the milestone two, we should:
>  # Remove TypeSerializerSnapshot#resolveSchemaCompatibility(TypeSerializer 
> newSerializer) and related implementation.
>  # Make all places where use 
> TypeSerializerSnapshot#resolveSchemaCompatibility(TypeSerializer 
> newSerializer) to check the compatibility call 
> Typeserializer#resolveSchemaCompatibility(TypeSerializerSnapshot 
> oldSerializerSnapshot). 
>  # Remove the default implementation of the new method.
> It will be done after several stable version.
> See FLIP-263 for more details.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (FLINK-30614) Improve resolving schema compatibility -- Milestone two

2024-09-26 Thread Zakelly Lan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-30614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zakelly Lan reassigned FLINK-30614:
---

Assignee: Han Yin

> Improve resolving schema compatibility -- Milestone two
> ---
>
> Key: FLINK-30614
> URL: https://issues.apache.org/jira/browse/FLINK-30614
> Project: Flink
>  Issue Type: Sub-task
>  Components: API / Type Serialization System
>Reporter: Hangxiang Yu
>Assignee: Han Yin
>Priority: Major
>  Labels: pull-request-available
>
> In the milestone two, we should:
>  # Remove TypeSerializerSnapshot#resolveSchemaCompatibility(TypeSerializer 
> newSerializer) and related implementation.
>  # Make all places where use 
> TypeSerializerSnapshot#resolveSchemaCompatibility(TypeSerializer 
> newSerializer) to check the compatibility call 
> Typeserializer#resolveSchemaCompatibility(TypeSerializerSnapshot 
> oldSerializerSnapshot). 
>  # Remove the default implementation of the new method.
> It will be done after several stable version.
> See FLIP-263 for more details.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-35574) Setup base branch for FrocksDB-8.10

2024-09-26 Thread Zakelly Lan (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-35574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17884991#comment-17884991
 ] 

Zakelly Lan commented on FLINK-35574:
-

[~mayuehappy] Done~

> Setup base branch for FrocksDB-8.10
> ---
>
> Key: FLINK-35574
> URL: https://issues.apache.org/jira/browse/FLINK-35574
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / State Backends
>Affects Versions: 2.0.0
>Reporter: Yue Ma
>Assignee: Yue Ma
>Priority: Major
> Fix For: 2.0-preview
>
>
> As the first part of FLINK-35573, we need to prepare a base branch for 
> FRocksDB-8.10.0 first. Mainly, it needs to be checked out from version 8.10.0 
> of the Rocksdb community. Then check pick the commit which used by Flink from 
> FRocksDB-6.20.3 to 8.10.0
> *Details:*
> |*JIRA*|*FrocksDB-6.20.3*|*Commit ID in FrocksDB-8.10.0*|*Plan*|
> |[[FLINK-10471] Add Apache Flink specific compaction filter to evict expired 
> state which has 
> time-to-live|https://github.com/ververica/frocksdb/commit/3da8249d50c8a3a6ea229f43890d37e098372786]|3da8249d50c8a3a6ea229f43890d37e098372786|d606c9450bef7d2a22c794f406d7940d9d2f29a4|Already
>  in *FrocksDB-8.10.0*|
> |+[[FLINK-19710] Revert implementation of PerfContext back to __thread to 
> avoid performance 
> regression|https://github.com/ververica/frocksdb/commit/d6f50f33064f1d24480dfb3c586a7bd7a7dbac01]+|d6f50f33064f1d24480dfb3c586a7bd7a7dbac01|
>  |Fix in FLINK-35575|
> |[FRocksDB release guide and helping 
> scripts|https://github.com/ververica/frocksdb/commit/2673de8e5460af8d23c0c7e1fb0c3258ea283419]|2673de8e5460af8d23c0c7e1fb0c3258ea283419|b58ba05a380d9bf0c223bc707f14897ce392ce1b|Already
>  in *FrocksDB-8.10.0*|
> |+[Add content related to ARM building in the FROCKSDB-RELEASE 
> documentation|https://github.com/ververica/frocksdb/commit/ec27ca01db5ff579dd7db1f70cf3a4677b63d589]+|ec27ca01db5ff579dd7db1f70cf3a4677b63d589|6cae002662a45131a0cd90dd84f5d3d3cb958713|Already
>  in *FrocksDB-8.10.0*|
> |[[FLINK-23756] Update FrocksDB release document with more 
> info|https://github.com/ververica/frocksdb/commit/f75e983045f4b64958dc0e93e8b94a7cfd7663be]|f75e983045f4b64958dc0e93e8b94a7cfd7663be|bac6aeb6e012e19d9d5e3a5ee22b84c1e4a1559c|Already
>  in *FrocksDB-8.10.0*|
> |[Add support for Apple Silicon to RocksJava 
> (#9254)|https://github.com/ververica/frocksdb/commit/dac2c60bc31b596f445d769929abed292878cac1]|dac2c60bc31b596f445d769929abed292878cac1|#9254|Already
>  in *FrocksDB-8.10.0*|
> |[Fix RocksJava releases for macOS 
> (#9662)|https://github.com/ververica/frocksdb/commit/22637e11968a627a06a3ac8aa78126e3ae6d1368]|22637e11968a627a06a3ac8aa78126e3ae6d1368|#9662|Already
>  in *FrocksDB-8.10.0*|
> |+[Fix clang13 build error 
> (#9374)|https://github.com/ververica/frocksdb/commit/a20fb9fa96af7b18015754cf44463e22fc123222]+|a20fb9fa96af7b18015754cf44463e22fc123222|#9374|Already
>  in *FrocksDB-8.10.0*|
> |+[[hotfix] Resolve brken make 
> format|https://github.com/ververica/frocksdb/commit/cf0acdc08fb1b8397ef29f3b7dc7e0400107555e]+|7a87e0bf4d59cc48f40ce69cf7b82237c5e8170c|
>  |Already in *FrocksDB-8.10.0*|
> |+[Update circleci xcode version 
> (#9405)|https://github.com/ververica/frocksdb/commit/f24393bdc8d44b79a9be7a58044e5fd01cf50df7]+|cf0acdc08fb1b8397ef29f3b7dc7e0400107555e|#9405|Already
>  in *FrocksDB-8.10.0*|
> |+[Upgrade to Ubuntu 20.04 in our CircleCI 
> config|https://github.com/ververica/frocksdb/commit/1fecfda040745fc508a0ea0bcbb98c970f89ee3e]+|1fecfda040745fc508a0ea0bcbb98c970f89ee3e|
>  |Fix in 
> [FLINK-35577|https://github.com/facebook/rocksdb/pull/9481/files#diff-78a8a19706dbd2a4425dd72bdab0502ed7a2cef16365ab7030a5a0588927bf47]
> fixed in 
> https://github.com/facebook/rocksdb/pull/9481/files#diff-78a8a19706dbd2a4425dd72bdab0502ed7a2cef16365ab7030a5a0588927bf47|
> |[Disable useless broken tests due to ci-image 
> upgraded|https://github.com/ververica/frocksdb/commit/9fef987e988c53a33b7807b85a56305bd9dede81]|9fef987e988c53a33b7807b85a56305bd9dede81|
>  |Fix in FLINK-35577|
> |[[hotfix] Use zlib's fossils page to replace 
> web.archive|https://github.com/ververica/frocksdb/commit/cbc35db93f312f54b49804177ca11dea44b4d98e]|cbc35db93f312f54b49804177ca11dea44b4d98e|8fff7bb9947f9036021f99e3463c9657e80b71ae|Already
>  in *FrocksDB-8.10.0*|
> |+[[hotfix] Change the resource request when running 
> CI|https://github.com/ververica/frocksdb/commit/2ec1019fd0433cb8ea5365b58faa2262ea0014e9]+|2ec1019fd0433cb8ea5365b58faa2262ea0014e9|174639cf1e6080a8f8f37aec132b3a500428f913|Already
>  in *FrocksDB-8.10.0*|
> |{+}[[FLINK-30321] Upgrade ZLIB of FRocksDB to 1.2.13 
> (|https://github.com/ververica/frocksdb/commit/3eac409606fcd9ce44a4bf7686db29c06c205039]{+}[#56|https://github.com/ververica/frocksdb/pull/56]
>  
> [)|https://github.com/ververica/frocksdb/commit

[jira] [Closed] (FLINK-35574) Setup base branch for FrocksDB-8.10

2024-09-26 Thread Zakelly Lan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-35574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zakelly Lan closed FLINK-35574.
---
Fix Version/s: 2.0-preview
   (was: 2.0.0)
   Resolution: Fixed

> Setup base branch for FrocksDB-8.10
> ---
>
> Key: FLINK-35574
> URL: https://issues.apache.org/jira/browse/FLINK-35574
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / State Backends
>Affects Versions: 2.0.0
>Reporter: Yue Ma
>Assignee: Yue Ma
>Priority: Major
> Fix For: 2.0-preview
>
>
> As the first part of FLINK-35573, we need to prepare a base branch for 
> FRocksDB-8.10.0 first. Mainly, it needs to be checked out from version 8.10.0 
> of the Rocksdb community. Then check pick the commit which used by Flink from 
> FRocksDB-6.20.3 to 8.10.0
> *Details:*
> |*JIRA*|*FrocksDB-6.20.3*|*Commit ID in FrocksDB-8.10.0*|*Plan*|
> |[[FLINK-10471] Add Apache Flink specific compaction filter to evict expired 
> state which has 
> time-to-live|https://github.com/ververica/frocksdb/commit/3da8249d50c8a3a6ea229f43890d37e098372786]|3da8249d50c8a3a6ea229f43890d37e098372786|d606c9450bef7d2a22c794f406d7940d9d2f29a4|Already
>  in *FrocksDB-8.10.0*|
> |+[[FLINK-19710] Revert implementation of PerfContext back to __thread to 
> avoid performance 
> regression|https://github.com/ververica/frocksdb/commit/d6f50f33064f1d24480dfb3c586a7bd7a7dbac01]+|d6f50f33064f1d24480dfb3c586a7bd7a7dbac01|
>  |Fix in FLINK-35575|
> |[FRocksDB release guide and helping 
> scripts|https://github.com/ververica/frocksdb/commit/2673de8e5460af8d23c0c7e1fb0c3258ea283419]|2673de8e5460af8d23c0c7e1fb0c3258ea283419|b58ba05a380d9bf0c223bc707f14897ce392ce1b|Already
>  in *FrocksDB-8.10.0*|
> |+[Add content related to ARM building in the FROCKSDB-RELEASE 
> documentation|https://github.com/ververica/frocksdb/commit/ec27ca01db5ff579dd7db1f70cf3a4677b63d589]+|ec27ca01db5ff579dd7db1f70cf3a4677b63d589|6cae002662a45131a0cd90dd84f5d3d3cb958713|Already
>  in *FrocksDB-8.10.0*|
> |[[FLINK-23756] Update FrocksDB release document with more 
> info|https://github.com/ververica/frocksdb/commit/f75e983045f4b64958dc0e93e8b94a7cfd7663be]|f75e983045f4b64958dc0e93e8b94a7cfd7663be|bac6aeb6e012e19d9d5e3a5ee22b84c1e4a1559c|Already
>  in *FrocksDB-8.10.0*|
> |[Add support for Apple Silicon to RocksJava 
> (#9254)|https://github.com/ververica/frocksdb/commit/dac2c60bc31b596f445d769929abed292878cac1]|dac2c60bc31b596f445d769929abed292878cac1|#9254|Already
>  in *FrocksDB-8.10.0*|
> |[Fix RocksJava releases for macOS 
> (#9662)|https://github.com/ververica/frocksdb/commit/22637e11968a627a06a3ac8aa78126e3ae6d1368]|22637e11968a627a06a3ac8aa78126e3ae6d1368|#9662|Already
>  in *FrocksDB-8.10.0*|
> |+[Fix clang13 build error 
> (#9374)|https://github.com/ververica/frocksdb/commit/a20fb9fa96af7b18015754cf44463e22fc123222]+|a20fb9fa96af7b18015754cf44463e22fc123222|#9374|Already
>  in *FrocksDB-8.10.0*|
> |+[[hotfix] Resolve brken make 
> format|https://github.com/ververica/frocksdb/commit/cf0acdc08fb1b8397ef29f3b7dc7e0400107555e]+|7a87e0bf4d59cc48f40ce69cf7b82237c5e8170c|
>  |Already in *FrocksDB-8.10.0*|
> |+[Update circleci xcode version 
> (#9405)|https://github.com/ververica/frocksdb/commit/f24393bdc8d44b79a9be7a58044e5fd01cf50df7]+|cf0acdc08fb1b8397ef29f3b7dc7e0400107555e|#9405|Already
>  in *FrocksDB-8.10.0*|
> |+[Upgrade to Ubuntu 20.04 in our CircleCI 
> config|https://github.com/ververica/frocksdb/commit/1fecfda040745fc508a0ea0bcbb98c970f89ee3e]+|1fecfda040745fc508a0ea0bcbb98c970f89ee3e|
>  |Fix in 
> [FLINK-35577|https://github.com/facebook/rocksdb/pull/9481/files#diff-78a8a19706dbd2a4425dd72bdab0502ed7a2cef16365ab7030a5a0588927bf47]
> fixed in 
> https://github.com/facebook/rocksdb/pull/9481/files#diff-78a8a19706dbd2a4425dd72bdab0502ed7a2cef16365ab7030a5a0588927bf47|
> |[Disable useless broken tests due to ci-image 
> upgraded|https://github.com/ververica/frocksdb/commit/9fef987e988c53a33b7807b85a56305bd9dede81]|9fef987e988c53a33b7807b85a56305bd9dede81|
>  |Fix in FLINK-35577|
> |[[hotfix] Use zlib's fossils page to replace 
> web.archive|https://github.com/ververica/frocksdb/commit/cbc35db93f312f54b49804177ca11dea44b4d98e]|cbc35db93f312f54b49804177ca11dea44b4d98e|8fff7bb9947f9036021f99e3463c9657e80b71ae|Already
>  in *FrocksDB-8.10.0*|
> |+[[hotfix] Change the resource request when running 
> CI|https://github.com/ververica/frocksdb/commit/2ec1019fd0433cb8ea5365b58faa2262ea0014e9]+|2ec1019fd0433cb8ea5365b58faa2262ea0014e9|174639cf1e6080a8f8f37aec132b3a500428f913|Already
>  in *FrocksDB-8.10.0*|
> |{+}[[FLINK-30321] Upgrade ZLIB of FRocksDB to 1.2.13 
> (|https://github.com/ververica/frocksdb/commit/3eac409606fcd9ce44a4bf7686db29c06c205039]{+}[#56|https://github.com/ververica/frocksdb/pull/56]
>  
> [)|https://github.com/ververica/fro

[jira] [Created] (FLINK-36376) More friendly error or warn message for misconfigured statebackend with async state processing

2024-09-26 Thread Zakelly Lan (Jira)

Zakelly Lan created FLINK-36376:
---

 Summary: More friendly error or warn message for misconfigured 
statebackend with async state processing
 Key: FLINK-36376
 URL: https://issues.apache.org/jira/browse/FLINK-36376
 Project: Flink
  Issue Type: Sub-task
Reporter: Zakelly Lan
Assignee: Zakelly Lan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (FLINK-36374) Bundle forst statebackend in flink-dist and provide shortcut to enable

2024-09-25 Thread Zakelly Lan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-36374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zakelly Lan reassigned FLINK-36374:
---

Assignee: Zakelly Lan

> Bundle forst statebackend in flink-dist and provide shortcut to enable
> --
>
> Key: FLINK-36374
> URL: https://issues.apache.org/jira/browse/FLINK-36374
> Project: Flink
>  Issue Type: Sub-task
>Reporter: Zakelly Lan
>Assignee: Zakelly Lan
>Priority: Major
>
> Currently, the forst statebackend are built under flink-statebackend-forst, 
> but is not included in the flink-dist jar. It is better to provide a same 
> distribution way like rocksdb.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-36374) Bundle forst statebackend in flink-dist and provide shortcut to enable

2024-09-25 Thread Zakelly Lan (Jira)

Zakelly Lan created FLINK-36374:
---

 Summary: Bundle forst statebackend in flink-dist and provide 
shortcut to enable
 Key: FLINK-36374
 URL: https://issues.apache.org/jira/browse/FLINK-36374
 Project: Flink
  Issue Type: Sub-task
Reporter: Zakelly Lan


Currently, the forst statebackend are built under flink-statebackend-forst, but 
is not included in the flink-dist jar. It is better to provide a same 
distribution way like rocksdb.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (FLINK-36364) Do not reuse serialized key in Forst map state and/or other namespaces

2024-09-25 Thread Zakelly Lan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-36364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zakelly Lan resolved FLINK-36364.
-
Fix Version/s: 2.0-preview
   Resolution: Fixed

> Do not reuse serialized key in Forst map state and/or other namespaces
> --
>
> Key: FLINK-36364
> URL: https://issues.apache.org/jira/browse/FLINK-36364
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / State Backends
>Reporter: Zakelly Lan
>Assignee: Zakelly Lan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0-preview
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (FLINK-35928) ForSt supports compiling with RocksDB

2024-09-25 Thread Zakelly Lan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-35928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zakelly Lan reassigned FLINK-35928:
---

Assignee: Yanfei Lei  (was: Hangxiang Yu)

> ForSt supports compiling with RocksDB
> -
>
> Key: FLINK-35928
> URL: https://issues.apache.org/jira/browse/FLINK-35928
> Project: Flink
>  Issue Type: Sub-task
>Reporter: Hangxiang Yu
>Assignee: Yanfei Lei
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-36364) Do not reuse serialized key in Forst map state and/or other namespaces

2024-09-25 Thread Zakelly Lan (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-36364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17884842#comment-17884842
 ] 

Zakelly Lan commented on FLINK-36364:
-

Merge 82582b3a7b75b7ffae9d48895088189b1324caa3 into master

> Do not reuse serialized key in Forst map state and/or other namespaces
> --
>
> Key: FLINK-36364
> URL: https://issues.apache.org/jira/browse/FLINK-36364
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / State Backends
>Reporter: Zakelly Lan
>Assignee: Zakelly Lan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-36364) Do not reuse serialized key in Forst map state and/or other namespaces

2024-09-24 Thread Zakelly Lan (Jira)

Zakelly Lan created FLINK-36364:
---

 Summary: Do not reuse serialized key in Forst map state and/or 
other namespaces
 Key: FLINK-36364
 URL: https://issues.apache.org/jira/browse/FLINK-36364
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / State Backends
Reporter: Zakelly Lan
Assignee: Zakelly Lan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (FLINK-36331) Support multiGet when DB all in local

2024-09-24 Thread Zakelly Lan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-36331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zakelly Lan reassigned FLINK-36331:
---

Assignee: Yanfei Lei

> Support multiGet when DB all in local
> -
>
> Key: FLINK-36331
> URL: https://issues.apache.org/jira/browse/FLINK-36331
> Project: Flink
>  Issue Type: Sub-task
>Reporter: Yanfei Lei
>Assignee: Yanfei Lei
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (FLINK-36274) Remove legacy ExternalizedCheckpointCleanup

2024-09-24 Thread Zakelly Lan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-36274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zakelly Lan resolved FLINK-36274.
-
Resolution: Fixed

> Remove legacy ExternalizedCheckpointCleanup
> ---
>
> Key: FLINK-36274
> URL: https://issues.apache.org/jira/browse/FLINK-36274
> Project: Flink
>  Issue Type: Sub-task
>Reporter: Zakelly Lan
>Assignee: Zakelly Lan
>Priority: Major
>
> The ExternalizedCheckpointCleanup is deprecated and moved to 
> ExternalizedCheckpointRetention. We might remove 
> ExternalizedCheckpointCleanup in flink 2.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (FLINK-34511) Flink 2.0: Remove legacy State&Checkpointing&Recovery options

2024-09-24 Thread Zakelly Lan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-34511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zakelly Lan resolved FLINK-34511.
-
Fix Version/s: 2.0-preview
   Resolution: Fixed

> Flink 2.0: Remove legacy State&Checkpointing&Recovery options
> -
>
> Key: FLINK-34511
> URL: https://issues.apache.org/jira/browse/FLINK-34511
> Project: Flink
>  Issue Type: Sub-task
>Reporter: Zakelly Lan
>Assignee: Zakelly Lan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0-preview
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-36274) Remove legacy ExternalizedCheckpointCleanup

2024-09-24 Thread Zakelly Lan (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-36274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17884289#comment-17884289
 ] 

Zakelly Lan commented on FLINK-36274:
-

Merge 5d463de1 into master

> Remove legacy ExternalizedCheckpointCleanup
> ---
>
> Key: FLINK-36274
> URL: https://issues.apache.org/jira/browse/FLINK-36274
> Project: Flink
>  Issue Type: Sub-task
>Reporter: Zakelly Lan
>Assignee: Zakelly Lan
>Priority: Major
>
> The ExternalizedCheckpointCleanup is deprecated and moved to 
> ExternalizedCheckpointRetention. We might remove 
> ExternalizedCheckpointCleanup in flink 2.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-35411) Optimize buffer triggering of async state requests

2024-09-23 Thread Zakelly Lan (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-35411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17884087#comment-17884087
 ] 

Zakelly Lan commented on FLINK-35411:
-

Merge 475a751e...8bb2f4d7 into master

> Optimize buffer triggering of async state requests
> --
>
> Key: FLINK-35411
> URL: https://issues.apache.org/jira/browse/FLINK-35411
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / State Backends, Runtime / Task
>Reporter: Zakelly Lan
>Assignee: Zakelly Lan
>Priority: Major
>  Labels: pull-request-available
>
> Currently during draining of async state requests, the task thread performs 
> {{Thread.sleep}} to avoid cpu overhead when polling mails. This can be 
> optimized by wait & notify.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (FLINK-36338) Properly handle KeyContext when using AsyncKeyedStateBackendAdaptor

2024-09-23 Thread Zakelly Lan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-36338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zakelly Lan resolved FLINK-36338.
-
Resolution: Fixed

> Properly handle KeyContext when using AsyncKeyedStateBackendAdaptor
> ---
>
> Key: FLINK-36338
> URL: https://issues.apache.org/jira/browse/FLINK-36338
> Project: Flink
>  Issue Type: Sub-task
>Reporter: Zakelly Lan
>Assignee: Zakelly Lan
>Priority: Major
>  Labels: pull-request-available
>
> After FLINK-36117, we port old state backends implementation to new api using 
> AsyncKeyedStateBackendAdaptor, but it cannot work because the KeyContext is 
> not properly handled, which should be fixed



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-36338) Properly handle KeyContext when using AsyncKeyedStateBackendAdaptor

2024-09-23 Thread Zakelly Lan (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-36338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17883843#comment-17883843
 ] 

Zakelly Lan commented on FLINK-36338:
-

Merged b14ab764a0adef68334f02dd6f52e9aca61c2477 into master

> Properly handle KeyContext when using AsyncKeyedStateBackendAdaptor
> ---
>
> Key: FLINK-36338
> URL: https://issues.apache.org/jira/browse/FLINK-36338
> Project: Flink
>  Issue Type: Sub-task
>Reporter: Zakelly Lan
>Assignee: Zakelly Lan
>Priority: Major
>  Labels: pull-request-available
>
> After FLINK-36117, we port old state backends implementation to new api using 
> AsyncKeyedStateBackendAdaptor, but it cannot work because the KeyContext is 
> not properly handled, which should be fixed



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-36338) Properly handle KeyContext when using AsyncKeyedStateBackendAdaptor

2024-09-20 Thread Zakelly Lan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-36338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zakelly Lan updated FLINK-36338:

Summary: Properly handle KeyContext when using 
AsyncKeyedStateBackendAdaptor  (was: Correct handle KeyContext when using 
AsyncKeyedStateBackendAdaptor)

> Properly handle KeyContext when using AsyncKeyedStateBackendAdaptor
> ---
>
> Key: FLINK-36338
> URL: https://issues.apache.org/jira/browse/FLINK-36338
> Project: Flink
>  Issue Type: Sub-task
>Reporter: Zakelly Lan
>Assignee: Zakelly Lan
>Priority: Major
>
> After FLINK-36117, we port old state backends implementation to new api using 
> AsyncKeyedStateBackendAdaptor, but it cannot work because the KeyContext is 
> not properly handled, which should be fixed



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-36338) Correct handle KeyContext when using AsyncKeyedStateBackendAdaptor

2024-09-20 Thread Zakelly Lan (Jira)

Zakelly Lan created FLINK-36338:
---

 Summary: Correct handle KeyContext when using 
AsyncKeyedStateBackendAdaptor
 Key: FLINK-36338
 URL: https://issues.apache.org/jira/browse/FLINK-36338
 Project: Flink
  Issue Type: Sub-task
Reporter: Zakelly Lan


After FLINK-36117, we port old state backends implementation to new api using 
AsyncKeyedStateBackendAdaptor, but it cannot work because the KeyContext is not 
properly handled, which should be fixed



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (FLINK-36338) Correct handle KeyContext when using AsyncKeyedStateBackendAdaptor

2024-09-20 Thread Zakelly Lan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-36338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zakelly Lan reassigned FLINK-36338:
---

Assignee: Zakelly Lan

> Correct handle KeyContext when using AsyncKeyedStateBackendAdaptor
> --
>
> Key: FLINK-36338
> URL: https://issues.apache.org/jira/browse/FLINK-36338
> Project: Flink
>  Issue Type: Sub-task
>Reporter: Zakelly Lan
>Assignee: Zakelly Lan
>Priority: Major
>
> After FLINK-36117, we port old state backends implementation to new api using 
> AsyncKeyedStateBackendAdaptor, but it cannot work because the KeyContext is 
> not properly handled, which should be fixed



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-36117) Implement AsyncKeyedStateBackend for RocksDBKeyedStateBackend and HeapKeyedStateBackend.

2024-09-19 Thread Zakelly Lan (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-36117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882974#comment-17882974
 ] 

Zakelly Lan commented on FLINK-36117:
-

Merged b20eebb ... e81da99  into master

> Implement AsyncKeyedStateBackend for RocksDBKeyedStateBackend and 
> HeapKeyedStateBackend.
> 
>
> Key: FLINK-36117
> URL: https://issues.apache.org/jira/browse/FLINK-36117
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / State Backends
>Reporter: wang qian
>Assignee: wang qian
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0-preview
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-35412) Batch execution of async state request callback

2024-09-19 Thread Zakelly Lan (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-35412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882970#comment-17882970
 ] 

Zakelly Lan commented on FLINK-35412:
-

Merge 9a831ebfec68ad7b257c25416e790bcba5965df1 into master

> Batch execution of async state request callback
> ---
>
> Key: FLINK-35412
> URL: https://issues.apache.org/jira/browse/FLINK-35412
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / State Backends, Runtime / Task
>Reporter: Zakelly Lan
>Assignee: Zakelly Lan
>Priority: Major
>  Labels: pull-request-available
>
> There is one mail for each callback when async state result returns. One 
> possible optimization is to encapsulate multiple callbacks into one mail.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-35411) Optimize buffer triggering of async state requests

2024-09-19 Thread Zakelly Lan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-35411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zakelly Lan updated FLINK-35411:

Summary: Optimize buffer triggering of async state requests  (was: Optimize 
wait logic in draining of async state requests)

> Optimize buffer triggering of async state requests
> --
>
> Key: FLINK-35411
> URL: https://issues.apache.org/jira/browse/FLINK-35411
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / State Backends, Runtime / Task
>Reporter: Zakelly Lan
>Assignee: Zakelly Lan
>Priority: Major
>
> Currently during draining of async state requests, the task thread performs 
> {{Thread.sleep}} to avoid cpu overhead when polling mails. This can be 
> optimized by wait & notify.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (FLINK-35411) Optimize wait logic in draining of async state requests

2024-09-19 Thread Zakelly Lan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-35411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zakelly Lan reassigned FLINK-35411:
---

Assignee: Zakelly Lan  (was: Yanfei Lei)

> Optimize wait logic in draining of async state requests
> ---
>
> Key: FLINK-35411
> URL: https://issues.apache.org/jira/browse/FLINK-35411
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / State Backends, Runtime / Task
>Reporter: Zakelly Lan
>Assignee: Zakelly Lan
>Priority: Major
>
> Currently during draining of async state requests, the task thread performs 
> {{Thread.sleep}} to avoid cpu overhead when polling mails. This can be 
> optimized by wait & notify.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (FLINK-35412) Batch execution of async state request callback

2024-09-19 Thread Zakelly Lan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-35412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zakelly Lan resolved FLINK-35412.
-
Fix Version/s: 2.0-preview
   Resolution: Fixed

> Batch execution of async state request callback
> ---
>
> Key: FLINK-35412
> URL: https://issues.apache.org/jira/browse/FLINK-35412
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / State Backends, Runtime / Task
>Reporter: Zakelly Lan
>Assignee: Zakelly Lan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0-preview
>
>
> There is one mail for each callback when async state result returns. One 
> possible optimization is to encapsulate multiple callbacks into one mail.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-35510) Implement basic incremental checkpoint for ForStStateBackend

2024-09-19 Thread Zakelly Lan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-35510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zakelly Lan updated FLINK-35510:

Fix Version/s: 2.0-preview

> Implement basic incremental checkpoint for ForStStateBackend
> 
>
> Key: FLINK-35510
> URL: https://issues.apache.org/jira/browse/FLINK-35510
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / State Backends
>Reporter: Feifan Wang
>Assignee: Feifan Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0-preview
>
>
> Use low DB api implement a basic incremental checkpoint for 
> ForStStatebackend, follow steps:
>  # db.disableFileDeletions()
>  # db.getLiveFiles(true)
>  # db.entableFileDeletes(false)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-35510) Implement basic incremental checkpoint for ForStStateBackend

2024-09-18 Thread Zakelly Lan (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-35510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882912#comment-17882912
 ] 

Zakelly Lan commented on FLINK-35510:
-

Merged ffd05220ffe829337ac12a49f1bbc822fa11f8fe into master

> Implement basic incremental checkpoint for ForStStateBackend
> 
>
> Key: FLINK-35510
> URL: https://issues.apache.org/jira/browse/FLINK-35510
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / State Backends
>Reporter: Feifan Wang
>Priority: Major
>  Labels: pull-request-available
>
> Use low DB api implement a basic incremental checkpoint for 
> ForStStatebackend, follow steps:
>  # db.disableFileDeletions()
>  # db.getLiveFiles(true)
>  # db.entableFileDeletes(false)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (FLINK-35510) Implement basic incremental checkpoint for ForStStateBackend

2024-09-18 Thread Zakelly Lan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-35510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zakelly Lan resolved FLINK-35510.
-
  Assignee: Feifan Wang
Resolution: Fixed

> Implement basic incremental checkpoint for ForStStateBackend
> 
>
> Key: FLINK-35510
> URL: https://issues.apache.org/jira/browse/FLINK-35510
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / State Backends
>Reporter: Feifan Wang
>Assignee: Feifan Wang
>Priority: Major
>  Labels: pull-request-available
>
> Use low DB api implement a basic incremental checkpoint for 
> ForStStatebackend, follow steps:
>  # db.disableFileDeletions()
>  # db.getLiveFiles(true)
>  # db.entableFileDeletes(false)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-36323) Remove deprecated MemoryStateBackend and RocksDBStateBackend

2024-09-18 Thread Zakelly Lan (Jira)

Zakelly Lan created FLINK-36323:
---

 Summary: Remove deprecated MemoryStateBackend and 
RocksDBStateBackend
 Key: FLINK-36323
 URL: https://issues.apache.org/jira/browse/FLINK-36323
 Project: Flink
  Issue Type: Technical Debt
  Components: Runtime / State Backends
Reporter: Zakelly Lan


After FLINK-19463, the checkpoint storage has been split from the state 
backends. The deprecated implementation {{MemoryStateBackend}} and 
{{RocksDBStateBackend}} are replaced by {{HashMapStateBackend}} and 
{{EmbeddedRocksDBStateBackend}}. Those two state backends should be removed in 
Flink 2.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-36322) Fix compile error of flink benchmark caused by breaking changes

2024-09-18 Thread Zakelly Lan (Jira)

Zakelly Lan created FLINK-36322:
---

 Summary: Fix compile error of flink benchmark caused by breaking 
changes
 Key: FLINK-36322
 URL: https://issues.apache.org/jira/browse/FLINK-36322
 Project: Flink
  Issue Type: Technical Debt
  Components: Benchmarks
Reporter: Zakelly Lan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-36274) Remove legacy ExternalizedCheckpointCleanup

2024-09-17 Thread Zakelly Lan (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-36274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882567#comment-17882567
 ] 

Zakelly Lan commented on FLINK-36274:
-

I'm taking this since we have a pretty tight schedule before 2.0-preview

> Remove legacy ExternalizedCheckpointCleanup
> ---
>
> Key: FLINK-36274
> URL: https://issues.apache.org/jira/browse/FLINK-36274
> Project: Flink
>  Issue Type: Sub-task
>Reporter: Zakelly Lan
>Priority: Major
>
> The ExternalizedCheckpointCleanup is deprecated and moved to 
> ExternalizedCheckpointRetention. We might remove 
> ExternalizedCheckpointCleanup in flink 2.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (FLINK-36274) Remove legacy ExternalizedCheckpointCleanup

2024-09-17 Thread Zakelly Lan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-36274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zakelly Lan reassigned FLINK-36274:
---

Assignee: Zakelly Lan

> Remove legacy ExternalizedCheckpointCleanup
> ---
>
> Key: FLINK-36274
> URL: https://issues.apache.org/jira/browse/FLINK-36274
> Project: Flink
>  Issue Type: Sub-task
>Reporter: Zakelly Lan
>Assignee: Zakelly Lan
>Priority: Major
>
> The ExternalizedCheckpointCleanup is deprecated and moved to 
> ExternalizedCheckpointRetention. We might remove 
> ExternalizedCheckpointCleanup in flink 2.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-35780) Support state migration between disabling and enabling ttl in RocksDBKeyedStateBackend

2024-09-17 Thread Zakelly Lan (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-35780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882559#comment-17882559
 ] 

Zakelly Lan commented on FLINK-35780:
-

[~xiangyu0xf] Sure, looking into this

> Support state migration between disabling and enabling ttl in 
> RocksDBKeyedStateBackend
> --
>
> Key: FLINK-35780
> URL: https://issues.apache.org/jira/browse/FLINK-35780
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / State Backends
>Reporter: xiangyu feng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (FLINK-35780) Support state migration between disabling and enabling ttl in RocksDBKeyedStateBackend

2024-09-17 Thread Zakelly Lan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-35780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zakelly Lan reassigned FLINK-35780:
---

Assignee: xiangyu feng

> Support state migration between disabling and enabling ttl in 
> RocksDBKeyedStateBackend
> --
>
> Key: FLINK-35780
> URL: https://issues.apache.org/jira/browse/FLINK-35780
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / State Backends
>Reporter: xiangyu feng
>Assignee: xiangyu feng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-36166) testJoinDisorderChangeLog failed on AZP

2024-09-17 Thread Zakelly Lan (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-36166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17882556#comment-17882556
 ] 

Zakelly Lan commented on FLINK-36166:
-

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=62182&view=logs&j=0c940707-2659-5648-cbe6-a1ad63045f0a&t=075c2716-8010-5565-fe08-3c4bb45824a4

> testJoinDisorderChangeLog failed on AZP
> ---
>
> Key: FLINK-36166
> URL: https://issues.apache.org/jira/browse/FLINK-36166
> Project: Flink
>  Issue Type: Bug
>  Components: Table SQL / Planner
>Affects Versions: 2.0-preview
>Reporter: Weijie Guo
>Priority: Critical
>  Labels: test-stability
>
> {code:java}
> Aug 23 03:59:33 03:59:33.972 [ERROR] Failures: 
> Aug 23 03:59:33 03:59:33.972 [ERROR] 
> org.apache.flink.table.planner.runtime.stream.sql.TableSinkITCase.testJoinDisorderChangeLog
> Aug 23 03:59:33 03:59:33.972 [ERROR]   Run 1: 
> TableSinkITCase.testJoinDisorderChangeLog:119 
> Aug 23 03:59:33 expected: List(+I[jason, 4, 22.5, 22])
> Aug 23 03:59:33  but was: ArrayBuffer(+U[jason, 4, 22.5, 22])
> {code}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=61572&view=logs&j=0c940707-2659-5648-cbe6-a1ad63045f0a&t=075c2716-8010-5565-fe08-3c4bb45824a4&l=12536



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-36274) Remove legacy ExternalizedCheckpointCleanup

2024-09-12 Thread Zakelly Lan (Jira)

Zakelly Lan created FLINK-36274:
---

 Summary: Remove legacy ExternalizedCheckpointCleanup
 Key: FLINK-36274
 URL: https://issues.apache.org/jira/browse/FLINK-36274
 Project: Flink
  Issue Type: Sub-task
Reporter: Zakelly Lan


The ExternalizedCheckpointCleanup is deprecated and moved to 
ExternalizedCheckpointRetention. We might remove ExternalizedCheckpointCleanup 
in flink 2.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-36274) Remove legacy ExternalizedCheckpointCleanup

2024-09-12 Thread Zakelly Lan (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-36274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17881461#comment-17881461
 ] 

Zakelly Lan commented on FLINK-36274:
-

[~spoon-lz] Would you spare some time working on this?

> Remove legacy ExternalizedCheckpointCleanup
> ---
>
> Key: FLINK-36274
> URL: https://issues.apache.org/jira/browse/FLINK-36274
> Project: Flink
>  Issue Type: Sub-task
>Reporter: Zakelly Lan
>Priority: Major
>
> The ExternalizedCheckpointCleanup is deprecated and moved to 
> ExternalizedCheckpointRetention. We might remove 
> ExternalizedCheckpointCleanup in flink 2.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-35027) Implement checkpoint drain in AsyncExecutionController

2024-09-12 Thread Zakelly Lan (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-35027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17881448#comment-17881448
 ] 

Zakelly Lan commented on FLINK-35027:
-

Merged into master via e94d433dd0b242e68eb2eb2a4b2546c2872e0776

> Implement checkpoint drain in AsyncExecutionController
> --
>
> Key: FLINK-35027
> URL: https://issues.apache.org/jira/browse/FLINK-35027
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Checkpointing, Runtime / Task
>Reporter: Yanfei Lei
>Assignee: Yanfei Lei
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-34511) Flink 2.0: Remove legacy State&Checkpointing&Recovery options

2024-09-11 Thread Zakelly Lan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-34511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zakelly Lan updated FLINK-34511:

Summary: Flink 2.0: Remove legacy State&Checkpointing&Recovery options  
(was: Flink 2.0: Deprecate legacy State&Checkpointing&Recovery options)

> Flink 2.0: Remove legacy State&Checkpointing&Recovery options
> -
>
> Key: FLINK-34511
> URL: https://issues.apache.org/jira/browse/FLINK-34511
> Project: Flink
>  Issue Type: Sub-task
>Reporter: Zakelly Lan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-34255) FLIP-406: Reorganize State & Checkpointing & Recovery Configuration

2024-09-11 Thread Zakelly Lan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-34255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zakelly Lan updated FLINK-34255:

Fix Version/s: 2.0-preview
   (was: 2.0.0)

> FLIP-406: Reorganize State & Checkpointing & Recovery Configuration
> ---
>
> Key: FLINK-34255
> URL: https://issues.apache.org/jira/browse/FLINK-34255
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing, Runtime / State Backends
>Reporter: Zakelly Lan
>Assignee: Zakelly Lan
>Priority: Major
> Fix For: 1.20.0, 2.0-preview
>
>
> The FLIP: 
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=284789560
>  
> Currently, the configuration options pertaining to checkpointing, recovery, 
> and state management are primarily grouped under the following prefixes:
>  * *state.backend.** : configurations related to state accessing and 
> checkpointing, as well as specific options for individual state backends
>  * *execution.checkpointing.** : configurations associated with checkpoint 
> execution and recovery
>  * {*}execution.savepoint.*{*}: configurations for recovery from savepoint
> In addition, there are several individual options such as 
> _{{state.checkpoint-storage}}_ and _{{state.checkpoints.dir}}_ that fall 
> outside of these prefixes. The current arrangement of these options, which 
> span multiple modules, is somewhat haphazard and lacks a systematic 
> structure. For example, the options under the {{_CheckpointingOptions_ }}and 
> {{_ExecutionCheckpointingOptions_ }}are related and have no clear boundaries 
> from the user's perspective, but there is no unified prefix for them. With 
> the upcoming release of Flink 2.0, we have an excellent opportunity to 
> overhaul and restructure the configurations related to checkpointing, 
> recovery, and state management. This FLIP proposes to reorganize these 
> settings, making it more coherent by module, which would significantly lower 
> the barriers for understanding and reduce the development costs moving 
> forward.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (FLINK-34511) Flink 2.0: Remove legacy State&Checkpointing&Recovery options

2024-09-11 Thread Zakelly Lan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-34511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zakelly Lan reassigned FLINK-34511:
---

Assignee: Zakelly Lan

> Flink 2.0: Remove legacy State&Checkpointing&Recovery options
> -
>
> Key: FLINK-34511
> URL: https://issues.apache.org/jira/browse/FLINK-34511
> Project: Flink
>  Issue Type: Sub-task
>Reporter: Zakelly Lan
>Assignee: Zakelly Lan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (FLINK-36246) Move async state related operators to flink-runtime

2024-09-10 Thread Zakelly Lan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-36246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zakelly Lan resolved FLINK-36246.
-
Fix Version/s: 2.0-preview
   Resolution: Fixed

> Move async state related operators to flink-runtime
> ---
>
> Key: FLINK-36246
> URL: https://issues.apache.org/jira/browse/FLINK-36246
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Task
>Reporter: Zakelly Lan
>Assignee: Zakelly Lan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0-preview
>
>
> After FLINK-36063, all operators are moved to flink-runtime. We should move 
> the async state related ones as well



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-36246) Move async state related operators to flink-runtime

2024-09-10 Thread Zakelly Lan (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-36246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17880858#comment-17880858
 ] 

Zakelly Lan commented on FLINK-36246:
-

merged into master via 1d5b214eb681229bfb78f49416d78df100f887d4

> Move async state related operators to flink-runtime
> ---
>
> Key: FLINK-36246
> URL: https://issues.apache.org/jira/browse/FLINK-36246
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Task
>Reporter: Zakelly Lan
>Assignee: Zakelly Lan
>Priority: Major
>  Labels: pull-request-available
>
> After FLINK-36063, all operators are moved to flink-runtime. We should move 
> the async state related ones as well



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (FLINK-35972) Directly specify key and process runnable in async state execution model

2024-09-10 Thread Zakelly Lan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-35972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zakelly Lan resolved FLINK-35972.
-
Fix Version/s: 2.0-preview
   Resolution: Fixed

> Directly specify key and process runnable in async state execution model
> 
>
> Key: FLINK-35972
> URL: https://issues.apache.org/jira/browse/FLINK-35972
> Project: Flink
>  Issue Type: Sub-task
>Reporter: Zakelly Lan
>Assignee: Zakelly Lan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0-preview
>
>
> In asynchronous execution model (FLIP-425) for state, the key is managed by 
> the framework and automatically assigned before {{processElement}} or 
> callback running. However in some cases, there should be an internal 
> interface that allows to switch key and start a new process with this key. It 
> is useful when implementing something like mini-batch, which triggers a batch 
> of real logic for gathered elements at one time instead of the 
> {{processElement}} for each element.
> Additionally, the processing invoked by this interface should have a higher 
> priority than the normal input, as the semantics of this processing method is 
> to continue processing the previously half-processed data.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-36244) Remove Queryable State

2024-09-09 Thread Zakelly Lan (Jira)

Zakelly Lan created FLINK-36244:
---

 Summary: Remove Queryable State
 Key: FLINK-36244
 URL: https://issues.apache.org/jira/browse/FLINK-36244
 Project: Flink
  Issue Type: Technical Debt
  Components: Runtime / Queryable State
Reporter: Zakelly Lan
Assignee: Zakelly Lan


Queryable State is deprecated in 1.18, and better be removed in 2.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-35389) Implement List Async State API for ForStStateBackend

2024-09-03 Thread Zakelly Lan (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-35389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878789#comment-17878789
 ] 

Zakelly Lan commented on FLINK-35389:
-

Merged into master via c58a6ebbae50e95c59ad5cdcc7c2f2023c5d38f3

> Implement List Async State API for ForStStateBackend
> 
>
> Key: FLINK-35389
> URL: https://issues.apache.org/jira/browse/FLINK-35389
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / State Backends
>Reporter: Hangxiang Yu
>Assignee: Zakelly Lan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (FLINK-35389) Implement List Async State API for ForStStateBackend

2024-09-03 Thread Zakelly Lan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-35389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zakelly Lan resolved FLINK-35389.
-
Fix Version/s: 2.0-preview
   Resolution: Fixed

> Implement List Async State API for ForStStateBackend
> 
>
> Key: FLINK-35389
> URL: https://issues.apache.org/jira/browse/FLINK-35389
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / State Backends
>Reporter: Hangxiang Yu
>Assignee: Zakelly Lan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.0-preview
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-36125) File not found exception on restoring state handles with file merging

2024-09-03 Thread Zakelly Lan (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-36125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878785#comment-17878785
 ] 

Zakelly Lan commented on FLINK-36125:
-

The concurrency issue merged into master via: 
2c16d884abdd2f33b86cb313775fbc0cc158a4be
1.20 via: b234229ca1a03f3c0c8b5176d267415309371cb4


Feel free to reopen this if there's still a problem.

> File not found exception on restoring state handles with file merging
> -
>
> Key: FLINK-36125
> URL: https://issues.apache.org/jira/browse/FLINK-36125
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.20.0
>Reporter: Burak Ozakinci
>Assignee: Zakelly Lan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 2.0.0, 1.20.1
>
> Attachments: app_code.txt
>
>
> 1.20 app with file merging with across checkpoints option enabled.
> {*}execution.checkpointing.file-merging.enabled{*}: true
> {*}execution.checkpointing.file-merging.across-checkpoint-boundary{*}: true
> App uses Zookeeper for HA and S3 for HDFS with RocksDB state backend.
> Summary of events:
>  * App started to fail and restarted due to Zookeeper connection failure
>  * It tried to use the previous directory with the log below while restoring
>  * 
> {code:java}
> Reusing previous directory 
> s3:.../taskowned/job_6676b9fb581c449ea78d95efdd338ee1_tm_10.99.195.1-6122-34e093
>  for checkpoint file-merging. {code}
>  
>  * FileMergingSnapshotManager could not find the checkpoint file under 
> `taskowned` directory and the app started to failover.
>  
>  
> {code:java}
> java.io.FileNotFoundException: No such file or directory: 
> s3:.../taskowned/job_6676b9fb581c449ea78d95efdd338ee1_tm_10.99.196.233-6122-acb0af/5ce4c69f-f02a-4f91-a656-9abdfa9d47fd
>  at 
> org.apache.flink.runtime.checkpoint.filemerging.FileMergingSnapshotManagerBase.lambda$restoreStateHandles$11(FileMergingSnapshotManagerBase.java:880)
> at java.base/java.util.HashMap.computeIfAbsent(HashMap.java:1134) 
>   at 
> org.apache.flink.runtime.checkpoint.filemerging.FileMergingSnapshotManagerBase.lambda$restoreStateHandles$12(FileMergingSnapshotManagerBase.java:861)
> at 
> java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
> at 
> java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
>   at 
> java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:177)
>   at 
> java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
> at 
> java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
> at 
> java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:177)
>   at 
> java.base/java.util.stream.Streams$StreamBuilderImpl.forEachRemaining(Streams.java:411)
>   at 
> java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:658)
> at 
> java.base/java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:274)
>   at 
> java.base/java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
> at 
> java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
>   at 
> java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
>at 
> java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
>   at 
> java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
> at 
> java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
>   at 
> java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:497)
>  at 
> java.base/java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:274)
>   at java.base/java.util.ArrayList$Itr.forEachRemaining(ArrayList.java:1033)  
> at 
> java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
> at 
> java.base/java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:734)
>   at 
> java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
>   at 
> java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
>at 
> java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
>   at 
> java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
> at 
> java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
>   at 
> java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:497)
>  at 
> jav

[jira] [Resolved] (FLINK-36125) File not found exception on restoring state handles with file merging

2024-09-03 Thread Zakelly Lan (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-36125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zakelly Lan resolved FLINK-36125.
-
Fix Version/s: 2.0-preview
   (was: 2.0.0)
   Resolution: Fixed

> File not found exception on restoring state handles with file merging
> -
>
> Key: FLINK-36125
> URL: https://issues.apache.org/jira/browse/FLINK-36125
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.20.0
>Reporter: Burak Ozakinci
>Assignee: Zakelly Lan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.20.1, 2.0-preview
>
> Attachments: app_code.txt
>
>
> 1.20 app with file merging with across checkpoints option enabled.
> {*}execution.checkpointing.file-merging.enabled{*}: true
> {*}execution.checkpointing.file-merging.across-checkpoint-boundary{*}: true
> App uses Zookeeper for HA and S3 for HDFS with RocksDB state backend.
> Summary of events:
>  * App started to fail and restarted due to Zookeeper connection failure
>  * It tried to use the previous directory with the log below while restoring
>  * 
> {code:java}
> Reusing previous directory 
> s3:.../taskowned/job_6676b9fb581c449ea78d95efdd338ee1_tm_10.99.195.1-6122-34e093
>  for checkpoint file-merging. {code}
>  
>  * FileMergingSnapshotManager could not find the checkpoint file under 
> `taskowned` directory and the app started to failover.
>  
>  
> {code:java}
> java.io.FileNotFoundException: No such file or directory: 
> s3:.../taskowned/job_6676b9fb581c449ea78d95efdd338ee1_tm_10.99.196.233-6122-acb0af/5ce4c69f-f02a-4f91-a656-9abdfa9d47fd
>  at 
> org.apache.flink.runtime.checkpoint.filemerging.FileMergingSnapshotManagerBase.lambda$restoreStateHandles$11(FileMergingSnapshotManagerBase.java:880)
> at java.base/java.util.HashMap.computeIfAbsent(HashMap.java:1134) 
>   at 
> org.apache.flink.runtime.checkpoint.filemerging.FileMergingSnapshotManagerBase.lambda$restoreStateHandles$12(FileMergingSnapshotManagerBase.java:861)
> at 
> java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
> at 
> java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
>   at 
> java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:177)
>   at 
> java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
> at 
> java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
> at 
> java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:177)
>   at 
> java.base/java.util.stream.Streams$StreamBuilderImpl.forEachRemaining(Streams.java:411)
>   at 
> java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:658)
> at 
> java.base/java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:274)
>   at 
> java.base/java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
> at 
> java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
>   at 
> java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
>at 
> java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
>   at 
> java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
> at 
> java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
>   at 
> java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:497)
>  at 
> java.base/java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:274)
>   at java.base/java.util.ArrayList$Itr.forEachRemaining(ArrayList.java:1033)  
> at 
> java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
> at 
> java.base/java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:734)
>   at 
> java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
>   at 
> java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
>at 
> java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
>   at 
> java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
> at 
> java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
>   at 
> java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:497)
>  at 
> java.base/java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:274)
>   at 
> java.base/java.util.Spliterators$ArraySpliterator.forEachRema

1 2 3 4 5 6 >

1 - 100 of 559 matches

Mail list logo