[jira] [Commented] (FLINK-36389) Fix DelegationTokenReceiverRepository to check if Delegation Token is enabled

2024-09-27 Thread Luke Chen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-36389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17885261#comment-17885261
 ] 

Luke Chen commented on FLINK-36389:
---

Ah, saw that we also need to check `security.delegation.tokens.enabled`. Thanks.

> Fix DelegationTokenReceiverRepository to check if Delegation Token is enabled
> -
>
> Key: FLINK-36389
> URL: https://issues.apache.org/jira/browse/FLINK-36389
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes, Deployment / YARN
>Affects Versions: 2.0.0
>Reporter: Archit Goyal
>Assignee: Archit Goyal
>Priority: Major
>  Labels: pull-request-available, security
>
> {*}Issue{*}: During the initialization of the 
> {{DelegationTokenReceiverRepository}} in the constructor, the 
> {{loadReceivers()}} method is invoked without checking whether delegation 
> tokens are enabled. This leads to the following error in the TaskManager logs:
> {code:java}
> java.lang.IllegalStateException: Delegation token receiver with service name 
> {} has multiple implementations [hadoopfs]{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-36389) Fix DelegationTokenReceiverRepository to check if Delegation Token is enabled

2024-09-26 Thread Luke Chen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-36389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17885232#comment-17885232
 ] 

Luke Chen commented on FLINK-36389:
---

I think we did check if delegation tokens is enabled before checking the state:
https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/security/token/DelegationTokenReceiverRepository.java#L70

The error seems to indicate that there are multiple Delegation token receivers 
set. Maybe have a look?

> Fix DelegationTokenReceiverRepository to check if Delegation Token is enabled
> -
>
> Key: FLINK-36389
> URL: https://issues.apache.org/jira/browse/FLINK-36389
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes, Deployment / YARN
>Reporter: Archit Goyal
>Priority: Major
>  Labels: security
>
> {*}Issue{*}: During the initialization of the 
> {{DelegationTokenReceiverRepository}} in the constructor, the 
> {{loadReceivers()}} method is invoked without checking whether delegation 
> tokens are enabled. This leads to the following error in the TaskManager logs:
> {code:java}
> java.lang.IllegalStateException: Delegation token receiver with service name 
> {} has multiple implementations [hadoopfs]{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-36293) RocksDBWriteBatchWrapperTest.testAsyncCancellation

2024-09-23 Thread Luke Chen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-36293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17883720#comment-17883720
 ] 

Luke Chen commented on FLINK-36293:
---

PR: https://github.com/apache/flink/pull/25373

> RocksDBWriteBatchWrapperTest.testAsyncCancellation 
> ---
>
> Key: FLINK-36293
> URL: https://issues.apache.org/jira/browse/FLINK-36293
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / State Backends
>Affects Versions: 2.0-preview
>Reporter: Matthias Pohl
>Priority: Blocker
>  Labels: pull-request-available, test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=62156&view=logs&j=0da23115-68bb-5dcd-192c-bd4c8adebde1&t=24c3384f-1bcb-57b3-224f-51bf973bbee8&l=11508
> {code}
> Sep 16 02:20:08 02:20:08.194 [ERROR] Tests run: 6, Failures: 0, Errors: 1, 
> Skipped: 0, Time elapsed: 0.724 s <<< FAILURE! -- in 
> org.apache.flink.contrib.streaming.state.RocksDBWriteBatchWrapperTest
> Sep 16 02:20:08 02:20:08.194 [ERROR] 
> org.apache.flink.contrib.streaming.state.RocksDBWriteBatchWrapperTest.testAsyncCancellation
>  -- Time elapsed: 0.121 s <<< ERROR!
> Sep 16 02:20:08 java.lang.Exception: Unexpected exception, 
> expected but 
> was
> Sep 16 02:20:08 Caused by: java.lang.AssertionError: 
> Sep 16 02:20:08 Expecting actual:
> Sep 16 02:20:08   2
> Sep 16 02:20:08 to be less than:
> Sep 16 02:20:08   2 
> Sep 16 02:20:08   at 
> org.apache.flink.contrib.streaming.state.RocksDBWriteBatchWrapperTest.testAsyncCancellation(RocksDBWriteBatchWrapperTest.java:98)
> Sep 16 02:20:08   at java.lang.reflect.Method.invoke(Method.java:498)
> Sep 16 02:20:08   Suppressed: 
> org.apache.flink.runtime.execution.CancelTaskException
> Sep 16 02:20:08   at 
> org.apache.flink.contrib.streaming.state.RocksDBWriteBatchWrapper.ensureNotCancelled(RocksDBWriteBatchWrapper.java:199)
> Sep 16 02:20:08   at 
> org.apache.flink.contrib.streaming.state.RocksDBWriteBatchWrapper.close(RocksDBWriteBatchWrapper.java:188)
> Sep 16 02:20:08   at 
> org.apache.flink.contrib.streaming.state.RocksDBWriteBatchWrapperTest.testAsyncCancellation(RocksDBWriteBatchWrapperTest.java:100)
> Sep 16 02:20:08   ... 1 more
> {code}
> This test was added FLINK-35580



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-36292) SplitFetcherManagerTest.testCloseCleansUpPreviouslyClosedFetcher times out

2024-09-21 Thread Luke Chen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-36292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17883558#comment-17883558
 ] 

Luke Chen commented on FLINK-36292:
---

Opened a PR to remove the timeout first: 
https://github.com/apache/flink/pull/25371

> SplitFetcherManagerTest.testCloseCleansUpPreviouslyClosedFetcher times out
> --
>
> Key: FLINK-36292
> URL: https://issues.apache.org/jira/browse/FLINK-36292
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Common
>Affects Versions: 2.0-preview
>Reporter: Matthias Pohl
>Priority: Blocker
>  Labels: pull-request-available
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=62173&view=logs&j=b6f8a893-8f59-51d5-fe28-fb56a8b0932c&t=095f1730-efbe-5303-c4a3-b5e3696fc4e2&l=10914
> {code}
> Sep 17 01:15:16 01:15:16.318 [ERROR] Tests run: 5, Failures: 0, Errors: 1, 
> Skipped: 0, Time elapsed: 32.65 s <<< FAILURE! -- in 
> org.apache.flink.connector.base.source.reader.fetcher.SplitFetcherManagerTest
> Sep 17 01:15:16 01:15:16.318 [ERROR] 
> org.apache.flink.connector.base.source.reader.fetcher.SplitFetcherManagerTest.testCloseCleansUpPreviouslyClosedFetcher
>  -- Time elapsed: 30.02 s <<< ERROR!
> Sep 17 01:15:16 org.junit.runners.model.TestTimedOutException: test timed out 
> after 3 milliseconds
> Sep 17 01:15:16 at sun.misc.Unsafe.park(Native Method)
> Sep 17 01:15:16 at 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
> Sep 17 01:15:16 at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
> Sep 17 01:15:16 at 
> java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1475)
> Sep 17 01:15:16 at 
> org.apache.flink.connector.base.source.reader.fetcher.SplitFetcherManager.close(SplitFetcherManager.java:344)
> Sep 17 01:15:16 at 
> org.apache.flink.connector.base.source.reader.fetcher.SplitFetcherManagerTest.testCloseCleansUpPreviouslyClosedFetcher(SplitFetcherManagerTest.java:97)
> Sep 17 01:15:16 at java.lang.reflect.Method.invoke(Method.java:498)
> Sep 17 01:15:16 at 
> java.util.concurrent.FutureTask.run(FutureTask.java:266)
> Sep 17 01:15:16 at java.lang.Thread.run(Thread.java:748)
> {code}
> The test was added by FLINK-35924



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-36293) RocksDBWriteBatchWrapperTest.testAsyncCancellation

2024-09-21 Thread Luke Chen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-36293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17883474#comment-17883474
 ] 

Luke Chen commented on FLINK-36293:
---

Investigating

> RocksDBWriteBatchWrapperTest.testAsyncCancellation 
> ---
>
> Key: FLINK-36293
> URL: https://issues.apache.org/jira/browse/FLINK-36293
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / State Backends
>Affects Versions: 2.0-preview
>Reporter: Matthias Pohl
>Priority: Blocker
>  Labels: test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=62156&view=logs&j=0da23115-68bb-5dcd-192c-bd4c8adebde1&t=24c3384f-1bcb-57b3-224f-51bf973bbee8&l=11508
> {code}
> Sep 16 02:20:08 02:20:08.194 [ERROR] Tests run: 6, Failures: 0, Errors: 1, 
> Skipped: 0, Time elapsed: 0.724 s <<< FAILURE! -- in 
> org.apache.flink.contrib.streaming.state.RocksDBWriteBatchWrapperTest
> Sep 16 02:20:08 02:20:08.194 [ERROR] 
> org.apache.flink.contrib.streaming.state.RocksDBWriteBatchWrapperTest.testAsyncCancellation
>  -- Time elapsed: 0.121 s <<< ERROR!
> Sep 16 02:20:08 java.lang.Exception: Unexpected exception, 
> expected but 
> was
> Sep 16 02:20:08 Caused by: java.lang.AssertionError: 
> Sep 16 02:20:08 Expecting actual:
> Sep 16 02:20:08   2
> Sep 16 02:20:08 to be less than:
> Sep 16 02:20:08   2 
> Sep 16 02:20:08   at 
> org.apache.flink.contrib.streaming.state.RocksDBWriteBatchWrapperTest.testAsyncCancellation(RocksDBWriteBatchWrapperTest.java:98)
> Sep 16 02:20:08   at java.lang.reflect.Method.invoke(Method.java:498)
> Sep 16 02:20:08   Suppressed: 
> org.apache.flink.runtime.execution.CancelTaskException
> Sep 16 02:20:08   at 
> org.apache.flink.contrib.streaming.state.RocksDBWriteBatchWrapper.ensureNotCancelled(RocksDBWriteBatchWrapper.java:199)
> Sep 16 02:20:08   at 
> org.apache.flink.contrib.streaming.state.RocksDBWriteBatchWrapper.close(RocksDBWriteBatchWrapper.java:188)
> Sep 16 02:20:08   at 
> org.apache.flink.contrib.streaming.state.RocksDBWriteBatchWrapperTest.testAsyncCancellation(RocksDBWriteBatchWrapperTest.java:100)
> Sep 16 02:20:08   ... 1 more
> {code}
> This test was added FLINK-35580



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-36290) OutOfMemoryError in connect test run

2024-09-20 Thread Luke Chen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-36290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17883431#comment-17883431
 ] 

Luke Chen commented on FLINK-36290:
---

I had a look at all the OOM cases, they seem don't have relevance. I thought I 
can find some tests that cause OOM the most, but looks they all happen in 
different tests/projects: 
 - Streaming Java
 - Connectors : Hive
 - Table : Planner
 - Formats : Csv
 - Table : SQL Gateway
 - Formats : Parquet

Is is possible we can get the heap dump created in the Jenkins? Need to ask 
infra's team?

> OutOfMemoryError in connect test run
> 
>
> Key: FLINK-36290
> URL: https://issues.apache.org/jira/browse/FLINK-36290
> Project: Flink
>  Issue Type: Bug
>  Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile), Tests
>Affects Versions: 2.0-preview
>Reporter: Matthias Pohl
>Priority: Blocker
>
> We saw a OOM in the connect stage that's caused a fatal error:
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=62173&view=logs&j=1c002d28-a73d-5309-26ee-10036d8476b4&t=d1c117a6-8f13-5466-55f0-d48dbb767fcd&l=12182
> {code}
> 03:19:59,975 [   flink-scheduler-1] ERROR 
> org.apache.flink.util.FatalExitExceptionHandler  [] - FATAL: 
> Thread 'flink-scheduler-1' produced an uncaught exception. Stopping the 
> process...
> java.lang.OutOfMemoryError: Java heap space
> [...]
> 03:19:59,981 [jobmanager_62-main-scheduler-thread-1] ERROR 
> org.apache.flink.util.FatalExitExceptionHandler  [] - FATAL: 
> Thread 'jobmanager_62-main-scheduler-thread-1' produced an uncaught 
> exception. Stopping the process...
> java.lang.OutOfMemoryError: Java heap space
> [...]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-36272) YarnFileStageTestS3ITCase fails on master

2024-09-14 Thread Luke Chen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-36272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17881828#comment-17881828
 ] 

Luke Chen commented on FLINK-36272:
---

We can close this ticket now since it is also fixed in this PR: 
[https://github.com/apache/flink/pull/25325|https://github.com/apache/flink/pull/25325].
 The latest CI build is green: 
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=62134&view=results

> YarnFileStageTestS3ITCase fails on master
> -
>
> Key: FLINK-36272
> URL: https://issues.apache.org/jira/browse/FLINK-36272
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN, Tests
>Affects Versions: 2.0.0
>Reporter: Matthias Pohl
>Priority: Blocker
>  Labels: test-stability
>
> The issue was introduced by FLINK-34085 where the test failure wasn't 
> discovered because the test didn't run (see 
> [logs|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=61954&view=logs&j=fc5181b0-e452-5c8f-68de-1097947f6483&t=995c650b-6573-581c-9ce6-7ad4cc038461&l=28206]).
> I would suspect that this is due to the fact that we're not enabling S3 in PR 
> CI.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-36272) YarnFileStageTestS3ITCase fails on master

2024-09-13 Thread Luke Chen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-36272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17881515#comment-17881515
 ] 

Luke Chen commented on FLINK-36272:
---

PR: https://github.com/apache/flink/pull/25325

> YarnFileStageTestS3ITCase fails on master
> -
>
> Key: FLINK-36272
> URL: https://issues.apache.org/jira/browse/FLINK-36272
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN, Tests
>Affects Versions: 2.0.0
>Reporter: Matthias Pohl
>Priority: Blocker
>  Labels: test-stability
>
> The issue was introduced by FLINK-34085 where the test failure wasn't 
> discovered because the test didn't run (see 
> [logs|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=61954&view=logs&j=fc5181b0-e452-5c8f-68de-1097947f6483&t=995c650b-6573-581c-9ce6-7ad4cc038461&l=28206]).
> I would suspect that this is due to the fact that we're not enabling S3 in PR 
> CI.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-36258) testRecursiveUploadForYarnS3n failed due to no AWS Credentials provided

2024-09-13 Thread Luke Chen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-36258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17881514#comment-17881514
 ] 

Luke Chen commented on FLINK-36258:
---

PR: https://github.com/apache/flink/pull/25325

> testRecursiveUploadForYarnS3n failed due to no AWS Credentials provided 
> 
>
> Key: FLINK-36258
> URL: https://issues.apache.org/jira/browse/FLINK-36258
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN, Tests
>Affects Versions: 2.0.0
>Reporter: Weijie Guo
>Assignee: Xuannan Su
>Priority: Blocker
>  Labels: pull-request-available, test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=61979&view=logs&j=fc5181b0-e452-5c8f-68de-1097947f6483&t=995c650b-6573-581c-9ce6-7ad4cc038461&l=28296
> {code:java}
> Sep 11 05:44:32 Caused by: java.lang.IllegalArgumentException: AWS Access Key 
> ID and Secret Access Key must be specified by setting the 
> fs.s3n.awsAccessKeyId and fs.s3n.awsSecretAccessKey properties (respectively).
> Sep 11 05:44:32   at 
> org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:74)
> Sep 11 05:44:32   at 
> org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:80)
> Sep 11 05:44:32   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method)
> Sep 11 05:44:32   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> Sep 11 05:44:32   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> Sep 11 05:44:32   at java.lang.reflect.Method.invoke(Method.java:498)
> Sep 11 05:44:32   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:433)
> Sep 11 05:44:32   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166)
> Sep 11 05:44:32   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158)
> Sep 11 05:44:32   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96)
> Sep 11 05:44:32   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362)
> Sep 11 05:44:32   at 
> org.apache.hadoop.fs.s3native.$Proxy64.initialize(Unknown Source)
> Sep 11 05:44:32   at 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:334)
> Sep 11 05:44:32   at 
> org.apache.flink.runtime.fs.hdfs.HadoopFsFactory.create(HadoopFsFactory.java:168)
> Sep 11 05:44:32   ... 33 more
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-35657) Flink UI not support show float metric value

2024-07-08 Thread Luke Chen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-35657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17863707#comment-17863707
 ] 

Luke Chen commented on FLINK-35657:
---

[~hong] , sure, I don't know the process very well. Should I open 2 separated 
PRs against 1.19 and 1.20 branches?

> Flink UI not support show float metric value
> 
>
> Key: FLINK-35657
> URL: https://issues.apache.org/jira/browse/FLINK-35657
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Web Frontend
>Affects Versions: 1.18.1, 1.20.0, 1.19.1
>Reporter: Can Luo
>Assignee: Luke Chen
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.20.0, 1.19.2
>
> Attachments: image-2024-06-20-15-42-41-004.png
>
>
> Flink ui show float metric value as int/long. For example, 
> `outPoolUsage`/`inPoolUsage` are always 0 or 1 at UI, this's a little helpful 
> for task fine tunning.
> !image-2024-06-20-15-42-41-004.png|width=402,height=190!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-35657) Flink UI not support show float metric value

2024-06-20 Thread Luke Chen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-35657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17856439#comment-17856439
 ] 

Luke Chen commented on FLINK-35657:
---

PR to fix this issue: https://github.com/apache/flink/pull/24964

> Flink UI not support show float metric value
> 
>
> Key: FLINK-35657
> URL: https://issues.apache.org/jira/browse/FLINK-35657
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Web Frontend
>Reporter: Can Luo
>Priority: Minor
> Attachments: image-2024-06-20-15-42-41-004.png
>
>
> Flink ui show float metric value as int/long. For example, 
> `outPoolUsage`/`inPoolUsage` are always 0 or 1 at UI, this's a little helpful 
> for task fine tunning.
> !image-2024-06-20-15-42-41-004.png|width=402,height=190!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-35653) "fraud detection" example missed "env.execute" explanation

2024-06-19 Thread Luke Chen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-35653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17856375#comment-17856375
 ] 

Luke Chen commented on FLINK-35653:
---

PR: https://github.com/apache/flink/pull/24959

> "fraud detection" example missed "env.execute" explanation
> --
>
> Key: FLINK-35653
> URL: https://issues.apache.org/jira/browse/FLINK-35653
> Project: Flink
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Luke Chen
>Priority: Major
>
> In "try flink" page, there is a [fraud detection with the dataStream 
> API|https://nightlies.apache.org/flink/flink-docs-master/docs/try-flink/datastream/#breaking-down-the-code]
>  example to demo how to build a stateful streaming application with Flink’s 
> DataStream API. In this page, we explained the code line by line, but missed 
> the last line:
> {code:java}
> env.execute("Fraud Detection");{code}
>  
> I confirmed this was existed before 
> [v1.12|https://github.com/apache/flink/blob/release-1.12/docs/try-flink/datastream_api.md?plain=1#L365],
>  but when we migrated Flink docs from Jekyll to Hugo in this 
> [PR|https://github.com/apache/flink/pull/14903], we missed that section. This 
> miss will let first flink users miss the important line to execute the flink 
> job.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-35653) "fraud detection" example missed "env.execute" explanation

2024-06-19 Thread Luke Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-35653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke Chen updated FLINK-35653:
--
Description: 
In "try flink" page, there is a [fraud detection with the dataStream 
API|https://nightlies.apache.org/flink/flink-docs-master/docs/try-flink/datastream/#breaking-down-the-code]
 example to demo how to build a stateful streaming application with Flink’s 
DataStream API. In this page, we explained the code line by line, but missed 
the last line:
{code:java}
env.execute("Fraud Detection");{code}
 

I confirmed this was existed before 
[v1.12|https://github.com/apache/flink/blob/release-1.12/docs/try-flink/datastream_api.md?plain=1#L365],
 but when we migrated Flink docs from Jekyll to Hugo in this 
[PR|https://github.com/apache/flink/pull/14903], we missed that section. This 
miss will let first flink users miss the important line to execute the flink 
job.

 

 

  was:
In "try flink" page, there is a [fraud detection with the dataStream 
API|https://nightlies.apache.org/flink/flink-docs-master/docs/try-flink/datastream/]
 example to demo how to build a stateful streaming application with Flink’s 
DataStream API. In this page, we explained the code line by line, but missed 
the last line:
{code:java}
env.execute("Fraud Detection");{code}
 

I confirmed this was existed before 
[v1.12|https://github.com/apache/flink/blob/release-1.12/docs/try-flink/datastream_api.md?plain=1#L365],
 but when we migrated Flink docs from Jekyll to Hugo in this 
[PR|https://github.com/apache/flink/pull/14903], we missed that section. This 
miss will let first flink users miss the important line to execute the flink 
job.

 

 


> "fraud detection" example missed "env.execute" explanation
> --
>
> Key: FLINK-35653
> URL: https://issues.apache.org/jira/browse/FLINK-35653
> Project: Flink
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Luke Chen
>Priority: Major
>
> In "try flink" page, there is a [fraud detection with the dataStream 
> API|https://nightlies.apache.org/flink/flink-docs-master/docs/try-flink/datastream/#breaking-down-the-code]
>  example to demo how to build a stateful streaming application with Flink’s 
> DataStream API. In this page, we explained the code line by line, but missed 
> the last line:
> {code:java}
> env.execute("Fraud Detection");{code}
>  
> I confirmed this was existed before 
> [v1.12|https://github.com/apache/flink/blob/release-1.12/docs/try-flink/datastream_api.md?plain=1#L365],
>  but when we migrated Flink docs from Jekyll to Hugo in this 
> [PR|https://github.com/apache/flink/pull/14903], we missed that section. This 
> miss will let first flink users miss the important line to execute the flink 
> job.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-35653) "fraud detection" example missed "env.execute" explanation

2024-06-19 Thread Luke Chen (Jira)
Luke Chen created FLINK-35653:
-

 Summary: "fraud detection" example missed "env.execute" explanation
 Key: FLINK-35653
 URL: https://issues.apache.org/jira/browse/FLINK-35653
 Project: Flink
  Issue Type: Bug
  Components: Documentation
Reporter: Luke Chen


In "try flink" page, there is a [fraud detection with the dataStream 
API|https://nightlies.apache.org/flink/flink-docs-master/docs/try-flink/datastream/]
 example to demo how to build a stateful streaming application with Flink’s 
DataStream API. In this page, we explained the code line by line, but missed 
the last line:
{code:java}
env.execute("Fraud Detection");{code}
 

I confirmed this was existed before 
[v1.12|https://github.com/apache/flink/blob/release-1.12/docs/try-flink/datastream_api.md?plain=1#L365],
 but when we migrated Flink docs from Jekyll to Hugo in this 
[PR|https://github.com/apache/flink/pull/14903], we missed that section. This 
miss will let first flink users miss the important line to execute the flink 
job.

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-35653) "fraud detection" example missed "env.execute" explanation

2024-06-19 Thread Luke Chen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-35653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17856372#comment-17856372
 ] 

Luke Chen commented on FLINK-35653:
---

Working on the fix.

> "fraud detection" example missed "env.execute" explanation
> --
>
> Key: FLINK-35653
> URL: https://issues.apache.org/jira/browse/FLINK-35653
> Project: Flink
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Luke Chen
>Priority: Major
>
> In "try flink" page, there is a [fraud detection with the dataStream 
> API|https://nightlies.apache.org/flink/flink-docs-master/docs/try-flink/datastream/]
>  example to demo how to build a stateful streaming application with Flink’s 
> DataStream API. In this page, we explained the code line by line, but missed 
> the last line:
> {code:java}
> env.execute("Fraud Detection");{code}
>  
> I confirmed this was existed before 
> [v1.12|https://github.com/apache/flink/blob/release-1.12/docs/try-flink/datastream_api.md?plain=1#L365],
>  but when we migrated Flink docs from Jekyll to Hugo in this 
> [PR|https://github.com/apache/flink/pull/14903], we missed that section. This 
> miss will let first flink users miss the important line to execute the flink 
> job.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-35587) Job fails with "The read buffer is null in credit-based input channel" on TPC-DS 10TB benchmark

2024-06-18 Thread Luke Chen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-35587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17855837#comment-17855837
 ] 

Luke Chen commented on FLINK-35587:
---

[~JunRuiLi] , do you have full logs for this issue? If possible, could you 
share it? Thanks.

> Job fails with "The read buffer is null in credit-based input channel" on 
> TPC-DS 10TB benchmark
> ---
>
> Key: FLINK-35587
> URL: https://issues.apache.org/jira/browse/FLINK-35587
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Affects Versions: 1.20.0
>Reporter: Junrui Li
>Assignee: Weijie Guo
>Priority: Blocker
> Attachments: image-2024-06-13-13-48-37-162.png
>
>
> While running TPC-DS 10TB benchmark on the latest master branch locally, I've 
> encountered a failure in certain queries, specifically query 75, resulting in 
> the error "The read buffer is null in credit-based input channel".
> Using a binary search approach, I identified the offending commit as 
> FLINK-33668. After reverting FLINK-33668 and subsequent commits, the issue 
> disappears. Re-applying FLINK-33668 to the branch re-introduces the error.
> Please see the attached image for the error stack trace.
> !image-2024-06-13-13-48-37-162.png|width=846,height=555!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-35629) Performance regression in stringRead and stringWrite

2024-06-18 Thread Luke Chen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-35629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17855827#comment-17855827
 ] 

Luke Chen commented on FLINK-35629:
---

The performance degraded started from after this PR: 
[https://github.com/apache/flink/pull/24924] merged.

> Performance regression in stringRead and stringWrite
> 
>
> Key: FLINK-35629
> URL: https://issues.apache.org/jira/browse/FLINK-35629
> Project: Flink
>  Issue Type: Bug
>  Components: Benchmarks
>Affects Versions: 1.20.0
>Reporter: Rui Fan
>Priority: Blocker
> Attachments: image-2024-06-18-14-52-55-164.png
>
>
> [http://flink-speed.xyz/timeline/#/?exe=1&ben=stringWrite.128.ascii&extr=on&quarts=on&equid=off&env=3&revs=200]
> [http://flink-speed.xyz/timeline/#/?exe=1&ben=stringWrite.128.chinese&extr=on&quarts=on&equid=off&env=3&revs=200]
> [http://flink-speed.xyz/timeline/#/?exe=1&ben=stringWrite.128.russian&extr=on&quarts=on&equid=off&env=3&revs=200]
> [http://flink-speed.xyz/timeline/#/?exe=6&ben=stringRead.128.chinese&extr=on&quarts=on&equid=off&env=3&revs=200]
> [http://flink-speed.xyz/timeline/#/?exe=6&ben=stringRead.4.ascii&extr=on&quarts=on&equid=off&env=3&revs=200]
> [http://flink-speed.xyz/timeline/#/?exe=6&ben=stringRead.4.chinese&extr=on&quarts=on&equid=off&env=3&revs=200]
> [http://flink-speed.xyz/timeline/#/?exe=6&ben=stringRead.4.russian&extr=on&quarts=on&equid=off&env=3&revs=200]
> !image-2024-06-18-14-52-55-164.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-33607) Add checksum verification for Maven wrapper as well

2024-05-27 Thread Luke Chen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17849874#comment-17849874
 ] 

Luke Chen commented on FLINK-33607:
---

[~mapohl] , I've opened a PR to improve it. Please help assign the ticket to 
me. Thanks.

> Add checksum verification for Maven wrapper as well
> ---
>
> Key: FLINK-33607
> URL: https://issues.apache.org/jira/browse/FLINK-33607
> Project: Flink
>  Issue Type: Improvement
>  Components: Build System
>Affects Versions: 1.18.0, 1.19.0
>Reporter: Matthias Pohl
>Priority: Major
>  Labels: pull-request-available
>
> FLINK-33503 enabled us to add checksum checks for the Maven wrapper binaries 
> along the update from 3.1.0 to 3.2.0.
> But there seems to be an issue with verifying the wrapper's checksum under 
> windows (see [related PR discussion in 
> Guava|https://github.com/google/guava/pull/6807/files]).
> This issue covers the fix as soon as MVRAPPER-103 is resolved. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-33607) Add checksum verification for Maven wrapper as well

2024-05-27 Thread Luke Chen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-33607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17849873#comment-17849873
 ] 

Luke Chen commented on FLINK-33607:
---

PR: [https://github.com/apache/flink/pull/24852]

After MVRAPPER-103 is resolved in 3.3.0, we should upgrade Maven wrapper to 
3.3.0 or later to enable the feature to verify the wrapper's checksum.

This PR did:
 # Upgrade Maven wrapper to the latest 3.3.2 version to include more bug fixes.
 # Update the {{wrapperSha256Sum}} for {{maven-wrapper-3.3.2.jar}}

> Add checksum verification for Maven wrapper as well
> ---
>
> Key: FLINK-33607
> URL: https://issues.apache.org/jira/browse/FLINK-33607
> Project: Flink
>  Issue Type: Improvement
>  Components: Build System
>Affects Versions: 1.18.0, 1.19.0
>Reporter: Matthias Pohl
>Priority: Major
>  Labels: pull-request-available
>
> FLINK-33503 enabled us to add checksum checks for the Maven wrapper binaries 
> along the update from 3.1.0 to 3.2.0.
> But there seems to be an issue with verifying the wrapper's checksum under 
> windows (see [related PR discussion in 
> Guava|https://github.com/google/guava/pull/6807/files]).
> This issue covers the fix as soon as MVRAPPER-103 is resolved. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-28060) Kafka Commit on checkpointing fails repeatedly after a broker restart

2022-08-09 Thread Luke Chen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-28060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577679#comment-17577679
 ] 

Luke Chen edited comment on FLINK-28060 at 8/10/22 12:17 AM:
-

[~mason6345] , yes, Kafka v3.1.1 assumes consumer poll ran before commitAsync 
call. But in Kafka v3.2.1, this assumption is removed. So, in v3.2.1, even if 
commitAsync calls earlier than poll, it'll work well.

So, for your question:

1. However, unusually, step 4 will fail if the test doesn't invoke poll 
regardless of the kafka clients version. I think it is possible for Flink to 
have a race condition where commitAsync is executed before poll (short 
checkpoint interval causing commitAsync before poll if topic partitions take 
long to assign). Is this behavior intended?

--> Yes, Kafka should not assume poll calls first or commitAsync calls first. 
That's a bug fixed in Kafka v3.2.1

2. Otherwise, do you have any recommendations for reproducing the issue in a CI 
environment where we do not assume poll() is invoked?

--> I think it's good. I also use the reproducer in this ticket with Kafka 
v3.2.1, and also investigate Kafka client logs, I confirmed it works well even 
if commitAsync calls earlier than poll.

 

Thanks.


was (Author: showuon):
[~mason6345] , yes, Kafka v3.1.1 expects(assumes) consumer poll ran before 
commitAsync call. But in Kafka v3.2.1, this assumption is removed. So, in 
v3.2.1, even if commitAsync calls earlier than poll, it'll work well.

So, for your question:

1. However, unusually, step 4 will fail if the test doesn't invoke poll 
regardless of the kafka clients version. I think it is possible for Flink to 
have a race condition where commitAsync is executed before poll (short 
checkpoint interval causing commitAsync before poll if topic partitions take 
long to assign). Is this behavior intended?

--> Yes, Kafka should not assume poll calls first or commitAsync calls first. 
That's a bug fixed in Kafka v3.2.1

2. Otherwise, do you have any recommendations for reproducing the issue in a CI 
environment where we do not assume poll() is invoked?

--> I think it's good. I also use the reproducer in this ticket with Kafka 
v3.2.1, and also investigate Kafka client logs, I confirmed it works well even 
if commitAsync calls earlier than poll.

 

Thanks.

> Kafka Commit on checkpointing fails repeatedly after a broker restart
> -
>
> Key: FLINK-28060
> URL: https://issues.apache.org/jira/browse/FLINK-28060
> Project: Flink
>  Issue Type: Bug
>  Components: API / DataStream, Connectors / Kafka
>Affects Versions: 1.15.0
> Environment: Reproduced on MacOS and Linux.
> Using java 8, Flink 1.15.0, Kafka 2.8.1.
>Reporter: Christian Lorenz
>Priority: Major
>  Labels: pull-request-available
> Attachments: flink-kafka-testjob.zip
>
>
> When Kafka Offset committing is enabled and done on Flinks checkpointing, an 
> error might occur if one Kafka broker is shutdown which might be the leader 
> of that partition in Kafkas internal __consumer_offsets topic.
> This is an expected behaviour. But once the broker is started up again, the 
> next checkpoint issued by flink should commit the meanwhile processed offsets 
> back to kafka. Somehow this does not seem to happen always in Flink 1.15.0 
> anymore and the offset committing is broken. An warning like the following 
> will be logged on each checkpoint:
> {code}
> [info] 14:33:13.684 WARN  [Source Data Fetcher for Source: input-kafka-source 
> -> Sink: output-stdout-sink (1/1)#1] o.a.f.c.k.s.reader.KafkaSourceReader - 
> Failed to commit consumer offsets for checkpoint 35
> [info] org.apache.kafka.clients.consumer.RetriableCommitFailedException: 
> Offset commit failed with a retriable exception. You should retry committing 
> the latest consumed offsets.
> [info] Caused by: 
> org.apache.kafka.common.errors.CoordinatorNotAvailableException: The 
> coordinator is not available.
> {code}
> To reproduce this I've attached a small flink job program.  To execute this 
> java8, scala sbt and docker / docker-compose is required.  Also see readme.md 
> for more details.
> The job can be run with `sbt run`, kafka cluster is started by 
> `docker-compose up`. If then the kafka brokers are restarted gracefully by 
> e.g. `docker-compose stop kafka1` and `docker-compose start kafka1` with 
> kafka2 and kafka3 afterwards, this warning will occur and no offsets will be 
> committed into kafka.
> This is not reproducible in flink 1.14.4.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-28060) Kafka Commit on checkpointing fails repeatedly after a broker restart

2022-08-09 Thread Luke Chen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-28060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577679#comment-17577679
 ] 

Luke Chen commented on FLINK-28060:
---

[~mason6345] , yes, Kafka v3.1.1 expects(assumes) consumer poll ran before 
commitAsync call. But in Kafka v3.2.1, this assumption is removed. So, in 
v3.2.1, even if commitAsync calls earlier than poll, it'll work well.

So, for your question:

1. However, unusually, step 4 will fail if the test doesn't invoke poll 
regardless of the kafka clients version. I think it is possible for Flink to 
have a race condition where commitAsync is executed before poll (short 
checkpoint interval causing commitAsync before poll if topic partitions take 
long to assign). Is this behavior intended?

--> Yes, Kafka should not assume poll calls first or commitAsync calls first. 
That's a bug fixed in Kafka v3.2.1

2. Otherwise, do you have any recommendations for reproducing the issue in a CI 
environment where we do not assume poll() is invoked?

--> I think it's good. I also use the reproducer in this ticket with Kafka 
v3.2.1, and also investigate Kafka client logs, I confirmed it works well even 
if commitAsync calls earlier than poll.

 

Thanks.

> Kafka Commit on checkpointing fails repeatedly after a broker restart
> -
>
> Key: FLINK-28060
> URL: https://issues.apache.org/jira/browse/FLINK-28060
> Project: Flink
>  Issue Type: Bug
>  Components: API / DataStream, Connectors / Kafka
>Affects Versions: 1.15.0
> Environment: Reproduced on MacOS and Linux.
> Using java 8, Flink 1.15.0, Kafka 2.8.1.
>Reporter: Christian Lorenz
>Priority: Major
>  Labels: pull-request-available
> Attachments: flink-kafka-testjob.zip
>
>
> When Kafka Offset committing is enabled and done on Flinks checkpointing, an 
> error might occur if one Kafka broker is shutdown which might be the leader 
> of that partition in Kafkas internal __consumer_offsets topic.
> This is an expected behaviour. But once the broker is started up again, the 
> next checkpoint issued by flink should commit the meanwhile processed offsets 
> back to kafka. Somehow this does not seem to happen always in Flink 1.15.0 
> anymore and the offset committing is broken. An warning like the following 
> will be logged on each checkpoint:
> {code}
> [info] 14:33:13.684 WARN  [Source Data Fetcher for Source: input-kafka-source 
> -> Sink: output-stdout-sink (1/1)#1] o.a.f.c.k.s.reader.KafkaSourceReader - 
> Failed to commit consumer offsets for checkpoint 35
> [info] org.apache.kafka.clients.consumer.RetriableCommitFailedException: 
> Offset commit failed with a retriable exception. You should retry committing 
> the latest consumed offsets.
> [info] Caused by: 
> org.apache.kafka.common.errors.CoordinatorNotAvailableException: The 
> coordinator is not available.
> {code}
> To reproduce this I've attached a small flink job program.  To execute this 
> java8, scala sbt and docker / docker-compose is required.  Also see readme.md 
> for more details.
> The job can be run with `sbt run`, kafka cluster is started by 
> `docker-compose up`. If then the kafka brokers are restarted gracefully by 
> e.g. `docker-compose stop kafka1` and `docker-compose start kafka1` with 
> kafka2 and kafka3 afterwards, this warning will occur and no offsets will be 
> committed into kafka.
> This is not reproducible in flink 1.14.4.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-28060) Kafka Commit on checkpointing fails repeatedly after a broker restart

2022-08-04 Thread Luke Chen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-28060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17575190#comment-17575190
 ] 

Luke Chen commented on FLINK-28060:
---

Could we try to test with Kafka v3.2.1. As mentioned in this comment: 
https://issues.apache.org/jira/browse/KAFKA-13840?focusedCommentId=17575186&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17575186
 , I believe this issue fix be fixed in Kafka v3.2.1. Thanks.

> Kafka Commit on checkpointing fails repeatedly after a broker restart
> -
>
> Key: FLINK-28060
> URL: https://issues.apache.org/jira/browse/FLINK-28060
> Project: Flink
>  Issue Type: Bug
>  Components: API / DataStream, Connectors / Kafka
>Affects Versions: 1.15.0
> Environment: Reproduced on MacOS and Linux.
> Using java 8, Flink 1.15.0, Kafka 2.8.1.
>Reporter: Christian Lorenz
>Priority: Major
>  Labels: pull-request-available
> Attachments: flink-kafka-testjob.zip
>
>
> When Kafka Offset committing is enabled and done on Flinks checkpointing, an 
> error might occur if one Kafka broker is shutdown which might be the leader 
> of that partition in Kafkas internal __consumer_offsets topic.
> This is an expected behaviour. But once the broker is started up again, the 
> next checkpoint issued by flink should commit the meanwhile processed offsets 
> back to kafka. Somehow this does not seem to happen always in Flink 1.15.0 
> anymore and the offset committing is broken. An warning like the following 
> will be logged on each checkpoint:
> {code}
> [info] 14:33:13.684 WARN  [Source Data Fetcher for Source: input-kafka-source 
> -> Sink: output-stdout-sink (1/1)#1] o.a.f.c.k.s.reader.KafkaSourceReader - 
> Failed to commit consumer offsets for checkpoint 35
> [info] org.apache.kafka.clients.consumer.RetriableCommitFailedException: 
> Offset commit failed with a retriable exception. You should retry committing 
> the latest consumed offsets.
> [info] Caused by: 
> org.apache.kafka.common.errors.CoordinatorNotAvailableException: The 
> coordinator is not available.
> {code}
> To reproduce this I've attached a small flink job program.  To execute this 
> java8, scala sbt and docker / docker-compose is required.  Also see readme.md 
> for more details.
> The job can be run with `sbt run`, kafka cluster is started by 
> `docker-compose up`. If then the kafka brokers are restarted gracefully by 
> e.g. `docker-compose stop kafka1` and `docker-compose start kafka1` with 
> kafka2 and kafka3 afterwards, this warning will occur and no offsets will be 
> committed into kafka.
> This is not reproducible in flink 1.14.4.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-28060) Kafka Commit on checkpointing fails repeatedly after a broker restart

2022-08-04 Thread Luke Chen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-28060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17575190#comment-17575190
 ] 

Luke Chen edited comment on FLINK-28060 at 8/4/22 10:11 AM:


Could we try to test with Kafka v3.2.1? As mentioned in this comment: 
https://issues.apache.org/jira/browse/KAFKA-13840?focusedCommentId=17575186&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17575186
 , I believe this issue fix be fixed in Kafka v3.2.1. Thanks.


was (Author: showuon):
Could we try to test with Kafka v3.2.1. As mentioned in this comment: 
https://issues.apache.org/jira/browse/KAFKA-13840?focusedCommentId=17575186&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17575186
 , I believe this issue fix be fixed in Kafka v3.2.1. Thanks.

> Kafka Commit on checkpointing fails repeatedly after a broker restart
> -
>
> Key: FLINK-28060
> URL: https://issues.apache.org/jira/browse/FLINK-28060
> Project: Flink
>  Issue Type: Bug
>  Components: API / DataStream, Connectors / Kafka
>Affects Versions: 1.15.0
> Environment: Reproduced on MacOS and Linux.
> Using java 8, Flink 1.15.0, Kafka 2.8.1.
>Reporter: Christian Lorenz
>Priority: Major
>  Labels: pull-request-available
> Attachments: flink-kafka-testjob.zip
>
>
> When Kafka Offset committing is enabled and done on Flinks checkpointing, an 
> error might occur if one Kafka broker is shutdown which might be the leader 
> of that partition in Kafkas internal __consumer_offsets topic.
> This is an expected behaviour. But once the broker is started up again, the 
> next checkpoint issued by flink should commit the meanwhile processed offsets 
> back to kafka. Somehow this does not seem to happen always in Flink 1.15.0 
> anymore and the offset committing is broken. An warning like the following 
> will be logged on each checkpoint:
> {code}
> [info] 14:33:13.684 WARN  [Source Data Fetcher for Source: input-kafka-source 
> -> Sink: output-stdout-sink (1/1)#1] o.a.f.c.k.s.reader.KafkaSourceReader - 
> Failed to commit consumer offsets for checkpoint 35
> [info] org.apache.kafka.clients.consumer.RetriableCommitFailedException: 
> Offset commit failed with a retriable exception. You should retry committing 
> the latest consumed offsets.
> [info] Caused by: 
> org.apache.kafka.common.errors.CoordinatorNotAvailableException: The 
> coordinator is not available.
> {code}
> To reproduce this I've attached a small flink job program.  To execute this 
> java8, scala sbt and docker / docker-compose is required.  Also see readme.md 
> for more details.
> The job can be run with `sbt run`, kafka cluster is started by 
> `docker-compose up`. If then the kafka brokers are restarted gracefully by 
> e.g. `docker-compose stop kafka1` and `docker-compose start kafka1` with 
> kafka2 and kafka3 afterwards, this warning will occur and no offsets will be 
> committed into kafka.
> This is not reproducible in flink 1.14.4.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)