[jira] [Commented] (FLINK-19381) Fix docs about relocatable savepoints
[ https://issues.apache.org/jira/browse/FLINK-19381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17201481#comment-17201481 ] Congxian Qiu(klion26) commented on FLINK-19381: --- Ah, the doc wasn't updated when FLINK-5763 implemented, I'd like to fix the doc, could you please assign this ticket to me? [~NicoK] > Fix docs about relocatable savepoints > - > > Key: FLINK-19381 > URL: https://issues.apache.org/jira/browse/FLINK-19381 > Project: Flink > Issue Type: Bug > Components: Documentation >Affects Versions: 1.12.0, 1.11.2 >Reporter: Nico Kruber >Priority: Major > > Although savepoints are relocatable since Flink 1.11, the docs still state > otherwise, for example in > [https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#triggering-savepoints] > The warning there, as well as the other changes from FLINK-15863, should be > removed again and potentially replaces with new constraints. > One known constraint is that if taskowned state is used > (\{{GenericWriteAhreadLog}} sink), savepoints are currently not relocatable > yet. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-19359) Restore from Checkpoint fails if checkpoint folders is corrupt/partial
[ https://issues.apache.org/jira/browse/FLINK-19359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17200040#comment-17200040 ] Congxian Qiu(klion26) commented on FLINK-19359: --- There is not {{_metadata}} means that the checkpoint is not completed. so you can't restore from it. > Restore from Checkpoint fails if checkpoint folders is corrupt/partial > -- > > Key: FLINK-19359 > URL: https://issues.apache.org/jira/browse/FLINK-19359 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.8.0 >Reporter: Arpith Prakash >Priority: Major > Attachments: Checkpoints.png > > > I'm using Flink 1.8.0 version and have enabled externalized checkpoint to > hdfs location, we have seen few scenarios where checkpoint folders will have > checkpoint files but only missing "*_metadata*" file. If we attempt to > restore application from this path, application fails with exception "Could > not find *_metadata* file. There is similar discussion in Flink user mailing > list with subject "Zookeeper connection loss causing checkpoint corruption" > around it. I've attached sample snapshot on how folder structure looks as > well. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-19249) Job would wait sometime(~10 min) before failover if some connection broken
[ https://issues.apache.org/jira/browse/FLINK-19249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17200033#comment-17200033 ] Congxian Qiu(klion26) commented on FLINK-19249: --- For the client side, we can add a {{[ReadTimeoutHandle|https://netty.io/4.0/api/io/netty/handler/timeout/ReadTimeoutHandler.html]}} or {{[IdelStateHandle|https://netty.io/4.0/api/io/netty/handler/timeout/IdleStateHandler.html]}} for {{ChannlePipeLine}} to make the client wait a shorter time. For the {{Connection timed out}} exception in server side, I don't think we can configure in Netty, it was throw in TCP stack. Maybe we can add a {{ReadTimeOutHandle}}/{{IdelStateHandle with a configured timeout here? [~sewen] what do you think about this?}} > Job would wait sometime(~10 min) before failover if some connection broken > -- > > Key: FLINK-19249 > URL: https://issues.apache.org/jira/browse/FLINK-19249 > Project: Flink > Issue Type: Bug > Components: Runtime / Network >Reporter: Congxian Qiu(klion26) >Priority: Blocker > Fix For: 1.12.0, 1.11.3 > > > {quote}encountered this error on 1.7, after going through the master code, I > think the problem is still there > {quote} > When the network environment is not so good, the connection between the > server and the client may be disconnected innocently. After the > disconnection, the server will receive the IOException such as below > {code:java} > java.io.IOException: Connection timed out > at sun.nio.ch.FileDispatcherImpl.write0(Native Method) > at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) > at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) > at sun.nio.ch.IOUtil.write(IOUtil.java:51) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:468) > at > org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:403) > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:934) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.forceFlush(AbstractNioChannel.java:367) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:639) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884) > at java.lang.Thread.run(Thread.java:748) > {code} > then release the view reader. > But the job would not fail until the downstream detect the disconnection > because of {{channelInactive}} later(~10 min). between such time, the job can > still process data, but the broken channel can't transfer any data or event, > so snapshot would fail during this time. this will cause the job to replay > many data after failover. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-19300) Timer loss after restoring from savepoint
[ https://issues.apache.org/jira/browse/FLINK-19300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17199331#comment-17199331 ] Congxian Qiu(klion26) commented on FLINK-19300: --- [~xianggao] thanks for reporting the issue, I think you analysis is right, {{InputStream#read}} can't guarantee to fully read the byte array. As this may loss the timer I'm not sure this need to be a blocker or not? > Timer loss after restoring from savepoint > - > > Key: FLINK-19300 > URL: https://issues.apache.org/jira/browse/FLINK-19300 > Project: Flink > Issue Type: Bug > Components: Runtime / State Backends >Reporter: Xiang Gao >Priority: Critical > > While using heap-based timers, we are seeing occasional timer loss after > restoring program from savepoint, especially when using a remote savepoint > storage (s3). > After some investigation, the issue seems to be related to [this line in > deserialization|https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/core/io/PostVersionedIOReadableWritable.java#L65]. > When trying to check the VERSIONED_IDENTIFIER, the input stream may not > guarantee filling the byte array, causing timers to be dropped for the > affected key group. > Should keep reading until expected number of bytes are actually read or if > end of the stream has been reached. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-19325) Optimize the consumed time for checkpoint completion
Congxian Qiu(klion26) created FLINK-19325: - Summary: Optimize the consumed time for checkpoint completion Key: FLINK-19325 URL: https://issues.apache.org/jira/browse/FLINK-19325 Project: Flink Issue Type: Improvement Components: Runtime / Checkpointing Reporter: Congxian Qiu(klion26) Currently when completing a checkpoint, we'll write out the state handle out in {{MetadataV2V3SerializerBase.java#serializeStreamStateHandle}} {code:java} static void serializeStreamStateHandle(StreamStateHandle stateHandle, DataOutputStream dos) throws IOException { if (stateHandle == null) { dos.writeByte(NULL_HANDLE); } else if (stateHandle instanceof RelativeFileStateHandle) { dos.writeByte(RELATIVE_STREAM_STATE_HANDLE); RelativeFileStateHandle relativeFileStateHandle = (RelativeFileStateHandle) stateHandle; dos.writeUTF(relativeFileStateHandle.getRelativePath()); dos.writeLong(relativeFileStateHandle.getStateSize()); } else if (stateHandle instanceof FileStateHandle) { dos.writeByte(FILE_STREAM_STATE_HANDLE); FileStateHandle fileStateHandle = (FileStateHandle) stateHandle; dos.writeLong(stateHandle.getStateSize()); dos.writeUTF(fileStateHandle.getFilePath().toString()); } else if (stateHandle instanceof ByteStreamStateHandle) { dos.writeByte(BYTE_STREAM_STATE_HANDLE); ByteStreamStateHandle byteStreamStateHandle = (ByteStreamStateHandle) stateHandle; dos.writeUTF(byteStreamStateHandle.getHandleName()); byte[] internalData = byteStreamStateHandle.getData(); dos.writeInt(internalData.length); dos.write(byteStreamStateHandle.getData()); } else { throw new IOException("Unknown implementation of StreamStateHandle: " + stateHandle.getClass()); } dos.flush(); } {code} We'll call {{dos.flush()}} after every state handle written out. But this may consume too much time and is not needed, because we'll close the outputstream after all things have been written out. I propose to remove the {{dos.flush()}} here to optimize the consumed time for checkpoint completion -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-19249) Job would wait sometime(~10 min) before failover if some connection broken
[ https://issues.apache.org/jira/browse/FLINK-19249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17199167#comment-17199167 ] Congxian Qiu(klion26) commented on FLINK-19249: --- [~sewen] thanks for the reply. IIUC, a TCP-keepalive and a shorter timeout can't solve this problem. The server receives a {{Connection timed out}} because the buffered data cannot send out in time. after the timeout, the server may send an {{ErrorResponse}} to the client, but the client may or may not receive this error in time. PS: we can't easily change the configuration of the Linux kernel, there are other services running on the same machine. > Job would wait sometime(~10 min) before failover if some connection broken > -- > > Key: FLINK-19249 > URL: https://issues.apache.org/jira/browse/FLINK-19249 > Project: Flink > Issue Type: Bug > Components: Runtime / Network >Reporter: Congxian Qiu(klion26) >Priority: Blocker > Fix For: 1.12.0, 1.11.3 > > > {quote}encountered this error on 1.7, after going through the master code, I > think the problem is still there > {quote} > When the network environment is not so good, the connection between the > server and the client may be disconnected innocently. After the > disconnection, the server will receive the IOException such as below > {code:java} > java.io.IOException: Connection timed out > at sun.nio.ch.FileDispatcherImpl.write0(Native Method) > at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) > at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) > at sun.nio.ch.IOUtil.write(IOUtil.java:51) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:468) > at > org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:403) > at > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:934) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.forceFlush(AbstractNioChannel.java:367) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:639) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884) > at java.lang.Thread.run(Thread.java:748) > {code} > then release the view reader. > But the job would not fail until the downstream detect the disconnection > because of {{channelInactive}} later(~10 min). between such time, the job can > still process data, but the broken channel can't transfer any data or event, > so snapshot would fail during this time. this will cause the job to replay > many data after failover. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-19249) Job would wait sometime(~10 min) before failover if some connection broken
Congxian Qiu(klion26) created FLINK-19249: - Summary: Job would wait sometime(~10 min) before failover if some connection broken Key: FLINK-19249 URL: https://issues.apache.org/jira/browse/FLINK-19249 Project: Flink Issue Type: Bug Components: Runtime / Network Reporter: Congxian Qiu(klion26) {quote}encountered this error on 1.7, after going through the master code, I think the problem is still there {quote} When the network environment is not so good, the connection between the server and the client may be disconnected innocently. After the disconnection, the server will receive the IOException such as below {code:java} java.io.IOException: Connection timed out at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) at sun.nio.ch.IOUtil.write(IOUtil.java:51) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:468) at org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:403) at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:934) at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.forceFlush(AbstractNioChannel.java:367) at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:639) at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580) at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497) at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884) at java.lang.Thread.run(Thread.java:748) {code} then release the view reader. But the job would not fail until the downstream detect the disconnection because of {{channelInactive}} later(~10 min). between such time, the job can still process data, but the broken channel can't transfer any data or event, so snapshot would fail during this time. this will cause the job to replay many data after failover. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (FLINK-18744) resume from modified savepoint dirctionary: No such file or directory
[ https://issues.apache.org/jira/browse/FLINK-18744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Congxian Qiu(klion26) closed FLINK-18744. - Resolution: Duplicate Close this one as duplicate, please ref to FLINK-14942 for more information, thanks. > resume from modified savepoint dirctionary: No such file or directory > - > > Key: FLINK-18744 > URL: https://issues.apache.org/jira/browse/FLINK-18744 > Project: Flink > Issue Type: Bug > Components: API / State Processor >Affects Versions: 1.11.0 >Reporter: tao wang >Priority: Major > > If I resume a job from a savepoint which is modified by state processor API, > such as loading from /savepoint-path-old and writing to /savepoint-path-new, > the job resumed with savepointpath = /savepoint-path-new while throwing an > Exception : > _*/savepoint-path-new/\{some-ui-id} (No such file or directory)*_. > I think it's an issue because of flink 1.11 use absolute path in savepoint > and checkpoint, but state processor API missed this. > The job will work well with new savepoint(which path is /savepoint-path-new) > if I copy all dictionary except `_metadata from` /savepoint-path-old to > /savepoint-path-new. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-19126) Failed to run job in yarn-cluster mode due to No Executor found.
[ https://issues.apache.org/jira/browse/FLINK-19126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17189788#comment-17189788 ] Congxian Qiu(klion26) commented on FLINK-19126: --- Hi [~Tang Yan] As the exception said, could you please have a try after export {{HADOOP_CLASSPATH}}? > Failed to run job in yarn-cluster mode due to No Executor found. > > > Key: FLINK-19126 > URL: https://issues.apache.org/jira/browse/FLINK-19126 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Affects Versions: 1.11.1 >Reporter: Tang Yan >Priority: Major > > I've build the flink package successfully, but when I run the below command, > it failed to submit the jobs. > [yanta@flink-1.11]$ bin/flink run -m yarn-cluster -p 2 -c > org.apache.flink.examples.java.wordcount.WordCount > examples/batch/WordCount.jar --input hdfs:///user/yanta/aa.txt --output > hdfs:///user/yanta/result.txt > Setting HADOOP_CONF_DIR=/etc/hadoop/conf because no HADOOP_CONF_DIR or > HADOOP_CLASSPATH was set. > The program > finished with the following exception: > java.lang.IllegalStateException: No Executor found. Please make sure to > export the HADOOP_CLASSPATH environment variable or have hadoop in your > classpath. For more information refer to the "Deployment & Operations" > section of the official Apache Flink documentation. at > org.apache.flink.yarn.cli.FallbackYarnSessionCli.isActive(FallbackYarnSessionCli.java:59) > at > org.apache.flink.client.cli.CliFrontend.validateAndGetActiveCommandLine(CliFrontend.java:1090) > at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:218) at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:916) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:992) > at > org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:992) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-14942) State Processing API: add an option to make deep copy
[ https://issues.apache.org/jira/browse/FLINK-14942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173767#comment-17173767 ] Congxian Qiu(klion26) commented on FLINK-14942: --- After savepoint relocatable, I think we need to make a deep copy default for StateProcessing API now. because we assume that all the data and metadata are in the same directory. > State Processing API: add an option to make deep copy > - > > Key: FLINK-14942 > URL: https://issues.apache.org/jira/browse/FLINK-14942 > Project: Flink > Issue Type: Improvement > Components: API / State Processor >Reporter: Jun Qin >Assignee: Jun Qin >Priority: Major > Labels: usability > Fix For: 1.12.0 > > > Current when a new savepoint is created based on a source savepoint, then > there are references in the new savepoint to the source savepoint. Here is > the [State Processing API > doc|https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/libs/state_processor_api.html] > says: > bq. Note: When basing a new savepoint on existing state, the state processor > api makes a shallow copy of the pointers to the existing operators. This > means that both savepoints share state and one cannot be deleted without > corrupting the other! > This JIRA is to request an option to have a deep copy (instead of shallow > copy) such that the new savepoint is self-contained. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18744) resume from modified savepoint dirctionary: No such file or directory
[ https://issues.apache.org/jira/browse/FLINK-18744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173766#comment-17173766 ] Congxian Qiu(klion26) commented on FLINK-18744: --- The problem here exists, it was caused by FLINK-5763, after FLINK-5763 we assume that all the data and meta were in the same directory. But currently, state process API does not apply a deep copy of the previous data. The workaround, for now, needs to copy the data from the previous directory to the new directory. I think the problem here can be fixed by FLINK-14942. > resume from modified savepoint dirctionary: No such file or directory > - > > Key: FLINK-18744 > URL: https://issues.apache.org/jira/browse/FLINK-18744 > Project: Flink > Issue Type: Bug > Components: API / State Processor >Affects Versions: 1.11.0 >Reporter: tao wang >Priority: Major > > If I resume a job from a savepoint which is modified by state processor API, > such as loading from /savepoint-path-old and writing to /savepoint-path-new, > the job resumed with savepointpath = /savepoint-path-new while throwing an > Exception : > _*/savepoint-path-new/\{some-ui-id} (No such file or directory)*_. > I think it's an issue because of flink 1.11 use absolute path in savepoint > and checkpoint, but state processor API missed this. > The job will work well with new savepoint(which path is /savepoint-path-new) > if I copy all dictionary except `_metadata from` /savepoint-path-old to > /savepoint-path-new. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-18675) Checkpoint not maintaining minimum pause duration between checkpoints
[ https://issues.apache.org/jira/browse/FLINK-18675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Congxian Qiu(klion26) updated FLINK-18675: -- Fix Version/s: 1.11.2 1.12.0 > Checkpoint not maintaining minimum pause duration between checkpoints > - > > Key: FLINK-18675 > URL: https://issues.apache.org/jira/browse/FLINK-18675 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.11.0 > Environment: !image.png! >Reporter: Ravi Bhushan Ratnakar >Priority: Critical > Fix For: 1.12.0, 1.11.2 > > Attachments: image.png > > > I am running a streaming job with Flink 1.11.0 using kubernetes > infrastructure. I have configured checkpoint configuration like below > Interval - 3 minutes > Minimum pause between checkpoints - 3 minutes > Checkpoint timeout - 10 minutes > Checkpointing Mode - Exactly Once > Number of Concurrent Checkpoint - 1 > > Other configs > Time Characteristics - Processing Time > > I am observing an usual behaviour. *When a checkpoint completes successfully* > *and if it's end to end duration is almost equal or greater than Minimum > pause duration then the next checkpoint gets triggered immediately without > maintaining the Minimum pause duration*. Kindly notice this behaviour from > checkpoint id 194 onward in the attached screenshot -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18675) Checkpoint not maintaining minimum pause duration between checkpoints
[ https://issues.apache.org/jira/browse/FLINK-18675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173755#comment-17173755 ] Congxian Qiu(klion26) commented on FLINK-18675: --- Seems there is a similar issue FLINK-18856 > Checkpoint not maintaining minimum pause duration between checkpoints > - > > Key: FLINK-18675 > URL: https://issues.apache.org/jira/browse/FLINK-18675 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.11.0 > Environment: !image.png! >Reporter: Ravi Bhushan Ratnakar >Priority: Critical > Attachments: image.png > > > I am running a streaming job with Flink 1.11.0 using kubernetes > infrastructure. I have configured checkpoint configuration like below > Interval - 3 minutes > Minimum pause between checkpoints - 3 minutes > Checkpoint timeout - 10 minutes > Checkpointing Mode - Exactly Once > Number of Concurrent Checkpoint - 1 > > Other configs > Time Characteristics - Processing Time > > I am observing an usual behaviour. *When a checkpoint completes successfully* > *and if it's end to end duration is almost equal or greater than Minimum > pause duration then the next checkpoint gets triggered immediately without > maintaining the Minimum pause duration*. Kindly notice this behaviour from > checkpoint id 194 onward in the attached screenshot -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18856) CheckpointCoordinator ignores checkpointing.min-pause
[ https://issues.apache.org/jira/browse/FLINK-18856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173754#comment-17173754 ] Congxian Qiu(klion26) commented on FLINK-18856: --- Seems this issue is similar with FLINK-18675 > CheckpointCoordinator ignores checkpointing.min-pause > - > > Key: FLINK-18856 > URL: https://issues.apache.org/jira/browse/FLINK-18856 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.11.1 >Reporter: Roman Khachatryan >Assignee: Roman Khachatryan >Priority: Major > Labels: pull-request-available > Fix For: 1.12.0, 1.11.2 > > > See discussion: > http://mail-archives.apache.org/mod_mbox/flink-dev/202008.mbox/%3cca+5xao2enuzfyq+e-mmr2luuueu_zfjjoabjtqxow6tkgdr...@mail.gmail.com%3e -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18675) Checkpoint not maintaining minimum pause duration between checkpoints
[ https://issues.apache.org/jira/browse/FLINK-18675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17170497#comment-17170497 ] Congxian Qiu(klion26) commented on FLINK-18675: --- [~raviratnakar] I think the problem here is that {{CheckpointRequestDecider}} has a wrong value of {{lastCheckpointCompletionRelativeTime}} when checking whether the checkpoint request is too early. 1. We retrieve the value of {{lastCheckpointCompletionRelativeTime}} when calling {{CheckpointRequestDecider#chooseRequestToExecute}} in {{CheckpointCoordinator#triggerCheckpoint}} 2. A pending checkpoint complete, and update the valuable {{pendingCheckpoints}} and {{lastCheckpointCompletionRelativeTime}} 3. In {{CheckpointRequestDecider#chooseRequestToExecute}} we use the previous {{lastCheckpointCompletionRelativeTime}} to check whether current checkpoint request is too early I think we can get the value of {{lastCheckpointCompletionRelativeTime}} in {{CheckpointRequestDecider#chooseRequestToExecute}} here to solve the problem here. > Checkpoint not maintaining minimum pause duration between checkpoints > - > > Key: FLINK-18675 > URL: https://issues.apache.org/jira/browse/FLINK-18675 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.11.0 > Environment: !image.png! >Reporter: Ravi Bhushan Ratnakar >Priority: Critical > Attachments: image.png > > > I am running a streaming job with Flink 1.11.0 using kubernetes > infrastructure. I have configured checkpoint configuration like below > Interval - 3 minutes > Minimum pause between checkpoints - 3 minutes > Checkpoint timeout - 10 minutes > Checkpointing Mode - Exactly Once > Number of Concurrent Checkpoint - 1 > > Other configs > Time Characteristics - Processing Time > > I am observing an usual behaviour. *When a checkpoint completes successfully* > *and if it's end to end duration is almost equal or greater than Minimum > pause duration then the next checkpoint gets triggered immediately without > maintaining the Minimum pause duration*. Kindly notice this behaviour from > checkpoint id 194 onward in the attached screenshot -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18748) Savepoint would be queued unexpected if pendingCheckpoints less than maxConcurrentCheckpoints
[ https://issues.apache.org/jira/browse/FLINK-18748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168357#comment-17168357 ] Congxian Qiu(klion26) commented on FLINK-18748: --- [~roman_khachatryan] could you please help to assign this ticket to [~wayland], thanks. [~wayland] Please make sure you have read the [contribute documentation|https://flink.apache.org/contributing/how-to-contribute.html] before implementation. and you can ping us for review after filed a PR. thanks. > Savepoint would be queued unexpected if pendingCheckpoints less than > maxConcurrentCheckpoints > - > > Key: FLINK-18748 > URL: https://issues.apache.org/jira/browse/FLINK-18748 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.11.0, 1.11.1 >Reporter: Congxian Qiu(klion26) >Priority: Major > > Inspired by a [user-zh > email|http://apache-flink.147419.n8.nabble.com/flink-1-11-rest-api-saveppoint-td5497.html] > After FLINK-17342, when triggering a checkpoint/savepoint, we'll check > whether the request can be triggered in > {{CheckpointRequestDecider#chooseRequestToExecute}}, the logic is as follow: > {code:java} > Preconditions.checkState(Thread.holdsLock(lock)); > // 1. > if (isTriggering || queuedRequests.isEmpty()) { >return Optional.empty(); > } > // 2 too many ongoing checkpoitn/savepoint > if (pendingCheckpointsSizeSupplier.get() >= maxConcurrentCheckpointAttempts) { >return Optional.of(queuedRequests.first()) > .filter(CheckpointTriggerRequest::isForce) > .map(unused -> queuedRequests.pollFirst()); > } > // 3 check the timestamp of last complete checkpoint > long nextTriggerDelayMillis = nextTriggerDelayMillis(lastCompletionMs); > if (nextTriggerDelayMillis > 0) { >return onTooEarly(nextTriggerDelayMillis); > } > return Optional.of(queuedRequests.pollFirst()); > {code} > But if currently {{pendingCheckpointsSizeSupplier.get()}} < > {{maxConcurrentCheckpointAttempts}}, and the request is a savepoint, the > savepoint will still wait some time in step 3. > I think we should trigger the savepoint immediately if > {{pendingCheckpointSizeSupplier.get()}} < {{maxConcurrentCheckpointAttempts}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18748) Savepoint would be queued unexpected if pendingCheckpoints less than maxConcurrentCheckpoints
[ https://issues.apache.org/jira/browse/FLINK-18748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17167912#comment-17167912 ] Congxian Qiu(klion26) commented on FLINK-18748: --- [~wayland] Do you want to fix this problem? > Savepoint would be queued unexpected if pendingCheckpoints less than > maxConcurrentCheckpoints > - > > Key: FLINK-18748 > URL: https://issues.apache.org/jira/browse/FLINK-18748 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.11.0, 1.11.1 >Reporter: Congxian Qiu(klion26) >Priority: Major > > Inspired by a [user-zh > email|http://apache-flink.147419.n8.nabble.com/flink-1-11-rest-api-saveppoint-td5497.html] > After FLINK-17342, when triggering a checkpoint/savepoint, we'll check > whether the request can be triggered in > {{CheckpointRequestDecider#chooseRequestToExecute}}, the logic is as follow: > {code:java} > Preconditions.checkState(Thread.holdsLock(lock)); > // 1. > if (isTriggering || queuedRequests.isEmpty()) { >return Optional.empty(); > } > // 2 too many ongoing checkpoitn/savepoint > if (pendingCheckpointsSizeSupplier.get() >= maxConcurrentCheckpointAttempts) { >return Optional.of(queuedRequests.first()) > .filter(CheckpointTriggerRequest::isForce) > .map(unused -> queuedRequests.pollFirst()); > } > // 3 check the timestamp of last complete checkpoint > long nextTriggerDelayMillis = nextTriggerDelayMillis(lastCompletionMs); > if (nextTriggerDelayMillis > 0) { >return onTooEarly(nextTriggerDelayMillis); > } > return Optional.of(queuedRequests.pollFirst()); > {code} > But if currently {{pendingCheckpointsSizeSupplier.get()}} < > {{maxConcurrentCheckpointAttempts}}, and the request is a savepoint, the > savepoint will still wait some time in step 3. > I think we should trigger the savepoint immediately if > {{pendingCheckpointSizeSupplier.get()}} < {{maxConcurrentCheckpointAttempts}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18748) Savepoint would be queued unexpected if pendingCheckpoints less than maxConcurrentCheckpoints
[ https://issues.apache.org/jira/browse/FLINK-18748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17167598#comment-17167598 ] Congxian Qiu(klion26) commented on FLINK-18748: --- Hi [~roman_khachatryan] thanks for your reply. If {{pendingCheckpointsSizeSupplier.get() >= maxConcurrentCheckpointAttempts}} , then we would not do step 2, but if {{pendingCheckpointSizeSupplier.get() < maxConcurrentCheckpointAttempts}} we'll skip case 2, just check the min pause time in case 3. but in this case, I think the savepoint should be triggered. If we have {{maxConcurrentCheckpointAttempts}} == 1, and now there is no ongoing checkpoint/savepoint. the user triggers a savepoint, then we will skip case 2 in the description. in case 3 we just check the min pause time, so we'll wait some time. > Savepoint would be queued unexpected if pendingCheckpoints less than > maxConcurrentCheckpoints > - > > Key: FLINK-18748 > URL: https://issues.apache.org/jira/browse/FLINK-18748 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.11.0, 1.11.1 >Reporter: Congxian Qiu(klion26) >Priority: Major > > Inspired by a [user-zh > email|http://apache-flink.147419.n8.nabble.com/flink-1-11-rest-api-saveppoint-td5497.html] > After FLINK-17342, when triggering a checkpoint/savepoint, we'll check > whether the request can be triggered in > {{CheckpointRequestDecider#chooseRequestToExecute}}, the logic is as follow: > {code:java} > Preconditions.checkState(Thread.holdsLock(lock)); > // 1. > if (isTriggering || queuedRequests.isEmpty()) { >return Optional.empty(); > } > // 2 too many ongoing checkpoitn/savepoint > if (pendingCheckpointsSizeSupplier.get() >= maxConcurrentCheckpointAttempts) { >return Optional.of(queuedRequests.first()) > .filter(CheckpointTriggerRequest::isForce) > .map(unused -> queuedRequests.pollFirst()); > } > // 3 check the timestamp of last complete checkpoint > long nextTriggerDelayMillis = nextTriggerDelayMillis(lastCompletionMs); > if (nextTriggerDelayMillis > 0) { >return onTooEarly(nextTriggerDelayMillis); > } > return Optional.of(queuedRequests.pollFirst()); > {code} > But if currently {{pendingCheckpointsSizeSupplier.get()}} < > {{maxConcurrentCheckpointAttempts}}, and the request is a savepoint, the > savepoint will still wait some time in step 3. > I think we should trigger the savepoint immediately if > {{pendingCheckpointSizeSupplier.get()}} < {{maxConcurrentCheckpointAttempts}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-18748) Savepoint would be queued unexpected if pendingCheckpoints less than maxConcurrentCheckpoints
[ https://issues.apache.org/jira/browse/FLINK-18748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Congxian Qiu(klion26) updated FLINK-18748: -- Summary: Savepoint would be queued unexpected if pendingCheckpoints less than maxConcurrentCheckpoints (was: Savepoint would be queued unexpected) > Savepoint would be queued unexpected if pendingCheckpoints less than > maxConcurrentCheckpoints > - > > Key: FLINK-18748 > URL: https://issues.apache.org/jira/browse/FLINK-18748 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.11.0, 1.11.1 >Reporter: Congxian Qiu(klion26) >Priority: Major > > Inspired by a [user-zh > email|http://apache-flink.147419.n8.nabble.com/flink-1-11-rest-api-saveppoint-td5497.html] > After FLINK-17342, when triggering a checkpoint/savepoint, we'll check > whether the request can be triggered in > {{CheckpointRequestDecider#chooseRequestToExecute}}, the logic is as follow: > {code:java} > Preconditions.checkState(Thread.holdsLock(lock)); > // 1. > if (isTriggering || queuedRequests.isEmpty()) { >return Optional.empty(); > } > // 2 too many ongoing checkpoitn/savepoint > if (pendingCheckpointsSizeSupplier.get() >= maxConcurrentCheckpointAttempts) { >return Optional.of(queuedRequests.first()) > .filter(CheckpointTriggerRequest::isForce) > .map(unused -> queuedRequests.pollFirst()); > } > // 3 check the timestamp of last complete checkpoint > long nextTriggerDelayMillis = nextTriggerDelayMillis(lastCompletionMs); > if (nextTriggerDelayMillis > 0) { >return onTooEarly(nextTriggerDelayMillis); > } > return Optional.of(queuedRequests.pollFirst()); > {code} > But if currently {{pendingCheckpointsSizeSupplier.get()}} < > {{maxConcurrentCheckpointAttempts}}, and the request is a savepoint, the > savepoint will still wait some time in step 3. > I think we should trigger the savepoint immediately if > {{pendingCheckpointSizeSupplier.get()}} < {{maxConcurrentCheckpointAttempts}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-18748) Savepoint would be queued unexpected
[ https://issues.apache.org/jira/browse/FLINK-18748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Congxian Qiu(klion26) updated FLINK-18748: -- Description: Inspired by a [user-zh email|http://apache-flink.147419.n8.nabble.com/flink-1-11-rest-api-saveppoint-td5497.html] After FLINK-17342, when triggering a checkpoint/savepoint, we'll check whether the request can be triggered in {{CheckpointRequestDecider#chooseRequestToExecute}}, the logic is as follow: {code:java} Preconditions.checkState(Thread.holdsLock(lock)); // 1. if (isTriggering || queuedRequests.isEmpty()) { return Optional.empty(); } // 2 too many ongoing checkpoitn/savepoint if (pendingCheckpointsSizeSupplier.get() >= maxConcurrentCheckpointAttempts) { return Optional.of(queuedRequests.first()) .filter(CheckpointTriggerRequest::isForce) .map(unused -> queuedRequests.pollFirst()); } // 3 check the timestamp of last complete checkpoint long nextTriggerDelayMillis = nextTriggerDelayMillis(lastCompletionMs); if (nextTriggerDelayMillis > 0) { return onTooEarly(nextTriggerDelayMillis); } return Optional.of(queuedRequests.pollFirst()); {code} But if currently {{pendingCheckpointsSizeSupplier.get()}} < {{maxConcurrentCheckpointAttempts}}, and the request is a savepoint, the savepoint will still wait some time in step 3. I think we should trigger the savepoint immediately if {{pendingCheckpointSizeSupplier.get()}} < {{maxConcurrentCheckpointAttempts}}. was: Inspired by an [user-zh email|[http://apache-flink.147419.n8.nabble.com/flink-1-11-rest-api-saveppoint-td5497.html]] After FLINK-17342, when triggering a checkpoint/savepoint, we'll check whether the request can be triggered in {{CheckpointRequestDecider#chooseRequestToExecute}}, the logic is as follow: {code:java} Preconditions.checkState(Thread.holdsLock(lock)); // 1. if (isTriggering || queuedRequests.isEmpty()) { return Optional.empty(); } // 2 too many ongoing checkpoitn/savepoint if (pendingCheckpointsSizeSupplier.get() >= maxConcurrentCheckpointAttempts) { return Optional.of(queuedRequests.first()) .filter(CheckpointTriggerRequest::isForce) .map(unused -> queuedRequests.pollFirst()); } // 3 check the timestamp of last complete checkpoint long nextTriggerDelayMillis = nextTriggerDelayMillis(lastCompletionMs); if (nextTriggerDelayMillis > 0) { return onTooEarly(nextTriggerDelayMillis); } return Optional.of(queuedRequests.pollFirst()); {code} But if currently {{pendingCheckpointsSizeSupplier.get()}} < {{maxConcurrentCheckpointAttempts}}, and the request is a savepoint, the savepoint will still wait some time in step 3. I think we should trigger the savepoint immediately if {{pendingCheckpointSizeSupplier.get()}} < {{maxConcurrentCheckpointAttempts}}. > Savepoint would be queued unexpected > > > Key: FLINK-18748 > URL: https://issues.apache.org/jira/browse/FLINK-18748 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.11.0, 1.11.1 >Reporter: Congxian Qiu(klion26) >Priority: Major > > Inspired by a [user-zh > email|http://apache-flink.147419.n8.nabble.com/flink-1-11-rest-api-saveppoint-td5497.html] > After FLINK-17342, when triggering a checkpoint/savepoint, we'll check > whether the request can be triggered in > {{CheckpointRequestDecider#chooseRequestToExecute}}, the logic is as follow: > {code:java} > Preconditions.checkState(Thread.holdsLock(lock)); > // 1. > if (isTriggering || queuedRequests.isEmpty()) { >return Optional.empty(); > } > // 2 too many ongoing checkpoitn/savepoint > if (pendingCheckpointsSizeSupplier.get() >= maxConcurrentCheckpointAttempts) { >return Optional.of(queuedRequests.first()) > .filter(CheckpointTriggerRequest::isForce) > .map(unused -> queuedRequests.pollFirst()); > } > // 3 check the timestamp of last complete checkpoint > long nextTriggerDelayMillis = nextTriggerDelayMillis(lastCompletionMs); > if (nextTriggerDelayMillis > 0) { >return onTooEarly(nextTriggerDelayMillis); > } > return Optional.of(queuedRequests.pollFirst()); > {code} > But if currently {{pendingCheckpointsSizeSupplier.get()}} < > {{maxConcurrentCheckpointAttempts}}, and the request is a savepoint, the > savepoint will still wait some time in step 3. > I think we should trigger the savepoint immediately if > {{pendingCheckpointSizeSupplier.get()}} < {{maxConcurrentCheckpointAttempts}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-18748) Savepoint would be queued unexpected
[ https://issues.apache.org/jira/browse/FLINK-18748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Congxian Qiu(klion26) updated FLINK-18748: -- Description: Inspired by an [user-zh email|[http://apache-flink.147419.n8.nabble.com/flink-1-11-rest-api-saveppoint-td5497.html]] After FLINK-17342, when triggering a checkpoint/savepoint, we'll check whether the request can be triggered in {{CheckpointRequestDecider#chooseRequestToExecute}}, the logic is as follow: {code:java} Preconditions.checkState(Thread.holdsLock(lock)); // 1. if (isTriggering || queuedRequests.isEmpty()) { return Optional.empty(); } // 2 too many ongoing checkpoitn/savepoint if (pendingCheckpointsSizeSupplier.get() >= maxConcurrentCheckpointAttempts) { return Optional.of(queuedRequests.first()) .filter(CheckpointTriggerRequest::isForce) .map(unused -> queuedRequests.pollFirst()); } // 3 check the timestamp of last complete checkpoint long nextTriggerDelayMillis = nextTriggerDelayMillis(lastCompletionMs); if (nextTriggerDelayMillis > 0) { return onTooEarly(nextTriggerDelayMillis); } return Optional.of(queuedRequests.pollFirst()); {code} But if currently {{pendingCheckpointsSizeSupplier.get()}} < {{maxConcurrentCheckpointAttempts}}, and the request is a savepoint, the savepoint will still wait some time in step 3. I think we should trigger the savepoint immediately if {{pendingCheckpointSizeSupplier.get()}} < {{maxConcurrentCheckpointAttempts}}. was: After FLINK-17342, when triggering a checkpoint/savepoint, we'll check whether the request can be triggered in {{CheckpointRequestDecider#chooseRequestToExecute}}, the logic is as follow: {code:java} Preconditions.checkState(Thread.holdsLock(lock)); // 1. if (isTriggering || queuedRequests.isEmpty()) { return Optional.empty(); } // 2 too many ongoing checkpoitn/savepoint if (pendingCheckpointsSizeSupplier.get() >= maxConcurrentCheckpointAttempts) { return Optional.of(queuedRequests.first()) .filter(CheckpointTriggerRequest::isForce) .map(unused -> queuedRequests.pollFirst()); } // 3 check the timestamp of last complete checkpoint long nextTriggerDelayMillis = nextTriggerDelayMillis(lastCompletionMs); if (nextTriggerDelayMillis > 0) { return onTooEarly(nextTriggerDelayMillis); } return Optional.of(queuedRequests.pollFirst()); {code} But if currently {{pendingCheckpointsSizeSupplier.get()}} < {{maxConcurrentCheckpointAttempts}}, and the request is a savepoint, the savepoint will still wait some time in step 3. I think we should trigger the savepoint immediately if {{pendingCheckpointSizeSupplier.get()}} < {{maxConcurrentCheckpointAttempts}}. > Savepoint would be queued unexpected > > > Key: FLINK-18748 > URL: https://issues.apache.org/jira/browse/FLINK-18748 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.11.0, 1.11.1 >Reporter: Congxian Qiu(klion26) >Priority: Major > > Inspired by an [user-zh > email|[http://apache-flink.147419.n8.nabble.com/flink-1-11-rest-api-saveppoint-td5497.html]] > After FLINK-17342, when triggering a checkpoint/savepoint, we'll check > whether the request can be triggered in > {{CheckpointRequestDecider#chooseRequestToExecute}}, the logic is as follow: > {code:java} > Preconditions.checkState(Thread.holdsLock(lock)); > // 1. > if (isTriggering || queuedRequests.isEmpty()) { >return Optional.empty(); > } > // 2 too many ongoing checkpoitn/savepoint > if (pendingCheckpointsSizeSupplier.get() >= maxConcurrentCheckpointAttempts) { >return Optional.of(queuedRequests.first()) > .filter(CheckpointTriggerRequest::isForce) > .map(unused -> queuedRequests.pollFirst()); > } > // 3 check the timestamp of last complete checkpoint > long nextTriggerDelayMillis = nextTriggerDelayMillis(lastCompletionMs); > if (nextTriggerDelayMillis > 0) { >return onTooEarly(nextTriggerDelayMillis); > } > return Optional.of(queuedRequests.pollFirst()); > {code} > But if currently {{pendingCheckpointsSizeSupplier.get()}} < > {{maxConcurrentCheckpointAttempts}}, and the request is a savepoint, the > savepoint will still wait some time in step 3. > I think we should trigger the savepoint immediately if > {{pendingCheckpointSizeSupplier.get()}} < {{maxConcurrentCheckpointAttempts}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18748) Savepoint would be queued unexpected
[ https://issues.apache.org/jira/browse/FLINK-18748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166877#comment-17166877 ] Congxian Qiu(klion26) commented on FLINK-18748: --- [~pnowojski] [~roman_khachatryan] What do you think about this problem, If this is valid, I can help to fix it. > Savepoint would be queued unexpected > > > Key: FLINK-18748 > URL: https://issues.apache.org/jira/browse/FLINK-18748 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.11.0, 1.11.1 >Reporter: Congxian Qiu(klion26) >Priority: Major > > After FLINK-17342, when triggering a checkpoint/savepoint, we'll check > whether the request can be triggered in > {{CheckpointRequestDecider#chooseRequestToExecute}}, the logic is as follow: > {code:java} > Preconditions.checkState(Thread.holdsLock(lock)); > // 1. > if (isTriggering || queuedRequests.isEmpty()) { >return Optional.empty(); > } > // 2 too many ongoing checkpoitn/savepoint > if (pendingCheckpointsSizeSupplier.get() >= maxConcurrentCheckpointAttempts) { >return Optional.of(queuedRequests.first()) > .filter(CheckpointTriggerRequest::isForce) > .map(unused -> queuedRequests.pollFirst()); > } > // 3 check the timestamp of last complete checkpoint > long nextTriggerDelayMillis = nextTriggerDelayMillis(lastCompletionMs); > if (nextTriggerDelayMillis > 0) { >return onTooEarly(nextTriggerDelayMillis); > } > return Optional.of(queuedRequests.pollFirst()); > {code} > But if currently {{pendingCheckpointsSizeSupplier.get()}} < > {{maxConcurrentCheckpointAttempts}}, and the request is a savepoint, the > savepoint will still wait some time in step 3. > I think we should trigger the savepoint immediately if > {{pendingCheckpointSizeSupplier.get()}} < {{maxConcurrentCheckpointAttempts}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-18748) Savepoint would be queued unexpected
Congxian Qiu(klion26) created FLINK-18748: - Summary: Savepoint would be queued unexpected Key: FLINK-18748 URL: https://issues.apache.org/jira/browse/FLINK-18748 Project: Flink Issue Type: Bug Components: Runtime / Checkpointing Affects Versions: 1.11.1, 1.11.0 Reporter: Congxian Qiu(klion26) After FLINK-17342, when triggering a checkpoint/savepoint, we'll check whether the request can be triggered in {{CheckpointRequestDecider#chooseRequestToExecute}}, the logic is as follow: {code:java} Preconditions.checkState(Thread.holdsLock(lock)); // 1. if (isTriggering || queuedRequests.isEmpty()) { return Optional.empty(); } // 2 too many ongoing checkpoitn/savepoint if (pendingCheckpointsSizeSupplier.get() >= maxConcurrentCheckpointAttempts) { return Optional.of(queuedRequests.first()) .filter(CheckpointTriggerRequest::isForce) .map(unused -> queuedRequests.pollFirst()); } // 3 check the timestamp of last complete checkpoint long nextTriggerDelayMillis = nextTriggerDelayMillis(lastCompletionMs); if (nextTriggerDelayMillis > 0) { return onTooEarly(nextTriggerDelayMillis); } return Optional.of(queuedRequests.pollFirst()); {code} But if currently {{pendingCheckpointsSizeSupplier.get()}} < {{maxConcurrentCheckpointAttempts}}, and the request is a savepoint, the savepoint will still wait some time in step 3. I think we should trigger the savepoint immediately if {{pendingCheckpointSizeSupplier.get()}} < {{maxConcurrentCheckpointAttempts}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18744) resume from modified savepoint dirctionary: No such file or directory
[ https://issues.apache.org/jira/browse/FLINK-18744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166808#comment-17166808 ] Congxian Qiu(klion26) commented on FLINK-18744: --- [~wayland] Thanks for reporting this issue, I'll take a look at it. > resume from modified savepoint dirctionary: No such file or directory > - > > Key: FLINK-18744 > URL: https://issues.apache.org/jira/browse/FLINK-18744 > Project: Flink > Issue Type: Bug > Components: API / State Processor >Affects Versions: 1.11.0 >Reporter: tao wang >Priority: Major > > If I resume a job from a savepoint which is modified by state processor API, > such as loading from /savepoint-path-old and writing to /savepoint-path-new, > the job resumed with savepointpath = /savepoint-path-new while throwing an > Exception : > _*/savepoint-path-new/\{some-ui-id} (No such file or directory)*_. > I think it's an issue because of flink 1.11 use absolute path in savepoint > and checkpoint, but state processor API missed this. > The job will work well with new savepoint(which path is /savepoint-path-new) > if I copy all dictionary except `_metadata from` /savepoint-path-old to > /savepoint-path-new. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18675) Checkpoint not maintaining minimum pause duration between checkpoints
[ https://issues.apache.org/jira/browse/FLINK-18675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17165387#comment-17165387 ] Congxian Qiu(klion26) commented on FLINK-18675: --- Hi [~raviratnakar] from the git history, the code at line number 1512 was added in 1.9.0(and seems there change would not affect this problem), IIUC, we would check whether the checkpoint can be triggered somewhere, I need to check it carefully as the code changed a lot. will reply here If found anything. > Checkpoint not maintaining minimum pause duration between checkpoints > - > > Key: FLINK-18675 > URL: https://issues.apache.org/jira/browse/FLINK-18675 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.11.0 > Environment: !image.png! >Reporter: Ravi Bhushan Ratnakar >Priority: Critical > Attachments: image.png > > > I am running a streaming job with Flink 1.11.0 using kubernetes > infrastructure. I have configured checkpoint configuration like below > Interval - 3 minutes > Minimum pause between checkpoints - 3 minutes > Checkpoint timeout - 10 minutes > Checkpointing Mode - Exactly Once > Number of Concurrent Checkpoint - 1 > > Other configs > Time Characteristics - Processing Time > > I am observing an usual behaviour. *When a checkpoint completes successfully* > *and if it's end to end duration is almost equal or greater than Minimum > pause duration then the next checkpoint gets triggered immediately without > maintaining the Minimum pause duration*. Kindly notice this behaviour from > checkpoint id 194 onward in the attached screenshot -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18665) Filesystem connector should use TableSchema exclude computed columns
[ https://issues.apache.org/jira/browse/FLINK-18665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17163229#comment-17163229 ] Congxian Qiu(klion26) commented on FLINK-18665: --- [~Leonard Xu] thanks for the update :) > Filesystem connector should use TableSchema exclude computed columns > > > Key: FLINK-18665 > URL: https://issues.apache.org/jira/browse/FLINK-18665 > Project: Flink > Issue Type: Bug > Components: Connectors / FileSystem, Table SQL / Ecosystem >Affects Versions: 1.11.1 >Reporter: Jark Wu >Assignee: Leonard Xu >Priority: Critical > Labels: pull-request-available > Fix For: 1.11.2 > > > This is reported in > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/How-to-use-a-nested-column-for-CREATE-TABLE-PARTITIONED-BY-td36796.html > {code} > create table navi ( > a STRING, > location ROW > ) with ( > 'connector' = 'filesystem', > 'path' = 'east-out', > 'format' = 'json' > ) > CREATE TABLE output ( > `partition` AS location.transId > ) PARTITIONED BY (`partition`) > WITH ( > 'connector' = 'filesystem', > 'path' = 'east-out', > 'format' = 'json' > ) LIKE navi (EXCLUDING ALL) > tEnv.sqlQuery("SELECT type, location FROM navi").executeInsert("output") > {code} > It throws the following exception > {code} > Exception in thread "main" org.apache.flink.table.api.ValidationException: > The field count of logical schema of the table does not match with the field > count of physical schema > . The logical schema: [STRING,ROW<`lastUpdateTime` BIGINT, `transId` STRING>] > The physical schema: [STRING,ROW<`lastUpdateTime` BIGINT, `transId` > STRING>,STRING]. > {code} > The reason is that {{FileSystemTableFactory#createTableSource}} should use > schema excluded computed column, not the original catalog table schema. > [1]: > https://github.com/apache/flink/blob/master/flink-table/flink-table-runtime-blink/src/main/java/org/apache/flink/table/filesystem/FileSystemTableFactory.java#L78 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18665) Filesystem connector should use TableSchema exclude computed columns
[ https://issues.apache.org/jira/browse/FLINK-18665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17162493#comment-17162493 ] Congxian Qiu(klion26) commented on FLINK-18665: --- Which versions will this issue affect? > Filesystem connector should use TableSchema exclude computed columns > > > Key: FLINK-18665 > URL: https://issues.apache.org/jira/browse/FLINK-18665 > Project: Flink > Issue Type: Bug > Components: Connectors / FileSystem, Table SQL / Ecosystem >Reporter: Jark Wu >Assignee: Leonard Xu >Priority: Critical > Labels: pull-request-available > Fix For: 1.11.2 > > > This is reported in > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/How-to-use-a-nested-column-for-CREATE-TABLE-PARTITIONED-BY-td36796.html > {code} > create table navi ( > a STRING, > location ROW > ) with ( > 'connector' = 'filesystem', > 'path' = 'east-out', > 'format' = 'json' > ) > CREATE TABLE output ( > `partition` AS location.transId > ) PARTITIONED BY (`partition`) > WITH ( > 'connector' = 'filesystem', > 'path' = 'east-out', > 'format' = 'json' > ) LIKE navi (EXCLUDING ALL) > tEnv.sqlQuery("SELECT type, location FROM navi").executeInsert("output") > {code} > It throws the following exception > {code} > Exception in thread "main" org.apache.flink.table.api.ValidationException: > The field count of logical schema of the table does not match with the field > count of physical schema > . The logical schema: [STRING,ROW<`lastUpdateTime` BIGINT, `transId` STRING>] > The physical schema: [STRING,ROW<`lastUpdateTime` BIGINT, `transId` > STRING>,STRING]. > {code} > The reason is that {{FileSystemTableFactory#createTableSource}} should use > schema excluded computed column, not the original catalog table schema. > [1]: > https://github.com/apache/flink/blob/master/flink-table/flink-table-runtime-blink/src/main/java/org/apache/flink/table/filesystem/FileSystemTableFactory.java#L78 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (FLINK-17571) A better way to show the files used in currently checkpoints
[ https://issues.apache.org/jira/browse/FLINK-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17161719#comment-17161719 ] Congxian Qiu(klion26) edited comment on FLINK-17571 at 7/21/20, 5:33 AM: - [~pnowojski] could you please assign this to me if it's ok. I want to add a command-line command {{./bin/flink checkpoint list $path}}. The $path should be the parent directory of {{_metadata}} or the path of {{_metadata}} {{And the result will be all the files used by the specific checkpoint/savepoint.}} was (Author: klion26): [~pnowojski] could you please assign this to me if it's ok. I want to add a command-line command {{./bin/flink savepoint list $path}}. The $path should be the parent directory of {{_metadata}} or the path of {{_metadata}} {{And the result will be all the files used by the specific checkpoint/savepoint.}} > A better way to show the files used in currently checkpoints > > > Key: FLINK-17571 > URL: https://issues.apache.org/jira/browse/FLINK-17571 > Project: Flink > Issue Type: New Feature > Components: Command Line Client, Runtime / Checkpointing >Reporter: Congxian Qiu(klion26) >Priority: Major > > Inspired by the > [userMail|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html] > Currently, there are [three types of > directory|https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/state/checkpoints.html#directory-structure] > for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be > deleted safely, but users can't delete the files in the SHARED directory > safely(the files may be created a long time ago). > I think it's better to give users a better way to know which files are > currently used(so the others are not used) > maybe a command-line command such as below is ok enough to support such a > feature. > {{./bin/flink checkpoint list $checkpointDir # list all the files used in > checkpoint}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-17571) A better way to show the files used in currently checkpoints
[ https://issues.apache.org/jira/browse/FLINK-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Congxian Qiu(klion26) updated FLINK-17571: -- Description: Inspired by the [userMail|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html] Currently, there are [three types of directory|https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/state/checkpoints.html#directory-structure] for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be deleted safely, but users can't delete the files in the SHARED directory safely(the files may be created a long time ago). I think it's better to give users a better way to know which files are currently used(so the others are not used) maybe a command-line command such as below is ok enough to support such a feature. {{./bin/flink checkpoint list $checkpointDir # list all the files used in checkpoint}} was: Inspired by the [userMail|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html] Currently, there are [three types of directory|https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/state/checkpoints.html#directory-structure] for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be deleted safely, but users can't delete the files in the SHARED directory safely(the files may be created a long time ago). I think it's better to give users a better way to know which files are currently used(so the others are not used) maybe a command-line command such as below is ok enough to support such a feature. {{./bin/flink savepoint list $checkpointDir # list all the files used in checkpoint}} > A better way to show the files used in currently checkpoints > > > Key: FLINK-17571 > URL: https://issues.apache.org/jira/browse/FLINK-17571 > Project: Flink > Issue Type: New Feature > Components: Command Line Client, Runtime / Checkpointing >Reporter: Congxian Qiu(klion26) >Priority: Major > > Inspired by the > [userMail|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html] > Currently, there are [three types of > directory|https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/state/checkpoints.html#directory-structure] > for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be > deleted safely, but users can't delete the files in the SHARED directory > safely(the files may be created a long time ago). > I think it's better to give users a better way to know which files are > currently used(so the others are not used) > maybe a command-line command such as below is ok enough to support such a > feature. > {{./bin/flink checkpoint list $checkpointDir # list all the files used in > checkpoint}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (FLINK-17571) A better way to show the files used in currently checkpoints
[ https://issues.apache.org/jira/browse/FLINK-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17161719#comment-17161719 ] Congxian Qiu(klion26) edited comment on FLINK-17571 at 7/21/20, 5:33 AM: - [~pnowojski] could you please assign this to me if it's ok. I want to add a command-line command {{./bin/flink checkpoint list $path}}. The $path should be the parent directory of {{_metadata}} or the path of {{_metadata}} {{And the result will be all the files used by the specific checkpoint.}} was (Author: klion26): [~pnowojski] could you please assign this to me if it's ok. I want to add a command-line command {{./bin/flink checkpoint list $path}}. The $path should be the parent directory of {{_metadata}} or the path of {{_metadata}} {{And the result will be all the files used by the specific checkpoint/savepoint.}} > A better way to show the files used in currently checkpoints > > > Key: FLINK-17571 > URL: https://issues.apache.org/jira/browse/FLINK-17571 > Project: Flink > Issue Type: New Feature > Components: Command Line Client, Runtime / Checkpointing >Reporter: Congxian Qiu(klion26) >Priority: Major > > Inspired by the > [userMail|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html] > Currently, there are [three types of > directory|https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/state/checkpoints.html#directory-structure] > for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be > deleted safely, but users can't delete the files in the SHARED directory > safely(the files may be created a long time ago). > I think it's better to give users a better way to know which files are > currently used(so the others are not used) > maybe a command-line command such as below is ok enough to support such a > feature. > {{./bin/flink checkpoint list $checkpointDir # list all the files used in > checkpoint}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-17571) A better way to show the files used in currently checkpoints
[ https://issues.apache.org/jira/browse/FLINK-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Congxian Qiu(klion26) updated FLINK-17571: -- Description: Inspired by the [userMail|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html] Currently, there are [three types of directory|https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/state/checkpoints.html#directory-structure] for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be deleted safely, but users can't delete the files in the SHARED directory safely(the files may be created a long time ago). I think it's better to give users a better way to know which files are currently used(so the others are not used) maybe a command-line command such as below is ok enough to support such a feature. {{./bin/flink savepoint list $checkpointDir # list all the files used in checkpoint}} was: Inspired by the [userMail|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html] Currently, there are [three types of directory|https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/state/checkpoints.html#directory-structure] for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be deleted safely, but users can't delete the files in the SHARED directory safely(the files may be created a long time ago). I think it's better to give users a better way to know which files are currently used(so the others are not used) maybe a command-line command such as below is ok enough to support such a feature. {{./bin/flink checkpoint list $checkpointDir # list all the files used in checkpoint}} > A better way to show the files used in currently checkpoints > > > Key: FLINK-17571 > URL: https://issues.apache.org/jira/browse/FLINK-17571 > Project: Flink > Issue Type: New Feature > Components: Command Line Client, Runtime / Checkpointing >Reporter: Congxian Qiu(klion26) >Priority: Major > > Inspired by the > [userMail|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html] > Currently, there are [three types of > directory|https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/state/checkpoints.html#directory-structure] > for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be > deleted safely, but users can't delete the files in the SHARED directory > safely(the files may be created a long time ago). > I think it's better to give users a better way to know which files are > currently used(so the others are not used) > maybe a command-line command such as below is ok enough to support such a > feature. > {{./bin/flink savepoint list $checkpointDir # list all the files used in > checkpoint}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17571) A better way to show the files used in currently checkpoints
[ https://issues.apache.org/jira/browse/FLINK-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17161719#comment-17161719 ] Congxian Qiu(klion26) commented on FLINK-17571: --- [~pnowojski] could you please assign this to me if it's ok. I want to add a command-line command {{./bin/flink savepoint list $path}}. The $path should be the parent directory of {{_metadata}} or the path of {{_metadata}} {{And the result will be all the files used by the specific checkpoint/savepoint.}} > A better way to show the files used in currently checkpoints > > > Key: FLINK-17571 > URL: https://issues.apache.org/jira/browse/FLINK-17571 > Project: Flink > Issue Type: New Feature > Components: Command Line Client, Runtime / Checkpointing >Reporter: Congxian Qiu(klion26) >Priority: Major > > Inspired by the > [userMail|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html] > Currently, there are [three types of > directory|https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/state/checkpoints.html#directory-structure] > for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be > deleted safely, but users can't delete the files in the SHARED directory > safely(the files may be created a long time ago). > I think it's better to give users a better way to know which files are > currently used(so the others are not used) > maybe a command-line command such as below is ok enough to support such a > feature. > {{./bin/flink checkpoint list $checkpointDir # list all the files used in > checkpoint}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18637) Key group is not in KeyGroupRange
[ https://issues.apache.org/jira/browse/FLINK-18637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17160919#comment-17160919 ] Congxian Qiu(klion26) commented on FLINK-18637: --- [~oripwk] thanks for report this problem, could you please update the flink version you're using? > Key group is not in KeyGroupRange > - > > Key: FLINK-18637 > URL: https://issues.apache.org/jira/browse/FLINK-18637 > Project: Flink > Issue Type: Bug > Environment: OS current user: yarn > Current Hadoop/Kerberos user: hadoop > JVM: Java HotSpot(TM) 64-Bit Server VM - Oracle Corporation - 1.8/25.141-b15 > Maximum heap size: 28960 MiBytes > JAVA_HOME: /usr/java/jdk1.8.0_141/jre > Hadoop version: 2.8.5-amzn-6 > JVM Options: >-Xmx30360049728 >-Xms30360049728 >-XX:MaxDirectMemorySize=4429185024 >-XX:MaxMetaspaceSize=1073741824 >-XX:+UseG1GC >-XX:+UnlockDiagnosticVMOptions >-XX:+G1SummarizeConcMark >-verbose:gc >-XX:+PrintGCDetails >-XX:+PrintGCDateStamps >-XX:+UnlockCommercialFeatures >-XX:+FlightRecorder >-XX:+DebugNonSafepoints > > -XX:FlightRecorderOptions=defaultrecording=true,settings=/home/hadoop/heap.jfc,dumponexit=true,dumponexitpath=/var/lib/hadoop-yarn/recording.jfr,loglevel=info > > -Dlog.file=/var/log/hadoop-yarn/containers/application_1593935560662_0002/container_1593935560662_0002_01_02/taskmanager.log >-Dlog4j.configuration=file:./log4j.properties > Program Arguments: >-Dtaskmanager.memory.framework.off-heap.size=134217728b >-Dtaskmanager.memory.network.max=1073741824b >-Dtaskmanager.memory.network.min=1073741824b >-Dtaskmanager.memory.framework.heap.size=134217728b >-Dtaskmanager.memory.managed.size=23192823744b >-Dtaskmanager.cpu.cores=7.0 >-Dtaskmanager.memory.task.heap.size=30225832000b >-Dtaskmanager.memory.task.off-heap.size=3221225472b >--configDir. > > -Djobmanager.rpc.address=ip-10-180-30-250.us-west-2.compute.internal-Dweb.port=0 >-Dweb.tmpdir=/tmp/flink-web-64f613cf-bf04-4a09-8c14-75c31b619574 >-Djobmanager.rpc.port=33739 >-Drest.address=ip-10-180-30-250.us-west-2.compute.internal >Reporter: Ori Popowski >Priority: Major > > I'm getting this error when creating a savepoint. I've read in > https://issues.apache.org/jira/browse/FLINK-16193 that it's caused by > unstable hashcode or equals on the key, or improper use of > {{reinterpretAsKeyedStream}}. > > My key is a string and I don't use {{reinterpretAsKeyedStream}}. > > {code:java} > senv > .addSource(source) > .flatMap(…) > .filterWith { case (metadata, _, _) => … } > .assignTimestampsAndWatermarks(new > BoundedOutOfOrdernessTimestampExtractor(…)) > .keyingBy { case (meta, _) => meta.toPath.toString } > .process(new TruncateLargeSessions(config.sessionSizeLimit)) > .keyingBy { case (meta, _) => meta.toPath.toString } > .window(EventTimeSessionWindows.withGap(Time.of(…))) > .process(new ProcessSession(sessionPlayback, config)) > .addSink(sink){code} > > {code:java} > org.apache.flink.util.FlinkException: Triggering a savepoint for the job > 962fc8e984e7ca1ed65a038aa62ce124 failed. > at > org.apache.flink.client.cli.CliFrontend.triggerSavepoint(CliFrontend.java:633) > at > org.apache.flink.client.cli.CliFrontend.lambda$savepoint$9(CliFrontend.java:611) > at > org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) > at > org.apache.flink.client.cli.CliFrontend.savepoint(CliFrontend.java:608) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:910) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968) > Caused by: java.util.concurrent.CompletionException: > java.util.concurrent.CompletionException: > org.apache.flink.runtime.checkpoint.CheckpointException: The job has failed. > at > org.apache.flink.runtime.scheduler.SchedulerBase.lambda$triggerSavepoint$3(SchedulerBase.java:744) > at > java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:822) > at > java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:797) > at > java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397) >
[jira] [Commented] (FLINK-18582) Add title anchor link for file event_driven.zh.md
[ https://issues.apache.org/jira/browse/FLINK-18582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17157881#comment-17157881 ] Congxian Qiu(klion26) commented on FLINK-18582: --- [~temper] thanks for creating this ticket. could you please share the pages which ref this page. If currently there is no pages ref to {{event_driven.zh.md}} maybe we could do this late. what do you think? >From my side, we always encourage to add title anchor links, but I'm sure >whether will we accept the change only add the title anchor(if currently no >page refs to it) cc [~jark] > Add title anchor link for file event_driven.zh.md > - > > Key: FLINK-18582 > URL: https://issues.apache.org/jira/browse/FLINK-18582 > Project: Flink > Issue Type: Improvement > Components: Documentation, Quickstarts >Reporter: Herman, Li >Priority: Major > > Translated files should add anchor link for sub-titles. > Otherwise, links from other documentations may not able to point to this > specified title. Such as file event_driven.zh.md with title: Side Outputs. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-18523) Advance watermark if there is no data in all of the partitions after some time
[ https://issues.apache.org/jira/browse/FLINK-18523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Congxian Qiu(klion26) updated FLINK-18523: -- Description: In the case of window calculations and eventTime scenarios, watermark cannot update because the source does not have data for some reason, and the last Windows cannot trigger the calculations. One parameter, table.exec.source. Idle -timeout, can only solve the problem of ignoring parallelism of watermark alignment that does not occur.But when there is no watermark in each parallel degree, you still cannot update the watermark. Is it possible to add a lock-timeout parameter (which should be larger than maxOutOfOrderness with default of "-1 ms") and if the watermark is not updated beyond this time (i.e., there is no data), then the current time is taken and sent downstream as the watermark. thanks! was: In the case of window calculations and eventTime scenarios, watermar cannot update because the source does not have data for some reason, and the last Windows cannot trigger the calculations. One parameter, table.exec.source. Idle -timeout, can only solve the problem of ignoring parallelism of watermark alignment that does not occur.But when there is no watermark in each parallel degree, you still cannot update the watermark. Is it possible to add a lock-timeout parameter (which should be larger than maxOutOfOrderness with default of "-1 ms") and if the watermark is not updated beyond this time (i.e., there is no data), then the current time is taken and sent downstream as the watermark. thanks! > Advance watermark if there is no data in all of the partitions after some time > -- > > Key: FLINK-18523 > URL: https://issues.apache.org/jira/browse/FLINK-18523 > Project: Flink > Issue Type: New Feature > Components: Table SQL / API >Affects Versions: 1.11.0 >Reporter: chen yong >Priority: Major > > In the case of window calculations and eventTime scenarios, watermark cannot > update because the source does not have data for some reason, and the last > Windows cannot trigger the calculations. > One parameter, table.exec.source. Idle -timeout, can only solve the problem > of ignoring parallelism of watermark alignment that does not occur.But when > there is no watermark in each parallel degree, you still cannot update the > watermark. > Is it possible to add a lock-timeout parameter (which should be larger than > maxOutOfOrderness with default of "-1 ms") and if the watermark is not > updated beyond this time (i.e., there is no data), then the current time is > taken and sent downstream as the watermark. > > thanks! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18493) Make target home directory used to store yarn files configurable
[ https://issues.apache.org/jira/browse/FLINK-18493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151770#comment-17151770 ] Congxian Qiu(klion26) commented on FLINK-18493: --- +1 for this feature. > Make target home directory used to store yarn files configurable > > > Key: FLINK-18493 > URL: https://issues.apache.org/jira/browse/FLINK-18493 > Project: Flink > Issue Type: Improvement > Components: Deployment / YARN >Affects Versions: 1.10.0, 1.10.1 >Reporter: Kevin Zhang >Priority: Major > > When submitting applications to yarn, the file system's default home > directory, like /user/user_name on hdfs, is used as the base directory to > store flink files. But in different file system access control strategies, > the default home directory may be inaccessible, and it's more preferable for > users to specify thier own paths if they wish to. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18196) flink throws `NullPointerException` when executeCheckpointing
[ https://issues.apache.org/jira/browse/FLINK-18196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150754#comment-17150754 ] Congxian Qiu(klion26) commented on FLINK-18196: --- Seems another report about the similar problem [http://apache-flink.147419.n8.nabble.com/flink1-10-checkpoint-td4337.html] > flink throws `NullPointerException` when executeCheckpointing > - > > Key: FLINK-18196 > URL: https://issues.apache.org/jira/browse/FLINK-18196 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.10.0, 1.10.1 > Environment: flink version: flink-1.10.0 flink-1.10.1 > jdk version: jdk1.8.0_40 >Reporter: Kai Chen >Priority: Major > > I meet checkpoint NPE when executing wordcount example: > java.lang.Exception: Could not perform checkpoint 5505 for operator Source: > KafkaTableSource(xxx) > SourceConversion(table=[xxx, source: > [KafkaTableSource(xxx)]], fields=[xxx]) > Calc(select=[xxx) AS xxx]) > > SinkConversionToTuple2 -- > Sink: Elasticsearch6UpsertTableSink(xxx) (1/1). > at > org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:802) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$triggerCheckpointAsync$3(StreamTask.java:777) > at > org.apache.flink.streaming.runtime.tasks.StreamTask$$Lambda$228/1024478318.call(Unknown > Source) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.run(StreamTaskActionExecutor.java:87) > at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:78) > at > org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:261) > at > org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:186) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:487) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:470) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:707) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:532) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.NullPointerException > at > org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.executeCheckpointing(StreamTask.java:1411) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.checkpointState(StreamTask.java:991) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$5(StreamTask.java:887) > at > org.apache.flink.streaming.runtime.tasks.StreamTask$$Lambda$229/1010499540.run(Unknown > Source) > at > org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:94) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:860) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:793) > ... 12 more > > checkpoint configuration > |Checkpointing Mode|Exactly Once| > |Interval|5s| > |Timeout|10m 0s| > |Minimum Pause Between Checkpoints|0ms| > |Maximum Concurrent Checkpoints|1| > With debug enabled, I found `checkpointMetaData` is null at > > [https://github.com/apache/flink/blob/release-1.10.0/flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/StreamTask.java#L1421] > > I fixed this with this patch: > [https://github.com/yuchuanchen/flink/commit/e5122d9787be1fee9bce141887e0d70c9b0a4f19] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18433) From the end-to-end performance test results, 1.11 has a regression
[ https://issues.apache.org/jira/browse/FLINK-18433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17147142#comment-17147142 ] Congxian Qiu(klion26) commented on FLINK-18433: --- Thanks for running these tests [~AHeise]. I guess we have to wait for [~Aihua]. From my previous experience, the result between the bounded-e2e test and the unbounded-e2e test may be different, because of the impact of setup things. I checked the log, found that the checkpoint expired before complete, Job fails because `{{Exceeded checkpoint tolerable failure threshold`}}, task received abort checkpoint message. I'm not very sure if we increase the tolerate checkpoint failure threshold can improve the performance or not. > From the end-to-end performance test results, 1.11 has a regression > --- > > Key: FLINK-18433 > URL: https://issues.apache.org/jira/browse/FLINK-18433 > Project: Flink > Issue Type: Bug > Components: API / Core, API / DataStream >Affects Versions: 1.11.0 > Environment: 3 machines > [|https://github.com/Li-Aihua/flink/blob/test_suite_for_basic_operations_1.11/flink-end-to-end-perf-tests/flink-basic-operations/src/main/java/org/apache/flink/basic/operations/PerformanceTestJob.java] >Reporter: Aihua Li >Priority: Major > Attachments: flink_11.log.gz > > > > I ran end-to-end performance tests between the Release-1.10 and Release-1.11. > the results were as follows: > |scenarioName|release-1.10|release-1.11| | > |OneInput_Broadcast_LazyFromSource_ExactlyOnce_10_rocksdb|46.175|43.8133|-5.11%| > |OneInput_Rescale_LazyFromSource_ExactlyOnce_100_heap|211.835|200.355|-5.42%| > |OneInput_Rebalance_LazyFromSource_ExactlyOnce_1024_rocksdb|1721.041667|1618.32|-5.97%| > |OneInput_KeyBy_LazyFromSource_ExactlyOnce_10_heap|46|43.615|-5.18%| > |OneInput_Broadcast_Eager_ExactlyOnce_100_rocksdb|212.105|199.688|-5.85%| > |OneInput_Rescale_Eager_ExactlyOnce_1024_heap|1754.64|1600.12|-8.81%| > |OneInput_Rebalance_Eager_ExactlyOnce_10_rocksdb|45.9167|43.0983|-6.14%| > |OneInput_KeyBy_Eager_ExactlyOnce_100_heap|212.0816667|200.727|-5.35%| > |OneInput_Broadcast_LazyFromSource_AtLeastOnce_1024_rocksdb|1718.245|1614.381667|-6.04%| > |OneInput_Rescale_LazyFromSource_AtLeastOnce_10_heap|46.12|43.5517|-5.57%| > |OneInput_Rebalance_LazyFromSource_AtLeastOnce_100_rocksdb|212.038|200.388|-5.49%| > |OneInput_KeyBy_LazyFromSource_AtLeastOnce_1024_heap|1762.048333|1606.408333|-8.83%| > |OneInput_Broadcast_Eager_AtLeastOnce_10_rocksdb|46.0583|43.4967|-5.56%| > |OneInput_Rescale_Eager_AtLeastOnce_100_heap|212.233|201.188|-5.20%| > |OneInput_Rebalance_Eager_AtLeastOnce_1024_rocksdb|1720.66|1616.85|-6.03%| > |OneInput_KeyBy_Eager_AtLeastOnce_10_heap|46.14|43.6233|-5.45%| > |TwoInputs_Broadcast_LazyFromSource_ExactlyOnce_100_rocksdb|156.918|152.957|-2.52%| > |TwoInputs_Rescale_LazyFromSource_ExactlyOnce_1024_heap|1415.511667|1300.1|-8.15%| > |TwoInputs_Rebalance_LazyFromSource_ExactlyOnce_10_rocksdb|34.2967|34.1667|-0.38%| > |TwoInputs_KeyBy_LazyFromSource_ExactlyOnce_100_heap|158.353|151.848|-4.11%| > |TwoInputs_Broadcast_Eager_ExactlyOnce_1024_rocksdb|1373.406667|1300.056667|-5.34%| > |TwoInputs_Rescale_Eager_ExactlyOnce_10_heap|34.5717|32.0967|-7.16%| > |TwoInputs_Rebalance_Eager_ExactlyOnce_100_rocksdb|158.655|147.44|-7.07%| > |TwoInputs_KeyBy_Eager_ExactlyOnce_1024_heap|1356.611667|1292.386667|-4.73%| > |TwoInputs_Broadcast_LazyFromSource_AtLeastOnce_10_rocksdb|34.01|33.205|-2.37%| > |TwoInputs_Rescale_LazyFromSource_AtLeastOnce_100_heap|149.588|145.997|-2.40%| > |TwoInputs_Rebalance_LazyFromSource_AtLeastOnce_1024_rocksdb|1359.74|1299.156667|-4.46%| > |TwoInputs_KeyBy_LazyFromSource_AtLeastOnce_10_heap|34.025|29.6833|-12.76%| > |TwoInputs_Broadcast_Eager_AtLeastOnce_100_rocksdb|157.303|151.4616667|-3.71%| > |TwoInputs_Rescale_Eager_AtLeastOnce_1024_heap|1368.74|1293.238333|-5.52%| > |TwoInputs_Rebalance_Eager_AtLeastOnce_10_rocksdb|34.325|33.285|-3.03%| > |TwoInputs_KeyBy_Eager_AtLeastOnce_100_heap|162.5116667|134.375|-17.31%| > It can be seen that the performance of 1.11 has a regression, basically > around 5%, and the maximum regression is 17%. This needs to be checked. > the test code: > flink-1.10.0: > [https://github.com/Li-Aihua/flink/blob/test_suite_for_basic_operations/flink-end-to-end-perf-tests/flink-basic-operations/src/main/java/org/apache/flink/basic/operations/PerformanceTestJob.java] > flink-1.11.0: > [https://github.com/Li-Aihua/flink/blob/test_suite_for_basic_operations_1.11/flink-end-to-end-perf-tests/flink-basic-operations/src/main/java/org/apache/flink/basic/operations/PerformanceTestJob.java] > commit cmd like tis: > bin/flink run -d -m 192.16
[jira] [Commented] (FLINK-18433) From the end-to-end performance test results, 1.11 has a regression
[ https://issues.apache.org/jira/browse/FLINK-18433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17144611#comment-17144611 ] Congxian Qiu(klion26) commented on FLINK-18433: --- [~Aihua] thanks for the work, maybe the affect version should be 1.11.0? > From the end-to-end performance test results, 1.11 has a regression > --- > > Key: FLINK-18433 > URL: https://issues.apache.org/jira/browse/FLINK-18433 > Project: Flink > Issue Type: Bug > Components: API / Core, API / DataStream >Affects Versions: 1.11.1 > Environment: 3 machines > [|https://github.com/Li-Aihua/flink/blob/test_suite_for_basic_operations_1.11/flink-end-to-end-perf-tests/flink-basic-operations/src/main/java/org/apache/flink/basic/operations/PerformanceTestJob.java] >Reporter: Aihua Li >Priority: Major > > > I ran end-to-end performance tests between the Release-1.10 and Release-1.11. > the results were as follows: > |scenarioName|release-1.10|release-1.11| | > |OneInput_Broadcast_LazyFromSource_ExactlyOnce_10_rocksdb|46.175|43.8133|-5.11%| > |OneInput_Rescale_LazyFromSource_ExactlyOnce_100_heap|211.835|200.355|-5.42%| > |OneInput_Rebalance_LazyFromSource_ExactlyOnce_1024_rocksdb|1721.041667|1618.32|-5.97%| > |OneInput_KeyBy_LazyFromSource_ExactlyOnce_10_heap|46|43.615|-5.18%| > |OneInput_Broadcast_Eager_ExactlyOnce_100_rocksdb|212.105|199.688|-5.85%| > |OneInput_Rescale_Eager_ExactlyOnce_1024_heap|1754.64|1600.12|-8.81%| > |OneInput_Rebalance_Eager_ExactlyOnce_10_rocksdb|45.9167|43.0983|-6.14%| > |OneInput_KeyBy_Eager_ExactlyOnce_100_heap|212.0816667|200.727|-5.35%| > |OneInput_Broadcast_LazyFromSource_AtLeastOnce_1024_rocksdb|1718.245|1614.381667|-6.04%| > |OneInput_Rescale_LazyFromSource_AtLeastOnce_10_heap|46.12|43.5517|-5.57%| > |OneInput_Rebalance_LazyFromSource_AtLeastOnce_100_rocksdb|212.038|200.388|-5.49%| > |OneInput_KeyBy_LazyFromSource_AtLeastOnce_1024_heap|1762.048333|1606.408333|-8.83%| > |OneInput_Broadcast_Eager_AtLeastOnce_10_rocksdb|46.0583|43.4967|-5.56%| > |OneInput_Rescale_Eager_AtLeastOnce_100_heap|212.233|201.188|-5.20%| > |OneInput_Rebalance_Eager_AtLeastOnce_1024_rocksdb|1720.66|1616.85|-6.03%| > |OneInput_KeyBy_Eager_AtLeastOnce_10_heap|46.14|43.6233|-5.45%| > |TwoInputs_Broadcast_LazyFromSource_ExactlyOnce_100_rocksdb|156.918|152.957|-2.52%| > |TwoInputs_Rescale_LazyFromSource_ExactlyOnce_1024_heap|1415.511667|1300.1|-8.15%| > |TwoInputs_Rebalance_LazyFromSource_ExactlyOnce_10_rocksdb|34.2967|34.1667|-0.38%| > |TwoInputs_KeyBy_LazyFromSource_ExactlyOnce_100_heap|158.353|151.848|-4.11%| > |TwoInputs_Broadcast_Eager_ExactlyOnce_1024_rocksdb|1373.406667|1300.056667|-5.34%| > |TwoInputs_Rescale_Eager_ExactlyOnce_10_heap|34.5717|32.0967|-7.16%| > |TwoInputs_Rebalance_Eager_ExactlyOnce_100_rocksdb|158.655|147.44|-7.07%| > |TwoInputs_KeyBy_Eager_ExactlyOnce_1024_heap|1356.611667|1292.386667|-4.73%| > |TwoInputs_Broadcast_LazyFromSource_AtLeastOnce_10_rocksdb|34.01|33.205|-2.37%| > |TwoInputs_Rescale_LazyFromSource_AtLeastOnce_100_heap|149.588|145.997|-2.40%| > |TwoInputs_Rebalance_LazyFromSource_AtLeastOnce_1024_rocksdb|1359.74|1299.156667|-4.46%| > |TwoInputs_KeyBy_LazyFromSource_AtLeastOnce_10_heap|34.025|29.6833|-12.76%| > |TwoInputs_Broadcast_Eager_AtLeastOnce_100_rocksdb|157.303|151.4616667|-3.71%| > |TwoInputs_Rescale_Eager_AtLeastOnce_1024_heap|1368.74|1293.238333|-5.52%| > |TwoInputs_Rebalance_Eager_AtLeastOnce_10_rocksdb|34.325|33.285|-3.03%| > |TwoInputs_KeyBy_Eager_AtLeastOnce_100_heap|162.5116667|134.375|-17.31%| > It can be seen that the performance of 1.11 has a regression, basically > around 5%, and the maximum regression is 17%. This needs to be checked. > the test code: > flink-1.10.0: > [https://github.com/Li-Aihua/flink/blob/test_suite_for_basic_operations/flink-end-to-end-perf-tests/flink-basic-operations/src/main/java/org/apache/flink/basic/operations/PerformanceTestJob.java] > flink-1.11.0: > [https://github.com/Li-Aihua/flink/blob/test_suite_for_basic_operations_1.11/flink-end-to-end-perf-tests/flink-basic-operations/src/main/java/org/apache/flink/basic/operations/PerformanceTestJob.java] > commit cmd like tis: > bin/flink run -d -m 192.168.39.246:8081 -c > org.apache.flink.basic.operations.PerformanceTestJob > /home/admin/flink-basic-operations_2.11-1.10-SNAPSHOT.jar --topologyName > OneInput --LogicalAttributesofEdges Broadcast --ScheduleMode LazyFromSource > --CheckpointMode ExactlyOnce --recordSize 10 --stateBackend rocksdb > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-5601) Window operator does not checkpoint watermarks
[ https://issues.apache.org/jira/browse/FLINK-5601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142686#comment-17142686 ] Congxian Qiu(klion26) commented on FLINK-5601: -- Seems another affected user(user-zh mail list) [http://apache-flink.147419.n8.nabble.com/savepoint-td4133.html] > Window operator does not checkpoint watermarks > -- > > Key: FLINK-5601 > URL: https://issues.apache.org/jira/browse/FLINK-5601 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Affects Versions: 1.5.0, 1.6.0, 1.7.0, 1.8.0, 1.9.0 >Reporter: Ufuk Celebi >Assignee: Jiayi Liao >Priority: Critical > Labels: pull-request-available > > During release testing [~stefanrichte...@gmail.com] and I noticed that > watermarks are not checkpointed in the window operator. > This can lead to non determinism when restoring checkpoints. I was running an > adjusted {{SessionWindowITCase}} via Kafka for testing migration and > rescaling and ran into failures, because the data generator required > determinisitic behaviour. > What happened was that on restore it could happen that late elements were not > dropped, because the watermarks needed to be re-established after restore > first. > [~aljoscha] Do you know whether there is a special reason for explicitly not > checkpointing watermarks? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18264) Translate the "External Resource Framework" page into Chinese
[ https://issues.apache.org/jira/browse/FLINK-18264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17141840#comment-17141840 ] Congxian Qiu(klion26) commented on FLINK-18264: --- [~karmagyz] assigned to you > Translate the "External Resource Framework" page into Chinese > - > > Key: FLINK-18264 > URL: https://issues.apache.org/jira/browse/FLINK-18264 > Project: Flink > Issue Type: Sub-task > Components: chinese-translation, Documentation >Affects Versions: 1.11.0 >Reporter: Yangze Guo >Assignee: Yangze Guo >Priority: Major > Fix For: 1.12.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (FLINK-18264) Translate the "External Resource Framework" page into Chinese
[ https://issues.apache.org/jira/browse/FLINK-18264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Congxian Qiu(klion26) reassigned FLINK-18264: - Assignee: Yangze Guo > Translate the "External Resource Framework" page into Chinese > - > > Key: FLINK-18264 > URL: https://issues.apache.org/jira/browse/FLINK-18264 > Project: Flink > Issue Type: Sub-task > Components: chinese-translation, Documentation >Affects Versions: 1.11.0 >Reporter: Yangze Guo >Assignee: Yangze Guo >Priority: Major > Fix For: 1.12.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (FLINK-18091) Test Relocatable Savepoints
[ https://issues.apache.org/jira/browse/FLINK-18091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Congxian Qiu(klion26) closed FLINK-18091. - Resolution: Fixed > Test Relocatable Savepoints > --- > > Key: FLINK-18091 > URL: https://issues.apache.org/jira/browse/FLINK-18091 > Project: Flink > Issue Type: Sub-task > Components: Tests >Reporter: Stephan Ewen >Assignee: Congxian Qiu(klion26) >Priority: Major > Labels: release-testing > Fix For: 1.11.0 > > > The test should do the following: > * take a savepoint. needs to make sure the job has enough state that there > is more than just the "_metadata" file > * copy it to another directory > * start the job from that savepoint by addressing the metadata file and by > addressing the savepoint directory > We should also test that an incremental checkpoint that gets moved fails with > a reasonable exception. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18091) Test Relocatable Savepoints
[ https://issues.apache.org/jira/browse/FLINK-18091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134092#comment-17134092 ] Congxian Qiu(klion26) commented on FLINK-18091: --- close this issue as we have tested in a real cluster as the previous comment. please feel free to reopen if anyone has any other concerns. thanks. > Test Relocatable Savepoints > --- > > Key: FLINK-18091 > URL: https://issues.apache.org/jira/browse/FLINK-18091 > Project: Flink > Issue Type: Sub-task > Components: Tests >Reporter: Stephan Ewen >Assignee: Congxian Qiu(klion26) >Priority: Major > Labels: release-testing > Fix For: 1.11.0 > > > The test should do the following: > * take a savepoint. needs to make sure the job has enough state that there > is more than just the "_metadata" file > * copy it to another directory > * start the job from that savepoint by addressing the metadata file and by > addressing the savepoint directory > We should also test that an incremental checkpoint that gets moved fails with > a reasonable exception. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18091) Test Relocatable Savepoints
[ https://issues.apache.org/jira/browse/FLINK-18091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17132837#comment-17132837 ] Congxian Qiu(klion26) commented on FLINK-18091: --- The logic of manual test for savepoint here is the same as the ITCase(trigger savepoint, move directory, and can restore from savepoint successfully) – IMHO, this can somehow double-check to make sure this feature works. and test incremental checkpoint will restore fail after relocating the directory. > Test Relocatable Savepoints > --- > > Key: FLINK-18091 > URL: https://issues.apache.org/jira/browse/FLINK-18091 > Project: Flink > Issue Type: Sub-task > Components: Tests >Reporter: Stephan Ewen >Assignee: Congxian Qiu(klion26) >Priority: Major > Labels: release-testing > Fix For: 1.11.0 > > > The test should do the following: > * take a savepoint. needs to make sure the job has enough state that there > is more than just the "_metadata" file > * copy it to another directory > * start the job from that savepoint by addressing the metadata file and by > addressing the savepoint directory > We should also test that an incremental checkpoint that gets moved fails with > a reasonable exception. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18165) When savingpoint is restored, select the checkpoint directory and stateBackend
[ https://issues.apache.org/jira/browse/FLINK-18165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17129148#comment-17129148 ] Congxian Qiu(klion26) commented on FLINK-18165: --- [~liyu] I agree that we could switch state backend easily through savepoint if this has been implemented. I'm willing to help implement this if we want to support this in the future release. > When savingpoint is restored, select the checkpoint directory and stateBackend > -- > > Key: FLINK-18165 > URL: https://issues.apache.org/jira/browse/FLINK-18165 > Project: Flink > Issue Type: Improvement > Components: Runtime / State Backends >Affects Versions: 1.9.0 > Environment: flink 1.9 >Reporter: Xinyuan Liu >Priority: Major > > If the checkpoint file is used as the initial state of the savepoint startup, > it must be ensured that the state backend used before and after is the same > type, but in actual production, there will be more and more state, the > taskmanager memory is insufficient and the cluster cannot be expanded, and > the state backend needs to be switched at this time. And there is a need to > ensure data consistency. Unfortunately, currently flink does not provide an > elegant way to switch state backend, can the community consider this proposal -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18091) Test Relocatable Savepoints
[ https://issues.apache.org/jira/browse/FLINK-18091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17129144#comment-17129144 ] Congxian Qiu(klion26) commented on FLINK-18091: --- [~aljoscha] We have an IT case for this {{SavepointITCase##testTriggerSavepointAndResumeWithFileBasedCheckpointsAndRelocateBasePath}} > Test Relocatable Savepoints > --- > > Key: FLINK-18091 > URL: https://issues.apache.org/jira/browse/FLINK-18091 > Project: Flink > Issue Type: Sub-task > Components: Tests >Reporter: Stephan Ewen >Assignee: Congxian Qiu(klion26) >Priority: Major > Labels: release-testing > Fix For: 1.11.0 > > > The test should do the following: > * take a savepoint. needs to make sure the job has enough state that there > is more than just the "_metadata" file > * copy it to another directory > * start the job from that savepoint by addressing the metadata file and by > addressing the savepoint directory > We should also test that an incremental checkpoint that gets moved fails with > a reasonable exception. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17479) Occasional checkpoint failure due to null pointer exception in Flink version 1.10
[ https://issues.apache.org/jira/browse/FLINK-17479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17128948#comment-17128948 ] Congxian Qiu(klion26) commented on FLINK-17479: --- [~nobleyd] thanks for tracking it down, if you confirm this is a JDK problem, could you please close this issue? > Occasional checkpoint failure due to null pointer exception in Flink version > 1.10 > - > > Key: FLINK-17479 > URL: https://issues.apache.org/jira/browse/FLINK-17479 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.10.0 > Environment: Flink1.10.0 > jdk1.8.0_60 >Reporter: nobleyd >Priority: Major > Attachments: image-2020-04-30-18-44-21-630.png, > image-2020-04-30-18-55-53-779.png > > > I upgrade the standalone cluster(3 machines) from flink1.9 to flink1.10.0 > latest. My job running normally in flink1.9 for about half a year, while I > get some job failed due to null pointer exception when checkpoing in > flink1.10.0. > Below is the exception log: > !image-2020-04-30-18-55-53-779.png! > I have checked the StreamTask(882), and is shown below. I think the only case > is that checkpointMetaData is null that can lead to a null pointer exception. > !image-2020-04-30-18-44-21-630.png! > I do not know why, is there anyone can help me? The problem only occurs in > Flink1.10.0 for now, it works well in flink1.9. I give the some conf > info(some different to the default) also in below, guessing that maybe it is > an error for configuration mistake. > some conf of my flink1.10.0: > > {code:java} > taskmanager.memory.flink.size: 71680m > taskmanager.memory.framework.heap.size: 512m > taskmanager.memory.framework.off-heap.size: 512m > taskmanager.memory.task.off-heap.size: 17920m > taskmanager.memory.managed.size: 512m > taskmanager.memory.jvm-metaspace.size: 512m > taskmanager.memory.network.fraction: 0.1 > taskmanager.memory.network.min: 1024mb > taskmanager.memory.network.max: 1536mb > taskmanager.memory.segment-size: 128kb > rest.port: 8682 > historyserver.web.port: 8782high-availability.jobmanager.port: > 13141,13142,13143,13144 > blob.server.port: 13146,13147,13148,13149taskmanager.rpc.port: > 13151,13152,13153,13154 > taskmanager.data.port: 13156metrics.internal.query-service.port: > 13161,13162,13163,13164,13166,13167,13168,13169env.java.home: > /usr/java/jdk1.8.0_60/bin/java > env.pid.dir: /home/work/flink-1.10.0{code} > > Hope someone can help me solve it. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18196) flink throws `NullPointerException` when executeCheckpointing
[ https://issues.apache.org/jira/browse/FLINK-18196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17128947#comment-17128947 ] Congxian Qiu(klion26) commented on FLINK-18196: --- [~yuchuanchen] thanks for creating this issue, maybe this is relevant to FLINK-17479, could you please try to change a different JDK version and see whether this problem has been fixed? > flink throws `NullPointerException` when executeCheckpointing > - > > Key: FLINK-18196 > URL: https://issues.apache.org/jira/browse/FLINK-18196 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.10.0, 1.10.1 >Reporter: Kai Chen >Priority: Major > > I meet checkpoint NPE when executing wordcount example: > java.lang.Exception: Could not perform checkpoint 5505 for operator Source: > KafkaTableSource(xxx) > SourceConversion(table=[xxx, source: > [KafkaTableSource(xxx)]], fields=[xxx]) > Calc(select=[xxx) AS xxx]) > > SinkConversionToTuple2 -- > Sink: Elasticsearch6UpsertTableSink(xxx) (1/1). > at > org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:802) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$triggerCheckpointAsync$3(StreamTask.java:777) > at > org.apache.flink.streaming.runtime.tasks.StreamTask$$Lambda$228/1024478318.call(Unknown > Source) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.run(StreamTaskActionExecutor.java:87) > at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:78) > at > org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:261) > at > org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:186) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:487) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:470) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:707) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:532) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.NullPointerException > at > org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.executeCheckpointing(StreamTask.java:1411) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.checkpointState(StreamTask.java:991) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$5(StreamTask.java:887) > at > org.apache.flink.streaming.runtime.tasks.StreamTask$$Lambda$229/1010499540.run(Unknown > Source) > at > org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:94) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:860) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:793) > ... 12 more > > flink version: flink-1.10.0 flink-1.10.1 > checkpoint configuration > |Checkpointing Mode|Exactly Once| > |Interval|5s| > |Timeout|10m 0s| > |Minimum Pause Between Checkpoints|0ms| > |Maximum Concurrent Checkpoints|1| > With debug enabled, I found `checkpointMetaData` is null at > > [https://github.com/apache/flink/blob/release-1.10.0/flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/StreamTask.java#L1421] > > I fixed this with this patch: > [https://github.com/yuchuanchen/flink/commit/e5122d9787be1fee9bce141887e0d70c9b0a4f19] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17468) Provide more detailed metrics why asynchronous part of checkpoint is taking long time
[ https://issues.apache.org/jira/browse/FLINK-17468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17128945#comment-17128945 ] Congxian Qiu(klion26) commented on FLINK-17468: --- Kindly ping [~qqibrow] > Provide more detailed metrics why asynchronous part of checkpoint is taking > long time > - > > Key: FLINK-17468 > URL: https://issues.apache.org/jira/browse/FLINK-17468 > Project: Flink > Issue Type: New Feature > Components: Runtime / Checkpointing, Runtime / Metrics, Runtime / > State Backends >Affects Versions: 1.10.0 >Reporter: Piotr Nowojski >Priority: Major > > As [reported by > users|https://lists.apache.org/thread.html/r0833452796ca7d1c9d5e35c110089c95cfdadee9d81884a13613a4ce%40%3Cuser.flink.apache.org%3E] > it's not obvious why asynchronous part of checkpoint is taking so long time. > Maybe we could provide some more detailed metrics/UI/logs about uploading > files, materializing meta data, or other things that are happening during the > asynchronous checkpoint process? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (FLINK-18091) Test Relocatable Savepoints
[ https://issues.apache.org/jira/browse/FLINK-18091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126778#comment-17126778 ] Congxian Qiu(klion26) edited comment on FLINK-18091 at 6/6/20, 5:13 AM: [~sewen] [~liyu] Do you have any other concerns about this issue, if not I'll close this issue, thanks. was (Author: klion26): I'll close this issue in 6.10(Wednesday) if it's ok to you. [~sewen] [~liyu] > Test Relocatable Savepoints > --- > > Key: FLINK-18091 > URL: https://issues.apache.org/jira/browse/FLINK-18091 > Project: Flink > Issue Type: Sub-task > Components: Tests >Reporter: Stephan Ewen >Assignee: Congxian Qiu(klion26) >Priority: Major > Labels: release-testing > Fix For: 1.11.0 > > > The test should do the following: > * take a savepoint. needs to make sure the job has enough state that there > is more than just the "_metadata" file > * copy it to another directory > * start the job from that savepoint by addressing the metadata file and by > addressing the savepoint directory > We should also test that an incremental checkpoint that gets moved fails with > a reasonable exception. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18112) Single Task Failure Recovery Prototype
[ https://issues.apache.org/jira/browse/FLINK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17127206#comment-17127206 ] Congxian Qiu(klion26) commented on FLINK-18112: --- [~ym] Thanks for creating the ticket, I think this feature can be very helpful for some scenarios. I'm interested in it. Does there any public documentation for this feature? Thanks > Single Task Failure Recovery Prototype > -- > > Key: FLINK-18112 > URL: https://issues.apache.org/jira/browse/FLINK-18112 > Project: Flink > Issue Type: New Feature > Components: Runtime / Checkpointing, Runtime / Coordination, Runtime > / Network >Affects Versions: 1.12.0 >Reporter: Yuan Mei >Assignee: Yuan Mei >Priority: Major > Fix For: 1.12.0 > > > Build a prototype of single task failure recovery to address and answer the > following questions: > *Step 1*: Scheduling part, restart a single node without restarting the > upstream or downstream nodes. > *Step 2*: Checkpointing part, as my understanding of how regional failover > works, this part might not need modification. > *Step 3*: Network part > - how the recovered node able to link to the upstream ResultPartitions, and > continue getting data > - how the downstream node able to link to the recovered node, and continue > getting node > - how different netty transit mode affects the results > - what if the failed node buffered data pool is full > *Step 4*: Failover process verification -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (FLINK-18091) Test Relocatable Savepoints
[ https://issues.apache.org/jira/browse/FLINK-18091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126778#comment-17126778 ] Congxian Qiu(klion26) edited comment on FLINK-18091 at 6/5/20, 1:52 PM: I'll close this issue in 6.10(Wednesday) if it's ok to you. [~sewen] [~liyu] was (Author: klion26): I'll close this issue in 6.10(Wednesday) if the result ok to you. [~sewen] [~liyu] > Test Relocatable Savepoints > --- > > Key: FLINK-18091 > URL: https://issues.apache.org/jira/browse/FLINK-18091 > Project: Flink > Issue Type: Sub-task > Components: Tests >Reporter: Stephan Ewen >Assignee: Congxian Qiu(klion26) >Priority: Major > Labels: release-testing > Fix For: 1.11.0 > > > The test should do the following: > * take a savepoint. needs to make sure the job has enough state that there > is more than just the "_metadata" file > * copy it to another directory > * start the job from that savepoint by addressing the metadata file and by > addressing the savepoint directory > We should also test that an incremental checkpoint that gets moved fails with > a reasonable exception. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18091) Test Relocatable Savepoints
[ https://issues.apache.org/jira/browse/FLINK-18091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126778#comment-17126778 ] Congxian Qiu(klion26) commented on FLINK-18091: --- I'll close this issue in 6.10(Wednesday) if the result ok to you. [~sewen] [~liyu] > Test Relocatable Savepoints > --- > > Key: FLINK-18091 > URL: https://issues.apache.org/jira/browse/FLINK-18091 > Project: Flink > Issue Type: Sub-task > Components: Tests >Reporter: Stephan Ewen >Assignee: Congxian Qiu(klion26) >Priority: Major > Labels: release-testing > Fix For: 1.11.0 > > > The test should do the following: > * take a savepoint. needs to make sure the job has enough state that there > is more than just the "_metadata" file > * copy it to another directory > * start the job from that savepoint by addressing the metadata file and by > addressing the savepoint directory > We should also test that an incremental checkpoint that gets moved fails with > a reasonable exception. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (FLINK-18091) Test Relocatable Savepoints
[ https://issues.apache.org/jira/browse/FLINK-18091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126769#comment-17126769 ] Congxian Qiu(klion26) edited comment on FLINK-18091 at 6/5/20, 1:43 PM: Test on a real cluster, Result excepted. * savepoint relocate can restore successfully * checkpoint relocate will be failed with FileNotFoundException commit : 8ca67c962a5575aba3a43b8cfd4a79ffc8069bd4 The Long log attached below: {{_username/ip/port and other sensitive information has been masked._}} # For Savepoint {code:java} [~/flink-1.11-SNAPSHOT]$ ./bin/flink savepoint 9bcc2546a841b36a39c46fbe13a2b631 hdfs:///user/xx/congxianqiu/savepoint -yid application_1591259429117_0007 SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/data/work/congxianqiu/flink-1.11-SNAPSHOT/lib/log4j-slf4j-impl-2.12.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/data/xx/hadoop/hadoop-2.7.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] 2020-06-05 20:27:43,039 WARN org.apache.flink.yarn.configuration.YarnLogConfigUtil [] - The configuration directory ('/data/work/congxianqiu/flink-1.11-SNAPSHOT/conf') already contains a LOG4J config file.If you want to use logback, then please delete or rename the log configuration file. 2020-06-05 20:27:43,422 INFO org.apache.flink.yarn.YarnClusterDescriptor [] - No path for the flink jar passed. Using the location of class org.apache.flink.yarn.YarnClusterDescriptor to locate the jar 2020-06-05 20:27:43,513 INFO org.apache.flink.yarn.YarnClusterDescriptor [] - Found Web Interface 10-215-128-84:35572 of application 'application_1591259429117_0007'. Triggering savepoint for job 9bcc2546a841b36a39c46fbe13a2b631. Waiting for response... Savepoint completed. Path: hdfs://ip:port/user/xx/congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33 You can resume your program from this savepoint with the run command. [~/flink-1.11-SNAPSHOT]$ hadoop fs -ls congxianqiu/savepoint Found 1 items drwxr-xr-x - xx supergroup 0 2020-06-05 20:27 congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33 [~/congxianqiu/flink-1.11-SNAPSHOT]$ hadoop fs -ls congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33 Found 2 items -rw-r--r-- 3 xx supergroup 74170 2020-06-05 20:27 congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33/6508ac9e-0d2a-4583-96ad-1d67fb5b1c8a -rw-r--r-- 3 xx supergroup 1205 2020-06-05 20:27 congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33/_metadata [~/flink-1.11-SNAPSHOT]$ hadoop fs -mv congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33 congxianqiu/savepoint/newsavepointpath [ ~/flink-1.11-SNAPSHOT]$ hadoop fs -ls congxianqiu/savepoint Found 1 items drwxr-xr-x - xx supergroup 0 2020-06-05 20:27 congxianqiu/savepoint/newsavepointpath [~/flink-1.11-SNAPSHOT]$ hadoop fs -ls congxianqiu/savepoint/newsavepointpath Found 2 items -rw-r--r-- 3 xx supergroup 74170 2020-06-05 20:27 congxianqiu/savepoint/newsavepointpath/6508ac9e-0d2a-4583-96ad-1d67fb5b1c8a -rw-r--r-- 3 xx supergroup 1205 2020-06-05 20:27 congxianqiu/savepoint/newsavepointpath/_metadata [~/flink-1.11-SNAPSHOT]$ ./bin/flink run -s hdfs:///user/xx/congxianqiu/newsavepointpath/_metadata -m yarn-cluster -c com.klion26.data.FlinkDemo /data/work/congxianqiu/flink-1.11-SNAPSHOT/ft_local/Flink-Demo-1.0-SNAPSHOT.jar SLF4J: Class path contains multiple SLF4J bindings. >> jobmanager.log restored succesfully, can do checkpoint successfully. 2020-06-05 21:11:10,053 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Starting job b2fbfa6527391f035e8eebd791c2f64e from savepoint hdfs:///user/xx/congxianqiu/savepoint/newsavepointpath/_metadata () 2020-06-05 21:11:10,198 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Reset the checkpoint ID of job b2fbfa6527391f035e8eebd791c2f64e to 3. 2020-06-05 21:11:10,198 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Restoring job b2fbfa6527391f035e8eebd791c2f64e from latest valid checkpoint: Checkpoint 2 @ 0 for b2fbfa6527391f035e8eebd791c2f64e. 2020-06-05 21:11:10,206 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - No master state to restore ... 2020-06-05 21:11:16,117 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Checkpoint triggering task Source: Custom Source (1/1) of job b2fbfa6527391f035e8eebd791c2f64e is not in state RUNNING but SCHEDULED instead. Aborting checkpoint. 2020-06-05 21:11:19,456 INFO org.apache.flink.yarn.Y
[jira] [Comment Edited] (FLINK-18091) Test Relocatable Savepoints
[ https://issues.apache.org/jira/browse/FLINK-18091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126769#comment-17126769 ] Congxian Qiu(klion26) edited comment on FLINK-18091 at 6/5/20, 1:43 PM: Test on a real cluster, Result excepted. * savepoint relocate can restore successfully * checkpoint relocate will be failed with FileNotFoundException commit : 8ca67c962a5575aba3a43b8cfd4a79ffc8069bd4 The Long log attached below: {{_username/ip/port and other sensitive information has been masked._}} # For Savepoint {code:java} [~/flink-1.11-SNAPSHOT]$ ./bin/flink savepoint 9bcc2546a841b36a39c46fbe13a2b631 hdfs:///user/xx/congxianqiu/savepoint -yid application_1591259429117_0007 SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/data/work/congxianqiu/flink-1.11-SNAPSHOT/lib/log4j-slf4j-impl-2.12.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/data/xx/hadoop/hadoop-2.7.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] 2020-06-05 20:27:43,039 WARN org.apache.flink.yarn.configuration.YarnLogConfigUtil [] - The configuration directory ('/data/work/congxianqiu/flink-1.11-SNAPSHOT/conf') already contains a LOG4J config file.If you want to use logback, then please delete or rename the log configuration file. 2020-06-05 20:27:43,422 INFO org.apache.flink.yarn.YarnClusterDescriptor [] - No path for the flink jar passed. Using the location of class org.apache.flink.yarn.YarnClusterDescriptor to locate the jar 2020-06-05 20:27:43,513 INFO org.apache.flink.yarn.YarnClusterDescriptor [] - Found Web Interface 10-215-128-84:35572 of application 'application_1591259429117_0007'. Triggering savepoint for job 9bcc2546a841b36a39c46fbe13a2b631. Waiting for response... Savepoint completed. Path: hdfs://ip:port/user/xx/congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33 You can resume your program from this savepoint with the run command. [~/flink-1.11-SNAPSHOT]$ hadoop fs -ls congxianqiu/savepoint Found 1 items drwxr-xr-x - xx supergroup 0 2020-06-05 20:27 congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33 [~/congxianqiu/flink-1.11-SNAPSHOT]$ hadoop fs -ls congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33 Found 2 items -rw-r--r-- 3 xx supergroup 74170 2020-06-05 20:27 congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33/6508ac9e-0d2a-4583-96ad-1d67fb5b1c8a -rw-r--r-- 3 xx supergroup 1205 2020-06-05 20:27 congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33/_metadata [~/flink-1.11-SNAPSHOT]$ hadoop fs -mv congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33 congxianqiu/savepoint/newsavepointpath [ ~/flink-1.11-SNAPSHOT]$ hadoop fs -ls congxianqiu/savepoint Found 1 items drwxr-xr-x - xx supergroup 0 2020-06-05 20:27 congxianqiu/savepoint/newsavepointpath [~/flink-1.11-SNAPSHOT]$ hadoop fs -ls congxianqiu/savepoint/newsavepointpath Found 2 items -rw-r--r-- 3 xx supergroup 74170 2020-06-05 20:27 congxianqiu/savepoint/newsavepointpath/6508ac9e-0d2a-4583-96ad-1d67fb5b1c8a -rw-r--r-- 3 xx supergroup 1205 2020-06-05 20:27 congxianqiu/savepoint/newsavepointpath/_metadata [~/flink-1.11-SNAPSHOT]$ ./bin/flink run -s hdfs:///user/xx/congxianqiu/newsavepointpath/_metadata -m yarn-cluster -c com.klion26.data.FlinkDemo /data/work/congxianqiu/flink-1.11-SNAPSHOT/ft_local/Flink-Demo-1.0-SNAPSHOT.jar SLF4J: Class path contains multiple SLF4J bindings. >> jobmanager.log restored succesfully, can do checkpoint successfully. 2020-06-05 21:11:10,053 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Starting job b2fbfa6527391f035e8eebd791c2f64e from savepoint hdfs:///user/xx/congxianqiu/savepoint/newsavepointpath/_metadata () 2020-06-05 21:11:10,198 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Reset the checkpoint ID of job b2fbfa6527391f035e8eebd791c2f64e to 3. 2020-06-05 21:11:10,198 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Restoring job b2fbfa6527391f035e8eebd791c2f64e from latest valid checkpoint: Checkpoint 2 @ 0 for b2fbfa6527391f035e8eebd791c2f64e. 2020-06-05 21:11:10,206 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - No master state to restore ... 2020-06-05 21:11:16,117 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Checkpoint triggering task Source: Custom Source (1/1) of job b2fbfa6527391f035e8eebd791c2f64e is not in state RUNNING but SCHEDULED instead. Aborting checkpoint. 2020-06-05 21:11:19,456 INFO org.apache.flink.yarn.Y
[jira] [Comment Edited] (FLINK-18091) Test Relocatable Savepoints
[ https://issues.apache.org/jira/browse/FLINK-18091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126769#comment-17126769 ] Congxian Qiu(klion26) edited comment on FLINK-18091 at 6/5/20, 1:40 PM: Test on a real cluster, Result excepted. * savepoint relocate can restore successfully * checkpoint relocate will be failed with FileNotFoundException commit : 8ca67c962a5575aba3a43b8cfd4a79ffc8069bd4 The Long log attached below: {{_username/ip/port and other sensitive information has been masked._}} # For Savepoint {code:java} [~/flink-1.11-SNAPSHOT]$ ./bin/flink savepoint 9bcc2546a841b36a39c46fbe13a2b631 hdfs:///user/xx/congxianqiu/savepoint -yid application_1591259429117_0007 SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/data/work/congxianqiu/flink-1.11-SNAPSHOT/lib/log4j-slf4j-impl-2.12.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/data/xx/hadoop/hadoop-2.7.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] 2020-06-05 20:27:43,039 WARN org.apache.flink.yarn.configuration.YarnLogConfigUtil [] - The configuration directory ('/data/work/congxianqiu/flink-1.11-SNAPSHOT/conf') already contains a LOG4J config file.If you want to use logback, then please delete or rename the log configuration file. 2020-06-05 20:27:43,422 INFO org.apache.flink.yarn.YarnClusterDescriptor [] - No path for the flink jar passed. Using the location of class org.apache.flink.yarn.YarnClusterDescriptor to locate the jar 2020-06-05 20:27:43,513 INFO org.apache.flink.yarn.YarnClusterDescriptor [] - Found Web Interface 10-215-128-84:35572 of application 'application_1591259429117_0007'. Triggering savepoint for job 9bcc2546a841b36a39c46fbe13a2b631. Waiting for response... Savepoint completed. Path: hdfs://ip:port/user/xx/congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33 You can resume your program from this savepoint with the run command. [~/flink-1.11-SNAPSHOT]$ hadoop fs -ls congxianqiu/savepoint Found 1 items drwxr-xr-x - xx supergroup 0 2020-06-05 20:27 congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33 [~/congxianqiu/flink-1.11-SNAPSHOT]$ hadoop fs -ls congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33 Found 2 items -rw-r--r-- 3 xx supergroup 74170 2020-06-05 20:27 congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33/6508ac9e-0d2a-4583-96ad-1d67fb5b1c8a -rw-r--r-- 3 xx supergroup 1205 2020-06-05 20:27 congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33/_metadata [~/flink-1.11-SNAPSHOT]$ hadoop fs -mv congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33 congxianqiu/savepoint/newsavepointpath [ ~/flink-1.11-SNAPSHOT]$ hadoop fs -ls congxianqiu/savepoint Found 1 items drwxr-xr-x - xx supergroup 0 2020-06-05 20:27 congxianqiu/savepoint/newsavepointpath [~/flink-1.11-SNAPSHOT]$ hadoop fs -ls congxianqiu/savepoint/newsavepointpath Found 2 items -rw-r--r-- 3 xx supergroup 74170 2020-06-05 20:27 congxianqiu/savepoint/newsavepointpath/6508ac9e-0d2a-4583-96ad-1d67fb5b1c8a -rw-r--r-- 3 xx supergroup 1205 2020-06-05 20:27 congxianqiu/savepoint/newsavepointpath/_metadata [~/flink-1.11-SNAPSHOT]$ ./bin/flink run -s hdfs:///user/xx/congxianqiu/newsavepointpath/_metadata -m yarn-cluster -c com.klion26.data.FlinkDemo /data/work/congxianqiu/flink-1.11-SNAPSHOT/ft_local/Flink-Demo-1.0-SNAPSHOT.jar SLF4J: Class path contains multiple SLF4J bindings. >> jobmanager.log 2020-06-05 21:11:10,053 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Starting job b2fbfa6527391f035e8eebd791c2f64e from savepoint hdfs:///user/xx/congxianqiu/savepoint/newsavepointpath/_metadata () 2020-06-05 21:11:10,198 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Reset the checkpoint ID of job b2fbfa6527391f035e8eebd791c2f64e to 3. 2020-06-05 21:11:10,198 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Restoring job b2fbfa6527391f035e8eebd791c2f64e from latest valid checkpoint: Checkpoint 2 @ 0 for b2fbfa6527391f035e8eebd791c2f64e. 2020-06-05 21:11:10,206 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - No master state to restore ... 2020-06-05 21:11:16,117 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Checkpoint triggering task Source: Custom Source (1/1) of job b2fbfa6527391f035e8eebd791c2f64e is not in state RUNNING but SCHEDULED instead. Aborting checkpoint. 2020-06-05 21:11:19,456 INFO org.apache.flink.yarn.YarnResourceManager [] - Registerin
[jira] [Comment Edited] (FLINK-18091) Test Relocatable Savepoints
[ https://issues.apache.org/jira/browse/FLINK-18091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126769#comment-17126769 ] Congxian Qiu(klion26) edited comment on FLINK-18091 at 6/5/20, 1:40 PM: Test on a real cluster, Result excepted. * savepoint relocate can restore successfully * checkpoint relocate will be failed with FileNotFoundException * commit : 8ca67c962a5575aba3a43b8cfd4a79ffc8069bd4 The Long log attached below: {{_username/ip/port and other sensitive information has been masked._}} # For Savepoint {code:java} [~/flink-1.11-SNAPSHOT]$ ./bin/flink savepoint 9bcc2546a841b36a39c46fbe13a2b631 hdfs:///user/xx/congxianqiu/savepoint -yid application_1591259429117_0007 SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/data/work/congxianqiu/flink-1.11-SNAPSHOT/lib/log4j-slf4j-impl-2.12.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/data/xx/hadoop/hadoop-2.7.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] 2020-06-05 20:27:43,039 WARN org.apache.flink.yarn.configuration.YarnLogConfigUtil [] - The configuration directory ('/data/work/congxianqiu/flink-1.11-SNAPSHOT/conf') already contains a LOG4J config file.If you want to use logback, then please delete or rename the log configuration file. 2020-06-05 20:27:43,422 INFO org.apache.flink.yarn.YarnClusterDescriptor [] - No path for the flink jar passed. Using the location of class org.apache.flink.yarn.YarnClusterDescriptor to locate the jar 2020-06-05 20:27:43,513 INFO org.apache.flink.yarn.YarnClusterDescriptor [] - Found Web Interface 10-215-128-84:35572 of application 'application_1591259429117_0007'. Triggering savepoint for job 9bcc2546a841b36a39c46fbe13a2b631. Waiting for response... Savepoint completed. Path: hdfs://ip:port/user/xx/congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33 You can resume your program from this savepoint with the run command. [~/flink-1.11-SNAPSHOT]$ hadoop fs -ls congxianqiu/savepoint Found 1 items drwxr-xr-x - xx supergroup 0 2020-06-05 20:27 congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33 [~/congxianqiu/flink-1.11-SNAPSHOT]$ hadoop fs -ls congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33 Found 2 items -rw-r--r-- 3 xx supergroup 74170 2020-06-05 20:27 congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33/6508ac9e-0d2a-4583-96ad-1d67fb5b1c8a -rw-r--r-- 3 xx supergroup 1205 2020-06-05 20:27 congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33/_metadata [~/flink-1.11-SNAPSHOT]$ hadoop fs -mv congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33 congxianqiu/savepoint/newsavepointpath [ ~/flink-1.11-SNAPSHOT]$ hadoop fs -ls congxianqiu/savepoint Found 1 items drwxr-xr-x - xx supergroup 0 2020-06-05 20:27 congxianqiu/savepoint/newsavepointpath [~/flink-1.11-SNAPSHOT]$ hadoop fs -ls congxianqiu/savepoint/newsavepointpath Found 2 items -rw-r--r-- 3 xx supergroup 74170 2020-06-05 20:27 congxianqiu/savepoint/newsavepointpath/6508ac9e-0d2a-4583-96ad-1d67fb5b1c8a -rw-r--r-- 3 xx supergroup 1205 2020-06-05 20:27 congxianqiu/savepoint/newsavepointpath/_metadata [~/flink-1.11-SNAPSHOT]$ ./bin/flink run -s hdfs:///user/xx/congxianqiu/newsavepointpath/_metadata -m yarn-cluster -c com.klion26.data.FlinkDemo /data/work/congxianqiu/flink-1.11-SNAPSHOT/ft_local/Flink-Demo-1.0-SNAPSHOT.jar SLF4J: Class path contains multiple SLF4J bindings. >> jobmanager.log 2020-06-05 21:11:10,053 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Starting job b2fbfa6527391f035e8eebd791c2f64e from savepoint hdfs:///user/xx/congxianqiu/savepoint/newsavepointpath/_metadata () 2020-06-05 21:11:10,198 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Reset the checkpoint ID of job b2fbfa6527391f035e8eebd791c2f64e to 3. 2020-06-05 21:11:10,198 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Restoring job b2fbfa6527391f035e8eebd791c2f64e from latest valid checkpoint: Checkpoint 2 @ 0 for b2fbfa6527391f035e8eebd791c2f64e. 2020-06-05 21:11:10,206 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - No master state to restore ... 2020-06-05 21:11:16,117 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Checkpoint triggering task Source: Custom Source (1/1) of job b2fbfa6527391f035e8eebd791c2f64e is not in state RUNNING but SCHEDULED instead. Aborting checkpoint. 2020-06-05 21:11:19,456 INFO org.apache.flink.yarn.YarnResourceManager [] - Register
[jira] [Commented] (FLINK-18091) Test Relocatable Savepoints
[ https://issues.apache.org/jira/browse/FLINK-18091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126769#comment-17126769 ] Congxian Qiu(klion26) commented on FLINK-18091: --- Test on a real cluster, Result excepted. * savepoint relocate can restore successfully * checkpoint relocate will be failed with FileNotFoundException The Long log attached below: {{_username/ip/port and other sensitive information has been masked._}} # For Savepoint {code:java} [~/flink-1.11-SNAPSHOT]$ ./bin/flink savepoint 9bcc2546a841b36a39c46fbe13a2b631 hdfs:///user/xx/congxianqiu/savepoint -yid application_1591259429117_0007 SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/data/work/congxianqiu/flink-1.11-SNAPSHOT/lib/log4j-slf4j-impl-2.12.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/data/xx/hadoop/hadoop-2.7.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] 2020-06-05 20:27:43,039 WARN org.apache.flink.yarn.configuration.YarnLogConfigUtil [] - The configuration directory ('/data/work/congxianqiu/flink-1.11-SNAPSHOT/conf') already contains a LOG4J config file.If you want to use logback, then please delete or rename the log configuration file. 2020-06-05 20:27:43,422 INFO org.apache.flink.yarn.YarnClusterDescriptor [] - No path for the flink jar passed. Using the location of class org.apache.flink.yarn.YarnClusterDescriptor to locate the jar 2020-06-05 20:27:43,513 INFO org.apache.flink.yarn.YarnClusterDescriptor [] - Found Web Interface 10-215-128-84:35572 of application 'application_1591259429117_0007'. Triggering savepoint for job 9bcc2546a841b36a39c46fbe13a2b631. Waiting for response... Savepoint completed. Path: hdfs://ip:port/user/xx/congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33 You can resume your program from this savepoint with the run command. [~/flink-1.11-SNAPSHOT]$ hadoop fs -ls congxianqiu/savepoint Found 1 items drwxr-xr-x - xx supergroup 0 2020-06-05 20:27 congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33 [~/congxianqiu/flink-1.11-SNAPSHOT]$ hadoop fs -ls congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33 Found 2 items -rw-r--r-- 3 xx supergroup 74170 2020-06-05 20:27 congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33/6508ac9e-0d2a-4583-96ad-1d67fb5b1c8a -rw-r--r-- 3 xx supergroup 1205 2020-06-05 20:27 congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33/_metadata [~/flink-1.11-SNAPSHOT]$ hadoop fs -mv congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33 congxianqiu/savepoint/newsavepointpath [ ~/flink-1.11-SNAPSHOT]$ hadoop fs -ls congxianqiu/savepoint Found 1 items drwxr-xr-x - xx supergroup 0 2020-06-05 20:27 congxianqiu/savepoint/newsavepointpath [~/flink-1.11-SNAPSHOT]$ hadoop fs -ls congxianqiu/savepoint/newsavepointpath Found 2 items -rw-r--r-- 3 xx supergroup 74170 2020-06-05 20:27 congxianqiu/savepoint/newsavepointpath/6508ac9e-0d2a-4583-96ad-1d67fb5b1c8a -rw-r--r-- 3 xx supergroup 1205 2020-06-05 20:27 congxianqiu/savepoint/newsavepointpath/_metadata [~/flink-1.11-SNAPSHOT]$ ./bin/flink run -s hdfs:///user/xx/congxianqiu/newsavepointpath/_metadata -m yarn-cluster -c com.klion26.data.FlinkDemo /data/work/congxianqiu/flink-1.11-SNAPSHOT/ft_local/Flink-Demo-1.0-SNAPSHOT.jar SLF4J: Class path contains multiple SLF4J bindings. >> jobmanager.log 2020-06-05 21:11:10,053 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Starting job b2fbfa6527391f035e8eebd791c2f64e from savepoint hdfs:///user/xx/congxianqiu/savepoint/newsavepointpath/_metadata () 2020-06-05 21:11:10,198 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Reset the checkpoint ID of job b2fbfa6527391f035e8eebd791c2f64e to 3. 2020-06-05 21:11:10,198 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Restoring job b2fbfa6527391f035e8eebd791c2f64e from latest valid checkpoint: Checkpoint 2 @ 0 for b2fbfa6527391f035e8eebd791c2f64e. 2020-06-05 21:11:10,206 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - No master state to restore ... 2020-06-05 21:11:16,117 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Checkpoint triggering task Source: Custom Source (1/1) of job b2fbfa6527391f035e8eebd791c2f64e is not in state RUNNING but SCHEDULED instead. Aborting checkpoint. 2020-06-05 21:11:19,456 INFO org.apache.flink.yarn.YarnResourceManager [] - Registering TaskManager with ResourceID container_e18_1591259429117_0019_01_02 (akka.tcp://flink@10-215-
[jira] [Commented] (FLINK-17571) A better way to show the files used in currently checkpoints
[ https://issues.apache.org/jira/browse/FLINK-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126428#comment-17126428 ] Congxian Qiu(klion26) commented on FLINK-17571: --- [~wind_ljy] The reason don't want to add a tool to clean the orphan files because that: we can't clean the files in safely, there may more than one jobs reference to the same file. But if you can guarantee that there will never be more than one job recovery from one checkpoint, you can use such tool to clean the orphan files. and a future step to clean the orphan files, you can add the tool to corntab, so that it can delete the orphan files regularly. PS: maybe you can first move the orphan files to some place, and delete them one day(or some other duration) later, in case delete the wrong files. > A better way to show the files used in currently checkpoints > > > Key: FLINK-17571 > URL: https://issues.apache.org/jira/browse/FLINK-17571 > Project: Flink > Issue Type: New Feature > Components: Command Line Client, Runtime / Checkpointing >Reporter: Congxian Qiu(klion26) >Priority: Major > > Inspired by the > [userMail|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html] > Currently, there are [three types of > directory|https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/state/checkpoints.html#directory-structure] > for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be > deleted safely, but users can't delete the files in the SHARED directory > safely(the files may be created a long time ago). > I think it's better to give users a better way to know which files are > currently used(so the others are not used) > maybe a command-line command such as below is ok enough to support such a > feature. > {{./bin/flink checkpoint list $checkpointDir # list all the files used in > checkpoint}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18091) Test Relocatable Savepoints
[ https://issues.apache.org/jira/browse/FLINK-18091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17125483#comment-17125483 ] Congxian Qiu(klion26) commented on FLINK-18091: --- [~liyu] thanks for the notice, please assign this to me, will test it on a real cluster, and reply with the result here. > Test Relocatable Savepoints > --- > > Key: FLINK-18091 > URL: https://issues.apache.org/jira/browse/FLINK-18091 > Project: Flink > Issue Type: Sub-task > Components: Tests >Reporter: Stephan Ewen >Priority: Major > Labels: release-testing > Fix For: 1.11.0 > > > The test should do the following: > * take a savepoint. needs to make sure the job has enough state that there > is more than just the "_metadata" file > * copy it to another directory > * start the job from that savepoint by addressing the metadata file and by > addressing the savepoint directory > We should also test that an incremental checkpoint that gets moved fails with > a reasonable exception. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-18038) StateBackendLoader logs application-defined state before it is fully configured
[ https://issues.apache.org/jira/browse/FLINK-18038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17120765#comment-17120765 ] Congxian Qiu(klion26) commented on FLINK-18038: --- I think logging the final used configuration is useful for the user. > StateBackendLoader logs application-defined state before it is fully > configured > --- > > Key: FLINK-18038 > URL: https://issues.apache.org/jira/browse/FLINK-18038 > Project: Flink > Issue Type: Bug > Components: Runtime / State Backends >Affects Versions: 1.9.1 >Reporter: Steve Bairos >Priority: Trivial > > In the > [StateBackendLoader|[https://github.com/apache/flink/blob/bb46756b84940a6134910e74406bfaff4f2f37e9/flink-runtime/src/main/java/org/apache/flink/runtime/state/StateBackendLoader.java#L201]], > there's this log line: > {code:java} > logger.info("Using application-defined state backend: {}", fromApplication); > {code} > It seems like this is inaccurate though because immediately after logging > this, if fromApplication is a ConfigurableStateBackend, we call the > .configure() function and it is replaced by a newly configured StateBackend. > To me, it seems like it would be better if we logged the state backend after > it was fully configured. In the current setup, we get confusing logs like > this: > {code:java} > 2020-05-29 21:39:44,387 INFO > org.apache.flink.streaming.runtime.tasks.StreamTask - Using > application-defined state backend: > RocksDBStateBackend{checkpointStreamBackend=File State Backend (checkpoints: > 's3://pinterest-montreal/checkpoints/xenon-dev-001-20191210/Xenon/BasicJavaStream', > savepoints: 'null', asynchronous: UNDEFINED, fileStateThreshold: -1), > localRocksDbDirectories=null, enableIncrementalCheckpointing=UNDEFINED, > numberOfTransferingThreads=-1}2020-05-29 21:39:44,387 INFO > org.apache.flink.streaming.runtime.tasks.StreamTask - Configuring > application-defined state backend with job/cluster config{code} > Which makes it ambiguous whether or not settings in our flink-conf.yaml like > "state.backend.incremental: true" are being applied properly or not. > > I can make a diff for the change if there aren't any objections -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-15507) Activate local recovery for RocksDB backends by default
[ https://issues.apache.org/jira/browse/FLINK-15507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17119388#comment-17119388 ] Congxian Qiu(klion26) commented on FLINK-15507: --- First, enable local recovery for incremental checkpoint has no overhead. and I agree with [~Zakelly] that we should enable incremental checkpoint by default(which has better performance). For not incremental checkpoint with local recovery enabled, we'll consume more disk(only the files in latest successfully checkpoint will be kept after FLINK-8871 merged) but will give recovery when failover. So from my side, enable local recovery for RocksDB StateBackend(incremental & full snapshot) by default is fine to me. > Activate local recovery for RocksDB backends by default > --- > > Key: FLINK-15507 > URL: https://issues.apache.org/jira/browse/FLINK-15507 > Project: Flink > Issue Type: Improvement > Components: Runtime / State Backends >Reporter: Stephan Ewen >Assignee: Zakelly Lan >Priority: Blocker > Labels: pull-request-available > Fix For: 1.11.0 > > > For the RocksDB state backend, local recovery has no overhead when > incremental checkpoints are used. > It should be activated by default, because it greatly helps with recovery. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16572) CheckPubSubEmulatorTest is flaky on Azure
[ https://issues.apache.org/jira/browse/FLINK-16572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17118239#comment-17118239 ] Congxian Qiu(klion26) commented on FLINK-16572: --- seems another instance [https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_apis/build/builds/2296/logs/133] > CheckPubSubEmulatorTest is flaky on Azure > - > > Key: FLINK-16572 > URL: https://issues.apache.org/jira/browse/FLINK-16572 > Project: Flink > Issue Type: Bug > Components: Connectors / Google Cloud PubSub, Tests >Affects Versions: 1.11.0 >Reporter: Aljoscha Krettek >Assignee: Richard Deurwaarder >Priority: Critical > Labels: pull-request-available, test-stability > Fix For: 1.11.0 > > Time Spent: 10m > Remaining Estimate: 0h > > Log: > https://dev.azure.com/aljoschakrettek/Flink/_build/results?buildId=56&view=logs&j=1f3ed471-1849-5d3c-a34c-19792af4ad16&t=ce095137-3e3b-5f73-4b79-c42d3d5f8283&l=7842 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17926) Can't build flink-web docker image because of EOL of Ubuntu:18.10
[ https://issues.apache.org/jira/browse/FLINK-17926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17115992#comment-17115992 ] Congxian Qiu(klion26) commented on FLINK-17926: --- [~rmetzger] what do you think about this, please assign it to me if you think this is valid and the proposal is ok. > Can't build flink-web docker image because of EOL of Ubuntu:18.10 > - > > Key: FLINK-17926 > URL: https://issues.apache.org/jira/browse/FLINK-17926 > Project: Flink > Issue Type: Bug > Components: Project Website >Reporter: Congxian Qiu(klion26) >Priority: Major > > Currently, the Dockerfile[1] in flink-web project is broken because of the > EOL of Ubuntu 18.10[2], will encounter the error such as bellow when > executing {{./run.sh}} > {code:java} > Err:3 http://security.ubuntu.com/ubuntu cosmic-security Release > 404 Not Found [IP: 91.189.88.152 80] > Ign:4 http://archive.ubuntu.com/ubuntu cosmic-updates InRelease > Ign:5 http://archive.ubuntu.com/ubuntu cosmic-backports InRelease > Err:6 http://archive.ubuntu.com/ubuntu cosmic Release > 404 Not Found [IP: 91.189.88.142 80] > Err:7 http://archive.ubuntu.com/ubuntu cosmic-updates Release > 404 Not Found [IP: 91.189.88.142 80] > Err:8 http://archive.ubuntu.com/ubuntu cosmic-backports Release > 404 Not Found [IP: 91.189.88.142 80] > Reading package lists... > {code} > The current LTS versions can be found in release website[2]. > Apache Flink docker image uses fedora:28[3], so it unaffected. > As fedora does not have LTS release[4], I proposal to use Ubuntu for website > here, and change the version from {{18.10}} to the closest LTS version > {{18.04, tried locally, it works successfully.}} > [1] > [https://github.com/apache/flink-web/blob/bc66f0f0f463ab62a22e81df7d7efd301b76a6b4/docker/Dockerfile#L17] > [2] [https://wiki.ubuntu.com/Releases] > > [3]https://github.com/apache/flink/blob/e92b2bf19bdf03ad3bae906dc5fa3781aeddb3ee/docs/docker/Dockerfile#L17 > [4] > https://fedoraproject.org/wiki/Fedora_Release_Life_Cycle#Maintenance_Schedule -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-17926) Can't build flink-web docker image because of EOL of Ubuntu:18.10
Congxian Qiu(klion26) created FLINK-17926: - Summary: Can't build flink-web docker image because of EOL of Ubuntu:18.10 Key: FLINK-17926 URL: https://issues.apache.org/jira/browse/FLINK-17926 Project: Flink Issue Type: Bug Components: Project Website Reporter: Congxian Qiu(klion26) Currently, the Dockerfile[1] in flink-web project is broken because of the EOL of Ubuntu 18.10[2], will encounter the error such as bellow when executing {{./run.sh}} {code:java} Err:3 http://security.ubuntu.com/ubuntu cosmic-security Release 404 Not Found [IP: 91.189.88.152 80] Ign:4 http://archive.ubuntu.com/ubuntu cosmic-updates InRelease Ign:5 http://archive.ubuntu.com/ubuntu cosmic-backports InRelease Err:6 http://archive.ubuntu.com/ubuntu cosmic Release 404 Not Found [IP: 91.189.88.142 80] Err:7 http://archive.ubuntu.com/ubuntu cosmic-updates Release 404 Not Found [IP: 91.189.88.142 80] Err:8 http://archive.ubuntu.com/ubuntu cosmic-backports Release 404 Not Found [IP: 91.189.88.142 80] Reading package lists... {code} The current LTS versions can be found in release website[2]. Apache Flink docker image uses fedora:28[3], so it unaffected. As fedora does not have LTS release[4], I proposal to use Ubuntu for website here, and change the version from {{18.10}} to the closest LTS version {{18.04, tried locally, it works successfully.}} [1] [https://github.com/apache/flink-web/blob/bc66f0f0f463ab62a22e81df7d7efd301b76a6b4/docker/Dockerfile#L17] [2] [https://wiki.ubuntu.com/Releases] [3]https://github.com/apache/flink/blob/e92b2bf19bdf03ad3bae906dc5fa3781aeddb3ee/docs/docker/Dockerfile#L17 [4] https://fedoraproject.org/wiki/Fedora_Release_Life_Cycle#Maintenance_Schedule -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-13009) YARNSessionCapacitySchedulerITCase#testVCoresAreSetCorrectlyAndJobManagerHostnameAreShownInWebInterfaceAndDynamicPropertiesAndYarnApplicationNameAndTaskManagerSlots th
[ https://issues.apache.org/jira/browse/FLINK-13009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17115896#comment-17115896 ] Congxian Qiu(klion26) commented on FLINK-13009: --- [~fly_in_gis] Thanks for the analysis and update, It's fine to me to close this ticket. Just want to confirm how will we know/verify the is fixed after a new bug fix version of hadoop has been released, is there any issue tracking such thing? > YARNSessionCapacitySchedulerITCase#testVCoresAreSetCorrectlyAndJobManagerHostnameAreShownInWebInterfaceAndDynamicPropertiesAndYarnApplicationNameAndTaskManagerSlots > throws NPE on Travis > - > > Key: FLINK-13009 > URL: https://issues.apache.org/jira/browse/FLINK-13009 > Project: Flink > Issue Type: Test > Components: Deployment / YARN, Tests >Affects Versions: 1.8.0 >Reporter: Congxian Qiu(klion26) >Priority: Critical > Labels: test-stability > Fix For: 1.11.0 > > > The test > {{YARNSessionCapacitySchedulerITCase#testVCoresAreSetCorrectlyAndJobManagerHostnameAreShownInWebInterfaceAndDynamicPropertiesAndYarnApplicationNameAndTaskManagerSlots}} > throws NPE on Travis. > NPE throws from RMAppAttemptMetrics.java#128, and the following is the code > from hadoop-2.8.3[1] > {code:java} > // Only add in the running containers if this is the active attempt. > 128 RMAppAttempt currentAttempt = rmContext.getRMApps() > 129 .get(attemptId.getApplicationId()).getCurrentAppAttempt(); > {code} > > log [https://api.travis-ci.org/v3/job/550689578/log.txt] > [1] > [https://github.com/apache/hadoop/blob/b3fe56402d908019d99af1f1f4fc65cb1d1436a2/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptMetrics.java#L128] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17468) Provide more detailed metrics why asynchronous part of checkpoint is taking long time
[ https://issues.apache.org/jira/browse/FLINK-17468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17113726#comment-17113726 ] Congxian Qiu(klion26) commented on FLINK-17468: --- [~qqibrow] How's it going? do you need any help? > Provide more detailed metrics why asynchronous part of checkpoint is taking > long time > - > > Key: FLINK-17468 > URL: https://issues.apache.org/jira/browse/FLINK-17468 > Project: Flink > Issue Type: New Feature > Components: Runtime / Checkpointing, Runtime / Metrics, Runtime / > State Backends >Affects Versions: 1.10.0 >Reporter: Piotr Nowojski >Priority: Major > > As [reported by > users|https://lists.apache.org/thread.html/r0833452796ca7d1c9d5e35c110089c95cfdadee9d81884a13613a4ce%40%3Cuser.flink.apache.org%3E] > it's not obvious why asynchronous part of checkpoint is taking so long time. > Maybe we could provide some more detailed metrics/UI/logs about uploading > files, materializing meta data, or other things that are happening during the > asynchronous checkpoint process? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17351) CheckpointCoordinator and CheckpointFailureManager ignores checkpoint timeouts
[ https://issues.apache.org/jira/browse/FLINK-17351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17113095#comment-17113095 ] Congxian Qiu(klion26) commented on FLINK-17351: --- Attach another issue which wants to increase the count under more checkpoint fail reasons > CheckpointCoordinator and CheckpointFailureManager ignores checkpoint timeouts > -- > > Key: FLINK-17351 > URL: https://issues.apache.org/jira/browse/FLINK-17351 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.9.2, 1.10.0 >Reporter: Piotr Nowojski >Assignee: Yuan Mei >Priority: Critical > Labels: pull-request-available > Fix For: 1.11.0 > > > As described in point 2: > https://issues.apache.org/jira/browse/FLINK-17327?focusedCommentId=17090576&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17090576 > (copy of description from above linked comment): > The logic in how {{CheckpointCoordinator}} handles checkpoint timeouts is > broken. In your [~qinjunjerry] examples, your job should have failed after > first checkpoint failure, but checkpoints were time outing on > CheckpointCoordinator after 5 seconds, before {{FlinkKafkaProducer}} was > detecting Kafka failure after 2 minutes. Those timeouts were not checked > against {{setTolerableCheckpointFailureNumber(...)}} limit, so the job was > keep going with many timed out checkpoints. Now funny thing happens: > FlinkKafkaProducer detects Kafka failure. Funny thing is that it depends > where the failure was detected: > a) on processing record? no problem, job will failover immediately once > failure is detected (in this example after 2 minutes) > b) on checkpoint? heh, the failure is reported to {{CheckpointCoordinator}} > *and gets ignored, as PendingCheckpoint has already been discarded 2 minutes > ago* :) So theoretically the checkpoints can keep failing forever and the job > will not restart automatically, unless something else fails. > Even more funny things can happen if we mix FLINK-17350 . or b) with > intermittent external system failure. Sink reports an exception, transaction > was lost/aborted, Sink is in failed state, but if there will be a happy > coincidence that it manages to accept further records, this exception can be > lost and all of the records in those failed checkpoints will be lost forever > as well. In all of the examples that [~qinjunjerry] posted it hasn't > happened. {{FlinkKafkaProducer}} was not able to recover after the initial > failure and it was keep throwing exceptions until the job finally failed (but > much later then it should have). And that's not guaranteed anywhere. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-11937) Resolve small file problem in RocksDB incremental checkpoint
[ https://issues.apache.org/jira/browse/FLINK-11937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Congxian Qiu(klion26) updated FLINK-11937: -- Fix Version/s: (was: 1.11.0) 1.12.0 > Resolve small file problem in RocksDB incremental checkpoint > > > Key: FLINK-11937 > URL: https://issues.apache.org/jira/browse/FLINK-11937 > Project: Flink > Issue Type: New Feature > Components: Runtime / Checkpointing >Reporter: Congxian Qiu(klion26) >Assignee: Congxian Qiu(klion26) >Priority: Major > Labels: pull-request-available > Fix For: 1.12.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > Currently when incremental checkpoint is enabled in RocksDBStateBackend a > separate file will be generated on DFS for each sst file. This may cause > “file flood” when running intensive workload (many jobs with high > parallelism) in big cluster. According to our observation in Alibaba > production, such file flood introduces at lease two drawbacks when using HDFS > as the checkpoint storage FileSystem: 1) huge number of RPC request issued to > NN which may burst its response queue; 2) huge number of files causes big > pressure on NN’s on-heap memory. > In Flink we ever noticed similar small file flood problem and tried to > resolved it by introducing ByteStreamStateHandle(FLINK-2808), but this > solution has its limitation that if we configure the threshold too low there > will still be too many small files, while if too high the JM will finally > OOM, thus could hardly resolve the issue in case of using RocksDBStateBackend > with incremental snapshot strategy. > We propose a new OutputStream called > FileSegmentCheckpointStateOutputStream(FSCSOS) to fix the problem. FSCSOS > will reuse the same underlying distributed file until its size exceeds a > preset threshold. We > plan to complete the work in 3 steps: firstly introduce FSCSOS, secondly > resolve the specific storage amplification issue on FSCSOS, and lastly add an > option to reuse FSCSOS across multiple checkpoints to further reduce the DFS > file number. > More details please refer to the attached design doc. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-16077) Translate "Custom State Serialization" page into Chinese
[ https://issues.apache.org/jira/browse/FLINK-16077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Congxian Qiu(klion26) updated FLINK-16077: -- Fix Version/s: (was: 1.11.0) 1.12.0 > Translate "Custom State Serialization" page into Chinese > > > Key: FLINK-16077 > URL: https://issues.apache.org/jira/browse/FLINK-16077 > Project: Flink > Issue Type: Sub-task > Components: chinese-translation, Documentation >Affects Versions: 1.11.0 >Reporter: Yu Li >Assignee: Congxian Qiu(klion26) >Priority: Major > Fix For: 1.12.0 > > > Complete the translation in `docs/dev/stream/state/custom_serialization.zh.md` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-15747) Enable setting RocksDB log level from configuration
[ https://issues.apache.org/jira/browse/FLINK-15747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Congxian Qiu(klion26) updated FLINK-15747: -- Fix Version/s: (was: 1.11.0) 1.12.0 > Enable setting RocksDB log level from configuration > --- > > Key: FLINK-15747 > URL: https://issues.apache.org/jira/browse/FLINK-15747 > Project: Flink > Issue Type: Improvement > Components: Runtime / State Backends >Reporter: Yu Li >Assignee: Congxian Qiu(klion26) >Priority: Major > Fix For: 1.12.0 > > > Currently to open the RocksDB local log, one has to create a customized > {{OptionsFactory}}, which is not quite convenient. This JIRA proposes to > enable setting it from configuration in flink-conf.yaml. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-16074) Translate the "Overview" page for "State & Fault Tolerance" into Chinese
[ https://issues.apache.org/jira/browse/FLINK-16074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Congxian Qiu(klion26) updated FLINK-16074: -- Fix Version/s: (was: 1.11.0) 1.12.0 > Translate the "Overview" page for "State & Fault Tolerance" into Chinese > > > Key: FLINK-16074 > URL: https://issues.apache.org/jira/browse/FLINK-16074 > Project: Flink > Issue Type: Sub-task > Components: chinese-translation, Documentation >Affects Versions: 1.11.0 >Reporter: Yu Li >Assignee: Congxian Qiu(klion26) >Priority: Major > Fix For: 1.12.0 > > > Complete the translation in `docs/dev/stream/state/index.zh.md` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-5763) Make savepoints self-contained and relocatable
[ https://issues.apache.org/jira/browse/FLINK-5763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17109933#comment-17109933 ] Congxian Qiu(klion26) commented on FLINK-5763: -- Add a release note. > Make savepoints self-contained and relocatable > -- > > Key: FLINK-5763 > URL: https://issues.apache.org/jira/browse/FLINK-5763 > Project: Flink > Issue Type: Improvement > Components: Runtime / State Backends >Reporter: Ufuk Celebi >Assignee: Congxian Qiu(klion26) >Priority: Critical > Labels: pull-request-available, usability > Fix For: 1.11.0 > > Time Spent: 10m > Remaining Estimate: 0h > > After a user has triggered a savepoint, a single savepoint file will be > returned as a handle to the savepoint. A savepoint to {{}} creates a > savepoint file like {{/savepoint-}}. > This file contains the metadata of the corresponding checkpoint, but not the > actual program state. While this works well for short term management > (pause-and-resume a job), it makes it hard to manage savepoints over longer > periods of time. > h4. Problems > h5. Scattered Checkpoint Files > For file system based checkpoints (FsStateBackend, RocksDBStateBackend) this > results in the savepoint referencing files from the checkpoint directory > (usually different than ). For users, it is virtually impossible to > tell which checkpoint files belong to a savepoint and which are lingering > around. This can easily lead to accidentally invalidating a savepoint by > deleting checkpoint files. > h5. Savepoints Not Relocatable > Even if a user is able to figure out which checkpoint files belong to a > savepoint, moving these files will invalidate the savepoint as well, because > the metadata file references absolute file paths. > h5. Forced to Use CLI for Disposal > Because of the scattered files, the user is in practice forced to use Flink’s > CLI to dispose a savepoint. This should be possible to handle in the scope of > the user’s environment via a file system delete operation. > h4. Proposal > In order to solve the described problems, savepoints should contain all their > state, both metadata and program state, inside a single directory. > Furthermore the metadata must only hold relative references to the checkpoint > files. This makes it obvious which files make up the state of a savepoint and > it is possible to move savepoints around by moving the savepoint directory. > h5. Desired File Layout > Triggering a savepoint to {{}} creates a directory as follows: > {code} > /savepoint-- > +-- _metadata > +-- data- [1 or more] > {code} > We include the JobID in the savepoint directory name in order to give some > hints about which job a savepoint belongs to. > h5. CLI > - Trigger: When triggering a savepoint to {{}} the savepoint > directory will be returned as the handle to the savepoint. > - Restore: Users can restore by pointing to the directory or the _metadata > file. The data files should be required to be in the same directory as the > _metadata file. > - Dispose: The disposal command should be deprecated and eventually removed. > While deprecated, disposal can happen by specifying the directory or the > _metadata file (same as restore). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-5763) Make savepoints self-contained and relocatable
[ https://issues.apache.org/jira/browse/FLINK-5763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Congxian Qiu(klion26) updated FLINK-5763: - Release Note: After FLINK-5763, we made savepoint self-contained and relocatable so that users can migrate savepoint from one place to another without any other processing manually. Currently do not support this feature after Entropy Injection enabled. > Make savepoints self-contained and relocatable > -- > > Key: FLINK-5763 > URL: https://issues.apache.org/jira/browse/FLINK-5763 > Project: Flink > Issue Type: Improvement > Components: Runtime / State Backends >Reporter: Ufuk Celebi >Assignee: Congxian Qiu(klion26) >Priority: Critical > Labels: pull-request-available, usability > Fix For: 1.11.0 > > Time Spent: 10m > Remaining Estimate: 0h > > After a user has triggered a savepoint, a single savepoint file will be > returned as a handle to the savepoint. A savepoint to {{}} creates a > savepoint file like {{/savepoint-}}. > This file contains the metadata of the corresponding checkpoint, but not the > actual program state. While this works well for short term management > (pause-and-resume a job), it makes it hard to manage savepoints over longer > periods of time. > h4. Problems > h5. Scattered Checkpoint Files > For file system based checkpoints (FsStateBackend, RocksDBStateBackend) this > results in the savepoint referencing files from the checkpoint directory > (usually different than ). For users, it is virtually impossible to > tell which checkpoint files belong to a savepoint and which are lingering > around. This can easily lead to accidentally invalidating a savepoint by > deleting checkpoint files. > h5. Savepoints Not Relocatable > Even if a user is able to figure out which checkpoint files belong to a > savepoint, moving these files will invalidate the savepoint as well, because > the metadata file references absolute file paths. > h5. Forced to Use CLI for Disposal > Because of the scattered files, the user is in practice forced to use Flink’s > CLI to dispose a savepoint. This should be possible to handle in the scope of > the user’s environment via a file system delete operation. > h4. Proposal > In order to solve the described problems, savepoints should contain all their > state, both metadata and program state, inside a single directory. > Furthermore the metadata must only hold relative references to the checkpoint > files. This makes it obvious which files make up the state of a savepoint and > it is possible to move savepoints around by moving the savepoint directory. > h5. Desired File Layout > Triggering a savepoint to {{}} creates a directory as follows: > {code} > /savepoint-- > +-- _metadata > +-- data- [1 or more] > {code} > We include the JobID in the savepoint directory name in order to give some > hints about which job a savepoint belongs to. > h5. CLI > - Trigger: When triggering a savepoint to {{}} the savepoint > directory will be returned as the handle to the savepoint. > - Restore: Users can restore by pointing to the directory or the _metadata > file. The data files should be required to be in the same directory as the > _metadata file. > - Dispose: The disposal command should be deprecated and eventually removed. > While deprecated, disposal can happen by specifying the directory or the > _metadata file (same as restore). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17487) Do not delete old checkpoints when stop the job.
[ https://issues.apache.org/jira/browse/FLINK-17487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17109386#comment-17109386 ] Congxian Qiu(klion26) commented on FLINK-17487: --- [~nobleyd] Do you mean the stop command did not succeed and the previous checkpoint was deleted? It seems weird to me. could you please share more logs with us(jm and tm log, it's better to enable debug log). > Do not delete old checkpoints when stop the job. > > > Key: FLINK-17487 > URL: https://issues.apache.org/jira/browse/FLINK-17487 > Project: Flink > Issue Type: Improvement > Components: Client / Job Submission, Runtime / Checkpointing >Reporter: nobleyd >Priority: Major > > When stop flink job using 'flink stop jobId', the checkpoints data is > deleted. > When the stop action is not succeed or failed because of some unknown errors, > sometimes the job resumes using the latest checkpoint, while sometimes it > just fails, and the checkpoints data is gone. > You may say why I need these checkpoints since I stop the job and a savepoint > will be generated. For example, my job uses a kafka source, while the kafka > missed some data, and I want to stop the job and resume it using an old > checkpoint. Anyway, I mean sometimes the action stop is failed and the > checkpoint data is also deleted, which is not good. > This feature is different from the case 'flink cancel jobId' or 'flink > savepoint jobId', which won't delete the checkpoint data. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (FLINK-16383) KafkaProducerExactlyOnceITCase. testExactlyOnceRegularSink fails with "The producer has already been closed"
[ https://issues.apache.org/jira/browse/FLINK-16383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108387#comment-17108387 ] Congxian Qiu(klion26) edited comment on FLINK-16383 at 5/15/20, 3:12 PM: - seems another instance [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=1230&view=logs&j=d44f43ce-542c-597d-bf94-b0718c71e5e8&t=34f486e1-e1e4-5dd2-9c06-bfdd9b9c74a8&l=18475] {code:java} 2020-05-14T04:20:36.9953507Z 04:20:36,993 [Source: Custom Source -> Sink: Unnamed (3/3)] WARN org.apache.flink.streaming.connectors.kafka.internal.Kafka010Fetcher [] - Committing offsets to Kafka takes longer than the checkpoint interval. Skipping commit of previous offsets because newer complete checkpoint offsets are available. This does not compromise Flink's checkpoint integrity. 2020-05-14T04:20:37.2142509Z 04:20:37,174 [program runner thread] ERROR org.apache.flink.streaming.connectors.kafka.KafkaTestBase [] - Job Runner failed with exception 2020-05-14T04:20:37.2143266Z org.apache.flink.client.program.ProgramInvocationException: Job failed (JobID: cdf00a9c3a8b9c3afb72f57e7ef936c8) 2020-05-14T04:20:37.2144122Z at org.apache.flink.client.ClientUtils.submitJobAndWaitForResult(ClientUtils.java:115) ~[flink-clients_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT] 2020-05-14T04:20:37.2145249Z at org.apache.flink.streaming.connectors.kafka.KafkaConsumerTestBase.lambda$runCancelingOnEmptyInputTest$2(KafkaConsumerTestBase.java:1075) ~[flink-connector-kafka-base_2.11-1.11-SNAPSHOT-tests.jar:?] 2020-05-14T04:20:37.2145858Z at java.lang.Thread.run(Thread.java:748) [?:1.8.0_242] 2020-05-14T04:20:37.2146393Z Caused by: org.apache.flink.runtime.client.JobCancellationException: Job was cancelled. 2020-05-14T04:20:37.2151756Z at org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:149) ~[flink-runtime_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT] 2020-05-14T04:20:37.2153457Z at org.apache.flink.client.ClientUtils.submitJobAndWaitForResult(ClientUtils.java:113) ~[flink-clients_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT] 2020-05-14T04:20:37.2154166Z ... 2 more {code} was (Author: klion26): seems another instance [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=1230&view=logs&j=d44f43ce-542c-597d-bf94-b0718c71e5e8&t=34f486e1-e1e4-5dd2-9c06-bfdd9b9c74a8&l=18475] 2020-05-14T04:20:36.9953507Z 04:20:36,993 [Source: Custom Source -> Sink: Unnamed (3/3)] WARN org.apache.flink.streaming.connectors.kafka.internal.Kafka010Fetcher [] - Committing offsets to Kafka takes longer than the checkpoint interval. Skipping commit of previous offsets because newer complete checkpoint offsets are available. This does not compromise Flink's checkpoint integrity. 2020-05-14T04:20:37.2142509Z 04:20:37,174 [program runner thread] ERROR org.apache.flink.streaming.connectors.kafka.KafkaTestBase [] - Job Runner failed with exception 2020-05-14T04:20:37.2143266Z org.apache.flink.client.program.ProgramInvocationException: Job failed (JobID: cdf00a9c3a8b9c3afb72f57e7ef936c8) 2020-05-14T04:20:37.2144122Z at org.apache.flink.client.ClientUtils.submitJobAndWaitForResult(ClientUtils.java:115) ~[flink-clients_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT] 2020-05-14T04:20:37.2145249Z at org.apache.flink.streaming.connectors.kafka.KafkaConsumerTestBase.lambda$runCancelingOnEmptyInputTest$2(KafkaConsumerTestBase.java:1075) ~[flink-connector-kafka-base_2.11-1.11-SNAPSHOT-tests.jar:?] 2020-05-14T04:20:37.2145858Z at java.lang.Thread.run(Thread.java:748) [?:1.8.0_242] 2020-05-14T04:20:37.2146393Z Caused by: org.apache.flink.runtime.client.JobCancellationException: Job was cancelled. 2020-05-14T04:20:37.2151756Z at org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:149) ~[flink-runtime_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT] 2020-05-14T04:20:37.2153457Z at org.apache.flink.client.ClientUtils.submitJobAndWaitForResult(ClientUtils.java:113) ~[flink-clients_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT] 2020-05-14T04:20:37.2154166Z ... 2 more > KafkaProducerExactlyOnceITCase. testExactlyOnceRegularSink fails with "The > producer has already been closed" > > > Key: FLINK-16383 > URL: https://issues.apache.org/jira/browse/FLINK-16383 > Project: Flink > Issue Type: Bug > Components: Runtime / Task, Tests >Reporter: Robert Metzger >Priority: Blocker > Labels: pull-request-available, test-stability > Fix For: 1.11.0 > > > Logs: > https://dev.azure.com/rmetzger/Flink/_build/results?buildId=5779&view=logs&j=a54de925-e958-5e24-790a-3a6150eb72d8&t=24e561e9-4c8d-598d-a290-e6acce191345 > {code} > 2020-03-01T01:06:57.4738418Z 01:06:57,473 [S
[jira] [Commented] (FLINK-16383) KafkaProducerExactlyOnceITCase. testExactlyOnceRegularSink fails with "The producer has already been closed"
[ https://issues.apache.org/jira/browse/FLINK-16383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108387#comment-17108387 ] Congxian Qiu(klion26) commented on FLINK-16383: --- seems another instance [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=1230&view=logs&j=d44f43ce-542c-597d-bf94-b0718c71e5e8&t=34f486e1-e1e4-5dd2-9c06-bfdd9b9c74a8&l=18475] 2020-05-14T04:20:36.9953507Z 04:20:36,993 [Source: Custom Source -> Sink: Unnamed (3/3)] WARN org.apache.flink.streaming.connectors.kafka.internal.Kafka010Fetcher [] - Committing offsets to Kafka takes longer than the checkpoint interval. Skipping commit of previous offsets because newer complete checkpoint offsets are available. This does not compromise Flink's checkpoint integrity. 2020-05-14T04:20:37.2142509Z 04:20:37,174 [program runner thread] ERROR org.apache.flink.streaming.connectors.kafka.KafkaTestBase [] - Job Runner failed with exception 2020-05-14T04:20:37.2143266Z org.apache.flink.client.program.ProgramInvocationException: Job failed (JobID: cdf00a9c3a8b9c3afb72f57e7ef936c8) 2020-05-14T04:20:37.2144122Z at org.apache.flink.client.ClientUtils.submitJobAndWaitForResult(ClientUtils.java:115) ~[flink-clients_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT] 2020-05-14T04:20:37.2145249Z at org.apache.flink.streaming.connectors.kafka.KafkaConsumerTestBase.lambda$runCancelingOnEmptyInputTest$2(KafkaConsumerTestBase.java:1075) ~[flink-connector-kafka-base_2.11-1.11-SNAPSHOT-tests.jar:?] 2020-05-14T04:20:37.2145858Z at java.lang.Thread.run(Thread.java:748) [?:1.8.0_242] 2020-05-14T04:20:37.2146393Z Caused by: org.apache.flink.runtime.client.JobCancellationException: Job was cancelled. 2020-05-14T04:20:37.2151756Z at org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:149) ~[flink-runtime_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT] 2020-05-14T04:20:37.2153457Z at org.apache.flink.client.ClientUtils.submitJobAndWaitForResult(ClientUtils.java:113) ~[flink-clients_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT] 2020-05-14T04:20:37.2154166Z ... 2 more > KafkaProducerExactlyOnceITCase. testExactlyOnceRegularSink fails with "The > producer has already been closed" > > > Key: FLINK-16383 > URL: https://issues.apache.org/jira/browse/FLINK-16383 > Project: Flink > Issue Type: Bug > Components: Runtime / Task, Tests >Reporter: Robert Metzger >Priority: Blocker > Labels: pull-request-available, test-stability > Fix For: 1.11.0 > > > Logs: > https://dev.azure.com/rmetzger/Flink/_build/results?buildId=5779&view=logs&j=a54de925-e958-5e24-790a-3a6150eb72d8&t=24e561e9-4c8d-598d-a290-e6acce191345 > {code} > 2020-03-01T01:06:57.4738418Z 01:06:57,473 [Source: Custom Source -> Map -> > Sink: Unnamed (1/1)] INFO > org.apache.flink.streaming.connectors.kafka.internal.FlinkKafkaInternalProducer > [] - Flushing new partitions > 2020-03-01T01:06:57.4739960Z 01:06:57,473 [FailingIdentityMapper Status > Printer] INFO > org.apache.flink.streaming.connectors.kafka.testutils.FailingIdentityMapper > [] - > Failing mapper 0: count=680, > totalCount=1000 > 2020-03-01T01:06:57.4909074Z > org.apache.flink.runtime.client.JobExecutionException: Job execution failed. > 2020-03-01T01:06:57.4910001Z at > org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:147) > 2020-03-01T01:06:57.4911000Z at > org.apache.flink.runtime.minicluster.MiniCluster.executeJobBlocking(MiniCluster.java:648) > 2020-03-01T01:06:57.4912078Z at > org.apache.flink.streaming.util.TestStreamEnvironment.execute(TestStreamEnvironment.java:77) > 2020-03-01T01:06:57.4913039Z at > org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1619) > 2020-03-01T01:06:57.4914421Z at > org.apache.flink.test.util.TestUtils.tryExecute(TestUtils.java:35) > 2020-03-01T01:06:57.4915423Z at > org.apache.flink.streaming.connectors.kafka.KafkaProducerTestBase.testExactlyOnce(KafkaProducerTestBase.java:370) > 2020-03-01T01:06:57.4916483Z at > org.apache.flink.streaming.connectors.kafka.KafkaProducerTestBase.testExactlyOnceRegularSink(KafkaProducerTestBase.java:309) > 2020-03-01T01:06:57.4917305Z at > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > 2020-03-01T01:06:57.4917982Z at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > 2020-03-01T01:06:57.4918769Z at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > 2020-03-01T01:06:57.4919477Z at > java.lang.reflect.Method.invoke(Method.java:498) > 2020-03-01T01:06:57.4920156Z at > org.junit.runners.model.Framework
[jira] [Commented] (FLINK-17571) A better way to show the files used in currently checkpoints
[ https://issues.apache.org/jira/browse/FLINK-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17106088#comment-17106088 ] Congxian Qiu(klion26) commented on FLINK-17571: --- [~pnowojski] the directory of checkpoint and savepoint is not the same, the files in one savepoint always going into one [directory|https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/state/savepoints.html#triggering-savepoints], so users can delete the whole savepoint safely if they want. Currently, only checkpoint files will go into different directories(shared, taskowend, exclusive). The command I proposed can be used for both checkpoint and savepoint, what I have in mind for the command is that # read the meta-file(this already has in our codebase) # deserialize the meta-file(this already has in our codebase) # output the files referenced in the meta-file The reason I want to keep it as a command in Flink is that: if we change something related to the metafile, we can have the command updated also. > A better way to show the files used in currently checkpoints > > > Key: FLINK-17571 > URL: https://issues.apache.org/jira/browse/FLINK-17571 > Project: Flink > Issue Type: New Feature > Components: Runtime / Checkpointing >Reporter: Congxian Qiu(klion26) >Priority: Major > > Inspired by the > [userMail|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html] > Currently, there are [three types of > directory|https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/state/checkpoints.html#directory-structure] > for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be > deleted safely, but users can't delete the files in the SHARED directory > safely(the files may be created a long time ago). > I think it's better to give users a better way to know which files are > currently used(so the others are not used) > maybe a command-line command such as below is ok enough to support such a > feature. > {{./bin/flink checkpoint list $checkpointDir # list all the files used in > checkpoint}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-13678) Translate "Code Style - Preamble" page into Chinese
[ https://issues.apache.org/jira/browse/FLINK-13678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17104190#comment-17104190 ] Congxian Qiu(klion26) commented on FLINK-13678: --- [~WangHW] Do you still working on this? I just saw that the GitHub account for the pr has been deleted. If you're still working on this, could you please file a new pr for this, and ping me to review it. > Translate "Code Style - Preamble" page into Chinese > --- > > Key: FLINK-13678 > URL: https://issues.apache.org/jira/browse/FLINK-13678 > Project: Flink > Issue Type: Sub-task > Components: chinese-translation, Project Website >Reporter: Jark Wu >Assignee: WangHengWei >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Translate page > https://flink.apache.org/zh/contributing/code-style-and-quality-preamble.html > into Chinese. The page is located in > https://github.com/apache/flink-web/blob/asf-site/contributing/code-style-and-quality-preamble.zh.md -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17468) Provide more detailed metrics why asynchronous part of checkpoint is taking long time
[ https://issues.apache.org/jira/browse/FLINK-17468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17104165#comment-17104165 ] Congxian Qiu(klion26) commented on FLINK-17468: --- [~qqibrow] please let me if you need any help~ > Provide more detailed metrics why asynchronous part of checkpoint is taking > long time > - > > Key: FLINK-17468 > URL: https://issues.apache.org/jira/browse/FLINK-17468 > Project: Flink > Issue Type: New Feature > Components: Runtime / Checkpointing, Runtime / Metrics, Runtime / > State Backends >Affects Versions: 1.10.0 >Reporter: Piotr Nowojski >Priority: Major > > As [reported by > users|https://lists.apache.org/thread.html/r0833452796ca7d1c9d5e35c110089c95cfdadee9d81884a13613a4ce%40%3Cuser.flink.apache.org%3E] > it's not obvious why asynchronous part of checkpoint is taking so long time. > Maybe we could provide some more detailed metrics/UI/logs about uploading > files, materializing meta data, or other things that are happening during the > asynchronous checkpoint process? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17571) A better way to show the files used in currently checkpoints
[ https://issues.apache.org/jira/browse/FLINK-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17104131#comment-17104131 ] Congxian Qiu(klion26) commented on FLINK-17571: --- sorry for the late replay, seems lost some email notification :( [~pnowojski] previously, I just want to show all the files(include EXCLUSIVE, SHARED and TASKOWNED) are being used by one checkpoint. because * If we don't restore from an external checkpoint, there always no orphaned checkpoint files left(the files should be deleted when the checkpoint has been discarded) * If we restore from an external checkpoint, then we just need to know the files used in the "restored" checkpoint * If we retain more than 1 checkpoint and want to remove some of them(always the older checkpoints) – because of the storage pressure, in most cases, we just need to keep the fresh one checkpoint. For implementation, we just need the leverage the {{Checkpoints#loadCheckpointMetaData}} for this. >From my side, another command {{removes files in specified checkpoints}} maybe >some more tricky, (because we may not know who references the files in the >given checkpoint. (if we retained more than one checkpoint – and the reference >one same shared file, then two separate jobs restore from the different >checkpoint, then we can't simply delete the files in each checkpoint) [~trystan] good to know that you're interested in helping this, I'm fine if you want to help to implement this, I can help to have the init review for your contribution. And please let me know if you have time to contribute this. No matter who finally helps to contribute this, we need to have an agreement with the implementation first on the issue side. > A better way to show the files used in currently checkpoints > > > Key: FLINK-17571 > URL: https://issues.apache.org/jira/browse/FLINK-17571 > Project: Flink > Issue Type: New Feature > Components: Runtime / Checkpointing >Reporter: Congxian Qiu(klion26) >Priority: Major > > Inspired by the > [userMail|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html] > Currently, there are [three types of > directory|https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/state/checkpoints.html#directory-structure] > for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be > deleted safely, but users can't delete the files in the SHARED directory > safely(the files may be created a long time ago). > I think it's better to give users a better way to know which files are > currently used(so the others are not used) > maybe a command-line command such as below is ok enough to support such a > feature. > {{./bin/flink checkpoint list $checkpointDir # list all the files used in > checkpoint}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17468) Provide more detailed metrics why asynchronous part of checkpoint is taking long time
[ https://issues.apache.org/jira/browse/FLINK-17468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17102397#comment-17102397 ] Congxian Qiu(klion26) commented on FLINK-17468: --- Hi [~qqibrow] from your previous reply, for the purpose to debug the checkpoint slowness, do you mean you need to add more and more log in {{RocksDBIncrementalSnapshotOperation}} and the other functions so that you can know the exact place which caused the slowness? If this is the scenario, do you think we add more debug/trace log in each step of the async snapshot part as my previous reply is good for your debug purpose? or what other metric or log do you want? > Provide more detailed metrics why asynchronous part of checkpoint is taking > long time > - > > Key: FLINK-17468 > URL: https://issues.apache.org/jira/browse/FLINK-17468 > Project: Flink > Issue Type: New Feature > Components: Runtime / Checkpointing, Runtime / Metrics, Runtime / > State Backends >Affects Versions: 1.10.0 >Reporter: Piotr Nowojski >Priority: Major > > As [reported by > users|https://lists.apache.org/thread.html/r0833452796ca7d1c9d5e35c110089c95cfdadee9d81884a13613a4ce%40%3Cuser.flink.apache.org%3E] > it's not obvious why asynchronous part of checkpoint is taking so long time. > Maybe we could provide some more detailed metrics/UI/logs about uploading > files, materializing meta data, or other things that are happening during the > asynchronous checkpoint process? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-17571) A better way to show the files used in currently checkpoints
[ https://issues.apache.org/jira/browse/FLINK-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Congxian Qiu(klion26) updated FLINK-17571: -- Description: Inspired by the [userMail|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html] Currently, there are [three types of directory|https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/state/checkpoints.html#directory-structure] for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be deleted safely, but users can't delete the files in the SHARED directory safely(the files may be created a long time ago). I think it's better to give users a better way to know which files are currently used(so the others are not used) maybe a command-line command such as below is ok enough to support such a feature. {{./bin/flink checkpoint list $checkpointDir # list all the files used in checkpoint}} was: Inspired by the [userMail|[http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]] Currently, there are [three types of directory|[https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/state/checkpoints.html#directory-structure]] for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be deleted safely, but users can't delete the files in the SHARED directory safely(the files may be created a long time ago). I think it's better to give users a better way to know which files are currently used(so the others are not used) maybe a command-line command such as below is ok enough to support such a feature. {{./bin/flink checkpoint list $checkpointDir # list all the files used in checkpoint}} > A better way to show the files used in currently checkpoints > > > Key: FLINK-17571 > URL: https://issues.apache.org/jira/browse/FLINK-17571 > Project: Flink > Issue Type: New Feature > Components: Runtime / Checkpointing >Reporter: Congxian Qiu(klion26) >Priority: Major > > Inspired by the > [userMail|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html] > Currently, there are [three types of > directory|https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/state/checkpoints.html#directory-structure] > for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be > deleted safely, but users can't delete the files in the SHARED directory > safely(the files may be created a long time ago). > I think it's better to give users a better way to know which files are > currently used(so the others are not used) > maybe a command-line command such as below is ok enough to support such a > feature. > {{./bin/flink checkpoint list $checkpointDir # list all the files used in > checkpoint}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17571) A better way to show the files used in currently checkpoints
[ https://issues.apache.org/jira/browse/FLINK-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17102356#comment-17102356 ] Congxian Qiu(klion26) commented on FLINK-17571: --- [~pnowojski] what do you think about this feature, If this is valid, I think I can help to contribute it. > A better way to show the files used in currently checkpoints > > > Key: FLINK-17571 > URL: https://issues.apache.org/jira/browse/FLINK-17571 > Project: Flink > Issue Type: New Feature > Components: Runtime / Checkpointing >Reporter: Congxian Qiu(klion26) >Priority: Major > > Inspired by the > [userMail|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html] > Currently, there are [three types of > directory|https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/state/checkpoints.html#directory-structure] > for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be > deleted safely, but users can't delete the files in the SHARED directory > safely(the files may be created a long time ago). > I think it's better to give users a better way to know which files are > currently used(so the others are not used) > maybe a command-line command such as below is ok enough to support such a > feature. > {{./bin/flink checkpoint list $checkpointDir # list all the files used in > checkpoint}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-17571) A better way to show the files used in currently checkpoints
[ https://issues.apache.org/jira/browse/FLINK-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Congxian Qiu(klion26) updated FLINK-17571: -- Description: Inspired by the [userMail|[http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]] Currently, there are [three types of directory|[https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/state/checkpoints.html#directory-structure]] for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be deleted safely, but users can't delete the files in the SHARED directory safely(the files may be created a long time ago). I think it's better to give users a better way to know which files are currently used(so the others are not used) maybe a command-line command such as below is ok enough to support such a feature. {{./bin/flink checkpoint list $checkpointDir # list all the files used in checkpoint}} was: Inspired by the [userMail|[http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]] Currently, there are three [types of directory|#directory-structure]] for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be deleted safely, but users can't delete the files in the SHARED directory safely(the files may be created a long time ago). I think it's better to give users a better way to know which files are currently used(so the others are not used) maybe a command-line command such as below is ok enough to support such a feature. {{./bin/flink checkpoint list $checkpointDir # list all the files used in checkpoint}} > A better way to show the files used in currently checkpoints > > > Key: FLINK-17571 > URL: https://issues.apache.org/jira/browse/FLINK-17571 > Project: Flink > Issue Type: New Feature > Components: Runtime / Checkpointing >Reporter: Congxian Qiu(klion26) >Priority: Major > > Inspired by the > [userMail|[http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]] > Currently, there are [three types of > directory|[https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/state/checkpoints.html#directory-structure]] > for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be > deleted safely, but users can't delete the files in the SHARED directory > safely(the files may be created a long time ago). > I think it's better to give users a better way to know which files are > currently used(so the others are not used) > maybe a command-line command such as below is ok enough to support such a > feature. > {{./bin/flink checkpoint list $checkpointDir # list all the files used in > checkpoint}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-17571) A better way to show the files used in currently checkpoints
[ https://issues.apache.org/jira/browse/FLINK-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Congxian Qiu(klion26) updated FLINK-17571: -- Description: Inspired by the [userMail|[http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]] Currently, there are three [types of directory|#directory-structure]] for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be deleted safely, but users can't delete the files in the SHARED directory safely(the files may be created a long time ago). I think it's better to give users a better way to know which files are currently used(so the others are not used) maybe a command-line command such as below is ok enough to support such a feature. {{./bin/flink checkpoint list $checkpointDir # list all the files used in checkpoint}} was: Inspired by the [usermail|#directory-structure] Currently, there are three [types of directory|#directory-structure] for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be deleted safely, but users can't delete the files in the SHARED directory safely(the files may be created a long time ago). I think it's better to give users a better way to know which files are currently used(so the others are not used) maybe a command-line command such as below is ok enough to support such a feature. {{./bin/flink checkpoint list $checkpointDir # list all the files used in checkpoint}} > A better way to show the files used in currently checkpoints > > > Key: FLINK-17571 > URL: https://issues.apache.org/jira/browse/FLINK-17571 > Project: Flink > Issue Type: New Feature > Components: Runtime / Checkpointing >Reporter: Congxian Qiu(klion26) >Priority: Major > > Inspired by the > [userMail|[http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]] > Currently, there are three [types of directory|#directory-structure]] for a > checkpoint, the files in TASKOWND and EXCLUSIVE directory can be deleted > safely, but users can't delete the files in the SHARED directory safely(the > files may be created a long time ago). > I think it's better to give users a better way to know which files are > currently used(so the others are not used) > maybe a command-line command such as below is ok enough to support such a > feature. > {{./bin/flink checkpoint list $checkpointDir # list all the files used in > checkpoint}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-17571) A better way to show the files used in currently checkpoints
[ https://issues.apache.org/jira/browse/FLINK-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Congxian Qiu(klion26) updated FLINK-17571: -- Description: Inspired by the [usermail|#directory-structure] Currently, there are three [types of directory|#directory-structure]] for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be deleted safely, but users can't delete the files in the SHARED directory safely(the files may be created a long time ago). I think it's better to give users a better way to know which files are currently used(so the others are not used) maybe a command-line command such as below is ok enough to support such a feature. {{./bin/flink checkpoint list $checkpointDir # list all the files used in checkpoint}} was: Inspired by the [usermail|#directory-structure]] Currently, there are three [types of directory|#directory-structure]] for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be deleted safely, but users can't delete the files in the SHARED directory safely(the files may be created a long time ago). I think it's better to give users a better way to know which files are currently used(so the others are not used) maybe a command-line command such as below is ok enough to support such a feature. {{./bin/flink checkpoint list $checkpointDir # list all the files used in checkpoint}} > A better way to show the files used in currently checkpoints > > > Key: FLINK-17571 > URL: https://issues.apache.org/jira/browse/FLINK-17571 > Project: Flink > Issue Type: New Feature > Components: Runtime / Checkpointing >Reporter: Congxian Qiu(klion26) >Priority: Major > > Inspired by the [usermail|#directory-structure] > Currently, there are three [types of directory|#directory-structure]] for a > checkpoint, the files in TASKOWND and EXCLUSIVE directory can be deleted > safely, but users can't delete the files in the SHARED directory safely(the > files may be created a long time ago). > I think it's better to give users a better way to know which files are > currently used(so the others are not used) > maybe a command-line command such as below is ok enough to support such a > feature. > {{./bin/flink checkpoint list $checkpointDir # list all the files used in > checkpoint}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-17571) A better way to show the files used in currently checkpoints
[ https://issues.apache.org/jira/browse/FLINK-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Congxian Qiu(klion26) updated FLINK-17571: -- Description: Inspired by the [usermail|#directory-structure]] Currently, there are three [types of directory|#directory-structure]] for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be deleted safely, but users can't delete the files in the SHARED directory safely(the files may be created a long time ago). I think it's better to give users a better way to know which files are currently used(so the others are not used) maybe a command-line command such as below is ok enough to support such a feature. {{./bin/flink checkpoint list $checkpointDir # list all the files used in checkpoint}} was: Inspired by the [usermail|[https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/state/checkpoints.html#directory-structure]] [user mail|[http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]] Currently, there are three [types of directory|#directory-structure]] for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be deleted safely, but users can't delete the files in the SHARED directory safely(the files may be created a long time ago). I think it's better to give users a better way to know which files are currently used(so the others are not used) maybe a command-line command such as below is ok enough to support such a feature. {{./bin/flink checkpoint list $checkpointDir # list all the files used in checkpoint}} > A better way to show the files used in currently checkpoints > > > Key: FLINK-17571 > URL: https://issues.apache.org/jira/browse/FLINK-17571 > Project: Flink > Issue Type: New Feature > Components: Runtime / Checkpointing >Reporter: Congxian Qiu(klion26) >Priority: Major > > Inspired by the [usermail|#directory-structure]] > Currently, there are three [types of directory|#directory-structure]] for a > checkpoint, the files in TASKOWND and EXCLUSIVE directory can be deleted > safely, but users can't delete the files in the SHARED directory safely(the > files may be created a long time ago). > I think it's better to give users a better way to know which files are > currently used(so the others are not used) > maybe a command-line command such as below is ok enough to support such a > feature. > {{./bin/flink checkpoint list $checkpointDir # list all the files used in > checkpoint}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-17571) A better way to show the files used in currently checkpoints
[ https://issues.apache.org/jira/browse/FLINK-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Congxian Qiu(klion26) updated FLINK-17571: -- Description: Inspired by the [usermail|#directory-structure] Currently, there are three [types of directory|#directory-structure] for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be deleted safely, but users can't delete the files in the SHARED directory safely(the files may be created a long time ago). I think it's better to give users a better way to know which files are currently used(so the others are not used) maybe a command-line command such as below is ok enough to support such a feature. {{./bin/flink checkpoint list $checkpointDir # list all the files used in checkpoint}} was: Inspired by the [usermail|#directory-structure] Currently, there are three [types of directory|#directory-structure]] for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be deleted safely, but users can't delete the files in the SHARED directory safely(the files may be created a long time ago). I think it's better to give users a better way to know which files are currently used(so the others are not used) maybe a command-line command such as below is ok enough to support such a feature. {{./bin/flink checkpoint list $checkpointDir # list all the files used in checkpoint}} > A better way to show the files used in currently checkpoints > > > Key: FLINK-17571 > URL: https://issues.apache.org/jira/browse/FLINK-17571 > Project: Flink > Issue Type: New Feature > Components: Runtime / Checkpointing >Reporter: Congxian Qiu(klion26) >Priority: Major > > Inspired by the [usermail|#directory-structure] > Currently, there are three [types of directory|#directory-structure] for a > checkpoint, the files in TASKOWND and EXCLUSIVE directory can be deleted > safely, but users can't delete the files in the SHARED directory safely(the > files may be created a long time ago). > I think it's better to give users a better way to know which files are > currently used(so the others are not used) > maybe a command-line command such as below is ok enough to support such a > feature. > {{./bin/flink checkpoint list $checkpointDir # list all the files used in > checkpoint}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-17571) A better way to show the files used in currently checkpoints
[ https://issues.apache.org/jira/browse/FLINK-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Congxian Qiu(klion26) updated FLINK-17571: -- Description: Inspired by the [usermail|[https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/state/checkpoints.html#directory-structure]] [user mail|[http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]] Currently, there are three [types of directory|#directory-structure]] for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be deleted safely, but users can't delete the files in the SHARED directory safely(the files may be created a long time ago). I think it's better to give users a better way to know which files are currently used(so the others are not used) maybe a command-line command such as below is ok enough to support such a feature. {{./bin/flink checkpoint list $checkpointDir # list all the files used in checkpoint}} was: Inspired by the [user mail|[http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]] Currently, there are three [types of directory|[https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/state/checkpoints.html#directory-structure]] for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be deleted safely, but users can't delete the files in the SHARED directory safely(the files may be created a long time ago). I think it's better to give users a better way to know which files are currently used(so the others are not used) maybe a command-line command such as below is ok enough to support such a feature. {{./bin/flink checkpoint list $checkpointDir # list all the files used in checkpoint}} > A better way to show the files used in currently checkpoints > > > Key: FLINK-17571 > URL: https://issues.apache.org/jira/browse/FLINK-17571 > Project: Flink > Issue Type: New Feature > Components: Runtime / Checkpointing >Reporter: Congxian Qiu(klion26) >Priority: Major > > Inspired by the > [usermail|[https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/state/checkpoints.html#directory-structure]] > [user > mail|[http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]] > > Currently, there are three [types of directory|#directory-structure]] for a > checkpoint, the files in TASKOWND and EXCLUSIVE directory can be deleted > safely, but users can't delete the files in the SHARED directory safely(the > files may be created a long time ago). > I think it's better to give users a better way to know which files are > currently used(so the others are not used) > maybe a command-line command such as below is ok enough to support such a > feature. > {{./bin/flink checkpoint list $checkpointDir # list all the files used in > checkpoint}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-17571) A better way to show the files used in currently checkpoints
[ https://issues.apache.org/jira/browse/FLINK-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Congxian Qiu(klion26) updated FLINK-17571: -- Description: Inspired by the [user mail|[http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]] Currently, there are three [types of directory|[https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/state/checkpoints.html#directory-structure]] for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be deleted safely, but users can't delete the files in the SHARED directory safely(the files may be created a long time ago). I think it's better to give users a better way to know which files are currently used(so the others are not used) maybe a command-line command such as below is ok enough to support such a feature. {{./bin/flink checkpoint list $checkpointDir # list all the files used in checkpoint}} was: Inspired by the [user mail|[http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]] Currently, there are three types of directory for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be deleted safely, but users can't delete the files in the SHARED directory safely(the files may be created a long time ago). I think it's better to give users a better way to know which files are currently used(so the others are not used) maybe a command-line command such as below is ok enough to support such a feature. {{./bin/flink checkpoint list $checkpointDir # list all the files used in checkpoint}} > A better way to show the files used in currently checkpoints > > > Key: FLINK-17571 > URL: https://issues.apache.org/jira/browse/FLINK-17571 > Project: Flink > Issue Type: New Feature > Components: Runtime / Checkpointing >Reporter: Congxian Qiu(klion26) >Priority: Major > > Inspired by the [user > mail|[http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]] > > Currently, there are three [types of > directory|[https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/state/checkpoints.html#directory-structure]] > for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be > deleted safely, but users can't delete the files in the SHARED directory > safely(the files may be created a long time ago). > I think it's better to give users a better way to know which files are > currently used(so the others are not used) > maybe a command-line command such as below is ok enough to support such a > feature. > {{./bin/flink checkpoint list $checkpointDir # list all the files used in > checkpoint}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-17571) A better way to show the files used in currently checkpoints
[ https://issues.apache.org/jira/browse/FLINK-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Congxian Qiu(klion26) updated FLINK-17571: -- Description: Inspired by the [user mail|[http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]] Currently, there are three types of directory for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be deleted safely, but users can't delete the files in the SHARED directory safely(the files may be created a long time ago). I think it's better to give users a better way to know which files are currently used(so the others are not used) maybe a command-line command such as below is ok enough to support such a feature. {{./bin/flink checkpoint list $checkpointDir # list all the files used in checkpoint}} was: Inspired by the [user mail|[http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html] Currently, there are three types of directory for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be deleted safely, but users can't delete the files in the SHARED directory safely(the files may be created a long time ago). I think it's better to give users a better way to know which files are currently used(so the others are not used) maybe a command-line command such as below is ok enough to support such a feature. {{./bin/flink checkpoint list $checkpointDir # list all the files used in checkpoint}} > A better way to show the files used in currently checkpoints > > > Key: FLINK-17571 > URL: https://issues.apache.org/jira/browse/FLINK-17571 > Project: Flink > Issue Type: New Feature > Components: Runtime / Checkpointing >Reporter: Congxian Qiu(klion26) >Priority: Major > > Inspired by the [user > mail|[http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]] > > Currently, there are three types of directory for a checkpoint, the files in > TASKOWND and EXCLUSIVE directory can be deleted safely, but users can't > delete the files in the SHARED directory safely(the files may be created a > long time ago). > I think it's better to give users a better way to know which files are > currently used(so the others are not used) > maybe a command-line command such as below is ok enough to support such a > feature. > {{./bin/flink checkpoint list $checkpointDir # list all the files used in > checkpoint}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-17571) A better way to show the files used in currently checkpoints
[ https://issues.apache.org/jira/browse/FLINK-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Congxian Qiu(klion26) updated FLINK-17571: -- Description: Inspired by the [user mail|[http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html] Currently, there are three types of directory for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be deleted safely, but users can't delete the files in the SHARED directory safely(the files may be created a long time ago). I think it's better to give users a better way to know which files are currently used(so the others are not used) maybe a command-line command such as below is ok enough to support such a feature. {{./bin/flink checkpoint list $checkpointDir # list all the files used in checkpoint}} was: Inspired by the [user mail|[http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]] Currently, there are three types of directory for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be deleted safely, but users can't delete the files in the SHARED directory safely(the files may be created a long time ago). I think it's better to give users a better way to know which files are currently used(so the others are not used) maybe a command-line command such as below is ok enough to support such a feature. {{./bin/flink checkpoint list $checkpointDir # list all the files used in checkpoint}} > A better way to show the files used in currently checkpoints > > > Key: FLINK-17571 > URL: https://issues.apache.org/jira/browse/FLINK-17571 > Project: Flink > Issue Type: New Feature > Components: Runtime / Checkpointing >Reporter: Congxian Qiu(klion26) >Priority: Major > > Inspired by the [user > mail|[http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html] > Currently, there are three types of directory for a checkpoint, the files in > TASKOWND and EXCLUSIVE directory can be deleted safely, but users can't > delete the files in the SHARED directory safely(the files may be created a > long time ago). > I think it's better to give users a better way to know which files are > currently used(so the others are not used) > maybe a command-line command such as below is ok enough to support such a > feature. > {{./bin/flink checkpoint list $checkpointDir # list all the files used in > checkpoint}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-17571) A better way to show the files used in currently checkpoints
[ https://issues.apache.org/jira/browse/FLINK-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Congxian Qiu(klion26) updated FLINK-17571: -- Description: Inspired by the [user mail|[http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]] Currently, there are three types of directory for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be deleted safely, but users can't delete the files in the SHARED directory safely(the files may be created a long time ago). I think it's better to give users a better way to know which files are currently used(so the others are not used) maybe a command-line command such as below is ok enough to support such a feature. {{./bin/flink checkpoint list $checkpointDir # list all the files used in checkpoint}} was: Inspired by the [[user mail|[http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]]] Currently, there are three types of directory for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be deleted safely, but users can't delete the files in the SHARED directory safely(the files may be created a long time ago). I think it's better to give users a better way to know which files are currently used(so the others are not used) maybe a command-line command such as below is ok enough to support such a feature. {{./bin/flink checkpoint list $checkpointDir # list all the files used in checkpoint}} > A better way to show the files used in currently checkpoints > > > Key: FLINK-17571 > URL: https://issues.apache.org/jira/browse/FLINK-17571 > Project: Flink > Issue Type: New Feature > Components: Runtime / Checkpointing >Reporter: Congxian Qiu(klion26) >Priority: Major > > Inspired by the [user > mail|[http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]] > Currently, there are three types of directory for a checkpoint, the files in > TASKOWND and EXCLUSIVE directory can be deleted safely, but users can't > delete the files in the SHARED directory safely(the files may be created a > long time ago). > I think it's better to give users a better way to know which files are > currently used(so the others are not used) > maybe a command-line command such as below is ok enough to support such a > feature. > {{./bin/flink checkpoint list $checkpointDir # list all the files used in > checkpoint}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-17571) A better way to show the files used in currently checkpoints
Congxian Qiu(klion26) created FLINK-17571: - Summary: A better way to show the files used in currently checkpoints Key: FLINK-17571 URL: https://issues.apache.org/jira/browse/FLINK-17571 Project: Flink Issue Type: New Feature Components: Runtime / Checkpointing Reporter: Congxian Qiu(klion26) Inspired by the [user mail]([http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]). Currently, there are three types of directory for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be deleted safely, but users can't delete the files in the SHARED directory safely(the files may be created a long time ago). I think it's better to give users a better way to know which files are currently used(so the others are not used) maybe a command-line command such as below is ok enough to support such a feature. {{./bin/flink checkpoint list $checkpointDir # list all the files used in checkpoint}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-17571) A better way to show the files used in currently checkpoints
[ https://issues.apache.org/jira/browse/FLINK-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Congxian Qiu(klion26) updated FLINK-17571: -- Description: Inspired by the [[user mail|[http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]]] Currently, there are three types of directory for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be deleted safely, but users can't delete the files in the SHARED directory safely(the files may be created a long time ago). I think it's better to give users a better way to know which files are currently used(so the others are not used) maybe a command-line command such as below is ok enough to support such a feature. {{./bin/flink checkpoint list $checkpointDir # list all the files used in checkpoint}} was: Inspired by the [user mail]([http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]). Currently, there are three types of directory for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be deleted safely, but users can't delete the files in the SHARED directory safely(the files may be created a long time ago). I think it's better to give users a better way to know which files are currently used(so the others are not used) maybe a command-line command such as below is ok enough to support such a feature. {{./bin/flink checkpoint list $checkpointDir # list all the files used in checkpoint}} > A better way to show the files used in currently checkpoints > > > Key: FLINK-17571 > URL: https://issues.apache.org/jira/browse/FLINK-17571 > Project: Flink > Issue Type: New Feature > Components: Runtime / Checkpointing >Reporter: Congxian Qiu(klion26) >Priority: Major > > Inspired by the [[user > mail|[http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]]] > Currently, there are three types of directory for a checkpoint, the files in > TASKOWND and EXCLUSIVE directory can be deleted safely, but users can't > delete the files in the SHARED directory safely(the files may be created a > long time ago). > I think it's better to give users a better way to know which files are > currently used(so the others are not used) > maybe a command-line command such as below is ok enough to support such a > feature. > {{./bin/flink checkpoint list $checkpointDir # list all the files used in > checkpoint}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17479) Occasional checkpoint failure due to null pointer exception in Flink version 1.10
[ https://issues.apache.org/jira/browse/FLINK-17479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17101355#comment-17101355 ] Congxian Qiu(klion26) commented on FLINK-17479: --- [~nobleyd] thanks for reporting this problem. seems strange fro the picture your given. if the {{checkpointMetadata}} is null, then how can the message [C{{ould not perform checkpoint " + checkpointMetaData.getCheckpointId() + " for operator " + getName() + '.'][1]}} could be printed? the error message will try to get the checkpointId from {{checkpointMetaData}}. could you please share the whole jm&tm log, a reproducible job is even better. thanks. [1]https://github.com/apache/flink/blob/aa4eb8f0c9ce74e6b92c3d9be5dc8e8cb536239d/flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/StreamTask.java#L801 > Occasional checkpoint failure due to null pointer exception in Flink version > 1.10 > - > > Key: FLINK-17479 > URL: https://issues.apache.org/jira/browse/FLINK-17479 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.10.0 > Environment: Flink1.10.0 > jdk1.8.0_60 >Reporter: nobleyd >Priority: Major > Attachments: image-2020-04-30-18-44-21-630.png, > image-2020-04-30-18-55-53-779.png > > > I upgrade the standalone cluster(3 machines) from flink1.9 to flink1.10.0 > latest. My job running normally in flink1.9 for about half a year, while I > get some job failed due to null pointer exception when checkpoing in > flink1.10.0. > Below is the exception log: > !image-2020-04-30-18-55-53-779.png! > I have checked the StreamTask(882), and is shown below. I think the only case > is that checkpointMetaData is null that can lead to a null pointer exception. > !image-2020-04-30-18-44-21-630.png! > I do not know why, is there anyone can help me? The problem only occurs in > Flink1.10.0 for now, it works well in flink1.9. I give the some conf > info(some different to the default) also in below, guessing that maybe it is > an error for configuration mistake. > some conf of my flink1.10.0: > > {code:java} > taskmanager.memory.flink.size: 71680m > taskmanager.memory.framework.heap.size: 512m > taskmanager.memory.framework.off-heap.size: 512m > taskmanager.memory.task.off-heap.size: 17920m > taskmanager.memory.managed.size: 512m > taskmanager.memory.jvm-metaspace.size: 512m > taskmanager.memory.network.fraction: 0.1 > taskmanager.memory.network.min: 1024mb > taskmanager.memory.network.max: 1536mb > taskmanager.memory.segment-size: 128kb > rest.port: 8682 > historyserver.web.port: 8782high-availability.jobmanager.port: > 13141,13142,13143,13144 > blob.server.port: 13146,13147,13148,13149taskmanager.rpc.port: > 13151,13152,13153,13154 > taskmanager.data.port: 13156metrics.internal.query-service.port: > 13161,13162,13163,13164,13166,13167,13168,13169env.java.home: > /usr/java/jdk1.8.0_60/bin/java > env.pid.dir: /home/work/flink-1.10.0{code} > > Hope someone can help me solve it. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-17342) Schedule savepoint if max-inflight-checkpoints limit is reached instead of forcing (in UC mode)
[ https://issues.apache.org/jira/browse/FLINK-17342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Congxian Qiu(klion26) updated FLINK-17342: -- Summary: Schedule savepoint if max-inflight-checkpoints limit is reached instead of forcing (in UC mode) (was: Schedule savepoint if max-inflight-checkpoints limit is reached isntead of forcing (in UC mode)) > Schedule savepoint if max-inflight-checkpoints limit is reached instead of > forcing (in UC mode) > --- > > Key: FLINK-17342 > URL: https://issues.apache.org/jira/browse/FLINK-17342 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing >Affects Versions: 1.11.0 >Reporter: Roman Khachatryan >Assignee: Roman Khachatryan >Priority: Major > Labels: pull-request-available > Fix For: 1.11.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17468) Provide more detailed metrics why asynchronous part of checkpoint is taking long time
[ https://issues.apache.org/jira/browse/FLINK-17468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17096430#comment-17096430 ] Congxian Qiu(klion26) commented on FLINK-17468: --- Hi, [~qqibrow], there may so many situations that lead to snapshot timeouts. but in my experience, all of them will narrow down into two: disk performance and network performance(include network speed and threads used to upload state). the disk performance/network bottleneck information we can get from somewhere other than Apache Flink. I don't think we can add more metrics here. But for user experience(especially someone wants to find out the root cause of the timeout), I agree to add some more debug(or trace) log in the async part of a snapshot, so that the user can easily know what the step of current snapshot going now.(There are so many steps in the async part of RocksDBStateBackend...) I proposed to add more debug(or trace log) for such purpose in FLINK-13808, and created an issue which wants to limit to the max concurrent snapshot in TM side(FLINK-15236), if we reach agreement on this, I can help to contribute them. > Provide more detailed metrics why asynchronous part of checkpoint is taking > long time > - > > Key: FLINK-17468 > URL: https://issues.apache.org/jira/browse/FLINK-17468 > Project: Flink > Issue Type: New Feature > Components: Runtime / Checkpointing, Runtime / Metrics, Runtime / > State Backends >Affects Versions: 1.10.0 >Reporter: Piotr Nowojski >Priority: Major > > As [reported by > users|https://lists.apache.org/thread.html/r0833452796ca7d1c9d5e35c110089c95cfdadee9d81884a13613a4ce%40%3Cuser.flink.apache.org%3E] > it's not obvious why asynchronous part of checkpoint is taking so long time. > Maybe we could provide some more detailed metrics/UI/logs about uploading > files, materializing meta data, or other things that are happening during the > asynchronous checkpoint process? -- This message was sent by Atlassian Jira (v8.3.4#803005)