[jira] [Commented] (FLINK-19381) Fix docs about relocatable savepoints

2020-09-24 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17201481#comment-17201481
 ] 

Congxian Qiu(klion26) commented on FLINK-19381:
---

Ah, the doc wasn't updated when FLINK-5763 implemented, I'd like to fix the 
doc, could you please assign this ticket to me? [~NicoK]

> Fix docs about relocatable savepoints
> -
>
> Key: FLINK-19381
> URL: https://issues.apache.org/jira/browse/FLINK-19381
> Project: Flink
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.12.0, 1.11.2
>Reporter: Nico Kruber
>Priority: Major
>
> Although savepoints are relocatable since Flink 1.11, the docs still state 
> otherwise, for example in 
> [https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#triggering-savepoints]
> The warning there, as well as the other changes from FLINK-15863, should be 
> removed again and potentially replaces with new constraints.
> One known constraint is that if taskowned state is used 
> (\{{GenericWriteAhreadLog}} sink), savepoints are currently not relocatable 
> yet.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19359) Restore from Checkpoint fails if checkpoint folders is corrupt/partial

2020-09-22 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200040#comment-17200040
 ] 

Congxian Qiu(klion26) commented on FLINK-19359:
---

There is not {{_metadata}} means that the checkpoint is not completed. so you 
can't restore from it.

> Restore from Checkpoint fails if checkpoint folders is corrupt/partial
> --
>
> Key: FLINK-19359
> URL: https://issues.apache.org/jira/browse/FLINK-19359
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.8.0
>Reporter: Arpith Prakash
>Priority: Major
> Attachments: Checkpoints.png
>
>
> I'm using Flink 1.8.0 version and have enabled externalized checkpoint to 
> hdfs location, we have seen few scenarios where checkpoint folders will have 
> checkpoint files but only missing "*_metadata*" file. If we attempt to 
> restore application from this path, application fails with exception "Could 
> not find *_metadata* file. There is similar discussion in Flink user mailing 
> list with subject  "Zookeeper connection loss causing checkpoint corruption" 
> around it. I've attached sample snapshot on how folder structure looks as 
> well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19249) Job would wait sometime(~10 min) before failover if some connection broken

2020-09-22 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17200033#comment-17200033
 ] 

Congxian Qiu(klion26) commented on FLINK-19249:
---

For the client side, we can add a 
{{[ReadTimeoutHandle|https://netty.io/4.0/api/io/netty/handler/timeout/ReadTimeoutHandler.html]}}
 or 
{{[IdelStateHandle|https://netty.io/4.0/api/io/netty/handler/timeout/IdleStateHandler.html]}}
 for {{ChannlePipeLine}} to make the client wait a shorter time.

For the {{Connection timed out}} exception in server side, I don't think we can 
configure in Netty, it was throw in TCP stack.

Maybe we can add a {{ReadTimeOutHandle}}/{{IdelStateHandle with a configured 
timeout here? [~sewen] what do you think about this?}}

> Job would wait sometime(~10 min) before failover if some connection broken
> --
>
> Key: FLINK-19249
> URL: https://issues.apache.org/jira/browse/FLINK-19249
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Reporter: Congxian Qiu(klion26)
>Priority: Blocker
> Fix For: 1.12.0, 1.11.3
>
>
> {quote}encountered this error on 1.7, after going through the master code, I 
> think the problem is still there
> {quote}
> When the network environment is not so good, the connection between the 
> server and the client may be disconnected innocently. After the 
> disconnection, the server will receive the IOException such as below
> {code:java}
> java.io.IOException: Connection timed out
>  at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>  at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
>  at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
>  at sun.nio.ch.IOUtil.write(IOUtil.java:51)
>  at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:468)
>  at 
> org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:403)
>  at 
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:934)
>  at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.forceFlush(AbstractNioChannel.java:367)
>  at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:639)
>  at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
>  at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
>  at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
>  at 
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
>  at java.lang.Thread.run(Thread.java:748)
> {code}
> then release the view reader.
> But the job would not fail until the downstream detect the disconnection 
> because of {{channelInactive}} later(~10 min). between such time, the job can 
> still process data, but the broken channel can't transfer any data or event, 
> so snapshot would fail during this time. this will cause the job to replay 
> many data after failover.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19300) Timer loss after restoring from savepoint

2020-09-21 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17199331#comment-17199331
 ] 

Congxian Qiu(klion26) commented on FLINK-19300:
---

[~xianggao] thanks for reporting the issue, I think you analysis is right, 
{{InputStream#read}} can't guarantee to fully read the byte array.

As this may loss the timer I'm not sure this need to be a blocker or not?

> Timer loss after restoring from savepoint
> -
>
> Key: FLINK-19300
> URL: https://issues.apache.org/jira/browse/FLINK-19300
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / State Backends
>Reporter: Xiang Gao
>Priority: Critical
>
> While using heap-based timers, we are seeing occasional timer loss after 
> restoring program from savepoint, especially when using a remote savepoint 
> storage (s3). 
> After some investigation, the issue seems to be related to [this line in 
> deserialization|https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/core/io/PostVersionedIOReadableWritable.java#L65].
>  When trying to check the VERSIONED_IDENTIFIER, the input stream may not 
> guarantee filling the byte array, causing timers to be dropped for the 
> affected key group.
> Should keep reading until expected number of bytes are actually read or if 
> end of the stream has been reached. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-19325) Optimize the consumed time for checkpoint completion

2020-09-21 Thread Congxian Qiu(klion26) (Jira)
Congxian Qiu(klion26) created FLINK-19325:
-

 Summary: Optimize the consumed time for checkpoint completion
 Key: FLINK-19325
 URL: https://issues.apache.org/jira/browse/FLINK-19325
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Checkpointing
Reporter: Congxian Qiu(klion26)


Currently when completing a checkpoint, we'll write out the state handle out in 
{{MetadataV2V3SerializerBase.java#serializeStreamStateHandle}}
{code:java}
static void serializeStreamStateHandle(StreamStateHandle stateHandle, 
DataOutputStream dos) throws IOException {
   if (stateHandle == null) {
  dos.writeByte(NULL_HANDLE);

   } else if (stateHandle instanceof RelativeFileStateHandle) {
  dos.writeByte(RELATIVE_STREAM_STATE_HANDLE);
  RelativeFileStateHandle relativeFileStateHandle = 
(RelativeFileStateHandle) stateHandle;
  dos.writeUTF(relativeFileStateHandle.getRelativePath());
  dos.writeLong(relativeFileStateHandle.getStateSize());
   } else if (stateHandle instanceof FileStateHandle) {
  dos.writeByte(FILE_STREAM_STATE_HANDLE);
  FileStateHandle fileStateHandle = (FileStateHandle) stateHandle;
  dos.writeLong(stateHandle.getStateSize());
  dos.writeUTF(fileStateHandle.getFilePath().toString());

   } else if (stateHandle instanceof ByteStreamStateHandle) {
  dos.writeByte(BYTE_STREAM_STATE_HANDLE);
  ByteStreamStateHandle byteStreamStateHandle = (ByteStreamStateHandle) 
stateHandle;
  dos.writeUTF(byteStreamStateHandle.getHandleName());
  byte[] internalData = byteStreamStateHandle.getData();
  dos.writeInt(internalData.length);
  dos.write(byteStreamStateHandle.getData());
   } else {
  throw new IOException("Unknown implementation of StreamStateHandle: " + 
stateHandle.getClass());
   }

   dos.flush();
}

{code}

We'll call {{dos.flush()}} after every state handle written out. But this may 
consume too much time and is not needed, because we'll close the outputstream 
after all things have been written out.

I propose to remove the {{dos.flush()}} here to optimize the consumed time for 
checkpoint completion



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19249) Job would wait sometime(~10 min) before failover if some connection broken

2020-09-20 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17199167#comment-17199167
 ] 

Congxian Qiu(klion26) commented on FLINK-19249:
---

[~sewen] thanks for the reply.

IIUC, a TCP-keepalive and a shorter timeout can't solve this problem.  The 
server receives a {{Connection timed out}} because the buffered data cannot 
send out in time. after the timeout, the server may send an {{ErrorResponse}} 
to the client, but the client may or may not receive this error in time.

PS: we can't easily change the configuration of the Linux kernel, there are 
other services running on the same machine.

> Job would wait sometime(~10 min) before failover if some connection broken
> --
>
> Key: FLINK-19249
> URL: https://issues.apache.org/jira/browse/FLINK-19249
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Reporter: Congxian Qiu(klion26)
>Priority: Blocker
> Fix For: 1.12.0, 1.11.3
>
>
> {quote}encountered this error on 1.7, after going through the master code, I 
> think the problem is still there
> {quote}
> When the network environment is not so good, the connection between the 
> server and the client may be disconnected innocently. After the 
> disconnection, the server will receive the IOException such as below
> {code:java}
> java.io.IOException: Connection timed out
>  at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>  at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
>  at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
>  at sun.nio.ch.IOUtil.write(IOUtil.java:51)
>  at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:468)
>  at 
> org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:403)
>  at 
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:934)
>  at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.forceFlush(AbstractNioChannel.java:367)
>  at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:639)
>  at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
>  at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
>  at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
>  at 
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
>  at java.lang.Thread.run(Thread.java:748)
> {code}
> then release the view reader.
> But the job would not fail until the downstream detect the disconnection 
> because of {{channelInactive}} later(~10 min). between such time, the job can 
> still process data, but the broken channel can't transfer any data or event, 
> so snapshot would fail during this time. this will cause the job to replay 
> many data after failover.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-19249) Job would wait sometime(~10 min) before failover if some connection broken

2020-09-15 Thread Congxian Qiu(klion26) (Jira)
Congxian Qiu(klion26) created FLINK-19249:
-

 Summary: Job would wait sometime(~10 min) before failover if some 
connection broken
 Key: FLINK-19249
 URL: https://issues.apache.org/jira/browse/FLINK-19249
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Network
Reporter: Congxian Qiu(klion26)


{quote}encountered this error on 1.7, after going through the master code, I 
think the problem is still there
{quote}
When the network environment is not so good, the connection between the server 
and the client may be disconnected innocently. After the disconnection, the 
server will receive the IOException such as below
{code:java}
java.io.IOException: Connection timed out
 at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
 at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
 at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
 at sun.nio.ch.IOUtil.write(IOUtil.java:51)
 at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:468)
 at 
org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:403)
 at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:934)
 at 
org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.forceFlush(AbstractNioChannel.java:367)
 at 
org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:639)
 at 
org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
 at 
org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
 at 
org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
 at 
org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
 at java.lang.Thread.run(Thread.java:748)
{code}
then release the view reader.

But the job would not fail until the downstream detect the disconnection 
because of {{channelInactive}} later(~10 min). between such time, the job can 
still process data, but the broken channel can't transfer any data or event, so 
snapshot would fail during this time. this will cause the job to replay many 
data after failover.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-18744) resume from modified savepoint dirctionary: No such file or directory

2020-09-03 Thread Congxian Qiu(klion26) (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu(klion26) closed FLINK-18744.
-
Resolution: Duplicate

Close this one as duplicate, please ref to FLINK-14942 for more information, 
thanks.

> resume from modified savepoint dirctionary: No such file or directory
> -
>
> Key: FLINK-18744
> URL: https://issues.apache.org/jira/browse/FLINK-18744
> Project: Flink
>  Issue Type: Bug
>  Components: API / State Processor
>Affects Versions: 1.11.0
>Reporter: tao wang
>Priority: Major
>
> If I resume a job from a savepoint which is modified by state processor API, 
> such as loading from /savepoint-path-old and writing to /savepoint-path-new, 
> the job resumed with savepointpath = /savepoint-path-new  while throwing an 
> Exception : 
>  _*/savepoint-path-new/\{some-ui-id} (No such file or directory)*_.
>  I think it's an issue because of flink 1.11 use absolute path in savepoint 
> and checkpoint, but state processor API missed this.
> The job will work well with new savepoint(which path is /savepoint-path-new) 
> if I copy all dictionary except `_metadata from` /savepoint-path-old to 
> /savepoint-path-new.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-19126) Failed to run job in yarn-cluster mode due to No Executor found.

2020-09-02 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189788#comment-17189788
 ] 

Congxian Qiu(klion26) commented on FLINK-19126:
---

Hi [~Tang Yan] As the exception said, could you please have a try after export 
{{HADOOP_CLASSPATH}}?

> Failed to run job in yarn-cluster mode due to No Executor found.
> 
>
> Key: FLINK-19126
> URL: https://issues.apache.org/jira/browse/FLINK-19126
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN
>Affects Versions: 1.11.1
>Reporter: Tang Yan
>Priority: Major
>
> I've build the flink package successfully, but when I run the below command, 
> it failed to submit the jobs.
> [yanta@flink-1.11]$ bin/flink run -m yarn-cluster -p 2 -c 
> org.apache.flink.examples.java.wordcount.WordCount 
> examples/batch/WordCount.jar  --input hdfs:///user/yanta/aa.txt --output 
> hdfs:///user/yanta/result.txt
> Setting HADOOP_CONF_DIR=/etc/hadoop/conf because no HADOOP_CONF_DIR or 
> HADOOP_CLASSPATH was set.
>  The program 
> finished with the following exception:
> java.lang.IllegalStateException: No Executor found. Please make sure to 
> export the HADOOP_CLASSPATH environment variable or have hadoop in your 
> classpath. For more information refer to the "Deployment & Operations" 
> section of the official Apache Flink documentation. at 
> org.apache.flink.yarn.cli.FallbackYarnSessionCli.isActive(FallbackYarnSessionCli.java:59)
>  at 
> org.apache.flink.client.cli.CliFrontend.validateAndGetActiveCommandLine(CliFrontend.java:1090)
>  at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:218) at 
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:916) 
> at 
> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:992) 
> at 
> org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
>  at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:992)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-14942) State Processing API: add an option to make deep copy

2020-08-09 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17173767#comment-17173767
 ] 

Congxian Qiu(klion26) commented on FLINK-14942:
---

After savepoint relocatable, I think we need to make a deep copy default for 
StateProcessing API now. because we assume that all the data and metadata are 
in the same directory.

> State Processing API: add an option to make deep copy
> -
>
> Key: FLINK-14942
> URL: https://issues.apache.org/jira/browse/FLINK-14942
> Project: Flink
>  Issue Type: Improvement
>  Components: API / State Processor
>Reporter: Jun Qin
>Assignee: Jun Qin
>Priority: Major
>  Labels: usability
> Fix For: 1.12.0
>
>
> Current when a new savepoint is created based on a source savepoint, then 
> there are references in the new savepoint to the source savepoint. Here is 
> the [State Processing API 
> doc|https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/libs/state_processor_api.html]
>  says: 
> bq. Note: When basing a new savepoint on existing state, the state processor 
> api makes a shallow copy of the pointers to the existing operators. This 
> means that both savepoints share state and one cannot be deleted without 
> corrupting the other!
> This JIRA is to request an option to have a deep copy (instead of shallow 
> copy) such that the new savepoint is self-contained. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-18744) resume from modified savepoint dirctionary: No such file or directory

2020-08-09 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17173766#comment-17173766
 ] 

Congxian Qiu(klion26) commented on FLINK-18744:
---

The problem here exists, it was caused by FLINK-5763, after FLINK-5763 we 
assume that all the data and meta were in the same directory. But currently, 
state process API does not apply a deep copy of the previous data. The 
workaround, for now, needs to copy the data from the previous directory to the 
new directory.

I think the problem here can be fixed by FLINK-14942.

> resume from modified savepoint dirctionary: No such file or directory
> -
>
> Key: FLINK-18744
> URL: https://issues.apache.org/jira/browse/FLINK-18744
> Project: Flink
>  Issue Type: Bug
>  Components: API / State Processor
>Affects Versions: 1.11.0
>Reporter: tao wang
>Priority: Major
>
> If I resume a job from a savepoint which is modified by state processor API, 
> such as loading from /savepoint-path-old and writing to /savepoint-path-new, 
> the job resumed with savepointpath = /savepoint-path-new  while throwing an 
> Exception : 
>  _*/savepoint-path-new/\{some-ui-id} (No such file or directory)*_.
>  I think it's an issue because of flink 1.11 use absolute path in savepoint 
> and checkpoint, but state processor API missed this.
> The job will work well with new savepoint(which path is /savepoint-path-new) 
> if I copy all dictionary except `_metadata from` /savepoint-path-old to 
> /savepoint-path-new.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-18675) Checkpoint not maintaining minimum pause duration between checkpoints

2020-08-09 Thread Congxian Qiu(klion26) (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu(klion26) updated FLINK-18675:
--
Fix Version/s: 1.11.2
   1.12.0

> Checkpoint not maintaining minimum pause duration between checkpoints
> -
>
> Key: FLINK-18675
> URL: https://issues.apache.org/jira/browse/FLINK-18675
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.11.0
> Environment: !image.png!
>Reporter: Ravi Bhushan Ratnakar
>Priority: Critical
> Fix For: 1.12.0, 1.11.2
>
> Attachments: image.png
>
>
> I am running a streaming job with Flink 1.11.0 using kubernetes 
> infrastructure. I have configured checkpoint configuration like below
> Interval - 3 minutes
> Minimum pause between checkpoints - 3 minutes
> Checkpoint timeout - 10 minutes
> Checkpointing Mode - Exactly Once
> Number of Concurrent Checkpoint - 1
>  
> Other configs
> Time Characteristics - Processing Time
>  
> I am observing an usual behaviour. *When a checkpoint completes successfully* 
> *and if it's end to end duration is almost equal or greater than Minimum 
> pause duration then the next checkpoint gets triggered immediately without 
> maintaining the Minimum pause duration*. Kindly notice this behaviour from 
> checkpoint id 194 onward in the attached screenshot



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-18675) Checkpoint not maintaining minimum pause duration between checkpoints

2020-08-09 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17173755#comment-17173755
 ] 

Congxian Qiu(klion26) commented on FLINK-18675:
---

Seems there is a similar issue FLINK-18856

> Checkpoint not maintaining minimum pause duration between checkpoints
> -
>
> Key: FLINK-18675
> URL: https://issues.apache.org/jira/browse/FLINK-18675
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.11.0
> Environment: !image.png!
>Reporter: Ravi Bhushan Ratnakar
>Priority: Critical
> Attachments: image.png
>
>
> I am running a streaming job with Flink 1.11.0 using kubernetes 
> infrastructure. I have configured checkpoint configuration like below
> Interval - 3 minutes
> Minimum pause between checkpoints - 3 minutes
> Checkpoint timeout - 10 minutes
> Checkpointing Mode - Exactly Once
> Number of Concurrent Checkpoint - 1
>  
> Other configs
> Time Characteristics - Processing Time
>  
> I am observing an usual behaviour. *When a checkpoint completes successfully* 
> *and if it's end to end duration is almost equal or greater than Minimum 
> pause duration then the next checkpoint gets triggered immediately without 
> maintaining the Minimum pause duration*. Kindly notice this behaviour from 
> checkpoint id 194 onward in the attached screenshot



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-18856) CheckpointCoordinator ignores checkpointing.min-pause

2020-08-09 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17173754#comment-17173754
 ] 

Congxian Qiu(klion26) commented on FLINK-18856:
---

Seems this issue is similar with FLINK-18675

> CheckpointCoordinator ignores checkpointing.min-pause
> -
>
> Key: FLINK-18856
> URL: https://issues.apache.org/jira/browse/FLINK-18856
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.11.1
>Reporter: Roman Khachatryan
>Assignee: Roman Khachatryan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.12.0, 1.11.2
>
>
> See discussion: 
> http://mail-archives.apache.org/mod_mbox/flink-dev/202008.mbox/%3cca+5xao2enuzfyq+e-mmr2luuueu_zfjjoabjtqxow6tkgdr...@mail.gmail.com%3e



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-18675) Checkpoint not maintaining minimum pause duration between checkpoints

2020-08-03 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17170497#comment-17170497
 ] 

Congxian Qiu(klion26) commented on FLINK-18675:
---

[~raviratnakar] I think the problem here is that {{CheckpointRequestDecider}} 
has a wrong value of {{lastCheckpointCompletionRelativeTime}} when checking 
whether the checkpoint request is too early.

1. We retrieve the value of {{lastCheckpointCompletionRelativeTime}} when 
calling {{CheckpointRequestDecider#chooseRequestToExecute}} in 
{{CheckpointCoordinator#triggerCheckpoint}}
2. A pending checkpoint complete, and update the valuable 
{{pendingCheckpoints}} and {{lastCheckpointCompletionRelativeTime}}
3. In {{CheckpointRequestDecider#chooseRequestToExecute}} we use the previous 
{{lastCheckpointCompletionRelativeTime}} to check whether current checkpoint 
request is too early

I think we can get the value of {{lastCheckpointCompletionRelativeTime}} in 
{{CheckpointRequestDecider#chooseRequestToExecute}} here to solve the problem 
here.

> Checkpoint not maintaining minimum pause duration between checkpoints
> -
>
> Key: FLINK-18675
> URL: https://issues.apache.org/jira/browse/FLINK-18675
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.11.0
> Environment: !image.png!
>Reporter: Ravi Bhushan Ratnakar
>Priority: Critical
> Attachments: image.png
>
>
> I am running a streaming job with Flink 1.11.0 using kubernetes 
> infrastructure. I have configured checkpoint configuration like below
> Interval - 3 minutes
> Minimum pause between checkpoints - 3 minutes
> Checkpoint timeout - 10 minutes
> Checkpointing Mode - Exactly Once
> Number of Concurrent Checkpoint - 1
>  
> Other configs
> Time Characteristics - Processing Time
>  
> I am observing an usual behaviour. *When a checkpoint completes successfully* 
> *and if it's end to end duration is almost equal or greater than Minimum 
> pause duration then the next checkpoint gets triggered immediately without 
> maintaining the Minimum pause duration*. Kindly notice this behaviour from 
> checkpoint id 194 onward in the attached screenshot



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-18748) Savepoint would be queued unexpected if pendingCheckpoints less than maxConcurrentCheckpoints

2020-07-30 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17168357#comment-17168357
 ] 

Congxian Qiu(klion26) commented on FLINK-18748:
---

[~roman_khachatryan] could you please help to assign this ticket to [~wayland], 
thanks.
[~wayland] Please make sure you have read the [contribute 
documentation|https://flink.apache.org/contributing/how-to-contribute.html] 
before implementation. and you can ping us for review after filed a PR. thanks.

> Savepoint would be queued unexpected if pendingCheckpoints less than 
> maxConcurrentCheckpoints
> -
>
> Key: FLINK-18748
> URL: https://issues.apache.org/jira/browse/FLINK-18748
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.11.0, 1.11.1
>Reporter: Congxian Qiu(klion26)
>Priority: Major
>
> Inspired by a [user-zh 
> email|http://apache-flink.147419.n8.nabble.com/flink-1-11-rest-api-saveppoint-td5497.html]
> After FLINK-17342, when triggering a checkpoint/savepoint, we'll check 
> whether the request can be triggered in 
> {{CheckpointRequestDecider#chooseRequestToExecute}}, the logic is as follow:
> {code:java}
> Preconditions.checkState(Thread.holdsLock(lock));
> // 1. 
> if (isTriggering || queuedRequests.isEmpty()) {
>return Optional.empty();
> }
> // 2 too many ongoing checkpoitn/savepoint
> if (pendingCheckpointsSizeSupplier.get() >= maxConcurrentCheckpointAttempts) {
>return Optional.of(queuedRequests.first())
>   .filter(CheckpointTriggerRequest::isForce)
>   .map(unused -> queuedRequests.pollFirst());
> }
> // 3 check the timestamp of last complete checkpoint
> long nextTriggerDelayMillis = nextTriggerDelayMillis(lastCompletionMs);
> if (nextTriggerDelayMillis > 0) {
>return onTooEarly(nextTriggerDelayMillis);
> }
> return Optional.of(queuedRequests.pollFirst());
> {code}
> But if currently {{pendingCheckpointsSizeSupplier.get()}} < 
> {{maxConcurrentCheckpointAttempts}}, and the request is a savepoint, the 
> savepoint will still wait some time in step 3. 
> I think we should trigger the savepoint immediately if 
> {{pendingCheckpointSizeSupplier.get()}} < {{maxConcurrentCheckpointAttempts}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-18748) Savepoint would be queued unexpected if pendingCheckpoints less than maxConcurrentCheckpoints

2020-07-30 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17167912#comment-17167912
 ] 

Congxian Qiu(klion26) commented on FLINK-18748:
---

[~wayland] Do you want to fix this problem?

> Savepoint would be queued unexpected if pendingCheckpoints less than 
> maxConcurrentCheckpoints
> -
>
> Key: FLINK-18748
> URL: https://issues.apache.org/jira/browse/FLINK-18748
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.11.0, 1.11.1
>Reporter: Congxian Qiu(klion26)
>Priority: Major
>
> Inspired by a [user-zh 
> email|http://apache-flink.147419.n8.nabble.com/flink-1-11-rest-api-saveppoint-td5497.html]
> After FLINK-17342, when triggering a checkpoint/savepoint, we'll check 
> whether the request can be triggered in 
> {{CheckpointRequestDecider#chooseRequestToExecute}}, the logic is as follow:
> {code:java}
> Preconditions.checkState(Thread.holdsLock(lock));
> // 1. 
> if (isTriggering || queuedRequests.isEmpty()) {
>return Optional.empty();
> }
> // 2 too many ongoing checkpoitn/savepoint
> if (pendingCheckpointsSizeSupplier.get() >= maxConcurrentCheckpointAttempts) {
>return Optional.of(queuedRequests.first())
>   .filter(CheckpointTriggerRequest::isForce)
>   .map(unused -> queuedRequests.pollFirst());
> }
> // 3 check the timestamp of last complete checkpoint
> long nextTriggerDelayMillis = nextTriggerDelayMillis(lastCompletionMs);
> if (nextTriggerDelayMillis > 0) {
>return onTooEarly(nextTriggerDelayMillis);
> }
> return Optional.of(queuedRequests.pollFirst());
> {code}
> But if currently {{pendingCheckpointsSizeSupplier.get()}} < 
> {{maxConcurrentCheckpointAttempts}}, and the request is a savepoint, the 
> savepoint will still wait some time in step 3. 
> I think we should trigger the savepoint immediately if 
> {{pendingCheckpointSizeSupplier.get()}} < {{maxConcurrentCheckpointAttempts}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-18748) Savepoint would be queued unexpected if pendingCheckpoints less than maxConcurrentCheckpoints

2020-07-29 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17167598#comment-17167598
 ] 

Congxian Qiu(klion26) commented on FLINK-18748:
---

Hi [~roman_khachatryan] thanks for your reply. If 
{{pendingCheckpointsSizeSupplier.get() >= maxConcurrentCheckpointAttempts}} , 
then we would not do step 2, but if {{pendingCheckpointSizeSupplier.get() < 
maxConcurrentCheckpointAttempts}} we'll skip case 2, just check the min pause 
time in case 3. but in this case, I think the savepoint should be triggered.

If we have {{maxConcurrentCheckpointAttempts}} == 1, and now there is no 
ongoing checkpoint/savepoint. the user triggers a savepoint, then we will skip 
case 2 in the description. in case 3 we just check the min pause time, so we'll 
wait some time.

 

> Savepoint would be queued unexpected if pendingCheckpoints less than 
> maxConcurrentCheckpoints
> -
>
> Key: FLINK-18748
> URL: https://issues.apache.org/jira/browse/FLINK-18748
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.11.0, 1.11.1
>Reporter: Congxian Qiu(klion26)
>Priority: Major
>
> Inspired by a [user-zh 
> email|http://apache-flink.147419.n8.nabble.com/flink-1-11-rest-api-saveppoint-td5497.html]
> After FLINK-17342, when triggering a checkpoint/savepoint, we'll check 
> whether the request can be triggered in 
> {{CheckpointRequestDecider#chooseRequestToExecute}}, the logic is as follow:
> {code:java}
> Preconditions.checkState(Thread.holdsLock(lock));
> // 1. 
> if (isTriggering || queuedRequests.isEmpty()) {
>return Optional.empty();
> }
> // 2 too many ongoing checkpoitn/savepoint
> if (pendingCheckpointsSizeSupplier.get() >= maxConcurrentCheckpointAttempts) {
>return Optional.of(queuedRequests.first())
>   .filter(CheckpointTriggerRequest::isForce)
>   .map(unused -> queuedRequests.pollFirst());
> }
> // 3 check the timestamp of last complete checkpoint
> long nextTriggerDelayMillis = nextTriggerDelayMillis(lastCompletionMs);
> if (nextTriggerDelayMillis > 0) {
>return onTooEarly(nextTriggerDelayMillis);
> }
> return Optional.of(queuedRequests.pollFirst());
> {code}
> But if currently {{pendingCheckpointsSizeSupplier.get()}} < 
> {{maxConcurrentCheckpointAttempts}}, and the request is a savepoint, the 
> savepoint will still wait some time in step 3. 
> I think we should trigger the savepoint immediately if 
> {{pendingCheckpointSizeSupplier.get()}} < {{maxConcurrentCheckpointAttempts}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-18748) Savepoint would be queued unexpected if pendingCheckpoints less than maxConcurrentCheckpoints

2020-07-29 Thread Congxian Qiu(klion26) (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu(klion26) updated FLINK-18748:
--
Summary: Savepoint would be queued unexpected if pendingCheckpoints less 
than maxConcurrentCheckpoints  (was: Savepoint would be queued unexpected)

> Savepoint would be queued unexpected if pendingCheckpoints less than 
> maxConcurrentCheckpoints
> -
>
> Key: FLINK-18748
> URL: https://issues.apache.org/jira/browse/FLINK-18748
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.11.0, 1.11.1
>Reporter: Congxian Qiu(klion26)
>Priority: Major
>
> Inspired by a [user-zh 
> email|http://apache-flink.147419.n8.nabble.com/flink-1-11-rest-api-saveppoint-td5497.html]
> After FLINK-17342, when triggering a checkpoint/savepoint, we'll check 
> whether the request can be triggered in 
> {{CheckpointRequestDecider#chooseRequestToExecute}}, the logic is as follow:
> {code:java}
> Preconditions.checkState(Thread.holdsLock(lock));
> // 1. 
> if (isTriggering || queuedRequests.isEmpty()) {
>return Optional.empty();
> }
> // 2 too many ongoing checkpoitn/savepoint
> if (pendingCheckpointsSizeSupplier.get() >= maxConcurrentCheckpointAttempts) {
>return Optional.of(queuedRequests.first())
>   .filter(CheckpointTriggerRequest::isForce)
>   .map(unused -> queuedRequests.pollFirst());
> }
> // 3 check the timestamp of last complete checkpoint
> long nextTriggerDelayMillis = nextTriggerDelayMillis(lastCompletionMs);
> if (nextTriggerDelayMillis > 0) {
>return onTooEarly(nextTriggerDelayMillis);
> }
> return Optional.of(queuedRequests.pollFirst());
> {code}
> But if currently {{pendingCheckpointsSizeSupplier.get()}} < 
> {{maxConcurrentCheckpointAttempts}}, and the request is a savepoint, the 
> savepoint will still wait some time in step 3. 
> I think we should trigger the savepoint immediately if 
> {{pendingCheckpointSizeSupplier.get()}} < {{maxConcurrentCheckpointAttempts}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-18748) Savepoint would be queued unexpected

2020-07-28 Thread Congxian Qiu(klion26) (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu(klion26) updated FLINK-18748:
--
Description: 
Inspired by a [user-zh 
email|http://apache-flink.147419.n8.nabble.com/flink-1-11-rest-api-saveppoint-td5497.html]

After FLINK-17342, when triggering a checkpoint/savepoint, we'll check whether 
the request can be triggered in 
{{CheckpointRequestDecider#chooseRequestToExecute}}, the logic is as follow:
{code:java}
Preconditions.checkState(Thread.holdsLock(lock));
// 1. 
if (isTriggering || queuedRequests.isEmpty()) {
   return Optional.empty();
}

// 2 too many ongoing checkpoitn/savepoint
if (pendingCheckpointsSizeSupplier.get() >= maxConcurrentCheckpointAttempts) {
   return Optional.of(queuedRequests.first())
  .filter(CheckpointTriggerRequest::isForce)
  .map(unused -> queuedRequests.pollFirst());
}

// 3 check the timestamp of last complete checkpoint
long nextTriggerDelayMillis = nextTriggerDelayMillis(lastCompletionMs);
if (nextTriggerDelayMillis > 0) {
   return onTooEarly(nextTriggerDelayMillis);
}

return Optional.of(queuedRequests.pollFirst());
{code}
But if currently {{pendingCheckpointsSizeSupplier.get()}} < 
{{maxConcurrentCheckpointAttempts}}, and the request is a savepoint, the 
savepoint will still wait some time in step 3. 

I think we should trigger the savepoint immediately if 
{{pendingCheckpointSizeSupplier.get()}} < {{maxConcurrentCheckpointAttempts}}.

  was:
Inspired by an [user-zh 
email|[http://apache-flink.147419.n8.nabble.com/flink-1-11-rest-api-saveppoint-td5497.html]]

After FLINK-17342, when triggering a checkpoint/savepoint, we'll check whether 
the request can be triggered in 
{{CheckpointRequestDecider#chooseRequestToExecute}}, the logic is as follow:
{code:java}
Preconditions.checkState(Thread.holdsLock(lock));
// 1. 
if (isTriggering || queuedRequests.isEmpty()) {
   return Optional.empty();
}

// 2 too many ongoing checkpoitn/savepoint
if (pendingCheckpointsSizeSupplier.get() >= maxConcurrentCheckpointAttempts) {
   return Optional.of(queuedRequests.first())
  .filter(CheckpointTriggerRequest::isForce)
  .map(unused -> queuedRequests.pollFirst());
}

// 3 check the timestamp of last complete checkpoint
long nextTriggerDelayMillis = nextTriggerDelayMillis(lastCompletionMs);
if (nextTriggerDelayMillis > 0) {
   return onTooEarly(nextTriggerDelayMillis);
}

return Optional.of(queuedRequests.pollFirst());
{code}
But if currently {{pendingCheckpointsSizeSupplier.get()}} < 
{{maxConcurrentCheckpointAttempts}}, and the request is a savepoint, the 
savepoint will still wait some time in step 3. 

I think we should trigger the savepoint immediately if 
{{pendingCheckpointSizeSupplier.get()}} < {{maxConcurrentCheckpointAttempts}}.


> Savepoint would be queued unexpected
> 
>
> Key: FLINK-18748
> URL: https://issues.apache.org/jira/browse/FLINK-18748
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.11.0, 1.11.1
>Reporter: Congxian Qiu(klion26)
>Priority: Major
>
> Inspired by a [user-zh 
> email|http://apache-flink.147419.n8.nabble.com/flink-1-11-rest-api-saveppoint-td5497.html]
> After FLINK-17342, when triggering a checkpoint/savepoint, we'll check 
> whether the request can be triggered in 
> {{CheckpointRequestDecider#chooseRequestToExecute}}, the logic is as follow:
> {code:java}
> Preconditions.checkState(Thread.holdsLock(lock));
> // 1. 
> if (isTriggering || queuedRequests.isEmpty()) {
>return Optional.empty();
> }
> // 2 too many ongoing checkpoitn/savepoint
> if (pendingCheckpointsSizeSupplier.get() >= maxConcurrentCheckpointAttempts) {
>return Optional.of(queuedRequests.first())
>   .filter(CheckpointTriggerRequest::isForce)
>   .map(unused -> queuedRequests.pollFirst());
> }
> // 3 check the timestamp of last complete checkpoint
> long nextTriggerDelayMillis = nextTriggerDelayMillis(lastCompletionMs);
> if (nextTriggerDelayMillis > 0) {
>return onTooEarly(nextTriggerDelayMillis);
> }
> return Optional.of(queuedRequests.pollFirst());
> {code}
> But if currently {{pendingCheckpointsSizeSupplier.get()}} < 
> {{maxConcurrentCheckpointAttempts}}, and the request is a savepoint, the 
> savepoint will still wait some time in step 3. 
> I think we should trigger the savepoint immediately if 
> {{pendingCheckpointSizeSupplier.get()}} < {{maxConcurrentCheckpointAttempts}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-18748) Savepoint would be queued unexpected

2020-07-28 Thread Congxian Qiu(klion26) (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu(klion26) updated FLINK-18748:
--
Description: 
Inspired by an [user-zh 
email|[http://apache-flink.147419.n8.nabble.com/flink-1-11-rest-api-saveppoint-td5497.html]]

After FLINK-17342, when triggering a checkpoint/savepoint, we'll check whether 
the request can be triggered in 
{{CheckpointRequestDecider#chooseRequestToExecute}}, the logic is as follow:
{code:java}
Preconditions.checkState(Thread.holdsLock(lock));
// 1. 
if (isTriggering || queuedRequests.isEmpty()) {
   return Optional.empty();
}

// 2 too many ongoing checkpoitn/savepoint
if (pendingCheckpointsSizeSupplier.get() >= maxConcurrentCheckpointAttempts) {
   return Optional.of(queuedRequests.first())
  .filter(CheckpointTriggerRequest::isForce)
  .map(unused -> queuedRequests.pollFirst());
}

// 3 check the timestamp of last complete checkpoint
long nextTriggerDelayMillis = nextTriggerDelayMillis(lastCompletionMs);
if (nextTriggerDelayMillis > 0) {
   return onTooEarly(nextTriggerDelayMillis);
}

return Optional.of(queuedRequests.pollFirst());
{code}
But if currently {{pendingCheckpointsSizeSupplier.get()}} < 
{{maxConcurrentCheckpointAttempts}}, and the request is a savepoint, the 
savepoint will still wait some time in step 3. 

I think we should trigger the savepoint immediately if 
{{pendingCheckpointSizeSupplier.get()}} < {{maxConcurrentCheckpointAttempts}}.

  was:
After FLINK-17342, when triggering a checkpoint/savepoint, we'll check whether 
the request can be triggered in 
{{CheckpointRequestDecider#chooseRequestToExecute}}, the logic is as follow:
{code:java}
Preconditions.checkState(Thread.holdsLock(lock));
// 1. 
if (isTriggering || queuedRequests.isEmpty()) {
   return Optional.empty();
}

// 2 too many ongoing checkpoitn/savepoint
if (pendingCheckpointsSizeSupplier.get() >= maxConcurrentCheckpointAttempts) {
   return Optional.of(queuedRequests.first())
  .filter(CheckpointTriggerRequest::isForce)
  .map(unused -> queuedRequests.pollFirst());
}

// 3 check the timestamp of last complete checkpoint
long nextTriggerDelayMillis = nextTriggerDelayMillis(lastCompletionMs);
if (nextTriggerDelayMillis > 0) {
   return onTooEarly(nextTriggerDelayMillis);
}

return Optional.of(queuedRequests.pollFirst());
{code}
But if currently {{pendingCheckpointsSizeSupplier.get()}} < 
{{maxConcurrentCheckpointAttempts}}, and the request is a savepoint, the 
savepoint will still wait some time in step 3. 

I think we should trigger the savepoint immediately if 
{{pendingCheckpointSizeSupplier.get()}} < {{maxConcurrentCheckpointAttempts}}.


> Savepoint would be queued unexpected
> 
>
> Key: FLINK-18748
> URL: https://issues.apache.org/jira/browse/FLINK-18748
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.11.0, 1.11.1
>Reporter: Congxian Qiu(klion26)
>Priority: Major
>
> Inspired by an [user-zh 
> email|[http://apache-flink.147419.n8.nabble.com/flink-1-11-rest-api-saveppoint-td5497.html]]
> After FLINK-17342, when triggering a checkpoint/savepoint, we'll check 
> whether the request can be triggered in 
> {{CheckpointRequestDecider#chooseRequestToExecute}}, the logic is as follow:
> {code:java}
> Preconditions.checkState(Thread.holdsLock(lock));
> // 1. 
> if (isTriggering || queuedRequests.isEmpty()) {
>return Optional.empty();
> }
> // 2 too many ongoing checkpoitn/savepoint
> if (pendingCheckpointsSizeSupplier.get() >= maxConcurrentCheckpointAttempts) {
>return Optional.of(queuedRequests.first())
>   .filter(CheckpointTriggerRequest::isForce)
>   .map(unused -> queuedRequests.pollFirst());
> }
> // 3 check the timestamp of last complete checkpoint
> long nextTriggerDelayMillis = nextTriggerDelayMillis(lastCompletionMs);
> if (nextTriggerDelayMillis > 0) {
>return onTooEarly(nextTriggerDelayMillis);
> }
> return Optional.of(queuedRequests.pollFirst());
> {code}
> But if currently {{pendingCheckpointsSizeSupplier.get()}} < 
> {{maxConcurrentCheckpointAttempts}}, and the request is a savepoint, the 
> savepoint will still wait some time in step 3. 
> I think we should trigger the savepoint immediately if 
> {{pendingCheckpointSizeSupplier.get()}} < {{maxConcurrentCheckpointAttempts}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-18748) Savepoint would be queued unexpected

2020-07-28 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166877#comment-17166877
 ] 

Congxian Qiu(klion26) commented on FLINK-18748:
---

[~pnowojski] [~roman_khachatryan] What do you think about this problem, If this 
is valid, I can help to fix it.

> Savepoint would be queued unexpected
> 
>
> Key: FLINK-18748
> URL: https://issues.apache.org/jira/browse/FLINK-18748
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.11.0, 1.11.1
>Reporter: Congxian Qiu(klion26)
>Priority: Major
>
> After FLINK-17342, when triggering a checkpoint/savepoint, we'll check 
> whether the request can be triggered in 
> {{CheckpointRequestDecider#chooseRequestToExecute}}, the logic is as follow:
> {code:java}
> Preconditions.checkState(Thread.holdsLock(lock));
> // 1. 
> if (isTriggering || queuedRequests.isEmpty()) {
>return Optional.empty();
> }
> // 2 too many ongoing checkpoitn/savepoint
> if (pendingCheckpointsSizeSupplier.get() >= maxConcurrentCheckpointAttempts) {
>return Optional.of(queuedRequests.first())
>   .filter(CheckpointTriggerRequest::isForce)
>   .map(unused -> queuedRequests.pollFirst());
> }
> // 3 check the timestamp of last complete checkpoint
> long nextTriggerDelayMillis = nextTriggerDelayMillis(lastCompletionMs);
> if (nextTriggerDelayMillis > 0) {
>return onTooEarly(nextTriggerDelayMillis);
> }
> return Optional.of(queuedRequests.pollFirst());
> {code}
> But if currently {{pendingCheckpointsSizeSupplier.get()}} < 
> {{maxConcurrentCheckpointAttempts}}, and the request is a savepoint, the 
> savepoint will still wait some time in step 3. 
> I think we should trigger the savepoint immediately if 
> {{pendingCheckpointSizeSupplier.get()}} < {{maxConcurrentCheckpointAttempts}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-18748) Savepoint would be queued unexpected

2020-07-28 Thread Congxian Qiu(klion26) (Jira)
Congxian Qiu(klion26) created FLINK-18748:
-

 Summary: Savepoint would be queued unexpected
 Key: FLINK-18748
 URL: https://issues.apache.org/jira/browse/FLINK-18748
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Checkpointing
Affects Versions: 1.11.1, 1.11.0
Reporter: Congxian Qiu(klion26)


After FLINK-17342, when triggering a checkpoint/savepoint, we'll check whether 
the request can be triggered in 
{{CheckpointRequestDecider#chooseRequestToExecute}}, the logic is as follow:
{code:java}
Preconditions.checkState(Thread.holdsLock(lock));
// 1. 
if (isTriggering || queuedRequests.isEmpty()) {
   return Optional.empty();
}

// 2 too many ongoing checkpoitn/savepoint
if (pendingCheckpointsSizeSupplier.get() >= maxConcurrentCheckpointAttempts) {
   return Optional.of(queuedRequests.first())
  .filter(CheckpointTriggerRequest::isForce)
  .map(unused -> queuedRequests.pollFirst());
}

// 3 check the timestamp of last complete checkpoint
long nextTriggerDelayMillis = nextTriggerDelayMillis(lastCompletionMs);
if (nextTriggerDelayMillis > 0) {
   return onTooEarly(nextTriggerDelayMillis);
}

return Optional.of(queuedRequests.pollFirst());
{code}
But if currently {{pendingCheckpointsSizeSupplier.get()}} < 
{{maxConcurrentCheckpointAttempts}}, and the request is a savepoint, the 
savepoint will still wait some time in step 3. 

I think we should trigger the savepoint immediately if 
{{pendingCheckpointSizeSupplier.get()}} < {{maxConcurrentCheckpointAttempts}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-18744) resume from modified savepoint dirctionary: No such file or directory

2020-07-28 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17166808#comment-17166808
 ] 

Congxian Qiu(klion26) commented on FLINK-18744:
---

[~wayland]  Thanks for reporting this issue, I'll take a look at it.

> resume from modified savepoint dirctionary: No such file or directory
> -
>
> Key: FLINK-18744
> URL: https://issues.apache.org/jira/browse/FLINK-18744
> Project: Flink
>  Issue Type: Bug
>  Components: API / State Processor
>Affects Versions: 1.11.0
>Reporter: tao wang
>Priority: Major
>
> If I resume a job from a savepoint which is modified by state processor API, 
> such as loading from /savepoint-path-old and writing to /savepoint-path-new, 
> the job resumed with savepointpath = /savepoint-path-new  while throwing an 
> Exception : 
>  _*/savepoint-path-new/\{some-ui-id} (No such file or directory)*_.
>  I think it's an issue because of flink 1.11 use absolute path in savepoint 
> and checkpoint, but state processor API missed this.
> The job will work well with new savepoint(which path is /savepoint-path-new) 
> if I copy all dictionary except `_metadata from` /savepoint-path-old to 
> /savepoint-path-new.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-18675) Checkpoint not maintaining minimum pause duration between checkpoints

2020-07-26 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165387#comment-17165387
 ] 

Congxian Qiu(klion26) commented on FLINK-18675:
---

Hi [~raviratnakar] from the git history, the code at line number 1512 was added 
in 1.9.0(and seems there change would not affect this problem), IIUC, we would 
check whether the checkpoint can be triggered somewhere, I need to check it 
carefully as the code changed a lot. will reply here If found anything.

> Checkpoint not maintaining minimum pause duration between checkpoints
> -
>
> Key: FLINK-18675
> URL: https://issues.apache.org/jira/browse/FLINK-18675
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.11.0
> Environment: !image.png!
>Reporter: Ravi Bhushan Ratnakar
>Priority: Critical
> Attachments: image.png
>
>
> I am running a streaming job with Flink 1.11.0 using kubernetes 
> infrastructure. I have configured checkpoint configuration like below
> Interval - 3 minutes
> Minimum pause between checkpoints - 3 minutes
> Checkpoint timeout - 10 minutes
> Checkpointing Mode - Exactly Once
> Number of Concurrent Checkpoint - 1
>  
> Other configs
> Time Characteristics - Processing Time
>  
> I am observing an usual behaviour. *When a checkpoint completes successfully* 
> *and if it's end to end duration is almost equal or greater than Minimum 
> pause duration then the next checkpoint gets triggered immediately without 
> maintaining the Minimum pause duration*. Kindly notice this behaviour from 
> checkpoint id 194 onward in the attached screenshot



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-18665) Filesystem connector should use TableSchema exclude computed columns

2020-07-22 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163229#comment-17163229
 ] 

Congxian Qiu(klion26) commented on FLINK-18665:
---

[~Leonard Xu] thanks for the update :)

> Filesystem connector should use TableSchema exclude computed columns
> 
>
> Key: FLINK-18665
> URL: https://issues.apache.org/jira/browse/FLINK-18665
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / FileSystem, Table SQL / Ecosystem
>Affects Versions: 1.11.1
>Reporter: Jark Wu
>Assignee: Leonard Xu
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.11.2
>
>
> This is reported in 
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/How-to-use-a-nested-column-for-CREATE-TABLE-PARTITIONED-BY-td36796.html
> {code}
> create table navi (
>   a STRING,
>   location ROW
> ) with (
>   'connector' = 'filesystem',
>   'path' = 'east-out',
>   'format' = 'json'
> )
> CREATE TABLE output (
>   `partition` AS location.transId
> ) PARTITIONED BY (`partition`)
> WITH (
>   'connector' = 'filesystem',
>   'path' = 'east-out',
>   'format' = 'json'
> ) LIKE navi (EXCLUDING ALL)
> tEnv.sqlQuery("SELECT type, location FROM navi").executeInsert("output")
> {code}
> It throws the following exception 
> {code}
> Exception in thread "main" org.apache.flink.table.api.ValidationException: 
> The field count of logical schema of the table does not match with the field 
> count of physical schema
> . The logical schema: [STRING,ROW<`lastUpdateTime` BIGINT, `transId` STRING>]
> The physical schema: [STRING,ROW<`lastUpdateTime` BIGINT, `transId` 
> STRING>,STRING].
> {code}
> The reason is that {{FileSystemTableFactory#createTableSource}} should use 
> schema excluded computed column, not the original catalog table schema.
> [1]: 
> https://github.com/apache/flink/blob/master/flink-table/flink-table-runtime-blink/src/main/java/org/apache/flink/table/filesystem/FileSystemTableFactory.java#L78



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-18665) Filesystem connector should use TableSchema exclude computed columns

2020-07-21 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17162493#comment-17162493
 ] 

Congxian Qiu(klion26) commented on FLINK-18665:
---

Which versions will this issue affect?

> Filesystem connector should use TableSchema exclude computed columns
> 
>
> Key: FLINK-18665
> URL: https://issues.apache.org/jira/browse/FLINK-18665
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / FileSystem, Table SQL / Ecosystem
>Reporter: Jark Wu
>Assignee: Leonard Xu
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.11.2
>
>
> This is reported in 
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/How-to-use-a-nested-column-for-CREATE-TABLE-PARTITIONED-BY-td36796.html
> {code}
> create table navi (
>   a STRING,
>   location ROW
> ) with (
>   'connector' = 'filesystem',
>   'path' = 'east-out',
>   'format' = 'json'
> )
> CREATE TABLE output (
>   `partition` AS location.transId
> ) PARTITIONED BY (`partition`)
> WITH (
>   'connector' = 'filesystem',
>   'path' = 'east-out',
>   'format' = 'json'
> ) LIKE navi (EXCLUDING ALL)
> tEnv.sqlQuery("SELECT type, location FROM navi").executeInsert("output")
> {code}
> It throws the following exception 
> {code}
> Exception in thread "main" org.apache.flink.table.api.ValidationException: 
> The field count of logical schema of the table does not match with the field 
> count of physical schema
> . The logical schema: [STRING,ROW<`lastUpdateTime` BIGINT, `transId` STRING>]
> The physical schema: [STRING,ROW<`lastUpdateTime` BIGINT, `transId` 
> STRING>,STRING].
> {code}
> The reason is that {{FileSystemTableFactory#createTableSource}} should use 
> schema excluded computed column, not the original catalog table schema.
> [1]: 
> https://github.com/apache/flink/blob/master/flink-table/flink-table-runtime-blink/src/main/java/org/apache/flink/table/filesystem/FileSystemTableFactory.java#L78



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-17571) A better way to show the files used in currently checkpoints

2020-07-20 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17161719#comment-17161719
 ] 

Congxian Qiu(klion26) edited comment on FLINK-17571 at 7/21/20, 5:33 AM:
-

[~pnowojski] could you please assign this to me if it's ok.

I want to add a command-line command {{./bin/flink checkpoint list $path}}.   
The $path should be the parent directory of {{_metadata}} or the path of 
{{_metadata}}

{{And the result will be all the files used by the specific 
checkpoint/savepoint.}}


was (Author: klion26):
[~pnowojski] could you please assign this to me if it's ok.

I want to add a command-line command {{./bin/flink savepoint list $path}}.   
The $path should be the parent directory of {{_metadata}} or the path of 
{{_metadata}}

{{And the result will be all the files used by the specific 
checkpoint/savepoint.}}

> A better way to show the files used in currently checkpoints
> 
>
> Key: FLINK-17571
> URL: https://issues.apache.org/jira/browse/FLINK-17571
> Project: Flink
>  Issue Type: New Feature
>  Components: Command Line Client, Runtime / Checkpointing
>Reporter: Congxian Qiu(klion26)
>Priority: Major
>
> Inspired by the 
> [userMail|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]
> Currently, there are [three types of 
> directory|https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/state/checkpoints.html#directory-structure]
>  for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be 
> deleted safely, but users can't delete the files in the SHARED directory 
> safely(the files may be created a long time ago).
> I think it's better to give users a better way to know which files are 
> currently used(so the others are not used)
> maybe a command-line command such as below is ok enough to support such a 
> feature.
> {{./bin/flink checkpoint list $checkpointDir  # list all the files used in 
> checkpoint}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-17571) A better way to show the files used in currently checkpoints

2020-07-20 Thread Congxian Qiu(klion26) (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu(klion26) updated FLINK-17571:
--
Description: 
Inspired by the 
[userMail|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]

Currently, there are [three types of 
directory|https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/state/checkpoints.html#directory-structure]
 for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be deleted 
safely, but users can't delete the files in the SHARED directory safely(the 
files may be created a long time ago).

I think it's better to give users a better way to know which files are 
currently used(so the others are not used)

maybe a command-line command such as below is ok enough to support such a 
feature.

{{./bin/flink checkpoint list $checkpointDir  # list all the files used in 
checkpoint}}

  was:
Inspired by the 
[userMail|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]

Currently, there are [three types of 
directory|https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/state/checkpoints.html#directory-structure]
 for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be deleted 
safely, but users can't delete the files in the SHARED directory safely(the 
files may be created a long time ago).

I think it's better to give users a better way to know which files are 
currently used(so the others are not used)

maybe a command-line command such as below is ok enough to support such a 
feature.

{{./bin/flink savepoint list $checkpointDir  # list all the files used in 
checkpoint}}


> A better way to show the files used in currently checkpoints
> 
>
> Key: FLINK-17571
> URL: https://issues.apache.org/jira/browse/FLINK-17571
> Project: Flink
>  Issue Type: New Feature
>  Components: Command Line Client, Runtime / Checkpointing
>Reporter: Congxian Qiu(klion26)
>Priority: Major
>
> Inspired by the 
> [userMail|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]
> Currently, there are [three types of 
> directory|https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/state/checkpoints.html#directory-structure]
>  for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be 
> deleted safely, but users can't delete the files in the SHARED directory 
> safely(the files may be created a long time ago).
> I think it's better to give users a better way to know which files are 
> currently used(so the others are not used)
> maybe a command-line command such as below is ok enough to support such a 
> feature.
> {{./bin/flink checkpoint list $checkpointDir  # list all the files used in 
> checkpoint}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-17571) A better way to show the files used in currently checkpoints

2020-07-20 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17161719#comment-17161719
 ] 

Congxian Qiu(klion26) edited comment on FLINK-17571 at 7/21/20, 5:33 AM:
-

[~pnowojski] could you please assign this to me if it's ok.

I want to add a command-line command {{./bin/flink checkpoint list $path}}.   
The $path should be the parent directory of {{_metadata}} or the path of 
{{_metadata}}

{{And the result will be all the files used by the specific checkpoint.}}


was (Author: klion26):
[~pnowojski] could you please assign this to me if it's ok.

I want to add a command-line command {{./bin/flink checkpoint list $path}}.   
The $path should be the parent directory of {{_metadata}} or the path of 
{{_metadata}}

{{And the result will be all the files used by the specific 
checkpoint/savepoint.}}

> A better way to show the files used in currently checkpoints
> 
>
> Key: FLINK-17571
> URL: https://issues.apache.org/jira/browse/FLINK-17571
> Project: Flink
>  Issue Type: New Feature
>  Components: Command Line Client, Runtime / Checkpointing
>Reporter: Congxian Qiu(klion26)
>Priority: Major
>
> Inspired by the 
> [userMail|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]
> Currently, there are [three types of 
> directory|https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/state/checkpoints.html#directory-structure]
>  for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be 
> deleted safely, but users can't delete the files in the SHARED directory 
> safely(the files may be created a long time ago).
> I think it's better to give users a better way to know which files are 
> currently used(so the others are not used)
> maybe a command-line command such as below is ok enough to support such a 
> feature.
> {{./bin/flink checkpoint list $checkpointDir  # list all the files used in 
> checkpoint}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-17571) A better way to show the files used in currently checkpoints

2020-07-20 Thread Congxian Qiu(klion26) (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu(klion26) updated FLINK-17571:
--
Description: 
Inspired by the 
[userMail|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]

Currently, there are [three types of 
directory|https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/state/checkpoints.html#directory-structure]
 for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be deleted 
safely, but users can't delete the files in the SHARED directory safely(the 
files may be created a long time ago).

I think it's better to give users a better way to know which files are 
currently used(so the others are not used)

maybe a command-line command such as below is ok enough to support such a 
feature.

{{./bin/flink savepoint list $checkpointDir  # list all the files used in 
checkpoint}}

  was:
Inspired by the 
[userMail|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]

Currently, there are [three types of 
directory|https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/state/checkpoints.html#directory-structure]
 for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be deleted 
safely, but users can't delete the files in the SHARED directory safely(the 
files may be created a long time ago).

I think it's better to give users a better way to know which files are 
currently used(so the others are not used)

maybe a command-line command such as below is ok enough to support such a 
feature.

{{./bin/flink checkpoint list $checkpointDir  # list all the files used in 
checkpoint}}


> A better way to show the files used in currently checkpoints
> 
>
> Key: FLINK-17571
> URL: https://issues.apache.org/jira/browse/FLINK-17571
> Project: Flink
>  Issue Type: New Feature
>  Components: Command Line Client, Runtime / Checkpointing
>Reporter: Congxian Qiu(klion26)
>Priority: Major
>
> Inspired by the 
> [userMail|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]
> Currently, there are [three types of 
> directory|https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/state/checkpoints.html#directory-structure]
>  for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be 
> deleted safely, but users can't delete the files in the SHARED directory 
> safely(the files may be created a long time ago).
> I think it's better to give users a better way to know which files are 
> currently used(so the others are not used)
> maybe a command-line command such as below is ok enough to support such a 
> feature.
> {{./bin/flink savepoint list $checkpointDir  # list all the files used in 
> checkpoint}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17571) A better way to show the files used in currently checkpoints

2020-07-20 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17161719#comment-17161719
 ] 

Congxian Qiu(klion26) commented on FLINK-17571:
---

[~pnowojski] could you please assign this to me if it's ok.

I want to add a command-line command {{./bin/flink savepoint list $path}}.   
The $path should be the parent directory of {{_metadata}} or the path of 
{{_metadata}}

{{And the result will be all the files used by the specific 
checkpoint/savepoint.}}

> A better way to show the files used in currently checkpoints
> 
>
> Key: FLINK-17571
> URL: https://issues.apache.org/jira/browse/FLINK-17571
> Project: Flink
>  Issue Type: New Feature
>  Components: Command Line Client, Runtime / Checkpointing
>Reporter: Congxian Qiu(klion26)
>Priority: Major
>
> Inspired by the 
> [userMail|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]
> Currently, there are [three types of 
> directory|https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/state/checkpoints.html#directory-structure]
>  for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be 
> deleted safely, but users can't delete the files in the SHARED directory 
> safely(the files may be created a long time ago).
> I think it's better to give users a better way to know which files are 
> currently used(so the others are not used)
> maybe a command-line command such as below is ok enough to support such a 
> feature.
> {{./bin/flink checkpoint list $checkpointDir  # list all the files used in 
> checkpoint}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-18637) Key group is not in KeyGroupRange

2020-07-20 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17160919#comment-17160919
 ] 

Congxian Qiu(klion26) commented on FLINK-18637:
---

[~oripwk] thanks for report this problem, could you please update the flink 
version you're using?

> Key group is not in KeyGroupRange
> -
>
> Key: FLINK-18637
> URL: https://issues.apache.org/jira/browse/FLINK-18637
> Project: Flink
>  Issue Type: Bug
> Environment: OS current user: yarn
> Current Hadoop/Kerberos user: hadoop
> JVM: Java HotSpot(TM) 64-Bit Server VM - Oracle Corporation - 1.8/25.141-b15
> Maximum heap size: 28960 MiBytes
> JAVA_HOME: /usr/java/jdk1.8.0_141/jre
> Hadoop version: 2.8.5-amzn-6
> JVM Options:
>-Xmx30360049728
>-Xms30360049728
>-XX:MaxDirectMemorySize=4429185024
>-XX:MaxMetaspaceSize=1073741824
>-XX:+UseG1GC
>-XX:+UnlockDiagnosticVMOptions
>-XX:+G1SummarizeConcMark
>-verbose:gc
>-XX:+PrintGCDetails
>-XX:+PrintGCDateStamps
>-XX:+UnlockCommercialFeatures
>-XX:+FlightRecorder
>-XX:+DebugNonSafepoints
>
> -XX:FlightRecorderOptions=defaultrecording=true,settings=/home/hadoop/heap.jfc,dumponexit=true,dumponexitpath=/var/lib/hadoop-yarn/recording.jfr,loglevel=info
>
> -Dlog.file=/var/log/hadoop-yarn/containers/application_1593935560662_0002/container_1593935560662_0002_01_02/taskmanager.log
>-Dlog4j.configuration=file:./log4j.properties
> Program Arguments:
>-Dtaskmanager.memory.framework.off-heap.size=134217728b
>-Dtaskmanager.memory.network.max=1073741824b
>-Dtaskmanager.memory.network.min=1073741824b
>-Dtaskmanager.memory.framework.heap.size=134217728b
>-Dtaskmanager.memory.managed.size=23192823744b
>-Dtaskmanager.cpu.cores=7.0
>-Dtaskmanager.memory.task.heap.size=30225832000b
>-Dtaskmanager.memory.task.off-heap.size=3221225472b
>--configDir.
>
> -Djobmanager.rpc.address=ip-10-180-30-250.us-west-2.compute.internal-Dweb.port=0
>-Dweb.tmpdir=/tmp/flink-web-64f613cf-bf04-4a09-8c14-75c31b619574
>-Djobmanager.rpc.port=33739
>-Drest.address=ip-10-180-30-250.us-west-2.compute.internal
>Reporter: Ori Popowski
>Priority: Major
>
> I'm getting this error when creating a savepoint. I've read in 
> https://issues.apache.org/jira/browse/FLINK-16193 that it's caused by 
> unstable hashcode or equals on the key, or improper use of 
> {{reinterpretAsKeyedStream}}.
>   
>  My key is a string and I don't use {{reinterpretAsKeyedStream}}.
>  
> {code:java}
> senv
>   .addSource(source)
>   .flatMap(…)
>   .filterWith { case (metadata, _, _) => … }
>   .assignTimestampsAndWatermarks(new 
> BoundedOutOfOrdernessTimestampExtractor(…))
>   .keyingBy { case (meta, _) => meta.toPath.toString }
>   .process(new TruncateLargeSessions(config.sessionSizeLimit))
>   .keyingBy { case (meta, _) => meta.toPath.toString }
>   .window(EventTimeSessionWindows.withGap(Time.of(…)))
>   .process(new ProcessSession(sessionPlayback, config))
>   .addSink(sink){code}
>  
> {code:java}
> org.apache.flink.util.FlinkException: Triggering a savepoint for the job 
> 962fc8e984e7ca1ed65a038aa62ce124 failed.
>   at 
> org.apache.flink.client.cli.CliFrontend.triggerSavepoint(CliFrontend.java:633)
>   at 
> org.apache.flink.client.cli.CliFrontend.lambda$savepoint$9(CliFrontend.java:611)
>   at 
> org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843)
>   at 
> org.apache.flink.client.cli.CliFrontend.savepoint(CliFrontend.java:608)
>   at 
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:910)
>   at 
> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
>   at 
> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>   at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968)
> Caused by: java.util.concurrent.CompletionException: 
> java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.checkpoint.CheckpointException: The job has failed.
>   at 
> org.apache.flink.runtime.scheduler.SchedulerBase.lambda$triggerSavepoint$3(SchedulerBase.java:744)
>   at 
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:822)
>   at 
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:797)
>   at 
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
>   at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397)
>   at 
> 

[jira] [Commented] (FLINK-18582) Add title anchor link for file event_driven.zh.md

2020-07-14 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17157881#comment-17157881
 ] 

Congxian Qiu(klion26) commented on FLINK-18582:
---

[~temper] thanks for creating this ticket. could you please share the pages 
which ref this page. If currently there is no pages ref to 
{{event_driven.zh.md}} maybe we could do this late. what do you think?

>From my side, we always encourage to add title anchor links, but I'm sure 
>whether will we accept the change only add the title anchor(if currently no 
>page refs to it)

cc [~jark]

> Add title anchor link for file event_driven.zh.md
> -
>
> Key: FLINK-18582
> URL: https://issues.apache.org/jira/browse/FLINK-18582
> Project: Flink
>  Issue Type: Improvement
>  Components: Documentation, Quickstarts
>Reporter: Herman, Li
>Priority: Major
>
> Translated files should add anchor link for sub-titles.
> Otherwise, links from other documentations may not able to point to this 
> specified title. Such as file event_driven.zh.md with title: Side Outputs.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-18523) Advance watermark if there is no data in all of the partitions after some time

2020-07-11 Thread Congxian Qiu(klion26) (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu(klion26) updated FLINK-18523:
--
Description: 
In the case of window calculations and eventTime scenarios, watermark cannot 
update because the source does not have data for some reason, and the last 
Windows cannot trigger the calculations.

One parameter, table.exec.source. Idle -timeout, can only solve the problem of 
ignoring parallelism of watermark alignment that does not occur.But when there 
is no watermark in each parallel degree, you still cannot update the watermark.

Is it possible to add a lock-timeout parameter (which should be larger than 
maxOutOfOrderness with default of "-1 ms") and if the watermark is not updated 
beyond this time (i.e., there is no data), then the current time is taken and 
sent downstream as the watermark.

 

thanks!

  was:
In the case of window calculations and eventTime scenarios, watermar cannot 
update because the source does not have data for some reason, and the last 
Windows cannot trigger the calculations.

One parameter, table.exec.source. Idle -timeout, can only solve the problem of 
ignoring parallelism of watermark alignment that does not occur.But when there 
is no watermark in each parallel degree, you still cannot update the watermark.

Is it possible to add a lock-timeout parameter (which should be larger than 
maxOutOfOrderness with default of "-1 ms") and if the watermark is not updated 
beyond this time (i.e., there is no data), then the current time is taken and 
sent downstream as the watermark.

 

thanks!


> Advance watermark if there is no data in all of the partitions after some time
> --
>
> Key: FLINK-18523
> URL: https://issues.apache.org/jira/browse/FLINK-18523
> Project: Flink
>  Issue Type: New Feature
>  Components: Table SQL / API
>Affects Versions: 1.11.0
>Reporter: chen yong
>Priority: Major
>
> In the case of window calculations and eventTime scenarios, watermark cannot 
> update because the source does not have data for some reason, and the last 
> Windows cannot trigger the calculations.
> One parameter, table.exec.source. Idle -timeout, can only solve the problem 
> of ignoring parallelism of watermark alignment that does not occur.But when 
> there is no watermark in each parallel degree, you still cannot update the 
> watermark.
> Is it possible to add a lock-timeout parameter (which should be larger than 
> maxOutOfOrderness with default of "-1 ms") and if the watermark is not 
> updated beyond this time (i.e., there is no data), then the current time is 
> taken and sent downstream as the watermark.
>  
> thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-18493) Make target home directory used to store yarn files configurable

2020-07-06 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17151770#comment-17151770
 ] 

Congxian Qiu(klion26) commented on FLINK-18493:
---

+1 for this feature.

> Make target home directory used to store yarn files configurable
> 
>
> Key: FLINK-18493
> URL: https://issues.apache.org/jira/browse/FLINK-18493
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / YARN
>Affects Versions: 1.10.0, 1.10.1
>Reporter: Kevin Zhang
>Priority: Major
>
> When submitting applications to yarn, the file system's default home 
> directory, like /user/user_name on hdfs, is used as the base directory to 
> store flink files. But in different file system access control strategies, 
> the default home directory may be inaccessible, and it's more preferable for 
> users to specify thier own paths if they wish to.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-18196) flink throws `NullPointerException` when executeCheckpointing

2020-07-03 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17150754#comment-17150754
 ] 

Congxian Qiu(klion26) commented on FLINK-18196:
---

Seems another report about the similar problem

[http://apache-flink.147419.n8.nabble.com/flink1-10-checkpoint-td4337.html]

> flink throws `NullPointerException` when executeCheckpointing
> -
>
> Key: FLINK-18196
> URL: https://issues.apache.org/jira/browse/FLINK-18196
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.10.0, 1.10.1
> Environment: flink version: flink-1.10.0 flink-1.10.1
> jdk version: jdk1.8.0_40
>Reporter: Kai Chen
>Priority: Major
>
> I meet checkpoint NPE when executing wordcount example:
> java.lang.Exception: Could not perform checkpoint 5505 for operator Source: 
> KafkaTableSource(xxx) > SourceConversion(table=[xxx, source: 
> [KafkaTableSource(xxx)]], fields=[xxx]) > Calc(select=[xxx) AS xxx]) > 
> SinkConversionToTuple2  --  > Sink: Elasticsearch6UpsertTableSink(xxx) (1/1).
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:802)
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$triggerCheckpointAsync$3(StreamTask.java:777)
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask$$Lambda$228/1024478318.call(Unknown
>  Source)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.run(StreamTaskActionExecutor.java:87)
> at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:78)
> at 
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:261)
> at 
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:186)
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:487)
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:470)
> at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:707)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:532)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.executeCheckpointing(StreamTask.java:1411)
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.checkpointState(StreamTask.java:991)
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$5(StreamTask.java:887)
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask$$Lambda$229/1010499540.run(Unknown
>  Source)
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:94)
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:860)
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:793)
> ... 12 more
>  
> checkpoint configuration
> |Checkpointing Mode|Exactly Once|
> |Interval|5s|
> |Timeout|10m 0s|
> |Minimum Pause Between Checkpoints|0ms|
> |Maximum Concurrent Checkpoints|1|
> With debug enabled, I found `checkpointMetaData`  is null at
>  
> [https://github.com/apache/flink/blob/release-1.10.0/flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/StreamTask.java#L1421]
>  
> I fixed this with this patch: 
> [https://github.com/yuchuanchen/flink/commit/e5122d9787be1fee9bce141887e0d70c9b0a4f19]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-18433) From the end-to-end performance test results, 1.11 has a regression

2020-06-27 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147142#comment-17147142
 ] 

Congxian Qiu(klion26) commented on FLINK-18433:
---

Thanks for running these tests [~AHeise]. I guess we have to wait for [~Aihua]. 
From my previous experience, the result between the bounded-e2e test and the 
unbounded-e2e test may be different, because of the impact of setup things. 

I checked the log, found that the checkpoint expired before complete, Job fails 
because `{{Exceeded checkpoint tolerable failure threshold`}}, task received 
abort checkpoint message. I'm not very sure if we increase the tolerate 
checkpoint failure threshold can improve the performance or not.

> From the end-to-end performance test results, 1.11 has a regression
> ---
>
> Key: FLINK-18433
> URL: https://issues.apache.org/jira/browse/FLINK-18433
> Project: Flink
>  Issue Type: Bug
>  Components: API / Core, API / DataStream
>Affects Versions: 1.11.0
> Environment: 3 machines
> [|https://github.com/Li-Aihua/flink/blob/test_suite_for_basic_operations_1.11/flink-end-to-end-perf-tests/flink-basic-operations/src/main/java/org/apache/flink/basic/operations/PerformanceTestJob.java]
>Reporter: Aihua Li
>Priority: Major
> Attachments: flink_11.log.gz
>
>
>  
> I ran end-to-end performance tests between the Release-1.10 and Release-1.11. 
> the results were as follows:
> |scenarioName|release-1.10|release-1.11| |
> |OneInput_Broadcast_LazyFromSource_ExactlyOnce_10_rocksdb|46.175|43.8133|-5.11%|
> |OneInput_Rescale_LazyFromSource_ExactlyOnce_100_heap|211.835|200.355|-5.42%|
> |OneInput_Rebalance_LazyFromSource_ExactlyOnce_1024_rocksdb|1721.041667|1618.32|-5.97%|
> |OneInput_KeyBy_LazyFromSource_ExactlyOnce_10_heap|46|43.615|-5.18%|
> |OneInput_Broadcast_Eager_ExactlyOnce_100_rocksdb|212.105|199.688|-5.85%|
> |OneInput_Rescale_Eager_ExactlyOnce_1024_heap|1754.64|1600.12|-8.81%|
> |OneInput_Rebalance_Eager_ExactlyOnce_10_rocksdb|45.9167|43.0983|-6.14%|
> |OneInput_KeyBy_Eager_ExactlyOnce_100_heap|212.0816667|200.727|-5.35%|
> |OneInput_Broadcast_LazyFromSource_AtLeastOnce_1024_rocksdb|1718.245|1614.381667|-6.04%|
> |OneInput_Rescale_LazyFromSource_AtLeastOnce_10_heap|46.12|43.5517|-5.57%|
> |OneInput_Rebalance_LazyFromSource_AtLeastOnce_100_rocksdb|212.038|200.388|-5.49%|
> |OneInput_KeyBy_LazyFromSource_AtLeastOnce_1024_heap|1762.048333|1606.408333|-8.83%|
> |OneInput_Broadcast_Eager_AtLeastOnce_10_rocksdb|46.0583|43.4967|-5.56%|
> |OneInput_Rescale_Eager_AtLeastOnce_100_heap|212.233|201.188|-5.20%|
> |OneInput_Rebalance_Eager_AtLeastOnce_1024_rocksdb|1720.66|1616.85|-6.03%|
> |OneInput_KeyBy_Eager_AtLeastOnce_10_heap|46.14|43.6233|-5.45%|
> |TwoInputs_Broadcast_LazyFromSource_ExactlyOnce_100_rocksdb|156.918|152.957|-2.52%|
> |TwoInputs_Rescale_LazyFromSource_ExactlyOnce_1024_heap|1415.511667|1300.1|-8.15%|
> |TwoInputs_Rebalance_LazyFromSource_ExactlyOnce_10_rocksdb|34.2967|34.1667|-0.38%|
> |TwoInputs_KeyBy_LazyFromSource_ExactlyOnce_100_heap|158.353|151.848|-4.11%|
> |TwoInputs_Broadcast_Eager_ExactlyOnce_1024_rocksdb|1373.406667|1300.056667|-5.34%|
> |TwoInputs_Rescale_Eager_ExactlyOnce_10_heap|34.5717|32.0967|-7.16%|
> |TwoInputs_Rebalance_Eager_ExactlyOnce_100_rocksdb|158.655|147.44|-7.07%|
> |TwoInputs_KeyBy_Eager_ExactlyOnce_1024_heap|1356.611667|1292.386667|-4.73%|
> |TwoInputs_Broadcast_LazyFromSource_AtLeastOnce_10_rocksdb|34.01|33.205|-2.37%|
> |TwoInputs_Rescale_LazyFromSource_AtLeastOnce_100_heap|149.588|145.997|-2.40%|
> |TwoInputs_Rebalance_LazyFromSource_AtLeastOnce_1024_rocksdb|1359.74|1299.156667|-4.46%|
> |TwoInputs_KeyBy_LazyFromSource_AtLeastOnce_10_heap|34.025|29.6833|-12.76%|
> |TwoInputs_Broadcast_Eager_AtLeastOnce_100_rocksdb|157.303|151.4616667|-3.71%|
> |TwoInputs_Rescale_Eager_AtLeastOnce_1024_heap|1368.74|1293.238333|-5.52%|
> |TwoInputs_Rebalance_Eager_AtLeastOnce_10_rocksdb|34.325|33.285|-3.03%|
> |TwoInputs_KeyBy_Eager_AtLeastOnce_100_heap|162.5116667|134.375|-17.31%|
> It can be seen that the performance of 1.11 has a regression, basically 
> around 5%, and the maximum regression is 17%. This needs to be checked.
> the test code:
> flink-1.10.0: 
> [https://github.com/Li-Aihua/flink/blob/test_suite_for_basic_operations/flink-end-to-end-perf-tests/flink-basic-operations/src/main/java/org/apache/flink/basic/operations/PerformanceTestJob.java]
> flink-1.11.0: 
> [https://github.com/Li-Aihua/flink/blob/test_suite_for_basic_operations_1.11/flink-end-to-end-perf-tests/flink-basic-operations/src/main/java/org/apache/flink/basic/operations/PerformanceTestJob.java]
> commit cmd like tis:
> bin/flink run -d -m 192.168.39.246:8081 -c 

[jira] [Commented] (FLINK-18433) From the end-to-end performance test results, 1.11 has a regression

2020-06-24 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17144611#comment-17144611
 ] 

Congxian Qiu(klion26) commented on FLINK-18433:
---

[~Aihua] thanks for the work, maybe the affect version should be 1.11.0?

> From the end-to-end performance test results, 1.11 has a regression
> ---
>
> Key: FLINK-18433
> URL: https://issues.apache.org/jira/browse/FLINK-18433
> Project: Flink
>  Issue Type: Bug
>  Components: API / Core, API / DataStream
>Affects Versions: 1.11.1
> Environment: 3 machines
> [|https://github.com/Li-Aihua/flink/blob/test_suite_for_basic_operations_1.11/flink-end-to-end-perf-tests/flink-basic-operations/src/main/java/org/apache/flink/basic/operations/PerformanceTestJob.java]
>Reporter: Aihua Li
>Priority: Major
>
>  
> I ran end-to-end performance tests between the Release-1.10 and Release-1.11. 
> the results were as follows:
> |scenarioName|release-1.10|release-1.11| |
> |OneInput_Broadcast_LazyFromSource_ExactlyOnce_10_rocksdb|46.175|43.8133|-5.11%|
> |OneInput_Rescale_LazyFromSource_ExactlyOnce_100_heap|211.835|200.355|-5.42%|
> |OneInput_Rebalance_LazyFromSource_ExactlyOnce_1024_rocksdb|1721.041667|1618.32|-5.97%|
> |OneInput_KeyBy_LazyFromSource_ExactlyOnce_10_heap|46|43.615|-5.18%|
> |OneInput_Broadcast_Eager_ExactlyOnce_100_rocksdb|212.105|199.688|-5.85%|
> |OneInput_Rescale_Eager_ExactlyOnce_1024_heap|1754.64|1600.12|-8.81%|
> |OneInput_Rebalance_Eager_ExactlyOnce_10_rocksdb|45.9167|43.0983|-6.14%|
> |OneInput_KeyBy_Eager_ExactlyOnce_100_heap|212.0816667|200.727|-5.35%|
> |OneInput_Broadcast_LazyFromSource_AtLeastOnce_1024_rocksdb|1718.245|1614.381667|-6.04%|
> |OneInput_Rescale_LazyFromSource_AtLeastOnce_10_heap|46.12|43.5517|-5.57%|
> |OneInput_Rebalance_LazyFromSource_AtLeastOnce_100_rocksdb|212.038|200.388|-5.49%|
> |OneInput_KeyBy_LazyFromSource_AtLeastOnce_1024_heap|1762.048333|1606.408333|-8.83%|
> |OneInput_Broadcast_Eager_AtLeastOnce_10_rocksdb|46.0583|43.4967|-5.56%|
> |OneInput_Rescale_Eager_AtLeastOnce_100_heap|212.233|201.188|-5.20%|
> |OneInput_Rebalance_Eager_AtLeastOnce_1024_rocksdb|1720.66|1616.85|-6.03%|
> |OneInput_KeyBy_Eager_AtLeastOnce_10_heap|46.14|43.6233|-5.45%|
> |TwoInputs_Broadcast_LazyFromSource_ExactlyOnce_100_rocksdb|156.918|152.957|-2.52%|
> |TwoInputs_Rescale_LazyFromSource_ExactlyOnce_1024_heap|1415.511667|1300.1|-8.15%|
> |TwoInputs_Rebalance_LazyFromSource_ExactlyOnce_10_rocksdb|34.2967|34.1667|-0.38%|
> |TwoInputs_KeyBy_LazyFromSource_ExactlyOnce_100_heap|158.353|151.848|-4.11%|
> |TwoInputs_Broadcast_Eager_ExactlyOnce_1024_rocksdb|1373.406667|1300.056667|-5.34%|
> |TwoInputs_Rescale_Eager_ExactlyOnce_10_heap|34.5717|32.0967|-7.16%|
> |TwoInputs_Rebalance_Eager_ExactlyOnce_100_rocksdb|158.655|147.44|-7.07%|
> |TwoInputs_KeyBy_Eager_ExactlyOnce_1024_heap|1356.611667|1292.386667|-4.73%|
> |TwoInputs_Broadcast_LazyFromSource_AtLeastOnce_10_rocksdb|34.01|33.205|-2.37%|
> |TwoInputs_Rescale_LazyFromSource_AtLeastOnce_100_heap|149.588|145.997|-2.40%|
> |TwoInputs_Rebalance_LazyFromSource_AtLeastOnce_1024_rocksdb|1359.74|1299.156667|-4.46%|
> |TwoInputs_KeyBy_LazyFromSource_AtLeastOnce_10_heap|34.025|29.6833|-12.76%|
> |TwoInputs_Broadcast_Eager_AtLeastOnce_100_rocksdb|157.303|151.4616667|-3.71%|
> |TwoInputs_Rescale_Eager_AtLeastOnce_1024_heap|1368.74|1293.238333|-5.52%|
> |TwoInputs_Rebalance_Eager_AtLeastOnce_10_rocksdb|34.325|33.285|-3.03%|
> |TwoInputs_KeyBy_Eager_AtLeastOnce_100_heap|162.5116667|134.375|-17.31%|
> It can be seen that the performance of 1.11 has a regression, basically 
> around 5%, and the maximum regression is 17%. This needs to be checked.
> the test code:
> flink-1.10.0: 
> [https://github.com/Li-Aihua/flink/blob/test_suite_for_basic_operations/flink-end-to-end-perf-tests/flink-basic-operations/src/main/java/org/apache/flink/basic/operations/PerformanceTestJob.java]
> flink-1.11.0: 
> [https://github.com/Li-Aihua/flink/blob/test_suite_for_basic_operations_1.11/flink-end-to-end-perf-tests/flink-basic-operations/src/main/java/org/apache/flink/basic/operations/PerformanceTestJob.java]
> commit cmd like tis:
> bin/flink run -d -m 192.168.39.246:8081 -c 
> org.apache.flink.basic.operations.PerformanceTestJob 
> /home/admin/flink-basic-operations_2.11-1.10-SNAPSHOT.jar --topologyName 
> OneInput --LogicalAttributesofEdges Broadcast --ScheduleMode LazyFromSource 
> --CheckpointMode ExactlyOnce --recordSize 10 --stateBackend rocksdb
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-5601) Window operator does not checkpoint watermarks

2020-06-23 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-5601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142686#comment-17142686
 ] 

Congxian Qiu(klion26) commented on FLINK-5601:
--

Seems another affected user(user-zh mail list) 
[http://apache-flink.147419.n8.nabble.com/savepoint-td4133.html]

> Window operator does not checkpoint watermarks
> --
>
> Key: FLINK-5601
> URL: https://issues.apache.org/jira/browse/FLINK-5601
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.5.0, 1.6.0, 1.7.0, 1.8.0, 1.9.0
>Reporter: Ufuk Celebi
>Assignee: Jiayi Liao
>Priority: Critical
>  Labels: pull-request-available
>
> During release testing [~stefanrichte...@gmail.com] and I noticed that 
> watermarks are not checkpointed in the window operator.
> This can lead to non determinism when restoring checkpoints. I was running an 
> adjusted {{SessionWindowITCase}} via Kafka for testing migration and 
> rescaling and ran into failures, because the data generator required 
> determinisitic behaviour.
> What happened was that on restore it could happen that late elements were not 
> dropped, because the watermarks needed to be re-established after restore 
> first.
> [~aljoscha] Do you know whether there is a special reason for explicitly not 
> checkpointing watermarks?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-18264) Translate the "External Resource Framework" page into Chinese

2020-06-22 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17141840#comment-17141840
 ] 

Congxian Qiu(klion26) commented on FLINK-18264:
---

[~karmagyz] assigned to you

> Translate the "External Resource Framework" page into Chinese
> -
>
> Key: FLINK-18264
> URL: https://issues.apache.org/jira/browse/FLINK-18264
> Project: Flink
>  Issue Type: Sub-task
>  Components: chinese-translation, Documentation
>Affects Versions: 1.11.0
>Reporter: Yangze Guo
>Assignee: Yangze Guo
>Priority: Major
> Fix For: 1.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-18264) Translate the "External Resource Framework" page into Chinese

2020-06-22 Thread Congxian Qiu(klion26) (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu(klion26) reassigned FLINK-18264:
-

Assignee: Yangze Guo

> Translate the "External Resource Framework" page into Chinese
> -
>
> Key: FLINK-18264
> URL: https://issues.apache.org/jira/browse/FLINK-18264
> Project: Flink
>  Issue Type: Sub-task
>  Components: chinese-translation, Documentation
>Affects Versions: 1.11.0
>Reporter: Yangze Guo
>Assignee: Yangze Guo
>Priority: Major
> Fix For: 1.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-18091) Test Relocatable Savepoints

2020-06-12 Thread Congxian Qiu(klion26) (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu(klion26) closed FLINK-18091.
-
Resolution: Fixed

> Test Relocatable Savepoints
> ---
>
> Key: FLINK-18091
> URL: https://issues.apache.org/jira/browse/FLINK-18091
> Project: Flink
>  Issue Type: Sub-task
>  Components: Tests
>Reporter: Stephan Ewen
>Assignee: Congxian Qiu(klion26)
>Priority: Major
>  Labels: release-testing
> Fix For: 1.11.0
>
>
> The test should do the following:
>  * take a savepoint. needs to make sure the job has enough state that there 
> is more than just the "_metadata" file
>  * copy it to another directory
>  * start the job from that savepoint by addressing the metadata file and by 
> addressing the savepoint directory
> We should also test that an incremental checkpoint that gets moved fails with 
> a reasonable exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-18091) Test Relocatable Savepoints

2020-06-12 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134092#comment-17134092
 ] 

Congxian Qiu(klion26) commented on FLINK-18091:
---

close this issue as we have tested in a real cluster as the previous comment.

please feel free to reopen if anyone has any other concerns. thanks.

> Test Relocatable Savepoints
> ---
>
> Key: FLINK-18091
> URL: https://issues.apache.org/jira/browse/FLINK-18091
> Project: Flink
>  Issue Type: Sub-task
>  Components: Tests
>Reporter: Stephan Ewen
>Assignee: Congxian Qiu(klion26)
>Priority: Major
>  Labels: release-testing
> Fix For: 1.11.0
>
>
> The test should do the following:
>  * take a savepoint. needs to make sure the job has enough state that there 
> is more than just the "_metadata" file
>  * copy it to another directory
>  * start the job from that savepoint by addressing the metadata file and by 
> addressing the savepoint directory
> We should also test that an incremental checkpoint that gets moved fails with 
> a reasonable exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-18091) Test Relocatable Savepoints

2020-06-10 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17132837#comment-17132837
 ] 

Congxian Qiu(klion26) commented on FLINK-18091:
---

The logic of manual test for savepoint here is the same as the ITCase(trigger 
savepoint, move directory, and can restore from savepoint successfully) – IMHO, 
this can somehow double-check to make sure this feature works. and test 
incremental checkpoint will restore fail after relocating the directory.

> Test Relocatable Savepoints
> ---
>
> Key: FLINK-18091
> URL: https://issues.apache.org/jira/browse/FLINK-18091
> Project: Flink
>  Issue Type: Sub-task
>  Components: Tests
>Reporter: Stephan Ewen
>Assignee: Congxian Qiu(klion26)
>Priority: Major
>  Labels: release-testing
> Fix For: 1.11.0
>
>
> The test should do the following:
>  * take a savepoint. needs to make sure the job has enough state that there 
> is more than just the "_metadata" file
>  * copy it to another directory
>  * start the job from that savepoint by addressing the metadata file and by 
> addressing the savepoint directory
> We should also test that an incremental checkpoint that gets moved fails with 
> a reasonable exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-18165) When savingpoint is restored, select the checkpoint directory and stateBackend

2020-06-09 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17129148#comment-17129148
 ] 

Congxian Qiu(klion26) commented on FLINK-18165:
---

[~liyu] I agree that we could switch state backend easily through savepoint if 
this has been implemented. I'm willing to help implement this if we want to 
support this in the future release.

> When savingpoint is restored, select the checkpoint directory and stateBackend
> --
>
> Key: FLINK-18165
> URL: https://issues.apache.org/jira/browse/FLINK-18165
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / State Backends
>Affects Versions: 1.9.0
> Environment: flink 1.9
>Reporter: Xinyuan Liu
>Priority: Major
>
> If the checkpoint file is used as the initial state of the savepoint startup, 
> it must be ensured that the state backend used before and after is the same 
> type, but in actual production, there will be more and more state, the 
> taskmanager memory is insufficient and the cluster cannot be expanded, and 
> the state backend needs to be switched at this time. And there is a need to 
> ensure data consistency. Unfortunately, currently flink does not provide an 
> elegant way to switch state backend, can the community consider this proposal



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-18091) Test Relocatable Savepoints

2020-06-09 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17129144#comment-17129144
 ] 

Congxian Qiu(klion26) commented on FLINK-18091:
---

[~aljoscha] We have an IT case for this 
{{SavepointITCase##testTriggerSavepointAndResumeWithFileBasedCheckpointsAndRelocateBasePath}}

> Test Relocatable Savepoints
> ---
>
> Key: FLINK-18091
> URL: https://issues.apache.org/jira/browse/FLINK-18091
> Project: Flink
>  Issue Type: Sub-task
>  Components: Tests
>Reporter: Stephan Ewen
>Assignee: Congxian Qiu(klion26)
>Priority: Major
>  Labels: release-testing
> Fix For: 1.11.0
>
>
> The test should do the following:
>  * take a savepoint. needs to make sure the job has enough state that there 
> is more than just the "_metadata" file
>  * copy it to another directory
>  * start the job from that savepoint by addressing the metadata file and by 
> addressing the savepoint directory
> We should also test that an incremental checkpoint that gets moved fails with 
> a reasonable exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17479) Occasional checkpoint failure due to null pointer exception in Flink version 1.10

2020-06-09 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17128948#comment-17128948
 ] 

Congxian Qiu(klion26) commented on FLINK-17479:
---

[~nobleyd] thanks for tracking it down, if you confirm this is a JDK problem, 
could you please close this issue?

> Occasional checkpoint failure due to null pointer exception in Flink version 
> 1.10
> -
>
> Key: FLINK-17479
> URL: https://issues.apache.org/jira/browse/FLINK-17479
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.10.0
> Environment: Flink1.10.0
> jdk1.8.0_60
>Reporter: nobleyd
>Priority: Major
> Attachments: image-2020-04-30-18-44-21-630.png, 
> image-2020-04-30-18-55-53-779.png
>
>
> I upgrade the standalone cluster(3 machines) from flink1.9 to flink1.10.0 
> latest. My job running normally in flink1.9 for about half a year, while I 
> get some job failed due to null pointer exception when checkpoing in  
> flink1.10.0.
> Below is the exception log:
> !image-2020-04-30-18-55-53-779.png!
> I have checked the StreamTask(882), and is shown below. I think the only case 
> is that checkpointMetaData is null that can lead to a null pointer exception.
> !image-2020-04-30-18-44-21-630.png!
> I do not know why, is there anyone can help me? The problem only occurs in 
> Flink1.10.0 for now, it works well in flink1.9. I give the some conf 
> info(some different to the default) also in below, guessing that maybe it is 
> an error for configuration mistake.
> some conf of my flink1.10.0:
>  
> {code:java}
> taskmanager.memory.flink.size: 71680m
> taskmanager.memory.framework.heap.size: 512m
> taskmanager.memory.framework.off-heap.size: 512m
> taskmanager.memory.task.off-heap.size: 17920m
> taskmanager.memory.managed.size: 512m
> taskmanager.memory.jvm-metaspace.size: 512m
> taskmanager.memory.network.fraction: 0.1
> taskmanager.memory.network.min: 1024mb
> taskmanager.memory.network.max: 1536mb
> taskmanager.memory.segment-size: 128kb
> rest.port: 8682
> historyserver.web.port: 8782high-availability.jobmanager.port: 
> 13141,13142,13143,13144
> blob.server.port: 13146,13147,13148,13149taskmanager.rpc.port: 
> 13151,13152,13153,13154
> taskmanager.data.port: 13156metrics.internal.query-service.port: 
> 13161,13162,13163,13164,13166,13167,13168,13169env.java.home: 
> /usr/java/jdk1.8.0_60/bin/java
> env.pid.dir: /home/work/flink-1.10.0{code}
>  
> Hope someone can help me solve it.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-18196) flink throws `NullPointerException` when executeCheckpointing

2020-06-09 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17128947#comment-17128947
 ] 

Congxian Qiu(klion26) commented on FLINK-18196:
---

[~yuchuanchen] thanks for creating this issue, maybe this is relevant to 
FLINK-17479, could you please try to change a different JDK version and see 
whether this problem has been fixed?

> flink throws `NullPointerException` when executeCheckpointing
> -
>
> Key: FLINK-18196
> URL: https://issues.apache.org/jira/browse/FLINK-18196
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.10.0, 1.10.1
>Reporter: Kai Chen
>Priority: Major
>
> I meet checkpoint NPE when executing wordcount example:
> java.lang.Exception: Could not perform checkpoint 5505 for operator Source: 
> KafkaTableSource(xxx) > SourceConversion(table=[xxx, source: 
> [KafkaTableSource(xxx)]], fields=[xxx]) > Calc(select=[xxx) AS xxx]) > 
> SinkConversionToTuple2  --  > Sink: Elasticsearch6UpsertTableSink(xxx) (1/1).
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:802)
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$triggerCheckpointAsync$3(StreamTask.java:777)
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask$$Lambda$228/1024478318.call(Unknown
>  Source)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.run(StreamTaskActionExecutor.java:87)
> at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:78)
> at 
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:261)
> at 
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:186)
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:487)
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:470)
> at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:707)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:532)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.executeCheckpointing(StreamTask.java:1411)
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.checkpointState(StreamTask.java:991)
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$5(StreamTask.java:887)
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask$$Lambda$229/1010499540.run(Unknown
>  Source)
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:94)
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:860)
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:793)
> ... 12 more
>  
> flink version: flink-1.10.0 flink-1.10.1
> checkpoint configuration
> |Checkpointing Mode|Exactly Once|
> |Interval|5s|
> |Timeout|10m 0s|
> |Minimum Pause Between Checkpoints|0ms|
> |Maximum Concurrent Checkpoints|1|
> With debug enabled, I found `checkpointMetaData`  is null at
>  
> [https://github.com/apache/flink/blob/release-1.10.0/flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/StreamTask.java#L1421]
>  
> I fixed this with this patch: 
> [https://github.com/yuchuanchen/flink/commit/e5122d9787be1fee9bce141887e0d70c9b0a4f19]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17468) Provide more detailed metrics why asynchronous part of checkpoint is taking long time

2020-06-09 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17128945#comment-17128945
 ] 

Congxian Qiu(klion26) commented on FLINK-17468:
---

Kindly ping [~qqibrow] 

> Provide more detailed metrics why asynchronous part of checkpoint is taking 
> long time
> -
>
> Key: FLINK-17468
> URL: https://issues.apache.org/jira/browse/FLINK-17468
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Checkpointing, Runtime / Metrics, Runtime / 
> State Backends
>Affects Versions: 1.10.0
>Reporter: Piotr Nowojski
>Priority: Major
>
> As [reported by 
> users|https://lists.apache.org/thread.html/r0833452796ca7d1c9d5e35c110089c95cfdadee9d81884a13613a4ce%40%3Cuser.flink.apache.org%3E]
>  it's not obvious why asynchronous part of checkpoint is taking so long time.
> Maybe we could provide some more detailed metrics/UI/logs about uploading 
> files, materializing meta data, or other things that are happening during the 
> asynchronous checkpoint process?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-18091) Test Relocatable Savepoints

2020-06-06 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126778#comment-17126778
 ] 

Congxian Qiu(klion26) edited comment on FLINK-18091 at 6/6/20, 5:13 AM:


[~sewen] [~liyu] Do you have any other concerns about this issue, if not I'll 
close this issue, thanks.


was (Author: klion26):
I'll close this issue in 6.10(Wednesday) if it's ok to you. [~sewen] [~liyu] 

> Test Relocatable Savepoints
> ---
>
> Key: FLINK-18091
> URL: https://issues.apache.org/jira/browse/FLINK-18091
> Project: Flink
>  Issue Type: Sub-task
>  Components: Tests
>Reporter: Stephan Ewen
>Assignee: Congxian Qiu(klion26)
>Priority: Major
>  Labels: release-testing
> Fix For: 1.11.0
>
>
> The test should do the following:
>  * take a savepoint. needs to make sure the job has enough state that there 
> is more than just the "_metadata" file
>  * copy it to another directory
>  * start the job from that savepoint by addressing the metadata file and by 
> addressing the savepoint directory
> We should also test that an incremental checkpoint that gets moved fails with 
> a reasonable exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-18112) Single Task Failure Recovery Prototype

2020-06-06 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17127206#comment-17127206
 ] 

Congxian Qiu(klion26) commented on FLINK-18112:
---

[~ym] Thanks for creating the ticket, I think this feature can be very helpful 
for some scenarios. I'm interested in it. Does there any public documentation 
for this feature? Thanks

> Single Task Failure Recovery Prototype
> --
>
> Key: FLINK-18112
> URL: https://issues.apache.org/jira/browse/FLINK-18112
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Checkpointing, Runtime / Coordination, Runtime 
> / Network
>Affects Versions: 1.12.0
>Reporter: Yuan Mei
>Assignee: Yuan Mei
>Priority: Major
> Fix For: 1.12.0
>
>
> Build a prototype of single task failure recovery to address and answer the 
> following questions:
> *Step 1*: Scheduling part, restart a single node without restarting the 
> upstream or downstream nodes.
> *Step 2*: Checkpointing part, as my understanding of how regional failover 
> works, this part might not need modification.
> *Step 3*: Network part
>   - how the recovered node able to link to the upstream ResultPartitions, and 
> continue getting data
>   - how the downstream node able to link to the recovered node, and continue 
> getting node
>   - how different netty transit mode affects the results
>   - what if the failed node buffered data pool is full
> *Step 4*: Failover process verification



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-18091) Test Relocatable Savepoints

2020-06-05 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126778#comment-17126778
 ] 

Congxian Qiu(klion26) edited comment on FLINK-18091 at 6/5/20, 1:52 PM:


I'll close this issue in 6.10(Wednesday) if it's ok to you. [~sewen] [~liyu] 


was (Author: klion26):
I'll close this issue in 6.10(Wednesday) if the result ok to you. [~sewen] 
[~liyu] 

> Test Relocatable Savepoints
> ---
>
> Key: FLINK-18091
> URL: https://issues.apache.org/jira/browse/FLINK-18091
> Project: Flink
>  Issue Type: Sub-task
>  Components: Tests
>Reporter: Stephan Ewen
>Assignee: Congxian Qiu(klion26)
>Priority: Major
>  Labels: release-testing
> Fix For: 1.11.0
>
>
> The test should do the following:
>  * take a savepoint. needs to make sure the job has enough state that there 
> is more than just the "_metadata" file
>  * copy it to another directory
>  * start the job from that savepoint by addressing the metadata file and by 
> addressing the savepoint directory
> We should also test that an incremental checkpoint that gets moved fails with 
> a reasonable exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-18091) Test Relocatable Savepoints

2020-06-05 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126778#comment-17126778
 ] 

Congxian Qiu(klion26) commented on FLINK-18091:
---

I'll close this issue in 6.10(Wednesday) if the result ok to you. [~sewen] 
[~liyu] 

> Test Relocatable Savepoints
> ---
>
> Key: FLINK-18091
> URL: https://issues.apache.org/jira/browse/FLINK-18091
> Project: Flink
>  Issue Type: Sub-task
>  Components: Tests
>Reporter: Stephan Ewen
>Assignee: Congxian Qiu(klion26)
>Priority: Major
>  Labels: release-testing
> Fix For: 1.11.0
>
>
> The test should do the following:
>  * take a savepoint. needs to make sure the job has enough state that there 
> is more than just the "_metadata" file
>  * copy it to another directory
>  * start the job from that savepoint by addressing the metadata file and by 
> addressing the savepoint directory
> We should also test that an incremental checkpoint that gets moved fails with 
> a reasonable exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-18091) Test Relocatable Savepoints

2020-06-05 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126769#comment-17126769
 ] 

Congxian Qiu(klion26) edited comment on FLINK-18091 at 6/5/20, 1:43 PM:


Test on a real cluster, Result excepted.
 * savepoint relocate can restore successfully
 * checkpoint relocate will be failed with FileNotFoundException

commit : 8ca67c962a5575aba3a43b8cfd4a79ffc8069bd4

The Long log attached below:

{{_username/ip/port and other sensitive information has been masked._}}
 # For Savepoint

 
{code:java}
[~/flink-1.11-SNAPSHOT]$ ./bin/flink savepoint 9bcc2546a841b36a39c46fbe13a2b631 
hdfs:///user/xx/congxianqiu/savepoint -yid application_1591259429117_0007
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in 
[jar:file:/data/work/congxianqiu/flink-1.11-SNAPSHOT/lib/log4j-slf4j-impl-2.12.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/data/xx/hadoop/hadoop-2.7.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2020-06-05 20:27:43,039 WARN  
org.apache.flink.yarn.configuration.YarnLogConfigUtil        [] - The 
configuration directory ('/data/work/congxianqiu/flink-1.11-SNAPSHOT/conf') 
already contains a LOG4J config file.If you want to use logback, then please 
delete or rename the log configuration file.
2020-06-05 20:27:43,422 INFO  org.apache.flink.yarn.YarnClusterDescriptor       
           [] - No path for the flink jar passed. Using the location of class 
org.apache.flink.yarn.YarnClusterDescriptor to locate the jar
2020-06-05 20:27:43,513 INFO  org.apache.flink.yarn.YarnClusterDescriptor       
           [] - Found Web Interface 10-215-128-84:35572 of application 
'application_1591259429117_0007'.
Triggering savepoint for job 9bcc2546a841b36a39c46fbe13a2b631.
Waiting for response...
Savepoint completed. Path: 
hdfs://ip:port/user/xx/congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33
You can resume your program from this savepoint with the run command.

[~/flink-1.11-SNAPSHOT]$ hadoop fs -ls congxianqiu/savepoint
Found 1 items
drwxr-xr-x   - xx supergroup          0 2020-06-05 20:27 
congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33
[~/congxianqiu/flink-1.11-SNAPSHOT]$ hadoop fs -ls 
congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33
Found 2 items
-rw-r--r--   3 xx supergroup      74170 2020-06-05 20:27 
congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33/6508ac9e-0d2a-4583-96ad-1d67fb5b1c8a
-rw-r--r--   3 xx supergroup       1205 2020-06-05 20:27 
congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33/_metadata



[~/flink-1.11-SNAPSHOT]$ hadoop fs -mv 
congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33 
congxianqiu/savepoint/newsavepointpath
[ ~/flink-1.11-SNAPSHOT]$ hadoop fs -ls congxianqiu/savepoint
Found 1 items
drwxr-xr-x   - xx supergroup          0 2020-06-05 20:27 
congxianqiu/savepoint/newsavepointpath
[~/flink-1.11-SNAPSHOT]$ hadoop fs -ls congxianqiu/savepoint/newsavepointpath
Found 2 items
-rw-r--r--   3 xx supergroup      74170 2020-06-05 20:27 
congxianqiu/savepoint/newsavepointpath/6508ac9e-0d2a-4583-96ad-1d67fb5b1c8a
-rw-r--r--   3 xx supergroup       1205 2020-06-05 20:27 
congxianqiu/savepoint/newsavepointpath/_metadata


[~/flink-1.11-SNAPSHOT]$ ./bin/flink run -s 
hdfs:///user/xx/congxianqiu/newsavepointpath/_metadata -m yarn-cluster -c 
com.klion26.data.FlinkDemo 
/data/work/congxianqiu/flink-1.11-SNAPSHOT/ft_local/Flink-Demo-1.0-SNAPSHOT.jar
SLF4J: Class path contains multiple SLF4J bindings.


>> jobmanager.log
restored succesfully, can do checkpoint successfully.
2020-06-05 21:11:10,053 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Starting job 
b2fbfa6527391f035e8eebd791c2f64e from savepoint 
hdfs:///user/xx/congxianqiu/savepoint/newsavepointpath/_metadata ()
2020-06-05 21:11:10,198 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Reset the 
checkpoint ID of job b2fbfa6527391f035e8eebd791c2f64e to 3.
2020-06-05 21:11:10,198 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Restoring job 
b2fbfa6527391f035e8eebd791c2f64e from latest valid checkpoint: Checkpoint 2 @ 0 
for b2fbfa6527391f035e8eebd791c2f64e.
2020-06-05 21:11:10,206 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - No master 
state to restore
...
2020-06-05 21:11:16,117 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Checkpoint 
triggering task Source: Custom Source (1/1) of job 
b2fbfa6527391f035e8eebd791c2f64e is not in state RUNNING but SCHEDULED instead. 
Aborting checkpoint.
2020-06-05 21:11:19,456 INFO  

[jira] [Comment Edited] (FLINK-18091) Test Relocatable Savepoints

2020-06-05 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126769#comment-17126769
 ] 

Congxian Qiu(klion26) edited comment on FLINK-18091 at 6/5/20, 1:43 PM:


Test on a real cluster, Result excepted.
 * savepoint relocate can restore successfully
 * checkpoint relocate will be failed with FileNotFoundException

commit : 8ca67c962a5575aba3a43b8cfd4a79ffc8069bd4

The Long log attached below:

{{_username/ip/port and other sensitive information has been masked._}}
 # For Savepoint

 
{code:java}
[~/flink-1.11-SNAPSHOT]$ ./bin/flink savepoint 9bcc2546a841b36a39c46fbe13a2b631 
hdfs:///user/xx/congxianqiu/savepoint -yid application_1591259429117_0007
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in 
[jar:file:/data/work/congxianqiu/flink-1.11-SNAPSHOT/lib/log4j-slf4j-impl-2.12.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/data/xx/hadoop/hadoop-2.7.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2020-06-05 20:27:43,039 WARN  
org.apache.flink.yarn.configuration.YarnLogConfigUtil        [] - The 
configuration directory ('/data/work/congxianqiu/flink-1.11-SNAPSHOT/conf') 
already contains a LOG4J config file.If you want to use logback, then please 
delete or rename the log configuration file.
2020-06-05 20:27:43,422 INFO  org.apache.flink.yarn.YarnClusterDescriptor       
           [] - No path for the flink jar passed. Using the location of class 
org.apache.flink.yarn.YarnClusterDescriptor to locate the jar
2020-06-05 20:27:43,513 INFO  org.apache.flink.yarn.YarnClusterDescriptor       
           [] - Found Web Interface 10-215-128-84:35572 of application 
'application_1591259429117_0007'.
Triggering savepoint for job 9bcc2546a841b36a39c46fbe13a2b631.
Waiting for response...
Savepoint completed. Path: 
hdfs://ip:port/user/xx/congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33
You can resume your program from this savepoint with the run command.

[~/flink-1.11-SNAPSHOT]$ hadoop fs -ls congxianqiu/savepoint
Found 1 items
drwxr-xr-x   - xx supergroup          0 2020-06-05 20:27 
congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33
[~/congxianqiu/flink-1.11-SNAPSHOT]$ hadoop fs -ls 
congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33
Found 2 items
-rw-r--r--   3 xx supergroup      74170 2020-06-05 20:27 
congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33/6508ac9e-0d2a-4583-96ad-1d67fb5b1c8a
-rw-r--r--   3 xx supergroup       1205 2020-06-05 20:27 
congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33/_metadata



[~/flink-1.11-SNAPSHOT]$ hadoop fs -mv 
congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33 
congxianqiu/savepoint/newsavepointpath
[ ~/flink-1.11-SNAPSHOT]$ hadoop fs -ls congxianqiu/savepoint
Found 1 items
drwxr-xr-x   - xx supergroup          0 2020-06-05 20:27 
congxianqiu/savepoint/newsavepointpath
[~/flink-1.11-SNAPSHOT]$ hadoop fs -ls congxianqiu/savepoint/newsavepointpath
Found 2 items
-rw-r--r--   3 xx supergroup      74170 2020-06-05 20:27 
congxianqiu/savepoint/newsavepointpath/6508ac9e-0d2a-4583-96ad-1d67fb5b1c8a
-rw-r--r--   3 xx supergroup       1205 2020-06-05 20:27 
congxianqiu/savepoint/newsavepointpath/_metadata


[~/flink-1.11-SNAPSHOT]$ ./bin/flink run -s 
hdfs:///user/xx/congxianqiu/newsavepointpath/_metadata -m yarn-cluster -c 
com.klion26.data.FlinkDemo 
/data/work/congxianqiu/flink-1.11-SNAPSHOT/ft_local/Flink-Demo-1.0-SNAPSHOT.jar
SLF4J: Class path contains multiple SLF4J bindings.


>> jobmanager.log
restored succesfully, can do checkpoint successfully.
2020-06-05 21:11:10,053 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Starting job 
b2fbfa6527391f035e8eebd791c2f64e from savepoint 
hdfs:///user/xx/congxianqiu/savepoint/newsavepointpath/_metadata ()
2020-06-05 21:11:10,198 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Reset the 
checkpoint ID of job b2fbfa6527391f035e8eebd791c2f64e to 3.
2020-06-05 21:11:10,198 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Restoring job 
b2fbfa6527391f035e8eebd791c2f64e from latest valid checkpoint: Checkpoint 2 @ 0 
for b2fbfa6527391f035e8eebd791c2f64e.
2020-06-05 21:11:10,206 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - No master 
state to restore
...
2020-06-05 21:11:16,117 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Checkpoint 
triggering task Source: Custom Source (1/1) of job 
b2fbfa6527391f035e8eebd791c2f64e is not in state RUNNING but SCHEDULED instead. 
Aborting checkpoint.
2020-06-05 21:11:19,456 INFO  

[jira] [Comment Edited] (FLINK-18091) Test Relocatable Savepoints

2020-06-05 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126769#comment-17126769
 ] 

Congxian Qiu(klion26) edited comment on FLINK-18091 at 6/5/20, 1:40 PM:


Test on a real cluster, Result excepted.
 * savepoint relocate can restore successfully
 * checkpoint relocate will be failed with FileNotFoundException

commit : 8ca67c962a5575aba3a43b8cfd4a79ffc8069bd4

The Long log attached below:

{{_username/ip/port and other sensitive information has been masked._}}
 # For Savepoint

 
{code:java}
[~/flink-1.11-SNAPSHOT]$ ./bin/flink savepoint 9bcc2546a841b36a39c46fbe13a2b631 
hdfs:///user/xx/congxianqiu/savepoint -yid application_1591259429117_0007
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in 
[jar:file:/data/work/congxianqiu/flink-1.11-SNAPSHOT/lib/log4j-slf4j-impl-2.12.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/data/xx/hadoop/hadoop-2.7.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2020-06-05 20:27:43,039 WARN  
org.apache.flink.yarn.configuration.YarnLogConfigUtil        [] - The 
configuration directory ('/data/work/congxianqiu/flink-1.11-SNAPSHOT/conf') 
already contains a LOG4J config file.If you want to use logback, then please 
delete or rename the log configuration file.
2020-06-05 20:27:43,422 INFO  org.apache.flink.yarn.YarnClusterDescriptor       
           [] - No path for the flink jar passed. Using the location of class 
org.apache.flink.yarn.YarnClusterDescriptor to locate the jar
2020-06-05 20:27:43,513 INFO  org.apache.flink.yarn.YarnClusterDescriptor       
           [] - Found Web Interface 10-215-128-84:35572 of application 
'application_1591259429117_0007'.
Triggering savepoint for job 9bcc2546a841b36a39c46fbe13a2b631.
Waiting for response...
Savepoint completed. Path: 
hdfs://ip:port/user/xx/congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33
You can resume your program from this savepoint with the run command.

[~/flink-1.11-SNAPSHOT]$ hadoop fs -ls congxianqiu/savepoint
Found 1 items
drwxr-xr-x   - xx supergroup          0 2020-06-05 20:27 
congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33
[~/congxianqiu/flink-1.11-SNAPSHOT]$ hadoop fs -ls 
congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33
Found 2 items
-rw-r--r--   3 xx supergroup      74170 2020-06-05 20:27 
congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33/6508ac9e-0d2a-4583-96ad-1d67fb5b1c8a
-rw-r--r--   3 xx supergroup       1205 2020-06-05 20:27 
congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33/_metadata



[~/flink-1.11-SNAPSHOT]$ hadoop fs -mv 
congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33 
congxianqiu/savepoint/newsavepointpath
[ ~/flink-1.11-SNAPSHOT]$ hadoop fs -ls congxianqiu/savepoint
Found 1 items
drwxr-xr-x   - xx supergroup          0 2020-06-05 20:27 
congxianqiu/savepoint/newsavepointpath
[~/flink-1.11-SNAPSHOT]$ hadoop fs -ls congxianqiu/savepoint/newsavepointpath
Found 2 items
-rw-r--r--   3 xx supergroup      74170 2020-06-05 20:27 
congxianqiu/savepoint/newsavepointpath/6508ac9e-0d2a-4583-96ad-1d67fb5b1c8a
-rw-r--r--   3 xx supergroup       1205 2020-06-05 20:27 
congxianqiu/savepoint/newsavepointpath/_metadata


[~/flink-1.11-SNAPSHOT]$ ./bin/flink run -s 
hdfs:///user/xx/congxianqiu/newsavepointpath/_metadata -m yarn-cluster -c 
com.klion26.data.FlinkDemo 
/data/work/congxianqiu/flink-1.11-SNAPSHOT/ft_local/Flink-Demo-1.0-SNAPSHOT.jar
SLF4J: Class path contains multiple SLF4J bindings.


>> jobmanager.log
2020-06-05 21:11:10,053 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Starting job 
b2fbfa6527391f035e8eebd791c2f64e from savepoint 
hdfs:///user/xx/congxianqiu/savepoint/newsavepointpath/_metadata ()
2020-06-05 21:11:10,198 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Reset the 
checkpoint ID of job b2fbfa6527391f035e8eebd791c2f64e to 3.
2020-06-05 21:11:10,198 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Restoring job 
b2fbfa6527391f035e8eebd791c2f64e from latest valid checkpoint: Checkpoint 2 @ 0 
for b2fbfa6527391f035e8eebd791c2f64e.
2020-06-05 21:11:10,206 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - No master 
state to restore
...
2020-06-05 21:11:16,117 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Checkpoint 
triggering task Source: Custom Source (1/1) of job 
b2fbfa6527391f035e8eebd791c2f64e is not in state RUNNING but SCHEDULED instead. 
Aborting checkpoint.
2020-06-05 21:11:19,456 INFO  org.apache.flink.yarn.YarnResourceManager         
           [] - Registering TaskManager 

[jira] [Comment Edited] (FLINK-18091) Test Relocatable Savepoints

2020-06-05 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126769#comment-17126769
 ] 

Congxian Qiu(klion26) edited comment on FLINK-18091 at 6/5/20, 1:40 PM:


Test on a real cluster, Result excepted.
 * savepoint relocate can restore successfully
 * checkpoint relocate will be failed with FileNotFoundException
 * commit : 8ca67c962a5575aba3a43b8cfd4a79ffc8069bd4

The Long log attached below:

{{_username/ip/port and other sensitive information has been masked._}}
 # For Savepoint

 
{code:java}
[~/flink-1.11-SNAPSHOT]$ ./bin/flink savepoint 9bcc2546a841b36a39c46fbe13a2b631 
hdfs:///user/xx/congxianqiu/savepoint -yid application_1591259429117_0007
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in 
[jar:file:/data/work/congxianqiu/flink-1.11-SNAPSHOT/lib/log4j-slf4j-impl-2.12.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/data/xx/hadoop/hadoop-2.7.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2020-06-05 20:27:43,039 WARN  
org.apache.flink.yarn.configuration.YarnLogConfigUtil        [] - The 
configuration directory ('/data/work/congxianqiu/flink-1.11-SNAPSHOT/conf') 
already contains a LOG4J config file.If you want to use logback, then please 
delete or rename the log configuration file.
2020-06-05 20:27:43,422 INFO  org.apache.flink.yarn.YarnClusterDescriptor       
           [] - No path for the flink jar passed. Using the location of class 
org.apache.flink.yarn.YarnClusterDescriptor to locate the jar
2020-06-05 20:27:43,513 INFO  org.apache.flink.yarn.YarnClusterDescriptor       
           [] - Found Web Interface 10-215-128-84:35572 of application 
'application_1591259429117_0007'.
Triggering savepoint for job 9bcc2546a841b36a39c46fbe13a2b631.
Waiting for response...
Savepoint completed. Path: 
hdfs://ip:port/user/xx/congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33
You can resume your program from this savepoint with the run command.

[~/flink-1.11-SNAPSHOT]$ hadoop fs -ls congxianqiu/savepoint
Found 1 items
drwxr-xr-x   - xx supergroup          0 2020-06-05 20:27 
congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33
[~/congxianqiu/flink-1.11-SNAPSHOT]$ hadoop fs -ls 
congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33
Found 2 items
-rw-r--r--   3 xx supergroup      74170 2020-06-05 20:27 
congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33/6508ac9e-0d2a-4583-96ad-1d67fb5b1c8a
-rw-r--r--   3 xx supergroup       1205 2020-06-05 20:27 
congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33/_metadata



[~/flink-1.11-SNAPSHOT]$ hadoop fs -mv 
congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33 
congxianqiu/savepoint/newsavepointpath
[ ~/flink-1.11-SNAPSHOT]$ hadoop fs -ls congxianqiu/savepoint
Found 1 items
drwxr-xr-x   - xx supergroup          0 2020-06-05 20:27 
congxianqiu/savepoint/newsavepointpath
[~/flink-1.11-SNAPSHOT]$ hadoop fs -ls congxianqiu/savepoint/newsavepointpath
Found 2 items
-rw-r--r--   3 xx supergroup      74170 2020-06-05 20:27 
congxianqiu/savepoint/newsavepointpath/6508ac9e-0d2a-4583-96ad-1d67fb5b1c8a
-rw-r--r--   3 xx supergroup       1205 2020-06-05 20:27 
congxianqiu/savepoint/newsavepointpath/_metadata


[~/flink-1.11-SNAPSHOT]$ ./bin/flink run -s 
hdfs:///user/xx/congxianqiu/newsavepointpath/_metadata -m yarn-cluster -c 
com.klion26.data.FlinkDemo 
/data/work/congxianqiu/flink-1.11-SNAPSHOT/ft_local/Flink-Demo-1.0-SNAPSHOT.jar
SLF4J: Class path contains multiple SLF4J bindings.


>> jobmanager.log
2020-06-05 21:11:10,053 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Starting job 
b2fbfa6527391f035e8eebd791c2f64e from savepoint 
hdfs:///user/xx/congxianqiu/savepoint/newsavepointpath/_metadata ()
2020-06-05 21:11:10,198 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Reset the 
checkpoint ID of job b2fbfa6527391f035e8eebd791c2f64e to 3.
2020-06-05 21:11:10,198 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Restoring job 
b2fbfa6527391f035e8eebd791c2f64e from latest valid checkpoint: Checkpoint 2 @ 0 
for b2fbfa6527391f035e8eebd791c2f64e.
2020-06-05 21:11:10,206 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - No master 
state to restore
...
2020-06-05 21:11:16,117 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Checkpoint 
triggering task Source: Custom Source (1/1) of job 
b2fbfa6527391f035e8eebd791c2f64e is not in state RUNNING but SCHEDULED instead. 
Aborting checkpoint.
2020-06-05 21:11:19,456 INFO  org.apache.flink.yarn.YarnResourceManager         
           [] - Registering TaskManager 

[jira] [Commented] (FLINK-18091) Test Relocatable Savepoints

2020-06-05 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126769#comment-17126769
 ] 

Congxian Qiu(klion26) commented on FLINK-18091:
---

Test on a real cluster, Result excepted.
 * savepoint relocate can restore successfully
 * checkpoint relocate will be failed with FileNotFoundException

The Long log attached below:

{{_username/ip/port and other sensitive information has been masked._}}
 # For Savepoint

 
{code:java}
[~/flink-1.11-SNAPSHOT]$ ./bin/flink savepoint 9bcc2546a841b36a39c46fbe13a2b631 
hdfs:///user/xx/congxianqiu/savepoint -yid application_1591259429117_0007
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in 
[jar:file:/data/work/congxianqiu/flink-1.11-SNAPSHOT/lib/log4j-slf4j-impl-2.12.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/data/xx/hadoop/hadoop-2.7.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2020-06-05 20:27:43,039 WARN  
org.apache.flink.yarn.configuration.YarnLogConfigUtil        [] - The 
configuration directory ('/data/work/congxianqiu/flink-1.11-SNAPSHOT/conf') 
already contains a LOG4J config file.If you want to use logback, then please 
delete or rename the log configuration file.
2020-06-05 20:27:43,422 INFO  org.apache.flink.yarn.YarnClusterDescriptor       
           [] - No path for the flink jar passed. Using the location of class 
org.apache.flink.yarn.YarnClusterDescriptor to locate the jar
2020-06-05 20:27:43,513 INFO  org.apache.flink.yarn.YarnClusterDescriptor       
           [] - Found Web Interface 10-215-128-84:35572 of application 
'application_1591259429117_0007'.
Triggering savepoint for job 9bcc2546a841b36a39c46fbe13a2b631.
Waiting for response...
Savepoint completed. Path: 
hdfs://ip:port/user/xx/congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33
You can resume your program from this savepoint with the run command.

[~/flink-1.11-SNAPSHOT]$ hadoop fs -ls congxianqiu/savepoint
Found 1 items
drwxr-xr-x   - xx supergroup          0 2020-06-05 20:27 
congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33
[~/congxianqiu/flink-1.11-SNAPSHOT]$ hadoop fs -ls 
congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33
Found 2 items
-rw-r--r--   3 xx supergroup      74170 2020-06-05 20:27 
congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33/6508ac9e-0d2a-4583-96ad-1d67fb5b1c8a
-rw-r--r--   3 xx supergroup       1205 2020-06-05 20:27 
congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33/_metadata



[~/flink-1.11-SNAPSHOT]$ hadoop fs -mv 
congxianqiu/savepoint/savepoint-9bcc25-4ed827357f33 
congxianqiu/savepoint/newsavepointpath
[ ~/flink-1.11-SNAPSHOT]$ hadoop fs -ls congxianqiu/savepoint
Found 1 items
drwxr-xr-x   - xx supergroup          0 2020-06-05 20:27 
congxianqiu/savepoint/newsavepointpath
[~/flink-1.11-SNAPSHOT]$ hadoop fs -ls congxianqiu/savepoint/newsavepointpath
Found 2 items
-rw-r--r--   3 xx supergroup      74170 2020-06-05 20:27 
congxianqiu/savepoint/newsavepointpath/6508ac9e-0d2a-4583-96ad-1d67fb5b1c8a
-rw-r--r--   3 xx supergroup       1205 2020-06-05 20:27 
congxianqiu/savepoint/newsavepointpath/_metadata


[~/flink-1.11-SNAPSHOT]$ ./bin/flink run -s 
hdfs:///user/xx/congxianqiu/newsavepointpath/_metadata -m yarn-cluster -c 
com.klion26.data.FlinkDemo 
/data/work/congxianqiu/flink-1.11-SNAPSHOT/ft_local/Flink-Demo-1.0-SNAPSHOT.jar
SLF4J: Class path contains multiple SLF4J bindings.


>> jobmanager.log
2020-06-05 21:11:10,053 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Starting job 
b2fbfa6527391f035e8eebd791c2f64e from savepoint 
hdfs:///user/xx/congxianqiu/savepoint/newsavepointpath/_metadata ()
2020-06-05 21:11:10,198 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Reset the 
checkpoint ID of job b2fbfa6527391f035e8eebd791c2f64e to 3.
2020-06-05 21:11:10,198 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Restoring job 
b2fbfa6527391f035e8eebd791c2f64e from latest valid checkpoint: Checkpoint 2 @ 0 
for b2fbfa6527391f035e8eebd791c2f64e.
2020-06-05 21:11:10,206 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - No master 
state to restore
...
2020-06-05 21:11:16,117 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Checkpoint 
triggering task Source: Custom Source (1/1) of job 
b2fbfa6527391f035e8eebd791c2f64e is not in state RUNNING but SCHEDULED instead. 
Aborting checkpoint.
2020-06-05 21:11:19,456 INFO  org.apache.flink.yarn.YarnResourceManager         
           [] - Registering TaskManager with ResourceID 
container_e18_1591259429117_0019_01_02 

[jira] [Commented] (FLINK-17571) A better way to show the files used in currently checkpoints

2020-06-05 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126428#comment-17126428
 ] 

Congxian Qiu(klion26) commented on FLINK-17571:
---

[~wind_ljy] The reason don't want to add a tool to clean the orphan files 
because that:  we can't clean the files in safely, there may more than one jobs 
reference to the same file.

But if you can guarantee that there will never be more than one job recovery 
from one checkpoint, you can use such tool to clean the orphan files. and a 
future step to clean the orphan files, you can add the tool to corntab, so that 
it can delete the orphan files regularly.

PS: maybe you can first move the orphan files to some place, and delete them 
one day(or some other duration) later, in case delete the wrong files.

> A better way to show the files used in currently checkpoints
> 
>
> Key: FLINK-17571
> URL: https://issues.apache.org/jira/browse/FLINK-17571
> Project: Flink
>  Issue Type: New Feature
>  Components: Command Line Client, Runtime / Checkpointing
>Reporter: Congxian Qiu(klion26)
>Priority: Major
>
> Inspired by the 
> [userMail|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]
> Currently, there are [three types of 
> directory|https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/state/checkpoints.html#directory-structure]
>  for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be 
> deleted safely, but users can't delete the files in the SHARED directory 
> safely(the files may be created a long time ago).
> I think it's better to give users a better way to know which files are 
> currently used(so the others are not used)
> maybe a command-line command such as below is ok enough to support such a 
> feature.
> {{./bin/flink checkpoint list $checkpointDir  # list all the files used in 
> checkpoint}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-18091) Test Relocatable Savepoints

2020-06-03 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17125483#comment-17125483
 ] 

Congxian Qiu(klion26) commented on FLINK-18091:
---

[~liyu] thanks for the notice, please assign this to me, will test it on a real 
cluster, and reply with the result here.

> Test Relocatable Savepoints
> ---
>
> Key: FLINK-18091
> URL: https://issues.apache.org/jira/browse/FLINK-18091
> Project: Flink
>  Issue Type: Sub-task
>  Components: Tests
>Reporter: Stephan Ewen
>Priority: Major
>  Labels: release-testing
> Fix For: 1.11.0
>
>
> The test should do the following:
>  * take a savepoint. needs to make sure the job has enough state that there 
> is more than just the "_metadata" file
>  * copy it to another directory
>  * start the job from that savepoint by addressing the metadata file and by 
> addressing the savepoint directory
> We should also test that an incremental checkpoint that gets moved fails with 
> a reasonable exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-18038) StateBackendLoader logs application-defined state before it is fully configured

2020-05-31 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17120765#comment-17120765
 ] 

Congxian Qiu(klion26) commented on FLINK-18038:
---

I think logging the final used configuration is useful for the user.

> StateBackendLoader logs application-defined state before it is fully 
> configured
> ---
>
> Key: FLINK-18038
> URL: https://issues.apache.org/jira/browse/FLINK-18038
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / State Backends
>Affects Versions: 1.9.1
>Reporter: Steve Bairos
>Priority: Trivial
>
> In the 
> [StateBackendLoader|[https://github.com/apache/flink/blob/bb46756b84940a6134910e74406bfaff4f2f37e9/flink-runtime/src/main/java/org/apache/flink/runtime/state/StateBackendLoader.java#L201]],
>  there's this log line:
> {code:java}
> logger.info("Using application-defined state backend: {}", fromApplication); 
> {code}
> It seems like this is inaccurate though because immediately after logging 
> this, if fromApplication is a ConfigurableStateBackend, we call the 
> .configure() function and it is replaced by a newly configured StateBackend. 
> To me, it seems like it would be better if we logged the state backend after 
> it was fully configured. In the current setup, we get confusing logs like 
> this: 
> {code:java}
> 2020-05-29 21:39:44,387 INFO  
> org.apache.flink.streaming.runtime.tasks.StreamTask   - Using 
> application-defined state backend: 
> RocksDBStateBackend{checkpointStreamBackend=File State Backend (checkpoints: 
> 's3://pinterest-montreal/checkpoints/xenon-dev-001-20191210/Xenon/BasicJavaStream',
>  savepoints: 'null', asynchronous: UNDEFINED, fileStateThreshold: -1), 
> localRocksDbDirectories=null, enableIncrementalCheckpointing=UNDEFINED, 
> numberOfTransferingThreads=-1}2020-05-29 21:39:44,387 INFO  
> org.apache.flink.streaming.runtime.tasks.StreamTask   - Configuring 
> application-defined state backend with job/cluster config{code}
> Which makes it ambiguous whether or not settings in our flink-conf.yaml like 
> "state.backend.incremental: true" are being applied properly or not. 
>  
> I can make a diff for the change if there aren't any objections



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15507) Activate local recovery for RocksDB backends by default

2020-05-29 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119388#comment-17119388
 ] 

Congxian Qiu(klion26) commented on FLINK-15507:
---

First, enable local recovery for incremental checkpoint has no overhead. and I 
agree with [~Zakelly] that we should enable incremental checkpoint by 
default(which has better performance).

For not incremental checkpoint with local recovery enabled, we'll consume more 
disk(only the files in latest successfully checkpoint will be kept after 
FLINK-8871 merged) but will give recovery when failover.

So from my side, enable local recovery for RocksDB StateBackend(incremental & 
full snapshot) by default is fine to me.

> Activate local recovery for RocksDB backends by default
> ---
>
> Key: FLINK-15507
> URL: https://issues.apache.org/jira/browse/FLINK-15507
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / State Backends
>Reporter: Stephan Ewen
>Assignee: Zakelly Lan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>
> For the RocksDB state backend, local recovery has no overhead when 
> incremental checkpoints are used. 
> It should be activated by default, because it greatly helps with recovery.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16572) CheckPubSubEmulatorTest is flaky on Azure

2020-05-27 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17118239#comment-17118239
 ] 

Congxian Qiu(klion26) commented on FLINK-16572:
---

seems another instance 
[https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_apis/build/builds/2296/logs/133]
 

> CheckPubSubEmulatorTest is flaky on Azure
> -
>
> Key: FLINK-16572
> URL: https://issues.apache.org/jira/browse/FLINK-16572
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Google Cloud PubSub, Tests
>Affects Versions: 1.11.0
>Reporter: Aljoscha Krettek
>Assignee: Richard Deurwaarder
>Priority: Critical
>  Labels: pull-request-available, test-stability
> Fix For: 1.11.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Log: 
> https://dev.azure.com/aljoschakrettek/Flink/_build/results?buildId=56=logs=1f3ed471-1849-5d3c-a34c-19792af4ad16=ce095137-3e3b-5f73-4b79-c42d3d5f8283=7842



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17926) Can't build flink-web docker image because of EOL of Ubuntu:18.10

2020-05-25 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17115992#comment-17115992
 ] 

Congxian Qiu(klion26) commented on FLINK-17926:
---

[~rmetzger] what do you think about this, please assign it to me if you think 
this is valid and the proposal is ok.

> Can't build flink-web docker image because of EOL of Ubuntu:18.10
> -
>
> Key: FLINK-17926
> URL: https://issues.apache.org/jira/browse/FLINK-17926
> Project: Flink
>  Issue Type: Bug
>  Components: Project Website
>Reporter: Congxian Qiu(klion26)
>Priority: Major
>
> Currently, the Dockerfile[1] in flink-web project is broken because of the 
> EOL of Ubuntu 18.10[2], will encounter the error such as bellow when 
> executing {{./run.sh}}
> {code:java}
> Err:3 http://security.ubuntu.com/ubuntu cosmic-security Release
>   404  Not Found [IP: 91.189.88.152 80]
> Ign:4 http://archive.ubuntu.com/ubuntu cosmic-updates InRelease
> Ign:5 http://archive.ubuntu.com/ubuntu cosmic-backports InRelease
> Err:6 http://archive.ubuntu.com/ubuntu cosmic Release
>   404  Not Found [IP: 91.189.88.142 80]
> Err:7 http://archive.ubuntu.com/ubuntu cosmic-updates Release
>   404  Not Found [IP: 91.189.88.142 80]
> Err:8 http://archive.ubuntu.com/ubuntu cosmic-backports Release
>   404  Not Found [IP: 91.189.88.142 80]
> Reading package lists...
> {code}
> The current LTS versions can be found in release website[2].
> Apache Flink docker image uses fedora:28[3], so it unaffected.
> As fedora does not have LTS release[4], I proposal to use Ubuntu for website 
> here, and change the version from {{18.10}} to the closest LTS version 
> {{18.04, tried locally, it works successfully.}}
>  [1] 
> [https://github.com/apache/flink-web/blob/bc66f0f0f463ab62a22e81df7d7efd301b76a6b4/docker/Dockerfile#L17]
> [2] [https://wiki.ubuntu.com/Releases]
>  
> [3]https://github.com/apache/flink/blob/e92b2bf19bdf03ad3bae906dc5fa3781aeddb3ee/docs/docker/Dockerfile#L17
>  [4] 
> https://fedoraproject.org/wiki/Fedora_Release_Life_Cycle#Maintenance_Schedule



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17926) Can't build flink-web docker image because of EOL of Ubuntu:18.10

2020-05-25 Thread Congxian Qiu(klion26) (Jira)
Congxian Qiu(klion26) created FLINK-17926:
-

 Summary: Can't build flink-web docker image because of EOL of 
Ubuntu:18.10
 Key: FLINK-17926
 URL: https://issues.apache.org/jira/browse/FLINK-17926
 Project: Flink
  Issue Type: Bug
  Components: Project Website
Reporter: Congxian Qiu(klion26)


Currently, the Dockerfile[1] in flink-web project is broken because of the EOL 
of Ubuntu 18.10[2], will encounter the error such as bellow when executing 
{{./run.sh}}
{code:java}
Err:3 http://security.ubuntu.com/ubuntu cosmic-security Release
  404  Not Found [IP: 91.189.88.152 80]
Ign:4 http://archive.ubuntu.com/ubuntu cosmic-updates InRelease
Ign:5 http://archive.ubuntu.com/ubuntu cosmic-backports InRelease
Err:6 http://archive.ubuntu.com/ubuntu cosmic Release
  404  Not Found [IP: 91.189.88.142 80]
Err:7 http://archive.ubuntu.com/ubuntu cosmic-updates Release
  404  Not Found [IP: 91.189.88.142 80]
Err:8 http://archive.ubuntu.com/ubuntu cosmic-backports Release
  404  Not Found [IP: 91.189.88.142 80]
Reading package lists...
{code}
The current LTS versions can be found in release website[2].

Apache Flink docker image uses fedora:28[3], so it unaffected.

As fedora does not have LTS release[4], I proposal to use Ubuntu for website 
here, and change the version from {{18.10}} to the closest LTS version {{18.04, 
tried locally, it works successfully.}}

 [1] 
[https://github.com/apache/flink-web/blob/bc66f0f0f463ab62a22e81df7d7efd301b76a6b4/docker/Dockerfile#L17]

[2] [https://wiki.ubuntu.com/Releases]

 
[3]https://github.com/apache/flink/blob/e92b2bf19bdf03ad3bae906dc5fa3781aeddb3ee/docs/docker/Dockerfile#L17

 [4] 
https://fedoraproject.org/wiki/Fedora_Release_Life_Cycle#Maintenance_Schedule



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-13009) YARNSessionCapacitySchedulerITCase#testVCoresAreSetCorrectlyAndJobManagerHostnameAreShownInWebInterfaceAndDynamicPropertiesAndYarnApplicationNameAndTaskManagerSlots th

2020-05-25 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17115896#comment-17115896
 ] 

Congxian Qiu(klion26) commented on FLINK-13009:
---

[~fly_in_gis] Thanks for the analysis and update, It's fine to me to close this 
ticket.

Just want to confirm how will we know/verify the is fixed after a new bug fix 
version of hadoop has been released, is there any issue tracking such thing?

> YARNSessionCapacitySchedulerITCase#testVCoresAreSetCorrectlyAndJobManagerHostnameAreShownInWebInterfaceAndDynamicPropertiesAndYarnApplicationNameAndTaskManagerSlots
>  throws NPE on Travis
> -
>
> Key: FLINK-13009
> URL: https://issues.apache.org/jira/browse/FLINK-13009
> Project: Flink
>  Issue Type: Test
>  Components: Deployment / YARN, Tests
>Affects Versions: 1.8.0
>Reporter: Congxian Qiu(klion26)
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.11.0
>
>
> The test 
> {{YARNSessionCapacitySchedulerITCase#testVCoresAreSetCorrectlyAndJobManagerHostnameAreShownInWebInterfaceAndDynamicPropertiesAndYarnApplicationNameAndTaskManagerSlots}}
>  throws NPE on Travis.
> NPE throws from RMAppAttemptMetrics.java#128, and the following is the code 
> from hadoop-2.8.3[1]
> {code:java}
> // Only add in the running containers if this is the active attempt.
> 128   RMAppAttempt currentAttempt = rmContext.getRMApps()
> 129   .get(attemptId.getApplicationId()).getCurrentAppAttempt();
> {code}
>  
> log [https://api.travis-ci.org/v3/job/550689578/log.txt]
> [1] 
> [https://github.com/apache/hadoop/blob/b3fe56402d908019d99af1f1f4fc65cb1d1436a2/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptMetrics.java#L128]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17468) Provide more detailed metrics why asynchronous part of checkpoint is taking long time

2020-05-21 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17113726#comment-17113726
 ] 

Congxian Qiu(klion26) commented on FLINK-17468:
---

[~qqibrow]  How's it going? do you need any help?

> Provide more detailed metrics why asynchronous part of checkpoint is taking 
> long time
> -
>
> Key: FLINK-17468
> URL: https://issues.apache.org/jira/browse/FLINK-17468
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Checkpointing, Runtime / Metrics, Runtime / 
> State Backends
>Affects Versions: 1.10.0
>Reporter: Piotr Nowojski
>Priority: Major
>
> As [reported by 
> users|https://lists.apache.org/thread.html/r0833452796ca7d1c9d5e35c110089c95cfdadee9d81884a13613a4ce%40%3Cuser.flink.apache.org%3E]
>  it's not obvious why asynchronous part of checkpoint is taking so long time.
> Maybe we could provide some more detailed metrics/UI/logs about uploading 
> files, materializing meta data, or other things that are happening during the 
> asynchronous checkpoint process?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17351) CheckpointCoordinator and CheckpointFailureManager ignores checkpoint timeouts

2020-05-21 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17113095#comment-17113095
 ] 

Congxian Qiu(klion26) commented on FLINK-17351:
---

Attach another issue which wants to increase the count under more checkpoint 
fail reasons

> CheckpointCoordinator and CheckpointFailureManager ignores checkpoint timeouts
> --
>
> Key: FLINK-17351
> URL: https://issues.apache.org/jira/browse/FLINK-17351
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.9.2, 1.10.0
>Reporter: Piotr Nowojski
>Assignee: Yuan Mei
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>
> As described in point 2: 
> https://issues.apache.org/jira/browse/FLINK-17327?focusedCommentId=17090576=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17090576
> (copy of description from above linked comment):
> The logic in how {{CheckpointCoordinator}} handles checkpoint timeouts is 
> broken. In your [~qinjunjerry] examples, your job should have failed after 
> first checkpoint failure, but checkpoints were time outing on 
> CheckpointCoordinator after 5 seconds, before {{FlinkKafkaProducer}} was 
> detecting Kafka failure after 2 minutes. Those timeouts were not checked 
> against {{setTolerableCheckpointFailureNumber(...)}} limit, so the job was 
> keep going with many timed out checkpoints. Now funny thing happens: 
> FlinkKafkaProducer detects Kafka failure. Funny thing is that it depends 
> where the failure was detected:
> a) on processing record? no problem, job will failover immediately once 
> failure is detected (in this example after 2 minutes)
> b) on checkpoint? heh, the failure is reported to {{CheckpointCoordinator}} 
> *and gets ignored, as PendingCheckpoint has already been discarded 2 minutes 
> ago* :) So theoretically the checkpoints can keep failing forever and the job 
> will not restart automatically, unless something else fails.
> Even more funny things can happen if we mix FLINK-17350 . or b) with 
> intermittent external system failure. Sink reports an exception, transaction 
> was lost/aborted, Sink is in failed state, but if there will be a happy 
> coincidence that it manages to accept further records, this exception can be 
> lost and all of the records in those failed checkpoints will be lost forever 
> as well. In all of the examples that [~qinjunjerry] posted it hasn't 
> happened. {{FlinkKafkaProducer}} was not able to recover after the initial 
> failure and it was keep throwing exceptions until the job finally failed (but 
> much later then it should have). And that's not guaranteed anywhere.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-11937) Resolve small file problem in RocksDB incremental checkpoint

2020-05-18 Thread Congxian Qiu(klion26) (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-11937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu(klion26) updated FLINK-11937:
--
Fix Version/s: (was: 1.11.0)
   1.12.0

> Resolve small file problem in RocksDB incremental checkpoint
> 
>
> Key: FLINK-11937
> URL: https://issues.apache.org/jira/browse/FLINK-11937
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Checkpointing
>Reporter: Congxian Qiu(klion26)
>Assignee: Congxian Qiu(klion26)
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.12.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Currently when incremental checkpoint is enabled in RocksDBStateBackend a 
> separate file will be generated on DFS for each sst file. This may cause 
> “file flood” when running intensive workload (many jobs with high 
> parallelism) in big cluster. According to our observation in Alibaba 
> production, such file flood introduces at lease two drawbacks when using HDFS 
> as the checkpoint storage FileSystem: 1) huge number of RPC request issued to 
> NN which may burst its response queue; 2) huge number of files causes big 
> pressure on NN’s on-heap memory.
> In Flink we ever noticed similar small file flood problem and tried to 
> resolved it by introducing ByteStreamStateHandle(FLINK-2808), but this 
> solution has its limitation that if we configure the threshold too low there 
> will still be too many small files, while if too high the JM will finally 
> OOM, thus could hardly resolve the issue in case of using RocksDBStateBackend 
> with incremental snapshot strategy.
> We propose a new OutputStream called 
> FileSegmentCheckpointStateOutputStream(FSCSOS) to fix the problem. FSCSOS 
> will reuse the same underlying distributed file until its size exceeds a 
> preset threshold. We
> plan to complete the work in 3 steps: firstly introduce FSCSOS, secondly 
> resolve the specific storage amplification issue on FSCSOS, and lastly add an 
> option to reuse FSCSOS across multiple checkpoints to further reduce the DFS 
> file number.
> More details please refer to the attached design doc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-16077) Translate "Custom State Serialization" page into Chinese

2020-05-18 Thread Congxian Qiu(klion26) (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-16077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu(klion26) updated FLINK-16077:
--
Fix Version/s: (was: 1.11.0)
   1.12.0

> Translate "Custom State Serialization" page into Chinese
> 
>
> Key: FLINK-16077
> URL: https://issues.apache.org/jira/browse/FLINK-16077
> Project: Flink
>  Issue Type: Sub-task
>  Components: chinese-translation, Documentation
>Affects Versions: 1.11.0
>Reporter: Yu Li
>Assignee: Congxian Qiu(klion26)
>Priority: Major
> Fix For: 1.12.0
>
>
> Complete the translation in `docs/dev/stream/state/custom_serialization.zh.md`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-15747) Enable setting RocksDB log level from configuration

2020-05-18 Thread Congxian Qiu(klion26) (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu(klion26) updated FLINK-15747:
--
Fix Version/s: (was: 1.11.0)
   1.12.0

> Enable setting RocksDB log level from configuration
> ---
>
> Key: FLINK-15747
> URL: https://issues.apache.org/jira/browse/FLINK-15747
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / State Backends
>Reporter: Yu Li
>Assignee: Congxian Qiu(klion26)
>Priority: Major
> Fix For: 1.12.0
>
>
> Currently to open the RocksDB local log, one has to create a customized 
> {{OptionsFactory}}, which is not quite convenient. This JIRA proposes to 
> enable setting it from configuration in flink-conf.yaml.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-16074) Translate the "Overview" page for "State & Fault Tolerance" into Chinese

2020-05-18 Thread Congxian Qiu(klion26) (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-16074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu(klion26) updated FLINK-16074:
--
Fix Version/s: (was: 1.11.0)
   1.12.0

> Translate the "Overview" page for "State & Fault Tolerance" into Chinese
> 
>
> Key: FLINK-16074
> URL: https://issues.apache.org/jira/browse/FLINK-16074
> Project: Flink
>  Issue Type: Sub-task
>  Components: chinese-translation, Documentation
>Affects Versions: 1.11.0
>Reporter: Yu Li
>Assignee: Congxian Qiu(klion26)
>Priority: Major
> Fix For: 1.12.0
>
>
> Complete the translation in `docs/dev/stream/state/index.zh.md`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-5763) Make savepoints self-contained and relocatable

2020-05-18 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-5763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17109933#comment-17109933
 ] 

Congxian Qiu(klion26) commented on FLINK-5763:
--

Add a release note.

> Make savepoints self-contained and relocatable
> --
>
> Key: FLINK-5763
> URL: https://issues.apache.org/jira/browse/FLINK-5763
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / State Backends
>Reporter: Ufuk Celebi
>Assignee: Congxian Qiu(klion26)
>Priority: Critical
>  Labels: pull-request-available, usability
> Fix For: 1.11.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> After a user has triggered a savepoint, a single savepoint file will be 
> returned as a handle to the savepoint. A savepoint to {{}} creates a 
> savepoint file like {{/savepoint-}}.
> This file contains the metadata of the corresponding checkpoint, but not the 
> actual program state. While this works well for short term management 
> (pause-and-resume a job), it makes it hard to manage savepoints over longer 
> periods of time.
> h4. Problems
> h5. Scattered Checkpoint Files
> For file system based checkpoints (FsStateBackend, RocksDBStateBackend) this 
> results in the savepoint referencing files from the checkpoint directory 
> (usually different than ). For users, it is virtually impossible to 
> tell which checkpoint files belong to a savepoint and which are lingering 
> around. This can easily lead to accidentally invalidating a savepoint by 
> deleting checkpoint files.
> h5. Savepoints Not Relocatable
> Even if a user is able to figure out which checkpoint files belong to a 
> savepoint, moving these files will invalidate the savepoint as well, because 
> the metadata file references absolute file paths.
> h5. Forced to Use CLI for Disposal
> Because of the scattered files, the user is in practice forced to use Flink’s 
> CLI to dispose a savepoint. This should be possible to handle in the scope of 
> the user’s environment via a file system delete operation.
> h4. Proposal
> In order to solve the described problems, savepoints should contain all their 
> state, both metadata and program state, inside a single directory. 
> Furthermore the metadata must only hold relative references to the checkpoint 
> files. This makes it obvious which files make up the state of a savepoint and 
> it is possible to move savepoints around by moving the savepoint directory.
> h5. Desired File Layout
> Triggering a savepoint to {{}} creates a directory as follows:
> {code}
> /savepoint--
>   +-- _metadata
>   +-- data- [1 or more]
> {code}
> We include the JobID in the savepoint directory name in order to give some 
> hints about which job a savepoint belongs to.
> h5. CLI
> - Trigger: When triggering a savepoint to {{}} the savepoint 
> directory will be returned as the handle to the savepoint.
> - Restore: Users can restore by pointing to the directory or the _metadata 
> file. The data files should be required to be in the same directory as the 
> _metadata file.
> - Dispose: The disposal command should be deprecated and eventually removed. 
> While deprecated, disposal can happen by specifying the directory or the 
> _metadata file (same as restore).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-5763) Make savepoints self-contained and relocatable

2020-05-18 Thread Congxian Qiu(klion26) (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-5763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu(klion26) updated FLINK-5763:
-
Release Note: After FLINK-5763, we made savepoint self-contained and 
relocatable so that users can migrate savepoint from one place to another 
without any other processing manually. Currently do not support this feature 
after Entropy Injection enabled.

> Make savepoints self-contained and relocatable
> --
>
> Key: FLINK-5763
> URL: https://issues.apache.org/jira/browse/FLINK-5763
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / State Backends
>Reporter: Ufuk Celebi
>Assignee: Congxian Qiu(klion26)
>Priority: Critical
>  Labels: pull-request-available, usability
> Fix For: 1.11.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> After a user has triggered a savepoint, a single savepoint file will be 
> returned as a handle to the savepoint. A savepoint to {{}} creates a 
> savepoint file like {{/savepoint-}}.
> This file contains the metadata of the corresponding checkpoint, but not the 
> actual program state. While this works well for short term management 
> (pause-and-resume a job), it makes it hard to manage savepoints over longer 
> periods of time.
> h4. Problems
> h5. Scattered Checkpoint Files
> For file system based checkpoints (FsStateBackend, RocksDBStateBackend) this 
> results in the savepoint referencing files from the checkpoint directory 
> (usually different than ). For users, it is virtually impossible to 
> tell which checkpoint files belong to a savepoint and which are lingering 
> around. This can easily lead to accidentally invalidating a savepoint by 
> deleting checkpoint files.
> h5. Savepoints Not Relocatable
> Even if a user is able to figure out which checkpoint files belong to a 
> savepoint, moving these files will invalidate the savepoint as well, because 
> the metadata file references absolute file paths.
> h5. Forced to Use CLI for Disposal
> Because of the scattered files, the user is in practice forced to use Flink’s 
> CLI to dispose a savepoint. This should be possible to handle in the scope of 
> the user’s environment via a file system delete operation.
> h4. Proposal
> In order to solve the described problems, savepoints should contain all their 
> state, both metadata and program state, inside a single directory. 
> Furthermore the metadata must only hold relative references to the checkpoint 
> files. This makes it obvious which files make up the state of a savepoint and 
> it is possible to move savepoints around by moving the savepoint directory.
> h5. Desired File Layout
> Triggering a savepoint to {{}} creates a directory as follows:
> {code}
> /savepoint--
>   +-- _metadata
>   +-- data- [1 or more]
> {code}
> We include the JobID in the savepoint directory name in order to give some 
> hints about which job a savepoint belongs to.
> h5. CLI
> - Trigger: When triggering a savepoint to {{}} the savepoint 
> directory will be returned as the handle to the savepoint.
> - Restore: Users can restore by pointing to the directory or the _metadata 
> file. The data files should be required to be in the same directory as the 
> _metadata file.
> - Dispose: The disposal command should be deprecated and eventually removed. 
> While deprecated, disposal can happen by specifying the directory or the 
> _metadata file (same as restore).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17487) Do not delete old checkpoints when stop the job.

2020-05-17 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17109386#comment-17109386
 ] 

Congxian Qiu(klion26) commented on FLINK-17487:
---

[~nobleyd] Do you mean the stop command did not succeed and the previous 
checkpoint was deleted? It seems weird to me. could you please share more logs 
with us(jm and tm log, it's better to enable debug log).

> Do not delete old checkpoints when stop the job.
> 
>
> Key: FLINK-17487
> URL: https://issues.apache.org/jira/browse/FLINK-17487
> Project: Flink
>  Issue Type: Improvement
>  Components: Client / Job Submission, Runtime / Checkpointing
>Reporter: nobleyd
>Priority: Major
>
> When stop flink job using 'flink stop jobId', the checkpoints data is 
> deleted. 
> When the stop action is not succeed or failed because of some unknown errors, 
> sometimes the job resumes using the latest checkpoint, while sometimes it 
> just fails, and the checkpoints data is gone.
> You may say why I need these checkpoints since I stop the job and a savepoint 
> will be generated. For example, my job uses a kafka source, while the kafka 
> missed some data, and I want to stop the job and resume it using an old 
> checkpoint. Anyway, I mean sometimes the action stop is failed and the 
> checkpoint data is also deleted, which is not good. 
> This feature is different from the case 'flink cancel jobId' or 'flink 
> savepoint jobId', which won't delete the checkpoint data.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-16383) KafkaProducerExactlyOnceITCase. testExactlyOnceRegularSink fails with "The producer has already been closed"

2020-05-15 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108387#comment-17108387
 ] 

Congxian Qiu(klion26) edited comment on FLINK-16383 at 5/15/20, 3:12 PM:
-

seems another instance 
[https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=1230=logs=d44f43ce-542c-597d-bf94-b0718c71e5e8=34f486e1-e1e4-5dd2-9c06-bfdd9b9c74a8=18475]
{code:java}
2020-05-14T04:20:36.9953507Z 04:20:36,993 [Source: Custom Source -> Sink: 
Unnamed (3/3)] WARN 
org.apache.flink.streaming.connectors.kafka.internal.Kafka010Fetcher [] - 
Committing offsets to Kafka takes longer than the checkpoint interval. Skipping 
commit of previous offsets because newer complete checkpoint offsets are 
available. This does not compromise Flink's checkpoint integrity. 
2020-05-14T04:20:37.2142509Z 04:20:37,174 [program runner thread] ERROR 
org.apache.flink.streaming.connectors.kafka.KafkaTestBase [] - Job Runner 
failed with exception 2020-05-14T04:20:37.2143266Z 
org.apache.flink.client.program.ProgramInvocationException: Job failed (JobID: 
cdf00a9c3a8b9c3afb72f57e7ef936c8) 2020-05-14T04:20:37.2144122Z at 
org.apache.flink.client.ClientUtils.submitJobAndWaitForResult(ClientUtils.java:115)
 ~[flink-clients_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT] 
2020-05-14T04:20:37.2145249Z at 
org.apache.flink.streaming.connectors.kafka.KafkaConsumerTestBase.lambda$runCancelingOnEmptyInputTest$2(KafkaConsumerTestBase.java:1075)
 ~[flink-connector-kafka-base_2.11-1.11-SNAPSHOT-tests.jar:?] 
2020-05-14T04:20:37.2145858Z at java.lang.Thread.run(Thread.java:748) 
[?:1.8.0_242] 2020-05-14T04:20:37.2146393Z Caused by: 
org.apache.flink.runtime.client.JobCancellationException: Job was cancelled. 
2020-05-14T04:20:37.2151756Z at 
org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:149)
 ~[flink-runtime_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT] 
2020-05-14T04:20:37.2153457Z at 
org.apache.flink.client.ClientUtils.submitJobAndWaitForResult(ClientUtils.java:113)
 ~[flink-clients_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT] 
2020-05-14T04:20:37.2154166Z ... 2 more
{code}


was (Author: klion26):
seems another instance 
[https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=1230=logs=d44f43ce-542c-597d-bf94-b0718c71e5e8=34f486e1-e1e4-5dd2-9c06-bfdd9b9c74a8=18475]



2020-05-14T04:20:36.9953507Z 04:20:36,993 [Source: Custom Source -> Sink: 
Unnamed (3/3)] WARN 
org.apache.flink.streaming.connectors.kafka.internal.Kafka010Fetcher [] - 
Committing offsets to Kafka takes longer than the checkpoint interval. Skipping 
commit of previous offsets because newer complete checkpoint offsets are 
available. This does not compromise Flink's checkpoint integrity. 
2020-05-14T04:20:37.2142509Z 04:20:37,174 [program runner thread] ERROR 
org.apache.flink.streaming.connectors.kafka.KafkaTestBase [] - Job Runner 
failed with exception 2020-05-14T04:20:37.2143266Z 
org.apache.flink.client.program.ProgramInvocationException: Job failed (JobID: 
cdf00a9c3a8b9c3afb72f57e7ef936c8) 2020-05-14T04:20:37.2144122Z at 
org.apache.flink.client.ClientUtils.submitJobAndWaitForResult(ClientUtils.java:115)
 ~[flink-clients_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT] 
2020-05-14T04:20:37.2145249Z at 
org.apache.flink.streaming.connectors.kafka.KafkaConsumerTestBase.lambda$runCancelingOnEmptyInputTest$2(KafkaConsumerTestBase.java:1075)
 ~[flink-connector-kafka-base_2.11-1.11-SNAPSHOT-tests.jar:?] 
2020-05-14T04:20:37.2145858Z at java.lang.Thread.run(Thread.java:748) 
[?:1.8.0_242] 2020-05-14T04:20:37.2146393Z Caused by: 
org.apache.flink.runtime.client.JobCancellationException: Job was cancelled. 
2020-05-14T04:20:37.2151756Z at 
org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:149)
 ~[flink-runtime_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT] 
2020-05-14T04:20:37.2153457Z at 
org.apache.flink.client.ClientUtils.submitJobAndWaitForResult(ClientUtils.java:113)
 ~[flink-clients_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT] 
2020-05-14T04:20:37.2154166Z ... 2 more

> KafkaProducerExactlyOnceITCase. testExactlyOnceRegularSink fails with "The 
> producer has already been closed"
> 
>
> Key: FLINK-16383
> URL: https://issues.apache.org/jira/browse/FLINK-16383
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task, Tests
>Reporter: Robert Metzger
>Priority: Blocker
>  Labels: pull-request-available, test-stability
> Fix For: 1.11.0
>
>
> Logs: 
> https://dev.azure.com/rmetzger/Flink/_build/results?buildId=5779=logs=a54de925-e958-5e24-790a-3a6150eb72d8=24e561e9-4c8d-598d-a290-e6acce191345
> {code}
> 2020-03-01T01:06:57.4738418Z 01:06:57,473 [Source: Custom Source -> Map -> 
> Sink: Unnamed 

[jira] [Commented] (FLINK-16383) KafkaProducerExactlyOnceITCase. testExactlyOnceRegularSink fails with "The producer has already been closed"

2020-05-15 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108387#comment-17108387
 ] 

Congxian Qiu(klion26) commented on FLINK-16383:
---

seems another instance 
[https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=1230=logs=d44f43ce-542c-597d-bf94-b0718c71e5e8=34f486e1-e1e4-5dd2-9c06-bfdd9b9c74a8=18475]



2020-05-14T04:20:36.9953507Z 04:20:36,993 [Source: Custom Source -> Sink: 
Unnamed (3/3)] WARN 
org.apache.flink.streaming.connectors.kafka.internal.Kafka010Fetcher [] - 
Committing offsets to Kafka takes longer than the checkpoint interval. Skipping 
commit of previous offsets because newer complete checkpoint offsets are 
available. This does not compromise Flink's checkpoint integrity. 
2020-05-14T04:20:37.2142509Z 04:20:37,174 [program runner thread] ERROR 
org.apache.flink.streaming.connectors.kafka.KafkaTestBase [] - Job Runner 
failed with exception 2020-05-14T04:20:37.2143266Z 
org.apache.flink.client.program.ProgramInvocationException: Job failed (JobID: 
cdf00a9c3a8b9c3afb72f57e7ef936c8) 2020-05-14T04:20:37.2144122Z at 
org.apache.flink.client.ClientUtils.submitJobAndWaitForResult(ClientUtils.java:115)
 ~[flink-clients_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT] 
2020-05-14T04:20:37.2145249Z at 
org.apache.flink.streaming.connectors.kafka.KafkaConsumerTestBase.lambda$runCancelingOnEmptyInputTest$2(KafkaConsumerTestBase.java:1075)
 ~[flink-connector-kafka-base_2.11-1.11-SNAPSHOT-tests.jar:?] 
2020-05-14T04:20:37.2145858Z at java.lang.Thread.run(Thread.java:748) 
[?:1.8.0_242] 2020-05-14T04:20:37.2146393Z Caused by: 
org.apache.flink.runtime.client.JobCancellationException: Job was cancelled. 
2020-05-14T04:20:37.2151756Z at 
org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:149)
 ~[flink-runtime_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT] 
2020-05-14T04:20:37.2153457Z at 
org.apache.flink.client.ClientUtils.submitJobAndWaitForResult(ClientUtils.java:113)
 ~[flink-clients_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT] 
2020-05-14T04:20:37.2154166Z ... 2 more

> KafkaProducerExactlyOnceITCase. testExactlyOnceRegularSink fails with "The 
> producer has already been closed"
> 
>
> Key: FLINK-16383
> URL: https://issues.apache.org/jira/browse/FLINK-16383
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task, Tests
>Reporter: Robert Metzger
>Priority: Blocker
>  Labels: pull-request-available, test-stability
> Fix For: 1.11.0
>
>
> Logs: 
> https://dev.azure.com/rmetzger/Flink/_build/results?buildId=5779=logs=a54de925-e958-5e24-790a-3a6150eb72d8=24e561e9-4c8d-598d-a290-e6acce191345
> {code}
> 2020-03-01T01:06:57.4738418Z 01:06:57,473 [Source: Custom Source -> Map -> 
> Sink: Unnamed (1/1)] INFO  
> org.apache.flink.streaming.connectors.kafka.internal.FlinkKafkaInternalProducer
>  [] - Flushing new partitions
> 2020-03-01T01:06:57.4739960Z 01:06:57,473 [FailingIdentityMapper Status 
> Printer] INFO  
> org.apache.flink.streaming.connectors.kafka.testutils.FailingIdentityMapper 
> [] - > Failing mapper  0: count=680, 
> totalCount=1000
> 2020-03-01T01:06:57.4909074Z 
> org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
> 2020-03-01T01:06:57.4910001Z  at 
> org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:147)
> 2020-03-01T01:06:57.4911000Z  at 
> org.apache.flink.runtime.minicluster.MiniCluster.executeJobBlocking(MiniCluster.java:648)
> 2020-03-01T01:06:57.4912078Z  at 
> org.apache.flink.streaming.util.TestStreamEnvironment.execute(TestStreamEnvironment.java:77)
> 2020-03-01T01:06:57.4913039Z  at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1619)
> 2020-03-01T01:06:57.4914421Z  at 
> org.apache.flink.test.util.TestUtils.tryExecute(TestUtils.java:35)
> 2020-03-01T01:06:57.4915423Z  at 
> org.apache.flink.streaming.connectors.kafka.KafkaProducerTestBase.testExactlyOnce(KafkaProducerTestBase.java:370)
> 2020-03-01T01:06:57.4916483Z  at 
> org.apache.flink.streaming.connectors.kafka.KafkaProducerTestBase.testExactlyOnceRegularSink(KafkaProducerTestBase.java:309)
> 2020-03-01T01:06:57.4917305Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2020-03-01T01:06:57.4917982Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2020-03-01T01:06:57.4918769Z  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2020-03-01T01:06:57.4919477Z  at 
> java.lang.reflect.Method.invoke(Method.java:498)
> 2020-03-01T01:06:57.4920156Z  at 
> 

[jira] [Commented] (FLINK-17571) A better way to show the files used in currently checkpoints

2020-05-13 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17106088#comment-17106088
 ] 

Congxian Qiu(klion26) commented on FLINK-17571:
---

[~pnowojski] the directory of checkpoint and savepoint is not the same, the 
files in one savepoint always going into one 
[directory|https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/state/savepoints.html#triggering-savepoints],
 so users can delete the whole savepoint safely if they want. Currently, only 
checkpoint files will go into different directories(shared, taskowend, 
exclusive).

The command I proposed can be used for both checkpoint and savepoint, what I 
have in mind for the command is that
 # read the meta-file(this already has in our codebase)
 # deserialize the meta-file(this already has in our codebase)
 # output the files referenced in the meta-file

The reason I want to keep it as a command in Flink is that: if we change 
something related to the metafile, we can have the command updated also.

> A better way to show the files used in currently checkpoints
> 
>
> Key: FLINK-17571
> URL: https://issues.apache.org/jira/browse/FLINK-17571
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Checkpointing
>Reporter: Congxian Qiu(klion26)
>Priority: Major
>
> Inspired by the 
> [userMail|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]
> Currently, there are [three types of 
> directory|https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/state/checkpoints.html#directory-structure]
>  for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be 
> deleted safely, but users can't delete the files in the SHARED directory 
> safely(the files may be created a long time ago).
> I think it's better to give users a better way to know which files are 
> currently used(so the others are not used)
> maybe a command-line command such as below is ok enough to support such a 
> feature.
> {{./bin/flink checkpoint list $checkpointDir  # list all the files used in 
> checkpoint}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-13678) Translate "Code Style - Preamble" page into Chinese

2020-05-11 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17104190#comment-17104190
 ] 

Congxian Qiu(klion26) commented on FLINK-13678:
---

[~WangHW] Do you still working on this? I just saw that the GitHub account for 
the pr has been deleted. If you're still working on this, could you please file 
a new pr for this, and ping me to review it.

> Translate "Code Style - Preamble" page into Chinese
> ---
>
> Key: FLINK-13678
> URL: https://issues.apache.org/jira/browse/FLINK-13678
> Project: Flink
>  Issue Type: Sub-task
>  Components: chinese-translation, Project Website
>Reporter: Jark Wu
>Assignee: WangHengWei
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Translate page 
> https://flink.apache.org/zh/contributing/code-style-and-quality-preamble.html 
> into Chinese. The page is located in  
> https://github.com/apache/flink-web/blob/asf-site/contributing/code-style-and-quality-preamble.zh.md



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17468) Provide more detailed metrics why asynchronous part of checkpoint is taking long time

2020-05-11 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17104165#comment-17104165
 ] 

Congxian Qiu(klion26) commented on FLINK-17468:
---

[~qqibrow] please let me if you need any help~

> Provide more detailed metrics why asynchronous part of checkpoint is taking 
> long time
> -
>
> Key: FLINK-17468
> URL: https://issues.apache.org/jira/browse/FLINK-17468
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Checkpointing, Runtime / Metrics, Runtime / 
> State Backends
>Affects Versions: 1.10.0
>Reporter: Piotr Nowojski
>Priority: Major
>
> As [reported by 
> users|https://lists.apache.org/thread.html/r0833452796ca7d1c9d5e35c110089c95cfdadee9d81884a13613a4ce%40%3Cuser.flink.apache.org%3E]
>  it's not obvious why asynchronous part of checkpoint is taking so long time.
> Maybe we could provide some more detailed metrics/UI/logs about uploading 
> files, materializing meta data, or other things that are happening during the 
> asynchronous checkpoint process?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17571) A better way to show the files used in currently checkpoints

2020-05-11 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17104131#comment-17104131
 ] 

Congxian Qiu(klion26) commented on FLINK-17571:
---

sorry for the late replay, seems lost some email notification :(

[~pnowojski]  previously, I just want to show all the files(include EXCLUSIVE, 
SHARED and TASKOWNED) are being used by one checkpoint.  because
 * If we don't restore from an external checkpoint, there always no orphaned 
checkpoint files left(the files should be deleted when the checkpoint has been 
discarded)
 * If we restore from an external checkpoint, then we just need to know the 
files used in the "restored" checkpoint
 * If we retain more than 1 checkpoint and want to remove some of them(always 
the older checkpoints) – because of the storage pressure, in most cases, we 
just need to keep the fresh one checkpoint.

For implementation, we just need the leverage the 
{{Checkpoints#loadCheckpointMetaData}} for this. 

>From my side, another command {{removes files in specified checkpoints}} maybe 
>some more tricky, (because we may not know who references the files in the 
>given checkpoint. (if we retained more than one checkpoint – and the reference 
>one same shared file, then two separate jobs restore from the different 
>checkpoint, then we can't simply delete the files in each checkpoint)

[~trystan] good to know that you're interested in helping this, I'm fine if you 
want to help to implement this, I can help to have the init review for your 
contribution. And please let me know if you have time to contribute this. No 
matter who finally helps to contribute this, we need to have an agreement with 
the implementation first on the issue side.

> A better way to show the files used in currently checkpoints
> 
>
> Key: FLINK-17571
> URL: https://issues.apache.org/jira/browse/FLINK-17571
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Checkpointing
>Reporter: Congxian Qiu(klion26)
>Priority: Major
>
> Inspired by the 
> [userMail|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]
> Currently, there are [three types of 
> directory|https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/state/checkpoints.html#directory-structure]
>  for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be 
> deleted safely, but users can't delete the files in the SHARED directory 
> safely(the files may be created a long time ago).
> I think it's better to give users a better way to know which files are 
> currently used(so the others are not used)
> maybe a command-line command such as below is ok enough to support such a 
> feature.
> {{./bin/flink checkpoint list $checkpointDir  # list all the files used in 
> checkpoint}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17468) Provide more detailed metrics why asynchronous part of checkpoint is taking long time

2020-05-08 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17102397#comment-17102397
 ] 

Congxian Qiu(klion26) commented on FLINK-17468:
---

Hi [~qqibrow] from your previous reply, for the purpose to debug the checkpoint 
slowness, do you mean you need to add more and more log in 
{{RocksDBIncrementalSnapshotOperation}} and the other functions so that you can 
know the exact place which caused the slowness?  If this is the scenario, do 
you think we add more debug/trace log in each step of the async snapshot part 
as my previous reply is good for your debug purpose? or what other metric or 
log do you want?

> Provide more detailed metrics why asynchronous part of checkpoint is taking 
> long time
> -
>
> Key: FLINK-17468
> URL: https://issues.apache.org/jira/browse/FLINK-17468
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Checkpointing, Runtime / Metrics, Runtime / 
> State Backends
>Affects Versions: 1.10.0
>Reporter: Piotr Nowojski
>Priority: Major
>
> As [reported by 
> users|https://lists.apache.org/thread.html/r0833452796ca7d1c9d5e35c110089c95cfdadee9d81884a13613a4ce%40%3Cuser.flink.apache.org%3E]
>  it's not obvious why asynchronous part of checkpoint is taking so long time.
> Maybe we could provide some more detailed metrics/UI/logs about uploading 
> files, materializing meta data, or other things that are happening during the 
> asynchronous checkpoint process?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-17571) A better way to show the files used in currently checkpoints

2020-05-08 Thread Congxian Qiu(klion26) (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu(klion26) updated FLINK-17571:
--
Description: 
Inspired by the 
[userMail|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]

Currently, there are [three types of 
directory|https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/state/checkpoints.html#directory-structure]
 for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be deleted 
safely, but users can't delete the files in the SHARED directory safely(the 
files may be created a long time ago).

I think it's better to give users a better way to know which files are 
currently used(so the others are not used)

maybe a command-line command such as below is ok enough to support such a 
feature.

{{./bin/flink checkpoint list $checkpointDir  # list all the files used in 
checkpoint}}

  was:
Inspired by the 
[userMail|[http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]]

Currently, there are [three types of 
directory|[https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/state/checkpoints.html#directory-structure]]
 for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be deleted 
safely, but users can't delete the files in the SHARED directory safely(the 
files may be created a long time ago).

I think it's better to give users a better way to know which files are 
currently used(so the others are not used)

maybe a command-line command such as below is ok enough to support such a 
feature.

{{./bin/flink checkpoint list $checkpointDir  # list all the files used in 
checkpoint}}


> A better way to show the files used in currently checkpoints
> 
>
> Key: FLINK-17571
> URL: https://issues.apache.org/jira/browse/FLINK-17571
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Checkpointing
>Reporter: Congxian Qiu(klion26)
>Priority: Major
>
> Inspired by the 
> [userMail|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]
> Currently, there are [three types of 
> directory|https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/state/checkpoints.html#directory-structure]
>  for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be 
> deleted safely, but users can't delete the files in the SHARED directory 
> safely(the files may be created a long time ago).
> I think it's better to give users a better way to know which files are 
> currently used(so the others are not used)
> maybe a command-line command such as below is ok enough to support such a 
> feature.
> {{./bin/flink checkpoint list $checkpointDir  # list all the files used in 
> checkpoint}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17571) A better way to show the files used in currently checkpoints

2020-05-08 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17102356#comment-17102356
 ] 

Congxian Qiu(klion26) commented on FLINK-17571:
---

[~pnowojski] what do you think about this feature,  If this is valid, I think I 
can help to contribute it.

> A better way to show the files used in currently checkpoints
> 
>
> Key: FLINK-17571
> URL: https://issues.apache.org/jira/browse/FLINK-17571
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Checkpointing
>Reporter: Congxian Qiu(klion26)
>Priority: Major
>
> Inspired by the 
> [userMail|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]
> Currently, there are [three types of 
> directory|https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/state/checkpoints.html#directory-structure]
>  for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be 
> deleted safely, but users can't delete the files in the SHARED directory 
> safely(the files may be created a long time ago).
> I think it's better to give users a better way to know which files are 
> currently used(so the others are not used)
> maybe a command-line command such as below is ok enough to support such a 
> feature.
> {{./bin/flink checkpoint list $checkpointDir  # list all the files used in 
> checkpoint}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-17571) A better way to show the files used in currently checkpoints

2020-05-08 Thread Congxian Qiu(klion26) (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu(klion26) updated FLINK-17571:
--
Description: 
Inspired by the 
[userMail|[http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]]

Currently, there are [three types of 
directory|[https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/state/checkpoints.html#directory-structure]]
 for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be deleted 
safely, but users can't delete the files in the SHARED directory safely(the 
files may be created a long time ago).

I think it's better to give users a better way to know which files are 
currently used(so the others are not used)

maybe a command-line command such as below is ok enough to support such a 
feature.

{{./bin/flink checkpoint list $checkpointDir  # list all the files used in 
checkpoint}}

  was:
Inspired by the 
[userMail|[http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]]

Currently, there are three [types of directory|#directory-structure]] for a 
checkpoint, the files in TASKOWND and EXCLUSIVE directory can be deleted 
safely, but users can't delete the files in the SHARED directory safely(the 
files may be created a long time ago).

I think it's better to give users a better way to know which files are 
currently used(so the others are not used)

maybe a command-line command such as below is ok enough to support such a 
feature.

{{./bin/flink checkpoint list $checkpointDir  # list all the files used in 
checkpoint}}


> A better way to show the files used in currently checkpoints
> 
>
> Key: FLINK-17571
> URL: https://issues.apache.org/jira/browse/FLINK-17571
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Checkpointing
>Reporter: Congxian Qiu(klion26)
>Priority: Major
>
> Inspired by the 
> [userMail|[http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]]
> Currently, there are [three types of 
> directory|[https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/state/checkpoints.html#directory-structure]]
>  for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be 
> deleted safely, but users can't delete the files in the SHARED directory 
> safely(the files may be created a long time ago).
> I think it's better to give users a better way to know which files are 
> currently used(so the others are not used)
> maybe a command-line command such as below is ok enough to support such a 
> feature.
> {{./bin/flink checkpoint list $checkpointDir  # list all the files used in 
> checkpoint}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-17571) A better way to show the files used in currently checkpoints

2020-05-08 Thread Congxian Qiu(klion26) (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu(klion26) updated FLINK-17571:
--
Description: 
Inspired by the 
[userMail|[http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]]

Currently, there are three [types of directory|#directory-structure]] for a 
checkpoint, the files in TASKOWND and EXCLUSIVE directory can be deleted 
safely, but users can't delete the files in the SHARED directory safely(the 
files may be created a long time ago).

I think it's better to give users a better way to know which files are 
currently used(so the others are not used)

maybe a command-line command such as below is ok enough to support such a 
feature.

{{./bin/flink checkpoint list $checkpointDir  # list all the files used in 
checkpoint}}

  was:
Inspired by the [usermail|#directory-structure]

Currently, there are three [types of directory|#directory-structure] for a 
checkpoint, the files in TASKOWND and EXCLUSIVE directory can be deleted 
safely, but users can't delete the files in the SHARED directory safely(the 
files may be created a long time ago).

I think it's better to give users a better way to know which files are 
currently used(so the others are not used)

maybe a command-line command such as below is ok enough to support such a 
feature.

{{./bin/flink checkpoint list $checkpointDir  # list all the files used in 
checkpoint}}


> A better way to show the files used in currently checkpoints
> 
>
> Key: FLINK-17571
> URL: https://issues.apache.org/jira/browse/FLINK-17571
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Checkpointing
>Reporter: Congxian Qiu(klion26)
>Priority: Major
>
> Inspired by the 
> [userMail|[http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]]
> Currently, there are three [types of directory|#directory-structure]] for a 
> checkpoint, the files in TASKOWND and EXCLUSIVE directory can be deleted 
> safely, but users can't delete the files in the SHARED directory safely(the 
> files may be created a long time ago).
> I think it's better to give users a better way to know which files are 
> currently used(so the others are not used)
> maybe a command-line command such as below is ok enough to support such a 
> feature.
> {{./bin/flink checkpoint list $checkpointDir  # list all the files used in 
> checkpoint}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-17571) A better way to show the files used in currently checkpoints

2020-05-08 Thread Congxian Qiu(klion26) (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu(klion26) updated FLINK-17571:
--
Description: 
Inspired by the [usermail|#directory-structure]] 

Currently, there are three [types of directory|#directory-structure]] for a 
checkpoint, the files in TASKOWND and EXCLUSIVE directory can be deleted 
safely, but users can't delete the files in the SHARED directory safely(the 
files may be created a long time ago).

I think it's better to give users a better way to know which files are 
currently used(so the others are not used)

maybe a command-line command such as below is ok enough to support such a 
feature.

{{./bin/flink checkpoint list $checkpointDir  # list all the files used in 
checkpoint}}

  was:
Inspired by the 
[usermail|[https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/state/checkpoints.html#directory-structure]]
 [user 
mail|[http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]]
 

Currently, there are three [types of directory|#directory-structure]] for a 
checkpoint, the files in TASKOWND and EXCLUSIVE directory can be deleted 
safely, but users can't delete the files in the SHARED directory safely(the 
files may be created a long time ago).

I think it's better to give users a better way to know which files are 
currently used(so the others are not used)

maybe a command-line command such as below is ok enough to support such a 
feature.

{{./bin/flink checkpoint list $checkpointDir  # list all the files used in 
checkpoint}}


> A better way to show the files used in currently checkpoints
> 
>
> Key: FLINK-17571
> URL: https://issues.apache.org/jira/browse/FLINK-17571
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Checkpointing
>Reporter: Congxian Qiu(klion26)
>Priority: Major
>
> Inspired by the [usermail|#directory-structure]] 
> Currently, there are three [types of directory|#directory-structure]] for a 
> checkpoint, the files in TASKOWND and EXCLUSIVE directory can be deleted 
> safely, but users can't delete the files in the SHARED directory safely(the 
> files may be created a long time ago).
> I think it's better to give users a better way to know which files are 
> currently used(so the others are not used)
> maybe a command-line command such as below is ok enough to support such a 
> feature.
> {{./bin/flink checkpoint list $checkpointDir  # list all the files used in 
> checkpoint}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-17571) A better way to show the files used in currently checkpoints

2020-05-08 Thread Congxian Qiu(klion26) (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu(klion26) updated FLINK-17571:
--
Description: 
Inspired by the [usermail|#directory-structure]

Currently, there are three [types of directory|#directory-structure]] for a 
checkpoint, the files in TASKOWND and EXCLUSIVE directory can be deleted 
safely, but users can't delete the files in the SHARED directory safely(the 
files may be created a long time ago).

I think it's better to give users a better way to know which files are 
currently used(so the others are not used)

maybe a command-line command such as below is ok enough to support such a 
feature.

{{./bin/flink checkpoint list $checkpointDir  # list all the files used in 
checkpoint}}

  was:
Inspired by the [usermail|#directory-structure]] 

Currently, there are three [types of directory|#directory-structure]] for a 
checkpoint, the files in TASKOWND and EXCLUSIVE directory can be deleted 
safely, but users can't delete the files in the SHARED directory safely(the 
files may be created a long time ago).

I think it's better to give users a better way to know which files are 
currently used(so the others are not used)

maybe a command-line command such as below is ok enough to support such a 
feature.

{{./bin/flink checkpoint list $checkpointDir  # list all the files used in 
checkpoint}}


> A better way to show the files used in currently checkpoints
> 
>
> Key: FLINK-17571
> URL: https://issues.apache.org/jira/browse/FLINK-17571
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Checkpointing
>Reporter: Congxian Qiu(klion26)
>Priority: Major
>
> Inspired by the [usermail|#directory-structure]
> Currently, there are three [types of directory|#directory-structure]] for a 
> checkpoint, the files in TASKOWND and EXCLUSIVE directory can be deleted 
> safely, but users can't delete the files in the SHARED directory safely(the 
> files may be created a long time ago).
> I think it's better to give users a better way to know which files are 
> currently used(so the others are not used)
> maybe a command-line command such as below is ok enough to support such a 
> feature.
> {{./bin/flink checkpoint list $checkpointDir  # list all the files used in 
> checkpoint}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-17571) A better way to show the files used in currently checkpoints

2020-05-08 Thread Congxian Qiu(klion26) (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu(klion26) updated FLINK-17571:
--
Description: 
Inspired by the [usermail|#directory-structure]

Currently, there are three [types of directory|#directory-structure] for a 
checkpoint, the files in TASKOWND and EXCLUSIVE directory can be deleted 
safely, but users can't delete the files in the SHARED directory safely(the 
files may be created a long time ago).

I think it's better to give users a better way to know which files are 
currently used(so the others are not used)

maybe a command-line command such as below is ok enough to support such a 
feature.

{{./bin/flink checkpoint list $checkpointDir  # list all the files used in 
checkpoint}}

  was:
Inspired by the [usermail|#directory-structure]

Currently, there are three [types of directory|#directory-structure]] for a 
checkpoint, the files in TASKOWND and EXCLUSIVE directory can be deleted 
safely, but users can't delete the files in the SHARED directory safely(the 
files may be created a long time ago).

I think it's better to give users a better way to know which files are 
currently used(so the others are not used)

maybe a command-line command such as below is ok enough to support such a 
feature.

{{./bin/flink checkpoint list $checkpointDir  # list all the files used in 
checkpoint}}


> A better way to show the files used in currently checkpoints
> 
>
> Key: FLINK-17571
> URL: https://issues.apache.org/jira/browse/FLINK-17571
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Checkpointing
>Reporter: Congxian Qiu(klion26)
>Priority: Major
>
> Inspired by the [usermail|#directory-structure]
> Currently, there are three [types of directory|#directory-structure] for a 
> checkpoint, the files in TASKOWND and EXCLUSIVE directory can be deleted 
> safely, but users can't delete the files in the SHARED directory safely(the 
> files may be created a long time ago).
> I think it's better to give users a better way to know which files are 
> currently used(so the others are not used)
> maybe a command-line command such as below is ok enough to support such a 
> feature.
> {{./bin/flink checkpoint list $checkpointDir  # list all the files used in 
> checkpoint}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-17571) A better way to show the files used in currently checkpoints

2020-05-08 Thread Congxian Qiu(klion26) (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu(klion26) updated FLINK-17571:
--
Description: 
Inspired by the 
[usermail|[https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/state/checkpoints.html#directory-structure]]
 [user 
mail|[http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]]
 

Currently, there are three [types of directory|#directory-structure]] for a 
checkpoint, the files in TASKOWND and EXCLUSIVE directory can be deleted 
safely, but users can't delete the files in the SHARED directory safely(the 
files may be created a long time ago).

I think it's better to give users a better way to know which files are 
currently used(so the others are not used)

maybe a command-line command such as below is ok enough to support such a 
feature.

{{./bin/flink checkpoint list $checkpointDir  # list all the files used in 
checkpoint}}

  was:
Inspired by the [user 
mail|[http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]]
 

Currently, there are three [types of 
directory|[https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/state/checkpoints.html#directory-structure]]
 for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be deleted 
safely, but users can't delete the files in the SHARED directory safely(the 
files may be created a long time ago).

I think it's better to give users a better way to know which files are 
currently used(so the others are not used)

maybe a command-line command such as below is ok enough to support such a 
feature.

{{./bin/flink checkpoint list $checkpointDir  # list all the files used in 
checkpoint}}


> A better way to show the files used in currently checkpoints
> 
>
> Key: FLINK-17571
> URL: https://issues.apache.org/jira/browse/FLINK-17571
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Checkpointing
>Reporter: Congxian Qiu(klion26)
>Priority: Major
>
> Inspired by the 
> [usermail|[https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/state/checkpoints.html#directory-structure]]
>  [user 
> mail|[http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]]
>  
> Currently, there are three [types of directory|#directory-structure]] for a 
> checkpoint, the files in TASKOWND and EXCLUSIVE directory can be deleted 
> safely, but users can't delete the files in the SHARED directory safely(the 
> files may be created a long time ago).
> I think it's better to give users a better way to know which files are 
> currently used(so the others are not used)
> maybe a command-line command such as below is ok enough to support such a 
> feature.
> {{./bin/flink checkpoint list $checkpointDir  # list all the files used in 
> checkpoint}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-17571) A better way to show the files used in currently checkpoints

2020-05-08 Thread Congxian Qiu(klion26) (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu(klion26) updated FLINK-17571:
--
Description: 
Inspired by the [user 
mail|[http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]]
 

Currently, there are three [types of 
directory|[https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/state/checkpoints.html#directory-structure]]
 for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be deleted 
safely, but users can't delete the files in the SHARED directory safely(the 
files may be created a long time ago).

I think it's better to give users a better way to know which files are 
currently used(so the others are not used)

maybe a command-line command such as below is ok enough to support such a 
feature.

{{./bin/flink checkpoint list $checkpointDir  # list all the files used in 
checkpoint}}

  was:
Inspired by the [user 
mail|[http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]]
 

Currently, there are three types of directory for a checkpoint, the files in 
TASKOWND and EXCLUSIVE directory can be deleted safely, but users can't delete 
the files in the SHARED directory safely(the files may be created a long time 
ago).

I think it's better to give users a better way to know which files are 
currently used(so the others are not used)

maybe a command-line command such as below is ok enough to support such a 
feature.

{{./bin/flink checkpoint list $checkpointDir  # list all the files used in 
checkpoint}}


> A better way to show the files used in currently checkpoints
> 
>
> Key: FLINK-17571
> URL: https://issues.apache.org/jira/browse/FLINK-17571
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Checkpointing
>Reporter: Congxian Qiu(klion26)
>Priority: Major
>
> Inspired by the [user 
> mail|[http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]]
>  
> Currently, there are three [types of 
> directory|[https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/state/checkpoints.html#directory-structure]]
>  for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be 
> deleted safely, but users can't delete the files in the SHARED directory 
> safely(the files may be created a long time ago).
> I think it's better to give users a better way to know which files are 
> currently used(so the others are not used)
> maybe a command-line command such as below is ok enough to support such a 
> feature.
> {{./bin/flink checkpoint list $checkpointDir  # list all the files used in 
> checkpoint}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-17571) A better way to show the files used in currently checkpoints

2020-05-08 Thread Congxian Qiu(klion26) (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu(klion26) updated FLINK-17571:
--
Description: 
Inspired by the [user 
mail|[http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]]
 

Currently, there are three types of directory for a checkpoint, the files in 
TASKOWND and EXCLUSIVE directory can be deleted safely, but users can't delete 
the files in the SHARED directory safely(the files may be created a long time 
ago).

I think it's better to give users a better way to know which files are 
currently used(so the others are not used)

maybe a command-line command such as below is ok enough to support such a 
feature.

{{./bin/flink checkpoint list $checkpointDir  # list all the files used in 
checkpoint}}

  was:
Inspired by the [user 
mail|[http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]

Currently, there are three types of directory for a checkpoint, the files in 
TASKOWND and EXCLUSIVE directory can be deleted safely, but users can't delete 
the files in the SHARED directory safely(the files may be created a long time 
ago).

I think it's better to give users a better way to know which files are 
currently used(so the others are not used)

maybe a command-line command such as below is ok enough to support such a 
feature.

{{./bin/flink checkpoint list $checkpointDir  # list all the files used in 
checkpoint}}


> A better way to show the files used in currently checkpoints
> 
>
> Key: FLINK-17571
> URL: https://issues.apache.org/jira/browse/FLINK-17571
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Checkpointing
>Reporter: Congxian Qiu(klion26)
>Priority: Major
>
> Inspired by the [user 
> mail|[http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]]
>  
> Currently, there are three types of directory for a checkpoint, the files in 
> TASKOWND and EXCLUSIVE directory can be deleted safely, but users can't 
> delete the files in the SHARED directory safely(the files may be created a 
> long time ago).
> I think it's better to give users a better way to know which files are 
> currently used(so the others are not used)
> maybe a command-line command such as below is ok enough to support such a 
> feature.
> {{./bin/flink checkpoint list $checkpointDir  # list all the files used in 
> checkpoint}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-17571) A better way to show the files used in currently checkpoints

2020-05-08 Thread Congxian Qiu(klion26) (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu(klion26) updated FLINK-17571:
--
Description: 
Inspired by the [user 
mail|[http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]

Currently, there are three types of directory for a checkpoint, the files in 
TASKOWND and EXCLUSIVE directory can be deleted safely, but users can't delete 
the files in the SHARED directory safely(the files may be created a long time 
ago).

I think it's better to give users a better way to know which files are 
currently used(so the others are not used)

maybe a command-line command such as below is ok enough to support such a 
feature.

{{./bin/flink checkpoint list $checkpointDir  # list all the files used in 
checkpoint}}

  was:
Inspired by the [user 
mail|[http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]]

Currently, there are three types of directory for a checkpoint, the files in 
TASKOWND and EXCLUSIVE directory can be deleted safely, but users can't delete 
the files in the SHARED directory safely(the files may be created a long time 
ago).

I think it's better to give users a better way to know which files are 
currently used(so the others are not used)

maybe a command-line command such as below is ok enough to support such a 
feature.

{{./bin/flink checkpoint list $checkpointDir  # list all the files used in 
checkpoint}}


> A better way to show the files used in currently checkpoints
> 
>
> Key: FLINK-17571
> URL: https://issues.apache.org/jira/browse/FLINK-17571
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Checkpointing
>Reporter: Congxian Qiu(klion26)
>Priority: Major
>
> Inspired by the [user 
> mail|[http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]
> Currently, there are three types of directory for a checkpoint, the files in 
> TASKOWND and EXCLUSIVE directory can be deleted safely, but users can't 
> delete the files in the SHARED directory safely(the files may be created a 
> long time ago).
> I think it's better to give users a better way to know which files are 
> currently used(so the others are not used)
> maybe a command-line command such as below is ok enough to support such a 
> feature.
> {{./bin/flink checkpoint list $checkpointDir  # list all the files used in 
> checkpoint}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-17571) A better way to show the files used in currently checkpoints

2020-05-08 Thread Congxian Qiu(klion26) (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu(klion26) updated FLINK-17571:
--
Description: 
Inspired by the [user 
mail|[http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]]

Currently, there are three types of directory for a checkpoint, the files in 
TASKOWND and EXCLUSIVE directory can be deleted safely, but users can't delete 
the files in the SHARED directory safely(the files may be created a long time 
ago).

I think it's better to give users a better way to know which files are 
currently used(so the others are not used)

maybe a command-line command such as below is ok enough to support such a 
feature.

{{./bin/flink checkpoint list $checkpointDir  # list all the files used in 
checkpoint}}

  was:
Inspired by the [[user 
mail|[http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]]]

Currently, there are three types of directory for a checkpoint, the files in 
TASKOWND and EXCLUSIVE directory can be deleted safely, but users can't delete 
the files in the SHARED directory safely(the files may be created a long time 
ago).

I think it's better to give users a better way to know which files are 
currently used(so the others are not used)

maybe a command-line command such as below is ok enough to support such a 
feature.

{{./bin/flink checkpoint list $checkpointDir  # list all the files used in 
checkpoint}}


> A better way to show the files used in currently checkpoints
> 
>
> Key: FLINK-17571
> URL: https://issues.apache.org/jira/browse/FLINK-17571
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Checkpointing
>Reporter: Congxian Qiu(klion26)
>Priority: Major
>
> Inspired by the [user 
> mail|[http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]]
> Currently, there are three types of directory for a checkpoint, the files in 
> TASKOWND and EXCLUSIVE directory can be deleted safely, but users can't 
> delete the files in the SHARED directory safely(the files may be created a 
> long time ago).
> I think it's better to give users a better way to know which files are 
> currently used(so the others are not used)
> maybe a command-line command such as below is ok enough to support such a 
> feature.
> {{./bin/flink checkpoint list $checkpointDir  # list all the files used in 
> checkpoint}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17571) A better way to show the files used in currently checkpoints

2020-05-08 Thread Congxian Qiu(klion26) (Jira)
Congxian Qiu(klion26) created FLINK-17571:
-

 Summary: A better way to show the files used in currently 
checkpoints
 Key: FLINK-17571
 URL: https://issues.apache.org/jira/browse/FLINK-17571
 Project: Flink
  Issue Type: New Feature
  Components: Runtime / Checkpointing
Reporter: Congxian Qiu(klion26)


Inspired by the [user 
mail]([http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]).

Currently, there are three types of directory for a checkpoint, the files in 
TASKOWND and EXCLUSIVE directory can be deleted safely, but users can't delete 
the files in the SHARED directory safely(the files may be created a long time 
ago).

I think it's better to give users a better way to know which files are 
currently used(so the others are not used)

maybe a command-line command such as below is ok enough to support such a 
feature.

{{./bin/flink checkpoint list $checkpointDir  # list all the files used in 
checkpoint}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-17571) A better way to show the files used in currently checkpoints

2020-05-08 Thread Congxian Qiu(klion26) (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu(klion26) updated FLINK-17571:
--
Description: 
Inspired by the [[user 
mail|[http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]]]

Currently, there are three types of directory for a checkpoint, the files in 
TASKOWND and EXCLUSIVE directory can be deleted safely, but users can't delete 
the files in the SHARED directory safely(the files may be created a long time 
ago).

I think it's better to give users a better way to know which files are 
currently used(so the others are not used)

maybe a command-line command such as below is ok enough to support such a 
feature.

{{./bin/flink checkpoint list $checkpointDir  # list all the files used in 
checkpoint}}

  was:
Inspired by the [user 
mail]([http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]).

Currently, there are three types of directory for a checkpoint, the files in 
TASKOWND and EXCLUSIVE directory can be deleted safely, but users can't delete 
the files in the SHARED directory safely(the files may be created a long time 
ago).

I think it's better to give users a better way to know which files are 
currently used(so the others are not used)

maybe a command-line command such as below is ok enough to support such a 
feature.

{{./bin/flink checkpoint list $checkpointDir  # list all the files used in 
checkpoint}}


> A better way to show the files used in currently checkpoints
> 
>
> Key: FLINK-17571
> URL: https://issues.apache.org/jira/browse/FLINK-17571
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Checkpointing
>Reporter: Congxian Qiu(klion26)
>Priority: Major
>
> Inspired by the [[user 
> mail|[http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]]]
> Currently, there are three types of directory for a checkpoint, the files in 
> TASKOWND and EXCLUSIVE directory can be deleted safely, but users can't 
> delete the files in the SHARED directory safely(the files may be created a 
> long time ago).
> I think it's better to give users a better way to know which files are 
> currently used(so the others are not used)
> maybe a command-line command such as below is ok enough to support such a 
> feature.
> {{./bin/flink checkpoint list $checkpointDir  # list all the files used in 
> checkpoint}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17479) Occasional checkpoint failure due to null pointer exception in Flink version 1.10

2020-05-06 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101355#comment-17101355
 ] 

Congxian Qiu(klion26) commented on FLINK-17479:
---

[~nobleyd] thanks for reporting this problem. seems strange fro the picture 
your given. if the {{checkpointMetadata}} is null, then how can the message 
[C{{ould not perform checkpoint " + checkpointMetaData.getCheckpointId() + " 
for operator " + getName() + '.'][1]}} could be printed? the error message will 
try to get the checkpointId from {{checkpointMetaData}}. could you please share 
the whole jm log, a reproducible job is even better. thanks.

[1]https://github.com/apache/flink/blob/aa4eb8f0c9ce74e6b92c3d9be5dc8e8cb536239d/flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/StreamTask.java#L801

 

> Occasional checkpoint failure due to null pointer exception in Flink version 
> 1.10
> -
>
> Key: FLINK-17479
> URL: https://issues.apache.org/jira/browse/FLINK-17479
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.10.0
> Environment: Flink1.10.0
> jdk1.8.0_60
>Reporter: nobleyd
>Priority: Major
> Attachments: image-2020-04-30-18-44-21-630.png, 
> image-2020-04-30-18-55-53-779.png
>
>
> I upgrade the standalone cluster(3 machines) from flink1.9 to flink1.10.0 
> latest. My job running normally in flink1.9 for about half a year, while I 
> get some job failed due to null pointer exception when checkpoing in  
> flink1.10.0.
> Below is the exception log:
> !image-2020-04-30-18-55-53-779.png!
> I have checked the StreamTask(882), and is shown below. I think the only case 
> is that checkpointMetaData is null that can lead to a null pointer exception.
> !image-2020-04-30-18-44-21-630.png!
> I do not know why, is there anyone can help me? The problem only occurs in 
> Flink1.10.0 for now, it works well in flink1.9. I give the some conf 
> info(some different to the default) also in below, guessing that maybe it is 
> an error for configuration mistake.
> some conf of my flink1.10.0:
>  
> {code:java}
> taskmanager.memory.flink.size: 71680m
> taskmanager.memory.framework.heap.size: 512m
> taskmanager.memory.framework.off-heap.size: 512m
> taskmanager.memory.task.off-heap.size: 17920m
> taskmanager.memory.managed.size: 512m
> taskmanager.memory.jvm-metaspace.size: 512m
> taskmanager.memory.network.fraction: 0.1
> taskmanager.memory.network.min: 1024mb
> taskmanager.memory.network.max: 1536mb
> taskmanager.memory.segment-size: 128kb
> rest.port: 8682
> historyserver.web.port: 8782high-availability.jobmanager.port: 
> 13141,13142,13143,13144
> blob.server.port: 13146,13147,13148,13149taskmanager.rpc.port: 
> 13151,13152,13153,13154
> taskmanager.data.port: 13156metrics.internal.query-service.port: 
> 13161,13162,13163,13164,13166,13167,13168,13169env.java.home: 
> /usr/java/jdk1.8.0_60/bin/java
> env.pid.dir: /home/work/flink-1.10.0{code}
>  
> Hope someone can help me solve it.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-17342) Schedule savepoint if max-inflight-checkpoints limit is reached instead of forcing (in UC mode)

2020-05-03 Thread Congxian Qiu(klion26) (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu(klion26) updated FLINK-17342:
--
Summary: Schedule savepoint if max-inflight-checkpoints limit is reached 
instead of forcing (in UC mode)  (was: Schedule savepoint if 
max-inflight-checkpoints limit is reached isntead of forcing (in UC mode))

> Schedule savepoint if max-inflight-checkpoints limit is reached instead of 
> forcing (in UC mode)
> ---
>
> Key: FLINK-17342
> URL: https://issues.apache.org/jira/browse/FLINK-17342
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.11.0
>Reporter: Roman Khachatryan
>Assignee: Roman Khachatryan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17468) Provide more detailed metrics why asynchronous part of checkpoint is taking long time

2020-04-30 Thread Congxian Qiu(klion26) (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17096430#comment-17096430
 ] 

Congxian Qiu(klion26) commented on FLINK-17468:
---

Hi, [~qqibrow], there may so many situations that lead to snapshot timeouts. 
but in my experience, all of them will narrow down into two: disk performance 
and network performance(include network speed and threads used to upload 
state). the disk performance/network bottleneck information we can get from 
somewhere other than Apache Flink. I don't think we can add more metrics here.

But for user experience(especially someone wants to find out the root cause of 
the timeout), I agree to add some more debug(or trace) log in the async part of 
a snapshot, so that the user can easily know what the step of current snapshot 
going now.(There are so many steps in the async part of RocksDBStateBackend...)

I proposed to add more debug(or trace log) for such purpose in FLINK-13808, and 
created an issue which wants to limit to the max concurrent snapshot in TM 
side(FLINK-15236), if we reach agreement on this, I can help to contribute them.

> Provide more detailed metrics why asynchronous part of checkpoint is taking 
> long time
> -
>
> Key: FLINK-17468
> URL: https://issues.apache.org/jira/browse/FLINK-17468
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Checkpointing, Runtime / Metrics, Runtime / 
> State Backends
>Affects Versions: 1.10.0
>Reporter: Piotr Nowojski
>Priority: Major
>
> As [reported by 
> users|https://lists.apache.org/thread.html/r0833452796ca7d1c9d5e35c110089c95cfdadee9d81884a13613a4ce%40%3Cuser.flink.apache.org%3E]
>  it's not obvious why asynchronous part of checkpoint is taking so long time.
> Maybe we could provide some more detailed metrics/UI/logs about uploading 
> files, materializing meta data, or other things that are happening during the 
> asynchronous checkpoint process?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   3   4   >