[jira] [Created] (FLINK-35144) Support multi source sync for FlinkCDC

2024-04-17 Thread Congxian Qiu (Jira)
Congxian Qiu created FLINK-35144:


 Summary: Support multi source sync for FlinkCDC
 Key: FLINK-35144
 URL: https://issues.apache.org/jira/browse/FLINK-35144
 Project: Flink
  Issue Type: Improvement
  Components: Flink CDC
Affects Versions: cdc-3.1.0
Reporter: Congxian Qiu


Currently, the FlinkCDC pipeline can only support a single source in one 
pipeline, we need to start multiple pipelines when there are various sources. 

For upstream which uses sharding, we need to sync multiple sources in one 
pipeline, the current pipeline can't do this because it can only support a 
single source.

This issue wants to support the sync of multiple sources in one pipeline.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-35144) Support various sources sync for FlinkCDC in one pipeline

2024-04-17 Thread Congxian Qiu (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-35144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu updated FLINK-35144:
-
Summary: Support various sources sync for FlinkCDC in one pipeline  (was: 
Support various source sync for FlinkCDC in one pipeline)

> Support various sources sync for FlinkCDC in one pipeline
> -
>
> Key: FLINK-35144
> URL: https://issues.apache.org/jira/browse/FLINK-35144
> Project: Flink
>  Issue Type: Improvement
>  Components: Flink CDC
>Affects Versions: cdc-3.1.0
>Reporter: Congxian Qiu
>Priority: Major
>
> Currently, the FlinkCDC pipeline can only support a single source in one 
> pipeline, we need to start multiple pipelines when there are various sources. 
> For upstream which uses sharding, we need to sync multiple sources in one 
> pipeline, the current pipeline can't do this because it can only support a 
> single source.
> This issue wants to support the sync of multiple sources in one pipeline.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-35144) Support various source sync for FlinkCDC in one pipeline

2024-04-17 Thread Congxian Qiu (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-35144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu updated FLINK-35144:
-
Summary: Support various source sync for FlinkCDC in one pipeline  (was: 
Support multi source sync for FlinkCDC)

> Support various source sync for FlinkCDC in one pipeline
> 
>
> Key: FLINK-35144
> URL: https://issues.apache.org/jira/browse/FLINK-35144
> Project: Flink
>  Issue Type: Improvement
>  Components: Flink CDC
>Affects Versions: cdc-3.1.0
>Reporter: Congxian Qiu
>Priority: Major
>
> Currently, the FlinkCDC pipeline can only support a single source in one 
> pipeline, we need to start multiple pipelines when there are various sources. 
> For upstream which uses sharding, we need to sync multiple sources in one 
> pipeline, the current pipeline can't do this because it can only support a 
> single source.
> This issue wants to support the sync of multiple sources in one pipeline.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-29913) Shared state would be discarded by mistake when maxConcurrentCheckpoint>1

2023-05-25 Thread Congxian Qiu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-29913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17726084#comment-17726084
 ] 

Congxian Qiu commented on FLINK-29913:
--

thanks for the discuss above and contribution!

Using the UUID/filename as the key solves the problem here, and it also makes 
sense because the key and the remote file are one-to-one. In addition, it can 
also solve some other potential problems, for example, if the Flink job 
management platform uses the SharedRegistry here to maintain the checkpoints 
lifecycle, if a task has two ssts with the same name, it will now cause the 
file to be deleted by mistake (this situation occurs as follows: job A 
generates a checkpoint chk1, then stops, job B job B resumes from chk1, 
completes chk2, then stops, then job C resumes from chk1, completes chk3, after 
we register chk2 and chk3 in one SharedRegistry, we'll delete some remote files 
by mistake, because there will be some sst files in chk2 and chk3 with the same 
name)

> Shared state would be discarded by mistake when maxConcurrentCheckpoint>1
> -
>
> Key: FLINK-29913
> URL: https://issues.apache.org/jira/browse/FLINK-29913
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.15.0, 1.16.0, 1.17.0
>Reporter: Yanfei Lei
>Assignee: Feifan Wang
>Priority: Major
> Fix For: 1.16.3, 1.17.2
>
>
> When maxConcurrentCheckpoint>1, the shared state of Incremental rocksdb state 
> backend would be discarded by registering the same name handle. See 
> [https://github.com/apache/flink/pull/21050#discussion_r1011061072]
> cc [~roman] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-29913) Shared state would be discarded by mistake when maxConcurrentCheckpoint>1

2022-11-13 Thread Congxian Qiu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-29913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17633506#comment-17633506
 ] 

Congxian Qiu commented on FLINK-29913:
--

sorry for the late reply.

[~Yanfei Lei]  for the priority, IMHO, if the user set \{{ 
maxConcurrenctCheckpoint > 1 && MAX_RETAINED_CHECKPOINTS > 1 }} , then the 
checkpoints may be broken, and can't restore from the checkpoint because of the 
{{{}FileNotFoundException{}}}, so I think it deserves to escalate the priority.

[~roman] your proposal seems valid from my perspective, maybe changing the 
logic for {{generating the registry key(perhaps using the filename in the 
remote filesystem)is enough to solve the problem here?}}

please let me what do you think about this, thanks.

> Shared state would be discarded by mistake when maxConcurrentCheckpoint>1
> -
>
> Key: FLINK-29913
> URL: https://issues.apache.org/jira/browse/FLINK-29913
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.15.0, 1.16.0
>Reporter: Yanfei Lei
>Priority: Minor
>
> When maxConcurrentCheckpoint>1, the shared state of Incremental rocksdb state 
> backend would be discarded by registering the same name handle. See 
> [https://github.com/apache/flink/pull/21050#discussion_r1011061072]
> cc [~roman] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (FLINK-29095) Improve logging in SharedStateRegistry

2022-11-07 Thread Congxian Qiu (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-29095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu resolved FLINK-29095.
--
Resolution: Fixed

> Improve logging in SharedStateRegistry 
> ---
>
> Key: FLINK-29095
> URL: https://issues.apache.org/jira/browse/FLINK-29095
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing, Runtime / State Backends
>Affects Versions: 1.16.0
>Reporter: Jing Ge
>Assignee: Yanfei Lei
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.17.0
>
>
> with the incremental checkpoint, conceptually, state files that are never 
> used by any checkpoint will be deleted/GC . In practices, state files might 
> be deleted when they are still somehow required by the failover which will 
> lead to Flink job fails.
> We should add the log for trouble shooting.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-29095) Improve logging in SharedStateRegistry

2022-11-07 Thread Congxian Qiu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-29095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17630226#comment-17630226
 ] 

Congxian Qiu commented on FLINK-29095:
--

merged into master 9a4250d248e93f3e87b211df98ce3d3c66aabca0

> Improve logging in SharedStateRegistry 
> ---
>
> Key: FLINK-29095
> URL: https://issues.apache.org/jira/browse/FLINK-29095
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing, Runtime / State Backends
>Affects Versions: 1.16.0
>Reporter: Jing Ge
>Assignee: Yanfei Lei
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.17.0
>
>
> with the incremental checkpoint, conceptually, state files that are never 
> used by any checkpoint will be deleted/GC . In practices, state files might 
> be deleted when they are still somehow required by the failover which will 
> lead to Flink job fails.
> We should add the log for trouble shooting.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-29913) Shared state would be discarded by mistake when maxConcurrentCheckpoint>1

2022-11-07 Thread Congxian Qiu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-29913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17629759#comment-17629759
 ] 

Congxian Qiu commented on FLINK-29913:
--

[~Yanfei Lei]  thanks for creating this ticket and the IT Case,  would you like 
to contribute a fix for this problem?

For the priority, As this may lead to {{FileNotFoundExecption if set 
{{maxConcurrencthCheckpoint > 1, I think it at least needs to be Critical, 
what do you think about this?

> Shared state would be discarded by mistake when maxConcurrentCheckpoint>1
> -
>
> Key: FLINK-29913
> URL: https://issues.apache.org/jira/browse/FLINK-29913
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.15.0, 1.16.0
>Reporter: Yanfei Lei
>Priority: Minor
>
> When maxConcurrentCheckpoint>1, the shared state of Incremental rocksdb state 
> backend would be discarded by registering the same name handle. See 
> [https://github.com/apache/flink/pull/21050#discussion_r1011061072]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-29157) Clarify the contract between CompletedCheckpointStore and SharedStateRegistry

2022-10-26 Thread Congxian Qiu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-29157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17624825#comment-17624825
 ] 

Congxian Qiu edited comment on FLINK-29157 at 10/27/22 5:22 AM:


merged into master 63767c5ed91642c67f97d9f16ff2b8955f9ae421

1.16 be2bd93838548f7858baecf5e8beb469836081d5


was (Author: klion26):
merged into master 63767c5ed91642c67f97d9f16ff2b8955f9ae421

> Clarify the contract between CompletedCheckpointStore and SharedStateRegistry
> -
>
> Key: FLINK-29157
> URL: https://issues.apache.org/jira/browse/FLINK-29157
> Project: Flink
>  Issue Type: Technical Debt
>  Components: Runtime / Checkpointing
>Affects Versions: 1.16.0, 1.15.2
>Reporter: Roman Khachatryan
>Assignee: Yanfei Lei
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.17.0, 1.15.3, 1.16.1
>
>
> After FLINK-24611, CompletedCheckpointStore is required to call 
> SharedStateRegistry.unregisterUnusedState() on checkpoint subsumption and 
> shutdown.
> Although it's not clear whether CompletedCheckpointStore is internal there 
> are in fact external implementations (which weren't updated accordingly).
>  
> After FLINK-25872, CompletedCheckpointStore also must call 
> checkpointsCleaner.cleanSubsumedCheckpoints.
>  
> Another issue with a custom implementation was using different java objects 
> for state for CheckpointStore and SharedStateRegistry (after FLINK-24086). 
>  
> So it makes sense to:
>  * clarify the contract (different in 1.15 and 1.16)
>  * require using the same checkpoint objects by SharedStateRegistryFactory 
> and CompletedCheckpointStore
>  * mark the interface(s) as PublicEvolving



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-29157) Clarify the contract between CompletedCheckpointStore and SharedStateRegistry

2022-10-26 Thread Congxian Qiu (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-29157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu updated FLINK-29157:
-
Fix Version/s: (was: 1.15.3)

> Clarify the contract between CompletedCheckpointStore and SharedStateRegistry
> -
>
> Key: FLINK-29157
> URL: https://issues.apache.org/jira/browse/FLINK-29157
> Project: Flink
>  Issue Type: Technical Debt
>  Components: Runtime / Checkpointing
>Affects Versions: 1.16.0, 1.15.2
>Reporter: Roman Khachatryan
>Assignee: Yanfei Lei
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.17.0, 1.16.1
>
>
> After FLINK-24611, CompletedCheckpointStore is required to call 
> SharedStateRegistry.unregisterUnusedState() on checkpoint subsumption and 
> shutdown.
> Although it's not clear whether CompletedCheckpointStore is internal there 
> are in fact external implementations (which weren't updated accordingly).
>  
> After FLINK-25872, CompletedCheckpointStore also must call 
> checkpointsCleaner.cleanSubsumedCheckpoints.
>  
> Another issue with a custom implementation was using different java objects 
> for state for CheckpointStore and SharedStateRegistry (after FLINK-24086). 
>  
> So it makes sense to:
>  * clarify the contract (different in 1.15 and 1.16)
>  * require using the same checkpoint objects by SharedStateRegistryFactory 
> and CompletedCheckpointStore
>  * mark the interface(s) as PublicEvolving



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (FLINK-29157) Clarify the contract between CompletedCheckpointStore and SharedStateRegistry

2022-10-26 Thread Congxian Qiu (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-29157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu resolved FLINK-29157.
--
Resolution: Fixed

merged into master 63767c5ed91642c67f97d9f16ff2b8955f9ae421

> Clarify the contract between CompletedCheckpointStore and SharedStateRegistry
> -
>
> Key: FLINK-29157
> URL: https://issues.apache.org/jira/browse/FLINK-29157
> Project: Flink
>  Issue Type: Technical Debt
>  Components: Runtime / Checkpointing
>Affects Versions: 1.16.0, 1.15.2
>Reporter: Roman Khachatryan
>Assignee: Yanfei Lei
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.17.0, 1.15.3, 1.16.1
>
>
> After FLINK-24611, CompletedCheckpointStore is required to call 
> SharedStateRegistry.unregisterUnusedState() on checkpoint subsumption and 
> shutdown.
> Although it's not clear whether CompletedCheckpointStore is internal there 
> are in fact external implementations (which weren't updated accordingly).
>  
> After FLINK-25872, CompletedCheckpointStore also must call 
> checkpointsCleaner.cleanSubsumedCheckpoints.
>  
> Another issue with a custom implementation was using different java objects 
> for state for CheckpointStore and SharedStateRegistry (after FLINK-24086). 
>  
> So it makes sense to:
>  * clarify the contract (different in 1.15 and 1.16)
>  * require using the same checkpoint objects by SharedStateRegistryFactory 
> and CompletedCheckpointStore
>  * mark the interface(s) as PublicEvolving



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (FLINK-29157) Clarify the contract between CompletedCheckpointStore and SharedStateRegistry

2022-10-26 Thread Congxian Qiu (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-29157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu reassigned FLINK-29157:


Assignee: Yanfei Lei  (was: Roman Khachatryan)

> Clarify the contract between CompletedCheckpointStore and SharedStateRegistry
> -
>
> Key: FLINK-29157
> URL: https://issues.apache.org/jira/browse/FLINK-29157
> Project: Flink
>  Issue Type: Technical Debt
>  Components: Runtime / Checkpointing
>Affects Versions: 1.16.0, 1.15.2
>Reporter: Roman Khachatryan
>Assignee: Yanfei Lei
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.17.0, 1.15.3, 1.16.1
>
>
> After FLINK-24611, CompletedCheckpointStore is required to call 
> SharedStateRegistry.unregisterUnusedState() on checkpoint subsumption and 
> shutdown.
> Although it's not clear whether CompletedCheckpointStore is internal there 
> are in fact external implementations (which weren't updated accordingly).
>  
> After FLINK-25872, CompletedCheckpointStore also must call 
> checkpointsCleaner.cleanSubsumedCheckpoints.
>  
> Another issue with a custom implementation was using different java objects 
> for state for CheckpointStore and SharedStateRegistry (after FLINK-24086). 
>  
> So it makes sense to:
>  * clarify the contract (different in 1.15 and 1.16)
>  * require using the same checkpoint objects by SharedStateRegistryFactory 
> and CompletedCheckpointStore
>  * mark the interface(s) as PublicEvolving



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-29146) User set job configuration can not be retirieved from JobGraph and ExecutionGraph

2022-09-05 Thread Congxian Qiu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-29146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17600282#comment-17600282
 ] 

Congxian Qiu commented on FLINK-29146:
--

A similar problem was encountered in a lower version, I think it needs to be 
fixed in master as well, otherwise, others may encounter problems when using 
ExecutionGraph#getJobConfiguration in the future

> User set job configuration can not be retirieved from JobGraph and 
> ExecutionGraph
> -
>
> Key: FLINK-29146
> URL: https://issues.apache.org/jira/browse/FLINK-29146
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.16.0
>Reporter: Shuiqiang Chen
>Priority: Major
>
> Currently, when building an ExecutionGraph, it requires to set the job 
> specific information (like job id, job name, job configuration, etc) and most 
> of them are from JobGraph.But I find that the configuration in JobGraph is a 
> new Configuration instance when the JobGraph is built, and it does not 
> contain any user set configuration. As a result, we are not able retrieve the 
> use specified job configuration in ExecutionGraph built from JobGraph during 
> execution runtime.
> BTW, in StreamExecutionEnvironment, it seems that job configurations that not 
> contained in built-in options will be ignored when calling 
> StreamExecutionEnvironment.configure(ReadableConfig[, ClassLoader]). However, 
> it will be included when constructing a StreamExecutionEnvironment, which 
> seems a bit inconsistent. Is it by design?
> {code:java}
> Configuration configuration = new Configuration();
> // These configured string will take effect.
> configuration.setString("k1", "v1");
> configuration.setString("k2", "v2");
> configuration.setString("k3", "v3");
> configuration.set(HeartbeatManagerOptions.HEARTBEAT_TIMEOUT, 30L);
> final StreamExecutionEnvironment env = 
> StreamExecutionEnvironment.getExecutionEnvironment(configuration);
> // These configured string will be ignored.
> configuration.setString("k4", "v4");
> configuration.setString("k5", "v5");
> configuration.setString("k6", "v6");
> env.configure(configuration);
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-29146) User set job configuration can not be retirieved from JobGraph and ExecutionGraph

2022-08-30 Thread Congxian Qiu (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-29146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu updated FLINK-29146:
-
Affects Version/s: 1.16.0

> User set job configuration can not be retirieved from JobGraph and 
> ExecutionGraph
> -
>
> Key: FLINK-29146
> URL: https://issues.apache.org/jira/browse/FLINK-29146
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.16.0
>Reporter: Shuiqiang Chen
>Priority: Major
>
> Currently, when building an ExecutionGraph, it requires to set the job 
> specific information (like job id, job name, job configuration, etc) and most 
> of them are from JobGraph.But I find that the configuraiton in JobGraph is a 
> new Configuration instance that does not contain any user set configuration. 
> As a result, we are not able retrieve the use specified job configuration in 
> ExecutionGraph built from JobGraph during runtime execution.
> BTW, in StreamExecutionEnvironment, it seems that job configuraitons that not 
> contained in built-in options will be igored when calling 
> StreamExecutionEnvironment.configure(ReadableConfig[, ClassLoader]). However, 
> it will be included when constructing a StreamExecutionEnvironment, which 
> seems a bit inconsistent.
> {code:java}
> Configuration configuration = new Configuration();
> // These configured string will take effect.
> configuration.setString("k1", "v1");
> configuration.setString("k2", "v2");
> configuration.setString("k3", "v3");
> configuration.set(HeartbeatManagerOptions.HEARTBEAT_TIMEOUT, 30L);
> final StreamExecutionEnvironment env = 
> StreamExecutionEnvironment.getExecutionEnvironment(configuration);
> // These configured string will be ignored.
> configuration.setString("k4", "v4");
> configuration.setString("k5", "v5");
> configuration.setString("k6", "v6");
> env.configure(configuration);
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-26932) TaskManager hung in cleanupAllocationBaseDirs not exit.

2022-04-06 Thread Congxian Qiu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-26932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17518243#comment-17518243
 ] 

Congxian Qiu commented on FLINK-26932:
--

hi [~huwh] could you please update the affected versions

> TaskManager hung in cleanupAllocationBaseDirs not exit.
> ---
>
> Key: FLINK-26932
> URL: https://issues.apache.org/jira/browse/FLINK-26932
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Reporter: huweihua
>Priority: Major
> Attachments: 1280X1280.png, 
> origin_img_v2_bb063beb-2f44-40fe-b1d2-4cc8dc87585g.png
>
>
> The disk TaskManager used had some fatal error. And then TaskManager hung in 
> cleanupAllocationBaseDirs and took the main thread.
>  
> So this TaskManager would not respond to the 
> cancelTask/disconnectResourceManager request.
>  
> At the same time, JobMaster already take this TaskManager is lost, and 
> schedule task to other TaskManager.
>  
> This may cause some unexpected task running.
>  
> After checking the log of TaskManager, TM already lost the connection with 
> ResourceManager, and it is always trying to register with ResourceManager. 
> The RegistrationTimeout cannot take effect because the main thread of 
> TaskManager is hung-up.
>  
> I think there are two options to handle it.
> Option 1: Add timeout for 
> TaskExecutorLocalStateStoreManager.cleanupAllocationBaseDirs, But I am afraid 
> some other methods would block main thread too.
> Option 2: Move the registrationTimeout in another thread, we need to deal 
> will the concurrency problem
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-23346) RocksDBStateBackend may core dump in flink_compactionfilterjni.cc

2021-11-08 Thread Congxian Qiu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-23346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17440403#comment-17440403
 ] 

Congxian Qiu commented on FLINK-23346:
--

[~yunta]  What do you think about this, will target this with 1.5.0 if there is 
no condense, and I can help to contribute the fix described in the description.

> RocksDBStateBackend may core dump in flink_compactionfilterjni.cc
> -
>
> Key: FLINK-23346
> URL: https://issues.apache.org/jira/browse/FLINK-23346
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / State Backends
>Affects Versions: 1.14.0, 1.13.1, 1.12.4
>Reporter: Congxian Qiu
>Priority: Major
>
> The code in [flink_compactionfilte.cpp 
> |https://github.com/ververica/frocksdb/blob/49bc897d5d768026f1eb816d960c1f2383396ef4/java/rocksjni/flink_compactionfilterjni.cc#L21]
> {code:cpp}
> inline void CheckAndRethrowException(JNIEnv* env) const {
> if (env->ExceptionCheck()) {
>   env->ExceptionDescribe();
>   env->Throw(env->ExceptionOccurred());
> }
> {code}
> may core dump in some sence, please see more information here[1][2][3]
> We can fix it by changing this to
> {code:cpp}
> inline void CheckAndRethrowException(JNIEnv* env) const {
> if (env->ExceptionCheck()) {
>   env->Throw(env->ExceptionOccurred());
> }
>   }
> {code}
> or
> {code:cpp}
>inline void CheckAndRethrowException(JNIEnv* env) const {
> if (env->ExceptionCheck()) {
>   jobject obj = env->ExceptionOccurred();
>   env->ExceptionDescribe();
>   env->Throw(obj);
> }
>   }
> {code}
> [1] 
> [https://stackoverflow.com/questions/30971068/does-jniexceptiondescribe-implicitily-clear-the-exception-trace-of-the-jni-env]
>  [2] [https://bugs.openjdk.java.net/browse/JDK-4067541]
>  [3] [https://bugs.openjdk.java.net/browse/JDK-8051947]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-23346) RocksDBStateBackend may core dump in flink_compactionfilterjni.cc

2021-10-24 Thread Congxian Qiu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-23346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17433434#comment-17433434
 ] 

Congxian Qiu commented on FLINK-23346:
--

[~yunta] sorry for the late reply. we don't notice the use exception from last 
time. But I think we should fix the problem here in any reason, because this 
could cause the core dump whatever the user exception is. and there will be 
many local files leak on the tm side when the core dump happens.

> RocksDBStateBackend may core dump in flink_compactionfilterjni.cc
> -
>
> Key: FLINK-23346
> URL: https://issues.apache.org/jira/browse/FLINK-23346
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / State Backends
>Affects Versions: 1.14.0, 1.13.1, 1.12.4
>Reporter: Congxian Qiu
>Priority: Major
>
> The code in [flink_compactionfilte.cpp 
> |https://github.com/ververica/frocksdb/blob/49bc897d5d768026f1eb816d960c1f2383396ef4/java/rocksjni/flink_compactionfilterjni.cc#L21]
> {code:cpp}
> inline void CheckAndRethrowException(JNIEnv* env) const {
> if (env->ExceptionCheck()) {
>   env->ExceptionDescribe();
>   env->Throw(env->ExceptionOccurred());
> }
> {code}
> may core dump in some sence, please see more information here[1][2][3]
> We can fix it by changing this to
> {code:cpp}
> inline void CheckAndRethrowException(JNIEnv* env) const {
> if (env->ExceptionCheck()) {
>   env->Throw(env->ExceptionOccurred());
> }
>   }
> {code}
> or
> {code:cpp}
>inline void CheckAndRethrowException(JNIEnv* env) const {
> if (env->ExceptionCheck()) {
>   jobject obj = env->ExceptionOccurred();
>   env->ExceptionDescribe();
>   env->Throw(obj);
> }
>   }
> {code}
> [1] 
> [https://stackoverflow.com/questions/30971068/does-jniexceptiondescribe-implicitily-clear-the-exception-trace-of-the-jni-env]
>  [2] [https://bugs.openjdk.java.net/browse/JDK-4067541]
>  [3] [https://bugs.openjdk.java.net/browse/JDK-8051947]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-24597) RocksdbStateBackend getKeysAndNamespaces would return duplicate data when using MapState

2021-10-24 Thread Congxian Qiu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-24597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17433432#comment-17433432
 ] 

Congxian Qiu commented on FLINK-24597:
--

[~mayuehappy] thanks for reporting this issue, I think this is valid. 
[~sjwiesman] could you please help to have a double check about this, thanks.

> RocksdbStateBackend getKeysAndNamespaces would return duplicate data when 
> using MapState 
> -
>
> Key: FLINK-24597
> URL: https://issues.apache.org/jira/browse/FLINK-24597
> Project: Flink
>  Issue Type: Bug
>  Components: API / State Processor, Runtime / State Backends
>Affects Versions: 1.14.0, 1.12.4, 1.13.3
>Reporter: Yue Ma
>Priority: Major
>  Labels: pull-request-available
>
> For example, in RocksdbStateBackend , if we worked in VoidNamespace , and And 
> use the ValueState like below .
> {code:java}
> // insert record
> for (int i = 0; i < 3; ++i) {
> keyedStateBackend.setCurrentKey(i);
> testValueState.update(String.valueOf(i));
> }
> {code}
> Then we get all the keysAndNamespace according the method 
> RocksDBKeyedStateBackend#getKeysAndNamespaces().The result of the traversal is
>  <1,VoidNamespace>,<2,VoidNamespace>,<3,VoidNamespace> ,which is as expected.
> Thus,if we use MapState , and update the MapState with different user key, 
> the getKeysAndNamespaces would return duplicate data with same 
> keyAndNamespace.
> {code:java}
> // insert record
> for (int i = 0; i < 3; ++i) {
> keyedStateBackend.setCurrentKey(i);
> mapState.put("userKeyA_" + i, "userValue");
> mapState.put("userKeyB_" + i, "userValue");
> }
> {code}
> The result of the traversal is
>  
> <1,VoidNamespace>,<1,VoidNamespace>,<2,VoidNamespace>,<2,VoidNamespace>,<3,VoidNamespace>,<3,VoidNamespace>.
> By reading the code, I found that the main reason for this problem is in the 
> implementation of _RocksStateKeysAndNamespaceIterator_.
> In the _hasNext_ method, when a new keyAndNamespace is created, there is no 
> comparison with the previousKeyAndNamespace. So we can refer to 
> RocksStateKeysIterator to implement the same logic should solve this problem.
>   
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-23139) State ownership: track and discard private state (registry+changelog)

2021-10-02 Thread Congxian Qiu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-23139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17423520#comment-17423520
 ] 

Congxian Qiu commented on FLINK-23139:
--

hi [~roman] seems the doc is not public, is there any public documentation for 
this?

> State ownership: track and discard private state (registry+changelog)
> -
>
> Key: FLINK-23139
> URL: https://issues.apache.org/jira/browse/FLINK-23139
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / State Backends
>Reporter: Roman Khachatryan
>Assignee: Roman Khachatryan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.15.0
>
>
> TM should own changelog backend state to prevent re-uploading state on 
> checkpoint abortion (or missing confirmation). A simpler solution to only own 
> aborted state is less maintanable in the long run.
> For that, on TM state should be tracked and discarded (on 
> subsumption+materialization; on shutdown). 
> See [state ownership design 
> doc|https://docs.google.com/document/d/1NJJQ30P27BmUvD7oa4FChvkYxMEgjRPTVdO1dHLl_9I/edit?usp=sharing],
>  in particular [Tracking private 
> state|https://docs.google.com/document/d/1NJJQ30P27BmUvD7oa4FChvkYxMEgjRPTVdO1dHLl_9I/edit#heading=h.9dxopqajsy7].
>  
> This ticket is about creating TaskStateRegistry and using it in 
> ChangelogStateBackend (for non-materialized part only; for materialized see 
> FLINK-23344).
>   
> Externalized checkpoints and savepoints should be supported (or please create 
> a separate ticket).
>  
> Retained checkpoints is a separate ticket: FLINK-23251



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-24149) Make checkpoint self-contained and relocatable

2021-09-14 Thread Congxian Qiu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-24149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414992#comment-17414992
 ] 

Congxian Qiu commented on FLINK-24149:
--

Sorry to jump in. I prefer to support incremental checkpoints when restoring 
from a previous checkpoint because 1) this is the way currently we have, 2) 
copy to another hdfs is less common than that always run in the same. We maybe 
need a solution to support increment checkpoints after restoring from a 
previous checkpoint if we want to make the path relative. Or maybe we can wait 
for the  FLIP published [~pnowojski] mentioned.

> Make checkpoint self-contained and relocatable
> --
>
> Key: FLINK-24149
> URL: https://issues.apache.org/jira/browse/FLINK-24149
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Reporter: Feifan Wang
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2021-09-08-17-06-31-560.png, 
> image-2021-09-08-17-10-28-240.png, image-2021-09-08-17-55-46-898.png, 
> image-2021-09-08-18-01-03-176.png, image-2021-09-14-14-22-31-537.png
>
>
> h1. Backgroud
> We have many jobs with large state size in production environment. According 
> to the operation practice of these jobs and the analysis of some specific 
> problems, we believe that RocksDBStateBackend's incremental checkpoint has 
> many advantages over savepoint:
>  # Savepoint takes much longer time then incremental checkpoint in jobs with 
> large state. The figure below is a job in our production environment, it 
> takes nearly 7 minutes to complete a savepoint, while checkpoint only takes a 
> few seconds.( checkpoint after savepoint takes longer time is a problem 
> described in -FLINK-23949-)
>  !image-2021-09-08-17-55-46-898.png|width=723,height=161!
>  # Savepoint causes excessive cpu usage. The figure below shows the CPU usage 
> of the same job in the above figure :
>  !image-2021-09-08-18-01-03-176.png|width=516,height=148!
>  # Savepoint may cause excessive native memory usage and eventually cause the 
> TaskManager process memory usage to exceed the limit. (We did not further 
> investigate the cause and did not try to reproduce the problem on other large 
> state jobs, but only increased the overhead memory. So this reason may not be 
> so conclusive. )
> For the above reasons, we tend to use retained incremental checkpoint to 
> completely replace savepoint for jobs with large state size.
> h1. Problems
>  * *Problem 1 : retained incremental checkpoint difficult to clean up once 
> they used for recovery*
> This problem caused by jobs recoveryed from a retained incremental checkpoint 
> may reference files on this retained incremental checkpoint's shared 
> directory in subsequent checkpoints, even they are not in a same job 
> instance. The worst case is that the retained checkpoint will be referenced 
> one by one, forming a very long reference chain.This makes it difficult for 
> users to manage retained checkpoints. In fact, we have also suffered failures 
> caused by incorrect deletion of retained checkpoints.
> Although we can use the file handle in checkpoint metadata to figure out 
> which files can be deleted, but I think it is inappropriate to let users do 
> this.
>  * *Problem 2 : checkpoint not relocatable*
> Even if we can figure out all files referenced by a checkpoint, moving these 
> files will invalidate the checkpoint as well, because the metadata file 
> references absolute file paths.
> Since savepoint already be self-contained and relocatable (FLINK-5763​), why 
> don't we use savepoint just for migrate jobs to another place ? In addition 
> to the savepoint performance problem in the background description, a very 
> important reason is that the migration requirement may come from the failure 
> of the original cluster. In this case, there is no opportunity to trigger 
> savepoint.
> h1. Proposal
>  * *job's checkpoint directory (user-defined-checkpoint-dir/) contains 
> all their state files (self-contained)*
>  As far as I know, in the current status, only the subsequent checkpoints of 
> the jobs restored from the retained checkpoint violate this constraint. One 
> possible solution is to re-upload all shared files at the first incremental 
> checkpoint after the job started, but we need to discuss how to distinguish 
> between a new job instance and a restart.
>  * *use relative file path in checkpoint metadata (relocatable)*
> Change all file references in checkpoint metadata to the relative path 
> relative to the _metadata file, so we can copy 
> user-defined-checkpoint-dir/ to any other place.
>  
> BTW, this issue is so similar to FLINK-5763 , we can read it as a background 
> supplement.



--
This message was sent by Atlassian Jira
(v

[jira] [Commented] (FLINK-23346) RocksDBStateBackend may core dump in flink_compactionfilterjni.cc

2021-07-12 Thread Congxian Qiu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-23346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17379013#comment-17379013
 ] 

Congxian Qiu commented on FLINK-23346:
--

[~yunta] Yes, env->ExceptionCheck() is true, it's some serializer exception 
from RocksDbTtlCompactionFilter, still debug why the exception can happen now.

> RocksDBStateBackend may core dump in flink_compactionfilterjni.cc
> -
>
> Key: FLINK-23346
> URL: https://issues.apache.org/jira/browse/FLINK-23346
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / State Backends
>Affects Versions: 1.14.0, 1.13.1, 1.12.4
>Reporter: Congxian Qiu
>Priority: Major
>
> The code in [flink_compactionfilte.cpp 
> |https://github.com/ververica/frocksdb/blob/49bc897d5d768026f1eb816d960c1f2383396ef4/java/rocksjni/flink_compactionfilterjni.cc#L21]
> {code:cpp}
> inline void CheckAndRethrowException(JNIEnv* env) const {
> if (env->ExceptionCheck()) {
>   env->ExceptionDescribe();
>   env->Throw(env->ExceptionOccurred());
> }
> {code}
> may core dump in some sence, please see more information here[1][2][3]
> We can fix it by changing this to
> {code:cpp}
> inline void CheckAndRethrowException(JNIEnv* env) const {
> if (env->ExceptionCheck()) {
>   env->Throw(env->ExceptionOccurred());
> }
>   }
> {code}
> or
> {code:cpp}
>inline void CheckAndRethrowException(JNIEnv* env) const {
> if (env->ExceptionCheck()) {
>   jobject obj = env->ExceptionOccurred();
>   env->ExceptionDescribe();
>   env->Throw(obj);
> }
>   }
> {code}
> [1] 
> [https://stackoverflow.com/questions/30971068/does-jniexceptiondescribe-implicitily-clear-the-exception-trace-of-the-jni-env]
>  [2] [https://bugs.openjdk.java.net/browse/JDK-4067541]
>  [3] [https://bugs.openjdk.java.net/browse/JDK-8051947]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-23346) RocksDBStateBackend may core dump in flink_compactionfilterjni.cc

2021-07-11 Thread Congxian Qiu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-23346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17378667#comment-17378667
 ] 

Congxian Qiu commented on FLINK-23346:
--

cc  [~yunta] [~liyu]

> RocksDBStateBackend may core dump in flink_compactionfilterjni.cc
> -
>
> Key: FLINK-23346
> URL: https://issues.apache.org/jira/browse/FLINK-23346
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / State Backends
>Affects Versions: 1.14.0, 1.13.1, 1.12.4
>Reporter: Congxian Qiu
>Priority: Major
>
> The code in [flink_compactionfilte.cpp 
> |https://github.com/ververica/frocksdb/blob/49bc897d5d768026f1eb816d960c1f2383396ef4/java/rocksjni/flink_compactionfilterjni.cc#L21]
> {code:cpp}
> inline void CheckAndRethrowException(JNIEnv* env) const {
> if (env->ExceptionCheck()) {
>   env->ExceptionDescribe();
>   env->Throw(env->ExceptionOccurred());
> }
> {code}
> may core dump in some sence, please see more information here[1][2][3]
> We can fix it by changing this to
> {code:cpp}
> inline void CheckAndRethrowException(JNIEnv* env) const {
> if (env->ExceptionCheck()) {
>   env->Throw(env->ExceptionOccurred());
> }
>   }
> {code}
> or
> {code:cpp}
>inline void CheckAndRethrowException(JNIEnv* env) const {
> if (env->ExceptionCheck()) {
>   jobject obj = env->ExceptionOccurred();
>   env->ExceptionDescribe();
>   env->Throw(obj);
> }
>   }
> {code}
> [1] 
> [https://stackoverflow.com/questions/30971068/does-jniexceptiondescribe-implicitily-clear-the-exception-trace-of-the-jni-env]
>  [2] [https://bugs.openjdk.java.net/browse/JDK-4067541]
>  [3] [https://bugs.openjdk.java.net/browse/JDK-8051947]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-23346) RocksDBStateBackend may core dump in flink_compactionfilterjni.cc

2021-07-11 Thread Congxian Qiu (Jira)
Congxian Qiu created FLINK-23346:


 Summary: RocksDBStateBackend may core dump in 
flink_compactionfilterjni.cc
 Key: FLINK-23346
 URL: https://issues.apache.org/jira/browse/FLINK-23346
 Project: Flink
  Issue Type: Bug
  Components: Runtime / State Backends
Affects Versions: 1.12.4, 1.13.1, 1.14.0
Reporter: Congxian Qiu


The code in [flink_compactionfilte.cpp 
|https://github.com/ververica/frocksdb/blob/49bc897d5d768026f1eb816d960c1f2383396ef4/java/rocksjni/flink_compactionfilterjni.cc#L21]
{code:cpp}
inline void CheckAndRethrowException(JNIEnv* env) const {
if (env->ExceptionCheck()) {
  env->ExceptionDescribe();
  env->Throw(env->ExceptionOccurred());
}
{code}
may core dump in some sence, please see more information here[1][2][3]

We can fix it by changing this to
{code:cpp}
inline void CheckAndRethrowException(JNIEnv* env) const {
if (env->ExceptionCheck()) {
  env->Throw(env->ExceptionOccurred());
}
  }
{code}
or
{code:cpp}
   inline void CheckAndRethrowException(JNIEnv* env) const {
if (env->ExceptionCheck()) {
  jobject obj = env->ExceptionOccurred();
  env->ExceptionDescribe();
  env->Throw(obj);
}
  }
{code}
[1] 
[https://stackoverflow.com/questions/30971068/does-jniexceptiondescribe-implicitily-clear-the-exception-trace-of-the-jni-env]
 [2] [https://bugs.openjdk.java.net/browse/JDK-4067541]
 [3] [https://bugs.openjdk.java.net/browse/JDK-8051947]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-13679) Translate "Code Style - Pull Requests & Changes" page into Chinese

2021-05-15 Thread Congxian Qiu (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu closed FLINK-13679.

Resolution: Fixed

merged with commit 23ad02dcc845ca2d87fa893572d77b5ec58104a8

> Translate "Code Style - Pull Requests & Changes" page into Chinese
> --
>
> Key: FLINK-13679
> URL: https://issues.apache.org/jira/browse/FLINK-13679
> Project: Flink
>  Issue Type: Sub-task
>  Components: chinese-translation, Project Website
>Reporter: Jark Wu
>Assignee: LakeShen
>Priority: Major
>  Labels: pull-request-available, stale-assigned
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Translate page 
> https://flink.apache.org/zh/contributing/code-style-and-quality-pull-requests.html
>  into Chinese. The page is located in 
> https://github.com/apache/flink-web/blob/asf-site/contributing/code-style-and-quality-pull-requests.zh.md.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16077) Translate "Custom State Serialization" page into Chinese

2021-04-21 Thread Congxian Qiu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326450#comment-17326450
 ] 

Congxian Qiu commented on FLINK-16077:
--

The pr is ready for review now

> Translate "Custom State Serialization" page into Chinese
> 
>
> Key: FLINK-16077
> URL: https://issues.apache.org/jira/browse/FLINK-16077
> Project: Flink
>  Issue Type: Sub-task
>  Components: chinese-translation, Documentation
>Affects Versions: 1.11.0
>Reporter: Yu Li
>Assignee: Congxian Qiu
>Priority: Major
>  Labels: pull-request-available, stale-assigned
> Fix For: 1.13.0
>
>
> Complete the translation in `docs/dev/stream/state/custom_serialization.zh.md`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-5763) Make savepoints self-contained and relocatable

2021-03-24 Thread Congxian Qiu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-5763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307790#comment-17307790
 ] 

Congxian Qiu edited comment on FLINK-5763 at 3/24/21, 12:28 PM:


[~trohrmann] sorry for the late reply, missed the email notification.

If I remember correctly, this feature did not touch entropy injection(did not 
support when entropy injection enabled neither). -I'll try to find out the 
reason with the reporter why this should happen. thanks-

Seems the user has figured out the reason, and the issue has been resolved.


was (Author: klion26):
[~trohrmann] sorry for the late reply, missed the email notification.

If I remember correctly, this feature did not touch entropy injection(did not 
support when entropy injection enabled neither). I'll try to find out the 
reason with the reporter why this should happen. thanks

> Make savepoints self-contained and relocatable
> --
>
> Key: FLINK-5763
> URL: https://issues.apache.org/jira/browse/FLINK-5763
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / State Backends
>Reporter: Ufuk Celebi
>Assignee: Congxian Qiu
>Priority: Critical
>  Labels: pull-request-available, usability
> Fix For: 1.11.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> After a user has triggered a savepoint, a single savepoint file will be 
> returned as a handle to the savepoint. A savepoint to {{}} creates a 
> savepoint file like {{/savepoint-}}.
> This file contains the metadata of the corresponding checkpoint, but not the 
> actual program state. While this works well for short term management 
> (pause-and-resume a job), it makes it hard to manage savepoints over longer 
> periods of time.
> h4. Problems
> h5. Scattered Checkpoint Files
> For file system based checkpoints (FsStateBackend, RocksDBStateBackend) this 
> results in the savepoint referencing files from the checkpoint directory 
> (usually different than ). For users, it is virtually impossible to 
> tell which checkpoint files belong to a savepoint and which are lingering 
> around. This can easily lead to accidentally invalidating a savepoint by 
> deleting checkpoint files.
> h5. Savepoints Not Relocatable
> Even if a user is able to figure out which checkpoint files belong to a 
> savepoint, moving these files will invalidate the savepoint as well, because 
> the metadata file references absolute file paths.
> h5. Forced to Use CLI for Disposal
> Because of the scattered files, the user is in practice forced to use Flink’s 
> CLI to dispose a savepoint. This should be possible to handle in the scope of 
> the user’s environment via a file system delete operation.
> h4. Proposal
> In order to solve the described problems, savepoints should contain all their 
> state, both metadata and program state, inside a single directory. 
> Furthermore the metadata must only hold relative references to the checkpoint 
> files. This makes it obvious which files make up the state of a savepoint and 
> it is possible to move savepoints around by moving the savepoint directory.
> h5. Desired File Layout
> Triggering a savepoint to {{}} creates a directory as follows:
> {code}
> /savepoint--
>   +-- _metadata
>   +-- data- [1 or more]
> {code}
> We include the JobID in the savepoint directory name in order to give some 
> hints about which job a savepoint belongs to.
> h5. CLI
> - Trigger: When triggering a savepoint to {{}} the savepoint 
> directory will be returned as the handle to the savepoint.
> - Restore: Users can restore by pointing to the directory or the _metadata 
> file. The data files should be required to be in the same directory as the 
> _metadata file.
> - Dispose: The disposal command should be deprecated and eventually removed. 
> While deprecated, disposal can happen by specifying the directory or the 
> _metadata file (same as restore).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-5763) Make savepoints self-contained and relocatable

2021-03-24 Thread Congxian Qiu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-5763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307790#comment-17307790
 ] 

Congxian Qiu commented on FLINK-5763:
-

[~trohrmann] sorry for the late reply, missed the email notification.

If I remember correctly, this feature did not touch entropy injection(did not 
support when entropy injection enabled neither). I'll try to find out the 
reason with the reporter why this should happen. thanks

> Make savepoints self-contained and relocatable
> --
>
> Key: FLINK-5763
> URL: https://issues.apache.org/jira/browse/FLINK-5763
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / State Backends
>Reporter: Ufuk Celebi
>Assignee: Congxian Qiu
>Priority: Critical
>  Labels: pull-request-available, usability
> Fix For: 1.11.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> After a user has triggered a savepoint, a single savepoint file will be 
> returned as a handle to the savepoint. A savepoint to {{}} creates a 
> savepoint file like {{/savepoint-}}.
> This file contains the metadata of the corresponding checkpoint, but not the 
> actual program state. While this works well for short term management 
> (pause-and-resume a job), it makes it hard to manage savepoints over longer 
> periods of time.
> h4. Problems
> h5. Scattered Checkpoint Files
> For file system based checkpoints (FsStateBackend, RocksDBStateBackend) this 
> results in the savepoint referencing files from the checkpoint directory 
> (usually different than ). For users, it is virtually impossible to 
> tell which checkpoint files belong to a savepoint and which are lingering 
> around. This can easily lead to accidentally invalidating a savepoint by 
> deleting checkpoint files.
> h5. Savepoints Not Relocatable
> Even if a user is able to figure out which checkpoint files belong to a 
> savepoint, moving these files will invalidate the savepoint as well, because 
> the metadata file references absolute file paths.
> h5. Forced to Use CLI for Disposal
> Because of the scattered files, the user is in practice forced to use Flink’s 
> CLI to dispose a savepoint. This should be possible to handle in the scope of 
> the user’s environment via a file system delete operation.
> h4. Proposal
> In order to solve the described problems, savepoints should contain all their 
> state, both metadata and program state, inside a single directory. 
> Furthermore the metadata must only hold relative references to the checkpoint 
> files. This makes it obvious which files make up the state of a savepoint and 
> it is possible to move savepoints around by moving the savepoint directory.
> h5. Desired File Layout
> Triggering a savepoint to {{}} creates a directory as follows:
> {code}
> /savepoint--
>   +-- _metadata
>   +-- data- [1 or more]
> {code}
> We include the JobID in the savepoint directory name in order to give some 
> hints about which job a savepoint belongs to.
> h5. CLI
> - Trigger: When triggering a savepoint to {{}} the savepoint 
> directory will be returned as the handle to the savepoint.
> - Restore: Users can restore by pointing to the directory or the _metadata 
> file. The data files should be required to be in the same directory as the 
> _metadata file.
> - Dispose: The disposal command should be deprecated and eventually removed. 
> While deprecated, disposal can happen by specifying the directory or the 
> _metadata file (same as restore).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-20976) Unify Binary format for Keyed State savepoints

2021-01-28 Thread Congxian Qiu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17274088#comment-17274088
 ] 

Congxian Qiu commented on FLINK-20976:
--

[~dwysakowicz] thanks very much for pushing this forward. after this issue been 
resolved, users can switch backend as they want.  Please let me if there is any 
work I can help with here, thanks.

> Unify Binary format for Keyed State savepoints
> --
>
> Key: FLINK-20976
> URL: https://issues.apache.org/jira/browse/FLINK-20976
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Checkpointing, Runtime / State Backends
>Reporter: Dawid Wysakowicz
>Priority: Major
> Fix For: 1.13.0
>
>
> The main goal of this proposal is the following:
> * Unify across all state backends a savepoint format for keyed state that is 
> more future-proof and applicable for potential new state backends. Checkpoint 
> formats, by definition, are still allowed to be backend specific.
> * Make it possible to switch a state backend via a savepoint
> * Rework abstractions related to snapshots and restoring, to reduce the 
> overhead and code duplication when attempting to implement a new state 
> backend. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (FLINK-19381) Fix docs about relocatable savepoints

2021-01-25 Thread Congxian Qiu (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-19381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu resolved FLINK-19381.
--
Resolution: Fixed

Merged into master with a9d2b766b2a04b5dc6532381c1f0c60bf56c4e74

> Fix docs about relocatable savepoints
> -
>
> Key: FLINK-19381
> URL: https://issues.apache.org/jira/browse/FLINK-19381
> Project: Flink
>  Issue Type: Bug
>  Components: Documentation, Runtime / Checkpointing
>Affects Versions: 1.12.0, 1.11.2
>Reporter: Nico Kruber
>Assignee: Congxian Qiu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.13.0
>
>
> Although savepoints are relocatable since Flink 1.11, the docs still state 
> otherwise, for example in 
> [https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#triggering-savepoints]
> The warning there, as well as the other changes from FLINK-15863, should be 
> removed again and potentially replaced with new constraints.
> One known constraint is that if taskowned state is used 
> ({{GenericWriteAhreadLog}} sink), savepoints are currently not relocatable 
> yet.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-18263) Allow external checkpoints to be persisted even when the job is in "Finished" state.

2021-01-17 Thread Congxian Qiu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17266725#comment-17266725
 ] 

Congxian Qiu commented on FLINK-18263:
--

Seems there is a related [mail 
list|http://apache-flink.147419.n8.nabble.com/Flink-checkpoint-td10186.html] 
with this issue

> Allow external checkpoints to be persisted even when the job is in "Finished" 
> state.
> 
>
> Key: FLINK-18263
> URL: https://issues.apache.org/jira/browse/FLINK-18263
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Reporter: Mark Cho
>Priority: Major
>  Labels: pull-request-available
>
> Currently, `execution.checkpointing.externalized-checkpoint-retention` 
> configuration supports two options:
> - `DELETE_ON_CANCELLATION` which keeps the externalized checkpoints in FAILED 
> and SUSPENDED state.
> - `RETAIN_ON_CANCELLATION` which keeps the externalized checkpoints in 
> FAILED, SUSPENDED, and CANCELED state.
> This gives us control over the retention of externalized checkpoints in all 
> terminal state of a job, except for the FINISHED state.
> If the job ends up in "FINISHED" state, externalized checkpoints will be 
> automatically cleaned up and there currently is no config that will ensure 
> that these externalized checkpoints to be persisted.
> I found an old Jira ticket FLINK-4512 where this was discussed. I think it 
> would be helpful to have a config that can control the retention policy for 
> FINISHED state as well.
> - This can be useful for cases where we want to rewind a job (that reached 
> the FINISHED state) to a previous checkpoint.
> - When we use externalized checkpoints, we want to fully delegate the 
> checkpoint clean-up to an external process in all job states (without 
> cherrypicking FINISHED state to be cleaned up by Flink).
> We have a quick fix working in our fork where we've changed 
> `ExternalizedCheckpointCleanup` enum:
> {code:java}
> RETAIN_ON_FAILURE (renamed from DELETE_ON_CANCELLATION; retains on FAILED)
> RETAIN_ON_CANCELLATION (kept the same; retains on FAILED, CANCELED)
> RETAIN_ON_SUCCESS (added; retains on FAILED, CANCELED, FINISHED)
> {code}
> Since this change requires changes to multiple components (e.g. config 
> values, REST API, Web UI, etc), I wanted to get the community's thoughts 
> before I invest more time in my quick fix PR (which currently only contains 
> minimal change to get this working).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-13678) Translate "Code Style - Preamble" page into Chinese

2020-12-14 Thread Congxian Qiu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17248791#comment-17248791
 ] 

Congxian Qiu commented on FLINK-13678:
--

[~Lonie] assigned to you, please feel free to file a pr.

> Translate "Code Style - Preamble" page into Chinese
> ---
>
> Key: FLINK-13678
> URL: https://issues.apache.org/jira/browse/FLINK-13678
> Project: Flink
>  Issue Type: Sub-task
>  Components: chinese-translation, Project Website
>Reporter: Jark Wu
>Assignee: huzeming
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Translate page 
> https://flink.apache.org/zh/contributing/code-style-and-quality-preamble.html 
> into Chinese. The page is located in  
> https://github.com/apache/flink-web/blob/asf-site/contributing/code-style-and-quality-preamble.zh.md



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-13678) Translate "Code Style - Preamble" page into Chinese

2020-12-14 Thread Congxian Qiu (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu reassigned FLINK-13678:


Assignee: huzeming  (was: WangHengWei)

> Translate "Code Style - Preamble" page into Chinese
> ---
>
> Key: FLINK-13678
> URL: https://issues.apache.org/jira/browse/FLINK-13678
> Project: Flink
>  Issue Type: Sub-task
>  Components: chinese-translation, Project Website
>Reporter: Jark Wu
>Assignee: huzeming
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Translate page 
> https://flink.apache.org/zh/contributing/code-style-and-quality-preamble.html 
> into Chinese. The page is located in  
> https://github.com/apache/flink-web/blob/asf-site/contributing/code-style-and-quality-preamble.zh.md



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-20376) Error in restoring checkpoint/savepoint when Flink is upgraded from 1.9 to 1.11.2

2020-12-07 Thread Congxian Qiu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17245679#comment-17245679
 ] 

Congxian Qiu commented on FLINK-20376:
--

[~Partha Mishra] thanks for the feedback, could you please share the topology 
here (or the minimal reproducible program), as [~AHeise] said, seems different 
Flink version generated a different job graph here.

> Error in restoring checkpoint/savepoint when Flink is upgraded from 1.9 to 
> 1.11.2
> -
>
> Key: FLINK-20376
> URL: https://issues.apache.org/jira/browse/FLINK-20376
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Reporter: Partha Pradeep Mishra
>Priority: Major
>
> We tried to save checkpoints for one of the flink job (1.9 version) and then 
> import/restore the checkpoints in the newer flink version (1.11.2). The 
> import/resume operation failed with the below error. Please note that both 
> the jobs(i.e. one running in 1.9 and other in 1.11.2) are same binary with no 
> code difference or introduction of new operators. Still we got the below 
> issue.
> _Cannot map checkpoint/savepoint state for operator 
> fbb4ef531e002f8fb3a2052db255adf5 to the new program, because the operator is 
> not available in the new program._
> *Complete Stack Trace :*
> {"errors":["org.apache.flink.runtime.rest.handler.RestHandlerException: Could 
> not execute application.\n\tat 
> org.apache.flink.runtime.webmonitor.handlers.JarRunHandler.lambda$handleRequest$1(JarRunHandler.java:103)\n\tat
>  
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)\n\tat
>  
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)\n\tat
>  
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)\n\tat
>  
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1609)\n\tat
>  
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\n\tat 
> java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)\n\tat
>  
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)\n\tat
>  
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat
>  
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat
>  java.lang.Thread.run(Thread.java:748)\nCaused by: 
> java.util.concurrent.CompletionException: 
> org.apache.flink.util.FlinkRuntimeException: Could not execute 
> application.\n\tat 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)\n\tat
>  
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)\n\tat
>  
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1606)\n\t...
>  7 more\nCaused by: org.apache.flink.util.FlinkRuntimeException: Could not 
> execute application.\n\tat 
> org.apache.flink.client.deployment.application.DetachedApplicationRunner.tryExecuteJobs(DetachedApplicationRunner.java:81)\n\tat
>  
> org.apache.flink.client.deployment.application.DetachedApplicationRunner.run(DetachedApplicationRunner.java:67)\n\tat
>  
> org.apache.flink.runtime.webmonitor.handlers.JarRunHandler.lambda$handleRequest$0(JarRunHandler.java:100)\n\tat
>  
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)\n\t...
>  7 more\nCaused by: 
> org.apache.flink.client.program.ProgramInvocationException: The main method 
> caused an error: Failed to execute job 
> 'ST1_100Services-preprod-Tumbling-ProcessedBased'.\n\tat 
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:302)\n\tat
>  
> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:198)\n\tat
>  
> org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:149)\n\tat
>  
> org.apache.flink.client.deployment.application.DetachedApplicationRunner.tryExecuteJobs(DetachedApplicationRunner.java:78)\n\t...
>  10 more\nCaused by: org.apache.flink.util.FlinkException: Failed to execute 
> job 'ST1_100Services-preprod-Tumbling-ProcessedBased'.\n\tat 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1821)\n\tat
>  
> org.apache.flink.client.program.StreamContextEnvironment.executeAsync(StreamContextEnvironment.java:128)\n\tat
>  
> org.apache.flink.client.program.StreamContextEnvironment.execute(StreamContextEnvironment.java:76)\n\tat
>  
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExe

[jira] [Updated] (FLINK-18968) Translate StateFun homepage to Chinese

2020-12-05 Thread Congxian Qiu (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu updated FLINK-18968:
-
Fix Version/s: statefun-2.3.0

> Translate StateFun homepage to Chinese 
> ---
>
> Key: FLINK-18968
> URL: https://issues.apache.org/jira/browse/FLINK-18968
> Project: Flink
>  Issue Type: Task
>  Components: chinese-translation, Stateful Functions
>Reporter: Zixuan Rao
>Assignee: Matt Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: statefun-2.3.0
>
>   Original Estimate: 5h
>  Remaining Estimate: 5h
>
> This issue is to translate 
> https://flink.apache.org/zh/stateful-functions.html into Chinese. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (FLINK-18968) Translate StateFun homepage to Chinese

2020-12-05 Thread Congxian Qiu (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu resolved FLINK-18968.
--
Resolution: Fixed

merged into master 

commits:

7fe767b
fcc5ef9
2f65c8a
5ed3562

> Translate StateFun homepage to Chinese 
> ---
>
> Key: FLINK-18968
> URL: https://issues.apache.org/jira/browse/FLINK-18968
> Project: Flink
>  Issue Type: Task
>  Components: chinese-translation, Stateful Functions
>Reporter: Zixuan Rao
>Assignee: Matt Wang
>Priority: Major
>  Labels: pull-request-available
>   Original Estimate: 5h
>  Remaining Estimate: 5h
>
> This issue is to translate 
> https://flink.apache.org/zh/stateful-functions.html into Chinese. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-18968) Translate StateFun homepage to Chinese

2020-12-05 Thread Congxian Qiu (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu reassigned FLINK-18968:


Assignee: Matt Wang

> Translate StateFun homepage to Chinese 
> ---
>
> Key: FLINK-18968
> URL: https://issues.apache.org/jira/browse/FLINK-18968
> Project: Flink
>  Issue Type: Task
>  Components: chinese-translation, Stateful Functions
>Reporter: Zixuan Rao
>Assignee: Matt Wang
>Priority: Major
>  Labels: pull-request-available
>   Original Estimate: 5h
>  Remaining Estimate: 5h
>
> This issue is to translate 
> https://flink.apache.org/zh/stateful-functions.html into Chinese. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-20209) Add missing checkpoint configuration to Flink UI

2020-12-03 Thread Congxian Qiu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17243802#comment-17243802
 ] 

Congxian Qiu edited comment on FLINK-20209 at 12/4/20, 7:53 AM:


FYI FLINK-20441 deprecated the setPreferCheckpointForRecovery and remove it in 
the future release


was (Author: klion26):
FYI FLINK-20441 wants to deprecate the setPreferCheckpointForRecovery and 
remove it in the future release

> Add missing checkpoint configuration to Flink UI
> 
>
> Key: FLINK-20209
> URL: https://issues.apache.org/jira/browse/FLINK-20209
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Web Frontend
>Affects Versions: 1.11.2
>Reporter: Peidian Li
>Assignee: Peidian Li
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2020-11-18-16-54-31-638.png
>
>
> Some of the [checkpointing 
> configurations|https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#checkpointing]
>  are not shown in the Flink UI,Can we consider adding these configurations to 
> the checkpoints configuration tab to make it easier for users to view 
> checkpoint configurations.
> These configurations need to be added:
> {code:java}
> execution.checkpointing.prefer-checkpoint-for-recovery
> execution.checkpointing.tolerable-failed-checkpoints
> execution.checkpointing.unaligned
> {code}
>  
> !image-2020-11-18-16-54-31-638.png|width=915,height=311!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-20209) Add missing checkpoint configuration to Flink UI

2020-12-03 Thread Congxian Qiu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17243802#comment-17243802
 ] 

Congxian Qiu edited comment on FLINK-20209 at 12/4/20, 7:53 AM:


FYI FLINK-20441 deprecated the setPreferCheckpointForRecovery and will remove 
it in the future release


was (Author: klion26):
FYI FLINK-20441 deprecated the setPreferCheckpointForRecovery and remove it in 
the future release

> Add missing checkpoint configuration to Flink UI
> 
>
> Key: FLINK-20209
> URL: https://issues.apache.org/jira/browse/FLINK-20209
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Web Frontend
>Affects Versions: 1.11.2
>Reporter: Peidian Li
>Assignee: Peidian Li
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2020-11-18-16-54-31-638.png
>
>
> Some of the [checkpointing 
> configurations|https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#checkpointing]
>  are not shown in the Flink UI,Can we consider adding these configurations to 
> the checkpoints configuration tab to make it easier for users to view 
> checkpoint configurations.
> These configurations need to be added:
> {code:java}
> execution.checkpointing.prefer-checkpoint-for-recovery
> execution.checkpointing.tolerable-failed-checkpoints
> execution.checkpointing.unaligned
> {code}
>  
> !image-2020-11-18-16-54-31-638.png|width=915,height=311!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-20209) Add missing checkpoint configuration to Flink UI

2020-12-03 Thread Congxian Qiu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17243802#comment-17243802
 ] 

Congxian Qiu commented on FLINK-20209:
--

FYI FLINK-20441 wants to deprecate the setPreferCheckpointForRecovery and 
remove it in the future release

> Add missing checkpoint configuration to Flink UI
> 
>
> Key: FLINK-20209
> URL: https://issues.apache.org/jira/browse/FLINK-20209
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Web Frontend
>Affects Versions: 1.11.2
>Reporter: Peidian Li
>Assignee: Peidian Li
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2020-11-18-16-54-31-638.png
>
>
> Some of the [checkpointing 
> configurations|https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#checkpointing]
>  are not shown in the Flink UI,Can we consider adding these configurations to 
> the checkpoints configuration tab to make it easier for users to view 
> checkpoint configurations.
> These configurations need to be added:
> {code:java}
> execution.checkpointing.prefer-checkpoint-for-recovery
> execution.checkpointing.tolerable-failed-checkpoints
> execution.checkpointing.unaligned
> {code}
>  
> !image-2020-11-18-16-54-31-638.png|width=915,height=311!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-20288) Correct documentation about savepoint self-contained

2020-11-27 Thread Congxian Qiu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17239908#comment-17239908
 ] 

Congxian Qiu commented on FLINK-20288:
--

[~yunta] thanks for creating this issue, maybe this is duplicated with 
FLINK-19381?

> Correct documentation about savepoint self-contained
> 
>
> Key: FLINK-20288
> URL: https://issues.apache.org/jira/browse/FLINK-20288
> Project: Flink
>  Issue Type: Bug
>  Components: Documentation, Runtime / Checkpointing
>Affects Versions: 1.11.0
>Reporter: Yun Tang
>Assignee: Yun Tang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.12.0, 1.11.4
>
>
> Savepoint self-contained has been supported while the documentation still 
> remain as not supported, we should fix that description.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (FLINK-20384) Broken Link in deployment/ha/kubernetes_ha.zh.md

2020-11-27 Thread Congxian Qiu (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-20384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu resolved FLINK-20384.
--
Fix Version/s: 1.12.0
   Resolution: Fixed

merged into master d5b15652fc85fe4b0929e7faf274b46c04b7e924
1.12 2e8d9b9489a1ea33cecdcfc4d84912d5c68c1bf0

> Broken Link in deployment/ha/kubernetes_ha.zh.md
> 
>
> Key: FLINK-20384
> URL: https://issues.apache.org/jira/browse/FLINK-20384
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes, Documentation
>Affects Versions: 1.12.0, 1.13.0
>Reporter: Huang Xingbo
>Assignee: Huang Xingbo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.12.0
>
>
> When executing the script build_docs.sh, it will throw the following 
> exception:
> {code:java}
> Liquid Exception: Could not find document 
> 'deployment/resource-providers/standalone/kubernetes.md' in tag 'link'. Make 
> sure the document exists and the path is correct. in 
> deployment/ha/kubernetes_ha.zh.md Could not find document 
> 'deployment/resource-providers/standalone/kubernetes.md' in tag 'link'.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-20384) Broken Link in deployment/ha/kubernetes_ha.zh.md

2020-11-27 Thread Congxian Qiu (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-20384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu reassigned FLINK-20384:


Assignee: Huang Xingbo

> Broken Link in deployment/ha/kubernetes_ha.zh.md
> 
>
> Key: FLINK-20384
> URL: https://issues.apache.org/jira/browse/FLINK-20384
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes, Documentation
>Affects Versions: 1.12.0, 1.13.0
>Reporter: Huang Xingbo
>Assignee: Huang Xingbo
>Priority: Major
>  Labels: pull-request-available
>
> When executing the script build_docs.sh, it will throw the following 
> exception:
> {code:java}
> Liquid Exception: Could not find document 
> 'deployment/resource-providers/standalone/kubernetes.md' in tag 'link'. Make 
> sure the document exists and the path is correct. in 
> deployment/ha/kubernetes_ha.zh.md Could not find document 
> 'deployment/resource-providers/standalone/kubernetes.md' in tag 'link'.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-20376) Error in restoring checkpoint/savepoint when Flink is upgraded from 1.9 to 1.11.2

2020-11-26 Thread Congxian Qiu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17239500#comment-17239500
 ] 

Congxian Qiu edited comment on FLINK-20376 at 11/27/20, 3:40 AM:
-

The exception said that there is some operator is not available in the new 
program, but the binary run on 1.9 and 1.11 is the same, and the [compatible 
table](https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/upgrading.html#compatibility-table)
 said 1.9 and 1.11 is compatible.
 could you please try 1) the savepoint triggered on 1.9 can be restored on 
1.10; 2) share the minimal reproducible program is better to figure out what 
the problem is


was (Author: klion26):
The exception said that there is some operator is not available in the new 
program, but the binary run on 1.9 and 1.11 is the same, and the [compatible 
table]([https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/upgrading.html#compatibility-table)]
 said 1.9 and 1.11 is compatible.
 could you please try 1) the savepoint triggered on 1.9 can be restored on 
1.10; 2) share the minimal reproducible program is better to figure out what 
the problem is

> Error in restoring checkpoint/savepoint when Flink is upgraded from 1.9 to 
> 1.11.2
> -
>
> Key: FLINK-20376
> URL: https://issues.apache.org/jira/browse/FLINK-20376
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Reporter: Partha Pradeep Mishra
>Priority: Major
>
> We tried to save checkpoints for one of the flink job (1.9 version) and then 
> import/restore the checkpoints in the newer flink version (1.11.2). The 
> import/resume operation failed with the below error. Please note that both 
> the jobs(i.e. one running in 1.9 and other in 1.11.2) are same binary with no 
> code difference or introduction of new operators. Still we got the below 
> issue.
> _Cannot map checkpoint/savepoint state for operator 
> fbb4ef531e002f8fb3a2052db255adf5 to the new program, because the operator is 
> not available in the new program._
> *Complete Stack Trace :*
> {"errors":["org.apache.flink.runtime.rest.handler.RestHandlerException: Could 
> not execute application.\n\tat 
> org.apache.flink.runtime.webmonitor.handlers.JarRunHandler.lambda$handleRequest$1(JarRunHandler.java:103)\n\tat
>  
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)\n\tat
>  
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)\n\tat
>  
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)\n\tat
>  
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1609)\n\tat
>  
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\n\tat 
> java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)\n\tat
>  
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)\n\tat
>  
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat
>  
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat
>  java.lang.Thread.run(Thread.java:748)\nCaused by: 
> java.util.concurrent.CompletionException: 
> org.apache.flink.util.FlinkRuntimeException: Could not execute 
> application.\n\tat 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)\n\tat
>  
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)\n\tat
>  
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1606)\n\t...
>  7 more\nCaused by: org.apache.flink.util.FlinkRuntimeException: Could not 
> execute application.\n\tat 
> org.apache.flink.client.deployment.application.DetachedApplicationRunner.tryExecuteJobs(DetachedApplicationRunner.java:81)\n\tat
>  
> org.apache.flink.client.deployment.application.DetachedApplicationRunner.run(DetachedApplicationRunner.java:67)\n\tat
>  
> org.apache.flink.runtime.webmonitor.handlers.JarRunHandler.lambda$handleRequest$0(JarRunHandler.java:100)\n\tat
>  
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)\n\t...
>  7 more\nCaused by: 
> org.apache.flink.client.program.ProgramInvocationException: The main method 
> caused an error: Failed to execute job 
> 'ST1_100Services-preprod-Tumbling-ProcessedBased'.\n\tat 
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:302)\n\tat
>  
> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:198)\n\tat
>  
> org.apache

[jira] [Commented] (FLINK-20376) Error in restoring checkpoint/savepoint when Flink is upgraded from 1.9 to 1.11.2

2020-11-26 Thread Congxian Qiu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17239500#comment-17239500
 ] 

Congxian Qiu commented on FLINK-20376:
--

The exception said that there is some operator is not available in the new 
program, but the binary run on 1.9 and 1.11 is the same, and the [compatible 
table]([https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/upgrading.html#compatibility-table)]
 said 1.9 and 1.11 is compatible.
 could you please try 1) the savepoint triggered on 1.9 can be restored on 
1.10; 2) share the minimal reproducible program is better to figure out what 
the problem is

> Error in restoring checkpoint/savepoint when Flink is upgraded from 1.9 to 
> 1.11.2
> -
>
> Key: FLINK-20376
> URL: https://issues.apache.org/jira/browse/FLINK-20376
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Reporter: Partha Pradeep Mishra
>Priority: Major
>
> We tried to save checkpoints for one of the flink job (1.9 version) and then 
> import/restore the checkpoints in the newer flink version (1.11.2). The 
> import/resume operation failed with the below error. Please note that both 
> the jobs(i.e. one running in 1.9 and other in 1.11.2) are same binary with no 
> code difference or introduction of new operators. Still we got the below 
> issue.
> _Cannot map checkpoint/savepoint state for operator 
> fbb4ef531e002f8fb3a2052db255adf5 to the new program, because the operator is 
> not available in the new program._
> *Complete Stack Trace :*
> {"errors":["org.apache.flink.runtime.rest.handler.RestHandlerException: Could 
> not execute application.\n\tat 
> org.apache.flink.runtime.webmonitor.handlers.JarRunHandler.lambda$handleRequest$1(JarRunHandler.java:103)\n\tat
>  
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)\n\tat
>  
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)\n\tat
>  
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)\n\tat
>  
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1609)\n\tat
>  
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\n\tat 
> java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)\n\tat
>  
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)\n\tat
>  
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat
>  
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat
>  java.lang.Thread.run(Thread.java:748)\nCaused by: 
> java.util.concurrent.CompletionException: 
> org.apache.flink.util.FlinkRuntimeException: Could not execute 
> application.\n\tat 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)\n\tat
>  
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)\n\tat
>  
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1606)\n\t...
>  7 more\nCaused by: org.apache.flink.util.FlinkRuntimeException: Could not 
> execute application.\n\tat 
> org.apache.flink.client.deployment.application.DetachedApplicationRunner.tryExecuteJobs(DetachedApplicationRunner.java:81)\n\tat
>  
> org.apache.flink.client.deployment.application.DetachedApplicationRunner.run(DetachedApplicationRunner.java:67)\n\tat
>  
> org.apache.flink.runtime.webmonitor.handlers.JarRunHandler.lambda$handleRequest$0(JarRunHandler.java:100)\n\tat
>  
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)\n\t...
>  7 more\nCaused by: 
> org.apache.flink.client.program.ProgramInvocationException: The main method 
> caused an error: Failed to execute job 
> 'ST1_100Services-preprod-Tumbling-ProcessedBased'.\n\tat 
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:302)\n\tat
>  
> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:198)\n\tat
>  
> org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:149)\n\tat
>  
> org.apache.flink.client.deployment.application.DetachedApplicationRunner.tryExecuteJobs(DetachedApplicationRunner.java:78)\n\t...
>  10 more\nCaused by: org.apache.flink.util.FlinkException: Failed to execute 
> job 'ST1_100Services-preprod-Tumbling-ProcessedBased'.\n\tat 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1821)\n\tat
>  
> org.apache.flink.client.program.StreamContextEnvironment.ex

[jira] [Resolved] (FLINK-17491) Translate Training page on project website

2020-11-20 Thread Congxian Qiu (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu resolved FLINK-17491.
--
Resolution: Fixed

merged into asf-site with commit eaf99f7ff5aa46cccf2d5b7323176985cf55ea95

> Translate Training page on project website
> --
>
> Key: FLINK-17491
> URL: https://issues.apache.org/jira/browse/FLINK-17491
> Project: Flink
>  Issue Type: Improvement
>  Components: chinese-translation, Project Website
>Reporter: David Anderson
>Assignee: Li Ying
>Priority: Major
>  Labels: pull-request-available
>
> Translate the training page for the project website to Chinese. The file is 
> training.zh.md.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17571) A better way to show the files used in currently checkpoints

2020-11-18 Thread Congxian Qiu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17234552#comment-17234552
 ] 

Congxian Qiu commented on FLINK-17571:
--

Currently, working on some urgent things in my company, just unassign the 
ticket from me. anyone else is interested in this can take over it.

If no one working on this and I find some time, will come back :)


There is a WIP branch https://github.com/klion26/flink/tree/FLINK-17571

> A better way to show the files used in currently checkpoints
> 
>
> Key: FLINK-17571
> URL: https://issues.apache.org/jira/browse/FLINK-17571
> Project: Flink
>  Issue Type: New Feature
>  Components: Command Line Client, Runtime / Checkpointing
>Reporter: Congxian Qiu
>Priority: Major
>
> Inspired by the 
> [userMail|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]
> Currently, there are [three types of 
> directory|https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/state/checkpoints.html#directory-structure]
>  for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be 
> deleted safely, but users can't delete the files in the SHARED directory 
> safely(the files may be created a long time ago).
> I think it's better to give users a better way to know which files are 
> currently used(so the others are not used)
> maybe a command-line command such as below is ok enough to support such a 
> feature.
> {{./bin/flink checkpoint list $checkpointDir  # list all the files used in 
> checkpoint}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-17571) A better way to show the files used in currently checkpoints

2020-11-18 Thread Congxian Qiu (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-17571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu reassigned FLINK-17571:


Assignee: (was: Congxian Qiu)

> A better way to show the files used in currently checkpoints
> 
>
> Key: FLINK-17571
> URL: https://issues.apache.org/jira/browse/FLINK-17571
> Project: Flink
>  Issue Type: New Feature
>  Components: Command Line Client, Runtime / Checkpointing
>Reporter: Congxian Qiu
>Priority: Major
>
> Inspired by the 
> [userMail|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Shared-Checkpoint-Cleanup-and-S3-Lifecycle-Policy-tt34965.html]
> Currently, there are [three types of 
> directory|https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/state/checkpoints.html#directory-structure]
>  for a checkpoint, the files in TASKOWND and EXCLUSIVE directory can be 
> deleted safely, but users can't delete the files in the SHARED directory 
> safely(the files may be created a long time ago).
> I think it's better to give users a better way to know which files are 
> currently used(so the others are not used)
> maybe a command-line command such as below is ok enough to support such a 
> feature.
> {{./bin/flink checkpoint list $checkpointDir  # list all the files used in 
> checkpoint}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-20192) Externalized checkpoint references a checkpoint from a different job

2020-11-17 Thread Congxian Qiu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17234272#comment-17234272
 ] 

Congxian Qiu commented on FLINK-20192:
--

[~Antti-Kaikkonen] you can create a savepoint and restore from it, the 
savepoint does not need to reference any checkpoint files(the checkpoint files 
can be deleted if you don't need to restore from it), and after 1.11, the 
savepoint can also be relocated(FLINK-5763).

> Externalized checkpoint references a checkpoint from a different job
> 
>
> Key: FLINK-20192
> URL: https://issues.apache.org/jira/browse/FLINK-20192
> Project: Flink
>  Issue Type: Bug
>  Components: API / DataStream, Runtime / Checkpointing
>Affects Versions: 1.11.2
>Reporter: Antti Kaikkonen
>Priority: Major
> Attachments: _metadata
>
>
> When I try to restore from an externalized checkpoint located at: 
> +/home/anttkaik/flink/checkpoints/0fc94de8d94e123585b5baed6972dbe8/chk-12+ I 
> get the following error: 
>   
> {code:java}
> java.lang.Exception: Exception while creating StreamOperatorStateContext. 
> at 
> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:204)
>  at 
> org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:247)
>  at 
> org.apache.flink.streaming.runtime.tasks.OperatorChain.initializeStateAndOpenOperators(OperatorChain.java:290)
>  at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$beforeInvoke$0(StreamTask.java:479)
>  at 
> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:47)
>  at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.beforeInvoke(StreamTask.java:475)
>  at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:528)
>  at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:721) at 
> org.apache.flink.runtime.taskmanager.Task.run(Task.java:546) at 
> java.lang.Thread.run(Thread.java:748) Caused by: 
> org.apache.flink.util.FlinkException: Could not restore keyed state backend 
> for FunctionGroupOperator_6b87a4870d0e21cecbbe271bd893cfcc_(2/4) from any of 
> the 1 provided restore options. at 
> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:135)
>  at 
> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:317)
>  at 
> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:144)
>  ... 9 more Caused by: 
> org.apache.flink.runtime.state.BackendBuildingException: Caught unexpected 
> exception. at 
> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder.build(RocksDBKeyedStateBackendBuilder.java:329)
>  at 
> org.apache.flink.contrib.streaming.state.RocksDBStateBackend.createKeyedStateBackend(RocksDBStateBackend.java:535)
>  at 
> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.lambda$keyedStatedBackend$1(StreamTaskStateInitializerImpl.java:301)
>  at 
> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:142)
>  at 
> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:121)
>  ... 11 more Caused by: java.io.FileNotFoundException: 
> /home/anttkaik/flink/checkpoints/01dbaf21d7c5e8f8eabd3602e086bb89/shared/0a3c0c1d-c924-4e6d-b6ad-463a75c9fce8
>  (No such file or directory) at java.io.FileInputStream.open0(Native 
> Method) at java.io.FileInputStream.open(FileInputStream.java:195) at 
> java.io.FileInputStream.(FileInputStream.java:138) at 
> org.apache.flink.core.fs.local.LocalDataInputStream.(LocalDataInputStream.java:50)
>  at 
> org.apache.flink.core.fs.local.LocalFileSystem.open(LocalFileSystem.java:143) 
> at 
> org.apache.flink.core.fs.SafetyNetWrapperFileSystem.open(SafetyNetWrapperFileSystem.java:85)
>  at 
> org.apache.flink.runtime.state.filesystem.FileStateHandle.openInputStream(FileStateHandle.java:69)
>  at 
> org.apache.flink.contrib.streaming.state.RocksDBStateDownloader.downloadDataForStateHandle(RocksDBStateDownloader.java:126)
>  at 
> org.apache.flink.contrib.streaming.state.RocksDBStateDownloader.lambda$createDownloadRunnables$0(RocksDBStateDownloader.java:109)
>  at 
> org.apache.flink.util.function.ThrowingRunnable.lambda$unchecked$0(ThrowingRunnable.java:50)
>  at 
> java.util.concurrent.CompletableFuture$AsyncRun.ru

[jira] [Comment Edited] (FLINK-20192) Externalized checkpoint references a checkpoint from a different job

2020-11-17 Thread Congxian Qiu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17233621#comment-17233621
 ] 

Congxian Qiu edited comment on FLINK-20192 at 11/17/20, 2:25 PM:
-

[~Antti-Kaikkonen] This is not a bug. Incremental checkpoints may reference the 
files belongs to the previous jobs.

If you want to delete the files in the {{SHARED}} directory safely, you need to 
go through all the checkpoint metadata to find out whether the file is still 
been referenced


was (Author: klion26):
[~Antti-Kaikkonen] Incremental checkpoints may reference the files belongs to 
the previous jobs.

If you want to delete the files in the {{SHARED}} directory safely, you need to 
go through all the checkpoint metadata to find out whether the file is still 
been referenced

> Externalized checkpoint references a checkpoint from a different job
> 
>
> Key: FLINK-20192
> URL: https://issues.apache.org/jira/browse/FLINK-20192
> Project: Flink
>  Issue Type: Bug
>  Components: API / DataStream, Runtime / Checkpointing
>Affects Versions: 1.11.2
>Reporter: Antti Kaikkonen
>Priority: Major
> Attachments: _metadata
>
>
> When I try to restore from an externalized checkpoint located at: 
> +/home/anttkaik/flink/checkpoints/0fc94de8d94e123585b5baed6972dbe8/chk-12+ I 
> get the following error: 
>   
> {code:java}
> java.lang.Exception: Exception while creating StreamOperatorStateContext. 
> at 
> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:204)
>  at 
> org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:247)
>  at 
> org.apache.flink.streaming.runtime.tasks.OperatorChain.initializeStateAndOpenOperators(OperatorChain.java:290)
>  at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$beforeInvoke$0(StreamTask.java:479)
>  at 
> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:47)
>  at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.beforeInvoke(StreamTask.java:475)
>  at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:528)
>  at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:721) at 
> org.apache.flink.runtime.taskmanager.Task.run(Task.java:546) at 
> java.lang.Thread.run(Thread.java:748) Caused by: 
> org.apache.flink.util.FlinkException: Could not restore keyed state backend 
> for FunctionGroupOperator_6b87a4870d0e21cecbbe271bd893cfcc_(2/4) from any of 
> the 1 provided restore options. at 
> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:135)
>  at 
> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:317)
>  at 
> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:144)
>  ... 9 more Caused by: 
> org.apache.flink.runtime.state.BackendBuildingException: Caught unexpected 
> exception. at 
> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder.build(RocksDBKeyedStateBackendBuilder.java:329)
>  at 
> org.apache.flink.contrib.streaming.state.RocksDBStateBackend.createKeyedStateBackend(RocksDBStateBackend.java:535)
>  at 
> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.lambda$keyedStatedBackend$1(StreamTaskStateInitializerImpl.java:301)
>  at 
> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:142)
>  at 
> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:121)
>  ... 11 more Caused by: java.io.FileNotFoundException: 
> /home/anttkaik/flink/checkpoints/01dbaf21d7c5e8f8eabd3602e086bb89/shared/0a3c0c1d-c924-4e6d-b6ad-463a75c9fce8
>  (No such file or directory) at java.io.FileInputStream.open0(Native 
> Method) at java.io.FileInputStream.open(FileInputStream.java:195) at 
> java.io.FileInputStream.(FileInputStream.java:138) at 
> org.apache.flink.core.fs.local.LocalDataInputStream.(LocalDataInputStream.java:50)
>  at 
> org.apache.flink.core.fs.local.LocalFileSystem.open(LocalFileSystem.java:143) 
> at 
> org.apache.flink.core.fs.SafetyNetWrapperFileSystem.open(SafetyNetWrapperFileSystem.java:85)
>  at 
> org.apache.flink.runtime.state.filesystem.FileStateHandle.openInputStream(FileStateHandle.java:69)
>  at 
> org.apache.flink.contrib.streaming.state.RocksDBStateDownloader

[jira] [Commented] (FLINK-20192) Externalized checkpoint references a checkpoint from a different job

2020-11-17 Thread Congxian Qiu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-20192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17233621#comment-17233621
 ] 

Congxian Qiu commented on FLINK-20192:
--

[~Antti-Kaikkonen] Incremental checkpoints may reference the files belongs to 
the previous jobs.

If you want to delete the files in the {{SHARED}} directory safely, you need to 
go through all the checkpoint metadata to find out whether the file is still 
been referenced

> Externalized checkpoint references a checkpoint from a different job
> 
>
> Key: FLINK-20192
> URL: https://issues.apache.org/jira/browse/FLINK-20192
> Project: Flink
>  Issue Type: Bug
>  Components: API / DataStream, Runtime / Checkpointing
>Affects Versions: 1.11.2
>Reporter: Antti Kaikkonen
>Priority: Major
> Attachments: _metadata
>
>
> When I try to restore from an externalized checkpoint located at: 
> +/home/anttkaik/flink/checkpoints/0fc94de8d94e123585b5baed6972dbe8/chk-12+ I 
> get the following error: 
>   
> {code:java}
> java.lang.Exception: Exception while creating StreamOperatorStateContext. 
> at 
> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:204)
>  at 
> org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:247)
>  at 
> org.apache.flink.streaming.runtime.tasks.OperatorChain.initializeStateAndOpenOperators(OperatorChain.java:290)
>  at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$beforeInvoke$0(StreamTask.java:479)
>  at 
> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:47)
>  at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.beforeInvoke(StreamTask.java:475)
>  at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:528)
>  at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:721) at 
> org.apache.flink.runtime.taskmanager.Task.run(Task.java:546) at 
> java.lang.Thread.run(Thread.java:748) Caused by: 
> org.apache.flink.util.FlinkException: Could not restore keyed state backend 
> for FunctionGroupOperator_6b87a4870d0e21cecbbe271bd893cfcc_(2/4) from any of 
> the 1 provided restore options. at 
> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:135)
>  at 
> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:317)
>  at 
> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:144)
>  ... 9 more Caused by: 
> org.apache.flink.runtime.state.BackendBuildingException: Caught unexpected 
> exception. at 
> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder.build(RocksDBKeyedStateBackendBuilder.java:329)
>  at 
> org.apache.flink.contrib.streaming.state.RocksDBStateBackend.createKeyedStateBackend(RocksDBStateBackend.java:535)
>  at 
> org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.lambda$keyedStatedBackend$1(StreamTaskStateInitializerImpl.java:301)
>  at 
> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:142)
>  at 
> org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:121)
>  ... 11 more Caused by: java.io.FileNotFoundException: 
> /home/anttkaik/flink/checkpoints/01dbaf21d7c5e8f8eabd3602e086bb89/shared/0a3c0c1d-c924-4e6d-b6ad-463a75c9fce8
>  (No such file or directory) at java.io.FileInputStream.open0(Native 
> Method) at java.io.FileInputStream.open(FileInputStream.java:195) at 
> java.io.FileInputStream.(FileInputStream.java:138) at 
> org.apache.flink.core.fs.local.LocalDataInputStream.(LocalDataInputStream.java:50)
>  at 
> org.apache.flink.core.fs.local.LocalFileSystem.open(LocalFileSystem.java:143) 
> at 
> org.apache.flink.core.fs.SafetyNetWrapperFileSystem.open(SafetyNetWrapperFileSystem.java:85)
>  at 
> org.apache.flink.runtime.state.filesystem.FileStateHandle.openInputStream(FileStateHandle.java:69)
>  at 
> org.apache.flink.contrib.streaming.state.RocksDBStateDownloader.downloadDataForStateHandle(RocksDBStateDownloader.java:126)
>  at 
> org.apache.flink.contrib.streaming.state.RocksDBStateDownloader.lambda$createDownloadRunnables$0(RocksDBStateDownloader.java:109)
>  at 
> org.apache.flink.util.function.ThrowingRunnable.lambda$unchecked$0(ThrowingRunnable.java:50)
>  at 
> java.util.concurrent.CompletableFuture$AsyncRun.

[jira] [Resolved] (FLINK-19673) Translate "Standalone Cluster" of "Clusters & Depolyment" page into Chinese

2020-11-11 Thread Congxian Qiu (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-19673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu resolved FLINK-19673.
--
Resolution: Fixed

[~ShawnHx] thanks for the work. merged into master 
1d3f8c3415a47830fba92c2806881218863dcde8

> Translate "Standalone Cluster" of "Clusters & Depolyment" page into Chinese
> ---
>
> Key: FLINK-19673
> URL: https://issues.apache.org/jira/browse/FLINK-19673
> Project: Flink
>  Issue Type: Improvement
>  Components: chinese-translation, Documentation
>Affects Versions: 1.11.0, 1.11.1, 1.11.2
>Reporter: Xiao Huang
>Assignee: Xiao Huang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.12.0
>
>
> The page url is 
> [https://ci.apache.org/projects/flink/flink-docs-master/ops/deployment/cluster_setup.html]
> The markdown file is located in 
> {{flink/docs/ops/deployment/cluster_setup.zh.md}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-19673) Translate "Standalone Cluster" of "Clusters & Depolyment" page into Chinese

2020-11-11 Thread Congxian Qiu (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-19673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu updated FLINK-19673:
-
Fix Version/s: 1.12.0

> Translate "Standalone Cluster" of "Clusters & Depolyment" page into Chinese
> ---
>
> Key: FLINK-19673
> URL: https://issues.apache.org/jira/browse/FLINK-19673
> Project: Flink
>  Issue Type: Improvement
>  Components: chinese-translation, Documentation
>Affects Versions: 1.11.0, 1.11.1, 1.11.2
>Reporter: Xiao Huang
>Assignee: Xiao Huang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.12.0
>
>
> The page url is 
> [https://ci.apache.org/projects/flink/flink-docs-master/ops/deployment/cluster_setup.html]
> The markdown file is located in 
> {{flink/docs/ops/deployment/cluster_setup.zh.md}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (FLINK-18755) RabbitMQ QoS Chinese Documentation

2020-11-10 Thread Congxian Qiu (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu resolved FLINK-18755.
--
Resolution: Fixed

[~wuyanzu]thanks for the work. merged into master 
044d9ee343345fbc9e40836a7b891634bce3cbfc

> RabbitMQ QoS Chinese Documentation
> --
>
> Key: FLINK-18755
> URL: https://issues.apache.org/jira/browse/FLINK-18755
> Project: Flink
>  Issue Type: Improvement
>  Components: chinese-translation, Connectors/ RabbitMQ
>Affects Versions: 1.12.0
>Reporter: Austin Cawley-Edwards
>Assignee: 吴彦祖
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.12.0
>
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> Please add documentation for the new QoS settings in the RabbitMQ connector. 
> The added English documentation can be found in the PR here: 
> [https://github.com/apache/flink/pull/12729/files#diff-6b432359b51642a8fad3050c4b73f47cR134-R167|https://github.com/apache/flink/pull/12729/files#diff-6b432359b51642a8fad3050c4b73f47cR134-R167.]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-18755) RabbitMQ QoS Chinese Documentation

2020-11-10 Thread Congxian Qiu (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-18755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu updated FLINK-18755:
-
Fix Version/s: 1.12.0

> RabbitMQ QoS Chinese Documentation
> --
>
> Key: FLINK-18755
> URL: https://issues.apache.org/jira/browse/FLINK-18755
> Project: Flink
>  Issue Type: Improvement
>  Components: chinese-translation, Connectors/ RabbitMQ
>Affects Versions: 1.12.0
>Reporter: Austin Cawley-Edwards
>Assignee: 吴彦祖
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.12.0
>
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> Please add documentation for the new QoS settings in the RabbitMQ connector. 
> The added English documentation can be found in the PR here: 
> [https://github.com/apache/flink/pull/12729/files#diff-6b432359b51642a8fad3050c4b73f47cR134-R167|https://github.com/apache/flink/pull/12729/files#diff-6b432359b51642a8fad3050c4b73f47cR134-R167.]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-19394) Translate the 'Monitoring Checkpointing' page of 'Debugging & Monitoring' into Chinese

2020-11-10 Thread Congxian Qiu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17229151#comment-17229151
 ] 

Congxian Qiu edited comment on FLINK-19394 at 11/10/20, 11:35 AM:
--

[~RocMarshal]  thank you for the work. merged into master 
3c4cb04658f1a86fc7d15d233b6dee8862e12f78


was (Author: klion26):
merged into master 3c4cb04658f1a86fc7d15d233b6dee8862e12f78

> Translate the 'Monitoring Checkpointing' page of 'Debugging & Monitoring' 
> into Chinese
> --
>
> Key: FLINK-19394
> URL: https://issues.apache.org/jira/browse/FLINK-19394
> Project: Flink
>  Issue Type: Improvement
>  Components: chinese-translation, Documentation
>Affects Versions: 1.11.2
>Reporter: Roc Marshal
>Assignee: Roc Marshal
>Priority: Major
>  Labels: Translation, documentation, pull-request-available, 
> translation, translation-zh
> Fix For: 1.12.0
>
>
> The file location: flink/docs/monitoring/checkpoint_monitoring.md
> The link of the page: 
> https://ci.apache.org/projects/flink/flink-docs-release-1.12/zh/monitoring/checkpoint_monitoring.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-19394) Translate the 'Monitoring Checkpointing' page of 'Debugging & Monitoring' into Chinese

2020-11-10 Thread Congxian Qiu (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-19394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu updated FLINK-19394:
-
Fix Version/s: 1.12.0

> Translate the 'Monitoring Checkpointing' page of 'Debugging & Monitoring' 
> into Chinese
> --
>
> Key: FLINK-19394
> URL: https://issues.apache.org/jira/browse/FLINK-19394
> Project: Flink
>  Issue Type: Improvement
>  Components: chinese-translation, Documentation
>Affects Versions: 1.11.2
>Reporter: Roc Marshal
>Assignee: Roc Marshal
>Priority: Major
>  Labels: Translation, documentation, pull-request-available, 
> translation, translation-zh
> Fix For: 1.12.0
>
>
> The file location: flink/docs/monitoring/checkpoint_monitoring.md
> The link of the page: 
> https://ci.apache.org/projects/flink/flink-docs-release-1.12/zh/monitoring/checkpoint_monitoring.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (FLINK-19394) Translate the 'Monitoring Checkpointing' page of 'Debugging & Monitoring' into Chinese

2020-11-10 Thread Congxian Qiu (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-19394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu resolved FLINK-19394.
--
Resolution: Fixed

merged into master 3c4cb04658f1a86fc7d15d233b6dee8862e12f78

> Translate the 'Monitoring Checkpointing' page of 'Debugging & Monitoring' 
> into Chinese
> --
>
> Key: FLINK-19394
> URL: https://issues.apache.org/jira/browse/FLINK-19394
> Project: Flink
>  Issue Type: Improvement
>  Components: chinese-translation, Documentation
>Affects Versions: 1.11.2
>Reporter: Roc Marshal
>Assignee: Roc Marshal
>Priority: Major
>  Labels: Translation, documentation, pull-request-available, 
> translation, translation-zh
> Fix For: 1.12.0
>
>
> The file location: flink/docs/monitoring/checkpoint_monitoring.md
> The link of the page: 
> https://ci.apache.org/projects/flink/flink-docs-release-1.12/zh/monitoring/checkpoint_monitoring.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-11662) job restart when CheckpointCoordinator drop checkpointDirectory as a whole

2019-02-19 Thread Congxian Qiu (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-11662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16772545#comment-16772545
 ] 

Congxian Qiu commented on FLINK-11662:
--

Hi [~framst] , I think there are some other issues about the same problem.

And there is an issue https://issues.apache.org/jira/browse/FLINK-10724 to 
refactor the error handling after the issue has been resolved, we can handle 
this problem elegantly.

> job restart when CheckpointCoordinator drop checkpointDirectory as a whole
> --
>
> Key: FLINK-11662
> URL: https://issues.apache.org/jira/browse/FLINK-11662
> Project: Flink
>  Issue Type: Bug
>  Components: Core
>Affects Versions: 1.7.0
> Environment: Flink 1.7.0
>Reporter: madong
>Priority: Major
>
> CheckpointCoordinator will drop the checkpointDirectory as a whole on a 
> failure, but if Tasks are performing a checkpoint, it will throw an Exception 
> and the job will restart.
> {code:java}
> 2019-02-16 11:26:29.378 [Checkpoint Timer] INFO 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering 
> checkpoint 1389046 @ 1550287589373 for job 599a6ac3c371874d12ebf024978cadbc.
> 2019-02-16 11:26:29.630 [flink-akka.actor.default-dispatcher-68] INFO 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Decline 
> checkpoint 1389046 by task 7239e5d29203c4c720ed2db6f5db33fc of job 
> 599a6ac3c371874d12ebf024978cadbc.
> 2019-02-16 11:26:29.630 [flink-akka.actor.default-dispatcher-68] INFO 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Discarding 
> checkpoint 1389046 of job 599a6ac3c371874d12ebf024978cadbc.
> org.apache.flink.runtime.checkpoint.decline.CheckpointDeclineTaskNotReadyException:
>  Task Source: KafkaSource -> mapOperate -> Timestamps/Watermarks (3/3) was 
> not running
> at org.apache.flink.runtime.taskmanager.Task$1.run(Task.java:1166)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 2019-02-16 11:26:29.697 [flink-akka.actor.default-dispatcher-68] INFO 
> org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: KafkaSource 
> -> mapOperate -> Timestamps/Watermarks (1/3) 
> (a5657b784d235731cd468164e85d0b50) switched from RUNNING to FAILED.
> org.apache.flink.streaming.runtime.tasks.AsynchronousException: 
> java.lang.Exception: Could not materialize checkpoint 1389046 for operator 
> Source: KafkaSource -> mapOperate -> Timestamps/Watermarks (1/3).
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointExceptionHandler.tryHandleCheckpointException(StreamTask.java:1153)
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:947)
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:884)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.Exception: Could not materialize checkpoint 1389046 for 
> operator Source: KafkaSource -> mapOperate -> Timestamps/Watermarks (1/3).
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:942)
> ... 6 common frames omitted
> Caused by: java.util.concurrent.ExecutionException: java.io.IOException: 
> Could not flush and close the file system output stream to 
> hdfs://.../flink/checkpoints/599a6ac3c371874d12ebf024978cadbc/chk-1389046/84631771-01e2-41bc-950d-c9e39eac26f9
>  in order to obtain the stream state handle
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> at java.util.concurrent.FutureTask.get(FutureTask.java:192)
> at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:53)
> at 
> org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.(OperatorSnapshotFinalizer.java:53)
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:853)
> ... 5 common frames omitted
> Caused by: java.io.IOException: Could not flush and close the file system 
> output stream to 
> hdfs://.../flink/checkpoints/599a6ac3c371874d12ebf024978cadbc/chk-1389046/84631771-01e2-41bc-950d-c9e39eac26f9
>  in order to obtain the stream state handle
>

[jira] [Assigned] (FLINK-11634) Translate "State Backends" page into Chinese

2019-02-17 Thread Congxian Qiu (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-11634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu reassigned FLINK-11634:


Assignee: Congxian Qiu  (was: hanfei)

> Translate "State Backends" page into Chinese
> 
>
> Key: FLINK-11634
> URL: https://issues.apache.org/jira/browse/FLINK-11634
> Project: Flink
>  Issue Type: Sub-task
>  Components: chinese-translation, Documentation
>Reporter: Congxian Qiu
>Assignee: Congxian Qiu
>Priority: Major
>
> doc locates in flink/docs/dev/stream/state/state_backens.md



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-11634) Translate "State Backends" page into Chinese

2019-02-17 Thread Congxian Qiu (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-11634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16770755#comment-16770755
 ] 

Congxian Qiu commented on FLINK-11634:
--

Hi, [~hanfeio] , Please reassign ticket to yourself before asking the assignee 
if he/her is still working on it.

> Translate "State Backends" page into Chinese
> 
>
> Key: FLINK-11634
> URL: https://issues.apache.org/jira/browse/FLINK-11634
> Project: Flink
>  Issue Type: Sub-task
>  Components: chinese-translation, Documentation
>Reporter: Congxian Qiu
>Assignee: hanfei
>Priority: Major
>
> doc locates in flink/docs/dev/stream/state/state_backens.md



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (FLINK-11634) Translate "State Backends" page into Chinese

2019-02-17 Thread Congxian Qiu (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-11634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16770755#comment-16770755
 ] 

Congxian Qiu edited comment on FLINK-11634 at 2/18/19 3:28 AM:
---

Hi, [~hanfeio] , Please reassign ticket to yourself before asking the assignee 
if he/she is still working on it.


was (Author: klion26):
Hi, [~hanfeio] , Please reassign ticket to yourself before asking the assignee 
if he/her is still working on it.

> Translate "State Backends" page into Chinese
> 
>
> Key: FLINK-11634
> URL: https://issues.apache.org/jira/browse/FLINK-11634
> Project: Flink
>  Issue Type: Sub-task
>  Components: chinese-translation, Documentation
>Reporter: Congxian Qiu
>Assignee: hanfei
>Priority: Major
>
> doc locates in flink/docs/dev/stream/state/state_backens.md



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (FLINK-11634) Translate "State Backends" page into Chinese

2019-02-17 Thread Congxian Qiu (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-11634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu reassigned FLINK-11634:


Assignee: Congxian Qiu  (was: hanfei)

> Translate "State Backends" page into Chinese
> 
>
> Key: FLINK-11634
> URL: https://issues.apache.org/jira/browse/FLINK-11634
> Project: Flink
>  Issue Type: Sub-task
>  Components: chinese-translation, Documentation
>Reporter: Congxian Qiu
>Assignee: Congxian Qiu
>Priority: Major
>
> doc locates in flink/docs/dev/stream/state/state_backens.md



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (FLINK-11634) Translate "State Backends" page into Chinese

2019-02-16 Thread Congxian Qiu (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-11634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu reassigned FLINK-11634:


Assignee: Congxian Qiu  (was: hanfei)

> Translate "State Backends" page into Chinese
> 
>
> Key: FLINK-11634
> URL: https://issues.apache.org/jira/browse/FLINK-11634
> Project: Flink
>  Issue Type: Sub-task
>  Components: chinese-translation, Documentation
>Reporter: Congxian Qiu
>Assignee: Congxian Qiu
>Priority: Major
>
> doc locates in flink/docs/dev/stream/state/state_backens.md



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-11634) Translate "State Backends" page into Chinese

2019-02-16 Thread Congxian Qiu (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-11634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16770053#comment-16770053
 ] 

Congxian Qiu commented on FLINK-11634:
--

Hi [~hanfeio] I'm working on this issue, maybe you could choose the other 
issues which have not assigned to any people.

> Translate "State Backends" page into Chinese
> 
>
> Key: FLINK-11634
> URL: https://issues.apache.org/jira/browse/FLINK-11634
> Project: Flink
>  Issue Type: Sub-task
>  Components: chinese-translation, Documentation
>Reporter: Congxian Qiu
>Assignee: hanfei
>Priority: Major
>
> doc locates in flink/docs/dev/stream/state/state_backens.md



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-11638) Translate "Savepoints" page into Chinese

2019-02-16 Thread Congxian Qiu (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-11638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16770052#comment-16770052
 ] 

Congxian Qiu commented on FLINK-11638:
--

Hi [~iluvex] I'm working on this issue, maybe you could choose the other issues 
which have not assigned to other people.

> Translate "Savepoints" page into Chinese
> 
>
> Key: FLINK-11638
> URL: https://issues.apache.org/jira/browse/FLINK-11638
> Project: Flink
>  Issue Type: Sub-task
>  Components: chinese-translation, Documentation
>Reporter: Congxian Qiu
>Assignee: Congxian Qiu
>Priority: Major
>
> doc locates in flink/docs/ops/state/savepoints.md



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (FLINK-11638) Translate "Savepoints" page into Chinese

2019-02-16 Thread Congxian Qiu (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-11638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu reassigned FLINK-11638:


Assignee: Congxian Qiu  (was: Xin Ma)

> Translate "Savepoints" page into Chinese
> 
>
> Key: FLINK-11638
> URL: https://issues.apache.org/jira/browse/FLINK-11638
> Project: Flink
>  Issue Type: Sub-task
>  Components: chinese-translation, Documentation
>Reporter: Congxian Qiu
>Assignee: Congxian Qiu
>Priority: Major
>
> doc locates in flink/docs/ops/state/savepoints.md



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-11638) Translate "Savepoints" page into Chinese

2019-02-15 Thread Congxian Qiu (JIRA)
Congxian Qiu created FLINK-11638:


 Summary: Translate "Savepoints" page into Chinese
 Key: FLINK-11638
 URL: https://issues.apache.org/jira/browse/FLINK-11638
 Project: Flink
  Issue Type: Sub-task
  Components: chinese-translation, Documentation
Reporter: Congxian Qiu
Assignee: Congxian Qiu


doc locates in flink/docs/ops/state/savepoints.md



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (FLINK-11567) Translate "How to Review a Pull Request" page into Chinese

2019-02-15 Thread Congxian Qiu (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-11567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu reassigned FLINK-11567:


Assignee: xulinjie  (was: Congxian Qiu)

> Translate "How to Review a Pull Request" page into Chinese
> --
>
> Key: FLINK-11567
> URL: https://issues.apache.org/jira/browse/FLINK-11567
> Project: Flink
>  Issue Type: Sub-task
>  Components: chinese-translation, Project Website
>Reporter: Jark Wu
>Assignee: xulinjie
>Priority: Major
>
> Translate "How to Review a Pull Request" page into Chinese.
> The markdown file is located in: flink-web/reviewing-prs.zh.md
> The url link is: https://flink.apache.org/zh/reviewing-prs.html
> Please adjust the links in the page to Chinese pages when translating. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-11637) Translate "Checkpoints" page into Chinese

2019-02-15 Thread Congxian Qiu (JIRA)
Congxian Qiu created FLINK-11637:


 Summary: Translate "Checkpoints" page into Chinese
 Key: FLINK-11637
 URL: https://issues.apache.org/jira/browse/FLINK-11637
 Project: Flink
  Issue Type: Sub-task
  Components: chinese-translation, Documentation
Reporter: Congxian Qiu
Assignee: Congxian Qiu


doc locates in flink/docs/ops/state/checkpoints.md



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-11636) Translate "State Schema Evolution" into Chinese

2019-02-15 Thread Congxian Qiu (JIRA)
Congxian Qiu created FLINK-11636:


 Summary: Translate "State Schema Evolution" into Chinese
 Key: FLINK-11636
 URL: https://issues.apache.org/jira/browse/FLINK-11636
 Project: Flink
  Issue Type: Sub-task
  Components: chinese-translation, Documentation
Reporter: Congxian Qiu
Assignee: Congxian Qiu


doc locates in flink/docs/dev/stream/state/schema_evolution.md



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (FLINK-11529) Support Chinese Documents for Apache Flink

2019-02-15 Thread Congxian Qiu (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-11529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu reassigned FLINK-11529:


Assignee: Jark Wu  (was: Congxian Qiu)

> Support Chinese Documents for Apache Flink
> --
>
> Key: FLINK-11529
> URL: https://issues.apache.org/jira/browse/FLINK-11529
> Project: Flink
>  Issue Type: New Feature
>  Components: chinese-translation, Documentation
>Reporter: Jark Wu
>Assignee: Jark Wu
>Priority: Major
>
> This issue is an umbrella issue for tracking fully support Chinese for Flink 
> documents (http://ci.apache.org/projects/flink/flink-docs-master/).
> A more detailed description can be found in the proposal doc: 
> https://docs.google.com/document/d/1R1-uDq-KawLB8afQYrczfcoQHjjIhq6tvUksxrfhBl0/edit#



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-11635) Translate "Checkpointing" page into Chinese

2019-02-15 Thread Congxian Qiu (JIRA)
Congxian Qiu created FLINK-11635:


 Summary: Translate "Checkpointing" page into Chinese
 Key: FLINK-11635
 URL: https://issues.apache.org/jira/browse/FLINK-11635
 Project: Flink
  Issue Type: Sub-task
  Components: chinese-translation, Documentation
Reporter: Congxian Qiu
Assignee: Congxian Qiu


doc locates in flink/docs/dev/stream/state/checkpointing.md



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (FLINK-11529) Support Chinese Documents for Apache Flink

2019-02-15 Thread Congxian Qiu (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-11529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu reassigned FLINK-11529:


Assignee: Congxian Qiu  (was: Jark Wu)

> Support Chinese Documents for Apache Flink
> --
>
> Key: FLINK-11529
> URL: https://issues.apache.org/jira/browse/FLINK-11529
> Project: Flink
>  Issue Type: New Feature
>  Components: chinese-translation, Documentation
>Reporter: Jark Wu
>Assignee: Congxian Qiu
>Priority: Major
>
> This issue is an umbrella issue for tracking fully support Chinese for Flink 
> documents (http://ci.apache.org/projects/flink/flink-docs-master/).
> A more detailed description can be found in the proposal doc: 
> https://docs.google.com/document/d/1R1-uDq-KawLB8afQYrczfcoQHjjIhq6tvUksxrfhBl0/edit#



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-11634) Translate "State Backends" page into Chinese

2019-02-15 Thread Congxian Qiu (JIRA)
Congxian Qiu created FLINK-11634:


 Summary: Translate "State Backends" page into Chinese
 Key: FLINK-11634
 URL: https://issues.apache.org/jira/browse/FLINK-11634
 Project: Flink
  Issue Type: Sub-task
  Components: chinese-translation, Documentation
Reporter: Congxian Qiu
Assignee: Congxian Qiu


doc locates in flink/docs/dev/stream/state/state_backens.md



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-11633) Translate "Working with State" into Chinese

2019-02-15 Thread Congxian Qiu (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-11633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu updated FLINK-11633:
-
Issue Type: Sub-task  (was: New Feature)
Parent: FLINK-11529

> Translate "Working with State" into Chinese
> ---
>
> Key: FLINK-11633
> URL: https://issues.apache.org/jira/browse/FLINK-11633
> Project: Flink
>  Issue Type: Sub-task
>  Components: chinese-translation, Documentation
>Reporter: Congxian Qiu
>Assignee: Congxian Qiu
>Priority: Major
>
> Doc locates in flink/doc/dev/state/state.md



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-11633) Translate "Working with State" into Chinese

2019-02-15 Thread Congxian Qiu (JIRA)
Congxian Qiu created FLINK-11633:


 Summary: Translate "Working with State" into Chinese
 Key: FLINK-11633
 URL: https://issues.apache.org/jira/browse/FLINK-11633
 Project: Flink
  Issue Type: New Feature
  Components: chinese-translation, Documentation
Reporter: Congxian Qiu
Assignee: Congxian Qiu


Doc locates in flink/doc/dev/state/state.md



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10481) Wordcount end-to-end test in docker env unstable

2019-02-15 Thread Congxian Qiu (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16769076#comment-16769076
 ] 

Congxian Qiu commented on FLINK-10481:
--

Is the problem relevant to this issue?

- Travis log link:  [https://api.travis-ci.org/v3/job/493604030/log.txt]

- Error log:
{code:java}
Step 2/16 : RUN apk add --no-cache bash snappy libc6-compat
 ---> [Warning] IPv4 forwarding is disabled. Networking will not work.
 ---> Running in 8c5ab0903f84
fetch http://dl-cdn.alpinelinux.org/alpine/v3.9/main/x86_64/APKINDEX.tar.gz
WARNING: Ignoring 
http://dl-cdn.alpinelinux.org/alpine/v3.9/main/x86_64/APKINDEX.tar.gz: 
temporary error (try again later)
fetch 
http://dl-cdn.alpinelinux.org/alpine/v3.9/community/x86_64/APKINDEX.tar.gz
  bash (missing):
required by: world[bash]
  libc6-compat (missing):
required by: world[libc6-compat]
  snappy (missing):
required by: world[snappy]
WARNING: Ignoring 
http://dl-cdn.alpinelinux.org/alpine/v3.9/community/x86_64/APKINDEX.tar.gz: 
temporary error (try again later)
ERROR: unsatisfiable constraints:
The command '/bin/sh -c apk add --no-cache bash snappy libc6-compat' 
returned a non-zero code: 3
Command: build_image failed. Retrying...
Command: build_image failed 3 times.
Failed to build docker image. Aborting...
[FAIL] Test script contains errors.
Checking for errors...
No errors in log files.
Checking for exceptions...
No exceptions in log files.
Checking for non-empty .out files...
grep: 
/home/travis/build/apache/flink/flink-dist/target/flink-1.8-SNAPSHOT-bin/flink-1.8-SNAPSHOT/log/*.out:
 No such file or directory
No non-empty .out files.

[FAIL] 'Wordcount end-to-end test in docker env' failed after 1 minutes and 35 
seconds! Test exited with exit code 1

No taskexecutor daemon to stop on host 
travis-job-9baf0d81-84bb-4970-897d-6beb240d4b16.
No standalonesession daemon to stop on host 
travis-job-9baf0d81-84bb-4970-897d-6beb240d4b16.
travis_time:end:12dbc4b0:start=1550211894080441746,finish=1550214946487459619,duration=3052407017873
The command "./tools/travis_controller.sh" exited with 1.
{code}

> Wordcount end-to-end test in docker env unstable
> 
>
> Key: FLINK-10481
> URL: https://issues.apache.org/jira/browse/FLINK-10481
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.7.0
>Reporter: Till Rohrmann
>Assignee: Dawid Wysakowicz
>Priority: Critical
>  Labels: pull-request-available, test-stability
> Fix For: 1.6.3, 1.7.0
>
>
> The {{Wordcount end-to-end test in docker env}} fails sometimes on Travis 
> with the following problem:
> {code}
> Status: Downloaded newer image for java:8-jre-alpine
>  ---> fdc893b19a14
> Step 2/16 : RUN apk add --no-cache bash snappy
>  ---> [Warning] IPv4 forwarding is disabled. Networking will not work.
>  ---> Running in 4329ebcd8a77
> fetch http://dl-cdn.alpinelinux.org/alpine/v3.4/main/x86_64/APKINDEX.tar.gz
> WARNING: Ignoring 
> http://dl-cdn.alpinelinux.org/alpine/v3.4/main/x86_64/APKINDEX.tar.gz: 
> temporary error (try again later)
> fetch 
> http://dl-cdn.alpinelinux.org/alpine/v3.4/community/x86_64/APKINDEX.tar.gz
> WARNING: Ignoring 
> http://dl-cdn.alpinelinux.org/alpine/v3.4/community/x86_64/APKINDEX.tar.gz: 
> temporary error (try again later)
> ERROR: unsatisfiable constraints:
>   bash (missing):
> required by: world[bash]
>   snappy (missing):
> required by: world[snappy]
> The command '/bin/sh -c apk add --no-cache bash snappy' returned a non-zero 
> code: 2
> {code}
> https://api.travis-ci.org/v3/job/434909395/log.txt
> It seems as if it is related to 
> https://github.com/gliderlabs/docker-alpine/issues/264 and 
> https://github.com/gliderlabs/docker-alpine/issues/279.
> We might want to switch to a different base image to avoid these problems in 
> the future.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10819) The instability problem of CI, JobManagerHAProcessFailureRecoveryITCase.testDispatcherProcessFailure test fail.

2019-02-14 Thread Congxian Qiu (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16769016#comment-16769016
 ] 

Congxian Qiu commented on FLINK-10819:
--

Another instance:  https://travis-ci.org/klion26/flink/jobs/493604011

> The instability problem of CI, 
> JobManagerHAProcessFailureRecoveryITCase.testDispatcherProcessFailure test 
> fail.
> ---
>
> Key: FLINK-10819
> URL: https://issues.apache.org/jira/browse/FLINK-10819
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.7.0
>Reporter: sunjincheng
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.8.0
>
>
> Found the following error in the process of CI:
> Results :
> Tests in error: 
>  JobManagerHAProcessFailureRecoveryITCase.testDispatcherProcessFailure:331 » 
> IllegalArgument
> Tests run: 1463, Failures: 0, Errors: 1, Skipped: 29
> 18:40:55.828 [INFO] 
> 
> 18:40:55.829 [INFO] BUILD FAILURE
> 18:40:55.829 [INFO] 
> 
> 18:40:55.830 [INFO] Total time: 30:19 min
> 18:40:55.830 [INFO] Finished at: 2018-11-07T18:40:55+00:00
> 18:40:56.294 [INFO] Final Memory: 92M/678M
> 18:40:56.294 [INFO] 
> 
> 18:40:56.294 [WARNING] The requested profile "include-kinesis" could not be 
> activated because it does not exist.
> 18:40:56.295 [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-surefire-plugin:2.18.1:test 
> (integration-tests) on project flink-tests_2.11: There are test failures.
> 18:40:56.295 [ERROR] 
> 18:40:56.295 [ERROR] Please refer to 
> /home/travis/build/sunjincheng121/flink/flink-tests/target/surefire-reports 
> for the individual test results.
> 18:40:56.295 [ERROR] -> [Help 1]
> 18:40:56.295 [ERROR] 
> 18:40:56.295 [ERROR] To see the full stack trace of the errors, re-run Maven 
> with the -e switch.
> 18:40:56.295 [ERROR] Re-run Maven using the -X switch to enable full debug 
> logging.
> 18:40:56.295 [ERROR] 
> 18:40:56.295 [ERROR] For more information about the errors and possible 
> solutions, please read the following articles:
> 18:40:56.295 [ERROR] [Help 1] 
> http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
> MVN exited with EXIT CODE: 1.
> Trying to KILL watchdog (11329).
> ./tools/travis_mvn_watchdog.sh: line 269: 11329 Terminated watchdog
> PRODUCED build artifacts.
> But after the rerun, the error disappeared. 
> Currently,no specific reasons are found, and will continue to pay attention.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-11334) Migrate enum serializers to use new serialization compatibility abstractions

2019-02-13 Thread Congxian Qiu (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-11334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767771#comment-16767771
 ] 

Congxian Qiu commented on FLINK-11334:
--

Hi,  [~kisimple], What's the status of this issue, I have an almost done patch, 
If you don't mind, could I take over this issue?

> Migrate enum serializers to use new serialization compatibility abstractions
> 
>
> Key: FLINK-11334
> URL: https://issues.apache.org/jira/browse/FLINK-11334
> Project: Flink
>  Issue Type: Sub-task
>  Components: State Backends, Checkpointing, Type Serialization System
>Reporter: Tzu-Li (Gordon) Tai
>Assignee: boshu Zheng
>Priority: Major
>
> This subtask covers migration of:
> * EnumSerializerConfigSnapshot
> * ScalaEnumSerializerConfigSnapshot
> to use the new serialization compatibility APIs ({{TypeSerializerSnapshot}} 
> and {{TypeSerializerSchemaCompatibility).
> The enum serializer snapshots should be implemented so that on restore the 
> order of Enum constants can be reordered (a case for serializer 
> reconfiguration), as well as adding new Enum constants.
> Serializers are only considered to have completed migration according to the 
> defined list of things to check in FLINK-11327.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (FLINK-11588) Migrate CopyableValueSerializer to use new serialization compatibility abstractions

2019-02-12 Thread Congxian Qiu (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-11588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu reassigned FLINK-11588:


Assignee: Congxian Qiu

> Migrate CopyableValueSerializer to use new serialization compatibility 
> abstractions
> ---
>
> Key: FLINK-11588
> URL: https://issues.apache.org/jira/browse/FLINK-11588
> Project: Flink
>  Issue Type: Sub-task
>  Components: Type Serialization System
>Reporter: Tzu-Li (Gordon) Tai
>Assignee: Congxian Qiu
>Priority: Major
> Fix For: 1.8.0
>
>
> This subtask covers migration of the {{CopyableValueSerializer}} to use the 
> new serialization compatibility APIs {{TypeSerializerSnapshot}} and 
> {{TypeSerializerSchemaCompatibility}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (FLINK-11554) Translate the "Community & Project Info" page into Chinese

2019-02-08 Thread Congxian Qiu (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-11554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu reassigned FLINK-11554:


Assignee: Congxian Qiu

> Translate the "Community & Project Info" page into Chinese
> --
>
> Key: FLINK-11554
> URL: https://issues.apache.org/jira/browse/FLINK-11554
> Project: Flink
>  Issue Type: Sub-task
>  Components: chinese-translation, Project Website
>Reporter: Jark Wu
>Assignee: Congxian Qiu
>Priority: Major
>
> Translate "Community & Project Info" page into Chinese.
> The markdown file is located in: flink-web/community.zh.md
> The url link is: https://flink.apache.org/zh/community.html
> Please adjust the links in the page to Chinese pages when translating. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (FLINK-11561) Translate "Flink Architecture" page into Chinese

2019-02-08 Thread Congxian Qiu (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-11561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu reassigned FLINK-11561:


Assignee: Congxian Qiu

> Translate "Flink Architecture" page into Chinese
> 
>
> Key: FLINK-11561
> URL: https://issues.apache.org/jira/browse/FLINK-11561
> Project: Flink
>  Issue Type: Sub-task
>  Components: chinese-translation, Project Website
>Reporter: Jark Wu
>Assignee: Congxian Qiu
>Priority: Major
>
> Translate "Flink Architecture" page into Chinese.
> The markdown file is located in: flink-web/flink-architecture.zh.md
> The url link is: https://flink.apache.org/zh/flink-architecture.html
> Please adjust the links in the page to Chinese pages when translating. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (FLINK-11565) Translate "Improving the Website" page into Chinese

2019-02-08 Thread Congxian Qiu (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-11565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu reassigned FLINK-11565:


Assignee: Congxian Qiu

> Translate "Improving the Website" page into Chinese
> ---
>
> Key: FLINK-11565
> URL: https://issues.apache.org/jira/browse/FLINK-11565
> Project: Flink
>  Issue Type: Sub-task
>  Components: chinese-translation, Project Website
>Reporter: Jark Wu
>Assignee: Congxian Qiu
>Priority: Major
>
> Translate "Improving the Website" page into Chinese.
> The markdown file is located in: flink-web/improve-website.zh.md
> The url link is: https://flink.apache.org/zh/improve-website.html
> Please adjust the links in the page to Chinese pages when translating. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (FLINK-11567) Translate "How to Review a Pull Request" page into Chinese

2019-02-08 Thread Congxian Qiu (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-11567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu reassigned FLINK-11567:


Assignee: Congxian Qiu

> Translate "How to Review a Pull Request" page into Chinese
> --
>
> Key: FLINK-11567
> URL: https://issues.apache.org/jira/browse/FLINK-11567
> Project: Flink
>  Issue Type: Sub-task
>  Components: chinese-translation, Project Website
>Reporter: Jark Wu
>Assignee: Congxian Qiu
>Priority: Major
>
> Translate "How to Review a Pull Request" page into Chinese.
> The markdown file is located in: flink-web/reviewing-prs.zh.md
> The url link is: https://flink.apache.org/zh/reviewing-prs.html
> Please adjust the links in the page to Chinese pages when translating. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-11428) BufferFileWriterFileSegmentReaderTest#testWriteRead failed on Travis

2019-02-08 Thread Congxian Qiu (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-11428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16763680#comment-16763680
 ] 

Congxian Qiu commented on FLINK-11428:
--

thank you for pointing this out. [~aljoscha]. I'll keep digging this issue.

I saw the part in the log, and run 
{{BufferFileWriterFileSegmentReaderTest.testWriteRead hundreds times locally,}} 
just encounter IlleageException when stopping tests, and no {{AssertionError.}}

 

 

 

> BufferFileWriterFileSegmentReaderTest#testWriteRead failed on Travis
> 
>
> Key: FLINK-11428
> URL: https://issues.apache.org/jira/browse/FLINK-11428
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Reporter: Congxian Qiu
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.8.0
>
>
> 10:31:58.273 [ERROR] Errors: 
> 10:31:58.273 [ERROR] 
> org.apache.flink.runtime.io.disk.iomanager.BufferFileWriterFileSegmentReaderTest.testWriteRead(org.apache.flink.runtime.io.disk.iomanager.BufferFileWriterFileSegmentReaderTest)
> 10:31:58.273 [ERROR] Run 1: 
> BufferFileWriterFileSegmentReaderTest.testWriteRead:141
> 10:31:58.273 [ERROR] Run 2: 
> BufferFileWriterFileSegmentReaderTest.tearDownWriterAndReader:95 » 
> IllegalState
>  
> Travis link: https://travis-ci.org/apache/flink/jobs/483788040



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (FLINK-11428) BufferFileWriterFileSegmentReaderTest#testWriteRead failed on Travis

2019-02-08 Thread Congxian Qiu (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-11428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16763680#comment-16763680
 ] 

Congxian Qiu edited comment on FLINK-11428 at 2/8/19 3:05 PM:
--

thank you for pointing this out. [~aljoscha]. I'll keep digging this issue.

I saw the part in the log, and run 
{{BufferFileWriterFileSegmentReaderTest.testWriteRead hundreds times locally,}} 
just encounter IlleageException when stopping tests, and no {{AssertionError. 
I'll run this test locally a bit more times to find the reason.}}

 

 

 


was (Author: klion26):
thank you for pointing this out. [~aljoscha]. I'll keep digging this issue.

I saw the part in the log, and run 
{{BufferFileWriterFileSegmentReaderTest.testWriteRead hundreds times locally,}} 
just encounter IlleageException when stopping tests, and no {{AssertionError.}}

 

 

 

> BufferFileWriterFileSegmentReaderTest#testWriteRead failed on Travis
> 
>
> Key: FLINK-11428
> URL: https://issues.apache.org/jira/browse/FLINK-11428
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Reporter: Congxian Qiu
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.8.0
>
>
> 10:31:58.273 [ERROR] Errors: 
> 10:31:58.273 [ERROR] 
> org.apache.flink.runtime.io.disk.iomanager.BufferFileWriterFileSegmentReaderTest.testWriteRead(org.apache.flink.runtime.io.disk.iomanager.BufferFileWriterFileSegmentReaderTest)
> 10:31:58.273 [ERROR] Run 1: 
> BufferFileWriterFileSegmentReaderTest.testWriteRead:141
> 10:31:58.273 [ERROR] Run 2: 
> BufferFileWriterFileSegmentReaderTest.tearDownWriterAndReader:95 » 
> IllegalState
>  
> Travis link: https://travis-ci.org/apache/flink/jobs/483788040



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (FLINK-11428) BufferFileWriterFileSegmentReaderTest#testWriteRead failed on Travis

2019-02-06 Thread Congxian Qiu (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-11428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16761678#comment-16761678
 ] 

Congxian Qiu edited comment on FLINK-11428 at 2/6/19 11:39 AM:
---

Hi, [~aljoscha]

After running a couple times of 
{{BufferFileWriterFileSegmentReaderTest#testWriteRead}} locally, I find the 
direct reason is when 
[writer.deleteChannel()|https://github.com/apache/flink/blob/46326ab9181acec53d1e9e7ec8f4a26c672fec31/flink-runtime/src/test/java/org/apache/flink/runtime/io/disk/iomanager/BufferFileWriterFileSegmentReaderTest.java#L91]
 is called, 
{{[writer.close()|https://github.com/apache/flink/blob/46326ab9181acec53d1e9e7ec8f4a26c672fec31/flink-runtime/src/test/java/org/apache/flink/runtime/io/disk/iomanager/BufferFileWriterFileSegmentReaderTest.java#L120]
 has not been called before for some reason. Do this just need to catch this 
situation(do some check in {{tearDownWriterAndReader) or to do more things?


was (Author: klion26):
Hi, [~aljoscha]

After running a couple times of 
{{BufferFileWriterFileSegmentReaderTest#testWriteRead}} locally, I find the 
direct reason is when 
[writer.deleteChannel()|https://github.com/apache/flink/blob/46326ab9181acec53d1e9e7ec8f4a26c672fec31/flink-runtime/src/test/java/org/apache/flink/runtime/io/disk/iomanager/BufferFileWriterFileSegmentReaderTest.java#L91]
 is called, 
{{[writer.close()|https://github.com/apache/flink/blob/46326ab9181acec53d1e9e7ec8f4a26c672fec31/flink-runtime/src/test/java/org/apache/flink/runtime/io/disk/iomanager/BufferFileWriterFileSegmentReaderTest.java#L120]
 has not been called before for some reason. Did this just need to catch this 
situation(do some check in {{tearDownWriterAndReader) or to do more things?

> BufferFileWriterFileSegmentReaderTest#testWriteRead failed on Travis
> 
>
> Key: FLINK-11428
> URL: https://issues.apache.org/jira/browse/FLINK-11428
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Reporter: Congxian Qiu
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.8.0
>
>
> 10:31:58.273 [ERROR] Errors: 
> 10:31:58.273 [ERROR] 
> org.apache.flink.runtime.io.disk.iomanager.BufferFileWriterFileSegmentReaderTest.testWriteRead(org.apache.flink.runtime.io.disk.iomanager.BufferFileWriterFileSegmentReaderTest)
> 10:31:58.273 [ERROR] Run 1: 
> BufferFileWriterFileSegmentReaderTest.testWriteRead:141
> 10:31:58.273 [ERROR] Run 2: 
> BufferFileWriterFileSegmentReaderTest.tearDownWriterAndReader:95 » 
> IllegalState
>  
> Travis link: https://travis-ci.org/apache/flink/jobs/483788040



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (FLINK-11428) BufferFileWriterFileSegmentReaderTest#testWriteRead failed on Travis

2019-02-06 Thread Congxian Qiu (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-11428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16761678#comment-16761678
 ] 

Congxian Qiu edited comment on FLINK-11428 at 2/6/19 11:36 AM:
---

Hi, [~aljoscha]

After running a couple times of 
{{BufferFileWriterFileSegmentReaderTest#testWriteRead}} locally, I find the 
direct reason is 
[writer.deleteChannel()|[https://github.com/apache/flink/blob/46326ab9181acec53d1e9e7ec8f4a26c672fec31/flink-runtime/src/test/java/org/apache/flink/runtime/io/disk/iomanager/BufferFileWriterFileSegmentReaderTest.java#L91]],
 and 
{{[writer.close()|https://github.com/apache/flink/blob/46326ab9181acec53d1e9e7ec8f4a26c672fec31/flink-runtime/src/test/java/org/apache/flink/runtime/io/disk/iomanager/BufferFileWriterFileSegmentReaderTest.java#L120]
 has not been called before for some reason. Did this just need to catch this 
situation(do some check in {{tearDownWriterAndReader}}) or to do more things?


was (Author: klion26):
After running a couple times of 
{{BufferFileWriterFileSegmentReaderTest#testWriteRead}} locally, I find the 
direct reason is 
[writer.deleteChannel()|[https://github.com/apache/flink/blob/46326ab9181acec53d1e9e7ec8f4a26c672fec31/flink-runtime/src/test/java/org/apache/flink/runtime/io/disk/iomanager/BufferFileWriterFileSegmentReaderTest.java#L91]],
 and 
{{}}[writer.close()|https://github.com/apache/flink/blob/46326ab9181acec53d1e9e7ec8f4a26c672fec31/flink-runtime/src/test/java/org/apache/flink/runtime/io/disk/iomanager/BufferFileWriterFileSegmentReaderTest.java#L120]
 has not been called 
[writer.close()|https://github.com/apache/flink/blob/46326ab9181acec53d1e9e7ec8f4a26c672fec31/flink-runtime/src/test/java/org/apache/flink/runtime/io/disk/iomanager/BufferFileWriterFileSegmentReaderTest.java#L120]
 before for some reason. Did this just need to catch this situation(do some 
check in {{tearDownWriterAndReader}}) or to do more things?

> BufferFileWriterFileSegmentReaderTest#testWriteRead failed on Travis
> 
>
> Key: FLINK-11428
> URL: https://issues.apache.org/jira/browse/FLINK-11428
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Reporter: Congxian Qiu
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.8.0
>
>
> 10:31:58.273 [ERROR] Errors: 
> 10:31:58.273 [ERROR] 
> org.apache.flink.runtime.io.disk.iomanager.BufferFileWriterFileSegmentReaderTest.testWriteRead(org.apache.flink.runtime.io.disk.iomanager.BufferFileWriterFileSegmentReaderTest)
> 10:31:58.273 [ERROR] Run 1: 
> BufferFileWriterFileSegmentReaderTest.testWriteRead:141
> 10:31:58.273 [ERROR] Run 2: 
> BufferFileWriterFileSegmentReaderTest.tearDownWriterAndReader:95 » 
> IllegalState
>  
> Travis link: https://travis-ci.org/apache/flink/jobs/483788040



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (FLINK-11428) BufferFileWriterFileSegmentReaderTest#testWriteRead failed on Travis

2019-02-06 Thread Congxian Qiu (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-11428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16761678#comment-16761678
 ] 

Congxian Qiu edited comment on FLINK-11428 at 2/6/19 11:38 AM:
---

Hi, [~aljoscha]

After running a couple times of 
{{BufferFileWriterFileSegmentReaderTest#testWriteRead}} locally, I find the 
direct reason is when 
[writer.deleteChannel()|https://github.com/apache/flink/blob/46326ab9181acec53d1e9e7ec8f4a26c672fec31/flink-runtime/src/test/java/org/apache/flink/runtime/io/disk/iomanager/BufferFileWriterFileSegmentReaderTest.java#L91]
 is called, 
{{[writer.close()|https://github.com/apache/flink/blob/46326ab9181acec53d1e9e7ec8f4a26c672fec31/flink-runtime/src/test/java/org/apache/flink/runtime/io/disk/iomanager/BufferFileWriterFileSegmentReaderTest.java#L120]
 has not been called before for some reason. Did this just need to catch this 
situation(do some check in {{tearDownWriterAndReader) or to do more things?


was (Author: klion26):
Hi, [~aljoscha]

After running a couple times of 
{{BufferFileWriterFileSegmentReaderTest#testWriteRead}} locally, I find the 
direct reason is 
[writer.deleteChannel()|https://github.com/apache/flink/blob/46326ab9181acec53d1e9e7ec8f4a26c672fec31/flink-runtime/src/test/java/org/apache/flink/runtime/io/disk/iomanager/BufferFileWriterFileSegmentReaderTest.java#L91],
 and 
{{[writer.close()|https://github.com/apache/flink/blob/46326ab9181acec53d1e9e7ec8f4a26c672fec31/flink-runtime/src/test/java/org/apache/flink/runtime/io/disk/iomanager/BufferFileWriterFileSegmentReaderTest.java#L120]
 has not been called before for some reason. Did this just need to catch this 
situation(do some check in {{tearDownWriterAndReader}}) or to do more things?

> BufferFileWriterFileSegmentReaderTest#testWriteRead failed on Travis
> 
>
> Key: FLINK-11428
> URL: https://issues.apache.org/jira/browse/FLINK-11428
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Reporter: Congxian Qiu
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.8.0
>
>
> 10:31:58.273 [ERROR] Errors: 
> 10:31:58.273 [ERROR] 
> org.apache.flink.runtime.io.disk.iomanager.BufferFileWriterFileSegmentReaderTest.testWriteRead(org.apache.flink.runtime.io.disk.iomanager.BufferFileWriterFileSegmentReaderTest)
> 10:31:58.273 [ERROR] Run 1: 
> BufferFileWriterFileSegmentReaderTest.testWriteRead:141
> 10:31:58.273 [ERROR] Run 2: 
> BufferFileWriterFileSegmentReaderTest.tearDownWriterAndReader:95 » 
> IllegalState
>  
> Travis link: https://travis-ci.org/apache/flink/jobs/483788040



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (FLINK-11428) BufferFileWriterFileSegmentReaderTest#testWriteRead failed on Travis

2019-02-06 Thread Congxian Qiu (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-11428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16761678#comment-16761678
 ] 

Congxian Qiu edited comment on FLINK-11428 at 2/6/19 11:37 AM:
---

Hi, [~aljoscha]

After running a couple times of 
{{BufferFileWriterFileSegmentReaderTest#testWriteRead}} locally, I find the 
direct reason is 
[writer.deleteChannel()|https://github.com/apache/flink/blob/46326ab9181acec53d1e9e7ec8f4a26c672fec31/flink-runtime/src/test/java/org/apache/flink/runtime/io/disk/iomanager/BufferFileWriterFileSegmentReaderTest.java#L91],
 and 
{{[writer.close()|https://github.com/apache/flink/blob/46326ab9181acec53d1e9e7ec8f4a26c672fec31/flink-runtime/src/test/java/org/apache/flink/runtime/io/disk/iomanager/BufferFileWriterFileSegmentReaderTest.java#L120]
 has not been called before for some reason. Did this just need to catch this 
situation(do some check in {{tearDownWriterAndReader}}) or to do more things?


was (Author: klion26):
Hi, [~aljoscha]

After running a couple times of 
{{BufferFileWriterFileSegmentReaderTest#testWriteRead}} locally, I find the 
direct reason is 
[writer.deleteChannel()|[https://github.com/apache/flink/blob/46326ab9181acec53d1e9e7ec8f4a26c672fec31/flink-runtime/src/test/java/org/apache/flink/runtime/io/disk/iomanager/BufferFileWriterFileSegmentReaderTest.java#L91]],
 and 
{{[writer.close()|https://github.com/apache/flink/blob/46326ab9181acec53d1e9e7ec8f4a26c672fec31/flink-runtime/src/test/java/org/apache/flink/runtime/io/disk/iomanager/BufferFileWriterFileSegmentReaderTest.java#L120]
 has not been called before for some reason. Did this just need to catch this 
situation(do some check in {{tearDownWriterAndReader}}) or to do more things?

> BufferFileWriterFileSegmentReaderTest#testWriteRead failed on Travis
> 
>
> Key: FLINK-11428
> URL: https://issues.apache.org/jira/browse/FLINK-11428
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Reporter: Congxian Qiu
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.8.0
>
>
> 10:31:58.273 [ERROR] Errors: 
> 10:31:58.273 [ERROR] 
> org.apache.flink.runtime.io.disk.iomanager.BufferFileWriterFileSegmentReaderTest.testWriteRead(org.apache.flink.runtime.io.disk.iomanager.BufferFileWriterFileSegmentReaderTest)
> 10:31:58.273 [ERROR] Run 1: 
> BufferFileWriterFileSegmentReaderTest.testWriteRead:141
> 10:31:58.273 [ERROR] Run 2: 
> BufferFileWriterFileSegmentReaderTest.tearDownWriterAndReader:95 » 
> IllegalState
>  
> Travis link: https://travis-ci.org/apache/flink/jobs/483788040



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-11428) BufferFileWriterFileSegmentReaderTest#testWriteRead failed on Travis

2019-02-06 Thread Congxian Qiu (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-11428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16761678#comment-16761678
 ] 

Congxian Qiu commented on FLINK-11428:
--

After running a couple times of 
{{BufferFileWriterFileSegmentReaderTest#testWriteRead}} locally, I find the 
direct reason is 
[writer.deleteChannel()|[https://github.com/apache/flink/blob/46326ab9181acec53d1e9e7ec8f4a26c672fec31/flink-runtime/src/test/java/org/apache/flink/runtime/io/disk/iomanager/BufferFileWriterFileSegmentReaderTest.java#L91]],
 and 
{{}}[writer.close()|https://github.com/apache/flink/blob/46326ab9181acec53d1e9e7ec8f4a26c672fec31/flink-runtime/src/test/java/org/apache/flink/runtime/io/disk/iomanager/BufferFileWriterFileSegmentReaderTest.java#L120]
 has not been called 
[writer.close()|https://github.com/apache/flink/blob/46326ab9181acec53d1e9e7ec8f4a26c672fec31/flink-runtime/src/test/java/org/apache/flink/runtime/io/disk/iomanager/BufferFileWriterFileSegmentReaderTest.java#L120]
 before for some reason. Did this just need to catch this situation(do some 
check in {{tearDownWriterAndReader}}) or to do more things?

> BufferFileWriterFileSegmentReaderTest#testWriteRead failed on Travis
> 
>
> Key: FLINK-11428
> URL: https://issues.apache.org/jira/browse/FLINK-11428
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Reporter: Congxian Qiu
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.8.0
>
>
> 10:31:58.273 [ERROR] Errors: 
> 10:31:58.273 [ERROR] 
> org.apache.flink.runtime.io.disk.iomanager.BufferFileWriterFileSegmentReaderTest.testWriteRead(org.apache.flink.runtime.io.disk.iomanager.BufferFileWriterFileSegmentReaderTest)
> 10:31:58.273 [ERROR] Run 1: 
> BufferFileWriterFileSegmentReaderTest.testWriteRead:141
> 10:31:58.273 [ERROR] Run 2: 
> BufferFileWriterFileSegmentReaderTest.tearDownWriterAndReader:95 » 
> IllegalState
>  
> Travis link: https://travis-ci.org/apache/flink/jobs/483788040



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (FLINK-11531) Translate the Home Page of flink docs into Chinese

2019-02-06 Thread Congxian Qiu (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-11531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16761642#comment-16761642
 ] 

Congxian Qiu edited comment on FLINK-11531 at 2/6/19 10:33 AM:
---

After [FLINK-11530|https://issues.apache.org/jira/browse/FLINK-11530] has been 
merged, I'll create a pr for this, I've translated the home page in my local 
machine.


was (Author: klion26):
After [FLINK-11527|https://issues.apache.org/jira/browse/FLINK-11527] has been 
merged, I'll create a pr for this, I've translated the home page in my local 
machine.

> Translate the Home Page of flink docs into Chinese
> --
>
> Key: FLINK-11531
> URL: https://issues.apache.org/jira/browse/FLINK-11531
> Project: Flink
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Jark Wu
>Assignee: Congxian Qiu
>Priority: Major
>
> The home page url is https://ci.apache.org/projects/flink/flink-docs-master/ .
> The markdown file is located in flink/docs/index.zh.md



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-11531) Translate the Home Page of flink docs into Chinese

2019-02-06 Thread Congxian Qiu (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-11531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16761642#comment-16761642
 ] 

Congxian Qiu commented on FLINK-11531:
--

After [FLINK-11527|https://issues.apache.org/jira/browse/FLINK-11527] has been 
merged, I'll create a pr for this, I've translated the home page in my local 
machine.

> Translate the Home Page of flink docs into Chinese
> --
>
> Key: FLINK-11531
> URL: https://issues.apache.org/jira/browse/FLINK-11531
> Project: Flink
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Jark Wu
>Assignee: Congxian Qiu
>Priority: Major
>
> The home page url is https://ci.apache.org/projects/flink/flink-docs-master/ .
> The markdown file is located in flink/docs/index.zh.md



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (FLINK-11531) Translate the Home Page of flink docs into Chinese

2019-02-05 Thread Congxian Qiu (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-11531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu reassigned FLINK-11531:


Assignee: Congxian Qiu

> Translate the Home Page of flink docs into Chinese
> --
>
> Key: FLINK-11531
> URL: https://issues.apache.org/jira/browse/FLINK-11531
> Project: Flink
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Jark Wu
>Assignee: Congxian Qiu
>Priority: Major
>
> The home page url is https://ci.apache.org/projects/flink/flink-docs-master/ .
> The markdown file is located in flink/docs/index.zh.md



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-11483) Improve StreamOperatorSnapshotRestoreTest with Parameterized

2019-01-30 Thread Congxian Qiu (JIRA)
Congxian Qiu created FLINK-11483:


 Summary: Improve StreamOperatorSnapshotRestoreTest with 
Parameterized
 Key: FLINK-11483
 URL: https://issues.apache.org/jira/browse/FLINK-11483
 Project: Flink
  Issue Type: Test
  Components: State Backends, Checkpointing, Tests
Reporter: Congxian Qiu
Assignee: Congxian Qiu


In current implementation, we will test {{StreamOperatorSnapshot}} with three 
statebackend: {{File}}, {{RocksDB_FULL}}, {{RocksDB_Incremental}}, each in a 
sperate class, we could improve this with Parameterized.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (FLINK-11352) Check and port JobManagerHACheckpointRecoveryITCase to new code base if necessary

2019-01-29 Thread Congxian Qiu (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-11352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16754685#comment-16754685
 ] 

Congxian Qiu edited comment on FLINK-11352 at 1/29/19 9:40 AM:
---

Hi [~till.rohrmann] Update for {{testCheckpointRecoveryFailure}}.

I dug a bit more and found there exist a test class called 
{{JobManagerHAProcessFailureRecoveryITCase}}, I think we can implement  
{{JobManagerHACheckpointRecoveryITCase::testCheckpointRecoveryFailure}} in 
there, the logic will similar with the 
{{JobManagerHAProcessFailureRecoveryITCase::testDispatcherProcessFailure}}, but 
we should remove the directory {{coordinateTempDir}} before starting the second 
dispatcher and verify that the except error log printed after the second 
dispatcher process started.

[gist link|https://gist.github.com/klion26/eafa6174df361d3bb0447c2e7681db0f] of 
the new code.

What do you think about this? If this is ok, I'll implement.


was (Author: klion26):
Hi [~till.rohrmann] Update for {{testCheckpointRecoveryFailure}}.

I dug a bit more and found there exist a test class called 
{{JobManagerHAProcessFailureRecoveryITCase}}, I think we can implement  
{{JobManagerHACheckpointRecoveryITCase::testCheckpointRecoveryFailure}} in 
there, the logic will similar with the 
{{JobManagerHAProcessFailureRecoveryITCase::testDispatcherProcessFailure}}, but 
we should remove the directory {{coordinateTempDir}} before starting the second 
dispatcher and verify that the except error log printed after the second 
dispatcher process started.

What do you think about this? If this is ok, I'll implement.

> Check and port JobManagerHACheckpointRecoveryITCase to new code base if 
> necessary
> -
>
> Key: FLINK-11352
> URL: https://issues.apache.org/jira/browse/FLINK-11352
> Project: Flink
>  Issue Type: Sub-task
>  Components: Tests
>Reporter: Till Rohrmann
>Assignee: Congxian Qiu
>Priority: Major
>
> Check and port {{JobManagerHACheckpointRecoveryITCase}} to new code base if 
> necessary



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-11352) Check and port JobManagerHACheckpointRecoveryITCase to new code base if necessary

2019-01-28 Thread Congxian Qiu (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-11352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16754685#comment-16754685
 ] 

Congxian Qiu commented on FLINK-11352:
--

Hi [~till.rohrmann] Update for {{testCheckpointRecoveryFailure}}.

I dug a bit more and found there exist a test class called 
{{JobManagerHAProcessFailureRecoveryITCase}}, I think we can implement  
{{JobManagerHACheckpointRecoveryITCase::testCheckpointRecoveryFailure}} in 
there, the logic will similar with the 
{{JobManagerHAProcessFailureRecoveryITCase::testDispatcherProcessFailure}}, but 
we should remove the directory {{coordinateTempDir}} before starting the second 
dispatcher and verify that the except error log printed after the second 
dispatcher process started.

What do you think about this? If this is ok, I'll implement.

> Check and port JobManagerHACheckpointRecoveryITCase to new code base if 
> necessary
> -
>
> Key: FLINK-11352
> URL: https://issues.apache.org/jira/browse/FLINK-11352
> Project: Flink
>  Issue Type: Sub-task
>  Components: Tests
>Reporter: Till Rohrmann
>Assignee: Congxian Qiu
>Priority: Major
>
> Check and port {{JobManagerHACheckpointRecoveryITCase}} to new code base if 
> necessary



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (FLINK-11361) Check and port RecoveryITCase to new code base if necessary

2019-01-28 Thread Congxian Qiu (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-11361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu reassigned FLINK-11361:


Assignee: (was: Congxian Qiu)

> Check and port RecoveryITCase to new code base if necessary
> ---
>
> Key: FLINK-11361
> URL: https://issues.apache.org/jira/browse/FLINK-11361
> Project: Flink
>  Issue Type: Sub-task
>  Components: Tests
>Reporter: Till Rohrmann
>Priority: Major
>
> Check and port {{RecoveryITCase}} to new code base if necessary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-11428) BufferFileWriterFileSegmentReaderTest#testWriteRead failed on Travis

2019-01-24 Thread Congxian Qiu (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-11428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu updated FLINK-11428:
-
Component/s: Tests

> BufferFileWriterFileSegmentReaderTest#testWriteRead failed on Travis
> 
>
> Key: FLINK-11428
> URL: https://issues.apache.org/jira/browse/FLINK-11428
> Project: Flink
>  Issue Type: Test
>  Components: Tests
>Reporter: Congxian Qiu
>Priority: Major
>
> 10:31:58.273 [ERROR] Errors: 
> 10:31:58.273 [ERROR] 
> org.apache.flink.runtime.io.disk.iomanager.BufferFileWriterFileSegmentReaderTest.testWriteRead(org.apache.flink.runtime.io.disk.iomanager.BufferFileWriterFileSegmentReaderTest)
> 10:31:58.273 [ERROR] Run 1: 
> BufferFileWriterFileSegmentReaderTest.testWriteRead:141
> 10:31:58.273 [ERROR] Run 2: 
> BufferFileWriterFileSegmentReaderTest.tearDownWriterAndReader:95 » 
> IllegalState
>  
> Travis link: https://travis-ci.org/apache/flink/jobs/483788040



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-11428) BufferFileWriterFileSegmentReaderTest#testWriteRead failed on Travis

2019-01-24 Thread Congxian Qiu (JIRA)
Congxian Qiu created FLINK-11428:


 Summary: BufferFileWriterFileSegmentReaderTest#testWriteRead 
failed on Travis
 Key: FLINK-11428
 URL: https://issues.apache.org/jira/browse/FLINK-11428
 Project: Flink
  Issue Type: Test
Reporter: Congxian Qiu


10:31:58.273 [ERROR] Errors: 
10:31:58.273 [ERROR] 
org.apache.flink.runtime.io.disk.iomanager.BufferFileWriterFileSegmentReaderTest.testWriteRead(org.apache.flink.runtime.io.disk.iomanager.BufferFileWriterFileSegmentReaderTest)
10:31:58.273 [ERROR] Run 1: 
BufferFileWriterFileSegmentReaderTest.testWriteRead:141
10:31:58.273 [ERROR] Run 2: 
BufferFileWriterFileSegmentReaderTest.tearDownWriterAndReader:95 » IllegalState
 

Travis link: https://travis-ci.org/apache/flink/jobs/483788040



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


  1   2   >