[jira] [Comment Edited] (FLINK-14816) Add thread dump feature for taskmanager

2019-12-20 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17001312#comment-17001312
 ] 

lamber-ken edited comment on FLINK-14816 at 12/21/19 12:33 AM:
---

Thanks, done. :)


was (Author: lamber-ken):
Thanks, done.

> Add thread dump feature for taskmanager
> ---
>
> Key: FLINK-14816
> URL: https://issues.apache.org/jira/browse/FLINK-14816
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Web Frontend
>Affects Versions: 1.9.1
>    Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0
>
> Attachments: screenshot-1.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Add thread dump feature for taskmanager, so use can get thread information 
> easily.
>  !screenshot-1.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-14816) Add thread dump feature for taskmanager

2019-12-20 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17001312#comment-17001312
 ] 

lamber-ken commented on FLINK-14816:


Thanks, done.

> Add thread dump feature for taskmanager
> ---
>
> Key: FLINK-14816
> URL: https://issues.apache.org/jira/browse/FLINK-14816
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Web Frontend
>Affects Versions: 1.9.1
>    Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0
>
> Attachments: screenshot-1.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Add thread dump feature for taskmanager, so use can get thread information 
> easily.
>  !screenshot-1.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-14177) Bump Curator From 2.12.0 to 4.2.0

2019-12-19 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17000645#comment-17000645
 ] 

lamber-ken commented on FLINK-14177:


It took me a long time to care about this pr(opened at on 2019-09-24). If 
someone wants to fix this, feel free to reopen it. But I don't have permission 
to unassign it. 

> Bump Curator From 2.12.0 to 4.2.0
> -
>
> Key: FLINK-14177
> URL: https://issues.apache.org/jira/browse/FLINK-14177
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / Hadoop Compatibility, Runtime / 
> Checkpointing
>Affects Versions: 1.8.1, 1.9.0
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> According to FLINK-10052 and FLINK-13417, we needs to upgrade the version of 
> CuratorFramework firstly.
> Curator4.2.0 supports
> 1) zk3.4.* and zk3.5.* 
> 2) connectionStateErrorPolicy



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-14177) Bump Curator From 2.12.0 to 4.2.0

2019-12-19 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken closed FLINK-14177.
--
Resolution: Won't Fix

> Bump Curator From 2.12.0 to 4.2.0
> -
>
> Key: FLINK-14177
> URL: https://issues.apache.org/jira/browse/FLINK-14177
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / Hadoop Compatibility, Runtime / 
> Checkpointing
>Affects Versions: 1.8.1, 1.9.0
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> According to FLINK-10052 and FLINK-13417, we needs to upgrade the version of 
> CuratorFramework firstly.
> Curator4.2.0 supports
> 1) zk3.4.* and zk3.5.* 
> 2) connectionStateErrorPolicy



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15087) JobManager is forced to shutdown JVM due to temporary loss of zookeeper connection

2019-12-06 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989515#comment-16989515
 ] 

lamber-ken commented on FLINK-15087:


(y)

> JobManager is forced to shutdown JVM due to temporary loss of zookeeper 
> connection
> --
>
> Key: FLINK-15087
> URL: https://issues.apache.org/jira/browse/FLINK-15087
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.8.2
>Reporter: Abdul Qadeer
>Priority: Major
>
> While testing I found that the loss of connection with zookeeper triggers JVM 
> shutdown for Job Manager, when started through 
> "StandaloneSessionClusterEntrypoint". This happens due to a NPE on 
> "taskManagerHeartbeatManager."
> When JobManagerRunner suspends jobMasterService (as Job manager is no longer 
> leader), "taskManagerHeartbeatManager" is set to null in 
> "stopHeartbeatServices".
> Next, "AkkaRpcActor" stops JobMaster and throws NPE in the following method:
> {code:java}
> @Override
> public CompletableFuture disconnectTaskManager(final ResourceID 
> resourceID, final Exception cause) {
>log.debug("Disconnect TaskExecutor {} because: {}", resourceID, 
> cause.getMessage());
>taskManagerHeartbeatManager.unmonitorTarget(resourceID);
>slotPool.releaseTaskManager(resourceID, cause);
> {code}
>  
> This leads to a fatal error finally in "ClusterEntryPoint.onFatalError()" and 
> forces JVM shutdown.
> The stack trace is below:
>  
> {noformat}
> {"timeMillis":1575581120723,"thread":"flink-akka.actor.default-dispatcher-93","level":"ERROR","loggerName":"com.Sample","message":"Failed
>  to take leadership with session id 
> b4662db5-f065-41d9-aaaf-78625355b251.","thrown":{"commonElementCount":0,"localizedMessage":"Failed
>  to take leadership with session id 
> b4662db5-f065-41d9-aaaf-78625355b251.","message":"Failed to take leadership 
> with session id 
> b4662db5-f065-41d9-aaaf-78625355b251.","name":"org.apache.flink.runtime.dispatcher.DispatcherException","cause":{"commonElementCount":18,"localizedMessage":"Termination
>  of previous JobManager for job bbb8c430787d92293e9d45c349231d9c failed. 
> Cannot submit job under the same job id.","message":"Termination of previous 
> JobManager for job bbb8c430787d92293e9d45c349231d9c failed. Cannot submit job 
> under the same job 
> id.","name":"org.apache.flink.runtime.dispatcher.DispatcherException","cause":{"commonElementCount":6,"localizedMessage":"org.apache.flink.util.FlinkException:
>  Could not properly shut down the 
> JobManagerRunner","message":"org.apache.flink.util.FlinkException: Could not 
> properly shut down the 
> JobManagerRunner","name":"java.util.concurrent.CompletionException","cause":{"commonElementCount":6,"localizedMessage":"Could
>  not properly shut down the JobManagerRunner","message":"Could not properly 
> shut down the 
> JobManagerRunner","name":"org.apache.flink.util.FlinkException","cause":{"commonElementCount":13,"localizedMessage":"Failure
>  while stopping RpcEndpoint jobmanager_0.","message":"Failure while stopping 
> RpcEndpoint 
> jobmanager_0.","name":"org.apache.flink.runtime.rpc.akka.exceptions.AkkaRpcException","cause":{"commonElementCount":13,"name":"java.lang.NullPointerException","extendedStackTrace":[{"class":"org.apache.flink.runtime.jobmaster.JobMaster","method":"disconnectTaskManager","file":"JobMaster.java","line":629,"exact":false,"location":"flink-runtime_2.11-1.8.2.jar","version":"1.8.2"},{"class":"org.apache.flink.runtime.jobmaster.JobMaster","method":"onStop","file":"JobMaster.java","line":346,"exact":false,"location":"flink-runtime_2.11-1.8.2.jar","version":"1.8.2"},{"class":"org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StartedState","method":"terminate","file":&q

[jira] [Commented] (FLINK-15087) JobManager is forced to shutdown JVM due to temporary loss of zookeeper connection

2019-12-05 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989482#comment-16989482
 ] 

lamber-ken commented on FLINK-15087:


hi, this may be is a duplicate of FLINK-10052.

> JobManager is forced to shutdown JVM due to temporary loss of zookeeper 
> connection
> --
>
> Key: FLINK-15087
> URL: https://issues.apache.org/jira/browse/FLINK-15087
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.8.2
>Reporter: Abdul Qadeer
>Priority: Major
>
> While testing I found that the loss of connection with zookeeper triggers JVM 
> shutdown for Job Manager, when started through 
> "StandaloneSessionClusterEntrypoint". This happens due to a NPE on 
> "taskManagerHeartbeatManager."
> When JobManagerRunner suspends jobMasterService (as Job manager is no longer 
> leader), "taskManagerHeartbeatManager" is set to null in 
> "stopHeartbeatServices".
> Next, "AkkaRpcActor" stops JobMaster and throws NPE in the following method:
> {code:java}
> @Override
> public CompletableFuture disconnectTaskManager(final ResourceID 
> resourceID, final Exception cause) {
>log.debug("Disconnect TaskExecutor {} because: {}", resourceID, 
> cause.getMessage());
>taskManagerHeartbeatManager.unmonitorTarget(resourceID);
>slotPool.releaseTaskManager(resourceID, cause);
> {code}
>  
> This leads to a fatal error finally in "ClusterEntryPoint.onFatalError()" and 
> forces JVM shutdown.
> The stack trace is below:
>  
> {noformat}
> {"timeMillis":1575581120723,"thread":"flink-akka.actor.default-dispatcher-93","level":"ERROR","loggerName":"com.Sample","message":"Failed
>  to take leadership with session id 
> b4662db5-f065-41d9-aaaf-78625355b251.","thrown":{"commonElementCount":0,"localizedMessage":"Failed
>  to take leadership with session id 
> b4662db5-f065-41d9-aaaf-78625355b251.","message":"Failed to take leadership 
> with session id 
> b4662db5-f065-41d9-aaaf-78625355b251.","name":"org.apache.flink.runtime.dispatcher.DispatcherException","cause":{"commonElementCount":18,"localizedMessage":"Termination
>  of previous JobManager for job bbb8c430787d92293e9d45c349231d9c failed. 
> Cannot submit job under the same job id.","message":"Termination of previous 
> JobManager for job bbb8c430787d92293e9d45c349231d9c failed. Cannot submit job 
> under the same job 
> id.","name":"org.apache.flink.runtime.dispatcher.DispatcherException","cause":{"commonElementCount":6,"localizedMessage":"org.apache.flink.util.FlinkException:
>  Could not properly shut down the 
> JobManagerRunner","message":"org.apache.flink.util.FlinkException: Could not 
> properly shut down the 
> JobManagerRunner","name":"java.util.concurrent.CompletionException","cause":{"commonElementCount":6,"localizedMessage":"Could
>  not properly shut down the JobManagerRunner","message":"Could not properly 
> shut down the 
> JobManagerRunner","name":"org.apache.flink.util.FlinkException","cause":{"commonElementCount":13,"localizedMessage":"Failure
>  while stopping RpcEndpoint jobmanager_0.","message":"Failure while stopping 
> RpcEndpoint 
> jobmanager_0.","name":"org.apache.flink.runtime.rpc.akka.exceptions.AkkaRpcException","cause":{"commonElementCount":13,"name":"java.lang.NullPointerException","extendedStackTrace":[{"class":"org.apache.flink.runtime.jobmaster.JobMaster","method":"disconnectTaskManager","file":"JobMaster.java","line":629,"exact":false,"location":"flink-runtime_2.11-1.8.2.jar","version":"1.8.2"},{"class":"org.apache.flink.runtime.jobmaster.JobMaster","method":"onStop","file":"JobMaster.java","line":346,"exact":false,"location":"flink-runtime_2.11-1.8.2.jar","version":"1.8.2"},{"class":"org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StartedState",&quo

[jira] [Comment Edited] (FLINK-14984) Remove old WebUI

2019-11-29 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16985263#comment-16985263
 ] 

lamber-ken edited comment on FLINK-14984 at 11/30/19 5:18 AM:
--

hi, [~chesnay] I think I can do it. Could you assign it to me?


was (Author: lamber-ken):
hi, [~chesnay] I think I can do it. Could anyon assign it to me?

> Remove old WebUI
> 
>
> Key: FLINK-14984
> URL: https://issues.apache.org/jira/browse/FLINK-14984
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Web Frontend
>Reporter: Chesnay Schepler
>Priority: Major
> Fix For: 1.10.0
>
>
> Following the discussion on the 
> [ML|https://lists.apache.org/thread.html/ae8528b620b51f6f8270b840a7d22c3b4231cd6f717f8280650a9be6@%3Cdev.flink.apache.org%3E],
>  remove the old WebUI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-14984) Remove old WebUI

2019-11-29 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16985263#comment-16985263
 ] 

lamber-ken commented on FLINK-14984:


hi, [~chesnay] I think I can do it. Could anyon assign it to me?

> Remove old WebUI
> 
>
> Key: FLINK-14984
> URL: https://issues.apache.org/jira/browse/FLINK-14984
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Web Frontend
>Reporter: Chesnay Schepler
>Priority: Major
> Fix For: 1.10.0
>
>
> Following the discussion on the 
> [ML|https://lists.apache.org/thread.html/ae8528b620b51f6f8270b840a7d22c3b4231cd6f717f8280650a9be6@%3Cdev.flink.apache.org%3E],
>  remove the old WebUI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-3158) Shading does not remove google guava from flink-dist fat jar

2019-11-21 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-3158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16979864#comment-16979864
 ] 

lamber-ken commented on FLINK-3158:
---

[~rmetzger]  (y)

> Shading does not remove google guava from flink-dist fat jar
> 
>
> Key: FLINK-3158
> URL: https://issues.apache.org/jira/browse/FLINK-3158
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 0.10.1, 1.0.0
>Reporter: Robert Metzger
>Assignee: Robert Metzger
>Priority: Blocker
>
> It seems that guava somehow slipped our checks and made it into the 
> flink-dist fat jar again.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-14827) Add thread dump feature for jobmanager

2019-11-15 Thread lamber-ken (Jira)
lamber-ken created FLINK-14827:
--

 Summary: Add thread dump feature for jobmanager
 Key: FLINK-14827
 URL: https://issues.apache.org/jira/browse/FLINK-14827
 Project: Flink
  Issue Type: New Feature
  Components: Runtime / Web Frontend
Affects Versions: 1.9.1
Reporter: lamber-ken
 Fix For: 1.9.2


Add thread dump feature for jobmanager, so use can get thread information 
easily.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-14816) Add thread dump feature for taskmanager

2019-11-15 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated FLINK-14816:
---
Summary: Add thread dump feature for taskmanager  (was: Add thread dump 
feature for jobmanager and taskmanager)

> Add thread dump feature for taskmanager
> ---
>
> Key: FLINK-14816
> URL: https://issues.apache.org/jira/browse/FLINK-14816
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Web Frontend
>Affects Versions: 1.9.1
>    Reporter: lamber-ken
>Priority: Major
> Fix For: 1.9.2
>
> Attachments: screenshot-1.png
>
>
> Add thread dump feature for taskmanager, so use can get thread information 
> easily.
>  !screenshot-1.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-14816) Add thread dump feature for jobmanager and taskmanager

2019-11-15 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated FLINK-14816:
---
Summary: Add thread dump feature for jobmanager and taskmanager  (was: Add 
thread dump feature for taskmanager)

> Add thread dump feature for jobmanager and taskmanager
> --
>
> Key: FLINK-14816
> URL: https://issues.apache.org/jira/browse/FLINK-14816
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Web Frontend
>Affects Versions: 1.9.1
>    Reporter: lamber-ken
>Priority: Major
> Fix For: 1.9.2
>
> Attachments: screenshot-1.png
>
>
> Add thread dump feature for taskmanager, so use can get thread information 
> easily.
>  !screenshot-1.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-14816) Add thread dump feature for taskmanager

2019-11-15 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated FLINK-14816:
---
Attachment: screenshot-1.png

> Add thread dump feature for taskmanager
> ---
>
> Key: FLINK-14816
> URL: https://issues.apache.org/jira/browse/FLINK-14816
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Web Frontend
>Affects Versions: 1.9.1
>    Reporter: lamber-ken
>Priority: Major
> Fix For: 1.9.2
>
> Attachments: screenshot-1.png
>
>
> Add thread dump feature for taskmanager, so use can get thread information 
> easily.
>  !image-2019-11-15-17-41-12-576.png|thumbnail! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-14816) Add thread dump feature for taskmanager

2019-11-15 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated FLINK-14816:
---
Description: 
Add thread dump feature for taskmanager, so use can get thread information 
easily.

 !screenshot-1.png! 

  was:
Add thread dump feature for taskmanager, so use can get thread information 
easily.

 !image-2019-11-15-17-41-12-576.png|thumbnail! 


> Add thread dump feature for taskmanager
> ---
>
> Key: FLINK-14816
> URL: https://issues.apache.org/jira/browse/FLINK-14816
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Web Frontend
>Affects Versions: 1.9.1
>    Reporter: lamber-ken
>Priority: Major
> Fix For: 1.9.2
>
> Attachments: screenshot-1.png
>
>
> Add thread dump feature for taskmanager, so use can get thread information 
> easily.
>  !screenshot-1.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-14816) Add thread dump feature for taskmanager

2019-11-15 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated FLINK-14816:
---
Attachment: (was: image-2019-11-15-17-41-12-576.png)

> Add thread dump feature for taskmanager
> ---
>
> Key: FLINK-14816
> URL: https://issues.apache.org/jira/browse/FLINK-14816
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Web Frontend
>Affects Versions: 1.9.1
>    Reporter: lamber-ken
>Priority: Major
> Fix For: 1.9.2
>
> Attachments: screenshot-1.png
>
>
> Add thread dump feature for taskmanager, so use can get thread information 
> easily.
>  !screenshot-1.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-14816) Add thread dump feature for taskmanager

2019-11-15 Thread lamber-ken (Jira)
lamber-ken created FLINK-14816:
--

 Summary: Add thread dump feature for taskmanager
 Key: FLINK-14816
 URL: https://issues.apache.org/jira/browse/FLINK-14816
 Project: Flink
  Issue Type: New Feature
  Components: Runtime / Web Frontend
Affects Versions: 1.9.1
Reporter: lamber-ken
 Fix For: 1.9.2
 Attachments: image-2019-11-15-17-41-12-576.png

Add thread dump feature for taskmanager, so use can get thread information 
easily.

 !image-2019-11-15-17-41-12-576.png|thumbnail! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-14614) add annotation location javastyle rule

2019-11-05 Thread lamber-ken (Jira)
lamber-ken created FLINK-14614:
--

 Summary: add annotation location javastyle rule
 Key: FLINK-14614
 URL: https://issues.apache.org/jira/browse/FLINK-14614
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Checkpointing
Affects Versions: 1.9.1
Reporter: lamber-ken
 Fix For: 1.9.2


Check location of annotation on language elements.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-14177) Bump Curator From 2.12.0 to 4.2.0

2019-09-23 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated FLINK-14177:
---
Description: 
According to FLINK-10052 and FLINK-13417, we needs to upgrade the version of 
CuratorFramework firstly.

Curator4.2.0 supports
1) zk3.4.* and zk3.5.* 
2) connectionStateErrorPolicy


  was:
According to FLINK-10052 and FLINK-14117, we needs to upgrade the version of 
CuratorFramework firstly.

Curator4.2.0 supports
1) zk3.4.* and zk3.5.* 
2) connectionStateErrorPolicy



> Bump Curator From 2.12.0 to 4.2.0
> -
>
> Key: FLINK-14177
> URL: https://issues.apache.org/jira/browse/FLINK-14177
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / Hadoop Compatibility, Runtime / 
> Checkpointing
>Affects Versions: 1.8.1, 1.9.0
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
> Fix For: 1.9.1
>
>
> According to FLINK-10052 and FLINK-13417, we needs to upgrade the version of 
> CuratorFramework firstly.
> Curator4.2.0 supports
> 1) zk3.4.* and zk3.5.* 
> 2) connectionStateErrorPolicy



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-14177) Bump Curator From 2.12.0 to 4.2.0

2019-09-23 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated FLINK-14177:
---
Description: 
According to FLINK-10052 and FLINK-14117, we needs to upgrade the version of 
CuratorFramework firstly.

Curator4.2.0 supports
1) zk3.4.* and zk3.5.* 
2) connectionStateErrorPolicy


  was:
According to FLINK-10052 and FLINK-14177, we needs to upgrade the version of 
CuratorFramework firstly.

Curator4.2.0 supports
1) zk3.4.* and zk3.5.* 
2) connectionStateErrorPolicy



> Bump Curator From 2.12.0 to 4.2.0
> -
>
> Key: FLINK-14177
> URL: https://issues.apache.org/jira/browse/FLINK-14177
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / Hadoop Compatibility, Runtime / 
> Checkpointing
>Affects Versions: 1.8.1, 1.9.0
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
> Fix For: 1.9.1
>
>
> According to FLINK-10052 and FLINK-14117, we needs to upgrade the version of 
> CuratorFramework firstly.
> Curator4.2.0 supports
> 1) zk3.4.* and zk3.5.* 
> 2) connectionStateErrorPolicy



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-14177) Bump Curator From 2.12.0 to 4.2.0

2019-09-23 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated FLINK-14177:
---
Description: 
According to FLINK-10052 and FLINK-14177, we needs to upgrade the version of 
CuratorFramework firstly.

Curator4.2.0 supports
1) zk3.4.* and zk3.5.* 
2) connectionStateErrorPolicy


  was:
According to FLINK-10052 and FLINK-14177, we needs to upgrade the version of 
CuratorFramework firstly.

Curator4.2.0 support 1) zk3.4.* and zk3.5.*  2) connectionStateErrorPolicy



> Bump Curator From 2.12.0 to 4.2.0
> -
>
> Key: FLINK-14177
> URL: https://issues.apache.org/jira/browse/FLINK-14177
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / Hadoop Compatibility, Runtime / 
> Checkpointing
>Affects Versions: 1.8.1, 1.9.0
>Reporter: lamber-ken
>Priority: Major
> Fix For: 1.9.1
>
>
> According to FLINK-10052 and FLINK-14177, we needs to upgrade the version of 
> CuratorFramework firstly.
> Curator4.2.0 supports
> 1) zk3.4.* and zk3.5.* 
> 2) connectionStateErrorPolicy



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-14177) Bump Curator From 2.12.0 to 4.2.0

2019-09-23 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated FLINK-14177:
---
Description: 
According to FLINK-10052 and FLINK-14177, we needs to upgrade the version of 
CuratorFramework firstly.

Curator4.2.0 support 1) zk3.4.* and zk3.5.*  2) connectionStateErrorPolicy


  was:**According to FLINK-10052 and FLINK-14177


> Bump Curator From 2.12.0 to 4.2.0
> -
>
> Key: FLINK-14177
> URL: https://issues.apache.org/jira/browse/FLINK-14177
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / Hadoop Compatibility, Runtime / 
> Checkpointing
>Affects Versions: 1.8.1, 1.9.0
>Reporter: lamber-ken
>Priority: Major
> Fix For: 1.9.1
>
>
> According to FLINK-10052 and FLINK-14177, we needs to upgrade the version of 
> CuratorFramework firstly.
> Curator4.2.0 support 1) zk3.4.* and zk3.5.*  2) connectionStateErrorPolicy



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-14177) Bump Curator From 2.12.0 to 4.2.0

2019-09-23 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated FLINK-14177:
---
Description: **According to FLINK-10052 and FLINK-14177

> Bump Curator From 2.12.0 to 4.2.0
> -
>
> Key: FLINK-14177
> URL: https://issues.apache.org/jira/browse/FLINK-14177
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / Hadoop Compatibility, Runtime / 
> Checkpointing
>Affects Versions: 1.8.1, 1.9.0
>Reporter: lamber-ken
>Priority: Major
> Fix For: 1.9.1
>
>
> **According to FLINK-10052 and FLINK-14177



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-14177) Bump Curator From 2.12.0 to 4.2.0

2019-09-23 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated FLINK-14177:
---
Component/s: Runtime / Checkpointing
 Connectors / Hadoop Compatibility

> Bump Curator From 2.12.0 to 4.2.0
> -
>
> Key: FLINK-14177
> URL: https://issues.apache.org/jira/browse/FLINK-14177
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / Hadoop Compatibility, Runtime / 
> Checkpointing
>Affects Versions: 1.8.1, 1.9.0
>Reporter: lamber-ken
>Priority: Major
> Fix For: 1.9.1
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-14177) Bump Curator From 2.12.0 to 4.2.0

2019-09-23 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated FLINK-14177:
---
Fix Version/s: 1.9.1

> Bump Curator From 2.12.0 to 4.2.0
> -
>
> Key: FLINK-14177
> URL: https://issues.apache.org/jira/browse/FLINK-14177
> Project: Flink
>  Issue Type: Improvement
>Affects Versions: 1.8.1, 1.9.0
>        Reporter: lamber-ken
>Priority: Major
> Fix For: 1.9.1
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-14177) Bump Curator to 3.5.5

2019-09-23 Thread lamber-ken (Jira)
lamber-ken created FLINK-14177:
--

 Summary: Bump Curator to 3.5.5
 Key: FLINK-14177
 URL: https://issues.apache.org/jira/browse/FLINK-14177
 Project: Flink
  Issue Type: Improvement
Reporter: lamber-ken






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-14177) Bump Curator From 2.12.0 to 4.2.0

2019-09-23 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated FLINK-14177:
---
Affects Version/s: 1.8.1
   1.9.0

> Bump Curator From 2.12.0 to 4.2.0
> -
>
> Key: FLINK-14177
> URL: https://issues.apache.org/jira/browse/FLINK-14177
> Project: Flink
>  Issue Type: Improvement
>Affects Versions: 1.8.1, 1.9.0
>        Reporter: lamber-ken
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-14177) Bump Curator From 2.12.0 to 4.2.0

2019-09-23 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated FLINK-14177:
---
Summary: Bump Curator From 2.12.0 to 4.2.0  (was: Bump Curator to 3.5.5)

> Bump Curator From 2.12.0 to 4.2.0
> -
>
> Key: FLINK-14177
> URL: https://issues.apache.org/jira/browse/FLINK-14177
> Project: Flink
>  Issue Type: Improvement
>    Reporter: lamber-ken
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-11420) Serialization of case classes containing a Map[String, Any] sometimes throws ArrayIndexOutOfBoundsException

2019-09-12 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-11420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928298#comment-16928298
 ] 

lamber-ken commented on FLINK-11420:


(y)

> Serialization of case classes containing a Map[String, Any] sometimes throws 
> ArrayIndexOutOfBoundsException
> ---
>
> Key: FLINK-11420
> URL: https://issues.apache.org/jira/browse/FLINK-11420
> Project: Flink
>  Issue Type: Bug
>  Components: API / Type Serialization System
>Affects Versions: 1.7.1
>Reporter: Jürgen Kreileder
>Assignee: Dawid Wysakowicz
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.7.3, 1.8.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> We frequently run into random ArrayIndexOutOfBounds exceptions when flink 
> tries to serialize Scala case classes containing a Map[String, Any] (Any 
> being String, Long, Int, or Boolean) with the FsStateBackend. (This probably 
> happens with any case class containing a type requiring Kryo, see this thread 
> for instance: 
> [http://mail-archives.apache.org/mod_mbox/flink-user/201710.mbox/%3cCANNGFpjX4gjV=df6tlfeojsb_rhwxs_ruoylqcqv2gvwqtt...@mail.gmail.com%3e])
> Disabling asynchronous snapshots seems to work around the problem, so maybe 
> something is not thread-safe in CaseClassSerializer.
> Our objects look like this:
> {code}
> case class Event(timestamp: Long, [...], content: Map[String, Any]
> case class EnrichedEvent(event: Event, additionalInfo: Map[String, Any])
> {code}
> I've looked at a few of the exceptions in a debugger. It always happens when 
> serializing the right-hand side a tuple from EnrichedEvent -> Event -> 
> content, e.g: 13 from ("foo", 13) or false from ("bar", false).
> Stacktrace:
> {code:java}
> java.lang.ArrayIndexOutOfBoundsException: Index -1 out of bounds for length 0
>  at com.esotericsoftware.kryo.util.IntArray.pop(IntArray.java:157)
>  at com.esotericsoftware.kryo.Kryo.reference(Kryo.java:822)
>  at com.esotericsoftware.kryo.Kryo.copy(Kryo.java:863)
>  at 
> org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializer.copy(KryoSerializer.java:217)
>  at 
> org.apache.flink.api.scala.typeutils.CaseClassSerializer.copy(CaseClassSerializer.scala:101)
>  at 
> org.apache.flink.api.scala.typeutils.CaseClassSerializer.copy(CaseClassSerializer.scala:32)
>  at 
> org.apache.flink.api.scala.typeutils.TraversableSerializer.$anonfun$copy$1(TraversableSerializer.scala:69)
>  at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:234)
>  at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:465)
>  at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:465)
>  at 
> org.apache.flink.api.scala.typeutils.TraversableSerializer.copy(TraversableSerializer.scala:69)
>  at 
> org.apache.flink.api.scala.typeutils.TraversableSerializer.copy(TraversableSerializer.scala:33)
>  at 
> org.apache.flink.api.scala.typeutils.CaseClassSerializer.copy(CaseClassSerializer.scala:101)
>  at 
> org.apache.flink.api.scala.typeutils.CaseClassSerializer.copy(CaseClassSerializer.scala:32)
>  at 
> org.apache.flink.api.scala.typeutils.CaseClassSerializer.copy(CaseClassSerializer.scala:101)
>  at 
> org.apache.flink.api.scala.typeutils.CaseClassSerializer.copy(CaseClassSerializer.scala:32)
>  at 
> org.apache.flink.api.common.typeutils.base.ListSerializer.copy(ListSerializer.java:99)
>  at 
> org.apache.flink.api.common.typeutils.base.ListSerializer.copy(ListSerializer.java:42)
>  at 
> org.apache.flink.runtime.state.heap.CopyOnWriteStateTable.get(CopyOnWriteStateTable.java:287)
>  at 
> org.apache.flink.runtime.state.heap.CopyOnWriteStateTable.get(CopyOnWriteStateTable.java:311)
>  at 
> org.apache.flink.runtime.state.heap.HeapListState.add(HeapListState.java:95)
>  at 
> org.apache.flink.streaming.runtime.operators.windowing.WindowOperator.processElement(WindowOperator.java:391)
>  at 
> org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:202)
>  at 
> org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:105)
>  at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300)
>  at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704)
>  at java.base/java.lang.Thread.run(Thread.java:834){code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-11218) fix the default restart stragegy delay value

2019-08-30 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-11218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919579#comment-16919579
 ] 

lamber-ken commented on FLINK-11218:


ok

> fix the default restart stragegy delay value
> 
>
> Key: FLINK-11218
> URL: https://issues.apache.org/jira/browse/FLINK-11218
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Configuration
>Affects Versions: 1.6.2, 1.7.2
>    Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> from 1.6.x version, default restart strategy was moved from ExcutionGraph to 
> backend.
> for now use NoOrFixedIfCheckpointingEnabledRestartStrategyFactory to generate 
> default FixedDelayRestartStrategy, but the 
> `{color:#660e7a}DEFAULT_RESTART_DELAY`{color}{color:#660e7a} {color}value is 
> 0. {color:#660e7a}
> {color}
> it will endless _loop_ when a operator always throw error, it exhausts the 
> CPU limit and MEM limit.
> so default `{color:#660e7a}DEFAULT_RESTART_DELAY`{color}  value 1L is 
> better
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-11218) fix the default restart stragegy delay value

2019-08-30 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-11218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919574#comment-16919574
 ] 

lamber-ken commented on FLINK-11218:


Hi, [~till.rohrmann]. It's no problem.

> fix the default restart stragegy delay value
> 
>
> Key: FLINK-11218
> URL: https://issues.apache.org/jira/browse/FLINK-11218
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Configuration
>Affects Versions: 1.6.2, 1.7.2
>    Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> from 1.6.x version, default restart strategy was moved from ExcutionGraph to 
> backend.
> for now use NoOrFixedIfCheckpointingEnabledRestartStrategyFactory to generate 
> default FixedDelayRestartStrategy, but the 
> `{color:#660e7a}DEFAULT_RESTART_DELAY`{color}{color:#660e7a} {color}value is 
> 0. {color:#660e7a}
> {color}
> it will endless _loop_ when a operator always throw error, it exhausts the 
> CPU limit and MEM limit.
> so default `{color:#660e7a}DEFAULT_RESTART_DELAY`{color}  value 1L is 
> better
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-13787) PrometheusPushGatewayReporter does not cleanup TM metrics when run on kubernetes

2019-08-30 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16919417#comment-16919417
 ] 

lamber-ken commented on FLINK-13787:


[~fly_in_gis] [~kaibo.zhou]

Hi,you can use this[1] fork.

[1] [https://github.com/dinumathai/pushgateway]

> PrometheusPushGatewayReporter does not cleanup TM metrics when run on 
> kubernetes
> 
>
> Key: FLINK-13787
> URL: https://issues.apache.org/jira/browse/FLINK-13787
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Metrics
>Affects Versions: 1.7.2, 1.8.1, 1.9.0
>Reporter: Kaibo Zhou
>Priority: Major
>
> I have run a flink job on kubernetes and use PrometheusPushGatewayReporter, I 
> can see the metrics from the flink jobmanager and taskmanager from the push 
> gateway's UI.
> When I cancel the job, I found the jobmanager's metrics disappear, but the 
> taskmanager's metrics still exist, even though I have set the 
> _deleteOnShutdown_ to true_._
> The configuration is:
> {code:java}
> metrics.reporters: "prom"
> metrics.reporter.prom.class: 
> "org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter"
> metrics.reporter.prom.jobName: "WordCount"
> metrics.reporter.prom.host: "localhost"
> metrics.reporter.prom.port: "9091"
> metrics.reporter.prom.randomJobNameSuffix: "true"
> metrics.reporter.prom.filterLabelValueCharacters: "true"
> metrics.reporter.prom.deleteOnShutdown: "true"
> {code}
>  
> Other people have also encountered this problem: 
> [https://stackoverflow.com/questions/54420498/flink-prometheus-push-gateway-reporter-delete-metrics-on-job-shutdown].
>   And another similar issue: FLINK-11457.
>  
> As prometheus is a very import metrics system on kubernetes, if we can solve 
> this problem, it is beneficial for users to monitor their flink jobs.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-11457) PrometheusPushGatewayReporter does not cleanup its metrics

2019-08-20 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-11457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911889#comment-16911889
 ] 

lamber-ken commented on FLINK-11457:


hi, [~opwvhk]

In our product env, we also met the same problem. If pushgateway can implements 
`TTL for pushed metrics`[1], it'll very useful. But for now, we use a external 
schedule system to check whether the flink job is alive or not, then delete 
metrics by pushgateway's rest api[2].

[1][https://github.com/prometheus/pushgateway/issues/19]

[2][https://github.com/prometheus/pushgateway#delete-method]

> PrometheusPushGatewayReporter does not cleanup its metrics
> --
>
> Key: FLINK-11457
> URL: https://issues.apache.org/jira/browse/FLINK-11457
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Metrics
>Reporter: Oscar Westra van Holthe - Kind
>Priority: Major
>
> When cancelling a job running on a yarn based cluster and then shutting down 
> the cluster, metrics on the push gateway are not deleted.
> My yarn-conf.yaml settings:
> {code:yaml}
> metrics.reporters: promgateway
> metrics.reporter.promgateway.class: 
> org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter
> metrics.reporter.promgateway.host: pushgateway.gcpstg.bolcom.net
> metrics.reporter.promgateway.port: 9091
> metrics.reporter.promgateway.jobName: PSMF
> metrics.reporter.promgateway.randomJobNameSuffix: true
> metrics.reporter.promgateway.deleteOnShutdown: true
> metrics.reporter.promgateway.interval: 30 SECONDS
> {code}
> What I expect to happen:
> * when running, the metrics are pushed to the push gateway to a separate 
> label per node (jobmanager/taskmanager)
> * when shutting down, the metrics are deleted from the push gateway
> This last bit does not happen.
> How the job is run:
> {code}flink run -m yarn-cluster -yn 5 -ys 2 -yst 
> "$INSTALL_DIRECTORY/app/psmf.jar"{code} 
> How the job is stopped:
> {code}
> YARN_APP_ID=$(yarn application -list | grep "PSMF" | awk '{print $1}')
> FLINK_JOB_ID=$(flink list -r -yid ${YARN_APP_ID} | grep "PSMF" | awk '{print 
> $4}')
> flink cancel -s "${SAVEPOINT_DIR%/}/" -yid "${YARN_APP_ID}" "${FLINK_JOB_ID}"
> echo "stop" | yarn-session.sh -id ${YARN_APP_ID}
> {code} 
> Is there anything I'm sdoing wrong? Anything I can help to fix?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-13787) PrometheusPushGatewayReporter does not cleanup TM metrics when run on kubernetes

2019-08-20 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911888#comment-16911888
 ] 

lamber-ken commented on FLINK-13787:


hi, [~kaibo.zhou].

In our product env, we also met the same problem. If pushgateway can implements 
`TTL for pushed metrics`[1], it'll very useful. But for now, we use a external 
schedule system to check whether the flink job is alive or not, then delete 
metrics by pushgateway's rest api[2].

[1][https://github.com/prometheus/pushgateway/issues/19]

[2][https://github.com/prometheus/pushgateway#delete-method]

 

> PrometheusPushGatewayReporter does not cleanup TM metrics when run on 
> kubernetes
> 
>
> Key: FLINK-13787
> URL: https://issues.apache.org/jira/browse/FLINK-13787
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Metrics
>Affects Versions: 1.7.2, 1.8.1, 1.9.0
>Reporter: Kaibo Zhou
>Priority: Major
>
> I have run a flink job on kubernetes and use PrometheusPushGatewayReporter, I 
> can see the metrics from the flink jobmanager and taskmanager from the push 
> gateway's UI.
> When I cancel the job, I found the jobmanager's metrics disappear, but the 
> taskmanager's metrics still exist, even though I have set the 
> _deleteOnShutdown_ to true_._
> The configuration is:
> {code:java}
> metrics.reporters: "prom"
> metrics.reporter.prom.class: 
> "org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporter"
> metrics.reporter.prom.jobName: "WordCount"
> metrics.reporter.prom.host: "localhost"
> metrics.reporter.prom.port: "9091"
> metrics.reporter.prom.randomJobNameSuffix: "true"
> metrics.reporter.prom.filterLabelValueCharacters: "true"
> metrics.reporter.prom.deleteOnShutdown: "true"
> {code}
>  
> Other people have also encountered this problem: 
> [https://stackoverflow.com/questions/54420498/flink-prometheus-push-gateway-reporter-delete-metrics-on-job-shutdown].
>   And another similar issue: FLINK-11457.
>  
> As prometheus is a very import metrics system on kubernetes, if we can solve 
> this problem, it is beneficial for users to monitor their flink jobs.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-10052) Tolerate temporarily suspended ZooKeeper connections

2019-07-23 Thread lamber-ken (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16891015#comment-16891015
 ] 

lamber-ken commented on FLINK-10052:


Hi [~quan]

This main point of this issue is about how +LeaderLatch+ deals with leadership 
when network disconnect.

For your question, may FLINK-10333 cat meet your needs.

> Tolerate temporarily suspended ZooKeeper connections
> 
>
> Key: FLINK-10052
> URL: https://issues.apache.org/jira/browse/FLINK-10052
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.4.2, 1.5.2, 1.6.0, 1.8.1
>Reporter: Till Rohrmann
>Assignee: Dominik Wosiński
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This issue results from FLINK-10011 which uncovered a problem with Flink's HA 
> recovery and proposed the following solution to harden Flink:
> The {{ZooKeeperLeaderElectionService}} uses the {{LeaderLatch}} Curator 
> recipe for leader election. The leader latch revokes leadership in case of a 
> suspended ZooKeeper connection. This can be premature in case that the system 
> can reconnect to ZooKeeper before its session expires. The effect of the lost 
> leadership is that all jobs will be canceled and directly restarted after 
> regaining the leadership.
> Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper 
> connection, it would be better to wait until the ZooKeeper connection is 
> LOST. That way we would allow the system to reconnect and not lose the 
> leadership. This could be achievable by using Curator's {{LeaderSelector}} 
> instead of the {{LeaderLatch}}.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Closed] (FLINK-13189) Fix the impact of zookeeper network disconnect temporarily on flink long running jobs

2019-07-18 Thread lamber-ken (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-13189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken closed FLINK-13189.
--
   Resolution: Duplicate
Fix Version/s: (was: 1.9.0)
 Release Note: duplicate to FLINK-10052

> Fix the impact of zookeeper network disconnect temporarily on flink long 
> running jobs
> -
>
> Key: FLINK-13189
> URL: https://issues.apache.org/jira/browse/FLINK-13189
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.8.1
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> *Issue detail info*
> We deploy flink streaming jobs on hadoop cluster on per-job model and use 
> zookeeper as HighAvailabilityService, but we found that flink job will 
> restart because of the network was disconnected temporarily between 
> jobmanager and zookeeper.
> So we analyze this problem deeply. Flink JobManager use curator's 
> `+LeaderLatch+` to maintain the leadership. When network disconncet, the 
> `+LeaderLatch+` will change leadership to false directly. We think it's too 
> brutally that many flink longrunning jobs will restart because of the network 
> shake.
>  
> *Fix this issue*
> From curator official website, we found that this issuse was fixed at 
> curator-3.x.x, but we can't not just change the flink-curator-version(2.12.0) 
> to 3.x.x because of zk-compatibility. Curator-2.x.x support zookeeper-3.4.x 
> and zookeeper-3.5.0, curator-3.x.x just compatible with ZooKeeper 3.5.x. 
> Based on the above considerations, we update `LeaderLatch` at 
> flink-shaded-curator module.
>  
> *Other*
> Any suggestions are webcome, thanks
>  
> *Useful links*
> [https://curator.apache.org/zk-compatibility.html] 
>  [https://cwiki.apache.org/confluence/display/CURATOR/Releases] 
>  [http://curator.apache.org/curator-recipes/leader-latch.html]
>   



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Issue Comment Deleted] (FLINK-10052) Tolerate temporarily suspended ZooKeeper connections

2019-07-17 Thread lamber-ken (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated FLINK-10052:
---
Comment: was deleted

(was: [~Tison], 
btw, which way is better? update PR#9066 or create a new pr that point to this 
issue. 
What do you think? thanks.)

> Tolerate temporarily suspended ZooKeeper connections
> 
>
> Key: FLINK-10052
> URL: https://issues.apache.org/jira/browse/FLINK-10052
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.4.2, 1.5.2, 1.6.0, 1.8.1
>Reporter: Till Rohrmann
>Assignee: Dominik Wosiński
>Priority: Major
>
> This issue results from FLINK-10011 which uncovered a problem with Flink's HA 
> recovery and proposed the following solution to harden Flink:
> The {{ZooKeeperLeaderElectionService}} uses the {{LeaderLatch}} Curator 
> recipe for leader election. The leader latch revokes leadership in case of a 
> suspended ZooKeeper connection. This can be premature in case that the system 
> can reconnect to ZooKeeper before its session expires. The effect of the lost 
> leadership is that all jobs will be canceled and directly restarted after 
> regaining the leadership.
> Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper 
> connection, it would be better to wait until the ZooKeeper connection is 
> LOST. That way we would allow the system to reconnect and not lose the 
> leadership. This could be achievable by using Curator's {{LeaderSelector}} 
> instead of the {{LeaderLatch}}.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (FLINK-10052) Tolerate temporarily suspended ZooKeeper connections

2019-07-17 Thread lamber-ken (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887276#comment-16887276
 ] 

lamber-ken commented on FLINK-10052:


[~Tison], 
btw, which way is better? update PR#9066 or create a new pr that point to this 
issue. 
What do you think? thanks.

> Tolerate temporarily suspended ZooKeeper connections
> 
>
> Key: FLINK-10052
> URL: https://issues.apache.org/jira/browse/FLINK-10052
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.4.2, 1.5.2, 1.6.0, 1.8.1
>Reporter: Till Rohrmann
>Assignee: Dominik Wosiński
>Priority: Major
>
> This issue results from FLINK-10011 which uncovered a problem with Flink's HA 
> recovery and proposed the following solution to harden Flink:
> The {{ZooKeeperLeaderElectionService}} uses the {{LeaderLatch}} Curator 
> recipe for leader election. The leader latch revokes leadership in case of a 
> suspended ZooKeeper connection. This can be premature in case that the system 
> can reconnect to ZooKeeper before its session expires. The effect of the lost 
> leadership is that all jobs will be canceled and directly restarted after 
> regaining the leadership.
> Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper 
> connection, it would be better to wait until the ZooKeeper connection is 
> LOST. That way we would allow the system to reconnect and not lose the 
> leadership. This could be achievable by using Curator's {{LeaderSelector}} 
> instead of the {{LeaderLatch}}.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Comment Edited] (FLINK-10052) Tolerate temporarily suspended ZooKeeper connections

2019-07-17 Thread lamber-ken (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887268#comment-16887268
 ] 

lamber-ken edited comment on FLINK-10052 at 7/17/19 5:28 PM:
-

[~Tison] yes, I checke the shaded class file by +javap -v+ command.

The main thing is that maven-shaded-plugin help us relocate 
+org.apache.zookeeper.ClientCnxn$EventThread+

I'll update PR#9066. It's welcome if you can help me to review the pr. 


was (Author: lamber-ken):
[~Tison] yes, I checke the shaded class file by +javap -v+ command, I'll update 
PR#9066. It's welcome if you can help me to review the pr.

> Tolerate temporarily suspended ZooKeeper connections
> 
>
> Key: FLINK-10052
> URL: https://issues.apache.org/jira/browse/FLINK-10052
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.4.2, 1.5.2, 1.6.0, 1.8.1
>Reporter: Till Rohrmann
>Assignee: Dominik Wosiński
>Priority: Major
>
> This issue results from FLINK-10011 which uncovered a problem with Flink's HA 
> recovery and proposed the following solution to harden Flink:
> The {{ZooKeeperLeaderElectionService}} uses the {{LeaderLatch}} Curator 
> recipe for leader election. The leader latch revokes leadership in case of a 
> suspended ZooKeeper connection. This can be premature in case that the system 
> can reconnect to ZooKeeper before its session expires. The effect of the lost 
> leadership is that all jobs will be canceled and directly restarted after 
> regaining the leadership.
> Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper 
> connection, it would be better to wait until the ZooKeeper connection is 
> LOST. That way we would allow the system to reconnect and not lose the 
> leadership. This could be achievable by using Curator's {{LeaderSelector}} 
> instead of the {{LeaderLatch}}.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (FLINK-10052) Tolerate temporarily suspended ZooKeeper connections

2019-07-17 Thread lamber-ken (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887268#comment-16887268
 ] 

lamber-ken commented on FLINK-10052:


[~Tison] yes, I checke the shaded class file by +javap -v+ command, I'll update 
PR#9066. It's welcome if you can help me to review the pr.

> Tolerate temporarily suspended ZooKeeper connections
> 
>
> Key: FLINK-10052
> URL: https://issues.apache.org/jira/browse/FLINK-10052
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.4.2, 1.5.2, 1.6.0, 1.8.1
>Reporter: Till Rohrmann
>Assignee: Dominik Wosiński
>Priority: Major
>
> This issue results from FLINK-10011 which uncovered a problem with Flink's HA 
> recovery and proposed the following solution to harden Flink:
> The {{ZooKeeperLeaderElectionService}} uses the {{LeaderLatch}} Curator 
> recipe for leader election. The leader latch revokes leadership in case of a 
> suspended ZooKeeper connection. This can be premature in case that the system 
> can reconnect to ZooKeeper before its session expires. The effect of the lost 
> leadership is that all jobs will be canceled and directly restarted after 
> regaining the leadership.
> Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper 
> connection, it would be better to wait until the ZooKeeper connection is 
> LOST. That way we would allow the system to reconnect and not lose the 
> leadership. This could be achievable by using Curator's {{LeaderSelector}} 
> instead of the {{LeaderLatch}}.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (FLINK-10052) Tolerate temporarily suspended ZooKeeper connections

2019-07-17 Thread lamber-ken (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886815#comment-16886815
 ] 

lamber-ken commented on FLINK-10052:


[~Tison],Ok, I see.

I had read the doc before , your ideas and the design is great.

Currently, based on these, it is may be a better way to fix the zookeeper 
connection problem like PR#9066, hacky but correctly. 

Or we can find out a good way to fix  CURATOR-532, then fix this issue later.

> Tolerate temporarily suspended ZooKeeper connections
> 
>
> Key: FLINK-10052
> URL: https://issues.apache.org/jira/browse/FLINK-10052
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.4.2, 1.5.2, 1.6.0, 1.8.1
>Reporter: Till Rohrmann
>Assignee: Dominik Wosiński
>Priority: Major
>
> This issue results from FLINK-10011 which uncovered a problem with Flink's HA 
> recovery and proposed the following solution to harden Flink:
> The {{ZooKeeperLeaderElectionService}} uses the {{LeaderLatch}} Curator 
> recipe for leader election. The leader latch revokes leadership in case of a 
> suspended ZooKeeper connection. This can be premature in case that the system 
> can reconnect to ZooKeeper before its session expires. The effect of the lost 
> leadership is that all jobs will be canceled and directly restarted after 
> regaining the leadership.
> Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper 
> connection, it would be better to wait until the ZooKeeper connection is 
> LOST. That way we would allow the system to reconnect and not lose the 
> leadership. This could be achievable by using Curator's {{LeaderSelector}} 
> instead of the {{LeaderLatch}}.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Comment Edited] (FLINK-10052) Tolerate temporarily suspended ZooKeeper connections

2019-07-17 Thread lamber-ken (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886699#comment-16886699
 ] 

lamber-ken edited comment on FLINK-10052 at 7/17/19 6:01 AM:
-

[~Tison] (y), I have some points need to talk with you.

First, for your first point, I thought it yesterday and wont create a new 
curator Jira like your CURATOR-532 that user can manually config ZooKeeper3.4.x 
Compatibility, but I give up that idea, because I found that it also needs to 
reflect +org.apache.zookeeper.ClientCnxn$EventThread+ which may throw 
ClassNotFoundException because of shading. Click it for more detail 
[InjectSessionExpiration|https://github.com/apache/curator/blob/master/curator-client/src/main/java/org/apache/curator/utils/InjectSessionExpiration.java].

Second, for your second point, I am not familiar with LeaderSeclector currently 
and I'm learning about it. I also think it is a ideally way we can just use 
SessionConnectionStateErrorPolicy directly in curator-4.x

Third, I don't understand the meaning of a flink scope leader latch

 


was (Author: lamber-ken):
[~Tison] (y), I have some points need to talk with you.

First, for your first point, I thought it yesterday and wont create a new 
curator Jira like your CURATOR-532 that use can manually config ZooKeeper3.4.x 
Compatibility, but I give up that idea, because I found that it also needs to 
reflect +org.apache.zookeeper.ClientCnxn$EventThread+ which may throw 
ClassNotFoundException because of shading. Click it for more detail 
[InjectSessionExpiration|https://github.com/apache/curator/blob/master/curator-client/src/main/java/org/apache/curator/utils/InjectSessionExpiration.java].

Second, for your second point, I am not familiar with LeaderSeclector currently 
and I'm learning about it. I also think it is a ideally way we can just use 
SessionConnectionStateErrorPolicy directly in curator-4.x

Third, I don't understand the meaning of a flink scope leader latch

 

> Tolerate temporarily suspended ZooKeeper connections
> 
>
> Key: FLINK-10052
> URL: https://issues.apache.org/jira/browse/FLINK-10052
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.4.2, 1.5.2, 1.6.0, 1.8.1
>Reporter: Till Rohrmann
>Assignee: Dominik Wosiński
>Priority: Major
>
> This issue results from FLINK-10011 which uncovered a problem with Flink's HA 
> recovery and proposed the following solution to harden Flink:
> The {{ZooKeeperLeaderElectionService}} uses the {{LeaderLatch}} Curator 
> recipe for leader election. The leader latch revokes leadership in case of a 
> suspended ZooKeeper connection. This can be premature in case that the system 
> can reconnect to ZooKeeper before its session expires. The effect of the lost 
> leadership is that all jobs will be canceled and directly restarted after 
> regaining the leadership.
> Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper 
> connection, it would be better to wait until the ZooKeeper connection is 
> LOST. That way we would allow the system to reconnect and not lose the 
> leadership. This could be achievable by using Curator's {{LeaderSelector}} 
> instead of the {{LeaderLatch}}.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (FLINK-10052) Tolerate temporarily suspended ZooKeeper connections

2019-07-17 Thread lamber-ken (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886699#comment-16886699
 ] 

lamber-ken commented on FLINK-10052:


[~Tison] (y), I have some points need to talk with you.

First, for your first point, I thought it yesterday and wont create a new 
curator Jira like your CURATOR-532 that use can manually config ZooKeeper3.4.x 
Compatibility, but I give up that idea, because I found that it also needs to 
reflect +org.apache.zookeeper.ClientCnxn$EventThread+ which may throw 
ClassNotFoundException because of shading. Click it for more detail 
[InjectSessionExpiration|https://github.com/apache/curator/blob/master/curator-client/src/main/java/org/apache/curator/utils/InjectSessionExpiration.java].

Second, for your second point, I am not familiar with LeaderSeclector currently 
and I'm learning about it. I also think it is a ideally way we can just use 
SessionConnectionStateErrorPolicy directly in curator-4.x

Third, I don't understand the meaning of a flink scope leader latch

 

> Tolerate temporarily suspended ZooKeeper connections
> 
>
> Key: FLINK-10052
> URL: https://issues.apache.org/jira/browse/FLINK-10052
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.4.2, 1.5.2, 1.6.0, 1.8.1
>Reporter: Till Rohrmann
>Assignee: Dominik Wosiński
>Priority: Major
>
> This issue results from FLINK-10011 which uncovered a problem with Flink's HA 
> recovery and proposed the following solution to harden Flink:
> The {{ZooKeeperLeaderElectionService}} uses the {{LeaderLatch}} Curator 
> recipe for leader election. The leader latch revokes leadership in case of a 
> suspended ZooKeeper connection. This can be premature in case that the system 
> can reconnect to ZooKeeper before its session expires. The effect of the lost 
> leadership is that all jobs will be canceled and directly restarted after 
> regaining the leadership.
> Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper 
> connection, it would be better to wait until the ZooKeeper connection is 
> LOST. That way we would allow the system to reconnect and not lose the 
> leadership. This could be achievable by using Curator's {{LeaderSelector}} 
> instead of the {{LeaderLatch}}.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (FLINK-10052) Tolerate temporarily suspended ZooKeeper connections

2019-07-16 Thread lamber-ken (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1688#comment-1688
 ] 

lamber-ken commented on FLINK-10052:


hi, [~Tison]

We can learn about the difference between LeaderLatch and LeaderSelector in the 
apache curator framework from here. 
[leaderlatch-vs-leaderselector|https://stackoverflow.com/questions/17998616/leaderlatch-vs-leaderselector].

BTW, I'm trying to realize it by LeaderSelector currently.

> Tolerate temporarily suspended ZooKeeper connections
> 
>
> Key: FLINK-10052
> URL: https://issues.apache.org/jira/browse/FLINK-10052
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.4.2, 1.5.2, 1.6.0, 1.8.1
>Reporter: Till Rohrmann
>Assignee: Dominik Wosiński
>Priority: Major
>
> This issue results from FLINK-10011 which uncovered a problem with Flink's HA 
> recovery and proposed the following solution to harden Flink:
> The {{ZooKeeperLeaderElectionService}} uses the {{LeaderLatch}} Curator 
> recipe for leader election. The leader latch revokes leadership in case of a 
> suspended ZooKeeper connection. This can be premature in case that the system 
> can reconnect to ZooKeeper before its session expires. The effect of the lost 
> leadership is that all jobs will be canceled and directly restarted after 
> regaining the leadership.
> Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper 
> connection, it would be better to wait until the ZooKeeper connection is 
> LOST. That way we would allow the system to reconnect and not lose the 
> leadership. This could be achievable by using Curator's {{LeaderSelector}} 
> instead of the {{LeaderLatch}}.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Comment Edited] (FLINK-10052) Tolerate temporarily suspended ZooKeeper connections

2019-07-16 Thread lamber-ken (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886663#comment-16886663
 ] 

lamber-ken edited comment on FLINK-10052 at 7/17/19 4:06 AM:
-

[~Tison], right, it's a better way to upgrate curator dependcy to fix this 
ideally, but there's a problem that curator-4.x detect the version of zookeeper 
by test whether +org.apache.zookeeper.admin.ZooKeeperAdmin+ is in classpath or 
not, like bellow.
{code:java}
Class.forName("org.apache.admin.ZooKeeperAdmin");
{code}
But flink-runtime module shades +org.apache.zookeeper+ to 
+org.apache.flink.shaded.zookeeper.org.apache.zookeeper+ , so it'll detect 
failed.

I think two ways to fix this issue,
First, rewrite +LeaderLatch#handleStateChange+ at flink-shaded-curator 
moduleflink, like [PR#9066|https://github.com/apache/flink/pull/9066].
Second, it also could be achievable by using Curator's LeaderSelector instead 
of the LeaderLatch as mentioned in issue description 







was (Author: lamber-ken):
[~Tison], right, it's a better way to upgrate curator dependcy to fix this 
ideally, but there's a problem that curator-4.x detect the version of zookeeper 
by test whether +org.apache.zookeeper.admin.ZooKeeperAdmin+ is in classpath or 
not, like bellow.
{code:java}
Class.forName("org.apache.admin.ZooKeeperAdmin");
{code}
But flink-runtime module shades +org.apache.zookeeper+ to 
+org.apache.flink.shaded.zookeeper.org.apache.zookeeper+ , so it'll detect 
failed.

I think two ways to fix this issue,
First, rewrite +LeaderLatch#handleStateChange+ at flink-shaded-curator 
moduleflink, like [PR#9066|https://github.com/apache/flink/pull/9066].
Seconde, it also could be achievable by using Curator's LeaderSelector instead 
of the LeaderLatch as mentioned in issue description 






> Tolerate temporarily suspended ZooKeeper connections
> 
>
> Key: FLINK-10052
> URL: https://issues.apache.org/jira/browse/FLINK-10052
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.4.2, 1.5.2, 1.6.0, 1.8.1
>Reporter: Till Rohrmann
>Assignee: Dominik Wosiński
>Priority: Major
>
> This issue results from FLINK-10011 which uncovered a problem with Flink's HA 
> recovery and proposed the following solution to harden Flink:
> The {{ZooKeeperLeaderElectionService}} uses the {{LeaderLatch}} Curator 
> recipe for leader election. The leader latch revokes leadership in case of a 
> suspended ZooKeeper connection. This can be premature in case that the system 
> can reconnect to ZooKeeper before its session expires. The effect of the lost 
> leadership is that all jobs will be canceled and directly restarted after 
> regaining the leadership.
> Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper 
> connection, it would be better to wait until the ZooKeeper connection is 
> LOST. That way we would allow the system to reconnect and not lose the 
> leadership. This could be achievable by using Curator's {{LeaderSelector}} 
> instead of the {{LeaderLatch}}.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (FLINK-10052) Tolerate temporarily suspended ZooKeeper connections

2019-07-16 Thread lamber-ken (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886663#comment-16886663
 ] 

lamber-ken commented on FLINK-10052:


[~Tison], right, it's a better way to upgrate curator dependcy to fix this 
ideally, but there's a problem that curator-4.x detect the version of zookeeper 
by test whether +org.apache.zookeeper.admin.ZooKeeperAdmin+ is in classpath or 
not, like bellow.
{code:java}
Class.forName("org.apache.admin.ZooKeeperAdmin");
{code}
But flink-runtime module shades +org.apache.zookeeper+ to 
+org.apache.flink.shaded.zookeeper.org.apache.zookeeper+ , so it'll detect 
failed.

I think two ways to fix this issue,
First, rewrite +LeaderLatch#handleStateChange+ at flink-shaded-curator 
moduleflink, like [PR#9066|https://github.com/apache/flink/pull/9066].
Seconde, it also could be achievable by using Curator's LeaderSelector instead 
of the LeaderLatch as mentioned in issue description 






> Tolerate temporarily suspended ZooKeeper connections
> 
>
> Key: FLINK-10052
> URL: https://issues.apache.org/jira/browse/FLINK-10052
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.4.2, 1.5.2, 1.6.0, 1.8.1
>Reporter: Till Rohrmann
>Assignee: Dominik Wosiński
>Priority: Major
>
> This issue results from FLINK-10011 which uncovered a problem with Flink's HA 
> recovery and proposed the following solution to harden Flink:
> The {{ZooKeeperLeaderElectionService}} uses the {{LeaderLatch}} Curator 
> recipe for leader election. The leader latch revokes leadership in case of a 
> suspended ZooKeeper connection. This can be premature in case that the system 
> can reconnect to ZooKeeper before its session expires. The effect of the lost 
> leadership is that all jobs will be canceled and directly restarted after 
> regaining the leadership.
> Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper 
> connection, it would be better to wait until the ZooKeeper connection is 
> LOST. That way we would allow the system to reconnect and not lose the 
> leadership. This could be achievable by using Curator's {{LeaderSelector}} 
> instead of the {{LeaderLatch}}.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Comment Edited] (FLINK-10052) Tolerate temporarily suspended ZooKeeper connections

2019-07-13 Thread lamber-ken (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884567#comment-16884567
 ] 

lamber-ken edited comment on FLINK-10052 at 7/14/19 5:30 AM:
-

Hi, all. 

BTW, if we upgrade curator dependency to 4.x, there's a problem that 
curator-4.x detect the version of zookeeper by test whether 
+org.apache.zookeeper.admin.ZooKeeperAdmin+ is in classpath or not.

But flink-runtime module shades +org.apache.zookeeper+ to 
+org.apache.flink.shaded.zookeeper.org.apache.zookeeper.+  So it will not fix 
this issue by upgrading curator's version.

Here is [Curator 
Compatibility|https://github.com/apache/curator/blob/master/curator-client/src/main/java/org/apache/curator/utils/Compatibility.java].


was (Author: lamber-ken):
Hi, all. 

BTW, if we upgrade curator dependency to 4.x, there's a problem that 
curator-4.x detect the version of zookeeper by test whether 
+org.apache.zookeeper.admin.ZooKeeperAdmin+ is in classpath or not.

But flink-runtime module will shade +org.apache.zookeeper+ to 
+org.apache.flink.shaded.zookeeper.org.apache.zookeeper.+  So it will not fix 
this issue by upgrading curator's version.

Here is [Curator 
Compatibility|https://github.com/apache/curator/blob/master/curator-client/src/main/java/org/apache/curator/utils/Compatibility.java].

> Tolerate temporarily suspended ZooKeeper connections
> 
>
> Key: FLINK-10052
> URL: https://issues.apache.org/jira/browse/FLINK-10052
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.4.2, 1.5.2, 1.6.0, 1.8.1
>Reporter: Till Rohrmann
>Assignee: Dominik Wosiński
>Priority: Major
>
> This issue results from FLINK-10011 which uncovered a problem with Flink's HA 
> recovery and proposed the following solution to harden Flink:
> The {{ZooKeeperLeaderElectionService}} uses the {{LeaderLatch}} Curator 
> recipe for leader election. The leader latch revokes leadership in case of a 
> suspended ZooKeeper connection. This can be premature in case that the system 
> can reconnect to ZooKeeper before its session expires. The effect of the lost 
> leadership is that all jobs will be canceled and directly restarted after 
> regaining the leadership.
> Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper 
> connection, it would be better to wait until the ZooKeeper connection is 
> LOST. That way we would allow the system to reconnect and not lose the 
> leadership. This could be achievable by using Curator's {{LeaderSelector}} 
> instead of the {{LeaderLatch}}.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Comment Edited] (FLINK-10052) Tolerate temporarily suspended ZooKeeper connections

2019-07-13 Thread lamber-ken (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884567#comment-16884567
 ] 

lamber-ken edited comment on FLINK-10052 at 7/14/19 5:29 AM:
-

Hi, all. 

BTW, if we upgrade curator dependency to 4.x, there's a problem that 
curator-4.x detect the version of zookeeper by test whether 
+org.apache.zookeeper.admin.ZooKeeperAdmin+ is in classpath or not.

But flink-runtime module will shade +org.apache.zookeeper+ to 
+org.apache.flink.shaded.zookeeper.org.apache.zookeeper.+  So it will not fix 
this issue by upgrading curator's version.

Here is [Curator 
Compatibility|https://github.com/apache/curator/blob/master/curator-client/src/main/java/org/apache/curator/utils/Compatibility.java].


was (Author: lamber-ken):
Hi, all. 

BTW, if we upgrade curator dependency to 4.x, there's a problem that 
curator-4.x detect the version of zookeeper by test 

whether +org.apache.zookeeper.admin.ZooKeeperAdmin+ is in classpath or not. But 
flink-runtime module will shade +org.apache.zookeeper+ to 
+org.apache.flink.shaded.zookeeper.org.apache.zookeeper.+  So it will not fix 
this issue by upgrading curator's version.

Here is [Curator 
Compatibility|https://github.com/apache/curator/blob/master/curator-client/src/main/java/org/apache/curator/utils/Compatibility.java].

> Tolerate temporarily suspended ZooKeeper connections
> 
>
> Key: FLINK-10052
> URL: https://issues.apache.org/jira/browse/FLINK-10052
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.4.2, 1.5.2, 1.6.0, 1.8.1
>Reporter: Till Rohrmann
>Assignee: Dominik Wosiński
>Priority: Major
>
> This issue results from FLINK-10011 which uncovered a problem with Flink's HA 
> recovery and proposed the following solution to harden Flink:
> The {{ZooKeeperLeaderElectionService}} uses the {{LeaderLatch}} Curator 
> recipe for leader election. The leader latch revokes leadership in case of a 
> suspended ZooKeeper connection. This can be premature in case that the system 
> can reconnect to ZooKeeper before its session expires. The effect of the lost 
> leadership is that all jobs will be canceled and directly restarted after 
> regaining the leadership.
> Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper 
> connection, it would be better to wait until the ZooKeeper connection is 
> LOST. That way we would allow the system to reconnect and not lose the 
> leadership. This could be achievable by using Curator's {{LeaderSelector}} 
> instead of the {{LeaderLatch}}.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (FLINK-10052) Tolerate temporarily suspended ZooKeeper connections

2019-07-13 Thread lamber-ken (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884567#comment-16884567
 ] 

lamber-ken commented on FLINK-10052:


Hi, all. 

BTW, if we upgrade curator dependency to 4.x, there's a problem that 
curator-4.x detect the version of zookeeper by test 

whether +org.apache.zookeeper.admin.ZooKeeperAdmin+ is in classpath or not. But 
flink-runtime module will shade +org.apache.zookeeper+ to 
+org.apache.flink.shaded.zookeeper.org.apache.zookeeper.+  So it will not fix 
this issue by upgrading curator's version.

Here is [Curator 
Compatibility|https://github.com/apache/curator/blob/master/curator-client/src/main/java/org/apache/curator/utils/Compatibility.java].

> Tolerate temporarily suspended ZooKeeper connections
> 
>
> Key: FLINK-10052
> URL: https://issues.apache.org/jira/browse/FLINK-10052
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.4.2, 1.5.2, 1.6.0, 1.8.1
>Reporter: Till Rohrmann
>Assignee: Dominik Wosiński
>Priority: Major
>
> This issue results from FLINK-10011 which uncovered a problem with Flink's HA 
> recovery and proposed the following solution to harden Flink:
> The {{ZooKeeperLeaderElectionService}} uses the {{LeaderLatch}} Curator 
> recipe for leader election. The leader latch revokes leadership in case of a 
> suspended ZooKeeper connection. This can be premature in case that the system 
> can reconnect to ZooKeeper before its session expires. The effect of the lost 
> leadership is that all jobs will be canceled and directly restarted after 
> regaining the leadership.
> Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper 
> connection, it would be better to wait until the ZooKeeper connection is 
> LOST. That way we would allow the system to reconnect and not lose the 
> leadership. This could be achievable by using Curator's {{LeaderSelector}} 
> instead of the {{LeaderLatch}}.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (FLINK-13189) Fix the impact of zookeeper network disconnect temporarily on flink long running jobs

2019-07-11 Thread lamber-ken (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-13189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated FLINK-13189:
---
Component/s: (was: Runtime / Network)
 Runtime / Coordination

> Fix the impact of zookeeper network disconnect temporarily on flink long 
> running jobs
> -
>
> Key: FLINK-13189
> URL: https://issues.apache.org/jira/browse/FLINK-13189
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.8.1
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.9.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> *Issue detail info*
> We deploy flink streaming jobs on hadoop cluster on per-job model and use 
> zookeeper as HighAvailabilityService, but we found that flink job will 
> restart because of the network was disconnected temporarily between 
> jobmanager and zookeeper.
> So we analyze this problem deeply. Flink JobManager use curator's 
> `+LeaderLatch+` to maintain the leadership. When network disconncet, the 
> `+LeaderLatch+` will change leadership to false directly. We think it's too 
> brutally that many flink longrunning jobs will restart because of the network 
> shake.
>  
> *Fix this issue*
> From curator official website, we found that this issuse was fixed at 
> curator-3.x.x, but we can't not just change the flink-curator-version(2.12.0) 
> to 3.x.x because of zk-compatibility. Curator-2.x.x support zookeeper-3.4.x 
> and zookeeper-3.5.0, curator-3.x.x just compatible with ZooKeeper 3.5.x. 
> Based on the above considerations, we update `LeaderLatch` at 
> flink-shaded-curator module.
>  
> *Other*
> Any suggestions are webcome, thanks
>  
> *Useful links*
> [https://curator.apache.org/zk-compatibility.html] 
>  [https://cwiki.apache.org/confluence/display/CURATOR/Releases] 
>  [http://curator.apache.org/curator-recipes/leader-latch.html]
>   



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (FLINK-10052) Tolerate temporarily suspended ZooKeeper connections

2019-07-11 Thread lamber-ken (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated FLINK-10052:
---
Affects Version/s: 1.8.1

> Tolerate temporarily suspended ZooKeeper connections
> 
>
> Key: FLINK-10052
> URL: https://issues.apache.org/jira/browse/FLINK-10052
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.4.2, 1.5.2, 1.6.0, 1.8.1
>Reporter: Till Rohrmann
>Assignee: Dominik Wosiński
>Priority: Major
>
> This issue results from FLINK-10011 which uncovered a problem with Flink's HA 
> recovery and proposed the following solution to harden Flink:
> The {{ZooKeeperLeaderElectionService}} uses the {{LeaderLatch}} Curator 
> recipe for leader election. The leader latch revokes leadership in case of a 
> suspended ZooKeeper connection. This can be premature in case that the system 
> can reconnect to ZooKeeper before its session expires. The effect of the lost 
> leadership is that all jobs will be canceled and directly restarted after 
> regaining the leadership.
> Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper 
> connection, it would be better to wait until the ZooKeeper connection is 
> LOST. That way we would allow the system to reconnect and not lose the 
> leadership. This could be achievable by using Curator's {{LeaderSelector}} 
> instead of the {{LeaderLatch}}.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (FLINK-10052) Tolerate temporarily suspended ZooKeeper connections

2019-07-11 Thread lamber-ken (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882859#comment-16882859
 ] 

lamber-ken commented on FLINK-10052:


Hi, all, I'm glad to fix this issue, when more and more jobs are deployed in 
cluster, the impact will be very large. My thoughts are as follows : 

1. If we upgrade curator dependency to 3.x or 4.x, because they just compatible 
with ZooKeeper 3.5.x, so it needs user upgrade zookeeper cluster version too. 
It is not very good to do that.

2. So, I think may by it's better to rewrite +LeaderLatch#handleStateChange+ at 
flink-shaded-curator module.

> Tolerate temporarily suspended ZooKeeper connections
> 
>
> Key: FLINK-10052
> URL: https://issues.apache.org/jira/browse/FLINK-10052
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.4.2, 1.5.2, 1.6.0
>Reporter: Till Rohrmann
>Assignee: Dominik Wosiński
>Priority: Major
>
> This issue results from FLINK-10011 which uncovered a problem with Flink's HA 
> recovery and proposed the following solution to harden Flink:
> The {{ZooKeeperLeaderElectionService}} uses the {{LeaderLatch}} Curator 
> recipe for leader election. The leader latch revokes leadership in case of a 
> suspended ZooKeeper connection. This can be premature in case that the system 
> can reconnect to ZooKeeper before its session expires. The effect of the lost 
> leadership is that all jobs will be canceled and directly restarted after 
> regaining the leadership.
> Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper 
> connection, it would be better to wait until the ZooKeeper connection is 
> LOST. That way we would allow the system to reconnect and not lose the 
> leadership. This could be achievable by using Curator's {{LeaderSelector}} 
> instead of the {{LeaderLatch}}.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Comment Edited] (FLINK-10052) Tolerate temporarily suspended ZooKeeper connections

2019-07-10 Thread lamber-ken (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882303#comment-16882303
 ] 

lamber-ken edited comment on FLINK-10052 at 7/10/19 5:46 PM:
-

Thanks for remind me [~elevy] that I created a duplicated jira 
[FLINK-13189|https://issues.apache.org/jira/browse/FLINK-13189], I think may 
nobody solved this issuse before because of latest code of flink in github.

I solve this problem by rewrite +LeaderLatch#handleStateChange+ at 
flink-shaded-curator module, and any suggestion will be welcome, thanks.


was (Author: lamber-ken):
Hi all, I'm very sorry that I created a duplicated jira 
[FLINK-13189|https://issues.apache.org/jira/browse/FLINK-13189], I think may 
nobody solved this issuse before because of latest code of flink in github. 

I solve this problem by rewrite +LeaderLatch#handleStateChange+ at 
flink-shaded-curator module, and any suggestion will be welcome, thanks.

> Tolerate temporarily suspended ZooKeeper connections
> 
>
> Key: FLINK-10052
> URL: https://issues.apache.org/jira/browse/FLINK-10052
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.4.2, 1.5.2, 1.6.0
>Reporter: Till Rohrmann
>Assignee: Dominik Wosiński
>Priority: Major
>
> This issue results from FLINK-10011 which uncovered a problem with Flink's HA 
> recovery and proposed the following solution to harden Flink:
> The {{ZooKeeperLeaderElectionService}} uses the {{LeaderLatch}} Curator 
> recipe for leader election. The leader latch revokes leadership in case of a 
> suspended ZooKeeper connection. This can be premature in case that the system 
> can reconnect to ZooKeeper before its session expires. The effect of the lost 
> leadership is that all jobs will be canceled and directly restarted after 
> regaining the leadership.
> Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper 
> connection, it would be better to wait until the ZooKeeper connection is 
> LOST. That way we would allow the system to reconnect and not lose the 
> leadership. This could be achievable by using Curator's {{LeaderSelector}} 
> instead of the {{LeaderLatch}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10052) Tolerate temporarily suspended ZooKeeper connections

2019-07-10 Thread lamber-ken (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882303#comment-16882303
 ] 

lamber-ken commented on FLINK-10052:


Hi all, I'm very sorry that I created a duplicated jira 
[FLINK-13189|https://issues.apache.org/jira/browse/FLINK-13189], I think may 
nobody solved this issuse before because of latest code of flink in github. 

I solve this problem by rewrite +LeaderLatch#handleStateChange+ at 
flink-shaded-curator module, and any suggestion will be welcome, thanks.

> Tolerate temporarily suspended ZooKeeper connections
> 
>
> Key: FLINK-10052
> URL: https://issues.apache.org/jira/browse/FLINK-10052
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.4.2, 1.5.2, 1.6.0
>Reporter: Till Rohrmann
>Assignee: Dominik Wosiński
>Priority: Major
>
> This issue results from FLINK-10011 which uncovered a problem with Flink's HA 
> recovery and proposed the following solution to harden Flink:
> The {{ZooKeeperLeaderElectionService}} uses the {{LeaderLatch}} Curator 
> recipe for leader election. The leader latch revokes leadership in case of a 
> suspended ZooKeeper connection. This can be premature in case that the system 
> can reconnect to ZooKeeper before its session expires. The effect of the lost 
> leadership is that all jobs will be canceled and directly restarted after 
> regaining the leadership.
> Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper 
> connection, it would be better to wait until the ZooKeeper connection is 
> LOST. That way we would allow the system to reconnect and not lose the 
> leadership. This could be achievable by using Curator's {{LeaderSelector}} 
> instead of the {{LeaderLatch}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-13189) Fix the impact of zookeeper network disconnect temporarily on flink long running jobs

2019-07-10 Thread lamber-ken (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-13189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882289#comment-16882289
 ] 

lamber-ken commented on FLINK-13189:


Thanks for remind me that [~elevy]. We met this problem in product env and 
latest flink version don't handle this also, so we create this jira.  We didn't 
expect it's duplicate when we create.

> Fix the impact of zookeeper network disconnect temporarily on flink long 
> running jobs
> -
>
> Key: FLINK-13189
> URL: https://issues.apache.org/jira/browse/FLINK-13189
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Affects Versions: 1.8.1
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.9.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> *Issue detail info*
> We deploy flink streaming jobs on hadoop cluster on per-job model and use 
> zookeeper as HighAvailabilityService, but we found that flink job will 
> restart because of the network was disconnected temporarily between 
> jobmanager and zookeeper.
> So we analyze this problem deeply. Flink JobManager use curator's 
> `+LeaderLatch+` to maintain the leadership. When network disconncet, the 
> `+LeaderLatch+` will change leadership to false directly. We think it's too 
> brutally that many flink longrunning jobs will restart because of the network 
> shake.
>  
> *Fix this issue*
> From curator official website, we found that this issuse was fixed at 
> curator-3.x.x, but we can't not just change the flink-curator-version(2.12.0) 
> to 3.x.x because of zk-compatibility. Curator-2.x.x support zookeeper-3.4.x 
> and zookeeper-3.5.0, curator-3.x.x just compatible with ZooKeeper 3.5.x. 
> Based on the above considerations, we update `LeaderLatch` at 
> flink-shaded-curator module.
>  
> *Other*
> Any suggestions are webcome, thanks
>  
> *Useful links*
> [https://curator.apache.org/zk-compatibility.html] 
>  [https://cwiki.apache.org/confluence/display/CURATOR/Releases] 
>  [http://curator.apache.org/curator-recipes/leader-latch.html]
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-13189) Fix the impact of zookeeper network disconnect temporarily on flink long running jobs

2019-07-10 Thread lamber-ken (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-13189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated FLINK-13189:
---
Description: 
*Issue detail info*

We deploy flink streaming jobs on hadoop cluster on per-job model and use 
zookeeper as HighAvailabilityService, but we found that flink job will restart 
because of the network was disconnected temporarily between jobmanager and 
zookeeper.

So we analyze this problem deeply. Flink JobManager use curator's 
`+LeaderLatch+` to maintain the leadership. When network disconncet, the 
`+LeaderLatch+` will change leadership to false directly. We think it's too 
brutally that many flink longrunning jobs will restart because of the network 
shake.

 

*Fix this issue*

>From curator official website, we found that this issuse was fixed at 
>curator-3.x.x, but we can't not just change the flink-curator-version(2.12.0) 
>to 3.x.x because of zk-compatibility. Curator-2.x.x support zookeeper-3.4.x 
>and zookeeper-3.5.0, curator-3.x.x just compatible with ZooKeeper 3.5.x. Based 
>on the above considerations, we update `LeaderLatch` at flink-shaded-curator 
>module.

 

*Other*

Any suggestions are webcome, thanks

 

*Useful links*

[https://curator.apache.org/zk-compatibility.html] 
 [https://cwiki.apache.org/confluence/display/CURATOR/Releases] 
 [http://curator.apache.org/curator-recipes/leader-latch.html]

  

  was:
*Issue detail info*

We deploy flink streaming jobs on hadoop cluster on per-job model and use 
zookeeper as HighAvailabilityService, but we found that flink job will restart 
because of the network was disconnected temporarily between jobmanager and 
zookeeper.

So we analyze this problem deeply. Flink JobManager use curator's 
`+LeaderLatch+` to maintain the leadership. When network disconncet, the 
`+LeaderLatch+` will change leadership to false directly. We think it's too 
brutally that many flink longrunning jobs will restart because of the network 
shake.

 

*Fix this issue*

>From curator official website, we found that this issuse was fixed at 
>curator-3.x.x, but we can't not just change the flink-curator-version(2.12.0) 
>to 3.x.x because of zk-compatibility. Curator-2.x.x support zookeeper-3.4.x 
>and zookeeper-3.5.0, curator-3.x.x just compatible with ZooKeeper 3.5.x. Based 
>on the above considerations, we update `LeaderLatch` at flink-shaded-curator 
>module.

 

*Useful links*

[https://curator.apache.org/zk-compatibility.html] 
[https://cwiki.apache.org/confluence/display/CURATOR/Releases] 
[http://curator.apache.org/curator-recipes/leader-latch.html]

  


> Fix the impact of zookeeper network disconnect temporarily on flink long 
> running jobs
> -
>
> Key: FLINK-13189
> URL: https://issues.apache.org/jira/browse/FLINK-13189
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Affects Versions: 1.8.1
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
> Fix For: 1.9.0
>
>
> *Issue detail info*
> We deploy flink streaming jobs on hadoop cluster on per-job model and use 
> zookeeper as HighAvailabilityService, but we found that flink job will 
> restart because of the network was disconnected temporarily between 
> jobmanager and zookeeper.
> So we analyze this problem deeply. Flink JobManager use curator's 
> `+LeaderLatch+` to maintain the leadership. When network disconncet, the 
> `+LeaderLatch+` will change leadership to false directly. We think it's too 
> brutally that many flink longrunning jobs will restart because of the network 
> shake.
>  
> *Fix this issue*
> From curator official website, we found that this issuse was fixed at 
> curator-3.x.x, but we can't not just change the flink-curator-version(2.12.0) 
> to 3.x.x because of zk-compatibility. Curator-2.x.x support zookeeper-3.4.x 
> and zookeeper-3.5.0, curator-3.x.x just compatible with ZooKeeper 3.5.x. 
> Based on the above considerations, we update `LeaderLatch` at 
> flink-shaded-curator module.
>  
> *Other*
> Any suggestions are webcome, thanks
>  
> *Useful links*
> [https://curator.apache.org/zk-compatibility.html] 
>  [https://cwiki.apache.org/confluence/display/CURATOR/Releases] 
>  [http://curator.apache.org/curator-recipes/leader-latch.html]
>   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-13189) Fix the impact of zookeeper network disconnect temporarily on flink long running jobs

2019-07-10 Thread lamber-ken (JIRA)
lamber-ken created FLINK-13189:
--

 Summary: Fix the impact of zookeeper network disconnect 
temporarily on flink long running jobs
 Key: FLINK-13189
 URL: https://issues.apache.org/jira/browse/FLINK-13189
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Network
Affects Versions: 1.8.1
Reporter: lamber-ken
Assignee: lamber-ken
 Fix For: 1.9.0


*Issue detail info*

We deploy flink streaming jobs on hadoop cluster on per-job model and use 
zookeeper as HighAvailabilityService, but we found that flink job will restart 
because of the network was disconnected temporarily between jobmanager and 
zookeeper.

So we analyze this problem deeply. Flink JobManager use curator's 
`+LeaderLatch+` to maintain the leadership. When network disconncet, the 
`+LeaderLatch+` will change leadership to false directly. We think it's too 
brutally that many flink longrunning jobs will restart because of the network 
shake.

 

*Fix this issue*

>From curator official website, we found that this issuse was fixed at 
>curator-3.x.x, but we can't not just change the flink-curator-version(2.12.0) 
>to 3.x.x because of zk-compatibility. Curator-2.x.x support zookeeper-3.4.x 
>and zookeeper-3.5.0, curator-3.x.x just compatible with ZooKeeper 3.5.x. Based 
>on the above considerations, we update `LeaderLatch` at flink-shaded-curator 
>module.

 

*Useful links*

[https://curator.apache.org/zk-compatibility.html] 
[https://cwiki.apache.org/confluence/display/CURATOR/Releases] 
[http://curator.apache.org/curator-recipes/leader-latch.html]

  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-12302) Fixed the wrong finalStatus of yarn application when application finished

2019-05-31 Thread lamber-ken (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-12302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852775#comment-16852775
 ] 

lamber-ken commented on FLINK-12302:


Thanks very match!

> Fixed the wrong finalStatus of yarn application when application finished
> -
>
> Key: FLINK-12302
> URL: https://issues.apache.org/jira/browse/FLINK-12302
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / YARN
>Affects Versions: 1.8.0
>    Reporter: lamber-ken
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.9.0
>
> Attachments: fix-bad-finalStatus.patch, flink-conf.yaml, 
> image-2019-04-23-19-56-49-933.png, image-2019-05-28-00-46-49-740.png, 
> image-2019-05-28-00-50-13-500.png, jobmanager-05-27.log, jobmanager-1.log, 
> jobmanager-2.log, screenshot-1.png, screenshot-2.png, 
> spslave4.bigdata.ly_23951, spslave5.bigdata.ly_20271, test.jar
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> flink job(flink-1.6.3) failed in per-job yarn cluste mode, the 
> resourcemanager of yarn rerun the job.
> when the job failed again, the application while finish, but the finalStatus 
> is +UNDEFINED,+  It's better to show state +FAILED+
> !image-2019-04-23-19-56-49-933.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (FLINK-12302) Fixed the wrong finalStatus of yarn application when application finished

2019-05-28 Thread lamber-ken (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-12302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849540#comment-16849540
 ] 

lamber-ken edited comment on FLINK-12302 at 5/28/19 9:47 AM:
-

[~gjy] By the way, can you help me to review my another pr if you have time? 
[https://github.com/apache/flink/pull/7987]


was (Author: lamber-ken):
By the way, can you help me to review my another pr if you have time? 
https://github.com/apache/flink/pull/7987

> Fixed the wrong finalStatus of yarn application when application finished
> -
>
> Key: FLINK-12302
> URL: https://issues.apache.org/jira/browse/FLINK-12302
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / YARN
>Affects Versions: 1.8.0
>    Reporter: lamber-ken
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.9.0
>
> Attachments: fix-bad-finalStatus.patch, flink-conf.yaml, 
> image-2019-04-23-19-56-49-933.png, image-2019-05-28-00-46-49-740.png, 
> image-2019-05-28-00-50-13-500.png, jobmanager-05-27.log, jobmanager-1.log, 
> jobmanager-2.log, screenshot-1.png, screenshot-2.png, 
> spslave4.bigdata.ly_23951, spslave5.bigdata.ly_20271, test.jar
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> flink job(flink-1.6.3) failed in per-job yarn cluste mode, the 
> resourcemanager of yarn rerun the job.
> when the job failed again, the application while finish, but the finalStatus 
> is +UNDEFINED,+  It's better to show state +FAILED+
> !image-2019-04-23-19-56-49-933.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-12302) Fixed the wrong finalStatus of yarn application when application finished

2019-05-28 Thread lamber-ken (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-12302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849540#comment-16849540
 ] 

lamber-ken commented on FLINK-12302:


By the way, can you help me to review my another pr if you have time? 
https://github.com/apache/flink/pull/7987

> Fixed the wrong finalStatus of yarn application when application finished
> -
>
> Key: FLINK-12302
> URL: https://issues.apache.org/jira/browse/FLINK-12302
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / YARN
>Affects Versions: 1.8.0
>    Reporter: lamber-ken
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.9.0
>
> Attachments: fix-bad-finalStatus.patch, flink-conf.yaml, 
> image-2019-04-23-19-56-49-933.png, image-2019-05-28-00-46-49-740.png, 
> image-2019-05-28-00-50-13-500.png, jobmanager-05-27.log, jobmanager-1.log, 
> jobmanager-2.log, screenshot-1.png, screenshot-2.png, 
> spslave4.bigdata.ly_23951, spslave5.bigdata.ly_20271, test.jar
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> flink job(flink-1.6.3) failed in per-job yarn cluste mode, the 
> resourcemanager of yarn rerun the job.
> when the job failed again, the application while finish, but the finalStatus 
> is +UNDEFINED,+  It's better to show state +FAILED+
> !image-2019-04-23-19-56-49-933.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (FLINK-12302) Fixed the wrong finalStatus of yarn application when application finished

2019-05-28 Thread lamber-ken (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken reassigned FLINK-12302:
--

Assignee: (was: lamber-ken)

> Fixed the wrong finalStatus of yarn application when application finished
> -
>
> Key: FLINK-12302
> URL: https://issues.apache.org/jira/browse/FLINK-12302
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / YARN
>Affects Versions: 1.8.0
>    Reporter: lamber-ken
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.9.0
>
> Attachments: fix-bad-finalStatus.patch, flink-conf.yaml, 
> image-2019-04-23-19-56-49-933.png, image-2019-05-28-00-46-49-740.png, 
> image-2019-05-28-00-50-13-500.png, jobmanager-05-27.log, jobmanager-1.log, 
> jobmanager-2.log, screenshot-1.png, screenshot-2.png, 
> spslave4.bigdata.ly_23951, spslave5.bigdata.ly_20271, test.jar
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> flink job(flink-1.6.3) failed in per-job yarn cluste mode, the 
> resourcemanager of yarn rerun the job.
> when the job failed again, the application while finish, but the finalStatus 
> is +UNDEFINED,+  It's better to show state +FAILED+
> !image-2019-04-23-19-56-49-933.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (FLINK-12302) Fixed the wrong finalStatus of yarn application when application finished

2019-05-28 Thread lamber-ken (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-12302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849537#comment-16849537
 ] 

lamber-ken edited comment on FLINK-12302 at 5/28/19 9:43 AM:
-

Hi [~gjy],

Thanks for your detail explain. I use +{{UNDEFINED}}+ status to check whether 
the flink-job on yarn is running or not before,  I only need to check the 
+EndTime+ currently, so it didn't affect me much now.

I don't know this patch whether affect even successfully finished jobs or not 
if it is applied, may be it's not enough as your said.

Thanks for talking about this issue these days, you can close this issue if you 
want. When we figure out a great way to fix this issue, we can reopen it. :D

 


was (Author: lamber-ken):
Hi [~gjy],

Thanks for your detail explain. I use +{{UNDEFINED}}+ status to check whether 
the flink-job on yarn is running or not before,  I only need to check the 
+EndTime+ currently, so it didn't affect me much now.

I don't know this patch whether affect even successfully finished jobs or not 
if it is applied, may be it's not enough as your said.

Thanks for talking about this issue these days, you can close this issue if you 
want. When we figure out a great way to fix this issue, we can reopen it.

 

> Fixed the wrong finalStatus of yarn application when application finished
> -
>
> Key: FLINK-12302
> URL: https://issues.apache.org/jira/browse/FLINK-12302
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / YARN
>Affects Versions: 1.8.0
>    Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.9.0
>
> Attachments: fix-bad-finalStatus.patch, flink-conf.yaml, 
> image-2019-04-23-19-56-49-933.png, image-2019-05-28-00-46-49-740.png, 
> image-2019-05-28-00-50-13-500.png, jobmanager-05-27.log, jobmanager-1.log, 
> jobmanager-2.log, screenshot-1.png, screenshot-2.png, 
> spslave4.bigdata.ly_23951, spslave5.bigdata.ly_20271, test.jar
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> flink job(flink-1.6.3) failed in per-job yarn cluste mode, the 
> resourcemanager of yarn rerun the job.
> when the job failed again, the application while finish, but the finalStatus 
> is +UNDEFINED,+  It's better to show state +FAILED+
> !image-2019-04-23-19-56-49-933.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (FLINK-12302) Fixed the wrong finalStatus of yarn application when application finished

2019-05-28 Thread lamber-ken (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-12302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849537#comment-16849537
 ] 

lamber-ken edited comment on FLINK-12302 at 5/28/19 9:42 AM:
-

Hi [~gjy],

Thanks for your detail explain. I use +{{UNDEFINED}}+ status to check whether 
the flink-job on yarn is running or not before,  I only need to check the 
+EndTime+ currently, so it didn't affect me much now.

I don't know this patch whether affect even successfully finished jobs or not 
if it is applied, may be it's not enough as your said.

Thanks for talking about this issue these days, you can close this issue if you 
want. When we figure out a great way to fix this issue, we can reopen it.

 


was (Author: lamber-ken):
Hi [~gjy], 

Thanks for your detail explain. I use +{{UNDEFINED}}+ status to check whether 
the flink-job on yarn is running or not before,  I only need to check the 
+EndTime+ currently, so it didn't affect me much now. 

I don't know this patch whether affect even successfully finished jobs or not 
if it is applied, may be it's not enough as your said. 


Thanks for talking about this issue these days, you can close this issue if you 
want. :)

 

> Fixed the wrong finalStatus of yarn application when application finished
> -
>
> Key: FLINK-12302
> URL: https://issues.apache.org/jira/browse/FLINK-12302
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / YARN
>Affects Versions: 1.8.0
>    Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.9.0
>
> Attachments: fix-bad-finalStatus.patch, flink-conf.yaml, 
> image-2019-04-23-19-56-49-933.png, image-2019-05-28-00-46-49-740.png, 
> image-2019-05-28-00-50-13-500.png, jobmanager-05-27.log, jobmanager-1.log, 
> jobmanager-2.log, screenshot-1.png, screenshot-2.png, 
> spslave4.bigdata.ly_23951, spslave5.bigdata.ly_20271, test.jar
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> flink job(flink-1.6.3) failed in per-job yarn cluste mode, the 
> resourcemanager of yarn rerun the job.
> when the job failed again, the application while finish, but the finalStatus 
> is +UNDEFINED,+  It's better to show state +FAILED+
> !image-2019-04-23-19-56-49-933.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-12302) Fixed the wrong finalStatus of yarn application when application finished

2019-05-28 Thread lamber-ken (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-12302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849537#comment-16849537
 ] 

lamber-ken commented on FLINK-12302:


Hi [~gjy], 

Thanks for your detail explain. I use +{{UNDEFINED}}+ status to check whether 
the flink-job on yarn is running or not before,  I only need to check the 
+EndTime+ currently, so it didn't affect me much now. 

I don't know this patch whether affect even successfully finished jobs or not 
if it is applied, may be it's not enough as your said. 


Thanks for talking about this issue these days, you can close this issue if you 
want. :)

 

> Fixed the wrong finalStatus of yarn application when application finished
> -
>
> Key: FLINK-12302
> URL: https://issues.apache.org/jira/browse/FLINK-12302
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / YARN
>Affects Versions: 1.8.0
>    Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.9.0
>
> Attachments: fix-bad-finalStatus.patch, flink-conf.yaml, 
> image-2019-04-23-19-56-49-933.png, image-2019-05-28-00-46-49-740.png, 
> image-2019-05-28-00-50-13-500.png, jobmanager-05-27.log, jobmanager-1.log, 
> jobmanager-2.log, screenshot-1.png, screenshot-2.png, 
> spslave4.bigdata.ly_23951, spslave5.bigdata.ly_20271, test.jar
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> flink job(flink-1.6.3) failed in per-job yarn cluste mode, the 
> resourcemanager of yarn rerun the job.
> when the job failed again, the application while finish, but the finalStatus 
> is +UNDEFINED,+  It's better to show state +FAILED+
> !image-2019-04-23-19-56-49-933.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (FLINK-12302) Fixed the wrong finalStatus of yarn application when application finished

2019-05-27 Thread lamber-ken (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-12302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849071#comment-16849071
 ] 

lamber-ken edited comment on FLINK-12302 at 5/27/19 4:53 PM:
-

[~gjy], from another side, we can analysis this issue only from the code logic.

When some scene happends and the call the +MiniDispatcher#jobNotFinished+ 
method, it means the flink job terminate unexpectedly, so it will notify the RM 
to kill the yarn application with +ApplicationStatus.UNKNOWN+ state, then the 
+UNKNOWN+ state will transfer to +{{UNDEFINED}}+ by 
+YarnResourceManager#getYarnStatus.+

 

But, in hadoop system, the +{{UNDEFINED}}+ means the application has not yet 
finished.

 

*MiniDispatcher#jobNotFinished*
{code:java}
@Override
protected void jobNotFinished(JobID jobId) {
   super.jobNotFinished(jobId);

   // shut down since we have done our job
   jobTerminationFuture.complete(ApplicationStatus.UNKNOWN);
}
{code}
*YarnResourceManager#getYarnStatus*
{code:java}
private FinalApplicationStatus getYarnStatus(ApplicationStatus status) {
   if (status == null) {
  return FinalApplicationStatus.UNDEFINED;
   }
   else {
  switch (status) {
 case SUCCEEDED:
return FinalApplicationStatus.SUCCEEDED;
 case FAILED:
return FinalApplicationStatus.FAILED;
 case CANCELED:
return FinalApplicationStatus.KILLED;
 default:
return FinalApplicationStatus.UNDEFINED;
  }
   }
}
{code}
 

*Hadoop Application Status* 
[FinalApplicationStatus|https://github.com/apache/hadoop-common/blob/42a61a4fbc88303913c4681f0d40ffcc737e70b5/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/FinalApplicationStatus.java#L32]
{code:java}
/**
 * Enumeration of various final states of an Application.
 */
@Public
@Stable
public enum FinalApplicationStatus {

 /** Undefined state when either the application has not yet finished */
  UNDEFINED,

  /** Application which finished successfully. */
  SUCCEEDED,

  /** Application which failed. */
  FAILED,

  /** Application which was terminated by a user or admin. */
  KILLED
}
{code}
 

*Longrunning Applications's FinalStatus*

*!image-2019-05-28-00-46-49-740.png!*

 

 


was (Author: lamber-ken):
[~gjy], from another side, we can analysis this issue only from the code.

When some scene happends and the call the +MiniDispatcher#jobNotFinished+ 
method, it means the flink job terminate unexpectedly, so it will notify the RM 
to kill the yarn application with +ApplicationStatus.UNKNOWN+ state, then the 
+UNKNOWN+ state will transfer to +{{UNDEFINED}}+ by 
+YarnResourceManager#getYarnStatus.+

 

But, in hadoop system, the +{{UNDEFINED}}+ means the application has not yet 
finished.

 

*MiniDispatcher#jobNotFinished*
{code:java}
@Override
protected void jobNotFinished(JobID jobId) {
   super.jobNotFinished(jobId);

   // shut down since we have done our job
   jobTerminationFuture.complete(ApplicationStatus.UNKNOWN);
}
{code}
*YarnResourceManager#getYarnStatus*
{code:java}
private FinalApplicationStatus getYarnStatus(ApplicationStatus status) {
   if (status == null) {
  return FinalApplicationStatus.UNDEFINED;
   }
   else {
  switch (status) {
 case SUCCEEDED:
return FinalApplicationStatus.SUCCEEDED;
 case FAILED:
return FinalApplicationStatus.FAILED;
 case CANCELED:
return FinalApplicationStatus.KILLED;
 default:
return FinalApplicationStatus.UNDEFINED;
  }
   }
}
{code}
 

*Hadoop Application Status* 
[FinalApplicationStatus|https://github.com/apache/hadoop-common/blob/42a61a4fbc88303913c4681f0d40ffcc737e70b5/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/FinalApplicationStatus.java#L32]
{code:java}
/**
 * Enumeration of various final states of an Application.
 */
@Public
@Stable
public enum FinalApplicationStatus {

 /** Undefined state when either the application has not yet finished */
  UNDEFINED,

  /** Application which finished successfully. */
  SUCCEEDED,

  /** Application which failed. */
  FAILED,

  /** Application which was terminated by a user or admin. */
  KILLED
}
{code}
 

*Longrunning Applications's FinalStatus*

*!image-2019-05-28-00-46-49-740.png!*

 

 

> Fixed the wrong finalStatus of yarn application when application finished
> -
>
> Key: FLINK-12302
> URL: https://issues.apache.org/jira/browse/FLINK-12302
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / YARN
>Affects Versions: 1.8.0
>    Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Minor
>  Labels: p

[jira] [Commented] (FLINK-12302) Fixed the wrong finalStatus of yarn application when application finished

2019-05-27 Thread lamber-ken (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-12302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849080#comment-16849080
 ] 

lamber-ken commented on FLINK-12302:


So, when the unexcepted scene happens, it displays wrong +finalStatus+, because 
it's not running. The FinalStatus should be FAILED not UNDEFINED.

!image-2019-05-28-00-50-13-500.png!

> Fixed the wrong finalStatus of yarn application when application finished
> -
>
> Key: FLINK-12302
> URL: https://issues.apache.org/jira/browse/FLINK-12302
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / YARN
>Affects Versions: 1.8.0
>    Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.9.0
>
> Attachments: fix-bad-finalStatus.patch, flink-conf.yaml, 
> image-2019-04-23-19-56-49-933.png, image-2019-05-28-00-46-49-740.png, 
> image-2019-05-28-00-50-13-500.png, jobmanager-05-27.log, jobmanager-1.log, 
> jobmanager-2.log, screenshot-1.png, screenshot-2.png, 
> spslave4.bigdata.ly_23951, spslave5.bigdata.ly_20271, test.jar
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> flink job(flink-1.6.3) failed in per-job yarn cluste mode, the 
> resourcemanager of yarn rerun the job.
> when the job failed again, the application while finish, but the finalStatus 
> is +UNDEFINED,+  It's better to show state +FAILED+
> !image-2019-04-23-19-56-49-933.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-12302) Fixed the wrong finalStatus of yarn application when application finished

2019-05-27 Thread lamber-ken (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated FLINK-12302:
---
Attachment: image-2019-05-28-00-50-13-500.png

> Fixed the wrong finalStatus of yarn application when application finished
> -
>
> Key: FLINK-12302
> URL: https://issues.apache.org/jira/browse/FLINK-12302
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / YARN
>Affects Versions: 1.8.0
>    Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.9.0
>
> Attachments: fix-bad-finalStatus.patch, flink-conf.yaml, 
> image-2019-04-23-19-56-49-933.png, image-2019-05-28-00-46-49-740.png, 
> image-2019-05-28-00-50-13-500.png, jobmanager-05-27.log, jobmanager-1.log, 
> jobmanager-2.log, screenshot-1.png, screenshot-2.png, 
> spslave4.bigdata.ly_23951, spslave5.bigdata.ly_20271, test.jar
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> flink job(flink-1.6.3) failed in per-job yarn cluste mode, the 
> resourcemanager of yarn rerun the job.
> when the job failed again, the application while finish, but the finalStatus 
> is +UNDEFINED,+  It's better to show state +FAILED+
> !image-2019-04-23-19-56-49-933.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (FLINK-12302) Fixed the wrong finalStatus of yarn application when application finished

2019-05-27 Thread lamber-ken (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-12302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849071#comment-16849071
 ] 

lamber-ken edited comment on FLINK-12302 at 5/27/19 4:47 PM:
-

[~gjy], from another side, we can analysis this issue only from the code.

When some scene happends and the call the +MiniDispatcher#jobNotFinished+ 
method, it means the flink job terminate unexpectedly, so it will notify the RM 
to kill the yarn application with +ApplicationStatus.UNKNOWN+ state, then the 
+UNKNOWN+ state will transfer to +{{UNDEFINED}}+ by 
+YarnResourceManager#getYarnStatus.+

 

But, in hadoop system, the +{{UNDEFINED}}+ means the application has not yet 
finished.

 

*MiniDispatcher#jobNotFinished*
{code:java}
@Override
protected void jobNotFinished(JobID jobId) {
   super.jobNotFinished(jobId);

   // shut down since we have done our job
   jobTerminationFuture.complete(ApplicationStatus.UNKNOWN);
}
{code}
*YarnResourceManager#getYarnStatus*
{code:java}
private FinalApplicationStatus getYarnStatus(ApplicationStatus status) {
   if (status == null) {
  return FinalApplicationStatus.UNDEFINED;
   }
   else {
  switch (status) {
 case SUCCEEDED:
return FinalApplicationStatus.SUCCEEDED;
 case FAILED:
return FinalApplicationStatus.FAILED;
 case CANCELED:
return FinalApplicationStatus.KILLED;
 default:
return FinalApplicationStatus.UNDEFINED;
  }
   }
}
{code}
 

*Hadoop Application Status* 
[FinalApplicationStatus|https://github.com/apache/hadoop-common/blob/42a61a4fbc88303913c4681f0d40ffcc737e70b5/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/FinalApplicationStatus.java#L32]
{code:java}
/**
 * Enumeration of various final states of an Application.
 */
@Public
@Stable
public enum FinalApplicationStatus {

 /** Undefined state when either the application has not yet finished */
  UNDEFINED,

  /** Application which finished successfully. */
  SUCCEEDED,

  /** Application which failed. */
  FAILED,

  /** Application which was terminated by a user or admin. */
  KILLED
}
{code}
 

*Longrunning Applications's FinalStatus*

*!image-2019-05-28-00-46-49-740.png!*

 

 


was (Author: lamber-ken):
[~gjy], from another side, we can analysis this issue only from the code.

When some scene happends and the call the +MiniDispatcher#jobNotFinished+ 
method, it means the flink job terminate unexpectedly, so it will notify the RM 
to kill the yarn application with +ApplicationStatus.UNKNOWN+ state, then the 
+UNKNOWN+ state will transfer to +{{UNDEFINED}}+ by 
+YarnResourceManager#getYarnStatus.+

 

But, in hadoop system, the +{{UNDEFINED}}+ means the application has not yet 
finished.

 

*MiniDispatcher#jobNotFinished*
{code:java}
@Override
protected void jobNotFinished(JobID jobId) {
   super.jobNotFinished(jobId);

   // shut down since we have done our job
   jobTerminationFuture.complete(ApplicationStatus.UNKNOWN);
}
{code}
*YarnResourceManager#getYarnStatus*
{code:java}
private FinalApplicationStatus getYarnStatus(ApplicationStatus status) {
   if (status == null) {
  return FinalApplicationStatus.UNDEFINED;
   }
   else {
  switch (status) {
 case SUCCEEDED:
return FinalApplicationStatus.SUCCEEDED;
 case FAILED:
return FinalApplicationStatus.FAILED;
 case CANCELED:
return FinalApplicationStatus.KILLED;
 default:
return FinalApplicationStatus.UNDEFINED;
  }
   }
}
{code}
 

*Hadoop Application Status* 
[FinalApplicationStatus|https://github.com/apache/hadoop-common/blob/42a61a4fbc88303913c4681f0d40ffcc737e70b5/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/FinalApplicationStatus.java#L32]
{code:java}
/**
 * Enumeration of various final states of an Application.
 */
@Public
@Stable
public enum FinalApplicationStatus {

 /** Undefined state when either the application has not yet finished */
  UNDEFINED,

  /** Application which finished successfully. */
  SUCCEEDED,

  /** Application which failed. */
  FAILED,

  /** Application which was terminated by a user or admin. */
  KILLED
}
{code}
  

> Fixed the wrong finalStatus of yarn application when application finished
> -
>
> Key: FLINK-12302
> URL: https://issues.apache.org/jira/browse/FLINK-12302
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / YARN
>Affects Versions: 1.8.0
>    Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.9.0
>
> Attachments: fix-bad-f

[jira] [Comment Edited] (FLINK-12302) Fixed the wrong finalStatus of yarn application when application finished

2019-05-27 Thread lamber-ken (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-12302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849071#comment-16849071
 ] 

lamber-ken edited comment on FLINK-12302 at 5/27/19 4:43 PM:
-

[~gjy], from another side, we can analysis this issue only from the code.

When some scene happends and the call the +MiniDispatcher#jobNotFinished+ 
method, it means the flink job terminate unexpectedly, so it will notify the RM 
to kill the yarn application with +ApplicationStatus.UNKNOWN+ state, then the 
+UNKNOWN+ state will transfer to +{{UNDEFINED}}+ by 
+YarnResourceManager#getYarnStatus.+

 

But, in hadoop system, the +{{UNDEFINED}}+ means the application has not yet 
finished.

 

*MiniDispatcher#jobNotFinished*
{code:java}
@Override
protected void jobNotFinished(JobID jobId) {
   super.jobNotFinished(jobId);

   // shut down since we have done our job
   jobTerminationFuture.complete(ApplicationStatus.UNKNOWN);
}
{code}
*YarnResourceManager#getYarnStatus*
{code:java}
private FinalApplicationStatus getYarnStatus(ApplicationStatus status) {
   if (status == null) {
  return FinalApplicationStatus.UNDEFINED;
   }
   else {
  switch (status) {
 case SUCCEEDED:
return FinalApplicationStatus.SUCCEEDED;
 case FAILED:
return FinalApplicationStatus.FAILED;
 case CANCELED:
return FinalApplicationStatus.KILLED;
 default:
return FinalApplicationStatus.UNDEFINED;
  }
   }
}
{code}
 

*Hadoop Application Status* 
[FinalApplicationStatus|https://github.com/apache/hadoop-common/blob/42a61a4fbc88303913c4681f0d40ffcc737e70b5/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/FinalApplicationStatus.java#L32]
{code:java}
/**
 * Enumeration of various final states of an Application.
 */
@Public
@Stable
public enum FinalApplicationStatus {

 /** Undefined state when either the application has not yet finished */
  UNDEFINED,

  /** Application which finished successfully. */
  SUCCEEDED,

  /** Application which failed. */
  FAILED,

  /** Application which was terminated by a user or admin. */
  KILLED
}
{code}
  


was (Author: lamber-ken):
[~gjy], from another side, we can analysis this issue only from the code.

When some scene happends and the call the +MiniDispatcher#jobNotFinished+ 
method, it means the flink job terminate unexpectedly, so it will notify the RM 
to kill the yarn application with +ApplicationStatus.UNKNOWN+ state, then the 
+UNKNOWN+ state will transfer to +{{UNDEFINED}}+ by 
+YarnResourceManager#getYarnStatus.+

 

But, in hadoop system, the +{{UNDEFINED}}+ means the application has not yet 
finished.

 

*MiniDispatcher#jobNotFinished*
{code:java}
@Override
protected void jobNotFinished(JobID jobId) {
   super.jobNotFinished(jobId);

   // shut down since we have done our job
   jobTerminationFuture.complete(ApplicationStatus.UNKNOWN);
}
{code}
*YarnResourceManager#getYarnStatus*

 
{code:java}
private FinalApplicationStatus getYarnStatus(ApplicationStatus status) {
   if (status == null) {
  return FinalApplicationStatus.UNDEFINED;
   }
   else {
  switch (status) {
 case SUCCEEDED:
return FinalApplicationStatus.SUCCEEDED;
 case FAILED:
return FinalApplicationStatus.FAILED;
 case CANCELED:
return FinalApplicationStatus.KILLED;
 default:
return FinalApplicationStatus.UNDEFINED;
  }
   }
}
{code}
**

 

*Hadoop Application Status* 
[FinalApplicationStatus|https://github.com/apache/hadoop-common/blob/42a61a4fbc88303913c4681f0d40ffcc737e70b5/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/FinalApplicationStatus.java#L32]
{code:java}
/**
 * Enumeration of various final states of an Application.
 */
@Public
@Stable
public enum FinalApplicationStatus {

 /** Undefined state when either the application has not yet finished */
  UNDEFINED,

  /** Application which finished successfully. */
  SUCCEEDED,

  /** Application which failed. */
  FAILED,

  /** Application which was terminated by a user or admin. */
  KILLED
}
{code}
  

 

 

> Fixed the wrong finalStatus of yarn application when application finished
> -
>
> Key: FLINK-12302
> URL: https://issues.apache.org/jira/browse/FLINK-12302
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / YARN
>Affects Versions: 1.8.0
>    Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.9.0
>
> Attachments: fix-bad-finalStatus.patch, flink-conf.yaml, 
> image-2019-04-23-19-56

[jira] [Comment Edited] (FLINK-12302) Fixed the wrong finalStatus of yarn application when application finished

2019-05-27 Thread lamber-ken (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-12302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849071#comment-16849071
 ] 

lamber-ken edited comment on FLINK-12302 at 5/27/19 4:42 PM:
-

[~gjy], from another side, we can analysis this issue only from the code.

When some scene happends and the call the +MiniDispatcher#jobNotFinished+ 
method, it means the flink job terminate unexpectedly, so it will notify the RM 
to kill the yarn application with +ApplicationStatus.UNKNOWN+ state, then the 
+UNKNOWN+ state will transfer to +{{UNDEFINED}}+ by 
+YarnResourceManager#getYarnStatus.+

 

But, in hadoop system, the +{{UNDEFINED}}+ means the application has not yet 
finished.

 

*MiniDispatcher#jobNotFinished*
{code:java}
@Override
protected void jobNotFinished(JobID jobId) {
   super.jobNotFinished(jobId);

   // shut down since we have done our job
   jobTerminationFuture.complete(ApplicationStatus.UNKNOWN);
}
{code}
*YarnResourceManager#getYarnStatus*

 
{code:java}
private FinalApplicationStatus getYarnStatus(ApplicationStatus status) {
   if (status == null) {
  return FinalApplicationStatus.UNDEFINED;
   }
   else {
  switch (status) {
 case SUCCEEDED:
return FinalApplicationStatus.SUCCEEDED;
 case FAILED:
return FinalApplicationStatus.FAILED;
 case CANCELED:
return FinalApplicationStatus.KILLED;
 default:
return FinalApplicationStatus.UNDEFINED;
  }
   }
}
{code}
**

 

*Hadoop Application Status* 
[FinalApplicationStatus|https://github.com/apache/hadoop-common/blob/42a61a4fbc88303913c4681f0d40ffcc737e70b5/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/FinalApplicationStatus.java#L32]
{code:java}
/**
 * Enumeration of various final states of an Application.
 */
@Public
@Stable
public enum FinalApplicationStatus {

 /** Undefined state when either the application has not yet finished */
  UNDEFINED,

  /** Application which finished successfully. */
  SUCCEEDED,

  /** Application which failed. */
  FAILED,

  /** Application which was terminated by a user or admin. */
  KILLED
}
{code}
  

 

 


was (Author: lamber-ken):
[~gjy], from another side, we can analysis this issue only from the code.

When some scene happends and the call the +MiniDispatcher#jobNotFinished+ 
method, it means the flink job terminate unexpectedly, so it will notify the RM 
to kill the yarn application with +ApplicationStatus.UNKNOWN+ state.

But, in hadoop system, the +{{UNDEFINED}}+ means the application has not yet 
finished.

*MiniDispatcher#jobNotFinished*
{code:java}
@Override
protected void jobNotFinished(JobID jobId) {
   super.jobNotFinished(jobId);

   // shut down since we have done our job
   jobTerminationFuture.complete(ApplicationStatus.UNKNOWN);
}
{code}
*Hadoop Application Status* 
[FinalApplicationStatus|https://github.com/apache/hadoop-common/blob/42a61a4fbc88303913c4681f0d40ffcc737e70b5/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/FinalApplicationStatus.java#L32]
{code:java}
/**
 * Enumeration of various final states of an Application.
 */
@Public
@Stable
public enum FinalApplicationStatus {

 /** Undefined state when either the application has not yet finished */
  UNDEFINED,

  /** Application which finished successfully. */
  SUCCEEDED,

  /** Application which failed. */
  FAILED,

  /** Application which was terminated by a user or admin. */
  KILLED
}
{code}
  

 

 

> Fixed the wrong finalStatus of yarn application when application finished
> -
>
> Key: FLINK-12302
> URL: https://issues.apache.org/jira/browse/FLINK-12302
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / YARN
>Affects Versions: 1.8.0
>    Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.9.0
>
> Attachments: fix-bad-finalStatus.patch, flink-conf.yaml, 
> image-2019-04-23-19-56-49-933.png, jobmanager-05-27.log, jobmanager-1.log, 
> jobmanager-2.log, screenshot-1.png, screenshot-2.png, 
> spslave4.bigdata.ly_23951, spslave5.bigdata.ly_20271, test.jar
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> flink job(flink-1.6.3) failed in per-job yarn cluste mode, the 
> resourcemanager of yarn rerun the job.
> when the job failed again, the application while finish, but the finalStatus 
> is +UNDEFINED,+  It's better to show state +FAILED+
> !image-2019-04-23-19-56-49-933.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (FLINK-12302) Fixed the wrong finalStatus of yarn application when application finished

2019-05-27 Thread lamber-ken (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-12302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849071#comment-16849071
 ] 

lamber-ken edited comment on FLINK-12302 at 5/27/19 4:34 PM:
-

[~gjy], from another side, we can analysis this issue only from the code.

When some scene happends and the call the +MiniDispatcher#jobNotFinished+ 
method, it means the flink job terminate unexpectedly, so it will notify the RM 
to kill the yarn application with +ApplicationStatus.UNKNOWN+ state.

But, in hadoop system, the +{{UNDEFINED}}+ means the application has not yet 
finished.

*MiniDispatcher#jobNotFinished*
{code:java}
@Override
protected void jobNotFinished(JobID jobId) {
   super.jobNotFinished(jobId);

   // shut down since we have done our job
   jobTerminationFuture.complete(ApplicationStatus.UNKNOWN);
}
{code}
*Hadoop Application Status* 
[FinalApplicationStatus|https://github.com/apache/hadoop-common/blob/42a61a4fbc88303913c4681f0d40ffcc737e70b5/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/FinalApplicationStatus.java#L32]
{code:java}
/**
 * Enumeration of various final states of an Application.
 */
@Public
@Stable
public enum FinalApplicationStatus {

 /** Undefined state when either the application has not yet finished */
  UNDEFINED,

  /** Application which finished successfully. */
  SUCCEEDED,

  /** Application which failed. */
  FAILED,

  /** Application which was terminated by a user or admin. */
  KILLED
}
{code}
  

 

 


was (Author: lamber-ken):
[~gjy], from another side, we can analysis this issue only from the code.

When some scene happends and the call the +MiniDispatcher#jobNotFinished+ 
method, it means the flink job terminate unexpectedly, so it will notify the RM 
to kill the yarn application with +ApplicationStatus.UNKNOWN+ state.

But, in hadoop system, the +{{UNDEFINED}}+ means the application has not yet 
finished.

*MiniDispatcher#jobNotFinished*
{code:java}
@Override
protected void jobNotFinished(JobID jobId) {
   super.jobNotFinished(jobId);

   // shut down since we have done our job
   jobTerminationFuture.complete(ApplicationStatus.UNKNOWN);
}
{code}
*Hadoop Application Status 
https://github.com/apache/hadoop-common/blob/42a61a4fbc88303913c4681f0d40ffcc737e70b5/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/FinalApplicationStatus.java#L32*
{code:java}
/**
 * Enumeration of various final states of an Application.
 */
@Public
@Stable
public enum FinalApplicationStatus {

 /** Undefined state when either the application has not yet finished */
  UNDEFINED,

  /** Application which finished successfully. */
  SUCCEEDED,

  /** Application which failed. */
  FAILED,

  /** Application which was terminated by a user or admin. */
  KILLED
}
{code}
  

 

 

> Fixed the wrong finalStatus of yarn application when application finished
> -
>
> Key: FLINK-12302
> URL: https://issues.apache.org/jira/browse/FLINK-12302
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / YARN
>Affects Versions: 1.8.0
>    Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.9.0
>
> Attachments: fix-bad-finalStatus.patch, flink-conf.yaml, 
> image-2019-04-23-19-56-49-933.png, jobmanager-05-27.log, jobmanager-1.log, 
> jobmanager-2.log, screenshot-1.png, screenshot-2.png, 
> spslave4.bigdata.ly_23951, spslave5.bigdata.ly_20271, test.jar
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> flink job(flink-1.6.3) failed in per-job yarn cluste mode, the 
> resourcemanager of yarn rerun the job.
> when the job failed again, the application while finish, but the finalStatus 
> is +UNDEFINED,+  It's better to show state +FAILED+
> !image-2019-04-23-19-56-49-933.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-12302) Fixed the wrong finalStatus of yarn application when application finished

2019-05-27 Thread lamber-ken (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-12302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849071#comment-16849071
 ] 

lamber-ken commented on FLINK-12302:


[~gjy], from another side, we can analysis this issue only from the code.

When some scene happends and the call the +MiniDispatcher#jobNotFinished+ 
method, it means the flink job terminate unexpectedly, so it will notify the RM 
to kill the yarn application with +ApplicationStatus.UNKNOWN+ state.

But, in hadoop system, the +{{UNDEFINED}}+ means the application has not yet 
finished.

*MiniDispatcher#jobNotFinished*
{code:java}
@Override
protected void jobNotFinished(JobID jobId) {
   super.jobNotFinished(jobId);

   // shut down since we have done our job
   jobTerminationFuture.complete(ApplicationStatus.UNKNOWN);
}
{code}
*Hadoop Application Status 
https://github.com/apache/hadoop-common/blob/42a61a4fbc88303913c4681f0d40ffcc737e70b5/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/FinalApplicationStatus.java#L32*
{code:java}
/**
 * Enumeration of various final states of an Application.
 */
@Public
@Stable
public enum FinalApplicationStatus {

 /** Undefined state when either the application has not yet finished */
  UNDEFINED,

  /** Application which finished successfully. */
  SUCCEEDED,

  /** Application which failed. */
  FAILED,

  /** Application which was terminated by a user or admin. */
  KILLED
}
{code}
  

 

 

> Fixed the wrong finalStatus of yarn application when application finished
> -
>
> Key: FLINK-12302
> URL: https://issues.apache.org/jira/browse/FLINK-12302
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / YARN
>Affects Versions: 1.8.0
>    Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.9.0
>
> Attachments: fix-bad-finalStatus.patch, flink-conf.yaml, 
> image-2019-04-23-19-56-49-933.png, jobmanager-05-27.log, jobmanager-1.log, 
> jobmanager-2.log, screenshot-1.png, screenshot-2.png, 
> spslave4.bigdata.ly_23951, spslave5.bigdata.ly_20271, test.jar
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> flink job(flink-1.6.3) failed in per-job yarn cluste mode, the 
> resourcemanager of yarn rerun the job.
> when the job failed again, the application while finish, but the finalStatus 
> is +UNDEFINED,+  It's better to show state +FAILED+
> !image-2019-04-23-19-56-49-933.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Issue Comment Deleted] (FLINK-12302) Fixed the wrong finalStatus of yarn application when application finished

2019-05-27 Thread lamber-ken (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated FLINK-12302:
---
Comment: was deleted

(was: [~gjy], hi, can you show me your +flink-conf.yaml+ file? thanks)

> Fixed the wrong finalStatus of yarn application when application finished
> -
>
> Key: FLINK-12302
> URL: https://issues.apache.org/jira/browse/FLINK-12302
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / YARN
>Affects Versions: 1.8.0
>    Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.9.0
>
> Attachments: fix-bad-finalStatus.patch, flink-conf.yaml, 
> image-2019-04-23-19-56-49-933.png, jobmanager-05-27.log, jobmanager-1.log, 
> jobmanager-2.log, screenshot-1.png, screenshot-2.png, 
> spslave4.bigdata.ly_23951, spslave5.bigdata.ly_20271, test.jar
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> flink job(flink-1.6.3) failed in per-job yarn cluste mode, the 
> resourcemanager of yarn rerun the job.
> when the job failed again, the application while finish, but the finalStatus 
> is +UNDEFINED,+  It's better to show state +FAILED+
> !image-2019-04-23-19-56-49-933.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-12302) Fixed the wrong finalStatus of yarn application when application finished

2019-05-27 Thread lamber-ken (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-12302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849038#comment-16849038
 ] 

lamber-ken commented on FLINK-12302:


[~gjy], here my env files.

1,test jars --> [^test.jar]

2,flink-1.8.0

3,flink-conf.yaml --> [^flink-conf.yaml]

4,the first jobmanager.log --> [^jobmanager-1.log]

5,the second jobmanager.log --> [^jobmanager-2.log]

you must wait the job reach the max attemp times, then you can kill the am. 
from the second jobmanager.log, we will see
{code:java}
Job 0cac7407733eb34396cd5e919631d4ff was not finished by JobManager. {code}

> Fixed the wrong finalStatus of yarn application when application finished
> -
>
> Key: FLINK-12302
> URL: https://issues.apache.org/jira/browse/FLINK-12302
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / YARN
>Affects Versions: 1.8.0
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.9.0
>
> Attachments: fix-bad-finalStatus.patch, flink-conf.yaml, 
> image-2019-04-23-19-56-49-933.png, jobmanager-05-27.log, jobmanager-1.log, 
> jobmanager-2.log, screenshot-1.png, screenshot-2.png, 
> spslave4.bigdata.ly_23951, spslave5.bigdata.ly_20271, test.jar
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> flink job(flink-1.6.3) failed in per-job yarn cluste mode, the 
> resourcemanager of yarn rerun the job.
> when the job failed again, the application while finish, but the finalStatus 
> is +UNDEFINED,+  It's better to show state +FAILED+
> !image-2019-04-23-19-56-49-933.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-12302) Fixed the wrong finalStatus of yarn application when application finished

2019-05-27 Thread lamber-ken (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated FLINK-12302:
---
Attachment: jobmanager-1.log

> Fixed the wrong finalStatus of yarn application when application finished
> -
>
> Key: FLINK-12302
> URL: https://issues.apache.org/jira/browse/FLINK-12302
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / YARN
>Affects Versions: 1.8.0
>    Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.9.0
>
> Attachments: fix-bad-finalStatus.patch, flink-conf.yaml, 
> image-2019-04-23-19-56-49-933.png, jobmanager-05-27.log, jobmanager-1.log, 
> jobmanager-2.log, screenshot-1.png, screenshot-2.png, 
> spslave4.bigdata.ly_23951, spslave5.bigdata.ly_20271, test.jar
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> flink job(flink-1.6.3) failed in per-job yarn cluste mode, the 
> resourcemanager of yarn rerun the job.
> when the job failed again, the application while finish, but the finalStatus 
> is +UNDEFINED,+  It's better to show state +FAILED+
> !image-2019-04-23-19-56-49-933.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-12302) Fixed the wrong finalStatus of yarn application when application finished

2019-05-27 Thread lamber-ken (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated FLINK-12302:
---
Attachment: jobmanager-2.log

> Fixed the wrong finalStatus of yarn application when application finished
> -
>
> Key: FLINK-12302
> URL: https://issues.apache.org/jira/browse/FLINK-12302
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / YARN
>Affects Versions: 1.8.0
>    Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.9.0
>
> Attachments: fix-bad-finalStatus.patch, flink-conf.yaml, 
> image-2019-04-23-19-56-49-933.png, jobmanager-05-27.log, jobmanager-1.log, 
> jobmanager-2.log, screenshot-1.png, screenshot-2.png, 
> spslave4.bigdata.ly_23951, spslave5.bigdata.ly_20271, test.jar
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> flink job(flink-1.6.3) failed in per-job yarn cluste mode, the 
> resourcemanager of yarn rerun the job.
> when the job failed again, the application while finish, but the finalStatus 
> is +UNDEFINED,+  It's better to show state +FAILED+
> !image-2019-04-23-19-56-49-933.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-12302) Fixed the wrong finalStatus of yarn application when application finished

2019-05-27 Thread lamber-ken (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated FLINK-12302:
---
Attachment: flink-conf.yaml

> Fixed the wrong finalStatus of yarn application when application finished
> -
>
> Key: FLINK-12302
> URL: https://issues.apache.org/jira/browse/FLINK-12302
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / YARN
>Affects Versions: 1.8.0
>    Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.9.0
>
> Attachments: fix-bad-finalStatus.patch, flink-conf.yaml, 
> image-2019-04-23-19-56-49-933.png, jobmanager-05-27.log, screenshot-1.png, 
> screenshot-2.png, spslave4.bigdata.ly_23951, spslave5.bigdata.ly_20271, 
> test.jar
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> flink job(flink-1.6.3) failed in per-job yarn cluste mode, the 
> resourcemanager of yarn rerun the job.
> when the job failed again, the application while finish, but the finalStatus 
> is +UNDEFINED,+  It's better to show state +FAILED+
> !image-2019-04-23-19-56-49-933.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-12302) Fixed the wrong finalStatus of yarn application when application finished

2019-05-27 Thread lamber-ken (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated FLINK-12302:
---
Attachment: (was: jobmanager-1.log)

> Fixed the wrong finalStatus of yarn application when application finished
> -
>
> Key: FLINK-12302
> URL: https://issues.apache.org/jira/browse/FLINK-12302
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / YARN
>Affects Versions: 1.8.0
>    Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.9.0
>
> Attachments: fix-bad-finalStatus.patch, 
> image-2019-04-23-19-56-49-933.png, jobmanager-05-27.log, screenshot-1.png, 
> screenshot-2.png, spslave4.bigdata.ly_23951, spslave5.bigdata.ly_20271, 
> test.jar
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> flink job(flink-1.6.3) failed in per-job yarn cluste mode, the 
> resourcemanager of yarn rerun the job.
> when the job failed again, the application while finish, but the finalStatus 
> is +UNDEFINED,+  It's better to show state +FAILED+
> !image-2019-04-23-19-56-49-933.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-12302) Fixed the wrong finalStatus of yarn application when application finished

2019-05-27 Thread lamber-ken (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated FLINK-12302:
---
Attachment: test.jar

> Fixed the wrong finalStatus of yarn application when application finished
> -
>
> Key: FLINK-12302
> URL: https://issues.apache.org/jira/browse/FLINK-12302
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / YARN
>Affects Versions: 1.8.0
>    Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.9.0
>
> Attachments: fix-bad-finalStatus.patch, 
> image-2019-04-23-19-56-49-933.png, jobmanager-05-27.log, screenshot-1.png, 
> screenshot-2.png, spslave4.bigdata.ly_23951, spslave5.bigdata.ly_20271, 
> test.jar
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> flink job(flink-1.6.3) failed in per-job yarn cluste mode, the 
> resourcemanager of yarn rerun the job.
> when the job failed again, the application while finish, but the finalStatus 
> is +UNDEFINED,+  It's better to show state +FAILED+
> !image-2019-04-23-19-56-49-933.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-12302) Fixed the wrong finalStatus of yarn application when application finished

2019-05-27 Thread lamber-ken (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated FLINK-12302:
---
Attachment: (was: flink-conf.yaml)

> Fixed the wrong finalStatus of yarn application when application finished
> -
>
> Key: FLINK-12302
> URL: https://issues.apache.org/jira/browse/FLINK-12302
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / YARN
>Affects Versions: 1.8.0
>    Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.9.0
>
> Attachments: fix-bad-finalStatus.patch, 
> image-2019-04-23-19-56-49-933.png, jobmanager-05-27.log, screenshot-1.png, 
> screenshot-2.png, spslave4.bigdata.ly_23951, spslave5.bigdata.ly_20271, 
> test.jar
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> flink job(flink-1.6.3) failed in per-job yarn cluste mode, the 
> resourcemanager of yarn rerun the job.
> when the job failed again, the application while finish, but the finalStatus 
> is +UNDEFINED,+  It's better to show state +FAILED+
> !image-2019-04-23-19-56-49-933.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-12302) Fixed the wrong finalStatus of yarn application when application finished

2019-05-27 Thread lamber-ken (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated FLINK-12302:
---
Attachment: (was: jobmanager-2.log)

> Fixed the wrong finalStatus of yarn application when application finished
> -
>
> Key: FLINK-12302
> URL: https://issues.apache.org/jira/browse/FLINK-12302
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / YARN
>Affects Versions: 1.8.0
>    Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.9.0
>
> Attachments: fix-bad-finalStatus.patch, 
> image-2019-04-23-19-56-49-933.png, jobmanager-05-27.log, screenshot-1.png, 
> screenshot-2.png, spslave4.bigdata.ly_23951, spslave5.bigdata.ly_20271, 
> test.jar
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> flink job(flink-1.6.3) failed in per-job yarn cluste mode, the 
> resourcemanager of yarn rerun the job.
> when the job failed again, the application while finish, but the finalStatus 
> is +UNDEFINED,+  It's better to show state +FAILED+
> !image-2019-04-23-19-56-49-933.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-12302) Fixed the wrong finalStatus of yarn application when application finished

2019-05-27 Thread lamber-ken (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated FLINK-12302:
---
Attachment: (was: test.jar)

> Fixed the wrong finalStatus of yarn application when application finished
> -
>
> Key: FLINK-12302
> URL: https://issues.apache.org/jira/browse/FLINK-12302
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / YARN
>Affects Versions: 1.8.0
>    Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.9.0
>
> Attachments: fix-bad-finalStatus.patch, flink-conf.yaml, 
> image-2019-04-23-19-56-49-933.png, jobmanager-05-27.log, jobmanager-2.log, 
> screenshot-1.png, screenshot-2.png, spslave4.bigdata.ly_23951, 
> spslave5.bigdata.ly_20271
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> flink job(flink-1.6.3) failed in per-job yarn cluste mode, the 
> resourcemanager of yarn rerun the job.
> when the job failed again, the application while finish, but the finalStatus 
> is +UNDEFINED,+  It's better to show state +FAILED+
> !image-2019-04-23-19-56-49-933.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-12302) Fixed the wrong finalStatus of yarn application when application finished

2019-05-27 Thread lamber-ken (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated FLINK-12302:
---
Attachment: test.jar
jobmanager-2.log
jobmanager-1.log
flink-conf.yaml

> Fixed the wrong finalStatus of yarn application when application finished
> -
>
> Key: FLINK-12302
> URL: https://issues.apache.org/jira/browse/FLINK-12302
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / YARN
>Affects Versions: 1.8.0
>    Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.9.0
>
> Attachments: fix-bad-finalStatus.patch, flink-conf.yaml, 
> image-2019-04-23-19-56-49-933.png, jobmanager-05-27.log, jobmanager-2.log, 
> screenshot-1.png, screenshot-2.png, spslave4.bigdata.ly_23951, 
> spslave5.bigdata.ly_20271
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> flink job(flink-1.6.3) failed in per-job yarn cluste mode, the 
> resourcemanager of yarn rerun the job.
> when the job failed again, the application while finish, but the finalStatus 
> is +UNDEFINED,+  It's better to show state +FAILED+
> !image-2019-04-23-19-56-49-933.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-12302) Fixed the wrong finalStatus of yarn application when application finished

2019-05-27 Thread lamber-ken (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated FLINK-12302:
---
Attachment: (was: test.jar)

> Fixed the wrong finalStatus of yarn application when application finished
> -
>
> Key: FLINK-12302
> URL: https://issues.apache.org/jira/browse/FLINK-12302
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / YARN
>Affects Versions: 1.8.0
>    Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.9.0
>
> Attachments: fix-bad-finalStatus.patch, flink-conf.yaml, 
> image-2019-04-23-19-56-49-933.png, jobmanager-05-27.log, jobmanager-2.log, 
> screenshot-1.png, screenshot-2.png, spslave4.bigdata.ly_23951, 
> spslave5.bigdata.ly_20271
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> flink job(flink-1.6.3) failed in per-job yarn cluste mode, the 
> resourcemanager of yarn rerun the job.
> when the job failed again, the application while finish, but the finalStatus 
> is +UNDEFINED,+  It's better to show state +FAILED+
> !image-2019-04-23-19-56-49-933.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-12302) Fixed the wrong finalStatus of yarn application when application finished

2019-05-27 Thread lamber-ken (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated FLINK-12302:
---
Attachment: test.jar

> Fixed the wrong finalStatus of yarn application when application finished
> -
>
> Key: FLINK-12302
> URL: https://issues.apache.org/jira/browse/FLINK-12302
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / YARN
>Affects Versions: 1.8.0
>    Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.9.0
>
> Attachments: fix-bad-finalStatus.patch, 
> image-2019-04-23-19-56-49-933.png, jobmanager-05-27.log, screenshot-1.png, 
> screenshot-2.png, spslave4.bigdata.ly_23951, spslave5.bigdata.ly_20271, 
> test.jar
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> flink job(flink-1.6.3) failed in per-job yarn cluste mode, the 
> resourcemanager of yarn rerun the job.
> when the job failed again, the application while finish, but the finalStatus 
> is +UNDEFINED,+  It's better to show state +FAILED+
> !image-2019-04-23-19-56-49-933.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-12302) Fixed the wrong finalStatus of yarn application when application finished

2019-05-27 Thread lamber-ken (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-12302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849010#comment-16849010
 ] 

lamber-ken commented on FLINK-12302:


[~gjy], hi, can you show me your +flink-conf.yaml+ file? thanks

> Fixed the wrong finalStatus of yarn application when application finished
> -
>
> Key: FLINK-12302
> URL: https://issues.apache.org/jira/browse/FLINK-12302
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / YARN
>Affects Versions: 1.8.0
>    Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.9.0
>
> Attachments: fix-bad-finalStatus.patch, 
> image-2019-04-23-19-56-49-933.png, jobmanager-05-27.log, screenshot-1.png, 
> screenshot-2.png, spslave4.bigdata.ly_23951, spslave5.bigdata.ly_20271
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> flink job(flink-1.6.3) failed in per-job yarn cluste mode, the 
> resourcemanager of yarn rerun the job.
> when the job failed again, the application while finish, but the finalStatus 
> is +UNDEFINED,+  It's better to show state +FAILED+
> !image-2019-04-23-19-56-49-933.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (FLINK-12558) Yarn application can't stop when flink job finished in normal mode

2019-05-21 Thread lamber-ken (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken closed FLINK-12558.
--
   Resolution: Not A Problem
Fix Version/s: 1.6.3
 Release Note: 
it needs to add -ace option to prevnet the CLI is terminated abruptly. like 
$FLINK_HOME/bin/flink run -m yarn-cluster -yn 2 -ace -c

> Yarn application can't stop when flink job finished in normal mode
> --
>
> Key: FLINK-12558
> URL: https://issues.apache.org/jira/browse/FLINK-12558
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN, Runtime / REST
>Affects Versions: 1.6.3
>    Reporter: lamber-ken
>Assignee: frank wang
>Priority: Major
> Fix For: 1.6.3
>
> Attachments: image-2019-05-20-18-47-12-497.png, jobmanager.txt
>
>
> I run a flink +SocketWindowWordCount+ job on yarn cluste mode, when I kill 
> the socket, the flink job can't stopped. and I can't reproduct the bug again.
>  
> *Steps 1*
> {code:java}
> nc -lk 
> {code}
> *Steps 2*
> {code:java}
> bin/flink run -m yarn-cluster -yn 2 
> examples/streaming/SocketWindowWordCount.jar --hostname 10.101.52.12 --port 
> 
> {code}
> *Steps 3*
>  cancel the above nc command
> *Steps 4*
>  every thing gone
>     !image-2019-05-20-18-47-12-497.png!
>  ** 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-12558) Yarn application can't stop when flink job finished in normal mode

2019-05-21 Thread lamber-ken (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-12558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16845457#comment-16845457
 ] 

lamber-ken commented on FLINK-12558:


hi, thanks for assign this issue [~frank wang], maybe the CLI is terminated 
abruptly, e.g., typing Ctrl + C. If use +-sae+ option, will not happen again.
{code:java}
$FLINK_HOME/bin/flink run -m yarn-cluster -sae -yn 2 -c  {code}
 

> Yarn application can't stop when flink job finished in normal mode
> --
>
> Key: FLINK-12558
> URL: https://issues.apache.org/jira/browse/FLINK-12558
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN, Runtime / REST
>Affects Versions: 1.6.3
>    Reporter: lamber-ken
>Assignee: frank wang
>Priority: Major
> Attachments: image-2019-05-20-18-47-12-497.png, jobmanager.txt
>
>
> I run a flink +SocketWindowWordCount+ job on yarn cluste mode, when I kill 
> the socket, the flink job can't stopped. and I can't reproduct the bug again.
>  
> *Steps 1*
> {code:java}
> nc -lk 
> {code}
> *Steps 2*
> {code:java}
> bin/flink run -m yarn-cluster -yn 2 
> examples/streaming/SocketWindowWordCount.jar --hostname 10.101.52.12 --port 
> 
> {code}
> *Steps 3*
>  cancel the above nc command
> *Steps 4*
>  every thing gone
>     !image-2019-05-20-18-47-12-497.png!
>  ** 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-12558) Yarn application can't stop when flink job finished in normal mode

2019-05-20 Thread lamber-ken (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated FLINK-12558:
---
Summary: Yarn application can't stop when flink job finished in normal mode 
 (was: Yarn application can't stop when flink job finished)

> Yarn application can't stop when flink job finished in normal mode
> --
>
> Key: FLINK-12558
> URL: https://issues.apache.org/jira/browse/FLINK-12558
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN, Runtime / REST
>Affects Versions: 1.6.3
>    Reporter: lamber-ken
>Assignee: frank wang
>Priority: Major
> Attachments: image-2019-05-20-18-47-12-497.png, jobmanager.txt
>
>
> I run a flink +SocketWindowWordCount+ job on yarn cluste mode, when I kill 
> the socket, the flink job can't stopped. and I can't reproduct the bug again.
>  
> *Steps 1*
> {code:java}
> nc -lk 
> {code}
> *Steps 2*
> {code:java}
> bin/flink run -m yarn-cluster -yn 2 
> examples/streaming/SocketWindowWordCount.jar --hostname 10.101.52.12 --port 
> 
> {code}
> *Steps 3*
>  cancel the above nc command
> *Steps 4*
>  every thing gone
>     !image-2019-05-20-18-47-12-497.png!
>  ** 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (FLINK-12558) Yarn application can't stop when flink job finished

2019-05-20 Thread lamber-ken (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-12558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844434#comment-16844434
 ] 

lamber-ken edited comment on FLINK-12558 at 5/21/19 1:37 AM:
-

[~frank wang], I think the +FlinkJobNotFoundException+ is not the root cause. 
The +FlinkJobNotFoundException+ will print when we visit the Flink WebUI which 
send requests every three seconds asynchronously. I think it's need to find out 
what prevents the application stop.


was (Author: lamber-ken):
[~frank wang], I think the +FlinkJobNotFoundException+ is not the root cause. 
The +FlinkJobNotFoundException+ will print when we visit the Flink WebUI which 
send requests every three seconds. I think it's need to find out what prevents 
the application stop.

> Yarn application can't stop when flink job finished
> ---
>
> Key: FLINK-12558
> URL: https://issues.apache.org/jira/browse/FLINK-12558
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN, Runtime / REST
>Affects Versions: 1.6.3
>    Reporter: lamber-ken
>Assignee: frank wang
>Priority: Major
> Attachments: image-2019-05-20-18-47-12-497.png, jobmanager.txt
>
>
> I run a flink +SocketWindowWordCount+ job on yarn cluste mode, when I kill 
> the socket, the flink job can't stopped. and I can't reproduct the bug again.
>  
> *Steps 1*
> {code:java}
> nc -lk 
> {code}
> *Steps 2*
> {code:java}
> bin/flink run -m yarn-cluster -yn 2 
> examples/streaming/SocketWindowWordCount.jar --hostname 10.101.52.12 --port 
> 
> {code}
> *Steps 3*
>  cancel the above nc command
> *Steps 4*
>  every thing gone
>     !image-2019-05-20-18-47-12-497.png!
>  ** 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-12558) Yarn application can't stop when flink job finished

2019-05-20 Thread lamber-ken (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-12558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844434#comment-16844434
 ] 

lamber-ken commented on FLINK-12558:


[~frank wang], I think the +FlinkJobNotFoundException+ is not the root cause. 
The +FlinkJobNotFoundException+ will print when we visit the Flink WebUI which 
send requests every three seconds. I think it's need to find out what prevents 
the application stop.

> Yarn application can't stop when flink job finished
> ---
>
> Key: FLINK-12558
> URL: https://issues.apache.org/jira/browse/FLINK-12558
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN, Runtime / REST
>Affects Versions: 1.6.3
>    Reporter: lamber-ken
>Assignee: frank wang
>Priority: Major
> Attachments: image-2019-05-20-18-47-12-497.png, jobmanager.txt
>
>
> I run a flink +SocketWindowWordCount+ job on yarn cluste mode, when I kill 
> the socket, the flink job can't stopped. and I can't reproduct the bug again.
>  
> *Steps 1*
> {code:java}
> nc -lk 
> {code}
> *Steps 2*
> {code:java}
> bin/flink run -m yarn-cluster -yn 2 
> examples/streaming/SocketWindowWordCount.jar --hostname 10.101.52.12 --port 
> 
> {code}
> *Steps 3*
>  cancel the above nc command
> *Steps 4*
>  every thing gone
>     !image-2019-05-20-18-47-12-497.png!
>  ** 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-12558) Yarn application can't stop when flink job finished

2019-05-20 Thread lamber-ken (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated FLINK-12558:
---
Description: 
I run a flink +SocketWindowWordCount+ job on yarn cluste mode, when I kill the 
socket, the flink job can't stopped. and I can't reproduct the bug again.

 

*Steps 1*
{code:java}
nc -lk 
{code}
*Steps 2*
{code:java}
bin/flink run -m yarn-cluster -yn 2 
examples/streaming/SocketWindowWordCount.jar --hostname 10.101.52.12 --port 
{code}
*Steps 3*
 cancel the above nc command

*Steps 4*
 every thing gone
    !image-2019-05-20-18-47-12-497.png!

 ** 

 

 

  was:
I run a flink +SocketWindowWordCount+ job on yarn cluste mode, when I kill the 
socket, the flink job can't stopped. and I can't reproduct the bug again.

*Steps 1*
{code:java}
nc -lk 
{code}
*Steps 2*
{code:java}
bin/flink run -m yarn-cluster -yn 2 
examples/streaming/SocketWindowWordCount.jar --hostname 10.101.52.12 --port 
{code}
*Steps 3*
 cancel the above nc command

*Steps 4*
 every thing gone
   !image-2019-05-20-18-47-12-497.png!

 ** 

 

 


> Yarn application can't stop when flink job finished
> ---
>
> Key: FLINK-12558
> URL: https://issues.apache.org/jira/browse/FLINK-12558
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN, Runtime / REST
>Affects Versions: 1.6.3
>    Reporter: lamber-ken
>Priority: Major
> Attachments: image-2019-05-20-18-47-12-497.png, jobmanager.txt
>
>
> I run a flink +SocketWindowWordCount+ job on yarn cluste mode, when I kill 
> the socket, the flink job can't stopped. and I can't reproduct the bug again.
>  
> *Steps 1*
> {code:java}
> nc -lk 
> {code}
> *Steps 2*
> {code:java}
> bin/flink run -m yarn-cluster -yn 2 
> examples/streaming/SocketWindowWordCount.jar --hostname 10.101.52.12 --port 
> 
> {code}
> *Steps 3*
>  cancel the above nc command
> *Steps 4*
>  every thing gone
>     !image-2019-05-20-18-47-12-497.png!
>  ** 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-12558) Yarn application can't stop when flink job finished

2019-05-20 Thread lamber-ken (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated FLINK-12558:
---
Attachment: jobmanager.txt

> Yarn application can't stop when flink job finished
> ---
>
> Key: FLINK-12558
> URL: https://issues.apache.org/jira/browse/FLINK-12558
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN, Runtime / REST
>Affects Versions: 1.6.3
>    Reporter: lamber-ken
>Priority: Major
> Attachments: image-2019-05-20-18-47-12-497.png, jobmanager.txt
>
>
> I run a flink +SocketWindowWordCount+ job on yarn cluste mode, when I kill 
> the socket, the flink job can't stopped. and I can't reproduct the bug again.
> *Steps 1*
> {code:java}
> nc -lk 
> {code}
> *Steps 2*
> {code:java}
> bin/flink run -m yarn-cluster -yn 2 
> examples/streaming/SocketWindowWordCount.jar --hostname 10.101.52.12 --port 
> 
> {code}
> *Steps 3*
>  cancel the above nc command
> *Steps 4*
>  every thing gone
>    !image-2019-05-20-18-47-12-497.png!
>  ** 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-12558) Yarn application can't stop when flink job finished

2019-05-20 Thread lamber-ken (JIRA)
lamber-ken created FLINK-12558:
--

 Summary: Yarn application can't stop when flink job finished
 Key: FLINK-12558
 URL: https://issues.apache.org/jira/browse/FLINK-12558
 Project: Flink
  Issue Type: Bug
  Components: Deployment / YARN, Runtime / REST
Affects Versions: 1.6.3
Reporter: lamber-ken
 Attachments: image-2019-05-20-18-47-12-497.png

I run a flink +SocketWindowWordCount+ job on yarn cluste mode, when I kill the 
socket, the flink job can't stopped. and I can't reproduct the bug again.

*Steps 1*
{code:java}
nc -lk 
{code}
*Steps 2*
{code:java}
bin/flink run -m yarn-cluster -yn 2 
examples/streaming/SocketWindowWordCount.jar --hostname 10.101.52.12 --port 
{code}
*Steps 3*
 cancel the above nc command

*Steps 4*
 every thing gone
   !image-2019-05-20-18-47-12-497.png!

 ** 

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (FLINK-11010) Flink SQL timestamp is inconsistent with currentProcessingTime()

2019-05-15 Thread lamber-ken (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-11010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken closed FLINK-11010.
--
Resolution: Won't Fix

> Flink SQL timestamp is inconsistent with currentProcessingTime()
> 
>
> Key: FLINK-11010
> URL: https://issues.apache.org/jira/browse/FLINK-11010
> Project: Flink
>  Issue Type: Bug
>  Components: Table SQL / API
>Affects Versions: 1.6.2, 1.7.0, 1.7.1, 1.8.0
>    Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Flink SQL timestamp is inconsistent with currentProcessingTime().
>  
> the ProcessingTime is just implemented by invoking System.currentTimeMillis() 
> but the long value will be automatically wrapped to a Timestamp with the 
> following statement: 
> `new java.sql.Timestamp(time - TimeZone.getDefault().getOffset(time));`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-12302) Fixed the wrong finalStatus of yarn application when application finished

2019-05-15 Thread lamber-ken (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-12302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16840354#comment-16840354
 ] 

lamber-ken commented on FLINK-12302:


[~gjy] this is an extreme example, but that reflects the problem.

> Fixed the wrong finalStatus of yarn application when application finished
> -
>
> Key: FLINK-12302
> URL: https://issues.apache.org/jira/browse/FLINK-12302
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / YARN
>Affects Versions: 1.8.0
>    Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.9.0
>
> Attachments: fix-bad-finalStatus.patch, 
> image-2019-04-23-19-56-49-933.png, screenshot-1.png, screenshot-2.png, 
> spslave4.bigdata.ly_23951, spslave5.bigdata.ly_20271
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> flink job(flink-1.6.3) failed in per-job yarn cluste mode, the 
> resourcemanager of yarn rerun the job.
> when the job failed again, the application while finish, but the finalStatus 
> is +UNDEFINED,+  It's better to show state +FAILED+
> !image-2019-04-23-19-56-49-933.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-12302) Fixed the wrong finalStatus of yarn application when application finished

2019-05-15 Thread lamber-ken (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-12302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16840352#comment-16840352
 ] 

lamber-ken commented on FLINK-12302:


thanks for your comment [~gjy], follow below steps, you'll get it.

*1, download flink-1.8*
{code:java}
wget 
http://mirrors.tuna.tsinghua.edu.cn/apache/flink/flink-1.8.0/flink-1.8.0-bin-scala_2.11.tgz
wget 
http://repo.maven.apache.org/maven2/org/apache/flink/flink-shaded-hadoop2-uber/2.4.1-1.8.0/flink-shaded-hadoop2-uber-2.4.1-1.8.0.jar
{code}

*2, config flink*
{code:java}
jobmanager.archive.fs.dir: hdfs:///flink/dev/flink-completed-jobs

yarn.application-attempts: 10

high-availability: zookeeper
high-availability.zookeeper.quorum: spslave.bigdata.ly:2205
high-availability.zookeeper.path.root: /flink-dev
high-availability.zookeeper.storageDir: hdfs:///flink/dev/recovery
{code}

*3, submit flink job*
{code:java}
import org.apache.flink.api.common.restartstrategy.RestartStrategies;
import org.apache.flink.api.common.time.Time;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.sink.SinkFunction;
import org.apache.flink.streaming.api.functions.source.SourceFunction;

public class TestDemo {

public static void main(String[] args) throws Exception {

StreamExecutionEnvironment env = 
StreamExecutionEnvironment.getExecutionEnvironment();
env.setRestartStrategy(RestartStrategies.fixedDelayRestart(30, 
Time.seconds(1)));

DataStream text = env.addSource(new SourceFunction() {
@Override
public void run(SourceContext ctx) throws Exception {
while (true) {
ctx.collect("");
Thread.sleep(100);
}
}
@Override
public void cancel() {

}
});

text.addSink(new SinkFunction() {
@Override
public void invoke(String value, Context context) throws Exception {
   int a = 1 / 0;
}
});

env.execute();
}


}
{code}

*4, after 40s, you'll see*
 !screenshot-1.png! 


*5, now kill the am by hand*
 !screenshot-2.png! 











> Fixed the wrong finalStatus of yarn application when application finished
> -
>
> Key: FLINK-12302
> URL: https://issues.apache.org/jira/browse/FLINK-12302
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / YARN
>Affects Versions: 1.8.0
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.9.0
>
> Attachments: fix-bad-finalStatus.patch, 
> image-2019-04-23-19-56-49-933.png, screenshot-1.png, screenshot-2.png, 
> spslave4.bigdata.ly_23951, spslave5.bigdata.ly_20271
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> flink job(flink-1.6.3) failed in per-job yarn cluste mode, the 
> resourcemanager of yarn rerun the job.
> when the job failed again, the application while finish, but the finalStatus 
> is +UNDEFINED,+  It's better to show state +FAILED+
> !image-2019-04-23-19-56-49-933.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-12302) Fixed the wrong finalStatus of yarn application when application finished

2019-05-15 Thread lamber-ken (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated FLINK-12302:
---
Attachment: screenshot-2.png

> Fixed the wrong finalStatus of yarn application when application finished
> -
>
> Key: FLINK-12302
> URL: https://issues.apache.org/jira/browse/FLINK-12302
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / YARN
>Affects Versions: 1.8.0
>    Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.9.0
>
> Attachments: fix-bad-finalStatus.patch, 
> image-2019-04-23-19-56-49-933.png, screenshot-1.png, screenshot-2.png, 
> spslave4.bigdata.ly_23951, spslave5.bigdata.ly_20271
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> flink job(flink-1.6.3) failed in per-job yarn cluste mode, the 
> resourcemanager of yarn rerun the job.
> when the job failed again, the application while finish, but the finalStatus 
> is +UNDEFINED,+  It's better to show state +FAILED+
> !image-2019-04-23-19-56-49-933.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-12302) Fixed the wrong finalStatus of yarn application when application finished

2019-05-15 Thread lamber-ken (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated FLINK-12302:
---
Attachment: screenshot-1.png

> Fixed the wrong finalStatus of yarn application when application finished
> -
>
> Key: FLINK-12302
> URL: https://issues.apache.org/jira/browse/FLINK-12302
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / YARN
>Affects Versions: 1.8.0
>    Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.9.0
>
> Attachments: fix-bad-finalStatus.patch, 
> image-2019-04-23-19-56-49-933.png, screenshot-1.png, 
> spslave4.bigdata.ly_23951, spslave5.bigdata.ly_20271
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> flink job(flink-1.6.3) failed in per-job yarn cluste mode, the 
> resourcemanager of yarn rerun the job.
> when the job failed again, the application while finish, but the finalStatus 
> is +UNDEFINED,+  It's better to show state +FAILED+
> !image-2019-04-23-19-56-49-933.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-12400) NullpointerException using SimpleStringSchema with Kafka

2019-05-15 Thread lamber-ken (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated FLINK-12400:
---
Environment: 
* Flink 1.7.2 job on 1.8 cluster
* Kafka 0.10 with a topic in log-compaction

  was:
Flink 1.7.2 job on 1.8 cluster
Kafka 0.10 with a topic in log-compaction


> NullpointerException using SimpleStringSchema with Kafka
> 
>
> Key: FLINK-12400
> URL: https://issues.apache.org/jira/browse/FLINK-12400
> Project: Flink
>  Issue Type: Improvement
>  Components: API / Type Serialization System
>Affects Versions: 1.7.2, 1.8.0
> Environment: * Flink 1.7.2 job on 1.8 cluster
> * Kafka 0.10 with a topic in log-compaction
>Reporter: Pierre Zemb
>Assignee: Pierre Zemb
>Priority: Minor
>
> Hi!
> Yesterday, we saw a strange behavior with our Flink job and Kafka. We are 
> consuming a Kafka topic setup in 
> [log-compaction|https://kafka.apache.org/documentation/#compaction] mode. As 
> such, sending a message with a null payload acts like a tombstone.
> We are consuming Kafka like this:
> {code:java}
> new FlinkKafkaConsumer010<>  ("topic", new SimpleStringSchema(), 
> this.kafkaProperties)
> {code}
> When we sent the message, job failed because of a NullPointerException 
> [here|https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/serialization/SimpleStringSchema.java#L75].
>  `byte[] message` was null, causing the NPE. 
> We forked the class and added a basic nullable check, returning null if so. 
> It fixed our issue. 
> Should we add it to the main class?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Reopened] (FLINK-12219) Yarn application can't stop when flink job failed in per-job yarn cluster mode

2019-05-06 Thread lamber-ken (JIRA)


 [ 
https://issues.apache.org/jira/browse/FLINK-12219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken reopened FLINK-12219:


> Yarn application can't stop when flink job failed in per-job yarn cluster mode
> --
>
> Key: FLINK-12219
> URL: https://issues.apache.org/jira/browse/FLINK-12219
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN, Runtime / REST
>Affects Versions: 1.6.3, 1.8.0
>    Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>  Labels: pull-request-available
> Attachments: fix-bug.patch, image-2019-04-17-15-00-40-687.png, 
> image-2019-04-17-15-02-49-513.png, image-2019-04-23-17-37-00-081.png
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> h3. *Issue detail info*
> In our flink(1.6.3) product env, I often encounter a scene that yarn 
> application can't stop when flink job failed in per-job yarn cluste mode, so 
> I deeply analyzed the reason why it happened.
> When a flink job fail, system will write an archive file to a FileSystem 
> through +MiniDispatcher#archiveExecutionGraph+ method, then notify 
> YarnJobClusterEntrypoint to shutDown. But, if 
> +MiniDispatcher#archiveExecutionGraph+ throw exceptions during execution, it 
> affect the following calls.
> So I open 
> [FLINK-12247|https://issues.apache.org/jira/projects/FLINK/issues/FLINK-12247]
>  to solve NEP bug when system write archive to FileSystem. But We still need 
> to consider other exceptions, so we should catch Exception / Throwable not 
> just IOExcetion.
> h3. *Flink yarn job fail flow*
> !image-2019-04-23-17-37-00-081.png!
> h3. *Flink yarn job fail on yarn*
> !image-2019-04-17-15-00-40-687.png!
>  
> h3. *Flink yarn application can't stop*
> !image-2019-04-17-15-02-49-513.png!
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


  1   2   3   >