date:20200728

[jira] [Assigned] (FLINK-18641) "Failure to finalize checkpoint" error in MasterTriggerRestoreHook

2020-07-28 Thread Jiangjie Qin (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-18641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiangjie Qin reassigned FLINK-18641:


Assignee: Jiangjie Qin

> "Failure to finalize checkpoint" error in MasterTriggerRestoreHook
> --
>
> Key: FLINK-18641
> URL: https://issues.apache.org/jira/browse/FLINK-18641
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.11.0
>Reporter: Brian Zhou
>Assignee: Jiangjie Qin
>Priority: Major
>
> https://github.com/pravega/flink-connectors is a Pravega connector for Flink. 
> The ReaderCheckpointHook[1] class uses the Flink `MasterTriggerRestoreHook` 
> interface to trigger the Pravega checkpoint during Flink checkpoints to make 
> sure the data recovery. The checkpoint recovery tests are running fine in 
> Flink 1.10, but it has below issues in Flink 1.11 causing the tests time out. 
> Suspect it is related to the checkpoint coordinator thread model changes in 
> Flink 1.11
> Error stacktrace:
> {code}
> 2020-07-09 15:39:39,999 30945 [jobmanager-future-thread-5] WARN  
> o.a.f.runtime.jobmaster.JobMaster - Error while processing checkpoint 
> acknowledgement message
> org.apache.flink.runtime.checkpoint.CheckpointException: Could not finalize 
> the pending checkpoint 3. Failure reason: Failure to finalize checkpoint.
>  at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1033)
>  at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:948)
>  at 
> org.apache.flink.runtime.scheduler.SchedulerBase.lambda$acknowledgeCheckpoint$4(SchedulerBase.java:802)
>  at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.flink.util.SerializedThrowable: Pending checkpoint has 
> not been fully acknowledged yet
>  at 
> org.apache.flink.util.Preconditions.checkState(Preconditions.java:195)
>  at 
> org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:298)
>  at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1021)
>  ... 9 common frames omitted
> {code}
> More detail in this mailing thread: 
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Pravega-connector-cannot-recover-from-the-checkpoint-due-to-quot-Failure-to-finalize-checkpoint-quot-td36652.html
> Also in https://github.com/pravega/flink-connectors/issues/387



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-18641) "Failure to finalize checkpoint" error in MasterTriggerRestoreHook

2020-07-28 Thread Jiangjie Qin (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-18641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166925#comment-17166925
 ] 

Jiangjie Qin commented on FLINK-18641:
--

[~SleePy] Sorry for the late reply. I have a fix ready and is working on the 
tests. Will submit a patch this week.

> "Failure to finalize checkpoint" error in MasterTriggerRestoreHook
> --
>
> Key: FLINK-18641
> URL: https://issues.apache.org/jira/browse/FLINK-18641
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.11.0
>Reporter: Brian Zhou
>Priority: Major
>
> https://github.com/pravega/flink-connectors is a Pravega connector for Flink. 
> The ReaderCheckpointHook[1] class uses the Flink `MasterTriggerRestoreHook` 
> interface to trigger the Pravega checkpoint during Flink checkpoints to make 
> sure the data recovery. The checkpoint recovery tests are running fine in 
> Flink 1.10, but it has below issues in Flink 1.11 causing the tests time out. 
> Suspect it is related to the checkpoint coordinator thread model changes in 
> Flink 1.11
> Error stacktrace:
> {code}
> 2020-07-09 15:39:39,999 30945 [jobmanager-future-thread-5] WARN  
> o.a.f.runtime.jobmaster.JobMaster - Error while processing checkpoint 
> acknowledgement message
> org.apache.flink.runtime.checkpoint.CheckpointException: Could not finalize 
> the pending checkpoint 3. Failure reason: Failure to finalize checkpoint.
>  at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1033)
>  at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:948)
>  at 
> org.apache.flink.runtime.scheduler.SchedulerBase.lambda$acknowledgeCheckpoint$4(SchedulerBase.java:802)
>  at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.flink.util.SerializedThrowable: Pending checkpoint has 
> not been fully acknowledged yet
>  at 
> org.apache.flink.util.Preconditions.checkState(Preconditions.java:195)
>  at 
> org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:298)
>  at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1021)
>  ... 9 common frames omitted
> {code}
> More detail in this mailing thread: 
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Pravega-connector-cannot-recover-from-the-checkpoint-due-to-quot-Failure-to-finalize-checkpoint-quot-td36652.html
> Also in https://github.com/pravega/flink-connectors/issues/387



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-18695) Allow NettyBufferPool to allocate heap buffers

2020-07-28 Thread Zhijiang (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-18695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166914#comment-17166914
 ] 

Zhijiang commented on FLINK-18695:
--

I agree with [~sewen]'s above analysis. 

Regarding the proposed two options by [~chesnay], I prefer the first one to 
allow heap buffer allocation in NettyBufferPool for two reasons:

* The current unsupport heap buffer indeed brings some limitations to upgrade 
netty version. Not only for the current SSL concern, maybe there are also other 
required heap memory improvements in future.
* The main direct memory overhead was already resolved after FLINK 1.11. I 
guess the heap memory overhead should be equivalent to the direct memory, if so 
the impact should be tiny even ignored.
* If we are not quite sure of the heap memory overhead, we can also take some 
ways to monitor and statistic the heap usages inside NettyBufferPool as did for 
direct memory before.

> Allow NettyBufferPool to allocate heap buffers
> --
>
> Key: FLINK-18695
> URL: https://issues.apache.org/jira/browse/FLINK-18695
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Reporter: Chesnay Schepler
>Priority: Major
> Fix For: 1.12.0
>
>
> in 4.1.43 netty made a change to their SslHandler to always use heap buffers 
> for JDK SSLEngine implementations, to avoid an additional memory copy.
> However, our {{NettyBufferPool}} forbids heap buffer allocations.
> We will either have to allow heap buffer allocations, or create a custom 
> SslHandler implementation that does not use heap buffers (although this seems 
> ill-adviced?).
> /cc [~sewen] [~uce] [~NicoK] [~zjwang] [~pnowojski]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-15156) Warn user if System.exit() is called in user code

2020-07-28 Thread Robert Metzger (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-15156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Metzger updated FLINK-15156:
---
Labels: starter  (was: )

> Warn user if System.exit() is called in user code
> -
>
> Key: FLINK-15156
> URL: https://issues.apache.org/jira/browse/FLINK-15156
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Reporter: Robert Metzger
>Priority: Minor
>  Labels: starter
>
> It would make debugging Flink errors easier if we would intercept and log 
> calls to System.exit() through the SecurityManager.
> A user recently had an error where the JobManager was shutting down because 
> of a System.exit() in the user code: 
> https://lists.apache.org/thread.html/b28dabcf3068d489f38399c456c80d48569fcdf74b15f8bb95d532d0%40%3Cuser.flink.apache.org%3E
> If I remember correctly, we had such issues before.
> I put this ticket into the "Runtime / Coordination" component, as it is 
> mostly about improving the usability / debuggability in that area.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-18695) Allow NettyBufferPool to allocate heap buffers

2020-07-28 Thread Ufuk Celebi (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-18695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166902#comment-17166902
 ] 

Ufuk Celebi commented on FLINK-18695:
-

I agree with Stephan that it should not have much of an impact. In particular 
for the SSL handler the comment in the [respective Netty 
PR|https://github.com/netty/netty/commit/39cc7a673939dec96258ff27f5b1874671838af0#diff-2fe7b22a8d650f1ea0bf56a809c061f9R303-R310]
 says that the direct buffer was copied back to a heap byte[] by the SSL engine 
anyways. Therefore, allowing heap buffers should be a net positive in the 
context of the SSL handler.

> Allow NettyBufferPool to allocate heap buffers
> --
>
> Key: FLINK-18695
> URL: https://issues.apache.org/jira/browse/FLINK-18695
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Reporter: Chesnay Schepler
>Priority: Major
> Fix For: 1.12.0
>
>
> in 4.1.43 netty made a change to their SslHandler to always use heap buffers 
> for JDK SSLEngine implementations, to avoid an additional memory copy.
> However, our {{NettyBufferPool}} forbids heap buffer allocations.
> We will either have to allow heap buffer allocations, or create a custom 
> SslHandler implementation that does not use heap buffers (although this seems 
> ill-adviced?).
> /cc [~sewen] [~uce] [~NicoK] [~zjwang] [~pnowojski]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-18742) Some configuration args do not take effect at client

2020-07-28 Thread Yang Wang (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-18742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166901#comment-17166901
 ] 

Yang Wang commented on FLINK-18742:
---

I think this is a valid fix. We should apply the dynamic config option first 
and then build the packaged program.

cc [~kkl0u]

> Some configuration args do not take effect at client
> 
>
> Key: FLINK-18742
> URL: https://issues.apache.org/jira/browse/FLINK-18742
> Project: Flink
>  Issue Type: Improvement
>  Components: Command Line Client
>Affects Versions: 1.11.1
>Reporter: Matt Wang
>Priority: Major
>
> Some configuration args from command line will not work at client, for 
> example, the job sets the {color:#505f79}_classloader.resolve-order_{color} 
> to _{color:#505f79}parent-first,{color}_ it can work at TaskManager, but 
> Client doesn't.
> The *FlinkUserCodeClassLoaders* will be created before calling the method of 
> _{color:#505f79}getEffectiveConfiguration(){color}_ at 
> {color:#505f79}org.apache.flink.client.cli.CliFrontend{color}, so the 
> _{color:#505f79}Configuration{color}_ used by 
> _{color:#505f79}PackagedProgram{color}_ does not include Configuration args.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (FLINK-18749) Correct dependencies in Kubernetes pom and notice file

2020-07-28 Thread Yang Wang (Jira)

Yang Wang created FLINK-18749:
-

 Summary: Correct dependencies in Kubernetes pom and notice file
 Key: FLINK-18749
 URL: https://issues.apache.org/jira/browse/FLINK-18749
 Project: Flink
  Issue Type: Bug
  Components: Deployment / Kubernetes
Affects Versions: 1.11.1
Reporter: Yang Wang
 Fix For: 1.11.2


Inspired when developing this PR[1], i find some incorrect dependency versions 
in NOTICE file. Also {{com.mifmif}} should be removed from the 
flink-kubernetes/pom.xml since we never use it.

 

[1]. 
[https://github.com/apache/flink/pull/12995/commits/8519f65321ba24c5164196a67a05d98fb268f490]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-18748) Savepoint would be queued unexpected

2020-07-28 Thread Congxian Qiu(klion26) (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-18748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu(klion26) updated FLINK-18748:
--
Description: 
Inspired by a [user-zh 
email|http://apache-flink.147419.n8.nabble.com/flink-1-11-rest-api-saveppoint-td5497.html]

After FLINK-17342, when triggering a checkpoint/savepoint, we'll check whether 
the request can be triggered in 
{{CheckpointRequestDecider#chooseRequestToExecute}}, the logic is as follow:
{code:java}
Preconditions.checkState(Thread.holdsLock(lock));
// 1. 
if (isTriggering || queuedRequests.isEmpty()) {
   return Optional.empty();
}

// 2 too many ongoing checkpoitn/savepoint
if (pendingCheckpointsSizeSupplier.get() >= maxConcurrentCheckpointAttempts) {
   return Optional.of(queuedRequests.first())
  .filter(CheckpointTriggerRequest::isForce)
  .map(unused -> queuedRequests.pollFirst());
}

// 3 check the timestamp of last complete checkpoint
long nextTriggerDelayMillis = nextTriggerDelayMillis(lastCompletionMs);
if (nextTriggerDelayMillis > 0) {
   return onTooEarly(nextTriggerDelayMillis);
}

return Optional.of(queuedRequests.pollFirst());
{code}
But if currently {{pendingCheckpointsSizeSupplier.get()}} < 
{{maxConcurrentCheckpointAttempts}}, and the request is a savepoint, the 
savepoint will still wait some time in step 3. 

I think we should trigger the savepoint immediately if 
{{pendingCheckpointSizeSupplier.get()}} < {{maxConcurrentCheckpointAttempts}}.

  was:
Inspired by an [user-zh 
email|[http://apache-flink.147419.n8.nabble.com/flink-1-11-rest-api-saveppoint-td5497.html]]

After FLINK-17342, when triggering a checkpoint/savepoint, we'll check whether 
the request can be triggered in 
{{CheckpointRequestDecider#chooseRequestToExecute}}, the logic is as follow:
{code:java}
Preconditions.checkState(Thread.holdsLock(lock));
// 1. 
if (isTriggering || queuedRequests.isEmpty()) {
   return Optional.empty();
}

// 2 too many ongoing checkpoitn/savepoint
if (pendingCheckpointsSizeSupplier.get() >= maxConcurrentCheckpointAttempts) {
   return Optional.of(queuedRequests.first())
  .filter(CheckpointTriggerRequest::isForce)
  .map(unused -> queuedRequests.pollFirst());
}

// 3 check the timestamp of last complete checkpoint
long nextTriggerDelayMillis = nextTriggerDelayMillis(lastCompletionMs);
if (nextTriggerDelayMillis > 0) {
   return onTooEarly(nextTriggerDelayMillis);
}

return Optional.of(queuedRequests.pollFirst());
{code}
But if currently {{pendingCheckpointsSizeSupplier.get()}} < 
{{maxConcurrentCheckpointAttempts}}, and the request is a savepoint, the 
savepoint will still wait some time in step 3. 

I think we should trigger the savepoint immediately if 
{{pendingCheckpointSizeSupplier.get()}} < {{maxConcurrentCheckpointAttempts}}.


> Savepoint would be queued unexpected
> 
>
> Key: FLINK-18748
> URL: https://issues.apache.org/jira/browse/FLINK-18748
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.11.0, 1.11.1
>Reporter: Congxian Qiu(klion26)
>Priority: Major
>
> Inspired by a [user-zh 
> email|http://apache-flink.147419.n8.nabble.com/flink-1-11-rest-api-saveppoint-td5497.html]
> After FLINK-17342, when triggering a checkpoint/savepoint, we'll check 
> whether the request can be triggered in 
> {{CheckpointRequestDecider#chooseRequestToExecute}}, the logic is as follow:
> {code:java}
> Preconditions.checkState(Thread.holdsLock(lock));
> // 1. 
> if (isTriggering || queuedRequests.isEmpty()) {
>return Optional.empty();
> }
> // 2 too many ongoing checkpoitn/savepoint
> if (pendingCheckpointsSizeSupplier.get() >= maxConcurrentCheckpointAttempts) {
>return Optional.of(queuedRequests.first())
>   .filter(CheckpointTriggerRequest::isForce)
>   .map(unused -> queuedRequests.pollFirst());
> }
> // 3 check the timestamp of last complete checkpoint
> long nextTriggerDelayMillis = nextTriggerDelayMillis(lastCompletionMs);
> if (nextTriggerDelayMillis > 0) {
>return onTooEarly(nextTriggerDelayMillis);
> }
> return Optional.of(queuedRequests.pollFirst());
> {code}
> But if currently {{pendingCheckpointsSizeSupplier.get()}} < 
> {{maxConcurrentCheckpointAttempts}}, and the request is a savepoint, the 
> savepoint will still wait some time in step 3. 
> I think we should trigger the savepoint immediately if 
> {{pendingCheckpointSizeSupplier.get()}} < {{maxConcurrentCheckpointAttempts}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-18748) Savepoint would be queued unexpected

2020-07-28 Thread Congxian Qiu(klion26) (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-18748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Congxian Qiu(klion26) updated FLINK-18748:
--
Description: 
Inspired by an [user-zh 
email|[http://apache-flink.147419.n8.nabble.com/flink-1-11-rest-api-saveppoint-td5497.html]]

After FLINK-17342, when triggering a checkpoint/savepoint, we'll check whether 
the request can be triggered in 
{{CheckpointRequestDecider#chooseRequestToExecute}}, the logic is as follow:
{code:java}
Preconditions.checkState(Thread.holdsLock(lock));
// 1. 
if (isTriggering || queuedRequests.isEmpty()) {
   return Optional.empty();
}

// 2 too many ongoing checkpoitn/savepoint
if (pendingCheckpointsSizeSupplier.get() >= maxConcurrentCheckpointAttempts) {
   return Optional.of(queuedRequests.first())
  .filter(CheckpointTriggerRequest::isForce)
  .map(unused -> queuedRequests.pollFirst());
}

// 3 check the timestamp of last complete checkpoint
long nextTriggerDelayMillis = nextTriggerDelayMillis(lastCompletionMs);
if (nextTriggerDelayMillis > 0) {
   return onTooEarly(nextTriggerDelayMillis);
}

return Optional.of(queuedRequests.pollFirst());
{code}
But if currently {{pendingCheckpointsSizeSupplier.get()}} < 
{{maxConcurrentCheckpointAttempts}}, and the request is a savepoint, the 
savepoint will still wait some time in step 3. 

I think we should trigger the savepoint immediately if 
{{pendingCheckpointSizeSupplier.get()}} < {{maxConcurrentCheckpointAttempts}}.

  was:
After FLINK-17342, when triggering a checkpoint/savepoint, we'll check whether 
the request can be triggered in 
{{CheckpointRequestDecider#chooseRequestToExecute}}, the logic is as follow:
{code:java}
Preconditions.checkState(Thread.holdsLock(lock));
// 1. 
if (isTriggering || queuedRequests.isEmpty()) {
   return Optional.empty();
}

// 2 too many ongoing checkpoitn/savepoint
if (pendingCheckpointsSizeSupplier.get() >= maxConcurrentCheckpointAttempts) {
   return Optional.of(queuedRequests.first())
  .filter(CheckpointTriggerRequest::isForce)
  .map(unused -> queuedRequests.pollFirst());
}

// 3 check the timestamp of last complete checkpoint
long nextTriggerDelayMillis = nextTriggerDelayMillis(lastCompletionMs);
if (nextTriggerDelayMillis > 0) {
   return onTooEarly(nextTriggerDelayMillis);
}

return Optional.of(queuedRequests.pollFirst());
{code}
But if currently {{pendingCheckpointsSizeSupplier.get()}} < 
{{maxConcurrentCheckpointAttempts}}, and the request is a savepoint, the 
savepoint will still wait some time in step 3. 

I think we should trigger the savepoint immediately if 
{{pendingCheckpointSizeSupplier.get()}} < {{maxConcurrentCheckpointAttempts}}.


> Savepoint would be queued unexpected
> 
>
> Key: FLINK-18748
> URL: https://issues.apache.org/jira/browse/FLINK-18748
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.11.0, 1.11.1
>Reporter: Congxian Qiu(klion26)
>Priority: Major
>
> Inspired by an [user-zh 
> email|[http://apache-flink.147419.n8.nabble.com/flink-1-11-rest-api-saveppoint-td5497.html]]
> After FLINK-17342, when triggering a checkpoint/savepoint, we'll check 
> whether the request can be triggered in 
> {{CheckpointRequestDecider#chooseRequestToExecute}}, the logic is as follow:
> {code:java}
> Preconditions.checkState(Thread.holdsLock(lock));
> // 1. 
> if (isTriggering || queuedRequests.isEmpty()) {
>return Optional.empty();
> }
> // 2 too many ongoing checkpoitn/savepoint
> if (pendingCheckpointsSizeSupplier.get() >= maxConcurrentCheckpointAttempts) {
>return Optional.of(queuedRequests.first())
>   .filter(CheckpointTriggerRequest::isForce)
>   .map(unused -> queuedRequests.pollFirst());
> }
> // 3 check the timestamp of last complete checkpoint
> long nextTriggerDelayMillis = nextTriggerDelayMillis(lastCompletionMs);
> if (nextTriggerDelayMillis > 0) {
>return onTooEarly(nextTriggerDelayMillis);
> }
> return Optional.of(queuedRequests.pollFirst());
> {code}
> But if currently {{pendingCheckpointsSizeSupplier.get()}} < 
> {{maxConcurrentCheckpointAttempts}}, and the request is a savepoint, the 
> savepoint will still wait some time in step 3. 
> I think we should trigger the savepoint immediately if 
> {{pendingCheckpointSizeSupplier.get()}} < {{maxConcurrentCheckpointAttempts}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-18681) The jar package version conflict causes the task to continue to increase and grab resources

2020-07-28 Thread Xintong Song (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-18681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166878#comment-17166878
 ] 

Xintong Song commented on FLINK-18681:
--

[~apach...@163.com], thanks for providing the screenshot and logs.

I found the following warnings in the Yarn RM log.
{code:java}
2020-07-22 17:54:57,155 WARN 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=wangty   
IP=x.x.x.61 OPERATION=AM Released Container TARGET=Scheduler
RESULT=FAILURE  DESCRIPTION=Trying to release container not owned by app or 
with invalid id.PERMISSIONS=Unauthorized access or invalid container
APPID=application_1590424616102_556340  
CONTAINERID=container_1590424616102_556340_01_02
2020-07-22 17:54:58,157 WARN 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=wangty   
IP=x.x.x.61 OPERATION=AM Released Container TARGET=Scheduler
RESULT=FAILURE  DESCRIPTION=Trying to release container not owned by app or 
with invalid id.PERMISSIONS=Unauthorized access or invalid container
APPID=application_1590424616102_556340  
CONTAINERID=container_1590424616102_556340_01_03
2020-07-22 17:54:59,160 WARN 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=wangty   
IP=x.x.x.61 OPERATION=AM Released Container TARGET=Scheduler
RESULT=FAILURE  DESCRIPTION=Trying to release container not owned by app or 
with invalid id.PERMISSIONS=Unauthorized access or invalid container
APPID=application_1590424616102_556340  
CONTAINERID=container_1590424616102_556340_01_04
{code}

It shows that Flink did released the containers, but the operations were 
rejected by the Yarn RM. The API Flink uses for release containers is 
{{AMRMClientAsync#releaseAssignedContainer}}, via the same client that 
successfully allocated containers from Yarn.
{code:java}
  /**
   * Release containers assigned by the Resource Manager. If the app cannot use
   * the container or wants to give up the container then it can release them.
   * The app needs to make new requests for the released resource capability if
   * it still needs it. eg. it released non-local resources
   * @param containerId
   */
  public abstract void releaseAssignedContainer(ContainerId containerId);
{code}

It seems to me that the Hadoop API did not work as expected. I would suggest to 
try get some help from the Apache Hadoop community.

Pulling in [~Tao Yang] who is an Apache Hadoop committer and expert in Yarn.

> The jar package version conflict causes the task to continue to increase and 
> grab resources
> ---
>
> Key: FLINK-18681
> URL: https://issues.apache.org/jira/browse/FLINK-18681
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: wangtaiyang
>Priority: Major
> Attachments: appId.log, dependency.log, 
> image-2020-07-28-15-32-51-851.png, 
> yarn-hadoop-resourcemanager-x.x.x.15.log.2020-07-22-17.log
>
>
> When I submit a flink task to yarn, the default resource configuration is 
> 1G&1core, but in fact this task will always increase resources 2core, 3core, 
> and so on. . . 200core. . . Then I went to look at the JM log and found the 
> following error:
> {code:java}
> //代码占位符
> java.lang.NoSuchMethodError: 
> org.apache.commons.cli.Option.builder(Ljava/lang/String;)Lorg/apache/commons/cli/Option$Builder;java.lang.NoSuchMethodError:
>  
> org.apache.commons.cli.Option.builder(Ljava/lang/String;)Lorg/apache/commons/cli/Option$Builder;
>  at 
> org.apache.flink.runtime.entrypoint.parser.CommandLineOptions.(CommandLineOptions.java:28)
>  ~[flink-dist_2.11-1.11.1.jar:1.11.1] at 
> org.apache.flink.runtime.clusterframework.BootstrapTools.lambda$getDynamicPropertiesAsString$0(BootstrapTools.java:648)
>  ~[flink-dist_2.11-1.11.1.jar:1.11.1] at 
> java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:267) 
> ~[?:1.8.0_191]
> ...
> java.lang.NoClassDefFoundError: Could not initialize class 
> org.apache.flink.runtime.entrypoint.parser.CommandLineOptionsjava.lang.NoClassDefFoundError:
>  Could not initialize class 
> org.apache.flink.runtime.entrypoint.parser.CommandLineOptions at 
> org.apache.flink.runtime.clusterframework.BootstrapTools.lambda$getDynamicPropertiesAsString$0(BootstrapTools.java:648)
>  ~[flink-dist_2.11-1.11.1.jar:1.11.1] at 
> java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:267) 
> ~[?:1.8.0_191] at 
> java.util.HashMap$KeySpliterator.forEachRemaining(HashMap.java:1553) 
> ~[?:1.8.0_191] at 
> java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) 
> ~[?:1.8.0_191]{code}
> Finally, it is confirmed that it is caused by the commands-cli version 
> conflict, but the task reporting error has not stopped and will continue to

[jira] [Commented] (FLINK-18748) Savepoint would be queued unexpected

2020-07-28 Thread Congxian Qiu(klion26) (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-18748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166877#comment-17166877
 ] 

Congxian Qiu(klion26) commented on FLINK-18748:
---

[~pnowojski] [~roman_khachatryan] What do you think about this problem, If this 
is valid, I can help to fix it.

> Savepoint would be queued unexpected
> 
>
> Key: FLINK-18748
> URL: https://issues.apache.org/jira/browse/FLINK-18748
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.11.0, 1.11.1
>Reporter: Congxian Qiu(klion26)
>Priority: Major
>
> After FLINK-17342, when triggering a checkpoint/savepoint, we'll check 
> whether the request can be triggered in 
> {{CheckpointRequestDecider#chooseRequestToExecute}}, the logic is as follow:
> {code:java}
> Preconditions.checkState(Thread.holdsLock(lock));
> // 1. 
> if (isTriggering || queuedRequests.isEmpty()) {
>return Optional.empty();
> }
> // 2 too many ongoing checkpoitn/savepoint
> if (pendingCheckpointsSizeSupplier.get() >= maxConcurrentCheckpointAttempts) {
>return Optional.of(queuedRequests.first())
>   .filter(CheckpointTriggerRequest::isForce)
>   .map(unused -> queuedRequests.pollFirst());
> }
> // 3 check the timestamp of last complete checkpoint
> long nextTriggerDelayMillis = nextTriggerDelayMillis(lastCompletionMs);
> if (nextTriggerDelayMillis > 0) {
>return onTooEarly(nextTriggerDelayMillis);
> }
> return Optional.of(queuedRequests.pollFirst());
> {code}
> But if currently {{pendingCheckpointsSizeSupplier.get()}} < 
> {{maxConcurrentCheckpointAttempts}}, and the request is a savepoint, the 
> savepoint will still wait some time in step 3. 
> I think we should trigger the savepoint immediately if 
> {{pendingCheckpointSizeSupplier.get()}} < {{maxConcurrentCheckpointAttempts}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (FLINK-18748) Savepoint would be queued unexpected

2020-07-28 Thread Congxian Qiu(klion26) (Jira)

Congxian Qiu(klion26) created FLINK-18748:
-

 Summary: Savepoint would be queued unexpected
 Key: FLINK-18748
 URL: https://issues.apache.org/jira/browse/FLINK-18748
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Checkpointing
Affects Versions: 1.11.1, 1.11.0
Reporter: Congxian Qiu(klion26)


After FLINK-17342, when triggering a checkpoint/savepoint, we'll check whether 
the request can be triggered in 
{{CheckpointRequestDecider#chooseRequestToExecute}}, the logic is as follow:
{code:java}
Preconditions.checkState(Thread.holdsLock(lock));
// 1. 
if (isTriggering || queuedRequests.isEmpty()) {
   return Optional.empty();
}

// 2 too many ongoing checkpoitn/savepoint
if (pendingCheckpointsSizeSupplier.get() >= maxConcurrentCheckpointAttempts) {
   return Optional.of(queuedRequests.first())
  .filter(CheckpointTriggerRequest::isForce)
  .map(unused -> queuedRequests.pollFirst());
}

// 3 check the timestamp of last complete checkpoint
long nextTriggerDelayMillis = nextTriggerDelayMillis(lastCompletionMs);
if (nextTriggerDelayMillis > 0) {
   return onTooEarly(nextTriggerDelayMillis);
}

return Optional.of(queuedRequests.pollFirst());
{code}
But if currently {{pendingCheckpointsSizeSupplier.get()}} < 
{{maxConcurrentCheckpointAttempts}}, and the request is a savepoint, the 
savepoint will still wait some time in step 3. 

I think we should trigger the savepoint immediately if 
{{pendingCheckpointSizeSupplier.get()}} < {{maxConcurrentCheckpointAttempts}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (FLINK-18713) Allow default ms unit for table.exec.mini-batch.allow-latency etc.

2020-07-28 Thread Jark Wu (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-18713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jark Wu reassigned FLINK-18713:
---

Assignee: hailong wang

> Allow default ms unit for table.exec.mini-batch.allow-latency etc.
> --
>
> Key: FLINK-18713
> URL: https://issues.apache.org/jira/browse/FLINK-18713
> Project: Flink
>  Issue Type: Improvement
>  Components: Table SQL / Planner
>Affects Versions: 1.11.0
>Reporter: hailong wang
>Assignee: hailong wang
>Priority: Major
> Fix For: 1.12.0
>
>
> We use `scala.concurrent.duration.Duration.create` to parse timeStr in 
> `TableConfigUtils#
> getMillisecondFromConfigDuration` for the following properties,
> {code:java}
> table.exec.async-lookup.timeout
> table.exec.source.idle-timeout
> table.exec.mini-batch.allow-latency
> table.exec.emit.early-fire.delay
> table.exec.emit.late-fire.delay{code}
> And it must has the unit.
> I think we can replace it with `TimeUtils.parseDuration(timeStr)` to parse 
> timeStr just like `DescriptorProperties#getOptionalDuration` to has default 
> ms unit and be consistent.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-18713) Allow default ms unit for table.exec.mini-batch.allow-latency etc.

2020-07-28 Thread hailong wang (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-18713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166855#comment-17166855
 ] 

hailong wang commented on FLINK-18713:
--

I would like to take it, Thanks for assigning to me.

> Allow default ms unit for table.exec.mini-batch.allow-latency etc.
> --
>
> Key: FLINK-18713
> URL: https://issues.apache.org/jira/browse/FLINK-18713
> Project: Flink
>  Issue Type: Improvement
>  Components: Table SQL / Planner
>Affects Versions: 1.11.0
>Reporter: hailong wang
>Priority: Major
> Fix For: 1.12.0
>
>
> We use `scala.concurrent.duration.Duration.create` to parse timeStr in 
> `TableConfigUtils#
> getMillisecondFromConfigDuration` for the following properties,
> {code:java}
> table.exec.async-lookup.timeout
> table.exec.source.idle-timeout
> table.exec.mini-batch.allow-latency
> table.exec.emit.early-fire.delay
> table.exec.emit.late-fire.delay{code}
> And it must has the unit.
> I think we can replace it with `TimeUtils.parseDuration(timeStr)` to parse 
> timeStr just like `DescriptorProperties#getOptionalDuration` to has default 
> ms unit and be consistent.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (FLINK-18747) Make Debezium Json format support timestamp with timezone

2020-07-28 Thread Leon Hao (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-18747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leon Hao closed FLINK-18747.

Resolution: Fixed

[~fsk119] Thank you for reply.

> Make Debezium Json format support timestamp with timezone
> -
>
> Key: FLINK-18747
> URL: https://issues.apache.org/jira/browse/FLINK-18747
> Project: Flink
>  Issue Type: Improvement
>  Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
>Reporter: Leon Hao
>Priority: Major
>
> Debezium Connector for MySQL maps timestamp datatype to ISO 8601 with time 
> zone format like '2020-07-29T01:09:52.534173Z'.  Current code doesn't support 
> the timestamp with time zone even if we set 
> 'debezium-json.timestamp-format.standard' = 'ISO-8601'. I think we should 
> support it because:
> 1. ISO-8601 itself supports timestamp with timezone.
> 2. Sometimes we need to CDC many tables with many timestamp fields from 
> Mysql. It would be a waste of time to convert these fields from string to 
> timestamp manually as proposed by this post FLINK-17752
> 3. It is very confusing to users when errors occur although the document says 
> Flink supports debezium json format and ISO-8601.
> I think we can create a new debezium-json.timestamp-format option to support 
> timestamp with time zone or have a better documentation letting users know 
> Flink dose not support it. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-18747) Make Debezium Json format support timestamp with timezone

2020-07-28 Thread Shengkai Fang (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-18747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166852#comment-17166852
 ] 

Shengkai Fang commented on FLINK-18747:
---

[~lhao] Yes. Debezium uses the same json format.

> Make Debezium Json format support timestamp with timezone
> -
>
> Key: FLINK-18747
> URL: https://issues.apache.org/jira/browse/FLINK-18747
> Project: Flink
>  Issue Type: Improvement
>  Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
>Reporter: Leon Hao
>Priority: Major
>
> Debezium Connector for MySQL maps timestamp datatype to ISO 8601 with time 
> zone format like '2020-07-29T01:09:52.534173Z'.  Current code doesn't support 
> the timestamp with time zone even if we set 
> 'debezium-json.timestamp-format.standard' = 'ISO-8601'. I think we should 
> support it because:
> 1. ISO-8601 itself supports timestamp with timezone.
> 2. Sometimes we need to CDC many tables with many timestamp fields from 
> Mysql. It would be a waste of time to convert these fields from string to 
> timestamp manually as proposed by this post FLINK-17752
> 3. It is very confusing to users when errors occur although the document says 
> Flink supports debezium json format and ISO-8601.
> I think we can create a new debezium-json.timestamp-format option to support 
> timestamp with time zone or have a better documentation letting users know 
> Flink dose not support it. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-18747) Make Debezium Json format support timestamp with timezone

2020-07-28 Thread Leon Hao (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-18747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166850#comment-17166850
 ] 

Leon Hao commented on FLINK-18747:
--

[~fsk119] will this work with Debezium Json format?

> Make Debezium Json format support timestamp with timezone
> -
>
> Key: FLINK-18747
> URL: https://issues.apache.org/jira/browse/FLINK-18747
> Project: Flink
>  Issue Type: Improvement
>  Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
>Reporter: Leon Hao
>Priority: Major
>
> Debezium Connector for MySQL maps timestamp datatype to ISO 8601 with time 
> zone format like '2020-07-29T01:09:52.534173Z'.  Current code doesn't support 
> the timestamp with time zone even if we set 
> 'debezium-json.timestamp-format.standard' = 'ISO-8601'. I think we should 
> support it because:
> 1. ISO-8601 itself supports timestamp with timezone.
> 2. Sometimes we need to CDC many tables with many timestamp fields from 
> Mysql. It would be a waste of time to convert these fields from string to 
> timestamp manually as proposed by this post FLINK-17752
> 3. It is very confusing to users when errors occur although the document says 
> Flink supports debezium json format and ISO-8601.
> I think we can create a new debezium-json.timestamp-format option to support 
> timestamp with time zone or have a better documentation letting users know 
> Flink dose not support it. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-18742) Some configuration args do not take effect at client

2020-07-28 Thread Matt Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-18742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Wang updated FLINK-18742:
--
Summary: Some configuration args do not take effect at client  (was: 
Configuration args do not take effect at client)

> Some configuration args do not take effect at client
> 
>
> Key: FLINK-18742
> URL: https://issues.apache.org/jira/browse/FLINK-18742
> Project: Flink
>  Issue Type: Improvement
>  Components: Command Line Client
>Affects Versions: 1.11.1
>Reporter: Matt Wang
>Priority: Major
>
> The configuration args from command line will not work at client, for 
> example, the job sets the {color:#505f79}_classloader.resolve-order_{color} 
> to _{color:#505f79}parent-first,{color}_ it can work at TaskManager, but 
> Client doesn't.
> The *FlinkUserCodeClassLoaders* will be created before calling the method of 
> _{color:#505f79}getEffectiveConfiguration(){color}_ at 
> {color:#505f79}org.apache.flink.client.cli.CliFrontend{color}, so the 
> _{color:#505f79}Configuration{color}_ used by 
> _{color:#505f79}PackagedProgram{color}_ does not include Configuration args.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-18742) Some configuration args do not take effect at client

2020-07-28 Thread Matt Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-18742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Wang updated FLINK-18742:
--
Description: 
Some configuration args from command line will not work at client, for example, 
the job sets the {color:#505f79}_classloader.resolve-order_{color} to 
_{color:#505f79}parent-first,{color}_ it can work at TaskManager, but Client 
doesn't.

The *FlinkUserCodeClassLoaders* will be created before calling the method of 
_{color:#505f79}getEffectiveConfiguration(){color}_ at 
{color:#505f79}org.apache.flink.client.cli.CliFrontend{color}, so the 
_{color:#505f79}Configuration{color}_ used by 
_{color:#505f79}PackagedProgram{color}_ does not include Configuration args.

  was:
The configuration args from command line will not work at client, for example, 
the job sets the {color:#505f79}_classloader.resolve-order_{color} to 
_{color:#505f79}parent-first,{color}_ it can work at TaskManager, but Client 
doesn't.

The *FlinkUserCodeClassLoaders* will be created before calling the method of 
_{color:#505f79}getEffectiveConfiguration(){color}_ at 
{color:#505f79}org.apache.flink.client.cli.CliFrontend{color}, so the 
_{color:#505f79}Configuration{color}_ used by 
_{color:#505f79}PackagedProgram{color}_ does not include Configuration args.


> Some configuration args do not take effect at client
> 
>
> Key: FLINK-18742
> URL: https://issues.apache.org/jira/browse/FLINK-18742
> Project: Flink
>  Issue Type: Improvement
>  Components: Command Line Client
>Affects Versions: 1.11.1
>Reporter: Matt Wang
>Priority: Major
>
> Some configuration args from command line will not work at client, for 
> example, the job sets the {color:#505f79}_classloader.resolve-order_{color} 
> to _{color:#505f79}parent-first,{color}_ it can work at TaskManager, but 
> Client doesn't.
> The *FlinkUserCodeClassLoaders* will be created before calling the method of 
> _{color:#505f79}getEffectiveConfiguration(){color}_ at 
> {color:#505f79}org.apache.flink.client.cli.CliFrontend{color}, so the 
> _{color:#505f79}Configuration{color}_ used by 
> _{color:#505f79}PackagedProgram{color}_ does not include Configuration args.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-18747) Make Debezium Json format support timestamp with timezone

2020-07-28 Thread Shengkai Fang (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-18747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166839#comment-17166839
 ] 

Shengkai Fang commented on FLINK-18747:
---

Currently, we have implemented TIMESTAMP_WITH_LOCAL_ZONE datatype to support 
timestamp with zone in FLINK-18296.

> Make Debezium Json format support timestamp with timezone
> -
>
> Key: FLINK-18747
> URL: https://issues.apache.org/jira/browse/FLINK-18747
> Project: Flink
>  Issue Type: Improvement
>  Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
>Reporter: Leon Hao
>Priority: Major
>
> Debezium Connector for MySQL maps timestamp datatype to ISO 8601 with time 
> zone format like '2020-07-29T01:09:52.534173Z'.  Current code doesn't support 
> the timestamp with time zone even if we set 
> 'debezium-json.timestamp-format.standard' = 'ISO-8601'. I think we should 
> support it because:
> 1. ISO-8601 itself supports timestamp with timezone.
> 2. Sometimes we need to CDC many tables with many timestamp fields from 
> Mysql. It would be a waste of time to convert these fields from string to 
> timestamp manually as proposed by this post FLINK-17752
> 3. It is very confusing to users when errors occur although the document says 
> Flink supports debezium json format and ISO-8601.
> I think we can create a new debezium-json.timestamp-format option to support 
> timestamp with time zone or have a better documentation letting users know 
> Flink dose not support it. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (FLINK-18747) Make Debezium Json format support timestamp with timezone

2020-07-28 Thread Leon Hao (Jira)

Leon Hao created FLINK-18747:


 Summary: Make Debezium Json format support timestamp with timezone
 Key: FLINK-18747
 URL: https://issues.apache.org/jira/browse/FLINK-18747
 Project: Flink
  Issue Type: Improvement
  Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
Reporter: Leon Hao


Debezium Connector for MySQL maps timestamp datatype to ISO 8601 with time zone 
format like '2020-07-29T01:09:52.534173Z'.  Current code doesn't support the 
timestamp with time zone even if we set 
'debezium-json.timestamp-format.standard' = 'ISO-8601'. I think we should 
support it because:

1. ISO-8601 itself supports timestamp with timezone.

2. Sometimes we need to CDC many tables with many timestamp fields from Mysql. 
It would be a waste of time to convert these fields from string to timestamp 
manually as proposed by this post FLINK-17752

3. It is very confusing to users when errors occur although the document says 
Flink supports debezium json format and ISO-8601.

I think we can create a new debezium-json.timestamp-format option to support 
timestamp with time zone or have a better documentation letting users know 
Flink dose not support it. 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-18625) Maintain redundant taskmanagers to speed up failover

2020-07-28 Thread Xintong Song (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-18625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166829#comment-17166829
 ] 

Xintong Song commented on FLINK-18625:
--

[~trohrmann], regarding your questions.

bq. How would this feature work if the job requests heterogeneous slots which 
might result into differently sized TMs? I guess we will allocate default sized 
TMs. But what if this will prevent us from allocating fewer larger sized TMs 
which are required for fulfilling the heterogeneous slot requests?
I see your point. One optimization could be to release the redundant task 
managers if there are heterogeneous pending worker requests. The problem is 
that the redundant task manager may not be releasable if any of the slots are 
allocated (e.g., slots are evenly spread out), and even releasable it would 
cost more time to obtain the new task manager. I guess that's the price we need 
to pay if this feature is enabled. WDYT?

bq. How does this feature relate to FLINK-16605 and FLINK-15959? I believe that 
the lower and upper bounds should also limit the number of redundant slots, 
right?
According to [~Jiangang]'s PR, the upper bound also limits the number of 
redundant slots. I believe it should be the same for the lower bound. We should 
make sure of that when working on FLINK-15959. cc [~karmagyz]

> Maintain redundant taskmanagers to speed up failover
> 
>
> Key: FLINK-18625
> URL: https://issues.apache.org/jira/browse/FLINK-18625
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Coordination
>Reporter: Liu
>Assignee: Liu
>Priority: Major
>  Labels: pull-request-available
>
> When flink job fails because of killed taskmanagers, it will request new 
> containers when restarting. Requesting new containers can be very slow, 
> sometimes it takes dozens of seconds even more. The reasons can be different, 
> for example, yarn and hdfs are slow, machine performance is poor. In some 
> product scenario, SLA is high and failover should be in seconds.
>  
> To speed up the recovery process, we can maintain redundant slots in advance. 
> When job restarts, it can use the redundant slots at once instead of 
> requesting new taskmanagers.
>  
> The implemention can be done in SlotManagerImpl. Below is a brief description:
>  # In construct method, init redundantTaskmanagerNum from config.
>  # In method start(), allocate redundant taskmanagers.
>  # In method start(), Change taskManagerTimeoutCheck() to 
> checkValidTaskManagers().
>  # In method checkValidTaskManagers(), manage redundant taskmanagers and 
> timeout taskmanagers. The idle taskmanager number must be not less than 
> redundantTaskmanagerNum.
>  * If less, allocate from resourceManager until equal.
>  * If more, release timeout taskmanagers but keep at least 
> redundantTaskmanagerNum idle taskmanagers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-18681) The jar package version conflict causes the task to continue to increase and grab resources

2020-07-28 Thread wangtaiyang (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-18681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangtaiyang updated FLINK-18681:

Attachment: yarn-hadoop-resourcemanager-x.x.x.15.log.2020-07-22-17.log

> The jar package version conflict causes the task to continue to increase and 
> grab resources
> ---
>
> Key: FLINK-18681
> URL: https://issues.apache.org/jira/browse/FLINK-18681
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: wangtaiyang
>Priority: Major
> Attachments: appId.log, dependency.log, 
> image-2020-07-28-15-32-51-851.png, 
> yarn-hadoop-resourcemanager-x.x.x.15.log.2020-07-22-17.log
>
>
> When I submit a flink task to yarn, the default resource configuration is 
> 1G&1core, but in fact this task will always increase resources 2core, 3core, 
> and so on. . . 200core. . . Then I went to look at the JM log and found the 
> following error:
> {code:java}
> //代码占位符
> java.lang.NoSuchMethodError: 
> org.apache.commons.cli.Option.builder(Ljava/lang/String;)Lorg/apache/commons/cli/Option$Builder;java.lang.NoSuchMethodError:
>  
> org.apache.commons.cli.Option.builder(Ljava/lang/String;)Lorg/apache/commons/cli/Option$Builder;
>  at 
> org.apache.flink.runtime.entrypoint.parser.CommandLineOptions.(CommandLineOptions.java:28)
>  ~[flink-dist_2.11-1.11.1.jar:1.11.1] at 
> org.apache.flink.runtime.clusterframework.BootstrapTools.lambda$getDynamicPropertiesAsString$0(BootstrapTools.java:648)
>  ~[flink-dist_2.11-1.11.1.jar:1.11.1] at 
> java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:267) 
> ~[?:1.8.0_191]
> ...
> java.lang.NoClassDefFoundError: Could not initialize class 
> org.apache.flink.runtime.entrypoint.parser.CommandLineOptionsjava.lang.NoClassDefFoundError:
>  Could not initialize class 
> org.apache.flink.runtime.entrypoint.parser.CommandLineOptions at 
> org.apache.flink.runtime.clusterframework.BootstrapTools.lambda$getDynamicPropertiesAsString$0(BootstrapTools.java:648)
>  ~[flink-dist_2.11-1.11.1.jar:1.11.1] at 
> java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:267) 
> ~[?:1.8.0_191] at 
> java.util.HashMap$KeySpliterator.forEachRemaining(HashMap.java:1553) 
> ~[?:1.8.0_191] at 
> java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) 
> ~[?:1.8.0_191]{code}
> Finally, it is confirmed that it is caused by the commands-cli version 
> conflict, but the task reporting error has not stopped and will continue to 
> grab resources and increase. Is this a bug?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-18681) The jar package version conflict causes the task to continue to increase and grab resources

2020-07-28 Thread wangtaiyang (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-18681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166824#comment-17166824
 ] 

wangtaiyang commented on FLINK-18681:
-

It's here !image-2020-07-28-15-32-51-851.png!

RM LOG:

[^yarn-hadoop-resourcemanager-x.x.x.15.log.2020-07-22-17.log]

 

> The jar package version conflict causes the task to continue to increase and 
> grab resources
> ---
>
> Key: FLINK-18681
> URL: https://issues.apache.org/jira/browse/FLINK-18681
> Project: Flink
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: wangtaiyang
>Priority: Major
> Attachments: appId.log, dependency.log, 
> image-2020-07-28-15-32-51-851.png, 
> yarn-hadoop-resourcemanager-x.x.x.15.log.2020-07-22-17.log
>
>
> When I submit a flink task to yarn, the default resource configuration is 
> 1G&1core, but in fact this task will always increase resources 2core, 3core, 
> and so on. . . 200core. . . Then I went to look at the JM log and found the 
> following error:
> {code:java}
> //代码占位符
> java.lang.NoSuchMethodError: 
> org.apache.commons.cli.Option.builder(Ljava/lang/String;)Lorg/apache/commons/cli/Option$Builder;java.lang.NoSuchMethodError:
>  
> org.apache.commons.cli.Option.builder(Ljava/lang/String;)Lorg/apache/commons/cli/Option$Builder;
>  at 
> org.apache.flink.runtime.entrypoint.parser.CommandLineOptions.(CommandLineOptions.java:28)
>  ~[flink-dist_2.11-1.11.1.jar:1.11.1] at 
> org.apache.flink.runtime.clusterframework.BootstrapTools.lambda$getDynamicPropertiesAsString$0(BootstrapTools.java:648)
>  ~[flink-dist_2.11-1.11.1.jar:1.11.1] at 
> java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:267) 
> ~[?:1.8.0_191]
> ...
> java.lang.NoClassDefFoundError: Could not initialize class 
> org.apache.flink.runtime.entrypoint.parser.CommandLineOptionsjava.lang.NoClassDefFoundError:
>  Could not initialize class 
> org.apache.flink.runtime.entrypoint.parser.CommandLineOptions at 
> org.apache.flink.runtime.clusterframework.BootstrapTools.lambda$getDynamicPropertiesAsString$0(BootstrapTools.java:648)
>  ~[flink-dist_2.11-1.11.1.jar:1.11.1] at 
> java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:267) 
> ~[?:1.8.0_191] at 
> java.util.HashMap$KeySpliterator.forEachRemaining(HashMap.java:1553) 
> ~[?:1.8.0_191] at 
> java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) 
> ~[?:1.8.0_191]{code}
> Finally, it is confirmed that it is caused by the commands-cli version 
> conflict, but the task reporting error has not stopped and will continue to 
> grab resources and increase. Is this a bug?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-18496) Anchors are not generated based on ZH characters

2020-07-28 Thread Jark Wu (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-18496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166822#comment-17166822
 ] 

Jark Wu commented on FLINK-18496:
-

I noticed you are using Jekyll 4.1.1 instead of 4.0.1

> Anchors are not generated based on ZH characters
> 
>
> Key: FLINK-18496
> URL: https://issues.apache.org/jira/browse/FLINK-18496
> Project: Flink
>  Issue Type: Bug
>  Components: Project Website
>Reporter: Zhu Zhu
>Assignee: Zhilong Hong
>Priority: Major
>  Labels: starter
>
> In ZH version pages of flink-web, the anchors are not generated based on ZH 
> characters. The anchor name would be like 'section-1', 'section-2' if there 
> is no EN characters. An example can be the links in the navigator of 
> https://flink.apache.org/zh/contributing/contribute-code.html
> This makes it impossible to ref an anchor from the content because the anchor 
> name might change unexpectedly if a new section is added.
> Note that it is a problem for flink-web only. The docs generated from the 
> flink repo can properly generate ZH anchors.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-18496) Anchors are not generated based on ZH characters

2020-07-28 Thread Jark Wu (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-18496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166819#comment-17166819
 ] 

Jark Wu commented on FLINK-18496:
-

My point is if it conflict with excerpt plugin, why it works in flink main repo?

> Anchors are not generated based on ZH characters
> 
>
> Key: FLINK-18496
> URL: https://issues.apache.org/jira/browse/FLINK-18496
> Project: Flink
>  Issue Type: Bug
>  Components: Project Website
>Reporter: Zhu Zhu
>Assignee: Zhilong Hong
>Priority: Major
>  Labels: starter
>
> In ZH version pages of flink-web, the anchors are not generated based on ZH 
> characters. The anchor name would be like 'section-1', 'section-2' if there 
> is no EN characters. An example can be the links in the navigator of 
> https://flink.apache.org/zh/contributing/contribute-code.html
> This makes it impossible to ref an anchor from the content because the anchor 
> name might change unexpectedly if a new section is added.
> Note that it is a problem for flink-web only. The docs generated from the 
> flink repo can properly generate ZH anchors.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (FLINK-18496) Anchors are not generated based on ZH characters

2020-07-28 Thread Zhilong Hong (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-18496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166815#comment-17166815
 ] 

Zhilong Hong edited comment on FLINK-18496 at 7/29/20, 2:36 AM:


I try to use the Gemfile in flink/docs, but it doesn't work. In my opinion, the 
main blocker is that "jekyll-multiple-language" is conflict with the *excerpt* 
in Jekyll 4. I'm still working on this, but it seems difficult to solve this 
issue.


was (Author: thesharing):
I try to use the Gemfile in flink/docs, but it doesn't work. In my opinion, the 
main blocker is that "jekyll-multiple-language" is conflict with the *excerpt* 
in Jekyll 4. I'm still working on this, but it seems hard to solve this issue.

> Anchors are not generated based on ZH characters
> 
>
> Key: FLINK-18496
> URL: https://issues.apache.org/jira/browse/FLINK-18496
> Project: Flink
>  Issue Type: Bug
>  Components: Project Website
>Reporter: Zhu Zhu
>Assignee: Zhilong Hong
>Priority: Major
>  Labels: starter
>
> In ZH version pages of flink-web, the anchors are not generated based on ZH 
> characters. The anchor name would be like 'section-1', 'section-2' if there 
> is no EN characters. An example can be the links in the navigator of 
> https://flink.apache.org/zh/contributing/contribute-code.html
> This makes it impossible to ref an anchor from the content because the anchor 
> name might change unexpectedly if a new section is added.
> Note that it is a problem for flink-web only. The docs generated from the 
> flink repo can properly generate ZH anchors.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-18496) Anchors are not generated based on ZH characters

2020-07-28 Thread Zhilong Hong (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-18496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166815#comment-17166815
 ] 

Zhilong Hong commented on FLINK-18496:
--

I try to use the Gemfile in flink/docs, but it doesn't work. In my opinion, the 
main blocker is that "jekyll-multiple-language" is conflict with the *excerpt* 
in Jekyll 4. I'm still working on this, but it seems hard to solve this issue.

> Anchors are not generated based on ZH characters
> 
>
> Key: FLINK-18496
> URL: https://issues.apache.org/jira/browse/FLINK-18496
> Project: Flink
>  Issue Type: Bug
>  Components: Project Website
>Reporter: Zhu Zhu
>Assignee: Zhilong Hong
>Priority: Major
>  Labels: starter
>
> In ZH version pages of flink-web, the anchors are not generated based on ZH 
> characters. The anchor name would be like 'section-1', 'section-2' if there 
> is no EN characters. An example can be the links in the navigator of 
> https://flink.apache.org/zh/contributing/contribute-code.html
> This makes it impossible to ref an anchor from the content because the anchor 
> name might change unexpectedly if a new section is added.
> Note that it is a problem for flink-web only. The docs generated from the 
> flink repo can properly generate ZH anchors.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-18663) Fix Flink On YARN AM not exit

2020-07-28 Thread tartarus (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-18663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166813#comment-17166813
 ] 

tartarus commented on FLINK-18663:
--

[~chesnay] [~trohrmann] Then we discuss how to fix this problem?

what do you think about pass  {{maxContentLength}} as construction parameters 
to AbstractHandler?

Then modify all Handlers. Do you have any good suggestions? thanks

> Fix Flink On YARN AM not exit
> -
>
> Key: FLINK-18663
> URL: https://issues.apache.org/jira/browse/FLINK-18663
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / REST
>Affects Versions: 1.10.0, 1.10.1, 1.11.0
>Reporter: tartarus
>Assignee: tartarus
>Priority: Critical
>  Labels: pull-request-available
> Attachments: 110.png, 111.png, 
> C49A7310-F932-451B-A203-6D17F3140C0D.png, 
> e18e00dd6664485c2ff55284fe969474.png, jobmanager.log.noyarn.tar.gz
>
>
> AbstractHandler throw NPE cause by FlinkHttpObjectAggregator is null
> when rest throw exception, it will do this code
> {code:java}
> private CompletableFuture handleException(Throwable throwable, 
> ChannelHandlerContext ctx, HttpRequest httpRequest) {
>   FlinkHttpObjectAggregator flinkHttpObjectAggregator = 
> ctx.pipeline().get(FlinkHttpObjectAggregator.class);
>   int maxLength = flinkHttpObjectAggregator.maxContentLength() - 
> OTHER_RESP_PAYLOAD_OVERHEAD;
>   if (throwable instanceof RestHandlerException) {
>   RestHandlerException rhe = (RestHandlerException) throwable;
>   String stackTrace = ExceptionUtils.stringifyException(rhe);
>   String truncatedStackTrace = Ascii.truncate(stackTrace, 
> maxLength, "...");
>   if (log.isDebugEnabled()) {
>   log.error("Exception occurred in REST handler.", rhe);
>   } else {
>   log.error("Exception occurred in REST handler: {}", 
> rhe.getMessage());
>   }
>   return HandlerUtils.sendErrorResponse(
>   ctx,
>   httpRequest,
>   new ErrorResponseBody(truncatedStackTrace),
>   rhe.getHttpResponseStatus(),
>   responseHeaders);
>   } else {
>   log.error("Unhandled exception.", throwable);
>   String stackTrace = String.format(" side:%n%s%nEnd of exception on server side>",
>   ExceptionUtils.stringifyException(throwable));
>   String truncatedStackTrace = Ascii.truncate(stackTrace, 
> maxLength, "...");
>   return HandlerUtils.sendErrorResponse(
>   ctx,
>   httpRequest,
>   new ErrorResponseBody(Arrays.asList("Internal server 
> error.", truncatedStackTrace)),
>   HttpResponseStatus.INTERNAL_SERVER_ERROR,
>   responseHeaders);
>   }
> }
> {code}
> but flinkHttpObjectAggregator some case is null,so this will throw NPE,but 
> this method called by  AbstractHandler#respondAsLeader
> {code:java}
> requestProcessingFuture
>   .whenComplete((Void ignored, Throwable throwable) -> {
>   if (throwable != null) {
>   
> handleException(ExceptionUtils.stripCompletionException(throwable), ctx, 
> httpRequest)
>   .whenComplete((Void ignored2, Throwable 
> throwable2) -> finalizeRequestProcessing(finalUploadedFiles));
>   } else {
>   finalizeRequestProcessing(finalUploadedFiles);
>   }
>   });
> {code}
>  the result is InFlightRequestTracker Cannot be cleared.
> so the CompletableFuture does‘t complete that handler's closeAsync returned
> !C49A7310-F932-451B-A203-6D17F3140C0D.png!
> !e18e00dd6664485c2ff55284fe969474.png!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-18496) Anchors are not generated based on ZH characters

2020-07-28 Thread Jark Wu (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-18496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166812#comment-17166812
 ] 

Jark Wu commented on FLINK-18496:
-

Could you use the versions used in flink main repo [1]?

[1]: https://github.com/apache/flink/blob/master/docs/Gemfile

> Anchors are not generated based on ZH characters
> 
>
> Key: FLINK-18496
> URL: https://issues.apache.org/jira/browse/FLINK-18496
> Project: Flink
>  Issue Type: Bug
>  Components: Project Website
>Reporter: Zhu Zhu
>Assignee: Zhilong Hong
>Priority: Major
>  Labels: starter
>
> In ZH version pages of flink-web, the anchors are not generated based on ZH 
> characters. The anchor name would be like 'section-1', 'section-2' if there 
> is no EN characters. An example can be the links in the navigator of 
> https://flink.apache.org/zh/contributing/contribute-code.html
> This makes it impossible to ref an anchor from the content because the anchor 
> name might change unexpectedly if a new section is added.
> Note that it is a problem for flink-web only. The docs generated from the 
> flink repo can properly generate ZH anchors.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-18744) resume from modified savepoint dirctionary: No such file or directory

2020-07-28 Thread Congxian Qiu(klion26) (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-18744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166808#comment-17166808
 ] 

Congxian Qiu(klion26) commented on FLINK-18744:
---

[~wayland]  Thanks for reporting this issue, I'll take a look at it.

> resume from modified savepoint dirctionary: No such file or directory
> -
>
> Key: FLINK-18744
> URL: https://issues.apache.org/jira/browse/FLINK-18744
> Project: Flink
>  Issue Type: Bug
>  Components: API / State Processor
>Affects Versions: 1.11.0
>Reporter: tao wang
>Priority: Major
>
> If I resume a job from a savepoint which is modified by state processor API, 
> such as loading from /savepoint-path-old and writing to /savepoint-path-new, 
> the job resumed with savepointpath = /savepoint-path-new  while throwing an 
> Exception : 
>  _*/savepoint-path-new/\{some-ui-id} (No such file or directory)*_.
>  I think it's an issue because of flink 1.11 use absolute path in savepoint 
> and checkpoint, but state processor API missed this.
> The job will work well with new savepoint(which path is /savepoint-path-new) 
> if I copy all dictionary except `_metadata from` /savepoint-path-old to 
> /savepoint-path-new.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-16947) ArtifactResolutionException: Could not transfer artifact. Entry [...] has not been leased from this pool

2020-07-28 Thread Dian Fu (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-16947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166805#comment-17166805
 ] 

Dian Fu commented on FLINK-16947:
-

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=4975&view=logs&j=8fd975ef-f478-511d-4997-6f15fe8a1fd3&t=ac0fa443-5d45-5a6b-3597-0310ecc1d2ab

> ArtifactResolutionException: Could not transfer artifact.  Entry [...] has 
> not been leased from this pool
> -
>
> Key: FLINK-16947
> URL: https://issues.apache.org/jira/browse/FLINK-16947
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / Azure Pipelines
>Reporter: Piotr Nowojski
>Assignee: Robert Metzger
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.12.0
>
>
> https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6982&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5
> Build of flink-metrics-availability-test failed with:
> {noformat}
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-surefire-plugin:2.22.1:test (end-to-end-tests) 
> on project flink-metrics-availability-test: Unable to generate classpath: 
> org.apache.maven.artifact.resolver.ArtifactResolutionException: Could not 
> transfer artifact org.apache.maven.surefire:surefire-grouper:jar:2.22.1 
> from/to google-maven-central 
> (https://maven-central-eu.storage-download.googleapis.com/maven2/): Entry 
> [id:13][route:{s}->https://maven-central-eu.storage-download.googleapis.com:443][state:null]
>  has not been leased from this pool
> [ERROR] org.apache.maven.surefire:surefire-grouper:jar:2.22.1
> [ERROR] 
> [ERROR] from the specified remote repositories:
> [ERROR] google-maven-central 
> (https://maven-central-eu.storage-download.googleapis.com/maven2/, 
> releases=true, snapshots=false),
> [ERROR] apache.snapshots (https://repository.apache.org/snapshots, 
> releases=false, snapshots=true)
> [ERROR] Path to dependency:
> [ERROR] 1) dummy:dummy:jar:1.0
> [ERROR] 2) org.apache.maven.surefire:surefire-junit47:jar:2.22.1
> [ERROR] 3) org.apache.maven.surefire:common-junit48:jar:2.22.1
> [ERROR] 4) org.apache.maven.surefire:surefire-grouper:jar:2.22.1
> [ERROR] -> [Help 1]
> [ERROR] 
> [ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
> switch.
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> [ERROR] 
> [ERROR] For more information about the errors and possible solutions, please 
> read the following articles:
> [ERROR] [Help 1] 
> http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
> [ERROR] 
> [ERROR] After correcting the problems, you can resume the build with the 
> command
> [ERROR]   mvn  -rf :flink-metrics-availability-test
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-17274) Maven: Premature end of Content-Length delimited message body

2020-07-28 Thread Dian Fu (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-17274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166804#comment-17166804
 ] 

Dian Fu commented on FLINK-17274:
-

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=4975&view=logs&j=3d12d40f-c62d-5ec4-6acc-0efe94cc3e89&t=5d6e4255-0ea8-5e2a-f52c-c881b7872361

> Maven: Premature end of Content-Length delimited message body
> -
>
> Key: FLINK-17274
> URL: https://issues.apache.org/jira/browse/FLINK-17274
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / Azure Pipelines
>Reporter: Robert Metzger
>Assignee: Robert Metzger
>Priority: Critical
> Fix For: 1.12.0
>
>
> CI: 
> https://dev.azure.com/rmetzger/Flink/_build/results?buildId=7786&view=logs&j=52b61abe-a3cc-5bde-cc35-1bbe89bb7df5&t=54421a62-0c80-5aad-3319-094ff69180bb
> {code}
> [ERROR] Failed to execute goal on project 
> flink-connector-elasticsearch7_2.11: Could not resolve dependencies for 
> project 
> org.apache.flink:flink-connector-elasticsearch7_2.11:jar:1.11-SNAPSHOT: Could 
> not transfer artifact org.apache.lucene:lucene-sandbox:jar:8.3.0 from/to 
> alicloud-mvn-mirror 
> (http://mavenmirror.alicloud.dak8s.net:/repository/maven-central/): GET 
> request of: org/apache/lucene/lucene-sandbox/8.3.0/lucene-sandbox-8.3.0.jar 
> from alicloud-mvn-mirror failed: Premature end of Content-Length delimited 
> message body (expected: 289920; received: 239832 -> [Help 1]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (FLINK-18746) WindowStaggerTest.testWindowStagger failed

2020-07-28 Thread Dian Fu (Jira)

Dian Fu created FLINK-18746:
---

 Summary: WindowStaggerTest.testWindowStagger failed
 Key: FLINK-18746
 URL: https://issues.apache.org/jira/browse/FLINK-18746
 Project: Flink
  Issue Type: Bug
  Components: API / DataStream
Affects Versions: 1.12.0
Reporter: Dian Fu


https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=4975&view=logs&j=77a9d8e1-d610-59b3-fc2a-4766541e0e33&t=7c61167f-30b3-5893-cc38-a9e3d057e392

{code}
2020-07-28T21:16:30.1350624Z [ERROR] Tests run: 1, Failures: 1, Errors: 0, 
Skipped: 0, Time elapsed: 0.145 s <<< FAILURE! - in 
org.apache.flink.streaming.runtime.operators.windowing.WindowStaggerTest
2020-07-28T21:16:30.1352065Z [ERROR] 
testWindowStagger(org.apache.flink.streaming.runtime.operators.windowing.WindowStaggerTest)
  Time elapsed: 0.012 s  <<< FAILURE!
2020-07-28T21:16:30.1352701Z java.lang.AssertionError
2020-07-28T21:16:30.1353104Zat org.junit.Assert.fail(Assert.java:86)
2020-07-28T21:16:30.1353810Zat org.junit.Assert.assertTrue(Assert.java:41)
2020-07-28T21:16:30.1354289Zat org.junit.Assert.assertTrue(Assert.java:52)
2020-07-28T21:16:30.1354914Zat 
org.apache.flink.streaming.runtime.operators.windowing.WindowStaggerTest.testWindowStagger(WindowStaggerTest.java:38)
2020-07-28T21:16:30.1355520Zat 
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
2020-07-28T21:16:30.1356060Zat 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
2020-07-28T21:16:30.1356663Zat 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
2020-07-28T21:16:30.1357220Zat 
java.lang.reflect.Method.invoke(Method.java:498)
2020-07-28T21:16:30.1357775Zat 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
2020-07-28T21:16:30.1358383Zat 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
2020-07-28T21:16:30.1358986Zat 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
2020-07-28T21:16:30.1359623Zat 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
2020-07-28T21:16:30.1360187Zat 
org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
2020-07-28T21:16:30.1360740Zat 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
2020-07-28T21:16:30.1361364Zat 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
2020-07-28T21:16:30.1361916Zat 
org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
2020-07-28T21:16:30.1362432Zat 
org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
2020-07-28T21:16:30.1362976Zat 
org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
2020-07-28T21:16:30.1363516Zat 
org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
2020-07-28T21:16:30.1364041Zat 
org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
2020-07-28T21:16:30.1364568Zat 
org.junit.runners.ParentRunner.run(ParentRunner.java:363)
2020-07-28T21:16:30.1365139Zat 
org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
2020-07-28T21:16:30.1365764Zat 
org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
2020-07-28T21:16:30.1366413Zat 
org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
2020-07-28T21:16:30.1367036Zat 
org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
2020-07-28T21:16:30.1367671Zat 
org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
2020-07-28T21:16:30.1368337Zat 
org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
2020-07-28T21:16:30.1368956Zat 
org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
2020-07-28T21:16:30.1369530Zat 
org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-18745) 'SQL Client end-to-end test (Old planner) Elasticsearch (v7.5.1)' failed to download Elasticsearch

2020-07-28 Thread Dian Fu (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-18745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dian Fu updated FLINK-18745:

Labels: test-stability  (was: )

> 'SQL Client end-to-end test (Old planner) Elasticsearch (v7.5.1)' failed to 
> download Elasticsearch
> --
>
> Key: FLINK-18745
> URL: https://issues.apache.org/jira/browse/FLINK-18745
> Project: Flink
>  Issue Type: Test
>  Components: Connectors / ElasticSearch, Table SQL / Client, Tests
>Affects Versions: 1.12.0, 1.11.1
>Reporter: Dian Fu
>Priority: Major
>  Labels: test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=4976&view=logs&j=91bf6583-3fb2-592f-e4d4-d79d79c3230a&t=03dbd840-5430-533d-d1a7-05d0ebe03873
> {code}
> Downloading Elasticsearch from 
> https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.5.1-linux-x86_64.tar.gz
>  ...
> 2020-07-28T22:23:48.2016019Z   % Total% Received % Xferd  Average Speed   
> TimeTime Time  Current
> 2020-07-28T22:23:48.2017880Z  Dload  Upload   
> Total   SpentLeft  Speed
> 2020-07-28T22:23:48.2018245Z 
> 2020-07-28T22:23:48.4204474Z   0 00 00 0  0  0 
> --:--:-- --:--:-- --:--:-- 0
> 2020-07-28T22:23:49.4207369Z   0  276M0 00 0  0  0 
> --:--:-- --:--:-- --:--:-- 0
> 2020-07-28T22:23:50.4205512Z   2  276M2 6459k0 0  5291k  0  
> 0:00:53  0:00:01  0:00:52 5290k
> 2020-07-28T22:23:51.4205838Z   7  276M7 20.2M0 0  9343k  0  
> 0:00:30  0:00:02  0:00:28 9341k
> 2020-07-28T22:23:51.5660725Z  14  276M   14 39.9M0 0  12.3M  0  
> 0:00:22  0:00:03  0:00:19 12.3M
> 2020-07-28T22:23:51.5661374Z  15  276M   15 43.2M0 0  12.8M  0  
> 0:00:21  0:00:03  0:00:18 12.8M
> 2020-07-28T22:23:51.5735405Z curl: (18) transfer closed with 244702844 bytes 
> remaining to read
> 2020-07-28T22:23:51.9894747Z   % Total% Received % Xferd  Average Speed   
> TimeTime Time  Current
> 2020-07-28T22:23:51.9895725Z  Dload  Upload   
> Total   SpentLeft  Speed
> 2020-07-28T22:23:51.9896068Z 
> 2020-07-28T22:23:51.9940775Z   0 00 00 0  0  0 
> --:--:-- --:--:-- --:--:-- 0curl: (7) Failed to connect to localhost port 
> 9200: Connection refused
> 2020-07-28T22:23:51.9951219Z [FAIL] Test script contains errors.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-18745) 'SQL Client end-to-end test (Old planner) Elasticsearch (v7.5.1)' failed to download Elasticsearch

2020-07-28 Thread Dian Fu (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-18745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dian Fu updated FLINK-18745:

Affects Version/s: 1.12.0

> 'SQL Client end-to-end test (Old planner) Elasticsearch (v7.5.1)' failed to 
> download Elasticsearch
> --
>
> Key: FLINK-18745
> URL: https://issues.apache.org/jira/browse/FLINK-18745
> Project: Flink
>  Issue Type: Test
>  Components: Connectors / ElasticSearch, Table SQL / Client, Tests
>Affects Versions: 1.12.0, 1.11.1
>Reporter: Dian Fu
>Priority: Major
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=4976&view=logs&j=91bf6583-3fb2-592f-e4d4-d79d79c3230a&t=03dbd840-5430-533d-d1a7-05d0ebe03873
> {code}
> Downloading Elasticsearch from 
> https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.5.1-linux-x86_64.tar.gz
>  ...
> 2020-07-28T22:23:48.2016019Z   % Total% Received % Xferd  Average Speed   
> TimeTime Time  Current
> 2020-07-28T22:23:48.2017880Z  Dload  Upload   
> Total   SpentLeft  Speed
> 2020-07-28T22:23:48.2018245Z 
> 2020-07-28T22:23:48.4204474Z   0 00 00 0  0  0 
> --:--:-- --:--:-- --:--:-- 0
> 2020-07-28T22:23:49.4207369Z   0  276M0 00 0  0  0 
> --:--:-- --:--:-- --:--:-- 0
> 2020-07-28T22:23:50.4205512Z   2  276M2 6459k0 0  5291k  0  
> 0:00:53  0:00:01  0:00:52 5290k
> 2020-07-28T22:23:51.4205838Z   7  276M7 20.2M0 0  9343k  0  
> 0:00:30  0:00:02  0:00:28 9341k
> 2020-07-28T22:23:51.5660725Z  14  276M   14 39.9M0 0  12.3M  0  
> 0:00:22  0:00:03  0:00:19 12.3M
> 2020-07-28T22:23:51.5661374Z  15  276M   15 43.2M0 0  12.8M  0  
> 0:00:21  0:00:03  0:00:18 12.8M
> 2020-07-28T22:23:51.5735405Z curl: (18) transfer closed with 244702844 bytes 
> remaining to read
> 2020-07-28T22:23:51.9894747Z   % Total% Received % Xferd  Average Speed   
> TimeTime Time  Current
> 2020-07-28T22:23:51.9895725Z  Dload  Upload   
> Total   SpentLeft  Speed
> 2020-07-28T22:23:51.9896068Z 
> 2020-07-28T22:23:51.9940775Z   0 00 00 0  0  0 
> --:--:-- --:--:-- --:--:-- 0curl: (7) Failed to connect to localhost port 
> 9200: Connection refused
> 2020-07-28T22:23:51.9951219Z [FAIL] Test script contains errors.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-18745) 'SQL Client end-to-end test (Old planner) Elasticsearch (v7.5.1)' failed to download Elasticsearch

2020-07-28 Thread Dian Fu (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-18745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166803#comment-17166803
 ] 

Dian Fu commented on FLINK-18745:
-

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=4942&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=ff888d9b-cd34-53cc-d90f-3e446d355529

> 'SQL Client end-to-end test (Old planner) Elasticsearch (v7.5.1)' failed to 
> download Elasticsearch
> --
>
> Key: FLINK-18745
> URL: https://issues.apache.org/jira/browse/FLINK-18745
> Project: Flink
>  Issue Type: Test
>  Components: Connectors / ElasticSearch, Table SQL / Client, Tests
>Affects Versions: 1.11.1
>Reporter: Dian Fu
>Priority: Major
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=4976&view=logs&j=91bf6583-3fb2-592f-e4d4-d79d79c3230a&t=03dbd840-5430-533d-d1a7-05d0ebe03873
> {code}
> Downloading Elasticsearch from 
> https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.5.1-linux-x86_64.tar.gz
>  ...
> 2020-07-28T22:23:48.2016019Z   % Total% Received % Xferd  Average Speed   
> TimeTime Time  Current
> 2020-07-28T22:23:48.2017880Z  Dload  Upload   
> Total   SpentLeft  Speed
> 2020-07-28T22:23:48.2018245Z 
> 2020-07-28T22:23:48.4204474Z   0 00 00 0  0  0 
> --:--:-- --:--:-- --:--:-- 0
> 2020-07-28T22:23:49.4207369Z   0  276M0 00 0  0  0 
> --:--:-- --:--:-- --:--:-- 0
> 2020-07-28T22:23:50.4205512Z   2  276M2 6459k0 0  5291k  0  
> 0:00:53  0:00:01  0:00:52 5290k
> 2020-07-28T22:23:51.4205838Z   7  276M7 20.2M0 0  9343k  0  
> 0:00:30  0:00:02  0:00:28 9341k
> 2020-07-28T22:23:51.5660725Z  14  276M   14 39.9M0 0  12.3M  0  
> 0:00:22  0:00:03  0:00:19 12.3M
> 2020-07-28T22:23:51.5661374Z  15  276M   15 43.2M0 0  12.8M  0  
> 0:00:21  0:00:03  0:00:18 12.8M
> 2020-07-28T22:23:51.5735405Z curl: (18) transfer closed with 244702844 bytes 
> remaining to read
> 2020-07-28T22:23:51.9894747Z   % Total% Received % Xferd  Average Speed   
> TimeTime Time  Current
> 2020-07-28T22:23:51.9895725Z  Dload  Upload   
> Total   SpentLeft  Speed
> 2020-07-28T22:23:51.9896068Z 
> 2020-07-28T22:23:51.9940775Z   0 00 00 0  0  0 
> --:--:-- --:--:-- --:--:-- 0curl: (7) Failed to connect to localhost port 
> 9200: Connection refused
> 2020-07-28T22:23:51.9951219Z [FAIL] Test script contains errors.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (FLINK-18745) 'SQL Client end-to-end test (Old planner) Elasticsearch (v7.5.1)' failed to download Elasticsearch

2020-07-28 Thread Dian Fu (Jira)

Dian Fu created FLINK-18745:
---

 Summary: 'SQL Client end-to-end test (Old planner) Elasticsearch 
(v7.5.1)' failed to download Elasticsearch
 Key: FLINK-18745
 URL: https://issues.apache.org/jira/browse/FLINK-18745
 Project: Flink
  Issue Type: Test
  Components: Connectors / ElasticSearch, Table SQL / Client, Tests
Affects Versions: 1.11.1
Reporter: Dian Fu


https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=4976&view=logs&j=91bf6583-3fb2-592f-e4d4-d79d79c3230a&t=03dbd840-5430-533d-d1a7-05d0ebe03873

{code}
Downloading Elasticsearch from 
https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.5.1-linux-x86_64.tar.gz
 ...
2020-07-28T22:23:48.2016019Z   % Total% Received % Xferd  Average Speed   
TimeTime Time  Current
2020-07-28T22:23:48.2017880Z  Dload  Upload   
Total   SpentLeft  Speed
2020-07-28T22:23:48.2018245Z 
2020-07-28T22:23:48.4204474Z   0 00 00 0  0  0 
--:--:-- --:--:-- --:--:-- 0
2020-07-28T22:23:49.4207369Z   0  276M0 00 0  0  0 
--:--:-- --:--:-- --:--:-- 0
2020-07-28T22:23:50.4205512Z   2  276M2 6459k0 0  5291k  0  
0:00:53  0:00:01  0:00:52 5290k
2020-07-28T22:23:51.4205838Z   7  276M7 20.2M0 0  9343k  0  
0:00:30  0:00:02  0:00:28 9341k
2020-07-28T22:23:51.5660725Z  14  276M   14 39.9M0 0  12.3M  0  
0:00:22  0:00:03  0:00:19 12.3M
2020-07-28T22:23:51.5661374Z  15  276M   15 43.2M0 0  12.8M  0  
0:00:21  0:00:03  0:00:18 12.8M
2020-07-28T22:23:51.5735405Z curl: (18) transfer closed with 244702844 bytes 
remaining to read
2020-07-28T22:23:51.9894747Z   % Total% Received % Xferd  Average Speed   
TimeTime Time  Current
2020-07-28T22:23:51.9895725Z  Dload  Upload   
Total   SpentLeft  Speed
2020-07-28T22:23:51.9896068Z 
2020-07-28T22:23:51.9940775Z   0 00 00 0  0  0 
--:--:-- --:--:-- --:--:-- 0curl: (7) Failed to connect to localhost port 
9200: Connection refused
2020-07-28T22:23:51.9951219Z [FAIL] Test script contains errors.
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-18690) Implement LocalInputPreferredSlotSharingStrategy

2020-07-28 Thread Zhu Zhu (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-18690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhu Zhu updated FLINK-18690:

Component/s: Runtime / Coordination

> Implement LocalInputPreferredSlotSharingStrategy
> 
>
> Key: FLINK-18690
> URL: https://issues.apache.org/jira/browse/FLINK-18690
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Reporter: Andrey Zagrebin
>Assignee: Zhu Zhu
>Priority: Major
>
> Implement ExecutionSlotSharingGroup, SlotSharingStrategy and default 
> LocalInputPreferredSlotSharingStrategy.
> The default strategy would be LocalInputPreferredSlotSharingStrategy. It will 
> try to reduce remote data exchanges. Subtasks, which are connected and belong 
> to the same SlotSharingGroup, tend to be put in the same 
> ExecutionSlotSharingGroup.
> See [design 
> doc|https://docs.google.com/document/d/10CbCVDBJWafaFOovIXrR8nAr2BnZ_RAGFtUeTHiJplw/edit#heading=h.t4vfmm4atqoy]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-18663) Fix Flink On YARN AM not exit

2020-07-28 Thread Jira



[ 
https://issues.apache.org/jira/browse/FLINK-18663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166755#comment-17166755
 ] 

许 容浩 commented on FLINK-18663:
--

发自我的华为手机

 原始邮件 
发件人： "Chesnay Schepler (Jira)" 
日期： 2020年7月29日周三 清晨6:06
收件人： issues@flink.apache.org
主 题： [jira] [Commented] (FLINK-18663) Fix Flink On YARN AM not exit

    [ 
https://issues.apache.org/jira/browse/FLINK-18663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166717#comment-17166717
 ]

Chesnay Schepler commented on FLINK-18663:
--

Well there we go then, I was able to reproduce the issue. It is indeed due to 
the client closing the connection.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


> Fix Flink On YARN AM not exit
> -
>
> Key: FLINK-18663
> URL: https://issues.apache.org/jira/browse/FLINK-18663
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / REST
>Affects Versions: 1.10.0, 1.10.1, 1.11.0
>Reporter: tartarus
>Assignee: tartarus
>Priority: Critical
>  Labels: pull-request-available
> Attachments: 110.png, 111.png, 
> C49A7310-F932-451B-A203-6D17F3140C0D.png, 
> e18e00dd6664485c2ff55284fe969474.png, jobmanager.log.noyarn.tar.gz
>
>
> AbstractHandler throw NPE cause by FlinkHttpObjectAggregator is null
> when rest throw exception, it will do this code
> {code:java}
> private CompletableFuture handleException(Throwable throwable, 
> ChannelHandlerContext ctx, HttpRequest httpRequest) {
>   FlinkHttpObjectAggregator flinkHttpObjectAggregator = 
> ctx.pipeline().get(FlinkHttpObjectAggregator.class);
>   int maxLength = flinkHttpObjectAggregator.maxContentLength() - 
> OTHER_RESP_PAYLOAD_OVERHEAD;
>   if (throwable instanceof RestHandlerException) {
>   RestHandlerException rhe = (RestHandlerException) throwable;
>   String stackTrace = ExceptionUtils.stringifyException(rhe);
>   String truncatedStackTrace = Ascii.truncate(stackTrace, 
> maxLength, "...");
>   if (log.isDebugEnabled()) {
>   log.error("Exception occurred in REST handler.", rhe);
>   } else {
>   log.error("Exception occurred in REST handler: {}", 
> rhe.getMessage());
>   }
>   return HandlerUtils.sendErrorResponse(
>   ctx,
>   httpRequest,
>   new ErrorResponseBody(truncatedStackTrace),
>   rhe.getHttpResponseStatus(),
>   responseHeaders);
>   } else {
>   log.error("Unhandled exception.", throwable);
>   String stackTrace = String.format(" side:%n%s%nEnd of exception on server side>",
>   ExceptionUtils.stringifyException(throwable));
>   String truncatedStackTrace = Ascii.truncate(stackTrace, 
> maxLength, "...");
>   return HandlerUtils.sendErrorResponse(
>   ctx,
>   httpRequest,
>   new ErrorResponseBody(Arrays.asList("Internal server 
> error.", truncatedStackTrace)),
>   HttpResponseStatus.INTERNAL_SERVER_ERROR,
>   responseHeaders);
>   }
> }
> {code}
> but flinkHttpObjectAggregator some case is null,so this will throw NPE,but 
> this method called by  AbstractHandler#respondAsLeader
> {code:java}
> requestProcessingFuture
>   .whenComplete((Void ignored, Throwable throwable) -> {
>   if (throwable != null) {
>   
> handleException(ExceptionUtils.stripCompletionException(throwable), ctx, 
> httpRequest)
>   .whenComplete((Void ignored2, Throwable 
> throwable2) -> finalizeRequestProcessing(finalUploadedFiles));
>   } else {
>   finalizeRequestProcessing(finalUploadedFiles);
>   }
>   });
> {code}
>  the result is InFlightRequestTracker Cannot be cleared.
> so the CompletableFuture does‘t complete that handler's closeAsync returned
> !C49A7310-F932-451B-A203-6D17F3140C0D.png!
> !e18e00dd6664485c2ff55284fe969474.png!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-18663) Fix Flink On YARN AM not exit

2020-07-28 Thread Chesnay Schepler (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-18663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166717#comment-17166717
 ] 

Chesnay Schepler commented on FLINK-18663:
--

Well there we go then, I was able to reproduce the issue. It is indeed due to 
the client closing the connection.

> Fix Flink On YARN AM not exit
> -
>
> Key: FLINK-18663
> URL: https://issues.apache.org/jira/browse/FLINK-18663
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / REST
>Affects Versions: 1.10.0, 1.10.1, 1.11.0
>Reporter: tartarus
>Assignee: tartarus
>Priority: Critical
>  Labels: pull-request-available
> Attachments: 110.png, 111.png, 
> C49A7310-F932-451B-A203-6D17F3140C0D.png, 
> e18e00dd6664485c2ff55284fe969474.png, jobmanager.log.noyarn.tar.gz
>
>
> AbstractHandler throw NPE cause by FlinkHttpObjectAggregator is null
> when rest throw exception, it will do this code
> {code:java}
> private CompletableFuture handleException(Throwable throwable, 
> ChannelHandlerContext ctx, HttpRequest httpRequest) {
>   FlinkHttpObjectAggregator flinkHttpObjectAggregator = 
> ctx.pipeline().get(FlinkHttpObjectAggregator.class);
>   int maxLength = flinkHttpObjectAggregator.maxContentLength() - 
> OTHER_RESP_PAYLOAD_OVERHEAD;
>   if (throwable instanceof RestHandlerException) {
>   RestHandlerException rhe = (RestHandlerException) throwable;
>   String stackTrace = ExceptionUtils.stringifyException(rhe);
>   String truncatedStackTrace = Ascii.truncate(stackTrace, 
> maxLength, "...");
>   if (log.isDebugEnabled()) {
>   log.error("Exception occurred in REST handler.", rhe);
>   } else {
>   log.error("Exception occurred in REST handler: {}", 
> rhe.getMessage());
>   }
>   return HandlerUtils.sendErrorResponse(
>   ctx,
>   httpRequest,
>   new ErrorResponseBody(truncatedStackTrace),
>   rhe.getHttpResponseStatus(),
>   responseHeaders);
>   } else {
>   log.error("Unhandled exception.", throwable);
>   String stackTrace = String.format(" side:%n%s%nEnd of exception on server side>",
>   ExceptionUtils.stringifyException(throwable));
>   String truncatedStackTrace = Ascii.truncate(stackTrace, 
> maxLength, "...");
>   return HandlerUtils.sendErrorResponse(
>   ctx,
>   httpRequest,
>   new ErrorResponseBody(Arrays.asList("Internal server 
> error.", truncatedStackTrace)),
>   HttpResponseStatus.INTERNAL_SERVER_ERROR,
>   responseHeaders);
>   }
> }
> {code}
> but flinkHttpObjectAggregator some case is null,so this will throw NPE,but 
> this method called by  AbstractHandler#respondAsLeader
> {code:java}
> requestProcessingFuture
>   .whenComplete((Void ignored, Throwable throwable) -> {
>   if (throwable != null) {
>   
> handleException(ExceptionUtils.stripCompletionException(throwable), ctx, 
> httpRequest)
>   .whenComplete((Void ignored2, Throwable 
> throwable2) -> finalizeRequestProcessing(finalUploadedFiles));
>   } else {
>   finalizeRequestProcessing(finalUploadedFiles);
>   }
>   });
> {code}
>  the result is InFlightRequestTracker Cannot be cleared.
> so the CompletableFuture does‘t complete that handler's closeAsync returned
> !C49A7310-F932-451B-A203-6D17F3140C0D.png!
> !e18e00dd6664485c2ff55284fe969474.png!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-18695) Allow NettyBufferPool to allocate heap buffers

2020-07-28 Thread Stephan Ewen (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-18695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166697#comment-17166697
 ] 

Stephan Ewen commented on FLINK-18695:
--

My feeling is that this should be okay in Flink 1.12 (and 1.11) for the 
following reason:
  - Netty memory USED TO BE a significant chunk (especially on the receiver 
side), which made it non-trivial to reason about in the memory configuration
  - In the latest versions we directly write from (and read into) Flink memory 
buffers, so the memory that Netty itself allocates is minimal (headers, frame 
length decoders, encryption). These should be not very big (possibly except the 
encryption case) so not too much impact whether they are on heap or off heap.

[~zjwang] What is your opinion?

[~NicoK] Related, is there any way we can set up OpenSSL by default? It looks 
like the binaries are anyways in the Flink binary release.

> Allow NettyBufferPool to allocate heap buffers
> --
>
> Key: FLINK-18695
> URL: https://issues.apache.org/jira/browse/FLINK-18695
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Reporter: Chesnay Schepler
>Priority: Major
> Fix For: 1.12.0
>
>
> in 4.1.43 netty made a change to their SslHandler to always use heap buffers 
> for JDK SSLEngine implementations, to avoid an additional memory copy.
> However, our {{NettyBufferPool}} forbids heap buffer allocations.
> We will either have to allow heap buffer allocations, or create a custom 
> SslHandler implementation that does not use heap buffers (although this seems 
> ill-adviced?).
> /cc [~sewen] [~uce] [~NicoK] [~zjwang] [~pnowojski]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-16510) Task manager safeguard shutdown may not be reliable

2020-07-28 Thread Maximilian Michels (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-16510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1719#comment-1719
 ] 

Maximilian Michels commented on FLINK-16510:


Here is the requested information (without changing the JVM arguments):

{{jstack -F -m 1}}:  [^stack3-mixed.txt] (same error without -F)
{{jstack -F 1}}:  [^stack3.txt]
Command:  [^command.txt] 

Interestingly, the originally reported behavior of hanging in the shutdown 
hooks is not visible in the stack trace. Still, the problem is not reproducible 
if {{halt}} will be immediately called on fatal errors without running shutdown 
hooks.

> Task manager safeguard shutdown may not be reliable
> ---
>
> Key: FLINK-16510
> URL: https://issues.apache.org/jira/browse/FLINK-16510
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Reporter: Maximilian Michels
>Assignee: Maximilian Michels
>Priority: Major
> Attachments: command.txt, stack2-1.txt, stack3-mixed.txt, stack3.txt
>
>
> The {{JvmShutdownSafeguard}} does not always succeed but can hang when 
> multiple threads attempt to shutdown the JVM. Apparently mixing 
> {{System.exit()}} with ShutdownHooks and forcefully terminating the JVM via 
> {{Runtime.halt()}} does not play together well:
> {noformat}
> "Jvm Terminator" #22 daemon prio=5 os_prio=0 tid=0x7fb8e82f2800 
> nid=0x5a96 runnable [0x7fb35cffb000]
>java.lang.Thread.State: RUNNABLE
>   at java.lang.Shutdown.$$YJP$$halt0(Native Method)
>   at java.lang.Shutdown.halt0(Shutdown.java)
>   at java.lang.Shutdown.halt(Shutdown.java:139)
>   - locked <0x00047ed67638> (a java.lang.Shutdown$Lock)
>   at java.lang.Runtime.halt(Runtime.java:276)
>   at 
> org.apache.flink.runtime.util.JvmShutdownSafeguard$DelayedTerminator.run(JvmShutdownSafeguard.java:86)
>   at java.lang.Thread.run(Thread.java:748)
>Locked ownable synchronizers:
>   - None
> "FlinkCompletableFutureDelayScheduler-thread-1" #18154 daemon prio=5 
> os_prio=0 tid=0x7fb708a7d000 nid=0x5a8a waiting for monitor entry 
> [0x7fb289d49000]
>java.lang.Thread.State: BLOCKED (on object monitor)
>   at java.lang.Shutdown.halt(Shutdown.java:139)
>   - waiting to lock <0x00047ed67638> (a java.lang.Shutdown$Lock)
>   at java.lang.Shutdown.exit(Shutdown.java:213)
>   - locked <0x00047edb7348> (a java.lang.Class for java.lang.Shutdown)
>   at java.lang.Runtime.exit(Runtime.java:110)
>   at java.lang.System.exit(System.java:973)
>   at 
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.terminateJVM(TaskManagerRunner.java:266)
>   at 
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.lambda$onFatalError$1(TaskManagerRunner.java:260)
>   at 
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner$$Lambda$27464/1464672548.accept(Unknown
>  Source)
>   at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
>   at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
>   at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
>   at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
>   at 
> org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:943)
>   at 
> org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:211)
>   at 
> org.apache.flink.runtime.concurrent.FutureUtils.lambda$orTimeout$11(FutureUtils.java:361)
>   at 
> org.apache.flink.runtime.concurrent.FutureUtils$$Lambda$27435/159015392.run(Unknown
>  Source)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
>Locked ownable synchronizers:
>   - <0x0006d5e56bd0> (a 
> java.util.concurrent.ThreadPoolExecutor$Worker)
> {noformat}
> Note that under this condition the JVM should terminate but it still hangs. 
> Sometimes it quits after several minutes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-16510) Task manager safeguard shutdown may not be reliable

2020-07-28 Thread Maximilian Michels (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-16510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maximilian Michels updated FLINK-16510:
---
Attachment: stack3.txt

> Task manager safeguard shutdown may not be reliable
> ---
>
> Key: FLINK-16510
> URL: https://issues.apache.org/jira/browse/FLINK-16510
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Reporter: Maximilian Michels
>Assignee: Maximilian Michels
>Priority: Major
> Attachments: command.txt, stack2-1.txt, stack3-mixed.txt, stack3.txt
>
>
> The {{JvmShutdownSafeguard}} does not always succeed but can hang when 
> multiple threads attempt to shutdown the JVM. Apparently mixing 
> {{System.exit()}} with ShutdownHooks and forcefully terminating the JVM via 
> {{Runtime.halt()}} does not play together well:
> {noformat}
> "Jvm Terminator" #22 daemon prio=5 os_prio=0 tid=0x7fb8e82f2800 
> nid=0x5a96 runnable [0x7fb35cffb000]
>java.lang.Thread.State: RUNNABLE
>   at java.lang.Shutdown.$$YJP$$halt0(Native Method)
>   at java.lang.Shutdown.halt0(Shutdown.java)
>   at java.lang.Shutdown.halt(Shutdown.java:139)
>   - locked <0x00047ed67638> (a java.lang.Shutdown$Lock)
>   at java.lang.Runtime.halt(Runtime.java:276)
>   at 
> org.apache.flink.runtime.util.JvmShutdownSafeguard$DelayedTerminator.run(JvmShutdownSafeguard.java:86)
>   at java.lang.Thread.run(Thread.java:748)
>Locked ownable synchronizers:
>   - None
> "FlinkCompletableFutureDelayScheduler-thread-1" #18154 daemon prio=5 
> os_prio=0 tid=0x7fb708a7d000 nid=0x5a8a waiting for monitor entry 
> [0x7fb289d49000]
>java.lang.Thread.State: BLOCKED (on object monitor)
>   at java.lang.Shutdown.halt(Shutdown.java:139)
>   - waiting to lock <0x00047ed67638> (a java.lang.Shutdown$Lock)
>   at java.lang.Shutdown.exit(Shutdown.java:213)
>   - locked <0x00047edb7348> (a java.lang.Class for java.lang.Shutdown)
>   at java.lang.Runtime.exit(Runtime.java:110)
>   at java.lang.System.exit(System.java:973)
>   at 
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.terminateJVM(TaskManagerRunner.java:266)
>   at 
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.lambda$onFatalError$1(TaskManagerRunner.java:260)
>   at 
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner$$Lambda$27464/1464672548.accept(Unknown
>  Source)
>   at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
>   at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
>   at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
>   at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
>   at 
> org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:943)
>   at 
> org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:211)
>   at 
> org.apache.flink.runtime.concurrent.FutureUtils.lambda$orTimeout$11(FutureUtils.java:361)
>   at 
> org.apache.flink.runtime.concurrent.FutureUtils$$Lambda$27435/159015392.run(Unknown
>  Source)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
>Locked ownable synchronizers:
>   - <0x0006d5e56bd0> (a 
> java.util.concurrent.ThreadPoolExecutor$Worker)
> {noformat}
> Note that under this condition the JVM should terminate but it still hangs. 
> Sometimes it quits after several minutes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-16510) Task manager safeguard shutdown may not be reliable

2020-07-28 Thread Maximilian Michels (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-16510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maximilian Michels updated FLINK-16510:
---
Attachment: command.txt

> Task manager safeguard shutdown may not be reliable
> ---
>
> Key: FLINK-16510
> URL: https://issues.apache.org/jira/browse/FLINK-16510
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Reporter: Maximilian Michels
>Assignee: Maximilian Michels
>Priority: Major
> Attachments: command.txt, stack2-1.txt, stack3-mixed.txt, stack3.txt
>
>
> The {{JvmShutdownSafeguard}} does not always succeed but can hang when 
> multiple threads attempt to shutdown the JVM. Apparently mixing 
> {{System.exit()}} with ShutdownHooks and forcefully terminating the JVM via 
> {{Runtime.halt()}} does not play together well:
> {noformat}
> "Jvm Terminator" #22 daemon prio=5 os_prio=0 tid=0x7fb8e82f2800 
> nid=0x5a96 runnable [0x7fb35cffb000]
>java.lang.Thread.State: RUNNABLE
>   at java.lang.Shutdown.$$YJP$$halt0(Native Method)
>   at java.lang.Shutdown.halt0(Shutdown.java)
>   at java.lang.Shutdown.halt(Shutdown.java:139)
>   - locked <0x00047ed67638> (a java.lang.Shutdown$Lock)
>   at java.lang.Runtime.halt(Runtime.java:276)
>   at 
> org.apache.flink.runtime.util.JvmShutdownSafeguard$DelayedTerminator.run(JvmShutdownSafeguard.java:86)
>   at java.lang.Thread.run(Thread.java:748)
>Locked ownable synchronizers:
>   - None
> "FlinkCompletableFutureDelayScheduler-thread-1" #18154 daemon prio=5 
> os_prio=0 tid=0x7fb708a7d000 nid=0x5a8a waiting for monitor entry 
> [0x7fb289d49000]
>java.lang.Thread.State: BLOCKED (on object monitor)
>   at java.lang.Shutdown.halt(Shutdown.java:139)
>   - waiting to lock <0x00047ed67638> (a java.lang.Shutdown$Lock)
>   at java.lang.Shutdown.exit(Shutdown.java:213)
>   - locked <0x00047edb7348> (a java.lang.Class for java.lang.Shutdown)
>   at java.lang.Runtime.exit(Runtime.java:110)
>   at java.lang.System.exit(System.java:973)
>   at 
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.terminateJVM(TaskManagerRunner.java:266)
>   at 
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.lambda$onFatalError$1(TaskManagerRunner.java:260)
>   at 
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner$$Lambda$27464/1464672548.accept(Unknown
>  Source)
>   at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
>   at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
>   at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
>   at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
>   at 
> org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:943)
>   at 
> org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:211)
>   at 
> org.apache.flink.runtime.concurrent.FutureUtils.lambda$orTimeout$11(FutureUtils.java:361)
>   at 
> org.apache.flink.runtime.concurrent.FutureUtils$$Lambda$27435/159015392.run(Unknown
>  Source)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
>Locked ownable synchronizers:
>   - <0x0006d5e56bd0> (a 
> java.util.concurrent.ThreadPoolExecutor$Worker)
> {noformat}
> Note that under this condition the JVM should terminate but it still hangs. 
> Sometimes it quits after several minutes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-16510) Task manager safeguard shutdown may not be reliable

2020-07-28 Thread Maximilian Michels (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-16510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maximilian Michels updated FLINK-16510:
---
Attachment: stack3-mixed.txt

> Task manager safeguard shutdown may not be reliable
> ---
>
> Key: FLINK-16510
> URL: https://issues.apache.org/jira/browse/FLINK-16510
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Reporter: Maximilian Michels
>Assignee: Maximilian Michels
>Priority: Major
> Attachments: stack2-1.txt, stack3-mixed.txt
>
>
> The {{JvmShutdownSafeguard}} does not always succeed but can hang when 
> multiple threads attempt to shutdown the JVM. Apparently mixing 
> {{System.exit()}} with ShutdownHooks and forcefully terminating the JVM via 
> {{Runtime.halt()}} does not play together well:
> {noformat}
> "Jvm Terminator" #22 daemon prio=5 os_prio=0 tid=0x7fb8e82f2800 
> nid=0x5a96 runnable [0x7fb35cffb000]
>java.lang.Thread.State: RUNNABLE
>   at java.lang.Shutdown.$$YJP$$halt0(Native Method)
>   at java.lang.Shutdown.halt0(Shutdown.java)
>   at java.lang.Shutdown.halt(Shutdown.java:139)
>   - locked <0x00047ed67638> (a java.lang.Shutdown$Lock)
>   at java.lang.Runtime.halt(Runtime.java:276)
>   at 
> org.apache.flink.runtime.util.JvmShutdownSafeguard$DelayedTerminator.run(JvmShutdownSafeguard.java:86)
>   at java.lang.Thread.run(Thread.java:748)
>Locked ownable synchronizers:
>   - None
> "FlinkCompletableFutureDelayScheduler-thread-1" #18154 daemon prio=5 
> os_prio=0 tid=0x7fb708a7d000 nid=0x5a8a waiting for monitor entry 
> [0x7fb289d49000]
>java.lang.Thread.State: BLOCKED (on object monitor)
>   at java.lang.Shutdown.halt(Shutdown.java:139)
>   - waiting to lock <0x00047ed67638> (a java.lang.Shutdown$Lock)
>   at java.lang.Shutdown.exit(Shutdown.java:213)
>   - locked <0x00047edb7348> (a java.lang.Class for java.lang.Shutdown)
>   at java.lang.Runtime.exit(Runtime.java:110)
>   at java.lang.System.exit(System.java:973)
>   at 
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.terminateJVM(TaskManagerRunner.java:266)
>   at 
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.lambda$onFatalError$1(TaskManagerRunner.java:260)
>   at 
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner$$Lambda$27464/1464672548.accept(Unknown
>  Source)
>   at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
>   at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
>   at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
>   at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
>   at 
> org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:943)
>   at 
> org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:211)
>   at 
> org.apache.flink.runtime.concurrent.FutureUtils.lambda$orTimeout$11(FutureUtils.java:361)
>   at 
> org.apache.flink.runtime.concurrent.FutureUtils$$Lambda$27435/159015392.run(Unknown
>  Source)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
>Locked ownable synchronizers:
>   - <0x0006d5e56bd0> (a 
> java.util.concurrent.ThreadPoolExecutor$Worker)
> {noformat}
> Note that under this condition the JVM should terminate but it still hangs. 
> Sometimes it quits after several minutes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-18733) Jobmanager cannot start in HA mode with Zookeeper

2020-07-28 Thread Leonid Ilyevsky (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-18733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166543#comment-17166543
 ] 

Leonid Ilyevsky commented on FLINK-18733:
-

Problem solved. I noticed that your distribution includes 
opt/flink-shaded-zookeeper-3.5.6.jar, so I replaced 
lib/flink-shaded-zookeeper-3.4.14.jar by 3.5.6 version, and now everything 
works.

I guess, you may want to make 3.5.6 (or newer) a default in your future 
releases.

> Jobmanager cannot start in HA mode with Zookeeper
> -
>
> Key: FLINK-18733
> URL: https://issues.apache.org/jira/browse/FLINK-18733
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.11.1
>Reporter: Leonid Ilyevsky
>Priority: Major
> Attachments: flink-conf.yaml, 
> flink-liquidnt-standalonesession-0-nj1dvloglab01.liquidnet.biz.log, 
> flink-liquidnt-taskexecutor-0-nj1dvloglab01.liquidnet.biz.log
>
>
> When configured in HA mode, the Jobmanager cannot start at all. First, it 
> issues warnings like this:
> {quote}{{2020-07-27 08:58:23,197 WARN 
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - 
> Session 0x0 for server *nj1dvloglab01.liquidnet.biz/:2181*, 
> unexpected error, closing socket connection and attempting reconnect}}
>  {{java.lang.IllegalArgumentException: *Unable to canonicalize address* 
> nj1dvloglab01.liquidnet.biz/:2181 because it's not resolvable}}
>  {{ at 
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:65)
>  ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}}
>  {{ at 
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:41)
>  ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}}
>  {{ at 
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:1001)
>  ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}}
>  {{ at 
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1060)
>  [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}}
> {quote}
> After few attempts connecting to Zookeeper, it finally fails:
> {quote}2020-07-27 08:59:35,055 ERROR 
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal error 
> occurred in the cluster entrypoint.
>  org.apache.flink.util.FlinkException: Unhandled error in 
> ZooKeeperLeaderElectionService: Ensure path threw exception
>  at 
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService.unhandledError(ZooKeeperLeaderElectionService.java:430)
>  ~[flink-dist_2.12-1.11.1.jar:1.11.1]
> {quote}
>  
> The same HA configuration works fine for me in Flink 1.10.0.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (FLINK-17073) Slow checkpoint cleanup causing OOMs

2020-07-28 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166480#comment-17166480
 ] 

Etienne Chauchot edited comment on FLINK-17073 at 7/28/20, 4:09 PM:


[~roman_khachatryan] 

When [~SleePy] and I discussed in [the deisgn 
doc|https://docs.google.com/document/d/1q0y0aWlJMoUWNW7jjsM8uWfHsy2dM6YmmcmhpQzgLMA/edit?usp=sharing],
 the idea was to wait until last checkpoint was cleaned before accepting 
another (that is what we called make cleaning part of checkpoint processing). 
Thus, checking only existing number of pending checkpoints was enough (no need 
for a new queue) to foresee an flood of checkpoints to clean. 

But the solution you propose (managing the queue of the checkpoints to clean 
and monitor its size) seems even simpler to me: it avoids having to sync normal 
checkpointing and checkpoint cleaning:

As you said, when we chose a checkpoint trigger request to execute 
(*CheckpointRequestDecider.chooseRequestToExecute*), we can drop new checkpoint 
requests when there are too many checkpoints to clean and thus regulate the 
whole checkpointing system. Syncing cleaning and checkpointing might not be 
necessary for this regulation, you're right.

If you don't mind, I'll go for this implementation proposal in the design doc.

[~roman_khachatryan] thanks anyway for the suggestions and please take a look 
at the design doc where we will have the impl discussions


was (Author: echauchot):
[~roman_khachatryan] 

When [~SleePy] and I discussed in [the deisgn 
doc|https://docs.google.com/document/d/1q0y0aWlJMoUWNW7jjsM8uWfHsy2dM6YmmcmhpQzgLMA/edit?usp=sharing],
 the idea was to wait until last checkpoint was cleaned before accepting 
another (that is what we called make cleaning part of checkpoint processing). 
Thus, checking only existing number of pending checkpoints was enough (no need 
for a new queue) to foresee an flood of checkpoints to clean. 

But the solution you propose (managing the queue of the checkpoints to clean 
and monitor its size) seems even simpler to me: it avoids having to sync normal 
checkpointing and checkpoint cleaning:

As you said, when we chose a checkpoint trigger request to execute 
(*CheckpointRequestDecider.chooseRequestToExecute*), we can drop new checkpoint 
requests when there are too many checkpoints to clean and thus regulate the 
whole checkpointing system. Syncing cleaning and checkpointing might not be 
necessary for this regulation, you're right.

If you don't mind, I'll go for this implementation proposal in the design doc.

> Slow checkpoint cleanup causing OOMs
> 
>
> Key: FLINK-17073
> URL: https://issues.apache.org/jira/browse/FLINK-17073
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0
>Reporter: Till Rohrmann
>Assignee: Etienne Chauchot
>Priority: Major
> Fix For: 1.12.0
>
>
> A user reported that he sees a decline in checkpoint cleanup speed when 
> upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup 
> tasks are waiting in the execution queue occupying memory. Ultimately, the JM 
> process dies with an OOM.
> Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is 
> used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the 
> {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max 
> parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as 
> CPU cores. This change might have caused the decline in completed checkpoint 
> discard throughput. This suspicion needs to be validated before trying to fix 
> it!
> [1] 
> https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (FLINK-18733) Jobmanager cannot start in HA mode with Zookeeper

2020-07-28 Thread Leonid Ilyevsky (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-18733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166507#comment-17166507
 ] 

Leonid Ilyevsky edited comment on FLINK-18733 at 7/28/20, 3:44 PM:
---

Till,

Just now I found this: https://issues.apache.org/jira/browse/ZOOKEEPER-3590 . 
So 

So it does look like some problem was introduced in 3.4.14. I guess, if you 
upgrade your Zookeeper client, it will be OK. Or at least it will be possible 
to turn off that "canonicalize" thing.

FYI: I use 3.6.1 for my Zookeeper cluster, it is stable release now.

My puzzle now is - how does it work for you? Maybe your network, DNS, etc. is 
setup differently compare to what I have at my work.

Question: would you know how I can independently test whether the host is 
resolvable in that sense? I mean, to test whether it can be "canonicalized"? Or 
configure the hosts differently? There are some other alias names that resolve 
to the same hosts, maybe I should use them?


was (Author: lilyevsky):
Till,

Just now I found this: https://issues.apache.org/jira/browse/ZOOKEEPER-3590 . 
So 

So it does look like some problem was introduced in 3.14. I guess, if you 
upgrade your Zookeeper client, it will be OK. Or at least it will be possible 
to turn off that "canonicalize" thing.

My puzzle now is - how does it work for you? Maybe your network, DNS, etc. is 
setup differently compare to what I have at my work.

Question: would you know how I can independently test whether the host is 
resolvable in that sense? I mean, to test whether it can be "canonicalized"? Or 
configure the hosts differently? There are some other alias names that resolve 
to the same hosts, maybe I should use them?

> Jobmanager cannot start in HA mode with Zookeeper
> -
>
> Key: FLINK-18733
> URL: https://issues.apache.org/jira/browse/FLINK-18733
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.11.1
>Reporter: Leonid Ilyevsky
>Priority: Major
> Attachments: flink-conf.yaml, 
> flink-liquidnt-standalonesession-0-nj1dvloglab01.liquidnet.biz.log, 
> flink-liquidnt-taskexecutor-0-nj1dvloglab01.liquidnet.biz.log
>
>
> When configured in HA mode, the Jobmanager cannot start at all. First, it 
> issues warnings like this:
> {quote}{{2020-07-27 08:58:23,197 WARN 
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - 
> Session 0x0 for server *nj1dvloglab01.liquidnet.biz/:2181*, 
> unexpected error, closing socket connection and attempting reconnect}}
>  {{java.lang.IllegalArgumentException: *Unable to canonicalize address* 
> nj1dvloglab01.liquidnet.biz/:2181 because it's not resolvable}}
>  {{ at 
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:65)
>  ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}}
>  {{ at 
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:41)
>  ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}}
>  {{ at 
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:1001)
>  ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}}
>  {{ at 
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1060)
>  [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}}
> {quote}
> After few attempts connecting to Zookeeper, it finally fails:
> {quote}2020-07-27 08:59:35,055 ERROR 
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal error 
> occurred in the cluster entrypoint.
>  org.apache.flink.util.FlinkException: Unhandled error in 
> ZooKeeperLeaderElectionService: Ensure path threw exception
>  at 
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService.unhandledError(ZooKeeperLeaderElectionService.java:430)
>  ~[flink-dist_2.12-1.11.1.jar:1.11.1]
> {quote}
>  
> The same HA configuration works fine for me in Flink 1.10.0.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-18733) Jobmanager cannot start in HA mode with Zookeeper

2020-07-28 Thread Leonid Ilyevsky (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-18733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166507#comment-17166507
 ] 

Leonid Ilyevsky commented on FLINK-18733:
-

Till,

Just now I found this: https://issues.apache.org/jira/browse/ZOOKEEPER-3590 . 
So 

So it does look like some problem was introduced in 3.14. I guess, if you 
upgrade your Zookeeper client, it will be OK. Or at least it will be possible 
to turn off that "canonicalize" thing.

My puzzle now is - how does it work for you? Maybe your network, DNS, etc. is 
setup differently compare to what I have at my work.

Question: would you know how I can independently test whether the host is 
resolvable in that sense? I mean, to test whether it can be "canonicalized"? Or 
configure the hosts differently? There are some other alias names that resolve 
to the same hosts, maybe I should use them?

> Jobmanager cannot start in HA mode with Zookeeper
> -
>
> Key: FLINK-18733
> URL: https://issues.apache.org/jira/browse/FLINK-18733
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.11.1
>Reporter: Leonid Ilyevsky
>Priority: Major
> Attachments: flink-conf.yaml, 
> flink-liquidnt-standalonesession-0-nj1dvloglab01.liquidnet.biz.log, 
> flink-liquidnt-taskexecutor-0-nj1dvloglab01.liquidnet.biz.log
>
>
> When configured in HA mode, the Jobmanager cannot start at all. First, it 
> issues warnings like this:
> {quote}{{2020-07-27 08:58:23,197 WARN 
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - 
> Session 0x0 for server *nj1dvloglab01.liquidnet.biz/:2181*, 
> unexpected error, closing socket connection and attempting reconnect}}
>  {{java.lang.IllegalArgumentException: *Unable to canonicalize address* 
> nj1dvloglab01.liquidnet.biz/:2181 because it's not resolvable}}
>  {{ at 
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:65)
>  ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}}
>  {{ at 
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:41)
>  ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}}
>  {{ at 
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:1001)
>  ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}}
>  {{ at 
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1060)
>  [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}}
> {quote}
> After few attempts connecting to Zookeeper, it finally fails:
> {quote}2020-07-27 08:59:35,055 ERROR 
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal error 
> occurred in the cluster entrypoint.
>  org.apache.flink.util.FlinkException: Unhandled error in 
> ZooKeeperLeaderElectionService: Ensure path threw exception
>  at 
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService.unhandledError(ZooKeeperLeaderElectionService.java:430)
>  ~[flink-dist_2.12-1.11.1.jar:1.11.1]
> {quote}
>  
> The same HA configuration works fine for me in Flink 1.10.0.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs

2020-07-28 Thread Roman Khachatryan (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166501#comment-17166501
 ] 

Roman Khachatryan commented on FLINK-17073:
---

Thanks for your analysis [~echauchot].
Sure, go ahead!

> Slow checkpoint cleanup causing OOMs
> 
>
> Key: FLINK-17073
> URL: https://issues.apache.org/jira/browse/FLINK-17073
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0
>Reporter: Till Rohrmann
>Assignee: Etienne Chauchot
>Priority: Major
> Fix For: 1.12.0
>
>
> A user reported that he sees a decline in checkpoint cleanup speed when 
> upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup 
> tasks are waiting in the execution queue occupying memory. Ultimately, the JM 
> process dies with an OOM.
> Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is 
> used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the 
> {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max 
> parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as 
> CPU cores. This change might have caused the decline in completed checkpoint 
> discard throughput. This suspicion needs to be validated before trying to fix 
> it!
> [1] 
> https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-16866) Make job submission non-blocking

2020-07-28 Thread Robert Metzger (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-16866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166490#comment-17166490
 ] 

Robert Metzger commented on FLINK-16866:


After an offline discussion with [~trohrmann], we are proposing to address this 
issue as follows:

Changes to the Dispatcher
- As part of the Dispatcher, we'll introduce a Job abstraction that tracks the 
final step of the job submission: the creation of the job manager.
- the job submission REST handler will return immediately after triggering the 
creation of the job manager.
- in this phase, the job will be in an INITIALIZING state. Once the job manager 
is started, the job is INITIALIZED.
- errors during the creation of the jobmanager are stored in the new job 
abstraction, probably in an ArchivedExecutionGraph. Failed submissions need to 
get evicted eventually.
 - calls to get the job status or cancel the job will have to be adopted to 
this change (so that they return the INITIALIZING state, properly cancel the 
job manager creation or fail the request (for example when triggering a 
savepoint on an initializing job). As a follow up idea, we could improve the 
cancellation of the initialization by executing it in a thread controlled by 
the "Job" abstraction, so that we can interrupt the thread (cooperation is not 
guaranteed))

Changes to other components:
- The web UI will need to handle jobs in the INITIALIZING state differently: 
Initially the job shall be listed in the "Running" section, but it won't be 
clickable OR it will show an almost-empty page, explaining that it is still 
pending submission. Submission errors should be accessible in the UI.
- The CliFrontend will keep its current semantics: After the submission has 
succeeded, it will periodically query the REST endpoint until the 
initialization is finished (or failed).
- The ExecutionEnvironment.executeAsync() call will only return a JobClient, 
once the job manager has been initialized.

> Make job submission non-blocking
> 
>
> Key: FLINK-16866
> URL: https://issues.apache.org/jira/browse/FLINK-16866
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.9.2, 1.10.0, 1.11.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Critical
> Fix For: 1.12.0
>
>
> Currently, Flink waits to acknowledge a job submission until the 
> corresponding {{JobManager}} has been created. Since its creation also 
> involves the creation of the {{ExecutionGraph}} and potential FS operations, 
> it can take a bit of time. If the user has configured a too low 
> {{web.timeout}}, the submission can time out only reporting a 
> {{TimeoutException}} to the user.
> I propose to change the notion of job submission slightly. Instead of waiting 
> until the {{JobManager}} has been created, a job submission is complete once 
> all job relevant files have been uploaded to the {{Dispatcher}} and the 
> {{Dispatcher}} has been told about it. Creating the {{JobManager}} will then 
> belong to the actual job execution. Consequently, if problems occur while 
> creating the {{JobManager}} it will result into a job failure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-18663) Fix Flink On YARN AM not exit

2020-07-28 Thread tartarus (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-18663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166485#comment-17166485
 ] 

tartarus commented on FLINK-18663:
--

[~chesnay]  There is a separate monitoring service that requests /jobs/overview 
every 5 seconds and timeout is 5 seconds too.

Then will close the client. 

> Fix Flink On YARN AM not exit
> -
>
> Key: FLINK-18663
> URL: https://issues.apache.org/jira/browse/FLINK-18663
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / REST
>Affects Versions: 1.10.0, 1.10.1, 1.11.0
>Reporter: tartarus
>Assignee: tartarus
>Priority: Critical
>  Labels: pull-request-available
> Attachments: 110.png, 111.png, 
> C49A7310-F932-451B-A203-6D17F3140C0D.png, 
> e18e00dd6664485c2ff55284fe969474.png, jobmanager.log.noyarn.tar.gz
>
>
> AbstractHandler throw NPE cause by FlinkHttpObjectAggregator is null
> when rest throw exception, it will do this code
> {code:java}
> private CompletableFuture handleException(Throwable throwable, 
> ChannelHandlerContext ctx, HttpRequest httpRequest) {
>   FlinkHttpObjectAggregator flinkHttpObjectAggregator = 
> ctx.pipeline().get(FlinkHttpObjectAggregator.class);
>   int maxLength = flinkHttpObjectAggregator.maxContentLength() - 
> OTHER_RESP_PAYLOAD_OVERHEAD;
>   if (throwable instanceof RestHandlerException) {
>   RestHandlerException rhe = (RestHandlerException) throwable;
>   String stackTrace = ExceptionUtils.stringifyException(rhe);
>   String truncatedStackTrace = Ascii.truncate(stackTrace, 
> maxLength, "...");
>   if (log.isDebugEnabled()) {
>   log.error("Exception occurred in REST handler.", rhe);
>   } else {
>   log.error("Exception occurred in REST handler: {}", 
> rhe.getMessage());
>   }
>   return HandlerUtils.sendErrorResponse(
>   ctx,
>   httpRequest,
>   new ErrorResponseBody(truncatedStackTrace),
>   rhe.getHttpResponseStatus(),
>   responseHeaders);
>   } else {
>   log.error("Unhandled exception.", throwable);
>   String stackTrace = String.format(" side:%n%s%nEnd of exception on server side>",
>   ExceptionUtils.stringifyException(throwable));
>   String truncatedStackTrace = Ascii.truncate(stackTrace, 
> maxLength, "...");
>   return HandlerUtils.sendErrorResponse(
>   ctx,
>   httpRequest,
>   new ErrorResponseBody(Arrays.asList("Internal server 
> error.", truncatedStackTrace)),
>   HttpResponseStatus.INTERNAL_SERVER_ERROR,
>   responseHeaders);
>   }
> }
> {code}
> but flinkHttpObjectAggregator some case is null,so this will throw NPE,but 
> this method called by  AbstractHandler#respondAsLeader
> {code:java}
> requestProcessingFuture
>   .whenComplete((Void ignored, Throwable throwable) -> {
>   if (throwable != null) {
>   
> handleException(ExceptionUtils.stripCompletionException(throwable), ctx, 
> httpRequest)
>   .whenComplete((Void ignored2, Throwable 
> throwable2) -> finalizeRequestProcessing(finalUploadedFiles));
>   } else {
>   finalizeRequestProcessing(finalUploadedFiles);
>   }
>   });
> {code}
>  the result is InFlightRequestTracker Cannot be cleared.
> so the CompletableFuture does‘t complete that handler's closeAsync returned
> !C49A7310-F932-451B-A203-6D17F3140C0D.png!
> !e18e00dd6664485c2ff55284fe969474.png!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-18663) Fix Flink On YARN AM not exit

2020-07-28 Thread Till Rohrmann (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-18663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166483#comment-17166483
 ] 

Till Rohrmann commented on FLINK-18663:
---

[~tartarus] what I meant is to check whether 
{{AbstractHandler.terminationFuture}} is not {{null}}. If this is the case, 
then the handler is being shut down.

I agree with Chesnay that we might have to look into other explanations for the 
described problem.

> Fix Flink On YARN AM not exit
> -
>
> Key: FLINK-18663
> URL: https://issues.apache.org/jira/browse/FLINK-18663
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / REST
>Affects Versions: 1.10.0, 1.10.1, 1.11.0
>Reporter: tartarus
>Assignee: tartarus
>Priority: Critical
>  Labels: pull-request-available
> Attachments: 110.png, 111.png, 
> C49A7310-F932-451B-A203-6D17F3140C0D.png, 
> e18e00dd6664485c2ff55284fe969474.png, jobmanager.log.noyarn.tar.gz
>
>
> AbstractHandler throw NPE cause by FlinkHttpObjectAggregator is null
> when rest throw exception, it will do this code
> {code:java}
> private CompletableFuture handleException(Throwable throwable, 
> ChannelHandlerContext ctx, HttpRequest httpRequest) {
>   FlinkHttpObjectAggregator flinkHttpObjectAggregator = 
> ctx.pipeline().get(FlinkHttpObjectAggregator.class);
>   int maxLength = flinkHttpObjectAggregator.maxContentLength() - 
> OTHER_RESP_PAYLOAD_OVERHEAD;
>   if (throwable instanceof RestHandlerException) {
>   RestHandlerException rhe = (RestHandlerException) throwable;
>   String stackTrace = ExceptionUtils.stringifyException(rhe);
>   String truncatedStackTrace = Ascii.truncate(stackTrace, 
> maxLength, "...");
>   if (log.isDebugEnabled()) {
>   log.error("Exception occurred in REST handler.", rhe);
>   } else {
>   log.error("Exception occurred in REST handler: {}", 
> rhe.getMessage());
>   }
>   return HandlerUtils.sendErrorResponse(
>   ctx,
>   httpRequest,
>   new ErrorResponseBody(truncatedStackTrace),
>   rhe.getHttpResponseStatus(),
>   responseHeaders);
>   } else {
>   log.error("Unhandled exception.", throwable);
>   String stackTrace = String.format(" side:%n%s%nEnd of exception on server side>",
>   ExceptionUtils.stringifyException(throwable));
>   String truncatedStackTrace = Ascii.truncate(stackTrace, 
> maxLength, "...");
>   return HandlerUtils.sendErrorResponse(
>   ctx,
>   httpRequest,
>   new ErrorResponseBody(Arrays.asList("Internal server 
> error.", truncatedStackTrace)),
>   HttpResponseStatus.INTERNAL_SERVER_ERROR,
>   responseHeaders);
>   }
> }
> {code}
> but flinkHttpObjectAggregator some case is null,so this will throw NPE,but 
> this method called by  AbstractHandler#respondAsLeader
> {code:java}
> requestProcessingFuture
>   .whenComplete((Void ignored, Throwable throwable) -> {
>   if (throwable != null) {
>   
> handleException(ExceptionUtils.stripCompletionException(throwable), ctx, 
> httpRequest)
>   .whenComplete((Void ignored2, Throwable 
> throwable2) -> finalizeRequestProcessing(finalUploadedFiles));
>   } else {
>   finalizeRequestProcessing(finalUploadedFiles);
>   }
>   });
> {code}
>  the result is InFlightRequestTracker Cannot be cleared.
> so the CompletableFuture does‘t complete that handler's closeAsync returned
> !C49A7310-F932-451B-A203-6D17F3140C0D.png!
> !e18e00dd6664485c2ff55284fe969474.png!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs

2020-07-28 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166480#comment-17166480
 ] 

Etienne Chauchot commented on FLINK-17073:
--

[~roman_khachatryan] 

When [~SleePy] and I discussed in [the deisgn 
doc|https://docs.google.com/document/d/1q0y0aWlJMoUWNW7jjsM8uWfHsy2dM6YmmcmhpQzgLMA/edit?usp=sharing],
 the idea was to wait until last checkpoint was cleaned before accepting 
another (that is what we called make cleaning part of checkpoint processing). 
Thus, checking only existing number of pending checkpoints was enough (no need 
for a new queue) to foresee an flood of checkpoints to clean. 

But the solution you propose (managing the queue of the checkpoints to clean 
and monitor its size) seems even simpler to me: it avoids having to sync normal 
checkpointing and checkpoint cleaning:

As you said, when we chose a checkpoint trigger request to execute 
(*CheckpointRequestDecider.chooseRequestToExecute*), we can drop new checkpoint 
requests when there are too many checkpoints to clean and thus regulate the 
whole checkpointing system. Syncing cleaning and checkpointing might not be 
necessary for this regulation, you're right.

If you don't mind, I'll go for this implementation proposal in the design doc.

> Slow checkpoint cleanup causing OOMs
> 
>
> Key: FLINK-17073
> URL: https://issues.apache.org/jira/browse/FLINK-17073
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0
>Reporter: Till Rohrmann
>Assignee: Etienne Chauchot
>Priority: Major
> Fix For: 1.12.0
>
>
> A user reported that he sees a decline in checkpoint cleanup speed when 
> upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup 
> tasks are waiting in the execution queue occupying memory. Ultimately, the JM 
> process dies with an OOM.
> Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is 
> used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the 
> {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max 
> parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as 
> CPU cores. This change might have caused the decline in completed checkpoint 
> discard throughput. This suspicion needs to be validated before trying to fix 
> it!
> [1] 
> https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (FLINK-18733) Jobmanager cannot start in HA mode with Zookeeper

2020-07-28 Thread Till Rohrmann (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-18733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166478#comment-17166478
 ] 

Till Rohrmann edited comment on FLINK-18733 at 7/28/20, 2:56 PM:
-

I tested the scenario with running a ZooKeeper cluster on the same machine 
where I also started Flink (but not the same process). I did not configure SASL 
and it worked when using a resolvable name. When configuring an unresolvable 
name, I saw the exact same stack traces as you did (including the SASL part). 
Hence, I would assume that ZooKeeper/Curator also used this code path when 
running the test with a resolvable name.

Have you checked whether there this is a ZooKeeper issue for a similar problem?


was (Author: till.rohrmann):
I tested the scenario with running a ZooKeeper cluster on the same machine 
where I also started Flink (but not the same process). I did not configure SASL 
and it worked when using a resolvable name. When configuring an unresolvable 
name, I saw the exact same stack traces as you did (including the SASL part). 
Hence, I would assume that ZooKeeper/Curator also used this code path when 
running the test with a resolvable name.

> Jobmanager cannot start in HA mode with Zookeeper
> -
>
> Key: FLINK-18733
> URL: https://issues.apache.org/jira/browse/FLINK-18733
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.11.1
>Reporter: Leonid Ilyevsky
>Priority: Major
> Attachments: flink-conf.yaml, 
> flink-liquidnt-standalonesession-0-nj1dvloglab01.liquidnet.biz.log, 
> flink-liquidnt-taskexecutor-0-nj1dvloglab01.liquidnet.biz.log
>
>
> When configured in HA mode, the Jobmanager cannot start at all. First, it 
> issues warnings like this:
> {quote}{{2020-07-27 08:58:23,197 WARN 
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - 
> Session 0x0 for server *nj1dvloglab01.liquidnet.biz/:2181*, 
> unexpected error, closing socket connection and attempting reconnect}}
>  {{java.lang.IllegalArgumentException: *Unable to canonicalize address* 
> nj1dvloglab01.liquidnet.biz/:2181 because it's not resolvable}}
>  {{ at 
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:65)
>  ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}}
>  {{ at 
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:41)
>  ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}}
>  {{ at 
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:1001)
>  ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}}
>  {{ at 
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1060)
>  [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}}
> {quote}
> After few attempts connecting to Zookeeper, it finally fails:
> {quote}2020-07-27 08:59:35,055 ERROR 
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal error 
> occurred in the cluster entrypoint.
>  org.apache.flink.util.FlinkException: Unhandled error in 
> ZooKeeperLeaderElectionService: Ensure path threw exception
>  at 
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService.unhandledError(ZooKeeperLeaderElectionService.java:430)
>  ~[flink-dist_2.12-1.11.1.jar:1.11.1]
> {quote}
>  
> The same HA configuration works fine for me in Flink 1.10.0.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-18733) Jobmanager cannot start in HA mode with Zookeeper

2020-07-28 Thread Till Rohrmann (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-18733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166478#comment-17166478
 ] 

Till Rohrmann commented on FLINK-18733:
---

I tested the scenario with running a ZooKeeper cluster on the same machine 
where I also started Flink (but not the same process). I did not configure SASL 
and it worked when using a resolvable name. When configuring an unresolvable 
name, I saw the exact same stack traces as you did (including the SASL part). 
Hence, I would assume that ZooKeeper/Curator also used this code path when 
running the test with a resolvable name.

> Jobmanager cannot start in HA mode with Zookeeper
> -
>
> Key: FLINK-18733
> URL: https://issues.apache.org/jira/browse/FLINK-18733
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.11.1
>Reporter: Leonid Ilyevsky
>Priority: Major
> Attachments: flink-conf.yaml, 
> flink-liquidnt-standalonesession-0-nj1dvloglab01.liquidnet.biz.log, 
> flink-liquidnt-taskexecutor-0-nj1dvloglab01.liquidnet.biz.log
>
>
> When configured in HA mode, the Jobmanager cannot start at all. First, it 
> issues warnings like this:
> {quote}{{2020-07-27 08:58:23,197 WARN 
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - 
> Session 0x0 for server *nj1dvloglab01.liquidnet.biz/:2181*, 
> unexpected error, closing socket connection and attempting reconnect}}
>  {{java.lang.IllegalArgumentException: *Unable to canonicalize address* 
> nj1dvloglab01.liquidnet.biz/:2181 because it's not resolvable}}
>  {{ at 
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:65)
>  ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}}
>  {{ at 
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:41)
>  ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}}
>  {{ at 
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:1001)
>  ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}}
>  {{ at 
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1060)
>  [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}}
> {quote}
> After few attempts connecting to Zookeeper, it finally fails:
> {quote}2020-07-27 08:59:35,055 ERROR 
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal error 
> occurred in the cluster entrypoint.
>  org.apache.flink.util.FlinkException: Unhandled error in 
> ZooKeeperLeaderElectionService: Ensure path threw exception
>  at 
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService.unhandledError(ZooKeeperLeaderElectionService.java:430)
>  ~[flink-dist_2.12-1.11.1.jar:1.11.1]
> {quote}
>  
> The same HA configuration works fine for me in Flink 1.10.0.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-18663) Fix Flink On YARN AM not exit

2020-07-28 Thread Chesnay Schepler (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-18663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166467#comment-17166467
 ] 

Chesnay Schepler commented on FLINK-18663:
--

So far the only explanation that I could find for a pipeline returning null is 
that the channel was already closed.

Our current assumption was that this happened because the RestServerEndpoint is 
shutting down. From the logs you gave us it appears that the shutdown is 
initiated 2 minutes after the NPE; this doesn't seem to match our assumption.

I'm wondering what would happen if the netty connection were to be closed by 
the client. We know that the request processing is delayed by 10 seconds; if 
the client aborts the connection in between maybe netty starts cleaning up the 
pipeline.


> Fix Flink On YARN AM not exit
> -
>
> Key: FLINK-18663
> URL: https://issues.apache.org/jira/browse/FLINK-18663
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / REST
>Affects Versions: 1.10.0, 1.10.1, 1.11.0
>Reporter: tartarus
>Assignee: tartarus
>Priority: Critical
>  Labels: pull-request-available
> Attachments: 110.png, 111.png, 
> C49A7310-F932-451B-A203-6D17F3140C0D.png, 
> e18e00dd6664485c2ff55284fe969474.png, jobmanager.log.noyarn.tar.gz
>
>
> AbstractHandler throw NPE cause by FlinkHttpObjectAggregator is null
> when rest throw exception, it will do this code
> {code:java}
> private CompletableFuture handleException(Throwable throwable, 
> ChannelHandlerContext ctx, HttpRequest httpRequest) {
>   FlinkHttpObjectAggregator flinkHttpObjectAggregator = 
> ctx.pipeline().get(FlinkHttpObjectAggregator.class);
>   int maxLength = flinkHttpObjectAggregator.maxContentLength() - 
> OTHER_RESP_PAYLOAD_OVERHEAD;
>   if (throwable instanceof RestHandlerException) {
>   RestHandlerException rhe = (RestHandlerException) throwable;
>   String stackTrace = ExceptionUtils.stringifyException(rhe);
>   String truncatedStackTrace = Ascii.truncate(stackTrace, 
> maxLength, "...");
>   if (log.isDebugEnabled()) {
>   log.error("Exception occurred in REST handler.", rhe);
>   } else {
>   log.error("Exception occurred in REST handler: {}", 
> rhe.getMessage());
>   }
>   return HandlerUtils.sendErrorResponse(
>   ctx,
>   httpRequest,
>   new ErrorResponseBody(truncatedStackTrace),
>   rhe.getHttpResponseStatus(),
>   responseHeaders);
>   } else {
>   log.error("Unhandled exception.", throwable);
>   String stackTrace = String.format(" side:%n%s%nEnd of exception on server side>",
>   ExceptionUtils.stringifyException(throwable));
>   String truncatedStackTrace = Ascii.truncate(stackTrace, 
> maxLength, "...");
>   return HandlerUtils.sendErrorResponse(
>   ctx,
>   httpRequest,
>   new ErrorResponseBody(Arrays.asList("Internal server 
> error.", truncatedStackTrace)),
>   HttpResponseStatus.INTERNAL_SERVER_ERROR,
>   responseHeaders);
>   }
> }
> {code}
> but flinkHttpObjectAggregator some case is null,so this will throw NPE,but 
> this method called by  AbstractHandler#respondAsLeader
> {code:java}
> requestProcessingFuture
>   .whenComplete((Void ignored, Throwable throwable) -> {
>   if (throwable != null) {
>   
> handleException(ExceptionUtils.stripCompletionException(throwable), ctx, 
> httpRequest)
>   .whenComplete((Void ignored2, Throwable 
> throwable2) -> finalizeRequestProcessing(finalUploadedFiles));
>   } else {
>   finalizeRequestProcessing(finalUploadedFiles);
>   }
>   });
> {code}
>  the result is InFlightRequestTracker Cannot be cleared.
> so the CompletableFuture does‘t complete that handler's closeAsync returned
> !C49A7310-F932-451B-A203-6D17F3140C0D.png!
> !e18e00dd6664485c2ff55284fe969474.png!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (FLINK-16866) Make job submission non-blocking

2020-07-28 Thread Till Rohrmann (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-16866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann reassigned FLINK-16866:
-

Assignee: Till Rohrmann  (was: Robert Metzger)

> Make job submission non-blocking
> 
>
> Key: FLINK-16866
> URL: https://issues.apache.org/jira/browse/FLINK-16866
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.9.2, 1.10.0, 1.11.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Critical
> Fix For: 1.12.0
>
>
> Currently, Flink waits to acknowledge a job submission until the 
> corresponding {{JobManager}} has been created. Since its creation also 
> involves the creation of the {{ExecutionGraph}} and potential FS operations, 
> it can take a bit of time. If the user has configured a too low 
> {{web.timeout}}, the submission can time out only reporting a 
> {{TimeoutException}} to the user.
> I propose to change the notion of job submission slightly. Instead of waiting 
> until the {{JobManager}} has been created, a job submission is complete once 
> all job relevant files have been uploaded to the {{Dispatcher}} and the 
> {{Dispatcher}} has been told about it. Creating the {{JobManager}} will then 
> belong to the actual job execution. Consequently, if problems occur while 
> creating the {{JobManager}} it will result into a job failure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-18733) Jobmanager cannot start in HA mode with Zookeeper

2020-07-28 Thread Leonid Ilyevsky (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-18733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166458#comment-17166458
 ] 

Leonid Ilyevsky commented on FLINK-18733:
-

Hi [~trohrmann]

The addresses I am using are perfectly resolvable, if you are talking about IP 
address resolution on Linux level. In fact, it is the same set of machines. I 
am running Flink cluster on the same three machines where I am running my 
Zookeeper cluster.

The critical point is that when when I reverted back to version 1.10.0, the 
problem disappeared.

I don't think this problem has anything to do with the Linux hosts being 
resolved or not. As you can see in the error message, it happens inside a 
routine related to SASL, which I am not using and don't need.

 

You said it works when you use Flink's ZooKeeper support locally.  What exactly 
is it? A Zookeeper running inside Flink?

Then it fails when you configured it with "unresolvable 
{{high-availability.zookeeper.quorum}} address". Did you actually use 
unresolvable hosts, so you could not even ping them? Obviously such test would 
fail, no doubts.

 

Could you please perform the test closer to what I am doing? Run a simple 
Zookeeper cluster on the same machines where you run Flink.

 

I actually found the code where the exception is thrown: 
[https://github.com/apache/zookeeper/blob/master/zookeeper-server/src/main/java/org/apache/zookeeper/SaslServerPrincipal.java]
 . I guess, this is not the exact version that you are using, so the line 
numbers might differ.

First thing I noticed, in the comment it says "Get the name of the server 
principal for a SASL client. This is visible for *testing purposes*". So this 
is supposed to be called only during tests? Not sure what that means.

Then, inside the getServerPrincipal method, it retrieves the "canonicalize" 
flag, and apparently it got the value "true". Maybe this is the source of the 
issue? Maybe in Flink 1.10.0 it was "false" and there was no problem? I hope 
there should be some workaround, like set some system property and make that 
flag to be false.

 

Thanks,

 

Leonid

 

> Jobmanager cannot start in HA mode with Zookeeper
> -
>
> Key: FLINK-18733
> URL: https://issues.apache.org/jira/browse/FLINK-18733
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.11.1
>Reporter: Leonid Ilyevsky
>Priority: Major
> Attachments: flink-conf.yaml, 
> flink-liquidnt-standalonesession-0-nj1dvloglab01.liquidnet.biz.log, 
> flink-liquidnt-taskexecutor-0-nj1dvloglab01.liquidnet.biz.log
>
>
> When configured in HA mode, the Jobmanager cannot start at all. First, it 
> issues warnings like this:
> {quote}{{2020-07-27 08:58:23,197 WARN 
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - 
> Session 0x0 for server *nj1dvloglab01.liquidnet.biz/:2181*, 
> unexpected error, closing socket connection and attempting reconnect}}
>  {{java.lang.IllegalArgumentException: *Unable to canonicalize address* 
> nj1dvloglab01.liquidnet.biz/:2181 because it's not resolvable}}
>  {{ at 
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:65)
>  ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}}
>  {{ at 
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:41)
>  ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}}
>  {{ at 
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:1001)
>  ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}}
>  {{ at 
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1060)
>  [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}}
> {quote}
> After few attempts connecting to Zookeeper, it finally fails:
> {quote}2020-07-27 08:59:35,055 ERROR 
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal error 
> occurred in the cluster entrypoint.
>  org.apache.flink.util.FlinkException: Unhandled error in 
> ZooKeeperLeaderElectionService: Ensure path threw exception
>  at 
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService.unhandledError(ZooKeeperLeaderElectionService.java:430)
>  ~[flink-dist_2.12-1.11.1.jar:1.11.1]
> {quote}
>  
> The same HA configuration works fine for me in Flink 1.10.0.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-18627) Get unmatch filter method records to side output

2020-07-28 Thread Aljoscha Krettek (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-18627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166445#comment-17166445
 ] 

Aljoscha Krettek commented on FLINK-18627:
--

That will not produce the desired output, {{stream.getSideOutput()}} will only 
get the side output from the last operation. That's what I was hinting at 
above: I don't know how we can provide an ergonomic API for the pattern of 
chaining multiple filters.

> Get unmatch filter method records to side output
> 
>
> Key: FLINK-18627
> URL: https://issues.apache.org/jira/browse/FLINK-18627
> Project: Flink
>  Issue Type: Improvement
>  Components: API / DataStream
>Reporter: Roey Shem Tov
>Priority: Major
> Fix For: 1.12.0
>
>
> Unmatch records to filter functions should send somehow to side output.
> Example:
>  
> {code:java}
> datastream
> .filter(i->i%2==0)
> .sideOutput(oddNumbersSideOutput);
> {code}
>  
>  
> That's way we can filter multiple times and send the filtered records to our 
> side output instead of dropping it immediatly, it can be useful in many ways.
>  
> What do you think?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-18744) resume from modified savepoint dirctionary: No such file or directory

2020-07-28 Thread tao wang (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-18744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tao wang updated FLINK-18744:
-
Description: 
If I resume a job from a savepoint which is modified by state processor API, 
such as loading from /savepoint-path-old and writing to /savepoint-path-new, 
the job resumed with savepointpath = /savepoint-path-new  while throwing an 
Exception : 
 _*/savepoint-path-new/\{some-ui-id} (No such file or directory)*_.
 I think it's an issue because of flink 1.11 use absolute path in savepoint and 
checkpoint, but state processor API missed this.

The job will work well with new savepoint(which path is /savepoint-path-new) if 
I copy all dictionary except `_metadata from` /savepoint-path-old to 
/savepoint-path-new.

  was:
If I resume a job from a savepoint which is modified by state processor API, 
such as loading from /savepoint-path-old and writing to /savepoint-path-new, 
the job resumed with savepointpath = /savepoint-path-new  while throwing an 
Exception : 
_*/savepoint-path-new/\{some-ui-id} (No such file or directory)*_.
I think it's an issue because of flink 1.11 use absolute path in savepoint and 
checkpoint, but state processor API missed this.


> resume from modified savepoint dirctionary: No such file or directory
> -
>
> Key: FLINK-18744
> URL: https://issues.apache.org/jira/browse/FLINK-18744
> Project: Flink
>  Issue Type: Bug
>  Components: API / State Processor
>Affects Versions: 1.11.0
>Reporter: tao wang
>Priority: Major
>
> If I resume a job from a savepoint which is modified by state processor API, 
> such as loading from /savepoint-path-old and writing to /savepoint-path-new, 
> the job resumed with savepointpath = /savepoint-path-new  while throwing an 
> Exception : 
>  _*/savepoint-path-new/\{some-ui-id} (No such file or directory)*_.
>  I think it's an issue because of flink 1.11 use absolute path in savepoint 
> and checkpoint, but state processor API missed this.
> The job will work well with new savepoint(which path is /savepoint-path-new) 
> if I copy all dictionary except `_metadata from` /savepoint-path-old to 
> /savepoint-path-new.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-18744) resume from modified savepoint dirctionary: No such file or directory

2020-07-28 Thread tao wang (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-18744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tao wang updated FLINK-18744:
-
Affects Version/s: (was: 1.11.1)
   1.11.0

> resume from modified savepoint dirctionary: No such file or directory
> -
>
> Key: FLINK-18744
> URL: https://issues.apache.org/jira/browse/FLINK-18744
> Project: Flink
>  Issue Type: Bug
>  Components: API / State Processor
>Affects Versions: 1.11.0
>Reporter: tao wang
>Priority: Major
>
> If I resume a job from a savepoint which is modified by state processor API, 
> such as loading from /savepoint-path-old and writing to /savepoint-path-new, 
> the job resumed with savepointpath = /savepoint-path-new  while throwing an 
> Exception : 
> _*/savepoint-path-new/\{some-ui-id} (No such file or directory)*_.
> I think it's an issue because of flink 1.11 use absolute path in savepoint 
> and checkpoint, but state processor API missed this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (FLINK-18744) resume from modified savepoint dirctionary: No such file or directory

2020-07-28 Thread tao wang (Jira)

tao wang created FLINK-18744:


 Summary: resume from modified savepoint dirctionary: No such file 
or directory
 Key: FLINK-18744
 URL: https://issues.apache.org/jira/browse/FLINK-18744
 Project: Flink
  Issue Type: Bug
  Components: API / State Processor
Affects Versions: 1.11.1
Reporter: tao wang


If I resume a job from a savepoint which is modified by state processor API, 
such as loading from /savepoint-path-old and writing to /savepoint-path-new, 
the job resumed with savepointpath = /savepoint-path-new  while throwing an 
Exception : 
_*/savepoint-path-new/\{some-ui-id} (No such file or directory)*_.
I think it's an issue because of flink 1.11 use absolute path in savepoint and 
checkpoint, but state processor API missed this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (FLINK-18281) Add WindowStagger into all Tumbling and Sliding Windows

2020-07-28 Thread Aljoscha Krettek (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-18281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aljoscha Krettek closed FLINK-18281.

Fix Version/s: 1.12.0
 Assignee: Teng Hu
   Resolution: Fixed

master: 335c47e11478358e8514e63ca807ea765ed9dd9a

> Add WindowStagger into all Tumbling and Sliding Windows
> ---
>
> Key: FLINK-18281
> URL: https://issues.apache.org/jira/browse/FLINK-18281
> Project: Flink
>  Issue Type: New Feature
>  Components: API / DataStream
>Reporter: Teng Hu
>Assignee: Teng Hu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.12.0
>
>
> Adding the window staggering functionality into *TumblingEventTimeWindows*, 
> *SlidingProcessingTimeWindows* and *SlidingEventTimeWindows*.
> This is a follow-up issue of 
> [FLINK-12855|https://issues.apache.org/jira/browse/FLINK-12855]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-18663) Fix Flink On YARN AM not exit

2020-07-28 Thread tartarus (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-18663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166433#comment-17166433
 ] 

tartarus commented on FLINK-18663:
--

[~trohrmann] I may not understand you too much before, but I'm sure 
{{AbstractHandler.terminationFuture}} has not complete, because I has dump a 
jvm.

I think [~chesnay] is right , this job has 15000 tasks, and GC frequently, so 
the TimeoutException is possible.

I am not very clear why {{FlinkHttpObjectAggregator}} was null.

[~trohrmann]  [~chesnay]  Do you have any suggestions on this issue?

 

> Fix Flink On YARN AM not exit
> -
>
> Key: FLINK-18663
> URL: https://issues.apache.org/jira/browse/FLINK-18663
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / REST
>Affects Versions: 1.10.0, 1.10.1, 1.11.0
>Reporter: tartarus
>Assignee: tartarus
>Priority: Critical
>  Labels: pull-request-available
> Attachments: 110.png, 111.png, 
> C49A7310-F932-451B-A203-6D17F3140C0D.png, 
> e18e00dd6664485c2ff55284fe969474.png, jobmanager.log.noyarn.tar.gz
>
>
> AbstractHandler throw NPE cause by FlinkHttpObjectAggregator is null
> when rest throw exception, it will do this code
> {code:java}
> private CompletableFuture handleException(Throwable throwable, 
> ChannelHandlerContext ctx, HttpRequest httpRequest) {
>   FlinkHttpObjectAggregator flinkHttpObjectAggregator = 
> ctx.pipeline().get(FlinkHttpObjectAggregator.class);
>   int maxLength = flinkHttpObjectAggregator.maxContentLength() - 
> OTHER_RESP_PAYLOAD_OVERHEAD;
>   if (throwable instanceof RestHandlerException) {
>   RestHandlerException rhe = (RestHandlerException) throwable;
>   String stackTrace = ExceptionUtils.stringifyException(rhe);
>   String truncatedStackTrace = Ascii.truncate(stackTrace, 
> maxLength, "...");
>   if (log.isDebugEnabled()) {
>   log.error("Exception occurred in REST handler.", rhe);
>   } else {
>   log.error("Exception occurred in REST handler: {}", 
> rhe.getMessage());
>   }
>   return HandlerUtils.sendErrorResponse(
>   ctx,
>   httpRequest,
>   new ErrorResponseBody(truncatedStackTrace),
>   rhe.getHttpResponseStatus(),
>   responseHeaders);
>   } else {
>   log.error("Unhandled exception.", throwable);
>   String stackTrace = String.format(" side:%n%s%nEnd of exception on server side>",
>   ExceptionUtils.stringifyException(throwable));
>   String truncatedStackTrace = Ascii.truncate(stackTrace, 
> maxLength, "...");
>   return HandlerUtils.sendErrorResponse(
>   ctx,
>   httpRequest,
>   new ErrorResponseBody(Arrays.asList("Internal server 
> error.", truncatedStackTrace)),
>   HttpResponseStatus.INTERNAL_SERVER_ERROR,
>   responseHeaders);
>   }
> }
> {code}
> but flinkHttpObjectAggregator some case is null,so this will throw NPE,but 
> this method called by  AbstractHandler#respondAsLeader
> {code:java}
> requestProcessingFuture
>   .whenComplete((Void ignored, Throwable throwable) -> {
>   if (throwable != null) {
>   
> handleException(ExceptionUtils.stripCompletionException(throwable), ctx, 
> httpRequest)
>   .whenComplete((Void ignored2, Throwable 
> throwable2) -> finalizeRequestProcessing(finalUploadedFiles));
>   } else {
>   finalizeRequestProcessing(finalUploadedFiles);
>   }
>   });
> {code}
>  the result is InFlightRequestTracker Cannot be cleared.
> so the CompletableFuture does‘t complete that handler's closeAsync returned
> !C49A7310-F932-451B-A203-6D17F3140C0D.png!
> !e18e00dd6664485c2ff55284fe969474.png!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-18733) Jobmanager cannot start in HA mode with Zookeeper

2020-07-28 Thread Till Rohrmann (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-18733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166429#comment-17166429
 ] 

Till Rohrmann commented on FLINK-18733:
---

Hi [~lilyevsky], I tried Flink's ZooKeeper support locally. At first it worked 
w/o problems. However, once I've configured the system with an unresolvable 
{{high-availability.zookeeper.quorum}} address, I could reproduce what you are 
describing. Hence, could you verify that the node on which you start the Flink 
processes can actually resolve any of 
{{nj1dvloglab01.liquidnet.biz:2181,nj1dvloglab02.liquidnet.biz:2181,nj1dvloglab03.liquidnet.biz:2181}}?
 If the system cannot resolve the address name, then the ZooKeeper client is 
not able to connect to the ZooKeeper quorum.

> Jobmanager cannot start in HA mode with Zookeeper
> -
>
> Key: FLINK-18733
> URL: https://issues.apache.org/jira/browse/FLINK-18733
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.11.1
>Reporter: Leonid Ilyevsky
>Priority: Major
> Attachments: flink-conf.yaml, 
> flink-liquidnt-standalonesession-0-nj1dvloglab01.liquidnet.biz.log, 
> flink-liquidnt-taskexecutor-0-nj1dvloglab01.liquidnet.biz.log
>
>
> When configured in HA mode, the Jobmanager cannot start at all. First, it 
> issues warnings like this:
> {quote}{{2020-07-27 08:58:23,197 WARN 
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - 
> Session 0x0 for server *nj1dvloglab01.liquidnet.biz/:2181*, 
> unexpected error, closing socket connection and attempting reconnect}}
>  {{java.lang.IllegalArgumentException: *Unable to canonicalize address* 
> nj1dvloglab01.liquidnet.biz/:2181 because it's not resolvable}}
>  {{ at 
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:65)
>  ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}}
>  {{ at 
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.SaslServerPrincipal.getServerPrincipal(SaslServerPrincipal.java:41)
>  ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}}
>  {{ at 
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:1001)
>  ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}}
>  {{ at 
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1060)
>  [flink-shaded-zookeeper-3.4.14.jar:3.4.14-11.0]}}
> {quote}
> After few attempts connecting to Zookeeper, it finally fails:
> {quote}2020-07-27 08:59:35,055 ERROR 
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal error 
> occurred in the cluster entrypoint.
>  org.apache.flink.util.FlinkException: Unhandled error in 
> ZooKeeperLeaderElectionService: Ensure path threw exception
>  at 
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService.unhandledError(ZooKeeperLeaderElectionService.java:430)
>  ~[flink-dist_2.12-1.11.1.jar:1.11.1]
> {quote}
>  
> The same HA configuration works fine for me in Flink 1.10.0.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Reopened] (FLINK-18743) Modify English link to Chinese link

2020-07-28 Thread weizheng (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

weizheng reopened FLINK-18743:
--

> Modify English link to Chinese link
> ---
>
> Key: FLINK-18743
> URL: https://issues.apache.org/jira/browse/FLINK-18743
> Project: Flink
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: weizheng
>Priority: Major
>
> In 
> [https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/concepts/stateful-stream-processing.html,]
>  
> [https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/dev/cluster_execution.html,|https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/dev/cluster_execution.html]
>  
> [https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/dev/datastream_api.html,|https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/dev/datastream_api.html]
>  
> [https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/dev/event_timestamp_extractors.html]
> Modify English link to Chinese link
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (FLINK-18743) Modify English link to Chinese link

2020-07-28 Thread weizheng (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

weizheng resolved FLINK-18743.
--
Resolution: Fixed

https://github.com/apache/flink/pull/13003

> Modify English link to Chinese link
> ---
>
> Key: FLINK-18743
> URL: https://issues.apache.org/jira/browse/FLINK-18743
> Project: Flink
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: weizheng
>Priority: Major
>
> In 
> [https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/concepts/stateful-stream-processing.html,]
>  
> [https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/dev/cluster_execution.html,|https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/dev/cluster_execution.html]
>  
> [https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/dev/datastream_api.html,|https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/dev/datastream_api.html]
>  
> [https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/dev/event_timestamp_extractors.html]
> Modify English link to Chinese link
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Issue Comment Deleted] (FLINK-18743) Modify English link to Chinese link

2020-07-28 Thread weizheng (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-18743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

weizheng updated FLINK-18743:
-
Comment: was deleted

(was: https://github.com/apache/flink/pull/13003)

> Modify English link to Chinese link
> ---
>
> Key: FLINK-18743
> URL: https://issues.apache.org/jira/browse/FLINK-18743
> Project: Flink
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: weizheng
>Priority: Major
>
> In 
> [https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/concepts/stateful-stream-processing.html,]
>  
> [https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/dev/cluster_execution.html,|https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/dev/cluster_execution.html]
>  
> [https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/dev/datastream_api.html,|https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/dev/datastream_api.html]
>  
> [https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/dev/event_timestamp_extractors.html]
> Modify English link to Chinese link
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-18715) add cpu usage metric of jobmanager/taskmanager

2020-07-28 Thread wgcn (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-18715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166396#comment-17166396
 ] 

wgcn commented on FLINK-18715:
--

[~trohrmann]  it's not suitable for the deployment scenarios  we don't have CPU 
isolation 

> add cpu usage metric of  jobmanager/taskmanager  
> -
>
> Key: FLINK-18715
> URL: https://issues.apache.org/jira/browse/FLINK-18715
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Metrics
>Affects Versions: 1.11.1
>Reporter: wgcn
>Priority: Major
> Fix For: 1.12.0, 1.11.2
>
>
> flink process add cpu usage metric,   user can determine that their job  is  
> io bound /cpu bound  ,so that they can increase/decrese cpu core in the 
> container (k8s,yarn). If it's nessary 
> .  you can assign  it  to me  ，I come up with a idea  calculating cpu usage 
> ratio using   ManagementFactory.getRuntimeMXBean().getUptime() and  
> ManagementFactory.getOperatingSystemMXBean().getProcessCpuTime  over a period 
> of time .   it can get a value in single cpu core environment. and user can 
> use the value  to calculate cpu usage ratio by dividing  num of container's 
> cpu core.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-18742) Configuration args do not take effect at client

2020-07-28 Thread Matt Wang (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-18742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166391#comment-17166391
 ] 

Matt Wang commented on FLINK-18742:
---

I have fixed this  in our company internal version, can assign this issue to me?

> Configuration args do not take effect at client
> ---
>
> Key: FLINK-18742
> URL: https://issues.apache.org/jira/browse/FLINK-18742
> Project: Flink
>  Issue Type: Improvement
>  Components: Command Line Client
>Affects Versions: 1.11.1
>Reporter: Matt Wang
>Priority: Major
>
> The configuration args from command line will not work at client, for 
> example, the job sets the {color:#505f79}_classloader.resolve-order_{color} 
> to _{color:#505f79}parent-first,{color}_ it can work at TaskManager, but 
> Client doesn't.
> The *FlinkUserCodeClassLoaders* will be created before calling the method of 
> _{color:#505f79}getEffectiveConfiguration(){color}_ at 
> {color:#505f79}org.apache.flink.client.cli.CliFrontend{color}, so the 
> _{color:#505f79}Configuration{color}_ used by 
> _{color:#505f79}PackagedProgram{color}_ does not include Configuration args.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (FLINK-18743) Modify English link to Chinese link

2020-07-28 Thread weizheng (Jira)

weizheng created FLINK-18743:


 Summary: Modify English link to Chinese link
 Key: FLINK-18743
 URL: https://issues.apache.org/jira/browse/FLINK-18743
 Project: Flink
  Issue Type: Improvement
  Components: Documentation
Reporter: weizheng


In 
[https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/concepts/stateful-stream-processing.html,]
 
[https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/dev/cluster_execution.html,|https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/dev/cluster_execution.html]
 
[https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/dev/datastream_api.html,|https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/dev/datastream_api.html]
 
[https://ci.apache.org/projects/flink/flink-docs-release-1.11/zh/dev/event_timestamp_extractors.html]

Modify English link to Chinese link

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-18742) Configuration args do not take effect at client

2020-07-28 Thread Matt Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-18742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Wang updated FLINK-18742:
--
Description: 
The configuration args from command line will not work at client, for example, 
the job sets the {color:#505f79}_classloader.resolve-order_{color} to 
_{color:#505f79}parent-first,{color}_ it can work at TaskManager, but Client 
doesn't.

The *FlinkUserCodeClassLoaders* will be created before calling the method of 
_{color:#505f79}getEffectiveConfiguration(){color}_ at 
{color:#505f79}org.apache.flink.client.cli.CliFrontend{color}, so the 
_{color:#505f79}Configuration{color}_ used by 
_{color:#505f79}PackagedProgram{color}_ does not include Configuration args.

  was:
The configuration args from command line will not work at client, for example, 
the job sets the `classloader.resolve-order` to `parent-first`, it can work at 
TaskManager, but Client doesn't.

The `FlinkUserCodeClassLoaders` will be created before calling the method of 
`getEffectiveConfiguration()` at `org.apache.flink.client.cli.CliFrontend`, so 
the `Configuration` used by `PackagedProgram` does not include Configuration 
args.


> Configuration args do not take effect at client
> ---
>
> Key: FLINK-18742
> URL: https://issues.apache.org/jira/browse/FLINK-18742
> Project: Flink
>  Issue Type: Improvement
>  Components: Command Line Client
>Affects Versions: 1.11.1
>Reporter: Matt Wang
>Priority: Major
>
> The configuration args from command line will not work at client, for 
> example, the job sets the {color:#505f79}_classloader.resolve-order_{color} 
> to _{color:#505f79}parent-first,{color}_ it can work at TaskManager, but 
> Client doesn't.
> The *FlinkUserCodeClassLoaders* will be created before calling the method of 
> _{color:#505f79}getEffectiveConfiguration(){color}_ at 
> {color:#505f79}org.apache.flink.client.cli.CliFrontend{color}, so the 
> _{color:#505f79}Configuration{color}_ used by 
> _{color:#505f79}PackagedProgram{color}_ does not include Configuration args.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (FLINK-18742) Configuration args do not take effect at client

2020-07-28 Thread Matt Wang (Jira)

Matt Wang created FLINK-18742:
-

 Summary: Configuration args do not take effect at client
 Key: FLINK-18742
 URL: https://issues.apache.org/jira/browse/FLINK-18742
 Project: Flink
  Issue Type: Improvement
  Components: Command Line Client
Affects Versions: 1.11.1
Reporter: Matt Wang


The configuration args from command line will not work at client, for example, 
the job sets the `classloader.resolve-order` to `parent-first`, it can work at 
TaskManager, but Client doesn't.

The `FlinkUserCodeClassLoaders` will be created before calling the method of 
`getEffectiveConfiguration()` at `org.apache.flink.client.cli.CliFrontend`, so 
the `Configuration` used by `PackagedProgram` does not include Configuration 
args.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-18496) Anchors are not generated based on ZH characters

2020-07-28 Thread Zhilong Hong (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-18496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166374#comment-17166374
 ] 

Zhilong Hong commented on FLINK-18496:
--

I think in fact the plugin is conflict with the excerpt rather than the 
{{_post}} folder.

1. After upgrading to Jekyll 4.0.1, I get errors: 
[https://gist.github.com/Thesharing/a4681dcef6229903088fb9eb49eaea6f] . The 
build stuck in endless loop when it comes to:
{code:bash}
/srv/flink-web/.rubydeps/ruby/2.5.0/gems/jekyll-multiple-languages-2.0.3/lib/jekyll-multiple-languages/multilang.rb:55:in
 `append_data_for_liquid'
/srv/flink-web/.rubydeps/ruby/2.5.0/gems/jekyll-multiple-languages-2.0.3/lib/jekyll-multiple-languages/document.rb:71:in
 `to_liquid'
{code}
The function {{append_data_for_liquid}} calls {{deep_merge_hashes!}} in 
jekyll/utils.rb, and the implementation of {{deep_merge_hashes!}} is different 
between Jekyll 3 and 4.
In Jekyll 3:
{code:ruby}
def deep_merge_hashes!(target, overwrite)
  overwrite.each_key do |key|
if overwrite[key].is_a? Hash and target[key].is_a? Hash
  target[key] = Utils.deep_merge_hashes(target[key], overwrite[key])
  next
end

target[key] = overwrite[key]
  end

  if target.default_proc.nil?
target.default_proc = overwrite.default_proc
  end

  target
end
{code}
In Jekyll 4:
{code:ruby}
def deep_merge_hashes!(target, overwrite)
  merge_values(target, overwrite)
  merge_default_proc(target, overwrite)
  duplicate_frozen_values(target)

  target
end
{code}
The loop gets stuck at {{duplicate_frozen_values}}. 
 I try to remove {{duplicate_frozen_values}} and build the doc. The build 
succeeds. However, the blogs without excerpts in markdown have no automatically 
generated excerpts in the web page.

2. I renamed the folder {{_posts}} to other names like {{articles}}. The build 
succeeds. But the reference of {{_posts}} in {{index.md}} is broken. Then I try 
to upgrade Jekyll to 4.1.1 and setting {{page_excerpts}} to {{true}}, which 
enables excerpts for every page. The build fails like above. So  in my opinion, 
the issue is related to the process of generating excerpts with 
"jekyll-multiple-language". 

I'm not familiar with Ruby, and I have no idea how to debug Jekyll with IDE. I 
think I'm stuck here right now, and I'd really appreciate it if anyone could 
help me with this issue.

> Anchors are not generated based on ZH characters
> 
>
> Key: FLINK-18496
> URL: https://issues.apache.org/jira/browse/FLINK-18496
> Project: Flink
>  Issue Type: Bug
>  Components: Project Website
>Reporter: Zhu Zhu
>Assignee: Zhilong Hong
>Priority: Major
>  Labels: starter
>
> In ZH version pages of flink-web, the anchors are not generated based on ZH 
> characters. The anchor name would be like 'section-1', 'section-2' if there 
> is no EN characters. An example can be the links in the navigator of 
> https://flink.apache.org/zh/contributing/contribute-code.html
> This makes it impossible to ref an anchor from the content because the anchor 
> name might change unexpectedly if a new section is added.
> Note that it is a problem for flink-web only. The docs generated from the 
> flink repo can properly generate ZH anchors.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (FLINK-18715) add cpu usage metric of jobmanager/taskmanager

2020-07-28 Thread wgcn (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-18715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166353#comment-17166353
 ] 

wgcn edited comment on FLINK-18715 at 7/28/20, 11:30 AM:
-

[~chesnay]   it indeed has a lot of system resources metric  , we talked about 
cpu occupation in single flink process


was (Author: 1026688210):
[~chesnay]   it indeed has a lot of system resources  , we talked about cpu 
occupation in single flink process

> add cpu usage metric of  jobmanager/taskmanager  
> -
>
> Key: FLINK-18715
> URL: https://issues.apache.org/jira/browse/FLINK-18715
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Metrics
>Affects Versions: 1.11.1
>Reporter: wgcn
>Priority: Major
> Fix For: 1.12.0, 1.11.2
>
>
> flink process add cpu usage metric,   user can determine that their job  is  
> io bound /cpu bound  ,so that they can increase/decrese cpu core in the 
> container (k8s,yarn). If it's nessary 
> .  you can assign  it  to me  ，I come up with a idea  calculating cpu usage 
> ratio using   ManagementFactory.getRuntimeMXBean().getUptime() and  
> ManagementFactory.getOperatingSystemMXBean().getProcessCpuTime  over a period 
> of time .   it can get a value in single cpu core environment. and user can 
> use the value  to calculate cpu usage ratio by dividing  num of container's 
> cpu core.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (FLINK-18646) Managed memory released check can block RPC thread

2020-07-28 Thread Andrey Zagrebin (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-18646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166346#comment-17166346
 ] 

Andrey Zagrebin edited comment on FLINK-18646 at 7/28/20, 11:27 AM:


merged into master by 3d056c8fea72ca40b663d12570913679be87c0a9
 merged into 1.11 by bcc97082639280ab14f465463fb07b27167c37e3

 

[~TsReaper] I am closing the issue as the verification should not block the RPC 
thread any more. Reopen it if you notice any problems with it. If there are 
still problems with the normal memory allocation timeout (given there is no 
real leak), we can discuss it in another issue.


was (Author: azagrebin):
merged into master by 3d056c8fea72ca40b663d12570913679be87c0a9
 merged into 1.10 by bcc97082639280ab14f465463fb07b27167c37e3

 

[~TsReaper] I am closing the issue as the verification should not block the RPC 
thread any more. Reopen it if you notice any problems with it. If there are 
still problems with the normal memory allocation timeout (given there is no 
real leak), we can discuss it in another issue.

> Managed memory released check can block RPC thread
> --
>
> Key: FLINK-18646
> URL: https://issues.apache.org/jira/browse/FLINK-18646
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Affects Versions: 1.11.0
>Reporter: Andrey Zagrebin
>Assignee: Andrey Zagrebin
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.12.0, 1.11.2
>
> Attachments: log1.png, log2.png
>
>
> UnsafeMemoryBudget#verifyEmpty, called on slot freeing, needs time to wait on 
> GC of all allocated/released managed memory. If there are a lot of segments 
> to GC then it can take time to finish the check. If slot freeing happens in 
> RPC thread, the GC waiting can block it and TM risks to miss its heartbeat.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (FLINK-18581) Cannot find GC cleaner with java version previous jdk8u72(-b01)

2020-07-28 Thread Andrey Zagrebin (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-18581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Zagrebin closed FLINK-18581.
---
Resolution: Fixed

merged into master by 2f03841d5414f9d4a4b810810317c0250065264e

merged into 1.11 by fe95187edfe742b64a1f7147e57856c931ef05c3

> Cannot find GC cleaner with java version previous jdk8u72(-b01)
> ---
>
> Key: FLINK-18581
> URL: https://issues.apache.org/jira/browse/FLINK-18581
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.11.0
>Reporter: Xintong Song
>Assignee: Andrey Zagrebin
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.12.0, 1.11.2
>
>
> {{JavaGcCleanerWrapper}} is looking for the package-private method 
> {{Reference.tryHandlePending}} using reflection. However, the method is first 
> introduced in the version jdk8u72(-b01). Therefore, if an older version JDK 
> is used, the method cannot be found and Flink will fail.
> See also this [ML 
> thread|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Error-GC-Cleaner-Provider-Flink-1-11-0-td36565.html].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (FLINK-18715) add cpu usage metric of jobmanager/taskmanager

2020-07-28 Thread wgcn (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-18715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166353#comment-17166353
 ] 

wgcn edited comment on FLINK-18715 at 7/28/20, 11:27 AM:
-

[~chesnay]   it indeed has a lot of system resources  , we talked about cpu 
occupation in single flink process


was (Author: 1026688210):
[~chesnay]   it indeed has a lot of system resources  , we talked able cpu 
occupation in single flink process

> add cpu usage metric of  jobmanager/taskmanager  
> -
>
> Key: FLINK-18715
> URL: https://issues.apache.org/jira/browse/FLINK-18715
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Metrics
>Affects Versions: 1.11.1
>Reporter: wgcn
>Priority: Major
> Fix For: 1.12.0, 1.11.2
>
>
> flink process add cpu usage metric,   user can determine that their job  is  
> io bound /cpu bound  ,so that they can increase/decrese cpu core in the 
> container (k8s,yarn). If it's nessary 
> .  you can assign  it  to me  ，I come up with a idea  calculating cpu usage 
> ratio using   ManagementFactory.getRuntimeMXBean().getUptime() and  
> ManagementFactory.getOperatingSystemMXBean().getProcessCpuTime  over a period 
> of time .   it can get a value in single cpu core environment. and user can 
> use the value  to calculate cpu usage ratio by dividing  num of container's 
> cpu core.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-18715) add cpu usage metric of jobmanager/taskmanager

2020-07-28 Thread wgcn (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-18715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166353#comment-17166353
 ] 

wgcn commented on FLINK-18715:
--

[~chesnay]   it indeed has a lot of system resources  , we talked able cpu 
occupation in single flink process

> add cpu usage metric of  jobmanager/taskmanager  
> -
>
> Key: FLINK-18715
> URL: https://issues.apache.org/jira/browse/FLINK-18715
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Metrics
>Affects Versions: 1.11.1
>Reporter: wgcn
>Priority: Major
> Fix For: 1.12.0, 1.11.2
>
>
> flink process add cpu usage metric,   user can determine that their job  is  
> io bound /cpu bound  ,so that they can increase/decrese cpu core in the 
> container (k8s,yarn). If it's nessary 
> .  you can assign  it  to me  ，I come up with a idea  calculating cpu usage 
> ratio using   ManagementFactory.getRuntimeMXBean().getUptime() and  
> ManagementFactory.getOperatingSystemMXBean().getProcessCpuTime  over a period 
> of time .   it can get a value in single cpu core environment. and user can 
> use the value  to calculate cpu usage ratio by dividing  num of container's 
> cpu core.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-18646) Managed memory released check can block RPC thread

2020-07-28 Thread Andrey Zagrebin (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-18646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Zagrebin updated FLINK-18646:

Fix Version/s: 1.12.0

> Managed memory released check can block RPC thread
> --
>
> Key: FLINK-18646
> URL: https://issues.apache.org/jira/browse/FLINK-18646
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Affects Versions: 1.11.0
>Reporter: Andrey Zagrebin
>Assignee: Andrey Zagrebin
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.12.0, 1.11.2
>
> Attachments: log1.png, log2.png
>
>
> UnsafeMemoryBudget#verifyEmpty, called on slot freeing, needs time to wait on 
> GC of all allocated/released managed memory. If there are a lot of segments 
> to GC then it can take time to finish the check. If slot freeing happens in 
> RPC thread, the GC waiting can block it and TM risks to miss its heartbeat.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (FLINK-18646) Managed memory released check can block RPC thread

2020-07-28 Thread Andrey Zagrebin (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-18646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166346#comment-17166346
 ] 

Andrey Zagrebin edited comment on FLINK-18646 at 7/28/20, 11:20 AM:


merged into master by 3d056c8fea72ca40b663d12570913679be87c0a9
 merged into 1.10 by bcc97082639280ab14f465463fb07b27167c37e3

 

[~TsReaper] I am closing the issue as the verification should not block the RPC 
thread any more. Reopen it if you notice any problems with it. If there are 
still problems with the normal memory allocation timeout (given there is no 
real leak), we can discuss it in another issue.


was (Author: azagrebin):
merged into master by 3d056c8fea72ca40b663d12570913679be87c0a9
merged into 1.10 by bcc97082639280ab14f465463fb07b27167c37e3

 

[~TsReaper] I am closing the issue as the verification should not block the RPC 
thread any more. Reopen it if you notice any problems with it. If there are 
still problems with the normal memory allocation timeout, we can discuss it in 
another issue.

> Managed memory released check can block RPC thread
> --
>
> Key: FLINK-18646
> URL: https://issues.apache.org/jira/browse/FLINK-18646
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Affects Versions: 1.11.0
>Reporter: Andrey Zagrebin
>Assignee: Andrey Zagrebin
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.11.2
>
> Attachments: log1.png, log2.png
>
>
> UnsafeMemoryBudget#verifyEmpty, called on slot freeing, needs time to wait on 
> GC of all allocated/released managed memory. If there are a lot of segments 
> to GC then it can take time to finish the check. If slot freeing happens in 
> RPC thread, the GC waiting can block it and TM risks to miss its heartbeat.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-18646) Managed memory released check can block RPC thread

2020-07-28 Thread Andrey Zagrebin (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-18646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166346#comment-17166346
 ] 

Andrey Zagrebin commented on FLINK-18646:
-

merged into master by 3d056c8fea72ca40b663d12570913679be87c0a9
merged into 1.10 by bcc97082639280ab14f465463fb07b27167c37e3

 

[~TsReaper] I am closing the issue as the verification should not block the RPC 
thread any more. Reopen it if you notice any problems with it. If there are 
still problems with the normal memory allocation timeout, we can discuss it in 
another issue.

> Managed memory released check can block RPC thread
> --
>
> Key: FLINK-18646
> URL: https://issues.apache.org/jira/browse/FLINK-18646
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Affects Versions: 1.11.0
>Reporter: Andrey Zagrebin
>Assignee: Andrey Zagrebin
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.11.2
>
> Attachments: log1.png, log2.png
>
>
> UnsafeMemoryBudget#verifyEmpty, called on slot freeing, needs time to wait on 
> GC of all allocated/released managed memory. If there are a lot of segments 
> to GC then it can take time to finish the check. If slot freeing happens in 
> RPC thread, the GC waiting can block it and TM risks to miss its heartbeat.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-18646) Managed memory released check can block RPC thread

2020-07-28 Thread Andrey Zagrebin (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-18646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Zagrebin updated FLINK-18646:

Release Note:   (was: merged into master by 
3d056c8fea72ca40b663d12570913679be87c0a9
merged into 1.10 by bcc97082639280ab14f465463fb07b27167c37e3)

> Managed memory released check can block RPC thread
> --
>
> Key: FLINK-18646
> URL: https://issues.apache.org/jira/browse/FLINK-18646
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Affects Versions: 1.11.0
>Reporter: Andrey Zagrebin
>Assignee: Andrey Zagrebin
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.11.2
>
> Attachments: log1.png, log2.png
>
>
> UnsafeMemoryBudget#verifyEmpty, called on slot freeing, needs time to wait on 
> GC of all allocated/released managed memory. If there are a lot of segments 
> to GC then it can take time to finish the check. If slot freeing happens in 
> RPC thread, the GC waiting can block it and TM risks to miss its heartbeat.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-18734) Add documentation for DynamoStreams Consumer CDC

2020-07-28 Thread Jark Wu (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-18734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166345#comment-17166345
 ] 

Jark Wu commented on FLINK-18734:
-

Sounds good to me.

> Add documentation for DynamoStreams Consumer CDC
> 
>
> Key: FLINK-18734
> URL: https://issues.apache.org/jira/browse/FLINK-18734
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / Kinesis, Documentation
>Affects Versions: 1.11.1
>Reporter: Vinay
>Priority: Minor
>  Labels: CDC, documentation
> Fix For: 1.12.0, 1.11.2
>
>
> Flink already supports CDC for DynamoDb - 
> https://issues.apache.org/jira/browse/FLINK-4582  by reading the data from 
> DynamoStreams but there is no documentation for the same. Given that Flink 
> now supports CDC for Debezium as well , we should add the documentation for 
> Dynamo CDC so that more users can use this feature.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (FLINK-18646) Managed memory released check can block RPC thread

2020-07-28 Thread Andrey Zagrebin (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-18646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Zagrebin closed FLINK-18646.
---
Release Note: 
merged into master by 3d056c8fea72ca40b663d12570913679be87c0a9
merged into 1.10 by bcc97082639280ab14f465463fb07b27167c37e3
  Resolution: Fixed

> Managed memory released check can block RPC thread
> --
>
> Key: FLINK-18646
> URL: https://issues.apache.org/jira/browse/FLINK-18646
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Affects Versions: 1.11.0
>Reporter: Andrey Zagrebin
>Assignee: Andrey Zagrebin
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.11.2
>
> Attachments: log1.png, log2.png
>
>
> UnsafeMemoryBudget#verifyEmpty, called on slot freeing, needs time to wait on 
> GC of all allocated/released managed memory. If there are a lot of segments 
> to GC then it can take time to finish the check. If slot freeing happens in 
> RPC thread, the GC waiting can block it and TM risks to miss its heartbeat.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (FLINK-18627) Get unmatch filter method records to side output

2020-07-28 Thread Roey Shem Tov (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-18627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166290#comment-17166290
 ] 

Roey Shem Tov edited comment on FLINK-18627 at 7/28/20, 11:16 AM:
--

[~aljoscha],

The semantic improvment is making it easier to get FilteredRecord into side 
output.

 

in this example (I changed it a little bit for the semantic of the PR):
{code:java}
final OutputTag curruptedData = new OutputTag<>("side-output"){};

SingleOutputStreamOperator stream = datastream
.filter(i->i%2==0,curruptedData)
.filter(i->i%3==0,curruptedData)
.filter(i->i%4==0,curruptedData)
.filter(i->i%5==0,curruptedData);

DataStream curruptedDataStream = stream.getSideOutput(curruptedData); // All 
data that doesn't divide at (2,3,4,5) together.{code}
And in the above case i have a new stream with all the curruptedData.

Offcourse the currupted data is only one example, there is more examples i can 
share.

 

I agree that filter should be filtering data, but it is NiceToHave feature that 
all the Filtered data will go to given outputTag instead just drop it.

Offcourse you can implement it by your self (extending RichFilterFunction and 
send all the Filtered Data into given output), but I think
 it is a nice wrapper that will be useful.


was (Author: roeyshemtov):
[~aljoscha],



The semantic improvment is making it easier to get FilteredRecord into side 
output.

 

in this example (I changed it a little bit for the semantic of the PR):
{code:java}
final OutputTag curruptedData = new OutputTag("side-output"){};

SingleOutputStreamOperator stream = datastream
.filter(i->i%2==0,curruptedData)
.filter(i->i%3==0,curruptedData)
.filter(i->i%4==0,curruptedData)
.filter(i->i%5==0,curruptedData);

DataStream curruptedDataStream = stream.getSideOutput(curruptedData); // All 
data that doesn't divide at (2,3,4,5) together.{code}
And in the above case i have a new stream with all the curruptedData.

Offcourse the currupted data is only one example, there is more examples i can 
share.

 

I agree that filter should be filtering data, but it is NiceToHave feature that 
all the Filtered data will go to given outputTag instead just drop it.

Offcourse you can implement it by your self (extending RichFilterFunction and 
send all the Filtered Data into given output), but I think
it is a nice wrapper that will be useful.

> Get unmatch filter method records to side output
> 
>
> Key: FLINK-18627
> URL: https://issues.apache.org/jira/browse/FLINK-18627
> Project: Flink
>  Issue Type: Improvement
>  Components: API / DataStream
>Reporter: Roey Shem Tov
>Priority: Major
> Fix For: 1.12.0
>
>
> Unmatch records to filter functions should send somehow to side output.
> Example:
>  
> {code:java}
> datastream
> .filter(i->i%2==0)
> .sideOutput(oddNumbersSideOutput);
> {code}
>  
>  
> That's way we can filter multiple times and send the filtered records to our 
> side output instead of dropping it immediatly, it can be useful in many ways.
>  
> What do you think?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-18734) Add documentation for DynamoStreams Consumer CDC

2020-07-28 Thread Vinay (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-18734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166340#comment-17166340
 ] 

Vinay commented on FLINK-18734:
---

[~jark]  yes, you are right, my suggestion is to only add it to the 
documentation, it could be under Kinesis Connector - 
[https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/connectors/kinesis.html]
  or even better we can have a CDC section in the document in which we can add 
supported tools by Flink

> Add documentation for DynamoStreams Consumer CDC
> 
>
> Key: FLINK-18734
> URL: https://issues.apache.org/jira/browse/FLINK-18734
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / Kinesis, Documentation
>Affects Versions: 1.11.1
>Reporter: Vinay
>Priority: Minor
>  Labels: CDC, documentation
> Fix For: 1.12.0, 1.11.2
>
>
> Flink already supports CDC for DynamoDb - 
> https://issues.apache.org/jira/browse/FLINK-4582  by reading the data from 
> DynamoStreams but there is no documentation for the same. Given that Flink 
> now supports CDC for Debezium as well , we should add the documentation for 
> Dynamo CDC so that more users can use this feature.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (FLINK-16866) Make job submission non-blocking

2020-07-28 Thread Robert Metzger (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-16866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Metzger reassigned FLINK-16866:
--

Assignee: Robert Metzger

> Make job submission non-blocking
> 
>
> Key: FLINK-16866
> URL: https://issues.apache.org/jira/browse/FLINK-16866
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.9.2, 1.10.0, 1.11.0
>Reporter: Till Rohrmann
>Assignee: Robert Metzger
>Priority: Critical
> Fix For: 1.12.0
>
>
> Currently, Flink waits to acknowledge a job submission until the 
> corresponding {{JobManager}} has been created. Since its creation also 
> involves the creation of the {{ExecutionGraph}} and potential FS operations, 
> it can take a bit of time. If the user has configured a too low 
> {{web.timeout}}, the submission can time out only reporting a 
> {{TimeoutException}} to the user.
> I propose to change the notion of job submission slightly. Instead of waiting 
> until the {{JobManager}} has been created, a job submission is complete once 
> all job relevant files have been uploaded to the {{Dispatcher}} and the 
> {{Dispatcher}} has been told about it. Creating the {{JobManager}} will then 
> belong to the actual job execution. Consequently, if problems occur while 
> creating the {{JobManager}} it will result into a job failure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-17260) StreamingKafkaITCase failure on Azure

2020-07-28 Thread Jiangjie Qin (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-17260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166330#comment-17166330
 ] 

Jiangjie Qin commented on FLINK-17260:
--

I don't have a clue either. From the phenomenon itself, a wild guess is the 
aggregation value 14 is somehow doubled.
 * The expected final result 27 comes from the aggregation of three events: 5, 
9, 13.
 * The error always have a 41, which is 14 greater than the expected result.
 * There are two possibilities for this to happen
 ** There are duplicate messages in Kafka.
 ** The state got wrong.
 * If there are duplicate messages, then 5 and 9 must both be duplicate. In 
that case, the output sequence should be 14, 19, 28, 41. So there should also 
be 19 and 28 before 41 is emitted.
 * So mathematically speaking, it seems that the aggregated value of 14 is 
somehow doubled.

I'll try to reproduce this locally with some logging added.

> StreamingKafkaITCase failure on Azure
> -
>
> Key: FLINK-17260
> URL: https://issues.apache.org/jira/browse/FLINK-17260
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Kafka, Tests
>Affects Versions: 1.11.0, 1.12.0
>Reporter: Roman Khachatryan
>Assignee: Chesnay Schepler
>Priority: Critical
>  Labels: pull-request-available, test-stability
> Fix For: 1.12.0
>
>
> [https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_apis/build/builds/7544/logs/165]
>  
> {code:java}
> 2020-04-16T00:12:32.2848429Z [INFO] Running 
> org.apache.flink.tests.util.kafka.StreamingKafkaITCase
> 2020-04-16T00:14:47.9100927Z [ERROR] Tests run: 3, Failures: 1, Errors: 0, 
> Skipped: 0, Time elapsed: 135.621 s <<< FAILURE! - in 
> org.apache.flink.tests.util.k afka.StreamingKafkaITCase
> 2020-04-16T00:14:47.9103036Z [ERROR] testKafka[0: 
> kafka-version:0.10.2.0](org.apache.flink.tests.util.kafka.StreamingKafkaITCase)
>   Time elapsed: 46.222 s  <<<  FAILURE!
> 2020-04-16T00:14:47.9104033Z java.lang.AssertionError: 
> expected:<[elephant,27,64213]> but was:<[]>
> 2020-04-16T00:14:47.9104638Zat org.junit.Assert.fail(Assert.java:88)
> 2020-04-16T00:14:47.9105148Zat 
> org.junit.Assert.failNotEquals(Assert.java:834)
> 2020-04-16T00:14:47.9105701Zat 
> org.junit.Assert.assertEquals(Assert.java:118)
> 2020-04-16T00:14:47.9106239Zat 
> org.junit.Assert.assertEquals(Assert.java:144)
> 2020-04-16T00:14:47.9107177Zat 
> org.apache.flink.tests.util.kafka.StreamingKafkaITCase.testKafka(StreamingKafkaITCase.java:162)
> 2020-04-16T00:14:47.9107845Zat 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2020-04-16T00:14:47.9108434Zat 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2020-04-16T00:14:47.9109318Zat 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2020-04-16T00:14:47.9109914Zat 
> java.lang.reflect.Method.invoke(Method.java:498)
> 2020-04-16T00:14:47.9110434Zat 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> 2020-04-16T00:14:47.9110985Zat 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> 2020-04-16T00:14:47.9111548Zat 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> 2020-04-16T00:14:47.9112083Zat 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> 2020-04-16T00:14:47.9112629Zat 
> org.apache.flink.util.ExternalResource$1.evaluate(ExternalResource.java:48)
> 2020-04-16T00:14:47.9113145Zat 
> org.apache.flink.util.ExternalResource$1.evaluate(ExternalResource.java:48)
> 2020-04-16T00:14:47.9113637Zat 
> org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
> 2020-04-16T00:14:47.9114072Zat 
> org.junit.rules.RunRules.evaluate(RunRules.java:20)
> 2020-04-16T00:14:47.9114490Zat 
> org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
> 2020-04-16T00:14:47.9115256Zat 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
> 2020-04-16T00:14:47.9115791Zat 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
> 2020-04-16T00:14:47.9116292Zat 
> org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> 2020-04-16T00:14:47.9116736Zat 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> 2020-04-16T00:14:47.9117779Zat 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> 2020-04-16T00:14:47.9118274Zat 
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> 2020-04-16T00:14:47.9118766Zat 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> 2020-04-16T00:14:47.9119204Zat 
> org.junit.runners.ParentR

[jira] [Commented] (FLINK-18712) Flink RocksDB statebackend memory leak issue

2020-07-28 Thread Farnight (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-18712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166322#comment-17166322
 ] 

Farnight commented on FLINK-18712:
--

[~yunta], below is some testing information based on simple job. Please help 
check. thanks a lot!

 

Flink configs:

for flink cluster, we use session-cluster mode.

version: 1.10

 

TM configs:

state.backend.rocksdb.memory.managed set to `true`

our k8s pod has 31G memory.

managed memory set to 10G. 

heap size set to 15G

other settings keep the default.

 

Job:
 # write a dummy source function to emit events in a for/while loop
 # use the default SessionWindow with gap 30 minutes.
 # run the job few times
 # monitor the k8s pod memory working set usage by cadvisor

 

 

case 1:

when running job on k8s (jm/tm inside a pod container). the memory working set 
keep raising, although the job is stopped, but working set doesn't decrease. 
eventually the tm process will be killed by oom-killer. and tm process will be 
restart(pid changed). then the memory working set got reset.

 

case 2:

when running job in my local machine(macbook pro) without k8s env. it doesn't 
have this issue.

> Flink RocksDB statebackend memory leak issue 
> -
>
> Key: FLINK-18712
> URL: https://issues.apache.org/jira/browse/FLINK-18712
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / State Backends
>Affects Versions: 1.10.0
>Reporter: Farnight
>Priority: Critical
>
> When using RocksDB as our statebackend, we found it will lead to memory leak 
> when restarting job (manually or in recovery case).
>  
> How to reproduce:
>  # increase RocksDB blockcache size(e.g. 1G), it is easier to monitor and 
> reproduce.
>  # start a job using RocksDB statebackend.
>  # when the RocksDB blockcache reachs maximum size, restart the job. and 
> monitor the memory usage (k8s pod working set) of the TM.
>  # go through step 2-3 few more times. and memory will keep raising.
>  
> Any solution or suggestion for this? Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (FLINK-18740) Support rate limiter in the universal kafka connector

2020-07-28 Thread Truong Duc Kien (Jira)

Truong Duc Kien created FLINK-18740:
---

 Summary: Support rate limiter in the universal kafka connector
 Key: FLINK-18740
 URL: https://issues.apache.org/jira/browse/FLINK-18740
 Project: Flink
  Issue Type: Improvement
  Components: Connectors / Kafka
Affects Versions: 1.11.1
Reporter: Truong Duc Kien


Currently rate limiter is only available for kafka connector 010,  but not the 
universal connector.{{}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-18715) add cpu usage metric of jobmanager/taskmanager

2020-07-28 Thread Chesnay Schepler (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-18715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166303#comment-17166303
 ] 

Chesnay Schepler commented on FLINK-18715:
--

[~1026688210] Flink has various opt-in metrics for system resources; have you 
looked at the 
[documentation|https://ci.apache.org/projects/flink/flink-docs-release-1.11/monitoring/metrics.html#system-resources]?

> add cpu usage metric of  jobmanager/taskmanager  
> -
>
> Key: FLINK-18715
> URL: https://issues.apache.org/jira/browse/FLINK-18715
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Metrics
>Affects Versions: 1.11.1
>Reporter: wgcn
>Priority: Major
> Fix For: 1.12.0, 1.11.2
>
>
> flink process add cpu usage metric,   user can determine that their job  is  
> io bound /cpu bound  ,so that they can increase/decrese cpu core in the 
> container (k8s,yarn). If it's nessary 
> .  you can assign  it  to me  ，I come up with a idea  calculating cpu usage 
> ratio using   ManagementFactory.getRuntimeMXBean().getUptime() and  
> ManagementFactory.getOperatingSystemMXBean().getProcessCpuTime  over a period 
> of time .   it can get a value in single cpu core environment. and user can 
> use the value  to calculate cpu usage ratio by dividing  num of container's 
> cpu core.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-17073) Slow checkpoint cleanup causing OOMs

2020-07-28 Thread Etienne Chauchot (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166310#comment-17166310
 ] 

Etienne Chauchot commented on FLINK-17073:
--

[~SleePy] sure, I'll update the google doc to add impl plan.

> Slow checkpoint cleanup causing OOMs
> 
>
> Key: FLINK-17073
> URL: https://issues.apache.org/jira/browse/FLINK-17073
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.7.3, 1.8.0, 1.9.0, 1.10.0, 1.11.0
>Reporter: Till Rohrmann
>Assignee: Etienne Chauchot
>Priority: Major
> Fix For: 1.12.0
>
>
> A user reported that he sees a decline in checkpoint cleanup speed when 
> upgrading from Flink 1.7.2 to 1.10.0. The result is that a lot of cleanup 
> tasks are waiting in the execution queue occupying memory. Ultimately, the JM 
> process dies with an OOM.
> Compared to Flink 1.7.2, we introduced a dedicated {{ioExecutor}} which is 
> used by the {{HighAvailabilityServices}} (FLINK-11851). Before, we use the 
> {{AkkaRpcService}} thread pool which was a {{ForkJoinPool}} with a max 
> parallelism of 64. Now it is a {{FixedThreadPool}} with as many threads as 
> CPU cores. This change might have caused the decline in completed checkpoint 
> discard throughput. This suspicion needs to be validated before trying to fix 
> it!
> [1] 
> https://lists.apache.org/thread.html/r390e5d775878918edca0b6c9f18de96f828c266a888e34ed30ce8494%40%3Cuser.flink.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (FLINK-11127) Make metrics query service establish connection to JobManager

2020-07-28 Thread Chesnay Schepler (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-11127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chesnay Schepler closed FLINK-11127.

Resolution: Won't Fix

> Make metrics query service establish connection to JobManager
> -
>
> Key: FLINK-11127
> URL: https://issues.apache.org/jira/browse/FLINK-11127
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / Kubernetes, Runtime / Coordination, Runtime 
> / Metrics
>Affects Versions: 1.7.0, 1.9.2, 1.10.0
>Reporter: Ufuk Celebi
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> As part of FLINK-10247, the internal metrics query service has been separated 
> into its own actor system. Before this change, the JobManager (JM) queried 
> TaskManager (TM) metrics via the TM actor. Now, the JM needs to establish a 
> separate connection to the TM metrics query service actor.
> In the context of Kubernetes, this is problematic as the JM will typically 
> *not* be able to resolve the TMs by name, resulting in warnings as follows:
> {code}
> 2018-12-11 08:32:33,962 WARN  akka.remote.ReliableDeliverySupervisor  
>   - Association with remote system 
> [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183] has 
> failed, address is now gated for [50] ms. Reason: [Association failed with 
> [akka.tcp://flink-metrics@flink-task-manager-64b868487c-x9l4b:39183]] Caused 
> by: [flink-task-manager-64b868487c-x9l4b: Name does not resolve]
> {code}
> In order to expose the TMs by name in Kubernetes, users require a service 
> *for each* TM instance which is not practical.
> This currently results in the web UI not being to display some basic metrics 
> about number of sent records. You can reproduce this by following the READMEs 
> in {{flink-container/kubernetes}}.
> This worked before, because the JM is typically exposed via a service with a 
> known name and the TMs establish the connection to it which the metrics query 
> service piggybacked on.
> A potential solution to this might be to let the query service connect to the 
> JM similar to how the TMs register.
> I tagged this ticket as an improvement, but in the context of Kubernetes I 
> would consider this to be a bug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (FLINK-18741) ProcessWindowFunction's process function exception

2020-07-28 Thread mzz (Jira)

mzz created FLINK-18741:
---

 Summary: ProcessWindowFunction's  process function exception
 Key: FLINK-18741
 URL: https://issues.apache.org/jira/browse/FLINK-18741
 Project: Flink
  Issue Type: Bug
  Components: API / DataStream
Affects Versions: 1.10.0
Reporter: mzz


I use ProcessWindowFunction to achieve PV calculation, but when rewriting 
process, the user-defined state value cannot be returned。
code：

{code:java}
tem.keyBy(x =>
  (x._1, x._2, x._4, x._5, x._6, x._7, x._8))
  .timeWindow(Time.seconds(15 * 60)) //15 min window
  .process(new ProcessWindowFunction[(String, String, String, String, 
String, String, String, String, String), CkResult, (String, String, String, 
String, String, String, String), TimeWindow] {
  var clickCount: ValueState[Long] = _
*  var requestCount: ValueState[Long] = _
*  var returnCount: ValueState[Long] = _
  var videoCount: ValueState[Long] = _
  var noVideoCount: ValueState[Long] = _

  override def open(parameters: Configuration): Unit = {
clickCount = getRuntimeContext.getState(new 
ValueStateDescriptor("clickCount", classOf[Long]))
   * requestCount = getRuntimeContext.getState(new 
ValueStateDescriptor("requestCount", classOf[Long]))*
returnCount = getRuntimeContext.getState(new 
ValueStateDescriptor("returnCount", classOf[Long]))
videoCount = getRuntimeContext.getState(new 
ValueStateDescriptor("videoCount", classOf[Long]))
noVideoCount = getRuntimeContext.getState(new 
ValueStateDescriptor("noVideoCount", classOf[Long]))
  }

  override def process(key: (String, String, String, String, String, 
String, String), context: Context, elements: Iterable[(String, String, String, 
String, String, String, String, String, String)], out: Collector[CkResult]) = {
try {
  var clickNum: Long = clickCount.value
  val dateNow = 
LocalDateTime.now().format(DateTimeFormatter.ofPattern("MMdd")).toLong
  var requestNum: Long = requestCount.value
  var returnNum: Long = returnCount.value
  var videoNum: Long = videoCount.value
  var noVideoNum: Long = noVideoCount.value
  if (requestNum == null) {
requestNum = 0
  }
  
  val ecpm = key._7.toDouble.formatted("%.2f").toFloat
  val created_at = getSecondTimestampTwo(new Date)
 
*  elements.foreach(e => {
if ("adreq".equals(e._3)) {
  requestNum += 1
  println(key._1, requestNum)
}
  })
  requestCount.update(requestNum)
  println(requestNum, key._1)*

  

  out.collect(CkResult(dateNow, (created_at - getZero_time) / (60 * 
15), key._2, key._3, key._4, key._5, key._3 + "_" + key._4 + "_" + key._5, 
key._6, key._1, requestCount.value, returnCount.value, fill_rate, 
noVideoCount.value + videoCount.value,
expose_rate, clickCount.value, click_rate, ecpm, 
(noVideoCount.value * ecpm + videoCount.value * ecpm / 
1000.toFloat).formatted("%.2f").toFloat, created_at))
}
catch {
  case e: Exception => println(key, e)
}
  }
})
{code}


{code:java}

  elements.foreach(e => {
if ("adreq".equals(e._3)) {
  requestNum += 1
  println(key._1, requestNum)
// The values printed here like :
//(key,1)
//(key,2)
//(key,3)
}
  })
//But print outside the for loop always like :
//(key,0)
  println(requestNum, key._1)
{code}


who can help me ,plz thx。









--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-16510) Task manager safeguard shutdown may not be reliable

2020-07-28 Thread Till Rohrmann (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-16510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166302#comment-17166302
 ] 

Till Rohrmann commented on FLINK-16510:
---

The JavaDoc says "If the shutdown sequence has already been initiated then this 
method does not wait for any running shutdown hooks or finalizers to finish 
their work.".

> Task manager safeguard shutdown may not be reliable
> ---
>
> Key: FLINK-16510
> URL: https://issues.apache.org/jira/browse/FLINK-16510
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Reporter: Maximilian Michels
>Assignee: Maximilian Michels
>Priority: Major
> Attachments: stack2-1.txt
>
>
> The {{JvmShutdownSafeguard}} does not always succeed but can hang when 
> multiple threads attempt to shutdown the JVM. Apparently mixing 
> {{System.exit()}} with ShutdownHooks and forcefully terminating the JVM via 
> {{Runtime.halt()}} does not play together well:
> {noformat}
> "Jvm Terminator" #22 daemon prio=5 os_prio=0 tid=0x7fb8e82f2800 
> nid=0x5a96 runnable [0x7fb35cffb000]
>java.lang.Thread.State: RUNNABLE
>   at java.lang.Shutdown.$$YJP$$halt0(Native Method)
>   at java.lang.Shutdown.halt0(Shutdown.java)
>   at java.lang.Shutdown.halt(Shutdown.java:139)
>   - locked <0x00047ed67638> (a java.lang.Shutdown$Lock)
>   at java.lang.Runtime.halt(Runtime.java:276)
>   at 
> org.apache.flink.runtime.util.JvmShutdownSafeguard$DelayedTerminator.run(JvmShutdownSafeguard.java:86)
>   at java.lang.Thread.run(Thread.java:748)
>Locked ownable synchronizers:
>   - None
> "FlinkCompletableFutureDelayScheduler-thread-1" #18154 daemon prio=5 
> os_prio=0 tid=0x7fb708a7d000 nid=0x5a8a waiting for monitor entry 
> [0x7fb289d49000]
>java.lang.Thread.State: BLOCKED (on object monitor)
>   at java.lang.Shutdown.halt(Shutdown.java:139)
>   - waiting to lock <0x00047ed67638> (a java.lang.Shutdown$Lock)
>   at java.lang.Shutdown.exit(Shutdown.java:213)
>   - locked <0x00047edb7348> (a java.lang.Class for java.lang.Shutdown)
>   at java.lang.Runtime.exit(Runtime.java:110)
>   at java.lang.System.exit(System.java:973)
>   at 
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.terminateJVM(TaskManagerRunner.java:266)
>   at 
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.lambda$onFatalError$1(TaskManagerRunner.java:260)
>   at 
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner$$Lambda$27464/1464672548.accept(Unknown
>  Source)
>   at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
>   at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
>   at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
>   at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
>   at 
> org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:943)
>   at 
> org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:211)
>   at 
> org.apache.flink.runtime.concurrent.FutureUtils.lambda$orTimeout$11(FutureUtils.java:361)
>   at 
> org.apache.flink.runtime.concurrent.FutureUtils$$Lambda$27435/159015392.run(Unknown
>  Source)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
>Locked ownable synchronizers:
>   - <0x0006d5e56bd0> (a 
> java.util.concurrent.ThreadPoolExecutor$Worker)
> {noformat}
> Note that under this condition the JVM should terminate but it still hangs. 
> Sometimes it quits after several minutes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-18715) add cpu usage metric of jobmanager/taskmanager

2020-07-28 Thread Till Rohrmann (Jira)



[ 
https://issues.apache.org/jira/browse/FLINK-18715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166299#comment-17166299
 ] 

Till Rohrmann commented on FLINK-18715:
---

How would this feature work in different deployment scenarios (e.g. in 
standalone mode where we don't have CPU isolation)?

> add cpu usage metric of  jobmanager/taskmanager  
> -
>
> Key: FLINK-18715
> URL: https://issues.apache.org/jira/browse/FLINK-18715
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Metrics
>Affects Versions: 1.11.1
>Reporter: wgcn
>Priority: Major
> Fix For: 1.12.0, 1.11.2
>
>
> flink process add cpu usage metric,   user can determine that their job  is  
> io bound /cpu bound  ,so that they can increase/decrese cpu core in the 
> container (k8s,yarn). If it's nessary 
> .  you can assign  it  to me  ，I come up with a idea  calculating cpu usage 
> ratio using   ManagementFactory.getRuntimeMXBean().getUptime() and  
> ManagementFactory.getOperatingSystemMXBean().getProcessCpuTime  over a period 
> of time .   it can get a value in single cpu core environment. and user can 
> use the value  to calculate cpu usage ratio by dividing  num of container's 
> cpu core.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (FLINK-18689) Deterministic Slot Sharing

2020-07-28 Thread Till Rohrmann (Jira)



 [ 
https://issues.apache.org/jira/browse/FLINK-18689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Rohrmann reassigned FLINK-18689:
-

Assignee: Andrey Zagrebin

> Deterministic Slot Sharing
> --
>
> Key: FLINK-18689
> URL: https://issues.apache.org/jira/browse/FLINK-18689
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Coordination
>Reporter: Andrey Zagrebin
>Assignee: Andrey Zagrebin
>Priority: Major
>
> [Design 
> doc|https://docs.google.com/document/d/10CbCVDBJWafaFOovIXrR8nAr2BnZ_RAGFtUeTHiJplw]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

1 2 >

1 - 100 of 134 matches

Mail list logo