[jira] [Created] (FLINK-15962) Reduce the default chunk size in netty stack

2020-02-09 Thread zhijiang (Jira)
zhijiang created FLINK-15962:


 Summary: Reduce the default chunk size in netty stack
 Key: FLINK-15962
 URL: https://issues.apache.org/jira/browse/FLINK-15962
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Network
Reporter: zhijiang


The current default chunk size inside netty stack is 16MB, and one arena would 
create multiple chunks based on demands. But it does not mean that the new 
chunk is created after the previous one was fully exhausted, which would 
further boost the direct memory overhead.

In order to decrease the total memory overhead caused by netty, we can reduce 
the default chunk size and measure the effect via existing e2e, and also verify 
the performance concern via the benchmarks.

This improvement is orthogonal to the 
[FLINK-10742|https://issues.apache.org/jira/browse/FLINK-10742] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-15914) Miss the barrier alignment metric for the case of two inputs

2020-02-05 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang closed FLINK-15914.

Resolution: Fixed

Merged in release-1.10: 72b076cbcbfd2636d4980a9fca979e7482b2729c

> Miss the barrier alignment metric for the case of two inputs
> 
>
> Key: FLINK-15914
> URL: https://issues.apache.org/jira/browse/FLINK-15914
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing, Runtime / Metrics
>Affects Versions: 1.10.0
>Reporter: zhijiang
>Assignee: zhijiang
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> When the StreamTwoInputSelectableProcessor was introduced before, it was 
> missing to add the barrier alignment metric in the constructor. But it does 
> not cause problems then, because only StreamTwoInputProcessor works at that 
> time.
> After StreamTwoInputProcessor is replaced by 
> StreamTwoInputSelectableProcessor as now, this bug is exposed and we will not 
> see the barrier alignment metric for the case of two inputs.
> The solution is to add this metric while constructing the current 
> StreamTwoInputProcessor.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15914) Miss the barrier alignment metric for the case of two inputs

2020-02-05 Thread zhijiang (Jira)
zhijiang created FLINK-15914:


 Summary: Miss the barrier alignment metric for the case of two 
inputs
 Key: FLINK-15914
 URL: https://issues.apache.org/jira/browse/FLINK-15914
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Checkpointing, Runtime / Metrics
Reporter: zhijiang
Assignee: zhijiang


When the StreamTwoInputSelectableProcessor was introduced before, it was 
missing to add the barrier alignment metric in the constructor. But it does not 
cause problems then, because only StreamTwoInputProcessor works at that time.

After StreamTwoInputProcessor is replaced by StreamTwoInputSelectableProcessor 
as now, this bug is exposed and we will not see the barrier alignment metric 
for the case of two inputs.

The solution is to add this metric while constructing the current 
StreamTwoInputProcessor.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-15010) Temp directories flink-netty-shuffle-* are not cleaned up

2020-02-03 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang updated FLINK-15010:
-
Fix Version/s: 1.9.3

> Temp directories flink-netty-shuffle-* are not cleaned up
> -
>
> Key: FLINK-15010
> URL: https://issues.apache.org/jira/browse/FLINK-15010
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Affects Versions: 1.9.0, 1.9.1, 1.9.2
>Reporter: Nico Kruber
>Assignee: Yun Gao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.9.3
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Starting a Flink cluster with 2 TMs and stopping it again will leave 2 
> temporary directories (and not delete them): flink-netty-shuffle-



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-15010) Temp directories flink-netty-shuffle-* are not cleaned up

2020-02-03 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang updated FLINK-15010:
-
Affects Version/s: 1.9.0
   1.9.2

> Temp directories flink-netty-shuffle-* are not cleaned up
> -
>
> Key: FLINK-15010
> URL: https://issues.apache.org/jira/browse/FLINK-15010
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Affects Versions: 1.9.0, 1.9.1, 1.9.2
>Reporter: Nico Kruber
>Assignee: Yun Gao
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Starting a Flink cluster with 2 TMs and stopping it again will leave 2 
> temporary directories (and not delete them): flink-netty-shuffle-



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-15010) Temp directories flink-netty-shuffle-* are not cleaned up

2020-02-03 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang closed FLINK-15010.

Resolution: Fixed

Merged in master: 5036334ca00405cd4cdd5a798dca012bb3cc7bbf

Merged in release-1.10: 57da4589e89672bcb042c8656b269e4faace1664

Merged in release-1.9: b8221b0758a2e0946f9ddd6ca537ce2a6e6a133a

> Temp directories flink-netty-shuffle-* are not cleaned up
> -
>
> Key: FLINK-15010
> URL: https://issues.apache.org/jira/browse/FLINK-15010
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Affects Versions: 1.9.1
>Reporter: Nico Kruber
>Assignee: Yun Gao
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Starting a Flink cluster with 2 TMs and stopping it again will leave 2 
> temporary directories (and not delete them): flink-netty-shuffle-



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-15873) Matched result may not be output if existing earlier partial matches

2020-02-03 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang reassigned FLINK-15873:


Assignee: shuai.xu

> Matched result may not be output if existing earlier partial matches
> 
>
> Key: FLINK-15873
> URL: https://issues.apache.org/jira/browse/FLINK-15873
> Project: Flink
>  Issue Type: Bug
>  Components: Library / CEP
>Affects Versions: 1.9.0
>Reporter: shuai.xu
>Assignee: shuai.xu
>Priority: Major
>
> When running some cep jobs with skip strategy, I found that when we get a 
> matched result, but if there is an earlier partial matches, the result will 
> not be returned.
> I think this is due to a bug in processMatchesAccordingToSkipStrategy() in 
> NFA class. It should return matched result without judging whether this is 
> partial matches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-15444) Make the component AbstractInvokable in CheckpointBarrierHandler NonNull

2020-01-21 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang closed FLINK-15444.

Resolution: Fixed

Merged in master : 9f92ddfecad6e056cc4b58311e493cbb0884357b

> Make the component AbstractInvokable in CheckpointBarrierHandler NonNull 
> -
>
> Key: FLINK-15444
> URL: https://issues.apache.org/jira/browse/FLINK-15444
> Project: Flink
>  Issue Type: Task
>  Components: Runtime / Checkpointing
>Reporter: zhijiang
>Assignee: zhijiang
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The current component {{AbstractInvokable}} in {{CheckpointBarrierHandler}} 
> is annotated as {{@Nullable}}. Actually in real code path it is passed via 
> the constructor and never be null. The nullable annotation is only used for 
> unit test purpose. But this way would mislead the real usage in practice and 
> bring extra troubles, because you have to alway check whether it is null 
> before usage in related processes.
> We can refactor the related unit tests to implement a dummy 
> {{AbstractInvokable}} for tests and remove the {{@Nullable}} annotation from 
> the related class constructors.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-14163) Execution#producedPartitions is possibly not assigned when used

2020-01-12 Thread zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17014085#comment-17014085
 ] 

zhijiang commented on FLINK-14163:
--

Thanks for the above good suggestions from you guys! Sorry for coming back this 
issue a bit late, especially for the PR already ready.

My previous guessing was that the formal support of async way would bring big 
trouble for scheduler, or it may be conflict with new scheduler direction in 
long term. Also considering the shuffle async way a bit over design then and no 
real users atm, so I mentioned before that I can accept the way of adjusting 
into the sync way to stop loss early. Although I also thought in general it is 
not a good way to break compatibility for exposed public interface. If it is 
not a problem for scheduler for handling the async way in future, I am happy to 
retain the async shuffle way.

If we decide to retain the async way and work around it in scheduler 
temporarily, it might be better to not fail directly after checking the future 
not completed. I mean we can step forward to bear a timeout before failing. 
This timeout is not only used for waiting future completion, also used for 
waiting for the future return by shuffle master while calling to avoid main 
thread stuck long time.

> Execution#producedPartitions is possibly not assigned when used
> ---
>
> Key: FLINK-14163
> URL: https://issues.apache.org/jira/browse/FLINK-14163
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.9.0, 1.10.0
>Reporter: Zhu Zhu
>Assignee: Yuan Mei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently {{Execution#producedPartitions}} is assigned after the partitions 
> have completed the registration to shuffle master in 
> {{Execution#registerProducedPartitions(...)}}.
> The partition registration is an async interface 
> ({{ShuffleMaster#registerPartitionWithProducer(...)}}), so 
> {{Execution#producedPartitions}} is possible[1] not set when used. 
> Usages includes:
> 1. deploying this task, so that the task may be deployed without its result 
> partitions assigned, and the job would hang. (DefaultScheduler issue only, 
> since legacy scheduler handled this case)
> 2. generating input descriptors for downstream tasks: 
> 3. retrieve {{ResultPartitionID}} for partition releasing: 
> [1] If a user uses Flink default shuffle master {{NettyShuffleMaster}}, it is 
> not problematic at the moment since it returns a completed future on 
> registration, so that it would be a synchronized process. However, if users 
> implement their own shuffle service in which the 
> {{ShuffleMaster#registerPartitionWithProducer}} returns an pending future, it 
> can be a problem. This is possible since customizable shuffle service is open 
> to users since 1.9 (via config "shuffle-service-factory.class").
> To avoid issues to happen, we may either 
> 1. fix all the usages of {{Execution#producedPartitions}} regarding the async 
> assigning, or 
> 2. change {{ShuffleMaster#registerPartitionWithProducer(...)}} to a sync 
> interface



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-15355) Nightly streaming file sink fails with unshaded hadoop

2020-01-10 Thread zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012387#comment-17012387
 ] 

zhijiang edited comment on FLINK-15355 at 1/10/20 11:29 AM:


Merged in release-1.10: aa37c0cd89053e68e72a19d51715b3a31b74163c

Merged in master: f7833aff7d50af5a3a3a671d9b6a44bd5dc17a67


was (Author: zjwang):
Merged in release-1.10: aa37c0cd89053e68e72a19d51715b3a31b74163c

> Nightly streaming file sink fails with unshaded hadoop
> --
>
> Key: FLINK-15355
> URL: https://issues.apache.org/jira/browse/FLINK-15355
> Project: Flink
>  Issue Type: Bug
>  Components: FileSystems
>Affects Versions: 1.10.0, 1.11.0
>Reporter: Arvid Heise
>Assignee: Arvid Heise
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> {code:java}
> org.apache.flink.client.program.ProgramInvocationException: The main method 
> caused an error: java.util.concurrent.ExecutionException: 
> org.apache.flink.runtime.client.JobSubmissionException: Failed to submit 
> JobGraph.
>  at 
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:335)
>  at 
> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:205)
>  at org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:138)
>  at 
> org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:664)
>  at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:213)
>  at 
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:895)
>  at 
> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
>  at 
> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>  at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968)
> Caused by: java.lang.RuntimeException: 
> java.util.concurrent.ExecutionException: 
> org.apache.flink.runtime.client.JobSubmissionException: Failed to submit 
> JobGraph.
>  at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)
>  at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1751)
>  at 
> org.apache.flink.streaming.api.environment.StreamContextEnvironment.executeAsync(StreamContextEnvironment.java:94)
>  at 
> org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:63)
>  at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1628)
>  at StreamingFileSinkProgram.main(StreamingFileSinkProgram.java:77)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:321)
>  ... 11 more
> Caused by: java.util.concurrent.ExecutionException: 
> org.apache.flink.runtime.client.JobSubmissionException: Failed to submit 
> JobGraph.
>  at 
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
>  at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
>  at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1746)
>  ... 20 more
> Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to 
> submit JobGraph.
>  at 
> org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$7(RestClusterClient.java:326)
>  at 
> java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
>  at 
> java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
>  at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>  at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
>  at 
> org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$8(FutureUtils.java:274)
>  at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
>  at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
>  at 
> java.util.concurrent.CompletableFuture.postCo

[jira] [Closed] (FLINK-15306) Adjust the default netty transport option from nio to auto

2020-01-09 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang closed FLINK-15306.

Resolution: Fixed

Merged in master: 169edef4b5e4381626b7405947ebfe3f49aff2ac

> Adjust the default netty transport option from nio to auto
> --
>
> Key: FLINK-15306
> URL: https://issues.apache.org/jira/browse/FLINK-15306
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Configuration, Runtime / Network
>Reporter: zhijiang
>Assignee: zhijiang
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The default option of `taskmanager.network.netty.transport` in 
> NettyShuffleEnvironmentOptions is `nio` now. As we know, the `epoll` mode can 
> get better performance, less GC and have more advanced features which are 
> only available on linux.
> Therefore it is better to adjust the default option to `auto` instead, and 
> then the framework would automatically choose the proper mode based on the 
> platform.
> We would further verify the performance effect via micro benchmark if 
> possible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-15355) Nightly streaming file sink fails with unshaded hadoop

2020-01-09 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang closed FLINK-15355.

Resolution: Fixed

Merged in release-1.10: aa37c0cd89053e68e72a19d51715b3a31b74163c

> Nightly streaming file sink fails with unshaded hadoop
> --
>
> Key: FLINK-15355
> URL: https://issues.apache.org/jira/browse/FLINK-15355
> Project: Flink
>  Issue Type: Bug
>  Components: FileSystems
>Affects Versions: 1.10.0, 1.11.0
>Reporter: Arvid Heise
>Assignee: Arvid Heise
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> {code:java}
> org.apache.flink.client.program.ProgramInvocationException: The main method 
> caused an error: java.util.concurrent.ExecutionException: 
> org.apache.flink.runtime.client.JobSubmissionException: Failed to submit 
> JobGraph.
>  at 
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:335)
>  at 
> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:205)
>  at org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:138)
>  at 
> org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:664)
>  at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:213)
>  at 
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:895)
>  at 
> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
>  at 
> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>  at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968)
> Caused by: java.lang.RuntimeException: 
> java.util.concurrent.ExecutionException: 
> org.apache.flink.runtime.client.JobSubmissionException: Failed to submit 
> JobGraph.
>  at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)
>  at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1751)
>  at 
> org.apache.flink.streaming.api.environment.StreamContextEnvironment.executeAsync(StreamContextEnvironment.java:94)
>  at 
> org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:63)
>  at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1628)
>  at StreamingFileSinkProgram.main(StreamingFileSinkProgram.java:77)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:321)
>  ... 11 more
> Caused by: java.util.concurrent.ExecutionException: 
> org.apache.flink.runtime.client.JobSubmissionException: Failed to submit 
> JobGraph.
>  at 
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
>  at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
>  at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1746)
>  ... 20 more
> Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to 
> submit JobGraph.
>  at 
> org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$7(RestClusterClient.java:326)
>  at 
> java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
>  at 
> java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
>  at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>  at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
>  at 
> org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$8(FutureUtils.java:274)
>  at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
>  at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
>  at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>  at 
> java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:561)
>  at 
> java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:929)
>  at 
> java.util

[jira] [Updated] (FLINK-15021) Remove the setting of netty channel watermark in NettyServer

2020-01-06 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang updated FLINK-15021:
-
Summary: Remove the setting of netty channel watermark in NettyServer  
(was: Remove setting of netty channel watermark and logic of writability 
changed)

> Remove the setting of netty channel watermark in NettyServer
> 
>
> Key: FLINK-15021
> URL: https://issues.apache.org/jira/browse/FLINK-15021
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Network
>Reporter: zhijiang
>Assignee: zhijiang
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The high and low watermark setting in NetterServer before was mainly used for 
> network flow control and limiting the maximum memory overhead caused by 
> copying data inside netty stack. In detail, when the downstream side 
> processes slowly and exhausted the available buffers finally, it would 
> temporarily close the auto read switch in netty stack. Then the upstream side 
> would finally reach the high watermark of channel to become unwritable.
> But based on credit-based flow control and reusing flink network buffer 
> inside netty stack, the watermark setting is not invalid now. So we can 
> safely remove it to cleanup the codes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-15021) Remove setting of netty channel watermark and logic of writability changed

2020-01-06 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang closed FLINK-15021.

Resolution: Fixed

Merged in master: 12d095028d54f842c7cc0f8efd3bac476fc5d9f7

> Remove setting of netty channel watermark and logic of writability changed
> --
>
> Key: FLINK-15021
> URL: https://issues.apache.org/jira/browse/FLINK-15021
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Network
>Reporter: zhijiang
>Assignee: zhijiang
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The high and low watermark setting in NetterServer before was mainly used for 
> network flow control and limiting the maximum memory overhead caused by 
> copying data inside netty stack. In detail, when the downstream side 
> processes slowly and exhausted the available buffers finally, it would 
> temporarily close the auto read switch in netty stack. Then the upstream side 
> would finally reach the high watermark of channel to become unwritable.
> But based on credit-based flow control and reusing flink network buffer 
> inside netty stack, the watermark setting is not invalid now. So we can 
> safely remove it to cleanup the codes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-15021) Remove setting of netty channel watermark and logic of writability changed

2020-01-06 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang updated FLINK-15021:
-
Description: 
The high and low watermark setting in NetterServer before was mainly used for 
network flow control and limiting the maximum memory overhead caused by copying 
data inside netty stack. In detail, when the downstream side processes slowly 
and exhausted the available buffers finally, it would temporarily close the 
auto read switch in netty stack. Then the upstream side would finally reach the 
high watermark of channel to become unwritable.

But based on credit-based flow control and reusing flink network buffer inside 
netty stack, the watermark setting is not invalid now. So we can safely remove 
it to cleanup the codes.

  was:After removing the non credit-based flow control codes, the channel 
writability changed logic in PartitionRequestQueue along with the setting of 
channel watermark are both invalid. Therefore we can remove them completely to 
simplify the codes.


> Remove setting of netty channel watermark and logic of writability changed
> --
>
> Key: FLINK-15021
> URL: https://issues.apache.org/jira/browse/FLINK-15021
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Network
>Reporter: zhijiang
>Assignee: zhijiang
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The high and low watermark setting in NetterServer before was mainly used for 
> network flow control and limiting the maximum memory overhead caused by 
> copying data inside netty stack. In detail, when the downstream side 
> processes slowly and exhausted the available buffers finally, it would 
> temporarily close the auto read switch in netty stack. Then the upstream side 
> would finally reach the high watermark of channel to become unwritable.
> But based on credit-based flow control and reusing flink network buffer 
> inside netty stack, the watermark setting is not invalid now. So we can 
> safely remove it to cleanup the codes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-15021) Remove setting of netty channel watermark and logic of writability changed

2020-01-06 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang updated FLINK-15021:
-
Fix Version/s: 1.11.0

> Remove setting of netty channel watermark and logic of writability changed
> --
>
> Key: FLINK-15021
> URL: https://issues.apache.org/jira/browse/FLINK-15021
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Network
>Reporter: zhijiang
>Assignee: zhijiang
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The high and low watermark setting in NetterServer before was mainly used for 
> network flow control and limiting the maximum memory overhead caused by 
> copying data inside netty stack. In detail, when the downstream side 
> processes slowly and exhausted the available buffers finally, it would 
> temporarily close the auto read switch in netty stack. Then the upstream side 
> would finally reach the high watermark of channel to become unwritable.
> But based on credit-based flow control and reusing flink network buffer 
> inside netty stack, the watermark setting is not invalid now. So we can 
> safely remove it to cleanup the codes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-10462) Remove ConnectionIndex for further sharing tcp connection in credit-based mode

2020-01-06 Thread zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-10462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17008630#comment-17008630
 ] 

zhijiang commented on FLINK-10462:
--

[~kevin.cyj] I think there are two requirements from your descriptions. 
 * One is how many connections are established between two TaskManagers. 
Currently it is up to the ConnectionIndex generated by IntermediateResult. If 
we want to adjust the number of connections by the number of netty threads by 
default or configurable setting, we can also drop the concept of 
ConnectionIndex from topology completely as this ticket proposed. Then it is 
easy to be understood and would not bring unexpected regression I guess.
 * Another is when to release the connection between two TaskManagers. Let's 
talk about it via your created ticket FLINK-15455. I think there are two sides 
for this proposal. For yarn per-job mode, the previous cancelled task may be 
submitted to the original TaskManager, so it can make use of previous 
connection to get benefits in this case. But for session mode, if one job 
finishes and another job is submitted to the TaskManager again, the connection 
between two TaskManagers might be changed for different jobs. E.g. jobA is 
submitted to TM1 and TM2, and jobB is submitted to TM2 and TM3. In this case we 
retain the connections between TM1 and TM2 after jobA finishes would waste 
resources to some extent. 

> Remove ConnectionIndex for further sharing tcp connection in credit-based 
> mode 
> ---
>
> Key: FLINK-10462
> URL: https://issues.apache.org/jira/browse/FLINK-10462
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.5.3, 1.5.4, 1.6.0, 1.6.1
>Reporter: zhijiang
>Assignee: zhijiang
>Priority: Minor
>
> Every {{IntermediateResult}} generates a random {{ConnectionIndex}} which 
> will be included in {{ConnectionID}}.
> The {{RemoteInputChannel}} requests to establish tcp connection via 
> {{ConnectionID}}. That means one tcp connection may be shared by multiple 
> {{RemoteInputChannel}} {{s which have the same ConnectionID}}. To do so, we 
> can reduce the physical connections between two \{{TaskManager}} s, and it 
> brings benefits for large scale jobs. 
> But this sharing is limited only for the same {{IntermediateResult}}, and I 
> think it is mainly because we may temporarily switch off {{autoread}} for the 
> channel during back pressure in previous network flow control. For 
> credit-based mode, the channel is always open for transporting different 
> intermediate data, so we can further share the tcp connection for different 
> {{IntermediateResults}} to remove the limit. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-14163) Execution#producedPartitions is possibly not assigned when used

2020-01-05 Thread zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17008582#comment-17008582
 ] 

zhijiang commented on FLINK-14163:
--

In addition, no matter which way we take to solve this issue, I think we can 
make it ready in release-1.11, not a blocker for release-1.10.

> Execution#producedPartitions is possibly not assigned when used
> ---
>
> Key: FLINK-14163
> URL: https://issues.apache.org/jira/browse/FLINK-14163
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.9.0, 1.10.0
>Reporter: Zhu Zhu
>Priority: Major
> Fix For: 1.10.0
>
>
> Currently {{Execution#producedPartitions}} is assigned after the partitions 
> have completed the registration to shuffle master in 
> {{Execution#registerProducedPartitions(...)}}.
> The partition registration is an async interface 
> ({{ShuffleMaster#registerPartitionWithProducer(...)}}), so 
> {{Execution#producedPartitions}} is possible[1] not set when used. 
> Usages includes:
> 1. deploying this task, so that the task may be deployed without its result 
> partitions assigned, and the job would hang. (DefaultScheduler issue only, 
> since legacy scheduler handled this case)
> 2. generating input descriptors for downstream tasks: 
> 3. retrieve {{ResultPartitionID}} for partition releasing: 
> [1] If a user uses Flink default shuffle master {{NettyShuffleMaster}}, it is 
> not problematic at the moment since it returns a completed future on 
> registration, so that it would be a synchronized process. However, if users 
> implement their own shuffle service in which the 
> {{ShuffleMaster#registerPartitionWithProducer}} returns an pending future, it 
> can be a problem. This is possible since customizable shuffle service is open 
> to users since 1.9 (via config "shuffle-service-factory.class").
> To avoid issues to happen, we may either 
> 1. fix all the usages of {{Execution#producedPartitions}} regarding the async 
> assigning, or 
> 2. change {{ShuffleMaster#registerPartitionWithProducer(...)}} to a sync 
> interface



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-14163) Execution#producedPartitions is possibly not assigned when used

2020-01-05 Thread zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17008580#comment-17008580
 ] 

zhijiang commented on FLINK-14163:
--

Thanks for reporting this potential issue [~zhuzh]!

After double checking the related codes, this issue indeed exists only for 
DefaultScheduler. When the ShuffleMaster#registerPartitionWithProducer was 
firstly introduced before, we already considered the async behavior in the 
scheduler process (legacy now) at that time.

The above mentioned three usages are mainly caused by the deployment process 
not considering the completed future of registering partition in new 
DefaultScheduler. If the new scheduler can also take the async way into account 
like the legacy scheduler did during deployment, I think we can solve all the 
existing concerns. 

I also feel that the current public method of Execution#getPartitionIds might 
bring potential risks to use in practice, because the returned partition might 
be an empty collection if the registration future was not completed yet, but 
the caller is not aware of this thing. 

>From the shuffle aspect, it is indeed meaningful for providing the async way 
>for registerPartitionWithProducer in a long term, which is flexible to satisfy 
>different scenarios. But from the existing implementation and possible future 
>extending implementations like yarn shuffle service etc, the sync way can also 
>satisfy the requirements I guess. So if this way would bring more troubles for 
>the scheduler and it is not easy to adjust for other components, it also makes 
>sense to adjust the registerPartitionWithProducer as sync way instead on my 
>side. We can make things easy. 

Are there any thoughts or inputs [~azagrebin]?

> Execution#producedPartitions is possibly not assigned when used
> ---
>
> Key: FLINK-14163
> URL: https://issues.apache.org/jira/browse/FLINK-14163
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.9.0, 1.10.0
>Reporter: Zhu Zhu
>Priority: Major
> Fix For: 1.10.0
>
>
> Currently {{Execution#producedPartitions}} is assigned after the partitions 
> have completed the registration to shuffle master in 
> {{Execution#registerProducedPartitions(...)}}.
> The partition registration is an async interface 
> ({{ShuffleMaster#registerPartitionWithProducer(...)}}), so 
> {{Execution#producedPartitions}} is possible[1] not set when used. 
> Usages includes:
> 1. deploying this task, so that the task may be deployed without its result 
> partitions assigned, and the job would hang. (DefaultScheduler issue only, 
> since legacy scheduler handled this case)
> 2. generating input descriptors for downstream tasks: 
> 3. retrieve {{ResultPartitionID}} for partition releasing: 
> [1] If a user uses Flink default shuffle master {{NettyShuffleMaster}}, it is 
> not problematic at the moment since it returns a completed future on 
> registration, so that it would be a synchronized process. However, if users 
> implement their own shuffle service in which the 
> {{ShuffleMaster#registerPartitionWithProducer}} returns an pending future, it 
> can be a problem. This is possible since customizable shuffle service is open 
> to users since 1.9 (via config "shuffle-service-factory.class").
> To avoid issues to happen, we may either 
> 1. fix all the usages of {{Execution#producedPartitions}} regarding the async 
> assigning, or 
> 2. change {{ShuffleMaster#registerPartitionWithProducer(...)}} to a sync 
> interface



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-15444) Make the component AbstractInvokable in CheckpointBarrierHandler NonNull

2019-12-30 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang updated FLINK-15444:
-
Fix Version/s: 1.11.0

> Make the component AbstractInvokable in CheckpointBarrierHandler NonNull 
> -
>
> Key: FLINK-15444
> URL: https://issues.apache.org/jira/browse/FLINK-15444
> Project: Flink
>  Issue Type: Task
>  Components: Runtime / Checkpointing
>Reporter: zhijiang
>Assignee: zhijiang
>Priority: Minor
> Fix For: 1.11.0
>
>
> The current component {{AbstractInvokable}} in {{CheckpointBarrierHandler}} 
> is annotated as {{@Nullable}}. Actually in real code path it is passed via 
> the constructor and never be null. The nullable annotation is only used for 
> unit test purpose. But this way would mislead the real usage in practice and 
> bring extra troubles, because you have to alway check whether it is null 
> before usage in related processes.
> We can refactor the related unit tests to implement a dummy 
> {{AbstractInvokable}} for tests and remove the {{@Nullable}} annotation from 
> the related class constructors.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15444) Make the component AbstractInvokable in CheckpointBarrierHandler NonNull

2019-12-30 Thread zhijiang (Jira)
zhijiang created FLINK-15444:


 Summary: Make the component AbstractInvokable in 
CheckpointBarrierHandler NonNull 
 Key: FLINK-15444
 URL: https://issues.apache.org/jira/browse/FLINK-15444
 Project: Flink
  Issue Type: Task
  Components: Runtime / Checkpointing
Reporter: zhijiang
Assignee: zhijiang


The current component {{AbstractInvokable}} in {{CheckpointBarrierHandler}} is 
annotated as {{@Nullable}}. Actually in real code path it is passed via the 
constructor and never be null. The nullable annotation is only used for unit 
test purpose. But this way would mislead the real usage in practice and bring 
extra troubles, because you have to alway check whether it is null before usage 
in related processes.

We can refactor the related unit tests to implement a dummy 
{{AbstractInvokable}} for tests and remove the {{@Nullable}} annotation from 
the related class constructors.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-15442) Harden the Avro Confluent Schema Registry nightly end-to-end test

2019-12-30 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang reassigned FLINK-15442:


Assignee: Yangze Guo

> Harden the Avro Confluent Schema Registry nightly end-to-end test
> -
>
> Key: FLINK-15442
> URL: https://issues.apache.org/jira/browse/FLINK-15442
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Reporter: Yangze Guo
>Assignee: Yangze Guo
>Priority: Critical
> Fix For: 1.10.0
>
>
> We have already harden the Avro Confluent Schema Registry test in 
> [FLINK-13567|https://issues.apache.org/jira/browse/FLINK-13567]. However, 
> there are still some defects in current mechanism.
> * The loop variable _i_ is not safe, it could be modified by the *command*.
> * The process of downloading kafka 0.10 is not included in the scope of 
> retry_times . I think we need to include it to tolerent transient network 
> issue.
> We need to fix those issue to harden the Avro Confluent Schema Registry 
> nightly end-to-end test.
> cc: [~trohrmann] [~chesnay]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-15437) Start session with property of "-Dtaskmanager.memory.process.size" not work

2019-12-30 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang reassigned FLINK-15437:


Assignee: Xintong Song

> Start session with property of "-Dtaskmanager.memory.process.size" not work
> ---
>
> Key: FLINK-15437
> URL: https://issues.apache.org/jira/browse/FLINK-15437
> Project: Flink
>  Issue Type: Bug
>  Components: API / Core
>Affects Versions: 1.10.0
>Reporter: xiaojin.wy
>Assignee: Xintong Song
>Priority: Critical
> Fix For: 1.10.0
>
>
> *The environment:*
> Yarn session cmd is as below, and the flink-conf.yaml has not the property of 
> "taskmanager.memory.process.size":
> export HADOOP_CLASSPATH=`hadoop classpath`;export 
> HADOOP_CONF_DIR=/dump/1/jenkins/workspace/Stream-Spark-3.4/env/hadoop_conf_dirs/blinktest2;
>  export BLINK_HOME=/dump/1/jenkins/workspace/test/blink_daily; 
> $BLINK_HOME/bin/yarn-session.sh -d -qu root.default -nm 'Session Cluster of 
> daily_regression_stream_spark_1.10' -jm 1024 -n 20 -s 10 
> -Dtaskmanager.memory.process.size=1024m
> *After execute the cmd above, there is a exception like this:*
> 2019-12-30 17:54:57,992 INFO  org.apache.hadoop.yarn.client.RMProxy   
>   - Connecting to ResourceManager at 
> z05c07224.sqa.zth.tbsite.net/11.163.188.36:8050
> 2019-12-30 17:54:58,182 ERROR org.apache.flink.yarn.cli.FlinkYarnSessionCli   
>   - Error while running the Flink session.
> org.apache.flink.configuration.IllegalConfigurationException: Either Task 
> Heap Memory size (taskmanager.memory.task.heap.size) and Managed Memory size 
> (taskmanager.memory.managed.size), or Total Flink Memory size 
> (taskmanager.memory.flink.size), or Total Process Memory size 
> (taskmanager.memory.process.size) need to be configured explicitly.
>   at 
> org.apache.flink.runtime.clusterframework.TaskExecutorResourceUtils.resourceSpecFromConfig(TaskExecutorResourceUtils.java:145)
>   at 
> org.apache.flink.client.deployment.AbstractClusterClientFactory.getClusterSpecification(AbstractClusterClientFactory.java:44)
>   at 
> org.apache.flink.yarn.cli.FlinkYarnSessionCli.run(FlinkYarnSessionCli.java:557)
>   at 
> org.apache.flink.yarn.cli.FlinkYarnSessionCli.lambda$main$5(FlinkYarnSessionCli.java:803)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1804)
>   at 
> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>   at 
> org.apache.flink.yarn.cli.FlinkYarnSessionCli.main(FlinkYarnSessionCli.java:803)
> 
>  The program finished with the following exception:
> org.apache.flink.configuration.IllegalConfigurationException: Either Task 
> Heap Memory size (taskmanager.memory.task.heap.size) and Managed Memory size 
> (taskmanager.memory.managed.size), or Total Flink Memory size 
> (taskmanager.memory.flink.size), or Total Process Memory size 
> (taskmanager.memory.process.size) need to be configured explicitly.
>   at 
> org.apache.flink.runtime.clusterframework.TaskExecutorResourceUtils.resourceSpecFromConfig(TaskExecutorResourceUtils.java:145)
>   at 
> org.apache.flink.client.deployment.AbstractClusterClientFactory.getClusterSpecification(AbstractClusterClientFactory.java:44)
>   at 
> org.apache.flink.yarn.cli.FlinkYarnSessionCli.run(FlinkYarnSessionCli.java:557)
>   at 
> org.apache.flink.yarn.cli.FlinkYarnSessionCli.lambda$main$5(FlinkYarnSessionCli.java:803)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1804)
>   at 
> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>   at 
> org.apache.flink.yarn.cli.FlinkYarnSessionCli.main(FlinkYarnSessionCli.java:803)
> *The flink-conf.yaml is :*
> jobmanager.rpc.address: localhost
> jobmanager.rpc.port: 6123
> jobmanager.heap.size: 1024m
> taskmanager.memory.total-process.size: 1024m
> taskmanager.numberOfTaskSlots: 1
> parallelism.default: 1
> jobmanager.execution.failover-strategy: region



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-15428) Avro Confluent Schema Registry nightly end-to-end test fails on travis

2019-12-29 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang reassigned FLINK-15428:


Assignee: Yangze Guo

> Avro Confluent Schema Registry nightly end-to-end test fails on travis
> --
>
> Key: FLINK-15428
> URL: https://issues.apache.org/jira/browse/FLINK-15428
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Kafka
>Affects Versions: 1.10.0
>Reporter: Yu Li
>Assignee: Yangze Guo
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.10.0
>
>
> Avro Confluent Schema Registry nightly end-to-end test fails with below error:
> {code}
> Could not start confluent schema registry
> /home/travis/build/apache/flink/flink-end-to-end-tests/test-scripts/kafka-common.sh:
>  line 78: ./bin/kafka-server-stop: No such file or directory
> No zookeeper server to stop
> Tried to kill 1549 but never saw it die
> [FAIL] Test script contains errors.
> {code}
> https://api.travis-ci.org/v3/job/629699437/log.txt



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-15387) Expose missing RocksDB properties out via RocksDBNativeMetricOptions

2019-12-25 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang reassigned FLINK-15387:


Assignee: Yun Tang

> Expose missing RocksDB properties out via RocksDBNativeMetricOptions
> 
>
> Key: FLINK-15387
> URL: https://issues.apache.org/jira/browse/FLINK-15387
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / State Backends
>Affects Versions: 1.10.0, 1.11.0
>Reporter: Yun Tang
>Assignee: Yun Tang
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.10.0, 1.11.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When we implements FLINK-15368, we need to expose block cache related metrics 
> of RocksDB out by adding more available options to current 
> RocksDBNativeMetricOptions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-15370) Configured write buffer manager actually not take effect in RocksDB's DBOptions

2019-12-23 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang reassigned FLINK-15370:


Assignee: Yu Li

> Configured write buffer manager actually not take effect in RocksDB's 
> DBOptions
> ---
>
> Key: FLINK-15370
> URL: https://issues.apache.org/jira/browse/FLINK-15370
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / State Backends
>Affects Versions: 1.10.0, 1.11.0
>Reporter: Yun Tang
>Assignee: Yu Li
>Priority: Blocker
> Fix For: 1.10.0, 1.11.0
>
>
> Currently, we call {{DBOptions#setWriteBufferManager}} after we extract the 
> {{DBOptions}} from {{RocksDBResourceContainer}}, however, we would extract a 
> new {{DBOptions}}  when creating the RocksDB instance. In other words, the 
> configured write buffer manager would not take effect in the {{DBOptions}} 
> which finally used in target RocksDB instance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-15368) Add end-to-end test for controlling RocksDB memory usage

2019-12-23 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang reassigned FLINK-15368:


Assignee: Yun Tang

> Add end-to-end test for controlling RocksDB memory usage
> 
>
> Key: FLINK-15368
> URL: https://issues.apache.org/jira/browse/FLINK-15368
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / State Backends
>Affects Versions: 1.10.0
>Reporter: Yu Li
>Assignee: Yun Tang
>Priority: Critical
> Fix For: 1.10.0
>
>
> We need to add an end-to-end test to make sure the RocksDB memory usage 
> control works well, especially under the slot sharing case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-15360) Yarn e2e test is broken with building docker image

2019-12-23 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang closed FLINK-15360.

Resolution: Fixed

Merged in master: e30bcfd9c8cbe56c1072fe9895f1e6d03389c31e

Merged in release-1.10: 5ebab4fa2e51791fc04e04e3ab6fbbfc9f243fce

> Yarn e2e test is broken with building docker image
> --
>
> Key: FLINK-15360
> URL: https://issues.apache.org/jira/browse/FLINK-15360
> Project: Flink
>  Issue Type: Bug
>Reporter: Yang Wang
>Assignee: Yangze Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Yarn e2e test is broken with building docker image. This is because this 
> change 
> [https://github.com/apache/flink/commit/cce1cef50d993aba5060ea5ac597174525ae895e].
>  
> Shell function \{{retry_times}} do not support passing a command as multiple 
> parts. For example, \{{retry_times 5 0 docker build image}} could not work.
>  
> cc [~karmagyz]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-15306) Adjust the default netty transport option from nio to auto

2019-12-23 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang updated FLINK-15306:
-
Component/s: Runtime / Configuration

> Adjust the default netty transport option from nio to auto
> --
>
> Key: FLINK-15306
> URL: https://issues.apache.org/jira/browse/FLINK-15306
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Configuration, Runtime / Network
>Reporter: zhijiang
>Assignee: zhijiang
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The default option of `taskmanager.network.netty.transport` in 
> NettyShuffleEnvironmentOptions is `nio` now. As we know, the `epoll` mode can 
> get better performance, less GC and have more advanced features which are 
> only available on linux.
> Therefore it is better to adjust the default option to `auto` instead, and 
> then the framework would automatically choose the proper mode based on the 
> platform.
> We would further verify the performance effect via micro benchmark if 
> possible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-15355) Nightly streaming file sink fails with unshaded hadoop

2019-12-23 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang reassigned FLINK-15355:


Assignee: PengFei Li

> Nightly streaming file sink fails with unshaded hadoop
> --
>
> Key: FLINK-15355
> URL: https://issues.apache.org/jira/browse/FLINK-15355
> Project: Flink
>  Issue Type: Bug
>  Components: FileSystems
>Affects Versions: 1.10.0, 1.11.0
>Reporter: Arvid Heise
>Assignee: PengFei Li
>Priority: Blocker
> Fix For: 1.10.0
>
>
> {code:java}
> org.apache.flink.client.program.ProgramInvocationException: The main method 
> caused an error: java.util.concurrent.ExecutionException: 
> org.apache.flink.runtime.client.JobSubmissionException: Failed to submit 
> JobGraph.
>  at 
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:335)
>  at 
> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:205)
>  at org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:138)
>  at 
> org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:664)
>  at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:213)
>  at 
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:895)
>  at 
> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
>  at 
> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>  at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968)
> Caused by: java.lang.RuntimeException: 
> java.util.concurrent.ExecutionException: 
> org.apache.flink.runtime.client.JobSubmissionException: Failed to submit 
> JobGraph.
>  at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)
>  at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1751)
>  at 
> org.apache.flink.streaming.api.environment.StreamContextEnvironment.executeAsync(StreamContextEnvironment.java:94)
>  at 
> org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:63)
>  at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1628)
>  at StreamingFileSinkProgram.main(StreamingFileSinkProgram.java:77)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:321)
>  ... 11 more
> Caused by: java.util.concurrent.ExecutionException: 
> org.apache.flink.runtime.client.JobSubmissionException: Failed to submit 
> JobGraph.
>  at 
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
>  at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
>  at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1746)
>  ... 20 more
> Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to 
> submit JobGraph.
>  at 
> org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$7(RestClusterClient.java:326)
>  at 
> java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
>  at 
> java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
>  at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>  at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
>  at 
> org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$8(FutureUtils.java:274)
>  at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
>  at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
>  at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>  at 
> java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:561)
>  at 
> java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:929)
>  at 
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1

[jira] [Closed] (FLINK-15340) Remove the executor of pipelined compression benchmark

2019-12-20 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang closed FLINK-15340.

Resolution: Fixed

Merged in benchmark repo: a92f9d5d492b97fe0e601567bfe0c021be819306

> Remove the executor of pipelined compression benchmark
> --
>
> Key: FLINK-15340
> URL: https://issues.apache.org/jira/browse/FLINK-15340
> Project: Flink
>  Issue Type: Task
>  Components: Benchmarks
>Reporter: zhijiang
>Assignee: zhijiang
>Priority: Minor
>
> In [FLINK-15308|https://issues.apache.org/jira/browse/FLINK-15308], we 
> removed the function of compression for pipelined case. Accordingly we also 
> need to remove the respective benchmark executor.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15340) Remove the executor of pipelined compression benchmark

2019-12-19 Thread zhijiang (Jira)
zhijiang created FLINK-15340:


 Summary: Remove the executor of pipelined compression benchmark
 Key: FLINK-15340
 URL: https://issues.apache.org/jira/browse/FLINK-15340
 Project: Flink
  Issue Type: Task
  Components: Benchmarks
Reporter: zhijiang
Assignee: zhijiang


In [FLINK-15308|https://issues.apache.org/jira/browse/FLINK-15308], we removed 
the function of compression for pipelined case. Accordingly we also need to 
remove the respective benchmark executor.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-14843) Streaming bucketing end-to-end test can fail with Output hash mismatch

2019-12-19 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang reassigned FLINK-14843:


Assignee: PengFei Li

> Streaming bucketing end-to-end test can fail with Output hash mismatch
> --
>
> Key: FLINK-14843
> URL: https://issues.apache.org/jira/browse/FLINK-14843
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / FileSystem, Tests
>Affects Versions: 1.10.0
> Environment: rev: dcc1330375826b779e4902176bb2473704dabb11
>Reporter: Gary Yao
>Assignee: PengFei Li
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.10.0
>
> Attachments: complete_result, 
> flink-gary-standalonesession-0-gyao-desktop.log, 
> flink-gary-taskexecutor-0-gyao-desktop.log, 
> flink-gary-taskexecutor-1-gyao-desktop.log, 
> flink-gary-taskexecutor-2-gyao-desktop.log, 
> flink-gary-taskexecutor-3-gyao-desktop.log, 
> flink-gary-taskexecutor-4-gyao-desktop.log, 
> flink-gary-taskexecutor-5-gyao-desktop.log, 
> flink-gary-taskexecutor-6-gyao-desktop.log
>
>
> *Description*
> Streaming bucketing end-to-end test ({{test_streaming_bucketing.sh}}) can 
> fail with Output hash mismatch.
> {noformat}
> Number of running task managers has reached 4.
> Job (e0b7a86e4d4111f3947baa3d004e083a) is running.
> Waiting until all values have been produced
> Truncating buckets
> Number of produced values 26930/6
> Truncating buckets
> Number of produced values 30890/6
> Truncating buckets
> Number of produced values 37340/6
> Truncating buckets
> Number of produced values 41290/6
> Truncating buckets
> Number of produced values 46710/6
> Truncating buckets
> Number of produced values 52120/6
> Truncating buckets
> Number of produced values 57110/6
> Truncating buckets
> Number of produced values 62530/6
> Cancelling job e0b7a86e4d4111f3947baa3d004e083a.
> Cancelled job e0b7a86e4d4111f3947baa3d004e083a.
> Waiting for job (e0b7a86e4d4111f3947baa3d004e083a) to reach terminal state 
> CANCELED ...
> Job (e0b7a86e4d4111f3947baa3d004e083a) reached terminal state CANCELED
> Job e0b7a86e4d4111f3947baa3d004e083a was cancelled, time to verify
> FAIL Bucketing Sink: Output hash mismatch.  Got 
> 9e00429abfb30eea4f459eb812b470ad, expected 01aba5ff77a0ef5e5cf6a727c248bdc3.
> head hexdump of actual:
> 000   (   2   ,   1   0   ,   0   ,   S   o   m   e   p   a   y
> 010   l   o   a   d   .   .   .   )  \n   (   2   ,   1   0   ,   1
> 020   ,   S   o   m   e   p   a   y   l   o   a   d   .   .   .
> 030   )  \n   (   2   ,   1   0   ,   2   ,   S   o   m   e   p
> 040   a   y   l   o   a   d   .   .   .   )  \n   (   2   ,   1   0
> 050   ,   3   ,   S   o   m   e   p   a   y   l   o   a   d   .
> 060   .   .   )  \n   (   2   ,   1   0   ,   4   ,   S   o   m   e
> 070   p   a   y   l   o   a   d   .   .   .   )  \n   (   2   ,
> 080   1   0   ,   5   ,   S   o   m   e   p   a   y   l   o   a
> 090   d   .   .   .   )  \n   (   2   ,   1   0   ,   6   ,   S   o
> 0a0   m   e   p   a   y   l   o   a   d   .   .   .   )  \n   (
> 0b0   2   ,   1   0   ,   7   ,   S   o   m   e   p   a   y   l
> 0c0   o   a   d   .   .   .   )  \n   (   2   ,   1   0   ,   8   ,
> 0d0   S   o   m   e   p   a   y   l   o   a   d   .   .   .   )
> 0e0  \n   (   2   ,   1   0   ,   9   ,   S   o   m   e   p   a
> 0f0   y   l   o   a   d   .   .   .   )  \n
> 0fa
> Stopping taskexecutor daemon (pid: 55164) on host gyao-desktop.
> Stopping standalonesession daemon (pid: 51073) on host gyao-desktop.
> Stopping taskexecutor daemon (pid: 51504) on host gyao-desktop.
> Skipping taskexecutor daemon (pid: 52034), because it is not running anymore 
> on gyao-desktop.
> Skipping taskexecutor daemon (pid: 52472), because it is not running anymore 
> on gyao-desktop.
> Skipping taskexecutor daemon (pid: 52916), because it is not running anymore 
> on gyao-desktop.
> Stopping taskexecutor daemon (pid: 54121) on host gyao-desktop.
> Stopping taskexecutor daemon (pid: 54726) on host gyao-desktop.
> [FAIL] Test script contains errors.
> Checking of logs skipped.
> [FAIL] 'flink-end-to-end-tests/test-scripts/test_streaming_bucketing.sh' 
> failed after 2 minutes and 3 seconds! Test exited with exit code 1
> {noformat}
> *How to reproduce*
> Comment out the delay of 10s after the 1st TM is restarted to provoke the 
> issue:
> {code:bash}
> echo "Restarting 1 TM"
> $FLINK_DIR/bin/taskmanager.sh start
> wait_for_number_of_running_tms 4
> #sleep 10
> echo "Killing 2 TMs"
> kill_random_taskmanager
> kill_random_taskmanager
> wait_for_number_of_running_tms 2
> {code}
> Command to run the test:
> {noformat}
> FLINK_DIR

[jira] [Commented] (FLINK-15311) Lz4BlockCompressionFactory should use native compressor instead of java unsafe

2019-12-19 Thread zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999889#comment-16999889
 ] 

zhijiang commented on FLINK-15311:
--

After discussed offline, this improvement is very critical for our motivation 
to bring compression at the beginning. Then still make it as a blocker for 
release-1.10, but change the type to improvement instead.

> Lz4BlockCompressionFactory should use native compressor instead of java unsafe
> --
>
> Key: FLINK-15311
> URL: https://issues.apache.org/jira/browse/FLINK-15311
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Reporter: Jingsong Lee
>Assignee: Yingjie Cao
>Priority: Blocker
> Fix For: 1.10.0
>
>
> According to:
> [https://lz4.github.io/lz4-java/1.7.0/lz4-compression-benchmark/]
> Java java unsafe compressor has lower performance than native lz4 compressor.
> After FLINK-14845 , we use lz4 compression for shuffler.
> In testing, I found shuffle using java unsafe compressor.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-15311) Lz4BlockCompressionFactory should use native compressor instead of java unsafe

2019-12-19 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang updated FLINK-15311:
-
Issue Type: Improvement  (was: Bug)

> Lz4BlockCompressionFactory should use native compressor instead of java unsafe
> --
>
> Key: FLINK-15311
> URL: https://issues.apache.org/jira/browse/FLINK-15311
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Reporter: Jingsong Lee
>Assignee: Yingjie Cao
>Priority: Blocker
> Fix For: 1.10.0
>
>
> According to:
> [https://lz4.github.io/lz4-java/1.7.0/lz4-compression-benchmark/]
> Java java unsafe compressor has lower performance than native lz4 compressor.
> After FLINK-14845 , we use lz4 compression for shuffler.
> In testing, I found shuffle using java unsafe compressor.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-15308) Job failed when enable pipelined-shuffle.compression and numberOfTaskSlots > 1

2019-12-18 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang updated FLINK-15308:
-
Fix Version/s: 1.10.0

> Job failed when enable pipelined-shuffle.compression and numberOfTaskSlots > 1
> --
>
> Key: FLINK-15308
> URL: https://issues.apache.org/jira/browse/FLINK-15308
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Affects Versions: 1.10.0
> Environment: $ git log
> commit 4b54da2c67692b1c9d43e1184c00899b0151b3ae
> Author: bowen.li 
> Date: Tue Dec 17 17:37:03 2019 -0800
>Reporter: Feng Jiajie
>Assignee: Yingjie Cao
>Priority: Blocker
> Fix For: 1.10.0
>
> Attachments: image-2019-12-19-10-55-30-644.png
>
>
> Job worked well with default flink-conf.yaml with 
> pipelined-shuffle.compression:
> {code:java}
> taskmanager.numberOfTaskSlots: 1
> taskmanager.network.pipelined-shuffle.compression.enabled: true
> {code}
> But when I set taskmanager.numberOfTaskSlots to 4 or 6:
> {code:java}
> taskmanager.numberOfTaskSlots: 6
> taskmanager.network.pipelined-shuffle.compression.enabled: true
> {code}
> job failed:
> {code:java}
> $ bin/flink run -m yarn-cluster -p 16 -yjm 1024m -ytm 12288m 
> ~/flink-example-1.0-SNAPSHOT.jar
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/data/build/flink/flink-dist/target/flink-1.10-SNAPSHOT-bin/flink-1.10-SNAPSHOT/lib/slf4j-log4j12-1.7.15.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/data/sa_cluster/cloudera/parcels/CDH-5.14.4-1.cdh5.14.4.p0.3/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> 2019-12-18 15:04:40,514 WARN  org.apache.flink.yarn.cli.FlinkYarnSessionCli   
>   - The configuration directory 
> ('/data/build/flink/flink-dist/target/flink-1.10-SNAPSHOT-bin/flink-1.10-SNAPSHOT/conf')
>  already contains a LOG4J config file.If you want to use logback, then please 
> delete or rename the log configuration file.
> 2019-12-18 15:04:40,514 WARN  org.apache.flink.yarn.cli.FlinkYarnSessionCli   
>   - The configuration directory 
> ('/data/build/flink/flink-dist/target/flink-1.10-SNAPSHOT-bin/flink-1.10-SNAPSHOT/conf')
>  already contains a LOG4J config file.If you want to use logback, then please 
> delete or rename the log configuration file.
> 2019-12-18 15:04:40,907 INFO  org.apache.flink.yarn.YarnClusterDescriptor 
>   - No path for the flink jar passed. Using the location of class 
> org.apache.flink.yarn.YarnClusterDescriptor to locate the jar
> 2019-12-18 15:04:41,084 INFO  org.apache.flink.yarn.YarnClusterDescriptor 
>   - Cluster specification: 
> ClusterSpecification{masterMemoryMB=1024, taskManagerMemoryMB=12288, 
> numberTaskManagers=1, slotsPerTaskManager=6}
> 2019-12-18 15:04:42,344 INFO  org.apache.flink.yarn.YarnClusterDescriptor 
>   - Submitting application master application_1576573857638_0026
> 2019-12-18 15:04:42,370 INFO  
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted 
> application application_1576573857638_0026
> 2019-12-18 15:04:42,371 INFO  org.apache.flink.yarn.YarnClusterDescriptor 
>   - Waiting for the cluster to be allocated
> 2019-12-18 15:04:42,372 INFO  org.apache.flink.yarn.YarnClusterDescriptor 
>   - Deploying cluster, current state ACCEPTED
> 2019-12-18 15:04:45,388 INFO  org.apache.flink.yarn.YarnClusterDescriptor 
>   - YARN application has been deployed successfully.
> 2019-12-18 15:04:45,390 INFO  org.apache.flink.yarn.YarnClusterDescriptor 
>   - Found Web Interface debugboxcreate431x3.sa:36162 of 
> application 'application_1576573857638_0026'.
> Job has been submitted with JobID 9140c70769f4271cc22ea8becaa26272
> 
>  The program finished with the following exception:
> org.apache.flink.client.program.ProgramInvocationException: The main method 
> caused an error: org.apache.flink.client.program.ProgramInvocationException: 
> Job failed (JobID: 9140c70769f4271cc22ea8becaa26272)
>   at 
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:335)
>   at 
> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:205)
>   at 
> org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:138)
>   at 
> org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:664)
>   at org.apache.flink.client.cli.CliFront

[jira] [Commented] (FLINK-15311) Lz4BlockCompressionFactory should use native compressor instead of java unsafe

2019-12-18 Thread zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999797#comment-16999797
 ] 

zhijiang commented on FLINK-15311:
--

I guess it belongs to performance improvement, not a bug, because it does not 
affect the compression function and stable issue.

Should it be a blocker for the release?

> Lz4BlockCompressionFactory should use native compressor instead of java unsafe
> --
>
> Key: FLINK-15311
> URL: https://issues.apache.org/jira/browse/FLINK-15311
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Reporter: Jingsong Lee
>Priority: Critical
> Fix For: 1.10.0
>
>
> According to:
> [https://lz4.github.io/lz4-java/1.7.0/lz4-compression-benchmark/]
> Java java unsafe compressor has lower performance than native lz4 compressor.
> After FLINK-14845 , we use lz4 compression for shuffler.
> In testing, I found shuffle using java unsafe compressor.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-15311) Lz4BlockCompressionFactory should use native compressor instead of java unsafe

2019-12-18 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang reassigned FLINK-15311:


Assignee: Yingjie Cao

> Lz4BlockCompressionFactory should use native compressor instead of java unsafe
> --
>
> Key: FLINK-15311
> URL: https://issues.apache.org/jira/browse/FLINK-15311
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Reporter: Jingsong Lee
>Assignee: Yingjie Cao
>Priority: Critical
> Fix For: 1.10.0
>
>
> According to:
> [https://lz4.github.io/lz4-java/1.7.0/lz4-compression-benchmark/]
> Java java unsafe compressor has lower performance than native lz4 compressor.
> After FLINK-14845 , we use lz4 compression for shuffler.
> In testing, I found shuffle using java unsafe compressor.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-15012) Checkpoint directory not cleaned up

2019-12-18 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang reassigned FLINK-15012:


Assignee: Yun Tang

> Checkpoint directory not cleaned up
> ---
>
> Key: FLINK-15012
> URL: https://issues.apache.org/jira/browse/FLINK-15012
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.9.1
>Reporter: Nico Kruber
>Assignee: Yun Tang
>Priority: Major
>
> I started a Flink cluster with 2 TMs using {{start-cluster.sh}} and the 
> following config (in addition to the default {{flink-conf.yaml}})
> {code:java}
> state.checkpoints.dir: file:///path/to/checkpoints/
> state.backend: rocksdb {code}
> After submitting a jobwith checkpoints enabled (every 5s), checkpoints show 
> up, e.g.
> {code:java}
> bb969f842bbc0ecc3b41b7fbe23b047b/
> ├── chk-2
> │   ├── 238969e1-6949-4b12-98e7-1411c186527c
> │   ├── 2702b226-9cfc-4327-979d-e5508ab2e3d5
> │   ├── 4c51cb24-6f71-4d20-9d4c-65ed6e826949
> │   ├── e706d574-c5b2-467a-8640-1885ca252e80
> │   └── _metadata
> ├── shared
> └── taskowned {code}
> If I shut down the cluster via {{stop-cluster.sh}}, these files will remain 
> on disk and not be cleaned up.
> In contrast, if I cancel the job, at least {{chk-2}} will be deleted, but 
> still leaving the (empty) directories.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-15308) Job failed when enable pipelined-shuffle.compression and numberOfTaskSlots > 1

2019-12-18 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang reassigned FLINK-15308:


Assignee: Yingjie Cao

> Job failed when enable pipelined-shuffle.compression and numberOfTaskSlots > 1
> --
>
> Key: FLINK-15308
> URL: https://issues.apache.org/jira/browse/FLINK-15308
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Affects Versions: 1.10.0
> Environment: $ git log
> commit 4b54da2c67692b1c9d43e1184c00899b0151b3ae
> Author: bowen.li 
> Date: Tue Dec 17 17:37:03 2019 -0800
>Reporter: Feng Jiajie
>Assignee: Yingjie Cao
>Priority: Blocker
>
> Job worked well with default flink-conf.yaml with 
> pipelined-shuffle.compression:
> {code:java}
> taskmanager.numberOfTaskSlots: 1
> taskmanager.network.pipelined-shuffle.compression.enabled: true
> {code}
> But when I set taskmanager.numberOfTaskSlots to 4 or 6:
> {code:java}
> taskmanager.numberOfTaskSlots: 6
> taskmanager.network.pipelined-shuffle.compression.enabled: true
> {code}
> job failed:
> {code:java}
> $ bin/flink run -m yarn-cluster -p 16 -yjm 1024m -ytm 12288m 
> ~/flink-example-1.0-SNAPSHOT.jar
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/data/build/flink/flink-dist/target/flink-1.10-SNAPSHOT-bin/flink-1.10-SNAPSHOT/lib/slf4j-log4j12-1.7.15.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/data/sa_cluster/cloudera/parcels/CDH-5.14.4-1.cdh5.14.4.p0.3/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> 2019-12-18 15:04:40,514 WARN  org.apache.flink.yarn.cli.FlinkYarnSessionCli   
>   - The configuration directory 
> ('/data/build/flink/flink-dist/target/flink-1.10-SNAPSHOT-bin/flink-1.10-SNAPSHOT/conf')
>  already contains a LOG4J config file.If you want to use logback, then please 
> delete or rename the log configuration file.
> 2019-12-18 15:04:40,514 WARN  org.apache.flink.yarn.cli.FlinkYarnSessionCli   
>   - The configuration directory 
> ('/data/build/flink/flink-dist/target/flink-1.10-SNAPSHOT-bin/flink-1.10-SNAPSHOT/conf')
>  already contains a LOG4J config file.If you want to use logback, then please 
> delete or rename the log configuration file.
> 2019-12-18 15:04:40,907 INFO  org.apache.flink.yarn.YarnClusterDescriptor 
>   - No path for the flink jar passed. Using the location of class 
> org.apache.flink.yarn.YarnClusterDescriptor to locate the jar
> 2019-12-18 15:04:41,084 INFO  org.apache.flink.yarn.YarnClusterDescriptor 
>   - Cluster specification: 
> ClusterSpecification{masterMemoryMB=1024, taskManagerMemoryMB=12288, 
> numberTaskManagers=1, slotsPerTaskManager=6}
> 2019-12-18 15:04:42,344 INFO  org.apache.flink.yarn.YarnClusterDescriptor 
>   - Submitting application master application_1576573857638_0026
> 2019-12-18 15:04:42,370 INFO  
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted 
> application application_1576573857638_0026
> 2019-12-18 15:04:42,371 INFO  org.apache.flink.yarn.YarnClusterDescriptor 
>   - Waiting for the cluster to be allocated
> 2019-12-18 15:04:42,372 INFO  org.apache.flink.yarn.YarnClusterDescriptor 
>   - Deploying cluster, current state ACCEPTED
> 2019-12-18 15:04:45,388 INFO  org.apache.flink.yarn.YarnClusterDescriptor 
>   - YARN application has been deployed successfully.
> 2019-12-18 15:04:45,390 INFO  org.apache.flink.yarn.YarnClusterDescriptor 
>   - Found Web Interface debugboxcreate431x3.sa:36162 of 
> application 'application_1576573857638_0026'.
> Job has been submitted with JobID 9140c70769f4271cc22ea8becaa26272
> 
>  The program finished with the following exception:
> org.apache.flink.client.program.ProgramInvocationException: The main method 
> caused an error: org.apache.flink.client.program.ProgramInvocationException: 
> Job failed (JobID: 9140c70769f4271cc22ea8becaa26272)
>   at 
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:335)
>   at 
> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:205)
>   at 
> org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:138)
>   at 
> org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:664)
>   at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:213)
>   at 
> org.apache.flink.client.cli.CliFrontend.

[jira] [Closed] (FLINK-1275) Add support to compress network I/O

2019-12-18 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-1275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang closed FLINK-1275.
---
Resolution: Duplicate

This issue was already done via 
[FLINK-14845|https://issues.apache.org/jira/browse/FLINK-14845], so close it.

> Add support to compress network I/O
> ---
>
> Key: FLINK-1275
> URL: https://issues.apache.org/jira/browse/FLINK-1275
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Affects Versions: 0.8.0
>Reporter: Ufuk Celebi
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15010) Temp directories flink-netty-shuffle-* are not cleaned up

2019-12-17 Thread zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16998822#comment-16998822
 ] 

zhijiang commented on FLINK-15010:
--

Thanks for the information and it is easy to re-produce this issue. I would 
assign to gaoyun for solving it.

> Temp directories flink-netty-shuffle-* are not cleaned up
> -
>
> Key: FLINK-15010
> URL: https://issues.apache.org/jira/browse/FLINK-15010
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Affects Versions: 1.9.1
>Reporter: Nico Kruber
>Priority: Major
>
> Starting a Flink cluster with 2 TMs and stopping it again will leave 2 
> temporary directories (and not delete them): flink-netty-shuffle-



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-15010) Temp directories flink-netty-shuffle-* are not cleaned up

2019-12-17 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang reassigned FLINK-15010:


Assignee: Yun Gao

> Temp directories flink-netty-shuffle-* are not cleaned up
> -
>
> Key: FLINK-15010
> URL: https://issues.apache.org/jira/browse/FLINK-15010
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Affects Versions: 1.9.1
>Reporter: Nico Kruber
>Assignee: Yun Gao
>Priority: Major
>
> Starting a Flink cluster with 2 TMs and stopping it again will leave 2 
> temporary directories (and not delete them): flink-netty-shuffle-



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15306) Adjust the default netty transport option from nio to auto

2019-12-17 Thread zhijiang (Jira)
zhijiang created FLINK-15306:


 Summary: Adjust the default netty transport option from nio to auto
 Key: FLINK-15306
 URL: https://issues.apache.org/jira/browse/FLINK-15306
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Network
Reporter: zhijiang
Assignee: zhijiang
 Fix For: 1.11.0


The default option of `taskmanager.network.netty.transport` in 
NettyShuffleEnvironmentOptions is `nio` now. As we know, the `epoll` mode can 
get better performance, less GC and have more advanced features which are only 
available on linux.

Therefore it is better to adjust the default option to `auto` instead, and then 
the framework would automatically choose the proper mode based on the platform.

We would further verify the performance effect via micro benchmark if possible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-15021) Remove setting of netty channel watermark and logic of writability changed

2019-12-17 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang updated FLINK-15021:
-
Parent: FLINK-7282
Issue Type: Sub-task  (was: Task)

> Remove setting of netty channel watermark and logic of writability changed
> --
>
> Key: FLINK-15021
> URL: https://issues.apache.org/jira/browse/FLINK-15021
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Network
>Reporter: zhijiang
>Assignee: zhijiang
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> After removing the non credit-based flow control codes, the channel 
> writability changed logic in PartitionRequestQueue along with the setting of 
> channel watermark are both invalid. Therefore we can remove them completely 
> to simplify the codes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-15021) Remove setting of netty channel watermark and logic of writability changed

2019-12-17 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang updated FLINK-15021:
-
Summary: Remove setting of netty channel watermark and logic of writability 
changed  (was: Refactor to remove channelWritabilityChanged from 
PartitionRequestQueue)

> Remove setting of netty channel watermark and logic of writability changed
> --
>
> Key: FLINK-15021
> URL: https://issues.apache.org/jira/browse/FLINK-15021
> Project: Flink
>  Issue Type: Task
>  Components: Runtime / Network
>Reporter: zhijiang
>Assignee: zhijiang
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> After removing the non credit-based flow control codes, the related channel 
> writability changed logics in PartitionRequestQueue are invalid and can be 
> removed completely. Therefore we can refactor the process to simplify the 
> codes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-15021) Remove setting of netty channel watermark and logic of writability changed

2019-12-17 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang updated FLINK-15021:
-
Description: After removing the non credit-based flow control codes, the 
channel writability changed logic in PartitionRequestQueue along with the 
setting of channel watermark are both invalid. Therefore we can remove them 
completely to simplify the codes.  (was: After removing the non credit-based 
flow control codes, the related channel writability changed logics in 
PartitionRequestQueue are invalid and can be removed completely. Therefore we 
can refactor the process to simplify the codes.)

> Remove setting of netty channel watermark and logic of writability changed
> --
>
> Key: FLINK-15021
> URL: https://issues.apache.org/jira/browse/FLINK-15021
> Project: Flink
>  Issue Type: Task
>  Components: Runtime / Network
>Reporter: zhijiang
>Assignee: zhijiang
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> After removing the non credit-based flow control codes, the channel 
> writability changed logic in PartitionRequestQueue along with the setting of 
> channel watermark are both invalid. Therefore we can remove them completely 
> to simplify the codes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-13589) DelimitedInputFormat index error on multi-byte delimiters with whole file input splits

2019-12-16 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang closed FLINK-13589.

Resolution: Fixed

Merged in master: 0bd083e5eeb5eb5adeddfbe3a9928860f3b4a6eb

Merged in release-1.9: db531e79807acba1ba28d9922bfed912fd78dd03

Merged in release-1.10: 1e716e4a43018caeb77beaa5d8f16cedfedbd887

> DelimitedInputFormat index error on multi-byte delimiters with whole file 
> input splits
> --
>
> Key: FLINK-13589
> URL: https://issues.apache.org/jira/browse/FLINK-13589
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / FileSystem, Formats (JSON, Avro, Parquet, 
> ORC, SequenceFile)
>Affects Versions: 1.8.1
>Reporter: Adric Eckstein
>Assignee: Arvid Heise
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.9.2, 1.10.0
>
> Attachments: delimiter-bug.patch
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> The DelimitedInputFormat can drops bytes when using input splits that have a 
> length of -1 (for reading the whole file).  It looks like this is a simple 
> bug in handing the delimiter on buffer boundaries where the logic is 
> inconsistent for different split types.
> Attached is a possible patch with fix and test.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-15010) Temp directories flink-netty-shuffle-* are not cleaned up

2019-12-16 Thread zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16997108#comment-16997108
 ] 

zhijiang edited comment on FLINK-15010 at 12/16/19 9:26 AM:


Hey [~NicoK], thanks for reporting this. I want to further confirm that what 's 
the mode for flink cluster, standalone / session ?  Or how can I re-produce 
this issue?


was (Author: zjwang):
Hey [~NicoK], thanks for reporting this. I want to further confirm that what 's 
the mode for flink cluster, standalone / session ?  I guess you did not start 
any jobs in cluster?  Or how can I re-produce this issue?

> Temp directories flink-netty-shuffle-* are not cleaned up
> -
>
> Key: FLINK-15010
> URL: https://issues.apache.org/jira/browse/FLINK-15010
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Affects Versions: 1.9.1
>Reporter: Nico Kruber
>Priority: Major
>
> Starting a Flink cluster with 2 TMs and stopping it again will leave 2 
> temporary directories (and not delete them): flink-netty-shuffle-



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-15010) Temp directories flink-netty-shuffle-* are not cleaned up

2019-12-16 Thread zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16997108#comment-16997108
 ] 

zhijiang edited comment on FLINK-15010 at 12/16/19 9:19 AM:


Hey [~NicoK], thanks for reporting this. I want to further confirm that what 's 
the mode for flink cluster, standalone / session ?  I guess you did not start 
any jobs in cluster?  Or how can I re-produce this issue?


was (Author: zjwang):
Hey [~NicoK], thanks for reporting this. I want to further confirm that what 's 
the mode for flink cluster, standalone / session ? Or how can I re-produce this 
issue?

> Temp directories flink-netty-shuffle-* are not cleaned up
> -
>
> Key: FLINK-15010
> URL: https://issues.apache.org/jira/browse/FLINK-15010
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Affects Versions: 1.9.1
>Reporter: Nico Kruber
>Priority: Major
>
> Starting a Flink cluster with 2 TMs and stopping it again will leave 2 
> temporary directories (and not delete them): flink-netty-shuffle-



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15010) Temp directories flink-netty-shuffle-* are not cleaned up

2019-12-16 Thread zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16997108#comment-16997108
 ] 

zhijiang commented on FLINK-15010:
--

Hey [~NicoK], thanks for reporting this. I want to further confirm that what 's 
the mode for flink cluster, standalone / session ? Or how can I re-produce this 
issue?

> Temp directories flink-netty-shuffle-* are not cleaned up
> -
>
> Key: FLINK-15010
> URL: https://issues.apache.org/jira/browse/FLINK-15010
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Affects Versions: 1.9.1
>Reporter: Nico Kruber
>Priority: Major
>
> Starting a Flink cluster with 2 TMs and stopping it again will leave 2 
> temporary directories (and not delete them): flink-netty-shuffle-



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-15105) Resuming Externalized Checkpoint after terminal failure (rocks, incremental) end-to-end test stalls on travis

2019-12-15 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang reassigned FLINK-15105:


Assignee: Congxian Qiu(klion26)

> Resuming Externalized Checkpoint after terminal failure (rocks, incremental) 
> end-to-end test stalls on travis
> -
>
> Key: FLINK-15105
> URL: https://issues.apache.org/jira/browse/FLINK-15105
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.10.0
>Reporter: Yu Li
>Assignee: Congxian Qiu(klion26)
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.10.0
>
>
> Resuming Externalized Checkpoint after terminal failure (rocks, incremental) 
> end-to-end test fails on release-1.9 nightly build stalls with "The job 
> exceeded the maximum log length, and has been terminated".
> https://api.travis-ci.org/v3/job/621090394/log.txt



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-15069) Supplement the pipelined shuffle compression case for benchmark

2019-12-15 Thread zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16993600#comment-16993600
 ] 

zhijiang edited comment on FLINK-15069 at 12/16/19 5:27 AM:


Fixed in master: 04ab225056714013fa3bf4dfb590edac7b577d03

Fixed in benchmarks repo: 0a2397a907a51608c276c39592c3b19c2455366b


was (Author: zjwang):
Fixed in master: 04ab225056714013fa3bf4dfb590edac7b577d03

> Supplement the pipelined shuffle compression case for benchmark
> ---
>
> Key: FLINK-15069
> URL: https://issues.apache.org/jira/browse/FLINK-15069
> Project: Flink
>  Issue Type: Task
>  Components: Benchmarks
>Reporter: zhijiang
>Assignee: zhijiang
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> While reviewing the [PR| 
> [https://github.com/apache/flink/pull/10375#pullrequestreview-325193504]] of 
> introducing data compression for persistent storage and network shuffle, we 
> think it is better to also cover this scenario in the benchmark for tracing 
> the performance issues future. 
> This ticket would supplement the compression case for pipelined partition 
> shuffle, and the compression case for blocking partition would be added in 
> [FLINK-15070|https://issues.apache.org/jira/browse/FLINK-15070]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-15069) Supplement the pipelined shuffle compression case for benchmark

2019-12-15 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang closed FLINK-15069.

Resolution: Fixed

> Supplement the pipelined shuffle compression case for benchmark
> ---
>
> Key: FLINK-15069
> URL: https://issues.apache.org/jira/browse/FLINK-15069
> Project: Flink
>  Issue Type: Task
>  Components: Benchmarks
>Reporter: zhijiang
>Assignee: zhijiang
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> While reviewing the [PR| 
> [https://github.com/apache/flink/pull/10375#pullrequestreview-325193504]] of 
> introducing data compression for persistent storage and network shuffle, we 
> think it is better to also cover this scenario in the benchmark for tracing 
> the performance issues future. 
> This ticket would supplement the compression case for pipelined partition 
> shuffle, and the compression case for blocking partition would be added in 
> [FLINK-15070|https://issues.apache.org/jira/browse/FLINK-15070]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-14952) Yarn containers can exceed physical memory limits when using BoundedBlockingSubpartition.

2019-12-12 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang updated FLINK-14952:
-
Release Note: 
This changes the option key and default value for the type of 
BoundedBlockingSubpartition in batch jobs. 

The previous key `taskmanager.network.bounded-blocking-subpartition-type` was 
changed to `taskmanager.network.blocking-shuffle.type` now. 

And the respective option default value was also changed from `auto` to `file` 
for avoiding yarn killing task manager container when memory usage of mmap 
exceeds some threshold.

  was:
This changes the option key and default value for the type of 
BoundedBlockingSubpartition in batch jobs. 

The previous key `taskmanager.network.bounded-blocking-subpartition-type` was 
changed to `taskmanager.network.blocking-shuffle.type` now. 

And the respective default option value was also changed from `auto` to `file` 
for avoiding yarn killing task manager container when memory usage of mmap 
exceeds some threshold.


> Yarn containers can exceed physical memory limits when using 
> BoundedBlockingSubpartition.
> -
>
> Key: FLINK-14952
> URL: https://issues.apache.org/jira/browse/FLINK-14952
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN, Runtime / Network
>Affects Versions: 1.9.1
>Reporter: Piotr Nowojski
>Assignee: zhijiang
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> As [reported by a user on the user mailing 
> list|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/CoGroup-SortMerger-performance-degradation-from-1-6-4-1-9-1-td31082.html],
>  combination of using {{BoundedBlockingSubpartition}} with yarn containers 
> can cause yarn container to exceed memory limits.
> {quote}2019-11-19 12:49:23,068 INFO org.apache.flink.yarn.YarnResourceManager 
> - Closing TaskExecutor connection container_e42_1574076744505_9444_01_04 
> because: Container 
> [pid=42774,containerID=container_e42_1574076744505_9444_01_04] is running 
> beyond physical memory limits. Current usage: 12.0 GB of 12 GB physical 
> memory used; 13.9 GB of 25.2 GB virtual memory used. Killing container.
> {quote}
> This is probably happening because memory usage of mmap is not capped and not 
> accounted by configured memory limits, however yarn is tracking this memory 
> usage and once Flink exceeds some threshold, container is being killed.
> Workaround is to overrule default value and force Flink to not user mmap, by 
> setting a secret (🤫) config option:
> {noformat}
> taskmanager.network.bounded-blocking-subpartition-type: file
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-14952) Yarn containers can exceed physical memory limits when using BoundedBlockingSubpartition.

2019-12-12 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang closed FLINK-14952.

Release Note: 
This changes the option key and default value for the type of 
BoundedBlockingSubpartition in batch jobs. 

The previous key `taskmanager.network.bounded-blocking-subpartition-type` was 
changed to `taskmanager.network.blocking-shuffle.type` now. 

And the respective default option value was also changed from `auto` to `file` 
for avoiding yarn killing task manager container when memory usage of mmap 
exceeds some threshold.
  Resolution: Fixed

Fixed in master: 7600e8b9d4cb8fee928c9edc9d2483787dc10a3c

Fixed in release-1.10: b52efff51f6494c442e32181a5d6896feec4e990

> Yarn containers can exceed physical memory limits when using 
> BoundedBlockingSubpartition.
> -
>
> Key: FLINK-14952
> URL: https://issues.apache.org/jira/browse/FLINK-14952
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN, Runtime / Network
>Affects Versions: 1.9.1
>Reporter: Piotr Nowojski
>Assignee: zhijiang
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> As [reported by a user on the user mailing 
> list|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/CoGroup-SortMerger-performance-degradation-from-1-6-4-1-9-1-td31082.html],
>  combination of using {{BoundedBlockingSubpartition}} with yarn containers 
> can cause yarn container to exceed memory limits.
> {quote}2019-11-19 12:49:23,068 INFO org.apache.flink.yarn.YarnResourceManager 
> - Closing TaskExecutor connection container_e42_1574076744505_9444_01_04 
> because: Container 
> [pid=42774,containerID=container_e42_1574076744505_9444_01_04] is running 
> beyond physical memory limits. Current usage: 12.0 GB of 12 GB physical 
> memory used; 13.9 GB of 25.2 GB virtual memory used. Killing container.
> {quote}
> This is probably happening because memory usage of mmap is not capped and not 
> accounted by configured memory limits, however yarn is tracking this memory 
> usage and once Flink exceeds some threshold, container is being killed.
> Workaround is to overrule default value and force Flink to not user mmap, by 
> setting a secret (🤫) config option:
> {noformat}
> taskmanager.network.bounded-blocking-subpartition-type: file
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (FLINK-15069) Supplement the pipelined shuffle compression case for benchmark

2019-12-11 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang reopened FLINK-15069:
--

> Supplement the pipelined shuffle compression case for benchmark
> ---
>
> Key: FLINK-15069
> URL: https://issues.apache.org/jira/browse/FLINK-15069
> Project: Flink
>  Issue Type: Task
>  Components: Benchmarks
>Reporter: zhijiang
>Assignee: zhijiang
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> While reviewing the [PR| 
> [https://github.com/apache/flink/pull/10375#pullrequestreview-325193504]] of 
> introducing data compression for persistent storage and network shuffle, we 
> think it is better to also cover this scenario in the benchmark for tracing 
> the performance issues future. 
> This ticket would supplement the compression case for pipelined partition 
> shuffle, and the compression case for blocking partition would be added in 
> [FLINK-15070|https://issues.apache.org/jira/browse/FLINK-15070]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-15069) Supplement the pipelined shuffle compression case for benchmark

2019-12-11 Thread zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16994090#comment-16994090
 ] 

zhijiang edited comment on FLINK-15069 at 12/12/19 2:30 AM:


Exactly, after this ticket merged I plan to submit a PR for implementing the 
`StreamNetworkCompressionThroughputBenchmarkExecutor` in flink-benchmarks repo.

Maybe I should close this ticket after the next PR for benchmark repo merged.


was (Author: zjwang):
Exactly, after this ticket merged I plan to submit a PR for implementing the 
`StreamNetworkCompressionThroughputBenchmarkExecutor` in flink-benchmarks repo.

> Supplement the pipelined shuffle compression case for benchmark
> ---
>
> Key: FLINK-15069
> URL: https://issues.apache.org/jira/browse/FLINK-15069
> Project: Flink
>  Issue Type: Task
>  Components: Benchmarks
>Reporter: zhijiang
>Assignee: zhijiang
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> While reviewing the [PR| 
> [https://github.com/apache/flink/pull/10375#pullrequestreview-325193504]] of 
> introducing data compression for persistent storage and network shuffle, we 
> think it is better to also cover this scenario in the benchmark for tracing 
> the performance issues future. 
> This ticket would supplement the compression case for pipelined partition 
> shuffle, and the compression case for blocking partition would be added in 
> [FLINK-15070|https://issues.apache.org/jira/browse/FLINK-15070]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-15069) Supplement the pipelined shuffle compression case for benchmark

2019-12-11 Thread zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16994090#comment-16994090
 ] 

zhijiang edited comment on FLINK-15069 at 12/12/19 2:23 AM:


Exactly, after this ticket merged I plan to submit a PR for implementing the 
`StreamNetworkCompressionThroughputBenchmarkExecutor` in flink-benchmarks repo.


was (Author: zjwang):
Exactly, after this ticket merged I planed to submit a PR for implementing the 
`StreamNetworkCompressionThroughputBenchmarkExecutor` in flink-benchmarks repo.

> Supplement the pipelined shuffle compression case for benchmark
> ---
>
> Key: FLINK-15069
> URL: https://issues.apache.org/jira/browse/FLINK-15069
> Project: Flink
>  Issue Type: Task
>  Components: Benchmarks
>Reporter: zhijiang
>Assignee: zhijiang
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> While reviewing the [PR| 
> [https://github.com/apache/flink/pull/10375#pullrequestreview-325193504]] of 
> introducing data compression for persistent storage and network shuffle, we 
> think it is better to also cover this scenario in the benchmark for tracing 
> the performance issues future. 
> This ticket would supplement the compression case for pipelined partition 
> shuffle, and the compression case for blocking partition would be added in 
> [FLINK-15070|https://issues.apache.org/jira/browse/FLINK-15070]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15069) Supplement the pipelined shuffle compression case for benchmark

2019-12-11 Thread zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16994090#comment-16994090
 ] 

zhijiang commented on FLINK-15069:
--

Exactly, after this ticket merged I planed to submit a PR for implementing the 
`StreamNetworkCompressionThroughputBenchmarkExecutor` in flink-benchmarks repo.

> Supplement the pipelined shuffle compression case for benchmark
> ---
>
> Key: FLINK-15069
> URL: https://issues.apache.org/jira/browse/FLINK-15069
> Project: Flink
>  Issue Type: Task
>  Components: Benchmarks
>Reporter: zhijiang
>Assignee: zhijiang
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> While reviewing the [PR| 
> [https://github.com/apache/flink/pull/10375#pullrequestreview-325193504]] of 
> introducing data compression for persistent storage and network shuffle, we 
> think it is better to also cover this scenario in the benchmark for tracing 
> the performance issues future. 
> This ticket would supplement the compression case for pipelined partition 
> shuffle, and the compression case for blocking partition would be added in 
> [FLINK-15070|https://issues.apache.org/jira/browse/FLINK-15070]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-15153) Service selector needs to contain jobmanager component label

2019-12-11 Thread zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16993148#comment-16993148
 ] 

zhijiang edited comment on FLINK-15153 at 12/12/19 2:16 AM:


Fixed in master: 4286fca12faf97e6089c7600896bb4a7af6f9c36

Fixed in release-1.10: 8f7ad848964c00bee490be0a4ea6d715426179d8


was (Author: zjwang):
Fixed in master: 4286fca12faf97e6089c7600896bb4a7af6f9c36

> Service selector needs to contain jobmanager component label
> 
>
> Key: FLINK-15153
> URL: https://issues.apache.org/jira/browse/FLINK-15153
> Project: Flink
>  Issue Type: Sub-task
>  Components: Deployment / Kubernetes
>Reporter: Yang Wang
>Assignee: Yang Wang
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> The jobmanager label needs to be added to service selector. Otherwise, it may 
> select the wrong backend pods(taskmanager).
> The internal service is used for taskmanager talking to jobmanager. If it 
> does not have correct backend pods, the taskmanager may fail to register.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-15069) Supplement the pipelined shuffle compression case for benchmark

2019-12-11 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang closed FLINK-15069.

Resolution: Fixed

Fixed in master: 04ab225056714013fa3bf4dfb590edac7b577d03

> Supplement the pipelined shuffle compression case for benchmark
> ---
>
> Key: FLINK-15069
> URL: https://issues.apache.org/jira/browse/FLINK-15069
> Project: Flink
>  Issue Type: Task
>  Components: Benchmarks
>Reporter: zhijiang
>Assignee: zhijiang
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> While reviewing the [PR| 
> [https://github.com/apache/flink/pull/10375#pullrequestreview-325193504]] of 
> introducing data compression for persistent storage and network shuffle, we 
> think it is better to also cover this scenario in the benchmark for tracing 
> the performance issues future. 
> This ticket would supplement the compression case for pipelined partition 
> shuffle, and the compression case for blocking partition would be added in 
> [FLINK-15070|https://issues.apache.org/jira/browse/FLINK-15070]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-14952) Yarn containers can exceed physical memory limits when using BoundedBlockingSubpartition.

2019-12-10 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang reassigned FLINK-14952:


Assignee: zhijiang

> Yarn containers can exceed physical memory limits when using 
> BoundedBlockingSubpartition.
> -
>
> Key: FLINK-14952
> URL: https://issues.apache.org/jira/browse/FLINK-14952
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN, Runtime / Network
>Affects Versions: 1.9.1
>Reporter: Piotr Nowojski
>Assignee: zhijiang
>Priority: Blocker
> Fix For: 1.10.0
>
>
> As [reported by a user on the user mailing 
> list|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/CoGroup-SortMerger-performance-degradation-from-1-6-4-1-9-1-td31082.html],
>  combination of using {{BoundedBlockingSubpartition}} with yarn containers 
> can cause yarn container to exceed memory limits.
> {quote}2019-11-19 12:49:23,068 INFO org.apache.flink.yarn.YarnResourceManager 
> - Closing TaskExecutor connection container_e42_1574076744505_9444_01_04 
> because: Container 
> [pid=42774,containerID=container_e42_1574076744505_9444_01_04] is running 
> beyond physical memory limits. Current usage: 12.0 GB of 12 GB physical 
> memory used; 13.9 GB of 25.2 GB virtual memory used. Killing container.
> {quote}
> This is probably happening because memory usage of mmap is not capped and not 
> accounted by configured memory limits, however yarn is tracking this memory 
> usage and once Flink exceeds some threshold, container is being killed.
> Workaround is to overrule default value and force Flink to not user mmap, by 
> setting a secret (🤫) config option:
> {noformat}
> taskmanager.network.bounded-blocking-subpartition-type: file
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-15140) Shuffle data compression does not work with BroadcastRecordWriter.

2019-12-10 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang reassigned FLINK-15140:


Assignee: Yingjie Cao

> Shuffle data compression does not work with BroadcastRecordWriter.
> --
>
> Key: FLINK-15140
> URL: https://issues.apache.org/jira/browse/FLINK-15140
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Affects Versions: 1.10.0
>Reporter: Yingjie Cao
>Assignee: Yingjie Cao
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I tested the newest code of master branch last weekend with more test cases. 
> Unfortunately, several problems were encountered, including a bug of 
> compression.
> When BroadcastRecordWriter is used, for pipelined mode, because the 
> compressor copies the data back to the input buffer, however, the underlying 
> buffer is shared when BroadcastRecordWriter is used. So we can not copy the 
> compressed buffer back to the input buffer if the underlying buffer is 
> shared. For blocking mode, we wrongly recycle the buffer when buffer is not 
> compressed, and the problem is also triggered when BroadcastRecordWriter is 
> used.
> To fix the problem, for blocking shuffle, the reference counter should be 
> maintained correctly, for pipelined shuffle, the simplest way maybe disable 
> compression when the underlying buffer is shared. I will open a PR to fix the 
> problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-14952) Yarn containers can exceed physical memory limits when using BoundedBlockingSubpartition.

2019-12-10 Thread zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16993207#comment-16993207
 ] 

zhijiang commented on FLINK-14952:
--

As [~kevin.cyj] mentioned above, the current blocking partition with file type 
has some potential concern for memory overhead. I created a separate ticket 
FLINK-15187 for tracking this issue and I do not tag it as a blocker for 
release-1.10 ATM.

> Yarn containers can exceed physical memory limits when using 
> BoundedBlockingSubpartition.
> -
>
> Key: FLINK-14952
> URL: https://issues.apache.org/jira/browse/FLINK-14952
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN, Runtime / Network
>Affects Versions: 1.9.1
>Reporter: Piotr Nowojski
>Priority: Blocker
> Fix For: 1.10.0
>
>
> As [reported by a user on the user mailing 
> list|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/CoGroup-SortMerger-performance-degradation-from-1-6-4-1-9-1-td31082.html],
>  combination of using {{BoundedBlockingSubpartition}} with yarn containers 
> can cause yarn container to exceed memory limits.
> {quote}2019-11-19 12:49:23,068 INFO org.apache.flink.yarn.YarnResourceManager 
> - Closing TaskExecutor connection container_e42_1574076744505_9444_01_04 
> because: Container 
> [pid=42774,containerID=container_e42_1574076744505_9444_01_04] is running 
> beyond physical memory limits. Current usage: 12.0 GB of 12 GB physical 
> memory used; 13.9 GB of 25.2 GB virtual memory used. Killing container.
> {quote}
> This is probably happening because memory usage of mmap is not capped and not 
> accounted by configured memory limits, however yarn is tracking this memory 
> usage and once Flink exceeds some threshold, container is being killed.
> Workaround is to overrule default value and force Flink to not user mmap, by 
> setting a secret (🤫) config option:
> {noformat}
> taskmanager.network.bounded-blocking-subpartition-type: file
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15187) Reuse LocalBufferPool for FileBufferReader in blocking partition

2019-12-10 Thread zhijiang (Jira)
zhijiang created FLINK-15187:


 Summary: Reuse LocalBufferPool for FileBufferReader in blocking 
partition
 Key: FLINK-15187
 URL: https://issues.apache.org/jira/browse/FLINK-15187
 Project: Flink
  Issue Type: Task
  Components: Runtime / Network
Reporter: zhijiang


If we take the file type via 
`taskmanager.network.bounded-blocking-subpartition-type` for batch job, while 
creating the respective view for reading the subpartition persistent data, it 
would create two unpolled memory segments for every subpartition. This portion 
of temporary memory is not managed and calculated by framework, so it might 
cause OOM error concern.

We can also reuse the ResultPartition's `LocalBufferPool` to read subpartition 
data to avoid this memory overhead. But there are additional two problems for 
reuse directly. 
 * The current core size of `LocalBufferPool` is `numberOfSubpartitions + 1`, 
but every subpartition needs two segments for pre-reading atm. We can remove 
the pre-reading to make the current core pool size suitable for the reading 
requirements, because the pre-reading function seems has no obvious benefits in 
practice which is only effecting for the last data.
 * When task finishes, it would destroy the `LocalBufferPool` even though the 
respective `ResultPartition still alive, so the following subpartition view can 
not reuse the pool directly. We should adjust the respective logics to either 
delay destroy the pool or create a new pool for subpartition view.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-15104) Performance regression on 4.12.2019 in RocksDB benchmarks

2019-12-10 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang reassigned FLINK-15104:


Assignee: Yun Tang

> Performance regression on 4.12.2019 in RocksDB benchmarks
> -
>
> Key: FLINK-15104
> URL: https://issues.apache.org/jira/browse/FLINK-15104
> Project: Flink
>  Issue Type: Bug
>  Components: Benchmarks, Runtime / State Backends
>Reporter: Piotr Nowojski
>Assignee: Yun Tang
>Priority: Blocker
> Fix For: 1.10.0
>
>
> A potential small performance regression that happened either on December 4th 
> or 5th:
> http://codespeed.dak8s.net:8000/timeline/?ben=stateBackends.ROCKS&env=2
> http://codespeed.dak8s.net:8000/timeline/?ben=stateBackends.ROCKS_INC&env=2
> it doesn't seem to be visible in the states benchmarks suites, which might be 
> an issue on its own.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-14952) Yarn containers can exceed physical memory limits when using BoundedBlockingSubpartition.

2019-12-10 Thread zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16993168#comment-16993168
 ] 

zhijiang edited comment on FLINK-14952 at 12/11/19 3:48 AM:


Generally speaking, we want to provide a better performance setting as a 
default config. 

But considering the mmap way might bring unstable concern in resource framework 
cluster, then I also prefer to adjust the default config option 
"*taskmanager.network.bounded-blocking-subpartition-type*" from `auto` to 
`file` mode.

Furthermore we can also supplement some descriptions of limitation/concerns for 
`mmap` way in options. Then when users want to enable `mmap` way for better 
performance, they are aware of the respective risks and the scenarios.


was (Author: zjwang):
Generally we want to provide a better performance setting as a default config. 

But considering the mmap way might bring unstable concern in resource framework 
cluster, then I also prefer to adjust the default config option 
"*taskmanager.network.bounded-blocking-subpartition-type*" from `auto` to 
`file` mode.

Furthermore we can also supplement some descriptions of limitation/concerns for 
`mmap` way in options. Then when users want to enable `mmap` way for better 
performance, they are aware of the respective risks and the scenarios.

> Yarn containers can exceed physical memory limits when using 
> BoundedBlockingSubpartition.
> -
>
> Key: FLINK-14952
> URL: https://issues.apache.org/jira/browse/FLINK-14952
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN, Runtime / Network
>Affects Versions: 1.9.1
>Reporter: Piotr Nowojski
>Priority: Blocker
> Fix For: 1.10.0
>
>
> As [reported by a user on the user mailing 
> list|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/CoGroup-SortMerger-performance-degradation-from-1-6-4-1-9-1-td31082.html],
>  combination of using {{BoundedBlockingSubpartition}} with yarn containers 
> can cause yarn container to exceed memory limits.
> {quote}2019-11-19 12:49:23,068 INFO org.apache.flink.yarn.YarnResourceManager 
> - Closing TaskExecutor connection container_e42_1574076744505_9444_01_04 
> because: Container 
> [pid=42774,containerID=container_e42_1574076744505_9444_01_04] is running 
> beyond physical memory limits. Current usage: 12.0 GB of 12 GB physical 
> memory used; 13.9 GB of 25.2 GB virtual memory used. Killing container.
> {quote}
> This is probably happening because memory usage of mmap is not capped and not 
> accounted by configured memory limits, however yarn is tracking this memory 
> usage and once Flink exceeds some threshold, container is being killed.
> Workaround is to overrule default value and force Flink to not user mmap, by 
> setting a secret (🤫) config option:
> {noformat}
> taskmanager.network.bounded-blocking-subpartition-type: file
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-14952) Yarn containers can exceed physical memory limits when using BoundedBlockingSubpartition.

2019-12-10 Thread zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16993168#comment-16993168
 ] 

zhijiang commented on FLINK-14952:
--

Generally we want to provide a better performance setting as a default config. 

But considering the mmap way might bring unstable concern in resource framework 
cluster, then I also prefer to adjust the default config option 
"*taskmanager.network.bounded-blocking-subpartition-type*" from `auto` to 
`file` mode.

Furthermore we can also supplement some descriptions of limitation/concerns for 
`mmap` way in options. Then when users want to enable `mmap` way for better 
performance, they are aware of the respective risks and the scenarios.

> Yarn containers can exceed physical memory limits when using 
> BoundedBlockingSubpartition.
> -
>
> Key: FLINK-14952
> URL: https://issues.apache.org/jira/browse/FLINK-14952
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN, Runtime / Network
>Affects Versions: 1.9.1
>Reporter: Piotr Nowojski
>Priority: Blocker
> Fix For: 1.10.0
>
>
> As [reported by a user on the user mailing 
> list|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/CoGroup-SortMerger-performance-degradation-from-1-6-4-1-9-1-td31082.html],
>  combination of using {{BoundedBlockingSubpartition}} with yarn containers 
> can cause yarn container to exceed memory limits.
> {quote}2019-11-19 12:49:23,068 INFO org.apache.flink.yarn.YarnResourceManager 
> - Closing TaskExecutor connection container_e42_1574076744505_9444_01_04 
> because: Container 
> [pid=42774,containerID=container_e42_1574076744505_9444_01_04] is running 
> beyond physical memory limits. Current usage: 12.0 GB of 12 GB physical 
> memory used; 13.9 GB of 25.2 GB virtual memory used. Killing container.
> {quote}
> This is probably happening because memory usage of mmap is not capped and not 
> accounted by configured memory limits, however yarn is tracking this memory 
> usage and once Flink exceeds some threshold, container is being killed.
> Workaround is to overrule default value and force Flink to not user mmap, by 
> setting a secret (🤫) config option:
> {noformat}
> taskmanager.network.bounded-blocking-subpartition-type: file
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-15153) Service selector needs to contain jobmanager component label

2019-12-10 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang reassigned FLINK-15153:


Assignee: Yang Wang

> Service selector needs to contain jobmanager component label
> 
>
> Key: FLINK-15153
> URL: https://issues.apache.org/jira/browse/FLINK-15153
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Reporter: Yang Wang
>Assignee: Yang Wang
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The jobmanager label needs to be added to service selector. Otherwise, it may 
> select the wrong backend pods(taskmanager).
> The internal service is used for taskmanager talking to jobmanager. If it 
> does not have correct backend pods, the taskmanager may fail to register.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-15153) Service selector needs to contain jobmanager component label

2019-12-10 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang closed FLINK-15153.

Resolution: Fixed

Fixed in master: 4286fca12faf97e6089c7600896bb4a7af6f9c36

> Service selector needs to contain jobmanager component label
> 
>
> Key: FLINK-15153
> URL: https://issues.apache.org/jira/browse/FLINK-15153
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Reporter: Yang Wang
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The jobmanager label needs to be added to service selector. Otherwise, it may 
> select the wrong backend pods(taskmanager).
> The internal service is used for taskmanager talking to jobmanager. If it 
> does not have correct backend pods, the taskmanager may fail to register.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15166) Shuffle data compression wrongly decrease the buffer reference count.

2019-12-10 Thread zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16992403#comment-16992403
 ] 

zhijiang commented on FLINK-15166:
--

Fixed in release-1.10 : 52a2f03eff658c5ec70b223a2c1551a96b4809dd

> Shuffle data compression wrongly decrease the buffer reference count.
> -
>
> Key: FLINK-15166
> URL: https://issues.apache.org/jira/browse/FLINK-15166
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Affects Versions: 1.10.0
>Reporter: Yingjie Cao
>Assignee: Yingjie Cao
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> FLINK-15140 report two relevant problems which are both triggered by 
> broadcast partitioner, to make it more clear, I create this Jira to addresses 
> the problems separately.
>  
> For blocking shuffle compression, we recycle the compressed intermediate 
> buffer each time after we write data out, however when the data is not 
> compressed, the return buffer is the original buffer and should not be 
> recycled, but we wrongly recycled it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (FLINK-15166) Shuffle data compression wrongly decrease the buffer reference count.

2019-12-10 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang resolved FLINK-15166.
--
Resolution: Fixed

Fixed in master : 4d693c4fbc5e6f3ff34ccb3cb3a1d9f35d6bbd76

> Shuffle data compression wrongly decrease the buffer reference count.
> -
>
> Key: FLINK-15166
> URL: https://issues.apache.org/jira/browse/FLINK-15166
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Affects Versions: 1.10.0
>Reporter: Yingjie Cao
>Assignee: Yingjie Cao
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> FLINK-15140 report two relevant problems which are both triggered by 
> broadcast partitioner, to make it more clear, I create this Jira to addresses 
> the problems separately.
>  
> For blocking shuffle compression, we recycle the compressed intermediate 
> buffer each time after we write data out, however when the data is not 
> compressed, the return buffer is the original buffer and should not be 
> recycled, but we wrongly recycled it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-15166) Shuffle data compression wrongly decrease the buffer reference count.

2019-12-10 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang reassigned FLINK-15166:


Assignee: Yingjie Cao

> Shuffle data compression wrongly decrease the buffer reference count.
> -
>
> Key: FLINK-15166
> URL: https://issues.apache.org/jira/browse/FLINK-15166
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Affects Versions: 1.10.0
>Reporter: Yingjie Cao
>Assignee: Yingjie Cao
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> FLINK-15140 report two relevant problems which are both triggered by 
> broadcast partitioner, to make it more clear, I create this Jira to addresses 
> the problems separately.
>  
> For blocking shuffle compression, we recycle the compressed intermediate 
> buffer each time after we write data out, however when the data is not 
> compressed, the return buffer is the original buffer and should not be 
> recycled, but we wrongly recycled it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-14551) Unaligned checkpoints

2019-12-09 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang updated FLINK-14551:
-
Fix Version/s: (was: 1.10.0)
   1.11.0

> Unaligned checkpoints
> -
>
> Key: FLINK-14551
> URL: https://issues.apache.org/jira/browse/FLINK-14551
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Checkpointing, Runtime / Network
>Reporter: zhijiang
>Priority: Major
> Fix For: 1.11.0
>
>
> This is the umbrella issue for the feature of unaligned checkpoints. Refer to 
> the 
> [FLIP-76|https://cwiki.apache.org/confluence/display/FLINK/FLIP-76%3A+Unaligned+Checkpoints]
>  for more details.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-14032) Make the cache size of RocksDBPriorityQueueSetFactory configurable

2019-12-09 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang reassigned FLINK-14032:


Assignee: Yun Tang

> Make the cache size of RocksDBPriorityQueueSetFactory configurable
> --
>
> Key: FLINK-14032
> URL: https://issues.apache.org/jira/browse/FLINK-14032
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / State Backends
>Reporter: Yun Tang
>Assignee: Yun Tang
>Priority: Major
> Fix For: 1.11.0
>
>
> Currently, the cache size of {{RocksDBPriorityQueueSetFactory}} has been set 
> as 128 and no any ways to configure this to other value. (We could increase 
> this to obtain better performance if necessary). Actually, this is also a 
> TODO for quiet a long time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-15070) Supplement cases of blocking partition with compression for benchmark

2019-12-09 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang updated FLINK-15070:
-
Summary: Supplement cases of blocking partition with compression for 
benchmark  (was: Supplement the case of blocking partition with compression for 
benchmark)

> Supplement cases of blocking partition with compression for benchmark
> -
>
> Key: FLINK-15070
> URL: https://issues.apache.org/jira/browse/FLINK-15070
> Project: Flink
>  Issue Type: Task
>  Components: Benchmarks
>Reporter: zhijiang
>Assignee: Haibo Sun
>Priority: Minor
>
> ATM the benchmark only covers the case of pipelined partition used in 
> streaming job, so it is better to also cover the case of blocking partition 
> for batch job.  Then we can easily trace the performance concerns for any 
> changes future.
> This ticket would introduce the blocking partition cases for uncompressed 
> file, uncompressed mmap and compressed file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-15070) Supplement the case of blocking partition with compression for benchmark

2019-12-09 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang updated FLINK-15070:
-
Description: 
ATM the benchmark only covers the case of pipelined partition used in streaming 
job, so it is better to also cover the case of blocking partition for batch 
job.  Then we can easily trace the performance concerns for any changes future.

This ticket would introduce the blocking partition cases for uncompressed file, 
uncompressed mmap and compressed file.

  was:ATM the benchmark only covers the case of pipelined partition used in 
streaming job, so it is better to also cover the case of blocking partition for 
batch job.  Then we can easily trace the performance concerns for any changes 
future.


> Supplement the case of blocking partition with compression for benchmark
> 
>
> Key: FLINK-15070
> URL: https://issues.apache.org/jira/browse/FLINK-15070
> Project: Flink
>  Issue Type: Task
>  Components: Benchmarks
>Reporter: zhijiang
>Assignee: Haibo Sun
>Priority: Minor
>
> ATM the benchmark only covers the case of pipelined partition used in 
> streaming job, so it is better to also cover the case of blocking partition 
> for batch job.  Then we can easily trace the performance concerns for any 
> changes future.
> This ticket would introduce the blocking partition cases for uncompressed 
> file, uncompressed mmap and compressed file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-15070) Supplement the case of blocking partition with compression for benchmark

2019-12-09 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang updated FLINK-15070:
-
Summary: Supplement the case of blocking partition with compression for 
benchmark  (was: Supplement the case of bounded blocking partition for 
benchmark)

> Supplement the case of blocking partition with compression for benchmark
> 
>
> Key: FLINK-15070
> URL: https://issues.apache.org/jira/browse/FLINK-15070
> Project: Flink
>  Issue Type: Task
>  Components: Benchmarks
>Reporter: zhijiang
>Assignee: Haibo Sun
>Priority: Minor
>
> ATM the benchmark only covers the case of pipelined partition used in 
> streaming job, so it is better to also cover the case of blocking partition 
> for batch job.  Then we can easily trace the performance concerns for any 
> changes future.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-15069) Supplement the pipelined shuffle compression case for benchmark

2019-12-09 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang updated FLINK-15069:
-
Description: 
While reviewing the [PR| 
[https://github.com/apache/flink/pull/10375#pullrequestreview-325193504]] of 
introducing data compression for persistent storage and network shuffle, we 
think it is better to also cover this scenario in the benchmark for tracing the 
performance issues future. 

This ticket would supplement the compression case for pipelined partition 
shuffle, and the compression case for blocking partition would be added in 
[FLINK-15070|https://issues.apache.org/jira/browse/FLINK-15070]

 

  was:
While reviewing the PR of introducing data compression for persistent storage 
and network shuffle, we think it is better to also cover this scenario in the 
benchmark for tracing the performance issues future. 

Refer to https://github.com/apache/flink/pull/10375#pullrequestreview-325193504


> Supplement the pipelined shuffle compression case for benchmark
> ---
>
> Key: FLINK-15069
> URL: https://issues.apache.org/jira/browse/FLINK-15069
> Project: Flink
>  Issue Type: Task
>  Components: Benchmarks
>Reporter: zhijiang
>Assignee: zhijiang
>Priority: Minor
>
> While reviewing the [PR| 
> [https://github.com/apache/flink/pull/10375#pullrequestreview-325193504]] of 
> introducing data compression for persistent storage and network shuffle, we 
> think it is better to also cover this scenario in the benchmark for tracing 
> the performance issues future. 
> This ticket would supplement the compression case for pipelined partition 
> shuffle, and the compression case for blocking partition would be added in 
> [FLINK-15070|https://issues.apache.org/jira/browse/FLINK-15070]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-15069) Supplement the pipelined shuffle compression case for benchmark

2019-12-09 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang updated FLINK-15069:
-
Summary: Supplement the pipelined shuffle compression case for benchmark  
(was: Supplement the compression case for benchmark)

> Supplement the pipelined shuffle compression case for benchmark
> ---
>
> Key: FLINK-15069
> URL: https://issues.apache.org/jira/browse/FLINK-15069
> Project: Flink
>  Issue Type: Task
>  Components: Benchmarks
>Reporter: zhijiang
>Assignee: zhijiang
>Priority: Minor
>
> While reviewing the PR of introducing data compression for persistent storage 
> and network shuffle, we think it is better to also cover this scenario in the 
> benchmark for tracing the performance issues future. 
> Refer to 
> https://github.com/apache/flink/pull/10375#pullrequestreview-325193504



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-15069) Supplement the compression case for benchmark

2019-12-09 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang reassigned FLINK-15069:


Assignee: zhijiang

> Supplement the compression case for benchmark
> -
>
> Key: FLINK-15069
> URL: https://issues.apache.org/jira/browse/FLINK-15069
> Project: Flink
>  Issue Type: Task
>  Components: Benchmarks
>Reporter: zhijiang
>Assignee: zhijiang
>Priority: Minor
>
> While reviewing the PR of introducing data compression for persistent storage 
> and network shuffle, we think it is better to also cover this scenario in the 
> benchmark for tracing the performance issues future. 
> Refer to 
> https://github.com/apache/flink/pull/10375#pullrequestreview-325193504



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-15070) Supplement the case of bounded blocking partition for benchmark

2019-12-09 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang reassigned FLINK-15070:


Assignee: Haibo Sun

> Supplement the case of bounded blocking partition for benchmark
> ---
>
> Key: FLINK-15070
> URL: https://issues.apache.org/jira/browse/FLINK-15070
> Project: Flink
>  Issue Type: Task
>  Components: Benchmarks
>Reporter: zhijiang
>Assignee: Haibo Sun
>Priority: Minor
>
> ATM the benchmark only covers the case of pipelined partition used in 
> streaming job, so it is better to also cover the case of blocking partition 
> for batch job.  Then we can easily trace the performance concerns for any 
> changes future.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-15109) InternalTimerServiceImpl references restored state after use, taking up resources unnecessarily

2019-12-08 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang reassigned FLINK-15109:


Assignee: Roman Khachatryan

> InternalTimerServiceImpl references restored state after use, taking up 
> resources unnecessarily
> ---
>
> Key: FLINK-15109
> URL: https://issues.apache.org/jira/browse/FLINK-15109
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Task
>Affects Versions: 1.9.1
>Reporter: Roman Khachatryan
>Assignee: Roman Khachatryan
>Priority: Trivial
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> E.g. 
> org.apache.flink.streaming.api.operators.InternalTimerServiceImpl#restoredTimersSnapshot:
>  # written in restoreTimersForKeyGroup()
>  # used in startTimerService()
>  # and then never used again.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (FLINK-15109) InternalTimerServiceImpl references restored state after use, taking up resources unnecessarily

2019-12-08 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang resolved FLINK-15109.
--
Resolution: Fixed

Fixed in master: 59ac51814c46d790f1ae030e1a199bddf00a8b01

> InternalTimerServiceImpl references restored state after use, taking up 
> resources unnecessarily
> ---
>
> Key: FLINK-15109
> URL: https://issues.apache.org/jira/browse/FLINK-15109
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Task
>Affects Versions: 1.9.1
>Reporter: Roman Khachatryan
>Priority: Trivial
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> E.g. 
> org.apache.flink.streaming.api.operators.InternalTimerServiceImpl#restoredTimersSnapshot:
>  # written in restoreTimersForKeyGroup()
>  # used in startTimerService()
>  # and then never used again.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-15109) InternalTimerServiceImpl references restored state after use, taking up resources unnecessarily

2019-12-08 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang updated FLINK-15109:
-
Fix Version/s: 1.10.0

> InternalTimerServiceImpl references restored state after use, taking up 
> resources unnecessarily
> ---
>
> Key: FLINK-15109
> URL: https://issues.apache.org/jira/browse/FLINK-15109
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Task
>Affects Versions: 1.9.1
>Reporter: Roman Khachatryan
>Assignee: Roman Khachatryan
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> E.g. 
> org.apache.flink.streaming.api.operators.InternalTimerServiceImpl#restoredTimersSnapshot:
>  # written in restoreTimersForKeyGroup()
>  # used in startTimerService()
>  # and then never used again.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-14845) Introduce data compression to blocking shuffle.

2019-12-08 Thread zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16991107#comment-16991107
 ] 

zhijiang commented on FLINK-14845:
--

[~pnowojski] Yes, exactly we would verify the effects via benchmark for 
blocking partition with compression as a follow up work. I created the 
respective tickets in FLINK-15070 and FLINK-15069 .

> Introduce data compression to blocking shuffle.
> ---
>
> Key: FLINK-14845
> URL: https://issues.apache.org/jira/browse/FLINK-14845
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Network
>Reporter: Yingjie Cao
>Assignee: Yingjie Cao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, blocking shuffle writer writes raw output data to disk without 
> compression. For IO bounded scenario, this can be optimized by compressing 
> the output data. It is better to introduce a compression mechanism and offer 
> users a config option to let the user decide whether to compress the shuffle 
> data. Actually, we hava implemented compression in our inner Flink version 
> and  here are some key points:
> 1. Where to compress/decompress?
> Compressing at upstream and decompressing at downstream.
> 2. Which thread do compress/decompress?
> Task threads do compress/decompress.
> 3. Data compression granularity.
> Per buffer.
> 4. How to handle that when data size become even bigger after compression?
> Give up compression in this case and introduce an extra flag to identify if 
> the data was compressed, that is, the output may be a mixture of compressed 
> and uncompressed data.
>  
> We'd like to introduce blocking shuffle data compression to Flink if there 
> are interests.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15070) Supplement the case of bounded blocking partition for benchmark

2019-12-05 Thread zhijiang (Jira)
zhijiang created FLINK-15070:


 Summary: Supplement the case of bounded blocking partition for 
benchmark
 Key: FLINK-15070
 URL: https://issues.apache.org/jira/browse/FLINK-15070
 Project: Flink
  Issue Type: Task
  Components: Benchmarks
Reporter: zhijiang


ATM the benchmark only covers the case of pipelined partition used in streaming 
job, so it is better to also cover the case of blocking partition for batch 
job.  Then we can easily trace the performance concerns for any changes future.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15069) Supplement the compression case for benchmark

2019-12-05 Thread zhijiang (Jira)
zhijiang created FLINK-15069:


 Summary: Supplement the compression case for benchmark
 Key: FLINK-15069
 URL: https://issues.apache.org/jira/browse/FLINK-15069
 Project: Flink
  Issue Type: Task
  Components: Benchmarks
Reporter: zhijiang


While reviewing the PR of introducing data compression for persistent storage 
and network shuffle, we think it is better to also cover this scenario in the 
benchmark for tracing the performance issues future. 

Refer to https://github.com/apache/flink/pull/10375#pullrequestreview-325193504



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-15030) Potential deadlock for bounded blocking ResultPartition.

2019-12-02 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang reassigned FLINK-15030:


Assignee: Yingjie Cao

> Potential deadlock for bounded blocking ResultPartition.
> 
>
> Key: FLINK-15030
> URL: https://issues.apache.org/jira/browse/FLINK-15030
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Affects Versions: 1.9.0, 1.9.1
>Reporter: Yingjie Cao
>Assignee: Yingjie Cao
>Priority: Major
>
> Currently, the BoundedBlockingSubpartition relies on the add of the next 
> BufferConsumer to flush and recycle the previous one, which means we need at 
> least (numsubpartition + 1) buffers to make the bounded blocking 
> ResultPartition work. However, the ResultPartitionFactory gives only 
> (numsubpartition) required buffers to the BoundedBlockingSubpartition which 
> may lead to deadlock.
> This problem exists only in version 1.9. In version 1.10 (master), this 
> problem has been fixed by this commit: 
> 2c8b4ef572f05bf4740b7e204af1e5e709cd945c.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15021) Refactor to remove channelWritabilityChanged from PartitionRequestQueue

2019-12-02 Thread zhijiang (Jira)
zhijiang created FLINK-15021:


 Summary: Refactor to remove channelWritabilityChanged from 
PartitionRequestQueue
 Key: FLINK-15021
 URL: https://issues.apache.org/jira/browse/FLINK-15021
 Project: Flink
  Issue Type: Task
  Components: Runtime / Network
Reporter: zhijiang
Assignee: zhijiang


After removing the non credit-based flow control codes, the related channel 
writability changed logics in PartitionRequestQueue are invalid and can be 
removed completely. Therefore we can refactor the process to simplify the codes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-14516) Remove non credit based network code

2019-12-02 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang updated FLINK-14516:
-
Parent: FLINK-7282
Issue Type: Sub-task  (was: Task)

> Remove non credit based network code
> 
>
> Key: FLINK-14516
> URL: https://issues.apache.org/jira/browse/FLINK-14516
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Network
>Affects Versions: 1.9.0
>Reporter: Piotr Nowojski
>Assignee: Piotr Nowojski
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> After [a survey on the dev mailing 
> list|http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/SURVEY-Dropping-non-Credit-based-Flow-Control-td33714.html]
>  the feedback was that old code path is not used and no longer needed. Based 
> on that we should be safe to drop it and make credit based flow control the 
> only option (currently it's the default).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-14956) MemoryMappedBoundedData Compressed Buffer Slicer

2019-11-28 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang closed FLINK-14956.

Resolution: Duplicate

> MemoryMappedBoundedData Compressed Buffer Slicer
> 
>
> Key: FLINK-14956
> URL: https://issues.apache.org/jira/browse/FLINK-14956
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Reporter: Nicholas Jiang
>Priority: Minor
> Attachments: CPU-IO.png, Compress-Read.png, Compress-Write.png
>
>
> MemoryMappedBoundedData, implementation of BoundedData simply through 
> ByteBuffers backed by memory, uses CompressedBufferSlicer which is 
> implementation of BoundedData.Reader to slice next buffer with uncompress. 
> CompressedBufferSlicer reads BoundedData by LZ4SafeDecompressor decompressing 
> byte buffer.When FileChannelMemoryMappedBoundedData tries to write buffer, 
> this uses LZ4Compressor to compress buffer to improve I/O performance.
> Compress read process:
> !Compress-Read.png|width=556,height=251!
> Compress write process:
> !Compress-Write.png|width=278,height=261!
> CPU/IO performance comparsion chart:
>   !CPU-IO.png|width=416,height=312!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-14956) MemoryMappedBoundedData Compressed Buffer Slicer

2019-11-28 Thread zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16984777#comment-16984777
 ] 

zhijiang commented on FLINK-14956:
--

I would close this duplicated issue to focus on FLINK-14845 only.

> MemoryMappedBoundedData Compressed Buffer Slicer
> 
>
> Key: FLINK-14956
> URL: https://issues.apache.org/jira/browse/FLINK-14956
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Reporter: Nicholas Jiang
>Priority: Minor
> Attachments: CPU-IO.png, Compress-Read.png, Compress-Write.png
>
>
> MemoryMappedBoundedData, implementation of BoundedData simply through 
> ByteBuffers backed by memory, uses CompressedBufferSlicer which is 
> implementation of BoundedData.Reader to slice next buffer with uncompress. 
> CompressedBufferSlicer reads BoundedData by LZ4SafeDecompressor decompressing 
> byte buffer.When FileChannelMemoryMappedBoundedData tries to write buffer, 
> this uses LZ4Compressor to compress buffer to improve I/O performance.
> Compress read process:
> !Compress-Read.png|width=556,height=251!
> Compress write process:
> !Compress-Write.png|width=278,height=261!
> CPU/IO performance comparsion chart:
>   !CPU-IO.png|width=416,height=312!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-14920) Set up environment to run performance e2e tests

2019-11-22 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang reassigned FLINK-14920:


Assignee: Aihua Li

> Set up environment to run performance e2e tests
> ---
>
> Key: FLINK-14920
> URL: https://issues.apache.org/jira/browse/FLINK-14920
> Project: Flink
>  Issue Type: Sub-task
>  Components: Test Infrastructure
>Reporter: Yu Li
>Assignee: Aihua Li
>Priority: Major
> Fix For: 1.10.0
>
>
> As proposed in 
> [FLIP-83|https://cwiki.apache.org/confluence/display/FLINK/FLIP-83%3A+Flink+End-to-end+Performance+Testing+Framework],
>  we need to complete below tasks here:
> * Prepare a small cluster and setup environments to run the tests
> * Setup Jenkins to trigger the performance e2e tests
> * Report the result to [code-speed center|http://codespeed.dak8s.net:8000] to 
> show the comparison.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-14919) Add performance e2e test suite for basic operations

2019-11-22 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang reassigned FLINK-14919:


Assignee: Aihua Li

> Add performance e2e test suite for basic operations
> ---
>
> Key: FLINK-14919
> URL: https://issues.apache.org/jira/browse/FLINK-14919
> Project: Flink
>  Issue Type: Sub-task
>  Components: Tests
>Reporter: Yu Li
>Assignee: Aihua Li
>Priority: Major
> Fix For: 1.10.0
>
>
> Add the test suite for basic Flink operations as proposed in 
> [FLIP-83|https://cwiki.apache.org/confluence/display/FLINK/FLIP-83%3A+Flink+End-to-end+Performance+Testing+Framework].
>  Target at completing the work before 1.10 so we could use it to guard the 
> release.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-14918) Add performance e2e test module and scripts

2019-11-22 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang reassigned FLINK-14918:


Assignee: Aihua Li

> Add performance e2e test module and scripts
> ---
>
> Key: FLINK-14918
> URL: https://issues.apache.org/jira/browse/FLINK-14918
> Project: Flink
>  Issue Type: Sub-task
>  Components: Test Infrastructure
>Reporter: Yu Li
>Assignee: Aihua Li
>Priority: Major
> Fix For: 1.10.0
>
>
> As proposed in FLIP-83, create a separate directory/module in parallel with 
> flink-end-to-end-tests, with the name of flink-end-to-end-perf-tests, and add 
> necessary scripts to form the framework.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-14928) Documentation links check nightly run failed on travis

2019-11-22 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang reassigned FLINK-14928:


Assignee: Congxian Qiu(klion26)  (was: Yu Li)

> Documentation links check nightly run failed on travis
> --
>
> Key: FLINK-14928
> URL: https://issues.apache.org/jira/browse/FLINK-14928
> Project: Flink
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.10.0
>Reporter: Yu Li
>Assignee: Congxian Qiu(klion26)
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.10.0
>
>
> This test stage fails stably, with below error:
> {noformat}
> [2019-11-22 03:10:45] ERROR `/dev/table/udfs.html' not found.
> [2019-11-22 03:10:46] ERROR `/dev/table/functions.html' not found.
> [2019-11-22 03:10:49] ERROR 
> `/zh/getting-started/tutorials/datastream_api.html' not found.
> [2019-11-22 03:10:49] ERROR 
> `/dev/table/functions/streaming/query_configuration.html' not found.
> [2019-11-22 03:10:49] ERROR `/dev/table/functions/sql.html' not found.
> [2019-11-22 03:10:49] ERROR `/dev/table/functions/tableApi.html' not found.
> [2019-11-22 03:10:49] ERROR `/zh/dev/table/udfs.html' not found.
> [2019-11-22 03:10:49] ERROR `/zh/dev/table/functions.html' not found.
> [2019-11-22 03:10:49] ERROR 
> `/zh/dev/table/functions/streaming/query_configuration.html' not found.
> [2019-11-22 03:10:49] ERROR `/zh/dev/table/functions/sql.html' not found.
> [2019-11-22 03:10:49] ERROR `/zh/dev/table/functions/tableApi.html' not found.
> http://localhost:4000/dev/table/udfs.html:
> Remote file does not exist -- broken link!!!
> --
> http://localhost:4000/dev/table/functions.html:
> Remote file does not exist -- broken link!!!
> --
> http://localhost:4000/zh/getting-started/tutorials/datastream_api.html:
> Remote file does not exist -- broken link!!!
> --
> http://localhost:4000/dev/table/functions/streaming/query_configuration.html:
> Remote file does not exist -- broken link!!!
> http://localhost:4000/dev/table/functions/sql.html:
> Remote file does not exist -- broken link!!!
> http://localhost:4000/dev/table/functions/tableApi.html:
> Remote file does not exist -- broken link!!!
> --
> http://localhost:4000/zh/dev/table/udfs.html:
> Remote file does not exist -- broken link!!!
> --
> http://localhost:4000/zh/dev/table/functions.html:
> Remote file does not exist -- broken link!!!
> --
> http://localhost:4000/zh/dev/table/functions/streaming/query_configuration.html:
> Remote file does not exist -- broken link!!!
> http://localhost:4000/zh/dev/table/functions/sql.html:
> Remote file does not exist -- broken link!!!
> http://localhost:4000/zh/dev/table/functions/tableApi.html:
> Remote file does not exist -- broken link!!!
> ---
> Found 11 broken links.
> {noformat}
> And here is the latest instance: 
> https://api.travis-ci.org/v3/job/615032410/log.txt



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-14928) Documentation links check nightly run failed on travis

2019-11-22 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang reassigned FLINK-14928:


Assignee: Yu Li

> Documentation links check nightly run failed on travis
> --
>
> Key: FLINK-14928
> URL: https://issues.apache.org/jira/browse/FLINK-14928
> Project: Flink
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.10.0
>Reporter: Yu Li
>Assignee: Yu Li
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.10.0
>
>
> This test stage fails stably, with below error:
> {noformat}
> [2019-11-22 03:10:45] ERROR `/dev/table/udfs.html' not found.
> [2019-11-22 03:10:46] ERROR `/dev/table/functions.html' not found.
> [2019-11-22 03:10:49] ERROR 
> `/zh/getting-started/tutorials/datastream_api.html' not found.
> [2019-11-22 03:10:49] ERROR 
> `/dev/table/functions/streaming/query_configuration.html' not found.
> [2019-11-22 03:10:49] ERROR `/dev/table/functions/sql.html' not found.
> [2019-11-22 03:10:49] ERROR `/dev/table/functions/tableApi.html' not found.
> [2019-11-22 03:10:49] ERROR `/zh/dev/table/udfs.html' not found.
> [2019-11-22 03:10:49] ERROR `/zh/dev/table/functions.html' not found.
> [2019-11-22 03:10:49] ERROR 
> `/zh/dev/table/functions/streaming/query_configuration.html' not found.
> [2019-11-22 03:10:49] ERROR `/zh/dev/table/functions/sql.html' not found.
> [2019-11-22 03:10:49] ERROR `/zh/dev/table/functions/tableApi.html' not found.
> http://localhost:4000/dev/table/udfs.html:
> Remote file does not exist -- broken link!!!
> --
> http://localhost:4000/dev/table/functions.html:
> Remote file does not exist -- broken link!!!
> --
> http://localhost:4000/zh/getting-started/tutorials/datastream_api.html:
> Remote file does not exist -- broken link!!!
> --
> http://localhost:4000/dev/table/functions/streaming/query_configuration.html:
> Remote file does not exist -- broken link!!!
> http://localhost:4000/dev/table/functions/sql.html:
> Remote file does not exist -- broken link!!!
> http://localhost:4000/dev/table/functions/tableApi.html:
> Remote file does not exist -- broken link!!!
> --
> http://localhost:4000/zh/dev/table/udfs.html:
> Remote file does not exist -- broken link!!!
> --
> http://localhost:4000/zh/dev/table/functions.html:
> Remote file does not exist -- broken link!!!
> --
> http://localhost:4000/zh/dev/table/functions/streaming/query_configuration.html:
> Remote file does not exist -- broken link!!!
> http://localhost:4000/zh/dev/table/functions/sql.html:
> Remote file does not exist -- broken link!!!
> http://localhost:4000/zh/dev/table/functions/tableApi.html:
> Remote file does not exist -- broken link!!!
> ---
> Found 11 broken links.
> {noformat}
> And here is the latest instance: 
> https://api.travis-ci.org/v3/job/615032410/log.txt



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-14926) Make sure no resource leak of RocksObject

2019-11-22 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang reassigned FLINK-14926:


Assignee: Yu Li

> Make sure no resource leak of RocksObject
> -
>
> Key: FLINK-14926
> URL: https://issues.apache.org/jira/browse/FLINK-14926
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / State Backends
>Affects Versions: 1.6.4, 1.7.2, 1.8.2, 1.9.1
>Reporter: Yu Li
>Assignee: Yu Li
>Priority: Major
>
> When investigating FLINK-14484 to allow setting {{WriteBufferManager}} via 
> options, we find it necessary to supply a {{close}} method in 
> {{OptionsFactory}} to make sure resources configured in options like 
> {{WriteBufferManager}} could/would be released.
> What's more, we're having potential risk of resource leak in 
> {{PredefinedOptions}} such as the created [BloomFilter 
> instance|https://github.com/apache/flink/blob/e14bd50fed42e89674ba9d01a231d5c5e59f490c/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/PredefinedOptions.java#L155]
>  is never closed (regression caused by FLINK-7220 changes), and we should 
> also fix this and add a {{close}} method in {{PredefinedOptions}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


<    1   2   3   4   5   6   7   8   9   10   >