[jira] [Updated] (FLINK-10653) Introduce Pluggable Shuffle Service Architecture

2019-09-01 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-10653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang updated FLINK-10653:
-
Summary: Introduce Pluggable Shuffle Service Architecture  (was: Introduce 
Pluggable Shuffle Manager Architecture)

> Introduce Pluggable Shuffle Service Architecture
> 
>
> Key: FLINK-10653
> URL: https://issues.apache.org/jira/browse/FLINK-10653
> Project: Flink
>  Issue Type: New Feature
>  Components: Runtime / Network
>Reporter: zhijiang
>Assignee: zhijiang
>Priority: Major
>
> This is the umbrella issue for improving shuffle architecture.
> Shuffle is the process of data transfer between stages, which involves in 
> writing outputs on sender side and reading data on receiver side. In flink 
> implementation, it covers three parts of writer, transport layer and reader 
> separately which are uniformed for both streaming and batch jobs.
> In detail, the current ResultPartitionWriter interface on upstream side only 
> supports in-memory outputs for streaming job and local persistent file 
> outputs for batch job. If we extend to implement another writer such as 
> DfsWriter, RdmaWriter, SortMergeWriter, etc based on ResultPartitionWriter 
> interface, it has not the unified mechanism to extend the reader side 
> accordingly. 
> In order to make the shuffle architecture more flexible and support more 
> scenarios especially for batch jobs, a high level shuffle architecture is 
> necessary to manage and extend both writer and reader sides together.
> Refer to the design doc for more details.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (FLINK-14004) Define SourceReader interface to verify the integration with StreamOneInputProcessor

2019-09-08 Thread zhijiang (Jira)
zhijiang created FLINK-14004:


 Summary: Define SourceReader interface to verify the integration 
with StreamOneInputProcessor
 Key: FLINK-14004
 URL: https://issues.apache.org/jira/browse/FLINK-14004
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Task
Reporter: zhijiang
Assignee: zhijiang


We already refactored the task input and output sides based on the new source 
characters in FLIP-27. In order to further verify that the new source reader 
could work well with the unified StreamOneInputProcessor in mailbox model, we 
would design a unit test for integrating the whole process. In detail:
 * Define SourceReader and SourceOutput relevant interfaces based on FLIP-27

 * Implement an example of stateless SourceReader (bounded sequence of integers)

 * Define SourceReaderOperator to integrate the SourceReader with 
StreamOneInputProcessor

 * Define SourceReaderStreamTask to execute the source input and implement a 
unit test for it.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-13992) Refactor Optional parameter in InputGateWithMetrics#updateMetrics

2019-09-11 Thread zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-13992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16928140#comment-16928140
 ] 

zhijiang commented on FLINK-13992:
--

Sorry for the late response because of traveling these days. I guess you could 
use your power to assign it to yourself now. :)

> Refactor Optional parameter in InputGateWithMetrics#updateMetrics
> -
>
> Key: FLINK-13992
> URL: https://issues.apache.org/jira/browse/FLINK-13992
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.10.0
>Reporter: TisonKun
>Priority: Major
> Fix For: 1.10.0
>
>
> As consensus from community code style discussion, in 
> {{InputGateWithMetrics#updateMetrics}} we can refactor to reduce the usage of 
> Optional parameter.
> cc [~azagrebin]
> {code:java}
> diff --git 
> a/flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/InputGateWithMetrics.java
>  
> b/flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/InputGateWithMetrics.java
> index 5d2cfd95c4..e548fbf02b 100644
> --- 
> a/flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/InputGateWithMetrics.java
> +++ 
> b/flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/InputGateWithMetrics.java
> @@ -24,6 +24,8 @@ import 
> org.apache.flink.runtime.io.network.partition.consumer.BufferOrEvent;
>  import org.apache.flink.runtime.io.network.partition.consumer.InputGate;
>  import org.apache.flink.runtime.metrics.groups.TaskIOMetricGroup;
>  
> +import javax.annotation.Nonnull;
> +
>  import java.io.IOException;
>  import java.util.Optional;
>  import java.util.concurrent.CompletableFuture;
> @@ -67,12 +69,12 @@ public class InputGateWithMetrics extends InputGate {
>  
>   @Override
>   public Optional getNext() throws IOException, 
> InterruptedException {
> - return updateMetrics(inputGate.getNext());
> + return inputGate.getNext().map(this::updateMetrics);
>   }
>  
>   @Override
>   public Optional pollNext() throws IOException, 
> InterruptedException {
> - return updateMetrics(inputGate.pollNext());
> + return inputGate.pollNext().map(this::updateMetrics);
>   }
>  
>   @Override
> @@ -85,8 +87,8 @@ public class InputGateWithMetrics extends InputGate {
>   inputGate.close();
>   }
>  
> - private Optional updateMetrics(Optional 
> bufferOrEvent) {
> - bufferOrEvent.ifPresent(b -> numBytesIn.inc(b.getSize()));
> + private BufferOrEvent updateMetrics(@Nonnull BufferOrEvent 
> bufferOrEvent) {
> + numBytesIn.inc(bufferOrEvent.getSize());
>   return bufferOrEvent;
>   }
>  }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (FLINK-13767) Refactor StreamInputProcessor#processInput based on InputStatus

2019-09-12 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang updated FLINK-13767:
-
Summary: Refactor StreamInputProcessor#processInput based on InputStatus  
(was: Migrate isFinished method from AvailabilityListener to AsyncDataInput)

> Refactor StreamInputProcessor#processInput based on InputStatus
> ---
>
> Key: FLINK-13767
> URL: https://issues.apache.org/jira/browse/FLINK-13767
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Network, Runtime / Task
>Reporter: zhijiang
>Assignee: zhijiang
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> AvailabilityListener is both used in AsyncDataInput and StreamTaskInput. We 
> already introduced InputStatus for PushBasedAsyncDataInput#emitNext, and then 
> InputStatus#END_OF_INPUT has the same semantic with 
> AvailabilityListener#isFinished.
> But for the case of AsyncDataInput which is mainly used by InputGate layer, 
> the isFinished() method is still needed at the moment. So we migrate this 
> method from AvailabilityListener to  AsyncDataInput, and refactor the 
> StreamInputProcessor implementations by using InputStatus to judge the 
> finished state.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (FLINK-13767) Refactor StreamInputProcessor#processInput based on InputStatus

2019-09-12 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang updated FLINK-13767:
-
Description: 
StreamInputProcessor#processInput could return InputStatus instead of current 
boolean value to keep consistent with PushingAsyncDataInput#emitNext.

For the implementation of StreamTwoInputProcessor#processInput, we could 
maintain and judge the two input status together with the next selected input 
index to determine the final precise status. To do so we could avoid invalid 
processInput call except for the first call.

 In addition, AvailabilityProvider#isFinished has the duplicated semantic with 
InputStatus#END_OF_INPUT for PushingAsyncDataInput, and it is only meaningful 
for PullingAsyncDataInput now. So we migrate the #isFinished method from 
AvailabilityProvider to PullingAsyncDataInput.

  was:
AvailabilityListener is both used in AsyncDataInput and StreamTaskInput. We 
already introduced InputStatus for PushBasedAsyncDataInput#emitNext, and then 
InputStatus#END_OF_INPUT has the same semantic with 
AvailabilityListener#isFinished.

But for the case of AsyncDataInput which is mainly used by InputGate layer, the 
isFinished() method is still needed at the moment. So we migrate this method 
from AvailabilityListener to  AsyncDataInput, and refactor the 
StreamInputProcessor implementations by using InputStatus to judge the finished 
state.


> Refactor StreamInputProcessor#processInput based on InputStatus
> ---
>
> Key: FLINK-13767
> URL: https://issues.apache.org/jira/browse/FLINK-13767
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Network, Runtime / Task
>Reporter: zhijiang
>Assignee: zhijiang
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> StreamInputProcessor#processInput could return InputStatus instead of current 
> boolean value to keep consistent with PushingAsyncDataInput#emitNext.
> For the implementation of StreamTwoInputProcessor#processInput, we could 
> maintain and judge the two input status together with the next selected input 
> index to determine the final precise status. To do so we could avoid invalid 
> processInput call except for the first call.
>  In addition, AvailabilityProvider#isFinished has the duplicated semantic 
> with InputStatus#END_OF_INPUT for PushingAsyncDataInput, and it is only 
> meaningful for PullingAsyncDataInput now. So we migrate the #isFinished 
> method from AvailabilityProvider to PullingAsyncDataInput.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-14087) throws java.lang.ArrayIndexOutOfBoundsException when emiting the data using RebalancePartitioner.

2019-09-16 Thread zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16930432#comment-16930432
 ] 

zhijiang commented on FLINK-14087:
--

Thanks for reporting this [~jiangyu].

We actually refactored the internal logic for RebalancePartitioner before. In 
the past it would check whether the selected channel index is beyond the number 
of channels before return. If exceeds, it would return the first channel index.

But it is breaking my previous assumption that multiple RecordWriter instances 
would share the same ChannelSelector instance. I need to double check the 
process of StreamGraph generation whether it would happen this case. It seems 
not reasonable to share the same internal state for different output edges. If 
so, for different parallelism in one output edge, it is not strictly rebalanced.

Could you share me your topology structure or the codes you submit the job, 
then I could easily debug the process of generating StreamGraph.

> throws java.lang.ArrayIndexOutOfBoundsException  when emiting the data using 
> RebalancePartitioner. 
> ---
>
> Key: FLINK-14087
> URL: https://issues.apache.org/jira/browse/FLINK-14087
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Affects Versions: 1.8.0, 1.8.1, 1.9.0
>Reporter: luojiangyu
>Priority: Major
>
> There is the condition the RecordWriter sharing the ChannelSelector instance.
> When two RecordWriter instance shared the same ChannelSelector Instance , It 
> may throws 
> java.lang.ArrayIndexOutOfBoundsException .  For example,  two recordWriter 
> instance shared the RebalancePartitioner instance. the RebalancePartitioner 
> instance setup 2 number of Channels when the first RecordWriter initializing, 
> next the some RebalancePartitioner instance setup 3 number of  channels When 
> the second RecordWriter initializing. this  throws 
> ArrayIndexOutOfBoundsException when the first RecordWriter instance emits the 
> data.
> The Exception likes
> |java.lang.RuntimeException: 2 at 
> org.apache.flink.streaming.runtime.io.RecordWriterOutput.pushToRecordWriter(RecordWriterOutput.java:112)
>  at 
> org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:91)
>  at 
> org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:47)
>  at 
> org.apache.flink.streaming.runtime.tasks.OperatorChain$BroadcastingOutputCollector.collect(OperatorChain.java:673)
>  at 
> org.apache.flink.streaming.runtime.tasks.OperatorChain$BroadcastingOutputCollector.collect(OperatorChain.java:617)
>  at 
> org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:726)
>  at 
> org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:699)
>  at 
> org.apache.flink.streaming.api.operators.StreamSourceContexts$NonTimestampContext.collect(StreamSourceContexts.java:104)
>  at 
> com.xx.flink.demo.wordcount.case3.StateTest$Source.run(StateTest.java:107) at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:94)
>  at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:57)
>  at 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:97)
>  at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:302)
>  at org.apache.flink.runtime.taskmanager.Task.run(Task.java:734) at 
> java.lang.Thread.run(Thread.java:748) Caused by: 
> java.lang.ArrayIndexOutOfBoundsException: 2 at 
> org.apache.flink.runtime.io.network.api.writer.RecordWriter.getBufferBuilder(RecordWriter.java:255)
>  at 
> org.apache.flink.runtime.io.network.api.writer.RecordWriter.copyFromSerializerToTargetChannel(RecordWriter.java:177)
>  at 
> org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:162)
>  at 
> org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:128)
>  at 
> org.apache.flink.streaming.runtime.io.RecordWriterOutput.pushToRecordWriter(RecordWriterOutput.java:109)
>  ... 14 more|



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-14087) throws java.lang.ArrayIndexOutOfBoundsException when emiting the data using RebalancePartitioner.

2019-09-17 Thread zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16931171#comment-16931171
 ] 

zhijiang commented on FLINK-14087:
--

Thanks for the further offline confirmation with me [~jiangyu].

After checking the relevant process in StreamGraph, the StreamPartitioner is 
actually shared for different edges via the structure of 
{{virtualPartitionNodes}}. IMO this sharing mechanism is not very reasonable 
because of two issues:
 * From semantic aspect, it is the global rebalance among different stream 
edges, which would cause actually imbalanced case among different parallelism 
of one edge. It seems more make sense to rebalance partially in one edge scope.
 * It would limit the runtime implementation or it assumes that there is no 
state maintaince in runtime implementation. Otherwise it would bring unexpected 
behavior. In previous version of RebalancePartitioner, it would check whether 
the selected channel index is beyond the number of channels for every call, so 
it hides this potential issue before, no matter with whether this behavior is 
actually balanced or not. And the latest refactoring work removes this check 
and makes the number of channels as the property of partitioner, then it 
exposes this bug.

Considering the solution, I prefer to adjust the process of stream graph to 
build separate partitioner instance for every stream edge. Are there other 
suggestions or inputs [~pnowojski] ?

> throws java.lang.ArrayIndexOutOfBoundsException  when emiting the data using 
> RebalancePartitioner. 
> ---
>
> Key: FLINK-14087
> URL: https://issues.apache.org/jira/browse/FLINK-14087
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Affects Versions: 1.8.0, 1.8.1, 1.9.0
>Reporter: luojiangyu
>Priority: Major
> Attachments: image-2019-09-16-19-14-39-403.png, 
> image-2019-09-16-19-15-34-639.png
>
>
> There is the condition the RecordWriter sharing the ChannelSelector instance.
> When two RecordWriter instance shared the same ChannelSelector Instance , It 
> may throws 
> java.lang.ArrayIndexOutOfBoundsException .  For example,  two recordWriter 
> instance shared the RebalancePartitioner instance. the RebalancePartitioner 
> instance setup 2 number of Channels when the first RecordWriter initializing, 
> next the some RebalancePartitioner instance setup 3 number of  channels When 
> the second RecordWriter initializing. this  throws 
> ArrayIndexOutOfBoundsException when the first RecordWriter instance emits the 
> data.
> The Exception likes
> |java.lang.RuntimeException: 2 at 
> org.apache.flink.streaming.runtime.io.RecordWriterOutput.pushToRecordWriter(RecordWriterOutput.java:112)
>  at 
> org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:91)
>  at 
> org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:47)
>  at 
> org.apache.flink.streaming.runtime.tasks.OperatorChain$BroadcastingOutputCollector.collect(OperatorChain.java:673)
>  at 
> org.apache.flink.streaming.runtime.tasks.OperatorChain$BroadcastingOutputCollector.collect(OperatorChain.java:617)
>  at 
> org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:726)
>  at 
> org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:699)
>  at 
> org.apache.flink.streaming.api.operators.StreamSourceContexts$NonTimestampContext.collect(StreamSourceContexts.java:104)
>  at 
> com.xx.flink.demo.wordcount.case3.StateTest$Source.run(StateTest.java:107) at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:94)
>  at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:57)
>  at 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:97)
>  at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:302)
>  at org.apache.flink.runtime.taskmanager.Task.run(Task.java:734) at 
> java.lang.Thread.run(Thread.java:748) Caused by: 
> java.lang.ArrayIndexOutOfBoundsException: 2 at 
> org.apache.flink.runtime.io.network.api.writer.RecordWriter.getBufferBuilder(RecordWriter.java:255)
>  at 
> org.apache.flink.runtime.io.network.api.writer.RecordWriter.copyFromSerializerToTargetChannel(RecordWriter.java:177)
>  at 
> org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:162)
>  at 
> org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:128)
>  at 
> org.apache.flink.streaming.runtime.io.RecordWriterOutput.pushToRecordWriter(RecordWriterOutput.java:109)
>  ... 14 more|



--
This message was sen

[jira] [Commented] (FLINK-14087) throws java.lang.ArrayIndexOutOfBoundsException when emiting the data using RebalancePartitioner.

2019-09-17 Thread zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16931227#comment-16931227
 ] 

zhijiang commented on FLINK-14087:
--

Thanks for the replies [~pnowojski]!

Unless there are some requirements of sharing information among different edges 
for special partitioner such as users' CustomPartitioner, the existing 
implementations of StreamPartitioner seem be independent for different edges. 

Do you think it is feasible to generate separate partitioner instance for 
different edges? [~aljoscha]

> throws java.lang.ArrayIndexOutOfBoundsException  when emiting the data using 
> RebalancePartitioner. 
> ---
>
> Key: FLINK-14087
> URL: https://issues.apache.org/jira/browse/FLINK-14087
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Affects Versions: 1.8.0, 1.8.1, 1.9.0
>Reporter: luojiangyu
>Priority: Major
> Attachments: image-2019-09-16-19-14-39-403.png, 
> image-2019-09-16-19-15-34-639.png
>
>
> There is the condition the RecordWriter sharing the ChannelSelector instance.
> When two RecordWriter instance shared the same ChannelSelector Instance , It 
> may throws 
> java.lang.ArrayIndexOutOfBoundsException .  For example,  two recordWriter 
> instance shared the RebalancePartitioner instance. the RebalancePartitioner 
> instance setup 2 number of Channels when the first RecordWriter initializing, 
> next the some RebalancePartitioner instance setup 3 number of  channels When 
> the second RecordWriter initializing. this  throws 
> ArrayIndexOutOfBoundsException when the first RecordWriter instance emits the 
> data.
> The Exception likes
> |java.lang.RuntimeException: 2 at 
> org.apache.flink.streaming.runtime.io.RecordWriterOutput.pushToRecordWriter(RecordWriterOutput.java:112)
>  at 
> org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:91)
>  at 
> org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:47)
>  at 
> org.apache.flink.streaming.runtime.tasks.OperatorChain$BroadcastingOutputCollector.collect(OperatorChain.java:673)
>  at 
> org.apache.flink.streaming.runtime.tasks.OperatorChain$BroadcastingOutputCollector.collect(OperatorChain.java:617)
>  at 
> org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:726)
>  at 
> org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:699)
>  at 
> org.apache.flink.streaming.api.operators.StreamSourceContexts$NonTimestampContext.collect(StreamSourceContexts.java:104)
>  at 
> com.xx.flink.demo.wordcount.case3.StateTest$Source.run(StateTest.java:107) at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:94)
>  at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:57)
>  at 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:97)
>  at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:302)
>  at org.apache.flink.runtime.taskmanager.Task.run(Task.java:734) at 
> java.lang.Thread.run(Thread.java:748) Caused by: 
> java.lang.ArrayIndexOutOfBoundsException: 2 at 
> org.apache.flink.runtime.io.network.api.writer.RecordWriter.getBufferBuilder(RecordWriter.java:255)
>  at 
> org.apache.flink.runtime.io.network.api.writer.RecordWriter.copyFromSerializerToTargetChannel(RecordWriter.java:177)
>  at 
> org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:162)
>  at 
> org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:128)
>  at 
> org.apache.flink.streaming.runtime.io.RecordWriterOutput.pushToRecordWriter(RecordWriterOutput.java:109)
>  ... 14 more|



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Comment Edited] (FLINK-14087) throws java.lang.ArrayIndexOutOfBoundsException when emiting the data using RebalancePartitioner.

2019-09-17 Thread zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16931289#comment-16931289
 ] 

zhijiang edited comment on FLINK-14087 at 9/17/19 10:48 AM:


Thanks for the quick response [~aljoscha]!

The topology from [~jiangyu] is like this :

DataStream dataStream = env.addSource().rebalance();
 dataStream.map().setParallelism(2).filter();
 dataStream.map().setParallelism(3).sink();

 

In StreamGraph#addVirtualPartitionNode, the {{RebalancePartitioner}} is cached 
inside the structure of {{virtualPartitionNodes}}.

And during StreamGraph#addEdgeInternal(), the same {{RebalancePartitioner}} 
instance would be fetched from virtualPartitionNodes while adding different 
edges.


was (Author: zjwang):
Thanks for the quick response [~aljoscha]!

The topology from [~jiangyu] is like this :

DataStream dataStream = env.addSource().rebalance();
dataStream.map().setParallelism(2).filter();
dataStream.map().setParallelism(3).sink();

 

In StreamGraph#addVirtualPartitionNode, the {{RebalancePartitioner}} is cached 
inside the structure of {{virtualPartitionNodes}}.

And during StreamGraph#addEdgeInternal(), the same {{RebalancePartitioner}} 
instance would be fetched from {{virtualPartitionNodes }}while adding different 
edges.

> throws java.lang.ArrayIndexOutOfBoundsException  when emiting the data using 
> RebalancePartitioner. 
> ---
>
> Key: FLINK-14087
> URL: https://issues.apache.org/jira/browse/FLINK-14087
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Affects Versions: 1.8.0, 1.8.1, 1.9.0
>Reporter: luojiangyu
>Priority: Major
> Attachments: image-2019-09-16-19-14-39-403.png, 
> image-2019-09-16-19-15-34-639.png
>
>
> There is the condition the RecordWriter sharing the ChannelSelector instance.
> When two RecordWriter instance shared the same ChannelSelector Instance , It 
> may throws 
> java.lang.ArrayIndexOutOfBoundsException .  For example,  two recordWriter 
> instance shared the RebalancePartitioner instance. the RebalancePartitioner 
> instance setup 2 number of Channels when the first RecordWriter initializing, 
> next the some RebalancePartitioner instance setup 3 number of  channels When 
> the second RecordWriter initializing. this  throws 
> ArrayIndexOutOfBoundsException when the first RecordWriter instance emits the 
> data.
> The Exception likes
> |java.lang.RuntimeException: 2 at 
> org.apache.flink.streaming.runtime.io.RecordWriterOutput.pushToRecordWriter(RecordWriterOutput.java:112)
>  at 
> org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:91)
>  at 
> org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:47)
>  at 
> org.apache.flink.streaming.runtime.tasks.OperatorChain$BroadcastingOutputCollector.collect(OperatorChain.java:673)
>  at 
> org.apache.flink.streaming.runtime.tasks.OperatorChain$BroadcastingOutputCollector.collect(OperatorChain.java:617)
>  at 
> org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:726)
>  at 
> org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:699)
>  at 
> org.apache.flink.streaming.api.operators.StreamSourceContexts$NonTimestampContext.collect(StreamSourceContexts.java:104)
>  at 
> com.xx.flink.demo.wordcount.case3.StateTest$Source.run(StateTest.java:107) at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:94)
>  at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:57)
>  at 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:97)
>  at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:302)
>  at org.apache.flink.runtime.taskmanager.Task.run(Task.java:734) at 
> java.lang.Thread.run(Thread.java:748) Caused by: 
> java.lang.ArrayIndexOutOfBoundsException: 2 at 
> org.apache.flink.runtime.io.network.api.writer.RecordWriter.getBufferBuilder(RecordWriter.java:255)
>  at 
> org.apache.flink.runtime.io.network.api.writer.RecordWriter.copyFromSerializerToTargetChannel(RecordWriter.java:177)
>  at 
> org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:162)
>  at 
> org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:128)
>  at 
> org.apache.flink.streaming.runtime.io.RecordWriterOutput.pushToRecordWriter(RecordWriterOutput.java:109)
>  ... 14 more|



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-14087) throws java.lang.ArrayIndexOutOfBoundsException when emiting the data using RebalancePartitioner.

2019-09-17 Thread zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16931289#comment-16931289
 ] 

zhijiang commented on FLINK-14087:
--

Thanks for the quick response [~aljoscha]!

The topology from [~jiangyu] is like this :

DataStream dataStream = env.addSource().rebalance();
dataStream.map().setParallelism(2).filter();
dataStream.map().setParallelism(3).sink();

 

In StreamGraph#addVirtualPartitionNode, the {{RebalancePartitioner}} is cached 
inside the structure of {{virtualPartitionNodes}}.

And during StreamGraph#addEdgeInternal(), the same {{RebalancePartitioner}} 
instance would be fetched from {{virtualPartitionNodes }}while adding different 
edges.

> throws java.lang.ArrayIndexOutOfBoundsException  when emiting the data using 
> RebalancePartitioner. 
> ---
>
> Key: FLINK-14087
> URL: https://issues.apache.org/jira/browse/FLINK-14087
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Affects Versions: 1.8.0, 1.8.1, 1.9.0
>Reporter: luojiangyu
>Priority: Major
> Attachments: image-2019-09-16-19-14-39-403.png, 
> image-2019-09-16-19-15-34-639.png
>
>
> There is the condition the RecordWriter sharing the ChannelSelector instance.
> When two RecordWriter instance shared the same ChannelSelector Instance , It 
> may throws 
> java.lang.ArrayIndexOutOfBoundsException .  For example,  two recordWriter 
> instance shared the RebalancePartitioner instance. the RebalancePartitioner 
> instance setup 2 number of Channels when the first RecordWriter initializing, 
> next the some RebalancePartitioner instance setup 3 number of  channels When 
> the second RecordWriter initializing. this  throws 
> ArrayIndexOutOfBoundsException when the first RecordWriter instance emits the 
> data.
> The Exception likes
> |java.lang.RuntimeException: 2 at 
> org.apache.flink.streaming.runtime.io.RecordWriterOutput.pushToRecordWriter(RecordWriterOutput.java:112)
>  at 
> org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:91)
>  at 
> org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:47)
>  at 
> org.apache.flink.streaming.runtime.tasks.OperatorChain$BroadcastingOutputCollector.collect(OperatorChain.java:673)
>  at 
> org.apache.flink.streaming.runtime.tasks.OperatorChain$BroadcastingOutputCollector.collect(OperatorChain.java:617)
>  at 
> org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:726)
>  at 
> org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:699)
>  at 
> org.apache.flink.streaming.api.operators.StreamSourceContexts$NonTimestampContext.collect(StreamSourceContexts.java:104)
>  at 
> com.xx.flink.demo.wordcount.case3.StateTest$Source.run(StateTest.java:107) at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:94)
>  at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:57)
>  at 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:97)
>  at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:302)
>  at org.apache.flink.runtime.taskmanager.Task.run(Task.java:734) at 
> java.lang.Thread.run(Thread.java:748) Caused by: 
> java.lang.ArrayIndexOutOfBoundsException: 2 at 
> org.apache.flink.runtime.io.network.api.writer.RecordWriter.getBufferBuilder(RecordWriter.java:255)
>  at 
> org.apache.flink.runtime.io.network.api.writer.RecordWriter.copyFromSerializerToTargetChannel(RecordWriter.java:177)
>  at 
> org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:162)
>  at 
> org.apache.flink.runtime.io.network.api.writer.RecordWriter.emit(RecordWriter.java:128)
>  at 
> org.apache.flink.streaming.runtime.io.RecordWriterOutput.pushToRecordWriter(RecordWriterOutput.java:109)
>  ... 14 more|



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (FLINK-14124) potential memory leak in netty server

2019-09-19 Thread zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16933568#comment-16933568
 ] 

zhijiang commented on FLINK-14124:
--

I think it is better for you to upgrade the flink version. The 1.4.2 is so old 
and there are many improvements in network stack after 1.5 version. Especially 
on netty server, we avoid the data copy from flink buffer to netty ByteBuffer 
which could save some direct memory used by netty server side.

So my suggestion is upgrading the flink latest version if possible to verify 
whether this issue still exist.

> potential memory leak in netty server
> -
>
> Key: FLINK-14124
> URL: https://issues.apache.org/jira/browse/FLINK-14124
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Affects Versions: 1.6.3
>Reporter: YufeiLiu
>Priority: Critical
> Attachments: image-2019-09-19-15-53-32-294.png, screenshot-1.png, 
> screenshot-2.png
>
>
> I have a job running in flink 1.4.2, end of the task is use Phoenix jdbc 
> driver write record into Apache Phoenix.
> _mqStream
> .keyBy(0)
> .window(TumblingProcessingTimeWindows.of(Time.of(300, 
> TimeUnit.SECONDS)))
> .process(new MyProcessWindowFunction())
> .addSink(new PhoenixSinkFunction());_
> But the TaskManager of sink subtask off-heap memory keep increasing, 
> precisely is might case by DirectByteBuffer.
> I analyze heap dump, find there are hundreds of DirectByteBuffer object, each 
> of them reference to over 3MB memory address, they are all link to Flink 
> Netty Server Thread.
>  !image-2019-09-19-15-53-32-294.png! 
> It only happened in sink task, other nodes just work fine. I think is problem 
> of Phoenix at first, but heap dump show memory is consume by netty. I didn't 
> know much about flink network, I will be appreciated if someone can tell me 
> the might causation or how to dig further.
>  !screenshot-1.png! 
> yarn.heap-cutoff-ratio: 0.2
> taskmanager.memory.fraction: 0.6
> taskmanager.network.numberOfBuffers: 32240
>  !screenshot-2.png! 
> I have Zookeeper, Kafka, Phoenix(Hbase), Flume dependency in package, they 
> all might use direct memory, but when direct memory get free, is there 
> something block the Cleaner progress.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-10995) Copy intermediate serialization results only once for broadcast mode

2019-09-23 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-10995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang updated FLINK-10995:
-
Component/s: (was: Runtime / Network)
 Runtime / Task

> Copy intermediate serialization results only once for broadcast mode
> 
>
> Key: FLINK-10995
> URL: https://issues.apache.org/jira/browse/FLINK-10995
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Task
>Affects Versions: 1.8.0
>Reporter: zhijiang
>Assignee: zhijiang
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The emitted records from operator would be firstly serialized into 
> intermediate bytes array in {{RecordSerializer}}, then copy the intermediate 
> results into target buffers for different sub partitions.  For broadcast 
> mode, the same intermediate results would be copied as many times as the 
> number of sub partitions, and this would affect the performance seriously in 
> large scale jobs.
> We can copy to only one target buffer which would be shared by all the sub 
> partitions to reduce the overheads. For emitting latency marker in broadcast 
> mode, we should flush the previous shared target buffers first, and then 
> request a new buffer for the target sub partition to send latency marker.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-12576) inputQueueLength metric does not work for LocalInputChannels

2019-09-23 Thread zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-12576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16936348#comment-16936348
 ] 

zhijiang commented on FLINK-12576:
--

Thanks for reporting this [~alpinegizmo]

I want to confirm two things:

1. The input metric here is for {{inputQueueLength}}?

2. Have you tried whether this problem exists before release-1.9, especially 
for the case of non-local in 2 single-slot TMs.

This ticket actually made two mainly changes before. One is for considering the 
input metric (inputQueueLength) for local input channel. The other is that the 
metric value is got out of synchronized way instead for remote input channel. 
So I wonder whether it would cause visibility issue for metric reporter thread. 
But it seems that this issue only happens for the parallelism of backpressure 
operator in your testing.

 

> inputQueueLength metric does not work for LocalInputChannels
> 
>
> Key: FLINK-12576
> URL: https://issues.apache.org/jira/browse/FLINK-12576
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Metrics, Runtime / Network
>Affects Versions: 1.6.4, 1.7.2, 1.8.0, 1.9.0
>Reporter: Piotr Nowojski
>Assignee: Aitozi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.9.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently {{inputQueueLength}} ignores LocalInputChannels 
> ({{SingleInputGate#getNumberOfQueuedBuffers}}). This can can cause mistakes 
> when looking for causes of back pressure (If task is back pressuring whole 
> Flink job, but there is a data skew and only local input channels are being 
> used).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-15069) Supplement the pipelined shuffle compression case for benchmark

2019-12-15 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang closed FLINK-15069.

Resolution: Fixed

> Supplement the pipelined shuffle compression case for benchmark
> ---
>
> Key: FLINK-15069
> URL: https://issues.apache.org/jira/browse/FLINK-15069
> Project: Flink
>  Issue Type: Task
>  Components: Benchmarks
>Reporter: zhijiang
>Assignee: zhijiang
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> While reviewing the [PR| 
> [https://github.com/apache/flink/pull/10375#pullrequestreview-325193504]] of 
> introducing data compression for persistent storage and network shuffle, we 
> think it is better to also cover this scenario in the benchmark for tracing 
> the performance issues future. 
> This ticket would supplement the compression case for pipelined partition 
> shuffle, and the compression case for blocking partition would be added in 
> [FLINK-15070|https://issues.apache.org/jira/browse/FLINK-15070]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-15069) Supplement the pipelined shuffle compression case for benchmark

2019-12-15 Thread zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16993600#comment-16993600
 ] 

zhijiang edited comment on FLINK-15069 at 12/16/19 5:27 AM:


Fixed in master: 04ab225056714013fa3bf4dfb590edac7b577d03

Fixed in benchmarks repo: 0a2397a907a51608c276c39592c3b19c2455366b


was (Author: zjwang):
Fixed in master: 04ab225056714013fa3bf4dfb590edac7b577d03

> Supplement the pipelined shuffle compression case for benchmark
> ---
>
> Key: FLINK-15069
> URL: https://issues.apache.org/jira/browse/FLINK-15069
> Project: Flink
>  Issue Type: Task
>  Components: Benchmarks
>Reporter: zhijiang
>Assignee: zhijiang
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> While reviewing the [PR| 
> [https://github.com/apache/flink/pull/10375#pullrequestreview-325193504]] of 
> introducing data compression for persistent storage and network shuffle, we 
> think it is better to also cover this scenario in the benchmark for tracing 
> the performance issues future. 
> This ticket would supplement the compression case for pipelined partition 
> shuffle, and the compression case for blocking partition would be added in 
> [FLINK-15070|https://issues.apache.org/jira/browse/FLINK-15070]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-15105) Resuming Externalized Checkpoint after terminal failure (rocks, incremental) end-to-end test stalls on travis

2019-12-15 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang reassigned FLINK-15105:


Assignee: Congxian Qiu(klion26)

> Resuming Externalized Checkpoint after terminal failure (rocks, incremental) 
> end-to-end test stalls on travis
> -
>
> Key: FLINK-15105
> URL: https://issues.apache.org/jira/browse/FLINK-15105
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.10.0
>Reporter: Yu Li
>Assignee: Congxian Qiu(klion26)
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.10.0
>
>
> Resuming Externalized Checkpoint after terminal failure (rocks, incremental) 
> end-to-end test fails on release-1.9 nightly build stalls with "The job 
> exceeded the maximum log length, and has been terminated".
> https://api.travis-ci.org/v3/job/621090394/log.txt



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15010) Temp directories flink-netty-shuffle-* are not cleaned up

2019-12-16 Thread zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16997108#comment-16997108
 ] 

zhijiang commented on FLINK-15010:
--

Hey [~NicoK], thanks for reporting this. I want to further confirm that what 's 
the mode for flink cluster, standalone / session ? Or how can I re-produce this 
issue?

> Temp directories flink-netty-shuffle-* are not cleaned up
> -
>
> Key: FLINK-15010
> URL: https://issues.apache.org/jira/browse/FLINK-15010
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Affects Versions: 1.9.1
>Reporter: Nico Kruber
>Priority: Major
>
> Starting a Flink cluster with 2 TMs and stopping it again will leave 2 
> temporary directories (and not delete them): flink-netty-shuffle-



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-15010) Temp directories flink-netty-shuffle-* are not cleaned up

2019-12-16 Thread zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16997108#comment-16997108
 ] 

zhijiang edited comment on FLINK-15010 at 12/16/19 9:19 AM:


Hey [~NicoK], thanks for reporting this. I want to further confirm that what 's 
the mode for flink cluster, standalone / session ?  I guess you did not start 
any jobs in cluster?  Or how can I re-produce this issue?


was (Author: zjwang):
Hey [~NicoK], thanks for reporting this. I want to further confirm that what 's 
the mode for flink cluster, standalone / session ? Or how can I re-produce this 
issue?

> Temp directories flink-netty-shuffle-* are not cleaned up
> -
>
> Key: FLINK-15010
> URL: https://issues.apache.org/jira/browse/FLINK-15010
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Affects Versions: 1.9.1
>Reporter: Nico Kruber
>Priority: Major
>
> Starting a Flink cluster with 2 TMs and stopping it again will leave 2 
> temporary directories (and not delete them): flink-netty-shuffle-



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-15010) Temp directories flink-netty-shuffle-* are not cleaned up

2019-12-16 Thread zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16997108#comment-16997108
 ] 

zhijiang edited comment on FLINK-15010 at 12/16/19 9:26 AM:


Hey [~NicoK], thanks for reporting this. I want to further confirm that what 's 
the mode for flink cluster, standalone / session ?  Or how can I re-produce 
this issue?


was (Author: zjwang):
Hey [~NicoK], thanks for reporting this. I want to further confirm that what 's 
the mode for flink cluster, standalone / session ?  I guess you did not start 
any jobs in cluster?  Or how can I re-produce this issue?

> Temp directories flink-netty-shuffle-* are not cleaned up
> -
>
> Key: FLINK-15010
> URL: https://issues.apache.org/jira/browse/FLINK-15010
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Affects Versions: 1.9.1
>Reporter: Nico Kruber
>Priority: Major
>
> Starting a Flink cluster with 2 TMs and stopping it again will leave 2 
> temporary directories (and not delete them): flink-netty-shuffle-



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-13589) DelimitedInputFormat index error on multi-byte delimiters with whole file input splits

2019-12-16 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-13589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang closed FLINK-13589.

Resolution: Fixed

Merged in master: 0bd083e5eeb5eb5adeddfbe3a9928860f3b4a6eb

Merged in release-1.9: db531e79807acba1ba28d9922bfed912fd78dd03

Merged in release-1.10: 1e716e4a43018caeb77beaa5d8f16cedfedbd887

> DelimitedInputFormat index error on multi-byte delimiters with whole file 
> input splits
> --
>
> Key: FLINK-13589
> URL: https://issues.apache.org/jira/browse/FLINK-13589
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / FileSystem, Formats (JSON, Avro, Parquet, 
> ORC, SequenceFile)
>Affects Versions: 1.8.1
>Reporter: Adric Eckstein
>Assignee: Arvid Heise
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.9.2, 1.10.0
>
> Attachments: delimiter-bug.patch
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> The DelimitedInputFormat can drops bytes when using input splits that have a 
> length of -1 (for reading the whole file).  It looks like this is a simple 
> bug in handing the delimiter on buffer boundaries where the logic is 
> inconsistent for different split types.
> Attached is a possible patch with fix and test.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-15021) Remove setting of netty channel watermark and logic of writability changed

2019-12-17 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang updated FLINK-15021:
-
Summary: Remove setting of netty channel watermark and logic of writability 
changed  (was: Refactor to remove channelWritabilityChanged from 
PartitionRequestQueue)

> Remove setting of netty channel watermark and logic of writability changed
> --
>
> Key: FLINK-15021
> URL: https://issues.apache.org/jira/browse/FLINK-15021
> Project: Flink
>  Issue Type: Task
>  Components: Runtime / Network
>Reporter: zhijiang
>Assignee: zhijiang
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> After removing the non credit-based flow control codes, the related channel 
> writability changed logics in PartitionRequestQueue are invalid and can be 
> removed completely. Therefore we can refactor the process to simplify the 
> codes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-15021) Remove setting of netty channel watermark and logic of writability changed

2019-12-17 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang updated FLINK-15021:
-
Description: After removing the non credit-based flow control codes, the 
channel writability changed logic in PartitionRequestQueue along with the 
setting of channel watermark are both invalid. Therefore we can remove them 
completely to simplify the codes.  (was: After removing the non credit-based 
flow control codes, the related channel writability changed logics in 
PartitionRequestQueue are invalid and can be removed completely. Therefore we 
can refactor the process to simplify the codes.)

> Remove setting of netty channel watermark and logic of writability changed
> --
>
> Key: FLINK-15021
> URL: https://issues.apache.org/jira/browse/FLINK-15021
> Project: Flink
>  Issue Type: Task
>  Components: Runtime / Network
>Reporter: zhijiang
>Assignee: zhijiang
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> After removing the non credit-based flow control codes, the channel 
> writability changed logic in PartitionRequestQueue along with the setting of 
> channel watermark are both invalid. Therefore we can remove them completely 
> to simplify the codes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-15021) Remove setting of netty channel watermark and logic of writability changed

2019-12-17 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang updated FLINK-15021:
-
Parent: FLINK-7282
Issue Type: Sub-task  (was: Task)

> Remove setting of netty channel watermark and logic of writability changed
> --
>
> Key: FLINK-15021
> URL: https://issues.apache.org/jira/browse/FLINK-15021
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Network
>Reporter: zhijiang
>Assignee: zhijiang
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> After removing the non credit-based flow control codes, the channel 
> writability changed logic in PartitionRequestQueue along with the setting of 
> channel watermark are both invalid. Therefore we can remove them completely 
> to simplify the codes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15306) Adjust the default netty transport option from nio to auto

2019-12-17 Thread zhijiang (Jira)
zhijiang created FLINK-15306:


 Summary: Adjust the default netty transport option from nio to auto
 Key: FLINK-15306
 URL: https://issues.apache.org/jira/browse/FLINK-15306
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Network
Reporter: zhijiang
Assignee: zhijiang
 Fix For: 1.11.0


The default option of `taskmanager.network.netty.transport` in 
NettyShuffleEnvironmentOptions is `nio` now. As we know, the `epoll` mode can 
get better performance, less GC and have more advanced features which are only 
available on linux.

Therefore it is better to adjust the default option to `auto` instead, and then 
the framework would automatically choose the proper mode based on the platform.

We would further verify the performance effect via micro benchmark if possible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15010) Temp directories flink-netty-shuffle-* are not cleaned up

2019-12-17 Thread zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16998822#comment-16998822
 ] 

zhijiang commented on FLINK-15010:
--

Thanks for the information and it is easy to re-produce this issue. I would 
assign to gaoyun for solving it.

> Temp directories flink-netty-shuffle-* are not cleaned up
> -
>
> Key: FLINK-15010
> URL: https://issues.apache.org/jira/browse/FLINK-15010
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Affects Versions: 1.9.1
>Reporter: Nico Kruber
>Priority: Major
>
> Starting a Flink cluster with 2 TMs and stopping it again will leave 2 
> temporary directories (and not delete them): flink-netty-shuffle-



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-15010) Temp directories flink-netty-shuffle-* are not cleaned up

2019-12-17 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang reassigned FLINK-15010:


Assignee: Yun Gao

> Temp directories flink-netty-shuffle-* are not cleaned up
> -
>
> Key: FLINK-15010
> URL: https://issues.apache.org/jira/browse/FLINK-15010
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Affects Versions: 1.9.1
>Reporter: Nico Kruber
>Assignee: Yun Gao
>Priority: Major
>
> Starting a Flink cluster with 2 TMs and stopping it again will leave 2 
> temporary directories (and not delete them): flink-netty-shuffle-



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-1275) Add support to compress network I/O

2019-12-18 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-1275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang closed FLINK-1275.
---
Resolution: Duplicate

This issue was already done via 
[FLINK-14845|https://issues.apache.org/jira/browse/FLINK-14845], so close it.

> Add support to compress network I/O
> ---
>
> Key: FLINK-1275
> URL: https://issues.apache.org/jira/browse/FLINK-1275
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Affects Versions: 0.8.0
>Reporter: Ufuk Celebi
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-15308) Job failed when enable pipelined-shuffle.compression and numberOfTaskSlots > 1

2019-12-18 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang reassigned FLINK-15308:


Assignee: Yingjie Cao

> Job failed when enable pipelined-shuffle.compression and numberOfTaskSlots > 1
> --
>
> Key: FLINK-15308
> URL: https://issues.apache.org/jira/browse/FLINK-15308
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Affects Versions: 1.10.0
> Environment: $ git log
> commit 4b54da2c67692b1c9d43e1184c00899b0151b3ae
> Author: bowen.li 
> Date: Tue Dec 17 17:37:03 2019 -0800
>Reporter: Feng Jiajie
>Assignee: Yingjie Cao
>Priority: Blocker
>
> Job worked well with default flink-conf.yaml with 
> pipelined-shuffle.compression:
> {code:java}
> taskmanager.numberOfTaskSlots: 1
> taskmanager.network.pipelined-shuffle.compression.enabled: true
> {code}
> But when I set taskmanager.numberOfTaskSlots to 4 or 6:
> {code:java}
> taskmanager.numberOfTaskSlots: 6
> taskmanager.network.pipelined-shuffle.compression.enabled: true
> {code}
> job failed:
> {code:java}
> $ bin/flink run -m yarn-cluster -p 16 -yjm 1024m -ytm 12288m 
> ~/flink-example-1.0-SNAPSHOT.jar
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/data/build/flink/flink-dist/target/flink-1.10-SNAPSHOT-bin/flink-1.10-SNAPSHOT/lib/slf4j-log4j12-1.7.15.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/data/sa_cluster/cloudera/parcels/CDH-5.14.4-1.cdh5.14.4.p0.3/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> 2019-12-18 15:04:40,514 WARN  org.apache.flink.yarn.cli.FlinkYarnSessionCli   
>   - The configuration directory 
> ('/data/build/flink/flink-dist/target/flink-1.10-SNAPSHOT-bin/flink-1.10-SNAPSHOT/conf')
>  already contains a LOG4J config file.If you want to use logback, then please 
> delete or rename the log configuration file.
> 2019-12-18 15:04:40,514 WARN  org.apache.flink.yarn.cli.FlinkYarnSessionCli   
>   - The configuration directory 
> ('/data/build/flink/flink-dist/target/flink-1.10-SNAPSHOT-bin/flink-1.10-SNAPSHOT/conf')
>  already contains a LOG4J config file.If you want to use logback, then please 
> delete or rename the log configuration file.
> 2019-12-18 15:04:40,907 INFO  org.apache.flink.yarn.YarnClusterDescriptor 
>   - No path for the flink jar passed. Using the location of class 
> org.apache.flink.yarn.YarnClusterDescriptor to locate the jar
> 2019-12-18 15:04:41,084 INFO  org.apache.flink.yarn.YarnClusterDescriptor 
>   - Cluster specification: 
> ClusterSpecification{masterMemoryMB=1024, taskManagerMemoryMB=12288, 
> numberTaskManagers=1, slotsPerTaskManager=6}
> 2019-12-18 15:04:42,344 INFO  org.apache.flink.yarn.YarnClusterDescriptor 
>   - Submitting application master application_1576573857638_0026
> 2019-12-18 15:04:42,370 INFO  
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted 
> application application_1576573857638_0026
> 2019-12-18 15:04:42,371 INFO  org.apache.flink.yarn.YarnClusterDescriptor 
>   - Waiting for the cluster to be allocated
> 2019-12-18 15:04:42,372 INFO  org.apache.flink.yarn.YarnClusterDescriptor 
>   - Deploying cluster, current state ACCEPTED
> 2019-12-18 15:04:45,388 INFO  org.apache.flink.yarn.YarnClusterDescriptor 
>   - YARN application has been deployed successfully.
> 2019-12-18 15:04:45,390 INFO  org.apache.flink.yarn.YarnClusterDescriptor 
>   - Found Web Interface debugboxcreate431x3.sa:36162 of 
> application 'application_1576573857638_0026'.
> Job has been submitted with JobID 9140c70769f4271cc22ea8becaa26272
> 
>  The program finished with the following exception:
> org.apache.flink.client.program.ProgramInvocationException: The main method 
> caused an error: org.apache.flink.client.program.ProgramInvocationException: 
> Job failed (JobID: 9140c70769f4271cc22ea8becaa26272)
>   at 
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:335)
>   at 
> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:205)
>   at 
> org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:138)
>   at 
> org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:664)
>   at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:213)
>   at 
> org.apache.flink.client.cli.CliFrontend.

[jira] [Assigned] (FLINK-15012) Checkpoint directory not cleaned up

2019-12-18 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang reassigned FLINK-15012:


Assignee: Yun Tang

> Checkpoint directory not cleaned up
> ---
>
> Key: FLINK-15012
> URL: https://issues.apache.org/jira/browse/FLINK-15012
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.9.1
>Reporter: Nico Kruber
>Assignee: Yun Tang
>Priority: Major
>
> I started a Flink cluster with 2 TMs using {{start-cluster.sh}} and the 
> following config (in addition to the default {{flink-conf.yaml}})
> {code:java}
> state.checkpoints.dir: file:///path/to/checkpoints/
> state.backend: rocksdb {code}
> After submitting a jobwith checkpoints enabled (every 5s), checkpoints show 
> up, e.g.
> {code:java}
> bb969f842bbc0ecc3b41b7fbe23b047b/
> ├── chk-2
> │   ├── 238969e1-6949-4b12-98e7-1411c186527c
> │   ├── 2702b226-9cfc-4327-979d-e5508ab2e3d5
> │   ├── 4c51cb24-6f71-4d20-9d4c-65ed6e826949
> │   ├── e706d574-c5b2-467a-8640-1885ca252e80
> │   └── _metadata
> ├── shared
> └── taskowned {code}
> If I shut down the cluster via {{stop-cluster.sh}}, these files will remain 
> on disk and not be cleaned up.
> In contrast, if I cancel the job, at least {{chk-2}} will be deleted, but 
> still leaving the (empty) directories.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15311) Lz4BlockCompressionFactory should use native compressor instead of java unsafe

2019-12-18 Thread zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999797#comment-16999797
 ] 

zhijiang commented on FLINK-15311:
--

I guess it belongs to performance improvement, not a bug, because it does not 
affect the compression function and stable issue.

Should it be a blocker for the release?

> Lz4BlockCompressionFactory should use native compressor instead of java unsafe
> --
>
> Key: FLINK-15311
> URL: https://issues.apache.org/jira/browse/FLINK-15311
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Reporter: Jingsong Lee
>Priority: Critical
> Fix For: 1.10.0
>
>
> According to:
> [https://lz4.github.io/lz4-java/1.7.0/lz4-compression-benchmark/]
> Java java unsafe compressor has lower performance than native lz4 compressor.
> After FLINK-14845 , we use lz4 compression for shuffler.
> In testing, I found shuffle using java unsafe compressor.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-15311) Lz4BlockCompressionFactory should use native compressor instead of java unsafe

2019-12-18 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang reassigned FLINK-15311:


Assignee: Yingjie Cao

> Lz4BlockCompressionFactory should use native compressor instead of java unsafe
> --
>
> Key: FLINK-15311
> URL: https://issues.apache.org/jira/browse/FLINK-15311
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Reporter: Jingsong Lee
>Assignee: Yingjie Cao
>Priority: Critical
> Fix For: 1.10.0
>
>
> According to:
> [https://lz4.github.io/lz4-java/1.7.0/lz4-compression-benchmark/]
> Java java unsafe compressor has lower performance than native lz4 compressor.
> After FLINK-14845 , we use lz4 compression for shuffler.
> In testing, I found shuffle using java unsafe compressor.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-15308) Job failed when enable pipelined-shuffle.compression and numberOfTaskSlots > 1

2019-12-18 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang updated FLINK-15308:
-
Fix Version/s: 1.10.0

> Job failed when enable pipelined-shuffle.compression and numberOfTaskSlots > 1
> --
>
> Key: FLINK-15308
> URL: https://issues.apache.org/jira/browse/FLINK-15308
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Affects Versions: 1.10.0
> Environment: $ git log
> commit 4b54da2c67692b1c9d43e1184c00899b0151b3ae
> Author: bowen.li 
> Date: Tue Dec 17 17:37:03 2019 -0800
>Reporter: Feng Jiajie
>Assignee: Yingjie Cao
>Priority: Blocker
> Fix For: 1.10.0
>
> Attachments: image-2019-12-19-10-55-30-644.png
>
>
> Job worked well with default flink-conf.yaml with 
> pipelined-shuffle.compression:
> {code:java}
> taskmanager.numberOfTaskSlots: 1
> taskmanager.network.pipelined-shuffle.compression.enabled: true
> {code}
> But when I set taskmanager.numberOfTaskSlots to 4 or 6:
> {code:java}
> taskmanager.numberOfTaskSlots: 6
> taskmanager.network.pipelined-shuffle.compression.enabled: true
> {code}
> job failed:
> {code:java}
> $ bin/flink run -m yarn-cluster -p 16 -yjm 1024m -ytm 12288m 
> ~/flink-example-1.0-SNAPSHOT.jar
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/data/build/flink/flink-dist/target/flink-1.10-SNAPSHOT-bin/flink-1.10-SNAPSHOT/lib/slf4j-log4j12-1.7.15.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/data/sa_cluster/cloudera/parcels/CDH-5.14.4-1.cdh5.14.4.p0.3/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> 2019-12-18 15:04:40,514 WARN  org.apache.flink.yarn.cli.FlinkYarnSessionCli   
>   - The configuration directory 
> ('/data/build/flink/flink-dist/target/flink-1.10-SNAPSHOT-bin/flink-1.10-SNAPSHOT/conf')
>  already contains a LOG4J config file.If you want to use logback, then please 
> delete or rename the log configuration file.
> 2019-12-18 15:04:40,514 WARN  org.apache.flink.yarn.cli.FlinkYarnSessionCli   
>   - The configuration directory 
> ('/data/build/flink/flink-dist/target/flink-1.10-SNAPSHOT-bin/flink-1.10-SNAPSHOT/conf')
>  already contains a LOG4J config file.If you want to use logback, then please 
> delete or rename the log configuration file.
> 2019-12-18 15:04:40,907 INFO  org.apache.flink.yarn.YarnClusterDescriptor 
>   - No path for the flink jar passed. Using the location of class 
> org.apache.flink.yarn.YarnClusterDescriptor to locate the jar
> 2019-12-18 15:04:41,084 INFO  org.apache.flink.yarn.YarnClusterDescriptor 
>   - Cluster specification: 
> ClusterSpecification{masterMemoryMB=1024, taskManagerMemoryMB=12288, 
> numberTaskManagers=1, slotsPerTaskManager=6}
> 2019-12-18 15:04:42,344 INFO  org.apache.flink.yarn.YarnClusterDescriptor 
>   - Submitting application master application_1576573857638_0026
> 2019-12-18 15:04:42,370 INFO  
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted 
> application application_1576573857638_0026
> 2019-12-18 15:04:42,371 INFO  org.apache.flink.yarn.YarnClusterDescriptor 
>   - Waiting for the cluster to be allocated
> 2019-12-18 15:04:42,372 INFO  org.apache.flink.yarn.YarnClusterDescriptor 
>   - Deploying cluster, current state ACCEPTED
> 2019-12-18 15:04:45,388 INFO  org.apache.flink.yarn.YarnClusterDescriptor 
>   - YARN application has been deployed successfully.
> 2019-12-18 15:04:45,390 INFO  org.apache.flink.yarn.YarnClusterDescriptor 
>   - Found Web Interface debugboxcreate431x3.sa:36162 of 
> application 'application_1576573857638_0026'.
> Job has been submitted with JobID 9140c70769f4271cc22ea8becaa26272
> 
>  The program finished with the following exception:
> org.apache.flink.client.program.ProgramInvocationException: The main method 
> caused an error: org.apache.flink.client.program.ProgramInvocationException: 
> Job failed (JobID: 9140c70769f4271cc22ea8becaa26272)
>   at 
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:335)
>   at 
> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:205)
>   at 
> org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:138)
>   at 
> org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:664)
>   at org.apache.flink.client.cli.CliFront

[jira] [Updated] (FLINK-15311) Lz4BlockCompressionFactory should use native compressor instead of java unsafe

2019-12-19 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang updated FLINK-15311:
-
Issue Type: Improvement  (was: Bug)

> Lz4BlockCompressionFactory should use native compressor instead of java unsafe
> --
>
> Key: FLINK-15311
> URL: https://issues.apache.org/jira/browse/FLINK-15311
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Reporter: Jingsong Lee
>Assignee: Yingjie Cao
>Priority: Blocker
> Fix For: 1.10.0
>
>
> According to:
> [https://lz4.github.io/lz4-java/1.7.0/lz4-compression-benchmark/]
> Java java unsafe compressor has lower performance than native lz4 compressor.
> After FLINK-14845 , we use lz4 compression for shuffler.
> In testing, I found shuffle using java unsafe compressor.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15311) Lz4BlockCompressionFactory should use native compressor instead of java unsafe

2019-12-19 Thread zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999889#comment-16999889
 ] 

zhijiang commented on FLINK-15311:
--

After discussed offline, this improvement is very critical for our motivation 
to bring compression at the beginning. Then still make it as a blocker for 
release-1.10, but change the type to improvement instead.

> Lz4BlockCompressionFactory should use native compressor instead of java unsafe
> --
>
> Key: FLINK-15311
> URL: https://issues.apache.org/jira/browse/FLINK-15311
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network
>Reporter: Jingsong Lee
>Assignee: Yingjie Cao
>Priority: Blocker
> Fix For: 1.10.0
>
>
> According to:
> [https://lz4.github.io/lz4-java/1.7.0/lz4-compression-benchmark/]
> Java java unsafe compressor has lower performance than native lz4 compressor.
> After FLINK-14845 , we use lz4 compression for shuffler.
> In testing, I found shuffle using java unsafe compressor.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-14843) Streaming bucketing end-to-end test can fail with Output hash mismatch

2019-12-19 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-14843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang reassigned FLINK-14843:


Assignee: PengFei Li

> Streaming bucketing end-to-end test can fail with Output hash mismatch
> --
>
> Key: FLINK-14843
> URL: https://issues.apache.org/jira/browse/FLINK-14843
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / FileSystem, Tests
>Affects Versions: 1.10.0
> Environment: rev: dcc1330375826b779e4902176bb2473704dabb11
>Reporter: Gary Yao
>Assignee: PengFei Li
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.10.0
>
> Attachments: complete_result, 
> flink-gary-standalonesession-0-gyao-desktop.log, 
> flink-gary-taskexecutor-0-gyao-desktop.log, 
> flink-gary-taskexecutor-1-gyao-desktop.log, 
> flink-gary-taskexecutor-2-gyao-desktop.log, 
> flink-gary-taskexecutor-3-gyao-desktop.log, 
> flink-gary-taskexecutor-4-gyao-desktop.log, 
> flink-gary-taskexecutor-5-gyao-desktop.log, 
> flink-gary-taskexecutor-6-gyao-desktop.log
>
>
> *Description*
> Streaming bucketing end-to-end test ({{test_streaming_bucketing.sh}}) can 
> fail with Output hash mismatch.
> {noformat}
> Number of running task managers has reached 4.
> Job (e0b7a86e4d4111f3947baa3d004e083a) is running.
> Waiting until all values have been produced
> Truncating buckets
> Number of produced values 26930/6
> Truncating buckets
> Number of produced values 30890/6
> Truncating buckets
> Number of produced values 37340/6
> Truncating buckets
> Number of produced values 41290/6
> Truncating buckets
> Number of produced values 46710/6
> Truncating buckets
> Number of produced values 52120/6
> Truncating buckets
> Number of produced values 57110/6
> Truncating buckets
> Number of produced values 62530/6
> Cancelling job e0b7a86e4d4111f3947baa3d004e083a.
> Cancelled job e0b7a86e4d4111f3947baa3d004e083a.
> Waiting for job (e0b7a86e4d4111f3947baa3d004e083a) to reach terminal state 
> CANCELED ...
> Job (e0b7a86e4d4111f3947baa3d004e083a) reached terminal state CANCELED
> Job e0b7a86e4d4111f3947baa3d004e083a was cancelled, time to verify
> FAIL Bucketing Sink: Output hash mismatch.  Got 
> 9e00429abfb30eea4f459eb812b470ad, expected 01aba5ff77a0ef5e5cf6a727c248bdc3.
> head hexdump of actual:
> 000   (   2   ,   1   0   ,   0   ,   S   o   m   e   p   a   y
> 010   l   o   a   d   .   .   .   )  \n   (   2   ,   1   0   ,   1
> 020   ,   S   o   m   e   p   a   y   l   o   a   d   .   .   .
> 030   )  \n   (   2   ,   1   0   ,   2   ,   S   o   m   e   p
> 040   a   y   l   o   a   d   .   .   .   )  \n   (   2   ,   1   0
> 050   ,   3   ,   S   o   m   e   p   a   y   l   o   a   d   .
> 060   .   .   )  \n   (   2   ,   1   0   ,   4   ,   S   o   m   e
> 070   p   a   y   l   o   a   d   .   .   .   )  \n   (   2   ,
> 080   1   0   ,   5   ,   S   o   m   e   p   a   y   l   o   a
> 090   d   .   .   .   )  \n   (   2   ,   1   0   ,   6   ,   S   o
> 0a0   m   e   p   a   y   l   o   a   d   .   .   .   )  \n   (
> 0b0   2   ,   1   0   ,   7   ,   S   o   m   e   p   a   y   l
> 0c0   o   a   d   .   .   .   )  \n   (   2   ,   1   0   ,   8   ,
> 0d0   S   o   m   e   p   a   y   l   o   a   d   .   .   .   )
> 0e0  \n   (   2   ,   1   0   ,   9   ,   S   o   m   e   p   a
> 0f0   y   l   o   a   d   .   .   .   )  \n
> 0fa
> Stopping taskexecutor daemon (pid: 55164) on host gyao-desktop.
> Stopping standalonesession daemon (pid: 51073) on host gyao-desktop.
> Stopping taskexecutor daemon (pid: 51504) on host gyao-desktop.
> Skipping taskexecutor daemon (pid: 52034), because it is not running anymore 
> on gyao-desktop.
> Skipping taskexecutor daemon (pid: 52472), because it is not running anymore 
> on gyao-desktop.
> Skipping taskexecutor daemon (pid: 52916), because it is not running anymore 
> on gyao-desktop.
> Stopping taskexecutor daemon (pid: 54121) on host gyao-desktop.
> Stopping taskexecutor daemon (pid: 54726) on host gyao-desktop.
> [FAIL] Test script contains errors.
> Checking of logs skipped.
> [FAIL] 'flink-end-to-end-tests/test-scripts/test_streaming_bucketing.sh' 
> failed after 2 minutes and 3 seconds! Test exited with exit code 1
> {noformat}
> *How to reproduce*
> Comment out the delay of 10s after the 1st TM is restarted to provoke the 
> issue:
> {code:bash}
> echo "Restarting 1 TM"
> $FLINK_DIR/bin/taskmanager.sh start
> wait_for_number_of_running_tms 4
> #sleep 10
> echo "Killing 2 TMs"
> kill_random_taskmanager
> kill_random_taskmanager
> wait_for_number_of_running_tms 2
> {code}
> Command to run the test:
> {noformat}
> FLINK_DIR

[jira] [Created] (FLINK-15340) Remove the executor of pipelined compression benchmark

2019-12-19 Thread zhijiang (Jira)
zhijiang created FLINK-15340:


 Summary: Remove the executor of pipelined compression benchmark
 Key: FLINK-15340
 URL: https://issues.apache.org/jira/browse/FLINK-15340
 Project: Flink
  Issue Type: Task
  Components: Benchmarks
Reporter: zhijiang
Assignee: zhijiang


In [FLINK-15308|https://issues.apache.org/jira/browse/FLINK-15308], we removed 
the function of compression for pipelined case. Accordingly we also need to 
remove the respective benchmark executor.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-15340) Remove the executor of pipelined compression benchmark

2019-12-20 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang closed FLINK-15340.

Resolution: Fixed

Merged in benchmark repo: a92f9d5d492b97fe0e601567bfe0c021be819306

> Remove the executor of pipelined compression benchmark
> --
>
> Key: FLINK-15340
> URL: https://issues.apache.org/jira/browse/FLINK-15340
> Project: Flink
>  Issue Type: Task
>  Components: Benchmarks
>Reporter: zhijiang
>Assignee: zhijiang
>Priority: Minor
>
> In [FLINK-15308|https://issues.apache.org/jira/browse/FLINK-15308], we 
> removed the function of compression for pipelined case. Accordingly we also 
> need to remove the respective benchmark executor.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-15355) Nightly streaming file sink fails with unshaded hadoop

2019-12-23 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang reassigned FLINK-15355:


Assignee: PengFei Li

> Nightly streaming file sink fails with unshaded hadoop
> --
>
> Key: FLINK-15355
> URL: https://issues.apache.org/jira/browse/FLINK-15355
> Project: Flink
>  Issue Type: Bug
>  Components: FileSystems
>Affects Versions: 1.10.0, 1.11.0
>Reporter: Arvid Heise
>Assignee: PengFei Li
>Priority: Blocker
> Fix For: 1.10.0
>
>
> {code:java}
> org.apache.flink.client.program.ProgramInvocationException: The main method 
> caused an error: java.util.concurrent.ExecutionException: 
> org.apache.flink.runtime.client.JobSubmissionException: Failed to submit 
> JobGraph.
>  at 
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:335)
>  at 
> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:205)
>  at org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:138)
>  at 
> org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:664)
>  at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:213)
>  at 
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:895)
>  at 
> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
>  at 
> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>  at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968)
> Caused by: java.lang.RuntimeException: 
> java.util.concurrent.ExecutionException: 
> org.apache.flink.runtime.client.JobSubmissionException: Failed to submit 
> JobGraph.
>  at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)
>  at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1751)
>  at 
> org.apache.flink.streaming.api.environment.StreamContextEnvironment.executeAsync(StreamContextEnvironment.java:94)
>  at 
> org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:63)
>  at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1628)
>  at StreamingFileSinkProgram.main(StreamingFileSinkProgram.java:77)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:321)
>  ... 11 more
> Caused by: java.util.concurrent.ExecutionException: 
> org.apache.flink.runtime.client.JobSubmissionException: Failed to submit 
> JobGraph.
>  at 
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
>  at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
>  at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1746)
>  ... 20 more
> Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to 
> submit JobGraph.
>  at 
> org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$7(RestClusterClient.java:326)
>  at 
> java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
>  at 
> java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
>  at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>  at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
>  at 
> org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$8(FutureUtils.java:274)
>  at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
>  at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
>  at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>  at 
> java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:561)
>  at 
> java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:929)
>  at 
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1

[jira] [Updated] (FLINK-15306) Adjust the default netty transport option from nio to auto

2019-12-23 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang updated FLINK-15306:
-
Component/s: Runtime / Configuration

> Adjust the default netty transport option from nio to auto
> --
>
> Key: FLINK-15306
> URL: https://issues.apache.org/jira/browse/FLINK-15306
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Configuration, Runtime / Network
>Reporter: zhijiang
>Assignee: zhijiang
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The default option of `taskmanager.network.netty.transport` in 
> NettyShuffleEnvironmentOptions is `nio` now. As we know, the `epoll` mode can 
> get better performance, less GC and have more advanced features which are 
> only available on linux.
> Therefore it is better to adjust the default option to `auto` instead, and 
> then the framework would automatically choose the proper mode based on the 
> platform.
> We would further verify the performance effect via micro benchmark if 
> possible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-15360) Yarn e2e test is broken with building docker image

2019-12-23 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang closed FLINK-15360.

Resolution: Fixed

Merged in master: e30bcfd9c8cbe56c1072fe9895f1e6d03389c31e

Merged in release-1.10: 5ebab4fa2e51791fc04e04e3ab6fbbfc9f243fce

> Yarn e2e test is broken with building docker image
> --
>
> Key: FLINK-15360
> URL: https://issues.apache.org/jira/browse/FLINK-15360
> Project: Flink
>  Issue Type: Bug
>Reporter: Yang Wang
>Assignee: Yangze Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Yarn e2e test is broken with building docker image. This is because this 
> change 
> [https://github.com/apache/flink/commit/cce1cef50d993aba5060ea5ac597174525ae895e].
>  
> Shell function \{{retry_times}} do not support passing a command as multiple 
> parts. For example, \{{retry_times 5 0 docker build image}} could not work.
>  
> cc [~karmagyz]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-15368) Add end-to-end test for controlling RocksDB memory usage

2019-12-23 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang reassigned FLINK-15368:


Assignee: Yun Tang

> Add end-to-end test for controlling RocksDB memory usage
> 
>
> Key: FLINK-15368
> URL: https://issues.apache.org/jira/browse/FLINK-15368
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / State Backends
>Affects Versions: 1.10.0
>Reporter: Yu Li
>Assignee: Yun Tang
>Priority: Critical
> Fix For: 1.10.0
>
>
> We need to add an end-to-end test to make sure the RocksDB memory usage 
> control works well, especially under the slot sharing case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-15370) Configured write buffer manager actually not take effect in RocksDB's DBOptions

2019-12-23 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang reassigned FLINK-15370:


Assignee: Yu Li

> Configured write buffer manager actually not take effect in RocksDB's 
> DBOptions
> ---
>
> Key: FLINK-15370
> URL: https://issues.apache.org/jira/browse/FLINK-15370
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / State Backends
>Affects Versions: 1.10.0, 1.11.0
>Reporter: Yun Tang
>Assignee: Yu Li
>Priority: Blocker
> Fix For: 1.10.0, 1.11.0
>
>
> Currently, we call {{DBOptions#setWriteBufferManager}} after we extract the 
> {{DBOptions}} from {{RocksDBResourceContainer}}, however, we would extract a 
> new {{DBOptions}}  when creating the RocksDB instance. In other words, the 
> configured write buffer manager would not take effect in the {{DBOptions}} 
> which finally used in target RocksDB instance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-15387) Expose missing RocksDB properties out via RocksDBNativeMetricOptions

2019-12-25 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang reassigned FLINK-15387:


Assignee: Yun Tang

> Expose missing RocksDB properties out via RocksDBNativeMetricOptions
> 
>
> Key: FLINK-15387
> URL: https://issues.apache.org/jira/browse/FLINK-15387
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / State Backends
>Affects Versions: 1.10.0, 1.11.0
>Reporter: Yun Tang
>Assignee: Yun Tang
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 1.10.0, 1.11.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When we implements FLINK-15368, we need to expose block cache related metrics 
> of RocksDB out by adding more available options to current 
> RocksDBNativeMetricOptions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-15428) Avro Confluent Schema Registry nightly end-to-end test fails on travis

2019-12-29 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang reassigned FLINK-15428:


Assignee: Yangze Guo

> Avro Confluent Schema Registry nightly end-to-end test fails on travis
> --
>
> Key: FLINK-15428
> URL: https://issues.apache.org/jira/browse/FLINK-15428
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Kafka
>Affects Versions: 1.10.0
>Reporter: Yu Li
>Assignee: Yangze Guo
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.10.0
>
>
> Avro Confluent Schema Registry nightly end-to-end test fails with below error:
> {code}
> Could not start confluent schema registry
> /home/travis/build/apache/flink/flink-end-to-end-tests/test-scripts/kafka-common.sh:
>  line 78: ./bin/kafka-server-stop: No such file or directory
> No zookeeper server to stop
> Tried to kill 1549 but never saw it die
> [FAIL] Test script contains errors.
> {code}
> https://api.travis-ci.org/v3/job/629699437/log.txt



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-15437) Start session with property of "-Dtaskmanager.memory.process.size" not work

2019-12-30 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang reassigned FLINK-15437:


Assignee: Xintong Song

> Start session with property of "-Dtaskmanager.memory.process.size" not work
> ---
>
> Key: FLINK-15437
> URL: https://issues.apache.org/jira/browse/FLINK-15437
> Project: Flink
>  Issue Type: Bug
>  Components: API / Core
>Affects Versions: 1.10.0
>Reporter: xiaojin.wy
>Assignee: Xintong Song
>Priority: Critical
> Fix For: 1.10.0
>
>
> *The environment:*
> Yarn session cmd is as below, and the flink-conf.yaml has not the property of 
> "taskmanager.memory.process.size":
> export HADOOP_CLASSPATH=`hadoop classpath`;export 
> HADOOP_CONF_DIR=/dump/1/jenkins/workspace/Stream-Spark-3.4/env/hadoop_conf_dirs/blinktest2;
>  export BLINK_HOME=/dump/1/jenkins/workspace/test/blink_daily; 
> $BLINK_HOME/bin/yarn-session.sh -d -qu root.default -nm 'Session Cluster of 
> daily_regression_stream_spark_1.10' -jm 1024 -n 20 -s 10 
> -Dtaskmanager.memory.process.size=1024m
> *After execute the cmd above, there is a exception like this:*
> 2019-12-30 17:54:57,992 INFO  org.apache.hadoop.yarn.client.RMProxy   
>   - Connecting to ResourceManager at 
> z05c07224.sqa.zth.tbsite.net/11.163.188.36:8050
> 2019-12-30 17:54:58,182 ERROR org.apache.flink.yarn.cli.FlinkYarnSessionCli   
>   - Error while running the Flink session.
> org.apache.flink.configuration.IllegalConfigurationException: Either Task 
> Heap Memory size (taskmanager.memory.task.heap.size) and Managed Memory size 
> (taskmanager.memory.managed.size), or Total Flink Memory size 
> (taskmanager.memory.flink.size), or Total Process Memory size 
> (taskmanager.memory.process.size) need to be configured explicitly.
>   at 
> org.apache.flink.runtime.clusterframework.TaskExecutorResourceUtils.resourceSpecFromConfig(TaskExecutorResourceUtils.java:145)
>   at 
> org.apache.flink.client.deployment.AbstractClusterClientFactory.getClusterSpecification(AbstractClusterClientFactory.java:44)
>   at 
> org.apache.flink.yarn.cli.FlinkYarnSessionCli.run(FlinkYarnSessionCli.java:557)
>   at 
> org.apache.flink.yarn.cli.FlinkYarnSessionCli.lambda$main$5(FlinkYarnSessionCli.java:803)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1804)
>   at 
> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>   at 
> org.apache.flink.yarn.cli.FlinkYarnSessionCli.main(FlinkYarnSessionCli.java:803)
> 
>  The program finished with the following exception:
> org.apache.flink.configuration.IllegalConfigurationException: Either Task 
> Heap Memory size (taskmanager.memory.task.heap.size) and Managed Memory size 
> (taskmanager.memory.managed.size), or Total Flink Memory size 
> (taskmanager.memory.flink.size), or Total Process Memory size 
> (taskmanager.memory.process.size) need to be configured explicitly.
>   at 
> org.apache.flink.runtime.clusterframework.TaskExecutorResourceUtils.resourceSpecFromConfig(TaskExecutorResourceUtils.java:145)
>   at 
> org.apache.flink.client.deployment.AbstractClusterClientFactory.getClusterSpecification(AbstractClusterClientFactory.java:44)
>   at 
> org.apache.flink.yarn.cli.FlinkYarnSessionCli.run(FlinkYarnSessionCli.java:557)
>   at 
> org.apache.flink.yarn.cli.FlinkYarnSessionCli.lambda$main$5(FlinkYarnSessionCli.java:803)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1804)
>   at 
> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>   at 
> org.apache.flink.yarn.cli.FlinkYarnSessionCli.main(FlinkYarnSessionCli.java:803)
> *The flink-conf.yaml is :*
> jobmanager.rpc.address: localhost
> jobmanager.rpc.port: 6123
> jobmanager.heap.size: 1024m
> taskmanager.memory.total-process.size: 1024m
> taskmanager.numberOfTaskSlots: 1
> parallelism.default: 1
> jobmanager.execution.failover-strategy: region



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-15442) Harden the Avro Confluent Schema Registry nightly end-to-end test

2019-12-30 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang reassigned FLINK-15442:


Assignee: Yangze Guo

> Harden the Avro Confluent Schema Registry nightly end-to-end test
> -
>
> Key: FLINK-15442
> URL: https://issues.apache.org/jira/browse/FLINK-15442
> Project: Flink
>  Issue Type: Bug
>  Components: Tests
>Reporter: Yangze Guo
>Assignee: Yangze Guo
>Priority: Critical
> Fix For: 1.10.0
>
>
> We have already harden the Avro Confluent Schema Registry test in 
> [FLINK-13567|https://issues.apache.org/jira/browse/FLINK-13567]. However, 
> there are still some defects in current mechanism.
> * The loop variable _i_ is not safe, it could be modified by the *command*.
> * The process of downloading kafka 0.10 is not included in the scope of 
> retry_times . I think we need to include it to tolerent transient network 
> issue.
> We need to fix those issue to harden the Avro Confluent Schema Registry 
> nightly end-to-end test.
> cc: [~trohrmann] [~chesnay]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-15444) Make the component AbstractInvokable in CheckpointBarrierHandler NonNull

2019-12-30 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang updated FLINK-15444:
-
Fix Version/s: 1.11.0

> Make the component AbstractInvokable in CheckpointBarrierHandler NonNull 
> -
>
> Key: FLINK-15444
> URL: https://issues.apache.org/jira/browse/FLINK-15444
> Project: Flink
>  Issue Type: Task
>  Components: Runtime / Checkpointing
>Reporter: zhijiang
>Assignee: zhijiang
>Priority: Minor
> Fix For: 1.11.0
>
>
> The current component {{AbstractInvokable}} in {{CheckpointBarrierHandler}} 
> is annotated as {{@Nullable}}. Actually in real code path it is passed via 
> the constructor and never be null. The nullable annotation is only used for 
> unit test purpose. But this way would mislead the real usage in practice and 
> bring extra troubles, because you have to alway check whether it is null 
> before usage in related processes.
> We can refactor the related unit tests to implement a dummy 
> {{AbstractInvokable}} for tests and remove the {{@Nullable}} annotation from 
> the related class constructors.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15444) Make the component AbstractInvokable in CheckpointBarrierHandler NonNull

2019-12-30 Thread zhijiang (Jira)
zhijiang created FLINK-15444:


 Summary: Make the component AbstractInvokable in 
CheckpointBarrierHandler NonNull 
 Key: FLINK-15444
 URL: https://issues.apache.org/jira/browse/FLINK-15444
 Project: Flink
  Issue Type: Task
  Components: Runtime / Checkpointing
Reporter: zhijiang
Assignee: zhijiang


The current component {{AbstractInvokable}} in {{CheckpointBarrierHandler}} is 
annotated as {{@Nullable}}. Actually in real code path it is passed via the 
constructor and never be null. The nullable annotation is only used for unit 
test purpose. But this way would mislead the real usage in practice and bring 
extra troubles, because you have to alway check whether it is null before usage 
in related processes.

We can refactor the related unit tests to implement a dummy 
{{AbstractInvokable}} for tests and remove the {{@Nullable}} annotation from 
the related class constructors.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-14163) Execution#producedPartitions is possibly not assigned when used

2020-01-05 Thread zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17008580#comment-17008580
 ] 

zhijiang commented on FLINK-14163:
--

Thanks for reporting this potential issue [~zhuzh]!

After double checking the related codes, this issue indeed exists only for 
DefaultScheduler. When the ShuffleMaster#registerPartitionWithProducer was 
firstly introduced before, we already considered the async behavior in the 
scheduler process (legacy now) at that time.

The above mentioned three usages are mainly caused by the deployment process 
not considering the completed future of registering partition in new 
DefaultScheduler. If the new scheduler can also take the async way into account 
like the legacy scheduler did during deployment, I think we can solve all the 
existing concerns. 

I also feel that the current public method of Execution#getPartitionIds might 
bring potential risks to use in practice, because the returned partition might 
be an empty collection if the registration future was not completed yet, but 
the caller is not aware of this thing. 

>From the shuffle aspect, it is indeed meaningful for providing the async way 
>for registerPartitionWithProducer in a long term, which is flexible to satisfy 
>different scenarios. But from the existing implementation and possible future 
>extending implementations like yarn shuffle service etc, the sync way can also 
>satisfy the requirements I guess. So if this way would bring more troubles for 
>the scheduler and it is not easy to adjust for other components, it also makes 
>sense to adjust the registerPartitionWithProducer as sync way instead on my 
>side. We can make things easy. 

Are there any thoughts or inputs [~azagrebin]?

> Execution#producedPartitions is possibly not assigned when used
> ---
>
> Key: FLINK-14163
> URL: https://issues.apache.org/jira/browse/FLINK-14163
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.9.0, 1.10.0
>Reporter: Zhu Zhu
>Priority: Major
> Fix For: 1.10.0
>
>
> Currently {{Execution#producedPartitions}} is assigned after the partitions 
> have completed the registration to shuffle master in 
> {{Execution#registerProducedPartitions(...)}}.
> The partition registration is an async interface 
> ({{ShuffleMaster#registerPartitionWithProducer(...)}}), so 
> {{Execution#producedPartitions}} is possible[1] not set when used. 
> Usages includes:
> 1. deploying this task, so that the task may be deployed without its result 
> partitions assigned, and the job would hang. (DefaultScheduler issue only, 
> since legacy scheduler handled this case)
> 2. generating input descriptors for downstream tasks: 
> 3. retrieve {{ResultPartitionID}} for partition releasing: 
> [1] If a user uses Flink default shuffle master {{NettyShuffleMaster}}, it is 
> not problematic at the moment since it returns a completed future on 
> registration, so that it would be a synchronized process. However, if users 
> implement their own shuffle service in which the 
> {{ShuffleMaster#registerPartitionWithProducer}} returns an pending future, it 
> can be a problem. This is possible since customizable shuffle service is open 
> to users since 1.9 (via config "shuffle-service-factory.class").
> To avoid issues to happen, we may either 
> 1. fix all the usages of {{Execution#producedPartitions}} regarding the async 
> assigning, or 
> 2. change {{ShuffleMaster#registerPartitionWithProducer(...)}} to a sync 
> interface



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-14163) Execution#producedPartitions is possibly not assigned when used

2020-01-05 Thread zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17008582#comment-17008582
 ] 

zhijiang commented on FLINK-14163:
--

In addition, no matter which way we take to solve this issue, I think we can 
make it ready in release-1.11, not a blocker for release-1.10.

> Execution#producedPartitions is possibly not assigned when used
> ---
>
> Key: FLINK-14163
> URL: https://issues.apache.org/jira/browse/FLINK-14163
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.9.0, 1.10.0
>Reporter: Zhu Zhu
>Priority: Major
> Fix For: 1.10.0
>
>
> Currently {{Execution#producedPartitions}} is assigned after the partitions 
> have completed the registration to shuffle master in 
> {{Execution#registerProducedPartitions(...)}}.
> The partition registration is an async interface 
> ({{ShuffleMaster#registerPartitionWithProducer(...)}}), so 
> {{Execution#producedPartitions}} is possible[1] not set when used. 
> Usages includes:
> 1. deploying this task, so that the task may be deployed without its result 
> partitions assigned, and the job would hang. (DefaultScheduler issue only, 
> since legacy scheduler handled this case)
> 2. generating input descriptors for downstream tasks: 
> 3. retrieve {{ResultPartitionID}} for partition releasing: 
> [1] If a user uses Flink default shuffle master {{NettyShuffleMaster}}, it is 
> not problematic at the moment since it returns a completed future on 
> registration, so that it would be a synchronized process. However, if users 
> implement their own shuffle service in which the 
> {{ShuffleMaster#registerPartitionWithProducer}} returns an pending future, it 
> can be a problem. This is possible since customizable shuffle service is open 
> to users since 1.9 (via config "shuffle-service-factory.class").
> To avoid issues to happen, we may either 
> 1. fix all the usages of {{Execution#producedPartitions}} regarding the async 
> assigning, or 
> 2. change {{ShuffleMaster#registerPartitionWithProducer(...)}} to a sync 
> interface



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-10462) Remove ConnectionIndex for further sharing tcp connection in credit-based mode

2020-01-06 Thread zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-10462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17008630#comment-17008630
 ] 

zhijiang commented on FLINK-10462:
--

[~kevin.cyj] I think there are two requirements from your descriptions. 
 * One is how many connections are established between two TaskManagers. 
Currently it is up to the ConnectionIndex generated by IntermediateResult. If 
we want to adjust the number of connections by the number of netty threads by 
default or configurable setting, we can also drop the concept of 
ConnectionIndex from topology completely as this ticket proposed. Then it is 
easy to be understood and would not bring unexpected regression I guess.
 * Another is when to release the connection between two TaskManagers. Let's 
talk about it via your created ticket FLINK-15455. I think there are two sides 
for this proposal. For yarn per-job mode, the previous cancelled task may be 
submitted to the original TaskManager, so it can make use of previous 
connection to get benefits in this case. But for session mode, if one job 
finishes and another job is submitted to the TaskManager again, the connection 
between two TaskManagers might be changed for different jobs. E.g. jobA is 
submitted to TM1 and TM2, and jobB is submitted to TM2 and TM3. In this case we 
retain the connections between TM1 and TM2 after jobA finishes would waste 
resources to some extent. 

> Remove ConnectionIndex for further sharing tcp connection in credit-based 
> mode 
> ---
>
> Key: FLINK-10462
> URL: https://issues.apache.org/jira/browse/FLINK-10462
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.5.3, 1.5.4, 1.6.0, 1.6.1
>Reporter: zhijiang
>Assignee: zhijiang
>Priority: Minor
>
> Every {{IntermediateResult}} generates a random {{ConnectionIndex}} which 
> will be included in {{ConnectionID}}.
> The {{RemoteInputChannel}} requests to establish tcp connection via 
> {{ConnectionID}}. That means one tcp connection may be shared by multiple 
> {{RemoteInputChannel}} {{s which have the same ConnectionID}}. To do so, we 
> can reduce the physical connections between two \{{TaskManager}} s, and it 
> brings benefits for large scale jobs. 
> But this sharing is limited only for the same {{IntermediateResult}}, and I 
> think it is mainly because we may temporarily switch off {{autoread}} for the 
> channel during back pressure in previous network flow control. For 
> credit-based mode, the channel is always open for transporting different 
> intermediate data, so we can further share the tcp connection for different 
> {{IntermediateResults}} to remove the limit. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-15021) Remove setting of netty channel watermark and logic of writability changed

2020-01-06 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang updated FLINK-15021:
-
Description: 
The high and low watermark setting in NetterServer before was mainly used for 
network flow control and limiting the maximum memory overhead caused by copying 
data inside netty stack. In detail, when the downstream side processes slowly 
and exhausted the available buffers finally, it would temporarily close the 
auto read switch in netty stack. Then the upstream side would finally reach the 
high watermark of channel to become unwritable.

But based on credit-based flow control and reusing flink network buffer inside 
netty stack, the watermark setting is not invalid now. So we can safely remove 
it to cleanup the codes.

  was:After removing the non credit-based flow control codes, the channel 
writability changed logic in PartitionRequestQueue along with the setting of 
channel watermark are both invalid. Therefore we can remove them completely to 
simplify the codes.


> Remove setting of netty channel watermark and logic of writability changed
> --
>
> Key: FLINK-15021
> URL: https://issues.apache.org/jira/browse/FLINK-15021
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Network
>Reporter: zhijiang
>Assignee: zhijiang
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The high and low watermark setting in NetterServer before was mainly used for 
> network flow control and limiting the maximum memory overhead caused by 
> copying data inside netty stack. In detail, when the downstream side 
> processes slowly and exhausted the available buffers finally, it would 
> temporarily close the auto read switch in netty stack. Then the upstream side 
> would finally reach the high watermark of channel to become unwritable.
> But based on credit-based flow control and reusing flink network buffer 
> inside netty stack, the watermark setting is not invalid now. So we can 
> safely remove it to cleanup the codes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-15021) Remove setting of netty channel watermark and logic of writability changed

2020-01-06 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang updated FLINK-15021:
-
Fix Version/s: 1.11.0

> Remove setting of netty channel watermark and logic of writability changed
> --
>
> Key: FLINK-15021
> URL: https://issues.apache.org/jira/browse/FLINK-15021
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Network
>Reporter: zhijiang
>Assignee: zhijiang
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The high and low watermark setting in NetterServer before was mainly used for 
> network flow control and limiting the maximum memory overhead caused by 
> copying data inside netty stack. In detail, when the downstream side 
> processes slowly and exhausted the available buffers finally, it would 
> temporarily close the auto read switch in netty stack. Then the upstream side 
> would finally reach the high watermark of channel to become unwritable.
> But based on credit-based flow control and reusing flink network buffer 
> inside netty stack, the watermark setting is not invalid now. So we can 
> safely remove it to cleanup the codes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-15021) Remove setting of netty channel watermark and logic of writability changed

2020-01-06 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang closed FLINK-15021.

Resolution: Fixed

Merged in master: 12d095028d54f842c7cc0f8efd3bac476fc5d9f7

> Remove setting of netty channel watermark and logic of writability changed
> --
>
> Key: FLINK-15021
> URL: https://issues.apache.org/jira/browse/FLINK-15021
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Network
>Reporter: zhijiang
>Assignee: zhijiang
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The high and low watermark setting in NetterServer before was mainly used for 
> network flow control and limiting the maximum memory overhead caused by 
> copying data inside netty stack. In detail, when the downstream side 
> processes slowly and exhausted the available buffers finally, it would 
> temporarily close the auto read switch in netty stack. Then the upstream side 
> would finally reach the high watermark of channel to become unwritable.
> But based on credit-based flow control and reusing flink network buffer 
> inside netty stack, the watermark setting is not invalid now. So we can 
> safely remove it to cleanup the codes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-15021) Remove the setting of netty channel watermark in NettyServer

2020-01-06 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang updated FLINK-15021:
-
Summary: Remove the setting of netty channel watermark in NettyServer  
(was: Remove setting of netty channel watermark and logic of writability 
changed)

> Remove the setting of netty channel watermark in NettyServer
> 
>
> Key: FLINK-15021
> URL: https://issues.apache.org/jira/browse/FLINK-15021
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Network
>Reporter: zhijiang
>Assignee: zhijiang
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The high and low watermark setting in NetterServer before was mainly used for 
> network flow control and limiting the maximum memory overhead caused by 
> copying data inside netty stack. In detail, when the downstream side 
> processes slowly and exhausted the available buffers finally, it would 
> temporarily close the auto read switch in netty stack. Then the upstream side 
> would finally reach the high watermark of channel to become unwritable.
> But based on credit-based flow control and reusing flink network buffer 
> inside netty stack, the watermark setting is not invalid now. So we can 
> safely remove it to cleanup the codes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-15355) Nightly streaming file sink fails with unshaded hadoop

2020-01-09 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang closed FLINK-15355.

Resolution: Fixed

Merged in release-1.10: aa37c0cd89053e68e72a19d51715b3a31b74163c

> Nightly streaming file sink fails with unshaded hadoop
> --
>
> Key: FLINK-15355
> URL: https://issues.apache.org/jira/browse/FLINK-15355
> Project: Flink
>  Issue Type: Bug
>  Components: FileSystems
>Affects Versions: 1.10.0, 1.11.0
>Reporter: Arvid Heise
>Assignee: Arvid Heise
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> {code:java}
> org.apache.flink.client.program.ProgramInvocationException: The main method 
> caused an error: java.util.concurrent.ExecutionException: 
> org.apache.flink.runtime.client.JobSubmissionException: Failed to submit 
> JobGraph.
>  at 
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:335)
>  at 
> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:205)
>  at org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:138)
>  at 
> org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:664)
>  at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:213)
>  at 
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:895)
>  at 
> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
>  at 
> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>  at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968)
> Caused by: java.lang.RuntimeException: 
> java.util.concurrent.ExecutionException: 
> org.apache.flink.runtime.client.JobSubmissionException: Failed to submit 
> JobGraph.
>  at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)
>  at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1751)
>  at 
> org.apache.flink.streaming.api.environment.StreamContextEnvironment.executeAsync(StreamContextEnvironment.java:94)
>  at 
> org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:63)
>  at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1628)
>  at StreamingFileSinkProgram.main(StreamingFileSinkProgram.java:77)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:321)
>  ... 11 more
> Caused by: java.util.concurrent.ExecutionException: 
> org.apache.flink.runtime.client.JobSubmissionException: Failed to submit 
> JobGraph.
>  at 
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
>  at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
>  at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1746)
>  ... 20 more
> Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to 
> submit JobGraph.
>  at 
> org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$7(RestClusterClient.java:326)
>  at 
> java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
>  at 
> java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
>  at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>  at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
>  at 
> org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$8(FutureUtils.java:274)
>  at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
>  at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
>  at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>  at 
> java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:561)
>  at 
> java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:929)
>  at 
> java.util

[jira] [Closed] (FLINK-15306) Adjust the default netty transport option from nio to auto

2020-01-09 Thread zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhijiang closed FLINK-15306.

Resolution: Fixed

Merged in master: 169edef4b5e4381626b7405947ebfe3f49aff2ac

> Adjust the default netty transport option from nio to auto
> --
>
> Key: FLINK-15306
> URL: https://issues.apache.org/jira/browse/FLINK-15306
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Configuration, Runtime / Network
>Reporter: zhijiang
>Assignee: zhijiang
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The default option of `taskmanager.network.netty.transport` in 
> NettyShuffleEnvironmentOptions is `nio` now. As we know, the `epoll` mode can 
> get better performance, less GC and have more advanced features which are 
> only available on linux.
> Therefore it is better to adjust the default option to `auto` instead, and 
> then the framework would automatically choose the proper mode based on the 
> platform.
> We would further verify the performance effect via micro benchmark if 
> possible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-15355) Nightly streaming file sink fails with unshaded hadoop

2020-01-10 Thread zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012387#comment-17012387
 ] 

zhijiang edited comment on FLINK-15355 at 1/10/20 11:29 AM:


Merged in release-1.10: aa37c0cd89053e68e72a19d51715b3a31b74163c

Merged in master: f7833aff7d50af5a3a3a671d9b6a44bd5dc17a67


was (Author: zjwang):
Merged in release-1.10: aa37c0cd89053e68e72a19d51715b3a31b74163c

> Nightly streaming file sink fails with unshaded hadoop
> --
>
> Key: FLINK-15355
> URL: https://issues.apache.org/jira/browse/FLINK-15355
> Project: Flink
>  Issue Type: Bug
>  Components: FileSystems
>Affects Versions: 1.10.0, 1.11.0
>Reporter: Arvid Heise
>Assignee: Arvid Heise
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> {code:java}
> org.apache.flink.client.program.ProgramInvocationException: The main method 
> caused an error: java.util.concurrent.ExecutionException: 
> org.apache.flink.runtime.client.JobSubmissionException: Failed to submit 
> JobGraph.
>  at 
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:335)
>  at 
> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:205)
>  at org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:138)
>  at 
> org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:664)
>  at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:213)
>  at 
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:895)
>  at 
> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:968)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
>  at 
> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>  at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:968)
> Caused by: java.lang.RuntimeException: 
> java.util.concurrent.ExecutionException: 
> org.apache.flink.runtime.client.JobSubmissionException: Failed to submit 
> JobGraph.
>  at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)
>  at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1751)
>  at 
> org.apache.flink.streaming.api.environment.StreamContextEnvironment.executeAsync(StreamContextEnvironment.java:94)
>  at 
> org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:63)
>  at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1628)
>  at StreamingFileSinkProgram.main(StreamingFileSinkProgram.java:77)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:321)
>  ... 11 more
> Caused by: java.util.concurrent.ExecutionException: 
> org.apache.flink.runtime.client.JobSubmissionException: Failed to submit 
> JobGraph.
>  at 
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
>  at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
>  at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1746)
>  ... 20 more
> Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to 
> submit JobGraph.
>  at 
> org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$7(RestClusterClient.java:326)
>  at 
> java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
>  at 
> java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
>  at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>  at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
>  at 
> org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$8(FutureUtils.java:274)
>  at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
>  at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
>  at 
> java.util.concurrent.CompletableFuture.postCo

[jira] [Commented] (FLINK-14163) Execution#producedPartitions is possibly not assigned when used

2020-01-12 Thread zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17014085#comment-17014085
 ] 

zhijiang commented on FLINK-14163:
--

Thanks for the above good suggestions from you guys! Sorry for coming back this 
issue a bit late, especially for the PR already ready.

My previous guessing was that the formal support of async way would bring big 
trouble for scheduler, or it may be conflict with new scheduler direction in 
long term. Also considering the shuffle async way a bit over design then and no 
real users atm, so I mentioned before that I can accept the way of adjusting 
into the sync way to stop loss early. Although I also thought in general it is 
not a good way to break compatibility for exposed public interface. If it is 
not a problem for scheduler for handling the async way in future, I am happy to 
retain the async shuffle way.

If we decide to retain the async way and work around it in scheduler 
temporarily, it might be better to not fail directly after checking the future 
not completed. I mean we can step forward to bear a timeout before failing. 
This timeout is not only used for waiting future completion, also used for 
waiting for the future return by shuffle master while calling to avoid main 
thread stuck long time.

> Execution#producedPartitions is possibly not assigned when used
> ---
>
> Key: FLINK-14163
> URL: https://issues.apache.org/jira/browse/FLINK-14163
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.9.0, 1.10.0
>Reporter: Zhu Zhu
>Assignee: Yuan Mei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.10.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently {{Execution#producedPartitions}} is assigned after the partitions 
> have completed the registration to shuffle master in 
> {{Execution#registerProducedPartitions(...)}}.
> The partition registration is an async interface 
> ({{ShuffleMaster#registerPartitionWithProducer(...)}}), so 
> {{Execution#producedPartitions}} is possible[1] not set when used. 
> Usages includes:
> 1. deploying this task, so that the task may be deployed without its result 
> partitions assigned, and the job would hang. (DefaultScheduler issue only, 
> since legacy scheduler handled this case)
> 2. generating input descriptors for downstream tasks: 
> 3. retrieve {{ResultPartitionID}} for partition releasing: 
> [1] If a user uses Flink default shuffle master {{NettyShuffleMaster}}, it is 
> not problematic at the moment since it returns a completed future on 
> registration, so that it would be a synchronized process. However, if users 
> implement their own shuffle service in which the 
> {{ShuffleMaster#registerPartitionWithProducer}} returns an pending future, it 
> can be a problem. This is possible since customizable shuffle service is open 
> to users since 1.9 (via config "shuffle-service-factory.class").
> To avoid issues to happen, we may either 
> 1. fix all the usages of {{Execution#producedPartitions}} regarding the async 
> assigning, or 
> 2. change {{ShuffleMaster#registerPartitionWithProducer(...)}} to a sync 
> interface



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16720) Maven gets stuck downloading artifacts on Azure

2020-03-23 Thread Zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17064732#comment-17064732
 ] 

Zhijiang commented on FLINK-16720:
--

Another instance : 
[https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6511&view=logs&j=fc5181b0-e452-5c8f-68de-1097947f6483&t=27d1d645-cbce-54e2-51c4-d8b45fe24607]

> Maven gets stuck downloading artifacts on Azure
> ---
>
> Key: FLINK-16720
> URL: https://issues.apache.org/jira/browse/FLINK-16720
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / Azure Pipelines
>Affects Versions: 1.11.0
>Reporter: Robert Metzger
>Priority: Major
>
> Logs: 
> https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6509&view=logs&j=fc5181b0-e452-5c8f-68de-1097947f6483&t=27d1d645-cbce-54e2-51c4-d8b45fe24607
> {code}
> 2020-03-23T08:43:28.4128014Z [INFO] 
> 
> 2020-03-23T08:43:28.4128557Z [INFO] Building flink-avro-confluent-registry 
> 1.11-SNAPSHOT
> 2020-03-23T08:43:28.4129129Z [INFO] 
> 
> 2020-03-23T08:48:47.6591333Z 
> ==
> 2020-03-23T08:48:47.6594540Z Maven produced no output for 300 seconds.
> 2020-03-23T08:48:47.6595164Z 
> ==
> 2020-03-23T08:48:47.6605370Z 
> ==
> 2020-03-23T08:48:47.6605803Z The following Java processes are running (JPS)
> 2020-03-23T08:48:47.6606173Z 
> ==
> 2020-03-23T08:48:47.7710037Z 920 Jps
> 2020-03-23T08:48:47.7778561Z 238 Launcher
> 2020-03-23T08:48:47.9270289Z 
> ==
> 2020-03-23T08:48:47.9270832Z Printing stack trace of Java process 967
> 2020-03-23T08:48:47.9271199Z 
> ==
> 2020-03-23T08:48:48.0165945Z 967: No such process
> 2020-03-23T08:48:48.0218260Z 
> ==
> 2020-03-23T08:48:48.0218736Z Printing stack trace of Java process 238
> 2020-03-23T08:48:48.0219075Z 
> ==
> 2020-03-23T08:48:48.3404066Z 2020-03-23 08:48:48
> 2020-03-23T08:48:48.3404828Z Full thread dump OpenJDK 64-Bit Server VM 
> (25.242-b08 mixed mode):
> 2020-03-23T08:48:48.3405064Z 
> 2020-03-23T08:48:48.3405445Z "Attach Listener" #370 daemon prio=9 os_prio=0 
> tid=0x7fe130001000 nid=0x452 waiting on condition [0x]
> 2020-03-23T08:48:48.3405868Zjava.lang.Thread.State: RUNNABLE
> 2020-03-23T08:48:48.3411202Z 
> 2020-03-23T08:48:48.3413171Z "resolver-5" #105 daemon prio=5 os_prio=0 
> tid=0x7fe1ec2ad800 nid=0x177 waiting on condition [0x7fe1872d9000]
> 2020-03-23T08:48:48.3414175Zjava.lang.Thread.State: WAITING (parking)
> 2020-03-23T08:48:48.3414560Z  at sun.misc.Unsafe.park(Native Method)
> 2020-03-23T08:48:48.3415451Z  - parking to wait for  <0x0003d5a9f828> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> 2020-03-23T08:48:48.3416180Z  at 
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> 2020-03-23T08:48:48.3416825Z  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> 2020-03-23T08:48:48.3417602Z  at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> 2020-03-23T08:48:48.3418250Z  at 
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
> 2020-03-23T08:48:48.3418930Z  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
> 2020-03-23T08:48:48.3419900Z  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 2020-03-23T08:48:48.3420395Z  at java.lang.Thread.run(Thread.java:748)
> 2020-03-23T08:48:48.3420648Z 
> 2020-03-23T08:48:48.3421424Z "resolver-4" #104 daemon prio=5 os_prio=0 
> tid=0x7fe1ec2ad000 nid=0x176 waiting on condition [0x7fe1863dd000]
> 2020-03-23T08:48:48.3421914Zjava.lang.Thread.State: WAITING (parking)
> 2020-03-23T08:48:48.3422233Z  at sun.misc.Unsafe.park(Native Method)
> 2020-03-23T08:48:48.3422919Z  - parking to wait for  <0x0003d5a9f828> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> 2020-03-23T08:48:48.3423447Z  at 
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> 2020-03-23T08:48:48.3424141Z  at 
> java.util.concurr

[jira] [Comment Edited] (FLINK-16720) Maven gets stuck downloading artifacts on Azure

2020-03-23 Thread Zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17064732#comment-17064732
 ] 

Zhijiang edited comment on FLINK-16720 at 3/23/20, 11:37 AM:
-

Another two instances : 
[https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6511&view=logs&j=fc5181b0-e452-5c8f-68de-1097947f6483&t=27d1d645-cbce-54e2-51c4-d8b45fe24607]

[https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6509&view=logs&s=ae4f8708-9994-57d3-c2d7-b892156e7812&j=d44f43ce-542c-597d-bf94-b0718c71e5e8]


was (Author: zjwang):
Another instance : 
[https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6511&view=logs&j=fc5181b0-e452-5c8f-68de-1097947f6483&t=27d1d645-cbce-54e2-51c4-d8b45fe24607]

> Maven gets stuck downloading artifacts on Azure
> ---
>
> Key: FLINK-16720
> URL: https://issues.apache.org/jira/browse/FLINK-16720
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / Azure Pipelines
>Affects Versions: 1.11.0
>Reporter: Robert Metzger
>Priority: Major
>
> Logs: 
> https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6509&view=logs&j=fc5181b0-e452-5c8f-68de-1097947f6483&t=27d1d645-cbce-54e2-51c4-d8b45fe24607
> {code}
> 2020-03-23T08:43:28.4128014Z [INFO] 
> 
> 2020-03-23T08:43:28.4128557Z [INFO] Building flink-avro-confluent-registry 
> 1.11-SNAPSHOT
> 2020-03-23T08:43:28.4129129Z [INFO] 
> 
> 2020-03-23T08:48:47.6591333Z 
> ==
> 2020-03-23T08:48:47.6594540Z Maven produced no output for 300 seconds.
> 2020-03-23T08:48:47.6595164Z 
> ==
> 2020-03-23T08:48:47.6605370Z 
> ==
> 2020-03-23T08:48:47.6605803Z The following Java processes are running (JPS)
> 2020-03-23T08:48:47.6606173Z 
> ==
> 2020-03-23T08:48:47.7710037Z 920 Jps
> 2020-03-23T08:48:47.7778561Z 238 Launcher
> 2020-03-23T08:48:47.9270289Z 
> ==
> 2020-03-23T08:48:47.9270832Z Printing stack trace of Java process 967
> 2020-03-23T08:48:47.9271199Z 
> ==
> 2020-03-23T08:48:48.0165945Z 967: No such process
> 2020-03-23T08:48:48.0218260Z 
> ==
> 2020-03-23T08:48:48.0218736Z Printing stack trace of Java process 238
> 2020-03-23T08:48:48.0219075Z 
> ==
> 2020-03-23T08:48:48.3404066Z 2020-03-23 08:48:48
> 2020-03-23T08:48:48.3404828Z Full thread dump OpenJDK 64-Bit Server VM 
> (25.242-b08 mixed mode):
> 2020-03-23T08:48:48.3405064Z 
> 2020-03-23T08:48:48.3405445Z "Attach Listener" #370 daemon prio=9 os_prio=0 
> tid=0x7fe130001000 nid=0x452 waiting on condition [0x]
> 2020-03-23T08:48:48.3405868Zjava.lang.Thread.State: RUNNABLE
> 2020-03-23T08:48:48.3411202Z 
> 2020-03-23T08:48:48.3413171Z "resolver-5" #105 daemon prio=5 os_prio=0 
> tid=0x7fe1ec2ad800 nid=0x177 waiting on condition [0x7fe1872d9000]
> 2020-03-23T08:48:48.3414175Zjava.lang.Thread.State: WAITING (parking)
> 2020-03-23T08:48:48.3414560Z  at sun.misc.Unsafe.park(Native Method)
> 2020-03-23T08:48:48.3415451Z  - parking to wait for  <0x0003d5a9f828> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> 2020-03-23T08:48:48.3416180Z  at 
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> 2020-03-23T08:48:48.3416825Z  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> 2020-03-23T08:48:48.3417602Z  at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> 2020-03-23T08:48:48.3418250Z  at 
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
> 2020-03-23T08:48:48.3418930Z  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
> 2020-03-23T08:48:48.3419900Z  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 2020-03-23T08:48:48.3420395Z  at java.lang.Thread.run(Thread.java:748)
> 2020-03-23T08:48:48.3420648Z 
> 2020-03-23T08:48:48.3421424Z "resolver-4" #104 daemon prio=5 os_prio=0 
> tid=0x7fe1ec2ad000 nid=0x176 waiting on condition [0x7fe1863dd000]
> 2020-03-23T08:48:48.3421914Zjava.lang.Th

[jira] [Updated] (FLINK-16720) Maven gets stuck downloading artifacts on Azure

2020-03-23 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-16720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang updated FLINK-16720:
-
Priority: Critical  (was: Major)

> Maven gets stuck downloading artifacts on Azure
> ---
>
> Key: FLINK-16720
> URL: https://issues.apache.org/jira/browse/FLINK-16720
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / Azure Pipelines
>Affects Versions: 1.11.0
>Reporter: Robert Metzger
>Priority: Critical
>
> Logs: 
> https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6509&view=logs&j=fc5181b0-e452-5c8f-68de-1097947f6483&t=27d1d645-cbce-54e2-51c4-d8b45fe24607
> {code}
> 2020-03-23T08:43:28.4128014Z [INFO] 
> 
> 2020-03-23T08:43:28.4128557Z [INFO] Building flink-avro-confluent-registry 
> 1.11-SNAPSHOT
> 2020-03-23T08:43:28.4129129Z [INFO] 
> 
> 2020-03-23T08:48:47.6591333Z 
> ==
> 2020-03-23T08:48:47.6594540Z Maven produced no output for 300 seconds.
> 2020-03-23T08:48:47.6595164Z 
> ==
> 2020-03-23T08:48:47.6605370Z 
> ==
> 2020-03-23T08:48:47.6605803Z The following Java processes are running (JPS)
> 2020-03-23T08:48:47.6606173Z 
> ==
> 2020-03-23T08:48:47.7710037Z 920 Jps
> 2020-03-23T08:48:47.7778561Z 238 Launcher
> 2020-03-23T08:48:47.9270289Z 
> ==
> 2020-03-23T08:48:47.9270832Z Printing stack trace of Java process 967
> 2020-03-23T08:48:47.9271199Z 
> ==
> 2020-03-23T08:48:48.0165945Z 967: No such process
> 2020-03-23T08:48:48.0218260Z 
> ==
> 2020-03-23T08:48:48.0218736Z Printing stack trace of Java process 238
> 2020-03-23T08:48:48.0219075Z 
> ==
> 2020-03-23T08:48:48.3404066Z 2020-03-23 08:48:48
> 2020-03-23T08:48:48.3404828Z Full thread dump OpenJDK 64-Bit Server VM 
> (25.242-b08 mixed mode):
> 2020-03-23T08:48:48.3405064Z 
> 2020-03-23T08:48:48.3405445Z "Attach Listener" #370 daemon prio=9 os_prio=0 
> tid=0x7fe130001000 nid=0x452 waiting on condition [0x]
> 2020-03-23T08:48:48.3405868Zjava.lang.Thread.State: RUNNABLE
> 2020-03-23T08:48:48.3411202Z 
> 2020-03-23T08:48:48.3413171Z "resolver-5" #105 daemon prio=5 os_prio=0 
> tid=0x7fe1ec2ad800 nid=0x177 waiting on condition [0x7fe1872d9000]
> 2020-03-23T08:48:48.3414175Zjava.lang.Thread.State: WAITING (parking)
> 2020-03-23T08:48:48.3414560Z  at sun.misc.Unsafe.park(Native Method)
> 2020-03-23T08:48:48.3415451Z  - parking to wait for  <0x0003d5a9f828> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> 2020-03-23T08:48:48.3416180Z  at 
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> 2020-03-23T08:48:48.3416825Z  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> 2020-03-23T08:48:48.3417602Z  at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> 2020-03-23T08:48:48.3418250Z  at 
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
> 2020-03-23T08:48:48.3418930Z  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
> 2020-03-23T08:48:48.3419900Z  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 2020-03-23T08:48:48.3420395Z  at java.lang.Thread.run(Thread.java:748)
> 2020-03-23T08:48:48.3420648Z 
> 2020-03-23T08:48:48.3421424Z "resolver-4" #104 daemon prio=5 os_prio=0 
> tid=0x7fe1ec2ad000 nid=0x176 waiting on condition [0x7fe1863dd000]
> 2020-03-23T08:48:48.3421914Zjava.lang.Thread.State: WAITING (parking)
> 2020-03-23T08:48:48.3422233Z  at sun.misc.Unsafe.park(Native Method)
> 2020-03-23T08:48:48.3422919Z  - parking to wait for  <0x0003d5a9f828> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> 2020-03-23T08:48:48.3423447Z  at 
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> 2020-03-23T08:48:48.3424141Z  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> 2020-03-23T08:48:48.3424734Z  at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlocki

[jira] [Commented] (FLINK-16720) Maven gets stuck downloading artifacts on Azure

2020-03-23 Thread Zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17065330#comment-17065330
 ] 

Zhijiang commented on FLINK-16720:
--

Another instance: 
[https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6535&view=logs&j=0da23115-68bb-5dcd-192c-bd4c8adebde1&t=fb588352-ef18-568d-b447-699986250ccb]
 

> Maven gets stuck downloading artifacts on Azure
> ---
>
> Key: FLINK-16720
> URL: https://issues.apache.org/jira/browse/FLINK-16720
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / Azure Pipelines
>Affects Versions: 1.11.0
>Reporter: Robert Metzger
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.11.0
>
>
> Logs: 
> https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6509&view=logs&j=fc5181b0-e452-5c8f-68de-1097947f6483&t=27d1d645-cbce-54e2-51c4-d8b45fe24607
> {code}
> 2020-03-23T08:43:28.4128014Z [INFO] 
> 
> 2020-03-23T08:43:28.4128557Z [INFO] Building flink-avro-confluent-registry 
> 1.11-SNAPSHOT
> 2020-03-23T08:43:28.4129129Z [INFO] 
> 
> 2020-03-23T08:48:47.6591333Z 
> ==
> 2020-03-23T08:48:47.6594540Z Maven produced no output for 300 seconds.
> 2020-03-23T08:48:47.6595164Z 
> ==
> 2020-03-23T08:48:47.6605370Z 
> ==
> 2020-03-23T08:48:47.6605803Z The following Java processes are running (JPS)
> 2020-03-23T08:48:47.6606173Z 
> ==
> 2020-03-23T08:48:47.7710037Z 920 Jps
> 2020-03-23T08:48:47.7778561Z 238 Launcher
> 2020-03-23T08:48:47.9270289Z 
> ==
> 2020-03-23T08:48:47.9270832Z Printing stack trace of Java process 967
> 2020-03-23T08:48:47.9271199Z 
> ==
> 2020-03-23T08:48:48.0165945Z 967: No such process
> 2020-03-23T08:48:48.0218260Z 
> ==
> 2020-03-23T08:48:48.0218736Z Printing stack trace of Java process 238
> 2020-03-23T08:48:48.0219075Z 
> ==
> 2020-03-23T08:48:48.3404066Z 2020-03-23 08:48:48
> 2020-03-23T08:48:48.3404828Z Full thread dump OpenJDK 64-Bit Server VM 
> (25.242-b08 mixed mode):
> 2020-03-23T08:48:48.3405064Z 
> 2020-03-23T08:48:48.3405445Z "Attach Listener" #370 daemon prio=9 os_prio=0 
> tid=0x7fe130001000 nid=0x452 waiting on condition [0x]
> 2020-03-23T08:48:48.3405868Zjava.lang.Thread.State: RUNNABLE
> 2020-03-23T08:48:48.3411202Z 
> 2020-03-23T08:48:48.3413171Z "resolver-5" #105 daemon prio=5 os_prio=0 
> tid=0x7fe1ec2ad800 nid=0x177 waiting on condition [0x7fe1872d9000]
> 2020-03-23T08:48:48.3414175Zjava.lang.Thread.State: WAITING (parking)
> 2020-03-23T08:48:48.3414560Z  at sun.misc.Unsafe.park(Native Method)
> 2020-03-23T08:48:48.3415451Z  - parking to wait for  <0x0003d5a9f828> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> 2020-03-23T08:48:48.3416180Z  at 
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> 2020-03-23T08:48:48.3416825Z  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> 2020-03-23T08:48:48.3417602Z  at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> 2020-03-23T08:48:48.3418250Z  at 
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
> 2020-03-23T08:48:48.3418930Z  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
> 2020-03-23T08:48:48.3419900Z  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 2020-03-23T08:48:48.3420395Z  at java.lang.Thread.run(Thread.java:748)
> 2020-03-23T08:48:48.3420648Z 
> 2020-03-23T08:48:48.3421424Z "resolver-4" #104 daemon prio=5 os_prio=0 
> tid=0x7fe1ec2ad000 nid=0x176 waiting on condition [0x7fe1863dd000]
> 2020-03-23T08:48:48.3421914Zjava.lang.Thread.State: WAITING (parking)
> 2020-03-23T08:48:48.3422233Z  at sun.misc.Unsafe.park(Native Method)
> 2020-03-23T08:48:48.3422919Z  - parking to wait for  <0x0003d5a9f828> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> 2020-03-23T08:48:48.3423447Z  at 
> java.util.concurrent.locks.LockSupport.park(Loc

[jira] [Created] (FLINK-16739) PrestoS3FileSystemITCase#testSimpleFileWriteAndRead fails with no such key

2020-03-23 Thread Zhijiang (Jira)
Zhijiang created FLINK-16739:


 Summary: PrestoS3FileSystemITCase#testSimpleFileWriteAndRead fails 
with no such key
 Key: FLINK-16739
 URL: https://issues.apache.org/jira/browse/FLINK-16739
 Project: Flink
  Issue Type: Task
  Components: Connectors / FileSystem, Tests
Reporter: Zhijiang


Build: 
[https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6546&view=logs&j=e9af9cde-9a65-5281-a58e-2c8511d36983&t=df5b2bf5-bcff-5dc9-7626-50bed0866a82]

logs
{code:java}
2020-03-24T01:51:19.6988685Z [INFO] Running 
org.apache.flink.fs.s3presto.PrestoS3FileSystemBehaviorITCase
2020-03-24T01:51:21.6250893Z [INFO] Running 
org.apache.flink.fs.s3presto.PrestoS3FileSystemITCase
2020-03-24T01:51:25.1626385Z [WARNING] Tests run: 8, Failures: 0, Errors: 0, 
Skipped: 2, Time elapsed: 5.461 s - in 
org.apache.flink.fs.s3presto.PrestoS3FileSystemBehaviorITCase
2020-03-24T01:51:50.5503712Z [ERROR] Tests run: 7, Failures: 1, Errors: 1, 
Skipped: 0, Time elapsed: 28.922 s <<< FAILURE! - in 
org.apache.flink.fs.s3presto.PrestoS3FileSystemITCase
2020-03-24T01:51:50.5506010Z [ERROR] testSimpleFileWriteAndRead[Scheme = 
s3p](org.apache.flink.fs.s3presto.PrestoS3FileSystemITCase)  Time elapsed: 0.7 
s  <<< ERROR!
2020-03-24T01:51:50.5513057Z 
com.facebook.presto.hive.s3.PrestoS3FileSystem$UnrecoverableS3OperationException:
 com.amazonaws.services.s3.model.AmazonS3Exception: The specified key does not 
exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request 
ID: A07D70A474EABC13; S3 Extended Request ID: 
R2ReW39oZ9ncoc82xb+V5h/EJV5/Mnsee+7uZ7cFMkliTQ/nKhvHPCDfr5zddbfUdR/S49VdbrA=), 
S3 Extended Request ID: 
R2ReW39oZ9ncoc82xb+V5h/EJV5/Mnsee+7uZ7cFMkliTQ/nKhvHPCDfr5zddbfUdR/S49VdbrA= 
(Path: s3://***/temp/tests-c79a578b-13d9-41ba-b73b-4f53fc965b96/test.txt)
2020-03-24T01:51:50.5517642Z Caused by: 
com.amazonaws.services.s3.model.AmazonS3Exception: The specified key does not 
exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request 
ID: A07D70A474EABC13; S3 Extended Request ID: 
R2ReW39oZ9ncoc82xb+V5h/EJV5/Mnsee+7uZ7cFMkliTQ/nKhvHPCDfr5zddbfUdR/S49VdbrA=)
2020-03-24T01:51:50.5519791Z 
2020-03-24T01:51:50.5520679Z [ERROR] 
org.apache.flink.fs.s3presto.PrestoS3FileSystemITCase  Time elapsed: 17.431 s  
<<< FAILURE!
2020-03-24T01:51:50.5521841Z java.lang.AssertionError: expected: but 
was:
2020-03-24T01:51:50.5522437Z 
2020-03-24T01:51:50.8966641Z [INFO] 
2020-03-24T01:51:50.8967386Z [INFO] Results:
2020-03-24T01:51:50.8967849Z [INFO] 
2020-03-24T01:51:50.8968357Z [ERROR] Failures: 
2020-03-24T01:51:50.8970933Z [ERROR]   
PrestoS3FileSystemITCase>AbstractHadoopFileSystemITTest.teardown:155->AbstractHadoopFileSystemITTest.checkPathExistence:61
 expected: but was:
2020-03-24T01:51:50.8972311Z [ERROR] Errors: 
2020-03-24T01:51:50.8973807Z [ERROR]   
PrestoS3FileSystemITCase>AbstractHadoopFileSystemITTest.testSimpleFileWriteAndRead:87
 » UnrecoverableS3Operation
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-16739) PrestoS3FileSystemITCase#testSimpleFileWriteAndRead fails with no such key

2020-03-23 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-16739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang updated FLINK-16739:
-
Fix Version/s: 1.11.0

> PrestoS3FileSystemITCase#testSimpleFileWriteAndRead fails with no such key
> --
>
> Key: FLINK-16739
> URL: https://issues.apache.org/jira/browse/FLINK-16739
> Project: Flink
>  Issue Type: Task
>  Components: Connectors / FileSystem, Tests
>Reporter: Zhijiang
>Priority: Major
>  Labels: test-stability
> Fix For: 1.11.0
>
>
> Build: 
> [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6546&view=logs&j=e9af9cde-9a65-5281-a58e-2c8511d36983&t=df5b2bf5-bcff-5dc9-7626-50bed0866a82]
> logs
> {code:java}
> 2020-03-24T01:51:19.6988685Z [INFO] Running 
> org.apache.flink.fs.s3presto.PrestoS3FileSystemBehaviorITCase
> 2020-03-24T01:51:21.6250893Z [INFO] Running 
> org.apache.flink.fs.s3presto.PrestoS3FileSystemITCase
> 2020-03-24T01:51:25.1626385Z [WARNING] Tests run: 8, Failures: 0, Errors: 0, 
> Skipped: 2, Time elapsed: 5.461 s - in 
> org.apache.flink.fs.s3presto.PrestoS3FileSystemBehaviorITCase
> 2020-03-24T01:51:50.5503712Z [ERROR] Tests run: 7, Failures: 1, Errors: 1, 
> Skipped: 0, Time elapsed: 28.922 s <<< FAILURE! - in 
> org.apache.flink.fs.s3presto.PrestoS3FileSystemITCase
> 2020-03-24T01:51:50.5506010Z [ERROR] testSimpleFileWriteAndRead[Scheme = 
> s3p](org.apache.flink.fs.s3presto.PrestoS3FileSystemITCase)  Time elapsed: 
> 0.7 s  <<< ERROR!
> 2020-03-24T01:51:50.5513057Z 
> com.facebook.presto.hive.s3.PrestoS3FileSystem$UnrecoverableS3OperationException:
>  com.amazonaws.services.s3.model.AmazonS3Exception: The specified key does 
> not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; 
> Request ID: A07D70A474EABC13; S3 Extended Request ID: 
> R2ReW39oZ9ncoc82xb+V5h/EJV5/Mnsee+7uZ7cFMkliTQ/nKhvHPCDfr5zddbfUdR/S49VdbrA=),
>  S3 Extended Request ID: 
> R2ReW39oZ9ncoc82xb+V5h/EJV5/Mnsee+7uZ7cFMkliTQ/nKhvHPCDfr5zddbfUdR/S49VdbrA= 
> (Path: s3://***/temp/tests-c79a578b-13d9-41ba-b73b-4f53fc965b96/test.txt)
> 2020-03-24T01:51:50.5517642Z Caused by: 
> com.amazonaws.services.s3.model.AmazonS3Exception: The specified key does not 
> exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request 
> ID: A07D70A474EABC13; S3 Extended Request ID: 
> R2ReW39oZ9ncoc82xb+V5h/EJV5/Mnsee+7uZ7cFMkliTQ/nKhvHPCDfr5zddbfUdR/S49VdbrA=)
> 2020-03-24T01:51:50.5519791Z 
> 2020-03-24T01:51:50.5520679Z [ERROR] 
> org.apache.flink.fs.s3presto.PrestoS3FileSystemITCase  Time elapsed: 17.431 s 
>  <<< FAILURE!
> 2020-03-24T01:51:50.5521841Z java.lang.AssertionError: expected: but 
> was:
> 2020-03-24T01:51:50.5522437Z 
> 2020-03-24T01:51:50.8966641Z [INFO] 
> 2020-03-24T01:51:50.8967386Z [INFO] Results:
> 2020-03-24T01:51:50.8967849Z [INFO] 
> 2020-03-24T01:51:50.8968357Z [ERROR] Failures: 
> 2020-03-24T01:51:50.8970933Z [ERROR]   
> PrestoS3FileSystemITCase>AbstractHadoopFileSystemITTest.teardown:155->AbstractHadoopFileSystemITTest.checkPathExistence:61
>  expected: but was:
> 2020-03-24T01:51:50.8972311Z [ERROR] Errors: 
> 2020-03-24T01:51:50.8973807Z [ERROR]   
> PrestoS3FileSystemITCase>AbstractHadoopFileSystemITTest.testSimpleFileWriteAndRead:87
>  » UnrecoverableS3Operation
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16629) Streaming bucketing end-to-end test output hash mismatch

2020-03-23 Thread Zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17065340#comment-17065340
 ] 

Zhijiang commented on FLINK-16629:
--

Another instance 
:[https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6546&view=logs&j=68a897ab-3047-5660-245a-cce8f83859f6&t=375367d9-d72e-5c21-3be0-b45149130f6b]

 
[https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_apis/build/builds/6546/logs/679]

 

> Streaming bucketing end-to-end test output hash mismatch
> 
>
> Key: FLINK-16629
> URL: https://issues.apache.org/jira/browse/FLINK-16629
> Project: Flink
>  Issue Type: Bug
>  Components: API / DataStream, Tests
>Affects Versions: 1.11.0
>Reporter: Piotr Nowojski
>Assignee: Robert Metzger
>Priority: Blocker
>  Labels: pull-request-available, test-stability
> Fix For: 1.11.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_apis/build/builds/6298/logs/722
> Some of the output mismatch failures were reported in another ticket: 
> https://issues.apache.org/jira/browse/FLINK-16227
> {code}
> 2020-03-17T02:04:19.9176915Z Number of produced values 30618/6
> 2020-03-17T02:04:19.9202731Z Truncating buckets
> 2020-03-17T02:04:25.0504959Z Truncating buckets
> 2020-03-17T02:04:30.1731295Z Truncating buckets
> 2020-03-17T02:04:35.3190114Z Truncating buckets
> 2020-03-17T02:04:40.4723887Z Truncating buckets
> 2020-03-17T02:04:45.5984655Z Truncating buckets
> 2020-03-17T02:04:50.7185356Z Truncating buckets
> 2020-03-17T02:04:55.8627129Z Truncating buckets
> 2020-03-17T02:05:01.0715985Z Number of produced values 74008/6
> 2020-03-17T02:05:02.3976850Z Cancelling job dba2fdb79579158295db27d0214fc2ff.
> 2020-03-17T02:05:03.4633541Z Cancelled job dba2fdb79579158295db27d0214fc2ff.
> 2020-03-17T02:05:03.4738270Z Waiting for job 
> (dba2fdb79579158295db27d0214fc2ff) to reach terminal state CANCELED ...
> 2020-03-17T02:05:03.5149228Z Job (dba2fdb79579158295db27d0214fc2ff) reached 
> terminal state CANCELED
> 2020-03-17T02:05:03.5150587Z Job dba2fdb79579158295db27d0214fc2ff was 
> cancelled, time to verify
> 2020-03-17T02:05:03.5590118Z FAIL Bucketing Sink: Output hash mismatch.  Got 
> c3787e7a52d913675e620837a7531742, expected 01aba5ff77a0ef5e5cf6a727c248bdc3.
> 2020-03-17T02:05:03.5591888Z head hexdump of actual:
> 2020-03-17T02:05:03.5989908Z 000   (   7   ,   1   0   ,   0   ,   S   o  
>  m   e   p   a   y
> 2020-03-17T02:05:03.5991252Z 010   l   o   a   d   .   .   .   )  \n   (  
>  7   ,   1   0   ,   1
> 2020-03-17T02:05:03.5991923Z 020   ,   S   o   m   e   p   a   y   l  
>  o   a   d   .   .   .
> 2020-03-17T02:05:03.5993055Z 030   )  \n   (   7   ,   1   0   ,   2   ,  
>  S   o   m   e   p
> 2020-03-17T02:05:03.5993690Z 040   a   y   l   o   a   d   .   .   .   )  
> \n   (   7   ,   1   0
> 2020-03-17T02:05:03.5994332Z 050   ,   3   ,   S   o   m   e   p   a  
>  y   l   o   a   d   .
> 2020-03-17T02:05:03.5994967Z 060   .   .   )  \n   (   7   ,   1   0   ,  
>  4   ,   S   o   m   e
> 2020-03-17T02:05:03.5995744Z 070   p   a   y   l   o   a   d   .   .  
>  .   )  \n   (   7   ,
> 2020-03-17T02:05:03.5996359Z 080   1   0   ,   5   ,   S   o   m   e  
>  p   a   y   l   o   a
> 2020-03-17T02:05:03.5997133Z 090   d   .   .   .   )  \n   (   7   ,   1  
>  0   ,   6   ,   S   o
> 2020-03-17T02:05:03.5997704Z 0a0   m   e   p   a   y   l   o   a   d  
>  .   .   .   )  \n   (
> 2020-03-17T02:05:03.5998295Z 0b0   7   ,   1   0   ,   7   ,   S   o   m  
>  e   p   a   y   l
> 2020-03-17T02:05:03.5999087Z 0c0   o   a   d   .   .   .   )  \n   (   7  
>  ,   1   0   ,   8   ,
> 2020-03-17T02:05:03.6000243Z 0d0   S   o   m   e   p   a   y   l   o  
>  a   d   .   .   .   )
> 2020-03-17T02:05:03.6000880Z 0e0  \n   (   7   ,   1   0   ,   9   ,   S  
>  o   m   e   p   a
> 2020-03-17T02:05:03.6001494Z 0f0   y   l   o   a   d   .   .   .   )  \n  
>   
> 2020-03-17T02:05:03.6001999Z 0fa
> 2020-03-17T02:05:03.9875220Z Stopping taskexecutor daemon (pid: 49278) on 
> host fv-az668.
> 2020-03-17T02:05:04.2569285Z Stopping standalonesession daemon (pid: 46323) 
> on host fv-az668.
> 2020-03-17T02:05:04.7664418Z Stopping taskexecutor daemon (pid: 46615) on 
> host fv-az668.
> 2020-03-17T02:05:04.7674722Z Skipping taskexecutor daemon (pid: 47009), 
> because it is not running anymore on fv-az668.
> 2020-03-17T02:05:04.7687383Z Skipping taskexecutor daemon (pid: 47299), 
> because it is not running anymore on fv-az668.
> 2020-03-17T02:05:04.7689091Z Skipping taskexecutor daemon (pid: 47619), 
> because it is not running a

[jira] [Created] (FLINK-16750) Kerberized YARN on Docker test fails with staring Hadoop cluster

2020-03-24 Thread Zhijiang (Jira)
Zhijiang created FLINK-16750:


 Summary: Kerberized YARN on Docker test fails with staring Hadoop 
cluster
 Key: FLINK-16750
 URL: https://issues.apache.org/jira/browse/FLINK-16750
 Project: Flink
  Issue Type: Task
  Components: Deployment / Docker, Deployment / YARN, Tests
Reporter: Zhijiang


Build: 
[https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6563&view=results]

logs
{code:java}
2020-03-24T08:48:53.3813297Z 
==
2020-03-24T08:48:53.3814016Z Running 'Running Kerberized YARN on Docker test 
(custom fs plugin)'
2020-03-24T08:48:53.3814511Z 
==
2020-03-24T08:48:53.3827028Z TEST_DATA_DIR: 
/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-53382133956
2020-03-24T08:48:56.1944456Z Flink dist directory: 
/home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT
2020-03-24T08:48:56.2300265Z Flink dist directory: 
/home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT
2020-03-24T08:48:56.2412349Z Flink dist directory: 
/home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT
2020-03-24T08:48:56.2861072Z Docker version 19.03.8, build afacb8b7f0
2020-03-24T08:48:56.8025297Z docker-compose version 1.25.4, build 8d51620a
2020-03-24T08:48:56.8499071Z Flink Tarball directory 
/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-53382133956
2020-03-24T08:48:56.8501170Z Flink tarball filename flink.tar.gz
2020-03-24T08:48:56.8502612Z Flink distribution directory name 
flink-1.11-SNAPSHOT
2020-03-24T08:48:56.8504724Z End-to-end directory 
/home/vsts/work/1/s/flink-end-to-end-tests
2020-03-24T08:48:56.8620115Z Building Hadoop Docker container
2020-03-24T08:48:56.9117609Z Sending build context to Docker daemon  56.83kB
2020-03-24T08:48:56.9117926Z 
2020-03-24T08:48:57.0076373Z Step 1/54 : FROM sequenceiq/pam:ubuntu-14.04
2020-03-24T08:48:57.0082811Z  ---> df7bea4c5f64
2020-03-24T08:48:57.0084798Z Step 2/54 : RUN set -x && addgroup hadoop 
&& useradd -d /home/hdfs -ms /bin/bash -G hadoop -p hdfs hdfs && useradd -d 
/home/yarn -ms /bin/bash -G hadoop -p yarn yarn && useradd -d /home/mapred 
-ms /bin/bash -G hadoop -p mapred mapred && useradd -d /home/hadoop-user 
-ms /bin/bash -p hadoop-user hadoop-user
2020-03-24T08:48:57.0092833Z  ---> Using cache
2020-03-24T08:48:57.0093976Z  ---> 3c12a7d3e20c
2020-03-24T08:48:57.0096889Z Step 3/54 : RUN set -x && apt-get update && 
apt-get install -y curl tar sudo openssh-server openssh-client rsync unzip 
krb5-user
2020-03-24T08:48:57.0106188Z  ---> Using cache
2020-03-24T08:48:57.0107830Z  ---> 9a59599596be
2020-03-24T08:48:57.0110793Z Step 4/54 : RUN set -x && mkdir -p 
/var/log/kerberos && touch /var/log/kerberos/kadmind.log
2020-03-24T08:48:57.0118896Z  ---> Using cache
2020-03-24T08:48:57.0121035Z  ---> c83551d4f695
2020-03-24T08:48:57.0125298Z Step 5/54 : RUN set -x && rm -f 
/etc/ssh/ssh_host_dsa_key /etc/ssh/ssh_host_rsa_key /root/.ssh/id_rsa && 
ssh-keygen -q -N "" -t dsa -f /etc/ssh/ssh_host_dsa_key && ssh-keygen -q -N 
"" -t rsa -f /etc/ssh/ssh_host_rsa_key && ssh-keygen -q -N "" -t rsa -f 
/root/.ssh/id_rsa && cp /root/.ssh/id_rsa.pub /root/.ssh/authorized_keys
2020-03-24T08:48:57.0133473Z  ---> Using cache
2020-03-24T08:48:57.0134240Z  ---> f69560c2bc0a
2020-03-24T08:48:57.0135683Z Step 6/54 : RUN set -x && mkdir -p 
/usr/java/default && curl -Ls 
'http://download.oracle.com/otn-pub/java/jdk/8u131-b11/d54c1d3a095b4ff2b6607d096fa80163/jdk-8u131-linux-x64.tar.gz'
 -H 'Cookie: oraclelicense=accept-securebackup-cookie' | tar 
--strip-components=1 -xz -C /usr/java/default/
2020-03-24T08:48:57.0148145Z  ---> Using cache
2020-03-24T08:48:57.0149008Z  ---> f824256d72f1
2020-03-24T08:48:57.0152616Z Step 7/54 : ENV JAVA_HOME /usr/java/default
2020-03-24T08:48:57.0155992Z  ---> Using cache
2020-03-24T08:48:57.0160104Z  ---> 770e6bfd219a
2020-03-24T08:48:57.0160410Z Step 8/54 : ENV PATH $PATH:$JAVA_HOME/bin
2020-03-24T08:48:57.0168690Z  ---> Using cache
2020-03-24T08:48:57.0169451Z  ---> 2643e1a25898
2020-03-24T08:48:57.0174785Z Step 9/54 : RUN set -x && curl -LOH 'Cookie: 
oraclelicense=accept-securebackup-cookie' 
'http://download.oracle.com/otn-pub/java/jce/8/jce_policy-8.zip' && unzip 
jce_policy-8.zip && cp /UnlimitedJCEPolicyJDK8/local_policy.jar 
/UnlimitedJCEPolicyJDK8/US_export_policy.jar $JAVA_HOME/jre/lib/security
2020-03-24T08:48:57.0187797Z  ---> Using cache
2020-03-24T08:48:57.0188202Z  ---> 51cf2085f95d
2020-03-24T08:48:57.0188467Z Step 10/54 : ARG HADOOP_VERSION=2.8.4
2020-03-24T08:48:57.0199344Z  ---> Using cache
2020-03-24T08:48:57.0199846Z  ---> d169c15c288c
2020-03-24T08:48:57.0200

[jira] [Updated] (FLINK-16750) Kerberized YARN on Docker test fails with staring Hadoop cluster

2020-03-24 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-16750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang updated FLINK-16750:
-
Fix Version/s: 1.11.0

> Kerberized YARN on Docker test fails with staring Hadoop cluster
> 
>
> Key: FLINK-16750
> URL: https://issues.apache.org/jira/browse/FLINK-16750
> Project: Flink
>  Issue Type: Task
>  Components: Deployment / Docker, Deployment / YARN, Tests
>Reporter: Zhijiang
>Priority: Major
>  Labels: test-stability
> Fix For: 1.11.0
>
>
> Build: 
> [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6563&view=results]
> logs
> {code:java}
> 2020-03-24T08:48:53.3813297Z 
> ==
> 2020-03-24T08:48:53.3814016Z Running 'Running Kerberized YARN on Docker test 
> (custom fs plugin)'
> 2020-03-24T08:48:53.3814511Z 
> ==
> 2020-03-24T08:48:53.3827028Z TEST_DATA_DIR: 
> /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-53382133956
> 2020-03-24T08:48:56.1944456Z Flink dist directory: 
> /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT
> 2020-03-24T08:48:56.2300265Z Flink dist directory: 
> /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT
> 2020-03-24T08:48:56.2412349Z Flink dist directory: 
> /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT
> 2020-03-24T08:48:56.2861072Z Docker version 19.03.8, build afacb8b7f0
> 2020-03-24T08:48:56.8025297Z docker-compose version 1.25.4, build 8d51620a
> 2020-03-24T08:48:56.8499071Z Flink Tarball directory 
> /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-53382133956
> 2020-03-24T08:48:56.8501170Z Flink tarball filename flink.tar.gz
> 2020-03-24T08:48:56.8502612Z Flink distribution directory name 
> flink-1.11-SNAPSHOT
> 2020-03-24T08:48:56.8504724Z End-to-end directory 
> /home/vsts/work/1/s/flink-end-to-end-tests
> 2020-03-24T08:48:56.8620115Z Building Hadoop Docker container
> 2020-03-24T08:48:56.9117609Z Sending build context to Docker daemon  56.83kB
> 2020-03-24T08:48:56.9117926Z 
> 2020-03-24T08:48:57.0076373Z Step 1/54 : FROM sequenceiq/pam:ubuntu-14.04
> 2020-03-24T08:48:57.0082811Z  ---> df7bea4c5f64
> 2020-03-24T08:48:57.0084798Z Step 2/54 : RUN set -x && addgroup hadoop
>  && useradd -d /home/hdfs -ms /bin/bash -G hadoop -p hdfs hdfs && useradd 
> -d /home/yarn -ms /bin/bash -G hadoop -p yarn yarn && useradd -d 
> /home/mapred -ms /bin/bash -G hadoop -p mapred mapred && useradd -d 
> /home/hadoop-user -ms /bin/bash -p hadoop-user hadoop-user
> 2020-03-24T08:48:57.0092833Z  ---> Using cache
> 2020-03-24T08:48:57.0093976Z  ---> 3c12a7d3e20c
> 2020-03-24T08:48:57.0096889Z Step 3/54 : RUN set -x && apt-get update && 
> apt-get install -y curl tar sudo openssh-server openssh-client rsync 
> unzip krb5-user
> 2020-03-24T08:48:57.0106188Z  ---> Using cache
> 2020-03-24T08:48:57.0107830Z  ---> 9a59599596be
> 2020-03-24T08:48:57.0110793Z Step 4/54 : RUN set -x && mkdir -p 
> /var/log/kerberos && touch /var/log/kerberos/kadmind.log
> 2020-03-24T08:48:57.0118896Z  ---> Using cache
> 2020-03-24T08:48:57.0121035Z  ---> c83551d4f695
> 2020-03-24T08:48:57.0125298Z Step 5/54 : RUN set -x && rm -f 
> /etc/ssh/ssh_host_dsa_key /etc/ssh/ssh_host_rsa_key /root/.ssh/id_rsa && 
> ssh-keygen -q -N "" -t dsa -f /etc/ssh/ssh_host_dsa_key && ssh-keygen -q 
> -N "" -t rsa -f /etc/ssh/ssh_host_rsa_key && ssh-keygen -q -N "" -t rsa 
> -f /root/.ssh/id_rsa && cp /root/.ssh/id_rsa.pub 
> /root/.ssh/authorized_keys
> 2020-03-24T08:48:57.0133473Z  ---> Using cache
> 2020-03-24T08:48:57.0134240Z  ---> f69560c2bc0a
> 2020-03-24T08:48:57.0135683Z Step 6/54 : RUN set -x && mkdir -p 
> /usr/java/default && curl -Ls 
> 'http://download.oracle.com/otn-pub/java/jdk/8u131-b11/d54c1d3a095b4ff2b6607d096fa80163/jdk-8u131-linux-x64.tar.gz'
>  -H 'Cookie: oraclelicense=accept-securebackup-cookie' | tar 
> --strip-components=1 -xz -C /usr/java/default/
> 2020-03-24T08:48:57.0148145Z  ---> Using cache
> 2020-03-24T08:48:57.0149008Z  ---> f824256d72f1
> 2020-03-24T08:48:57.0152616Z Step 7/54 : ENV JAVA_HOME /usr/java/default
> 2020-03-24T08:48:57.0155992Z  ---> Using cache
> 2020-03-24T08:48:57.0160104Z  ---> 770e6bfd219a
> 2020-03-24T08:48:57.0160410Z Step 8/54 : ENV PATH $PATH:$JAVA_HOME/bin
> 2020-03-24T08:48:57.0168690Z  ---> Using cache
> 2020-03-24T08:48:57.0169451Z  ---> 2643e1a25898
> 2020-03-24T08:48:57.0174785Z Step 9/54 : RUN set -x && curl -LOH 'Cookie: 
> oraclelicense=accept-securebackup-cookie' 
> 'http://download.oracle.com/otn-pub/java/jce/8/jce_p

[jira] [Commented] (FLINK-16720) Maven gets stuck downloading artifacts on Azure

2020-03-24 Thread Zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066345#comment-17066345
 ] 

Zhijiang commented on FLINK-16720:
--

[~rmetzger], thanks for the correction. It is indeed the issue of 
TaskExecutorTest.testSlotAcceptance from main stack. I should be more careful. 

> Maven gets stuck downloading artifacts on Azure
> ---
>
> Key: FLINK-16720
> URL: https://issues.apache.org/jira/browse/FLINK-16720
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / Azure Pipelines
>Affects Versions: 1.11.0
>Reporter: Robert Metzger
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.11.0
>
>
> Logs: 
> https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6509&view=logs&j=fc5181b0-e452-5c8f-68de-1097947f6483&t=27d1d645-cbce-54e2-51c4-d8b45fe24607
> {code}
> 2020-03-23T08:43:28.4128014Z [INFO] 
> 
> 2020-03-23T08:43:28.4128557Z [INFO] Building flink-avro-confluent-registry 
> 1.11-SNAPSHOT
> 2020-03-23T08:43:28.4129129Z [INFO] 
> 
> 2020-03-23T08:48:47.6591333Z 
> ==
> 2020-03-23T08:48:47.6594540Z Maven produced no output for 300 seconds.
> 2020-03-23T08:48:47.6595164Z 
> ==
> 2020-03-23T08:48:47.6605370Z 
> ==
> 2020-03-23T08:48:47.6605803Z The following Java processes are running (JPS)
> 2020-03-23T08:48:47.6606173Z 
> ==
> 2020-03-23T08:48:47.7710037Z 920 Jps
> 2020-03-23T08:48:47.7778561Z 238 Launcher
> 2020-03-23T08:48:47.9270289Z 
> ==
> 2020-03-23T08:48:47.9270832Z Printing stack trace of Java process 967
> 2020-03-23T08:48:47.9271199Z 
> ==
> 2020-03-23T08:48:48.0165945Z 967: No such process
> 2020-03-23T08:48:48.0218260Z 
> ==
> 2020-03-23T08:48:48.0218736Z Printing stack trace of Java process 238
> 2020-03-23T08:48:48.0219075Z 
> ==
> 2020-03-23T08:48:48.3404066Z 2020-03-23 08:48:48
> 2020-03-23T08:48:48.3404828Z Full thread dump OpenJDK 64-Bit Server VM 
> (25.242-b08 mixed mode):
> 2020-03-23T08:48:48.3405064Z 
> 2020-03-23T08:48:48.3405445Z "Attach Listener" #370 daemon prio=9 os_prio=0 
> tid=0x7fe130001000 nid=0x452 waiting on condition [0x]
> 2020-03-23T08:48:48.3405868Zjava.lang.Thread.State: RUNNABLE
> 2020-03-23T08:48:48.3411202Z 
> 2020-03-23T08:48:48.3413171Z "resolver-5" #105 daemon prio=5 os_prio=0 
> tid=0x7fe1ec2ad800 nid=0x177 waiting on condition [0x7fe1872d9000]
> 2020-03-23T08:48:48.3414175Zjava.lang.Thread.State: WAITING (parking)
> 2020-03-23T08:48:48.3414560Z  at sun.misc.Unsafe.park(Native Method)
> 2020-03-23T08:48:48.3415451Z  - parking to wait for  <0x0003d5a9f828> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> 2020-03-23T08:48:48.3416180Z  at 
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> 2020-03-23T08:48:48.3416825Z  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> 2020-03-23T08:48:48.3417602Z  at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> 2020-03-23T08:48:48.3418250Z  at 
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
> 2020-03-23T08:48:48.3418930Z  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
> 2020-03-23T08:48:48.3419900Z  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 2020-03-23T08:48:48.3420395Z  at java.lang.Thread.run(Thread.java:748)
> 2020-03-23T08:48:48.3420648Z 
> 2020-03-23T08:48:48.3421424Z "resolver-4" #104 daemon prio=5 os_prio=0 
> tid=0x7fe1ec2ad000 nid=0x176 waiting on condition [0x7fe1863dd000]
> 2020-03-23T08:48:48.3421914Zjava.lang.Thread.State: WAITING (parking)
> 2020-03-23T08:48:48.3422233Z  at sun.misc.Unsafe.park(Native Method)
> 2020-03-23T08:48:48.3422919Z  - parking to wait for  <0x0003d5a9f828> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> 2020-03-23T08:48:48.3423447Z  at 
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> 2020-03-

[jira] [Closed] (FLINK-16712) Refactor StreamTask to construct final fields

2020-03-24 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-16712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang closed FLINK-16712.

Resolution: Fixed

Merged in master: bb4ec22a7d0e0d0831ec56b121eefb465bf8f939

> Refactor StreamTask to construct final fields
> -
>
> Key: FLINK-16712
> URL: https://issues.apache.org/jira/browse/FLINK-16712
> Project: Flink
>  Issue Type: Task
>  Components: Runtime / Task
>Reporter: Zhijiang
>Assignee: Zhijiang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.11.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> At the moment there are four fields initialized in the method of 
> StreamTask#beforeInvoke, such as `stateBackend`, `checkpointStorage`, 
> `timerService`, `asyncOperationsThreadPool`.
> In general it is suggested to use final fields to get known benefits. So we 
> can refactor the StreamTask to initialize these fields in the constructor 
> instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16629) Streaming bucketing end-to-end test output hash mismatch

2020-03-24 Thread Zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066426#comment-17066426
 ] 

Zhijiang commented on FLINK-16629:
--

Another instance 
[https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6591&view=logs&s=9fca669f-5c5f-59c7-4118-e31c641064f0&j=68a897ab-3047-5660-245a-cce8f83859f6]

> Streaming bucketing end-to-end test output hash mismatch
> 
>
> Key: FLINK-16629
> URL: https://issues.apache.org/jira/browse/FLINK-16629
> Project: Flink
>  Issue Type: Bug
>  Components: API / DataStream, Tests
>Affects Versions: 1.11.0
>Reporter: Piotr Nowojski
>Assignee: Robert Metzger
>Priority: Blocker
>  Labels: pull-request-available, test-stability
> Fix For: 1.11.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_apis/build/builds/6298/logs/722
> Some of the output mismatch failures were reported in another ticket: 
> https://issues.apache.org/jira/browse/FLINK-16227
> {code}
> 2020-03-17T02:04:19.9176915Z Number of produced values 30618/6
> 2020-03-17T02:04:19.9202731Z Truncating buckets
> 2020-03-17T02:04:25.0504959Z Truncating buckets
> 2020-03-17T02:04:30.1731295Z Truncating buckets
> 2020-03-17T02:04:35.3190114Z Truncating buckets
> 2020-03-17T02:04:40.4723887Z Truncating buckets
> 2020-03-17T02:04:45.5984655Z Truncating buckets
> 2020-03-17T02:04:50.7185356Z Truncating buckets
> 2020-03-17T02:04:55.8627129Z Truncating buckets
> 2020-03-17T02:05:01.0715985Z Number of produced values 74008/6
> 2020-03-17T02:05:02.3976850Z Cancelling job dba2fdb79579158295db27d0214fc2ff.
> 2020-03-17T02:05:03.4633541Z Cancelled job dba2fdb79579158295db27d0214fc2ff.
> 2020-03-17T02:05:03.4738270Z Waiting for job 
> (dba2fdb79579158295db27d0214fc2ff) to reach terminal state CANCELED ...
> 2020-03-17T02:05:03.5149228Z Job (dba2fdb79579158295db27d0214fc2ff) reached 
> terminal state CANCELED
> 2020-03-17T02:05:03.5150587Z Job dba2fdb79579158295db27d0214fc2ff was 
> cancelled, time to verify
> 2020-03-17T02:05:03.5590118Z FAIL Bucketing Sink: Output hash mismatch.  Got 
> c3787e7a52d913675e620837a7531742, expected 01aba5ff77a0ef5e5cf6a727c248bdc3.
> 2020-03-17T02:05:03.5591888Z head hexdump of actual:
> 2020-03-17T02:05:03.5989908Z 000   (   7   ,   1   0   ,   0   ,   S   o  
>  m   e   p   a   y
> 2020-03-17T02:05:03.5991252Z 010   l   o   a   d   .   .   .   )  \n   (  
>  7   ,   1   0   ,   1
> 2020-03-17T02:05:03.5991923Z 020   ,   S   o   m   e   p   a   y   l  
>  o   a   d   .   .   .
> 2020-03-17T02:05:03.5993055Z 030   )  \n   (   7   ,   1   0   ,   2   ,  
>  S   o   m   e   p
> 2020-03-17T02:05:03.5993690Z 040   a   y   l   o   a   d   .   .   .   )  
> \n   (   7   ,   1   0
> 2020-03-17T02:05:03.5994332Z 050   ,   3   ,   S   o   m   e   p   a  
>  y   l   o   a   d   .
> 2020-03-17T02:05:03.5994967Z 060   .   .   )  \n   (   7   ,   1   0   ,  
>  4   ,   S   o   m   e
> 2020-03-17T02:05:03.5995744Z 070   p   a   y   l   o   a   d   .   .  
>  .   )  \n   (   7   ,
> 2020-03-17T02:05:03.5996359Z 080   1   0   ,   5   ,   S   o   m   e  
>  p   a   y   l   o   a
> 2020-03-17T02:05:03.5997133Z 090   d   .   .   .   )  \n   (   7   ,   1  
>  0   ,   6   ,   S   o
> 2020-03-17T02:05:03.5997704Z 0a0   m   e   p   a   y   l   o   a   d  
>  .   .   .   )  \n   (
> 2020-03-17T02:05:03.5998295Z 0b0   7   ,   1   0   ,   7   ,   S   o   m  
>  e   p   a   y   l
> 2020-03-17T02:05:03.5999087Z 0c0   o   a   d   .   .   .   )  \n   (   7  
>  ,   1   0   ,   8   ,
> 2020-03-17T02:05:03.6000243Z 0d0   S   o   m   e   p   a   y   l   o  
>  a   d   .   .   .   )
> 2020-03-17T02:05:03.6000880Z 0e0  \n   (   7   ,   1   0   ,   9   ,   S  
>  o   m   e   p   a
> 2020-03-17T02:05:03.6001494Z 0f0   y   l   o   a   d   .   .   .   )  \n  
>   
> 2020-03-17T02:05:03.6001999Z 0fa
> 2020-03-17T02:05:03.9875220Z Stopping taskexecutor daemon (pid: 49278) on 
> host fv-az668.
> 2020-03-17T02:05:04.2569285Z Stopping standalonesession daemon (pid: 46323) 
> on host fv-az668.
> 2020-03-17T02:05:04.7664418Z Stopping taskexecutor daemon (pid: 46615) on 
> host fv-az668.
> 2020-03-17T02:05:04.7674722Z Skipping taskexecutor daemon (pid: 47009), 
> because it is not running anymore on fv-az668.
> 2020-03-17T02:05:04.7687383Z Skipping taskexecutor daemon (pid: 47299), 
> because it is not running anymore on fv-az668.
> 2020-03-17T02:05:04.7689091Z Skipping taskexecutor daemon (pid: 47619), 
> because it is not running anymore on fv-az668.
> 2020-03-17T02:05:04.7690289Z Stopping taskexecutor daemon (pid: 48538) on 
> host fv-az6

[jira] [Created] (FLINK-16768) HadoopS3RecoverableWriterITCase.testRecoverWithStateWithMultiPart runs without exit

2020-03-25 Thread Zhijiang (Jira)
Zhijiang created FLINK-16768:


 Summary: 
HadoopS3RecoverableWriterITCase.testRecoverWithStateWithMultiPart runs without 
exit
 Key: FLINK-16768
 URL: https://issues.apache.org/jira/browse/FLINK-16768
 Project: Flink
  Issue Type: Task
  Components: FileSystems, Tests
Reporter: Zhijiang
 Fix For: 1.11.0


Logs: 
[https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6584&view=logs&j=d44f43ce-542c-597d-bf94-b0718c71e5e8&t=d26b3528-38b0-53d2-05f7-37557c2405e4]
{code:java}
2020-03-24T15:52:18.9196862Z "main" #1 prio=5 os_prio=0 tid=0x7fd36c00b800 
nid=0xc21 runnable [0x7fd3743ce000]
2020-03-24T15:52:18.9197235Zjava.lang.Thread.State: RUNNABLE
2020-03-24T15:52:18.9197536Zat 
java.net.SocketInputStream.socketRead0(Native Method)
2020-03-24T15:52:18.9197931Zat 
java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
2020-03-24T15:52:18.9198340Zat 
java.net.SocketInputStream.read(SocketInputStream.java:171)
2020-03-24T15:52:18.9198749Zat 
java.net.SocketInputStream.read(SocketInputStream.java:141)
2020-03-24T15:52:18.9199171Zat 
sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
2020-03-24T15:52:18.9199840Zat 
sun.security.ssl.InputRecord.readV3Record(InputRecord.java:593)
2020-03-24T15:52:18.9200265Zat 
sun.security.ssl.InputRecord.read(InputRecord.java:532)
2020-03-24T15:52:18.9200663Zat 
sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:975)
2020-03-24T15:52:18.9201213Z- locked <0x927583d8> (a 
java.lang.Object)
2020-03-24T15:52:18.9201589Zat 
sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:933)
2020-03-24T15:52:18.9202026Zat 
sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
2020-03-24T15:52:18.9202583Z- locked <0x92758c00> (a 
sun.security.ssl.AppInputStream)
2020-03-24T15:52:18.9203029Zat 
org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137)
2020-03-24T15:52:18.9203558Zat 
org.apache.http.impl.io.SessionInputBufferImpl.read(SessionInputBufferImpl.java:198)
2020-03-24T15:52:18.9204121Zat 
org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:176)
2020-03-24T15:52:18.9204626Zat 
org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:135)
2020-03-24T15:52:18.9205121Zat 
com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:82)
2020-03-24T15:52:18.9205679Zat 
com.amazonaws.event.ProgressInputStream.read(ProgressInputStream.java:180)
2020-03-24T15:52:18.9206164Zat 
com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:82)
2020-03-24T15:52:18.9206786Zat 
com.amazonaws.services.s3.internal.S3AbortableInputStream.read(S3AbortableInputStream.java:125)
2020-03-24T15:52:18.9207361Zat 
com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:82)
2020-03-24T15:52:18.9207839Zat 
com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:82)
2020-03-24T15:52:18.9208327Zat 
com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:82)
2020-03-24T15:52:18.9208809Zat 
com.amazonaws.event.ProgressInputStream.read(ProgressInputStream.java:180)
2020-03-24T15:52:18.9209273Zat 
com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:82)
2020-03-24T15:52:18.9210003Zat 
com.amazonaws.util.LengthCheckInputStream.read(LengthCheckInputStream.java:107)
2020-03-24T15:52:18.9210658Zat 
com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:82)
2020-03-24T15:52:18.9211154Zat 
org.apache.hadoop.fs.s3a.S3AInputStream.lambda$read$3(S3AInputStream.java:445)
2020-03-24T15:52:18.9211631Zat 
org.apache.hadoop.fs.s3a.S3AInputStream$$Lambda$42/1936375962.execute(Unknown 
Source)
2020-03-24T15:52:18.9212044Zat 
org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:109)
2020-03-24T15:52:18.9212553Zat 
org.apache.hadoop.fs.s3a.Invoker.lambda$retry$3(Invoker.java:260)
2020-03-24T15:52:18.9212972Zat 
org.apache.hadoop.fs.s3a.Invoker$$Lambda$23/1457226878.execute(Unknown Source)
2020-03-24T15:52:18.9213408Zat 
org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:317)
2020-03-24T15:52:18.9213866Zat 
org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:256)
2020-03-24T15:52:18.9214273Zat 
org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:231)
2020-03-24T15:52:18.9214701Zat 
org.apache.hadoop.fs.s3a.S3AInputStream.read(S3AInputStream.java:441)
2020-03-24T15:52:18.9215443Z- locked <0x926e88b0> (a 
org.apache.hadoop.fs.s3a.S3AInputStream)
2020-03-24T15:52:18.9215852Zat 
java.io.DataInputStream.read(DataInputStream.java:149)
2020-03-24T15:52:18.9216305Zat 
org.apache.flink.runtime.fs.hdfs.HadoopDataInputStream.read(HadoopDataInputStream.java:94)
2020-03-24T

[jira] [Commented] (FLINK-16753) Exception from AsyncCheckpointRunnable should be wrapped in CheckpointException

2020-03-25 Thread Zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066463#comment-17066463
 ] 

Zhijiang commented on FLINK-16753:
--

Hey [~wind_ljy], could you point out where 
`owner.getEnvironment().declineCheckpoint` is called in specific class? Then we 
can better understand the context.

 

> Exception from AsyncCheckpointRunnable should be wrapped in 
> CheckpointException
> ---
>
> Key: FLINK-16753
> URL: https://issues.apache.org/jira/browse/FLINK-16753
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.10.0
>Reporter: Jiayi Liao
>Priority: Major
>
> If an exception is thrown when task is doing aysnc checkpoint, the checkpoint 
> will be declined and regarded as {{CheckpointFailureReason.JOB_FAILURE}}, 
> which gives a wrong message to users.
> I think we can simply replace
> {code:java}
> owner.getEnvironment().declineCheckpoint(checkpointMetaData.getCheckpointId(),
>  checkpointException);
> {code}
> with
>  
> {code:java}
> owner.getEnvironment().declineCheckpoint(checkpointMetaData.getCheckpointId(),
>  new CheckpointException(CheckpointFailureReason.EXCEPTION, 
> checkpointException));
> {code}
> cc [~trohrmann]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16645) Limit the maximum backlogs in subpartitions for data skew case

2020-03-25 Thread Zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066465#comment-17066465
 ] 

Zhijiang commented on FLINK-16645:
--

Thanks for the concern. If you are interested, I can also assign it to you. :)

> Limit the maximum backlogs in subpartitions for data skew case
> --
>
> Key: FLINK-16645
> URL: https://issues.apache.org/jira/browse/FLINK-16645
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Network
>Reporter: Zhijiang
>Priority: Major
> Fix For: 1.11.0
>
>
> In the case of data skew, most of the buffers in partition's LocalBufferPool 
> are probably requested away and accumulated in certain subpartition, which 
> would increase in-flight data to slow down the barrier alignment.
> We can set up a proper config to control how many backlogs are allowed for 
> one subpartition. If one subpartition reaches this threshold, it will make 
> the buffer pool unavailable which blocks task processing continuously. Then 
> we can reduce the in-flight data for speeding up checkpoint process a bit and 
> not impact on the performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16638) Flink checkStateMappingCompleteness doesn't include UserDefinedOperatorIDs

2020-03-25 Thread Zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066466#comment-17066466
 ] 

Zhijiang commented on FLINK-16638:
--

[~basharaj], do you want to contribute the PR for this bug? If so, i can assign 
this ticket for you. :)

> Flink checkStateMappingCompleteness doesn't include UserDefinedOperatorIDs
> --
>
> Key: FLINK-16638
> URL: https://issues.apache.org/jira/browse/FLINK-16638
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.9.1
>Reporter: Bashar Abdul Jawad
>Priority: Critical
> Fix For: 1.11.0
>
>
> [StateAssignmentOperation.checkStateMappingCompleteness|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/StateAssignmentOperation.java#L555]
>  doesn't check for UserDefinedOperatorIDs (specified using setUidHash), 
> causing the exception:
> {code}
>  java.lang.IllegalStateException: There is no operator for the state {}
> {code}
> to be thrown when a savepoint can't be mapped to an ExecutionJobVertex, even 
> when the operator hash is explicitly specified.
> I believe this logic should be extended to also include 
> UserDefinedOperatorIDs as so:
> {code:java}
> for (ExecutionJobVertex executionJobVertex : tasks) {
>   allOperatorIDs.addAll(executionJobVertex.getOperatorIDs());
>   allOperatorIDs.addAll(executionJobVertex.getUserDefinedOperatorIDs());
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-16582) NettyBufferPoolTest may have warns on NettyBuffer leak

2020-03-25 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-16582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang updated FLINK-16582:
-
Fix Version/s: 1.11.0

> NettyBufferPoolTest may have warns on NettyBuffer leak 
> ---
>
> Key: FLINK-16582
> URL: https://issues.apache.org/jira/browse/FLINK-16582
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Network, Tests
>Reporter: Yun Gao
>Assignee: Yun Gao
>Priority: Major
> Fix For: 1.11.0
>
>
> {code:java}
> 4749 [Flink Netty Client (50072) Thread 0] ERROR
> org.apache.flink.shaded.netty4.io.netty.util.ResourceLeakDetector [] - LEAK:
> ByteBuf.release() was not called before it's garbage-collected. See
> https://netty.io/wiki/reference-counted-objects.html for more information.
> Recent access records: 
> Created at:
>   
> org.apache.flink.shaded.netty4.io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:349)
>   
> org.apache.flink.shaded.netty4.io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:187)
>   
> org.apache.flink.shaded.netty4.io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:178)
>   
> org.apache.flink.shaded.netty4.io.netty.buffer.AbstractByteBufAllocator.buffer(AbstractByteBufAllocator.java:115)
>   
> org.apache.flink.runtime.io.network.netty.NettyBufferPoolTest.testNoHeapAllocations(NettyBufferPoolTest.java:38)
>   sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   java.lang.reflect.Method.invoke(Method.java:498)
>   
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   org.junit.runners.Suite.runChild(Suite.java:128)
>   org.junit.runners.Suite.runChild(Suite.java:27)
>   org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   org.junit.runner.JUnitCore.run(JUnitCore.java:137)
>   
> com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
>   
> com.intellij.rt.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:33)
>   
> com.intellij.rt.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:230)
>   com.intellij.rt.junit.JUnitStarter.main(JUnitStarter.java:58)
> Test ignored.
> Process finished with exit code 0
> {code}
> We should released the allocated buffers in the tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-16561) Resuming Externalized Checkpoint (rocks, incremental, no parallelism change) end-to-end test fails on Azure

2020-03-25 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-16561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang updated FLINK-16561:
-
Fix Version/s: 1.11.0

> Resuming Externalized Checkpoint (rocks, incremental, no parallelism change) 
> end-to-end test fails on Azure
> ---
>
> Key: FLINK-16561
> URL: https://issues.apache.org/jira/browse/FLINK-16561
> Project: Flink
>  Issue Type: Test
>  Components: Runtime / Checkpointing, Tests
>Affects Versions: 1.11.0
>Reporter: Biao Liu
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.11.0
>
>
> {quote}Caused by: java.io.IOException: Cannot access file system for 
> checkpoint/savepoint path 'file://.'.
>   at 
> org.apache.flink.runtime.state.filesystem.AbstractFsCheckpointStorage.resolveCheckpointPointer(AbstractFsCheckpointStorage.java:233)
>   at 
> org.apache.flink.runtime.state.filesystem.AbstractFsCheckpointStorage.resolveCheckpoint(AbstractFsCheckpointStorage.java:110)
>   at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1332)
>   at 
> org.apache.flink.runtime.scheduler.SchedulerBase.tryRestoreExecutionGraphFromSavepoint(SchedulerBase.java:314)
>   at 
> org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:247)
>   at 
> org.apache.flink.runtime.scheduler.SchedulerBase.(SchedulerBase.java:223)
>   at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.(DefaultScheduler.java:118)
>   at 
> org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:103)
>   at 
> org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:281)
>   at 
> org.apache.flink.runtime.jobmaster.JobMaster.(JobMaster.java:269)
>   at 
> org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:98)
>   at 
> org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:40)
>   at 
> org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.(JobManagerRunnerImpl.java:146)
>   ... 10 more
> Caused by: java.io.IOException: Found local file path with authority '.' in 
> path 'file://.'. Hint: Did you forget a slash? (correct path would be 
> 'file:///.')
>   at 
> org.apache.flink.core.fs.FileSystem.getUnguardedFileSystem(FileSystem.java:441)
>   at org.apache.flink.core.fs.FileSystem.get(FileSystem.java:389)
>   at org.apache.flink.core.fs.Path.getFileSystem(Path.java:298)
>   at 
> org.apache.flink.runtime.state.filesystem.AbstractFsCheckpointStorage.resolveCheckpointPointer(AbstractFsCheckpointStorage.java:230)
>   ... 22 more
> {quote}
> The original log is here, 
> https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6073&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=2b7514ee-e706-5046-657b-3430666e7bd9
> There are some similar tickets about this case, but the stack here looks 
> different. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-16770) Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end test fails with no such file

2020-03-25 Thread Zhijiang (Jira)
Zhijiang created FLINK-16770:


 Summary: Resuming Externalized Checkpoint (rocks, incremental, 
scale up) end-to-end test fails with no such file
 Key: FLINK-16770
 URL: https://issues.apache.org/jira/browse/FLINK-16770
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Checkpointing, Tests
Reporter: Zhijiang
 Fix For: 1.11.0


The log : 
[https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6603&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5]

 

There was also the similar problem in 
https://issues.apache.org/jira/browse/FLINK-16561, but for the case of no 
parallelism change. And this case is for scaling up. Not quite sure whether the 
root cause is the same one.
{code:java}
2020-03-25T06:50:31.3894841Z Running 'Resuming Externalized Checkpoint (rocks, 
incremental, scale up) end-to-end test'
2020-03-25T06:50:31.3895308Z 
==
2020-03-25T06:50:31.3907274Z TEST_DATA_DIR: 
/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304
2020-03-25T06:50:31.5500274Z Flink dist directory: 
/home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT
2020-03-25T06:50:31.6354639Z Starting cluster.
2020-03-25T06:50:31.8871932Z Starting standalonesession daemon on host fv-az655.
2020-03-25T06:50:33.5021784Z Starting taskexecutor daemon on host fv-az655.
2020-03-25T06:50:33.5152274Z Waiting for Dispatcher REST endpoint to come up...
2020-03-25T06:50:34.5498116Z Waiting for Dispatcher REST endpoint to come up...
2020-03-25T06:50:35.6031346Z Waiting for Dispatcher REST endpoint to come up...
2020-03-25T06:50:36.9848425Z Waiting for Dispatcher REST endpoint to come up...
2020-03-25T06:50:38.0283377Z Dispatcher REST endpoint is up.
2020-03-25T06:50:38.0285490Z Running externalized checkpoints test, with 
ORIGINAL_DOP=2 NEW_DOP=4 and STATE_BACKEND_TYPE=rocks 
STATE_BACKEND_FILE_ASYNC=true STATE_BACKEND_ROCKSDB_INCREMENTAL=true 
SIMULATE_FAILURE=false ...
2020-03-25T06:50:46.1754645Z Job (b8cb04e4b1e730585bc616aa352866d0) is running.
2020-03-25T06:50:46.1758132Z Waiting for job (b8cb04e4b1e730585bc616aa352866d0) 
to have at least 1 completed checkpoints ...
2020-03-25T06:50:46.3478276Z Waiting for job to process up to 200 records, 
current progress: 173 records ...
2020-03-25T06:50:49.6332988Z Cancelling job b8cb04e4b1e730585bc616aa352866d0.
2020-03-25T06:50:50.4875673Z Cancelled job b8cb04e4b1e730585bc616aa352866d0.
2020-03-25T06:50:50.5468230Z ls: cannot access 
'/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304/externalized-chckpt-e2e-backend-dir/b8cb04e4b1e730585bc616aa352866d0/chk-[1-9]*/_metadata':
 No such file or directory
2020-03-25T06:50:50.5606260Z Restoring job with externalized checkpoint at . ...
2020-03-25T06:50:58.4728245Z 
2020-03-25T06:50:58.4732663Z 

2020-03-25T06:50:58.4735785Z  The program finished with the following exception:
2020-03-25T06:50:58.4737759Z 
2020-03-25T06:50:58.4742666Z 
org.apache.flink.client.program.ProgramInvocationException: The main method 
caused an error: java.util.concurrent.ExecutionException: 
org.apache.flink.runtime.client.JobSubmissionException: Failed to submit 
JobGraph.
2020-03-25T06:50:58.4746274Zat 
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:335)
2020-03-25T06:50:58.4749954Zat 
org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:205)
2020-03-25T06:50:58.4752753Zat 
org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:142)
2020-03-25T06:50:58.4755400Zat 
org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:659)
2020-03-25T06:50:58.4757862Zat 
org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:210)
2020-03-25T06:50:58.4760282Zat 
org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:890)
2020-03-25T06:50:58.4763591Zat 
org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:963)
2020-03-25T06:50:58.4764274Zat 
java.security.AccessController.doPrivileged(Native Method)
2020-03-25T06:50:58.4764809Zat 
javax.security.auth.Subject.doAs(Subject.java:422)
2020-03-25T06:50:58.4765434Zat 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
2020-03-25T06:50:58.4766180Zat 
org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
2020-03-25T06:50:58.4773549Zat 
org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:963)
2020-03-25T06:50:58.4774502Z Caused by: java.lang.RuntimeException: 
java.util.concurrent.ExecutionException: 
org.apache.flink.runtime.client.JobSubmissionExcept

[jira] [Commented] (FLINK-16753) Exception from AsyncCheckpointRunnable should be wrapped in CheckpointException

2020-03-25 Thread Zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066579#comment-17066579
 ] 

Zhijiang commented on FLINK-16753:
--

Thanks for the updates and I am clear now.  I think it is reasonable to reuse 
the existing `CheckpointException` to cover specific internal 
`CheckpointFailureReason` instead of general `Exception` now.

I can assign this ticket to you if you are willing to contribute the PR. :)

> Exception from AsyncCheckpointRunnable should be wrapped in 
> CheckpointException
> ---
>
> Key: FLINK-16753
> URL: https://issues.apache.org/jira/browse/FLINK-16753
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.10.0
>Reporter: Jiayi Liao
>Priority: Major
>
> If an exception is thrown from task's async checkpoint process, the 
> checkpoint will be declined as expected, but the reason for declining 
> checkpoint will be regarded as {{CheckpointFailureReason.JOB_FAILURE}}, which 
> gives a wrong message to users.
> I think we can simply replace
> {code:java}
> owner.getEnvironment().declineCheckpoint(checkpointMetaData.getCheckpointId(),
>  checkpointException);
> {code}
> with
>  
> {code:java}
> owner.getEnvironment().declineCheckpoint(checkpointMetaData.getCheckpointId(),
>  new CheckpointException(CheckpointFailureReason.EXCEPTION, 
> checkpointException));
> {code}
> in {{AsyncCheckpointRunnable.handleExecutionException}}.
> cc [~trohrmann]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15772) Shaded Hadoop S3A with credentials provider end-to-end test fails on travis

2020-03-25 Thread Zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066583#comment-17066583
 ] 

Zhijiang commented on FLINK-15772:
--

Another instance 
[https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6606&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5]

> Shaded Hadoop S3A with credentials provider end-to-end test fails on travis
> ---
>
> Key: FLINK-15772
> URL: https://issues.apache.org/jira/browse/FLINK-15772
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / FileSystem, Tests
>Affects Versions: 1.10.0, 1.11.0
>Reporter: Yu Li
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.11.0
>
>
> Shaded Hadoop S3A with credentials provider end-to-end test fails on travis 
> with below error:
> {code}
> Job with JobID 048b4651c0ba858b926aeb36f5315058 has finished.
> Job Runtime: 6016 ms
> sort: cannot read: 
> '/home/travis/build/apache/flink/flink-end-to-end-tests/test-scripts/temp-test-directory-17605057822/temp/test_batch_wordcount-2abf3dbf-b4ba-4d3a-a43b-c43e710eb439*':
>  No such file or directory
> FAIL WordCount (hadoop_with_provider): Output hash mismatch.  Got 
> d41d8cd98f00b204e9800998ecf8427e, expected 72a690412be8928ba239c2da967328a5.
> head hexdump of actual:
> head: cannot open 
> '/home/travis/build/apache/flink/flink-end-to-end-tests/test-scripts/temp-test-directory-17605057822/temp/test_batch_wordcount-2abf3dbf-b4ba-4d3a-a43b-c43e710eb439*'
>  for reading: No such file or directory
> ed2bf7571ec8ab184b7316809da0b2facb9b367a7c7f0f1bdaac6dd5e6f107ae
> ed2bf7571ec8ab184b7316809da0b2facb9b367a7c7f0f1bdaac6dd5e6f107ae
> [FAIL] Test script contains errors.
> Checking for errors...
> No errors in log files.
> Checking for exceptions...
> No exceptions in log files.
> Checking for non-empty .out files...
> No non-empty .out files.
> [FAIL] 'Shaded Hadoop S3A with credentials provider end-to-end test' failed 
> after 0 minutes and 20 seconds! Test exited with exit code 1
> {code}
> https://api.travis-ci.org/v3/job/641444512/log.txt



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (FLINK-16753) Exception from AsyncCheckpointRunnable should be wrapped in CheckpointException

2020-03-25 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-16753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang reassigned FLINK-16753:


Assignee: Jiayi Liao

> Exception from AsyncCheckpointRunnable should be wrapped in 
> CheckpointException
> ---
>
> Key: FLINK-16753
> URL: https://issues.apache.org/jira/browse/FLINK-16753
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Checkpointing
>Affects Versions: 1.10.0
>Reporter: Jiayi Liao
>Assignee: Jiayi Liao
>Priority: Major
>
> If an exception is thrown from task's async checkpoint process, the 
> checkpoint will be declined as expected, but the reason for declining 
> checkpoint will be regarded as {{CheckpointFailureReason.JOB_FAILURE}}, which 
> gives a wrong message to users.
> I think we can simply replace
> {code:java}
> owner.getEnvironment().declineCheckpoint(checkpointMetaData.getCheckpointId(),
>  checkpointException);
> {code}
> with
>  
> {code:java}
> owner.getEnvironment().declineCheckpoint(checkpointMetaData.getCheckpointId(),
>  new CheckpointException(CheckpointFailureReason.EXCEPTION, 
> checkpointException));
> {code}
> in {{AsyncCheckpointRunnable.handleExecutionException}}.
> cc [~trohrmann]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16720) Maven gets stuck downloading artifacts on Azure

2020-03-26 Thread Zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17067448#comment-17067448
 ] 

Zhijiang commented on FLINK-16720:
--

Another instance 
[https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6629&view=logs&j=fc5181b0-e452-5c8f-68de-1097947f6483&t=27d1d645-cbce-54e2-51c4-d8b45fe24607]

> Maven gets stuck downloading artifacts on Azure
> ---
>
> Key: FLINK-16720
> URL: https://issues.apache.org/jira/browse/FLINK-16720
> Project: Flink
>  Issue Type: Bug
>  Components: Build System / Azure Pipelines
>Affects Versions: 1.11.0
>Reporter: Robert Metzger
>Assignee: Robert Metzger
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.11.0
>
>
> Logs: 
> https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6509&view=logs&j=fc5181b0-e452-5c8f-68de-1097947f6483&t=27d1d645-cbce-54e2-51c4-d8b45fe24607
> {code}
> 2020-03-23T08:43:28.4128014Z [INFO] 
> 
> 2020-03-23T08:43:28.4128557Z [INFO] Building flink-avro-confluent-registry 
> 1.11-SNAPSHOT
> 2020-03-23T08:43:28.4129129Z [INFO] 
> 
> 2020-03-23T08:48:47.6591333Z 
> ==
> 2020-03-23T08:48:47.6594540Z Maven produced no output for 300 seconds.
> 2020-03-23T08:48:47.6595164Z 
> ==
> 2020-03-23T08:48:47.6605370Z 
> ==
> 2020-03-23T08:48:47.6605803Z The following Java processes are running (JPS)
> 2020-03-23T08:48:47.6606173Z 
> ==
> 2020-03-23T08:48:47.7710037Z 920 Jps
> 2020-03-23T08:48:47.7778561Z 238 Launcher
> 2020-03-23T08:48:47.9270289Z 
> ==
> 2020-03-23T08:48:47.9270832Z Printing stack trace of Java process 967
> 2020-03-23T08:48:47.9271199Z 
> ==
> 2020-03-23T08:48:48.0165945Z 967: No such process
> 2020-03-23T08:48:48.0218260Z 
> ==
> 2020-03-23T08:48:48.0218736Z Printing stack trace of Java process 238
> 2020-03-23T08:48:48.0219075Z 
> ==
> 2020-03-23T08:48:48.3404066Z 2020-03-23 08:48:48
> 2020-03-23T08:48:48.3404828Z Full thread dump OpenJDK 64-Bit Server VM 
> (25.242-b08 mixed mode):
> 2020-03-23T08:48:48.3405064Z 
> 2020-03-23T08:48:48.3405445Z "Attach Listener" #370 daemon prio=9 os_prio=0 
> tid=0x7fe130001000 nid=0x452 waiting on condition [0x]
> 2020-03-23T08:48:48.3405868Zjava.lang.Thread.State: RUNNABLE
> 2020-03-23T08:48:48.3411202Z 
> 2020-03-23T08:48:48.3413171Z "resolver-5" #105 daemon prio=5 os_prio=0 
> tid=0x7fe1ec2ad800 nid=0x177 waiting on condition [0x7fe1872d9000]
> 2020-03-23T08:48:48.3414175Zjava.lang.Thread.State: WAITING (parking)
> 2020-03-23T08:48:48.3414560Z  at sun.misc.Unsafe.park(Native Method)
> 2020-03-23T08:48:48.3415451Z  - parking to wait for  <0x0003d5a9f828> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> 2020-03-23T08:48:48.3416180Z  at 
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> 2020-03-23T08:48:48.3416825Z  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> 2020-03-23T08:48:48.3417602Z  at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> 2020-03-23T08:48:48.3418250Z  at 
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
> 2020-03-23T08:48:48.3418930Z  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
> 2020-03-23T08:48:48.3419900Z  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 2020-03-23T08:48:48.3420395Z  at java.lang.Thread.run(Thread.java:748)
> 2020-03-23T08:48:48.3420648Z 
> 2020-03-23T08:48:48.3421424Z "resolver-4" #104 daemon prio=5 os_prio=0 
> tid=0x7fe1ec2ad000 nid=0x176 waiting on condition [0x7fe1863dd000]
> 2020-03-23T08:48:48.3421914Zjava.lang.Thread.State: WAITING (parking)
> 2020-03-23T08:48:48.3422233Z  at sun.misc.Unsafe.park(Native Method)
> 2020-03-23T08:48:48.3422919Z  - parking to wait for  <0x0003d5a9f828> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> 2020-03-23T08:48:48.3423447Z  at 
> java.util.co

[jira] [Created] (FLINK-16821) Run Kubernetes test failed with invalid named "minikube"

2020-03-26 Thread Zhijiang (Jira)
Zhijiang created FLINK-16821:


 Summary: Run Kubernetes test failed with invalid named "minikube"
 Key: FLINK-16821
 URL: https://issues.apache.org/jira/browse/FLINK-16821
 Project: Flink
  Issue Type: Bug
  Components: Deployment / Kubernetes, Tests
Reporter: Zhijiang


This is the test run 
[https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6702&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5]

Log output
{code:java}
2020-03-27T00:07:38.9666021Z Running 'Run Kubernetes test'
2020-03-27T00:07:38.956Z 
==
2020-03-27T00:07:38.9677101Z TEST_DATA_DIR: 
/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-38967103614
2020-03-27T00:07:41.7529865Z Flink dist directory: 
/home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT
2020-03-27T00:07:41.7721475Z Flink dist directory: 
/home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT
2020-03-27T00:07:41.8208394Z Docker version 19.03.8, build afacb8b7f0
2020-03-27T00:07:42.4793914Z docker-compose version 1.25.4, build 8d51620a
2020-03-27T00:07:42.5359301Z Installing minikube ...
2020-03-27T00:07:42.5494076Z   % Total% Received % Xferd  Average Speed   
TimeTime Time  Current
2020-03-27T00:07:42.5494729Z  Dload  Upload   
Total   SpentLeft  Speed
2020-03-27T00:07:42.5498136Z 
2020-03-27T00:07:42.6214887Z   0 00 00 0  0  0 
--:--:-- --:--:-- --:--:-- 0
2020-03-27T00:07:43.3467750Z   0 00 00 0  0  0 
--:--:-- --:--:-- --:--:-- 0
2020-03-27T00:07:43.3469636Z 100 52.0M  100 52.0M0 0  65.2M  0 
--:--:-- --:--:-- --:--:-- 65.2M
2020-03-27T00:07:43.4262625Z * There is no local cluster named "minikube"
2020-03-27T00:07:43.4264438Z   - To fix this, run: minikube start
2020-03-27T00:07:43.4282404Z Starting minikube ...
2020-03-27T00:07:43.7749694Z * minikube v1.9.0 on Ubuntu 16.04
2020-03-27T00:07:43.7761742Z * Using the none driver based on user configuration
2020-03-27T00:07:43.7762229Z X The none driver requires conntrack to be 
installed for kubernetes version 1.18.0
2020-03-27T00:07:43.8202161Z * There is no local cluster named "minikube"
2020-03-27T00:07:43.8203353Z   - To fix this, run: minikube start
2020-03-27T00:07:43.8568899Z * There is no local cluster named "minikube"
2020-03-27T00:07:43.8570685Z   - To fix this, run: minikube start
2020-03-27T00:07:43.8583793Z Command: start_kubernetes_if_not_running failed. 
Retrying...
2020-03-27T00:07:48.9017252Z * There is no local cluster named "minikube"
2020-03-27T00:07:48.9019347Z   - To fix this, run: minikube start
2020-03-27T00:07:48.9031515Z Starting minikube ...
2020-03-27T00:07:49.0612601Z * minikube v1.9.0 on Ubuntu 16.04
2020-03-27T00:07:49.0616688Z * Using the none driver based on user configuration
2020-03-27T00:07:49.0620173Z X The none driver requires conntrack to be 
installed for kubernetes version 1.18.0
2020-03-27T00:07:49.1040676Z * There is no local cluster named "minikube"
2020-03-27T00:07:49.1042353Z   - To fix this, run: minikube start
2020-03-27T00:07:49.1453522Z * There is no local cluster named "minikube"
2020-03-27T00:07:49.1454594Z   - To fix this, run: minikube start
2020-03-27T00:07:49.1468436Z Command: start_kubernetes_if_not_running failed. 
Retrying...
2020-03-27T00:07:54.1907713Z * There is no local cluster named "minikube"
2020-03-27T00:07:54.1909876Z   - To fix this, run: minikube start
2020-03-27T00:07:54.1921479Z Starting minikube ...
2020-03-27T00:07:54.3388738Z * minikube v1.9.0 on Ubuntu 16.04
2020-03-27T00:07:54.3395499Z * Using the none driver based on user configuration
2020-03-27T00:07:54.3396443Z X The none driver requires conntrack to be 
installed for kubernetes version 1.18.0
2020-03-27T00:07:54.3824399Z * There is no local cluster named "minikube"
2020-03-27T00:07:54.3837652Z   - To fix this, run: minikube start
2020-03-27T00:07:54.4203902Z * There is no local cluster named "minikube"
2020-03-27T00:07:54.4204895Z   - To fix this, run: minikube start
2020-03-27T00:07:54.4217866Z Command: start_kubernetes_if_not_running failed. 
Retrying...
2020-03-27T00:07:59.4235917Z Command: start_kubernetes_if_not_running failed 3 
times.
2020-03-27T00:07:59.4236459Z Could not start minikube. Aborting...
2020-03-27T00:07:59.8439850Z The connection to the server localhost:8080 was 
refused - did you specify the right host or port?
2020-03-27T00:07:59.8939088Z The connection to the server localhost:8080 was 
refused - did you specify the right host or port?
2020-03-27T00:07:59.9515679Z The connection to the server localhost:8080 was 
refused - did you specify the right host or port?
2020-03-27T00:07:59.9528463Z Stopping minikube ...
2020-03-27T00:07:59.9

[jira] [Commented] (FLINK-16821) Run Kubernetes test failed with invalid named "minikube"

2020-03-26 Thread Zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17068254#comment-17068254
 ] 

Zhijiang commented on FLINK-16821:
--

Another instance 
[https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6705&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5]

> Run Kubernetes test failed with invalid named "minikube"
> 
>
> Key: FLINK-16821
> URL: https://issues.apache.org/jira/browse/FLINK-16821
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes, Tests
>Reporter: Zhijiang
>Priority: Major
>  Labels: test-stability
>
> This is the test run 
> [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6702&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5]
> Log output
> {code:java}
> 2020-03-27T00:07:38.9666021Z Running 'Run Kubernetes test'
> 2020-03-27T00:07:38.956Z 
> ==
> 2020-03-27T00:07:38.9677101Z TEST_DATA_DIR: 
> /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-38967103614
> 2020-03-27T00:07:41.7529865Z Flink dist directory: 
> /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT
> 2020-03-27T00:07:41.7721475Z Flink dist directory: 
> /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT
> 2020-03-27T00:07:41.8208394Z Docker version 19.03.8, build afacb8b7f0
> 2020-03-27T00:07:42.4793914Z docker-compose version 1.25.4, build 8d51620a
> 2020-03-27T00:07:42.5359301Z Installing minikube ...
> 2020-03-27T00:07:42.5494076Z   % Total% Received % Xferd  Average Speed   
> TimeTime Time  Current
> 2020-03-27T00:07:42.5494729Z  Dload  Upload   
> Total   SpentLeft  Speed
> 2020-03-27T00:07:42.5498136Z 
> 2020-03-27T00:07:42.6214887Z   0 00 00 0  0  0 
> --:--:-- --:--:-- --:--:-- 0
> 2020-03-27T00:07:43.3467750Z   0 00 00 0  0  0 
> --:--:-- --:--:-- --:--:-- 0
> 2020-03-27T00:07:43.3469636Z 100 52.0M  100 52.0M0 0  65.2M  0 
> --:--:-- --:--:-- --:--:-- 65.2M
> 2020-03-27T00:07:43.4262625Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:43.4264438Z   - To fix this, run: minikube start
> 2020-03-27T00:07:43.4282404Z Starting minikube ...
> 2020-03-27T00:07:43.7749694Z * minikube v1.9.0 on Ubuntu 16.04
> 2020-03-27T00:07:43.7761742Z * Using the none driver based on user 
> configuration
> 2020-03-27T00:07:43.7762229Z X The none driver requires conntrack to be 
> installed for kubernetes version 1.18.0
> 2020-03-27T00:07:43.8202161Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:43.8203353Z   - To fix this, run: minikube start
> 2020-03-27T00:07:43.8568899Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:43.8570685Z   - To fix this, run: minikube start
> 2020-03-27T00:07:43.8583793Z Command: start_kubernetes_if_not_running failed. 
> Retrying...
> 2020-03-27T00:07:48.9017252Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:48.9019347Z   - To fix this, run: minikube start
> 2020-03-27T00:07:48.9031515Z Starting minikube ...
> 2020-03-27T00:07:49.0612601Z * minikube v1.9.0 on Ubuntu 16.04
> 2020-03-27T00:07:49.0616688Z * Using the none driver based on user 
> configuration
> 2020-03-27T00:07:49.0620173Z X The none driver requires conntrack to be 
> installed for kubernetes version 1.18.0
> 2020-03-27T00:07:49.1040676Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:49.1042353Z   - To fix this, run: minikube start
> 2020-03-27T00:07:49.1453522Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:49.1454594Z   - To fix this, run: minikube start
> 2020-03-27T00:07:49.1468436Z Command: start_kubernetes_if_not_running failed. 
> Retrying...
> 2020-03-27T00:07:54.1907713Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:54.1909876Z   - To fix this, run: minikube start
> 2020-03-27T00:07:54.1921479Z Starting minikube ...
> 2020-03-27T00:07:54.3388738Z * minikube v1.9.0 on Ubuntu 16.04
> 2020-03-27T00:07:54.3395499Z * Using the none driver based on user 
> configuration
> 2020-03-27T00:07:54.3396443Z X The none driver requires conntrack to be 
> installed for kubernetes version 1.18.0
> 2020-03-27T00:07:54.3824399Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:54.3837652Z   - To fix this, run: minikube start
> 2020-03-27T00:07:54.4203902Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:54.4204895Z   - To fix this, run: minikube start
> 2020-03-27T00:07:54.4217866Z Command: start_kubernetes_if_not_running failed. 
> Retrying..

[jira] [Updated] (FLINK-16821) Run Kubernetes test failed with invalid named "minikube"

2020-03-26 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-16821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang updated FLINK-16821:
-
Priority: Critical  (was: Major)

> Run Kubernetes test failed with invalid named "minikube"
> 
>
> Key: FLINK-16821
> URL: https://issues.apache.org/jira/browse/FLINK-16821
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes, Tests
>Reporter: Zhijiang
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.11.0
>
>
> This is the test run 
> [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6702&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5]
> Log output
> {code:java}
> 2020-03-27T00:07:38.9666021Z Running 'Run Kubernetes test'
> 2020-03-27T00:07:38.956Z 
> ==
> 2020-03-27T00:07:38.9677101Z TEST_DATA_DIR: 
> /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-38967103614
> 2020-03-27T00:07:41.7529865Z Flink dist directory: 
> /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT
> 2020-03-27T00:07:41.7721475Z Flink dist directory: 
> /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT
> 2020-03-27T00:07:41.8208394Z Docker version 19.03.8, build afacb8b7f0
> 2020-03-27T00:07:42.4793914Z docker-compose version 1.25.4, build 8d51620a
> 2020-03-27T00:07:42.5359301Z Installing minikube ...
> 2020-03-27T00:07:42.5494076Z   % Total% Received % Xferd  Average Speed   
> TimeTime Time  Current
> 2020-03-27T00:07:42.5494729Z  Dload  Upload   
> Total   SpentLeft  Speed
> 2020-03-27T00:07:42.5498136Z 
> 2020-03-27T00:07:42.6214887Z   0 00 00 0  0  0 
> --:--:-- --:--:-- --:--:-- 0
> 2020-03-27T00:07:43.3467750Z   0 00 00 0  0  0 
> --:--:-- --:--:-- --:--:-- 0
> 2020-03-27T00:07:43.3469636Z 100 52.0M  100 52.0M0 0  65.2M  0 
> --:--:-- --:--:-- --:--:-- 65.2M
> 2020-03-27T00:07:43.4262625Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:43.4264438Z   - To fix this, run: minikube start
> 2020-03-27T00:07:43.4282404Z Starting minikube ...
> 2020-03-27T00:07:43.7749694Z * minikube v1.9.0 on Ubuntu 16.04
> 2020-03-27T00:07:43.7761742Z * Using the none driver based on user 
> configuration
> 2020-03-27T00:07:43.7762229Z X The none driver requires conntrack to be 
> installed for kubernetes version 1.18.0
> 2020-03-27T00:07:43.8202161Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:43.8203353Z   - To fix this, run: minikube start
> 2020-03-27T00:07:43.8568899Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:43.8570685Z   - To fix this, run: minikube start
> 2020-03-27T00:07:43.8583793Z Command: start_kubernetes_if_not_running failed. 
> Retrying...
> 2020-03-27T00:07:48.9017252Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:48.9019347Z   - To fix this, run: minikube start
> 2020-03-27T00:07:48.9031515Z Starting minikube ...
> 2020-03-27T00:07:49.0612601Z * minikube v1.9.0 on Ubuntu 16.04
> 2020-03-27T00:07:49.0616688Z * Using the none driver based on user 
> configuration
> 2020-03-27T00:07:49.0620173Z X The none driver requires conntrack to be 
> installed for kubernetes version 1.18.0
> 2020-03-27T00:07:49.1040676Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:49.1042353Z   - To fix this, run: minikube start
> 2020-03-27T00:07:49.1453522Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:49.1454594Z   - To fix this, run: minikube start
> 2020-03-27T00:07:49.1468436Z Command: start_kubernetes_if_not_running failed. 
> Retrying...
> 2020-03-27T00:07:54.1907713Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:54.1909876Z   - To fix this, run: minikube start
> 2020-03-27T00:07:54.1921479Z Starting minikube ...
> 2020-03-27T00:07:54.3388738Z * minikube v1.9.0 on Ubuntu 16.04
> 2020-03-27T00:07:54.3395499Z * Using the none driver based on user 
> configuration
> 2020-03-27T00:07:54.3396443Z X The none driver requires conntrack to be 
> installed for kubernetes version 1.18.0
> 2020-03-27T00:07:54.3824399Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:54.3837652Z   - To fix this, run: minikube start
> 2020-03-27T00:07:54.4203902Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:54.4204895Z   - To fix this, run: minikube start
> 2020-03-27T00:07:54.4217866Z Command: start_kubernetes_if_not_running failed. 
> Retrying...
> 2020-03-27T00:07:59.4235917Z Command: start_kubernetes_if_not_running failed 
> 3 times.
> 2020-03-27T00:07:59.4236459Z Could not start minikube. Aborting.

[jira] [Commented] (FLINK-16821) Run Kubernetes test failed with invalid named "minikube"

2020-03-26 Thread Zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17068257#comment-17068257
 ] 

Zhijiang commented on FLINK-16821:
--

Another instance 
[https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6708&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5]

> Run Kubernetes test failed with invalid named "minikube"
> 
>
> Key: FLINK-16821
> URL: https://issues.apache.org/jira/browse/FLINK-16821
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes, Tests
>Reporter: Zhijiang
>Priority: Major
>  Labels: test-stability
>
> This is the test run 
> [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6702&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5]
> Log output
> {code:java}
> 2020-03-27T00:07:38.9666021Z Running 'Run Kubernetes test'
> 2020-03-27T00:07:38.956Z 
> ==
> 2020-03-27T00:07:38.9677101Z TEST_DATA_DIR: 
> /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-38967103614
> 2020-03-27T00:07:41.7529865Z Flink dist directory: 
> /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT
> 2020-03-27T00:07:41.7721475Z Flink dist directory: 
> /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT
> 2020-03-27T00:07:41.8208394Z Docker version 19.03.8, build afacb8b7f0
> 2020-03-27T00:07:42.4793914Z docker-compose version 1.25.4, build 8d51620a
> 2020-03-27T00:07:42.5359301Z Installing minikube ...
> 2020-03-27T00:07:42.5494076Z   % Total% Received % Xferd  Average Speed   
> TimeTime Time  Current
> 2020-03-27T00:07:42.5494729Z  Dload  Upload   
> Total   SpentLeft  Speed
> 2020-03-27T00:07:42.5498136Z 
> 2020-03-27T00:07:42.6214887Z   0 00 00 0  0  0 
> --:--:-- --:--:-- --:--:-- 0
> 2020-03-27T00:07:43.3467750Z   0 00 00 0  0  0 
> --:--:-- --:--:-- --:--:-- 0
> 2020-03-27T00:07:43.3469636Z 100 52.0M  100 52.0M0 0  65.2M  0 
> --:--:-- --:--:-- --:--:-- 65.2M
> 2020-03-27T00:07:43.4262625Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:43.4264438Z   - To fix this, run: minikube start
> 2020-03-27T00:07:43.4282404Z Starting minikube ...
> 2020-03-27T00:07:43.7749694Z * minikube v1.9.0 on Ubuntu 16.04
> 2020-03-27T00:07:43.7761742Z * Using the none driver based on user 
> configuration
> 2020-03-27T00:07:43.7762229Z X The none driver requires conntrack to be 
> installed for kubernetes version 1.18.0
> 2020-03-27T00:07:43.8202161Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:43.8203353Z   - To fix this, run: minikube start
> 2020-03-27T00:07:43.8568899Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:43.8570685Z   - To fix this, run: minikube start
> 2020-03-27T00:07:43.8583793Z Command: start_kubernetes_if_not_running failed. 
> Retrying...
> 2020-03-27T00:07:48.9017252Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:48.9019347Z   - To fix this, run: minikube start
> 2020-03-27T00:07:48.9031515Z Starting minikube ...
> 2020-03-27T00:07:49.0612601Z * minikube v1.9.0 on Ubuntu 16.04
> 2020-03-27T00:07:49.0616688Z * Using the none driver based on user 
> configuration
> 2020-03-27T00:07:49.0620173Z X The none driver requires conntrack to be 
> installed for kubernetes version 1.18.0
> 2020-03-27T00:07:49.1040676Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:49.1042353Z   - To fix this, run: minikube start
> 2020-03-27T00:07:49.1453522Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:49.1454594Z   - To fix this, run: minikube start
> 2020-03-27T00:07:49.1468436Z Command: start_kubernetes_if_not_running failed. 
> Retrying...
> 2020-03-27T00:07:54.1907713Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:54.1909876Z   - To fix this, run: minikube start
> 2020-03-27T00:07:54.1921479Z Starting minikube ...
> 2020-03-27T00:07:54.3388738Z * minikube v1.9.0 on Ubuntu 16.04
> 2020-03-27T00:07:54.3395499Z * Using the none driver based on user 
> configuration
> 2020-03-27T00:07:54.3396443Z X The none driver requires conntrack to be 
> installed for kubernetes version 1.18.0
> 2020-03-27T00:07:54.3824399Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:54.3837652Z   - To fix this, run: minikube start
> 2020-03-27T00:07:54.4203902Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:54.4204895Z   - To fix this, run: minikube start
> 2020-03-27T00:07:54.4217866Z Command: start_kubernetes_if_not_running failed. 
> Retrying..

[jira] [Updated] (FLINK-16821) Run Kubernetes test failed with invalid named "minikube"

2020-03-26 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-16821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang updated FLINK-16821:
-
Fix Version/s: 1.11.0

> Run Kubernetes test failed with invalid named "minikube"
> 
>
> Key: FLINK-16821
> URL: https://issues.apache.org/jira/browse/FLINK-16821
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes, Tests
>Reporter: Zhijiang
>Priority: Major
>  Labels: test-stability
> Fix For: 1.11.0
>
>
> This is the test run 
> [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6702&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5]
> Log output
> {code:java}
> 2020-03-27T00:07:38.9666021Z Running 'Run Kubernetes test'
> 2020-03-27T00:07:38.956Z 
> ==
> 2020-03-27T00:07:38.9677101Z TEST_DATA_DIR: 
> /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-38967103614
> 2020-03-27T00:07:41.7529865Z Flink dist directory: 
> /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT
> 2020-03-27T00:07:41.7721475Z Flink dist directory: 
> /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT
> 2020-03-27T00:07:41.8208394Z Docker version 19.03.8, build afacb8b7f0
> 2020-03-27T00:07:42.4793914Z docker-compose version 1.25.4, build 8d51620a
> 2020-03-27T00:07:42.5359301Z Installing minikube ...
> 2020-03-27T00:07:42.5494076Z   % Total% Received % Xferd  Average Speed   
> TimeTime Time  Current
> 2020-03-27T00:07:42.5494729Z  Dload  Upload   
> Total   SpentLeft  Speed
> 2020-03-27T00:07:42.5498136Z 
> 2020-03-27T00:07:42.6214887Z   0 00 00 0  0  0 
> --:--:-- --:--:-- --:--:-- 0
> 2020-03-27T00:07:43.3467750Z   0 00 00 0  0  0 
> --:--:-- --:--:-- --:--:-- 0
> 2020-03-27T00:07:43.3469636Z 100 52.0M  100 52.0M0 0  65.2M  0 
> --:--:-- --:--:-- --:--:-- 65.2M
> 2020-03-27T00:07:43.4262625Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:43.4264438Z   - To fix this, run: minikube start
> 2020-03-27T00:07:43.4282404Z Starting minikube ...
> 2020-03-27T00:07:43.7749694Z * minikube v1.9.0 on Ubuntu 16.04
> 2020-03-27T00:07:43.7761742Z * Using the none driver based on user 
> configuration
> 2020-03-27T00:07:43.7762229Z X The none driver requires conntrack to be 
> installed for kubernetes version 1.18.0
> 2020-03-27T00:07:43.8202161Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:43.8203353Z   - To fix this, run: minikube start
> 2020-03-27T00:07:43.8568899Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:43.8570685Z   - To fix this, run: minikube start
> 2020-03-27T00:07:43.8583793Z Command: start_kubernetes_if_not_running failed. 
> Retrying...
> 2020-03-27T00:07:48.9017252Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:48.9019347Z   - To fix this, run: minikube start
> 2020-03-27T00:07:48.9031515Z Starting minikube ...
> 2020-03-27T00:07:49.0612601Z * minikube v1.9.0 on Ubuntu 16.04
> 2020-03-27T00:07:49.0616688Z * Using the none driver based on user 
> configuration
> 2020-03-27T00:07:49.0620173Z X The none driver requires conntrack to be 
> installed for kubernetes version 1.18.0
> 2020-03-27T00:07:49.1040676Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:49.1042353Z   - To fix this, run: minikube start
> 2020-03-27T00:07:49.1453522Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:49.1454594Z   - To fix this, run: minikube start
> 2020-03-27T00:07:49.1468436Z Command: start_kubernetes_if_not_running failed. 
> Retrying...
> 2020-03-27T00:07:54.1907713Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:54.1909876Z   - To fix this, run: minikube start
> 2020-03-27T00:07:54.1921479Z Starting minikube ...
> 2020-03-27T00:07:54.3388738Z * minikube v1.9.0 on Ubuntu 16.04
> 2020-03-27T00:07:54.3395499Z * Using the none driver based on user 
> configuration
> 2020-03-27T00:07:54.3396443Z X The none driver requires conntrack to be 
> installed for kubernetes version 1.18.0
> 2020-03-27T00:07:54.3824399Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:54.3837652Z   - To fix this, run: minikube start
> 2020-03-27T00:07:54.4203902Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:54.4204895Z   - To fix this, run: minikube start
> 2020-03-27T00:07:54.4217866Z Command: start_kubernetes_if_not_running failed. 
> Retrying...
> 2020-03-27T00:07:59.4235917Z Command: start_kubernetes_if_not_running failed 
> 3 times.
> 2020-03-27T00:07:59.4236459Z Could not start minikube. Aborting...
> 2020-03-2

[jira] [Commented] (FLINK-16821) Run Kubernetes test failed with invalid named "minikube"

2020-03-26 Thread Zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17068258#comment-17068258
 ] 

Zhijiang commented on FLINK-16821:
--

Another instance 
[https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6709&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5]

> Run Kubernetes test failed with invalid named "minikube"
> 
>
> Key: FLINK-16821
> URL: https://issues.apache.org/jira/browse/FLINK-16821
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes, Tests
>Reporter: Zhijiang
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.11.0
>
>
> This is the test run 
> [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6702&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5]
> Log output
> {code:java}
> 2020-03-27T00:07:38.9666021Z Running 'Run Kubernetes test'
> 2020-03-27T00:07:38.956Z 
> ==
> 2020-03-27T00:07:38.9677101Z TEST_DATA_DIR: 
> /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-38967103614
> 2020-03-27T00:07:41.7529865Z Flink dist directory: 
> /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT
> 2020-03-27T00:07:41.7721475Z Flink dist directory: 
> /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT
> 2020-03-27T00:07:41.8208394Z Docker version 19.03.8, build afacb8b7f0
> 2020-03-27T00:07:42.4793914Z docker-compose version 1.25.4, build 8d51620a
> 2020-03-27T00:07:42.5359301Z Installing minikube ...
> 2020-03-27T00:07:42.5494076Z   % Total% Received % Xferd  Average Speed   
> TimeTime Time  Current
> 2020-03-27T00:07:42.5494729Z  Dload  Upload   
> Total   SpentLeft  Speed
> 2020-03-27T00:07:42.5498136Z 
> 2020-03-27T00:07:42.6214887Z   0 00 00 0  0  0 
> --:--:-- --:--:-- --:--:-- 0
> 2020-03-27T00:07:43.3467750Z   0 00 00 0  0  0 
> --:--:-- --:--:-- --:--:-- 0
> 2020-03-27T00:07:43.3469636Z 100 52.0M  100 52.0M0 0  65.2M  0 
> --:--:-- --:--:-- --:--:-- 65.2M
> 2020-03-27T00:07:43.4262625Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:43.4264438Z   - To fix this, run: minikube start
> 2020-03-27T00:07:43.4282404Z Starting minikube ...
> 2020-03-27T00:07:43.7749694Z * minikube v1.9.0 on Ubuntu 16.04
> 2020-03-27T00:07:43.7761742Z * Using the none driver based on user 
> configuration
> 2020-03-27T00:07:43.7762229Z X The none driver requires conntrack to be 
> installed for kubernetes version 1.18.0
> 2020-03-27T00:07:43.8202161Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:43.8203353Z   - To fix this, run: minikube start
> 2020-03-27T00:07:43.8568899Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:43.8570685Z   - To fix this, run: minikube start
> 2020-03-27T00:07:43.8583793Z Command: start_kubernetes_if_not_running failed. 
> Retrying...
> 2020-03-27T00:07:48.9017252Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:48.9019347Z   - To fix this, run: minikube start
> 2020-03-27T00:07:48.9031515Z Starting minikube ...
> 2020-03-27T00:07:49.0612601Z * minikube v1.9.0 on Ubuntu 16.04
> 2020-03-27T00:07:49.0616688Z * Using the none driver based on user 
> configuration
> 2020-03-27T00:07:49.0620173Z X The none driver requires conntrack to be 
> installed for kubernetes version 1.18.0
> 2020-03-27T00:07:49.1040676Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:49.1042353Z   - To fix this, run: minikube start
> 2020-03-27T00:07:49.1453522Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:49.1454594Z   - To fix this, run: minikube start
> 2020-03-27T00:07:49.1468436Z Command: start_kubernetes_if_not_running failed. 
> Retrying...
> 2020-03-27T00:07:54.1907713Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:54.1909876Z   - To fix this, run: minikube start
> 2020-03-27T00:07:54.1921479Z Starting minikube ...
> 2020-03-27T00:07:54.3388738Z * minikube v1.9.0 on Ubuntu 16.04
> 2020-03-27T00:07:54.3395499Z * Using the none driver based on user 
> configuration
> 2020-03-27T00:07:54.3396443Z X The none driver requires conntrack to be 
> installed for kubernetes version 1.18.0
> 2020-03-27T00:07:54.3824399Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:54.3837652Z   - To fix this, run: minikube start
> 2020-03-27T00:07:54.4203902Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:54.4204895Z   - To fix this, run: minikube start
> 2020-03-27T00:07:54.4217866Z Command: start_kubernetes_i

[jira] [Updated] (FLINK-16821) Run Kubernetes test failed with invalid named "minikube"

2020-03-26 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-16821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang updated FLINK-16821:
-
Priority: Blocker  (was: Critical)

> Run Kubernetes test failed with invalid named "minikube"
> 
>
> Key: FLINK-16821
> URL: https://issues.apache.org/jira/browse/FLINK-16821
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes, Tests
>Reporter: Zhijiang
>Priority: Blocker
>  Labels: test-stability
> Fix For: 1.11.0
>
>
> This is the test run 
> [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6702&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5]
> Log output
> {code:java}
> 2020-03-27T00:07:38.9666021Z Running 'Run Kubernetes test'
> 2020-03-27T00:07:38.956Z 
> ==
> 2020-03-27T00:07:38.9677101Z TEST_DATA_DIR: 
> /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-38967103614
> 2020-03-27T00:07:41.7529865Z Flink dist directory: 
> /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT
> 2020-03-27T00:07:41.7721475Z Flink dist directory: 
> /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT
> 2020-03-27T00:07:41.8208394Z Docker version 19.03.8, build afacb8b7f0
> 2020-03-27T00:07:42.4793914Z docker-compose version 1.25.4, build 8d51620a
> 2020-03-27T00:07:42.5359301Z Installing minikube ...
> 2020-03-27T00:07:42.5494076Z   % Total% Received % Xferd  Average Speed   
> TimeTime Time  Current
> 2020-03-27T00:07:42.5494729Z  Dload  Upload   
> Total   SpentLeft  Speed
> 2020-03-27T00:07:42.5498136Z 
> 2020-03-27T00:07:42.6214887Z   0 00 00 0  0  0 
> --:--:-- --:--:-- --:--:-- 0
> 2020-03-27T00:07:43.3467750Z   0 00 00 0  0  0 
> --:--:-- --:--:-- --:--:-- 0
> 2020-03-27T00:07:43.3469636Z 100 52.0M  100 52.0M0 0  65.2M  0 
> --:--:-- --:--:-- --:--:-- 65.2M
> 2020-03-27T00:07:43.4262625Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:43.4264438Z   - To fix this, run: minikube start
> 2020-03-27T00:07:43.4282404Z Starting minikube ...
> 2020-03-27T00:07:43.7749694Z * minikube v1.9.0 on Ubuntu 16.04
> 2020-03-27T00:07:43.7761742Z * Using the none driver based on user 
> configuration
> 2020-03-27T00:07:43.7762229Z X The none driver requires conntrack to be 
> installed for kubernetes version 1.18.0
> 2020-03-27T00:07:43.8202161Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:43.8203353Z   - To fix this, run: minikube start
> 2020-03-27T00:07:43.8568899Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:43.8570685Z   - To fix this, run: minikube start
> 2020-03-27T00:07:43.8583793Z Command: start_kubernetes_if_not_running failed. 
> Retrying...
> 2020-03-27T00:07:48.9017252Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:48.9019347Z   - To fix this, run: minikube start
> 2020-03-27T00:07:48.9031515Z Starting minikube ...
> 2020-03-27T00:07:49.0612601Z * minikube v1.9.0 on Ubuntu 16.04
> 2020-03-27T00:07:49.0616688Z * Using the none driver based on user 
> configuration
> 2020-03-27T00:07:49.0620173Z X The none driver requires conntrack to be 
> installed for kubernetes version 1.18.0
> 2020-03-27T00:07:49.1040676Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:49.1042353Z   - To fix this, run: minikube start
> 2020-03-27T00:07:49.1453522Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:49.1454594Z   - To fix this, run: minikube start
> 2020-03-27T00:07:49.1468436Z Command: start_kubernetes_if_not_running failed. 
> Retrying...
> 2020-03-27T00:07:54.1907713Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:54.1909876Z   - To fix this, run: minikube start
> 2020-03-27T00:07:54.1921479Z Starting minikube ...
> 2020-03-27T00:07:54.3388738Z * minikube v1.9.0 on Ubuntu 16.04
> 2020-03-27T00:07:54.3395499Z * Using the none driver based on user 
> configuration
> 2020-03-27T00:07:54.3396443Z X The none driver requires conntrack to be 
> installed for kubernetes version 1.18.0
> 2020-03-27T00:07:54.3824399Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:54.3837652Z   - To fix this, run: minikube start
> 2020-03-27T00:07:54.4203902Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:54.4204895Z   - To fix this, run: minikube start
> 2020-03-27T00:07:54.4217866Z Command: start_kubernetes_if_not_running failed. 
> Retrying...
> 2020-03-27T00:07:59.4235917Z Command: start_kubernetes_if_not_running failed 
> 3 times.
> 2020-03-27T00:07:59.4236459Z Could not start minikube. Aborting

[jira] [Commented] (FLINK-16750) Kerberized YARN on Docker test fails with staring Hadoop cluster

2020-03-27 Thread Zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17068797#comment-17068797
 ] 

Zhijiang commented on FLINK-16750:
--

> What does "staring Hadoop cluster" mean?

[~gjy], the above logs show the info "Error: Could not start hadoop cluster" 
after retrying some time. I am not quite sure the root cause of it.

> Kerberized YARN on Docker test fails with staring Hadoop cluster
> 
>
> Key: FLINK-16750
> URL: https://issues.apache.org/jira/browse/FLINK-16750
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Docker, Deployment / YARN, Tests
>Reporter: Zhijiang
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.11.0
>
>
> Build: 
> [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6563&view=results]
> logs
> {code:java}
> 2020-03-24T08:48:53.3813297Z 
> ==
> 2020-03-24T08:48:53.3814016Z Running 'Running Kerberized YARN on Docker test 
> (custom fs plugin)'
> 2020-03-24T08:48:53.3814511Z 
> ==
> 2020-03-24T08:48:53.3827028Z TEST_DATA_DIR: 
> /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-53382133956
> 2020-03-24T08:48:56.1944456Z Flink dist directory: 
> /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT
> 2020-03-24T08:48:56.2300265Z Flink dist directory: 
> /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT
> 2020-03-24T08:48:56.2412349Z Flink dist directory: 
> /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT
> 2020-03-24T08:48:56.2861072Z Docker version 19.03.8, build afacb8b7f0
> 2020-03-24T08:48:56.8025297Z docker-compose version 1.25.4, build 8d51620a
> 2020-03-24T08:48:56.8499071Z Flink Tarball directory 
> /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-53382133956
> 2020-03-24T08:48:56.8501170Z Flink tarball filename flink.tar.gz
> 2020-03-24T08:48:56.8502612Z Flink distribution directory name 
> flink-1.11-SNAPSHOT
> 2020-03-24T08:48:56.8504724Z End-to-end directory 
> /home/vsts/work/1/s/flink-end-to-end-tests
> 2020-03-24T08:48:56.8620115Z Building Hadoop Docker container
> 2020-03-24T08:48:56.9117609Z Sending build context to Docker daemon  56.83kB
> 2020-03-24T08:48:56.9117926Z 
> 2020-03-24T08:48:57.0076373Z Step 1/54 : FROM sequenceiq/pam:ubuntu-14.04
> 2020-03-24T08:48:57.0082811Z  ---> df7bea4c5f64
> 2020-03-24T08:48:57.0084798Z Step 2/54 : RUN set -x && addgroup hadoop
>  && useradd -d /home/hdfs -ms /bin/bash -G hadoop -p hdfs hdfs && useradd 
> -d /home/yarn -ms /bin/bash -G hadoop -p yarn yarn && useradd -d 
> /home/mapred -ms /bin/bash -G hadoop -p mapred mapred && useradd -d 
> /home/hadoop-user -ms /bin/bash -p hadoop-user hadoop-user
> 2020-03-24T08:48:57.0092833Z  ---> Using cache
> 2020-03-24T08:48:57.0093976Z  ---> 3c12a7d3e20c
> 2020-03-24T08:48:57.0096889Z Step 3/54 : RUN set -x && apt-get update && 
> apt-get install -y curl tar sudo openssh-server openssh-client rsync 
> unzip krb5-user
> 2020-03-24T08:48:57.0106188Z  ---> Using cache
> 2020-03-24T08:48:57.0107830Z  ---> 9a59599596be
> 2020-03-24T08:48:57.0110793Z Step 4/54 : RUN set -x && mkdir -p 
> /var/log/kerberos && touch /var/log/kerberos/kadmind.log
> 2020-03-24T08:48:57.0118896Z  ---> Using cache
> 2020-03-24T08:48:57.0121035Z  ---> c83551d4f695
> 2020-03-24T08:48:57.0125298Z Step 5/54 : RUN set -x && rm -f 
> /etc/ssh/ssh_host_dsa_key /etc/ssh/ssh_host_rsa_key /root/.ssh/id_rsa && 
> ssh-keygen -q -N "" -t dsa -f /etc/ssh/ssh_host_dsa_key && ssh-keygen -q 
> -N "" -t rsa -f /etc/ssh/ssh_host_rsa_key && ssh-keygen -q -N "" -t rsa 
> -f /root/.ssh/id_rsa && cp /root/.ssh/id_rsa.pub 
> /root/.ssh/authorized_keys
> 2020-03-24T08:48:57.0133473Z  ---> Using cache
> 2020-03-24T08:48:57.0134240Z  ---> f69560c2bc0a
> 2020-03-24T08:48:57.0135683Z Step 6/54 : RUN set -x && mkdir -p 
> /usr/java/default && curl -Ls 
> 'http://download.oracle.com/otn-pub/java/jdk/8u131-b11/d54c1d3a095b4ff2b6607d096fa80163/jdk-8u131-linux-x64.tar.gz'
>  -H 'Cookie: oraclelicense=accept-securebackup-cookie' | tar 
> --strip-components=1 -xz -C /usr/java/default/
> 2020-03-24T08:48:57.0148145Z  ---> Using cache
> 2020-03-24T08:48:57.0149008Z  ---> f824256d72f1
> 2020-03-24T08:48:57.0152616Z Step 7/54 : ENV JAVA_HOME /usr/java/default
> 2020-03-24T08:48:57.0155992Z  ---> Using cache
> 2020-03-24T08:48:57.0160104Z  ---> 770e6bfd219a
> 2020-03-24T08:48:57.0160410Z Step 8/54 : ENV PATH $PATH:$JAVA_HOME/bin
> 2020-03-24T08:48:57.0168690Z  ---> Using cache
> 2020

[jira] [Commented] (FLINK-16821) Run Kubernetes test failed with invalid named "minikube"

2020-03-28 Thread Zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17069458#comment-17069458
 ] 

Zhijiang commented on FLINK-16821:
--

Thanks for solving it [~rmetzger]!

I guess it is already needed for release-1.10?  Another instance found in 
release-1.10 : 
[https://travis-ci.org/github/apache/flink/builds/667815122?utm_medium=notification&utm_source=slack]

> Run Kubernetes test failed with invalid named "minikube"
> 
>
> Key: FLINK-16821
> URL: https://issues.apache.org/jira/browse/FLINK-16821
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes, Tests
>Reporter: Zhijiang
>Assignee: Robert Metzger
>Priority: Blocker
>  Labels: pull-request-available, test-stability
> Fix For: 1.11.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This is the test run 
> [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6702&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5]
> Log output
> {code:java}
> 2020-03-27T00:07:38.9666021Z Running 'Run Kubernetes test'
> 2020-03-27T00:07:38.956Z 
> ==
> 2020-03-27T00:07:38.9677101Z TEST_DATA_DIR: 
> /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-38967103614
> 2020-03-27T00:07:41.7529865Z Flink dist directory: 
> /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT
> 2020-03-27T00:07:41.7721475Z Flink dist directory: 
> /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT
> 2020-03-27T00:07:41.8208394Z Docker version 19.03.8, build afacb8b7f0
> 2020-03-27T00:07:42.4793914Z docker-compose version 1.25.4, build 8d51620a
> 2020-03-27T00:07:42.5359301Z Installing minikube ...
> 2020-03-27T00:07:42.5494076Z   % Total% Received % Xferd  Average Speed   
> TimeTime Time  Current
> 2020-03-27T00:07:42.5494729Z  Dload  Upload   
> Total   SpentLeft  Speed
> 2020-03-27T00:07:42.5498136Z 
> 2020-03-27T00:07:42.6214887Z   0 00 00 0  0  0 
> --:--:-- --:--:-- --:--:-- 0
> 2020-03-27T00:07:43.3467750Z   0 00 00 0  0  0 
> --:--:-- --:--:-- --:--:-- 0
> 2020-03-27T00:07:43.3469636Z 100 52.0M  100 52.0M0 0  65.2M  0 
> --:--:-- --:--:-- --:--:-- 65.2M
> 2020-03-27T00:07:43.4262625Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:43.4264438Z   - To fix this, run: minikube start
> 2020-03-27T00:07:43.4282404Z Starting minikube ...
> 2020-03-27T00:07:43.7749694Z * minikube v1.9.0 on Ubuntu 16.04
> 2020-03-27T00:07:43.7761742Z * Using the none driver based on user 
> configuration
> 2020-03-27T00:07:43.7762229Z X The none driver requires conntrack to be 
> installed for kubernetes version 1.18.0
> 2020-03-27T00:07:43.8202161Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:43.8203353Z   - To fix this, run: minikube start
> 2020-03-27T00:07:43.8568899Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:43.8570685Z   - To fix this, run: minikube start
> 2020-03-27T00:07:43.8583793Z Command: start_kubernetes_if_not_running failed. 
> Retrying...
> 2020-03-27T00:07:48.9017252Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:48.9019347Z   - To fix this, run: minikube start
> 2020-03-27T00:07:48.9031515Z Starting minikube ...
> 2020-03-27T00:07:49.0612601Z * minikube v1.9.0 on Ubuntu 16.04
> 2020-03-27T00:07:49.0616688Z * Using the none driver based on user 
> configuration
> 2020-03-27T00:07:49.0620173Z X The none driver requires conntrack to be 
> installed for kubernetes version 1.18.0
> 2020-03-27T00:07:49.1040676Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:49.1042353Z   - To fix this, run: minikube start
> 2020-03-27T00:07:49.1453522Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:49.1454594Z   - To fix this, run: minikube start
> 2020-03-27T00:07:49.1468436Z Command: start_kubernetes_if_not_running failed. 
> Retrying...
> 2020-03-27T00:07:54.1907713Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:54.1909876Z   - To fix this, run: minikube start
> 2020-03-27T00:07:54.1921479Z Starting minikube ...
> 2020-03-27T00:07:54.3388738Z * minikube v1.9.0 on Ubuntu 16.04
> 2020-03-27T00:07:54.3395499Z * Using the none driver based on user 
> configuration
> 2020-03-27T00:07:54.3396443Z X The none driver requires conntrack to be 
> installed for kubernetes version 1.18.0
> 2020-03-27T00:07:54.3824399Z * There is no local cluster named "minikube"
> 2020-03-27T00:07:54.3837652Z   - To fix this, run: minikube start
> 2020-03-27T00:07:54.4203902Z *

[jira] [Resolved] (FLINK-16262) Class loader problem with FlinkKafkaProducer.Semantic.EXACTLY_ONCE and usrlib directory

2020-03-28 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-16262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang resolved FLINK-16262.
--
Resolution: Fixed

Merged in release-1.10: e39cfe7660daaeed4213f04ccbce6de1e8d90fe5

Merged in master: ff0d0c979d7cf67648ecf91850e782e99d557240

> Class loader problem with FlinkKafkaProducer.Semantic.EXACTLY_ONCE and usrlib 
> directory
> ---
>
> Key: FLINK-16262
> URL: https://issues.apache.org/jira/browse/FLINK-16262
> Project: Flink
>  Issue Type: Bug
>  Components: Connectors / Kafka
>Affects Versions: 1.10.0
> Environment: openjdk:11-jre with a slightly modified Flink 1.10.0 
> build (nothing changed regarding Kafka and/or class loading).
>Reporter: Jürgen Kreileder
>Assignee: Guowei Ma
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.10.1, 1.11.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> We're using Docker images modeled after 
> [https://github.com/apache/flink/blob/master/flink-container/docker/Dockerfile]
>  (using Java 11)
> When I try to switch a Kafka producer from AT_LEAST_ONCE to EXACTLY_ONCE, the 
> taskmanager startup fails with:
> {code:java}
> 2020-02-24 18:25:16.389 INFO  o.a.f.r.t.Task                           Create 
> Case Fixer -> Sink: Findings local-krei04-kba-digitalweb-uc1 (1/1) 
> (72f7764c6f6c614e5355562ed3d27209) switched from RUNNING to FAILED.
> org.apache.kafka.common.config.ConfigException: Invalid value 
> org.apache.kafka.common.serialization.ByteArraySerializer for configuration 
> key.serializer: Class 
> org.apache.kafka.common.serialization.ByteArraySerializer could not be found.
>  at org.apache.kafka.common.config.ConfigDef.parseType(ConfigDef.java:718)
>  at org.apache.kafka.common.config.ConfigDef.parseValue(ConfigDef.java:471)
>  at org.apache.kafka.common.config.ConfigDef.parse(ConfigDef.java:464)
>  at 
> org.apache.kafka.common.config.AbstractConfig.(AbstractConfig.java:62)
>  at 
> org.apache.kafka.common.config.AbstractConfig.(AbstractConfig.java:75)
>  at 
> org.apache.kafka.clients.producer.ProducerConfig.(ProducerConfig.java:396)
>  at 
> org.apache.kafka.clients.producer.KafkaProducer.(KafkaProducer.java:326)
>  at 
> org.apache.kafka.clients.producer.KafkaProducer.(KafkaProducer.java:298)
>  at 
> org.apache.flink.streaming.connectors.kafka.internal.FlinkKafkaInternalProducer.(FlinkKafkaInternalProducer.java:76)
>  at 
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.lambda$abortTransactions$2(FlinkKafkaProducer.java:1107)
>  at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(Unknown 
> Source)
>  at java.base/java.util.HashMap$KeySpliterator.forEachRemaining(Unknown 
> Source)
>  at java.base/java.util.stream.AbstractPipeline.copyInto(Unknown Source)
>  at java.base/java.util.stream.ForEachOps$ForEachTask.compute(Unknown Source)
>  at java.base/java.util.concurrent.CountedCompleter.exec(Unknown Source)
>  at java.base/java.util.concurrent.ForkJoinTask.doExec(Unknown Source)
>  at 
> java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(Unknown 
> Source)
>  at java.base/java.util.concurrent.ForkJoinPool.scan(Unknown Source)
>  at java.base/java.util.concurrent.ForkJoinPool.runWorker(Unknown Source)
>  at java.base/java.util.concurrent.ForkJoinWorkerThread.run(Unknown 
> Source){code}
> This looks like a class loading issue: If I copy our JAR to FLINK_LIB_DIR 
> instead of FLINK_USR_LIB_DIR, everything works fine.
> (AT_LEAST_ONCE producers works fine with the JAR in FLINK_USR_LIB_DIR)
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-16787) Provide an assigner strategy of average splits allocation

2020-03-28 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-16787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang updated FLINK-16787:
-
Fix Version/s: (was: 1.11.0)

> Provide an assigner strategy of average splits allocation
> -
>
> Key: FLINK-16787
> URL: https://issues.apache.org/jira/browse/FLINK-16787
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Task
>Reporter: Jingsong Lee
>Priority: Major
>
> For now InputSplitAssigner:
> Each task is to grab split, rather than average distribution, so once the 
> later tasks are not scheduled, the former tasks will grab all splits.
> We can provide an assigner strategy of average splits allocation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-16787) Provide an assigner strategy of average splits allocation

2020-03-28 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-16787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang updated FLINK-16787:
-
Component/s: Runtime / Coordination
 API / Core

> Provide an assigner strategy of average splits allocation
> -
>
> Key: FLINK-16787
> URL: https://issues.apache.org/jira/browse/FLINK-16787
> Project: Flink
>  Issue Type: Sub-task
>  Components: API / Core, Runtime / Coordination, Runtime / Task
>Reporter: Jingsong Lee
>Priority: Major
>
> For now InputSplitAssigner:
> Each task is to grab split, rather than average distribution, so once the 
> later tasks are not scheduled, the former tasks will grab all splits.
> We can provide an assigner strategy of average splits allocation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16787) Provide an assigner strategy of average splits allocation

2020-03-28 Thread Zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17069927#comment-17069927
 ] 

Zhijiang commented on FLINK-16787:
--

I guess this feature would be involved in more components, from API to 
coordinator and then task stack in batch mode, maybe need a FLIP. And I do not 
think it can be done in release-1.11.  So I adjust the related labels above.

> Provide an assigner strategy of average splits allocation
> -
>
> Key: FLINK-16787
> URL: https://issues.apache.org/jira/browse/FLINK-16787
> Project: Flink
>  Issue Type: Sub-task
>  Components: API / Core, Runtime / Coordination, Runtime / Task
>Reporter: Jingsong Lee
>Priority: Major
>
> For now InputSplitAssigner:
> Each task is to grab split, rather than average distribution, so once the 
> later tasks are not scheduled, the former tasks will grab all splits.
> We can provide an assigner strategy of average splits allocation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-15981) Control the direct memory in FileChannelBoundedData.FileBufferReader

2020-03-28 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang updated FLINK-15981:
-
Fix Version/s: (was: 1.10.1)
   (was: 1.11.0)

> Control the direct memory in FileChannelBoundedData.FileBufferReader
> 
>
> Key: FLINK-15981
> URL: https://issues.apache.org/jira/browse/FLINK-15981
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Affects Versions: 1.10.0
>Reporter: Jingsong Lee
>Priority: Critical
>
> Now, the default blocking BoundedData is FileChannelBoundedData. In its 
> reader, will create new direct buffer 64KB.
> When parallelism greater than 100, users need configure 
> "taskmanager.memory.task.off-heap.size" to avoid direct memory OOM. It is 
> hard to configure, and it cost a lot of memory. Consider 1000 parallelism, 
> maybe we need 1GB+ for a task manager.
> This is not conducive to the scenario of less slots and large parallelism. 
> Batch jobs could run little by little, but memory shortage would consume a 
> lot.
> If we provided N-Input operators, maybe things will be worse. This means the 
> number of subpartitions that can be requested at the same time will be more. 
> We have no idea how much memory.
> Here are my rough thoughts:
>  * Obtain memory from network buffers.
>  * provide "The maximum number of subpartitions that can be requested at the 
> same time".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16770) Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end test fails with no such file

2020-03-29 Thread Zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17070230#comment-17070230
 ] 

Zhijiang commented on FLINK-16770:
--

Another instance 
[https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6788&view=logs&j=7bafe89a-737e-5a81-708c-24b72a2345fc&t=8f0197c1-92aa-5b5f-4284-1ae542d75a1e]

> Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end 
> test fails with no such file
> ---
>
> Key: FLINK-16770
> URL: https://issues.apache.org/jira/browse/FLINK-16770
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing, Tests
>Reporter: Zhijiang
>Priority: Major
>  Labels: test-stability
> Fix For: 1.11.0
>
>
> The log : 
> [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6603&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5]
>  
> There was also the similar problem in 
> https://issues.apache.org/jira/browse/FLINK-16561, but for the case of no 
> parallelism change. And this case is for scaling up. Not quite sure whether 
> the root cause is the same one.
> {code:java}
> 2020-03-25T06:50:31.3894841Z Running 'Resuming Externalized Checkpoint 
> (rocks, incremental, scale up) end-to-end test'
> 2020-03-25T06:50:31.3895308Z 
> ==
> 2020-03-25T06:50:31.3907274Z TEST_DATA_DIR: 
> /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304
> 2020-03-25T06:50:31.5500274Z Flink dist directory: 
> /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT
> 2020-03-25T06:50:31.6354639Z Starting cluster.
> 2020-03-25T06:50:31.8871932Z Starting standalonesession daemon on host 
> fv-az655.
> 2020-03-25T06:50:33.5021784Z Starting taskexecutor daemon on host fv-az655.
> 2020-03-25T06:50:33.5152274Z Waiting for Dispatcher REST endpoint to come 
> up...
> 2020-03-25T06:50:34.5498116Z Waiting for Dispatcher REST endpoint to come 
> up...
> 2020-03-25T06:50:35.6031346Z Waiting for Dispatcher REST endpoint to come 
> up...
> 2020-03-25T06:50:36.9848425Z Waiting for Dispatcher REST endpoint to come 
> up...
> 2020-03-25T06:50:38.0283377Z Dispatcher REST endpoint is up.
> 2020-03-25T06:50:38.0285490Z Running externalized checkpoints test, with 
> ORIGINAL_DOP=2 NEW_DOP=4 and STATE_BACKEND_TYPE=rocks 
> STATE_BACKEND_FILE_ASYNC=true STATE_BACKEND_ROCKSDB_INCREMENTAL=true 
> SIMULATE_FAILURE=false ...
> 2020-03-25T06:50:46.1754645Z Job (b8cb04e4b1e730585bc616aa352866d0) is 
> running.
> 2020-03-25T06:50:46.1758132Z Waiting for job 
> (b8cb04e4b1e730585bc616aa352866d0) to have at least 1 completed checkpoints 
> ...
> 2020-03-25T06:50:46.3478276Z Waiting for job to process up to 200 records, 
> current progress: 173 records ...
> 2020-03-25T06:50:49.6332988Z Cancelling job b8cb04e4b1e730585bc616aa352866d0.
> 2020-03-25T06:50:50.4875673Z Cancelled job b8cb04e4b1e730585bc616aa352866d0.
> 2020-03-25T06:50:50.5468230Z ls: cannot access 
> '/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304/externalized-chckpt-e2e-backend-dir/b8cb04e4b1e730585bc616aa352866d0/chk-[1-9]*/_metadata':
>  No such file or directory
> 2020-03-25T06:50:50.5606260Z Restoring job with externalized checkpoint at . 
> ...
> 2020-03-25T06:50:58.4728245Z 
> 2020-03-25T06:50:58.4732663Z 
> 
> 2020-03-25T06:50:58.4735785Z  The program finished with the following 
> exception:
> 2020-03-25T06:50:58.4737759Z 
> 2020-03-25T06:50:58.4742666Z 
> org.apache.flink.client.program.ProgramInvocationException: The main method 
> caused an error: java.util.concurrent.ExecutionException: 
> org.apache.flink.runtime.client.JobSubmissionException: Failed to submit 
> JobGraph.
> 2020-03-25T06:50:58.4746274Z  at 
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:335)
> 2020-03-25T06:50:58.4749954Z  at 
> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:205)
> 2020-03-25T06:50:58.4752753Z  at 
> org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:142)
> 2020-03-25T06:50:58.4755400Z  at 
> org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:659)
> 2020-03-25T06:50:58.4757862Z  at 
> org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:210)
> 2020-03-25T06:50:58.4760282Z  at 
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:890)
> 2020-03-25T06:50:58.4763591Z  at 
> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:963)
> 2020-03-25T06:50:58.4764274Z  at 
> java.securit

[jira] [Updated] (FLINK-16770) Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end test fails with no such file

2020-03-29 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-16770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang updated FLINK-16770:
-
Priority: Critical  (was: Major)

> Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end 
> test fails with no such file
> ---
>
> Key: FLINK-16770
> URL: https://issues.apache.org/jira/browse/FLINK-16770
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing, Tests
>Reporter: Zhijiang
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.11.0
>
>
> The log : 
> [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6603&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5]
>  
> There was also the similar problem in 
> https://issues.apache.org/jira/browse/FLINK-16561, but for the case of no 
> parallelism change. And this case is for scaling up. Not quite sure whether 
> the root cause is the same one.
> {code:java}
> 2020-03-25T06:50:31.3894841Z Running 'Resuming Externalized Checkpoint 
> (rocks, incremental, scale up) end-to-end test'
> 2020-03-25T06:50:31.3895308Z 
> ==
> 2020-03-25T06:50:31.3907274Z TEST_DATA_DIR: 
> /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304
> 2020-03-25T06:50:31.5500274Z Flink dist directory: 
> /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT
> 2020-03-25T06:50:31.6354639Z Starting cluster.
> 2020-03-25T06:50:31.8871932Z Starting standalonesession daemon on host 
> fv-az655.
> 2020-03-25T06:50:33.5021784Z Starting taskexecutor daemon on host fv-az655.
> 2020-03-25T06:50:33.5152274Z Waiting for Dispatcher REST endpoint to come 
> up...
> 2020-03-25T06:50:34.5498116Z Waiting for Dispatcher REST endpoint to come 
> up...
> 2020-03-25T06:50:35.6031346Z Waiting for Dispatcher REST endpoint to come 
> up...
> 2020-03-25T06:50:36.9848425Z Waiting for Dispatcher REST endpoint to come 
> up...
> 2020-03-25T06:50:38.0283377Z Dispatcher REST endpoint is up.
> 2020-03-25T06:50:38.0285490Z Running externalized checkpoints test, with 
> ORIGINAL_DOP=2 NEW_DOP=4 and STATE_BACKEND_TYPE=rocks 
> STATE_BACKEND_FILE_ASYNC=true STATE_BACKEND_ROCKSDB_INCREMENTAL=true 
> SIMULATE_FAILURE=false ...
> 2020-03-25T06:50:46.1754645Z Job (b8cb04e4b1e730585bc616aa352866d0) is 
> running.
> 2020-03-25T06:50:46.1758132Z Waiting for job 
> (b8cb04e4b1e730585bc616aa352866d0) to have at least 1 completed checkpoints 
> ...
> 2020-03-25T06:50:46.3478276Z Waiting for job to process up to 200 records, 
> current progress: 173 records ...
> 2020-03-25T06:50:49.6332988Z Cancelling job b8cb04e4b1e730585bc616aa352866d0.
> 2020-03-25T06:50:50.4875673Z Cancelled job b8cb04e4b1e730585bc616aa352866d0.
> 2020-03-25T06:50:50.5468230Z ls: cannot access 
> '/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304/externalized-chckpt-e2e-backend-dir/b8cb04e4b1e730585bc616aa352866d0/chk-[1-9]*/_metadata':
>  No such file or directory
> 2020-03-25T06:50:50.5606260Z Restoring job with externalized checkpoint at . 
> ...
> 2020-03-25T06:50:58.4728245Z 
> 2020-03-25T06:50:58.4732663Z 
> 
> 2020-03-25T06:50:58.4735785Z  The program finished with the following 
> exception:
> 2020-03-25T06:50:58.4737759Z 
> 2020-03-25T06:50:58.4742666Z 
> org.apache.flink.client.program.ProgramInvocationException: The main method 
> caused an error: java.util.concurrent.ExecutionException: 
> org.apache.flink.runtime.client.JobSubmissionException: Failed to submit 
> JobGraph.
> 2020-03-25T06:50:58.4746274Z  at 
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:335)
> 2020-03-25T06:50:58.4749954Z  at 
> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:205)
> 2020-03-25T06:50:58.4752753Z  at 
> org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:142)
> 2020-03-25T06:50:58.4755400Z  at 
> org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:659)
> 2020-03-25T06:50:58.4757862Z  at 
> org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:210)
> 2020-03-25T06:50:58.4760282Z  at 
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:890)
> 2020-03-25T06:50:58.4763591Z  at 
> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:963)
> 2020-03-25T06:50:58.4764274Z  at 
> java.security.AccessController.doPrivileged(Native Method)
> 2020-03-25T06:50:58.4764809Z  at 
> javax.security.auth.Subject.doAs(Subject.java:422)
> 2020-03-25T06:50:58.4765434Z  at 
> org.apache.hadoop

[jira] [Commented] (FLINK-16770) Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end test fails with no such file

2020-03-29 Thread Zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17070233#comment-17070233
 ] 

Zhijiang commented on FLINK-16770:
--

Another instance 
[https://travis-ci.org/apache/flink/builds/668073755?utm_source=slack&utm_medium=notification]

> Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end 
> test fails with no such file
> ---
>
> Key: FLINK-16770
> URL: https://issues.apache.org/jira/browse/FLINK-16770
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing, Tests
>Reporter: Zhijiang
>Priority: Critical
>  Labels: test-stability
> Fix For: 1.11.0
>
>
> The log : 
> [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6603&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5]
>  
> There was also the similar problem in 
> https://issues.apache.org/jira/browse/FLINK-16561, but for the case of no 
> parallelism change. And this case is for scaling up. Not quite sure whether 
> the root cause is the same one.
> {code:java}
> 2020-03-25T06:50:31.3894841Z Running 'Resuming Externalized Checkpoint 
> (rocks, incremental, scale up) end-to-end test'
> 2020-03-25T06:50:31.3895308Z 
> ==
> 2020-03-25T06:50:31.3907274Z TEST_DATA_DIR: 
> /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304
> 2020-03-25T06:50:31.5500274Z Flink dist directory: 
> /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT
> 2020-03-25T06:50:31.6354639Z Starting cluster.
> 2020-03-25T06:50:31.8871932Z Starting standalonesession daemon on host 
> fv-az655.
> 2020-03-25T06:50:33.5021784Z Starting taskexecutor daemon on host fv-az655.
> 2020-03-25T06:50:33.5152274Z Waiting for Dispatcher REST endpoint to come 
> up...
> 2020-03-25T06:50:34.5498116Z Waiting for Dispatcher REST endpoint to come 
> up...
> 2020-03-25T06:50:35.6031346Z Waiting for Dispatcher REST endpoint to come 
> up...
> 2020-03-25T06:50:36.9848425Z Waiting for Dispatcher REST endpoint to come 
> up...
> 2020-03-25T06:50:38.0283377Z Dispatcher REST endpoint is up.
> 2020-03-25T06:50:38.0285490Z Running externalized checkpoints test, with 
> ORIGINAL_DOP=2 NEW_DOP=4 and STATE_BACKEND_TYPE=rocks 
> STATE_BACKEND_FILE_ASYNC=true STATE_BACKEND_ROCKSDB_INCREMENTAL=true 
> SIMULATE_FAILURE=false ...
> 2020-03-25T06:50:46.1754645Z Job (b8cb04e4b1e730585bc616aa352866d0) is 
> running.
> 2020-03-25T06:50:46.1758132Z Waiting for job 
> (b8cb04e4b1e730585bc616aa352866d0) to have at least 1 completed checkpoints 
> ...
> 2020-03-25T06:50:46.3478276Z Waiting for job to process up to 200 records, 
> current progress: 173 records ...
> 2020-03-25T06:50:49.6332988Z Cancelling job b8cb04e4b1e730585bc616aa352866d0.
> 2020-03-25T06:50:50.4875673Z Cancelled job b8cb04e4b1e730585bc616aa352866d0.
> 2020-03-25T06:50:50.5468230Z ls: cannot access 
> '/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304/externalized-chckpt-e2e-backend-dir/b8cb04e4b1e730585bc616aa352866d0/chk-[1-9]*/_metadata':
>  No such file or directory
> 2020-03-25T06:50:50.5606260Z Restoring job with externalized checkpoint at . 
> ...
> 2020-03-25T06:50:58.4728245Z 
> 2020-03-25T06:50:58.4732663Z 
> 
> 2020-03-25T06:50:58.4735785Z  The program finished with the following 
> exception:
> 2020-03-25T06:50:58.4737759Z 
> 2020-03-25T06:50:58.4742666Z 
> org.apache.flink.client.program.ProgramInvocationException: The main method 
> caused an error: java.util.concurrent.ExecutionException: 
> org.apache.flink.runtime.client.JobSubmissionException: Failed to submit 
> JobGraph.
> 2020-03-25T06:50:58.4746274Z  at 
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:335)
> 2020-03-25T06:50:58.4749954Z  at 
> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:205)
> 2020-03-25T06:50:58.4752753Z  at 
> org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:142)
> 2020-03-25T06:50:58.4755400Z  at 
> org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:659)
> 2020-03-25T06:50:58.4757862Z  at 
> org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:210)
> 2020-03-25T06:50:58.4760282Z  at 
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:890)
> 2020-03-25T06:50:58.4763591Z  at 
> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:963)
> 2020-03-25T06:50:58.4764274Z  at 
> java.security.AccessController.doPrivileged(Native Method)
> 2020-03-

[jira] [Assigned] (FLINK-16536) Implement InputChannel state recovery for unaligned checkpoint

2020-04-03 Thread Zhijiang (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-16536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijiang reassigned FLINK-16536:


Assignee: Zhijiang

> Implement InputChannel state recovery for unaligned checkpoint
> --
>
> Key: FLINK-16536
> URL: https://issues.apache.org/jira/browse/FLINK-16536
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Checkpointing, Runtime / Network
>Reporter: Zhijiang
>Assignee: Zhijiang
>Priority: Major
> Fix For: 1.11.0
>
>
> During recovery process for unaligned checkpoint, the input channel state 
> should also be recovered besides with existing operator states.
> The InputGate would request buffer from local pool and then interact with 
> ChannelStateReader to fill in the state data.  The filled buffer would be 
> inserted into respective InputChannel queue for processing in normal way.
> It should guarantee that the new data from upstream side should not overtake 
> the input state data to avoid mis-order issue.
> Refer to more details by 
> [https://docs.google.com/document/d/16_MOQymzxrKvUHXh6QFr2AAXIKt_2vPUf8vzKy4H_tU/edit]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   3   4   5   6   7   8   9   10   >