[jira] [Created] (FLINK-36073) ApplicationMode with the K8s operator does not support downloading jars via filesystem plugins

2024-08-15 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-36073:
--

 Summary: ApplicationMode with the K8s operator does not support 
downloading jars via filesystem plugins
 Key: FLINK-36073
 URL: https://issues.apache.org/jira/browse/FLINK-36073
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 2.0.0, 1.19.2, 1.20.1
Reporter: Robert Metzger


As discussed in this ticket https://issues.apache.org/jira/browse/FLINK-28915 
we can not define a FlinkDeployment spec with
{code}
spec:
  job:
jarURI: s3://myDevBucket/myjob.jar
{code}

It fails with:
{code}
org.apache.flink.container.entrypoint.StandaloneApplicationClusterEntryPoint.main(StandaloneApplicationClusterEntryPoint.java:89)
 [flink-dist-1.19-x.jar:1.19-x]
Caused by: java.net.MalformedURLException: unknown protocol: s3
at java.net.URL.(URL.java:652) ~[?:?]
at java.net.URL.(URL.java:541) ~[?:?]
at java.net.URL.(URL.java:488) ~[?:?]
at 
org.apache.flink.configuration.ConfigUtils.decodeListFromConfig(ConfigUtils.java:133)
 ~[flink-dist-1.19-x.jar:1.19-x]
at 
org.apache.flink.client.program.DefaultPackagedProgramRetriever.getClasspathsFromConfiguration(DefaultPackagedProgramRetriever.java:273)
 ~[flink-dist-1.19-x.jar:1.19-x]
at 
org.apache.flink.client.program.DefaultPackagedProgramRetriever.create(DefaultPackagedProgramRetriever.java:121)
 ~[flink-dist-1.19-x.jar:1.19-x]
... 4 more
{code}

even though I have defined:
{code}
  podTemplate:
spec:
  containers:
- name: flink-main-container
  env:
- name: ENABLE_BUILT_IN_PLUGINS
  value: "flink-s3-fs-presto-1.19.0.jar"
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-35526) Remove deprecated stedolan/jq Docker image from Flink e2e tests

2024-06-05 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-35526:
--

 Summary: Remove deprecated stedolan/jq Docker image from Flink e2e 
tests
 Key: FLINK-35526
 URL: https://issues.apache.org/jira/browse/FLINK-35526
 Project: Flink
  Issue Type: Bug
  Components: Test Infrastructure
Reporter: Robert Metzger
Assignee: Robert Metzger


Our CI logs contain this warning: 
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=60060&view=logs&j=af184cdd-c6d8-5084-0b69-7e9c67b35f7a&t=0f3adb59-eefa-51c6-2858-3654d9e0749d&l=3828

{code}
latest: Pulling from stedolan/jq
[DEPRECATION NOTICE] Docker Image Format v1, and Docker Image manifest version 
2, schema 1 support will be removed in an upcoming release. Suggest the author 
of docker.io/stedolan/jq:latest to upgrade the image to the OCI Format, or 
Docker Image manifest v2, schema 2. More information at 
https://docs.docker.com/go/deprecated-image-specs/
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-33217) Flink SQL: UNNEST fails with on LEFT JOIN with NOT NULL type in array

2023-10-09 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-33217:
--

 Summary: Flink SQL: UNNEST fails with on LEFT JOIN with NOT NULL 
type in array
 Key: FLINK-33217
 URL: https://issues.apache.org/jira/browse/FLINK-33217
 Project: Flink
  Issue Type: Bug
  Components: Table SQL / Planner
Affects Versions: 1.15.3, 1.18.0, 1.19.0
Reporter: Robert Metzger


Steps to reproduce:

Take a column of type 
{code:java}
business_data ROW<`id` STRING, `updateEvent` ARRAY 
NOT NULL>> {code}
Take this query
{code:java}
select id, ue_name from reproduce_unnest LEFT JOIN 
UNNEST(reproduce_unnest.business_data.updateEvent) AS exploded_ue(ue_name) ON 
true {code}
And get this error
{code:java}
Caused by: java.lang.AssertionError: Type mismatch:rowtype of rel before 
registration: RecordType(RecordType:peek_no_expand(VARCHAR(2147483647) 
CHARACTER SET "UTF-16LE" id, RecordType:peek_no_expand(VARCHAR(2147483647) 
CHARACTER SET "UTF-16LE" NOT NULL name) NOT NULL ARRAY updateEvent) 
business_data, VARCHAR(2147483647) CHARACTER SET "UTF-16LE" ue_name) NOT 
NULLrowtype of rel after registration: 
RecordType(RecordType:peek_no_expand(VARCHAR(2147483647) CHARACTER SET 
"UTF-16LE" id, RecordType:peek_no_expand(VARCHAR(2147483647) CHARACTER SET 
"UTF-16LE" NOT NULL name) NOT NULL ARRAY updateEvent) business_data, 
VARCHAR(2147483647) CHARACTER SET "UTF-16LE" NOT NULL name) NOT 
NULLDifference:ue_name: VARCHAR(2147483647) CHARACTER SET "UTF-16LE" -> 
VARCHAR(2147483647) CHARACTER SET "UTF-16LE" NOT NULL
at org.apache.calcite.util.Litmus$1.fail(Litmus.java:32)at 
org.apache.calcite.plan.RelOptUtil.equal(RelOptUtil.java:2206)   at 
org.apache.calcite.rel.AbstractRelNode.onRegister(AbstractRelNode.java:275)  at 
org.apache.calcite.plan.volcano.VolcanoPlanner.registerImpl(VolcanoPlanner.java:1270)
at 
org.apache.calcite.plan.volcano.VolcanoPlanner.register(VolcanoPlanner.java:598)
 at 
org.apache.calcite.plan.volcano.VolcanoPlanner.ensureRegistered(VolcanoPlanner.java:613)
 at 
org.apache.calcite.plan.volcano.VolcanoPlanner.changeTraits(VolcanoPlanner.java:498)
 at org.apache.calcite.tools.Programs$RuleSetProgram.run(Programs.java:315) 
 at 
org.apache.flink.table.planner.plan.optimize.program.FlinkVolcanoProgram.optimize(FlinkVolcanoProgram.scala:62)
 {code}
I have implemented a small test case, which fails against Flink 1.15, 1.8 and 
the latest master branch.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-32439) Kubernetes operator is silently overwriting the "execution.savepoint.path" config

2023-06-26 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-32439:
--

 Summary: Kubernetes operator is silently overwriting the 
"execution.savepoint.path" config
 Key: FLINK-32439
 URL: https://issues.apache.org/jira/browse/FLINK-32439
 Project: Flink
  Issue Type: Bug
  Components: Kubernetes Operator
Reporter: Robert Metzger


I recently stumbled across the fact that the K8s operator is silently deleting 
/ overwriting the execution.savepoint.path config option.

I understand why this happens, but I wonder if the operator should write a log 
message if the user configured the execution.savepoint.path option.

And / or add a list to the docs about "Operator managed" config options?

https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/ApplicationReconciler.java#L155-L159



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-31840) NullPointerException in operators.window.slicing.SliceAssigners$AbstractSliceAssigner.assignSliceEnd

2023-04-18 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-31840:
--

 Summary: NullPointerException in 
operators.window.slicing.SliceAssigners$AbstractSliceAssigner.assignSliceEnd
 Key: FLINK-31840
 URL: https://issues.apache.org/jira/browse/FLINK-31840
 Project: Flink
  Issue Type: Bug
Reporter: Robert Metzger


While running a Flink SQL Query (with a hop window), I got this error.
{code}
Caused by: 
org.apache.flink.streaming.runtime.tasks.ExceptionInChainedOperatorException: 
Could not forward element to next operator
at 
org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:99)
at 
org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:57)
at 
org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:29)
at 
org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:56)
at 
org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:29)
at StreamExecCalc$11.processElement(Unknown Source)
at 
org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:82)
... 23 more
Caused by: java.lang.NullPointerException
at 
org.apache.flink.table.runtime.operators.window.slicing.SliceAssigners$AbstractSliceAssigner.assignSliceEnd(SliceAssigners.java:558)
at 
org.apache.flink.table.runtime.operators.aggregate.window.LocalSlicingWindowAggOperator.processElement(LocalSlicingWindowAggOperator.java:114)
at 
org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:82)
... 29 more
{code}

It was caused by a timestamp field containing NULL values.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-31834) Azure Warning: no space left on device

2023-04-18 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-31834:
--

 Summary: Azure Warning: no space left on device
 Key: FLINK-31834
 URL: https://issues.apache.org/jira/browse/FLINK-31834
 Project: Flink
  Issue Type: Bug
  Components: Build System / Azure Pipelines
Reporter: Robert Metzger


In this CI run: 
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=48213&view=logs&j=af184cdd-c6d8-5084-0b69-7e9c67b35f7a&t=841082b6-1a93-5908-4d37-a071f4387a5f&l=21

There was this warning:
{code}
Loaded image: confluentinc/cp-kafka:6.2.2
Loaded image: testcontainers/ryuk:0.3.3
ApplyLayer exit status 1 stdout:  stderr: write /opt/jdk-15.0.1+9/lib/modules: 
no space left on device
##[error]Bash exited with code '1'.
Finishing: Restore docker images
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-31810) RocksDBException: Bad table magic number on checkpoint rescale

2023-04-14 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-31810:
--

 Summary: RocksDBException: Bad table magic number on checkpoint 
rescale
 Key: FLINK-31810
 URL: https://issues.apache.org/jira/browse/FLINK-31810
 Project: Flink
  Issue Type: Bug
  Components: Runtime / State Backends
Affects Versions: 1.15.2
Reporter: Robert Metzger


While rescaling a job from checkpoint, I ran into this exception:

{code:java}
SinkMaterializer[7] -> rob-result[7]: Writer -> rob-result[7]: Committer 
(4/4)#3 (c1b348f7eed6e1ce0e41ef75338ae754) switched from INITIALIZING to FAILED 
with failure cause: java.lang.Exception: Exception while creating 
StreamOperatorStateContext.
at 
org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:255)
at 
org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:265)
at 
org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.initializeStateAndOpenOperators(RegularOperatorChain.java:106)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.restoreGates(StreamTask.java:703)
at 
org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.call(StreamTaskActionExecutor.java:55)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.restoreInternal(StreamTask.java:679)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.restore(StreamTask.java:646)
at 
org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:948)
at 
org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:917)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:741)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:563)
at java.base/java.lang.Thread.run(Unknown Source)
Caused by: org.apache.flink.util.FlinkException: Could not restore keyed state 
backend for SinkUpsertMaterializer_7d9b7588bc2ff89baed50d7a4558caa4_(4/4) from 
any of the 1 provided restore options.
at 
org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:160)
at 
org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:346)
at 
org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:164)
... 11 more
Caused by: org.apache.flink.runtime.state.BackendBuildingException: Caught 
unexpected exception.
at 
org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder.build(RocksDBKeyedStateBackendBuilder.java:395)
at 
org.apache.flink.contrib.streaming.state.EmbeddedRocksDBStateBackend.createKeyedStateBackend(EmbeddedRocksDBStateBackend.java:483)
at 
org.apache.flink.contrib.streaming.state.EmbeddedRocksDBStateBackend.createKeyedStateBackend(EmbeddedRocksDBStateBackend.java:97)
at 
org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.lambda$keyedStatedBackend$1(StreamTaskStateInitializerImpl.java:329)
at 
org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:168)
at 
org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:135)
... 13 more
Caused by: java.io.IOException: Error while opening RocksDB instance.
at 
org.apache.flink.contrib.streaming.state.RocksDBOperationUtils.openDB(RocksDBOperationUtils.java:92)
at 
org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreDBInstanceFromStateHandle(RocksDBIncrementalRestoreOperation.java:465)
at 
org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreWithRescaling(RocksDBIncrementalRestoreOperation.java:321)
at 
org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restore(RocksDBIncrementalRestoreOperation.java:164)
at 
org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder.build(RocksDBKeyedStateBackendBuilder.java:315)
... 18 more
Caused by: org.rocksdb.RocksDBException: Bad table magic number: expected 
9863518390377041911, found 4096 in 
/tmp/job__op_SinkUpsertMaterializer_7d9b7588bc2ff89baed50d7a4558caa4__4_4__uuid_d5587dfc-78b3-427c-8cb6-35507b71bc4b/46475654-5515-430e-b215-389d42cddb97/000232.sst
at org.rocksdb.RocksDB.open(Native Method)
at org.rocksdb.RocksDB.open(RocksDB.java:306)
at 
org.apache.flink.contrib.streaming.state.RocksDBOperationUtils.openDB(RocksDBOperationUtils.java:80)
... 22 more
{code}

I haven't found any other cases of this issue

[jira] [Created] (FLINK-30083) Bump maven-shade-plugin to 3.4.0

2022-11-18 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-30083:
--

 Summary: Bump maven-shade-plugin to 3.4.0
 Key: FLINK-30083
 URL: https://issues.apache.org/jira/browse/FLINK-30083
 Project: Flink
  Issue Type: Improvement
  Components: Build System
Affects Versions: 1.17.0
Reporter: Robert Metzger
 Fix For: 1.17.0


FLINK-24273 proposes to relocate the io.fabric8 dependencies of 
flink-kubernetes.
This is not possible because of a problem with the maven shade plugin ("mvn 
install" doesn't work, it needs to be "mvn clean install").
MSHADE-425 solves this issue, and has been released with maven-shade-plugin 
3.4.0.

Upgrading the shade plugin will solve the problem, unblocking the K8s 
relocation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-29779) Allow using MiniCluster with a PluginManager to use metrics reporters

2022-10-27 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-29779:
--

 Summary: Allow using MiniCluster with a PluginManager to use 
metrics reporters
 Key: FLINK-29779
 URL: https://issues.apache.org/jira/browse/FLINK-29779
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Coordination
Affects Versions: 1.16.0
Reporter: Robert Metzger
Assignee: Robert Metzger
 Fix For: 1.16.1


Currently, using MiniCluster with a metric reporter loaded as a plugin is not 
supported, because the {{ReporterSetup.fromConfiguration(config, null));}} gets 
passed {{null}} for the PluginManager.

I think it generally valuable to allow passing a PluginManager to the 
MiniCluster.

I'll open a PR for this.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-29492) Kafka exactly-once sink causes OutOfMemoryError

2022-09-30 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-29492:
--

 Summary: Kafka exactly-once sink causes OutOfMemoryError
 Key: FLINK-29492
 URL: https://issues.apache.org/jira/browse/FLINK-29492
 Project: Flink
  Issue Type: Bug
  Components: Connectors / Kafka
Affects Versions: 1.15.2
Reporter: Robert Metzger


My Kafka exactly-once sinks are periodically failing with a `OutOfMemoryError: 
Java heap space`.

This looks very similar to FLINK-28250. But I am running 1.15.2, which contains 
a fix for FLINK-28250.

Exception:
{code:java}
java.io.IOException: Could not perform checkpoint 2281 for operator 
http_events[3]: Writer (1/1)#1.
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier(StreamTask.java:1210)
at 
org.apache.flink.streaming.runtime.io.checkpointing.CheckpointBarrierHandler.notifyCheckpoint(CheckpointBarrierHandler.java:147)
at 
org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.triggerCheckpoint(SingleCheckpointBarrierHandler.java:287)
at 
org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.access$100(SingleCheckpointBarrierHandler.java:64)
at 
org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler$ControllerImpl.triggerGlobalCheckpoint(SingleCheckpointBarrierHandler.java:493)
at 
org.apache.flink.streaming.runtime.io.checkpointing.AbstractAlignedBarrierHandlerState.triggerGlobalCheckpoint(AbstractAlignedBarrierHandlerState.java:74)
at 
org.apache.flink.streaming.runtime.io.checkpointing.AbstractAlignedBarrierHandlerState.barrierReceived(AbstractAlignedBarrierHandlerState.java:66)
at 
org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.lambda$processBarrier$2(SingleCheckpointBarrierHandler.java:234)
at 
org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.markCheckpointAlignedAndTransformState(SingleCheckpointBarrierHandler.java:262)
at 
org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.processBarrier(SingleCheckpointBarrierHandler.java:231)
at 
org.apache.flink.streaming.runtime.io.checkpointing.CheckpointedInputGate.handleEvent(CheckpointedInputGate.java:181)
at 
org.apache.flink.streaming.runtime.io.checkpointing.CheckpointedInputGate.pollNext(CheckpointedInputGate.java:159)
at 
org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext(AbstractStreamTaskNetworkInput.java:110)
at 
org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:519)
at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:203)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:804)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:753)
at 
org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:948)
at 
org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:927)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:741)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:563)
at java.base/java.lang.Thread.run(Unknown Source)
Caused by: org.apache.flink.runtime.checkpoint.CheckpointException: Could not 
complete snapshot 2281 for operator http_events[3]: Writer (1/1)#1. Failure 
reason: Checkpoint was declined.
at 
org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:269)
at 
org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:173)
at 
org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:348)
at 
org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.checkpointStreamOperator(RegularOperatorChain.java:227)
at 
org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.buildOperatorSnapshotFutures(RegularOperatorChain.java:212)
at 
org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.snapshotState(RegularOperatorChain.java:192)
at 
org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.takeSnapshotSync(SubtaskCheckpointCoordinatorImpl.java:647)
at 
org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointState(SubtaskCheckpointCoordinatorImpl.java:320)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$12(StreamTask.java:1253)
at 
org.apache.flink.streaming.runtime.t

[jira] [Created] (FLINK-29212) Properly load Hadoop native libraries in Flink docker iamges

2022-09-06 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-29212:
--

 Summary: Properly load Hadoop native libraries in Flink docker 
iamges
 Key: FLINK-29212
 URL: https://issues.apache.org/jira/browse/FLINK-29212
 Project: Flink
  Issue Type: Bug
  Components: flink-docker
Affects Versions: 1.17.0
Reporter: Robert Metzger


On startup, Flink logs:

{code:java}
2022-09-04 12:36:03.559 [main] WARN  org.apache.hadoop.util.NativeCodeLoader  - 
Unable to load native-hadoop library for your platform... using builtin-java 
classes where applicable
{code}

Hadoop native libraries are used for:
- Compression Codecs (bzip2, lz4, zlib)
- Native IO utilities for HDFS Short-Circuit Local Reads and Centralized Cache 
Management in HDFS
- CRC32 checksum implementation
(Source: 
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/NativeLibraries.html)

Resolving this for the docker images we are providing should be easy, remove 
one unnecessary WARNing and provide performance benefits for some users.






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-29122) Improve robustness of FileUtils.expandDirectory()

2022-08-26 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-29122:
--

 Summary: Improve robustness of FileUtils.expandDirectory() 
 Key: FLINK-29122
 URL: https://issues.apache.org/jira/browse/FLINK-29122
 Project: Flink
  Issue Type: Bug
  Components: API / Core
Affects Versions: 1.16.0, 1.17.0
Reporter: Robert Metzger


`FileUtils.expandDirectory()` can potentially write to invalid locations if the 
zip file is invalid (contains entry names with ../).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-28303) Kafka SQL Connector loses data when restoring from a savepoint with a topic with empty partitions

2022-06-29 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-28303:
--

 Summary: Kafka SQL Connector loses data when restoring from a 
savepoint with a topic with empty partitions
 Key: FLINK-28303
 URL: https://issues.apache.org/jira/browse/FLINK-28303
 Project: Flink
  Issue Type: Bug
  Components: Connectors / Kafka
Affects Versions: 1.14.4
Reporter: Robert Metzger


Steps to reproduce:
- Set up a Kafka topic with 10 partitions
- produce records 0-9 into the topic
- take a savepoint and stop the job
- produce records 10-19 into the topic
- restore the job from the savepoint.

The job will be missing usually 2-4 records from 10-19.

My assumption is that if a partition never had data (which is likely with 10 
partitions and 10 records), the savepoint will only contain offsets for 
partitions with data. 
While the job was offline (and we've written record 10-19 into the topic), all 
partitions got filled. Now, when Kafka comes online again, it will use the 
"latest" offset for those partitions, skipping some data.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-28265) Inconsistency in Kubernetes HA service: broken state handle

2022-06-27 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-28265:
--

 Summary: Inconsistency in Kubernetes HA service: broken state 
handle
 Key: FLINK-28265
 URL: https://issues.apache.org/jira/browse/FLINK-28265
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.14.4
Reporter: Robert Metzger


I have a JobManager, which at some point failed to acknowledge a checkpoint:

{code}
Error while processing AcknowledgeCheckpoint message
org.apache.flink.runtime.checkpoint.CheckpointException: Could not complete the 
pending checkpoint 193393. Failure reason: Failure to finalize checkpoint.
at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1255)
at 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:1100)
at 
org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$acknowledgeCheckpoint$1(ExecutionGraphHandler.java:89)
at 
org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$processCheckpointCoordinatorMessage$3(ExecutionGraphHandler.java:119)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown 
Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown 
Source)
at java.base/java.lang.Thread.run(Unknown Source)
Caused by: 
org.apache.flink.runtime.persistence.StateHandleStore$AlreadyExistException: 
checkpointID-0193393 already exists in ConfigMap 
cm--jobmanager-leader
at 
org.apache.flink.kubernetes.highavailability.KubernetesStateHandleStore.getKeyAlreadyExistException(KubernetesStateHandleStore.java:534)
at 
org.apache.flink.kubernetes.highavailability.KubernetesStateHandleStore.lambda$addAndLock$0(KubernetesStateHandleStore.java:155)
at 
org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:316)
at 
java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(Unknown Source)
... 3 common frames omitted
{code}

the JobManager creates subsequent checkpoints successfully.
Upon failure, it tries to recover this checkpoint (0193393), but 
fails to do so because of:
{code}
Caused by: org.apache.flink.util.FlinkException: Could not retrieve checkpoint 
193393 from state handle under checkpointID-0193393. This indicates 
that the retrieved state handle is broken. Try cleaning the state handle store 
... Caused by: java.io.FileNotFoundException: No such file or directory: 
s3://xxx/flink-ha/xxx/completedCheckpoint72e30229420c
{code}

I'm running Flink 1.14.4.




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (FLINK-28260) flink-runtime-web fails to execute "npm ci" on Apple Silicon (arm64) without rosetta

2022-06-27 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-28260:
--

 Summary: flink-runtime-web fails to execute "npm ci" on Apple 
Silicon (arm64) without rosetta
 Key: FLINK-28260
 URL: https://issues.apache.org/jira/browse/FLINK-28260
 Project: Flink
  Issue Type: Bug
Reporter: Robert Metzger


Flink 1.16-SNAPSHOT fails to build in the flink-runtime-web project because we 
are using an outdated frontend-maven-plugin (v 1.11.3).
This is the error:
{code}
[ERROR] Failed to execute goal 
com.github.eirslett:frontend-maven-plugin:1.11.3:npm (npm install) on project 
flink-runtime-web: Failed to run task: 'npm ci --cache-max=0 --no-save 
${npm.proxy}' failed. java.io.IOException: Cannot run program 
"/Users/rmetzger/Projects/flink/flink-runtime-web/web-dashboard/node/node" (in 
directory "/Users/rmetzger/Projects/flink/flink-runtime-web/web-dashboard"): 
error=86, Bad CPU type in executable -> [Help 1]
{code}

Using the latest frontend-maven-plugin (v. 1.12.1) properly passes the build, 
because this version downloads the proper arm64 npm version. However, 
frontend-maven-plugin 1.12.1 requires Maven 3.6.0, which is too high for Flink 
(highest mvn version for Flink is 3.2.5).

The best workaround is using rosetta on M1 Macs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (FLINK-28259) flink-parquet doesn't compile on M1 mac without rosetta

2022-06-27 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-28259:
--

 Summary: flink-parquet doesn't compile on M1 mac without rosetta
 Key: FLINK-28259
 URL: https://issues.apache.org/jira/browse/FLINK-28259
 Project: Flink
  Issue Type: Bug
  Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
Affects Versions: 1.16.0
Reporter: Robert Metzger
Assignee: Robert Metzger


Compiling Flink 1.16-SNAPSHOT fails on an M1 Mac (apple silicon) without the 
rosetta translation layer, because the automatically downloaded 
"protoc-3.17.3-osx-aarch_64.exe" file is actually just a copy of 
"protoc-3.17.3-osx-x86_64.exe". (as you can read here: 
https://github.com/os72/protoc-jar/issues/93)

This is the error:
{code}
[ERROR] Failed to execute goal 
org.xolstice.maven.plugins:protobuf-maven-plugin:0.5.1:test-compile (default) 
on project flink-parquet: An error occurred while invoking protoc. Error while 
executing process. Cannot run program 
"/Users/rmetzger/Projects/flink/flink-formats/flink-parquet/target/protoc-plugins/protoc-3.17.3-osx-aarch_64.exe":
 error=86, Bad CPU type in executable -> [Help 1]
{code}





--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (FLINK-28232) Allow for custom pre-flight checks for SQL UDFs

2022-06-23 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-28232:
--

 Summary: Allow for custom pre-flight checks for SQL UDFs
 Key: FLINK-28232
 URL: https://issues.apache.org/jira/browse/FLINK-28232
 Project: Flink
  Issue Type: New Feature
  Components: Table SQL / API
Reporter: Robert Metzger


Currently, implementors of SQL UDFs [1] can not validate the UDF input before 
submitting a SQL query to the runtime. 
Take for example a UDF that computes a regex based on user input. Ideally 
there's a callback for the UDF implementor to check if the user-provided regex 
is valid and compiles, to avoid errors during the execution of the SQL query.

It would be ideal to get access to the schema information resolved by the SQL 
planner in that pre-flight validation to also allow for schema related checks 
pre-flight.


[1] 
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/functions/udfs/



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (FLINK-25998) Flink akka runs into NoClassDefFoundError on shutdown

2022-02-07 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-25998:
--

 Summary: Flink akka runs into NoClassDefFoundError on shutdown
 Key: FLINK-25998
 URL: https://issues.apache.org/jira/browse/FLINK-25998
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.15.0
Reporter: Robert Metzger


When trying to start a standalone jobmanager on an unavailable port, I see the 
following unexpected exception:

{code}
2022-02-08 08:07:18,299 INFO  akka.remote.Remoting  
   [] - Starting remoting
2022-02-08 08:07:18,357 ERROR akka.remote.transport.netty.NettyTransport
   [] - failed to bind to /0.0.0.0:6123, shutting down Netty transport
2022-02-08 08:07:18,373 INFO  
org.apache.flink.runtime.entrypoint.ClusterEntrypoint[] - Shutting 
StandaloneApplicationClusterEntryPoint down with application status FAILED. 
Diagnostics java.net.BindException: Could not start actor system on any port in 
port range 6123
at 
org.apache.flink.runtime.rpc.akka.AkkaBootstrapTools.startRemoteActorSystem(AkkaBootstrapTools.java:133)
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:358)
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:327)
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:247)
at 
org.apache.flink.runtime.rpc.RpcUtils.createRemoteRpcService(RpcUtils.java:191)
at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.initializeServices(ClusterEntrypoint.java:334)
at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:253)
at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$1(ClusterEntrypoint.java:203)
at 
org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:28)
at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:200)
at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:684)
at 
org.apache.flink.container.entrypoint.StandaloneApplicationClusterEntryPoint.main(StandaloneApplicationClusterEntryPoint.java:82)
.
2022-02-08 08:07:18,377 INFO  
akka.remote.RemoteActorRefProvider$RemotingTerminator[] - Shutting down 
remote daemon.
2022-02-08 08:07:18,377 ERROR org.apache.flink.util.FatalExitExceptionHandler   
   [] - FATAL: Thread 'flink-akka.remote.default-remote-dispatcher-6' 
produced an uncaught exception. Stopping the process...
java.lang.NoClassDefFoundError: 
akka/actor/dungeon/FaultHandling$$anonfun$handleNonFatalOrInterruptedException$1
at 
akka.actor.dungeon.FaultHandling.handleNonFatalOrInterruptedException(FaultHandling.scala:334)
 ~[flink-rpc-akka_ce724655-52fe-4b3a-8cdc-b79ab446e34d.jar:1.15-SNAPSHOT]
at 
akka.actor.dungeon.FaultHandling.handleNonFatalOrInterruptedException$(FaultHandling.scala:334)
 ~[flink-rpc-akka_ce724655-52fe-4b3a-8cdc-b79ab446e34d.jar:1.15-SNAPSHOT]
at 
akka.actor.ActorCell.handleNonFatalOrInterruptedException(ActorCell.scala:411) 
~[flink-rpc-akka_ce724655-52fe-4b3a-8cdc-b79ab446e34d.jar:1.15-SNAPSHOT]
at akka.actor.ActorCell.invoke(ActorCell.scala:551) 
~[flink-rpc-akka_ce724655-52fe-4b3a-8cdc-b79ab446e34d.jar:1.15-SNAPSHOT]
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:270) 
~[flink-rpc-akka_ce724655-52fe-4b3a-8cdc-b79ab446e34d.jar:1.15-SNAPSHOT]
at akka.dispatch.Mailbox.run(Mailbox.scala:231) 
~[flink-rpc-akka_ce724655-52fe-4b3a-8cdc-b79ab446e34d.jar:1.15-SNAPSHOT]
at akka.dispatch.Mailbox.exec(Mailbox.scala:243) 
[flink-rpc-akka_ce724655-52fe-4b3a-8cdc-b79ab446e34d.jar:1.15-SNAPSHOT]
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) 
[?:1.8.0_312]
at 
java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) 
[?:1.8.0_312]
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) 
[?:1.8.0_312]
at 
java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175) 
[?:1.8.0_312]
Caused by: java.lang.ClassNotFoundException: 
akka.actor.dungeon.FaultHandling$$anonfun$handleNonFatalOrInterruptedException$1
at java.net.URLClassLoader.findClass(URLClassLoader.java:387) 
~[?:1.8.0_312]
at java.lang.ClassLoader.loadClass(ClassLoader.java:419) ~[?:1.8.0_312]
at 
org.apache.flink.core.classloading.ComponentClassLoader.loadClassFromComponentOnly(ComponentClassLoader.java:149)
 ~[flink-dist-1.15-SNAPSHOT.jar:1.15-SNAPSHOT]
at 
org.apache.flink.core.classloading.ComponentClassLoader.loadClass(ComponentClassLoader

[jira] [Created] (FLINK-25679) Build arm64 Linux images for Apache Flink

2022-01-17 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-25679:
--

 Summary: Build arm64 Linux images for Apache Flink
 Key: FLINK-25679
 URL: https://issues.apache.org/jira/browse/FLINK-25679
 Project: Flink
  Issue Type: Improvement
  Components: flink-docker
Affects Versions: 1.15.0
Reporter: Robert Metzger


Building Flink images for arm64 Linux should be trivial to support, since 
upstream docker images support arm64, as well as frocksdb.

Building the images locally is also easily possible using Docker's buildx 
features, and the build system of the official docker images most likely 
supports ARM arch.

This improvement would allow us supporting development / testing on Apple 
M1-based systems, as well as ARM architecture at various cloud providers (AWS 
Graviton)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-25505) Fix NetworkBufferPoolTest, SystemResourcesCounterTest on Apple M1

2022-01-03 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-25505:
--

 Summary: Fix NetworkBufferPoolTest, SystemResourcesCounterTest on 
Apple M1 
 Key: FLINK-25505
 URL: https://issues.apache.org/jira/browse/FLINK-25505
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Metrics, Runtime / Network
Affects Versions: 1.15.0
Reporter: Robert Metzger


As discussed in https://issues.apache.org/jira/browse/FLINK-23230, some tests 
in flink-runtime are not passing on M1 / Apple Silicon Macs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-25327) ApplicationMode "DELETE /cluster" REST call leads to exit code 2, instead of 0

2021-12-15 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-25327:
--

 Summary: ApplicationMode "DELETE /cluster" REST call leads to exit 
code 2, instead of 0
 Key: FLINK-25327
 URL: https://issues.apache.org/jira/browse/FLINK-25327
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Reporter: Robert Metzger


FLINK-24113 introduced a mode to keep the Application Mode JobManager running 
after the Job has been cancelled. Cluster shutdown needs to be initiated for 
example using the DELETE /cluster REST endpoint.

The problem is that there can be a fatal error during the shutdown, making the 
JobManager exit with return code != 0 (making resource managers believe there 
was an error with the Flink application)

Error 
{code}
2021-12-15 08:09:55,708 ERROR 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint[] - Fatal error 
occurred in the cluster entrypoint.
org.apache.flink.util.FlinkException: Application failed unexpectedly.
at 
org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap.lambda$finishBootstrapTasks$1(ApplicationDispatcherBootstrap.java:177)
 ~[flink-dist-1.15-master-robert.jar:1.15-master-robert]
at 
java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:884)
 ~[?:1.8.0_312]
at 
java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:866)
 ~[?:1.8.0_312]
at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) 
~[?:1.8.0_312]
at 
java.util.concurrent.CompletableFuture.cancel(CompletableFuture.java:2278) 
~[?:1.8.0_312]
at 
org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap.stop(ApplicationDispatcherBootstrap.java:125)
 ~[flink-dist-1.15-master-robert.jar:1.15-master-robert]
at 
org.apache.flink.runtime.dispatcher.Dispatcher.lambda$onStop$0(Dispatcher.java:284)
 ~[flink-dist-1.15-master-robert.jar:1.15-master-robert]
at 
org.apache.flink.util.concurrent.FutureUtils.lambda$runAfterwardsAsync$18(FutureUtils.java:696)
 ~[flink-dist-1.15-master-robert.jar:1.15-master-robert]
at 
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
 ~[?:1.8.0_312]
at 
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
 ~[?:1.8.0_312]
at 
java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
 ~[?:1.8.0_312]
at 
org.apache.flink.util.concurrent.DirectExecutorService.execute(DirectExecutorService.java:217)
 ~[flink-dist-1.15-master-robert.jar:1.15-master-robert]
at 
java.util.concurrent.CompletableFuture$UniCompletion.claim(CompletableFuture.java:543)
 ~[?:1.8.0_312]
at 
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:765)
 ~[?:1.8.0_312]
at 
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
 ~[?:1.8.0_312]
at 
java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:795)
 ~[?:1.8.0_312]
at 
java.util.concurrent.CompletableFuture.whenCompleteAsync(CompletableFuture.java:2163)
 ~[?:1.8.0_312]
at 
org.apache.flink.util.concurrent.FutureUtils.runAfterwardsAsync(FutureUtils.java:693)
 ~[flink-dist-1.15-master-robert.jar:1.15-master-robert]
at 
org.apache.flink.util.concurrent.FutureUtils.runAfterwards(FutureUtils.java:660)
 ~[flink-dist-1.15-master-robert.jar:1.15-master-robert]
at 
org.apache.flink.runtime.dispatcher.Dispatcher.onStop(Dispatcher.java:281) 
~[flink-dist-1.15-master-robert.jar:1.15-master-robert]
at 
org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStop(RpcEndpoint.java:214)
 ~[flink-dist-1.15-master-robert.jar:1.15-master-robert]
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StartedState.lambda$terminate$0(AkkaRpcActor.java:580)
 ~[flink-rpc-akka_44e0316d-9cf7-4fc8-9b48-4f6084b0cc47.jar:1.15-master-robert]
at 
org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:83)
 ~[flink-rpc-akka_44e0316d-9cf7-4fc8-9b48-4f6084b0cc47.jar:1.15-master-robert]
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StartedState.terminate(AkkaRpcActor.java:579)
 ~[flink-rpc-akka_44e0316d-9cf7-4fc8-9b48-4f6084b0cc47.jar:1.15-master-robert]
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:191)
 ~[flink-rpc-akka_44e0316d-9cf7-4fc8-9b48-4f6084b0cc47.jar:1.15-master-robert]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24) 
[flink-rpc-akka_44e0316d-9cf7-4fc8-9b48-4f6084b0cc47.jar:1.15-master-robert]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20) 
[flink-rpc-akka_44e0316d-9cf7-4fc8-9b48-4f6084b0cc47.jar:1.15-master-robert]
at scala.PartialFunction.applyOrElse(PartialFunction.scala:123) 
[flink-rpc-akka_44e0316d-9cf7-4fc8-9b48-4f6084b0cc47.jar:1.15-

[jira] [Created] (FLINK-25316) BlobServer can get stuck during shutdown

2021-12-14 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-25316:
--

 Summary: BlobServer can get stuck during shutdown
 Key: FLINK-25316
 URL: https://issues.apache.org/jira/browse/FLINK-25316
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.15.0
Reporter: Robert Metzger
 Fix For: 1.15.0


The cluster shutdown can get stuck
{code}
"AkkaRpcService-Supervisor-Termination-Future-Executor-thread-1" #89 daemon 
prio=5 os_prio=0 tid=0x004017d7 nid=0x2ec in Object.wait() 
[0x00402a9b5000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0xd6c48368> (a 
org.apache.flink.runtime.blob.BlobServer)
at java.lang.Thread.join(Thread.java:1252)
- locked <0xd6c48368> (a 
org.apache.flink.runtime.blob.BlobServer)
at java.lang.Thread.join(Thread.java:1326)
at org.apache.flink.runtime.blob.BlobServer.close(BlobServer.java:319)
at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.stopClusterServices(ClusterEntrypoint.java:406)
- locked <0xd5d27350> (a java.lang.Object)
at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$shutDownAsync$4(ClusterEntrypoint.java:505
{code}

because the BlobServer.run() method ignores interrupts:
{code}
"BLOB Server listener at 6124" #30 daemon prio=5 os_prio=0 
tid=0x00401c929800 nid=0x2b4 runnable [0x0040263f9000]
   java.lang.Thread.State: RUNNABLE
at java.net.PlainSocketImpl.socketAccept(Native Method)
at 
java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
at java.net.ServerSocket.implAccept(ServerSocket.java:560)
at java.net.ServerSocket.accept(ServerSocket.java:528)
at 
org.apache.flink.util.NetUtils.acceptWithoutTimeout(NetUtils.java:143)
at org.apache.flink.runtime.blob.BlobServer.run(BlobServer.java:268)
{code}

This issue was introduced in FLINK-24156.




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-24395) Checkpoint trigger time difference between log statement and web frontend

2021-09-28 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-24395:
--

 Summary: Checkpoint trigger time difference between log statement 
and web frontend
 Key: FLINK-24395
 URL: https://issues.apache.org/jira/browse/FLINK-24395
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Checkpointing
Affects Versions: 1.14.0
Reporter: Robert Metzger
 Attachments: image-2021-09-28-12-20-34-332.png

Consider this checkpoint (68)

{code}
2021-09-28 10:14:43,644 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Triggering 
checkpoint 68 (type=CHECKPOINT) @ 1632823660151 for job 
.
2021-09-28 10:16:41,428 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Completed 
checkpoint 68 for job  (128940015376 bytes, 
checkpointDuration=540908 ms, finalizationTime=369 ms).
{code}

And what is shown in the UI about it:

 !image-2021-09-28-12-20-34-332.png! 

The trigger time is off by ~7 minutes. It seems that the trigger message is 
logged too late.
(note that this has happened in a system where savepoint disposal is very slow)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-24392) Upgrade presto s3 fs implementation to Trinio >= 348

2021-09-28 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-24392:
--

 Summary: Upgrade presto s3 fs implementation to Trinio >= 348
 Key: FLINK-24392
 URL: https://issues.apache.org/jira/browse/FLINK-24392
 Project: Flink
  Issue Type: Improvement
  Components: FileSystems
Affects Versions: 1.14.0
Reporter: Robert Metzger
 Fix For: 1.15.0


The Presto s3 filesystem implementation currently shipped with Flink doesn't 
support streaming uploads. All data needs to be materialized to a single file 
on disk, before it can be uploaded.
This can lead to situations where TaskManagers are running out of disk when 
creating a savepoint.

The Hadoop filesystem implementation supports streaming uploads (by using 
multipart uploads of smaller (say 100mb) files locally), but it does more API 
calls, leading to other issues.

Trinion 348 supports streaming uploads.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-24320) Show in the Job / Checkpoints / Configuration if checkpoints are incremental

2021-09-17 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-24320:
--

 Summary: Show in the Job / Checkpoints / Configuration if 
checkpoints are incremental
 Key: FLINK-24320
 URL: https://issues.apache.org/jira/browse/FLINK-24320
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Checkpointing, Runtime / Web Frontend
Affects Versions: 1.13.2
Reporter: Robert Metzger
 Attachments: image-2021-09-17-13-31-02-148.png, 
image-2021-09-17-13-31-32-311.png

It would be nice if the overview would also show if incremental checkpoints are 
enabled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-24208) Allow idempotent savepoint triggering

2021-09-08 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-24208:
--

 Summary: Allow idempotent savepoint triggering
 Key: FLINK-24208
 URL: https://issues.apache.org/jira/browse/FLINK-24208
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Checkpointing
Reporter: Robert Metzger


As a user of Flink, I want to be able to trigger a savepoint from an external 
system in a way that I can detect if I have requested this savepoint already.

By passing a custom ID to the savepoint request, I can check (in case of an 
error of the original request, or the external system) if the request has been 
made already.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-24114) Make CompletedOperationCache.COMPLETED_OPERATION_RESULT_CACHE_DURATION_SECONDS configurable (at least for savepoint trigger operations)

2021-09-01 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-24114:
--

 Summary: Make 
CompletedOperationCache.COMPLETED_OPERATION_RESULT_CACHE_DURATION_SECONDS 
configurable (at least for savepoint trigger operations)
 Key: FLINK-24114
 URL: https://issues.apache.org/jira/browse/FLINK-24114
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Coordination
Affects Versions: 1.15.0
Reporter: Robert Metzger


Currently, it can happen that external services triggering savepoints can not 
persist the savepoint location from the savepoint handler, because the 
operation cache has a hardcoded value of 5 minutes, until entries (which have 
been accessed at least once) are evicted.
To avoid scenarios where the savepoint location has been accessed, but the 
external system failed to persist the location, I propose to make this eviction 
timeout configurable (so that I as a user can configure a value of 24 hours for 
the cache eviction).

(This is related to FLINK-24113)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-24113) Introduce option in Application Mode to request cluster shutdown

2021-09-01 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-24113:
--

 Summary: Introduce option in Application Mode to request cluster 
shutdown
 Key: FLINK-24113
 URL: https://issues.apache.org/jira/browse/FLINK-24113
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Coordination
Affects Versions: 1.15.0
Reporter: Robert Metzger


Currently a Flink JobManager started in Application Mode will shut down once 
the job has completed.

When doing a "stop with savepoint" operation, we want to keep the JobManager 
alive after the job has stopped to retrieve and persist the final savepoint 
location.
Currently, Flink waits up to 5 minutes and then shuts down.

I'm proposing to introduce a new configuration flag "application mode shutdown 
behavior": "keepalive" (naming things is hard ;) ) which will keep the 
JobManager in ApplicationMode running until a REST call confirms that it can 
shutdown.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-24037) Allow wildcards in ENABLE_BUILT_IN_PLUGINS

2021-08-28 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-24037:
--

 Summary: Allow wildcards in ENABLE_BUILT_IN_PLUGINS
 Key: FLINK-24037
 URL: https://issues.apache.org/jira/browse/FLINK-24037
 Project: Flink
  Issue Type: Improvement
  Components: flink-docker
Reporter: Robert Metzger


As a user of Flink, I would like to be able to specify a certain default 
plugin, (such as the S3 presto FS) without having to specific the Flink version 
again.
The Flink version is already specified by the Docker container I'm using.

If one is using generic deployment scripts, I don't want to put the Flink 
version in two locations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-23925) HistoryServer: Archiving job with more than one attempt fails

2021-08-23 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-23925:
--

 Summary: HistoryServer: Archiving job with more than one attempt 
fails
 Key: FLINK-23925
 URL: https://issues.apache.org/jira/browse/FLINK-23925
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.13.2
Reporter: Robert Metzger


Error:
{code}
2021-08-23 16:26:01,953 INFO  
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - 
Disconnect job manager 
0...@akka.tcp://flink@localhost:6123/user/rpc/jobmanager_2
 for job ca9f6a073d311d60f457a1c4243e7dc3 from the resource manager.
2021-08-23 16:26:02,137 INFO  
org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Could not 
archive completed job 
CarTopSpeedWindowingExample(ca9f6a073d311d60f457a1c4243e7dc3) to the history 
server.
java.util.concurrent.CompletionException: java.lang.IllegalArgumentException: 
attempt does not exist
at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
 ~[?:1.8.0_252]
at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
 [?:1.8.0_252]
at 
java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1643)
 [?:1.8.0_252]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
[?:1.8.0_252]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
[?:1.8.0_252]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_252]
Caused by: java.lang.IllegalArgumentException: attempt does not exist
at 
org.apache.flink.runtime.executiongraph.ArchivedExecutionVertex.getPriorExecutionAttempt(ArchivedExecutionVertex.java:109)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.runtime.executiongraph.ArchivedExecutionVertex.getPriorExecutionAttempt(ArchivedExecutionVertex.java:31)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.runtime.rest.handler.job.SubtaskExecutionAttemptDetailsHandler.archiveJsonWithPath(SubtaskExecutionAttemptDetailsHandler.java:140)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.runtime.webmonitor.history.OnlyExecutionGraphJsonArchivist.archiveJsonWithPath(OnlyExecutionGraphJsonArchivist.java:51)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.runtime.webmonitor.WebMonitorEndpoint.archiveJsonWithPath(WebMonitorEndpoint.java:1031)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.runtime.dispatcher.JsonResponseHistoryServerArchivist.lambda$archiveExecutionGraph$0(JsonResponseHistoryServerArchivist.java:61)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
org.apache.flink.util.function.ThrowingRunnable.lambda$unchecked$0(ThrowingRunnable.java:49)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1640)
 ~[?:1.8.0_252]
... 3 more
{code}

Steps to reproduce:
- start a Flink reactive mode job manager:
mkdir usrlib
cp ./examples/streaming/TopSpeedWindowing.jar usrlib/
# Submit Job in Reactive Mode
./bin/standalone-job.sh start -Dscheduler-mode=reactive 
-Dexecution.checkpointing.interval="10s" -j 
org.apache.flink.streaming.examples.windowing.TopSpeedWindowing
# Start first TaskManager
./bin/taskmanager.sh start

- Add another taskmanager to trigger a restart
- Cancel the job

See the failure in the jobmanager logs.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-23913) UnalignedCheckpointITCase fails with exit code 137 (kernel oom) on Azure VMs

2021-08-23 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-23913:
--

 Summary: UnalignedCheckpointITCase fails with exit code 137 
(kernel oom) on Azure VMs
 Key: FLINK-23913
 URL: https://issues.apache.org/jira/browse/FLINK-23913
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Network
Affects Versions: 1.14.0
 Environment: UnalignedCheckpointITCase
Reporter: Robert Metzger
 Fix For: 1.14.0


Cases reported in FLINK-23525:
- 
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=22618&view=logs&j=4d4a0d10-fca2-5507-8eed-c07f0bdf4887&t=7b25afdf-cc6c-566f-5459-359dc2585798&l=10338
- 
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=22618&view=logs&j=39d5b1d5-3b41-54dc-6458-1e2ddd1cdcf3&t=0c010d0c-3dec-5bf1-d408-7b18988b1b2b&l=4743
- 
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=22605&view=logs&j=39d5b1d5-3b41-54dc-6458-1e2ddd1cdcf3&t=0c010d0c-3dec-5bf1-d408-7b18988b1b2b&l=4743
- ... there are a lot more cases.

The problem seems to have started occurring around August 20.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-23589) Support Avro Microsecond precision

2021-08-02 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-23589:
--

 Summary: Support Avro Microsecond precision
 Key: FLINK-23589
 URL: https://issues.apache.org/jira/browse/FLINK-23589
 Project: Flink
  Issue Type: Improvement
  Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
Reporter: Robert Metzger
 Fix For: 1.14.0


This was raised by a user: 
https://lists.apache.org/thread.html/r463f748358202d207e4bf9c7fdcb77e609f35bbd670dbc5278dd7615%40%3Cuser.flink.apache.org%3E

Here's the Avro spec: 
https://avro.apache.org/docs/1.8.0/spec.html#Timestamp+%28microsecond+precision%29



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-23562) Update CI docker image to latest java version (1.8.0_292)

2021-07-30 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-23562:
--

 Summary: Update CI docker image to latest java version (1.8.0_292)
 Key: FLINK-23562
 URL: https://issues.apache.org/jira/browse/FLINK-23562
 Project: Flink
  Issue Type: Technical Debt
  Components: Build System / Azure Pipelines
Reporter: Robert Metzger
 Fix For: 1.14.0


The java version we are using on our CI is outdated (1.8.0_282 vs 1.8.0_292). 
The latest java version has TLSv1 disabled, which makes the 
KubernetesClusterDescriptorTest fail.

This will be fixed by FLINK-22802.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-23546) stop-cluster.sh produces warning on macOS 11.4

2021-07-29 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-23546:
--

 Summary: stop-cluster.sh produces warning on macOS 11.4
 Key: FLINK-23546
 URL: https://issues.apache.org/jira/browse/FLINK-23546
 Project: Flink
  Issue Type: Bug
  Components: Deployment / Scripts
Affects Versions: 1.14.0
Reporter: Robert Metzger


Since FLINK-17470, we are stopping daemons with a timeout, to SIGKILL them if 
they are not gracefully stopping.

I noticed that this mechanism causes warnings on macOS:

{code}
❰robert❙/tmp/flink-1.14-SNAPSHOT❱✔≻ ./bin/start-cluster.sh
Starting cluster.
Starting standalonesession daemon on host MacBook-Pro-2.localdomain.
Starting taskexecutor daemon on host MacBook-Pro-2.localdomain.
❰robert❙/tmp/flink-1.14-SNAPSHOT❱✔≻ ./bin/stop-cluster.sh
Stopping taskexecutor daemon (pid: 50044) on host MacBook-Pro-2.localdomain.
tail: illegal option -- -
usage: tail [-F | -f | -r] [-q] [-b # | -c # | -n #] [file ...]
Stopping standalonesession daemon (pid: 49812) on host 
MacBook-Pro-2.localdomain.
tail: illegal option -- -
usage: tail [-F | -f | -r] [-q] [-b # | -c # | -n #] [file ...]
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-23191) Azure: Upload CI logs to S3 as well

2021-06-30 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-23191:
--

 Summary: Azure: Upload CI logs to S3 as well
 Key: FLINK-23191
 URL: https://issues.apache.org/jira/browse/FLINK-23191
 Project: Flink
  Issue Type: Improvement
  Components: Build System / Azure Pipelines
Reporter: Robert Metzger


We are currently uploading the CI logs to Azure as artifacts. The maximum 
retention (we've also configured) is 60 days. Afterwards, the logs are gone.

For rarely occurring test failures, the logs might be lost at the time we start 
looking into them.
Therefore, we should store the CI logs somewhere permanently, such as S3 
(similarly to how we stored them when we were using travis)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-23129) When cancelling any running job of multiple jobs in an application cluster, JobManager shuts down

2021-06-23 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-23129:
--

 Summary: When cancelling any running job of multiple jobs in an 
application cluster, JobManager shuts down
 Key: FLINK-23129
 URL: https://issues.apache.org/jira/browse/FLINK-23129
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.14.0
Reporter: Robert Metzger


I have a jar with two jobs, both executeAsync() from the same main method. I 
execute the main method in an Application Mode cluster. When I cancel one of 
the two jobs, both jobs will stop executing.

I would expect that the JobManager shuts down once all jobs submitted from an 
application are finished.

If this is a known limitation, we should document it.

{code}
2021-06-23 21:29:53,123 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph   [] - Job first job 
(18181be02da272387354d093519b2359) switched from state RUNNING to CANCELLING.
2021-06-23 21:29:53,124 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph   [] - Source: 
Custom Source -> Sink: Unnamed (1/1) (5a69b1c19f8da23975f6961898ab50a2) 
switched from RUNNING to CANCELING.
2021-06-23 21:29:53,141 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph   [] - Source: 
Custom Source -> Sink: Unnamed (1/1) (5a69b1c19f8da23975f6961898ab50a2) 
switched from CANCELING to CANCELED.
2021-06-23 21:29:53,144 INFO  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Clearing resource requirements of job 18181be02da272387354d093519b2359
2021-06-23 21:29:53,145 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph   [] - Job first job 
(18181be02da272387354d093519b2359) switched from state CANCELLING to CANCELED.
2021-06-23 21:29:53,145 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Stopping 
checkpoint coordinator for job 18181be02da272387354d093519b2359.
2021-06-23 21:29:53,147 INFO  
org.apache.flink.runtime.checkpoint.StandaloneCompletedCheckpointStore [] - 
Shutting down
2021-06-23 21:29:53,150 INFO  
org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Job 
18181be02da272387354d093519b2359 reached terminal state CANCELED.
2021-06-23 21:29:53,152 INFO  org.apache.flink.runtime.jobmaster.JobMaster  
   [] - Stopping the JobMaster for job first 
job(18181be02da272387354d093519b2359).
2021-06-23 21:29:53,155 INFO  
org.apache.flink.runtime.jobmaster.slotpool.DefaultDeclarativeSlotPool [] - 
Releasing slot [c35b64879d6b02d383c825ea735ebba0].
2021-06-23 21:29:53,159 INFO  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Clearing resource requirements of job 18181be02da272387354d093519b2359
2021-06-23 21:29:53,159 INFO  org.apache.flink.runtime.jobmaster.JobMaster  
   [] - Close ResourceManager connection 
281b3fcf7ad0a6f7763fa90b8a5b9adb: Stopping JobMaster for job first 
job(18181be02da272387354d093519b2359)..
2021-06-23 21:29:53,160 INFO  
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - 
Disconnect job manager 
0...@akka.tcp://flink@localhost:6123/user/rpc/jobmanager_2
 for job 18181be02da272387354d093519b2359 from the resource manager.
2021-06-23 21:29:53,225 INFO  
org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap 
[] - Application CANCELED:
java.util.concurrent.CompletionException: 
org.apache.flink.client.deployment.application.UnsuccessfulExecutionException: 
Application Status: CANCELED
at 
org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap.lambda$unwrapJobResultException$4(ApplicationDispatcherBootstrap.java:304)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616) 
~[?:1.8.0_252]
at 
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
 ~[?:1.8.0_252]
at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) 
~[?:1.8.0_252]
at 
java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975) 
~[?:1.8.0_252]
at 
org.apache.flink.client.deployment.application.JobStatusPollingUtils.lambda$null$2(JobStatusPollingUtils.java:101)
 ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT]
at 
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
 ~[?:1.8.0_252]
at 
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
 ~[?:1.8.0_252]
at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) 
~[?:1.8.0_252]
at 
java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975) 
~[?:1.8.0_252]
at 
org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.lambda$invokeRpc$0(AkkaInvocationHan

[jira] [Created] (FLINK-23000) Allow downloading all Flink logs from the UI

2021-06-15 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-23000:
--

 Summary: Allow downloading all Flink logs from the UI
 Key: FLINK-23000
 URL: https://issues.apache.org/jira/browse/FLINK-23000
 Project: Flink
  Issue Type: New Feature
  Components: Runtime / Web Frontend
Reporter: Robert Metzger
Assignee: Robert Metzger
 Fix For: 1.14.0


Users often struggle to provide all relevant logs. By having a button in the 
Web UI that collects the logs of all TaskManagers and the JobManager will make 
things easier for the community supporting users.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-22923) Queryable state (rocksdb) with TM restart end-to-end test unstable

2021-06-08 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-22923:
--

 Summary: Queryable state (rocksdb) with TM restart end-to-end test 
unstable
 Key: FLINK-22923
 URL: https://issues.apache.org/jira/browse/FLINK-22923
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Queryable State
Affects Versions: 1.14.0
Reporter: Robert Metzger
 Fix For: 1.14.0


https://dev.azure.com/rmetzger/Flink/_build/results?buildId=9119&view=logs&j=9401bf33-03c4-5a24-83fe-e51d75db73ef&t=72901ab2-7cd0-57be-82b1-bca51de20fba

{code}
Jun 04 19:39:12 16/17 completed checkpoints
Jun 04 19:39:14 16/17 completed checkpoints
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
details.
Jun 04 19:39:17 after: 20
Jun 04 19:39:17 An error occurred
Jun 04 19:39:17 [FAIL] Test script contains errors.
Jun 04 19:39:17 Checking of logs skipped.
Jun 04 19:39:17 
Jun 04 19:39:17 [FAIL] 'Queryable state (rocksdb) with TM restart end-to-end 
test' failed after 0 minutes and 48 seconds! Test exited with exit code 1
Jun 04 19:39:17 

{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-22856) Move our Azure pipelines away from Ubuntu 16.04 by September

2021-06-02 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-22856:
--

 Summary: Move our Azure pipelines away from Ubuntu 16.04 by 
September
 Key: FLINK-22856
 URL: https://issues.apache.org/jira/browse/FLINK-22856
 Project: Flink
  Issue Type: Bug
  Components: Build System / Azure Pipelines
Reporter: Robert Metzger
 Fix For: 1.14.0


Azure won't support Ubuntu 16.04 starting from October, hence we need to 
migrate to a newer ubuntu version.

We should do this at a time when the builds are relatively stable to be able to 
clearly identify issues relating to the version upgrade. Also, we shouldn't do 
this before a feature freeze ;) 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-22765) ExceptionUtilsITCase.testIsMetaspaceOutOfMemoryError is unstable

2021-05-25 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-22765:
--

 Summary: ExceptionUtilsITCase.testIsMetaspaceOutOfMemoryError is 
unstable
 Key: FLINK-22765
 URL: https://issues.apache.org/jira/browse/FLINK-22765
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.14.0
Reporter: Robert Metzger
 Fix For: 1.14.0


https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=18292&view=logs&j=39d5b1d5-3b41-54dc-6458-1e2ddd1cdcf3&t=a99e99c7-21cd-5a1f-7274-585e62b72f56

{code}
May 25 00:56:38 java.lang.AssertionError: 
May 25 00:56:38 
May 25 00:56:38 Expected: is ""
May 25 00:56:38  but: was "The system is out of resources.\nConsult the 
following stack trace for details."
May 25 00:56:38 at 
org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
May 25 00:56:38 at org.junit.Assert.assertThat(Assert.java:956)
May 25 00:56:38 at org.junit.Assert.assertThat(Assert.java:923)
May 25 00:56:38 at 
org.apache.flink.runtime.util.ExceptionUtilsITCase.run(ExceptionUtilsITCase.java:94)
May 25 00:56:38 at 
org.apache.flink.runtime.util.ExceptionUtilsITCase.testIsMetaspaceOutOfMemoryError(ExceptionUtilsITCase.java:70)
May 25 00:56:38 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)
May 25 00:56:38 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
May 25 00:56:38 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
May 25 00:56:38 at java.lang.reflect.Method.invoke(Method.java:498)
May 25 00:56:38 at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
May 25 00:56:38 at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
May 25 00:56:38 at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
May 25 00:56:38 at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
May 25 00:56:38 at 
org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45)
May 25 00:56:38 at 
org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
May 25 00:56:38 at org.junit.rules.RunRules.evaluate(RunRules.java:20)
May 25 00:56:38 at 
org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
May 25 00:56:38 at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
May 25 00:56:38 at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
May 25 00:56:38 at 
org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
May 25 00:56:38 at 
org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
May 25 00:56:38 at 
org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
May 25 00:56:38 at 
org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
May 25 00:56:38 at 
org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
May 25 00:56:38 at 
org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
May 25 00:56:38 at org.junit.rules.RunRules.evaluate(RunRules.java:20)
May 25 00:56:38 at 
org.junit.runners.ParentRunner.run(ParentRunner.java:363)
May 25 00:56:38 at 
org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
May 25 00:56:38 at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
May 25 00:56:38 at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
May 25 00:56:38 at 
org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
May 25 00:56:38 at 
org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
May 25 00:56:38 at 
org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
May 25 00:56:38 at 
org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
May 25 00:56:38 at 
org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
May 25 00:56:38 

{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-22692) CheckpointStoreITCase.testRestartOnRecoveryFailure fails with RuntimeException

2021-05-18 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-22692:
--

 Summary: CheckpointStoreITCase.testRestartOnRecoveryFailure fails 
with RuntimeException
 Key: FLINK-22692
 URL: https://issues.apache.org/jira/browse/FLINK-22692
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Checkpointing
Reporter: Robert Metzger


Not sure if it is related to the adaptive scheduler: 
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=18052&view=logs&j=8fd9202e-fd17-5b26-353c-ac1ff76c8f28&t=a0a633b8-47ef-5c5a-2806-3c13b9e48228

{code}
May 17 22:29:11 [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time 
elapsed: 1.351 s <<< FAILURE! - in 
org.apache.flink.test.checkpointing.CheckpointStoreITCase
May 17 22:29:11 [ERROR] 
testRestartOnRecoveryFailure(org.apache.flink.test.checkpointing.CheckpointStoreITCase)
  Time elapsed: 1.138 s  <<< ERROR!
May 17 22:29:11 org.apache.flink.runtime.client.JobExecutionException: Job 
execution failed.
May 17 22:29:11 at 
org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144)
May 17 22:29:11 at 
org.apache.flink.runtime.minicluster.MiniClusterJobClient.lambda$getJobExecutionResult$3(MiniClusterJobClient.java:137)
May 17 22:29:11 at 
java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
May 17 22:29:11 at 
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
May 17 22:29:11 at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
May 17 22:29:11 at 
java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
May 17 22:29:11 at 
org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.lambda$invokeRpc$0(AkkaInvocationHandler.java:237)
May 17 22:29:11 at 
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
May 17 22:29:11 at 
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
May 17 22:29:11 at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
May 17 22:29:11 at 
java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
May 17 22:29:11 at 
org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:1081)
May 17 22:29:11 at akka.dispatch.OnComplete.internal(Future.scala:264)
May 17 22:29:11 at akka.dispatch.OnComplete.internal(Future.scala:261)
May 17 22:29:11 at 
akka.dispatch.japi$CallbackBridge.apply(Future.scala:191)
May 17 22:29:11 at 
akka.dispatch.japi$CallbackBridge.apply(Future.scala:188)
May 17 22:29:11 at 
scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
May 17 22:29:11 at 
org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:73)
May 17 22:29:11 at 
scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
May 17 22:29:11 at 
scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
May 17 22:29:11 at 
akka.pattern.PromiseActorRef.$bang(AskSupport.scala:572)
May 17 22:29:11 at 
akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:22)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-22593) SavepointITCase.testShouldAddEntropyToSavepointPath unstable

2021-05-07 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-22593:
--

 Summary: SavepointITCase.testShouldAddEntropyToSavepointPath 
unstable
 Key: FLINK-22593
 URL: https://issues.apache.org/jira/browse/FLINK-22593
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Checkpointing
Affects Versions: 1.14.0
Reporter: Robert Metzger


https://dev.azure.com/rmetzger/Flink/_build/results?buildId=9072&view=logs&j=cc649950-03e9-5fae-8326-2f1ad744b536&t=51cab6ca-669f-5dc0-221d-1e4f7dc4fc85

{code}
2021-05-07T10:56:20.9429367Z May 07 10:56:20 [ERROR] Tests run: 13, Failures: 
0, Errors: 1, Skipped: 0, Time elapsed: 33.441 s <<< FAILURE! - in 
org.apache.flink.test.checkpointing.SavepointITCase
2021-05-07T10:56:20.9445862Z May 07 10:56:20 [ERROR] 
testShouldAddEntropyToSavepointPath(org.apache.flink.test.checkpointing.SavepointITCase)
  Time elapsed: 2.083 s  <<< ERROR!
2021-05-07T10:56:20.9447106Z May 07 10:56:20 
java.util.concurrent.ExecutionException: 
java.util.concurrent.CompletionException: 
org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint triggering 
task Sink: Unnamed (3/4) of job 4e155a20f0a7895043661a6446caf1cb has not being 
executed at the moment. Aborting checkpoint. Failure reason: Not all required 
tasks are currently running.
2021-05-07T10:56:20.9448194Z May 07 10:56:20at 
java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
2021-05-07T10:56:20.9448797Z May 07 10:56:20at 
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
2021-05-07T10:56:20.9449428Z May 07 10:56:20at 
org.apache.flink.test.checkpointing.SavepointITCase.submitJobAndTakeSavepoint(SavepointITCase.java:305)
2021-05-07T10:56:20.9450160Z May 07 10:56:20at 
org.apache.flink.test.checkpointing.SavepointITCase.testShouldAddEntropyToSavepointPath(SavepointITCase.java:273)
2021-05-07T10:56:20.9450785Z May 07 10:56:20at 
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
2021-05-07T10:56:20.9451331Z May 07 10:56:20at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
2021-05-07T10:56:20.9451940Z May 07 10:56:20at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
2021-05-07T10:56:20.9452498Z May 07 10:56:20at 
java.lang.reflect.Method.invoke(Method.java:498)
2021-05-07T10:56:20.9453247Z May 07 10:56:20at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
2021-05-07T10:56:20.9454007Z May 07 10:56:20at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
2021-05-07T10:56:20.9454687Z May 07 10:56:20at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
2021-05-07T10:56:20.9455302Z May 07 10:56:20at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
2021-05-07T10:56:20.9455909Z May 07 10:56:20at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
2021-05-07T10:56:20.9456493Z May 07 10:56:20at 
org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
2021-05-07T10:56:20.9457074Z May 07 10:56:20at 
org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45)
2021-05-07T10:56:20.9457636Z May 07 10:56:20at 
org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
2021-05-07T10:56:20.9458157Z May 07 10:56:20at 
org.junit.rules.RunRules.evaluate(RunRules.java:20)
2021-05-07T10:56:20.9458678Z May 07 10:56:20at 
org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
2021-05-07T10:56:20.9459252Z May 07 10:56:20at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
2021-05-07T10:56:20.9459865Z May 07 10:56:20at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
2021-05-07T10:56:20.9460433Z May 07 10:56:20at 
org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
2021-05-07T10:56:20.9461058Z May 07 10:56:20at 
org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
2021-05-07T10:56:20.9461607Z May 07 10:56:20at 
org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
2021-05-07T10:56:20.9462159Z May 07 10:56:20at 
org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
2021-05-07T10:56:20.9462705Z May 07 10:56:20at 
org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
2021-05-07T10:56:20.9463243Z May 07 10:56:20at 
org.junit.runners.ParentRunner.run(ParentRunner.java:363)
2021-05-07T10:56:20.9463812Z May 07 10:56:20at 
org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
2021-05-07T10:56:20.9464436Z May 07 10:56:20at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
2021-05-07T10:56:20.9465073Z May 07 10:56:20at 
org.apache.maven.surefire.junit4.JU

[jira] [Created] (FLINK-22574) Adaptive Scheduler: Can not cancel restarting job

2021-05-05 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-22574:
--

 Summary: Adaptive Scheduler: Can not cancel restarting job
 Key: FLINK-22574
 URL: https://issues.apache.org/jira/browse/FLINK-22574
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.13.0
Reporter: Robert Metzger
 Fix For: 1.13.1


I have a job in state RESTARTING. When I now issue a cancel RPC call, I get the 
following exception:

{code}
org.apache.flink.runtime.rest.handler.RestHandlerException: Job cancellation 
failed: Cancellation failed. 
 at 
org.apache.flink.runtime.rest.handler.job.JobCancellationHandler.lambda$handleRequest$0(JobCancellationHandler.java:127)
 
 at 
java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836) 
 at 
java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
 
 at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) 
 at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
 
 at 
org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.lambda$invokeRpc$0(AkkaInvocationHandler.java:234)
 
 at 
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
 
 at 
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
 
 at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) 
 at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
 
 at 
org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:1079)
 
 at akka.dispatch.OnComplete.internal(Future.scala:263) 
 at akka.dispatch.OnComplete.internal(Future.scala:261) 
 at akka.dispatch.japi$CallbackBridge.apply(Future.scala:191) 
 at akka.dispatch.japi$CallbackBridge.apply(Future.scala:188) 
 at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60) 
 at 
org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:73)
 
 at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:68) 
 at 
scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:284)
 
 at 
scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:284)
 
 at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:284) 
 at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:573) 
 at 
akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:23)
 
 at 
akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:21)
 
 at scala.concurrent.Future.$anonfun$andThen$1(Future.scala:532) 
 at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:29) 
 at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:29) 
 at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60) 
 at 
akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
 
 at 
akka.dispatch.BatchingExecutor$BlockableBatch.$anonfun$run$1(BatchingExecutor.scala:91)
 
 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12) 
 at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:81) 
 at 
akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:91) 
 at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40) 
 at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)
 
 at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 
 at 
akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) 
 at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 
 at 
akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 
Caused by: org.apache.flink.util.FlinkException: Cancellation failed. 
 at 
org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.lambda$cancel$3(JobMasterServiceLeadershipRunner.java:197)
 
 at 
java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:884)
 
 at 
java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:866)
 
 at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) 
 at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
 
 at 
org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.lambda$invokeRpc$0(AkkaInvocationHandler.java:234)
 
 at 
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
 
 at 
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
 
 at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) 
 at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFutur

[jira] [Created] (FLINK-22556) Extend license checker to scan for traces of (L)GPL licensed code

2021-05-03 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-22556:
--

 Summary: Extend license checker to scan for traces of (L)GPL 
licensed code
 Key: FLINK-22556
 URL: https://issues.apache.org/jira/browse/FLINK-22556
 Project: Flink
  Issue Type: Improvement
  Components: Build System / CI
Reporter: Robert Metzger
 Fix For: 1.14.0


This is a follow up to FLINK-22555.
The goal is to catch this and similar cases automatically.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-22509) ./bin/flink run -m yarn-cluster -d submission leads to IllegalStateException

2021-04-28 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-22509:
--

 Summary: ./bin/flink run -m yarn-cluster -d submission leads to 
IllegalStateException
 Key: FLINK-22509
 URL: https://issues.apache.org/jira/browse/FLINK-22509
 Project: Flink
  Issue Type: Bug
  Components: Deployment / YARN
Affects Versions: 1.13.0, 1.14.0
Reporter: Robert Metzger


Submitting a detached, per-job YARN cluster in Flink (like this: {{./bin/flink 
run -m yarn-cluster -d  ./examples/streaming/TopSpeedWindowing.jar}}), leads to 
the following exception:

{code}
2021-04-28 11:39:00,786 INFO  org.apache.flink.yarn.YarnClusterDescriptor   
   [] - Found Web Interface 
ip-172-31-27-232.eu-central-1.compute.internal:45689 of application 
'application_1619607372651_0005'.
Job has been submitted with JobID 5543e81db9c2de78b646088891f23bfc
Exception in thread "Thread-4" java.lang.IllegalStateException: Trying to 
access closed classloader. Please check if you store classloaders directly or 
indirectly in static fields. If the stacktrace suggests that the leak occurs in 
a third party library and cannot be fixed immediately, you can disable this 
check with the configuration 'classloader.check-leaked-classloader'.
at 
org.apache.flink.runtime.execution.librarycache.FlinkUserCodeClassLoaders$SafetyNetWrapperClassLoader.ensureInner(FlinkUserCodeClassLoaders.java:164)
at 
org.apache.flink.runtime.execution.librarycache.FlinkUserCodeClassLoaders$SafetyNetWrapperClassLoader.getResource(FlinkUserCodeClassLoaders.java:183)
at 
org.apache.hadoop.conf.Configuration.getResource(Configuration.java:2570)
at 
org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2783)
at 
org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2758)
at 
org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2638)
at org.apache.hadoop.conf.Configuration.get(Configuration.java:1100)
at 
org.apache.hadoop.conf.Configuration.getTimeDuration(Configuration.java:1707)
at 
org.apache.hadoop.conf.Configuration.getTimeDuration(Configuration.java:1688)
at 
org.apache.hadoop.util.ShutdownHookManager.getShutdownTimeout(ShutdownHookManager.java:183)
at 
org.apache.hadoop.util.ShutdownHookManager.shutdownExecutor(ShutdownHookManager.java:145)
at 
org.apache.hadoop.util.ShutdownHookManager.access$300(ShutdownHookManager.java:65)
at 
org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:102)
{code}

The job is still running as expected.
Detached submission with {{./bin/flink run-application -t yarn-application -d}} 
works as expected. This is also the documented approach.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-22483) Recover checkpoints when JobMaster gains leadership

2021-04-26 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-22483:
--

 Summary: Recover checkpoints when JobMaster gains leadership
 Key: FLINK-22483
 URL: https://issues.apache.org/jira/browse/FLINK-22483
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.13.0
Reporter: Robert Metzger
 Fix For: 1.14.0


Recovering checkpoints (from the CompletedCheckpointStore) is a potentially 
blocking operation, for example if the file system implementation is retrying 
to connect to a unavailable storage backend.

Currently, we are calling the CompletedCheckpointStore.recover() method from 
the main thread of the JobManager, making it unresponsive to any RPC call while 
the recover method is blocked.

By moving the recovery to the start of the JobManager (which happens 
asynchronously after the JobMaster has gained leadership), Flink will remain 
responsive (reporting a job in INITIALIZING state).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-22413) Hide Checkpointing page in the UI for batch jobs

2021-04-22 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-22413:
--

 Summary: Hide Checkpointing page in the UI for batch jobs
 Key: FLINK-22413
 URL: https://issues.apache.org/jira/browse/FLINK-22413
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Web Frontend
Reporter: Robert Metzger


This is a follow up to 
https://github.com/apache/flink/commit/8dca9fa852c72984ac873eae9a96bbd739e502f3#commitcomment-49744255

Batch jobs don't need a Checkpointing page, because there is no information for 
these jobs available there.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-22331) CLI Frontend (RestClient) doesn't work on Apple M1

2021-04-17 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-22331:
--

 Summary: CLI Frontend (RestClient) doesn't work on Apple M1
 Key: FLINK-22331
 URL: https://issues.apache.org/jira/browse/FLINK-22331
 Project: Flink
  Issue Type: Bug
  Components: Client / Job Submission
Affects Versions: 1.12.2, 1.13.0
Reporter: Robert Metzger
 Attachments: flink-muthmann-client-KlemensMac.local (1).log_native, 
flink-muthmann-client-KlemensMac.local.log_rosetta

This issue was first reported by a user: 
https://lists.apache.org/thread.html/r50bda40a69688de52c9d6e3489ac2641491387c20fdc1cecedceee76%40%3Cuser.flink.apache.org%3E

See attached logs.

Exception without rosetta:
{code}
org.apache.flink.client.program.ProgramInvocationException: The main method 
caused an error: Failed to execute job 'Streaming WordCount'.
at 
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:366)
at 
org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:219)
at 
org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:114)
at 
org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:812)
at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:246)
at 
org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1054)
at 
org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1132)
at 
org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:28)
at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1132)
Caused by: org.apache.flink.util.FlinkException: Failed to execute job 
'Streaming WordCount'.
at 
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1918)
at 
org.apache.flink.client.program.StreamContextEnvironment.executeAsync(StreamContextEnvironment.java:135)
at 
org.apache.flink.client.program.StreamContextEnvironment.execute(StreamContextEnvironment.java:76)
at 
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1782)
at 
org.apache.flink.streaming.examples.wordcount.WordCount.main(WordCount.java:97)
at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at 
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:349)
... 8 more
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to 
submit JobGraph.
at 
org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$7(RestClusterClient.java:400)
at 
java.base/java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:986)
at 
java.base/java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:970)
at 
java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
at 
java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
at 
org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$9(FutureUtils.java:390)
at 
java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859)
at 
java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837)
at 
java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
at 
java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
at 
org.apache.flink.runtime.rest.RestClient$ClientHandler.exceptionCaught(RestClient.java:613)
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302)
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281)
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:273)
at 
org.apache.flink.shaded.netty4.io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireExceptionCaught(CombinedChannelDuplexHandler.java:424)
at 
org.apache.flink.shaded.netty4.io.netty.channel.ChannelHandlerAdapter.exceptionCaught(ChannelHandlerAdapter.java:

[jira] [Created] (FLINK-22258) Adaptive Scheduler: Show history of rescales in the Web UI

2021-04-13 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-22258:
--

 Summary: Adaptive Scheduler: Show history of rescales in the Web UI
 Key: FLINK-22258
 URL: https://issues.apache.org/jira/browse/FLINK-22258
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination, Runtime / Web Frontend
Reporter: Robert Metzger


As a user, I would like to see the history of rescale events in the web UI 
(number of slots used, task parallelisms)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-22252) Backquotes are not rendered correctly in config option descriptions

2021-04-12 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-22252:
--

 Summary: Backquotes are not rendered correctly in config option 
descriptions
 Key: FLINK-22252
 URL: https://issues.apache.org/jira/browse/FLINK-22252
 Project: Flink
  Issue Type: Task
  Components: Documentation, Runtime / Configuration
Affects Versions: 1.13.0
Reporter: Robert Metzger
 Fix For: 1.13.0


This is how the config options are rendered in the 1.13 docs:
{code}
`none`, `off`, `disable`: No restart strategy.
`fixeddelay`, `fixed-delay`: Fixed delay restart strategy. More details can be 
found here.
`failurerate`, `failure-rate`: Failure rate restart strategy. More details can 
be found here.
{code}.
https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/config/#state-backend

Here's the rendering in the old docs.
https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/config.html#restart-strategy



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-22244) Clarify Reactive Mode documentation wrt to timeouts

2021-04-12 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-22244:
--

 Summary: Clarify Reactive Mode documentation wrt to timeouts
 Key: FLINK-22244
 URL: https://issues.apache.org/jira/browse/FLINK-22244
 Project: Flink
  Issue Type: Task
  Components: Runtime / Coordination
Affects Versions: 1.13.0
Reporter: Robert Metzger
Assignee: Robert Metzger
 Fix For: 1.13.0


In the release testing of Reactive Mode (FLINK-22134) we found that the 
documentation of the timeouts needs some clarification.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-22243) Reactive Mode parallelism changes are not shown in the job graph visualization in the UI

2021-04-12 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-22243:
--

 Summary: Reactive Mode parallelism changes are not shown in the 
job graph visualization in the UI
 Key: FLINK-22243
 URL: https://issues.apache.org/jira/browse/FLINK-22243
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Web Frontend
Affects Versions: 1.13.0
Reporter: Robert Metzger
 Fix For: 1.14.0


As reported here FLINK-22134, the parallelism in the visual job graph on top of 
the detail page is not in sync with the parallelism listed in the task list 
below, when reactive mode causes a parallelism change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-22105) testForceAlignedCheckpointResultingInPriorityEvents unstable

2021-04-02 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-22105:
--

 Summary: testForceAlignedCheckpointResultingInPriorityEvents 
unstable
 Key: FLINK-22105
 URL: https://issues.apache.org/jira/browse/FLINK-22105
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Checkpointing
Affects Versions: 1.13.0
Reporter: Robert Metzger
 Fix For: 1.13.0


https://dev.azure.com/rmetzger/Flink/_build/results?buildId=9021&view=logs&j=9dc1b5dc-bcfa-5f83-eaa7-0cb181ddc267&t=ab910030-93db-52a7-74a3-34a0addb481b

{code}
2021-04-01T19:29:55.2392858Z [ERROR] Tests run: 10, Failures: 0, Errors: 1, 
Skipped: 0, Time elapsed: 1.921 s <<< FAILURE! - in 
org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorTest
2021-04-01T19:29:55.2396751Z [ERROR] 
testForceAlignedCheckpointResultingInPriorityEvents(org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorTest)
  Time elapsed: 0.02 s  <<< ERROR!
2021-04-01T19:29:55.2397415Z java.lang.RuntimeException: unable to send request 
to worker
2021-04-01T19:29:55.2397956Zat 
org.apache.flink.runtime.checkpoint.channel.ChannelStateWriterImpl.enqueue(ChannelStateWriterImpl.java:228)
2021-04-01T19:29:55.2398603Zat 
org.apache.flink.runtime.checkpoint.channel.ChannelStateWriterImpl.finishOutput(ChannelStateWriterImpl.java:183)
2021-04-01T19:29:55.2399310Zat 
org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointState(SubtaskCheckpointCoordinatorImpl.java:298)
2021-04-01T19:29:55.2400104Zat 
org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorTest.testForceAlignedCheckpointResultingInPriorityEvents(SubtaskCheckpointCoordinatorTest.java:215)
2021-04-01T19:29:55.2400746Zat 
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
2021-04-01T19:29:55.2401202Zat 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
2021-04-01T19:29:55.2401746Zat 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
2021-04-01T19:29:55.2402237Zat 
java.lang.reflect.Method.invoke(Method.java:498)
2021-04-01T19:29:55.2402722Zat 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
2021-04-01T19:29:55.2403270Zat 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
2021-04-01T19:29:55.2403818Zat 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
2021-04-01T19:29:55.2404354Zat 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
2021-04-01T19:29:55.2404854Zat 
org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
2021-04-01T19:29:55.2405359Zat 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
2021-04-01T19:29:55.2405896Zat 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
2021-04-01T19:29:55.2406393Zat 
org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
2021-04-01T19:29:55.2406855Zat 
org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
2021-04-01T19:29:55.2407331Zat 
org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
2021-04-01T19:29:55.2407806Zat 
org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
2021-04-01T19:29:55.2408279Zat 
org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
2021-04-01T19:29:55.2408907Zat 
org.junit.runners.ParentRunner.run(ParentRunner.java:363)
2021-04-01T19:29:55.2409403Zat 
org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
2021-04-01T19:29:55.2409954Zat 
org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
2021-04-01T19:29:55.2410524Zat 
org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
2021-04-01T19:29:55.2411318Zat 
org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
2021-04-01T19:29:55.2411880Zat 
org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
2021-04-01T19:29:55.2412467Zat 
org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
2021-04-01T19:29:55.2413006Zat 
org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
2021-04-01T19:29:55.2413519Zat 
org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
2021-04-01T19:29:55.2414082Z Caused by: java.lang.IllegalArgumentException: 
writer not found while processing request: writeOutput 0
2021-04-01T19:29:55.2414726Zat 
org.apache.flink.runtime.checkpoint.channel.CheckpointInProgressRequest.onWriterMissing(ChannelStateWriteRequest.java:223)
2021-04-01T19:29:55.2415463Zat 
org.apache.flink.runtime.checkpoint.channel.ChannelStateWriteRequestDispatcherImpl.dispatchInte

[jira] [Created] (FLINK-22062) Add "Data Sources" (FLIP-27 sources overview page) page from 1.12x to master

2021-03-30 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-22062:
--

 Summary: Add "Data Sources" (FLIP-27 sources overview page) page 
from 1.12x to master
 Key: FLINK-22062
 URL: https://issues.apache.org/jira/browse/FLINK-22062
 Project: Flink
  Issue Type: Bug
  Components: API / DataStream, Documentation
Affects Versions: 1.13.0
Reporter: Robert Metzger
 Fix For: 1.13.0


This page 
https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/stream/sources.html
 is not available in the latest docs on master.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-22040) Maven: Entry has not been leased from this pool / fix for e2e tests

2021-03-30 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-22040:
--

 Summary: Maven: Entry has not been leased from this pool / fix for 
e2e tests
 Key: FLINK-22040
 URL: https://issues.apache.org/jira/browse/FLINK-22040
 Project: Flink
  Issue Type: Sub-task
  Components: Build System / Azure Pipelines
Affects Versions: 1.12.2, 1.13.0
Reporter: Robert Metzger






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-22001) Exceptions from JobMaster initialization are not forwarded to the user

2021-03-28 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-22001:
--

 Summary: Exceptions from JobMaster initialization are not 
forwarded to the user
 Key: FLINK-22001
 URL: https://issues.apache.org/jira/browse/FLINK-22001
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.13.0
Reporter: Robert Metzger


Steps to reproduce:
Set up a streaming job with an invalid parallelism configuration, for example:
{code}
.setParallelism(15).setMaxParallelism(1);
{code}

This should report the following exception to the user:
{code}
Caused by: org.apache.flink.runtime.JobException: Vertex 
Window(GlobalWindows(), DeltaTrigger, TimeEvictor, ComparableAggregator, 
PassThroughWindowFunction)'s parallelism (15) is higher than the max 
parallelism (1). Please lower the parallelism or increase the max parallelism.
at 
org.apache.flink.runtime.executiongraph.ExecutionJobVertex.(ExecutionJobVertex.java:160)
at 
org.apache.flink.runtime.executiongraph.DefaultExecutionGraph.attachJobGraph(DefaultExecutionGraph.java:781)
at 
org.apache.flink.runtime.executiongraph.DefaultExecutionGraphBuilder.buildGraph(DefaultExecutionGraphBuilder.java:193)
at 
org.apache.flink.runtime.scheduler.DefaultExecutionGraphFactory.createAndRestoreExecutionGraph(DefaultExecutionGraphFactory.java:106)
at 
org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:252)
at 
org.apache.flink.runtime.scheduler.SchedulerBase.(SchedulerBase.java:185)
at 
org.apache.flink.runtime.scheduler.DefaultScheduler.(DefaultScheduler.java:119)
at 
org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:132)
at 
org.apache.flink.runtime.jobmaster.DefaultSlotPoolServiceSchedulerFactory.createScheduler(DefaultSlotPoolServiceSchedulerFactory.java:110)
at 
org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:340)
at 
org.apache.flink.runtime.jobmaster.JobMaster.(JobMaster.java:317)
at 
org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:94)
at 
org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:39)
at 
org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.startJobMasterServiceSafely(JobManagerRunnerImpl.java:363)
... 13 more
{code}

However, what the user sees is 
{code}
2021-03-28 20:32:33,935 INFO  
org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Job 
419f60eac551619fc1081c670ced3649 reached globally terminal state FAILED.

...

2021-03-28 20:32:33,974 INFO  
org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Stopped 
dispatcher akka://flink/user/rpc/dispatcher_2.
2021-03-28 20:32:33,977 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService  
   [] - Stopping Akka RPC service.
Exception in thread "main" org.apache.flink.util.FlinkException: Failed to 
execute job 'CarTopSpeedWindowingExample'.
at 
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1975)
at 
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1853)
at 
org.apache.flink.streaming.api.environment.LocalStreamEnvironment.execute(LocalStreamEnvironment.java:69)
at 
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1839)
at 
org.apache.flink.streaming.examples.windowing.TopSpeedWindowing.main(TopSpeedWindowing.java:101)
Caused by: java.lang.RuntimeException: Error while waiting for job to be 
initialized
at 
org.apache.flink.client.ClientUtils.waitUntilJobInitializationFinished(ClientUtils.java:160)
at 
org.apache.flink.client.program.PerJobMiniClusterFactory.lambda$submitJob$2(PerJobMiniClusterFactory.java:83)
at 
org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:73)
at 
java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
at 
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
at 
java.util.concurrent.CompletableFuture$Completion.exec(CompletableFuture.java:457)
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
at 
java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
at 
java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorke

[jira] [Created] (FLINK-21929) flink-statebackend-rocksdb crashes with Error occurred in starting fork

2021-03-23 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-21929:
--

 Summary: flink-statebackend-rocksdb crashes with Error occurred in 
starting fork
 Key: FLINK-21929
 URL: https://issues.apache.org/jira/browse/FLINK-21929
 Project: Flink
  Issue Type: Bug
  Components: Runtime / State Backends
Affects Versions: 1.13.0
Reporter: Robert Metzger
 Attachments: image-2021-03-23-13-18-41-836.png

https://dev.azure.com/rmetzger/Flink/_build/results?buildId=9001&view=results

{code}
2021-03-23T09:11:12.1861967Z [INFO] BUILD FAILURE
2021-03-23T09:11:12.1863007Z [INFO] 

2021-03-23T09:11:12.1863492Z [INFO] Total time: 42:35 min
2021-03-23T09:11:12.1864171Z [INFO] Finished at: 2021-03-23T09:11:12+00:00
2021-03-23T09:11:12.8003245Z [INFO] Final Memory: 137M/806M
2021-03-23T09:11:12.8006310Z [INFO] 

2021-03-23T09:11:12.8082409Z [ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-surefire-plugin:2.22.1:test (default-test) on 
project flink-statebackend-rocksdb_2.11: There are test failures.
2021-03-23T09:11:12.8086652Z [ERROR] 
2021-03-23T09:11:12.8092462Z [ERROR] Please refer to 
/__w/1/s/flink-state-backends/flink-statebackend-rocksdb/target/surefire-reports
 for the individual test results.
2021-03-23T09:11:12.8096948Z [ERROR] Please refer to dump files (if any exist) 
[date].dump, [date]-jvmRun[N].dump and [date].dumpstream.
2021-03-23T09:11:12.8101388Z [ERROR] ExecutionException Error occurred in 
starting fork, check output in log
2021-03-23T09:11:12.8105868Z [ERROR] 
org.apache.maven.surefire.booter.SurefireBooterForkException: 
ExecutionException Error occurred in starting fork, check output in log
2021-03-23T09:11:12.8110518Z [ERROR] at 
org.apache.maven.plugin.surefire.booterclient.ForkStarter.awaitResultsDone(ForkStarter.java:510)
2021-03-23T09:11:12.8115518Z [ERROR] at 
org.apache.maven.plugin.surefire.booterclient.ForkStarter.runSuitesForkOnceMultiple(ForkStarter.java:382)
2021-03-23T09:11:12.8120811Z [ERROR] at 
org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:297)
2021-03-23T09:11:12.8126356Z [ERROR] at 
org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:246)
2021-03-23T09:11:12.8127129Z [ERROR] at 
org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeProvider(AbstractSurefireMojo.java:1183)
2021-03-23T09:11:12.8131291Z [ERROR] at 
org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeAfterPreconditionsChecked(AbstractSurefireMojo.java:1011)
2021-03-23T09:11:12.8132369Z [ERROR] at 
org.apache.maven.plugin.surefire.AbstractSurefireMojo.execute(AbstractSurefireMojo.java:857)
2021-03-23T09:11:12.8133397Z [ERROR] at 
org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:132)
2021-03-23T09:11:12.8134116Z [ERROR] at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:208)
2021-03-23T09:11:12.8134793Z [ERROR] at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
2021-03-23T09:11:12.8135621Z [ERROR] at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
2021-03-23T09:11:12.8136323Z [ERROR] at 
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)
2021-03-23T09:11:12.8141570Z [ERROR] at 
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80)
2021-03-23T09:11:12.8142374Z [ERROR] at 
org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51)
2021-03-23T09:11:12.8145665Z [ERROR] at 
org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:120)
2021-03-23T09:11:12.8146407Z [ERROR] at 
org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:355)
2021-03-23T09:11:12.8148835Z [ERROR] at 
org.apache.maven.DefaultMaven.execute(DefaultMaven.java:155)
2021-03-23T09:11:12.8151299Z [ERROR] at 
org.apache.maven.cli.MavenCli.execute(MavenCli.java:584)
2021-03-23T09:11:12.8152244Z [ERROR] at 
org.apache.maven.cli.MavenCli.doMain(MavenCli.java:216)
2021-03-23T09:11:12.8152806Z [ERROR] at 
org.apache.maven.cli.MavenCli.main(MavenCli.java:160)
2021-03-23T09:11:12.8155818Z [ERROR] at 
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
2021-03-23T09:11:12.8159757Z [ERROR] at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
2021-03-23T09:11:12.8177288Z [ERROR] at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
2021-03-23T09:11:12.8178021Z [ERROR] at 
java.lang.reflect.Method.invoke(Method.java:498)
2021-03-23T09:11:12.8179802Z [ERROR] at 
org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
2021-

[jira] [Created] (FLINK-21884) Reduce TaskManager failure detection time

2021-03-19 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-21884:
--

 Summary: Reduce TaskManager failure detection time
 Key: FLINK-21884
 URL: https://issues.apache.org/jira/browse/FLINK-21884
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Coordination
Reporter: Robert Metzger
 Fix For: 1.14.0
 Attachments: image-2021-03-19-20-10-40-324.png

In Flink 1.13 (and older versions), TaskManager failures stall the processing 
for a significant amount of time, even though the system gets indications for 
the failure almost immediately through network connection losses.

This is due to a high (default) heartbeat timeout of 50 seconds [1] to 
accommodate for GC pauses, transient network disruptions or generally slow 
environments (otherwise, we would unregister a healthy TaskManager).

Such a high timeout can lead to disruptions in the processing (no processing 
for certain periods, high latencies, buildup of consumer lag etc.). In Reactive 
Mode (FLINK-10407), the issue surfaces on scale-down events, where the loss of 
a TaskManager is immediately visible in the logs, but the job is stuck in 
"FAILING" for quite a while until the TaskManger is really deregistered. (Note 
that this issue is not that critical in a autoscaling setup, because Flink can 
control the scale-down events and trigger them proactively)

On this metrics dashboard, one can see that the job has significant throughput 
drops / consumer lags during scale down (and also CPU usage spikes on 
processing the queued events, leading to incorrect scale up events again).
 !image-2021-03-19-20-10-40-324.png! 

One idea to solve this problem is to:
- Score TaskManagers based on certain signals (# exceptions reported, exception 
types (connection losses, akka failures), failure frequencies,  ...) and 
blacklist them accordingly.

[1] 
https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/config/#heartbeat-timeout






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-21883) Introduce cooldown period into adaptive scheduler

2021-03-19 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-21883:
--

 Summary: Introduce cooldown period into adaptive scheduler
 Key: FLINK-21883
 URL: https://issues.apache.org/jira/browse/FLINK-21883
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Coordination
Reporter: Robert Metzger
 Fix For: 1.14.0


This is a follow up to reactive mode, introduced in FLINK-10407.

Introduce a cooldown timeout, during which no further scaling actions are 
performed, after a scaling action.
Without such a cooldown timeout, it can happen with unfortunate timing, that we 
are rescaling the job very frequently, because TaskManagers are not all 
connecting at the same time.
With the current implementation (1.13), this only applies to scaling up, but 
this can also apply to scaling down with autoscaling support.

With this implemented, users can define a cooldown timeout of say 5 minutes: If 
taskmanagers are now slowly connecting one after another, we will only rescale 
every 5 minutes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-21819) Remove swift FS filesystem

2021-03-16 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-21819:
--

 Summary: Remove swift FS filesystem
 Key: FLINK-21819
 URL: https://issues.apache.org/jira/browse/FLINK-21819
 Project: Flink
  Issue Type: Task
  Components: FileSystems
Reporter: Robert Metzger
 Fix For: 1.13.0


As discussed on the dev@ list, the community agreed to remote the swift FS 
filesystem: https://www.mail-archive.com/dev@flink.apache.org/msg44993.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-21712) Extract interfaces for the Execution* (classes used by the DefaultExecutonGraph)

2021-03-10 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-21712:
--

 Summary: Extract interfaces for the Execution* (classes used by 
the DefaultExecutonGraph)
 Key: FLINK-21712
 URL: https://issues.apache.org/jira/browse/FLINK-21712
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Coordination
Affects Versions: 1.13.0
Reporter: Robert Metzger


This is a follow up to FLINK-21347, where we discussed in the review 
(https://github.com/apache/flink/pull/14950#discussion_r580963986) that it 
would be good to extract interfaces for the remaining Execution* classes to 
make testing / mocking easier.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-21711) DataStreamSink doesn't allow setting maxParallelism

2021-03-10 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-21711:
--

 Summary: DataStreamSink doesn't allow setting maxParallelism
 Key: FLINK-21711
 URL: https://issues.apache.org/jira/browse/FLINK-21711
 Project: Flink
  Issue Type: Improvement
  Components: API / DataStream
Affects Versions: 1.13.0
Reporter: Robert Metzger


It seems that we can only set the max parallelism of the sink only via this 
internal API:

{code}
input.addSink(new 
ParallelismTrackingSink<>()).getTransformation().setMaxParallelism(1);

{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-21638) Clean up argument order and semantics of all "stopWithSavepoint" calls in the codebase

2021-03-05 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-21638:
--

 Summary: Clean up argument order and semantics of all 
"stopWithSavepoint" calls in the codebase
 Key: FLINK-21638
 URL: https://issues.apache.org/jira/browse/FLINK-21638
 Project: Flink
  Issue Type: Improvement
  Components: Client / Job Submission, Runtime / Coordination
Reporter: Robert Metzger
 Fix For: 1.13.0


This came up in a review: 
https://github.com/apache/flink/pull/14948#discussion_r588304475

The stopWithSavepoint() methods in Flink have a boolean flag, indicating either 
whether we inject a MAX_WATERMARK in the pipeline, or whether we terminate or 
suspend a job. 

As part of this ticket, I propose to introduce a new enum type in flink-core 
"StopWithSavepointBehavior" with SUSPEND and TERMINATE fields and replace the 
boolean flag with this new field.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-21606) TaskManager connected to invalid JobManager leading to TaskSubmissionException

2021-03-04 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-21606:
--

 Summary: TaskManager connected to invalid JobManager leading to 
TaskSubmissionException
 Key: FLINK-21606
 URL: https://issues.apache.org/jira/browse/FLINK-21606
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.13.0
Reporter: Robert Metzger
 Fix For: 1.13.0


While testing reactive mode, I had to start my JobManager a few times to get 
the configuration right. While doing that, I had at least on TaskManager (TM6), 
which was first connected to the first JobManager (with a running job), and 
then to the second one.

On the second JobManager, I was able to execute my test job (on another 
TaskManager (TMx)), once TM6 reconnected, and reactive mode tried to utilize 
all available resources, I repeatedly ran into this issue:

{code}
2021-03-04 15:49:36,322 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph   [] - 
Window(GlobalWindows(), DeltaTrigger, TimeEvictor, ComparableAggregator, 
PassThroughWindowFunction) -> Sink: Print to Std. Out (5/7) 
(ae8f39c8dd88148aff93c8f811fab22e) switched from DEPLOYING to FAILED on 
192.168.2.173:64041-4f7521 @ macbook-pro-2.localdomain (dataPort=64044).
java.util.concurrent.CompletionException: 
org.apache.flink.runtime.taskexecutor.exceptions.TaskSubmissionException: Could 
not submit task because there is no JobManager associated for the job 
bbe8634736b5b1d813dd322cfaaa08ea.
at 
java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326) 
~[?:1.8.0_252]
at 
java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
 ~[?:1.8.0_252]
at 
java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:925) 
~[?:1.8.0_252]
at 
java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:913)
 ~[?:1.8.0_252]
at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) 
~[?:1.8.0_252]
at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
 ~[?:1.8.0_252]
at 
org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.lambda$invokeRpc$0(AkkaInvocationHandler.java:234)
 ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
 ~[?:1.8.0_252]
at 
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
 ~[?:1.8.0_252]
at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) 
~[?:1.8.0_252]
at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
 ~[?:1.8.0_252]
at 
org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:1064)
 ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at akka.dispatch.OnComplete.internal(Future.scala:263) 
~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at akka.dispatch.OnComplete.internal(Future.scala:261) 
~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:191) 
~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:188) 
~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) 
~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:73)
 ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44) 
~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252) 
~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:572) 
~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:101) 
~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:999) 
~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at akka.actor.Actor$class.aroundReceive(Actor.scala:517) 
~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:458) 
~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) 
[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at akka.actor.ActorCell.invoke(ActorCell.scala:561) 
[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) 
[flink-dist_2.11-1.13-SNAPSHOT.jar:1

[jira] [Created] (FLINK-21604) Remove adaptive scheduler nightly test profile, replace with more specific testing approach

2021-03-04 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-21604:
--

 Summary: Remove adaptive scheduler nightly test profile, replace 
with more specific testing approach
 Key: FLINK-21604
 URL: https://issues.apache.org/jira/browse/FLINK-21604
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Reporter: Robert Metzger
 Fix For: 1.13.0


We are currently running all tests with the adaptive scheduler, but only once a 
night.
This is wasteful for all tests not using any scheduler. 

Ideally, we come up with a testing strategy where we run all relevant tests 
(checkpointing, failure, recovery, .. etc. related with both schedulers.

One approach would be parameterizing some tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-21587) FineGrainedSlotManagerTest.testNotificationAboutNotEnoughResources is unstable

2021-03-03 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-21587:
--

 Summary: 
FineGrainedSlotManagerTest.testNotificationAboutNotEnoughResources is unstable
 Key: FLINK-21587
 URL: https://issues.apache.org/jira/browse/FLINK-21587
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.13.0
Reporter: Robert Metzger
 Fix For: 1.13.0


Happened in my WIP branch, but most likely unrelated to my change: 
https://dev.azure.com/rmetzger/Flink/_build/results?buildId=8925&view=logs&j=9dc1b5dc-bcfa-5f83-eaa7-0cb181ddc267&t=ab910030-93db-52a7-74a3-34a0addb481b

Also note that the error is reproducable locally with DEBUG log level, but not 
with INFO:
{code}
[ERROR] Tests run: 12, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 2.169 
s <<< FAILURE! - in 
org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManagerTest
[ERROR] 
testNotificationAboutNotEnoughResources(org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManagerTest)
  Time elapsed: 0.029 s  <<< FAILURE!
java.lang.AssertionError: 

Expected: a collection with size <1>
 but: collection size was <0>
at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
at org.junit.Assert.assertThat(Assert.java:956)
at org.junit.Assert.assertThat(Assert.java:923)
at 
org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManagerTest$9.lambda$new$5(FineGrainedSlotManagerTest.java:548)
at 
org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManagerTestBase$Context.runTest(FineGrainedSlotManagerTestBase.java:197)
at 
org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManagerTest$9.(FineGrainedSlotManagerTest.java:521)
at 
org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManagerTest.testNotificationAboutNotEnoughResources(FineGrainedSlotManagerTest.java:507)
at 
org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManagerTest.testNotificationAboutNotEnoughResources(FineGrainedSlotManagerTest.java:493)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-21561) Introduce cache for binaries downloaded by bash tests

2021-03-02 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-21561:
--

 Summary: Introduce cache for binaries downloaded by bash tests 
 Key: FLINK-21561
 URL: https://issues.apache.org/jira/browse/FLINK-21561
 Project: Flink
  Issue Type: Improvement
  Components: Build System / CI, Test Infrastructure
Reporter: Robert Metzger
 Fix For: 1.13.0


It seems that archive.apache.org is currently very slow, causing the bash e2e 
tests to timeout, because they need a lot of time downloading dependencies.

This ticket is for tracking an improvement to the bash-based e2e tests to cache 
the binaries in Azure pipelines Caches [1].

[1] 
https://docs.microsoft.com/en-us/azure/devops/pipelines/release/caching?view=azure-devops



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-21558) DeclarativeSlotManager starts logging "Could not fulfill resource requirements of job xxx"

2021-03-02 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-21558:
--

 Summary: DeclarativeSlotManager starts logging "Could not fulfill 
resource requirements of job xxx"
 Key: FLINK-21558
 URL: https://issues.apache.org/jira/browse/FLINK-21558
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.13.0
Reporter: Robert Metzger
 Fix For: 1.13.0


While testing the reactive mode, I noticed that my job started normally, but 
after a few minutes, it started logging this:

{code}
2021-03-02 13:36:48,075 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph   [] - Source: 
Custom Source -> Timestamps/Watermarks (2/3) (061b652dabc0ecfc83c942ee3e127ecd) 
switched from DEPLOYING to RUNNING.
2021-03-02 13:36:48,076 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph   [] - 
Window(GlobalWindows(), DeltaTrigger, TimeEvictor, ComparableAggregator, 
PassThroughWindowFunction) -> Sink: Print to Std. Out (2/3) 
(6a715e3c70754aafa0b91332b69a736d) switched from DEPLOYING to RUNNING.
2021-03-02 13:36:48,077 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph   [] - 
Window(GlobalWindows(), DeltaTrigger, TimeEvictor, ComparableAggregator, 
PassThroughWindowFunction) -> Sink: Print to Std. Out (1/3) 
(8655874da6905d13c01927a282ed2ce0) switched from DEPLOYING to RUNNING.
2021-03-02 13:36:48,080 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph   [] - 
Window(GlobalWindows(), DeltaTrigger, TimeEvictor, ComparableAggregator, 
PassThroughWindowFunction) -> Sink: Print to Std. Out (3/3) 
(9514718713ffa453c43a7e7efde9920a) switched from DEPLOYING to RUNNING.
2021-03-02 13:40:28,893 WARN  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Could not fulfill resource requirements of job 
1283d12b281c35f33f3602611ef43b35.
2021-03-02 13:40:29,474 WARN  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Could not fulfill resource requirements of job 
1283d12b281c35f33f3602611ef43b35.
2021-03-02 13:40:29,475 WARN  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Could not fulfill resource requirements of job 
1283d12b281c35f33f3602611ef43b35.
2021-03-02 13:40:29,475 WARN  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Could not fulfill resource requirements of job 
1283d12b281c35f33f3602611ef43b35.
2021-03-02 13:40:39,495 WARN  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Could not fulfill resource requirements of job 
1283d12b281c35f33f3602611ef43b35.
2021-03-02 13:40:39,496 WARN  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Could not fulfill resource requirements of job 
1283d12b281c35f33f3602611ef43b35.
2021-03-02 13:40:39,497 WARN  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Could not fulfill resource requirements of job 
1283d12b281c35f33f3602611ef43b35.
2021-03-02 13:40:49,518 WARN  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Could not fulfill resource requirements of job 
1283d12b281c35f33f3602611ef43b35.
2021-03-02 13:40:49,518 WARN  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Could not fulfill resource requirements of job 
1283d12b281c35f33f3602611ef43b35.
2021-03-02 13:40:49,519 WARN  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Could not fulfill resource requirements of job 
1283d12b281c35f33f3602611ef43b35.
2021-03-02 13:40:59,536 WARN  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Could not fulfill resource requirements of job 
1283d12b281c35f33f3602611ef43b35.
2021-03-02 13:40:59,536 WARN  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Could not fulfill resource requirements of job 
1283d12b281c35f33f3602611ef43b35.
2021-03-02 13:40:59,537 WARN  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Could not fulfill resource requirements of job 
1283d12b281c35f33f3602611ef43b35.
2021-03-02 13:41:09,556 WARN  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Could not fulfill resource requirements of job 
1283d12b281c35f33f3602611ef43b35.
2021-03-02 13:41:09,557 WARN  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Could not fulfill resource requirements of job 
1283d12b281c35f33f3602611ef43b35.
2021-03-02 13:41:09,557 WARN  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Could not fulfill resource requirements of job 
1283d12b281c35f33f3602611ef43b35.
2021-03-02 13:41:19,577 WARN  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Could n

[jira] [Created] (FLINK-21534) Test jars are uploaded twice during Flink maven deployment

2021-03-01 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-21534:
--

 Summary: Test jars are uploaded twice during Flink maven deployment
 Key: FLINK-21534
 URL: https://issues.apache.org/jira/browse/FLINK-21534
 Project: Flink
  Issue Type: Bug
  Components: Build System
Affects Versions: 1.13.0
Reporter: Robert Metzger


Here's an example of a snapshot build: 
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=13905&view=logs&j=eca6b3a6-1600-56cc-916a-c549b3cde3ff&t=e9844b5e-5aa3-546b-6c3e-5395c7c0cac7

You can see that the same jar is uploaded twice:
{code}
2021-02-28T21:07:05.2904685Z Uploading: 
https://repository.apache.org/content/repositories/snapshots/org/apache/flink/flink-runtime_2.11/1.13-SNAPSHOT/flink-runtime_2.11-1.13-20210228.210704-95-tests.jar

2021-02-28T21:07:05.5738429Z Uploaded: 
https://repository.apache.org/content/repositories/snapshots/org/apache/flink/flink-runtime_2.11/1.13-SNAPSHOT/flink-runtime_2.11-1.13-20210228.210704-95-tests.jar
 (4714 KB at 16596.2 KB/sec)


2021-02-28T21:07:05.6506506Z Uploading: 
https://repository.apache.org/content/repositories/snapshots/org/apache/flink/flink-runtime_2.11/1.13-SNAPSHOT/flink-runtime_2.11-1.13-20210228.210704-95-tests.jar

2021-02-28T21:07:07.1976748Z Uploaded: 
https://repository.apache.org/content/repositories/snapshots/org/apache/flink/flink-runtime_2.11/1.13-SNAPSHOT/flink-runtime_2.11-1.13-20210228.210704-95-tests.jar
 (4714 KB at 3046.7 KB/sec)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-21475) StateWithExecutionGraph.suspend fails with IllegalStateException

2021-02-23 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-21475:
--

 Summary: StateWithExecutionGraph.suspend fails with 
IllegalStateException
 Key: FLINK-21475
 URL: https://issues.apache.org/jira/browse/FLINK-21475
 Project: Flink
  Issue Type: New Feature
  Components: Runtime / Coordination
Affects Versions: 1.13.0
Reporter: Robert Metzger
 Fix For: 1.13.0


https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=13665&view=logs&j=a5ef94ef-68c2-57fd-3794-dc108ed1c495&t=9c1ddabe-d186-5a2c-5fcc-f3cafb3ec699

For discoverability, the Azure output is:
{code}
2021-02-23T23:48:50.6572167Z [INFO] BUILD FAILURE
2021-02-23T23:48:50.6573151Z [INFO] 

2021-02-23T23:48:50.6573684Z [INFO] Total time: 01:59 min
2021-02-23T23:48:50.6574520Z [INFO] Finished at: 2021-02-23T23:48:50+00:00
2021-02-23T23:48:51.4672056Z [INFO] Final Memory: 183M/3491M
2021-02-23T23:48:51.4673656Z [INFO] 

2021-02-23T23:48:51.4674310Z [WARNING] The requested profile "skip-webui-build" 
could not be activated because it does not exist.
2021-02-23T23:48:51.4675176Z [ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-surefire-plugin:2.22.1:test (integration-tests) 
on project flink-connector-files: There are test failures.
2021-02-23T23:48:51.4675685Z [ERROR] 
2021-02-23T23:48:51.4676248Z [ERROR] Please refer to 
/__w/2/s/flink-connectors/flink-connector-files/target/surefire-reports for the 
individual test results.
2021-02-23T23:48:51.4677378Z [ERROR] Please refer to dump files (if any exist) 
[date].dump, [date]-jvmRun[N].dump and [date].dumpstream.
2021-02-23T23:48:51.4677963Z [ERROR] ExecutionException The forked VM 
terminated without properly saying goodbye. VM crash or System.exit called?
2021-02-23T23:48:51.4679891Z [ERROR] Command was /bin/sh -c cd 
/__w/2/s/flink-connectors/flink-connector-files/target && 
/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -Xms256m -Xmx2048m 
-Dmvn.forkNumber=1 -XX:+UseG1GC -jar 
/__w/2/s/flink-connectors/flink-connector-files/target/surefire/surefirebooter72596965996179907.jar
 /__w/2/s/flink-connectors/flink-connector-files/target/surefire 
2021-02-23T23-47-06_882-jvmRun1 surefire3083231786399295313tmp 
surefire_104007394662374862924tmp
2021-02-23T23:48:51.4680846Z [ERROR] Error occurred in starting fork, check 
output in log
2021-02-23T23:48:51.4681156Z [ERROR] Process Exit Code: 239
2021-02-23T23:48:51.4681413Z [ERROR] Crashed tests:
2021-02-23T23:48:51.4681737Z [ERROR] 
org.apache.flink.connector.file.sink.writer.FileSinkMigrationITCase
2021-02-23T23:48:51.4682470Z [ERROR] 
org.apache.maven.surefire.booter.SurefireBooterForkException: 
ExecutionException The forked VM terminated without properly saying goodbye. VM 
crash or System.exit called?
2021-02-23T23:48:51.4684122Z [ERROR] Command was /bin/sh -c cd 
/__w/2/s/flink-connectors/flink-connector-files/target && 
/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -Xms256m -Xmx2048m 
-Dmvn.forkNumber=1 -XX:+UseG1GC -jar 
/__w/2/s/flink-connectors/flink-connector-files/target/surefire/surefirebooter72596965996179907.jar
 /__w/2/s/flink-connectors/flink-connector-files/target/surefire 
2021-02-23T23-47-06_882-jvmRun1 surefire3083231786399295313tmp 
surefire_104007394662374862924tmp
2021-02-23T23:48:51.4685319Z [ERROR] Error occurred in starting fork, check 
output in log
2021-02-23T23:48:51.4685796Z [ERROR] Process Exit Code: 239
2021-02-23T23:48:51.4686229Z [ERROR] Crashed tests:
2021-02-23T23:48:51.4686926Z [ERROR] 
org.apache.flink.connector.file.sink.writer.FileSinkMigrationITCase
2021-02-23T23:48:51.4687709Z [ERROR] at 
org.apache.maven.plugin.surefire.booterclient.ForkStarter.awaitResultsDone(ForkStarter.java:510)
2021-02-23T23:48:51.4688603Z [ERROR] at 
org.apache.maven.plugin.surefire.booterclient.ForkStarter.runSuitesForkPerTestSet(ForkStarter.java:457)
2021-02-23T23:48:51.4689451Z [ERROR] at 
org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:298)
2021-02-23T23:48:51.4690207Z [ERROR] at 
org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:246)
2021-02-23T23:48:51.4691076Z [ERROR] at 
org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeProvider(AbstractSurefireMojo.java:1183)
2021-02-23T23:48:51.4692029Z [ERROR] at 
org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeAfterPreconditionsChecked(AbstractSurefireMojo.java:1011)
2021-02-23T23:48:51.4693058Z [ERROR] at 
org.apache.maven.plugin.surefire.AbstractSurefireMojo.execute(AbstractSurefireMojo.java:857)
2021-02-23T23:48:51.4693946Z [ERROR] at 
org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:132)
2021-02-23T23:48:51.4694768Z [ERROR] at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor

[jira] [Created] (FLINK-21450) Add local recovery support to adaptive scheduler

2021-02-22 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-21450:
--

 Summary: Add local recovery support to adaptive scheduler
 Key: FLINK-21450
 URL: https://issues.apache.org/jira/browse/FLINK-21450
 Project: Flink
  Issue Type: New Feature
  Components: Runtime / Coordination
Reporter: Robert Metzger


local recovery means that, on a failure, we are able to re-use the state in a 
taskmanager, instead of loading it again from distributed storage (which means 
the scheduler needs to know where which state is located, and schedule tasks 
accordingly).

Adaptive Scheduler is currently not respecting the location of state, so 
failures require the re-loading of state from the distributed storage.

Adding this feature will allow us to enable the {{Local recovery and sticky 
scheduling end-to-end test}} for adaptive scheduler again.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-21390) Rename DeclarativeScheduler to AdaptiveScheduler

2021-02-17 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-21390:
--

 Summary: Rename DeclarativeScheduler to AdaptiveScheduler
 Key: FLINK-21390
 URL: https://issues.apache.org/jira/browse/FLINK-21390
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Affects Versions: 1.13.0
Reporter: Robert Metzger
 Fix For: 1.13.0


DeclarativeScheduler does not seem to be an appropriate name for what it does: 
In particular looking at the difference to the DefaultScheduler, calling it 
AdaptiveScheduler (as in, adapts the parallelism of the ExecutionGraph during 
the job lifetime) seems more appropriate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-21360) Add resourceTimeout configuration

2021-02-10 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-21360:
--

 Summary: Add resourceTimeout configuration
 Key: FLINK-21360
 URL: https://issues.apache.org/jira/browse/FLINK-21360
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Reporter: Robert Metzger
 Fix For: 1.13.0


resourceTimeout is currently a hardcoded value. Make it configurable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-21348) Add tests for StateWithExecutionGraph

2021-02-10 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-21348:
--

 Summary: Add tests for StateWithExecutionGraph
 Key: FLINK-21348
 URL: https://issues.apache.org/jira/browse/FLINK-21348
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Affects Versions: 1.13.0
Reporter: Robert Metzger
 Fix For: 1.13.0


This ticket is about adding dedicated tests for the StateWithExecutionGraph 
class.

This is a follow up from 
https://github.com/apache/flink/pull/14879#discussion_r573707768





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-21347) Extract interface out of ExecutionGraph for better testability

2021-02-10 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-21347:
--

 Summary: Extract interface out of ExecutionGraph for better 
testability
 Key: FLINK-21347
 URL: https://issues.apache.org/jira/browse/FLINK-21347
 Project: Flink
  Issue Type: Task
  Components: Runtime / Coordination
Reporter: Robert Metzger


This is a follow up to this comment: 
https://github.com/apache/flink/pull/14879#discussion_r573613450

The ExecutionGraph class is currently not very handy for tests, as it has a lot 
of dependencies. Extracting an interface for the ExecutionGraph will make 
testing and future changes easier.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-21333) Introduce stopping with savepoint state

2021-02-09 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-21333:
--

 Summary: Introduce stopping with savepoint state
 Key: FLINK-21333
 URL: https://issues.apache.org/jira/browse/FLINK-21333
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Affects Versions: 1.13.0
Reporter: Robert Metzger


The declarative scheduler is also affected by the problem described in 
FLINK-21030. We want to solve this problem by introducing a separate state when 
are taking a savepoint for stopping Flink.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-21329) "Local recovery and sticky scheduling end-to-end test" does not finish within 600 seconds

2021-02-08 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-21329:
--

 Summary: "Local recovery and sticky scheduling end-to-end test" 
does not finish within 600 seconds
 Key: FLINK-21329
 URL: https://issues.apache.org/jira/browse/FLINK-21329
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.13.0
Reporter: Robert Metzger


https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=13118&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=ff888d9b-cd34-53cc-d90f-3e446d355529

{code}
Feb 08 22:25:46 
==
Feb 08 22:25:46 Running 'Local recovery and sticky scheduling end-to-end test'
Feb 08 22:25:46 
==
Feb 08 22:25:46 TEST_DATA_DIR: 
/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-46881214821
Feb 08 22:25:47 Flink dist directory: 
/home/vsts/work/1/s/flink-dist/target/flink-1.13-SNAPSHOT-bin/flink-1.13-SNAPSHOT
Feb 08 22:25:47 Running local recovery test with configuration:
Feb 08 22:25:47 parallelism: 4
Feb 08 22:25:47 max attempts: 10
Feb 08 22:25:47 backend: rocks
Feb 08 22:25:47 incremental checkpoints: false
Feb 08 22:25:47 kill JVM: false
Feb 08 22:25:47 Starting zookeeper daemon on host fv-az127-394.
Feb 08 22:25:47 Starting HA cluster with 1 masters.
Feb 08 22:25:48 Starting standalonesession daemon on host fv-az127-394.
Feb 08 22:25:49 Starting taskexecutor daemon on host fv-az127-394.
Feb 08 22:25:49 Waiting for Dispatcher REST endpoint to come up...
Feb 08 22:25:50 Waiting for Dispatcher REST endpoint to come up...
Feb 08 22:25:51 Waiting for Dispatcher REST endpoint to come up...
Feb 08 22:25:53 Waiting for Dispatcher REST endpoint to come up...
Feb 08 22:25:54 Dispatcher REST endpoint is up.
Feb 08 22:25:54 Started TM watchdog with PID 28961.
Feb 08 22:25:58 Job has been submitted with JobID 
e790e85a39040539f9386c0df7ca4812
Feb 08 22:35:47 Test (pid: 27970) did not finish after 600 seconds.
Feb 08 22:35:47 Printing Flink logs and killing it:

{code}

and

{code}

at 
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver.unhandledError(ZooKeeperLeaderRetrievalDriver.java:184)
at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl$6.apply(CuratorFrameworkImpl.java:713)
at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl$6.apply(CuratorFrameworkImpl.java:709)
at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:100)
at 
org.apache.flink.shaded.curator4.org.apache.curator.shaded.com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:92)
at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.logError(CuratorFrameworkImpl.java:708)
at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:874)
at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:990)
at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:943)
at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:66)
at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:346)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: 
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException$ConnectionLossException:
 KeeperErrorCode = ConnectionLoss
at 
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
at 
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:862)
... 10 more

{code}



--
Thi

[jira] [Created] (FLINK-21307) Revisit activation model of FlinkSecurityManager

2021-02-05 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-21307:
--

 Summary: Revisit activation model of FlinkSecurityManager
 Key: FLINK-21307
 URL: https://issues.apache.org/jira/browse/FLINK-21307
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Task
Affects Versions: 1.13.0
Reporter: Robert Metzger


In FLINK-15156, we introduced a feature that allows users to log or completely 
disable calls to System.exit(). This feature is enabled for certain threads / 
code sections intended to execute user-code.

The activation of the security manager (for monitoring user calls to 
System.exit() is currently not well-defined, and only implemented on a 
best-effort basis.
This ticket is to revisit the activation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-21306) FlinkSecurityManager might avoid fatal system exits

2021-02-05 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-21306:
--

 Summary: FlinkSecurityManager might avoid fatal system exits
 Key: FLINK-21306
 URL: https://issues.apache.org/jira/browse/FLINK-21306
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Task
Affects Versions: 1.13.0
Reporter: Robert Metzger


In FLINK-15156, we introduced a feature that allows users to log or completely 
disable calls to System.exit().
This feature is enabled for certain threads / code sections intended to execute 
user-code.
However, some user code calls might still lead to fatal errors, which we want 
to handle by killing the Flink process.
It is likely that this new change (which is disabled by default) can lead to a 
situation where Flink should exit immediately, but it doesn't (thus leaving the 
system in an inconsistent state)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-21260) Add DeclarativeScheduler / Finished state

2021-02-03 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-21260:
--

 Summary: Add DeclarativeScheduler / Finished state
 Key: FLINK-21260
 URL: https://issues.apache.org/jira/browse/FLINK-21260
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Reporter: Robert Metzger
 Fix For: 1.13.0


This subtask of adding the declarative scheduler is about adding the Finished 
state to Flink, including tests.

Finished: The job execution has been completed.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-21259) Add DeclarativeScheduler / Failing state

2021-02-03 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-21259:
--

 Summary: Add DeclarativeScheduler / Failing state
 Key: FLINK-21259
 URL: https://issues.apache.org/jira/browse/FLINK-21259
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Reporter: Robert Metzger
 Fix For: 1.13.0


This subtask of adding the declarative scheduler is about adding the Failing 
state to Flink, including tests.

Failing: An unrecoverable fault has occurred. The scheduler stops the 
ExecutionGraph by canceling it.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-21258) Add DeclarativeScheduler / Canceling state

2021-02-03 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-21258:
--

 Summary: Add DeclarativeScheduler / Canceling state
 Key: FLINK-21258
 URL: https://issues.apache.org/jira/browse/FLINK-21258
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Reporter: Robert Metzger
 Fix For: 1.13.0


This subtask of adding the declarative scheduler is about adding the Canceling 
state to Flink, including tests.

Canceling: The job has been canceled by the user. The scheduler stops the 
ExecutionGraph by canceling it.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-21257) Add DeclarativeScheduler / Restarting state

2021-02-03 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-21257:
--

 Summary: Add DeclarativeScheduler / Restarting state
 Key: FLINK-21257
 URL: https://issues.apache.org/jira/browse/FLINK-21257
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Reporter: Robert Metzger
 Fix For: 1.13.0


This subtask of adding the declarative scheduler is about adding the Created 
state to Flink, including tests.

Restarting: A recoverable fault has occurred. The scheduler stops the 
ExecutionGraph by canceling it.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-21256) Add DeclarativeScheduler / Executing state

2021-02-03 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-21256:
--

 Summary: Add DeclarativeScheduler / Executing state
 Key: FLINK-21256
 URL: https://issues.apache.org/jira/browse/FLINK-21256
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Reporter: Robert Metzger
Assignee: Robert Metzger
 Fix For: 1.13.0


This subtask of adding the declarative scheduler is about adding the Created 
state to Flink, including tests.

Executing: The set of resources is stable and the scheduler could decide on the 
parallelism with which to execute the job. The ExecutionGraph is created and 
the execution of the job has started.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-21255) Add DeclarativeScheduler / WaitingForResources state

2021-02-03 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-21255:
--

 Summary: Add DeclarativeScheduler / WaitingForResources state
 Key: FLINK-21255
 URL: https://issues.apache.org/jira/browse/FLINK-21255
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Reporter: Robert Metzger
Assignee: Robert Metzger
 Fix For: 1.13.0


This subtask of adding the declarative scheduler is about adding the Created 
state to Flink, including tests.

Waiting for resources: The required resources are declared. The scheduler waits 
until either the requirements are fulfilled or the set of resources has 
stabilised.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-21254) Add DeclarativeScheduler / Created state

2021-02-03 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-21254:
--

 Summary: Add DeclarativeScheduler / Created state 
 Key: FLINK-21254
 URL: https://issues.apache.org/jira/browse/FLINK-21254
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Reporter: Robert Metzger
Assignee: Robert Metzger
 Fix For: 1.13.0


This subtask of adding the declarative scheduler is about adding the Created 
state to Flink, including tests.
Created: Initial state of the scheduler




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-21166) NullPointer in CheckpointMetricsBuilder surfacing in "Resuming Savepoint" e2e

2021-01-27 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-21166:
--

 Summary: NullPointer in CheckpointMetricsBuilder surfacing in 
"Resuming Savepoint" e2e
 Key: FLINK-21166
 URL: https://issues.apache.org/jira/browse/FLINK-21166
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Checkpointing
Affects Versions: 1.13.0
Reporter: Robert Metzger


https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=12562&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=ff888d9b-cd34-53cc-d90f-3e446d355529

{code}
Running 'Resuming Savepoint (rocks, scale down, rocks timers) end-to-end test'
Found exception in log files; printing first 500 lines; see full logs for 
details:
...
[FAIL] 'Resuming Savepoint (rocks, scale down, rocks timers) end-to-end test' 
failed after 0 minutes and 30 seconds! Test exited with exit code 0 but the 
logs contained errors, exceptions or non-empty .out files
{code}

One TaskManager log contains the following:
{code}
=== Finished metrics report ===

2021-01-27 15:22:49,635 WARN  
org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable [] - Could not 
properly clean up the async checkpoint runnable.
java.lang.IllegalStateException: null
at 
org.apache.flink.util.Preconditions.checkState(Preconditions.java:177) 
~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.util.Preconditions.checkCompletedNormally(Preconditions.java:261)
 ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.concurrent.FutureUtils.checkStateAndGet(FutureUtils.java:1176)
 ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.runtime.checkpoint.CheckpointMetricsBuilder.build(CheckpointMetricsBuilder.java:133)
 ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.reportAbortedSnapshotStats(AsyncCheckpointRunnable.java:219)
 ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.close(AsyncCheckpointRunnable.java:292)
 ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at org.apache.flink.util.IOUtils.closeQuietly(IOUtils.java:275) 
~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.cancelAsyncCheckpointRunnable(SubtaskCheckpointCoordinatorImpl.java:451)
 ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.notifyCheckpointAborted(SubtaskCheckpointCoordinatorImpl.java:340)
 ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$notifyCheckpointAbortAsync$12(StreamTask.java:1069)
 ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$notifyCheckpointOperation$13(StreamTask.java:1082)
 ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50)
 [flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90) 
[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMailsWhenDefaultActionUnavailable(MailboxProcessor.java:314)
 [flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:300)
 [flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:188)
 [flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:615)
 [flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:579) 
[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:763) 
[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:565) 
[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_275]
2021-01-27 15:22:49,637 WARN  
org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable [] - Could not 
properly clean up the async checkpoint runnable.
java.lang.IllegalStateException: null
at 
org.apache.flink.util.Preconditions.checkState(Preconditions.java:177) 
~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT]
at 
org.apac

[jira] [Created] (FLINK-21136) Reactive Mode: Adjust timeout behavior in declarative scheduler

2021-01-25 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-21136:
--

 Summary: Reactive Mode: Adjust timeout behavior in declarative 
scheduler
 Key: FLINK-21136
 URL: https://issues.apache.org/jira/browse/FLINK-21136
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Reporter: Robert Metzger
 Fix For: 1.13.0


The FLIP states the following timeout and resource registration behavior: 

On initial startup, the declarative scheduler will wait indefinitely for 
TaskManagers to show up. Once there are enough TaskManagers available to start 
the job, and the set of resources is stable (see FLIP-160 for a definition), 
the job will start running.

Once the job has started running, and a TaskManager is lost, it will wait for 
10 seconds for the TaskManager to re-appear. Otherwise, the job will be 
scheduled again with the available resources. If no TaskManagers are available 
anymore, the declarative scheduler will wait indefinitely again for new 
resources.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-21135) Reactive Mode: Change Declarative Scheduler to set infinite parallelism in JobGraph

2021-01-25 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-21135:
--

 Summary: Reactive Mode: Change Declarative Scheduler to set 
infinite parallelism in JobGraph
 Key: FLINK-21135
 URL: https://issues.apache.org/jira/browse/FLINK-21135
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Reporter: Robert Metzger


For Reactive Mode, the scheduler needs to change the parallelism and 
maxParalllelism of the submitted job graph to it's max value (2^15).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-21134) Introduce execution mode configuration key and check for supported ClusterEntrypoint type

2021-01-25 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-21134:
--

 Summary: Introduce execution mode configuration key and check for 
supported ClusterEntrypoint type
 Key: FLINK-21134
 URL: https://issues.apache.org/jira/browse/FLINK-21134
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Reporter: Robert Metzger
 Fix For: 1.13.0


According to the FLIP, introduce a "execution-mode" configuration key, and 
check in the ClusterEntrypoint if the chosen entry point type is supported by 
the selected execution mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-21073) Mention that RocksDB ignores equals/hashCode because it works on binary data

2021-01-21 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-21073:
--

 Summary: Mention that RocksDB ignores equals/hashCode because it 
works on binary data
 Key: FLINK-21073
 URL: https://issues.apache.org/jira/browse/FLINK-21073
 Project: Flink
  Issue Type: Bug
  Components: Documentation, Runtime / State Backends
Reporter: Robert Metzger
 Fix For: 1.13.0


See 
https://lists.apache.org/thread.html/ra43e2b5d388831290c293b9daf0eee0b0a5d9712543b62c83234a829%40%3Cuser.flink.apache.org%3E





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-21065) Passing configuration from TableEnvironmentImpl to MiniCluster is not supported

2021-01-21 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-21065:
--

 Summary: Passing configuration from TableEnvironmentImpl to 
MiniCluster is not supported
 Key: FLINK-21065
 URL: https://issues.apache.org/jira/browse/FLINK-21065
 Project: Flink
  Issue Type: Bug
  Components: Table SQL / Planner
Affects Versions: 1.13.0
Reporter: Robert Metzger


While trying to fix a bug in Flink's scheduler, I needed to pass a 
configuration parameter from the TableEnvironmentITCase (blink planner) to the 
TaskManager.

Changing this from:
{code}
   case "TableEnvironment" =>
tEnv = TableEnvironmentImpl.create(settings)
{code}
to
{code}
   case "TableEnvironment" =>
tEnv = TableEnvironmentImpl.create(settings)
val conf = new Configuration();
conf.setInteger("taskmanager.numberOfTaskSlots", 1)
tEnv.getConfig.addConfiguration(conf)
{code}

Did not cause any effect on the launched TaskManager.
It seems that configuration is not properly forwarded through all abstractions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-21051) SQLClientSchemaRegistryITCase unstable with InternalServerErrorException: Status 500

2021-01-20 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-21051:
--

 Summary: SQLClientSchemaRegistryITCase unstable with 
InternalServerErrorException: Status 500
 Key: FLINK-21051
 URL: https://issues.apache.org/jira/browse/FLINK-21051
 Project: Flink
  Issue Type: Bug
  Components: Table SQL / Ecosystem, Tests
Affects Versions: 1.13.0
Reporter: Robert Metzger


https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=12253&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=ff888d9b-cd34-53cc-d90f-3e446d355529

{code}
2021-01-20T00:10:21.3510385Z Jan 20 00:10:21 
2021-01-20T00:10:21.3516246Z Jan 20 00:10:21 [ERROR] 
testWriting(org.apache.flink.tests.util.kafka.SQLClientSchemaRegistryITCase)  
Time elapsed: 0.001 s  <<< ERROR!
2021-01-20T00:10:21.3517459Z Jan 20 00:10:21 java.lang.RuntimeException: Could 
not build the flink-dist image
2021-01-20T00:10:21.3518178Z Jan 20 00:10:21at 
org.apache.flink.tests.util.flink.FlinkContainer$FlinkContainerBuilder.build(FlinkContainer.java:281)
2021-01-20T00:10:21.3519176Z Jan 20 00:10:21at 
org.apache.flink.tests.util.kafka.SQLClientSchemaRegistryITCase.(SQLClientSchemaRegistryITCase.java:88)
2021-01-20T00:10:21.3519873Z Jan 20 00:10:21at 
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
2021-01-20T00:10:21.3520537Z Jan 20 00:10:21at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
2021-01-20T00:10:21.3521390Z Jan 20 00:10:21at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
2021-01-20T00:10:21.3522080Z Jan 20 00:10:21at 
java.lang.reflect.Constructor.newInstance(Constructor.java:423)
2021-01-20T00:10:21.3522730Z Jan 20 00:10:21at 
org.junit.runners.BlockJUnit4ClassRunner.createTest(BlockJUnit4ClassRunner.java:217)
2021-01-20T00:10:21.3523452Z Jan 20 00:10:21at 
org.junit.runners.BlockJUnit4ClassRunner$1.runReflectiveCall(BlockJUnit4ClassRunner.java:266)
2021-01-20T00:10:21.3524237Z Jan 20 00:10:21at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
2021-01-20T00:10:21.3524879Z Jan 20 00:10:21at 
org.junit.runners.BlockJUnit4ClassRunner.methodBlock(BlockJUnit4ClassRunner.java:263)
2021-01-20T00:10:21.3525527Z Jan 20 00:10:21at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
2021-01-20T00:10:21.3526157Z Jan 20 00:10:21at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
2021-01-20T00:10:21.3526754Z Jan 20 00:10:21at 
org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
2021-01-20T00:10:21.3527316Z Jan 20 00:10:21at 
org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
2021-01-20T00:10:21.3527884Z Jan 20 00:10:21at 
org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
2021-01-20T00:10:21.3528462Z Jan 20 00:10:21at 
org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
2021-01-20T00:10:21.3529491Z Jan 20 00:10:21at 
org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
2021-01-20T00:10:21.3530220Z Jan 20 00:10:21at 
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
2021-01-20T00:10:21.3530970Z Jan 20 00:10:21at 
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
2021-01-20T00:10:21.3531649Z Jan 20 00:10:21at 
java.util.concurrent.FutureTask.run(FutureTask.java:266)
2021-01-20T00:10:21.3532201Z Jan 20 00:10:21at 
java.lang.Thread.run(Thread.java:748)
2021-01-20T00:10:21.3533545Z Jan 20 00:10:21 Caused by: 
com.github.dockerjava.api.exception.InternalServerErrorException: Status 500: 
{"message":"Get 
https://registry-1.docker.io/v2/testcontainers/ryuk/manifests/0.3.0: received 
unexpected HTTP status: 502 Bad Gateway"}
2021-01-20T00:10:21.3534353Z Jan 20 00:10:21 
2021-01-20T00:10:21.3534955Z Jan 20 00:10:21at 
org.testcontainers.shaded.com.github.dockerjava.core.DefaultInvocationBuilder.execute(DefaultInvocationBuilder.java:247)
2021-01-20T00:10:21.3536388Z Jan 20 00:10:21at 
org.testcontainers.shaded.com.github.dockerjava.core.DefaultInvocationBuilder.lambda$executeAndStream$1(DefaultInvocationBuilder.java:269)
2021-01-20T00:10:21.3537066Z Jan 20 00:10:21... 1 more
2021-01-20T00:10:21.3541323Z Jan 20 00:10:21 
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-20929) Add automated deployment of nightly docker images

2021-01-11 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-20929:
--

 Summary: Add automated deployment of nightly docker images
 Key: FLINK-20929
 URL: https://issues.apache.org/jira/browse/FLINK-20929
 Project: Flink
  Issue Type: Task
  Components: Build System / Azure Pipelines
Reporter: Robert Metzger


There've been a few developers asking for nightly builds (think 
apache/flink:1.13-SNAPSHOT) of Flink.

In INFRA-21276, Flink got access to the "apache/flink" DockerHub repository, 
which we could use for this purpose.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-20928) KafkaSourceReaderTest.testOffsetCommitOnCheckpointComplete:189->pollUntil:270 » Timeout

2021-01-11 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-20928:
--

 Summary: 
KafkaSourceReaderTest.testOffsetCommitOnCheckpointComplete:189->pollUntil:270 » 
Timeout
 Key: FLINK-20928
 URL: https://issues.apache.org/jira/browse/FLINK-20928
 Project: Flink
  Issue Type: Bug
  Components: Connectors / Kafka
Affects Versions: 1.13.0
Reporter: Robert Metzger


https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=11861&view=logs&j=c5f0071e-1851-543e-9a45-9ac140befc32&t=1fb1a56f-e8b5-5a82-00a0-a2db7757b4f5

{code}
[ERROR] Tests run: 8, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 93.992 
s <<< FAILURE! - in 
org.apache.flink.connector.kafka.source.reader.KafkaSourceReaderTest
[ERROR] 
testOffsetCommitOnCheckpointComplete(org.apache.flink.connector.kafka.source.reader.KafkaSourceReaderTest)
  Time elapsed: 60.086 s  <<< ERROR!
java.util.concurrent.TimeoutException: The offset commit did not finish before 
timeout.
at 
org.apache.flink.core.testutils.CommonTestUtils.waitUtil(CommonTestUtils.java:210)
at 
org.apache.flink.connector.kafka.source.reader.KafkaSourceReaderTest.pollUntil(KafkaSourceReaderTest.java:270)
at 
org.apache.flink.connector.kafka.source.reader.KafkaSourceReaderTest.testOffsetCommitOnCheckpointComplete(KafkaSourceReaderTest.java:189)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-20532) generate-stackbrew-library.sh in flink-docker doesn't properly prune the java11 tag

2020-12-08 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-20532:
--

 Summary: generate-stackbrew-library.sh in flink-docker doesn't 
properly prune the java11 tag
 Key: FLINK-20532
 URL: https://issues.apache.org/jira/browse/FLINK-20532
 Project: Flink
  Issue Type: Bug
  Components: flink-docker
Affects Versions: 1.12.0
Reporter: Robert Metzger
Assignee: Robert Metzger


The output of {generate-stackbrew-library.sh} contains two {java11} tags.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-20531) Flink docs are not building anymore due to builder change

2020-12-08 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-20531:
--

 Summary: Flink docs are not building anymore due to builder change
 Key: FLINK-20531
 URL: https://issues.apache.org/jira/browse/FLINK-20531
 Project: Flink
  Issue Type: Bug
  Components: Documentation
Reporter: Robert Metzger
Assignee: Robert Metzger


The Flink docs are not building anymore, due to 
{code}
r1068824 | dfoulks | 2020-12-07 18:53:38 +0100 (Mon, 07 Dec 2020) | 1 line
Moved bb-slave1 jobs to bb-slave7 and bb-slave2 jobs to bb-slave8
{code}

bb-slave2 has "rvm" installed, "bb-slave8" doesn't: 
https://ci.apache.org/builders/flink-docs-release-1.11/builds/161



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-20455) Add check to LicenseChecker for top level /LICENSE files in shaded jars

2020-12-02 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-20455:
--

 Summary: Add check to LicenseChecker for top level /LICENSE files 
in shaded jars
 Key: FLINK-20455
 URL: https://issues.apache.org/jira/browse/FLINK-20455
 Project: Flink
  Issue Type: Task
  Components: Build System / CI
Reporter: Robert Metzger
 Fix For: 1.13.0


During the release verification of the 1.12.0 release, we noticed several 
modules containing LICENSE files in the jar file, which are not Apache licenses.
This could mislead users that the JARs are licensed not according to the ASL, 
but something else.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-20449) UnalignedCheckpointITCase times out

2020-12-01 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-20449:
--

 Summary: UnalignedCheckpointITCase times out
 Key: FLINK-20449
 URL: https://issues.apache.org/jira/browse/FLINK-20449
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Checkpointing
Affects Versions: 1.13.0
Reporter: Robert Metzger
 Fix For: 1.13.0


https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=10416&view=logs&j=39d5b1d5-3b41-54dc-6458-1e2ddd1cdcf3&t=a99e99c7-21cd-5a1f-7274-585e62b72f56

{code}
2020-12-02T01:24:33.7219846Z [ERROR] Tests run: 10, Failures: 0, Errors: 2, 
Skipped: 0, Time elapsed: 746.887 s <<< FAILURE! - in 
org.apache.flink.test.checkpointing.UnalignedCheckpointITCase
2020-12-02T01:24:33.7220860Z [ERROR] execute[Parallel cogroup, p = 
10](org.apache.flink.test.checkpointing.UnalignedCheckpointITCase)  Time 
elapsed: 300.24 s  <<< ERROR!
2020-12-02T01:24:33.7221663Z org.junit.runners.model.TestTimedOutException: 
test timed out after 300 seconds
2020-12-02T01:24:33.7222017Zat sun.misc.Unsafe.park(Native Method)
2020-12-02T01:24:33.7222390Zat 
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
2020-12-02T01:24:33.7222882Zat 
java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
2020-12-02T01:24:33.7223356Zat 
java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
2020-12-02T01:24:33.7223840Zat 
java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
2020-12-02T01:24:33.7224320Zat 
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
2020-12-02T01:24:33.7224864Zat 
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1842)
2020-12-02T01:24:33.7225500Zat 
org.apache.flink.streaming.api.environment.LocalStreamEnvironment.execute(LocalStreamEnvironment.java:70)
2020-12-02T01:24:33.7226297Zat 
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1822)
2020-12-02T01:24:33.7226929Zat 
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1804)
2020-12-02T01:24:33.7227572Zat 
org.apache.flink.test.checkpointing.UnalignedCheckpointTestBase.execute(UnalignedCheckpointTestBase.java:122)
2020-12-02T01:24:33.7228187Zat 
org.apache.flink.test.checkpointing.UnalignedCheckpointITCase.execute(UnalignedCheckpointITCase.java:159)
2020-12-02T01:24:33.7228680Zat 
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
2020-12-02T01:24:33.7229099Zat 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
2020-12-02T01:24:33.7229617Zat 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
2020-12-02T01:24:33.7230068Zat 
java.lang.reflect.Method.invoke(Method.java:498)
2020-12-02T01:24:33.7230733Zat 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
2020-12-02T01:24:33.7231262Zat 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
2020-12-02T01:24:33.7231775Zat 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
2020-12-02T01:24:33.7232276Zat 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
2020-12-02T01:24:33.7232732Zat 
org.junit.rules.Verifier$1.evaluate(Verifier.java:35)
2020-12-02T01:24:33.7233144Zat 
org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
2020-12-02T01:24:33.7233663Zat 
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
2020-12-02T01:24:33.7234239Zat 
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
2020-12-02T01:24:33.7234735Zat 
java.util.concurrent.FutureTask.run(FutureTask.java:266)
2020-12-02T01:24:33.7235093Zat java.lang.Thread.run(Thread.java:748)
2020-12-02T01:24:33.7235305Z 
2020-12-02T01:24:33.7235728Z [ERROR] execute[Parallel union, p = 
10](org.apache.flink.test.checkpointing.UnalignedCheckpointITCase)  Time 
elapsed: 300.539 s  <<< ERROR!
2020-12-02T01:24:33.7236436Z org.junit.runners.model.TestTimedOutException: 
test timed out after 300 seconds
2020-12-02T01:24:33.7236790Zat sun.misc.Unsafe.park(Native Method)
2020-12-02T01:24:33.7237158Zat 
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
2020-12-02T01:24:33.7237641Zat 
java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
2020-12-02T01:24:33.7238118Zat 
java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
2020-12-02T01:24:33.7238599Zat 
java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
2020-12-02T01:24:33.7239885Zat 
java.util.concu

[jira] [Created] (FLINK-20442) Fix license documentation mistakes in flink-python.jar

2020-12-01 Thread Robert Metzger (Jira)
Robert Metzger created FLINK-20442:
--

 Summary: Fix license documentation mistakes in flink-python.jar
 Key: FLINK-20442
 URL: https://issues.apache.org/jira/browse/FLINK-20442
 Project: Flink
  Issue Type: Bug
  Components: API / Python
Affects Versions: 1.12.0
Reporter: Robert Metzger
 Fix For: 1.12.0


Issues reported by Chesnay:

-The flink-python jar contains 2 license files in the root directory and 
another 2 in the META-INF directory. This should be reduced down to 1 under 
META-INF. I'm inclined to block the release on this because the root license is 
BSD.
- The flink-python jar appears to bundle lz4 (native libraries under win32/, 
linux/ and darwin/), but this is neither listed in the NOTICE nor do we have an 
explicit license file for it.

Other minor things that we should address in the future:
- opt/python contains some LICENSE files that should instead be placed under 
licenses/
- licenses/ contains a stray "ASM" file containing the ASM license. It's not a 
problem (because it is identical with our intended copy), but it indicates that 
something is amiss. This seems to originate from the flink-python jar, which 
bundles some beam stuff, which bundles bytebuddy, which bundles this license 
file. From what I can tell bytebuddy is not actually bundling ASM though; they 
just bundle the license for whatever reason. It is not listed as bundled in the 
flink-python NOTICE though, so I wouldn't block the release on it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   3   4   5   6   7   8   >