[jira] [Created] (FLINK-36073) ApplicationMode with the K8s operator does not support downloading jars via filesystem plugins
Robert Metzger created FLINK-36073: -- Summary: ApplicationMode with the K8s operator does not support downloading jars via filesystem plugins Key: FLINK-36073 URL: https://issues.apache.org/jira/browse/FLINK-36073 Project: Flink Issue Type: Bug Components: Runtime / Coordination Affects Versions: 2.0.0, 1.19.2, 1.20.1 Reporter: Robert Metzger As discussed in this ticket https://issues.apache.org/jira/browse/FLINK-28915 we can not define a FlinkDeployment spec with {code} spec: job: jarURI: s3://myDevBucket/myjob.jar {code} It fails with: {code} org.apache.flink.container.entrypoint.StandaloneApplicationClusterEntryPoint.main(StandaloneApplicationClusterEntryPoint.java:89) [flink-dist-1.19-x.jar:1.19-x] Caused by: java.net.MalformedURLException: unknown protocol: s3 at java.net.URL.(URL.java:652) ~[?:?] at java.net.URL.(URL.java:541) ~[?:?] at java.net.URL.(URL.java:488) ~[?:?] at org.apache.flink.configuration.ConfigUtils.decodeListFromConfig(ConfigUtils.java:133) ~[flink-dist-1.19-x.jar:1.19-x] at org.apache.flink.client.program.DefaultPackagedProgramRetriever.getClasspathsFromConfiguration(DefaultPackagedProgramRetriever.java:273) ~[flink-dist-1.19-x.jar:1.19-x] at org.apache.flink.client.program.DefaultPackagedProgramRetriever.create(DefaultPackagedProgramRetriever.java:121) ~[flink-dist-1.19-x.jar:1.19-x] ... 4 more {code} even though I have defined: {code} podTemplate: spec: containers: - name: flink-main-container env: - name: ENABLE_BUILT_IN_PLUGINS value: "flink-s3-fs-presto-1.19.0.jar" {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-35526) Remove deprecated stedolan/jq Docker image from Flink e2e tests
Robert Metzger created FLINK-35526: -- Summary: Remove deprecated stedolan/jq Docker image from Flink e2e tests Key: FLINK-35526 URL: https://issues.apache.org/jira/browse/FLINK-35526 Project: Flink Issue Type: Bug Components: Test Infrastructure Reporter: Robert Metzger Assignee: Robert Metzger Our CI logs contain this warning: https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=60060&view=logs&j=af184cdd-c6d8-5084-0b69-7e9c67b35f7a&t=0f3adb59-eefa-51c6-2858-3654d9e0749d&l=3828 {code} latest: Pulling from stedolan/jq [DEPRECATION NOTICE] Docker Image Format v1, and Docker Image manifest version 2, schema 1 support will be removed in an upcoming release. Suggest the author of docker.io/stedolan/jq:latest to upgrade the image to the OCI Format, or Docker Image manifest v2, schema 2. More information at https://docs.docker.com/go/deprecated-image-specs/ {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-33217) Flink SQL: UNNEST fails with on LEFT JOIN with NOT NULL type in array
Robert Metzger created FLINK-33217: -- Summary: Flink SQL: UNNEST fails with on LEFT JOIN with NOT NULL type in array Key: FLINK-33217 URL: https://issues.apache.org/jira/browse/FLINK-33217 Project: Flink Issue Type: Bug Components: Table SQL / Planner Affects Versions: 1.15.3, 1.18.0, 1.19.0 Reporter: Robert Metzger Steps to reproduce: Take a column of type {code:java} business_data ROW<`id` STRING, `updateEvent` ARRAY NOT NULL>> {code} Take this query {code:java} select id, ue_name from reproduce_unnest LEFT JOIN UNNEST(reproduce_unnest.business_data.updateEvent) AS exploded_ue(ue_name) ON true {code} And get this error {code:java} Caused by: java.lang.AssertionError: Type mismatch:rowtype of rel before registration: RecordType(RecordType:peek_no_expand(VARCHAR(2147483647) CHARACTER SET "UTF-16LE" id, RecordType:peek_no_expand(VARCHAR(2147483647) CHARACTER SET "UTF-16LE" NOT NULL name) NOT NULL ARRAY updateEvent) business_data, VARCHAR(2147483647) CHARACTER SET "UTF-16LE" ue_name) NOT NULLrowtype of rel after registration: RecordType(RecordType:peek_no_expand(VARCHAR(2147483647) CHARACTER SET "UTF-16LE" id, RecordType:peek_no_expand(VARCHAR(2147483647) CHARACTER SET "UTF-16LE" NOT NULL name) NOT NULL ARRAY updateEvent) business_data, VARCHAR(2147483647) CHARACTER SET "UTF-16LE" NOT NULL name) NOT NULLDifference:ue_name: VARCHAR(2147483647) CHARACTER SET "UTF-16LE" -> VARCHAR(2147483647) CHARACTER SET "UTF-16LE" NOT NULL at org.apache.calcite.util.Litmus$1.fail(Litmus.java:32)at org.apache.calcite.plan.RelOptUtil.equal(RelOptUtil.java:2206) at org.apache.calcite.rel.AbstractRelNode.onRegister(AbstractRelNode.java:275) at org.apache.calcite.plan.volcano.VolcanoPlanner.registerImpl(VolcanoPlanner.java:1270) at org.apache.calcite.plan.volcano.VolcanoPlanner.register(VolcanoPlanner.java:598) at org.apache.calcite.plan.volcano.VolcanoPlanner.ensureRegistered(VolcanoPlanner.java:613) at org.apache.calcite.plan.volcano.VolcanoPlanner.changeTraits(VolcanoPlanner.java:498) at org.apache.calcite.tools.Programs$RuleSetProgram.run(Programs.java:315) at org.apache.flink.table.planner.plan.optimize.program.FlinkVolcanoProgram.optimize(FlinkVolcanoProgram.scala:62) {code} I have implemented a small test case, which fails against Flink 1.15, 1.8 and the latest master branch. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-32439) Kubernetes operator is silently overwriting the "execution.savepoint.path" config
Robert Metzger created FLINK-32439: -- Summary: Kubernetes operator is silently overwriting the "execution.savepoint.path" config Key: FLINK-32439 URL: https://issues.apache.org/jira/browse/FLINK-32439 Project: Flink Issue Type: Bug Components: Kubernetes Operator Reporter: Robert Metzger I recently stumbled across the fact that the K8s operator is silently deleting / overwriting the execution.savepoint.path config option. I understand why this happens, but I wonder if the operator should write a log message if the user configured the execution.savepoint.path option. And / or add a list to the docs about "Operator managed" config options? https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/ApplicationReconciler.java#L155-L159 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-31840) NullPointerException in operators.window.slicing.SliceAssigners$AbstractSliceAssigner.assignSliceEnd
Robert Metzger created FLINK-31840: -- Summary: NullPointerException in operators.window.slicing.SliceAssigners$AbstractSliceAssigner.assignSliceEnd Key: FLINK-31840 URL: https://issues.apache.org/jira/browse/FLINK-31840 Project: Flink Issue Type: Bug Reporter: Robert Metzger While running a Flink SQL Query (with a hop window), I got this error. {code} Caused by: org.apache.flink.streaming.runtime.tasks.ExceptionInChainedOperatorException: Could not forward element to next operator at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:99) at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:57) at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:29) at org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:56) at org.apache.flink.streaming.api.operators.CountingOutput.collect(CountingOutput.java:29) at StreamExecCalc$11.processElement(Unknown Source) at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:82) ... 23 more Caused by: java.lang.NullPointerException at org.apache.flink.table.runtime.operators.window.slicing.SliceAssigners$AbstractSliceAssigner.assignSliceEnd(SliceAssigners.java:558) at org.apache.flink.table.runtime.operators.aggregate.window.LocalSlicingWindowAggOperator.processElement(LocalSlicingWindowAggOperator.java:114) at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:82) ... 29 more {code} It was caused by a timestamp field containing NULL values. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-31834) Azure Warning: no space left on device
Robert Metzger created FLINK-31834: -- Summary: Azure Warning: no space left on device Key: FLINK-31834 URL: https://issues.apache.org/jira/browse/FLINK-31834 Project: Flink Issue Type: Bug Components: Build System / Azure Pipelines Reporter: Robert Metzger In this CI run: https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=48213&view=logs&j=af184cdd-c6d8-5084-0b69-7e9c67b35f7a&t=841082b6-1a93-5908-4d37-a071f4387a5f&l=21 There was this warning: {code} Loaded image: confluentinc/cp-kafka:6.2.2 Loaded image: testcontainers/ryuk:0.3.3 ApplyLayer exit status 1 stdout: stderr: write /opt/jdk-15.0.1+9/lib/modules: no space left on device ##[error]Bash exited with code '1'. Finishing: Restore docker images {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-31810) RocksDBException: Bad table magic number on checkpoint rescale
Robert Metzger created FLINK-31810: -- Summary: RocksDBException: Bad table magic number on checkpoint rescale Key: FLINK-31810 URL: https://issues.apache.org/jira/browse/FLINK-31810 Project: Flink Issue Type: Bug Components: Runtime / State Backends Affects Versions: 1.15.2 Reporter: Robert Metzger While rescaling a job from checkpoint, I ran into this exception: {code:java} SinkMaterializer[7] -> rob-result[7]: Writer -> rob-result[7]: Committer (4/4)#3 (c1b348f7eed6e1ce0e41ef75338ae754) switched from INITIALIZING to FAILED with failure cause: java.lang.Exception: Exception while creating StreamOperatorStateContext. at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:255) at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:265) at org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.initializeStateAndOpenOperators(RegularOperatorChain.java:106) at org.apache.flink.streaming.runtime.tasks.StreamTask.restoreGates(StreamTask.java:703) at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.call(StreamTaskActionExecutor.java:55) at org.apache.flink.streaming.runtime.tasks.StreamTask.restoreInternal(StreamTask.java:679) at org.apache.flink.streaming.runtime.tasks.StreamTask.restore(StreamTask.java:646) at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:948) at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:917) at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:741) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:563) at java.base/java.lang.Thread.run(Unknown Source) Caused by: org.apache.flink.util.FlinkException: Could not restore keyed state backend for SinkUpsertMaterializer_7d9b7588bc2ff89baed50d7a4558caa4_(4/4) from any of the 1 provided restore options. at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:160) at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:346) at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:164) ... 11 more Caused by: org.apache.flink.runtime.state.BackendBuildingException: Caught unexpected exception. at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder.build(RocksDBKeyedStateBackendBuilder.java:395) at org.apache.flink.contrib.streaming.state.EmbeddedRocksDBStateBackend.createKeyedStateBackend(EmbeddedRocksDBStateBackend.java:483) at org.apache.flink.contrib.streaming.state.EmbeddedRocksDBStateBackend.createKeyedStateBackend(EmbeddedRocksDBStateBackend.java:97) at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.lambda$keyedStatedBackend$1(StreamTaskStateInitializerImpl.java:329) at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:168) at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:135) ... 13 more Caused by: java.io.IOException: Error while opening RocksDB instance. at org.apache.flink.contrib.streaming.state.RocksDBOperationUtils.openDB(RocksDBOperationUtils.java:92) at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreDBInstanceFromStateHandle(RocksDBIncrementalRestoreOperation.java:465) at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreWithRescaling(RocksDBIncrementalRestoreOperation.java:321) at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restore(RocksDBIncrementalRestoreOperation.java:164) at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder.build(RocksDBKeyedStateBackendBuilder.java:315) ... 18 more Caused by: org.rocksdb.RocksDBException: Bad table magic number: expected 9863518390377041911, found 4096 in /tmp/job__op_SinkUpsertMaterializer_7d9b7588bc2ff89baed50d7a4558caa4__4_4__uuid_d5587dfc-78b3-427c-8cb6-35507b71bc4b/46475654-5515-430e-b215-389d42cddb97/000232.sst at org.rocksdb.RocksDB.open(Native Method) at org.rocksdb.RocksDB.open(RocksDB.java:306) at org.apache.flink.contrib.streaming.state.RocksDBOperationUtils.openDB(RocksDBOperationUtils.java:80) ... 22 more {code} I haven't found any other cases of this issue
[jira] [Created] (FLINK-30083) Bump maven-shade-plugin to 3.4.0
Robert Metzger created FLINK-30083: -- Summary: Bump maven-shade-plugin to 3.4.0 Key: FLINK-30083 URL: https://issues.apache.org/jira/browse/FLINK-30083 Project: Flink Issue Type: Improvement Components: Build System Affects Versions: 1.17.0 Reporter: Robert Metzger Fix For: 1.17.0 FLINK-24273 proposes to relocate the io.fabric8 dependencies of flink-kubernetes. This is not possible because of a problem with the maven shade plugin ("mvn install" doesn't work, it needs to be "mvn clean install"). MSHADE-425 solves this issue, and has been released with maven-shade-plugin 3.4.0. Upgrading the shade plugin will solve the problem, unblocking the K8s relocation. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-29779) Allow using MiniCluster with a PluginManager to use metrics reporters
Robert Metzger created FLINK-29779: -- Summary: Allow using MiniCluster with a PluginManager to use metrics reporters Key: FLINK-29779 URL: https://issues.apache.org/jira/browse/FLINK-29779 Project: Flink Issue Type: Improvement Components: Runtime / Coordination Affects Versions: 1.16.0 Reporter: Robert Metzger Assignee: Robert Metzger Fix For: 1.16.1 Currently, using MiniCluster with a metric reporter loaded as a plugin is not supported, because the {{ReporterSetup.fromConfiguration(config, null));}} gets passed {{null}} for the PluginManager. I think it generally valuable to allow passing a PluginManager to the MiniCluster. I'll open a PR for this. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-29492) Kafka exactly-once sink causes OutOfMemoryError
Robert Metzger created FLINK-29492: -- Summary: Kafka exactly-once sink causes OutOfMemoryError Key: FLINK-29492 URL: https://issues.apache.org/jira/browse/FLINK-29492 Project: Flink Issue Type: Bug Components: Connectors / Kafka Affects Versions: 1.15.2 Reporter: Robert Metzger My Kafka exactly-once sinks are periodically failing with a `OutOfMemoryError: Java heap space`. This looks very similar to FLINK-28250. But I am running 1.15.2, which contains a fix for FLINK-28250. Exception: {code:java} java.io.IOException: Could not perform checkpoint 2281 for operator http_events[3]: Writer (1/1)#1. at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier(StreamTask.java:1210) at org.apache.flink.streaming.runtime.io.checkpointing.CheckpointBarrierHandler.notifyCheckpoint(CheckpointBarrierHandler.java:147) at org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.triggerCheckpoint(SingleCheckpointBarrierHandler.java:287) at org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.access$100(SingleCheckpointBarrierHandler.java:64) at org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler$ControllerImpl.triggerGlobalCheckpoint(SingleCheckpointBarrierHandler.java:493) at org.apache.flink.streaming.runtime.io.checkpointing.AbstractAlignedBarrierHandlerState.triggerGlobalCheckpoint(AbstractAlignedBarrierHandlerState.java:74) at org.apache.flink.streaming.runtime.io.checkpointing.AbstractAlignedBarrierHandlerState.barrierReceived(AbstractAlignedBarrierHandlerState.java:66) at org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.lambda$processBarrier$2(SingleCheckpointBarrierHandler.java:234) at org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.markCheckpointAlignedAndTransformState(SingleCheckpointBarrierHandler.java:262) at org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.processBarrier(SingleCheckpointBarrierHandler.java:231) at org.apache.flink.streaming.runtime.io.checkpointing.CheckpointedInputGate.handleEvent(CheckpointedInputGate.java:181) at org.apache.flink.streaming.runtime.io.checkpointing.CheckpointedInputGate.pollNext(CheckpointedInputGate.java:159) at org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext(AbstractStreamTaskNetworkInput.java:110) at org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65) at org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:519) at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:203) at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:804) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:753) at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:948) at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:927) at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:741) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:563) at java.base/java.lang.Thread.run(Unknown Source) Caused by: org.apache.flink.runtime.checkpoint.CheckpointException: Could not complete snapshot 2281 for operator http_events[3]: Writer (1/1)#1. Failure reason: Checkpoint was declined. at org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:269) at org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:173) at org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:348) at org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.checkpointStreamOperator(RegularOperatorChain.java:227) at org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.buildOperatorSnapshotFutures(RegularOperatorChain.java:212) at org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.snapshotState(RegularOperatorChain.java:192) at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.takeSnapshotSync(SubtaskCheckpointCoordinatorImpl.java:647) at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointState(SubtaskCheckpointCoordinatorImpl.java:320) at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$12(StreamTask.java:1253) at org.apache.flink.streaming.runtime.t
[jira] [Created] (FLINK-29212) Properly load Hadoop native libraries in Flink docker iamges
Robert Metzger created FLINK-29212: -- Summary: Properly load Hadoop native libraries in Flink docker iamges Key: FLINK-29212 URL: https://issues.apache.org/jira/browse/FLINK-29212 Project: Flink Issue Type: Bug Components: flink-docker Affects Versions: 1.17.0 Reporter: Robert Metzger On startup, Flink logs: {code:java} 2022-09-04 12:36:03.559 [main] WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable {code} Hadoop native libraries are used for: - Compression Codecs (bzip2, lz4, zlib) - Native IO utilities for HDFS Short-Circuit Local Reads and Centralized Cache Management in HDFS - CRC32 checksum implementation (Source: https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/NativeLibraries.html) Resolving this for the docker images we are providing should be easy, remove one unnecessary WARNing and provide performance benefits for some users. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-29122) Improve robustness of FileUtils.expandDirectory()
Robert Metzger created FLINK-29122: -- Summary: Improve robustness of FileUtils.expandDirectory() Key: FLINK-29122 URL: https://issues.apache.org/jira/browse/FLINK-29122 Project: Flink Issue Type: Bug Components: API / Core Affects Versions: 1.16.0, 1.17.0 Reporter: Robert Metzger `FileUtils.expandDirectory()` can potentially write to invalid locations if the zip file is invalid (contains entry names with ../). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-28303) Kafka SQL Connector loses data when restoring from a savepoint with a topic with empty partitions
Robert Metzger created FLINK-28303: -- Summary: Kafka SQL Connector loses data when restoring from a savepoint with a topic with empty partitions Key: FLINK-28303 URL: https://issues.apache.org/jira/browse/FLINK-28303 Project: Flink Issue Type: Bug Components: Connectors / Kafka Affects Versions: 1.14.4 Reporter: Robert Metzger Steps to reproduce: - Set up a Kafka topic with 10 partitions - produce records 0-9 into the topic - take a savepoint and stop the job - produce records 10-19 into the topic - restore the job from the savepoint. The job will be missing usually 2-4 records from 10-19. My assumption is that if a partition never had data (which is likely with 10 partitions and 10 records), the savepoint will only contain offsets for partitions with data. While the job was offline (and we've written record 10-19 into the topic), all partitions got filled. Now, when Kafka comes online again, it will use the "latest" offset for those partitions, skipping some data. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-28265) Inconsistency in Kubernetes HA service: broken state handle
Robert Metzger created FLINK-28265: -- Summary: Inconsistency in Kubernetes HA service: broken state handle Key: FLINK-28265 URL: https://issues.apache.org/jira/browse/FLINK-28265 Project: Flink Issue Type: Bug Components: Runtime / Coordination Affects Versions: 1.14.4 Reporter: Robert Metzger I have a JobManager, which at some point failed to acknowledge a checkpoint: {code} Error while processing AcknowledgeCheckpoint message org.apache.flink.runtime.checkpoint.CheckpointException: Could not complete the pending checkpoint 193393. Failure reason: Failure to finalize checkpoint. at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1255) at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:1100) at org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$acknowledgeCheckpoint$1(ExecutionGraphHandler.java:89) at org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$processCheckpointCoordinatorMessage$3(ExecutionGraphHandler.java:119) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source) Caused by: org.apache.flink.runtime.persistence.StateHandleStore$AlreadyExistException: checkpointID-0193393 already exists in ConfigMap cm--jobmanager-leader at org.apache.flink.kubernetes.highavailability.KubernetesStateHandleStore.getKeyAlreadyExistException(KubernetesStateHandleStore.java:534) at org.apache.flink.kubernetes.highavailability.KubernetesStateHandleStore.lambda$addAndLock$0(KubernetesStateHandleStore.java:155) at org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$attemptCheckAndUpdateConfigMap$11(Fabric8FlinkKubeClient.java:316) at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(Unknown Source) ... 3 common frames omitted {code} the JobManager creates subsequent checkpoints successfully. Upon failure, it tries to recover this checkpoint (0193393), but fails to do so because of: {code} Caused by: org.apache.flink.util.FlinkException: Could not retrieve checkpoint 193393 from state handle under checkpointID-0193393. This indicates that the retrieved state handle is broken. Try cleaning the state handle store ... Caused by: java.io.FileNotFoundException: No such file or directory: s3://xxx/flink-ha/xxx/completedCheckpoint72e30229420c {code} I'm running Flink 1.14.4. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (FLINK-28260) flink-runtime-web fails to execute "npm ci" on Apple Silicon (arm64) without rosetta
Robert Metzger created FLINK-28260: -- Summary: flink-runtime-web fails to execute "npm ci" on Apple Silicon (arm64) without rosetta Key: FLINK-28260 URL: https://issues.apache.org/jira/browse/FLINK-28260 Project: Flink Issue Type: Bug Reporter: Robert Metzger Flink 1.16-SNAPSHOT fails to build in the flink-runtime-web project because we are using an outdated frontend-maven-plugin (v 1.11.3). This is the error: {code} [ERROR] Failed to execute goal com.github.eirslett:frontend-maven-plugin:1.11.3:npm (npm install) on project flink-runtime-web: Failed to run task: 'npm ci --cache-max=0 --no-save ${npm.proxy}' failed. java.io.IOException: Cannot run program "/Users/rmetzger/Projects/flink/flink-runtime-web/web-dashboard/node/node" (in directory "/Users/rmetzger/Projects/flink/flink-runtime-web/web-dashboard"): error=86, Bad CPU type in executable -> [Help 1] {code} Using the latest frontend-maven-plugin (v. 1.12.1) properly passes the build, because this version downloads the proper arm64 npm version. However, frontend-maven-plugin 1.12.1 requires Maven 3.6.0, which is too high for Flink (highest mvn version for Flink is 3.2.5). The best workaround is using rosetta on M1 Macs. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (FLINK-28259) flink-parquet doesn't compile on M1 mac without rosetta
Robert Metzger created FLINK-28259: -- Summary: flink-parquet doesn't compile on M1 mac without rosetta Key: FLINK-28259 URL: https://issues.apache.org/jira/browse/FLINK-28259 Project: Flink Issue Type: Bug Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile) Affects Versions: 1.16.0 Reporter: Robert Metzger Assignee: Robert Metzger Compiling Flink 1.16-SNAPSHOT fails on an M1 Mac (apple silicon) without the rosetta translation layer, because the automatically downloaded "protoc-3.17.3-osx-aarch_64.exe" file is actually just a copy of "protoc-3.17.3-osx-x86_64.exe". (as you can read here: https://github.com/os72/protoc-jar/issues/93) This is the error: {code} [ERROR] Failed to execute goal org.xolstice.maven.plugins:protobuf-maven-plugin:0.5.1:test-compile (default) on project flink-parquet: An error occurred while invoking protoc. Error while executing process. Cannot run program "/Users/rmetzger/Projects/flink/flink-formats/flink-parquet/target/protoc-plugins/protoc-3.17.3-osx-aarch_64.exe": error=86, Bad CPU type in executable -> [Help 1] {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (FLINK-28232) Allow for custom pre-flight checks for SQL UDFs
Robert Metzger created FLINK-28232: -- Summary: Allow for custom pre-flight checks for SQL UDFs Key: FLINK-28232 URL: https://issues.apache.org/jira/browse/FLINK-28232 Project: Flink Issue Type: New Feature Components: Table SQL / API Reporter: Robert Metzger Currently, implementors of SQL UDFs [1] can not validate the UDF input before submitting a SQL query to the runtime. Take for example a UDF that computes a regex based on user input. Ideally there's a callback for the UDF implementor to check if the user-provided regex is valid and compiles, to avoid errors during the execution of the SQL query. It would be ideal to get access to the schema information resolved by the SQL planner in that pre-flight validation to also allow for schema related checks pre-flight. [1] https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/functions/udfs/ -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (FLINK-25998) Flink akka runs into NoClassDefFoundError on shutdown
Robert Metzger created FLINK-25998: -- Summary: Flink akka runs into NoClassDefFoundError on shutdown Key: FLINK-25998 URL: https://issues.apache.org/jira/browse/FLINK-25998 Project: Flink Issue Type: Bug Components: Runtime / Coordination Affects Versions: 1.15.0 Reporter: Robert Metzger When trying to start a standalone jobmanager on an unavailable port, I see the following unexpected exception: {code} 2022-02-08 08:07:18,299 INFO akka.remote.Remoting [] - Starting remoting 2022-02-08 08:07:18,357 ERROR akka.remote.transport.netty.NettyTransport [] - failed to bind to /0.0.0.0:6123, shutting down Netty transport 2022-02-08 08:07:18,373 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint[] - Shutting StandaloneApplicationClusterEntryPoint down with application status FAILED. Diagnostics java.net.BindException: Could not start actor system on any port in port range 6123 at org.apache.flink.runtime.rpc.akka.AkkaBootstrapTools.startRemoteActorSystem(AkkaBootstrapTools.java:133) at org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:358) at org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:327) at org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:247) at org.apache.flink.runtime.rpc.RpcUtils.createRemoteRpcService(RpcUtils.java:191) at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.initializeServices(ClusterEntrypoint.java:334) at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:253) at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$1(ClusterEntrypoint.java:203) at org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:28) at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:200) at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:684) at org.apache.flink.container.entrypoint.StandaloneApplicationClusterEntryPoint.main(StandaloneApplicationClusterEntryPoint.java:82) . 2022-02-08 08:07:18,377 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator[] - Shutting down remote daemon. 2022-02-08 08:07:18,377 ERROR org.apache.flink.util.FatalExitExceptionHandler [] - FATAL: Thread 'flink-akka.remote.default-remote-dispatcher-6' produced an uncaught exception. Stopping the process... java.lang.NoClassDefFoundError: akka/actor/dungeon/FaultHandling$$anonfun$handleNonFatalOrInterruptedException$1 at akka.actor.dungeon.FaultHandling.handleNonFatalOrInterruptedException(FaultHandling.scala:334) ~[flink-rpc-akka_ce724655-52fe-4b3a-8cdc-b79ab446e34d.jar:1.15-SNAPSHOT] at akka.actor.dungeon.FaultHandling.handleNonFatalOrInterruptedException$(FaultHandling.scala:334) ~[flink-rpc-akka_ce724655-52fe-4b3a-8cdc-b79ab446e34d.jar:1.15-SNAPSHOT] at akka.actor.ActorCell.handleNonFatalOrInterruptedException(ActorCell.scala:411) ~[flink-rpc-akka_ce724655-52fe-4b3a-8cdc-b79ab446e34d.jar:1.15-SNAPSHOT] at akka.actor.ActorCell.invoke(ActorCell.scala:551) ~[flink-rpc-akka_ce724655-52fe-4b3a-8cdc-b79ab446e34d.jar:1.15-SNAPSHOT] at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:270) ~[flink-rpc-akka_ce724655-52fe-4b3a-8cdc-b79ab446e34d.jar:1.15-SNAPSHOT] at akka.dispatch.Mailbox.run(Mailbox.scala:231) ~[flink-rpc-akka_ce724655-52fe-4b3a-8cdc-b79ab446e34d.jar:1.15-SNAPSHOT] at akka.dispatch.Mailbox.exec(Mailbox.scala:243) [flink-rpc-akka_ce724655-52fe-4b3a-8cdc-b79ab446e34d.jar:1.15-SNAPSHOT] at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) [?:1.8.0_312] at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) [?:1.8.0_312] at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) [?:1.8.0_312] at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175) [?:1.8.0_312] Caused by: java.lang.ClassNotFoundException: akka.actor.dungeon.FaultHandling$$anonfun$handleNonFatalOrInterruptedException$1 at java.net.URLClassLoader.findClass(URLClassLoader.java:387) ~[?:1.8.0_312] at java.lang.ClassLoader.loadClass(ClassLoader.java:419) ~[?:1.8.0_312] at org.apache.flink.core.classloading.ComponentClassLoader.loadClassFromComponentOnly(ComponentClassLoader.java:149) ~[flink-dist-1.15-SNAPSHOT.jar:1.15-SNAPSHOT] at org.apache.flink.core.classloading.ComponentClassLoader.loadClass(ComponentClassLoader
[jira] [Created] (FLINK-25679) Build arm64 Linux images for Apache Flink
Robert Metzger created FLINK-25679: -- Summary: Build arm64 Linux images for Apache Flink Key: FLINK-25679 URL: https://issues.apache.org/jira/browse/FLINK-25679 Project: Flink Issue Type: Improvement Components: flink-docker Affects Versions: 1.15.0 Reporter: Robert Metzger Building Flink images for arm64 Linux should be trivial to support, since upstream docker images support arm64, as well as frocksdb. Building the images locally is also easily possible using Docker's buildx features, and the build system of the official docker images most likely supports ARM arch. This improvement would allow us supporting development / testing on Apple M1-based systems, as well as ARM architecture at various cloud providers (AWS Graviton) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-25505) Fix NetworkBufferPoolTest, SystemResourcesCounterTest on Apple M1
Robert Metzger created FLINK-25505: -- Summary: Fix NetworkBufferPoolTest, SystemResourcesCounterTest on Apple M1 Key: FLINK-25505 URL: https://issues.apache.org/jira/browse/FLINK-25505 Project: Flink Issue Type: Bug Components: Runtime / Metrics, Runtime / Network Affects Versions: 1.15.0 Reporter: Robert Metzger As discussed in https://issues.apache.org/jira/browse/FLINK-23230, some tests in flink-runtime are not passing on M1 / Apple Silicon Macs. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-25327) ApplicationMode "DELETE /cluster" REST call leads to exit code 2, instead of 0
Robert Metzger created FLINK-25327: -- Summary: ApplicationMode "DELETE /cluster" REST call leads to exit code 2, instead of 0 Key: FLINK-25327 URL: https://issues.apache.org/jira/browse/FLINK-25327 Project: Flink Issue Type: Bug Components: Runtime / Coordination Reporter: Robert Metzger FLINK-24113 introduced a mode to keep the Application Mode JobManager running after the Job has been cancelled. Cluster shutdown needs to be initiated for example using the DELETE /cluster REST endpoint. The problem is that there can be a fatal error during the shutdown, making the JobManager exit with return code != 0 (making resource managers believe there was an error with the Flink application) Error {code} 2021-12-15 08:09:55,708 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint[] - Fatal error occurred in the cluster entrypoint. org.apache.flink.util.FlinkException: Application failed unexpectedly. at org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap.lambda$finishBootstrapTasks$1(ApplicationDispatcherBootstrap.java:177) ~[flink-dist-1.15-master-robert.jar:1.15-master-robert] at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:884) ~[?:1.8.0_312] at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:866) ~[?:1.8.0_312] at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_312] at java.util.concurrent.CompletableFuture.cancel(CompletableFuture.java:2278) ~[?:1.8.0_312] at org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap.stop(ApplicationDispatcherBootstrap.java:125) ~[flink-dist-1.15-master-robert.jar:1.15-master-robert] at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$onStop$0(Dispatcher.java:284) ~[flink-dist-1.15-master-robert.jar:1.15-master-robert] at org.apache.flink.util.concurrent.FutureUtils.lambda$runAfterwardsAsync$18(FutureUtils.java:696) ~[flink-dist-1.15-master-robert.jar:1.15-master-robert] at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) ~[?:1.8.0_312] at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) ~[?:1.8.0_312] at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456) ~[?:1.8.0_312] at org.apache.flink.util.concurrent.DirectExecutorService.execute(DirectExecutorService.java:217) ~[flink-dist-1.15-master-robert.jar:1.15-master-robert] at java.util.concurrent.CompletableFuture$UniCompletion.claim(CompletableFuture.java:543) ~[?:1.8.0_312] at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:765) ~[?:1.8.0_312] at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) ~[?:1.8.0_312] at java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:795) ~[?:1.8.0_312] at java.util.concurrent.CompletableFuture.whenCompleteAsync(CompletableFuture.java:2163) ~[?:1.8.0_312] at org.apache.flink.util.concurrent.FutureUtils.runAfterwardsAsync(FutureUtils.java:693) ~[flink-dist-1.15-master-robert.jar:1.15-master-robert] at org.apache.flink.util.concurrent.FutureUtils.runAfterwards(FutureUtils.java:660) ~[flink-dist-1.15-master-robert.jar:1.15-master-robert] at org.apache.flink.runtime.dispatcher.Dispatcher.onStop(Dispatcher.java:281) ~[flink-dist-1.15-master-robert.jar:1.15-master-robert] at org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStop(RpcEndpoint.java:214) ~[flink-dist-1.15-master-robert.jar:1.15-master-robert] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StartedState.lambda$terminate$0(AkkaRpcActor.java:580) ~[flink-rpc-akka_44e0316d-9cf7-4fc8-9b48-4f6084b0cc47.jar:1.15-master-robert] at org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:83) ~[flink-rpc-akka_44e0316d-9cf7-4fc8-9b48-4f6084b0cc47.jar:1.15-master-robert] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StartedState.terminate(AkkaRpcActor.java:579) ~[flink-rpc-akka_44e0316d-9cf7-4fc8-9b48-4f6084b0cc47.jar:1.15-master-robert] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:191) ~[flink-rpc-akka_44e0316d-9cf7-4fc8-9b48-4f6084b0cc47.jar:1.15-master-robert] at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24) [flink-rpc-akka_44e0316d-9cf7-4fc8-9b48-4f6084b0cc47.jar:1.15-master-robert] at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20) [flink-rpc-akka_44e0316d-9cf7-4fc8-9b48-4f6084b0cc47.jar:1.15-master-robert] at scala.PartialFunction.applyOrElse(PartialFunction.scala:123) [flink-rpc-akka_44e0316d-9cf7-4fc8-9b48-4f6084b0cc47.jar:1.15-
[jira] [Created] (FLINK-25316) BlobServer can get stuck during shutdown
Robert Metzger created FLINK-25316: -- Summary: BlobServer can get stuck during shutdown Key: FLINK-25316 URL: https://issues.apache.org/jira/browse/FLINK-25316 Project: Flink Issue Type: Bug Components: Runtime / Coordination Affects Versions: 1.15.0 Reporter: Robert Metzger Fix For: 1.15.0 The cluster shutdown can get stuck {code} "AkkaRpcService-Supervisor-Termination-Future-Executor-thread-1" #89 daemon prio=5 os_prio=0 tid=0x004017d7 nid=0x2ec in Object.wait() [0x00402a9b5000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0xd6c48368> (a org.apache.flink.runtime.blob.BlobServer) at java.lang.Thread.join(Thread.java:1252) - locked <0xd6c48368> (a org.apache.flink.runtime.blob.BlobServer) at java.lang.Thread.join(Thread.java:1326) at org.apache.flink.runtime.blob.BlobServer.close(BlobServer.java:319) at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.stopClusterServices(ClusterEntrypoint.java:406) - locked <0xd5d27350> (a java.lang.Object) at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$shutDownAsync$4(ClusterEntrypoint.java:505 {code} because the BlobServer.run() method ignores interrupts: {code} "BLOB Server listener at 6124" #30 daemon prio=5 os_prio=0 tid=0x00401c929800 nid=0x2b4 runnable [0x0040263f9000] java.lang.Thread.State: RUNNABLE at java.net.PlainSocketImpl.socketAccept(Native Method) at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409) at java.net.ServerSocket.implAccept(ServerSocket.java:560) at java.net.ServerSocket.accept(ServerSocket.java:528) at org.apache.flink.util.NetUtils.acceptWithoutTimeout(NetUtils.java:143) at org.apache.flink.runtime.blob.BlobServer.run(BlobServer.java:268) {code} This issue was introduced in FLINK-24156. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-24395) Checkpoint trigger time difference between log statement and web frontend
Robert Metzger created FLINK-24395: -- Summary: Checkpoint trigger time difference between log statement and web frontend Key: FLINK-24395 URL: https://issues.apache.org/jira/browse/FLINK-24395 Project: Flink Issue Type: Improvement Components: Runtime / Checkpointing Affects Versions: 1.14.0 Reporter: Robert Metzger Attachments: image-2021-09-28-12-20-34-332.png Consider this checkpoint (68) {code} 2021-09-28 10:14:43,644 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Triggering checkpoint 68 (type=CHECKPOINT) @ 1632823660151 for job . 2021-09-28 10:16:41,428 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Completed checkpoint 68 for job (128940015376 bytes, checkpointDuration=540908 ms, finalizationTime=369 ms). {code} And what is shown in the UI about it: !image-2021-09-28-12-20-34-332.png! The trigger time is off by ~7 minutes. It seems that the trigger message is logged too late. (note that this has happened in a system where savepoint disposal is very slow) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-24392) Upgrade presto s3 fs implementation to Trinio >= 348
Robert Metzger created FLINK-24392: -- Summary: Upgrade presto s3 fs implementation to Trinio >= 348 Key: FLINK-24392 URL: https://issues.apache.org/jira/browse/FLINK-24392 Project: Flink Issue Type: Improvement Components: FileSystems Affects Versions: 1.14.0 Reporter: Robert Metzger Fix For: 1.15.0 The Presto s3 filesystem implementation currently shipped with Flink doesn't support streaming uploads. All data needs to be materialized to a single file on disk, before it can be uploaded. This can lead to situations where TaskManagers are running out of disk when creating a savepoint. The Hadoop filesystem implementation supports streaming uploads (by using multipart uploads of smaller (say 100mb) files locally), but it does more API calls, leading to other issues. Trinion 348 supports streaming uploads. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-24320) Show in the Job / Checkpoints / Configuration if checkpoints are incremental
Robert Metzger created FLINK-24320: -- Summary: Show in the Job / Checkpoints / Configuration if checkpoints are incremental Key: FLINK-24320 URL: https://issues.apache.org/jira/browse/FLINK-24320 Project: Flink Issue Type: Improvement Components: Runtime / Checkpointing, Runtime / Web Frontend Affects Versions: 1.13.2 Reporter: Robert Metzger Attachments: image-2021-09-17-13-31-02-148.png, image-2021-09-17-13-31-32-311.png It would be nice if the overview would also show if incremental checkpoints are enabled. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-24208) Allow idempotent savepoint triggering
Robert Metzger created FLINK-24208: -- Summary: Allow idempotent savepoint triggering Key: FLINK-24208 URL: https://issues.apache.org/jira/browse/FLINK-24208 Project: Flink Issue Type: Improvement Components: Runtime / Checkpointing Reporter: Robert Metzger As a user of Flink, I want to be able to trigger a savepoint from an external system in a way that I can detect if I have requested this savepoint already. By passing a custom ID to the savepoint request, I can check (in case of an error of the original request, or the external system) if the request has been made already. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-24114) Make CompletedOperationCache.COMPLETED_OPERATION_RESULT_CACHE_DURATION_SECONDS configurable (at least for savepoint trigger operations)
Robert Metzger created FLINK-24114: -- Summary: Make CompletedOperationCache.COMPLETED_OPERATION_RESULT_CACHE_DURATION_SECONDS configurable (at least for savepoint trigger operations) Key: FLINK-24114 URL: https://issues.apache.org/jira/browse/FLINK-24114 Project: Flink Issue Type: Improvement Components: Runtime / Coordination Affects Versions: 1.15.0 Reporter: Robert Metzger Currently, it can happen that external services triggering savepoints can not persist the savepoint location from the savepoint handler, because the operation cache has a hardcoded value of 5 minutes, until entries (which have been accessed at least once) are evicted. To avoid scenarios where the savepoint location has been accessed, but the external system failed to persist the location, I propose to make this eviction timeout configurable (so that I as a user can configure a value of 24 hours for the cache eviction). (This is related to FLINK-24113) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-24113) Introduce option in Application Mode to request cluster shutdown
Robert Metzger created FLINK-24113: -- Summary: Introduce option in Application Mode to request cluster shutdown Key: FLINK-24113 URL: https://issues.apache.org/jira/browse/FLINK-24113 Project: Flink Issue Type: Improvement Components: Runtime / Coordination Affects Versions: 1.15.0 Reporter: Robert Metzger Currently a Flink JobManager started in Application Mode will shut down once the job has completed. When doing a "stop with savepoint" operation, we want to keep the JobManager alive after the job has stopped to retrieve and persist the final savepoint location. Currently, Flink waits up to 5 minutes and then shuts down. I'm proposing to introduce a new configuration flag "application mode shutdown behavior": "keepalive" (naming things is hard ;) ) which will keep the JobManager in ApplicationMode running until a REST call confirms that it can shutdown. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-24037) Allow wildcards in ENABLE_BUILT_IN_PLUGINS
Robert Metzger created FLINK-24037: -- Summary: Allow wildcards in ENABLE_BUILT_IN_PLUGINS Key: FLINK-24037 URL: https://issues.apache.org/jira/browse/FLINK-24037 Project: Flink Issue Type: Improvement Components: flink-docker Reporter: Robert Metzger As a user of Flink, I would like to be able to specify a certain default plugin, (such as the S3 presto FS) without having to specific the Flink version again. The Flink version is already specified by the Docker container I'm using. If one is using generic deployment scripts, I don't want to put the Flink version in two locations. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-23925) HistoryServer: Archiving job with more than one attempt fails
Robert Metzger created FLINK-23925: -- Summary: HistoryServer: Archiving job with more than one attempt fails Key: FLINK-23925 URL: https://issues.apache.org/jira/browse/FLINK-23925 Project: Flink Issue Type: Bug Components: Runtime / Coordination Affects Versions: 1.13.2 Reporter: Robert Metzger Error: {code} 2021-08-23 16:26:01,953 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Disconnect job manager 0...@akka.tcp://flink@localhost:6123/user/rpc/jobmanager_2 for job ca9f6a073d311d60f457a1c4243e7dc3 from the resource manager. 2021-08-23 16:26:02,137 INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Could not archive completed job CarTopSpeedWindowingExample(ca9f6a073d311d60f457a1c4243e7dc3) to the history server. java.util.concurrent.CompletionException: java.lang.IllegalArgumentException: attempt does not exist at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273) ~[?:1.8.0_252] at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280) [?:1.8.0_252] at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1643) [?:1.8.0_252] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_252] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_252] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_252] Caused by: java.lang.IllegalArgumentException: attempt does not exist at org.apache.flink.runtime.executiongraph.ArchivedExecutionVertex.getPriorExecutionAttempt(ArchivedExecutionVertex.java:109) ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT] at org.apache.flink.runtime.executiongraph.ArchivedExecutionVertex.getPriorExecutionAttempt(ArchivedExecutionVertex.java:31) ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT] at org.apache.flink.runtime.rest.handler.job.SubtaskExecutionAttemptDetailsHandler.archiveJsonWithPath(SubtaskExecutionAttemptDetailsHandler.java:140) ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT] at org.apache.flink.runtime.webmonitor.history.OnlyExecutionGraphJsonArchivist.archiveJsonWithPath(OnlyExecutionGraphJsonArchivist.java:51) ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT] at org.apache.flink.runtime.webmonitor.WebMonitorEndpoint.archiveJsonWithPath(WebMonitorEndpoint.java:1031) ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT] at org.apache.flink.runtime.dispatcher.JsonResponseHistoryServerArchivist.lambda$archiveExecutionGraph$0(JsonResponseHistoryServerArchivist.java:61) ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT] at org.apache.flink.util.function.ThrowingRunnable.lambda$unchecked$0(ThrowingRunnable.java:49) ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT] at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1640) ~[?:1.8.0_252] ... 3 more {code} Steps to reproduce: - start a Flink reactive mode job manager: mkdir usrlib cp ./examples/streaming/TopSpeedWindowing.jar usrlib/ # Submit Job in Reactive Mode ./bin/standalone-job.sh start -Dscheduler-mode=reactive -Dexecution.checkpointing.interval="10s" -j org.apache.flink.streaming.examples.windowing.TopSpeedWindowing # Start first TaskManager ./bin/taskmanager.sh start - Add another taskmanager to trigger a restart - Cancel the job See the failure in the jobmanager logs. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-23913) UnalignedCheckpointITCase fails with exit code 137 (kernel oom) on Azure VMs
Robert Metzger created FLINK-23913: -- Summary: UnalignedCheckpointITCase fails with exit code 137 (kernel oom) on Azure VMs Key: FLINK-23913 URL: https://issues.apache.org/jira/browse/FLINK-23913 Project: Flink Issue Type: Bug Components: Runtime / Network Affects Versions: 1.14.0 Environment: UnalignedCheckpointITCase Reporter: Robert Metzger Fix For: 1.14.0 Cases reported in FLINK-23525: - https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=22618&view=logs&j=4d4a0d10-fca2-5507-8eed-c07f0bdf4887&t=7b25afdf-cc6c-566f-5459-359dc2585798&l=10338 - https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=22618&view=logs&j=39d5b1d5-3b41-54dc-6458-1e2ddd1cdcf3&t=0c010d0c-3dec-5bf1-d408-7b18988b1b2b&l=4743 - https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=22605&view=logs&j=39d5b1d5-3b41-54dc-6458-1e2ddd1cdcf3&t=0c010d0c-3dec-5bf1-d408-7b18988b1b2b&l=4743 - ... there are a lot more cases. The problem seems to have started occurring around August 20. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-23589) Support Avro Microsecond precision
Robert Metzger created FLINK-23589: -- Summary: Support Avro Microsecond precision Key: FLINK-23589 URL: https://issues.apache.org/jira/browse/FLINK-23589 Project: Flink Issue Type: Improvement Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile) Reporter: Robert Metzger Fix For: 1.14.0 This was raised by a user: https://lists.apache.org/thread.html/r463f748358202d207e4bf9c7fdcb77e609f35bbd670dbc5278dd7615%40%3Cuser.flink.apache.org%3E Here's the Avro spec: https://avro.apache.org/docs/1.8.0/spec.html#Timestamp+%28microsecond+precision%29 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-23562) Update CI docker image to latest java version (1.8.0_292)
Robert Metzger created FLINK-23562: -- Summary: Update CI docker image to latest java version (1.8.0_292) Key: FLINK-23562 URL: https://issues.apache.org/jira/browse/FLINK-23562 Project: Flink Issue Type: Technical Debt Components: Build System / Azure Pipelines Reporter: Robert Metzger Fix For: 1.14.0 The java version we are using on our CI is outdated (1.8.0_282 vs 1.8.0_292). The latest java version has TLSv1 disabled, which makes the KubernetesClusterDescriptorTest fail. This will be fixed by FLINK-22802. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-23546) stop-cluster.sh produces warning on macOS 11.4
Robert Metzger created FLINK-23546: -- Summary: stop-cluster.sh produces warning on macOS 11.4 Key: FLINK-23546 URL: https://issues.apache.org/jira/browse/FLINK-23546 Project: Flink Issue Type: Bug Components: Deployment / Scripts Affects Versions: 1.14.0 Reporter: Robert Metzger Since FLINK-17470, we are stopping daemons with a timeout, to SIGKILL them if they are not gracefully stopping. I noticed that this mechanism causes warnings on macOS: {code} ❰robert❙/tmp/flink-1.14-SNAPSHOT❱✔≻ ./bin/start-cluster.sh Starting cluster. Starting standalonesession daemon on host MacBook-Pro-2.localdomain. Starting taskexecutor daemon on host MacBook-Pro-2.localdomain. ❰robert❙/tmp/flink-1.14-SNAPSHOT❱✔≻ ./bin/stop-cluster.sh Stopping taskexecutor daemon (pid: 50044) on host MacBook-Pro-2.localdomain. tail: illegal option -- - usage: tail [-F | -f | -r] [-q] [-b # | -c # | -n #] [file ...] Stopping standalonesession daemon (pid: 49812) on host MacBook-Pro-2.localdomain. tail: illegal option -- - usage: tail [-F | -f | -r] [-q] [-b # | -c # | -n #] [file ...] {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-23191) Azure: Upload CI logs to S3 as well
Robert Metzger created FLINK-23191: -- Summary: Azure: Upload CI logs to S3 as well Key: FLINK-23191 URL: https://issues.apache.org/jira/browse/FLINK-23191 Project: Flink Issue Type: Improvement Components: Build System / Azure Pipelines Reporter: Robert Metzger We are currently uploading the CI logs to Azure as artifacts. The maximum retention (we've also configured) is 60 days. Afterwards, the logs are gone. For rarely occurring test failures, the logs might be lost at the time we start looking into them. Therefore, we should store the CI logs somewhere permanently, such as S3 (similarly to how we stored them when we were using travis) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-23129) When cancelling any running job of multiple jobs in an application cluster, JobManager shuts down
Robert Metzger created FLINK-23129: -- Summary: When cancelling any running job of multiple jobs in an application cluster, JobManager shuts down Key: FLINK-23129 URL: https://issues.apache.org/jira/browse/FLINK-23129 Project: Flink Issue Type: Bug Components: Runtime / Coordination Affects Versions: 1.14.0 Reporter: Robert Metzger I have a jar with two jobs, both executeAsync() from the same main method. I execute the main method in an Application Mode cluster. When I cancel one of the two jobs, both jobs will stop executing. I would expect that the JobManager shuts down once all jobs submitted from an application are finished. If this is a known limitation, we should document it. {code} 2021-06-23 21:29:53,123 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job first job (18181be02da272387354d093519b2359) switched from state RUNNING to CANCELLING. 2021-06-23 21:29:53,124 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: Custom Source -> Sink: Unnamed (1/1) (5a69b1c19f8da23975f6961898ab50a2) switched from RUNNING to CANCELING. 2021-06-23 21:29:53,141 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: Custom Source -> Sink: Unnamed (1/1) (5a69b1c19f8da23975f6961898ab50a2) switched from CANCELING to CANCELED. 2021-06-23 21:29:53,144 INFO org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] - Clearing resource requirements of job 18181be02da272387354d093519b2359 2021-06-23 21:29:53,145 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job first job (18181be02da272387354d093519b2359) switched from state CANCELLING to CANCELED. 2021-06-23 21:29:53,145 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Stopping checkpoint coordinator for job 18181be02da272387354d093519b2359. 2021-06-23 21:29:53,147 INFO org.apache.flink.runtime.checkpoint.StandaloneCompletedCheckpointStore [] - Shutting down 2021-06-23 21:29:53,150 INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Job 18181be02da272387354d093519b2359 reached terminal state CANCELED. 2021-06-23 21:29:53,152 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Stopping the JobMaster for job first job(18181be02da272387354d093519b2359). 2021-06-23 21:29:53,155 INFO org.apache.flink.runtime.jobmaster.slotpool.DefaultDeclarativeSlotPool [] - Releasing slot [c35b64879d6b02d383c825ea735ebba0]. 2021-06-23 21:29:53,159 INFO org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] - Clearing resource requirements of job 18181be02da272387354d093519b2359 2021-06-23 21:29:53,159 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Close ResourceManager connection 281b3fcf7ad0a6f7763fa90b8a5b9adb: Stopping JobMaster for job first job(18181be02da272387354d093519b2359).. 2021-06-23 21:29:53,160 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Disconnect job manager 0...@akka.tcp://flink@localhost:6123/user/rpc/jobmanager_2 for job 18181be02da272387354d093519b2359 from the resource manager. 2021-06-23 21:29:53,225 INFO org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap [] - Application CANCELED: java.util.concurrent.CompletionException: org.apache.flink.client.deployment.application.UnsuccessfulExecutionException: Application Status: CANCELED at org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap.lambda$unwrapJobResultException$4(ApplicationDispatcherBootstrap.java:304) ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT] at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616) ~[?:1.8.0_252] at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591) ~[?:1.8.0_252] at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_252] at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975) ~[?:1.8.0_252] at org.apache.flink.client.deployment.application.JobStatusPollingUtils.lambda$null$2(JobStatusPollingUtils.java:101) ~[flink-dist_2.11-1.14-SNAPSHOT.jar:1.14-SNAPSHOT] at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) ~[?:1.8.0_252] at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) ~[?:1.8.0_252] at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_252] at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975) ~[?:1.8.0_252] at org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.lambda$invokeRpc$0(AkkaInvocationHan
[jira] [Created] (FLINK-23000) Allow downloading all Flink logs from the UI
Robert Metzger created FLINK-23000: -- Summary: Allow downloading all Flink logs from the UI Key: FLINK-23000 URL: https://issues.apache.org/jira/browse/FLINK-23000 Project: Flink Issue Type: New Feature Components: Runtime / Web Frontend Reporter: Robert Metzger Assignee: Robert Metzger Fix For: 1.14.0 Users often struggle to provide all relevant logs. By having a button in the Web UI that collects the logs of all TaskManagers and the JobManager will make things easier for the community supporting users. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-22923) Queryable state (rocksdb) with TM restart end-to-end test unstable
Robert Metzger created FLINK-22923: -- Summary: Queryable state (rocksdb) with TM restart end-to-end test unstable Key: FLINK-22923 URL: https://issues.apache.org/jira/browse/FLINK-22923 Project: Flink Issue Type: Bug Components: Runtime / Queryable State Affects Versions: 1.14.0 Reporter: Robert Metzger Fix For: 1.14.0 https://dev.azure.com/rmetzger/Flink/_build/results?buildId=9119&view=logs&j=9401bf33-03c4-5a24-83fe-e51d75db73ef&t=72901ab2-7cd0-57be-82b1-bca51de20fba {code} Jun 04 19:39:12 16/17 completed checkpoints Jun 04 19:39:14 16/17 completed checkpoints SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. Jun 04 19:39:17 after: 20 Jun 04 19:39:17 An error occurred Jun 04 19:39:17 [FAIL] Test script contains errors. Jun 04 19:39:17 Checking of logs skipped. Jun 04 19:39:17 Jun 04 19:39:17 [FAIL] 'Queryable state (rocksdb) with TM restart end-to-end test' failed after 0 minutes and 48 seconds! Test exited with exit code 1 Jun 04 19:39:17 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-22856) Move our Azure pipelines away from Ubuntu 16.04 by September
Robert Metzger created FLINK-22856: -- Summary: Move our Azure pipelines away from Ubuntu 16.04 by September Key: FLINK-22856 URL: https://issues.apache.org/jira/browse/FLINK-22856 Project: Flink Issue Type: Bug Components: Build System / Azure Pipelines Reporter: Robert Metzger Fix For: 1.14.0 Azure won't support Ubuntu 16.04 starting from October, hence we need to migrate to a newer ubuntu version. We should do this at a time when the builds are relatively stable to be able to clearly identify issues relating to the version upgrade. Also, we shouldn't do this before a feature freeze ;) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-22765) ExceptionUtilsITCase.testIsMetaspaceOutOfMemoryError is unstable
Robert Metzger created FLINK-22765: -- Summary: ExceptionUtilsITCase.testIsMetaspaceOutOfMemoryError is unstable Key: FLINK-22765 URL: https://issues.apache.org/jira/browse/FLINK-22765 Project: Flink Issue Type: Bug Components: Runtime / Coordination Affects Versions: 1.14.0 Reporter: Robert Metzger Fix For: 1.14.0 https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=18292&view=logs&j=39d5b1d5-3b41-54dc-6458-1e2ddd1cdcf3&t=a99e99c7-21cd-5a1f-7274-585e62b72f56 {code} May 25 00:56:38 java.lang.AssertionError: May 25 00:56:38 May 25 00:56:38 Expected: is "" May 25 00:56:38 but: was "The system is out of resources.\nConsult the following stack trace for details." May 25 00:56:38 at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20) May 25 00:56:38 at org.junit.Assert.assertThat(Assert.java:956) May 25 00:56:38 at org.junit.Assert.assertThat(Assert.java:923) May 25 00:56:38 at org.apache.flink.runtime.util.ExceptionUtilsITCase.run(ExceptionUtilsITCase.java:94) May 25 00:56:38 at org.apache.flink.runtime.util.ExceptionUtilsITCase.testIsMetaspaceOutOfMemoryError(ExceptionUtilsITCase.java:70) May 25 00:56:38 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) May 25 00:56:38 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) May 25 00:56:38 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) May 25 00:56:38 at java.lang.reflect.Method.invoke(Method.java:498) May 25 00:56:38 at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) May 25 00:56:38 at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) May 25 00:56:38 at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) May 25 00:56:38 at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) May 25 00:56:38 at org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45) May 25 00:56:38 at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) May 25 00:56:38 at org.junit.rules.RunRules.evaluate(RunRules.java:20) May 25 00:56:38 at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) May 25 00:56:38 at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) May 25 00:56:38 at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) May 25 00:56:38 at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) May 25 00:56:38 at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) May 25 00:56:38 at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) May 25 00:56:38 at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) May 25 00:56:38 at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) May 25 00:56:38 at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48) May 25 00:56:38 at org.junit.rules.RunRules.evaluate(RunRules.java:20) May 25 00:56:38 at org.junit.runners.ParentRunner.run(ParentRunner.java:363) May 25 00:56:38 at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365) May 25 00:56:38 at org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273) May 25 00:56:38 at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238) May 25 00:56:38 at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159) May 25 00:56:38 at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384) May 25 00:56:38 at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345) May 25 00:56:38 at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) May 25 00:56:38 at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) May 25 00:56:38 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-22692) CheckpointStoreITCase.testRestartOnRecoveryFailure fails with RuntimeException
Robert Metzger created FLINK-22692: -- Summary: CheckpointStoreITCase.testRestartOnRecoveryFailure fails with RuntimeException Key: FLINK-22692 URL: https://issues.apache.org/jira/browse/FLINK-22692 Project: Flink Issue Type: Bug Components: Runtime / Checkpointing Reporter: Robert Metzger Not sure if it is related to the adaptive scheduler: https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=18052&view=logs&j=8fd9202e-fd17-5b26-353c-ac1ff76c8f28&t=a0a633b8-47ef-5c5a-2806-3c13b9e48228 {code} May 17 22:29:11 [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 1.351 s <<< FAILURE! - in org.apache.flink.test.checkpointing.CheckpointStoreITCase May 17 22:29:11 [ERROR] testRestartOnRecoveryFailure(org.apache.flink.test.checkpointing.CheckpointStoreITCase) Time elapsed: 1.138 s <<< ERROR! May 17 22:29:11 org.apache.flink.runtime.client.JobExecutionException: Job execution failed. May 17 22:29:11 at org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144) May 17 22:29:11 at org.apache.flink.runtime.minicluster.MiniClusterJobClient.lambda$getJobExecutionResult$3(MiniClusterJobClient.java:137) May 17 22:29:11 at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616) May 17 22:29:11 at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591) May 17 22:29:11 at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) May 17 22:29:11 at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975) May 17 22:29:11 at org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.lambda$invokeRpc$0(AkkaInvocationHandler.java:237) May 17 22:29:11 at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) May 17 22:29:11 at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) May 17 22:29:11 at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) May 17 22:29:11 at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975) May 17 22:29:11 at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:1081) May 17 22:29:11 at akka.dispatch.OnComplete.internal(Future.scala:264) May 17 22:29:11 at akka.dispatch.OnComplete.internal(Future.scala:261) May 17 22:29:11 at akka.dispatch.japi$CallbackBridge.apply(Future.scala:191) May 17 22:29:11 at akka.dispatch.japi$CallbackBridge.apply(Future.scala:188) May 17 22:29:11 at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) May 17 22:29:11 at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:73) May 17 22:29:11 at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44) May 17 22:29:11 at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252) May 17 22:29:11 at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:572) May 17 22:29:11 at akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:22) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-22593) SavepointITCase.testShouldAddEntropyToSavepointPath unstable
Robert Metzger created FLINK-22593: -- Summary: SavepointITCase.testShouldAddEntropyToSavepointPath unstable Key: FLINK-22593 URL: https://issues.apache.org/jira/browse/FLINK-22593 Project: Flink Issue Type: Bug Components: Runtime / Checkpointing Affects Versions: 1.14.0 Reporter: Robert Metzger https://dev.azure.com/rmetzger/Flink/_build/results?buildId=9072&view=logs&j=cc649950-03e9-5fae-8326-2f1ad744b536&t=51cab6ca-669f-5dc0-221d-1e4f7dc4fc85 {code} 2021-05-07T10:56:20.9429367Z May 07 10:56:20 [ERROR] Tests run: 13, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 33.441 s <<< FAILURE! - in org.apache.flink.test.checkpointing.SavepointITCase 2021-05-07T10:56:20.9445862Z May 07 10:56:20 [ERROR] testShouldAddEntropyToSavepointPath(org.apache.flink.test.checkpointing.SavepointITCase) Time elapsed: 2.083 s <<< ERROR! 2021-05-07T10:56:20.9447106Z May 07 10:56:20 java.util.concurrent.ExecutionException: java.util.concurrent.CompletionException: org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint triggering task Sink: Unnamed (3/4) of job 4e155a20f0a7895043661a6446caf1cb has not being executed at the moment. Aborting checkpoint. Failure reason: Not all required tasks are currently running. 2021-05-07T10:56:20.9448194Z May 07 10:56:20at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) 2021-05-07T10:56:20.9448797Z May 07 10:56:20at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908) 2021-05-07T10:56:20.9449428Z May 07 10:56:20at org.apache.flink.test.checkpointing.SavepointITCase.submitJobAndTakeSavepoint(SavepointITCase.java:305) 2021-05-07T10:56:20.9450160Z May 07 10:56:20at org.apache.flink.test.checkpointing.SavepointITCase.testShouldAddEntropyToSavepointPath(SavepointITCase.java:273) 2021-05-07T10:56:20.9450785Z May 07 10:56:20at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 2021-05-07T10:56:20.9451331Z May 07 10:56:20at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 2021-05-07T10:56:20.9451940Z May 07 10:56:20at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 2021-05-07T10:56:20.9452498Z May 07 10:56:20at java.lang.reflect.Method.invoke(Method.java:498) 2021-05-07T10:56:20.9453247Z May 07 10:56:20at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) 2021-05-07T10:56:20.9454007Z May 07 10:56:20at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) 2021-05-07T10:56:20.9454687Z May 07 10:56:20at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) 2021-05-07T10:56:20.9455302Z May 07 10:56:20at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) 2021-05-07T10:56:20.9455909Z May 07 10:56:20at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) 2021-05-07T10:56:20.9456493Z May 07 10:56:20at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48) 2021-05-07T10:56:20.9457074Z May 07 10:56:20at org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45) 2021-05-07T10:56:20.9457636Z May 07 10:56:20at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) 2021-05-07T10:56:20.9458157Z May 07 10:56:20at org.junit.rules.RunRules.evaluate(RunRules.java:20) 2021-05-07T10:56:20.9458678Z May 07 10:56:20at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) 2021-05-07T10:56:20.9459252Z May 07 10:56:20at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) 2021-05-07T10:56:20.9459865Z May 07 10:56:20at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) 2021-05-07T10:56:20.9460433Z May 07 10:56:20at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) 2021-05-07T10:56:20.9461058Z May 07 10:56:20at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) 2021-05-07T10:56:20.9461607Z May 07 10:56:20at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) 2021-05-07T10:56:20.9462159Z May 07 10:56:20at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) 2021-05-07T10:56:20.9462705Z May 07 10:56:20at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) 2021-05-07T10:56:20.9463243Z May 07 10:56:20at org.junit.runners.ParentRunner.run(ParentRunner.java:363) 2021-05-07T10:56:20.9463812Z May 07 10:56:20at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365) 2021-05-07T10:56:20.9464436Z May 07 10:56:20at org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273) 2021-05-07T10:56:20.9465073Z May 07 10:56:20at org.apache.maven.surefire.junit4.JU
[jira] [Created] (FLINK-22574) Adaptive Scheduler: Can not cancel restarting job
Robert Metzger created FLINK-22574: -- Summary: Adaptive Scheduler: Can not cancel restarting job Key: FLINK-22574 URL: https://issues.apache.org/jira/browse/FLINK-22574 Project: Flink Issue Type: Bug Components: Runtime / Coordination Affects Versions: 1.13.0 Reporter: Robert Metzger Fix For: 1.13.1 I have a job in state RESTARTING. When I now issue a cancel RPC call, I get the following exception: {code} org.apache.flink.runtime.rest.handler.RestHandlerException: Job cancellation failed: Cancellation failed. at org.apache.flink.runtime.rest.handler.job.JobCancellationHandler.lambda$handleRequest$0(JobCancellationHandler.java:127) at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836) at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) at org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.lambda$invokeRpc$0(AkkaInvocationHandler.java:234) at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:1079) at akka.dispatch.OnComplete.internal(Future.scala:263) at akka.dispatch.OnComplete.internal(Future.scala:261) at akka.dispatch.japi$CallbackBridge.apply(Future.scala:191) at akka.dispatch.japi$CallbackBridge.apply(Future.scala:188) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60) at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:73) at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:68) at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:284) at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:284) at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:284) at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:573) at akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:23) at akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:21) at scala.concurrent.Future.$anonfun$andThen$1(Future.scala:532) at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:29) at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:29) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60) at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55) at akka.dispatch.BatchingExecutor$BlockableBatch.$anonfun$run$1(BatchingExecutor.scala:91) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12) at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:81) at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:91) at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44) at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: org.apache.flink.util.FlinkException: Cancellation failed. at org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.lambda$cancel$3(JobMasterServiceLeadershipRunner.java:197) at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:884) at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:866) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) at org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.lambda$invokeRpc$0(AkkaInvocationHandler.java:234) at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFutur
[jira] [Created] (FLINK-22556) Extend license checker to scan for traces of (L)GPL licensed code
Robert Metzger created FLINK-22556: -- Summary: Extend license checker to scan for traces of (L)GPL licensed code Key: FLINK-22556 URL: https://issues.apache.org/jira/browse/FLINK-22556 Project: Flink Issue Type: Improvement Components: Build System / CI Reporter: Robert Metzger Fix For: 1.14.0 This is a follow up to FLINK-22555. The goal is to catch this and similar cases automatically. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-22509) ./bin/flink run -m yarn-cluster -d submission leads to IllegalStateException
Robert Metzger created FLINK-22509: -- Summary: ./bin/flink run -m yarn-cluster -d submission leads to IllegalStateException Key: FLINK-22509 URL: https://issues.apache.org/jira/browse/FLINK-22509 Project: Flink Issue Type: Bug Components: Deployment / YARN Affects Versions: 1.13.0, 1.14.0 Reporter: Robert Metzger Submitting a detached, per-job YARN cluster in Flink (like this: {{./bin/flink run -m yarn-cluster -d ./examples/streaming/TopSpeedWindowing.jar}}), leads to the following exception: {code} 2021-04-28 11:39:00,786 INFO org.apache.flink.yarn.YarnClusterDescriptor [] - Found Web Interface ip-172-31-27-232.eu-central-1.compute.internal:45689 of application 'application_1619607372651_0005'. Job has been submitted with JobID 5543e81db9c2de78b646088891f23bfc Exception in thread "Thread-4" java.lang.IllegalStateException: Trying to access closed classloader. Please check if you store classloaders directly or indirectly in static fields. If the stacktrace suggests that the leak occurs in a third party library and cannot be fixed immediately, you can disable this check with the configuration 'classloader.check-leaked-classloader'. at org.apache.flink.runtime.execution.librarycache.FlinkUserCodeClassLoaders$SafetyNetWrapperClassLoader.ensureInner(FlinkUserCodeClassLoaders.java:164) at org.apache.flink.runtime.execution.librarycache.FlinkUserCodeClassLoaders$SafetyNetWrapperClassLoader.getResource(FlinkUserCodeClassLoaders.java:183) at org.apache.hadoop.conf.Configuration.getResource(Configuration.java:2570) at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2783) at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2758) at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2638) at org.apache.hadoop.conf.Configuration.get(Configuration.java:1100) at org.apache.hadoop.conf.Configuration.getTimeDuration(Configuration.java:1707) at org.apache.hadoop.conf.Configuration.getTimeDuration(Configuration.java:1688) at org.apache.hadoop.util.ShutdownHookManager.getShutdownTimeout(ShutdownHookManager.java:183) at org.apache.hadoop.util.ShutdownHookManager.shutdownExecutor(ShutdownHookManager.java:145) at org.apache.hadoop.util.ShutdownHookManager.access$300(ShutdownHookManager.java:65) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:102) {code} The job is still running as expected. Detached submission with {{./bin/flink run-application -t yarn-application -d}} works as expected. This is also the documented approach. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-22483) Recover checkpoints when JobMaster gains leadership
Robert Metzger created FLINK-22483: -- Summary: Recover checkpoints when JobMaster gains leadership Key: FLINK-22483 URL: https://issues.apache.org/jira/browse/FLINK-22483 Project: Flink Issue Type: Bug Components: Runtime / Coordination Affects Versions: 1.13.0 Reporter: Robert Metzger Fix For: 1.14.0 Recovering checkpoints (from the CompletedCheckpointStore) is a potentially blocking operation, for example if the file system implementation is retrying to connect to a unavailable storage backend. Currently, we are calling the CompletedCheckpointStore.recover() method from the main thread of the JobManager, making it unresponsive to any RPC call while the recover method is blocked. By moving the recovery to the start of the JobManager (which happens asynchronously after the JobMaster has gained leadership), Flink will remain responsive (reporting a job in INITIALIZING state). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-22413) Hide Checkpointing page in the UI for batch jobs
Robert Metzger created FLINK-22413: -- Summary: Hide Checkpointing page in the UI for batch jobs Key: FLINK-22413 URL: https://issues.apache.org/jira/browse/FLINK-22413 Project: Flink Issue Type: Bug Components: Runtime / Web Frontend Reporter: Robert Metzger This is a follow up to https://github.com/apache/flink/commit/8dca9fa852c72984ac873eae9a96bbd739e502f3#commitcomment-49744255 Batch jobs don't need a Checkpointing page, because there is no information for these jobs available there. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-22331) CLI Frontend (RestClient) doesn't work on Apple M1
Robert Metzger created FLINK-22331: -- Summary: CLI Frontend (RestClient) doesn't work on Apple M1 Key: FLINK-22331 URL: https://issues.apache.org/jira/browse/FLINK-22331 Project: Flink Issue Type: Bug Components: Client / Job Submission Affects Versions: 1.12.2, 1.13.0 Reporter: Robert Metzger Attachments: flink-muthmann-client-KlemensMac.local (1).log_native, flink-muthmann-client-KlemensMac.local.log_rosetta This issue was first reported by a user: https://lists.apache.org/thread.html/r50bda40a69688de52c9d6e3489ac2641491387c20fdc1cecedceee76%40%3Cuser.flink.apache.org%3E See attached logs. Exception without rosetta: {code} org.apache.flink.client.program.ProgramInvocationException: The main method caused an error: Failed to execute job 'Streaming WordCount'. at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:366) at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:219) at org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:114) at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:812) at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:246) at org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1054) at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1132) at org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:28) at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1132) Caused by: org.apache.flink.util.FlinkException: Failed to execute job 'Streaming WordCount'. at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1918) at org.apache.flink.client.program.StreamContextEnvironment.executeAsync(StreamContextEnvironment.java:135) at org.apache.flink.client.program.StreamContextEnvironment.execute(StreamContextEnvironment.java:76) at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1782) at org.apache.flink.streaming.examples.wordcount.WordCount.main(WordCount.java:97) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:349) ... 8 more Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph. at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$7(RestClusterClient.java:400) at java.base/java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:986) at java.base/java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:970) at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088) at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$9(FutureUtils.java:390) at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088) at org.apache.flink.runtime.rest.RestClient$ClientHandler.exceptionCaught(RestClient.java:613) at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302) at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:281) at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:273) at org.apache.flink.shaded.netty4.io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireExceptionCaught(CombinedChannelDuplexHandler.java:424) at org.apache.flink.shaded.netty4.io.netty.channel.ChannelHandlerAdapter.exceptionCaught(ChannelHandlerAdapter.java:
[jira] [Created] (FLINK-22258) Adaptive Scheduler: Show history of rescales in the Web UI
Robert Metzger created FLINK-22258: -- Summary: Adaptive Scheduler: Show history of rescales in the Web UI Key: FLINK-22258 URL: https://issues.apache.org/jira/browse/FLINK-22258 Project: Flink Issue Type: Sub-task Components: Runtime / Coordination, Runtime / Web Frontend Reporter: Robert Metzger As a user, I would like to see the history of rescale events in the web UI (number of slots used, task parallelisms) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-22252) Backquotes are not rendered correctly in config option descriptions
Robert Metzger created FLINK-22252: -- Summary: Backquotes are not rendered correctly in config option descriptions Key: FLINK-22252 URL: https://issues.apache.org/jira/browse/FLINK-22252 Project: Flink Issue Type: Task Components: Documentation, Runtime / Configuration Affects Versions: 1.13.0 Reporter: Robert Metzger Fix For: 1.13.0 This is how the config options are rendered in the 1.13 docs: {code} `none`, `off`, `disable`: No restart strategy. `fixeddelay`, `fixed-delay`: Fixed delay restart strategy. More details can be found here. `failurerate`, `failure-rate`: Failure rate restart strategy. More details can be found here. {code}. https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/config/#state-backend Here's the rendering in the old docs. https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/config.html#restart-strategy -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-22244) Clarify Reactive Mode documentation wrt to timeouts
Robert Metzger created FLINK-22244: -- Summary: Clarify Reactive Mode documentation wrt to timeouts Key: FLINK-22244 URL: https://issues.apache.org/jira/browse/FLINK-22244 Project: Flink Issue Type: Task Components: Runtime / Coordination Affects Versions: 1.13.0 Reporter: Robert Metzger Assignee: Robert Metzger Fix For: 1.13.0 In the release testing of Reactive Mode (FLINK-22134) we found that the documentation of the timeouts needs some clarification. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-22243) Reactive Mode parallelism changes are not shown in the job graph visualization in the UI
Robert Metzger created FLINK-22243: -- Summary: Reactive Mode parallelism changes are not shown in the job graph visualization in the UI Key: FLINK-22243 URL: https://issues.apache.org/jira/browse/FLINK-22243 Project: Flink Issue Type: Bug Components: Runtime / Web Frontend Affects Versions: 1.13.0 Reporter: Robert Metzger Fix For: 1.14.0 As reported here FLINK-22134, the parallelism in the visual job graph on top of the detail page is not in sync with the parallelism listed in the task list below, when reactive mode causes a parallelism change. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-22105) testForceAlignedCheckpointResultingInPriorityEvents unstable
Robert Metzger created FLINK-22105: -- Summary: testForceAlignedCheckpointResultingInPriorityEvents unstable Key: FLINK-22105 URL: https://issues.apache.org/jira/browse/FLINK-22105 Project: Flink Issue Type: Bug Components: Runtime / Checkpointing Affects Versions: 1.13.0 Reporter: Robert Metzger Fix For: 1.13.0 https://dev.azure.com/rmetzger/Flink/_build/results?buildId=9021&view=logs&j=9dc1b5dc-bcfa-5f83-eaa7-0cb181ddc267&t=ab910030-93db-52a7-74a3-34a0addb481b {code} 2021-04-01T19:29:55.2392858Z [ERROR] Tests run: 10, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 1.921 s <<< FAILURE! - in org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorTest 2021-04-01T19:29:55.2396751Z [ERROR] testForceAlignedCheckpointResultingInPriorityEvents(org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorTest) Time elapsed: 0.02 s <<< ERROR! 2021-04-01T19:29:55.2397415Z java.lang.RuntimeException: unable to send request to worker 2021-04-01T19:29:55.2397956Zat org.apache.flink.runtime.checkpoint.channel.ChannelStateWriterImpl.enqueue(ChannelStateWriterImpl.java:228) 2021-04-01T19:29:55.2398603Zat org.apache.flink.runtime.checkpoint.channel.ChannelStateWriterImpl.finishOutput(ChannelStateWriterImpl.java:183) 2021-04-01T19:29:55.2399310Zat org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointState(SubtaskCheckpointCoordinatorImpl.java:298) 2021-04-01T19:29:55.2400104Zat org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorTest.testForceAlignedCheckpointResultingInPriorityEvents(SubtaskCheckpointCoordinatorTest.java:215) 2021-04-01T19:29:55.2400746Zat sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 2021-04-01T19:29:55.2401202Zat sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 2021-04-01T19:29:55.2401746Zat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 2021-04-01T19:29:55.2402237Zat java.lang.reflect.Method.invoke(Method.java:498) 2021-04-01T19:29:55.2402722Zat org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) 2021-04-01T19:29:55.2403270Zat org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) 2021-04-01T19:29:55.2403818Zat org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) 2021-04-01T19:29:55.2404354Zat org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) 2021-04-01T19:29:55.2404854Zat org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) 2021-04-01T19:29:55.2405359Zat org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) 2021-04-01T19:29:55.2405896Zat org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) 2021-04-01T19:29:55.2406393Zat org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) 2021-04-01T19:29:55.2406855Zat org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) 2021-04-01T19:29:55.2407331Zat org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) 2021-04-01T19:29:55.2407806Zat org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) 2021-04-01T19:29:55.2408279Zat org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) 2021-04-01T19:29:55.2408907Zat org.junit.runners.ParentRunner.run(ParentRunner.java:363) 2021-04-01T19:29:55.2409403Zat org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365) 2021-04-01T19:29:55.2409954Zat org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273) 2021-04-01T19:29:55.2410524Zat org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238) 2021-04-01T19:29:55.2411318Zat org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159) 2021-04-01T19:29:55.2411880Zat org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384) 2021-04-01T19:29:55.2412467Zat org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345) 2021-04-01T19:29:55.2413006Zat org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) 2021-04-01T19:29:55.2413519Zat org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) 2021-04-01T19:29:55.2414082Z Caused by: java.lang.IllegalArgumentException: writer not found while processing request: writeOutput 0 2021-04-01T19:29:55.2414726Zat org.apache.flink.runtime.checkpoint.channel.CheckpointInProgressRequest.onWriterMissing(ChannelStateWriteRequest.java:223) 2021-04-01T19:29:55.2415463Zat org.apache.flink.runtime.checkpoint.channel.ChannelStateWriteRequestDispatcherImpl.dispatchInte
[jira] [Created] (FLINK-22062) Add "Data Sources" (FLIP-27 sources overview page) page from 1.12x to master
Robert Metzger created FLINK-22062: -- Summary: Add "Data Sources" (FLIP-27 sources overview page) page from 1.12x to master Key: FLINK-22062 URL: https://issues.apache.org/jira/browse/FLINK-22062 Project: Flink Issue Type: Bug Components: API / DataStream, Documentation Affects Versions: 1.13.0 Reporter: Robert Metzger Fix For: 1.13.0 This page https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/stream/sources.html is not available in the latest docs on master. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-22040) Maven: Entry has not been leased from this pool / fix for e2e tests
Robert Metzger created FLINK-22040: -- Summary: Maven: Entry has not been leased from this pool / fix for e2e tests Key: FLINK-22040 URL: https://issues.apache.org/jira/browse/FLINK-22040 Project: Flink Issue Type: Sub-task Components: Build System / Azure Pipelines Affects Versions: 1.12.2, 1.13.0 Reporter: Robert Metzger -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-22001) Exceptions from JobMaster initialization are not forwarded to the user
Robert Metzger created FLINK-22001: -- Summary: Exceptions from JobMaster initialization are not forwarded to the user Key: FLINK-22001 URL: https://issues.apache.org/jira/browse/FLINK-22001 Project: Flink Issue Type: Bug Components: Runtime / Coordination Affects Versions: 1.13.0 Reporter: Robert Metzger Steps to reproduce: Set up a streaming job with an invalid parallelism configuration, for example: {code} .setParallelism(15).setMaxParallelism(1); {code} This should report the following exception to the user: {code} Caused by: org.apache.flink.runtime.JobException: Vertex Window(GlobalWindows(), DeltaTrigger, TimeEvictor, ComparableAggregator, PassThroughWindowFunction)'s parallelism (15) is higher than the max parallelism (1). Please lower the parallelism or increase the max parallelism. at org.apache.flink.runtime.executiongraph.ExecutionJobVertex.(ExecutionJobVertex.java:160) at org.apache.flink.runtime.executiongraph.DefaultExecutionGraph.attachJobGraph(DefaultExecutionGraph.java:781) at org.apache.flink.runtime.executiongraph.DefaultExecutionGraphBuilder.buildGraph(DefaultExecutionGraphBuilder.java:193) at org.apache.flink.runtime.scheduler.DefaultExecutionGraphFactory.createAndRestoreExecutionGraph(DefaultExecutionGraphFactory.java:106) at org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:252) at org.apache.flink.runtime.scheduler.SchedulerBase.(SchedulerBase.java:185) at org.apache.flink.runtime.scheduler.DefaultScheduler.(DefaultScheduler.java:119) at org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:132) at org.apache.flink.runtime.jobmaster.DefaultSlotPoolServiceSchedulerFactory.createScheduler(DefaultSlotPoolServiceSchedulerFactory.java:110) at org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:340) at org.apache.flink.runtime.jobmaster.JobMaster.(JobMaster.java:317) at org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:94) at org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:39) at org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.startJobMasterServiceSafely(JobManagerRunnerImpl.java:363) ... 13 more {code} However, what the user sees is {code} 2021-03-28 20:32:33,935 INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Job 419f60eac551619fc1081c670ced3649 reached globally terminal state FAILED. ... 2021-03-28 20:32:33,974 INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Stopped dispatcher akka://flink/user/rpc/dispatcher_2. 2021-03-28 20:32:33,977 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService [] - Stopping Akka RPC service. Exception in thread "main" org.apache.flink.util.FlinkException: Failed to execute job 'CarTopSpeedWindowingExample'. at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1975) at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1853) at org.apache.flink.streaming.api.environment.LocalStreamEnvironment.execute(LocalStreamEnvironment.java:69) at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1839) at org.apache.flink.streaming.examples.windowing.TopSpeedWindowing.main(TopSpeedWindowing.java:101) Caused by: java.lang.RuntimeException: Error while waiting for job to be initialized at org.apache.flink.client.ClientUtils.waitUntilJobInitializationFinished(ClientUtils.java:160) at org.apache.flink.client.program.PerJobMiniClusterFactory.lambda$submitJob$2(PerJobMiniClusterFactory.java:83) at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:73) at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616) at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591) at java.util.concurrent.CompletableFuture$Completion.exec(CompletableFuture.java:457) at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorke
[jira] [Created] (FLINK-21929) flink-statebackend-rocksdb crashes with Error occurred in starting fork
Robert Metzger created FLINK-21929: -- Summary: flink-statebackend-rocksdb crashes with Error occurred in starting fork Key: FLINK-21929 URL: https://issues.apache.org/jira/browse/FLINK-21929 Project: Flink Issue Type: Bug Components: Runtime / State Backends Affects Versions: 1.13.0 Reporter: Robert Metzger Attachments: image-2021-03-23-13-18-41-836.png https://dev.azure.com/rmetzger/Flink/_build/results?buildId=9001&view=results {code} 2021-03-23T09:11:12.1861967Z [INFO] BUILD FAILURE 2021-03-23T09:11:12.1863007Z [INFO] 2021-03-23T09:11:12.1863492Z [INFO] Total time: 42:35 min 2021-03-23T09:11:12.1864171Z [INFO] Finished at: 2021-03-23T09:11:12+00:00 2021-03-23T09:11:12.8003245Z [INFO] Final Memory: 137M/806M 2021-03-23T09:11:12.8006310Z [INFO] 2021-03-23T09:11:12.8082409Z [ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.22.1:test (default-test) on project flink-statebackend-rocksdb_2.11: There are test failures. 2021-03-23T09:11:12.8086652Z [ERROR] 2021-03-23T09:11:12.8092462Z [ERROR] Please refer to /__w/1/s/flink-state-backends/flink-statebackend-rocksdb/target/surefire-reports for the individual test results. 2021-03-23T09:11:12.8096948Z [ERROR] Please refer to dump files (if any exist) [date].dump, [date]-jvmRun[N].dump and [date].dumpstream. 2021-03-23T09:11:12.8101388Z [ERROR] ExecutionException Error occurred in starting fork, check output in log 2021-03-23T09:11:12.8105868Z [ERROR] org.apache.maven.surefire.booter.SurefireBooterForkException: ExecutionException Error occurred in starting fork, check output in log 2021-03-23T09:11:12.8110518Z [ERROR] at org.apache.maven.plugin.surefire.booterclient.ForkStarter.awaitResultsDone(ForkStarter.java:510) 2021-03-23T09:11:12.8115518Z [ERROR] at org.apache.maven.plugin.surefire.booterclient.ForkStarter.runSuitesForkOnceMultiple(ForkStarter.java:382) 2021-03-23T09:11:12.8120811Z [ERROR] at org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:297) 2021-03-23T09:11:12.8126356Z [ERROR] at org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:246) 2021-03-23T09:11:12.8127129Z [ERROR] at org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeProvider(AbstractSurefireMojo.java:1183) 2021-03-23T09:11:12.8131291Z [ERROR] at org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeAfterPreconditionsChecked(AbstractSurefireMojo.java:1011) 2021-03-23T09:11:12.8132369Z [ERROR] at org.apache.maven.plugin.surefire.AbstractSurefireMojo.execute(AbstractSurefireMojo.java:857) 2021-03-23T09:11:12.8133397Z [ERROR] at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:132) 2021-03-23T09:11:12.8134116Z [ERROR] at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:208) 2021-03-23T09:11:12.8134793Z [ERROR] at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153) 2021-03-23T09:11:12.8135621Z [ERROR] at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145) 2021-03-23T09:11:12.8136323Z [ERROR] at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116) 2021-03-23T09:11:12.8141570Z [ERROR] at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80) 2021-03-23T09:11:12.8142374Z [ERROR] at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51) 2021-03-23T09:11:12.8145665Z [ERROR] at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:120) 2021-03-23T09:11:12.8146407Z [ERROR] at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:355) 2021-03-23T09:11:12.8148835Z [ERROR] at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:155) 2021-03-23T09:11:12.8151299Z [ERROR] at org.apache.maven.cli.MavenCli.execute(MavenCli.java:584) 2021-03-23T09:11:12.8152244Z [ERROR] at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:216) 2021-03-23T09:11:12.8152806Z [ERROR] at org.apache.maven.cli.MavenCli.main(MavenCli.java:160) 2021-03-23T09:11:12.8155818Z [ERROR] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 2021-03-23T09:11:12.8159757Z [ERROR] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 2021-03-23T09:11:12.8177288Z [ERROR] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 2021-03-23T09:11:12.8178021Z [ERROR] at java.lang.reflect.Method.invoke(Method.java:498) 2021-03-23T09:11:12.8179802Z [ERROR] at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289) 2021-
[jira] [Created] (FLINK-21884) Reduce TaskManager failure detection time
Robert Metzger created FLINK-21884: -- Summary: Reduce TaskManager failure detection time Key: FLINK-21884 URL: https://issues.apache.org/jira/browse/FLINK-21884 Project: Flink Issue Type: Improvement Components: Runtime / Coordination Reporter: Robert Metzger Fix For: 1.14.0 Attachments: image-2021-03-19-20-10-40-324.png In Flink 1.13 (and older versions), TaskManager failures stall the processing for a significant amount of time, even though the system gets indications for the failure almost immediately through network connection losses. This is due to a high (default) heartbeat timeout of 50 seconds [1] to accommodate for GC pauses, transient network disruptions or generally slow environments (otherwise, we would unregister a healthy TaskManager). Such a high timeout can lead to disruptions in the processing (no processing for certain periods, high latencies, buildup of consumer lag etc.). In Reactive Mode (FLINK-10407), the issue surfaces on scale-down events, where the loss of a TaskManager is immediately visible in the logs, but the job is stuck in "FAILING" for quite a while until the TaskManger is really deregistered. (Note that this issue is not that critical in a autoscaling setup, because Flink can control the scale-down events and trigger them proactively) On this metrics dashboard, one can see that the job has significant throughput drops / consumer lags during scale down (and also CPU usage spikes on processing the queued events, leading to incorrect scale up events again). !image-2021-03-19-20-10-40-324.png! One idea to solve this problem is to: - Score TaskManagers based on certain signals (# exceptions reported, exception types (connection losses, akka failures), failure frequencies, ...) and blacklist them accordingly. [1] https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/config/#heartbeat-timeout -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-21883) Introduce cooldown period into adaptive scheduler
Robert Metzger created FLINK-21883: -- Summary: Introduce cooldown period into adaptive scheduler Key: FLINK-21883 URL: https://issues.apache.org/jira/browse/FLINK-21883 Project: Flink Issue Type: Improvement Components: Runtime / Coordination Reporter: Robert Metzger Fix For: 1.14.0 This is a follow up to reactive mode, introduced in FLINK-10407. Introduce a cooldown timeout, during which no further scaling actions are performed, after a scaling action. Without such a cooldown timeout, it can happen with unfortunate timing, that we are rescaling the job very frequently, because TaskManagers are not all connecting at the same time. With the current implementation (1.13), this only applies to scaling up, but this can also apply to scaling down with autoscaling support. With this implemented, users can define a cooldown timeout of say 5 minutes: If taskmanagers are now slowly connecting one after another, we will only rescale every 5 minutes. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-21819) Remove swift FS filesystem
Robert Metzger created FLINK-21819: -- Summary: Remove swift FS filesystem Key: FLINK-21819 URL: https://issues.apache.org/jira/browse/FLINK-21819 Project: Flink Issue Type: Task Components: FileSystems Reporter: Robert Metzger Fix For: 1.13.0 As discussed on the dev@ list, the community agreed to remote the swift FS filesystem: https://www.mail-archive.com/dev@flink.apache.org/msg44993.html -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-21712) Extract interfaces for the Execution* (classes used by the DefaultExecutonGraph)
Robert Metzger created FLINK-21712: -- Summary: Extract interfaces for the Execution* (classes used by the DefaultExecutonGraph) Key: FLINK-21712 URL: https://issues.apache.org/jira/browse/FLINK-21712 Project: Flink Issue Type: Improvement Components: Runtime / Coordination Affects Versions: 1.13.0 Reporter: Robert Metzger This is a follow up to FLINK-21347, where we discussed in the review (https://github.com/apache/flink/pull/14950#discussion_r580963986) that it would be good to extract interfaces for the remaining Execution* classes to make testing / mocking easier. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-21711) DataStreamSink doesn't allow setting maxParallelism
Robert Metzger created FLINK-21711: -- Summary: DataStreamSink doesn't allow setting maxParallelism Key: FLINK-21711 URL: https://issues.apache.org/jira/browse/FLINK-21711 Project: Flink Issue Type: Improvement Components: API / DataStream Affects Versions: 1.13.0 Reporter: Robert Metzger It seems that we can only set the max parallelism of the sink only via this internal API: {code} input.addSink(new ParallelismTrackingSink<>()).getTransformation().setMaxParallelism(1); {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-21638) Clean up argument order and semantics of all "stopWithSavepoint" calls in the codebase
Robert Metzger created FLINK-21638: -- Summary: Clean up argument order and semantics of all "stopWithSavepoint" calls in the codebase Key: FLINK-21638 URL: https://issues.apache.org/jira/browse/FLINK-21638 Project: Flink Issue Type: Improvement Components: Client / Job Submission, Runtime / Coordination Reporter: Robert Metzger Fix For: 1.13.0 This came up in a review: https://github.com/apache/flink/pull/14948#discussion_r588304475 The stopWithSavepoint() methods in Flink have a boolean flag, indicating either whether we inject a MAX_WATERMARK in the pipeline, or whether we terminate or suspend a job. As part of this ticket, I propose to introduce a new enum type in flink-core "StopWithSavepointBehavior" with SUSPEND and TERMINATE fields and replace the boolean flag with this new field. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-21606) TaskManager connected to invalid JobManager leading to TaskSubmissionException
Robert Metzger created FLINK-21606: -- Summary: TaskManager connected to invalid JobManager leading to TaskSubmissionException Key: FLINK-21606 URL: https://issues.apache.org/jira/browse/FLINK-21606 Project: Flink Issue Type: Bug Components: Runtime / Coordination Affects Versions: 1.13.0 Reporter: Robert Metzger Fix For: 1.13.0 While testing reactive mode, I had to start my JobManager a few times to get the configuration right. While doing that, I had at least on TaskManager (TM6), which was first connected to the first JobManager (with a running job), and then to the second one. On the second JobManager, I was able to execute my test job (on another TaskManager (TMx)), once TM6 reconnected, and reactive mode tried to utilize all available resources, I repeatedly ran into this issue: {code} 2021-03-04 15:49:36,322 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Window(GlobalWindows(), DeltaTrigger, TimeEvictor, ComparableAggregator, PassThroughWindowFunction) -> Sink: Print to Std. Out (5/7) (ae8f39c8dd88148aff93c8f811fab22e) switched from DEPLOYING to FAILED on 192.168.2.173:64041-4f7521 @ macbook-pro-2.localdomain (dataPort=64044). java.util.concurrent.CompletionException: org.apache.flink.runtime.taskexecutor.exceptions.TaskSubmissionException: Could not submit task because there is no JobManager associated for the job bbe8634736b5b1d813dd322cfaaa08ea. at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326) ~[?:1.8.0_252] at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338) ~[?:1.8.0_252] at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:925) ~[?:1.8.0_252] at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:913) ~[?:1.8.0_252] at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_252] at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_252] at org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.lambda$invokeRpc$0(AkkaInvocationHandler.java:234) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) ~[?:1.8.0_252] at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) ~[?:1.8.0_252] at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_252] at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_252] at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:1064) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at akka.dispatch.OnComplete.internal(Future.scala:263) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at akka.dispatch.OnComplete.internal(Future.scala:261) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at akka.dispatch.japi$CallbackBridge.apply(Future.scala:191) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at akka.dispatch.japi$CallbackBridge.apply(Future.scala:188) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:73) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:572) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:101) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:999) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at akka.actor.Actor$class.aroundReceive(Actor.scala:517) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:458) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) [flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at akka.actor.ActorCell.invoke(ActorCell.scala:561) [flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) [flink-dist_2.11-1.13-SNAPSHOT.jar:1
[jira] [Created] (FLINK-21604) Remove adaptive scheduler nightly test profile, replace with more specific testing approach
Robert Metzger created FLINK-21604: -- Summary: Remove adaptive scheduler nightly test profile, replace with more specific testing approach Key: FLINK-21604 URL: https://issues.apache.org/jira/browse/FLINK-21604 Project: Flink Issue Type: Sub-task Components: Runtime / Coordination Reporter: Robert Metzger Fix For: 1.13.0 We are currently running all tests with the adaptive scheduler, but only once a night. This is wasteful for all tests not using any scheduler. Ideally, we come up with a testing strategy where we run all relevant tests (checkpointing, failure, recovery, .. etc. related with both schedulers. One approach would be parameterizing some tests. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-21587) FineGrainedSlotManagerTest.testNotificationAboutNotEnoughResources is unstable
Robert Metzger created FLINK-21587: -- Summary: FineGrainedSlotManagerTest.testNotificationAboutNotEnoughResources is unstable Key: FLINK-21587 URL: https://issues.apache.org/jira/browse/FLINK-21587 Project: Flink Issue Type: Bug Components: Runtime / Coordination Affects Versions: 1.13.0 Reporter: Robert Metzger Fix For: 1.13.0 Happened in my WIP branch, but most likely unrelated to my change: https://dev.azure.com/rmetzger/Flink/_build/results?buildId=8925&view=logs&j=9dc1b5dc-bcfa-5f83-eaa7-0cb181ddc267&t=ab910030-93db-52a7-74a3-34a0addb481b Also note that the error is reproducable locally with DEBUG log level, but not with INFO: {code} [ERROR] Tests run: 12, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 2.169 s <<< FAILURE! - in org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManagerTest [ERROR] testNotificationAboutNotEnoughResources(org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManagerTest) Time elapsed: 0.029 s <<< FAILURE! java.lang.AssertionError: Expected: a collection with size <1> but: collection size was <0> at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20) at org.junit.Assert.assertThat(Assert.java:956) at org.junit.Assert.assertThat(Assert.java:923) at org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManagerTest$9.lambda$new$5(FineGrainedSlotManagerTest.java:548) at org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManagerTestBase$Context.runTest(FineGrainedSlotManagerTestBase.java:197) at org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManagerTest$9.(FineGrainedSlotManagerTest.java:521) at org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManagerTest.testNotificationAboutNotEnoughResources(FineGrainedSlotManagerTest.java:507) at org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManagerTest.testNotificationAboutNotEnoughResources(FineGrainedSlotManagerTest.java:493) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-21561) Introduce cache for binaries downloaded by bash tests
Robert Metzger created FLINK-21561: -- Summary: Introduce cache for binaries downloaded by bash tests Key: FLINK-21561 URL: https://issues.apache.org/jira/browse/FLINK-21561 Project: Flink Issue Type: Improvement Components: Build System / CI, Test Infrastructure Reporter: Robert Metzger Fix For: 1.13.0 It seems that archive.apache.org is currently very slow, causing the bash e2e tests to timeout, because they need a lot of time downloading dependencies. This ticket is for tracking an improvement to the bash-based e2e tests to cache the binaries in Azure pipelines Caches [1]. [1] https://docs.microsoft.com/en-us/azure/devops/pipelines/release/caching?view=azure-devops -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-21558) DeclarativeSlotManager starts logging "Could not fulfill resource requirements of job xxx"
Robert Metzger created FLINK-21558: -- Summary: DeclarativeSlotManager starts logging "Could not fulfill resource requirements of job xxx" Key: FLINK-21558 URL: https://issues.apache.org/jira/browse/FLINK-21558 Project: Flink Issue Type: Bug Components: Runtime / Coordination Affects Versions: 1.13.0 Reporter: Robert Metzger Fix For: 1.13.0 While testing the reactive mode, I noticed that my job started normally, but after a few minutes, it started logging this: {code} 2021-03-02 13:36:48,075 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: Custom Source -> Timestamps/Watermarks (2/3) (061b652dabc0ecfc83c942ee3e127ecd) switched from DEPLOYING to RUNNING. 2021-03-02 13:36:48,076 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Window(GlobalWindows(), DeltaTrigger, TimeEvictor, ComparableAggregator, PassThroughWindowFunction) -> Sink: Print to Std. Out (2/3) (6a715e3c70754aafa0b91332b69a736d) switched from DEPLOYING to RUNNING. 2021-03-02 13:36:48,077 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Window(GlobalWindows(), DeltaTrigger, TimeEvictor, ComparableAggregator, PassThroughWindowFunction) -> Sink: Print to Std. Out (1/3) (8655874da6905d13c01927a282ed2ce0) switched from DEPLOYING to RUNNING. 2021-03-02 13:36:48,080 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Window(GlobalWindows(), DeltaTrigger, TimeEvictor, ComparableAggregator, PassThroughWindowFunction) -> Sink: Print to Std. Out (3/3) (9514718713ffa453c43a7e7efde9920a) switched from DEPLOYING to RUNNING. 2021-03-02 13:40:28,893 WARN org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] - Could not fulfill resource requirements of job 1283d12b281c35f33f3602611ef43b35. 2021-03-02 13:40:29,474 WARN org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] - Could not fulfill resource requirements of job 1283d12b281c35f33f3602611ef43b35. 2021-03-02 13:40:29,475 WARN org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] - Could not fulfill resource requirements of job 1283d12b281c35f33f3602611ef43b35. 2021-03-02 13:40:29,475 WARN org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] - Could not fulfill resource requirements of job 1283d12b281c35f33f3602611ef43b35. 2021-03-02 13:40:39,495 WARN org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] - Could not fulfill resource requirements of job 1283d12b281c35f33f3602611ef43b35. 2021-03-02 13:40:39,496 WARN org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] - Could not fulfill resource requirements of job 1283d12b281c35f33f3602611ef43b35. 2021-03-02 13:40:39,497 WARN org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] - Could not fulfill resource requirements of job 1283d12b281c35f33f3602611ef43b35. 2021-03-02 13:40:49,518 WARN org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] - Could not fulfill resource requirements of job 1283d12b281c35f33f3602611ef43b35. 2021-03-02 13:40:49,518 WARN org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] - Could not fulfill resource requirements of job 1283d12b281c35f33f3602611ef43b35. 2021-03-02 13:40:49,519 WARN org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] - Could not fulfill resource requirements of job 1283d12b281c35f33f3602611ef43b35. 2021-03-02 13:40:59,536 WARN org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] - Could not fulfill resource requirements of job 1283d12b281c35f33f3602611ef43b35. 2021-03-02 13:40:59,536 WARN org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] - Could not fulfill resource requirements of job 1283d12b281c35f33f3602611ef43b35. 2021-03-02 13:40:59,537 WARN org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] - Could not fulfill resource requirements of job 1283d12b281c35f33f3602611ef43b35. 2021-03-02 13:41:09,556 WARN org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] - Could not fulfill resource requirements of job 1283d12b281c35f33f3602611ef43b35. 2021-03-02 13:41:09,557 WARN org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] - Could not fulfill resource requirements of job 1283d12b281c35f33f3602611ef43b35. 2021-03-02 13:41:09,557 WARN org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] - Could not fulfill resource requirements of job 1283d12b281c35f33f3602611ef43b35. 2021-03-02 13:41:19,577 WARN org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] - Could n
[jira] [Created] (FLINK-21534) Test jars are uploaded twice during Flink maven deployment
Robert Metzger created FLINK-21534: -- Summary: Test jars are uploaded twice during Flink maven deployment Key: FLINK-21534 URL: https://issues.apache.org/jira/browse/FLINK-21534 Project: Flink Issue Type: Bug Components: Build System Affects Versions: 1.13.0 Reporter: Robert Metzger Here's an example of a snapshot build: https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=13905&view=logs&j=eca6b3a6-1600-56cc-916a-c549b3cde3ff&t=e9844b5e-5aa3-546b-6c3e-5395c7c0cac7 You can see that the same jar is uploaded twice: {code} 2021-02-28T21:07:05.2904685Z Uploading: https://repository.apache.org/content/repositories/snapshots/org/apache/flink/flink-runtime_2.11/1.13-SNAPSHOT/flink-runtime_2.11-1.13-20210228.210704-95-tests.jar 2021-02-28T21:07:05.5738429Z Uploaded: https://repository.apache.org/content/repositories/snapshots/org/apache/flink/flink-runtime_2.11/1.13-SNAPSHOT/flink-runtime_2.11-1.13-20210228.210704-95-tests.jar (4714 KB at 16596.2 KB/sec) 2021-02-28T21:07:05.6506506Z Uploading: https://repository.apache.org/content/repositories/snapshots/org/apache/flink/flink-runtime_2.11/1.13-SNAPSHOT/flink-runtime_2.11-1.13-20210228.210704-95-tests.jar 2021-02-28T21:07:07.1976748Z Uploaded: https://repository.apache.org/content/repositories/snapshots/org/apache/flink/flink-runtime_2.11/1.13-SNAPSHOT/flink-runtime_2.11-1.13-20210228.210704-95-tests.jar (4714 KB at 3046.7 KB/sec) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-21475) StateWithExecutionGraph.suspend fails with IllegalStateException
Robert Metzger created FLINK-21475: -- Summary: StateWithExecutionGraph.suspend fails with IllegalStateException Key: FLINK-21475 URL: https://issues.apache.org/jira/browse/FLINK-21475 Project: Flink Issue Type: New Feature Components: Runtime / Coordination Affects Versions: 1.13.0 Reporter: Robert Metzger Fix For: 1.13.0 https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=13665&view=logs&j=a5ef94ef-68c2-57fd-3794-dc108ed1c495&t=9c1ddabe-d186-5a2c-5fcc-f3cafb3ec699 For discoverability, the Azure output is: {code} 2021-02-23T23:48:50.6572167Z [INFO] BUILD FAILURE 2021-02-23T23:48:50.6573151Z [INFO] 2021-02-23T23:48:50.6573684Z [INFO] Total time: 01:59 min 2021-02-23T23:48:50.6574520Z [INFO] Finished at: 2021-02-23T23:48:50+00:00 2021-02-23T23:48:51.4672056Z [INFO] Final Memory: 183M/3491M 2021-02-23T23:48:51.4673656Z [INFO] 2021-02-23T23:48:51.4674310Z [WARNING] The requested profile "skip-webui-build" could not be activated because it does not exist. 2021-02-23T23:48:51.4675176Z [ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.22.1:test (integration-tests) on project flink-connector-files: There are test failures. 2021-02-23T23:48:51.4675685Z [ERROR] 2021-02-23T23:48:51.4676248Z [ERROR] Please refer to /__w/2/s/flink-connectors/flink-connector-files/target/surefire-reports for the individual test results. 2021-02-23T23:48:51.4677378Z [ERROR] Please refer to dump files (if any exist) [date].dump, [date]-jvmRun[N].dump and [date].dumpstream. 2021-02-23T23:48:51.4677963Z [ERROR] ExecutionException The forked VM terminated without properly saying goodbye. VM crash or System.exit called? 2021-02-23T23:48:51.4679891Z [ERROR] Command was /bin/sh -c cd /__w/2/s/flink-connectors/flink-connector-files/target && /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -Xms256m -Xmx2048m -Dmvn.forkNumber=1 -XX:+UseG1GC -jar /__w/2/s/flink-connectors/flink-connector-files/target/surefire/surefirebooter72596965996179907.jar /__w/2/s/flink-connectors/flink-connector-files/target/surefire 2021-02-23T23-47-06_882-jvmRun1 surefire3083231786399295313tmp surefire_104007394662374862924tmp 2021-02-23T23:48:51.4680846Z [ERROR] Error occurred in starting fork, check output in log 2021-02-23T23:48:51.4681156Z [ERROR] Process Exit Code: 239 2021-02-23T23:48:51.4681413Z [ERROR] Crashed tests: 2021-02-23T23:48:51.4681737Z [ERROR] org.apache.flink.connector.file.sink.writer.FileSinkMigrationITCase 2021-02-23T23:48:51.4682470Z [ERROR] org.apache.maven.surefire.booter.SurefireBooterForkException: ExecutionException The forked VM terminated without properly saying goodbye. VM crash or System.exit called? 2021-02-23T23:48:51.4684122Z [ERROR] Command was /bin/sh -c cd /__w/2/s/flink-connectors/flink-connector-files/target && /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -Xms256m -Xmx2048m -Dmvn.forkNumber=1 -XX:+UseG1GC -jar /__w/2/s/flink-connectors/flink-connector-files/target/surefire/surefirebooter72596965996179907.jar /__w/2/s/flink-connectors/flink-connector-files/target/surefire 2021-02-23T23-47-06_882-jvmRun1 surefire3083231786399295313tmp surefire_104007394662374862924tmp 2021-02-23T23:48:51.4685319Z [ERROR] Error occurred in starting fork, check output in log 2021-02-23T23:48:51.4685796Z [ERROR] Process Exit Code: 239 2021-02-23T23:48:51.4686229Z [ERROR] Crashed tests: 2021-02-23T23:48:51.4686926Z [ERROR] org.apache.flink.connector.file.sink.writer.FileSinkMigrationITCase 2021-02-23T23:48:51.4687709Z [ERROR] at org.apache.maven.plugin.surefire.booterclient.ForkStarter.awaitResultsDone(ForkStarter.java:510) 2021-02-23T23:48:51.4688603Z [ERROR] at org.apache.maven.plugin.surefire.booterclient.ForkStarter.runSuitesForkPerTestSet(ForkStarter.java:457) 2021-02-23T23:48:51.4689451Z [ERROR] at org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:298) 2021-02-23T23:48:51.4690207Z [ERROR] at org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:246) 2021-02-23T23:48:51.4691076Z [ERROR] at org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeProvider(AbstractSurefireMojo.java:1183) 2021-02-23T23:48:51.4692029Z [ERROR] at org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeAfterPreconditionsChecked(AbstractSurefireMojo.java:1011) 2021-02-23T23:48:51.4693058Z [ERROR] at org.apache.maven.plugin.surefire.AbstractSurefireMojo.execute(AbstractSurefireMojo.java:857) 2021-02-23T23:48:51.4693946Z [ERROR] at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:132) 2021-02-23T23:48:51.4694768Z [ERROR] at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor
[jira] [Created] (FLINK-21450) Add local recovery support to adaptive scheduler
Robert Metzger created FLINK-21450: -- Summary: Add local recovery support to adaptive scheduler Key: FLINK-21450 URL: https://issues.apache.org/jira/browse/FLINK-21450 Project: Flink Issue Type: New Feature Components: Runtime / Coordination Reporter: Robert Metzger local recovery means that, on a failure, we are able to re-use the state in a taskmanager, instead of loading it again from distributed storage (which means the scheduler needs to know where which state is located, and schedule tasks accordingly). Adaptive Scheduler is currently not respecting the location of state, so failures require the re-loading of state from the distributed storage. Adding this feature will allow us to enable the {{Local recovery and sticky scheduling end-to-end test}} for adaptive scheduler again. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-21390) Rename DeclarativeScheduler to AdaptiveScheduler
Robert Metzger created FLINK-21390: -- Summary: Rename DeclarativeScheduler to AdaptiveScheduler Key: FLINK-21390 URL: https://issues.apache.org/jira/browse/FLINK-21390 Project: Flink Issue Type: Sub-task Components: Runtime / Coordination Affects Versions: 1.13.0 Reporter: Robert Metzger Fix For: 1.13.0 DeclarativeScheduler does not seem to be an appropriate name for what it does: In particular looking at the difference to the DefaultScheduler, calling it AdaptiveScheduler (as in, adapts the parallelism of the ExecutionGraph during the job lifetime) seems more appropriate. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-21360) Add resourceTimeout configuration
Robert Metzger created FLINK-21360: -- Summary: Add resourceTimeout configuration Key: FLINK-21360 URL: https://issues.apache.org/jira/browse/FLINK-21360 Project: Flink Issue Type: Sub-task Components: Runtime / Coordination Reporter: Robert Metzger Fix For: 1.13.0 resourceTimeout is currently a hardcoded value. Make it configurable. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-21348) Add tests for StateWithExecutionGraph
Robert Metzger created FLINK-21348: -- Summary: Add tests for StateWithExecutionGraph Key: FLINK-21348 URL: https://issues.apache.org/jira/browse/FLINK-21348 Project: Flink Issue Type: Sub-task Components: Runtime / Coordination Affects Versions: 1.13.0 Reporter: Robert Metzger Fix For: 1.13.0 This ticket is about adding dedicated tests for the StateWithExecutionGraph class. This is a follow up from https://github.com/apache/flink/pull/14879#discussion_r573707768 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-21347) Extract interface out of ExecutionGraph for better testability
Robert Metzger created FLINK-21347: -- Summary: Extract interface out of ExecutionGraph for better testability Key: FLINK-21347 URL: https://issues.apache.org/jira/browse/FLINK-21347 Project: Flink Issue Type: Task Components: Runtime / Coordination Reporter: Robert Metzger This is a follow up to this comment: https://github.com/apache/flink/pull/14879#discussion_r573613450 The ExecutionGraph class is currently not very handy for tests, as it has a lot of dependencies. Extracting an interface for the ExecutionGraph will make testing and future changes easier. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-21333) Introduce stopping with savepoint state
Robert Metzger created FLINK-21333: -- Summary: Introduce stopping with savepoint state Key: FLINK-21333 URL: https://issues.apache.org/jira/browse/FLINK-21333 Project: Flink Issue Type: Sub-task Components: Runtime / Coordination Affects Versions: 1.13.0 Reporter: Robert Metzger The declarative scheduler is also affected by the problem described in FLINK-21030. We want to solve this problem by introducing a separate state when are taking a savepoint for stopping Flink. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-21329) "Local recovery and sticky scheduling end-to-end test" does not finish within 600 seconds
Robert Metzger created FLINK-21329: -- Summary: "Local recovery and sticky scheduling end-to-end test" does not finish within 600 seconds Key: FLINK-21329 URL: https://issues.apache.org/jira/browse/FLINK-21329 Project: Flink Issue Type: Bug Components: Runtime / Coordination Affects Versions: 1.13.0 Reporter: Robert Metzger https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=13118&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=ff888d9b-cd34-53cc-d90f-3e446d355529 {code} Feb 08 22:25:46 == Feb 08 22:25:46 Running 'Local recovery and sticky scheduling end-to-end test' Feb 08 22:25:46 == Feb 08 22:25:46 TEST_DATA_DIR: /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-46881214821 Feb 08 22:25:47 Flink dist directory: /home/vsts/work/1/s/flink-dist/target/flink-1.13-SNAPSHOT-bin/flink-1.13-SNAPSHOT Feb 08 22:25:47 Running local recovery test with configuration: Feb 08 22:25:47 parallelism: 4 Feb 08 22:25:47 max attempts: 10 Feb 08 22:25:47 backend: rocks Feb 08 22:25:47 incremental checkpoints: false Feb 08 22:25:47 kill JVM: false Feb 08 22:25:47 Starting zookeeper daemon on host fv-az127-394. Feb 08 22:25:47 Starting HA cluster with 1 masters. Feb 08 22:25:48 Starting standalonesession daemon on host fv-az127-394. Feb 08 22:25:49 Starting taskexecutor daemon on host fv-az127-394. Feb 08 22:25:49 Waiting for Dispatcher REST endpoint to come up... Feb 08 22:25:50 Waiting for Dispatcher REST endpoint to come up... Feb 08 22:25:51 Waiting for Dispatcher REST endpoint to come up... Feb 08 22:25:53 Waiting for Dispatcher REST endpoint to come up... Feb 08 22:25:54 Dispatcher REST endpoint is up. Feb 08 22:25:54 Started TM watchdog with PID 28961. Feb 08 22:25:58 Job has been submitted with JobID e790e85a39040539f9386c0df7ca4812 Feb 08 22:35:47 Test (pid: 27970) did not finish after 600 seconds. Feb 08 22:35:47 Printing Flink logs and killing it: {code} and {code} at org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver.unhandledError(ZooKeeperLeaderRetrievalDriver.java:184) at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl$6.apply(CuratorFrameworkImpl.java:713) at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl$6.apply(CuratorFrameworkImpl.java:709) at org.apache.flink.shaded.curator4.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:100) at org.apache.flink.shaded.curator4.org.apache.curator.shaded.com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30) at org.apache.flink.shaded.curator4.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:92) at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.logError(CuratorFrameworkImpl.java:708) at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:874) at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:990) at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:943) at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:66) at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:346) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException.create(KeeperException.java:102) at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:862) ... 10 more {code} -- Thi
[jira] [Created] (FLINK-21307) Revisit activation model of FlinkSecurityManager
Robert Metzger created FLINK-21307: -- Summary: Revisit activation model of FlinkSecurityManager Key: FLINK-21307 URL: https://issues.apache.org/jira/browse/FLINK-21307 Project: Flink Issue Type: Improvement Components: Runtime / Task Affects Versions: 1.13.0 Reporter: Robert Metzger In FLINK-15156, we introduced a feature that allows users to log or completely disable calls to System.exit(). This feature is enabled for certain threads / code sections intended to execute user-code. The activation of the security manager (for monitoring user calls to System.exit() is currently not well-defined, and only implemented on a best-effort basis. This ticket is to revisit the activation. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-21306) FlinkSecurityManager might avoid fatal system exits
Robert Metzger created FLINK-21306: -- Summary: FlinkSecurityManager might avoid fatal system exits Key: FLINK-21306 URL: https://issues.apache.org/jira/browse/FLINK-21306 Project: Flink Issue Type: Bug Components: Runtime / Task Affects Versions: 1.13.0 Reporter: Robert Metzger In FLINK-15156, we introduced a feature that allows users to log or completely disable calls to System.exit(). This feature is enabled for certain threads / code sections intended to execute user-code. However, some user code calls might still lead to fatal errors, which we want to handle by killing the Flink process. It is likely that this new change (which is disabled by default) can lead to a situation where Flink should exit immediately, but it doesn't (thus leaving the system in an inconsistent state) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-21260) Add DeclarativeScheduler / Finished state
Robert Metzger created FLINK-21260: -- Summary: Add DeclarativeScheduler / Finished state Key: FLINK-21260 URL: https://issues.apache.org/jira/browse/FLINK-21260 Project: Flink Issue Type: Sub-task Components: Runtime / Coordination Reporter: Robert Metzger Fix For: 1.13.0 This subtask of adding the declarative scheduler is about adding the Finished state to Flink, including tests. Finished: The job execution has been completed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-21259) Add DeclarativeScheduler / Failing state
Robert Metzger created FLINK-21259: -- Summary: Add DeclarativeScheduler / Failing state Key: FLINK-21259 URL: https://issues.apache.org/jira/browse/FLINK-21259 Project: Flink Issue Type: Sub-task Components: Runtime / Coordination Reporter: Robert Metzger Fix For: 1.13.0 This subtask of adding the declarative scheduler is about adding the Failing state to Flink, including tests. Failing: An unrecoverable fault has occurred. The scheduler stops the ExecutionGraph by canceling it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-21258) Add DeclarativeScheduler / Canceling state
Robert Metzger created FLINK-21258: -- Summary: Add DeclarativeScheduler / Canceling state Key: FLINK-21258 URL: https://issues.apache.org/jira/browse/FLINK-21258 Project: Flink Issue Type: Sub-task Components: Runtime / Coordination Reporter: Robert Metzger Fix For: 1.13.0 This subtask of adding the declarative scheduler is about adding the Canceling state to Flink, including tests. Canceling: The job has been canceled by the user. The scheduler stops the ExecutionGraph by canceling it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-21257) Add DeclarativeScheduler / Restarting state
Robert Metzger created FLINK-21257: -- Summary: Add DeclarativeScheduler / Restarting state Key: FLINK-21257 URL: https://issues.apache.org/jira/browse/FLINK-21257 Project: Flink Issue Type: Sub-task Components: Runtime / Coordination Reporter: Robert Metzger Fix For: 1.13.0 This subtask of adding the declarative scheduler is about adding the Created state to Flink, including tests. Restarting: A recoverable fault has occurred. The scheduler stops the ExecutionGraph by canceling it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-21256) Add DeclarativeScheduler / Executing state
Robert Metzger created FLINK-21256: -- Summary: Add DeclarativeScheduler / Executing state Key: FLINK-21256 URL: https://issues.apache.org/jira/browse/FLINK-21256 Project: Flink Issue Type: Sub-task Components: Runtime / Coordination Reporter: Robert Metzger Assignee: Robert Metzger Fix For: 1.13.0 This subtask of adding the declarative scheduler is about adding the Created state to Flink, including tests. Executing: The set of resources is stable and the scheduler could decide on the parallelism with which to execute the job. The ExecutionGraph is created and the execution of the job has started. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-21255) Add DeclarativeScheduler / WaitingForResources state
Robert Metzger created FLINK-21255: -- Summary: Add DeclarativeScheduler / WaitingForResources state Key: FLINK-21255 URL: https://issues.apache.org/jira/browse/FLINK-21255 Project: Flink Issue Type: Sub-task Components: Runtime / Coordination Reporter: Robert Metzger Assignee: Robert Metzger Fix For: 1.13.0 This subtask of adding the declarative scheduler is about adding the Created state to Flink, including tests. Waiting for resources: The required resources are declared. The scheduler waits until either the requirements are fulfilled or the set of resources has stabilised. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-21254) Add DeclarativeScheduler / Created state
Robert Metzger created FLINK-21254: -- Summary: Add DeclarativeScheduler / Created state Key: FLINK-21254 URL: https://issues.apache.org/jira/browse/FLINK-21254 Project: Flink Issue Type: Sub-task Components: Runtime / Coordination Reporter: Robert Metzger Assignee: Robert Metzger Fix For: 1.13.0 This subtask of adding the declarative scheduler is about adding the Created state to Flink, including tests. Created: Initial state of the scheduler -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-21166) NullPointer in CheckpointMetricsBuilder surfacing in "Resuming Savepoint" e2e
Robert Metzger created FLINK-21166: -- Summary: NullPointer in CheckpointMetricsBuilder surfacing in "Resuming Savepoint" e2e Key: FLINK-21166 URL: https://issues.apache.org/jira/browse/FLINK-21166 Project: Flink Issue Type: Bug Components: Runtime / Checkpointing Affects Versions: 1.13.0 Reporter: Robert Metzger https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=12562&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=ff888d9b-cd34-53cc-d90f-3e446d355529 {code} Running 'Resuming Savepoint (rocks, scale down, rocks timers) end-to-end test' Found exception in log files; printing first 500 lines; see full logs for details: ... [FAIL] 'Resuming Savepoint (rocks, scale down, rocks timers) end-to-end test' failed after 0 minutes and 30 seconds! Test exited with exit code 0 but the logs contained errors, exceptions or non-empty .out files {code} One TaskManager log contains the following: {code} === Finished metrics report === 2021-01-27 15:22:49,635 WARN org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable [] - Could not properly clean up the async checkpoint runnable. java.lang.IllegalStateException: null at org.apache.flink.util.Preconditions.checkState(Preconditions.java:177) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.util.Preconditions.checkCompletedNormally(Preconditions.java:261) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.concurrent.FutureUtils.checkStateAndGet(FutureUtils.java:1176) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.checkpoint.CheckpointMetricsBuilder.build(CheckpointMetricsBuilder.java:133) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.reportAbortedSnapshotStats(AsyncCheckpointRunnable.java:219) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.close(AsyncCheckpointRunnable.java:292) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.util.IOUtils.closeQuietly(IOUtils.java:275) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.cancelAsyncCheckpointRunnable(SubtaskCheckpointCoordinatorImpl.java:451) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.notifyCheckpointAborted(SubtaskCheckpointCoordinatorImpl.java:340) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$notifyCheckpointAbortAsync$12(StreamTask.java:1069) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$notifyCheckpointOperation$13(StreamTask.java:1082) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50) [flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90) [flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMailsWhenDefaultActionUnavailable(MailboxProcessor.java:314) [flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:300) [flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:188) [flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:615) [flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:579) [flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:763) [flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apache.flink.runtime.taskmanager.Task.run(Task.java:565) [flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_275] 2021-01-27 15:22:49,637 WARN org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable [] - Could not properly clean up the async checkpoint runnable. java.lang.IllegalStateException: null at org.apache.flink.util.Preconditions.checkState(Preconditions.java:177) ~[flink-dist_2.11-1.13-SNAPSHOT.jar:1.13-SNAPSHOT] at org.apac
[jira] [Created] (FLINK-21136) Reactive Mode: Adjust timeout behavior in declarative scheduler
Robert Metzger created FLINK-21136: -- Summary: Reactive Mode: Adjust timeout behavior in declarative scheduler Key: FLINK-21136 URL: https://issues.apache.org/jira/browse/FLINK-21136 Project: Flink Issue Type: Sub-task Components: Runtime / Coordination Reporter: Robert Metzger Fix For: 1.13.0 The FLIP states the following timeout and resource registration behavior: On initial startup, the declarative scheduler will wait indefinitely for TaskManagers to show up. Once there are enough TaskManagers available to start the job, and the set of resources is stable (see FLIP-160 for a definition), the job will start running. Once the job has started running, and a TaskManager is lost, it will wait for 10 seconds for the TaskManager to re-appear. Otherwise, the job will be scheduled again with the available resources. If no TaskManagers are available anymore, the declarative scheduler will wait indefinitely again for new resources. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-21135) Reactive Mode: Change Declarative Scheduler to set infinite parallelism in JobGraph
Robert Metzger created FLINK-21135: -- Summary: Reactive Mode: Change Declarative Scheduler to set infinite parallelism in JobGraph Key: FLINK-21135 URL: https://issues.apache.org/jira/browse/FLINK-21135 Project: Flink Issue Type: Sub-task Components: Runtime / Coordination Reporter: Robert Metzger For Reactive Mode, the scheduler needs to change the parallelism and maxParalllelism of the submitted job graph to it's max value (2^15). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-21134) Introduce execution mode configuration key and check for supported ClusterEntrypoint type
Robert Metzger created FLINK-21134: -- Summary: Introduce execution mode configuration key and check for supported ClusterEntrypoint type Key: FLINK-21134 URL: https://issues.apache.org/jira/browse/FLINK-21134 Project: Flink Issue Type: Sub-task Components: Runtime / Coordination Reporter: Robert Metzger Fix For: 1.13.0 According to the FLIP, introduce a "execution-mode" configuration key, and check in the ClusterEntrypoint if the chosen entry point type is supported by the selected execution mode. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-21073) Mention that RocksDB ignores equals/hashCode because it works on binary data
Robert Metzger created FLINK-21073: -- Summary: Mention that RocksDB ignores equals/hashCode because it works on binary data Key: FLINK-21073 URL: https://issues.apache.org/jira/browse/FLINK-21073 Project: Flink Issue Type: Bug Components: Documentation, Runtime / State Backends Reporter: Robert Metzger Fix For: 1.13.0 See https://lists.apache.org/thread.html/ra43e2b5d388831290c293b9daf0eee0b0a5d9712543b62c83234a829%40%3Cuser.flink.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-21065) Passing configuration from TableEnvironmentImpl to MiniCluster is not supported
Robert Metzger created FLINK-21065: -- Summary: Passing configuration from TableEnvironmentImpl to MiniCluster is not supported Key: FLINK-21065 URL: https://issues.apache.org/jira/browse/FLINK-21065 Project: Flink Issue Type: Bug Components: Table SQL / Planner Affects Versions: 1.13.0 Reporter: Robert Metzger While trying to fix a bug in Flink's scheduler, I needed to pass a configuration parameter from the TableEnvironmentITCase (blink planner) to the TaskManager. Changing this from: {code} case "TableEnvironment" => tEnv = TableEnvironmentImpl.create(settings) {code} to {code} case "TableEnvironment" => tEnv = TableEnvironmentImpl.create(settings) val conf = new Configuration(); conf.setInteger("taskmanager.numberOfTaskSlots", 1) tEnv.getConfig.addConfiguration(conf) {code} Did not cause any effect on the launched TaskManager. It seems that configuration is not properly forwarded through all abstractions. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-21051) SQLClientSchemaRegistryITCase unstable with InternalServerErrorException: Status 500
Robert Metzger created FLINK-21051: -- Summary: SQLClientSchemaRegistryITCase unstable with InternalServerErrorException: Status 500 Key: FLINK-21051 URL: https://issues.apache.org/jira/browse/FLINK-21051 Project: Flink Issue Type: Bug Components: Table SQL / Ecosystem, Tests Affects Versions: 1.13.0 Reporter: Robert Metzger https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=12253&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=ff888d9b-cd34-53cc-d90f-3e446d355529 {code} 2021-01-20T00:10:21.3510385Z Jan 20 00:10:21 2021-01-20T00:10:21.3516246Z Jan 20 00:10:21 [ERROR] testWriting(org.apache.flink.tests.util.kafka.SQLClientSchemaRegistryITCase) Time elapsed: 0.001 s <<< ERROR! 2021-01-20T00:10:21.3517459Z Jan 20 00:10:21 java.lang.RuntimeException: Could not build the flink-dist image 2021-01-20T00:10:21.3518178Z Jan 20 00:10:21at org.apache.flink.tests.util.flink.FlinkContainer$FlinkContainerBuilder.build(FlinkContainer.java:281) 2021-01-20T00:10:21.3519176Z Jan 20 00:10:21at org.apache.flink.tests.util.kafka.SQLClientSchemaRegistryITCase.(SQLClientSchemaRegistryITCase.java:88) 2021-01-20T00:10:21.3519873Z Jan 20 00:10:21at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) 2021-01-20T00:10:21.3520537Z Jan 20 00:10:21at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) 2021-01-20T00:10:21.3521390Z Jan 20 00:10:21at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) 2021-01-20T00:10:21.3522080Z Jan 20 00:10:21at java.lang.reflect.Constructor.newInstance(Constructor.java:423) 2021-01-20T00:10:21.3522730Z Jan 20 00:10:21at org.junit.runners.BlockJUnit4ClassRunner.createTest(BlockJUnit4ClassRunner.java:217) 2021-01-20T00:10:21.3523452Z Jan 20 00:10:21at org.junit.runners.BlockJUnit4ClassRunner$1.runReflectiveCall(BlockJUnit4ClassRunner.java:266) 2021-01-20T00:10:21.3524237Z Jan 20 00:10:21at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) 2021-01-20T00:10:21.3524879Z Jan 20 00:10:21at org.junit.runners.BlockJUnit4ClassRunner.methodBlock(BlockJUnit4ClassRunner.java:263) 2021-01-20T00:10:21.3525527Z Jan 20 00:10:21at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) 2021-01-20T00:10:21.3526157Z Jan 20 00:10:21at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) 2021-01-20T00:10:21.3526754Z Jan 20 00:10:21at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) 2021-01-20T00:10:21.3527316Z Jan 20 00:10:21at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) 2021-01-20T00:10:21.3527884Z Jan 20 00:10:21at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) 2021-01-20T00:10:21.3528462Z Jan 20 00:10:21at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) 2021-01-20T00:10:21.3529491Z Jan 20 00:10:21at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) 2021-01-20T00:10:21.3530220Z Jan 20 00:10:21at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298) 2021-01-20T00:10:21.3530970Z Jan 20 00:10:21at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292) 2021-01-20T00:10:21.3531649Z Jan 20 00:10:21at java.util.concurrent.FutureTask.run(FutureTask.java:266) 2021-01-20T00:10:21.3532201Z Jan 20 00:10:21at java.lang.Thread.run(Thread.java:748) 2021-01-20T00:10:21.3533545Z Jan 20 00:10:21 Caused by: com.github.dockerjava.api.exception.InternalServerErrorException: Status 500: {"message":"Get https://registry-1.docker.io/v2/testcontainers/ryuk/manifests/0.3.0: received unexpected HTTP status: 502 Bad Gateway"} 2021-01-20T00:10:21.3534353Z Jan 20 00:10:21 2021-01-20T00:10:21.3534955Z Jan 20 00:10:21at org.testcontainers.shaded.com.github.dockerjava.core.DefaultInvocationBuilder.execute(DefaultInvocationBuilder.java:247) 2021-01-20T00:10:21.3536388Z Jan 20 00:10:21at org.testcontainers.shaded.com.github.dockerjava.core.DefaultInvocationBuilder.lambda$executeAndStream$1(DefaultInvocationBuilder.java:269) 2021-01-20T00:10:21.3537066Z Jan 20 00:10:21... 1 more 2021-01-20T00:10:21.3541323Z Jan 20 00:10:21 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-20929) Add automated deployment of nightly docker images
Robert Metzger created FLINK-20929: -- Summary: Add automated deployment of nightly docker images Key: FLINK-20929 URL: https://issues.apache.org/jira/browse/FLINK-20929 Project: Flink Issue Type: Task Components: Build System / Azure Pipelines Reporter: Robert Metzger There've been a few developers asking for nightly builds (think apache/flink:1.13-SNAPSHOT) of Flink. In INFRA-21276, Flink got access to the "apache/flink" DockerHub repository, which we could use for this purpose. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-20928) KafkaSourceReaderTest.testOffsetCommitOnCheckpointComplete:189->pollUntil:270 » Timeout
Robert Metzger created FLINK-20928: -- Summary: KafkaSourceReaderTest.testOffsetCommitOnCheckpointComplete:189->pollUntil:270 » Timeout Key: FLINK-20928 URL: https://issues.apache.org/jira/browse/FLINK-20928 Project: Flink Issue Type: Bug Components: Connectors / Kafka Affects Versions: 1.13.0 Reporter: Robert Metzger https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=11861&view=logs&j=c5f0071e-1851-543e-9a45-9ac140befc32&t=1fb1a56f-e8b5-5a82-00a0-a2db7757b4f5 {code} [ERROR] Tests run: 8, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 93.992 s <<< FAILURE! - in org.apache.flink.connector.kafka.source.reader.KafkaSourceReaderTest [ERROR] testOffsetCommitOnCheckpointComplete(org.apache.flink.connector.kafka.source.reader.KafkaSourceReaderTest) Time elapsed: 60.086 s <<< ERROR! java.util.concurrent.TimeoutException: The offset commit did not finish before timeout. at org.apache.flink.core.testutils.CommonTestUtils.waitUtil(CommonTestUtils.java:210) at org.apache.flink.connector.kafka.source.reader.KafkaSourceReaderTest.pollUntil(KafkaSourceReaderTest.java:270) at org.apache.flink.connector.kafka.source.reader.KafkaSourceReaderTest.testOffsetCommitOnCheckpointComplete(KafkaSourceReaderTest.java:189) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-20532) generate-stackbrew-library.sh in flink-docker doesn't properly prune the java11 tag
Robert Metzger created FLINK-20532: -- Summary: generate-stackbrew-library.sh in flink-docker doesn't properly prune the java11 tag Key: FLINK-20532 URL: https://issues.apache.org/jira/browse/FLINK-20532 Project: Flink Issue Type: Bug Components: flink-docker Affects Versions: 1.12.0 Reporter: Robert Metzger Assignee: Robert Metzger The output of {generate-stackbrew-library.sh} contains two {java11} tags. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-20531) Flink docs are not building anymore due to builder change
Robert Metzger created FLINK-20531: -- Summary: Flink docs are not building anymore due to builder change Key: FLINK-20531 URL: https://issues.apache.org/jira/browse/FLINK-20531 Project: Flink Issue Type: Bug Components: Documentation Reporter: Robert Metzger Assignee: Robert Metzger The Flink docs are not building anymore, due to {code} r1068824 | dfoulks | 2020-12-07 18:53:38 +0100 (Mon, 07 Dec 2020) | 1 line Moved bb-slave1 jobs to bb-slave7 and bb-slave2 jobs to bb-slave8 {code} bb-slave2 has "rvm" installed, "bb-slave8" doesn't: https://ci.apache.org/builders/flink-docs-release-1.11/builds/161 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-20455) Add check to LicenseChecker for top level /LICENSE files in shaded jars
Robert Metzger created FLINK-20455: -- Summary: Add check to LicenseChecker for top level /LICENSE files in shaded jars Key: FLINK-20455 URL: https://issues.apache.org/jira/browse/FLINK-20455 Project: Flink Issue Type: Task Components: Build System / CI Reporter: Robert Metzger Fix For: 1.13.0 During the release verification of the 1.12.0 release, we noticed several modules containing LICENSE files in the jar file, which are not Apache licenses. This could mislead users that the JARs are licensed not according to the ASL, but something else. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-20449) UnalignedCheckpointITCase times out
Robert Metzger created FLINK-20449: -- Summary: UnalignedCheckpointITCase times out Key: FLINK-20449 URL: https://issues.apache.org/jira/browse/FLINK-20449 Project: Flink Issue Type: Bug Components: Runtime / Checkpointing Affects Versions: 1.13.0 Reporter: Robert Metzger Fix For: 1.13.0 https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=10416&view=logs&j=39d5b1d5-3b41-54dc-6458-1e2ddd1cdcf3&t=a99e99c7-21cd-5a1f-7274-585e62b72f56 {code} 2020-12-02T01:24:33.7219846Z [ERROR] Tests run: 10, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 746.887 s <<< FAILURE! - in org.apache.flink.test.checkpointing.UnalignedCheckpointITCase 2020-12-02T01:24:33.7220860Z [ERROR] execute[Parallel cogroup, p = 10](org.apache.flink.test.checkpointing.UnalignedCheckpointITCase) Time elapsed: 300.24 s <<< ERROR! 2020-12-02T01:24:33.7221663Z org.junit.runners.model.TestTimedOutException: test timed out after 300 seconds 2020-12-02T01:24:33.7222017Zat sun.misc.Unsafe.park(Native Method) 2020-12-02T01:24:33.7222390Zat java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) 2020-12-02T01:24:33.7222882Zat java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707) 2020-12-02T01:24:33.7223356Zat java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) 2020-12-02T01:24:33.7223840Zat java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742) 2020-12-02T01:24:33.7224320Zat java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908) 2020-12-02T01:24:33.7224864Zat org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1842) 2020-12-02T01:24:33.7225500Zat org.apache.flink.streaming.api.environment.LocalStreamEnvironment.execute(LocalStreamEnvironment.java:70) 2020-12-02T01:24:33.7226297Zat org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1822) 2020-12-02T01:24:33.7226929Zat org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1804) 2020-12-02T01:24:33.7227572Zat org.apache.flink.test.checkpointing.UnalignedCheckpointTestBase.execute(UnalignedCheckpointTestBase.java:122) 2020-12-02T01:24:33.7228187Zat org.apache.flink.test.checkpointing.UnalignedCheckpointITCase.execute(UnalignedCheckpointITCase.java:159) 2020-12-02T01:24:33.7228680Zat sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 2020-12-02T01:24:33.7229099Zat sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 2020-12-02T01:24:33.7229617Zat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 2020-12-02T01:24:33.7230068Zat java.lang.reflect.Method.invoke(Method.java:498) 2020-12-02T01:24:33.7230733Zat org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) 2020-12-02T01:24:33.7231262Zat org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) 2020-12-02T01:24:33.7231775Zat org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) 2020-12-02T01:24:33.7232276Zat org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) 2020-12-02T01:24:33.7232732Zat org.junit.rules.Verifier$1.evaluate(Verifier.java:35) 2020-12-02T01:24:33.7233144Zat org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48) 2020-12-02T01:24:33.7233663Zat org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298) 2020-12-02T01:24:33.7234239Zat org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292) 2020-12-02T01:24:33.7234735Zat java.util.concurrent.FutureTask.run(FutureTask.java:266) 2020-12-02T01:24:33.7235093Zat java.lang.Thread.run(Thread.java:748) 2020-12-02T01:24:33.7235305Z 2020-12-02T01:24:33.7235728Z [ERROR] execute[Parallel union, p = 10](org.apache.flink.test.checkpointing.UnalignedCheckpointITCase) Time elapsed: 300.539 s <<< ERROR! 2020-12-02T01:24:33.7236436Z org.junit.runners.model.TestTimedOutException: test timed out after 300 seconds 2020-12-02T01:24:33.7236790Zat sun.misc.Unsafe.park(Native Method) 2020-12-02T01:24:33.7237158Zat java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) 2020-12-02T01:24:33.7237641Zat java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707) 2020-12-02T01:24:33.7238118Zat java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) 2020-12-02T01:24:33.7238599Zat java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742) 2020-12-02T01:24:33.7239885Zat java.util.concu
[jira] [Created] (FLINK-20442) Fix license documentation mistakes in flink-python.jar
Robert Metzger created FLINK-20442: -- Summary: Fix license documentation mistakes in flink-python.jar Key: FLINK-20442 URL: https://issues.apache.org/jira/browse/FLINK-20442 Project: Flink Issue Type: Bug Components: API / Python Affects Versions: 1.12.0 Reporter: Robert Metzger Fix For: 1.12.0 Issues reported by Chesnay: -The flink-python jar contains 2 license files in the root directory and another 2 in the META-INF directory. This should be reduced down to 1 under META-INF. I'm inclined to block the release on this because the root license is BSD. - The flink-python jar appears to bundle lz4 (native libraries under win32/, linux/ and darwin/), but this is neither listed in the NOTICE nor do we have an explicit license file for it. Other minor things that we should address in the future: - opt/python contains some LICENSE files that should instead be placed under licenses/ - licenses/ contains a stray "ASM" file containing the ASM license. It's not a problem (because it is identical with our intended copy), but it indicates that something is amiss. This seems to originate from the flink-python jar, which bundles some beam stuff, which bundles bytebuddy, which bundles this license file. From what I can tell bytebuddy is not actually bundling ASM though; they just bundle the license for whatever reason. It is not listed as bundled in the flink-python NOTICE though, so I wouldn't block the release on it. -- This message was sent by Atlassian Jira (v8.3.4#803005)