[jira] [Created] (FLINK-13616) Update BucketingSinkMigrationTest to restore from 1.9 savepoint
Till Rohrmann created FLINK-13616: - Summary: Update BucketingSinkMigrationTest to restore from 1.9 savepoint Key: FLINK-13616 URL: https://issues.apache.org/jira/browse/FLINK-13616 Project: Flink Issue Type: Sub-task Components: Tests Affects Versions: 1.10.0 Reporter: Till Rohrmann Fix For: 1.10.0 Update {{BucketingSinkMigrationTest}} to restore from 1.9 savepoint. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (FLINK-13614) Add MigrationVersion.v1_9
Till Rohrmann created FLINK-13614: - Summary: Add MigrationVersion.v1_9 Key: FLINK-13614 URL: https://issues.apache.org/jira/browse/FLINK-13614 Project: Flink Issue Type: Sub-task Components: Tests Affects Versions: 1.10.0 Reporter: Till Rohrmann Fix For: 1.10.0 Add {{MigrationVersion.v1_9}} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (FLINK-13615) Update CEPMigrationTest to restore from 1.9 savepoint
Till Rohrmann created FLINK-13615: - Summary: Update CEPMigrationTest to restore from 1.9 savepoint Key: FLINK-13615 URL: https://issues.apache.org/jira/browse/FLINK-13615 Project: Flink Issue Type: Sub-task Components: Tests Affects Versions: 1.10.0 Reporter: Till Rohrmann Fix For: 1.10.0 Update {{CEPMigrationTest}} to restore from 1.9 savepoint -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (FLINK-13613) Update migration tests for Flink 1.9
Till Rohrmann created FLINK-13613: - Summary: Update migration tests for Flink 1.9 Key: FLINK-13613 URL: https://issues.apache.org/jira/browse/FLINK-13613 Project: Flink Issue Type: Task Components: Tests Affects Versions: 1.10.0 Reporter: Till Rohrmann Fix For: 1.10.0 Once the Flink {{1.9.0}} release is out, we should update existing migration tests to cover restoring from {{1.9.0}} savepoints. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (FLINK-13608) Update upgrade compatibility table (docs/ops/upgrading.md) for 1.9.0
Till Rohrmann created FLINK-13608: - Summary: Update upgrade compatibility table (docs/ops/upgrading.md) for 1.9.0 Key: FLINK-13608 URL: https://issues.apache.org/jira/browse/FLINK-13608 Project: Flink Issue Type: Task Components: Documentation Affects Versions: 1.9.0 Reporter: Till Rohrmann Fix For: 1.9.0 Update upgrade compatibility table (docs/ops/upgrading.md) for 1.9.0 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (FLINK-13607) TPC-H end-to-end test (Blink planner) failed on Travis
Till Rohrmann created FLINK-13607: - Summary: TPC-H end-to-end test (Blink planner) failed on Travis Key: FLINK-13607 URL: https://issues.apache.org/jira/browse/FLINK-13607 Project: Flink Issue Type: Bug Components: Table SQL / API, Tests Affects Versions: 1.9.0 Reporter: Till Rohrmann Fix For: 1.9.0 The {{TPC-H end-to-end test (Blink planner)}} fail consistently on Travis with {code} Generating test data... Error: Could not find or load main class org.apache.flink.table.tpch.TpchDataGenerator {code} https://api.travis-ci.org/v3/job/568280203/log.txt https://api.travis-ci.org/v3/job/568280209/log.txt https://api.travis-ci.org/v3/job/568280215/log.txt -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (FLINK-13606) PrometheusReporterEndToEndITCase.testReporter unstable on Travis
Till Rohrmann created FLINK-13606: - Summary: PrometheusReporterEndToEndITCase.testReporter unstable on Travis Key: FLINK-13606 URL: https://issues.apache.org/jira/browse/FLINK-13606 Project: Flink Issue Type: Bug Components: Runtime / Metrics Affects Versions: 1.9.0 Reporter: Till Rohrmann Fix For: 1.9.0 The {{PrometheusReporterEndToEndITCase.testReporter}} is unstable on Travis. It fails with {{java.io.IOException: Process failed due to timeout.}} https://api.travis-ci.org/v3/job/568280216/log.txt -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (FLINK-13599) Kinesis end-to-end test failed on Travis
Till Rohrmann created FLINK-13599: - Summary: Kinesis end-to-end test failed on Travis Key: FLINK-13599 URL: https://issues.apache.org/jira/browse/FLINK-13599 Project: Flink Issue Type: Test Components: Connectors / Kinesis, Tests Affects Versions: 1.9.0, 1.10.0 Reporter: Till Rohrmann Fix For: 1.9.0 The {{Kinesis end-to-end test}} failed on Travis with {code} 2019-08-06 08:48:20,177 ERROR org.apache.flink.client.cli.CliFrontend - Error while running the command. org.apache.flink.client.program.ProgramInvocationException: The main method caused an error: Unable to execute HTTP request: Connect to localhost:4567 [localhost/127.0.0.1] failed: Connection refused (Connection refused) at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:593) at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:438) at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:274) at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:746) at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:273) at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:205) at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1010) at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1083) at org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30) at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1083) Caused by: org.apache.flink.kinesis.shaded.com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to localhost:4567 [localhost/127.0.0.1] failed: Connection refused (Connection refused) at org.apache.flink.kinesis.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1116) at org.apache.flink.kinesis.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1066) at org.apache.flink.kinesis.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:743) at org.apache.flink.kinesis.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:717) at org.apache.flink.kinesis.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:699) at org.apache.flink.kinesis.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:667) at org.apache.flink.kinesis.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:649) at org.apache.flink.kinesis.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:513) at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.doInvoke(AmazonKinesisClient.java:2388) at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.invoke(AmazonKinesisClient.java:2364) at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.executeDescribeStream(AmazonKinesisClient.java:754) at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.describeStream(AmazonKinesisClient.java:729) at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.describeStream(AmazonKinesisClient.java:766) at org.apache.flink.streaming.kinesis.test.KinesisPubsubClient.createTopic(KinesisPubsubClient.java:63) at org.apache.flink.streaming.kinesis.test.KinesisExampleTest.main(KinesisExampleTest.java:57) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:576) ... 9 more Caused by: org.apache.flink.kinesis.shaded.org.apache.http.conn.HttpHostConnectException: Connect to localhost:4567 [localhost/127.0.0.1] failed: Connection refused (Connection refused) at org.apache.flink.kinesis.shaded.org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:159) at org.apache.flink.kinesis.shaded.org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:359) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
[jira] [Created] (FLINK-13597) Test Flink on cluster
Till Rohrmann created FLINK-13597: - Summary: Test Flink on cluster Key: FLINK-13597 URL: https://issues.apache.org/jira/browse/FLINK-13597 Project: Flink Issue Type: Test Components: Tests Affects Versions: 1.9.0 Reporter: Till Rohrmann Fix For: 1.9.0 In order to verify the Flink {{1.9.0}} release, we should try it out on one of the available cloud providers (AWS, GCP). I would suggest to run a non-trivial workload on the cluster with using some of the new features: - SQL job - Filesystem being loaded as a plugin - Enable fine-grained recovery for batch job - Java 9 - Checkpoints to GFS -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (FLINK-13595) KafkaITCase.testBigRecordJob fails on Travis
Till Rohrmann created FLINK-13595: - Summary: KafkaITCase.testBigRecordJob fails on Travis Key: FLINK-13595 URL: https://issues.apache.org/jira/browse/FLINK-13595 Project: Flink Issue Type: Bug Components: Connectors / Kafka, Tests Affects Versions: 1.9.0 Reporter: Till Rohrmann Fix For: 1.9.0 The {{KafkaITCase.testBigRecordJob}} failed with a {{TestTimedOutException}} {code} Test testBigRecordJob(org.apache.flink.streaming.connectors.kafka.KafkaITCase) failed with: org.junit.runners.model.TestTimedOutException: test timed out after 6 milliseconds at java.lang.Object.wait(Native Method) at java.lang.Thread.join(Thread.java:1252) at java.lang.Thread.join(Thread.java:1326) at org.apache.kafka.clients.admin.KafkaAdminClient.close(KafkaAdminClient.java:476) at org.apache.kafka.clients.admin.AdminClient.close(AdminClient.java:92) at org.apache.kafka.clients.admin.AdminClient.close(AdminClient.java:75) at org.apache.flink.streaming.connectors.kafka.KafkaTestEnvironmentImpl.deleteTestTopic(KafkaTestEnvironmentImpl.java:153) at org.apache.flink.streaming.connectors.kafka.KafkaTestBase.deleteTestTopic(KafkaTestBase.java:204) at org.apache.flink.streaming.connectors.kafka.KafkaConsumerTestBase.runBigRecordTestTopology(KafkaConsumerTestBase.java:1336) at org.apache.flink.streaming.connectors.kafka.KafkaITCase.testBigRecordJob(KafkaITCase.java:121) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.lang.Thread.run(Thread.java:748) {code} https://api.travis-ci.org/v3/job/568176170/log.txt -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (FLINK-13567) Avro Confluent Schema Registry nightly end-to-end test failed on Travis
Till Rohrmann created FLINK-13567: - Summary: Avro Confluent Schema Registry nightly end-to-end test failed on Travis Key: FLINK-13567 URL: https://issues.apache.org/jira/browse/FLINK-13567 Project: Flink Issue Type: Bug Components: Tests Affects Versions: 1.9.0, 1.10.0 Reporter: Till Rohrmann The {{Avro Confluent Schema Registry nightly end-to-end test}} failed on Travis with {code} [FAIL] 'Avro Confluent Schema Registry nightly end-to-end test' failed after 2 minutes and 11 seconds! Test exited with exit code 1 No taskexecutor daemon (pid: 29044) is running anymore on travis-job-b0823aec-c4ec-4d4b-8b59-e9f968de9501. No standalonesession daemon to stop on host travis-job-b0823aec-c4ec-4d4b-8b59-e9f968de9501. rm: cannot remove '/home/travis/build/apache/flink/flink-dist/target/flink-1.10-SNAPSHOT-bin/flink-1.10-SNAPSHOT/plugins': No such file or directory {code} https://api.travis-ci.org/v3/job/567273939/log.txt -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (FLINK-13557) Improve failover strategy documentation with explanation of concepts
Till Rohrmann created FLINK-13557: - Summary: Improve failover strategy documentation with explanation of concepts Key: FLINK-13557 URL: https://issues.apache.org/jira/browse/FLINK-13557 Project: Flink Issue Type: Improvement Components: Documentation Affects Versions: 1.9.0, 1.10.0 Reporter: Till Rohrmann The current failover strategy configuration documentation (https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#failover-strategies) could be improved by explaining the concept of pipelined regions in more detail. It could benefit from some figures visualizing the concept. We should even consider adding the explanation of pipelined regions to Flink's concept page because it is quite central to how Flink (will) work. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (FLINK-13556) Python profile failed on Travis with setup problem
Till Rohrmann created FLINK-13556: - Summary: Python profile failed on Travis with setup problem Key: FLINK-13556 URL: https://issues.apache.org/jira/browse/FLINK-13556 Project: Flink Issue Type: Bug Components: API / Python Affects Versions: 1.9.0, 1.10.0 Reporter: Till Rohrmann Fix For: 1.9.0 The Python profile failed on Travis because conda install 3.5 failed with the following error {{Error([('SSL routines', 'ssl3_get_record', 'decryption failed or bad record mac')])}}. https://api.travis-ci.org/v3/job/566835602/log.txt -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (FLINK-13553) KvStateServerHandlerTest.readInboundBlocking unstable on Travis
Till Rohrmann created FLINK-13553: - Summary: KvStateServerHandlerTest.readInboundBlocking unstable on Travis Key: FLINK-13553 URL: https://issues.apache.org/jira/browse/FLINK-13553 Project: Flink Issue Type: Bug Components: Runtime / Queryable State Affects Versions: 1.10.0 Reporter: Till Rohrmann Fix For: 1.10.0 The {{KvStateServerHandlerTest.readInboundBlocking}} and {{KvStateServerHandlerTest.testQueryExecutorShutDown}} fail on Travis with a {{TimeoutException}}. https://api.travis-ci.org/v3/job/566420641/log.txt -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (FLINK-13497) Checkpoints can complete after CheckpointFailureManager fails job
Till Rohrmann created FLINK-13497: - Summary: Checkpoints can complete after CheckpointFailureManager fails job Key: FLINK-13497 URL: https://issues.apache.org/jira/browse/FLINK-13497 Project: Flink Issue Type: Bug Components: Runtime / Checkpointing Affects Versions: 1.9.0, 1.10.0 Reporter: Till Rohrmann I think that we introduced with FLINK-12364 an inconsistency wrt to job termination a checkpointing. In FLINK-9900 it was discovered that checkpoints can complete even after the {{CheckpointFailureManager}} decided to fail a job. I think the expected behaviour should be that we fail all pending checkpoints once the {{CheckpointFailureManager}} decides to fail the job. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (FLINK-13476) Release partitions for FINISHED or FAILED tasks if they are cancelled
Till Rohrmann created FLINK-13476: - Summary: Release partitions for FINISHED or FAILED tasks if they are cancelled Key: FLINK-13476 URL: https://issues.apache.org/jira/browse/FLINK-13476 Project: Flink Issue Type: Bug Components: Runtime / Coordination Reporter: Till Rohrmann Fix For: 1.9.0 With FLINK-12615 we removed that partitions are being explicitly released from the JM if an {{Execution}} which is in state {{FINISHED}} or {{FAILED}} is being cancelled. In order to not have resource leak when using pipelined result partitions whose consumers fail before start consuming, we should re-introduce the deleted else branch (removed via 408f6b67aefaccfc76708b2d4772eb7f0a8fd984). Once we properly wait that a {{Task}} does not finish until its produced results have been either persisted or sent to a consumer, then we should be able to remove this branch again. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (FLINK-13475) Reduce dependency on third-party maven repositories
Till Rohrmann created FLINK-13475: - Summary: Reduce dependency on third-party maven repositories Key: FLINK-13475 URL: https://issues.apache.org/jira/browse/FLINK-13475 Project: Flink Issue Type: Improvement Components: Connectors / Hive Affects Versions: 1.9.0 Reporter: Till Rohrmann A user reported that Flink's Hive connectors requires third-party maven repositories which are not everywhere accessible in order to build. Concretely, the hive connector requires access to Conjars for org.pentaho:pentaho-aggdesigner-algorithm and javax.jms:jms:jar:1.1. It would be great to reduce the dependency on third-party maven repositories if possible. For future reference, other projects faced similar problems: CALCITE-605, CALCITE-1474 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (FLINK-13441) Add batch E2E test which runs with fewer slots than parallelism
Till Rohrmann created FLINK-13441: - Summary: Add batch E2E test which runs with fewer slots than parallelism Key: FLINK-13441 URL: https://issues.apache.org/jira/browse/FLINK-13441 Project: Flink Issue Type: Test Components: API / DataSet, Tests Reporter: Till Rohrmann Fix For: 1.9.0 We should adapt the existing batch E2E test to use the newly introduced {{ScheduleMode#LAZY_FROM_SOURCES_WITH_BATCH_SLOT_REQUEST}} and verify that the job runs on a cluster with fewer slots than the job's parallelism. In order to make this work, we need to set the shuffles to be blocking via {{ExecutionMode#BATCH}}. As a batch job we should use the {{DataSetAllroundTestProgram}}. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (FLINK-13439) Run existing SQL/Table API E2E tests with blink runner
Till Rohrmann created FLINK-13439: - Summary: Run existing SQL/Table API E2E tests with blink runner Key: FLINK-13439 URL: https://issues.apache.org/jira/browse/FLINK-13439 Project: Flink Issue Type: Test Components: Table SQL / API, Tests Affects Versions: 1.9.0 Reporter: Till Rohrmann Fix For: 1.9.0 We should run all existing SQL/Table API E2E tests with the blink runner. As part of FLINK-13273 the {{test_sql_client.sh}} test will already be ported to run with blink. Additionally we also need to enable the {{test_streaming_sql.sh}} test to run with the blink runner. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (FLINK-13437) Add Hive SQL E2E test
Till Rohrmann created FLINK-13437: - Summary: Add Hive SQL E2E test Key: FLINK-13437 URL: https://issues.apache.org/jira/browse/FLINK-13437 Project: Flink Issue Type: Test Components: Connectors / Hive, Tests Affects Versions: 1.9.0 Reporter: Till Rohrmann Fix For: 1.9.0 We should add an E2E test for the Hive integration: List all tables and read some metadata, read from an existing table, register a new table in Hive, use a registered function, write to an existing table, write to a new table. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (FLINK-13436) Add TPC-H queries as E2E tests
Till Rohrmann created FLINK-13436: - Summary: Add TPC-H queries as E2E tests Key: FLINK-13436 URL: https://issues.apache.org/jira/browse/FLINK-13436 Project: Flink Issue Type: Test Components: Table SQL / Planner, Tests Affects Versions: 1.9.0 Reporter: Till Rohrmann Fix For: 1.9.0 We should add the TPC-H queries as E2E tests in order to verify the blink planner. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (FLINK-13434) Add E2E test for stop-with-savepoint
Till Rohrmann created FLINK-13434: - Summary: Add E2E test for stop-with-savepoint Key: FLINK-13434 URL: https://issues.apache.org/jira/browse/FLINK-13434 Project: Flink Issue Type: Test Components: Tests Affects Versions: 1.9.0 Reporter: Till Rohrmann Fix For: 1.9.0 We should modify the existing cancel-with-savepoint E2E test to use the new stop-with-savepoint command. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (FLINK-13249) Distributed Jepsen test fails with blocked TaskExecutor
Till Rohrmann created FLINK-13249: - Summary: Distributed Jepsen test fails with blocked TaskExecutor Key: FLINK-13249 URL: https://issues.apache.org/jira/browse/FLINK-13249 Project: Flink Issue Type: Bug Components: Runtime / Coordination Affects Versions: 1.9.0 Reporter: Till Rohrmann Fix For: 1.9.0 The distributed Jepsen test which kills {{JobMasters}} started to fail recently. From a first glance, it looks as if the {{TaskExecutor's}} main thread is blocked by some operation. Further investigation is required. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (FLINK-13228) HadoopRecoverableWriterTest.testCommitAfterNormalClose fails on Travis
Till Rohrmann created FLINK-13228: - Summary: HadoopRecoverableWriterTest.testCommitAfterNormalClose fails on Travis Key: FLINK-13228 URL: https://issues.apache.org/jira/browse/FLINK-13228 Project: Flink Issue Type: Bug Components: FileSystems Affects Versions: 1.9.0 Reporter: Till Rohrmann {{HadoopRecoverableWriterTest.testCommitAfterNormalClose}} failed on Travis with {code} HadoopRecoverableWriterTest.testCommitAfterNormalClose » IO The stream is closed {code} https://api.travis-ci.org/v3/job/557293706/log.txt -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (FLINK-13226) KafkaProducerExactlyOnceITCase.testMultipleSinkOperators fails on Travis
Till Rohrmann created FLINK-13226: - Summary: KafkaProducerExactlyOnceITCase.testMultipleSinkOperators fails on Travis Key: FLINK-13226 URL: https://issues.apache.org/jira/browse/FLINK-13226 Project: Flink Issue Type: Bug Components: Connectors / Kafka Affects Versions: 1.9.0 Reporter: Till Rohrmann The {{KafkaProducerExactlyOnceITCase.testMultipleSinkOperators}} fails on Travis with not producing output for 300 s. https://api.travis-ci.org/v3/job/557290235/log.txt -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (FLINK-13216) AggregateITCase.testNestedGroupByAgg fails on Travis
Till Rohrmann created FLINK-13216: - Summary: AggregateITCase.testNestedGroupByAgg fails on Travis Key: FLINK-13216 URL: https://issues.apache.org/jira/browse/FLINK-13216 Project: Flink Issue Type: Bug Components: Table SQL / Planner Affects Versions: 1.9.0 Reporter: Till Rohrmann Fix For: 1.9.0 The {{AggregateITCase.testNestedGroupByAgg}} fails on Travis with {code} AggregateITCase.testNestedGroupByAgg:472 expected: but was: {code} https://api.travis-ci.org/v3/job/557214216/log.txt -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (FLINK-13187) Expose batch slot request via new scheduling mode
Till Rohrmann created FLINK-13187: - Summary: Expose batch slot request via new scheduling mode Key: FLINK-13187 URL: https://issues.apache.org/jira/browse/FLINK-13187 Project: Flink Issue Type: Sub-task Components: Runtime / Coordination Affects Versions: 1.9.0 Reporter: Till Rohrmann Assignee: Till Rohrmann Fix For: 1.9.0 In order to leverage the newly introduced batch slot request feature introduced with FLINK-13166, we should introduce a new {{ScheduleMode#LAZY_FROM_SOURCES_WITH_BATCH_SLOT_REQUESTS}}. Users who enable this, must make sure that the generated job does not contain any pipelined shuffles because this can lead to resource deadlocks (not timing out requests with too few slots to make a pipelined region run). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-13166) Support batch slot requests
Till Rohrmann created FLINK-13166: - Summary: Support batch slot requests Key: FLINK-13166 URL: https://issues.apache.org/jira/browse/FLINK-13166 Project: Flink Issue Type: Sub-task Components: Runtime / Coordination Affects Versions: 1.9.0 Reporter: Till Rohrmann Fix For: 1.9.0 In order to support the execution of batch jobs with fewer slots than requested we need to introduce a special slot request notion which does not register an eager timeout. Moreover, this slot request should not react to failure signals from the {{ResourceManager}} and only time out if there is not available or allocated slot which can fulfill the requested {{ResourceProfile}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-13165) Complete requested slots in request order
Till Rohrmann created FLINK-13165: - Summary: Complete requested slots in request order Key: FLINK-13165 URL: https://issues.apache.org/jira/browse/FLINK-13165 Project: Flink Issue Type: Sub-task Components: Runtime / Coordination Affects Versions: 1.9.0 Reporter: Till Rohrmann Fix For: 1.9.0 When executing batch jobs with fewer slots than requested we should make sure that the slot requests are being completed in the order in which they were enqueued into the {{SlotPool}}. Otherwise we might risk that a consumer task gets deployed before a producer causing a resource deadlock situation. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-13156) Elasticsearch (v1.7.1) sink end-to-end test failed on Travis
Till Rohrmann created FLINK-13156: - Summary: Elasticsearch (v1.7.1) sink end-to-end test failed on Travis Key: FLINK-13156 URL: https://issues.apache.org/jira/browse/FLINK-13156 Project: Flink Issue Type: Bug Components: Connectors / ElasticSearch Affects Versions: 1.9.0 Reporter: Till Rohrmann Fix For: 1.9.0 The {{Elasticsearch (v1.7.1) sink end-to-end test}} failed on Travis because the test waited for Elasticsearch records indefinitely. https://api.travis-ci.org/v3/job/554991871/log.txt -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-13155) SQL Client end-to-end test fails on Travis
Till Rohrmann created FLINK-13155: - Summary: SQL Client end-to-end test fails on Travis Key: FLINK-13155 URL: https://issues.apache.org/jira/browse/FLINK-13155 Project: Flink Issue Type: Bug Components: Table SQL / Client Affects Versions: 1.9.0 Reporter: Till Rohrmann Fix For: 1.9.0 The {{SQL Client end-to-end test}} which executes {{test-scripts/test_sql_client.sh}} fails on Travis with non-empty out files. https://api.travis-ci.org/v3/job/554991859/log.txt -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-13154) Broken links in documentation
Till Rohrmann created FLINK-13154: - Summary: Broken links in documentation Key: FLINK-13154 URL: https://issues.apache.org/jira/browse/FLINK-13154 Project: Flink Issue Type: Bug Components: Documentation Affects Versions: 1.9.0 Reporter: Till Rohrmann Fix For: 1.9.0 Flink's broken link verification failed on Travis as it has discovered some broken links. We need to fix this for the upcoming release. {code} [2019-07-07 10:54:21] ERROR `/tutorials/datastream_api.html' not found. [2019-07-07 10:54:21] ERROR `/tutorials/local_setup.html' not found. [2019-07-07 10:54:21] ERROR `/tutorials/flink_on_windows.html' not found. [2019-07-07 10:54:21] ERROR `/examples' not found. [2019-07-07 10:54:23] ERROR `/zh/dev/connectors/pubsub.html' not found. [2019-07-07 10:54:24] ERROR `/ops/filesystems.html' not found. [2019-07-07 10:54:24] ERROR `/dev/table/catalog.html' not found. [2019-07-07 10:54:28] ERROR `/zh/tutorials/datastream_api.html' not found. [2019-07-07 10:54:28] ERROR `/zh/tutorials/local_setup.html' not found. http://localhost:4000/tutorials/datastream_api.html: Remote file does not exist -- broken link!!! http://localhost:4000/tutorials/local_setup.html: Remote file does not exist -- broken link!!! -- http://localhost:4000/tutorials/flink_on_windows.html: Remote file does not exist -- broken link!!! -- http://localhost:4000/examples: Remote file does not exist -- broken link!!! -- http://localhost:4000/zh/dev/connectors/pubsub.html: Remote file does not exist -- broken link!!! -- http://localhost:4000/ops/filesystems.html: Remote file does not exist -- broken link!!! -- http://localhost:4000/dev/table/catalog.html: Remote file does not exist -- broken link!!! -- http://localhost:4000/zh/tutorials/datastream_api.html: Remote file does not exist -- broken link!!! http://localhost:4000/zh/tutorials/local_setup.html: Remote file does not exist -- broken link!!! {code} https://api.travis-ci.org/v3/job/554991858/log.txt -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-13153) SplitAggregateITCase.testMinMaxWithRetraction failed on Travis
Till Rohrmann created FLINK-13153: - Summary: SplitAggregateITCase.testMinMaxWithRetraction failed on Travis Key: FLINK-13153 URL: https://issues.apache.org/jira/browse/FLINK-13153 Project: Flink Issue Type: Bug Components: Table SQL / Planner Affects Versions: 1.9.0 Reporter: Till Rohrmann Fix For: 1.9.0 {{SplitAggregateITCase.testMinMaxWithRetraction}} failed on Travis with {code} Failures: 10:50:43.355 [ERROR] SplitAggregateITCase.testMinMaxWithRetraction:195 expected: but was: {code} https://api.travis-ci.org/v3/job/554991853/log.txt -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-13140) PushFilterIntoTableSourceScanRuleTest.testWithUd fails on Travis
Till Rohrmann created FLINK-13140: - Summary: PushFilterIntoTableSourceScanRuleTest.testWithUd fails on Travis Key: FLINK-13140 URL: https://issues.apache.org/jira/browse/FLINK-13140 Project: Flink Issue Type: Bug Components: Table SQL / Planner Affects Versions: 1.9.0 Reporter: Till Rohrmann Fix For: 1.9.0 {{PushFilterIntoTableSourceScanRuleTest.testWithUd}} fails on Travis with {code} 06:22:11.290 [ERROR] Failures: 06:22:11.290 [ERROR] PushFilterIntoTableSourceScanRuleTest.testWithUdf:93 planAfter expected:<...ssions$utils$Func1$$[a39386268ffec8461452460bcbe089ad]($2), 32)]) +- Lo...> but was:<...ssions$utils$Func1$$[8a89e0d5a022a06a00c7734a25295ff4]($2), 32)]) +- Lo...> {code} https://api.travis-ci.org/v3/job/555252046/log.txt -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-13139) Various Hive tests fail on Travis
Till Rohrmann created FLINK-13139: - Summary: Various Hive tests fail on Travis Key: FLINK-13139 URL: https://issues.apache.org/jira/browse/FLINK-13139 Project: Flink Issue Type: Bug Components: Connectors / Hive Affects Versions: 1.9.0 Reporter: Till Rohrmann Fix For: 1.9.0 Various Hive related tests fail on Travis: {code} 06:06:49.654 [ERROR] Errors: 06:06:49.654 [ERROR] HiveInputFormatTest.createCatalog:66 » Catalog Failed to create Hive Metastore... 06:06:49.654 [ERROR] HiveTableFactoryTest.init:55 » Catalog Failed to create Hive Metastore client 06:06:49.654 [ERROR] HiveTableOutputFormatTest.createCatalog:72 » Catalog Failed to create Hive Met... 06:06:49.654 [ERROR] HiveTableSinkTest.createCatalog:72 » Catalog Failed to create Hive Metastore c... 06:06:49.654 [ERROR] HiveTableSourceTest.createCatalog:67 » Catalog Failed to create Hive Metastore... 06:06:49.654 [ERROR] HiveCatalogGenericMetadataTest.init:49 » Catalog Failed to create Hive Metasto... 06:06:49.654 [ERROR] HiveCatalogHiveMetadataTest.init:55 » Catalog Failed to create Hive Metastore ... 06:06:49.654 [ERROR] HiveGenericUDFTest.testCeil:193->init:387 » ExceptionInInitializer 06:06:49.654 [ERROR] HiveGenericUDFTest.testDecode:160 » NoClassDefFound Could not initialize class... 06:06:49.654 [ERROR] HiveSimpleUDFTest.testUDFArray_singleArray:202->init:237 » NoClassDefFound Cou... 06:06:49.654 [ERROR] HiveSimpleUDFTest.testUDFBin:60->init:237 » NoClassDefFound Could not initiali... 06:06:49.654 [ERROR] HiveSimpleUDFTest.testUDFConv:67->init:237 » NoClassDefFound Could not initial... 06:06:49.654 [ERROR] HiveSimpleUDFTest.testUDFJson:85->init:237 » NoClassDefFound Could not initial... 06:06:49.654 [ERROR] HiveSimpleUDFTest.testUDFMinute:126->init:237 » ExceptionInInitializer 06:06:49.654 [ERROR] HiveSimpleUDFTest.testUDFRand:51->init:237 » NoClassDefFound Could not initial... 06:06:49.654 [ERROR] HiveSimpleUDFTest.testUDFRegExpExtract:153->init:237 » NoClassDefFound Could n... 06:06:49.654 [ERROR] HiveSimpleUDFTest.testUDFToInteger:188->init:237 » NoClassDefFound Could not i... 06:06:49.654 [ERROR] HiveSimpleUDFTest.testUDFUnbase64:166->init:237 » NoClassDefFound Could not in... 06:06:49.654 [ERROR] HiveSimpleUDFTest.testUDFUnhex:177->init:237 » NoClassDefFound Could not initi... 06:06:49.654 [ERROR] HiveSimpleUDFTest.testUDFWeekOfYear:139->init:237 » NoClassDefFound Could not ... {code} https://api.travis-ci.org/v3/job/555252043/log.txt -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-12916) KeyedComplexChainTest.testMigrationAndRestore failed on Travis
Till Rohrmann created FLINK-12916: - Summary: KeyedComplexChainTest.testMigrationAndRestore failed on Travis Key: FLINK-12916 URL: https://issues.apache.org/jira/browse/FLINK-12916 Project: Flink Issue Type: Bug Components: Runtime / Checkpointing, Tests Affects Versions: 1.9.0 Reporter: Till Rohrmann Fix For: 1.9.0 The test case {{KeyedComplexChainTest.testMigrationAndRestore}} failed on Travis because a Task received the cancellation from one of its inputs {code} Caused by: org.apache.flink.runtime.checkpoint.CheckpointException: Task received cancellation from one of its inputs at org.apache.flink.streaming.runtime.io.BarrierBuffer.notifyAbortOnCancellationBarrier(BarrierBuffer.java:428) at org.apache.flink.streaming.runtime.io.BarrierBuffer.processCancellationBarrier(BarrierBuffer.java:327) at org.apache.flink.streaming.runtime.io.BarrierBuffer.pollNext(BarrierBuffer.java:208) at org.apache.flink.streaming.runtime.io.StreamTaskNetworkInput.pollNextNullable(StreamTaskNetworkInput.java:102) at org.apache.flink.streaming.runtime.io.StreamTaskNetworkInput.pollNextNullable(StreamTaskNetworkInput.java:47) at org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:128) at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.performDefaultAction(OneInputStreamTask.java:101) at org.apache.flink.streaming.runtime.tasks.StreamTask.run(StreamTask.java:268) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:376) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:676) ... 1 more {code} https://api.travis-ci.org/v3/job/548181384/log.txt -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-12915) AbstractOperatorRestoreTestBase can deadlock if one test fails
Till Rohrmann created FLINK-12915: - Summary: AbstractOperatorRestoreTestBase can deadlock if one test fails Key: FLINK-12915 URL: https://issues.apache.org/jira/browse/FLINK-12915 Project: Flink Issue Type: Bug Components: API / Type Serialization System, Tests Affects Versions: 1.9.0 Reporter: Till Rohrmann The {{AbstractOperatorRestoreTestBase}} can deadlock in case of a failing test case. The problem is that if one test fails then the corresponding test job won't be canceled. Due to that a succeeding test case might not get enough slots to execute because the job of the failing test is still running. This leads to a deadlocked test as happened here: https://api.travis-ci.org/v3/job/548181384/log.txt A side effect is that the original test failure won't be properly displayed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-12863) Race condition between slot offerings and AllocatedSlotReport
Till Rohrmann created FLINK-12863: - Summary: Race condition between slot offerings and AllocatedSlotReport Key: FLINK-12863 URL: https://issues.apache.org/jira/browse/FLINK-12863 Project: Flink Issue Type: Bug Components: Runtime / Coordination Affects Versions: 1.9.0 Reporter: Till Rohrmann Fix For: 1.9.0 With FLINK-11059 we introduced the {{AllocatedSlotReport}} which is used by the {{TaskExecutor}} to synchronize its internal view on slot allocations with the view of the {{JobMaster}}. It seems that there is a race condition between offering slots and receiving the report because the {{AllocatedSlotReport}} is sent by the {{HeartbeatManagerSenderImpl}} from a separate thread. Due to that it can happen that we generate an {{AllocatedSlotReport}} just before getting new slots offered. Since the report is sent from a different thread, it can then happen that the response to the slot offerings is sent earlier than the {{AllocatedSlotReport}}. Consequently, we might receive an outdated slot report on the {{TaskExecutor}} causing active slots to be released. In order to solve the problem I propose to add a fencing token to the {{AllocatedSlotReport}} which is being updated whenever we offer new slots to the {{JobMaster}}. When we receive the {{AllocatedSlotReport}} on the {{TaskExecutor}} we compare the current slot report fencing token with the received one and only process the report if they are equal. Otherwise we wait for the next heartbeat to send us an up to date {{AllocatedSlotReport}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-12768) FlinkKinesisConsumerTest.testSourceSynchronization unstable on Travis
Till Rohrmann created FLINK-12768: - Summary: FlinkKinesisConsumerTest.testSourceSynchronization unstable on Travis Key: FLINK-12768 URL: https://issues.apache.org/jira/browse/FLINK-12768 Project: Flink Issue Type: Bug Components: Connectors / Kinesis Affects Versions: 1.9.0 Reporter: Till Rohrmann Fix For: 1.9.0 The {{FlinkKinesisConsumerTest.testSourceSynchronization}} seems to be unstable on Travis. It fails with {code} [ERROR] testSourceSynchronization(org.apache.flink.streaming.connectors.kinesis.FlinkKinesisConsumerTest) Time elapsed: 10.031 s <<< FAILURE! java.lang.AssertionError: Expected: iterable containing ["1", ] but: No item matched: at org.apache.flink.streaming.connectors.kinesis.FlinkKinesisConsumerTest.testSourceSynchronization(FlinkKinesisConsumerTest.java:950) {code} https://api.travis-ci.org/v3/job/541845510/log.txt While locking into the problem, I noticed that the test case takes 1 second to execute on my machine. I'm wondering whether this really needs to take this long. Moreover, the test code contains {{Thread.sleeps}} and uses {{Whiteboxing}} which we should avoid. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-12670) Implement FailureRateRestartBackoffTimeStrategy
Till Rohrmann created FLINK-12670: - Summary: Implement FailureRateRestartBackoffTimeStrategy Key: FLINK-12670 URL: https://issues.apache.org/jira/browse/FLINK-12670 Project: Flink Issue Type: Sub-task Components: Runtime / Coordination Affects Versions: 1.9.0 Reporter: Till Rohrmann Fix For: 1.9.0 We should implement a {{FailureRateRestartBackoffTimeStrategy}} similar to the {{FailureRateRestartStrategy}} which allows to configure the number of allowed restarts and the fixed delay in between restart attempts. In order to be backwards compatible, we should respect the configuration values used to configure the {{FailureRateRestartStrategy}} (see the documentation for more information: https://ci.apache.org/projects/flink/flink-docs-stable/dev/restart_strategies.html). Additionally, we should also respect the {{FailureRateRestartStrategyConfiguration}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-12669) Implement FixedDelayRestartBackoffTimeStrategy
Till Rohrmann created FLINK-12669: - Summary: Implement FixedDelayRestartBackoffTimeStrategy Key: FLINK-12669 URL: https://issues.apache.org/jira/browse/FLINK-12669 Project: Flink Issue Type: Sub-task Components: Runtime / Coordination Affects Versions: 1.9.0 Reporter: Till Rohrmann Fix For: 1.9.0 We should implement a {{FixedDelayRestartBackoffTimeStrategy}} similar to the {{FixedDelayRestartStrategy}} which allows to configure the number of allowed restarts and the fixed delay in between restart attempts. In order to be backwards compatible, we should respect the configuration values used to configure the {{FixedDelayRestartStrategy}} (see the documentation for more information: https://ci.apache.org/projects/flink/flink-docs-stable/dev/restart_strategies.html). Additionally, we should also respect the {{FixedDelayRestartStrategyConfiguration}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-12654) Enable Azure FS E2E tests on Travis
Till Rohrmann created FLINK-12654: - Summary: Enable Azure FS E2E tests on Travis Key: FLINK-12654 URL: https://issues.apache.org/jira/browse/FLINK-12654 Project: Flink Issue Type: Improvement Components: Connectors / FileSystem, Tests Affects Versions: 1.9.0 Reporter: Till Rohrmann We should think about enabling the {{test_azure_fs.sh}} E2E test on Travis. This would improve our test coverage. In order to do this, we need to setup an Azure account, set up the right environment variables {{$IT_CASE_AZURE_ACCOUNT}}, {{$IT_CASE_AZURE_ACCESS_KEY}}, {{$IT_CASE_AZURE_CONTAINER}} and upload a static {{words}} file to {{wasbs://$IT_CASE_AZURE_CONTAINER@$IT_CASE_AZURE_ACCOUNT.blob.core.windows.net/words}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-12592) StreamTableEnvironment object has no attribute connect
Till Rohrmann created FLINK-12592: - Summary: StreamTableEnvironment object has no attribute connect Key: FLINK-12592 URL: https://issues.apache.org/jira/browse/FLINK-12592 Project: Flink Issue Type: Bug Components: API / Python, Tests Affects Versions: 1.9.0 Reporter: Till Rohrmann The Python build module failed on Travis with the following problem: {{'StreamTableEnvironment' object has no attribute 'connect'}}. https://api.travis-ci.org/v3/job/535684431/log.txt -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-12545) TableSourceTest.testNestedProject failed on Travis
Till Rohrmann created FLINK-12545: - Summary: TableSourceTest.testNestedProject failed on Travis Key: FLINK-12545 URL: https://issues.apache.org/jira/browse/FLINK-12545 Project: Flink Issue Type: Bug Components: Table SQL / Planner Affects Versions: 1.9.0 Reporter: Till Rohrmann The {{TableSourceTest.testNestedProject}} failed on Travis with {code} 07:15:38.909 [ERROR] Failures: 07:15:38.909 [ERROR] TableSourceTest.testNestedProject:375 null expected:<...deepNested.nested2.f[lag AS nestedFlag, deepNested.nested2.num AS nestedNum]) StreamTableSourceScan(table=[[T]], fields=[id, deepNested, nested], source=[TestSource(read nested fields: id.*, deepNested.nested2.num, deepNested.nested2.flag], deepNested.nested1...> but was:<...deepNested.nested2.f[1 AS nestedFlag, deepNested.nested2.f0 AS nestedNum]) StreamTableSourceScan(table=[[T]], fields=[id, deepNested, nested], source=[TestSource(read nested fields: id.*, deepNested.nested2.f1, deepNested.nested2.f0], deepNested.nested1...> {code} https://api.travis-ci.org/v3/job/533646868/log.txt -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-12502) Harden JobMasterTest#testRequestNextInputSplitWithDataSourceFailover
Till Rohrmann created FLINK-12502: - Summary: Harden JobMasterTest#testRequestNextInputSplitWithDataSourceFailover Key: FLINK-12502 URL: https://issues.apache.org/jira/browse/FLINK-12502 Project: Flink Issue Type: Bug Components: Runtime / Coordination, Tests Affects Versions: 1.9.0 Reporter: Till Rohrmann The {{JobMasterTest#testRequestNextInputSplitWithDataSourceFailover}} relies on how many files you have in your working directory. This assumption is quite brittle. Instead we should explicitly instantiate an {{InputSplitAssigner}} with a defined number of input splits. Moreover, we should make the assertions more explicit: Input split comparisons should not rely solely on the length of the input split data. Maybe it is also not necessary to capture the full {{TaskDeploymentDescriptor}} because we could already know the producer's and consumer's {{JobVertexID}} when we create the {{JobGraph}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-12324) Remove ActorGateway interface
Till Rohrmann created FLINK-12324: - Summary: Remove ActorGateway interface Key: FLINK-12324 URL: https://issues.apache.org/jira/browse/FLINK-12324 Project: Flink Issue Type: Sub-task Components: Runtime / Coordination Reporter: Till Rohrmann Assignee: Till Rohrmann Fix For: 1.9.0 Remove the {{ActorGateway}} interface. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-12323) Remove legacy ActorGateway implementations
Till Rohrmann created FLINK-12323: - Summary: Remove legacy ActorGateway implementations Key: FLINK-12323 URL: https://issues.apache.org/jira/browse/FLINK-12323 Project: Flink Issue Type: Sub-task Components: Runtime / Coordination Reporter: Till Rohrmann Assignee: Till Rohrmann Fix For: 1.9.0 Remove the legacy {{ActorGateway}} based implementations. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-12322) Remove legacy ActorTaskManagerGateway
Till Rohrmann created FLINK-12322: - Summary: Remove legacy ActorTaskManagerGateway Key: FLINK-12322 URL: https://issues.apache.org/jira/browse/FLINK-12322 Project: Flink Issue Type: Sub-task Components: Runtime / Coordination Reporter: Till Rohrmann Fix For: 1.9.0 Remove the legacy {{ActorTaskManagerGateway}} component from Flink. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-12122) Spread out tasks evenly across all available registered TaskManagers
Till Rohrmann created FLINK-12122: - Summary: Spread out tasks evenly across all available registered TaskManagers Key: FLINK-12122 URL: https://issues.apache.org/jira/browse/FLINK-12122 Project: Flink Issue Type: Sub-task Components: Runtime / Coordination Affects Versions: 1.7.2, 1.6.4, 1.8.0 Reporter: Till Rohrmann Fix For: 1.7.3, 1.9.0, 1.8.1 With Flip-6, we changed the default behaviour how slots are assigned to {{TaskManages}}. Instead of evenly spreading it out over all registered {{TaskManagers}}, we randomly pick slots from {{TaskManagers}} with a tendency to first fill up a TM before using another one. This is a regression wrt the pre Flip-6 code. I suggest to change the behaviour so that we try to evenly distribute slots across all available {{TaskManagers}} by considering how many of their slots are already allocated. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-12070) Make blocking result partitions consumable multiple times
Till Rohrmann created FLINK-12070: - Summary: Make blocking result partitions consumable multiple times Key: FLINK-12070 URL: https://issues.apache.org/jira/browse/FLINK-12070 Project: Flink Issue Type: Improvement Components: Runtime / Network Reporter: Till Rohrmann In order to avoid writing produced results multiple times for multiple consumers and in order to speed up batch recoveries, we should make the blocking result partitions to be consumable multiple times. At the moment a blocking result partition will be released once the consumers has processed all data. Instead the result partition should be released once the next blocking result has been produced and all consumers of a blocking result partition have terminated. Moreover, blocking results should not hold on slot resources like network buffers or memory as it is currently the case with {{SpillableSubpartitions}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-12069) Add proper lifecycle management for intermediate result partitions
Till Rohrmann created FLINK-12069: - Summary: Add proper lifecycle management for intermediate result partitions Key: FLINK-12069 URL: https://issues.apache.org/jira/browse/FLINK-12069 Project: Flink Issue Type: Improvement Components: Runtime / Coordination, Runtime / Network Affects Versions: 1.8.0, 1.9.0 Reporter: Till Rohrmann In order to properly execute batch jobs, we should make the lifecycle management of intermediate result partitions the responsibility of the {{JobMaster}}/{{Scheduler}} component. The {{Scheduler}} knows best when an intermediate result partition is no longer needed and, thus, can be freed. So for example, a blocking intermediate result should only be released after all subsequent blocking intermediate results have been completed in order to speed up potential failovers. Moreover, having explicit control over intermediate result partitions, could also enable use cases like result partition sharing between jobs and even across clusters (by simply not releasing the result partitions). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-12068) Backtrack fail over regions if intermediate results are unavailable
Till Rohrmann created FLINK-12068: - Summary: Backtrack fail over regions if intermediate results are unavailable Key: FLINK-12068 URL: https://issues.apache.org/jira/browse/FLINK-12068 Project: Flink Issue Type: Sub-task Components: Runtime / Coordination Reporter: Till Rohrmann The batch failover strategy needs to be able to backtrack fail over regions if an intermediate result is unavailable. Either by explicitly checking whether the intermediate result partition is available or via a special exception indicating that a result partition is no longer available. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-12058) Cancel checkpoint operations belonging to a discarded/aborted checkpoint
Till Rohrmann created FLINK-12058: - Summary: Cancel checkpoint operations belonging to a discarded/aborted checkpoint Key: FLINK-12058 URL: https://issues.apache.org/jira/browse/FLINK-12058 Project: Flink Issue Type: Improvement Components: Runtime / Checkpointing Affects Versions: 1.7.2, 1.8.0 Reporter: Till Rohrmann In order to save CPU cycles and reduce disk and network I/O, we should try to cancel local checkpoint operations belonging to discarded aborted or subsumed checkpoints. For example, if a {{Task}} declines a checkpoint, the {{CheckpointCoordinator}} will discard this checkpoint. However, other checkpointing operations belonging to this checkpoint won't be necessarily notified and canceled. The notification mechanism could piggy back on the existing {{CancelCheckpointMarker}} or be a separate signal sent to all participating {{Tasks}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-12036) Simplify ExecutionGraph by removing legacy concurrency code
Till Rohrmann created FLINK-12036: - Summary: Simplify ExecutionGraph by removing legacy concurrency code Key: FLINK-12036 URL: https://issues.apache.org/jira/browse/FLINK-12036 Project: Flink Issue Type: Improvement Components: Runtime / Coordination Affects Versions: 1.8.0 Reporter: Till Rohrmann With FLINK-11417, we made the {{ExecutionGraph}} single threaded. Consequently all the logic which handles concurrent state changes is no longer needed (e.g. {{AtomicReferenceFieldUpdater}}, checking for concurrent changes, etc.). This issue is the umbrella to collect every legacy concurrency code which can be removed from the {{ExecutionGraph}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-12034) KafkaITCase.testAllDeletes failed on Travis
Till Rohrmann created FLINK-12034: - Summary: KafkaITCase.testAllDeletes failed on Travis Key: FLINK-12034 URL: https://issues.apache.org/jira/browse/FLINK-12034 Project: Flink Issue Type: Bug Components: Connectors / Kafka, Tests Affects Versions: 1.8.0 Reporter: Till Rohrmann The {{KafkaITCase.testAllDeletes}} failed on Travis because it timed out: {code} 12:36:24.782 [ERROR] Errors: 12:36:24.783 [ERROR] KafkaITCase.testAllDeletes:131->KafkaConsumerTestBase.runAllDeletesTest:1526->KafkaTestBase.deleteTestTopic:192->Object.wait:-2 » TestTimedOut 12:36:24.783 [INFO] 12:36:24.783 [ERROR] Tests run: 46, Failures: 0, Errors: 1, Skipped: 0 {code} https://api.travis-ci.org/v3/job/511419474/log.txt -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-12033) KafkaITCase.testCancelingFullTopic failed on Travis
Till Rohrmann created FLINK-12033: - Summary: KafkaITCase.testCancelingFullTopic failed on Travis Key: FLINK-12033 URL: https://issues.apache.org/jira/browse/FLINK-12033 Project: Flink Issue Type: Bug Components: Connectors / Kafka, Test Infrastructure Affects Versions: 1.8.0 Reporter: Till Rohrmann The {{KafkaITCase.testCancelingFullTopic}} failed on Travis because it timed out. https://api.travis-ci.org/v3/job/511419185/log.txt -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-12032) KafkaProducerAtLeastOnceITCase failed to bind to address
Till Rohrmann created FLINK-12032: - Summary: KafkaProducerAtLeastOnceITCase failed to bind to address Key: FLINK-12032 URL: https://issues.apache.org/jira/browse/FLINK-12032 Project: Flink Issue Type: Bug Components: Connectors / Kafka, Tests Affects Versions: 1.8.0 Reporter: Till Rohrmann The {{KafkaProducerAtLeastOnceITCase}} failed on Travis because it could not bind to the specified address. https://api.travis-ci.org/v3/job/511419185/log.txt -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-12021) Let ResultConjunctFuture return future results in same order as futures
Till Rohrmann created FLINK-12021: - Summary: Let ResultConjunctFuture return future results in same order as futures Key: FLINK-12021 URL: https://issues.apache.org/jira/browse/FLINK-12021 Project: Flink Issue Type: Improvement Components: Runtime / Coordination Affects Versions: 1.7.2, 1.8.0 Reporter: Till Rohrmann Assignee: Till Rohrmann Fix For: 1.8.0 The {{ResultConjunctFuture}} should return future results in the same order as the futures were specified. The advantage would be that we maintain the same output order as the input order if the input has been specified with a special ordering. This should also fix a current performance regression where we don't maintain the topological ordering when deploying tasks. Due to this, we need to make an additional result partition lookup which adds additional latency. The problem has occurred due to FLINK-10431 which changes the interleaving how the slot futures are completed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-12020) Add documentation for mesos-appmaster-job.sh
Till Rohrmann created FLINK-12020: - Summary: Add documentation for mesos-appmaster-job.sh Key: FLINK-12020 URL: https://issues.apache.org/jira/browse/FLINK-12020 Project: Flink Issue Type: Improvement Components: Deployment / Mesos, Documentation Affects Versions: 1.7.2, 1.8.0 Reporter: Till Rohrmann The Flink documentation is currently lacking information about the {{mesos-appmaster-job.sh}} and how to use it. It would be helpful for our users to add documentation and examples how to use it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11855) Race condition in EmbeddedLeaderService
Till Rohrmann created FLINK-11855: - Summary: Race condition in EmbeddedLeaderService Key: FLINK-11855 URL: https://issues.apache.org/jira/browse/FLINK-11855 Project: Flink Issue Type: Bug Components: Runtime / Coordination Affects Versions: 1.7.2, 1.8.0 Reporter: Till Rohrmann Assignee: Till Rohrmann Fix For: 1.7.3, 1.8.0 There is a race condition in the {{EmbeddedLeaderService}} which can occur if the {{EmbeddedLeaderService}} is shut down before the {{GrantLeadershipCall}} has been executed. In this case, the {{contender}} is nulled which leads to a NPE. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11851) ClusterEntrypoint provides wrong executor to HaServices
Till Rohrmann created FLINK-11851: - Summary: ClusterEntrypoint provides wrong executor to HaServices Key: FLINK-11851 URL: https://issues.apache.org/jira/browse/FLINK-11851 Project: Flink Issue Type: Bug Components: Runtime / Coordination Affects Versions: 1.7.2, 1.8.0 Reporter: Till Rohrmann Assignee: Till Rohrmann Fix For: 1.7.3, 1.8.0 The {{ClusterEntrypoint}} provides the executor of the common {{RpcService}} to the {{HighAvailabilityServices}} which uses the executor to run io operations. In I/O heavy cases, this can block all {{RpcService}} threads and make the {{RpcEndpoints}} running in the respective {{RpcService}} unresponsive. I suggest to introduce a dedicated I/O executor which is used for io heavy operations. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11846) Duplicate job submission delete HA files
Till Rohrmann created FLINK-11846: - Summary: Duplicate job submission delete HA files Key: FLINK-11846 URL: https://issues.apache.org/jira/browse/FLINK-11846 Project: Flink Issue Type: Bug Components: Runtime / Coordination Affects Versions: 1.8.0 Reporter: Till Rohrmann Assignee: Till Rohrmann Fix For: 1.8.0 Due to changes for FLINK-11383, the {{Dispatcher}} now delete HA files if the client submits twice a job. A duplicate job submission should, however, simply be rejected but not cause that HA files are being deleted. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11843) Dispatcher fails to recover jobs if leader change happens during JobManagerRunner termination
Till Rohrmann created FLINK-11843: - Summary: Dispatcher fails to recover jobs if leader change happens during JobManagerRunner termination Key: FLINK-11843 URL: https://issues.apache.org/jira/browse/FLINK-11843 Project: Flink Issue Type: Bug Components: Runtime / Coordination Affects Versions: 1.7.2, 1.8.0 Reporter: Till Rohrmann The {{Dispatcher}} fails to recover jobs if a leader change happens during the {{JobManagerRunner}} termination of the previous run. The problem is that we schedule the start future of the recovered {{JobGraph}} using the {{MainThreadExecutor}} and additionally require that this future is completed before any other recovery operation from a subsequent leadership session is executed. If now the leadership changes, the {{MainThreadExecutor}} will be invalidated and the scheduled future will never be completed. The relevant ML thread: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/1-7-1-job-stuck-in-suspended-state-td26439.html -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11826) Kafka09ITCase.testRateLimitedConsumer fails on Travis
Till Rohrmann created FLINK-11826: - Summary: Kafka09ITCase.testRateLimitedConsumer fails on Travis Key: FLINK-11826 URL: https://issues.apache.org/jira/browse/FLINK-11826 Project: Flink Issue Type: Bug Components: Connectors / Kafka, Tests Affects Versions: 1.8.0 Reporter: Till Rohrmann The {{Kafka09ITCase.testRateLimitedConsumer}} fails on Travis with: {code} 20:33:49.887 [ERROR] Errors: 20:33:49.887 [ERROR] Kafka09ITCase.testRateLimitedConsumer:204 » JobExecution Job execution failed. {code} https://api.travis-ci.org/v3/job/501660504/log.txt -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11813) Standby per job mode Dispatchers don't know job's JobSchedulingStatus
Till Rohrmann created FLINK-11813: - Summary: Standby per job mode Dispatchers don't know job's JobSchedulingStatus Key: FLINK-11813 URL: https://issues.apache.org/jira/browse/FLINK-11813 Project: Flink Issue Type: Bug Components: Runtime / Coordination Affects Versions: 1.7.2, 1.6.4, 1.8.0 Reporter: Till Rohrmann At the moment, it can happen that standby {{Dispatchers}} in per job mode will restart a terminated job after they gained leadership. The problem is that we currently clear the {{RunningJobsRegistry}} once a job has reached a globally terminal state. After the leading {{Dispatcher}} terminates, a standby {{Dispatcher}} will gain leadership. Without having the information from the {{RunningJobsRegistry}} it cannot tell whether the job has been executed or whether the {{Dispatcher}} needs to re-execute the job. At the moment, the {{Dispatcher}} will assume that there was a fault and hence re-execute the job. This can lead to duplicate results. I think we need some way to tell standby {{Dispatchers}} that a certain job has been successfully executed. One trivial solution could be to not clean up the {{RunningJobsRegistry}} but then we will clutter ZooKeeper. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11789) Checkpoint directories are not cleaned up after job termination
Till Rohrmann created FLINK-11789: - Summary: Checkpoint directories are not cleaned up after job termination Key: FLINK-11789 URL: https://issues.apache.org/jira/browse/FLINK-11789 Project: Flink Issue Type: Bug Components: Runtime / Checkpointing Affects Versions: 1.9.0 Reporter: Till Rohrmann Flink currently does not clean up all checkpoint directories when a job reaches a globally terminal state. Having configured the checkpoint directory {{checkpoints}}, I observe that after cancelling the job {{JOB_ID}} there are still {code} checkpoints/JOB_ID/shared checkpoints/JOB_ID/taskowned {code} I think it would be good if would delete {{checkpoints/JOB_ID}} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11719) Remove JobMaster#start(JobMasterId) and #suspend
Till Rohrmann created FLINK-11719: - Summary: Remove JobMaster#start(JobMasterId) and #suspend Key: FLINK-11719 URL: https://issues.apache.org/jira/browse/FLINK-11719 Project: Flink Issue Type: Improvement Components: Distributed Coordination Affects Versions: 1.8.0 Reporter: Till Rohrmann Assignee: Till Rohrmann Currently, the {{JobMaster}} contains a lot of mutable state which is necessary because it is used across different leadership sessions by the {{JobManagerRunner}}. For this purpose, we have the methods {{JobMaster#start(JobMasterId)}} and {{#suspend}}. The mutable state management makes things on the {{JobMaster}} side more complicated than they need to be. In order to improve the {{JobMaster's}} maintainability I suggest to remove this logic and instead terminate the {{JobMaster}} if the {{JobManagerRunner}} loses leadership. This entails that for every leadership we will create a new {{JobMaster}} instance. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11718) Add onStart method to RpcEndpoint which is run in the actor's main thread
Till Rohrmann created FLINK-11718: - Summary: Add onStart method to RpcEndpoint which is run in the actor's main thread Key: FLINK-11718 URL: https://issues.apache.org/jira/browse/FLINK-11718 Project: Flink Issue Type: Improvement Components: Distributed Coordination Affects Versions: 1.8.0 Reporter: Till Rohrmann Assignee: Till Rohrmann I propose to introduce a {{RpcEndpoint#onStart}} method which is called when the {{RpcEndpoint}} is started via {{RpcEndpoint#start}}. At the moment, users will override {{#start}} where they need to remember to also call {{super.start()}} in order to actually start the {{RpcEndpoint}}. Moreover, the logic executed by {{start}} won't be run in the actor's main thread. This is problematic if the method triggers asynchronous behaviour which is executed in the actor's main thread. If that is the case, it can happen that the asynchronous operation is executed before the {{start}} method has been finished. Due to these problems, I suggest to introduce a {{RpcEndpoint#onStart}} method which can be overriden by sub classes similarly to the {{RpcEndpoint#onStop}} method. The {{onStart}} method can be used to setup the {{RpcEndpoint}} and is guaranteed to be executed before any other message is processed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11678) Remove suspend logic from ExecutionGraphCache
Till Rohrmann created FLINK-11678: - Summary: Remove suspend logic from ExecutionGraphCache Key: FLINK-11678 URL: https://issues.apache.org/jira/browse/FLINK-11678 Project: Flink Issue Type: Sub-task Components: Distributed Coordination Affects Versions: 1.8.0 Reporter: Till Rohrmann Assignee: Till Rohrmann Fix For: 1.8.0 Since we no longer obtain the {{ExecutionGraph}} from the {{JobManager}} we can now remove the suspending logic in the {{ExecutionGraphCache}} which will simplify this component. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11677) Remove ResourceManagerRunner
Till Rohrmann created FLINK-11677: - Summary: Remove ResourceManagerRunner Key: FLINK-11677 URL: https://issues.apache.org/jira/browse/FLINK-11677 Project: Flink Issue Type: Improvement Components: Distributed Coordination Affects Versions: 1.8.0 Reporter: Till Rohrmann Assignee: TisonKun Fix For: 1.8.0 The {{ResourceManagerRunner}} is no longer needed. Thus, I suggest to remove it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11663) Remove control flow break point from Execution#releaseAssignedResource
Till Rohrmann created FLINK-11663: - Summary: Remove control flow break point from Execution#releaseAssignedResource Key: FLINK-11663 URL: https://issues.apache.org/jira/browse/FLINK-11663 Project: Flink Issue Type: Improvement Components: Distributed Coordination Affects Versions: 1.8.0 Reporter: Till Rohrmann Assignee: Till Rohrmann Fix For: 1.8.0 In {{Execution#releaseAssignedResource}} we release the assigned resource by calling {{LogicalSlot#releaseSlot}} and use {{FutureUtils.whenCompleteAsyncIfNotDone}} to merge the future back into the main thread in order to complete the {{Execution#releaseFuture}}. This is no longer necessary since the returned future is always completed from within the main thread (with the changes from FLINK-10431). In fact this control flow break point makes it hard to properly suspend the {{ExecutionGraph}} atomically as required for FLINK-11537. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11650) Remove legacy FlinkResourceManager
Till Rohrmann created FLINK-11650: - Summary: Remove legacy FlinkResourceManager Key: FLINK-11650 URL: https://issues.apache.org/jira/browse/FLINK-11650 Project: Flink Issue Type: Sub-task Components: Distributed Coordination Affects Versions: 1.8.0 Reporter: Till Rohrmann Assignee: Till Rohrmann Fix For: 1.8.0 Remove the legacy {{FlinkResourceManager}} component. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11649) Remove legacy JobInfo
Till Rohrmann created FLINK-11649: - Summary: Remove legacy JobInfo Key: FLINK-11649 URL: https://issues.apache.org/jira/browse/FLINK-11649 Project: Flink Issue Type: Sub-task Components: Distributed Coordination Affects Versions: 1.8.0 Reporter: Till Rohrmann Assignee: Till Rohrmann Fix For: 1.8.0 Remove legacy {{JobInfo}} class. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11648) Remove MemoryArchivist
Till Rohrmann created FLINK-11648: - Summary: Remove MemoryArchivist Key: FLINK-11648 URL: https://issues.apache.org/jira/browse/FLINK-11648 Project: Flink Issue Type: Sub-task Components: Distributed Coordination Affects Versions: 1.8.0 Reporter: Till Rohrmann Assignee: Till Rohrmann Fix For: 1.8.0 Remove the {{MemoryArchivist}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11645) Remove legacy JobManager
Till Rohrmann created FLINK-11645: - Summary: Remove legacy JobManager Key: FLINK-11645 URL: https://issues.apache.org/jira/browse/FLINK-11645 Project: Flink Issue Type: Sub-task Components: Distributed Coordination Affects Versions: 1.8.0 Reporter: Till Rohrmann Assignee: Till Rohrmann Fix For: 1.8.0 Remove the legacy {{JobManager}} component. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11631) TaskExecutorITCase#testJobReExecutionAfterTaskExecutorTermination unstable on Travis
Till Rohrmann created FLINK-11631: - Summary: TaskExecutorITCase#testJobReExecutionAfterTaskExecutorTermination unstable on Travis Key: FLINK-11631 URL: https://issues.apache.org/jira/browse/FLINK-11631 Project: Flink Issue Type: Bug Components: Distributed Coordination, Tests Affects Versions: 1.8.0 Reporter: Till Rohrmann The {{TaskExecutorITCase#testJobReExecutionAfterTaskExecutorTermination}} is unstable on Travis. It fails with {code} 16:12:04.644 [ERROR] testJobReExecutionAfterTaskExecutorTermination(org.apache.flink.runtime.taskexecutor.TaskExecutorITCase) Time elapsed: 1.257 s <<< ERROR! org.apache.flink.util.FlinkException: Could not close resource. at org.apache.flink.runtime.taskexecutor.TaskExecutorITCase.teardown(TaskExecutorITCase.java:83) Caused by: org.apache.flink.util.FlinkException: Error while shutting the TaskExecutor down. Caused by: org.apache.flink.util.FlinkException: Could not properly shut down the TaskManager services. Caused by: java.lang.IllegalStateException: NetworkBufferPool is not empty after destroying all LocalBufferPools {code} https://api.travis-ci.org/v3/job/493221318/log.txt The problem seems to be caused by the {{TaskExecutor}} not properly waiting for the termination of all running {{Tasks}}. Due to this, there is a race condition which causes that not all buffers are returned to the {{BufferPool}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11630) TaskExecutor does not wait for Task termination when terminating itself
Till Rohrmann created FLINK-11630: - Summary: TaskExecutor does not wait for Task termination when terminating itself Key: FLINK-11630 URL: https://issues.apache.org/jira/browse/FLINK-11630 Project: Flink Issue Type: Bug Components: Distributed Coordination Affects Versions: 1.8.0 Reporter: Till Rohrmann The {{TaskExecutor}} does not properly wait for the termination of {{Tasks}} when terminating. In fact, it does not even trigger the cancellation of the running {{Tasks}}. I think for better lifecycle management it is important that the {{TaskExecutor}} triggers the termination of all running {{Tasks}} and then wait until all {{Tasks}} have terminated before it terminates itself. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11629) Stop AkkaRpcActor immediately after onStop has finished
Till Rohrmann created FLINK-11629: - Summary: Stop AkkaRpcActor immediately after onStop has finished Key: FLINK-11629 URL: https://issues.apache.org/jira/browse/FLINK-11629 Project: Flink Issue Type: Improvement Components: Distributed Coordination Affects Versions: 1.8.0 Reporter: Till Rohrmann Assignee: Till Rohrmann Fix For: 1.8.0 Currently, the {{AkkaRpcActor}} sends a {{Kill}} message to itself after the {{onStop}} future has been completed. Since {{Kill}} messages enqueue into the mail box, the stop operation is not immediate. I suggest to rather use {{Context#stop(ActorRef)}} to immediately stop the {{AkkaRcpActor}} upon termination. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11594) Check & port TaskManagerRegistrationTest to new code base
Till Rohrmann created FLINK-11594: - Summary: Check & port TaskManagerRegistrationTest to new code base Key: FLINK-11594 URL: https://issues.apache.org/jira/browse/FLINK-11594 Project: Flink Issue Type: Sub-task Affects Versions: 1.8.0 Reporter: Till Rohrmann Fix For: 1.8.0 Check and port {{TaskManagerRegistrationTest}} to new code base. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11593) Check & port TaskManagerTest to new code base
Till Rohrmann created FLINK-11593: - Summary: Check & port TaskManagerTest to new code base Key: FLINK-11593 URL: https://issues.apache.org/jira/browse/FLINK-11593 Project: Flink Issue Type: Sub-task Affects Versions: 1.8.0 Reporter: Till Rohrmann Fix For: 1.8.0 Check and port {{TaskManagerTest}} to new code base. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11592) Port TaskManagerFailsWithSlotSharingITCase to new code base
Till Rohrmann created FLINK-11592: - Summary: Port TaskManagerFailsWithSlotSharingITCase to new code base Key: FLINK-11592 URL: https://issues.apache.org/jira/browse/FLINK-11592 Project: Flink Issue Type: Sub-task Affects Versions: 1.8.0 Reporter: Till Rohrmann Assignee: Till Rohrmann Fix For: 1.8.0 Port {{TaskManagerFailsWithSlotSharingITCase}} to new code base. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11587) Check and port CoLocationConstraintITCase to new code base
Till Rohrmann created FLINK-11587: - Summary: Check and port CoLocationConstraintITCase to new code base Key: FLINK-11587 URL: https://issues.apache.org/jira/browse/FLINK-11587 Project: Flink Issue Type: Sub-task Affects Versions: 1.8.0 Reporter: Till Rohrmann Fix For: 1.8.0 Check and port {{CoLocationConstraintITCase}} to new code base. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11586) Check and port SlotSharingITCase to new code base
Till Rohrmann created FLINK-11586: - Summary: Check and port SlotSharingITCase to new code base Key: FLINK-11586 URL: https://issues.apache.org/jira/browse/FLINK-11586 Project: Flink Issue Type: Sub-task Affects Versions: 1.8.0 Reporter: Till Rohrmann Fix For: 1.8.0 Check and port {{SlotSharingITCase}} to new code base. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11551) Allow RpcEndpoints to execute asynchronous stop operations
Till Rohrmann created FLINK-11551: - Summary: Allow RpcEndpoints to execute asynchronous stop operations Key: FLINK-11551 URL: https://issues.apache.org/jira/browse/FLINK-11551 Project: Flink Issue Type: Improvement Components: Distributed Coordination Affects Versions: 1.7.1, 1.8.0 Reporter: Till Rohrmann Assignee: Till Rohrmann Fix For: 1.8.0 Some {{RpcEndpoints}} would benefit from being able to run asynchronous stop operations in the main thread executor. At the moment, this is not possible since {{RpcEndpoint#postStop}} is called by by {{UntypedActor#postStop}} which is the last call for an actor. I suggest to introduce a new method {{RpcEndpoint#onStop}} which is executed when the {{RpcEndpoint}} is requested to terminate. Internally, this method will be called when the {{UntypedActor}} is still running, thus, allowing to execute arbitrary other messages (e.g. asynchronous operations in the main thread executor). Only after the future returned by {{onStop}} is completed, the underlying {{UntypedActor}} will shut down. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11537) ExecutionGraph does not reach terminal state when JobMaster lost leadership
Till Rohrmann created FLINK-11537: - Summary: ExecutionGraph does not reach terminal state when JobMaster lost leadership Key: FLINK-11537 URL: https://issues.apache.org/jira/browse/FLINK-11537 Project: Flink Issue Type: Bug Components: Distributed Coordination Affects Versions: 1.8.0 Reporter: Till Rohrmann Assignee: Till Rohrmann Fix For: 1.8.0 The {{ExecutionGraph}} sometimes does not reach a terminal state if the {{JobMaster}} lost the leadership. The reason is that we use the fenced main thread executor to execute {{ExecutionGraph}} changes and we don't wait for the {{ExecutionGraph}} to reach the terminal state before we set the fencing token {{null}}. One possible solution would be to wait for the {{ExecutionGraph}} to reach the terminal state before clearing the fencing token. This has, however, the downside that the {{JobMaster}} is still reachable until the {{ExecutionGraph}} has been properly terminated. Alternatively, we could use the unfenced main thread executor to send the cancel calls out. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11524) LeaderChangeClusterComponentsTest.testReelectionOfJobMaster failed on Travis
Till Rohrmann created FLINK-11524: - Summary: LeaderChangeClusterComponentsTest.testReelectionOfJobMaster failed on Travis Key: FLINK-11524 URL: https://issues.apache.org/jira/browse/FLINK-11524 Project: Flink Issue Type: Bug Components: Tests Affects Versions: 1.8.0 Reporter: Till Rohrmann The {{LeaderChangeClusterComponentsTest.testReelectionOfJobMaster}} failed on Travis: https://api.travis-ci.org/v3/job/488578456/log.txt It looks as if the {{JobMaster}} cannot reconnect to the {{ResourceManager}}. Maybe this is caused by a wrong interleaving of the revoke and grant leadership calls to the {{TestingEmbeddedHaServices}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11521) Make RetryingRegistration timeouts configurable
Till Rohrmann created FLINK-11521: - Summary: Make RetryingRegistration timeouts configurable Key: FLINK-11521 URL: https://issues.apache.org/jira/browse/FLINK-11521 Project: Flink Issue Type: Bug Components: Distributed Coordination Affects Versions: 1.8.0 Reporter: Till Rohrmann Assignee: Till Rohrmann Fix For: 1.8.0 For more control and better testability I suggest to make the {{RetryingRegistration}} configurable. Especially the hard coded timeouts make testing in some scenarios hard. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11472) KafkaITCase.testCancelingEmptyTopic failed on Travis
Till Rohrmann created FLINK-11472: - Summary: KafkaITCase.testCancelingEmptyTopic failed on Travis Key: FLINK-11472 URL: https://issues.apache.org/jira/browse/FLINK-11472 Project: Flink Issue Type: Bug Components: Kafka Connector, Tests Affects Versions: 1.8.0 Reporter: Till Rohrmann Fix For: 1.8.0 The {{KafkaITCase.testCancelingEmptyTopic}} failed on Travis with a {{TimeoutException}}: {code} 05:15:35.521 [ERROR] Errors: 05:15:35.521 [ERROR] KafkaITCase.testCancelingEmptyTopic:85->KafkaConsumerTestBase.runCancelingOnEmptyInputTest:1124->KafkaTestBase.deleteTestTopic:206->Object.wait:502->Object.wait:-2 » TestTimedOut {code} https://api.travis-ci.org/v3/job/485729678/log.txt -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11446) FlinkKafkaProducer011ITCase.testRecoverCommittedTransaction failed on Travis
Till Rohrmann created FLINK-11446: - Summary: FlinkKafkaProducer011ITCase.testRecoverCommittedTransaction failed on Travis Key: FLINK-11446 URL: https://issues.apache.org/jira/browse/FLINK-11446 Project: Flink Issue Type: Bug Components: Tests Affects Versions: 1.8.0 Reporter: Till Rohrmann The {{FlinkKafkaProducer011ITCase.testRecoverCommittedTransaction}} failed on Travis with producing no output for 10 minutes: https://api.travis-ci.org/v3/job/485771998/log.txt -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11415) Introduce JobMasterServiceFactor for JobManagerRunner
Till Rohrmann created FLINK-11415: - Summary: Introduce JobMasterServiceFactor for JobManagerRunner Key: FLINK-11415 URL: https://issues.apache.org/jira/browse/FLINK-11415 Project: Flink Issue Type: Improvement Components: Distributed Coordination Affects Versions: 1.8.0 Reporter: Till Rohrmann Assignee: Till Rohrmann Fix For: 1.8.0 In order to better test the {{JobManagerRunner}}, I suggest to introduce a {{JobMasterServiceFactory}} which allows to control how the {{JobManagerRunner}} instantiates a {{JobMasterService}}. At the moment we create a {{JobMaster}} in the constructor of the {{JobManagerRunner}}. This unnecessarily couples these two components together and makes testing harder. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11414) Introduce JobMasterService interface for the JobManagerRunner
Till Rohrmann created FLINK-11414: - Summary: Introduce JobMasterService interface for the JobManagerRunner Key: FLINK-11414 URL: https://issues.apache.org/jira/browse/FLINK-11414 Project: Flink Issue Type: Improvement Components: Distributed Coordination Affects Versions: 1.8.0 Reporter: Till Rohrmann Assignee: Till Rohrmann Fix For: 1.8.0 In order to better separate concerns I suggest to introduce a {{JobMasterService}} interface which only exposes the lifecycle methods the {{JobManagerRunner}} needs instead of directly using the {{JobMaster}}. That way, we can later instantiate the {{JobManagerRunner}} easier for testing purposes. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11400) JobManagerRunner does not wait for suspension of JobMaster
Till Rohrmann created FLINK-11400: - Summary: JobManagerRunner does not wait for suspension of JobMaster Key: FLINK-11400 URL: https://issues.apache.org/jira/browse/FLINK-11400 Project: Flink Issue Type: Bug Components: Distributed Coordination Affects Versions: 1.7.1, 1.6.3, 1.8.0 Reporter: Till Rohrmann Assignee: Till Rohrmann Fix For: 1.8.0 The {{JobManagerRunner}} does not wait for the suspension of the {{JobMaster}} to finish before granting leadership again. This can lead to a state where the {{JobMaster}} tries to start the {{ExecutionGraph}} but the {{SlotPool}} is still stopped. I suggest to linearize the leadership operations (granting and revoking leadership) similarly to the {{Dispatcher}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11383) Dispatcher does not clean up blobs of failed submissions
Till Rohrmann created FLINK-11383: - Summary: Dispatcher does not clean up blobs of failed submissions Key: FLINK-11383 URL: https://issues.apache.org/jira/browse/FLINK-11383 Project: Flink Issue Type: Bug Components: Distributed Coordination Affects Versions: 1.7.1, 1.6.3, 1.8.0 Reporter: Till Rohrmann Assignee: Till Rohrmann Fix For: 1.8.0 The {{Dispatcher}} does not clean up blobs originating from a failed submissions. Compared to the legacy code, this is a regression and we should remove relevant blobs. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11370) Check and port ZooKeeperLeaderElectionITCase to new code base if necessary
Till Rohrmann created FLINK-11370: - Summary: Check and port ZooKeeperLeaderElectionITCase to new code base if necessary Key: FLINK-11370 URL: https://issues.apache.org/jira/browse/FLINK-11370 Project: Flink Issue Type: Sub-task Components: Tests Reporter: Till Rohrmann Check and port {{ZooKeeperLeaderElectionITCase}} to new code base if necessary. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11368) Check and port TaskManagerStartupTest to new code base if necessary
Till Rohrmann created FLINK-11368: - Summary: Check and port TaskManagerStartupTest to new code base if necessary Key: FLINK-11368 URL: https://issues.apache.org/jira/browse/FLINK-11368 Project: Flink Issue Type: Sub-task Components: Tests Reporter: Till Rohrmann Check and port {{TaskManagerStartupTest}} to new code base if necessary. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11369) Check and port ZooKeeperHAJobManagerTest to new code base if necessary
Till Rohrmann created FLINK-11369: - Summary: Check and port ZooKeeperHAJobManagerTest to new code base if necessary Key: FLINK-11369 URL: https://issues.apache.org/jira/browse/FLINK-11369 Project: Flink Issue Type: Sub-task Components: Tests Reporter: Till Rohrmann Check and port {{ZooKeeperHAJobManagerTest}} to new code base if necessary. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11366) Check and port TaskManagerMetricsTest to new code base if necessary
Till Rohrmann created FLINK-11366: - Summary: Check and port TaskManagerMetricsTest to new code base if necessary Key: FLINK-11366 URL: https://issues.apache.org/jira/browse/FLINK-11366 Project: Flink Issue Type: Sub-task Components: Tests Reporter: Till Rohrmann Check and port {{TaskManagerMetricsTest}} to new code base if necessary. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11367) Check and port TaskManagerProcessReapingTestBase to new code base if necessary
Till Rohrmann created FLINK-11367: - Summary: Check and port TaskManagerProcessReapingTestBase to new code base if necessary Key: FLINK-11367 URL: https://issues.apache.org/jira/browse/FLINK-11367 Project: Flink Issue Type: Sub-task Components: Tests Reporter: Till Rohrmann Check and port {{TaskManagerProcessReapingTestBase}} to new code base if necessary. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11364) Check and port TaskManagerFailsITCase to new code base if necessary
Till Rohrmann created FLINK-11364: - Summary: Check and port TaskManagerFailsITCase to new code base if necessary Key: FLINK-11364 URL: https://issues.apache.org/jira/browse/FLINK-11364 Project: Flink Issue Type: Sub-task Components: Tests Reporter: Till Rohrmann Check and port {{TaskManagerFailsITCase}} to new code base if necessary. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-11365) Check and port TaskManagerFailureRecoveryITCase to new code base if necessary
Till Rohrmann created FLINK-11365: - Summary: Check and port TaskManagerFailureRecoveryITCase to new code base if necessary Key: FLINK-11365 URL: https://issues.apache.org/jira/browse/FLINK-11365 Project: Flink Issue Type: Sub-task Components: Tests Reporter: Till Rohrmann Check and port {{TaskManagerFailureRecoveryITCase}} to new code base if necessary. -- This message was sent by Atlassian JIRA (v7.6.3#76005)