[jira] [Created] (FLINK-13616) Update BucketingSinkMigrationTest to restore from 1.9 savepoint

2019-08-07 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-13616:
-

 Summary: Update BucketingSinkMigrationTest to restore from 1.9 
savepoint
 Key: FLINK-13616
 URL: https://issues.apache.org/jira/browse/FLINK-13616
 Project: Flink
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 1.10.0
Reporter: Till Rohrmann
 Fix For: 1.10.0


Update {{BucketingSinkMigrationTest}} to restore from 1.9 savepoint.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (FLINK-13614) Add MigrationVersion.v1_9

2019-08-07 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-13614:
-

 Summary: Add MigrationVersion.v1_9
 Key: FLINK-13614
 URL: https://issues.apache.org/jira/browse/FLINK-13614
 Project: Flink
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 1.10.0
Reporter: Till Rohrmann
 Fix For: 1.10.0


Add {{MigrationVersion.v1_9}}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (FLINK-13615) Update CEPMigrationTest to restore from 1.9 savepoint

2019-08-07 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-13615:
-

 Summary: Update CEPMigrationTest to restore from 1.9 savepoint
 Key: FLINK-13615
 URL: https://issues.apache.org/jira/browse/FLINK-13615
 Project: Flink
  Issue Type: Sub-task
  Components: Tests
Affects Versions: 1.10.0
Reporter: Till Rohrmann
 Fix For: 1.10.0


Update {{CEPMigrationTest}} to restore from 1.9 savepoint



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (FLINK-13613) Update migration tests for Flink 1.9

2019-08-07 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-13613:
-

 Summary: Update migration tests for Flink 1.9
 Key: FLINK-13613
 URL: https://issues.apache.org/jira/browse/FLINK-13613
 Project: Flink
  Issue Type: Task
  Components: Tests
Affects Versions: 1.10.0
Reporter: Till Rohrmann
 Fix For: 1.10.0


Once the Flink {{1.9.0}} release is out, we should update existing migration 
tests to cover restoring from {{1.9.0}} savepoints.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (FLINK-13608) Update upgrade compatibility table (docs/ops/upgrading.md) for 1.9.0

2019-08-07 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-13608:
-

 Summary: Update upgrade compatibility table 
(docs/ops/upgrading.md) for 1.9.0
 Key: FLINK-13608
 URL: https://issues.apache.org/jira/browse/FLINK-13608
 Project: Flink
  Issue Type: Task
  Components: Documentation
Affects Versions: 1.9.0
Reporter: Till Rohrmann
 Fix For: 1.9.0


Update upgrade compatibility table (docs/ops/upgrading.md) for 1.9.0



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (FLINK-13607) TPC-H end-to-end test (Blink planner) failed on Travis

2019-08-07 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-13607:
-

 Summary: TPC-H end-to-end test (Blink planner) failed on Travis
 Key: FLINK-13607
 URL: https://issues.apache.org/jira/browse/FLINK-13607
 Project: Flink
  Issue Type: Bug
  Components: Table SQL / API, Tests
Affects Versions: 1.9.0
Reporter: Till Rohrmann
 Fix For: 1.9.0


The {{TPC-H end-to-end test (Blink planner)}} fail consistently on Travis with

{code}
Generating test data...
Error: Could not find or load main class 
org.apache.flink.table.tpch.TpchDataGenerator
{code}

https://api.travis-ci.org/v3/job/568280203/log.txt
https://api.travis-ci.org/v3/job/568280209/log.txt
https://api.travis-ci.org/v3/job/568280215/log.txt



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (FLINK-13606) PrometheusReporterEndToEndITCase.testReporter unstable on Travis

2019-08-07 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-13606:
-

 Summary: PrometheusReporterEndToEndITCase.testReporter unstable on 
Travis
 Key: FLINK-13606
 URL: https://issues.apache.org/jira/browse/FLINK-13606
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Metrics
Affects Versions: 1.9.0
Reporter: Till Rohrmann
 Fix For: 1.9.0


The {{PrometheusReporterEndToEndITCase.testReporter}} is unstable on Travis. It 
fails with {{java.io.IOException: Process failed due to timeout.}}

https://api.travis-ci.org/v3/job/568280216/log.txt



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (FLINK-13599) Kinesis end-to-end test failed on Travis

2019-08-06 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-13599:
-

 Summary: Kinesis end-to-end test failed on Travis
 Key: FLINK-13599
 URL: https://issues.apache.org/jira/browse/FLINK-13599
 Project: Flink
  Issue Type: Test
  Components: Connectors / Kinesis, Tests
Affects Versions: 1.9.0, 1.10.0
Reporter: Till Rohrmann
 Fix For: 1.9.0


The {{Kinesis end-to-end test}} failed on Travis with 

{code}
2019-08-06 08:48:20,177 ERROR org.apache.flink.client.cli.CliFrontend   
- Error while running the command.
org.apache.flink.client.program.ProgramInvocationException: The main method 
caused an error: Unable to execute HTTP request: Connect to localhost:4567 
[localhost/127.0.0.1] failed: Connection refused (Connection refused)
at 
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:593)
at 
org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:438)
at 
org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:274)
at 
org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:746)
at 
org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:273)
at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:205)
at 
org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1010)
at 
org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1083)
at 
org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1083)
Caused by: org.apache.flink.kinesis.shaded.com.amazonaws.SdkClientException: 
Unable to execute HTTP request: Connect to localhost:4567 [localhost/127.0.0.1] 
failed: Connection refused (Connection refused)
at 
org.apache.flink.kinesis.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1116)
at 
org.apache.flink.kinesis.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1066)
at 
org.apache.flink.kinesis.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:743)
at 
org.apache.flink.kinesis.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:717)
at 
org.apache.flink.kinesis.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:699)
at 
org.apache.flink.kinesis.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:667)
at 
org.apache.flink.kinesis.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:649)
at 
org.apache.flink.kinesis.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:513)
at 
org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.doInvoke(AmazonKinesisClient.java:2388)
at 
org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.invoke(AmazonKinesisClient.java:2364)
at 
org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.executeDescribeStream(AmazonKinesisClient.java:754)
at 
org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.describeStream(AmazonKinesisClient.java:729)
at 
org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.describeStream(AmazonKinesisClient.java:766)
at 
org.apache.flink.streaming.kinesis.test.KinesisPubsubClient.createTopic(KinesisPubsubClient.java:63)
at 
org.apache.flink.streaming.kinesis.test.KinesisExampleTest.main(KinesisExampleTest.java:57)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:576)
... 9 more
Caused by: 
org.apache.flink.kinesis.shaded.org.apache.http.conn.HttpHostConnectException: 
Connect to localhost:4567 [localhost/127.0.0.1] failed: Connection refused 
(Connection refused)
at 
org.apache.flink.kinesis.shaded.org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:159)
at 
org.apache.flink.kinesis.shaded.org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:359)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 

[jira] [Created] (FLINK-13597) Test Flink on cluster

2019-08-06 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-13597:
-

 Summary: Test Flink on cluster
 Key: FLINK-13597
 URL: https://issues.apache.org/jira/browse/FLINK-13597
 Project: Flink
  Issue Type: Test
  Components: Tests
Affects Versions: 1.9.0
Reporter: Till Rohrmann
 Fix For: 1.9.0


In order to verify the Flink {{1.9.0}} release, we should try it out on one of 
the available cloud providers (AWS, GCP). I would suggest to run a non-trivial 
workload on the cluster with using some of the new features:

- SQL job
- Filesystem being loaded as a plugin
- Enable fine-grained recovery for batch job
- Java 9
- Checkpoints to GFS



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (FLINK-13595) KafkaITCase.testBigRecordJob fails on Travis

2019-08-06 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-13595:
-

 Summary: KafkaITCase.testBigRecordJob fails on Travis
 Key: FLINK-13595
 URL: https://issues.apache.org/jira/browse/FLINK-13595
 Project: Flink
  Issue Type: Bug
  Components: Connectors / Kafka, Tests
Affects Versions: 1.9.0
Reporter: Till Rohrmann
 Fix For: 1.9.0


The {{KafkaITCase.testBigRecordJob}} failed with a {{TestTimedOutException}}

{code}
Test testBigRecordJob(org.apache.flink.streaming.connectors.kafka.KafkaITCase) 
failed with:
org.junit.runners.model.TestTimedOutException: test timed out after 6 
milliseconds
at java.lang.Object.wait(Native Method)
at java.lang.Thread.join(Thread.java:1252)
at java.lang.Thread.join(Thread.java:1326)
at 
org.apache.kafka.clients.admin.KafkaAdminClient.close(KafkaAdminClient.java:476)
at org.apache.kafka.clients.admin.AdminClient.close(AdminClient.java:92)
at org.apache.kafka.clients.admin.AdminClient.close(AdminClient.java:75)
at 
org.apache.flink.streaming.connectors.kafka.KafkaTestEnvironmentImpl.deleteTestTopic(KafkaTestEnvironmentImpl.java:153)
at 
org.apache.flink.streaming.connectors.kafka.KafkaTestBase.deleteTestTopic(KafkaTestBase.java:204)
at 
org.apache.flink.streaming.connectors.kafka.KafkaConsumerTestBase.runBigRecordTestTopology(KafkaConsumerTestBase.java:1336)
at 
org.apache.flink.streaming.connectors.kafka.KafkaITCase.testBigRecordJob(KafkaITCase.java:121)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
at 
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.lang.Thread.run(Thread.java:748)
{code}

https://api.travis-ci.org/v3/job/568176170/log.txt



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (FLINK-13567) Avro Confluent Schema Registry nightly end-to-end test failed on Travis

2019-08-04 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-13567:
-

 Summary: Avro Confluent Schema Registry nightly end-to-end test 
failed on Travis
 Key: FLINK-13567
 URL: https://issues.apache.org/jira/browse/FLINK-13567
 Project: Flink
  Issue Type: Bug
  Components: Tests
Affects Versions: 1.9.0, 1.10.0
Reporter: Till Rohrmann


The {{Avro Confluent Schema Registry nightly end-to-end test}} failed on Travis 
with

{code}
[FAIL] 'Avro Confluent Schema Registry nightly end-to-end test' failed after 2 
minutes and 11 seconds! Test exited with exit code 1

No taskexecutor daemon (pid: 29044) is running anymore on 
travis-job-b0823aec-c4ec-4d4b-8b59-e9f968de9501.
No standalonesession daemon to stop on host 
travis-job-b0823aec-c4ec-4d4b-8b59-e9f968de9501.
rm: cannot remove 
'/home/travis/build/apache/flink/flink-dist/target/flink-1.10-SNAPSHOT-bin/flink-1.10-SNAPSHOT/plugins':
 No such file or directory
{code}

https://api.travis-ci.org/v3/job/567273939/log.txt



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (FLINK-13557) Improve failover strategy documentation with explanation of concepts

2019-08-02 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-13557:
-

 Summary: Improve failover strategy documentation with explanation 
of concepts
 Key: FLINK-13557
 URL: https://issues.apache.org/jira/browse/FLINK-13557
 Project: Flink
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.9.0, 1.10.0
Reporter: Till Rohrmann


The current failover strategy configuration documentation 
(https://ci.apache.org/projects/flink/flink-docs-master/dev/task_failure_recovery.html#failover-strategies)
 could be improved by explaining the concept of pipelined regions in more 
detail. It could benefit from some figures visualizing the concept. We should 
even consider adding the explanation of pipelined regions to Flink's concept 
page because it is quite central to how Flink (will) work.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (FLINK-13556) Python profile failed on Travis with setup problem

2019-08-02 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-13556:
-

 Summary: Python profile failed on Travis with setup problem
 Key: FLINK-13556
 URL: https://issues.apache.org/jira/browse/FLINK-13556
 Project: Flink
  Issue Type: Bug
  Components: API / Python
Affects Versions: 1.9.0, 1.10.0
Reporter: Till Rohrmann
 Fix For: 1.9.0


The Python profile failed on Travis because conda install 3.5 failed with the 
following error {{Error([('SSL routines', 'ssl3_get_record', 'decryption failed 
or bad record mac')])}}.

https://api.travis-ci.org/v3/job/566835602/log.txt



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (FLINK-13553) KvStateServerHandlerTest.readInboundBlocking unstable on Travis

2019-08-02 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-13553:
-

 Summary: KvStateServerHandlerTest.readInboundBlocking unstable on 
Travis
 Key: FLINK-13553
 URL: https://issues.apache.org/jira/browse/FLINK-13553
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Queryable State
Affects Versions: 1.10.0
Reporter: Till Rohrmann
 Fix For: 1.10.0


The {{KvStateServerHandlerTest.readInboundBlocking}} and 
{{KvStateServerHandlerTest.testQueryExecutorShutDown}} fail on Travis with a 
{{TimeoutException}}.

https://api.travis-ci.org/v3/job/566420641/log.txt



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (FLINK-13497) Checkpoints can complete after CheckpointFailureManager fails job

2019-07-30 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-13497:
-

 Summary: Checkpoints can complete after CheckpointFailureManager 
fails job
 Key: FLINK-13497
 URL: https://issues.apache.org/jira/browse/FLINK-13497
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Checkpointing
Affects Versions: 1.9.0, 1.10.0
Reporter: Till Rohrmann


I think that we introduced with FLINK-12364 an inconsistency wrt to job 
termination a checkpointing. In FLINK-9900 it was discovered that checkpoints 
can complete even after the {{CheckpointFailureManager}} decided to fail a job. 
I think the expected behaviour should be that we fail all pending checkpoints 
once the {{CheckpointFailureManager}} decides to fail the job.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (FLINK-13476) Release partitions for FINISHED or FAILED tasks if they are cancelled

2019-07-29 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-13476:
-

 Summary: Release partitions for FINISHED or FAILED tasks if they 
are cancelled
 Key: FLINK-13476
 URL: https://issues.apache.org/jira/browse/FLINK-13476
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Reporter: Till Rohrmann
 Fix For: 1.9.0


With FLINK-12615 we removed that partitions are being explicitly released from 
the JM if an {{Execution}} which is in state {{FINISHED}} or {{FAILED}} is 
being cancelled. In order to not have resource leak when using pipelined result 
partitions whose consumers fail before start consuming, we should re-introduce 
the deleted else branch (removed via 408f6b67aefaccfc76708b2d4772eb7f0a8fd984).

Once we properly wait that a {{Task}} does not finish until its produced 
results have been either persisted or sent to a consumer, then we should be 
able to remove this branch again.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (FLINK-13475) Reduce dependency on third-party maven repositories

2019-07-29 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-13475:
-

 Summary: Reduce dependency on third-party maven repositories
 Key: FLINK-13475
 URL: https://issues.apache.org/jira/browse/FLINK-13475
 Project: Flink
  Issue Type: Improvement
  Components: Connectors / Hive
Affects Versions: 1.9.0
Reporter: Till Rohrmann


A user reported that Flink's Hive connectors requires third-party maven 
repositories which are not everywhere accessible in order to build. Concretely, 
the hive connector requires access to Conjars for 
org.pentaho:pentaho-aggdesigner-algorithm and javax.jms:jms:jar:1.1.

It would be great to reduce the dependency on third-party maven repositories if 
possible. For future reference, other projects faced similar problems: 
CALCITE-605, CALCITE-1474



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (FLINK-13441) Add batch E2E test which runs with fewer slots than parallelism

2019-07-26 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-13441:
-

 Summary: Add batch E2E test which runs with fewer slots than 
parallelism
 Key: FLINK-13441
 URL: https://issues.apache.org/jira/browse/FLINK-13441
 Project: Flink
  Issue Type: Test
  Components: API / DataSet, Tests
Reporter: Till Rohrmann
 Fix For: 1.9.0


We should adapt the existing batch E2E test to use the newly introduced 
{{ScheduleMode#LAZY_FROM_SOURCES_WITH_BATCH_SLOT_REQUEST}} and verify that the 
job runs on a cluster with fewer slots than the job's parallelism. In order to 
make this work, we need to set the shuffles to be blocking via 
{{ExecutionMode#BATCH}}. As a batch job we should use the 
{{DataSetAllroundTestProgram}}.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (FLINK-13439) Run existing SQL/Table API E2E tests with blink runner

2019-07-26 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-13439:
-

 Summary: Run existing SQL/Table API E2E tests with blink runner
 Key: FLINK-13439
 URL: https://issues.apache.org/jira/browse/FLINK-13439
 Project: Flink
  Issue Type: Test
  Components: Table SQL / API, Tests
Affects Versions: 1.9.0
Reporter: Till Rohrmann
 Fix For: 1.9.0


We should run all existing SQL/Table API E2E tests with the blink runner. As 
part of FLINK-13273 the {{test_sql_client.sh}} test will already be ported to 
run with blink. Additionally we also need to enable the 
{{test_streaming_sql.sh}} test to run with the blink runner.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (FLINK-13437) Add Hive SQL E2E test

2019-07-26 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-13437:
-

 Summary: Add Hive SQL E2E test
 Key: FLINK-13437
 URL: https://issues.apache.org/jira/browse/FLINK-13437
 Project: Flink
  Issue Type: Test
  Components: Connectors / Hive, Tests
Affects Versions: 1.9.0
Reporter: Till Rohrmann
 Fix For: 1.9.0


We should add an E2E test for the Hive integration: List all tables and read 
some metadata, read from an existing table, register a new table in Hive, use a 
registered function, write to an existing table, write to a new table.




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (FLINK-13436) Add TPC-H queries as E2E tests

2019-07-26 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-13436:
-

 Summary: Add TPC-H queries as E2E tests
 Key: FLINK-13436
 URL: https://issues.apache.org/jira/browse/FLINK-13436
 Project: Flink
  Issue Type: Test
  Components: Table SQL / Planner, Tests
Affects Versions: 1.9.0
Reporter: Till Rohrmann
 Fix For: 1.9.0


We should add the TPC-H queries as E2E tests in order to verify the blink 
planner.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (FLINK-13434) Add E2E test for stop-with-savepoint

2019-07-26 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-13434:
-

 Summary: Add E2E test for stop-with-savepoint
 Key: FLINK-13434
 URL: https://issues.apache.org/jira/browse/FLINK-13434
 Project: Flink
  Issue Type: Test
  Components: Tests
Affects Versions: 1.9.0
Reporter: Till Rohrmann
 Fix For: 1.9.0


We should modify the existing cancel-with-savepoint E2E test to use the new 
stop-with-savepoint command.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (FLINK-13249) Distributed Jepsen test fails with blocked TaskExecutor

2019-07-12 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-13249:
-

 Summary: Distributed Jepsen test fails with blocked TaskExecutor
 Key: FLINK-13249
 URL: https://issues.apache.org/jira/browse/FLINK-13249
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.9.0
Reporter: Till Rohrmann
 Fix For: 1.9.0


The distributed Jepsen test which kills {{JobMasters}} started to fail 
recently. From a first glance, it looks as if the {{TaskExecutor's}} main 
thread is blocked by some operation. Further investigation is required.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (FLINK-13228) HadoopRecoverableWriterTest.testCommitAfterNormalClose fails on Travis

2019-07-11 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-13228:
-

 Summary: HadoopRecoverableWriterTest.testCommitAfterNormalClose 
fails on Travis
 Key: FLINK-13228
 URL: https://issues.apache.org/jira/browse/FLINK-13228
 Project: Flink
  Issue Type: Bug
  Components: FileSystems
Affects Versions: 1.9.0
Reporter: Till Rohrmann


{{HadoopRecoverableWriterTest.testCommitAfterNormalClose}} failed on Travis with

{code}
HadoopRecoverableWriterTest.testCommitAfterNormalClose » IO The stream is 
closed
{code}

https://api.travis-ci.org/v3/job/557293706/log.txt



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (FLINK-13226) KafkaProducerExactlyOnceITCase.testMultipleSinkOperators fails on Travis

2019-07-11 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-13226:
-

 Summary: KafkaProducerExactlyOnceITCase.testMultipleSinkOperators 
fails on Travis
 Key: FLINK-13226
 URL: https://issues.apache.org/jira/browse/FLINK-13226
 Project: Flink
  Issue Type: Bug
  Components: Connectors / Kafka
Affects Versions: 1.9.0
Reporter: Till Rohrmann


The {{KafkaProducerExactlyOnceITCase.testMultipleSinkOperators}} fails on 
Travis with not producing output for 300 s.

https://api.travis-ci.org/v3/job/557290235/log.txt



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (FLINK-13216) AggregateITCase.testNestedGroupByAgg fails on Travis

2019-07-11 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-13216:
-

 Summary: AggregateITCase.testNestedGroupByAgg fails on Travis
 Key: FLINK-13216
 URL: https://issues.apache.org/jira/browse/FLINK-13216
 Project: Flink
  Issue Type: Bug
  Components: Table SQL / Planner
Affects Versions: 1.9.0
Reporter: Till Rohrmann
 Fix For: 1.9.0


The {{AggregateITCase.testNestedGroupByAgg}} fails on Travis with

{code}
AggregateITCase.testNestedGroupByAgg:472 expected: but was:
{code}

https://api.travis-ci.org/v3/job/557214216/log.txt



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (FLINK-13187) Expose batch slot request via new scheduling mode

2019-07-10 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-13187:
-

 Summary: Expose batch slot request via new scheduling mode
 Key: FLINK-13187
 URL: https://issues.apache.org/jira/browse/FLINK-13187
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Affects Versions: 1.9.0
Reporter: Till Rohrmann
Assignee: Till Rohrmann
 Fix For: 1.9.0


In order to leverage the newly introduced batch slot request feature introduced 
with FLINK-13166, we should introduce a new 
{{ScheduleMode#LAZY_FROM_SOURCES_WITH_BATCH_SLOT_REQUESTS}}. Users who enable 
this, must make sure that the generated job does not contain any pipelined 
shuffles because this can lead to resource deadlocks (not timing out requests 
with too few slots to make a pipelined region run). 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-13166) Support batch slot requests

2019-07-09 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-13166:
-

 Summary: Support batch slot requests
 Key: FLINK-13166
 URL: https://issues.apache.org/jira/browse/FLINK-13166
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Affects Versions: 1.9.0
Reporter: Till Rohrmann
 Fix For: 1.9.0


In order to support the execution of batch jobs with fewer slots than requested 
we need to introduce a special slot request notion which does not register an 
eager timeout. Moreover, this slot request should not react to failure signals 
from the {{ResourceManager}} and only time out if there is not available or 
allocated slot which can fulfill the requested {{ResourceProfile}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-13165) Complete requested slots in request order

2019-07-09 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-13165:
-

 Summary: Complete requested slots in request order
 Key: FLINK-13165
 URL: https://issues.apache.org/jira/browse/FLINK-13165
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Affects Versions: 1.9.0
Reporter: Till Rohrmann
 Fix For: 1.9.0


When executing batch jobs with fewer slots than requested we should make sure 
that the slot requests are being completed in the order in which they were 
enqueued into the {{SlotPool}}. Otherwise we might risk that a consumer task 
gets deployed before a producer causing a resource deadlock situation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-13156) Elasticsearch (v1.7.1) sink end-to-end test failed on Travis

2019-07-08 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-13156:
-

 Summary: Elasticsearch (v1.7.1) sink end-to-end test failed on 
Travis
 Key: FLINK-13156
 URL: https://issues.apache.org/jira/browse/FLINK-13156
 Project: Flink
  Issue Type: Bug
  Components: Connectors / ElasticSearch
Affects Versions: 1.9.0
Reporter: Till Rohrmann
 Fix For: 1.9.0


The {{Elasticsearch (v1.7.1) sink end-to-end test}} failed on Travis because 
the test waited for Elasticsearch records indefinitely.

https://api.travis-ci.org/v3/job/554991871/log.txt



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-13155) SQL Client end-to-end test fails on Travis

2019-07-08 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-13155:
-

 Summary: SQL Client end-to-end test fails on Travis
 Key: FLINK-13155
 URL: https://issues.apache.org/jira/browse/FLINK-13155
 Project: Flink
  Issue Type: Bug
  Components: Table SQL / Client
Affects Versions: 1.9.0
Reporter: Till Rohrmann
 Fix For: 1.9.0


The {{SQL Client end-to-end test}} which executes 
{{test-scripts/test_sql_client.sh}} fails on Travis with non-empty out files.

https://api.travis-ci.org/v3/job/554991859/log.txt



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-13154) Broken links in documentation

2019-07-08 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-13154:
-

 Summary: Broken links in documentation
 Key: FLINK-13154
 URL: https://issues.apache.org/jira/browse/FLINK-13154
 Project: Flink
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.9.0
Reporter: Till Rohrmann
 Fix For: 1.9.0


Flink's broken link verification failed on Travis as it has discovered some 
broken links. We need to fix this for the upcoming release.

{code}
[2019-07-07 10:54:21] ERROR `/tutorials/datastream_api.html' not found.
[2019-07-07 10:54:21] ERROR `/tutorials/local_setup.html' not found.
[2019-07-07 10:54:21] ERROR `/tutorials/flink_on_windows.html' not found.
[2019-07-07 10:54:21] ERROR `/examples' not found.
[2019-07-07 10:54:23] ERROR `/zh/dev/connectors/pubsub.html' not found.
[2019-07-07 10:54:24] ERROR `/ops/filesystems.html' not found.
[2019-07-07 10:54:24] ERROR `/dev/table/catalog.html' not found.
[2019-07-07 10:54:28] ERROR `/zh/tutorials/datastream_api.html' not found.
[2019-07-07 10:54:28] ERROR `/zh/tutorials/local_setup.html' not found.
http://localhost:4000/tutorials/datastream_api.html:
Remote file does not exist -- broken link!!!
http://localhost:4000/tutorials/local_setup.html:
Remote file does not exist -- broken link!!!
--
http://localhost:4000/tutorials/flink_on_windows.html:
Remote file does not exist -- broken link!!!
--
http://localhost:4000/examples:
Remote file does not exist -- broken link!!!
--
http://localhost:4000/zh/dev/connectors/pubsub.html:
Remote file does not exist -- broken link!!!
--
http://localhost:4000/ops/filesystems.html:
Remote file does not exist -- broken link!!!
--
http://localhost:4000/dev/table/catalog.html:
Remote file does not exist -- broken link!!!
--
http://localhost:4000/zh/tutorials/datastream_api.html:
Remote file does not exist -- broken link!!!
http://localhost:4000/zh/tutorials/local_setup.html:
Remote file does not exist -- broken link!!!
{code}

https://api.travis-ci.org/v3/job/554991858/log.txt



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-13153) SplitAggregateITCase.testMinMaxWithRetraction failed on Travis

2019-07-08 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-13153:
-

 Summary: SplitAggregateITCase.testMinMaxWithRetraction failed on 
Travis
 Key: FLINK-13153
 URL: https://issues.apache.org/jira/browse/FLINK-13153
 Project: Flink
  Issue Type: Bug
  Components: Table SQL / Planner
Affects Versions: 1.9.0
Reporter: Till Rohrmann
 Fix For: 1.9.0


{{SplitAggregateITCase.testMinMaxWithRetraction}} failed on Travis with

{code}
Failures: 
10:50:43.355 [ERROR]   SplitAggregateITCase.testMinMaxWithRetraction:195 
expected: but was:
{code}

https://api.travis-ci.org/v3/job/554991853/log.txt



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-13140) PushFilterIntoTableSourceScanRuleTest.testWithUd fails on Travis

2019-07-08 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-13140:
-

 Summary: PushFilterIntoTableSourceScanRuleTest.testWithUd fails on 
Travis
 Key: FLINK-13140
 URL: https://issues.apache.org/jira/browse/FLINK-13140
 Project: Flink
  Issue Type: Bug
  Components: Table SQL / Planner
Affects Versions: 1.9.0
Reporter: Till Rohrmann
 Fix For: 1.9.0


{{PushFilterIntoTableSourceScanRuleTest.testWithUd}} fails on Travis with

{code}
06:22:11.290 [ERROR] Failures: 
06:22:11.290 [ERROR]   PushFilterIntoTableSourceScanRuleTest.testWithUdf:93 
planAfter 
expected:<...ssions$utils$Func1$$[a39386268ffec8461452460bcbe089ad]($2), 32)])
   +- Lo...> but 
was:<...ssions$utils$Func1$$[8a89e0d5a022a06a00c7734a25295ff4]($2), 32)])
   +- Lo...>
{code}

https://api.travis-ci.org/v3/job/555252046/log.txt



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-13139) Various Hive tests fail on Travis

2019-07-08 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-13139:
-

 Summary: Various Hive tests fail on Travis
 Key: FLINK-13139
 URL: https://issues.apache.org/jira/browse/FLINK-13139
 Project: Flink
  Issue Type: Bug
  Components: Connectors / Hive
Affects Versions: 1.9.0
Reporter: Till Rohrmann
 Fix For: 1.9.0


Various Hive related tests fail on Travis:

{code}
06:06:49.654 [ERROR] Errors: 
06:06:49.654 [ERROR]   HiveInputFormatTest.createCatalog:66 » Catalog Failed 
to create Hive Metastore...
06:06:49.654 [ERROR]   HiveTableFactoryTest.init:55 » Catalog Failed to create 
Hive Metastore client
06:06:49.654 [ERROR]   HiveTableOutputFormatTest.createCatalog:72 » Catalog 
Failed to create Hive Met...
06:06:49.654 [ERROR]   HiveTableSinkTest.createCatalog:72 » Catalog Failed to 
create Hive Metastore c...
06:06:49.654 [ERROR]   HiveTableSourceTest.createCatalog:67 » Catalog Failed 
to create Hive Metastore...
06:06:49.654 [ERROR]   HiveCatalogGenericMetadataTest.init:49 » Catalog Failed 
to create Hive Metasto...
06:06:49.654 [ERROR]   HiveCatalogHiveMetadataTest.init:55 » Catalog Failed to 
create Hive Metastore ...
06:06:49.654 [ERROR]   HiveGenericUDFTest.testCeil:193->init:387 » 
ExceptionInInitializer
06:06:49.654 [ERROR]   HiveGenericUDFTest.testDecode:160 » NoClassDefFound 
Could not initialize class...
06:06:49.654 [ERROR]   HiveSimpleUDFTest.testUDFArray_singleArray:202->init:237 
» NoClassDefFound Cou...
06:06:49.654 [ERROR]   HiveSimpleUDFTest.testUDFBin:60->init:237 » 
NoClassDefFound Could not initiali...
06:06:49.654 [ERROR]   HiveSimpleUDFTest.testUDFConv:67->init:237 » 
NoClassDefFound Could not initial...
06:06:49.654 [ERROR]   HiveSimpleUDFTest.testUDFJson:85->init:237 » 
NoClassDefFound Could not initial...
06:06:49.654 [ERROR]   HiveSimpleUDFTest.testUDFMinute:126->init:237 » 
ExceptionInInitializer
06:06:49.654 [ERROR]   HiveSimpleUDFTest.testUDFRand:51->init:237 » 
NoClassDefFound Could not initial...
06:06:49.654 [ERROR]   HiveSimpleUDFTest.testUDFRegExpExtract:153->init:237 » 
NoClassDefFound Could n...
06:06:49.654 [ERROR]   HiveSimpleUDFTest.testUDFToInteger:188->init:237 » 
NoClassDefFound Could not i...
06:06:49.654 [ERROR]   HiveSimpleUDFTest.testUDFUnbase64:166->init:237 » 
NoClassDefFound Could not in...
06:06:49.654 [ERROR]   HiveSimpleUDFTest.testUDFUnhex:177->init:237 » 
NoClassDefFound Could not initi...
06:06:49.654 [ERROR]   HiveSimpleUDFTest.testUDFWeekOfYear:139->init:237 » 
NoClassDefFound Could not ...
{code}

https://api.travis-ci.org/v3/job/555252043/log.txt



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-12916) KeyedComplexChainTest.testMigrationAndRestore failed on Travis

2019-06-20 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-12916:
-

 Summary: KeyedComplexChainTest.testMigrationAndRestore failed on 
Travis
 Key: FLINK-12916
 URL: https://issues.apache.org/jira/browse/FLINK-12916
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Checkpointing, Tests
Affects Versions: 1.9.0
Reporter: Till Rohrmann
 Fix For: 1.9.0


The test case {{KeyedComplexChainTest.testMigrationAndRestore}} failed on 
Travis because a Task received the cancellation from one of its inputs
{code}
Caused by: org.apache.flink.runtime.checkpoint.CheckpointException: Task 
received cancellation from one of its inputs
at 
org.apache.flink.streaming.runtime.io.BarrierBuffer.notifyAbortOnCancellationBarrier(BarrierBuffer.java:428)
at 
org.apache.flink.streaming.runtime.io.BarrierBuffer.processCancellationBarrier(BarrierBuffer.java:327)
at 
org.apache.flink.streaming.runtime.io.BarrierBuffer.pollNext(BarrierBuffer.java:208)
at 
org.apache.flink.streaming.runtime.io.StreamTaskNetworkInput.pollNextNullable(StreamTaskNetworkInput.java:102)
at 
org.apache.flink.streaming.runtime.io.StreamTaskNetworkInput.pollNextNullable(StreamTaskNetworkInput.java:47)
at 
org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:128)
at 
org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.performDefaultAction(OneInputStreamTask.java:101)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.run(StreamTask.java:268)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:376)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:676)
... 1 more
{code}

https://api.travis-ci.org/v3/job/548181384/log.txt



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-12915) AbstractOperatorRestoreTestBase can deadlock if one test fails

2019-06-20 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-12915:
-

 Summary: AbstractOperatorRestoreTestBase can deadlock if one test 
fails
 Key: FLINK-12915
 URL: https://issues.apache.org/jira/browse/FLINK-12915
 Project: Flink
  Issue Type: Bug
  Components: API / Type Serialization System, Tests
Affects Versions: 1.9.0
Reporter: Till Rohrmann


The {{AbstractOperatorRestoreTestBase}} can deadlock in case of a failing test 
case. The problem is that if one test fails then the corresponding test job 
won't be canceled. Due to that a succeeding test case might not get enough 
slots to execute because the job of the failing test is still running. This 
leads to a deadlocked test as happened here: 
https://api.travis-ci.org/v3/job/548181384/log.txt

A side effect is that the original test failure won't be properly displayed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-12863) Race condition between slot offerings and AllocatedSlotReport

2019-06-16 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-12863:
-

 Summary: Race condition between slot offerings and 
AllocatedSlotReport
 Key: FLINK-12863
 URL: https://issues.apache.org/jira/browse/FLINK-12863
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.9.0
Reporter: Till Rohrmann
 Fix For: 1.9.0


With FLINK-11059 we introduced the {{AllocatedSlotReport}} which is used by the 
{{TaskExecutor}} to synchronize its internal view on slot allocations with the 
view of the {{JobMaster}}. It seems that there is a race condition between 
offering slots and receiving the report because the {{AllocatedSlotReport}} is 
sent by the {{HeartbeatManagerSenderImpl}} from a separate thread. 

Due to that it can happen that we generate an {{AllocatedSlotReport}} just 
before getting new slots offered. Since the report is sent from a different 
thread, it can then happen that the response to the slot offerings is sent 
earlier than the {{AllocatedSlotReport}}. Consequently, we might receive an 
outdated slot report on the {{TaskExecutor}} causing active slots to be 
released.

In order to solve the problem I propose to add a fencing token to the 
{{AllocatedSlotReport}} which is being updated whenever we offer new slots to 
the {{JobMaster}}. When we receive the {{AllocatedSlotReport}} on the 
{{TaskExecutor}} we compare the current slot report fencing token with the 
received one and only process the report if they are equal. Otherwise we wait 
for the next heartbeat to send us an up to date {{AllocatedSlotReport}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-12768) FlinkKinesisConsumerTest.testSourceSynchronization unstable on Travis

2019-06-06 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-12768:
-

 Summary: FlinkKinesisConsumerTest.testSourceSynchronization 
unstable on Travis
 Key: FLINK-12768
 URL: https://issues.apache.org/jira/browse/FLINK-12768
 Project: Flink
  Issue Type: Bug
  Components: Connectors / Kinesis
Affects Versions: 1.9.0
Reporter: Till Rohrmann
 Fix For: 1.9.0


The {{FlinkKinesisConsumerTest.testSourceSynchronization}} seems to be unstable 
on Travis. It fails with 

{code}
[ERROR] 
testSourceSynchronization(org.apache.flink.streaming.connectors.kinesis.FlinkKinesisConsumerTest)
  Time elapsed: 10.031 s  <<< FAILURE!
java.lang.AssertionError: 

Expected: iterable containing ["1", ]
 but: No item matched: 
at 
org.apache.flink.streaming.connectors.kinesis.FlinkKinesisConsumerTest.testSourceSynchronization(FlinkKinesisConsumerTest.java:950)
{code}

https://api.travis-ci.org/v3/job/541845510/log.txt

While locking into the problem, I noticed that the test case takes 1 second to 
execute on my machine. I'm wondering whether this really needs to take this 
long. Moreover, the test code contains {{Thread.sleeps}} and uses 
{{Whiteboxing}} which we should avoid.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-12670) Implement FailureRateRestartBackoffTimeStrategy

2019-05-29 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-12670:
-

 Summary: Implement FailureRateRestartBackoffTimeStrategy
 Key: FLINK-12670
 URL: https://issues.apache.org/jira/browse/FLINK-12670
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Affects Versions: 1.9.0
Reporter: Till Rohrmann
 Fix For: 1.9.0


We should implement a {{FailureRateRestartBackoffTimeStrategy}} similar to the 
{{FailureRateRestartStrategy}} which allows to configure the number of allowed 
restarts and the fixed delay in between restart attempts.

In order to be backwards compatible, we should respect the configuration values 
used to configure the {{FailureRateRestartStrategy}} (see the documentation for 
more information: 
https://ci.apache.org/projects/flink/flink-docs-stable/dev/restart_strategies.html).
 Additionally, we should also respect the 
{{FailureRateRestartStrategyConfiguration}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-12669) Implement FixedDelayRestartBackoffTimeStrategy

2019-05-29 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-12669:
-

 Summary: Implement FixedDelayRestartBackoffTimeStrategy
 Key: FLINK-12669
 URL: https://issues.apache.org/jira/browse/FLINK-12669
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Affects Versions: 1.9.0
Reporter: Till Rohrmann
 Fix For: 1.9.0


We should implement a {{FixedDelayRestartBackoffTimeStrategy}} similar to the 
{{FixedDelayRestartStrategy}} which allows to configure the number of allowed 
restarts and the fixed delay in between restart attempts.

In order to be backwards compatible, we should respect the configuration values 
used to configure the {{FixedDelayRestartStrategy}} (see the documentation for 
more information: 
https://ci.apache.org/projects/flink/flink-docs-stable/dev/restart_strategies.html).
 Additionally, we should also respect the 
{{FixedDelayRestartStrategyConfiguration}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-12654) Enable Azure FS E2E tests on Travis

2019-05-28 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-12654:
-

 Summary: Enable Azure FS E2E tests on Travis
 Key: FLINK-12654
 URL: https://issues.apache.org/jira/browse/FLINK-12654
 Project: Flink
  Issue Type: Improvement
  Components: Connectors / FileSystem, Tests
Affects Versions: 1.9.0
Reporter: Till Rohrmann


We should think about enabling the {{test_azure_fs.sh}} E2E test on Travis. 
This would improve our test coverage. 

In order to do this, we need to setup an Azure account, set up the right 
environment variables {{$IT_CASE_AZURE_ACCOUNT}}, 
{{$IT_CASE_AZURE_ACCESS_KEY}}, {{$IT_CASE_AZURE_CONTAINER}} and upload a static 
{{words}} file to 
{{wasbs://$IT_CASE_AZURE_CONTAINER@$IT_CASE_AZURE_ACCOUNT.blob.core.windows.net/words}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-12592) StreamTableEnvironment object has no attribute connect

2019-05-22 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-12592:
-

 Summary: StreamTableEnvironment object has no attribute connect
 Key: FLINK-12592
 URL: https://issues.apache.org/jira/browse/FLINK-12592
 Project: Flink
  Issue Type: Bug
  Components: API / Python, Tests
Affects Versions: 1.9.0
Reporter: Till Rohrmann


The Python build module failed on Travis with the following problem: 
{{'StreamTableEnvironment' object has no attribute 'connect'}}.

https://api.travis-ci.org/v3/job/535684431/log.txt



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-12545) TableSourceTest.testNestedProject failed on Travis

2019-05-17 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-12545:
-

 Summary: TableSourceTest.testNestedProject failed on Travis
 Key: FLINK-12545
 URL: https://issues.apache.org/jira/browse/FLINK-12545
 Project: Flink
  Issue Type: Bug
  Components: Table SQL / Planner
Affects Versions: 1.9.0
Reporter: Till Rohrmann


The {{TableSourceTest.testNestedProject}} failed on Travis with

{code}
07:15:38.909 [ERROR] Failures: 
07:15:38.909 [ERROR]   TableSourceTest.testNestedProject:375 null 
expected:<...deepNested.nested2.f[lag AS nestedFlag, deepNested.nested2.num AS 
nestedNum])
StreamTableSourceScan(table=[[T]], fields=[id, deepNested, nested], 
source=[TestSource(read nested fields: id.*, deepNested.nested2.num, 
deepNested.nested2.flag], deepNested.nested1...> but 
was:<...deepNested.nested2.f[1 AS nestedFlag, deepNested.nested2.f0 AS 
nestedNum])
StreamTableSourceScan(table=[[T]], fields=[id, deepNested, nested], 
source=[TestSource(read nested fields: id.*, deepNested.nested2.f1, 
deepNested.nested2.f0], deepNested.nested1...>
{code}

https://api.travis-ci.org/v3/job/533646868/log.txt



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-12502) Harden JobMasterTest#testRequestNextInputSplitWithDataSourceFailover

2019-05-13 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-12502:
-

 Summary: Harden 
JobMasterTest#testRequestNextInputSplitWithDataSourceFailover
 Key: FLINK-12502
 URL: https://issues.apache.org/jira/browse/FLINK-12502
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination, Tests
Affects Versions: 1.9.0
Reporter: Till Rohrmann


The {{JobMasterTest#testRequestNextInputSplitWithDataSourceFailover}} relies on 
how many files you have in your working directory. This assumption is quite 
brittle. Instead we should explicitly instantiate an {{InputSplitAssigner}} 
with a defined number of input splits. 

Moreover, we should make the assertions more explicit: Input split comparisons 
should not rely solely on the length of the input split data.

Maybe it is also not necessary to capture the full {{TaskDeploymentDescriptor}} 
because we could already know the producer's and consumer's {{JobVertexID}} 
when we create the {{JobGraph}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-12324) Remove ActorGateway interface

2019-04-24 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-12324:
-

 Summary: Remove ActorGateway interface
 Key: FLINK-12324
 URL: https://issues.apache.org/jira/browse/FLINK-12324
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Reporter: Till Rohrmann
Assignee: Till Rohrmann
 Fix For: 1.9.0


Remove the {{ActorGateway}} interface.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-12323) Remove legacy ActorGateway implementations

2019-04-24 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-12323:
-

 Summary: Remove legacy ActorGateway implementations
 Key: FLINK-12323
 URL: https://issues.apache.org/jira/browse/FLINK-12323
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Reporter: Till Rohrmann
Assignee: Till Rohrmann
 Fix For: 1.9.0


Remove the legacy {{ActorGateway}} based implementations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-12322) Remove legacy ActorTaskManagerGateway

2019-04-24 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-12322:
-

 Summary: Remove legacy ActorTaskManagerGateway
 Key: FLINK-12322
 URL: https://issues.apache.org/jira/browse/FLINK-12322
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Reporter: Till Rohrmann
 Fix For: 1.9.0


Remove the legacy {{ActorTaskManagerGateway}} component from Flink.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-12122) Spread out tasks evenly across all available registered TaskManagers

2019-04-08 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-12122:
-

 Summary: Spread out tasks evenly across all available registered 
TaskManagers
 Key: FLINK-12122
 URL: https://issues.apache.org/jira/browse/FLINK-12122
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Affects Versions: 1.7.2, 1.6.4, 1.8.0
Reporter: Till Rohrmann
 Fix For: 1.7.3, 1.9.0, 1.8.1


With Flip-6, we changed the default behaviour how slots are assigned to 
{{TaskManages}}. Instead of evenly spreading it out over all registered 
{{TaskManagers}}, we randomly pick slots from {{TaskManagers}} with a tendency 
to first fill up a TM before using another one. This is a regression wrt the 
pre Flip-6 code.

I suggest to change the behaviour so that we try to evenly distribute slots 
across all available {{TaskManagers}} by considering how many of their slots 
are already allocated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-12070) Make blocking result partitions consumable multiple times

2019-03-29 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-12070:
-

 Summary: Make blocking result partitions consumable multiple times
 Key: FLINK-12070
 URL: https://issues.apache.org/jira/browse/FLINK-12070
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Network
Reporter: Till Rohrmann


In order to avoid writing produced results multiple times for multiple 
consumers and in order to speed up batch recoveries, we should make the 
blocking result partitions to be consumable multiple times. At the moment a 
blocking result partition will be released once the consumers has processed all 
data. Instead the result partition should be released once the next blocking 
result has been produced and all consumers of a blocking result partition have 
terminated. Moreover, blocking results should not hold on slot resources like 
network buffers or memory as it is currently the case with 
{{SpillableSubpartitions}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-12069) Add proper lifecycle management for intermediate result partitions

2019-03-29 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-12069:
-

 Summary: Add proper lifecycle management for intermediate result 
partitions
 Key: FLINK-12069
 URL: https://issues.apache.org/jira/browse/FLINK-12069
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Coordination, Runtime / Network
Affects Versions: 1.8.0, 1.9.0
Reporter: Till Rohrmann


In order to properly execute batch jobs, we should make the lifecycle 
management of intermediate result partitions the responsibility of the 
{{JobMaster}}/{{Scheduler}} component. The {{Scheduler}} knows best when an 
intermediate result partition is no longer needed and, thus, can be freed. So 
for example, a blocking intermediate result should only be released after all 
subsequent blocking intermediate results have been completed in order to speed 
up potential failovers.

Moreover, having explicit control over intermediate result partitions, could 
also enable use cases like result partition sharing between jobs and even 
across clusters (by simply not releasing the result partitions). 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-12068) Backtrack fail over regions if intermediate results are unavailable

2019-03-29 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-12068:
-

 Summary: Backtrack fail over regions if intermediate results are 
unavailable
 Key: FLINK-12068
 URL: https://issues.apache.org/jira/browse/FLINK-12068
 Project: Flink
  Issue Type: Sub-task
  Components: Runtime / Coordination
Reporter: Till Rohrmann


The batch failover strategy needs to be able to backtrack fail over regions if 
an intermediate result is unavailable. Either by explicitly checking whether 
the intermediate result partition is available or via a special exception 
indicating that a result partition is no longer available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-12058) Cancel checkpoint operations belonging to a discarded/aborted checkpoint

2019-03-28 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-12058:
-

 Summary: Cancel checkpoint operations belonging to a 
discarded/aborted checkpoint
 Key: FLINK-12058
 URL: https://issues.apache.org/jira/browse/FLINK-12058
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Checkpointing
Affects Versions: 1.7.2, 1.8.0
Reporter: Till Rohrmann


In order to save CPU cycles and reduce disk and network I/O, we should try to 
cancel local checkpoint operations belonging to discarded aborted or subsumed 
checkpoints. For example, if a {{Task}} declines a checkpoint, the 
{{CheckpointCoordinator}} will discard this checkpoint. However, other 
checkpointing operations belonging to this checkpoint won't be necessarily 
notified and canceled.

The notification mechanism could piggy back on the existing 
{{CancelCheckpointMarker}} or be a separate signal sent to all participating 
{{Tasks}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-12036) Simplify ExecutionGraph by removing legacy concurrency code

2019-03-27 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-12036:
-

 Summary: Simplify ExecutionGraph by removing legacy concurrency 
code
 Key: FLINK-12036
 URL: https://issues.apache.org/jira/browse/FLINK-12036
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Coordination
Affects Versions: 1.8.0
Reporter: Till Rohrmann


With FLINK-11417, we made the {{ExecutionGraph}} single threaded. Consequently 
all the logic which handles concurrent state changes is no longer needed (e.g. 
{{AtomicReferenceFieldUpdater}}, checking for concurrent changes, etc.). This 
issue is the umbrella to collect every legacy concurrency code which can be 
removed from the {{ExecutionGraph}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-12034) KafkaITCase.testAllDeletes failed on Travis

2019-03-27 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-12034:
-

 Summary: KafkaITCase.testAllDeletes failed on Travis
 Key: FLINK-12034
 URL: https://issues.apache.org/jira/browse/FLINK-12034
 Project: Flink
  Issue Type: Bug
  Components: Connectors / Kafka, Tests
Affects Versions: 1.8.0
Reporter: Till Rohrmann


The {{KafkaITCase.testAllDeletes}} failed on Travis because it timed out:

{code}
12:36:24.782 [ERROR] Errors: 
12:36:24.783 [ERROR]   
KafkaITCase.testAllDeletes:131->KafkaConsumerTestBase.runAllDeletesTest:1526->KafkaTestBase.deleteTestTopic:192->Object.wait:-2
 » TestTimedOut
12:36:24.783 [INFO] 
12:36:24.783 [ERROR] Tests run: 46, Failures: 0, Errors: 1, Skipped: 0
{code}

https://api.travis-ci.org/v3/job/511419474/log.txt



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-12033) KafkaITCase.testCancelingFullTopic failed on Travis

2019-03-27 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-12033:
-

 Summary: KafkaITCase.testCancelingFullTopic failed on Travis
 Key: FLINK-12033
 URL: https://issues.apache.org/jira/browse/FLINK-12033
 Project: Flink
  Issue Type: Bug
  Components: Connectors / Kafka, Test Infrastructure
Affects Versions: 1.8.0
Reporter: Till Rohrmann


The {{KafkaITCase.testCancelingFullTopic}} failed on Travis because it timed 
out.

https://api.travis-ci.org/v3/job/511419185/log.txt



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-12032) KafkaProducerAtLeastOnceITCase failed to bind to address

2019-03-27 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-12032:
-

 Summary: KafkaProducerAtLeastOnceITCase failed to bind to address
 Key: FLINK-12032
 URL: https://issues.apache.org/jira/browse/FLINK-12032
 Project: Flink
  Issue Type: Bug
  Components: Connectors / Kafka, Tests
Affects Versions: 1.8.0
Reporter: Till Rohrmann


The {{KafkaProducerAtLeastOnceITCase}} failed on Travis because it could not 
bind to the specified address.

https://api.travis-ci.org/v3/job/511419185/log.txt



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-12021) Let ResultConjunctFuture return future results in same order as futures

2019-03-26 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-12021:
-

 Summary: Let ResultConjunctFuture return future results in same 
order as futures
 Key: FLINK-12021
 URL: https://issues.apache.org/jira/browse/FLINK-12021
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Coordination
Affects Versions: 1.7.2, 1.8.0
Reporter: Till Rohrmann
Assignee: Till Rohrmann
 Fix For: 1.8.0


The {{ResultConjunctFuture}} should return future results in the same order as 
the futures were specified. The advantage would be that we maintain the same 
output order as the input order if the input has been specified with a special 
ordering.

This should also fix a current performance regression where we don't maintain 
the topological ordering when deploying tasks. Due to this, we need to make an 
additional result partition lookup which adds additional latency. The problem 
has occurred due to FLINK-10431 which changes the interleaving how the slot 
futures are completed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-12020) Add documentation for mesos-appmaster-job.sh

2019-03-26 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-12020:
-

 Summary: Add documentation for mesos-appmaster-job.sh
 Key: FLINK-12020
 URL: https://issues.apache.org/jira/browse/FLINK-12020
 Project: Flink
  Issue Type: Improvement
  Components: Deployment / Mesos, Documentation
Affects Versions: 1.7.2, 1.8.0
Reporter: Till Rohrmann


The Flink documentation is currently lacking information about the 
{{mesos-appmaster-job.sh}} and how to use it. It would be helpful for our users 
to add documentation and examples how to use it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-11855) Race condition in EmbeddedLeaderService

2019-03-07 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-11855:
-

 Summary: Race condition in EmbeddedLeaderService
 Key: FLINK-11855
 URL: https://issues.apache.org/jira/browse/FLINK-11855
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.7.2, 1.8.0
Reporter: Till Rohrmann
Assignee: Till Rohrmann
 Fix For: 1.7.3, 1.8.0


There is a race condition in the {{EmbeddedLeaderService}} which can occur if 
the {{EmbeddedLeaderService}} is shut down before the {{GrantLeadershipCall}} 
has been executed. In this case, the {{contender}} is nulled which leads to a 
NPE.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-11851) ClusterEntrypoint provides wrong executor to HaServices

2019-03-07 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-11851:
-

 Summary: ClusterEntrypoint provides wrong executor to HaServices
 Key: FLINK-11851
 URL: https://issues.apache.org/jira/browse/FLINK-11851
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.7.2, 1.8.0
Reporter: Till Rohrmann
Assignee: Till Rohrmann
 Fix For: 1.7.3, 1.8.0


The {{ClusterEntrypoint}} provides the executor of the common {{RpcService}} to 
the {{HighAvailabilityServices}} which uses the executor to run io operations. 
In I/O heavy cases, this can block all {{RpcService}} threads and make the 
{{RpcEndpoints}} running in the respective {{RpcService}} unresponsive.

I suggest to introduce a dedicated I/O executor which is used for io heavy 
operations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-11846) Duplicate job submission delete HA files

2019-03-06 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-11846:
-

 Summary: Duplicate job submission delete HA files
 Key: FLINK-11846
 URL: https://issues.apache.org/jira/browse/FLINK-11846
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.8.0
Reporter: Till Rohrmann
Assignee: Till Rohrmann
 Fix For: 1.8.0


Due to changes for FLINK-11383, the {{Dispatcher}} now delete HA files if the 
client submits twice a job. A duplicate job submission should, however, simply 
be rejected but not cause that HA files are being deleted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-11843) Dispatcher fails to recover jobs if leader change happens during JobManagerRunner termination

2019-03-06 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-11843:
-

 Summary: Dispatcher fails to recover jobs if leader change happens 
during JobManagerRunner termination
 Key: FLINK-11843
 URL: https://issues.apache.org/jira/browse/FLINK-11843
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.7.2, 1.8.0
Reporter: Till Rohrmann


The {{Dispatcher}} fails to recover jobs if a leader change happens during the 
{{JobManagerRunner}} termination of the previous run. The problem is that we 
schedule the start future of the recovered {{JobGraph}} using the 
{{MainThreadExecutor}} and additionally require that this future is completed 
before any other recovery operation from a subsequent leadership session is 
executed. If now the leadership changes, the {{MainThreadExecutor}} will be 
invalidated and the scheduled future will never be completed.

The relevant ML thread: 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/1-7-1-job-stuck-in-suspended-state-td26439.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-11826) Kafka09ITCase.testRateLimitedConsumer fails on Travis

2019-03-05 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-11826:
-

 Summary: Kafka09ITCase.testRateLimitedConsumer fails on Travis
 Key: FLINK-11826
 URL: https://issues.apache.org/jira/browse/FLINK-11826
 Project: Flink
  Issue Type: Bug
  Components: Connectors / Kafka, Tests
Affects Versions: 1.8.0
Reporter: Till Rohrmann


The {{Kafka09ITCase.testRateLimitedConsumer}} fails on Travis with:
{code}
20:33:49.887 [ERROR] Errors: 
20:33:49.887 [ERROR]   Kafka09ITCase.testRateLimitedConsumer:204 » 
JobExecution Job execution failed.
{code}

https://api.travis-ci.org/v3/job/501660504/log.txt



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-11813) Standby per job mode Dispatchers don't know job's JobSchedulingStatus

2019-03-04 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-11813:
-

 Summary: Standby per job mode Dispatchers don't know job's 
JobSchedulingStatus
 Key: FLINK-11813
 URL: https://issues.apache.org/jira/browse/FLINK-11813
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.7.2, 1.6.4, 1.8.0
Reporter: Till Rohrmann


At the moment, it can happen that standby {{Dispatchers}} in per job mode will 
restart a terminated job after they gained leadership. The problem is that we 
currently clear the {{RunningJobsRegistry}} once a job has reached a globally 
terminal state. After the leading {{Dispatcher}} terminates, a standby 
{{Dispatcher}} will gain leadership. Without having the information from the 
{{RunningJobsRegistry}} it cannot tell whether the job has been executed or 
whether the {{Dispatcher}} needs to re-execute the job. At the moment, the 
{{Dispatcher}} will assume that there was a fault and hence re-execute the job. 
This can lead to duplicate results.

I think we need some way to tell standby {{Dispatchers}} that a certain job has 
been successfully executed. One trivial solution could be to not clean up the 
{{RunningJobsRegistry}} but then we will clutter ZooKeeper.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-11789) Checkpoint directories are not cleaned up after job termination

2019-03-01 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-11789:
-

 Summary: Checkpoint directories are not cleaned up after job 
termination
 Key: FLINK-11789
 URL: https://issues.apache.org/jira/browse/FLINK-11789
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Checkpointing
Affects Versions: 1.9.0
Reporter: Till Rohrmann


Flink currently does not clean up all checkpoint directories when a job reaches 
a globally terminal state. Having configured the checkpoint directory 
{{checkpoints}}, I observe that after cancelling the job {{JOB_ID}} there are 
still

{code}
checkpoints/JOB_ID/shared
checkpoints/JOB_ID/taskowned
{code}

I think it would be good if would delete {{checkpoints/JOB_ID}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-11719) Remove JobMaster#start(JobMasterId) and #suspend

2019-02-21 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-11719:
-

 Summary: Remove JobMaster#start(JobMasterId) and #suspend
 Key: FLINK-11719
 URL: https://issues.apache.org/jira/browse/FLINK-11719
 Project: Flink
  Issue Type: Improvement
  Components: Distributed Coordination
Affects Versions: 1.8.0
Reporter: Till Rohrmann
Assignee: Till Rohrmann


Currently, the {{JobMaster}} contains a lot of mutable state which is necessary 
because it is used across different leadership sessions by the 
{{JobManagerRunner}}. For this purpose, we have the methods 
{{JobMaster#start(JobMasterId)}} and {{#suspend}}. The mutable state management 
makes things on the {{JobMaster}} side more complicated than they need to be. 
In order to improve the {{JobMaster's}} maintainability I suggest to remove 
this logic and instead terminate the {{JobMaster}} if the {{JobManagerRunner}} 
loses leadership. This entails that for every leadership we will create a new 
{{JobMaster}} instance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-11718) Add onStart method to RpcEndpoint which is run in the actor's main thread

2019-02-21 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-11718:
-

 Summary: Add onStart method to RpcEndpoint which is run in the 
actor's main thread
 Key: FLINK-11718
 URL: https://issues.apache.org/jira/browse/FLINK-11718
 Project: Flink
  Issue Type: Improvement
  Components: Distributed Coordination
Affects Versions: 1.8.0
Reporter: Till Rohrmann
Assignee: Till Rohrmann


I propose to introduce a {{RpcEndpoint#onStart}} method which is called when 
the {{RpcEndpoint}} is started via {{RpcEndpoint#start}}. At the moment, users 
will override {{#start}} where they need to remember to also call 
{{super.start()}} in order to actually start the {{RpcEndpoint}}. Moreover, the 
logic executed by {{start}} won't be run in the actor's main thread. This is 
problematic if the method triggers asynchronous behaviour which is executed in 
the actor's main thread. If that is the case, it can happen that the 
asynchronous operation is executed before the {{start}} method has been 
finished. 

Due to these problems, I suggest to introduce a {{RpcEndpoint#onStart}} method 
which can be overriden by sub classes similarly to the {{RpcEndpoint#onStop}} 
method. The {{onStart}} method can be used to setup the {{RpcEndpoint}} and is 
guaranteed to be executed before any other message is processed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-11678) Remove suspend logic from ExecutionGraphCache

2019-02-20 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-11678:
-

 Summary: Remove suspend logic from ExecutionGraphCache
 Key: FLINK-11678
 URL: https://issues.apache.org/jira/browse/FLINK-11678
 Project: Flink
  Issue Type: Sub-task
  Components: Distributed Coordination
Affects Versions: 1.8.0
Reporter: Till Rohrmann
Assignee: Till Rohrmann
 Fix For: 1.8.0


Since we no longer obtain the {{ExecutionGraph}} from the {{JobManager}} we can 
now remove the suspending logic in the {{ExecutionGraphCache}} which will 
simplify this component.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-11677) Remove ResourceManagerRunner

2019-02-20 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-11677:
-

 Summary: Remove ResourceManagerRunner
 Key: FLINK-11677
 URL: https://issues.apache.org/jira/browse/FLINK-11677
 Project: Flink
  Issue Type: Improvement
  Components: Distributed Coordination
Affects Versions: 1.8.0
Reporter: Till Rohrmann
Assignee: TisonKun
 Fix For: 1.8.0


The {{ResourceManagerRunner}} is no longer needed. Thus, I suggest to remove it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-11663) Remove control flow break point from Execution#releaseAssignedResource

2019-02-19 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-11663:
-

 Summary: Remove control flow break point from 
Execution#releaseAssignedResource
 Key: FLINK-11663
 URL: https://issues.apache.org/jira/browse/FLINK-11663
 Project: Flink
  Issue Type: Improvement
  Components: Distributed Coordination
Affects Versions: 1.8.0
Reporter: Till Rohrmann
Assignee: Till Rohrmann
 Fix For: 1.8.0


In {{Execution#releaseAssignedResource}} we release the assigned resource by 
calling {{LogicalSlot#releaseSlot}} and use 
{{FutureUtils.whenCompleteAsyncIfNotDone}} to merge the future back into the 
main thread in order to complete the {{Execution#releaseFuture}}. This is no 
longer necessary since the returned future is always completed from within the 
main thread (with the changes from FLINK-10431).

In fact this control flow break point makes it hard to properly suspend the 
{{ExecutionGraph}} atomically as required for FLINK-11537.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-11650) Remove legacy FlinkResourceManager

2019-02-18 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-11650:
-

 Summary: Remove legacy FlinkResourceManager
 Key: FLINK-11650
 URL: https://issues.apache.org/jira/browse/FLINK-11650
 Project: Flink
  Issue Type: Sub-task
  Components: Distributed Coordination
Affects Versions: 1.8.0
Reporter: Till Rohrmann
Assignee: Till Rohrmann
 Fix For: 1.8.0


Remove the legacy {{FlinkResourceManager}} component.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-11649) Remove legacy JobInfo

2019-02-18 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-11649:
-

 Summary: Remove legacy JobInfo
 Key: FLINK-11649
 URL: https://issues.apache.org/jira/browse/FLINK-11649
 Project: Flink
  Issue Type: Sub-task
  Components: Distributed Coordination
Affects Versions: 1.8.0
Reporter: Till Rohrmann
Assignee: Till Rohrmann
 Fix For: 1.8.0


Remove legacy {{JobInfo}} class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-11648) Remove MemoryArchivist

2019-02-18 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-11648:
-

 Summary: Remove MemoryArchivist
 Key: FLINK-11648
 URL: https://issues.apache.org/jira/browse/FLINK-11648
 Project: Flink
  Issue Type: Sub-task
  Components: Distributed Coordination
Affects Versions: 1.8.0
Reporter: Till Rohrmann
Assignee: Till Rohrmann
 Fix For: 1.8.0


Remove the {{MemoryArchivist}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-11645) Remove legacy JobManager

2019-02-18 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-11645:
-

 Summary: Remove legacy JobManager
 Key: FLINK-11645
 URL: https://issues.apache.org/jira/browse/FLINK-11645
 Project: Flink
  Issue Type: Sub-task
  Components: Distributed Coordination
Affects Versions: 1.8.0
Reporter: Till Rohrmann
Assignee: Till Rohrmann
 Fix For: 1.8.0


Remove the legacy {{JobManager}} component.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-11631) TaskExecutorITCase#testJobReExecutionAfterTaskExecutorTermination unstable on Travis

2019-02-15 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-11631:
-

 Summary: 
TaskExecutorITCase#testJobReExecutionAfterTaskExecutorTermination unstable on 
Travis
 Key: FLINK-11631
 URL: https://issues.apache.org/jira/browse/FLINK-11631
 Project: Flink
  Issue Type: Bug
  Components: Distributed Coordination, Tests
Affects Versions: 1.8.0
Reporter: Till Rohrmann


The {{TaskExecutorITCase#testJobReExecutionAfterTaskExecutorTermination}} is 
unstable on Travis. It fails with 
{code}
16:12:04.644 [ERROR] 
testJobReExecutionAfterTaskExecutorTermination(org.apache.flink.runtime.taskexecutor.TaskExecutorITCase)
  Time elapsed: 1.257 s  <<< ERROR!
org.apache.flink.util.FlinkException: Could not close resource.
at 
org.apache.flink.runtime.taskexecutor.TaskExecutorITCase.teardown(TaskExecutorITCase.java:83)
Caused by: org.apache.flink.util.FlinkException: Error while shutting the 
TaskExecutor down.
Caused by: org.apache.flink.util.FlinkException: Could not properly shut down 
the TaskManager services.
Caused by: java.lang.IllegalStateException: NetworkBufferPool is not empty 
after destroying all LocalBufferPools
{code} 

https://api.travis-ci.org/v3/job/493221318/log.txt

The problem seems to be caused by the {{TaskExecutor}} not properly waiting for 
the termination of all running {{Tasks}}. Due to this, there is a race 
condition which causes that not all buffers are returned to the {{BufferPool}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-11630) TaskExecutor does not wait for Task termination when terminating itself

2019-02-15 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-11630:
-

 Summary: TaskExecutor does not wait for Task termination when 
terminating itself
 Key: FLINK-11630
 URL: https://issues.apache.org/jira/browse/FLINK-11630
 Project: Flink
  Issue Type: Bug
  Components: Distributed Coordination
Affects Versions: 1.8.0
Reporter: Till Rohrmann


The {{TaskExecutor}} does not properly wait for the termination of {{Tasks}} 
when terminating. In fact, it does not even trigger the cancellation of the 
running {{Tasks}}. I think for better lifecycle management it is important that 
the {{TaskExecutor}} triggers the termination of all running {{Tasks}} and then 
wait until all {{Tasks}} have terminated before it terminates itself.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-11629) Stop AkkaRpcActor immediately after onStop has finished

2019-02-15 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-11629:
-

 Summary: Stop AkkaRpcActor immediately after onStop has finished
 Key: FLINK-11629
 URL: https://issues.apache.org/jira/browse/FLINK-11629
 Project: Flink
  Issue Type: Improvement
  Components: Distributed Coordination
Affects Versions: 1.8.0
Reporter: Till Rohrmann
Assignee: Till Rohrmann
 Fix For: 1.8.0


Currently, the {{AkkaRpcActor}} sends a {{Kill}} message to itself after the 
{{onStop}} future has been completed. Since {{Kill}} messages enqueue into the 
mail box, the stop operation is not immediate. I suggest to rather use 
{{Context#stop(ActorRef)}} to immediately stop the {{AkkaRcpActor}} upon 
termination.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-11594) Check & port TaskManagerRegistrationTest to new code base

2019-02-13 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-11594:
-

 Summary: Check & port TaskManagerRegistrationTest to new code base
 Key: FLINK-11594
 URL: https://issues.apache.org/jira/browse/FLINK-11594
 Project: Flink
  Issue Type: Sub-task
Affects Versions: 1.8.0
Reporter: Till Rohrmann
 Fix For: 1.8.0


Check and port {{TaskManagerRegistrationTest}} to new code base.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-11593) Check & port TaskManagerTest to new code base

2019-02-13 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-11593:
-

 Summary: Check & port TaskManagerTest to new code base
 Key: FLINK-11593
 URL: https://issues.apache.org/jira/browse/FLINK-11593
 Project: Flink
  Issue Type: Sub-task
Affects Versions: 1.8.0
Reporter: Till Rohrmann
 Fix For: 1.8.0


Check and port {{TaskManagerTest}} to new code base.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-11592) Port TaskManagerFailsWithSlotSharingITCase to new code base

2019-02-13 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-11592:
-

 Summary: Port TaskManagerFailsWithSlotSharingITCase to new code 
base
 Key: FLINK-11592
 URL: https://issues.apache.org/jira/browse/FLINK-11592
 Project: Flink
  Issue Type: Sub-task
Affects Versions: 1.8.0
Reporter: Till Rohrmann
Assignee: Till Rohrmann
 Fix For: 1.8.0


Port {{TaskManagerFailsWithSlotSharingITCase}} to new code base.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-11587) Check and port CoLocationConstraintITCase to new code base

2019-02-12 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-11587:
-

 Summary: Check and port CoLocationConstraintITCase to new code base
 Key: FLINK-11587
 URL: https://issues.apache.org/jira/browse/FLINK-11587
 Project: Flink
  Issue Type: Sub-task
Affects Versions: 1.8.0
Reporter: Till Rohrmann
 Fix For: 1.8.0


Check and port {{CoLocationConstraintITCase}} to new code base.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-11586) Check and port SlotSharingITCase to new code base

2019-02-12 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-11586:
-

 Summary: Check and port SlotSharingITCase to new code base
 Key: FLINK-11586
 URL: https://issues.apache.org/jira/browse/FLINK-11586
 Project: Flink
  Issue Type: Sub-task
Affects Versions: 1.8.0
Reporter: Till Rohrmann
 Fix For: 1.8.0


Check and port {{SlotSharingITCase}} to new code base.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-11551) Allow RpcEndpoints to execute asynchronous stop operations

2019-02-07 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-11551:
-

 Summary: Allow RpcEndpoints to execute asynchronous stop operations
 Key: FLINK-11551
 URL: https://issues.apache.org/jira/browse/FLINK-11551
 Project: Flink
  Issue Type: Improvement
  Components: Distributed Coordination
Affects Versions: 1.7.1, 1.8.0
Reporter: Till Rohrmann
Assignee: Till Rohrmann
 Fix For: 1.8.0


Some {{RpcEndpoints}} would benefit from being able to run asynchronous stop 
operations in the main thread executor. At the moment, this is not possible 
since {{RpcEndpoint#postStop}} is called by by {{UntypedActor#postStop}} which 
is the last call for an actor.

I suggest to introduce a new method {{RpcEndpoint#onStop}} which is executed 
when the {{RpcEndpoint}} is requested to terminate. Internally, this method 
will be called when the {{UntypedActor}} is still running, thus, allowing to 
execute arbitrary other messages (e.g. asynchronous operations in the main 
thread executor). Only after the future returned by {{onStop}} is completed, 
the underlying {{UntypedActor}} will shut down.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-11537) ExecutionGraph does not reach terminal state when JobMaster lost leadership

2019-02-06 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-11537:
-

 Summary: ExecutionGraph does not reach terminal state when 
JobMaster lost leadership
 Key: FLINK-11537
 URL: https://issues.apache.org/jira/browse/FLINK-11537
 Project: Flink
  Issue Type: Bug
  Components: Distributed Coordination
Affects Versions: 1.8.0
Reporter: Till Rohrmann
Assignee: Till Rohrmann
 Fix For: 1.8.0


The {{ExecutionGraph}} sometimes does not reach a terminal state if the 
{{JobMaster}} lost the leadership. The reason is that we use the fenced main 
thread executor to execute {{ExecutionGraph}} changes and we don't wait for the 
{{ExecutionGraph}} to reach the terminal state before we set the fencing token 
{{null}}.

One possible solution would be to wait for the {{ExecutionGraph}} to reach the 
terminal state before clearing the fencing token. This has, however, the 
downside that the {{JobMaster}} is still reachable until the {{ExecutionGraph}} 
has been properly terminated. Alternatively, we could use the unfenced main 
thread executor to send the cancel calls out.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-11524) LeaderChangeClusterComponentsTest.testReelectionOfJobMaster failed on Travis

2019-02-05 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-11524:
-

 Summary: 
LeaderChangeClusterComponentsTest.testReelectionOfJobMaster failed on Travis
 Key: FLINK-11524
 URL: https://issues.apache.org/jira/browse/FLINK-11524
 Project: Flink
  Issue Type: Bug
  Components: Tests
Affects Versions: 1.8.0
Reporter: Till Rohrmann


The {{LeaderChangeClusterComponentsTest.testReelectionOfJobMaster}} failed on 
Travis: https://api.travis-ci.org/v3/job/488578456/log.txt

It looks as if the {{JobMaster}} cannot reconnect to the {{ResourceManager}}. 
Maybe this is caused by a wrong interleaving of the revoke and grant leadership 
calls to the {{TestingEmbeddedHaServices}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-11521) Make RetryingRegistration timeouts configurable

2019-02-02 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-11521:
-

 Summary: Make RetryingRegistration timeouts configurable
 Key: FLINK-11521
 URL: https://issues.apache.org/jira/browse/FLINK-11521
 Project: Flink
  Issue Type: Bug
  Components: Distributed Coordination
Affects Versions: 1.8.0
Reporter: Till Rohrmann
Assignee: Till Rohrmann
 Fix For: 1.8.0


For more control and better testability I suggest to make the 
{{RetryingRegistration}} configurable.  Especially the hard coded timeouts make 
testing in some scenarios hard.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-11472) KafkaITCase.testCancelingEmptyTopic failed on Travis

2019-01-30 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-11472:
-

 Summary: KafkaITCase.testCancelingEmptyTopic failed on Travis
 Key: FLINK-11472
 URL: https://issues.apache.org/jira/browse/FLINK-11472
 Project: Flink
  Issue Type: Bug
  Components: Kafka Connector, Tests
Affects Versions: 1.8.0
Reporter: Till Rohrmann
 Fix For: 1.8.0


The {{KafkaITCase.testCancelingEmptyTopic}} failed on Travis with a 
{{TimeoutException}}:
{code}
05:15:35.521 [ERROR] Errors: 
05:15:35.521 [ERROR]   
KafkaITCase.testCancelingEmptyTopic:85->KafkaConsumerTestBase.runCancelingOnEmptyInputTest:1124->KafkaTestBase.deleteTestTopic:206->Object.wait:502->Object.wait:-2
 » TestTimedOut
{code}

https://api.travis-ci.org/v3/job/485729678/log.txt



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-11446) FlinkKafkaProducer011ITCase.testRecoverCommittedTransaction failed on Travis

2019-01-29 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-11446:
-

 Summary: 
FlinkKafkaProducer011ITCase.testRecoverCommittedTransaction failed on Travis
 Key: FLINK-11446
 URL: https://issues.apache.org/jira/browse/FLINK-11446
 Project: Flink
  Issue Type: Bug
  Components: Tests
Affects Versions: 1.8.0
Reporter: Till Rohrmann


The {{FlinkKafkaProducer011ITCase.testRecoverCommittedTransaction}} failed on 
Travis with producing no output for 10 minutes: 
https://api.travis-ci.org/v3/job/485771998/log.txt



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-11415) Introduce JobMasterServiceFactor for JobManagerRunner

2019-01-23 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-11415:
-

 Summary: Introduce JobMasterServiceFactor for JobManagerRunner
 Key: FLINK-11415
 URL: https://issues.apache.org/jira/browse/FLINK-11415
 Project: Flink
  Issue Type: Improvement
  Components: Distributed Coordination
Affects Versions: 1.8.0
Reporter: Till Rohrmann
Assignee: Till Rohrmann
 Fix For: 1.8.0


In order to better test the {{JobManagerRunner}}, I suggest to introduce a 
{{JobMasterServiceFactory}} which allows to control how the 
{{JobManagerRunner}} instantiates a {{JobMasterService}}. At the moment we 
create a {{JobMaster}} in the constructor of the {{JobManagerRunner}}. This 
unnecessarily couples these two components together and makes testing harder.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-11414) Introduce JobMasterService interface for the JobManagerRunner

2019-01-23 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-11414:
-

 Summary: Introduce JobMasterService interface for the 
JobManagerRunner
 Key: FLINK-11414
 URL: https://issues.apache.org/jira/browse/FLINK-11414
 Project: Flink
  Issue Type: Improvement
  Components: Distributed Coordination
Affects Versions: 1.8.0
Reporter: Till Rohrmann
Assignee: Till Rohrmann
 Fix For: 1.8.0


In order to better separate concerns I suggest to introduce a 
{{JobMasterService}} interface which only exposes the lifecycle methods the 
{{JobManagerRunner}} needs instead of directly using the {{JobMaster}}. That 
way, we can later instantiate the {{JobManagerRunner}} easier for testing 
purposes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-11400) JobManagerRunner does not wait for suspension of JobMaster

2019-01-21 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-11400:
-

 Summary: JobManagerRunner does not wait for suspension of JobMaster
 Key: FLINK-11400
 URL: https://issues.apache.org/jira/browse/FLINK-11400
 Project: Flink
  Issue Type: Bug
  Components: Distributed Coordination
Affects Versions: 1.7.1, 1.6.3, 1.8.0
Reporter: Till Rohrmann
Assignee: Till Rohrmann
 Fix For: 1.8.0


The {{JobManagerRunner}} does not wait for the suspension of the {{JobMaster}} 
to finish before granting leadership again. This can lead to a state where the 
{{JobMaster}} tries to start the {{ExecutionGraph}} but the {{SlotPool}} is 
still stopped.

I suggest to linearize the leadership operations (granting and revoking 
leadership) similarly to the {{Dispatcher}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-11383) Dispatcher does not clean up blobs of failed submissions

2019-01-17 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-11383:
-

 Summary: Dispatcher does not clean up blobs of failed submissions
 Key: FLINK-11383
 URL: https://issues.apache.org/jira/browse/FLINK-11383
 Project: Flink
  Issue Type: Bug
  Components: Distributed Coordination
Affects Versions: 1.7.1, 1.6.3, 1.8.0
Reporter: Till Rohrmann
Assignee: Till Rohrmann
 Fix For: 1.8.0


The {{Dispatcher}} does not clean up blobs originating from a failed 
submissions. Compared to the legacy code, this is a regression and we should 
remove relevant blobs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-11370) Check and port ZooKeeperLeaderElectionITCase to new code base if necessary

2019-01-16 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-11370:
-

 Summary: Check and port ZooKeeperLeaderElectionITCase to new code 
base if necessary
 Key: FLINK-11370
 URL: https://issues.apache.org/jira/browse/FLINK-11370
 Project: Flink
  Issue Type: Sub-task
  Components: Tests
Reporter: Till Rohrmann


Check and port {{ZooKeeperLeaderElectionITCase}} to new code base if necessary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-11368) Check and port TaskManagerStartupTest to new code base if necessary

2019-01-16 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-11368:
-

 Summary: Check and port TaskManagerStartupTest to new code base if 
necessary
 Key: FLINK-11368
 URL: https://issues.apache.org/jira/browse/FLINK-11368
 Project: Flink
  Issue Type: Sub-task
  Components: Tests
Reporter: Till Rohrmann


Check and port {{TaskManagerStartupTest}} to new code base if necessary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-11369) Check and port ZooKeeperHAJobManagerTest to new code base if necessary

2019-01-16 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-11369:
-

 Summary: Check and port ZooKeeperHAJobManagerTest to new code base 
if necessary
 Key: FLINK-11369
 URL: https://issues.apache.org/jira/browse/FLINK-11369
 Project: Flink
  Issue Type: Sub-task
  Components: Tests
Reporter: Till Rohrmann


Check and port {{ZooKeeperHAJobManagerTest}} to new code base if necessary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-11366) Check and port TaskManagerMetricsTest to new code base if necessary

2019-01-16 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-11366:
-

 Summary: Check and port TaskManagerMetricsTest to new code base if 
necessary
 Key: FLINK-11366
 URL: https://issues.apache.org/jira/browse/FLINK-11366
 Project: Flink
  Issue Type: Sub-task
  Components: Tests
Reporter: Till Rohrmann


Check and port {{TaskManagerMetricsTest}} to new code base if necessary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-11367) Check and port TaskManagerProcessReapingTestBase to new code base if necessary

2019-01-16 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-11367:
-

 Summary: Check and port TaskManagerProcessReapingTestBase to new 
code base if necessary
 Key: FLINK-11367
 URL: https://issues.apache.org/jira/browse/FLINK-11367
 Project: Flink
  Issue Type: Sub-task
  Components: Tests
Reporter: Till Rohrmann


Check and port {{TaskManagerProcessReapingTestBase}} to new code base if 
necessary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-11364) Check and port TaskManagerFailsITCase to new code base if necessary

2019-01-16 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-11364:
-

 Summary: Check and port TaskManagerFailsITCase to new code base if 
necessary
 Key: FLINK-11364
 URL: https://issues.apache.org/jira/browse/FLINK-11364
 Project: Flink
  Issue Type: Sub-task
  Components: Tests
Reporter: Till Rohrmann


Check and port {{TaskManagerFailsITCase}} to new code base if necessary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-11365) Check and port TaskManagerFailureRecoveryITCase to new code base if necessary

2019-01-16 Thread Till Rohrmann (JIRA)
Till Rohrmann created FLINK-11365:
-

 Summary: Check and port TaskManagerFailureRecoveryITCase to new 
code base if necessary
 Key: FLINK-11365
 URL: https://issues.apache.org/jira/browse/FLINK-11365
 Project: Flink
  Issue Type: Sub-task
  Components: Tests
Reporter: Till Rohrmann


Check and port {{TaskManagerFailureRecoveryITCase}} to new code base if 
necessary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


<    1   2   3   4   5   6   7   8   9   10   >