[jira] [Created] (FLINK-7361) flink-web doesn't build with ruby 2.4
Nico Kruber created FLINK-7361: -- Summary: flink-web doesn't build with ruby 2.4 Key: FLINK-7361 URL: https://issues.apache.org/jira/browse/FLINK-7361 Project: Flink Issue Type: Bug Components: Project Website Reporter: Nico Kruber Assignee: Nico Kruber The dependencies pulled in by the old jekyll version do not build with ruby 2.4 and fail with something like {code} yajl_ext.c:881:22: error: 'rb_cFixnum' undeclared (first use in this function); did you mean 'rb_isalnum'? {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (FLINK-7355) YARNSessionFIFOITCase#testfullAlloc does not run anymore
Nico Kruber created FLINK-7355: -- Summary: YARNSessionFIFOITCase#testfullAlloc does not run anymore Key: FLINK-7355 URL: https://issues.apache.org/jira/browse/FLINK-7355 Project: Flink Issue Type: Bug Components: Tests, YARN Affects Versions: 1.4.0, 1.3.2 Reporter: Nico Kruber Priority: Minor {{YARNSessionFIFOITCase#testfullAlloc}} is a test case that is ignored because of a too high resource consumption but if run manually, it fails with {code} Error while deploying YARN cluster: Couldn't deploy Yarn session cluster java.lang.RuntimeException: Couldn't deploy Yarn session cluster at org.apache.flink.yarn.AbstractYarnClusterDescriptor.deploySessionCluster(AbstractYarnClusterDescriptor.java:367) at org.apache.flink.yarn.cli.FlinkYarnSessionCli.run(FlinkYarnSessionCli.java:663) at org.apache.flink.yarn.YarnTestBase$Runner.run(YarnTestBase.java:680) Caused by: java.lang.IllegalArgumentException: The configuration value 'containerized.heap-cutoff-min' is higher (600) than the requested amount of memory 256 at org.apache.flink.yarn.Utils.calculateHeapSize(Utils.java:101) at org.apache.flink.yarn.AbstractYarnClusterDescriptor.setupApplicationMasterContainer(AbstractYarnClusterDescriptor.java:1356) at org.apache.flink.yarn.AbstractYarnClusterDescriptor.startAppMaster(AbstractYarnClusterDescriptor.java:840) at org.apache.flink.yarn.AbstractYarnClusterDescriptor.deployInternal(AbstractYarnClusterDescriptor.java:456) at org.apache.flink.yarn.AbstractYarnClusterDescriptor.deploySessionCluster(AbstractYarnClusterDescriptor.java:362) ... 2 more {code} in current master and {code} Error while starting the YARN Client: The JobManager memory (256) is below the minimum required memory amount of 768 MB java.lang.IllegalArgumentException: The JobManager memory (256) is below the minimum required memory amount of 768 MB at org.apache.flink.yarn.AbstractYarnClusterDescriptor.setJobManagerMemory(AbstractYarnClusterDescriptor.java:187) at org.apache.flink.yarn.cli.FlinkYarnSessionCli.createDescriptor(FlinkYarnSessionCli.java:314) at org.apache.flink.yarn.cli.FlinkYarnSessionCli.run(FlinkYarnSessionCli.java:622) at org.apache.flink.yarn.YarnTestBase$Runner.run(YarnTestBase.java:645) {code} in Flink 1.3.2 RC2 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (FLINK-7354) test instability in LocalFlinkMiniClusterITCase#testLocalFlinkMiniClusterWithMultipleTaskManagers
Nico Kruber created FLINK-7354: -- Summary: test instability in LocalFlinkMiniClusterITCase#testLocalFlinkMiniClusterWithMultipleTaskManagers Key: FLINK-7354 URL: https://issues.apache.org/jira/browse/FLINK-7354 Project: Flink Issue Type: Bug Components: Tests Affects Versions: 1.2.1, 1.1.4, 1.4.0, 1.3.2 Reporter: Nico Kruber Assignee: Nico Kruber Priority: Critical During {{mvn clean install}} on the 1.3.2 RC2, I found an inconsistently failing test at {{LocalFlinkMiniClusterITCase#testLocalFlinkMiniClusterWithMultipleTaskManagers}}: {code} testLocalFlinkMiniClusterWithMultipleTaskManagers(org.apache.flink.test.runtime.minicluster.LocalFlinkMiniClusterITCase) Time elapsed: 34.978 sec <<< FAILURE! java.lang.AssertionError: Thread Thread[initialSeedUniquifierGenerator,5,main] was started by the mini cluster, but not shut down at org.junit.Assert.fail(Assert.java:88) at org.apache.flink.test.runtime.minicluster.LocalFlinkMiniClusterITCase.testLocalFlinkMiniClusterWithMultipleTaskManagers(LocalFlinkMiniClusterITCase.java:168) {code} Searching the web for that error yields one previous thread on the dev-list, so this seems to be valid for quite old versions of flink, too, but apparently, was never solved: https://lists.apache.org/thread.html/07ce439bf6d358bd3139541b52ef6b8e8af249a27e09ae10b6698f81@%3Cdev.flink.apache.org%3E -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (FLINK-7352) ExecutionGraphRestartTest timeouts
Nico Kruber created FLINK-7352: -- Summary: ExecutionGraphRestartTest timeouts Key: FLINK-7352 URL: https://issues.apache.org/jira/browse/FLINK-7352 Project: Flink Issue Type: Bug Components: Distributed Coordination, Tests Affects Versions: 1.4.0, 1.3.2 Reporter: Nico Kruber Priority: Critical Recently, I received timeouts from some tests in {{ExecutionGraphRestartTest}} like this {code} Tests in error: ExecutionGraphRestartTest.testConcurrentLocalFailAndRestart:638 » Timeout {code} This particular instance is from 1.3.2 RC2 and stuck in {{ExecutionGraphTestUtils#waitUntilDeployedAndSwitchToRunning()}} but I also had instances stuck in {{ExecutionGraphTestUtils#waitUntilJobStatus}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (FLINK-7351) test instability in JobClientActorRecoveryITCase#testJobClientRecovery
Nico Kruber created FLINK-7351: -- Summary: test instability in JobClientActorRecoveryITCase#testJobClientRecovery Key: FLINK-7351 URL: https://issues.apache.org/jira/browse/FLINK-7351 Project: Flink Issue Type: Bug Components: Job-Submission, Tests Affects Versions: 1.3.2 Reporter: Nico Kruber Priority: Minor On a 16-core VM, the following test failed during {{mvn clean verify}} {code} Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 22.814 sec <<< FAILURE! - in org.apache.flink.runtime.client.JobClientActorRecoveryITCase testJobClientRecovery(org.apache.flink.runtime.client.JobClientActorRecoveryITCase) Time elapsed: 21.299 sec <<< ERROR! org.apache.flink.runtime.client.JobExecutionException: Job execution failed. at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply$mcV$sp(JobManager.scala:933) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply(JobManager.scala:876) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply(JobManager.scala:876) at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Not enough free slots available to run the job. You can decrease the operator parallelism or increase the number of slots per TaskManager in the configuration. Resources available to scheduler: Number of instances=0, total number of slots=0, available slots=0 at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask(Scheduler.java:334) at org.apache.flink.runtime.jobmanager.scheduler.Scheduler.allocateSlot(Scheduler.java:139) at org.apache.flink.runtime.executiongraph.Execution.allocateSlotForExecution(Execution.java:368) at org.apache.flink.runtime.executiongraph.Execution.scheduleForExecution(Execution.java:309) at org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForExecution(ExecutionVertex.java:596) at org.apache.flink.runtime.executiongraph.ExecutionJobVertex.scheduleAll(ExecutionJobVertex.java:450) at org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleLazy(ExecutionGraph.java:834) at org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleForExecution(ExecutionGraph.java:814) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:1425) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:1372) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:1372) at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (FLINK-7346) EventTimeWindowCheckpointingITCase.testPreAggregatedSlidingTimeWindow killed by maven watchdog on Travis
Nico Kruber created FLINK-7346: -- Summary: EventTimeWindowCheckpointingITCase.testPreAggregatedSlidingTimeWindow killed by maven watchdog on Travis Key: FLINK-7346 URL: https://issues.apache.org/jira/browse/FLINK-7346 Project: Flink Issue Type: Bug Components: DataStream API, Tests Affects Versions: 1.4.0 Reporter: Nico Kruber Assignee: Nico Kruber Several test runs fail with the watchdog killing the tests after receiving no output for 300s showing {{EventTimeWindowCheckpointingITCase.testPreAggregatedSlidingTimeWindow}} as one of the tests, sometimes with another test running in parallel. It does not seem to be hanging though and the only reason for this behaviour may be that the tests, especially with RocksDB, take very long with no output by each of the test methods in {{AbstractEventTimeWindowCheckpointingITCase}}. Thus adding output per method should fix the spurious test failures. Some failing instances: https://travis-ci.org/apache/flink/jobs/259460738 https://travis-ci.org/apache/flink/jobs/259748656 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (FLINK-7316) always use off-heap network buffers
Nico Kruber created FLINK-7316: -- Summary: always use off-heap network buffers Key: FLINK-7316 URL: https://issues.apache.org/jira/browse/FLINK-7316 Project: Flink Issue Type: Sub-task Components: Core, Network Affects Versions: 1.4.0 Reporter: Nico Kruber Assignee: Nico Kruber In order to send flink buffers through netty into the network, we need to make the buffers use off-heap memory. Otherwise, there will be a hidden copy happening in the NIO stack. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (FLINK-7315) use flink's buffers in netty
Nico Kruber created FLINK-7315: -- Summary: use flink's buffers in netty Key: FLINK-7315 URL: https://issues.apache.org/jira/browse/FLINK-7315 Project: Flink Issue Type: Improvement Components: Core, Network Affects Versions: 1.4.0 Reporter: Nico Kruber Assignee: Nico Kruber The goal of this change is to avoid the step in the channel encoder and decoder pipelines where flink buffers are copied into netty buffers. Instead, netty should directly send flink buffers to the network. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (FLINK-7312) activate checkstyle for flink/core/memory/*
Nico Kruber created FLINK-7312: -- Summary: activate checkstyle for flink/core/memory/* Key: FLINK-7312 URL: https://issues.apache.org/jira/browse/FLINK-7312 Project: Flink Issue Type: Improvement Components: Checkstyle, Core Affects Versions: 1.4.0 Reporter: Nico Kruber Assignee: Nico Kruber Priority: Minor -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (FLINK-7311) refrain from using fail(Exception#getMessage()) in core memory tests
Nico Kruber created FLINK-7311: -- Summary: refrain from using fail(Exception#getMessage()) in core memory tests Key: FLINK-7311 URL: https://issues.apache.org/jira/browse/FLINK-7311 Project: Flink Issue Type: Improvement Components: Core, Tests Affects Versions: 1.4.0 Reporter: Nico Kruber Assignee: Nico Kruber Priority: Minor Most unit tests in the {{flink/core/memory/}} domain still relies on a code pattern like this: {code} try { // ... } catch (Exception e) { e.printStackTrace(); fail(e.getMessage()); } {code} This does hides the exception details and should be removed. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (FLINK-7310) always use HybridMemorySegment
Nico Kruber created FLINK-7310: -- Summary: always use HybridMemorySegment Key: FLINK-7310 URL: https://issues.apache.org/jira/browse/FLINK-7310 Project: Flink Issue Type: Improvement Components: Core Affects Versions: 1.4.0 Reporter: Nico Kruber Assignee: Nico Kruber For future changes to the network buffers (sending our own off-heap buffers through to netty), we cannot use {{HeapMemorySegment}} anymore and need to rely on {{HybridMemorySegment}} instead. We should thus drop any code that loads the {{HeapMemorySegment}} (it is still available if needed) in favour of the {{HybridMemorySegment}} which is able to work on both heap and off-heap memory. FYI: For the performance penalty of this change compared to using {{HeapMemorySegment}} alone, see this interesting blob article (from 2015): https://flink.apache.org/news/2015/09/16/off-heap-memory.html -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (FLINK-7287) test instability in Kafka010ITCase.testCommitOffsetsToKafka
Nico Kruber created FLINK-7287: -- Summary: test instability in Kafka010ITCase.testCommitOffsetsToKafka Key: FLINK-7287 URL: https://issues.apache.org/jira/browse/FLINK-7287 Project: Flink Issue Type: Improvement Components: Kafka Connector, Tests Affects Versions: 1.4.0 Reporter: Nico Kruber Assignee: Nico Kruber sporadically, {{Kafka010ITCase.testCommitOffsetsToKafka}} seems to be failing, e.g. {code} Test testCommitOffsetsToKafka(org.apache.flink.streaming.connectors.kafka.Kafka010ITCase) is running. 12:29:31,597 INFO org.apache.flink.streaming.connectors.kafka.KafkaTestBase - === == Writing sequence of 50 into testCommitOffsetsToKafkaTopic with p=3 === 12:29:31,597 INFO org.apache.flink.streaming.connectors.kafka.KafkaTestBase - Writing attempt #1 12:29:31,598 INFO org.apache.flink.streaming.connectors.kafka.KafkaTestEnvironmentImpl - Creating topic testCommitOffsetsToKafkaTopic-1 12:29:31,598 INFO org.I0Itec.zkclient.ZkEventThread - Starting ZkClient event thread. 12:29:31,599 INFO org.I0Itec.zkclient.ZkClient - Waiting for keeper state SyncConnected 12:29:31,601 INFO org.I0Itec.zkclient.ZkClient - zookeeper state changed (SyncConnected) 12:29:31,615 INFO org.I0Itec.zkclient.ZkEventThread - Terminate ZkClient event thread. 12:29:31,719 INFO org.I0Itec.zkclient.ZkEventThread - Starting ZkClient event thread. 12:29:31,722 INFO org.I0Itec.zkclient.ZkClient - Waiting for keeper state SyncConnected 12:29:31,728 INFO org.I0Itec.zkclient.ZkClient - zookeeper state changed (SyncConnected) 12:29:31,729 INFO org.I0Itec.zkclient.ZkEventThread - Terminate ZkClient event thread. 12:29:31,832 INFO org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducerBase - Starting FlinkKafkaProducer (3/3) to produce into default topic testCommitOffsetsToKafkaTopic-1 12:29:31,840 INFO org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducerBase - Starting FlinkKafkaProducer (2/3) to produce into default topic testCommitOffsetsToKafkaTopic-1 12:29:31,842 WARN org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducerBase - Flushing on checkpoint is enabled, but checkpointing is not enabled. Disabling flushing. 12:29:31,844 INFO org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducerBase - Starting FlinkKafkaProducer (1/3) to produce into default topic testCommitOffsetsToKafkaTopic-1 12:29:31,844 WARN org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducerBase - Flushing on checkpoint is enabled, but checkpointing is not enabled. Disabling flushing. 12:29:31,846 WARN org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducerBase - Flushing on checkpoint is enabled, but checkpointing is not enabled. Disabling flushing. 12:29:31,998 INFO org.apache.flink.streaming.connectors.kafka.KafkaTestBase - Finished writing sequence 12:29:31,998 INFO org.apache.flink.streaming.connectors.kafka.KafkaTestBase - Validating sequence 12:29:32,123 INFO org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase - No restore state for FlinkKafkaConsumer. 12:29:32,129 INFO org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase - No restore state for FlinkKafkaConsumer. 12:29:32,136 INFO org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase - No restore state for FlinkKafkaConsumer. 12:29:32,139 INFO org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase - Consumer subtask 0 will start reading the following 1 partitions from the committed group offsets in Kafka: [KafkaTopicPartition{topic='testCommitOffsetsToKafkaTopic-1', partition=1}] 12:29:32,154 INFO org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase - Consumer subtask 2 will start reading the following 1 partitions from the committed group offsets in Kafka: [KafkaTopicPartition{topic='testCommitOffsetsToKafkaTopic-1', partition=0}] 12:29:32,236 INFO org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase - Consumer subtask 1 will start reading the following 1 partitions from the committed group offsets in Kafka: [KafkaTopicPartition{topic='testCommitOffsetsToKafkaTopic-1', partition=2}] 12:29:32,496 INFO org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase - No restore state for FlinkKafkaConsumer. 12:29:32,507 INFO
[jira] [Created] (FLINK-7285) allow submission of big job graphs
Nico Kruber created FLINK-7285: -- Summary: allow submission of big job graphs Key: FLINK-7285 URL: https://issues.apache.org/jira/browse/FLINK-7285 Project: Flink Issue Type: Improvement Components: Distributed Coordination, Job-Submission Affects Versions: 1.4.0 Reporter: Nico Kruber Assignee: Nico Kruber In order to support submission of big job graphs, i.e. from user programs with a lot of code that does not fit into {{akka.framesize}}, we could try to store the {{JobGraph}} in the {{BlobServer}}. Note that in the communication between the job and task managers, we also need the offloading of the {{TaskDeploymentDescriptor}} (FLINK-6046) to complete this feature. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (FLINK-7283) PythonPlanBinderTest issues with python paths
Nico Kruber created FLINK-7283: -- Summary: PythonPlanBinderTest issues with python paths Key: FLINK-7283 URL: https://issues.apache.org/jira/browse/FLINK-7283 Project: Flink Issue Type: Bug Components: Python API, Tests Affects Versions: 1.3.1, 1.3.0, 1.4.0, 1.3.2 Reporter: Nico Kruber Assignee: Nico Kruber There are some issues with {{PythonPlanBinderTest}} and the Python2/3 tests: - the path is not set correctly (only inside {{config}}, not the {{configuration}} that is passed on to the {{PythonPlanBinder}} - linux distributions have become quite inventive regarding python binary names: some offer {{python}} as Python 2, some as Python 3. Similarly, {{python3}} and/or {{python2}} may not be available. If we really want to test both, we need to take this into account. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (FLINK-7262) remove unused FallbackLibraryCacheManager
Nico Kruber created FLINK-7262: -- Summary: remove unused FallbackLibraryCacheManager Key: FLINK-7262 URL: https://issues.apache.org/jira/browse/FLINK-7262 Project: Flink Issue Type: Sub-task Components: Distributed Coordination, Network Affects Versions: 1.4.0 Reporter: Nico Kruber Assignee: Nico Kruber {{FallbackLibraryCacheManager}} is basically only used in unit tests nowadays and should probably be removed. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (FLINK-7261) avoid unnecessary exceptions in the logs in non-HA cases
Nico Kruber created FLINK-7261: -- Summary: avoid unnecessary exceptions in the logs in non-HA cases Key: FLINK-7261 URL: https://issues.apache.org/jira/browse/FLINK-7261 Project: Flink Issue Type: Sub-task Components: Distributed Coordination, Network Affects Versions: 1.4.0 Reporter: Nico Kruber Assignee: Nico Kruber {{PermanentBlobCache#getHAFileInternal}} first tries to download files from the HA store but if it does not exist, it will log an exception from the attempt to move the {{incomingFile}} to its destination which is misleading to the user. We should extend {{BlobView#get}} to return whether a file was actually copied or not, e.g. in the {{VoidBlobStore}} to keep the abstraction of the BLOB stores but to not report errors in expected cases (recall that {{FileSystemBlobStore#get}} will already throw an exception if anything failed in there and if successful but the succeeding move fails the exception from the move should still be prevailed). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (FLINK-7212) JobManagerLeaderSessionIDITSuite not executed
Nico Kruber created FLINK-7212: -- Summary: JobManagerLeaderSessionIDITSuite not executed Key: FLINK-7212 URL: https://issues.apache.org/jira/browse/FLINK-7212 Project: Flink Issue Type: Bug Components: JobManager, Tests Affects Versions: 1.4.0 Reporter: Nico Kruber Assignee: Nico Kruber {{JobManagerLeaderSessionIDITSuite}} is currently not executed due to its naming scheme. Only {{*ITCase}} and {{*Test}} classes are run, except for inside {{flink-ml}} which adds more patters to the {{scalatest}} plugin. Also, {{JobManagerLeaderSessionIDITSuite}} needs to be adapted slightly so that it runs successfully. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (FLINK-7196) add a TTL to transient BLOB files
Nico Kruber created FLINK-7196: -- Summary: add a TTL to transient BLOB files Key: FLINK-7196 URL: https://issues.apache.org/jira/browse/FLINK-7196 Project: Flink Issue Type: Sub-task Components: Distributed Coordination, Network Affects Versions: 1.4.0 Reporter: Nico Kruber Assignee: Nico Kruber Transient BLOB files are not automatically cleaned up unless the {{BlobCache}}/{{BlobServer}} are shut down or the files are deleted via the {{delete}} methods. Additionally, they should have a default time-to-live (TTL) so that they may be cleaned up in failure cases. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (FLINK-7166) generated avro sources not cleaned up or re-created after changes
Nico Kruber created FLINK-7166: -- Summary: generated avro sources not cleaned up or re-created after changes Key: FLINK-7166 URL: https://issues.apache.org/jira/browse/FLINK-7166 Project: Flink Issue Type: Bug Components: Build System Affects Versions: 1.4.0 Reporter: Nico Kruber Assignee: Nico Kruber Since the AVRO upgrade to 1.8.2, I could compile the flink-avro module any more with a failure like this in {{mvn clean install -DskipTests -pl flink-connectors/flink-avro}}: {code} Compilation failure [ERROR] flink-connectors/flink-avro/src/test/java/org/apache/flink/api/io/avro/generated/Fixed16.java:[10,8] org.apache.flink.api.io.avro.generated.Fixed16 is not abstract and does not override abstract method readExternal(java.io.ObjectInput) in org.apache.avro.specific.SpecificFixed {code} This was caused by maven both not cleaning up the generated sources and also not overwriting them with new ones itself. Only a manual {{rm -rf flink-connectors/flink-avro/src/test/java/org/apache/flink/api/io/avro/generated}} solved the issue. The cause for this, though, is that the avro files are generated under the {{src}} directory, not {{target/generated-test-sources}} as they should be. Either the generated sources should be cleaned up as well, or the generated files should be moved to this directory which is a more invasive change due to some hacks with respect to these files. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (FLINK-7141) enable travis cache again
Nico Kruber created FLINK-7141: -- Summary: enable travis cache again Key: FLINK-7141 URL: https://issues.apache.org/jira/browse/FLINK-7141 Project: Flink Issue Type: Improvement Components: Build System, Travis Affects Versions: 1.4.0 Reporter: Nico Kruber Assignee: Nico Kruber In the past, we had some troubles with the travis cache but in general it may be a good idea to include it again to speed up build times by reducing the time the maven downloads take. This time, we should also deal with corrupt files in the maven repository and [tune travis|https://docs.travis-ci.com/user/caching/#Caches-and-build-matrices] so that it does not create corrupt caches in the first place. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (FLINK-7140) include a UUID/random name into the BlobKey
Nico Kruber created FLINK-7140: -- Summary: include a UUID/random name into the BlobKey Key: FLINK-7140 URL: https://issues.apache.org/jira/browse/FLINK-7140 Project: Flink Issue Type: Sub-task Components: Distributed Coordination, Network Affects Versions: 1.4.0 Reporter: Nico Kruber Assignee: Nico Kruber Currently, at the {{BlobServer}}, the id of a BLOB is only based on its contents. This may cause issues during cleanup of transient BLOBs with the same content (we don't do ref-counting on a per-blob basis anymore) or at the rare occasion of a hash collision. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (FLINK-7115) test instability in MiniClusterITCase.runJobWithMultipleJobManagers
Nico Kruber created FLINK-7115: -- Summary: test instability in MiniClusterITCase.runJobWithMultipleJobManagers Key: FLINK-7115 URL: https://issues.apache.org/jira/browse/FLINK-7115 Project: Flink Issue Type: Bug Components: Tests Affects Versions: 1.4.0 Reporter: Nico Kruber In a test run with unrelated changes, I to have one case of {{MiniClusterITCase}} hanging: https://s3.amazonaws.com/archive.travis-ci.org/jobs/250775808/log.txt?X-Amz-Expires=30=20170706T151556Z=AWS4-HMAC-SHA256=AKIAJRYRXRSVGNKPKO5A/20170706/us-east-1/s3/aws4_request=host=5b7c512137c7cbd82dcb77a98aeadc3d761b7055bea6d8f07ad6b786dc196f37 {code} == Maven produced no output for 300 seconds. == == The following Java processes are running (JPS) == 12852 Jps 9166 surefirebooter1705381973603203163.jar 4966 Launcher == Printing stack trace of Java process 12865 == 12865: No such process == Printing stack trace of Java process 9166 == 2017-07-06 15:05:52 Full thread dump OpenJDK 64-Bit Server VM (24.131-b00 mixed mode): "Attach Listener" daemon prio=10 tid=0x7fc520b1a000 nid=0x3266 waiting on condition [0x] java.lang.Thread.State: RUNNABLE "flink-akka.actor.default-dispatcher-9" daemon prio=10 tid=0x7fc520b0e800 nid=0x23fd waiting on condition [0x7fc51abcb000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x0007a0ca2c78> (a akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool) at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:2075) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) "flink-akka.actor.default-dispatcher-8" daemon prio=10 tid=0x7fc520bb9800 nid=0x23fc waiting on condition [0x7fc51aaca000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x0007a0ca2c78> (a akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool) at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:2075) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) "Flink-MetricRegistry-1" prio=10 tid=0x7fc520ba7800 nid=0x23f9 waiting on condition [0x7fc51a4c4000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x0007a09699c8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1092) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) "flink-akka.actor.default-dispatcher-7" daemon prio=10 tid=0x7fc520b9d800 nid=0x23f7 waiting on condition [0x7fc51a6c6000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x0007a0ca2c78> (a akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool) at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:2075) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) "flink-akka.actor.default-dispatcher-6" daemon prio=10 tid=0x7fc520aab000 nid=0x23ef waiting for monitor entry [0x7fc51b0d7000] java.lang.Thread.State: BLOCKED (on object monitor)
[jira] [Created] (FLINK-7102) improve ClassLoaderITCase
Nico Kruber created FLINK-7102: -- Summary: improve ClassLoaderITCase Key: FLINK-7102 URL: https://issues.apache.org/jira/browse/FLINK-7102 Project: Flink Issue Type: Bug Components: Tests Affects Versions: 1.4.0 Reporter: Nico Kruber Assignee: Nico Kruber {{ClassLoaderITCase}}... * unnecessarily runs multiple tests in a single test case * {{#testDisposeSavepointWithCustomKvState()}} does not cancel its job (thus the order of execution of test cases defines the outcome) * uses {{e.getCause().getCause()}} which may cause {{NullPointerException}}s hiding the original error -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (FLINK-7068) change BlobService sub-classes for permanent and transient BLOBs
Nico Kruber created FLINK-7068: -- Summary: change BlobService sub-classes for permanent and transient BLOBs Key: FLINK-7068 URL: https://issues.apache.org/jira/browse/FLINK-7068 Project: Flink Issue Type: Sub-task Components: Distributed Coordination, Network Affects Versions: 1.4.0 Reporter: Nico Kruber Assignee: Nico Kruber A {{PermanentBlobStore}} should resemble use cases for BLOBs that are permanently stored for a job's life time (HA and non-HA). A {{TransientBlobStore}} should reflect BLOB offloading for logs, RPC, etc. which even does not have to be reflected by files. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (FLINK-7064) test instability in WordCountMapreduceITCase
Nico Kruber created FLINK-7064: -- Summary: test instability in WordCountMapreduceITCase Key: FLINK-7064 URL: https://issues.apache.org/jira/browse/FLINK-7064 Project: Flink Issue Type: Bug Components: Tests Affects Versions: 1.4.0 Reporter: Nico Kruber Although already mentioned in FLINK-7004, this does not seem fixed yet and apparently now became an instable test: {code} Running org.apache.flink.test.hadoop.mapred.WordCountMapredITCase Inflater has been closed java.lang.NullPointerException: Inflater has been closed at java.util.zip.Inflater.ensureOpen(Inflater.java:389) at java.util.zip.Inflater.inflate(Inflater.java:257) at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:152) at java.io.FilterInputStream.read(FilterInputStream.java:133) at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:283) at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:325) at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:177) at java.io.InputStreamReader.read(InputStreamReader.java:184) at java.io.BufferedReader.fill(BufferedReader.java:154) at java.io.BufferedReader.readLine(BufferedReader.java:317) at java.io.BufferedReader.readLine(BufferedReader.java:382) at javax.xml.parsers.FactoryFinder.findJarServiceProvider(FactoryFinder.java:319) at javax.xml.parsers.FactoryFinder.find(FactoryFinder.java:255) at javax.xml.parsers.DocumentBuilderFactory.newInstance(DocumentBuilderFactory.java:121) at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2467) at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2444) at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2361) at org.apache.hadoop.conf.Configuration.get(Configuration.java:968) at org.apache.hadoop.mapred.JobConf.checkAndWarnDeprecation(JobConf.java:2010) at org.apache.hadoop.mapred.JobConf.(JobConf.java:423) at org.apache.flink.hadoopcompatibility.HadoopInputs.readHadoopFile(HadoopInputs.java:63) at org.apache.flink.test.hadoop.mapred.WordCountMapredITCase.internalRun(WordCountMapredITCase.java:79) at org.apache.flink.test.hadoop.mapred.WordCountMapredITCase.testProgram(WordCountMapredITCase.java:67) at org.apache.flink.test.util.JavaProgramTestBase.testJobWithObjectReuse(JavaProgramTestBase.java:127) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) at org.junit.rules.RunRules.evaluate(RunRules.java:20) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48) at org.junit.rules.RunRules.evaluate(RunRules.java:20) at org.junit.runners.ParentRunner.run(ParentRunner.java:363) at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:283) at org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:173) at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:153) at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:128) at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:203) at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:155) at
[jira] [Created] (FLINK-7063) test instability in OperatorStateBackendTest.testSnapshotAsyncCancel
Nico Kruber created FLINK-7063: -- Summary: test instability in OperatorStateBackendTest.testSnapshotAsyncCancel Key: FLINK-7063 URL: https://issues.apache.org/jira/browse/FLINK-7063 Project: Flink Issue Type: Bug Components: State Backends, Checkpointing, Tests Affects Versions: 1.3.1, 1.4.0 Reporter: Nico Kruber Assignee: Stefan Richter Priority: Minor {{OperatorStateBackendTest.testSnapshotAsyncCancel}} seems to be instable and sometimes fails: {code} testSnapshotAsyncCancel(org.apache.flink.runtime.state.OperatorStateBackendTest) Time elapsed: 0.036 sec <<< ERROR! java.util.concurrent.ExecutionException: java.io.IOException: Stream closed. at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:206) at org.apache.flink.runtime.state.OperatorStateBackendTest.testSnapshotAsyncCancel(OperatorStateBackendTest.java:636) Caused by: java.io.IOException: Stream closed. at org.apache.flink.runtime.util.BlockerCheckpointStreamFactory$1.write(BlockerCheckpointStreamFactory.java:95) at java.io.DataOutputStream.writeInt(DataOutputStream.java:197) at org.apache.flink.core.io.VersionedIOReadableWritable.write(VersionedIOReadableWritable.java:40) at org.apache.flink.runtime.state.OperatorBackendSerializationProxy.write(OperatorBackendSerializationProxy.java:65) at org.apache.flink.runtime.state.DefaultOperatorStateBackend$1.performOperation(DefaultOperatorStateBackend.java:255) at org.apache.flink.runtime.state.DefaultOperatorStateBackend$1.performOperation(DefaultOperatorStateBackend.java:233) at org.apache.flink.runtime.io.async.AbstractAsyncIOCallable.call(AbstractAsyncIOCallable.java:72) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) {code} logs: * https://s3.amazonaws.com/archive.travis-ci.org/jobs/248822546/log.txt?X-Amz-Expires=30=20170703T092940Z=AWS4-HMAC-SHA256=AKIAJRYRXRSVGNKPKO5A/20170703/us-east-1/s3/aws4_request=host=f468cd238236d7038a1e12086dd4a0e3ba538d93c883790d180e4c63b973a5f2 * https://transfer.sh/MHawk/17392.5.tar.gz -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (FLINK-7057) move BLOB ref-counting from LibraryCacheManager to BlobCache
Nico Kruber created FLINK-7057: -- Summary: move BLOB ref-counting from LibraryCacheManager to BlobCache Key: FLINK-7057 URL: https://issues.apache.org/jira/browse/FLINK-7057 Project: Flink Issue Type: Sub-task Components: Distributed Coordination, Network Affects Versions: 1.4.0 Reporter: Nico Kruber Assignee: Nico Kruber Currently, the {{LibraryCacheManager}} is doing some ref-counting for JAR files managed by it. Instead, we want the {{BlobCache}} to do that itself for all job-related BLOBs. Also, we do not want to operate on a per-{{BlobKey}} level but rather per job. Therefore, the cleanup process should be adapted, too. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (FLINK-7056) add API to allow job-related BLOBs to be stored
Nico Kruber created FLINK-7056: -- Summary: add API to allow job-related BLOBs to be stored Key: FLINK-7056 URL: https://issues.apache.org/jira/browse/FLINK-7056 Project: Flink Issue Type: Sub-task Components: Distributed Coordination, Network Affects Versions: 1.4.0 Reporter: Nico Kruber Assignee: Nico Kruber To ease cleanup, we will make job-related BLOBs be reflected in the blob storage so that they may be removed along with the job. This adds the jobId to many methods similar to the previous code from the {{NAME_ADDRESSABLE}} mode. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (FLINK-7055) refactor BlobService#getURL() methods to return a File object
Nico Kruber created FLINK-7055: -- Summary: refactor BlobService#getURL() methods to return a File object Key: FLINK-7055 URL: https://issues.apache.org/jira/browse/FLINK-7055 Project: Flink Issue Type: Sub-task Components: Distributed Coordination, Network Affects Versions: 1.4.0 Reporter: Nico Kruber Assignee: Nico Kruber As a relic from its use by the {{UrlClassLoader}}, {{BlobService#getURL()}} methods always returned {{URL}} objects although they were always pointing to locally cached files. As a step towards a better architecture and API, these should return a File object instead. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (FLINK-7054) remove LibraryCacheManager#getFile()
Nico Kruber created FLINK-7054: -- Summary: remove LibraryCacheManager#getFile() Key: FLINK-7054 URL: https://issues.apache.org/jira/browse/FLINK-7054 Project: Flink Issue Type: Sub-task Components: Distributed Coordination, Network Affects Versions: 1.4.0 Reporter: Nico Kruber Assignee: Nico Kruber {{LibraryCacheManager#getFile()}} was only used in tests where it is avoidable but if used anywhere else, it may have caused cleanup issues. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (FLINK-7053) improve code quality in some tests
Nico Kruber created FLINK-7053: -- Summary: improve code quality in some tests Key: FLINK-7053 URL: https://issues.apache.org/jira/browse/FLINK-7053 Project: Flink Issue Type: Sub-task Components: Distributed Coordination, Network Affects Versions: 1.4.0 Reporter: Nico Kruber Assignee: Nico Kruber * {{BlobClientTest}} and {{BlobClientSslTest}} share a lot of common code * the received buffers there are currently not verified for being equal to the expected one * {{TemporarFolder}} should be used throughout blob store tests -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (FLINK-7052) remove NAME_ADDRESSABLE mode
Nico Kruber created FLINK-7052: -- Summary: remove NAME_ADDRESSABLE mode Key: FLINK-7052 URL: https://issues.apache.org/jira/browse/FLINK-7052 Project: Flink Issue Type: Sub-task Components: Network Affects Versions: 1.4.0 Reporter: Nico Kruber Assignee: Nico Kruber Remove the BLOB store's {{NAME_ADDRESSABLE}} mode as it is currently not used and partly broken. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (FLINK-7012) remove user-JAR upload when disposing a savepoint the old way
Nico Kruber created FLINK-7012: -- Summary: remove user-JAR upload when disposing a savepoint the old way Key: FLINK-7012 URL: https://issues.apache.org/jira/browse/FLINK-7012 Project: Flink Issue Type: Bug Components: Client, State Backends, Checkpointing Affects Versions: 1.4.0 Reporter: Nico Kruber Assignee: Nico Kruber Priority: Minor Inside {{CliFrontend#disposeSavepoint()}}, user JAR files are being uploaded to the {{BlobServer}} but they are actually not used (also also not cleaned up) in the job manager's handling of the {{DisposeSavepoint}} message. Since removing new savepoints is as simple as deleting files and old savepoints have always worked without these user JARs, we should remove the upload to be able to make the JAR file upload jobId-dependent for FLINK-6916. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (FLINK-6916) FLIP-19: Improved BLOB storage architecture
Nico Kruber created FLINK-6916: -- Summary: FLIP-19: Improved BLOB storage architecture Key: FLINK-6916 URL: https://issues.apache.org/jira/browse/FLINK-6916 Project: Flink Issue Type: Improvement Components: Network Affects Versions: 1.4.0 Reporter: Nico Kruber Assignee: Nico Kruber The current architecture around the BLOB server and cache components seems rather patched up and has some issues regarding concurrency ([FLINK-6380]), cleanup, API inconsistencies / currently unused API ([FLINK-6329], [FLINK-6008]). These make future integration with FLIP-6 or extensions like offloading oversized RPC messages ([FLINK-6046]) difficult. We therefore propose an improvement on the current architecture as described below which tackles these issues, provides some cleanup, and enables further BLOB server use cases. Please refer to https://cwiki.apache.org/confluence/display/FLINK/FLIP-19%3A+Improved+BLOB+storage+architecture for a full overview on the proposed changes. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (FLINK-6791) Using MemoryStateBackend as checkpoint stream back-end may block checkpoint/savepoint creation
Nico Kruber created FLINK-6791: -- Summary: Using MemoryStateBackend as checkpoint stream back-end may block checkpoint/savepoint creation Key: FLINK-6791 URL: https://issues.apache.org/jira/browse/FLINK-6791 Project: Flink Issue Type: Bug Components: State Backends, Checkpointing Affects Versions: 1.2.1, 1.3.0 Reporter: Nico Kruber If the `MemoryStateBackend` is used as the checkpoint stream back-end in e.g. RocksDBStateBackend, it will block further checkpoint/savepoint creation if the checkpoint data reaches the back-end's max state size. In that case, an error message is logged at the task manager but the save-/checkpoint never completes and although the job continues, no further checkpoints will be made. Please see the following example that should be reproducible: {code:java} env.setStateBackend(new RocksDBStateBackend(new MemoryStateBackend(1000 * 1024 * 1024, false), false)); env.enableCheckpointing(100L); final long numKeys = 100_000L; DataStreamSourcesource1 = env.addSource(new RichParallelSourceFunction () { private volatile boolean running = true; @Override public void run(SourceContext ctx) throws Exception { long counter = 0; while (running) { synchronized (ctx.getCheckpointLock()) { ctx.collect(Tuple1.of(counter % numKeys)); counter++; } Thread.yield(); } } @Override public void cancel() { running = false; } }); source1.keyBy(0) .map(new RichMapFunction () { private transient ValueState val; @Override public Tuple1 map(Tuple1 value) throws Exception { val.update(Collections.nCopies(100, value.f0)); return value; } @Override public void open(final Configuration parameters) throws Exception { ValueStateDescriptor
descriptor = new ValueStateDescriptor<>( "data", // the state name TypeInformation.of(new TypeHint
() { }) // type information ); val = getRuntimeContext().getState(descriptor); } }).uid("identity-map-with-state") .addSink(new DiscardingSink
()); env.execute("failingsnapshots"); {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-6784) Add some notes about externalized checkpoints and the difference to savepoints
Nico Kruber created FLINK-6784: -- Summary: Add some notes about externalized checkpoints and the difference to savepoints Key: FLINK-6784 URL: https://issues.apache.org/jira/browse/FLINK-6784 Project: Flink Issue Type: Sub-task Components: Documentation, State Backends, Checkpointing Affects Versions: 1.3.0 Reporter: Nico Kruber Assignee: Nico Kruber while externalized checkpoints are described somehow, there does not seem to be any paragraph explaining the difference to savepoints, also there are two checkpointing docs which could at least be linked somehow -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-6782) Update savepoint documentation
Nico Kruber created FLINK-6782: -- Summary: Update savepoint documentation Key: FLINK-6782 URL: https://issues.apache.org/jira/browse/FLINK-6782 Project: Flink Issue Type: Sub-task Components: Documentation, State Backends, Checkpointing Affects Versions: 1.3.0 Reporter: Nico Kruber Assignee: Nico Kruber Savepoint documentation is a bit outdated regarding full data being stored in the savepoint path, not just a metadata file -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-6774) build-helper-maven-plugin version not set
Nico Kruber created FLINK-6774: -- Summary: build-helper-maven-plugin version not set Key: FLINK-6774 URL: https://issues.apache.org/jira/browse/FLINK-6774 Project: Flink Issue Type: Improvement Components: Build System Affects Versions: 1.3.0 Reporter: Nico Kruber Assignee: Nico Kruber Priority: Minor some modules forgot to specify the version of their {{build-helper-maven-plugin}} which causes the following warning in a maven build: {code} [WARNING] 'build.plugins.plugin.version' for org.codehaus.mojo:build-helper-maven-plugin is missing. @ org.apache.flink:flink-connector-kafka-base_${scala.binary.version}:[unknown-version], /home/nico/Projects/flink/flink-connectors/flink-connector-kafka-base/pom.xml, line 216, column 12 [WARNING] It is highly recommended to fix these problems because they threaten the stability of your build. {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-6706) remove ChaosMonkeyITCase
Nico Kruber created FLINK-6706: -- Summary: remove ChaosMonkeyITCase Key: FLINK-6706 URL: https://issues.apache.org/jira/browse/FLINK-6706 Project: Flink Issue Type: Improvement Components: Tests Affects Versions: 1.3.0 Reporter: Nico Kruber {{ChaosMonkeyITCase}} has been disabled since Dec 2015 and is probably outdated. In its current form, it doesn't make sense to keep it and would need a replacement if desired. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-6702) SIGABRT after CEPOperatorTest#testCEPOperatorSerializationWRocksDB() during GC
Nico Kruber created FLINK-6702: -- Summary: SIGABRT after CEPOperatorTest#testCEPOperatorSerializationWRocksDB() during GC Key: FLINK-6702 URL: https://issues.apache.org/jira/browse/FLINK-6702 Project: Flink Issue Type: Bug Components: CEP, Tests Affects Versions: 1.3.0 Reporter: Nico Kruber During the CEP unit tests, when garbage collection kicks in and tries to finalize RocksDB, it may fail with {code} pure virtual method called terminate called without an active exception Process finished with exit code 134 (interrupted by signal 6: SIGABRT) {code} Reason is a missing {{harness.close()}} call in {{CEPOperatorTest#testCEPOperatorSerializationWRocksDB()}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-6689) Remote StreamExecutionEnvironment fails to submit jobs against LocalFlinkMiniCluster
Nico Kruber created FLINK-6689: -- Summary: Remote StreamExecutionEnvironment fails to submit jobs against LocalFlinkMiniCluster Key: FLINK-6689 URL: https://issues.apache.org/jira/browse/FLINK-6689 Project: Flink Issue Type: Bug Components: Job-Submission Affects Versions: 1.3.0 Reporter: Nico Kruber Fix For: 1.3.0 The following Flink programs fails to execute with the current 1.3 branch (1.2 works): {code:java} final String jobManagerAddress = "localhost"; final int jobManagerPort = ConfigConstants.DEFAULT_JOB_MANAGER_IPC_PORT; final Configuration config = new Configuration(); config.setString(ConfigConstants.JOB_MANAGER_IPC_ADDRESS_KEY, jobManagerAddress); config.setInteger(ConfigConstants.JOB_MANAGER_IPC_PORT_KEY, jobManagerPort); config.setBoolean(ConfigConstants.LOCAL_START_WEBSERVER, true); final LocalFlinkMiniCluster cluster = new LocalFlinkMiniCluster(config, false); cluster.start(true); final StreamExecutionEnvironment env = StreamExecutionEnvironment.createRemoteEnvironment(jobManagerAddress, jobManagerPort); env.fromElements(1l).addSink(new DiscardingSink()); // fails due to leader session id being wrong: env.execute("test"); {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-6680) App & Flink migration guide: updates for the 1.3 release
Nico Kruber created FLINK-6680: -- Summary: App & Flink migration guide: updates for the 1.3 release Key: FLINK-6680 URL: https://issues.apache.org/jira/browse/FLINK-6680 Project: Flink Issue Type: Sub-task Reporter: Nico Kruber The "Upgrading Applications and Flink Versions" at {{docs/ops/upgrading.md}} does not contain any info on Flink 1.3 yet. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-6678) Migration guide: add note about removed log4j default logger from core artefacts
Nico Kruber created FLINK-6678: -- Summary: Migration guide: add note about removed log4j default logger from core artefacts Key: FLINK-6678 URL: https://issues.apache.org/jira/browse/FLINK-6678 Project: Flink Issue Type: Sub-task Components: Build System, Documentation Affects Versions: 1.3.0 Reporter: Nico Kruber Fix For: 1.3.0 The migration guide at {{docs/dev/migration.md}} needs to be extended with some notes about the removed specific logger dependencies in the Flink core artefacts (FLINK-6415). This is valid for applications embedding flink. Examples and quickstarts have been adding their loggers already but other projects may need to add those. In maven's {{pom.xml}}, for example, by adding the following dependencies {code:xml} org.slf4j slf4j-log4j12 1.7.7 log4j log4j 1.2.17 {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-6677) Add Table API changes to the migration guide
Nico Kruber created FLINK-6677: -- Summary: Add Table API changes to the migration guide Key: FLINK-6677 URL: https://issues.apache.org/jira/browse/FLINK-6677 Project: Flink Issue Type: Sub-task Components: Documentation, Table API & SQL Affects Versions: 1.3.0 Reporter: Nico Kruber Fix For: 1.3.0 The migration guide at {{docs/dev/migration.md}} needs to be extended with some notes about the API changes. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-6676) Add QueryableStateClient changed to the migration guide
Nico Kruber created FLINK-6676: -- Summary: Add QueryableStateClient changed to the migration guide Key: FLINK-6676 URL: https://issues.apache.org/jira/browse/FLINK-6676 Project: Flink Issue Type: Sub-task Components: Documentation, Queryable State Affects Versions: 1.3.0 Reporter: Nico Kruber Fix For: 1.3.0 The migration guide at {{docs/dev/migration.md}} needs to be extended with some notes about the API changes: * changes in the constructor * more? -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-6674) Update release 1.3 docs
Nico Kruber created FLINK-6674: -- Summary: Update release 1.3 docs Key: FLINK-6674 URL: https://issues.apache.org/jira/browse/FLINK-6674 Project: Flink Issue Type: Improvement Components: Documentation Affects Versions: 1.3.0 Reporter: Nico Kruber Fix For: 1.3.0 Umbrella issue to track required updates to the documentation for the 1.3 release. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-6670) remove CommonTestUtils.createTempDirectory()
Nico Kruber created FLINK-6670: -- Summary: remove CommonTestUtils.createTempDirectory() Key: FLINK-6670 URL: https://issues.apache.org/jira/browse/FLINK-6670 Project: Flink Issue Type: Bug Components: Tests Reporter: Nico Kruber Priority: Minor {{CommonTestUtils.createTempDirectory()}} encourages a dangerous design pattern with potential concurrency issues in the unit tests which could be solved by using the following pattern instead. {code:java} @Rule public TemporaryFolder tempFolder = new TemporaryFolder(); {code} We should therefore remove {{CommonTestUtils.createTempDirectory()}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-6659) RocksDBMergeIteratorTest leaving temporary directories behind
Nico Kruber created FLINK-6659: -- Summary: RocksDBMergeIteratorTest leaving temporary directories behind Key: FLINK-6659 URL: https://issues.apache.org/jira/browse/FLINK-6659 Project: Flink Issue Type: Bug Components: Tests Affects Versions: 1.2.1, 1.2.0, 1.3.0, 1.2.2 Reporter: Nico Kruber Assignee: Nico Kruber {{RocksDBMergeIteratorTest}} uses a newly created temporary directory for its RocksDB instance but does not delete is when finished. We should better replace this pattern with a proper {{@Rule}}-based approach -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-6654) missing maven dependency on "flink-shaded-hadoop2-uber" in flink-dist
Nico Kruber created FLINK-6654: -- Summary: missing maven dependency on "flink-shaded-hadoop2-uber" in flink-dist Key: FLINK-6654 URL: https://issues.apache.org/jira/browse/FLINK-6654 Project: Flink Issue Type: Bug Components: Build System Affects Versions: 1.3.0 Reporter: Nico Kruber Assignee: Nico Kruber Fix For: 1.3.0 Since applying FLINK-6514, flink-dist includes {{flink-shaded-hadoop2-uber-*.jar}} but without giving this dependency in its {{pom.xml}}. This may lead to concurrency issues during builds but also fails building the flink-dist module only (with dependencies) as in {code} mvn clean install -pl flink-dist -am {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-6380) BlobService concurrency issues between delete and put/get methods
Nico Kruber created FLINK-6380: -- Summary: BlobService concurrency issues between delete and put/get methods Key: FLINK-6380 URL: https://issues.apache.org/jira/browse/FLINK-6380 Project: Flink Issue Type: Bug Components: Network Affects Versions: 1.3.0 Reporter: Nico Kruber {{BlobCache#deleteAll(JobID)}} deletes the job directory which is only created at the start of {{BlobCache#getURL(BlobKey)}} which then relies on the directory being present. This is not restricted to the {{BlobCache}}, though, but also affects the {{BlobServer}} in two ways: 1) its own local storage and 2) its backing {{BlobStore}} For the latter, i.e. in {{FileSystemBlobStore}}, there is no guarantee that a directory will not be deleted concurrently (from a {{delete}} method) between its creation and writing a file (in a {{get}} method): * the {{delete}} method for name-addressable blobs always deletes the job-specific storage directory if there is no further blob for this job * the content-addressable blobs do that similarly but are shared among jobs and thus only delete directories if there is no other blob. Since name-addressable blobs have not been used so far and the latter case typically does not occur concurrently with get/put requests, this has not been a problem so far but is more relevant after applying FLINK-6046. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-6329) refactor name-addressable accessors in BlobService classes from returning a URL to a File or InputStream
Nico Kruber created FLINK-6329: -- Summary: refactor name-addressable accessors in BlobService classes from returning a URL to a File or InputStream Key: FLINK-6329 URL: https://issues.apache.org/jira/browse/FLINK-6329 Project: Flink Issue Type: Improvement Components: Network Reporter: Nico Kruber Priority: Minor The fact that the BLOB-by-hash, i.e. {{CONTENT_ADDRESSABLE}}, methods are typed to {{URL}} is an artefact of it being used for libraries with a {{URLClassLoader}}. For the {{NAME_ADDRESSABLE}} blobs, we could type all the methods like {{get(JobId, String)}} to {{File}} or to {{InputStream}}, rather then to {{URL}}. That means we should also change the method names to {{getBlob}} or something like that. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-6320) Flakey JobManagerHAJobGraphRecoveryITCase
Nico Kruber created FLINK-6320: -- Summary: Flakey JobManagerHAJobGraphRecoveryITCase Key: FLINK-6320 URL: https://issues.apache.org/jira/browse/FLINK-6320 Project: Flink Issue Type: Bug Components: Tests Affects Versions: 1.3.0 Reporter: Nico Kruber it looks as if there is a race condition in the cleanup of {{JobManagerHAJobGraphRecoveryITCase}}. {code} Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 50.271 sec <<< FAILURE! - in org.apache.flink.test.recovery.JobManagerHAJobGraphRecoveryITCase testJobPersistencyWhenJobManagerShutdown(org.apache.flink.test.recovery.JobManagerHAJobGraphRecoveryITCase) Time elapsed: 0.129 sec <<< ERROR! java.io.FileNotFoundException: File does not exist: /tmp/9b63934b-789d-428c-aa9e-47d5d8fa1e32/recovery/submittedJobGraphf763d61fba47 at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2275) at org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1653) at org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1535) at org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2270) at org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1653) at org.apache.flink.test.recovery.JobManagerHAJobGraphRecoveryITCase.cleanUp(JobManagerHAJobGraphRecoveryITCase.java:112) {code} Full log: https://s3.amazonaws.com/archive.travis-ci.org/jobs/223124016/log.txt Maybe a rule-based temporary directory is a better solution: {code:java} @Rule public TemporaryFolder tempFolder = new TemporaryFolder(); {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-6305) flink-ml tests are executed in all flink-fast-test profiles
Nico Kruber created FLINK-6305: -- Summary: flink-ml tests are executed in all flink-fast-test profiles Key: FLINK-6305 URL: https://issues.apache.org/jira/browse/FLINK-6305 Project: Flink Issue Type: Improvement Components: Tests Affects Versions: 1.3.0 Reporter: Nico Kruber Priority: Minor The {{flink-fast-tests-\*}} profiles partition the unit tests based on their starting letter. However, this does not affect the Scala tests run via the ScalaTest plugin and therefore, {{flink-ml}} tests are executed in all three currently existing profiles. While this may not be that grave, it does run for about 2.5 minutes on Travis CI which may be saved in 2/3 of the profiles there. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-6299) make all IT cases extend from TestLogger
Nico Kruber created FLINK-6299: -- Summary: make all IT cases extend from TestLogger Key: FLINK-6299 URL: https://issues.apache.org/jira/browse/FLINK-6299 Project: Flink Issue Type: Improvement Components: Tests Affects Versions: 1.3.0 Reporter: Nico Kruber Assignee: Nico Kruber Priority: Minor Not all of the integration tests extend from {{TestLogger}} but this is a very helpful tool so the currently running tests are written to the logs as well as their failures, especially for those tests where errors are often burried in the logs. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-6293) Flakey JobManagerITCase
Nico Kruber created FLINK-6293: -- Summary: Flakey JobManagerITCase Key: FLINK-6293 URL: https://issues.apache.org/jira/browse/FLINK-6293 Project: Flink Issue Type: Bug Components: Job-Submission, Tests Affects Versions: 1.3.0 Reporter: Nico Kruber Quite seldomly, {{JobManagerITCase}} seems to hang, e.g. see https://api.travis-ci.org/jobs/220888193/log.txt?deansi=true The maven watchdog kills the build due to not output being produced within 300s and {{JobManagerITCase}} seems to hang in line 772, i.e. {code:title=JobManagerITCase lines 770-772|language=java|linenumbers=true|firstline=770} // Trigger savepoint for non-existing job jobManager.tell(TriggerSavepoint(jobId, Option.apply("any")), testActor) val response = expectMsgType[TriggerSavepointFailure](deadline.timeLeft) {code} Although the (downloaded) logs do not quite allow a precise mapping to this test case, it looks as if the following block may be related: {code} 09:34:47,684 INFO org.apache.flink.runtime.minicluster.FlinkMiniCluster - Akka ask timeout set to 100s 09:34:47,777 INFO org.apache.flink.runtime.minicluster.FlinkMiniCluster - Disabled queryable state server 09:34:47,777 INFO org.apache.flink.runtime.minicluster.FlinkMiniCluster - Starting FlinkMiniCluster. 09:34:47,809 INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started 09:34:47,837 INFO org.apache.flink.runtime.blob.BlobServer - Created BLOB server storage directory /tmp/blobStore-eab23d04-ea18-4dc5-b1df-fcf9fc295062 09:34:47,838 WARN org.apache.flink.runtime.net.SSLUtils - Not a SSL socket, will skip setting tls version and cipher suites. 09:34:47,839 INFO org.apache.flink.runtime.blob.BlobServer - Started BLOB server at 0.0.0.0:36745 - max concurrent requests: 50 - max backlog: 1000 09:34:47,840 INFO org.apache.flink.runtime.metrics.MetricRegistry - No metrics reporter configured, no metrics will be exposed/reported. 09:34:47,850 INFO org.apache.flink.runtime.testingUtils.TestingMemoryArchivist - Started memory archivist akka://flink/user/archive_1 09:34:47,860 INFO org.apache.flink.runtime.testutils.TestingResourceManager - Trying to associate with JobManager leader akka://flink/user/jobmanager_1 09:34:47,861 INFO org.apache.flink.runtime.testingUtils.TestingJobManager - Starting JobManager at akka://flink/user/jobmanager_1. 09:34:47,862 WARN org.apache.flink.runtime.testingUtils.TestingJobManager - Discard message LeaderSessionMessage(----,TriggerSavepoint(6e813070338a23b0ff571646bca56521,Some(any))) because there is currently no valid leader id known. 09:34:47,862 INFO org.apache.flink.runtime.testingUtils.TestingJobManager - JobManager akka://flink/user/jobmanager_1 was granted leadership with leader session ID Some(----). 09:34:47,867 INFO org.apache.flink.runtime.testutils.TestingResourceManager - Resource Manager associating with leading JobManager Actor[akka://flink/user/jobmanager_1#-652927556] - leader session ---- {code} If so, then this may be related to FLINK-6287 and may possibly even be a duplicate. What is strange though is that the timeout for the expected message to arrive is no more than 2m and thus the test should properly fail within 300s. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-6292) Travis: transfer.sh not accepting uploads via http:// anymore
Nico Kruber created FLINK-6292: -- Summary: Travis: transfer.sh not accepting uploads via http:// anymore Key: FLINK-6292 URL: https://issues.apache.org/jira/browse/FLINK-6292 Project: Flink Issue Type: Bug Components: Tests Affects Versions: 1.2.0, 1.3.0, 1.1.5 Reporter: Nico Kruber Assignee: Nico Kruber The {{travis_mvn_watchdog.sh}} script tries to upload the logs to transfer.sh but it seems like they do not accept uploads to {{http://transfer.sh}} anymore and only accept {{https}} nowadays. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-6287) Flakey JobManagerRegistrationTest
Nico Kruber created FLINK-6287: -- Summary: Flakey JobManagerRegistrationTest Key: FLINK-6287 URL: https://issues.apache.org/jira/browse/FLINK-6287 Project: Flink Issue Type: Bug Components: JobManager Affects Versions: 1.3.0 Environment: unit tests Reporter: Nico Kruber There seems to be a race condition in the "{{JobManagerRegistrationTest.The JobManager should handle repeated registration calls}}" (scala) unit test. Every so often, especially when my system is under load, this test fails with a timeout after seeing the following messages in the log4j INFO outputs: {code} 14:18:42,257 INFO org.apache.flink.runtime.testutils.TestingResourceManager - Trying to associate with JobManager leader akka://flink/user/$f#-1062324203 14:18:42,253 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager at akka://flink/user/$f. 14:18:42,258 WARN org.apache.flink.runtime.jobmanager.JobManager - Discard message LeaderSessionMessage(----,RegisterTaskManager(8b6ad3eccdbfcb199df630b794b6ec0c,8b6ad3eccdbfcb199df630b794b6ec0c @ nico-work.fritz.box (dataPort=1),cores=4, physMem=16686931968, heap=1556938752, managed=10,1)) because there is currently no valid leader id known. 14:18:42,259 WARN org.apache.flink.runtime.jobmanager.JobManager - Discard message LeaderSessionMessage(----,RegisterTaskManager(8b6ad3eccdbfcb199df630b794b6ec0c,8b6ad3eccdbfcb199df630b794b6ec0c @ nico-work.fritz.box (dataPort=1),cores=4, physMem=16686931968, heap=1556938752, managed=10,1)) because there is currently no valid leader id known. 14:18:42,259 WARN org.apache.flink.runtime.jobmanager.JobManager - Discard message LeaderSessionMessage(----,RegisterTaskManager(8b6ad3eccdbfcb199df630b794b6ec0c,8b6ad3eccdbfcb199df630b794b6ec0c @ nico-work.fritz.box (dataPort=1),cores=4, physMem=16686931968, heap=1556938752, managed=10,1)) because there is currently no valid leader id known. 14:18:42,259 WARN org.apache.flink.runtime.jobmanager.JobManager - Discard message LeaderSessionMessage(----,RegisterResourceManager akka://flink/user/resourcemanager-61e00f37-9e99-4355-9099-53b992e8232e) because there is currently no valid leader id known. 14:18:42,259 INFO org.apache.flink.runtime.jobmanager.JobManager - JobManager akka://flink/user/$f was granted leadership with leader session ID Some(----). {code} Full log: {code} 14:18:42,230 INFO org.apache.flink.runtime.blob.BlobServer - Created BLOB server storage directory /tmp/blobStore-d09b13ee-26bb-4ec8-950a-956f4a6c16cf 14:18:42,231 WARN org.apache.flink.runtime.net.SSLUtils - Not a SSL socket, will skip setting tls version and cipher suites. 14:18:42,236 INFO org.apache.flink.runtime.blob.BlobServer - Started BLOB server at 0.0.0.0:39695 - max concurrent requests: 50 - max backlog: 1000 14:18:42,247 INFO org.apache.flink.runtime.metrics.MetricRegistry - No metrics reporter configured, no metrics will be exposed/reported. 14:18:42,249 WARN org.apache.flink.runtime.metrics.MetricRegistry - Could not start MetricDumpActor. No metrics will be submitted to the WebInterface. akka.actor.InvalidActorNameException: actor name [MetricQueryService] is not unique! at akka.actor.dungeon.ChildrenContainer$NormalChildrenContainer.reserve(ChildrenContainer.scala:130) at akka.actor.dungeon.Children$class.reserveChild(Children.scala:76) at akka.actor.ActorCell.reserveChild(ActorCell.scala:369) at akka.actor.dungeon.Children$class.makeChild(Children.scala:201) at akka.actor.dungeon.Children$class.attachChild(Children.scala:41) at akka.actor.ActorCell.attachChild(ActorCell.scala:369) at akka.actor.ActorSystemImpl.actorOf(ActorSystem.scala:553) at org.apache.flink.runtime.metrics.dump.MetricQueryService.startMetricQueryService(MetricQueryService.java:170) at org.apache.flink.runtime.metrics.MetricRegistry.startQueryService(MetricRegistry.java:166) at org.apache.flink.runtime.jobmanager.JobManager$.startJobManagerActors(JobManager.scala:2720) at org.apache.flink.runtime.jobmanager.JobManagerRegistrationTest.org$apache$flink$runtime$jobmanager$JobManagerRegistrationTest$$startTestingJobManager(JobManagerRegistrationTest.scala:182) at org.apache.flink.runtime.jobmanager.JobManagerRegistrationTest$$anonfun$1$$anonfun$apply$mcV$sp$4.apply$mcV$sp(JobManagerRegistrationTest.scala:123) at org.apache.flink.runtime.jobmanager.JobManagerRegistrationTest$$anonfun$1$$anonfun$apply$mcV$sp$4.apply(JobManagerRegistrationTest.scala:121) at org.apache.flink.runtime.jobmanager.JobManagerRegistrationTest$$anonfun$1$$anonfun$apply$mcV$sp$4.apply(JobManagerRegistrationTest.scala:121) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at
[jira] [Created] (FLINK-6270) Port several network config parameters to ConfigOption
Nico Kruber created FLINK-6270: -- Summary: Port several network config parameters to ConfigOption Key: FLINK-6270 URL: https://issues.apache.org/jira/browse/FLINK-6270 Project: Flink Issue Type: Improvement Components: Network Affects Versions: 1.3.0 Reporter: Nico Kruber Assignee: Nico Kruber Priority: Minor I'd like to port some memory and network buffers related config options to new {{ConfigOption}} instances before continuing with FLINK-4545. These include: * {{taskmanager.memory.size}} * {{taskmanager.memory.fraction}} * {{taskmanager.memory.off-heap}} * {{taskmanager.memory.preallocate}} * {{taskmanager.network.numberOfBuffers}} * {{taskmanager.memory.segment-size}} Some of these already existed as {{ConfigOption}} instances in {{MiniClusterConfiguration}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-6064) wrong BlobServer hostname in TaskExecutor
Nico Kruber created FLINK-6064: -- Summary: wrong BlobServer hostname in TaskExecutor Key: FLINK-6064 URL: https://issues.apache.org/jira/browse/FLINK-6064 Project: Flink Issue Type: Bug Components: Distributed Coordination Affects Versions: 1.3.0 Reporter: Nico Kruber Assignee: Nico Kruber The hostname currently used for the connection to the blob server inside TaskExecutor is the akka RPC address which is invalid here. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-6046) Add support for oversized messages during deployment
Nico Kruber created FLINK-6046: -- Summary: Add support for oversized messages during deployment Key: FLINK-6046 URL: https://issues.apache.org/jira/browse/FLINK-6046 Project: Flink Issue Type: New Feature Components: Distributed Coordination Reporter: Nico Kruber Assignee: Nico Kruber This is the non-FLIP6 version of FLINK-4346, restricted to deployment messages: Currently, messages larger than the maximum Akka Framesize cause an error when being transported. We should add a way to pass messages that are larger than {{akka.framesize}} as may happen for task deployments via the {{TaskDeploymentDescriptor}}. We should use the {{BlobServer}} to offload big data items (if possible) and make use of any potential distributed file system behind. This way, not only do we avoid the akka framesize restriction, but may also be able to speed up deployment. I suggest the following changes: - the sender, i.e. the {{Execution}} class, tries to store the serialized job information and serialized task information (if oversized) from the {{TaskDeploymentDescriptor}} (tdd) on the {{BlobServer}} as a single {{NAME_ADDRESSABLE}} blob under its job ID (if this does not work, we send the whole tdd as usual via akka) - if stored in a blob, these data items are removed from the tdd - the receiver, i.e. the {{TaskManager}} class, tries to retrieve any offloaded data after receiving the {{TaskDeploymentDescriptor}} from akka; it re-assembles the original tdd - as all {{NAME_ADDRESSABLE}} blobs, these offloaded blobs are removed when the job enters a final state Further (future) changes may include: - separating the serialized job information and serialized task information into two files and re-use the first one for all tasks - not re-deploying these two during job recovery (if possible) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-6008) collection of BlobServer improvements
Nico Kruber created FLINK-6008: -- Summary: collection of BlobServer improvements Key: FLINK-6008 URL: https://issues.apache.org/jira/browse/FLINK-6008 Project: Flink Issue Type: Improvement Components: Network Affects Versions: 1.3.0 Reporter: Nico Kruber Assignee: Nico Kruber The following things should be removed around the BlobServer/BlobCache: * update config uptions with non-deprecated ones, e.g. {{high-availability.cluster-id}} and {{high-availability.storageDir}} * promote {{BlobStore#deleteAll(JobID)}} to the {{BlobService}} * extend the {{BlobService}} to work with {{NAME_ADDRESSABLE}} blobs (prepares FLINK-4399] * remove {{NAME_ADDRESSABLE}} blobs after job/task termination * do not fail the {{BlobServer}} when a delete operation fails * code style, like using {{Preconditions.checkArgument}} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-6005) unit test ArrayList initializations without default size
Nico Kruber created FLINK-6005: -- Summary: unit test ArrayList initializations without default size Key: FLINK-6005 URL: https://issues.apache.org/jira/browse/FLINK-6005 Project: Flink Issue Type: Improvement Components: Tests Affects Versions: 1.3.0 Reporter: Nico Kruber Assignee: Nico Kruber Priority: Minor I found some ArrayList initializations without a default size although it is possible to select one. The following PR will show some cases that I'd like to fix. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-5973) check whether the direct memory size is always correctly calculated
Nico Kruber created FLINK-5973: -- Summary: check whether the direct memory size is always correctly calculated Key: FLINK-5973 URL: https://issues.apache.org/jira/browse/FLINK-5973 Project: Flink Issue Type: Bug Components: Local Runtime Reporter: Nico Kruber Priority: Minor A user on the mailing list ran into the problem that the {{directMemorySize}} was incorrectly set too high which may happen if the following code path gets {{maxMemory}} from 1/4*> instead of the calculation, {{taskmanager.sh}} is doing (in his case via the discouraged {{start-local.sh}} script). It be the case that other code paths also exhibit this issue, which should be checked. {code:title=TaskManagerServices#createMemoryManager()} } else if (memType == MemoryType.OFF_HEAP) { // The maximum heap memory has been adjusted according to the fraction long maxMemory = EnvironmentInformation.getMaxJvmHeapMemory(); long directMemorySize = (long) (maxMemory / (1.0 - memoryFraction) * memoryFraction); if (preAllocateMemory) { LOG.info("Using {} of the maximum memory size for managed off-heap memory ({} MB)." , memoryFraction, directMemorySize >> 20); } else { LOG.info("Limiting managed memory to {} of the maximum memory size ({} MB)," + " memory will be allocated lazily.", memoryFraction, directMemorySize >> 20); } memorySize = directMemorySize; } else { {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-5814) flink-dist creates wrong symlink when not used with cleaned before
Nico Kruber created FLINK-5814: -- Summary: flink-dist creates wrong symlink when not used with cleaned before Key: FLINK-5814 URL: https://issues.apache.org/jira/browse/FLINK-5814 Project: Flink Issue Type: Bug Components: Build System Affects Versions: 1.2.0 Reporter: Nico Kruber Assignee: Nico Kruber Priority: Minor If {{/build-target}} already exists, 'mvn package' for flink-dist will create a symbolic link *inside* {{/build-target}} instead of replacing that symlink. This is due to the behaviour of {{ln \-sf}} for target links that point to directories and may be solved by adding the {{--no-dereference}} parameter. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-5651) state backends: forbid Iterator#remove() from the Iterable returned by HeapListState#get()
Nico Kruber created FLINK-5651: -- Summary: state backends: forbid Iterator#remove() from the Iterable returned by HeapListState#get() Key: FLINK-5651 URL: https://issues.apache.org/jira/browse/FLINK-5651 Project: Flink Issue Type: Improvement Components: State Backends, Checkpointing Affects Versions: 1.2.0 Reporter: Nico Kruber Assignee: Nico Kruber The idiom behind AppendingState#get() is to return a copy of the value behind or at least not to allow changes to the underlying state storage. However, the heap state backend returns the original list and thus is prone to changes. The returned Iterable offers Iterable#iterator() which in turn offers Iterator#remove() that would change the stored state. While we cannot efficiently block the user from modifying the objects in the list, we can at least prevent him from doing structural changes to the list by forbidding Iterator#remove() there. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-5642) queryable state: race condition with HeadListState
Nico Kruber created FLINK-5642: -- Summary: queryable state: race condition with HeadListState Key: FLINK-5642 URL: https://issues.apache.org/jira/browse/FLINK-5642 Project: Flink Issue Type: Bug Components: Queryable State Affects Versions: 1.2.0 Reporter: Nico Kruber Assignee: Nico Kruber If queryable state accesses a HeapListState instance that is being modified during the value's serialisation, it may crash, e.g. with a NullPointerException during the serialisation or with an EOFException during de-serialisation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-5615) queryable state: execute the QueryableStateITCase for all three state back-ends
Nico Kruber created FLINK-5615: -- Summary: queryable state: execute the QueryableStateITCase for all three state back-ends Key: FLINK-5615 URL: https://issues.apache.org/jira/browse/FLINK-5615 Project: Flink Issue Type: Improvement Components: Queryable State Affects Versions: 1.2.0 Reporter: Nico Kruber Assignee: Nico Kruber The QueryableStateITCase currently is only tested with the MemoryStateBackend but as has been seen in the past, some errors or inconsistent behaviour only appeared with different state back-ends. It should thus be extended to be tested with all three currently existing state back-ends. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-5613) QueryableState: requesting a non-existing key in RocksDBStateBackend is not consistent with the MemoryStateBackend and FsStateBackend
Nico Kruber created FLINK-5613: -- Summary: QueryableState: requesting a non-existing key in RocksDBStateBackend is not consistent with the MemoryStateBackend and FsStateBackend Key: FLINK-5613 URL: https://issues.apache.org/jira/browse/FLINK-5613 Project: Flink Issue Type: Bug Components: Queryable State Affects Versions: 1.2.0 Reporter: Nico Kruber Assignee: Nico Kruber Querying for a non-existing key for a state that has a default value set currently results in the default value being returned in the RocksDBStateBackend only. MemoryStateBackend or FsStateBackend will return null which results in an UnknownKeyOrNamespace exception. Default values are now deprecated and will be removed eventually so we should not introduce them into the new queryable state API and thus adapt the RocksDBStateBackend accordingly -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-5591) queryable state: request failures failing on the server side do not contain client stack trace
Nico Kruber created FLINK-5591: -- Summary: queryable state: request failures failing on the server side do not contain client stack trace Key: FLINK-5591 URL: https://issues.apache.org/jira/browse/FLINK-5591 Project: Flink Issue Type: Improvement Components: Queryable State Affects Versions: 1.2.0 Reporter: Nico Kruber Assignee: Nico Kruber Failures during queries using QueryableStateClient throw exceptions like UnknownKeyOrNamespace but these are thrown on the server side with its stacktrace included, then serialised and sent over to the client where they are deserialised and re-thrown. For debugging it would help a lot if these "server-exceptions" would be encapsulated in new client-exception instances so that we also get the client's stack trace. Especially since failing requests may be caused by the client itself, e.g. non-existing jobs/keys, wrong namespace/serialisers, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-5576) extend deserialization functions of KvStateRequestSerializer to detect unconsumed bytes
Nico Kruber created FLINK-5576: -- Summary: extend deserialization functions of KvStateRequestSerializer to detect unconsumed bytes Key: FLINK-5576 URL: https://issues.apache.org/jira/browse/FLINK-5576 Project: Flink Issue Type: Improvement Components: Queryable State Affects Versions: 1.2.0 Reporter: Nico Kruber Assignee: Nico Kruber Priority: Minor KvStateRequestSerializer#deserializeValue and KvStateRequestSerializer#deserializeList both deserialize a given byte array. This is used by clients and unit tests and it is fair to assume that these byte arrays represent a complete value since we do not offer a method to continue reading form the middle of the array anyway. Therefore, we can treat unconsumed bytes as errors, e.g. from a wrong serializer being used. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-5561) DataInputDeserializer#available returns one too few
Nico Kruber created FLINK-5561: -- Summary: DataInputDeserializer#available returns one too few Key: FLINK-5561 URL: https://issues.apache.org/jira/browse/FLINK-5561 Project: Flink Issue Type: Bug Affects Versions: 1.2.0 Reporter: Nico Kruber Assignee: Nico Kruber DataInputDeserializer#available seems to assume that the position points to the last read byte but instead it points to the next byte. Therefore, it returns a value which is 1 smaller than the correct one. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-5559) queryable state: KvStateRequestSerializer#deserializeKeyAndNamespace() throws an IOException without own failure message if deserialisation fails
Nico Kruber created FLINK-5559: -- Summary: queryable state: KvStateRequestSerializer#deserializeKeyAndNamespace() throws an IOException without own failure message if deserialisation fails Key: FLINK-5559 URL: https://issues.apache.org/jira/browse/FLINK-5559 Project: Flink Issue Type: Improvement Components: Queryable State Affects Versions: 1.2.0 Reporter: Nico Kruber Assignee: Nico Kruber Priority: Minor KvStateRequestSerializer#deserializeKeyAndNamespace() throws an IOException, e.g. EOFException, if the deserialisation fails, e.g. there are not enough available bytes. In these cases, it should instead also throw an IllegalArgumentException with a message containing "This indicates a mismatch in the key/namespace serializers used by the KvState instance and this access." as the other error cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-5530) race condition in AbstractRocksDBState#getSerializedValue
Nico Kruber created FLINK-5530: -- Summary: race condition in AbstractRocksDBState#getSerializedValue Key: FLINK-5530 URL: https://issues.apache.org/jira/browse/FLINK-5530 Project: Flink Issue Type: Bug Components: Queryable State Affects Versions: 1.2.0 Reporter: Nico Kruber Assignee: Nico Kruber Priority: Blocker AbstractRocksDBState#getSerializedValue() uses the same key serialisation stream as the ordinary state access methods but is called in parallel during state queries thus violating the assumption of only one thread accessing it. This may lead to either wrong results in queries or corrupt data while queries are executed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-5528) tests: reduce the retry delay in QueryableStateITCase
Nico Kruber created FLINK-5528: -- Summary: tests: reduce the retry delay in QueryableStateITCase Key: FLINK-5528 URL: https://issues.apache.org/jira/browse/FLINK-5528 Project: Flink Issue Type: Improvement Components: Queryable State Affects Versions: 1.2.0 Reporter: Nico Kruber Assignee: Nico Kruber The QueryableStateITCase uses a retry of 1 second, e.g. if a queried key does not exist yet. This seems a bit too conservative as the job may not take that long to deploy and especially since getKvStateWithRetries() recovers from failures by retrying. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-5527) QueryableState: requesting a non-existing key in MemoryStateBackend or FsStateBackend does not return the default value
Nico Kruber created FLINK-5527: -- Summary: QueryableState: requesting a non-existing key in MemoryStateBackend or FsStateBackend does not return the default value Key: FLINK-5527 URL: https://issues.apache.org/jira/browse/FLINK-5527 Project: Flink Issue Type: Improvement Components: Queryable State Affects Versions: 1.2.0 Reporter: Nico Kruber Assignee: Nico Kruber Querying for a non-existing key for a state that has a default value set currently results in an UnknownKeyOrNamespace exception when the MemoryStateBackend or FsStateBackend is used. It should return the default value instead just like the RocksDBStateBackend. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-5526) QueryableState: notify upon receiving a query but having queryable state disabled
Nico Kruber created FLINK-5526: -- Summary: QueryableState: notify upon receiving a query but having queryable state disabled Key: FLINK-5526 URL: https://issues.apache.org/jira/browse/FLINK-5526 Project: Flink Issue Type: Improvement Components: Queryable State Affects Versions: 1.2.0 Reporter: Nico Kruber Priority: Minor When querying state but having it disabled in the config, a warning should be presented to the user that a query was received but the component is disabled. This is in addition to the query itself failing with a rather generic exception that is not pointing to this fact. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-5521) remove unused KvStateRequestSerializer#serializeList
Nico Kruber created FLINK-5521: -- Summary: remove unused KvStateRequestSerializer#serializeList Key: FLINK-5521 URL: https://issues.apache.org/jira/browse/FLINK-5521 Project: Flink Issue Type: Improvement Components: Queryable State Affects Versions: 1.2.0 Reporter: Nico Kruber Assignee: Nico Kruber KvStateRequestSerializer#serializeList is unused and instead the state backends' serialisation functions are used. Therefore, remove this one and make sure KvStateRequestSerializer#deserializeList works with the state backends' ones. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-5515) fix unused kvState.getSerializedValue call in KvStateServerHandler
Nico Kruber created FLINK-5515: -- Summary: fix unused kvState.getSerializedValue call in KvStateServerHandler Key: FLINK-5515 URL: https://issues.apache.org/jira/browse/FLINK-5515 Project: Flink Issue Type: Improvement Reporter: Nico Kruber This was added in 4809f5367b08a9734fc1bd4875be51a9f3bb65aa and is probably a left-over from a merge. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-5507) remove queryable list state sink
Nico Kruber created FLINK-5507: -- Summary: remove queryable list state sink Key: FLINK-5507 URL: https://issues.apache.org/jira/browse/FLINK-5507 Project: Flink Issue Type: Improvement Components: State Backends, Checkpointing Affects Versions: 1.2.0 Reporter: Nico Kruber Assignee: Nico Kruber The queryable state "sink" using ListState (".asQueryableState(, ListStateDescriptor)") stores all incoming data forever and is never cleaned. Eventually, it will pile up too much memory and is thus of limited use. We should remove it from the API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-5482) QueryableStateClient does not recover from a failed lookup due to a non-running job
Nico Kruber created FLINK-5482: -- Summary: QueryableStateClient does not recover from a failed lookup due to a non-running job Key: FLINK-5482 URL: https://issues.apache.org/jira/browse/FLINK-5482 Project: Flink Issue Type: Bug Affects Versions: 1.2.0 Reporter: Nico Kruber Assignee: Nico Kruber When the QueryableStateClient is used to issue a query but the job is not running yet, its internal lookup result is cached with an IllegalStateException that the job was not found. It does, however, never recover from that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-5334) outdated scala SBT quickstart example
Nico Kruber created FLINK-5334: -- Summary: outdated scala SBT quickstart example Key: FLINK-5334 URL: https://issues.apache.org/jira/browse/FLINK-5334 Project: Flink Issue Type: Bug Components: Quickstarts Reporter: Nico Kruber The scala quickstart set up via sbt-quickstart.sh or from the repository at https://github.com/tillrohrmann/flink-project seems outdated compared to what is set up with the maven archetype, e.g. Job.scala vs. BatchJob.scala and StreamingJob.scala. This should probably be updated and also the hard-coded example in sbt-quickstart.sh on the web page could be removed and download the newest version instead as the mvn command does. see https://ci.apache.org/projects/flink/flink-docs-release-1.2/quickstart/scala_api_quickstart.html for these two paths (SBT vs. Maven) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-5309) documentation links on the home page point to 1.2-SNAPSHOT
Nico Kruber created FLINK-5309: -- Summary: documentation links on the home page point to 1.2-SNAPSHOT Key: FLINK-5309 URL: https://issues.apache.org/jira/browse/FLINK-5309 Project: Flink Issue Type: Bug Components: Project Website Reporter: Nico Kruber The main website at https://flink.apache.org/ has several links to the documentation but despite advertising a stable release download, all of those links point to the 1.2 branch. This should be set to the same stable version's documentation instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-5308) download links to previous releases are incomplete
Nico Kruber created FLINK-5308: -- Summary: download links to previous releases are incomplete Key: FLINK-5308 URL: https://issues.apache.org/jira/browse/FLINK-5308 Project: Flink Issue Type: Bug Components: Project Website Reporter: Nico Kruber Assignee: Nico Kruber The list of all releases under https://flink.apache.org/downloads.html#all-releases does not contain several previous releases. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-5212) Flakey ScalaShellITCase#testPreventRecreationBatch (out of Java heap space)
Nico Kruber created FLINK-5212: -- Summary: Flakey ScalaShellITCase#testPreventRecreationBatch (out of Java heap space) Key: FLINK-5212 URL: https://issues.apache.org/jira/browse/FLINK-5212 Project: Flink Issue Type: Bug Components: Flink on Tez, Scala Shell Affects Versions: 2.0.0 Environment: TravisCI Reporter: Nico Kruber {code:none} Tests run: 9, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 255.399 sec <<< FAILURE! - in org.apache.flink.api.scala.ScalaShellITCase testPreventRecreationBatch(org.apache.flink.api.scala.ScalaShellITCase) Time elapsed: 198.128 sec <<< ERROR! java.lang.OutOfMemoryError: Java heap space at scala.reflect.internal.Names$class.enterChars(Names.scala:70) at scala.reflect.internal.Names$class.body$1(Names.scala:116) at scala.reflect.internal.Names$class.newTermName(Names.scala:127) at scala.reflect.internal.SymbolTable.newTermName(SymbolTable.scala:16) at scala.reflect.internal.Names$class.newTermName(Names.scala:83) at scala.reflect.internal.SymbolTable.newTermName(SymbolTable.scala:16) at scala.reflect.internal.Names$class.newTermName(Names.scala:144) at scala.reflect.internal.SymbolTable.newTermName(SymbolTable.scala:16) at scala.tools.nsc.symtab.classfile.ClassfileParser$ConstantPool.getName(ClassfileParser.scala:206) at scala.tools.nsc.symtab.classfile.ClassfileParser$ConstantPool.getExternalName(ClassfileParser.scala:216) at scala.tools.nsc.symtab.classfile.ClassfileParser$ConstantPool.getType(ClassfileParser.scala:286) at scala.tools.nsc.symtab.classfile.ClassfileParser.parseMethod(ClassfileParser.scala:565) at scala.tools.nsc.symtab.classfile.ClassfileParser.scala$tools$nsc$symtab$classfile$ClassfileParser$$queueLoad$1(ClassfileParser.scala:480) at scala.tools.nsc.symtab.classfile.ClassfileParser$$anonfun$parseClass$1.apply$mcV$sp(ClassfileParser.scala:490) at scala.tools.nsc.symtab.classfile.ClassfileParser.parseClass(ClassfileParser.scala:495) at scala.tools.nsc.symtab.classfile.ClassfileParser.parse(ClassfileParser.scala:136) at scala.tools.nsc.symtab.SymbolLoaders$ClassfileLoader$$anonfun$doComplete$2.apply$mcV$sp(SymbolLoaders.scala:347) at scala.tools.nsc.symtab.SymbolLoaders$ClassfileLoader$$anonfun$doComplete$2.apply(SymbolLoaders.scala:347) at scala.tools.nsc.symtab.SymbolLoaders$ClassfileLoader$$anonfun$doComplete$2.apply(SymbolLoaders.scala:347) at scala.reflect.internal.SymbolTable.enteringPhase(SymbolTable.scala:235) at scala.tools.nsc.symtab.SymbolLoaders$ClassfileLoader.doComplete(SymbolLoaders.scala:347) at scala.tools.nsc.symtab.SymbolLoaders$SymbolLoader.complete(SymbolLoaders.scala:211) at scala.tools.nsc.symtab.SymbolLoaders$SymbolLoader.load(SymbolLoaders.scala:227) at scala.reflect.internal.Symbols$Symbol.typeParams(Symbols.scala:1708) at scala.reflect.internal.Types$NoArgsTypeRef.typeParams(Types.scala:1926) at scala.reflect.internal.Types$NoArgsTypeRef.isHigherKinded(Types.scala:1925) at scala.reflect.internal.transform.UnCurry$class.scala$reflect$internal$transform$UnCurry$$expandAlias(UnCurry.scala:22) at scala.reflect.internal.transform.UnCurry$$anon$2.apply(UnCurry.scala:26) at scala.reflect.internal.tpe.TypeMaps$TypeMap.applyToSymbolInfo(TypeMaps.scala:218) at scala.reflect.internal.tpe.TypeMaps$TypeMap.loop$1(TypeMaps.scala:227) at scala.reflect.internal.tpe.TypeMaps$TypeMap.noChangeToSymbols(TypeMaps.scala:229) at scala.reflect.internal.tpe.TypeMaps$TypeMap.mapOver(TypeMaps.scala:243) Results : Tests in error: ScalaShellITCase.testPreventRecreationBatch » OutOfMemory Java heap space {code} stdout: https://api.travis-ci.org/jobs/180090640/log.txt?deansi=true full logs: https://transfer.sh/nu2wr/34.1.tar.gz -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-5206) Flakey PythonPlanBinderTest
Nico Kruber created FLINK-5206: -- Summary: Flakey PythonPlanBinderTest Key: FLINK-5206 URL: https://issues.apache.org/jira/browse/FLINK-5206 Project: Flink Issue Type: Bug Environment: in TravisCI Reporter: Nico Kruber {code:none} --- T E S T S --- Running org.apache.flink.python.api.PythonPlanBinderTest Job execution failed. org.apache.flink.runtime.client.JobExecutionException: Job execution failed. at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply$mcV$sp(JobManager.scala:903) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply(JobManager.scala:846) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply(JobManager.scala:846) at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:401) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: java.io.IOException: Output path '/tmp/flink/result2' could not be initialized. Canceling task... at org.apache.flink.api.common.io.FileOutputFormat.open(FileOutputFormat.java:233) at org.apache.flink.api.java.io.TextOutputFormat.open(TextOutputFormat.java:78) at org.apache.flink.runtime.operators.DataSinkTask.invoke(DataSinkTask.java:178) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:654) at java.lang.Thread.run(Thread.java:745) Tests run: 3, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 36.891 sec <<< FAILURE! - in org.apache.flink.python.api.PythonPlanBinderTest testJobWithoutObjectReuse(org.apache.flink.python.api.PythonPlanBinderTest) Time elapsed: 11.53 sec <<< FAILURE! java.lang.AssertionError: Error while calling the test program: Job execution failed. at org.junit.Assert.fail(Assert.java:88) at org.apache.flink.test.util.JavaProgramTestBase.testJobWithoutObjectReuse(JavaProgramTestBase.java:180) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) at org.junit.rules.RunRules.evaluate(RunRules.java:20) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) at org.junit.runners.ParentRunner.run(ParentRunner.java:363) at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:283) at org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:173) at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:153) at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:128) at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:203) at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:155) at
[jira] [Created] (FLINK-5204) Flakey YARNSessionCapacitySchedulerITCase (NullPointerException)
Nico Kruber created FLINK-5204: -- Summary: Flakey YARNSessionCapacitySchedulerITCase (NullPointerException) Key: FLINK-5204 URL: https://issues.apache.org/jira/browse/FLINK-5204 Project: Flink Issue Type: Bug Components: Tests, YARN Affects Versions: 2.0.0 Environment: TravisCI Reporter: Nico Kruber {code:none} testTaskManagerFailure(org.apache.flink.yarn.YARNSessionCapacitySchedulerITCase) Time elapsed: 0.61 sec <<< ERROR! java.lang.NullPointerException: java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics.getAggregateAppResourceUsage(RMAppAttemptMetrics.java:119) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.getApplicationResourceUsageReport(RMAppAttemptImpl.java:814) at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.createAndGetApplicationReport(RMAppImpl.java:580) at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplications(ClientRMService.java:806) at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplications(ClientRMService.java:670) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplications(ApplicationClientProtocolPBServiceImpl.java:234) at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:425) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2045) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:408) at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:107) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplications(ApplicationClientProtocolPBClientImpl.java:254) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy91.getApplications(Unknown Source) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplications(YarnClientImpl.java:478) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplications(YarnClientImpl.java:455) at org.apache.flink.yarn.YarnTestBase.checkClusterEmpty(YarnTestBase.java:193) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) at org.junit.rules.RunRules.evaluate(RunRules.java:20) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) at
[jira] [Created] (FLINK-5203) YARNHighAvailabilityITCase fails to deploy YARN cluster
Nico Kruber created FLINK-5203: -- Summary: YARNHighAvailabilityITCase fails to deploy YARN cluster Key: FLINK-5203 URL: https://issues.apache.org/jira/browse/FLINK-5203 Project: Flink Issue Type: Bug Components: Tests, YARN Affects Versions: 2.0.0 Environment: in TravisCI Reporter: Nico Kruber {code:none} Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 49.883 sec <<< FAILURE! - in org.apache.flink.yarn.YARNHighAvailabilityITCase testMultipleAMKill(org.apache.flink.yarn.YARNHighAvailabilityITCase) Time elapsed: 42.191 sec <<< ERROR! java.lang.RuntimeException: Couldn't deploy Yarn cluster at org.apache.flink.yarn.AbstractYarnClusterDescriptor.deployInternal(AbstractYarnClusterDescriptor.java:840) at org.apache.flink.yarn.AbstractYarnClusterDescriptor.deploy(AbstractYarnClusterDescriptor.java:407) at org.apache.flink.yarn.YARNHighAvailabilityITCase.testMultipleAMKill(YARNHighAvailabilityITCase.java:131) {code} stdout log: https://api.travis-ci.org/jobs/179733979/log.txt?deansi=true full logs: https://transfer.sh/UNjFq/29.5.tar.gz -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-5202) test timeout (no output for 300s) in YARNHighAvailabilityITCase
Nico Kruber created FLINK-5202: -- Summary: test timeout (no output for 300s) in YARNHighAvailabilityITCase Key: FLINK-5202 URL: https://issues.apache.org/jira/browse/FLINK-5202 Project: Flink Issue Type: Bug Components: Tests, YARN Affects Versions: 2.0.0 Environment: in TravisCI Reporter: Nico Kruber TravisCI occasionally fails since YARNHighAvailabilityITCase fails to produce an output (or finish) within 300s. See the logs at the URLs below textual log: https://api.travis-ci.org/jobs/179733965/log.txt?deansi=true full log at https://transfer.sh/xcLvY/29.1.tar.gz -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-5178) allow BLOB_STORAGE_DIRECTORY_KEY to point to a distributed file system
Nico Kruber created FLINK-5178: -- Summary: allow BLOB_STORAGE_DIRECTORY_KEY to point to a distributed file system Key: FLINK-5178 URL: https://issues.apache.org/jira/browse/FLINK-5178 Project: Flink Issue Type: Improvement Components: Network Reporter: Nico Kruber Assignee: Nico Kruber After FLINK-5129, high availability (HA) mode adds the ability for the BlobCache instances at the task managers to download blobs directly from the distributed file system. It would be nice if this also worked in non-HA mode and BLOB_STORAGE_DIRECTORY_KEY may point to a distributed file system. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-5081) unable to set yarn.maximum-failed-containers with flink one-time YARN setup
Nico Kruber created FLINK-5081: -- Summary: unable to set yarn.maximum-failed-containers with flink one-time YARN setup Key: FLINK-5081 URL: https://issues.apache.org/jira/browse/FLINK-5081 Project: Flink Issue Type: Bug Components: Startup Shell Scripts Affects Versions: 1.1.4 Reporter: Nico Kruber When letting flink setup YARN for a one-time job, it apparently does not deliver the {{yarn.maximum-failed-containers}} parameter to YARN as the {{yarn-session.sh}} script does. Adding it to conf/flink-conf.yaml as https://ci.apache.org/projects/flink/flink-docs-master/setup/yarn_setup.html#recovery-behavior-of-flink-on-yarn suggested also does not work. example: {code:none} flink run -m yarn-cluster -yn 3 -yjm 1024 -ytm 4096 .jar --parallelism 3 -Dyarn.maximum-failed-containers=100 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-5066) prevent LocalInputChannel#getNextBuffer from de-serialising all events when looking for EndOfPartitionEvent only
Nico Kruber created FLINK-5066: -- Summary: prevent LocalInputChannel#getNextBuffer from de-serialising all events when looking for EndOfPartitionEvent only Key: FLINK-5066 URL: https://issues.apache.org/jira/browse/FLINK-5066 Project: Flink Issue Type: Improvement Components: Network Reporter: Nico Kruber Assignee: Nico Kruber LocalInputChannel#getNextBuffer de-serialises all incoming events on the lookout for an EndOfPartitionEvent. Instead, if EventSerializer offered a function to check for an event type only without de-serialising the whole event, we could save some resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-5060) only serialise records once in RecordWriter#emit and RecordWriter#broadcastEmit
Nico Kruber created FLINK-5060: -- Summary: only serialise records once in RecordWriter#emit and RecordWriter#broadcastEmit Key: FLINK-5060 URL: https://issues.apache.org/jira/browse/FLINK-5060 Project: Flink Issue Type: Improvement Components: Network Reporter: Nico Kruber Assignee: Nico Kruber Currently, org.apache.flink.runtime.io.network.api.writer.RecordWriter#emit and org.apache.flink.runtime.io.network.api.writer.RecordWriter#broadcastEmit serialise a record once per target channel. Instead, they could serialise the record only once and use the serialised form for every channel and thus save resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-5059) only serialise events once in RecordWriter#broadcastEvent
Nico Kruber created FLINK-5059: -- Summary: only serialise events once in RecordWriter#broadcastEvent Key: FLINK-5059 URL: https://issues.apache.org/jira/browse/FLINK-5059 Project: Flink Issue Type: Improvement Components: Network Reporter: Nico Kruber Assignee: Nico Kruber Currently, org.apache.flink.runtime.io.network.api.writer.RecordWriter#broadcastEvent serialises the event once per target channel. Instead, it could serialise the event only once and use the serialised form for every channel and thus save resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-5008) Update IDE setup documentation
Nico Kruber created FLINK-5008: -- Summary: Update IDE setup documentation Key: FLINK-5008 URL: https://issues.apache.org/jira/browse/FLINK-5008 Project: Flink Issue Type: Improvement Components: Documentation Reporter: Nico Kruber Priority: Minor The IDE setup documentation of Flink is outdated in both parts: IntelliJ IDEA was based on an old version and Eclipse/Scala IDE does not work at all anymore. -- This message was sent by Atlassian JIRA (v6.3.4#6332)