[jira] [Created] (FLINK-2677) Add a general-purpose keyed-window operator
Stephan Ewen created FLINK-2677: --- Summary: Add a general-purpose keyed-window operator Key: FLINK-2677 URL: https://issues.apache.org/jira/browse/FLINK-2677 Project: Flink Issue Type: Sub-task Components: Streaming Affects Versions: 0.10 Reporter: Stephan Ewen Fix For: 0.10 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2683) Add utilities for heap-backed keyed state in time panes
Stephan Ewen created FLINK-2683: --- Summary: Add utilities for heap-backed keyed state in time panes Key: FLINK-2683 URL: https://issues.apache.org/jira/browse/FLINK-2683 Project: Flink Issue Type: Sub-task Components: Streaming Affects Versions: 0.10 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.10 Time panes are commonly used for aligned time windows (tumbling and sliding). These utilities would represent the panes with per-pane keyed state. The implementation must support: - Evicting panes efficiently - iterating over the union of all panes' state per key for all keys (computing the window function for all keys) - efficient append the state for a key in a random pane -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2646) Rich functions should provide a method "closeAfterFailure()"
Stephan Ewen created FLINK-2646: --- Summary: Rich functions should provide a method "closeAfterFailure()" Key: FLINK-2646 URL: https://issues.apache.org/jira/browse/FLINK-2646 Project: Flink Issue Type: Improvement Components: Core Affects Versions: 0.10 Reporter: Stephan Ewen Fix For: 0.10 Right now, the {{close()}} method of rich functions is invoked in case of proper completion, and in case of canceling in case of error (to allow for cleanup). In certain cases, the user function needs to know why it is closed, whether the task completed in a regular fashion, or was canceled/failed. I suggest to add a method {{closeAfterFailure()}} to the {{RuchFunction}}. By default, this method calls {{close()}}. The runtime is the changed to call {{close()}} as part of the regular execution and {{closeAfterFailure()}} in case of an irregular exit. Because by default all cases call {{close()}} the change would not be API breaking. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2647) Stream operators need to differentiate between close() and dispose()
Stephan Ewen created FLINK-2647: --- Summary: Stream operators need to differentiate between close() and dispose() Key: FLINK-2647 URL: https://issues.apache.org/jira/browse/FLINK-2647 Project: Flink Issue Type: Bug Components: Streaming Affects Versions: 0.10 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.10 Currently, operators make no distinction between closing (which needs to emit remaining buffered data) and disposing (releasing resources). In case of a failure, we want to only release the resources, and not attempt to send pending buffered data. Effectively, streaming operators need to implement these methods then: - open() - processRecord() - processWatermark() - close() - dispose() -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2635) Unify StreamInputProcessor and OneInputStreamTask
Stephan Ewen created FLINK-2635: --- Summary: Unify StreamInputProcessor and OneInputStreamTask Key: FLINK-2635 URL: https://issues.apache.org/jira/browse/FLINK-2635 Project: Flink Issue Type: Improvement Components: Streaming Affects Versions: 0.10 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.10 Working on some windowing operations, I found that these two classes can easily be put into one, making the set of task/operator classes easier to navigate. The {{OneInputStreamTask}} does effectively nothing more than delegate to the {{StreamInputProcessor}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2636) Clean up StreamRecord and Watermark types
Stephan Ewen created FLINK-2636: --- Summary: Clean up StreamRecord and Watermark types Key: FLINK-2636 URL: https://issues.apache.org/jira/browse/FLINK-2636 Project: Flink Issue Type: Improvement Components: Streaming Affects Versions: 0.10 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.10 StreamRecords and Watermarks can be returned from streaming inputs alternatingly. They should have a common super type to make the code cleaner and allow for proper generic typing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2641) Integrate the off-heap memory configuration with the TaskManager start script
Stephan Ewen created FLINK-2641: --- Summary: Integrate the off-heap memory configuration with the TaskManager start script Key: FLINK-2641 URL: https://issues.apache.org/jira/browse/FLINK-2641 Project: Flink Issue Type: New Feature Components: Start-Stop Scripts Affects Versions: 0.10 Reporter: Stephan Ewen Fix For: 0.10 The TaskManager start script needs to adjust the {{-Xmx}}, {{-Xms}}, and {{-XX:MaxDirectMemorySize}} parameters according to the off-heap memory settings. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2640) Integrate the off-heap configurations with YARN runner
Stephan Ewen created FLINK-2640: --- Summary: Integrate the off-heap configurations with YARN runner Key: FLINK-2640 URL: https://issues.apache.org/jira/browse/FLINK-2640 Project: Flink Issue Type: New Feature Components: YARN Client Affects Versions: 0.10 Reporter: Stephan Ewen Fix For: 0.10 The YARN runner needs to adjust the {{-Xmx}}, {{-Xms}}, and {{-XX:MaxDirectMemorySize}} parameters according to the off-heap memory settings. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2638) Add @SafeVarargs to the environment.fromElements(...) method
Stephan Ewen created FLINK-2638: --- Summary: Add @SafeVarargs to the environment.fromElements(...) method Key: FLINK-2638 URL: https://issues.apache.org/jira/browse/FLINK-2638 Project: Flink Issue Type: Improvement Components: Core, Streaming Affects Versions: 0.10 Reporter: Stephan Ewen Fix For: 0.10 This annotation suppresses the warnings when people use these methods with generic types. Since we check types and serialize them in the input format anyways, we should be actually safe. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2624) RabbitMQ source / sink should participate in checkpointing
Stephan Ewen created FLINK-2624: --- Summary: RabbitMQ source / sink should participate in checkpointing Key: FLINK-2624 URL: https://issues.apache.org/jira/browse/FLINK-2624 Project: Flink Issue Type: Bug Components: Streaming Connectors Affects Versions: 0.10 Reporter: Stephan Ewen The RabbitMQ connector does not offer any fault tolerance guarantees right now, because it does not participate in the checkpointing. We should integrate it in a similar was as the {{FlinkKafkaConsumer}} is integrated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2586) Unstable Storm Compatibility Tests
Stephan Ewen created FLINK-2586: --- Summary: Unstable Storm Compatibility Tests Key: FLINK-2586 URL: https://issues.apache.org/jira/browse/FLINK-2586 Project: Flink Issue Type: Bug Components: Storm Compatibility Affects Versions: 0.10 Reporter: Stephan Ewen Priority: Critical Fix For: 0.10 The Storm Compatibility tests frequently fail. The reason is that they kill the topologies after a certain time interval. That may fail on CI infrastructure when certain steps are delayed beyond usual. Trying to guarantee progress by time is inherently problematic: - Waiting too short makes tests unstable - Waiting too long makes tests slow The right way to go is letting the program decide when to terminate, for example by throwing a special {{SuccessException}}. Have a look at the Kafka connector tests, they do this a lot and hence run exactly as short or as long as they need to. Here is an example of a failed run: https://s3.amazonaws.com/archive.travis-ci.org/jobs/77499577/log.txt -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2554) Add exception reporting handler
Stephan Ewen created FLINK-2554: --- Summary: Add exception reporting handler Key: FLINK-2554 URL: https://issues.apache.org/jira/browse/FLINK-2554 Project: Flink Issue Type: Sub-task Components: Webfrontend Affects Versions: 0.10 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.10 The web dashboard should report all encountered exceptions: - The root failure cause - A list of exceptions encountered at individual tasks, together with the names of the tasks and the executing task managers -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2547) Provide Cluster Status Info
Stephan Ewen created FLINK-2547: --- Summary: Provide Cluster Status Info Key: FLINK-2547 URL: https://issues.apache.org/jira/browse/FLINK-2547 Project: Flink Issue Type: Sub-task Components: Webfrontend Affects Versions: 0.10 Reporter: Stephan Ewen Assignee: Stephan Ewen Priority: Minor Fix For: 0.10 Add a request to the web server that returns {code} { taskmanagers : numTaskmanagers, slots-total : numSlotsTotal, slots-available : nonSlotsAvailable, jobs-running : numJobsRunning, jobs-finished : numJobsFinished, jobs-cancelled : numJobsCancelled, jobs-failed : numJobsFailed } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2520) StreamFaultToleranceTestBase does not allow for multiple tests
Stephan Ewen created FLINK-2520: --- Summary: StreamFaultToleranceTestBase does not allow for multiple tests Key: FLINK-2520 URL: https://issues.apache.org/jira/browse/FLINK-2520 Project: Flink Issue Type: Improvement Components: Tests Affects Versions: 0.10 Reporter: Stephan Ewen Priority: Minor It would be good to implement more tests into the same class (or user at least the same driver class), in order to re-use mini clusters and reducer overall test times. The {{StreamFaultToleranceTestBase}} does not support that. I think the pattern where the {{@Test}} methods are in a base class, and the actual tests only implement hook methods is not a good pattern. The batch API tests used this initially and we abandoned it everywhere for a more flexible implementation where the base class only sets up the. cluster/environment and the {{@Test}} methods go int the subclass. The {{MultipleProgramsTestBase}} is a good example of a flexible test base. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2524) Add getTaskNameWithSubtasks() to RuntimeContext
Stephan Ewen created FLINK-2524: --- Summary: Add getTaskNameWithSubtasks() to RuntimeContext Key: FLINK-2524 URL: https://issues.apache.org/jira/browse/FLINK-2524 Project: Flink Issue Type: Improvement Components: Local Runtime Affects Versions: 0.10 Reporter: Stephan Ewen Fix For: 0.10 When printing information to logs or debug output, one frequently needs to identify the statement with the originating task (task name and which subtask). In many places, the system and user code manually construct something like MyTask (2/7). The {{RuntimeContext}} should offer this, because it is too frequently needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2519) BarrierBuffers get stuck infinitely when some inputs end early
Stephan Ewen created FLINK-2519: --- Summary: BarrierBuffers get stuck infinitely when some inputs end early Key: FLINK-2519 URL: https://issues.apache.org/jira/browse/FLINK-2519 Project: Flink Issue Type: Bug Components: Streaming Affects Versions: 0.10 Reporter: Stephan Ewen Assignee: Stephan Ewen Priority: Blocker Fix For: 0.10 When some sources exit early, the barrier buffer may start an alignment, but will never complete it, because some inputs never deliver a checkpoint barrier. [FLINK-2515] mitigates most of the problem, by not starting checkpoints any more when some sources already finished. When a checkpoint is started while a source is finishing, the barrier buffer may still align infinitely. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2515) CheckpointCoordinator triggers checkpoints even if not all sources are running any more
Stephan Ewen created FLINK-2515: --- Summary: CheckpointCoordinator triggers checkpoints even if not all sources are running any more Key: FLINK-2515 URL: https://issues.apache.org/jira/browse/FLINK-2515 Project: Flink Issue Type: Bug Components: JobManager Affects Versions: 0.10 Reporter: Stephan Ewen Priority: Blocker Fix For: 0.10 When some sources finish early, they will not emit checkpoint barriers any more. That means that pending checkpoint alignments will never be able to complete, locking the flow. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2508) Confusing sharing of StreamExecutionEnvironment
Stephan Ewen created FLINK-2508: --- Summary: Confusing sharing of StreamExecutionEnvironment Key: FLINK-2508 URL: https://issues.apache.org/jira/browse/FLINK-2508 Project: Flink Issue Type: Bug Components: Streaming Affects Versions: 0.10 Reporter: Stephan Ewen Fix For: 0.10 In the {{StreamExecutionEnvironment}}, the environment is once created and then shared with a static variable to all successive calls to {{getExecutionEnvironment()}}. But it can be overridden by calls to {{createLocalEnvironment()}} and {{createRemoteEnvironment()}}. This seems a bit un-intuitive, and probably creates confusion when dispatching multiple streaming jobs from within the same JVM. Why is it even necessary to cache the current execution environment? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2509) Improve error messages when user code classes are not found
Stephan Ewen created FLINK-2509: --- Summary: Improve error messages when user code classes are not found Key: FLINK-2509 URL: https://issues.apache.org/jira/browse/FLINK-2509 Project: Flink Issue Type: Bug Components: Core Affects Versions: 0.10 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.10 When a job fails because the user code classes are not found, we should add some information about the class loader and class path into the exception message. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2510) KafkaConnector should access partition metadata from master/cluster
Stephan Ewen created FLINK-2510: --- Summary: KafkaConnector should access partition metadata from master/cluster Key: FLINK-2510 URL: https://issues.apache.org/jira/browse/FLINK-2510 Project: Flink Issue Type: Improvement Components: Streaming Connectors Affects Versions: 0.10 Reporter: Stephan Ewen Priority: Minor Currently, the Kafka connector assumes that it can access the partition metadata from the client. There may be setups where this is not possible, and where the access needs to happen from within the cluster (due to firewalls). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2492) Rename remaining runtime classes from match to join
Stephan Ewen created FLINK-2492: --- Summary: Rename remaining runtime classes from match to join Key: FLINK-2492 URL: https://issues.apache.org/jira/browse/FLINK-2492 Project: Flink Issue Type: Improvement Components: Local Runtime Affects Versions: 0.10 Reporter: Stephan Ewen Assignee: Stephan Ewen Priority: Minor Fix For: 0.10 While working with the runtime join classes, I saw that many of them still refer to the join as match. Since all other parts now consistently refer to join, we should adjust the runtime classes as well. Makes it easier for new contributors. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2493) Simplify names of example program JARs
Stephan Ewen created FLINK-2493: --- Summary: Simplify names of example program JARs Key: FLINK-2493 URL: https://issues.apache.org/jira/browse/FLINK-2493 Project: Flink Issue Type: Improvement Components: Examples Affects Versions: 0.10 Reporter: Stephan Ewen Priority: Minor I find the names of the example JARs a bit annoying. Why not name the file {{examples/ConnectedComponents.jar}} rather than {{examples/flink-java-examples-0.10-SNAPSHOT-ConnectedComponents.jar}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2474) Occasional failures in PartitionedStateCheckpointingITCase
Stephan Ewen created FLINK-2474: --- Summary: Occasional failures in PartitionedStateCheckpointingITCase Key: FLINK-2474 URL: https://issues.apache.org/jira/browse/FLINK-2474 Project: Flink Issue Type: Bug Components: Streaming Affects Versions: 0.10 Reporter: Stephan Ewen The error message {code} Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 47.301 sec FAILURE! - in org.apache.flink.test.checkpointing.PartitionedStateCheckpointingITCase runCheckpointedProgram(org.apache.flink.test.checkpointing.PartitionedStateCheckpointingITCase) Time elapsed: 42.495 sec FAILURE! java.lang.AssertionError: expected:86678900 but was:3467156 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:144) at org.apache.flink.test.checkpointing.PartitionedStateCheckpointingITCase.runCheckpointedProgram(PartitionedStateCheckpointingITCase.java:117) {code} The detailed CI logs https://s3.amazonaws.com/archive.travis-ci.org/jobs/73928480/log.txt -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2473) Add a timeout to ActorSystem shutdown in the Client
Stephan Ewen created FLINK-2473: --- Summary: Add a timeout to ActorSystem shutdown in the Client Key: FLINK-2473 URL: https://issues.apache.org/jira/browse/FLINK-2473 Project: Flink Issue Type: Bug Components: Distributed Runtime Affects Versions: 0.10 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.10 Just ran into what looks like a rare bug in Akka, where the {{awaitTermination()}} method hits a bug and freezes indefinitely. We can work around this by supplying a timeout to the method. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2463) Chow execution config in dashboard
Stephan Ewen created FLINK-2463: --- Summary: Chow execution config in dashboard Key: FLINK-2463 URL: https://issues.apache.org/jira/browse/FLINK-2463 Project: Flink Issue Type: Sub-task Components: Webfrontend Affects Versions: 0.10 Reporter: Stephan Ewen Priority: Minor Fix For: 0.10 I wanted to change the configuration display, as in #927, but it wasn't implemented in the new UI. So I decided to create this PR to take care of it. Added handler for new endpoint {{/jobs/:jobid/config}} The endpoint is hit when selecting a job in the UI Example of response: {code} { execution-config: { execution-mode: PIPELINED, job-parallelism: -1, max-execution-retries: -1, object-reuse-mode: false, user-config: { example: true } }, jid: dedc15369efa94eb1e3ad6482f1344b6, name: WordCount Example } {code} The info is stored in the existing $rootScope.job object Added new tab in job display to show the config -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2438) Improve performance of channel events
Stephan Ewen created FLINK-2438: --- Summary: Improve performance of channel events Key: FLINK-2438 URL: https://issues.apache.org/jira/browse/FLINK-2438 Project: Flink Issue Type: Bug Components: Distributed Runtime Affects Versions: 0.10 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.10 Until now, channel events were sufficiently rare to not matter in their performance. With frequent checkpointing, their performance becomes a factor. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2426) Create a read-only variant of the Configuration
Stephan Ewen created FLINK-2426: --- Summary: Create a read-only variant of the Configuration Key: FLINK-2426 URL: https://issues.apache.org/jira/browse/FLINK-2426 Project: Flink Issue Type: Sub-task Components: Core Affects Versions: 0.10 Reporter: Stephan Ewen Fix For: 0.10 In order to give the TaskManager configuration to runtime components (or even user code), it should be wrapped in an {{UnmodifiableConfiguration}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2427) Allow the BarrierBuffer to maintain multiple queues of blocked inputs
Stephan Ewen created FLINK-2427: --- Summary: Allow the BarrierBuffer to maintain multiple queues of blocked inputs Key: FLINK-2427 URL: https://issues.apache.org/jira/browse/FLINK-2427 Project: Flink Issue Type: Sub-task Components: Streaming Affects Versions: 0.10 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.10 In corner cases (dropped barriers due to failures/startup races), this is required for proper operation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2425) Give access to TaskManager config and hostname in the Runtime Environment
Stephan Ewen created FLINK-2425: --- Summary: Give access to TaskManager config and hostname in the Runtime Environment Key: FLINK-2425 URL: https://issues.apache.org/jira/browse/FLINK-2425 Project: Flink Issue Type: Improvement Components: TaskManager Affects Versions: 0.10 Reporter: Stephan Ewen Fix For: 0.10 The RuntimeEnvironment (that is used by the operators to access the context) should give access to the TaskManager's configuration, to allow to read config values. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2428) Clean up unused properties in StreamConfig
Stephan Ewen created FLINK-2428: --- Summary: Clean up unused properties in StreamConfig Key: FLINK-2428 URL: https://issues.apache.org/jira/browse/FLINK-2428 Project: Flink Issue Type: Bug Components: Streaming Affects Versions: 0.10 Reporter: Stephan Ewen Priority: Minor There is a multitude of unused properties in the {{StreamConfig}}, which should be removed, if no longer relevant. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2429) Remove the enableCheckpointing() without interval variant
Stephan Ewen created FLINK-2429: --- Summary: Remove the enableCheckpointing() without interval variant Key: FLINK-2429 URL: https://issues.apache.org/jira/browse/FLINK-2429 Project: Flink Issue Type: Bug Components: Streaming Reporter: Stephan Ewen -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2418) Add an end-to-end streaming fault tolerance test for the Checkpointed interface
Stephan Ewen created FLINK-2418: --- Summary: Add an end-to-end streaming fault tolerance test for the Checkpointed interface Key: FLINK-2418 URL: https://issues.apache.org/jira/browse/FLINK-2418 Project: Flink Issue Type: Bug Components: Streaming Affects Versions: 0.10 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.10 The current test lacks the following: - Does not validate and exactly-once counts after partitioning step - Does not cover all cases with the {{Checkpointed}} interface, but uses a mix of by-key state and explicitly Checkpointed state - The test uses a non-fault-tolerant deprecated infinite reducer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2421) StreamRecordSerializer incorrectly duplicates and misses tests
Stephan Ewen created FLINK-2421: --- Summary: StreamRecordSerializer incorrectly duplicates and misses tests Key: FLINK-2421 URL: https://issues.apache.org/jira/browse/FLINK-2421 Project: Flink Issue Type: Bug Components: Streaming Affects Versions: 0.10 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.10 The duplication method of the {{StreamRecordSerializer}} does not respect the duplication of the internal type serializer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2407) Add an API switch to select between exactly once and at least once fault tolerance
Stephan Ewen created FLINK-2407: --- Summary: Add an API switch to select between exactly once and at least once fault tolerance Key: FLINK-2407 URL: https://issues.apache.org/jira/browse/FLINK-2407 Project: Flink Issue Type: Sub-task Components: Streaming Affects Versions: 0.10 Reporter: Stephan Ewen Fix For: 0.10 Based on the addition of the BarrierTracker, we can add a switch to choose between the two modes exactly once and at least once. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2406) Abstract BarrierBuffer to an exchangeable BarrierHandler
Stephan Ewen created FLINK-2406: --- Summary: Abstract BarrierBuffer to an exchangeable BarrierHandler Key: FLINK-2406 URL: https://issues.apache.org/jira/browse/FLINK-2406 Project: Flink Issue Type: Sub-task Components: Streaming Affects Versions: 0.10 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.10 We need to make the Checkpoint handling pluggable, to allow us to use different implementations: - BarrierBuffer for exactly once processing. This inevitably introduces a bit of latency. - BarrierTracker for at least once processing, with no added latency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2404) LongCounters should have an addValue() method for primitive longs
Stephan Ewen created FLINK-2404: --- Summary: LongCounters should have an addValue() method for primitive longs Key: FLINK-2404 URL: https://issues.apache.org/jira/browse/FLINK-2404 Project: Flink Issue Type: Bug Components: Core Affects Versions: 0.10 Reporter: Stephan Ewen Fix For: 0.10 Since the LongCounter is used heavily for internal statistics reporting, it must have very low overhead. The current addValue() method always boxes and unboxes the values, which is unnecessary overhead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2402) Add a non-blocking BarrierTracker
Stephan Ewen created FLINK-2402: --- Summary: Add a non-blocking BarrierTracker Key: FLINK-2402 URL: https://issues.apache.org/jira/browse/FLINK-2402 Project: Flink Issue Type: Sub-task Components: Streaming Affects Versions: 0.10 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.10 This issue would add a new tracker for barriers that simply tracks what barriers have been observed from which inputs. It never blocks off buffers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2397) Unify two backend servers to one server
Stephan Ewen created FLINK-2397: --- Summary: Unify two backend servers to one server Key: FLINK-2397 URL: https://issues.apache.org/jira/browse/FLINK-2397 Project: Flink Issue Type: Sub-task Components: Webfrontend Affects Versions: 0.10 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.10 Currently, the new dashboard still needs both the old backend server and the new backend server. We need to migrate the requests from {{/jobsInfo}} to the new server. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2393) Add a stateless at-least-once mode for streaming
Stephan Ewen created FLINK-2393: --- Summary: Add a stateless at-least-once mode for streaming Key: FLINK-2393 URL: https://issues.apache.org/jira/browse/FLINK-2393 Project: Flink Issue Type: New Feature Components: Streaming Affects Versions: 0.10 Reporter: Stephan Ewen Currently, the checkpointing mechanism provides exactly once guarantees. Part of that is the step that temporarily aligns the data streams. This step increases the tuple latency temporarily. By offering a version that does not provide exactly-once, but only at-least-once, we can avoid the latency increase. For super-low-latency applications, that tolerate duplicates, this may be an interesting option. To realize that, we would use a slightly modified version of the checkpointing algorithm. Effectively, the streams would not be aligned, but tasks would only count the received barriers and emit their own barrier as soon as the saw a barrier from all inputs. My feeling is that it makes not sense to implement state backups, when being concerned with this super low latency. The mode would hence be a purely stateless at-least-once mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2385) Scala DataSet.distinct should have parenthesis
Stephan Ewen created FLINK-2385: --- Summary: Scala DataSet.distinct should have parenthesis Key: FLINK-2385 URL: https://issues.apache.org/jira/browse/FLINK-2385 Project: Flink Issue Type: Bug Components: Scala API Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.10 The method is not a side-effect free accessor, but defines heavy computation, even if it does not mutate the original data set. This is a somewhat API breaking change. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2389) Add dashboard frontend architecture amd build infrastructue
Stephan Ewen created FLINK-2389: --- Summary: Add dashboard frontend architecture amd build infrastructue Key: FLINK-2389 URL: https://issues.apache.org/jira/browse/FLINK-2389 Project: Flink Issue Type: Sub-task Components: Webfrontend Affects Versions: 0.10 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.10 Add the build infrastructure and basic libraries for a modern and modular web dashboard. The dashboard is going to be built using angular.js. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2357) New JobManager Runtime Web Frontend
Stephan Ewen created FLINK-2357: --- Summary: New JobManager Runtime Web Frontend Key: FLINK-2357 URL: https://issues.apache.org/jira/browse/FLINK-2357 Project: Flink Issue Type: New Feature Components: Webfrontend Affects Versions: 0.10 Reporter: Stephan Ewen We need to improve rework the Job Manager Web Frontend. The current web frontend is limited and has a lot of design issues - It does not display and progress while operators are running. This is especially problematic for streaming jobs - It has no graph representation of the data flows - it does not allow to look into execution attempts - it has no hook to deal with the upcoming live accumulators - The architecture is not very modular/extensible I propose to add a new JobManager web frontend: - Based on Netty HTTP (very lightweight) - Using rest-style URLs for jobs and vertices - integrating the D3 graph renderer of the previews with the runtime monitor - with details on execution attempts - first class visualization of records processed and bytes processed -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2358) Add Netty-HTTP based server and server handlers
Stephan Ewen created FLINK-2358: --- Summary: Add Netty-HTTP based server and server handlers Key: FLINK-2358 URL: https://issues.apache.org/jira/browse/FLINK-2358 Project: Flink Issue Type: Sub-task Components: Webfrontend Affects Versions: 0.10 Reporter: Stephan Ewen Fix For: 0.10 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2351) Deprecate config builders in InputFormats and Output formats
Stephan Ewen created FLINK-2351: --- Summary: Deprecate config builders in InputFormats and Output formats Key: FLINK-2351 URL: https://issues.apache.org/jira/browse/FLINK-2351 Project: Flink Issue Type: Bug Components: Core Affects Versions: 0.10 Reporter: Stephan Ewen Fix For: 0.10 Old APIs used to pass functions as classes and parameters as configs. To support that, all input- and output formats had config builders. As the record API is deprecated and about to be removed, these builders are no longer needed and should be deprecated and removed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2350) Deprecate the stream open timeouts in FileInputFormat and FileOutputFormat
Stephan Ewen created FLINK-2350: --- Summary: Deprecate the stream open timeouts in FileInputFormat and FileOutputFormat Key: FLINK-2350 URL: https://issues.apache.org/jira/browse/FLINK-2350 Project: Flink Issue Type: Bug Components: Core Affects Versions: 0.10 Reporter: Stephan Ewen The stream open requests have the ability to fail after a certain timeout. By default, this timeout is deactivated, because loaded HDFS has in the past frequently exceeded this timeout. I think no one ever used this, and I am not sure it is useful, actually. We can remove it to reduce code complexity, in my opinion. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2343) Change default garbage collector in streaming environments
Stephan Ewen created FLINK-2343: --- Summary: Change default garbage collector in streaming environments Key: FLINK-2343 URL: https://issues.apache.org/jira/browse/FLINK-2343 Project: Flink Issue Type: Improvement Components: Start-Stop Scripts Affects Versions: 0.10 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.10 When starting Flink, we don't pass any particular GC related JVM flags to the system. That means, it uses the default garbage collectors, which are the bulk parallel GCs for both old gen and new gen. For streaming applications, this results in vastly fluctuating latencies. Latencies are much more constant with either the {{CMS}} or {{G1}} GC. I propose to make the CMS the default GC for streaming setups. G1 may become the GC of choice in the future, but fro various articles I found, it is still somewhat in beta status (see for example here: http://jaxenter.com/kirk-pepperdine-on-the-g1-for-java-9-118190.html ) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2339) Prevent asynchronous checkpoint calls from overtaking each other
Stephan Ewen created FLINK-2339: --- Summary: Prevent asynchronous checkpoint calls from overtaking each other Key: FLINK-2339 URL: https://issues.apache.org/jira/browse/FLINK-2339 Project: Flink Issue Type: Improvement Components: TaskManager Affects Versions: 0.10 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.10 Currently, when checkpoint state materialization takes very long, and the checkpoint interval is low, the asynchronous calls to trigger checkpoints (on the sources) could overtake prior calls. We can fix that by making sure that all calls are dispatched in order by the same thread, rather than spawning a new thread for each call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2333) Stream Data Sink that periodically rolls files
Stephan Ewen created FLINK-2333: --- Summary: Stream Data Sink that periodically rolls files Key: FLINK-2333 URL: https://issues.apache.org/jira/browse/FLINK-2333 Project: Flink Issue Type: New Feature Components: Streaming Affects Versions: 0.10 Reporter: Stephan Ewen It would be useful to have a file data sink for streams that starts a new file every n elements or records. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2327) Log the limit of open file handles at startup
Stephan Ewen created FLINK-2327: --- Summary: Log the limit of open file handles at startup Key: FLINK-2327 URL: https://issues.apache.org/jira/browse/FLINK-2327 Project: Flink Issue Type: Improvement Components: Local Runtime Affects Versions: 0.10 Reporter: Stephan Ewen Assignee: Stephan Ewen Priority: Minor Fix For: 0.10 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2331) Whitelist some exceptions to be valid in YARN logs
Stephan Ewen created FLINK-2331: --- Summary: Whitelist some exceptions to be valid in YARN logs Key: FLINK-2331 URL: https://issues.apache.org/jira/browse/FLINK-2331 Project: Flink Issue Type: Bug Components: YARN Client Affects Versions: 0.10 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.10 The YARN logs may occasionally contain this exception that occurs on shutdown: {code} 14:55:18,819 ERROR akka.remote.RemoteActorRef - swallowing exception during message send akka.remote.RemoteTransportExceptionNoStackTrace: Attempted to send remote message but Remoting is not running. {code} Whitelisting the {{akka.remote.RemoteTransportExceptionNoStackTrace}} should make the tests more reliable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2330) Make FromElementsFunction checkpointable
Stephan Ewen created FLINK-2330: --- Summary: Make FromElementsFunction checkpointable Key: FLINK-2330 URL: https://issues.apache.org/jira/browse/FLINK-2330 Project: Flink Issue Type: Improvement Components: Streaming Affects Versions: 0.10 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.10 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2314) Make Streaming File Sources Persistent
Stephan Ewen created FLINK-2314: --- Summary: Make Streaming File Sources Persistent Key: FLINK-2314 URL: https://issues.apache.org/jira/browse/FLINK-2314 Project: Flink Issue Type: Improvement Components: Streaming Affects Versions: 0.9 Reporter: Stephan Ewen Streaming File sources should participate in the checkpointing. They should track the bytes they read from the file and checkpoint it. One can look at the sequence generating source function for an example of a checkpointed source. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2315) Hadoop Writables cannot exploit implementing NormalizableKey
Stephan Ewen created FLINK-2315: --- Summary: Hadoop Writables cannot exploit implementing NormalizableKey Key: FLINK-2315 URL: https://issues.apache.org/jira/browse/FLINK-2315 Project: Flink Issue Type: Bug Components: Java API Affects Versions: 0.9 Reporter: Stephan Ewen When one implements a type that extends {{hadoop.io.Writable}} and it implements {{NormalizableKey}}, this is still not exploited. The Writable comparator fails to recognize that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2284) Confusing/inconsistent PartitioningStrategy
Stephan Ewen created FLINK-2284: --- Summary: Confusing/inconsistent PartitioningStrategy Key: FLINK-2284 URL: https://issues.apache.org/jira/browse/FLINK-2284 Project: Flink Issue Type: Bug Components: Streaming Affects Versions: 0.9 Reporter: Stephan Ewen The PartitioningStrategy in {{org.apache.flink.streaming.runtime.partitioner.StreamPartitioner.java}} is non standard and not easily understandable. What form of partitioning is `SHUFFLE`? Shuffle just means redistribute, it says nothing about what it does. Same with `DISTRIBUTE`. Also `GLOBAL` is not a well-defined/established term. Why is `GROUPBY` a partition type? Doesn't grouping simply hash partition (like I assume SHUFFLE means), so why does it have an extra entry? Sticking with principled and established names/concepts is important to allow people to collaborate on the code. Why not stick with the partitioning types defined in the batch API? They are well defined and named: ``` NONE, FORWARD, RANDOM, HASH, RANGE, FORCED_REBALANCE, BROADCAST, CUSTOM ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2270) Error in KafkaSource Documentation
Stephan Ewen created FLINK-2270: --- Summary: Error in KafkaSource Documentation Key: FLINK-2270 URL: https://issues.apache.org/jira/browse/FLINK-2270 Project: Flink Issue Type: Bug Components: Documentation Affects Versions: 0.9 Reporter: Stephan Ewen The Kafka Source documentation still refers to {{enableMonitoring()}} as the method to enable checkpointing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2231) Create a Serializer for Scala Enumerations
Stephan Ewen created FLINK-2231: --- Summary: Create a Serializer for Scala Enumerations Key: FLINK-2231 URL: https://issues.apache.org/jira/browse/FLINK-2231 Project: Flink Issue Type: Improvement Components: Scala API Reporter: Stephan Ewen Scala Enumerations are currently serialized with Kryo, but should be efficiently serialized by just writing the {{initial}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2177) NillPointer in task resource release
Stephan Ewen created FLINK-2177: --- Summary: NillPointer in task resource release Key: FLINK-2177 URL: https://issues.apache.org/jira/browse/FLINK-2177 Project: Flink Issue Type: Bug Components: Distributed Runtime Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Ufuk Celebi Priority: Blocker Fix For: 0.9 {code} == == FATAL === == A fatal error occurred, forcing the TaskManager to shut down: FATAL - exception in task resource cleanup java.lang.NullPointerException at org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.cancelRequestFor(PartitionRequestClientHandler.java:89) at org.apache.flink.runtime.io.network.netty.PartitionRequestClient.close(PartitionRequestClient.java:182) at org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.releaseAllResources(RemoteInputChannel.java:199) at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.releaseAllResources(SingleInputGate.java:332) at org.apache.flink.runtime.io.network.NetworkEnvironment.unregisterTask(NetworkEnvironment.java:368) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:650) at java.lang.Thread.run(Thread.java:701) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2130) RabbitMQ source does not fail when failing to retrieve elements
Stephan Ewen created FLINK-2130: --- Summary: RabbitMQ source does not fail when failing to retrieve elements Key: FLINK-2130 URL: https://issues.apache.org/jira/browse/FLINK-2130 Project: Flink Issue Type: Bug Reporter: Stephan Ewen The RMQ source only logs when elements cannot be retrieved. Failures are not propagated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2098) Checkpoint barrier initiation at source is not aligned with snapshotting
Stephan Ewen created FLINK-2098: --- Summary: Checkpoint barrier initiation at source is not aligned with snapshotting Key: FLINK-2098 URL: https://issues.apache.org/jira/browse/FLINK-2098 Project: Flink Issue Type: Bug Components: Streaming Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Stephan Ewen Priority: Blocker Fix For: 0.9 The stream source does not properly align the emission of checkpoint barriers with the drawing of snapshots. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2070) Confusing methods print() that print on client vs on TaskManager
Stephan Ewen created FLINK-2070: --- Summary: Confusing methods print() that print on client vs on TaskManager Key: FLINK-2070 URL: https://issues.apache.org/jira/browse/FLINK-2070 Project: Flink Issue Type: Bug Components: Core Affects Versions: 0.9 Reporter: Stephan Ewen Fix For: 0.9 With the {{print()}} method printing on the client, the {{print(sourceIentified)}} method becomes confusing, as it has the same name, but prints on the taskManager (into its out files) and executes lazily. We should clarify the confusion by picking a more descriptive name, like {{printOnTaskManager()}}. I am not sure how common the use case to print into the TaskManager {{out}} files is. We could remove that method and point to the {{PrintingOutputFormat}} for that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2071) Projection function can fail non serializable
Stephan Ewen created FLINK-2071: --- Summary: Projection function can fail non serializable Key: FLINK-2071 URL: https://issues.apache.org/jira/browse/FLINK-2071 Project: Flink Issue Type: Bug Components: Java API Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.9 The projection function creates a full reusable tuple including fields. If the fields are not serializable, the function serialization fails. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2076) Bug in re-openable hash join
Stephan Ewen created FLINK-2076: --- Summary: Bug in re-openable hash join Key: FLINK-2076 URL: https://issues.apache.org/jira/browse/FLINK-2076 Project: Flink Issue Type: Bug Components: Local Runtime Affects Versions: 0.9 Reporter: Stephan Ewen It happens deterministically in my machine with the following setup: TaskManager: - heap size: 512m - network buffers: 4096 - slots: 32 Job: - ConnectedComponents - 100k vertices - 1.2m edges -- this gives around 260 m Flink managed memory, across 32 slots is 8MB per slot, with several mem consumers in the job, makes the iterative hash join out-of-core {code} java.lang.RuntimeException: Hash Join bug in memory management: Memory buffers leaked. at org.apache.flink.runtime.operators.hash.MutableHashTable.buildTableFromSpilledPartition(MutableHashTable.java:733) at org.apache.flink.runtime.operators.hash.MutableHashTable.prepareNextPartition(MutableHashTable.java:508) at org.apache.flink.runtime.operators.hash.ReOpenableMutableHashTable.prepareNextPartition(ReOpenableMutableHashTable.java:167) at org.apache.flink.runtime.operators.hash.MutableHashTable.nextRecord(MutableHashTable.java:541) at org.apache.flink.runtime.operators.hash.NonReusingBuildSecondHashMatchIterator.callWithNextKey(NonReusingBuildSecondHashMatchIterator.java:102) at org.apache.flink.runtime.operators.AbstractCachedBuildSideMatchDriver.run(AbstractCachedBuildSideMatchDriver.java:155) at org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:496) at org.apache.flink.runtime.iterative.task.AbstractIterativePactTask.run(AbstractIterativePactTask.java:139) at org.apache.flink.runtime.iterative.task.IterationIntermediatePactTask.run(IterationIntermediatePactTask.java:92) at org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:560) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2058) Hadoop Input Splits do not use proper UserCodeClassloader
Stephan Ewen created FLINK-2058: --- Summary: Hadoop Input Splits do not use proper UserCodeClassloader Key: FLINK-2058 URL: https://issues.apache.org/jira/browse/FLINK-2058 Project: Flink Issue Type: Bug Components: Hadoop Compatibility Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.9 The Hadoop input splits manually resolve classes without using a proper classloaders. I propose so serialize the classes themselves (rather then the names) as part of the input format, so that the general class loading functionality for user defined types handles this automatically. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2057) Remove IOReadableWritable interface from input splits
Stephan Ewen created FLINK-2057: --- Summary: Remove IOReadableWritable interface from input splits Key: FLINK-2057 URL: https://issues.apache.org/jira/browse/FLINK-2057 Project: Flink Issue Type: Improvement Components: Core Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.9 The {{IOReadableWritable}} interface was used in the previous RPC subsystem. Since we use Java Serialization for RPC / Actor Messages now, we can remove this, which makes the implementation of new Split Types much easier. Removing legacy functionality... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2063) Streaming checkpoints consider only input and output vertices
Stephan Ewen created FLINK-2063: --- Summary: Streaming checkpoints consider only input and output vertices Key: FLINK-2063 URL: https://issues.apache.org/jira/browse/FLINK-2063 Project: Flink Issue Type: Bug Components: Streaming Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.9 Currently, the streaming checkpoint coordinator is configured only with the list of input and output vertices as stateful vertices. Checkpoint confirmations from intermediate vertices are ignored. Since the runtime assumes that all vertices participate in checkpointing anyways (all of them send acknowledge messages), the coordinator should be configured accordingly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2046) Quickstarts are missing on the new Website
Stephan Ewen created FLINK-2046: --- Summary: Quickstarts are missing on the new Website Key: FLINK-2046 URL: https://issues.apache.org/jira/browse/FLINK-2046 Project: Flink Issue Type: Bug Components: Project Website Reporter: Stephan Ewen -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2041) Optimizer plans with memory for pipeline breakers, even though they are not used any more
Stephan Ewen created FLINK-2041: --- Summary: Optimizer plans with memory for pipeline breakers, even though they are not used any more Key: FLINK-2041 URL: https://issues.apache.org/jira/browse/FLINK-2041 Project: Flink Issue Type: Bug Components: Optimizer Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.9 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2010) Add test that verifies that chained tasks are properly checkpointed
Stephan Ewen created FLINK-2010: --- Summary: Add test that verifies that chained tasks are properly checkpointed Key: FLINK-2010 URL: https://issues.apache.org/jira/browse/FLINK-2010 Project: Flink Issue Type: Bug Components: Streaming Reporter: Stephan Ewen Because this is critical logic, we should have a unit test for the {{StreamingTask}} validating that the state from all chained tasks is taken into account. It needs to check - The trigger checkpoint method, and that the state handle handle in the checkpoint acknowledgement contains all the necessary state - The restore state method, that state is reset properly -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2011) Improve error message when user-defined serialization logic is wrong
Stephan Ewen created FLINK-2011: --- Summary: Improve error message when user-defined serialization logic is wrong Key: FLINK-2011 URL: https://issues.apache.org/jira/browse/FLINK-2011 Project: Flink Issue Type: Bug Components: Distributed Runtime Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.9 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2004) Memory leack in presence of failed checkpoints in KafkaSource
Stephan Ewen created FLINK-2004: --- Summary: Memory leack in presence of failed checkpoints in KafkaSource Key: FLINK-2004 URL: https://issues.apache.org/jira/browse/FLINK-2004 Project: Flink Issue Type: Bug Components: Streaming Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Robert Metzger Priority: Critical Fix For: 0.9 Checkpoints that fail never send a commit message to the tasks. Maintaining a map of all pending checkpoints introduces a memory leak, as entries for failed checkpoints will never be removed. Approaches to fix this: - The source cleans up entries from older checkpoints once a checkpoint is committed (simple implementation in a linked hash map) - The commit message could include the optional state handle (source needs not maintain the map) - The checkpoint coordinator could send messages for failed checkpoints? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1969) Remove old profile code
Stephan Ewen created FLINK-1969: --- Summary: Remove old profile code Key: FLINK-1969 URL: https://issues.apache.org/jira/browse/FLINK-1969 Project: Flink Issue Type: Bug Components: Distributed Runtime Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.9 The old profiler code is not instantiated any more and is basically dead. It has in parts been replaced by the metrics library already. The classes still get in the way during refactoring, which is why I suggest to remove them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1972) Remove instanceID in streaming tasks
Stephan Ewen created FLINK-1972: --- Summary: Remove instanceID in streaming tasks Key: FLINK-1972 URL: https://issues.apache.org/jira/browse/FLINK-1972 Project: Flink Issue Type: Bug Components: Streaming Reporter: Stephan Ewen The streaming tasks have a weird instanceID that is a JVM local counter. Tasks have already proper names, vertex ids, and IDs scoped to the attempt of execution. We should remove this counter ID. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1953) Rework Checkpoint Coordinator
Stephan Ewen created FLINK-1953: --- Summary: Rework Checkpoint Coordinator Key: FLINK-1953 URL: https://issues.apache.org/jira/browse/FLINK-1953 Project: Flink Issue Type: Bug Components: Streaming Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.9 The checkpoint coordinator currently contains no tests and is vulnerable to a variety of situations. In particular, I propose to add: - Better configurability which tasks receive the trigger checkpoint messages, which tasks need to acknowledge the checkpoint, and which tasks need to receive confirmation messages. - checkpoint timeouts, such that incomplete checkpoints are guaranteed to be cleaned up after a while, regardless of successful checkpoints - better sanity checking of messages and fields, to properly handle/ignore messages for old/expired checkpoints, or invalidly routed messages - Better handling of checkpoint attempts at points where the execution has just failed is is currently being canceled. - Add a good set of tests -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1879) Simplify JobClient
Stephan Ewen created FLINK-1879: --- Summary: Simplify JobClient Key: FLINK-1879 URL: https://issues.apache.org/jira/browse/FLINK-1879 Project: Flink Issue Type: Sub-task Components: Distributed Runtime Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.9 The JobClient currently uses two actors, when one is sufficient. The calls in the mini clusters also require tests to handle with ActorRefs, which should not be necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1845) NonReusingSortMergeCoGroupIterator uses ReusingKeyGroupedIterator
Stephan Ewen created FLINK-1845: --- Summary: NonReusingSortMergeCoGroupIterator uses ReusingKeyGroupedIterator Key: FLINK-1845 URL: https://issues.apache.org/jira/browse/FLINK-1845 Project: Flink Issue Type: Bug Components: Local Runtime Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Alexander Alexandrov Fix For: 0.9 This is an unexpected mutability bug. [~aalexandrov] already has created a pull request that fixes this... -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1839) Failures in TwitterStreamITCase
Stephan Ewen created FLINK-1839: --- Summary: Failures in TwitterStreamITCase Key: FLINK-1839 URL: https://issues.apache.org/jira/browse/FLINK-1839 Project: Flink Issue Type: Bug Components: Streaming Affects Versions: 0.9 Reporter: Stephan Ewen The test seems unstable and fails occasionally with the error below {code} Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 4.508 sec FAILURE! - in org.apache.flink.streaming.examples.test.twitter.TwitterStreamITCase testJobWithoutObjectReuse(org.apache.flink.streaming.examples.test.twitter.TwitterStreamITCase) Time elapsed: 3.739 sec FAILURE! java.lang.AssertionError: Different number of lines in expected and obtained result. expected:146 but was:144 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.apache.flink.test.util.TestBaseUtils.compareResultsByLinesInMemory(TestBaseUtils.java:256) at org.apache.flink.test.util.TestBaseUtils.compareResultsByLinesInMemory(TestBaseUtils.java:242) at org.apache.flink.streaming.examples.test.twitter.TwitterStreamITCase.postSubmit(TwitterStreamITCase.java:34) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1801) NetworkEnvironment should start without JobManager association
Stephan Ewen created FLINK-1801: --- Summary: NetworkEnvironment should start without JobManager association Key: FLINK-1801 URL: https://issues.apache.org/jira/browse/FLINK-1801 Project: Flink Issue Type: Sub-task Components: TaskManager Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.9 The NetworkEnvironment should be able to start without a dedicated JobManager association and get one / loose one as the TaskManager connects to different JobManagers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1802) BlobManager directories should be checked before TaskManager startup
Stephan Ewen created FLINK-1802: --- Summary: BlobManager directories should be checked before TaskManager startup Key: FLINK-1802 URL: https://issues.apache.org/jira/browse/FLINK-1802 Project: Flink Issue Type: Sub-task Components: TaskManager Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.9 That allows the call to start the taskmanager to fail early and synchronous, improving debugability. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1760) Add support for building Flink with Scala 2.11
Stephan Ewen created FLINK-1760: --- Summary: Add support for building Flink with Scala 2.11 Key: FLINK-1760 URL: https://issues.apache.org/jira/browse/FLINK-1760 Project: Flink Issue Type: Improvement Components: Build System Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Alexander Alexandrov Fix For: 0.9 Pull request https://github.com/apache/flink/pull/477 is implementing this feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1762) Make Statefulness of a Streaming Function explicit
Stephan Ewen created FLINK-1762: --- Summary: Make Statefulness of a Streaming Function explicit Key: FLINK-1762 URL: https://issues.apache.org/jira/browse/FLINK-1762 Project: Flink Issue Type: Improvement Components: Streaming Affects Versions: 0.9 Reporter: Stephan Ewen Currently, the state of streaming functions stored in the {{StreamingRuntimeContext}}. That is rather inexplicit, a function may or may not make use of the state. This also hides from the system whether a function is stateful or not. How about we make this explicit by letting stateful functions extend a special interface (see below). That would allow the stream graph to already know which functions are stateful. Certain vertices would not participate in the checkpointing, if they only contain stateless vertices. We can set up the ExecutionGraph to expect confirmations only from the participating vertices, saving messages. {code} public interface Statehandle { get, put, ... } public interface Stateful { void setStateHandle(Statehandle handle); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1765) Reducer grouping is skippted when parallelism is one
Stephan Ewen created FLINK-1765: --- Summary: Reducer grouping is skippted when parallelism is one Key: FLINK-1765 URL: https://issues.apache.org/jira/browse/FLINK-1765 Project: Flink Issue Type: Bug Components: Streaming Affects Versions: 0.9 Reporter: Stephan Ewen Fix For: 0.9 This program (not the parallelism) incorrectly runs a non grouped reduce and fails with a NullPointerException. {code} StreamExecutionEnvironment env = ... env.setDegreeOfParallelism(1); DataStreamString stream = env.addSource(...); stream .filter(...) .map(...) .groupBy(someField) .reduce(new ReduceFunction() {...} ) .addSink(...); env.execute(); {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1763) Remove cancel from streaming SinkFunction
Stephan Ewen created FLINK-1763: --- Summary: Remove cancel from streaming SinkFunction Key: FLINK-1763 URL: https://issues.apache.org/jira/browse/FLINK-1763 Project: Flink Issue Type: Improvement Components: Streaming Affects Versions: 0.9 Reporter: Stephan Ewen Fix For: 0.9 Since the streaming sink function is called individually for each record, it does not require a {{cancel()}} function. The system can cancel between calls to that function (which it cannot do for the source function). Removing this method removes the need to always implement the unnecessary, and usually empty, method. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1764) Rework record copying logic in streaming API
Stephan Ewen created FLINK-1764: --- Summary: Rework record copying logic in streaming API Key: FLINK-1764 URL: https://issues.apache.org/jira/browse/FLINK-1764 Project: Flink Issue Type: Improvement Components: Streaming Affects Versions: 0.9 Reporter: Stephan Ewen The logic for chained tasks in the streaming API does a lot of copying of records. In some cases, a record is copied multiple times before being passed to a function. This seems unnecessary, in the general case. In any case, multiple copies seem incorrect. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1721) Flakey Yarn Tests
Stephan Ewen created FLINK-1721: --- Summary: Flakey Yarn Tests Key: FLINK-1721 URL: https://issues.apache.org/jira/browse/FLINK-1721 Project: Flink Issue Type: Bug Components: YARN Client Affects Versions: 0.9 Reporter: Stephan Ewen Failed tests: YARNSessionFIFOITCase.testTaskManagerFailure:276-YarnTestBase.ensureNoProhibitedStringInLogFiles:286 Found a file /home/travis/build/StephanEwen/incubator-flink/flink-yarn-tests/target/flink-yarn-tests-fifo/flink-yarn-tests-fifo-logDir-nm-0_0/application_1426629317989_0004/container_1426629317989_0004_01_03/taskmanager.log with a prohibited string: [Exception, Started SelectChannelConnector@0.0.0.0:8081] YARNSessionFIFOITCase.testNonexistingQueue:305-YarnTestBase.ensureNoProhibitedStringInLogFiles:286 Found a file /home/travis/build/StephanEwen/incubator-flink/flink-yarn-tests/target/flink-yarn-tests-fifo/flink-yarn-tests-fifo-logDir-nm-0_0/application_1426629317989_0004/container_1426629317989_0004_01_03/taskmanager.log with a prohibited string: [Exception, Started SelectChannelConnector@0.0.0.0:8081] YARNSessionFIFOITCase.testJavaAPI:460-YarnTestBase.ensureNoProhibitedStringInLogFiles:286 Found a file /home/travis/build/StephanEwen/incubator-flink/flink-yarn-tests/target/flink-yarn-tests-fifo/flink-yarn-tests-fifo-logDir-nm-0_0/application_1426629317989_0004/container_1426629317989_0004_01_03/taskmanager.log with a prohibited string: [Exception, Started SelectChannelConnector@0.0.0.0:8081] Log details attached -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1730) Add a FlinkTools.persist style method to the Data Set.
Stephan Ewen created FLINK-1730: --- Summary: Add a FlinkTools.persist style method to the Data Set. Key: FLINK-1730 URL: https://issues.apache.org/jira/browse/FLINK-1730 Project: Flink Issue Type: New Feature Reporter: Stephan Ewen Priority: Minor I think this is an operation that will be needed more prominently. Defining a point where one long logical program is broken into different executions. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1712) Restructure Maven Projects
Stephan Ewen created FLINK-1712: --- Summary: Restructure Maven Projects Key: FLINK-1712 URL: https://issues.apache.org/jira/browse/FLINK-1712 Project: Flink Issue Type: Improvement Affects Versions: 0.9 Reporter: Stephan Ewen Priority: Minor Project consolidation - flink-hadoop (shaded fat jar) - Core (Core and Java) - (core-scala) - Streaming (core + java) - (streaming-scala) - Runtime - Optimizer (may also be merged with Client) - Client (or Client + Optimizer) - Examples (Java + Scala + Streaming Java + Streaming Scala) - Tests (test-utils (compile) and tests (test)) - Quickstarts - Quickstart Java - Quickstart Scala - connectors / Input/Output Formats - Avro - HBase - HadoopCompartibility - HCatalogue - JDBC - kafka - rabbit - ... - staging - Gelly - ML - spargel (deprecated) - expression API - contrib - yarn - dist - yarn tests - java 8 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1711) Replace all usages off commons.Validate with guava.check
Stephan Ewen created FLINK-1711: --- Summary: Replace all usages off commons.Validate with guava.check Key: FLINK-1711 URL: https://issues.apache.org/jira/browse/FLINK-1711 Project: Flink Issue Type: Improvement Affects Versions: 0.9 Reporter: Stephan Ewen Priority: Minor Fix For: 0.9 Per discussion on the mailing list, we decided to increase homogeneity. One part is to consistently use the Guava methods {{checkNotNull}} and {{checkArgument}}, rather than Apache Commons Lang3 {{Validate}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1705) InstanceConnectionInfo returns wrong hostname when no DNS entry exists
Stephan Ewen created FLINK-1705: --- Summary: InstanceConnectionInfo returns wrong hostname when no DNS entry exists Key: FLINK-1705 URL: https://issues.apache.org/jira/browse/FLINK-1705 Project: Flink Issue Type: Bug Components: TaskManager Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.9 If there is no DNS entry for an address (like 10.4.122.43), then the {{InstanceConnectionInfo}} returns the first octet ({{10}}) as the hostame. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1691) Inprove CountCollectITCase
Stephan Ewen created FLINK-1691: --- Summary: Inprove CountCollectITCase Key: FLINK-1691 URL: https://issues.apache.org/jira/browse/FLINK-1691 Project: Flink Issue Type: Bug Components: test Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Maximilian Michels Fix For: 0.9 The CountCollectITCase logs heavily and does not reuse the same cluster across multiple tests. Both can be addressed by letting it extend the MultipleProgramsTestBase -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1671) Add execution modes for programs
Stephan Ewen created FLINK-1671: --- Summary: Add execution modes for programs Key: FLINK-1671 URL: https://issues.apache.org/jira/browse/FLINK-1671 Project: Flink Issue Type: Bug Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.9 Currently, there is a single way that programs get executed: Pipelined. With the new code for batch shuffles (https://github.com/apache/flink/pull/471), we have much more flexibility and I would like to expose that. I suggest to add more execution modes that can be chosen on the `ExecutionEnvironment`: - {{BATCH}} A mode where every shuffle is executed in a batch way, meaning preceding operators must be done before successors start. Only for the batch programs (d'oh). - {{PIPELINED}} This is the mode corresponding to the current execution mode. It pipelines where possible and batches, where deadlocks would otherwise happen. Initially, I would make this the default (be close to the current behavior). Only available for batch programs. - {{PIPELINED_WITH_BATCH_FALLBACK}} This would start out with pipelining shuffles and fall back to batch shuffles upon failure and recovery, or once it sees that not enough slots are available to bring up all operators at once (requirement for pipelining). - {{STREAMING}} This is the default and only way for streaming programs. All communication is pipelined, and the special streaming checkpointing code is activated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1675) Rework Accumulators
Stephan Ewen created FLINK-1675: --- Summary: Rework Accumulators Key: FLINK-1675 URL: https://issues.apache.org/jira/browse/FLINK-1675 Project: Flink Issue Type: Bug Components: JobManager, TaskManager Affects Versions: 0.9 Reporter: Stephan Ewen Fix For: 0.9 The accumulators need an overhaul to address various issues: 1. User defined Accumulator classes crash the client, because it is not using the user code classloader to decode the received message. 2. They should be attached to the ExecutionGraph, not the dedicated AccumulatorManager. That makes them accessible also for archived execution graphs. 3. Accumulators should be sent periodically, as part of the heart beat that sends metrics. This allows them to be updated in real time 4. Accumulators should be stored fine grained (per executionvertex, or per execution) and the final value should be on computed by merging all involved ones. This allows users to access the per-subtask accumulators, which is often interesting. 5. Accumulators should subsume the aggregators by allowing to be versioned with a superstep. The versioned ones should be redistributed to the cluster after each superstep. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1667) Add tests for recovery with distributed process failure
Stephan Ewen created FLINK-1667: --- Summary: Add tests for recovery with distributed process failure Key: FLINK-1667 URL: https://issues.apache.org/jira/browse/FLINK-1667 Project: Flink Issue Type: Improvement Components: test Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.9 The system should have a test that actually spawns multiple TaskManager processes (JVMs) and tests how recovery works when one of them is killed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1668) Add a config option to specify delays between restarts
Stephan Ewen created FLINK-1668: --- Summary: Add a config option to specify delays between restarts Key: FLINK-1668 URL: https://issues.apache.org/jira/browse/FLINK-1668 Project: Flink Issue Type: Improvement Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.9 The system currently introduces a short delay between a failed task execution and the restarted execution. The reason is that this delay seemed to help in letting problems surface that let to the failed task. As an example, if a TaskManager fails, tasks fail due to data transfer errors. The TaskManager is not immediately recognized as failed, though (takes a bit until heartbeats time out). Immediately re-deploying tasks has a very high chance of assigning work to the TaskManager that is actually not responding, causing the execution retry to fail again. The delay gives the system time to figure out that the TaskManager was lost and does not take it into account upon the retry. Currently, the system uses the heartbeat timeout as the default delay value. This may make sense as a default value for critical task failures, but is actually quite high for other types of failures. In any case, I would like to add an option for users to specify the delay (even set it to 0, if desired). The delay is not the best solution, in my opinion, we should eventually move to something better. Ideas are to put TaskManagers responsible for failed tasks in a probationary mode until they have reported back that everything is good (still alive, disk space available, etc) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1648) Add a mode where the system automatically sets the parallelism to the available task slots
Stephan Ewen created FLINK-1648: --- Summary: Add a mode where the system automatically sets the parallelism to the available task slots Key: FLINK-1648 URL: https://issues.apache.org/jira/browse/FLINK-1648 Project: Flink Issue Type: New Feature Components: JobManager Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.9 This is basically a port of this code form the 0.8 release: https://github.com/apache/flink/pull/410 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1649) Give a good error message when a user program emits a null record
Stephan Ewen created FLINK-1649: --- Summary: Give a good error message when a user program emits a null record Key: FLINK-1649 URL: https://issues.apache.org/jira/browse/FLINK-1649 Project: Flink Issue Type: Bug Components: Local Runtime Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.9 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1650) Suppress Akka's Netty Shutdown Errors through the log config
Stephan Ewen created FLINK-1650: --- Summary: Suppress Akka's Netty Shutdown Errors through the log config Key: FLINK-1650 URL: https://issues.apache.org/jira/browse/FLINK-1650 Project: Flink Issue Type: Bug Components: other Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.9 I suggest to set the logging for `org.jboss.netty.channel.DefaultChannelPipeline` to error, in order to get rid of the misleading stack trace caused by an akka/netty hickup on shutdown. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1626) Spurious failure in MatchTask cancelling test
Stephan Ewen created FLINK-1626: --- Summary: Spurious failure in MatchTask cancelling test Key: FLINK-1626 URL: https://issues.apache.org/jira/browse/FLINK-1626 Project: Flink Issue Type: Bug Components: Local Runtime Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.9 This is a problem in the test. Cancelling a task may actually throw an exception (especially interrupted exceptions). The test can be modified to either only call cancel and not call interrupt, or to tolerate exceptions that are followup exceptions of interrupted exceptions. I would prepare a patch for the first approach. The stack trace of the symptom is blow: {code} java.lang.RuntimeException: Hashtable closing was interrupted at org.apache.flink.runtime.operators.hash.MutableHashTable.close(MutableHashTable.java:652) at org.apache.flink.runtime.operators.hash.ReusingBuildFirstHashMatchIterator.close(ReusingBuildFirstHashMatchIterator.java:100) at org.apache.flink.runtime.operators.MatchDriver.cleanup(MatchDriver.java:179) at org.apache.flink.runtime.operators.testutils.DriverTestBase.testDriverInternal(DriverTestBase.java:245) at org.apache.flink.runtime.operators.testutils.DriverTestBase.testDriver(DriverTestBase.java:175) at org.apache.flink.runtime.operators.MatchTaskTest$4.run(MatchTaskTest.java:783) Tests run: 47, Failures: 2, Errors: 0, Skipped: 0, Time elapsed: 15.589 sec FAILURE! - in org.apache.flink.runtime.operators.MatchTaskTest testCancelHashMatchTaskWhileBuildFirst[1](org.apache.flink.runtime.operators.MatchTaskTest) Time elapsed: 1.029 sec FAILURE! java.lang.AssertionError: Test threw an exception even though it was properly canceled. at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.assertTrue(Assert.java:41) at org.apache.flink.runtime.operators.MatchTaskTest.testCancelHashMatchTaskWhileBuildFirst(MatchTaskTest.java:802) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1631) Port collisions in ProcessReaping tests
Stephan Ewen created FLINK-1631: --- Summary: Port collisions in ProcessReaping tests Key: FLINK-1631 URL: https://issues.apache.org/jira/browse/FLINK-1631 Project: Flink Issue Type: Bug Components: Distributed Runtime Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.9 The process reaping tests for the JobManager spawn a process that starts a webserver on the default port. It may happen that this port is not available, due to another concurrently running task. I suggest to add an option to not start the webserver to prevent this, by setting the webserver port to {{-1}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1608) TaskManagers may pick wrong network interface when starting before JobManager
Stephan Ewen created FLINK-1608: --- Summary: TaskManagers may pick wrong network interface when starting before JobManager Key: FLINK-1608 URL: https://issues.apache.org/jira/browse/FLINK-1608 Project: Flink Issue Type: Bug Components: TaskManager Affects Versions: 0.9 Reporter: Stephan Ewen Fix For: 0.9 The taskmanagers use a NetUtils routine to find an interface that lets them talk to the Jobmanager. However, if the JobManager is not online yet, they fall back to localhost. In cases where the TaskManagers start faster than the JobManager, they pick the wrong hostname and interface. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-1598) Give better error messages when serializers run out of space.
Stephan Ewen created FLINK-1598: --- Summary: Give better error messages when serializers run out of space. Key: FLINK-1598 URL: https://issues.apache.org/jira/browse/FLINK-1598 Project: Flink Issue Type: Bug Components: Local Runtime Affects Versions: 0.9 Reporter: Stephan Ewen Assignee: Stephan Ewen Fix For: 0.9 -- This message was sent by Atlassian JIRA (v6.3.4#6332)