[jira] [Created] (FLINK-2677) Add a general-purpose keyed-window operator

2015-09-15 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2677:
---

 Summary: Add a general-purpose keyed-window operator
 Key: FLINK-2677
 URL: https://issues.apache.org/jira/browse/FLINK-2677
 Project: Flink
  Issue Type: Sub-task
  Components: Streaming
Affects Versions: 0.10
Reporter: Stephan Ewen
 Fix For: 0.10






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2683) Add utilities for heap-backed keyed state in time panes

2015-09-15 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2683:
---

 Summary: Add utilities for heap-backed keyed state in time panes
 Key: FLINK-2683
 URL: https://issues.apache.org/jira/browse/FLINK-2683
 Project: Flink
  Issue Type: Sub-task
  Components: Streaming
Affects Versions: 0.10
Reporter: Stephan Ewen
Assignee: Stephan Ewen
 Fix For: 0.10


Time panes are commonly used for aligned time windows (tumbling and sliding). 
These utilities would represent the panes with per-pane keyed state.

The implementation must support:
  - Evicting panes efficiently
  - iterating over the union of all panes' state per key for all keys 
(computing the window function for all keys)
  - efficient append the state for a key in a random pane



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2646) Rich functions should provide a method "closeAfterFailure()"

2015-09-09 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2646:
---

 Summary: Rich functions should provide a method 
"closeAfterFailure()"
 Key: FLINK-2646
 URL: https://issues.apache.org/jira/browse/FLINK-2646
 Project: Flink
  Issue Type: Improvement
  Components: Core
Affects Versions: 0.10
Reporter: Stephan Ewen
 Fix For: 0.10


Right now, the {{close()}} method of rich functions is invoked in case of 
proper completion, and in case of canceling in case of error (to allow for 
cleanup).

In certain cases, the user function needs to know why it is closed, whether the 
task completed in a regular fashion, or was canceled/failed.

I suggest to add a method {{closeAfterFailure()}} to the {{RuchFunction}}. By 
default, this method calls {{close()}}. The runtime is the changed to call 
{{close()}} as part of the regular execution and {{closeAfterFailure()}} in 
case of an irregular exit.

Because by default all cases call {{close()}} the change would not be API 
breaking.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2647) Stream operators need to differentiate between close() and dispose()

2015-09-09 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2647:
---

 Summary: Stream operators need to differentiate between close() 
and dispose()
 Key: FLINK-2647
 URL: https://issues.apache.org/jira/browse/FLINK-2647
 Project: Flink
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.10
Reporter: Stephan Ewen
Assignee: Stephan Ewen
 Fix For: 0.10


Currently, operators make no distinction between closing (which needs to emit 
remaining buffered data) and disposing (releasing resources).

In case of a failure, we want to only release the resources, and not attempt to 
send pending buffered data.

Effectively, streaming operators need to implement these methods then:
  - open()
  - processRecord()
  - processWatermark()
  - close()
  - dispose()




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2635) Unify StreamInputProcessor and OneInputStreamTask

2015-09-08 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2635:
---

 Summary: Unify StreamInputProcessor and OneInputStreamTask
 Key: FLINK-2635
 URL: https://issues.apache.org/jira/browse/FLINK-2635
 Project: Flink
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 0.10
Reporter: Stephan Ewen
Assignee: Stephan Ewen
 Fix For: 0.10


Working on some windowing operations, I found that these two classes can easily 
be put into one, making the set of task/operator classes easier to navigate.

The {{OneInputStreamTask}} does effectively nothing more than delegate to the 
{{StreamInputProcessor}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2636) Clean up StreamRecord and Watermark types

2015-09-08 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2636:
---

 Summary: Clean up StreamRecord and Watermark types
 Key: FLINK-2636
 URL: https://issues.apache.org/jira/browse/FLINK-2636
 Project: Flink
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 0.10
Reporter: Stephan Ewen
Assignee: Stephan Ewen
 Fix For: 0.10


StreamRecords and Watermarks can be returned from streaming inputs 
alternatingly. They should have a common super type to make the code cleaner 
and allow for proper generic typing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2641) Integrate the off-heap memory configuration with the TaskManager start script

2015-09-08 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2641:
---

 Summary: Integrate the off-heap memory configuration with the 
TaskManager start script
 Key: FLINK-2641
 URL: https://issues.apache.org/jira/browse/FLINK-2641
 Project: Flink
  Issue Type: New Feature
  Components: Start-Stop Scripts
Affects Versions: 0.10
Reporter: Stephan Ewen
 Fix For: 0.10


The TaskManager start script needs to adjust the {{-Xmx}}, {{-Xms}}, and 
{{-XX:MaxDirectMemorySize}} parameters according to the off-heap memory 
settings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2640) Integrate the off-heap configurations with YARN runner

2015-09-08 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2640:
---

 Summary: Integrate the off-heap configurations with YARN runner
 Key: FLINK-2640
 URL: https://issues.apache.org/jira/browse/FLINK-2640
 Project: Flink
  Issue Type: New Feature
  Components: YARN Client
Affects Versions: 0.10
Reporter: Stephan Ewen
 Fix For: 0.10


The YARN runner needs to adjust the {{-Xmx}}, {{-Xms}}, and 
{{-XX:MaxDirectMemorySize}} parameters according to the off-heap memory 
settings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2638) Add @SafeVarargs to the environment.fromElements(...) method

2015-09-08 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2638:
---

 Summary: Add @SafeVarargs to the environment.fromElements(...) 
method
 Key: FLINK-2638
 URL: https://issues.apache.org/jira/browse/FLINK-2638
 Project: Flink
  Issue Type: Improvement
  Components: Core, Streaming
Affects Versions: 0.10
Reporter: Stephan Ewen
 Fix For: 0.10


This annotation suppresses the warnings when people use these methods with 
generic types.

Since we check types and serialize them in the input format anyways, we should 
be actually safe.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2624) RabbitMQ source / sink should participate in checkpointing

2015-09-04 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2624:
---

 Summary: RabbitMQ source / sink should participate in checkpointing
 Key: FLINK-2624
 URL: https://issues.apache.org/jira/browse/FLINK-2624
 Project: Flink
  Issue Type: Bug
  Components: Streaming Connectors
Affects Versions: 0.10
Reporter: Stephan Ewen


The RabbitMQ connector does not offer any fault tolerance guarantees right now, 
because it does not participate in the checkpointing.

We should integrate it in a similar was as the {{FlinkKafkaConsumer}} is 
integrated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2586) Unstable Storm Compatibility Tests

2015-08-27 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2586:
---

 Summary: Unstable Storm Compatibility Tests
 Key: FLINK-2586
 URL: https://issues.apache.org/jira/browse/FLINK-2586
 Project: Flink
  Issue Type: Bug
  Components: Storm Compatibility
Affects Versions: 0.10
Reporter: Stephan Ewen
Priority: Critical
 Fix For: 0.10


The Storm Compatibility tests frequently fail.

The reason is that they kill the topologies after a certain time interval. That 
may fail on CI infrastructure when certain steps are delayed beyond usual. 
Trying to guarantee progress by time is inherently problematic:
  - Waiting too short makes tests unstable
  - Waiting too long makes tests slow

The right way to go is letting the program decide when to terminate, for 
example by throwing a special {{SuccessException}}.

Have a look at the Kafka connector tests, they do this a lot and hence run 
exactly as short or as long as they need to.

Here is an example of a failed run: 
https://s3.amazonaws.com/archive.travis-ci.org/jobs/77499577/log.txt



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2554) Add exception reporting handler

2015-08-20 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2554:
---

 Summary: Add exception reporting handler
 Key: FLINK-2554
 URL: https://issues.apache.org/jira/browse/FLINK-2554
 Project: Flink
  Issue Type: Sub-task
  Components: Webfrontend
Affects Versions: 0.10
Reporter: Stephan Ewen
Assignee: Stephan Ewen
 Fix For: 0.10


The web dashboard should report all encountered exceptions:
  - The root failure cause
  - A list of exceptions encountered at individual tasks, together with the 
names of the tasks and the executing task managers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2547) Provide Cluster Status Info

2015-08-19 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2547:
---

 Summary: Provide Cluster Status Info
 Key: FLINK-2547
 URL: https://issues.apache.org/jira/browse/FLINK-2547
 Project: Flink
  Issue Type: Sub-task
  Components: Webfrontend
Affects Versions: 0.10
Reporter: Stephan Ewen
Assignee: Stephan Ewen
Priority: Minor
 Fix For: 0.10


Add a request to the web server that returns

{code}
{
  taskmanagers : numTaskmanagers,
  slots-total : numSlotsTotal, 
  slots-available : nonSlotsAvailable,
  jobs-running : numJobsRunning,
  jobs-finished : numJobsFinished,
  jobs-cancelled : numJobsCancelled,
  jobs-failed : numJobsFailed
}
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2520) StreamFaultToleranceTestBase does not allow for multiple tests

2015-08-14 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2520:
---

 Summary: StreamFaultToleranceTestBase does not allow for multiple 
tests
 Key: FLINK-2520
 URL: https://issues.apache.org/jira/browse/FLINK-2520
 Project: Flink
  Issue Type: Improvement
  Components: Tests
Affects Versions: 0.10
Reporter: Stephan Ewen
Priority: Minor


It would be good to implement more tests into the same class (or user at least 
the same driver class), in order to re-use mini clusters and reducer overall 
test times.

The {{StreamFaultToleranceTestBase}} does not support that.
I think the pattern where the {{@Test}} methods are in a base class, and the 
actual tests only implement hook methods is not a good pattern.

The batch API tests used this initially and we abandoned it everywhere for a 
more flexible implementation where the base class only sets up the. 
cluster/environment and the {{@Test}} methods go int the subclass.

The {{MultipleProgramsTestBase}} is a good example of a flexible test base.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2524) Add getTaskNameWithSubtasks() to RuntimeContext

2015-08-14 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2524:
---

 Summary: Add getTaskNameWithSubtasks() to RuntimeContext
 Key: FLINK-2524
 URL: https://issues.apache.org/jira/browse/FLINK-2524
 Project: Flink
  Issue Type: Improvement
  Components: Local Runtime
Affects Versions: 0.10
Reporter: Stephan Ewen
 Fix For: 0.10


When printing information to logs or debug output, one frequently needs to 
identify the statement with the originating task (task name and which subtask).

In many places, the system and user code manually construct something like 
MyTask (2/7).

The {{RuntimeContext}} should offer this, because it is too frequently needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2519) BarrierBuffers get stuck infinitely when some inputs end early

2015-08-14 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2519:
---

 Summary: BarrierBuffers get stuck infinitely when some inputs end 
early
 Key: FLINK-2519
 URL: https://issues.apache.org/jira/browse/FLINK-2519
 Project: Flink
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.10
Reporter: Stephan Ewen
Assignee: Stephan Ewen
Priority: Blocker
 Fix For: 0.10


When some sources exit early, the barrier buffer may start an alignment, but 
will never complete it, because some inputs never deliver a checkpoint barrier.

[FLINK-2515] mitigates most of the problem, by not starting checkpoints any 
more when some sources already finished. When a checkpoint is started while a 
source is finishing, the barrier buffer may still align infinitely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2515) CheckpointCoordinator triggers checkpoints even if not all sources are running any more

2015-08-13 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2515:
---

 Summary: CheckpointCoordinator triggers checkpoints even if not 
all sources are running any more
 Key: FLINK-2515
 URL: https://issues.apache.org/jira/browse/FLINK-2515
 Project: Flink
  Issue Type: Bug
  Components: JobManager
Affects Versions: 0.10
Reporter: Stephan Ewen
Priority: Blocker
 Fix For: 0.10


When some sources finish early, they will not emit checkpoint barriers any 
more. That means that pending checkpoint alignments will never be able to 
complete, locking the flow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2508) Confusing sharing of StreamExecutionEnvironment

2015-08-11 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2508:
---

 Summary: Confusing sharing of StreamExecutionEnvironment
 Key: FLINK-2508
 URL: https://issues.apache.org/jira/browse/FLINK-2508
 Project: Flink
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.10
Reporter: Stephan Ewen
 Fix For: 0.10


In the {{StreamExecutionEnvironment}}, the environment is once created and then 
shared with a static variable to all successive calls to 
{{getExecutionEnvironment()}}. But it can be overridden by calls to 
{{createLocalEnvironment()}} and {{createRemoteEnvironment()}}.

This seems a bit un-intuitive, and probably creates confusion when dispatching 
multiple streaming jobs from within the same JVM.

Why is it even necessary to cache the current execution environment?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2509) Improve error messages when user code classes are not found

2015-08-11 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2509:
---

 Summary: Improve error messages when user code classes are not 
found
 Key: FLINK-2509
 URL: https://issues.apache.org/jira/browse/FLINK-2509
 Project: Flink
  Issue Type: Bug
  Components: Core
Affects Versions: 0.10
Reporter: Stephan Ewen
Assignee: Stephan Ewen
 Fix For: 0.10


When a job fails because the user code classes are not found, we should add 
some information about the class loader and class path into the exception 
message.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2510) KafkaConnector should access partition metadata from master/cluster

2015-08-11 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2510:
---

 Summary: KafkaConnector should access partition metadata from 
master/cluster
 Key: FLINK-2510
 URL: https://issues.apache.org/jira/browse/FLINK-2510
 Project: Flink
  Issue Type: Improvement
  Components: Streaming Connectors
Affects Versions: 0.10
Reporter: Stephan Ewen
Priority: Minor


Currently, the Kafka connector assumes that it can access the partition 
metadata from the client. There may be setups where this is not possible, and 
where the access needs to happen from within the cluster (due to firewalls).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2492) Rename remaining runtime classes from match to join

2015-08-06 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2492:
---

 Summary: Rename remaining runtime classes from match to join
 Key: FLINK-2492
 URL: https://issues.apache.org/jira/browse/FLINK-2492
 Project: Flink
  Issue Type: Improvement
  Components: Local Runtime
Affects Versions: 0.10
Reporter: Stephan Ewen
Assignee: Stephan Ewen
Priority: Minor
 Fix For: 0.10


While working with the runtime join classes, I saw that many of them still 
refer to the join as match.

Since all other parts now consistently refer to join, we should adjust the 
runtime classes as well. Makes it easier for new contributors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2493) Simplify names of example program JARs

2015-08-06 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2493:
---

 Summary: Simplify names of example program JARs
 Key: FLINK-2493
 URL: https://issues.apache.org/jira/browse/FLINK-2493
 Project: Flink
  Issue Type: Improvement
  Components: Examples
Affects Versions: 0.10
Reporter: Stephan Ewen
Priority: Minor


I find the names of the example JARs a bit annoying.

Why not name the file {{examples/ConnectedComponents.jar}} rather than 
{{examples/flink-java-examples-0.10-SNAPSHOT-ConnectedComponents.jar}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2474) Occasional failures in PartitionedStateCheckpointingITCase

2015-08-03 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2474:
---

 Summary: Occasional failures in PartitionedStateCheckpointingITCase
 Key: FLINK-2474
 URL: https://issues.apache.org/jira/browse/FLINK-2474
 Project: Flink
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.10
Reporter: Stephan Ewen


The error message

{code}
Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 47.301 sec  
FAILURE! - in 
org.apache.flink.test.checkpointing.PartitionedStateCheckpointingITCase
runCheckpointedProgram(org.apache.flink.test.checkpointing.PartitionedStateCheckpointingITCase)
  Time elapsed: 42.495 sec   FAILURE!
java.lang.AssertionError: expected:86678900 but was:3467156
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:118)
at org.junit.Assert.assertEquals(Assert.java:144)
at 
org.apache.flink.test.checkpointing.PartitionedStateCheckpointingITCase.runCheckpointedProgram(PartitionedStateCheckpointingITCase.java:117)
{code}

The detailed CI logs

https://s3.amazonaws.com/archive.travis-ci.org/jobs/73928480/log.txt



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2473) Add a timeout to ActorSystem shutdown in the Client

2015-08-03 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2473:
---

 Summary: Add a timeout to ActorSystem shutdown in the Client
 Key: FLINK-2473
 URL: https://issues.apache.org/jira/browse/FLINK-2473
 Project: Flink
  Issue Type: Bug
  Components: Distributed Runtime
Affects Versions: 0.10
Reporter: Stephan Ewen
Assignee: Stephan Ewen
 Fix For: 0.10


Just ran into what looks like a rare bug in Akka, where the 
{{awaitTermination()}} method hits a bug and freezes indefinitely.

We can work around this by supplying a timeout to the method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2463) Chow execution config in dashboard

2015-08-02 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2463:
---

 Summary: Chow execution config in dashboard
 Key: FLINK-2463
 URL: https://issues.apache.org/jira/browse/FLINK-2463
 Project: Flink
  Issue Type: Sub-task
  Components: Webfrontend
Affects Versions: 0.10
Reporter: Stephan Ewen
Priority: Minor
 Fix For: 0.10


I wanted to change the configuration display, as in #927, but it wasn't 
implemented in the new UI. So I decided to create this PR to take care of it.

Added handler for new endpoint {{/jobs/:jobid/config}}

The endpoint is hit when selecting a job in the UI

Example of response:
{code}
{
execution-config: {
execution-mode: PIPELINED,
job-parallelism: -1,
max-execution-retries: -1,
object-reuse-mode: false,
user-config: {
example: true
}
},
jid: dedc15369efa94eb1e3ad6482f1344b6,
name: WordCount Example
}
{code}

The info is stored in the existing $rootScope.job object

Added new tab in job display to show the config



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2438) Improve performance of channel events

2015-07-30 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2438:
---

 Summary: Improve performance of channel events
 Key: FLINK-2438
 URL: https://issues.apache.org/jira/browse/FLINK-2438
 Project: Flink
  Issue Type: Bug
  Components: Distributed Runtime
Affects Versions: 0.10
Reporter: Stephan Ewen
Assignee: Stephan Ewen
 Fix For: 0.10


Until now, channel events were sufficiently rare to not matter in their 
performance. With frequent checkpointing, their performance becomes a factor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2426) Create a read-only variant of the Configuration

2015-07-29 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2426:
---

 Summary: Create a read-only variant of the Configuration
 Key: FLINK-2426
 URL: https://issues.apache.org/jira/browse/FLINK-2426
 Project: Flink
  Issue Type: Sub-task
  Components: Core
Affects Versions: 0.10
Reporter: Stephan Ewen
 Fix For: 0.10


In order to give the TaskManager configuration to runtime components (or even 
user code), it should be wrapped in an {{UnmodifiableConfiguration}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2427) Allow the BarrierBuffer to maintain multiple queues of blocked inputs

2015-07-29 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2427:
---

 Summary: Allow the BarrierBuffer to maintain multiple queues of 
blocked inputs
 Key: FLINK-2427
 URL: https://issues.apache.org/jira/browse/FLINK-2427
 Project: Flink
  Issue Type: Sub-task
  Components: Streaming
Affects Versions: 0.10
Reporter: Stephan Ewen
Assignee: Stephan Ewen
 Fix For: 0.10


In corner cases (dropped barriers due to failures/startup races), this is 
required for proper operation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2425) Give access to TaskManager config and hostname in the Runtime Environment

2015-07-29 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2425:
---

 Summary: Give access to TaskManager config and hostname in the 
Runtime Environment
 Key: FLINK-2425
 URL: https://issues.apache.org/jira/browse/FLINK-2425
 Project: Flink
  Issue Type: Improvement
  Components: TaskManager
Affects Versions: 0.10
Reporter: Stephan Ewen
 Fix For: 0.10


The RuntimeEnvironment (that is used by the operators to access the context) 
should give access to the TaskManager's configuration, to allow to read config 
values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2428) Clean up unused properties in StreamConfig

2015-07-29 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2428:
---

 Summary: Clean up unused properties in StreamConfig
 Key: FLINK-2428
 URL: https://issues.apache.org/jira/browse/FLINK-2428
 Project: Flink
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.10
Reporter: Stephan Ewen
Priority: Minor


There is a multitude of unused properties in the {{StreamConfig}}, which should 
be removed, if no longer relevant.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2429) Remove the enableCheckpointing() without interval variant

2015-07-29 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2429:
---

 Summary: Remove the enableCheckpointing() without interval 
variant
 Key: FLINK-2429
 URL: https://issues.apache.org/jira/browse/FLINK-2429
 Project: Flink
  Issue Type: Bug
  Components: Streaming
Reporter: Stephan Ewen






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2418) Add an end-to-end streaming fault tolerance test for the Checkpointed interface

2015-07-28 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2418:
---

 Summary: Add an end-to-end  streaming fault tolerance test for the 
Checkpointed interface
 Key: FLINK-2418
 URL: https://issues.apache.org/jira/browse/FLINK-2418
 Project: Flink
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.10
Reporter: Stephan Ewen
Assignee: Stephan Ewen
 Fix For: 0.10


The current test lacks the following:
  - Does not validate and exactly-once counts after partitioning step
  - Does not cover all cases with the {{Checkpointed}} interface, but uses a 
mix of by-key state and explicitly Checkpointed state
  - The test uses a non-fault-tolerant deprecated infinite reducer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2421) StreamRecordSerializer incorrectly duplicates and misses tests

2015-07-28 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2421:
---

 Summary: StreamRecordSerializer incorrectly duplicates and misses 
tests
 Key: FLINK-2421
 URL: https://issues.apache.org/jira/browse/FLINK-2421
 Project: Flink
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.10
Reporter: Stephan Ewen
Assignee: Stephan Ewen
 Fix For: 0.10


The duplication method of the {{StreamRecordSerializer}} does not respect the 
duplication of the internal type serializer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2407) Add an API switch to select between exactly once and at least once fault tolerance

2015-07-26 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2407:
---

 Summary: Add an API switch to select between exactly once and 
at least once fault tolerance
 Key: FLINK-2407
 URL: https://issues.apache.org/jira/browse/FLINK-2407
 Project: Flink
  Issue Type: Sub-task
  Components: Streaming
Affects Versions: 0.10
Reporter: Stephan Ewen
 Fix For: 0.10


Based on the addition of the BarrierTracker, we can add a switch to choose 
between the two modes exactly once and at least once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2406) Abstract BarrierBuffer to an exchangeable BarrierHandler

2015-07-26 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2406:
---

 Summary: Abstract BarrierBuffer to an exchangeable BarrierHandler
 Key: FLINK-2406
 URL: https://issues.apache.org/jira/browse/FLINK-2406
 Project: Flink
  Issue Type: Sub-task
  Components: Streaming
Affects Versions: 0.10
Reporter: Stephan Ewen
Assignee: Stephan Ewen
 Fix For: 0.10


We need to make the Checkpoint handling pluggable, to allow us to use different 
implementations:

  - BarrierBuffer for exactly once processing. This inevitably introduces a 
bit of latency.
  - BarrierTracker for at least once processing, with no added latency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2404) LongCounters should have an addValue() method for primitive longs

2015-07-24 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2404:
---

 Summary: LongCounters should have an addValue() method for 
primitive longs
 Key: FLINK-2404
 URL: https://issues.apache.org/jira/browse/FLINK-2404
 Project: Flink
  Issue Type: Bug
  Components: Core
Affects Versions: 0.10
Reporter: Stephan Ewen
 Fix For: 0.10


Since the LongCounter is used heavily for internal statistics reporting, it 
must have very low overhead.

The current addValue() method always boxes and unboxes the values, which is 
unnecessary overhead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2402) Add a non-blocking BarrierTracker

2015-07-23 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2402:
---

 Summary: Add a non-blocking BarrierTracker
 Key: FLINK-2402
 URL: https://issues.apache.org/jira/browse/FLINK-2402
 Project: Flink
  Issue Type: Sub-task
  Components: Streaming
Affects Versions: 0.10
Reporter: Stephan Ewen
Assignee: Stephan Ewen
 Fix For: 0.10


This issue would add a new tracker for barriers that simply tracks what 
barriers have been observed from which inputs. It never blocks off buffers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2397) Unify two backend servers to one server

2015-07-23 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2397:
---

 Summary: Unify two backend servers to one server
 Key: FLINK-2397
 URL: https://issues.apache.org/jira/browse/FLINK-2397
 Project: Flink
  Issue Type: Sub-task
  Components: Webfrontend
Affects Versions: 0.10
Reporter: Stephan Ewen
Assignee: Stephan Ewen
 Fix For: 0.10


Currently, the new dashboard still needs both the old backend server and the 
new backend server. We need to migrate the requests from {{/jobsInfo}} to the 
new server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2393) Add a stateless at-least-once mode for streaming

2015-07-22 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2393:
---

 Summary: Add a stateless at-least-once mode for streaming
 Key: FLINK-2393
 URL: https://issues.apache.org/jira/browse/FLINK-2393
 Project: Flink
  Issue Type: New Feature
  Components: Streaming
Affects Versions: 0.10
Reporter: Stephan Ewen


Currently, the checkpointing mechanism provides exactly once guarantees. Part 
of that is the step that temporarily aligns the data streams. This step 
increases the tuple latency temporarily.

By offering a version that does not provide exactly-once, but only 
at-least-once, we can avoid the latency increase. For super-low-latency 
applications, that tolerate duplicates, this may be an interesting option.

To realize that, we would use a slightly modified version of the checkpointing 
algorithm. Effectively, the streams would not be aligned, but tasks would only 
count the received barriers and emit their own barrier as soon as the saw a 
barrier from all inputs.

My feeling is that it makes not sense to implement state backups, when being 
concerned with this super low latency. The mode would hence be a purely 
stateless at-least-once mode.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2385) Scala DataSet.distinct should have parenthesis

2015-07-21 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2385:
---

 Summary: Scala DataSet.distinct should have parenthesis
 Key: FLINK-2385
 URL: https://issues.apache.org/jira/browse/FLINK-2385
 Project: Flink
  Issue Type: Bug
  Components: Scala API
Affects Versions: 0.9
Reporter: Stephan Ewen
Assignee: Stephan Ewen
 Fix For: 0.10


The method is not a side-effect free accessor, but defines heavy computation, 
even if it does not mutate the original data set.

This is a somewhat API breaking change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2389) Add dashboard frontend architecture amd build infrastructue

2015-07-21 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2389:
---

 Summary: Add dashboard frontend architecture amd build 
infrastructue
 Key: FLINK-2389
 URL: https://issues.apache.org/jira/browse/FLINK-2389
 Project: Flink
  Issue Type: Sub-task
  Components: Webfrontend
Affects Versions: 0.10
Reporter: Stephan Ewen
Assignee: Stephan Ewen
 Fix For: 0.10


Add the build infrastructure and basic libraries for a modern and modular web 
dashboard.

The dashboard is going to be built using angular.js.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2357) New JobManager Runtime Web Frontend

2015-07-14 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2357:
---

 Summary: New JobManager Runtime Web Frontend
 Key: FLINK-2357
 URL: https://issues.apache.org/jira/browse/FLINK-2357
 Project: Flink
  Issue Type: New Feature
  Components: Webfrontend
Affects Versions: 0.10
Reporter: Stephan Ewen


We need to improve rework the Job Manager Web Frontend.

The current web frontend is limited and has a lot of design issues
  - It does not display and progress while operators are running. This is 
especially problematic for streaming jobs
  - It has no graph representation of the data flows
  - it does not allow to look into execution attempts
  - it has no hook to deal with the upcoming live accumulators
  - The architecture is not very modular/extensible

I propose to add a new JobManager web frontend:
  - Based on Netty HTTP (very lightweight)
  - Using rest-style URLs for jobs and vertices
  - integrating the D3 graph renderer of the previews with the runtime monitor
  - with details on execution attempts
  - first class visualization of records processed and bytes processed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2358) Add Netty-HTTP based server and server handlers

2015-07-14 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2358:
---

 Summary: Add Netty-HTTP based server and server handlers
 Key: FLINK-2358
 URL: https://issues.apache.org/jira/browse/FLINK-2358
 Project: Flink
  Issue Type: Sub-task
  Components: Webfrontend
Affects Versions: 0.10
Reporter: Stephan Ewen
 Fix For: 0.10






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2351) Deprecate config builders in InputFormats and Output formats

2015-07-12 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2351:
---

 Summary: Deprecate config builders in InputFormats and Output 
formats
 Key: FLINK-2351
 URL: https://issues.apache.org/jira/browse/FLINK-2351
 Project: Flink
  Issue Type: Bug
  Components: Core
Affects Versions: 0.10
Reporter: Stephan Ewen
 Fix For: 0.10


Old APIs used to pass functions as classes and parameters as configs. To 
support that, all input- and output formats had config builders.

As the record API is deprecated and about to be removed, these builders are no 
longer needed and should be deprecated and removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2350) Deprecate the stream open timeouts in FileInputFormat and FileOutputFormat

2015-07-12 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2350:
---

 Summary: Deprecate the stream open timeouts in FileInputFormat and 
FileOutputFormat
 Key: FLINK-2350
 URL: https://issues.apache.org/jira/browse/FLINK-2350
 Project: Flink
  Issue Type: Bug
  Components: Core
Affects Versions: 0.10
Reporter: Stephan Ewen


The stream open requests have the ability to fail after a certain timeout. By 
default, this timeout is deactivated, because loaded HDFS has in the past 
frequently exceeded this timeout.

I think no one ever used this, and I am not sure it is useful, actually. We can 
remove it to reduce code complexity, in my opinion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2343) Change default garbage collector in streaming environments

2015-07-10 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2343:
---

 Summary: Change default garbage collector in streaming environments
 Key: FLINK-2343
 URL: https://issues.apache.org/jira/browse/FLINK-2343
 Project: Flink
  Issue Type: Improvement
  Components: Start-Stop Scripts
Affects Versions: 0.10
Reporter: Stephan Ewen
Assignee: Stephan Ewen
 Fix For: 0.10


When starting Flink, we don't pass any particular GC related JVM flags to the 
system. That means, it uses the default garbage collectors, which are the bulk 
parallel GCs for both old gen and new gen.

For streaming applications, this results in vastly fluctuating latencies. 
Latencies are much more constant with either the {{CMS}} or {{G1}} GC.

I propose to make the CMS the default GC for streaming setups.

G1 may become the GC of choice in the future, but fro various articles I found, 
it is still somewhat in beta status (see for example here: 
http://jaxenter.com/kirk-pepperdine-on-the-g1-for-java-9-118190.html )



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2339) Prevent asynchronous checkpoint calls from overtaking each other

2015-07-09 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2339:
---

 Summary: Prevent asynchronous checkpoint calls from overtaking 
each other
 Key: FLINK-2339
 URL: https://issues.apache.org/jira/browse/FLINK-2339
 Project: Flink
  Issue Type: Improvement
  Components: TaskManager
Affects Versions: 0.10
Reporter: Stephan Ewen
Assignee: Stephan Ewen
 Fix For: 0.10


Currently, when checkpoint state materialization takes very long, and the 
checkpoint interval is low, the asynchronous calls to trigger checkpoints (on 
the sources) could overtake prior calls.

We can fix that by making sure that all calls are dispatched in order by the 
same thread, rather than spawning a new thread for each call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2333) Stream Data Sink that periodically rolls files

2015-07-09 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2333:
---

 Summary: Stream Data Sink that periodically rolls files 
 Key: FLINK-2333
 URL: https://issues.apache.org/jira/browse/FLINK-2333
 Project: Flink
  Issue Type: New Feature
  Components: Streaming
Affects Versions: 0.10
Reporter: Stephan Ewen


It would be useful to have a file data sink for streams that starts a new file 
every n elements or records.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2327) Log the limit of open file handles at startup

2015-07-08 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2327:
---

 Summary: Log the limit of open file handles at startup
 Key: FLINK-2327
 URL: https://issues.apache.org/jira/browse/FLINK-2327
 Project: Flink
  Issue Type: Improvement
  Components: Local Runtime
Affects Versions: 0.10
Reporter: Stephan Ewen
Assignee: Stephan Ewen
Priority: Minor
 Fix For: 0.10






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2331) Whitelist some exceptions to be valid in YARN logs

2015-07-08 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2331:
---

 Summary: Whitelist some exceptions to be valid in YARN logs
 Key: FLINK-2331
 URL: https://issues.apache.org/jira/browse/FLINK-2331
 Project: Flink
  Issue Type: Bug
  Components: YARN Client
Affects Versions: 0.10
Reporter: Stephan Ewen
Assignee: Stephan Ewen
 Fix For: 0.10


The YARN logs may occasionally contain this exception that occurs on shutdown:

{code}
14:55:18,819 ERROR akka.remote.RemoteActorRef   
 - swallowing exception during message send
akka.remote.RemoteTransportExceptionNoStackTrace: Attempted to send remote 
message but Remoting is not running.
{code}

Whitelisting the {{akka.remote.RemoteTransportExceptionNoStackTrace}} should 
make the tests more reliable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2330) Make FromElementsFunction checkpointable

2015-07-08 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2330:
---

 Summary: Make FromElementsFunction checkpointable
 Key: FLINK-2330
 URL: https://issues.apache.org/jira/browse/FLINK-2330
 Project: Flink
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 0.10
Reporter: Stephan Ewen
Assignee: Stephan Ewen
 Fix For: 0.10






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2314) Make Streaming File Sources Persistent

2015-07-02 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2314:
---

 Summary: Make Streaming File Sources Persistent
 Key: FLINK-2314
 URL: https://issues.apache.org/jira/browse/FLINK-2314
 Project: Flink
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 0.9
Reporter: Stephan Ewen


Streaming File sources should participate in the checkpointing. They should 
track the bytes they read from the file and checkpoint it.

One can look at the sequence generating source function for an example of a 
checkpointed source.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2315) Hadoop Writables cannot exploit implementing NormalizableKey

2015-07-02 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2315:
---

 Summary: Hadoop Writables cannot exploit implementing 
NormalizableKey
 Key: FLINK-2315
 URL: https://issues.apache.org/jira/browse/FLINK-2315
 Project: Flink
  Issue Type: Bug
  Components: Java API
Affects Versions: 0.9
Reporter: Stephan Ewen


When one implements a type that extends {{hadoop.io.Writable}} and it 
implements {{NormalizableKey}}, this is still not exploited.

The Writable comparator fails to recognize that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2284) Confusing/inconsistent PartitioningStrategy

2015-06-29 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2284:
---

 Summary: Confusing/inconsistent PartitioningStrategy
 Key: FLINK-2284
 URL: https://issues.apache.org/jira/browse/FLINK-2284
 Project: Flink
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.9
Reporter: Stephan Ewen


The PartitioningStrategy in 
{{org.apache.flink.streaming.runtime.partitioner.StreamPartitioner.java}} is 
non standard and not easily understandable.

What form of partitioning is `SHUFFLE`? Shuffle just means redistribute, it 
says nothing about what it does. Same with `DISTRIBUTE`. Also `GLOBAL` is not a 
well-defined/established term. Why is `GROUPBY` a partition type? Doesn't 
grouping simply hash partition (like I assume SHUFFLE means), so why does it 
have an extra entry?

Sticking with principled and established names/concepts is important to allow 
people to collaborate on the code. 

Why not stick with the partitioning types defined in the batch API? They are 
well defined and named:
```
NONE, FORWARD, RANDOM, HASH, RANGE, FORCED_REBALANCE, BROADCAST, CUSTOM
```




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2270) Error in KafkaSource Documentation

2015-06-24 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2270:
---

 Summary: Error in KafkaSource Documentation
 Key: FLINK-2270
 URL: https://issues.apache.org/jira/browse/FLINK-2270
 Project: Flink
  Issue Type: Bug
  Components: Documentation
Affects Versions: 0.9
Reporter: Stephan Ewen


The Kafka Source documentation still refers to {{enableMonitoring()}} as the 
method to enable checkpointing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2231) Create a Serializer for Scala Enumerations

2015-06-16 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2231:
---

 Summary: Create a Serializer for Scala Enumerations
 Key: FLINK-2231
 URL: https://issues.apache.org/jira/browse/FLINK-2231
 Project: Flink
  Issue Type: Improvement
  Components: Scala API
Reporter: Stephan Ewen


Scala Enumerations are currently serialized with Kryo, but should be 
efficiently serialized by just writing the {{initial}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2177) NillPointer in task resource release

2015-06-05 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2177:
---

 Summary: NillPointer in task resource release
 Key: FLINK-2177
 URL: https://issues.apache.org/jira/browse/FLINK-2177
 Project: Flink
  Issue Type: Bug
  Components: Distributed Runtime
Affects Versions: 0.9
Reporter: Stephan Ewen
Assignee: Ufuk Celebi
Priority: Blocker
 Fix For: 0.9


{code}
==
==  FATAL  ===
==

A fatal error occurred, forcing the TaskManager to shut down: FATAL - exception 
in task resource cleanup
java.lang.NullPointerException
at 
org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.cancelRequestFor(PartitionRequestClientHandler.java:89)
at 
org.apache.flink.runtime.io.network.netty.PartitionRequestClient.close(PartitionRequestClient.java:182)
at 
org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.releaseAllResources(RemoteInputChannel.java:199)
at 
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.releaseAllResources(SingleInputGate.java:332)
at 
org.apache.flink.runtime.io.network.NetworkEnvironment.unregisterTask(NetworkEnvironment.java:368)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:650)
at java.lang.Thread.run(Thread.java:701)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2130) RabbitMQ source does not fail when failing to retrieve elements

2015-06-01 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2130:
---

 Summary: RabbitMQ source does not fail when failing to retrieve 
elements
 Key: FLINK-2130
 URL: https://issues.apache.org/jira/browse/FLINK-2130
 Project: Flink
  Issue Type: Bug
Reporter: Stephan Ewen


The RMQ source only logs when elements cannot be retrieved. Failures are not 
propagated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2098) Checkpoint barrier initiation at source is not aligned with snapshotting

2015-05-27 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2098:
---

 Summary: Checkpoint barrier initiation at source is not aligned 
with snapshotting
 Key: FLINK-2098
 URL: https://issues.apache.org/jira/browse/FLINK-2098
 Project: Flink
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.9
Reporter: Stephan Ewen
Assignee: Stephan Ewen
Priority: Blocker
 Fix For: 0.9


The stream source does not properly align the emission of checkpoint barriers 
with the drawing of snapshots.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2070) Confusing methods print() that print on client vs on TaskManager

2015-05-21 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2070:
---

 Summary: Confusing methods print() that print on client vs on 
TaskManager
 Key: FLINK-2070
 URL: https://issues.apache.org/jira/browse/FLINK-2070
 Project: Flink
  Issue Type: Bug
  Components: Core
Affects Versions: 0.9
Reporter: Stephan Ewen
 Fix For: 0.9


With the {{print()}} method printing on the client, the 
{{print(sourceIentified)}} method becomes confusing, as it has the same name, 
but prints on the taskManager (into its out files) and executes lazily.

We should clarify the confusion by picking a more descriptive name, like 
{{printOnTaskManager()}}.

I am not sure how common the use case to print into the TaskManager {{out}} 
files is. We could remove that method and point to the {{PrintingOutputFormat}} 
for that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2071) Projection function can fail non serializable

2015-05-21 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2071:
---

 Summary: Projection function can fail non serializable
 Key: FLINK-2071
 URL: https://issues.apache.org/jira/browse/FLINK-2071
 Project: Flink
  Issue Type: Bug
  Components: Java API
Affects Versions: 0.9
Reporter: Stephan Ewen
Assignee: Stephan Ewen
 Fix For: 0.9


The projection function creates a full reusable tuple including fields. If the 
fields are not serializable, the function serialization fails.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2076) Bug in re-openable hash join

2015-05-21 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2076:
---

 Summary: Bug in re-openable hash join
 Key: FLINK-2076
 URL: https://issues.apache.org/jira/browse/FLINK-2076
 Project: Flink
  Issue Type: Bug
  Components: Local Runtime
Affects Versions: 0.9
Reporter: Stephan Ewen


It happens deterministically in my machine with the following setup:

TaskManager:
  - heap size: 512m
  - network buffers: 4096
  - slots: 32

Job:
  - ConnectedComponents
  - 100k vertices
  - 1.2m edges

-- this gives around 260 m Flink managed memory, across 32 slots is 8MB per 
slot, with several mem consumers in the job, makes the iterative hash join 
out-of-core

{code}
java.lang.RuntimeException: Hash Join bug in memory management: 
Memory buffers leaked.
at 
org.apache.flink.runtime.operators.hash.MutableHashTable.buildTableFromSpilledPartition(MutableHashTable.java:733)
at 
org.apache.flink.runtime.operators.hash.MutableHashTable.prepareNextPartition(MutableHashTable.java:508)
at 
org.apache.flink.runtime.operators.hash.ReOpenableMutableHashTable.prepareNextPartition(ReOpenableMutableHashTable.java:167)
at 
org.apache.flink.runtime.operators.hash.MutableHashTable.nextRecord(MutableHashTable.java:541)
at 
org.apache.flink.runtime.operators.hash.NonReusingBuildSecondHashMatchIterator.callWithNextKey(NonReusingBuildSecondHashMatchIterator.java:102)
at 
org.apache.flink.runtime.operators.AbstractCachedBuildSideMatchDriver.run(AbstractCachedBuildSideMatchDriver.java:155)
at 
org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:496)
at 
org.apache.flink.runtime.iterative.task.AbstractIterativePactTask.run(AbstractIterativePactTask.java:139)
at 
org.apache.flink.runtime.iterative.task.IterationIntermediatePactTask.run(IterationIntermediatePactTask.java:92)
at 
org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:560)
at java.lang.Thread.run(Thread.java:745)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2058) Hadoop Input Splits do not use proper UserCodeClassloader

2015-05-20 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2058:
---

 Summary: Hadoop Input Splits do not use proper UserCodeClassloader
 Key: FLINK-2058
 URL: https://issues.apache.org/jira/browse/FLINK-2058
 Project: Flink
  Issue Type: Bug
  Components: Hadoop Compatibility
Affects Versions: 0.9
Reporter: Stephan Ewen
Assignee: Stephan Ewen
 Fix For: 0.9


The Hadoop input splits manually resolve classes without using a proper 
classloaders.

I propose so serialize the classes themselves (rather then the names) as part 
of the input format, so that the general class loading functionality for user 
defined types handles this automatically.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2057) Remove IOReadableWritable interface from input splits

2015-05-20 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2057:
---

 Summary: Remove IOReadableWritable interface from input splits
 Key: FLINK-2057
 URL: https://issues.apache.org/jira/browse/FLINK-2057
 Project: Flink
  Issue Type: Improvement
  Components: Core
Affects Versions: 0.9
Reporter: Stephan Ewen
Assignee: Stephan Ewen
 Fix For: 0.9


The {{IOReadableWritable}} interface was used in the previous RPC subsystem.

Since we use Java Serialization for RPC / Actor Messages now, we can remove 
this, which makes the implementation of new Split Types much easier.

Removing legacy functionality...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2063) Streaming checkpoints consider only input and output vertices

2015-05-20 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2063:
---

 Summary: Streaming checkpoints consider only input and output 
vertices
 Key: FLINK-2063
 URL: https://issues.apache.org/jira/browse/FLINK-2063
 Project: Flink
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.9
Reporter: Stephan Ewen
Assignee: Stephan Ewen
 Fix For: 0.9


Currently, the streaming checkpoint coordinator is configured only with the 
list of input and output vertices as stateful vertices.

Checkpoint confirmations from intermediate vertices are ignored.

Since the runtime assumes that all vertices participate in checkpointing 
anyways (all of them send acknowledge messages), the coordinator should be 
configured accordingly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2046) Quickstarts are missing on the new Website

2015-05-19 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2046:
---

 Summary: Quickstarts are missing on the new Website
 Key: FLINK-2046
 URL: https://issues.apache.org/jira/browse/FLINK-2046
 Project: Flink
  Issue Type: Bug
  Components: Project Website
Reporter: Stephan Ewen






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2041) Optimizer plans with memory for pipeline breakers, even though they are not used any more

2015-05-18 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2041:
---

 Summary: Optimizer plans with memory for pipeline breakers, even 
though they are not used any more
 Key: FLINK-2041
 URL: https://issues.apache.org/jira/browse/FLINK-2041
 Project: Flink
  Issue Type: Bug
  Components: Optimizer
Affects Versions: 0.9
Reporter: Stephan Ewen
Assignee: Stephan Ewen
 Fix For: 0.9






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2010) Add test that verifies that chained tasks are properly checkpointed

2015-05-13 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2010:
---

 Summary: Add test that verifies that chained tasks are properly 
checkpointed
 Key: FLINK-2010
 URL: https://issues.apache.org/jira/browse/FLINK-2010
 Project: Flink
  Issue Type: Bug
  Components: Streaming
Reporter: Stephan Ewen


Because this is critical logic, we should have a unit test for the 
{{StreamingTask}} validating that the state from all chained tasks is taken 
into account.

It needs to check
  - The trigger checkpoint method, and that the state handle handle in the 
checkpoint acknowledgement contains all the necessary state

  - The restore state method, that state is reset properly



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2011) Improve error message when user-defined serialization logic is wrong

2015-05-13 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2011:
---

 Summary: Improve error message when user-defined serialization 
logic is wrong
 Key: FLINK-2011
 URL: https://issues.apache.org/jira/browse/FLINK-2011
 Project: Flink
  Issue Type: Bug
  Components: Distributed Runtime
Affects Versions: 0.9
Reporter: Stephan Ewen
Assignee: Stephan Ewen
 Fix For: 0.9






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2004) Memory leack in presence of failed checkpoints in KafkaSource

2015-05-12 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2004:
---

 Summary: Memory leack in presence of failed checkpoints in 
KafkaSource
 Key: FLINK-2004
 URL: https://issues.apache.org/jira/browse/FLINK-2004
 Project: Flink
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.9
Reporter: Stephan Ewen
Assignee: Robert Metzger
Priority: Critical
 Fix For: 0.9


Checkpoints that fail never send a commit message to the tasks.

Maintaining a map of all pending checkpoints introduces a memory leak, as 
entries for failed checkpoints will never be removed.

Approaches to fix this:
  - The source cleans up entries from older checkpoints once a checkpoint is 
committed (simple implementation in a linked hash map)
  - The commit message could include the optional state handle (source needs 
not maintain the map)
  - The checkpoint coordinator could send messages for failed checkpoints?





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-1969) Remove old profile code

2015-05-04 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-1969:
---

 Summary: Remove old profile code
 Key: FLINK-1969
 URL: https://issues.apache.org/jira/browse/FLINK-1969
 Project: Flink
  Issue Type: Bug
  Components: Distributed Runtime
Affects Versions: 0.9
Reporter: Stephan Ewen
Assignee: Stephan Ewen
 Fix For: 0.9


The old profiler code is not instantiated any more and is basically dead.
It has in parts been replaced by the metrics library already.

The classes still get in the way during refactoring, which is why I suggest to 
remove them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-1972) Remove instanceID in streaming tasks

2015-05-04 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-1972:
---

 Summary: Remove instanceID in streaming tasks
 Key: FLINK-1972
 URL: https://issues.apache.org/jira/browse/FLINK-1972
 Project: Flink
  Issue Type: Bug
  Components: Streaming
Reporter: Stephan Ewen


The streaming tasks have a weird instanceID that is a JVM local counter.

Tasks have already proper names, vertex ids, and IDs scoped to the attempt of 
execution. We should remove this counter ID.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-1953) Rework Checkpoint Coordinator

2015-04-28 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-1953:
---

 Summary: Rework Checkpoint Coordinator
 Key: FLINK-1953
 URL: https://issues.apache.org/jira/browse/FLINK-1953
 Project: Flink
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.9
Reporter: Stephan Ewen
Assignee: Stephan Ewen
 Fix For: 0.9


The checkpoint coordinator currently contains no tests and is vulnerable to a 
variety of situations. In particular, I propose to add:

 - Better configurability which tasks receive the trigger checkpoint messages, 
which tasks need to acknowledge the checkpoint, and which tasks need to receive 
confirmation messages.

 - checkpoint timeouts, such that incomplete checkpoints are guaranteed to be 
cleaned up after a while, regardless of successful checkpoints

 - better sanity checking of messages and fields, to properly handle/ignore 
messages for old/expired checkpoints, or invalidly routed messages

 - Better handling of checkpoint attempts at points where the execution has 
just failed is is currently being canceled.

 - Add a good set of tests



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-1879) Simplify JobClient

2015-04-13 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-1879:
---

 Summary: Simplify JobClient
 Key: FLINK-1879
 URL: https://issues.apache.org/jira/browse/FLINK-1879
 Project: Flink
  Issue Type: Sub-task
  Components: Distributed Runtime
Affects Versions: 0.9
Reporter: Stephan Ewen
Assignee: Stephan Ewen
 Fix For: 0.9


The JobClient currently uses two actors, when one is sufficient. The calls in 
the mini clusters also require tests to handle with ActorRefs, which should not 
be necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-1845) NonReusingSortMergeCoGroupIterator uses ReusingKeyGroupedIterator

2015-04-08 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-1845:
---

 Summary: NonReusingSortMergeCoGroupIterator uses 
ReusingKeyGroupedIterator
 Key: FLINK-1845
 URL: https://issues.apache.org/jira/browse/FLINK-1845
 Project: Flink
  Issue Type: Bug
  Components: Local Runtime
Affects Versions: 0.9
Reporter: Stephan Ewen
Assignee: Alexander Alexandrov
 Fix For: 0.9


This is an unexpected mutability bug. [~aalexandrov] already has created a pull 
request that fixes this...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-1839) Failures in TwitterStreamITCase

2015-04-07 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-1839:
---

 Summary: Failures in TwitterStreamITCase
 Key: FLINK-1839
 URL: https://issues.apache.org/jira/browse/FLINK-1839
 Project: Flink
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.9
Reporter: Stephan Ewen


The test seems unstable and fails occasionally with the error below

{code}
Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 4.508 sec  
FAILURE! - in 
org.apache.flink.streaming.examples.test.twitter.TwitterStreamITCase
testJobWithoutObjectReuse(org.apache.flink.streaming.examples.test.twitter.TwitterStreamITCase)
  Time elapsed: 3.739 sec   FAILURE!
java.lang.AssertionError: Different number of lines in expected and obtained 
result. expected:146 but was:144
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:118)
at org.junit.Assert.assertEquals(Assert.java:555)
at 
org.apache.flink.test.util.TestBaseUtils.compareResultsByLinesInMemory(TestBaseUtils.java:256)
at 
org.apache.flink.test.util.TestBaseUtils.compareResultsByLinesInMemory(TestBaseUtils.java:242)
at 
org.apache.flink.streaming.examples.test.twitter.TwitterStreamITCase.postSubmit(TwitterStreamITCase.java:34)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-1801) NetworkEnvironment should start without JobManager association

2015-03-30 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-1801:
---

 Summary: NetworkEnvironment should start without JobManager 
association
 Key: FLINK-1801
 URL: https://issues.apache.org/jira/browse/FLINK-1801
 Project: Flink
  Issue Type: Sub-task
  Components: TaskManager
Affects Versions: 0.9
Reporter: Stephan Ewen
Assignee: Stephan Ewen
 Fix For: 0.9


The NetworkEnvironment should be able to start without a dedicated JobManager 
association and get one / loose one as the TaskManager connects to different 
JobManagers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-1802) BlobManager directories should be checked before TaskManager startup

2015-03-30 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-1802:
---

 Summary: BlobManager directories should be checked before 
TaskManager startup
 Key: FLINK-1802
 URL: https://issues.apache.org/jira/browse/FLINK-1802
 Project: Flink
  Issue Type: Sub-task
  Components: TaskManager
Affects Versions: 0.9
Reporter: Stephan Ewen
Assignee: Stephan Ewen
 Fix For: 0.9


That allows the call to start the taskmanager to fail early and synchronous, 
improving debugability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-1760) Add support for building Flink with Scala 2.11

2015-03-20 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-1760:
---

 Summary: Add support for building Flink with Scala 2.11
 Key: FLINK-1760
 URL: https://issues.apache.org/jira/browse/FLINK-1760
 Project: Flink
  Issue Type: Improvement
  Components: Build System
Affects Versions: 0.9
Reporter: Stephan Ewen
Assignee: Alexander Alexandrov
 Fix For: 0.9


Pull request https://github.com/apache/flink/pull/477 is implementing this 
feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-1762) Make Statefulness of a Streaming Function explicit

2015-03-20 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-1762:
---

 Summary: Make Statefulness of a Streaming Function explicit
 Key: FLINK-1762
 URL: https://issues.apache.org/jira/browse/FLINK-1762
 Project: Flink
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 0.9
Reporter: Stephan Ewen


Currently, the state of streaming functions stored in the 
{{StreamingRuntimeContext}}.

That is rather inexplicit, a function may or may not make use of the state. 
This also hides from the system whether a function is stateful or not.

How about we make this explicit by letting stateful functions extend a special 
interface (see below). That would allow the stream graph to already know which 
functions are stateful. Certain vertices would not participate in the 
checkpointing, if they only contain stateless vertices. We can set up the 
ExecutionGraph to expect confirmations only from the participating vertices, 
saving messages.

{code}
public interface Statehandle {

 get, put, ...
}

public interface Stateful {

void setStateHandle(Statehandle handle);
}
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-1765) Reducer grouping is skippted when parallelism is one

2015-03-20 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-1765:
---

 Summary: Reducer grouping is skippted when parallelism is one
 Key: FLINK-1765
 URL: https://issues.apache.org/jira/browse/FLINK-1765
 Project: Flink
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.9
Reporter: Stephan Ewen
 Fix For: 0.9


This program (not the parallelism) incorrectly runs a non grouped reduce and 
fails with a NullPointerException.

{code}
StreamExecutionEnvironment env = ...
env.setDegreeOfParallelism(1);

DataStreamString stream = env.addSource(...);

stream
.filter(...)
.map(...)
.groupBy(someField)
.reduce(new ReduceFunction() {...} )
.addSink(...);

env.execute();
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-1763) Remove cancel from streaming SinkFunction

2015-03-20 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-1763:
---

 Summary: Remove cancel from streaming SinkFunction
 Key: FLINK-1763
 URL: https://issues.apache.org/jira/browse/FLINK-1763
 Project: Flink
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 0.9
Reporter: Stephan Ewen
 Fix For: 0.9


Since the streaming sink function is called individually for each record, it 
does not require a {{cancel()}} function. The system can cancel between calls 
to that function (which it cannot do for the source function).

Removing this method removes the need to always implement the unnecessary, and 
usually empty, method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-1764) Rework record copying logic in streaming API

2015-03-20 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-1764:
---

 Summary: Rework record copying logic in streaming API
 Key: FLINK-1764
 URL: https://issues.apache.org/jira/browse/FLINK-1764
 Project: Flink
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 0.9
Reporter: Stephan Ewen


The logic for chained tasks in the streaming API does a lot of copying of 
records. In some cases, a record is copied multiple times before being passed 
to a function.

This seems unnecessary, in the general case. In any case, multiple copies seem 
incorrect.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-1721) Flakey Yarn Tests

2015-03-18 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-1721:
---

 Summary: Flakey Yarn Tests
 Key: FLINK-1721
 URL: https://issues.apache.org/jira/browse/FLINK-1721
 Project: Flink
  Issue Type: Bug
  Components: YARN Client
Affects Versions: 0.9
Reporter: Stephan Ewen


Failed tests: 
  
YARNSessionFIFOITCase.testTaskManagerFailure:276-YarnTestBase.ensureNoProhibitedStringInLogFiles:286
 Found a file 
/home/travis/build/StephanEwen/incubator-flink/flink-yarn-tests/target/flink-yarn-tests-fifo/flink-yarn-tests-fifo-logDir-nm-0_0/application_1426629317989_0004/container_1426629317989_0004_01_03/taskmanager.log
 with a prohibited string: [Exception, Started 
SelectChannelConnector@0.0.0.0:8081]
  
YARNSessionFIFOITCase.testNonexistingQueue:305-YarnTestBase.ensureNoProhibitedStringInLogFiles:286
 Found a file 
/home/travis/build/StephanEwen/incubator-flink/flink-yarn-tests/target/flink-yarn-tests-fifo/flink-yarn-tests-fifo-logDir-nm-0_0/application_1426629317989_0004/container_1426629317989_0004_01_03/taskmanager.log
 with a prohibited string: [Exception, Started 
SelectChannelConnector@0.0.0.0:8081]
  
YARNSessionFIFOITCase.testJavaAPI:460-YarnTestBase.ensureNoProhibitedStringInLogFiles:286
 Found a file 
/home/travis/build/StephanEwen/incubator-flink/flink-yarn-tests/target/flink-yarn-tests-fifo/flink-yarn-tests-fifo-logDir-nm-0_0/application_1426629317989_0004/container_1426629317989_0004_01_03/taskmanager.log
 with a prohibited string: [Exception, Started 
SelectChannelConnector@0.0.0.0:8081]


Log details attached



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-1730) Add a FlinkTools.persist style method to the Data Set.

2015-03-18 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-1730:
---

 Summary: Add a FlinkTools.persist style method to the Data Set.
 Key: FLINK-1730
 URL: https://issues.apache.org/jira/browse/FLINK-1730
 Project: Flink
  Issue Type: New Feature
Reporter: Stephan Ewen
Priority: Minor


I think this is an operation that will be needed more prominently. Defining a 
point where one long logical program is broken into different executions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-1712) Restructure Maven Projects

2015-03-17 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-1712:
---

 Summary: Restructure Maven Projects
 Key: FLINK-1712
 URL: https://issues.apache.org/jira/browse/FLINK-1712
 Project: Flink
  Issue Type: Improvement
Affects Versions: 0.9
Reporter: Stephan Ewen
Priority: Minor


Project consolidation

 - flink-hadoop (shaded fat jar)

 - Core (Core and Java)
 - (core-scala)
 - Streaming (core + java)
 - (streaming-scala)

 - Runtime
 - Optimizer (may also be merged with Client)
 - Client (or Client + Optimizer)
 
 - Examples (Java + Scala + Streaming Java + Streaming Scala)
 - Tests (test-utils (compile) and tests (test))
 
 - Quickstarts
   - Quickstart Java
   - Quickstart Scala
 
 - connectors / Input/Output Formats
   - Avro
   - HBase
   - HadoopCompartibility
   - HCatalogue
   - JDBC
   - kafka
   - rabbit
   - ...
 
 - staging
   - Gelly
   - ML
   - spargel (deprecated)
   - expression API
 
 - contrib
 
 - yarn
 
 - dist
 
 - yarn tests
 
 - java 8



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-1711) Replace all usages off commons.Validate with guava.check

2015-03-17 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-1711:
---

 Summary: Replace all usages off commons.Validate with guava.check
 Key: FLINK-1711
 URL: https://issues.apache.org/jira/browse/FLINK-1711
 Project: Flink
  Issue Type: Improvement
Affects Versions: 0.9
Reporter: Stephan Ewen
Priority: Minor
 Fix For: 0.9


Per discussion on the mailing list, we decided to increase homogeneity. One 
part is to consistently use the Guava methods {{checkNotNull}} and 
{{checkArgument}}, rather than Apache Commons Lang3 {{Validate}}.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-1705) InstanceConnectionInfo returns wrong hostname when no DNS entry exists

2015-03-14 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-1705:
---

 Summary: InstanceConnectionInfo returns wrong hostname when no DNS 
entry exists
 Key: FLINK-1705
 URL: https://issues.apache.org/jira/browse/FLINK-1705
 Project: Flink
  Issue Type: Bug
  Components: TaskManager
Affects Versions: 0.9
Reporter: Stephan Ewen
Assignee: Stephan Ewen
 Fix For: 0.9


If there is no DNS entry for an address (like 10.4.122.43), then the 
{{InstanceConnectionInfo}} returns the first octet ({{10}}) as the hostame.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-1691) Inprove CountCollectITCase

2015-03-11 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-1691:
---

 Summary: Inprove CountCollectITCase
 Key: FLINK-1691
 URL: https://issues.apache.org/jira/browse/FLINK-1691
 Project: Flink
  Issue Type: Bug
  Components: test
Affects Versions: 0.9
Reporter: Stephan Ewen
Assignee: Maximilian Michels
 Fix For: 0.9


The CountCollectITCase logs heavily and does not reuse the same cluster across 
multiple tests.

Both can be addressed by letting it extend the MultipleProgramsTestBase



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-1671) Add execution modes for programs

2015-03-10 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-1671:
---

 Summary: Add execution modes for programs
 Key: FLINK-1671
 URL: https://issues.apache.org/jira/browse/FLINK-1671
 Project: Flink
  Issue Type: Bug
Affects Versions: 0.9
Reporter: Stephan Ewen
Assignee: Stephan Ewen
 Fix For: 0.9


Currently, there is a single way that programs get executed: Pipelined. With 
the new code for batch shuffles (https://github.com/apache/flink/pull/471), we 
have much more flexibility and I would like to expose that.

I suggest to add more execution modes that can be chosen on the 
`ExecutionEnvironment`:

  - {{BATCH}} A mode where every shuffle is executed in a batch way, meaning 
preceding operators must be done before successors start. Only for the batch 
programs (d'oh).

  - {{PIPELINED}} This is the mode corresponding to the current execution mode. 
It pipelines where possible and batches, where deadlocks would otherwise 
happen. Initially, I would make this the default (be close to the current 
behavior). Only available for batch programs.

  - {{PIPELINED_WITH_BATCH_FALLBACK}} This would start out with pipelining 
shuffles and fall back to batch shuffles upon failure and recovery, or once it 
sees that not enough slots are available to bring up all operators at once 
(requirement for pipelining).

  - {{STREAMING}} This is the default and only way for streaming programs. All 
communication is pipelined, and the special streaming checkpointing code is 
activated.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-1675) Rework Accumulators

2015-03-10 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-1675:
---

 Summary: Rework Accumulators
 Key: FLINK-1675
 URL: https://issues.apache.org/jira/browse/FLINK-1675
 Project: Flink
  Issue Type: Bug
  Components: JobManager, TaskManager
Affects Versions: 0.9
Reporter: Stephan Ewen
 Fix For: 0.9


The accumulators need an overhaul to address various issues:

1.  User defined Accumulator classes crash the client, because it is not using 
the user code classloader to decode the received message.

2.  They should be attached to the ExecutionGraph, not the dedicated 
AccumulatorManager. That makes them accessible also for archived execution 
graphs.

3.  Accumulators should be sent periodically, as part of the heart beat that 
sends metrics. This allows them to be updated in real time

4. Accumulators should be stored fine grained (per executionvertex, or per 
execution) and the final value should be on computed by merging all involved 
ones. This allows users to access the per-subtask accumulators, which is often 
interesting.

5. Accumulators should subsume the aggregators by allowing to be versioned 
with a superstep. The versioned ones should be redistributed to the cluster 
after each superstep.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-1667) Add tests for recovery with distributed process failure

2015-03-09 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-1667:
---

 Summary: Add tests for recovery with distributed process failure
 Key: FLINK-1667
 URL: https://issues.apache.org/jira/browse/FLINK-1667
 Project: Flink
  Issue Type: Improvement
  Components: test
Affects Versions: 0.9
Reporter: Stephan Ewen
Assignee: Stephan Ewen
 Fix For: 0.9


The system should have a test that actually spawns multiple TaskManager 
processes (JVMs) and tests how recovery works when one of them is killed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-1668) Add a config option to specify delays between restarts

2015-03-09 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-1668:
---

 Summary: Add a config option to specify delays between restarts
 Key: FLINK-1668
 URL: https://issues.apache.org/jira/browse/FLINK-1668
 Project: Flink
  Issue Type: Improvement
Affects Versions: 0.9
Reporter: Stephan Ewen
Assignee: Stephan Ewen
 Fix For: 0.9


The system currently introduces a short delay between a failed task execution 
and the restarted execution.

The reason is that this delay seemed to help in letting problems surface that 
let to the failed task. As an example, if a TaskManager fails, tasks fail due 
to data transfer errors. The TaskManager is not immediately recognized as 
failed, though (takes a bit until heartbeats time out). Immediately 
re-deploying tasks has a very high chance of assigning work to the TaskManager 
that is actually not responding, causing the execution retry to fail again. The 
delay gives the system time to figure out that the TaskManager was lost and 
does not take it into account upon the retry.

Currently, the system uses the heartbeat timeout as the default delay value. 
This may make sense as a default value for critical task failures, but is 
actually quite high for other types of failures.

In any case, I would like to add an option for users to specify the delay (even 
set it to 0, if desired).

The delay is not the best solution, in my opinion, we should eventually move to 
something better. Ideas are to put TaskManagers responsible for failed tasks in 
a probationary mode until they have reported back that everything is good 
(still alive, disk space available, etc)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-1648) Add a mode where the system automatically sets the parallelism to the available task slots

2015-03-04 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-1648:
---

 Summary: Add a mode where the system automatically sets the 
parallelism to the available task slots
 Key: FLINK-1648
 URL: https://issues.apache.org/jira/browse/FLINK-1648
 Project: Flink
  Issue Type: New Feature
  Components: JobManager
Affects Versions: 0.9
Reporter: Stephan Ewen
Assignee: Stephan Ewen
 Fix For: 0.9


This is basically a port of this code form the 0.8 release:

https://github.com/apache/flink/pull/410



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-1649) Give a good error message when a user program emits a null record

2015-03-04 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-1649:
---

 Summary: Give a good error message when a user program emits a 
null record
 Key: FLINK-1649
 URL: https://issues.apache.org/jira/browse/FLINK-1649
 Project: Flink
  Issue Type: Bug
  Components: Local Runtime
Affects Versions: 0.9
Reporter: Stephan Ewen
Assignee: Stephan Ewen
 Fix For: 0.9






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-1650) Suppress Akka's Netty Shutdown Errors through the log config

2015-03-04 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-1650:
---

 Summary: Suppress Akka's Netty Shutdown Errors through the log 
config
 Key: FLINK-1650
 URL: https://issues.apache.org/jira/browse/FLINK-1650
 Project: Flink
  Issue Type: Bug
  Components: other
Affects Versions: 0.9
Reporter: Stephan Ewen
Assignee: Stephan Ewen
 Fix For: 0.9


I suggest to set the logging for 
`org.jboss.netty.channel.DefaultChannelPipeline` to error, in order to get rid 
of the misleading stack trace caused by an akka/netty hickup on shutdown.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-1626) Spurious failure in MatchTask cancelling test

2015-03-02 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-1626:
---

 Summary: Spurious failure in MatchTask cancelling test
 Key: FLINK-1626
 URL: https://issues.apache.org/jira/browse/FLINK-1626
 Project: Flink
  Issue Type: Bug
  Components: Local Runtime
Affects Versions: 0.9
Reporter: Stephan Ewen
Assignee: Stephan Ewen
 Fix For: 0.9


This is a problem in the test. Cancelling a task may actually throw an 
exception (especially interrupted exceptions). The test can be modified to 
either only call cancel and not call interrupt, or to tolerate exceptions that 
are followup exceptions of interrupted exceptions. I would prepare a patch for 
the first approach.

The stack trace of the symptom is blow:
{code}
java.lang.RuntimeException: Hashtable closing was interrupted
at 
org.apache.flink.runtime.operators.hash.MutableHashTable.close(MutableHashTable.java:652)
at 
org.apache.flink.runtime.operators.hash.ReusingBuildFirstHashMatchIterator.close(ReusingBuildFirstHashMatchIterator.java:100)
at 
org.apache.flink.runtime.operators.MatchDriver.cleanup(MatchDriver.java:179)
at 
org.apache.flink.runtime.operators.testutils.DriverTestBase.testDriverInternal(DriverTestBase.java:245)
at 
org.apache.flink.runtime.operators.testutils.DriverTestBase.testDriver(DriverTestBase.java:175)
at 
org.apache.flink.runtime.operators.MatchTaskTest$4.run(MatchTaskTest.java:783)
Tests run: 47, Failures: 2, Errors: 0, Skipped: 0, Time elapsed: 15.589 sec  
FAILURE! - in org.apache.flink.runtime.operators.MatchTaskTest
testCancelHashMatchTaskWhileBuildFirst[1](org.apache.flink.runtime.operators.MatchTaskTest)
  Time elapsed: 1.029 sec   FAILURE!
java.lang.AssertionError: Test threw an exception even though it was properly 
canceled.
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.assertTrue(Assert.java:41)
at 
org.apache.flink.runtime.operators.MatchTaskTest.testCancelHashMatchTaskWhileBuildFirst(MatchTaskTest.java:802)
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-1631) Port collisions in ProcessReaping tests

2015-03-02 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-1631:
---

 Summary: Port collisions in ProcessReaping tests
 Key: FLINK-1631
 URL: https://issues.apache.org/jira/browse/FLINK-1631
 Project: Flink
  Issue Type: Bug
  Components: Distributed Runtime
Affects Versions: 0.9
Reporter: Stephan Ewen
Assignee: Stephan Ewen
 Fix For: 0.9


The process reaping tests for the JobManager spawn a process that starts a 
webserver on the default port. It may happen that this port is not available, 
due to another concurrently running task.

I suggest to add an option to not start the webserver to prevent this, by 
setting the webserver port to {{-1}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-1608) TaskManagers may pick wrong network interface when starting before JobManager

2015-02-24 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-1608:
---

 Summary: TaskManagers may pick wrong network interface when 
starting before JobManager
 Key: FLINK-1608
 URL: https://issues.apache.org/jira/browse/FLINK-1608
 Project: Flink
  Issue Type: Bug
  Components: TaskManager
Affects Versions: 0.9
Reporter: Stephan Ewen
 Fix For: 0.9


The taskmanagers use a NetUtils routine to find an interface that lets them 
talk to the Jobmanager. However, if the JobManager is not online yet, they fall 
back to localhost.

In cases where the TaskManagers start faster than the JobManager, they pick the 
wrong hostname and interface.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-1598) Give better error messages when serializers run out of space.

2015-02-23 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-1598:
---

 Summary: Give better error messages when serializers run out of 
space.
 Key: FLINK-1598
 URL: https://issues.apache.org/jira/browse/FLINK-1598
 Project: Flink
  Issue Type: Bug
  Components: Local Runtime
Affects Versions: 0.9
Reporter: Stephan Ewen
Assignee: Stephan Ewen
 Fix For: 0.9






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


<    2   3   4   5   6   7   8   >