[GitHub] flink pull request: [FLINK-2292][FLINK-1573] add live per-task acc...

2015-07-10 Thread mxm
Github user mxm commented on a diff in the pull request:

https://github.com/apache/flink/pull/896#discussion_r34335824
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java
 ---
@@ -131,6 +133,35 @@

private SerializedValueStateHandle? operatorState;
 
+   /* Lock for updating the accumulators atomically. */
--- End diff --

If you consider comment lines as empty (non-code) lines, then it follows 
the classes' style.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2292][FLINK-1573] add live per-task acc...

2015-07-10 Thread mxm
Github user mxm commented on a diff in the pull request:

https://github.com/apache/flink/pull/896#discussion_r34335830
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java ---
@@ -172,13 +173,20 @@
 
/** The library cache, from which the task can request its required JAR 
files */
private final LibraryCacheManager libraryCache;
-   
+
/** The cache for user-defined files that the invokable requires */
private final FileCache fileCache;
-   
+
/** The gateway to the network stack, which handles inputs and produced 
results */
private final NetworkEnvironment network;
 
+   /** The registry of this task which enables live reporting of 
accumulators */
+   private final AccumulatorRegistry accumulatorRegistry;
--- End diff --

Yes absolutely. This is an artifact of a former design.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-2292) Report accumulators periodically while job is running

2015-07-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621978#comment-14621978
 ] 

ASF GitHub Bot commented on FLINK-2292:
---

Github user mxm commented on a diff in the pull request:

https://github.com/apache/flink/pull/896#discussion_r34335830
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java ---
@@ -172,13 +173,20 @@
 
/** The library cache, from which the task can request its required JAR 
files */
private final LibraryCacheManager libraryCache;
-   
+
/** The cache for user-defined files that the invokable requires */
private final FileCache fileCache;
-   
+
/** The gateway to the network stack, which handles inputs and produced 
results */
private final NetworkEnvironment network;
 
+   /** The registry of this task which enables live reporting of 
accumulators */
+   private final AccumulatorRegistry accumulatorRegistry;
--- End diff --

Yes absolutely. This is an artifact of a former design.


 Report accumulators periodically while job is running
 -

 Key: FLINK-2292
 URL: https://issues.apache.org/jira/browse/FLINK-2292
 Project: Flink
  Issue Type: Sub-task
  Components: JobManager, TaskManager
Reporter: Maximilian Michels
Assignee: Maximilian Michels
 Fix For: 0.10


 Accumulators should be sent periodically, as part of the heartbeat that sends 
 metrics. This allows them to be updated in real time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2292) Report accumulators periodically while job is running

2015-07-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621981#comment-14621981
 ] 

ASF GitHub Bot commented on FLINK-2292:
---

Github user mxm commented on a diff in the pull request:

https://github.com/apache/flink/pull/896#discussion_r34335963
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/accumulators/AccumulatorRegistry.java
 ---
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.runtime.accumulators;
+
+import org.apache.flink.api.common.JobID;
+import org.apache.flink.api.common.accumulators.Accumulator;
+import org.apache.flink.api.common.accumulators.LongCounter;
+import org.apache.flink.runtime.executiongraph.ExecutionAttemptID;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+import java.util.HashMap;
+import java.util.Map;
+
+
+/**
+ * Main accumulator registry which encapsulates internal and user-defined 
accumulators.
+ */
+public class AccumulatorRegistry {
--- End diff --

True. Originally, those two were a bit different. The API can be changed to 
be the same but I find keeping the two classes useful because it helps to 
differentiate acumulator context by type.


 Report accumulators periodically while job is running
 -

 Key: FLINK-2292
 URL: https://issues.apache.org/jira/browse/FLINK-2292
 Project: Flink
  Issue Type: Sub-task
  Components: JobManager, TaskManager
Reporter: Maximilian Michels
Assignee: Maximilian Michels
 Fix For: 0.10


 Accumulators should be sent periodically, as part of the heartbeat that sends 
 metrics. This allows them to be updated in real time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-2292][FLINK-1573] add live per-task acc...

2015-07-10 Thread mxm
Github user mxm commented on a diff in the pull request:

https://github.com/apache/flink/pull/896#discussion_r34335963
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/accumulators/AccumulatorRegistry.java
 ---
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.runtime.accumulators;
+
+import org.apache.flink.api.common.JobID;
+import org.apache.flink.api.common.accumulators.Accumulator;
+import org.apache.flink.api.common.accumulators.LongCounter;
+import org.apache.flink.runtime.executiongraph.ExecutionAttemptID;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+import java.util.HashMap;
+import java.util.Map;
+
+
+/**
+ * Main accumulator registry which encapsulates internal and user-defined 
accumulators.
+ */
+public class AccumulatorRegistry {
--- End diff --

True. Originally, those two were a bit different. The API can be changed to 
be the same but I find keeping the two classes useful because it helps to 
differentiate acumulator context by type.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-2293) Division by Zero Exception

2015-07-10 Thread Andra Lungu (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621986#comment-14621986
 ] 

Andra Lungu commented on FLINK-2293:


Hi [~StephanEwen],

Thank you for your patience. Indeed I had the wrong update for Flink and then I 
had to fix some minor glitches in my code :)
But everything is perfect now! I tested it on all the jobs that previously 
failed with this exception.  

 Division by Zero Exception
 --

 Key: FLINK-2293
 URL: https://issues.apache.org/jira/browse/FLINK-2293
 Project: Flink
  Issue Type: Bug
  Components: Local Runtime
Affects Versions: 0.9, 0.10
Reporter: Andra Lungu
Assignee: Stephan Ewen
Priority: Critical
 Fix For: 0.10, 0.9.1


 I am basically running an algorithm that simulates a Gather Sum Apply 
 Iteration that performs Traingle Count (Why simulate it? Because you just 
 need a superstep - useless overhead if you use the runGatherSumApply 
 function in Graph).
 What happens, at a high level:
 1). Select neighbors with ID greater than the one corresponding to the 
 current vertex;
 2). Propagate the received values to neighbors with higher ID;
 3). compute the number of triangles by checking whether
 trgVertex.getValue().get(srcVertex.getId());
 As you can see, I *do not* perform any division at all;
 code is here: 
 https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/GSATriangleCount.java
 Now for small graphs, 50MB max, the computation finishes nicely with the 
 correct result. For a 10GB graph, however, I got this:
 java.lang.ArithmeticException: / by zero
 at 
 org.apache.flink.runtime.operators.hash.MutableHashTable.insertIntoTable(MutableHashTable.java:836)
 at 
 org.apache.flink.runtime.operators.hash.MutableHashTable.buildTableFromSpilledPartition(MutableHashTable.java:819)
 at 
 org.apache.flink.runtime.operators.hash.MutableHashTable.prepareNextPartition(MutableHashTable.java:508)
 at 
 org.apache.flink.runtime.operators.hash.MutableHashTable.nextRecord(MutableHashTable.java:544)
 at 
 org.apache.flink.runtime.operators.hash.NonReusingBuildFirstHashMatchIterator.callWithNextKey(NonReusingBuildFirstHashMatchIterator.java:104)
 at 
 org.apache.flink.runtime.operators.MatchDriver.run(MatchDriver.java:173)
 at 
 org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:496)
 at 
 org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362)
 at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
 at java.lang.Thread.run(Thread.java:722)
 see the full log here: https://gist.github.com/andralungu/984774f6348269df7951



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (FLINK-2293) Division by Zero Exception

2015-07-10 Thread Andra Lungu (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andra Lungu resolved FLINK-2293.

Resolution: Fixed

Sorry for reopening it when it was already fixed. Never hurts to make sure that 
it also works in the environment where it previously failed :). Thanks again!

 Division by Zero Exception
 --

 Key: FLINK-2293
 URL: https://issues.apache.org/jira/browse/FLINK-2293
 Project: Flink
  Issue Type: Bug
  Components: Local Runtime
Affects Versions: 0.9, 0.10
Reporter: Andra Lungu
Assignee: Stephan Ewen
Priority: Critical
 Fix For: 0.10, 0.9.1


 I am basically running an algorithm that simulates a Gather Sum Apply 
 Iteration that performs Traingle Count (Why simulate it? Because you just 
 need a superstep - useless overhead if you use the runGatherSumApply 
 function in Graph).
 What happens, at a high level:
 1). Select neighbors with ID greater than the one corresponding to the 
 current vertex;
 2). Propagate the received values to neighbors with higher ID;
 3). compute the number of triangles by checking whether
 trgVertex.getValue().get(srcVertex.getId());
 As you can see, I *do not* perform any division at all;
 code is here: 
 https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/GSATriangleCount.java
 Now for small graphs, 50MB max, the computation finishes nicely with the 
 correct result. For a 10GB graph, however, I got this:
 java.lang.ArithmeticException: / by zero
 at 
 org.apache.flink.runtime.operators.hash.MutableHashTable.insertIntoTable(MutableHashTable.java:836)
 at 
 org.apache.flink.runtime.operators.hash.MutableHashTable.buildTableFromSpilledPartition(MutableHashTable.java:819)
 at 
 org.apache.flink.runtime.operators.hash.MutableHashTable.prepareNextPartition(MutableHashTable.java:508)
 at 
 org.apache.flink.runtime.operators.hash.MutableHashTable.nextRecord(MutableHashTable.java:544)
 at 
 org.apache.flink.runtime.operators.hash.NonReusingBuildFirstHashMatchIterator.callWithNextKey(NonReusingBuildFirstHashMatchIterator.java:104)
 at 
 org.apache.flink.runtime.operators.MatchDriver.run(MatchDriver.java:173)
 at 
 org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:496)
 at 
 org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362)
 at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
 at java.lang.Thread.run(Thread.java:722)
 see the full log here: https://gist.github.com/andralungu/984774f6348269df7951



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2293) Division by Zero Exception

2015-07-10 Thread Stephan Ewen (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621997#comment-14621997
 ] 

Stephan Ewen commented on FLINK-2293:
-

No worries, just glad the bug is fixed :-)

 Division by Zero Exception
 --

 Key: FLINK-2293
 URL: https://issues.apache.org/jira/browse/FLINK-2293
 Project: Flink
  Issue Type: Bug
  Components: Local Runtime
Affects Versions: 0.9, 0.10
Reporter: Andra Lungu
Assignee: Stephan Ewen
Priority: Critical
 Fix For: 0.10, 0.9.1


 I am basically running an algorithm that simulates a Gather Sum Apply 
 Iteration that performs Traingle Count (Why simulate it? Because you just 
 need a superstep - useless overhead if you use the runGatherSumApply 
 function in Graph).
 What happens, at a high level:
 1). Select neighbors with ID greater than the one corresponding to the 
 current vertex;
 2). Propagate the received values to neighbors with higher ID;
 3). compute the number of triangles by checking whether
 trgVertex.getValue().get(srcVertex.getId());
 As you can see, I *do not* perform any division at all;
 code is here: 
 https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/GSATriangleCount.java
 Now for small graphs, 50MB max, the computation finishes nicely with the 
 correct result. For a 10GB graph, however, I got this:
 java.lang.ArithmeticException: / by zero
 at 
 org.apache.flink.runtime.operators.hash.MutableHashTable.insertIntoTable(MutableHashTable.java:836)
 at 
 org.apache.flink.runtime.operators.hash.MutableHashTable.buildTableFromSpilledPartition(MutableHashTable.java:819)
 at 
 org.apache.flink.runtime.operators.hash.MutableHashTable.prepareNextPartition(MutableHashTable.java:508)
 at 
 org.apache.flink.runtime.operators.hash.MutableHashTable.nextRecord(MutableHashTable.java:544)
 at 
 org.apache.flink.runtime.operators.hash.NonReusingBuildFirstHashMatchIterator.callWithNextKey(NonReusingBuildFirstHashMatchIterator.java:104)
 at 
 org.apache.flink.runtime.operators.MatchDriver.run(MatchDriver.java:173)
 at 
 org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:496)
 at 
 org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362)
 at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
 at java.lang.Thread.run(Thread.java:722)
 see the full log here: https://gist.github.com/andralungu/984774f6348269df7951



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2292) Report accumulators periodically while job is running

2015-07-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621969#comment-14621969
 ] 

ASF GitHub Bot commented on FLINK-2292:
---

Github user mxm commented on a diff in the pull request:

https://github.com/apache/flink/pull/896#discussion_r34335420
  
--- Diff: 
flink-runtime/src/main/scala/org/apache/flink/runtime/taskmanager/TaskManager.scala
 ---
@@ -1017,6 +1040,9 @@ object TaskManager {
 
   val HEARTBEAT_INTERVAL: FiniteDuration = 5000 milliseconds
 
+  /* Interval to send accumulators to the job manager  */
+  val ACCUMULATOR_REPORT_INTERVAL: FiniteDuration = 10 seconds
--- End diff --

Removed. The update interval corresponds to the heartbeat interval.


 Report accumulators periodically while job is running
 -

 Key: FLINK-2292
 URL: https://issues.apache.org/jira/browse/FLINK-2292
 Project: Flink
  Issue Type: Sub-task
  Components: JobManager, TaskManager
Reporter: Maximilian Michels
Assignee: Maximilian Michels
 Fix For: 0.10


 Accumulators should be sent periodically, as part of the heartbeat that sends 
 metrics. This allows them to be updated in real time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2292) Report accumulators periodically while job is running

2015-07-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621965#comment-14621965
 ] 

ASF GitHub Bot commented on FLINK-2292:
---

Github user mxm commented on a diff in the pull request:

https://github.com/apache/flink/pull/896#discussion_r34335345
  
--- Diff: 
flink-staging/flink-streaming/flink-streaming-core/src/main/java/org/apache/flink/streaming/runtime/io/StreamingAbstractRecordReader.java
 ---
@@ -57,6 +58,12 @@
 
private final BarrierBuffer barrierBuffer;
 
+   /**
+* Counters for the number of bytes read and records processed.
+*/
+   private LongCounter numRecordsRead = null;
--- End diff --

Thanks, I didn't know.


 Report accumulators periodically while job is running
 -

 Key: FLINK-2292
 URL: https://issues.apache.org/jira/browse/FLINK-2292
 Project: Flink
  Issue Type: Sub-task
  Components: JobManager, TaskManager
Reporter: Maximilian Michels
Assignee: Maximilian Michels
 Fix For: 0.10


 Accumulators should be sent periodically, as part of the heartbeat that sends 
 metrics. This allows them to be updated in real time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-2292][FLINK-1573] add live per-task acc...

2015-07-10 Thread mxm
Github user mxm commented on a diff in the pull request:

https://github.com/apache/flink/pull/896#discussion_r34335420
  
--- Diff: 
flink-runtime/src/main/scala/org/apache/flink/runtime/taskmanager/TaskManager.scala
 ---
@@ -1017,6 +1040,9 @@ object TaskManager {
 
   val HEARTBEAT_INTERVAL: FiniteDuration = 5000 milliseconds
 
+  /* Interval to send accumulators to the job manager  */
+  val ACCUMULATOR_REPORT_INTERVAL: FiniteDuration = 10 seconds
--- End diff --

Removed. The update interval corresponds to the heartbeat interval.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-2292) Report accumulators periodically while job is running

2015-07-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621977#comment-14621977
 ] 

ASF GitHub Bot commented on FLINK-2292:
---

Github user mxm commented on a diff in the pull request:

https://github.com/apache/flink/pull/896#discussion_r34335824
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java
 ---
@@ -131,6 +133,35 @@

private SerializedValueStateHandle? operatorState;
 
+   /* Lock for updating the accumulators atomically. */
--- End diff --

If you consider comment lines as empty (non-code) lines, then it follows 
the classes' style.


 Report accumulators periodically while job is running
 -

 Key: FLINK-2292
 URL: https://issues.apache.org/jira/browse/FLINK-2292
 Project: Flink
  Issue Type: Sub-task
  Components: JobManager, TaskManager
Reporter: Maximilian Michels
Assignee: Maximilian Michels
 Fix For: 0.10


 Accumulators should be sent periodically, as part of the heartbeat that sends 
 metrics. This allows them to be updated in real time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-2292][FLINK-1573] add live per-task acc...

2015-07-10 Thread mxm
Github user mxm commented on the pull request:

https://github.com/apache/flink/pull/896#issuecomment-120302065
  
 Looks like this change breaks the YARN integration. The YARN WordCount no 
longer works.

Should be working again now.

 It would be good if the accumulator update interval was configurable.
 Edit: Is that the same value as the heartbeats?

Yes, that was a design rationale to keep the message count low. We could 
only send the accumulators in every Nth heartbeat and let it be configurable.

 The is a potential modification conflict: Drawing a snapshot for 
serialization and registering a new accumulator can lead to a 
ConcurrentModificationException in the drawing of the snapshot.

I conducted tests with concurrent insertions and deletions and found that 
only concurrent removals cause ConcurrentModificationExceptions. Removals are 
not allowed for accumulators. Anyways, we could switch to a synchronized or 
copy on write hash map. If we do I would opt for the latter.

 The naming of the accumulators refers sometimes to flink vs. 
user-defined, and sometimes to internal vs. external. Can we make this 
consistent? I actually like the flink vs. user-defined naming better.

Then let's stick to the flink vs. user-defined naming scheme.

 I think the code would be simpler is the registry simply always had a 
created map for internal and external accumulators. Also, a reporter object 
would help. 

I agree that would be a nicer way of dealing with the API.





---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-2292) Report accumulators periodically while job is running

2015-07-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621992#comment-14621992
 ] 

ASF GitHub Bot commented on FLINK-2292:
---

Github user mxm commented on the pull request:

https://github.com/apache/flink/pull/896#issuecomment-120302065
  
 Looks like this change breaks the YARN integration. The YARN WordCount no 
longer works.

Should be working again now.

 It would be good if the accumulator update interval was configurable.
 Edit: Is that the same value as the heartbeats?

Yes, that was a design rationale to keep the message count low. We could 
only send the accumulators in every Nth heartbeat and let it be configurable.

 The is a potential modification conflict: Drawing a snapshot for 
serialization and registering a new accumulator can lead to a 
ConcurrentModificationException in the drawing of the snapshot.

I conducted tests with concurrent insertions and deletions and found that 
only concurrent removals cause ConcurrentModificationExceptions. Removals are 
not allowed for accumulators. Anyways, we could switch to a synchronized or 
copy on write hash map. If we do I would opt for the latter.

 The naming of the accumulators refers sometimes to flink vs. 
user-defined, and sometimes to internal vs. external. Can we make this 
consistent? I actually like the flink vs. user-defined naming better.

Then let's stick to the flink vs. user-defined naming scheme.

 I think the code would be simpler is the registry simply always had a 
created map for internal and external accumulators. Also, a reporter object 
would help. 

I agree that would be a nicer way of dealing with the API.





 Report accumulators periodically while job is running
 -

 Key: FLINK-2292
 URL: https://issues.apache.org/jira/browse/FLINK-2292
 Project: Flink
  Issue Type: Sub-task
  Components: JobManager, TaskManager
Reporter: Maximilian Michels
Assignee: Maximilian Michels
 Fix For: 0.10


 Accumulators should be sent periodically, as part of the heartbeat that sends 
 metrics. This allows them to be updated in real time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (FLINK-2339) Prevent asynchronous checkpoint calls from overtaking each other

2015-07-10 Thread Stephan Ewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Ewen resolved FLINK-2339.
-
Resolution: Fixed

Fixed in cbde2c2a3d71e17990d76d603e1bb6d275c888be

 Prevent asynchronous checkpoint calls from overtaking each other
 

 Key: FLINK-2339
 URL: https://issues.apache.org/jira/browse/FLINK-2339
 Project: Flink
  Issue Type: Improvement
  Components: TaskManager
Affects Versions: 0.10
Reporter: Stephan Ewen
Assignee: Stephan Ewen
 Fix For: 0.10


 Currently, when checkpoint state materialization takes very long, and the 
 checkpoint interval is low, the asynchronous calls to trigger checkpoints (on 
 the sources) could overtake prior calls.
 We can fix that by making sure that all calls are dispatched in order by the 
 same thread, rather than spawning a new thread for each call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (FLINK-2339) Prevent asynchronous checkpoint calls from overtaking each other

2015-07-10 Thread Stephan Ewen (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Ewen closed FLINK-2339.
---

 Prevent asynchronous checkpoint calls from overtaking each other
 

 Key: FLINK-2339
 URL: https://issues.apache.org/jira/browse/FLINK-2339
 Project: Flink
  Issue Type: Improvement
  Components: TaskManager
Affects Versions: 0.10
Reporter: Stephan Ewen
Assignee: Stephan Ewen
 Fix For: 0.10


 Currently, when checkpoint state materialization takes very long, and the 
 checkpoint interval is low, the asynchronous calls to trigger checkpoints (on 
 the sources) could overtake prior calls.
 We can fix that by making sure that all calls are dispatched in order by the 
 same thread, rather than spawning a new thread for each call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-2292][FLINK-1573] add live per-task acc...

2015-07-10 Thread StephanEwen
Github user StephanEwen commented on the pull request:

https://github.com/apache/flink/pull/896#issuecomment-120305327
  
The reporter object has the advantage that it is more easily extensible. At 
some point we will want to differentiate between locally received bytes, and 
remotely received bytes, for example. Or include wait/stall times to detect 
whether a task is throttled by a slower predecessor.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-2292) Report accumulators periodically while job is running

2015-07-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622010#comment-14622010
 ] 

ASF GitHub Bot commented on FLINK-2292:
---

Github user StephanEwen commented on the pull request:

https://github.com/apache/flink/pull/896#issuecomment-120305327
  
The reporter object has the advantage that it is more easily extensible. At 
some point we will want to differentiate between locally received bytes, and 
remotely received bytes, for example. Or include wait/stall times to detect 
whether a task is throttled by a slower predecessor.


 Report accumulators periodically while job is running
 -

 Key: FLINK-2292
 URL: https://issues.apache.org/jira/browse/FLINK-2292
 Project: Flink
  Issue Type: Sub-task
  Components: JobManager, TaskManager
Reporter: Maximilian Michels
Assignee: Maximilian Michels
 Fix For: 0.10


 Accumulators should be sent periodically, as part of the heartbeat that sends 
 metrics. This allows them to be updated in real time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2180) Streaming iterate test fails spuriously

2015-07-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621938#comment-14621938
 ] 

ASF GitHub Bot commented on FLINK-2180:
---

Github user gyfora commented on the pull request:

https://github.com/apache/flink/pull/802#issuecomment-120276079
  
I am currently preparing a new PR with a rework of how the iterations are 
constructed in the StreamGraph it will be ready today. It will solve many 
issues that we currently have and will make the iterations much more robust.

I am also adding a bunch of new tests.

I think this is still a good addition because some of my tests will depend 
on the maxWaitTime. I propose to add your changes afterwards. I volunteer to 
port your commits to my changes :)


 Streaming iterate test fails spuriously
 ---

 Key: FLINK-2180
 URL: https://issues.apache.org/jira/browse/FLINK-2180
 Project: Flink
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.9
Reporter: Márton Balassi
Assignee: Gábor Hermann

 Following output seen occasionally: 
 Tests run: 3, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 3.667 sec  
 FAILURE! - in org.apache.flink.streaming.api.IterateTest
 test(org.apache.flink.streaming.api.IterateTest)  Time elapsed: 3.662 sec  
  FAILURE!
 java.lang.AssertionError: null
   at org.junit.Assert.fail(Assert.java:86)
   at org.junit.Assert.assertTrue(Assert.java:41)
   at org.junit.Assert.assertTrue(Assert.java:52)
   at org.apache.flink.streaming.api.IterateTest.test(IterateTest.java:154)
 See: https://travis-ci.org/mbalassi/flink/jobs/65803465



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-2292][FLINK-1573] add live per-task acc...

2015-07-10 Thread mxm
Github user mxm commented on a diff in the pull request:

https://github.com/apache/flink/pull/896#discussion_r34335325
  
--- Diff: 
flink-examples/flink-java-examples/src/main/java/org/apache/flink/examples/java/wordcount/WordCount.java
 ---
@@ -60,6 +60,7 @@ public static void main(String[] args) throws Exception {

// set up the execution environment
final ExecutionEnvironment env = 
ExecutionEnvironment.getExecutionEnvironment();
+   env.setParallelism(1);
--- End diff --

Thanks for spotting it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2292][FLINK-1573] add live per-task acc...

2015-07-10 Thread mxm
Github user mxm commented on a diff in the pull request:

https://github.com/apache/flink/pull/896#discussion_r34335345
  
--- Diff: 
flink-staging/flink-streaming/flink-streaming-core/src/main/java/org/apache/flink/streaming/runtime/io/StreamingAbstractRecordReader.java
 ---
@@ -57,6 +58,12 @@
 
private final BarrierBuffer barrierBuffer;
 
+   /**
+* Counters for the number of bytes read and records processed.
+*/
+   private LongCounter numRecordsRead = null;
--- End diff --

Thanks, I didn't know.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2292][FLINK-1573] add live per-task acc...

2015-07-10 Thread mxm
Github user mxm commented on a diff in the pull request:

https://github.com/apache/flink/pull/896#discussion_r34335337
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java ---
@@ -172,13 +173,20 @@
 
/** The library cache, from which the task can request its required JAR 
files */
private final LibraryCacheManager libraryCache;
-   
+
/** The cache for user-defined files that the invokable requires */
private final FileCache fileCache;
-   
+
/** The gateway to the network stack, which handles inputs and produced 
results */
private final NetworkEnvironment network;
 
+   /** The registry of this task which enables live reporting of 
accumulators */
+   private final AccumulatorRegistry accumulatorRegistry;
+
+   public AccumulatorRegistry getAccumulatorRegistry() {
--- End diff --

I agree.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2292][FLINK-1573] add live per-task acc...

2015-07-10 Thread mxm
Github user mxm commented on a diff in the pull request:

https://github.com/apache/flink/pull/896#discussion_r34335380
  
--- Diff: 
flink-runtime/src/main/scala/org/apache/flink/runtime/taskmanager/TaskManager.scala
 ---
@@ -306,99 +308,100 @@ extends Actor with ActorLogMessages with 
ActorSynchronousLogging {
 if (!isConnected) {
   log.debug(sDropping message $message because the TaskManager is 
currently  +
 not connected to a JobManager.)
-}
+} else {
 
-// we order the messages by frequency, to make sure the code paths for 
matching
-// are as short as possible
-message match {
+  // we order the messages by frequency, to make sure the code paths 
for matching
+  // are as short as possible
+  message match {
+
+// tell the task about the availability of a new input partition
+case UpdateTaskSinglePartitionInfo(executionID, resultID, 
partitionInfo) =
+  updateTaskInputPartitions(executionID, List((resultID, 
partitionInfo)))
+
+// tell the task about the availability of some new input 
partitions
+case UpdateTaskMultiplePartitionInfos(executionID, partitionInfos) 
=
+  updateTaskInputPartitions(executionID, partitionInfos)
+
+// discards intermediate result partitions of a task execution on 
this TaskManager
+case FailIntermediateResultPartitions(executionID) =
+  log.info(Discarding the results produced by task execution  + 
executionID)
+  if (network.isAssociated) {
+try {
+  
network.getPartitionManager.releasePartitionsProducedBy(executionID)
+} catch {
+  case t: Throwable = killTaskManagerFatal(
+Fatal leak: Unable to release intermediate result 
partition data, t)
+}
+  }
 
-  // tell the task about the availability of a new input partition
-  case UpdateTaskSinglePartitionInfo(executionID, resultID, 
partitionInfo) =
-updateTaskInputPartitions(executionID, List((resultID, 
partitionInfo)))
+// notifies the TaskManager that the state of a task has changed.
+// the TaskManager informs the JobManager and cleans up in case 
the transition
+// was into a terminal state, or in case the JobManager cannot be 
informed of the
+// state transition
 
-  // tell the task about the availability of some new input partitions
-  case UpdateTaskMultiplePartitionInfos(executionID, partitionInfos) =
-updateTaskInputPartitions(executionID, partitionInfos)
+case updateMsg@UpdateTaskExecutionState(taskExecutionState: 
TaskExecutionState) =
 
-  // discards intermediate result partitions of a task execution on 
this TaskManager
-  case FailIntermediateResultPartitions(executionID) =
-log.info(Discarding the results produced by task execution  + 
executionID)
-if (network.isAssociated) {
-  try {
-
network.getPartitionManager.releasePartitionsProducedBy(executionID)
-  } catch {
-case t: Throwable = killTaskManagerFatal(
-Fatal leak: Unable to release intermediate result 
partition data, t)
-  }
-}
+  // we receive these from our tasks and forward them to the 
JobManager
--- End diff --

This is a bug I discovered while reading through the code. It prevents 
processing of messages when the task manager is not connected to the job 
manager. If you look at line 307, it says it would skip the message but 
continues to process it. If you want I can open a separate pull request.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2292][FLINK-1573] add live per-task acc...

2015-07-10 Thread mxm
Github user mxm commented on a diff in the pull request:

https://github.com/apache/flink/pull/896#discussion_r34335330
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/RuntimeEnvironment.java
 ---
@@ -212,19 +218,19 @@ public InputGate getInputGate(int index) {
return inputGates;
}
 
-   @Override
-   public void reportAccumulators(MapString, Accumulator?, ? 
accumulators) {
-   AccumulatorEvent evt;
-   try {
-   evt = new AccumulatorEvent(getJobID(), accumulators);
-   }
-   catch (IOException e) {
-   throw new RuntimeException(Cannot serialize 
accumulators to send them to JobManager, e);
-   }
-
-   ReportAccumulatorResult accResult = new 
ReportAccumulatorResult(jobId, executionId, evt);
-   jobManagerActor.tell(accResult, ActorRef.noSender());
-   }
+// @Override
--- End diff --

Yes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2292][FLINK-1573] add live per-task acc...

2015-07-10 Thread mxm
Github user mxm commented on a diff in the pull request:

https://github.com/apache/flink/pull/896#discussion_r34335342
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java ---
@@ -237,12 +246,13 @@ public Task(TaskDeploymentDescriptor tdd,
 
this.memoryManager = checkNotNull(memManager);
this.ioManager = checkNotNull(ioManager);
-   this.broadcastVariableManager =checkNotNull(bcVarManager);
+   this.broadcastVariableManager = checkNotNull(bcVarManager);
+   this.accumulatorRegistry = accumulatorRegistry;
 
this.jobManager = checkNotNull(jobManagerActor);
this.taskManager = checkNotNull(taskManagerActor);
this.actorAskTimeout = new 
Timeout(checkNotNull(actorAskTimeout));
-   
--- End diff --

No auto formats, just removes redundant tabs in empty lines and corrects a 
missing space.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-2292) Report accumulators periodically while job is running

2015-07-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621964#comment-14621964
 ] 

ASF GitHub Bot commented on FLINK-2292:
---

Github user mxm commented on a diff in the pull request:

https://github.com/apache/flink/pull/896#discussion_r34335342
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java ---
@@ -237,12 +246,13 @@ public Task(TaskDeploymentDescriptor tdd,
 
this.memoryManager = checkNotNull(memManager);
this.ioManager = checkNotNull(ioManager);
-   this.broadcastVariableManager =checkNotNull(bcVarManager);
+   this.broadcastVariableManager = checkNotNull(bcVarManager);
+   this.accumulatorRegistry = accumulatorRegistry;
 
this.jobManager = checkNotNull(jobManagerActor);
this.taskManager = checkNotNull(taskManagerActor);
this.actorAskTimeout = new 
Timeout(checkNotNull(actorAskTimeout));
-   
--- End diff --

No auto formats, just removes redundant tabs in empty lines and corrects a 
missing space.


 Report accumulators periodically while job is running
 -

 Key: FLINK-2292
 URL: https://issues.apache.org/jira/browse/FLINK-2292
 Project: Flink
  Issue Type: Sub-task
  Components: JobManager, TaskManager
Reporter: Maximilian Michels
Assignee: Maximilian Michels
 Fix For: 0.10


 Accumulators should be sent periodically, as part of the heartbeat that sends 
 metrics. This allows them to be updated in real time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2292) Report accumulators periodically while job is running

2015-07-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621961#comment-14621961
 ] 

ASF GitHub Bot commented on FLINK-2292:
---

Github user mxm commented on a diff in the pull request:

https://github.com/apache/flink/pull/896#discussion_r34335325
  
--- Diff: 
flink-examples/flink-java-examples/src/main/java/org/apache/flink/examples/java/wordcount/WordCount.java
 ---
@@ -60,6 +60,7 @@ public static void main(String[] args) throws Exception {

// set up the execution environment
final ExecutionEnvironment env = 
ExecutionEnvironment.getExecutionEnvironment();
+   env.setParallelism(1);
--- End diff --

Thanks for spotting it.


 Report accumulators periodically while job is running
 -

 Key: FLINK-2292
 URL: https://issues.apache.org/jira/browse/FLINK-2292
 Project: Flink
  Issue Type: Sub-task
  Components: JobManager, TaskManager
Reporter: Maximilian Michels
Assignee: Maximilian Michels
 Fix For: 0.10


 Accumulators should be sent periodically, as part of the heartbeat that sends 
 metrics. This allows them to be updated in real time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2292) Report accumulators periodically while job is running

2015-07-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621963#comment-14621963
 ] 

ASF GitHub Bot commented on FLINK-2292:
---

Github user mxm commented on a diff in the pull request:

https://github.com/apache/flink/pull/896#discussion_r34335337
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java ---
@@ -172,13 +173,20 @@
 
/** The library cache, from which the task can request its required JAR 
files */
private final LibraryCacheManager libraryCache;
-   
+
/** The cache for user-defined files that the invokable requires */
private final FileCache fileCache;
-   
+
/** The gateway to the network stack, which handles inputs and produced 
results */
private final NetworkEnvironment network;
 
+   /** The registry of this task which enables live reporting of 
accumulators */
+   private final AccumulatorRegistry accumulatorRegistry;
+
+   public AccumulatorRegistry getAccumulatorRegistry() {
--- End diff --

I agree.


 Report accumulators periodically while job is running
 -

 Key: FLINK-2292
 URL: https://issues.apache.org/jira/browse/FLINK-2292
 Project: Flink
  Issue Type: Sub-task
  Components: JobManager, TaskManager
Reporter: Maximilian Michels
Assignee: Maximilian Michels
 Fix For: 0.10


 Accumulators should be sent periodically, as part of the heartbeat that sends 
 metrics. This allows them to be updated in real time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-2180] [streaming] Iteration test fix

2015-07-10 Thread gyfora
Github user gyfora commented on the pull request:

https://github.com/apache/flink/pull/802#issuecomment-120276079
  
I am currently preparing a new PR with a rework of how the iterations are 
constructed in the StreamGraph it will be ready today. It will solve many 
issues that we currently have and will make the iterations much more robust.

I am also adding a bunch of new tests.

I think this is still a good addition because some of my tests will depend 
on the maxWaitTime. I propose to add your changes afterwards. I volunteer to 
port your commits to my changes :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [doc] Fix wordcount example in YARN setup

2015-07-10 Thread rmetzger
Github user rmetzger commented on the pull request:

https://github.com/apache/flink/pull/897#issuecomment-120279506
  
+1 to merge


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2310] Add an Adamic Adar Similarity exa...

2015-07-10 Thread vasia
Github user vasia commented on the pull request:

https://github.com/apache/flink/pull/892#issuecomment-120319905
  
Hi @shghatge!

I agree, let's deal with the approximate version as a separate issue. In 
the end though, it would be nice to have a single library method and an input 
parameter to decide whether the computation should be exact or approximate.

Regarding the bloom filter, the idea is for each vertex to build a bloom 
filter with its neighbors and send it to its neighbors. Then, each vertex can 
compare its own neighborhood (the exact one) with the received bloom filter 
neighborhoods. Take a look at how approximate Jaccard is computed in the okapi 
library 
[here](https://github.com/grafos-ml/okapi/blob/master/src/main/java/ml/grafos/okapi/graphs/similarity/Jaccard.java)
 (class `JaccardApproximation `).

Let me know if you have more questions :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-2310) Add an Adamic-Adar Similarity example

2015-07-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622039#comment-14622039
 ] 

ASF GitHub Bot commented on FLINK-2310:
---

Github user vasia commented on the pull request:

https://github.com/apache/flink/pull/892#issuecomment-120319905
  
Hi @shghatge!

I agree, let's deal with the approximate version as a separate issue. In 
the end though, it would be nice to have a single library method and an input 
parameter to decide whether the computation should be exact or approximate.

Regarding the bloom filter, the idea is for each vertex to build a bloom 
filter with its neighbors and send it to its neighbors. Then, each vertex can 
compare its own neighborhood (the exact one) with the received bloom filter 
neighborhoods. Take a look at how approximate Jaccard is computed in the okapi 
library 
[here](https://github.com/grafos-ml/okapi/blob/master/src/main/java/ml/grafos/okapi/graphs/similarity/Jaccard.java)
 (class `JaccardApproximation `).

Let me know if you have more questions :)


 Add an Adamic-Adar Similarity example
 -

 Key: FLINK-2310
 URL: https://issues.apache.org/jira/browse/FLINK-2310
 Project: Flink
  Issue Type: Task
  Components: Gelly
Reporter: Andra Lungu
Assignee: Shivani Ghatge
Priority: Minor

 Just as Jaccard, the Adamic-Adar algorithm measures the similarity between a 
 set of nodes. However, instead of counting the common neighbors and dividing 
 them by the total number of neighbors, the similarity is weighted according 
 to the vertex degrees. In particular, it's equal to log(1/numberOfEdges).
 The Adamic-Adar algorithm can be broken into three steps: 
 1). For each vertex, compute the log of its inverse degrees (with the formula 
 above) and set it as the vertex value. 
 2). Each vertex will then send this new computed value along with a list of 
 neighbors to the targets of its out-edges
 3). Weigh the edges with the Adamic-Adar index: Sum over n from CN of 
 log(1/k_n)(CN is the set of all common neighbors of two vertices x, y. k_n is 
 the degree of node n). See [2]
 Prerequisites: 
 - Full understanding of the Jaccard Similarity Measure algorithm
 - Reading the associated literature: 
 [1] http://social.cs.uiuc.edu/class/cs591kgk/friendsadamic.pdf
 [2] 
 http://stackoverflow.com/questions/22565620/fast-algorithm-to-compute-adamic-adar



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [Flink-gelly] Added local clustering coefficie...

2015-07-10 Thread samk3211
GitHub user samk3211 opened a pull request:

https://github.com/apache/flink/pull/898

[Flink-gelly] Added local clustering coefficient (for directed graph) exa…

…mple in the tutorialExamples subfolder.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/samk3211/flink tutorial

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/898.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #898


commit 1cacbd8ab58e431c32db0c2bef2538df81e088b1
Author: Samia samia
Date:   2015-07-10T09:25:46Z

[Flink-gelly] Added local clustering coefficient (directed graph) example 
in the tutorialExamples subfolder.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink-gelly] Added local clustering coefficie...

2015-07-10 Thread samk3211
Github user samk3211 closed the pull request at:

https://github.com/apache/flink/pull/898


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Created] (FLINK-2342) Add new fit operation and more tests for StandardScaler

2015-07-10 Thread Theodore Vasiloudis (JIRA)
Theodore Vasiloudis created FLINK-2342:
--

 Summary: Add new fit operation and more tests for StandardScaler
 Key: FLINK-2342
 URL: https://issues.apache.org/jira/browse/FLINK-2342
 Project: Flink
  Issue Type: Improvement
  Components: Machine Learning Library
Reporter: Theodore Vasiloudis
Assignee: Theodore Vasiloudis
Priority: Minor
 Fix For: 0.10


StandardScaler currently has a transform operation for types (Vector, Double), 
but no  corresponding fit operation.

The test cases also do not cover all the possible types that we can call fit 
and transform on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2342) Add new fit operation and more tests for StandardScaler

2015-07-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622075#comment-14622075
 ] 

ASF GitHub Bot commented on FLINK-2342:
---

GitHub user thvasilo opened a pull request:

https://github.com/apache/flink/pull/899

[FLINK-2342] [ml]  Add new fit operation and more tests for StandardScaler

StandardScaler currently has a transform operation for types (Vector, 
Double), but no corresponding fit operation.

The test cases also do not cover all the possible types that we can call 
fit and transform on.

This PR addresses this, and removes some unused code.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/thvasilo/flink scaler-extra-tests

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/899.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #899


commit aa18b3dc82475c179f56fad25363184a47c36c2f
Author: Theodore Vasiloudis t...@sics.se
Date:   2015-07-10T09:34:32Z

More tests and a new operation for StandardScaler




 Add new fit operation and more tests for StandardScaler
 ---

 Key: FLINK-2342
 URL: https://issues.apache.org/jira/browse/FLINK-2342
 Project: Flink
  Issue Type: Improvement
  Components: Machine Learning Library
Reporter: Theodore Vasiloudis
Assignee: Theodore Vasiloudis
Priority: Minor
  Labels: ML
 Fix For: 0.10


 StandardScaler currently has a transform operation for types (Vector, 
 Double), but no  corresponding fit operation.
 The test cases also do not cover all the possible types that we can call fit 
 and transform on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-2342] [ml] Add new fit operation and mo...

2015-07-10 Thread thvasilo
GitHub user thvasilo opened a pull request:

https://github.com/apache/flink/pull/899

[FLINK-2342] [ml]  Add new fit operation and more tests for StandardScaler

StandardScaler currently has a transform operation for types (Vector, 
Double), but no corresponding fit operation.

The test cases also do not cover all the possible types that we can call 
fit and transform on.

This PR addresses this, and removes some unused code.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/thvasilo/flink scaler-extra-tests

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/899.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #899


commit aa18b3dc82475c179f56fad25363184a47c36c2f
Author: Theodore Vasiloudis t...@sics.se
Date:   2015-07-10T09:34:32Z

More tests and a new operation for StandardScaler




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: Make streaming iterations more robust

2015-07-10 Thread gyfora
GitHub user gyfora opened a pull request:

https://github.com/apache/flink/pull/900

Make streaming iterations more robust

This PR reworks the way iterations are constructed in the streamgraph from 
the previous eager approach to do it when we actually create the jobgraph.

This new approach solves many know and unknown issues that the previous 
approach had, such as issues with multiple tail and head operators 
(FLINK-2328), issues with partitioning on the feedback stream, issues with 
ConnectedIterations and immutability of the IterativeDataStream etc.

I also included a new set of tests to validate all the expected 
functionality:

-Test for simple and connected iterations with more heads and tails
-Parallelism, partitioning and co-location checks
-Tests for the exceptions thrown during job creation for invalid iterative 
topologies

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gyfora/flink master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/900.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #900


commit 9fef88306bc79f1f2e0520ddeb2ef10be8e4de09
Author: Gyula Fora gyf...@apache.org
Date:   2015-07-10T11:26:44Z

[FLINK-2335] [streaming] Lazy iteration construction in StreamGraph




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2310] Add an Adamic Adar Similarity exa...

2015-07-10 Thread shghatge
Github user shghatge commented on the pull request:

https://github.com/apache/flink/pull/892#issuecomment-120366455
  
Hello @vasia 
Thanks for the clarification. I was only thinking of changing the data 
structure :) But this method seems much more sensible. In any case... Should I 
close this pull request? Or should I commit the changes made to it according to 
your first suggestion?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-2106) Add outer joins to API, Optimizer, and Runtime

2015-07-10 Thread Johann Kovacs (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622235#comment-14622235
 ] 

Johann Kovacs commented on FLINK-2106:
--

[~r-pogalz] and I would like to start working on this as part of our project in 
the next couple of weeks. Can you please assign (one of) us?

 Add outer joins to API, Optimizer, and Runtime
 --

 Key: FLINK-2106
 URL: https://issues.apache.org/jira/browse/FLINK-2106
 Project: Flink
  Issue Type: Sub-task
  Components: Java API, Local Runtime, Optimizer, Scala API
Reporter: Fabian Hueske
Priority: Minor
 Fix For: pre-apache


 Add left/right/full outer join methods to the DataSet APIs (Java, Scala), to 
 the optimizer, and the runtime of Flink.
 Initially, the execution strategy should be a sort-merge outer join 
 (FLINK-2105) but can later be extended to hash joins for left/right outer 
 joins.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-2111] Add stop signal to cleanly shut...

2015-07-10 Thread mjsax
Github user mjsax commented on the pull request:

https://github.com/apache/flink/pull/750#issuecomment-120408249
  
Changes applied:
- renamed signal name from TERMINATE to STOP (as decided by community vote)
- reverted changes to execution graph state machine
- added interface `Stopable` an reverted method `AbstractInvocable.stop()`
- added `JobType` enum (Batching / Streaming) [as discussed at last Meetup 
with @StephanEwen ]
- extended test coverage

Please review again.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [doc] Fix wordcount example in YARN setup

2015-07-10 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/flink/pull/897


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-2106) Add outer joins to API, Optimizer, and Runtime

2015-07-10 Thread Fabian Hueske (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622248#comment-14622248
 ] 

Fabian Hueske commented on FLINK-2106:
--

Having a working outer join algorithm (FLINK-2105 or FLINK-2107) is a 
prerequisite for this issue. 
I would strongly recommend to first solve FLINK-2105 before tackling this issue.

 Add outer joins to API, Optimizer, and Runtime
 --

 Key: FLINK-2106
 URL: https://issues.apache.org/jira/browse/FLINK-2106
 Project: Flink
  Issue Type: Sub-task
  Components: Java API, Local Runtime, Optimizer, Scala API
Reporter: Fabian Hueske
Priority: Minor
 Fix For: pre-apache


 Add left/right/full outer join methods to the DataSet APIs (Java, Scala), to 
 the optimizer, and the runtime of Flink.
 Initially, the execution strategy should be a sort-merge outer join 
 (FLINK-2105) but can later be extended to hash joins for left/right outer 
 joins.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2310) Add an Adamic-Adar Similarity example

2015-07-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622107#comment-14622107
 ] 

ASF GitHub Bot commented on FLINK-2310:
---

Github user shghatge commented on the pull request:

https://github.com/apache/flink/pull/892#issuecomment-120366455
  
Hello @vasia 
Thanks for the clarification. I was only thinking of changing the data 
structure :) But this method seems much more sensible. In any case... Should I 
close this pull request? Or should I commit the changes made to it according to 
your first suggestion?


 Add an Adamic-Adar Similarity example
 -

 Key: FLINK-2310
 URL: https://issues.apache.org/jira/browse/FLINK-2310
 Project: Flink
  Issue Type: Task
  Components: Gelly
Reporter: Andra Lungu
Assignee: Shivani Ghatge
Priority: Minor

 Just as Jaccard, the Adamic-Adar algorithm measures the similarity between a 
 set of nodes. However, instead of counting the common neighbors and dividing 
 them by the total number of neighbors, the similarity is weighted according 
 to the vertex degrees. In particular, it's equal to log(1/numberOfEdges).
 The Adamic-Adar algorithm can be broken into three steps: 
 1). For each vertex, compute the log of its inverse degrees (with the formula 
 above) and set it as the vertex value. 
 2). Each vertex will then send this new computed value along with a list of 
 neighbors to the targets of its out-edges
 3). Weigh the edges with the Adamic-Adar index: Sum over n from CN of 
 log(1/k_n)(CN is the set of all common neighbors of two vertices x, y. k_n is 
 the degree of node n). See [2]
 Prerequisites: 
 - Full understanding of the Jaccard Similarity Measure algorithm
 - Reading the associated literature: 
 [1] http://social.cs.uiuc.edu/class/cs591kgk/friendsadamic.pdf
 [2] 
 http://stackoverflow.com/questions/22565620/fast-algorithm-to-compute-adamic-adar



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: Make streaming iterations more robust

2015-07-10 Thread senorcarbone
Github user senorcarbone commented on a diff in the pull request:

https://github.com/apache/flink/pull/900#discussion_r34352021
  
--- Diff: 
flink-staging/flink-streaming/flink-streaming-core/src/main/java/org/apache/flink/streaming/api/datastream/IterativeDataStream.java
 ---
@@ -201,14 +204,16 @@ public 
ConnectedIterativeDataStream(IterativeDataStreamI input, TypeInformatio
 * @return The feedback stream.
 * 
 */
+   @SuppressWarnings({ rawtypes, unchecked })
public DataStreamF closeWith(DataStreamF feedbackStream) {
-   DataStreamF iterationSink = new 
DataStreamSinkF(input.environment, Iteration Sink,
-   null, null);
+   if (input.closed) {
+   throw new IllegalStateException(
--- End diff --

Since now we build the loops at the job graph generation, wouldn't it be 
reasonable to do an implicit union there?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [doc] Fix wordcount example in YARN setup

2015-07-10 Thread chiwanpark
Github user chiwanpark commented on the pull request:

https://github.com/apache/flink/pull/897#issuecomment-120384704
  
Merging...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-2310) Add an Adamic-Adar Similarity example

2015-07-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622253#comment-14622253
 ] 

ASF GitHub Bot commented on FLINK-2310:
---

Github user vasia commented on the pull request:

https://github.com/apache/flink/pull/892#issuecomment-120404920
  
You can use this PR for the exact computation, no need to open a new one!


 Add an Adamic-Adar Similarity example
 -

 Key: FLINK-2310
 URL: https://issues.apache.org/jira/browse/FLINK-2310
 Project: Flink
  Issue Type: Task
  Components: Gelly
Reporter: Andra Lungu
Assignee: Shivani Ghatge
Priority: Minor

 Just as Jaccard, the Adamic-Adar algorithm measures the similarity between a 
 set of nodes. However, instead of counting the common neighbors and dividing 
 them by the total number of neighbors, the similarity is weighted according 
 to the vertex degrees. In particular, it's equal to log(1/numberOfEdges).
 The Adamic-Adar algorithm can be broken into three steps: 
 1). For each vertex, compute the log of its inverse degrees (with the formula 
 above) and set it as the vertex value. 
 2). Each vertex will then send this new computed value along with a list of 
 neighbors to the targets of its out-edges
 3). Weigh the edges with the Adamic-Adar index: Sum over n from CN of 
 log(1/k_n)(CN is the set of all common neighbors of two vertices x, y. k_n is 
 the degree of node n). See [2]
 Prerequisites: 
 - Full understanding of the Jaccard Similarity Measure algorithm
 - Reading the associated literature: 
 [1] http://social.cs.uiuc.edu/class/cs591kgk/friendsadamic.pdf
 [2] 
 http://stackoverflow.com/questions/22565620/fast-algorithm-to-compute-adamic-adar



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: Make streaming iterations more robust

2015-07-10 Thread senorcarbone
Github user senorcarbone commented on the pull request:

https://github.com/apache/flink/pull/900#issuecomment-120405183
  
There are quite a few changes in this PR! In general I am very keen to 
eliminating DataStream mutations and adding that logic into the jobgraph 
building phase. The code is clear enough, I didn't find any errors so far and 
tests are use-case complete and pass so :+1:  from me.

So again what actually happens now (correct me if I am wrong) is that we 
actually reuse the same src-sink pair for loops with the same parallelism and 
create additional src-sinks otherwise, correct?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2200] Add Flink with Scala 2.11 in Mave...

2015-07-10 Thread chiwanpark
Github user chiwanpark commented on the pull request:

https://github.com/apache/flink/pull/885#issuecomment-120554092
  
If I get 1 more LGTM, I'll merge this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2108] [ml] [WIP] Add score function for...

2015-07-10 Thread thvasilo
GitHub user thvasilo opened a pull request:

https://github.com/apache/flink/pull/902

[FLINK-2108] [ml] [WIP]  Add score function for Predictors

This PR build upon the evaluation PR currently under review (#871) and adds 
to new operations to the Predictor class, one that takes a scorer and a test 
set and produces a score as an evaluation of the Predictor performance using 
the provided score, and one that takes only a test set and a default score is 
used.

This PR includes implementations for custom scores and simple scores for 
all Predictor implementations, either through the Classifier and Regressor base 
classes, or specific ones, like the one provided for ALS. The provided score 
custom score operation currently expects DataSet[LabeledVector] as the type of 
test set and Double as the type of prediction.

TODO: Docs and code cleanup

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/thvasilo/flink score-operation

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/902.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #902


commit 305b43a451af3d8bc859671476c215308fbfc7fc
Author: mikiobraun mikiobr...@gmail.com
Date:   2015-06-22T15:04:42Z

Adding some first loss functions for the evaluation framework

commit bdb1a6912d2bcec29446ca4a9fbc550f2ecb8f4a
Author: Theodore Vasiloudis t...@sics.se
Date:   2015-06-23T14:07:48Z

Scorer for evaluation

commit 4a7593ade68f43d444a6b289191f053a4ea8b031
Author: Theodore Vasiloudis t...@sics.se
Date:   2015-06-25T09:41:10Z

Adds accuracy score and R^2 score. Also trying out Scores as classes 
instead of functions.

Not too happy with the extra biolerplate of Score as classes will probably 
revert,
and have objects like RegressionsScores, ClassificationScores that contain 
the definitions
of the relevant scores.

commit 5c89c478bd00f168bfe48954d06367b28f948571
Author: Theodore Vasiloudis t...@sics.se
Date:   2015-06-26T11:30:56Z

Adds a evaluate operation for LabeledVector input

commit e7bb4b42424641d640df370cd6ace71f7f42ee8d
Author: Theodore Vasiloudis t...@sics.se
Date:   2015-06-26T11:32:13Z

Adds Regressor interface, and a score function for regression algorithms.

commit 3d8a6928b02b30c732f282df61613561dbf8d4fc
Author: Theodore Vasiloudis t...@sics.se
Date:   2015-06-30T14:04:58Z

Added Classifier intermediate class, and default score function for 
classifiers.

commit e1a26ed30bb784633685703892f67d51136f6060
Author: Theodore Vasiloudis t...@sics.se
Date:   2015-07-01T08:20:41Z

Going back to having scores defined in objects instead of their own classes.

commit 0dd251a5a59cd610c4df3e9a1ea3921b1a9cc2e0
Author: Theodore Vasiloudis t...@sics.se
Date:   2015-07-01T13:00:37Z

Removed ParameterMap from predict function of PredictOperation

commit 492e9a383af6285f0fdca5031d2bd7bdfe3cd511
Author: Theodore Vasiloudis t...@sics.se
Date:   2015-07-02T10:21:28Z

Reworked score functionality allow chained Predictors.

All predictors must now implement a calculateScore function.
We are for now assuming that predictors are supervised learning algorithms,
once unsupervised learning algorithms are added this will need to be 
reworked.

Also added an evaluate dataset operation to ALS, to allow for scoring of the
algorithm. Default performance measure for ALS is RMSE.

commit d9715ed3a6faba78e0b34368425768e826d5a736
Author: Theodore Vasiloudis t...@sics.se
Date:   2015-07-06T08:50:59Z

Made calculateScore only take DataSet[(Double, Double)]

commit 7f1a6da52dfcd47d39c39cee2141112e5c10ddad
Author: Theodore Vasiloudis t...@sics.se
Date:   2015-07-07T08:15:58Z

Added test for DataSet.mean()

commit edbe3dd9ea48d168f67a9ff231f8373a6aaee38d
Author: Theodore Vasiloudis t...@sics.se
Date:   2015-07-07T12:11:45Z

Switched from cross to mapWithBcVariable

commit e840c14032f5fea3b476e1a99122eb9125ba5a4f
Author: Theodore Vasiloudis t...@sics.se
Date:   2015-07-08T16:48:27Z

Addressed multiple PR comments.

commit 6e48b612f0a367e798e40590f6921d4dc242f2aa
Author: Theodore Vasiloudis t...@sics.se
Date:   2015-07-08T17:06:42Z

Add approximatelyEquals for Doubles, used for score calculation.

commit 57d0ef2c4bc268d1d870c7aab537dd611f464fcf
Author: Theodore Vasiloudis t...@sics.se
Date:   2015-07-08T17:23:14Z

Improved dostrings for Score

commit eb66de590947ca8f887a8e52f8c66ec860b82af3
Author: Theodore Vasiloudis t...@sics.se
Date:   2015-07-08T17:27:32Z

Removed score function from Predictor.

commit 13053ef358091427fa89c66305d602a84a819c87
Author: Theodore Vasiloudis t...@sics.se
Date:   2015-07-09T13:52:18Z

Added score operation.

This operation is similar to the predict and evaluate operations,
allowing type-dependent 

[GitHub] flink pull request: [FLINK-2337] Multiple SLF4J bindings using Sto...

2015-07-10 Thread mjsax
GitHub user mjsax opened a pull request:

https://github.com/apache/flink/pull/903

[FLINK-2337] Multiple SLF4J bindings using Storm compatibility layer

exclude logback from Storm dependencies

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mjsax/flink flink-2337-slf4j-bindings

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/903.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #903


commit 9bc3e974e7833fd92a2251a38b54ad9d6e09d759
Author: mjsax mj...@informatik.hu-berlin.de
Date:   2015-07-10T13:43:21Z

excluded 'logback' from Storm dependencies
fixes FLINK-2337




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-2108) Add score function for Predictors

2015-07-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622406#comment-14622406
 ] 

ASF GitHub Bot commented on FLINK-2108:
---

GitHub user thvasilo opened a pull request:

https://github.com/apache/flink/pull/902

[FLINK-2108] [ml] [WIP]  Add score function for Predictors

This PR build upon the evaluation PR currently under review (#871) and adds 
to new operations to the Predictor class, one that takes a scorer and a test 
set and produces a score as an evaluation of the Predictor performance using 
the provided score, and one that takes only a test set and a default score is 
used.

This PR includes implementations for custom scores and simple scores for 
all Predictor implementations, either through the Classifier and Regressor base 
classes, or specific ones, like the one provided for ALS. The provided score 
custom score operation currently expects DataSet[LabeledVector] as the type of 
test set and Double as the type of prediction.

TODO: Docs and code cleanup

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/thvasilo/flink score-operation

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/902.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #902


commit 305b43a451af3d8bc859671476c215308fbfc7fc
Author: mikiobraun mikiobr...@gmail.com
Date:   2015-06-22T15:04:42Z

Adding some first loss functions for the evaluation framework

commit bdb1a6912d2bcec29446ca4a9fbc550f2ecb8f4a
Author: Theodore Vasiloudis t...@sics.se
Date:   2015-06-23T14:07:48Z

Scorer for evaluation

commit 4a7593ade68f43d444a6b289191f053a4ea8b031
Author: Theodore Vasiloudis t...@sics.se
Date:   2015-06-25T09:41:10Z

Adds accuracy score and R^2 score. Also trying out Scores as classes 
instead of functions.

Not too happy with the extra biolerplate of Score as classes will probably 
revert,
and have objects like RegressionsScores, ClassificationScores that contain 
the definitions
of the relevant scores.

commit 5c89c478bd00f168bfe48954d06367b28f948571
Author: Theodore Vasiloudis t...@sics.se
Date:   2015-06-26T11:30:56Z

Adds a evaluate operation for LabeledVector input

commit e7bb4b42424641d640df370cd6ace71f7f42ee8d
Author: Theodore Vasiloudis t...@sics.se
Date:   2015-06-26T11:32:13Z

Adds Regressor interface, and a score function for regression algorithms.

commit 3d8a6928b02b30c732f282df61613561dbf8d4fc
Author: Theodore Vasiloudis t...@sics.se
Date:   2015-06-30T14:04:58Z

Added Classifier intermediate class, and default score function for 
classifiers.

commit e1a26ed30bb784633685703892f67d51136f6060
Author: Theodore Vasiloudis t...@sics.se
Date:   2015-07-01T08:20:41Z

Going back to having scores defined in objects instead of their own classes.

commit 0dd251a5a59cd610c4df3e9a1ea3921b1a9cc2e0
Author: Theodore Vasiloudis t...@sics.se
Date:   2015-07-01T13:00:37Z

Removed ParameterMap from predict function of PredictOperation

commit 492e9a383af6285f0fdca5031d2bd7bdfe3cd511
Author: Theodore Vasiloudis t...@sics.se
Date:   2015-07-02T10:21:28Z

Reworked score functionality allow chained Predictors.

All predictors must now implement a calculateScore function.
We are for now assuming that predictors are supervised learning algorithms,
once unsupervised learning algorithms are added this will need to be 
reworked.

Also added an evaluate dataset operation to ALS, to allow for scoring of the
algorithm. Default performance measure for ALS is RMSE.

commit d9715ed3a6faba78e0b34368425768e826d5a736
Author: Theodore Vasiloudis t...@sics.se
Date:   2015-07-06T08:50:59Z

Made calculateScore only take DataSet[(Double, Double)]

commit 7f1a6da52dfcd47d39c39cee2141112e5c10ddad
Author: Theodore Vasiloudis t...@sics.se
Date:   2015-07-07T08:15:58Z

Added test for DataSet.mean()

commit edbe3dd9ea48d168f67a9ff231f8373a6aaee38d
Author: Theodore Vasiloudis t...@sics.se
Date:   2015-07-07T12:11:45Z

Switched from cross to mapWithBcVariable

commit e840c14032f5fea3b476e1a99122eb9125ba5a4f
Author: Theodore Vasiloudis t...@sics.se
Date:   2015-07-08T16:48:27Z

Addressed multiple PR comments.

commit 6e48b612f0a367e798e40590f6921d4dc242f2aa
Author: Theodore Vasiloudis t...@sics.se
Date:   2015-07-08T17:06:42Z

Add approximatelyEquals for Doubles, used for score calculation.

commit 57d0ef2c4bc268d1d870c7aab537dd611f464fcf
Author: Theodore Vasiloudis t...@sics.se
Date:   2015-07-08T17:23:14Z

Improved dostrings for Score

commit eb66de590947ca8f887a8e52f8c66ec860b82af3
Author: Theodore Vasiloudis t...@sics.se
Date:   2015-07-08T17:27:32Z

Removed score function from Predictor.


[jira] [Commented] (FLINK-2337) Multiple SLF4J bindings using Storm compatibility layer

2015-07-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622407#comment-14622407
 ] 

ASF GitHub Bot commented on FLINK-2337:
---

GitHub user mjsax opened a pull request:

https://github.com/apache/flink/pull/903

[FLINK-2337] Multiple SLF4J bindings using Storm compatibility layer

exclude logback from Storm dependencies

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mjsax/flink flink-2337-slf4j-bindings

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/903.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #903


commit 9bc3e974e7833fd92a2251a38b54ad9d6e09d759
Author: mjsax mj...@informatik.hu-berlin.de
Date:   2015-07-10T13:43:21Z

excluded 'logback' from Storm dependencies
fixes FLINK-2337




 Multiple SLF4J bindings using Storm compatibility layer
 ---

 Key: FLINK-2337
 URL: https://issues.apache.org/jira/browse/FLINK-2337
 Project: Flink
  Issue Type: Bug
  Components: flink-contrib
Reporter: Matthias J. Sax
Assignee: Matthias J. Sax
Priority: Minor

 Storm depends on logback as slf4j implemenation but Flink uses log4j. The log 
 shows the following conflict:
 SLF4J: Class path contains multiple SLF4J bindings.
 SLF4J: Found binding in 
 [jar:file:/home/cicero/.m2/repository/ch/qos/logback/logback-classic/1.0.13/logback-classic-1.0.13.jar!/org/slf4j/impl/StaticLoggerBinder.class]
 SLF4J: Found binding in 
 [jar:file:/home/cicero/.m2/repository/org/slf4j/slf4j-log4j12/1.7.7/slf4j-log4j12-1.7.7.jar!/org/slf4j/impl/StaticLoggerBinder.class]
 SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
 explanation.
 Need to exclude logback from storm dependencies to fix this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2200) Flink API with Scala 2.11 - Maven Repository

2015-07-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622381#comment-14622381
 ] 

ASF GitHub Bot commented on FLINK-2200:
---

Github user aalexandrov commented on the pull request:

https://github.com/apache/flink/pull/885#issuecomment-120423639
  
Looks very neat!


 Flink API with Scala 2.11 - Maven Repository
 

 Key: FLINK-2200
 URL: https://issues.apache.org/jira/browse/FLINK-2200
 Project: Flink
  Issue Type: Wish
  Components: Build System, Scala API
Reporter: Philipp Götze
Assignee: Chiwan Park
Priority: Trivial
  Labels: maven

 It would be nice if you could upload a pre-built version of the Flink API 
 with Scala 2.11 to the maven repository.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2328) Applying more than one transformation on an IterativeDataStream fails

2015-07-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622166#comment-14622166
 ] 

ASF GitHub Bot commented on FLINK-2328:
---

GitHub user gyfora opened a pull request:

https://github.com/apache/flink/pull/900

Make streaming iterations more robust

This PR reworks the way iterations are constructed in the streamgraph from 
the previous eager approach to do it when we actually create the jobgraph.

This new approach solves many know and unknown issues that the previous 
approach had, such as issues with multiple tail and head operators 
(FLINK-2328), issues with partitioning on the feedback stream, issues with 
ConnectedIterations and immutability of the IterativeDataStream etc.

I also included a new set of tests to validate all the expected 
functionality:

-Test for simple and connected iterations with more heads and tails
-Parallelism, partitioning and co-location checks
-Tests for the exceptions thrown during job creation for invalid iterative 
topologies

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gyfora/flink master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/900.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #900


commit 9fef88306bc79f1f2e0520ddeb2ef10be8e4de09
Author: Gyula Fora gyf...@apache.org
Date:   2015-07-10T11:26:44Z

[FLINK-2335] [streaming] Lazy iteration construction in StreamGraph




 Applying more than one transformation on an IterativeDataStream fails
 -

 Key: FLINK-2328
 URL: https://issues.apache.org/jira/browse/FLINK-2328
 Project: Flink
  Issue Type: Bug
  Components: Streaming
Reporter: Gyula Fora
Assignee: Gyula Fora

 Currently the user cannot apply more than one transformation on the 
 IterativeDataStream directly.
 It fails because instead of creating one iteration source and connecting it 
 to the operators it will try to create two iteration sources which fails on 
 the shared broker slot.
 A workaround is to use a no-op mapper on the iterative stream and applying 
 the two transformations on that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2343) Change default garbage collector in streaming environments

2015-07-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622388#comment-14622388
 ] 

ASF GitHub Bot commented on FLINK-2343:
---

GitHub user StephanEwen opened a pull request:

https://github.com/apache/flink/pull/901

[FLINK-2343] [scripts] Make CMS the default GC for streaming setups

When the system is started with `start-cluster-streaming.sh`, this patch 
sets the Garbage Collector to Concurrent Mark Sweep, to keep a low and smooth 
latency. Without these options, the parallel bulk GC introduces frequent 
latency spikes.

The GC option is currently only set if no other java option is set via 
`env.java.opts`. Otherwise, if users define their favorite GC in these options, 
the JVM may not start due to option conflicts.

When setting the CMS GC, we also set the `CMSClassUnloadingEnabled` option, 
to make sure that user-code classes are removed once the classloader expires.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/StephanEwen/incubator-flink gc_option

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/901.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #901


commit 8840d7508668504bade7f28ca128389d5e1392d1
Author: Stephan Ewen se...@apache.org
Date:   2015-07-10T14:37:28Z

[FLINK-2343] [scripts] Make CMS the default GC for streaming setups




 Change default garbage collector in streaming environments
 --

 Key: FLINK-2343
 URL: https://issues.apache.org/jira/browse/FLINK-2343
 Project: Flink
  Issue Type: Improvement
  Components: Start-Stop Scripts
Affects Versions: 0.10
Reporter: Stephan Ewen
Assignee: Stephan Ewen
 Fix For: 0.10


 When starting Flink, we don't pass any particular GC related JVM flags to the 
 system. That means, it uses the default garbage collectors, which are the 
 bulk parallel GCs for both old gen and new gen.
 For streaming applications, this results in vastly fluctuating latencies. 
 Latencies are much more constant with either the {{CMS}} or {{G1}} GC.
 I propose to make the CMS the default GC for streaming setups.
 G1 may become the GC of choice in the future, but fro various articles I 
 found, it is still somewhat in beta status (see for example here: 
 http://jaxenter.com/kirk-pepperdine-on-the-g1-for-java-9-118190.html )



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-2200] Add Flink with Scala 2.11 in Mave...

2015-07-10 Thread chiwanpark
Github user chiwanpark commented on the pull request:

https://github.com/apache/flink/pull/885#issuecomment-120411775
  
Hi, I updated PR :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2008][FLINK-2296] Fix checkpoint commit...

2015-07-10 Thread gyfora
Github user gyfora commented on the pull request:

https://github.com/apache/flink/pull/895#issuecomment-120421266
  
:+1: 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-2008) PersistentKafkaSource is sometimes emitting tuples multiple times

2015-07-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622359#comment-14622359
 ] 

ASF GitHub Bot commented on FLINK-2008:
---

Github user gyfora commented on the pull request:

https://github.com/apache/flink/pull/895#issuecomment-120421266
  
:+1: 


 PersistentKafkaSource is sometimes emitting tuples multiple times
 -

 Key: FLINK-2008
 URL: https://issues.apache.org/jira/browse/FLINK-2008
 Project: Flink
  Issue Type: Bug
  Components: Kafka Connector, Streaming
Affects Versions: 0.9
Reporter: Robert Metzger
Assignee: Robert Metzger

 The PersistentKafkaSource is expected to emit records exactly once.
 Two test cases of the KafkaITCase are sporadically failing because records 
 are emitted multiple times.
 Affected tests:
 {{testPersistentSourceWithOffsetUpdates()}}, after the offsets have been 
 changed manually in ZK:
 {code}
 java.lang.RuntimeException: Expected v to be 3, but was 4 on element 0 
 array=[4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 
 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 
 2]
 {code}
 {{brokerFailureTest()}} also fails:
 {code}
 05/13/2015 08:13:16   Custom source - Stream Sink(1/1) switched to FAILED 
 java.lang.AssertionError: Received tuple with value 21 twice
   at org.junit.Assert.fail(Assert.java:88)
   at org.junit.Assert.assertTrue(Assert.java:41)
   at org.junit.Assert.assertFalse(Assert.java:64)
   at 
 org.apache.flink.streaming.connectors.kafka.KafkaITCase$15.invoke(KafkaITCase.java:877)
   at 
 org.apache.flink.streaming.connectors.kafka.KafkaITCase$15.invoke(KafkaITCase.java:859)
   at 
 org.apache.flink.streaming.api.operators.StreamSink.callUserFunction(StreamSink.java:39)
   at 
 org.apache.flink.streaming.api.operators.StreamOperator.callUserFunctionAndLogException(StreamOperator.java:137)
   at 
 org.apache.flink.streaming.api.operators.ChainableStreamOperator.collect(ChainableStreamOperator.java:54)
   at 
 org.apache.flink.streaming.api.collector.CollectorWrapper.collect(CollectorWrapper.java:39)
   at 
 org.apache.flink.streaming.connectors.kafka.api.persistent.PersistentKafkaSource.run(PersistentKafkaSource.java:173)
   at 
 org.apache.flink.streaming.api.operators.StreamSource.callUserFunction(StreamSource.java:40)
   at 
 org.apache.flink.streaming.api.operators.StreamOperator.callUserFunctionAndLogException(StreamOperator.java:137)
   at 
 org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:34)
   at 
 org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:139)
   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:562)
   at java.lang.Thread.run(Thread.java:745)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2343) Change default garbage collector in streaming environments

2015-07-10 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2343:
---

 Summary: Change default garbage collector in streaming environments
 Key: FLINK-2343
 URL: https://issues.apache.org/jira/browse/FLINK-2343
 Project: Flink
  Issue Type: Improvement
  Components: Start-Stop Scripts
Affects Versions: 0.10
Reporter: Stephan Ewen
Assignee: Stephan Ewen
 Fix For: 0.10


When starting Flink, we don't pass any particular GC related JVM flags to the 
system. That means, it uses the default garbage collectors, which are the bulk 
parallel GCs for both old gen and new gen.

For streaming applications, this results in vastly fluctuating latencies. 
Latencies are much more constant with either the {{CMS}} or {{G1}} GC.

I propose to make the CMS the default GC for streaming setups.

G1 may become the GC of choice in the future, but fro various articles I found, 
it is still somewhat in beta status (see for example here: 
http://jaxenter.com/kirk-pepperdine-on-the-g1-for-java-9-118190.html )



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (FLINK-2328) Applying more than one transformation on an IterativeDataStream fails

2015-07-10 Thread Gyula Fora (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gyula Fora closed FLINK-2328.
-
Resolution: Duplicate

This issue will be solved by this: 
https://issues.apache.org/jira/browse/FLINK-2335

 Applying more than one transformation on an IterativeDataStream fails
 -

 Key: FLINK-2328
 URL: https://issues.apache.org/jira/browse/FLINK-2328
 Project: Flink
  Issue Type: Bug
  Components: Streaming
Reporter: Gyula Fora
Assignee: Gyula Fora

 Currently the user cannot apply more than one transformation on the 
 IterativeDataStream directly.
 It fails because instead of creating one iteration source and connecting it 
 to the operators it will try to create two iteration sources which fails on 
 the shared broker slot.
 A workaround is to use a no-op mapper on the iterative stream and applying 
 the two transformations on that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: Make streaming iterations more robust

2015-07-10 Thread gyfora
Github user gyfora commented on a diff in the pull request:

https://github.com/apache/flink/pull/900#discussion_r34357746
  
--- Diff: 
flink-staging/flink-streaming/flink-streaming-core/src/main/java/org/apache/flink/streaming/api/datastream/IterativeDataStream.java
 ---
@@ -201,14 +204,16 @@ public 
ConnectedIterativeDataStream(IterativeDataStreamI input, TypeInformatio
 * @return The feedback stream.
 * 
 */
+   @SuppressWarnings({ rawtypes, unchecked })
public DataStreamF closeWith(DataStreamF feedbackStream) {
-   DataStreamF iterationSink = new 
DataStreamSinkF(input.environment, Iteration Sink,
-   null, null);
+   if (input.closed) {
+   throw new IllegalStateException(
--- End diff --

I think this is a fair restriction, the user can always union.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2200] Add Flink with Scala 2.11 in Mave...

2015-07-10 Thread aalexandrov
Github user aalexandrov commented on the pull request:

https://github.com/apache/flink/pull/885#issuecomment-120423639
  
Looks very neat!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-2111) Add stop signal to cleanly shutdown streaming jobs

2015-07-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622264#comment-14622264
 ] 

ASF GitHub Bot commented on FLINK-2111:
---

Github user mjsax commented on the pull request:

https://github.com/apache/flink/pull/750#issuecomment-120408249
  
Changes applied:
- renamed signal name from TERMINATE to STOP (as decided by community vote)
- reverted changes to execution graph state machine
- added interface `Stopable` an reverted method `AbstractInvocable.stop()`
- added `JobType` enum (Batching / Streaming) [as discussed at last Meetup 
with @StephanEwen ]
- extended test coverage

Please review again.


 Add stop signal to cleanly shutdown streaming jobs
 

 Key: FLINK-2111
 URL: https://issues.apache.org/jira/browse/FLINK-2111
 Project: Flink
  Issue Type: Improvement
  Components: Distributed Runtime, JobManager, Local Runtime, 
 Streaming, TaskManager, Webfrontend
Reporter: Matthias J. Sax
Assignee: Matthias J. Sax
Priority: Minor

 Currently, streaming jobs can only be stopped using cancel command, what is 
 a hard stop with no clean shutdown.
 The new introduced stop signal, will only affect streaming source tasks 
 such that the sources can stop emitting data and shutdown cleanly, resulting 
 in a clean shutdown of the whole streaming job.
 This feature is a pre-requirment for 
 https://issues.apache.org/jira/browse/FLINK-1929



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2200) Flink API with Scala 2.11 - Maven Repository

2015-07-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622276#comment-14622276
 ] 

ASF GitHub Bot commented on FLINK-2200:
---

Github user chiwanpark commented on the pull request:

https://github.com/apache/flink/pull/885#issuecomment-120411775
  
Hi, I updated PR :)


 Flink API with Scala 2.11 - Maven Repository
 

 Key: FLINK-2200
 URL: https://issues.apache.org/jira/browse/FLINK-2200
 Project: Flink
  Issue Type: Wish
  Components: Build System, Scala API
Reporter: Philipp Götze
Assignee: Chiwan Park
Priority: Trivial
  Labels: maven

 It would be nice if you could upload a pre-built version of the Flink API 
 with Scala 2.11 to the maven repository.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: Make streaming iterations more robust

2015-07-10 Thread senorcarbone
Github user senorcarbone commented on a diff in the pull request:

https://github.com/apache/flink/pull/900#discussion_r34359293
  
--- Diff: 
flink-staging/flink-streaming/flink-streaming-core/src/main/java/org/apache/flink/streaming/api/datastream/IterativeDataStream.java
 ---
@@ -201,14 +204,16 @@ public 
ConnectedIterativeDataStream(IterativeDataStreamI input, TypeInformatio
 * @return The feedback stream.
 * 
 */
+   @SuppressWarnings({ rawtypes, unchecked })
public DataStreamF closeWith(DataStreamF feedbackStream) {
-   DataStreamF iterationSink = new 
DataStreamSinkF(input.environment, Iteration Sink,
-   null, null);
+   if (input.closed) {
+   throw new IllegalStateException(
--- End diff --

sure


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2343] [scripts] Make CMS the default GC...

2015-07-10 Thread StephanEwen
GitHub user StephanEwen opened a pull request:

https://github.com/apache/flink/pull/901

[FLINK-2343] [scripts] Make CMS the default GC for streaming setups

When the system is started with `start-cluster-streaming.sh`, this patch 
sets the Garbage Collector to Concurrent Mark Sweep, to keep a low and smooth 
latency. Without these options, the parallel bulk GC introduces frequent 
latency spikes.

The GC option is currently only set if no other java option is set via 
`env.java.opts`. Otherwise, if users define their favorite GC in these options, 
the JVM may not start due to option conflicts.

When setting the CMS GC, we also set the `CMSClassUnloadingEnabled` option, 
to make sure that user-code classes are removed once the classloader expires.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/StephanEwen/incubator-flink gc_option

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/901.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #901


commit 8840d7508668504bade7f28ca128389d5e1392d1
Author: Stephan Ewen se...@apache.org
Date:   2015-07-10T14:37:28Z

[FLINK-2343] [scripts] Make CMS the default GC for streaming setups




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Created] (FLINK-2344) Deprecate/Remove the old Pact Pair type

2015-07-10 Thread Stephan Ewen (JIRA)
Stephan Ewen created FLINK-2344:
---

 Summary: Deprecate/Remove the old Pact Pair type
 Key: FLINK-2344
 URL: https://issues.apache.org/jira/browse/FLINK-2344
 Project: Flink
  Issue Type: Improvement
  Components: Core
Affects Versions: 0.10
Reporter: Stephan Ewen
Assignee: Stephan Ewen
 Fix For: 0.10






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2337) Multiple SLF4J bindings using Storm compatibility layer

2015-07-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622669#comment-14622669
 ] 

ASF GitHub Bot commented on FLINK-2337:
---

Github user mjsax commented on the pull request:

https://github.com/apache/flink/pull/903#issuecomment-120479371
  
Travis failed due to unrelated error:

Failed tests: 
  
ProcessFailureStreamingRecoveryITCaseAbstractProcessFailureRecoveryTest.testTaskManagerProcessFailure:198
 The program encountered a ProgramInvocationException : The program execution 
failed: Job execution failed.

Does anyone know anything about this failing test?

The log warning `SLF4J: Class path contains multiple SLF4J bindings.` is 
gone for the 4 successful Travis runs.

This PR should be ready to get merged.


 Multiple SLF4J bindings using Storm compatibility layer
 ---

 Key: FLINK-2337
 URL: https://issues.apache.org/jira/browse/FLINK-2337
 Project: Flink
  Issue Type: Bug
  Components: flink-contrib
Reporter: Matthias J. Sax
Assignee: Matthias J. Sax
Priority: Minor

 Storm depends on logback as slf4j implemenation but Flink uses log4j. The log 
 shows the following conflict:
 SLF4J: Class path contains multiple SLF4J bindings.
 SLF4J: Found binding in 
 [jar:file:/home/cicero/.m2/repository/ch/qos/logback/logback-classic/1.0.13/logback-classic-1.0.13.jar!/org/slf4j/impl/StaticLoggerBinder.class]
 SLF4J: Found binding in 
 [jar:file:/home/cicero/.m2/repository/org/slf4j/slf4j-log4j12/1.7.7/slf4j-log4j12-1.7.7.jar!/org/slf4j/impl/StaticLoggerBinder.class]
 SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
 explanation.
 Need to exclude logback from storm dependencies to fix this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-2218) Web client cannot distinguesh between Flink options and program arguments

2015-07-10 Thread Matthias J. Sax (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias J. Sax updated FLINK-2218:
---
Summary: Web client cannot distinguesh between Flink options and program 
arguments  (was: Web client cannot specify parallelism)

 Web client cannot distinguesh between Flink options and program arguments
 -

 Key: FLINK-2218
 URL: https://issues.apache.org/jira/browse/FLINK-2218
 Project: Flink
  Issue Type: Improvement
  Components: Webfrontend
Affects Versions: master
Reporter: Ufuk Celebi
Assignee: Matthias J. Sax

 Programs submitted via the CLI can set the parallelism of a program ({{-p}}). 
 This is not possible for programs submitted via the web client. With the web 
 client, the program needs to take care of setting the parallelism (by having 
 an extra program arg) or the default parallelism is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-2218) Web client cannot distinguesh between Flink options and program arguments

2015-07-10 Thread Matthias J. Sax (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias J. Sax updated FLINK-2218:
---
Description: 
WebClient has only one input field for arguments. This field is used for Flink 
options (e.g., `-p`) and program arguments. Thus, supported Flink options 
restrict the possible program arguments. CliFrontend in contrast can 
distinguish both and thus `-p` can also be used as an program argument.

Solution: add a second input field for Flink options to WebClient

  was:
Programs submitted via the CLI can set the parallelism of a program ({{-p}}). 

This is not possible for programs submitted via the web client. With the web 
client, the program needs to take care of setting the parallelism (by having an 
extra program arg) or the default parallelism is used.


 Web client cannot distinguesh between Flink options and program arguments
 -

 Key: FLINK-2218
 URL: https://issues.apache.org/jira/browse/FLINK-2218
 Project: Flink
  Issue Type: Improvement
  Components: Webfrontend
Affects Versions: master
Reporter: Ufuk Celebi
Assignee: Matthias J. Sax

 WebClient has only one input field for arguments. This field is used for 
 Flink options (e.g., `-p`) and program arguments. Thus, supported Flink 
 options restrict the possible program arguments. CliFrontend in contrast can 
 distinguish both and thus `-p` can also be used as an program argument.
 Solution: add a second input field for Flink options to WebClient



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-2337] Multiple SLF4J bindings using Sto...

2015-07-10 Thread mjsax
Github user mjsax commented on the pull request:

https://github.com/apache/flink/pull/903#issuecomment-120479371
  
Travis failed due to unrelated error:

Failed tests: 
  
ProcessFailureStreamingRecoveryITCaseAbstractProcessFailureRecoveryTest.testTaskManagerProcessFailure:198
 The program encountered a ProgramInvocationException : The program execution 
failed: Job execution failed.

Does anyone know anything about this failing test?

The log warning `SLF4J: Class path contains multiple SLF4J bindings.` is 
gone for the 4 successful Travis runs.

This PR should be ready to get merged.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-2292) Report accumulators periodically while job is running

2015-07-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622590#comment-14622590
 ] 

ASF GitHub Bot commented on FLINK-2292:
---

Github user mxm commented on the pull request:

https://github.com/apache/flink/pull/896#issuecomment-120460741
  
I've addressed your comments in a new commit.


 Report accumulators periodically while job is running
 -

 Key: FLINK-2292
 URL: https://issues.apache.org/jira/browse/FLINK-2292
 Project: Flink
  Issue Type: Sub-task
  Components: JobManager, TaskManager
Reporter: Maximilian Michels
Assignee: Maximilian Michels
 Fix For: 0.10


 Accumulators should be sent periodically, as part of the heartbeat that sends 
 metrics. This allows them to be updated in real time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-2292][FLINK-1573] add live per-task acc...

2015-07-10 Thread mxm
Github user mxm commented on the pull request:

https://github.com/apache/flink/pull/896#issuecomment-120460741
  
I've addressed your comments in a new commit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2218] Web client cannot distinguesh bet...

2015-07-10 Thread mjsax
GitHub user mjsax opened a pull request:

https://github.com/apache/flink/pull/904

[FLINK-2218] Web client cannot distinguesh between Flink options and 
program arguments

added new input fields 'options' to WebClient
adapted WebClient-to-JobManager job submission logic
updated documentation (including new screenshots)
(some additional minor cleanup in launch.html and program.js)

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mjsax/flink flink-2218-WebClientDOP

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/904.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #904


commit 3f8a02b8233a136b5b5eb890ed11f7861b346bbc
Author: mjsax mj...@informatik.hu-berlin.de
Date:   2015-07-10T14:53:13Z

[FLINK-2218] Web client cannot distinguesh between Flink options and 
program arguments
added new input fields 'options' to WebClient
adapted WebClient-to-JobManager job submission logic
updated documentation (including new screenshots)
(some additional minor cleanup in launch.html and program.js)




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-2218) Web client cannot distinguesh between Flink options and program arguments

2015-07-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622775#comment-14622775
 ] 

ASF GitHub Bot commented on FLINK-2218:
---

GitHub user mjsax opened a pull request:

https://github.com/apache/flink/pull/904

[FLINK-2218] Web client cannot distinguesh between Flink options and 
program arguments

added new input fields 'options' to WebClient
adapted WebClient-to-JobManager job submission logic
updated documentation (including new screenshots)
(some additional minor cleanup in launch.html and program.js)

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mjsax/flink flink-2218-WebClientDOP

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/904.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #904


commit 3f8a02b8233a136b5b5eb890ed11f7861b346bbc
Author: mjsax mj...@informatik.hu-berlin.de
Date:   2015-07-10T14:53:13Z

[FLINK-2218] Web client cannot distinguesh between Flink options and 
program arguments
added new input fields 'options' to WebClient
adapted WebClient-to-JobManager job submission logic
updated documentation (including new screenshots)
(some additional minor cleanup in launch.html and program.js)




 Web client cannot distinguesh between Flink options and program arguments
 -

 Key: FLINK-2218
 URL: https://issues.apache.org/jira/browse/FLINK-2218
 Project: Flink
  Issue Type: Improvement
  Components: Webfrontend
Affects Versions: master
Reporter: Ufuk Celebi
Assignee: Matthias J. Sax

 WebClient has only one input field for arguments. This field is used for 
 Flink options (e.g., `-p`) and program arguments. Thus, supported Flink 
 options restrict the possible program arguments. CliFrontend in contrast can 
 distinguish both and thus `-p` can also be used as an program argument.
 Solution: add a second input field for Flink options to WebClient



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2138) PartitionCustom for streaming

2015-07-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622813#comment-14622813
 ] 

ASF GitHub Bot commented on FLINK-2138:
---

Github user gyfora commented on the pull request:

https://github.com/apache/flink/pull/872#issuecomment-120506018
  
Looks good to merge. If no objections, I will merge it tomorrow. :+1: 


 PartitionCustom for streaming
 -

 Key: FLINK-2138
 URL: https://issues.apache.org/jira/browse/FLINK-2138
 Project: Flink
  Issue Type: New Feature
  Components: Streaming
Affects Versions: 0.9
Reporter: Márton Balassi
Assignee: Gábor Hermann
Priority: Minor

 The batch API has support for custom partitioning, this should be added for 
 streaming with a similar signature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-2138] Added custom partitioning to Data...

2015-07-10 Thread gyfora
Github user gyfora commented on the pull request:

https://github.com/apache/flink/pull/872#issuecomment-120506018
  
Looks good to merge. If no objections, I will merge it tomorrow. :+1: 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---