[GitHub] flink pull request: [FLINK-2131]: Initialization schemes for k-mea...

2015-06-30 Thread sachingoel0101
Github user sachingoel0101 commented on the pull request:

https://github.com/apache/flink/pull/757#issuecomment-117049680
  
Further, the probability distribution doesn't need to be scaled down to 
between [0,1]. We just take care that of while building the cumulative 
distribution.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering

2015-06-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14607912#comment-14607912
 ] 

ASF GitHub Bot commented on FLINK-2131:
---

Github user sachingoel0101 commented on the pull request:

https://github.com/apache/flink/pull/757#issuecomment-117049680
  
Further, the probability distribution doesn't need to be scaled down to 
between [0,1]. We just take care that of while building the cumulative 
distribution.


 Add Initialization schemes for K-means clustering
 -

 Key: FLINK-2131
 URL: https://issues.apache.org/jira/browse/FLINK-2131
 Project: Flink
  Issue Type: Task
  Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel

 The Lloyd's [KMeans] algorithm takes initial centroids as its input. However, 
 in case the user doesn't provide the initial centers, they may ask for a 
 particular initialization scheme to be followed. The most commonly used are 
 these:
 1. Random initialization: Self-explanatory
 2. kmeans++ initialization: http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
 3. kmeans|| : http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf
 For very large data sets, or for large values of k, the kmeans|| method is 
 preferred as it provides the same approximation guarantees as kmeans++ and 
 requires lesser number of passes over the input data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2293) Division by Zero Exception

2015-06-30 Thread Andra Lungu (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14607965#comment-14607965
 ] 

Andra Lungu commented on FLINK-2293:


Also reproducible with 
https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/NodeSplittingGSAJaccard.java;
 The error is identical. 

The input DataSet is SNAP's orkut graph: 
http://snap.stanford.edu/data/com-Orkut.html

I am running on 30 wally nodes; well one is down so 29 (-p224). 
Then, the flink.conf file should be useful: 
https://gist.github.com/andralungu/3338bbd01ce61e0ce43d

Let me know if I can give help you with anything else. Even if it's a memory 
issue [as is often the case] ; the error is at least misleading... 


 Division by Zero Exception
 --

 Key: FLINK-2293
 URL: https://issues.apache.org/jira/browse/FLINK-2293
 Project: Flink
  Issue Type: Bug
  Components: Local Runtime
Affects Versions: 0.9, 0.10
Reporter: Andra Lungu
Priority: Critical
 Fix For: 0.9.1


 I am basically running an algorithm that simulates a Gather Sum Apply 
 Iteration that performs Traingle Count (Why simulate it? Because you just 
 need a superstep - useless overhead if you use the runGatherSumApply 
 function in Graph).
 What happens, at a high level:
 1). Select neighbors with ID greater than the one corresponding to the 
 current vertex;
 2). Propagate the received values to neighbors with higher ID;
 3). compute the number of triangles by checking whether
 trgVertex.getValue().get(srcVertex.getId());
 As you can see, I *do not* perform any division at all;
 code is here: 
 https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/GSATriangleCount.java
 Now for small graphs, 50MB max, the computation finishes nicely with the 
 correct result. For a 10GB graph, however, I got this:
 java.lang.ArithmeticException: / by zero
 at 
 org.apache.flink.runtime.operators.hash.MutableHashTable.insertIntoTable(MutableHashTable.java:836)
 at 
 org.apache.flink.runtime.operators.hash.MutableHashTable.buildTableFromSpilledPartition(MutableHashTable.java:819)
 at 
 org.apache.flink.runtime.operators.hash.MutableHashTable.prepareNextPartition(MutableHashTable.java:508)
 at 
 org.apache.flink.runtime.operators.hash.MutableHashTable.nextRecord(MutableHashTable.java:544)
 at 
 org.apache.flink.runtime.operators.hash.NonReusingBuildFirstHashMatchIterator.callWithNextKey(NonReusingBuildFirstHashMatchIterator.java:104)
 at 
 org.apache.flink.runtime.operators.MatchDriver.run(MatchDriver.java:173)
 at 
 org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:496)
 at 
 org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362)
 at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
 at java.lang.Thread.run(Thread.java:722)
 see the full log here: https://gist.github.com/andralungu/984774f6348269df7951



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2105) Implement Sort-Merge Outer Join algorithm

2015-06-30 Thread Fabian Hueske (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14607966#comment-14607966
 ] 

Fabian Hueske commented on FLINK-2105:
--

[~chiwanpark] is right. This issue is about the implementation of an 
{{OuterJoinMergeIterator}} with a very good test coverage. 
The integration with Drivers and Operators is part of FLINK-687.



 Implement Sort-Merge Outer Join algorithm
 -

 Key: FLINK-2105
 URL: https://issues.apache.org/jira/browse/FLINK-2105
 Project: Flink
  Issue Type: Sub-task
  Components: Local Runtime
Reporter: Fabian Hueske
Assignee: Ricky Pogalz
Priority: Minor
 Fix For: pre-apache


 Flink does not natively support outer joins at the moment. 
 This issue proposes to implement a sort-merge outer join algorithm that can 
 cover left, right, and full outer joins.
 The implementation can be based on the regular sort-merge join iterator 
 ({{ReusingMergeMatchIterator}} and {{NonReusingMergeMatchIterator}}, see also 
 {{MatchDriver}} class)
 The Reusing and NonReusing variants differ in whether object instances are 
 reused or new objects are created. I would start with the NonReusing variant 
 which is safer from a user's point of view and should also be easier to 
 implement.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (FLINK-1573) Add per-job metrics to flink.

2015-06-30 Thread Maximilian Michels (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-1573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maximilian Michels reassigned FLINK-1573:
-

Assignee: Maximilian Michels  (was: Robert Metzger)

 Add per-job metrics to flink.
 -

 Key: FLINK-1573
 URL: https://issues.apache.org/jira/browse/FLINK-1573
 Project: Flink
  Issue Type: Sub-task
  Components: JobManager, TaskManager
Affects Versions: 0.9
Reporter: Robert Metzger
Assignee: Maximilian Michels

 With FLINK-1501, we have JVM specific metrics (mainly monitoring the TMs).
 With this task, I would like to add metrics which are job-specific.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2293) Division by Zero Exception

2015-06-30 Thread Andra Lungu (JIRA)
Andra Lungu created FLINK-2293:
--

 Summary: Division by Zero Exception
 Key: FLINK-2293
 URL: https://issues.apache.org/jira/browse/FLINK-2293
 Project: Flink
  Issue Type: Bug
  Components: Distributed Runtime
Affects Versions: 0.10
Reporter: Andra Lungu


I am basically running an algorithm that simulates a Gather Sum Apply Iteration 
that performs Traingle Count (Why simulate it? Because you just need a 
superstep - useless overhead if you use the runGatherSumApply function in 
Graph).
What happens, at a high level:
1). Select neighbors with ID greater than the one corresponding to the current 
vertex;
2). Propagate the received values to neighbors with higher ID;
3). compute the number of triangles by checking whether

trgVertex.getValue().get(srcVertex.getId());

As you can see, I *do not* perform any division at all;
code is here: 
https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/GSATriangleCount.java

Now for small graphs, 50MB max, the computation finishes nicely with the 
correct result. For a 10GB graph, however, I got this:

java.lang.ArithmeticException: / by zero
at 
org.apache.flink.runtime.operators.hash.MutableHashTable.insertIntoTable(MutableHashTable.java:836)
at 
org.apache.flink.runtime.operators.hash.MutableHashTable.buildTableFromSpilledPartition(MutableHashTable.java:819)
at 
org.apache.flink.runtime.operators.hash.MutableHashTable.prepareNextPartition(MutableHashTable.java:508)
at 
org.apache.flink.runtime.operators.hash.MutableHashTable.nextRecord(MutableHashTable.java:544)
at 
org.apache.flink.runtime.operators.hash.NonReusingBuildFirstHashMatchIterator.callWithNextKey(NonReusingBuildFirstHashMatchIterator.java:104)
at org.apache.flink.runtime.operators.MatchDriver.run(MatchDriver.java:173)
at 
org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:496)
at 
org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
at java.lang.Thread.run(Thread.java:722)

see the full log here: https://gist.github.com/andralungu/984774f6348269df7951



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-2131]: Initialization schemes for k-mea...

2015-06-30 Thread thvasilo
Github user thvasilo commented on the pull request:

https://github.com/apache/flink/pull/757#issuecomment-117054526
  
OK, thanks for the explanation. I will look at this PR this week hopefully.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering

2015-06-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14607946#comment-14607946
 ] 

ASF GitHub Bot commented on FLINK-2131:
---

Github user thvasilo commented on the pull request:

https://github.com/apache/flink/pull/757#issuecomment-117054526
  
OK, thanks for the explanation. I will look at this PR this week hopefully.


 Add Initialization schemes for K-means clustering
 -

 Key: FLINK-2131
 URL: https://issues.apache.org/jira/browse/FLINK-2131
 Project: Flink
  Issue Type: Task
  Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel

 The Lloyd's [KMeans] algorithm takes initial centroids as its input. However, 
 in case the user doesn't provide the initial centers, they may ask for a 
 particular initialization scheme to be followed. The most commonly used are 
 these:
 1. Random initialization: Self-explanatory
 2. kmeans++ initialization: http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
 3. kmeans|| : http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf
 For very large data sets, or for large values of k, the kmeans|| method is 
 preferred as it provides the same approximation guarantees as kmeans++ and 
 requires lesser number of passes over the input data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-2293) Division by Zero Exception

2015-06-30 Thread Fabian Hueske (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabian Hueske updated FLINK-2293:
-
Priority: Critical  (was: Major)

 Division by Zero Exception
 --

 Key: FLINK-2293
 URL: https://issues.apache.org/jira/browse/FLINK-2293
 Project: Flink
  Issue Type: Bug
  Components: Local Runtime
Affects Versions: 0.9, 0.10
Reporter: Andra Lungu
Priority: Critical
 Fix For: 0.9.1


 I am basically running an algorithm that simulates a Gather Sum Apply 
 Iteration that performs Traingle Count (Why simulate it? Because you just 
 need a superstep - useless overhead if you use the runGatherSumApply 
 function in Graph).
 What happens, at a high level:
 1). Select neighbors with ID greater than the one corresponding to the 
 current vertex;
 2). Propagate the received values to neighbors with higher ID;
 3). compute the number of triangles by checking whether
 trgVertex.getValue().get(srcVertex.getId());
 As you can see, I *do not* perform any division at all;
 code is here: 
 https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/GSATriangleCount.java
 Now for small graphs, 50MB max, the computation finishes nicely with the 
 correct result. For a 10GB graph, however, I got this:
 java.lang.ArithmeticException: / by zero
 at 
 org.apache.flink.runtime.operators.hash.MutableHashTable.insertIntoTable(MutableHashTable.java:836)
 at 
 org.apache.flink.runtime.operators.hash.MutableHashTable.buildTableFromSpilledPartition(MutableHashTable.java:819)
 at 
 org.apache.flink.runtime.operators.hash.MutableHashTable.prepareNextPartition(MutableHashTable.java:508)
 at 
 org.apache.flink.runtime.operators.hash.MutableHashTable.nextRecord(MutableHashTable.java:544)
 at 
 org.apache.flink.runtime.operators.hash.NonReusingBuildFirstHashMatchIterator.callWithNextKey(NonReusingBuildFirstHashMatchIterator.java:104)
 at 
 org.apache.flink.runtime.operators.MatchDriver.run(MatchDriver.java:173)
 at 
 org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:496)
 at 
 org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362)
 at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
 at java.lang.Thread.run(Thread.java:722)
 see the full log here: https://gist.github.com/andralungu/984774f6348269df7951



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2292) Report accumulators periodically while job is running

2015-06-30 Thread Maximilian Michels (JIRA)
Maximilian Michels created FLINK-2292:
-

 Summary: Report accumulators periodically while job is running
 Key: FLINK-2292
 URL: https://issues.apache.org/jira/browse/FLINK-2292
 Project: Flink
  Issue Type: Sub-task
  Components: JobManager, TaskManager
Reporter: Maximilian Michels
Assignee: Maximilian Michels
 Fix For: 0.10


Accumulators should be sent periodically, as part of the heartbeat that sends 
metrics. This allows them to be updated in real time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-2293) Division by Zero Exception

2015-06-30 Thread Fabian Hueske (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabian Hueske updated FLINK-2293:
-
Component/s: (was: Distributed Runtime)
 Local Runtime

 Division by Zero Exception
 --

 Key: FLINK-2293
 URL: https://issues.apache.org/jira/browse/FLINK-2293
 Project: Flink
  Issue Type: Bug
  Components: Local Runtime
Affects Versions: 0.9, 0.10
Reporter: Andra Lungu
Priority: Critical
 Fix For: 0.9.1


 I am basically running an algorithm that simulates a Gather Sum Apply 
 Iteration that performs Traingle Count (Why simulate it? Because you just 
 need a superstep - useless overhead if you use the runGatherSumApply 
 function in Graph).
 What happens, at a high level:
 1). Select neighbors with ID greater than the one corresponding to the 
 current vertex;
 2). Propagate the received values to neighbors with higher ID;
 3). compute the number of triangles by checking whether
 trgVertex.getValue().get(srcVertex.getId());
 As you can see, I *do not* perform any division at all;
 code is here: 
 https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/GSATriangleCount.java
 Now for small graphs, 50MB max, the computation finishes nicely with the 
 correct result. For a 10GB graph, however, I got this:
 java.lang.ArithmeticException: / by zero
 at 
 org.apache.flink.runtime.operators.hash.MutableHashTable.insertIntoTable(MutableHashTable.java:836)
 at 
 org.apache.flink.runtime.operators.hash.MutableHashTable.buildTableFromSpilledPartition(MutableHashTable.java:819)
 at 
 org.apache.flink.runtime.operators.hash.MutableHashTable.prepareNextPartition(MutableHashTable.java:508)
 at 
 org.apache.flink.runtime.operators.hash.MutableHashTable.nextRecord(MutableHashTable.java:544)
 at 
 org.apache.flink.runtime.operators.hash.NonReusingBuildFirstHashMatchIterator.callWithNextKey(NonReusingBuildFirstHashMatchIterator.java:104)
 at 
 org.apache.flink.runtime.operators.MatchDriver.run(MatchDriver.java:173)
 at 
 org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:496)
 at 
 org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362)
 at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
 at java.lang.Thread.run(Thread.java:722)
 see the full log here: https://gist.github.com/andralungu/984774f6348269df7951



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-2293) Division by Zero Exception

2015-06-30 Thread Fabian Hueske (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabian Hueske updated FLINK-2293:
-
Affects Version/s: 0.9

 Division by Zero Exception
 --

 Key: FLINK-2293
 URL: https://issues.apache.org/jira/browse/FLINK-2293
 Project: Flink
  Issue Type: Bug
  Components: Local Runtime
Affects Versions: 0.9, 0.10
Reporter: Andra Lungu
 Fix For: 0.9.1


 I am basically running an algorithm that simulates a Gather Sum Apply 
 Iteration that performs Traingle Count (Why simulate it? Because you just 
 need a superstep - useless overhead if you use the runGatherSumApply 
 function in Graph).
 What happens, at a high level:
 1). Select neighbors with ID greater than the one corresponding to the 
 current vertex;
 2). Propagate the received values to neighbors with higher ID;
 3). compute the number of triangles by checking whether
 trgVertex.getValue().get(srcVertex.getId());
 As you can see, I *do not* perform any division at all;
 code is here: 
 https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/GSATriangleCount.java
 Now for small graphs, 50MB max, the computation finishes nicely with the 
 correct result. For a 10GB graph, however, I got this:
 java.lang.ArithmeticException: / by zero
 at 
 org.apache.flink.runtime.operators.hash.MutableHashTable.insertIntoTable(MutableHashTable.java:836)
 at 
 org.apache.flink.runtime.operators.hash.MutableHashTable.buildTableFromSpilledPartition(MutableHashTable.java:819)
 at 
 org.apache.flink.runtime.operators.hash.MutableHashTable.prepareNextPartition(MutableHashTable.java:508)
 at 
 org.apache.flink.runtime.operators.hash.MutableHashTable.nextRecord(MutableHashTable.java:544)
 at 
 org.apache.flink.runtime.operators.hash.NonReusingBuildFirstHashMatchIterator.callWithNextKey(NonReusingBuildFirstHashMatchIterator.java:104)
 at 
 org.apache.flink.runtime.operators.MatchDriver.run(MatchDriver.java:173)
 at 
 org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:496)
 at 
 org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362)
 at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
 at java.lang.Thread.run(Thread.java:722)
 see the full log here: https://gist.github.com/andralungu/984774f6348269df7951



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-2293) Division by Zero Exception

2015-06-30 Thread Fabian Hueske (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabian Hueske updated FLINK-2293:
-
Fix Version/s: 0.9.1

 Division by Zero Exception
 --

 Key: FLINK-2293
 URL: https://issues.apache.org/jira/browse/FLINK-2293
 Project: Flink
  Issue Type: Bug
  Components: Distributed Runtime
Affects Versions: 0.10
Reporter: Andra Lungu
 Fix For: 0.9.1


 I am basically running an algorithm that simulates a Gather Sum Apply 
 Iteration that performs Traingle Count (Why simulate it? Because you just 
 need a superstep - useless overhead if you use the runGatherSumApply 
 function in Graph).
 What happens, at a high level:
 1). Select neighbors with ID greater than the one corresponding to the 
 current vertex;
 2). Propagate the received values to neighbors with higher ID;
 3). compute the number of triangles by checking whether
 trgVertex.getValue().get(srcVertex.getId());
 As you can see, I *do not* perform any division at all;
 code is here: 
 https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/GSATriangleCount.java
 Now for small graphs, 50MB max, the computation finishes nicely with the 
 correct result. For a 10GB graph, however, I got this:
 java.lang.ArithmeticException: / by zero
 at 
 org.apache.flink.runtime.operators.hash.MutableHashTable.insertIntoTable(MutableHashTable.java:836)
 at 
 org.apache.flink.runtime.operators.hash.MutableHashTable.buildTableFromSpilledPartition(MutableHashTable.java:819)
 at 
 org.apache.flink.runtime.operators.hash.MutableHashTable.prepareNextPartition(MutableHashTable.java:508)
 at 
 org.apache.flink.runtime.operators.hash.MutableHashTable.nextRecord(MutableHashTable.java:544)
 at 
 org.apache.flink.runtime.operators.hash.NonReusingBuildFirstHashMatchIterator.callWithNextKey(NonReusingBuildFirstHashMatchIterator.java:104)
 at 
 org.apache.flink.runtime.operators.MatchDriver.run(MatchDriver.java:173)
 at 
 org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:496)
 at 
 org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362)
 at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
 at java.lang.Thread.run(Thread.java:722)
 see the full log here: https://gist.github.com/andralungu/984774f6348269df7951



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [Flink-1999] basic TfidfTransformer

2015-06-30 Thread rbraeunlich
Github user rbraeunlich commented on a diff in the pull request:

https://github.com/apache/flink/pull/730#discussion_r33551797
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/feature/TfIdfTransformer.scala
 ---
@@ -0,0 +1,128 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * License); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.feature
+
+import org.apache.flink.api.scala.{DataSet, _}
+import org.apache.flink.ml.common.{Parameter, ParameterMap, Transformer}
+import org.apache.flink.ml.math.SparseVector
+
+import scala.collection.mutable.LinkedHashSet
+import scala.math.log;
+
+/**
+ * This transformer calcuates the term-frequency times inverse document 
frequency for the give DataSet of documents.
+ * The DataSet will be treated as the corpus of documents. The single 
words will be filtered against the regex:
+ * code
+ * (?u)\b\w\w+\b
+ * /code
+ * p
+ * The TF is the frequence of a word inside one document
+ * p
+ * The IDF for a word is calculated: log(total number of documents / 
documents that contain the word) + 1
+ * p
+ * This transformer returns a SparseVector where the index is the hash of 
the word and the value the tf-idf.
+ * @author Ronny Bräunlich
+ * @author Vassil Dimov
+ * @author Filip Perisic
+ */
+class TfIdfTransformer extends Transformer[(Int, Seq[String]), (Int, 
SparseVector)] {
+
+  override def transform(input: DataSet[(Int /* docId */ , Seq[String] 
/*The document */ )], transformParameters: ParameterMap): DataSet[(Int, 
SparseVector)] = {
--- End diff --

We had some discussion about the type in 
[FLINK-1999](https://issues.apache.org/jira/browse/FLINK-1999) and since there 
were no objections we sticked to Seq[String]. With Seq[String] the caller has 
to decide how to split his text.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2131]: Initialization schemes for k-mea...

2015-06-30 Thread thvasilo
Github user thvasilo commented on the pull request:

https://github.com/apache/flink/pull/757#issuecomment-117042891
  
Hello Sachin, could you explain what the discrete sampler does?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering

2015-06-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14607916#comment-14607916
 ] 

ASF GitHub Bot commented on FLINK-2131:
---

Github user sachingoel0101 commented on the pull request:

https://github.com/apache/flink/pull/757#issuecomment-117051023
  
Sorry about the formatting though. I'll fix it. I haven't worked on this in 
a while. 
I'll incorporate your suggestions from the previous PR.


 Add Initialization schemes for K-means clustering
 -

 Key: FLINK-2131
 URL: https://issues.apache.org/jira/browse/FLINK-2131
 Project: Flink
  Issue Type: Task
  Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel

 The Lloyd's [KMeans] algorithm takes initial centroids as its input. However, 
 in case the user doesn't provide the initial centers, they may ask for a 
 particular initialization scheme to be followed. The most commonly used are 
 these:
 1. Random initialization: Self-explanatory
 2. kmeans++ initialization: http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
 3. kmeans|| : http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf
 For very large data sets, or for large values of k, the kmeans|| method is 
 preferred as it provides the same approximation guarantees as kmeans++ and 
 requires lesser number of passes over the input data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-1731] [ml] Implementation of Feature K-...

2015-06-30 Thread thvasilo
Github user thvasilo commented on the pull request:

https://github.com/apache/flink/pull/700#issuecomment-117053448
  
Hello @peedeeX21 . The API does not deal with distributed models at the 
moment. In the K-means case having the model distributed is overkill, as it is 
highly unlikely that you will have 1000 centroids, making the model tiny, and 
distributing it actually creates unnecessary overhead.

We can keep the current implementation, but in the future we should really 
test against a non distributed model, which can be broadcast in a 
DataSet[Seq[LabeledVector]] and compare performance.

Also, could you add an evaluate operation (EvaluateDataSetOperation) for 
Kmeans (and corresponding test)? It would be parametrized as 
EvaluateDataSetOperation[Kmeans, Vector, Double]


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-06-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14607941#comment-14607941
 ] 

ASF GitHub Bot commented on FLINK-1731:
---

Github user thvasilo commented on the pull request:

https://github.com/apache/flink/pull/700#issuecomment-117053448
  
Hello @peedeeX21 . The API does not deal with distributed models at the 
moment. In the K-means case having the model distributed is overkill, as it is 
highly unlikely that you will have 1000 centroids, making the model tiny, and 
distributing it actually creates unnecessary overhead.

We can keep the current implementation, but in the future we should really 
test against a non distributed model, which can be broadcast in a 
DataSet[Seq[LabeledVector]] and compare performance.

Also, could you add an evaluate operation (EvaluateDataSetOperation) for 
Kmeans (and corresponding test)? It would be parametrized as 
EvaluateDataSetOperation[Kmeans, Vector, Double]


 Add kMeans clustering algorithm to machine learning library
 ---

 Key: FLINK-1731
 URL: https://issues.apache.org/jira/browse/FLINK-1731
 Project: Flink
  Issue Type: New Feature
  Components: Machine Learning Library
Reporter: Till Rohrmann
Assignee: Peter Schrott
  Labels: ML

 The Flink repository already contains a kMeans implementation but it is not 
 yet ported to the machine learning library. I assume that only the used data 
 types have to be adapted and then it can be more or less directly moved to 
 flink-ml.
 The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
 implementation because the improve the initial seeding phase to achieve near 
 optimal clustering. It might be worthwhile to implement kMeans||.
 Resources:
 [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
 [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2293) Division by Zero Exception

2015-06-30 Thread Fabian Hueske (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14607950#comment-14607950
 ] 

Fabian Hueske commented on FLINK-2293:
--

Is it possible to share the 10GB input data (download link)? 
Can you also share a few details about your execution setup (local machine, 
cluster, #machines, #slots, amount of memory, ...) that can help to reproduce 
the problem? Thanks!

 Division by Zero Exception
 --

 Key: FLINK-2293
 URL: https://issues.apache.org/jira/browse/FLINK-2293
 Project: Flink
  Issue Type: Bug
  Components: Local Runtime
Affects Versions: 0.9, 0.10
Reporter: Andra Lungu
Priority: Critical
 Fix For: 0.9.1


 I am basically running an algorithm that simulates a Gather Sum Apply 
 Iteration that performs Traingle Count (Why simulate it? Because you just 
 need a superstep - useless overhead if you use the runGatherSumApply 
 function in Graph).
 What happens, at a high level:
 1). Select neighbors with ID greater than the one corresponding to the 
 current vertex;
 2). Propagate the received values to neighbors with higher ID;
 3). compute the number of triangles by checking whether
 trgVertex.getValue().get(srcVertex.getId());
 As you can see, I *do not* perform any division at all;
 code is here: 
 https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/GSATriangleCount.java
 Now for small graphs, 50MB max, the computation finishes nicely with the 
 correct result. For a 10GB graph, however, I got this:
 java.lang.ArithmeticException: / by zero
 at 
 org.apache.flink.runtime.operators.hash.MutableHashTable.insertIntoTable(MutableHashTable.java:836)
 at 
 org.apache.flink.runtime.operators.hash.MutableHashTable.buildTableFromSpilledPartition(MutableHashTable.java:819)
 at 
 org.apache.flink.runtime.operators.hash.MutableHashTable.prepareNextPartition(MutableHashTable.java:508)
 at 
 org.apache.flink.runtime.operators.hash.MutableHashTable.nextRecord(MutableHashTable.java:544)
 at 
 org.apache.flink.runtime.operators.hash.NonReusingBuildFirstHashMatchIterator.callWithNextKey(NonReusingBuildFirstHashMatchIterator.java:104)
 at 
 org.apache.flink.runtime.operators.MatchDriver.run(MatchDriver.java:173)
 at 
 org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:496)
 at 
 org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362)
 at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
 at java.lang.Thread.run(Thread.java:722)
 see the full log here: https://gist.github.com/andralungu/984774f6348269df7951



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2015-06-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14607961#comment-14607961
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user chiwanpark commented on the pull request:

https://github.com/apache/flink/pull/696#issuecomment-117059705
  
Hi, I updated this PR. I reimplemented kNN Join with `zipWithIndex` and 
fitted to changed pipeline architecture. 


 Add exact k-nearest-neighbours algorithm to machine learning library
 

 Key: FLINK-1745
 URL: https://issues.apache.org/jira/browse/FLINK-1745
 Project: Flink
  Issue Type: New Feature
  Components: Machine Learning Library
Reporter: Till Rohrmann
Assignee: Till Rohrmann
  Labels: ML, Starter

 Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
 it is still used as a mean to classify data and to do regression. This issue 
 focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
 proposed in [2].
 Could be a starter task.
 Resources:
 [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
 [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering

2015-06-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14607895#comment-14607895
 ] 

ASF GitHub Bot commented on FLINK-2131:
---

Github user thvasilo commented on the pull request:

https://github.com/apache/flink/pull/757#issuecomment-117042891
  
Hello Sachin, could you explain what the discrete sampler does?


 Add Initialization schemes for K-means clustering
 -

 Key: FLINK-2131
 URL: https://issues.apache.org/jira/browse/FLINK-2131
 Project: Flink
  Issue Type: Task
  Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel

 The Lloyd's [KMeans] algorithm takes initial centroids as its input. However, 
 in case the user doesn't provide the initial centers, they may ask for a 
 particular initialization scheme to be followed. The most commonly used are 
 these:
 1. Random initialization: Self-explanatory
 2. kmeans++ initialization: http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
 3. kmeans|| : http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf
 For very large data sets, or for large values of k, the kmeans|| method is 
 preferred as it provides the same approximation guarantees as kmeans++ and 
 requires lesser number of passes over the input data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering

2015-06-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14607907#comment-14607907
 ] 

ASF GitHub Bot commented on FLINK-2131:
---

Github user sachingoel0101 commented on the pull request:

https://github.com/apache/flink/pull/757#issuecomment-117047575
  
Hi @thvasilo, thanks for taking the time to go through it. 
Consider for example a probability distribution P(X_0) = 0.2, P(X_1) = 0.3, 
P(X_2) = 0.5
To sample an element out of X_0, X_1 and X_2, we can generate a random 
number but we need to map intervals of real numbers to the values X_0, X_1 and 
X_2. This is what the discreteSampler does.
It forms a cumulative distribution as [0.2, 0.5, 1.0] and then, if the 
generated random no is in [0, 0.2), we pick X_0, and so on.


 Add Initialization schemes for K-means clustering
 -

 Key: FLINK-2131
 URL: https://issues.apache.org/jira/browse/FLINK-2131
 Project: Flink
  Issue Type: Task
  Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel

 The Lloyd's [KMeans] algorithm takes initial centroids as its input. However, 
 in case the user doesn't provide the initial centers, they may ask for a 
 particular initialization scheme to be followed. The most commonly used are 
 these:
 1. Random initialization: Self-explanatory
 2. kmeans++ initialization: http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
 3. kmeans|| : http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf
 For very large data sets, or for large values of k, the kmeans|| method is 
 preferred as it provides the same approximation guarantees as kmeans++ and 
 requires lesser number of passes over the input data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-2131]: Initialization schemes for k-mea...

2015-06-30 Thread sachingoel0101
Github user sachingoel0101 commented on the pull request:

https://github.com/apache/flink/pull/757#issuecomment-117047575
  
Hi @thvasilo, thanks for taking the time to go through it. 
Consider for example a probability distribution P(X_0) = 0.2, P(X_1) = 0.3, 
P(X_2) = 0.5
To sample an element out of X_0, X_1 and X_2, we can generate a random 
number but we need to map intervals of real numbers to the values X_0, X_1 and 
X_2. This is what the discreteSampler does.
It forms a cumulative distribution as [0.2, 0.5, 1.0] and then, if the 
generated random no is in [0, 0.2), we pick X_0, and so on.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2131]: Initialization schemes for k-mea...

2015-06-30 Thread sachingoel0101
Github user sachingoel0101 commented on the pull request:

https://github.com/apache/flink/pull/757#issuecomment-117051023
  
Sorry about the formatting though. I'll fix it. I haven't worked on this in 
a while. 
I'll incorporate your suggestions from the previous PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Updated] (FLINK-766) Web interface progress monitoring for DataSources and DataSinks with splits

2015-06-30 Thread Maximilian Michels (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maximilian Michels updated FLINK-766:
-
Issue Type: Sub-task  (was: Improvement)
Parent: FLINK-456

 Web interface progress monitoring for DataSources and DataSinks with splits
 ---

 Key: FLINK-766
 URL: https://issues.apache.org/jira/browse/FLINK-766
 Project: Flink
  Issue Type: Sub-task
Reporter: GitHub Import
Priority: Minor
  Labels: github-import
 Fix For: pre-apache


 The progress monitoring for DataSources and DataSinks can be improved by 
 including the number of processed vs total splits into the progress.
  Imported from GitHub 
 Url: https://github.com/stratosphere/stratosphere/issues/766
 Created by: [rmetzger|https://github.com/rmetzger]
 Labels: enhancement, gui, simple-issue, 
 Milestone: Release 0.6 (unplanned)
 Created at: Wed May 07 12:05:54 CEST 2014
 State: open



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering

2015-06-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14607954#comment-14607954
 ] 

ASF GitHub Bot commented on FLINK-2131:
---

Github user sachingoel0101 commented on the pull request:

https://github.com/apache/flink/pull/757#issuecomment-117055846
  
Okay. I'll update it today itself with a few trivial fixes.


 Add Initialization schemes for K-means clustering
 -

 Key: FLINK-2131
 URL: https://issues.apache.org/jira/browse/FLINK-2131
 Project: Flink
  Issue Type: Task
  Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel

 The Lloyd's [KMeans] algorithm takes initial centroids as its input. However, 
 in case the user doesn't provide the initial centers, they may ask for a 
 particular initialization scheme to be followed. The most commonly used are 
 these:
 1. Random initialization: Self-explanatory
 2. kmeans++ initialization: http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
 3. kmeans|| : http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf
 For very large data sets, or for large values of k, the kmeans|| method is 
 preferred as it provides the same approximation guarantees as kmeans++ and 
 requires lesser number of passes over the input data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-2286) Window ParallelMerge sometimes swallows elements of the last window

2015-06-30 Thread Maximilian Michels (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maximilian Michels updated FLINK-2286:
--
Fix Version/s: 0.9.1

 Window ParallelMerge sometimes swallows elements of the last window
 ---

 Key: FLINK-2286
 URL: https://issues.apache.org/jira/browse/FLINK-2286
 Project: Flink
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.9, 0.10
Reporter: Márton Balassi
Assignee: Márton Balassi
 Fix For: 0.9.1


 Last windows in the stream that do not have parts at all the parallel 
 operator instances get swallowed by the ParallelMerge.
 To resolve this ParallelMerge should be an operator instead of a function, so 
 the close method can access the collector and emit these.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-2131]: Initialization schemes for k-mea...

2015-06-30 Thread sachingoel0101
Github user sachingoel0101 commented on the pull request:

https://github.com/apache/flink/pull/757#issuecomment-117055846
  
Okay. I'll update it today itself with a few trivial fixes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Updated] (FLINK-2285) Active policy emits elements of the last window twice

2015-06-30 Thread Maximilian Michels (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maximilian Michels updated FLINK-2285:
--
Fix Version/s: 0.9.1

 Active policy emits elements of the last window twice
 -

 Key: FLINK-2285
 URL: https://issues.apache.org/jira/browse/FLINK-2285
 Project: Flink
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.9, 0.10
Reporter: Márton Balassi
Assignee: Márton Balassi
 Fix For: 0.9.1


 The root cause is some duplicate code between the close methods of the 
 GroupedActiveDiscretizer and the GroupedStreamDiscretizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-1745] [ml] [WIP] Add exact k-nearest-ne...

2015-06-30 Thread chiwanpark
Github user chiwanpark commented on the pull request:

https://github.com/apache/flink/pull/696#issuecomment-117059705
  
Hi, I updated this PR. I reimplemented kNN Join with `zipWithIndex` and 
fitted to changed pipeline architecture. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-1731] [ml] Implementation of Feature K-...

2015-06-30 Thread sachingoel0101
Github user sachingoel0101 commented on the pull request:

https://github.com/apache/flink/pull/700#issuecomment-117060737
  
Hi. IMO, the purpose of learning is to develop a model which compactly 
represents the data somehow. Thus, having a distributed model doesn't make 
sense. Besides, the user might just want to take the model and use it somewhere 
else in which case it makes sense to have it available not-as-distributed, but 
just as a java slash scala object which user can easily operate on.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-06-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14607968#comment-14607968
 ] 

ASF GitHub Bot commented on FLINK-1731:
---

Github user sachingoel0101 commented on the pull request:

https://github.com/apache/flink/pull/700#issuecomment-117060737
  
Hi. IMO, the purpose of learning is to develop a model which compactly 
represents the data somehow. Thus, having a distributed model doesn't make 
sense. Besides, the user might just want to take the model and use it somewhere 
else in which case it makes sense to have it available not-as-distributed, but 
just as a java slash scala object which user can easily operate on.


 Add kMeans clustering algorithm to machine learning library
 ---

 Key: FLINK-1731
 URL: https://issues.apache.org/jira/browse/FLINK-1731
 Project: Flink
  Issue Type: New Feature
  Components: Machine Learning Library
Reporter: Till Rohrmann
Assignee: Peter Schrott
  Labels: ML

 The Flink repository already contains a kMeans implementation but it is not 
 yet ported to the machine learning library. I assume that only the used data 
 types have to be adapted and then it can be more or less directly moved to 
 flink-ml.
 The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
 implementation because the improve the initial seeding phase to achieve near 
 optimal clustering. It might be worthwhile to implement kMeans||.
 Resources:
 [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
 [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2141) Allow GSA's Gather to perform this operation in more than one direction

2015-06-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609398#comment-14609398
 ] 

ASF GitHub Bot commented on FLINK-2141:
---

GitHub user shghatge opened a pull request:

https://github.com/apache/flink/pull/877

[FLINK-2141] Allow GSA's Gather to perform this operation in more than one 
direction

Added the setDirection() and getDirection() methods to GSAConfiguration.java
Added functionality to gather values from chosen neighbors instead of only 
OUT neighbors
Added a simple example which generates the list of vertices to whom their 
exists a path from the vertex as value of the vertex
Added util classes for the example and also added a test.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/shghatge/flink vertex

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/877.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #877


commit d88329024394cef17a3454d20e6905f7d7363004
Author: Shivani shgha...@gmail.com
Date:   2015-07-01T01:03:58Z

[FLINK-2141][gelly] Allow GSA's Gather to perform this operation in more 
than one direction

commit b2a29d6763820a2be6b1a71f0224fda83861c2c4
Author: Shivani shgha...@gmail.com
Date:   2015-07-01T01:10:41Z

[FLINK-2141][gelly] Allow GSA's Gather to perform this operation in more 
than one direction




 Allow GSA's Gather to perform this operation in more than one direction
 ---

 Key: FLINK-2141
 URL: https://issues.apache.org/jira/browse/FLINK-2141
 Project: Flink
  Issue Type: New Feature
  Components: Gelly
Affects Versions: 0.9
Reporter: Andra Lungu
Assignee: Shivani Ghatge

 For the time being, a vertex only gathers information from its in-edges.
 Similarly to the vertex-centric approach, we would like to allow users to 
 gather data from out and all edges as well. 
 This property should be set using a setDirection() method.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-2141] Allow GSA's Gather to perform thi...

2015-06-30 Thread shghatge
GitHub user shghatge opened a pull request:

https://github.com/apache/flink/pull/877

[FLINK-2141] Allow GSA's Gather to perform this operation in more than one 
direction

Added the setDirection() and getDirection() methods to GSAConfiguration.java
Added functionality to gather values from chosen neighbors instead of only 
OUT neighbors
Added a simple example which generates the list of vertices to whom their 
exists a path from the vertex as value of the vertex
Added util classes for the example and also added a test.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/shghatge/flink vertex

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/877.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #877


commit d88329024394cef17a3454d20e6905f7d7363004
Author: Shivani shgha...@gmail.com
Date:   2015-07-01T01:03:58Z

[FLINK-2141][gelly] Allow GSA's Gather to perform this operation in more 
than one direction

commit b2a29d6763820a2be6b1a71f0224fda83861c2c4
Author: Shivani shgha...@gmail.com
Date:   2015-07-01T01:10:41Z

[FLINK-2141][gelly] Allow GSA's Gather to perform this operation in more 
than one direction




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Updated] (FLINK-2302) Allow multiple instances to run on single host

2015-06-30 Thread Ufuk Celebi (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ufuk Celebi updated FLINK-2302:
---
Description: 
The scripts currently can only run a single component like TaskManager per host.

The main reason is that there is a single PID file per running process (e.g. 
TaskManager). If this file exists, no other process of the respective type can 
be started. Instead of having a single file per process with a single PID, we 
can append multiple PIDs to one file, e.g. a {{tm.pid}} can look like:

{code}
pid tm1
pid tm2
{code}

Stopping a TM then removes a single PID in -FIFO fashion (add pids at tail, 
remove pids from head)- in LIFO fashion (add/remove pids at/from tail). LIFO is 
important to ensure that the log files can be automatically  enumerated by 
number of pids per host w/o overwriting existing log files.


  was:
The scripts currently can only run a single component like TaskManager per host.

The main reason is that there is a single PID file per running process (e.g. 
TaskManager). If this file exists, no other process of the respective type can 
be started. Instead of having a single file per process with a single PID, we 
can append multiple PIDs to one file, e.g. a {{tm.pid}} can look like:

{code}
pid tm1
pid tm2
{code}

Stopping a TM then removes a single PID in FIFO fashion (add pids at tail, 
remove pids from head).



 Allow multiple instances to run on single host
 --

 Key: FLINK-2302
 URL: https://issues.apache.org/jira/browse/FLINK-2302
 Project: Flink
  Issue Type: Improvement
  Components: Start-Stop Scripts
Affects Versions: 0.10
Reporter: Ufuk Celebi
 Fix For: 0.10


 The scripts currently can only run a single component like TaskManager per 
 host.
 The main reason is that there is a single PID file per running process (e.g. 
 TaskManager). If this file exists, no other process of the respective type 
 can be started. Instead of having a single file per process with a single 
 PID, we can append multiple PIDs to one file, e.g. a {{tm.pid}} can look like:
 {code}
 pid tm1
 pid tm2
 {code}
 Stopping a TM then removes a single PID in -FIFO fashion (add pids at tail, 
 remove pids from head)- in LIFO fashion (add/remove pids at/from tail). LIFO 
 is important to ensure that the log files can be automatically  enumerated by 
 number of pids per host w/o overwriting existing log files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-2138] Added custom partitioning to Data...

2015-06-30 Thread gaborhermann
Github user gaborhermann commented on the pull request:

https://github.com/apache/flink/pull/872#issuecomment-117141916
  
Okay then. These are the effects of changing I did not know about. Let's 
stick to (2) and later on, we might reconsider this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Comment Edited] (FLINK-2295) TwoInput Task do not react to/forward checkpoint barriers

2015-06-30 Thread Gyula Fora (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608171#comment-14608171
 ] 

Gyula Fora edited comment on FLINK-2295 at 6/30/15 11:55 AM:
-

I see that the CoReader set in the TwoInput task does not listen to the 
checkpoint events. This should be a trivial fix.


was (Author: gyfora):
I see that the CoReader set in the TwoInput tas does not listen to the 
checkpoint events. This should be a trivial fix.

 TwoInput Task do not react to/forward checkpoint barriers
 -

 Key: FLINK-2295
 URL: https://issues.apache.org/jira/browse/FLINK-2295
 Project: Flink
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.9, 0.10, 0.9.1
Reporter: Aljoscha Krettek
Assignee: Aljoscha Krettek
Priority: Blocker

 The event listener for the checkpoint barriers was never enabled for TwoInput 
 tasks. I have a fix for it and also tests that verify that it actually works.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: New operator state interfaces

2015-06-30 Thread rmetzger
Github user rmetzger commented on the pull request:

https://github.com/apache/flink/pull/747#issuecomment-117161545
  
Please create JIRAs for changes in the future.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-2294) Keyed State does not work with DOP=1

2015-06-30 Thread Robert Metzger (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608236#comment-14608236
 ] 

Robert Metzger commented on FLINK-2294:
---

It would have been nice if you waited until you reached agreement with Aljoscha 
on the implementation, instead of just committing it to master, without a pull 
request.

From my understanding, I agree with Guyla, that it is easier to set the 
nextInput outside the operator, instead of inside all operators.

 Keyed State does not work with DOP=1
 

 Key: FLINK-2294
 URL: https://issues.apache.org/jira/browse/FLINK-2294
 Project: Flink
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.10
Reporter: Aljoscha Krettek
Assignee: Gyula Fora
Priority: Blocker

 When changing the DOP from 3 to 1 in StatefulOperatorTest.apiTest() the test 
 fails. The reason seems to be that the element is not properly set when 
 chaining is happening.
 Also, requiring this:
 {code}
 headContext.setNextInput(nextRecord);
 streamOperator.processElement(nextRecord);
 {code}
 to be called seems rather fragile. Why not set the element in 
 {{processElement()}}. This would also make for cleaner encapsulation, since 
 now all outside code must assume that operators have a 
 {{StreamingRuntimeContext}} on which they set the next element.
 The state/keyed state machinery seems dangerously undertested.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2294) Keyed State does not work with DOP=1

2015-06-30 Thread Gyula Fora (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608160#comment-14608160
 ] 

Gyula Fora commented on FLINK-2294:
---

The issue was caused by having two output collectors in the output handles for 
copy/non-copy. And I forgot to set it in the second one. Code duplication = evil

Im pushing a fix soon.

 Keyed State does not work with DOP=1
 

 Key: FLINK-2294
 URL: https://issues.apache.org/jira/browse/FLINK-2294
 Project: Flink
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.10
Reporter: Aljoscha Krettek
Assignee: Gyula Fora
Priority: Blocker

 When changing the DOP from 3 to 1 in StatefulOperatorTest.apiTest() the test 
 fails. The reason seems to be that the element is not properly set when 
 chaining is happening.
 Also, requiring this:
 {code}
 headContext.setNextInput(nextRecord);
 streamOperator.processElement(nextRecord);
 {code}
 to be called seems rather fragile. Why not set the element in 
 {{processElement()}}. This would also make for cleaner encapsulation, since 
 now all outside code must assume that operators have a 
 {{StreamingRuntimeContext}} on which they set the next element.
 The state/keyed state machinery seems dangerously undertested.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2138) PartitionCustom for streaming

2015-06-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608159#comment-14608159
 ] 

ASF GitHub Bot commented on FLINK-2138:
---

Github user gaborhermann commented on the pull request:

https://github.com/apache/flink/pull/872#issuecomment-117141916
  
Okay then. These are the effects of changing I did not know about. Let's 
stick to (2) and later on, we might reconsider this.


 PartitionCustom for streaming
 -

 Key: FLINK-2138
 URL: https://issues.apache.org/jira/browse/FLINK-2138
 Project: Flink
  Issue Type: New Feature
  Components: Streaming
Affects Versions: 0.9
Reporter: Márton Balassi
Assignee: Gábor Hermann
Priority: Minor

 The batch API has support for custom partitioning, this should be added for 
 streaming with a similar signature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2294) Keyed State does not work with DOP=1

2015-06-30 Thread Aljoscha Krettek (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608178#comment-14608178
 ] 

Aljoscha Krettek commented on FLINK-2294:
-

{{processElement}} gets the element. That is the entry point into the operator. 
Why not set it inside the operator where the element is received?

 Keyed State does not work with DOP=1
 

 Key: FLINK-2294
 URL: https://issues.apache.org/jira/browse/FLINK-2294
 Project: Flink
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.10
Reporter: Aljoscha Krettek
Assignee: Gyula Fora
Priority: Blocker

 When changing the DOP from 3 to 1 in StatefulOperatorTest.apiTest() the test 
 fails. The reason seems to be that the element is not properly set when 
 chaining is happening.
 Also, requiring this:
 {code}
 headContext.setNextInput(nextRecord);
 streamOperator.processElement(nextRecord);
 {code}
 to be called seems rather fragile. Why not set the element in 
 {{processElement()}}. This would also make for cleaner encapsulation, since 
 now all outside code must assume that operators have a 
 {{StreamingRuntimeContext}} on which they set the next element.
 The state/keyed state machinery seems dangerously undertested.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2294) Keyed State does not work with DOP=1

2015-06-30 Thread Gyula Fora (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608238#comment-14608238
 ] 

Gyula Fora commented on FLINK-2294:
---

This was a trivial fix for the current implementation thats why I pushed it. 
Changing the implementation would mean that we need to add additional tests for 
the operators, which we can do afterwards if we decide that way.

 Keyed State does not work with DOP=1
 

 Key: FLINK-2294
 URL: https://issues.apache.org/jira/browse/FLINK-2294
 Project: Flink
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.10
Reporter: Aljoscha Krettek
Assignee: Gyula Fora
Priority: Blocker

 When changing the DOP from 3 to 1 in StatefulOperatorTest.apiTest() the test 
 fails. The reason seems to be that the element is not properly set when 
 chaining is happening.
 Also, requiring this:
 {code}
 headContext.setNextInput(nextRecord);
 streamOperator.processElement(nextRecord);
 {code}
 to be called seems rather fragile. Why not set the element in 
 {{processElement()}}. This would also make for cleaner encapsulation, since 
 now all outside code must assume that operators have a 
 {{StreamingRuntimeContext}} on which they set the next element.
 The state/keyed state machinery seems dangerously undertested.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2295) TwoInput Task do not react to/forward checkpoint barriers

2015-06-30 Thread Gyula Fora (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608171#comment-14608171
 ] 

Gyula Fora commented on FLINK-2295:
---

I see that the CoReader set in the TwoInput tas does not listen to the 
checkpoint events. This should be a trivial fix.

 TwoInput Task do not react to/forward checkpoint barriers
 -

 Key: FLINK-2295
 URL: https://issues.apache.org/jira/browse/FLINK-2295
 Project: Flink
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.9, 0.10, 0.9.1
Reporter: Aljoscha Krettek
Assignee: Aljoscha Krettek
Priority: Blocker

 The event listener for the checkpoint barriers was never enabled for TwoInput 
 tasks. I have a fix for it and also tests that verify that it actually works.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2295) TwoInput Task do not react to/forward checkpoint barriers

2015-06-30 Thread Aljoscha Krettek (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608183#comment-14608183
 ] 

Aljoscha Krettek commented on FLINK-2295:
-

That's why I said I have the fix, yes. :D The difficult bit are the tests that 
verify that this actually works and is not broken by some future changes.

 TwoInput Task do not react to/forward checkpoint barriers
 -

 Key: FLINK-2295
 URL: https://issues.apache.org/jira/browse/FLINK-2295
 Project: Flink
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.9, 0.10, 0.9.1
Reporter: Aljoscha Krettek
Assignee: Aljoscha Krettek
Priority: Blocker

 The event listener for the checkpoint barriers was never enabled for TwoInput 
 tasks. I have a fix for it and also tests that verify that it actually works.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2294) Keyed State does not work with DOP=1

2015-06-30 Thread Gyula Fora (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608195#comment-14608195
 ] 

Gyula Fora commented on FLINK-2294:
---

Every stream operator has a context and since the call to processElement 
happens outside of the operator implementation (inside the streamtask or 
collector) we can set it there. (thats 2 places)

Setting it inside the processElement means that we need to set it in each 
operator implementation (Map, Filter etc) and the user needs to be aware of 
setting it in custom operator implementations (infinite places). 

So I will probably stick with the current solution.

 Keyed State does not work with DOP=1
 

 Key: FLINK-2294
 URL: https://issues.apache.org/jira/browse/FLINK-2294
 Project: Flink
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.10
Reporter: Aljoscha Krettek
Assignee: Gyula Fora
Priority: Blocker

 When changing the DOP from 3 to 1 in StatefulOperatorTest.apiTest() the test 
 fails. The reason seems to be that the element is not properly set when 
 chaining is happening.
 Also, requiring this:
 {code}
 headContext.setNextInput(nextRecord);
 streamOperator.processElement(nextRecord);
 {code}
 to be called seems rather fragile. Why not set the element in 
 {{processElement()}}. This would also make for cleaner encapsulation, since 
 now all outside code must assume that operators have a 
 {{StreamingRuntimeContext}} on which they set the next element.
 The state/keyed state machinery seems dangerously undertested.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (FLINK-2294) Keyed State does not work with DOP=1

2015-06-30 Thread Gyula Fora (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gyula Fora closed FLINK-2294.
-
Resolution: Fixed

https://github.com/apache/flink/commit/fef9f115838b3ba3d3769f8669ee251c2cd403c6

 Keyed State does not work with DOP=1
 

 Key: FLINK-2294
 URL: https://issues.apache.org/jira/browse/FLINK-2294
 Project: Flink
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.10
Reporter: Aljoscha Krettek
Assignee: Gyula Fora
Priority: Blocker

 When changing the DOP from 3 to 1 in StatefulOperatorTest.apiTest() the test 
 fails. The reason seems to be that the element is not properly set when 
 chaining is happening.
 Also, requiring this:
 {code}
 headContext.setNextInput(nextRecord);
 streamOperator.processElement(nextRecord);
 {code}
 to be called seems rather fragile. Why not set the element in 
 {{processElement()}}. This would also make for cleaner encapsulation, since 
 now all outside code must assume that operators have a 
 {{StreamingRuntimeContext}} on which they set the next element.
 The state/keyed state machinery seems dangerously undertested.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-2290) CoRecordReader Does Not Read Events From Both Inputs When No Elements Arrive

2015-06-30 Thread Aljoscha Krettek (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aljoscha Krettek updated FLINK-2290:

Priority: Blocker  (was: Major)

 CoRecordReader Does Not Read Events From Both Inputs When No Elements Arrive
 

 Key: FLINK-2290
 URL: https://issues.apache.org/jira/browse/FLINK-2290
 Project: Flink
  Issue Type: Bug
  Components: Streaming
Reporter: Aljoscha Krettek
Assignee: Aljoscha Krettek
Priority: Blocker

 When no elements arrive the reader will always try to read from the same 
 input index. This means that it will only process elements form this input. 
 This could be problematic with Watermarks/Checkpoint barriers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-2138] Added custom partitioning to Data...

2015-06-30 Thread StephanEwen
Github user StephanEwen commented on the pull request:

https://github.com/apache/flink/pull/872#issuecomment-117117294
  
How about we leave the batch API as it is for now and address that as a 
separate issue? There are quite some subtleties in how the optimizer assesses 
equality of partitioning (based on partitioners) that would have to be changed 
(and should retain backwards compatibility).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-2138) PartitionCustom for streaming

2015-06-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608088#comment-14608088
 ] 

ASF GitHub Bot commented on FLINK-2138:
---

Github user StephanEwen commented on the pull request:

https://github.com/apache/flink/pull/872#issuecomment-117117294
  
How about we leave the batch API as it is for now and address that as a 
separate issue? There are quite some subtleties in how the optimizer assesses 
equality of partitioning (based on partitioners) that would have to be changed 
(and should retain backwards compatibility).


 PartitionCustom for streaming
 -

 Key: FLINK-2138
 URL: https://issues.apache.org/jira/browse/FLINK-2138
 Project: Flink
  Issue Type: New Feature
  Components: Streaming
Affects Versions: 0.9
Reporter: Márton Balassi
Assignee: Gábor Hermann
Priority: Minor

 The batch API has support for custom partitioning, this should be added for 
 streaming with a similar signature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2294) Keyed State does not work with DOP=1

2015-06-30 Thread Aljoscha Krettek (JIRA)
Aljoscha Krettek created FLINK-2294:
---

 Summary: Keyed State does not work with DOP=1
 Key: FLINK-2294
 URL: https://issues.apache.org/jira/browse/FLINK-2294
 Project: Flink
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.10
Reporter: Aljoscha Krettek
Priority: Blocker


When changing the DOP from 3 to 1 in StatefulOperatorTest.apiTest() the test 
fails. The reason seems to be that the element is not properly set when 
chaining is happening.

Also, requiring this:
{code}
headContext.setNextInput(nextRecord);
streamOperator.processElement(nextRecord);
{code}

to be called seems rather fragile. Why not set the element in 
{{processElement()}}. This would also make for cleaner encapsulation, since now 
all outside code must assume that operators have a {{StreamingRuntimeContext}} 
on which they set the next element.

The state/keyed state machinery seems dangerously undertested.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2294) Keyed State does not work with DOP=1

2015-06-30 Thread Gyula Fora (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608100#comment-14608100
 ] 

Gyula Fora commented on FLINK-2294:
---

I don't understand what you mean by setting it in the processElement().

That method is a part of an interface that the user implements.

 Keyed State does not work with DOP=1
 

 Key: FLINK-2294
 URL: https://issues.apache.org/jira/browse/FLINK-2294
 Project: Flink
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.10
Reporter: Aljoscha Krettek
Priority: Blocker

 When changing the DOP from 3 to 1 in StatefulOperatorTest.apiTest() the test 
 fails. The reason seems to be that the element is not properly set when 
 chaining is happening.
 Also, requiring this:
 {code}
 headContext.setNextInput(nextRecord);
 streamOperator.processElement(nextRecord);
 {code}
 to be called seems rather fragile. Why not set the element in 
 {{processElement()}}. This would also make for cleaner encapsulation, since 
 now all outside code must assume that operators have a 
 {{StreamingRuntimeContext}} on which they set the next element.
 The state/keyed state machinery seems dangerously undertested.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2297) Add threshold setting for SVM binary predictions

2015-06-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608261#comment-14608261
 ] 

ASF GitHub Bot commented on FLINK-2297:
---

Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/874#discussion_r33570883
  
--- Diff: 
flink-staging/flink-ml/src/test/scala/org/apache/flink/ml/classification/SVMITSuite.scala
 ---
@@ -69,19 +70,38 @@ class SVMITSuite extends FlatSpec with Matchers with 
FlinkTestBase {
 
 svm.fit(trainingDS)
 
-val threshold = 0.0
-
-val predictionPairs = svm.evaluate(test).map {
-  truthPrediction =
-val truth = truthPrediction._1
-val prediction = truthPrediction._2
-val thresholdedPrediction = if (prediction  threshold) 1.0 else 
-1.0
-(truth, thresholdedPrediction)
-}
+val predictionPairs = svm.evaluate(test)
 
 val absoluteErrorSum = predictionPairs.collect().map{
   case (truth, prediction) = Math.abs(truth - prediction)}.sum
 
 absoluteErrorSum should be  15.0
   }
+
+  it should be possible to get the raw decision function values in {
+val env = ExecutionEnvironment.getExecutionEnvironment
+
+val svm = SVM().
+  setBlocks(env.getParallelism).
+  setIterations(100).
+  setLocalIterations(100).
+  setRegularization(0.002).
+  setStepsize(0.1).
+  setSeed(0).
+  clearThreshold()
+
+val trainingDS = env.fromCollection(Classification.trainingData)
+
+val test = trainingDS.map(x = x.vector)
+
+svm.fit(trainingDS)
+
+val predictions: DataSet[(FlinkVector, Double)] = svm.predict(test)
+
+val preds = predictions.map(vectorLabel = vectorLabel._2).collect()
+
+preds.max should be  1.0
--- End diff --

Better way to check raw values vs. 1/-1



 Add threshold setting for SVM binary predictions
 

 Key: FLINK-2297
 URL: https://issues.apache.org/jira/browse/FLINK-2297
 Project: Flink
  Issue Type: Improvement
  Components: Machine Learning Library
Reporter: Theodore Vasiloudis
Priority: Minor
  Labels: ML
 Fix For: 0.10


 Currently SVM outputs the raw decision function values when using the predict 
 function.
 We should have instead the ability to set a threshold above which examples 
 are labeled as positive (1.0) and below negative (-1.0). Then the prediction 
 function can be directly used for evaluation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-2297] [ml] Add threshold setting for SV...

2015-06-30 Thread thvasilo
Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/874#discussion_r33570883
  
--- Diff: 
flink-staging/flink-ml/src/test/scala/org/apache/flink/ml/classification/SVMITSuite.scala
 ---
@@ -69,19 +70,38 @@ class SVMITSuite extends FlatSpec with Matchers with 
FlinkTestBase {
 
 svm.fit(trainingDS)
 
-val threshold = 0.0
-
-val predictionPairs = svm.evaluate(test).map {
-  truthPrediction =
-val truth = truthPrediction._1
-val prediction = truthPrediction._2
-val thresholdedPrediction = if (prediction  threshold) 1.0 else 
-1.0
-(truth, thresholdedPrediction)
-}
+val predictionPairs = svm.evaluate(test)
 
 val absoluteErrorSum = predictionPairs.collect().map{
   case (truth, prediction) = Math.abs(truth - prediction)}.sum
 
 absoluteErrorSum should be  15.0
   }
+
+  it should be possible to get the raw decision function values in {
+val env = ExecutionEnvironment.getExecutionEnvironment
+
+val svm = SVM().
+  setBlocks(env.getParallelism).
+  setIterations(100).
+  setLocalIterations(100).
+  setRegularization(0.002).
+  setStepsize(0.1).
+  setSeed(0).
+  clearThreshold()
+
+val trainingDS = env.fromCollection(Classification.trainingData)
+
+val test = trainingDS.map(x = x.vector)
+
+svm.fit(trainingDS)
+
+val predictions: DataSet[(FlinkVector, Double)] = svm.predict(test)
+
+val preds = predictions.map(vectorLabel = vectorLabel._2).collect()
+
+preds.max should be  1.0
--- End diff --

Better way to check raw values vs. 1/-1



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Assigned] (FLINK-2296) Checkpoint committing broken

2015-06-30 Thread Robert Metzger (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Metzger reassigned FLINK-2296:
-

Assignee: Robert Metzger

 Checkpoint committing broken
 

 Key: FLINK-2296
 URL: https://issues.apache.org/jira/browse/FLINK-2296
 Project: Flink
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.10
Reporter: Robert Metzger
Assignee: Robert Metzger
Priority: Blocker

 While working on fixing the failing {{PersistentKafkaSource}} test, I 
 realized that the recent changes introduced in New operator state 
 interfaces https://github.com/apache/flink/pull/747 (sadly, there is no JIRA 
 for this huge change) introduced some changes that I was not aware of.
 * The {{CheckpointCooridnator}} is now sending the StateHandle back to the 
 TaskManager when confirming a checkpoint. For the non-FS case, this means 
 that for checkpoint committed operators, the state is send twice over the 
 wire for each checkpoint. 
 For the FS case, this means that for every checkpoint commit, the state needs 
 to be retrieved from the file system.
 Did you conduct any tests on a cluster to measure the performance impact of 
 that?
 I see three approaches for fixing the aforementioned issue:
 - keep it this way (probably poor performance)
 - always keep the state for uncommitted checkpoints in the TaskManager's 
 memory. Therefore, we need to come up with a good eviction strategy. I don't 
 know the implications for large state.
 - change the interface and do not provide the state to the user function 
 (=old behavior). This forces users to think about how they want to keep the 
 state (but it is also a bit more work for them)
 I would like to get some feedback on how to solve this issue!
 Also, I discovered the following bugs:
 * Non-source tasks didn't get {{commitCheckpoint}} calls, even though they 
 implemented the {{CheckpointCommitter}} interface. I fixed this issue in my 
 current branch.
 * The state passed to the {{commitCheckpoint}} method did not match with the 
 subtask id. So user functions were receiving states from other parallel 
 instances. This lead to faulty behavior in the KafkaSource (thats also the 
 reason why the KafkaITCase was failing more frequently ...). I fixed this 
 issue in my current branch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2297) Add threshold setting for SVM binary predictions

2015-06-30 Thread Theodore Vasiloudis (JIRA)
Theodore Vasiloudis created FLINK-2297:
--

 Summary: Add threshold setting for SVM binary predictions
 Key: FLINK-2297
 URL: https://issues.apache.org/jira/browse/FLINK-2297
 Project: Flink
  Issue Type: Improvement
  Components: Machine Learning Library
Reporter: Theodore Vasiloudis
Priority: Minor
 Fix For: 0.10


Currently SVM outputs the raw decision function values when using the predict 
function.

We should have instead the ability to set a threshold above which examples are 
labeled as positive (1.0) and below negative (-1.0). Then the prediction 
function can be directly used for evaluation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-2297] [ml] Add threshold setting for SV...

2015-06-30 Thread thvasilo
GitHub user thvasilo opened a pull request:

https://github.com/apache/flink/pull/874

[FLINK-2297] [ml] Add threshold setting for SVM binary predictions

Currently SVM outputs the raw decision function values when using the 
predict function.

We should have instead the ability to set a threshold above which examples 
are labeled as positive (1.0) and below negative (-1.0). Then the prediction 
function can be directly used for evaluation.

This provides that functionality, as well as the ability to provide the raw 
decision function values.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/thvasilo/flink svm-threshold

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/874.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #874


commit 279f12a6031f9d7b73e6b69303ae44627df4c401
Author: Theodore Vasiloudis t...@sics.se
Date:   2015-06-30T12:49:03Z

Added parameters argument to PredictOperation predict function.

commit 1cea7a9a71ba04b6a9433789c0787112c7a24084
Author: Theodore Vasiloudis t...@sics.se
Date:   2015-06-30T12:59:57Z

Made the type parameter for Parameter to be covariant.

commit 00cf902d1481d698648abe6f037583d8d543fa53
Author: Theodore Vasiloudis t...@sics.se
Date:   2015-06-30T13:00:49Z

Added Threshold option for SVM, to determine which predictions are 
positive/negative.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-2297) Add threshold setting for SVM binary predictions

2015-06-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608253#comment-14608253
 ] 

ASF GitHub Bot commented on FLINK-2297:
---

GitHub user thvasilo opened a pull request:

https://github.com/apache/flink/pull/874

[FLINK-2297] [ml] Add threshold setting for SVM binary predictions

Currently SVM outputs the raw decision function values when using the 
predict function.

We should have instead the ability to set a threshold above which examples 
are labeled as positive (1.0) and below negative (-1.0). Then the prediction 
function can be directly used for evaluation.

This provides that functionality, as well as the ability to provide the raw 
decision function values.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/thvasilo/flink svm-threshold

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/874.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #874


commit 279f12a6031f9d7b73e6b69303ae44627df4c401
Author: Theodore Vasiloudis t...@sics.se
Date:   2015-06-30T12:49:03Z

Added parameters argument to PredictOperation predict function.

commit 1cea7a9a71ba04b6a9433789c0787112c7a24084
Author: Theodore Vasiloudis t...@sics.se
Date:   2015-06-30T12:59:57Z

Made the type parameter for Parameter to be covariant.

commit 00cf902d1481d698648abe6f037583d8d543fa53
Author: Theodore Vasiloudis t...@sics.se
Date:   2015-06-30T13:00:49Z

Added Threshold option for SVM, to determine which predictions are 
positive/negative.




 Add threshold setting for SVM binary predictions
 

 Key: FLINK-2297
 URL: https://issues.apache.org/jira/browse/FLINK-2297
 Project: Flink
  Issue Type: Improvement
  Components: Machine Learning Library
Reporter: Theodore Vasiloudis
Priority: Minor
  Labels: ML
 Fix For: 0.10


 Currently SVM outputs the raw decision function values when using the predict 
 function.
 We should have instead the ability to set a threshold above which examples 
 are labeled as positive (1.0) and below negative (-1.0). Then the prediction 
 function can be directly used for evaluation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2296) Checkpoint committing broken

2015-06-30 Thread Gyula Fora (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608255#comment-14608255
 ] 

Gyula Fora commented on FLINK-2296:
---

The reason we made these changes for the Checkpoint committing are the 
following:

Storing uncommitted changes with ids in the operator like before makes no sense 
at all. If the committing is actually important this will not work because if 
the operator fails before it receives the commit the state to be committed is 
lost.

State committing usually involves committing some actual state, which you 
either store or provide a way to access it. Since storing locally doesnt work 
for the specific reasons I mentioned before, we need to send a handle to it.

Also if we abstract the actual checkpointing away like we do in the 
OperatorState implementations the user is not aware when the checkpointing 
happens, so he would need to mess around with the checkpointing logic as well.

Caching local states in the task managers could be an option to increase 
performance I agree.

 Checkpoint committing broken
 

 Key: FLINK-2296
 URL: https://issues.apache.org/jira/browse/FLINK-2296
 Project: Flink
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.10
Reporter: Robert Metzger
Assignee: Robert Metzger
Priority: Blocker

 While working on fixing the failing {{PersistentKafkaSource}} test, I 
 realized that the recent changes introduced in New operator state 
 interfaces https://github.com/apache/flink/pull/747 (sadly, there is no JIRA 
 for this huge change) introduced some changes that I was not aware of.
 * The {{CheckpointCooridnator}} is now sending the StateHandle back to the 
 TaskManager when confirming a checkpoint. For the non-FS case, this means 
 that for checkpoint committed operators, the state is send twice over the 
 wire for each checkpoint. 
 For the FS case, this means that for every checkpoint commit, the state needs 
 to be retrieved from the file system.
 Did you conduct any tests on a cluster to measure the performance impact of 
 that?
 I see three approaches for fixing the aforementioned issue:
 - keep it this way (probably poor performance)
 - always keep the state for uncommitted checkpoints in the TaskManager's 
 memory. Therefore, we need to come up with a good eviction strategy. I don't 
 know the implications for large state.
 - change the interface and do not provide the state to the user function 
 (=old behavior). This forces users to think about how they want to keep the 
 state (but it is also a bit more work for them)
 I would like to get some feedback on how to solve this issue!
 Also, I discovered the following bugs:
 * Non-source tasks didn't get {{commitCheckpoint}} calls, even though they 
 implemented the {{CheckpointCommitter}} interface. I fixed this issue in my 
 current branch.
 * The state passed to the {{commitCheckpoint}} method did not match with the 
 subtask id. So user functions were receiving states from other parallel 
 instances. This lead to faulty behavior in the KafkaSource (thats also the 
 reason why the KafkaITCase was failing more frequently ...). I fixed this 
 issue in my current branch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-2296) Checkpoint committing broken

2015-06-30 Thread Robert Metzger (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Metzger updated FLINK-2296:
--
Description: 
While working on fixing the failing {{PersistentKafkaSource}} test, I realized 
that the recent changes introduced in New operator state interfaces 
https://github.com/apache/flink/pull/747 (sadly, there is no JIRA for this huge 
change) introduced some changes that I was not aware of.

* The {{CheckpointCoordinator}} is now sending the StateHandle back to the 
TaskManager when confirming a checkpoint. For the non-FS case, this means that 
for checkpoint committed operators, the state is send twice over the wire for 
each checkpoint. 
For the FS case, this means that for every checkpoint commit, the state needs 
to be retrieved from the file system.
Did you conduct any tests on a cluster to measure the performance impact of 
that?

I see three approaches for fixing the aforementioned issue:
- keep it this way (probably poor performance)
- always keep the state for uncommitted checkpoints in the TaskManager's 
memory. Therefore, we need to come up with a good eviction strategy. I don't 
know the implications for large state.
- change the interface and do not provide the state to the user function (=old 
behavior). This forces users to think about how they want to keep the state 
(but it is also a bit more work for them)

I would like to get some feedback on how to solve this issue!


Also, I discovered the following bugs:
* Non-source tasks didn't get {{commitCheckpoint}} calls, even though they 
implemented the {{CheckpointCommitter}} interface. I fixed this issue in my 
current branch.
* The state passed to the {{commitCheckpoint}} method did not match with the 
subtask id. So user functions were receiving states from other parallel 
instances. This lead to faulty behavior in the KafkaSource (thats also the 
reason why the KafkaITCase was failing more frequently ...). I fixed this issue 
in my current branch.


  was:
While working on fixing the failing {{PersistentKafkaSource}} test, I realized 
that the recent changes introduced in New operator state interfaces 
https://github.com/apache/flink/pull/747 (sadly, there is no JIRA for this huge 
change) introduced some changes that I was not aware of.

* The {{CheckpointCooridnator}} is now sending the StateHandle back to the 
TaskManager when confirming a checkpoint. For the non-FS case, this means that 
for checkpoint committed operators, the state is send twice over the wire for 
each checkpoint. 
For the FS case, this means that for every checkpoint commit, the state needs 
to be retrieved from the file system.
Did you conduct any tests on a cluster to measure the performance impact of 
that?

I see three approaches for fixing the aforementioned issue:
- keep it this way (probably poor performance)
- always keep the state for uncommitted checkpoints in the TaskManager's 
memory. Therefore, we need to come up with a good eviction strategy. I don't 
know the implications for large state.
- change the interface and do not provide the state to the user function (=old 
behavior). This forces users to think about how they want to keep the state 
(but it is also a bit more work for them)

I would like to get some feedback on how to solve this issue!


Also, I discovered the following bugs:
* Non-source tasks didn't get {{commitCheckpoint}} calls, even though they 
implemented the {{CheckpointCommitter}} interface. I fixed this issue in my 
current branch.
* The state passed to the {{commitCheckpoint}} method did not match with the 
subtask id. So user functions were receiving states from other parallel 
instances. This lead to faulty behavior in the KafkaSource (thats also the 
reason why the KafkaITCase was failing more frequently ...). I fixed this issue 
in my current branch.



 Checkpoint committing broken
 

 Key: FLINK-2296
 URL: https://issues.apache.org/jira/browse/FLINK-2296
 Project: Flink
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.10
Reporter: Robert Metzger
Assignee: Robert Metzger
Priority: Blocker

 While working on fixing the failing {{PersistentKafkaSource}} test, I 
 realized that the recent changes introduced in New operator state 
 interfaces https://github.com/apache/flink/pull/747 (sadly, there is no JIRA 
 for this huge change) introduced some changes that I was not aware of.
 * The {{CheckpointCoordinator}} is now sending the StateHandle back to the 
 TaskManager when confirming a checkpoint. For the non-FS case, this means 
 that for checkpoint committed operators, the state is send twice over the 
 wire for each checkpoint. 
 For the FS case, this means that for every checkpoint commit, the state needs 
 to be retrieved from the file system.
 Did you conduct any tests on a 

[GitHub] flink pull request: [FLINK-2297] [ml] Add threshold setting for SV...

2015-06-30 Thread thvasilo
Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/874#discussion_r33570815
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/classification/SVM.scala
 ---
@@ -242,8 +275,21 @@ object SVM{
 }
   }
 
-  override def predict(value: T, model: DenseVector): Double = {
-value.asBreeze dot model.asBreeze
+  override def predict(value: T, model: DenseVector, 
predictParameters: ParameterMap):
+Double = {
+val thresholdOption = predictParameters.get(Threshold)
+
+val rawValue = value.asBreeze dot model.asBreeze
+// If the Threshold option has been reset, we will get back a 
Some(None) thresholdOption
+// causing the exception when we try to get the value. In that 
case we just return the
+// raw value
+try {
--- End diff --

This part is a bit hacky and will have a performance impact. However it 
allows to do what we want without needing to change the way ParameterMap works, 
and with the default behaviour being that the thresholdOption is set, the 
exception should be triggered most of the time.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-2297) Add threshold setting for SVM binary predictions

2015-06-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608259#comment-14608259
 ] 

ASF GitHub Bot commented on FLINK-2297:
---

Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/874#discussion_r33570815
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/classification/SVM.scala
 ---
@@ -242,8 +275,21 @@ object SVM{
 }
   }
 
-  override def predict(value: T, model: DenseVector): Double = {
-value.asBreeze dot model.asBreeze
+  override def predict(value: T, model: DenseVector, 
predictParameters: ParameterMap):
+Double = {
+val thresholdOption = predictParameters.get(Threshold)
+
+val rawValue = value.asBreeze dot model.asBreeze
+// If the Threshold option has been reset, we will get back a 
Some(None) thresholdOption
+// causing the exception when we try to get the value. In that 
case we just return the
+// raw value
+try {
--- End diff --

This part is a bit hacky and will have a performance impact. However it 
allows to do what we want without needing to change the way ParameterMap works, 
and with the default behaviour being that the thresholdOption is set, the 
exception should be triggered most of the time.


 Add threshold setting for SVM binary predictions
 

 Key: FLINK-2297
 URL: https://issues.apache.org/jira/browse/FLINK-2297
 Project: Flink
  Issue Type: Improvement
  Components: Machine Learning Library
Reporter: Theodore Vasiloudis
Priority: Minor
  Labels: ML
 Fix For: 0.10


 Currently SVM outputs the raw decision function values when using the predict 
 function.
 We should have instead the ability to set a threshold above which examples 
 are labeled as positive (1.0) and below negative (-1.0). Then the prediction 
 function can be directly used for evaluation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2296) Checkpoint committing broken

2015-06-30 Thread Stephan Ewen (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608264#comment-14608264
 ] 

Stephan Ewen commented on FLINK-2296:
-

I think there is a misunderstanding on what the commit messages currently 
are. They are currently a best effort notification of completed checkpoints. 
There are no guarantees in the delivery in the presence of any failures after a 
completed checkpoint.

Also, and form of serious distributed commit activity would need a 2-phase or 
consensus protocol, so these messages are not really suitable for that.

Committing does not require the state in many cases. For many interactions with 
the outside world, the actual checkpointing would handle the state, and the 
post commit notification would require mainly a something like a transaction 
ID.

We may want to rename them from commit to notifyCheckpointComplete, though.

I think we have a classical case here of inappropriate communication and 
description. Both documentation on what the model of the commit messages is, 
as well as and clear description of the changes performed to the checkpointing 
mechanism.

 Checkpoint committing broken
 

 Key: FLINK-2296
 URL: https://issues.apache.org/jira/browse/FLINK-2296
 Project: Flink
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.10
Reporter: Robert Metzger
Assignee: Robert Metzger
Priority: Blocker

 While working on fixing the failing {{PersistentKafkaSource}} test, I 
 realized that the recent changes introduced in New operator state 
 interfaces https://github.com/apache/flink/pull/747 (sadly, there is no JIRA 
 for this huge change) introduced some changes that I was not aware of.
 * The {{CheckpointCoordinator}} is now sending the StateHandle back to the 
 TaskManager when confirming a checkpoint. For the non-FS case, this means 
 that for checkpoint committed operators, the state is send twice over the 
 wire for each checkpoint. 
 For the FS case, this means that for every checkpoint commit, the state needs 
 to be retrieved from the file system.
 Did you conduct any tests on a cluster to measure the performance impact of 
 that?
 I see three approaches for fixing the aforementioned issue:
 - keep it this way (probably poor performance)
 - always keep the state for uncommitted checkpoints in the TaskManager's 
 memory. Therefore, we need to come up with a good eviction strategy. I don't 
 know the implications for large state.
 - change the interface and do not provide the state to the user function 
 (=old behavior). This forces users to think about how they want to keep the 
 state (but it is also a bit more work for them)
 I would like to get some feedback on how to solve this issue!
 Also, I discovered the following bugs:
 * Non-source tasks didn't get {{commitCheckpoint}} calls, even though they 
 implemented the {{CheckpointCommitter}} interface. I fixed this issue in my 
 current branch.
 * The state passed to the {{commitCheckpoint}} method did not match with the 
 subtask id. So user functions were receiving states from other parallel 
 instances. This lead to faulty behavior in the KafkaSource (thats also the 
 reason why the KafkaITCase was failing more frequently ...). I fixed this 
 issue in my current branch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-2157] [ml] [WIP] Create evaluation fram...

2015-06-30 Thread thvasilo
Github user thvasilo commented on the pull request:

https://github.com/apache/flink/pull/871#issuecomment-117185135
  
Cherry picked the changes to Predictor and SVM from #874 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-2157) Create evaluation framework for ML library

2015-06-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608270#comment-14608270
 ] 

ASF GitHub Bot commented on FLINK-2157:
---

Github user thvasilo commented on the pull request:

https://github.com/apache/flink/pull/871#issuecomment-117185135
  
Cherry picked the changes to Predictor and SVM from #874 


 Create evaluation framework for ML library
 --

 Key: FLINK-2157
 URL: https://issues.apache.org/jira/browse/FLINK-2157
 Project: Flink
  Issue Type: New Feature
  Components: Machine Learning Library
Reporter: Till Rohrmann
  Labels: ML
 Fix For: 0.10


 Currently, FlinkML lacks means to evaluate the performance of trained models. 
 It would be great to add some {{Evaluators}} which can calculate some score 
 based on the information about true and predicted labels. This could also be 
 used for the cross validation to choose the right hyper parameters.
 Possible scores could be F score [1], zero-one-loss score, etc.
 Resources
 [1] [http://en.wikipedia.org/wiki/F1_score]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2298) Allow setting custom YARN application names through the CLI

2015-06-30 Thread Robert Metzger (JIRA)
Robert Metzger created FLINK-2298:
-

 Summary: Allow setting custom YARN application names through the 
CLI
 Key: FLINK-2298
 URL: https://issues.apache.org/jira/browse/FLINK-2298
 Project: Flink
  Issue Type: Bug
  Components: YARN Client
Affects Versions: 0.10
Reporter: Robert Metzger
Assignee: Robert Metzger


A Flink user asked for adding an option to the YARN CLI frontend to provide a 
custom application name when starting Flink on YARN.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2299) The slot on which the task maanger was scheduled was killed

2015-06-30 Thread Andra Lungu (JIRA)
Andra Lungu created FLINK-2299:
--

 Summary: The slot on which the task maanger was scheduled was 
killed
 Key: FLINK-2299
 URL: https://issues.apache.org/jira/browse/FLINK-2299
 Project: Flink
  Issue Type: Bug
Affects Versions: 0.9, 0.10
Reporter: Andra Lungu
Priority: Critical
 Fix For: 0.9.1


The following code: 
https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/GSATriangleCount.java

Ran on the twitter follower graph: 
http://twitter.mpi-sws.org/data-icwsm2010.html 

With a similar configuration to the one in FLINK-2293

fails with the following exception:
java.lang.Exception: The slot in which the task was executed has been released. 
Probably loss of TaskManager 57c67d938c9144bec5ba798bb8ebe636 @ wally025 - 8 
slots - URL: akka.tcp://flink@130.149.249.35:56135/user/taskmanager
at 
org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151)
at 
org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547)
at 
org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119)
at 
org.apache.flink.runtime.instance.Instance.markDead(Instance.java:154)
at 
org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:182)
at 
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:421)
at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at 
org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:36)
at 
org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:29)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
at 
org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:29)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at 
org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:92)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at 
akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46)
at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)
at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)
at akka.actor.ActorCell.invoke(ActorCell.scala:486)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
at akka.dispatch.Mailbox.run(Mailbox.scala:221)
at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

06/29/2015 10:33:46 Job execution switched to status FAILING.

The logs are here:
https://drive.google.com/file/d/0BwnaKJcSLc43M1BhNUt5NWdINHc/view?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2299) The slot on which the task maanger was scheduled was killed

2015-06-30 Thread Robert Metzger (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608296#comment-14608296
 ] 

Robert Metzger commented on FLINK-2299:
---

Looks like wally025 broke up with the JobManager ;) 

{code}
11:08:48,028 WARN  akka.remote.ReliableDeliverySupervisor   
 - Association with remote system [akka.tcp://flink@130.149.249.11:6123] has 
failed, address is now gated for [5000] ms. Reason is: [Disassociated].
11:08:55,417 WARN  Remoting 
 - Tried to associate with unreachable remote address 
[akka.tcp://flink@130.149.249.11:6123]. Address is now gated for 5000 ms, all 
messages to this address will be delivered to dead letters. Reason: Connection 
refused: /130.149.249.11:6123
11:08:55,422 INFO  org.apache.flink.runtime.taskmanager.TaskManager 
 - Disconnecting from JobManager: JobManager is no longer reachable
11:08:55,509 INFO  org.apache.flink.runtime.taskmanager.TaskManager 
 - Disassociating from JobManager
{code}

Were there any network issues?

 The slot on which the task maanger was scheduled was killed
 ---

 Key: FLINK-2299
 URL: https://issues.apache.org/jira/browse/FLINK-2299
 Project: Flink
  Issue Type: Bug
Affects Versions: 0.9, 0.10
Reporter: Andra Lungu
Priority: Critical
 Fix For: 0.9.1


 The following code: 
 https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/GSATriangleCount.java
 Ran on the twitter follower graph: 
 http://twitter.mpi-sws.org/data-icwsm2010.html 
 With a similar configuration to the one in FLINK-2293
 fails with the following exception:
 java.lang.Exception: The slot in which the task was executed has been 
 released. Probably loss of TaskManager 57c67d938c9144bec5ba798bb8ebe636 @ 
 wally025 - 8 slots - URL: 
 akka.tcp://flink@130.149.249.35:56135/user/taskmanager
 at 
 org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151)
 at 
 org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547)
 at 
 org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119)
 at 
 org.apache.flink.runtime.instance.Instance.markDead(Instance.java:154)
 at 
 org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:182)
 at 
 org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:421)
 at 
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
 at 
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
 at 
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
 at 
 org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:36)
 at 
 org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:29)
 at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
 at 
 org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:29)
 at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
 at 
 org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:92)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
 at 
 akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46)
 at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)
 at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)
 at akka.actor.ActorCell.invoke(ActorCell.scala:486)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
 at akka.dispatch.Mailbox.run(Mailbox.scala:221)
 at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
 at 
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 06/29/2015 10:33:46 Job execution switched to status FAILING.
 The logs are here:
 https://drive.google.com/file/d/0BwnaKJcSLc43M1BhNUt5NWdINHc/view?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-1731] [ml] Implementation of Feature K-...

2015-06-30 Thread thvasilo
Github user thvasilo commented on the pull request:

https://github.com/apache/flink/pull/700#issuecomment-117194517
  
What we would like to see actually is this PR and #757 to be merged into 
one, so that we can review them as a whole. @sachingoel0101 do you think you 
will be able to do that?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-06-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608302#comment-14608302
 ] 

ASF GitHub Bot commented on FLINK-1731:
---

Github user thvasilo commented on the pull request:

https://github.com/apache/flink/pull/700#issuecomment-117194517
  
What we would like to see actually is this PR and #757 to be merged into 
one, so that we can review them as a whole. @sachingoel0101 do you think you 
will be able to do that?


 Add kMeans clustering algorithm to machine learning library
 ---

 Key: FLINK-1731
 URL: https://issues.apache.org/jira/browse/FLINK-1731
 Project: Flink
  Issue Type: New Feature
  Components: Machine Learning Library
Reporter: Till Rohrmann
Assignee: Peter Schrott
  Labels: ML

 The Flink repository already contains a kMeans implementation but it is not 
 yet ported to the machine learning library. I assume that only the used data 
 types have to be adapted and then it can be more or less directly moved to 
 flink-ml.
 The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
 implementation because the improve the initial seeding phase to achieve near 
 optimal clustering. It might be worthwhile to implement kMeans||.
 Resources:
 [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
 [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2299) The slot on which the task maanger was scheduled was killed

2015-06-30 Thread Andra Lungu (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608304#comment-14608304
 ] 

Andra Lungu commented on FLINK-2299:


No, there was no problem with the network. I believe the issue is more complex 
than that, unfortunately :(. And it is reproducible; the network cannot crash 
that often...

 The slot on which the task maanger was scheduled was killed
 ---

 Key: FLINK-2299
 URL: https://issues.apache.org/jira/browse/FLINK-2299
 Project: Flink
  Issue Type: Bug
Affects Versions: 0.9, 0.10
Reporter: Andra Lungu
Priority: Critical
 Fix For: 0.9.1


 The following code: 
 https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/GSATriangleCount.java
 Ran on the twitter follower graph: 
 http://twitter.mpi-sws.org/data-icwsm2010.html 
 With a similar configuration to the one in FLINK-2293
 fails with the following exception:
 java.lang.Exception: The slot in which the task was executed has been 
 released. Probably loss of TaskManager 57c67d938c9144bec5ba798bb8ebe636 @ 
 wally025 - 8 slots - URL: 
 akka.tcp://flink@130.149.249.35:56135/user/taskmanager
 at 
 org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151)
 at 
 org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547)
 at 
 org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119)
 at 
 org.apache.flink.runtime.instance.Instance.markDead(Instance.java:154)
 at 
 org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:182)
 at 
 org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:421)
 at 
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
 at 
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
 at 
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
 at 
 org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:36)
 at 
 org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:29)
 at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
 at 
 org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:29)
 at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
 at 
 org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:92)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
 at 
 akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46)
 at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)
 at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)
 at akka.actor.ActorCell.invoke(ActorCell.scala:486)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
 at akka.dispatch.Mailbox.run(Mailbox.scala:221)
 at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
 at 
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 06/29/2015 10:33:46 Job execution switched to status FAILING.
 The logs are here:
 https://drive.google.com/file/d/0BwnaKJcSLc43M1BhNUt5NWdINHc/view?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library

2015-06-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608308#comment-14608308
 ] 

ASF GitHub Bot commented on FLINK-1745:
---

Github user thvasilo commented on the pull request:

https://github.com/apache/flink/pull/696#issuecomment-117195363
  
Thanks @chiwanpark . I don't think I'll have the time to review this week, 
I'll set a reminder for Monday.


 Add exact k-nearest-neighbours algorithm to machine learning library
 

 Key: FLINK-1745
 URL: https://issues.apache.org/jira/browse/FLINK-1745
 Project: Flink
  Issue Type: New Feature
  Components: Machine Learning Library
Reporter: Till Rohrmann
Assignee: Till Rohrmann
  Labels: ML, Starter

 Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial 
 it is still used as a mean to classify data and to do regression. This issue 
 focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as 
 proposed in [2].
 Could be a starter task.
 Resources:
 [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm]
 [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (FLINK-2299) The slot on which the task maanger was scheduled was killed

2015-06-30 Thread Andra Lungu (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608304#comment-14608304
 ] 

Andra Lungu edited comment on FLINK-2299 at 6/30/15 2:00 PM:
-

No, there was no problem with the network. I believe the issue is more complex 
than that, unfortunately :(. And it is reproducible; the network cannot crash 
that often...

wally025 died (the problem is on the TM side IMO) and then I guess it broke the 
JM...
And others have experienced a similar issue(on different clusters) recently.


was (Author: andralungu):
No, there was no problem with the network. I believe the issue is more complex 
than that, unfortunately :(. And it is reproducible; the network cannot crash 
that often...

 The slot on which the task maanger was scheduled was killed
 ---

 Key: FLINK-2299
 URL: https://issues.apache.org/jira/browse/FLINK-2299
 Project: Flink
  Issue Type: Bug
Affects Versions: 0.9, 0.10
Reporter: Andra Lungu
Priority: Critical
 Fix For: 0.9.1


 The following code: 
 https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/GSATriangleCount.java
 Ran on the twitter follower graph: 
 http://twitter.mpi-sws.org/data-icwsm2010.html 
 With a similar configuration to the one in FLINK-2293
 fails with the following exception:
 java.lang.Exception: The slot in which the task was executed has been 
 released. Probably loss of TaskManager 57c67d938c9144bec5ba798bb8ebe636 @ 
 wally025 - 8 slots - URL: 
 akka.tcp://flink@130.149.249.35:56135/user/taskmanager
 at 
 org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151)
 at 
 org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547)
 at 
 org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119)
 at 
 org.apache.flink.runtime.instance.Instance.markDead(Instance.java:154)
 at 
 org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:182)
 at 
 org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:421)
 at 
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
 at 
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
 at 
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
 at 
 org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:36)
 at 
 org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:29)
 at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
 at 
 org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:29)
 at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
 at 
 org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:92)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
 at 
 akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46)
 at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)
 at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)
 at akka.actor.ActorCell.invoke(ActorCell.scala:486)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
 at akka.dispatch.Mailbox.run(Mailbox.scala:221)
 at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
 at 
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 06/29/2015 10:33:46 Job execution switched to status FAILING.
 The logs are here:
 https://drive.google.com/file/d/0BwnaKJcSLc43M1BhNUt5NWdINHc/view?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2157) Create evaluation framework for ML library

2015-06-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608346#comment-14608346
 ] 

ASF GitHub Bot commented on FLINK-2157:
---

Github user thvasilo commented on the pull request:

https://github.com/apache/flink/pull/871#issuecomment-117202580
  
I believe that the current code covers the scope of the linked issue. We 
can now start reviewing the relevant changes and continue with code cleanup to 
bring this to a merge-able state.

Once cleanup is done a few more evaluation score should be added before 
merging.


 Create evaluation framework for ML library
 --

 Key: FLINK-2157
 URL: https://issues.apache.org/jira/browse/FLINK-2157
 Project: Flink
  Issue Type: New Feature
  Components: Machine Learning Library
Reporter: Till Rohrmann
  Labels: ML
 Fix For: 0.10


 Currently, FlinkML lacks means to evaluate the performance of trained models. 
 It would be great to add some {{Evaluators}} which can calculate some score 
 based on the information about true and predicted labels. This could also be 
 used for the cross validation to choose the right hyper parameters.
 Possible scores could be F score [1], zero-one-loss score, etc.
 Resources
 [1] [http://en.wikipedia.org/wiki/F1_score]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-2157] [ml] [WIP] Create evaluation fram...

2015-06-30 Thread thvasilo
Github user thvasilo commented on the pull request:

https://github.com/apache/flink/pull/871#issuecomment-117202580
  
I believe that the current code covers the scope of the linked issue. We 
can now start reviewing the relevant changes and continue with code cleanup to 
bring this to a merge-able state.

Once cleanup is done a few more evaluation score should be added before 
merging.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Updated] (FLINK-2111) Add stop signal to cleanly shutdown streaming jobs

2015-06-30 Thread Matthias J. Sax (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias J. Sax updated FLINK-2111:
---
Summary: Add stop signal to cleanly shutdown streaming jobs  (was: Add 
stop signal to cleanly stop streaming jobs)

 Add stop signal to cleanly shutdown streaming jobs
 

 Key: FLINK-2111
 URL: https://issues.apache.org/jira/browse/FLINK-2111
 Project: Flink
  Issue Type: Improvement
  Components: Distributed Runtime, JobManager, Local Runtime, 
 Streaming, TaskManager, Webfrontend
Reporter: Matthias J. Sax
Assignee: Matthias J. Sax
Priority: Minor

 Currently, streaming jobs can only be stopped using cancel command, what is 
 a hard stop with no clean shutdown.
 The new introduced terminate signal, will only affect streaming source 
 tasks such that the sources can stop emitting data and terminate cleanly, 
 resulting in a clean termination of the whole streaming job.
 This feature is a pre-requirment for 
 https://issues.apache.org/jira/browse/FLINK-1929



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2296) Checkpoint committing broken

2015-06-30 Thread Stephan Ewen (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608411#comment-14608411
 ] 

Stephan Ewen commented on FLINK-2296:
-

+1 to do that so the current state is robust and efficient

If we decide to add the state handle back, this should go together with a 
design how to avoid repeated state transfers.

 Checkpoint committing broken
 

 Key: FLINK-2296
 URL: https://issues.apache.org/jira/browse/FLINK-2296
 Project: Flink
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.10
Reporter: Robert Metzger
Assignee: Robert Metzger
Priority: Blocker

 While working on fixing the failing {{PersistentKafkaSource}} test, I 
 realized that the recent changes introduced in New operator state 
 interfaces https://github.com/apache/flink/pull/747 (sadly, there is no JIRA 
 for this huge change) introduced some changes that I was not aware of.
 * The {{CheckpointCoordinator}} is now sending the StateHandle back to the 
 TaskManager when confirming a checkpoint. For the non-FS case, this means 
 that for checkpoint committed operators, the state is send twice over the 
 wire for each checkpoint. 
 For the FS case, this means that for every checkpoint commit, the state needs 
 to be retrieved from the file system.
 Did you conduct any tests on a cluster to measure the performance impact of 
 that?
 I see three approaches for fixing the aforementioned issue:
 - keep it this way (probably poor performance)
 - always keep the state for uncommitted checkpoints in the TaskManager's 
 memory. Therefore, we need to come up with a good eviction strategy. I don't 
 know the implications for large state.
 - change the interface and do not provide the state to the user function 
 (=old behavior). This forces users to think about how they want to keep the 
 state (but it is also a bit more work for them)
 I would like to get some feedback on how to solve this issue!
 Also, I discovered the following bugs:
 * Non-source tasks didn't get {{commitCheckpoint}} calls, even though they 
 implemented the {{CheckpointCommitter}} interface. I fixed this issue in my 
 current branch.
 * The state passed to the {{commitCheckpoint}} method did not match with the 
 subtask id. So user functions were receiving states from other parallel 
 instances. This lead to faulty behavior in the KafkaSource (thats also the 
 reason why the KafkaITCase was failing more frequently ...). I fixed this 
 issue in my current branch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [storm-contib] added named attribute access to...

2015-06-30 Thread mjsax
GitHub user mjsax opened a pull request:

https://github.com/apache/flink/pull/875

[storm-contib] added named attribute access to Storm compatibility layer

  - added support for named attribute access
  - extenden FlinkTopologyBuilder to handle declared output schemas
  - adapted JUnit test
  - added new examples and ITCases
  - updated README

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mjsax/flink flink-storm-compatibility

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/875.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #875


commit 1f2b7afe172eef8b4664ccc42449ab8a999bd582
Author: mjsax mj...@informatik.hu-berlin.de
Date:   2015-06-29T15:50:07Z

Storm compatibility improvement
  - added support for named attribute access
  - extenden FlinkTopologyBuilder to handle declared output schemas
  - adapted JUnit test
  - added new examples and ITCases
  - updated README




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2131][ml]: Initialization schemes for k...

2015-06-30 Thread sachingoel0101
Github user sachingoel0101 closed the pull request at:

https://github.com/apache/flink/pull/757


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-2296) Checkpoint committing broken

2015-06-30 Thread Robert Metzger (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608360#comment-14608360
 ] 

Robert Metzger commented on FLINK-2296:
---

+1 for renaming the method to {{notifyCheckpointComplete}}!

+1 for removing the {{StateHandleSerializable state}} argument from the 
commitCheckpoint/notifyCheckpointComplete method.

If nobody disagrees within 24 hours, I will rename the method and remove the 
argument.

 Checkpoint committing broken
 

 Key: FLINK-2296
 URL: https://issues.apache.org/jira/browse/FLINK-2296
 Project: Flink
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.10
Reporter: Robert Metzger
Assignee: Robert Metzger
Priority: Blocker

 While working on fixing the failing {{PersistentKafkaSource}} test, I 
 realized that the recent changes introduced in New operator state 
 interfaces https://github.com/apache/flink/pull/747 (sadly, there is no JIRA 
 for this huge change) introduced some changes that I was not aware of.
 * The {{CheckpointCoordinator}} is now sending the StateHandle back to the 
 TaskManager when confirming a checkpoint. For the non-FS case, this means 
 that for checkpoint committed operators, the state is send twice over the 
 wire for each checkpoint. 
 For the FS case, this means that for every checkpoint commit, the state needs 
 to be retrieved from the file system.
 Did you conduct any tests on a cluster to measure the performance impact of 
 that?
 I see three approaches for fixing the aforementioned issue:
 - keep it this way (probably poor performance)
 - always keep the state for uncommitted checkpoints in the TaskManager's 
 memory. Therefore, we need to come up with a good eviction strategy. I don't 
 know the implications for large state.
 - change the interface and do not provide the state to the user function 
 (=old behavior). This forces users to think about how they want to keep the 
 state (but it is also a bit more work for them)
 I would like to get some feedback on how to solve this issue!
 Also, I discovered the following bugs:
 * Non-source tasks didn't get {{commitCheckpoint}} calls, even though they 
 implemented the {{CheckpointCommitter}} interface. I fixed this issue in my 
 current branch.
 * The state passed to the {{commitCheckpoint}} method did not match with the 
 subtask id. So user functions were receiving states from other parallel 
 instances. This lead to faulty behavior in the KafkaSource (thats also the 
 reason why the KafkaITCase was failing more frequently ...). I fixed this 
 issue in my current branch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (FLINK-2111) Add stop signal to cleanly shutdown streaming jobs

2015-06-30 Thread Matthias J. Sax (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias J. Sax updated FLINK-2111:
---
Description: 
Currently, streaming jobs can only be stopped using cancel command, what is a 
hard stop with no clean shutdown.

The new introduced stop signal, will only affect streaming source tasks such 
that the sources can stop emitting data and shutdown cleanly, resulting in a 
clean shutdown of the whole streaming job.

This feature is a pre-requirment for 
https://issues.apache.org/jira/browse/FLINK-1929

  was:
Currently, streaming jobs can only be stopped using cancel command, what is a 
hard stop with no clean shutdown.

The new introduced terminate signal, will only affect streaming source tasks 
such that the sources can stop emitting data and terminate cleanly, resulting 
in a clean termination of the whole streaming job.

This feature is a pre-requirment for 
https://issues.apache.org/jira/browse/FLINK-1929


 Add stop signal to cleanly shutdown streaming jobs
 

 Key: FLINK-2111
 URL: https://issues.apache.org/jira/browse/FLINK-2111
 Project: Flink
  Issue Type: Improvement
  Components: Distributed Runtime, JobManager, Local Runtime, 
 Streaming, TaskManager, Webfrontend
Reporter: Matthias J. Sax
Assignee: Matthias J. Sax
Priority: Minor

 Currently, streaming jobs can only be stopped using cancel command, what is 
 a hard stop with no clean shutdown.
 The new introduced stop signal, will only affect streaming source tasks 
 such that the sources can stop emitting data and shutdown cleanly, resulting 
 in a clean shutdown of the whole streaming job.
 This feature is a pre-requirment for 
 https://issues.apache.org/jira/browse/FLINK-1929



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2301) In BarrierBuffer newer Barriers trigger old Checkpoints

2015-06-30 Thread Aljoscha Krettek (JIRA)
Aljoscha Krettek created FLINK-2301:
---

 Summary: In BarrierBuffer newer Barriers trigger old Checkpoints
 Key: FLINK-2301
 URL: https://issues.apache.org/jira/browse/FLINK-2301
 Project: Flink
  Issue Type: Bug
  Components: Streaming
Reporter: Aljoscha Krettek
Assignee: Aljoscha Krettek


When the BarrierBuffer has some inputs blocked on barrier 0, then receives 
barriers for barrier 1 on the other inputs this makes the BarrierBuffer process 
the checkpoint with id 0.

I think the BarrierBuffer should drop all previous BarrierCheckpoints when it 
receives a barrier from a more recent checkpoint and unblock the previously 
blocked channels. This will make it ready to correctly react to the other 
barriers of the newer checkpoint. It should also ignore barriers that arrive 
late when we already processed a more recent checkpoint.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-2131][ml]: Initialization schemes for k...

2015-06-30 Thread sachingoel0101
Github user sachingoel0101 commented on the pull request:

https://github.com/apache/flink/pull/757#issuecomment-117220314
  
Hey @thvasilo , I'm going to break up this PR further. The motivation is 
that, the Sampling code should be available as a general feature. Given a 
probability distribution over data, user should be able to sample as many 
points as they want.

The Sampler will take the DataSet as input, number of samples required and 
a function which determines the relative probability of a particular element 
being picked, apart from specifying whether the elements should be sampled with 
replacement or without replacement. 
Let me know your thoughts. I'll work out a version in the meantime. If this 
is desirable, I will file a JIRA ticket and open a separate PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [FLINK-2131][ml]: Initialization schemes for k...

2015-06-30 Thread sachingoel0101
GitHub user sachingoel0101 reopened a pull request:

https://github.com/apache/flink/pull/757

[FLINK-2131][ml]: Initialization schemes for k-means clustering

This adds two most common initialization strategies for the k-means 
clustering algorithm, namely, Random initialization and kmeans++ initialization.
Further details are at https://issues.apache.org/jira/browse/FLINK-2131
[Edit]: Work on kmeans|| has been started and just needs to be finalized.
[Edit]: kmeans|| implementation finished. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sachingoel0101/flink 
clustering_initializations

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/757.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #757


commit dc2de88bf5e3148bb116cad607fc3c61d9dceac6
Author: Sachin Goel sachingoel0...@gmail.com
Date:   2015-06-02T06:44:30Z

Random and kmeans++ initialization methods added

commit 4a39a19c1425259c71ac6d922b4d9a9f2e7d1c6e
Author: Sachin Goel sachingoel0...@gmail.com
Date:   2015-06-02T15:42:58Z

Merge https://github.com/apache/flink into clustering_initializations

commit cdbb3a0801d364935d455798c695f4615ae74e76
Author: Sachin Goel sachingoel0...@gmail.com
Date:   2015-06-02T19:49:24Z

Merge https://github.com/apache/flink into clustering_initializations

commit 7496e21462e4efc0813450971ae6cbc94d2b2c15
Author: Sachin Goel sachingoel0...@gmail.com
Date:   2015-06-02T22:41:20Z

Initialization costs of random and kmeans++ added

commit 8033c87b71686bd3955281db12583592549406cb
Author: Sachin Goel sachingoel0...@gmail.com
Date:   2015-06-05T21:54:10Z

Merge https://github.com/apache/flink into clustering_initializations

commit 29ed1d3fb31aa038d6ed1a5bf16d58f19565cdf8
Author: Sachin Goel sachingoel0...@gmail.com
Date:   2015-06-05T22:52:02Z

Removed cost parameter from Algorithm itself. Leaving it to the user for 
now. Also added support for weighted input data sets

commit 5286c3c21d5019f6ba8ab67c2074570087bc1b3a
Author: Sachin Goel sachingoel0...@gmail.com
Date:   2015-06-06T05:04:55Z

An initial draft of kmeans-par method

commit f3bfad4fc0c6576af14f1e981f8e778445856355
Author: Sachin Goel sachingoel0...@gmail.com
Date:   2015-06-08T10:36:32Z

All three initialization schemes implemented and tested

commit 8496b8fd627ade8dbe7b92949d35d3cce704f1cc
Author: Sachin Goel sachingoel0...@gmail.com
Date:   2015-06-08T10:36:58Z

Merge https://github.com/apache/flink into clustering_initializations

commit 3765a3e6a77a8bdbac21d03be1c43263925b1495
Author: Sachin Goel sachingoel0...@gmail.com
Date:   2015-06-30T08:57:41Z

Merge remote-tracking branch 'upstream/master' into 
clustering_initializations




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering

2015-06-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608434#comment-14608434
 ] 

ASF GitHub Bot commented on FLINK-2131:
---

Github user sachingoel0101 commented on the pull request:

https://github.com/apache/flink/pull/757#issuecomment-117220314
  
Hey @thvasilo , I'm going to break up this PR further. The motivation is 
that, the Sampling code should be available as a general feature. Given a 
probability distribution over data, user should be able to sample as many 
points as they want.

The Sampler will take the DataSet as input, number of samples required and 
a function which determines the relative probability of a particular element 
being picked, apart from specifying whether the elements should be sampled with 
replacement or without replacement. 
Let me know your thoughts. I'll work out a version in the meantime. If this 
is desirable, I will file a JIRA ticket and open a separate PR.


 Add Initialization schemes for K-means clustering
 -

 Key: FLINK-2131
 URL: https://issues.apache.org/jira/browse/FLINK-2131
 Project: Flink
  Issue Type: Task
  Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel

 The Lloyd's [KMeans] algorithm takes initial centroids as its input. However, 
 in case the user doesn't provide the initial centers, they may ask for a 
 particular initialization scheme to be followed. The most commonly used are 
 these:
 1. Random initialization: Self-explanatory
 2. kmeans++ initialization: http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
 3. kmeans|| : http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf
 For very large data sets, or for large values of k, the kmeans|| method is 
 preferred as it provides the same approximation guarantees as kmeans++ and 
 requires lesser number of passes over the input data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-2300) Links on FAQ page not rendered correctly

2015-06-30 Thread Robert Metzger (JIRA)
Robert Metzger created FLINK-2300:
-

 Summary: Links on FAQ page not rendered correctly
 Key: FLINK-2300
 URL: https://issues.apache.org/jira/browse/FLINK-2300
 Project: Flink
  Issue Type: Bug
  Components: Project Website
Reporter: Robert Metzger
Priority: Minor


On the Flink website, the links using the github plugin are broken.

For example
{code}
{% github README.md master build instructions %}
{code}
renders to
{code}
https://github.com/apache/flink/tree/master/README.md
{code}
See:
http://flink.apache.org/faq.html#my-job-fails-early-with-a-javaioeofexception-what-could-be-the-cause

I was not able to resolve the issue by using {{a}} tags or markdown links.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (FLINK-2108) Add score function for Predictors

2015-06-30 Thread Theodore Vasiloudis (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608336#comment-14608336
 ] 

Theodore Vasiloudis edited comment on FLINK-2108 at 6/30/15 2:07 PM:
-

The score function can now be considered feature complete, as presented in the 
linked PR: https://github.com/apache/flink/pull/871


was (Author: tvas):
The for the score function can now be considered feature complete, as presented 
in the linked PR: https://github.com/apache/flink/pull/871

 Add score function for Predictors
 -

 Key: FLINK-2108
 URL: https://issues.apache.org/jira/browse/FLINK-2108
 Project: Flink
  Issue Type: Improvement
  Components: Machine Learning Library
Reporter: Theodore Vasiloudis
Assignee: Theodore Vasiloudis
Priority: Minor
  Labels: ML

 A score function for Predictor implementations should take a DataSet[(I, O)] 
 and an (optional) scoring measure and return a score.
 The DataSet[(I, O)] would probably be the output of the predict function.
 For example in MultipleLinearRegression, we can call predict on a labeled 
 dataset, get back predictions for each item in the data, and then call score 
 with the resulting dataset as an argument and we should get back a score for 
 the prediction quality, such as the R^2 score.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2299) The slot on which the task maanger was scheduled was killed

2015-06-30 Thread Stephan Ewen (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608412#comment-14608412
 ] 

Stephan Ewen commented on FLINK-2299:
-

We are getting more reports where this happens. We need to figure out

 - Whether these are results of garbage collection stalls. In that case, there 
is not much we could do easily, but we have to think about a long term

 - Or whether this is a problem with the Akka remote watching. In that case, we 
should revert back to doing the heartbeats ourselves

 The slot on which the task maanger was scheduled was killed
 ---

 Key: FLINK-2299
 URL: https://issues.apache.org/jira/browse/FLINK-2299
 Project: Flink
  Issue Type: Bug
Affects Versions: 0.9, 0.10
Reporter: Andra Lungu
Priority: Critical
 Fix For: 0.9.1


 The following code: 
 https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/GSATriangleCount.java
 Ran on the twitter follower graph: 
 http://twitter.mpi-sws.org/data-icwsm2010.html 
 With a similar configuration to the one in FLINK-2293
 fails with the following exception:
 java.lang.Exception: The slot in which the task was executed has been 
 released. Probably loss of TaskManager 57c67d938c9144bec5ba798bb8ebe636 @ 
 wally025 - 8 slots - URL: 
 akka.tcp://flink@130.149.249.35:56135/user/taskmanager
 at 
 org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151)
 at 
 org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547)
 at 
 org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119)
 at 
 org.apache.flink.runtime.instance.Instance.markDead(Instance.java:154)
 at 
 org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:182)
 at 
 org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:421)
 at 
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
 at 
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
 at 
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
 at 
 org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:36)
 at 
 org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:29)
 at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
 at 
 org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:29)
 at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
 at 
 org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:92)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
 at 
 akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46)
 at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)
 at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)
 at akka.actor.ActorCell.invoke(ActorCell.scala:486)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
 at akka.dispatch.Mailbox.run(Mailbox.scala:221)
 at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
 at 
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 06/29/2015 10:33:46 Job execution switched to status FAILING.
 The logs are here:
 https://drive.google.com/file/d/0BwnaKJcSLc43M1BhNUt5NWdINHc/view?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2296) Checkpoint committing broken

2015-06-30 Thread Ufuk Celebi (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608397#comment-14608397
 ] 

Ufuk Celebi commented on FLINK-2296:


+1 to the renaming.

 Checkpoint committing broken
 

 Key: FLINK-2296
 URL: https://issues.apache.org/jira/browse/FLINK-2296
 Project: Flink
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.10
Reporter: Robert Metzger
Assignee: Robert Metzger
Priority: Blocker

 While working on fixing the failing {{PersistentKafkaSource}} test, I 
 realized that the recent changes introduced in New operator state 
 interfaces https://github.com/apache/flink/pull/747 (sadly, there is no JIRA 
 for this huge change) introduced some changes that I was not aware of.
 * The {{CheckpointCoordinator}} is now sending the StateHandle back to the 
 TaskManager when confirming a checkpoint. For the non-FS case, this means 
 that for checkpoint committed operators, the state is send twice over the 
 wire for each checkpoint. 
 For the FS case, this means that for every checkpoint commit, the state needs 
 to be retrieved from the file system.
 Did you conduct any tests on a cluster to measure the performance impact of 
 that?
 I see three approaches for fixing the aforementioned issue:
 - keep it this way (probably poor performance)
 - always keep the state for uncommitted checkpoints in the TaskManager's 
 memory. Therefore, we need to come up with a good eviction strategy. I don't 
 know the implications for large state.
 - change the interface and do not provide the state to the user function 
 (=old behavior). This forces users to think about how they want to keep the 
 state (but it is also a bit more work for them)
 I would like to get some feedback on how to solve this issue!
 Also, I discovered the following bugs:
 * Non-source tasks didn't get {{commitCheckpoint}} calls, even though they 
 implemented the {{CheckpointCommitter}} interface. I fixed this issue in my 
 current branch.
 * The state passed to the {{commitCheckpoint}} method did not match with the 
 subtask id. So user functions were receiving states from other parallel 
 instances. This lead to faulty behavior in the KafkaSource (thats also the 
 reason why the KafkaITCase was failing more frequently ...). I fixed this 
 issue in my current branch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering

2015-06-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608437#comment-14608437
 ] 

ASF GitHub Bot commented on FLINK-2131:
---

GitHub user sachingoel0101 reopened a pull request:

https://github.com/apache/flink/pull/757

[FLINK-2131][ml]: Initialization schemes for k-means clustering

This adds two most common initialization strategies for the k-means 
clustering algorithm, namely, Random initialization and kmeans++ initialization.
Further details are at https://issues.apache.org/jira/browse/FLINK-2131
[Edit]: Work on kmeans|| has been started and just needs to be finalized.
[Edit]: kmeans|| implementation finished. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sachingoel0101/flink 
clustering_initializations

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/757.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #757


commit dc2de88bf5e3148bb116cad607fc3c61d9dceac6
Author: Sachin Goel sachingoel0...@gmail.com
Date:   2015-06-02T06:44:30Z

Random and kmeans++ initialization methods added

commit 4a39a19c1425259c71ac6d922b4d9a9f2e7d1c6e
Author: Sachin Goel sachingoel0...@gmail.com
Date:   2015-06-02T15:42:58Z

Merge https://github.com/apache/flink into clustering_initializations

commit cdbb3a0801d364935d455798c695f4615ae74e76
Author: Sachin Goel sachingoel0...@gmail.com
Date:   2015-06-02T19:49:24Z

Merge https://github.com/apache/flink into clustering_initializations

commit 7496e21462e4efc0813450971ae6cbc94d2b2c15
Author: Sachin Goel sachingoel0...@gmail.com
Date:   2015-06-02T22:41:20Z

Initialization costs of random and kmeans++ added

commit 8033c87b71686bd3955281db12583592549406cb
Author: Sachin Goel sachingoel0...@gmail.com
Date:   2015-06-05T21:54:10Z

Merge https://github.com/apache/flink into clustering_initializations

commit 29ed1d3fb31aa038d6ed1a5bf16d58f19565cdf8
Author: Sachin Goel sachingoel0...@gmail.com
Date:   2015-06-05T22:52:02Z

Removed cost parameter from Algorithm itself. Leaving it to the user for 
now. Also added support for weighted input data sets

commit 5286c3c21d5019f6ba8ab67c2074570087bc1b3a
Author: Sachin Goel sachingoel0...@gmail.com
Date:   2015-06-06T05:04:55Z

An initial draft of kmeans-par method

commit f3bfad4fc0c6576af14f1e981f8e778445856355
Author: Sachin Goel sachingoel0...@gmail.com
Date:   2015-06-08T10:36:32Z

All three initialization schemes implemented and tested

commit 8496b8fd627ade8dbe7b92949d35d3cce704f1cc
Author: Sachin Goel sachingoel0...@gmail.com
Date:   2015-06-08T10:36:58Z

Merge https://github.com/apache/flink into clustering_initializations

commit 3765a3e6a77a8bdbac21d03be1c43263925b1495
Author: Sachin Goel sachingoel0...@gmail.com
Date:   2015-06-30T08:57:41Z

Merge remote-tracking branch 'upstream/master' into 
clustering_initializations




 Add Initialization schemes for K-means clustering
 -

 Key: FLINK-2131
 URL: https://issues.apache.org/jira/browse/FLINK-2131
 Project: Flink
  Issue Type: Task
  Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel

 The Lloyd's [KMeans] algorithm takes initial centroids as its input. However, 
 in case the user doesn't provide the initial centers, they may ask for a 
 particular initialization scheme to be followed. The most commonly used are 
 these:
 1. Random initialization: Self-explanatory
 2. kmeans++ initialization: http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
 3. kmeans|| : http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf
 For very large data sets, or for large values of k, the kmeans|| method is 
 preferred as it provides the same approximation guarantees as kmeans++ and 
 requires lesser number of passes over the input data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering

2015-06-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608435#comment-14608435
 ] 

ASF GitHub Bot commented on FLINK-2131:
---

Github user sachingoel0101 closed the pull request at:

https://github.com/apache/flink/pull/757


 Add Initialization schemes for K-means clustering
 -

 Key: FLINK-2131
 URL: https://issues.apache.org/jira/browse/FLINK-2131
 Project: Flink
  Issue Type: Task
  Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel

 The Lloyd's [KMeans] algorithm takes initial centroids as its input. However, 
 in case the user doesn't provide the initial centers, they may ask for a 
 particular initialization scheme to be followed. The most commonly used are 
 these:
 1. Random initialization: Self-explanatory
 2. kmeans++ initialization: http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
 3. kmeans|| : http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf
 For very large data sets, or for large values of k, the kmeans|| method is 
 preferred as it provides the same approximation guarantees as kmeans++ and 
 requires lesser number of passes over the input data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2108) Add score function for Predictors

2015-06-30 Thread Theodore Vasiloudis (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608336#comment-14608336
 ] 

Theodore Vasiloudis commented on FLINK-2108:


The for the score function can now be considered feature complete, as presented 
in the linked PR: https://github.com/apache/flink/pull/871

 Add score function for Predictors
 -

 Key: FLINK-2108
 URL: https://issues.apache.org/jira/browse/FLINK-2108
 Project: Flink
  Issue Type: Improvement
  Components: Machine Learning Library
Reporter: Theodore Vasiloudis
Assignee: Theodore Vasiloudis
Priority: Minor
  Labels: ML

 A score function for Predictor implementations should take a DataSet[(I, O)] 
 and an (optional) scoring measure and return a score.
 The DataSet[(I, O)] would probably be the output of the predict function.
 For example in MultipleLinearRegression, we can call predict on a labeled 
 dataset, get back predictions for each item in the data, and then call score 
 with the resulting dataset as an argument and we should get back a score for 
 the prediction quality, such as the R^2 score.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2296) Checkpoint committing broken

2015-06-30 Thread Gyula Fora (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608364#comment-14608364
 ] 

Gyula Fora commented on FLINK-2296:
---

+1

 Checkpoint committing broken
 

 Key: FLINK-2296
 URL: https://issues.apache.org/jira/browse/FLINK-2296
 Project: Flink
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.10
Reporter: Robert Metzger
Assignee: Robert Metzger
Priority: Blocker

 While working on fixing the failing {{PersistentKafkaSource}} test, I 
 realized that the recent changes introduced in New operator state 
 interfaces https://github.com/apache/flink/pull/747 (sadly, there is no JIRA 
 for this huge change) introduced some changes that I was not aware of.
 * The {{CheckpointCoordinator}} is now sending the StateHandle back to the 
 TaskManager when confirming a checkpoint. For the non-FS case, this means 
 that for checkpoint committed operators, the state is send twice over the 
 wire for each checkpoint. 
 For the FS case, this means that for every checkpoint commit, the state needs 
 to be retrieved from the file system.
 Did you conduct any tests on a cluster to measure the performance impact of 
 that?
 I see three approaches for fixing the aforementioned issue:
 - keep it this way (probably poor performance)
 - always keep the state for uncommitted checkpoints in the TaskManager's 
 memory. Therefore, we need to come up with a good eviction strategy. I don't 
 know the implications for large state.
 - change the interface and do not provide the state to the user function 
 (=old behavior). This forces users to think about how they want to keep the 
 state (but it is also a bit more work for them)
 I would like to get some feedback on how to solve this issue!
 Also, I discovered the following bugs:
 * Non-source tasks didn't get {{commitCheckpoint}} calls, even though they 
 implemented the {{CheckpointCommitter}} interface. I fixed this issue in my 
 current branch.
 * The state passed to the {{commitCheckpoint}} method did not match with the 
 subtask id. So user functions were receiving states from other parallel 
 instances. This lead to faulty behavior in the KafkaSource (thats also the 
 reason why the KafkaITCase was failing more frequently ...). I fixed this 
 issue in my current branch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (FLINK-2296) Checkpoint committing broken

2015-06-30 Thread Robert Metzger (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608360#comment-14608360
 ] 

Robert Metzger edited comment on FLINK-2296 at 6/30/15 2:22 PM:


+1 for renaming the method to {{notifyCheckpointComplete}}!

+1 for removing the {{StateHandleSerializable state}} argument from the 
commitCheckpoint/notifyCheckpointComplete method.

If nobody disagrees within 24 hours, I will rename the method and remove the 
argument. (still, I'm asking for some +1 or comments on my suggestions)


was (Author: rmetzger):
+1 for renaming the method to {{notifyCheckpointComplete}}!

+1 for removing the {{StateHandleSerializable state}} argument from the 
commitCheckpoint/notifyCheckpointComplete method.

If nobody disagrees within 24 hours, I will rename the method and remove the 
argument.

 Checkpoint committing broken
 

 Key: FLINK-2296
 URL: https://issues.apache.org/jira/browse/FLINK-2296
 Project: Flink
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.10
Reporter: Robert Metzger
Assignee: Robert Metzger
Priority: Blocker

 While working on fixing the failing {{PersistentKafkaSource}} test, I 
 realized that the recent changes introduced in New operator state 
 interfaces https://github.com/apache/flink/pull/747 (sadly, there is no JIRA 
 for this huge change) introduced some changes that I was not aware of.
 * The {{CheckpointCoordinator}} is now sending the StateHandle back to the 
 TaskManager when confirming a checkpoint. For the non-FS case, this means 
 that for checkpoint committed operators, the state is send twice over the 
 wire for each checkpoint. 
 For the FS case, this means that for every checkpoint commit, the state needs 
 to be retrieved from the file system.
 Did you conduct any tests on a cluster to measure the performance impact of 
 that?
 I see three approaches for fixing the aforementioned issue:
 - keep it this way (probably poor performance)
 - always keep the state for uncommitted checkpoints in the TaskManager's 
 memory. Therefore, we need to come up with a good eviction strategy. I don't 
 know the implications for large state.
 - change the interface and do not provide the state to the user function 
 (=old behavior). This forces users to think about how they want to keep the 
 state (but it is also a bit more work for them)
 I would like to get some feedback on how to solve this issue!
 Also, I discovered the following bugs:
 * Non-source tasks didn't get {{commitCheckpoint}} calls, even though they 
 implemented the {{CheckpointCommitter}} interface. I fixed this issue in my 
 current branch.
 * The state passed to the {{commitCheckpoint}} method did not match with the 
 subtask id. So user functions were receiving states from other parallel 
 instances. This lead to faulty behavior in the KafkaSource (thats also the 
 reason why the KafkaITCase was failing more frequently ...). I fixed this 
 issue in my current branch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (FLINK-2157) Create evaluation framework for ML library

2015-06-30 Thread Theodore Vasiloudis (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Theodore Vasiloudis reassigned FLINK-2157:
--

Assignee: Theodore Vasiloudis

 Create evaluation framework for ML library
 --

 Key: FLINK-2157
 URL: https://issues.apache.org/jira/browse/FLINK-2157
 Project: Flink
  Issue Type: New Feature
  Components: Machine Learning Library
Reporter: Till Rohrmann
Assignee: Theodore Vasiloudis
  Labels: ML
 Fix For: 0.10


 Currently, FlinkML lacks means to evaluate the performance of trained models. 
 It would be great to add some {{Evaluators}} which can calculate some score 
 based on the information about true and predicted labels. This could also be 
 used for the cross validation to choose the right hyper parameters.
 Possible scores could be F score [1], zero-one-loss score, etc.
 Resources
 [1] [http://en.wikipedia.org/wiki/F1_score]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2293) Division by Zero Exception

2015-06-30 Thread Stephan Ewen (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608419#comment-14608419
 ] 

Stephan Ewen commented on FLINK-2293:
-

[~fhueske] and me have located this and figured out a fix. I'll try to push a 
fix later today.

The issue occurs only in low-memory settings. You may want to break up the job 
anyways, to reduce memory fragmentation.

 Division by Zero Exception
 --

 Key: FLINK-2293
 URL: https://issues.apache.org/jira/browse/FLINK-2293
 Project: Flink
  Issue Type: Bug
  Components: Local Runtime
Affects Versions: 0.9, 0.10
Reporter: Andra Lungu
Priority: Critical
 Fix For: 0.9.1


 I am basically running an algorithm that simulates a Gather Sum Apply 
 Iteration that performs Traingle Count (Why simulate it? Because you just 
 need a superstep - useless overhead if you use the runGatherSumApply 
 function in Graph).
 What happens, at a high level:
 1). Select neighbors with ID greater than the one corresponding to the 
 current vertex;
 2). Propagate the received values to neighbors with higher ID;
 3). compute the number of triangles by checking whether
 trgVertex.getValue().get(srcVertex.getId());
 As you can see, I *do not* perform any division at all;
 code is here: 
 https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/GSATriangleCount.java
 Now for small graphs, 50MB max, the computation finishes nicely with the 
 correct result. For a 10GB graph, however, I got this:
 java.lang.ArithmeticException: / by zero
 at 
 org.apache.flink.runtime.operators.hash.MutableHashTable.insertIntoTable(MutableHashTable.java:836)
 at 
 org.apache.flink.runtime.operators.hash.MutableHashTable.buildTableFromSpilledPartition(MutableHashTable.java:819)
 at 
 org.apache.flink.runtime.operators.hash.MutableHashTable.prepareNextPartition(MutableHashTable.java:508)
 at 
 org.apache.flink.runtime.operators.hash.MutableHashTable.nextRecord(MutableHashTable.java:544)
 at 
 org.apache.flink.runtime.operators.hash.NonReusingBuildFirstHashMatchIterator.callWithNextKey(NonReusingBuildFirstHashMatchIterator.java:104)
 at 
 org.apache.flink.runtime.operators.MatchDriver.run(MatchDriver.java:173)
 at 
 org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:496)
 at 
 org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362)
 at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
 at java.lang.Thread.run(Thread.java:722)
 see the full log here: https://gist.github.com/andralungu/984774f6348269df7951



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-2298] Allow setting a custom applicatio...

2015-06-30 Thread hsaputra
Github user hsaputra commented on a diff in the pull request:

https://github.com/apache/flink/pull/876#discussion_r33604736
  
--- Diff: 
flink-yarn/src/main/java/org/apache/flink/yarn/FlinkYarnClient.java ---
@@ -591,14 +592,17 @@ protected AbstractFlinkYarnCluster 
deployInternal(String clusterName) throws Exc
capability.setMemory(jobManagerMemoryMb);
capability.setVirtualCores(1);
 
-   if(clusterName == null) {
-   clusterName = Flink session with +taskManagerCount+ 
TaskManagers;
+   if(defaultName == null) {
+   defaultName = Flink session with +taskManagerCount+ 
TaskManagers;
}
if(detached) {
-   clusterName +=  (detached);
+   defaultName +=  (detached);
}
 
-   appContext.setApplicationName(clusterName); // application name
+   if(this.customName != null) {
--- End diff --

Should probably check for empty or blank name I suppose


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (FLINK-2299) The slot on which the task maanger was scheduled was killed

2015-06-30 Thread Stephan Ewen (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608810#comment-14608810
 ] 

Stephan Ewen commented on FLINK-2299:
-

[~andralungu] - from some LOG debugging, it seems that the following happens:

The JobManager decides that the TaskManager is unreachable at 07:46:45
At that time, the TaskManager was executing the {{GroupReduce at 
main(TriangleCount.java:51)}}. Can you check what this GroupReduce does? Does 
it create a list of elements that becomes very large? This would lead to long 
GC pauses and could cause the TaskManager to not respond to the JobManager's 
heartbeats.

 The slot on which the task maanger was scheduled was killed
 ---

 Key: FLINK-2299
 URL: https://issues.apache.org/jira/browse/FLINK-2299
 Project: Flink
  Issue Type: Bug
Affects Versions: 0.9, 0.10
Reporter: Andra Lungu
Priority: Critical
 Fix For: 0.9.1


 The following code: 
 https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/GSATriangleCount.java
 Ran on the twitter follower graph: 
 http://twitter.mpi-sws.org/data-icwsm2010.html 
 With a similar configuration to the one in FLINK-2293
 fails with the following exception:
 java.lang.Exception: The slot in which the task was executed has been 
 released. Probably loss of TaskManager 57c67d938c9144bec5ba798bb8ebe636 @ 
 wally025 - 8 slots - URL: 
 akka.tcp://flink@130.149.249.35:56135/user/taskmanager
 at 
 org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151)
 at 
 org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547)
 at 
 org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119)
 at 
 org.apache.flink.runtime.instance.Instance.markDead(Instance.java:154)
 at 
 org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:182)
 at 
 org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:421)
 at 
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
 at 
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
 at 
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
 at 
 org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:36)
 at 
 org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:29)
 at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
 at 
 org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:29)
 at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
 at 
 org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:92)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
 at 
 akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46)
 at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)
 at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)
 at akka.actor.ActorCell.invoke(ActorCell.scala:486)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
 at akka.dispatch.Mailbox.run(Mailbox.scala:221)
 at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
 at 
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 06/29/2015 10:33:46 Job execution switched to status FAILING.
 The logs are here:
 https://drive.google.com/file/d/0BwnaKJcSLc43M1BhNUt5NWdINHc/view?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-2299) The slot on which the task maanger was scheduled was killed

2015-06-30 Thread Stephan Ewen (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608732#comment-14608732
 ] 

Stephan Ewen commented on FLINK-2299:
-

Aside from the issue that you see, I think you have a problem with data skew in 
the join.

From the logs it looks like you may see heavy collisions in keys (one key on 
the build side occurring very often).

Can you look into prior ML discussions on hints how to deal with that?

 The slot on which the task maanger was scheduled was killed
 ---

 Key: FLINK-2299
 URL: https://issues.apache.org/jira/browse/FLINK-2299
 Project: Flink
  Issue Type: Bug
Affects Versions: 0.9, 0.10
Reporter: Andra Lungu
Priority: Critical
 Fix For: 0.9.1


 The following code: 
 https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/GSATriangleCount.java
 Ran on the twitter follower graph: 
 http://twitter.mpi-sws.org/data-icwsm2010.html 
 With a similar configuration to the one in FLINK-2293
 fails with the following exception:
 java.lang.Exception: The slot in which the task was executed has been 
 released. Probably loss of TaskManager 57c67d938c9144bec5ba798bb8ebe636 @ 
 wally025 - 8 slots - URL: 
 akka.tcp://flink@130.149.249.35:56135/user/taskmanager
 at 
 org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151)
 at 
 org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547)
 at 
 org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119)
 at 
 org.apache.flink.runtime.instance.Instance.markDead(Instance.java:154)
 at 
 org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:182)
 at 
 org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:421)
 at 
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
 at 
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
 at 
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
 at 
 org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:36)
 at 
 org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:29)
 at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
 at 
 org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:29)
 at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
 at 
 org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:92)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
 at 
 akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46)
 at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)
 at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)
 at akka.actor.ActorCell.invoke(ActorCell.scala:486)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
 at akka.dispatch.Mailbox.run(Mailbox.scala:221)
 at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
 at 
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 06/29/2015 10:33:46 Job execution switched to status FAILING.
 The logs are here:
 https://drive.google.com/file/d/0BwnaKJcSLc43M1BhNUt5NWdINHc/view?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] flink pull request: [FLINK-2298] Allow setting a custom applicatio...

2015-06-30 Thread hsaputra
Github user hsaputra commented on a diff in the pull request:

https://github.com/apache/flink/pull/876#discussion_r33604696
  
--- Diff: 
flink-yarn/src/main/java/org/apache/flink/yarn/FlinkYarnClient.java ---
@@ -341,7 +342,7 @@ public AbstractFlinkYarnCluster run() throws Exception {
 * This method will block until the ApplicationMaster/JobManager have 
been
 * deployed on YARN.
 */
-   protected AbstractFlinkYarnCluster deployInternal(String clusterName) 
throws Exception {
+   protected AbstractFlinkYarnCluster deployInternal(String defaultName) 
throws Exception {
--- End diff --

Since this becomes default, maybe we could just remove the input argument 
and set default internally if customName is null or empty.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


  1   2   >