[GitHub] flink pull request: [FLINK-2131]: Initialization schemes for k-mea...
Github user sachingoel0101 commented on the pull request: https://github.com/apache/flink/pull/757#issuecomment-117049680 Further, the probability distribution doesn't need to be scaled down to between [0,1]. We just take care that of while building the cumulative distribution. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering
[ https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14607912#comment-14607912 ] ASF GitHub Bot commented on FLINK-2131: --- Github user sachingoel0101 commented on the pull request: https://github.com/apache/flink/pull/757#issuecomment-117049680 Further, the probability distribution doesn't need to be scaled down to between [0,1]. We just take care that of while building the cumulative distribution. Add Initialization schemes for K-means clustering - Key: FLINK-2131 URL: https://issues.apache.org/jira/browse/FLINK-2131 Project: Flink Issue Type: Task Components: Machine Learning Library Reporter: Sachin Goel Assignee: Sachin Goel The Lloyd's [KMeans] algorithm takes initial centroids as its input. However, in case the user doesn't provide the initial centers, they may ask for a particular initialization scheme to be followed. The most commonly used are these: 1. Random initialization: Self-explanatory 2. kmeans++ initialization: http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf 3. kmeans|| : http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf For very large data sets, or for large values of k, the kmeans|| method is preferred as it provides the same approximation guarantees as kmeans++ and requires lesser number of passes over the input data. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2293) Division by Zero Exception
[ https://issues.apache.org/jira/browse/FLINK-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14607965#comment-14607965 ] Andra Lungu commented on FLINK-2293: Also reproducible with https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/NodeSplittingGSAJaccard.java; The error is identical. The input DataSet is SNAP's orkut graph: http://snap.stanford.edu/data/com-Orkut.html I am running on 30 wally nodes; well one is down so 29 (-p224). Then, the flink.conf file should be useful: https://gist.github.com/andralungu/3338bbd01ce61e0ce43d Let me know if I can give help you with anything else. Even if it's a memory issue [as is often the case] ; the error is at least misleading... Division by Zero Exception -- Key: FLINK-2293 URL: https://issues.apache.org/jira/browse/FLINK-2293 Project: Flink Issue Type: Bug Components: Local Runtime Affects Versions: 0.9, 0.10 Reporter: Andra Lungu Priority: Critical Fix For: 0.9.1 I am basically running an algorithm that simulates a Gather Sum Apply Iteration that performs Traingle Count (Why simulate it? Because you just need a superstep - useless overhead if you use the runGatherSumApply function in Graph). What happens, at a high level: 1). Select neighbors with ID greater than the one corresponding to the current vertex; 2). Propagate the received values to neighbors with higher ID; 3). compute the number of triangles by checking whether trgVertex.getValue().get(srcVertex.getId()); As you can see, I *do not* perform any division at all; code is here: https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/GSATriangleCount.java Now for small graphs, 50MB max, the computation finishes nicely with the correct result. For a 10GB graph, however, I got this: java.lang.ArithmeticException: / by zero at org.apache.flink.runtime.operators.hash.MutableHashTable.insertIntoTable(MutableHashTable.java:836) at org.apache.flink.runtime.operators.hash.MutableHashTable.buildTableFromSpilledPartition(MutableHashTable.java:819) at org.apache.flink.runtime.operators.hash.MutableHashTable.prepareNextPartition(MutableHashTable.java:508) at org.apache.flink.runtime.operators.hash.MutableHashTable.nextRecord(MutableHashTable.java:544) at org.apache.flink.runtime.operators.hash.NonReusingBuildFirstHashMatchIterator.callWithNextKey(NonReusingBuildFirstHashMatchIterator.java:104) at org.apache.flink.runtime.operators.MatchDriver.run(MatchDriver.java:173) at org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:496) at org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559) at java.lang.Thread.run(Thread.java:722) see the full log here: https://gist.github.com/andralungu/984774f6348269df7951 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2105) Implement Sort-Merge Outer Join algorithm
[ https://issues.apache.org/jira/browse/FLINK-2105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14607966#comment-14607966 ] Fabian Hueske commented on FLINK-2105: -- [~chiwanpark] is right. This issue is about the implementation of an {{OuterJoinMergeIterator}} with a very good test coverage. The integration with Drivers and Operators is part of FLINK-687. Implement Sort-Merge Outer Join algorithm - Key: FLINK-2105 URL: https://issues.apache.org/jira/browse/FLINK-2105 Project: Flink Issue Type: Sub-task Components: Local Runtime Reporter: Fabian Hueske Assignee: Ricky Pogalz Priority: Minor Fix For: pre-apache Flink does not natively support outer joins at the moment. This issue proposes to implement a sort-merge outer join algorithm that can cover left, right, and full outer joins. The implementation can be based on the regular sort-merge join iterator ({{ReusingMergeMatchIterator}} and {{NonReusingMergeMatchIterator}}, see also {{MatchDriver}} class) The Reusing and NonReusing variants differ in whether object instances are reused or new objects are created. I would start with the NonReusing variant which is safer from a user's point of view and should also be easier to implement. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (FLINK-1573) Add per-job metrics to flink.
[ https://issues.apache.org/jira/browse/FLINK-1573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maximilian Michels reassigned FLINK-1573: - Assignee: Maximilian Michels (was: Robert Metzger) Add per-job metrics to flink. - Key: FLINK-1573 URL: https://issues.apache.org/jira/browse/FLINK-1573 Project: Flink Issue Type: Sub-task Components: JobManager, TaskManager Affects Versions: 0.9 Reporter: Robert Metzger Assignee: Maximilian Michels With FLINK-1501, we have JVM specific metrics (mainly monitoring the TMs). With this task, I would like to add metrics which are job-specific. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2293) Division by Zero Exception
Andra Lungu created FLINK-2293: -- Summary: Division by Zero Exception Key: FLINK-2293 URL: https://issues.apache.org/jira/browse/FLINK-2293 Project: Flink Issue Type: Bug Components: Distributed Runtime Affects Versions: 0.10 Reporter: Andra Lungu I am basically running an algorithm that simulates a Gather Sum Apply Iteration that performs Traingle Count (Why simulate it? Because you just need a superstep - useless overhead if you use the runGatherSumApply function in Graph). What happens, at a high level: 1). Select neighbors with ID greater than the one corresponding to the current vertex; 2). Propagate the received values to neighbors with higher ID; 3). compute the number of triangles by checking whether trgVertex.getValue().get(srcVertex.getId()); As you can see, I *do not* perform any division at all; code is here: https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/GSATriangleCount.java Now for small graphs, 50MB max, the computation finishes nicely with the correct result. For a 10GB graph, however, I got this: java.lang.ArithmeticException: / by zero at org.apache.flink.runtime.operators.hash.MutableHashTable.insertIntoTable(MutableHashTable.java:836) at org.apache.flink.runtime.operators.hash.MutableHashTable.buildTableFromSpilledPartition(MutableHashTable.java:819) at org.apache.flink.runtime.operators.hash.MutableHashTable.prepareNextPartition(MutableHashTable.java:508) at org.apache.flink.runtime.operators.hash.MutableHashTable.nextRecord(MutableHashTable.java:544) at org.apache.flink.runtime.operators.hash.NonReusingBuildFirstHashMatchIterator.callWithNextKey(NonReusingBuildFirstHashMatchIterator.java:104) at org.apache.flink.runtime.operators.MatchDriver.run(MatchDriver.java:173) at org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:496) at org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559) at java.lang.Thread.run(Thread.java:722) see the full log here: https://gist.github.com/andralungu/984774f6348269df7951 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2131]: Initialization schemes for k-mea...
Github user thvasilo commented on the pull request: https://github.com/apache/flink/pull/757#issuecomment-117054526 OK, thanks for the explanation. I will look at this PR this week hopefully. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering
[ https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14607946#comment-14607946 ] ASF GitHub Bot commented on FLINK-2131: --- Github user thvasilo commented on the pull request: https://github.com/apache/flink/pull/757#issuecomment-117054526 OK, thanks for the explanation. I will look at this PR this week hopefully. Add Initialization schemes for K-means clustering - Key: FLINK-2131 URL: https://issues.apache.org/jira/browse/FLINK-2131 Project: Flink Issue Type: Task Components: Machine Learning Library Reporter: Sachin Goel Assignee: Sachin Goel The Lloyd's [KMeans] algorithm takes initial centroids as its input. However, in case the user doesn't provide the initial centers, they may ask for a particular initialization scheme to be followed. The most commonly used are these: 1. Random initialization: Self-explanatory 2. kmeans++ initialization: http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf 3. kmeans|| : http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf For very large data sets, or for large values of k, the kmeans|| method is preferred as it provides the same approximation guarantees as kmeans++ and requires lesser number of passes over the input data. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (FLINK-2293) Division by Zero Exception
[ https://issues.apache.org/jira/browse/FLINK-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fabian Hueske updated FLINK-2293: - Priority: Critical (was: Major) Division by Zero Exception -- Key: FLINK-2293 URL: https://issues.apache.org/jira/browse/FLINK-2293 Project: Flink Issue Type: Bug Components: Local Runtime Affects Versions: 0.9, 0.10 Reporter: Andra Lungu Priority: Critical Fix For: 0.9.1 I am basically running an algorithm that simulates a Gather Sum Apply Iteration that performs Traingle Count (Why simulate it? Because you just need a superstep - useless overhead if you use the runGatherSumApply function in Graph). What happens, at a high level: 1). Select neighbors with ID greater than the one corresponding to the current vertex; 2). Propagate the received values to neighbors with higher ID; 3). compute the number of triangles by checking whether trgVertex.getValue().get(srcVertex.getId()); As you can see, I *do not* perform any division at all; code is here: https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/GSATriangleCount.java Now for small graphs, 50MB max, the computation finishes nicely with the correct result. For a 10GB graph, however, I got this: java.lang.ArithmeticException: / by zero at org.apache.flink.runtime.operators.hash.MutableHashTable.insertIntoTable(MutableHashTable.java:836) at org.apache.flink.runtime.operators.hash.MutableHashTable.buildTableFromSpilledPartition(MutableHashTable.java:819) at org.apache.flink.runtime.operators.hash.MutableHashTable.prepareNextPartition(MutableHashTable.java:508) at org.apache.flink.runtime.operators.hash.MutableHashTable.nextRecord(MutableHashTable.java:544) at org.apache.flink.runtime.operators.hash.NonReusingBuildFirstHashMatchIterator.callWithNextKey(NonReusingBuildFirstHashMatchIterator.java:104) at org.apache.flink.runtime.operators.MatchDriver.run(MatchDriver.java:173) at org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:496) at org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559) at java.lang.Thread.run(Thread.java:722) see the full log here: https://gist.github.com/andralungu/984774f6348269df7951 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2292) Report accumulators periodically while job is running
Maximilian Michels created FLINK-2292: - Summary: Report accumulators periodically while job is running Key: FLINK-2292 URL: https://issues.apache.org/jira/browse/FLINK-2292 Project: Flink Issue Type: Sub-task Components: JobManager, TaskManager Reporter: Maximilian Michels Assignee: Maximilian Michels Fix For: 0.10 Accumulators should be sent periodically, as part of the heartbeat that sends metrics. This allows them to be updated in real time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (FLINK-2293) Division by Zero Exception
[ https://issues.apache.org/jira/browse/FLINK-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fabian Hueske updated FLINK-2293: - Component/s: (was: Distributed Runtime) Local Runtime Division by Zero Exception -- Key: FLINK-2293 URL: https://issues.apache.org/jira/browse/FLINK-2293 Project: Flink Issue Type: Bug Components: Local Runtime Affects Versions: 0.9, 0.10 Reporter: Andra Lungu Priority: Critical Fix For: 0.9.1 I am basically running an algorithm that simulates a Gather Sum Apply Iteration that performs Traingle Count (Why simulate it? Because you just need a superstep - useless overhead if you use the runGatherSumApply function in Graph). What happens, at a high level: 1). Select neighbors with ID greater than the one corresponding to the current vertex; 2). Propagate the received values to neighbors with higher ID; 3). compute the number of triangles by checking whether trgVertex.getValue().get(srcVertex.getId()); As you can see, I *do not* perform any division at all; code is here: https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/GSATriangleCount.java Now for small graphs, 50MB max, the computation finishes nicely with the correct result. For a 10GB graph, however, I got this: java.lang.ArithmeticException: / by zero at org.apache.flink.runtime.operators.hash.MutableHashTable.insertIntoTable(MutableHashTable.java:836) at org.apache.flink.runtime.operators.hash.MutableHashTable.buildTableFromSpilledPartition(MutableHashTable.java:819) at org.apache.flink.runtime.operators.hash.MutableHashTable.prepareNextPartition(MutableHashTable.java:508) at org.apache.flink.runtime.operators.hash.MutableHashTable.nextRecord(MutableHashTable.java:544) at org.apache.flink.runtime.operators.hash.NonReusingBuildFirstHashMatchIterator.callWithNextKey(NonReusingBuildFirstHashMatchIterator.java:104) at org.apache.flink.runtime.operators.MatchDriver.run(MatchDriver.java:173) at org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:496) at org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559) at java.lang.Thread.run(Thread.java:722) see the full log here: https://gist.github.com/andralungu/984774f6348269df7951 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (FLINK-2293) Division by Zero Exception
[ https://issues.apache.org/jira/browse/FLINK-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fabian Hueske updated FLINK-2293: - Affects Version/s: 0.9 Division by Zero Exception -- Key: FLINK-2293 URL: https://issues.apache.org/jira/browse/FLINK-2293 Project: Flink Issue Type: Bug Components: Local Runtime Affects Versions: 0.9, 0.10 Reporter: Andra Lungu Fix For: 0.9.1 I am basically running an algorithm that simulates a Gather Sum Apply Iteration that performs Traingle Count (Why simulate it? Because you just need a superstep - useless overhead if you use the runGatherSumApply function in Graph). What happens, at a high level: 1). Select neighbors with ID greater than the one corresponding to the current vertex; 2). Propagate the received values to neighbors with higher ID; 3). compute the number of triangles by checking whether trgVertex.getValue().get(srcVertex.getId()); As you can see, I *do not* perform any division at all; code is here: https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/GSATriangleCount.java Now for small graphs, 50MB max, the computation finishes nicely with the correct result. For a 10GB graph, however, I got this: java.lang.ArithmeticException: / by zero at org.apache.flink.runtime.operators.hash.MutableHashTable.insertIntoTable(MutableHashTable.java:836) at org.apache.flink.runtime.operators.hash.MutableHashTable.buildTableFromSpilledPartition(MutableHashTable.java:819) at org.apache.flink.runtime.operators.hash.MutableHashTable.prepareNextPartition(MutableHashTable.java:508) at org.apache.flink.runtime.operators.hash.MutableHashTable.nextRecord(MutableHashTable.java:544) at org.apache.flink.runtime.operators.hash.NonReusingBuildFirstHashMatchIterator.callWithNextKey(NonReusingBuildFirstHashMatchIterator.java:104) at org.apache.flink.runtime.operators.MatchDriver.run(MatchDriver.java:173) at org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:496) at org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559) at java.lang.Thread.run(Thread.java:722) see the full log here: https://gist.github.com/andralungu/984774f6348269df7951 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (FLINK-2293) Division by Zero Exception
[ https://issues.apache.org/jira/browse/FLINK-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fabian Hueske updated FLINK-2293: - Fix Version/s: 0.9.1 Division by Zero Exception -- Key: FLINK-2293 URL: https://issues.apache.org/jira/browse/FLINK-2293 Project: Flink Issue Type: Bug Components: Distributed Runtime Affects Versions: 0.10 Reporter: Andra Lungu Fix For: 0.9.1 I am basically running an algorithm that simulates a Gather Sum Apply Iteration that performs Traingle Count (Why simulate it? Because you just need a superstep - useless overhead if you use the runGatherSumApply function in Graph). What happens, at a high level: 1). Select neighbors with ID greater than the one corresponding to the current vertex; 2). Propagate the received values to neighbors with higher ID; 3). compute the number of triangles by checking whether trgVertex.getValue().get(srcVertex.getId()); As you can see, I *do not* perform any division at all; code is here: https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/GSATriangleCount.java Now for small graphs, 50MB max, the computation finishes nicely with the correct result. For a 10GB graph, however, I got this: java.lang.ArithmeticException: / by zero at org.apache.flink.runtime.operators.hash.MutableHashTable.insertIntoTable(MutableHashTable.java:836) at org.apache.flink.runtime.operators.hash.MutableHashTable.buildTableFromSpilledPartition(MutableHashTable.java:819) at org.apache.flink.runtime.operators.hash.MutableHashTable.prepareNextPartition(MutableHashTable.java:508) at org.apache.flink.runtime.operators.hash.MutableHashTable.nextRecord(MutableHashTable.java:544) at org.apache.flink.runtime.operators.hash.NonReusingBuildFirstHashMatchIterator.callWithNextKey(NonReusingBuildFirstHashMatchIterator.java:104) at org.apache.flink.runtime.operators.MatchDriver.run(MatchDriver.java:173) at org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:496) at org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559) at java.lang.Thread.run(Thread.java:722) see the full log here: https://gist.github.com/andralungu/984774f6348269df7951 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [Flink-1999] basic TfidfTransformer
Github user rbraeunlich commented on a diff in the pull request: https://github.com/apache/flink/pull/730#discussion_r33551797 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/feature/TfIdfTransformer.scala --- @@ -0,0 +1,128 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * License); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.ml.feature + +import org.apache.flink.api.scala.{DataSet, _} +import org.apache.flink.ml.common.{Parameter, ParameterMap, Transformer} +import org.apache.flink.ml.math.SparseVector + +import scala.collection.mutable.LinkedHashSet +import scala.math.log; + +/** + * This transformer calcuates the term-frequency times inverse document frequency for the give DataSet of documents. + * The DataSet will be treated as the corpus of documents. The single words will be filtered against the regex: + * code + * (?u)\b\w\w+\b + * /code + * p + * The TF is the frequence of a word inside one document + * p + * The IDF for a word is calculated: log(total number of documents / documents that contain the word) + 1 + * p + * This transformer returns a SparseVector where the index is the hash of the word and the value the tf-idf. + * @author Ronny Bräunlich + * @author Vassil Dimov + * @author Filip Perisic + */ +class TfIdfTransformer extends Transformer[(Int, Seq[String]), (Int, SparseVector)] { + + override def transform(input: DataSet[(Int /* docId */ , Seq[String] /*The document */ )], transformParameters: ParameterMap): DataSet[(Int, SparseVector)] = { --- End diff -- We had some discussion about the type in [FLINK-1999](https://issues.apache.org/jira/browse/FLINK-1999) and since there were no objections we sticked to Seq[String]. With Seq[String] the caller has to decide how to split his text. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2131]: Initialization schemes for k-mea...
Github user thvasilo commented on the pull request: https://github.com/apache/flink/pull/757#issuecomment-117042891 Hello Sachin, could you explain what the discrete sampler does? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering
[ https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14607916#comment-14607916 ] ASF GitHub Bot commented on FLINK-2131: --- Github user sachingoel0101 commented on the pull request: https://github.com/apache/flink/pull/757#issuecomment-117051023 Sorry about the formatting though. I'll fix it. I haven't worked on this in a while. I'll incorporate your suggestions from the previous PR. Add Initialization schemes for K-means clustering - Key: FLINK-2131 URL: https://issues.apache.org/jira/browse/FLINK-2131 Project: Flink Issue Type: Task Components: Machine Learning Library Reporter: Sachin Goel Assignee: Sachin Goel The Lloyd's [KMeans] algorithm takes initial centroids as its input. However, in case the user doesn't provide the initial centers, they may ask for a particular initialization scheme to be followed. The most commonly used are these: 1. Random initialization: Self-explanatory 2. kmeans++ initialization: http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf 3. kmeans|| : http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf For very large data sets, or for large values of k, the kmeans|| method is preferred as it provides the same approximation guarantees as kmeans++ and requires lesser number of passes over the input data. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-1731] [ml] Implementation of Feature K-...
Github user thvasilo commented on the pull request: https://github.com/apache/flink/pull/700#issuecomment-117053448 Hello @peedeeX21 . The API does not deal with distributed models at the moment. In the K-means case having the model distributed is overkill, as it is highly unlikely that you will have 1000 centroids, making the model tiny, and distributing it actually creates unnecessary overhead. We can keep the current implementation, but in the future we should really test against a non distributed model, which can be broadcast in a DataSet[Seq[LabeledVector]] and compare performance. Also, could you add an evaluate operation (EvaluateDataSetOperation) for Kmeans (and corresponding test)? It would be parametrized as EvaluateDataSetOperation[Kmeans, Vector, Double] --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library
[ https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14607941#comment-14607941 ] ASF GitHub Bot commented on FLINK-1731: --- Github user thvasilo commented on the pull request: https://github.com/apache/flink/pull/700#issuecomment-117053448 Hello @peedeeX21 . The API does not deal with distributed models at the moment. In the K-means case having the model distributed is overkill, as it is highly unlikely that you will have 1000 centroids, making the model tiny, and distributing it actually creates unnecessary overhead. We can keep the current implementation, but in the future we should really test against a non distributed model, which can be broadcast in a DataSet[Seq[LabeledVector]] and compare performance. Also, could you add an evaluate operation (EvaluateDataSetOperation) for Kmeans (and corresponding test)? It would be parametrized as EvaluateDataSetOperation[Kmeans, Vector, Double] Add kMeans clustering algorithm to machine learning library --- Key: FLINK-1731 URL: https://issues.apache.org/jira/browse/FLINK-1731 Project: Flink Issue Type: New Feature Components: Machine Learning Library Reporter: Till Rohrmann Assignee: Peter Schrott Labels: ML The Flink repository already contains a kMeans implementation but it is not yet ported to the machine learning library. I assume that only the used data types have to be adapted and then it can be more or less directly moved to flink-ml. The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better implementation because the improve the initial seeding phase to achieve near optimal clustering. It might be worthwhile to implement kMeans||. Resources: [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2293) Division by Zero Exception
[ https://issues.apache.org/jira/browse/FLINK-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14607950#comment-14607950 ] Fabian Hueske commented on FLINK-2293: -- Is it possible to share the 10GB input data (download link)? Can you also share a few details about your execution setup (local machine, cluster, #machines, #slots, amount of memory, ...) that can help to reproduce the problem? Thanks! Division by Zero Exception -- Key: FLINK-2293 URL: https://issues.apache.org/jira/browse/FLINK-2293 Project: Flink Issue Type: Bug Components: Local Runtime Affects Versions: 0.9, 0.10 Reporter: Andra Lungu Priority: Critical Fix For: 0.9.1 I am basically running an algorithm that simulates a Gather Sum Apply Iteration that performs Traingle Count (Why simulate it? Because you just need a superstep - useless overhead if you use the runGatherSumApply function in Graph). What happens, at a high level: 1). Select neighbors with ID greater than the one corresponding to the current vertex; 2). Propagate the received values to neighbors with higher ID; 3). compute the number of triangles by checking whether trgVertex.getValue().get(srcVertex.getId()); As you can see, I *do not* perform any division at all; code is here: https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/GSATriangleCount.java Now for small graphs, 50MB max, the computation finishes nicely with the correct result. For a 10GB graph, however, I got this: java.lang.ArithmeticException: / by zero at org.apache.flink.runtime.operators.hash.MutableHashTable.insertIntoTable(MutableHashTable.java:836) at org.apache.flink.runtime.operators.hash.MutableHashTable.buildTableFromSpilledPartition(MutableHashTable.java:819) at org.apache.flink.runtime.operators.hash.MutableHashTable.prepareNextPartition(MutableHashTable.java:508) at org.apache.flink.runtime.operators.hash.MutableHashTable.nextRecord(MutableHashTable.java:544) at org.apache.flink.runtime.operators.hash.NonReusingBuildFirstHashMatchIterator.callWithNextKey(NonReusingBuildFirstHashMatchIterator.java:104) at org.apache.flink.runtime.operators.MatchDriver.run(MatchDriver.java:173) at org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:496) at org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559) at java.lang.Thread.run(Thread.java:722) see the full log here: https://gist.github.com/andralungu/984774f6348269df7951 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library
[ https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14607961#comment-14607961 ] ASF GitHub Bot commented on FLINK-1745: --- Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/696#issuecomment-117059705 Hi, I updated this PR. I reimplemented kNN Join with `zipWithIndex` and fitted to changed pipeline architecture. Add exact k-nearest-neighbours algorithm to machine learning library Key: FLINK-1745 URL: https://issues.apache.org/jira/browse/FLINK-1745 Project: Flink Issue Type: New Feature Components: Machine Learning Library Reporter: Till Rohrmann Assignee: Till Rohrmann Labels: ML, Starter Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial it is still used as a mean to classify data and to do regression. This issue focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as proposed in [2]. Could be a starter task. Resources: [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm] [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering
[ https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14607895#comment-14607895 ] ASF GitHub Bot commented on FLINK-2131: --- Github user thvasilo commented on the pull request: https://github.com/apache/flink/pull/757#issuecomment-117042891 Hello Sachin, could you explain what the discrete sampler does? Add Initialization schemes for K-means clustering - Key: FLINK-2131 URL: https://issues.apache.org/jira/browse/FLINK-2131 Project: Flink Issue Type: Task Components: Machine Learning Library Reporter: Sachin Goel Assignee: Sachin Goel The Lloyd's [KMeans] algorithm takes initial centroids as its input. However, in case the user doesn't provide the initial centers, they may ask for a particular initialization scheme to be followed. The most commonly used are these: 1. Random initialization: Self-explanatory 2. kmeans++ initialization: http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf 3. kmeans|| : http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf For very large data sets, or for large values of k, the kmeans|| method is preferred as it provides the same approximation guarantees as kmeans++ and requires lesser number of passes over the input data. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering
[ https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14607907#comment-14607907 ] ASF GitHub Bot commented on FLINK-2131: --- Github user sachingoel0101 commented on the pull request: https://github.com/apache/flink/pull/757#issuecomment-117047575 Hi @thvasilo, thanks for taking the time to go through it. Consider for example a probability distribution P(X_0) = 0.2, P(X_1) = 0.3, P(X_2) = 0.5 To sample an element out of X_0, X_1 and X_2, we can generate a random number but we need to map intervals of real numbers to the values X_0, X_1 and X_2. This is what the discreteSampler does. It forms a cumulative distribution as [0.2, 0.5, 1.0] and then, if the generated random no is in [0, 0.2), we pick X_0, and so on. Add Initialization schemes for K-means clustering - Key: FLINK-2131 URL: https://issues.apache.org/jira/browse/FLINK-2131 Project: Flink Issue Type: Task Components: Machine Learning Library Reporter: Sachin Goel Assignee: Sachin Goel The Lloyd's [KMeans] algorithm takes initial centroids as its input. However, in case the user doesn't provide the initial centers, they may ask for a particular initialization scheme to be followed. The most commonly used are these: 1. Random initialization: Self-explanatory 2. kmeans++ initialization: http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf 3. kmeans|| : http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf For very large data sets, or for large values of k, the kmeans|| method is preferred as it provides the same approximation guarantees as kmeans++ and requires lesser number of passes over the input data. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2131]: Initialization schemes for k-mea...
Github user sachingoel0101 commented on the pull request: https://github.com/apache/flink/pull/757#issuecomment-117047575 Hi @thvasilo, thanks for taking the time to go through it. Consider for example a probability distribution P(X_0) = 0.2, P(X_1) = 0.3, P(X_2) = 0.5 To sample an element out of X_0, X_1 and X_2, we can generate a random number but we need to map intervals of real numbers to the values X_0, X_1 and X_2. This is what the discreteSampler does. It forms a cumulative distribution as [0.2, 0.5, 1.0] and then, if the generated random no is in [0, 0.2), we pick X_0, and so on. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2131]: Initialization schemes for k-mea...
Github user sachingoel0101 commented on the pull request: https://github.com/apache/flink/pull/757#issuecomment-117051023 Sorry about the formatting though. I'll fix it. I haven't worked on this in a while. I'll incorporate your suggestions from the previous PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Updated] (FLINK-766) Web interface progress monitoring for DataSources and DataSinks with splits
[ https://issues.apache.org/jira/browse/FLINK-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maximilian Michels updated FLINK-766: - Issue Type: Sub-task (was: Improvement) Parent: FLINK-456 Web interface progress monitoring for DataSources and DataSinks with splits --- Key: FLINK-766 URL: https://issues.apache.org/jira/browse/FLINK-766 Project: Flink Issue Type: Sub-task Reporter: GitHub Import Priority: Minor Labels: github-import Fix For: pre-apache The progress monitoring for DataSources and DataSinks can be improved by including the number of processed vs total splits into the progress. Imported from GitHub Url: https://github.com/stratosphere/stratosphere/issues/766 Created by: [rmetzger|https://github.com/rmetzger] Labels: enhancement, gui, simple-issue, Milestone: Release 0.6 (unplanned) Created at: Wed May 07 12:05:54 CEST 2014 State: open -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering
[ https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14607954#comment-14607954 ] ASF GitHub Bot commented on FLINK-2131: --- Github user sachingoel0101 commented on the pull request: https://github.com/apache/flink/pull/757#issuecomment-117055846 Okay. I'll update it today itself with a few trivial fixes. Add Initialization schemes for K-means clustering - Key: FLINK-2131 URL: https://issues.apache.org/jira/browse/FLINK-2131 Project: Flink Issue Type: Task Components: Machine Learning Library Reporter: Sachin Goel Assignee: Sachin Goel The Lloyd's [KMeans] algorithm takes initial centroids as its input. However, in case the user doesn't provide the initial centers, they may ask for a particular initialization scheme to be followed. The most commonly used are these: 1. Random initialization: Self-explanatory 2. kmeans++ initialization: http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf 3. kmeans|| : http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf For very large data sets, or for large values of k, the kmeans|| method is preferred as it provides the same approximation guarantees as kmeans++ and requires lesser number of passes over the input data. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (FLINK-2286) Window ParallelMerge sometimes swallows elements of the last window
[ https://issues.apache.org/jira/browse/FLINK-2286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maximilian Michels updated FLINK-2286: -- Fix Version/s: 0.9.1 Window ParallelMerge sometimes swallows elements of the last window --- Key: FLINK-2286 URL: https://issues.apache.org/jira/browse/FLINK-2286 Project: Flink Issue Type: Bug Components: Streaming Affects Versions: 0.9, 0.10 Reporter: Márton Balassi Assignee: Márton Balassi Fix For: 0.9.1 Last windows in the stream that do not have parts at all the parallel operator instances get swallowed by the ParallelMerge. To resolve this ParallelMerge should be an operator instead of a function, so the close method can access the collector and emit these. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2131]: Initialization schemes for k-mea...
Github user sachingoel0101 commented on the pull request: https://github.com/apache/flink/pull/757#issuecomment-117055846 Okay. I'll update it today itself with a few trivial fixes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Updated] (FLINK-2285) Active policy emits elements of the last window twice
[ https://issues.apache.org/jira/browse/FLINK-2285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maximilian Michels updated FLINK-2285: -- Fix Version/s: 0.9.1 Active policy emits elements of the last window twice - Key: FLINK-2285 URL: https://issues.apache.org/jira/browse/FLINK-2285 Project: Flink Issue Type: Bug Components: Streaming Affects Versions: 0.9, 0.10 Reporter: Márton Balassi Assignee: Márton Balassi Fix For: 0.9.1 The root cause is some duplicate code between the close methods of the GroupedActiveDiscretizer and the GroupedStreamDiscretizer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-1745] [ml] [WIP] Add exact k-nearest-ne...
Github user chiwanpark commented on the pull request: https://github.com/apache/flink/pull/696#issuecomment-117059705 Hi, I updated this PR. I reimplemented kNN Join with `zipWithIndex` and fitted to changed pipeline architecture. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-1731] [ml] Implementation of Feature K-...
Github user sachingoel0101 commented on the pull request: https://github.com/apache/flink/pull/700#issuecomment-117060737 Hi. IMO, the purpose of learning is to develop a model which compactly represents the data somehow. Thus, having a distributed model doesn't make sense. Besides, the user might just want to take the model and use it somewhere else in which case it makes sense to have it available not-as-distributed, but just as a java slash scala object which user can easily operate on. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library
[ https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14607968#comment-14607968 ] ASF GitHub Bot commented on FLINK-1731: --- Github user sachingoel0101 commented on the pull request: https://github.com/apache/flink/pull/700#issuecomment-117060737 Hi. IMO, the purpose of learning is to develop a model which compactly represents the data somehow. Thus, having a distributed model doesn't make sense. Besides, the user might just want to take the model and use it somewhere else in which case it makes sense to have it available not-as-distributed, but just as a java slash scala object which user can easily operate on. Add kMeans clustering algorithm to machine learning library --- Key: FLINK-1731 URL: https://issues.apache.org/jira/browse/FLINK-1731 Project: Flink Issue Type: New Feature Components: Machine Learning Library Reporter: Till Rohrmann Assignee: Peter Schrott Labels: ML The Flink repository already contains a kMeans implementation but it is not yet ported to the machine learning library. I assume that only the used data types have to be adapted and then it can be more or less directly moved to flink-ml. The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better implementation because the improve the initial seeding phase to achieve near optimal clustering. It might be worthwhile to implement kMeans||. Resources: [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2141) Allow GSA's Gather to perform this operation in more than one direction
[ https://issues.apache.org/jira/browse/FLINK-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609398#comment-14609398 ] ASF GitHub Bot commented on FLINK-2141: --- GitHub user shghatge opened a pull request: https://github.com/apache/flink/pull/877 [FLINK-2141] Allow GSA's Gather to perform this operation in more than one direction Added the setDirection() and getDirection() methods to GSAConfiguration.java Added functionality to gather values from chosen neighbors instead of only OUT neighbors Added a simple example which generates the list of vertices to whom their exists a path from the vertex as value of the vertex Added util classes for the example and also added a test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/shghatge/flink vertex Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/877.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #877 commit d88329024394cef17a3454d20e6905f7d7363004 Author: Shivani shgha...@gmail.com Date: 2015-07-01T01:03:58Z [FLINK-2141][gelly] Allow GSA's Gather to perform this operation in more than one direction commit b2a29d6763820a2be6b1a71f0224fda83861c2c4 Author: Shivani shgha...@gmail.com Date: 2015-07-01T01:10:41Z [FLINK-2141][gelly] Allow GSA's Gather to perform this operation in more than one direction Allow GSA's Gather to perform this operation in more than one direction --- Key: FLINK-2141 URL: https://issues.apache.org/jira/browse/FLINK-2141 Project: Flink Issue Type: New Feature Components: Gelly Affects Versions: 0.9 Reporter: Andra Lungu Assignee: Shivani Ghatge For the time being, a vertex only gathers information from its in-edges. Similarly to the vertex-centric approach, we would like to allow users to gather data from out and all edges as well. This property should be set using a setDirection() method. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2141] Allow GSA's Gather to perform thi...
GitHub user shghatge opened a pull request: https://github.com/apache/flink/pull/877 [FLINK-2141] Allow GSA's Gather to perform this operation in more than one direction Added the setDirection() and getDirection() methods to GSAConfiguration.java Added functionality to gather values from chosen neighbors instead of only OUT neighbors Added a simple example which generates the list of vertices to whom their exists a path from the vertex as value of the vertex Added util classes for the example and also added a test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/shghatge/flink vertex Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/877.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #877 commit d88329024394cef17a3454d20e6905f7d7363004 Author: Shivani shgha...@gmail.com Date: 2015-07-01T01:03:58Z [FLINK-2141][gelly] Allow GSA's Gather to perform this operation in more than one direction commit b2a29d6763820a2be6b1a71f0224fda83861c2c4 Author: Shivani shgha...@gmail.com Date: 2015-07-01T01:10:41Z [FLINK-2141][gelly] Allow GSA's Gather to perform this operation in more than one direction --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Updated] (FLINK-2302) Allow multiple instances to run on single host
[ https://issues.apache.org/jira/browse/FLINK-2302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ufuk Celebi updated FLINK-2302: --- Description: The scripts currently can only run a single component like TaskManager per host. The main reason is that there is a single PID file per running process (e.g. TaskManager). If this file exists, no other process of the respective type can be started. Instead of having a single file per process with a single PID, we can append multiple PIDs to one file, e.g. a {{tm.pid}} can look like: {code} pid tm1 pid tm2 {code} Stopping a TM then removes a single PID in -FIFO fashion (add pids at tail, remove pids from head)- in LIFO fashion (add/remove pids at/from tail). LIFO is important to ensure that the log files can be automatically enumerated by number of pids per host w/o overwriting existing log files. was: The scripts currently can only run a single component like TaskManager per host. The main reason is that there is a single PID file per running process (e.g. TaskManager). If this file exists, no other process of the respective type can be started. Instead of having a single file per process with a single PID, we can append multiple PIDs to one file, e.g. a {{tm.pid}} can look like: {code} pid tm1 pid tm2 {code} Stopping a TM then removes a single PID in FIFO fashion (add pids at tail, remove pids from head). Allow multiple instances to run on single host -- Key: FLINK-2302 URL: https://issues.apache.org/jira/browse/FLINK-2302 Project: Flink Issue Type: Improvement Components: Start-Stop Scripts Affects Versions: 0.10 Reporter: Ufuk Celebi Fix For: 0.10 The scripts currently can only run a single component like TaskManager per host. The main reason is that there is a single PID file per running process (e.g. TaskManager). If this file exists, no other process of the respective type can be started. Instead of having a single file per process with a single PID, we can append multiple PIDs to one file, e.g. a {{tm.pid}} can look like: {code} pid tm1 pid tm2 {code} Stopping a TM then removes a single PID in -FIFO fashion (add pids at tail, remove pids from head)- in LIFO fashion (add/remove pids at/from tail). LIFO is important to ensure that the log files can be automatically enumerated by number of pids per host w/o overwriting existing log files. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2138] Added custom partitioning to Data...
Github user gaborhermann commented on the pull request: https://github.com/apache/flink/pull/872#issuecomment-117141916 Okay then. These are the effects of changing I did not know about. Let's stick to (2) and later on, we might reconsider this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Comment Edited] (FLINK-2295) TwoInput Task do not react to/forward checkpoint barriers
[ https://issues.apache.org/jira/browse/FLINK-2295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608171#comment-14608171 ] Gyula Fora edited comment on FLINK-2295 at 6/30/15 11:55 AM: - I see that the CoReader set in the TwoInput task does not listen to the checkpoint events. This should be a trivial fix. was (Author: gyfora): I see that the CoReader set in the TwoInput tas does not listen to the checkpoint events. This should be a trivial fix. TwoInput Task do not react to/forward checkpoint barriers - Key: FLINK-2295 URL: https://issues.apache.org/jira/browse/FLINK-2295 Project: Flink Issue Type: Bug Components: Streaming Affects Versions: 0.9, 0.10, 0.9.1 Reporter: Aljoscha Krettek Assignee: Aljoscha Krettek Priority: Blocker The event listener for the checkpoint barriers was never enabled for TwoInput tasks. I have a fix for it and also tests that verify that it actually works. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: New operator state interfaces
Github user rmetzger commented on the pull request: https://github.com/apache/flink/pull/747#issuecomment-117161545 Please create JIRAs for changes in the future. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2294) Keyed State does not work with DOP=1
[ https://issues.apache.org/jira/browse/FLINK-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608236#comment-14608236 ] Robert Metzger commented on FLINK-2294: --- It would have been nice if you waited until you reached agreement with Aljoscha on the implementation, instead of just committing it to master, without a pull request. From my understanding, I agree with Guyla, that it is easier to set the nextInput outside the operator, instead of inside all operators. Keyed State does not work with DOP=1 Key: FLINK-2294 URL: https://issues.apache.org/jira/browse/FLINK-2294 Project: Flink Issue Type: Bug Components: Streaming Affects Versions: 0.10 Reporter: Aljoscha Krettek Assignee: Gyula Fora Priority: Blocker When changing the DOP from 3 to 1 in StatefulOperatorTest.apiTest() the test fails. The reason seems to be that the element is not properly set when chaining is happening. Also, requiring this: {code} headContext.setNextInput(nextRecord); streamOperator.processElement(nextRecord); {code} to be called seems rather fragile. Why not set the element in {{processElement()}}. This would also make for cleaner encapsulation, since now all outside code must assume that operators have a {{StreamingRuntimeContext}} on which they set the next element. The state/keyed state machinery seems dangerously undertested. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2294) Keyed State does not work with DOP=1
[ https://issues.apache.org/jira/browse/FLINK-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608160#comment-14608160 ] Gyula Fora commented on FLINK-2294: --- The issue was caused by having two output collectors in the output handles for copy/non-copy. And I forgot to set it in the second one. Code duplication = evil Im pushing a fix soon. Keyed State does not work with DOP=1 Key: FLINK-2294 URL: https://issues.apache.org/jira/browse/FLINK-2294 Project: Flink Issue Type: Bug Components: Streaming Affects Versions: 0.10 Reporter: Aljoscha Krettek Assignee: Gyula Fora Priority: Blocker When changing the DOP from 3 to 1 in StatefulOperatorTest.apiTest() the test fails. The reason seems to be that the element is not properly set when chaining is happening. Also, requiring this: {code} headContext.setNextInput(nextRecord); streamOperator.processElement(nextRecord); {code} to be called seems rather fragile. Why not set the element in {{processElement()}}. This would also make for cleaner encapsulation, since now all outside code must assume that operators have a {{StreamingRuntimeContext}} on which they set the next element. The state/keyed state machinery seems dangerously undertested. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2138) PartitionCustom for streaming
[ https://issues.apache.org/jira/browse/FLINK-2138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608159#comment-14608159 ] ASF GitHub Bot commented on FLINK-2138: --- Github user gaborhermann commented on the pull request: https://github.com/apache/flink/pull/872#issuecomment-117141916 Okay then. These are the effects of changing I did not know about. Let's stick to (2) and later on, we might reconsider this. PartitionCustom for streaming - Key: FLINK-2138 URL: https://issues.apache.org/jira/browse/FLINK-2138 Project: Flink Issue Type: New Feature Components: Streaming Affects Versions: 0.9 Reporter: Márton Balassi Assignee: Gábor Hermann Priority: Minor The batch API has support for custom partitioning, this should be added for streaming with a similar signature. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2294) Keyed State does not work with DOP=1
[ https://issues.apache.org/jira/browse/FLINK-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608178#comment-14608178 ] Aljoscha Krettek commented on FLINK-2294: - {{processElement}} gets the element. That is the entry point into the operator. Why not set it inside the operator where the element is received? Keyed State does not work with DOP=1 Key: FLINK-2294 URL: https://issues.apache.org/jira/browse/FLINK-2294 Project: Flink Issue Type: Bug Components: Streaming Affects Versions: 0.10 Reporter: Aljoscha Krettek Assignee: Gyula Fora Priority: Blocker When changing the DOP from 3 to 1 in StatefulOperatorTest.apiTest() the test fails. The reason seems to be that the element is not properly set when chaining is happening. Also, requiring this: {code} headContext.setNextInput(nextRecord); streamOperator.processElement(nextRecord); {code} to be called seems rather fragile. Why not set the element in {{processElement()}}. This would also make for cleaner encapsulation, since now all outside code must assume that operators have a {{StreamingRuntimeContext}} on which they set the next element. The state/keyed state machinery seems dangerously undertested. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2294) Keyed State does not work with DOP=1
[ https://issues.apache.org/jira/browse/FLINK-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608238#comment-14608238 ] Gyula Fora commented on FLINK-2294: --- This was a trivial fix for the current implementation thats why I pushed it. Changing the implementation would mean that we need to add additional tests for the operators, which we can do afterwards if we decide that way. Keyed State does not work with DOP=1 Key: FLINK-2294 URL: https://issues.apache.org/jira/browse/FLINK-2294 Project: Flink Issue Type: Bug Components: Streaming Affects Versions: 0.10 Reporter: Aljoscha Krettek Assignee: Gyula Fora Priority: Blocker When changing the DOP from 3 to 1 in StatefulOperatorTest.apiTest() the test fails. The reason seems to be that the element is not properly set when chaining is happening. Also, requiring this: {code} headContext.setNextInput(nextRecord); streamOperator.processElement(nextRecord); {code} to be called seems rather fragile. Why not set the element in {{processElement()}}. This would also make for cleaner encapsulation, since now all outside code must assume that operators have a {{StreamingRuntimeContext}} on which they set the next element. The state/keyed state machinery seems dangerously undertested. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2295) TwoInput Task do not react to/forward checkpoint barriers
[ https://issues.apache.org/jira/browse/FLINK-2295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608171#comment-14608171 ] Gyula Fora commented on FLINK-2295: --- I see that the CoReader set in the TwoInput tas does not listen to the checkpoint events. This should be a trivial fix. TwoInput Task do not react to/forward checkpoint barriers - Key: FLINK-2295 URL: https://issues.apache.org/jira/browse/FLINK-2295 Project: Flink Issue Type: Bug Components: Streaming Affects Versions: 0.9, 0.10, 0.9.1 Reporter: Aljoscha Krettek Assignee: Aljoscha Krettek Priority: Blocker The event listener for the checkpoint barriers was never enabled for TwoInput tasks. I have a fix for it and also tests that verify that it actually works. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2295) TwoInput Task do not react to/forward checkpoint barriers
[ https://issues.apache.org/jira/browse/FLINK-2295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608183#comment-14608183 ] Aljoscha Krettek commented on FLINK-2295: - That's why I said I have the fix, yes. :D The difficult bit are the tests that verify that this actually works and is not broken by some future changes. TwoInput Task do not react to/forward checkpoint barriers - Key: FLINK-2295 URL: https://issues.apache.org/jira/browse/FLINK-2295 Project: Flink Issue Type: Bug Components: Streaming Affects Versions: 0.9, 0.10, 0.9.1 Reporter: Aljoscha Krettek Assignee: Aljoscha Krettek Priority: Blocker The event listener for the checkpoint barriers was never enabled for TwoInput tasks. I have a fix for it and also tests that verify that it actually works. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2294) Keyed State does not work with DOP=1
[ https://issues.apache.org/jira/browse/FLINK-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608195#comment-14608195 ] Gyula Fora commented on FLINK-2294: --- Every stream operator has a context and since the call to processElement happens outside of the operator implementation (inside the streamtask or collector) we can set it there. (thats 2 places) Setting it inside the processElement means that we need to set it in each operator implementation (Map, Filter etc) and the user needs to be aware of setting it in custom operator implementations (infinite places). So I will probably stick with the current solution. Keyed State does not work with DOP=1 Key: FLINK-2294 URL: https://issues.apache.org/jira/browse/FLINK-2294 Project: Flink Issue Type: Bug Components: Streaming Affects Versions: 0.10 Reporter: Aljoscha Krettek Assignee: Gyula Fora Priority: Blocker When changing the DOP from 3 to 1 in StatefulOperatorTest.apiTest() the test fails. The reason seems to be that the element is not properly set when chaining is happening. Also, requiring this: {code} headContext.setNextInput(nextRecord); streamOperator.processElement(nextRecord); {code} to be called seems rather fragile. Why not set the element in {{processElement()}}. This would also make for cleaner encapsulation, since now all outside code must assume that operators have a {{StreamingRuntimeContext}} on which they set the next element. The state/keyed state machinery seems dangerously undertested. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (FLINK-2294) Keyed State does not work with DOP=1
[ https://issues.apache.org/jira/browse/FLINK-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gyula Fora closed FLINK-2294. - Resolution: Fixed https://github.com/apache/flink/commit/fef9f115838b3ba3d3769f8669ee251c2cd403c6 Keyed State does not work with DOP=1 Key: FLINK-2294 URL: https://issues.apache.org/jira/browse/FLINK-2294 Project: Flink Issue Type: Bug Components: Streaming Affects Versions: 0.10 Reporter: Aljoscha Krettek Assignee: Gyula Fora Priority: Blocker When changing the DOP from 3 to 1 in StatefulOperatorTest.apiTest() the test fails. The reason seems to be that the element is not properly set when chaining is happening. Also, requiring this: {code} headContext.setNextInput(nextRecord); streamOperator.processElement(nextRecord); {code} to be called seems rather fragile. Why not set the element in {{processElement()}}. This would also make for cleaner encapsulation, since now all outside code must assume that operators have a {{StreamingRuntimeContext}} on which they set the next element. The state/keyed state machinery seems dangerously undertested. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (FLINK-2290) CoRecordReader Does Not Read Events From Both Inputs When No Elements Arrive
[ https://issues.apache.org/jira/browse/FLINK-2290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aljoscha Krettek updated FLINK-2290: Priority: Blocker (was: Major) CoRecordReader Does Not Read Events From Both Inputs When No Elements Arrive Key: FLINK-2290 URL: https://issues.apache.org/jira/browse/FLINK-2290 Project: Flink Issue Type: Bug Components: Streaming Reporter: Aljoscha Krettek Assignee: Aljoscha Krettek Priority: Blocker When no elements arrive the reader will always try to read from the same input index. This means that it will only process elements form this input. This could be problematic with Watermarks/Checkpoint barriers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2138] Added custom partitioning to Data...
Github user StephanEwen commented on the pull request: https://github.com/apache/flink/pull/872#issuecomment-117117294 How about we leave the batch API as it is for now and address that as a separate issue? There are quite some subtleties in how the optimizer assesses equality of partitioning (based on partitioners) that would have to be changed (and should retain backwards compatibility). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2138) PartitionCustom for streaming
[ https://issues.apache.org/jira/browse/FLINK-2138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608088#comment-14608088 ] ASF GitHub Bot commented on FLINK-2138: --- Github user StephanEwen commented on the pull request: https://github.com/apache/flink/pull/872#issuecomment-117117294 How about we leave the batch API as it is for now and address that as a separate issue? There are quite some subtleties in how the optimizer assesses equality of partitioning (based on partitioners) that would have to be changed (and should retain backwards compatibility). PartitionCustom for streaming - Key: FLINK-2138 URL: https://issues.apache.org/jira/browse/FLINK-2138 Project: Flink Issue Type: New Feature Components: Streaming Affects Versions: 0.9 Reporter: Márton Balassi Assignee: Gábor Hermann Priority: Minor The batch API has support for custom partitioning, this should be added for streaming with a similar signature. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2294) Keyed State does not work with DOP=1
Aljoscha Krettek created FLINK-2294: --- Summary: Keyed State does not work with DOP=1 Key: FLINK-2294 URL: https://issues.apache.org/jira/browse/FLINK-2294 Project: Flink Issue Type: Bug Components: Streaming Affects Versions: 0.10 Reporter: Aljoscha Krettek Priority: Blocker When changing the DOP from 3 to 1 in StatefulOperatorTest.apiTest() the test fails. The reason seems to be that the element is not properly set when chaining is happening. Also, requiring this: {code} headContext.setNextInput(nextRecord); streamOperator.processElement(nextRecord); {code} to be called seems rather fragile. Why not set the element in {{processElement()}}. This would also make for cleaner encapsulation, since now all outside code must assume that operators have a {{StreamingRuntimeContext}} on which they set the next element. The state/keyed state machinery seems dangerously undertested. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2294) Keyed State does not work with DOP=1
[ https://issues.apache.org/jira/browse/FLINK-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608100#comment-14608100 ] Gyula Fora commented on FLINK-2294: --- I don't understand what you mean by setting it in the processElement(). That method is a part of an interface that the user implements. Keyed State does not work with DOP=1 Key: FLINK-2294 URL: https://issues.apache.org/jira/browse/FLINK-2294 Project: Flink Issue Type: Bug Components: Streaming Affects Versions: 0.10 Reporter: Aljoscha Krettek Priority: Blocker When changing the DOP from 3 to 1 in StatefulOperatorTest.apiTest() the test fails. The reason seems to be that the element is not properly set when chaining is happening. Also, requiring this: {code} headContext.setNextInput(nextRecord); streamOperator.processElement(nextRecord); {code} to be called seems rather fragile. Why not set the element in {{processElement()}}. This would also make for cleaner encapsulation, since now all outside code must assume that operators have a {{StreamingRuntimeContext}} on which they set the next element. The state/keyed state machinery seems dangerously undertested. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2297) Add threshold setting for SVM binary predictions
[ https://issues.apache.org/jira/browse/FLINK-2297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608261#comment-14608261 ] ASF GitHub Bot commented on FLINK-2297: --- Github user thvasilo commented on a diff in the pull request: https://github.com/apache/flink/pull/874#discussion_r33570883 --- Diff: flink-staging/flink-ml/src/test/scala/org/apache/flink/ml/classification/SVMITSuite.scala --- @@ -69,19 +70,38 @@ class SVMITSuite extends FlatSpec with Matchers with FlinkTestBase { svm.fit(trainingDS) -val threshold = 0.0 - -val predictionPairs = svm.evaluate(test).map { - truthPrediction = -val truth = truthPrediction._1 -val prediction = truthPrediction._2 -val thresholdedPrediction = if (prediction threshold) 1.0 else -1.0 -(truth, thresholdedPrediction) -} +val predictionPairs = svm.evaluate(test) val absoluteErrorSum = predictionPairs.collect().map{ case (truth, prediction) = Math.abs(truth - prediction)}.sum absoluteErrorSum should be 15.0 } + + it should be possible to get the raw decision function values in { +val env = ExecutionEnvironment.getExecutionEnvironment + +val svm = SVM(). + setBlocks(env.getParallelism). + setIterations(100). + setLocalIterations(100). + setRegularization(0.002). + setStepsize(0.1). + setSeed(0). + clearThreshold() + +val trainingDS = env.fromCollection(Classification.trainingData) + +val test = trainingDS.map(x = x.vector) + +svm.fit(trainingDS) + +val predictions: DataSet[(FlinkVector, Double)] = svm.predict(test) + +val preds = predictions.map(vectorLabel = vectorLabel._2).collect() + +preds.max should be 1.0 --- End diff -- Better way to check raw values vs. 1/-1 Add threshold setting for SVM binary predictions Key: FLINK-2297 URL: https://issues.apache.org/jira/browse/FLINK-2297 Project: Flink Issue Type: Improvement Components: Machine Learning Library Reporter: Theodore Vasiloudis Priority: Minor Labels: ML Fix For: 0.10 Currently SVM outputs the raw decision function values when using the predict function. We should have instead the ability to set a threshold above which examples are labeled as positive (1.0) and below negative (-1.0). Then the prediction function can be directly used for evaluation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2297] [ml] Add threshold setting for SV...
Github user thvasilo commented on a diff in the pull request: https://github.com/apache/flink/pull/874#discussion_r33570883 --- Diff: flink-staging/flink-ml/src/test/scala/org/apache/flink/ml/classification/SVMITSuite.scala --- @@ -69,19 +70,38 @@ class SVMITSuite extends FlatSpec with Matchers with FlinkTestBase { svm.fit(trainingDS) -val threshold = 0.0 - -val predictionPairs = svm.evaluate(test).map { - truthPrediction = -val truth = truthPrediction._1 -val prediction = truthPrediction._2 -val thresholdedPrediction = if (prediction threshold) 1.0 else -1.0 -(truth, thresholdedPrediction) -} +val predictionPairs = svm.evaluate(test) val absoluteErrorSum = predictionPairs.collect().map{ case (truth, prediction) = Math.abs(truth - prediction)}.sum absoluteErrorSum should be 15.0 } + + it should be possible to get the raw decision function values in { +val env = ExecutionEnvironment.getExecutionEnvironment + +val svm = SVM(). + setBlocks(env.getParallelism). + setIterations(100). + setLocalIterations(100). + setRegularization(0.002). + setStepsize(0.1). + setSeed(0). + clearThreshold() + +val trainingDS = env.fromCollection(Classification.trainingData) + +val test = trainingDS.map(x = x.vector) + +svm.fit(trainingDS) + +val predictions: DataSet[(FlinkVector, Double)] = svm.predict(test) + +val preds = predictions.map(vectorLabel = vectorLabel._2).collect() + +preds.max should be 1.0 --- End diff -- Better way to check raw values vs. 1/-1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Assigned] (FLINK-2296) Checkpoint committing broken
[ https://issues.apache.org/jira/browse/FLINK-2296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Metzger reassigned FLINK-2296: - Assignee: Robert Metzger Checkpoint committing broken Key: FLINK-2296 URL: https://issues.apache.org/jira/browse/FLINK-2296 Project: Flink Issue Type: Bug Components: Streaming Affects Versions: 0.10 Reporter: Robert Metzger Assignee: Robert Metzger Priority: Blocker While working on fixing the failing {{PersistentKafkaSource}} test, I realized that the recent changes introduced in New operator state interfaces https://github.com/apache/flink/pull/747 (sadly, there is no JIRA for this huge change) introduced some changes that I was not aware of. * The {{CheckpointCooridnator}} is now sending the StateHandle back to the TaskManager when confirming a checkpoint. For the non-FS case, this means that for checkpoint committed operators, the state is send twice over the wire for each checkpoint. For the FS case, this means that for every checkpoint commit, the state needs to be retrieved from the file system. Did you conduct any tests on a cluster to measure the performance impact of that? I see three approaches for fixing the aforementioned issue: - keep it this way (probably poor performance) - always keep the state for uncommitted checkpoints in the TaskManager's memory. Therefore, we need to come up with a good eviction strategy. I don't know the implications for large state. - change the interface and do not provide the state to the user function (=old behavior). This forces users to think about how they want to keep the state (but it is also a bit more work for them) I would like to get some feedback on how to solve this issue! Also, I discovered the following bugs: * Non-source tasks didn't get {{commitCheckpoint}} calls, even though they implemented the {{CheckpointCommitter}} interface. I fixed this issue in my current branch. * The state passed to the {{commitCheckpoint}} method did not match with the subtask id. So user functions were receiving states from other parallel instances. This lead to faulty behavior in the KafkaSource (thats also the reason why the KafkaITCase was failing more frequently ...). I fixed this issue in my current branch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2297) Add threshold setting for SVM binary predictions
Theodore Vasiloudis created FLINK-2297: -- Summary: Add threshold setting for SVM binary predictions Key: FLINK-2297 URL: https://issues.apache.org/jira/browse/FLINK-2297 Project: Flink Issue Type: Improvement Components: Machine Learning Library Reporter: Theodore Vasiloudis Priority: Minor Fix For: 0.10 Currently SVM outputs the raw decision function values when using the predict function. We should have instead the ability to set a threshold above which examples are labeled as positive (1.0) and below negative (-1.0). Then the prediction function can be directly used for evaluation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2297] [ml] Add threshold setting for SV...
GitHub user thvasilo opened a pull request: https://github.com/apache/flink/pull/874 [FLINK-2297] [ml] Add threshold setting for SVM binary predictions Currently SVM outputs the raw decision function values when using the predict function. We should have instead the ability to set a threshold above which examples are labeled as positive (1.0) and below negative (-1.0). Then the prediction function can be directly used for evaluation. This provides that functionality, as well as the ability to provide the raw decision function values. You can merge this pull request into a Git repository by running: $ git pull https://github.com/thvasilo/flink svm-threshold Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/874.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #874 commit 279f12a6031f9d7b73e6b69303ae44627df4c401 Author: Theodore Vasiloudis t...@sics.se Date: 2015-06-30T12:49:03Z Added parameters argument to PredictOperation predict function. commit 1cea7a9a71ba04b6a9433789c0787112c7a24084 Author: Theodore Vasiloudis t...@sics.se Date: 2015-06-30T12:59:57Z Made the type parameter for Parameter to be covariant. commit 00cf902d1481d698648abe6f037583d8d543fa53 Author: Theodore Vasiloudis t...@sics.se Date: 2015-06-30T13:00:49Z Added Threshold option for SVM, to determine which predictions are positive/negative. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2297) Add threshold setting for SVM binary predictions
[ https://issues.apache.org/jira/browse/FLINK-2297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608253#comment-14608253 ] ASF GitHub Bot commented on FLINK-2297: --- GitHub user thvasilo opened a pull request: https://github.com/apache/flink/pull/874 [FLINK-2297] [ml] Add threshold setting for SVM binary predictions Currently SVM outputs the raw decision function values when using the predict function. We should have instead the ability to set a threshold above which examples are labeled as positive (1.0) and below negative (-1.0). Then the prediction function can be directly used for evaluation. This provides that functionality, as well as the ability to provide the raw decision function values. You can merge this pull request into a Git repository by running: $ git pull https://github.com/thvasilo/flink svm-threshold Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/874.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #874 commit 279f12a6031f9d7b73e6b69303ae44627df4c401 Author: Theodore Vasiloudis t...@sics.se Date: 2015-06-30T12:49:03Z Added parameters argument to PredictOperation predict function. commit 1cea7a9a71ba04b6a9433789c0787112c7a24084 Author: Theodore Vasiloudis t...@sics.se Date: 2015-06-30T12:59:57Z Made the type parameter for Parameter to be covariant. commit 00cf902d1481d698648abe6f037583d8d543fa53 Author: Theodore Vasiloudis t...@sics.se Date: 2015-06-30T13:00:49Z Added Threshold option for SVM, to determine which predictions are positive/negative. Add threshold setting for SVM binary predictions Key: FLINK-2297 URL: https://issues.apache.org/jira/browse/FLINK-2297 Project: Flink Issue Type: Improvement Components: Machine Learning Library Reporter: Theodore Vasiloudis Priority: Minor Labels: ML Fix For: 0.10 Currently SVM outputs the raw decision function values when using the predict function. We should have instead the ability to set a threshold above which examples are labeled as positive (1.0) and below negative (-1.0). Then the prediction function can be directly used for evaluation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2296) Checkpoint committing broken
[ https://issues.apache.org/jira/browse/FLINK-2296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608255#comment-14608255 ] Gyula Fora commented on FLINK-2296: --- The reason we made these changes for the Checkpoint committing are the following: Storing uncommitted changes with ids in the operator like before makes no sense at all. If the committing is actually important this will not work because if the operator fails before it receives the commit the state to be committed is lost. State committing usually involves committing some actual state, which you either store or provide a way to access it. Since storing locally doesnt work for the specific reasons I mentioned before, we need to send a handle to it. Also if we abstract the actual checkpointing away like we do in the OperatorState implementations the user is not aware when the checkpointing happens, so he would need to mess around with the checkpointing logic as well. Caching local states in the task managers could be an option to increase performance I agree. Checkpoint committing broken Key: FLINK-2296 URL: https://issues.apache.org/jira/browse/FLINK-2296 Project: Flink Issue Type: Bug Components: Streaming Affects Versions: 0.10 Reporter: Robert Metzger Assignee: Robert Metzger Priority: Blocker While working on fixing the failing {{PersistentKafkaSource}} test, I realized that the recent changes introduced in New operator state interfaces https://github.com/apache/flink/pull/747 (sadly, there is no JIRA for this huge change) introduced some changes that I was not aware of. * The {{CheckpointCooridnator}} is now sending the StateHandle back to the TaskManager when confirming a checkpoint. For the non-FS case, this means that for checkpoint committed operators, the state is send twice over the wire for each checkpoint. For the FS case, this means that for every checkpoint commit, the state needs to be retrieved from the file system. Did you conduct any tests on a cluster to measure the performance impact of that? I see three approaches for fixing the aforementioned issue: - keep it this way (probably poor performance) - always keep the state for uncommitted checkpoints in the TaskManager's memory. Therefore, we need to come up with a good eviction strategy. I don't know the implications for large state. - change the interface and do not provide the state to the user function (=old behavior). This forces users to think about how they want to keep the state (but it is also a bit more work for them) I would like to get some feedback on how to solve this issue! Also, I discovered the following bugs: * Non-source tasks didn't get {{commitCheckpoint}} calls, even though they implemented the {{CheckpointCommitter}} interface. I fixed this issue in my current branch. * The state passed to the {{commitCheckpoint}} method did not match with the subtask id. So user functions were receiving states from other parallel instances. This lead to faulty behavior in the KafkaSource (thats also the reason why the KafkaITCase was failing more frequently ...). I fixed this issue in my current branch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (FLINK-2296) Checkpoint committing broken
[ https://issues.apache.org/jira/browse/FLINK-2296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Metzger updated FLINK-2296: -- Description: While working on fixing the failing {{PersistentKafkaSource}} test, I realized that the recent changes introduced in New operator state interfaces https://github.com/apache/flink/pull/747 (sadly, there is no JIRA for this huge change) introduced some changes that I was not aware of. * The {{CheckpointCoordinator}} is now sending the StateHandle back to the TaskManager when confirming a checkpoint. For the non-FS case, this means that for checkpoint committed operators, the state is send twice over the wire for each checkpoint. For the FS case, this means that for every checkpoint commit, the state needs to be retrieved from the file system. Did you conduct any tests on a cluster to measure the performance impact of that? I see three approaches for fixing the aforementioned issue: - keep it this way (probably poor performance) - always keep the state for uncommitted checkpoints in the TaskManager's memory. Therefore, we need to come up with a good eviction strategy. I don't know the implications for large state. - change the interface and do not provide the state to the user function (=old behavior). This forces users to think about how they want to keep the state (but it is also a bit more work for them) I would like to get some feedback on how to solve this issue! Also, I discovered the following bugs: * Non-source tasks didn't get {{commitCheckpoint}} calls, even though they implemented the {{CheckpointCommitter}} interface. I fixed this issue in my current branch. * The state passed to the {{commitCheckpoint}} method did not match with the subtask id. So user functions were receiving states from other parallel instances. This lead to faulty behavior in the KafkaSource (thats also the reason why the KafkaITCase was failing more frequently ...). I fixed this issue in my current branch. was: While working on fixing the failing {{PersistentKafkaSource}} test, I realized that the recent changes introduced in New operator state interfaces https://github.com/apache/flink/pull/747 (sadly, there is no JIRA for this huge change) introduced some changes that I was not aware of. * The {{CheckpointCooridnator}} is now sending the StateHandle back to the TaskManager when confirming a checkpoint. For the non-FS case, this means that for checkpoint committed operators, the state is send twice over the wire for each checkpoint. For the FS case, this means that for every checkpoint commit, the state needs to be retrieved from the file system. Did you conduct any tests on a cluster to measure the performance impact of that? I see three approaches for fixing the aforementioned issue: - keep it this way (probably poor performance) - always keep the state for uncommitted checkpoints in the TaskManager's memory. Therefore, we need to come up with a good eviction strategy. I don't know the implications for large state. - change the interface and do not provide the state to the user function (=old behavior). This forces users to think about how they want to keep the state (but it is also a bit more work for them) I would like to get some feedback on how to solve this issue! Also, I discovered the following bugs: * Non-source tasks didn't get {{commitCheckpoint}} calls, even though they implemented the {{CheckpointCommitter}} interface. I fixed this issue in my current branch. * The state passed to the {{commitCheckpoint}} method did not match with the subtask id. So user functions were receiving states from other parallel instances. This lead to faulty behavior in the KafkaSource (thats also the reason why the KafkaITCase was failing more frequently ...). I fixed this issue in my current branch. Checkpoint committing broken Key: FLINK-2296 URL: https://issues.apache.org/jira/browse/FLINK-2296 Project: Flink Issue Type: Bug Components: Streaming Affects Versions: 0.10 Reporter: Robert Metzger Assignee: Robert Metzger Priority: Blocker While working on fixing the failing {{PersistentKafkaSource}} test, I realized that the recent changes introduced in New operator state interfaces https://github.com/apache/flink/pull/747 (sadly, there is no JIRA for this huge change) introduced some changes that I was not aware of. * The {{CheckpointCoordinator}} is now sending the StateHandle back to the TaskManager when confirming a checkpoint. For the non-FS case, this means that for checkpoint committed operators, the state is send twice over the wire for each checkpoint. For the FS case, this means that for every checkpoint commit, the state needs to be retrieved from the file system. Did you conduct any tests on a
[GitHub] flink pull request: [FLINK-2297] [ml] Add threshold setting for SV...
Github user thvasilo commented on a diff in the pull request: https://github.com/apache/flink/pull/874#discussion_r33570815 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/classification/SVM.scala --- @@ -242,8 +275,21 @@ object SVM{ } } - override def predict(value: T, model: DenseVector): Double = { -value.asBreeze dot model.asBreeze + override def predict(value: T, model: DenseVector, predictParameters: ParameterMap): +Double = { +val thresholdOption = predictParameters.get(Threshold) + +val rawValue = value.asBreeze dot model.asBreeze +// If the Threshold option has been reset, we will get back a Some(None) thresholdOption +// causing the exception when we try to get the value. In that case we just return the +// raw value +try { --- End diff -- This part is a bit hacky and will have a performance impact. However it allows to do what we want without needing to change the way ParameterMap works, and with the default behaviour being that the thresholdOption is set, the exception should be triggered most of the time. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2297) Add threshold setting for SVM binary predictions
[ https://issues.apache.org/jira/browse/FLINK-2297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608259#comment-14608259 ] ASF GitHub Bot commented on FLINK-2297: --- Github user thvasilo commented on a diff in the pull request: https://github.com/apache/flink/pull/874#discussion_r33570815 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/classification/SVM.scala --- @@ -242,8 +275,21 @@ object SVM{ } } - override def predict(value: T, model: DenseVector): Double = { -value.asBreeze dot model.asBreeze + override def predict(value: T, model: DenseVector, predictParameters: ParameterMap): +Double = { +val thresholdOption = predictParameters.get(Threshold) + +val rawValue = value.asBreeze dot model.asBreeze +// If the Threshold option has been reset, we will get back a Some(None) thresholdOption +// causing the exception when we try to get the value. In that case we just return the +// raw value +try { --- End diff -- This part is a bit hacky and will have a performance impact. However it allows to do what we want without needing to change the way ParameterMap works, and with the default behaviour being that the thresholdOption is set, the exception should be triggered most of the time. Add threshold setting for SVM binary predictions Key: FLINK-2297 URL: https://issues.apache.org/jira/browse/FLINK-2297 Project: Flink Issue Type: Improvement Components: Machine Learning Library Reporter: Theodore Vasiloudis Priority: Minor Labels: ML Fix For: 0.10 Currently SVM outputs the raw decision function values when using the predict function. We should have instead the ability to set a threshold above which examples are labeled as positive (1.0) and below negative (-1.0). Then the prediction function can be directly used for evaluation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2296) Checkpoint committing broken
[ https://issues.apache.org/jira/browse/FLINK-2296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608264#comment-14608264 ] Stephan Ewen commented on FLINK-2296: - I think there is a misunderstanding on what the commit messages currently are. They are currently a best effort notification of completed checkpoints. There are no guarantees in the delivery in the presence of any failures after a completed checkpoint. Also, and form of serious distributed commit activity would need a 2-phase or consensus protocol, so these messages are not really suitable for that. Committing does not require the state in many cases. For many interactions with the outside world, the actual checkpointing would handle the state, and the post commit notification would require mainly a something like a transaction ID. We may want to rename them from commit to notifyCheckpointComplete, though. I think we have a classical case here of inappropriate communication and description. Both documentation on what the model of the commit messages is, as well as and clear description of the changes performed to the checkpointing mechanism. Checkpoint committing broken Key: FLINK-2296 URL: https://issues.apache.org/jira/browse/FLINK-2296 Project: Flink Issue Type: Bug Components: Streaming Affects Versions: 0.10 Reporter: Robert Metzger Assignee: Robert Metzger Priority: Blocker While working on fixing the failing {{PersistentKafkaSource}} test, I realized that the recent changes introduced in New operator state interfaces https://github.com/apache/flink/pull/747 (sadly, there is no JIRA for this huge change) introduced some changes that I was not aware of. * The {{CheckpointCoordinator}} is now sending the StateHandle back to the TaskManager when confirming a checkpoint. For the non-FS case, this means that for checkpoint committed operators, the state is send twice over the wire for each checkpoint. For the FS case, this means that for every checkpoint commit, the state needs to be retrieved from the file system. Did you conduct any tests on a cluster to measure the performance impact of that? I see three approaches for fixing the aforementioned issue: - keep it this way (probably poor performance) - always keep the state for uncommitted checkpoints in the TaskManager's memory. Therefore, we need to come up with a good eviction strategy. I don't know the implications for large state. - change the interface and do not provide the state to the user function (=old behavior). This forces users to think about how they want to keep the state (but it is also a bit more work for them) I would like to get some feedback on how to solve this issue! Also, I discovered the following bugs: * Non-source tasks didn't get {{commitCheckpoint}} calls, even though they implemented the {{CheckpointCommitter}} interface. I fixed this issue in my current branch. * The state passed to the {{commitCheckpoint}} method did not match with the subtask id. So user functions were receiving states from other parallel instances. This lead to faulty behavior in the KafkaSource (thats also the reason why the KafkaITCase was failing more frequently ...). I fixed this issue in my current branch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2157] [ml] [WIP] Create evaluation fram...
Github user thvasilo commented on the pull request: https://github.com/apache/flink/pull/871#issuecomment-117185135 Cherry picked the changes to Predictor and SVM from #874 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2157) Create evaluation framework for ML library
[ https://issues.apache.org/jira/browse/FLINK-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608270#comment-14608270 ] ASF GitHub Bot commented on FLINK-2157: --- Github user thvasilo commented on the pull request: https://github.com/apache/flink/pull/871#issuecomment-117185135 Cherry picked the changes to Predictor and SVM from #874 Create evaluation framework for ML library -- Key: FLINK-2157 URL: https://issues.apache.org/jira/browse/FLINK-2157 Project: Flink Issue Type: New Feature Components: Machine Learning Library Reporter: Till Rohrmann Labels: ML Fix For: 0.10 Currently, FlinkML lacks means to evaluate the performance of trained models. It would be great to add some {{Evaluators}} which can calculate some score based on the information about true and predicted labels. This could also be used for the cross validation to choose the right hyper parameters. Possible scores could be F score [1], zero-one-loss score, etc. Resources [1] [http://en.wikipedia.org/wiki/F1_score] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2298) Allow setting custom YARN application names through the CLI
Robert Metzger created FLINK-2298: - Summary: Allow setting custom YARN application names through the CLI Key: FLINK-2298 URL: https://issues.apache.org/jira/browse/FLINK-2298 Project: Flink Issue Type: Bug Components: YARN Client Affects Versions: 0.10 Reporter: Robert Metzger Assignee: Robert Metzger A Flink user asked for adding an option to the YARN CLI frontend to provide a custom application name when starting Flink on YARN. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2299) The slot on which the task maanger was scheduled was killed
Andra Lungu created FLINK-2299: -- Summary: The slot on which the task maanger was scheduled was killed Key: FLINK-2299 URL: https://issues.apache.org/jira/browse/FLINK-2299 Project: Flink Issue Type: Bug Affects Versions: 0.9, 0.10 Reporter: Andra Lungu Priority: Critical Fix For: 0.9.1 The following code: https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/GSATriangleCount.java Ran on the twitter follower graph: http://twitter.mpi-sws.org/data-icwsm2010.html With a similar configuration to the one in FLINK-2293 fails with the following exception: java.lang.Exception: The slot in which the task was executed has been released. Probably loss of TaskManager 57c67d938c9144bec5ba798bb8ebe636 @ wally025 - 8 slots - URL: akka.tcp://flink@130.149.249.35:56135/user/taskmanager at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151) at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547) at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119) at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:154) at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:182) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:421) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:36) at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:29) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) at org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:29) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:92) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46) at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369) at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501) at akka.actor.ActorCell.invoke(ActorCell.scala:486) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) at akka.dispatch.Mailbox.run(Mailbox.scala:221) at akka.dispatch.Mailbox.exec(Mailbox.scala:231) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 06/29/2015 10:33:46 Job execution switched to status FAILING. The logs are here: https://drive.google.com/file/d/0BwnaKJcSLc43M1BhNUt5NWdINHc/view?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2299) The slot on which the task maanger was scheduled was killed
[ https://issues.apache.org/jira/browse/FLINK-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608296#comment-14608296 ] Robert Metzger commented on FLINK-2299: --- Looks like wally025 broke up with the JobManager ;) {code} 11:08:48,028 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@130.149.249.11:6123] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 11:08:55,417 WARN Remoting - Tried to associate with unreachable remote address [akka.tcp://flink@130.149.249.11:6123]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: /130.149.249.11:6123 11:08:55,422 INFO org.apache.flink.runtime.taskmanager.TaskManager - Disconnecting from JobManager: JobManager is no longer reachable 11:08:55,509 INFO org.apache.flink.runtime.taskmanager.TaskManager - Disassociating from JobManager {code} Were there any network issues? The slot on which the task maanger was scheduled was killed --- Key: FLINK-2299 URL: https://issues.apache.org/jira/browse/FLINK-2299 Project: Flink Issue Type: Bug Affects Versions: 0.9, 0.10 Reporter: Andra Lungu Priority: Critical Fix For: 0.9.1 The following code: https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/GSATriangleCount.java Ran on the twitter follower graph: http://twitter.mpi-sws.org/data-icwsm2010.html With a similar configuration to the one in FLINK-2293 fails with the following exception: java.lang.Exception: The slot in which the task was executed has been released. Probably loss of TaskManager 57c67d938c9144bec5ba798bb8ebe636 @ wally025 - 8 slots - URL: akka.tcp://flink@130.149.249.35:56135/user/taskmanager at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151) at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547) at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119) at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:154) at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:182) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:421) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:36) at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:29) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) at org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:29) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:92) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46) at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369) at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501) at akka.actor.ActorCell.invoke(ActorCell.scala:486) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) at akka.dispatch.Mailbox.run(Mailbox.scala:221) at akka.dispatch.Mailbox.exec(Mailbox.scala:231) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 06/29/2015 10:33:46 Job execution switched to status FAILING. The logs are here: https://drive.google.com/file/d/0BwnaKJcSLc43M1BhNUt5NWdINHc/view?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-1731] [ml] Implementation of Feature K-...
Github user thvasilo commented on the pull request: https://github.com/apache/flink/pull/700#issuecomment-117194517 What we would like to see actually is this PR and #757 to be merged into one, so that we can review them as a whole. @sachingoel0101 do you think you will be able to do that? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library
[ https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608302#comment-14608302 ] ASF GitHub Bot commented on FLINK-1731: --- Github user thvasilo commented on the pull request: https://github.com/apache/flink/pull/700#issuecomment-117194517 What we would like to see actually is this PR and #757 to be merged into one, so that we can review them as a whole. @sachingoel0101 do you think you will be able to do that? Add kMeans clustering algorithm to machine learning library --- Key: FLINK-1731 URL: https://issues.apache.org/jira/browse/FLINK-1731 Project: Flink Issue Type: New Feature Components: Machine Learning Library Reporter: Till Rohrmann Assignee: Peter Schrott Labels: ML The Flink repository already contains a kMeans implementation but it is not yet ported to the machine learning library. I assume that only the used data types have to be adapted and then it can be more or less directly moved to flink-ml. The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better implementation because the improve the initial seeding phase to achieve near optimal clustering. It might be worthwhile to implement kMeans||. Resources: [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2299) The slot on which the task maanger was scheduled was killed
[ https://issues.apache.org/jira/browse/FLINK-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608304#comment-14608304 ] Andra Lungu commented on FLINK-2299: No, there was no problem with the network. I believe the issue is more complex than that, unfortunately :(. And it is reproducible; the network cannot crash that often... The slot on which the task maanger was scheduled was killed --- Key: FLINK-2299 URL: https://issues.apache.org/jira/browse/FLINK-2299 Project: Flink Issue Type: Bug Affects Versions: 0.9, 0.10 Reporter: Andra Lungu Priority: Critical Fix For: 0.9.1 The following code: https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/GSATriangleCount.java Ran on the twitter follower graph: http://twitter.mpi-sws.org/data-icwsm2010.html With a similar configuration to the one in FLINK-2293 fails with the following exception: java.lang.Exception: The slot in which the task was executed has been released. Probably loss of TaskManager 57c67d938c9144bec5ba798bb8ebe636 @ wally025 - 8 slots - URL: akka.tcp://flink@130.149.249.35:56135/user/taskmanager at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151) at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547) at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119) at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:154) at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:182) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:421) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:36) at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:29) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) at org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:29) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:92) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46) at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369) at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501) at akka.actor.ActorCell.invoke(ActorCell.scala:486) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) at akka.dispatch.Mailbox.run(Mailbox.scala:221) at akka.dispatch.Mailbox.exec(Mailbox.scala:231) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 06/29/2015 10:33:46 Job execution switched to status FAILING. The logs are here: https://drive.google.com/file/d/0BwnaKJcSLc43M1BhNUt5NWdINHc/view?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1745) Add exact k-nearest-neighbours algorithm to machine learning library
[ https://issues.apache.org/jira/browse/FLINK-1745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608308#comment-14608308 ] ASF GitHub Bot commented on FLINK-1745: --- Github user thvasilo commented on the pull request: https://github.com/apache/flink/pull/696#issuecomment-117195363 Thanks @chiwanpark . I don't think I'll have the time to review this week, I'll set a reminder for Monday. Add exact k-nearest-neighbours algorithm to machine learning library Key: FLINK-1745 URL: https://issues.apache.org/jira/browse/FLINK-1745 Project: Flink Issue Type: New Feature Components: Machine Learning Library Reporter: Till Rohrmann Assignee: Till Rohrmann Labels: ML, Starter Even though the k-nearest-neighbours (kNN) [1,2] algorithm is quite trivial it is still used as a mean to classify data and to do regression. This issue focuses on the implementation of an exact kNN (H-BNLJ, H-BRJ) algorithm as proposed in [2]. Could be a starter task. Resources: [1] [http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm] [2] [https://www.cs.utah.edu/~lifeifei/papers/mrknnj.pdf] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (FLINK-2299) The slot on which the task maanger was scheduled was killed
[ https://issues.apache.org/jira/browse/FLINK-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608304#comment-14608304 ] Andra Lungu edited comment on FLINK-2299 at 6/30/15 2:00 PM: - No, there was no problem with the network. I believe the issue is more complex than that, unfortunately :(. And it is reproducible; the network cannot crash that often... wally025 died (the problem is on the TM side IMO) and then I guess it broke the JM... And others have experienced a similar issue(on different clusters) recently. was (Author: andralungu): No, there was no problem with the network. I believe the issue is more complex than that, unfortunately :(. And it is reproducible; the network cannot crash that often... The slot on which the task maanger was scheduled was killed --- Key: FLINK-2299 URL: https://issues.apache.org/jira/browse/FLINK-2299 Project: Flink Issue Type: Bug Affects Versions: 0.9, 0.10 Reporter: Andra Lungu Priority: Critical Fix For: 0.9.1 The following code: https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/GSATriangleCount.java Ran on the twitter follower graph: http://twitter.mpi-sws.org/data-icwsm2010.html With a similar configuration to the one in FLINK-2293 fails with the following exception: java.lang.Exception: The slot in which the task was executed has been released. Probably loss of TaskManager 57c67d938c9144bec5ba798bb8ebe636 @ wally025 - 8 slots - URL: akka.tcp://flink@130.149.249.35:56135/user/taskmanager at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151) at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547) at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119) at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:154) at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:182) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:421) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:36) at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:29) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) at org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:29) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:92) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46) at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369) at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501) at akka.actor.ActorCell.invoke(ActorCell.scala:486) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) at akka.dispatch.Mailbox.run(Mailbox.scala:221) at akka.dispatch.Mailbox.exec(Mailbox.scala:231) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 06/29/2015 10:33:46 Job execution switched to status FAILING. The logs are here: https://drive.google.com/file/d/0BwnaKJcSLc43M1BhNUt5NWdINHc/view?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2157) Create evaluation framework for ML library
[ https://issues.apache.org/jira/browse/FLINK-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608346#comment-14608346 ] ASF GitHub Bot commented on FLINK-2157: --- Github user thvasilo commented on the pull request: https://github.com/apache/flink/pull/871#issuecomment-117202580 I believe that the current code covers the scope of the linked issue. We can now start reviewing the relevant changes and continue with code cleanup to bring this to a merge-able state. Once cleanup is done a few more evaluation score should be added before merging. Create evaluation framework for ML library -- Key: FLINK-2157 URL: https://issues.apache.org/jira/browse/FLINK-2157 Project: Flink Issue Type: New Feature Components: Machine Learning Library Reporter: Till Rohrmann Labels: ML Fix For: 0.10 Currently, FlinkML lacks means to evaluate the performance of trained models. It would be great to add some {{Evaluators}} which can calculate some score based on the information about true and predicted labels. This could also be used for the cross validation to choose the right hyper parameters. Possible scores could be F score [1], zero-one-loss score, etc. Resources [1] [http://en.wikipedia.org/wiki/F1_score] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2157] [ml] [WIP] Create evaluation fram...
Github user thvasilo commented on the pull request: https://github.com/apache/flink/pull/871#issuecomment-117202580 I believe that the current code covers the scope of the linked issue. We can now start reviewing the relevant changes and continue with code cleanup to bring this to a merge-able state. Once cleanup is done a few more evaluation score should be added before merging. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Updated] (FLINK-2111) Add stop signal to cleanly shutdown streaming jobs
[ https://issues.apache.org/jira/browse/FLINK-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias J. Sax updated FLINK-2111: --- Summary: Add stop signal to cleanly shutdown streaming jobs (was: Add stop signal to cleanly stop streaming jobs) Add stop signal to cleanly shutdown streaming jobs Key: FLINK-2111 URL: https://issues.apache.org/jira/browse/FLINK-2111 Project: Flink Issue Type: Improvement Components: Distributed Runtime, JobManager, Local Runtime, Streaming, TaskManager, Webfrontend Reporter: Matthias J. Sax Assignee: Matthias J. Sax Priority: Minor Currently, streaming jobs can only be stopped using cancel command, what is a hard stop with no clean shutdown. The new introduced terminate signal, will only affect streaming source tasks such that the sources can stop emitting data and terminate cleanly, resulting in a clean termination of the whole streaming job. This feature is a pre-requirment for https://issues.apache.org/jira/browse/FLINK-1929 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2296) Checkpoint committing broken
[ https://issues.apache.org/jira/browse/FLINK-2296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608411#comment-14608411 ] Stephan Ewen commented on FLINK-2296: - +1 to do that so the current state is robust and efficient If we decide to add the state handle back, this should go together with a design how to avoid repeated state transfers. Checkpoint committing broken Key: FLINK-2296 URL: https://issues.apache.org/jira/browse/FLINK-2296 Project: Flink Issue Type: Bug Components: Streaming Affects Versions: 0.10 Reporter: Robert Metzger Assignee: Robert Metzger Priority: Blocker While working on fixing the failing {{PersistentKafkaSource}} test, I realized that the recent changes introduced in New operator state interfaces https://github.com/apache/flink/pull/747 (sadly, there is no JIRA for this huge change) introduced some changes that I was not aware of. * The {{CheckpointCoordinator}} is now sending the StateHandle back to the TaskManager when confirming a checkpoint. For the non-FS case, this means that for checkpoint committed operators, the state is send twice over the wire for each checkpoint. For the FS case, this means that for every checkpoint commit, the state needs to be retrieved from the file system. Did you conduct any tests on a cluster to measure the performance impact of that? I see three approaches for fixing the aforementioned issue: - keep it this way (probably poor performance) - always keep the state for uncommitted checkpoints in the TaskManager's memory. Therefore, we need to come up with a good eviction strategy. I don't know the implications for large state. - change the interface and do not provide the state to the user function (=old behavior). This forces users to think about how they want to keep the state (but it is also a bit more work for them) I would like to get some feedback on how to solve this issue! Also, I discovered the following bugs: * Non-source tasks didn't get {{commitCheckpoint}} calls, even though they implemented the {{CheckpointCommitter}} interface. I fixed this issue in my current branch. * The state passed to the {{commitCheckpoint}} method did not match with the subtask id. So user functions were receiving states from other parallel instances. This lead to faulty behavior in the KafkaSource (thats also the reason why the KafkaITCase was failing more frequently ...). I fixed this issue in my current branch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [storm-contib] added named attribute access to...
GitHub user mjsax opened a pull request: https://github.com/apache/flink/pull/875 [storm-contib] added named attribute access to Storm compatibility layer - added support for named attribute access - extenden FlinkTopologyBuilder to handle declared output schemas - adapted JUnit test - added new examples and ITCases - updated README You can merge this pull request into a Git repository by running: $ git pull https://github.com/mjsax/flink flink-storm-compatibility Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/875.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #875 commit 1f2b7afe172eef8b4664ccc42449ab8a999bd582 Author: mjsax mj...@informatik.hu-berlin.de Date: 2015-06-29T15:50:07Z Storm compatibility improvement - added support for named attribute access - extenden FlinkTopologyBuilder to handle declared output schemas - adapted JUnit test - added new examples and ITCases - updated README --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2131][ml]: Initialization schemes for k...
Github user sachingoel0101 closed the pull request at: https://github.com/apache/flink/pull/757 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2296) Checkpoint committing broken
[ https://issues.apache.org/jira/browse/FLINK-2296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608360#comment-14608360 ] Robert Metzger commented on FLINK-2296: --- +1 for renaming the method to {{notifyCheckpointComplete}}! +1 for removing the {{StateHandleSerializable state}} argument from the commitCheckpoint/notifyCheckpointComplete method. If nobody disagrees within 24 hours, I will rename the method and remove the argument. Checkpoint committing broken Key: FLINK-2296 URL: https://issues.apache.org/jira/browse/FLINK-2296 Project: Flink Issue Type: Bug Components: Streaming Affects Versions: 0.10 Reporter: Robert Metzger Assignee: Robert Metzger Priority: Blocker While working on fixing the failing {{PersistentKafkaSource}} test, I realized that the recent changes introduced in New operator state interfaces https://github.com/apache/flink/pull/747 (sadly, there is no JIRA for this huge change) introduced some changes that I was not aware of. * The {{CheckpointCoordinator}} is now sending the StateHandle back to the TaskManager when confirming a checkpoint. For the non-FS case, this means that for checkpoint committed operators, the state is send twice over the wire for each checkpoint. For the FS case, this means that for every checkpoint commit, the state needs to be retrieved from the file system. Did you conduct any tests on a cluster to measure the performance impact of that? I see three approaches for fixing the aforementioned issue: - keep it this way (probably poor performance) - always keep the state for uncommitted checkpoints in the TaskManager's memory. Therefore, we need to come up with a good eviction strategy. I don't know the implications for large state. - change the interface and do not provide the state to the user function (=old behavior). This forces users to think about how they want to keep the state (but it is also a bit more work for them) I would like to get some feedback on how to solve this issue! Also, I discovered the following bugs: * Non-source tasks didn't get {{commitCheckpoint}} calls, even though they implemented the {{CheckpointCommitter}} interface. I fixed this issue in my current branch. * The state passed to the {{commitCheckpoint}} method did not match with the subtask id. So user functions were receiving states from other parallel instances. This lead to faulty behavior in the KafkaSource (thats also the reason why the KafkaITCase was failing more frequently ...). I fixed this issue in my current branch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (FLINK-2111) Add stop signal to cleanly shutdown streaming jobs
[ https://issues.apache.org/jira/browse/FLINK-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthias J. Sax updated FLINK-2111: --- Description: Currently, streaming jobs can only be stopped using cancel command, what is a hard stop with no clean shutdown. The new introduced stop signal, will only affect streaming source tasks such that the sources can stop emitting data and shutdown cleanly, resulting in a clean shutdown of the whole streaming job. This feature is a pre-requirment for https://issues.apache.org/jira/browse/FLINK-1929 was: Currently, streaming jobs can only be stopped using cancel command, what is a hard stop with no clean shutdown. The new introduced terminate signal, will only affect streaming source tasks such that the sources can stop emitting data and terminate cleanly, resulting in a clean termination of the whole streaming job. This feature is a pre-requirment for https://issues.apache.org/jira/browse/FLINK-1929 Add stop signal to cleanly shutdown streaming jobs Key: FLINK-2111 URL: https://issues.apache.org/jira/browse/FLINK-2111 Project: Flink Issue Type: Improvement Components: Distributed Runtime, JobManager, Local Runtime, Streaming, TaskManager, Webfrontend Reporter: Matthias J. Sax Assignee: Matthias J. Sax Priority: Minor Currently, streaming jobs can only be stopped using cancel command, what is a hard stop with no clean shutdown. The new introduced stop signal, will only affect streaming source tasks such that the sources can stop emitting data and shutdown cleanly, resulting in a clean shutdown of the whole streaming job. This feature is a pre-requirment for https://issues.apache.org/jira/browse/FLINK-1929 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2301) In BarrierBuffer newer Barriers trigger old Checkpoints
Aljoscha Krettek created FLINK-2301: --- Summary: In BarrierBuffer newer Barriers trigger old Checkpoints Key: FLINK-2301 URL: https://issues.apache.org/jira/browse/FLINK-2301 Project: Flink Issue Type: Bug Components: Streaming Reporter: Aljoscha Krettek Assignee: Aljoscha Krettek When the BarrierBuffer has some inputs blocked on barrier 0, then receives barriers for barrier 1 on the other inputs this makes the BarrierBuffer process the checkpoint with id 0. I think the BarrierBuffer should drop all previous BarrierCheckpoints when it receives a barrier from a more recent checkpoint and unblock the previously blocked channels. This will make it ready to correctly react to the other barriers of the newer checkpoint. It should also ignore barriers that arrive late when we already processed a more recent checkpoint. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2131][ml]: Initialization schemes for k...
Github user sachingoel0101 commented on the pull request: https://github.com/apache/flink/pull/757#issuecomment-117220314 Hey @thvasilo , I'm going to break up this PR further. The motivation is that, the Sampling code should be available as a general feature. Given a probability distribution over data, user should be able to sample as many points as they want. The Sampler will take the DataSet as input, number of samples required and a function which determines the relative probability of a particular element being picked, apart from specifying whether the elements should be sampled with replacement or without replacement. Let me know your thoughts. I'll work out a version in the meantime. If this is desirable, I will file a JIRA ticket and open a separate PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [FLINK-2131][ml]: Initialization schemes for k...
GitHub user sachingoel0101 reopened a pull request: https://github.com/apache/flink/pull/757 [FLINK-2131][ml]: Initialization schemes for k-means clustering This adds two most common initialization strategies for the k-means clustering algorithm, namely, Random initialization and kmeans++ initialization. Further details are at https://issues.apache.org/jira/browse/FLINK-2131 [Edit]: Work on kmeans|| has been started and just needs to be finalized. [Edit]: kmeans|| implementation finished. You can merge this pull request into a Git repository by running: $ git pull https://github.com/sachingoel0101/flink clustering_initializations Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/757.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #757 commit dc2de88bf5e3148bb116cad607fc3c61d9dceac6 Author: Sachin Goel sachingoel0...@gmail.com Date: 2015-06-02T06:44:30Z Random and kmeans++ initialization methods added commit 4a39a19c1425259c71ac6d922b4d9a9f2e7d1c6e Author: Sachin Goel sachingoel0...@gmail.com Date: 2015-06-02T15:42:58Z Merge https://github.com/apache/flink into clustering_initializations commit cdbb3a0801d364935d455798c695f4615ae74e76 Author: Sachin Goel sachingoel0...@gmail.com Date: 2015-06-02T19:49:24Z Merge https://github.com/apache/flink into clustering_initializations commit 7496e21462e4efc0813450971ae6cbc94d2b2c15 Author: Sachin Goel sachingoel0...@gmail.com Date: 2015-06-02T22:41:20Z Initialization costs of random and kmeans++ added commit 8033c87b71686bd3955281db12583592549406cb Author: Sachin Goel sachingoel0...@gmail.com Date: 2015-06-05T21:54:10Z Merge https://github.com/apache/flink into clustering_initializations commit 29ed1d3fb31aa038d6ed1a5bf16d58f19565cdf8 Author: Sachin Goel sachingoel0...@gmail.com Date: 2015-06-05T22:52:02Z Removed cost parameter from Algorithm itself. Leaving it to the user for now. Also added support for weighted input data sets commit 5286c3c21d5019f6ba8ab67c2074570087bc1b3a Author: Sachin Goel sachingoel0...@gmail.com Date: 2015-06-06T05:04:55Z An initial draft of kmeans-par method commit f3bfad4fc0c6576af14f1e981f8e778445856355 Author: Sachin Goel sachingoel0...@gmail.com Date: 2015-06-08T10:36:32Z All three initialization schemes implemented and tested commit 8496b8fd627ade8dbe7b92949d35d3cce704f1cc Author: Sachin Goel sachingoel0...@gmail.com Date: 2015-06-08T10:36:58Z Merge https://github.com/apache/flink into clustering_initializations commit 3765a3e6a77a8bdbac21d03be1c43263925b1495 Author: Sachin Goel sachingoel0...@gmail.com Date: 2015-06-30T08:57:41Z Merge remote-tracking branch 'upstream/master' into clustering_initializations --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering
[ https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608434#comment-14608434 ] ASF GitHub Bot commented on FLINK-2131: --- Github user sachingoel0101 commented on the pull request: https://github.com/apache/flink/pull/757#issuecomment-117220314 Hey @thvasilo , I'm going to break up this PR further. The motivation is that, the Sampling code should be available as a general feature. Given a probability distribution over data, user should be able to sample as many points as they want. The Sampler will take the DataSet as input, number of samples required and a function which determines the relative probability of a particular element being picked, apart from specifying whether the elements should be sampled with replacement or without replacement. Let me know your thoughts. I'll work out a version in the meantime. If this is desirable, I will file a JIRA ticket and open a separate PR. Add Initialization schemes for K-means clustering - Key: FLINK-2131 URL: https://issues.apache.org/jira/browse/FLINK-2131 Project: Flink Issue Type: Task Components: Machine Learning Library Reporter: Sachin Goel Assignee: Sachin Goel The Lloyd's [KMeans] algorithm takes initial centroids as its input. However, in case the user doesn't provide the initial centers, they may ask for a particular initialization scheme to be followed. The most commonly used are these: 1. Random initialization: Self-explanatory 2. kmeans++ initialization: http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf 3. kmeans|| : http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf For very large data sets, or for large values of k, the kmeans|| method is preferred as it provides the same approximation guarantees as kmeans++ and requires lesser number of passes over the input data. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-2300) Links on FAQ page not rendered correctly
Robert Metzger created FLINK-2300: - Summary: Links on FAQ page not rendered correctly Key: FLINK-2300 URL: https://issues.apache.org/jira/browse/FLINK-2300 Project: Flink Issue Type: Bug Components: Project Website Reporter: Robert Metzger Priority: Minor On the Flink website, the links using the github plugin are broken. For example {code} {% github README.md master build instructions %} {code} renders to {code} https://github.com/apache/flink/tree/master/README.md {code} See: http://flink.apache.org/faq.html#my-job-fails-early-with-a-javaioeofexception-what-could-be-the-cause I was not able to resolve the issue by using {{a}} tags or markdown links. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (FLINK-2108) Add score function for Predictors
[ https://issues.apache.org/jira/browse/FLINK-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608336#comment-14608336 ] Theodore Vasiloudis edited comment on FLINK-2108 at 6/30/15 2:07 PM: - The score function can now be considered feature complete, as presented in the linked PR: https://github.com/apache/flink/pull/871 was (Author: tvas): The for the score function can now be considered feature complete, as presented in the linked PR: https://github.com/apache/flink/pull/871 Add score function for Predictors - Key: FLINK-2108 URL: https://issues.apache.org/jira/browse/FLINK-2108 Project: Flink Issue Type: Improvement Components: Machine Learning Library Reporter: Theodore Vasiloudis Assignee: Theodore Vasiloudis Priority: Minor Labels: ML A score function for Predictor implementations should take a DataSet[(I, O)] and an (optional) scoring measure and return a score. The DataSet[(I, O)] would probably be the output of the predict function. For example in MultipleLinearRegression, we can call predict on a labeled dataset, get back predictions for each item in the data, and then call score with the resulting dataset as an argument and we should get back a score for the prediction quality, such as the R^2 score. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2299) The slot on which the task maanger was scheduled was killed
[ https://issues.apache.org/jira/browse/FLINK-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608412#comment-14608412 ] Stephan Ewen commented on FLINK-2299: - We are getting more reports where this happens. We need to figure out - Whether these are results of garbage collection stalls. In that case, there is not much we could do easily, but we have to think about a long term - Or whether this is a problem with the Akka remote watching. In that case, we should revert back to doing the heartbeats ourselves The slot on which the task maanger was scheduled was killed --- Key: FLINK-2299 URL: https://issues.apache.org/jira/browse/FLINK-2299 Project: Flink Issue Type: Bug Affects Versions: 0.9, 0.10 Reporter: Andra Lungu Priority: Critical Fix For: 0.9.1 The following code: https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/GSATriangleCount.java Ran on the twitter follower graph: http://twitter.mpi-sws.org/data-icwsm2010.html With a similar configuration to the one in FLINK-2293 fails with the following exception: java.lang.Exception: The slot in which the task was executed has been released. Probably loss of TaskManager 57c67d938c9144bec5ba798bb8ebe636 @ wally025 - 8 slots - URL: akka.tcp://flink@130.149.249.35:56135/user/taskmanager at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151) at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547) at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119) at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:154) at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:182) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:421) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:36) at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:29) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) at org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:29) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:92) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46) at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369) at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501) at akka.actor.ActorCell.invoke(ActorCell.scala:486) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) at akka.dispatch.Mailbox.run(Mailbox.scala:221) at akka.dispatch.Mailbox.exec(Mailbox.scala:231) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 06/29/2015 10:33:46 Job execution switched to status FAILING. The logs are here: https://drive.google.com/file/d/0BwnaKJcSLc43M1BhNUt5NWdINHc/view?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2296) Checkpoint committing broken
[ https://issues.apache.org/jira/browse/FLINK-2296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608397#comment-14608397 ] Ufuk Celebi commented on FLINK-2296: +1 to the renaming. Checkpoint committing broken Key: FLINK-2296 URL: https://issues.apache.org/jira/browse/FLINK-2296 Project: Flink Issue Type: Bug Components: Streaming Affects Versions: 0.10 Reporter: Robert Metzger Assignee: Robert Metzger Priority: Blocker While working on fixing the failing {{PersistentKafkaSource}} test, I realized that the recent changes introduced in New operator state interfaces https://github.com/apache/flink/pull/747 (sadly, there is no JIRA for this huge change) introduced some changes that I was not aware of. * The {{CheckpointCoordinator}} is now sending the StateHandle back to the TaskManager when confirming a checkpoint. For the non-FS case, this means that for checkpoint committed operators, the state is send twice over the wire for each checkpoint. For the FS case, this means that for every checkpoint commit, the state needs to be retrieved from the file system. Did you conduct any tests on a cluster to measure the performance impact of that? I see three approaches for fixing the aforementioned issue: - keep it this way (probably poor performance) - always keep the state for uncommitted checkpoints in the TaskManager's memory. Therefore, we need to come up with a good eviction strategy. I don't know the implications for large state. - change the interface and do not provide the state to the user function (=old behavior). This forces users to think about how they want to keep the state (but it is also a bit more work for them) I would like to get some feedback on how to solve this issue! Also, I discovered the following bugs: * Non-source tasks didn't get {{commitCheckpoint}} calls, even though they implemented the {{CheckpointCommitter}} interface. I fixed this issue in my current branch. * The state passed to the {{commitCheckpoint}} method did not match with the subtask id. So user functions were receiving states from other parallel instances. This lead to faulty behavior in the KafkaSource (thats also the reason why the KafkaITCase was failing more frequently ...). I fixed this issue in my current branch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering
[ https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608437#comment-14608437 ] ASF GitHub Bot commented on FLINK-2131: --- GitHub user sachingoel0101 reopened a pull request: https://github.com/apache/flink/pull/757 [FLINK-2131][ml]: Initialization schemes for k-means clustering This adds two most common initialization strategies for the k-means clustering algorithm, namely, Random initialization and kmeans++ initialization. Further details are at https://issues.apache.org/jira/browse/FLINK-2131 [Edit]: Work on kmeans|| has been started and just needs to be finalized. [Edit]: kmeans|| implementation finished. You can merge this pull request into a Git repository by running: $ git pull https://github.com/sachingoel0101/flink clustering_initializations Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/757.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #757 commit dc2de88bf5e3148bb116cad607fc3c61d9dceac6 Author: Sachin Goel sachingoel0...@gmail.com Date: 2015-06-02T06:44:30Z Random and kmeans++ initialization methods added commit 4a39a19c1425259c71ac6d922b4d9a9f2e7d1c6e Author: Sachin Goel sachingoel0...@gmail.com Date: 2015-06-02T15:42:58Z Merge https://github.com/apache/flink into clustering_initializations commit cdbb3a0801d364935d455798c695f4615ae74e76 Author: Sachin Goel sachingoel0...@gmail.com Date: 2015-06-02T19:49:24Z Merge https://github.com/apache/flink into clustering_initializations commit 7496e21462e4efc0813450971ae6cbc94d2b2c15 Author: Sachin Goel sachingoel0...@gmail.com Date: 2015-06-02T22:41:20Z Initialization costs of random and kmeans++ added commit 8033c87b71686bd3955281db12583592549406cb Author: Sachin Goel sachingoel0...@gmail.com Date: 2015-06-05T21:54:10Z Merge https://github.com/apache/flink into clustering_initializations commit 29ed1d3fb31aa038d6ed1a5bf16d58f19565cdf8 Author: Sachin Goel sachingoel0...@gmail.com Date: 2015-06-05T22:52:02Z Removed cost parameter from Algorithm itself. Leaving it to the user for now. Also added support for weighted input data sets commit 5286c3c21d5019f6ba8ab67c2074570087bc1b3a Author: Sachin Goel sachingoel0...@gmail.com Date: 2015-06-06T05:04:55Z An initial draft of kmeans-par method commit f3bfad4fc0c6576af14f1e981f8e778445856355 Author: Sachin Goel sachingoel0...@gmail.com Date: 2015-06-08T10:36:32Z All three initialization schemes implemented and tested commit 8496b8fd627ade8dbe7b92949d35d3cce704f1cc Author: Sachin Goel sachingoel0...@gmail.com Date: 2015-06-08T10:36:58Z Merge https://github.com/apache/flink into clustering_initializations commit 3765a3e6a77a8bdbac21d03be1c43263925b1495 Author: Sachin Goel sachingoel0...@gmail.com Date: 2015-06-30T08:57:41Z Merge remote-tracking branch 'upstream/master' into clustering_initializations Add Initialization schemes for K-means clustering - Key: FLINK-2131 URL: https://issues.apache.org/jira/browse/FLINK-2131 Project: Flink Issue Type: Task Components: Machine Learning Library Reporter: Sachin Goel Assignee: Sachin Goel The Lloyd's [KMeans] algorithm takes initial centroids as its input. However, in case the user doesn't provide the initial centers, they may ask for a particular initialization scheme to be followed. The most commonly used are these: 1. Random initialization: Self-explanatory 2. kmeans++ initialization: http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf 3. kmeans|| : http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf For very large data sets, or for large values of k, the kmeans|| method is preferred as it provides the same approximation guarantees as kmeans++ and requires lesser number of passes over the input data. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering
[ https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608435#comment-14608435 ] ASF GitHub Bot commented on FLINK-2131: --- Github user sachingoel0101 closed the pull request at: https://github.com/apache/flink/pull/757 Add Initialization schemes for K-means clustering - Key: FLINK-2131 URL: https://issues.apache.org/jira/browse/FLINK-2131 Project: Flink Issue Type: Task Components: Machine Learning Library Reporter: Sachin Goel Assignee: Sachin Goel The Lloyd's [KMeans] algorithm takes initial centroids as its input. However, in case the user doesn't provide the initial centers, they may ask for a particular initialization scheme to be followed. The most commonly used are these: 1. Random initialization: Self-explanatory 2. kmeans++ initialization: http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf 3. kmeans|| : http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf For very large data sets, or for large values of k, the kmeans|| method is preferred as it provides the same approximation guarantees as kmeans++ and requires lesser number of passes over the input data. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2108) Add score function for Predictors
[ https://issues.apache.org/jira/browse/FLINK-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608336#comment-14608336 ] Theodore Vasiloudis commented on FLINK-2108: The for the score function can now be considered feature complete, as presented in the linked PR: https://github.com/apache/flink/pull/871 Add score function for Predictors - Key: FLINK-2108 URL: https://issues.apache.org/jira/browse/FLINK-2108 Project: Flink Issue Type: Improvement Components: Machine Learning Library Reporter: Theodore Vasiloudis Assignee: Theodore Vasiloudis Priority: Minor Labels: ML A score function for Predictor implementations should take a DataSet[(I, O)] and an (optional) scoring measure and return a score. The DataSet[(I, O)] would probably be the output of the predict function. For example in MultipleLinearRegression, we can call predict on a labeled dataset, get back predictions for each item in the data, and then call score with the resulting dataset as an argument and we should get back a score for the prediction quality, such as the R^2 score. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2296) Checkpoint committing broken
[ https://issues.apache.org/jira/browse/FLINK-2296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608364#comment-14608364 ] Gyula Fora commented on FLINK-2296: --- +1 Checkpoint committing broken Key: FLINK-2296 URL: https://issues.apache.org/jira/browse/FLINK-2296 Project: Flink Issue Type: Bug Components: Streaming Affects Versions: 0.10 Reporter: Robert Metzger Assignee: Robert Metzger Priority: Blocker While working on fixing the failing {{PersistentKafkaSource}} test, I realized that the recent changes introduced in New operator state interfaces https://github.com/apache/flink/pull/747 (sadly, there is no JIRA for this huge change) introduced some changes that I was not aware of. * The {{CheckpointCoordinator}} is now sending the StateHandle back to the TaskManager when confirming a checkpoint. For the non-FS case, this means that for checkpoint committed operators, the state is send twice over the wire for each checkpoint. For the FS case, this means that for every checkpoint commit, the state needs to be retrieved from the file system. Did you conduct any tests on a cluster to measure the performance impact of that? I see three approaches for fixing the aforementioned issue: - keep it this way (probably poor performance) - always keep the state for uncommitted checkpoints in the TaskManager's memory. Therefore, we need to come up with a good eviction strategy. I don't know the implications for large state. - change the interface and do not provide the state to the user function (=old behavior). This forces users to think about how they want to keep the state (but it is also a bit more work for them) I would like to get some feedback on how to solve this issue! Also, I discovered the following bugs: * Non-source tasks didn't get {{commitCheckpoint}} calls, even though they implemented the {{CheckpointCommitter}} interface. I fixed this issue in my current branch. * The state passed to the {{commitCheckpoint}} method did not match with the subtask id. So user functions were receiving states from other parallel instances. This lead to faulty behavior in the KafkaSource (thats also the reason why the KafkaITCase was failing more frequently ...). I fixed this issue in my current branch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (FLINK-2296) Checkpoint committing broken
[ https://issues.apache.org/jira/browse/FLINK-2296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608360#comment-14608360 ] Robert Metzger edited comment on FLINK-2296 at 6/30/15 2:22 PM: +1 for renaming the method to {{notifyCheckpointComplete}}! +1 for removing the {{StateHandleSerializable state}} argument from the commitCheckpoint/notifyCheckpointComplete method. If nobody disagrees within 24 hours, I will rename the method and remove the argument. (still, I'm asking for some +1 or comments on my suggestions) was (Author: rmetzger): +1 for renaming the method to {{notifyCheckpointComplete}}! +1 for removing the {{StateHandleSerializable state}} argument from the commitCheckpoint/notifyCheckpointComplete method. If nobody disagrees within 24 hours, I will rename the method and remove the argument. Checkpoint committing broken Key: FLINK-2296 URL: https://issues.apache.org/jira/browse/FLINK-2296 Project: Flink Issue Type: Bug Components: Streaming Affects Versions: 0.10 Reporter: Robert Metzger Assignee: Robert Metzger Priority: Blocker While working on fixing the failing {{PersistentKafkaSource}} test, I realized that the recent changes introduced in New operator state interfaces https://github.com/apache/flink/pull/747 (sadly, there is no JIRA for this huge change) introduced some changes that I was not aware of. * The {{CheckpointCoordinator}} is now sending the StateHandle back to the TaskManager when confirming a checkpoint. For the non-FS case, this means that for checkpoint committed operators, the state is send twice over the wire for each checkpoint. For the FS case, this means that for every checkpoint commit, the state needs to be retrieved from the file system. Did you conduct any tests on a cluster to measure the performance impact of that? I see three approaches for fixing the aforementioned issue: - keep it this way (probably poor performance) - always keep the state for uncommitted checkpoints in the TaskManager's memory. Therefore, we need to come up with a good eviction strategy. I don't know the implications for large state. - change the interface and do not provide the state to the user function (=old behavior). This forces users to think about how they want to keep the state (but it is also a bit more work for them) I would like to get some feedback on how to solve this issue! Also, I discovered the following bugs: * Non-source tasks didn't get {{commitCheckpoint}} calls, even though they implemented the {{CheckpointCommitter}} interface. I fixed this issue in my current branch. * The state passed to the {{commitCheckpoint}} method did not match with the subtask id. So user functions were receiving states from other parallel instances. This lead to faulty behavior in the KafkaSource (thats also the reason why the KafkaITCase was failing more frequently ...). I fixed this issue in my current branch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (FLINK-2157) Create evaluation framework for ML library
[ https://issues.apache.org/jira/browse/FLINK-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Theodore Vasiloudis reassigned FLINK-2157: -- Assignee: Theodore Vasiloudis Create evaluation framework for ML library -- Key: FLINK-2157 URL: https://issues.apache.org/jira/browse/FLINK-2157 Project: Flink Issue Type: New Feature Components: Machine Learning Library Reporter: Till Rohrmann Assignee: Theodore Vasiloudis Labels: ML Fix For: 0.10 Currently, FlinkML lacks means to evaluate the performance of trained models. It would be great to add some {{Evaluators}} which can calculate some score based on the information about true and predicted labels. This could also be used for the cross validation to choose the right hyper parameters. Possible scores could be F score [1], zero-one-loss score, etc. Resources [1] [http://en.wikipedia.org/wiki/F1_score] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2293) Division by Zero Exception
[ https://issues.apache.org/jira/browse/FLINK-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608419#comment-14608419 ] Stephan Ewen commented on FLINK-2293: - [~fhueske] and me have located this and figured out a fix. I'll try to push a fix later today. The issue occurs only in low-memory settings. You may want to break up the job anyways, to reduce memory fragmentation. Division by Zero Exception -- Key: FLINK-2293 URL: https://issues.apache.org/jira/browse/FLINK-2293 Project: Flink Issue Type: Bug Components: Local Runtime Affects Versions: 0.9, 0.10 Reporter: Andra Lungu Priority: Critical Fix For: 0.9.1 I am basically running an algorithm that simulates a Gather Sum Apply Iteration that performs Traingle Count (Why simulate it? Because you just need a superstep - useless overhead if you use the runGatherSumApply function in Graph). What happens, at a high level: 1). Select neighbors with ID greater than the one corresponding to the current vertex; 2). Propagate the received values to neighbors with higher ID; 3). compute the number of triangles by checking whether trgVertex.getValue().get(srcVertex.getId()); As you can see, I *do not* perform any division at all; code is here: https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/GSATriangleCount.java Now for small graphs, 50MB max, the computation finishes nicely with the correct result. For a 10GB graph, however, I got this: java.lang.ArithmeticException: / by zero at org.apache.flink.runtime.operators.hash.MutableHashTable.insertIntoTable(MutableHashTable.java:836) at org.apache.flink.runtime.operators.hash.MutableHashTable.buildTableFromSpilledPartition(MutableHashTable.java:819) at org.apache.flink.runtime.operators.hash.MutableHashTable.prepareNextPartition(MutableHashTable.java:508) at org.apache.flink.runtime.operators.hash.MutableHashTable.nextRecord(MutableHashTable.java:544) at org.apache.flink.runtime.operators.hash.NonReusingBuildFirstHashMatchIterator.callWithNextKey(NonReusingBuildFirstHashMatchIterator.java:104) at org.apache.flink.runtime.operators.MatchDriver.run(MatchDriver.java:173) at org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:496) at org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559) at java.lang.Thread.run(Thread.java:722) see the full log here: https://gist.github.com/andralungu/984774f6348269df7951 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2298] Allow setting a custom applicatio...
Github user hsaputra commented on a diff in the pull request: https://github.com/apache/flink/pull/876#discussion_r33604736 --- Diff: flink-yarn/src/main/java/org/apache/flink/yarn/FlinkYarnClient.java --- @@ -591,14 +592,17 @@ protected AbstractFlinkYarnCluster deployInternal(String clusterName) throws Exc capability.setMemory(jobManagerMemoryMb); capability.setVirtualCores(1); - if(clusterName == null) { - clusterName = Flink session with +taskManagerCount+ TaskManagers; + if(defaultName == null) { + defaultName = Flink session with +taskManagerCount+ TaskManagers; } if(detached) { - clusterName += (detached); + defaultName += (detached); } - appContext.setApplicationName(clusterName); // application name + if(this.customName != null) { --- End diff -- Should probably check for empty or blank name I suppose --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (FLINK-2299) The slot on which the task maanger was scheduled was killed
[ https://issues.apache.org/jira/browse/FLINK-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608810#comment-14608810 ] Stephan Ewen commented on FLINK-2299: - [~andralungu] - from some LOG debugging, it seems that the following happens: The JobManager decides that the TaskManager is unreachable at 07:46:45 At that time, the TaskManager was executing the {{GroupReduce at main(TriangleCount.java:51)}}. Can you check what this GroupReduce does? Does it create a list of elements that becomes very large? This would lead to long GC pauses and could cause the TaskManager to not respond to the JobManager's heartbeats. The slot on which the task maanger was scheduled was killed --- Key: FLINK-2299 URL: https://issues.apache.org/jira/browse/FLINK-2299 Project: Flink Issue Type: Bug Affects Versions: 0.9, 0.10 Reporter: Andra Lungu Priority: Critical Fix For: 0.9.1 The following code: https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/GSATriangleCount.java Ran on the twitter follower graph: http://twitter.mpi-sws.org/data-icwsm2010.html With a similar configuration to the one in FLINK-2293 fails with the following exception: java.lang.Exception: The slot in which the task was executed has been released. Probably loss of TaskManager 57c67d938c9144bec5ba798bb8ebe636 @ wally025 - 8 slots - URL: akka.tcp://flink@130.149.249.35:56135/user/taskmanager at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151) at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547) at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119) at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:154) at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:182) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:421) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:36) at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:29) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) at org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:29) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:92) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46) at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369) at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501) at akka.actor.ActorCell.invoke(ActorCell.scala:486) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) at akka.dispatch.Mailbox.run(Mailbox.scala:221) at akka.dispatch.Mailbox.exec(Mailbox.scala:231) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 06/29/2015 10:33:46 Job execution switched to status FAILING. The logs are here: https://drive.google.com/file/d/0BwnaKJcSLc43M1BhNUt5NWdINHc/view?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-2299) The slot on which the task maanger was scheduled was killed
[ https://issues.apache.org/jira/browse/FLINK-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608732#comment-14608732 ] Stephan Ewen commented on FLINK-2299: - Aside from the issue that you see, I think you have a problem with data skew in the join. From the logs it looks like you may see heavy collisions in keys (one key on the build side occurring very often). Can you look into prior ML discussions on hints how to deal with that? The slot on which the task maanger was scheduled was killed --- Key: FLINK-2299 URL: https://issues.apache.org/jira/browse/FLINK-2299 Project: Flink Issue Type: Bug Affects Versions: 0.9, 0.10 Reporter: Andra Lungu Priority: Critical Fix For: 0.9.1 The following code: https://github.com/andralungu/gelly-partitioning/blob/master/src/main/java/example/GSATriangleCount.java Ran on the twitter follower graph: http://twitter.mpi-sws.org/data-icwsm2010.html With a similar configuration to the one in FLINK-2293 fails with the following exception: java.lang.Exception: The slot in which the task was executed has been released. Probably loss of TaskManager 57c67d938c9144bec5ba798bb8ebe636 @ wally025 - 8 slots - URL: akka.tcp://flink@130.149.249.35:56135/user/taskmanager at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:151) at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:547) at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:119) at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:154) at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:182) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:421) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:36) at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:29) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) at org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:29) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:92) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46) at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369) at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501) at akka.actor.ActorCell.invoke(ActorCell.scala:486) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) at akka.dispatch.Mailbox.run(Mailbox.scala:221) at akka.dispatch.Mailbox.exec(Mailbox.scala:231) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 06/29/2015 10:33:46 Job execution switched to status FAILING. The logs are here: https://drive.google.com/file/d/0BwnaKJcSLc43M1BhNUt5NWdINHc/view?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] flink pull request: [FLINK-2298] Allow setting a custom applicatio...
Github user hsaputra commented on a diff in the pull request: https://github.com/apache/flink/pull/876#discussion_r33604696 --- Diff: flink-yarn/src/main/java/org/apache/flink/yarn/FlinkYarnClient.java --- @@ -341,7 +342,7 @@ public AbstractFlinkYarnCluster run() throws Exception { * This method will block until the ApplicationMaster/JobManager have been * deployed on YARN. */ - protected AbstractFlinkYarnCluster deployInternal(String clusterName) throws Exception { + protected AbstractFlinkYarnCluster deployInternal(String defaultName) throws Exception { --- End diff -- Since this becomes default, maybe we could just remove the input argument and set default internally if customName is null or empty. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---