[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2017-02-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15861919#comment-15861919
 ] 

ASF GitHub Bot commented on FLINK-1731:
---

Github user skonto commented on the issue:

https://github.com/apache/flink/pull/3192
  
@sachingoel0101 np as soon as you are ready let me know. Also @tillrohrmann 
made some comments to take into consideration.


> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Peter Schrott
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2017-02-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15859364#comment-15859364
 ] 

ASF GitHub Bot commented on FLINK-1731:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/3192#discussion_r100282403
  
--- Diff: 
flink-libraries/flink-ml/src/main/scala/org/apache/flink/ml/clustering/KMeans.scala
 ---
@@ -0,0 +1,263 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.clustering
+
+import 
org.apache.flink.api.java.functions.FunctionAnnotation.ForwardedFields
+import org.apache.flink.api.scala.{DataSet, _}
+import org.apache.flink.ml._
+import org.apache.flink.ml.common.{LabeledVector, _}
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.{BLAS, Vector}
+import org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric
+import org.apache.flink.ml.pipeline._
+
+
+/**
+  * Implements the KMeans algorithm which calculates cluster centroids 
based on set of training data
+  * points and a set of k initial centroids.
+  *
+  * [[KMeans]] is a [[Predictor]] which needs to be trained on a set of 
data points and can then be
+  * used to assign new points to the learned cluster centroids.
+  *
+  * The KMeans algorithm works as described on Wikipedia
+  * (http://en.wikipedia.org/wiki/K-means_clustering):
+  *
+  * Given an initial set of k means m1(1),…,mk(1) (see below), the 
algorithm proceeds by alternating
+  * between two steps:
+  *
+  * ===Assignment step:===
+  *
+  * Assign each observation to the cluster whose mean yields the least 
within-cluster sum  of
+  * squares (WCSS). Since the sum of squares is the squared Euclidean 
distance, this is intuitively
+  * the "nearest" mean. (Mathematically, this means partitioning the 
observations according to the
+  * Voronoi diagram generated by the means).
+  *
+  * `S_i^(t) = { x_p : || x_p - m_i^(t) ||^2 ≤ || x_p - m_j^(t) ||^2 
\forall j, 1 ≤ j ≤ k}`,
+  * where each `x_p`  is assigned to exactly one `S^{(t)}`, even if it 
could be assigned to two or
+  * more of them.
+  *
+  * ===Update step:===
+  *
+  * Calculate the new means to be the centroids of the observations in the 
new clusters.
+  *
+  * `m^{(t+1)}_i = ( 1 / |S^{(t)}_i| ) \sum_{x_j \in S^{(t)}_i} x_j`
+  *
+  * Since the arithmetic mean is a least-squares estimator, this also 
minimizes the within-cluster
+  * sum of squares (WCSS) objective.
+  *
+  * @example
+  * {{{
+  *   val trainingDS: DataSet[Vector] = 
env.fromCollection(Clustering.trainingData)
+  *   val initialCentroids: DataSet[LabledVector] = 
env.fromCollection(Clustering.initCentroids)
+  *
+  *   val kmeans = KMeans()
+  * .setInitialCentroids(initialCentroids)
+  * .setNumIterations(10)
+  *
+  *   kmeans.fit(trainingDS)
+  *
+  *   // getting the computed centroids
+  *   val centroidsResult = kmeans.centroids.get.collect()
+  *
+  *   // get matching clusters for new points
+  *   val testDS: DataSet[Vector] = 
env.fromCollection(Clustering.testData)
+  *   val clusters: DataSet[LabeledVector] = kmeans.predict(testDS)
+  * }}}
+  *
+  * =Parameters=
+  *
+  * - [[org.apache.flink.ml.clustering.KMeans.NumIterations]]:
+  * Defines the number of iterations to recalculate the centroids of the 
clusters. As it
+  * is a heuristic algorithm, there is no guarantee that it will converge 
to the global optimum. The
+  * centroids of the clusters and the reassignment of the data points will 
be repeated till the
+  * given number of iterations is reached.
+  * (Default value: '''10''')
+  *
+  * - [[org.apache.flink.ml.clustering.KMeans.InitialCentroids]]:
+  * Defines the initial k centroids of the k clust

[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2017-02-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15859075#comment-15859075
 ] 

ASF GitHub Bot commented on FLINK-1731:
---

Github user sachingoel0101 commented on the issue:

https://github.com/apache/flink/pull/3192
  
@skonto I'm traveling right now and won't be able to push an update until
Monday/Tuesday.

On Feb 9, 2017 09:31, "Stavros Kontopoulos" 
wrote:

> @sachingoel0101  could you update the
> PR so I can do a final review and request a merge?
> @tillrohrmann  could assist with the
> forwardedfields question?
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> , or 
mute
> the thread
> 

> .
>



> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Peter Schrott
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2017-02-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15858967#comment-15858967
 ] 

ASF GitHub Bot commented on FLINK-1731:
---

Github user skonto commented on the issue:

https://github.com/apache/flink/pull/3192
  
@sachingoel0101 could you update the PR so I can do a final review and 
request a merge?
@tillrohrmann could assist with the forwardedfields question?


> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Peter Schrott
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2017-01-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15834626#comment-15834626
 ] 

ASF GitHub Bot commented on FLINK-1731:
---

GitHub user sachingoel0101 opened a pull request:

https://github.com/apache/flink/pull/3192

[FLINK-1731][ml] Add KMeans clustering(Lloyd's algorithm)

This is a breakoff from https://github.com/apache/flink/pull/757 to add the 
lloyd's algorithm first.
I will follow this up with initialization schemes in the above linked PR. 

To address a few comments from the previous PR:
We cannot use `DataSet[LabeledVector]` instead of 
`DataSet[Seq[LabeledVector]]` because the model here is of type 
`Seq[LabeledVector]` and the semantics of pipeline require as such. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sachingoel0101/flink kmeans

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/3192.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3192


commit 598f1ea9b4a0e1daf1f151c8b69c88bf83224f71
Author: Peter Schrott 
Date:   2015-07-29T22:44:54Z

[FLINK-1731][ml]Added KMeans algorithm to ML library

commit d70c46e71e152b374c9b3f23c9d0bd006bf503ff
Author: Florian Goessler 
Date:   2015-07-29T22:50:22Z

[FLINK-1731][ml]Added unit tests for KMeans algorithm




> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Peter Schrott
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2016-02-05 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134250#comment-15134250
 ] 

Till Rohrmann commented on FLINK-1731:
--

There are still some unresolved issues with the implementation. If they get 
fixed soon, we can include kmeans in the 1.0 release. But I cannot give you any 
guarantees on that.

> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Peter Schrott
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2016-02-05 Thread Simone Robutti (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134002#comment-15134002
 ] 

Simone Robutti commented on FLINK-1731:
---

Hi Till,

any news on the inclusion of K-Means in the 1.0 release? 

> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Peter Schrott
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-12-07 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15044662#comment-15044662
 ] 

Till Rohrmann commented on FLINK-1731:
--

Hi [~ovidiumarcu], there is a PR open for adding kMeans 
(https://github.com/apache/flink/pull/757/files). Thus, we hope to include it 
in the next release.

> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Peter Schrott
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-12-06 Thread Ovidiu Marcu (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15044136#comment-15044136
 ] 

Ovidiu Marcu commented on FLINK-1731:
-

Will you consider this issue within the next release?

> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Peter Schrott
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-10-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944880#comment-14944880
 ] 

ASF GitHub Bot commented on FLINK-1731:
---

Github user peedeeX21 closed the pull request at:

https://github.com/apache/flink/pull/700


> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Peter Schrott
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-07-02 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611921#comment-14611921
 ] 

ASF GitHub Bot commented on FLINK-1731:
---

Github user thvasilo commented on the pull request:

https://github.com/apache/flink/pull/700#issuecomment-118023871
  
Sure, feel free to close this, and link to the new one.


> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Peter Schrott
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-07-02 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611895#comment-14611895
 ] 

ASF GitHub Bot commented on FLINK-1731:
---

Github user peedeeX21 commented on the pull request:

https://github.com/apache/flink/pull/700#issuecomment-118020035
  
@thvasilo I actually could create a pull request for @sachingoel0101 . So 
everything should be fine now. We can even close this PR.


> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Peter Schrott
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-07-02 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14611888#comment-14611888
 ] 

ASF GitHub Bot commented on FLINK-1731:
---

Github user thvasilo commented on the pull request:

https://github.com/apache/flink/pull/700#issuecomment-118017596
  
Hello @peedeeX21, one thing you could try is to rebase this branch on 
@sachingoel0101's branch, and then do a forced push to this one.


> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Peter Schrott
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-07-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14610556#comment-14610556
 ] 

ASF GitHub Bot commented on FLINK-1731:
---

Github user sachingoel0101 commented on the pull request:

https://github.com/apache/flink/pull/700#issuecomment-117731195
  
@peedeeX21 , try this link: 
https://github.com/sachingoel0101/flink/compare/clustering_initializations...peedeeX21:feature_kmeans
I had a lot of trouble getting to create a PR to your repo yesterday.


> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Peter Schrott
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-07-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14610553#comment-14610553
 ] 

ASF GitHub Bot commented on FLINK-1731:
---

Github user peedeeX21 commented on the pull request:

https://github.com/apache/flink/pull/700#issuecomment-117730349
  
@sachingoel0101 me creating a pull request for your repo would be the best. 
But for some reason I can't choose your repo as base fork. 


> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Peter Schrott
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-07-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14610506#comment-14610506
 ] 

ASF GitHub Bot commented on FLINK-1731:
---

Github user sachingoel0101 commented on the pull request:

https://github.com/apache/flink/pull/700#issuecomment-117723450
  
@thvasilo , how do I merge this PR into mine? Maybe @peedeeX21 can create a 
pull request to my branch at 
https://github.com/sachingoel0101/flink/tree/clustering_initializations or is 
there a better option?


> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Peter Schrott
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-06-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14608302#comment-14608302
 ] 

ASF GitHub Bot commented on FLINK-1731:
---

Github user thvasilo commented on the pull request:

https://github.com/apache/flink/pull/700#issuecomment-117194517
  
What we would like to see actually is this PR and #757 to be merged into 
one, so that we can review them as a whole. @sachingoel0101 do you think you 
will be able to do that?


> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Peter Schrott
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-06-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14608003#comment-14608003
 ] 

ASF GitHub Bot commented on FLINK-1731:
---

Github user peedeeX21 commented on the pull request:

https://github.com/apache/flink/pull/700#issuecomment-117068361
  
I totally agree on you guys points. We have a little amount of centroids, 
and the model is not supposed to be distributed in the end.

The question is now: Should the resulting `DataSet` of centroids just be 
collected, or the the whole iteration be rewritten to work an a non distributed 
collection?

Note: Unfortunately I am quite busy right now with other projects, so I 
wont have time to do lots of changes right now. Either the people from my group 
(who might actually have the same workload right now) or @sachingoel0101 can 
work on that if its really urgent.


> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Peter Schrott
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-06-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14607968#comment-14607968
 ] 

ASF GitHub Bot commented on FLINK-1731:
---

Github user sachingoel0101 commented on the pull request:

https://github.com/apache/flink/pull/700#issuecomment-117060737
  
Hi. IMO, the purpose of learning is to develop a model which compactly 
represents the data somehow. Thus, having a distributed model doesn't make 
sense. Besides, the user might just want to take the model and use it somewhere 
else in which case it makes sense to have it available not-as-distributed, but 
just as a java slash scala object which user can easily operate on.


> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Peter Schrott
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-06-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14607941#comment-14607941
 ] 

ASF GitHub Bot commented on FLINK-1731:
---

Github user thvasilo commented on the pull request:

https://github.com/apache/flink/pull/700#issuecomment-117053448
  
Hello @peedeeX21 . The API does not deal with distributed models at the 
moment. In the K-means case having the model distributed is overkill, as it is 
highly unlikely that you will have >1000 centroids, making the model tiny, and 
distributing it actually creates unnecessary overhead.

We can keep the current implementation, but in the future we should really 
test against a non distributed model, which can be broadcast in a 
DataSet[Seq[LabeledVector]] and compare performance.

Also, could you add an evaluate operation (EvaluateDataSetOperation) for 
Kmeans (and corresponding test)? It would be parametrized as 
EvaluateDataSetOperation[Kmeans, Vector, Double]


> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Peter Schrott
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-06-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14606437#comment-14606437
 ] 

ASF GitHub Bot commented on FLINK-1731:
---

Github user peedeeX21 commented on the pull request:

https://github.com/apache/flink/pull/700#issuecomment-116850459
  
I am having some trouble to fit our predictor into the new API. 
The problem is, that with `PredictOperation` the type of the model has to 
be defined. A `DataSet` of this type is the output of the `getModel`. For the 
`predict` method the input is just an object of this type.

In our case our model is a `DataSet` of `LabeledVectors` (the centroids). 
This means I can not implement a `PredictOperation` due to that restriction.

For me the API feels a bit inconsistent in that case 

For now I implemented only an `PredictDataSetOperation`.


> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Peter Schrott
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-06-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605768#comment-14605768
 ] 

ASF GitHub Bot commented on FLINK-1731:
---

Github user peedeeX21 commented on a diff in the pull request:

https://github.com/apache/flink/pull/700#discussion_r33476507
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/clustering/KMeans.scala
 ---
@@ -0,0 +1,247 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.clustering
+
+import org.apache.flink.api.common.functions.RichMapFunction
+import 
org.apache.flink.api.java.functions.FunctionAnnotation.ForwardedFields
+import org.apache.flink.api.scala.{DataSet, _}
+import org.apache.flink.configuration.Configuration
+import org.apache.flink.ml.common.{LabeledVector, _}
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.Vector
+import org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric
+import org.apache.flink.ml.pipeline._
+
+import scala.collection.JavaConverters._
+
+
+/**
+ * Implements the KMeans algorithm which calculates cluster centroids 
based on set of training data
+ * points and a set of k initial centroids.
+ *
+ * [[KMeans]] is a [[Predictor]] which needs to be trained on a set of 
data points and can then be
+ * used to assign new points to the learned cluster centroids.
+ *
+ * The KMeans algorithm works as described on Wikipedia
+ * (http://en.wikipedia.org/wiki/K-means_clustering):
+ *
+ * Given an initial set of k means m1(1),…,mk(1) (see below), the 
algorithm proceeds by alternating
+ * between two steps:
+ *
+ * ===Assignment step:===
+ *
+ * Assign each observation to the cluster whose mean yields the least 
within-cluster sum  of
+ * squares (WCSS). Since the sum of squares is the squared Euclidean 
distance, this is intuitively
+ * the "nearest" mean. (Mathematically, this means partitioning the 
observations according to the
+ * Voronoi diagram generated by the means).
+ *
+ * `S_i^(t) = { x_p : || x_p - m_i^(t) ||^2 ≤ || x_p - m_j^(t) ||^2 
\forall j, 1 ≤ j ≤ k}`,
+ * where each `x_p`  is assigned to exactly one `S^{(t)}`, even if it 
could be assigned to two or
+ * more of them.
+ *
+ * ===Update step:===
+ *
+ * Calculate the new means to be the centroids of the observations in the 
new clusters.
+ *
+ * `m^{(t+1)}_i = ( 1 / |S^{(t)}_i| ) \sum_{x_j \in S^{(t)}_i} x_j`
+ *
+ * Since the arithmetic mean is a least-squares estimator, this also 
minimizes the within-cluster
+ * sum of squares (WCSS) objective.
+ *
+ * @example
+ * {{{
+ *  val trainingDS: DataSet[Vector] = 
env.fromCollection(Clustering.trainingData)
+ *  val initialCentroids: DataSet[LabledVector] = 
env.fromCollection(Clustering.initCentroids)
+ *
+ *  val kmeans = KMeans()
+ *.setInitialCentroids(initialCentroids)
+ *.setNumIterations(10)
+ *
+ *  kmeans.fit(trainingDS)
+ *
+ *  // getting the computed centroids
+ *  val centroidsResult = kmeans.centroids.get.collect()
+ *
+ *  // get matching clusters for new points
+ *  val testDS: DataSet[Vector] = 
env.fromCollection(Clustering.testData)
+ *  val clusters: DataSet[LabeledVector] = kmeans.predict(testDS)
+ * }}}
+ *
+ * =Parameters=
+ *
+ * - [[org.apache.flink.ml.clustering.KMeans.NumIterations]]:
+ * Defines the number of iterations to recalculate the centroids of the 
clusters. As it
+ * is a heuristic algorithm, there is no guarantee that it will converge 
to the global optimum. The
+ * centroids of the clusters and the reassignment of the data points will 
be repeated till the
+ * given number of iterations is reached.
+ * (Default value: '''10''')
+ *
+ * - [[org.apache.flink.ml.clustering.KMeans.InitialCentroids]]:
 

[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-06-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605753#comment-14605753
 ] 

ASF GitHub Bot commented on FLINK-1731:
---

Github user peedeeX21 commented on a diff in the pull request:

https://github.com/apache/flink/pull/700#discussion_r33475321
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/clustering/KMeans.scala
 ---
@@ -0,0 +1,247 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.clustering
+
+import org.apache.flink.api.common.functions.RichMapFunction
+import 
org.apache.flink.api.java.functions.FunctionAnnotation.ForwardedFields
+import org.apache.flink.api.scala.{DataSet, _}
+import org.apache.flink.configuration.Configuration
+import org.apache.flink.ml.common.{LabeledVector, _}
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.Vector
+import org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric
+import org.apache.flink.ml.pipeline._
+
+import scala.collection.JavaConverters._
+
+
+/**
+ * Implements the KMeans algorithm which calculates cluster centroids 
based on set of training data
+ * points and a set of k initial centroids.
+ *
+ * [[KMeans]] is a [[Predictor]] which needs to be trained on a set of 
data points and can then be
+ * used to assign new points to the learned cluster centroids.
+ *
+ * The KMeans algorithm works as described on Wikipedia
+ * (http://en.wikipedia.org/wiki/K-means_clustering):
+ *
+ * Given an initial set of k means m1(1),…,mk(1) (see below), the 
algorithm proceeds by alternating
+ * between two steps:
+ *
+ * ===Assignment step:===
+ *
+ * Assign each observation to the cluster whose mean yields the least 
within-cluster sum  of
+ * squares (WCSS). Since the sum of squares is the squared Euclidean 
distance, this is intuitively
+ * the "nearest" mean. (Mathematically, this means partitioning the 
observations according to the
+ * Voronoi diagram generated by the means).
+ *
+ * `S_i^(t) = { x_p : || x_p - m_i^(t) ||^2 ≤ || x_p - m_j^(t) ||^2 
\forall j, 1 ≤ j ≤ k}`,
+ * where each `x_p`  is assigned to exactly one `S^{(t)}`, even if it 
could be assigned to two or
+ * more of them.
+ *
+ * ===Update step:===
+ *
+ * Calculate the new means to be the centroids of the observations in the 
new clusters.
+ *
+ * `m^{(t+1)}_i = ( 1 / |S^{(t)}_i| ) \sum_{x_j \in S^{(t)}_i} x_j`
+ *
+ * Since the arithmetic mean is a least-squares estimator, this also 
minimizes the within-cluster
+ * sum of squares (WCSS) objective.
+ *
+ * @example
+ * {{{
+ *  val trainingDS: DataSet[Vector] = 
env.fromCollection(Clustering.trainingData)
+ *  val initialCentroids: DataSet[LabledVector] = 
env.fromCollection(Clustering.initCentroids)
+ *
+ *  val kmeans = KMeans()
+ *.setInitialCentroids(initialCentroids)
+ *.setNumIterations(10)
+ *
+ *  kmeans.fit(trainingDS)
+ *
+ *  // getting the computed centroids
+ *  val centroidsResult = kmeans.centroids.get.collect()
+ *
+ *  // get matching clusters for new points
+ *  val testDS: DataSet[Vector] = 
env.fromCollection(Clustering.testData)
+ *  val clusters: DataSet[LabeledVector] = kmeans.predict(testDS)
+ * }}}
+ *
+ * =Parameters=
+ *
+ * - [[org.apache.flink.ml.clustering.KMeans.NumIterations]]:
+ * Defines the number of iterations to recalculate the centroids of the 
clusters. As it
+ * is a heuristic algorithm, there is no guarantee that it will converge 
to the global optimum. The
+ * centroids of the clusters and the reassignment of the data points will 
be repeated till the
+ * given number of iterations is reached.
+ * (Default value: '''10''')
+ *
+ * - [[org.apache.flink.ml.clustering.KMeans.InitialCentroids]]:
 

[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-06-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605752#comment-14605752
 ] 

ASF GitHub Bot commented on FLINK-1731:
---

Github user peedeeX21 commented on a diff in the pull request:

https://github.com/apache/flink/pull/700#discussion_r33475298
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/clustering/KMeans.scala
 ---
@@ -0,0 +1,247 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.clustering
+
+import org.apache.flink.api.common.functions.RichMapFunction
+import 
org.apache.flink.api.java.functions.FunctionAnnotation.ForwardedFields
+import org.apache.flink.api.scala.{DataSet, _}
+import org.apache.flink.configuration.Configuration
+import org.apache.flink.ml.common.{LabeledVector, _}
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.Vector
+import org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric
+import org.apache.flink.ml.pipeline._
+
+import scala.collection.JavaConverters._
+
+
+/**
+ * Implements the KMeans algorithm which calculates cluster centroids 
based on set of training data
+ * points and a set of k initial centroids.
+ *
+ * [[KMeans]] is a [[Predictor]] which needs to be trained on a set of 
data points and can then be
+ * used to assign new points to the learned cluster centroids.
+ *
+ * The KMeans algorithm works as described on Wikipedia
+ * (http://en.wikipedia.org/wiki/K-means_clustering):
+ *
+ * Given an initial set of k means m1(1),…,mk(1) (see below), the 
algorithm proceeds by alternating
+ * between two steps:
+ *
+ * ===Assignment step:===
+ *
+ * Assign each observation to the cluster whose mean yields the least 
within-cluster sum  of
+ * squares (WCSS). Since the sum of squares is the squared Euclidean 
distance, this is intuitively
+ * the "nearest" mean. (Mathematically, this means partitioning the 
observations according to the
+ * Voronoi diagram generated by the means).
+ *
+ * `S_i^(t) = { x_p : || x_p - m_i^(t) ||^2 ≤ || x_p - m_j^(t) ||^2 
\forall j, 1 ≤ j ≤ k}`,
+ * where each `x_p`  is assigned to exactly one `S^{(t)}`, even if it 
could be assigned to two or
+ * more of them.
+ *
+ * ===Update step:===
+ *
+ * Calculate the new means to be the centroids of the observations in the 
new clusters.
+ *
+ * `m^{(t+1)}_i = ( 1 / |S^{(t)}_i| ) \sum_{x_j \in S^{(t)}_i} x_j`
+ *
+ * Since the arithmetic mean is a least-squares estimator, this also 
minimizes the within-cluster
+ * sum of squares (WCSS) objective.
+ *
+ * @example
+ * {{{
+ *  val trainingDS: DataSet[Vector] = 
env.fromCollection(Clustering.trainingData)
+ *  val initialCentroids: DataSet[LabledVector] = 
env.fromCollection(Clustering.initCentroids)
+ *
+ *  val kmeans = KMeans()
+ *.setInitialCentroids(initialCentroids)
+ *.setNumIterations(10)
+ *
+ *  kmeans.fit(trainingDS)
+ *
+ *  // getting the computed centroids
+ *  val centroidsResult = kmeans.centroids.get.collect()
+ *
+ *  // get matching clusters for new points
+ *  val testDS: DataSet[Vector] = 
env.fromCollection(Clustering.testData)
+ *  val clusters: DataSet[LabeledVector] = kmeans.predict(testDS)
+ * }}}
+ *
+ * =Parameters=
+ *
+ * - [[org.apache.flink.ml.clustering.KMeans.NumIterations]]:
+ * Defines the number of iterations to recalculate the centroids of the 
clusters. As it
+ * is a heuristic algorithm, there is no guarantee that it will converge 
to the global optimum. The
+ * centroids of the clusters and the reassignment of the data points will 
be repeated till the
+ * given number of iterations is reached.
+ * (Default value: '''10''')
+ *
+ * - [[org.apache.flink.ml.clustering.KMeans.InitialCentroids]]:
 

[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-06-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605726#comment-14605726
 ] 

ASF GitHub Bot commented on FLINK-1731:
---

Github user peedeeX21 commented on a diff in the pull request:

https://github.com/apache/flink/pull/700#discussion_r33471983
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/clustering/KMeans.scala
 ---
@@ -0,0 +1,247 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.clustering
+
+import org.apache.flink.api.common.functions.RichMapFunction
+import 
org.apache.flink.api.java.functions.FunctionAnnotation.ForwardedFields
+import org.apache.flink.api.scala.{DataSet, _}
+import org.apache.flink.configuration.Configuration
+import org.apache.flink.ml.common.{LabeledVector, _}
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.Vector
+import org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric
+import org.apache.flink.ml.pipeline._
+
+import scala.collection.JavaConverters._
+
+
+/**
+ * Implements the KMeans algorithm which calculates cluster centroids 
based on set of training data
+ * points and a set of k initial centroids.
+ *
+ * [[KMeans]] is a [[Predictor]] which needs to be trained on a set of 
data points and can then be
+ * used to assign new points to the learned cluster centroids.
+ *
+ * The KMeans algorithm works as described on Wikipedia
+ * (http://en.wikipedia.org/wiki/K-means_clustering):
+ *
+ * Given an initial set of k means m1(1),…,mk(1) (see below), the 
algorithm proceeds by alternating
+ * between two steps:
+ *
+ * ===Assignment step:===
+ *
+ * Assign each observation to the cluster whose mean yields the least 
within-cluster sum  of
+ * squares (WCSS). Since the sum of squares is the squared Euclidean 
distance, this is intuitively
+ * the "nearest" mean. (Mathematically, this means partitioning the 
observations according to the
+ * Voronoi diagram generated by the means).
+ *
+ * `S_i^(t) = { x_p : || x_p - m_i^(t) ||^2 ≤ || x_p - m_j^(t) ||^2 
\forall j, 1 ≤ j ≤ k}`,
+ * where each `x_p`  is assigned to exactly one `S^{(t)}`, even if it 
could be assigned to two or
+ * more of them.
+ *
+ * ===Update step:===
+ *
+ * Calculate the new means to be the centroids of the observations in the 
new clusters.
+ *
+ * `m^{(t+1)}_i = ( 1 / |S^{(t)}_i| ) \sum_{x_j \in S^{(t)}_i} x_j`
+ *
+ * Since the arithmetic mean is a least-squares estimator, this also 
minimizes the within-cluster
+ * sum of squares (WCSS) objective.
+ *
+ * @example
+ * {{{
+ *  val trainingDS: DataSet[Vector] = 
env.fromCollection(Clustering.trainingData)
+ *  val initialCentroids: DataSet[LabledVector] = 
env.fromCollection(Clustering.initCentroids)
+ *
+ *  val kmeans = KMeans()
+ *.setInitialCentroids(initialCentroids)
+ *.setNumIterations(10)
+ *
+ *  kmeans.fit(trainingDS)
+ *
+ *  // getting the computed centroids
+ *  val centroidsResult = kmeans.centroids.get.collect()
+ *
+ *  // get matching clusters for new points
+ *  val testDS: DataSet[Vector] = 
env.fromCollection(Clustering.testData)
+ *  val clusters: DataSet[LabeledVector] = kmeans.predict(testDS)
+ * }}}
+ *
+ * =Parameters=
+ *
+ * - [[org.apache.flink.ml.clustering.KMeans.NumIterations]]:
+ * Defines the number of iterations to recalculate the centroids of the 
clusters. As it
+ * is a heuristic algorithm, there is no guarantee that it will converge 
to the global optimum. The
+ * centroids of the clusters and the reassignment of the data points will 
be repeated till the
+ * given number of iterations is reached.
+ * (Default value: '''10''')
+ *
+ * - [[org.apache.flink.ml.clustering.KMeans.InitialCentroids]]:
 

[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-06-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605680#comment-14605680
 ] 

ASF GitHub Bot commented on FLINK-1731:
---

Github user peedeeX21 commented on a diff in the pull request:

https://github.com/apache/flink/pull/700#discussion_r33469076
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/clustering/KMeans.scala
 ---
@@ -0,0 +1,247 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.clustering
+
+import org.apache.flink.api.common.functions.RichMapFunction
+import 
org.apache.flink.api.java.functions.FunctionAnnotation.ForwardedFields
+import org.apache.flink.api.scala.{DataSet, _}
+import org.apache.flink.configuration.Configuration
+import org.apache.flink.ml.common.{LabeledVector, _}
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.Vector
+import org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric
+import org.apache.flink.ml.pipeline._
+
+import scala.collection.JavaConverters._
+
+
+/**
+ * Implements the KMeans algorithm which calculates cluster centroids 
based on set of training data
+ * points and a set of k initial centroids.
+ *
+ * [[KMeans]] is a [[Predictor]] which needs to be trained on a set of 
data points and can then be
+ * used to assign new points to the learned cluster centroids.
+ *
+ * The KMeans algorithm works as described on Wikipedia
+ * (http://en.wikipedia.org/wiki/K-means_clustering):
+ *
+ * Given an initial set of k means m1(1),…,mk(1) (see below), the 
algorithm proceeds by alternating
+ * between two steps:
+ *
+ * ===Assignment step:===
+ *
+ * Assign each observation to the cluster whose mean yields the least 
within-cluster sum  of
+ * squares (WCSS). Since the sum of squares is the squared Euclidean 
distance, this is intuitively
+ * the "nearest" mean. (Mathematically, this means partitioning the 
observations according to the
+ * Voronoi diagram generated by the means).
+ *
+ * `S_i^(t) = { x_p : || x_p - m_i^(t) ||^2 ≤ || x_p - m_j^(t) ||^2 
\forall j, 1 ≤ j ≤ k}`,
+ * where each `x_p`  is assigned to exactly one `S^{(t)}`, even if it 
could be assigned to two or
+ * more of them.
+ *
+ * ===Update step:===
+ *
+ * Calculate the new means to be the centroids of the observations in the 
new clusters.
+ *
+ * `m^{(t+1)}_i = ( 1 / |S^{(t)}_i| ) \sum_{x_j \in S^{(t)}_i} x_j`
+ *
+ * Since the arithmetic mean is a least-squares estimator, this also 
minimizes the within-cluster
+ * sum of squares (WCSS) objective.
+ *
+ * @example
+ * {{{
+ *  val trainingDS: DataSet[Vector] = 
env.fromCollection(Clustering.trainingData)
+ *  val initialCentroids: DataSet[LabledVector] = 
env.fromCollection(Clustering.initCentroids)
+ *
+ *  val kmeans = KMeans()
+ *.setInitialCentroids(initialCentroids)
+ *.setNumIterations(10)
+ *
+ *  kmeans.fit(trainingDS)
+ *
+ *  // getting the computed centroids
+ *  val centroidsResult = kmeans.centroids.get.collect()
+ *
+ *  // get matching clusters for new points
+ *  val testDS: DataSet[Vector] = 
env.fromCollection(Clustering.testData)
+ *  val clusters: DataSet[LabeledVector] = kmeans.predict(testDS)
+ * }}}
+ *
+ * =Parameters=
+ *
+ * - [[org.apache.flink.ml.clustering.KMeans.NumIterations]]:
+ * Defines the number of iterations to recalculate the centroids of the 
clusters. As it
+ * is a heuristic algorithm, there is no guarantee that it will converge 
to the global optimum. The
+ * centroids of the clusters and the reassignment of the data points will 
be repeated till the
+ * given number of iterations is reached.
+ * (Default value: '''10''')
+ *
+ * - [[org.apache.flink.ml.clustering.KMeans.InitialCentroids]]:
 

[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-06-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605665#comment-14605665
 ] 

ASF GitHub Bot commented on FLINK-1731:
---

Github user peedeeX21 commented on a diff in the pull request:

https://github.com/apache/flink/pull/700#discussion_r33468244
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/clustering/KMeans.scala
 ---
@@ -0,0 +1,247 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.clustering
+
+import org.apache.flink.api.common.functions.RichMapFunction
+import 
org.apache.flink.api.java.functions.FunctionAnnotation.ForwardedFields
+import org.apache.flink.api.scala.{DataSet, _}
+import org.apache.flink.configuration.Configuration
+import org.apache.flink.ml.common.{LabeledVector, _}
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.Vector
+import org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric
+import org.apache.flink.ml.pipeline._
+
+import scala.collection.JavaConverters._
+
+
+/**
+ * Implements the KMeans algorithm which calculates cluster centroids 
based on set of training data
+ * points and a set of k initial centroids.
+ *
+ * [[KMeans]] is a [[Predictor]] which needs to be trained on a set of 
data points and can then be
+ * used to assign new points to the learned cluster centroids.
+ *
+ * The KMeans algorithm works as described on Wikipedia
+ * (http://en.wikipedia.org/wiki/K-means_clustering):
+ *
+ * Given an initial set of k means m1(1),…,mk(1) (see below), the 
algorithm proceeds by alternating
+ * between two steps:
+ *
+ * ===Assignment step:===
+ *
+ * Assign each observation to the cluster whose mean yields the least 
within-cluster sum  of
+ * squares (WCSS). Since the sum of squares is the squared Euclidean 
distance, this is intuitively
+ * the "nearest" mean. (Mathematically, this means partitioning the 
observations according to the
+ * Voronoi diagram generated by the means).
+ *
+ * `S_i^(t) = { x_p : || x_p - m_i^(t) ||^2 ≤ || x_p - m_j^(t) ||^2 
\forall j, 1 ≤ j ≤ k}`,
+ * where each `x_p`  is assigned to exactly one `S^{(t)}`, even if it 
could be assigned to two or
+ * more of them.
+ *
+ * ===Update step:===
+ *
+ * Calculate the new means to be the centroids of the observations in the 
new clusters.
+ *
+ * `m^{(t+1)}_i = ( 1 / |S^{(t)}_i| ) \sum_{x_j \in S^{(t)}_i} x_j`
+ *
+ * Since the arithmetic mean is a least-squares estimator, this also 
minimizes the within-cluster
+ * sum of squares (WCSS) objective.
+ *
+ * @example
+ * {{{
+ *  val trainingDS: DataSet[Vector] = 
env.fromCollection(Clustering.trainingData)
+ *  val initialCentroids: DataSet[LabledVector] = 
env.fromCollection(Clustering.initCentroids)
+ *
+ *  val kmeans = KMeans()
+ *.setInitialCentroids(initialCentroids)
+ *.setNumIterations(10)
+ *
+ *  kmeans.fit(trainingDS)
+ *
+ *  // getting the computed centroids
+ *  val centroidsResult = kmeans.centroids.get.collect()
+ *
+ *  // get matching clusters for new points
+ *  val testDS: DataSet[Vector] = 
env.fromCollection(Clustering.testData)
+ *  val clusters: DataSet[LabeledVector] = kmeans.predict(testDS)
+ * }}}
+ *
+ * =Parameters=
+ *
+ * - [[org.apache.flink.ml.clustering.KMeans.NumIterations]]:
+ * Defines the number of iterations to recalculate the centroids of the 
clusters. As it
+ * is a heuristic algorithm, there is no guarantee that it will converge 
to the global optimum. The
+ * centroids of the clusters and the reassignment of the data points will 
be repeated till the
+ * given number of iterations is reached.
+ * (Default value: '''10''')
+ *
+ * - [[org.apache.flink.ml.clustering.KMeans.InitialCentroids]]:
 

[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-06-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605652#comment-14605652
 ] 

ASF GitHub Bot commented on FLINK-1731:
---

Github user sachingoel0101 commented on the pull request:

https://github.com/apache/flink/pull/700#issuecomment-116685600
  
I've been following this PR since my PR on initialization schemes can't be 
merged before this. I already have three initialization mechanisms [namely 
Random, k-means++, kmeans||]. I've referenced the PR on this thread earlier.


> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Peter Schrott
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-06-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605638#comment-14605638
 ] 

ASF GitHub Bot commented on FLINK-1731:
---

Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/700#discussion_r33466208
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/clustering/KMeans.scala
 ---
@@ -0,0 +1,247 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.clustering
+
+import org.apache.flink.api.common.functions.RichMapFunction
+import 
org.apache.flink.api.java.functions.FunctionAnnotation.ForwardedFields
+import org.apache.flink.api.scala.{DataSet, _}
+import org.apache.flink.configuration.Configuration
+import org.apache.flink.ml.common.{LabeledVector, _}
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.Vector
+import org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric
+import org.apache.flink.ml.pipeline._
+
+import scala.collection.JavaConverters._
+
+
+/**
+ * Implements the KMeans algorithm which calculates cluster centroids 
based on set of training data
+ * points and a set of k initial centroids.
+ *
+ * [[KMeans]] is a [[Predictor]] which needs to be trained on a set of 
data points and can then be
+ * used to assign new points to the learned cluster centroids.
+ *
+ * The KMeans algorithm works as described on Wikipedia
+ * (http://en.wikipedia.org/wiki/K-means_clustering):
+ *
+ * Given an initial set of k means m1(1),…,mk(1) (see below), the 
algorithm proceeds by alternating
+ * between two steps:
+ *
+ * ===Assignment step:===
+ *
+ * Assign each observation to the cluster whose mean yields the least 
within-cluster sum  of
+ * squares (WCSS). Since the sum of squares is the squared Euclidean 
distance, this is intuitively
+ * the "nearest" mean. (Mathematically, this means partitioning the 
observations according to the
+ * Voronoi diagram generated by the means).
+ *
+ * `S_i^(t) = { x_p : || x_p - m_i^(t) ||^2 ≤ || x_p - m_j^(t) ||^2 
\forall j, 1 ≤ j ≤ k}`,
+ * where each `x_p`  is assigned to exactly one `S^{(t)}`, even if it 
could be assigned to two or
+ * more of them.
+ *
+ * ===Update step:===
+ *
+ * Calculate the new means to be the centroids of the observations in the 
new clusters.
+ *
+ * `m^{(t+1)}_i = ( 1 / |S^{(t)}_i| ) \sum_{x_j \in S^{(t)}_i} x_j`
+ *
+ * Since the arithmetic mean is a least-squares estimator, this also 
minimizes the within-cluster
+ * sum of squares (WCSS) objective.
+ *
+ * @example
+ * {{{
+ *  val trainingDS: DataSet[Vector] = 
env.fromCollection(Clustering.trainingData)
+ *  val initialCentroids: DataSet[LabledVector] = 
env.fromCollection(Clustering.initCentroids)
+ *
+ *  val kmeans = KMeans()
+ *.setInitialCentroids(initialCentroids)
+ *.setNumIterations(10)
+ *
+ *  kmeans.fit(trainingDS)
+ *
+ *  // getting the computed centroids
+ *  val centroidsResult = kmeans.centroids.get.collect()
+ *
+ *  // get matching clusters for new points
+ *  val testDS: DataSet[Vector] = 
env.fromCollection(Clustering.testData)
+ *  val clusters: DataSet[LabeledVector] = kmeans.predict(testDS)
+ * }}}
+ *
+ * =Parameters=
+ *
+ * - [[org.apache.flink.ml.clustering.KMeans.NumIterations]]:
+ * Defines the number of iterations to recalculate the centroids of the 
clusters. As it
+ * is a heuristic algorithm, there is no guarantee that it will converge 
to the global optimum. The
+ * centroids of the clusters and the reassignment of the data points will 
be repeated till the
+ * given number of iterations is reached.
+ * (Default value: '''10''')
+ *
+ * - [[org.apache.flink.ml.clustering.KMeans.InitialCentroids]]:
  

[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-06-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605632#comment-14605632
 ] 

ASF GitHub Bot commented on FLINK-1731:
---

Github user thvasilo commented on the pull request:

https://github.com/apache/flink/pull/700#issuecomment-116669400
  
Another note: It should not be necessary for the user to provide the 
initial centroids, those should be possible to generated from the algorithm 
itself, ideally with a scheme like kmeans++.


> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Peter Schrott
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-06-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605627#comment-14605627
 ] 

ASF GitHub Bot commented on FLINK-1731:
---

Github user peedeeX21 commented on a diff in the pull request:

https://github.com/apache/flink/pull/700#discussion_r33465223
  
--- Diff: 
flink-staging/flink-ml/src/test/scala/org/apache/flink/ml/clustering/Clustering.scala
 ---
@@ -0,0 +1,256 @@
+/*
--- End diff --

done


> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Peter Schrott
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-06-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605615#comment-14605615
 ] 

ASF GitHub Bot commented on FLINK-1731:
---

Github user thvasilo commented on the pull request:

https://github.com/apache/flink/pull/700#issuecomment-116662421
  
Hello I've left some initial comments. Once those have been addressed I'll 
try to do some more integration testing and then pass the review over to a 
commiter.


> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Peter Schrott
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-06-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605613#comment-14605613
 ] 

ASF GitHub Bot commented on FLINK-1731:
---

Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/700#discussion_r33463634
  
--- Diff: 
flink-staging/flink-ml/src/test/scala/org/apache/flink/ml/clustering/Clustering.scala
 ---
@@ -0,0 +1,256 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.clustering
+
+import breeze.linalg.{DenseVector => BreezeDenseVector, Vector => 
BreezeVector}
+import org.apache.flink.ml.common.LabeledVector
+import org.apache.flink.ml.math.{DenseVector, Vector}
+
+/**
+ * Trainings- and test-data set for the K-Means implementation
+ * [[org.apache.flink.ml.clustering.KMeans]].
+ */
+object Clustering {
+
+  /*
+   * Number of iterations for the K-Means algorithm.
+   */
+  val iterations = 10
+
+  /*
+   * Sequence of initial centroids.
+   */
+  val centroidData: Seq[LabeledVector] = Seq(
+LabeledVector(1, DenseVector(-0.1369104662767052, 0.2949172396037093, 
-0.01070450818187003)),
+LabeledVector(2, DenseVector(0.43643950041582885, 0.30117329671833215, 
0.20965108353159922)),
+LabeledVector(3, DenseVector(0.26011627041438423, 0.22954649683337805, 
0.2936286262276151)),
+LabeledVector(4, DenseVector(-0.041980932305508145, 
0.03116256923634109, 0.31065743174542293)),
+LabeledVector(5, DenseVector(0.0984398491976613, -0.21227718242541602, 
-0.45083084300074255)),
+LabeledVector(6, DenseVector(-0.216526923545, 
-0.47142840804338293, -0.02298954070830948)),
+LabeledVector(7, DenseVector(-0.0632307695567563, 0.2387221400443612, 
0.09416850805771804)),
+LabeledVector(8, DenseVector(0.16383680898916775, 
-0.24586810465119346, 0.08783590589294081)),
+LabeledVector(9, DenseVector(-0.24763544645492513, 
0.19688995732231254, 0.4520904742796472)),
+LabeledVector(10, DenseVector(0.16468044138881932, 
0.06259522206982082, 0.12145870313604247))
+
+  )
+
+  /*
+   * 3 Dimensional DenseVectors from a Part of Cosmo-Gas Dataset
+   * Reference: http://nuage.cs.washington.edu/benchmark/
+   */
+  val trainingData: Seq[Vector] = Seq(
+DenseVector(-0.489811986685, 0.496883004904, -0.483860999346),
+DenseVector(-0.485296010971, 0.496421992779, -0.484212994576),
+DenseVector(-0.481514006853, 0.496134012938, -0.48508900404),
+DenseVector(-0.47854255, 0.496246010065, -0.486301004887),
+DenseVector(-0.475461006165, 0.496093004942, -0.487686008215),
+DenseVector(-0.471846997738, 0.496558994055, -0.488242000341),
+DenseVector(-0.467496991158, 0.497166007757, -0.48861899972),
+DenseVector(-0.463036000729, 0.497680991888, -0.489721000195),
+DenseVector(-0.458972990513, 0.4984369874, -0.490575999022),
+DenseVector(-0.455772012472, 0.499684005976, -0.491737008095),
+DenseVector(-0.453074991703, -0.499433010817, -0.492006987333),
+DenseVector(-0.450913995504, -0.499316990376, -0.492769002914),
+DenseVector(-0.448724985123, -0.499406009912, -0.493508011103),
+DenseVector(-0.44715899229, -0.499680995941, -0.494500011206),
+DenseVector(-0.445362001657, -0.499630987644, -0.495151996613),
+DenseVector(-0.442811012268, -0.499303996563, -0.495151013136),
+DenseVector(-0.439810991287, -0.499332994223, -0.49529799819),
+DenseVector(-0.43678098917, -0.499361991882, -0.49545699358),
+DenseVector(-0.433919012547, -0.499334007502, -0.495705991983),
+DenseVector(-0.43117800355, -0.499345004559, -0.496196985245),
+DenseVector(-0.428333997726, -0.499083012342, -0.496385991573),
+DenseVector(-0.425300985575, -0.49844199419, -0.496405988932),
+DenseVector(-0.421882003546, -0.497743010521, -0.496706992388),
+DenseVec

[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-06-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605612#comment-14605612
 ] 

ASF GitHub Bot commented on FLINK-1731:
---

Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/700#discussion_r33463484
  
--- Diff: 
flink-staging/flink-ml/src/test/scala/org/apache/flink/ml/clustering/Clustering.scala
 ---
@@ -0,0 +1,256 @@
+/*
--- End diff --

Rename to ClusteringData.scala


> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Peter Schrott
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-06-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605607#comment-14605607
 ] 

ASF GitHub Bot commented on FLINK-1731:
---

Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/700#discussion_r33463192
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/clustering/KMeans.scala
 ---
@@ -0,0 +1,247 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.clustering
+
+import org.apache.flink.api.common.functions.RichMapFunction
+import 
org.apache.flink.api.java.functions.FunctionAnnotation.ForwardedFields
+import org.apache.flink.api.scala.{DataSet, _}
+import org.apache.flink.configuration.Configuration
+import org.apache.flink.ml.common.{LabeledVector, _}
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.Vector
+import org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric
+import org.apache.flink.ml.pipeline._
+
+import scala.collection.JavaConverters._
+
+
+/**
+ * Implements the KMeans algorithm which calculates cluster centroids 
based on set of training data
+ * points and a set of k initial centroids.
+ *
+ * [[KMeans]] is a [[Predictor]] which needs to be trained on a set of 
data points and can then be
+ * used to assign new points to the learned cluster centroids.
+ *
+ * The KMeans algorithm works as described on Wikipedia
+ * (http://en.wikipedia.org/wiki/K-means_clustering):
+ *
+ * Given an initial set of k means m1(1),…,mk(1) (see below), the 
algorithm proceeds by alternating
+ * between two steps:
+ *
+ * ===Assignment step:===
+ *
+ * Assign each observation to the cluster whose mean yields the least 
within-cluster sum  of
+ * squares (WCSS). Since the sum of squares is the squared Euclidean 
distance, this is intuitively
+ * the "nearest" mean. (Mathematically, this means partitioning the 
observations according to the
+ * Voronoi diagram generated by the means).
+ *
+ * `S_i^(t) = { x_p : || x_p - m_i^(t) ||^2 ≤ || x_p - m_j^(t) ||^2 
\forall j, 1 ≤ j ≤ k}`,
+ * where each `x_p`  is assigned to exactly one `S^{(t)}`, even if it 
could be assigned to two or
+ * more of them.
+ *
+ * ===Update step:===
+ *
+ * Calculate the new means to be the centroids of the observations in the 
new clusters.
+ *
+ * `m^{(t+1)}_i = ( 1 / |S^{(t)}_i| ) \sum_{x_j \in S^{(t)}_i} x_j`
+ *
+ * Since the arithmetic mean is a least-squares estimator, this also 
minimizes the within-cluster
+ * sum of squares (WCSS) objective.
+ *
+ * @example
+ * {{{
+ *  val trainingDS: DataSet[Vector] = 
env.fromCollection(Clustering.trainingData)
+ *  val initialCentroids: DataSet[LabledVector] = 
env.fromCollection(Clustering.initCentroids)
+ *
+ *  val kmeans = KMeans()
+ *.setInitialCentroids(initialCentroids)
+ *.setNumIterations(10)
+ *
+ *  kmeans.fit(trainingDS)
+ *
+ *  // getting the computed centroids
+ *  val centroidsResult = kmeans.centroids.get.collect()
+ *
+ *  // get matching clusters for new points
+ *  val testDS: DataSet[Vector] = 
env.fromCollection(Clustering.testData)
+ *  val clusters: DataSet[LabeledVector] = kmeans.predict(testDS)
+ * }}}
+ *
+ * =Parameters=
+ *
+ * - [[org.apache.flink.ml.clustering.KMeans.NumIterations]]:
+ * Defines the number of iterations to recalculate the centroids of the 
clusters. As it
+ * is a heuristic algorithm, there is no guarantee that it will converge 
to the global optimum. The
+ * centroids of the clusters and the reassignment of the data points will 
be repeated till the
+ * given number of iterations is reached.
+ * (Default value: '''10''')
+ *
+ * - [[org.apache.flink.ml.clustering.KMeans.InitialCentroids]]:
  

[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-06-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605598#comment-14605598
 ] 

ASF GitHub Bot commented on FLINK-1731:
---

Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/700#discussion_r33462286
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/clustering/KMeans.scala
 ---
@@ -0,0 +1,247 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.clustering
+
+import org.apache.flink.api.common.functions.RichMapFunction
+import 
org.apache.flink.api.java.functions.FunctionAnnotation.ForwardedFields
+import org.apache.flink.api.scala.{DataSet, _}
+import org.apache.flink.configuration.Configuration
+import org.apache.flink.ml.common.{LabeledVector, _}
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.Vector
+import org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric
+import org.apache.flink.ml.pipeline._
+
+import scala.collection.JavaConverters._
+
+
+/**
+ * Implements the KMeans algorithm which calculates cluster centroids 
based on set of training data
+ * points and a set of k initial centroids.
+ *
+ * [[KMeans]] is a [[Predictor]] which needs to be trained on a set of 
data points and can then be
+ * used to assign new points to the learned cluster centroids.
+ *
+ * The KMeans algorithm works as described on Wikipedia
+ * (http://en.wikipedia.org/wiki/K-means_clustering):
+ *
+ * Given an initial set of k means m1(1),…,mk(1) (see below), the 
algorithm proceeds by alternating
+ * between two steps:
+ *
+ * ===Assignment step:===
+ *
+ * Assign each observation to the cluster whose mean yields the least 
within-cluster sum  of
+ * squares (WCSS). Since the sum of squares is the squared Euclidean 
distance, this is intuitively
+ * the "nearest" mean. (Mathematically, this means partitioning the 
observations according to the
+ * Voronoi diagram generated by the means).
+ *
+ * `S_i^(t) = { x_p : || x_p - m_i^(t) ||^2 ≤ || x_p - m_j^(t) ||^2 
\forall j, 1 ≤ j ≤ k}`,
+ * where each `x_p`  is assigned to exactly one `S^{(t)}`, even if it 
could be assigned to two or
+ * more of them.
+ *
+ * ===Update step:===
+ *
+ * Calculate the new means to be the centroids of the observations in the 
new clusters.
+ *
+ * `m^{(t+1)}_i = ( 1 / |S^{(t)}_i| ) \sum_{x_j \in S^{(t)}_i} x_j`
+ *
+ * Since the arithmetic mean is a least-squares estimator, this also 
minimizes the within-cluster
+ * sum of squares (WCSS) objective.
+ *
+ * @example
+ * {{{
+ *  val trainingDS: DataSet[Vector] = 
env.fromCollection(Clustering.trainingData)
+ *  val initialCentroids: DataSet[LabledVector] = 
env.fromCollection(Clustering.initCentroids)
+ *
+ *  val kmeans = KMeans()
+ *.setInitialCentroids(initialCentroids)
+ *.setNumIterations(10)
+ *
+ *  kmeans.fit(trainingDS)
+ *
+ *  // getting the computed centroids
+ *  val centroidsResult = kmeans.centroids.get.collect()
+ *
+ *  // get matching clusters for new points
+ *  val testDS: DataSet[Vector] = 
env.fromCollection(Clustering.testData)
+ *  val clusters: DataSet[LabeledVector] = kmeans.predict(testDS)
+ * }}}
+ *
+ * =Parameters=
+ *
+ * - [[org.apache.flink.ml.clustering.KMeans.NumIterations]]:
+ * Defines the number of iterations to recalculate the centroids of the 
clusters. As it
+ * is a heuristic algorithm, there is no guarantee that it will converge 
to the global optimum. The
+ * centroids of the clusters and the reassignment of the data points will 
be repeated till the
+ * given number of iterations is reached.
+ * (Default value: '''10''')
+ *
+ * - [[org.apache.flink.ml.clustering.KMeans.InitialCentroids]]:
  

[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-06-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605595#comment-14605595
 ] 

ASF GitHub Bot commented on FLINK-1731:
---

Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/700#discussion_r33462036
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/clustering/KMeans.scala
 ---
@@ -0,0 +1,247 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.clustering
+
+import org.apache.flink.api.common.functions.RichMapFunction
+import 
org.apache.flink.api.java.functions.FunctionAnnotation.ForwardedFields
+import org.apache.flink.api.scala.{DataSet, _}
+import org.apache.flink.configuration.Configuration
+import org.apache.flink.ml.common.{LabeledVector, _}
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.Vector
+import org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric
+import org.apache.flink.ml.pipeline._
+
+import scala.collection.JavaConverters._
+
+
+/**
+ * Implements the KMeans algorithm which calculates cluster centroids 
based on set of training data
+ * points and a set of k initial centroids.
+ *
+ * [[KMeans]] is a [[Predictor]] which needs to be trained on a set of 
data points and can then be
+ * used to assign new points to the learned cluster centroids.
+ *
+ * The KMeans algorithm works as described on Wikipedia
+ * (http://en.wikipedia.org/wiki/K-means_clustering):
+ *
+ * Given an initial set of k means m1(1),…,mk(1) (see below), the 
algorithm proceeds by alternating
+ * between two steps:
+ *
+ * ===Assignment step:===
+ *
+ * Assign each observation to the cluster whose mean yields the least 
within-cluster sum  of
+ * squares (WCSS). Since the sum of squares is the squared Euclidean 
distance, this is intuitively
+ * the "nearest" mean. (Mathematically, this means partitioning the 
observations according to the
+ * Voronoi diagram generated by the means).
+ *
+ * `S_i^(t) = { x_p : || x_p - m_i^(t) ||^2 ≤ || x_p - m_j^(t) ||^2 
\forall j, 1 ≤ j ≤ k}`,
+ * where each `x_p`  is assigned to exactly one `S^{(t)}`, even if it 
could be assigned to two or
+ * more of them.
+ *
+ * ===Update step:===
+ *
+ * Calculate the new means to be the centroids of the observations in the 
new clusters.
+ *
+ * `m^{(t+1)}_i = ( 1 / |S^{(t)}_i| ) \sum_{x_j \in S^{(t)}_i} x_j`
+ *
+ * Since the arithmetic mean is a least-squares estimator, this also 
minimizes the within-cluster
+ * sum of squares (WCSS) objective.
+ *
+ * @example
+ * {{{
+ *  val trainingDS: DataSet[Vector] = 
env.fromCollection(Clustering.trainingData)
+ *  val initialCentroids: DataSet[LabledVector] = 
env.fromCollection(Clustering.initCentroids)
+ *
+ *  val kmeans = KMeans()
+ *.setInitialCentroids(initialCentroids)
+ *.setNumIterations(10)
+ *
+ *  kmeans.fit(trainingDS)
+ *
+ *  // getting the computed centroids
+ *  val centroidsResult = kmeans.centroids.get.collect()
+ *
+ *  // get matching clusters for new points
+ *  val testDS: DataSet[Vector] = 
env.fromCollection(Clustering.testData)
+ *  val clusters: DataSet[LabeledVector] = kmeans.predict(testDS)
+ * }}}
+ *
+ * =Parameters=
+ *
+ * - [[org.apache.flink.ml.clustering.KMeans.NumIterations]]:
+ * Defines the number of iterations to recalculate the centroids of the 
clusters. As it
+ * is a heuristic algorithm, there is no guarantee that it will converge 
to the global optimum. The
+ * centroids of the clusters and the reassignment of the data points will 
be repeated till the
+ * given number of iterations is reached.
+ * (Default value: '''10''')
+ *
+ * - [[org.apache.flink.ml.clustering.KMeans.InitialCentroids]]:
  

[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-06-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605585#comment-14605585
 ] 

ASF GitHub Bot commented on FLINK-1731:
---

Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/700#discussion_r33461173
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/clustering/KMeans.scala
 ---
@@ -0,0 +1,247 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.clustering
+
+import org.apache.flink.api.common.functions.RichMapFunction
+import 
org.apache.flink.api.java.functions.FunctionAnnotation.ForwardedFields
+import org.apache.flink.api.scala.{DataSet, _}
+import org.apache.flink.configuration.Configuration
+import org.apache.flink.ml.common.{LabeledVector, _}
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.Vector
+import org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric
+import org.apache.flink.ml.pipeline._
+
+import scala.collection.JavaConverters._
+
+
+/**
+ * Implements the KMeans algorithm which calculates cluster centroids 
based on set of training data
+ * points and a set of k initial centroids.
+ *
+ * [[KMeans]] is a [[Predictor]] which needs to be trained on a set of 
data points and can then be
+ * used to assign new points to the learned cluster centroids.
+ *
+ * The KMeans algorithm works as described on Wikipedia
+ * (http://en.wikipedia.org/wiki/K-means_clustering):
+ *
+ * Given an initial set of k means m1(1),…,mk(1) (see below), the 
algorithm proceeds by alternating
+ * between two steps:
+ *
+ * ===Assignment step:===
+ *
+ * Assign each observation to the cluster whose mean yields the least 
within-cluster sum  of
+ * squares (WCSS). Since the sum of squares is the squared Euclidean 
distance, this is intuitively
+ * the "nearest" mean. (Mathematically, this means partitioning the 
observations according to the
+ * Voronoi diagram generated by the means).
+ *
+ * `S_i^(t) = { x_p : || x_p - m_i^(t) ||^2 ≤ || x_p - m_j^(t) ||^2 
\forall j, 1 ≤ j ≤ k}`,
+ * where each `x_p`  is assigned to exactly one `S^{(t)}`, even if it 
could be assigned to two or
+ * more of them.
+ *
+ * ===Update step:===
+ *
+ * Calculate the new means to be the centroids of the observations in the 
new clusters.
+ *
+ * `m^{(t+1)}_i = ( 1 / |S^{(t)}_i| ) \sum_{x_j \in S^{(t)}_i} x_j`
+ *
+ * Since the arithmetic mean is a least-squares estimator, this also 
minimizes the within-cluster
+ * sum of squares (WCSS) objective.
+ *
+ * @example
+ * {{{
+ *  val trainingDS: DataSet[Vector] = 
env.fromCollection(Clustering.trainingData)
+ *  val initialCentroids: DataSet[LabledVector] = 
env.fromCollection(Clustering.initCentroids)
+ *
+ *  val kmeans = KMeans()
+ *.setInitialCentroids(initialCentroids)
+ *.setNumIterations(10)
+ *
+ *  kmeans.fit(trainingDS)
+ *
+ *  // getting the computed centroids
+ *  val centroidsResult = kmeans.centroids.get.collect()
+ *
+ *  // get matching clusters for new points
+ *  val testDS: DataSet[Vector] = 
env.fromCollection(Clustering.testData)
+ *  val clusters: DataSet[LabeledVector] = kmeans.predict(testDS)
+ * }}}
+ *
+ * =Parameters=
+ *
+ * - [[org.apache.flink.ml.clustering.KMeans.NumIterations]]:
+ * Defines the number of iterations to recalculate the centroids of the 
clusters. As it
+ * is a heuristic algorithm, there is no guarantee that it will converge 
to the global optimum. The
+ * centroids of the clusters and the reassignment of the data points will 
be repeated till the
+ * given number of iterations is reached.
+ * (Default value: '''10''')
+ *
+ * - [[org.apache.flink.ml.clustering.KMeans.InitialCentroids]]:
  

[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-06-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605573#comment-14605573
 ] 

ASF GitHub Bot commented on FLINK-1731:
---

Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/700#discussion_r33460529
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/clustering/KMeans.scala
 ---
@@ -0,0 +1,247 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.clustering
+
+import org.apache.flink.api.common.functions.RichMapFunction
+import 
org.apache.flink.api.java.functions.FunctionAnnotation.ForwardedFields
+import org.apache.flink.api.scala.{DataSet, _}
+import org.apache.flink.configuration.Configuration
+import org.apache.flink.ml.common.{LabeledVector, _}
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.Vector
+import org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric
+import org.apache.flink.ml.pipeline._
+
+import scala.collection.JavaConverters._
+
+
+/**
+ * Implements the KMeans algorithm which calculates cluster centroids 
based on set of training data
+ * points and a set of k initial centroids.
+ *
+ * [[KMeans]] is a [[Predictor]] which needs to be trained on a set of 
data points and can then be
+ * used to assign new points to the learned cluster centroids.
+ *
+ * The KMeans algorithm works as described on Wikipedia
+ * (http://en.wikipedia.org/wiki/K-means_clustering):
+ *
+ * Given an initial set of k means m1(1),…,mk(1) (see below), the 
algorithm proceeds by alternating
+ * between two steps:
+ *
+ * ===Assignment step:===
+ *
+ * Assign each observation to the cluster whose mean yields the least 
within-cluster sum  of
+ * squares (WCSS). Since the sum of squares is the squared Euclidean 
distance, this is intuitively
+ * the "nearest" mean. (Mathematically, this means partitioning the 
observations according to the
+ * Voronoi diagram generated by the means).
+ *
+ * `S_i^(t) = { x_p : || x_p - m_i^(t) ||^2 ≤ || x_p - m_j^(t) ||^2 
\forall j, 1 ≤ j ≤ k}`,
+ * where each `x_p`  is assigned to exactly one `S^{(t)}`, even if it 
could be assigned to two or
+ * more of them.
+ *
+ * ===Update step:===
+ *
+ * Calculate the new means to be the centroids of the observations in the 
new clusters.
+ *
+ * `m^{(t+1)}_i = ( 1 / |S^{(t)}_i| ) \sum_{x_j \in S^{(t)}_i} x_j`
+ *
+ * Since the arithmetic mean is a least-squares estimator, this also 
minimizes the within-cluster
+ * sum of squares (WCSS) objective.
+ *
+ * @example
+ * {{{
+ *  val trainingDS: DataSet[Vector] = 
env.fromCollection(Clustering.trainingData)
+ *  val initialCentroids: DataSet[LabledVector] = 
env.fromCollection(Clustering.initCentroids)
+ *
+ *  val kmeans = KMeans()
+ *.setInitialCentroids(initialCentroids)
+ *.setNumIterations(10)
+ *
+ *  kmeans.fit(trainingDS)
+ *
+ *  // getting the computed centroids
+ *  val centroidsResult = kmeans.centroids.get.collect()
+ *
+ *  // get matching clusters for new points
+ *  val testDS: DataSet[Vector] = 
env.fromCollection(Clustering.testData)
+ *  val clusters: DataSet[LabeledVector] = kmeans.predict(testDS)
+ * }}}
+ *
+ * =Parameters=
+ *
+ * - [[org.apache.flink.ml.clustering.KMeans.NumIterations]]:
+ * Defines the number of iterations to recalculate the centroids of the 
clusters. As it
+ * is a heuristic algorithm, there is no guarantee that it will converge 
to the global optimum. The
+ * centroids of the clusters and the reassignment of the data points will 
be repeated till the
+ * given number of iterations is reached.
+ * (Default value: '''10''')
+ *
+ * - [[org.apache.flink.ml.clustering.KMeans.InitialCentroids]]:
  

[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-06-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14605571#comment-14605571
 ] 

ASF GitHub Bot commented on FLINK-1731:
---

Github user thvasilo commented on a diff in the pull request:

https://github.com/apache/flink/pull/700#discussion_r33460336
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/clustering/KMeans.scala
 ---
@@ -0,0 +1,247 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.clustering
+
+import org.apache.flink.api.common.functions.RichMapFunction
+import 
org.apache.flink.api.java.functions.FunctionAnnotation.ForwardedFields
+import org.apache.flink.api.scala.{DataSet, _}
+import org.apache.flink.configuration.Configuration
+import org.apache.flink.ml.common.{LabeledVector, _}
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.Vector
+import org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric
+import org.apache.flink.ml.pipeline._
+
+import scala.collection.JavaConverters._
+
+
+/**
+ * Implements the KMeans algorithm which calculates cluster centroids 
based on set of training data
+ * points and a set of k initial centroids.
+ *
+ * [[KMeans]] is a [[Predictor]] which needs to be trained on a set of 
data points and can then be
+ * used to assign new points to the learned cluster centroids.
+ *
+ * The KMeans algorithm works as described on Wikipedia
+ * (http://en.wikipedia.org/wiki/K-means_clustering):
+ *
+ * Given an initial set of k means m1(1),…,mk(1) (see below), the 
algorithm proceeds by alternating
+ * between two steps:
+ *
+ * ===Assignment step:===
+ *
+ * Assign each observation to the cluster whose mean yields the least 
within-cluster sum  of
+ * squares (WCSS). Since the sum of squares is the squared Euclidean 
distance, this is intuitively
+ * the "nearest" mean. (Mathematically, this means partitioning the 
observations according to the
+ * Voronoi diagram generated by the means).
+ *
+ * `S_i^(t) = { x_p : || x_p - m_i^(t) ||^2 ≤ || x_p - m_j^(t) ||^2 
\forall j, 1 ≤ j ≤ k}`,
+ * where each `x_p`  is assigned to exactly one `S^{(t)}`, even if it 
could be assigned to two or
+ * more of them.
+ *
+ * ===Update step:===
+ *
+ * Calculate the new means to be the centroids of the observations in the 
new clusters.
+ *
+ * `m^{(t+1)}_i = ( 1 / |S^{(t)}_i| ) \sum_{x_j \in S^{(t)}_i} x_j`
+ *
+ * Since the arithmetic mean is a least-squares estimator, this also 
minimizes the within-cluster
+ * sum of squares (WCSS) objective.
+ *
+ * @example
+ * {{{
+ *  val trainingDS: DataSet[Vector] = 
env.fromCollection(Clustering.trainingData)
+ *  val initialCentroids: DataSet[LabledVector] = 
env.fromCollection(Clustering.initCentroids)
+ *
+ *  val kmeans = KMeans()
+ *.setInitialCentroids(initialCentroids)
+ *.setNumIterations(10)
+ *
+ *  kmeans.fit(trainingDS)
+ *
+ *  // getting the computed centroids
+ *  val centroidsResult = kmeans.centroids.get.collect()
+ *
+ *  // get matching clusters for new points
+ *  val testDS: DataSet[Vector] = 
env.fromCollection(Clustering.testData)
+ *  val clusters: DataSet[LabeledVector] = kmeans.predict(testDS)
+ * }}}
+ *
+ * =Parameters=
+ *
+ * - [[org.apache.flink.ml.clustering.KMeans.NumIterations]]:
+ * Defines the number of iterations to recalculate the centroids of the 
clusters. As it
+ * is a heuristic algorithm, there is no guarantee that it will converge 
to the global optimum. The
+ * centroids of the clusters and the reassignment of the data points will 
be repeated till the
+ * given number of iterations is reached.
+ * (Default value: '''10''')
+ *
+ * - [[org.apache.flink.ml.clustering.KMeans.InitialCentroids]]:
  

[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-06-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14600798#comment-14600798
 ] 

ASF GitHub Bot commented on FLINK-1731:
---

Github user thvasilo commented on the pull request:

https://github.com/apache/flink/pull/700#issuecomment-115133956
  
Thanks, seems like all is fine now. We will start reviewing this in the 
next few days.


> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Peter Schrott
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-06-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14599614#comment-14599614
 ] 

ASF GitHub Bot commented on FLINK-1731:
---

Github user FGoessler commented on the pull request:

https://github.com/apache/flink/pull/700#issuecomment-114921879
  
Just rebased and force pushed -> hoping for good Travis results :smiley: 


> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Peter Schrott
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-06-22 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14595995#comment-14595995
 ] 

ASF GitHub Bot commented on FLINK-1731:
---

Github user thvasilo commented on the pull request:

https://github.com/apache/flink/pull/700#issuecomment-114135260
  
Hello @peedeeX21 , most of the failing Travis tests have been fixed in the 
current master, could you try rebasing this PR and making a forced push to this 
branch?


> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Peter Schrott
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-06-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14580345#comment-14580345
 ] 

ASF GitHub Bot commented on FLINK-1731:
---

Github user peedeeX21 commented on the pull request:

https://github.com/apache/flink/pull/700#issuecomment-110681501
  
@tillrohrmann great. no worries. was just not sure what is going on. :) 
good luck with the new release!


> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Peter Schrott
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-06-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14580249#comment-14580249
 ] 

ASF GitHub Bot commented on FLINK-1731:
---

Github user tillrohrmann commented on the pull request:

https://github.com/apache/flink/pull/700#issuecomment-110658920
  
Will do @peedeeX21. Currently I'm busy with the upcoming release, but once 
we're done with it, I'll work on this PR.


> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Peter Schrott
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-06-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14580136#comment-14580136
 ] 

ASF GitHub Bot commented on FLINK-1731:
---

Github user peedeeX21 commented on the pull request:

https://github.com/apache/flink/pull/700#issuecomment-110621613
  
@tillrohrmann 
Would you please help me out with that pending pull request? 


> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Peter Schrott
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-06-05 Thread Peter Schrott (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14574273#comment-14574273
 ] 

Peter Schrott commented on FLINK-1731:
--

[~till.rohrmann] I am not entirely sure if we speak about the same thing. In 
our opinion the failure of Travis is not related to our changes. 
Or do you mean, that I should force Travis to run over my repository to see the 
problem still exists?
If so, I just need to push something to my repository, right? But I don't have 
any changes to make.
- Thanks, Peter

> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Peter Schrott
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-06-04 Thread Peter Schrott (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14572577#comment-14572577
 ] 

Peter Schrott commented on FLINK-1731:
--

Hi [~fobeligi],

sure you can pull it from my repository. I am glad someone uses my software. :)

Peter

> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Peter Schrott
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-06-04 Thread Faye Beligianni (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14572475#comment-14572475
 ] 

Faye Beligianni commented on FLINK-1731:


Hello [~peedeeX21], 

I am interesting in using your implementation of k-Means algorithm for some 
experiments that I am gonna run for my master thesis.
If that's not a problem I could pull it from your repository. 

Thank you, 
Faye
 

> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Peter Schrott
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-06-04 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14572329#comment-14572329
 ] 

Till Rohrmann commented on FLINK-1731:
--

You can enable Travis support [1] for you repository. Then whenever you push 
something to your repo, Travis will trigger a new build.

[1] https://travis-ci.org/

> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Peter Schrott
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-06-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14570676#comment-14570676
 ] 

ASF GitHub Bot commented on FLINK-1731:
---

Github user FGoessler commented on the pull request:

https://github.com/apache/flink/pull/700#issuecomment-108310843
  
The travis build is failing on Oracle JDK 8. Maven or Flink are hanging 
according to the build log. Can anyone help or at least restart the build? 
Are there any known "flipping tests"? Imo the failure isn't related to our 
changes.


> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Peter Schrott
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-06-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568664#comment-14568664
 ] 

ASF GitHub Bot commented on FLINK-1731:
---

Github user sachingoel0101 commented on the pull request:

https://github.com/apache/flink/pull/700#issuecomment-107831123
  
Hey guys. You might wanna look at the initialization schemes here: 
https://github.com/apache/flink/pull/757


> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Peter Schrott
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-06-01 Thread Sachin Goel (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568392#comment-14568392
 ] 

Sachin Goel commented on FLINK-1731:


I'm creating a separate issue for Initialization schemes. This would address 
the Random, kmeans++ and kmeans|| initialization methods. Since any 
initialization itself is a solution to the kmeans problem, they would all be 
instances of Predictor also. User can access the centroids learned via 
instance.centroids and pass them to the KMeans algorithm which has been 
implemented. 
These is another way possible which takes the burden off the user to figure out 
how to pass the initial centroids to KMeans. We can have a parameter which 
signifies which initialization scheme to use. The KMeans algorithm would then 
need to call the appropriate initialization scheme in its fit function and work 
with the centroids found by the initialization scheme as its initial centroids.

> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Peter Schrott
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-05-31 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14566606#comment-14566606
 ] 

Till Rohrmann commented on FLINK-1731:
--

{{fromBreeze[org.apache.flink.ml.math.Vector]}} should solve the problem

> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Peter Schrott
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-05-31 Thread Alexander Alexandrov (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14566564#comment-14566564
 ] 

Alexander Alexandrov commented on FLINK-1731:
-

[~till.rohrmann] I coudn't find it eather. I think we were discussing to do the 
K-Means|| as a separate issue.

Florian Gößler also reported the following issue when he tried to rebase

{{{
Error:(200, 75) ambiguous implicit values:
 both value denseVectorConverter in object BreezeVectorConverter of type => 
org.apache.flink.ml.math.BreezeVectorConverter[org.apache.flink.ml.math.DenseVector]
 and value sparseVectorConverter in object BreezeVectorConverter of type => 
org.apache.flink.ml.math.BreezeVectorConverter[org.apache.flink.ml.math.SparseVector]
 match expected type org.apache.flink.ml.math.BreezeVectorConverter[T]
.reduce((p1, p2) => (p1._1, (p1._2.asBreeze + 
p2._2.asBreeze).fromBreeze, p1._3 + p2._3))
}}}

Any idea what might be the cause?

> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Peter Schrott
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-05-31 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14566558#comment-14566558
 ] 

Till Rohrmann commented on FLINK-1731:
--

Sorry, didn't have much time lately to review the PR. I'll try to spend some 
time on them. Currently, we made a major rewriting of the pipeline mechanism 
which makes it necessary to rebase the PR. See 
[http://ci.apache.org/projects/flink/flink-docs-master/libs/ml/pipelines.html] 
for further information.

What were [~peedeeX21]'s concerns? I could not find them.

[~sachingoel0101], as far as I know, there is no one working on kmeans||. You 
can take the lead.

> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Peter Schrott
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-05-30 Thread Sachin Goel (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14566346#comment-14566346
 ] 

Sachin Goel commented on FLINK-1731:


Is anyone working currently on the kmeans|| implementation? I just finished 
working on it in an OpenMP framework and could easily implement it in a 
Map-Reduce fashion. Besides it is considered better than the original kmeans++ 
because it requires less passes over the data while providing essentially the 
same approximation guarantees as kmeans++.

> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Peter Schrott
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-05-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14565275#comment-14565275
 ] 

ASF GitHub Bot commented on FLINK-1731:
---

Github user aalexandrov commented on the pull request:

https://github.com/apache/flink/pull/700#issuecomment-106911412
  
Can anybody with more Apache insight answer to @peedeeX21 concerns? 
Otherwise I suggest to merge this and open a follow-up issue that extends the 
current implementation to KMeans++. 


> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Peter Schrott
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-05-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14552243#comment-14552243
 ] 

ASF GitHub Bot commented on FLINK-1731:
---

GitHub user peedeeX21 opened a pull request:

https://github.com/apache/flink/pull/700

[FLINK-1731] [ml] Implementation of Feature K-Means and Test Suite

Within the IMPRO-3 warm-up task the implementation of K-Means and 
corresponding test suite was done.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/peedeeX21/flink feature_kmeans

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/700.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #700


commit 02fe6b2c7ebc6bf4b55e832681286994b03c4d40
Author: Florian Goessler 
Date:   2015-05-20T09:12:20Z

[FLINK-1731] [ml] unit test for KMeans

commit 71aa47bd06ad2e051749ea1b9df923b8eb5bf6e4
Author: Peter Schrott 
Date:   2015-05-20T11:08:36Z

[FLINK-1731] [ml] Implementation of K-Means




> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Peter Schrott
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-05-20 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14552154#comment-14552154
 ] 

Till Rohrmann commented on FLINK-1731:
--

Sure. Maybe we can add in this version a random centroid picking strategy
if no centroids were provided.

On Wed, May 20, 2015 at 12:39 PM, Alexander Alexandrov (JIRA) <



> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Peter Schrott
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-05-20 Thread Alexander Alexandrov (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14552113#comment-14552113
 ] 

Alexander Alexandrov commented on FLINK-1731:
-

[~till.rohrmann] Can we merge this with the current set of features and then 
add the automatic picking of the initial centroids in another issue?

> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Peter Schrott
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-05-18 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14548041#comment-14548041
 ] 

Till Rohrmann commented on FLINK-1731:
--

I think it is a good idea to pass the initial centroids as a parameter to the 
algorithm. But for the case that the user does not provide initial centroids, 
can't the algorithm try to select some? Maybe [this 
paper|http://airccse.org/journal/jcsit/1011csit13.pdf] helps. 

> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Peter Schrott
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-05-18 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14548038#comment-14548038
 ] 

Till Rohrmann commented on FLINK-1731:
--

The problem is that you try to divide a vector of doubles by a long value. This 
is not supported by Breeze. If you change {{1L}} to {{1.0}} it should work.

> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Peter Schrott
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-05-18 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14548026#comment-14548026
 ] 

Till Rohrmann commented on FLINK-1731:
--

The thing is that you need an {{ExecutionEnvironment}} in order to convert a 
collection into a {{DataSet}}. If you have a program with different 
{{ExecutionEnvironments}}, then the algorithm does not know which one to take. 
Therefore, I'd simply go with a {{setCentroids(DataSet[A])}} method.

> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Peter Schrott
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-05-17 Thread Hae Joon Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547149#comment-14547149
 ] 

Hae Joon Lee commented on FLINK-1731:
-

Hi, I am testing K-mean right now.
I faced an error "could not find implicit value for parameter"
I solved a lot of things except for this one.

* Error:(142, 53) could not find implicit value for parameter op: 
breeze.linalg.operators.OpDiv.Impl2[breeze.linalg.Vector[Double],Long,That].map(x
 => LabeledVector(x._1, x._2.asBreeze / 1L)).withForwardedFields("_1->id")

{code:title=KMeans.scala|borderStyle=solid}
val finalCentroids = centroids.iterate(numIterations) { currentCentroids =>
val newCentroids: DataSet[LabeledVector] = input
  .map(new SelectNearestCenterMapper).withBroadcastSet(currentCentroids, 
CENTROIDS)
  .map(x => (x.label, x.vector, 1L)).withForwardedFields("_1; _2")
  .groupBy(x => x._1)
  .reduce((p1, p2) => (p1._1, (p1._2.asBreeze + p2._2.asBreeze).fromBreeze, 
p1._3 + p2._3)).withForwardedFields("_1")
  .map(x => LabeledVector(x._1, x._2.asBreeze :/ 
x._3)).withForwardedFields("_1->id")
  newCentroids
}
{code}

As far as I know, the error "could not find implicit value for parameter" can 
be solved by putting exact import class.

I put 'import breeze.linalg.operators._' on import line as well. 
but it does not work.
Have you ever seen this kind of error before?



> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Peter Schrott
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-05-14 Thread Theodore Vasiloudis (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14544164#comment-14544164
 ] 

Theodore Vasiloudis commented on FLINK-1731:


Yeah that might be the better option. The optimization framework is more 
developer oriented, but since Kmeans is mostly aimed at practitioners it would 
be better to abstract away the complexity.

> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Peter Schrott
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-05-14 Thread Alexander Alexandrov (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14543734#comment-14543734
 ] 

Alexander Alexandrov commented on FLINK-1731:
-

I would go with a {{DataSet}} for the centroids as well. That said, we can 
reduce syntax at the client side by providing either

- an implicit converter that {{Seq\[A\] => DataSet\[A\]}} (needs to be part of 
the Flink Scala API, could be already there), or
- an overloaded {{setCentroids(Seq\[A\])}} setter.

> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Peter Schrott
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-05-14 Thread Theodore Vasiloudis (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14543711#comment-14543711
 ] 

Theodore Vasiloudis commented on FLINK-1731:


Since the centroids will have to be broadcast to all task managers, that means 
that they will have to be placed inside a DataSet eventually.

One approach is to use a Sequence which you then convert into a DataSet inside 
the algorithm, or require that the user provides a DataSet as a parameter.

In GradientDescent we are using the second option, i.e. we expect a DataSet of 
weights, you can do the same with centroids.

> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Peter Schrott
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-05-14 Thread Peter Schrott (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14543670#comment-14543670
 ] 

Peter Schrott commented on FLINK-1731:
--

Hi flink people,

as we now figured out how to pass in the initial centroids (via ParameterMap) 
there is still the open question, if we should use a Seqence or DataSet.
As I already mentioned before, I am not sure about the side effects regarding 
parallelism using the DataSet type.

- thanks for advices.

> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Peter Schrott
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-05-14 Thread Peter Schrott (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14543648#comment-14543648
 ] 

Peter Schrott commented on FLINK-1731:
--

Great! Thanks!

> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Alexander Alexandrov
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-05-14 Thread Robert Metzger (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14543647#comment-14543647
 ] 

Robert Metzger commented on FLINK-1731:
---

[~aalexandrov]: only users with "Contributor" permissions can be assigned to 
issues.
I made [~peedeeX21] a contributor and assigned him.

> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Alexander Alexandrov
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-05-14 Thread Alexander Alexandrov (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14543643#comment-14543643
 ] 

Alexander Alexandrov commented on FLINK-1731:
-

[~peedeeX21] for some reason I cannot assign this to you directly. I cleared 
the assignee field so you can assign the issue to yourself. 

> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-05-13 Thread Peter Schrott (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14541818#comment-14541818
 ] 

Peter Schrott commented on FLINK-1731:
--

As discussed with [~aalexandrov] The initial centroids are given to the kmeans 
within the parameter map. For convenience fluent syntax is added.

> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Alexander Alexandrov
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-05-13 Thread Alexander Alexandrov (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14541570#comment-14541570
 ] 

Alexander Alexandrov commented on FLINK-1731:
-

I suggest to try and add the initial centroids as a proper parameter and not as 
part of the ParameterMap, since they are an actual input to the algorithm (as 
opposed to an algorithm hyper-parameter).

> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Alexander Alexandrov
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-05-13 Thread Peter Schrott (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14541516#comment-14541516
 ] 

Peter Schrott commented on FLINK-1731:
--

Hi [~chiwanpark],

the thing is, to fit the model, the KMeans uses two datasets. One is the 
training data, the other are the initial centroids. This means, the 
{code:java}fit{code}-method should take two attributes at that point. This is 
the reason why I suggested to use the parameter map for passing the initial 
centroids. The training dataset will be passed as argument to the 
{code:java}fit{code}-method, equally to the CoCoA implementation.



> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Alexander Alexandrov
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-05-12 Thread Chiwan Park (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14541211#comment-14541211
 ] 

Chiwan Park commented on FLINK-1731:


Hello, [~peedeeX21].

I think you can pass the initial centroids like {{fit(centroids: 
DataSet\[Vector\], fitParameters: ParameterMap}}. The fit method means that 
Learner creates a model and fits it into the given input. (in this case, 
centroids)

And the created model (named like {{KMeansModel}}) decides the cluster of other 
points. From this approach, the initial centroids passed as a DataSet will be 
better.

You can see this approach in CoCoA implementation. 
(https://github.com/apache/flink/blob/master/flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/classification/CoCoA.scala)

I hope that this comments help you.

> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Alexander Alexandrov
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-05-12 Thread Peter Schrott (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14540465#comment-14540465
 ] 

Peter Schrott commented on FLINK-1731:
--

Very nice. The implementation of BreezeVector and EuclideanDistanceMetrics 
works out just fine. Thanks for the support on that.

There is another open question:
1) How are the initial centroids to be passed to the algorithm. We implemented 
the KMeans as an derivative of Learner. As there is only one argument to pass 
(the dataset), should we set the initial centroids as parameter. (We do the 
same for the number of iterations)
2) Should the initial centroids passed as a DataSet or Seq? Are there any side 
effects regarding parallelism when using the DataSet type? 

> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Alexander Alexandrov
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-05-12 Thread Hae Joon Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14539938#comment-14539938
 ] 

Hae Joon Lee commented on FLINK-1731:
-

To load input `Points` in fit function for `BreezeVector` we should use input:  
DataSet[LebelVector]?
I implemented input dataset like trainingData = seq[DenseVector] 
(DenseVector(-0.489811986685, 0.496883004904, -0.483860999346) ... )

In the case of K-means, datatype of `Centroids` can be LebelVector because it 
has centroid number, but datatype of `Points` does not have to be LebelVector 
in that it only has points as coordinates.

> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Alexander Alexandrov
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-05-07 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14532722#comment-14532722
 ] 

Till Rohrmann commented on FLINK-1731:
--

I just merged the PR with the distance metrics. You can use it now.

At the moment I think it's easier to simply use the {BreezeVector} if you want 
to do vector manipulations.

> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Alexander Alexandrov
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-05-07 Thread Chiwan Park (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14532636#comment-14532636
 ] 

Chiwan Park commented on FLINK-1731:


Hi, [~peedeeX21]. There are implementations of distance measure between two 
vectors including Euclidean Distance in FLINK-1933. I have sent a 
[PR|https://github.com/apache/flink/pull/629] and it will be merged soon. You 
can use it for calculating distances between each pair of data.

> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Alexander Alexandrov
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-05-07 Thread Peter Schrott (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14532619#comment-14532619
 ] 

Peter Schrott commented on FLINK-1731:
--

For the implementation of the Kmeans algorithm there are some basic operations 
for the org.apache.flink.ml.math.Vecor missing. (add, euclideanDistance, div). 
Are these to implement or is it more recommendable to use BreezeVector? 

> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Alexander Alexandrov
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.
> The kMeans++ [1] and the kMeans|| [2] algorithm constitute a better 
> implementation because the improve the initial seeding phase to achieve near 
> optimal clustering. It might be worthwhile to implement kMeans||.
> Resources:
> [1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> [2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-05-07 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14532215#comment-14532215
 ] 

Till Rohrmann commented on FLINK-1731:
--

Great to hear that you guys will pick up this task. There is also the kMeans++ 
[1] and kMeans|| [2] algorithms which improve the the initial seeding to reach 
near optimal clustering. If you want to, then you can also implement these 
algorithms.

[1] http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
[2] http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf

> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Alexander Alexandrov
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1731) Add kMeans clustering algorithm to machine learning library

2015-05-06 Thread Peter Schrott (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14530424#comment-14530424
 ] 

Peter Schrott commented on FLINK-1731:
--

FGoessler, philjjoon, peedeeX21 & voelkj
are working on that within the IMPRO-3.SS15 warmup task

> Add kMeans clustering algorithm to machine learning library
> ---
>
> Key: FLINK-1731
> URL: https://issues.apache.org/jira/browse/FLINK-1731
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Alexander Alexandrov
>  Labels: ML
>
> The Flink repository already contains a kMeans implementation but it is not 
> yet ported to the machine learning library. I assume that only the used data 
> types have to be adapted and then it can be more or less directly moved to 
> flink-ml.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)