subject:"\[GitHub\] spark pull request\: \[Spark\-7879\]\[MLlib\] KMeans API for spark.ml Pi..."

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-17 Thread yu-iskw

Github user yu-iskw commented on the pull request:

https://github.com/apache/spark/pull/6756#issuecomment-122477206
  
Thank you for merging it and your continuous support!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-17 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/6756


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-17 Thread jkbradley

Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/6756#issuecomment-122465713
  
LGTM thanks for contributing this big feature!
Merging with master


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6756#issuecomment-122462091
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-17 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/6756#issuecomment-122462050
  
  [Test build #37674 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37674/console)
 for   PR 6756 at commit 
[`be752de`](https://github.com/apache/spark/commit/be752de88b45a43da8609517796b00633f943e79).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class KMeans(override val uid: String) extends Estimator[KMeansModel] 
with KMeansParams `
  * `class KMeansModel(JavaModel):`
  * `class KMeans(JavaEstimator, HasFeaturesCol, HasMaxIter, HasSeed):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-17 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/6756#issuecomment-122437718
  
  [Test build #37674 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37674/consoleFull)
 for   PR 6756 at commit 
[`be752de`](https://github.com/apache/spark/commit/be752de88b45a43da8609517796b00633f943e79).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-17 Thread yu-iskw

Github user yu-iskw commented on the pull request:

https://github.com/apache/spark/pull/6756#issuecomment-122437007
  
Oh, I'm sorry for the easy mistakes...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6756#issuecomment-122437108
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6756#issuecomment-122437085
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-17 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/6756#discussion_r34931021
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
@@ -0,0 +1,114 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans}
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.sql.{DataFrame, SQLContext}
+
+private[clustering] case class TestRow(features: Vector)
+
+object KMeansSuite {
+  def generateKMeansData(sql: SQLContext, rows: Int, dim: Int, k: Int): 
DataFrame = {
+val sc = sql.sparkContext
+val rdd = sc.parallelize(1 to rows).map(i => 
Vectors.dense(Array.fill(dim)((i % k).toDouble)))
+  .map(v => new TestRow(v))
+sql.createDataFrame(rdd)
+  }
+}
+
+class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext {
+
+  final val k = 5
+  @transient var dataset: DataFrame = _
+
+  override def beforeAll(): Unit = {
+super.beforeAll()
+
+dataset = KMeansSuite.generateKMeansData(sqlContext, 50, 3, k)
+  }
+
+  test("default parameters") {
+val kmeans = new KMeans()
+
+assert(kmeans.getK === 2)
+assert(kmeans.getFeaturesCol === "features")
+assert(kmeans.getPredictionCol === "prediction")
+assert(kmeans.getMaxIter === 20)
+assert(kmeans.getRuns === 1)
+assert(kmeans.getInitMode === MLlibKMeans.K_MEANS_PARALLEL)
+assert(kmeans.getInitSteps === 5)
+assert(kmeans.getEpsilon === 1e-4)
+  }
+
+  test("set parameters") {
+val kmeans = new KMeans()
+  .setK(9)
+  .setFeaturesCol("test_feature")
+  .setPredictionCol("test_prediction")
+  .setMaxIter(33)
+  .setRuns(7)
+  .setInitMode(MLlibKMeans.RANDOM)
+  .setInitSteps(3)
+  .setSeed(123)
+  .setEpsilon(1e-3)
+
+assert(kmeans.getK === 9)
+assert(kmeans.getFeaturesCol === "test_feature")
+assert(kmeans.getPredictionCol === "test_prediction")
+assert(kmeans.getMaxIter === 33)
+assert(kmeans.getRuns === 7)
+assert(kmeans.getInitMode === MLlibKMeans.RANDOM)
+assert(kmeans.getInitSteps === 3)
+assert(kmeans.getSeed === 123)
+assert(kmeans.getEpsilon === 1e-3)
+  }
+
+  test("parameters validation") {
+intercept[IllegalArgumentException] {
+  new KMeans().setK(1)
+}
+intercept[IllegalArgumentException] {
+  new KMeans().setInitMode("no_such_a_mode")
+}
+intercept[IllegalArgumentException] {
+  new KMeans().setInitSteps(0)
+}
+intercept[IllegalArgumentException] {
+  new KMeans().setRuns(0)
+}
+  }
+
+  test("fit & transform") {
+val predictionColName = "kmeans_prediction"
+val kmeans = new 
KMeans().setK(k).setPredictionCol(predictionColName).setSeed(1)
+val model = kmeans.fit(dataset)
+assert(model.clusterCenters.length === k)
+
+val transformed = model.transform(dataset)
+val expectedColumns = Array("features", predictionColName)
+expectedColumns.foreach { column =>
+  transformed.columns.contains(column)
--- End diff --

need to assert


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-17 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/6756#discussion_r34931026
  
--- Diff: 
mllib/src/test/java/org/apache/spark/ml/clustering/JavaKMeansSuite.java ---
@@ -0,0 +1,72 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering;
+
+import java.io.Serializable;
+import java.util.Arrays;
+import java.util.List;
+
+import org.junit.After;
+import org.junit.Before;
+import org.junit.Test;
+import static org.junit.Assert.assertArrayEquals;
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertTrue;
+
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.mllib.linalg.Vector;
+import org.apache.spark.sql.DataFrame;
+import org.apache.spark.sql.SQLContext;
+
+public class JavaKMeansSuite implements Serializable {
+
+  private transient int k = 5;
+  private transient JavaSparkContext sc;
+  private transient DataFrame dataset;
+  private transient SQLContext sql;
+
+  @Before
+  public void setUp() {
+sc = new JavaSparkContext("local", "JavaKMeansSuite");
+sql = new SQLContext(sc);
+
+dataset = KMeansSuite.generateKMeansData(sql, 50, 3, k);
+  }
+
+  @After
+  public void tearDown() {
+sc.stop();
+sc = null;
+  }
+
+  @Test
+  public void fitAndTransform() {
+KMeans kmeans = new KMeans().setK(k).setSeed(1);
+KMeansModel model = kmeans.fit(dataset);
+
+Vector[] centers = model.clusterCenters();
+assertEquals(k, centers.length);
+
+DataFrame transformed = model.transform(dataset);
+List columns = Arrays.asList(transformed.columns());
+List expectedColumns = Arrays.asList("features", "prediction");
+for (String column: expectedColumns) {
+  columns.contains(column);
--- End diff --

Need to assertTrue


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-17 Thread jkbradley

Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/6756#issuecomment-122415725
  
The changes look good, save for those 2 tiny items.  That should be all!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6756#issuecomment-122209483
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-17 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/6756#issuecomment-122209384
  
  [Test build #37589 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37589/console)
 for   PR 6756 at commit 
[`a14939b`](https://github.com/apache/spark/commit/a14939bb221e87cceee16119263fae214ae6dd60).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class KMeans(override val uid: String) extends Estimator[KMeansModel] 
with KMeansParams `
  * `class KMeansModel(JavaModel):`
  * `class KMeans(JavaEstimator, HasFeaturesCol, HasMaxIter, HasSeed):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6756#issuecomment-122208559
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-17 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/6756#issuecomment-122208318
  
  [Test build #37587 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37587/console)
 for   PR 6756 at commit 
[`4c61693`](https://github.com/apache/spark/commit/4c6169357e84bf823c1346c64e1e41e86774e562).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class KMeans(override val uid: String) extends Estimator[KMeansModel] 
with KMeansParams `
  * `class KMeansModel(JavaModel):`
  * `class KMeans(JavaEstimator, HasFeaturesCol, HasMaxIter, HasSeed):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-16 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/6756#issuecomment-122183530
  
  [Test build #37589 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37589/consoleFull)
 for   PR 6756 at commit 
[`a14939b`](https://github.com/apache/spark/commit/a14939bb221e87cceee16119263fae214ae6dd60).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-16 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6756#issuecomment-122183394
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-16 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6756#issuecomment-122183383
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-16 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/6756#issuecomment-122182463
  
  [Test build #37587 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37587/consoleFull)
 for   PR 6756 at commit 
[`4c61693`](https://github.com/apache/spark/commit/4c6169357e84bf823c1346c64e1e41e86774e562).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-16 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6756#issuecomment-122182129
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-16 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6756#issuecomment-122182121
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-16 Thread yu-iskw

Github user yu-iskw commented on a diff in the pull request:

https://github.com/apache/spark/pull/6756#discussion_r34863112
  
--- Diff: python/docs/pyspark.ml.rst ---
@@ -33,6 +33,14 @@ pyspark.ml.classification module
 :undoc-members:
 :inherited-members:
 
+pyspark.ml.clustering module
+
--- End diff --

I didn't know that. Thank you for the comment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-16 Thread yu-iskw

Github user yu-iskw commented on a diff in the pull request:

https://github.com/apache/spark/pull/6756#discussion_r34862779
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -0,0 +1,204 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.param.{Param, Params, IntParam, DoubleParam, 
ParamMap}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasMaxIter, 
HasPredictionCol, HasSeed}
+import org.apache.spark.ml.util.{Identifiable, SchemaUtils}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans, 
KMeansModel => MLlibKMeansModel}
+import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.types.{IntegerType, StructType}
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.util.Utils
+
+
+/**
+ * Common params for KMeans and KMeansModel
+ */
+private[clustering] trait KMeansParams
+extends Params with HasMaxIter with HasFeaturesCol with HasSeed with 
HasPredictionCol {
+
+  /**
+   * Set the number of clusters to create (k). Default: 2.
+   * @group param
+   */
+  final val k = new IntParam(this, "k", "number of clusters to create", 
(x: Int) => x > 1)
+
+  /** @group getParam */
+  def getK: Int = $(k)
+
+  /**
+   * Param the number of runs of the algorithm to execute in parallel. We 
initialize the algorithm
+   * this many times with random starting conditions (configured by the 
initialization mode), then
+   * return the best clustering found over any run. Default: 1.
--- End diff --

I see.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-16 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/6756#discussion_r34862614
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
@@ -0,0 +1,113 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans}
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.sql.{DataFrame, SQLContext}
+
+private[clustering] case class TestRow(features: Vector)
+
+object KMeansSuite {
+  def generateKMeansData(sql: SQLContext, rows: Int, dim: Int, k: Int): 
DataFrame = {
+val sc = sql.sparkContext
+val rdd = sc.parallelize(1 to rows).map(i => 
Vectors.dense(Array.fill(dim)((i % k).toDouble)))
+  .map(v => new TestRow(v))
+sql.createDataFrame(rdd)
+  }
+}
+
+class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext {
+
+  final val k = 5
+  @transient var dataset: DataFrame = _
+
+  override def beforeAll(): Unit = {
+super.beforeAll()
+
+dataset = KMeansSuite.generateKMeansData(sqlContext, 50, 3, k)
+  }
+
+  test("default parameters") {
+val kmeans = new KMeans()
+
+assert(kmeans.getK === 2)
+assert(kmeans.getFeaturesCol === "features")
+assert(kmeans.getPredictionCol === "prediction")
+assert(kmeans.getMaxIter === 20)
+assert(kmeans.getRuns === 1)
+assert(kmeans.getInitMode === MLlibKMeans.K_MEANS_PARALLEL)
+assert(kmeans.getInitSteps === 5)
+assert(kmeans.getEpsilon === 1e-4)
+  }
+
+  test("set parameters") {
+val kmeans = new KMeans()
+  .setK(9)
+  .setFeaturesCol("test_feature")
+  .setPredictionCol("test_prediction")
+  .setMaxIter(33)
+  .setRuns(7)
+  .setInitMode(MLlibKMeans.RANDOM)
+  .setInitSteps(3)
+  .setSeed(123)
+  .setEpsilon(1e-3)
+
+assert(kmeans.getK === 9)
+assert(kmeans.getFeaturesCol === "test_feature")
+assert(kmeans.getPredictionCol === "test_prediction")
+assert(kmeans.getMaxIter === 33)
+assert(kmeans.getRuns === 7)
+assert(kmeans.getInitMode === MLlibKMeans.RANDOM)
+assert(kmeans.getInitSteps === 3)
+assert(kmeans.getSeed === 123)
+assert(kmeans.getEpsilon === 1e-3)
+  }
+
+  test("parameters validation") {
+intercept[IllegalArgumentException] {
+  new KMeans().setK(1)
+}
+intercept[IllegalArgumentException] {
+  new KMeans().setInitMode("no_such_a_mode")
+}
+intercept[IllegalArgumentException] {
+  new KMeans().setInitSteps(0)
+}
+intercept[IllegalArgumentException] {
+  new KMeans().setRuns(0)
+}
+  }
+
+  test("fit & transform") {
+val predictionColName = "kmeans_prediction"
+val kmeans = new 
KMeans().setK(k).setPredictionCol(predictionColName).setSeed(1)
+val model = kmeans.fit(dataset)
+assert(model.clusterCenters.length === k)
+
+val transformed = model.transform(dataset)
+transformed.columns.foreach { column =>
--- End diff --

(same as above) This should be switched: We want to make sure 
expectedColumns is a subset of transformed.columns, but it is currently doing 
the opposite (superset).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: revie

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-16 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/6756#discussion_r34862620
  
--- Diff: 
mllib/src/test/java/org/apache/spark/ml/clustering/JavaKMeansSuite.java ---
@@ -0,0 +1,71 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering;
+
+import java.io.Serializable;
+import java.util.Arrays;
+import java.util.List;
+
+import org.junit.After;
+import org.junit.Before;
+import org.junit.Test;
+import static org.junit.Assert.assertArrayEquals;
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertTrue;
+
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.mllib.linalg.Vector;
+import org.apache.spark.sql.DataFrame;
+import org.apache.spark.sql.SQLContext;
+
+public class JavaKMeansSuite implements Serializable {
+
+  private transient int k = 5;
+  private transient JavaSparkContext sc;
+  private transient DataFrame dataset;
+  private transient SQLContext sql;
+
+  @Before
+  public void setUp() {
+sc = new JavaSparkContext("local", "JavaKMeansSuite");
+sql = new SQLContext(sc);
+
+dataset = KMeansSuite.generateKMeansData(sql, 50, 3, k);
+  }
+
+  @After
+  public void tearDown() {
+sc.stop();
+sc = null;
+  }
+
+  @Test
+  public void fitAndTransform() {
+KMeans kmeans = new KMeans().setK(k).setSeed(1);
+KMeansModel model = kmeans.fit(dataset);
+
+Vector[] centers = model.clusterCenters();
+assertEquals(k, centers.length);
+
+DataFrame transformed = model.transform(dataset);
+List expectedColumns = Arrays.asList("features", "prediction");
+for (String clm: transformed.columns()) {
--- End diff --

This should be switched: We want to make sure expectedColumns is a subset 
of transformed.columns, but it is currently doing the opposite (superset).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-16 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/6756#discussion_r34862616
  
--- Diff: python/docs/pyspark.ml.rst ---
@@ -33,6 +33,14 @@ pyspark.ml.classification module
 :undoc-members:
 :inherited-members:
 
+pyspark.ml.clustering module
+
--- End diff --

This dashed line should be exactly the same length as the above line, to 
avoid style complaints when compiling docs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-16 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/6756#discussion_r34862613
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -0,0 +1,204 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.param.{Param, Params, IntParam, DoubleParam, 
ParamMap}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasMaxIter, 
HasPredictionCol, HasSeed}
+import org.apache.spark.ml.util.{Identifiable, SchemaUtils}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans, 
KMeansModel => MLlibKMeansModel}
+import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.types.{IntegerType, StructType}
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.util.Utils
+
+
+/**
+ * Common params for KMeans and KMeansModel
+ */
+private[clustering] trait KMeansParams
+extends Params with HasMaxIter with HasFeaturesCol with HasSeed with 
HasPredictionCol {
+
+  /**
+   * Set the number of clusters to create (k). Default: 2.
+   * @group param
+   */
+  final val k = new IntParam(this, "k", "number of clusters to create", 
(x: Int) => x > 1)
+
+  /** @group getParam */
+  def getK: Int = $(k)
+
+  /**
+   * Param the number of runs of the algorithm to execute in parallel. We 
initialize the algorithm
+   * this many times with random starting conditions (configured by the 
initialization mode), then
+   * return the best clustering found over any run. Default: 1.
+   * @group param
+   */
+  final val runs = new IntParam(this, "runs",
+"number of runs of the algorithm to execute in parallel", (value: Int) 
=> value >= 1)
+
+  /** @group getParam */
+  def getRuns: Int = $(runs)
+
+  /**
+   * Param the distance threshold within which we've consider centers to 
have converged.
+   * If all centers move less than this Euclidean distance, we stop 
iterating one run. Default: 1e-4
+   * @group param
+   */
+  final val epsilon = new DoubleParam(this, "epsilon",
+"distance threshold within which we've consider centers to have 
converge",
+(value: Double) => value >= 0.0)
+
+  /** @group getParam */
+  def getEpsilon: Double = $(epsilon)
+
+  /**
+   * Param for the initialization algorithm. This can be either "random" 
to choose random points as
+   * initial cluster centers, or "k-means||" to use a parallel variant of 
k-means++
+   * (Bahmani et al., Scalable K-Means++, VLDB 2012). Default: k-means||.
+   * @group expertParam
+   */
+  final val initMode = new Param[String](this, "initMode", "initialization 
algorithm",
+(value: String) => MLlibKMeans.validateInitMode(value))
+
+  /** @group getExpertParam */
--- End diff --

"getExpertParam" --> "expertGetParam"
"setExpertParam" --> "expertSetParam"
(See ml/package.scala)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-16 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/6756#discussion_r34862615
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
@@ -0,0 +1,113 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans}
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.sql.{DataFrame, SQLContext}
+
+private[clustering] case class TestRow(features: Vector)
+
+object KMeansSuite {
+  def generateKMeansData(sql: SQLContext, rows: Int, dim: Int, k: Int): 
DataFrame = {
+val sc = sql.sparkContext
+val rdd = sc.parallelize(1 to rows).map(i => 
Vectors.dense(Array.fill(dim)((i % k).toDouble)))
+  .map(v => new TestRow(v))
+sql.createDataFrame(rdd)
+  }
+}
+
+class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext {
+
+  final val k = 5
+  @transient var dataset: DataFrame = _
+
+  override def beforeAll(): Unit = {
+super.beforeAll()
+
+dataset = KMeansSuite.generateKMeansData(sqlContext, 50, 3, k)
+  }
+
+  test("default parameters") {
+val kmeans = new KMeans()
+
+assert(kmeans.getK === 2)
+assert(kmeans.getFeaturesCol === "features")
+assert(kmeans.getPredictionCol === "prediction")
+assert(kmeans.getMaxIter === 20)
+assert(kmeans.getRuns === 1)
+assert(kmeans.getInitMode === MLlibKMeans.K_MEANS_PARALLEL)
+assert(kmeans.getInitSteps === 5)
+assert(kmeans.getEpsilon === 1e-4)
+  }
+
+  test("set parameters") {
+val kmeans = new KMeans()
+  .setK(9)
+  .setFeaturesCol("test_feature")
+  .setPredictionCol("test_prediction")
+  .setMaxIter(33)
+  .setRuns(7)
+  .setInitMode(MLlibKMeans.RANDOM)
+  .setInitSteps(3)
+  .setSeed(123)
+  .setEpsilon(1e-3)
+
+assert(kmeans.getK === 9)
+assert(kmeans.getFeaturesCol === "test_feature")
+assert(kmeans.getPredictionCol === "test_prediction")
+assert(kmeans.getMaxIter === 33)
+assert(kmeans.getRuns === 7)
+assert(kmeans.getInitMode === MLlibKMeans.RANDOM)
+assert(kmeans.getInitSteps === 3)
+assert(kmeans.getSeed === 123)
+assert(kmeans.getEpsilon === 1e-3)
+  }
+
+  test("parameters validation") {
+intercept[IllegalArgumentException] {
+  new KMeans().setK(1)
+}
+intercept[IllegalArgumentException] {
+  new KMeans().setInitMode("no_such_a_mode")
+}
+intercept[IllegalArgumentException] {
+  new KMeans().setInitSteps(0)
+}
+intercept[IllegalArgumentException] {
+  new KMeans().setRuns(0)
+}
+  }
+
+  test("fit & transform") {
+val predictionColName = "kmeans_prediction"
+val kmeans = new 
KMeans().setK(k).setPredictionCol(predictionColName).setSeed(1)
+val model = kmeans.fit(dataset)
+assert(model.clusterCenters.length === k)
+
+val transformed = model.transform(dataset)
+transformed.columns.foreach { column =>
+  Array("features", predictionColName).contains(column)
+}
+val clusters = 
transformed.select(predictionColName).map(_.get(0)).distinct().collect().toSet
--- End diff --

I'd prefer you use getInt instead of get, to be specific.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: revi

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-16 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/6756#discussion_r34862617
  
--- Diff: python/pyspark/ml/clustering.py ---
@@ -0,0 +1,210 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from pyspark.ml.util import keyword_only
+from pyspark.ml.wrapper import JavaEstimator, JavaModel
+from pyspark.ml.param.shared import *
+from pyspark.mllib.common import inherit_doc
+from pyspark.mllib.linalg import _convert_to_vector
+
+__all__ = ['KMeans', 'KMeansModel']
+
+
+class KMeansModel(JavaModel):
+"""
+Model fitted by KMeans.
+"""
+
+def clusterCenters(self):
+"""Get the cluster centers, represented as a list of NumPy 
arrays."""
+return [c.toArray() for c in self._call_java("clusterCenters")]
+
+
+@inherit_doc
+class KMeans(JavaEstimator, HasFeaturesCol, HasMaxIter, HasSeed):
+"""
+K-means Clustering
+
+>>> from pyspark.mllib.linalg import Vectors
+>>> data = [(Vectors.dense([0.0, 0.0]),), (Vectors.dense([1.0, 1.0]),),
+... (Vectors.dense([9.0, 8.0]),), (Vectors.dense([8.0, 9.0]),)]
+>>> df = sqlContext.createDataFrame(data, ["features"])
+>>> kmeans = KMeans().setK(2).setSeed(1).setFeaturesCol("features")
+>>> model = kmeans.fit(df)
+>>> centers = model.clusterCenters()
+>>> len(centers)
+2
+>>> transformed = model.transform(df).select("features", "prediction")
+>>> "features" in transformed.columns
--- End diff --

select() already checks that features and prediction are in the schema, so 
no need to check again.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-16 Thread jkbradley

Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/6756#issuecomment-122177561
  
Thanks for the updates!  A few more comments, but only small items


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-16 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/6756#discussion_r34862612
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -0,0 +1,204 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.param.{Param, Params, IntParam, DoubleParam, 
ParamMap}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasMaxIter, 
HasPredictionCol, HasSeed}
+import org.apache.spark.ml.util.{Identifiable, SchemaUtils}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans, 
KMeansModel => MLlibKMeansModel}
+import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.types.{IntegerType, StructType}
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.util.Utils
+
+
+/**
+ * Common params for KMeans and KMeansModel
+ */
+private[clustering] trait KMeansParams
+extends Params with HasMaxIter with HasFeaturesCol with HasSeed with 
HasPredictionCol {
+
+  /**
+   * Set the number of clusters to create (k). Default: 2.
+   * @group param
+   */
+  final val k = new IntParam(this, "k", "number of clusters to create", 
(x: Int) => x > 1)
+
+  /** @group getParam */
+  def getK: Int = $(k)
+
+  /**
+   * Param the number of runs of the algorithm to execute in parallel. We 
initialize the algorithm
+   * this many times with random starting conditions (configured by the 
initialization mode), then
+   * return the best clustering found over any run. Default: 1.
--- End diff --

Just noticed: Here and elsewhere, can you please state in the Param Scala 
doc the constraints (in this case "Must be >= 1")?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-16 Thread yu-iskw

Github user yu-iskw commented on a diff in the pull request:

https://github.com/apache/spark/pull/6756#discussion_r34862572
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -0,0 +1,202 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.param.{Param, Params, IntParam, DoubleParam, 
ParamMap}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasMaxIter, 
HasPredictionCol, HasSeed}
+import org.apache.spark.ml.util.{Identifiable, SchemaUtils}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans, 
KMeansModel => MLlibKMeansModel}
+import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.types.{IntegerType, StructType}
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.util.Utils
+
+
+/**
+ * Common params for KMeans and KMeansModel
+ */
+private[clustering] trait KMeansParams
+extends Params with HasMaxIter with HasFeaturesCol with HasSeed with 
HasPredictionCol {
+
+  /**
+   * Set the number of clusters to create (k). Default: 2.
+   * @group param
+   */
+  final val k = new IntParam(this, "k", "number of clusters to create", 
(x: Int) => x > 1)
+
+  /** @group getParam */
+  def getK: Int = $(k)
+
+  /**
+   * Param the number of runs of the algorithm to execute in parallel. We 
initialize the algorithm
+   * this many times with random starting conditions (configured by the 
initialization mode), then
+   * return the best clustering found over any run. Default: 1.
+   * @group param
+   */
+  final val runs = new IntParam(this, "runs",
+"number of runs of the algorithm to execute in parallel", (value: Int) 
=> value >= 1)
+
+  /** @group getParam */
+  def getRuns: Int = $(runs)
+
+  /**
+   * Param the distance threshold within which we've consider centers to 
have converged.
+   * If all centers move less than this Euclidean distance, we stop 
iterating one run.
+   * @group param
+   */
+  final val epsilon = new DoubleParam(this, "epsilon", "distance 
threshold")
+
+  /** @group getParam */
+  def getEpsilon: Double = $(epsilon)
+
+  /**
+   * Param for the initialization algorithm. This can be either "random" 
to choose random points as
+   * initial cluster centers, or "k-means||" to use a parallel variant of 
k-means++
+   * (Bahmani et al., Scalable K-Means++, VLDB 2012). Default: k-means||.
+   * @group param
+   */
+  final val initMode = new Param[String](this, "initMode", "initialization 
algorithm",
+(value: String) => MLlibKMeans.validateInitializationMode(value))
+
+  /** @group getParam */
+  def getInitializationMode: String = $(initMode)
--- End diff --

I understand. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-16 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/6756#discussion_r34862337
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -0,0 +1,202 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.param.{Param, Params, IntParam, DoubleParam, 
ParamMap}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasMaxIter, 
HasPredictionCol, HasSeed}
+import org.apache.spark.ml.util.{Identifiable, SchemaUtils}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans, 
KMeansModel => MLlibKMeansModel}
+import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.types.{IntegerType, StructType}
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.util.Utils
+
+
+/**
+ * Common params for KMeans and KMeansModel
+ */
+private[clustering] trait KMeansParams
+extends Params with HasMaxIter with HasFeaturesCol with HasSeed with 
HasPredictionCol {
+
+  /**
+   * Set the number of clusters to create (k). Default: 2.
+   * @group param
+   */
+  final val k = new IntParam(this, "k", "number of clusters to create", 
(x: Int) => x > 1)
+
+  /** @group getParam */
+  def getK: Int = $(k)
+
+  /**
+   * Param the number of runs of the algorithm to execute in parallel. We 
initialize the algorithm
+   * this many times with random starting conditions (configured by the 
initialization mode), then
+   * return the best clustering found over any run. Default: 1.
+   * @group param
+   */
+  final val runs = new IntParam(this, "runs",
+"number of runs of the algorithm to execute in parallel", (value: Int) 
=> value >= 1)
+
+  /** @group getParam */
+  def getRuns: Int = $(runs)
+
+  /**
+   * Param the distance threshold within which we've consider centers to 
have converged.
+   * If all centers move less than this Euclidean distance, we stop 
iterating one run.
+   * @group param
+   */
+  final val epsilon = new DoubleParam(this, "epsilon", "distance 
threshold")
+
+  /** @group getParam */
+  def getEpsilon: Double = $(epsilon)
+
+  /**
+   * Param for the initialization algorithm. This can be either "random" 
to choose random points as
+   * initial cluster centers, or "k-means||" to use a parallel variant of 
k-means++
+   * (Bahmani et al., Scalable K-Means++, VLDB 2012). Default: k-means||.
+   * @group param
+   */
+  final val initMode = new Param[String](this, "initMode", "initialization 
algorithm",
+(value: String) => MLlibKMeans.validateInitializationMode(value))
+
+  /** @group getParam */
+  def getInitializationMode: String = $(initMode)
--- End diff --

While I agree that changing the name is not great, this has been 
unavoidable anyways due to inconsistencies in spark.mllib.  I'd prefer we 
switch to better names now, rather than be stuck with the old names in this new 
API.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-16 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6756#issuecomment-121915598
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-16 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/6756#issuecomment-121915474
  
  [Test build #37479 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37479/console)
 for   PR 6756 at commit 
[`c8dc6e6`](https://github.com/apache/spark/commit/c8dc6e6f2cd60cdded8978771b8249172613db63).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class KMeans(override val uid: String) extends Estimator[KMeansModel] 
with KMeansParams `
  * `class KMeansModel(JavaModel):`
  * `class KMeans(JavaEstimator, HasFeaturesCol, HasMaxIter, HasSeed):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-16 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6756#issuecomment-121912405
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-16 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/6756#issuecomment-121912153
  
  [Test build #37475 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37475/console)
 for   PR 6756 at commit 
[`19a9d63`](https://github.com/apache/spark/commit/19a9d63bf1ba84b339ccf875ce059fa8eba75a80).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class KMeans(override val uid: String) extends Estimator[KMeansModel] 
with KMeansParams `
  * `class KMeansModel(JavaModel):`
  * `class KMeans(JavaEstimator, HasFeaturesCol, HasMaxIter, HasSeed):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-16 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/6756#issuecomment-121870189
  
  [Test build #37479 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37479/consoleFull)
 for   PR 6756 at commit 
[`c8dc6e6`](https://github.com/apache/spark/commit/c8dc6e6f2cd60cdded8978771b8249172613db63).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-16 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6756#issuecomment-121869667
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-16 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6756#issuecomment-121869646
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-16 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/6756#issuecomment-121865074
  
  [Test build #37475 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37475/consoleFull)
 for   PR 6756 at commit 
[`19a9d63`](https://github.com/apache/spark/commit/19a9d63bf1ba84b339ccf875ce059fa8eba75a80).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-16 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6756#issuecomment-121862254
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-16 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6756#issuecomment-121862282
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-16 Thread yu-iskw

Github user yu-iskw commented on a diff in the pull request:

https://github.com/apache/spark/pull/6756#discussion_r34762297
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -0,0 +1,202 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.param.{Param, Params, IntParam, DoubleParam, 
ParamMap}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasMaxIter, 
HasPredictionCol, HasSeed}
+import org.apache.spark.ml.util.{Identifiable, SchemaUtils}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans, 
KMeansModel => MLlibKMeansModel}
+import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.types.{IntegerType, StructType}
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.util.Utils
+
+
+/**
+ * Common params for KMeans and KMeansModel
+ */
+private[clustering] trait KMeansParams
+extends Params with HasMaxIter with HasFeaturesCol with HasSeed with 
HasPredictionCol {
+
+  /**
+   * Set the number of clusters to create (k). Default: 2.
+   * @group param
+   */
+  final val k = new IntParam(this, "k", "number of clusters to create", 
(x: Int) => x > 1)
--- End diff --

I'm with @jkbradley. We don't need to support that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-15 Thread yu-iskw

Github user yu-iskw commented on a diff in the pull request:

https://github.com/apache/spark/pull/6756#discussion_r34746765
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -0,0 +1,202 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.param.{Param, Params, IntParam, DoubleParam, 
ParamMap}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasMaxIter, 
HasPredictionCol, HasSeed}
+import org.apache.spark.ml.util.{Identifiable, SchemaUtils}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans, 
KMeansModel => MLlibKMeansModel}
+import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.types.{IntegerType, StructType}
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.util.Utils
+
+
+/**
+ * Common params for KMeans and KMeansModel
+ */
+private[clustering] trait KMeansParams
+extends Params with HasMaxIter with HasFeaturesCol with HasSeed with 
HasPredictionCol {
+
+  /**
+   * Set the number of clusters to create (k). Default: 2.
+   * @group param
+   */
+  final val k = new IntParam(this, "k", "number of clusters to create", 
(x: Int) => x > 1)
+
+  /** @group getParam */
+  def getK: Int = $(k)
+
+  /**
+   * Param the number of runs of the algorithm to execute in parallel. We 
initialize the algorithm
+   * this many times with random starting conditions (configured by the 
initialization mode), then
+   * return the best clustering found over any run. Default: 1.
+   * @group param
+   */
+  final val runs = new IntParam(this, "runs",
+"number of runs of the algorithm to execute in parallel", (value: Int) 
=> value >= 1)
+
+  /** @group getParam */
+  def getRuns: Int = $(runs)
+
+  /**
+   * Param the distance threshold within which we've consider centers to 
have converged.
+   * If all centers move less than this Euclidean distance, we stop 
iterating one run.
+   * @group param
+   */
+  final val epsilon = new DoubleParam(this, "epsilon", "distance 
threshold")
+
+  /** @group getParam */
+  def getEpsilon: Double = $(epsilon)
+
+  /**
+   * Param for the initialization algorithm. This can be either "random" 
to choose random points as
+   * initial cluster centers, or "k-means||" to use a parallel variant of 
k-means++
+   * (Bahmani et al., Scalable K-Means++, VLDB 2012). Default: k-means||.
+   * @group param
+   */
+  final val initMode = new Param[String](this, "initMode", "initialization 
algorithm",
+(value: String) => MLlibKMeans.validateInitializationMode(value))
+
+  /** @group getParam */
+  def getInitializationMode: String = $(initMode)
--- End diff --

@jkbradley I totally agree with that. Using the short name makes sense. 
What about the method name of `spark.mllib.clustering.KMeans`? Personally, I 
think it is a little strange that there is the difference about the method name 
between `spark.ml` and `spark.mllib`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-10 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/6756#discussion_r34380716
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -0,0 +1,202 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.param.{Param, Params, IntParam, DoubleParam, 
ParamMap}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasMaxIter, 
HasPredictionCol, HasSeed}
+import org.apache.spark.ml.util.{Identifiable, SchemaUtils}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans, 
KMeansModel => MLlibKMeansModel}
+import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.types.{IntegerType, StructType}
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.util.Utils
+
+
+/**
+ * Common params for KMeans and KMeansModel
+ */
+private[clustering] trait KMeansParams
+extends Params with HasMaxIter with HasFeaturesCol with HasSeed with 
HasPredictionCol {
+
+  /**
+   * Set the number of clusters to create (k). Default: 2.
+   * @group param
+   */
+  final val k = new IntParam(this, "k", "number of clusters to create", 
(x: Int) => x > 1)
+
+  /** @group getParam */
+  def getK: Int = $(k)
+
+  /**
+   * Param the number of runs of the algorithm to execute in parallel. We 
initialize the algorithm
+   * this many times with random starting conditions (configured by the 
initialization mode), then
+   * return the best clustering found over any run. Default: 1.
+   * @group param
+   */
+  final val runs = new IntParam(this, "runs",
+"number of runs of the algorithm to execute in parallel", (value: Int) 
=> value >= 1)
+
+  /** @group getParam */
+  def getRuns: Int = $(runs)
+
+  /**
+   * Param the distance threshold within which we've consider centers to 
have converged.
+   * If all centers move less than this Euclidean distance, we stop 
iterating one run.
+   * @group param
+   */
+  final val epsilon = new DoubleParam(this, "epsilon", "distance 
threshold")
+
+  /** @group getParam */
+  def getEpsilon: Double = $(epsilon)
+
+  /**
+   * Param for the initialization algorithm. This can be either "random" 
to choose random points as
+   * initial cluster centers, or "k-means||" to use a parallel variant of 
k-means++
+   * (Bahmani et al., Scalable K-Means++, VLDB 2012). Default: k-means||.
+   * @group param
+   */
+  final val initMode = new Param[String](this, "initMode", "initialization 
algorithm",
+(value: String) => MLlibKMeans.validateInitializationMode(value))
+
+  /** @group getParam */
+  def getInitializationMode: String = $(initMode)
+
+  /**
+   * Param for the number of steps for the k-means|| initialization mode. 
This is an advanced
+   * setting -- the default of 5 is almost always enough. Default: 5.
+   * @group param
+   */
+  final val initSteps = new IntParam(this, "initSteps", "number of steps 
for k-means||",
+(value: Int) => value > 0)
+
+  /** @group getParam */
+  def getInitializationSteps: Int = $(initSteps)
+
+  /**
+   * Validates and transforms the input schema.
+   * @param schema input schema
+   * @return output schema
+   */
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+SchemaUtils.checkColumnType(schema, $(featuresCol), new VectorUDT)
+SchemaUtils.appendColumn(schema, $(predictionCol), IntegerType)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Model fitted by KMeans.
+ *
+ * @param parentModel a model trained by spark.mllib.clustering.KMeans.
+ */
+@Experimental
+class KMeansModel p

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-10 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/6756#discussion_r34380371
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -0,0 +1,201 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasMaxIter, 
HasPredictionCol, HasSeed}
+import org.apache.spark.ml.param.{Param, ParamMap, Params}
+import org.apache.spark.ml.util.{Identifiable, SchemaUtils}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans, 
KMeansModel => MLlibKMeansModel}
+import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.types.{IntegerType, StructType}
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.util.Utils
+
+
+/**
+ * Common params for KMeans and KMeansModel
+ */
+private[clustering] trait KMeansParams
+extends Params with HasMaxIter with HasFeaturesCol with HasSeed with 
HasPredictionCol {
+
+  /**
+   * Set the number of clusters to create (k). Default: 2.
+   * @group param
+   */
+  val k = new Param[Int](this, "k", "number of clusters to create", (x: 
Int) => x > 1)
+
+  /** @group getParam */
+  def getK: Int = $(k)
+
+  /**
+   * Param the number of runs of the algorithm to execute in parallel. We 
initialize the algorithm
+   * this many times with random starting conditions (configured by the 
initialization mode), then
+   * return the best clustering found over any run. Default: 1.
+   * @group param
+   */
+  val runs = new Param[Int](this, "runs", "number of runs of the algorithm 
to execute in parallel",
+(value: Int) => value >= 1)
+
+  /** @group getParam */
+  def getRuns: Int = $(runs)
+
+  /**
+   * Param the distance threshold within which we've consider centers to 
have converged.
+   * If all centers move less than this Euclidean distance, we stop 
iterating one run.
+   * @group param
+   */
+  val epsilon = new Param[Double](this, "epsilon", "distance threshold")
+
+  /** @group getParam */
+  def getEpsilon: Double = $(epsilon)
+
+  /**
+   * Param for the initialization algorithm. This can be either "random" 
to choose random points as
+   * initial cluster centers, or "k-means||" to use a parallel variant of 
k-means++
+   * (Bahmani et al., Scalable K-Means++, VLDB 2012). Default: k-means||.
+   * @group param
+   */
+  val initMode = new Param[String](this, "initMode", "initialization 
algorithm",
+(value: String) => MLlibKMeans.validateInitializationMode(value))
+
+  /** @group getParam */
+  def getInitializationMode: String = $(initMode)
+
+  /**
+   * Param for the number of steps for the k-means|| initialization mode. 
This is an advanced
+   * setting -- the default of 5 is almost always enough. Default: 5.
+   * @group param
+   */
+  val initSteps = new Param[Int](this, "initSteps", "number of steps for 
k-means||",
+(value: Int) => value > 0)
+
+  /** @group getParam */
+  def getInitializationSteps: Int = $(initSteps)
+
+  /**
+   * Validates and transforms the input schema.
+   * @param schema input schema
+   * @return output schema
+   */
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+SchemaUtils.checkColumnType(schema, $(featuresCol), new VectorUDT)
+SchemaUtils.appendColumn(schema, $(predictionCol), IntegerType)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Model fitted by KMeans.
+ *
+ * @param parentModel a model trained by spark.mllib.clustering.KMeans.
+ */
+@Experimental
+class KMeansModel private[ml] (
+override val uid: String,

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-09 Thread yu-iskw

Github user yu-iskw commented on a diff in the pull request:

https://github.com/apache/spark/pull/6756#discussion_r34323548
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -0,0 +1,201 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasMaxIter, 
HasPredictionCol, HasSeed}
+import org.apache.spark.ml.param.{Param, ParamMap, Params}
+import org.apache.spark.ml.util.{Identifiable, SchemaUtils}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans, 
KMeansModel => MLlibKMeansModel}
+import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.types.{IntegerType, StructType}
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.util.Utils
+
+
+/**
+ * Common params for KMeans and KMeansModel
+ */
+private[clustering] trait KMeansParams
+extends Params with HasMaxIter with HasFeaturesCol with HasSeed with 
HasPredictionCol {
+
+  /**
+   * Set the number of clusters to create (k). Default: 2.
+   * @group param
+   */
+  val k = new Param[Int](this, "k", "number of clusters to create", (x: 
Int) => x > 1)
+
+  /** @group getParam */
+  def getK: Int = $(k)
+
+  /**
+   * Param the number of runs of the algorithm to execute in parallel. We 
initialize the algorithm
+   * this many times with random starting conditions (configured by the 
initialization mode), then
+   * return the best clustering found over any run. Default: 1.
+   * @group param
+   */
+  val runs = new Param[Int](this, "runs", "number of runs of the algorithm 
to execute in parallel",
+(value: Int) => value >= 1)
+
+  /** @group getParam */
+  def getRuns: Int = $(runs)
+
+  /**
+   * Param the distance threshold within which we've consider centers to 
have converged.
+   * If all centers move less than this Euclidean distance, we stop 
iterating one run.
+   * @group param
+   */
+  val epsilon = new Param[Double](this, "epsilon", "distance threshold")
+
+  /** @group getParam */
+  def getEpsilon: Double = $(epsilon)
+
+  /**
+   * Param for the initialization algorithm. This can be either "random" 
to choose random points as
+   * initial cluster centers, or "k-means||" to use a parallel variant of 
k-means++
+   * (Bahmani et al., Scalable K-Means++, VLDB 2012). Default: k-means||.
+   * @group param
+   */
+  val initMode = new Param[String](this, "initMode", "initialization 
algorithm",
+(value: String) => MLlibKMeans.validateInitializationMode(value))
+
+  /** @group getParam */
+  def getInitializationMode: String = $(initMode)
+
+  /**
+   * Param for the number of steps for the k-means|| initialization mode. 
This is an advanced
+   * setting -- the default of 5 is almost always enough. Default: 5.
+   * @group param
+   */
+  val initSteps = new Param[Int](this, "initSteps", "number of steps for 
k-means||",
+(value: Int) => value > 0)
+
+  /** @group getParam */
+  def getInitializationSteps: Int = $(initSteps)
+
+  /**
+   * Validates and transforms the input schema.
+   * @param schema input schema
+   * @return output schema
+   */
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+SchemaUtils.checkColumnType(schema, $(featuresCol), new VectorUDT)
+SchemaUtils.appendColumn(schema, $(predictionCol), IntegerType)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Model fitted by KMeans.
+ *
+ * @param parentModel a model trained by spark.mllib.clustering.KMeans.
+ */
+@Experimental
+class KMeansModel private[ml] (
+override val uid: String,

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-09 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/6756#discussion_r34308107
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -0,0 +1,202 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.param.{Param, Params, IntParam, DoubleParam, 
ParamMap}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasMaxIter, 
HasPredictionCol, HasSeed}
+import org.apache.spark.ml.util.{Identifiable, SchemaUtils}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans, 
KMeansModel => MLlibKMeansModel}
+import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.types.{IntegerType, StructType}
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.util.Utils
+
+
+/**
+ * Common params for KMeans and KMeansModel
+ */
+private[clustering] trait KMeansParams
+extends Params with HasMaxIter with HasFeaturesCol with HasSeed with 
HasPredictionCol {
+
+  /**
+   * Set the number of clusters to create (k). Default: 2.
+   * @group param
+   */
+  final val k = new IntParam(this, "k", "number of clusters to create", 
(x: Int) => x > 1)
--- End diff --

I don't see a need to support that.  Let's only modify that if there's a 
real use case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-09 Thread feynmanliang

Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/6756#discussion_r34301648
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -0,0 +1,202 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.param.{Param, Params, IntParam, DoubleParam, 
ParamMap}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasMaxIter, 
HasPredictionCol, HasSeed}
+import org.apache.spark.ml.util.{Identifiable, SchemaUtils}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans, 
KMeansModel => MLlibKMeansModel}
+import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.types.{IntegerType, StructType}
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.util.Utils
+
+
+/**
+ * Common params for KMeans and KMeansModel
+ */
+private[clustering] trait KMeansParams
+extends Params with HasMaxIter with HasFeaturesCol with HasSeed with 
HasPredictionCol {
+
+  /**
+   * Set the number of clusters to create (k). Default: 2.
+   * @group param
+   */
+  final val k = new IntParam(this, "k", "number of clusters to create", 
(x: Int) => x > 1)
+
+  /** @group getParam */
+  def getK: Int = $(k)
+
+  /**
+   * Param the number of runs of the algorithm to execute in parallel. We 
initialize the algorithm
+   * this many times with random starting conditions (configured by the 
initialization mode), then
+   * return the best clustering found over any run. Default: 1.
+   * @group param
+   */
+  final val runs = new IntParam(this, "runs",
+"number of runs of the algorithm to execute in parallel", (value: Int) 
=> value >= 1)
+
+  /** @group getParam */
+  def getRuns: Int = $(runs)
+
+  /**
+   * Param the distance threshold within which we've consider centers to 
have converged.
+   * If all centers move less than this Euclidean distance, we stop 
iterating one run.
+   * @group param
+   */
+  final val epsilon = new DoubleParam(this, "epsilon", "distance 
threshold")
+
+  /** @group getParam */
+  def getEpsilon: Double = $(epsilon)
+
+  /**
+   * Param for the initialization algorithm. This can be either "random" 
to choose random points as
+   * initial cluster centers, or "k-means||" to use a parallel variant of 
k-means++
+   * (Bahmani et al., Scalable K-Means++, VLDB 2012). Default: k-means||.
+   * @group param
+   */
+  final val initMode = new Param[String](this, "initMode", "initialization 
algorithm",
+(value: String) => MLlibKMeans.validateInitializationMode(value))
+
+  /** @group getParam */
+  def getInitializationMode: String = $(initMode)
+
+  /**
+   * Param for the number of steps for the k-means|| initialization mode. 
This is an advanced
+   * setting -- the default of 5 is almost always enough. Default: 5.
+   * @group param
+   */
+  final val initSteps = new IntParam(this, "initSteps", "number of steps 
for k-means||",
+(value: Int) => value > 0)
+
+  /** @group getParam */
+  def getInitializationSteps: Int = $(initSteps)
+
+  /**
+   * Validates and transforms the input schema.
+   * @param schema input schema
+   * @return output schema
+   */
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+SchemaUtils.checkColumnType(schema, $(featuresCol), new VectorUDT)
+SchemaUtils.appendColumn(schema, $(predictionCol), IntegerType)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Model fitted by KMeans.
+ *
+ * @param parentModel a model trained by spark.mllib.clustering.KMeans.
+ */
+@Experimental
+class KMeansMode

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-09 Thread feynmanliang

Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/6756#discussion_r34301567
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -0,0 +1,202 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.param.{Param, Params, IntParam, DoubleParam, 
ParamMap}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasMaxIter, 
HasPredictionCol, HasSeed}
+import org.apache.spark.ml.util.{Identifiable, SchemaUtils}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans, 
KMeansModel => MLlibKMeansModel}
+import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.types.{IntegerType, StructType}
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.util.Utils
+
+
+/**
+ * Common params for KMeans and KMeansModel
+ */
+private[clustering] trait KMeansParams
+extends Params with HasMaxIter with HasFeaturesCol with HasSeed with 
HasPredictionCol {
+
+  /**
+   * Set the number of clusters to create (k). Default: 2.
+   * @group param
+   */
+  final val k = new IntParam(this, "k", "number of clusters to create", 
(x: Int) => x > 1)
--- End diff --

Should be `x >= 1` (or is there a reason why we can't just have a single 
cluster)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-09 Thread feynmanliang

Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/6756#discussion_r34301484
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -0,0 +1,202 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.param.{Param, Params, IntParam, DoubleParam, 
ParamMap}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasMaxIter, 
HasPredictionCol, HasSeed}
+import org.apache.spark.ml.util.{Identifiable, SchemaUtils}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans, 
KMeansModel => MLlibKMeansModel}
+import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.types.{IntegerType, StructType}
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.util.Utils
+
+
+/**
+ * Common params for KMeans and KMeansModel
+ */
+private[clustering] trait KMeansParams
+extends Params with HasMaxIter with HasFeaturesCol with HasSeed with 
HasPredictionCol {
+
+  /**
+   * Set the number of clusters to create (k). Default: 2.
+   * @group param
+   */
+  final val k = new IntParam(this, "k", "number of clusters to create", 
(x: Int) => x > 1)
+
+  /** @group getParam */
+  def getK: Int = $(k)
+
+  /**
+   * Param the number of runs of the algorithm to execute in parallel. We 
initialize the algorithm
+   * this many times with random starting conditions (configured by the 
initialization mode), then
+   * return the best clustering found over any run. Default: 1.
+   * @group param
+   */
+  final val runs = new IntParam(this, "runs",
+"number of runs of the algorithm to execute in parallel", (value: Int) 
=> value >= 1)
+
+  /** @group getParam */
+  def getRuns: Int = $(runs)
+
+  /**
+   * Param the distance threshold within which we've consider centers to 
have converged.
+   * If all centers move less than this Euclidean distance, we stop 
iterating one run.
+   * @group param
+   */
+  final val epsilon = new DoubleParam(this, "epsilon", "distance 
threshold")
--- End diff --

Check >=0


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-08 Thread feynmanliang

Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/6756#discussion_r34223724
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -0,0 +1,201 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasMaxIter, 
HasPredictionCol, HasSeed}
+import org.apache.spark.ml.param.{Param, ParamMap, Params}
+import org.apache.spark.ml.util.{Identifiable, SchemaUtils}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans, 
KMeansModel => MLlibKMeansModel}
+import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.types.{IntegerType, StructType}
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.util.Utils
+
+
+/**
+ * Common params for KMeans and KMeansModel
+ */
+private[clustering] trait KMeansParams
+extends Params with HasMaxIter with HasFeaturesCol with HasSeed with 
HasPredictionCol {
+
+  /**
+   * Set the number of clusters to create (k). Default: 2.
+   * @group param
+   */
+  val k = new Param[Int](this, "k", "number of clusters to create", (x: 
Int) => x > 1)
+
+  /** @group getParam */
+  def getK: Int = $(k)
+
+  /**
+   * Param the number of runs of the algorithm to execute in parallel. We 
initialize the algorithm
+   * this many times with random starting conditions (configured by the 
initialization mode), then
+   * return the best clustering found over any run. Default: 1.
+   * @group param
+   */
+  val runs = new Param[Int](this, "runs", "number of runs of the algorithm 
to execute in parallel",
+(value: Int) => value >= 1)
+
+  /** @group getParam */
+  def getRuns: Int = $(runs)
+
+  /**
+   * Param the distance threshold within which we've consider centers to 
have converged.
+   * If all centers move less than this Euclidean distance, we stop 
iterating one run.
+   * @group param
+   */
+  val epsilon = new Param[Double](this, "epsilon", "distance threshold")
+
+  /** @group getParam */
+  def getEpsilon: Double = $(epsilon)
+
+  /**
+   * Param for the initialization algorithm. This can be either "random" 
to choose random points as
+   * initial cluster centers, or "k-means||" to use a parallel variant of 
k-means++
+   * (Bahmani et al., Scalable K-Means++, VLDB 2012). Default: k-means||.
+   * @group param
+   */
+  val initMode = new Param[String](this, "initMode", "initialization 
algorithm",
+(value: String) => MLlibKMeans.validateInitializationMode(value))
+
+  /** @group getParam */
+  def getInitializationMode: String = $(initMode)
+
+  /**
+   * Param for the number of steps for the k-means|| initialization mode. 
This is an advanced
+   * setting -- the default of 5 is almost always enough. Default: 5.
+   * @group param
+   */
+  val initSteps = new Param[Int](this, "initSteps", "number of steps for 
k-means||",
+(value: Int) => value > 0)
+
+  /** @group getParam */
+  def getInitializationSteps: Int = $(initSteps)
+
+  /**
+   * Validates and transforms the input schema.
+   * @param schema input schema
+   * @return output schema
+   */
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+SchemaUtils.checkColumnType(schema, $(featuresCol), new VectorUDT)
+SchemaUtils.appendColumn(schema, $(predictionCol), IntegerType)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Model fitted by KMeans.
+ *
+ * @param parentModel a model trained by spark.mllib.clustering.KMeans.
+ */
+@Experimental
+class KMeansModel private[ml] (
+override val uid: Stri

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-08 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/6756#discussion_r34220885
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -0,0 +1,202 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.param.{Param, Params, IntParam, DoubleParam, 
ParamMap}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasMaxIter, 
HasPredictionCol, HasSeed}
+import org.apache.spark.ml.util.{Identifiable, SchemaUtils}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans, 
KMeansModel => MLlibKMeansModel}
+import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.types.{IntegerType, StructType}
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.util.Utils
+
+
+/**
+ * Common params for KMeans and KMeansModel
+ */
+private[clustering] trait KMeansParams
+extends Params with HasMaxIter with HasFeaturesCol with HasSeed with 
HasPredictionCol {
+
+  /**
+   * Set the number of clusters to create (k). Default: 2.
+   * @group param
+   */
+  final val k = new IntParam(this, "k", "number of clusters to create", 
(x: Int) => x > 1)
+
+  /** @group getParam */
+  def getK: Int = $(k)
+
+  /**
+   * Param the number of runs of the algorithm to execute in parallel. We 
initialize the algorithm
+   * this many times with random starting conditions (configured by the 
initialization mode), then
+   * return the best clustering found over any run. Default: 1.
+   * @group param
+   */
+  final val runs = new IntParam(this, "runs",
+"number of runs of the algorithm to execute in parallel", (value: Int) 
=> value >= 1)
+
+  /** @group getParam */
+  def getRuns: Int = $(runs)
+
+  /**
+   * Param the distance threshold within which we've consider centers to 
have converged.
+   * If all centers move less than this Euclidean distance, we stop 
iterating one run.
+   * @group param
+   */
+  final val epsilon = new DoubleParam(this, "epsilon", "distance 
threshold")
+
+  /** @group getParam */
+  def getEpsilon: Double = $(epsilon)
+
+  /**
+   * Param for the initialization algorithm. This can be either "random" 
to choose random points as
+   * initial cluster centers, or "k-means||" to use a parallel variant of 
k-means++
+   * (Bahmani et al., Scalable K-Means++, VLDB 2012). Default: k-means||.
+   * @group param
+   */
+  final val initMode = new Param[String](this, "initMode", "initialization 
algorithm",
+(value: String) => MLlibKMeans.validateInitializationMode(value))
+
+  /** @group getParam */
+  def getInitializationMode: String = $(initMode)
+
+  /**
+   * Param for the number of steps for the k-means|| initialization mode. 
This is an advanced
+   * setting -- the default of 5 is almost always enough. Default: 5.
+   * @group param
+   */
+  final val initSteps = new IntParam(this, "initSteps", "number of steps 
for k-means||",
+(value: Int) => value > 0)
+
+  /** @group getParam */
+  def getInitializationSteps: Int = $(initSteps)
+
+  /**
+   * Validates and transforms the input schema.
+   * @param schema input schema
+   * @return output schema
+   */
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+SchemaUtils.checkColumnType(schema, $(featuresCol), new VectorUDT)
+SchemaUtils.appendColumn(schema, $(predictionCol), IntegerType)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Model fitted by KMeans.
+ *
+ * @param parentModel a model trained by spark.mllib.clustering.KMeans.
+ */
+@Experimental
+class KMeansModel p

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-08 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/6756#discussion_r34219862
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -0,0 +1,201 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasMaxIter, 
HasPredictionCol, HasSeed}
+import org.apache.spark.ml.param.{Param, ParamMap, Params}
+import org.apache.spark.ml.util.{Identifiable, SchemaUtils}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans, 
KMeansModel => MLlibKMeansModel}
+import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.types.{IntegerType, StructType}
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.util.Utils
+
+
+/**
+ * Common params for KMeans and KMeansModel
+ */
+private[clustering] trait KMeansParams
+extends Params with HasMaxIter with HasFeaturesCol with HasSeed with 
HasPredictionCol {
+
+  /**
+   * Set the number of clusters to create (k). Default: 2.
+   * @group param
+   */
+  val k = new Param[Int](this, "k", "number of clusters to create", (x: 
Int) => x > 1)
+
+  /** @group getParam */
+  def getK: Int = $(k)
+
+  /**
+   * Param the number of runs of the algorithm to execute in parallel. We 
initialize the algorithm
+   * this many times with random starting conditions (configured by the 
initialization mode), then
+   * return the best clustering found over any run. Default: 1.
+   * @group param
+   */
+  val runs = new Param[Int](this, "runs", "number of runs of the algorithm 
to execute in parallel",
+(value: Int) => value >= 1)
+
+  /** @group getParam */
+  def getRuns: Int = $(runs)
+
+  /**
+   * Param the distance threshold within which we've consider centers to 
have converged.
+   * If all centers move less than this Euclidean distance, we stop 
iterating one run.
+   * @group param
+   */
+  val epsilon = new Param[Double](this, "epsilon", "distance threshold")
+
+  /** @group getParam */
+  def getEpsilon: Double = $(epsilon)
+
+  /**
+   * Param for the initialization algorithm. This can be either "random" 
to choose random points as
+   * initial cluster centers, or "k-means||" to use a parallel variant of 
k-means++
+   * (Bahmani et al., Scalable K-Means++, VLDB 2012). Default: k-means||.
+   * @group param
+   */
+  val initMode = new Param[String](this, "initMode", "initialization 
algorithm",
+(value: String) => MLlibKMeans.validateInitializationMode(value))
+
+  /** @group getParam */
+  def getInitializationMode: String = $(initMode)
+
+  /**
+   * Param for the number of steps for the k-means|| initialization mode. 
This is an advanced
+   * setting -- the default of 5 is almost always enough. Default: 5.
+   * @group param
+   */
+  val initSteps = new Param[Int](this, "initSteps", "number of steps for 
k-means||",
+(value: Int) => value > 0)
+
+  /** @group getParam */
+  def getInitializationSteps: Int = $(initSteps)
+
+  /**
+   * Validates and transforms the input schema.
+   * @param schema input schema
+   * @return output schema
+   */
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+SchemaUtils.checkColumnType(schema, $(featuresCol), new VectorUDT)
+SchemaUtils.appendColumn(schema, $(predictionCol), IntegerType)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Model fitted by KMeans.
+ *
+ * @param parentModel a model trained by spark.mllib.clustering.KMeans.
+ */
+@Experimental
+class KMeansModel private[ml] (
+override val uid: String,

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-07 Thread jkbradley

Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/6756#issuecomment-119357135
  
You will also need to modify pyspark.ml.rst to add the clustering module.

That's all I see now---mostly small cleanups.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-07 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/6756#discussion_r34096466
  
--- Diff: python/pyspark/ml/clustering.py ---
@@ -0,0 +1,202 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from pyspark.ml.util import keyword_only
+from pyspark.ml.wrapper import JavaEstimator, JavaModel
+from pyspark.ml.param.shared import *
+from pyspark.mllib.common import inherit_doc
+from pyspark.mllib.linalg import _convert_to_vector
+
+__all__ = ['KMeans', 'KMeansModel']
+
+
+class KMeansModel(JavaModel):
+"""
+Model fitted by KMeans.
+"""
+
+def clusterCenters(self):
+"""Get the cluster centers, represented as a list of NumPy 
arrays."""
+return [c.toArray() for c in self._call_java("clusterCenters")]
+
+
+@inherit_doc
+class KMeans(JavaEstimator, HasFeaturesCol, HasMaxIter, HasSeed):
+"""
+K-means Clustering
+
+>>> from pyspark.mllib.linalg import Vectors
+>>> data = [(Vectors.dense([0.0, 0.0]),), (Vectors.dense([1.0, 1.0]),),
+... (Vectors.dense([9.0, 8.0]),), (Vectors.dense([8.0, 9.0]),)]
+>>> df = sqlContext.createDataFrame(data, ["features"])
+>>> kmeans = KMeans().setK(2).setSeed(1).setFeaturesCol("features")
+>>> model = kmeans.fit(df)
+>>> centers = model.clusterCenters()
+>>> len(centers)
+2
+>>> transformed = model.transform(df)
+>>> (transformed.columns)[0] == 'features'
+True
+>>> (transformed.columns)[1] == 'prediction'
+True
+>>> rows = sorted(transformed.collect(), key = lambda r: r[0])
+>>> rows[0].prediction == rows[1].prediction
+True
+>>> rows[2].prediction == rows[3].prediction
+True
+>>> kmeans.setParams("features")
+Traceback (most recent call last):
+...
+TypeError: Method setParams forces keyword arguments.
+"""
+
+@keyword_only
+def __init__(self, k=2, maxIter=20, runs=1, epsilon=1e-4, 
initMode="k-means||", initStep=5):
+super(KMeans, self).__init__()
+self._java_obj = 
self._new_java_obj("org.apache.spark.ml.clustering.KMeans", self.uid)
+self.k = Param(self, "k", "number of clusters to create")
+self.epsilon = Param(self, "epsilon",
+ "distance threshold within which " +
+ "we've consider centers to have converged")
+self.runs = Param(self, "runs", "number of runs of the algorithm 
to execute in parallel")
+self.seed = Param(self, "seed", "random seed")
+self.initMode = Param(self, "initMode",
+  "the initialization algorithm. This can be 
either \"random\" to " +
+  "choose random points as initial cluster 
centers, or \"k-means||\" " +
+  "to use a parallel variant of k-means++")
+self.initSteps = Param(self, "initSteps", "steps for k-means 
initialization mode")
+self._setDefault(k=2, maxIter=20, runs=1, epsilon=1e-4, 
initMode="k-means||", initSteps=5)
+kwargs = self.__init__._input_kwargs
+self.setParams(**kwargs)
+
+def _create_model(self, java_model):
+return KMeansModel(java_model)
+
+@keyword_only
+def setParams(self, k=2, maxIter=20, runs=1, epsilon=1e-4, 
initMode="k-means||", initSteps=5):
+"""
+setParams(self, k=2, maxIter=20, runs=1, epsilon=1e-4, 
initMode="k-means||", initSteps=5):
+
+Sets params for KMeans.
+"""
+kwargs = self.setParams._input_kwargs
+return self._set(**kwargs)
+
+def setK(self, value):
+"""
+Sets the value of :py:attr:`k`.
+
+>>> algo = KMeans().setK(10)
+>>> algo.getK()
+10
+"""
+self._paramMap[self.k] = value
+retur

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-07 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/6756#discussion_r34096452
  
--- Diff: python/pyspark/ml/clustering.py ---
@@ -0,0 +1,202 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from pyspark.ml.util import keyword_only
+from pyspark.ml.wrapper import JavaEstimator, JavaModel
+from pyspark.ml.param.shared import *
+from pyspark.mllib.common import inherit_doc
+from pyspark.mllib.linalg import _convert_to_vector
+
+__all__ = ['KMeans', 'KMeansModel']
+
+
+class KMeansModel(JavaModel):
+"""
+Model fitted by KMeans.
+"""
+
+def clusterCenters(self):
+"""Get the cluster centers, represented as a list of NumPy 
arrays."""
+return [c.toArray() for c in self._call_java("clusterCenters")]
+
+
+@inherit_doc
+class KMeans(JavaEstimator, HasFeaturesCol, HasMaxIter, HasSeed):
+"""
+K-means Clustering
+
+>>> from pyspark.mllib.linalg import Vectors
+>>> data = [(Vectors.dense([0.0, 0.0]),), (Vectors.dense([1.0, 1.0]),),
+... (Vectors.dense([9.0, 8.0]),), (Vectors.dense([8.0, 9.0]),)]
+>>> df = sqlContext.createDataFrame(data, ["features"])
+>>> kmeans = KMeans().setK(2).setSeed(1).setFeaturesCol("features")
+>>> model = kmeans.fit(df)
+>>> centers = model.clusterCenters()
+>>> len(centers)
+2
+>>> transformed = model.transform(df)
+>>> (transformed.columns)[0] == 'features'
--- End diff --

Test for contains, not order of columns.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-07 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/6756#discussion_r34096458
  
--- Diff: python/pyspark/ml/clustering.py ---
@@ -0,0 +1,202 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from pyspark.ml.util import keyword_only
+from pyspark.ml.wrapper import JavaEstimator, JavaModel
+from pyspark.ml.param.shared import *
+from pyspark.mllib.common import inherit_doc
+from pyspark.mllib.linalg import _convert_to_vector
+
+__all__ = ['KMeans', 'KMeansModel']
+
+
+class KMeansModel(JavaModel):
+"""
+Model fitted by KMeans.
+"""
+
+def clusterCenters(self):
+"""Get the cluster centers, represented as a list of NumPy 
arrays."""
+return [c.toArray() for c in self._call_java("clusterCenters")]
+
+
+@inherit_doc
+class KMeans(JavaEstimator, HasFeaturesCol, HasMaxIter, HasSeed):
+"""
+K-means Clustering
+
+>>> from pyspark.mllib.linalg import Vectors
+>>> data = [(Vectors.dense([0.0, 0.0]),), (Vectors.dense([1.0, 1.0]),),
+... (Vectors.dense([9.0, 8.0]),), (Vectors.dense([8.0, 9.0]),)]
+>>> df = sqlContext.createDataFrame(data, ["features"])
+>>> kmeans = KMeans().setK(2).setSeed(1).setFeaturesCol("features")
+>>> model = kmeans.fit(df)
+>>> centers = model.clusterCenters()
+>>> len(centers)
+2
+>>> transformed = model.transform(df)
+>>> (transformed.columns)[0] == 'features'
+True
+>>> (transformed.columns)[1] == 'prediction'
+True
+>>> rows = sorted(transformed.collect(), key = lambda r: r[0])
+>>> rows[0].prediction == rows[1].prediction
+True
+>>> rows[2].prediction == rows[3].prediction
+True
+>>> kmeans.setParams("features")
--- End diff --

No need to test this here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-07 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/6756#discussion_r34096460
  
--- Diff: python/pyspark/ml/clustering.py ---
@@ -0,0 +1,202 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from pyspark.ml.util import keyword_only
+from pyspark.ml.wrapper import JavaEstimator, JavaModel
+from pyspark.ml.param.shared import *
+from pyspark.mllib.common import inherit_doc
+from pyspark.mllib.linalg import _convert_to_vector
+
+__all__ = ['KMeans', 'KMeansModel']
+
+
+class KMeansModel(JavaModel):
+"""
+Model fitted by KMeans.
+"""
+
+def clusterCenters(self):
+"""Get the cluster centers, represented as a list of NumPy 
arrays."""
+return [c.toArray() for c in self._call_java("clusterCenters")]
+
+
+@inherit_doc
+class KMeans(JavaEstimator, HasFeaturesCol, HasMaxIter, HasSeed):
+"""
+K-means Clustering
+
+>>> from pyspark.mllib.linalg import Vectors
+>>> data = [(Vectors.dense([0.0, 0.0]),), (Vectors.dense([1.0, 1.0]),),
+... (Vectors.dense([9.0, 8.0]),), (Vectors.dense([8.0, 9.0]),)]
+>>> df = sqlContext.createDataFrame(data, ["features"])
+>>> kmeans = KMeans().setK(2).setSeed(1).setFeaturesCol("features")
+>>> model = kmeans.fit(df)
+>>> centers = model.clusterCenters()
+>>> len(centers)
+2
+>>> transformed = model.transform(df)
+>>> (transformed.columns)[0] == 'features'
+True
+>>> (transformed.columns)[1] == 'prediction'
+True
+>>> rows = sorted(transformed.collect(), key = lambda r: r[0])
+>>> rows[0].prediction == rows[1].prediction
+True
+>>> rows[2].prediction == rows[3].prediction
+True
+>>> kmeans.setParams("features")
+Traceback (most recent call last):
+...
+TypeError: Method setParams forces keyword arguments.
+"""
+
--- End diff --

Should add placeholder for the various params.  For examples of how to do 
this, look in regression.py and search for "placeholder to make it appear in 
the generated doc"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-07 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/6756#discussion_r34096449
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
@@ -0,0 +1,112 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans}
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.sql.{DataFrame, SQLContext}
+
+private[clustering] case class TestRow(features: Vector)
+
+object KMeansSuite {
+  def generateKMeansData(sql: SQLContext, rows: Int, dim: Int, k: Int): 
DataFrame = {
+val sc = sql.sparkContext
+val rdd = sc.parallelize(1 to rows).map(i => 
Vectors.dense(Array.fill(dim)((i % k).toDouble)))
+  .map(v => new TestRow(v))
+sql.createDataFrame(rdd)
+  }
+}
+
+class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext {
+
+  final val k = 5
+  @transient var dataset: DataFrame = _
+
+  override def beforeAll(): Unit = {
+super.beforeAll()
+
+dataset = KMeansSuite.generateKMeansData(sqlContext, 50, 3, k)
+  }
+
+  test("default parameters") {
+val kmeans = new KMeans()
+
+assert(kmeans.getK === 2)
+assert(kmeans.getFeaturesCol === "features")
+assert(kmeans.getPredictionCol === "prediction")
+assert(kmeans.getMaxIter === 20)
+assert(kmeans.getRuns === 1)
+assert(kmeans.getInitializationMode === MLlibKMeans.K_MEANS_PARALLEL)
+assert(kmeans.getInitializationSteps === 5)
+assert(kmeans.getEpsilon === 1e-4)
+  }
+
+  test("set parameters") {
+val kmeans = new KMeans()
+  .setK(9)
+  .setFeaturesCol("test_feature")
+  .setPredictionCol("test_prediction")
+  .setMaxIter(33)
+  .setRuns(7)
+  .setInitializationMode(MLlibKMeans.RANDOM)
+  .setInitializationSteps(3)
+  .setSeed(123)
+  .setEpsilon(1e-3)
+
+assert(kmeans.getK === 9)
+assert(kmeans.getFeaturesCol === "test_feature")
+assert(kmeans.getPredictionCol === "test_prediction")
+assert(kmeans.getMaxIter === 33)
+assert(kmeans.getRuns === 7)
+assert(kmeans.getInitializationMode === MLlibKMeans.RANDOM)
+assert(kmeans.getInitializationSteps === 3)
+assert(kmeans.getSeed === 123)
+assert(kmeans.getEpsilon === 1e-3)
+  }
+
+  test("parameters validation") {
+intercept[IllegalArgumentException] {
+  new KMeans().setK(1)
+}
+intercept[IllegalArgumentException] {
+  new KMeans().setInitializationMode("no_such_a_mode")
+}
+intercept[IllegalArgumentException] {
+  new KMeans().setInitializationSteps(0)
+}
+intercept[IllegalArgumentException] {
+  new KMeans().setRuns(0)
+}
+  }
+
+  test("fit & transform") {
+val predictionColName = "kmeans_prediction"
+val kmeans = new 
KMeans().setK(k).setPredictionCol(predictionColName).setSeed(1)
+val model = kmeans.fit(dataset)
+assert(model.clusterCenters.length === k)
+
+val transformed = model.transform(dataset)
+assert(transformed.columns === Array("features", predictionColName))
+val clusters = transformed.select(predictionColName)
+  .map(row => row.apply(0)).distinct().collect().toSet
--- End diff --

Use ```_.getInt(0)``` instead of apply, to be specific about type.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

--

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-07 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/6756#discussion_r34096457
  
--- Diff: python/pyspark/ml/clustering.py ---
@@ -0,0 +1,202 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from pyspark.ml.util import keyword_only
+from pyspark.ml.wrapper import JavaEstimator, JavaModel
+from pyspark.ml.param.shared import *
+from pyspark.mllib.common import inherit_doc
+from pyspark.mllib.linalg import _convert_to_vector
+
+__all__ = ['KMeans', 'KMeansModel']
+
+
+class KMeansModel(JavaModel):
+"""
+Model fitted by KMeans.
+"""
+
+def clusterCenters(self):
+"""Get the cluster centers, represented as a list of NumPy 
arrays."""
+return [c.toArray() for c in self._call_java("clusterCenters")]
+
+
+@inherit_doc
+class KMeans(JavaEstimator, HasFeaturesCol, HasMaxIter, HasSeed):
+"""
+K-means Clustering
+
+>>> from pyspark.mllib.linalg import Vectors
+>>> data = [(Vectors.dense([0.0, 0.0]),), (Vectors.dense([1.0, 1.0]),),
+... (Vectors.dense([9.0, 8.0]),), (Vectors.dense([8.0, 9.0]),)]
+>>> df = sqlContext.createDataFrame(data, ["features"])
+>>> kmeans = KMeans().setK(2).setSeed(1).setFeaturesCol("features")
+>>> model = kmeans.fit(df)
+>>> centers = model.clusterCenters()
+>>> len(centers)
+2
+>>> transformed = model.transform(df)
+>>> (transformed.columns)[0] == 'features'
+True
+>>> (transformed.columns)[1] == 'prediction'
+True
+>>> rows = sorted(transformed.collect(), key = lambda r: r[0])
--- End diff --

Use ```r.features``` instead of ```r[0]``` since you should not assume 
column order.  If you select a list of cols from transformed, then you can 
assume that order.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-07 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/6756#discussion_r34096441
  
--- Diff: 
mllib/src/test/java/org/apache/spark/ml/clustering/JavaKMeansSuite.java ---
@@ -0,0 +1,65 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering;
+
+import java.io.Serializable;
+
+import org.junit.After;
+import org.junit.Before;
+import org.junit.Test;
+import static org.junit.Assert.assertArrayEquals;
+import static org.junit.Assert.assertEquals;
+
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.mllib.linalg.Vector;
+import org.apache.spark.sql.DataFrame;
+import org.apache.spark.sql.SQLContext;
+
+public class JavaKMeansSuite implements Serializable {
+
+  private transient int k = 5;
+  private transient JavaSparkContext sc;
+  private transient DataFrame dataset;
+  private transient SQLContext sql;
+
+  @Before
+  public void setUp() {
+sc = new JavaSparkContext("local", "JavaKMeansSuite");
+sql = new SQLContext(sc);
+
+dataset = KMeansSuite.generateKMeansData(sql, 50, 3, k);
+  }
+
+  @After
+  public void tearDown() {
+sc.stop();
+sc = null;
+  }
+
+  @Test
+  public void fitAndTransform() {
+KMeans kmeans = new KMeans().setK(k).setSeed(1);
+KMeansModel model = kmeans.fit(dataset);
+
+Vector[] centers = model.clusterCenters();
+assertEquals(k, centers.length);
+
+DataFrame transformed = model.transform(dataset);
+assertArrayEquals(new String[]{"features", "prediction"}, 
transformed.columns());
--- End diff --

Check using ```contains``` so that you do not depend on the order.  (Also, 
we might add other output columns in the future.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-07 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/6756#discussion_r34096425
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -0,0 +1,202 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.param.{Param, Params, IntParam, DoubleParam, 
ParamMap}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasMaxIter, 
HasPredictionCol, HasSeed}
+import org.apache.spark.ml.util.{Identifiable, SchemaUtils}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans, 
KMeansModel => MLlibKMeansModel}
+import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.types.{IntegerType, StructType}
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.util.Utils
+
+
+/**
+ * Common params for KMeans and KMeansModel
+ */
+private[clustering] trait KMeansParams
+extends Params with HasMaxIter with HasFeaturesCol with HasSeed with 
HasPredictionCol {
+
+  /**
+   * Set the number of clusters to create (k). Default: 2.
+   * @group param
+   */
+  final val k = new IntParam(this, "k", "number of clusters to create", 
(x: Int) => x > 1)
+
+  /** @group getParam */
+  def getK: Int = $(k)
+
+  /**
+   * Param the number of runs of the algorithm to execute in parallel. We 
initialize the algorithm
+   * this many times with random starting conditions (configured by the 
initialization mode), then
+   * return the best clustering found over any run. Default: 1.
+   * @group param
+   */
+  final val runs = new IntParam(this, "runs",
+"number of runs of the algorithm to execute in parallel", (value: Int) 
=> value >= 1)
+
+  /** @group getParam */
+  def getRuns: Int = $(runs)
+
+  /**
+   * Param the distance threshold within which we've consider centers to 
have converged.
+   * If all centers move less than this Euclidean distance, we stop 
iterating one run.
+   * @group param
+   */
+  final val epsilon = new DoubleParam(this, "epsilon", "distance 
threshold")
+
+  /** @group getParam */
+  def getEpsilon: Double = $(epsilon)
+
+  /**
+   * Param for the initialization algorithm. This can be either "random" 
to choose random points as
+   * initial cluster centers, or "k-means||" to use a parallel variant of 
k-means++
+   * (Bahmani et al., Scalable K-Means++, VLDB 2012). Default: k-means||.
+   * @group param
+   */
+  final val initMode = new Param[String](this, "initMode", "initialization 
algorithm",
+(value: String) => MLlibKMeans.validateInitializationMode(value))
+
+  /** @group getParam */
+  def getInitializationMode: String = $(initMode)
+
+  /**
+   * Param for the number of steps for the k-means|| initialization mode. 
This is an advanced
+   * setting -- the default of 5 is almost always enough. Default: 5.
+   * @group param
+   */
+  final val initSteps = new IntParam(this, "initSteps", "number of steps 
for k-means||",
+(value: Int) => value > 0)
+
+  /** @group getParam */
+  def getInitializationSteps: Int = $(initSteps)
+
+  /**
+   * Validates and transforms the input schema.
+   * @param schema input schema
+   * @return output schema
+   */
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+SchemaUtils.checkColumnType(schema, $(featuresCol), new VectorUDT)
+SchemaUtils.appendColumn(schema, $(predictionCol), IntegerType)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Model fitted by KMeans.
+ *
+ * @param parentModel a model trained by spark.mllib.clustering.KMeans.
+ */
+@Experimental
+class KMeansModel p

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-07 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/6756#discussion_r34096419
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -0,0 +1,202 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.param.{Param, Params, IntParam, DoubleParam, 
ParamMap}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasMaxIter, 
HasPredictionCol, HasSeed}
+import org.apache.spark.ml.util.{Identifiable, SchemaUtils}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans, 
KMeansModel => MLlibKMeansModel}
+import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.types.{IntegerType, StructType}
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.util.Utils
+
+
+/**
+ * Common params for KMeans and KMeansModel
+ */
+private[clustering] trait KMeansParams
+extends Params with HasMaxIter with HasFeaturesCol with HasSeed with 
HasPredictionCol {
+
+  /**
+   * Set the number of clusters to create (k). Default: 2.
+   * @group param
+   */
+  final val k = new IntParam(this, "k", "number of clusters to create", 
(x: Int) => x > 1)
+
+  /** @group getParam */
+  def getK: Int = $(k)
+
+  /**
+   * Param the number of runs of the algorithm to execute in parallel. We 
initialize the algorithm
+   * this many times with random starting conditions (configured by the 
initialization mode), then
+   * return the best clustering found over any run. Default: 1.
+   * @group param
+   */
+  final val runs = new IntParam(this, "runs",
+"number of runs of the algorithm to execute in parallel", (value: Int) 
=> value >= 1)
+
+  /** @group getParam */
+  def getRuns: Int = $(runs)
+
+  /**
+   * Param the distance threshold within which we've consider centers to 
have converged.
+   * If all centers move less than this Euclidean distance, we stop 
iterating one run.
+   * @group param
+   */
+  final val epsilon = new DoubleParam(this, "epsilon", "distance 
threshold")
+
+  /** @group getParam */
+  def getEpsilon: Double = $(epsilon)
+
+  /**
+   * Param for the initialization algorithm. This can be either "random" 
to choose random points as
+   * initial cluster centers, or "k-means||" to use a parallel variant of 
k-means++
+   * (Bahmani et al., Scalable K-Means++, VLDB 2012). Default: k-means||.
+   * @group param
+   */
+  final val initMode = new Param[String](this, "initMode", "initialization 
algorithm",
+(value: String) => MLlibKMeans.validateInitializationMode(value))
+
+  /** @group getParam */
+  def getInitializationMode: String = $(initMode)
+
+  /**
+   * Param for the number of steps for the k-means|| initialization mode. 
This is an advanced
+   * setting -- the default of 5 is almost always enough. Default: 5.
+   * @group param
+   */
+  final val initSteps = new IntParam(this, "initSteps", "number of steps 
for k-means||",
+(value: Int) => value > 0)
+
+  /** @group getParam */
+  def getInitializationSteps: Int = $(initSteps)
+
+  /**
+   * Validates and transforms the input schema.
+   * @param schema input schema
+   * @return output schema
+   */
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+SchemaUtils.checkColumnType(schema, $(featuresCol), new VectorUDT)
+SchemaUtils.appendColumn(schema, $(predictionCol), IntegerType)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Model fitted by KMeans.
+ *
+ * @param parentModel a model trained by spark.mllib.clustering.KMeans.
+ */
+@Experimental
+class KMeansModel p

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-07 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/6756#discussion_r34096435
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -0,0 +1,202 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.param.{Param, Params, IntParam, DoubleParam, 
ParamMap}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasMaxIter, 
HasPredictionCol, HasSeed}
+import org.apache.spark.ml.util.{Identifiable, SchemaUtils}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans, 
KMeansModel => MLlibKMeansModel}
+import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.types.{IntegerType, StructType}
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.util.Utils
+
+
+/**
+ * Common params for KMeans and KMeansModel
+ */
+private[clustering] trait KMeansParams
+extends Params with HasMaxIter with HasFeaturesCol with HasSeed with 
HasPredictionCol {
+
+  /**
+   * Set the number of clusters to create (k). Default: 2.
+   * @group param
+   */
+  final val k = new IntParam(this, "k", "number of clusters to create", 
(x: Int) => x > 1)
+
+  /** @group getParam */
+  def getK: Int = $(k)
+
+  /**
+   * Param the number of runs of the algorithm to execute in parallel. We 
initialize the algorithm
+   * this many times with random starting conditions (configured by the 
initialization mode), then
+   * return the best clustering found over any run. Default: 1.
+   * @group param
+   */
+  final val runs = new IntParam(this, "runs",
+"number of runs of the algorithm to execute in parallel", (value: Int) 
=> value >= 1)
+
+  /** @group getParam */
+  def getRuns: Int = $(runs)
+
+  /**
+   * Param the distance threshold within which we've consider centers to 
have converged.
+   * If all centers move less than this Euclidean distance, we stop 
iterating one run.
+   * @group param
+   */
+  final val epsilon = new DoubleParam(this, "epsilon", "distance 
threshold")
+
+  /** @group getParam */
+  def getEpsilon: Double = $(epsilon)
+
+  /**
+   * Param for the initialization algorithm. This can be either "random" 
to choose random points as
+   * initial cluster centers, or "k-means||" to use a parallel variant of 
k-means++
+   * (Bahmani et al., Scalable K-Means++, VLDB 2012). Default: k-means||.
+   * @group param
+   */
+  final val initMode = new Param[String](this, "initMode", "initialization 
algorithm",
+(value: String) => MLlibKMeans.validateInitializationMode(value))
+
+  /** @group getParam */
+  def getInitializationMode: String = $(initMode)
+
+  /**
+   * Param for the number of steps for the k-means|| initialization mode. 
This is an advanced
+   * setting -- the default of 5 is almost always enough. Default: 5.
+   * @group param
+   */
+  final val initSteps = new IntParam(this, "initSteps", "number of steps 
for k-means||",
+(value: Int) => value > 0)
+
+  /** @group getParam */
+  def getInitializationSteps: Int = $(initSteps)
+
+  /**
+   * Validates and transforms the input schema.
+   * @param schema input schema
+   * @return output schema
+   */
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+SchemaUtils.checkColumnType(schema, $(featuresCol), new VectorUDT)
+SchemaUtils.appendColumn(schema, $(predictionCol), IntegerType)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Model fitted by KMeans.
+ *
+ * @param parentModel a model trained by spark.mllib.clustering.KMeans.
+ */
+@Experimental
+class KMeansModel p

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-07 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/6756#discussion_r34096438
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala ---
@@ -521,6 +519,13 @@ object KMeans {
   v2: VectorWithNorm): Double = {
 MLUtils.fastSquaredDistance(v1.vector, v1.norm, v2.vector, v2.norm)
   }
+
+  private[spark] def validateInitializationMode(initializationMode: 
String): Boolean = {
--- End diff --

This should return true or false, not throw an exception.  The caller can 
throw an exception as needed.  (spark.ml Params handle throwing the exception 
if isValid fails.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-07 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/6756#discussion_r34096414
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -0,0 +1,202 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.param.{Param, Params, IntParam, DoubleParam, 
ParamMap}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasMaxIter, 
HasPredictionCol, HasSeed}
+import org.apache.spark.ml.util.{Identifiable, SchemaUtils}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans, 
KMeansModel => MLlibKMeansModel}
+import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.types.{IntegerType, StructType}
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.util.Utils
+
+
+/**
+ * Common params for KMeans and KMeansModel
+ */
+private[clustering] trait KMeansParams
+extends Params with HasMaxIter with HasFeaturesCol with HasSeed with 
HasPredictionCol {
+
+  /**
+   * Set the number of clusters to create (k). Default: 2.
+   * @group param
+   */
+  final val k = new IntParam(this, "k", "number of clusters to create", 
(x: Int) => x > 1)
+
+  /** @group getParam */
+  def getK: Int = $(k)
+
+  /**
+   * Param the number of runs of the algorithm to execute in parallel. We 
initialize the algorithm
+   * this many times with random starting conditions (configured by the 
initialization mode), then
+   * return the best clustering found over any run. Default: 1.
+   * @group param
+   */
+  final val runs = new IntParam(this, "runs",
+"number of runs of the algorithm to execute in parallel", (value: Int) 
=> value >= 1)
+
+  /** @group getParam */
+  def getRuns: Int = $(runs)
+
+  /**
+   * Param the distance threshold within which we've consider centers to 
have converged.
+   * If all centers move less than this Euclidean distance, we stop 
iterating one run.
+   * @group param
+   */
+  final val epsilon = new DoubleParam(this, "epsilon", "distance 
threshold")
+
+  /** @group getParam */
+  def getEpsilon: Double = $(epsilon)
+
+  /**
+   * Param for the initialization algorithm. This can be either "random" 
to choose random points as
+   * initial cluster centers, or "k-means||" to use a parallel variant of 
k-means++
+   * (Bahmani et al., Scalable K-Means++, VLDB 2012). Default: k-means||.
+   * @group param
+   */
+  final val initMode = new Param[String](this, "initMode", "initialization 
algorithm",
+(value: String) => MLlibKMeans.validateInitializationMode(value))
+
+  /** @group getParam */
+  def getInitializationMode: String = $(initMode)
+
+  /**
+   * Param for the number of steps for the k-means|| initialization mode. 
This is an advanced
+   * setting -- the default of 5 is almost always enough. Default: 5.
+   * @group param
+   */
+  final val initSteps = new IntParam(this, "initSteps", "number of steps 
for k-means||",
+(value: Int) => value > 0)
+
+  /** @group getParam */
+  def getInitializationSteps: Int = $(initSteps)
+
+  /**
+   * Validates and transforms the input schema.
+   * @param schema input schema
+   * @return output schema
+   */
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+SchemaUtils.checkColumnType(schema, $(featuresCol), new VectorUDT)
+SchemaUtils.appendColumn(schema, $(predictionCol), IntegerType)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Model fitted by KMeans.
+ *
+ * @param parentModel a model trained by spark.mllib.clustering.KMeans.
+ */
+@Experimental
+class KMeansModel p

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-07 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/6756#discussion_r34096444
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
@@ -0,0 +1,112 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans}
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.sql.{DataFrame, SQLContext}
+
+private[clustering] case class TestRow(features: Vector)
+
+object KMeansSuite {
+  def generateKMeansData(sql: SQLContext, rows: Int, dim: Int, k: Int): 
DataFrame = {
+val sc = sql.sparkContext
+val rdd = sc.parallelize(1 to rows).map(i => 
Vectors.dense(Array.fill(dim)((i % k).toDouble)))
+  .map(v => new TestRow(v))
+sql.createDataFrame(rdd)
+  }
+}
+
+class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext {
+
+  final val k = 5
+  @transient var dataset: DataFrame = _
+
+  override def beforeAll(): Unit = {
+super.beforeAll()
+
+dataset = KMeansSuite.generateKMeansData(sqlContext, 50, 3, k)
+  }
+
+  test("default parameters") {
+val kmeans = new KMeans()
+
+assert(kmeans.getK === 2)
+assert(kmeans.getFeaturesCol === "features")
+assert(kmeans.getPredictionCol === "prediction")
+assert(kmeans.getMaxIter === 20)
+assert(kmeans.getRuns === 1)
+assert(kmeans.getInitializationMode === MLlibKMeans.K_MEANS_PARALLEL)
+assert(kmeans.getInitializationSteps === 5)
+assert(kmeans.getEpsilon === 1e-4)
+  }
+
+  test("set parameters") {
+val kmeans = new KMeans()
+  .setK(9)
+  .setFeaturesCol("test_feature")
+  .setPredictionCol("test_prediction")
+  .setMaxIter(33)
+  .setRuns(7)
+  .setInitializationMode(MLlibKMeans.RANDOM)
+  .setInitializationSteps(3)
+  .setSeed(123)
+  .setEpsilon(1e-3)
+
+assert(kmeans.getK === 9)
+assert(kmeans.getFeaturesCol === "test_feature")
+assert(kmeans.getPredictionCol === "test_prediction")
+assert(kmeans.getMaxIter === 33)
+assert(kmeans.getRuns === 7)
+assert(kmeans.getInitializationMode === MLlibKMeans.RANDOM)
+assert(kmeans.getInitializationSteps === 3)
+assert(kmeans.getSeed === 123)
+assert(kmeans.getEpsilon === 1e-3)
+  }
+
+  test("parameters validation") {
+intercept[IllegalArgumentException] {
+  new KMeans().setK(1)
+}
+intercept[IllegalArgumentException] {
+  new KMeans().setInitializationMode("no_such_a_mode")
+}
+intercept[IllegalArgumentException] {
+  new KMeans().setInitializationSteps(0)
+}
+intercept[IllegalArgumentException] {
+  new KMeans().setRuns(0)
+}
+  }
+
+  test("fit & transform") {
+val predictionColName = "kmeans_prediction"
+val kmeans = new 
KMeans().setK(k).setPredictionCol(predictionColName).setSeed(1)
+val model = kmeans.fit(dataset)
+assert(model.clusterCenters.length === k)
+
+val transformed = model.transform(dataset)
+assert(transformed.columns === Array("features", predictionColName))
--- End diff --

ditto: Test using ```contains```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-07 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/6756#discussion_r34096423
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -0,0 +1,202 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.param.{Param, Params, IntParam, DoubleParam, 
ParamMap}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasMaxIter, 
HasPredictionCol, HasSeed}
+import org.apache.spark.ml.util.{Identifiable, SchemaUtils}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans, 
KMeansModel => MLlibKMeansModel}
+import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.types.{IntegerType, StructType}
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.util.Utils
+
+
+/**
+ * Common params for KMeans and KMeansModel
+ */
+private[clustering] trait KMeansParams
+extends Params with HasMaxIter with HasFeaturesCol with HasSeed with 
HasPredictionCol {
+
+  /**
+   * Set the number of clusters to create (k). Default: 2.
+   * @group param
+   */
+  final val k = new IntParam(this, "k", "number of clusters to create", 
(x: Int) => x > 1)
+
+  /** @group getParam */
+  def getK: Int = $(k)
+
+  /**
+   * Param the number of runs of the algorithm to execute in parallel. We 
initialize the algorithm
+   * this many times with random starting conditions (configured by the 
initialization mode), then
+   * return the best clustering found over any run. Default: 1.
+   * @group param
+   */
+  final val runs = new IntParam(this, "runs",
+"number of runs of the algorithm to execute in parallel", (value: Int) 
=> value >= 1)
+
+  /** @group getParam */
+  def getRuns: Int = $(runs)
+
+  /**
+   * Param the distance threshold within which we've consider centers to 
have converged.
+   * If all centers move less than this Euclidean distance, we stop 
iterating one run.
+   * @group param
+   */
+  final val epsilon = new DoubleParam(this, "epsilon", "distance 
threshold")
+
+  /** @group getParam */
+  def getEpsilon: Double = $(epsilon)
+
+  /**
+   * Param for the initialization algorithm. This can be either "random" 
to choose random points as
+   * initial cluster centers, or "k-means||" to use a parallel variant of 
k-means++
+   * (Bahmani et al., Scalable K-Means++, VLDB 2012). Default: k-means||.
+   * @group param
+   */
+  final val initMode = new Param[String](this, "initMode", "initialization 
algorithm",
+(value: String) => MLlibKMeans.validateInitializationMode(value))
+
+  /** @group getParam */
+  def getInitializationMode: String = $(initMode)
+
+  /**
+   * Param for the number of steps for the k-means|| initialization mode. 
This is an advanced
+   * setting -- the default of 5 is almost always enough. Default: 5.
+   * @group param
+   */
+  final val initSteps = new IntParam(this, "initSteps", "number of steps 
for k-means||",
+(value: Int) => value > 0)
+
+  /** @group getParam */
+  def getInitializationSteps: Int = $(initSteps)
+
+  /**
+   * Validates and transforms the input schema.
+   * @param schema input schema
+   * @return output schema
+   */
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+SchemaUtils.checkColumnType(schema, $(featuresCol), new VectorUDT)
+SchemaUtils.appendColumn(schema, $(predictionCol), IntegerType)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Model fitted by KMeans.
+ *
+ * @param parentModel a model trained by spark.mllib.clustering.KMeans.
+ */
+@Experimental
+class KMeansModel p

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-07 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/6756#discussion_r34096433
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -0,0 +1,202 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.param.{Param, Params, IntParam, DoubleParam, 
ParamMap}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasMaxIter, 
HasPredictionCol, HasSeed}
+import org.apache.spark.ml.util.{Identifiable, SchemaUtils}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans, 
KMeansModel => MLlibKMeansModel}
+import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.types.{IntegerType, StructType}
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.util.Utils
+
+
+/**
+ * Common params for KMeans and KMeansModel
+ */
+private[clustering] trait KMeansParams
+extends Params with HasMaxIter with HasFeaturesCol with HasSeed with 
HasPredictionCol {
+
+  /**
+   * Set the number of clusters to create (k). Default: 2.
+   * @group param
+   */
+  final val k = new IntParam(this, "k", "number of clusters to create", 
(x: Int) => x > 1)
+
+  /** @group getParam */
+  def getK: Int = $(k)
+
+  /**
+   * Param the number of runs of the algorithm to execute in parallel. We 
initialize the algorithm
+   * this many times with random starting conditions (configured by the 
initialization mode), then
+   * return the best clustering found over any run. Default: 1.
+   * @group param
+   */
+  final val runs = new IntParam(this, "runs",
+"number of runs of the algorithm to execute in parallel", (value: Int) 
=> value >= 1)
+
+  /** @group getParam */
+  def getRuns: Int = $(runs)
+
+  /**
+   * Param the distance threshold within which we've consider centers to 
have converged.
+   * If all centers move less than this Euclidean distance, we stop 
iterating one run.
+   * @group param
+   */
+  final val epsilon = new DoubleParam(this, "epsilon", "distance 
threshold")
+
+  /** @group getParam */
+  def getEpsilon: Double = $(epsilon)
+
+  /**
+   * Param for the initialization algorithm. This can be either "random" 
to choose random points as
+   * initial cluster centers, or "k-means||" to use a parallel variant of 
k-means++
+   * (Bahmani et al., Scalable K-Means++, VLDB 2012). Default: k-means||.
+   * @group param
+   */
+  final val initMode = new Param[String](this, "initMode", "initialization 
algorithm",
+(value: String) => MLlibKMeans.validateInitializationMode(value))
+
+  /** @group getParam */
+  def getInitializationMode: String = $(initMode)
+
+  /**
+   * Param for the number of steps for the k-means|| initialization mode. 
This is an advanced
+   * setting -- the default of 5 is almost always enough. Default: 5.
+   * @group param
+   */
+  final val initSteps = new IntParam(this, "initSteps", "number of steps 
for k-means||",
+(value: Int) => value > 0)
+
+  /** @group getParam */
+  def getInitializationSteps: Int = $(initSteps)
+
+  /**
+   * Validates and transforms the input schema.
+   * @param schema input schema
+   * @return output schema
+   */
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+SchemaUtils.checkColumnType(schema, $(featuresCol), new VectorUDT)
+SchemaUtils.appendColumn(schema, $(predictionCol), IntegerType)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Model fitted by KMeans.
+ *
+ * @param parentModel a model trained by spark.mllib.clustering.KMeans.
+ */
+@Experimental
+class KMeansModel p

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-07 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/6756#discussion_r34096403
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -0,0 +1,202 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.param.{Param, Params, IntParam, DoubleParam, 
ParamMap}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasMaxIter, 
HasPredictionCol, HasSeed}
+import org.apache.spark.ml.util.{Identifiable, SchemaUtils}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans, 
KMeansModel => MLlibKMeansModel}
+import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.types.{IntegerType, StructType}
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.util.Utils
+
+
+/**
+ * Common params for KMeans and KMeansModel
+ */
+private[clustering] trait KMeansParams
+extends Params with HasMaxIter with HasFeaturesCol with HasSeed with 
HasPredictionCol {
+
+  /**
+   * Set the number of clusters to create (k). Default: 2.
+   * @group param
+   */
+  final val k = new IntParam(this, "k", "number of clusters to create", 
(x: Int) => x > 1)
+
+  /** @group getParam */
+  def getK: Int = $(k)
+
+  /**
+   * Param the number of runs of the algorithm to execute in parallel. We 
initialize the algorithm
+   * this many times with random starting conditions (configured by the 
initialization mode), then
+   * return the best clustering found over any run. Default: 1.
+   * @group param
+   */
+  final val runs = new IntParam(this, "runs",
+"number of runs of the algorithm to execute in parallel", (value: Int) 
=> value >= 1)
+
+  /** @group getParam */
+  def getRuns: Int = $(runs)
+
+  /**
+   * Param the distance threshold within which we've consider centers to 
have converged.
+   * If all centers move less than this Euclidean distance, we stop 
iterating one run.
+   * @group param
+   */
+  final val epsilon = new DoubleParam(this, "epsilon", "distance 
threshold")
--- End diff --

Please fill out the built-in doc.  "distance threshold" should be replaced 
with a full explanation.  Feel free to just copy the Scala doc.  Same for other 
parameters.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-07 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/6756#discussion_r34096400
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -0,0 +1,202 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.param.{Param, Params, IntParam, DoubleParam, 
ParamMap}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasMaxIter, 
HasPredictionCol, HasSeed}
+import org.apache.spark.ml.util.{Identifiable, SchemaUtils}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans, 
KMeansModel => MLlibKMeansModel}
+import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.types.{IntegerType, StructType}
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.util.Utils
+
+
+/**
+ * Common params for KMeans and KMeansModel
+ */
+private[clustering] trait KMeansParams
+extends Params with HasMaxIter with HasFeaturesCol with HasSeed with 
HasPredictionCol {
+
+  /**
+   * Set the number of clusters to create (k). Default: 2.
+   * @group param
+   */
+  final val k = new IntParam(this, "k", "number of clusters to create", 
(x: Int) => x > 1)
+
+  /** @group getParam */
+  def getK: Int = $(k)
+
+  /**
+   * Param the number of runs of the algorithm to execute in parallel. We 
initialize the algorithm
+   * this many times with random starting conditions (configured by the 
initialization mode), then
+   * return the best clustering found over any run. Default: 1.
+   * @group param
+   */
+  final val runs = new IntParam(this, "runs",
+"number of runs of the algorithm to execute in parallel", (value: Int) 
=> value >= 1)
+
+  /** @group getParam */
+  def getRuns: Int = $(runs)
+
+  /**
+   * Param the distance threshold within which we've consider centers to 
have converged.
+   * If all centers move less than this Euclidean distance, we stop 
iterating one run.
--- End diff --

state default value


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-07 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/6756#discussion_r34096408
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -0,0 +1,202 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.param.{Param, Params, IntParam, DoubleParam, 
ParamMap}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasMaxIter, 
HasPredictionCol, HasSeed}
+import org.apache.spark.ml.util.{Identifiable, SchemaUtils}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans, 
KMeansModel => MLlibKMeansModel}
+import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.types.{IntegerType, StructType}
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.util.Utils
+
+
+/**
+ * Common params for KMeans and KMeansModel
+ */
+private[clustering] trait KMeansParams
+extends Params with HasMaxIter with HasFeaturesCol with HasSeed with 
HasPredictionCol {
+
+  /**
+   * Set the number of clusters to create (k). Default: 2.
+   * @group param
+   */
+  final val k = new IntParam(this, "k", "number of clusters to create", 
(x: Int) => x > 1)
+
+  /** @group getParam */
+  def getK: Int = $(k)
+
+  /**
+   * Param the number of runs of the algorithm to execute in parallel. We 
initialize the algorithm
+   * this many times with random starting conditions (configured by the 
initialization mode), then
+   * return the best clustering found over any run. Default: 1.
+   * @group param
+   */
+  final val runs = new IntParam(this, "runs",
+"number of runs of the algorithm to execute in parallel", (value: Int) 
=> value >= 1)
+
+  /** @group getParam */
+  def getRuns: Int = $(runs)
+
+  /**
+   * Param the distance threshold within which we've consider centers to 
have converged.
+   * If all centers move less than this Euclidean distance, we stop 
iterating one run.
+   * @group param
+   */
+  final val epsilon = new DoubleParam(this, "epsilon", "distance 
threshold")
+
+  /** @group getParam */
+  def getEpsilon: Double = $(epsilon)
+
+  /**
+   * Param for the initialization algorithm. This can be either "random" 
to choose random points as
+   * initial cluster centers, or "k-means||" to use a parallel variant of 
k-means++
+   * (Bahmani et al., Scalable K-Means++, VLDB 2012). Default: k-means||.
+   * @group param
+   */
+  final val initMode = new Param[String](this, "initMode", "initialization 
algorithm",
+(value: String) => MLlibKMeans.validateInitializationMode(value))
+
+  /** @group getParam */
+  def getInitializationMode: String = $(initMode)
+
+  /**
+   * Param for the number of steps for the k-means|| initialization mode. 
This is an advanced
--- End diff --

To mark something as an expert-only Param, use tags:
```
@group expertParam
@group expertSetParam
@group expertGetParam
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-07 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/6756#discussion_r34096416
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -0,0 +1,202 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.param.{Param, Params, IntParam, DoubleParam, 
ParamMap}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasMaxIter, 
HasPredictionCol, HasSeed}
+import org.apache.spark.ml.util.{Identifiable, SchemaUtils}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans, 
KMeansModel => MLlibKMeansModel}
+import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.types.{IntegerType, StructType}
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.util.Utils
+
+
+/**
+ * Common params for KMeans and KMeansModel
+ */
+private[clustering] trait KMeansParams
+extends Params with HasMaxIter with HasFeaturesCol with HasSeed with 
HasPredictionCol {
+
+  /**
+   * Set the number of clusters to create (k). Default: 2.
+   * @group param
+   */
+  final val k = new IntParam(this, "k", "number of clusters to create", 
(x: Int) => x > 1)
+
+  /** @group getParam */
+  def getK: Int = $(k)
+
+  /**
+   * Param the number of runs of the algorithm to execute in parallel. We 
initialize the algorithm
+   * this many times with random starting conditions (configured by the 
initialization mode), then
+   * return the best clustering found over any run. Default: 1.
+   * @group param
+   */
+  final val runs = new IntParam(this, "runs",
+"number of runs of the algorithm to execute in parallel", (value: Int) 
=> value >= 1)
+
+  /** @group getParam */
+  def getRuns: Int = $(runs)
+
+  /**
+   * Param the distance threshold within which we've consider centers to 
have converged.
+   * If all centers move less than this Euclidean distance, we stop 
iterating one run.
+   * @group param
+   */
+  final val epsilon = new DoubleParam(this, "epsilon", "distance 
threshold")
+
+  /** @group getParam */
+  def getEpsilon: Double = $(epsilon)
+
+  /**
+   * Param for the initialization algorithm. This can be either "random" 
to choose random points as
+   * initial cluster centers, or "k-means||" to use a parallel variant of 
k-means++
+   * (Bahmani et al., Scalable K-Means++, VLDB 2012). Default: k-means||.
+   * @group param
+   */
+  final val initMode = new Param[String](this, "initMode", "initialization 
algorithm",
+(value: String) => MLlibKMeans.validateInitializationMode(value))
+
+  /** @group getParam */
+  def getInitializationMode: String = $(initMode)
+
+  /**
+   * Param for the number of steps for the k-means|| initialization mode. 
This is an advanced
+   * setting -- the default of 5 is almost always enough. Default: 5.
+   * @group param
+   */
+  final val initSteps = new IntParam(this, "initSteps", "number of steps 
for k-means||",
+(value: Int) => value > 0)
+
+  /** @group getParam */
+  def getInitializationSteps: Int = $(initSteps)
+
+  /**
+   * Validates and transforms the input schema.
+   * @param schema input schema
+   * @return output schema
+   */
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+SchemaUtils.checkColumnType(schema, $(featuresCol), new VectorUDT)
+SchemaUtils.appendColumn(schema, $(predictionCol), IntegerType)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Model fitted by KMeans.
+ *
+ * @param parentModel a model trained by spark.mllib.clustering.KMeans.
+ */
+@Experimental
+class KMeansModel p

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-07 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/6756#discussion_r34096411
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -0,0 +1,202 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.param.{Param, Params, IntParam, DoubleParam, 
ParamMap}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasMaxIter, 
HasPredictionCol, HasSeed}
+import org.apache.spark.ml.util.{Identifiable, SchemaUtils}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans, 
KMeansModel => MLlibKMeansModel}
+import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.types.{IntegerType, StructType}
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.util.Utils
+
+
+/**
+ * Common params for KMeans and KMeansModel
+ */
+private[clustering] trait KMeansParams
+extends Params with HasMaxIter with HasFeaturesCol with HasSeed with 
HasPredictionCol {
+
+  /**
+   * Set the number of clusters to create (k). Default: 2.
+   * @group param
+   */
+  final val k = new IntParam(this, "k", "number of clusters to create", 
(x: Int) => x > 1)
+
+  /** @group getParam */
+  def getK: Int = $(k)
+
+  /**
+   * Param the number of runs of the algorithm to execute in parallel. We 
initialize the algorithm
+   * this many times with random starting conditions (configured by the 
initialization mode), then
+   * return the best clustering found over any run. Default: 1.
+   * @group param
+   */
+  final val runs = new IntParam(this, "runs",
+"number of runs of the algorithm to execute in parallel", (value: Int) 
=> value >= 1)
+
+  /** @group getParam */
+  def getRuns: Int = $(runs)
+
+  /**
+   * Param the distance threshold within which we've consider centers to 
have converged.
+   * If all centers move less than this Euclidean distance, we stop 
iterating one run.
+   * @group param
+   */
+  final val epsilon = new DoubleParam(this, "epsilon", "distance 
threshold")
+
+  /** @group getParam */
+  def getEpsilon: Double = $(epsilon)
+
+  /**
+   * Param for the initialization algorithm. This can be either "random" 
to choose random points as
+   * initial cluster centers, or "k-means||" to use a parallel variant of 
k-means++
+   * (Bahmani et al., Scalable K-Means++, VLDB 2012). Default: k-means||.
+   * @group param
+   */
+  final val initMode = new Param[String](this, "initMode", "initialization 
algorithm",
+(value: String) => MLlibKMeans.validateInitializationMode(value))
+
+  /** @group getParam */
+  def getInitializationMode: String = $(initMode)
+
+  /**
+   * Param for the number of steps for the k-means|| initialization mode. 
This is an advanced
+   * setting -- the default of 5 is almost always enough. Default: 5.
+   * @group param
+   */
+  final val initSteps = new IntParam(this, "initSteps", "number of steps 
for k-means||",
+(value: Int) => value > 0)
+
+  /** @group getParam */
+  def getInitializationSteps: Int = $(initSteps)
--- End diff --

getInitSteps


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-07 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/6756#discussion_r34096404
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -0,0 +1,202 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.param.{Param, Params, IntParam, DoubleParam, 
ParamMap}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasMaxIter, 
HasPredictionCol, HasSeed}
+import org.apache.spark.ml.util.{Identifiable, SchemaUtils}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans, 
KMeansModel => MLlibKMeansModel}
+import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.types.{IntegerType, StructType}
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.util.Utils
+
+
+/**
+ * Common params for KMeans and KMeansModel
+ */
+private[clustering] trait KMeansParams
+extends Params with HasMaxIter with HasFeaturesCol with HasSeed with 
HasPredictionCol {
+
+  /**
+   * Set the number of clusters to create (k). Default: 2.
+   * @group param
+   */
+  final val k = new IntParam(this, "k", "number of clusters to create", 
(x: Int) => x > 1)
+
+  /** @group getParam */
+  def getK: Int = $(k)
+
+  /**
+   * Param the number of runs of the algorithm to execute in parallel. We 
initialize the algorithm
+   * this many times with random starting conditions (configured by the 
initialization mode), then
+   * return the best clustering found over any run. Default: 1.
+   * @group param
+   */
+  final val runs = new IntParam(this, "runs",
+"number of runs of the algorithm to execute in parallel", (value: Int) 
=> value >= 1)
+
+  /** @group getParam */
+  def getRuns: Int = $(runs)
+
+  /**
+   * Param the distance threshold within which we've consider centers to 
have converged.
+   * If all centers move less than this Euclidean distance, we stop 
iterating one run.
+   * @group param
+   */
+  final val epsilon = new DoubleParam(this, "epsilon", "distance 
threshold")
+
+  /** @group getParam */
+  def getEpsilon: Double = $(epsilon)
+
+  /**
+   * Param for the initialization algorithm. This can be either "random" 
to choose random points as
+   * initial cluster centers, or "k-means||" to use a parallel variant of 
k-means++
+   * (Bahmani et al., Scalable K-Means++, VLDB 2012). Default: k-means||.
+   * @group param
+   */
+  final val initMode = new Param[String](this, "initMode", "initialization 
algorithm",
+(value: String) => MLlibKMeans.validateInitializationMode(value))
+
+  /** @group getParam */
+  def getInitializationMode: String = $(initMode)
--- End diff --

getInitMode


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-07 Thread jkbradley

Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/6756#issuecomment-119344394
  
Reviewing now


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-06 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/6756#discussion_r34007996
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -0,0 +1,201 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasMaxIter, 
HasPredictionCol, HasSeed}
+import org.apache.spark.ml.param.{Param, ParamMap, Params}
+import org.apache.spark.ml.util.{Identifiable, SchemaUtils}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans, 
KMeansModel => MLlibKMeansModel}
+import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.types.{IntegerType, StructType}
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.util.Utils
+
+
+/**
+ * Common params for KMeans and KMeansModel
+ */
+private[clustering] trait KMeansParams
+extends Params with HasMaxIter with HasFeaturesCol with HasSeed with 
HasPredictionCol {
+
+  /**
+   * Set the number of clusters to create (k). Default: 2.
+   * @group param
+   */
+  val k = new Param[Int](this, "k", "number of clusters to create", (x: 
Int) => x > 1)
--- End diff --

We need specialized param types since Java doesn't handle Scala's boxed 
types very nicely.  I forget the exact bug it caused...but @mengxr might recall.

We do not need StringParam since Scala's String is actually 
java.lang.String, not a boxed type.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-05 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/6756#discussion_r33900137
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -0,0 +1,201 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasMaxIter, 
HasPredictionCol, HasSeed}
+import org.apache.spark.ml.param.{Param, ParamMap, Params}
+import org.apache.spark.ml.util.{Identifiable, SchemaUtils}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans, 
KMeansModel => MLlibKMeansModel}
+import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.types.{IntegerType, StructType}
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.util.Utils
+
+
+/**
+ * Common params for KMeans and KMeansModel
+ */
+private[clustering] trait KMeansParams
+extends Params with HasMaxIter with HasFeaturesCol with HasSeed with 
HasPredictionCol {
+
+  /**
+   * Set the number of clusters to create (k). Default: 2.
+   * @group param
+   */
+  val k = new Param[Int](this, "k", "number of clusters to create", (x: 
Int) => x > 1)
+
+  /** @group getParam */
+  def getK: Int = $(k)
+
+  /**
+   * Param the number of runs of the algorithm to execute in parallel. We 
initialize the algorithm
+   * this many times with random starting conditions (configured by the 
initialization mode), then
+   * return the best clustering found over any run. Default: 1.
+   * @group param
+   */
+  val runs = new Param[Int](this, "runs", "number of runs of the algorithm 
to execute in parallel",
+(value: Int) => value >= 1)
+
+  /** @group getParam */
+  def getRuns: Int = $(runs)
+
+  /**
+   * Param the distance threshold within which we've consider centers to 
have converged.
+   * If all centers move less than this Euclidean distance, we stop 
iterating one run.
+   * @group param
+   */
+  val epsilon = new Param[Double](this, "epsilon", "distance threshold")
+
+  /** @group getParam */
+  def getEpsilon: Double = $(epsilon)
+
+  /**
+   * Param for the initialization algorithm. This can be either "random" 
to choose random points as
+   * initial cluster centers, or "k-means||" to use a parallel variant of 
k-means++
+   * (Bahmani et al., Scalable K-Means++, VLDB 2012). Default: k-means||.
+   * @group param
+   */
+  val initMode = new Param[String](this, "initMode", "initialization 
algorithm",
+(value: String) => MLlibKMeans.validateInitializationMode(value))
+
+  /** @group getParam */
+  def getInitializationMode: String = $(initMode)
+
+  /**
+   * Param for the number of steps for the k-means|| initialization mode. 
This is an advanced
+   * setting -- the default of 5 is almost always enough. Default: 5.
+   * @group param
+   */
+  val initSteps = new Param[Int](this, "initSteps", "number of steps for 
k-means||",
+(value: Int) => value > 0)
+
+  /** @group getParam */
+  def getInitializationSteps: Int = $(initSteps)
+
+  /**
+   * Validates and transforms the input schema.
+   * @param schema input schema
+   * @return output schema
+   */
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+SchemaUtils.checkColumnType(schema, $(featuresCol), new VectorUDT)
+SchemaUtils.appendColumn(schema, $(predictionCol), IntegerType)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Model fitted by KMeans.
+ *
+ * @param parentModel a model trained by spark.mllib.clustering.KMeans.
+ */
+@Experimental
+class KMeansModel private[ml] (
+override val uid: String,

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-02 Thread feynmanliang

Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/6756#discussion_r33839844
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -0,0 +1,201 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasMaxIter, 
HasPredictionCol, HasSeed}
+import org.apache.spark.ml.param.{Param, ParamMap, Params}
+import org.apache.spark.ml.util.{Identifiable, SchemaUtils}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans, 
KMeansModel => MLlibKMeansModel}
+import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.types.{IntegerType, StructType}
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.util.Utils
+
+
+/**
+ * Common params for KMeans and KMeansModel
+ */
+private[clustering] trait KMeansParams
+extends Params with HasMaxIter with HasFeaturesCol with HasSeed with 
HasPredictionCol {
+
+  /**
+   * Set the number of clusters to create (k). Default: 2.
+   * @group param
+   */
+  val k = new Param[Int](this, "k", "number of clusters to create", (x: 
Int) => x > 1)
--- End diff --

@yu-iskw I'm actually not totally clear why we need `IntParam` for Java 
compatibility; I suspect it's got something to do with `Param[T]` and type 
erasure on the JVM. Do you know why?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-02 Thread feynmanliang

Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/6756#discussion_r33839758
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -0,0 +1,201 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasMaxIter, 
HasPredictionCol, HasSeed}
+import org.apache.spark.ml.param.{Param, ParamMap, Params}
+import org.apache.spark.ml.util.{Identifiable, SchemaUtils}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans, 
KMeansModel => MLlibKMeansModel}
+import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.types.{IntegerType, StructType}
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.util.Utils
+
+
+/**
+ * Common params for KMeans and KMeansModel
+ */
+private[clustering] trait KMeansParams
+extends Params with HasMaxIter with HasFeaturesCol with HasSeed with 
HasPredictionCol {
+
+  /**
+   * Set the number of clusters to create (k). Default: 2.
+   * @group param
+   */
+  val k = new Param[Int](this, "k", "number of clusters to create", (x: 
Int) => x > 1)
+
+  /** @group getParam */
+  def getK: Int = $(k)
+
+  /**
+   * Param the number of runs of the algorithm to execute in parallel. We 
initialize the algorithm
+   * this many times with random starting conditions (configured by the 
initialization mode), then
+   * return the best clustering found over any run. Default: 1.
+   * @group param
+   */
+  val runs = new Param[Int](this, "runs", "number of runs of the algorithm 
to execute in parallel",
+(value: Int) => value >= 1)
+
+  /** @group getParam */
+  def getRuns: Int = $(runs)
+
+  /**
+   * Param the distance threshold within which we've consider centers to 
have converged.
+   * If all centers move less than this Euclidean distance, we stop 
iterating one run.
+   * @group param
+   */
+  val epsilon = new Param[Double](this, "epsilon", "distance threshold")
+
+  /** @group getParam */
+  def getEpsilon: Double = $(epsilon)
+
+  /**
+   * Param for the initialization algorithm. This can be either "random" 
to choose random points as
+   * initial cluster centers, or "k-means||" to use a parallel variant of 
k-means++
+   * (Bahmani et al., Scalable K-Means++, VLDB 2012). Default: k-means||.
+   * @group param
+   */
+  val initMode = new Param[String](this, "initMode", "initialization 
algorithm",
+(value: String) => MLlibKMeans.validateInitializationMode(value))
+
+  /** @group getParam */
+  def getInitializationMode: String = $(initMode)
+
+  /**
+   * Param for the number of steps for the k-means|| initialization mode. 
This is an advanced
+   * setting -- the default of 5 is almost always enough. Default: 5.
+   * @group param
+   */
+  val initSteps = new Param[Int](this, "initSteps", "number of steps for 
k-means||",
+(value: Int) => value > 0)
+
+  /** @group getParam */
+  def getInitializationSteps: Int = $(initSteps)
+
+  /**
+   * Validates and transforms the input schema.
+   * @param schema input schema
+   * @return output schema
+   */
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+SchemaUtils.checkColumnType(schema, $(featuresCol), new VectorUDT)
+SchemaUtils.appendColumn(schema, $(predictionCol), IntegerType)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Model fitted by KMeans.
+ *
+ * @param parentModel a model trained by spark.mllib.clustering.KMeans.
+ */
+@Experimental
+class KMeansModel private[ml] (
+override val uid: Stri

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-02 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6756#issuecomment-118215081
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-02 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/6756#issuecomment-118215045
  
  [Test build #36450 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36450/console)
 for   PR 6756 at commit 
[`2f392e1`](https://github.com/apache/spark/commit/2f392e116f69b6f2dfa8ff372c0288ea1f5b52a6).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class KMeans(override val uid: String) extends Estimator[KMeansModel] 
with KMeansParams `
  * `class KMeansModel(JavaModel):`
  * `class KMeans(JavaEstimator, HasFeaturesCol, HasMaxIter, HasSeed):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-02 Thread yu-iskw

Github user yu-iskw commented on a diff in the pull request:

https://github.com/apache/spark/pull/6756#discussion_r33834966
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -0,0 +1,201 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasMaxIter, 
HasPredictionCol, HasSeed}
+import org.apache.spark.ml.param.{Param, ParamMap, Params}
+import org.apache.spark.ml.util.{Identifiable, SchemaUtils}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans, 
KMeansModel => MLlibKMeansModel}
+import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.types.{IntegerType, StructType}
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.util.Utils
+
+
+/**
+ * Common params for KMeans and KMeansModel
+ */
+private[clustering] trait KMeansParams
+extends Params with HasMaxIter with HasFeaturesCol with HasSeed with 
HasPredictionCol {
+
+  /**
+   * Set the number of clusters to create (k). Default: 2.
+   * @group param
+   */
+  val k = new Param[Int](this, "k", "number of clusters to create", (x: 
Int) => x > 1)
--- End diff --

@feynmanliang I have almost understood the reason why we need such a 
primitive param class. From the developer's point view, it is a little strange 
that `StringParam` is not supported. Of course we can treat the scala's String 
class from Java. However, I think using both of `Param[String]` and the 
primitive param classes such as `IntParam` is a little bit strange.  What do 
you think about that we should support `StringParam` class or not?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-02 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/6756#issuecomment-118202981
  
  [Test build #36450 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36450/consoleFull)
 for   PR 6756 at commit 
[`2f392e1`](https://github.com/apache/spark/commit/2f392e116f69b6f2dfa8ff372c0288ea1f5b52a6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-02 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6756#issuecomment-118202853
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-02 Thread yu-iskw

Github user yu-iskw commented on the pull request:

https://github.com/apache/spark/pull/6756#issuecomment-118202856
  
@feynmanliang Thank you for your feedback. I modified the points.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-02 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6756#issuecomment-118202835
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-02 Thread yu-iskw

Github user yu-iskw commented on a diff in the pull request:

https://github.com/apache/spark/pull/6756#discussion_r33833577
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -0,0 +1,201 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasMaxIter, 
HasPredictionCol, HasSeed}
+import org.apache.spark.ml.param.{Param, ParamMap, Params}
+import org.apache.spark.ml.util.{Identifiable, SchemaUtils}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans, 
KMeansModel => MLlibKMeansModel}
+import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.types.{IntegerType, StructType}
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.util.Utils
+
+
+/**
+ * Common params for KMeans and KMeansModel
+ */
+private[clustering] trait KMeansParams
+extends Params with HasMaxIter with HasFeaturesCol with HasSeed with 
HasPredictionCol {
+
+  /**
+   * Set the number of clusters to create (k). Default: 2.
+   * @group param
+   */
+  val k = new Param[Int](this, "k", "number of clusters to create", (x: 
Int) => x > 1)
+
+  /** @group getParam */
+  def getK: Int = $(k)
+
+  /**
+   * Param the number of runs of the algorithm to execute in parallel. We 
initialize the algorithm
+   * this many times with random starting conditions (configured by the 
initialization mode), then
+   * return the best clustering found over any run. Default: 1.
+   * @group param
+   */
+  val runs = new Param[Int](this, "runs", "number of runs of the algorithm 
to execute in parallel",
+(value: Int) => value >= 1)
+
+  /** @group getParam */
+  def getRuns: Int = $(runs)
+
+  /**
+   * Param the distance threshold within which we've consider centers to 
have converged.
+   * If all centers move less than this Euclidean distance, we stop 
iterating one run.
+   * @group param
+   */
+  val epsilon = new Param[Double](this, "epsilon", "distance threshold")
+
+  /** @group getParam */
+  def getEpsilon: Double = $(epsilon)
+
+  /**
+   * Param for the initialization algorithm. This can be either "random" 
to choose random points as
+   * initial cluster centers, or "k-means||" to use a parallel variant of 
k-means++
+   * (Bahmani et al., Scalable K-Means++, VLDB 2012). Default: k-means||.
+   * @group param
+   */
+  val initMode = new Param[String](this, "initMode", "initialization 
algorithm",
+(value: String) => MLlibKMeans.validateInitializationMode(value))
+
+  /** @group getParam */
+  def getInitializationMode: String = $(initMode)
+
+  /**
+   * Param for the number of steps for the k-means|| initialization mode. 
This is an advanced
+   * setting -- the default of 5 is almost always enough. Default: 5.
+   * @group param
+   */
+  val initSteps = new Param[Int](this, "initSteps", "number of steps for 
k-means||",
+(value: Int) => value > 0)
+
+  /** @group getParam */
+  def getInitializationSteps: Int = $(initSteps)
+
+  /**
+   * Validates and transforms the input schema.
+   * @param schema input schema
+   * @return output schema
+   */
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+SchemaUtils.checkColumnType(schema, $(featuresCol), new VectorUDT)
+SchemaUtils.appendColumn(schema, $(predictionCol), IntegerType)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Model fitted by KMeans.
+ *
+ * @param parentModel a model trained by spark.mllib.clustering.KMeans.
+ */
+@Experimental
+class KMeansModel private[ml] (
+override val uid: String,

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-02 Thread feynmanliang

Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/6756#discussion_r33804988
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -0,0 +1,201 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasMaxIter, 
HasPredictionCol, HasSeed}
+import org.apache.spark.ml.param.{Param, ParamMap, Params}
+import org.apache.spark.ml.util.{Identifiable, SchemaUtils}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans, 
KMeansModel => MLlibKMeansModel}
+import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.types.{IntegerType, StructType}
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.util.Utils
+
+
+/**
+ * Common params for KMeans and KMeansModel
+ */
+private[clustering] trait KMeansParams
+extends Params with HasMaxIter with HasFeaturesCol with HasSeed with 
HasPredictionCol {
+
+  /**
+   * Set the number of clusters to create (k). Default: 2.
+   * @group param
+   */
+  val k = new Param[Int](this, "k", "number of clusters to create", (x: 
Int) => x > 1)
+
+  /** @group getParam */
+  def getK: Int = $(k)
+
+  /**
+   * Param the number of runs of the algorithm to execute in parallel. We 
initialize the algorithm
+   * this many times with random starting conditions (configured by the 
initialization mode), then
+   * return the best clustering found over any run. Default: 1.
+   * @group param
+   */
+  val runs = new Param[Int](this, "runs", "number of runs of the algorithm 
to execute in parallel",
+(value: Int) => value >= 1)
+
+  /** @group getParam */
+  def getRuns: Int = $(runs)
+
+  /**
+   * Param the distance threshold within which we've consider centers to 
have converged.
+   * If all centers move less than this Euclidean distance, we stop 
iterating one run.
+   * @group param
+   */
+  val epsilon = new Param[Double](this, "epsilon", "distance threshold")
+
+  /** @group getParam */
+  def getEpsilon: Double = $(epsilon)
+
+  /**
+   * Param for the initialization algorithm. This can be either "random" 
to choose random points as
+   * initial cluster centers, or "k-means||" to use a parallel variant of 
k-means++
+   * (Bahmani et al., Scalable K-Means++, VLDB 2012). Default: k-means||.
+   * @group param
+   */
+  val initMode = new Param[String](this, "initMode", "initialization 
algorithm",
+(value: String) => MLlibKMeans.validateInitializationMode(value))
+
+  /** @group getParam */
+  def getInitializationMode: String = $(initMode)
+
+  /**
+   * Param for the number of steps for the k-means|| initialization mode. 
This is an advanced
+   * setting -- the default of 5 is almost always enough. Default: 5.
+   * @group param
+   */
+  val initSteps = new Param[Int](this, "initSteps", "number of steps for 
k-means||",
--- End diff --

`IntParam`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-02 Thread feynmanliang

Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/6756#discussion_r33804769
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -0,0 +1,201 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasMaxIter, 
HasPredictionCol, HasSeed}
+import org.apache.spark.ml.param.{Param, ParamMap, Params}
+import org.apache.spark.ml.util.{Identifiable, SchemaUtils}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans, 
KMeansModel => MLlibKMeansModel}
+import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.types.{IntegerType, StructType}
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.util.Utils
+
+
+/**
+ * Common params for KMeans and KMeansModel
+ */
+private[clustering] trait KMeansParams
+extends Params with HasMaxIter with HasFeaturesCol with HasSeed with 
HasPredictionCol {
+
+  /**
+   * Set the number of clusters to create (k). Default: 2.
+   * @group param
+   */
+  val k = new Param[Int](this, "k", "number of clusters to create", (x: 
Int) => x > 1)
--- End diff --

Also, maybe consider making them `final`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-02 Thread feynmanliang

Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/6756#discussion_r33804680
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -0,0 +1,201 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasMaxIter, 
HasPredictionCol, HasSeed}
+import org.apache.spark.ml.param.{Param, ParamMap, Params}
+import org.apache.spark.ml.util.{Identifiable, SchemaUtils}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans, 
KMeansModel => MLlibKMeansModel}
+import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.types.{IntegerType, StructType}
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.util.Utils
+
+
+/**
+ * Common params for KMeans and KMeansModel
+ */
+private[clustering] trait KMeansParams
+extends Params with HasMaxIter with HasFeaturesCol with HasSeed with 
HasPredictionCol {
+
+  /**
+   * Set the number of clusters to create (k). Default: 2.
+   * @group param
+   */
+  val k = new Param[Int](this, "k", "number of clusters to create", (x: 
Int) => x > 1)
+
+  /** @group getParam */
+  def getK: Int = $(k)
+
+  /**
+   * Param the number of runs of the algorithm to execute in parallel. We 
initialize the algorithm
+   * this many times with random starting conditions (configured by the 
initialization mode), then
+   * return the best clustering found over any run. Default: 1.
+   * @group param
+   */
+  val runs = new Param[Int](this, "runs", "number of runs of the algorithm 
to execute in parallel",
+(value: Int) => value >= 1)
+
+  /** @group getParam */
+  def getRuns: Int = $(runs)
+
+  /**
+   * Param the distance threshold within which we've consider centers to 
have converged.
+   * If all centers move less than this Euclidean distance, we stop 
iterating one run.
+   * @group param
+   */
+  val epsilon = new Param[Double](this, "epsilon", "distance threshold")
+
+  /** @group getParam */
+  def getEpsilon: Double = $(epsilon)
+
+  /**
+   * Param for the initialization algorithm. This can be either "random" 
to choose random points as
+   * initial cluster centers, or "k-means||" to use a parallel variant of 
k-means++
+   * (Bahmani et al., Scalable K-Means++, VLDB 2012). Default: k-means||.
+   * @group param
+   */
+  val initMode = new Param[String](this, "initMode", "initialization 
algorithm",
+(value: String) => MLlibKMeans.validateInitializationMode(value))
+
+  /** @group getParam */
+  def getInitializationMode: String = $(initMode)
+
+  /**
+   * Param for the number of steps for the k-means|| initialization mode. 
This is an advanced
+   * setting -- the default of 5 is almost always enough. Default: 5.
+   * @group param
+   */
+  val initSteps = new Param[Int](this, "initSteps", "number of steps for 
k-means||",
+(value: Int) => value > 0)
+
+  /** @group getParam */
+  def getInitializationSteps: Int = $(initSteps)
+
+  /**
+   * Validates and transforms the input schema.
+   * @param schema input schema
+   * @return output schema
+   */
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+SchemaUtils.checkColumnType(schema, $(featuresCol), new VectorUDT)
+SchemaUtils.appendColumn(schema, $(predictionCol), IntegerType)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Model fitted by KMeans.
+ *
+ * @param parentModel a model trained by spark.mllib.clustering.KMeans.
+ */
+@Experimental
+class KMeansModel private[ml] (
+override val uid: Stri

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-02 Thread feynmanliang

Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/6756#discussion_r33804498
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -0,0 +1,201 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasMaxIter, 
HasPredictionCol, HasSeed}
+import org.apache.spark.ml.param.{Param, ParamMap, Params}
+import org.apache.spark.ml.util.{Identifiable, SchemaUtils}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans, 
KMeansModel => MLlibKMeansModel}
+import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.types.{IntegerType, StructType}
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.util.Utils
+
+
+/**
+ * Common params for KMeans and KMeansModel
+ */
+private[clustering] trait KMeansParams
+extends Params with HasMaxIter with HasFeaturesCol with HasSeed with 
HasPredictionCol {
+
+  /**
+   * Set the number of clusters to create (k). Default: 2.
+   * @group param
+   */
+  val k = new Param[Int](this, "k", "number of clusters to create", (x: 
Int) => x > 1)
+
+  /** @group getParam */
+  def getK: Int = $(k)
+
+  /**
+   * Param the number of runs of the algorithm to execute in parallel. We 
initialize the algorithm
+   * this many times with random starting conditions (configured by the 
initialization mode), then
+   * return the best clustering found over any run. Default: 1.
+   * @group param
+   */
+  val runs = new Param[Int](this, "runs", "number of runs of the algorithm 
to execute in parallel",
+(value: Int) => value >= 1)
+
+  /** @group getParam */
+  def getRuns: Int = $(runs)
+
+  /**
+   * Param the distance threshold within which we've consider centers to 
have converged.
+   * If all centers move less than this Euclidean distance, we stop 
iterating one run.
+   * @group param
+   */
+  val epsilon = new Param[Double](this, "epsilon", "distance threshold")
--- End diff --

`DoubleParam`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-02 Thread feynmanliang

Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/6756#discussion_r33804468
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -0,0 +1,201 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasMaxIter, 
HasPredictionCol, HasSeed}
+import org.apache.spark.ml.param.{Param, ParamMap, Params}
+import org.apache.spark.ml.util.{Identifiable, SchemaUtils}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans, 
KMeansModel => MLlibKMeansModel}
+import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.types.{IntegerType, StructType}
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.util.Utils
+
+
+/**
+ * Common params for KMeans and KMeansModel
+ */
+private[clustering] trait KMeansParams
+extends Params with HasMaxIter with HasFeaturesCol with HasSeed with 
HasPredictionCol {
+
+  /**
+   * Set the number of clusters to create (k). Default: 2.
+   * @group param
+   */
+  val k = new Param[Int](this, "k", "number of clusters to create", (x: 
Int) => x > 1)
+
+  /** @group getParam */
+  def getK: Int = $(k)
+
+  /**
+   * Param the number of runs of the algorithm to execute in parallel. We 
initialize the algorithm
+   * this many times with random starting conditions (configured by the 
initialization mode), then
+   * return the best clustering found over any run. Default: 1.
+   * @group param
+   */
+  val runs = new Param[Int](this, "runs", "number of runs of the algorithm 
to execute in parallel",
--- End diff --

`IntParam`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-02 Thread feynmanliang

Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/6756#discussion_r33804453
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -0,0 +1,201 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasMaxIter, 
HasPredictionCol, HasSeed}
+import org.apache.spark.ml.param.{Param, ParamMap, Params}
+import org.apache.spark.ml.util.{Identifiable, SchemaUtils}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans, 
KMeansModel => MLlibKMeansModel}
+import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.types.{IntegerType, StructType}
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.util.Utils
+
+
+/**
+ * Common params for KMeans and KMeansModel
+ */
+private[clustering] trait KMeansParams
+extends Params with HasMaxIter with HasFeaturesCol with HasSeed with 
HasPredictionCol {
+
+  /**
+   * Set the number of clusters to create (k). Default: 2.
+   * @group param
+   */
+  val k = new Param[Int](this, "k", "number of clusters to create", (x: 
Int) => x > 1)
--- End diff --

nit: `IntParam` instead of `Param[Int]`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-02 Thread feynmanliang

Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/6756#discussion_r33803659
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -0,0 +1,201 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasMaxIter, 
HasPredictionCol, HasSeed}
+import org.apache.spark.ml.param.{Param, ParamMap, Params}
+import org.apache.spark.ml.util.{Identifiable, SchemaUtils}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans, 
KMeansModel => MLlibKMeansModel}
+import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
+import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.types.{IntegerType, StructType}
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.util.Utils
+
+
+/**
+ * Common params for KMeans and KMeansModel
+ */
+private[clustering] trait KMeansParams
+extends Params with HasMaxIter with HasFeaturesCol with HasSeed with 
HasPredictionCol {
+
+  /**
+   * Set the number of clusters to create (k). Default: 2.
+   * @group param
+   */
+  val k = new Param[Int](this, "k", "number of clusters to create", (x: 
Int) => x > 1)
+
+  /** @group getParam */
+  def getK: Int = $(k)
+
+  /**
+   * Param the number of runs of the algorithm to execute in parallel. We 
initialize the algorithm
+   * this many times with random starting conditions (configured by the 
initialization mode), then
+   * return the best clustering found over any run. Default: 1.
+   * @group param
+   */
+  val runs = new Param[Int](this, "runs", "number of runs of the algorithm 
to execute in parallel",
+(value: Int) => value >= 1)
+
+  /** @group getParam */
+  def getRuns: Int = $(runs)
+
+  /**
+   * Param the distance threshold within which we've consider centers to 
have converged.
+   * If all centers move less than this Euclidean distance, we stop 
iterating one run.
+   * @group param
+   */
+  val epsilon = new Param[Double](this, "epsilon", "distance threshold")
+
+  /** @group getParam */
+  def getEpsilon: Double = $(epsilon)
+
+  /**
+   * Param for the initialization algorithm. This can be either "random" 
to choose random points as
+   * initial cluster centers, or "k-means||" to use a parallel variant of 
k-means++
+   * (Bahmani et al., Scalable K-Means++, VLDB 2012). Default: k-means||.
+   * @group param
+   */
+  val initMode = new Param[String](this, "initMode", "initialization 
algorithm",
+(value: String) => MLlibKMeans.validateInitializationMode(value))
+
+  /** @group getParam */
+  def getInitializationMode: String = $(initMode)
+
+  /**
+   * Param for the number of steps for the k-means|| initialization mode. 
This is an advanced
+   * setting -- the default of 5 is almost always enough. Default: 5.
+   * @group param
+   */
+  val initSteps = new Param[Int](this, "initSteps", "number of steps for 
k-means||",
+(value: Int) => value > 0)
+
+  /** @group getParam */
+  def getInitializationSteps: Int = $(initSteps)
+
+  /**
+   * Validates and transforms the input schema.
+   * @param schema input schema
+   * @return output schema
+   */
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+SchemaUtils.checkColumnType(schema, $(featuresCol), new VectorUDT)
+SchemaUtils.appendColumn(schema, $(predictionCol), IntegerType)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Model fitted by KMeans.
+ *
+ * @param parentModel a model trained by spark.mllib.clustering.KMeans.
+ */
+@Experimental
+class KMeansModel private[ml] (
+override val uid: Stri

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-02 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6756#issuecomment-117974897
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Spark-7879][MLlib] KMeans API for spark.ml Pi...

2015-07-02 Thread yu-iskw

Github user yu-iskw commented on the pull request:

https://github.com/apache/spark/pull/6756#issuecomment-117975579
  
@jkbradley After executing rebase with the mater branch, I modified the 
points. Could you review it when you have time? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 >

1 - 100 of 182 matches

Mail list logo