[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

2018-12-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23072#discussion_r239701916
  
--- Diff: R/pkg/vignettes/sparkr-vignettes.Rmd ---
@@ -968,6 +970,17 @@ predicted <- predict(model, df)
 head(predicted)
 ```
 
+ Power Iteration Clustering
+
+Power Iteration Clustering (PIC) is a scalable graph clustering algorithm. 
`spark.assignClusters` method runs the PIC algorithm and returns a cluster 
assignment for each input vertex.
+
+```{r}
+df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0),
+  list(1L, 2L, 1.0), list(3L, 4L, 1.0),
--- End diff --

BTW, when I added that into https://spark.apache.org/contributing.html, we 
also agreed upon following committer's judgement based upon the guide because 
the guide mentions:

> The coding conventions described above should be followed, unless there 
is good reason to do otherwise. Exceptions include legacy code and modifying 
third-party code.

since we do have legacy reason, and there is a good reason - consistency 
and committer's judgement.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

2018-12-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23072#discussion_r239701364
  
--- Diff: R/pkg/tests/fulltests/test_mllib_clustering.R ---
@@ -319,4 +319,18 @@ test_that("spark.posterior and spark.perplexity", {
   expect_equal(length(local.posterior), sum(unlist(local.posterior)))
 })
 
+test_that("spark.assignClusters", {
+  df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0),
+ list(1L, 2L, 1.0), list(3L, 4L, 1.0),
+ list(4L, 0L, 0.1)), schema = c("src", "dst", 
"weight"))
+  clusters <- spark.assignClusters(df, initMode = "degree", weightCol = 
"weight")
+  expected_result <- createDataFrame(list(list(4L, 1L),
+  list(0L, 0L),
+  list(1L, 0L),
+  list(3L, 1L),
+  list(2L, 0L)),
+  schema = c("id", "cluster"))
--- End diff --

ditto for style


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

2018-12-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23072#discussion_r239701069
  
--- Diff: R/pkg/vignettes/sparkr-vignettes.Rmd ---
@@ -968,6 +970,17 @@ predicted <- predict(model, df)
 head(predicted)
 ```
 
+ Power Iteration Clustering
+
+Power Iteration Clustering (PIC) is a scalable graph clustering algorithm. 
`spark.assignClusters` method runs the PIC algorithm and returns a cluster 
assignment for each input vertex.
+
+```{r}
+df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0),
+  list(1L, 2L, 1.0), list(3L, 4L, 1.0),
--- End diff --

There are two separate style are already mixed in R code IIRC:

```r
df <- createDataFrame(
  list(list(0L, 1L, 1.0), list(0L, 2L, 1.0),
  list(1L, 2L, 1.0), list(3L, 4L, 1.0),
  list(4L, 0L, 0.1)), schema = c("src", "dst", "weight"))
```

or

```r
df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0),
   list(1L, 2L, 1.0), list(3L, 4L, 1.0),
   list(4L, 0L, 0.1)),
  schema = c("src", "dst", "weight"))
```

Let's avoid mixed style, and let's go for the later one when possible 
because at least that looks more complying the code style.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

2018-12-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23072#discussion_r239700846
  
--- Diff: R/pkg/vignettes/sparkr-vignettes.Rmd ---
@@ -968,6 +970,17 @@ predicted <- predict(model, df)
 head(predicted)
 ```
 
+ Power Iteration Clustering
+
+Power Iteration Clustering (PIC) is a scalable graph clustering algorithm. 
`spark.assignClusters` method runs the PIC algorithm and returns a cluster 
assignment for each input vertex.
+
+```{r}
+df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0),
+  list(1L, 2L, 1.0), list(3L, 4L, 1.0),
--- End diff --

Yea, we do have for indentation rule. "Code Style Guide" at 
https://spark.apache.org/contributing.html -> 
https://google.github.io/styleguide/Rguide.xml. I know the code style is not 
perfectly documented but at least there are some examples. I think the correct 
indentation is:

```r
df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0),
   list(1L, 2L, 1.0), list(3L, 4L, 1.0),
   list(4L, 0L, 0.1)),
  schema = c("src", "dst", "weight"))
``` 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

2018-12-06 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/23072#discussion_r239700056
  
--- Diff: R/pkg/vignettes/sparkr-vignettes.Rmd ---
@@ -968,6 +970,17 @@ predicted <- predict(model, df)
 head(predicted)
 ```
 
+ Power Iteration Clustering
+
+Power Iteration Clustering (PIC) is a scalable graph clustering algorithm. 
`spark.assignClusters` method runs the PIC algorithm and returns a cluster 
assignment for each input vertex.
+
+```{r}
+df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0),
+  list(1L, 2L, 1.0), list(3L, 4L, 1.0),
--- End diff --

Do we have an indentation rule for this? This PR is using two types of 
indentations for the same statements.
- For docs (sparkr-vignettes.Rmd, mllib_clustering.R), this line is aligned 
with the first `list`.
- For real code (test_mllib_clustering.R, powerIterationClustering.R), this 
line is aligned with the second `list`.

Can we use the same indentation rule?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

2018-12-06 Thread huaxingao
Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/23072#discussion_r239626824
  
--- Diff: R/pkg/R/mllib_clustering.R ---
@@ -610,3 +616,58 @@ setMethod("write.ml", signature(object = "LDAModel", 
path = "character"),
   function(object, path, overwrite = FALSE) {
 write_internal(object, path, overwrite)
   })
+
+#' PowerIterationClustering
+#'
+#' A scalable graph clustering algorithm. Users can call 
\code{spark.assignClusters} to
+#' return a cluster assignment for each input vertex.
+#'
+#  Run the PIC algorithm and returns a cluster assignment for each input 
vertex.
+#' @param data A SparkDataFrame.
+#' @param k The number of clusters to create.
+#' @param initMode Param for the initialization algorithm.
+#' @param maxIter Param for maximum number of iterations.
+#' @param sourceCol Param for the name of the input column for source 
vertex IDs.
+#' @param destinationCol Name of the input column for destination vertex 
IDs.
+#' @param weightCol Param for weight column name. If this is not set or 
\code{NULL},
+#'  we treat all instance weights as 1.0.
+#' @param ... additional argument(s) passed to the method.
+#' @return A dataset that contains columns of vertex id and the 
corresponding cluster for the id.
+#' The schema of it will be:
+#' \code{id: Long}
+#' \code{cluster: Int}
+#' @rdname spark.powerIterationClustering
+#' @aliases 
assignClusters,PowerIterationClustering-method,SparkDataFrame-method
+#' @examples
+#' \dontrun{
+#' df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0),
+#'   list(1L, 2L, 1.0), list(3L, 4L, 1.0),
+#'   list(4L, 0L, 0.1)), schema = c("src", "dst", 
"weight"))
+#' clusters <- spark.assignClusters(df, initMode="degree", 
weightCol="weight")
+#' showDF(clusters)
+#' }
+#' @note spark.assignClusters(SparkDataFrame) since 3.0.0
+setMethod("spark.assignClusters",
+  signature(data = "SparkDataFrame"),
+  function(data, k = 2L, initMode = c("random", "degree"), maxIter 
= 20L,
+sourceCol = "src", destinationCol = "dst", weightCol = NULL) {
+if (!is.numeric(k) || k < 1) {
+  stop("k should be a number with value >= 1.")
+}
+if (!is.integer(maxIter) || maxIter <= 0) {
+  stop("maxIter should be a number with value > 0.")
+}
--- End diff --

Seems to me that R is  a thin wrapper, we only need to create a PIC object 
and call the corresponding scala method. SparkDataFrame's column types are only 
checked in scala, not in R. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

2018-12-06 Thread huaxingao
Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/23072#discussion_r239626871
  
--- Diff: docs/ml-clustering.md ---
@@ -265,3 +265,44 @@ Refer to the [R API 
docs](api/R/spark.gaussianMixture.html) for more details.
 
 
 
+
+## Power Iteration Clustering (PIC)
+
+Power Iteration Clustering (PIC) is  a scalable graph clustering algorithm
+developed by http://www.cs.cmu.edu/~frank/papers/icml2010-pic-final.pdf>Lin and 
Cohen.
--- End diff --

Thanks. I will change the hyperlink. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

2018-12-05 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/23072#discussion_r239259444
  
--- Diff: R/pkg/R/mllib_clustering.R ---
@@ -610,3 +616,58 @@ setMethod("write.ml", signature(object = "LDAModel", 
path = "character"),
   function(object, path, overwrite = FALSE) {
 write_internal(object, path, overwrite)
   })
+
+#' PowerIterationClustering
+#'
+#' A scalable graph clustering algorithm. Users can call 
\code{spark.assignClusters} to
+#' return a cluster assignment for each input vertex.
+#'
+#  Run the PIC algorithm and returns a cluster assignment for each input 
vertex.
+#' @param data A SparkDataFrame.
+#' @param k The number of clusters to create.
+#' @param initMode Param for the initialization algorithm.
+#' @param maxIter Param for maximum number of iterations.
+#' @param sourceCol Param for the name of the input column for source 
vertex IDs.
+#' @param destinationCol Name of the input column for destination vertex 
IDs.
+#' @param weightCol Param for weight column name. If this is not set or 
\code{NULL},
+#'  we treat all instance weights as 1.0.
+#' @param ... additional argument(s) passed to the method.
+#' @return A dataset that contains columns of vertex id and the 
corresponding cluster for the id.
+#' The schema of it will be:
+#' \code{id: Long}
+#' \code{cluster: Int}
+#' @rdname spark.powerIterationClustering
+#' @aliases 
assignClusters,PowerIterationClustering-method,SparkDataFrame-method
+#' @examples
+#' \dontrun{
+#' df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0),
+#'   list(1L, 2L, 1.0), list(3L, 4L, 1.0),
+#'   list(4L, 0L, 0.1)), schema = c("src", "dst", 
"weight"))
+#' clusters <- spark.assignClusters(df, initMode="degree", 
weightCol="weight")
+#' showDF(clusters)
+#' }
+#' @note spark.assignClusters(SparkDataFrame) since 3.0.0
+setMethod("spark.assignClusters",
+  signature(data = "SparkDataFrame"),
+  function(data, k = 2L, initMode = c("random", "degree"), maxIter 
= 20L,
+sourceCol = "src", destinationCol = "dst", weightCol = NULL) {
+if (!is.numeric(k) || k < 1) {
+  stop("k should be a number with value >= 1.")
+}
+if (!is.integer(maxIter) || maxIter <= 0) {
+  stop("maxIter should be a number with value > 0.")
+}
--- End diff --

I mean the `data` SparkDataFrame's column types, if possible. If you remove 
'L' from '0L' in your example dataset, you can see the failure.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

2018-12-05 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/23072#discussion_r239258564
  
--- Diff: docs/ml-clustering.md ---
@@ -265,3 +265,44 @@ Refer to the [R API 
docs](api/R/spark.gaussianMixture.html) for more details.
 
 
 
+
+## Power Iteration Clustering (PIC)
+
+Power Iteration Clustering (PIC) is  a scalable graph clustering algorithm
+developed by http://www.cs.cmu.edu/~frank/papers/icml2010-pic-final.pdf>Lin and 
Cohen.
--- End diff --

Actually, I built this PR on my Mac, and found that the hyperlink is not 
generated.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

2018-12-05 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/23072#discussion_r239258366
  
--- Diff: docs/ml-clustering.md ---
@@ -265,3 +265,44 @@ Refer to the [R API 
docs](api/R/spark.gaussianMixture.html) for more details.
 
 
 
+
+## Power Iteration Clustering (PIC)
+
+Power Iteration Clustering (PIC) is  a scalable graph clustering algorithm
+developed by http://www.cs.cmu.edu/~frank/papers/icml2010-pic-final.pdf>Lin and 
Cohen.
--- End diff --

You need to build from Spark repository because Jekyll handles it 
differently from GitHub. Please try to build in `docs` directory. There is 
`README.md` for that.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

2018-12-05 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/23072#discussion_r239257840
  
--- Diff: docs/ml-clustering.md ---
@@ -265,3 +265,44 @@ Refer to the [R API 
docs](api/R/spark.gaussianMixture.html) for more details.
 
 
 
+
+## Power Iteration Clustering (PIC)
+
+Power Iteration Clustering (PIC) is  a scalable graph clustering algorithm
+developed by http://www.cs.cmu.edu/~frank/papers/icml2010-pic-final.pdf>Lin and 
Cohen.
+From the abstract: PIC finds a very low-dimensional embedding of a dataset
+using truncated power iteration on a normalized pair-wise similarity 
matrix of the data.
+
+`spark.ml`'s PowerIterationClustering implementation takes the following 
parameters:
+
+* `k`: the number of clusters to create
+* `initMode`: param for the initialization algorithm
+* `maxIter`: param for maximum number of iterations
+* `srcCol`: param for the name of the input column for source vertex IDs
+* `dstCol`: name of the input column for destination vertex IDs
+* `weightCol`: Param for weight column name
+
+**Examples**
+
+
+
+
+Refer to the [Scala API 
docs](api/scala/index.html#org.apache.spark.ml.clustering.PowerIterationClustering)
 for more details.
+
+{% include_example 
scala/org/apache/spark/examples/ml/PowerIterationClusteringExample.scala %}
+
+
+
+Refer to the [Java API 
docs](api/java/org/apache/spark/ml/clustering/PowerIterationClustering.html) 
for more details.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaPowerIterationClusteringExample.java %}
+
+
+
--- End diff --

Thanks. Got it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

2018-12-05 Thread huaxingao
Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/23072#discussion_r239250376
  
--- Diff: docs/ml-clustering.md ---
@@ -265,3 +265,44 @@ Refer to the [R API 
docs](api/R/spark.gaussianMixture.html) for more details.
 
 
 
+
+## Power Iteration Clustering (PIC)
+
+Power Iteration Clustering (PIC) is  a scalable graph clustering algorithm
+developed by http://www.cs.cmu.edu/~frank/papers/icml2010-pic-final.pdf>Lin and 
Cohen.
--- End diff --

I normally check the md file on the github. The link works OK. Is there a 
better way to check? @dongjoon-hyun @felixcheung 

https://github.com/apache/spark/blob/9158da8cb76cc13f3011deaa7ac2c290eef62389/docs/ml-clustering.md
I guess I will still remove the ```a href=``` since no other places in the 
doc uses ``


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

2018-12-05 Thread huaxingao
Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/23072#discussion_r239250335
  
--- Diff: R/pkg/R/mllib_clustering.R ---
@@ -610,3 +616,58 @@ setMethod("write.ml", signature(object = "LDAModel", 
path = "character"),
   function(object, path, overwrite = FALSE) {
 write_internal(object, path, overwrite)
   })
+
+#' PowerIterationClustering
+#'
+#' A scalable graph clustering algorithm. Users can call 
\code{spark.assignClusters} to
+#' return a cluster assignment for each input vertex.
+#'
+#  Run the PIC algorithm and returns a cluster assignment for each input 
vertex.
+#' @param data A SparkDataFrame.
+#' @param k The number of clusters to create.
+#' @param initMode Param for the initialization algorithm.
+#' @param maxIter Param for maximum number of iterations.
+#' @param sourceCol Param for the name of the input column for source 
vertex IDs.
+#' @param destinationCol Name of the input column for destination vertex 
IDs.
+#' @param weightCol Param for weight column name. If this is not set or 
\code{NULL},
+#'  we treat all instance weights as 1.0.
+#' @param ... additional argument(s) passed to the method.
+#' @return A dataset that contains columns of vertex id and the 
corresponding cluster for the id.
+#' The schema of it will be:
+#' \code{id: Long}
+#' \code{cluster: Int}
+#' @rdname spark.powerIterationClustering
+#' @aliases 
assignClusters,PowerIterationClustering-method,SparkDataFrame-method
+#' @examples
+#' \dontrun{
+#' df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0),
+#'   list(1L, 2L, 1.0), list(3L, 4L, 1.0),
+#'   list(4L, 0L, 0.1)), schema = c("src", "dst", 
"weight"))
+#' clusters <- spark.assignClusters(df, initMode="degree", 
weightCol="weight")
+#' showDF(clusters)
+#' }
+#' @note spark.assignClusters(SparkDataFrame) since 3.0.0
+setMethod("spark.assignClusters",
+  signature(data = "SparkDataFrame"),
+  function(data, k = 2L, initMode = c("random", "degree"), maxIter 
= 20L,
+sourceCol = "src", destinationCol = "dst", weightCol = NULL) {
+if (!is.numeric(k) || k < 1) {
+  stop("k should be a number with value >= 1.")
+}
+if (!is.integer(maxIter) || maxIter <= 0) {
+  stop("maxIter should be a number with value > 0.")
+}
--- End diff --

@dongjoon-hyun ```src``` and ```dst``` are character columns. I have the 
check for character type. 
```
as.character(sourceCol),
as.character(destinationCol)
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

2018-12-05 Thread huaxingao
Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/23072#discussion_r239238873
  
--- Diff: docs/ml-clustering.md ---
@@ -265,3 +265,44 @@ Refer to the [R API 
docs](api/R/spark.gaussianMixture.html) for more details.
 
 
 
+
+## Power Iteration Clustering (PIC)
+
+Power Iteration Clustering (PIC) is  a scalable graph clustering algorithm
+developed by http://www.cs.cmu.edu/~frank/papers/icml2010-pic-final.pdf>Lin and 
Cohen.
+From the abstract: PIC finds a very low-dimensional embedding of a dataset
+using truncated power iteration on a normalized pair-wise similarity 
matrix of the data.
+
+`spark.ml`'s PowerIterationClustering implementation takes the following 
parameters:
+
+* `k`: the number of clusters to create
+* `initMode`: param for the initialization algorithm
+* `maxIter`: param for maximum number of iterations
+* `srcCol`: param for the name of the input column for source vertex IDs
+* `dstCol`: name of the input column for destination vertex IDs
+* `weightCol`: Param for weight column name
+
+**Examples**
+
+
+
+
+Refer to the [Scala API 
docs](api/scala/index.html#org.apache.spark.ml.clustering.PowerIterationClustering)
 for more details.
+
+{% include_example 
scala/org/apache/spark/examples/ml/PowerIterationClusteringExample.scala %}
+
+
+
+Refer to the [Java API 
docs](api/java/org/apache/spark/ml/clustering/PowerIterationClustering.html) 
for more details.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaPowerIterationClusteringExample.java %}
+
+
+
--- End diff --

@dongjoon-hyun 
https://github.com/apache/spark/pull/22996
I will add the python example in the doc once the above PR is merged in. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

2018-12-05 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/23072#discussion_r239228923
  
--- Diff: examples/src/main/r/ml/powerIterationClustering.R ---
@@ -0,0 +1,37 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# To run this example use
+# ./bin/spark-submit examples/src/main/r/ml/powerIterationClustering.R
+
+# Load SparkR library into your R session
+library(SparkR)
+
+# Initialize SparkSession
+sparkR.session(appName = "SparkR-ML-powerIterationCLustering-example")
+
+# $example on$
+df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0),
+   list(1L, 2L, 1.0), list(3L, 4L, 1.0),
+   list(4L, 0L, 0.1)), schema = c("src", "dst", 
"weight"))
+#assign clusters
--- End diff --

nit. `#assign` -> `# assign`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

2018-12-05 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/23072#discussion_r239224970
  
--- Diff: docs/ml-clustering.md ---
@@ -265,3 +265,44 @@ Refer to the [R API 
docs](api/R/spark.gaussianMixture.html) for more details.
 
 
 
+
+## Power Iteration Clustering (PIC)
+
+Power Iteration Clustering (PIC) is  a scalable graph clustering algorithm
+developed by http://www.cs.cmu.edu/~frank/papers/icml2010-pic-final.pdf>Lin and 
Cohen.
+From the abstract: PIC finds a very low-dimensional embedding of a dataset
+using truncated power iteration on a normalized pair-wise similarity 
matrix of the data.
+
+`spark.ml`'s PowerIterationClustering implementation takes the following 
parameters:
+
+* `k`: the number of clusters to create
+* `initMode`: param for the initialization algorithm
+* `maxIter`: param for maximum number of iterations
+* `srcCol`: param for the name of the input column for source vertex IDs
+* `dstCol`: name of the input column for destination vertex IDs
+* `weightCol`: Param for weight column name
+
+**Examples**
+
+
+
+
+Refer to the [Scala API 
docs](api/scala/index.html#org.apache.spark.ml.clustering.PowerIterationClustering)
 for more details.
+
+{% include_example 
scala/org/apache/spark/examples/ml/PowerIterationClusteringExample.scala %}
+
+
+
+Refer to the [Java API 
docs](api/java/org/apache/spark/ml/clustering/PowerIterationClustering.html) 
for more details.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaPowerIterationClusteringExample.java %}
+
+
+
--- End diff --

It seems that `Python` is missed here. Could you check and add it?
cc @HyukjinKwon 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

2018-12-05 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/23072#discussion_r239224498
  
--- Diff: docs/ml-clustering.md ---
@@ -265,3 +265,44 @@ Refer to the [R API 
docs](api/R/spark.gaussianMixture.html) for more details.
 
 
 
+
+## Power Iteration Clustering (PIC)
+
+Power Iteration Clustering (PIC) is  a scalable graph clustering algorithm
+developed by http://www.cs.cmu.edu/~frank/papers/icml2010-pic-final.pdf>Lin and 
Cohen.
--- End diff --

It seems that `` tag doesn't work here. Maybe, could you check the 
generated document and try `[Lin and 
Cohen](http://www.cs.cmu.edu/~frank/papers/icml2010-pic-final.pdf)` instead?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

2018-12-05 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/23072#discussion_r239203950
  
--- Diff: R/pkg/R/mllib_clustering.R ---
@@ -610,3 +616,58 @@ setMethod("write.ml", signature(object = "LDAModel", 
path = "character"),
   function(object, path, overwrite = FALSE) {
 write_internal(object, path, overwrite)
   })
+
+#' PowerIterationClustering
+#'
+#' A scalable graph clustering algorithm. Users can call 
\code{spark.assignClusters} to
+#' return a cluster assignment for each input vertex.
+#'
+#  Run the PIC algorithm and returns a cluster assignment for each input 
vertex.
+#' @param data A SparkDataFrame.
+#' @param k The number of clusters to create.
+#' @param initMode Param for the initialization algorithm.
+#' @param maxIter Param for maximum number of iterations.
+#' @param sourceCol Param for the name of the input column for source 
vertex IDs.
+#' @param destinationCol Name of the input column for destination vertex 
IDs.
+#' @param weightCol Param for weight column name. If this is not set or 
\code{NULL},
+#'  we treat all instance weights as 1.0.
+#' @param ... additional argument(s) passed to the method.
+#' @return A dataset that contains columns of vertex id and the 
corresponding cluster for the id.
+#' The schema of it will be:
+#' \code{id: Long}
+#' \code{cluster: Int}
+#' @rdname spark.powerIterationClustering
+#' @aliases 
assignClusters,PowerIterationClustering-method,SparkDataFrame-method
+#' @examples
+#' \dontrun{
+#' df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0),
+#'   list(1L, 2L, 1.0), list(3L, 4L, 1.0),
+#'   list(4L, 0L, 0.1)), schema = c("src", "dst", 
"weight"))
+#' clusters <- spark.assignClusters(df, initMode="degree", 
weightCol="weight")
+#' showDF(clusters)
+#' }
+#' @note spark.assignClusters(SparkDataFrame) since 3.0.0
+setMethod("spark.assignClusters",
+  signature(data = "SparkDataFrame"),
+  function(data, k = 2L, initMode = c("random", "degree"), maxIter 
= 20L,
+sourceCol = "src", destinationCol = "dst", weightCol = NULL) {
+if (!is.numeric(k) || k < 1) {
+  stop("k should be a number with value >= 1.")
+}
+if (!is.integer(maxIter) || maxIter <= 0) {
+  stop("maxIter should be a number with value > 0.")
+}
--- End diff --

Can we make it sure that the `src` and `dst` columns are int or bigint, 
too? Otherwise, we may hit `IllegalArgumentException` from Scala side.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

2018-12-05 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/23072#discussion_r239198848
  
--- Diff: R/pkg/tests/fulltests/test_mllib_clustering.R ---
@@ -319,4 +319,18 @@ test_that("spark.posterior and spark.perplexity", {
   expect_equal(length(local.posterior), sum(unlist(local.posterior)))
 })
 
+test_that("spark.assignClusters", {
+df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0),
--- End diff --

indentation?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

2018-12-05 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/23072#discussion_r239197337
  
--- Diff: R/pkg/tests/fulltests/test_mllib_fpm.R ---
@@ -84,19 +84,21 @@ test_that("spark.fpGrowth", {
 })
 
 test_that("spark.prefixSpan", {
-df <- createDataFrame(list(list(list(list(1L, 2L), list(3L))),
-  list(list(list(1L), list(3L, 2L), list(1L, 2L))),
-  list(list(list(1L, 2L), list(5L))),
-  list(list(list(6L, schema = c("sequence"))
-result1 <- spark.findFrequentSequentialPatterns(df, minSupport = 0.5, 
maxPatternLength = 5L,
-maxLocalProjDBSize = 
3200L)
-
-expected_result <- createDataFrame(list(list(list(list(1L)), 3L),
-list(list(list(3L)), 2L),
-list(list(list(2L)), 3L),
-list(list(list(1L, 2L)), 3L),
-list(list(list(1L), list(3L)), 
2L)),
-schema = c("sequence", "freq"))
-  })
+  df <- createDataFrame(list(list(list(list(1L, 2L), list(3L))),
+list(list(list(1L), list(3L, 2L), list(1L, 2L))),
+list(list(list(1L, 2L), list(5L))),
+list(list(list(6L, schema = c("sequence"))
+  result1 <- spark.findFrequentSequentialPatterns(df, minSupport = 0.5, 
maxPatternLength = 5L,
+  maxLocalProjDBSize = 
3200L)
+
+  expected_result <- createDataFrame(list(list(list(list(1L)), 3L),
+  list(list(list(3L)), 2L),
+  list(list(list(2L)), 3L),
+  list(list(list(1L, 2L)), 3L),
+  list(list(list(1L), list(3L)), 
2L)),
+  schema = c("sequence", "freq"))
+
+  expect_equivalent(expected_result, result1)
--- End diff --

`spark.prefixSpan` test case is irrelevant to the scope of PR.
If we want to add this line `expect_equivalent(expected_result, result1)`, 
let's add in another PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

2018-12-05 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/23072#discussion_r239194803
  
--- Diff: R/pkg/R/mllib_clustering.R ---
@@ -610,3 +616,58 @@ setMethod("write.ml", signature(object = "LDAModel", 
path = "character"),
   function(object, path, overwrite = FALSE) {
 write_internal(object, path, overwrite)
   })
+
+#' PowerIterationClustering
+#'
+#' A scalable graph clustering algorithm. Users can call 
\code{spark.assignClusters} to
+#' return a cluster assignment for each input vertex.
+#'
+#  Run the PIC algorithm and returns a cluster assignment for each input 
vertex.
+#' @param data A SparkDataFrame.
+#' @param k The number of clusters to create.
+#' @param initMode Param for the initialization algorithm.
+#' @param maxIter Param for maximum number of iterations.
+#' @param sourceCol Param for the name of the input column for source 
vertex IDs.
+#' @param destinationCol Name of the input column for destination vertex 
IDs.
--- End diff --

nit. Here, `Name` -> `Param for the name` for consistency with the other 
param descriptions?

Or, is it better to remote `Param for` prefix in other descriptions?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

2018-12-01 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/23072#discussion_r238087240
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/ml/FPGrowthExample.scala ---
@@ -64,4 +64,3 @@ object FPGrowthExample {
 spark.stop()
   }
 }
-// scalastyle:on println
--- End diff --

yes, println is not used


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

2018-11-30 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/23072#discussion_r237985559
  
--- Diff: python/pyspark/ml/clustering.py ---
@@ -1209,9 +1209,9 @@ class PowerIterationClustering(HasMaxIter, 
HasWeightCol, JavaParams, JavaMLReada
 .. note:: Experimental
 
 Power Iteration Clustering (PIC), a scalable graph clustering 
algorithm developed by
-`Lin and Cohen `_. From the 
abstract:
+`Lin and Cohen 
`_. From the
 PIC finds a very low-dimensional embedding of a dataset using 
truncated power
-iteration on a normalized pair-wise similarity matrix of the data.
+abstract: iteration on a normalized pair-wise similarity matrix of the 
data.
--- End diff --

Could you check this again? It seems to break the original sentence 
accidentally. Maybe, `From the abstract:`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

2018-11-30 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/23072#discussion_r237984857
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/r/PowerIterationClusteringWrapper.scala
 ---
@@ -0,0 +1,39 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.r
+
+import org.apache.spark.ml.clustering.PowerIterationClustering
+
+private[r] object PowerIterationClusteringWrapper {
+  def getPowerIterationClustering(
+  k: Int,
+  initMode: String,
+  maxIter: Int,
+  srcCol: String,
+  dstCol: String,
+  weightCol: String): PowerIterationClustering = {
+val pic = new PowerIterationClustering()
+.setK(k)
--- End diff --

Indentation with two spaces?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

2018-11-30 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/23072#discussion_r237983768
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/ml/FPGrowthExample.scala ---
@@ -64,4 +64,3 @@ object FPGrowthExample {
 spark.stop()
   }
 }
-// scalastyle:on println
--- End diff --

Of course, sure!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

2018-11-30 Thread huaxingao
Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/23072#discussion_r237966508
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/ml/FPGrowthExample.scala ---
@@ -64,4 +64,3 @@ object FPGrowthExample {
 spark.stop()
   }
 }
-// scalastyle:on println
--- End diff --

@dongjoon-hyun sorry, I missed the ```// scalastyle:off println```
Is it OK with you if  I remove ```// scalastyle:off println``` too?  Since 
```println``` is not used in the example


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

2018-11-30 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/23072#discussion_r237956662
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/ml/FPGrowthExample.scala ---
@@ -64,4 +64,3 @@ object FPGrowthExample {
 spark.stop()
   }
 }
-// scalastyle:on println
--- End diff --

Hi, @huaxingao . Let's not remove this. I understand the intention, but we 
had better keep this because this is the indicator of the scope of line 20.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

2018-11-28 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/23072#discussion_r237333561
  
--- Diff: docs/ml-clustering.md ---
@@ -265,3 +265,44 @@ Refer to the [R API 
docs](api/R/spark.gaussianMixture.html) for more details.
 
 
 
+
+## Power Iteration Clustering (PIC)
+
+Power Iteration Clustering (PIC) is  a scalable graph clustering algorithm
--- End diff --

OK sounds good. Let's merge this one first just as a matter of process.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

2018-11-28 Thread huaxingao
Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/23072#discussion_r237332601
  
--- Diff: docs/ml-clustering.md ---
@@ -265,3 +265,44 @@ Refer to the [R API 
docs](api/R/spark.gaussianMixture.html) for more details.
 
 
 
+
+## Power Iteration Clustering (PIC)
+
+Power Iteration Clustering (PIC) is  a scalable graph clustering algorithm
--- End diff --

The doc change will be in both 2.4 and master, but the R related code will 
be in master only. I think that's why @felixcheung asked me to open a separate 
PR to merge in the doc change for 2.4.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

2018-11-28 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/23072#discussion_r237330636
  
--- Diff: docs/ml-clustering.md ---
@@ -265,3 +265,44 @@ Refer to the [R API 
docs](api/R/spark.gaussianMixture.html) for more details.
 
 
 
+
+## Power Iteration Clustering (PIC)
+
+Power Iteration Clustering (PIC) is  a scalable graph clustering algorithm
--- End diff --

Pardon, I'm catching up -- why just commit this doc to 2.4 and not master?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

2018-11-27 Thread huaxingao
Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/23072#discussion_r236787704
  
--- Diff: docs/ml-clustering.md ---
@@ -265,3 +265,44 @@ Refer to the [R API 
docs](api/R/spark.gaussianMixture.html) for more details.
 
 
 
+
+## Power Iteration Clustering (PIC)
+
+Power Iteration Clustering (PIC) is  a scalable graph clustering algorithm
--- End diff --

sure


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

2018-11-27 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/23072#discussion_r236771417
  
--- Diff: docs/ml-clustering.md ---
@@ -265,3 +265,44 @@ Refer to the [R API 
docs](api/R/spark.gaussianMixture.html) for more details.
 
 
 
+
+## Power Iteration Clustering (PIC)
+
+Power Iteration Clustering (PIC) is  a scalable graph clustering algorithm
--- End diff --

could you open a separate PR with just this file (minus R) and 
FPGrowthExample.scala on branch-2.4?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

2018-11-17 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/23072#discussion_r234432181
  
--- Diff: R/pkg/R/mllib_clustering.R ---
@@ -610,3 +616,57 @@ setMethod("write.ml", signature(object = "LDAModel", 
path = "character"),
   function(object, path, overwrite = FALSE) {
 write_internal(object, path, overwrite)
   })
+
+#' PowerIterationClustering
+#'
+#' A scalable graph clustering algorithm. Users can call 
\code{spark.assignClusters} to
+#' return a cluster assignment for each input vertex.
+#'
+#  Run the PIC algorithm and returns a cluster assignment for each input 
vertex.
+#' @param data A SparkDataFrame.
+#' @param k The number of clusters to create.
+#' @param initMode Param for the initialization algorithm.
+#' @param maxIter Param for maximum number of iterations.
+#' @param srcCol Param for the name of the input column for source vertex 
IDs.
+#' @param dstCol Name of the input column for destination vertex IDs.
+#' @param weightCol Param for weight column name. If this is not set or 
\code{NULL},
+#'  we treat all instance weights as 1.0.
+#' @param ... additional argument(s) passed to the method.
+#' @return A dataset that contains columns of vertex id and the 
corresponding cluster for the id.
+#' The schema of it will be:
+#' \code{id: Long}
+#' \code{cluster: Int}
+#' @rdname spark.powerIterationClustering
+#' @aliases 
assignClusters,PowerIterationClustering-method,SparkDataFrame-method
+#' @examples
+#' \dontrun{
+#' df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0),
+#'   list(1L, 2L, 1.0), list(3L, 4L, 1.0),
+#'   list(4L, 0L, 0.1)), schema = c("src", "dst", 
"weight"))
+#' clusters <- spark.assignClusters(df, initMode="degree", 
weightCol="weight")
+#' showDF(clusters)
+#' }
+#' @note spark.assignClusters(SparkDataFrame) since 3.0.0
+setMethod("spark.assignClusters",
+  signature(data = "SparkDataFrame"),
+  function(data, k = 2L, initMode = "random", maxIter = 20L, 
srcCol = "src",
+dstCol = "dst", weightCol = NULL) {
--- End diff --

I  think we try to avoid srcCol dstCol in R (I think there are other R ml 
APIs like that)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

2018-11-17 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/23072#discussion_r234432019
  
--- Diff: R/pkg/R/mllib_clustering.R ---
@@ -610,3 +616,57 @@ setMethod("write.ml", signature(object = "LDAModel", 
path = "character"),
   function(object, path, overwrite = FALSE) {
 write_internal(object, path, overwrite)
   })
+
+#' PowerIterationClustering
+#'
+#' A scalable graph clustering algorithm. Users can call 
\code{spark.assignClusters} to
+#' return a cluster assignment for each input vertex.
+#'
+#  Run the PIC algorithm and returns a cluster assignment for each input 
vertex.
+#' @param data A SparkDataFrame.
+#' @param k The number of clusters to create.
+#' @param initMode Param for the initialization algorithm.
+#' @param maxIter Param for maximum number of iterations.
+#' @param srcCol Param for the name of the input column for source vertex 
IDs.
+#' @param dstCol Name of the input column for destination vertex IDs.
+#' @param weightCol Param for weight column name. If this is not set or 
\code{NULL},
+#'  we treat all instance weights as 1.0.
+#' @param ... additional argument(s) passed to the method.
+#' @return A dataset that contains columns of vertex id and the 
corresponding cluster for the id.
+#' The schema of it will be:
+#' \code{id: Long}
+#' \code{cluster: Int}
+#' @rdname spark.powerIterationClustering
+#' @aliases 
assignClusters,PowerIterationClustering-method,SparkDataFrame-method
+#' @examples
+#' \dontrun{
+#' df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0),
+#'   list(1L, 2L, 1.0), list(3L, 4L, 1.0),
+#'   list(4L, 0L, 0.1)), schema = c("src", "dst", 
"weight"))
+#' clusters <- spark.assignClusters(df, initMode="degree", 
weightCol="weight")
+#' showDF(clusters)
+#' }
+#' @note spark.assignClusters(SparkDataFrame) since 3.0.0
+setMethod("spark.assignClusters",
+  signature(data = "SparkDataFrame"),
+  function(data, k = 2L, initMode = "random", maxIter = 20L, 
srcCol = "src",
--- End diff --

set valid values for initMode and check for it - eg. 
https://github.com/apache/spark/pull/23072/files#diff-d9f92e07db6424e2527a7f9d7caa9013R355

and `match.arg(initMode)`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

2018-11-17 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/23072#discussion_r234432049
  
--- Diff: R/pkg/vignettes/sparkr-vignettes.Rmd ---
@@ -968,6 +970,17 @@ predicted <- predict(model, df)
 head(predicted)
 ```
 
+ Power Iteration Clustering
+
+Power Iteration Clustering (PIC) is a scalable graph clustering algorithm. 
`spark.assignClusters` method runs the PIC algorithm and returns a cluster 
assignment for each input vertex.
+
+```{r}
+df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0),
+  list(1L, 2L, 1.0), list(3L, 4L, 1.0),
+  list(4L, 0L, 0.1)), schema = c("src", "dst", 
"weight"))
+head(spark.assignClusters(df, initMode="degree", weightCol="weight"))
--- End diff --

spacing: `initMode = "degree", weightCol = "weight"`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

2018-11-17 Thread huaxingao
GitHub user huaxingao opened a pull request:

https://github.com/apache/spark/pull/23072

[SPARK-19827][R]spark.ml R API for PIC

## What changes were proposed in this pull request?

Add PowerIterationCluster (PIC) in R
## How was this patch tested?
Add test case


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/huaxingao/spark spark-19827

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23072.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23072


commit 9e2b0f9ffe0866fa328bc677500e4f3a49ff384b
Author: Huaxin Gao 
Date:   2018-11-17T21:25:46Z

[SPARK-19827][R]spark.ml R API for PIC




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org