from:"huaxingao"

[GitHub] spark issue #21465: [SPARK-24333][ML][PYTHON]Add fit with validation set to ...

2018-12-07 Thread huaxingao

Github user huaxingao commented on the issue:

https://github.com/apache/spark/pull/21465
  
@BryanCutler Thank you very much for your help!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23256: [SPARK-24207][R] follow-up PR for SPARK-24207 to ...

2018-12-07 Thread huaxingao

GitHub user huaxingao opened a pull request:

https://github.com/apache/spark/pull/23256

[SPARK-24207][R] follow-up PR for SPARK-24207 to fix code style problems

## What changes were proposed in this pull request?

follow-up PR for SPARK-24207 to fix code style problems

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/huaxingao/spark spark-24207-cnt

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23256.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23256


commit 04ec3e18e71b0f0e0d212a3532a6dc06518d61c0
Author: Huaxin Gao 
Date:   2018-12-07T19:55:50Z

fix code style problems




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21465: [SPARK-24333][ML][PYTHON]Add fit with validation ...

2018-12-07 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/21465#discussion_r239904265
  
--- Diff: python/pyspark/ml/param/shared.py ---
@@ -814,3 +814,25 @@ def getDistanceMeasure(self):
 """
 return self.getOrDefault(self.distanceMeasure)
 
+
+class HasValidationIndicatorCol(Params):
--- End diff --

You are right. DecisionTreeParams should be at the bottom. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

2018-12-06 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/23072#discussion_r239626824
  
--- Diff: R/pkg/R/mllib_clustering.R ---
@@ -610,3 +616,58 @@ setMethod("write.ml", signature(object = "LDAModel", 
path = "character"),
   function(object, path, overwrite = FALSE) {
 write_internal(object, path, overwrite)
   })
+
+#' PowerIterationClustering
+#'
+#' A scalable graph clustering algorithm. Users can call 
\code{spark.assignClusters} to
+#' return a cluster assignment for each input vertex.
+#'
+#  Run the PIC algorithm and returns a cluster assignment for each input 
vertex.
+#' @param data A SparkDataFrame.
+#' @param k The number of clusters to create.
+#' @param initMode Param for the initialization algorithm.
+#' @param maxIter Param for maximum number of iterations.
+#' @param sourceCol Param for the name of the input column for source 
vertex IDs.
+#' @param destinationCol Name of the input column for destination vertex 
IDs.
+#' @param weightCol Param for weight column name. If this is not set or 
\code{NULL},
+#'  we treat all instance weights as 1.0.
+#' @param ... additional argument(s) passed to the method.
+#' @return A dataset that contains columns of vertex id and the 
corresponding cluster for the id.
+#' The schema of it will be:
+#' \code{id: Long}
+#' \code{cluster: Int}
+#' @rdname spark.powerIterationClustering
+#' @aliases 
assignClusters,PowerIterationClustering-method,SparkDataFrame-method
+#' @examples
+#' \dontrun{
+#' df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0),
+#'   list(1L, 2L, 1.0), list(3L, 4L, 1.0),
+#'   list(4L, 0L, 0.1)), schema = c("src", "dst", 
"weight"))
+#' clusters <- spark.assignClusters(df, initMode="degree", 
weightCol="weight")
+#' showDF(clusters)
+#' }
+#' @note spark.assignClusters(SparkDataFrame) since 3.0.0
+setMethod("spark.assignClusters",
+  signature(data = "SparkDataFrame"),
+  function(data, k = 2L, initMode = c("random", "degree"), maxIter 
= 20L,
+sourceCol = "src", destinationCol = "dst", weightCol = NULL) {
+if (!is.numeric(k) || k < 1) {
+  stop("k should be a number with value >= 1.")
+}
+if (!is.integer(maxIter) || maxIter <= 0) {
+  stop("maxIter should be a number with value > 0.")
+}
--- End diff --

Seems to me that R is  a thin wrapper, we only need to create a PIC object 
and call the corresponding scala method. SparkDataFrame's column types are only 
checked in scala, not in R. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

2018-12-06 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/23072#discussion_r239626871
  
--- Diff: docs/ml-clustering.md ---
@@ -265,3 +265,44 @@ Refer to the [R API 
docs](api/R/spark.gaussianMixture.html) for more details.
 
 
 
+
+## Power Iteration Clustering (PIC)
+
+Power Iteration Clustering (PIC) is  a scalable graph clustering algorithm
+developed by http://www.cs.cmu.edu/~frank/papers/icml2010-pic-final.pdf>Lin and 
Cohen.
--- End diff --

Thanks. I will change the hyperlink. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21465: [SPARK-24333][ML][PYTHON]Add fit with validation set to ...

2018-12-05 Thread huaxingao

Github user huaxingao commented on the issue:

https://github.com/apache/spark/pull/21465
  
@BryanCutler Thank you very much for your review! I will submit changes 
soon. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

2018-12-05 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/23072#discussion_r239250376
  
--- Diff: docs/ml-clustering.md ---
@@ -265,3 +265,44 @@ Refer to the [R API 
docs](api/R/spark.gaussianMixture.html) for more details.
 
 
 
+
+## Power Iteration Clustering (PIC)
+
+Power Iteration Clustering (PIC) is  a scalable graph clustering algorithm
+developed by http://www.cs.cmu.edu/~frank/papers/icml2010-pic-final.pdf>Lin and 
Cohen.
--- End diff --

I normally check the md file on the github. The link works OK. Is there a 
better way to check? @dongjoon-hyun @felixcheung 

https://github.com/apache/spark/blob/9158da8cb76cc13f3011deaa7ac2c290eef62389/docs/ml-clustering.md
I guess I will still remove the ```a href=``` since no other places in the 
doc uses ``


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23072: [SPARK-19827][R]spark.ml R API for PIC

2018-12-05 Thread huaxingao

Github user huaxingao commented on the issue:

https://github.com/apache/spark/pull/23072
  
@dongjoon-hyun Thank you very much for your review. I will make the changes 
soon. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

2018-12-05 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/23072#discussion_r239250335
  
--- Diff: R/pkg/R/mllib_clustering.R ---
@@ -610,3 +616,58 @@ setMethod("write.ml", signature(object = "LDAModel", 
path = "character"),
   function(object, path, overwrite = FALSE) {
 write_internal(object, path, overwrite)
   })
+
+#' PowerIterationClustering
+#'
+#' A scalable graph clustering algorithm. Users can call 
\code{spark.assignClusters} to
+#' return a cluster assignment for each input vertex.
+#'
+#  Run the PIC algorithm and returns a cluster assignment for each input 
vertex.
+#' @param data A SparkDataFrame.
+#' @param k The number of clusters to create.
+#' @param initMode Param for the initialization algorithm.
+#' @param maxIter Param for maximum number of iterations.
+#' @param sourceCol Param for the name of the input column for source 
vertex IDs.
+#' @param destinationCol Name of the input column for destination vertex 
IDs.
+#' @param weightCol Param for weight column name. If this is not set or 
\code{NULL},
+#'  we treat all instance weights as 1.0.
+#' @param ... additional argument(s) passed to the method.
+#' @return A dataset that contains columns of vertex id and the 
corresponding cluster for the id.
+#' The schema of it will be:
+#' \code{id: Long}
+#' \code{cluster: Int}
+#' @rdname spark.powerIterationClustering
+#' @aliases 
assignClusters,PowerIterationClustering-method,SparkDataFrame-method
+#' @examples
+#' \dontrun{
+#' df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0),
+#'   list(1L, 2L, 1.0), list(3L, 4L, 1.0),
+#'   list(4L, 0L, 0.1)), schema = c("src", "dst", 
"weight"))
+#' clusters <- spark.assignClusters(df, initMode="degree", 
weightCol="weight")
+#' showDF(clusters)
+#' }
+#' @note spark.assignClusters(SparkDataFrame) since 3.0.0
+setMethod("spark.assignClusters",
+  signature(data = "SparkDataFrame"),
+  function(data, k = 2L, initMode = c("random", "degree"), maxIter 
= 20L,
+sourceCol = "src", destinationCol = "dst", weightCol = NULL) {
+if (!is.numeric(k) || k < 1) {
+  stop("k should be a number with value >= 1.")
+}
+if (!is.integer(maxIter) || maxIter <= 0) {
+  stop("maxIter should be a number with value > 0.")
+}
--- End diff --

@dongjoon-hyun ```src``` and ```dst``` are character columns. I have the 
check for character type. 
```
as.character(sourceCol),
as.character(destinationCol)
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

2018-12-05 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/23072#discussion_r239238873
  
--- Diff: docs/ml-clustering.md ---
@@ -265,3 +265,44 @@ Refer to the [R API 
docs](api/R/spark.gaussianMixture.html) for more details.
 
 
 
+
+## Power Iteration Clustering (PIC)
+
+Power Iteration Clustering (PIC) is  a scalable graph clustering algorithm
+developed by http://www.cs.cmu.edu/~frank/papers/icml2010-pic-final.pdf>Lin and 
Cohen.
+From the abstract: PIC finds a very low-dimensional embedding of a dataset
+using truncated power iteration on a normalized pair-wise similarity 
matrix of the data.
+
+`spark.ml`'s PowerIterationClustering implementation takes the following 
parameters:
+
+* `k`: the number of clusters to create
+* `initMode`: param for the initialization algorithm
+* `maxIter`: param for maximum number of iterations
+* `srcCol`: param for the name of the input column for source vertex IDs
+* `dstCol`: name of the input column for destination vertex IDs
+* `weightCol`: Param for weight column name
+
+**Examples**
+
+
+
+
+Refer to the [Scala API 
docs](api/scala/index.html#org.apache.spark.ml.clustering.PowerIterationClustering)
 for more details.
+
+{% include_example 
scala/org/apache/spark/examples/ml/PowerIterationClusteringExample.scala %}
+
+
+
+Refer to the [Java API 
docs](api/java/org/apache/spark/ml/clustering/PowerIterationClustering.html) 
for more details.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaPowerIterationClusteringExample.java %}
+
+
+
--- End diff --

@dongjoon-hyun 
https://github.com/apache/spark/pull/22996
I will add the python example in the doc once the above PR is merged in. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21465: [SPARK-24333][ML][PYTHON]Add fit with validation ...

2018-12-05 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/21465#discussion_r239173098
  
--- Diff: python/pyspark/ml/classification.py ---
@@ -1174,9 +1165,31 @@ def trees(self):
 return [DecisionTreeClassificationModel(m) for m in 
list(self._call_java("trees"))]
 
 
+class GBTClassifierParams(GBTParams, HasVarianceImpurity):
--- End diff --

@BryanCutler Thanks for your review. 
Seems recently https://github.com/apache/spark/pull/22986 added ```trait 
HasVarianceImpurity``` and made 
```private[ml] trait GBTClassifierParams extends GBTParams with 
HasVarianceImpurity```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

2018-11-30 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/23072#discussion_r237966508
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/ml/FPGrowthExample.scala ---
@@ -64,4 +64,3 @@ object FPGrowthExample {
 spark.stop()
   }
 }
-// scalastyle:on println
--- End diff --

@dongjoon-hyun sorry, I missed the ```// scalastyle:off println```
Is it OK with you if  I remove ```// scalastyle:off println``` too?  Since 
```println``` is not used in the example


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23168: [SPARK-26207][doc]add PowerIterationClustering (P...

2018-11-29 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/23168#discussion_r237661961
  
--- Diff: docs/ml-clustering.md ---
@@ -265,3 +265,38 @@ Refer to the [R API 
docs](api/R/spark.gaussianMixture.html) for more details.
 
 
 
+
+## Power Iteration Clustering (PIC)
+
+Power Iteration Clustering (PIC) is  a scalable graph clustering algorithm
+developed by http://www.icml2010.org/papers/387.pdf>Lin and 
Cohen.
--- End diff --

@shahidki31 Changed. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23161: [SPARK-26189][R]Fix unionAll doc in SparkR

2018-11-29 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/23161#discussion_r237628719
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -2732,13 +2732,24 @@ setMethod("union",
 dataFrame(unioned)
   })
 
-#' Return a new SparkDataFrame containing the union of rows
-#'
-#' This is an alias for `union`.
+#' Return a new SparkDataFrame containing the union of rows.
+#' This is an alias for \code{union}.
--- End diff --

You are right, it is the title and newline is required. I manually checked. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

2018-11-28 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/23072#discussion_r237332601
  
--- Diff: docs/ml-clustering.md ---
@@ -265,3 +265,44 @@ Refer to the [R API 
docs](api/R/spark.gaussianMixture.html) for more details.
 
 
 
+
+## Power Iteration Clustering (PIC)
+
+Power Iteration Clustering (PIC) is  a scalable graph clustering algorithm
--- End diff --

The doc change will be in both 2.4 and master, but the R related code will 
be in master only. I think that's why @felixcheung asked me to open a separate 
PR to merge in the doc change for 2.4.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23168: [SPARK-26207][doc]add PowerIterationClustering (PIC) doc...

2018-11-28 Thread huaxingao

Github user huaxingao commented on the issue:

https://github.com/apache/spark/pull/23168
  
@srowen It's not in master yet. The PR is here
https://github.com/apache/spark/pull/23072


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23168: [SPARK-26207][doc]add PowerIterationClustering (PIC) doc...

2018-11-28 Thread huaxingao

Github user huaxingao commented on the issue:

https://github.com/apache/spark/pull/23168
  
@felixcheung Could you please review?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23168: [SPARK-26207][doc]add PowerIterationClustering (P...

2018-11-28 Thread huaxingao

GitHub user huaxingao opened a pull request:

https://github.com/apache/spark/pull/23168

[SPARK-26207][doc]add PowerIterationClustering (PIC) doc in 2.4 branch

## What changes were proposed in this pull request?

Add PIC doc in 2.4

## How was this patch tested?

Manually tested 


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/huaxingao/spark spark-26207

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23168.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23168


commit 11363fbfd765d53f87c8908266540f8fec77fe7a
Author: Huaxin Gao 
Date:   2018-11-28T17:52:37Z

[SPARK-26207][doc]add PowerIterationClustering (PIC) doc in 2.4 branch




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23161: [SPARK-26189][R]Fix unionAll doc in SparkR

2018-11-27 Thread huaxingao

GitHub user huaxingao opened a pull request:

https://github.com/apache/spark/pull/23161

[SPARK-26189][R]Fix unionAll doc in SparkR

## What changes were proposed in this pull request?

Fix unionAll doc in SparkR

## How was this patch tested?

Manually ran test


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/huaxingao/spark spark-26189

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23161.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23161


commit 423711ec45883822942be309d8052cee976ef8c0
Author: Huaxin Gao 
Date:   2018-11-28T04:22:33Z

[SPARK-26189][R]Fix unionAll doc in SparkR




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

2018-11-27 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/23072#discussion_r236787704
  
--- Diff: docs/ml-clustering.md ---
@@ -265,3 +265,44 @@ Refer to the [R API 
docs](api/R/spark.gaussianMixture.html) for more details.
 
 
 
+
+## Power Iteration Clustering (PIC)
+
+Power Iteration Clustering (PIC) is  a scalable graph clustering algorithm
--- End diff --

sure


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23157: [SPARK-26185][PYTHON]add weightCol in python Mult...

2018-11-27 Thread huaxingao

GitHub user huaxingao opened a pull request:

https://github.com/apache/spark/pull/23157

[SPARK-26185][PYTHON]add weightCol in python 
MulticlassClassificationEvaluator



## What changes were proposed in this pull request?

add weightCol for python version of MulticlassClassificationEvaluator and 
MulticlassMetrics

## How was this patch tested?

add doc test


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/huaxingao/spark spark-26185

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23157.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23157


commit e7cf84521f331b75d645654573696be74d7b0b06
Author: Huaxin Gao 
Date:   2018-11-27T17:52:58Z

[SPARK-26185][PYTHON]add weightCol in python 
MulticlassClassificationEvaluator




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21465: [SPARK-24333][ML][PYTHON]Add fit with validation ...

2018-11-21 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/21465#discussion_r235488017
  
--- Diff: python/pyspark/ml/classification.py ---
@@ -1176,8 +1176,8 @@ def trees(self):
 
 @inherit_doc
 class GBTClassifier(JavaEstimator, HasFeaturesCol, HasLabelCol, 
HasPredictionCol, HasMaxIter,
-GBTParams, HasCheckpointInterval, HasStepSize, 
HasSeed, JavaMLWritable,
-JavaMLReadable):
+GBTParams, HasCheckpointInterval, HasStepSize, HasSeed,
+HasValidationIndicatorCol, JavaMLWritable, 
JavaMLReadable):
--- End diff --

@BryanCutler Thank you very much for reviewing my PR. I moved 
HasValidationIndicatorCol, HasMaxIter and HasStepSize to GBTParams. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23072: [SPARK-19827][R]spark.ml R API for PIC

2018-11-19 Thread huaxingao

Github user huaxingao commented on the issue:

https://github.com/apache/spark/pull/23072
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20442: [SPARK-23265][ML]Update multi-column error handling logi...

2018-11-19 Thread huaxingao

Github user huaxingao commented on the issue:

https://github.com/apache/spark/pull/20442
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC

2018-11-17 Thread huaxingao

GitHub user huaxingao opened a pull request:

https://github.com/apache/spark/pull/23072

[SPARK-19827][R]spark.ml R API for PIC

## What changes were proposed in this pull request?

Add PowerIterationCluster (PIC) in R
## How was this patch tested?
Add test case


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/huaxingao/spark spark-19827

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23072.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23072


commit 9e2b0f9ffe0866fa328bc677500e4f3a49ff384b
Author: Huaxin Gao 
Date:   2018-11-17T21:25:46Z

[SPARK-19827][R]spark.ml R API for PIC




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22996: [SPARK-25997][ML]add Python example code for Power Itera...

2018-11-09 Thread huaxingao

Github user huaxingao commented on the issue:

https://github.com/apache/spark/pull/22996
  
@holdenk 
Yes, it is. I will include the examples in ml-clustering.md. 



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22996: add Python example code for Power Iteration Clust...

2018-11-09 Thread huaxingao

GitHub user huaxingao opened a pull request:

https://github.com/apache/spark/pull/22996

add Python example code for Power Iteration Clustering in spark.ml

## What changes were proposed in this pull request?

Add python example for Power Iteration Clustering in spark.ml

## How was this patch tested?

Manually tested

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/huaxingao/spark spark-25997

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22996.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22996


commit 905b542a8618269bdc079f3c335a80c13d2214fa
Author: Huaxin Gao 
Date:   2018-11-09T22:32:17Z

add Python example code for Power Iteration Clustering in spark.ml




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22788: [SPARK-25769][SQL]make UnresolvedAttribute.sql escape ne...

2018-11-06 Thread huaxingao

Github user huaxingao commented on the issue:

https://github.com/apache/spark/pull/22788
  
@cloud-fan @dongjoon-hyun 
Because of the above test failures in ```ExpressionTypeCheckingSuite```, 
shall I revert to the previous change ?
```
  override def sql: String = nameParts.map { part =>
part match {
  case ParserUtils.escapedIdentifier(_) | 
ParserUtils.qualifiedEscapedIdentifier(_, _) =>
part
  case _ =>
quoteIdentifier(part)
}
  }.mkString(".")
```
Or change ```ExpressionTypeCheckingSuite```?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22788: [SPARK-25769][SQL]make UnresolvedAttribute.sql escape ne...

2018-11-02 Thread huaxingao

Github user huaxingao commented on the issue:

https://github.com/apache/spark/pull/22788
  
I have a question regarding the test failure in 
```ExpressionTypeCheckingSuite```. Most of the tests in this suite failed after 
I change ```UnresolvedAttribute.sql = UnresolvedAttribute.name```. For example, 
```
  test("check types for unary arithmetic") {
assertError(BitwiseNot('stringField), "requires integral type")
  }
```
The test failed with
```
"cannot resolve '~`stringField`' due to data type mismatch: argument 1 
requires integral
 type, however, '`stringField`' is of string type.;" did not contain 
"cannot resolve
 '~stringField' due to data type mismatch:"
```
It seems that the root cause of the failure is that 
```UnresolveAttribute.sql``` doesn't match with ```AttributeReference.sql``` 
any more after my fix. I am in doubt if it is  correct to make 
```UnresolvedAttribute.sql ``` the same as ``` UnresolvedAttribute.name```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22788: [SPARK-25769][SQL]escape nested columns by backti...

2018-10-31 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/22788#discussion_r229583440
  
--- Diff: 
sql/core/src/test/resources/sql-tests/results/columnresolution-negative.sql.out 
---
@@ -161,7 +161,7 @@ SELECT db1.t1.i1 FROM t1, mydb2.t1
 struct<>
 -- !query 18 output
 org.apache.spark.sql.AnalysisException
-cannot resolve '`db1.t1.i1`' given input columns: [mydb2.t1.i1, 
mydb2.t1.i1]; line 1 pos 7
+cannot resolve '`db1`.`t1`.`i1`' given input columns: [mydb2.t1.i1, 
mydb2.t1.i1]; line 1 pos 7
--- End diff --

@dilipbiswal 
Currently ```sql``` is not the same as ```name```




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22788: [SPARK-25769][SQL]escape nested columns by backti...

2018-10-30 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/22788#discussion_r229479893
  
--- Diff: 
sql/core/src/test/resources/sql-tests/results/columnresolution-negative.sql.out 
---
@@ -161,7 +161,7 @@ SELECT db1.t1.i1 FROM t1, mydb2.t1
 struct<>
 -- !query 18 output
 org.apache.spark.sql.AnalysisException
-cannot resolve '`db1.t1.i1`' given input columns: [mydb2.t1.i1, 
mydb2.t1.i1]; line 1 pos 7
+cannot resolve '`db1`.`t1`.`i1`' given input columns: [mydb2.t1.i1, 
mydb2.t1.i1]; line 1 pos 7
--- End diff --

Yes. I agree. For the four examples, we will have the following results:
```

$"a.b".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql
```
with my previous fix: 
```
`a`.`b` 
``` 
make sql same as name: 
```
a.b
```
```

$"`a.b`".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql
```
with my previous fix: 
```
`a`.`b`
```  
make sql same as name: 
```
`a.b`
```
```

$"`a`.b".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql
```
with my previous fix: 
```
`a`.`b` 
```
make sql same as name: 
```
a.b
```
```

$"`a.b`.c".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql
```
with my previous fix: 
```
`a.b`.`c`
``` 
make sql same as name: 
```
`a.b`.c 
```
Does this look good to everybody?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22788: [SPARK-25769][SQL]escape nested columns by backti...

2018-10-30 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/22788#discussion_r229354247
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala 
---
@@ -2856,6 +2856,21 @@ class SQLQuerySuite extends QueryTest with 
SharedSQLContext {
   checkAnswer(sql("select 26393499451 / (1e6 * 1000)"), 
Row(BigDecimal("26.393499451")))
 }
   }
+
+  test("SPARK-25769 escape nested columns") {
+import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute
+val sql1 = $"a.b".expr.asInstanceOf[UnresolvedAttribute].sql
+assert(sql1 === "`a`.`b`")
+
+val sql2 = $"`a.b`".expr.asInstanceOf[UnresolvedAttribute].sql
+assert(sql2 === "`a.b`")
+
+val sql3 = $"`a`.b".expr.asInstanceOf[UnresolvedAttribute].sql
+assert(sql3 === "`a`.`b`")
+
+val sql4 = $"`a.b`.c".expr.asInstanceOf[UnresolvedAttribute].sql
+assert(sql4 === "`a.b`.`c`")
+  }
--- End diff --

Merged. Thank you very much! @dongjoon-hyun 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22863: [SPARK-25859][ML]add scala/java/python example and doc f...

2018-10-27 Thread huaxingao

Github user huaxingao commented on the issue:

https://github.com/apache/spark/pull/22863
  
Thanks @felixcheung 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22863: [SPARK-25859][ML]add scala/java/python example an...

2018-10-27 Thread huaxingao

Github user huaxingao closed the pull request at:

https://github.com/apache/spark/pull/22863


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22863: [SPARK-25859][ML]add scala/java/python example and doc f...

2018-10-27 Thread huaxingao

Github user huaxingao commented on the issue:

https://github.com/apache/spark/pull/22863
  
@felixcheung 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22863: [SPARK-25859][ML]add scala/java/python example an...

2018-10-27 Thread huaxingao

GitHub user huaxingao opened a pull request:

https://github.com/apache/spark/pull/22863

[SPARK-25859][ML]add scala/java/python example and doc for PrefixSpan

## What changes were proposed in this pull request?

add scala/java/python example and doc for PrefixSpan in branch 2.4

## How was this patch tested?

Manually tested


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/huaxingao/spark mydocbranch

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22863.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22863


commit 3109c213c2f875ea7099929621a3be18b5f02862
Author: Huaxin Gao 
Date:   2018-10-27T18:14:36Z

[SPARK-25859][ML]add scala/java/python example and doc for PrefixSpan




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21710: [SPARK-24207][R]add R API for PrefixSpan

2018-10-26 Thread huaxingao

Github user huaxingao commented on the issue:

https://github.com/apache/spark/pull/21710
  
@felixcheung I am terribly sorry that I missed your comment for the ml doc 
and example for 2.4. Is it still time to merge in 2.4? I saw one of my PR got 
merged in 2.4 last night. I can submit a PR for 2.4 doc and example now. Please 
let me know. Thanks a lot!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22295: [SPARK-25255][PYTHON]Add getActiveSession to SparkSessio...

2018-10-26 Thread huaxingao

Github user huaxingao commented on the issue:

https://github.com/apache/spark/pull/22295
  
Thank you very much for your help! ! @holdenk @HyukjinKwon 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22790: [SPARK-25793][ML]call SaveLoadV2_0.load for class...

2018-10-25 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/22790#discussion_r228274470
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala
 ---
@@ -109,7 +109,7 @@ class BisectingKMeansModel private[clustering] (
 
   @Since("2.0.0")
   override def save(sc: SparkContext, path: String): Unit = {
-BisectingKMeansModel.SaveLoadV1_0.save(sc, this, path)
+BisectingKMeansModel.SaveLoadV2_0.save(sc, this, path)
   }
 
   override protected def formatVersion: String = "1.0"
--- End diff --

I changed the ```formatVersion``` to 2.0. There are quite a few files that 
implement trait ```Saveable``` and have ```formatVersion```.  I don't feel 
comfortable to change other files for this PR. Maybe I will open a separate 
jira to remove ```formatVersion``` from trait ```Saveable```? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22790: [SPARK-25793][ML]call SaveLoadV2_0.load for classNameV2_...

2018-10-23 Thread huaxingao

Github user huaxingao commented on the issue:

https://github.com/apache/spark/pull/22790
  
I added a regression test in 
```org.apache.spark.mllib.clustering.BisectingKMeansSuite``` 
I could  add the following test in ml package. 
```
  test("SPARK-25793") {
val bisectingKMeans = new BisectingKMeans()
bisectingKMeans.setDistanceMeasure(DistanceMeasure.COSINE)
val readBisectingKMeans = testDefaultReadWrite(bisectingKMeans)
assert(bisectingKMeans.distanceMeasure === 
readBisectingKMeans.distanceMeasure)
  }
```
But the bug doesn't really affect the above test. With the bug, even though 
mllib ```BisectingKMeansModel.load``` will call V1_0 load and gives a model 
with default value of distanceMeasure, in ml package, 
```BisectingKMeansModelReader.load``` will call 
```metadata.getAndSetParams(model)``` which will set the distanceMeasure to the 
correct value (DistanceMeasure.COSINE).
```
override def load(path: String): BisectingKMeansModel = {
  val metadata = DefaultParamsReader.loadMetadata(path, sc, className)
  val dataPath = new Path(path, "data").toString
  val mllibModel = MLlibBisectingKMeansModel.load(sc, dataPath)
  val model = new BisectingKMeansModel(metadata.uid, mllibModel)
  metadata.getAndSetParams(model)
  model
}
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22790: [SPARK-25793][ML]call SaveLoadV2_0.load for classNameV2_...

2018-10-23 Thread huaxingao

Github user huaxingao commented on the issue:

https://github.com/apache/spark/pull/22790
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22790: [SPARK-25793][ML]call SaveLoadV2_0.load for class...

2018-10-22 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/22790#discussion_r227229331
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala
 ---
@@ -126,7 +126,7 @@ object BisectingKMeansModel extends 
Loader[BisectingKMeansModel] {
 val model = SaveLoadV1_0.load(sc, path)
 model
   case (SaveLoadV2_0.thisClassName, SaveLoadV2_0.thisFormatVersion) =>
-val model = SaveLoadV1_0.load(sc, path)
+val model = SaveLoadV2_0.load(sc, path)
--- End diff --

@viirya @mgaido91 
I will change the ```BisectingKMeansModel.save``` to
```
BisectingKMeansModel.SaveLoadV2_0.save(sc, this, path)
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22788: [SPARK-25769][SQL]escape nested columns by backti...

2018-10-22 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/22788#discussion_r227152273
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala 
---
@@ -2702,7 +2702,7 @@ class SQLQuerySuite extends QueryTest with 
SharedSQLContext {
 
   val e = intercept[AnalysisException](sql("SELECT v.i from (SELECT i 
FROM v)"))
   assert(e.message ==
-"cannot resolve '`v.i`' given input columns: 
[__auto_generated_subquery_name.i]")
--- End diff --

The out-most backticks is added in 
```case _```, in ```case class UnresolvedAttribute(nameParts: 
Seq[String])```
```
  override def sql: String = name match {
case ParserUtils.escapedIdentifier(_) | 
ParserUtils.qualifiedEscapedIdentifier(_, _) => name
case _ => quoteIdentifier(name)
  }
```
the ```nameParts``` is a Seq of ```"v" "i"```, name is ```v.i``` 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22793: [SPARK-25793][ML]Call SaveLoadV2_0.load for class...

2018-10-22 Thread huaxingao

Github user huaxingao closed the pull request at:

https://github.com/apache/spark/pull/22793


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22793: [SPARK-25793][ML]Call SaveLoadV2_0.load for classNameV2_...

2018-10-22 Thread huaxingao

Github user huaxingao commented on the issue:

https://github.com/apache/spark/pull/22793
  
@WeichenXu123 
I created two PRs for this jira. I had trouble to create the first one so I 
created another one. I will close this PR. Please use the other one. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22793: [SPARK-25793][ML]Call SaveLoadV2_0.load for class...

2018-10-21 Thread huaxingao

GitHub user huaxingao opened a pull request:

https://github.com/apache/spark/pull/22793

[SPARK-25793][ML]Call SaveLoadV2_0.load for classNameV2_0

## What changes were proposed in this pull request?
The wrong version of load is called in BisectingKMeansModel.load.
```
  case (SaveLoadV2_0.thisClassName, SaveLoadV2_0.thisFormatVersion) =>
val model = SaveLoadV1_0.load(sc, path)
```


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/huaxingao/spark spark25793

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22793.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22793






---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22790: [SPARK-25793][ML]call SaveLoadV2_0.load for class...

2018-10-21 Thread huaxingao

GitHub user huaxingao opened a pull request:

https://github.com/apache/spark/pull/22790

[SPARK-25793][ML]call SaveLoadV2_0.load for classNameV2_0

## What changes were proposed in this pull request?
The following code in BisectingKMeansModel.load calls the wrong version of 
load.
```
  case (SaveLoadV2_0.thisClassName, SaveLoadV2_0.thisFormatVersion) =>
val model = SaveLoadV1_0.load(sc, path)
```




You can merge this pull request into a Git repository by running:

$ git pull https://github.com/huaxingao/spark spark-25793

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22790.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22790






---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22788: [SPARK-25769][SQL]change nested columns from `a.b...

2018-10-21 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/22788#discussion_r226872842
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala
 ---
@@ -98,8 +98,18 @@ case class UnresolvedTableValuedFunction(
  */
 case class UnresolvedAttribute(nameParts: Seq[String]) extends Attribute 
with Unevaluable {
 
-  def name: String =
-nameParts.map(n => if (n.contains(".")) s"`$n`" else n).mkString(".")
+  def name: String = {
+nameParts.map(n =>
+  if (n.contains(".")) s"`$n`"
+  else {
+if (nameParts.length > 1) {
+  s"`$n`"
+} else {
+  n
+}
+  }
+).mkString(".")
+  }
--- End diff --

@dongjoon-hyun Thanks! Fixed. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22788: [SPARK-25769][SQL]change nested columns from `a.b...

2018-10-21 Thread huaxingao

GitHub user huaxingao opened a pull request:

https://github.com/apache/spark/pull/22788

[SPARK-25769][SQL]change nested columns from `a.b` to `a`.`b`

## What changes were proposed in this pull request?
Currently, ```$"a.b".expr.asInstanceOf[UnresolvedAttribute].sql```
returns 
```
res1: String = `a.b`
```
This PR will change the nested column to  
```
`a`.`b`
```



## How was this patch tested?
I didn't add any new test. However, I changed several of the existing tests 
to make them to expect the nested  columns to have the format of
```
`a`.`b`
```




You can merge this pull request into a Git repository by running:

$ git pull https://github.com/huaxingao/spark spark-25769

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22788.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22788


commit 2aa4ad97c8b9df5fdd022653549a0252dbb819b6
Author: Huaxin Gao 
Date:   2018-10-21T16:13:49Z

[SPARK-25769][SQL]change nested columns from  to .




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22295: [SPARK-25255][PYTHON]Add getActiveSession to Spar...

2018-10-18 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/22295#discussion_r226178191
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -3863,6 +3863,145 @@ def test_jvm_default_session_already_set(self):
 spark.stop()
 
 
+class SparkSessionTests2(unittest.TestCase):
+
+def test_active_session(self):
+spark = SparkSession.builder \
+.master("local") \
+.getOrCreate()
+try:
+activeSession = SparkSession.getActiveSession()
+df = activeSession.createDataFrame([(1, 'Alice')], ['age', 
'name'])
+self.assertEqual(df.collect(), [Row(age=1, name=u'Alice')])
+finally:
+spark.stop()
+
+def test_get_active_session_when_no_active_session(self):
+active = SparkSession.getActiveSession()
+self.assertEqual(active, None)
+spark = SparkSession.builder \
+.master("local") \
+.getOrCreate()
+active = SparkSession.getActiveSession()
+self.assertEqual(active, spark)
+spark.stop()
+active = SparkSession.getActiveSession()
+self.assertEqual(active, None)
+
+def test_SparkSession(self):
+spark = SparkSession.builder \
+.master("local") \
+.config("some-config", "v2") \
+.getOrCreate()
+try:
+self.assertEqual(spark.conf.get("some-config"), "v2")
+self.assertEqual(spark.sparkContext._conf.get("some-config"), 
"v2")
+self.assertEqual(spark.version, spark.sparkContext.version)
+spark.sql("CREATE DATABASE test_db")
+spark.catalog.setCurrentDatabase("test_db")
+self.assertEqual(spark.catalog.currentDatabase(), "test_db")
+spark.sql("CREATE TABLE table1 (name STRING, age INT) USING 
parquet")
+self.assertEqual(spark.table("table1").columns, ['name', 
'age'])
+self.assertEqual(spark.range(3).count(), 3)
+finally:
+spark.stop()
+
+def test_global_default_session(self):
+spark = SparkSession.builder \
+.master("local") \
+.getOrCreate()
+try:
+self.assertEqual(SparkSession.builder.getOrCreate(), spark)
+finally:
+spark.stop()
+
+def test_default_and_active_session(self):
+spark = SparkSession.builder \
+.master("local") \
+.getOrCreate()
+activeSession = spark._jvm.SparkSession.getActiveSession()
+defaultSession = spark._jvm.SparkSession.getDefaultSession()
+try:
+self.assertEqual(activeSession, defaultSession)
+finally:
+spark.stop()
+
+def test_config_option_propagated_to_existing_SparkSession(self):
--- End diff --

Will change. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22295: [SPARK-25255][PYTHON]Add getActiveSession to Spar...

2018-10-18 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/22295#discussion_r226178127
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -3863,6 +3863,145 @@ def test_jvm_default_session_already_set(self):
 spark.stop()
 
 
+class SparkSessionTests2(unittest.TestCase):
+
+def test_active_session(self):
+spark = SparkSession.builder \
+.master("local") \
+.getOrCreate()
+try:
+activeSession = SparkSession.getActiveSession()
+df = activeSession.createDataFrame([(1, 'Alice')], ['age', 
'name'])
+self.assertEqual(df.collect(), [Row(age=1, name=u'Alice')])
+finally:
+spark.stop()
+
+def test_get_active_session_when_no_active_session(self):
+active = SparkSession.getActiveSession()
+self.assertEqual(active, None)
+spark = SparkSession.builder \
+.master("local") \
+.getOrCreate()
+active = SparkSession.getActiveSession()
+self.assertEqual(active, spark)
+spark.stop()
+active = SparkSession.getActiveSession()
+self.assertEqual(active, None)
+
+def test_SparkSession(self):
+spark = SparkSession.builder \
+.master("local") \
+.config("some-config", "v2") \
+.getOrCreate()
+try:
+self.assertEqual(spark.conf.get("some-config"), "v2")
+self.assertEqual(spark.sparkContext._conf.get("some-config"), 
"v2")
+self.assertEqual(spark.version, spark.sparkContext.version)
+spark.sql("CREATE DATABASE test_db")
+spark.catalog.setCurrentDatabase("test_db")
+self.assertEqual(spark.catalog.currentDatabase(), "test_db")
+spark.sql("CREATE TABLE table1 (name STRING, age INT) USING 
parquet")
+self.assertEqual(spark.table("table1").columns, ['name', 
'age'])
+self.assertEqual(spark.range(3).count(), 3)
+finally:
+spark.stop()
+
+def test_global_default_session(self):
+spark = SparkSession.builder \
+.master("local") \
+.getOrCreate()
+try:
+self.assertEqual(SparkSession.builder.getOrCreate(), spark)
+finally:
+spark.stop()
+
+def test_default_and_active_session(self):
+spark = SparkSession.builder \
+.master("local") \
+.getOrCreate()
+activeSession = spark._jvm.SparkSession.getActiveSession()
+defaultSession = spark._jvm.SparkSession.getDefaultSession()
+try:
+self.assertEqual(activeSession, defaultSession)
+finally:
+spark.stop()
+
+def test_config_option_propagated_to_existing_SparkSession(self):
+session1 = SparkSession.builder \
+.master("local") \
+.config("spark-config1", "a") \
+.getOrCreate()
+self.assertEqual(session1.conf.get("spark-config1"), "a")
+session2 = SparkSession.builder \
+.config("spark-config1", "b") \
+.getOrCreate()
+try:
+self.assertEqual(session1, session2)
+self.assertEqual(session1.conf.get("spark-config1"), "b")
+finally:
+session1.stop()
+
+def test_new_session(self):
+session = SparkSession.builder \
+.master("local") \
+.getOrCreate()
+newSession = session.newSession()
+try:
+self.assertNotEqual(session, newSession)
+finally:
+session.stop()
+newSession.stop()
+
+def test_create_new_session_if_old_session_stopped(self):
+session = SparkSession.builder \
+.master("local") \
+.getOrCreate()
+session.stop()
+newSession = SparkSession.builder \
+.master("local") \
+.getOrCreate()
+try:
+self.assertNotEqual(session, newSession)
+finally:
+newSession.stop()
+
+def test_active_session_with_None_and_not_None_context(self):
+from pyspark.context import SparkContext
+from pyspark.conf import SparkConf
+sc = SparkContext._active_spark_co

[GitHub] spark pull request #22295: [SPARK-25255][PYTHON]Add getActiveSession to Spar...

2018-10-18 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/22295#discussion_r226178054
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -2713,6 +2713,25 @@ def from_csv(col, schema, options={}):
 return Column(jc)
 
 
+@since(3.0)
+def _getActiveSession():
--- End diff --

Do you mean the _ prefix or the function itself?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22295: [SPARK-25255][PYTHON]Add getActiveSession to Spar...

2018-10-16 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/22295#discussion_r225667299
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -3654,6 +3654,109 @@ def test_jvm_default_session_already_set(self):
 spark.stop()
 
 
+class SparkSessionTests2(unittest.TestCase):
+
+def test_active_session(self):
+spark = SparkSession.builder \
+.master("local") \
+.getOrCreate()
+try:
+activeSession = SparkSession.getActiveSession()
+df = activeSession.createDataFrame([(1, 'Alice')], ['age', 
'name'])
+self.assertEqual(df.collect(), [Row(age=1, name=u'Alice')])
+finally:
+spark.stop()
+
+def test_get_active_session_when_no_active_session(self):
+active = SparkSession.getActiveSession()
+self.assertEqual(active, None)
+spark = SparkSession.builder \
+.master("local") \
+.getOrCreate()
+active = SparkSession.getActiveSession()
+self.assertEqual(active, spark)
+spark.stop()
+active = SparkSession.getActiveSession()
+self.assertEqual(active, None)
--- End diff --

Thanks @holdenk 
I will add a test for the above comment and also add a test for your 
comment regarding
```
self._jvm.SparkSession.setActiveSession(self._jsparkSession)
```



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22295: [SPARK-25255][PYTHON]Add getActiveSession to Spar...

2018-10-16 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/22295#discussion_r225667174
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -2633,6 +2633,23 @@ def sequence(start, stop, step=None):
 _to_java_column(start), _to_java_column(stop), 
_to_java_column(step)))
 
 
+@since(3.0)
+def getActiveSession():
+"""
+Returns the active SparkSession for the current thread
+"""
+from pyspark.sql import SparkSession
+sc = SparkContext._active_spark_context
--- End diff --

@holdenk @HyukjinKwon 
Thanks for the comments.
I checked Scala's behavior:
```
  test("my test") {
val cx = SparkContext.getActive
val session = SparkSession.getActiveSession
println(cx)
println(session)
  }
```
The result is 
```
None
None
```
So it returns None if sc isNone. Actually my current code returns None if 
sc isNone, but I will change the code a bit to make it more obvious. I will 
also add _ prefix in the function name and mention in the docstring that this 
function is not supposed to be called directly.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22295: [SPARK-25255][PYTHON]Add getActiveSession to Spar...

2018-10-16 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/22295#discussion_r225666954
  
--- Diff: python/pyspark/sql/session.py ---
@@ -231,6 +231,7 @@ def __init__(self, sparkContext, jsparkSession=None):
 or SparkSession._instantiatedSession._sc._jsc is None:
 SparkSession._instantiatedSession = self
 self._jvm.SparkSession.setDefaultSession(self._jsparkSession)
+self._jvm.SparkSession.setActiveSession(self._jsparkSession)
--- End diff --

@holdenk @HyukjinKwon 
Thanks for the comments. I looked the scala code, it ```setActiveSession``` 
in ```createDataFrame```. 
```
  def createDataFrame[A <: Product : TypeTag](rdd: RDD[A]): DataFrame = {
SparkSession.setActiveSession(this)
...
  }
```
I will do the same for python.
```
def createDataFrame(self, data, schema=None, samplingRatio=None, 
verifySchema=True):
SparkSession._activeSession = self
self._jvm.SparkSession.setActiveSession(self._jsparkSession)
```
Will also add a test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21710: [SPARK-24207][R]add R API for PrefixSpan

2018-10-09 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/21710#discussion_r223760003
  
--- Diff: examples/src/main/python/ml/prefixspan_example.py ---
@@ -0,0 +1,48 @@
+#
--- End diff --

@felixcheung I don't think the doc, java/scala/python example are in 2.4. 
Shall I open a separate jira to merge in the fix?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22295: [SPARK-25255][PYTHON]Add getActiveSession to Spar...

2018-10-05 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/22295#discussion_r223165392
  
--- Diff: python/pyspark/sql/session.py ---
@@ -252,6 +255,20 @@ def newSession(self):
 """
 return self.__class__(self._sc, self._jsparkSession.newSession())
 
+@classmethod
+@since(2.5)
+def getActiveSession(cls):
+"""
+Returns the active SparkSession for the current thread, returned 
by the builder.
+>>> s = SparkSession.getActiveSession()
+>>> l = [('Alice', 1)]
+>>> rdd = s.sparkContext.parallelize(l)
+>>> df = s.createDataFrame(rdd, ['name', 'age'])
+>>> df.select("age").collect()
+[Row(age=1)]
+"""
+return cls._activeSession
--- End diff --

@HyukjinKwon I am not sure if I follow your suggestion correctly. Does the 
following look right to you?
session.py
```
@classmethod
@since(3.0)
def getActiveSession(cls):
from pyspark.sql import functions
return functions.getActiveSession()
```
functions.py
```
@since(3.0)
def getActiveSession():
from pyspark.sql import SparkSession
sc = SparkContext._active_spark_context
if sc is None:
  sc = SparkContext()

if sc._jvm.SparkSession.getActiveSession().isDefined():
SparkSession(sc, sc._jvm.SparkSession.getActiveSession().get())
return SparkSession._activeSession
else:
return None
```



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22295: [SPARK-25255][PYTHON]Add getActiveSession to Spar...

2018-09-28 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/22295#discussion_r221394694
  
--- Diff: python/pyspark/sql/session.py ---
@@ -231,6 +231,7 @@ def __init__(self, sparkContext, jsparkSession=None):
 or SparkSession._instantiatedSession._sc._jsc is None:
 SparkSession._instantiatedSession = self
 self._jvm.SparkSession.setDefaultSession(self._jsparkSession)
+self._jvm.SparkSession.setActiveSession(self._jsparkSession)
--- End diff --

@HyukjinKwon Do you mean something like this:
```
def test_two_spark_session(self):
session1 = None
session2 = None
try:
session1 = SparkSession.builder.config("key1", 
"value1").getOrCreate()
session2 = SparkSession.builder.config("key2", 
"value2").getOrCreate()
self.assertEqual(session1, session2)

df = session1.createDataFrame([(1, 'Alice')], ['age', 'name'])
self.assertEqual(df.collect(), [Row(age=1, name=u'Alice')])
activeSession1 = session1.getActiveSession()
activeSession2 = session2.getActiveSession()
self.assertEqual(activeSession1, activeSession1)

finally:
if session1 is not None:
session1.stop()
if session2 is not None:
session2.stop()
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22295: [SPARK-25255][PYTHON]Add getActiveSession to Spar...

2018-09-27 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/22295#discussion_r221089916
  
--- Diff: python/pyspark/sql/session.py ---
@@ -231,6 +231,7 @@ def __init__(self, sparkContext, jsparkSession=None):
 or SparkSession._instantiatedSession._sc._jsc is None:
 SparkSession._instantiatedSession = self
 self._jvm.SparkSession.setDefaultSession(self._jsparkSession)
+self._jvm.SparkSession.setActiveSession(self._jsparkSession)
--- End diff --

@HyukjinKwon Seems to me that active session is set OK in the 
```__init__```. When createDataFrame, we already have a session, and the active 
session is already set in the ```__init__```. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22295: [SPARK-25255][PYTHON]Add getActiveSession to SparkSessio...

2018-09-27 Thread huaxingao

Github user huaxingao commented on the issue:

https://github.com/apache/spark/pull/22295
  
I just saw this fix [SPARK-25525][SQL][PYSPARK] Do not update conf for 
existing SparkContext in SparkSession.getOrCreate. #22545
I will remove ```test_create_SparkContext_then_SparkSession``` 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22537: [SPARK-21291][R] add R partitionBy API in DataFrame

2018-09-25 Thread huaxingao

Github user huaxingao commented on the issue:

https://github.com/apache/spark/pull/22537
  
Thanks! @HyukjinKwon @felixcheung 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22537: [SPARK-21291][R] add R partitionBy API in DataFra...

2018-09-24 Thread huaxingao

GitHub user huaxingao opened a pull request:

https://github.com/apache/spark/pull/22537

[SPARK-21291][R] add R partitionBy API in DataFrame

## What changes were proposed in this pull request?

add R partitionBy API in write.df
I didn't add bucketBy in write.df. The last line of write.df is
```
write <- handledCallJMethod(write, "save")
```
save doesn't support bucketBy right now. 
```
 assertNotBucketed("save")
```

## How was this patch tested?

Add unit test in test_sparkSQL.R

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/huaxingao/spark spark-21291

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22537.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22537


commit 7544ee503f45bb2890d12263aa5a27b19797123d
Author: Huaxin Gao 
Date:   2018-09-24T17:42:16Z

[SPARK-21291] add R partitionBy API In DataFrame




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22295: [SPARK-25255][PYTHON]Add getActiveSession to Spar...

2018-09-17 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/22295#discussion_r218237306
  
--- Diff: python/pyspark/sql/session.py ---
@@ -231,6 +231,7 @@ def __init__(self, sparkContext, jsparkSession=None):
 or SparkSession._instantiatedSession._sc._jsc is None:
 SparkSession._instantiatedSession = self
 self._jvm.SparkSession.setDefaultSession(self._jsparkSession)
+self._jvm.SparkSession.setActiveSession(self._jsparkSession)
--- End diff --

Thanks you very much for your comments. 
I have a question here. In stop() method, shall we clear the activeSession 
too? Currently, it has
```
def stop(self):
"""Stop the underlying :class:`SparkContext`.
"""
self._jvm.SparkSession.clearDefaultSession()
SparkSession._instantiatedSession = None
```
Do I need to add the following?
```
  self._jvm.SparkSession.clearActiveSession()
```
To test for getActiveSession when there is no active session, I am thinking 
of adding
```
def test_get_active_session_when_no_active_session(self):
spark = SparkSession.builder \
.master("local") \
.getOrCreate()
spark.stop()
active = spark.getActiveSession()
self.assertEqual(active, None)
```
The test didn't pass because in stop(), the active session is not cleared. 



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21649: [SPARK-23648][R][SQL]Adds more types for hint in SparkR

2018-09-17 Thread huaxingao

Github user huaxingao commented on the issue:

https://github.com/apache/spark/pull/21649
  
@felixcheung Thanks for your comments. I changed ```stopifnot```. At L3925 
I could add 
```
hintList <- list("hint2", "hint3", "hint4")
hint(df, "hint1", 1.2345, hintList)
```
but I am a little hesitated to give an example. Even though we relax the 
check for hint, I am not sure how the numeric hint or list hint is used in real 
sql. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21649: [SPARK-23648][R][SQL]Adds more types for hint in ...

2018-09-11 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/21649#discussion_r216727458
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -3939,7 +3929,15 @@ setMethod("hint",
   signature(x = "SparkDataFrame", name = "character"),
   function(x, name, ...) {
 parameters <- list(...)
-stopifnot(all(sapply(parameters, isTypeAllowedForSqlHint)))
+stopifnot(all(sapply(parameters, function(x) {
--- End diff --

My bad. Will change. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21649: [SPARK-23648][R][SQL]Adds more types for hint in ...

2018-09-10 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/21649#discussion_r216413819
  
--- Diff: R/pkg/R/DataFrame.R ---
@@ -3905,6 +3905,16 @@ setMethod("rollup",
 groupedData(sgd)
   })
 
+isTypeAllowedForSqlHint <- function(x) {
+  if (is.character(x) || is.numeric(x)) {
+TRUE
+  } else if (is.list(x)) {
+all(sapply(x, (function(y) is.character(y) || is.numeric(y
--- End diff --

I think multiple hint in different types in a single list works OK. I 
tested 
```df.hint("hint1", Seq(1, 2, "a"))``` in scala, it works OK. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22295: [SPARK-25255][PYTHON]Add getActiveSession to Spar...

2018-09-07 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/22295#discussion_r216115581
  
--- Diff: python/pyspark/sql/session.py ---
@@ -252,6 +252,16 @@ def newSession(self):
 """
 return self.__class__(self._sc, self._jsparkSession.newSession())
 
+@since(2.4)
+def getActiveSession(self):
+"""
+Returns the active SparkSession for the current thread, returned 
by the builder.
+>>> s = spark.getActiveSession()
+>>> spark._jsparkSession.getDefaultSession().get().equals(s.get())
+True
+"""
+return self._jsparkSession.getActiveSession()
--- End diff --

@HyukjinKwon I add a set of tests. Some of them are borrowed from 
```SparkSessionBuilderSuite.scala```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21710: [SPARK-24207][R]add R API for PrefixSpan

2018-09-05 Thread huaxingao

Github user huaxingao commented on the issue:

https://github.com/apache/spark/pull/21710
  
@felixcheung 
Are there any other things I need to change? If not, could this PR be 
merged in 2.4? Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21649: [SPARK-23648][R][SQL]Adds more types for hint in SparkR

2018-09-05 Thread huaxingao

Github user huaxingao commented on the issue:

https://github.com/apache/spark/pull/21649
  
@felixcheung 
Are there any other things I need to change? If not, could this PR be 
merged in 2.4? Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20442: [SPARK-23265][ML]Update multi-column error handling logi...

2018-09-04 Thread huaxingao

Github user huaxingao commented on the issue:

https://github.com/apache/spark/pull/20442
  
Any more comments? @MLnick @jkbradley 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22295: [SPARK-25255][PYTHON]Add getActiveSession to Spar...

2018-09-04 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/22295#discussion_r215022091
  
--- Diff: python/pyspark/sql/session.py ---
@@ -252,6 +252,16 @@ def newSession(self):
 """
 return self.__class__(self._sc, self._jsparkSession.newSession())
 
+@since(2.4)
+def getActiveSession(self):
+"""
+Returns the active SparkSession for the current thread, returned 
by the builder.
+>>> s = spark.getActiveSession()
+>>> spark._jsparkSession.getDefaultSession().get().equals(s.get())
--- End diff --

@holdenk @felixcheung Thanks for the review. I will change this. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22295: [SPARK-25255][PYTHON]Add getActiveSession to Spar...

2018-09-04 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/22295#discussion_r215022059
  
--- Diff: python/pyspark/sql/session.py ---
@@ -252,6 +252,16 @@ def newSession(self):
 """
 return self.__class__(self._sc, self._jsparkSession.newSession())
 
+@since(2.4)
+def getActiveSession(self):
+"""
+Returns the active SparkSession for the current thread, returned 
by the builder.
+>>> s = spark.getActiveSession()
+>>> spark._jsparkSession.getDefaultSession().get().equals(s.get())
+True
+"""
+return self._jsparkSession.getActiveSession()
--- End diff --

@HyukjinKwon Sorry for the late reply. Yes, this returns a JVM instance. 
In the scala code, ```SparkSession.getActiveSession``` returns an 
```Option[SparkSession]```
I am not sure how to do a python equivalent of Scala's ```Option```. In the 
following code, is there a way to wrap the python ```session``` in else path to 
something equivalent of Scala's ```Option```?  If not, can I just return the 
python ```session```?
```
if self._jsparkSession.getActiveSession() is None:
return None
else:
return self.__class__(self._sc, 
self._jsparkSession.getActiveSession().get())
```



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22228: [SPARK-25124][ML]VectorSizeHint setSize and getSi...

2018-09-04 Thread huaxingao

Github user huaxingao closed the pull request at:

https://github.com/apache/spark/pull/8


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22291: [SPARK-25007][R]Add array_intersect/array_except/...

2018-08-31 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/22291#discussion_r214472480
  
--- Diff: R/pkg/R/generics.R ---
@@ -799,10 +807,18 @@ setGeneric("array_sort", function(x) { 
standardGeneric("array_sort") })
 #' @name NULL
 setGeneric("arrays_overlap", function(x, y) { 
standardGeneric("arrays_overlap") })
 
+#' @rdname column_collection_functions
+#' @name NULL
+setGeneric("array_union", function(x, y) { standardGeneric("array_union") 
})
+
 #' @rdname column_collection_functions
 #' @name NULL
 setGeneric("arrays_zip", function(x, ...) { standardGeneric("arrays_zip") 
})
 
+#' @rdname column_collection_functions
+#' @name NULL
+setGeneric("shuffle", function(x) { standardGeneric("shuffle") })
--- End diff --

@felixcheung 
Changed. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22295: [SPARK-25255][PYTHON]Add getActiveSession to Spar...

2018-08-30 Thread huaxingao

GitHub user huaxingao opened a pull request:

https://github.com/apache/spark/pull/22295

[SPARK-25255][PYTHON]Add getActiveSession to SparkSession in PySpark

## What changes were proposed in this pull request?

add getActiveSession  in session.py

## How was this patch tested?

add doctest


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/huaxingao/spark spark25255

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22295.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22295


commit e9885b38e35dcfd01a40e43cf442beeaea226b98
Author: Huaxin Gao 
Date:   2018-08-31T00:50:23Z

[SPARK-25255][PYTHON]Add getActiveSession to SparkSession in PySpark




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22291: [SPARK-25007][R]Add array_intersect/array_except/array_u...

2018-08-30 Thread huaxingao

Github user huaxingao commented on the issue:

https://github.com/apache/spark/pull/22291
  
@felixcheung @HyukjinKwon Sorry I couldn't figure out how to make the 
```sequence``` work in the other PR. I will work on this one first. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22291: [SPARK-25007][R]Add array_intersect/array_except/...

2018-08-30 Thread huaxingao

GitHub user huaxingao opened a pull request:

https://github.com/apache/spark/pull/22291

[SPARK-25007][R]Add array_intersect/array_except/array_union/shuffle to 
SparkR


## What changes were proposed in this pull request?

Add the R version of array_intersect/array_except/array_union/shuffle

## How was this patch tested?
Add test in test_sparkSQL.R



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/huaxingao/spark spark-25007

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22291.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22291


commit 4a4507ff8c9764933bc1386ff57cfba1d7b18fac
Author: Huaxin Gao 
Date:   2018-08-30T20:30:47Z

[SPARK-25007][R]Add array_intersect/array_except/array_union/array_shuffle 
to SparkR




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22228: [SPARK-25124][ML]VectorSizeHint setSize and getSize don'...

2018-08-24 Thread huaxingao

Github user huaxingao commented on the issue:

https://github.com/apache/spark/pull/8
  
@jkbradley  backport to 2.3. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22228: [SPARK-25124][ML]VectorSizeHint setSize and getSi...

2018-08-24 Thread huaxingao

GitHub user huaxingao opened a pull request:

https://github.com/apache/spark/pull/8

[SPARK-25124][ML]VectorSizeHint setSize and getSize don't return values 
backport to 2.3

## What changes were proposed in this pull request?
In feature.py, VectorSizeHint setSize and getSize don't return value. Add 
return.

(Please fill in changes proposed in this fix)

## How was this patch tested?

Unit Test added

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/huaxingao/spark spark-25124-2.3

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/8.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #8


commit 35092f0b829dea4520014eb980e31bb3511c9e7f
Author: Huaxin Gao 
Date:   2018-08-24T21:10:05Z

[SPARK-25124][ML]VectorSizeHint setSize and getSize don't return values 
backport to 2.3




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22136: [SPARK-25124][ML]VectorSizeHint setSize and getSi...

2018-08-22 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/22136#discussion_r212088986
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -844,6 +844,28 @@ def test_string_indexer_from_labels(self):
 
.select(model_default.getOrDefault(model_default.outputCol)).collect()
 self.assertEqual(len(transformed_list), 5)
 
+def test_vector_size_hint(self):
--- End diff --

@jkbradley Sorry, my bad. I added set/getSize and removed VectorAssembler 
from the test to simply.  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22136: [SPARK-25124][ML]VectorSizeHint setSize and getSi...

2018-08-17 Thread huaxingao

GitHub user huaxingao opened a pull request:

https://github.com/apache/spark/pull/22136

[SPARK-25124][ML]VectorSizeHint setSize and getSize don't return values

## What changes were proposed in this pull request?

In feature.py, VectorSizeHint setSize and getSize don't return value. Add 
return. 

## How was this patch tested?

I tested the changes on my local. 


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/huaxingao/spark spark-25124

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22136.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22136


commit 91a819af424778d284d66893cc6b11e1015720e0
Author: Huaxin Gao 
Date:   2018-08-17T18:15:53Z

[SPARK-25124]VectorSizeHint setSize and getSize don't return values




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21835: [SPARK-24779]Add sequence / map_concat / map_from...

2018-08-15 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/21835#discussion_r210326849
  
--- Diff: R/pkg/R/functions.R ---
@@ -3320,7 +3321,7 @@ setMethod("explode",
 #' @aliases sequence sequence,Column-method
 #' @note sequence since 2.4.0
 setMethod("sequence",
- signature(x = "Column", y = "Column"),
+ signature(),
--- End diff --

@felixcheung 
gentle ping


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21439: [SPARK-24391][SQL] Support arrays of any types by from_j...

2018-08-13 Thread huaxingao

Github user huaxingao commented on the issue:

https://github.com/apache/spark/pull/21439
  
Sure. I will work on it. Thanks for letting me know. @viirya 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21925: [SPARK-24973][PYTHON]Add numIter to Python Cluste...

2018-07-30 Thread huaxingao

GitHub user huaxingao opened a pull request:

https://github.com/apache/spark/pull/21925

[SPARK-24973][PYTHON]Add numIter to Python ClusteringSummary

## What changes were proposed in this pull request?

Add numIter to Python version of ClusteringSummary

## How was this patch tested?

Modified existing UT test_multiclass_logistic_regression_summary

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/huaxingao/spark spark-24973

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21925.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21925


commit 7cf662b46a5da5af2e4c5df2c27aa4b14cb46830
Author: Huaxin Gao 
Date:   2018-07-30T22:35:05Z

[SPARK-24973][PYTHON]Add numIter to Python ClusteringSummary




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21835: [SPARK-24779]Add sequence / map_concat / map_from...

2018-07-26 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/21835#discussion_r205609369
  
--- Diff: R/pkg/R/functions.R ---
@@ -3320,7 +3321,7 @@ setMethod("explode",
 #' @aliases sequence sequence,Column-method
 #' @note sequence since 2.4.0
 setMethod("sequence",
- signature(x = "Column", y = "Column"),
+ signature(),
--- End diff --

I will need to make the ```sequence``` method  still callable without 
base:: prefix, right?
Is the following implementation OK?
```
setMethod("sequence",
  signature(),
  function(x, y = NULL, step = NULL) {
if (is.null(y) && is.null(step))
  base::sequence(x)
else {
  jc <- if (is.null(step)) {
callJStatic("org.apache.spark.sql.functions", "sequence", 
x@jc, y@jc)
  } else {
stopifnot(class(step) == "Column")
callJStatic("org.apache.spark.sql.functions", "sequence", 
x@jc, y@jc, step@jc)
  }
  column(jc)
}
  })
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21835: [SPARK-24779]Add sequence / map_concat / map_from...

2018-07-26 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/21835#discussion_r205538890
  
--- Diff: R/pkg/R/functions.R ---
@@ -3320,7 +3321,7 @@ setMethod("explode",
 #' @aliases sequence sequence,Column-method
 #' @note sequence since 2.4.0
 setMethod("sequence",
- signature(x = "Column", y = "Column"),
+ signature(),
--- End diff --

@felixcheung 
It seems that if I only have ... in the generic function, 
```
setGeneric("sequence", function(...) { standardGeneric("sequence") })
```
I can't have anything is the signature. Otherwise I will have Error in 
matchSignature.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21835: [SPARK-24779]Add sequence / map_concat / map_from...

2018-07-24 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/21835#discussion_r204861059
  
--- Diff: R/pkg/tests/fulltests/test_context.R ---
@@ -21,10 +21,11 @@ test_that("Check masked functions", {
   # Check that we are not masking any new function from base, stats, 
testthat unexpectedly
   # NOTE: We should avoid adding entries to *namesOfMaskedCompletely* as 
masked functions make it
   # hard for users to use base R functions. Please check when in doubt.
-  namesOfMaskedCompletely <- c("cov", "filter", "sample", "not")
+  namesOfMaskedCompletely <- c("cov", "filter", "sample", "not", 
"sequence")
--- End diff --

Thanks @HyukjinKwon @felixcheung for your review. I will change this. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21835: [SPARK-24779]Add sequence / map_concat / map_from_entrie...

2018-07-23 Thread huaxingao

Github user huaxingao commented on the issue:

https://github.com/apache/spark/pull/21835
  
@HyukjinKwon @felixcheung 
Could you please review? Thank you very much in advance!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21835: [SPARK-24779]Add sequence / map_concat / map_from...

2018-07-21 Thread huaxingao

GitHub user huaxingao opened a pull request:

https://github.com/apache/spark/pull/21835

[SPARK-24779]Add sequence / map_concat / map_from_entries / an option in 
months_between UDF to disable rounding-off



## What changes were proposed in this pull request?

Add the R version of sequence / map_concat / map_from_entries / an option 
in months_between UDF to disable rounding-off

## How was this patch tested?

Add test in test_sparkSQL.R


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/huaxingao/spark spark-24779

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21835.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21835


commit faff20b70e3577d58976283ac7157717d8b34c8d
Author: Huaxin Gao 
Date:   2018-07-21T02:20:50Z

[SPARK-24779]Add sequence / map_concat / map_from_entries / an option in 
months_between UDF to disable rounding-off




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21820: [SPARK-24868][PYTHON]add sequence function in Python

2018-07-20 Thread huaxingao

Github user huaxingao commented on the issue:

https://github.com/apache/spark/pull/21820
  
@HyukjinKwon Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21820: [SPARK-24868][PYTHON]add sequence function in Pyt...

2018-07-19 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/21820#discussion_r203934505
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -2551,6 +2551,27 @@ def map_concat(*cols):
 return Column(jc)
 
 
+@since(2.4)
+def sequence(start, stop, step=None):
+"""
+Generate a sequence of integers from start to stop, incrementing by 
step.
+If step is not set, incrementing by 1 if start is less than or equal 
to stop, otherwise -1.
+
+>>> df1 = spark.createDataFrame([(-2, 2)], ('C1', 'C2'))
+>>> df1.select(sequence('C1', 'C2').alias('r')).collect()
+[Row(r=[-2, -1, 0, 1, 2])]
+>>> df2 = spark.createDataFrame([(4, -4, -2)], ('C1', 'C2', 'C3'))
--- End diff --

It will throw ```java.lang.IllegalArgumentException```. 
There is a check in ```collectionOperations.scala```
```
require(
  (step > num.zero && start <= stop)
|| (step < num.zero && start >= stop)
|| (step == num.zero && start == stop),
  s"Illegal sequence boundaries: $start to $stop by $step")
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21710: [SPARK-24207][R]add R API for PrefixSpan

2018-07-18 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/21710#discussion_r203526021
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/r/PrefixSpanWrapper.scala ---
@@ -0,0 +1,34 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.r
+
+import org.apache.spark.ml.fpm.PrefixSpan
+
+private[r] object PrefixSpanWrapper {
+  def getPrefixSpan(
+  minSupport: Double,
+  maxPatternLength: Int,
+  maxLocalProjDBSize: Double,
--- End diff --

```maxLocalProjDBSize``` is a ```LongParam``` in scala, but it seems that r 
doesn't have ```long```. 
See ```help(integer):```
```
 Note that on almost all implementations of R the range of
 representable integers is restricted to about +/-2*10^9: âdoubleâs
 can hold much larger integers exactly.
```
so I used ```double``` in r, and do ```toLong``` later


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21710: [SPARK-24207][R]add R API for PrefixSpan

2018-07-18 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/21710#discussion_r203481597
  
--- Diff: R/pkg/R/generics.R ---
@@ -1415,6 +1415,13 @@ setGeneric("spark.freqItemsets", function(object) { 
standardGeneric("spark.freqI
 #' @rdname spark.fpGrowth
 setGeneric("spark.associationRules", function(object) { 
standardGeneric("spark.associationRules") })
 
+#' @rdname spark.prefixSpan
+setGeneric("spark.prefixSpan", function(...) { 
standardGeneric("spark.prefixSpan") })
+
+#' @rdname spark.prefixSpan
--- End diff --

OK thanks. I guess I will keep both spark.prefixSpan and 
spark.findFrequentSequentialPatterns in the same rd. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21710: [SPARK-24207][R]add R API for PrefixSpan

2018-07-17 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/21710#discussion_r203229835
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/r/PrefixSpanWrapper.scala ---
@@ -0,0 +1,42 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.r
+
+import org.apache.spark.ml.fpm.PrefixSpan
+import org.apache.spark.sql.{DataFrame, Dataset}
+
+private[r] class PrefixSpanWrapper private (val prefixSpan: PrefixSpan) {
--- End diff --

You are right. I will remove 
```
private[r] class PrefixSpanWrapper private (val prefixSpan: PrefixSpan) {
  def findFrequentSequentialPatterns(dataset: Dataset[_]): DataFrame = {
prefixSpan.findFrequentSequentialPatterns(dataset)
  }
}
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21710: [SPARK-24207][R]add R API for PrefixSpan

2018-07-17 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/21710#discussion_r203229794
  
--- Diff: R/pkg/tests/fulltests/test_mllib_fpm.R ---
@@ -82,4 +82,26 @@ test_that("spark.fpGrowth", {
 
 })
 
+test_that("spark.prefixSpan", {
+df <- createDataFrame(list(list(list(list(1L, 2L), list(3L))),
+  list(list(list(1L), list(3L, 2L), list(1L, 2L))),
+  list(list(list(1L, 2L), list(5L))),
+  list(list(list(6L, schema = c("sequence"))
+prefix_Span1 <- spark.prefixSpan(minSupport = 0.5, maxPatternLength = 
5L,
+ maxLocalProjDBSize = 3200L)
+result1 <- spark.findFrequentSequentialPatterns(prefix_Span1, df)
+
+expected_result <- createDataFrame(list(list(list(list(1L)), 3L),
+list(list(list(3L)), 2L),
+list(list(list(2L)), 3L),
+list(list(list(1L, 2L)), 3L),
+list(list(list(1L), list(3L)), 
2L)),
+schema = c("sequence", "freq"))
+expect_equivalent(expected_result, result1)
+
+prefix_Span2 <- spark.prefixSpan(minSupport = 0.5, maxPatternLength = 
5L)
+result2 <- spark.findFrequentSequentialPatterns(prefix_Span2, df)
+expect_equivalent(expected_result, result2)
--- End diff --

Just trying to test that the default value (```maxLocalProjDBSize = 
3200L```) will be used if the parameter isn't set explicitly. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21710: [SPARK-24207][R]add R API for PrefixSpan

2018-07-17 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/21710#discussion_r203229733
  
--- Diff: R/pkg/R/generics.R ---
@@ -1415,6 +1415,13 @@ setGeneric("spark.freqItemsets", function(object) { 
standardGeneric("spark.freqI
 #' @rdname spark.fpGrowth
 setGeneric("spark.associationRules", function(object) { 
standardGeneric("spark.associationRules") })
 
+#' @rdname spark.prefixSpan
+setGeneric("spark.prefixSpan", function(...) { 
standardGeneric("spark.prefixSpan") })
+
+#' @rdname spark.prefixSpan
--- End diff --

@viirya Thanks for your review. 
Do you mean to use ```findFrequentSequentialPatterns``` instead of 
```spark.findFrequentSequentialPatterns```? I have a question about when we 
need to add ```spark``` in front of the method name. In FPGrowth, some methods 
have ```spark.```, e.g.  ```spark.freqItemsets```, 
```spark.associationRules```. Some methods don't have ```spark.```, e.g. 
```predict``` and ```write.ml``` . 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21645: [SPARK-24537][R]Add array_remove / array_zip / map_from_...

2018-07-12 Thread huaxingao

Github user huaxingao commented on the issue:

https://github.com/apache/spark/pull/21645
  
Thanks! @HyukjinKwon @felixcheung 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21710: [SPARK-24207][R]add R API for PrefixSpan

2018-07-11 Thread huaxingao

Github user huaxingao commented on the issue:

https://github.com/apache/spark/pull/21710
  
@felixcheung Can I open a new jira for code example and documentation? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21645: [SPARK-24537][R]Add array_remove / array_zip / ma...

2018-07-11 Thread huaxingao

Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/21645#discussion_r201827579
  
--- Diff: R/pkg/R/functions.R ---
@@ -3071,6 +3085,19 @@ setMethod("array_position",
 column(jc)
   })
 
+#' @details
+#' \code{array_remove}: Removes all elements that equal to element from 
the given array.
+#'
+#' @rdname column_collection_functions
+#' @aliases array_remove array_remove,Column-method
+#' @note array_remove since 2.4.0
+setMethod("array_remove",
+  signature(x = "Column", value = "ANY"),
--- End diff --

Thanks for your review. I will add in doc. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21678: [SPARK-23461][R]vignettes should include model predictio...

2018-07-11 Thread huaxingao

Github user huaxingao commented on the issue:

https://github.com/apache/spark/pull/21678
  
@felixcheung Thanks a lot for your help!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 >

1 - 100 of 348 matches

Mail list logo