[GitHub] spark issue #21465: [SPARK-24333][ML][PYTHON]Add fit with validation set to ...
Github user huaxingao commented on the issue: https://github.com/apache/spark/pull/21465 @BryanCutler Thank you very much for your help! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23256: [SPARK-24207][R] follow-up PR for SPARK-24207 to ...
GitHub user huaxingao opened a pull request: https://github.com/apache/spark/pull/23256 [SPARK-24207][R] follow-up PR for SPARK-24207 to fix code style problems ## What changes were proposed in this pull request? follow-up PR for SPARK-24207 to fix code style problems You can merge this pull request into a Git repository by running: $ git pull https://github.com/huaxingao/spark spark-24207-cnt Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23256.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23256 commit 04ec3e18e71b0f0e0d212a3532a6dc06518d61c0 Author: Huaxin Gao Date: 2018-12-07T19:55:50Z fix code style problems --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21465: [SPARK-24333][ML][PYTHON]Add fit with validation ...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/21465#discussion_r239904265 --- Diff: python/pyspark/ml/param/shared.py --- @@ -814,3 +814,25 @@ def getDistanceMeasure(self): """ return self.getOrDefault(self.distanceMeasure) + +class HasValidationIndicatorCol(Params): --- End diff -- You are right. DecisionTreeParams should be at the bottom. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/23072#discussion_r239626824 --- Diff: R/pkg/R/mllib_clustering.R --- @@ -610,3 +616,58 @@ setMethod("write.ml", signature(object = "LDAModel", path = "character"), function(object, path, overwrite = FALSE) { write_internal(object, path, overwrite) }) + +#' PowerIterationClustering +#' +#' A scalable graph clustering algorithm. Users can call \code{spark.assignClusters} to +#' return a cluster assignment for each input vertex. +#' +# Run the PIC algorithm and returns a cluster assignment for each input vertex. +#' @param data A SparkDataFrame. +#' @param k The number of clusters to create. +#' @param initMode Param for the initialization algorithm. +#' @param maxIter Param for maximum number of iterations. +#' @param sourceCol Param for the name of the input column for source vertex IDs. +#' @param destinationCol Name of the input column for destination vertex IDs. +#' @param weightCol Param for weight column name. If this is not set or \code{NULL}, +#' we treat all instance weights as 1.0. +#' @param ... additional argument(s) passed to the method. +#' @return A dataset that contains columns of vertex id and the corresponding cluster for the id. +#' The schema of it will be: +#' \code{id: Long} +#' \code{cluster: Int} +#' @rdname spark.powerIterationClustering +#' @aliases assignClusters,PowerIterationClustering-method,SparkDataFrame-method +#' @examples +#' \dontrun{ +#' df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0), +#' list(1L, 2L, 1.0), list(3L, 4L, 1.0), +#' list(4L, 0L, 0.1)), schema = c("src", "dst", "weight")) +#' clusters <- spark.assignClusters(df, initMode="degree", weightCol="weight") +#' showDF(clusters) +#' } +#' @note spark.assignClusters(SparkDataFrame) since 3.0.0 +setMethod("spark.assignClusters", + signature(data = "SparkDataFrame"), + function(data, k = 2L, initMode = c("random", "degree"), maxIter = 20L, +sourceCol = "src", destinationCol = "dst", weightCol = NULL) { +if (!is.numeric(k) || k < 1) { + stop("k should be a number with value >= 1.") +} +if (!is.integer(maxIter) || maxIter <= 0) { + stop("maxIter should be a number with value > 0.") +} --- End diff -- Seems to me that R is a thin wrapper, we only need to create a PIC object and call the corresponding scala method. SparkDataFrame's column types are only checked in scala, not in R. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/23072#discussion_r239626871 --- Diff: docs/ml-clustering.md --- @@ -265,3 +265,44 @@ Refer to the [R API docs](api/R/spark.gaussianMixture.html) for more details. + +## Power Iteration Clustering (PIC) + +Power Iteration Clustering (PIC) is a scalable graph clustering algorithm +developed by http://www.cs.cmu.edu/~frank/papers/icml2010-pic-final.pdf>Lin and Cohen. --- End diff -- Thanks. I will change the hyperlink. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21465: [SPARK-24333][ML][PYTHON]Add fit with validation set to ...
Github user huaxingao commented on the issue: https://github.com/apache/spark/pull/21465 @BryanCutler Thank you very much for your review! I will submit changes soon. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/23072#discussion_r239250376 --- Diff: docs/ml-clustering.md --- @@ -265,3 +265,44 @@ Refer to the [R API docs](api/R/spark.gaussianMixture.html) for more details. + +## Power Iteration Clustering (PIC) + +Power Iteration Clustering (PIC) is a scalable graph clustering algorithm +developed by http://www.cs.cmu.edu/~frank/papers/icml2010-pic-final.pdf>Lin and Cohen. --- End diff -- I normally check the md file on the github. The link works OK. Is there a better way to check? @dongjoon-hyun @felixcheung https://github.com/apache/spark/blob/9158da8cb76cc13f3011deaa7ac2c290eef62389/docs/ml-clustering.md I guess I will still remove the ```a href=``` since no other places in the doc uses `` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23072: [SPARK-19827][R]spark.ml R API for PIC
Github user huaxingao commented on the issue: https://github.com/apache/spark/pull/23072 @dongjoon-hyun Thank you very much for your review. I will make the changes soon. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/23072#discussion_r239250335 --- Diff: R/pkg/R/mllib_clustering.R --- @@ -610,3 +616,58 @@ setMethod("write.ml", signature(object = "LDAModel", path = "character"), function(object, path, overwrite = FALSE) { write_internal(object, path, overwrite) }) + +#' PowerIterationClustering +#' +#' A scalable graph clustering algorithm. Users can call \code{spark.assignClusters} to +#' return a cluster assignment for each input vertex. +#' +# Run the PIC algorithm and returns a cluster assignment for each input vertex. +#' @param data A SparkDataFrame. +#' @param k The number of clusters to create. +#' @param initMode Param for the initialization algorithm. +#' @param maxIter Param for maximum number of iterations. +#' @param sourceCol Param for the name of the input column for source vertex IDs. +#' @param destinationCol Name of the input column for destination vertex IDs. +#' @param weightCol Param for weight column name. If this is not set or \code{NULL}, +#' we treat all instance weights as 1.0. +#' @param ... additional argument(s) passed to the method. +#' @return A dataset that contains columns of vertex id and the corresponding cluster for the id. +#' The schema of it will be: +#' \code{id: Long} +#' \code{cluster: Int} +#' @rdname spark.powerIterationClustering +#' @aliases assignClusters,PowerIterationClustering-method,SparkDataFrame-method +#' @examples +#' \dontrun{ +#' df <- createDataFrame(list(list(0L, 1L, 1.0), list(0L, 2L, 1.0), +#' list(1L, 2L, 1.0), list(3L, 4L, 1.0), +#' list(4L, 0L, 0.1)), schema = c("src", "dst", "weight")) +#' clusters <- spark.assignClusters(df, initMode="degree", weightCol="weight") +#' showDF(clusters) +#' } +#' @note spark.assignClusters(SparkDataFrame) since 3.0.0 +setMethod("spark.assignClusters", + signature(data = "SparkDataFrame"), + function(data, k = 2L, initMode = c("random", "degree"), maxIter = 20L, +sourceCol = "src", destinationCol = "dst", weightCol = NULL) { +if (!is.numeric(k) || k < 1) { + stop("k should be a number with value >= 1.") +} +if (!is.integer(maxIter) || maxIter <= 0) { + stop("maxIter should be a number with value > 0.") +} --- End diff -- @dongjoon-hyun ```src``` and ```dst``` are character columns. I have the check for character type. ``` as.character(sourceCol), as.character(destinationCol) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/23072#discussion_r239238873 --- Diff: docs/ml-clustering.md --- @@ -265,3 +265,44 @@ Refer to the [R API docs](api/R/spark.gaussianMixture.html) for more details. + +## Power Iteration Clustering (PIC) + +Power Iteration Clustering (PIC) is a scalable graph clustering algorithm +developed by http://www.cs.cmu.edu/~frank/papers/icml2010-pic-final.pdf>Lin and Cohen. +From the abstract: PIC finds a very low-dimensional embedding of a dataset +using truncated power iteration on a normalized pair-wise similarity matrix of the data. + +`spark.ml`'s PowerIterationClustering implementation takes the following parameters: + +* `k`: the number of clusters to create +* `initMode`: param for the initialization algorithm +* `maxIter`: param for maximum number of iterations +* `srcCol`: param for the name of the input column for source vertex IDs +* `dstCol`: name of the input column for destination vertex IDs +* `weightCol`: Param for weight column name + +**Examples** + + + + +Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.clustering.PowerIterationClustering) for more details. + +{% include_example scala/org/apache/spark/examples/ml/PowerIterationClusteringExample.scala %} + + + +Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/PowerIterationClustering.html) for more details. + +{% include_example java/org/apache/spark/examples/ml/JavaPowerIterationClusteringExample.java %} + + + --- End diff -- @dongjoon-hyun https://github.com/apache/spark/pull/22996 I will add the python example in the doc once the above PR is merged in. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21465: [SPARK-24333][ML][PYTHON]Add fit with validation ...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/21465#discussion_r239173098 --- Diff: python/pyspark/ml/classification.py --- @@ -1174,9 +1165,31 @@ def trees(self): return [DecisionTreeClassificationModel(m) for m in list(self._call_java("trees"))] +class GBTClassifierParams(GBTParams, HasVarianceImpurity): --- End diff -- @BryanCutler Thanks for your review. Seems recently https://github.com/apache/spark/pull/22986 added ```trait HasVarianceImpurity``` and made ```private[ml] trait GBTClassifierParams extends GBTParams with HasVarianceImpurity``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/23072#discussion_r237966508 --- Diff: examples/src/main/scala/org/apache/spark/examples/ml/FPGrowthExample.scala --- @@ -64,4 +64,3 @@ object FPGrowthExample { spark.stop() } } -// scalastyle:on println --- End diff -- @dongjoon-hyun sorry, I missed the ```// scalastyle:off println``` Is it OK with you if I remove ```// scalastyle:off println``` too? Since ```println``` is not used in the example --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23168: [SPARK-26207][doc]add PowerIterationClustering (P...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/23168#discussion_r237661961 --- Diff: docs/ml-clustering.md --- @@ -265,3 +265,38 @@ Refer to the [R API docs](api/R/spark.gaussianMixture.html) for more details. + +## Power Iteration Clustering (PIC) + +Power Iteration Clustering (PIC) is a scalable graph clustering algorithm +developed by http://www.icml2010.org/papers/387.pdf>Lin and Cohen. --- End diff -- @shahidki31 Changed. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23161: [SPARK-26189][R]Fix unionAll doc in SparkR
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/23161#discussion_r237628719 --- Diff: R/pkg/R/DataFrame.R --- @@ -2732,13 +2732,24 @@ setMethod("union", dataFrame(unioned) }) -#' Return a new SparkDataFrame containing the union of rows -#' -#' This is an alias for `union`. +#' Return a new SparkDataFrame containing the union of rows. +#' This is an alias for \code{union}. --- End diff -- You are right, it is the title and newline is required. I manually checked. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/23072#discussion_r237332601 --- Diff: docs/ml-clustering.md --- @@ -265,3 +265,44 @@ Refer to the [R API docs](api/R/spark.gaussianMixture.html) for more details. + +## Power Iteration Clustering (PIC) + +Power Iteration Clustering (PIC) is a scalable graph clustering algorithm --- End diff -- The doc change will be in both 2.4 and master, but the R related code will be in master only. I think that's why @felixcheung asked me to open a separate PR to merge in the doc change for 2.4. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23168: [SPARK-26207][doc]add PowerIterationClustering (PIC) doc...
Github user huaxingao commented on the issue: https://github.com/apache/spark/pull/23168 @srowen It's not in master yet. The PR is here https://github.com/apache/spark/pull/23072 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23168: [SPARK-26207][doc]add PowerIterationClustering (PIC) doc...
Github user huaxingao commented on the issue: https://github.com/apache/spark/pull/23168 @felixcheung Could you please review? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23168: [SPARK-26207][doc]add PowerIterationClustering (P...
GitHub user huaxingao opened a pull request: https://github.com/apache/spark/pull/23168 [SPARK-26207][doc]add PowerIterationClustering (PIC) doc in 2.4 branch ## What changes were proposed in this pull request? Add PIC doc in 2.4 ## How was this patch tested? Manually tested You can merge this pull request into a Git repository by running: $ git pull https://github.com/huaxingao/spark spark-26207 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23168.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23168 commit 11363fbfd765d53f87c8908266540f8fec77fe7a Author: Huaxin Gao Date: 2018-11-28T17:52:37Z [SPARK-26207][doc]add PowerIterationClustering (PIC) doc in 2.4 branch --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23161: [SPARK-26189][R]Fix unionAll doc in SparkR
GitHub user huaxingao opened a pull request: https://github.com/apache/spark/pull/23161 [SPARK-26189][R]Fix unionAll doc in SparkR ## What changes were proposed in this pull request? Fix unionAll doc in SparkR ## How was this patch tested? Manually ran test You can merge this pull request into a Git repository by running: $ git pull https://github.com/huaxingao/spark spark-26189 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23161.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23161 commit 423711ec45883822942be309d8052cee976ef8c0 Author: Huaxin Gao Date: 2018-11-28T04:22:33Z [SPARK-26189][R]Fix unionAll doc in SparkR --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/23072#discussion_r236787704 --- Diff: docs/ml-clustering.md --- @@ -265,3 +265,44 @@ Refer to the [R API docs](api/R/spark.gaussianMixture.html) for more details. + +## Power Iteration Clustering (PIC) + +Power Iteration Clustering (PIC) is a scalable graph clustering algorithm --- End diff -- sure --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23157: [SPARK-26185][PYTHON]add weightCol in python Mult...
GitHub user huaxingao opened a pull request: https://github.com/apache/spark/pull/23157 [SPARK-26185][PYTHON]add weightCol in python MulticlassClassificationEvaluator ## What changes were proposed in this pull request? add weightCol for python version of MulticlassClassificationEvaluator and MulticlassMetrics ## How was this patch tested? add doc test You can merge this pull request into a Git repository by running: $ git pull https://github.com/huaxingao/spark spark-26185 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23157.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23157 commit e7cf84521f331b75d645654573696be74d7b0b06 Author: Huaxin Gao Date: 2018-11-27T17:52:58Z [SPARK-26185][PYTHON]add weightCol in python MulticlassClassificationEvaluator --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21465: [SPARK-24333][ML][PYTHON]Add fit with validation ...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/21465#discussion_r235488017 --- Diff: python/pyspark/ml/classification.py --- @@ -1176,8 +1176,8 @@ def trees(self): @inherit_doc class GBTClassifier(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredictionCol, HasMaxIter, -GBTParams, HasCheckpointInterval, HasStepSize, HasSeed, JavaMLWritable, -JavaMLReadable): +GBTParams, HasCheckpointInterval, HasStepSize, HasSeed, +HasValidationIndicatorCol, JavaMLWritable, JavaMLReadable): --- End diff -- @BryanCutler Thank you very much for reviewing my PR. I moved HasValidationIndicatorCol, HasMaxIter and HasStepSize to GBTParams. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23072: [SPARK-19827][R]spark.ml R API for PIC
Github user huaxingao commented on the issue: https://github.com/apache/spark/pull/23072 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20442: [SPARK-23265][ML]Update multi-column error handling logi...
Github user huaxingao commented on the issue: https://github.com/apache/spark/pull/20442 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23072: [SPARK-19827][R]spark.ml R API for PIC
GitHub user huaxingao opened a pull request: https://github.com/apache/spark/pull/23072 [SPARK-19827][R]spark.ml R API for PIC ## What changes were proposed in this pull request? Add PowerIterationCluster (PIC) in R ## How was this patch tested? Add test case You can merge this pull request into a Git repository by running: $ git pull https://github.com/huaxingao/spark spark-19827 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23072.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23072 commit 9e2b0f9ffe0866fa328bc677500e4f3a49ff384b Author: Huaxin Gao Date: 2018-11-17T21:25:46Z [SPARK-19827][R]spark.ml R API for PIC --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22996: [SPARK-25997][ML]add Python example code for Power Itera...
Github user huaxingao commented on the issue: https://github.com/apache/spark/pull/22996 @holdenk Yes, it is. I will include the examples in ml-clustering.md. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22996: add Python example code for Power Iteration Clust...
GitHub user huaxingao opened a pull request: https://github.com/apache/spark/pull/22996 add Python example code for Power Iteration Clustering in spark.ml ## What changes were proposed in this pull request? Add python example for Power Iteration Clustering in spark.ml ## How was this patch tested? Manually tested You can merge this pull request into a Git repository by running: $ git pull https://github.com/huaxingao/spark spark-25997 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22996.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22996 commit 905b542a8618269bdc079f3c335a80c13d2214fa Author: Huaxin Gao Date: 2018-11-09T22:32:17Z add Python example code for Power Iteration Clustering in spark.ml --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22788: [SPARK-25769][SQL]make UnresolvedAttribute.sql escape ne...
Github user huaxingao commented on the issue: https://github.com/apache/spark/pull/22788 @cloud-fan @dongjoon-hyun Because of the above test failures in ```ExpressionTypeCheckingSuite```, shall I revert to the previous change ? ``` override def sql: String = nameParts.map { part => part match { case ParserUtils.escapedIdentifier(_) | ParserUtils.qualifiedEscapedIdentifier(_, _) => part case _ => quoteIdentifier(part) } }.mkString(".") ``` Or change ```ExpressionTypeCheckingSuite```? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22788: [SPARK-25769][SQL]make UnresolvedAttribute.sql escape ne...
Github user huaxingao commented on the issue: https://github.com/apache/spark/pull/22788 I have a question regarding the test failure in ```ExpressionTypeCheckingSuite```. Most of the tests in this suite failed after I change ```UnresolvedAttribute.sql = UnresolvedAttribute.name```. For example, ``` test("check types for unary arithmetic") { assertError(BitwiseNot('stringField), "requires integral type") } ``` The test failed with ``` "cannot resolve '~`stringField`' due to data type mismatch: argument 1 requires integral type, however, '`stringField`' is of string type.;" did not contain "cannot resolve '~stringField' due to data type mismatch:" ``` It seems that the root cause of the failure is that ```UnresolveAttribute.sql``` doesn't match with ```AttributeReference.sql``` any more after my fix. I am in doubt if it is correct to make ```UnresolvedAttribute.sql ``` the same as ``` UnresolvedAttribute.name``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22788: [SPARK-25769][SQL]escape nested columns by backti...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/22788#discussion_r229583440 --- Diff: sql/core/src/test/resources/sql-tests/results/columnresolution-negative.sql.out --- @@ -161,7 +161,7 @@ SELECT db1.t1.i1 FROM t1, mydb2.t1 struct<> -- !query 18 output org.apache.spark.sql.AnalysisException -cannot resolve '`db1.t1.i1`' given input columns: [mydb2.t1.i1, mydb2.t1.i1]; line 1 pos 7 +cannot resolve '`db1`.`t1`.`i1`' given input columns: [mydb2.t1.i1, mydb2.t1.i1]; line 1 pos 7 --- End diff -- @dilipbiswal Currently ```sql``` is not the same as ```name``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22788: [SPARK-25769][SQL]escape nested columns by backti...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/22788#discussion_r229479893 --- Diff: sql/core/src/test/resources/sql-tests/results/columnresolution-negative.sql.out --- @@ -161,7 +161,7 @@ SELECT db1.t1.i1 FROM t1, mydb2.t1 struct<> -- !query 18 output org.apache.spark.sql.AnalysisException -cannot resolve '`db1.t1.i1`' given input columns: [mydb2.t1.i1, mydb2.t1.i1]; line 1 pos 7 +cannot resolve '`db1`.`t1`.`i1`' given input columns: [mydb2.t1.i1, mydb2.t1.i1]; line 1 pos 7 --- End diff -- Yes. I agree. For the four examples, we will have the following results: ``` $"a.b".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql ``` with my previous fix: ``` `a`.`b` ``` make sql same as name: ``` a.b ``` ``` $"`a.b`".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql ``` with my previous fix: ``` `a`.`b` ``` make sql same as name: ``` `a.b` ``` ``` $"`a`.b".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql ``` with my previous fix: ``` `a`.`b` ``` make sql same as name: ``` a.b ``` ``` $"`a.b`.c".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql ``` with my previous fix: ``` `a.b`.`c` ``` make sql same as name: ``` `a.b`.c ``` Does this look good to everybody? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22788: [SPARK-25769][SQL]escape nested columns by backti...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/22788#discussion_r229354247 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala --- @@ -2856,6 +2856,21 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext { checkAnswer(sql("select 26393499451 / (1e6 * 1000)"), Row(BigDecimal("26.393499451"))) } } + + test("SPARK-25769 escape nested columns") { +import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute +val sql1 = $"a.b".expr.asInstanceOf[UnresolvedAttribute].sql +assert(sql1 === "`a`.`b`") + +val sql2 = $"`a.b`".expr.asInstanceOf[UnresolvedAttribute].sql +assert(sql2 === "`a.b`") + +val sql3 = $"`a`.b".expr.asInstanceOf[UnresolvedAttribute].sql +assert(sql3 === "`a`.`b`") + +val sql4 = $"`a.b`.c".expr.asInstanceOf[UnresolvedAttribute].sql +assert(sql4 === "`a.b`.`c`") + } --- End diff -- Merged. Thank you very much! @dongjoon-hyun --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22863: [SPARK-25859][ML]add scala/java/python example and doc f...
Github user huaxingao commented on the issue: https://github.com/apache/spark/pull/22863 Thanks @felixcheung --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22863: [SPARK-25859][ML]add scala/java/python example an...
Github user huaxingao closed the pull request at: https://github.com/apache/spark/pull/22863 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22863: [SPARK-25859][ML]add scala/java/python example and doc f...
Github user huaxingao commented on the issue: https://github.com/apache/spark/pull/22863 @felixcheung --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22863: [SPARK-25859][ML]add scala/java/python example an...
GitHub user huaxingao opened a pull request: https://github.com/apache/spark/pull/22863 [SPARK-25859][ML]add scala/java/python example and doc for PrefixSpan ## What changes were proposed in this pull request? add scala/java/python example and doc for PrefixSpan in branch 2.4 ## How was this patch tested? Manually tested You can merge this pull request into a Git repository by running: $ git pull https://github.com/huaxingao/spark mydocbranch Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22863.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22863 commit 3109c213c2f875ea7099929621a3be18b5f02862 Author: Huaxin Gao Date: 2018-10-27T18:14:36Z [SPARK-25859][ML]add scala/java/python example and doc for PrefixSpan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21710: [SPARK-24207][R]add R API for PrefixSpan
Github user huaxingao commented on the issue: https://github.com/apache/spark/pull/21710 @felixcheung I am terribly sorry that I missed your comment for the ml doc and example for 2.4. Is it still time to merge in 2.4? I saw one of my PR got merged in 2.4 last night. I can submit a PR for 2.4 doc and example now. Please let me know. Thanks a lot! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22295: [SPARK-25255][PYTHON]Add getActiveSession to SparkSessio...
Github user huaxingao commented on the issue: https://github.com/apache/spark/pull/22295 Thank you very much for your help! ! @holdenk @HyukjinKwon --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22790: [SPARK-25793][ML]call SaveLoadV2_0.load for class...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/22790#discussion_r228274470 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala --- @@ -109,7 +109,7 @@ class BisectingKMeansModel private[clustering] ( @Since("2.0.0") override def save(sc: SparkContext, path: String): Unit = { -BisectingKMeansModel.SaveLoadV1_0.save(sc, this, path) +BisectingKMeansModel.SaveLoadV2_0.save(sc, this, path) } override protected def formatVersion: String = "1.0" --- End diff -- I changed the ```formatVersion``` to 2.0. There are quite a few files that implement trait ```Saveable``` and have ```formatVersion```. I don't feel comfortable to change other files for this PR. Maybe I will open a separate jira to remove ```formatVersion``` from trait ```Saveable```? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22790: [SPARK-25793][ML]call SaveLoadV2_0.load for classNameV2_...
Github user huaxingao commented on the issue: https://github.com/apache/spark/pull/22790 I added a regression test in ```org.apache.spark.mllib.clustering.BisectingKMeansSuite``` I could add the following test in ml package. ``` test("SPARK-25793") { val bisectingKMeans = new BisectingKMeans() bisectingKMeans.setDistanceMeasure(DistanceMeasure.COSINE) val readBisectingKMeans = testDefaultReadWrite(bisectingKMeans) assert(bisectingKMeans.distanceMeasure === readBisectingKMeans.distanceMeasure) } ``` But the bug doesn't really affect the above test. With the bug, even though mllib ```BisectingKMeansModel.load``` will call V1_0 load and gives a model with default value of distanceMeasure, in ml package, ```BisectingKMeansModelReader.load``` will call ```metadata.getAndSetParams(model)``` which will set the distanceMeasure to the correct value (DistanceMeasure.COSINE). ``` override def load(path: String): BisectingKMeansModel = { val metadata = DefaultParamsReader.loadMetadata(path, sc, className) val dataPath = new Path(path, "data").toString val mllibModel = MLlibBisectingKMeansModel.load(sc, dataPath) val model = new BisectingKMeansModel(metadata.uid, mllibModel) metadata.getAndSetParams(model) model } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22790: [SPARK-25793][ML]call SaveLoadV2_0.load for classNameV2_...
Github user huaxingao commented on the issue: https://github.com/apache/spark/pull/22790 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22790: [SPARK-25793][ML]call SaveLoadV2_0.load for class...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/22790#discussion_r227229331 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala --- @@ -126,7 +126,7 @@ object BisectingKMeansModel extends Loader[BisectingKMeansModel] { val model = SaveLoadV1_0.load(sc, path) model case (SaveLoadV2_0.thisClassName, SaveLoadV2_0.thisFormatVersion) => -val model = SaveLoadV1_0.load(sc, path) +val model = SaveLoadV2_0.load(sc, path) --- End diff -- @viirya @mgaido91 I will change the ```BisectingKMeansModel.save``` to ``` BisectingKMeansModel.SaveLoadV2_0.save(sc, this, path) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22788: [SPARK-25769][SQL]escape nested columns by backti...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/22788#discussion_r227152273 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala --- @@ -2702,7 +2702,7 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext { val e = intercept[AnalysisException](sql("SELECT v.i from (SELECT i FROM v)")) assert(e.message == -"cannot resolve '`v.i`' given input columns: [__auto_generated_subquery_name.i]") --- End diff -- The out-most backticks is added in ```case _```, in ```case class UnresolvedAttribute(nameParts: Seq[String])``` ``` override def sql: String = name match { case ParserUtils.escapedIdentifier(_) | ParserUtils.qualifiedEscapedIdentifier(_, _) => name case _ => quoteIdentifier(name) } ``` the ```nameParts``` is a Seq of ```"v" "i"```, name is ```v.i``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22793: [SPARK-25793][ML]Call SaveLoadV2_0.load for class...
Github user huaxingao closed the pull request at: https://github.com/apache/spark/pull/22793 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22793: [SPARK-25793][ML]Call SaveLoadV2_0.load for classNameV2_...
Github user huaxingao commented on the issue: https://github.com/apache/spark/pull/22793 @WeichenXu123 I created two PRs for this jira. I had trouble to create the first one so I created another one. I will close this PR. Please use the other one. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22793: [SPARK-25793][ML]Call SaveLoadV2_0.load for class...
GitHub user huaxingao opened a pull request: https://github.com/apache/spark/pull/22793 [SPARK-25793][ML]Call SaveLoadV2_0.load for classNameV2_0 ## What changes were proposed in this pull request? The wrong version of load is called in BisectingKMeansModel.load. ``` case (SaveLoadV2_0.thisClassName, SaveLoadV2_0.thisFormatVersion) => val model = SaveLoadV1_0.load(sc, path) ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/huaxingao/spark spark25793 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22793.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22793 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22790: [SPARK-25793][ML]call SaveLoadV2_0.load for class...
GitHub user huaxingao opened a pull request: https://github.com/apache/spark/pull/22790 [SPARK-25793][ML]call SaveLoadV2_0.load for classNameV2_0 ## What changes were proposed in this pull request? The following code in BisectingKMeansModel.load calls the wrong version of load. ``` case (SaveLoadV2_0.thisClassName, SaveLoadV2_0.thisFormatVersion) => val model = SaveLoadV1_0.load(sc, path) ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/huaxingao/spark spark-25793 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22790.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22790 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22788: [SPARK-25769][SQL]change nested columns from `a.b...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/22788#discussion_r226872842 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala --- @@ -98,8 +98,18 @@ case class UnresolvedTableValuedFunction( */ case class UnresolvedAttribute(nameParts: Seq[String]) extends Attribute with Unevaluable { - def name: String = -nameParts.map(n => if (n.contains(".")) s"`$n`" else n).mkString(".") + def name: String = { +nameParts.map(n => + if (n.contains(".")) s"`$n`" + else { +if (nameParts.length > 1) { + s"`$n`" +} else { + n +} + } +).mkString(".") + } --- End diff -- @dongjoon-hyun Thanks! Fixed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22788: [SPARK-25769][SQL]change nested columns from `a.b...
GitHub user huaxingao opened a pull request: https://github.com/apache/spark/pull/22788 [SPARK-25769][SQL]change nested columns from `a.b` to `a`.`b` ## What changes were proposed in this pull request? Currently, ```$"a.b".expr.asInstanceOf[UnresolvedAttribute].sql``` returns ``` res1: String = `a.b` ``` This PR will change the nested column to ``` `a`.`b` ``` ## How was this patch tested? I didn't add any new test. However, I changed several of the existing tests to make them to expect the nested columns to have the format of ``` `a`.`b` ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/huaxingao/spark spark-25769 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22788.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22788 commit 2aa4ad97c8b9df5fdd022653549a0252dbb819b6 Author: Huaxin Gao Date: 2018-10-21T16:13:49Z [SPARK-25769][SQL]change nested columns from to . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22295: [SPARK-25255][PYTHON]Add getActiveSession to Spar...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/22295#discussion_r226178191 --- Diff: python/pyspark/sql/tests.py --- @@ -3863,6 +3863,145 @@ def test_jvm_default_session_already_set(self): spark.stop() +class SparkSessionTests2(unittest.TestCase): + +def test_active_session(self): +spark = SparkSession.builder \ +.master("local") \ +.getOrCreate() +try: +activeSession = SparkSession.getActiveSession() +df = activeSession.createDataFrame([(1, 'Alice')], ['age', 'name']) +self.assertEqual(df.collect(), [Row(age=1, name=u'Alice')]) +finally: +spark.stop() + +def test_get_active_session_when_no_active_session(self): +active = SparkSession.getActiveSession() +self.assertEqual(active, None) +spark = SparkSession.builder \ +.master("local") \ +.getOrCreate() +active = SparkSession.getActiveSession() +self.assertEqual(active, spark) +spark.stop() +active = SparkSession.getActiveSession() +self.assertEqual(active, None) + +def test_SparkSession(self): +spark = SparkSession.builder \ +.master("local") \ +.config("some-config", "v2") \ +.getOrCreate() +try: +self.assertEqual(spark.conf.get("some-config"), "v2") +self.assertEqual(spark.sparkContext._conf.get("some-config"), "v2") +self.assertEqual(spark.version, spark.sparkContext.version) +spark.sql("CREATE DATABASE test_db") +spark.catalog.setCurrentDatabase("test_db") +self.assertEqual(spark.catalog.currentDatabase(), "test_db") +spark.sql("CREATE TABLE table1 (name STRING, age INT) USING parquet") +self.assertEqual(spark.table("table1").columns, ['name', 'age']) +self.assertEqual(spark.range(3).count(), 3) +finally: +spark.stop() + +def test_global_default_session(self): +spark = SparkSession.builder \ +.master("local") \ +.getOrCreate() +try: +self.assertEqual(SparkSession.builder.getOrCreate(), spark) +finally: +spark.stop() + +def test_default_and_active_session(self): +spark = SparkSession.builder \ +.master("local") \ +.getOrCreate() +activeSession = spark._jvm.SparkSession.getActiveSession() +defaultSession = spark._jvm.SparkSession.getDefaultSession() +try: +self.assertEqual(activeSession, defaultSession) +finally: +spark.stop() + +def test_config_option_propagated_to_existing_SparkSession(self): --- End diff -- Will change. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22295: [SPARK-25255][PYTHON]Add getActiveSession to Spar...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/22295#discussion_r226178127 --- Diff: python/pyspark/sql/tests.py --- @@ -3863,6 +3863,145 @@ def test_jvm_default_session_already_set(self): spark.stop() +class SparkSessionTests2(unittest.TestCase): + +def test_active_session(self): +spark = SparkSession.builder \ +.master("local") \ +.getOrCreate() +try: +activeSession = SparkSession.getActiveSession() +df = activeSession.createDataFrame([(1, 'Alice')], ['age', 'name']) +self.assertEqual(df.collect(), [Row(age=1, name=u'Alice')]) +finally: +spark.stop() + +def test_get_active_session_when_no_active_session(self): +active = SparkSession.getActiveSession() +self.assertEqual(active, None) +spark = SparkSession.builder \ +.master("local") \ +.getOrCreate() +active = SparkSession.getActiveSession() +self.assertEqual(active, spark) +spark.stop() +active = SparkSession.getActiveSession() +self.assertEqual(active, None) + +def test_SparkSession(self): +spark = SparkSession.builder \ +.master("local") \ +.config("some-config", "v2") \ +.getOrCreate() +try: +self.assertEqual(spark.conf.get("some-config"), "v2") +self.assertEqual(spark.sparkContext._conf.get("some-config"), "v2") +self.assertEqual(spark.version, spark.sparkContext.version) +spark.sql("CREATE DATABASE test_db") +spark.catalog.setCurrentDatabase("test_db") +self.assertEqual(spark.catalog.currentDatabase(), "test_db") +spark.sql("CREATE TABLE table1 (name STRING, age INT) USING parquet") +self.assertEqual(spark.table("table1").columns, ['name', 'age']) +self.assertEqual(spark.range(3).count(), 3) +finally: +spark.stop() + +def test_global_default_session(self): +spark = SparkSession.builder \ +.master("local") \ +.getOrCreate() +try: +self.assertEqual(SparkSession.builder.getOrCreate(), spark) +finally: +spark.stop() + +def test_default_and_active_session(self): +spark = SparkSession.builder \ +.master("local") \ +.getOrCreate() +activeSession = spark._jvm.SparkSession.getActiveSession() +defaultSession = spark._jvm.SparkSession.getDefaultSession() +try: +self.assertEqual(activeSession, defaultSession) +finally: +spark.stop() + +def test_config_option_propagated_to_existing_SparkSession(self): +session1 = SparkSession.builder \ +.master("local") \ +.config("spark-config1", "a") \ +.getOrCreate() +self.assertEqual(session1.conf.get("spark-config1"), "a") +session2 = SparkSession.builder \ +.config("spark-config1", "b") \ +.getOrCreate() +try: +self.assertEqual(session1, session2) +self.assertEqual(session1.conf.get("spark-config1"), "b") +finally: +session1.stop() + +def test_new_session(self): +session = SparkSession.builder \ +.master("local") \ +.getOrCreate() +newSession = session.newSession() +try: +self.assertNotEqual(session, newSession) +finally: +session.stop() +newSession.stop() + +def test_create_new_session_if_old_session_stopped(self): +session = SparkSession.builder \ +.master("local") \ +.getOrCreate() +session.stop() +newSession = SparkSession.builder \ +.master("local") \ +.getOrCreate() +try: +self.assertNotEqual(session, newSession) +finally: +newSession.stop() + +def test_active_session_with_None_and_not_None_context(self): +from pyspark.context import SparkContext +from pyspark.conf import SparkConf +sc = SparkContext._active_spark_co
[GitHub] spark pull request #22295: [SPARK-25255][PYTHON]Add getActiveSession to Spar...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/22295#discussion_r226178054 --- Diff: python/pyspark/sql/functions.py --- @@ -2713,6 +2713,25 @@ def from_csv(col, schema, options={}): return Column(jc) +@since(3.0) +def _getActiveSession(): --- End diff -- Do you mean the _ prefix or the function itself? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22295: [SPARK-25255][PYTHON]Add getActiveSession to Spar...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/22295#discussion_r225667299 --- Diff: python/pyspark/sql/tests.py --- @@ -3654,6 +3654,109 @@ def test_jvm_default_session_already_set(self): spark.stop() +class SparkSessionTests2(unittest.TestCase): + +def test_active_session(self): +spark = SparkSession.builder \ +.master("local") \ +.getOrCreate() +try: +activeSession = SparkSession.getActiveSession() +df = activeSession.createDataFrame([(1, 'Alice')], ['age', 'name']) +self.assertEqual(df.collect(), [Row(age=1, name=u'Alice')]) +finally: +spark.stop() + +def test_get_active_session_when_no_active_session(self): +active = SparkSession.getActiveSession() +self.assertEqual(active, None) +spark = SparkSession.builder \ +.master("local") \ +.getOrCreate() +active = SparkSession.getActiveSession() +self.assertEqual(active, spark) +spark.stop() +active = SparkSession.getActiveSession() +self.assertEqual(active, None) --- End diff -- Thanks @holdenk I will add a test for the above comment and also add a test for your comment regarding ``` self._jvm.SparkSession.setActiveSession(self._jsparkSession) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22295: [SPARK-25255][PYTHON]Add getActiveSession to Spar...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/22295#discussion_r225667174 --- Diff: python/pyspark/sql/functions.py --- @@ -2633,6 +2633,23 @@ def sequence(start, stop, step=None): _to_java_column(start), _to_java_column(stop), _to_java_column(step))) +@since(3.0) +def getActiveSession(): +""" +Returns the active SparkSession for the current thread +""" +from pyspark.sql import SparkSession +sc = SparkContext._active_spark_context --- End diff -- @holdenk @HyukjinKwon Thanks for the comments. I checked Scala's behavior: ``` test("my test") { val cx = SparkContext.getActive val session = SparkSession.getActiveSession println(cx) println(session) } ``` The result is ``` None None ``` So it returns None if sc isNone. Actually my current code returns None if sc isNone, but I will change the code a bit to make it more obvious. I will also add _ prefix in the function name and mention in the docstring that this function is not supposed to be called directly. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22295: [SPARK-25255][PYTHON]Add getActiveSession to Spar...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/22295#discussion_r225666954 --- Diff: python/pyspark/sql/session.py --- @@ -231,6 +231,7 @@ def __init__(self, sparkContext, jsparkSession=None): or SparkSession._instantiatedSession._sc._jsc is None: SparkSession._instantiatedSession = self self._jvm.SparkSession.setDefaultSession(self._jsparkSession) +self._jvm.SparkSession.setActiveSession(self._jsparkSession) --- End diff -- @holdenk @HyukjinKwon Thanks for the comments. I looked the scala code, it ```setActiveSession``` in ```createDataFrame```. ``` def createDataFrame[A <: Product : TypeTag](rdd: RDD[A]): DataFrame = { SparkSession.setActiveSession(this) ... } ``` I will do the same for python. ``` def createDataFrame(self, data, schema=None, samplingRatio=None, verifySchema=True): SparkSession._activeSession = self self._jvm.SparkSession.setActiveSession(self._jsparkSession) ``` Will also add a test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21710: [SPARK-24207][R]add R API for PrefixSpan
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/21710#discussion_r223760003 --- Diff: examples/src/main/python/ml/prefixspan_example.py --- @@ -0,0 +1,48 @@ +# --- End diff -- @felixcheung I don't think the doc, java/scala/python example are in 2.4. Shall I open a separate jira to merge in the fix? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22295: [SPARK-25255][PYTHON]Add getActiveSession to Spar...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/22295#discussion_r223165392 --- Diff: python/pyspark/sql/session.py --- @@ -252,6 +255,20 @@ def newSession(self): """ return self.__class__(self._sc, self._jsparkSession.newSession()) +@classmethod +@since(2.5) +def getActiveSession(cls): +""" +Returns the active SparkSession for the current thread, returned by the builder. +>>> s = SparkSession.getActiveSession() +>>> l = [('Alice', 1)] +>>> rdd = s.sparkContext.parallelize(l) +>>> df = s.createDataFrame(rdd, ['name', 'age']) +>>> df.select("age").collect() +[Row(age=1)] +""" +return cls._activeSession --- End diff -- @HyukjinKwon I am not sure if I follow your suggestion correctly. Does the following look right to you? session.py ``` @classmethod @since(3.0) def getActiveSession(cls): from pyspark.sql import functions return functions.getActiveSession() ``` functions.py ``` @since(3.0) def getActiveSession(): from pyspark.sql import SparkSession sc = SparkContext._active_spark_context if sc is None: sc = SparkContext() if sc._jvm.SparkSession.getActiveSession().isDefined(): SparkSession(sc, sc._jvm.SparkSession.getActiveSession().get()) return SparkSession._activeSession else: return None ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22295: [SPARK-25255][PYTHON]Add getActiveSession to Spar...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/22295#discussion_r221394694 --- Diff: python/pyspark/sql/session.py --- @@ -231,6 +231,7 @@ def __init__(self, sparkContext, jsparkSession=None): or SparkSession._instantiatedSession._sc._jsc is None: SparkSession._instantiatedSession = self self._jvm.SparkSession.setDefaultSession(self._jsparkSession) +self._jvm.SparkSession.setActiveSession(self._jsparkSession) --- End diff -- @HyukjinKwon Do you mean something like this: ``` def test_two_spark_session(self): session1 = None session2 = None try: session1 = SparkSession.builder.config("key1", "value1").getOrCreate() session2 = SparkSession.builder.config("key2", "value2").getOrCreate() self.assertEqual(session1, session2) df = session1.createDataFrame([(1, 'Alice')], ['age', 'name']) self.assertEqual(df.collect(), [Row(age=1, name=u'Alice')]) activeSession1 = session1.getActiveSession() activeSession2 = session2.getActiveSession() self.assertEqual(activeSession1, activeSession1) finally: if session1 is not None: session1.stop() if session2 is not None: session2.stop() ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22295: [SPARK-25255][PYTHON]Add getActiveSession to Spar...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/22295#discussion_r221089916 --- Diff: python/pyspark/sql/session.py --- @@ -231,6 +231,7 @@ def __init__(self, sparkContext, jsparkSession=None): or SparkSession._instantiatedSession._sc._jsc is None: SparkSession._instantiatedSession = self self._jvm.SparkSession.setDefaultSession(self._jsparkSession) +self._jvm.SparkSession.setActiveSession(self._jsparkSession) --- End diff -- @HyukjinKwon Seems to me that active session is set OK in the ```__init__```. When createDataFrame, we already have a session, and the active session is already set in the ```__init__```. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22295: [SPARK-25255][PYTHON]Add getActiveSession to SparkSessio...
Github user huaxingao commented on the issue: https://github.com/apache/spark/pull/22295 I just saw this fix [SPARK-25525][SQL][PYSPARK] Do not update conf for existing SparkContext in SparkSession.getOrCreate. #22545 I will remove ```test_create_SparkContext_then_SparkSession``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22537: [SPARK-21291][R] add R partitionBy API in DataFrame
Github user huaxingao commented on the issue: https://github.com/apache/spark/pull/22537 Thanks! @HyukjinKwon @felixcheung --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22537: [SPARK-21291][R] add R partitionBy API in DataFra...
GitHub user huaxingao opened a pull request: https://github.com/apache/spark/pull/22537 [SPARK-21291][R] add R partitionBy API in DataFrame ## What changes were proposed in this pull request? add R partitionBy API in write.df I didn't add bucketBy in write.df. The last line of write.df is ``` write <- handledCallJMethod(write, "save") ``` save doesn't support bucketBy right now. ``` assertNotBucketed("save") ``` ## How was this patch tested? Add unit test in test_sparkSQL.R You can merge this pull request into a Git repository by running: $ git pull https://github.com/huaxingao/spark spark-21291 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22537.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22537 commit 7544ee503f45bb2890d12263aa5a27b19797123d Author: Huaxin Gao Date: 2018-09-24T17:42:16Z [SPARK-21291] add R partitionBy API In DataFrame --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22295: [SPARK-25255][PYTHON]Add getActiveSession to Spar...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/22295#discussion_r218237306 --- Diff: python/pyspark/sql/session.py --- @@ -231,6 +231,7 @@ def __init__(self, sparkContext, jsparkSession=None): or SparkSession._instantiatedSession._sc._jsc is None: SparkSession._instantiatedSession = self self._jvm.SparkSession.setDefaultSession(self._jsparkSession) +self._jvm.SparkSession.setActiveSession(self._jsparkSession) --- End diff -- Thanks you very much for your comments. I have a question here. In stop() method, shall we clear the activeSession too? Currently, it has ``` def stop(self): """Stop the underlying :class:`SparkContext`. """ self._jvm.SparkSession.clearDefaultSession() SparkSession._instantiatedSession = None ``` Do I need to add the following? ``` self._jvm.SparkSession.clearActiveSession() ``` To test for getActiveSession when there is no active session, I am thinking of adding ``` def test_get_active_session_when_no_active_session(self): spark = SparkSession.builder \ .master("local") \ .getOrCreate() spark.stop() active = spark.getActiveSession() self.assertEqual(active, None) ``` The test didn't pass because in stop(), the active session is not cleared. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21649: [SPARK-23648][R][SQL]Adds more types for hint in SparkR
Github user huaxingao commented on the issue: https://github.com/apache/spark/pull/21649 @felixcheung Thanks for your comments. I changed ```stopifnot```. At L3925 I could add ``` hintList <- list("hint2", "hint3", "hint4") hint(df, "hint1", 1.2345, hintList) ``` but I am a little hesitated to give an example. Even though we relax the check for hint, I am not sure how the numeric hint or list hint is used in real sql. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21649: [SPARK-23648][R][SQL]Adds more types for hint in ...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/21649#discussion_r216727458 --- Diff: R/pkg/R/DataFrame.R --- @@ -3939,7 +3929,15 @@ setMethod("hint", signature(x = "SparkDataFrame", name = "character"), function(x, name, ...) { parameters <- list(...) -stopifnot(all(sapply(parameters, isTypeAllowedForSqlHint))) +stopifnot(all(sapply(parameters, function(x) { --- End diff -- My bad. Will change. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21649: [SPARK-23648][R][SQL]Adds more types for hint in ...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/21649#discussion_r216413819 --- Diff: R/pkg/R/DataFrame.R --- @@ -3905,6 +3905,16 @@ setMethod("rollup", groupedData(sgd) }) +isTypeAllowedForSqlHint <- function(x) { + if (is.character(x) || is.numeric(x)) { +TRUE + } else if (is.list(x)) { +all(sapply(x, (function(y) is.character(y) || is.numeric(y --- End diff -- I think multiple hint in different types in a single list works OK. I tested ```df.hint("hint1", Seq(1, 2, "a"))``` in scala, it works OK. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22295: [SPARK-25255][PYTHON]Add getActiveSession to Spar...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/22295#discussion_r216115581 --- Diff: python/pyspark/sql/session.py --- @@ -252,6 +252,16 @@ def newSession(self): """ return self.__class__(self._sc, self._jsparkSession.newSession()) +@since(2.4) +def getActiveSession(self): +""" +Returns the active SparkSession for the current thread, returned by the builder. +>>> s = spark.getActiveSession() +>>> spark._jsparkSession.getDefaultSession().get().equals(s.get()) +True +""" +return self._jsparkSession.getActiveSession() --- End diff -- @HyukjinKwon I add a set of tests. Some of them are borrowed from ```SparkSessionBuilderSuite.scala``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21710: [SPARK-24207][R]add R API for PrefixSpan
Github user huaxingao commented on the issue: https://github.com/apache/spark/pull/21710 @felixcheung Are there any other things I need to change? If not, could this PR be merged in 2.4? Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21649: [SPARK-23648][R][SQL]Adds more types for hint in SparkR
Github user huaxingao commented on the issue: https://github.com/apache/spark/pull/21649 @felixcheung Are there any other things I need to change? If not, could this PR be merged in 2.4? Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20442: [SPARK-23265][ML]Update multi-column error handling logi...
Github user huaxingao commented on the issue: https://github.com/apache/spark/pull/20442 Any more comments? @MLnick @jkbradley --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22295: [SPARK-25255][PYTHON]Add getActiveSession to Spar...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/22295#discussion_r215022091 --- Diff: python/pyspark/sql/session.py --- @@ -252,6 +252,16 @@ def newSession(self): """ return self.__class__(self._sc, self._jsparkSession.newSession()) +@since(2.4) +def getActiveSession(self): +""" +Returns the active SparkSession for the current thread, returned by the builder. +>>> s = spark.getActiveSession() +>>> spark._jsparkSession.getDefaultSession().get().equals(s.get()) --- End diff -- @holdenk @felixcheung Thanks for the review. I will change this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22295: [SPARK-25255][PYTHON]Add getActiveSession to Spar...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/22295#discussion_r215022059 --- Diff: python/pyspark/sql/session.py --- @@ -252,6 +252,16 @@ def newSession(self): """ return self.__class__(self._sc, self._jsparkSession.newSession()) +@since(2.4) +def getActiveSession(self): +""" +Returns the active SparkSession for the current thread, returned by the builder. +>>> s = spark.getActiveSession() +>>> spark._jsparkSession.getDefaultSession().get().equals(s.get()) +True +""" +return self._jsparkSession.getActiveSession() --- End diff -- @HyukjinKwon Sorry for the late reply. Yes, this returns a JVM instance. In the scala code, ```SparkSession.getActiveSession``` returns an ```Option[SparkSession]``` I am not sure how to do a python equivalent of Scala's ```Option```. In the following code, is there a way to wrap the python ```session``` in else path to something equivalent of Scala's ```Option```? If not, can I just return the python ```session```? ``` if self._jsparkSession.getActiveSession() is None: return None else: return self.__class__(self._sc, self._jsparkSession.getActiveSession().get()) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22228: [SPARK-25124][ML]VectorSizeHint setSize and getSi...
Github user huaxingao closed the pull request at: https://github.com/apache/spark/pull/8 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22291: [SPARK-25007][R]Add array_intersect/array_except/...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/22291#discussion_r214472480 --- Diff: R/pkg/R/generics.R --- @@ -799,10 +807,18 @@ setGeneric("array_sort", function(x) { standardGeneric("array_sort") }) #' @name NULL setGeneric("arrays_overlap", function(x, y) { standardGeneric("arrays_overlap") }) +#' @rdname column_collection_functions +#' @name NULL +setGeneric("array_union", function(x, y) { standardGeneric("array_union") }) + #' @rdname column_collection_functions #' @name NULL setGeneric("arrays_zip", function(x, ...) { standardGeneric("arrays_zip") }) +#' @rdname column_collection_functions +#' @name NULL +setGeneric("shuffle", function(x) { standardGeneric("shuffle") }) --- End diff -- @felixcheung Changed. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22295: [SPARK-25255][PYTHON]Add getActiveSession to Spar...
GitHub user huaxingao opened a pull request: https://github.com/apache/spark/pull/22295 [SPARK-25255][PYTHON]Add getActiveSession to SparkSession in PySpark ## What changes were proposed in this pull request? add getActiveSession in session.py ## How was this patch tested? add doctest You can merge this pull request into a Git repository by running: $ git pull https://github.com/huaxingao/spark spark25255 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22295.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22295 commit e9885b38e35dcfd01a40e43cf442beeaea226b98 Author: Huaxin Gao Date: 2018-08-31T00:50:23Z [SPARK-25255][PYTHON]Add getActiveSession to SparkSession in PySpark --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22291: [SPARK-25007][R]Add array_intersect/array_except/array_u...
Github user huaxingao commented on the issue: https://github.com/apache/spark/pull/22291 @felixcheung @HyukjinKwon Sorry I couldn't figure out how to make the ```sequence``` work in the other PR. I will work on this one first. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22291: [SPARK-25007][R]Add array_intersect/array_except/...
GitHub user huaxingao opened a pull request: https://github.com/apache/spark/pull/22291 [SPARK-25007][R]Add array_intersect/array_except/array_union/shuffle to SparkR ## What changes were proposed in this pull request? Add the R version of array_intersect/array_except/array_union/shuffle ## How was this patch tested? Add test in test_sparkSQL.R You can merge this pull request into a Git repository by running: $ git pull https://github.com/huaxingao/spark spark-25007 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22291.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22291 commit 4a4507ff8c9764933bc1386ff57cfba1d7b18fac Author: Huaxin Gao Date: 2018-08-30T20:30:47Z [SPARK-25007][R]Add array_intersect/array_except/array_union/array_shuffle to SparkR --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22228: [SPARK-25124][ML]VectorSizeHint setSize and getSize don'...
Github user huaxingao commented on the issue: https://github.com/apache/spark/pull/8 @jkbradley backport to 2.3. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22228: [SPARK-25124][ML]VectorSizeHint setSize and getSi...
GitHub user huaxingao opened a pull request: https://github.com/apache/spark/pull/8 [SPARK-25124][ML]VectorSizeHint setSize and getSize don't return values backport to 2.3 ## What changes were proposed in this pull request? In feature.py, VectorSizeHint setSize and getSize don't return value. Add return. (Please fill in changes proposed in this fix) ## How was this patch tested? Unit Test added You can merge this pull request into a Git repository by running: $ git pull https://github.com/huaxingao/spark spark-25124-2.3 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/8.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #8 commit 35092f0b829dea4520014eb980e31bb3511c9e7f Author: Huaxin Gao Date: 2018-08-24T21:10:05Z [SPARK-25124][ML]VectorSizeHint setSize and getSize don't return values backport to 2.3 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22136: [SPARK-25124][ML]VectorSizeHint setSize and getSi...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/22136#discussion_r212088986 --- Diff: python/pyspark/ml/tests.py --- @@ -844,6 +844,28 @@ def test_string_indexer_from_labels(self): .select(model_default.getOrDefault(model_default.outputCol)).collect() self.assertEqual(len(transformed_list), 5) +def test_vector_size_hint(self): --- End diff -- @jkbradley Sorry, my bad. I added set/getSize and removed VectorAssembler from the test to simply. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22136: [SPARK-25124][ML]VectorSizeHint setSize and getSi...
GitHub user huaxingao opened a pull request: https://github.com/apache/spark/pull/22136 [SPARK-25124][ML]VectorSizeHint setSize and getSize don't return values ## What changes were proposed in this pull request? In feature.py, VectorSizeHint setSize and getSize don't return value. Add return. ## How was this patch tested? I tested the changes on my local. You can merge this pull request into a Git repository by running: $ git pull https://github.com/huaxingao/spark spark-25124 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22136.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22136 commit 91a819af424778d284d66893cc6b11e1015720e0 Author: Huaxin Gao Date: 2018-08-17T18:15:53Z [SPARK-25124]VectorSizeHint setSize and getSize don't return values --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21835: [SPARK-24779]Add sequence / map_concat / map_from...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/21835#discussion_r210326849 --- Diff: R/pkg/R/functions.R --- @@ -3320,7 +3321,7 @@ setMethod("explode", #' @aliases sequence sequence,Column-method #' @note sequence since 2.4.0 setMethod("sequence", - signature(x = "Column", y = "Column"), + signature(), --- End diff -- @felixcheung gentle ping --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21439: [SPARK-24391][SQL] Support arrays of any types by from_j...
Github user huaxingao commented on the issue: https://github.com/apache/spark/pull/21439 Sure. I will work on it. Thanks for letting me know. @viirya --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21925: [SPARK-24973][PYTHON]Add numIter to Python Cluste...
GitHub user huaxingao opened a pull request: https://github.com/apache/spark/pull/21925 [SPARK-24973][PYTHON]Add numIter to Python ClusteringSummary ## What changes were proposed in this pull request? Add numIter to Python version of ClusteringSummary ## How was this patch tested? Modified existing UT test_multiclass_logistic_regression_summary You can merge this pull request into a Git repository by running: $ git pull https://github.com/huaxingao/spark spark-24973 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21925.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21925 commit 7cf662b46a5da5af2e4c5df2c27aa4b14cb46830 Author: Huaxin Gao Date: 2018-07-30T22:35:05Z [SPARK-24973][PYTHON]Add numIter to Python ClusteringSummary --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21835: [SPARK-24779]Add sequence / map_concat / map_from...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/21835#discussion_r205609369 --- Diff: R/pkg/R/functions.R --- @@ -3320,7 +3321,7 @@ setMethod("explode", #' @aliases sequence sequence,Column-method #' @note sequence since 2.4.0 setMethod("sequence", - signature(x = "Column", y = "Column"), + signature(), --- End diff -- I will need to make the ```sequence``` method still callable without base:: prefix, right? Is the following implementation OK? ``` setMethod("sequence", signature(), function(x, y = NULL, step = NULL) { if (is.null(y) && is.null(step)) base::sequence(x) else { jc <- if (is.null(step)) { callJStatic("org.apache.spark.sql.functions", "sequence", x@jc, y@jc) } else { stopifnot(class(step) == "Column") callJStatic("org.apache.spark.sql.functions", "sequence", x@jc, y@jc, step@jc) } column(jc) } }) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21835: [SPARK-24779]Add sequence / map_concat / map_from...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/21835#discussion_r205538890 --- Diff: R/pkg/R/functions.R --- @@ -3320,7 +3321,7 @@ setMethod("explode", #' @aliases sequence sequence,Column-method #' @note sequence since 2.4.0 setMethod("sequence", - signature(x = "Column", y = "Column"), + signature(), --- End diff -- @felixcheung It seems that if I only have ... in the generic function, ``` setGeneric("sequence", function(...) { standardGeneric("sequence") }) ``` I can't have anything is the signature. Otherwise I will have Error in matchSignature. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21835: [SPARK-24779]Add sequence / map_concat / map_from...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/21835#discussion_r204861059 --- Diff: R/pkg/tests/fulltests/test_context.R --- @@ -21,10 +21,11 @@ test_that("Check masked functions", { # Check that we are not masking any new function from base, stats, testthat unexpectedly # NOTE: We should avoid adding entries to *namesOfMaskedCompletely* as masked functions make it # hard for users to use base R functions. Please check when in doubt. - namesOfMaskedCompletely <- c("cov", "filter", "sample", "not") + namesOfMaskedCompletely <- c("cov", "filter", "sample", "not", "sequence") --- End diff -- Thanks @HyukjinKwon @felixcheung for your review. I will change this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21835: [SPARK-24779]Add sequence / map_concat / map_from_entrie...
Github user huaxingao commented on the issue: https://github.com/apache/spark/pull/21835 @HyukjinKwon @felixcheung Could you please review? Thank you very much in advance! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21835: [SPARK-24779]Add sequence / map_concat / map_from...
GitHub user huaxingao opened a pull request: https://github.com/apache/spark/pull/21835 [SPARK-24779]Add sequence / map_concat / map_from_entries / an option in months_between UDF to disable rounding-off ## What changes were proposed in this pull request? Add the R version of sequence / map_concat / map_from_entries / an option in months_between UDF to disable rounding-off ## How was this patch tested? Add test in test_sparkSQL.R You can merge this pull request into a Git repository by running: $ git pull https://github.com/huaxingao/spark spark-24779 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21835.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21835 commit faff20b70e3577d58976283ac7157717d8b34c8d Author: Huaxin Gao Date: 2018-07-21T02:20:50Z [SPARK-24779]Add sequence / map_concat / map_from_entries / an option in months_between UDF to disable rounding-off --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21820: [SPARK-24868][PYTHON]add sequence function in Python
Github user huaxingao commented on the issue: https://github.com/apache/spark/pull/21820 @HyukjinKwon Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21820: [SPARK-24868][PYTHON]add sequence function in Pyt...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/21820#discussion_r203934505 --- Diff: python/pyspark/sql/functions.py --- @@ -2551,6 +2551,27 @@ def map_concat(*cols): return Column(jc) +@since(2.4) +def sequence(start, stop, step=None): +""" +Generate a sequence of integers from start to stop, incrementing by step. +If step is not set, incrementing by 1 if start is less than or equal to stop, otherwise -1. + +>>> df1 = spark.createDataFrame([(-2, 2)], ('C1', 'C2')) +>>> df1.select(sequence('C1', 'C2').alias('r')).collect() +[Row(r=[-2, -1, 0, 1, 2])] +>>> df2 = spark.createDataFrame([(4, -4, -2)], ('C1', 'C2', 'C3')) --- End diff -- It will throw ```java.lang.IllegalArgumentException```. There is a check in ```collectionOperations.scala``` ``` require( (step > num.zero && start <= stop) || (step < num.zero && start >= stop) || (step == num.zero && start == stop), s"Illegal sequence boundaries: $start to $stop by $step") ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21710: [SPARK-24207][R]add R API for PrefixSpan
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/21710#discussion_r203526021 --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/PrefixSpanWrapper.scala --- @@ -0,0 +1,34 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.r + +import org.apache.spark.ml.fpm.PrefixSpan + +private[r] object PrefixSpanWrapper { + def getPrefixSpan( + minSupport: Double, + maxPatternLength: Int, + maxLocalProjDBSize: Double, --- End diff -- ```maxLocalProjDBSize``` is a ```LongParam``` in scala, but it seems that r doesn't have ```long```. See ```help(integer):``` ``` Note that on almost all implementations of R the range of representable integers is restricted to about +/-2*10^9: âdoubleâs can hold much larger integers exactly. ``` so I used ```double``` in r, and do ```toLong``` later --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21710: [SPARK-24207][R]add R API for PrefixSpan
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/21710#discussion_r203481597 --- Diff: R/pkg/R/generics.R --- @@ -1415,6 +1415,13 @@ setGeneric("spark.freqItemsets", function(object) { standardGeneric("spark.freqI #' @rdname spark.fpGrowth setGeneric("spark.associationRules", function(object) { standardGeneric("spark.associationRules") }) +#' @rdname spark.prefixSpan +setGeneric("spark.prefixSpan", function(...) { standardGeneric("spark.prefixSpan") }) + +#' @rdname spark.prefixSpan --- End diff -- OK thanks. I guess I will keep both spark.prefixSpan and spark.findFrequentSequentialPatterns in the same rd. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21710: [SPARK-24207][R]add R API for PrefixSpan
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/21710#discussion_r203229835 --- Diff: mllib/src/main/scala/org/apache/spark/ml/r/PrefixSpanWrapper.scala --- @@ -0,0 +1,42 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.r + +import org.apache.spark.ml.fpm.PrefixSpan +import org.apache.spark.sql.{DataFrame, Dataset} + +private[r] class PrefixSpanWrapper private (val prefixSpan: PrefixSpan) { --- End diff -- You are right. I will remove ``` private[r] class PrefixSpanWrapper private (val prefixSpan: PrefixSpan) { def findFrequentSequentialPatterns(dataset: Dataset[_]): DataFrame = { prefixSpan.findFrequentSequentialPatterns(dataset) } } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21710: [SPARK-24207][R]add R API for PrefixSpan
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/21710#discussion_r203229794 --- Diff: R/pkg/tests/fulltests/test_mllib_fpm.R --- @@ -82,4 +82,26 @@ test_that("spark.fpGrowth", { }) +test_that("spark.prefixSpan", { +df <- createDataFrame(list(list(list(list(1L, 2L), list(3L))), + list(list(list(1L), list(3L, 2L), list(1L, 2L))), + list(list(list(1L, 2L), list(5L))), + list(list(list(6L, schema = c("sequence")) +prefix_Span1 <- spark.prefixSpan(minSupport = 0.5, maxPatternLength = 5L, + maxLocalProjDBSize = 3200L) +result1 <- spark.findFrequentSequentialPatterns(prefix_Span1, df) + +expected_result <- createDataFrame(list(list(list(list(1L)), 3L), +list(list(list(3L)), 2L), +list(list(list(2L)), 3L), +list(list(list(1L, 2L)), 3L), +list(list(list(1L), list(3L)), 2L)), +schema = c("sequence", "freq")) +expect_equivalent(expected_result, result1) + +prefix_Span2 <- spark.prefixSpan(minSupport = 0.5, maxPatternLength = 5L) +result2 <- spark.findFrequentSequentialPatterns(prefix_Span2, df) +expect_equivalent(expected_result, result2) --- End diff -- Just trying to test that the default value (```maxLocalProjDBSize = 3200L```) will be used if the parameter isn't set explicitly. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21710: [SPARK-24207][R]add R API for PrefixSpan
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/21710#discussion_r203229733 --- Diff: R/pkg/R/generics.R --- @@ -1415,6 +1415,13 @@ setGeneric("spark.freqItemsets", function(object) { standardGeneric("spark.freqI #' @rdname spark.fpGrowth setGeneric("spark.associationRules", function(object) { standardGeneric("spark.associationRules") }) +#' @rdname spark.prefixSpan +setGeneric("spark.prefixSpan", function(...) { standardGeneric("spark.prefixSpan") }) + +#' @rdname spark.prefixSpan --- End diff -- @viirya Thanks for your review. Do you mean to use ```findFrequentSequentialPatterns``` instead of ```spark.findFrequentSequentialPatterns```? I have a question about when we need to add ```spark``` in front of the method name. In FPGrowth, some methods have ```spark.```, e.g. ```spark.freqItemsets```, ```spark.associationRules```. Some methods don't have ```spark.```, e.g. ```predict``` and ```write.ml``` . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21645: [SPARK-24537][R]Add array_remove / array_zip / map_from_...
Github user huaxingao commented on the issue: https://github.com/apache/spark/pull/21645 Thanks! @HyukjinKwon @felixcheung --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21710: [SPARK-24207][R]add R API for PrefixSpan
Github user huaxingao commented on the issue: https://github.com/apache/spark/pull/21710 @felixcheung Can I open a new jira for code example and documentation? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21645: [SPARK-24537][R]Add array_remove / array_zip / ma...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/21645#discussion_r201827579 --- Diff: R/pkg/R/functions.R --- @@ -3071,6 +3085,19 @@ setMethod("array_position", column(jc) }) +#' @details +#' \code{array_remove}: Removes all elements that equal to element from the given array. +#' +#' @rdname column_collection_functions +#' @aliases array_remove array_remove,Column-method +#' @note array_remove since 2.4.0 +setMethod("array_remove", + signature(x = "Column", value = "ANY"), --- End diff -- Thanks for your review. I will add in doc. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21678: [SPARK-23461][R]vignettes should include model predictio...
Github user huaxingao commented on the issue: https://github.com/apache/spark/pull/21678 @felixcheung Thanks a lot for your help! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org