[GitHub] spark pull request #21492: [SPARK-24300][ML] change the way to set seed in m...
GitHub user ludatabricks opened a pull request: https://github.com/apache/spark/pull/21492 [SPARK-24300][ML] change the way to set seed in ml.cluster.LDASuite.generateLDAData ## What changes were proposed in this pull request? Using different RNG in all different partitions. ## How was this patch tested? manually Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ludatabricks/spark-1 SPARK-24300 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21492.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21492 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21340: [SPARK-24115] Have logging pass through instrumentation ...
Github user ludatabricks commented on the issue: https://github.com/apache/spark/pull/21340 Thanks for the PR. LGTM. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21344: [SPARK-24114] Add instrumentation to FPGrowth.
Github user ludatabricks commented on the issue: https://github.com/apache/spark/pull/21344 LGTM. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21347: [SPARK-24290][ML] add support for Array input for...
GitHub user ludatabricks opened a pull request: https://github.com/apache/spark/pull/21347 [SPARK-24290][ML] add support for Array input for instrumentation.logNamedValue ## What changes were proposed in this pull request? Extend instrumentation.logNamedValue to support Array input change the logging for "clusterSizes" to new method ## How was this patch tested? N/A Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ludatabricks/spark-1 SPARK-24290 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21347.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21347 commit a98fbc3654a3fde6b7d7f9189a6f48034fb3a94d Author: Lu WANG <lu.wang@...> Date: 2018-05-16T20:19:25Z add support for Array input for instrumentation.logNamedValue --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21335: [SPARK-24231][PYSPARK][ML] Provide Python API for...
GitHub user ludatabricks opened a pull request: https://github.com/apache/spark/pull/21335 [SPARK-24231][PYSPARK][ML] Provide Python API for evaluateEachIteration for spark.ml GBTs ## What changes were proposed in this pull request? Add evaluateEachIteration for GBTClassification and GBTRegressionModel ## How was this patch tested? doctest Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ludatabricks/spark-1 SPARK-14682 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21335.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21335 commit 4265a1a9deaf30185f7dee993979ef89afc18ee1 Author: Lu WANG <lu.wang@...> Date: 2018-05-14T23:55:20Z add function in GBTClassifier commit ae8eb4c0cf6d49259d174390f8ccd8a8fbe674cc Author: Lu WANG <lu.wang@...> Date: 2018-05-15T17:34:50Z add evaluateEachIteration to GBTClassificationModel and GBTRegressionModel commit c25c5a6c2b4bff44816a8760a38877844f532141 Author: Lu WANG <lu.wang@...> Date: 2018-05-15T17:40:50Z fix minor typos --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21183: [SPARK-22210][ML] Add seed for LDA variationalTopicInfer...
Github user ludatabricks commented on the issue: https://github.com/apache/spark/pull/21183 I tested to load the old saving models from Spark 2.3. It is ok to load it from this. For the tests in LDASuite, I do see failing sometimes without this fix. It will not always happen. I can remove it if you think it is not necessary. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21265: [SPARK-24146][PySpark][ML] spark.ml parity for se...
Github user ludatabricks commented on a diff in the pull request: https://github.com/apache/spark/pull/21265#discussion_r187144226 --- Diff: python/pyspark/ml/fpm.py --- @@ -243,3 +244,75 @@ def setParams(self, minSupport=0.3, minConfidence=0.8, itemsCol="items", def _create_model(self, java_model): return FPGrowthModel(java_model) + + +class PrefixSpan(object): +""" +.. note:: Experimental + +A parallel PrefixSpan algorithm to mine frequent sequential patterns. +The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns +Efficiently by Prefix-Projected Pattern Growth +(see http://doi.org/10.1109/ICDE.2001.914830;>here). + +.. versionadded:: 2.4.0 + +""" +@staticmethod +@since("2.4.0") +def findFrequentSequentialPatterns(dataset, + sequenceCol, + minSupport, + maxPatternLength, + maxLocalProjDBSize): +""" +.. note:: Experimental +Finds the complete set of frequent sequential patterns in the input sequences of itemsets. + +:param dataset: A dataset or a dataframe containing a sequence column which is +`Seq[Seq[_]]` type. +:param sequenceCol: The name of the sequence column in dataset, rows with nulls in this +column are ignored. +:param minSupport: The minimal support level of the sequential pattern, any pattern that + appears more than (minSupport * size-of-the-dataset) times will be + output (recommended value: `0.1`). +:param maxPatternLength: The maximal length of the sequential pattern + (recommended value: `10`). +:param maxLocalProjDBSize: The maximum number of items (including delimiters used in the + internal storage format) allowed in a projected database before + local processing. If a projected database exceeds this size, + another iteration of distributed prefix growth is run + (recommended value: `3200`). +:return: A `DataFrame` that contains columns of sequence and corresponding frequency. + The schema of it will be: + - `sequence: Seq[Seq[T]]` (T is the item type) + - `freq: Long` + +>>> from pyspark.ml.fpm import PrefixSpan +>>> from pyspark.sql import Row +>>> df = sc.parallelize([Row(sequence=[[1, 2], [3]]), --- End diff -- One question: Should we add something in the example to show some special case or how these parameters works? For example - add pattern which is larger than ``maxPatternLength`` - add nulls in the column --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21195: [Spark-23975][ML] Add support of array input for ...
Github user ludatabricks commented on a diff in the pull request: https://github.com/apache/spark/pull/21195#discussion_r186566521 --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/LDASuite.scala --- @@ -323,4 +324,21 @@ class LDASuite extends SparkFunSuite with MLlibTestSparkContext with DefaultRead assert(model.getOptimizer === optimizer) } } + + test("LDA with Array input") { +def trainAndLogLikelihoodAndPerplexity(dataset: Dataset[_]): (Double, Double) = { + val model = new LDA().setK(k).setOptimizer("online").setMaxIter(1).setSeed(1).fit(dataset) + (model.logLikelihood(dataset), model.logPerplexity(dataset)) +} + +val (newDataset, newDatasetD, newDatasetF) = MLTestingUtils.generateArrayFeatureDataset(dataset) +val (ll, lp) = trainAndLogLikelihoodAndPerplexity(newDataset) --- End diff -- Yes. I want to use this as the base for the comparison after we fix SPARK-22210. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21218: [SPARK-24155][ML] Instrumentation improvements fo...
Github user ludatabricks commented on a diff in the pull request: https://github.com/apache/spark/pull/21218#discussion_r185894432 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala --- @@ -423,6 +423,8 @@ class GaussianMixture @Since("2.0.0") ( val summary = new GaussianMixtureSummary(model.transform(dataset), $(predictionCol), $(probabilityCol), $(featuresCol), $(k), logLikelihood) model.setSummary(Some(summary)) +instr.logNamedValue("logLikelihood", logLikelihood) +instr.logNamedValue("clusterSizes", summary.clusterSizes.toString) --- End diff -- @WeichenXu123 The function `` clusterSizes.mkString(", ")`` could change the array to a string, separating each String in the array with comma. What do you think? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21204: [SPARK-24132][ML] Instrumentation improvement for classi...
Github user ludatabricks commented on the issue: https://github.com/apache/spark/pull/21204 LGTM Retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13493: [SPARK-15750][MLLib][PYSPARK] Constructing FPGrowth fail...
Github user ludatabricks commented on the issue: https://github.com/apache/spark/pull/13493 LGTM retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21218: [SPARK-24155][ML] Instrument improvements for clu...
GitHub user ludatabricks opened a pull request: https://github.com/apache/spark/pull/21218 [SPARK-24155][ML] Instrument improvements for clustering ## What changes were proposed in this pull request? changed the instrument for all of the clustering methods ## How was this patch tested? N/A Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ludatabricks/spark-1 SPARK-23686-1 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21218.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21218 commit 1810dbe8204981986fa2598b5b022a5c74e43732 Author: Lu WANG <lu.wang@...> Date: 2018-05-02T04:11:40Z instrumentation improvement for clustering commit 07c20a45737f9a4f008eef6da717034670427483 Author: Lu WANG <lu.wang@...> Date: 2018-05-02T21:18:16Z add more info for instrument --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21204: [SPARK-24132][ML]Expand instrumentation for class...
GitHub user ludatabricks opened a pull request: https://github.com/apache/spark/pull/21204 [SPARK-24132][ML]Expand instrumentation for classification ## What changes were proposed in this pull request? - Add OptionalInstrumentation as argument for getNumClasses in ml.classification.Classifier - Change the function call for getNumClasses in train() in ml.classification.DecisionTreeClassifier, ml.classification.RandomForestClassifier, and ml.classification.NaiveBayes - Modify the instrumentation creation in ml.classification.LinearSVC - Change the log call in ml.classification.OneVsRest and ml.classification.LinearSVC ## How was this patch tested? Manual. Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ludatabricks/spark-1 SPARK-23686 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21204.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21204 commit 7b75ed6aaa21faf68f5f2db04eaf550f2d468542 Author: Lu WANG <lu.wang@...> Date: 2018-05-01T06:13:36Z Expand instrumentation for classification --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21195: [Spark 23975][ML] Add support of array input for ...
GitHub user ludatabricks opened a pull request: https://github.com/apache/spark/pull/21195 [Spark 23975][ML] Add support of array input for all clustering methods ## What changes were proposed in this pull request? Add support for all of the clustering methods ## How was this patch tested? unit tests added Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ludatabricks/spark-1 SPARK-23975-1 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21195.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21195 commit 31226b4b8e5aa5fc016f61ec86c42683c452a696 Author: Lu WANG <lu.wang@...> Date: 2018-04-26T17:46:49Z add Array input support for BisectingKMeans commit 45e6e96e974607ed0526401d0fdbb4f1c8161dd6 Author: Lu WANG <lu.wang@...> Date: 2018-04-30T17:14:41Z add support of array input for all clustering methods --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21183: [SPARK-22210][ML] Add seed for LDA variationalTop...
GitHub user ludatabricks opened a pull request: https://github.com/apache/spark/pull/21183 [SPARK-22210][ML] Add seed for LDA variationalTopicInference ## What changes were proposed in this pull request? - Add seed parameter for variationalTopicInference - Add seed for calling variationalTopicInference in submitMiniBatch - Add var seed in LDAModel so that it can take the seed from LDA and use it for the function call of variationalTopicInference in logLikelihoodBound, topicDistributions, getTopicDistributionMethod, and topicDistribution. ## How was this patch tested? Check the test result in mllib.clustering.LDASuite to make sure the result is repeatable with the seed. Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ludatabricks/spark-1 SPARK-22210 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21183.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21183 commit e647e85e21c63546611c50e45bc57d232b0cbe83 Author: Lu WANG <lu.wang@...> Date: 2018-04-27T16:46:45Z Add seed for LDA variationalTopicInference --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13493: [SPARK-15750][MLLib][PYSPARK] Constructing FPGrowth fail...
Github user ludatabricks commented on the issue: https://github.com/apache/spark/pull/13493 The bug is confirmed. The fix looks pretty reasonable to me. ping @jkbradley . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...
GitHub user ludatabricks opened a pull request: https://github.com/apache/spark/pull/21081 [SPARK-23975][ML]Allow Clustering to take Arrays of Double as input features ## What changes were proposed in this pull request? - Multiple possible input types is added in validateAndTransformSchema() and computeCost() while checking column type - Add if statement in transform() to support array type as featuresCol - Add the case statement in fit() while selecting columns from dataset These changes will be applied to KMeans first, then to other clustering method ## How was this patch tested? unit test is added Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ludatabricks/spark-1 SPARK-23975 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21081.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21081 commit ed890d35ff1e9edbe2a557f68732835b3e911906 Author: Lu WANG <lu.wang@...> Date: 2018-04-16T17:32:02Z add Array input support for KMeans commit badb0cc5ca6ca69bb8e8fc0fce5ea05a4100bca0 Author: Lu WANG <lu.wang@...> Date: 2018-04-16T17:49:00Z remove redundent code --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21044: Add RawPrediction, numClasses, and numFeatures fo...
GitHub user ludatabricks opened a pull request: https://github.com/apache/spark/pull/21044 Add RawPrediction, numClasses, and numFeatures for OneVsRestModel add RawPrediction as output column add numClasses and numFeatures to OneVsRestModel ## What changes were proposed in this pull request? - Add two val numClasses and numFeatures in OneVsRestModel so that we can inherit from Classifier in the future - Add rawPrediction output column in transform, the prediction label in calculated by the rawPrediciton like raw2prediction ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ludatabricks/spark-1 SPARK-9312 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21044.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21044 commit 0cfc20a3637c06071e6fe48ca5db4834b34c889e Author: Lu WANG <lu.wang@...> Date: 2018-04-11T19:08:22Z add rawPrediction as an output column; add numCLasses and numFeatures to OneVsRestModel --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21015: [SPARK-23944][ML] Add the set method for the two ...
GitHub user ludatabricks opened a pull request: https://github.com/apache/spark/pull/21015 [SPARK-23944][ML] Add the set method for the two LSHModel ## What changes were proposed in this pull request? Add two set method for LSHModel in LSH.scala, BucketedRandomProjectionLSH.scala, and MinHashLSH.scala ## How was this patch tested? New test for the param setup was added into - BucketedRandomProjectionLSHSuite.scala - MinHashLSHSuite.scala Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ludatabricks/spark-1 SPARK-23944 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21015.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21015 commit 9f16ea6f15572a8d189fe537844487abdea797b4 Author: Lu WANG <lu.wang@...> Date: 2018-04-09T21:56:48Z Add the set method for two LSHModels --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org