[GitHub] spark pull request #21492: [SPARK-24300][ML] change the way to set seed in m...

2018-06-04 Thread ludatabricks
GitHub user ludatabricks opened a pull request:

https://github.com/apache/spark/pull/21492

[SPARK-24300][ML] change the way to set seed in 
ml.cluster.LDASuite.generateLDAData

## What changes were proposed in this pull request?

Using different RNG in all different partitions.

## How was this patch tested?

manually

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ludatabricks/spark-1 SPARK-24300

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21492.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21492






---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21340: [SPARK-24115] Have logging pass through instrumentation ...

2018-05-17 Thread ludatabricks
Github user ludatabricks commented on the issue:

https://github.com/apache/spark/pull/21340
  
Thanks for the PR. LGTM. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21344: [SPARK-24114] Add instrumentation to FPGrowth.

2018-05-16 Thread ludatabricks
Github user ludatabricks commented on the issue:

https://github.com/apache/spark/pull/21344
  
LGTM.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21347: [SPARK-24290][ML] add support for Array input for...

2018-05-16 Thread ludatabricks
GitHub user ludatabricks opened a pull request:

https://github.com/apache/spark/pull/21347

[SPARK-24290][ML] add support for Array input for 
instrumentation.logNamedValue

## What changes were proposed in this pull request?

Extend instrumentation.logNamedValue to support Array input
change the logging for "clusterSizes" to new method


## How was this patch tested?

N/A

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ludatabricks/spark-1 SPARK-24290

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21347.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21347


commit a98fbc3654a3fde6b7d7f9189a6f48034fb3a94d
Author: Lu WANG <lu.wang@...>
Date:   2018-05-16T20:19:25Z

add support for Array input for instrumentation.logNamedValue




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21335: [SPARK-24231][PYSPARK][ML] Provide Python API for...

2018-05-15 Thread ludatabricks
GitHub user ludatabricks opened a pull request:

https://github.com/apache/spark/pull/21335

[SPARK-24231][PYSPARK][ML]  Provide Python API for evaluateEachIteration 
for spark.ml GBTs

## What changes were proposed in this pull request?

Add evaluateEachIteration for GBTClassification and GBTRegressionModel

## How was this patch tested?

doctest

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ludatabricks/spark-1 SPARK-14682

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21335.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21335


commit 4265a1a9deaf30185f7dee993979ef89afc18ee1
Author: Lu WANG <lu.wang@...>
Date:   2018-05-14T23:55:20Z

add function in GBTClassifier

commit ae8eb4c0cf6d49259d174390f8ccd8a8fbe674cc
Author: Lu WANG <lu.wang@...>
Date:   2018-05-15T17:34:50Z

add evaluateEachIteration to GBTClassificationModel and GBTRegressionModel

commit c25c5a6c2b4bff44816a8760a38877844f532141
Author: Lu WANG <lu.wang@...>
Date:   2018-05-15T17:40:50Z

fix minor typos




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21183: [SPARK-22210][ML] Add seed for LDA variationalTopicInfer...

2018-05-14 Thread ludatabricks
Github user ludatabricks commented on the issue:

https://github.com/apache/spark/pull/21183
  
I tested to load the old saving models from Spark 2.3. It is ok to load it 
from this. 

For the tests in LDASuite, I do see failing sometimes without this fix. It 
will not always happen. I can remove it if you think it is not necessary. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21265: [SPARK-24146][PySpark][ML] spark.ml parity for se...

2018-05-09 Thread ludatabricks
Github user ludatabricks commented on a diff in the pull request:

https://github.com/apache/spark/pull/21265#discussion_r187144226
  
--- Diff: python/pyspark/ml/fpm.py ---
@@ -243,3 +244,75 @@ def setParams(self, minSupport=0.3, minConfidence=0.8, 
itemsCol="items",
 
 def _create_model(self, java_model):
 return FPGrowthModel(java_model)
+
+
+class PrefixSpan(object):
+"""
+.. note:: Experimental
+
+A parallel PrefixSpan algorithm to mine frequent sequential patterns.
+The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: 
Mining Sequential Patterns
+Efficiently by Prefix-Projected Pattern Growth
+(see http://doi.org/10.1109/ICDE.2001.914830;>here).
+
+.. versionadded:: 2.4.0
+
+"""
+@staticmethod
+@since("2.4.0")
+def findFrequentSequentialPatterns(dataset,
+   sequenceCol,
+   minSupport,
+   maxPatternLength,
+   maxLocalProjDBSize):
+"""
+.. note:: Experimental
+Finds the complete set of frequent sequential patterns in the 
input sequences of itemsets.
+
+:param dataset: A dataset or a dataframe containing a sequence 
column which is
+`Seq[Seq[_]]` type.
+:param sequenceCol: The name of the sequence column in dataset, 
rows with nulls in this
+column are ignored.
+:param minSupport: The minimal support level of the sequential 
pattern, any pattern that
+   appears more than (minSupport * 
size-of-the-dataset) times will be
+   output (recommended value: `0.1`).
+:param maxPatternLength: The maximal length of the sequential 
pattern
+ (recommended value: `10`).
+:param maxLocalProjDBSize: The maximum number of items (including 
delimiters used in the
+   internal storage format) allowed in a 
projected database before
+   local processing. If a projected 
database exceeds this size,
+   another iteration of distributed prefix 
growth is run
+   (recommended value: `3200`).
+:return: A `DataFrame` that contains columns of sequence and 
corresponding frequency.
+ The schema of it will be:
+  - `sequence: Seq[Seq[T]]` (T is the item type)
+  - `freq: Long`
+
+>>> from pyspark.ml.fpm import PrefixSpan
+>>> from pyspark.sql import Row
+>>> df = sc.parallelize([Row(sequence=[[1, 2], [3]]),
--- End diff --

One question: Should we add something in the example to show some special 
case or how these parameters works? 
For example 
- add pattern which is larger than ``maxPatternLength``
- add nulls in the column


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21195: [Spark-23975][ML] Add support of array input for ...

2018-05-07 Thread ludatabricks
Github user ludatabricks commented on a diff in the pull request:

https://github.com/apache/spark/pull/21195#discussion_r186566521
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/LDASuite.scala ---
@@ -323,4 +324,21 @@ class LDASuite extends SparkFunSuite with 
MLlibTestSparkContext with DefaultRead
   assert(model.getOptimizer === optimizer)
 }
   }
+
+  test("LDA with Array input") {
+def trainAndLogLikelihoodAndPerplexity(dataset: Dataset[_]): (Double, 
Double) = {
+  val model = new 
LDA().setK(k).setOptimizer("online").setMaxIter(1).setSeed(1).fit(dataset)
+  (model.logLikelihood(dataset), model.logPerplexity(dataset))
+}
+
+val (newDataset, newDatasetD, newDatasetF) = 
MLTestingUtils.generateArrayFeatureDataset(dataset)
+val (ll, lp) = trainAndLogLikelihoodAndPerplexity(newDataset)
--- End diff --

Yes. I want to use this as the base for the comparison after we fix 
SPARK-22210.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21218: [SPARK-24155][ML] Instrumentation improvements fo...

2018-05-03 Thread ludatabricks
Github user ludatabricks commented on a diff in the pull request:

https://github.com/apache/spark/pull/21218#discussion_r185894432
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala ---
@@ -423,6 +423,8 @@ class GaussianMixture @Since("2.0.0") (
 val summary = new GaussianMixtureSummary(model.transform(dataset),
   $(predictionCol), $(probabilityCol), $(featuresCol), $(k), 
logLikelihood)
 model.setSummary(Some(summary))
+instr.logNamedValue("logLikelihood", logLikelihood)
+instr.logNamedValue("clusterSizes", summary.clusterSizes.toString)
--- End diff --

@WeichenXu123 The function `` clusterSizes.mkString(", ")`` could change 
the array to a string, separating each String in the array with comma. What do 
you think?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21204: [SPARK-24132][ML] Instrumentation improvement for classi...

2018-05-03 Thread ludatabricks
Github user ludatabricks commented on the issue:

https://github.com/apache/spark/pull/21204
  
LGTM Retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13493: [SPARK-15750][MLLib][PYSPARK] Constructing FPGrowth fail...

2018-05-03 Thread ludatabricks
Github user ludatabricks commented on the issue:

https://github.com/apache/spark/pull/13493
  
LGTM retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21218: [SPARK-24155][ML] Instrument improvements for clu...

2018-05-02 Thread ludatabricks
GitHub user ludatabricks opened a pull request:

https://github.com/apache/spark/pull/21218

[SPARK-24155][ML] Instrument improvements for clustering

## What changes were proposed in this pull request?

changed the instrument for all of the clustering methods

## How was this patch tested?

N/A

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ludatabricks/spark-1 SPARK-23686-1

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21218.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21218


commit 1810dbe8204981986fa2598b5b022a5c74e43732
Author: Lu WANG <lu.wang@...>
Date:   2018-05-02T04:11:40Z

instrumentation improvement for clustering

commit 07c20a45737f9a4f008eef6da717034670427483
Author: Lu WANG <lu.wang@...>
Date:   2018-05-02T21:18:16Z

add more info for instrument




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21204: [SPARK-24132][ML]Expand instrumentation for class...

2018-05-01 Thread ludatabricks
GitHub user ludatabricks opened a pull request:

https://github.com/apache/spark/pull/21204

[SPARK-24132][ML]Expand instrumentation for classification

## What changes were proposed in this pull request?

- Add OptionalInstrumentation as argument for getNumClasses in 
ml.classification.Classifier

- Change the function call for getNumClasses in train() in 
ml.classification.DecisionTreeClassifier, 
ml.classification.RandomForestClassifier, and ml.classification.NaiveBayes

- Modify the instrumentation creation in ml.classification.LinearSVC

- Change the log call in ml.classification.OneVsRest and 
ml.classification.LinearSVC

## How was this patch tested?

Manual.

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ludatabricks/spark-1 SPARK-23686

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21204.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21204


commit 7b75ed6aaa21faf68f5f2db04eaf550f2d468542
Author: Lu WANG <lu.wang@...>
Date:   2018-05-01T06:13:36Z

Expand instrumentation for classification




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21195: [Spark 23975][ML] Add support of array input for ...

2018-04-30 Thread ludatabricks
GitHub user ludatabricks opened a pull request:

https://github.com/apache/spark/pull/21195

[Spark 23975][ML] Add support of array input for all clustering methods

## What changes were proposed in this pull request?

Add support for all of the clustering methods

## How was this patch tested?

unit tests added

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ludatabricks/spark-1 SPARK-23975-1

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21195.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21195


commit 31226b4b8e5aa5fc016f61ec86c42683c452a696
Author: Lu WANG <lu.wang@...>
Date:   2018-04-26T17:46:49Z

add Array input support for BisectingKMeans

commit 45e6e96e974607ed0526401d0fdbb4f1c8161dd6
Author: Lu WANG <lu.wang@...>
Date:   2018-04-30T17:14:41Z

add support of array input for all clustering methods




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21183: [SPARK-22210][ML] Add seed for LDA variationalTop...

2018-04-27 Thread ludatabricks
GitHub user ludatabricks opened a pull request:

https://github.com/apache/spark/pull/21183

[SPARK-22210][ML] Add seed for LDA variationalTopicInference

## What changes were proposed in this pull request?

- Add seed parameter for variationalTopicInference

- Add seed for calling variationalTopicInference in submitMiniBatch

- Add var seed in LDAModel so that it can take the seed from LDA and use it 
for the function call of variationalTopicInference in logLikelihoodBound, 
topicDistributions, getTopicDistributionMethod, and topicDistribution.


## How was this patch tested?

Check the test result in mllib.clustering.LDASuite to make sure the result 
is repeatable with the seed.

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ludatabricks/spark-1 SPARK-22210

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21183.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21183


commit e647e85e21c63546611c50e45bc57d232b0cbe83
Author: Lu WANG <lu.wang@...>
Date:   2018-04-27T16:46:45Z

Add seed for LDA variationalTopicInference




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13493: [SPARK-15750][MLLib][PYSPARK] Constructing FPGrowth fail...

2018-04-24 Thread ludatabricks
Github user ludatabricks commented on the issue:

https://github.com/apache/spark/pull/13493
  
The bug is confirmed.  The fix looks pretty reasonable to me. ping 
@jkbradley .


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21081: [SPARK-23975][ML]Allow Clustering to take Arrays ...

2018-04-16 Thread ludatabricks
GitHub user ludatabricks opened a pull request:

https://github.com/apache/spark/pull/21081

[SPARK-23975][ML]Allow Clustering to take Arrays of Double as input features

## What changes were proposed in this pull request?

- Multiple possible input types is added in validateAndTransformSchema() 
and computeCost() while checking column type

- Add if statement in transform() to support array type as featuresCol

- Add the case statement in fit() while selecting columns from dataset

These changes will be applied to KMeans first, then to other clustering 
method

## How was this patch tested?

unit test is added

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ludatabricks/spark-1 SPARK-23975

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21081.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21081


commit ed890d35ff1e9edbe2a557f68732835b3e911906
Author: Lu WANG <lu.wang@...>
Date:   2018-04-16T17:32:02Z

add Array input support for KMeans

commit badb0cc5ca6ca69bb8e8fc0fce5ea05a4100bca0
Author: Lu WANG <lu.wang@...>
Date:   2018-04-16T17:49:00Z

remove redundent code




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21044: Add RawPrediction, numClasses, and numFeatures fo...

2018-04-11 Thread ludatabricks
GitHub user ludatabricks opened a pull request:

https://github.com/apache/spark/pull/21044

Add RawPrediction, numClasses, and numFeatures for OneVsRestModel

add RawPrediction as output column 
add numClasses and numFeatures to OneVsRestModel

## What changes were proposed in this pull request?

- Add two val numClasses and numFeatures in OneVsRestModel so that we can 
inherit from Classifier in the future

- Add rawPrediction output column in transform, the prediction label in 
calculated by the rawPrediciton like raw2prediction

 

## How was this patch tested?


(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)
Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ludatabricks/spark-1 SPARK-9312

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21044.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21044


commit 0cfc20a3637c06071e6fe48ca5db4834b34c889e
Author: Lu WANG <lu.wang@...>
Date:   2018-04-11T19:08:22Z

add rawPrediction as an output column;
add numCLasses and numFeatures to OneVsRestModel




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21015: [SPARK-23944][ML] Add the set method for the two ...

2018-04-09 Thread ludatabricks
GitHub user ludatabricks opened a pull request:

https://github.com/apache/spark/pull/21015

[SPARK-23944][ML] Add the set method for the two LSHModel

## What changes were proposed in this pull request?

Add two set method for LSHModel in LSH.scala, 
BucketedRandomProjectionLSH.scala, and MinHashLSH.scala

## How was this patch tested?

New test for the param setup was added into 

- BucketedRandomProjectionLSHSuite.scala

- MinHashLSHSuite.scala

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ludatabricks/spark-1 SPARK-23944

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21015.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21015


commit 9f16ea6f15572a8d189fe537844487abdea797b4
Author: Lu WANG <lu.wang@...>
Date:   2018-04-09T21:56:48Z

Add the set method for two LSHModels




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org