Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/17819
@viirya It is possible I think. A similar example is, `HasRegParam` trait,
do not put `setRegParam` in trait but moved into concrete estimator/transformer
class, should be the same reason
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/17819
@viirya Yes. But if there is some better design I will be happy to listen.
---
-
To unsubscribe, e-mail: reviews-unsubscr
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19229
Great! That's it. thanks!
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional comman
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19229
@viirya I guess the reason is, the old PR version:
`df.withColumn(..).withColumn(..).withColumn(..)`, the long df chain
prevent the shuffle re-using... but now you merge them into one step
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/17819
ok to test.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19152
@marktab You should close merged PR. Thanks!
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/18748#discussion_r139161851
--- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala
---
@@ -356,6 +371,40 @@ class ALSModel private[ml
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19208
Jenkins, test this please.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19208
@jkbradley I split this PR, removed the code for "dump models to disk", so
the PR will be smaller and easier to review. When this PR merged, I will create
follow-up PR for "dump
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19208
oh...sorry for that, I integrate @hhbyyh's old PR into this new one,
because I found the code "dump models to disk" and "collect models" seem to be
cohesive and s
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19204#discussion_r138763970
--- Diff: python/pyspark/ml/evaluation.py ---
@@ -328,6 +329,87 @@ def setParams(self, predictionCol="prediction",
label
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/18924
ping @akopich This is an very useful improvement. Can you update the code
while you're at it ?
---
-
To unsubscri
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19156
ping @yanboliang Any other comments ?
We need merge this before 2.3 release.
---
-
To unsubscribe, e-mail: reviews
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19204
Jenkins, test this please.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19186#discussion_r138577518
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
---
@@ -483,24 +488,24 @@ class LogisticRegression @Since
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19214
cc @srowen Thanks!
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail
GitHub user WeichenXu123 opened a pull request:
https://github.com/apache/spark/pull/19214
[SPARK-21027][MINOR][FOLLOW-UP] add missing since tag
## What changes were proposed in this pull request?
add missing since tag for `setParallelism` in #19110
## How was
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19110#discussion_r138519719
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala ---
@@ -297,6 +298,16 @@ final class OneVsRest @Since("
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19122
@BryanCutler code updated. thanks!
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19122#discussion_r138518283
--- Diff: python/pyspark/ml/tuning.py ---
@@ -193,7 +194,8 @@ class CrossValidator(Estimator, ValidatorParams,
MLReadable, MLWritable
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19122#discussion_r138518235
--- Diff: python/pyspark/ml/tuning.py ---
@@ -208,23 +210,23 @@ class CrossValidator(Estimator, ValidatorParams,
MLReadable, MLWritable
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/17862
@hhbyyh Test result looks good!
OWLQN takes longer time for each iteration, because each iteration's line
search, it made more passes on da
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19208#discussion_r138391134
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tuning/ValidatorParams.scala ---
@@ -150,20 +150,14 @@ private[ml] object ValidatorParams
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19208#discussion_r138393318
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
@@ -212,14 +238,12 @@ object CrossValidator extends
MLReadable
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19208#discussion_r138389265
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
@@ -261,17 +290,40 @@ class CrossValidatorModel private[ml
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19208
cc @jkbradley
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/18313
@hhbyyh I apologize to you that your PR is valuable (in the case model list
is very big).
But now your PR is stale and I integrate it into my new PR #19208
Would you mind to take a
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/16774
@BryanCutler @MLnick I found a bug in this PR: after save estimator (CV or
TVS) and then load again, the "Parallelism" setting will be lost. But I fix
this in #19208
GitHub user WeichenXu123 opened a pull request:
https://github.com/apache/spark/pull/19208
[SPARK-21087] [ML] CrossValidator, TrainValidationSplit should preserve all
models after fitting: Scala
## What changes were proposed in this pull request?
1. We add a parameter
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19122#discussion_r138249937
--- Diff: python/pyspark/ml/param/_shared_params_code_gen.py ---
@@ -152,6 +152,8 @@ def get$Name(self):
("varianceCol", "
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19107
OK. Thanks @zhengruifeng .I will close this PR.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For
Github user WeichenXu123 closed the pull request at:
https://github.com/apache/spark/pull/19107
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/9183
@minixalpha Sorry for delay. Too busy recently. But I will try to finish
and commit my new PR once I get time. Thanks
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19110
Thanks @MLnick @BryanCutler . Would you mind helping review another similar
PR #19122 ? We need some other features but blocking on that PR. Thanks
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19172#discussion_r137922397
--- Diff: python/pyspark/ml/tests.py ---
@@ -1655,6 +1655,25 @@ def
test_multinomial_logistic_regression_with_bound(self
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19172#discussion_r137922474
--- Diff: python/pyspark/ml/classification.py ---
@@ -1425,11 +1425,13 @@ class MultilayerPerceptronClassifier(JavaEstimator,
HasFeaturesCol
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19172#discussion_r137922379
--- Diff: python/pyspark/ml/tests.py ---
@@ -1655,6 +1655,25 @@ def
test_multinomial_logistic_regression_with_bound(self
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19172
Jenkins, test this please.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/18748#discussion_r137815796
--- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala
---
@@ -356,6 +371,40 @@ class ALSModel private[ml
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/15770#discussion_r137800867
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala
---
@@ -0,0 +1,216 @@
+/*
+ * Licensed to the
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/15770#discussion_r137805843
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala
---
@@ -0,0 +1,216 @@
+/*
+ * Licensed to the
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19156#discussion_r137740578
--- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala ---
@@ -94,46 +97,86 @@ object Summarizer extends Logging {
* - min
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19156
Thanks @thunterdb code updated.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/15770
@wangmiao1981 Sorry for delay, I will take a look later, thanks!
---
-
To unsubscribe, e-mail: reviews-unsubscr
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19107
cc @smurching Thanks!
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/17383
@facaiy So can you do benchmark first (by generating random testing data) ?
So we can see how much this can speed up
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/16158#discussion_r137546848
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tuning/ValidatorParams.scala ---
@@ -85,6 +86,32 @@ private[ml] trait ValidatorParams extends
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/16158#discussion_r137545402
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tuning/ValidatorParams.scala ---
@@ -85,6 +86,32 @@ private[ml] trait ValidatorParams extends
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/16158#discussion_r137542479
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tuning/ValidatorParams.scala ---
@@ -85,6 +86,32 @@ private[ml] trait ValidatorParams extends
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19156
cc @yanboliang @thunterdb Thanks!
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
GitHub user WeichenXu123 opened a pull request:
https://github.com/apache/spark/pull/19156
[SPARK-19634][FOLLOW-UP][ML] Improve interface of dataframe vectorized
summarizer
## What changes were proposed in this pull request?
Make several improvements in dataframe
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19122#discussion_r137264588
--- Diff: python/pyspark/ml/tuning.py ---
@@ -255,18 +257,23 @@ def _fit(self, dataset):
randCol = self.uid + "_rand"
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19110
@MLnick Conflict resolved. Thanks!
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19122#discussion_r137175343
--- Diff: python/pyspark/ml/tuning.py ---
@@ -255,18 +257,24 @@ def _fit(self, dataset):
randCol = self.uid + "_rand"
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19020
Looks good. cc @jkbradley Thanks!
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19122#discussion_r136934638
--- Diff: python/pyspark/ml/tuning.py ---
@@ -255,18 +257,24 @@ def _fit(self, dataset):
randCol = self.uid + "_rand"
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19122#discussion_r136933807
--- Diff: python/pyspark/ml/tuning.py ---
@@ -255,18 +257,23 @@ def _fit(self, dataset):
randCol = self.uid + "_rand"
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/13794
+1 @jkbradley For now it is better to keep the current implementation for
the 4 meta-algo in pyspark.
@yinxusen Would you mind to close this PR ? But I still appreciate your
contribution
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19108
cc @yanboliang Thanks!
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19122#discussion_r136850665
--- Diff: python/pyspark/ml/tuning.py ---
@@ -255,18 +257,23 @@ def _fit(self, dataset):
randCol = self.uid + "_rand"
GitHub user WeichenXu123 opened a pull request:
https://github.com/apache/spark/pull/19122
[SPARK-21911][ML][PySpark] Parallel Model Evaluation for ML Tuning in
PySpark
## What changes were proposed in this pull request?
Add parallelism support for ML tuning in pyspark
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/18902
Sure. I will create JIRA after this perf gap is confirmed.
---
-
To unsubscribe, e-mail: reviews-unsubscr
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/18902
hmm... that's interesting. So I found performance gap between dataframe
codegen aggregation and the simple RDD aggregation. I will discuss with SQL
team for this later. Thanks!
---
If
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/17014
@zhengruifeng `KMeans` regarded as a bugfix(SPARK-21799) because the
double-cache issue is introduced in 2.2 and cause perf regression.
Other algos also have the same issue, but the issue
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/18902
+1 for using Dataframe-based version code.
@zhengruifeng One thing I want to confirm is that, I check your testing
code, both RDD-based version and Dataframe-based version code will
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19111
I found `NaiveBayes` also possible to fail. Founded here #18538 . Hope this
can works!
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81316/console
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/18538
Jenkins, test this please.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/16774#discussion_r136719561
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala ---
@@ -87,37 +91,63 @@ class TrainValidationSplit @Since("
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/16774#discussion_r136719485
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
@@ -100,31 +113,53 @@ class CrossValidator @Since("1.2.0"
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/16774#discussion_r136719383
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
@@ -100,31 +113,53 @@ class CrossValidator @Since("1.2.0"
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19110
Jenkins, test this please.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19018
cc @felixcheung
I encounter RTest failed again even when this seed added.
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81350/console
error:
```
Failed
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/18281
I take this PR over in #19110 because the original author is busy but we
need merge this PR soon.
Thanks!
---
If your project is set up for it, you can reply to this email and have your
GitHub user WeichenXu123 opened a pull request:
https://github.com/apache/spark/pull/19110
[SPARK-21027][ML][PYTHON] Added tunable parallelism to one vs. rest in both
Scala mllib and Pyspark
## What changes were proposed in this pull request?
Added tunable parallelism to
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19106#discussion_r136696592
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/classification/ProbabilisticClassifier.scala
---
@@ -245,6 +245,13 @@ private[ml] object
GitHub user WeichenXu123 opened a pull request:
https://github.com/apache/spark/pull/19108
[SPARK-21898][ML] Feature parity for KolmogorovSmirnovTest in MLlib
## What changes were proposed in this pull request?
Feature parity for KolmogorovSmirnovTest in MLlib
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19107
cc @jkbradley @smurching
This should be merged and backport to 2.2 ASAP!
Other improvement (e.g adding `handlePersistence` param) can be left in
this PR #17014
---
If your project is
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/17014
@zhengruifeng @jkbradley I create a PR #19107 for quick fix `KMeans` perf
regression bug.
This PR can continue to work on adding Param of `handlePersistence` which
is not so emergent
GitHub user WeichenXu123 opened a pull request:
https://github.com/apache/spark/pull/19107
[SPARK-21799][ML] Fix `KMeans` performance regression caused by
double-caching
## What changes were proposed in this pull request?
Fix `KMeans` performance regression caused by
GitHub user WeichenXu123 opened a pull request:
https://github.com/apache/spark/pull/19106
[SPARK-21770][ML] ProbabilisticClassificationModel fix corner case:
normalization of all-zero raw predictions
## What changes were proposed in this pull request
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/16864
@Bcpoole Thanks for this PR. But I want to ask which place in spark can
this extension apply to ? e.g. can this algo used in join cost estimating or
somewhere else ? But if there is no
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/18538#discussion_r136536168
--- Diff:
mllib/src/test/scala/org/apache/spark/ml/evaluation/ClusteringEvaluatorSuite.scala
---
@@ -0,0 +1,91 @@
+/*
+ * Licensed to the
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/18538#discussion_r136532646
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala
---
@@ -0,0 +1,395 @@
+/*
+ * Licensed to the Apache
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/16774#discussion_r136482755
--- Diff:
mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala ---
@@ -120,6 +120,33 @@ class CrossValidatorSuite
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/17014
@smurching Yes this should be added as a `ml.Param`, we should not add as
an argument.
@zhengruifeng Would you mind update the PR according to our discussion
result above ?
Make
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/17014
I think about this double-cache issue for a few days. One big problem is
that, we are hard get precise storage level info. For example, we may add `map`
transform on cached dataset and then
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/16774#discussion_r136243309
--- Diff:
mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala ---
@@ -120,6 +120,33 @@ class CrossValidatorSuite
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19020#discussion_r136071530
--- Diff:
mllib/src/test/scala/org/apache/spark/ml/optim/aggregator/HuberAggregatorSuite.scala
---
@@ -0,0 +1,170 @@
+/*
+ * Licensed to the
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19020#discussion_r136072548
--- Diff:
mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala
---
@@ -146,6 +161,8 @@ class LinearRegressionSuite
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19020#discussion_r136067839
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/HuberAggregator.scala
---
@@ -0,0 +1,141 @@
+/*
+ * Licensed to the
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19020#discussion_r136069679
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/HuberAggregator.scala
---
@@ -0,0 +1,141 @@
+/*
+ * Licensed to the
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19078#discussion_r136032375
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/PCA.scala ---
@@ -44,6 +44,13 @@ class PCA @Since("1.4.0") (@Since("1.4
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/17862
+1 for adding test on large-scale datasets.
Another thing I want to know is that: you can compare the final loss value
on the result coefficients, between LIBLINEAR(scikit-learn), LBFGS
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/17014
@zhengruifeng OK. so the the part of `KMeans` in this PR still works. No
need change I think.
---
If your project is set up for it, you can reply to this email and have your
reply appear on
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/17014
cc @zhengruifeng
I update my comment you need check again, thanks!
I read the PR again, it still do not resolve double-caching issue in KMeans.
in KMean, your code
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19065
@smurching Code updated, thanks!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19065#discussion_r135782045
--- Diff:
mllib/src/test/scala/org/apache/spark/ml/classification/ProbabilisticClassifierSuite.scala
---
@@ -91,4 +94,54 @@ object
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19078
cc @jkbradley
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19078#discussion_r135751225
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/PCA.scala ---
@@ -44,6 +44,13 @@ class PCA @Since("1.4.0") (@Since("1.4
GitHub user WeichenXu123 opened a pull request:
https://github.com/apache/spark/pull/19078
[SPARK-21862] Add overflow check in PCA
## What changes were proposed in this pull request?
add overflow check in PCA, otherwise it is possible to throw
`NegativeArraySizeException
601 - 700 of 1170 matches
Mail list logo