Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19381
cc @jkbradley @yanboliang thanks!
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19433#discussion_r144787368
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tree/impl/LocalDecisionTree.scala ---
@@ -0,0 +1,250 @@
+/*
+ * Licensed to the Apache
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19433#discussion_r144788264
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tree/impl/LocalDecisionTree.scala ---
@@ -0,0 +1,250 @@
+/*
+ * Licensed to the Apache
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19433#discussion_r144790803
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tree/impl/TrainingInfo.scala ---
@@ -0,0 +1,144 @@
+/*
+ * Licensed to the Apache Software
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19433#discussion_r144786233
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tree/impl/FeatureVector.scala ---
@@ -0,0 +1,148 @@
+/*
+ * Licensed to the Apache Software
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19433#discussion_r144790384
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tree/impl/AggUpdateUtils.scala ---
@@ -0,0 +1,85 @@
+/*
+ * Licensed to the Apache Software
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19433#discussion_r144789990
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tree/impl/AggUpdateUtils.scala ---
@@ -0,0 +1,85 @@
+/*
+ * Licensed to the Apache Software
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19433#discussion_r144789015
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tree/impl/TrainingInfo.scala ---
@@ -0,0 +1,144 @@
+/*
+ * Licensed to the Apache Software
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19433
I made a rough pass. I have only a few issues for now, I haven't go into
code details:
- The `colStoreInit` currently ignore the `subsampleWeights`, it should be
used, isn
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/18924
@akopich Yes you can wait this to be merged first. I think @jkbradley will
have time to check this next week. Don't
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/18924
@akopich LGTM. and do you have time to create a PR to resolve random seed
not working issue mentioned by @hhbyyh ? Thanks
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19020#discussion_r143727850
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala ---
@@ -208,6 +292,26 @@ class LinearRegression @Since("
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19020#discussion_r143726258
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala ---
@@ -744,11 +754,20 @@ object LinearRegressionModel extends
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19020#discussion_r143727489
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/HuberAggregator.scala
---
@@ -0,0 +1,145 @@
+/*
+ * Licensed to the
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/17968
@gatorsmile Add this to white list!
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/17968#discussion_r143702830
--- Diff: python/pyspark/ml/linalg/__init__.py ---
@@ -976,14 +976,18 @@ def __getitem__(self, indices):
return self.values[i + j
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/17968#discussion_r143702719
--- Diff: python/pyspark/mllib/linalg/__init__.py ---
@@ -1131,14 +1131,17 @@ def __getitem__(self, indices):
return self.values[i + j
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/17968#discussion_r143656736
--- Diff: python/pyspark/ml/linalg/__init__.py ---
@@ -976,14 +976,20 @@ def __getitem__(self, indices):
return self.values[i + j
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/17968#discussion_r143656820
--- Diff: python/pyspark/mllib/linalg/__init__.py ---
@@ -1131,14 +1131,20 @@ def __getitem__(self, indices):
return self.values[i + j
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19438#discussion_r143492975
--- Diff:
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/QuantileSummariesSuite.scala
---
@@ -58,7 +58,7 @@ class
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/17968
@gglanzani And you the `ml.linalg.DenseMatrix` looks have the same bug. Can
you also update it ?
---
-
To unsubscribe, e
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/17968
ping @gglanzani This bug need fixed ASAP. Can you update code when you're
free ? Thanks.
---
-
To unsubscribe, e
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/17968#discussion_r143465176
--- Diff: python/pyspark/mllib/linalg/__init__.py ---
@@ -1131,14 +1131,21 @@ def __getitem__(self, indices):
return self.values[i + j
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19106
@srowen Any other comments? Thanks!
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/15770#discussion_r143426157
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala
---
@@ -0,0 +1,216 @@
+/*
+ * Licensed to the
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19433
@smurching Does it still WIP ? If done remove "[WIP]", I will begin review,
thanks!
---
-
To unsubscribe, e-mai
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/18924
Oh, sorry for that, it should waiting @jkbradley to merge it. Don't worry,
I will contact him!
---
-
To unsubscri
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/13493#discussion_r141852693
--- Diff: python/pyspark/mllib/tests.py ---
@@ -1762,6 +1763,18 @@ def test_pca(self):
self.assertEqualUpToSign(pcs.toArray()[:, k
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/13493
ping @zjffdu Looks reasonable fix, But pls resolve the conflicts! Thanks!
---
-
To unsubscribe, e-mail: reviews-unsubscr
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19106#discussion_r141616141
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/classification/ProbabilisticClassifier.scala
---
@@ -230,21 +230,19 @@ private[ml] object
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/8883
I take this over in #19381
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e
GitHub user WeichenXu123 opened a pull request:
https://github.com/apache/spark/pull/19381
[SPARK-10884][ML] Support prediction on single instance for regression and
classification related models
## What changes were proposed in this pull request?
Support prediction on
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19208
cc @smurching code updated, thanks!
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19208#discussion_r141391733
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
@@ -261,17 +290,40 @@ class CrossValidatorModel private[ml
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19122
ping @jkbradley
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19020
I also vote to combine them as one estimator, here are my two cents:
1, Regression with Huber loss is one kind of linear regression. It makes
sense to switch between different loss
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19208#discussion_r141357565
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala ---
@@ -276,12 +315,32 @@ object TrainValidationSplitModel
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19208#discussion_r141356070
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
@@ -261,17 +290,40 @@ class CrossValidatorModel private[ml
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/18924
LGTM. Thanks! ping @jkbradley
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands
Github user WeichenXu123 closed the pull request at:
https://github.com/apache/spark/pull/19350
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19337
+1 for updating ML API.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/17503
Can you do some benchmark to show how much improvement this change will
bring ?
---
-
To unsubscribe, e-mail: reviews
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19337
`require` is better i think
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19337
Why not possible ? `meanGammaChange` is always positive. The loop will
become `while (true) {...}`
and should use `require` instead of `assert
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19208
I will update this PR after #19350 get merged. We need to address another
issue first. Thanks!
---
-
To unsubscribe, e
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19350
retest this please.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19337
OK. But I'm afraid accidently negative value cause the code run into dead
loop. Maybe adding `assert` is better.
---
---
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/18034#discussion_r141016974
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala ---
@@ -468,7 +469,16 @@ object LocalLDAModel extends Loader
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19350#discussion_r141015509
--- Diff: mllib/src/main/scala/org/apache/spark/ml/Estimator.scala ---
@@ -82,5 +85,32 @@ abstract class Estimator[M <: Model[M]] exte
GitHub user WeichenXu123 opened a pull request:
https://github.com/apache/spark/pull/19350
[SPARK-22126][ML] Fix model-specific optimization support for ML tuning
## What changes were proposed in this pull request?
Push down fitting parallelization code from
CrossValidator
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19229#discussion_r140690182
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala ---
@@ -223,20 +223,18 @@ class ImputerModel private[ml
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19229#discussion_r140689908
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala ---
@@ -223,20 +223,18 @@ class ImputerModel private[ml
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19229#discussion_r140675967
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala ---
@@ -223,20 +223,18 @@ class ImputerModel private[ml
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19317
It is better adding more perf test for `OpenHashSet` replacement to avoid
perf regression. And I found `reduceByKeyLocally` also use `JHashSet`, I am not
sure whether there is some special
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19317
@ConeyLiu Yes tree aggregate introduce extra shuffle. But it is possible to
improve perf when driver total collecting data size from executors are large
and there're many partitions.
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/18924
@akopich follow-up JIRA created here
https://issues.apache.org/jira/browse/SPARK-22111
Can you create follow up PR after this PR being merged
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19318
The `aggregateByKeyLocally` haven't been implemented. Please wait until
#19317 finished or merge that PR contents into this, otherwise this PR cannot
use. We should not open a PR which c
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19317
Oh I get your point. This is different from `RDD.aggregate`, it directly
return Map and avoid shuffling. it seems useful when numKeys is small.
But, I check the final `reduce` step, it
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19317
Yes. I guess the perf gain is because, this PR use local hashmap which can
use unlimited memory, but current spark aggregation impl, will auto spill local
hashmap when exceeding a threshold
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19317
And I have to point out that your impl have high risk causing OOM. The
current impl will auto spill when local hashmap is too large and can take
advantage of spark auto memory management
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19229
@viirya Yeah the perf gap I only focus on `mean` which can take advantage
of codegen.
---
-
To unsubscribe, e-mail
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19229
The performance gap issue (compared with RDD version), I create a separated
JIRA to track:
https://issues.apache.org/jira/browse/SPARK-22105
As the result of discussion with @cloud-fan
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19156
@cloud-fan Can you help review the part of code which related to SQL
interface ?
---
-
To unsubscribe, e-mail: reviews
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19278
@jkbradley Sure I tested the backwards compatibility. Part of the reason I
changed into `DefaultParamReader.getAndSetParams` is for backwards
compatibility
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19122#discussion_r140402700
--- Diff: python/pyspark/ml/tests.py ---
@@ -836,6 +836,27 @@ def test_save_load_simple_estimator(self):
loadedModel
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/13794
cc @srowen Can you help close this ? We won't need this feature for now.
---
-
To unsubscribe, e-mail: reviews-uns
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19279
cc @srowen Can you help close this ?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19278#discussion_r140398933
--- Diff:
mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala ---
@@ -159,12 +159,15 @@ class CrossValidatorSuite
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19278#discussion_r140398857
--- Diff:
mllib/src/test/scala/org/apache/spark/ml/tuning/TrainValidationSplitSuite.scala
---
@@ -160,11 +160,13 @@ class TrainValidationSplitSuite
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/18924#discussion_r140249325
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -503,17 +518,15 @@ final class OnlineLDAOptimizer extends
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/18924#discussion_r140188363
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -503,17 +518,15 @@ final class OnlineLDAOptimizer extends
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/14325
ping @gatorsmile Add this to test.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/14325
You should override `override def evaluate(dataset: Dataset[_])` (without
the label param).
---
-
To unsubscribe, e-mail
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19106
ping @srowen Any other comments ?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19288
OK I agree.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/18924#discussion_r140111453
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -503,17 +518,15 @@ final class OnlineLDAOptimizer extends
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19229
@viirya Thanks very much! Although the perf gap exists (when numCols is
large), it won't block this PR. I will create a JIRA to track
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/17819
@MLnick Yea, you're right, only move `setXXX` to concrete class also work
fine. The root cause is the `setXXX` return type. But I think the multi /
single logic can be merged, because s
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19208
@smurching I will update this PR after #19278 merged. Because now this PR
depend on that one. Thanks!
---
-
To unsubscribe
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19278
@BryanCutler The reason I add `skipParams` is that, if we don't use
`DefaultParamReader.getAndSetParams`, we have to hardcoding all params which
are very troublesome. And every time we ad
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19278#discussion_r139855087
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
@@ -303,16 +304,16 @@ object CrossValidatorModel extends
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19278#discussion_r139854984
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
@@ -212,14 +213,12 @@ object CrossValidator extends
MLReadable
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19278#discussion_r139854872
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala ---
@@ -122,7 +123,7 @@ class TrainValidationSplit @Since("
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19278#discussion_r139854853
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
@@ -283,6 +282,8 @@ object CrossValidatorModel extends
MLReadable
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19106
Jenkins, test this please.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/12066
@yanboliang Are you still working on this ?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19278
cc @BryanCutler Thanks!
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail
GitHub user WeichenXu123 opened a pull request:
https://github.com/apache/spark/pull/19278
[SPARK-22060][ML] Fix CrossValidator/TrainValidationSplit param
persist/load bug
## What changes were proposed in this pull request?
Currently the param of CrossValidator
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19186
@zhengruifeng You should write a design doc succinctly and link to jira
first. After we reach agreement with the design this PR can move forward.
Thanks
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19229
Oh. That's what have done in the old PR #18902 .(Because the RDD version
(not in master branch, only personal impl here (sorry for put wrong link, the
code link is here:
https://githu
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/17819
Yes you can only move `setInputCols` into the outer class to resolve this
issue. But I prefer merge it together. I think we can unify the `transform`
method. (First we check param `inputCol
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/17819
@viirya Oh, I am not saying the compatibility against old version scala
application. What I say is about new version `Bucketizer`, when spark user use
java language(not scala language), call
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19208
@smurching Thanks! I will update later. And note that I will separate part
of this PR to a new PR (the separated part will be a bugfix for #16774
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/17819
@viirya Scala `with trait` is a complex mechanism and `trait` isn't
equivalent to java's `interface`. Scala compiler will precompile and generate
many other codes, so java-side c
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/16774
OK. I will separate a PR. Thanks!
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19229
@viirya No, keep the dataframe version code. But I only want to confirm how
much performance gap between this and RDD version. (for possible improvements
in the future, because in similar test
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/18924#discussion_r139467949
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -462,31 +462,44 @@ final class OnlineLDAOptimizer extends
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/18924#discussion_r139470472
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -462,31 +462,44 @@ final class OnlineLDAOptimizer extends
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19229
@viirya I run the code, you're right, most of time cost on the executedPlan
generation (The old version code). thanks!
But can you append benchmark comparison with `RDD.aggregate` ve
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19229
Looks not the reason. maybe issues somewhere else. Let me run test later.
Thanks!
But there is some small issues in test:
Don't include gen data time:
```
val
501 - 600 of 1170 matches
Mail list logo