Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19229#discussion_r140689908
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala ---
@@ -223,20 +223,18 @@ class ImputerModel private[ml
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19337
+1 for updating ML API.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail
Github user WeichenXu123 closed the pull request at:
https://github.com/apache/spark/pull/19350
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/18924
LGTM. Thanks! ping @jkbradley
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19208#discussion_r141356070
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
@@ -261,17 +290,40 @@ class CrossValidatorModel private[ml
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19208#discussion_r141357565
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala ---
@@ -276,12 +315,32 @@ object TrainValidationSplitModel
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19020
I also vote to combine them as one estimator, here are my two cents:
1, Regression with Huber loss is one kind of linear regression. It makes
sense to switch between different loss
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/17373
Jenkins, test this please.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/15435
Jenkins, test this please.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/17849
What do you think about this ? @jkbradley
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/18924
Thanks! I will take a look later.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/18538#discussion_r134449164
--- Diff:
mllib/src/test/scala/org/apache/spark/ml/evaluation/ClusteringEvaluatorSuite.scala
---
@@ -0,0 +1,91 @@
+/*
+ * Licensed
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/15435
Jenkins, test this please.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19018
ping @felixcheung We can make all R tests for trees deterministic (not only
random trees). Leave other problems to separate PR. It would be great to fix it
soon, Thanks!
---
If your project
GitHub user WeichenXu123 opened a pull request:
https://github.com/apache/spark/pull/19065
[SPARK-21729][ML][TEST] Generic test for ProbabilisticClassifier to ensure
consistent output columns
## What changes were proposed in this pull request?
Add test for prediction using
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19065
Jenkins, test this please.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/17014#discussion_r135534873
--- Diff: mllib/src/main/scala/org/apache/spark/ml/Predictor.scala ---
@@ -85,6 +86,10 @@ abstract class Predictor[
M <: PredictionMo
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19018
@felixcheung In Jenkins Log I only found Random forest and Decision Tree
failed, random forest failed more frequently. thanks!
---
If your project is set up for it, you can reply
Github user WeichenXu123 closed the pull request at:
https://github.com/apache/spark/pull/19026
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature
GitHub user WeichenXu123 opened a pull request:
https://github.com/apache/spark/pull/19072
[SPARK-17133][ML][FOLLOW-UP] Add convenient method `asBinary` for casting
to BinaryLogisticRegressionSummary
## What changes were proposed in this pull request?
add an "asB
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19018
@felixcheung This error occur in the OneHotEncoder inside the RFormula I
think. Only OneHotEncoder will print this error message after I search the
project...
---
If your project is set up
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/15435
Jenkins test this please
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19029#discussion_r135186430
--- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala ---
@@ -438,6 +438,10 @@ private[ml] object SummaryBuilderImpl extends Logging
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/15435
Jenkins test this please
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/18992#discussion_r134109929
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/optim/loss/DifferentiableRegularization.scala
---
@@ -57,6 +61,11 @@ private[ml] class
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19078
cc @jkbradley
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so
GitHub user WeichenXu123 opened a pull request:
https://github.com/apache/spark/pull/19078
[SPARK-21862] Add overflow check in PCA
## What changes were proposed in this pull request?
add overflow check in PCA, otherwise it is possible to throw
`NegativeArraySizeException
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19078#discussion_r135751225
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/PCA.scala ---
@@ -44,6 +44,13 @@ class PCA @Since("1.4.0") (@Since("1.4
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/17014#discussion_r135695930
--- Diff: mllib/src/main/scala/org/apache/spark/ml/Predictor.scala ---
@@ -85,6 +86,10 @@ abstract class Predictor[
M <: PredictionMo
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19072
Jenkins, test this please.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/17014
I think about this double-cache issue for a few days. One big problem is
that, we are hard get precise storage level info. For example, we may add `map`
transform on cached dataset
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19122#discussion_r136850665
--- Diff: python/pyspark/ml/tuning.py ---
@@ -255,18 +257,23 @@ def _fit(self, dataset):
randCol = self.uid + "_rand"
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19108
cc @yanboliang Thanks!
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/18902
Sure. I will create JIRA after this perf gap is confirmed.
---
-
To unsubscribe, e-mail: reviews-unsubscr
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/18902
hmm... that's interesting. So I found performance gap between dataframe
codegen aggregation and the simple RDD aggregation. I will discuss with SQL
team for this later. Thanks!
---
If your
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/17014
@zhengruifeng `KMeans` regarded as a bugfix(SPARK-21799) because the
double-cache issue is introduced in 2.2 and cause perf regression.
Other algos also have the same issue, but the issue
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19110
Jenkins, test this please.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/16774#discussion_r136719383
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
@@ -100,31 +113,53 @@ class CrossValidator @Since("1.2.0"
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/16774#discussion_r136719561
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala ---
@@ -87,37 +91,63 @@ class TrainValidationSplit @Since("
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/16774#discussion_r136719485
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
@@ -100,31 +113,53 @@ class CrossValidator @Since("1.2.0"
GitHub user WeichenXu123 opened a pull request:
https://github.com/apache/spark/pull/19122
[SPARK-21911][ML][PySpark] Parallel Model Evaluation for ML Tuning in
PySpark
## What changes were proposed in this pull request?
Add parallelism support for ML tuning in pyspark
GitHub user WeichenXu123 opened a pull request:
https://github.com/apache/spark/pull/19106
[SPARK-21770][ML] ProbabilisticClassificationModel fix corner case:
normalization of all-zero raw predictions
## What changes were proposed in this pull request
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/17014
@zhengruifeng @jkbradley I create a PR #19107 for quick fix `KMeans` perf
regression bug.
This PR can continue to work on adding Param of `handlePersistence` which
is not so emergent
GitHub user WeichenXu123 opened a pull request:
https://github.com/apache/spark/pull/19107
[SPARK-21799][ML] Fix `KMeans` performance regression caused by
double-caching
## What changes were proposed in this pull request?
Fix `KMeans` performance regression caused
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19107
cc @jkbradley @smurching
This should be merged and backport to 2.2 ASAP!
Other improvement (e.g adding `handlePersistence` param) can be left in
this PR #17014
---
If your project
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/16864
@Bcpoole Thanks for this PR. But I want to ask which place in spark can
this extension apply to ? e.g. can this algo used in join cost estimating or
somewhere else
GitHub user WeichenXu123 opened a pull request:
https://github.com/apache/spark/pull/19108
[SPARK-21898][ML] Feature parity for KolmogorovSmirnovTest in MLlib
## What changes were proposed in this pull request?
Feature parity for KolmogorovSmirnovTest in MLlib
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19106#discussion_r136696592
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/classification/ProbabilisticClassifier.scala
---
@@ -245,6 +245,13 @@ private[ml] object
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19020#discussion_r136071530
--- Diff:
mllib/src/test/scala/org/apache/spark/ml/optim/aggregator/HuberAggregatorSuite.scala
---
@@ -0,0 +1,170 @@
+/*
+ * Licensed
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19020#discussion_r136067839
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/HuberAggregator.scala
---
@@ -0,0 +1,141 @@
+/*
+ * Licensed
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19020#discussion_r136069679
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/HuberAggregator.scala
---
@@ -0,0 +1,141 @@
+/*
+ * Licensed
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19020#discussion_r136072548
--- Diff:
mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala
---
@@ -146,6 +161,8 @@ class LinearRegressionSuite
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19065#discussion_r135782045
--- Diff:
mllib/src/test/scala/org/apache/spark/ml/classification/ProbabilisticClassifierSuite.scala
---
@@ -91,4 +94,54 @@ object
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19065
@smurching Code updated, thanks!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19018
cc @felixcheung
I encounter RTest failed again even when this seed added.
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81350/console
error:
```
Failed
GitHub user WeichenXu123 opened a pull request:
https://github.com/apache/spark/pull/19110
[SPARK-21027][ML][PYTHON] Added tunable parallelism to one vs. rest in both
Scala mllib and Pyspark
## What changes were proposed in this pull request?
Added tunable parallelism
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/18281
I take this PR over in #19110 because the original author is busy but we
need merge this PR soon.
Thanks!
---
If your project is set up for it, you can reply to this email and have your
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/18538
Jenkins, test this please.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19111
I found `NaiveBayes` also possible to fail. Founded here #18538 . Hope this
can works!
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81316/console
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/18902
+1 for using Dataframe-based version code.
@zhengruifeng One thing I want to confirm is that, I check your testing
code, both RDD-based version and Dataframe-based version code
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/16774#discussion_r136243309
--- Diff:
mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala ---
@@ -120,6 +120,33 @@ class CrossValidatorSuite
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/16774#discussion_r136482755
--- Diff:
mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala ---
@@ -120,6 +120,33 @@ class CrossValidatorSuite
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/17014
@smurching Yes this should be added as a `ml.Param`, we should not add as
an argument.
@zhengruifeng Would you mind update the PR according to our discussion
result above ?
Make
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/18538#discussion_r136536168
--- Diff:
mllib/src/test/scala/org/apache/spark/ml/evaluation/ClusteringEvaluatorSuite.scala
---
@@ -0,0 +1,91 @@
+/*
+ * Licensed
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/18538#discussion_r136532646
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala
---
@@ -0,0 +1,395 @@
+/*
+ * Licensed to the Apache
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/17014
cc @zhengruifeng
I update my comment you need check again, thanks!
I read the PR again, it still do not resolve double-caching issue in KMeans.
in KMean, your code
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/18924
Oh, sorry for that, it should waiting @jkbradley to merge it. Don't worry,
I will contact him!
---
-
To unsubscribe, e
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/18924
@akopich LGTM. and do you have time to create a PR to resolve random seed
not working issue mentioned by @hhbyyh ? Thanks
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/15770#discussion_r143426157
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala
---
@@ -0,0 +1,216 @@
+/*
+ * Licensed
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19433
@smurching Does it still WIP ? If done remove "[WIP]", I will begin review,
thanks!
---
-
To unsubscribe, e-mai
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19106
@srowen Any other comments? Thanks!
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/17862#discussion_r145371903
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala ---
@@ -282,8 +348,27 @@ class LinearSVC @Since("
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/17862#discussion_r145371704
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala ---
@@ -42,7 +44,26 @@ import org.apache.spark.sql.functions.{col
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/17862#discussion_r145369694
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala ---
@@ -282,8 +348,27 @@ class LinearSVC @Since("
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/10466
@hhbyyh Do you get time to continue this PR ? thanks!
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
GitHub user WeichenXu123 opened a pull request:
https://github.com/apache/spark/pull/19558
[SPARK-22332][ML][TEST] Fix NaiveBayes unit test occasionly fail (cause by
test dataset not deterministic)
## What changes were proposed in this pull request?
Fix NaiveBayes unit
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19439
@hhbyyh Thanks for your comments!
> Another option is that to support all bytes[], short[], int[], float[]
and double[] as data storage type candidates, and switch among them accord
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19516
I thought about this, because `ChiSqSelector` only work for categorical
features, after processing it marked features without attributes as
`NominalAttribute` is reasonable, the problem
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19516#discussion_r146513000
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala ---
@@ -291,9 +291,13 @@ final class ChiSqSelectorModel private[ml
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19516#discussion_r146531755
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala ---
@@ -291,9 +291,13 @@ final class ChiSqSelectorModel private[ml
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19516#discussion_r146546628
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala ---
@@ -291,9 +291,13 @@ final class ChiSqSelectorModel private[ml
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/17972#discussion_r150470524
--- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala
---
@@ -129,7 +129,7 @@ private[recommendation] trait ALSModelParams
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19588#discussion_r150445283
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/feature/VectorIndexer.scala ---
@@ -37,7 +38,25 @@ import org.apache.spark.sql.types.{StructField
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19525#discussion_r150486465
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/linalg/JsonMatrixConverter.scala ---
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19525#discussion_r150482756
--- Diff:
mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala ---
@@ -476,6 +476,10 @@ class DenseMatrix @Since("
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19753
@smurching The getter/setter is included in the super class
`HasHandleInvalid`. I can add test
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19588
Python API jira created here:
https://issues.apache.org/jira/browse/SPARK-22521
---
-
To unsubscribe, e-mail: reviews
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19621
@viirya @MLnick Thanks!
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19588#discussion_r151055221
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/feature/VectorIndexer.scala ---
@@ -311,22 +346,39 @@ class VectorIndexerModel private[ml
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19621
I want to ask, for option `StringIndexer.frequencyDesc`, in the case
existing two labels which have the same frequency, which of them will be put in
the front ?
If this is not specified
GitHub user WeichenXu123 opened a pull request:
https://github.com/apache/spark/pull/19753
[SPARK-22521][ML] VectorIndexerModel support handle unseen categories via
handleInvalid: Python API
## What changes were proposed in this pull request?
Add python api
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19753#discussion_r151311569
--- Diff: python/pyspark/ml/feature.py ---
@@ -2565,22 +2575,28 @@ class VectorIndexer(JavaEstimator, HasInputCol,
HasOutputCol, JavaMLReadable, Ja
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19753#discussion_r151311832
--- Diff: python/pyspark/ml/feature.py ---
@@ -2565,22 +2575,28 @@ class VectorIndexer(JavaEstimator, HasInputCol,
HasOutputCol, JavaMLReadable, Ja
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19621
I checked the failed tests in sparkR. There's some trouble in the failed
`glm` sparkR tests.
These tests compare sparkR glm and R-lib glm results on test data "iris",
b
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19621
Jenkins retest this please.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19621
@MLnick Will RDD "count by value" aggregation be deterministic ? e.g., 2
RDD with the same elements, but with different element order and different
partition number, will `rdd.co
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19621
@MLnick Ah, I don't express it exactly, the first case, what I mean is,
sort by frequency, but if the case frequency equal, sort by alphabet
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19621
@viirya @MLnick Code updated. Thanks!
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19758#discussion_r152749515
--- Diff:
mllib/src/test/scala/org/apache/spark/ml/tree/impl/TreeSplitUtilsSuite.scala ---
@@ -0,0 +1,280 @@
+/*
+ * Licensed to the Apache
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19621
@MLnick How about this way:
The case "fequencyAsc/Desc", sort first by frequency and then by alphabet,
The case "alphabetAsc/Desc", sort by alphabet (and if
601 - 700 of 1170 matches
Mail list logo