Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19588
Sure, I will add python api after this is merged.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19588#discussion_r150771136
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/feature/VectorIndexer.scala ---
@@ -311,22 +342,39 @@ class VectorIndexerModel private[ml
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19588#discussion_r150761099
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/feature/VectorIndexer.scala ---
@@ -311,22 +346,39 @@ class VectorIndexerModel private[ml
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19588#discussion_r150760733
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/feature/VectorIndexer.scala ---
@@ -311,22 +346,39 @@ class VectorIndexerModel private[ml
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19208
Jenkins, test this please.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19208
I manually tested backwards compatibility and it works fine. I paste the
test code for `CrossValidator` here.
Run following code in spark-2.2 shell first:
```
import
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/17972
Have you checked other algorithms which can also apply this parameter ?
---
-
To unsubscribe, e-mail: reviews-unsubscr
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19208#discussion_r150731403
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala ---
@@ -177,7 +202,9 @@ class TrainValidationSplit @Since("
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19666
OK. I will waiting @smurching to merge split parts of #19433 get merged
first, and then I will update this PR.
---
-
To
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19525#discussion_r150482756
--- Diff:
mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala ---
@@ -476,6 +476,10 @@ class DenseMatrix @Since("
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19525#discussion_r150486465
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/linalg/JsonMatrixConverter.scala ---
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/17972#discussion_r150470524
--- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala
---
@@ -129,7 +129,7 @@ private[recommendation] trait ALSModelParams
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19588#discussion_r150445283
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/feature/VectorIndexer.scala ---
@@ -37,7 +38,25 @@ import org.apache.spark.sql.types.{StructField
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/18624
But, I agree the issue @MLnick mentioned, the code now looks convoluted,
can you try to simplify it ?
---
-
To
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/18624#discussion_r150170451
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala
---
@@ -286,40 +288,119 @@ object
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/15770
LGTM. ping @yanboliang
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19666
@facaiy Your idea looks also reasonable. So we can use the condition
"exclude the first bin" to do the pruning (filter out the other half symmetric
splits). This condition looks si
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19156#discussion_r149956415
--- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala ---
@@ -527,27 +570,28 @@ private[ml] object SummaryBuilderImpl extends
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19156#discussion_r149941345
--- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala ---
@@ -94,46 +98,87 @@ object Summarizer extends Logging {
* - min
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19156#discussion_r149893125
--- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala ---
@@ -94,46 +97,86 @@ object Summarizer extends Logging {
* - min
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19156#discussion_r149855295
--- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala ---
@@ -197,14 +240,14 @@ private[ml] object SummaryBuilderImpl extends
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19156#discussion_r149854985
--- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala ---
@@ -94,46 +97,86 @@ object Summarizer extends Logging {
* - min
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19662#discussion_r149567769
--- Diff:
mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala ---
@@ -126,4 +126,25 @@ class VectorAssemblerSuite
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19666#discussion_r149567340
--- Diff:
mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala ---
@@ -631,6 +614,42 @@ class RandomForestSuite extends
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19565
ok I agree this change. @jkbradley Can you take a look ?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19666#discussion_r149561550
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -741,17 +678,43 @@ private[spark] object RandomForest extends
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19666
Also cc @smurching Thanks!
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19666
@facaiy Thanks for your review! I put more explanation on the design
purpose of `traverseUnorderedSplits`. But, if you have better solution, no
hesitate to tell me
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19685
Have you made some test to check the performance difference for this ?
---
-
To unsubscribe, e-mail: reviews-unsubscr
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19685#discussion_r149554146
--- Diff: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala
---
@@ -289,9 +289,11 @@ class ALSModel private[ml] (
private
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19662
Looks reasonable, have you check other places which have similar issue ?
---
-
To unsubscribe, e-mail: reviews-unsubscr
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19020
LGTM. thanks!
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19666
@smurching I guess if iterating over gray code will have higher time
complexity O(n * 2^n), (Not very sure, maybe there's some high efficient
algos?) , the recursive traverse in my PR
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19666#discussion_r149274660
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -976,6 +930,44 @@ private[spark] object RandomForest extends
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19208#discussion_r149269770
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala ---
@@ -101,6 +101,20 @@ class TrainValidationSplit @Since("
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/16864
@jiangxb1987 yes I agree to close it.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
GitHub user WeichenXu123 opened a pull request:
https://github.com/apache/spark/pull/19666
[SPARK-22451][ML] Reduce decision tree aggregate size for unordered
features from O(2^numCategories) to O(numCategories)
## What changes were proposed in this pull request?
We do not
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/17000
@MLnick It looks like VF-LBFGS has a different scenario. In VF algos, the
vectors will be too large to store in driver memory, so we slice the vectors
into different machines (stored by `RDD
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19661
And I don't know whether these class dependency injection into spark-core
lib is reasonable ...
---
-
To unsubscri
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19661
So why do you include the class such as
`org.apache.spark.ml.feature.Instance`.
You can look into a lot of algos, in `ml` package (not `mllib`), still use
something like `RDD[Instance
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19208#discussion_r148926895
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
@@ -117,6 +123,12 @@ class CrossValidator @Since("1.2.0"
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19586
and in `ml`, if we want to register class before running algos, Some other
classes like `LabeledPoint`, `Instance` also need registered.
and there're some class temporary defined in
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19586
We can config the class to register by config
`spark.kryo.classesToRegister`, does it need to add into spark code
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19641
retest this please.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail
GitHub user WeichenXu123 reopened a pull request:
https://github.com/apache/spark/pull/19350
[SPARK-22126][ML] Fix model-specific optimization support for ML tuning
## What changes were proposed in this pull request?
Push down fitting parallelization code from
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19621
Jenkins, test this please.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19588
@hhbyyh comments addressed. Thanks!
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19588#discussion_r148734195
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/feature/VectorIndexer.scala ---
@@ -311,22 +342,39 @@ class VectorIndexerModel private[ml
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19208
ping @jkbradley Comments all addressed! Pls take a look again. Thanks!
---
-
To unsubscribe, e-mail: reviews-unsubscr
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19439#discussion_r148706148
--- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala
---
@@ -0,0 +1,236 @@
+/*
+ * Licensed to the Apache Software
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19208#discussion_r148701873
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
@@ -236,12 +252,17 @@ object CrossValidator extends
MLReadable
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19208#discussion_r148701451
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
@@ -117,6 +123,12 @@ class CrossValidator @Since("1.2.0"
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19439#discussion_r148700390
--- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala
---
@@ -0,0 +1,236 @@
+/*
+ * Licensed to the Apache Software
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19439#discussion_r148700189
--- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala
---
@@ -0,0 +1,236 @@
+/*
+ * Licensed to the Apache Software
GitHub user WeichenXu123 opened a pull request:
https://github.com/apache/spark/pull/19641
[SPARK-21911][ML][PySpark][DOC] Fix doc for parallel ML Tuning in PySpark
## What changes were proposed in this pull request?
Fix doc issue mentioned here:
https://github.com/apache
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19621
@viirya Code updated. Thanks!
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19627
Jenkins, test this please.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19627
Jenkins, retest this.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail
GitHub user WeichenXu123 opened a pull request:
https://github.com/apache/spark/pull/19627
[SPARK-21088][ML][WIP] CrossValidator, TrainValidationSplit support collect
all models when fitting: Python API
## What changes were proposed in this pull request?
CrossValidator
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19565
Yes I think when dataset is large enough, using the same
`miniBatchFraction`, the result RDD size of "filter before sample" and "filter
after sample" will be asymptotica
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19621#discussion_r148174902
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala ---
@@ -130,21 +152,33 @@ class StringIndexer @Since("
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/15770
@wangmiao1981 oh, not a big deal, what I thought is that, user is possible
to use `graphx` package to get the `Graph[Double, Double]`, and in `ml` package
it cannot accept this format, require
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/15770#discussion_r148047597
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala
---
@@ -0,0 +1,216 @@
+/*
+ * Licensed to the
GitHub user WeichenXu123 opened a pull request:
https://github.com/apache/spark/pull/19621
[SPARK-11215][ml] Add multiple columns support to StringIndexer
## What changes were proposed in this pull request?
Add multiple columns support to StringIndexer.
## How was
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19588#discussion_r147542224
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/feature/VectorIndexer.scala ---
@@ -311,22 +342,39 @@ class VectorIndexerModel private[ml
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19122
@jkbradley Sure I will!
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19588
cc @hhbyyh @MrBago Thanks!
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e
GitHub user WeichenXu123 opened a pull request:
https://github.com/apache/spark/pull/19588
[SPARK-12375][ML] VectorIndexerModel support handle unseen categories via
handleInvalid
## What changes were proposed in this pull request?
Support skip/error/keep strategy, similar
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19433
After discussion and modifications, I approve this PR overall. Ping
@jkbradley Can you take a look now ?
---
-
To
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19433#discussion_r147317401
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tree/impl/LocalDecisionTree.scala ---
@@ -0,0 +1,255 @@
+/*
+ * Licensed to the Apache
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19565
Yes, it changed the probability of samples indeed compared with current
code.
But according to the comments coming from @jkbradley in #18924 , "in order
to make **corpusSize**, batc
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19565
@akopich IMO the filter won't cost too much, don't worry about the
performance. (Or you can make a test to
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19439#discussion_r147075121
--- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala
---
@@ -0,0 +1,258 @@
+/*
+ * Licensed to the Apache Software
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19565
@akopich If you want to cache the input dataset, create JIAR to discuss it
first. It's another issue I think. This JIAR also related to input caching
issues: https://issues.apache.org
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19433
> We'll actually only have to run an O(n log n) sort on continuous feature
values once (i.e. in the FeatureVector constructor), since once the continuous
features are sorted we can upd
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19433#discussion_r147036693
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tree/impl/LocalDecisionTree.scala ---
@@ -0,0 +1,250 @@
+/*
+ * Licensed to the Apache
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/10466
@hhbyyh OK. i will take this over. Our team need this feature now.
---
-
To unsubscribe, e-mail: reviews-unsubscr
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19565#discussion_r146810442
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -497,40 +495,38 @@ final class OnlineLDAOptimizer extends
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19565#discussion_r146799989
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -497,40 +495,38 @@ final class OnlineLDAOptimizer extends
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19433#discussion_r146735946
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tree/impl/LocalDecisionTree.scala ---
@@ -0,0 +1,250 @@
+/*
+ * Licensed to the Apache
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19558
cc @jkbradley @MrBago
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19516#discussion_r146546628
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala ---
@@ -291,9 +291,13 @@ final class ChiSqSelectorModel private[ml
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19516#discussion_r146531755
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala ---
@@ -291,9 +291,13 @@ final class ChiSqSelectorModel private[ml
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19516#discussion_r146513000
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala ---
@@ -291,9 +291,13 @@ final class ChiSqSelectorModel private[ml
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19516
I thought about this, because `ChiSqSelector` only work for categorical
features, after processing it marked features without attributes as
`NominalAttribute` is reasonable, the problem is it
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/10466
@hhbyyh Do you get time to continue this PR ? thanks!
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
GitHub user WeichenXu123 opened a pull request:
https://github.com/apache/spark/pull/19558
[SPARK-22332][ML][TEST] Fix NaiveBayes unit test occasionly fail (cause by
test dataset not deterministic)
## What changes were proposed in this pull request?
Fix NaiveBayes unit
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19439
@hhbyyh Thanks for your comments!
> Another option is that to support all bytes[], short[], int[], float[]
and double[] as data storage type candidates, and switch among them accord
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/17862#discussion_r146167919
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala ---
@@ -282,8 +348,27 @@ class LinearSVC @Since("
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/17862#discussion_r146167706
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala ---
@@ -42,7 +44,26 @@ import org.apache.spark.sql.functions.{col
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19439#discussion_r146163447
--- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala
---
@@ -0,0 +1,258 @@
+/*
+ * Licensed to the Apache Software
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19439#discussion_r146163650
--- Diff: python/pyspark/ml/image.py ---
@@ -0,0 +1,122 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19439#discussion_r146164215
--- Diff: python/pyspark/ml/image.py ---
@@ -0,0 +1,122 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19439#discussion_r146163724
--- Diff: python/pyspark/ml/image.py ---
@@ -0,0 +1,122 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19527#discussion_r145911522
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoderEstimator.scala
---
@@ -0,0 +1,464 @@
+/*
+ * Licensed to the Apache
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/19527#discussion_r145913386
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoderEstimator.scala
---
@@ -0,0 +1,464 @@
+/*
+ * Licensed to the Apache
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/17862#discussion_r145371903
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala ---
@@ -282,8 +348,27 @@ class LinearSVC @Since("
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/17862#discussion_r145369694
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala ---
@@ -282,8 +348,27 @@ class LinearSVC @Since("
Github user WeichenXu123 commented on a diff in the pull request:
https://github.com/apache/spark/pull/17862#discussion_r145371704
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala ---
@@ -42,7 +44,26 @@ import org.apache.spark.sql.functions.{col
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19433
@smurching I found some issues and have some thoughts on the columnar
features format:
- In your doc, you said "Specifically, we only need to store sufficient
stats for each bin
401 - 500 of 1170 matches
Mail list logo