Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/19020#discussion_r147327448
--- Diff:
mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala
---
@@ -998,6 +1047,198 @@ class LinearRegressionSuite
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/19020#discussion_r147323528
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala ---
@@ -69,25 +70,103 @@ private[regression] trait
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/19020#discussion_r147316970
--- Diff:
mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala
---
@@ -998,6 +1047,198 @@ class LinearRegressionSuite
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/19020#discussion_r147327208
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala ---
@@ -480,10 +638,14 @@ object LinearRegression extends
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/19020#discussion_r147322978
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/HuberAggregator.scala
---
@@ -0,0 +1,145 @@
+/*
+ * Licensed to the Apache
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/19020#discussion_r147322642
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/HuberAggregator.scala
---
@@ -0,0 +1,145 @@
+/*
+ * Licensed to the Apache
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/19020#discussion_r147319678
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/HuberAggregator.scala
---
@@ -0,0 +1,145 @@
+/*
+ * Licensed to the Apache
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/19020#discussion_r147321479
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/HuberAggregator.scala
---
@@ -0,0 +1,145 @@
+/*
+ * Licensed to the Apache
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/19565#discussion_r147226715
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -446,14 +445,14 @@ final class OnlineLDAOptimizer extends
Github user hhbyyh commented on the issue:
https://github.com/apache/spark/pull/19565
I'm curious about the performance comparison, if "filter before sample"
triggers a filter over the whole dataset for each `submitMiniBatch`, then
there'll be some performance imp
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/19565#discussion_r147207042
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -446,14 +445,14 @@ final class OnlineLDAOptimizer extends
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/19565#discussion_r147020853
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -446,14 +445,14 @@ final class OnlineLDAOptimizer extends
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/19565#discussion_r147021004
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -446,14 +445,14 @@ final class OnlineLDAOptimizer extends
Github user hhbyyh commented on the issue:
https://github.com/apache/spark/pull/19565
I wonder if we should add cache() for lda training data, even not for this
feature.
@srowen Not sure where we're on caching the training data or not for
different algorithms. Appre
Github user hhbyyh commented on the issue:
https://github.com/apache/spark/pull/19439
@thunterdb @WeichenXu123 Let's keep only Array[Byte] for now.
@WeichenXu123 for the origin column. Surely it maybe handy in some
scenarios, but I'm most concerned about the objec
Github user hhbyyh commented on the issue:
https://github.com/apache/spark/pull/10466
Feel free to work on it. I can help review.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/17862#discussion_r146169405
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala ---
@@ -42,7 +44,26 @@ import org.apache.spark.sql.functions.{col, lit
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/17862#discussion_r146168734
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala ---
@@ -282,8 +348,27 @@ class LinearSVC @Since("2.2.0") (
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/17862#discussion_r146168660
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala ---
@@ -42,7 +44,26 @@ import org.apache.spark.sql.functions.{col, lit
Github user hhbyyh commented on the issue:
https://github.com/apache/spark/pull/19439
@thunterdb Thanks for the reply.
> It does, indirectly: this is what the field types CV_32FXX do. You need
to do some low-level casting to convert the byte array to array of numbers,
Github user hhbyyh commented on the issue:
https://github.com/apache/spark/pull/17862
Thanks @WeichenXu123 for the comments.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/17862#discussion_r146166006
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala ---
@@ -282,8 +348,27 @@ class LinearSVC @Since("2.2.0") (
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/17862#discussion_r146165706
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala ---
@@ -282,8 +348,27 @@ class LinearSVC @Since("2.2.0") (
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/17862#discussion_r146165449
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala ---
@@ -42,7 +44,26 @@ import org.apache.spark.sql.functions.{col, lit
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/19525#discussion_r145331849
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/linalg/JsonMatrixConverter.scala ---
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/19525#discussion_r145333064
--- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala ---
@@ -122,17 +124,33 @@ private[ml] object Param {
/** Decodes a param
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/19525#discussion_r145330685
--- Diff: mllib/src/main/scala/org/apache/spark/ml/param/params.scala ---
@@ -122,17 +124,33 @@ private[ml] object Param {
/** Decodes a param
GitHub user hhbyyh opened a pull request:
https://github.com/apache/spark/pull/19525
[SPARK-22289] [ML] Add JSON support for Matrix parameters (LR with
coefficients bound)
## What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK
Github user hhbyyh commented on the issue:
https://github.com/apache/spark/pull/17862
Please let me know if there's any unresolved comments. Thanks.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apach
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/19337#discussion_r143818776
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -224,6 +224,24 @@ private[clustering] trait LDAParams extends Params
with
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/18924#discussion_r143112965
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -462,31 +463,60 @@ final class OnlineLDAOptimizer extends
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/18924#discussion_r143081342
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -462,31 +463,60 @@ final class OnlineLDAOptimizer extends
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/18924#discussion_r143080051
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -462,31 +463,60 @@ final class OnlineLDAOptimizer extends
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/18924#discussion_r143080481
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -462,31 +463,60 @@ final class OnlineLDAOptimizer extends
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/18924#discussion_r143080675
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -503,21 +533,22 @@ final class OnlineLDAOptimizer extends
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/18924#discussion_r143077626
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -462,31 +463,60 @@ final class OnlineLDAOptimizer extends
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/18924#discussion_r143068229
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -462,31 +463,60 @@ final class OnlineLDAOptimizer extends
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/18924#discussion_r143067455
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -462,31 +463,60 @@ final class OnlineLDAOptimizer extends
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/18924#discussion_r143056727
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -462,31 +463,60 @@ final class OnlineLDAOptimizer extends
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/18924#discussion_r143055573
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -462,31 +463,60 @@ final class OnlineLDAOptimizer extends
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/18924#discussion_r143058244
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -462,31 +463,60 @@ final class OnlineLDAOptimizer extends
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/18924#discussion_r143057944
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -462,31 +463,60 @@ final class OnlineLDAOptimizer extends
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/19337#discussion_r142854372
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -322,6 +326,13 @@ final class OnlineLDAOptimizer extends
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/19337#discussion_r142853109
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -224,6 +224,20 @@ private[clustering] trait LDAParams extends Params
with
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/19337#discussion_r142853643
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -224,6 +224,20 @@ private[clustering] trait LDAParams extends Params
with
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/18924#discussion_r142833499
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -462,36 +462,55 @@ final class OnlineLDAOptimizer extends
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/18924#discussion_r142833374
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -462,36 +462,55 @@ final class OnlineLDAOptimizer extends
Github user hhbyyh commented on the issue:
https://github.com/apache/spark/pull/18924
Yes, I think local test is enough for both correctness and performance.
For consistency with old LDA, just some manual local test would be
sufficient. You may well just use the LDA example
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/18924#discussion_r142831316
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -462,36 +462,55 @@ final class OnlineLDAOptimizer extends
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/18924#discussion_r142571627
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -462,36 +462,55 @@ final class OnlineLDAOptimizer extends
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/18924#discussion_r142572013
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -462,36 +462,55 @@ final class OnlineLDAOptimizer extends
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/18924#discussion_r142574222
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -462,36 +462,55 @@ final class OnlineLDAOptimizer extends
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/18924#discussion_r142574453
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -462,36 +462,55 @@ final class OnlineLDAOptimizer extends
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/18924#discussion_r142571342
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -462,36 +462,55 @@ final class OnlineLDAOptimizer extends
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/18924#discussion_r142571728
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -462,36 +462,55 @@ final class OnlineLDAOptimizer extends
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/18924#discussion_r142571603
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -462,36 +462,55 @@ final class OnlineLDAOptimizer extends
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/18924#discussion_r142571685
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -462,36 +462,55 @@ final class OnlineLDAOptimizer extends
Github user hhbyyh commented on the issue:
https://github.com/apache/spark/pull/19208
It's OK to me to include the "dump model to disk"
https://github.com/apache/spark/pull/18313 in this or other PR (or not).
After reading the discussion, I feel it's an o
Github user hhbyyh commented on the issue:
https://github.com/apache/spark/pull/18313
That's all right. Please just proceed.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional com
Github user hhbyyh commented on the issue:
https://github.com/apache/spark/pull/17862
Tested with several larger data set with Hinge Loss function, to compare
l-bfgs and owlqn solvers.
Run until converged or exceed maxIter (2000).
dataset | numRecords | numFeatures | l
Github user hhbyyh commented on the issue:
https://github.com/apache/spark/pull/16158
Update:
To support pipeline estimator, change the tuning summary column name to
include full param reference:
![image](https://user-images.githubusercontent.com/7981698/30287417
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/16158#discussion_r138133273
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tuning/ValidatorParams.scala ---
@@ -85,6 +86,32 @@ private[ml] trait ValidatorParams extends HasSeed
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/16158#discussion_r138133238
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tuning/ValidatorParams.scala ---
@@ -85,6 +86,32 @@ private[ml] trait ValidatorParams extends HasSeed
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/17461#discussion_r135430463
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -32,10 +34,7 @@ import org.apache.spark.ml.param._
import
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/17461#discussion_r135430545
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -180,6 +179,29 @@ private[clustering] trait LDAParams extends Params
with
Github user hhbyyh commented on the issue:
https://github.com/apache/spark/pull/18610
Thanks for the reply. Since there's already an agreement, I will hold my
suggestion on initialModel data type.
---
If your project is set up for it, you can reply to this email and have your
Github user hhbyyh commented on the issue:
https://github.com/apache/spark/pull/17862
Sure, I can find some larger dataset to test with.
But I guess, as showed in the PR description, LBFGS will generally
outperform OWLQS, but not in all the cases. I assume single large scale
Github user hhbyyh commented on the issue:
https://github.com/apache/spark/pull/17862
Given the discussion above, I plan to replace OWLQN with LBFGS. I will send
update soon.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/18610#discussion_r135418170
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala ---
@@ -226,6 +246,12 @@ class LinearRegression @Since("
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/18610#discussion_r135418289
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala ---
@@ -72,6 +72,22 @@ private[regression] trait LinearRegressionParams
Github user hhbyyh commented on the issue:
https://github.com/apache/spark/pull/18610
Just to confirm, so we have agreed that the initialModel should be of type
[T <: Model[T]] rather than a String type (path to the saved model)? Sorry I
didn't find the related discussion.
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/17461#discussion_r135382461
--- Diff: docs/mllib-clustering.md ---
@@ -243,6 +243,9 @@ configuration), this parameter specifies the frequency
with which
checkpoints will be
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/17461#discussion_r135382496
--- Diff:
examples/src/main/scala/org/apache/spark/examples/ml/LDAIncrementalExample.scala
---
@@ -0,0 +1,175 @@
+/*
+ * Licensed to the Apache
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/17461#discussion_r135382509
--- Diff:
examples/src/main/scala/org/apache/spark/examples/ml/LDAIncrementalExample.scala
---
@@ -0,0 +1,175 @@
+/*
+ * Licensed to the Apache
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/17461#discussion_r135382471
--- Diff:
examples/src/main/scala/org/apache/spark/examples/ml/LDAIncrementalExample.scala
---
@@ -0,0 +1,175 @@
+/*
+ * Licensed to the Apache
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/17461#discussion_r135382491
--- Diff:
examples/src/main/scala/org/apache/spark/examples/ml/LDAIncrementalExample.scala
---
@@ -0,0 +1,175 @@
+/*
+ * Licensed to the Apache
Github user hhbyyh commented on the issue:
https://github.com/apache/spark/pull/17461
Got it. Will make a pass today.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and
Github user hhbyyh commented on the issue:
https://github.com/apache/spark/pull/18315
Thanks for the comment @sethah and @yanboliang .
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/18315#discussion_r134858430
--- Diff:
mllib/src/test/scala/org/apache/spark/ml/optim/aggregator/HingeAggregatorSuite.scala
---
@@ -0,0 +1,150 @@
+/*
+ * Licensed to the Apache
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/18315#discussion_r134104799
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala ---
@@ -219,8 +219,17 @@ class LinearSVC @Since("2.2.0") (
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/18315#discussion_r134104829
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/HingeAggregator.scala
---
@@ -0,0 +1,106 @@
+/*
+ * Licensed to the Apache
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/18315#discussion_r134104881
--- Diff:
mllib/src/test/scala/org/apache/spark/ml/optim/aggregator/HingeAggregatorSuite.scala
---
@@ -0,0 +1,136 @@
+/*
+ * Licensed to the Apache
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/18315#discussion_r134104883
--- Diff:
mllib/src/test/scala/org/apache/spark/ml/optim/aggregator/HingeAggregatorSuite.scala
---
@@ -0,0 +1,150 @@
+/*
+ * Licensed to the Apache
Github user hhbyyh commented on the issue:
https://github.com/apache/spark/pull/18902
Thanks for the quick update. The implementation may be improved on some
details. But first I'd want to confirm the "convert to null" method does not
have any defect.
@MLnick @sro
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/18538#discussion_r133571846
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala
---
@@ -0,0 +1,240 @@
+/*
+ * Licensed to the Apache
Github user hhbyyh commented on the issue:
https://github.com/apache/spark/pull/18902
Eh, I meant that it may be possible to get the mean values purely using
DataFrame API. (convert missingValue/NaN to null) in one pass, so we may need
to check the performance comparison. But I guess
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/17583#discussion_r132605415
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/feature/FuncTransformer.scala ---
@@ -0,0 +1,141 @@
+/*
+ * Licensed to the Apache Software
Github user hhbyyh commented on the issue:
https://github.com/apache/spark/pull/18902
Hi @zhengruifeng Thanks for the idea and implementation. Definitely
something worth exploring.
As I understand, the new implementation improves the locality yet it
leverages RDD API
Github user hhbyyh commented on the issue:
https://github.com/apache/spark/pull/16158
Move the tuningSummary to Models, and updated the name of the metrics
column.
![image](https://user-images.githubusercontent.com/7981698/29146612-e3a7ac78-7d16-11e7-9a4d-9ece0935bd70.png
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/17583#discussion_r132058489
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/feature/FuncTransformer.scala ---
@@ -0,0 +1,141 @@
+/*
+ * Licensed to the Apache Software
Github user hhbyyh commented on the issue:
https://github.com/apache/spark/pull/17583
A gentle ping since I think this is quite helpful.
@jkbradley @MLnick @yanboliang @srowen @holdenk
---
If your project is set up for it, you can reply to this email and have your
reply appear
Github user hhbyyh closed the pull request at:
https://github.com/apache/spark/pull/18733
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/16158#discussion_r131454343
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
@@ -133,7 +134,10 @@ class CrossValidator @Since("1.2.0") (@Si
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/16158#discussion_r131270741
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
@@ -133,7 +134,10 @@ class CrossValidator @Since("1.2.0") (@Si
Github user hhbyyh commented on a diff in the pull request:
https://github.com/apache/spark/pull/18733#discussion_r131268294
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala ---
@@ -112,16 +112,16 @@ class CrossValidator @Since("1.2.0") (@Si
Github user hhbyyh commented on the issue:
https://github.com/apache/spark/pull/16774
I'm confused by your suggestions here and in #18733.
I don't think it's appropriate to just "include" a similar work originated
from another PR, and sugg
Github user hhbyyh commented on the issue:
https://github.com/apache/spark/pull/18733
Features should be merged when they are reasonable and ready, but not
waiting on uncertain changes especially when there's no conflicts. Spark is
already way too slow.
---
If your project i
Github user hhbyyh commented on the issue:
https://github.com/apache/spark/pull/18733
Nothing of this change depends on #16774.
The basic idea is that we should release the driver memory as soon as a
trained model is evaluated. I don't see there's any conflict.
Github user hhbyyh commented on the issue:
https://github.com/apache/spark/pull/18313
@jkbradley Thanks for the suggestion. After the discussion, I found that
actually we can reduce the memory requirement for the tuning process. Check
https://issues.apache.org/jira/browse/SPARK-21535
Github user hhbyyh commented on the issue:
https://github.com/apache/spark/pull/18315
Thanks for the review. Updated to address the comments.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have
101 - 200 of 974 matches
Mail list logo