Github user VinceShieh commented on the issue:
https://github.com/apache/spark/pull/19318
thanks :)
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h
Github user VinceShieh commented on the issue:
https://github.com/apache/spark/pull/18936
Hi Sean, sorry for late reply. Yeah, actually we do have some performance
data on F2J vs. OpenBLAS. It seems there is no performance gain from openblas,
not even on the unit test level. We
Github user VinceShieh commented on the issue:
https://github.com/apache/spark/pull/19317
Nice catch. thanks. the perf gain is truly narrow.
I believe this impl just tried to align with the impl of
'reduceByKeyLocally'.
@ConeyLiu maybe we should revisit the code, along
GitHub user VinceShieh opened a pull request:
https://github.com/apache/spark/pull/19318
[SPARK-22096][ML] use aggregateByKeyLocally in feature frequency calcâ¦
## What changes were proposed in this pull request?
NaiveBayes currently takes aggreateByKey followed
Github user VinceShieh commented on the issue:
https://github.com/apache/spark/pull/18936
Okay. We will benchmark on OpenBLAS. Thanks.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have
Github user VinceShieh commented on the issue:
https://github.com/apache/spark/pull/18936
@srowen currently, what we see is, with default thread setting(take up all
computation resource available) for native blas, the No. 1 hot spot (with 95%+
self time
Github user VinceShieh commented on the issue:
https://github.com/apache/spark/pull/18936
thanks, Sean and Nick.
To @srowen , I think the difference is the finding from our previous
investigation that, thread setting in the native BLAS impacts the overall
performance of a method
Github user VinceShieh commented on the issue:
https://github.com/apache/spark/pull/18936
Yes, they are not the only place, but we only tested on the dense dataset
and got the performance data shown above. We are conservative on sparse data,
so keep the sparse path the way
GitHub user VinceShieh opened a pull request:
https://github.com/apache/spark/pull/18936
[SPARK-21688][ML][MLLIB] make native BLAS the first choice for BLAS level 1
operations for dense data
## What changes were proposed in this pull request?
In this PR, we make native BLAS
Github user VinceShieh commented on the issue:
https://github.com/apache/spark/pull/17894
@sethah yes, we only take 100 samples and trained with 3 iterations,
numClasss is 20 of our test dataset for single node testing.
Yeah, I also believe it'd have a better result if it's
Github user VinceShieh commented on the issue:
https://github.com/apache/spark/pull/17894
Forgot to mention, we observed a nearly 2x performance gain with the help
of nativeBLAS- MKL, without a fine tuning, so if we can also make F2J version
run faster in distributed cluster than
Github user VinceShieh commented on the issue:
https://github.com/apache/spark/pull/17894
sorry for late update!
we tested on this PR against the current implementation with both dense and
sparse(0.95 sparsity):
![image](https://cloud.githubusercontent.com/assets/2673819
Github user VinceShieh commented on the issue:
https://github.com/apache/spark/pull/17894
@sethah Sorry for the late response. Setting as WIP. We have performance
data for dense features, data for the sparse feature will be ready soon. thanks.
---
If your project is set up
Github user VinceShieh commented on the issue:
https://github.com/apache/spark/pull/17894
@hhbyyh performance testing is ongoing, thanks!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have
Github user VinceShieh commented on a diff in the pull request:
https://github.com/apache/spark/pull/17894#discussion_r115415823
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
---
@@ -1722,25 +1723,22 @@ private class LogisticAggregator
Github user VinceShieh commented on a diff in the pull request:
https://github.com/apache/spark/pull/17894#discussion_r115415580
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
---
@@ -23,6 +23,7 @@ import scala.collection.mutable
Github user VinceShieh commented on a diff in the pull request:
https://github.com/apache/spark/pull/17237#discussion_r115186158
--- Diff: python/pyspark/ml/feature.py ---
@@ -1936,6 +1935,14 @@ class StringIndexer(JavaEstimator, HasInputCol,
HasOutputCol, HasHandleInvalid
GitHub user VinceShieh opened a pull request:
https://github.com/apache/spark/pull/17894
[SPARK-17134][ML] Use level 2 BLAS operations in LogisticAggregator
## What changes were proposed in this pull request?
Multinomial logistic regression uses LogisticAggregator class
Github user VinceShieh commented on the issue:
https://github.com/apache/spark/pull/17237
Sure. No problem!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so
GitHub user VinceShieh opened a pull request:
https://github.com/apache/spark/pull/17237
[SPARK-19852][PYSPARK][ML] Update Python API setHandleInvalid for
StringIndexer
## What changes were proposed in this pull request?
This PR is to maintain API parity with changes made
Github user VinceShieh commented on the issue:
https://github.com/apache/spark/pull/16883
Sure, I can work on that :) @jkbradley
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user VinceShieh commented on the issue:
https://github.com/apache/spark/pull/16883
updated. Thank you both @imatiach-msft @jkbradley
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have
Github user VinceShieh commented on a diff in the pull request:
https://github.com/apache/spark/pull/16883#discussion_r103599555
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala ---
@@ -17,14 +17,16 @@
package org.apache.spark.ml.feature
Github user VinceShieh commented on a diff in the pull request:
https://github.com/apache/spark/pull/16883#discussion_r103597822
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala ---
@@ -163,25 +190,28 @@ class StringIndexerModel
Github user VinceShieh commented on the issue:
https://github.com/apache/spark/pull/16883
gotcha, will update soon.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled
Github user VinceShieh commented on a diff in the pull request:
https://github.com/apache/spark/pull/16922#discussion_r101183452
--- Diff: python/pyspark/ml/feature.py ---
@@ -1178,7 +1178,17 @@ class QuantileDiscretizer(JavaEstimator,
HasInputCol, HasOutputCol, JavaMLReadab
GitHub user VinceShieh opened a pull request:
https://github.com/apache/spark/pull/16922
[SPARK-19590][pyspark][ML] update the document for QuantileDiscretizeâ¦
## What changes were proposed in this pull request?
This PR is to document the changes on QuantileDiscretizer
Github user VinceShieh commented on the issue:
https://github.com/apache/spark/pull/16883
@srowen @jkbradley do u have time to take a look?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have
GitHub user VinceShieh opened a pull request:
https://github.com/apache/spark/pull/16883
[SPARK-17498][ML] enchance StringIndexer to handle unseen labels
## What changes were proposed in this pull request?
This PR is an enhancement to ML StringIndexer.
Before this PR
Github user VinceShieh commented on a diff in the pull request:
https://github.com/apache/spark/pull/15055#discussion_r88427044
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/PCA.scala ---
@@ -34,6 +34,7 @@ import org.apache.spark.rdd.RDD
import
Github user VinceShieh commented on a diff in the pull request:
https://github.com/apache/spark/pull/15055#discussion_r88376643
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/PCA.scala ---
@@ -34,6 +34,7 @@ import org.apache.spark.rdd.RDD
import
Github user VinceShieh commented on the issue:
https://github.com/apache/spark/pull/15055
@srowen @jkbradley do you have time to take a look at this one?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project
Github user VinceShieh commented on the issue:
https://github.com/apache/spark/pull/14640
@rdelassus Agree. There are a number of folding methods, so some code
refractoring should be done if more folding methods are to be supported in the
future. But for now, I guess we will just
Github user VinceShieh commented on the issue:
https://github.com/apache/spark/pull/15428
sorry, I must have forgotten to commit the changes.
All done now. Thanks for reviewing.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub
Github user VinceShieh commented on the issue:
https://github.com/apache/spark/pull/15428
Thanks for your valuable suggestions. @jkbradley @srowen
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does
Github user VinceShieh commented on a diff in the pull request:
https://github.com/apache/spark/pull/15428#discussion_r84205467
--- Diff: python/pyspark/ml/feature.py ---
@@ -1157,9 +1157,11 @@ class QuantileDiscretizer(JavaEstimator,
HasInputCol, HasOutputCol, JavaMLReadab
Github user VinceShieh commented on a diff in the pull request:
https://github.com/apache/spark/pull/15428#discussion_r84205458
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala ---
@@ -66,11 +67,13 @@ private[feature] trait
Github user VinceShieh commented on a diff in the pull request:
https://github.com/apache/spark/pull/15428#discussion_r84205414
--- Diff: docs/ml-features.md ---
@@ -1104,9 +1104,11 @@ for more details on the API.
`QuantileDiscretizer` takes a column with continuous features
Github user VinceShieh commented on the issue:
https://github.com/apache/spark/pull/15428
typo corrected. Thank you all. @srowen @jkbradley
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have
Github user VinceShieh commented on a diff in the pull request:
https://github.com/apache/spark/pull/15428#discussion_r82743072
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala
---
@@ -73,15 +78,27 @@ final class Bucketizer @Since("1.4.0") (@Si
GitHub user VinceShieh opened a pull request:
https://github.com/apache/spark/pull/15428
[SPARK-17219][ML] enchanced NaN value handling in Bucketizer
## What changes were proposed in this pull request?
This PR is an enhancement of PR with commit
ID
Github user VinceShieh commented on a diff in the pull request:
https://github.com/apache/spark/pull/14858#discussion_r78513514
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala ---
@@ -109,7 +114,7 @@ final class QuantileDiscretizer @Since
Github user VinceShieh commented on the issue:
https://github.com/apache/spark/pull/14858
@srowen Updated. Thanks.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled
Github user VinceShieh commented on the issue:
https://github.com/apache/spark/pull/14640
@finleyb indeed, thank you for pointing it out. I have put it right and
added a test to guard this issue. Many thanks. And feel free to let us know if
you have any problem with this class or any
GitHub user VinceShieh opened a pull request:
https://github.com/apache/spark/pull/15055
[SPARK-17462][MLLIB]use VersionUtils to parse Spark version strings
## What changes were proposed in this pull request?
Several places in MLlib use custom regexes or other approaches
Github user VinceShieh commented on a diff in the pull request:
https://github.com/apache/spark/pull/14858#discussion_r77465278
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala ---
@@ -114,10 +115,10 @@ final class QuantileDiscretizer @Since
Github user VinceShieh commented on a diff in the pull request:
https://github.com/apache/spark/pull/14858#discussion_r77283036
--- Diff: docs/ml-features.md ---
@@ -1102,7 +1102,8 @@ for more details on the API.
## QuantileDiscretizer
`QuantileDiscretizer` takes
Github user VinceShieh commented on a diff in the pull request:
https://github.com/apache/spark/pull/14858#discussion_r77145656
--- Diff: docs/ml-features.md ---
@@ -1102,7 +1102,8 @@ for more details on the API.
## QuantileDiscretizer
`QuantileDiscretizer` takes
Github user VinceShieh commented on a diff in the pull request:
https://github.com/apache/spark/pull/14858#discussion_r77138983
--- Diff: docs/ml-features.md ---
@@ -1102,7 +1102,8 @@ for more details on the API.
## QuantileDiscretizer
`QuantileDiscretizer` takes
Github user VinceShieh commented on a diff in the pull request:
https://github.com/apache/spark/pull/14858#discussion_r77138037
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala ---
@@ -114,10 +115,10 @@ final class QuantileDiscretizer @Since
Github user VinceShieh commented on a diff in the pull request:
https://github.com/apache/spark/pull/14858#discussion_r77134887
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala ---
@@ -114,10 +115,10 @@ final class QuantileDiscretizer @Since
Github user VinceShieh commented on the issue:
https://github.com/apache/spark/pull/14640
Updates:
1. code refactoring. Rename the API to align with Sklearn changes
2. add implementation in CrossValidator
---
If your project is set up for it, you can reply to this email
Github user VinceShieh commented on the issue:
https://github.com/apache/spark/pull/14858
updated tests and documents related to this change
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have
Github user VinceShieh commented on a diff in the pull request:
https://github.com/apache/spark/pull/14858#discussion_r76773626
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala ---
@@ -114,10 +114,10 @@ final class QuantileDiscretizer @Since
Github user VinceShieh commented on a diff in the pull request:
https://github.com/apache/spark/pull/14858#discussion_r76738479
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala
---
@@ -106,18 +106,19 @@ final class Bucketizer @Since("1.4.0"
Github user VinceShieh commented on a diff in the pull request:
https://github.com/apache/spark/pull/14858#discussion_r76738333
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala ---
@@ -114,10 +114,10 @@ final class QuantileDiscretizer @Since
Github user VinceShieh commented on a diff in the pull request:
https://github.com/apache/spark/pull/14858#discussion_r76572410
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala ---
@@ -116,8 +116,7 @@ final class QuantileDiscretizer @Since
Github user VinceShieh commented on a diff in the pull request:
https://github.com/apache/spark/pull/14858#discussion_r76571166
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala
---
@@ -63,7 +63,7 @@ final class Bucketizer @Since("1.4.0") (@Si
Github user VinceShieh commented on a diff in the pull request:
https://github.com/apache/spark/pull/14858#discussion_r76570646
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala ---
@@ -116,8 +116,7 @@ final class QuantileDiscretizer @Since
Github user VinceShieh commented on a diff in the pull request:
https://github.com/apache/spark/pull/14858#discussion_r76569942
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala
---
@@ -63,7 +63,7 @@ final class Bucketizer @Since("1.4.0") (@Si
Github user VinceShieh commented on a diff in the pull request:
https://github.com/apache/spark/pull/14858#discussion_r76569900
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala
---
@@ -129,17 +129,21 @@ object Bucketizer extends
DefaultParamsReadable
GitHub user VinceShieh opened a pull request:
https://github.com/apache/spark/pull/14858
[SPARK-17219][ML] Add NaN value handling in Bucketizer
## What changes were proposed in this pull request?
This PR fixes an issue when a cutpoints vector containing NaN is sent
Github user VinceShieh commented on the issue:
https://github.com/apache/spark/pull/14640
@holdenk thanks for your comments. :) You are right. But as you can see,
this is a variant of kFold, so I think it's better to stay close to it,
otherwise, it would seems confusing, dont you
Github user VinceShieh commented on the issue:
https://github.com/apache/spark/pull/14640
if one understands the underlying ideas behind this method (labelKFold),
it's easy to take it as a class/category of data, though I do think it's not
that straightforward, even a bit confusing
Github user VinceShieh commented on the issue:
https://github.com/apache/spark/pull/14747
it seems Array.distinct will not break the sequence of the elements. But,
you are right, we need guarantee the array is sorted.
---
If your project is set up for it, you can reply
Github user VinceShieh commented on the issue:
https://github.com/apache/spark/pull/14747
yes, the output from approxQuantile is a sorted array.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does
GitHub user VinceShieh opened a pull request:
https://github.com/apache/spark/pull/14747
[SPARK-17086] Fix an issue in QuantileDiscretizer
## What changes were proposed in this pull request?
In cases when QuantileDiscretizerSuite is called upon a numeric array
GitHub user VinceShieh opened a pull request:
https://github.com/apache/spark/pull/14640
[SPARK-17055] add labelKFold to CrossValidator
## What changes were proposed in this pull request?
This patch improves the CrossValidator by adding a new training/validation
split
68 matches
Mail list logo