Github user yu-iskw commented on the pull request:
https://github.com/apache/spark/pull/1964#issuecomment-59885635
Because this patch is not fit for the Spark design concept, I close this PR
without merging.
Github user yu-iskw closed the pull request at:
https://github.com/apache/spark/pull/1964
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is
Github user yu-iskw commented on the pull request:
https://github.com/apache/spark/pull/1964#issuecomment-55825405
Hi @mengxr, @erikerlandson, @rnowling
Sorry to comment again. Could you review it?
thanks
---
If your project is set up for it, you can reply to this email and
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1964#issuecomment-54952704
[QA tests have
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20038/consoleFull)
for PR 1964 at commit
Github user yu-iskw commented on the pull request:
https://github.com/apache/spark/pull/1964#issuecomment-54953348
@mengxr, @erikerlandson, @rnowling Sorry for having been kept waiting. I
redesigned the interface of distance functions. Here is the full file change.
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1964#issuecomment-54960119
[QA tests have
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20038/consoleFull)
for PR 1964 at commit
Github user aarondav commented on the pull request:
https://github.com/apache/spark/pull/1964#issuecomment-54802662
Jenkins, ok to test.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1964#issuecomment-54806883
[QA tests have
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19976/consoleFull)
for PR 1964 at commit
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1964#issuecomment-54814790
[QA tests have
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19976/consoleFull)
for PR 1964 at commit
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1964#issuecomment-54694468
Can one of the admins verify this patch?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project
Github user yu-iskw commented on the pull request:
https://github.com/apache/spark/pull/1964#issuecomment-54564146
I'm sorry for delay to reply in replying for you.
Because I didn't concern about Python API, I rethink of the design for
distance now.
Please give me a few days.
Github user mengxr commented on the pull request:
https://github.com/apache/spark/pull/1964#issuecomment-54254431
@yu-iskw @erikerlandson @dlwh I prefer simple types for parameters for
model serialization and consistent APIs across languages. In a predictive
model, we should store
Github user mengxr commented on the pull request:
https://github.com/apache/spark/pull/1964#issuecomment-54255195
@yu-iskw Thanks for contributing to Breeze! I linked this JIRA to
SPARK-3219, which tries to generalize k-means to support more distance
measures, and we will try to use
Github user yu-iskw commented on the pull request:
https://github.com/apache/spark/pull/1964#issuecomment-53845417
@mengxr, @dlwh, @erikerlandson, @rnowling,
Thank you for your feedback.
I agree with that idea which distance metrics/measures are implemented in
Breeze.
Github user erikerlandson commented on the pull request:
https://github.com/apache/spark/pull/1964#issuecomment-53936014
@yu-iskw, your use case above with `setMeasure` is definitely how I
envisioned it working, in terms of passing in metric functions as parameters.
I'm
Github user mengxr commented on the pull request:
https://github.com/apache/spark/pull/1964#issuecomment-53547074
@yu-iskw Sorry for the delay in code review! What do you expect users to do
with the distances?
For example, users can pick different distance measures in
Github user yu-iskw commented on the pull request:
https://github.com/apache/spark/pull/1964#issuecomment-53567544
@mengxr Thank you for your comment.
I expect users are able to pick different distance measure in clustering
algorithm, such as k-means.
And I agree with
Github user rnowling commented on the pull request:
https://github.com/apache/spark/pull/1964#issuecomment-53570185
@mengxr , @yu-iskw
I think it is valuable to contribute distance metrics to Breeze, but not
all of the metrics provided by @yu-iskw may be of interest to
Github user erikerlandson commented on the pull request:
https://github.com/apache/spark/pull/1964#issuecomment-53581400
It seems most elegant and scala-like to directly supply the distance
function as the parameter, instead of just a 'type' that selects some internal
function. If
Github user dlwh commented on the pull request:
https://github.com/apache/spark/pull/1964#issuecomment-53598635
We'd happily take a distance ufunc, or perhaps a collection of distance
ufuncs along the lines of what's here. I'd want it to be more Breeze-like
than what's in the
Github user erikerlandson commented on the pull request:
https://github.com/apache/spark/pull/1964#issuecomment-53445323
@yu-iskw, I'm in favor of adopting a standardized distance metric class.
How best to proceed is a question of architecture and road map. I'm
interested in
Github user yu-iskw commented on the pull request:
https://github.com/apache/spark/pull/1964#issuecomment-53525367
@mengxr, Will you please review the detail on this pull request.
Thank you for your trouble in advance.
---
If your project is set up for it, you can reply to this
Github user erikerlandson commented on the pull request:
https://github.com/apache/spark/pull/1964#issuecomment-53341467
You want to use abs(vector) instead of vector.map(abs(_)):
https://gist.github.com/erikerlandson/488275019e4b54f5cdaf
With the test harness above, I am
Github user yu-iskw commented on the pull request:
https://github.com/apache/spark/pull/1964#issuecomment-53347789
Hi @erikerlandson,
Thank you for your comment.
I use breeze.numerics.abs instead of Math.abs.
I tried to check the performance of both again. That of
Github user erikerlandson commented on the pull request:
https://github.com/apache/spark/pull/1964#issuecomment-53348770
One thing I noticed was that this will not compile:
val v = new DenseBreezeVector((1 to 10).map(j = 1.0))
val a = abs(v) // fails
Github user yu-iskw commented on the pull request:
https://github.com/apache/spark/pull/1964#issuecomment-53351219
Hi @erikerlandson,
I appriciate your example code.
I only use abs(v.toArray) in ChebyshevDistanceMetric.
But i didn't change in
Github user yu-iskw commented on the pull request:
https://github.com/apache/spark/pull/1964#issuecomment-53378581
If the interface of distance function is valid, merging this issue, and
then trying the performance improvement in another issue would be a good idea.
Since the
Github user yu-iskw commented on the pull request:
https://github.com/apache/spark/pull/1964#issuecomment-53149119
BYW, I checked the performance of Math.abs() and breeze.numerics.abs.
It seems that Math.abs() performs better than breeze.numerics.abs.
A performs better than B.
Github user erikerlandson commented on the pull request:
https://github.com/apache/spark/pull/1964#issuecomment-53105970
Hi @yu-iskw,
Universal element-wise functions (U-Funcs), including element-wise abs(v),
are listed here:
Github user erikerlandson commented on the pull request:
https://github.com/apache/spark/pull/1964#issuecomment-53108224
the implicit conversion function doesn't have to throw an exception, it
just needs to override the breeze vector overloading and inherit the default
spark vector
Github user yu-iskw commented on the pull request:
https://github.com/apache/spark/pull/1964#issuecomment-53129462
Good Morning @erikerlandson,
Thank you for telling me the documentation.
But I can not compile, if I changed the code in ChebyshevDistanceMetric.
I guess
Github user rnowling commented on the pull request:
https://github.com/apache/spark/pull/1964#issuecomment-52924889
@erikerlandson Breeze uses non-JVM data structures so it can use BLAS. The
malloc and copying could be both expensive and lead to 2-3x higher memory usage.
This
Github user yu-iskw commented on the pull request:
https://github.com/apache/spark/pull/1964#issuecomment-52943537
Hi @erikerlandson and @rnowling,
Thank you for your advice.
I modified my code. Would you please review it.
- Improve the performance of calculating
Github user yu-iskw commented on the pull request:
https://github.com/apache/spark/pull/1964#issuecomment-52994477
Sorry for immediate modification.
- CosineDistanceMeasure and TanimotoDistanceMeasure are reverted before
64f7389
-- Because both classes can not be applied
Github user erikerlandson commented on the pull request:
https://github.com/apache/spark/pull/1964#issuecomment-52996580
I think the design can be made somewhat less complex. I coded up an
example here:
https://gist.github.com/erikerlandson/f4b9b9a5c9469f2d9006
A couple
Github user erikerlandson commented on the pull request:
https://github.com/apache/spark/pull/1964#issuecomment-53003052
Given how small these classes are, it seems like it would be efficient to
define them all in a single file
---
If your project is set up for it, you can reply to
Github user yu-iskw commented on the pull request:
https://github.com/apache/spark/pull/1964#issuecomment-53006322
Hello @erikerlandson,
Thank you for your comment.
Its' in the morning in Japan :)
I don't think a Weighted trait improves the design.
I agree
Github user erikerlandson commented on the pull request:
https://github.com/apache/spark/pull/1964#issuecomment-53008478
@yu-iskw I took a fast look and that seems good to me. I will try to
review it more tomorrow.
---
If your project is set up for it, you can reply to this
Github user erikerlandson commented on the pull request:
https://github.com/apache/spark/pull/1964#issuecomment-53008699
one brief comment: with breeze vectors you can say `abs(v)` instead of
`v.map(Math.abs(_))`
import breeze.numerics.abs
---
If your project is set up for
Github user rnowling commented on the pull request:
https://github.com/apache/spark/pull/1964#issuecomment-53010747
Overall, changes look great! Thank you for considering our feedback.
I like that you were able to reduce the number of files. I think
Github user yu-iskw commented on the pull request:
https://github.com/apache/spark/pull/1964#issuecomment-53012573
one brief comment: with breeze vectors you can say abs(v) instead of
v.map(Math.abs(_))
Sorry, I can't find abs API for a vector in Breeze API documentation.
Github user yu-iskw commented on the pull request:
https://github.com/apache/spark/pull/1964#issuecomment-53013311
@rnowling
Thank you for your comment.
- Remove SquaredEuclideanDistanceMeasure.scala
- Aggregate tests to reduce the number of files
---
If your
Github user yu-iskw commented on the pull request:
https://github.com/apache/spark/pull/1964#issuecomment-52756284
Jenkins, test this please.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have
Github user yu-iskw commented on the pull request:
https://github.com/apache/spark/pull/1964#issuecomment-52756243
Hi @erikerlandson,
Thank you for your comment.
I modify the code according to your advice.
Would you please check it.
- DistanceMeasure trait
Github user rnowling commented on the pull request:
https://github.com/apache/spark/pull/1964#issuecomment-52810571
Great work!
If a class implements the DistanceMetric trait, it should probably be
renamed ---DistanceMetric. For example, CosineDistanceMeasure should be
Github user erikerlandson commented on the pull request:
https://github.com/apache/spark/pull/1964#issuecomment-52848633
I like where this RFE is going.
One comment about metric versus measure - cosine distance is a subclass
of DistanceMeasure, as it is *not* a metric.
Github user erikerlandson commented on the pull request:
https://github.com/apache/spark/pull/1964#issuecomment-52858495
I feel ambiguous about whether there needs to be a sub-trait for
WeightedDistanceMeasure, but if we have that then we will also need a
WeightedDistanceMetric as a
Github user erikerlandson commented on the pull request:
https://github.com/apache/spark/pull/1964#issuecomment-52516267
I'm wondering if it might be simpler and more idiomatic to just define
distance measure directly as any subclass of Function2, like:
trait DistanceMeasure
Github user JoshRosen commented on the pull request:
https://github.com/apache/spark/pull/1964#issuecomment-52527707
Jenkins, test this please.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1964#issuecomment-52528032
QA tests have started for PR 1964. This patch merges cleanly. brView
progress:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18761/consoleFull
---
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/1964#issuecomment-52537821
QA results for PR 1964:br- This patch FAILED unit tests.br- This patch
merges cleanlybr- This patch adds the following public classes
(experimental):brclass
Github user erikerlandson commented on the pull request:
https://github.com/apache/spark/pull/1964#issuecomment-52554072
It might be useful to distinguish true metrics from 'measures'. For
example, cosine distance is not a true distance metric. In some algorithms,
that difference
GitHub user yu-iskw opened a pull request:
https://github.com/apache/spark/pull/1964
[SPARK-3012] Standardized Distance Functions between two Vectors for MLlib
https://issues.apache.org/jira/browse/SPARK-3012
I implemented some distance measures between two Vector
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/1964#issuecomment-52282833
Can one of the admins verify this patch?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
Github user yu-iskw commented on the pull request:
https://github.com/apache/spark/pull/1964#issuecomment-52375350
Jenkins, test this please.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have
55 matches
Mail list logo