Github user erikerlandson commented on the pull request:
https://github.com/apache/spark/pull/1964#issuecomment-52848633
I like where this RFE is going.
One comment about "metric" versus "measure" - cosine distance is a subclass
of DistanceMeasure, as
Github user erikerlandson commented on the pull request:
https://github.com/apache/spark/pull/1964#issuecomment-52554072
It might be useful to distinguish true metrics from 'measures'. For
example, cosine distance is not a true distance metric. In some algorithms,
that
Github user erikerlandson commented on the pull request:
https://github.com/apache/spark/pull/1964#issuecomment-52524657
It might also be handy to define an implicit conversion from straight
functions:
implicit def functionToDistanceMeasure(f: (Vector, Vector)=>Double) =
Github user erikerlandson commented on the pull request:
https://github.com/apache/spark/pull/1964#issuecomment-52516267
I'm wondering if it might be simpler and more idiomatic to just define
distance measure directly as any subclass of Function2, like:
trait DistanceMe
Github user erikerlandson commented on the pull request:
https://github.com/apache/spark/pull/1689#issuecomment-52336202
Latest push updates RangePartition sampling job to be async, and updates
the async action functions so that they will properly enclose the sampling job
induced by
Github user erikerlandson commented on the pull request:
https://github.com/apache/spark/pull/1909#issuecomment-51977051
@rxin I created an umbrella:
https://issues.apache.org/jira/browse/SPARK-2992
---
If your project is set up for it, you can reply to this email and have your
Github user erikerlandson commented on the pull request:
https://github.com/apache/spark/pull/1909#issuecomment-51965848
Good point, I will look into those.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
GitHub user erikerlandson opened a pull request:
https://github.com/apache/spark/pull/1909
[SPARK-2991] Implement RDD lazy transforms for scanLeft and scan
Discussion of implementations:
http://erikerlandson.github.io/blog/2014/08/09/implementing-an-rdd-scanleft-transform-with
Github user erikerlandson commented on the pull request:
https://github.com/apache/spark/pull/1839#issuecomment-51806430
Assuming this is correct, "okay" is not same as "ok":
> The following regex checks that: .*ok\W+to\W+test.*
> So I think yo
Github user erikerlandson commented on the pull request:
https://github.com/apache/spark/pull/1839#issuecomment-51720727
Jenkins still not getting the memo. How strict is Jenkins with commands?
Is 'okay' same as 'ok'?
---
If your project is set up for it, y
GitHub user erikerlandson opened a pull request:
https://github.com/apache/spark/pull/1858
[SPARK-2911] apply parent[T](j) to clarify UnionRDD code
References to dependencies(j) for actually obtaining RDD parents are less
common than I originally estimated. It does clarify
GitHub user erikerlandson opened a pull request:
https://github.com/apache/spark/pull/1841
[SPARK-2911]: provide rdd.parent[T](j) to obtain jth parent RDD
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/erikerlandson/spark spark
Github user erikerlandson commented on the pull request:
https://github.com/apache/spark/pull/1839#issuecomment-51496145
This is a reboot of:
https://github.com/apache/spark/pull/1254
---
If your project is set up for it, you can reply to this email and have your
reply appear on
GitHub user erikerlandson opened a pull request:
https://github.com/apache/spark/pull/1839
[SPARK-2315] Implement drop, dropRight and dropWhile for RDDs, which
take RDD as input and return new RDD with elements dropped.
These methods are now implemented as lazy RDD
Github user erikerlandson closed the pull request at:
https://github.com/apache/spark/pull/1254
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature
Github user erikerlandson commented on the pull request:
https://github.com/apache/spark/pull/1254#issuecomment-51495882
I'm going to try closing this PR and rebooting with a fresh one
---
If your project is set up for it, you can reply to this email and have your
reply appe
Github user erikerlandson commented on a diff in the pull request:
https://github.com/apache/spark/pull/1689#discussion_r15931660
--- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala ---
@@ -222,7 +228,8 @@ class RangePartitioner[K : Ordering : ClassTag, V
Github user erikerlandson commented on a diff in the pull request:
https://github.com/apache/spark/pull/1689#discussion_r15931609
--- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala ---
@@ -113,8 +113,12 @@ class RangePartitioner[K : Ordering : ClassTag, V
Github user erikerlandson commented on the pull request:
https://github.com/apache/spark/pull/1254#issuecomment-51142701
Should I consider creating a fresh PR, or is there some better way to get
Jenkins to test?
---
If your project is set up for it, you can reply to this email and
Github user erikerlandson commented on the pull request:
https://github.com/apache/spark/pull/1254#issuecomment-50966402
Starting to worry I confused it by pushing the PR branch using '+'
---
If your project is set up for it, you can reply to this email and have your
reply
Github user erikerlandson commented on the pull request:
https://github.com/apache/spark/pull/1254#issuecomment-50905932
jenkins appears to be awol
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
GitHub user erikerlandson opened a pull request:
https://github.com/apache/spark/pull/1689
[SPARK-1021] Defer the data-driven computation of partition bounds in so...
...rtByKey() until evaluation.
You can merge this pull request into a Git repository by running:
$ git pull
Github user erikerlandson commented on the pull request:
https://github.com/apache/spark/pull/1254#issuecomment-50759649
O Jenkins Where Art Thou?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
Github user erikerlandson commented on the pull request:
https://github.com/apache/spark/pull/1254#issuecomment-50648091
should Jenkins run an automatic build on PR update?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well
Github user erikerlandson commented on the pull request:
https://github.com/apache/spark/pull/1254#issuecomment-50554859
I updated this PR so that drop(), dropRight() and dropWhile() are now lazy
transforms. A description of what I did is here:
http://erikerlandson.github.io
Github user erikerlandson commented on the pull request:
https://github.com/apache/spark/pull/1254#issuecomment-47429672
I also envision typical use cases as being either pre- or post-processing.
That is, not something that would often appear inside a tight loop.
---
If your
Github user erikerlandson commented on the pull request:
https://github.com/apache/spark/pull/1254#issuecomment-47428789
Note, in a typical case where one is invoking something like rdd.drop(1),
or other small number, only one partition gets evaluated by drop - the first
one
Github user erikerlandson commented on the pull request:
https://github.com/apache/spark/pull/1254#issuecomment-47426817
Tangentially, one thing I noticed is that currently all the
"XxxRDDFunctions" implicits are automatically defined in SparkContext, and so I
held to that
Github user erikerlandson commented on the pull request:
https://github.com/apache/spark/pull/1254#issuecomment-47426628
It will scan one partition twice: the one containing the "boundary"
between things dropped and not-dropped. Any partitions prior to that boundary
are
Github user erikerlandson commented on the pull request:
https://github.com/apache/spark/pull/1254#issuecomment-47420288
My reasoning is that most use cases (or at least the ones I had in mind)
are something like rdd.drop(n), where n is much smaller than rdd.count(),
generally 1 or
GitHub user erikerlandson opened a pull request:
https://github.com/apache/spark/pull/1254
[SPARK-2315] Implement drop, dropRight and dropWhile for RDDs
drop, dropRight and dropWhile methods for RDDs that return a new RDD as the
result.
// example: load in some text and
Github user erikerlandson commented on the pull request:
https://github.com/apache/spark/pull/714#issuecomment-42983694
I coded up a proposal where SparkConf owns the checking inside of getInt(),
getDouble() and friends, described here:
https://issues.apache.org/jira/browse
201 - 232 of 232 matches
Mail list logo