[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

2014-08-20 Thread erikerlandson
Github user erikerlandson commented on the pull request: https://github.com/apache/spark/pull/1964#issuecomment-52848633 I like where this RFE is going. One comment about "metric" versus "measure" - cosine distance is a subclass of DistanceMeasure, as

[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

2014-08-18 Thread erikerlandson
Github user erikerlandson commented on the pull request: https://github.com/apache/spark/pull/1964#issuecomment-52554072 It might be useful to distinguish true metrics from 'measures'. For example, cosine distance is not a true distance metric. In some algorithms, that

[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

2014-08-18 Thread erikerlandson
Github user erikerlandson commented on the pull request: https://github.com/apache/spark/pull/1964#issuecomment-52524657 It might also be handy to define an implicit conversion from straight functions: implicit def functionToDistanceMeasure(f: (Vector, Vector)=>Double) =

[GitHub] spark pull request: [SPARK-3012] Standardized Distance Functions b...

2014-08-18 Thread erikerlandson
Github user erikerlandson commented on the pull request: https://github.com/apache/spark/pull/1964#issuecomment-52516267 I'm wondering if it might be simpler and more idiomatic to just define distance measure directly as any subclass of Function2, like: trait DistanceMe

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-08-15 Thread erikerlandson
Github user erikerlandson commented on the pull request: https://github.com/apache/spark/pull/1689#issuecomment-52336202 Latest push updates RangePartition sampling job to be async, and updates the async action functions so that they will properly enclose the sampling job induced by

[GitHub] spark pull request: [SPARK-2991] Implement RDD lazy transforms for...

2014-08-12 Thread erikerlandson
Github user erikerlandson commented on the pull request: https://github.com/apache/spark/pull/1909#issuecomment-51977051 @rxin I created an umbrella: https://issues.apache.org/jira/browse/SPARK-2992 --- If your project is set up for it, you can reply to this email and have your

[GitHub] spark pull request: [SPARK-2991] Implement RDD lazy transforms for...

2014-08-12 Thread erikerlandson
Github user erikerlandson commented on the pull request: https://github.com/apache/spark/pull/1909#issuecomment-51965848 Good point, I will look into those. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark pull request: [SPARK-2991] Implement RDD lazy transforms for...

2014-08-12 Thread erikerlandson
GitHub user erikerlandson opened a pull request: https://github.com/apache/spark/pull/1909 [SPARK-2991] Implement RDD lazy transforms for scanLeft and scan Discussion of implementations: http://erikerlandson.github.io/blog/2014/08/09/implementing-an-rdd-scanleft-transform-with

[GitHub] spark pull request: [SPARK-2315] Implement drop, dropRight and dro...

2014-08-11 Thread erikerlandson
Github user erikerlandson commented on the pull request: https://github.com/apache/spark/pull/1839#issuecomment-51806430 Assuming this is correct, "okay" is not same as "ok": > The following regex checks that: .*ok\W+to\W+test.* > So I think yo

[GitHub] spark pull request: [SPARK-2315] Implement drop, dropRight and dro...

2014-08-10 Thread erikerlandson
Github user erikerlandson commented on the pull request: https://github.com/apache/spark/pull/1839#issuecomment-51720727 Jenkins still not getting the memo. How strict is Jenkins with commands? Is 'okay' same as 'ok'? --- If your project is set up for it, y

[GitHub] spark pull request: [SPARK-2911] apply parent[T](j) to clarify Uni...

2014-08-08 Thread erikerlandson
GitHub user erikerlandson opened a pull request: https://github.com/apache/spark/pull/1858 [SPARK-2911] apply parent[T](j) to clarify UnionRDD code References to dependencies(j) for actually obtaining RDD parents are less common than I originally estimated. It does clarify

[GitHub] spark pull request: [SPARK-2911]: provide rdd.parent[T](j) to obta...

2014-08-07 Thread erikerlandson
GitHub user erikerlandson opened a pull request: https://github.com/apache/spark/pull/1841 [SPARK-2911]: provide rdd.parent[T](j) to obtain jth parent RDD You can merge this pull request into a Git repository by running: $ git pull https://github.com/erikerlandson/spark spark

[GitHub] spark pull request: [SPARK-2315] Implement drop, dropRight and dro...

2014-08-07 Thread erikerlandson
Github user erikerlandson commented on the pull request: https://github.com/apache/spark/pull/1839#issuecomment-51496145 This is a reboot of: https://github.com/apache/spark/pull/1254 --- If your project is set up for it, you can reply to this email and have your reply appear on

[GitHub] spark pull request: [SPARK-2315] Implement drop, dropRight and dro...

2014-08-07 Thread erikerlandson
GitHub user erikerlandson opened a pull request: https://github.com/apache/spark/pull/1839 [SPARK-2315] Implement drop, dropRight and dropWhile for RDDs, which take RDD as input and return new RDD with elements dropped. These methods are now implemented as lazy RDD

[GitHub] spark pull request: [SPARK-2315] Implement drop, dropRight and dro...

2014-08-07 Thread erikerlandson
Github user erikerlandson closed the pull request at: https://github.com/apache/spark/pull/1254 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark pull request: [SPARK-2315] Implement drop, dropRight and dro...

2014-08-07 Thread erikerlandson
Github user erikerlandson commented on the pull request: https://github.com/apache/spark/pull/1254#issuecomment-51495882 I'm going to try closing this PR and rebooting with a fresh one --- If your project is set up for it, you can reply to this email and have your reply appe

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-08-07 Thread erikerlandson
Github user erikerlandson commented on a diff in the pull request: https://github.com/apache/spark/pull/1689#discussion_r15931660 --- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala --- @@ -222,7 +228,8 @@ class RangePartitioner[K : Ordering : ClassTag, V

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-08-07 Thread erikerlandson
Github user erikerlandson commented on a diff in the pull request: https://github.com/apache/spark/pull/1689#discussion_r15931609 --- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala --- @@ -113,8 +113,12 @@ class RangePartitioner[K : Ordering : ClassTag, V

[GitHub] spark pull request: [SPARK-2315] Implement drop, dropRight and dro...

2014-08-04 Thread erikerlandson
Github user erikerlandson commented on the pull request: https://github.com/apache/spark/pull/1254#issuecomment-51142701 Should I consider creating a fresh PR, or is there some better way to get Jenkins to test? --- If your project is set up for it, you can reply to this email and

[GitHub] spark pull request: [SPARK-2315] Implement drop, dropRight and dro...

2014-08-02 Thread erikerlandson
Github user erikerlandson commented on the pull request: https://github.com/apache/spark/pull/1254#issuecomment-50966402 Starting to worry I confused it by pushing the PR branch using '+' --- If your project is set up for it, you can reply to this email and have your reply

[GitHub] spark pull request: [SPARK-2315] Implement drop, dropRight and dro...

2014-08-01 Thread erikerlandson
Github user erikerlandson commented on the pull request: https://github.com/apache/spark/pull/1254#issuecomment-50905932 jenkins appears to be awol --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-07-31 Thread erikerlandson
GitHub user erikerlandson opened a pull request: https://github.com/apache/spark/pull/1689 [SPARK-1021] Defer the data-driven computation of partition bounds in so... ...rtByKey() until evaluation. You can merge this pull request into a Git repository by running: $ git pull

[GitHub] spark pull request: [SPARK-2315] Implement drop, dropRight and dro...

2014-07-31 Thread erikerlandson
Github user erikerlandson commented on the pull request: https://github.com/apache/spark/pull/1254#issuecomment-50759649 O Jenkins Where Art Thou? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not

[GitHub] spark pull request: [SPARK-2315] Implement drop, dropRight and dro...

2014-07-30 Thread erikerlandson
Github user erikerlandson commented on the pull request: https://github.com/apache/spark/pull/1254#issuecomment-50648091 should Jenkins run an automatic build on PR update? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well

[GitHub] spark pull request: [SPARK-2315] Implement drop, dropRight and dro...

2014-07-29 Thread erikerlandson
Github user erikerlandson commented on the pull request: https://github.com/apache/spark/pull/1254#issuecomment-50554859 I updated this PR so that drop(), dropRight() and dropWhile() are now lazy transforms. A description of what I did is here: http://erikerlandson.github.io

[GitHub] spark pull request: [SPARK-2315] Implement drop, dropRight and dro...

2014-06-28 Thread erikerlandson
Github user erikerlandson commented on the pull request: https://github.com/apache/spark/pull/1254#issuecomment-47429672 I also envision typical use cases as being either pre- or post-processing. That is, not something that would often appear inside a tight loop. --- If your

[GitHub] spark pull request: [SPARK-2315] Implement drop, dropRight and dro...

2014-06-28 Thread erikerlandson
Github user erikerlandson commented on the pull request: https://github.com/apache/spark/pull/1254#issuecomment-47428789 Note, in a typical case where one is invoking something like rdd.drop(1), or other small number, only one partition gets evaluated by drop - the first one

[GitHub] spark pull request: [SPARK-2315] Implement drop, dropRight and dro...

2014-06-28 Thread erikerlandson
Github user erikerlandson commented on the pull request: https://github.com/apache/spark/pull/1254#issuecomment-47426817 Tangentially, one thing I noticed is that currently all the "XxxRDDFunctions" implicits are automatically defined in SparkContext, and so I held to that

[GitHub] spark pull request: [SPARK-2315] Implement drop, dropRight and dro...

2014-06-28 Thread erikerlandson
Github user erikerlandson commented on the pull request: https://github.com/apache/spark/pull/1254#issuecomment-47426628 It will scan one partition twice: the one containing the "boundary" between things dropped and not-dropped. Any partitions prior to that boundary are

[GitHub] spark pull request: [SPARK-2315] Implement drop, dropRight and dro...

2014-06-28 Thread erikerlandson
Github user erikerlandson commented on the pull request: https://github.com/apache/spark/pull/1254#issuecomment-47420288 My reasoning is that most use cases (or at least the ones I had in mind) are something like rdd.drop(n), where n is much smaller than rdd.count(), generally 1 or

[GitHub] spark pull request: [SPARK-2315] Implement drop, dropRight and dro...

2014-06-27 Thread erikerlandson
GitHub user erikerlandson opened a pull request: https://github.com/apache/spark/pull/1254 [SPARK-2315] Implement drop, dropRight and dropWhile for RDDs drop, dropRight and dropWhile methods for RDDs that return a new RDD as the result. // example: load in some text and

[GitHub] spark pull request: [SPARK-1779] add warning when memoryFraction i...

2014-05-13 Thread erikerlandson
Github user erikerlandson commented on the pull request: https://github.com/apache/spark/pull/714#issuecomment-42983694 I coded up a proposal where SparkConf owns the checking inside of getInt(), getDouble() and friends, described here: https://issues.apache.org/jira/browse

<    1   2   3