[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues

2015-07-10 Thread koertkuipers
Github user koertkuipers closed the pull request at:

https://github.com/apache/spark/pull/3632


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues

2015-04-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3632#issuecomment-9676
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues

2015-01-11 Thread koertkuipers
Github user koertkuipers commented on a diff in the pull request:

https://github.com/apache/spark/pull/3632#discussion_r22770867
  
--- Diff: core/src/main/scala/org/apache/spark/util/Ordering.scala ---
@@ -0,0 +1,42 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util
+
+private[spark] class HashOrdering[A] extends Ordering[A] {
+  override def compare(x: A, y: A): Int = {
+val h1 = if (x == null) 0 else x.hashCode()
+val h2 = if (y == null) 0 else y.hashCode()
+if (h1  h2) -1 else if (h1 == h2) 0 else 1
+  }
+}
+
+private[spark] class NoOrdering[A] extends Ordering[A] {
+  override def compare(x: A, y: A): Int = 0
+}
+
+private[spark] class KeyValueOrdering[A, B](
+  ordering1: Option[Ordering[A]], ordering2: Option[Ordering[B]]
+) extends Ordering[Product2[A, B]] {
+  private val ord1 = ordering1.getOrElse(new HashOrdering[A])
+  private val ord2 = ordering2.getOrElse(new NoOrdering[B])
+
+  override def compare(x: Product2[A, B], y: Product2[A, B]): Int = {
+val c1 = ord1.compare(x._1, y._1)
+if (c1 != 0) c1 else ord2.compare(x._2, y._2)
--- End diff --

@markhamstra any preference between the 2 options?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues

2015-01-02 Thread markhamstra
Github user markhamstra commented on a diff in the pull request:

https://github.com/apache/spark/pull/3632#discussion_r22420650
  
--- Diff: core/src/main/scala/org/apache/spark/util/Ordering.scala ---
@@ -0,0 +1,42 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util
+
+private[spark] class HashOrdering[A] extends Ordering[A] {
+  override def compare(x: A, y: A): Int = {
+val h1 = if (x == null) 0 else x.hashCode()
+val h2 = if (y == null) 0 else y.hashCode()
+if (h1  h2) -1 else if (h1 == h2) 0 else 1
+  }
+}
+
+private[spark] class NoOrdering[A] extends Ordering[A] {
+  override def compare(x: A, y: A): Int = 0
+}
+
+private[spark] class KeyValueOrdering[A, B](
+  ordering1: Option[Ordering[A]], ordering2: Option[Ordering[B]]
+) extends Ordering[Product2[A, B]] {
+  private val ord1 = ordering1.getOrElse(new HashOrdering[A])
+  private val ord2 = ordering2.getOrElse(new NoOrdering[B])
--- End diff --

What is the expected scenario in which a `KeyValueOrdering` is called for 
with `B` unordered? You're setting up `KeyValueOrdering` to be more general 
than your needs for its only current usage in `OrderedValueRDDFunctions`, but 
I'm not quite grasping how and where else you are expecting `KeyValueOrdering` 
to be used.

It's seeming to me that `KeyValueOrdering` should have two ctors: 
```scala
KeyValueOrdering[A, B](keyOrdering: Ordering[A], valueOrdering: Ordering[B])

...

this(valueOrdering: Ordering[B]) = this(new HashOrdering[A], valueOrdering)
``` 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues

2015-01-02 Thread markhamstra
Github user markhamstra commented on a diff in the pull request:

https://github.com/apache/spark/pull/3632#discussion_r22422723
  
--- Diff: core/src/main/scala/org/apache/spark/util/Ordering.scala ---
@@ -0,0 +1,42 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util
+
+private[spark] class HashOrdering[A] extends Ordering[A] {
+  override def compare(x: A, y: A): Int = {
+val h1 = if (x == null) 0 else x.hashCode()
+val h2 = if (y == null) 0 else y.hashCode()
+if (h1  h2) -1 else if (h1 == h2) 0 else 1
+  }
+}
+
+private[spark] class NoOrdering[A] extends Ordering[A] {
+  override def compare(x: A, y: A): Int = 0
+}
+
+private[spark] class KeyValueOrdering[A, B](
+  ordering1: Option[Ordering[A]], ordering2: Option[Ordering[B]]
+) extends Ordering[Product2[A, B]] {
+  private val ord1 = ordering1.getOrElse(new HashOrdering[A])
+  private val ord2 = ordering2.getOrElse(new NoOrdering[B])
+
+  override def compare(x: Product2[A, B], y: Product2[A, B]): Int = {
+val c1 = ord1.compare(x._1, y._1)
+if (c1 != 0) c1 else ord2.compare(x._2, y._2)
--- End diff --

What happens when `ord1` is `HashOrdering` and `c1 == 0` but `x._1 != 
y._1`?  More generally, what happens when `ord1` isn't actually a full 
ordering? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues

2015-01-02 Thread markhamstra
Github user markhamstra commented on a diff in the pull request:

https://github.com/apache/spark/pull/3632#discussion_r22421802
  
--- Diff: core/src/main/scala/org/apache/spark/util/Ordering.scala ---
@@ -0,0 +1,42 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util
+
+private[spark] class HashOrdering[A] extends Ordering[A] {
+  override def compare(x: A, y: A): Int = {
+val h1 = if (x == null) 0 else x.hashCode()
+val h2 = if (y == null) 0 else y.hashCode()
+if (h1  h2) -1 else if (h1 == h2) 0 else 1
+  }
+}
--- End diff --

`ExternalSorter#keyComparator` should be refactored to use 
`spark.util.HashOrdering`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues

2015-01-02 Thread markhamstra
Github user markhamstra commented on a diff in the pull request:

https://github.com/apache/spark/pull/3632#discussion_r22422719
  
--- Diff: core/src/main/scala/org/apache/spark/util/Ordering.scala ---
@@ -0,0 +1,42 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util
+
+private[spark] class HashOrdering[A] extends Ordering[A] {
--- End diff --

This isn't actually true.  The `compare` method only produces a partial 
ordering.  `ExternalSorter#keyComparator` gets away with the `Ordering[K]` 
falsehood only because later passes resolve hash collisions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues

2015-01-02 Thread koertkuipers
Github user koertkuipers commented on a diff in the pull request:

https://github.com/apache/spark/pull/3632#discussion_r22428452
  
--- Diff: core/src/main/scala/org/apache/spark/util/Ordering.scala ---
@@ -0,0 +1,42 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util
+
+private[spark] class HashOrdering[A] extends Ordering[A] {
+  override def compare(x: A, y: A): Int = {
+val h1 = if (x == null) 0 else x.hashCode()
+val h2 = if (y == null) 0 else y.hashCode()
+if (h1  h2) -1 else if (h1 == h2) 0 else 1
+  }
+}
+
+private[spark] class NoOrdering[A] extends Ordering[A] {
+  override def compare(x: A, y: A): Int = 0
+}
+
+private[spark] class KeyValueOrdering[A, B](
+  ordering1: Option[Ordering[A]], ordering2: Option[Ordering[B]]
+) extends Ordering[Product2[A, B]] {
+  private val ord1 = ordering1.getOrElse(new HashOrdering[A])
+  private val ord2 = ordering2.getOrElse(new NoOrdering[B])
+
+  override def compare(x: Product2[A, B], y: Product2[A, B]): Int = {
+val c1 = ord1.compare(x._1, y._1)
+if (c1 != 0) c1 else ord2.compare(x._2, y._2)
--- End diff --

i see 2 options:
1) do something similar to what happens in 
ExternalSorter.mergeWithAggregation where in groupByKeyAndSortValues i am aware 
of the fact that i might be processing multiple keys (with same hashCode) at 
once and check for key equality. this increases memory requirements
(all values for all keys with same hashCode have to fit in memory as 
opposed to all values for a single key).
2) require an ordering for K which can be used as a tie breaker when the 
hashCodes of the keys are the same, so that i have a total ordering for K.

thoughts?

i will add a unit test where i have multiple keys with the same hashCode.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues

2015-01-02 Thread koertkuipers
Github user koertkuipers commented on a diff in the pull request:

https://github.com/apache/spark/pull/3632#discussion_r22423736
  
--- Diff: core/src/main/scala/org/apache/spark/util/Ordering.scala ---
@@ -0,0 +1,42 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util
+
+private[spark] class HashOrdering[A] extends Ordering[A] {
+  override def compare(x: A, y: A): Int = {
+val h1 = if (x == null) 0 else x.hashCode()
+val h2 = if (y == null) 0 else y.hashCode()
+if (h1  h2) -1 else if (h1 == h2) 0 else 1
+  }
+}
+
+private[spark] class NoOrdering[A] extends Ordering[A] {
+  override def compare(x: A, y: A): Int = 0
+}
+
+private[spark] class KeyValueOrdering[A, B](
+  ordering1: Option[Ordering[A]], ordering2: Option[Ordering[B]]
+) extends Ordering[Product2[A, B]] {
+  private val ord1 = ordering1.getOrElse(new HashOrdering[A])
+  private val ord2 = ordering2.getOrElse(new NoOrdering[B])
+
+  override def compare(x: Product2[A, B], y: Product2[A, B]): Int = {
+val c1 = ord1.compare(x._1, y._1)
+if (c1 != 0) c1 else ord2.compare(x._2, y._2)
--- End diff --

good point that doesn't look right. it could lead to keys being interleaved 
in the output.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues

2015-01-02 Thread koertkuipers
Github user koertkuipers commented on a diff in the pull request:

https://github.com/apache/spark/pull/3632#discussion_r22423573
  
--- Diff: core/src/main/scala/org/apache/spark/util/Ordering.scala ---
@@ -0,0 +1,42 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.util
+
+private[spark] class HashOrdering[A] extends Ordering[A] {
+  override def compare(x: A, y: A): Int = {
+val h1 = if (x == null) 0 else x.hashCode()
+val h2 = if (y == null) 0 else y.hashCode()
+if (h1  h2) -1 else if (h1 == h2) 0 else 1
+  }
+}
+
+private[spark] class NoOrdering[A] extends Ordering[A] {
+  override def compare(x: A, y: A): Int = 0
+}
+
+private[spark] class KeyValueOrdering[A, B](
+  ordering1: Option[Ordering[A]], ordering2: Option[Ordering[B]]
+) extends Ordering[Product2[A, B]] {
+  private val ord1 = ordering1.getOrElse(new HashOrdering[A])
+  private val ord2 = ordering2.getOrElse(new NoOrdering[B])
--- End diff --

yeah thats right i copied it from another pullreq by me that needed a more 
general version. i can simplify it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues

2014-12-26 Thread koertkuipers
Github user koertkuipers commented on the pull request:

https://github.com/apache/spark/pull/3632#issuecomment-68150977
  
@markhamstra take a look now. 
i ignored the situation of K and V having same type, since i think it can 
be dealt with by using a simple wrapper (value) class for the Vs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues

2014-12-24 Thread markhamstra
Github user markhamstra commented on the pull request:

https://github.com/apache/spark/pull/3632#issuecomment-68069421
  
The reason for separate classes is to cleanly segregate the 
available/supportable functionality.  Not every `PairRDD` has keys that can be 
ordered, so `sortByKey` shouldn't be part of `PairRDD`.  When keys can be 
ordered, there is often a natural ordering that is already implicitly in scope. 
 When that is true, then we don't want to force the user to explicitly provide 
an `Ordering` -- e.g. if you have an `RDD[Int, Foo]`, then rdd.sortByKey() 
should just work.  If you want a different Ordering, then you just need to 
bring a new implicit Ordering for that key type into scope.

Things aren't as cleanly separated in the Java API because of the lack of 
support for implicits there, but that doesn't mean that we should abandon the 
separation between `PairRDD` and `OrderedRDD` on the Scala side or start 
dirtying-up `PairRDD.scala` when we want to provide new methods for RDDs whose 
keys and values can both be ordered.

I really think that we want to repeat the pattern of `OrderedRDD` for these 
`DoublyOrderedRDD` -- or whatever better name you can come up with.  The 
biggest quirk I can see right now is if the types of both keys and values are 
the same but you want to order them one way when sorting by key and a different 
way when doing the secondary sort on values.  That won't work with implicits 
since there can only be one implicit `Ordering` for the type in scope at a 
time.  The problem could either be avoided by using distinct types for the key 
and value roles, or a method signature with explicit orderings could be added 
to address this corner case. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues

2014-12-24 Thread koertkuipers
Github user koertkuipers commented on the pull request:

https://github.com/apache/spark/pull/3632#issuecomment-68081488
  
mhhh i dont really agree with you. i find OrderedRDD confusing because:
1) you kind of have to know that there is an implicit conversion to 
OrderedRDD somewhere out there that only works in certain conditions. a method 
on PairRDDFunctions with an implicit Ordering[V] parameter is part of the 
public API and also very clear in the restrictions it imposes (since its 
documented in the required implicit parameter that is part of the API).
2) its much harder to override the Ordering[V]. forcing the user to bring a 
new implicit Ordering[V] in scope just to use it once is a somewhat bad use of 
implicits (and hard to debug). an implicit parameter should allow for an 
explicit override in my opinion.

anyhow, you are right that what i did does not confirm with OrderedRDD. so 
if others agree with you i will rewrite it like that, no problem! good point on 
the corner case of K and V having same type... let me think about that.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues

2014-12-24 Thread koertkuipers
GitHub user koertkuipers reopened a pull request:

https://github.com/apache/spark/pull/3632

SPARK-3655 GroupByKeyAndSortValues

See https://issues.apache.org/jira/browse/SPARK-3655

This pullreq is based on the approach that uses 
repartitionAndSortWithinPartition, but only implements GroupByKeyAndSortValues 
and not foldLeft.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tresata/spark 
feat-group-by-key-and-sort-values

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3632.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3632


commit 7e3cde989ec93849d60988e6d9fae729ca0c46a4
Author: Koert Kuipers ko...@tresata.com
Date:   2014-12-07T20:16:53Z

works but Iterables in signature are not right

commit 42075338a32c40e4b962b547dbc74aad89351207
Author: Koert Kuipers ko...@tresata.com
Date:   2014-12-07T21:57:25Z

change groupByKeyAndSortValues to return RDD[(K, TraversableOnce[V]) 
instead of RDD[(K, Iterable[V]). i dont think the Iterable version can be 
implemented efficiently

commit 4f7defe86c514f3d153feaed804cf77f1d402f63
Author: Koert Kuipers ko...@tresata.com
Date:   2014-12-10T14:44:18Z

change groupByKeyAndSortValues to return RDD[(K, Iterable[V]) where the 
values (the iterables) are in-memory arrays




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues

2014-12-24 Thread koertkuipers
Github user koertkuipers commented on the pull request:

https://github.com/apache/spark/pull/3632#issuecomment-68081763
  
i will work on updated version early januari


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues

2014-12-24 Thread koertkuipers
Github user koertkuipers closed the pull request at:

https://github.com/apache/spark/pull/3632


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues

2014-12-24 Thread koertkuipers
Github user koertkuipers commented on the pull request:

https://github.com/apache/spark/pull/3632#issuecomment-68081928
  
woops sorry i hit the wrong button there. didnt mean to close this pullreq.

@markhamstra
i will try to update this pullreq sometime in first few weeks of january to 
address issues you raised 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues

2014-12-24 Thread markhamstra
Github user markhamstra commented on the pull request:

https://github.com/apache/spark/pull/3632#issuecomment-68082299
  
@koertkuipers Don't get me wrong, I'm not arguing that the way that 
`PairRDDFuntions` and `OrderedRDDFunctions` work is objectively the best and 
unquestionably an exemplary use of implicits.  It is, however, what we've got 
and does follow a certain logic.  What I'm saying is that it is good to 
maintain that logic and pattern for an API extension that is similar to 
`OrderedRDDFunctions`.

Changing to a different pattern is something we could consider for Spark 
2.0 when we can break the established public API.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues

2014-12-23 Thread koertkuipers
Github user koertkuipers commented on the pull request:

https://github.com/apache/spark/pull/3632#issuecomment-68011681
  
hey @markhamstra 

i assume you are referring to the one method groupByKeyAndSortValues that 
has an implicit Ordering[V] parameter, since the other groupByKeyAndSortValues 
methods do not require an implicit Ordering to be available for the values (it 
needs to be explicitly provided and is typically use-case specific and not the 
natural ordering of values).

what the benefit is of an implicit conversion to a class that will have 
this one method without the implicit parameter, versus a single method with an 
implicit parameter? it seems to me this makes the api harder to understand. is 
this for java perhaps?
i have to admit i am not sure i understand the reason for 
OrderedRDDFunctions existence either (again 2 methods on PairRDDFunctions with 
implicits seem preferred to me over a new class).



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues

2014-12-20 Thread markhamstra
Github user markhamstra commented on the pull request:

https://github.com/apache/spark/pull/3632#issuecomment-67757353
  
On a first pass, this doesn't look right.  If you are providing additional 
methods that should be available for `RDD[(K, V)]` where there is an `Ordering` 
available for both `K` and `V`, then it seems that the strategy that you should 
be following should not be to add methods directly to `PairRDD`, but rather to 
go one step further than does `OrderedRDD`, whose purpose is to provide 
additional methods for `RDD[(K, V)]` where there is an `Ordering` available for 
`K`.

Please read this comment: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/OrderedRDDFunctions.scala#L26

...and look at how the `Ordering` context bound is used:

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/OrderedRDDFunctions.scala#L26

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/OrderedRDDFunctions.scala#L26


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues

2014-12-07 Thread koertkuipers
GitHub user koertkuipers opened a pull request:

https://github.com/apache/spark/pull/3632

SPARK-3655 GroupByKeyAndSortValues

See https://issues.apache.org/jira/browse/SPARK-3655

This pullreq is based on the approach that uses 
repartitionAndSortWithinPartition, but only implements GroupByKeyAndSortValues 
and not foldLeft.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tresata/spark 
feat-group-by-key-and-sort-values

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3632.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3632


commit 7e3cde989ec93849d60988e6d9fae729ca0c46a4
Author: Koert Kuipers ko...@tresata.com
Date:   2014-12-07T20:16:53Z

works but Iterables in signature are not right

commit 42075338a32c40e4b962b547dbc74aad89351207
Author: Koert Kuipers ko...@tresata.com
Date:   2014-12-07T21:57:25Z

change groupByKeyAndSortValues to return RDD[(K, TraversableOnce[V]) 
instead of RDD[(K, Iterable[V]). i dont think the Iterable version can be 
implemented efficiently




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues

2014-12-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3632#issuecomment-65958988
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org