[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues
Github user koertkuipers closed the pull request at: https://github.com/apache/spark/pull/3632 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3632#issuecomment-9676 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/3632#discussion_r22770867 --- Diff: core/src/main/scala/org/apache/spark/util/Ordering.scala --- @@ -0,0 +1,42 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util + +private[spark] class HashOrdering[A] extends Ordering[A] { + override def compare(x: A, y: A): Int = { +val h1 = if (x == null) 0 else x.hashCode() +val h2 = if (y == null) 0 else y.hashCode() +if (h1 h2) -1 else if (h1 == h2) 0 else 1 + } +} + +private[spark] class NoOrdering[A] extends Ordering[A] { + override def compare(x: A, y: A): Int = 0 +} + +private[spark] class KeyValueOrdering[A, B]( + ordering1: Option[Ordering[A]], ordering2: Option[Ordering[B]] +) extends Ordering[Product2[A, B]] { + private val ord1 = ordering1.getOrElse(new HashOrdering[A]) + private val ord2 = ordering2.getOrElse(new NoOrdering[B]) + + override def compare(x: Product2[A, B], y: Product2[A, B]): Int = { +val c1 = ord1.compare(x._1, y._1) +if (c1 != 0) c1 else ord2.compare(x._2, y._2) --- End diff -- @markhamstra any preference between the 2 options? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues
Github user markhamstra commented on a diff in the pull request: https://github.com/apache/spark/pull/3632#discussion_r22420650 --- Diff: core/src/main/scala/org/apache/spark/util/Ordering.scala --- @@ -0,0 +1,42 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util + +private[spark] class HashOrdering[A] extends Ordering[A] { + override def compare(x: A, y: A): Int = { +val h1 = if (x == null) 0 else x.hashCode() +val h2 = if (y == null) 0 else y.hashCode() +if (h1 h2) -1 else if (h1 == h2) 0 else 1 + } +} + +private[spark] class NoOrdering[A] extends Ordering[A] { + override def compare(x: A, y: A): Int = 0 +} + +private[spark] class KeyValueOrdering[A, B]( + ordering1: Option[Ordering[A]], ordering2: Option[Ordering[B]] +) extends Ordering[Product2[A, B]] { + private val ord1 = ordering1.getOrElse(new HashOrdering[A]) + private val ord2 = ordering2.getOrElse(new NoOrdering[B]) --- End diff -- What is the expected scenario in which a `KeyValueOrdering` is called for with `B` unordered? You're setting up `KeyValueOrdering` to be more general than your needs for its only current usage in `OrderedValueRDDFunctions`, but I'm not quite grasping how and where else you are expecting `KeyValueOrdering` to be used. It's seeming to me that `KeyValueOrdering` should have two ctors: ```scala KeyValueOrdering[A, B](keyOrdering: Ordering[A], valueOrdering: Ordering[B]) ... this(valueOrdering: Ordering[B]) = this(new HashOrdering[A], valueOrdering) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues
Github user markhamstra commented on a diff in the pull request: https://github.com/apache/spark/pull/3632#discussion_r22422723 --- Diff: core/src/main/scala/org/apache/spark/util/Ordering.scala --- @@ -0,0 +1,42 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util + +private[spark] class HashOrdering[A] extends Ordering[A] { + override def compare(x: A, y: A): Int = { +val h1 = if (x == null) 0 else x.hashCode() +val h2 = if (y == null) 0 else y.hashCode() +if (h1 h2) -1 else if (h1 == h2) 0 else 1 + } +} + +private[spark] class NoOrdering[A] extends Ordering[A] { + override def compare(x: A, y: A): Int = 0 +} + +private[spark] class KeyValueOrdering[A, B]( + ordering1: Option[Ordering[A]], ordering2: Option[Ordering[B]] +) extends Ordering[Product2[A, B]] { + private val ord1 = ordering1.getOrElse(new HashOrdering[A]) + private val ord2 = ordering2.getOrElse(new NoOrdering[B]) + + override def compare(x: Product2[A, B], y: Product2[A, B]): Int = { +val c1 = ord1.compare(x._1, y._1) +if (c1 != 0) c1 else ord2.compare(x._2, y._2) --- End diff -- What happens when `ord1` is `HashOrdering` and `c1 == 0` but `x._1 != y._1`? More generally, what happens when `ord1` isn't actually a full ordering? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues
Github user markhamstra commented on a diff in the pull request: https://github.com/apache/spark/pull/3632#discussion_r22421802 --- Diff: core/src/main/scala/org/apache/spark/util/Ordering.scala --- @@ -0,0 +1,42 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util + +private[spark] class HashOrdering[A] extends Ordering[A] { + override def compare(x: A, y: A): Int = { +val h1 = if (x == null) 0 else x.hashCode() +val h2 = if (y == null) 0 else y.hashCode() +if (h1 h2) -1 else if (h1 == h2) 0 else 1 + } +} --- End diff -- `ExternalSorter#keyComparator` should be refactored to use `spark.util.HashOrdering`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues
Github user markhamstra commented on a diff in the pull request: https://github.com/apache/spark/pull/3632#discussion_r22422719 --- Diff: core/src/main/scala/org/apache/spark/util/Ordering.scala --- @@ -0,0 +1,42 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util + +private[spark] class HashOrdering[A] extends Ordering[A] { --- End diff -- This isn't actually true. The `compare` method only produces a partial ordering. `ExternalSorter#keyComparator` gets away with the `Ordering[K]` falsehood only because later passes resolve hash collisions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/3632#discussion_r22428452 --- Diff: core/src/main/scala/org/apache/spark/util/Ordering.scala --- @@ -0,0 +1,42 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util + +private[spark] class HashOrdering[A] extends Ordering[A] { + override def compare(x: A, y: A): Int = { +val h1 = if (x == null) 0 else x.hashCode() +val h2 = if (y == null) 0 else y.hashCode() +if (h1 h2) -1 else if (h1 == h2) 0 else 1 + } +} + +private[spark] class NoOrdering[A] extends Ordering[A] { + override def compare(x: A, y: A): Int = 0 +} + +private[spark] class KeyValueOrdering[A, B]( + ordering1: Option[Ordering[A]], ordering2: Option[Ordering[B]] +) extends Ordering[Product2[A, B]] { + private val ord1 = ordering1.getOrElse(new HashOrdering[A]) + private val ord2 = ordering2.getOrElse(new NoOrdering[B]) + + override def compare(x: Product2[A, B], y: Product2[A, B]): Int = { +val c1 = ord1.compare(x._1, y._1) +if (c1 != 0) c1 else ord2.compare(x._2, y._2) --- End diff -- i see 2 options: 1) do something similar to what happens in ExternalSorter.mergeWithAggregation where in groupByKeyAndSortValues i am aware of the fact that i might be processing multiple keys (with same hashCode) at once and check for key equality. this increases memory requirements (all values for all keys with same hashCode have to fit in memory as opposed to all values for a single key). 2) require an ordering for K which can be used as a tie breaker when the hashCodes of the keys are the same, so that i have a total ordering for K. thoughts? i will add a unit test where i have multiple keys with the same hashCode. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/3632#discussion_r22423736 --- Diff: core/src/main/scala/org/apache/spark/util/Ordering.scala --- @@ -0,0 +1,42 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util + +private[spark] class HashOrdering[A] extends Ordering[A] { + override def compare(x: A, y: A): Int = { +val h1 = if (x == null) 0 else x.hashCode() +val h2 = if (y == null) 0 else y.hashCode() +if (h1 h2) -1 else if (h1 == h2) 0 else 1 + } +} + +private[spark] class NoOrdering[A] extends Ordering[A] { + override def compare(x: A, y: A): Int = 0 +} + +private[spark] class KeyValueOrdering[A, B]( + ordering1: Option[Ordering[A]], ordering2: Option[Ordering[B]] +) extends Ordering[Product2[A, B]] { + private val ord1 = ordering1.getOrElse(new HashOrdering[A]) + private val ord2 = ordering2.getOrElse(new NoOrdering[B]) + + override def compare(x: Product2[A, B], y: Product2[A, B]): Int = { +val c1 = ord1.compare(x._1, y._1) +if (c1 != 0) c1 else ord2.compare(x._2, y._2) --- End diff -- good point that doesn't look right. it could lead to keys being interleaved in the output. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues
Github user koertkuipers commented on a diff in the pull request: https://github.com/apache/spark/pull/3632#discussion_r22423573 --- Diff: core/src/main/scala/org/apache/spark/util/Ordering.scala --- @@ -0,0 +1,42 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util + +private[spark] class HashOrdering[A] extends Ordering[A] { + override def compare(x: A, y: A): Int = { +val h1 = if (x == null) 0 else x.hashCode() +val h2 = if (y == null) 0 else y.hashCode() +if (h1 h2) -1 else if (h1 == h2) 0 else 1 + } +} + +private[spark] class NoOrdering[A] extends Ordering[A] { + override def compare(x: A, y: A): Int = 0 +} + +private[spark] class KeyValueOrdering[A, B]( + ordering1: Option[Ordering[A]], ordering2: Option[Ordering[B]] +) extends Ordering[Product2[A, B]] { + private val ord1 = ordering1.getOrElse(new HashOrdering[A]) + private val ord2 = ordering2.getOrElse(new NoOrdering[B]) --- End diff -- yeah thats right i copied it from another pullreq by me that needed a more general version. i can simplify it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/3632#issuecomment-68150977 @markhamstra take a look now. i ignored the situation of K and V having same type, since i think it can be dealt with by using a simple wrapper (value) class for the Vs. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues
Github user markhamstra commented on the pull request: https://github.com/apache/spark/pull/3632#issuecomment-68069421 The reason for separate classes is to cleanly segregate the available/supportable functionality. Not every `PairRDD` has keys that can be ordered, so `sortByKey` shouldn't be part of `PairRDD`. When keys can be ordered, there is often a natural ordering that is already implicitly in scope. When that is true, then we don't want to force the user to explicitly provide an `Ordering` -- e.g. if you have an `RDD[Int, Foo]`, then rdd.sortByKey() should just work. If you want a different Ordering, then you just need to bring a new implicit Ordering for that key type into scope. Things aren't as cleanly separated in the Java API because of the lack of support for implicits there, but that doesn't mean that we should abandon the separation between `PairRDD` and `OrderedRDD` on the Scala side or start dirtying-up `PairRDD.scala` when we want to provide new methods for RDDs whose keys and values can both be ordered. I really think that we want to repeat the pattern of `OrderedRDD` for these `DoublyOrderedRDD` -- or whatever better name you can come up with. The biggest quirk I can see right now is if the types of both keys and values are the same but you want to order them one way when sorting by key and a different way when doing the secondary sort on values. That won't work with implicits since there can only be one implicit `Ordering` for the type in scope at a time. The problem could either be avoided by using distinct types for the key and value roles, or a method signature with explicit orderings could be added to address this corner case. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/3632#issuecomment-68081488 mhhh i dont really agree with you. i find OrderedRDD confusing because: 1) you kind of have to know that there is an implicit conversion to OrderedRDD somewhere out there that only works in certain conditions. a method on PairRDDFunctions with an implicit Ordering[V] parameter is part of the public API and also very clear in the restrictions it imposes (since its documented in the required implicit parameter that is part of the API). 2) its much harder to override the Ordering[V]. forcing the user to bring a new implicit Ordering[V] in scope just to use it once is a somewhat bad use of implicits (and hard to debug). an implicit parameter should allow for an explicit override in my opinion. anyhow, you are right that what i did does not confirm with OrderedRDD. so if others agree with you i will rewrite it like that, no problem! good point on the corner case of K and V having same type... let me think about that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues
GitHub user koertkuipers reopened a pull request: https://github.com/apache/spark/pull/3632 SPARK-3655 GroupByKeyAndSortValues See https://issues.apache.org/jira/browse/SPARK-3655 This pullreq is based on the approach that uses repartitionAndSortWithinPartition, but only implements GroupByKeyAndSortValues and not foldLeft. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tresata/spark feat-group-by-key-and-sort-values Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3632.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3632 commit 7e3cde989ec93849d60988e6d9fae729ca0c46a4 Author: Koert Kuipers ko...@tresata.com Date: 2014-12-07T20:16:53Z works but Iterables in signature are not right commit 42075338a32c40e4b962b547dbc74aad89351207 Author: Koert Kuipers ko...@tresata.com Date: 2014-12-07T21:57:25Z change groupByKeyAndSortValues to return RDD[(K, TraversableOnce[V]) instead of RDD[(K, Iterable[V]). i dont think the Iterable version can be implemented efficiently commit 4f7defe86c514f3d153feaed804cf77f1d402f63 Author: Koert Kuipers ko...@tresata.com Date: 2014-12-10T14:44:18Z change groupByKeyAndSortValues to return RDD[(K, Iterable[V]) where the values (the iterables) are in-memory arrays --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/3632#issuecomment-68081763 i will work on updated version early januari --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues
Github user koertkuipers closed the pull request at: https://github.com/apache/spark/pull/3632 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/3632#issuecomment-68081928 woops sorry i hit the wrong button there. didnt mean to close this pullreq. @markhamstra i will try to update this pullreq sometime in first few weeks of january to address issues you raised --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues
Github user markhamstra commented on the pull request: https://github.com/apache/spark/pull/3632#issuecomment-68082299 @koertkuipers Don't get me wrong, I'm not arguing that the way that `PairRDDFuntions` and `OrderedRDDFunctions` work is objectively the best and unquestionably an exemplary use of implicits. It is, however, what we've got and does follow a certain logic. What I'm saying is that it is good to maintain that logic and pattern for an API extension that is similar to `OrderedRDDFunctions`. Changing to a different pattern is something we could consider for Spark 2.0 when we can break the established public API. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues
Github user koertkuipers commented on the pull request: https://github.com/apache/spark/pull/3632#issuecomment-68011681 hey @markhamstra i assume you are referring to the one method groupByKeyAndSortValues that has an implicit Ordering[V] parameter, since the other groupByKeyAndSortValues methods do not require an implicit Ordering to be available for the values (it needs to be explicitly provided and is typically use-case specific and not the natural ordering of values). what the benefit is of an implicit conversion to a class that will have this one method without the implicit parameter, versus a single method with an implicit parameter? it seems to me this makes the api harder to understand. is this for java perhaps? i have to admit i am not sure i understand the reason for OrderedRDDFunctions existence either (again 2 methods on PairRDDFunctions with implicits seem preferred to me over a new class). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues
Github user markhamstra commented on the pull request: https://github.com/apache/spark/pull/3632#issuecomment-67757353 On a first pass, this doesn't look right. If you are providing additional methods that should be available for `RDD[(K, V)]` where there is an `Ordering` available for both `K` and `V`, then it seems that the strategy that you should be following should not be to add methods directly to `PairRDD`, but rather to go one step further than does `OrderedRDD`, whose purpose is to provide additional methods for `RDD[(K, V)]` where there is an `Ordering` available for `K`. Please read this comment: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/OrderedRDDFunctions.scala#L26 ...and look at how the `Ordering` context bound is used: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/OrderedRDDFunctions.scala#L26 https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/OrderedRDDFunctions.scala#L26 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues
GitHub user koertkuipers opened a pull request: https://github.com/apache/spark/pull/3632 SPARK-3655 GroupByKeyAndSortValues See https://issues.apache.org/jira/browse/SPARK-3655 This pullreq is based on the approach that uses repartitionAndSortWithinPartition, but only implements GroupByKeyAndSortValues and not foldLeft. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tresata/spark feat-group-by-key-and-sort-values Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3632.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3632 commit 7e3cde989ec93849d60988e6d9fae729ca0c46a4 Author: Koert Kuipers ko...@tresata.com Date: 2014-12-07T20:16:53Z works but Iterables in signature are not right commit 42075338a32c40e4b962b547dbc74aad89351207 Author: Koert Kuipers ko...@tresata.com Date: 2014-12-07T21:57:25Z change groupByKeyAndSortValues to return RDD[(K, TraversableOnce[V]) instead of RDD[(K, Iterable[V]). i dont think the Iterable version can be implemented efficiently --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-3655 GroupByKeyAndSortValues
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3632#issuecomment-65958988 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org