I filed it and submitted the PR that Josh suggested: https://spark-project.atlassian.net/browse/SPARK-1021 https://github.com/apache/incubator-spark/pull/379
On Wed, Jan 8, 2014 at 9:56 AM, Andrew Ash <and...@andrewash.com> wrote: > And at the moment we should use the atlassian.net Jira instance, not the > apache.org one? The apache one looks empty. > > https://spark-project.atlassian.net/browse/SPARK > https://issues.apache.org/jira/browse/SPARK > > > On Wed, Jan 8, 2014 at 9:04 AM, Aaron Davidson <ilike...@gmail.com> wrote: > >> Feel free to always file official bugs in Jira, as long as it's not >> already there! >> >> >> On Tue, Jan 7, 2014 at 9:47 PM, Andrew Ash <and...@andrewash.com> wrote: >> >>> Hi Josh, >>> >>> I just ran into this again myself and noticed that the source hasn't >>> changed since we discussed in December. Should I file an official bug in >>> Jira? >>> >>> Andrew >>> >>> >>> On Tue, Dec 10, 2013 at 8:34 AM, Josh Rosen <rosenvi...@gmail.com>wrote: >>> >>>> I wonder whether making RangePartitoner .rangeBounds into a lazy val >>>> would fix this ( >>>> https://github.com/apache/incubator-spark/blob/6169fe14a140146602fb07cfcd13eee6efad98f9/core/src/main/scala/org/apache/spark/Partitioner.scala#L95). >>>> We'd need to make sure that rangeBounds() is never called before an action >>>> is performed. This could be tricky because it's called in the >>>> RangePartitioner.equals() method. Maybe it's sufficient to just compare >>>> the number of partitions, the ids of the RDDs used to create the >>>> RangePartitioner, and the sort ordering. This still supports the case >>>> where I range-partition one RDD and pass the same partitioner to a >>>> different RDD. It breaks support for the case where two range partitioners >>>> created on different RDDs happened to have the same rangeBounds(), but it >>>> seems unlikely that this would really harm performance since it's probably >>>> unlikely that the range partitioners are equal by chance. >>>> >>>> >>>> On Tue, Dec 10, 2013 at 8:18 AM, Ryan Prenger <r...@tracevector.com>wrote: >>>> >>>>> Thanks for the responses! I agree that b seems like it would be >>>>> better. I could imagine optimizations that could be made if a filter call >>>>> came after the sortByKey that would make the initial partitioning >>>>> sub-optimal. Plus this way, it's a pain to use in the REPL. >>>>> >>>>> Cheers, >>>>> >>>>> Ryan >>>>> >>>>> >>>>> On Tue, Dec 10, 2013 at 7:06 AM, Andrew Ash <and...@andrewash.com>wrote: >>>>> >>>>>> Since sortByKey() invokes those right now, we should either a) change >>>>>> the documentation to treat note that it kicks off actions or b) change >>>>>> the >>>>>> method to execute those things lazily. >>>>>> >>>>>> Personally I'd prefer b but don't know how difficult that would be. >>>>>> >>>>>> >>>>>> On Tue, Dec 10, 2013 at 1:52 AM, Jason Lenderman < >>>>>> jslender...@gmail.com> wrote: >>>>>> >>>>>>> Hey Ryan, >>>>>>> >>>>>>> The *sortByKey* method creates a *RangePartitioner* (see >>>>>>> Partitioner.scala), and the initialization code of the >>>>>>> *RangePartitioner* invokes actions *count* and *sample*. >>>>>>> >>>>>>> >>>>>>> Jason >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Mon, Dec 9, 2013 at 7:01 PM, Ryan Prenger >>>>>>> <r...@tracevector.com>wrote: >>>>>>> >>>>>>>> sortByKey is listed as a data transformation, not an action, yet it >>>>>>>> launches a job. This doesn't seem to square with the documentation. >>>>>>>> >>>>>>>> Ryan >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >