Re: Why does sortByKey launch cluster job?

Aaron Davidson Wed, 08 Jan 2014 09:06:22 -0800

Feel free to always file official bugs in Jira, as long as it's not already
there!



On Tue, Jan 7, 2014 at 9:47 PM, Andrew Ash <and...@andrewash.com> wrote:

> Hi Josh,
>
> I just ran into this again myself and noticed that the source hasn't
> changed since we discussed in December.  Should I file an official bug in
> Jira?
>
> Andrew
>
>
> On Tue, Dec 10, 2013 at 8:34 AM, Josh Rosen <rosenvi...@gmail.com> wrote:
>
>> I wonder whether making RangePartitoner .rangeBounds into a lazy val
>> would fix this (
>> https://github.com/apache/incubator-spark/blob/6169fe14a140146602fb07cfcd13eee6efad98f9/core/src/main/scala/org/apache/spark/Partitioner.scala#L95).
>>  We'd need to make sure that rangeBounds() is never called before an action
>> is performed.  This could be tricky because it's called in the
>> RangePartitioner.equals() method.  Maybe it's sufficient to just compare
>> the number of partitions, the ids of the RDDs used to create the
>> RangePartitioner, and the sort ordering.  This still supports the case
>> where I range-partition one RDD and pass the same partitioner to a
>> different RDD.  It breaks support for the case where two range partitioners
>> created on different RDDs happened to have the same rangeBounds(), but it
>> seems unlikely that this would really harm performance since it's probably
>> unlikely that the range partitioners are equal by chance.
>>
>>
>> On Tue, Dec 10, 2013 at 8:18 AM, Ryan Prenger <r...@tracevector.com>wrote:
>>
>>> Thanks for the responses!  I agree that b seems like it would be better.
>>>  I could imagine optimizations that could be made if a filter call came
>>> after the sortByKey that would make the initial partitioning sub-optimal.
>>>  Plus this way, it's a pain to use in the REPL.
>>>
>>> Cheers,
>>>
>>> Ryan
>>>
>>>
>>> On Tue, Dec 10, 2013 at 7:06 AM, Andrew Ash <and...@andrewash.com>wrote:
>>>
>>>> Since sortByKey() invokes those right now, we should either a) change
>>>> the documentation to treat note that it kicks off actions or b) change the
>>>> method to execute those things lazily.
>>>>
>>>> Personally I'd prefer b but don't know how difficult that would be.
>>>>
>>>>
>>>> On Tue, Dec 10, 2013 at 1:52 AM, Jason Lenderman <jslender...@gmail.com
>>>> > wrote:
>>>>
>>>>> Hey Ryan,
>>>>>
>>>>> The *sortByKey* method creates a *RangePartitioner* (see
>>>>> Partitioner.scala), and the initialization code of the
>>>>> *RangePartitioner* invokes actions *count* and *sample*.
>>>>>
>>>>>
>>>>> Jason
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Dec 9, 2013 at 7:01 PM, Ryan Prenger <r...@tracevector.com>wrote:
>>>>>
>>>>>> sortByKey is listed as a data transformation, not an action, yet it
>>>>>> launches a job.  This doesn't seem to square with the documentation.
>>>>>>
>>>>>> Ryan
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Why does sortByKey launch cluster job?

Reply via email to