While I was at Two Sigma I ended up implementing something similar to what
Koert described... you can check it out here:
https://github.com/twosigma/flint/blob/master/src/main/scala/com/twosigma/flint/rdd/OrderedRDD.scala.
They've built a lot more on top of this (including support for dataframes
etc).

2017-03-10 9:45 GMT-05:00 Koert Kuipers <ko...@tresata.com>:

> this shouldn't be too hard. adding something to spark-sorted or to the
> dataframe/dataset logical plan that says "trust me, i am already
> partitioned and sorted" seems doable. however you most likely need a custom
> hash partitioner, and you have to be careful to read the data in without
> file splitting.
>
> On Mar 10, 2017 9:10 AM, "sourabh chaki" <chaki.sour...@gmail.com> wrote:
>
>> My use case is also quite similar. I have 2 feeds. One 3TB and another
>> 100GB. Both the feeds are generated by hadoop reduce operation and
>> partitioned by hadoop hashpartitioner. 3TB feed has 10K partitions whereas
>> 100GB file has 200 partitions.
>>
>> Now when I do a join between these two feeds using spark, spark shuffles
>> both the RDDS and it takes long time to complete. Can we do something so
>> that spark can recognise the existing partitions of 3TB feed and shuffles
>> only 200GB feed?
>> It can be mapside scan for bigger RDD and shuffle read from smaller RDD?
>>
>> I have looked at spark-sorted project, but that project does not utilise
>> the pre-existing partitions in the feed.
>> Any pointer will be helpful.
>>
>> Thanks
>> Sourabh
>>
>> On Thu, Mar 12, 2015 at 6:35 AM, Imran Rashid <iras...@cloudera.com>
>> wrote:
>>
>>> Hi Jonathan,
>>>
>>> you might be interested in https://issues.apache.org/j
>>> ira/browse/SPARK-3655 (not yet available) and https://github.com/tresata
>>> /spark-sorted (not part of spark, but it is available right now).
>>> Hopefully thats what you are looking for.  To the best of my knowledge that
>>> covers what is available now / what is being worked on.
>>>
>>> Imran
>>>
>>> On Wed, Mar 11, 2015 at 4:38 PM, Jonathan Coveney <jcove...@gmail.com>
>>> wrote:
>>>
>>>> Hello all,
>>>>
>>>> I am wondering if spark already has support for optimizations on sorted
>>>> data and/or if such support could be added (I am comfortable dropping to a
>>>> lower level if necessary to implement this, but I'm not sure if it is
>>>> possible at all).
>>>>
>>>> Context: we have a number of data sets which are essentially already
>>>> sorted on a key. With our current systems, we can take advantage of this to
>>>> do a lot of analysis in a very efficient fashion...merges and joins, for
>>>> example, can be done very efficiently, as can folds on a secondary key and
>>>> so on.
>>>>
>>>> I was wondering if spark would be a fit for implementing these sorts of
>>>> optimizations? Obviously it is sort of a niche case, but would this be
>>>> achievable? Any pointers on where I should look?
>>>>
>>>
>>>
>>

Reply via email to