this shouldn't be too hard. adding something to spark-sorted or to the
dataframe/dataset logical plan that says "trust me, i am already
partitioned and sorted" seems doable. however you most likely need a custom
hash partitioner, and you have to be careful to read the data in without
file splitting.

On Mar 10, 2017 9:10 AM, "sourabh chaki" <chaki.sour...@gmail.com> wrote:

> My use case is also quite similar. I have 2 feeds. One 3TB and another
> 100GB. Both the feeds are generated by hadoop reduce operation and
> partitioned by hadoop hashpartitioner. 3TB feed has 10K partitions whereas
> 100GB file has 200 partitions.
>
> Now when I do a join between these two feeds using spark, spark shuffles
> both the RDDS and it takes long time to complete. Can we do something so
> that spark can recognise the existing partitions of 3TB feed and shuffles
> only 200GB feed?
> It can be mapside scan for bigger RDD and shuffle read from smaller RDD?
>
> I have looked at spark-sorted project, but that project does not utilise
> the pre-existing partitions in the feed.
> Any pointer will be helpful.
>
> Thanks
> Sourabh
>
> On Thu, Mar 12, 2015 at 6:35 AM, Imran Rashid <iras...@cloudera.com>
> wrote:
>
>> Hi Jonathan,
>>
>> you might be interested in https://issues.apache.org/j
>> ira/browse/SPARK-3655 (not yet available) and https://github.com/tresata
>> /spark-sorted (not part of spark, but it is available right now).
>> Hopefully thats what you are looking for.  To the best of my knowledge that
>> covers what is available now / what is being worked on.
>>
>> Imran
>>
>> On Wed, Mar 11, 2015 at 4:38 PM, Jonathan Coveney <jcove...@gmail.com>
>> wrote:
>>
>>> Hello all,
>>>
>>> I am wondering if spark already has support for optimizations on sorted
>>> data and/or if such support could be added (I am comfortable dropping to a
>>> lower level if necessary to implement this, but I'm not sure if it is
>>> possible at all).
>>>
>>> Context: we have a number of data sets which are essentially already
>>> sorted on a key. With our current systems, we can take advantage of this to
>>> do a lot of analysis in a very efficient fashion...merges and joins, for
>>> example, can be done very efficiently, as can folds on a secondary key and
>>> so on.
>>>
>>> I was wondering if spark would be a fit for implementing these sorts of
>>> optimizations? Obviously it is sort of a niche case, but would this be
>>> achievable? Any pointers on where I should look?
>>>
>>
>>
>

Reply via email to