My use case is also quite similar. I have 2 feeds. One 3TB and another
100GB. Both the feeds are generated by hadoop reduce operation and
partitioned by hadoop hashpartitioner. 3TB feed has 10K partitions whereas
100GB file has 200 partitions.

Now when I do a join between these two feeds using spark, spark shuffles
both the RDDS and it takes long time to complete. Can we do something so
that spark can recognise the existing partitions of 3TB feed and shuffles
only 200GB feed?
It can be mapside scan for bigger RDD and shuffle read from smaller RDD?

I have looked at spark-sorted project, but that project does not utilise
the pre-existing partitions in the feed.
Any pointer will be helpful.

Thanks
Sourabh

On Thu, Mar 12, 2015 at 6:35 AM, Imran Rashid <iras...@cloudera.com> wrote:

> Hi Jonathan,
>
> you might be interested in https://issues.apache.org/
> jira/browse/SPARK-3655 (not yet available) and https://github.com/
> tresata/spark-sorted (not part of spark, but it is available right now).
> Hopefully thats what you are looking for.  To the best of my knowledge that
> covers what is available now / what is being worked on.
>
> Imran
>
> On Wed, Mar 11, 2015 at 4:38 PM, Jonathan Coveney <jcove...@gmail.com>
> wrote:
>
>> Hello all,
>>
>> I am wondering if spark already has support for optimizations on sorted
>> data and/or if such support could be added (I am comfortable dropping to a
>> lower level if necessary to implement this, but I'm not sure if it is
>> possible at all).
>>
>> Context: we have a number of data sets which are essentially already
>> sorted on a key. With our current systems, we can take advantage of this to
>> do a lot of analysis in a very efficient fashion...merges and joins, for
>> example, can be done very efficiently, as can folds on a secondary key and
>> so on.
>>
>> I was wondering if spark would be a fit for implementing these sorts of
>> optimizations? Obviously it is sort of a niche case, but would this be
>> achievable? Any pointers on where I should look?
>>
>
>

Reply via email to