this shouldn't be too hard. adding something to spark-sorted or to the dataframe/dataset logical plan that says "trust me, i am already partitioned and sorted" seems doable. however you most likely need a custom hash partitioner, and you have to be careful to read the data in without file splitting.
On Mar 10, 2017 9:10 AM, "sourabh chaki" <chaki.sour...@gmail.com> wrote: > My use case is also quite similar. I have 2 feeds. One 3TB and another > 100GB. Both the feeds are generated by hadoop reduce operation and > partitioned by hadoop hashpartitioner. 3TB feed has 10K partitions whereas > 100GB file has 200 partitions. > > Now when I do a join between these two feeds using spark, spark shuffles > both the RDDS and it takes long time to complete. Can we do something so > that spark can recognise the existing partitions of 3TB feed and shuffles > only 200GB feed? > It can be mapside scan for bigger RDD and shuffle read from smaller RDD? > > I have looked at spark-sorted project, but that project does not utilise > the pre-existing partitions in the feed. > Any pointer will be helpful. > > Thanks > Sourabh > > On Thu, Mar 12, 2015 at 6:35 AM, Imran Rashid <iras...@cloudera.com> > wrote: > >> Hi Jonathan, >> >> you might be interested in https://issues.apache.org/j >> ira/browse/SPARK-3655 (not yet available) and https://github.com/tresata >> /spark-sorted (not part of spark, but it is available right now). >> Hopefully thats what you are looking for. To the best of my knowledge that >> covers what is available now / what is being worked on. >> >> Imran >> >> On Wed, Mar 11, 2015 at 4:38 PM, Jonathan Coveney <jcove...@gmail.com> >> wrote: >> >>> Hello all, >>> >>> I am wondering if spark already has support for optimizations on sorted >>> data and/or if such support could be added (I am comfortable dropping to a >>> lower level if necessary to implement this, but I'm not sure if it is >>> possible at all). >>> >>> Context: we have a number of data sets which are essentially already >>> sorted on a key. With our current systems, we can take advantage of this to >>> do a lot of analysis in a very efficient fashion...merges and joins, for >>> example, can be done very efficiently, as can folds on a secondary key and >>> so on. >>> >>> I was wondering if spark would be a fit for implementing these sorts of >>> optimizations? Obviously it is sort of a niche case, but would this be >>> achievable? Any pointers on where I should look? >>> >> >> >