While I was at Two Sigma I ended up implementing something similar to what Koert described... you can check it out here: https://github.com/twosigma/flint/blob/master/src/main/scala/com/twosigma/flint/rdd/OrderedRDD.scala. They've built a lot more on top of this (including support for dataframes etc).
2017-03-10 9:45 GMT-05:00 Koert Kuipers <ko...@tresata.com>: > this shouldn't be too hard. adding something to spark-sorted or to the > dataframe/dataset logical plan that says "trust me, i am already > partitioned and sorted" seems doable. however you most likely need a custom > hash partitioner, and you have to be careful to read the data in without > file splitting. > > On Mar 10, 2017 9:10 AM, "sourabh chaki" <chaki.sour...@gmail.com> wrote: > >> My use case is also quite similar. I have 2 feeds. One 3TB and another >> 100GB. Both the feeds are generated by hadoop reduce operation and >> partitioned by hadoop hashpartitioner. 3TB feed has 10K partitions whereas >> 100GB file has 200 partitions. >> >> Now when I do a join between these two feeds using spark, spark shuffles >> both the RDDS and it takes long time to complete. Can we do something so >> that spark can recognise the existing partitions of 3TB feed and shuffles >> only 200GB feed? >> It can be mapside scan for bigger RDD and shuffle read from smaller RDD? >> >> I have looked at spark-sorted project, but that project does not utilise >> the pre-existing partitions in the feed. >> Any pointer will be helpful. >> >> Thanks >> Sourabh >> >> On Thu, Mar 12, 2015 at 6:35 AM, Imran Rashid <iras...@cloudera.com> >> wrote: >> >>> Hi Jonathan, >>> >>> you might be interested in https://issues.apache.org/j >>> ira/browse/SPARK-3655 (not yet available) and https://github.com/tresata >>> /spark-sorted (not part of spark, but it is available right now). >>> Hopefully thats what you are looking for. To the best of my knowledge that >>> covers what is available now / what is being worked on. >>> >>> Imran >>> >>> On Wed, Mar 11, 2015 at 4:38 PM, Jonathan Coveney <jcove...@gmail.com> >>> wrote: >>> >>>> Hello all, >>>> >>>> I am wondering if spark already has support for optimizations on sorted >>>> data and/or if such support could be added (I am comfortable dropping to a >>>> lower level if necessary to implement this, but I'm not sure if it is >>>> possible at all). >>>> >>>> Context: we have a number of data sets which are essentially already >>>> sorted on a key. With our current systems, we can take advantage of this to >>>> do a lot of analysis in a very efficient fashion...merges and joins, for >>>> example, can be done very efficiently, as can folds on a secondary key and >>>> so on. >>>> >>>> I was wondering if spark would be a fit for implementing these sorts of >>>> optimizations? Obviously it is sort of a niche case, but would this be >>>> achievable? Any pointers on where I should look? >>>> >>> >>> >>