Using partitioning with dataframes, how can we retrieve informations about partitions? partitions bounds for example
Thanks, Shaira 2016-07-07 6:30 GMT+02:00 Koert Kuipers <ko...@tresata.com>: > spark does keep some information on the partitions of an RDD, namely the > partitioning/partitioner. > > GroupSorted is an extension for key-value RDDs that also keeps track of > the ordering, allowing for faster joins, non-reduce type operations on very > large groups of values per key, etc. > see here: > https://github.com/tresata/spark-sorted > however no support for streaming (yet)... > > > On Wed, Jul 6, 2016 at 11:55 PM, Omid Alipourfard <ecyn...@gmail.com> > wrote: > >> Hi, >> >> Why doesn't Spark keep information about the structure of the RDDs or the >> partitions within RDDs? Say that I use >> repartitionAndSortWithinPartitions, which results in sorted partitions. >> With sorted partitions, lookups should be super fast (binary search?), yet >> I still need to go through the whole partition to perform a lookup -- using >> say, filter. >> >> To give more context into a use case, let me give a very simple example >> where having this feature seems extremely useful: consider that you have a >> stream of incoming keys, where for each key you need to lookup the >> associated value in a large RDD and perform operations on the values. >> Right now, performing a join between the RDDs in the DStream and the large >> RDD seems to be the way to go. I.e.: >> >> incomingData.transform { rdd => largeRdd.join(rdd) } >> .map(performAdditionalOperations).save(...) >> >> Assuming that the largeRdd is sorted/or contains an index and each window >> of incomingData is small, this join operation can be performed in >> *O(incomingData >> * (log(largeRDD) | 1)). *Yet, right now, I believe this operation is >> much more expensive than that. >> >> I have just started using Spark, so it's highly likely that I am using it >> wrong. So any thoughts are appreciated! >> >> TL;DR. Why not keep an index/info with each partition or RDD to speed up >> operations such as lookups filters, etc.? >> >> Thanks, >> Omid >> > >