Spar SQL's in-memory cache stores statistics per column which in turn is used to skip batches(default size 10000) within partition
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnStats.scala#L25 Hope this helps Thanks -Nitin On Tue, Dec 15, 2015 at 12:28 AM, Michael Segel <msegel_had...@hotmail.com> wrote: > Hi, > > This may be a silly question… couldn’t find the answer on my own… > > I’m trying to find out if anyone has implemented secondary indexing on > Spark’s RDDs. > > If anyone could point me to some references, it would be helpful. > > I’ve seen some stuff on Succinct Spark (see: > https://amplab.cs.berkeley.edu/succinct-spark-queries-on-compressed-rdds/ > ) > but was more interested in integration with SparkSQL and SparkSQL support > for secondary indexing. > > Also the reason I’m posting this to the dev list is that there’s more to > this question … > > > Thx > > -Mike > >