Spar SQL's in-memory cache stores statistics per column which in turn is
used to skip batches(default size 10000) within partition

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnStats.scala#L25

Hope this helps

Thanks
-Nitin

On Tue, Dec 15, 2015 at 12:28 AM, Michael Segel <msegel_had...@hotmail.com>
wrote:

> Hi,
>
> This may be a silly question… couldn’t find the answer on my own…
>
> I’m trying to find out if anyone has implemented secondary indexing on
> Spark’s RDDs.
>
> If anyone could point me to some references, it would be helpful.
>
> I’ve seen some stuff on Succinct Spark (see:
> https://amplab.cs.berkeley.edu/succinct-spark-queries-on-compressed-rdds/
>  )
> but was more interested in integration with SparkSQL and SparkSQL support
> for secondary indexing.
>
> Also the reason I’m posting this to the dev list is that there’s more to
> this question …
>
>
> Thx
>
> -Mike
>
>

Reply via email to