Re: Efficient filtering on Spark SQL dataframes with ordered keys

Michael David Pedersen Mon, 31 Oct 2016 03:56:29 -0700

Hi Mich,

Thank you for your quick reply!


What type of table is the underlying table? Is it Hbase, Hive ORC or what?
>

It is a custom datasource, but ultimately backed by HBase.


> By Key you mean a UNIQUE ID or something similar and then you do multiple
> scans on the tempTable which stores data using in-memory columnar format.
>

The key is a unique ID, yes. But note that I don't actually do multiple
scans on the same temp table: I create a new temp table for every query I
want to run, because each query will be based on a different key range. The
caching is at the level of the full key-value RDD.

If I did instead cache the temp table, I don't see a way of exploiting key
ordering for key range filters?


> That is the optimisation of tempTable storage as far as I know.
>

So it seems to me that my current solution won't be using this
optimisation, as I'm caching the RDD rather than the temp table.


> Have you tried it using predicate push-down on the underlying table itself?
>

No, because I essentially want to load the entire table into memory before
doing any queries. At that point I have nothing to push down.

Cheers,
Michael

Re: Efficient filtering on Spark SQL dataframes with ordered keys

Reply via email to