Re: Efficient filtering on Spark SQL dataframes with ordered keys

Michael David Pedersen Mon, 31 Oct 2016 07:18:08 -0700

Hi Mich,

Thank you again for your reply.


As I see you are caching the table already sorted
>
> val keyValRDDSorted = keyValRDD.sortByKey().cache
>
> and the next stage is you are creating multiple tempTables (different
> ranges) that cache a subset of rows already cached in RDD. The data stored
> in tempTable is in Hive columnar format (I assume that means ORC format)
>

But the thing is that I don't explicitly cache the tempTables, and I don't
really want to because I'll only run a single query on each tempTable. So I
expect the SQL query processor to operate directly on the underlying
key-value RDD, and my concern is that this may be inefficient.


> Well that is all you can do.
>

Ok, thanks - that's really what I wanted to get confirmation of.


> Bear in mind that these tempTables are immutable and I do not know any way
> of dropping tempTable to free more memory.
>

I'm assuming there won't be any (significant) memory overhead of
registering the temp tables as long as I don't explicitly cache them. Am I
wrong? In any case I'll be calling sqlContext.dropTempTable once the query
has completed, which according to the documentation should also free up
memory.

Cheers,
Michael

Re: Efficient filtering on Spark SQL dataframes with ordered keys

Reply via email to