Hello,

I've got a Spark SQL dataframe containing a "key" column. The queries I
want to run start by filtering on the key range. My question in outline: is
it possible to sort the dataset by key so as to do efficient key range
filters, before subsequently running a more complex SQL query?

I'm aware that such efficient filtering is possible for key-value RDDs,
i.e. RDDs over Tuple2, using PairRDDFunctions. My workflow currently looks
as follows:

// Create a dataframe
val df: DataFrame = sqlContext.sql("SELECT * FROM ...")
val keyValRDD = df.rdd.map( (r: Row) => (r.getAs[String]("key"), r) )

// Sort by key - and cache.
val keyValRDDSorted = keyValRDD.sortByKey().cache

// Define a function to run SQL query on a range.
def queryRange(lower: String, upper: String, sql: String, tableName:
String) = {
    val rangeRDD = keyValRDDSorted.filterByRange(lower, upper)
    val rangeDF = sqlContext.createDataFrame(rangeRDD.map{ _._2 },
df.schema)
    rangeDF.createTempView(tableName)
    sqlContext.sql(sql)
}

// Invoke multiple times.
queryRange(...)
queryRange(...)
...

This works, and is efficient in that only the partitions containing the
relevant key range are processed. However, I understand that Spark SQL uses
an optimised storage format as compared to plain RDDs. The above workflow
can't take advantage of this, as it is the key-value RDD that is cached.

So, my specific question: Is there a more efficient way of achieving the
desired result?

Any pointers would be much appreciated.

Many thanks,
Michael

PS: This question was also asked on StackOverflow -
http://stackoverflow.com/questions/40129411/efficient-filtering-on-spark-sql-dataframes-with-ordered-keys
.

Reply via email to