Hello, I've got a Spark SQL dataframe containing a "key" column. The queries I want to run start by filtering on the key range. My question in outline: is it possible to sort the dataset by key so as to do efficient key range filters, before subsequently running a more complex SQL query?
I'm aware that such efficient filtering is possible for key-value RDDs, i.e. RDDs over Tuple2, using PairRDDFunctions. My workflow currently looks as follows: // Create a dataframe val df: DataFrame = sqlContext.sql("SELECT * FROM ...") val keyValRDD = df.rdd.map( (r: Row) => (r.getAs[String]("key"), r) ) // Sort by key - and cache. val keyValRDDSorted = keyValRDD.sortByKey().cache // Define a function to run SQL query on a range. def queryRange(lower: String, upper: String, sql: String, tableName: String) = { val rangeRDD = keyValRDDSorted.filterByRange(lower, upper) val rangeDF = sqlContext.createDataFrame(rangeRDD.map{ _._2 }, df.schema) rangeDF.createTempView(tableName) sqlContext.sql(sql) } // Invoke multiple times. queryRange(...) queryRange(...) ... This works, and is efficient in that only the partitions containing the relevant key range are processed. However, I understand that Spark SQL uses an optimised storage format as compared to plain RDDs. The above workflow can't take advantage of this, as it is the key-value RDD that is cached. So, my specific question: Is there a more efficient way of achieving the desired result? Any pointers would be much appreciated. Many thanks, Michael PS: This question was also asked on StackOverflow - http://stackoverflow.com/questions/40129411/efficient-filtering-on-spark-sql-dataframes-with-ordered-keys .