Hi Spark User, I have run into some situation where Spark SQL is much slower than Parquet MR for processing parquet files. Can you provide some guidance on optimization?
Suppose I have a table "person" with columns: gender, age, name, address, etc, which is stored in parquet files. I tried two ways to read the table in spark: 1) define case class Person(gender: String, age: Int, etc), then I use Spark SQL Dataset API: val ds = spark.read.parquet("...").as[Person] 2) define avrò record "record Person {string gender, int age, etc}". Then use parquet-avro <https://github.com/apache/parquet-mr/tree/master/parquet-avro> and newapiHadoopFile: val rdd = sc.newAPIHadoopFile( path, classOf[ParquetInputFormat[avro.Person]], classOf[Void], classOf[avro.Person], job.getConfiguration).values Then I compare 3 actions (spark 2.1.1, parquet-avro 1.8.1): a) ds.filter(" gender='female' and age > 50").count // 1 min b) ds.filter(" gender='female'").filter(_.age > 50).count // 15 min c) rdd.filter(r => r.gender == "female" && r.age > 50).count // 7 min I can understand a) is faster than c) because a) is limited to sql query so Spark can do a lot things to optimize (such as not fully deserialize the objects). But I don't understand b) is much slower than c) because I assume both requires full deserialization. Is there anything I can try to improve b)? Thanks, Mike