Hi Spark User,

I have run into some situation where Spark SQL is much slower than Parquet
MR for processing parquet files. Can you provide some guidance on
optimization?

Suppose I have a table "person" with columns: gender, age, name, address,
etc, which is stored in parquet files. I tried two ways to read the table
in spark:

1) define case class Person(gender: String, age: Int, etc), then I use
Spark SQL Dataset API:
val ds = spark.read.parquet("...").as[Person]

2) define avrò record "record Person {string gender, int age, etc}". Then
use parquet-avro
<https://github.com/apache/parquet-mr/tree/master/parquet-avro> and
newapiHadoopFile:

val rdd = sc.newAPIHadoopFile(
    path,
    classOf[ParquetInputFormat[avro.Person]],
    classOf[Void],
    classOf[avro.Person],
    job.getConfiguration).values


Then I compare 3 actions (spark 2.1.1, parquet-avro 1.8.1):

a) ds.filter(" gender='female' and age > 50").count          // 1 min

b) ds.filter(" gender='female'").filter(_.age > 50).count    // 15 min

c) rdd.filter(r => r.gender == "female" && r.age > 50).count // 7 min

I can understand a) is faster than c) because a) is limited to sql
query so Spark can do a lot things to optimize (such as not fully
deserialize the objects). But I don't understand b) is much slower
than c) because I assume both requires full deserialization.

Is there anything I can try to improve b)?

Thanks,


Mike

Reply via email to