I recently ran a query with the following form: select a.*, b.* from some_small_table a inner join ( select things from someother table lateral view explode(s) ss as sss where a_key is in (x,y,z) ) b on a.key = b.key where someothercriterion
On hive, this query took about five minutes. In Spark, using either the same syntax in a spark.sql call or using the dataframe API, it appeared as if it was going to take on the order of 10 hours. I didn't let it finish. The data underlying the hive table are sequence files, ~30mb each, ~1000 to a partition, and my query ran over only five partitions. A single partition is about 25gb. How can Spark perform so badly? Do I need to handle sequence files in a special way?