[Spark Dataframe] How can I write a correct filter so the Hive table partitions are pruned correctly

Patrick Duin Wed, 13 Sep 2017 06:06:56 -0700

Hi Spark users,

I've got an issue where I wrote a filter on a Hive table using dataframes
and despite setting:
spark.sql.hive.metastorePartitionPruning=true no partitions are being
pruned.


In short:

Doing this: table.filter("partition=x or partition=y") will result in Spark
fetching all partition metadata from the Hive metastore and doing the
filtering after fetching the partitions.

On the other hand if my filter is "simple":
table.filter("partition=x ")
Spark does a call to the metastore that passes along the filter and fetches
just the ones it needs.

Our case is where we have a lot of partitions on a table and the calls that
result in all the partitions take minutes as well as causing us memory
issues. Is this a bug or is there a better way of doing the filter call?

Thanks,
 Patrick

PS:
Sorry for crossposting I wasn't sure if the user list was the correct place
to ask and I understood to go via stackoverflow first so my question is
also here in more detail:
https://stackoverflow.com/questions/46152526/how-should-i-configure-spark-to-correctly-prune-hive-metastore-partitions

[Spark Dataframe] How can I write a correct filter so the Hive table partitions are pruned correctly

Reply via email to