Hi Spark users,

I've got an issue where I wrote a filter on a Hive table using dataframes
and despite setting:
spark.sql.hive.metastorePartitionPruning=true no partitions are being
pruned.

In short:

Doing this: table.filter("partition=x or partition=y") will result in Spark
fetching all partition metadata from the Hive metastore and doing the
filtering after fetching the partitions.

On the other hand if my filter is "simple":
table.filter("partition=x ")
Spark does a call to the metastore that passes along the filter and fetches
just the ones it needs.

Our case is where we have a lot of partitions on a table and the calls that
result in all the partitions take minutes as well as causing us memory
issues. Is this a bug or is there a better way of doing the filter call?

Thanks,
 Patrick

PS:
Sorry for crossposting I wasn't sure if the user list was the correct place
to ask and I understood to go via stackoverflow first so my question is
also here in more detail:
https://stackoverflow.com/questions/46152526/how-should-i-configure-spark-to-correctly-prune-hive-metastore-partitions

Reply via email to