Hi Spark users,
I've got an issue where I wrote a filter on a Hive table using dataframes
and despite setting:
spark.sql.hive.metastorePartitionPruning=true no partitions are being
pruned.
In short:
Doing this: table.filter("partition=x or partition=y") will result in Spark
fetching all partition metadata from the Hive metastore and doing the
filtering after fetching the partitions.
On the other hand if my filter is "simple":
table.filter("partition=x ")
Spark does a call to the metastore that passes along the filter and fetches
just the ones it needs.
Our case is where we have a lot of partitions on a table and the calls that
result in all the partitions take minutes as well as causing us memory
issues. Is this a bug or is there a better way of doing the filter call?
Thanks,
Patrick
PS:
Sorry for crossposting I wasn't sure if the user list was the correct place
to ask and I understood to go via stackoverflow first so my question is
also here in more detail:
https://stackoverflow.com/questions/46152526/how-should-i-configure-spark-to-correctly-prune-hive-metastore-partitions