>It¹s really a very simple query that I¹m trying to run: >select ... > bloom_contains(a_id, b_id_bloom)
That's nearly impossible to optimize directly - there is no way to limit the number of table_b rows which may match table a. More than one bloom filter can successfully match a single row from a_id, so this reduces to for (a_id in a) { for (b_id_bloom in b) { if (bloom_contains(a_id, b_id_bloom)) { emitRow(); } } } >The sizes of the tables I¹m running against are small roughly 50-100Mb > but this query would need to be expanded to run on a table that is >>100Gb (table_b would likely max out around 100Mb). As long as table_b is <1Gb, this will be executed in parallel across all the splits (so there's an invisible for-parallel( s in splits) outside). You need to increase the noconditional task size & the container size to hold the entire table_b in memory. As you can probably tell, it would've been much much faster if hive generated the dimension table loop outside for this case. for (b_id_bloom in b) { for (a_id in split) { if (bloom_contains(a_id, b_id_bloom) { emitRow(); } } } So that each bloom filter is deserialized exactly once & reused. I can think of one way to leverage hive UDTFs to generate something like that, but it won't be any sort of standard SQL syntax (i.e lateral view). I'll probably have to think more about it though. But before that, do run this code with JVM opts -Xprof and check the stdout to see if the bloom filter deserialization is hurting you or not. Cheers, Gopal