>It¹s really a very simple query that I¹m trying to run:
>select
...
>    bloom_contains(a_id, b_id_bloom)

That's nearly impossible to optimize directly - there is no way to limit
the number of table_b rows which may match table a.

More than one bloom filter can successfully match a single row from a_id,
so this reduces to

for (a_id in a) {

   for (b_id_bloom in b) {
    if (bloom_contains(a_id, b_id_bloom)) {
      emitRow(); 
    }
  }
}


>The sizes of the tables I¹m running against are small ‹ roughly 50-100Mb
>‹ but this query would need to be expanded to run on a table that is
>>100Gb (table_b would likely max out around 100Mb).

As long as table_b is <1Gb, this will be executed in parallel across all
the splits (so there's an invisible for-parallel( s in splits) outside).

You need to increase the noconditional task size & the container size to
hold the entire table_b in memory.

As you can probably tell, it would've been much much faster if hive
generated the dimension table loop outside for this case.

for (b_id_bloom in b) {
  for (a_id in split) {
    if (bloom_contains(a_id, b_id_bloom) {
       emitRow();
    }
  }
}

So that each bloom filter is deserialized exactly once & reused.

I can think of one way to leverage hive UDTFs to generate something like
that, but it won't be any sort of standard SQL syntax (i.e lateral view).
I'll probably have to think more about it though.

But before that, do run this code with JVM opts -Xprof and check the
stdout to see if the bloom filter deserialization is hurting you or not.

Cheers,

Gopal


Reply via email to