Cool, that's a good find. Re-stating what you're seeing: the
distribution of your HBase table (region splits) doesn't match an even
distribution of the data in the HBase table. Some regions have more data
than other regions.
Typically, applications reading from HBase will launch workers based
Hi Marcell,
Yes, that's correct - the cache we build for the RHS is only kept around
while the join query is being executed. It'd be interesting to explore
keeping the cache around longer for cases like yours (and probably not too
difficult). We'd need to keep a map that maps the RHS query to its
The table is about 300GB in hbase.
I've done some more research and now my test is very simple - I'm tryng to
calculate count of records of the table. No "distincts" and etc., just
phoenixTableAsDataFrame(...).count().
And now I see the issue - Spark creates about 400 task (14 executors),
starts