Re: Phoenix as a source for Spark processing

2018-03-15 Thread Josh Elser
Cool, that's a good find. Re-stating what you're seeing: the distribution of your HBase table (region splits) doesn't match an even distribution of the data in the HBase table. Some regions have more data than other regions. Typically, applications reading from HBase will launch workers based

Re: Direct HBase vs. Phoenix query performance

2018-03-15 Thread James Taylor
Hi Marcell, Yes, that's correct - the cache we build for the RHS is only kept around while the join query is being executed. It'd be interesting to explore keeping the cache around longer for cases like yours (and probably not too difficult). We'd need to keep a map that maps the RHS query to its

RE: Phoenix as a source for Spark processing

2018-03-15 Thread Stepan Migunov
The table is about 300GB in hbase. I've done some more research and now my test is very simple - I'm tryng to calculate count of records of the table. No "distincts" and etc., just phoenixTableAsDataFrame(...).count(). And now I see the issue - Spark creates about 400 task (14 executors), starts