On 10/29/18 11:39 AM, Nicolas Paris wrote:
Thanks Josh,

On Mon, Oct 29, 2018 at 10:47:42AM -0400, Josh Elser wrote:
Use Hive when Hive does things well, and use Phoenix when Phoenix does
it well.

That would be great. My concern is the phoenix "joins" do not compete
with postgresql in my actual tests.
Phoenix + hive is ok, however
Phoenix + hive + postgres is not.


Am I wrong with the bad performances of joins in the context of large
tables (> 10M) ?


I think trying to phrase "JOIN efficiency" in terms of data sets is the wrong way to go about an appropriate explanation.

There are limitations that Phoenix has which I would summarize as "things HBase can handle as push-downs" and "the lack of a distributed execution engine".

For example, you found few-to-many joins worked well with Phoenix, but you would find that (in most case) many-to-many joins will be slow. This is largely because of the constructs that HBase provides as a data store and what Phoenix can "work with". When Phoenix can push down one side of the join, you get a fast, (often) parallelized scan from Phoenix. When both sides of the relation are large, you end up running a sort-merge join which pulls everything back to the client.

The first step is understanding what Phoenix is actually doing to run your query (JOIN or otherwise) and then understanding if you can rephrase your JOIN (or really, the application-level "question") in such a way that Phoenix can run an efficient execution over it.

Hope that helps.

Reply via email to