On 10/29/18 11:39 AM, Nicolas Paris wrote:
Thanks Josh,
On Mon, Oct 29, 2018 at 10:47:42AM -0400, Josh Elser wrote:
Use Hive when Hive does things well, and use Phoenix when Phoenix does
it well.
That would be great. My concern is the phoenix "joins" do not compete
with postgresql in my actual tests.
Phoenix + hive is ok, however
Phoenix + hive + postgres is not.
Am I wrong with the bad performances of joins in the context of large
tables (> 10M) ?
I think trying to phrase "JOIN efficiency" in terms of data sets is the
wrong way to go about an appropriate explanation.
There are limitations that Phoenix has which I would summarize as
"things HBase can handle as push-downs" and "the lack of a distributed
execution engine".
For example, you found few-to-many joins worked well with Phoenix, but
you would find that (in most case) many-to-many joins will be slow. This
is largely because of the constructs that HBase provides as a data store
and what Phoenix can "work with". When Phoenix can push down one side of
the join, you get a fast, (often) parallelized scan from Phoenix. When
both sides of the relation are large, you end up running a sort-merge
join which pulls everything back to the client.
The first step is understanding what Phoenix is actually doing to run
your query (JOIN or otherwise) and then understanding if you can
rephrase your JOIN (or really, the application-level "question") in such
a way that Phoenix can run an efficient execution over it.
Hope that helps.