Re: Phoenix Performances & Uses Cases

2018-10-29 Thread Josh Elser
Specifically to your last two points about windowing, transforming, 
grouping, etc: my current opinion is that Hive does certain analytical 
style operations much better than Phoenix. Personally, I don't think it 
makes sense for Phoenix to try to "catch up". It would take years for us 
to build such capabilities on par with what they have.


Some of us have been making efforts to ease data access between Hive and 
Phoenix via the PhoenixStorageHandler for Hive. The goal of this is that 
it will make your life easier to use the correct tool for the job. Use 
Hive when Hive does things well, and use Phoenix when Phoenix does it well.


(Again, this is my opinion. It is not meant to be some declaration of 
direction by the entire Apache Phoenix community)


On 10/27/18 7:50 AM, Nicolas Paris wrote:

Hi

I am benchmarking phoenix to better understand its strength and
weaknesses. My basis is to compare to postgresql for OLTP workload and
hive llap for OLAP workload. I am testing on a 10 computer cluster
instance with hive (2.1) and phoenix (4.8)  220 GO RAM/32CPU versus a
postgresql (9.6) 128GO RAM 32CPU.

Right now, my opinion is:
- when getting a subset on a large table, phoenix performs the
   best
- when getting a subset from multiple large tables, postgres performs
   the best
- when getting a subset from a large table joining one to many small
   table, phoenix performs the best
- when ingesting high frequency data, Phoenix performs the best
- when grouping by query, hive > postgresql > phoenix
- when windowning, transforming, grouping, hive performs the best,
   phoenix the worst

Finally, my conclusion is  phoenix is not intended at all for analytics
queries such grouping, windowing, and joining large tables. It suits
well for very specific use case like maintaining a very large table with
eventually small tables to join with (such timeseries data, or binary
storage data with hbase MOB enabled).

Am I missing something ?

Thanks,



Re: Phoenix Performances & Uses Cases

2018-10-29 Thread Nicolas Paris
Thanks Josh,

On Mon, Oct 29, 2018 at 10:47:42AM -0400, Josh Elser wrote:
> Use Hive when Hive does things well, and use Phoenix when Phoenix does
> it well.

That would be great. My concern is the phoenix "joins" do not compete
with postgresql in my actual tests.
Phoenix + hive is ok, however
Phoenix + hive + postgres is not.


Am I wrong with the bad performances of joins in the context of large
tables (> 10M) ?

-- 
nicolas


Re: Phoenix Performances & Uses Cases

2018-10-29 Thread Josh Elser

On 10/29/18 11:39 AM, Nicolas Paris wrote:

Thanks Josh,

On Mon, Oct 29, 2018 at 10:47:42AM -0400, Josh Elser wrote:

Use Hive when Hive does things well, and use Phoenix when Phoenix does
it well.


That would be great. My concern is the phoenix "joins" do not compete
with postgresql in my actual tests.
Phoenix + hive is ok, however
Phoenix + hive + postgres is not.


Am I wrong with the bad performances of joins in the context of large
tables (> 10M) ?



I think trying to phrase "JOIN efficiency" in terms of data sets is the 
wrong way to go about an appropriate explanation.


There are limitations that Phoenix has which I would summarize as 
"things HBase can handle as push-downs" and "the lack of a distributed 
execution engine".


For example, you found few-to-many joins worked well with Phoenix, but 
you would find that (in most case) many-to-many joins will be slow. This 
is largely because of the constructs that HBase provides as a data store 
and what Phoenix can "work with". When Phoenix can push down one side of 
the join, you get a fast, (often) parallelized scan from Phoenix. When 
both sides of the relation are large, you end up running a sort-merge 
join which pulls everything back to the client.


The first step is understanding what Phoenix is actually doing to run 
your query (JOIN or otherwise) and then understanding if you can 
rephrase your JOIN (or really, the application-level "question") in such 
a way that Phoenix can run an efficient execution over it.


Hope that helps.


Re: Phoenix Performances & Uses Cases

2018-11-02 Thread Neelesh
Another observation with Phoenix global indexes - at very large volumes of
writes, a single region server failure cascades to the entire cluster very
quickly

On Sat, Oct 27, 2018, 4:50 AM Nicolas Paris 
wrote:

> Hi
>
> I am benchmarking phoenix to better understand its strength and
> weaknesses. My basis is to compare to postgresql for OLTP workload and
> hive llap for OLAP workload. I am testing on a 10 computer cluster
> instance with hive (2.1) and phoenix (4.8)  220 GO RAM/32CPU versus a
> postgresql (9.6) 128GO RAM 32CPU.
>
> Right now, my opinion is:
> - when getting a subset on a large table, phoenix performs the
>   best
> - when getting a subset from multiple large tables, postgres performs
>   the best
> - when getting a subset from a large table joining one to many small
>   table, phoenix performs the best
> - when ingesting high frequency data, Phoenix performs the best
> - when grouping by query, hive > postgresql > phoenix
> - when windowning, transforming, grouping, hive performs the best,
>   phoenix the worst
>
> Finally, my conclusion is  phoenix is not intended at all for analytics
> queries such grouping, windowing, and joining large tables. It suits
> well for very specific use case like maintaining a very large table with
> eventually small tables to join with (such timeseries data, or binary
> storage data with hbase MOB enabled).
>
> Am I missing something ?
>
> Thanks,
>
> --
> nicolas
>