Re: Phoenix Performances & Uses Cases
Specifically to your last two points about windowing, transforming, grouping, etc: my current opinion is that Hive does certain analytical style operations much better than Phoenix. Personally, I don't think it makes sense for Phoenix to try to "catch up". It would take years for us to build such capabilities on par with what they have. Some of us have been making efforts to ease data access between Hive and Phoenix via the PhoenixStorageHandler for Hive. The goal of this is that it will make your life easier to use the correct tool for the job. Use Hive when Hive does things well, and use Phoenix when Phoenix does it well. (Again, this is my opinion. It is not meant to be some declaration of direction by the entire Apache Phoenix community) On 10/27/18 7:50 AM, Nicolas Paris wrote: Hi I am benchmarking phoenix to better understand its strength and weaknesses. My basis is to compare to postgresql for OLTP workload and hive llap for OLAP workload. I am testing on a 10 computer cluster instance with hive (2.1) and phoenix (4.8) 220 GO RAM/32CPU versus a postgresql (9.6) 128GO RAM 32CPU. Right now, my opinion is: - when getting a subset on a large table, phoenix performs the best - when getting a subset from multiple large tables, postgres performs the best - when getting a subset from a large table joining one to many small table, phoenix performs the best - when ingesting high frequency data, Phoenix performs the best - when grouping by query, hive > postgresql > phoenix - when windowning, transforming, grouping, hive performs the best, phoenix the worst Finally, my conclusion is phoenix is not intended at all for analytics queries such grouping, windowing, and joining large tables. It suits well for very specific use case like maintaining a very large table with eventually small tables to join with (such timeseries data, or binary storage data with hbase MOB enabled). Am I missing something ? Thanks,
Re: Phoenix Performances & Uses Cases
Thanks Josh, On Mon, Oct 29, 2018 at 10:47:42AM -0400, Josh Elser wrote: > Use Hive when Hive does things well, and use Phoenix when Phoenix does > it well. That would be great. My concern is the phoenix "joins" do not compete with postgresql in my actual tests. Phoenix + hive is ok, however Phoenix + hive + postgres is not. Am I wrong with the bad performances of joins in the context of large tables (> 10M) ? -- nicolas
Re: Phoenix Performances & Uses Cases
On 10/29/18 11:39 AM, Nicolas Paris wrote: Thanks Josh, On Mon, Oct 29, 2018 at 10:47:42AM -0400, Josh Elser wrote: Use Hive when Hive does things well, and use Phoenix when Phoenix does it well. That would be great. My concern is the phoenix "joins" do not compete with postgresql in my actual tests. Phoenix + hive is ok, however Phoenix + hive + postgres is not. Am I wrong with the bad performances of joins in the context of large tables (> 10M) ? I think trying to phrase "JOIN efficiency" in terms of data sets is the wrong way to go about an appropriate explanation. There are limitations that Phoenix has which I would summarize as "things HBase can handle as push-downs" and "the lack of a distributed execution engine". For example, you found few-to-many joins worked well with Phoenix, but you would find that (in most case) many-to-many joins will be slow. This is largely because of the constructs that HBase provides as a data store and what Phoenix can "work with". When Phoenix can push down one side of the join, you get a fast, (often) parallelized scan from Phoenix. When both sides of the relation are large, you end up running a sort-merge join which pulls everything back to the client. The first step is understanding what Phoenix is actually doing to run your query (JOIN or otherwise) and then understanding if you can rephrase your JOIN (or really, the application-level "question") in such a way that Phoenix can run an efficient execution over it. Hope that helps.
Re: Phoenix Performances & Uses Cases
Another observation with Phoenix global indexes - at very large volumes of writes, a single region server failure cascades to the entire cluster very quickly On Sat, Oct 27, 2018, 4:50 AM Nicolas Paris wrote: > Hi > > I am benchmarking phoenix to better understand its strength and > weaknesses. My basis is to compare to postgresql for OLTP workload and > hive llap for OLAP workload. I am testing on a 10 computer cluster > instance with hive (2.1) and phoenix (4.8) 220 GO RAM/32CPU versus a > postgresql (9.6) 128GO RAM 32CPU. > > Right now, my opinion is: > - when getting a subset on a large table, phoenix performs the > best > - when getting a subset from multiple large tables, postgres performs > the best > - when getting a subset from a large table joining one to many small > table, phoenix performs the best > - when ingesting high frequency data, Phoenix performs the best > - when grouping by query, hive > postgresql > phoenix > - when windowning, transforming, grouping, hive performs the best, > phoenix the worst > > Finally, my conclusion is phoenix is not intended at all for analytics > queries such grouping, windowing, and joining large tables. It suits > well for very specific use case like maintaining a very large table with > eventually small tables to join with (such timeseries data, or binary > storage data with hbase MOB enabled). > > Am I missing something ? > > Thanks, > > -- > nicolas >