Thanks James for the reply. Please see my comments below Secondary indexes[1] on the non-primary key columns is the way to improve > performance for these case. Take a look at this[2] presentation for more > detail.
I have done a brief reading on secondary indexes and I will go through the video for a detailed study. I understand that they can help out in improving the performance but with the OLAP workload that we plan to use hbase + phoenix for probably secondary indexes will not be a good solution. With OLAP, we could not have a control over the queries that are executed during interactive analysis. Let me know if you have thoughts on this. I'd recommend at least 6 nodes and 10-20 nodes would be better in your > test environment We are in a proof of concept phase and have not yet finalized on how big the hbase cluster would be in production. Initially we plan to start with 4-5 nodes. Our data size would be in 10-100 million records (not in billions for sure). Do you see a reason we should increase the cluster size? > Have you seen our Pherf tool[3] that will help you benchmark your queries > under representative data sizes? I will look at the tool. Thanks again for sharing your inputs. On Mon, Apr 11, 2016 at 9:29 PM, James Taylor <jamestay...@apache.org> wrote: > Hi Amit, > > If a query doesn't filter on the primary key columns, the entire table > must be scanned (hence it'll be slower). Secondary indexes[1] on the > non-primary key columns is the way to improve performance for these case. > Take a look at this[2] presentation for more detail. > > Also, a 3 node cluster is not ideal for benchmarking, but it might be fine > to just play around with a bit. There's always overhead with a distributed > system that you won't see in a more typical single node RDBMS. If you can > mimic your cluster size you'll use in production, that'd be ideal . > Otherwise, I'd recommend at least 6 nodes and 10-20 nodes would be better > in your test environment. Have you seen our Pherf tool[3] that will help > you benchmark your queries under representative data sizes? > > Thanks, > James > > [1] https://phoenix.apache.org/secondary_indexing.html > [2] https://www.youtube.com/watch?v=f4Nmh5KM6gI&feature=youtu.be > [3] https://phoenix.apache.org/pherf.html > > On Mon, Apr 11, 2016 at 6:37 AM, Amit Shah <amits...@gmail.com> wrote: > >> Hi Mujtaba, >> >> I observed that if the where-clause and group-by queries are applied on >> the primary key columns, then they are superfast (~ 200 ms). This is not >> the case with queries that have non-primary key columns in the where clause >> and group by queries. I tried configuring the bucket cache but surprisingly >> it doesn't give much speed improvement which I had thought of. One of these >> queries are taking 26 secs and 31 secs on a 3 node Hbase cluster with and >> without bucket cache respectively. >> >> Let me know if you have suggestions that we could try out. >> >> Regards, >> Amit. >> >> On Tue, Mar 29, 2016 at 10:55 PM, Amit Shah <amits...@gmail.com> wrote: >> >>> Hi Mujtaba, >>> >>> Could these improvements be because of region distribution across region >>> servers? Along with the optimizations you had suggested I had also used >>> hbase-region-inspector to move regions evenly across the region server. >>> >>> Below is the table schema for the TRANSACTIONS table >>> >>> >>> CREATE TABLE TRANSACTIONS_TEST ( >>> AGENT_ID BIGINT >>> ,A_NAME VARCHAR >>> ,A_ROLE VARCHAR >>> ,TERRITORY_ID BIGINT >>> ,T_ZIPCODE BIGINT >>> ,T_PLACE_NAME VARCHAR >>> ,GRPBYCF.T_STATE VARCHAR >>> ,GRPBYCF.T_COUNTRY VARCHAR >>> ,PRODUCT_ID BIGINT NOT NULL >>> ,P_NAME VARCHAR >>> ,P_CATEGORY VARCHAR >>> ,CHANNEL_ID BIGINT >>> ,CH_NAME VARCHAR >>> ,CH_TYPE VARCHAR >>> ,CUSTOMER_ID BIGINT NOT NULL >>> ,CS_NAME VARCHAR >>> ,CS_TYPE VARCHAR >>> ,IS_NEW_CUSTOMER BOOLEAN >>> ,CLOSE_DATE DATE >>> ,DAY_CNT_SPENT INTEGER >>> ,TOTAL_EXPENSES BIGINT >>> ,FORCASTED_SALES BIGINT >>> ,GRPBYCF.UNIT_CNT_SOLD BIGINT >>> ,PRICE_PER_UNIT BIGINT >>> ,DISCOUNT_PERCENT BIGINT >>> ,GRPBYCF.TOTAL_SALES BIGINT >>> ,CONSTRAINT pk PRIMARY KEY (PRODUCT_ID, CUSTOMER_ID) >>> ) COMPRESSION='SNAPPY'; >>> >>> I will try out the guidepost width reduction and let you know the >>> results. >>> >>> Thank you, >>> Amit. >>> >>> On Tue, Mar 29, 2016 at 10:50 PM, Mujtaba Chohan <mujt...@apache.org> >>> wrote: >>> >>>> Optimization did help somewhat but not to the extent I was expecting. >>>> See chart below. >>>> >>>> [image: Inline image 1] >>>> >>>> Can you share your table schema so I can experiment with it? Another >>>> thing you can try is reducing guidepost >>>> <https://phoenix.apache.org/tuning.html> width for this table by >>>> executing UPDATE STATISTICS TRANSACTIONS SET >>>> "phoenix.stats.guidepost.width"=50000000; >>>> >>>> >>>> >>>> >>>> On Tue, Mar 29, 2016 at 6:45 AM, Amit Shah <amits...@gmail.com> wrote: >>>> >>>>> Hi Mujtaba, >>>>> >>>>> I did try the two optimization techniques by recreating the table and >>>>> then loading it again with 10 mil records. They do not seem to help out >>>>> much in terms of the timings. Kindly find the phoenix log file attached. >>>>> Let me know if I am missing anything. >>>>> >>>>> Thanks, >>>>> Amit. >>>>> >>>>> On Mon, Mar 28, 2016 at 11:44 PM, Mujtaba Chohan <mujt...@apache.org> >>>>> wrote: >>>>> >>>>>> Here's the chart for time it takes for each of the parallel scans >>>>>> after split. On RS where data is not read from disk scan gets back in ~20 >>>>>> secs but for the RS which has 6 it's ~45 secs. >>>>>> >>>>>> [image: Inline image 2] >>>>>> >>>>>> Yes I see disk reads with 607 ios/second on the hosts that stores 6 >>>>>>> regions >>>>>>> >>>>>> >>>>>> Two things that you should try to reduce disk reads or maybe a >>>>>> combination of both 1. Have only the columns used in your group by query >>>>>> in >>>>>> a separate column family CREATE TABLE T (K integer primary key, >>>>>> GRPBYCF.UNIT_CNT_SOLD integer, GRPBYCF.TOTAL_SALES integer, >>>>>> GRPBYCF.T_COUNTRY varchar, ...) 2. Turn on snappy compression for >>>>>> your table ALTER TABLE T SET COMPRESSION='SNAPPY' followed by a >>>>>> major compaction. >>>>>> >>>>>> I tried to compact the table from the hbase web UI >>>>>>> >>>>>> >>>>>> You need to do *major_compact* from HBase shell. From UI it's minor. >>>>>> >>>>>> - mujtaba >>>>>> >>>>>> On Mon, Mar 28, 2016 at 12:32 AM, Amit Shah <amits...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Thanks Mujtaba and James for replying back. >>>>>>> >>>>>>> Mujtaba, Below are details to your follow up queries >>>>>>> >>>>>>> 1. How wide is your table >>>>>>> >>>>>>> >>>>>>> I have 26 columns in the TRANSACTIONS table with a couple of columns >>>>>>> combined to be marked as a primary key >>>>>>> >>>>>>> 2. How many region servers is your data distributed on and what's >>>>>>>> the heap size? >>>>>>> >>>>>>> >>>>>>> When I posted the initial readings of the query taking around 2 >>>>>>> minutes, I had one region server storing 4 regions for the 10 mil >>>>>>> records >>>>>>> TRANSACTIONS table. The heap size on the master server is 1 GB while the >>>>>>> region server has 3.63 GB heap setting. >>>>>>> >>>>>>> Later I added 2 more region servers to the cluster and configured >>>>>>> them as data nodes and region servers. After this step, the regions got >>>>>>> split on two region servers with the count as 2 on one region server >>>>>>> and 6 >>>>>>> on another. I didn't follow what action caused this region split or was >>>>>>> it >>>>>>> automatically done by hbase (load balancer??) >>>>>>> >>>>>>> 3. Do you see lots of disk I/O on region servers during aggregation? >>>>>>> >>>>>>> >>>>>>> Yes I see disk reads with 607 ios/second on the hosts that stores 6 >>>>>>> regions. Kindly find the disk io statistics attached as images. >>>>>>> >>>>>>> 4. Can you try your query after major compacting your table? >>>>>>> >>>>>>> >>>>>>> I tried to compact the table from the hbase web UI. For some reason, >>>>>>> the compaction table attribute on the web ui is still shown as NONE. >>>>>>> After >>>>>>> these changes, the query time is down to *42 secs. * >>>>>>> Is compression different from compaction? Would the query >>>>>>> performance improve by compressing the data by one of the algorithms? >>>>>>> Logically it doesn't sound right though. >>>>>>> >>>>>>> Can you also replace log4j.properties with the attached one and >>>>>>>> reply back with phoenix.log created by executing your query in sqlline? >>>>>>> >>>>>>> >>>>>>> After replacing the log4j.properties, I have captured the logs for >>>>>>> the group by query execution and attached. >>>>>>> >>>>>>> >>>>>>> James, >>>>>>> If I follow the queries that you pasted, I see the index getting >>>>>>> used but if I try to explain the query plan on the pre-loaded >>>>>>> TRANSACTIONS >>>>>>> table I do not see the index being used. Probably the query plan is >>>>>>> changing based on whether the table has data or not. >>>>>>> >>>>>>> The query time is reduced down to 42 secs right now. Let me know if >>>>>>> you have more suggestions on to improve it further. >>>>>>> >>>>>>> Thanks, >>>>>>> Amit. >>>>>>> >>>>>>> On Sat, Mar 26, 2016 at 4:21 AM, James Taylor < >>>>>>> jamestay...@apache.org> wrote: >>>>>>> >>>>>>>> Hi Amit, >>>>>>>> Using 4.7.0-HBase-1.1 release, I see the index being used for that >>>>>>>> query (see below). An index will help some, as the aggregation can be >>>>>>>> done >>>>>>>> in place as the scan over the index is occurring (as opposed to having >>>>>>>> to >>>>>>>> hold the distinct values found during grouping in memory per chunk of >>>>>>>> work >>>>>>>> and sorting each chunk on the client). It's not going to prevent the >>>>>>>> entire >>>>>>>> index from being scanned though. You'll need a WHERE clause to prevent >>>>>>>> that. >>>>>>>> >>>>>>>> 0: jdbc:phoenix:localhost> create table TRANSACTIONS (K integer >>>>>>>> primary key, UNIT_CNT_SOLD integer, TOTAL_SALES integer, T_COUNTRY >>>>>>>> varchar); >>>>>>>> No rows affected (1.32 seconds) >>>>>>>> 0: jdbc:phoenix:localhost> CREATE INDEX TRANSACTIONS_COUNTRY_INDEX >>>>>>>> ON TRANSACTIONS (T_COUNTRY) INCLUDE (UNIT_CNT_SOLD, TOTAL_SALES); >>>>>>>> No rows affected (6.452 seconds) >>>>>>>> 0: jdbc:phoenix:localhost> explain SELECT SUM(UNIT_CNT_SOLD), >>>>>>>> SUM(TOTAL_SALES) FROM TRANSACTIONS GROUP BY T_COUNTRY; >>>>>>>> >>>>>>>> +--------------------------------------------------------------------------+ >>>>>>>> | PLAN >>>>>>>> | >>>>>>>> >>>>>>>> +--------------------------------------------------------------------------+ >>>>>>>> | CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER >>>>>>>> TRANSACTIONS_COUNTRY_INDEX | >>>>>>>> | SERVER AGGREGATE INTO ORDERED DISTINCT ROWS BY ["T_COUNTRY"] >>>>>>>> | >>>>>>>> | CLIENT MERGE SORT >>>>>>>> | >>>>>>>> >>>>>>>> +--------------------------------------------------------------------------+ >>>>>>>> 3 rows selected (0.028 seconds) >>>>>>>> >>>>>>>> Thanks, >>>>>>>> James >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Mar 25, 2016 at 10:37 AM, Mujtaba Chohan < >>>>>>>> mujt...@apache.org> wrote: >>>>>>>> >>>>>>>>> That seems excessively slow for 10M rows which should be in order >>>>>>>>> of few seconds at most without index. 1. How wide is your table 2. >>>>>>>>> How many >>>>>>>>> region servers is your data distributed on and what's the heap size? >>>>>>>>> 3. Do >>>>>>>>> you see lots of disk I/O on region servers during aggregation? 4. Can >>>>>>>>> you >>>>>>>>> try your query after major compacting your table? >>>>>>>>> >>>>>>>>> Can you also replace log4j.properties with the attached one and >>>>>>>>> reply back with phoenix.log created by executing your query in >>>>>>>>> sqlline? >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Mujtaba >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Mar 25, 2016 at 6:56 AM, Amit Shah <amits...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> I am trying to evaluate apache hbase (version 1.0.0) and phoenix >>>>>>>>>> (version 4.6) deployed through cloudera for our OLAP workfload. >>>>>>>>>> I have a table that has 10 mil rows. I try to execute the below roll >>>>>>>>>> up >>>>>>>>>> query and it takes around 2 mins to return 1,850 rows. >>>>>>>>>> >>>>>>>>>> SELECT SUM(UNIT_CNT_SOLD), SUM(TOTAL_SALES) FROM TRANSACTIONS >>>>>>>>>> GROUP BY T_COUNTRY; >>>>>>>>>> >>>>>>>>>> I tried applying the "joining with indices" example given on the >>>>>>>>>> website <https://phoenix.apache.org/joins.html> on the >>>>>>>>>> TRANSACTIONS table by creating an index on the grouped by column as >>>>>>>>>> below >>>>>>>>>> but that doesn't help. >>>>>>>>>> >>>>>>>>>> CREATE INDEX TRANSACTIONS_COUNTRY_INDEX ON TRANSACTIONS >>>>>>>>>> (T_COUNTRY) INCLUDE (UNIT_CNT_SOLD, TOTAL_SALES); >>>>>>>>>> >>>>>>>>>> This index is not getting used when the query is executed. The >>>>>>>>>> query plan is as below >>>>>>>>>> >>>>>>>>>> +------------------------------------------+ >>>>>>>>>> | PLAN | >>>>>>>>>> +------------------------------------------+ >>>>>>>>>> | CLIENT 31-CHUNK PARALLEL 31-WAY FULL SCAN OVER TRANSACTIONS | >>>>>>>>>> | SERVER AGGREGATE INTO DISTINCT ROWS BY [T_COUNTRY] | >>>>>>>>>> | CLIENT MERGE SORT | >>>>>>>>>> +------------------------------------------+ >>>>>>>>>> >>>>>>>>>> Theoretically can secondary indexes help improve the performance >>>>>>>>>> of group by queries? >>>>>>>>>> >>>>>>>>>> Any suggestions on what are different options in phoenix I could >>>>>>>>>> try out to speed up GROUP BY queries? >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Amit. >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >