Hi Mujtaba, I observed that if the where-clause and group-by queries are applied on the primary key columns, then they are superfast (~ 200 ms). This is not the case with queries that have non-primary key columns in the where clause and group by queries. I tried configuring the bucket cache but surprisingly it doesn't give much speed improvement which I had thought of. One of these queries are taking 26 secs and 31 secs on a 3 node Hbase cluster with and without bucket cache respectively.
Let me know if you have suggestions that we could try out. Regards, Amit. On Tue, Mar 29, 2016 at 10:55 PM, Amit Shah <amits...@gmail.com> wrote: > Hi Mujtaba, > > Could these improvements be because of region distribution across region > servers? Along with the optimizations you had suggested I had also used > hbase-region-inspector to move regions evenly across the region server. > > Below is the table schema for the TRANSACTIONS table > > > CREATE TABLE TRANSACTIONS_TEST ( > AGENT_ID BIGINT > ,A_NAME VARCHAR > ,A_ROLE VARCHAR > ,TERRITORY_ID BIGINT > ,T_ZIPCODE BIGINT > ,T_PLACE_NAME VARCHAR > ,GRPBYCF.T_STATE VARCHAR > ,GRPBYCF.T_COUNTRY VARCHAR > ,PRODUCT_ID BIGINT NOT NULL > ,P_NAME VARCHAR > ,P_CATEGORY VARCHAR > ,CHANNEL_ID BIGINT > ,CH_NAME VARCHAR > ,CH_TYPE VARCHAR > ,CUSTOMER_ID BIGINT NOT NULL > ,CS_NAME VARCHAR > ,CS_TYPE VARCHAR > ,IS_NEW_CUSTOMER BOOLEAN > ,CLOSE_DATE DATE > ,DAY_CNT_SPENT INTEGER > ,TOTAL_EXPENSES BIGINT > ,FORCASTED_SALES BIGINT > ,GRPBYCF.UNIT_CNT_SOLD BIGINT > ,PRICE_PER_UNIT BIGINT > ,DISCOUNT_PERCENT BIGINT > ,GRPBYCF.TOTAL_SALES BIGINT > ,CONSTRAINT pk PRIMARY KEY (PRODUCT_ID, CUSTOMER_ID) > ) COMPRESSION='SNAPPY'; > > I will try out the guidepost width reduction and let you know the results. > > Thank you, > Amit. > > On Tue, Mar 29, 2016 at 10:50 PM, Mujtaba Chohan <mujt...@apache.org> > wrote: > >> Optimization did help somewhat but not to the extent I was expecting. See >> chart below. >> >> [image: Inline image 1] >> >> Can you share your table schema so I can experiment with it? Another >> thing you can try is reducing guidepost >> <https://phoenix.apache.org/tuning.html> width for this table by >> executing UPDATE STATISTICS TRANSACTIONS SET >> "phoenix.stats.guidepost.width"=50000000; >> >> >> >> >> On Tue, Mar 29, 2016 at 6:45 AM, Amit Shah <amits...@gmail.com> wrote: >> >>> Hi Mujtaba, >>> >>> I did try the two optimization techniques by recreating the table and >>> then loading it again with 10 mil records. They do not seem to help out >>> much in terms of the timings. Kindly find the phoenix log file attached. >>> Let me know if I am missing anything. >>> >>> Thanks, >>> Amit. >>> >>> On Mon, Mar 28, 2016 at 11:44 PM, Mujtaba Chohan <mujt...@apache.org> >>> wrote: >>> >>>> Here's the chart for time it takes for each of the parallel scans after >>>> split. On RS where data is not read from disk scan gets back in ~20 secs >>>> but for the RS which has 6 it's ~45 secs. >>>> >>>> [image: Inline image 2] >>>> >>>> Yes I see disk reads with 607 ios/second on the hosts that stores 6 >>>>> regions >>>>> >>>> >>>> Two things that you should try to reduce disk reads or maybe a >>>> combination of both 1. Have only the columns used in your group by query in >>>> a separate column family CREATE TABLE T (K integer primary key, >>>> GRPBYCF.UNIT_CNT_SOLD integer, GRPBYCF.TOTAL_SALES integer, >>>> GRPBYCF.T_COUNTRY varchar, ...) 2. Turn on snappy compression for your >>>> table ALTER TABLE T SET COMPRESSION='SNAPPY' followed by a major >>>> compaction. >>>> >>>> I tried to compact the table from the hbase web UI >>>>> >>>> >>>> You need to do *major_compact* from HBase shell. From UI it's minor. >>>> >>>> - mujtaba >>>> >>>> On Mon, Mar 28, 2016 at 12:32 AM, Amit Shah <amits...@gmail.com> wrote: >>>> >>>>> Thanks Mujtaba and James for replying back. >>>>> >>>>> Mujtaba, Below are details to your follow up queries >>>>> >>>>> 1. How wide is your table >>>>> >>>>> >>>>> I have 26 columns in the TRANSACTIONS table with a couple of columns >>>>> combined to be marked as a primary key >>>>> >>>>> 2. How many region servers is your data distributed on and what's the >>>>>> heap size? >>>>> >>>>> >>>>> When I posted the initial readings of the query taking around 2 >>>>> minutes, I had one region server storing 4 regions for the 10 mil records >>>>> TRANSACTIONS table. The heap size on the master server is 1 GB while the >>>>> region server has 3.63 GB heap setting. >>>>> >>>>> Later I added 2 more region servers to the cluster and configured them >>>>> as data nodes and region servers. After this step, the regions got split >>>>> on >>>>> two region servers with the count as 2 on one region server and 6 on >>>>> another. I didn't follow what action caused this region split or was it >>>>> automatically done by hbase (load balancer??) >>>>> >>>>> 3. Do you see lots of disk I/O on region servers during aggregation? >>>>> >>>>> >>>>> Yes I see disk reads with 607 ios/second on the hosts that stores 6 >>>>> regions. Kindly find the disk io statistics attached as images. >>>>> >>>>> 4. Can you try your query after major compacting your table? >>>>> >>>>> >>>>> I tried to compact the table from the hbase web UI. For some reason, >>>>> the compaction table attribute on the web ui is still shown as NONE. After >>>>> these changes, the query time is down to *42 secs. * >>>>> Is compression different from compaction? Would the query performance >>>>> improve by compressing the data by one of the algorithms? Logically it >>>>> doesn't sound right though. >>>>> >>>>> Can you also replace log4j.properties with the attached one and reply >>>>>> back with phoenix.log created by executing your query in sqlline? >>>>> >>>>> >>>>> After replacing the log4j.properties, I have captured the logs for the >>>>> group by query execution and attached. >>>>> >>>>> >>>>> James, >>>>> If I follow the queries that you pasted, I see the index getting used >>>>> but if I try to explain the query plan on the pre-loaded TRANSACTIONS >>>>> table >>>>> I do not see the index being used. Probably the query plan is changing >>>>> based on whether the table has data or not. >>>>> >>>>> The query time is reduced down to 42 secs right now. Let me know if >>>>> you have more suggestions on to improve it further. >>>>> >>>>> Thanks, >>>>> Amit. >>>>> >>>>> On Sat, Mar 26, 2016 at 4:21 AM, James Taylor <jamestay...@apache.org> >>>>> wrote: >>>>> >>>>>> Hi Amit, >>>>>> Using 4.7.0-HBase-1.1 release, I see the index being used for that >>>>>> query (see below). An index will help some, as the aggregation can be >>>>>> done >>>>>> in place as the scan over the index is occurring (as opposed to having to >>>>>> hold the distinct values found during grouping in memory per chunk of >>>>>> work >>>>>> and sorting each chunk on the client). It's not going to prevent the >>>>>> entire >>>>>> index from being scanned though. You'll need a WHERE clause to prevent >>>>>> that. >>>>>> >>>>>> 0: jdbc:phoenix:localhost> create table TRANSACTIONS (K integer >>>>>> primary key, UNIT_CNT_SOLD integer, TOTAL_SALES integer, T_COUNTRY >>>>>> varchar); >>>>>> No rows affected (1.32 seconds) >>>>>> 0: jdbc:phoenix:localhost> CREATE INDEX TRANSACTIONS_COUNTRY_INDEX ON >>>>>> TRANSACTIONS (T_COUNTRY) INCLUDE (UNIT_CNT_SOLD, TOTAL_SALES); >>>>>> No rows affected (6.452 seconds) >>>>>> 0: jdbc:phoenix:localhost> explain SELECT SUM(UNIT_CNT_SOLD), >>>>>> SUM(TOTAL_SALES) FROM TRANSACTIONS GROUP BY T_COUNTRY; >>>>>> >>>>>> +--------------------------------------------------------------------------+ >>>>>> | PLAN >>>>>> | >>>>>> >>>>>> +--------------------------------------------------------------------------+ >>>>>> | CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER >>>>>> TRANSACTIONS_COUNTRY_INDEX | >>>>>> | SERVER AGGREGATE INTO ORDERED DISTINCT ROWS BY ["T_COUNTRY"] >>>>>> | >>>>>> | CLIENT MERGE SORT >>>>>> | >>>>>> >>>>>> +--------------------------------------------------------------------------+ >>>>>> 3 rows selected (0.028 seconds) >>>>>> >>>>>> Thanks, >>>>>> James >>>>>> >>>>>> >>>>>> On Fri, Mar 25, 2016 at 10:37 AM, Mujtaba Chohan <mujt...@apache.org> >>>>>> wrote: >>>>>> >>>>>>> That seems excessively slow for 10M rows which should be in order of >>>>>>> few seconds at most without index. 1. How wide is your table 2. How many >>>>>>> region servers is your data distributed on and what's the heap size? 3. >>>>>>> Do >>>>>>> you see lots of disk I/O on region servers during aggregation? 4. Can >>>>>>> you >>>>>>> try your query after major compacting your table? >>>>>>> >>>>>>> Can you also replace log4j.properties with the attached one and >>>>>>> reply back with phoenix.log created by executing your query in sqlline? >>>>>>> >>>>>>> Thanks, >>>>>>> Mujtaba >>>>>>> >>>>>>> >>>>>>> On Fri, Mar 25, 2016 at 6:56 AM, Amit Shah <amits...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> I am trying to evaluate apache hbase (version 1.0.0) and phoenix >>>>>>>> (version 4.6) deployed through cloudera for our OLAP workfload. I >>>>>>>> have a table that has 10 mil rows. I try to execute the below roll up >>>>>>>> query >>>>>>>> and it takes around 2 mins to return 1,850 rows. >>>>>>>> >>>>>>>> SELECT SUM(UNIT_CNT_SOLD), SUM(TOTAL_SALES) FROM TRANSACTIONS GROUP >>>>>>>> BY T_COUNTRY; >>>>>>>> >>>>>>>> I tried applying the "joining with indices" example given on the >>>>>>>> website <https://phoenix.apache.org/joins.html> on the >>>>>>>> TRANSACTIONS table by creating an index on the grouped by column as >>>>>>>> below >>>>>>>> but that doesn't help. >>>>>>>> >>>>>>>> CREATE INDEX TRANSACTIONS_COUNTRY_INDEX ON TRANSACTIONS (T_COUNTRY) >>>>>>>> INCLUDE (UNIT_CNT_SOLD, TOTAL_SALES); >>>>>>>> >>>>>>>> This index is not getting used when the query is executed. The >>>>>>>> query plan is as below >>>>>>>> >>>>>>>> +------------------------------------------+ >>>>>>>> | PLAN | >>>>>>>> +------------------------------------------+ >>>>>>>> | CLIENT 31-CHUNK PARALLEL 31-WAY FULL SCAN OVER TRANSACTIONS | >>>>>>>> | SERVER AGGREGATE INTO DISTINCT ROWS BY [T_COUNTRY] | >>>>>>>> | CLIENT MERGE SORT | >>>>>>>> +------------------------------------------+ >>>>>>>> >>>>>>>> Theoretically can secondary indexes help improve the performance of >>>>>>>> group by queries? >>>>>>>> >>>>>>>> Any suggestions on what are different options in phoenix I could >>>>>>>> try out to speed up GROUP BY queries? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Amit. >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >