Hi Mujtaba, Could these improvements be because of region distribution across region servers? Along with the optimizations you had suggested I had also used hbase-region-inspector to move regions evenly across the region server.
Below is the table schema for the TRANSACTIONS table CREATE TABLE TRANSACTIONS_TEST ( AGENT_ID BIGINT ,A_NAME VARCHAR ,A_ROLE VARCHAR ,TERRITORY_ID BIGINT ,T_ZIPCODE BIGINT ,T_PLACE_NAME VARCHAR ,GRPBYCF.T_STATE VARCHAR ,GRPBYCF.T_COUNTRY VARCHAR ,PRODUCT_ID BIGINT NOT NULL ,P_NAME VARCHAR ,P_CATEGORY VARCHAR ,CHANNEL_ID BIGINT ,CH_NAME VARCHAR ,CH_TYPE VARCHAR ,CUSTOMER_ID BIGINT NOT NULL ,CS_NAME VARCHAR ,CS_TYPE VARCHAR ,IS_NEW_CUSTOMER BOOLEAN ,CLOSE_DATE DATE ,DAY_CNT_SPENT INTEGER ,TOTAL_EXPENSES BIGINT ,FORCASTED_SALES BIGINT ,GRPBYCF.UNIT_CNT_SOLD BIGINT ,PRICE_PER_UNIT BIGINT ,DISCOUNT_PERCENT BIGINT ,GRPBYCF.TOTAL_SALES BIGINT ,CONSTRAINT pk PRIMARY KEY (PRODUCT_ID, CUSTOMER_ID) ) COMPRESSION='SNAPPY'; I will try out the guidepost width reduction and let you know the results. Thank you, Amit. On Tue, Mar 29, 2016 at 10:50 PM, Mujtaba Chohan <[email protected]> wrote: > Optimization did help somewhat but not to the extent I was expecting. See > chart below. > > [image: Inline image 1] > > Can you share your table schema so I can experiment with it? Another thing > you can try is reducing guidepost <https://phoenix.apache.org/tuning.html> > width for this table by executing UPDATE STATISTICS TRANSACTIONS SET > "phoenix.stats.guidepost.width"=50000000; > > > > > On Tue, Mar 29, 2016 at 6:45 AM, Amit Shah <[email protected]> wrote: > >> Hi Mujtaba, >> >> I did try the two optimization techniques by recreating the table and >> then loading it again with 10 mil records. They do not seem to help out >> much in terms of the timings. Kindly find the phoenix log file attached. >> Let me know if I am missing anything. >> >> Thanks, >> Amit. >> >> On Mon, Mar 28, 2016 at 11:44 PM, Mujtaba Chohan <[email protected]> >> wrote: >> >>> Here's the chart for time it takes for each of the parallel scans after >>> split. On RS where data is not read from disk scan gets back in ~20 secs >>> but for the RS which has 6 it's ~45 secs. >>> >>> [image: Inline image 2] >>> >>> Yes I see disk reads with 607 ios/second on the hosts that stores 6 >>>> regions >>>> >>> >>> Two things that you should try to reduce disk reads or maybe a >>> combination of both 1. Have only the columns used in your group by query in >>> a separate column family CREATE TABLE T (K integer primary key, >>> GRPBYCF.UNIT_CNT_SOLD integer, GRPBYCF.TOTAL_SALES integer, >>> GRPBYCF.T_COUNTRY varchar, ...) 2. Turn on snappy compression for your >>> table ALTER TABLE T SET COMPRESSION='SNAPPY' followed by a major >>> compaction. >>> >>> I tried to compact the table from the hbase web UI >>>> >>> >>> You need to do *major_compact* from HBase shell. From UI it's minor. >>> >>> - mujtaba >>> >>> On Mon, Mar 28, 2016 at 12:32 AM, Amit Shah <[email protected]> wrote: >>> >>>> Thanks Mujtaba and James for replying back. >>>> >>>> Mujtaba, Below are details to your follow up queries >>>> >>>> 1. How wide is your table >>>> >>>> >>>> I have 26 columns in the TRANSACTIONS table with a couple of columns >>>> combined to be marked as a primary key >>>> >>>> 2. How many region servers is your data distributed on and what's the >>>>> heap size? >>>> >>>> >>>> When I posted the initial readings of the query taking around 2 >>>> minutes, I had one region server storing 4 regions for the 10 mil records >>>> TRANSACTIONS table. The heap size on the master server is 1 GB while the >>>> region server has 3.63 GB heap setting. >>>> >>>> Later I added 2 more region servers to the cluster and configured them >>>> as data nodes and region servers. After this step, the regions got split on >>>> two region servers with the count as 2 on one region server and 6 on >>>> another. I didn't follow what action caused this region split or was it >>>> automatically done by hbase (load balancer??) >>>> >>>> 3. Do you see lots of disk I/O on region servers during aggregation? >>>> >>>> >>>> Yes I see disk reads with 607 ios/second on the hosts that stores 6 >>>> regions. Kindly find the disk io statistics attached as images. >>>> >>>> 4. Can you try your query after major compacting your table? >>>> >>>> >>>> I tried to compact the table from the hbase web UI. For some reason, >>>> the compaction table attribute on the web ui is still shown as NONE. After >>>> these changes, the query time is down to *42 secs. * >>>> Is compression different from compaction? Would the query performance >>>> improve by compressing the data by one of the algorithms? Logically it >>>> doesn't sound right though. >>>> >>>> Can you also replace log4j.properties with the attached one and reply >>>>> back with phoenix.log created by executing your query in sqlline? >>>> >>>> >>>> After replacing the log4j.properties, I have captured the logs for the >>>> group by query execution and attached. >>>> >>>> >>>> James, >>>> If I follow the queries that you pasted, I see the index getting used >>>> but if I try to explain the query plan on the pre-loaded TRANSACTIONS table >>>> I do not see the index being used. Probably the query plan is changing >>>> based on whether the table has data or not. >>>> >>>> The query time is reduced down to 42 secs right now. Let me know if you >>>> have more suggestions on to improve it further. >>>> >>>> Thanks, >>>> Amit. >>>> >>>> On Sat, Mar 26, 2016 at 4:21 AM, James Taylor <[email protected]> >>>> wrote: >>>> >>>>> Hi Amit, >>>>> Using 4.7.0-HBase-1.1 release, I see the index being used for that >>>>> query (see below). An index will help some, as the aggregation can be done >>>>> in place as the scan over the index is occurring (as opposed to having to >>>>> hold the distinct values found during grouping in memory per chunk of work >>>>> and sorting each chunk on the client). It's not going to prevent the >>>>> entire >>>>> index from being scanned though. You'll need a WHERE clause to prevent >>>>> that. >>>>> >>>>> 0: jdbc:phoenix:localhost> create table TRANSACTIONS (K integer >>>>> primary key, UNIT_CNT_SOLD integer, TOTAL_SALES integer, T_COUNTRY >>>>> varchar); >>>>> No rows affected (1.32 seconds) >>>>> 0: jdbc:phoenix:localhost> CREATE INDEX TRANSACTIONS_COUNTRY_INDEX ON >>>>> TRANSACTIONS (T_COUNTRY) INCLUDE (UNIT_CNT_SOLD, TOTAL_SALES); >>>>> No rows affected (6.452 seconds) >>>>> 0: jdbc:phoenix:localhost> explain SELECT SUM(UNIT_CNT_SOLD), >>>>> SUM(TOTAL_SALES) FROM TRANSACTIONS GROUP BY T_COUNTRY; >>>>> >>>>> +--------------------------------------------------------------------------+ >>>>> | PLAN >>>>> | >>>>> >>>>> +--------------------------------------------------------------------------+ >>>>> | CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER >>>>> TRANSACTIONS_COUNTRY_INDEX | >>>>> | SERVER AGGREGATE INTO ORDERED DISTINCT ROWS BY ["T_COUNTRY"] >>>>> | >>>>> | CLIENT MERGE SORT >>>>> | >>>>> >>>>> +--------------------------------------------------------------------------+ >>>>> 3 rows selected (0.028 seconds) >>>>> >>>>> Thanks, >>>>> James >>>>> >>>>> >>>>> On Fri, Mar 25, 2016 at 10:37 AM, Mujtaba Chohan <[email protected]> >>>>> wrote: >>>>> >>>>>> That seems excessively slow for 10M rows which should be in order of >>>>>> few seconds at most without index. 1. How wide is your table 2. How many >>>>>> region servers is your data distributed on and what's the heap size? 3. >>>>>> Do >>>>>> you see lots of disk I/O on region servers during aggregation? 4. Can you >>>>>> try your query after major compacting your table? >>>>>> >>>>>> Can you also replace log4j.properties with the attached one and reply >>>>>> back with phoenix.log created by executing your query in sqlline? >>>>>> >>>>>> Thanks, >>>>>> Mujtaba >>>>>> >>>>>> >>>>>> On Fri, Mar 25, 2016 at 6:56 AM, Amit Shah <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I am trying to evaluate apache hbase (version 1.0.0) and phoenix >>>>>>> (version 4.6) deployed through cloudera for our OLAP workfload. I >>>>>>> have a table that has 10 mil rows. I try to execute the below roll up >>>>>>> query >>>>>>> and it takes around 2 mins to return 1,850 rows. >>>>>>> >>>>>>> SELECT SUM(UNIT_CNT_SOLD), SUM(TOTAL_SALES) FROM TRANSACTIONS GROUP >>>>>>> BY T_COUNTRY; >>>>>>> >>>>>>> I tried applying the "joining with indices" example given on the >>>>>>> website <https://phoenix.apache.org/joins.html> on the TRANSACTIONS >>>>>>> table by creating an index on the grouped by column as below but that >>>>>>> doesn't help. >>>>>>> >>>>>>> CREATE INDEX TRANSACTIONS_COUNTRY_INDEX ON TRANSACTIONS (T_COUNTRY) >>>>>>> INCLUDE (UNIT_CNT_SOLD, TOTAL_SALES); >>>>>>> >>>>>>> This index is not getting used when the query is executed. The query >>>>>>> plan is as below >>>>>>> >>>>>>> +------------------------------------------+ >>>>>>> | PLAN | >>>>>>> +------------------------------------------+ >>>>>>> | CLIENT 31-CHUNK PARALLEL 31-WAY FULL SCAN OVER TRANSACTIONS | >>>>>>> | SERVER AGGREGATE INTO DISTINCT ROWS BY [T_COUNTRY] | >>>>>>> | CLIENT MERGE SORT | >>>>>>> +------------------------------------------+ >>>>>>> >>>>>>> Theoretically can secondary indexes help improve the performance of >>>>>>> group by queries? >>>>>>> >>>>>>> Any suggestions on what are different options in phoenix I could try >>>>>>> out to speed up GROUP BY queries? >>>>>>> >>>>>>> Thanks, >>>>>>> Amit. >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >
