Hi Amit, If a query doesn't filter on the primary key columns, the entire table must be scanned (hence it'll be slower). Secondary indexes[1] on the non-primary key columns is the way to improve performance for these case. Take a look at this[2] presentation for more detail.
Also, a 3 node cluster is not ideal for benchmarking, but it might be fine to just play around with a bit. There's always overhead with a distributed system that you won't see in a more typical single node RDBMS. If you can mimic your cluster size you'll use in production, that'd be ideal . Otherwise, I'd recommend at least 6 nodes and 10-20 nodes would be better in your test environment. Have you seen our Pherf tool[3] that will help you benchmark your queries under representative data sizes? Thanks, James [1] https://phoenix.apache.org/secondary_indexing.html [2] https://www.youtube.com/watch?v=f4Nmh5KM6gI&feature=youtu.be [3] https://phoenix.apache.org/pherf.html On Mon, Apr 11, 2016 at 6:37 AM, Amit Shah <[email protected]> wrote: > Hi Mujtaba, > > I observed that if the where-clause and group-by queries are applied on > the primary key columns, then they are superfast (~ 200 ms). This is not > the case with queries that have non-primary key columns in the where clause > and group by queries. I tried configuring the bucket cache but surprisingly > it doesn't give much speed improvement which I had thought of. One of these > queries are taking 26 secs and 31 secs on a 3 node Hbase cluster with and > without bucket cache respectively. > > Let me know if you have suggestions that we could try out. > > Regards, > Amit. > > On Tue, Mar 29, 2016 at 10:55 PM, Amit Shah <[email protected]> wrote: > >> Hi Mujtaba, >> >> Could these improvements be because of region distribution across region >> servers? Along with the optimizations you had suggested I had also used >> hbase-region-inspector to move regions evenly across the region server. >> >> Below is the table schema for the TRANSACTIONS table >> >> >> CREATE TABLE TRANSACTIONS_TEST ( >> AGENT_ID BIGINT >> ,A_NAME VARCHAR >> ,A_ROLE VARCHAR >> ,TERRITORY_ID BIGINT >> ,T_ZIPCODE BIGINT >> ,T_PLACE_NAME VARCHAR >> ,GRPBYCF.T_STATE VARCHAR >> ,GRPBYCF.T_COUNTRY VARCHAR >> ,PRODUCT_ID BIGINT NOT NULL >> ,P_NAME VARCHAR >> ,P_CATEGORY VARCHAR >> ,CHANNEL_ID BIGINT >> ,CH_NAME VARCHAR >> ,CH_TYPE VARCHAR >> ,CUSTOMER_ID BIGINT NOT NULL >> ,CS_NAME VARCHAR >> ,CS_TYPE VARCHAR >> ,IS_NEW_CUSTOMER BOOLEAN >> ,CLOSE_DATE DATE >> ,DAY_CNT_SPENT INTEGER >> ,TOTAL_EXPENSES BIGINT >> ,FORCASTED_SALES BIGINT >> ,GRPBYCF.UNIT_CNT_SOLD BIGINT >> ,PRICE_PER_UNIT BIGINT >> ,DISCOUNT_PERCENT BIGINT >> ,GRPBYCF.TOTAL_SALES BIGINT >> ,CONSTRAINT pk PRIMARY KEY (PRODUCT_ID, CUSTOMER_ID) >> ) COMPRESSION='SNAPPY'; >> >> I will try out the guidepost width reduction and let you know the results. >> >> Thank you, >> Amit. >> >> On Tue, Mar 29, 2016 at 10:50 PM, Mujtaba Chohan <[email protected]> >> wrote: >> >>> Optimization did help somewhat but not to the extent I was expecting. >>> See chart below. >>> >>> [image: Inline image 1] >>> >>> Can you share your table schema so I can experiment with it? Another >>> thing you can try is reducing guidepost >>> <https://phoenix.apache.org/tuning.html> width for this table by >>> executing UPDATE STATISTICS TRANSACTIONS SET >>> "phoenix.stats.guidepost.width"=50000000; >>> >>> >>> >>> >>> On Tue, Mar 29, 2016 at 6:45 AM, Amit Shah <[email protected]> wrote: >>> >>>> Hi Mujtaba, >>>> >>>> I did try the two optimization techniques by recreating the table and >>>> then loading it again with 10 mil records. They do not seem to help out >>>> much in terms of the timings. Kindly find the phoenix log file attached. >>>> Let me know if I am missing anything. >>>> >>>> Thanks, >>>> Amit. >>>> >>>> On Mon, Mar 28, 2016 at 11:44 PM, Mujtaba Chohan <[email protected]> >>>> wrote: >>>> >>>>> Here's the chart for time it takes for each of the parallel scans >>>>> after split. On RS where data is not read from disk scan gets back in ~20 >>>>> secs but for the RS which has 6 it's ~45 secs. >>>>> >>>>> [image: Inline image 2] >>>>> >>>>> Yes I see disk reads with 607 ios/second on the hosts that stores 6 >>>>>> regions >>>>>> >>>>> >>>>> Two things that you should try to reduce disk reads or maybe a >>>>> combination of both 1. Have only the columns used in your group by query >>>>> in >>>>> a separate column family CREATE TABLE T (K integer primary key, >>>>> GRPBYCF.UNIT_CNT_SOLD integer, GRPBYCF.TOTAL_SALES integer, >>>>> GRPBYCF.T_COUNTRY varchar, ...) 2. Turn on snappy compression for >>>>> your table ALTER TABLE T SET COMPRESSION='SNAPPY' followed by a major >>>>> compaction. >>>>> >>>>> I tried to compact the table from the hbase web UI >>>>>> >>>>> >>>>> You need to do *major_compact* from HBase shell. From UI it's minor. >>>>> >>>>> - mujtaba >>>>> >>>>> On Mon, Mar 28, 2016 at 12:32 AM, Amit Shah <[email protected]> >>>>> wrote: >>>>> >>>>>> Thanks Mujtaba and James for replying back. >>>>>> >>>>>> Mujtaba, Below are details to your follow up queries >>>>>> >>>>>> 1. How wide is your table >>>>>> >>>>>> >>>>>> I have 26 columns in the TRANSACTIONS table with a couple of columns >>>>>> combined to be marked as a primary key >>>>>> >>>>>> 2. How many region servers is your data distributed on and what's the >>>>>>> heap size? >>>>>> >>>>>> >>>>>> When I posted the initial readings of the query taking around 2 >>>>>> minutes, I had one region server storing 4 regions for the 10 mil records >>>>>> TRANSACTIONS table. The heap size on the master server is 1 GB while the >>>>>> region server has 3.63 GB heap setting. >>>>>> >>>>>> Later I added 2 more region servers to the cluster and configured >>>>>> them as data nodes and region servers. After this step, the regions got >>>>>> split on two region servers with the count as 2 on one region server and >>>>>> 6 >>>>>> on another. I didn't follow what action caused this region split or was >>>>>> it >>>>>> automatically done by hbase (load balancer??) >>>>>> >>>>>> 3. Do you see lots of disk I/O on region servers during aggregation? >>>>>> >>>>>> >>>>>> Yes I see disk reads with 607 ios/second on the hosts that stores 6 >>>>>> regions. Kindly find the disk io statistics attached as images. >>>>>> >>>>>> 4. Can you try your query after major compacting your table? >>>>>> >>>>>> >>>>>> I tried to compact the table from the hbase web UI. For some reason, >>>>>> the compaction table attribute on the web ui is still shown as NONE. >>>>>> After >>>>>> these changes, the query time is down to *42 secs. * >>>>>> Is compression different from compaction? Would the query performance >>>>>> improve by compressing the data by one of the algorithms? Logically it >>>>>> doesn't sound right though. >>>>>> >>>>>> Can you also replace log4j.properties with the attached one and reply >>>>>>> back with phoenix.log created by executing your query in sqlline? >>>>>> >>>>>> >>>>>> After replacing the log4j.properties, I have captured the logs for >>>>>> the group by query execution and attached. >>>>>> >>>>>> >>>>>> James, >>>>>> If I follow the queries that you pasted, I see the index getting used >>>>>> but if I try to explain the query plan on the pre-loaded TRANSACTIONS >>>>>> table >>>>>> I do not see the index being used. Probably the query plan is changing >>>>>> based on whether the table has data or not. >>>>>> >>>>>> The query time is reduced down to 42 secs right now. Let me know if >>>>>> you have more suggestions on to improve it further. >>>>>> >>>>>> Thanks, >>>>>> Amit. >>>>>> >>>>>> On Sat, Mar 26, 2016 at 4:21 AM, James Taylor <[email protected] >>>>>> > wrote: >>>>>> >>>>>>> Hi Amit, >>>>>>> Using 4.7.0-HBase-1.1 release, I see the index being used for that >>>>>>> query (see below). An index will help some, as the aggregation can be >>>>>>> done >>>>>>> in place as the scan over the index is occurring (as opposed to having >>>>>>> to >>>>>>> hold the distinct values found during grouping in memory per chunk of >>>>>>> work >>>>>>> and sorting each chunk on the client). It's not going to prevent the >>>>>>> entire >>>>>>> index from being scanned though. You'll need a WHERE clause to prevent >>>>>>> that. >>>>>>> >>>>>>> 0: jdbc:phoenix:localhost> create table TRANSACTIONS (K integer >>>>>>> primary key, UNIT_CNT_SOLD integer, TOTAL_SALES integer, T_COUNTRY >>>>>>> varchar); >>>>>>> No rows affected (1.32 seconds) >>>>>>> 0: jdbc:phoenix:localhost> CREATE INDEX TRANSACTIONS_COUNTRY_INDEX >>>>>>> ON TRANSACTIONS (T_COUNTRY) INCLUDE (UNIT_CNT_SOLD, TOTAL_SALES); >>>>>>> No rows affected (6.452 seconds) >>>>>>> 0: jdbc:phoenix:localhost> explain SELECT SUM(UNIT_CNT_SOLD), >>>>>>> SUM(TOTAL_SALES) FROM TRANSACTIONS GROUP BY T_COUNTRY; >>>>>>> >>>>>>> +--------------------------------------------------------------------------+ >>>>>>> | PLAN >>>>>>> | >>>>>>> >>>>>>> +--------------------------------------------------------------------------+ >>>>>>> | CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER >>>>>>> TRANSACTIONS_COUNTRY_INDEX | >>>>>>> | SERVER AGGREGATE INTO ORDERED DISTINCT ROWS BY ["T_COUNTRY"] >>>>>>> | >>>>>>> | CLIENT MERGE SORT >>>>>>> | >>>>>>> >>>>>>> +--------------------------------------------------------------------------+ >>>>>>> 3 rows selected (0.028 seconds) >>>>>>> >>>>>>> Thanks, >>>>>>> James >>>>>>> >>>>>>> >>>>>>> On Fri, Mar 25, 2016 at 10:37 AM, Mujtaba Chohan <[email protected] >>>>>>> > wrote: >>>>>>> >>>>>>>> That seems excessively slow for 10M rows which should be in order >>>>>>>> of few seconds at most without index. 1. How wide is your table 2. How >>>>>>>> many >>>>>>>> region servers is your data distributed on and what's the heap size? >>>>>>>> 3. Do >>>>>>>> you see lots of disk I/O on region servers during aggregation? 4. Can >>>>>>>> you >>>>>>>> try your query after major compacting your table? >>>>>>>> >>>>>>>> Can you also replace log4j.properties with the attached one and >>>>>>>> reply back with phoenix.log created by executing your query in sqlline? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Mujtaba >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Mar 25, 2016 at 6:56 AM, Amit Shah <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> I am trying to evaluate apache hbase (version 1.0.0) and phoenix >>>>>>>>> (version 4.6) deployed through cloudera for our OLAP workfload. I >>>>>>>>> have a table that has 10 mil rows. I try to execute the below roll up >>>>>>>>> query >>>>>>>>> and it takes around 2 mins to return 1,850 rows. >>>>>>>>> >>>>>>>>> SELECT SUM(UNIT_CNT_SOLD), SUM(TOTAL_SALES) FROM TRANSACTIONS >>>>>>>>> GROUP BY T_COUNTRY; >>>>>>>>> >>>>>>>>> I tried applying the "joining with indices" example given on the >>>>>>>>> website <https://phoenix.apache.org/joins.html> on the >>>>>>>>> TRANSACTIONS table by creating an index on the grouped by column as >>>>>>>>> below >>>>>>>>> but that doesn't help. >>>>>>>>> >>>>>>>>> CREATE INDEX TRANSACTIONS_COUNTRY_INDEX ON TRANSACTIONS >>>>>>>>> (T_COUNTRY) INCLUDE (UNIT_CNT_SOLD, TOTAL_SALES); >>>>>>>>> >>>>>>>>> This index is not getting used when the query is executed. The >>>>>>>>> query plan is as below >>>>>>>>> >>>>>>>>> +------------------------------------------+ >>>>>>>>> | PLAN | >>>>>>>>> +------------------------------------------+ >>>>>>>>> | CLIENT 31-CHUNK PARALLEL 31-WAY FULL SCAN OVER TRANSACTIONS | >>>>>>>>> | SERVER AGGREGATE INTO DISTINCT ROWS BY [T_COUNTRY] | >>>>>>>>> | CLIENT MERGE SORT | >>>>>>>>> +------------------------------------------+ >>>>>>>>> >>>>>>>>> Theoretically can secondary indexes help improve the performance >>>>>>>>> of group by queries? >>>>>>>>> >>>>>>>>> Any suggestions on what are different options in phoenix I could >>>>>>>>> try out to speed up GROUP BY queries? >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Amit. >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
