Optimization did help somewhat but not to the extent I was expecting. See chart below.
[image: Inline image 1] Can you share your table schema so I can experiment with it? Another thing you can try is reducing guidepost <https://phoenix.apache.org/tuning.html> width for this table by executing UPDATE STATISTICS TRANSACTIONS SET "phoenix.stats.guidepost.width"=50000000; On Tue, Mar 29, 2016 at 6:45 AM, Amit Shah <amits...@gmail.com> wrote: > Hi Mujtaba, > > I did try the two optimization techniques by recreating the table and then > loading it again with 10 mil records. They do not seem to help out much in > terms of the timings. Kindly find the phoenix log file attached. Let me > know if I am missing anything. > > Thanks, > Amit. > > On Mon, Mar 28, 2016 at 11:44 PM, Mujtaba Chohan <mujt...@apache.org> > wrote: > >> Here's the chart for time it takes for each of the parallel scans after >> split. On RS where data is not read from disk scan gets back in ~20 secs >> but for the RS which has 6 it's ~45 secs. >> >> [image: Inline image 2] >> >> Yes I see disk reads with 607 ios/second on the hosts that stores 6 >>> regions >>> >> >> Two things that you should try to reduce disk reads or maybe a >> combination of both 1. Have only the columns used in your group by query in >> a separate column family CREATE TABLE T (K integer primary key, >> GRPBYCF.UNIT_CNT_SOLD integer, GRPBYCF.TOTAL_SALES integer, >> GRPBYCF.T_COUNTRY varchar, ...) 2. Turn on snappy compression for your >> table ALTER TABLE T SET COMPRESSION='SNAPPY' followed by a major >> compaction. >> >> I tried to compact the table from the hbase web UI >>> >> >> You need to do *major_compact* from HBase shell. From UI it's minor. >> >> - mujtaba >> >> On Mon, Mar 28, 2016 at 12:32 AM, Amit Shah <amits...@gmail.com> wrote: >> >>> Thanks Mujtaba and James for replying back. >>> >>> Mujtaba, Below are details to your follow up queries >>> >>> 1. How wide is your table >>> >>> >>> I have 26 columns in the TRANSACTIONS table with a couple of columns >>> combined to be marked as a primary key >>> >>> 2. How many region servers is your data distributed on and what's the >>>> heap size? >>> >>> >>> When I posted the initial readings of the query taking around 2 minutes, >>> I had one region server storing 4 regions for the 10 mil records >>> TRANSACTIONS table. The heap size on the master server is 1 GB while the >>> region server has 3.63 GB heap setting. >>> >>> Later I added 2 more region servers to the cluster and configured them >>> as data nodes and region servers. After this step, the regions got split on >>> two region servers with the count as 2 on one region server and 6 on >>> another. I didn't follow what action caused this region split or was it >>> automatically done by hbase (load balancer??) >>> >>> 3. Do you see lots of disk I/O on region servers during aggregation? >>> >>> >>> Yes I see disk reads with 607 ios/second on the hosts that stores 6 >>> regions. Kindly find the disk io statistics attached as images. >>> >>> 4. Can you try your query after major compacting your table? >>> >>> >>> I tried to compact the table from the hbase web UI. For some reason, the >>> compaction table attribute on the web ui is still shown as NONE. After >>> these changes, the query time is down to *42 secs. * >>> Is compression different from compaction? Would the query performance >>> improve by compressing the data by one of the algorithms? Logically it >>> doesn't sound right though. >>> >>> Can you also replace log4j.properties with the attached one and reply >>>> back with phoenix.log created by executing your query in sqlline? >>> >>> >>> After replacing the log4j.properties, I have captured the logs for the >>> group by query execution and attached. >>> >>> >>> James, >>> If I follow the queries that you pasted, I see the index getting used >>> but if I try to explain the query plan on the pre-loaded TRANSACTIONS table >>> I do not see the index being used. Probably the query plan is changing >>> based on whether the table has data or not. >>> >>> The query time is reduced down to 42 secs right now. Let me know if you >>> have more suggestions on to improve it further. >>> >>> Thanks, >>> Amit. >>> >>> On Sat, Mar 26, 2016 at 4:21 AM, James Taylor <jamestay...@apache.org> >>> wrote: >>> >>>> Hi Amit, >>>> Using 4.7.0-HBase-1.1 release, I see the index being used for that >>>> query (see below). An index will help some, as the aggregation can be done >>>> in place as the scan over the index is occurring (as opposed to having to >>>> hold the distinct values found during grouping in memory per chunk of work >>>> and sorting each chunk on the client). It's not going to prevent the entire >>>> index from being scanned though. You'll need a WHERE clause to prevent >>>> that. >>>> >>>> 0: jdbc:phoenix:localhost> create table TRANSACTIONS (K integer primary >>>> key, UNIT_CNT_SOLD integer, TOTAL_SALES integer, T_COUNTRY varchar); >>>> No rows affected (1.32 seconds) >>>> 0: jdbc:phoenix:localhost> CREATE INDEX TRANSACTIONS_COUNTRY_INDEX ON >>>> TRANSACTIONS (T_COUNTRY) INCLUDE (UNIT_CNT_SOLD, TOTAL_SALES); >>>> No rows affected (6.452 seconds) >>>> 0: jdbc:phoenix:localhost> explain SELECT SUM(UNIT_CNT_SOLD), >>>> SUM(TOTAL_SALES) FROM TRANSACTIONS GROUP BY T_COUNTRY; >>>> >>>> +--------------------------------------------------------------------------+ >>>> | PLAN >>>> | >>>> >>>> +--------------------------------------------------------------------------+ >>>> | CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER >>>> TRANSACTIONS_COUNTRY_INDEX | >>>> | SERVER AGGREGATE INTO ORDERED DISTINCT ROWS BY ["T_COUNTRY"] >>>> | >>>> | CLIENT MERGE SORT >>>> | >>>> >>>> +--------------------------------------------------------------------------+ >>>> 3 rows selected (0.028 seconds) >>>> >>>> Thanks, >>>> James >>>> >>>> >>>> On Fri, Mar 25, 2016 at 10:37 AM, Mujtaba Chohan <mujt...@apache.org> >>>> wrote: >>>> >>>>> That seems excessively slow for 10M rows which should be in order of >>>>> few seconds at most without index. 1. How wide is your table 2. How many >>>>> region servers is your data distributed on and what's the heap size? 3. Do >>>>> you see lots of disk I/O on region servers during aggregation? 4. Can you >>>>> try your query after major compacting your table? >>>>> >>>>> Can you also replace log4j.properties with the attached one and reply >>>>> back with phoenix.log created by executing your query in sqlline? >>>>> >>>>> Thanks, >>>>> Mujtaba >>>>> >>>>> >>>>> On Fri, Mar 25, 2016 at 6:56 AM, Amit Shah <amits...@gmail.com> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I am trying to evaluate apache hbase (version 1.0.0) and phoenix >>>>>> (version 4.6) deployed through cloudera for our OLAP workfload. I >>>>>> have a table that has 10 mil rows. I try to execute the below roll up >>>>>> query >>>>>> and it takes around 2 mins to return 1,850 rows. >>>>>> >>>>>> SELECT SUM(UNIT_CNT_SOLD), SUM(TOTAL_SALES) FROM TRANSACTIONS GROUP >>>>>> BY T_COUNTRY; >>>>>> >>>>>> I tried applying the "joining with indices" example given on the >>>>>> website <https://phoenix.apache.org/joins.html> on the TRANSACTIONS >>>>>> table by creating an index on the grouped by column as below but that >>>>>> doesn't help. >>>>>> >>>>>> CREATE INDEX TRANSACTIONS_COUNTRY_INDEX ON TRANSACTIONS (T_COUNTRY) >>>>>> INCLUDE (UNIT_CNT_SOLD, TOTAL_SALES); >>>>>> >>>>>> This index is not getting used when the query is executed. The query >>>>>> plan is as below >>>>>> >>>>>> +------------------------------------------+ >>>>>> | PLAN | >>>>>> +------------------------------------------+ >>>>>> | CLIENT 31-CHUNK PARALLEL 31-WAY FULL SCAN OVER TRANSACTIONS | >>>>>> | SERVER AGGREGATE INTO DISTINCT ROWS BY [T_COUNTRY] | >>>>>> | CLIENT MERGE SORT | >>>>>> +------------------------------------------+ >>>>>> >>>>>> Theoretically can secondary indexes help improve the performance of >>>>>> group by queries? >>>>>> >>>>>> Any suggestions on what are different options in phoenix I could try >>>>>> out to speed up GROUP BY queries? >>>>>> >>>>>> Thanks, >>>>>> Amit. >>>>>> >>>>> >>>>> >>>> >>> >> >