Here's the chart for time it takes for each of the parallel scans after split. On RS where data is not read from disk scan gets back in ~20 secs but for the RS which has 6 it's ~45 secs.
[image: Inline image 2] Yes I see disk reads with 607 ios/second on the hosts that stores 6 regions > Two things that you should try to reduce disk reads or maybe a combination of both 1. Have only the columns used in your group by query in a separate column family CREATE TABLE T (K integer primary key, GRPBYCF.UNIT_CNT_SOLD integer, GRPBYCF.TOTAL_SALES integer, GRPBYCF.T_COUNTRY varchar, ...) 2. Turn on snappy compression for your table ALTER TABLE T SET COMPRESSION='SNAPPY' followed by a major compaction. I tried to compact the table from the hbase web UI > You need to do *major_compact* from HBase shell. From UI it's minor. - mujtaba On Mon, Mar 28, 2016 at 12:32 AM, Amit Shah <amits...@gmail.com> wrote: > Thanks Mujtaba and James for replying back. > > Mujtaba, Below are details to your follow up queries > > 1. How wide is your table > > > I have 26 columns in the TRANSACTIONS table with a couple of columns > combined to be marked as a primary key > > 2. How many region servers is your data distributed on and what's the heap >> size? > > > When I posted the initial readings of the query taking around 2 minutes, I > had one region server storing 4 regions for the 10 mil records TRANSACTIONS > table. The heap size on the master server is 1 GB while the region server > has 3.63 GB heap setting. > > Later I added 2 more region servers to the cluster and configured them as > data nodes and region servers. After this step, the regions got split on > two region servers with the count as 2 on one region server and 6 on > another. I didn't follow what action caused this region split or was it > automatically done by hbase (load balancer??) > > 3. Do you see lots of disk I/O on region servers during aggregation? > > > Yes I see disk reads with 607 ios/second on the hosts that stores 6 > regions. Kindly find the disk io statistics attached as images. > > 4. Can you try your query after major compacting your table? > > > I tried to compact the table from the hbase web UI. For some reason, the > compaction table attribute on the web ui is still shown as NONE. After > these changes, the query time is down to *42 secs. * > Is compression different from compaction? Would the query performance > improve by compressing the data by one of the algorithms? Logically it > doesn't sound right though. > > Can you also replace log4j.properties with the attached one and reply back >> with phoenix.log created by executing your query in sqlline? > > > After replacing the log4j.properties, I have captured the logs for the > group by query execution and attached. > > > James, > If I follow the queries that you pasted, I see the index getting used but > if I try to explain the query plan on the pre-loaded TRANSACTIONS table I > do not see the index being used. Probably the query plan is changing based > on whether the table has data or not. > > The query time is reduced down to 42 secs right now. Let me know if you > have more suggestions on to improve it further. > > Thanks, > Amit. > > On Sat, Mar 26, 2016 at 4:21 AM, James Taylor <jamestay...@apache.org> > wrote: > >> Hi Amit, >> Using 4.7.0-HBase-1.1 release, I see the index being used for that query >> (see below). An index will help some, as the aggregation can be done in >> place as the scan over the index is occurring (as opposed to having to hold >> the distinct values found during grouping in memory per chunk of work and >> sorting each chunk on the client). It's not going to prevent the entire >> index from being scanned though. You'll need a WHERE clause to prevent that. >> >> 0: jdbc:phoenix:localhost> create table TRANSACTIONS (K integer primary >> key, UNIT_CNT_SOLD integer, TOTAL_SALES integer, T_COUNTRY varchar); >> No rows affected (1.32 seconds) >> 0: jdbc:phoenix:localhost> CREATE INDEX TRANSACTIONS_COUNTRY_INDEX ON >> TRANSACTIONS (T_COUNTRY) INCLUDE (UNIT_CNT_SOLD, TOTAL_SALES); >> No rows affected (6.452 seconds) >> 0: jdbc:phoenix:localhost> explain SELECT SUM(UNIT_CNT_SOLD), >> SUM(TOTAL_SALES) FROM TRANSACTIONS GROUP BY T_COUNTRY; >> >> +--------------------------------------------------------------------------+ >> | PLAN >> | >> >> +--------------------------------------------------------------------------+ >> | CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER TRANSACTIONS_COUNTRY_INDEX >> | >> | SERVER AGGREGATE INTO ORDERED DISTINCT ROWS BY ["T_COUNTRY"] >> | >> | CLIENT MERGE SORT >> | >> >> +--------------------------------------------------------------------------+ >> 3 rows selected (0.028 seconds) >> >> Thanks, >> James >> >> >> On Fri, Mar 25, 2016 at 10:37 AM, Mujtaba Chohan <mujt...@apache.org> >> wrote: >> >>> That seems excessively slow for 10M rows which should be in order of few >>> seconds at most without index. 1. How wide is your table 2. How many region >>> servers is your data distributed on and what's the heap size? 3. Do you see >>> lots of disk I/O on region servers during aggregation? 4. Can you try your >>> query after major compacting your table? >>> >>> Can you also replace log4j.properties with the attached one and reply >>> back with phoenix.log created by executing your query in sqlline? >>> >>> Thanks, >>> Mujtaba >>> >>> >>> On Fri, Mar 25, 2016 at 6:56 AM, Amit Shah <amits...@gmail.com> wrote: >>> >>>> Hi, >>>> >>>> I am trying to evaluate apache hbase (version 1.0.0) and phoenix >>>> (version 4.6) deployed through cloudera for our OLAP workfload. I have >>>> a table that has 10 mil rows. I try to execute the below roll up query and >>>> it takes around 2 mins to return 1,850 rows. >>>> >>>> SELECT SUM(UNIT_CNT_SOLD), SUM(TOTAL_SALES) FROM TRANSACTIONS GROUP BY >>>> T_COUNTRY; >>>> >>>> I tried applying the "joining with indices" example given on the >>>> website <https://phoenix.apache.org/joins.html> on the TRANSACTIONS >>>> table by creating an index on the grouped by column as below but that >>>> doesn't help. >>>> >>>> CREATE INDEX TRANSACTIONS_COUNTRY_INDEX ON TRANSACTIONS (T_COUNTRY) >>>> INCLUDE (UNIT_CNT_SOLD, TOTAL_SALES); >>>> >>>> This index is not getting used when the query is executed. The query >>>> plan is as below >>>> >>>> +------------------------------------------+ >>>> | PLAN | >>>> +------------------------------------------+ >>>> | CLIENT 31-CHUNK PARALLEL 31-WAY FULL SCAN OVER TRANSACTIONS | >>>> | SERVER AGGREGATE INTO DISTINCT ROWS BY [T_COUNTRY] | >>>> | CLIENT MERGE SORT | >>>> +------------------------------------------+ >>>> >>>> Theoretically can secondary indexes help improve the performance of >>>> group by queries? >>>> >>>> Any suggestions on what are different options in phoenix I could try >>>> out to speed up GROUP BY queries? >>>> >>>> Thanks, >>>> Amit. >>>> >>> >>> >> >