Re: Speeding Up Group By Queries

Amit Shah Mon, 11 Apr 2016 06:37:57 -0700

Hi Mujtaba,

I observed that if the where-clause and group-by queries are applied on the
primary key columns, then they are superfast (~ 200 ms). This is not the
case with queries that have non-primary key columns in the where clause and
group by queries. I tried configuring the bucket cache but surprisingly it
doesn't give much speed improvement which I had thought of. One of these
queries are taking 26 secs and 31 secs on a 3 node Hbase cluster with and
without bucket cache respectively.


Let me know if you have suggestions that we could try out.

Regards,
Amit.

On Tue, Mar 29, 2016 at 10:55 PM, Amit Shah <amits...@gmail.com> wrote:

> Hi Mujtaba,
>
> Could these improvements be because of region distribution across region
> servers? Along with the optimizations you had suggested I had also used
> hbase-region-inspector to move regions evenly across the region server.
>
> Below is the table schema for the TRANSACTIONS table
>
>
> CREATE TABLE TRANSACTIONS_TEST (
> AGENT_ID BIGINT
> ,A_NAME VARCHAR
> ,A_ROLE VARCHAR
> ,TERRITORY_ID BIGINT
> ,T_ZIPCODE BIGINT
> ,T_PLACE_NAME VARCHAR
> ,GRPBYCF.T_STATE VARCHAR
> ,GRPBYCF.T_COUNTRY VARCHAR
> ,PRODUCT_ID BIGINT NOT NULL
> ,P_NAME VARCHAR
> ,P_CATEGORY VARCHAR
> ,CHANNEL_ID BIGINT
> ,CH_NAME VARCHAR
> ,CH_TYPE VARCHAR
> ,CUSTOMER_ID BIGINT NOT NULL
> ,CS_NAME VARCHAR
> ,CS_TYPE VARCHAR
> ,IS_NEW_CUSTOMER BOOLEAN
> ,CLOSE_DATE DATE
> ,DAY_CNT_SPENT INTEGER
> ,TOTAL_EXPENSES BIGINT
> ,FORCASTED_SALES BIGINT
> ,GRPBYCF.UNIT_CNT_SOLD BIGINT
> ,PRICE_PER_UNIT BIGINT
> ,DISCOUNT_PERCENT BIGINT
> ,GRPBYCF.TOTAL_SALES BIGINT
> ,CONSTRAINT pk PRIMARY KEY (PRODUCT_ID, CUSTOMER_ID)
> ) COMPRESSION='SNAPPY';
>
> I will try out the guidepost width reduction and let you know the results.
>
> Thank you,
> Amit.
>
> On Tue, Mar 29, 2016 at 10:50 PM, Mujtaba Chohan <mujt...@apache.org>
> wrote:
>
>> Optimization did help somewhat but not to the extent I was expecting. See
>> chart below.
>>
>> [image: Inline image 1]
>>
>> Can you share your table schema so I can experiment with it? Another
>> thing you can try is reducing guidepost
>> <https://phoenix.apache.org/tuning.html> width for this table by
>> executing UPDATE STATISTICS TRANSACTIONS SET
>> "phoenix.stats.guidepost.width"=50000000;
>>
>>
>>
>>
>> On Tue, Mar 29, 2016 at 6:45 AM, Amit Shah <amits...@gmail.com> wrote:
>>
>>> Hi Mujtaba,
>>>
>>> I did try the two optimization techniques by recreating the table and
>>> then loading it again with 10 mil records. They do not seem to help out
>>> much in terms of the timings. Kindly find the phoenix log file attached.
>>> Let me know if I am missing anything.
>>>
>>> Thanks,
>>> Amit.
>>>
>>> On Mon, Mar 28, 2016 at 11:44 PM, Mujtaba Chohan <mujt...@apache.org>
>>> wrote:
>>>
>>>> Here's the chart for time it takes for each of the parallel scans after
>>>> split. On RS where data is not read from disk scan gets back in ~20 secs
>>>> but for the RS which has 6 it's ~45 secs.
>>>>
>>>> [image: Inline image 2]
>>>>
>>>>  Yes I see disk reads with 607 ios/second on the hosts that stores 6
>>>>> regions
>>>>>
>>>>
>>>> Two things that you should try to reduce disk reads or maybe a
>>>> combination of both 1. Have only the columns used in your group by query in
>>>> a separate column family CREATE TABLE T (K integer primary key,
>>>> GRPBYCF.UNIT_CNT_SOLD integer, GRPBYCF.TOTAL_SALES integer,
>>>> GRPBYCF.T_COUNTRY varchar, ...) 2. Turn on snappy compression for your
>>>> table ALTER TABLE T SET COMPRESSION='SNAPPY' followed by a major
>>>> compaction.
>>>>
>>>> I tried to compact the table from the hbase web UI
>>>>>
>>>>
>>>> You need to do *major_compact* from HBase shell. From UI it's minor.
>>>>
>>>> - mujtaba
>>>>
>>>> On Mon, Mar 28, 2016 at 12:32 AM, Amit Shah <amits...@gmail.com> wrote:
>>>>
>>>>> Thanks Mujtaba and James for replying back.
>>>>>
>>>>> Mujtaba, Below are details to your follow up queries
>>>>>
>>>>> 1. How wide is your table
>>>>>
>>>>>
>>>>> I have 26 columns in the TRANSACTIONS table with a couple of columns
>>>>> combined to be marked as a primary key
>>>>>
>>>>> 2. How many region servers is your data distributed on and what's the
>>>>>> heap size?
>>>>>
>>>>>
>>>>> When I posted the initial readings of the query taking around 2
>>>>> minutes, I had one region server storing 4 regions for the 10 mil records
>>>>> TRANSACTIONS table. The heap size on the master server is 1 GB while the
>>>>> region server has 3.63 GB heap setting.
>>>>>
>>>>> Later I added 2 more region servers to the cluster and configured them
>>>>> as data nodes and region servers. After this step, the regions got split 
>>>>> on
>>>>> two region servers with the count as 2 on one region server and 6 on
>>>>> another. I didn't follow what action caused this region split or was it
>>>>> automatically done by hbase (load balancer??)
>>>>>
>>>>> 3. Do you see lots of disk I/O on region servers during aggregation?
>>>>>
>>>>>
>>>>>  Yes I see disk reads with 607 ios/second on the hosts that stores 6
>>>>> regions. Kindly find the disk io statistics attached as images.
>>>>>
>>>>> 4. Can you try your query after major compacting your table?
>>>>>
>>>>>
>>>>> I tried to compact the table from the hbase web UI. For some reason,
>>>>> the compaction table attribute on the web ui is still shown as NONE. After
>>>>> these changes, the query time is down to *42 secs. *
>>>>> Is compression different from compaction? Would the query performance
>>>>> improve by compressing the data by one of the algorithms? Logically it
>>>>> doesn't sound right though.
>>>>>
>>>>> Can you also replace log4j.properties with the attached one and reply
>>>>>> back with phoenix.log created by executing your query in sqlline?
>>>>>
>>>>>
>>>>> After replacing the log4j.properties, I have captured the logs for the
>>>>> group by query execution and attached.
>>>>>
>>>>>
>>>>> James,
>>>>> If I follow the queries that you pasted, I see the index getting used
>>>>> but if I try to explain the query plan on the pre-loaded TRANSACTIONS 
>>>>> table
>>>>> I do not see the index being used. Probably the query plan is changing
>>>>> based on whether the table has data or not.
>>>>>
>>>>> The query time is reduced down to 42 secs right now. Let me know if
>>>>> you have more suggestions on to improve it further.
>>>>>
>>>>> Thanks,
>>>>> Amit.
>>>>>
>>>>> On Sat, Mar 26, 2016 at 4:21 AM, James Taylor <jamestay...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Hi Amit,
>>>>>> Using 4.7.0-HBase-1.1 release, I see the index being used for that
>>>>>> query (see below). An index will help some, as the aggregation can be 
>>>>>> done
>>>>>> in place as the scan over the index is occurring (as opposed to having to
>>>>>> hold the distinct values found during grouping in memory per chunk of 
>>>>>> work
>>>>>> and sorting each chunk on the client). It's not going to prevent the 
>>>>>> entire
>>>>>> index from being scanned though. You'll need a WHERE clause to prevent 
>>>>>> that.
>>>>>>
>>>>>> 0: jdbc:phoenix:localhost> create table TRANSACTIONS (K integer
>>>>>> primary key, UNIT_CNT_SOLD integer, TOTAL_SALES integer, T_COUNTRY 
>>>>>> varchar);
>>>>>> No rows affected (1.32 seconds)
>>>>>> 0: jdbc:phoenix:localhost> CREATE INDEX TRANSACTIONS_COUNTRY_INDEX ON
>>>>>> TRANSACTIONS (T_COUNTRY) INCLUDE (UNIT_CNT_SOLD, TOTAL_SALES);
>>>>>> No rows affected (6.452 seconds)
>>>>>> 0: jdbc:phoenix:localhost> explain SELECT SUM(UNIT_CNT_SOLD),
>>>>>> SUM(TOTAL_SALES) FROM TRANSACTIONS GROUP BY T_COUNTRY;
>>>>>>
>>>>>> +--------------------------------------------------------------------------+
>>>>>> |                                   PLAN
>>>>>>       |
>>>>>>
>>>>>> +--------------------------------------------------------------------------+
>>>>>> | CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER
>>>>>> TRANSACTIONS_COUNTRY_INDEX  |
>>>>>> |     SERVER AGGREGATE INTO ORDERED DISTINCT ROWS BY ["T_COUNTRY"]
>>>>>>       |
>>>>>> | CLIENT MERGE SORT
>>>>>>      |
>>>>>>
>>>>>> +--------------------------------------------------------------------------+
>>>>>> 3 rows selected (0.028 seconds)
>>>>>>
>>>>>> Thanks,
>>>>>> James
>>>>>>
>>>>>>
>>>>>> On Fri, Mar 25, 2016 at 10:37 AM, Mujtaba Chohan <mujt...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> That seems excessively slow for 10M rows which should be in order of
>>>>>>> few seconds at most without index. 1. How wide is your table 2. How many
>>>>>>> region servers is your data distributed on and what's the heap size? 3. 
>>>>>>> Do
>>>>>>> you see lots of disk I/O on region servers during aggregation? 4. Can 
>>>>>>> you
>>>>>>> try your query after major compacting your table?
>>>>>>>
>>>>>>> Can you also replace log4j.properties with the attached one and
>>>>>>> reply back with phoenix.log created by executing your query in sqlline?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Mujtaba
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Mar 25, 2016 at 6:56 AM, Amit Shah <amits...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I am trying to evaluate apache hbase (version 1.0.0) and phoenix
>>>>>>>> (version 4.6) deployed through cloudera for our OLAP workfload. I
>>>>>>>> have a table that has 10 mil rows. I try to execute the below roll up 
>>>>>>>> query
>>>>>>>> and it takes around 2 mins to return 1,850 rows.
>>>>>>>>
>>>>>>>> SELECT SUM(UNIT_CNT_SOLD), SUM(TOTAL_SALES) FROM TRANSACTIONS GROUP
>>>>>>>> BY T_COUNTRY;
>>>>>>>>
>>>>>>>> I tried applying the "joining with indices" example given on the
>>>>>>>> website <https://phoenix.apache.org/joins.html> on the
>>>>>>>> TRANSACTIONS table by creating an index on the grouped by column as 
>>>>>>>> below
>>>>>>>> but that doesn't help.
>>>>>>>>
>>>>>>>> CREATE INDEX TRANSACTIONS_COUNTRY_INDEX ON TRANSACTIONS (T_COUNTRY)
>>>>>>>> INCLUDE (UNIT_CNT_SOLD, TOTAL_SALES);
>>>>>>>>
>>>>>>>> This index is not getting used when the query is executed. The
>>>>>>>> query plan is as below
>>>>>>>>
>>>>>>>> +------------------------------------------+
>>>>>>>> |                   PLAN                   |
>>>>>>>> +------------------------------------------+
>>>>>>>> | CLIENT 31-CHUNK PARALLEL 31-WAY FULL SCAN OVER TRANSACTIONS |
>>>>>>>> |     SERVER AGGREGATE INTO DISTINCT ROWS BY [T_COUNTRY] |
>>>>>>>> | CLIENT MERGE SORT                        |
>>>>>>>> +------------------------------------------+
>>>>>>>>
>>>>>>>> Theoretically can secondary indexes help improve the performance of
>>>>>>>> group by queries?
>>>>>>>>
>>>>>>>> Any suggestions on what are different options in phoenix I could
>>>>>>>> try out to speed up GROUP BY queries?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Amit.
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Speeding Up Group By Queries

Reply via email to