Re: Speeding Up Group By Queries

Amit Shah Tue, 29 Mar 2016 10:27:08 -0700

Hi Mujtaba,

Could these improvements be because of region distribution across region
servers? Along with the optimizations you had suggested I had also used
hbase-region-inspector to move regions evenly across the region server.


Below is the table schema for the TRANSACTIONS table


CREATE TABLE TRANSACTIONS_TEST (
AGENT_ID BIGINT
,A_NAME VARCHAR
,A_ROLE VARCHAR
,TERRITORY_ID BIGINT
,T_ZIPCODE BIGINT
,T_PLACE_NAME VARCHAR
,GRPBYCF.T_STATE VARCHAR
,GRPBYCF.T_COUNTRY VARCHAR
,PRODUCT_ID BIGINT NOT NULL
,P_NAME VARCHAR
,P_CATEGORY VARCHAR
,CHANNEL_ID BIGINT
,CH_NAME VARCHAR
,CH_TYPE VARCHAR
,CUSTOMER_ID BIGINT NOT NULL
,CS_NAME VARCHAR
,CS_TYPE VARCHAR
,IS_NEW_CUSTOMER BOOLEAN
,CLOSE_DATE DATE
,DAY_CNT_SPENT INTEGER
,TOTAL_EXPENSES BIGINT
,FORCASTED_SALES BIGINT
,GRPBYCF.UNIT_CNT_SOLD BIGINT
,PRICE_PER_UNIT BIGINT
,DISCOUNT_PERCENT BIGINT
,GRPBYCF.TOTAL_SALES BIGINT
,CONSTRAINT pk PRIMARY KEY (PRODUCT_ID, CUSTOMER_ID)
) COMPRESSION='SNAPPY';

I will try out the guidepost width reduction and let you know the results.

Thank you,
Amit.

On Tue, Mar 29, 2016 at 10:50 PM, Mujtaba Chohan <[email protected]> wrote:

> Optimization did help somewhat but not to the extent I was expecting. See
> chart below.
>
> [image: Inline image 1]
>
> Can you share your table schema so I can experiment with it? Another thing
> you can try is reducing guidepost <https://phoenix.apache.org/tuning.html>
> width for this table by executing UPDATE STATISTICS TRANSACTIONS SET
> "phoenix.stats.guidepost.width"=50000000;
>
>
>
>
> On Tue, Mar 29, 2016 at 6:45 AM, Amit Shah <[email protected]> wrote:
>
>> Hi Mujtaba,
>>
>> I did try the two optimization techniques by recreating the table and
>> then loading it again with 10 mil records. They do not seem to help out
>> much in terms of the timings. Kindly find the phoenix log file attached.
>> Let me know if I am missing anything.
>>
>> Thanks,
>> Amit.
>>
>> On Mon, Mar 28, 2016 at 11:44 PM, Mujtaba Chohan <[email protected]>
>> wrote:
>>
>>> Here's the chart for time it takes for each of the parallel scans after
>>> split. On RS where data is not read from disk scan gets back in ~20 secs
>>> but for the RS which has 6 it's ~45 secs.
>>>
>>> [image: Inline image 2]
>>>
>>>  Yes I see disk reads with 607 ios/second on the hosts that stores 6
>>>> regions
>>>>
>>>
>>> Two things that you should try to reduce disk reads or maybe a
>>> combination of both 1. Have only the columns used in your group by query in
>>> a separate column family CREATE TABLE T (K integer primary key,
>>> GRPBYCF.UNIT_CNT_SOLD integer, GRPBYCF.TOTAL_SALES integer,
>>> GRPBYCF.T_COUNTRY varchar, ...) 2. Turn on snappy compression for your
>>> table ALTER TABLE T SET COMPRESSION='SNAPPY' followed by a major
>>> compaction.
>>>
>>> I tried to compact the table from the hbase web UI
>>>>
>>>
>>> You need to do *major_compact* from HBase shell. From UI it's minor.
>>>
>>> - mujtaba
>>>
>>> On Mon, Mar 28, 2016 at 12:32 AM, Amit Shah <[email protected]> wrote:
>>>
>>>> Thanks Mujtaba and James for replying back.
>>>>
>>>> Mujtaba, Below are details to your follow up queries
>>>>
>>>> 1. How wide is your table
>>>>
>>>>
>>>> I have 26 columns in the TRANSACTIONS table with a couple of columns
>>>> combined to be marked as a primary key
>>>>
>>>> 2. How many region servers is your data distributed on and what's the
>>>>> heap size?
>>>>
>>>>
>>>> When I posted the initial readings of the query taking around 2
>>>> minutes, I had one region server storing 4 regions for the 10 mil records
>>>> TRANSACTIONS table. The heap size on the master server is 1 GB while the
>>>> region server has 3.63 GB heap setting.
>>>>
>>>> Later I added 2 more region servers to the cluster and configured them
>>>> as data nodes and region servers. After this step, the regions got split on
>>>> two region servers with the count as 2 on one region server and 6 on
>>>> another. I didn't follow what action caused this region split or was it
>>>> automatically done by hbase (load balancer??)
>>>>
>>>> 3. Do you see lots of disk I/O on region servers during aggregation?
>>>>
>>>>
>>>>  Yes I see disk reads with 607 ios/second on the hosts that stores 6
>>>> regions. Kindly find the disk io statistics attached as images.
>>>>
>>>> 4. Can you try your query after major compacting your table?
>>>>
>>>>
>>>> I tried to compact the table from the hbase web UI. For some reason,
>>>> the compaction table attribute on the web ui is still shown as NONE. After
>>>> these changes, the query time is down to *42 secs. *
>>>> Is compression different from compaction? Would the query performance
>>>> improve by compressing the data by one of the algorithms? Logically it
>>>> doesn't sound right though.
>>>>
>>>> Can you also replace log4j.properties with the attached one and reply
>>>>> back with phoenix.log created by executing your query in sqlline?
>>>>
>>>>
>>>> After replacing the log4j.properties, I have captured the logs for the
>>>> group by query execution and attached.
>>>>
>>>>
>>>> James,
>>>> If I follow the queries that you pasted, I see the index getting used
>>>> but if I try to explain the query plan on the pre-loaded TRANSACTIONS table
>>>> I do not see the index being used. Probably the query plan is changing
>>>> based on whether the table has data or not.
>>>>
>>>> The query time is reduced down to 42 secs right now. Let me know if you
>>>> have more suggestions on to improve it further.
>>>>
>>>> Thanks,
>>>> Amit.
>>>>
>>>> On Sat, Mar 26, 2016 at 4:21 AM, James Taylor <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Amit,
>>>>> Using 4.7.0-HBase-1.1 release, I see the index being used for that
>>>>> query (see below). An index will help some, as the aggregation can be done
>>>>> in place as the scan over the index is occurring (as opposed to having to
>>>>> hold the distinct values found during grouping in memory per chunk of work
>>>>> and sorting each chunk on the client). It's not going to prevent the 
>>>>> entire
>>>>> index from being scanned though. You'll need a WHERE clause to prevent 
>>>>> that.
>>>>>
>>>>> 0: jdbc:phoenix:localhost> create table TRANSACTIONS (K integer
>>>>> primary key, UNIT_CNT_SOLD integer, TOTAL_SALES integer, T_COUNTRY 
>>>>> varchar);
>>>>> No rows affected (1.32 seconds)
>>>>> 0: jdbc:phoenix:localhost> CREATE INDEX TRANSACTIONS_COUNTRY_INDEX ON
>>>>> TRANSACTIONS (T_COUNTRY) INCLUDE (UNIT_CNT_SOLD, TOTAL_SALES);
>>>>> No rows affected (6.452 seconds)
>>>>> 0: jdbc:phoenix:localhost> explain SELECT SUM(UNIT_CNT_SOLD),
>>>>> SUM(TOTAL_SALES) FROM TRANSACTIONS GROUP BY T_COUNTRY;
>>>>>
>>>>> +--------------------------------------------------------------------------+
>>>>> |                                   PLAN
>>>>>     |
>>>>>
>>>>> +--------------------------------------------------------------------------+
>>>>> | CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER
>>>>> TRANSACTIONS_COUNTRY_INDEX  |
>>>>> |     SERVER AGGREGATE INTO ORDERED DISTINCT ROWS BY ["T_COUNTRY"]
>>>>>     |
>>>>> | CLIENT MERGE SORT
>>>>>      |
>>>>>
>>>>> +--------------------------------------------------------------------------+
>>>>> 3 rows selected (0.028 seconds)
>>>>>
>>>>> Thanks,
>>>>> James
>>>>>
>>>>>
>>>>> On Fri, Mar 25, 2016 at 10:37 AM, Mujtaba Chohan <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> That seems excessively slow for 10M rows which should be in order of
>>>>>> few seconds at most without index. 1. How wide is your table 2. How many
>>>>>> region servers is your data distributed on and what's the heap size? 3. 
>>>>>> Do
>>>>>> you see lots of disk I/O on region servers during aggregation? 4. Can you
>>>>>> try your query after major compacting your table?
>>>>>>
>>>>>> Can you also replace log4j.properties with the attached one and reply
>>>>>> back with phoenix.log created by executing your query in sqlline?
>>>>>>
>>>>>> Thanks,
>>>>>> Mujtaba
>>>>>>
>>>>>>
>>>>>> On Fri, Mar 25, 2016 at 6:56 AM, Amit Shah <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I am trying to evaluate apache hbase (version 1.0.0) and phoenix
>>>>>>> (version 4.6) deployed through cloudera for our OLAP workfload. I
>>>>>>> have a table that has 10 mil rows. I try to execute the below roll up 
>>>>>>> query
>>>>>>> and it takes around 2 mins to return 1,850 rows.
>>>>>>>
>>>>>>> SELECT SUM(UNIT_CNT_SOLD), SUM(TOTAL_SALES) FROM TRANSACTIONS GROUP
>>>>>>> BY T_COUNTRY;
>>>>>>>
>>>>>>> I tried applying the "joining with indices" example given on the
>>>>>>> website <https://phoenix.apache.org/joins.html> on the TRANSACTIONS
>>>>>>> table by creating an index on the grouped by column as below but that
>>>>>>> doesn't help.
>>>>>>>
>>>>>>> CREATE INDEX TRANSACTIONS_COUNTRY_INDEX ON TRANSACTIONS (T_COUNTRY)
>>>>>>> INCLUDE (UNIT_CNT_SOLD, TOTAL_SALES);
>>>>>>>
>>>>>>> This index is not getting used when the query is executed. The query
>>>>>>> plan is as below
>>>>>>>
>>>>>>> +------------------------------------------+
>>>>>>> |                   PLAN                   |
>>>>>>> +------------------------------------------+
>>>>>>> | CLIENT 31-CHUNK PARALLEL 31-WAY FULL SCAN OVER TRANSACTIONS |
>>>>>>> |     SERVER AGGREGATE INTO DISTINCT ROWS BY [T_COUNTRY] |
>>>>>>> | CLIENT MERGE SORT                        |
>>>>>>> +------------------------------------------+
>>>>>>>
>>>>>>> Theoretically can secondary indexes help improve the performance of
>>>>>>> group by queries?
>>>>>>>
>>>>>>> Any suggestions on what are different options in phoenix I could try
>>>>>>> out to speed up GROUP BY queries?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Amit.
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Speeding Up Group By Queries

Reply via email to