Re: Speeding Up Group By Queries

Mujtaba Chohan Tue, 29 Mar 2016 10:22:31 -0700

Optimization did help somewhat but not to the extent I was expecting. See
chart below.


[image: Inline image 1]

Can you share your table schema so I can experiment with it? Another thing
you can try is reducing guidepost <https://phoenix.apache.org/tuning.html>
width for this table by executing UPDATE STATISTICS TRANSACTIONS SET
"phoenix.stats.guidepost.width"=50000000;




On Tue, Mar 29, 2016 at 6:45 AM, Amit Shah <amits...@gmail.com> wrote:

> Hi Mujtaba,
>
> I did try the two optimization techniques by recreating the table and then
> loading it again with 10 mil records. They do not seem to help out much in
> terms of the timings. Kindly find the phoenix log file attached. Let me
> know if I am missing anything.
>
> Thanks,
> Amit.
>
> On Mon, Mar 28, 2016 at 11:44 PM, Mujtaba Chohan <mujt...@apache.org>
> wrote:
>
>> Here's the chart for time it takes for each of the parallel scans after
>> split. On RS where data is not read from disk scan gets back in ~20 secs
>> but for the RS which has 6 it's ~45 secs.
>>
>> [image: Inline image 2]
>>
>>  Yes I see disk reads with 607 ios/second on the hosts that stores 6
>>> regions
>>>
>>
>> Two things that you should try to reduce disk reads or maybe a
>> combination of both 1. Have only the columns used in your group by query in
>> a separate column family CREATE TABLE T (K integer primary key,
>> GRPBYCF.UNIT_CNT_SOLD integer, GRPBYCF.TOTAL_SALES integer,
>> GRPBYCF.T_COUNTRY varchar, ...) 2. Turn on snappy compression for your
>> table ALTER TABLE T SET COMPRESSION='SNAPPY' followed by a major
>> compaction.
>>
>> I tried to compact the table from the hbase web UI
>>>
>>
>> You need to do *major_compact* from HBase shell. From UI it's minor.
>>
>> - mujtaba
>>
>> On Mon, Mar 28, 2016 at 12:32 AM, Amit Shah <amits...@gmail.com> wrote:
>>
>>> Thanks Mujtaba and James for replying back.
>>>
>>> Mujtaba, Below are details to your follow up queries
>>>
>>> 1. How wide is your table
>>>
>>>
>>> I have 26 columns in the TRANSACTIONS table with a couple of columns
>>> combined to be marked as a primary key
>>>
>>> 2. How many region servers is your data distributed on and what's the
>>>> heap size?
>>>
>>>
>>> When I posted the initial readings of the query taking around 2 minutes,
>>> I had one region server storing 4 regions for the 10 mil records
>>> TRANSACTIONS table. The heap size on the master server is 1 GB while the
>>> region server has 3.63 GB heap setting.
>>>
>>> Later I added 2 more region servers to the cluster and configured them
>>> as data nodes and region servers. After this step, the regions got split on
>>> two region servers with the count as 2 on one region server and 6 on
>>> another. I didn't follow what action caused this region split or was it
>>> automatically done by hbase (load balancer??)
>>>
>>> 3. Do you see lots of disk I/O on region servers during aggregation?
>>>
>>>
>>>  Yes I see disk reads with 607 ios/second on the hosts that stores 6
>>> regions. Kindly find the disk io statistics attached as images.
>>>
>>> 4. Can you try your query after major compacting your table?
>>>
>>>
>>> I tried to compact the table from the hbase web UI. For some reason, the
>>> compaction table attribute on the web ui is still shown as NONE. After
>>> these changes, the query time is down to *42 secs. *
>>> Is compression different from compaction? Would the query performance
>>> improve by compressing the data by one of the algorithms? Logically it
>>> doesn't sound right though.
>>>
>>> Can you also replace log4j.properties with the attached one and reply
>>>> back with phoenix.log created by executing your query in sqlline?
>>>
>>>
>>> After replacing the log4j.properties, I have captured the logs for the
>>> group by query execution and attached.
>>>
>>>
>>> James,
>>> If I follow the queries that you pasted, I see the index getting used
>>> but if I try to explain the query plan on the pre-loaded TRANSACTIONS table
>>> I do not see the index being used. Probably the query plan is changing
>>> based on whether the table has data or not.
>>>
>>> The query time is reduced down to 42 secs right now. Let me know if you
>>> have more suggestions on to improve it further.
>>>
>>> Thanks,
>>> Amit.
>>>
>>> On Sat, Mar 26, 2016 at 4:21 AM, James Taylor <jamestay...@apache.org>
>>> wrote:
>>>
>>>> Hi Amit,
>>>> Using 4.7.0-HBase-1.1 release, I see the index being used for that
>>>> query (see below). An index will help some, as the aggregation can be done
>>>> in place as the scan over the index is occurring (as opposed to having to
>>>> hold the distinct values found during grouping in memory per chunk of work
>>>> and sorting each chunk on the client). It's not going to prevent the entire
>>>> index from being scanned though. You'll need a WHERE clause to prevent 
>>>> that.
>>>>
>>>> 0: jdbc:phoenix:localhost> create table TRANSACTIONS (K integer primary
>>>> key, UNIT_CNT_SOLD integer, TOTAL_SALES integer, T_COUNTRY varchar);
>>>> No rows affected (1.32 seconds)
>>>> 0: jdbc:phoenix:localhost> CREATE INDEX TRANSACTIONS_COUNTRY_INDEX ON
>>>> TRANSACTIONS (T_COUNTRY) INCLUDE (UNIT_CNT_SOLD, TOTAL_SALES);
>>>> No rows affected (6.452 seconds)
>>>> 0: jdbc:phoenix:localhost> explain SELECT SUM(UNIT_CNT_SOLD),
>>>> SUM(TOTAL_SALES) FROM TRANSACTIONS GROUP BY T_COUNTRY;
>>>>
>>>> +--------------------------------------------------------------------------+
>>>> |                                   PLAN
>>>>     |
>>>>
>>>> +--------------------------------------------------------------------------+
>>>> | CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER
>>>> TRANSACTIONS_COUNTRY_INDEX  |
>>>> |     SERVER AGGREGATE INTO ORDERED DISTINCT ROWS BY ["T_COUNTRY"]
>>>>     |
>>>> | CLIENT MERGE SORT
>>>>    |
>>>>
>>>> +--------------------------------------------------------------------------+
>>>> 3 rows selected (0.028 seconds)
>>>>
>>>> Thanks,
>>>> James
>>>>
>>>>
>>>> On Fri, Mar 25, 2016 at 10:37 AM, Mujtaba Chohan <mujt...@apache.org>
>>>> wrote:
>>>>
>>>>> That seems excessively slow for 10M rows which should be in order of
>>>>> few seconds at most without index. 1. How wide is your table 2. How many
>>>>> region servers is your data distributed on and what's the heap size? 3. Do
>>>>> you see lots of disk I/O on region servers during aggregation? 4. Can you
>>>>> try your query after major compacting your table?
>>>>>
>>>>> Can you also replace log4j.properties with the attached one and reply
>>>>> back with phoenix.log created by executing your query in sqlline?
>>>>>
>>>>> Thanks,
>>>>> Mujtaba
>>>>>
>>>>>
>>>>> On Fri, Mar 25, 2016 at 6:56 AM, Amit Shah <amits...@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I am trying to evaluate apache hbase (version 1.0.0) and phoenix
>>>>>> (version 4.6) deployed through cloudera for our OLAP workfload. I
>>>>>> have a table that has 10 mil rows. I try to execute the below roll up 
>>>>>> query
>>>>>> and it takes around 2 mins to return 1,850 rows.
>>>>>>
>>>>>> SELECT SUM(UNIT_CNT_SOLD), SUM(TOTAL_SALES) FROM TRANSACTIONS GROUP
>>>>>> BY T_COUNTRY;
>>>>>>
>>>>>> I tried applying the "joining with indices" example given on the
>>>>>> website <https://phoenix.apache.org/joins.html> on the TRANSACTIONS
>>>>>> table by creating an index on the grouped by column as below but that
>>>>>> doesn't help.
>>>>>>
>>>>>> CREATE INDEX TRANSACTIONS_COUNTRY_INDEX ON TRANSACTIONS (T_COUNTRY)
>>>>>> INCLUDE (UNIT_CNT_SOLD, TOTAL_SALES);
>>>>>>
>>>>>> This index is not getting used when the query is executed. The query
>>>>>> plan is as below
>>>>>>
>>>>>> +------------------------------------------+
>>>>>> |                   PLAN                   |
>>>>>> +------------------------------------------+
>>>>>> | CLIENT 31-CHUNK PARALLEL 31-WAY FULL SCAN OVER TRANSACTIONS |
>>>>>> |     SERVER AGGREGATE INTO DISTINCT ROWS BY [T_COUNTRY] |
>>>>>> | CLIENT MERGE SORT                        |
>>>>>> +------------------------------------------+
>>>>>>
>>>>>> Theoretically can secondary indexes help improve the performance of
>>>>>> group by queries?
>>>>>>
>>>>>> Any suggestions on what are different options in phoenix I could try
>>>>>> out to speed up GROUP BY queries?
>>>>>>
>>>>>> Thanks,
>>>>>> Amit.
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Speeding Up Group By Queries

Reply via email to