Re: Speeding Up Group By Queries

Mujtaba Chohan Mon, 28 Mar 2016 11:15:01 -0700

Here's the chart for time it takes for each of the parallel scans after
split. On RS where data is not read from disk scan gets back in ~20 secs
but for the RS which has 6 it's ~45 secs.


[image: Inline image 2]

 Yes I see disk reads with 607 ios/second on the hosts that stores 6 regions
>

Two things that you should try to reduce disk reads or maybe a combination
of both 1. Have only the columns used in your group by query in a separate
column family CREATE TABLE T (K integer primary key, GRPBYCF.UNIT_CNT_SOLD
integer, GRPBYCF.TOTAL_SALES integer, GRPBYCF.T_COUNTRY varchar, ...) 2.
Turn on snappy compression for your table ALTER TABLE T SET
COMPRESSION='SNAPPY' followed by a major compaction.

I tried to compact the table from the hbase web UI
>

You need to do *major_compact* from HBase shell. From UI it's minor.

- mujtaba

On Mon, Mar 28, 2016 at 12:32 AM, Amit Shah <[email protected]> wrote:

> Thanks Mujtaba and James for replying back.
>
> Mujtaba, Below are details to your follow up queries
>
> 1. How wide is your table
>
>
> I have 26 columns in the TRANSACTIONS table with a couple of columns
> combined to be marked as a primary key
>
> 2. How many region servers is your data distributed on and what's the heap
>> size?
>
>
> When I posted the initial readings of the query taking around 2 minutes, I
> had one region server storing 4 regions for the 10 mil records TRANSACTIONS
> table. The heap size on the master server is 1 GB while the region server
> has 3.63 GB heap setting.
>
> Later I added 2 more region servers to the cluster and configured them as
> data nodes and region servers. After this step, the regions got split on
> two region servers with the count as 2 on one region server and 6 on
> another. I didn't follow what action caused this region split or was it
> automatically done by hbase (load balancer??)
>
> 3. Do you see lots of disk I/O on region servers during aggregation?
>
>
>  Yes I see disk reads with 607 ios/second on the hosts that stores 6
> regions. Kindly find the disk io statistics attached as images.
>
> 4. Can you try your query after major compacting your table?
>
>
> I tried to compact the table from the hbase web UI. For some reason, the
> compaction table attribute on the web ui is still shown as NONE. After
> these changes, the query time is down to *42 secs. *
> Is compression different from compaction? Would the query performance
> improve by compressing the data by one of the algorithms? Logically it
> doesn't sound right though.
>
> Can you also replace log4j.properties with the attached one and reply back
>> with phoenix.log created by executing your query in sqlline?
>
>
> After replacing the log4j.properties, I have captured the logs for the
> group by query execution and attached.
>
>
> James,
> If I follow the queries that you pasted, I see the index getting used but
> if I try to explain the query plan on the pre-loaded TRANSACTIONS table I
> do not see the index being used. Probably the query plan is changing based
> on whether the table has data or not.
>
> The query time is reduced down to 42 secs right now. Let me know if you
> have more suggestions on to improve it further.
>
> Thanks,
> Amit.
>
> On Sat, Mar 26, 2016 at 4:21 AM, James Taylor <[email protected]>
> wrote:
>
>> Hi Amit,
>> Using 4.7.0-HBase-1.1 release, I see the index being used for that query
>> (see below). An index will help some, as the aggregation can be done in
>> place as the scan over the index is occurring (as opposed to having to hold
>> the distinct values found during grouping in memory per chunk of work and
>> sorting each chunk on the client). It's not going to prevent the entire
>> index from being scanned though. You'll need a WHERE clause to prevent that.
>>
>> 0: jdbc:phoenix:localhost> create table TRANSACTIONS (K integer primary
>> key, UNIT_CNT_SOLD integer, TOTAL_SALES integer, T_COUNTRY varchar);
>> No rows affected (1.32 seconds)
>> 0: jdbc:phoenix:localhost> CREATE INDEX TRANSACTIONS_COUNTRY_INDEX ON
>> TRANSACTIONS (T_COUNTRY) INCLUDE (UNIT_CNT_SOLD, TOTAL_SALES);
>> No rows affected (6.452 seconds)
>> 0: jdbc:phoenix:localhost> explain SELECT SUM(UNIT_CNT_SOLD),
>> SUM(TOTAL_SALES) FROM TRANSACTIONS GROUP BY T_COUNTRY;
>>
>> +--------------------------------------------------------------------------+
>> |                                   PLAN
>>   |
>>
>> +--------------------------------------------------------------------------+
>> | CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER TRANSACTIONS_COUNTRY_INDEX
>>  |
>> |     SERVER AGGREGATE INTO ORDERED DISTINCT ROWS BY ["T_COUNTRY"]
>>   |
>> | CLIENT MERGE SORT
>>  |
>>
>> +--------------------------------------------------------------------------+
>> 3 rows selected (0.028 seconds)
>>
>> Thanks,
>> James
>>
>>
>> On Fri, Mar 25, 2016 at 10:37 AM, Mujtaba Chohan <[email protected]>
>> wrote:
>>
>>> That seems excessively slow for 10M rows which should be in order of few
>>> seconds at most without index. 1. How wide is your table 2. How many region
>>> servers is your data distributed on and what's the heap size? 3. Do you see
>>> lots of disk I/O on region servers during aggregation? 4. Can you try your
>>> query after major compacting your table?
>>>
>>> Can you also replace log4j.properties with the attached one and reply
>>> back with phoenix.log created by executing your query in sqlline?
>>>
>>> Thanks,
>>> Mujtaba
>>>
>>>
>>> On Fri, Mar 25, 2016 at 6:56 AM, Amit Shah <[email protected]> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am trying to evaluate apache hbase (version 1.0.0) and phoenix
>>>> (version 4.6) deployed through cloudera for our OLAP workfload. I have
>>>> a table that has 10 mil rows. I try to execute the below roll up query and
>>>> it takes around 2 mins to return 1,850 rows.
>>>>
>>>> SELECT SUM(UNIT_CNT_SOLD), SUM(TOTAL_SALES) FROM TRANSACTIONS GROUP BY
>>>> T_COUNTRY;
>>>>
>>>> I tried applying the "joining with indices" example given on the
>>>> website <https://phoenix.apache.org/joins.html> on the TRANSACTIONS
>>>> table by creating an index on the grouped by column as below but that
>>>> doesn't help.
>>>>
>>>> CREATE INDEX TRANSACTIONS_COUNTRY_INDEX ON TRANSACTIONS (T_COUNTRY)
>>>> INCLUDE (UNIT_CNT_SOLD, TOTAL_SALES);
>>>>
>>>> This index is not getting used when the query is executed. The query
>>>> plan is as below
>>>>
>>>> +------------------------------------------+
>>>> |                   PLAN                   |
>>>> +------------------------------------------+
>>>> | CLIENT 31-CHUNK PARALLEL 31-WAY FULL SCAN OVER TRANSACTIONS |
>>>> |     SERVER AGGREGATE INTO DISTINCT ROWS BY [T_COUNTRY] |
>>>> | CLIENT MERGE SORT                        |
>>>> +------------------------------------------+
>>>>
>>>> Theoretically can secondary indexes help improve the performance of
>>>> group by queries?
>>>>
>>>> Any suggestions on what are different options in phoenix I could try
>>>> out to speed up GROUP BY queries?
>>>>
>>>> Thanks,
>>>> Amit.
>>>>
>>>
>>>
>>
>

Re: Speeding Up Group By Queries

Reply via email to