Hi Maryann

Please find my answers inline.

Thanks,
Pradheep

From: Maryann Xue <maryann....@gmail.com<mailto:maryann....@gmail.com>>
Reply-To: "user@phoenix.apache.org<mailto:user@phoenix.apache.org>" 
<user@phoenix.apache.org<mailto:user@phoenix.apache.org>>
Date: Wednesday, February 22, 2017 at 2:22 PM
To: "user@phoenix.apache.org<mailto:user@phoenix.apache.org>" 
<user@phoenix.apache.org<mailto:user@phoenix.apache.org>>
Subject: Re: Phoenix query performance

Hi Pradheep,

Thank you for posting the query and the log file! There are two things going on 
on the server side at the same time here. I think it'd be a good idea to 
isolate the problem first. So a few questions:
1. When you say data size went from "< 1M" to 30M, did the data from both LHS 
and RHS grow proportionately?
It is basically the same table..the query is like a self join..yes, you can say 
that it is proportional.
2. If yes to (1), what if we only increase the data in LHS, but keep it small 
for RHS? Would the query run significantly faster?
When RHS count is 420336, time taken is 37 seconds
When RHS count is 63575, time taken is 32 seconds (not a significant difference)
3. What if we only do group by on LHS? Would the query time be linear to the 
data size?
After Removing group by on RHS
When RHS count is 420336, time taken is 34 seconds
When RHS count is 63575, time taken is 32 seconds
4. How was GC when running the query?
About 12ms in 1 RS, 10ms in 1 RS, 4-5ms in couple of them and less than 1ms in 
the rest of the region servers when the query is runnning.

Thanks,
Maryann

On Wed, Feb 22, 2017 at 9:28 AM, Pradheep Shanmugam 
<pradheep.shanmu...@infor.com<mailto:pradheep.shanmu...@infor.com>> wrote:
Hi,

We have a hbase cluster with 8 region servers with 20G memory
We have a table  with 1 column family along with a secondary index.
Following query took only few milliseconds when we had less data(< 1 million)
After adding more data(~30M rows) the performance declined and took about a 
minute or more(not stable)

select msbo1.PARENTID
  from msbo_phoenix_comp_rowkey msbo1
  left outer join (
         select PARENTID,MILESTONETYPEID
           from msbo_phoenix_comp_rowkey
          where PARENTREFERENCETIME between 1479964000 and 1480464000
            and OWNERORGID = 100
            and PARENTTYPE = 'SHIPMENT'
            and MILESTONETYPEID = 19661
            group by PARENTID,MILESTONETYPEID
             ) msbo2
  on msbo1.PARENTID = msbo2.PARENTID
  where msbo1.PARENTTYPE = 'SHIPMENT'
   and msbo1.OWNERORGID = 100
   and msbo2.MILESTONETYPEID is null
   and msbo1.PARENTREFERENCETIME between 1479964000 and 1480464000
group by msbo1.PARENTID
  order by msbo1.PARENTID

The RHS return about a 500K rows ..LHS about 18M rows…final result about 500K 
rows

MSBO_PHOENIX_COMP_ROWKEY_INDEX_PARENTTYPE_OWNERORGID_MILESTONETYPEID_PARENTREFERENCETIME
  is the index

Query plan:
CLIENT 8-CHUNK PARALLEL 8-WAY RANGE SCAN OVER 
MSBO_PHOENIX_COMP_ROWKEY_INDEX_PARENTTYPE_OWNERORGID_MILESTONETYPEID_PARENTREFERENCETIME
 [0,'SHIPMENT',100]
    SERVER FILTER BY FIRST KEY ONLY AND 
(TO_UNSIGNED_LONG("PARENTREFERENCETIME") >= 1477958400 AND 
TO_UNSIGNED_LONG("PARENTREFERENCETIME") <= 1480464000)
    SERVER AGGREGATE INTO DISTINCT ROWS BY ["MSBO1.:PARENTID"]
CLIENT MERGE SORT
    PARALLEL LEFT-JOIN TABLE 0
        CLIENT 8-CHUNK PARALLEL 8-WAY RANGE SCAN OVER 
MSBO_PHOENIX_COMP_ROWKEY_INDEX_PARENTTYPE_OWNERORGID_MILESTONETYPEID_PARENTREFERENCETIME
 [0,'SHIPMENT',100,19661,1,477,958,400] - [0,'SHIPMENT',100,19661,1,480,464,000]
            SERVER FILTER BY FIRST KEY ONLY
            SERVER AGGREGATE INTO DISTINCT ROWS BY ["PARENTID", 
"MILESTONETYPEID"]
        CLIENT MERGE SORT
    AFTER-JOIN SERVER FILTER BY MSBO2.MILESTONETYPEID IS NULL

Attached the phoenix log.
I see the caching to set as 100..and "maxResultSize”:2097152..is that something 
that can be tuned will help?
Is that the client merge sort consuming more time can be improved? Is there any 
other tuning possible?

Thanks,
Pradheep

Reply via email to