Re: Query performance question

Maryann Xue Thu, 18 Dec 2014 09:03:05 -0800

No problem. Thanks for reporting the problem, Ralph!

On Thu, Dec 18, 2014 at 11:49 AM, Perko, Ralph J <ralph.pe...@pnnl.gov>
wrote:
>
>   Thank you for your quick response to this.  I built and deployed 4.2.3
> and there is a dramatic difference in performance.  Both the subselect and
> join query went from around 177s to 8s!  The explain plan now shows the
> entire pk being used.
>
>    __________________________________________________
> *Ralph Perko*
> Pacific Northwest National Laboratory
>   (509) 375-2272
> ralph.pe...@pnnl.gov
>
>
>   From: Maryann Xue <maryann....@gmail.com>
> Reply-To: "user@phoenix.apache.org" <user@phoenix.apache.org>
> Date: Monday, December 15, 2014 at 3:14 PM
>
> To: "user@phoenix.apache.org" <user@phoenix.apache.org>
> Subject: Re: Query performance question
>
>   Thanks again for the input, Ralph!
>
>  Pinpointed the problem and filed a JIRA (
> https://issues.apache.org/jira/browse/PHOENIX-1533). Will try to work out
> a solution ASAP.
>
> On Mon, Dec 15, 2014 at 3:38 PM, Perko, Ralph J <ralph.pe...@pnnl.gov>
> wrote:
>>
>>   I am using version 4.2.2
>>
>>
>>      From: <Perko>, Ralph Perko <ralph.pe...@pnnl.gov>
>> Reply-To: "user@phoenix.apache.org" <user@phoenix.apache.org>
>> Date: Monday, December 15, 2014 at 12:37 PM
>>
>> To: "user@phoenix.apache.org" <user@phoenix.apache.org>
>> Subject: Re: Query performance question
>>
>>    DDL is attached – thanks!
>>
>>  Ralph
>>
>>
>>   From: Maryann Xue <maryann....@gmail.com>
>> Reply-To: "user@phoenix.apache.org" <user@phoenix.apache.org>
>> Date: Monday, December 15, 2014 at 12:21 PM
>> To: "user@phoenix.apache.org" <user@phoenix.apache.org>
>> Subject: Re: Query performance question
>>
>>   Hi Ralph,
>>
>>  Thank you very much for the information! Very helpful for your
>> questions.
>> The numbers look reasonable as opposed to the query plan. But the only
>> problem is both file_id and recnum should be used in "DYNAMIC SERVER
>> FILTER" (which will turn into a skip-scan), but the plan you got there only
>> used file_id alone.
>> So could you please attach the DDL of your table and index so that I can
>> try reproducing the problem?
>>
>>
>>  Thanks,
>> Maryann
>>
>>
>> On Mon, Dec 15, 2014 at 2:46 PM, Perko, Ralph J <ralph.pe...@pnnl.gov>
>> wrote:
>>>
>>>   My apologies for the delay.  I had to switch clusters and use a
>>> smaller dataset.  The discrepancy still exists but the numbers are a little
>>> different:
>>>
>>>  I ran the same queries as in the original email (below)
>>>
>>>  Total records: 581M
>>>
>>>  Simple query based on secondary index value used in the subselect:
>>> 1747 recs – 0.256 sec
>>>
>>>  Subselect query:  177s
>>>
>>>  Join query: 179s
>>>
>>>  The answers to your questions are below.
>>>
>>>  1. What is the Primary Key definition of your BULK_TABLE?
>>>
>>>  CONSTRAINT pkey PRIMARY KEY (file_id,recnum)
>>>
>>>  2. How many (approximately) distinct "file_id" values are there in the
>>> BULK_TABLE? (If you don't know for sure, you can just run a query to find
>>> out).
>>>
>>>  "select count(distinct(file_id)) from BULK_TABLE” - 7902 - 92s
>>>
>>>  "select distinct(file_id) from BULK_TABLE”  - returns in 25m
>>>
>>>  3. How long does it take to run a full-scan query on BULK_TABLE, like
>>> "select * from BULK_TABLE”?
>>>
>>>  Results began returning after about 25min
>>>
>>>  4. How long does it take to run a full-scan join query on BULK_TABLE,
>>> like "select * from BULK_TABLE join (select file_id, recnum from
>>> BULK_TABLE) as SS on BULK_TABLE.file_id = SS.file_id and BULK_TABLE.recnum
>>> = SS.recnum”?
>>>
>>>  The full-scan join fails with a MaxServerCacheSizeExceededException  -
>>> server cache set to 1G.
>>>
>>>  Custom hbase/phoenix settings are attached.
>>>
>>>  Thanks,
>>> Ralph
>>>
>>>
>>>   From: Maryann Xue <maryann....@gmail.com>
>>> Reply-To: "user@phoenix.apache.org" <user@phoenix.apache.org>
>>> Date: Friday, December 12, 2014 at 8:07 AM
>>> To: "user@phoenix.apache.org" <user@phoenix.apache.org>
>>> Subject: Re: Query performance question
>>>
>>>   Hi Ralph,
>>>
>>>  Thanks for the question!
>>> According to the "explain" result you got, the optimization worked
>>> exactly as expected with this query:
>>>
>>>  "DYNAMIC SERVER FILTER BY FILE_ID IN (SS.FILE_ID)" means a skip-scan
>>> instead of a full-scan over BULK_TABLE will be executed at runtime based on
>>> the values of "file_id" it got from the inner query.
>>>
>>>  So I need to know a few more things:
>>> 1. What is the Primary Key definition of your BULK_TABLE?
>>> 2. How many (approximately) distinct "file_id" values are there in the
>>> BULK_TABLE? (If you don't know for sure, you can just run a query to find
>>> out).
>>> 3. How long does it take to run a full-scan query on BULK_TABLE, like
>>> "select * from BULK_TABLE"?
>>> 4. How long does it take to run a full-scan join query on BULK_TABLE,
>>> like "select * from BULK_TABLE join (select file_id, recnum from
>>> BULK_TABLE) as SS on BULK_TABLE.file_id = SS.file_id and
>>> BULK_TABLE.recnum = SS.recnum"?
>>>
>>>
>>>  Thanks,
>>> Maryann
>>>
>>>
>>> On Thu, Dec 11, 2014 at 6:28 PM, Perko, Ralph J <ralph.pe...@pnnl.gov>
>>> wrote:
>>>>
>>>>  Hi,
>>>>
>>>>  Thanks for all your help thus far with Phoenix.
>>>>
>>>>  I am trying to understand the best way to construct a query that
>>>> returns all the fields from a table but still takes advantage of a single
>>>> field secondary index.   I have a table with upwards of 50 fields and do
>>>> not wish to index them all but the use case exists to return them all.
>>>>
>>>>  My general approach is to first select the records I want using an
>>>> indexed field then use the returned pk values to get the entire record in
>>>> the form of a subselect or join.
>>>>
>>>>  The initial select executes very fast, sub-second , returning close
>>>> to 3000 records.  When used as a subselect or join the entire query takes
>>>> very long (over 15min)  or does not return.   Based on the processing plans
>>>> it appears there is a lot more going on than just a simple look-up of the
>>>> values returned in the subselect.  Is there a way to do this using Phoenix
>>>> SQL syntax?
>>>>
>>>>  Any suggestions are appreciated.
>>>>
>>>>  Initial indexed query (very fast):
>>>>
>>>>  SELECT file_id,recnum
>>>>
>>>> FROM BULK_TABLE
>>>>
>>>> WHERE saddr IN (ip1,ip2,ip3))
>>>>
>>>>
>>>>  file_id and recnum make up the primary key
>>>>
>>>>
>>>>  Plan:
>>>>
>>>> | CLIENT PARALLEL 10-WAY SKIP SCAN ON 30 KEYS OVER BULK_TABLE_SADDR_IDX
>>>> [0,1,000,004,076] - [9,1,000,142,114] |
>>>>
>>>> | CLIENT MERGE SORT |
>>>>
>>>>
>>>>  Used as a subselect (times out):
>>>>
>>>>   SELECT * FROM BULK_TABLE
>>>>
>>>> WHERE (file_id,recnum) IN(SELECT file_id,recnum
>>>>
>>>>              FROM BULK_TABLE
>>>>
>>>>       WHERE saddr IN (ip1,ip2,ip3));
>>>>
>>>> Plan:
>>>>
>>>> | CLIENT PARALLEL 10-WAY FULL SCAN OVER BULK_TABLE |
>>>>
>>>> | CLIENT MERGE SORT |
>>>>
>>>> |     PARALLEL SEMI-JOIN TABLE 0 (SKIP MERGE) |
>>>>
>>>> |         CLIENT PARALLEL 10-WAY SKIP SCAN ON 30 KEYS
>>>> OVER BULK_TABLE_SADDR_IDX [0,1,000,004,076] - [9,1,000,142,114] |
>>>>
>>>> |             SERVER AGGREGATE INTO DISTINCT ROWS BY [RECNUM, FILE_ID] |
>>>>
>>>> |         CLIENT MERGE SORT |
>>>>
>>>> |     DYNAMIC SERVER FILTER BY FILE_ID IN ($1.$3) |
>>>>
>>>>
>>>>  Another approach using using a join instead:
>>>>
>>>>
>>>>  SELECT *
>>>>
>>>> FROM BULK_TABLE
>>>>
>>>> JOIN
>>>>
>>>>     (SELECT file_id, recnum
>>>>
>>>>      FROM BULK_TABLE
>>>>
>>>>      WHERE saddr in (ip1,ip2,ip3)) AS SS
>>>>
>>>> ON BULK_TABLE.file_id = SS.file_id AND BULK_TABLE.recnum = SS.recnum;
>>>>
>>>>
>>>>  Runs faster but still can take about 15min
>>>>
>>>>
>>>>  Plan:
>>>>
>>>>
>>>>  | CLIENT PARALLEL 10-WAY FULL SCAN OVER BULK_TABLE |
>>>>
>>>> | CLIENT MERGE SORT |
>>>>
>>>> |     PARALLEL INNER-JOIN TABLE 0 |
>>>>
>>>> |         CLIENT PARALLEL 10-WAY SKIP SCAN ON 30 KEYS
>>>> OVER BULK_TABLE_SADDR_IDX [0,1,000,004,076] - [9,1,000,142,114] |
>>>>
>>>> |         CLIENT MERGE SORT |
>>>>
>>>> |     DYNAMIC SERVER FILTER BY FILE_ID IN (SS.FILE_ID) |
>>>>
>>>>
>>>>
>>>>  Is there a more efficient way to run a query such as this?
>>>>
>>>>  Thanks!
>>>>  Ralph
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>  --
>>> Thanks,
>>> Maryann
>>>
>>
>>
>>  --
>> Thanks,
>> Maryann
>>
>
>
>  --
> Thanks,
> Maryann
>



-- 
Thanks,
Maryann

Re: Query performance question

Reply via email to