My apologies for the delay.  I had to switch clusters and use a smaller 
dataset.  The discrepancy still exists but the numbers are a little different:

I ran the same queries as in the original email (below)

Total records: 581M

Simple query based on secondary index value used in the subselect:
1747 recs – 0.256 sec

Subselect query:  177s

Join query: 179s

The answers to your questions are below.

1. What is the Primary Key definition of your BULK_TABLE?

CONSTRAINT pkey PRIMARY KEY (file_id,recnum)

2. How many (approximately) distinct "file_id" values are there in the 
BULK_TABLE? (If you don't know for sure, you can just run a query to find out).

"select count(distinct(file_id)) from BULK_TABLE” - 7902 - 92s

"select distinct(file_id) from BULK_TABLE”  - returns in 25m

3. How long does it take to run a full-scan query on BULK_TABLE, like "select * 
from BULK_TABLE”?

Results began returning after about 25min

4. How long does it take to run a full-scan join query on BULK_TABLE, like 
"select * from BULK_TABLE join (select file_id, recnum from BULK_TABLE) as SS 
on BULK_TABLE.file_id = SS.file_id and BULK_TABLE.recnum = SS.recnum”?

The full-scan join fails with a MaxServerCacheSizeExceededException  - server 
cache set to 1G.

Custom hbase/phoenix settings are attached.

Thanks,
Ralph


From: Maryann Xue <maryann....@gmail.com<mailto:maryann....@gmail.com>>
Reply-To: "user@phoenix.apache.org<mailto:user@phoenix.apache.org>" 
<user@phoenix.apache.org<mailto:user@phoenix.apache.org>>
Date: Friday, December 12, 2014 at 8:07 AM
To: "user@phoenix.apache.org<mailto:user@phoenix.apache.org>" 
<user@phoenix.apache.org<mailto:user@phoenix.apache.org>>
Subject: Re: Query performance question

Hi Ralph,

Thanks for the question!
According to the "explain" result you got, the optimization worked exactly as 
expected with this query:

"DYNAMIC SERVER FILTER BY FILE_ID IN (SS.FILE_ID)" means a skip-scan instead of 
a full-scan over BULK_TABLE will be executed at runtime based on the values of 
"file_id" it got from the inner query.

So I need to know a few more things:
1. What is the Primary Key definition of your BULK_TABLE?
2. How many (approximately) distinct "file_id" values are there in the 
BULK_TABLE? (If you don't know for sure, you can just run a query to find out).
3. How long does it take to run a full-scan query on BULK_TABLE, like "select * 
from BULK_TABLE"?
4. How long does it take to run a full-scan join query on BULK_TABLE, like 
"select * from BULK_TABLE join (select file_id, recnum from BULK_TABLE) as SS 
on BULK_TABLE.file_id = SS.file_id and BULK_TABLE.recnum = SS.recnum"?


Thanks,
Maryann


On Thu, Dec 11, 2014 at 6:28 PM, Perko, Ralph J 
<ralph.pe...@pnnl.gov<mailto:ralph.pe...@pnnl.gov>> wrote:
Hi,

Thanks for all your help thus far with Phoenix.

I am trying to understand the best way to construct a query that returns all 
the fields from a table but still takes advantage of a single field secondary 
index.   I have a table with upwards of 50 fields and do not wish to index them 
all but the use case exists to return them all.

My general approach is to first select the records I want using an indexed 
field then use the returned pk values to get the entire record in the form of a 
subselect or join.

The initial select executes very fast, sub-second , returning close to 3000 
records.  When used as a subselect or join the entire query takes very long 
(over 15min)  or does not return.   Based on the processing plans it appears 
there is a lot more going on than just a simple look-up of the values returned 
in the subselect.  Is there a way to do this using Phoenix SQL syntax?

Any suggestions are appreciated.

Initial indexed query (very fast):

SELECT file_id,recnum
FROM BULK_TABLE
WHERE saddr IN (ip1,ip2,ip3))

file_id and recnum make up the primary key

Plan:
| CLIENT PARALLEL 10-WAY SKIP SCAN ON 30 KEYS OVER BULK_TABLE_SADDR_IDX 
[0,1,000,004,076] - [9,1,000,142,114] |
| CLIENT MERGE SORT |

Used as a subselect (times out):

SELECT * FROM BULK_TABLE
WHERE (file_id,recnum) IN(SELECT file_id,recnum
             FROM BULK_TABLE
      WHERE saddr IN (ip1,ip2,ip3));
Plan:
| CLIENT PARALLEL 10-WAY FULL SCAN OVER BULK_TABLE |
| CLIENT MERGE SORT |
|     PARALLEL SEMI-JOIN TABLE 0 (SKIP MERGE) |
|         CLIENT PARALLEL 10-WAY SKIP SCAN ON 30 KEYS OVER BULK_TABLE_SADDR_IDX 
[0,1,000,004,076] - [9,1,000,142,114] |
|             SERVER AGGREGATE INTO DISTINCT ROWS BY [RECNUM, FILE_ID] |
|         CLIENT MERGE SORT |
|     DYNAMIC SERVER FILTER BY FILE_ID IN ($1.$3) |

Another approach using using a join instead:

SELECT *
FROM BULK_TABLE
JOIN
    (SELECT file_id, recnum
     FROM BULK_TABLE
     WHERE saddr in (ip1,ip2,ip3)) AS SS
ON BULK_TABLE.file_id = SS.file_id AND BULK_TABLE.recnum = SS.recnum;

Runs faster but still can take about 15min

Plan:

| CLIENT PARALLEL 10-WAY FULL SCAN OVER BULK_TABLE |
| CLIENT MERGE SORT |
|     PARALLEL INNER-JOIN TABLE 0 |
|         CLIENT PARALLEL 10-WAY SKIP SCAN ON 30 KEYS OVER BULK_TABLE_SADDR_IDX 
[0,1,000,004,076] - [9,1,000,142,114] |
|         CLIENT MERGE SORT |
|     DYNAMIC SERVER FILTER BY FILE_ID IN (SS.FILE_ID) |


Is there a more efficient way to run a query such as this?

Thanks!
Ralph







--
Thanks,
Maryann

Attachment: hbase-site.xml
Description: hbase-site.xml

Reply via email to