[ 
https://issues.apache.org/jira/browse/PHOENIX-3354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15552558#comment-15552558
 ] 

Sumit Nigam commented on PHOENIX-3354:
--------------------------------------

Here are the 2 DDLs (for data and index tables):

CREATE TABLE IF NOT EXISTS "ldmns:exDocStoreb" (CURRENT_TIMESTAMP BIGINT NOT 
NULL, ID VARCHAR(96), BINARY_CURR_EXDOC VARBINARY, CURR_CHECKSUM VARCHAR(32), 
BINARY_PREV_EXDOC VARBINARY, PREV_CHECKSUM VARCHAR(32), PREV_TIMESTAMP BIGINT, 
SUMMARY VARCHAR, OBJ_SUMMARY VARCHAR, PARAM_SAMPLES VARCHAR, BULK_PUBLISH_UUID  
VARCHAR, TOTAL_FACTS INTEGER, CURR_EXDOC VARCHAR, PREV_EXDOC VARCHAR CONSTRAINT 
PK PRIMARY KEY(CURRENT_TIMESTAMP, ID)) COMPRESSION = 'SNAPPY', BLOCKCACHE =  
false, SALT_BUCKETS = 36


CREATE INDEX IF NOT EXISTS "ldmns:indx_exdocb" ON "ldmns:exDocStoreb"(ID) 
INCLUDE (SUMMARY, OBJ_SUMMARY, PARAM_SAMPLES, BULK_PUBLISH_UUID)


Here is the upsert query for this table:

UPSERT INTO "ldmns:exDocStoreb" (CURRENT_TIMESTAMP, BULK_PUBLISH_UUID, ID, 
CURR_CHECKSUM, CURR_EXDOC, SUMMARY, PREV_EXDOC, PREV_CHECKSUM, PREV_TIMESTAMP, 
OBJ_SUMMARY, PARAM_SAMPLES, TOTAL_FACTS, BINARY_CURR_EXDOC, BINARY_PREV_EXDOC) 
VALUES (TO_NUMBER(CURRENT_TIME()), ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)


Here is explain plan of a SELECT with merge sort:

explain   select  /* +USE_SORT_MERGE_JOIN*/  ID, CURR_EXDOC, BINARY_CURR_EXDOC, 
CURRENT_TIMESTAMP, CURR_CHECKSUM, PREV_EXDOC, BINARY_PREV_EXDOC, PREV_CHECKSUM, 
PREV_TIMESTAMP from "ldmns:exDocStoreb"  as a inner join (select 
max(CURRENT_TIMESTAMP) as mct, ID as tid from "ldmns:exDocStoreb" where ID like 
' 006389a6b10667f39bdbbdafdc4611 e03cc04418cbc2619ddc01f54d88d7 c3bf%' group by 
ID) as tmp on a.ID=tmp.tid and a.CURRENT_TIMESTAMP=tmp.mct where id like ' 
006389a6b10667f39bdbbdafdc4611 e03cc04418cbc2619ddc01f54d88d7 c3bf%' ;
+----------------------------- -------------+
|                   PLAN                   |
+----------------------------- -------------+
| SORT-MERGE-JOIN (INNER) TABLES           |
|     CLIENT 36-CHUNK PARALLEL 36-WAY RANGE SCAN OVER ldmns:exDocStoreb [0] |
|         SERVER FILTER BY ID LIKE ' 006389a6b10667f39bdbbdafdc4611 
e03cc04418cbc2619ddc01f54d88d7 c3bf%' |
|         SERVER SORTED BY [A.ID, A.CURRENT_TIMESTAMP] |
|     CLIENT MERGE SORT                    |
| AND (SKIP MERGE)                         |
|     CLIENT 36-CHUNK PARALLEL 36-WAY RANGE SCAN OVER ldmns:indx_exdocb [0,' 
006389a6b10667f39bdbbdafdc4611 e03cc04418cbc2619ddc01f54d88d7 c3bf'] - 
[0,'006389 |
|         SERVER FILTER BY FIRST KEY ONLY  |
|         SERVER AGGREGATE INTO ORDERED DISTINCT ROWS BY ["ID"] |
|     CLIENT MERGE SORT                    |
|     CLIENT SORTED BY ["ID", MAX("CURRENT_TIMESTAMP")] |
+----------------------------- -------------+
11 rows selected (0.025 seconds)


Here is explain plan with default join:

explain SELECT   ID, CURR_EXDOC, BINARY_CURR_EXDOC, CURRENT_TIMESTAMP, 
CURR_CHECKSUM, PREV_EXDOC, BINARY_PREV_EXDOC, PREV_CHECKSUM, PREV_TIMESTAMP 
from "ldmns:exDocStoreb" as a inner join (select max(CURRENT_TIMESTAMP) as mct, 
ID as tid from "ldmns:exDocStoreb" where ID like ' 
42ecf4abd4bd7e7606025dc8eee3de 6a3cc04418cbc2619ddc01f54d88d7 c3bf%' group by 
ID) as tmp on a.ID=tmp.tid and a.CURRENT_TIMESTAMP=tmp.mct where ID like ' 
42ecf4abd4bd7e7606025dc8eee3de 6a3cc04418cbc2619ddc01f54d88d7 c3bf%' ;
+----------------------------- -------------+
|                   PLAN                   |
+----------------------------- -------------+
| CLIENT 3-CHUNK PARALLEL 3-WAY RANGE SCAN OVER ldmns:exDocStoreb [0] |
|     SERVER FILTER BY ID LIKE ' 42ecf4abd4bd7e7606025dc8eee3de 
6a3cc04418cbc2619ddc01f54d88d7 c3bf%' |
|     PARALLEL INNER-JOIN TABLE 0 (SKIP MERGE) |
|         CLIENT 3-CHUNK PARALLEL 3-WAY RANGE SCAN OVER ldmns:indx_exdocb [0,' 
42ecf4abd4bd7e7606025dc8eee3de 6a3cc04418cbc2619ddc01f54d88d7 c3bf'] - [0,' 
42ecf4abd4bd7e7606025dc8eee3de 6a3cc0 |
|             SERVER FILTER BY FIRST KEY ONLY |
|             SERVER AGGREGATE INTO ORDERED DISTINCT ROWS BY ["ID"] |
|         CLIENT MERGE SORT                |
|     DYNAMIC SERVER FILTER BY (A.CURRENT_TIMESTAMP, A.ID) IN ((TMP.MCT, 
TMP.TID)) |
+----------------------------- -------------+
8 rows selected (0.033 seconds)

I am on phoenix 4.6/ hbase 1.1

Just changing the join algorithm should be enough. I would assume that changing 
the hash join to sort-merge join would not alter the query results which it 
does.

> SORT_MERGE join on single tenant, salted table misses records randomly.
> -----------------------------------------------------------------------
>
>                 Key: PHOENIX-3354
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-3354
>             Project: Phoenix
>          Issue Type: Bug
>    Affects Versions: 4.5.0, 4.6.0
>         Environment: Hbase 1.1.2
>            Reporter: Sumit Nigam
>            Priority: Critical
>
> Single-tenant, salted table on Hbase 1.1.2 has ~3 million records. It has a 
> secondary index as well. The SORT_MERGE join hint misses data randomly when 
> used in queries. 
> I will add the DDLs and explain plan for all the queries that access this 
> table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to