[ 
https://issues.apache.org/jira/browse/PHOENIX-3438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15651887#comment-15651887
 ] 

James Taylor edited comment on PHOENIX-3438 at 11/9/16 7:57 PM:
----------------------------------------------------------------

[~jpalmert] - our current optimization for distinct is limited to usage when 
the order is preserved for the rows being output. That's not the case here. For 
example, let's say you have these rows:
| organization_id | score | entity_id |
| org1 | 50 | e1 |
| org1 | 30 | e5 |
| org1 | 20 | e3 |
| org2 | 90 | e4 |
| org2 | 30 | e2 |

The rows need to be output in this order, which is different than the row key 
order:
| score | entity_id |
| 90 | e4 |
| 50 | e1 |
| 30 | e5 |
| 30 | e2 |
| 20 | e3 |

If you had an index on SCORE+ENITY_ID+ORGANIZATION_ID it might work, but you 
wouldn't be able to do that with your base table being a multi-tenant table 
(since ORGANIZATION_ID will always be the first row key column for indexes on 
multi-tenant tables). But I'm not sure if you're really varying the 
ORGANIZATION_ID. It might help if you wrote up an example here with your 
CONTAINER_ID instead. In that case if you replace ORGANIZATION_ID above with 
CONTAINER_ID, then I think the optimization will be possible. But, I still 
think there's a missing optimization for Phoenix, though. I think the limit 
needs to be pushed to the DistinctPrefixFilter (or it might work to have a 
PageFilter *after* the DistinctPrefixFilter). I'll see if that can be added, as 
it would not difficult.

FYI, [~lhofhansl].


was (Author: jamestaylor):
[~jpalmert] - our current optimization for distinct is limited to usage when 
the order is preserved for the rows being output. That's not the case here. For 
example, let's say you have these rows:
| organization_id | score | entity_id |
----------------
| org1 | 50 | e1 |
| org1 | 30 | e5 |
| org1 | 20 | e3 |
| org2 | 90 | e4 |
| org2 | 30 | e2 |
---------------

The rows need to be output in this order, which is different than the row key 
order:
| score | entity_id |
----------------
| 90 | e4 |
| 50 | e1 |
| 30 | e5 |
| 30 | e2 |
| 20 | e3 |
---------------

If you had an index on SCORE+ENITY_ID+ORGANIZATION_ID it might work, but you 
wouldn't be able to do that with your base table being a multi-tenant table 
(since ORGANIZATION_ID will always be the first row key column for indexes on 
multi-tenant tables). But I'm not sure if you're really varying the 
ORGANIZATION_ID. It might help if you wrote up an example here with your 
CONTAINER_ID instead. In that case if you replace ORGANIZATION_ID above with 
CONTAINER_ID, then I think the optimization will be possible. But, I still 
think there's a missing optimization for Phoenix, though. I think the limit 
needs to be pushed to the DistinctPrefixFilter (or it might work to have a 
PageFilter *after* the DistinctPrefixFilter). I'll see if that can be added, as 
it would not difficult.

FYI, [~lhofhansl].

> Phoenix should be able to optimize LIMIT query with IN clause
> -------------------------------------------------------------
>
>                 Key: PHOENIX-3438
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-3438
>             Project: Phoenix
>          Issue Type: Bug
>    Affects Versions: 4.8.0
>            Reporter: Joel Palmert
>            Assignee: James Taylor
>            Priority: Critical
>             Fix For: 4.8.3
>
>
> This is taken from a real production issue we're having but has been 
> simplified here. Steps to repro:
> CREATE TABLE IF NOT EXISTS TEST.TEST (
>     ORGANIZATION_ID CHAR(15) NOT NULL,
>     SCORE INTEGER NOT NULL,
>     ENTITY_ID CHAR(15)
>     CONSTRAINT TEST_PK PRIMARY KEY (
>         ORGANIZATION_ID,
>         SCORE DESC
>     )
> ) VERSIONS=1, MULTI_TENANT=TRUE, REPLICATION_SCOPE=1, TTL=43200
> EXPLAIN
> SELECT entity_id, MAX(score) max_score
> FROM test.test
> WHERE organization_id IN ('org1','org3')
> GROUP BY entity_id
> ORDER BY max_score DESC
> LIMIT 1
> CLIENT 1-CHUNK PARALLEL 1-WAY SKIP SCAN ON 2 KEYS OVER TEST.TEST ['org1       
>     '] - ['org2           ']
>     SERVER AGGREGATE INTO DISTINCT ROWS BY [ENTITY_ID]
> CLIENT MERGE SORT
> CLIENT TOP 1 ROW SORTED BY [MAX(SCORE) DESC]
> Notice that Phoenix gets and sends all the rows for the two organizations to 
> the client. Since the ORDER BY is in PK order I would expect it to just 
> 1. Get the first row for each organization (in index order) rather than 
> scanning all the rows.
> 2. Merge them (only 2 rows in this case)
> 3. Limit 1
> Consider the following query and plan for what I would expect to see:
> EXPLAIN
> SELECT entity_id, MAX(score) max_score
> FROM(
>       SELECT entity_id, score
>       FROM(
>               SELECT entity_id, score
>               FROM test.test
>               WHERE organization_id = 'org1'
>               ORDER BY score DESC
>               LIMIT 1)
>       UNION ALL
>       SELECT entity_id, score
>               FROM(
>               SELECT entity_id, score
>               FROM test.test
>               WHERE organization_id = 'org2'
>               ORDER BY score DESC
>               LIMIT 1))
> GROUP BY entity_id
> ORDER BY max_score DESC
> LIMIT 1
> UNION ALL OVER 2 QUERIES
>     CLIENT 1-CHUNK SERIAL 1-WAY RANGE SCAN OVER TEST.TEST ['org1           ']
>         SERVER 1 ROW LIMIT
>     CLIENT 1 ROW LIMIT
>     CLIENT 1-CHUNK SERIAL 1-WAY RANGE SCAN OVER TEST.TEST ['org2           ']
>         SERVER 1 ROW LIMIT
>     CLIENT 1 ROW LIMIT
> CLIENT SORTED BY [ENTITY_ID]
> CLIENT AGGREGATE INTO DISTINCT ROWS BY [ENTITY_ID]
> CLIENT TOP 1 ROW SORTED BY [MAX(SCORE) DESC]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to