[ 
https://issues.apache.org/jira/browse/KUDU-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15241691#comment-15241691
 ] 

Dan Burkert commented on KUDU-1363:
-----------------------------------

I thought about this a little more after commenting yesterday evening.  It's 
not necessarily true that IN list predicates are inefficient, only that making 
them efficient is going to be a little bit tricky.  As an example, consider the 
following schema:

{code:SQL}
CREATE TABLE machine_metrics
(STRING host, STRING metric, TIMESTAMP time, DOUBLE value)
PRIMARY KEY (host, metric, time);
{code}

So we have a pretty ordinary time series schema, with the somewhat unusual 
characteristic of sorting first by the host and metric instead of timestamp.  
With a table like this we may want to have a query that retrieves a few metrics 
across a few different hosts for a single day, such as:

{code:SQL}
SELECT * from machine_metrics
WHERE host IN ('host-001', 'host-235')
  AND metric IN ('load-avg-1min', 'load-avg-5min')
  AND time >= 2016-04-01T00:00:00
  AND time < 2016-04-02T00:00:00;
{code}

In the most naive way, this scan could be satisfied by doing a full table scan, 
and simply applying the predicates to each record as they are scanned.  But 
since the predicates are specified on primary key columns, Kudu could be a 
little bit smarter and convert the full table scan into 4 individual scanners 
which scan just the necessary rows which match the predicates. The scanners 
would have the following primary key bounds:

{code:SQL}
PK > ('host-001', 'load-avg-1min', 2016-04-01T00:00:00) AND PK <= ('host-001', 
'load-avg-1min', 2016-04-02T00:00:00)
PK > ('host-001', 'load-avg-5min', 2016-04-01T00:00:00) AND PK <= ('host-001', 
'load-avg-5min', 2016-04-02T00:00:00)
PK > ('host-235', 'load-avg-1min', 2016-04-01T00:00:00) AND PK <= ('host-235', 
'load-avg-1min', 2016-04-02T00:00:00)
PK > ('host-235', 'load-avg-5min', 2016-04-01T00:00:00) AND PK <= ('host-235', 
'load-avg-5min', 2016-04-02T00:00:00)
{code}

Today Kudu is smart enough to push equality and range predicates into a single 
primary key bound (see the optimization guide linked above for examples), but 
only a single primary key bound is supported, not multiple.  As a bonus, I 
think adding this level of optimization would negate the need for a multi-get 
API.

> Add Multiple column range predicates for the same column in a single scan
> -------------------------------------------------------------------------
>
>                 Key: KUDU-1363
>                 URL: https://issues.apache.org/jira/browse/KUDU-1363
>             Project: Kudu
>          Issue Type: New Feature
>            Reporter: Chris George
>
> Currently adding multiple column range predicates for the same column does 
> essentially an AND between the two predicates which will cause no results to 
> be returned. 
> This would greatly increase performance were I can complete in one scan what 
> would otherwise take two.
> As an example using the java api:
> ColumnRangePredicate columnRangePredicateColumnNameA = new 
> ColumnRangePredicate(new ColumnSchema.ColumnSchemaBuilder("column_name", 
> Type.STRING).build());
> columnRangePredicateColumnNameA.setLowerBound("A");
> columnRangePredicateColumnNameA.setUpperBound("A");
> ColumnRangePredicate columnRangePredicateColumnNameB = new 
> ColumnRangePredicate(new ColumnSchema.ColumnSchemaBuilder("column_name", 
> Type.STRING).build());
> columnRangePredicateColumnNameB.setLowerBound("B");
> columnRangePredicateColumnNameB.setUpperBound("B");
> which would be equivalent:
> select * from some_table where column_name="A" or column_name="B"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to