[ 
https://issues.apache.org/jira/browse/PHOENIX-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Taylor updated PHOENIX-1940:
----------------------------------
    Description: 
Looks like quite a bit of time is spent in the binary search done to get the 
latest Cell value when we're evaluating expressions on the server side (up to 
60% is spent in KeyValueUtil.getColumnLatest()). Since we know the set of 
column qualifiers being projected into the scan, we could push the expected 
position (assuming all columns have values). If the Cell is not in that 
position, we could fall back to a binary search.

Further enhancements could be to: allow a not null constraint on KeyValue 
columns and either a) require all non null values to be provided on an UPSERT, 
or b) do a check and put to enforce it (for transactional tables this could be 
enforced).

Additionally, the table could declare that dynamic columns are not allowed. If 
both of the above are true, then we'd be able guaranteed positional access the 
List<Cell> that we get back from an HBase Scanner.

One further enhancement would be to collect a set of all ColumnExpression 
instances on the server side for all expressions sent over. Then, we'd bind 
them once, outside of the general expression evaluation of all expressions in a 
statement for a given row. An example of where this would save time would be in 
evaluating the following TPCH-Q1 aggregate query:

{code}
SELECT
    l_returnflag,
    l_linestatus,
    sum(l_quantity) as sum_qty,
    sum(l_extendedprice) as sum_base_price,
    sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
    sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
    avg(l_quantity) as avg_qty,
    avg(l_extendedprice) as avg_price,
    avg(l_discount) as avg_disc,
    count(*) as count_order
FROM
    lineitem
WHERE
    l_shipdate <= date '1998-12-01' - interval '90' day
GROUP BY
    l_returnflag,
    l_linestatus
ORDER BY
    l_returnflag,
    l_linestatus;
{code}
During aggregation, the KeyValueColumnExpression for l_extendedprice would be 
evaluated four times currently, once per occurrence in different SELECT 
expressions. This enhancement would cut that down to once.

  was:
Looks like quite a bit of time is spent in the binary search done to get the 
latest Cell value when we're evaluating expressions on the server side (up to 
60% is spent in KeyValueUtil.getColumnLatest()). Since we know the set of 
column qualifiers being projected into the scan, we could push the expected 
position (assuming all columns have values). If the Cell is not in that 
position, we could fall back to a binary search.

Further enhancements could be to: allow a not null constraint on KeyValue 
columns and either a) require all non null values to be provided on an UPSERT, 
or b) do a check and put to enforce it (for transactional tables this could be 
enforced).

Additionally, the table could declare that dynamic columns are not allowed. If 
both of the above are true, when we'd be guaranteed to be able to positionally 
access the List<Cell> that we get back from an HBase Scanner.

One further enhancement would be to collect a set of all ColumnExpression 
instances on the server side for all expressions sent over. Then, we'd bind 
them once, outside of the general expression evaluation for each row. An 
example of where this would save time would be in evaluating the following 
TPCH-Q1 aggregate query:

{code}
SELECT
    l_returnflag,
    l_linestatus,
    sum(l_quantity) as sum_qty,
    sum(l_extendedprice) as sum_base_price,
    sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
    sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
    avg(l_quantity) as avg_qty,
    avg(l_extendedprice) as avg_price,
    avg(l_discount) as avg_disc,
    count(*) as count_order
FROM
    lineitem
WHERE
    l_shipdate <= date '1998-12-01' - interval '90' day
GROUP BY
    l_returnflag,
    l_linestatus
ORDER BY
    l_returnflag,
    l_linestatus;
{code}
During aggregation, the KeyValueColumnExpression for l_extendedprice would be 
evaluated four times, once per occurrence in different SELECT expressions.


> Push expected List<Cell> ordinal position in KeyValueColumnExpression
> ---------------------------------------------------------------------
>
>                 Key: PHOENIX-1940
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-1940
>             Project: Phoenix
>          Issue Type: Bug
>            Reporter: James Taylor
>
> Looks like quite a bit of time is spent in the binary search done to get the 
> latest Cell value when we're evaluating expressions on the server side (up to 
> 60% is spent in KeyValueUtil.getColumnLatest()). Since we know the set of 
> column qualifiers being projected into the scan, we could push the expected 
> position (assuming all columns have values). If the Cell is not in that 
> position, we could fall back to a binary search.
> Further enhancements could be to: allow a not null constraint on KeyValue 
> columns and either a) require all non null values to be provided on an 
> UPSERT, or b) do a check and put to enforce it (for transactional tables this 
> could be enforced).
> Additionally, the table could declare that dynamic columns are not allowed. 
> If both of the above are true, then we'd be able guaranteed positional access 
> the List<Cell> that we get back from an HBase Scanner.
> One further enhancement would be to collect a set of all ColumnExpression 
> instances on the server side for all expressions sent over. Then, we'd bind 
> them once, outside of the general expression evaluation of all expressions in 
> a statement for a given row. An example of where this would save time would 
> be in evaluating the following TPCH-Q1 aggregate query:
> {code}
> SELECT
>     l_returnflag,
>     l_linestatus,
>     sum(l_quantity) as sum_qty,
>     sum(l_extendedprice) as sum_base_price,
>     sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
>     sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
>     avg(l_quantity) as avg_qty,
>     avg(l_extendedprice) as avg_price,
>     avg(l_discount) as avg_disc,
>     count(*) as count_order
> FROM
>     lineitem
> WHERE
>     l_shipdate <= date '1998-12-01' - interval '90' day
> GROUP BY
>     l_returnflag,
>     l_linestatus
> ORDER BY
>     l_returnflag,
>     l_linestatus;
> {code}
> During aggregation, the KeyValueColumnExpression for l_extendedprice would be 
> evaluated four times currently, once per occurrence in different SELECT 
> expressions. This enhancement would cut that down to once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to