[ 
https://issues.apache.org/jira/browse/PHOENIX-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14942176#comment-14942176
 ] 

Samarth Jain commented on PHOENIX-914:
--------------------------------------

I was hoping to get a good perf boost even in cases when the query is not over 
the data created towards the the tail end. For a query that uses just the 
filter on the row_timestamp column, it ideally should be able to skip the 
HFiles which don't have data in that time range. 

It turns out region splits work against us. On a region split, HBase isn't able 
to figure out what the min and max timestamps of the newly created store files 
should be (because it doesn't actually go through every row while splitting a 
region). As a result, future scans are not able to filter those store files 
created from splits. Over time with region splits happening automatically, we 
are not able to filter out any of the HFiles resulting in the same performance 
for tables with and without row_timestamp column. 

To confirm the theory, I created  two pre-split tables (one with row timestamp 
and one without) of 4 regions on my laptop, loaded 20 million rows on the table 
making sure they were evenly distributed. The amount of data wasn't enough 
either to cause automatic region splits either.

Query:
select count(*) from <table>  where pk1 > to_date('1970-01-01 01:05:00.000') 
and pk1 < to_date('1970-01-01 01:20:00.000');

Standard table - 18 seconds.
Row timestamp table - 6 seconds.

As expected, with the time range available on store files, query returned much 
faster (3x) on the row timestamp table. This is because it didn't need to scan 
rest of the 3 regions unlike the query on standard table. The perf gain would 
had been even more prominent for a larger number of rows and pre-split regions.

> Native HBase timestamp support to optimize date range queries in Phoenix 
> -------------------------------------------------------------------------
>
>                 Key: PHOENIX-914
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-914
>             Project: Phoenix
>          Issue Type: Improvement
>    Affects Versions: 4.0.0
>            Reporter: Vladimir Rodionov
>            Assignee: Samarth Jain
>         Attachments: PHOENIX-914.patch, PHOENIX-914.patch, 
> PHOENIX-914_v2.patch, wip.patch
>
>
> For many applications one of the column of a table can be (and must be) 
> naturally mapped 
> to HBase timestamp. What it gives us is the optimization on StoreScanner 
> where HFiles with timestamps out of range of
> a Scan operator will be omitted. Let us say that we have time-series type of 
> data (EVENTS) and custom compaction, where we create 
> series of HFiles with continuous non-overlapping timestamp ranges.
> CREATE TABLE IF NOT EXISTS ODS.EVENTS (
>     METRICID  VARCHAR NOT NULL,
>     METRICNAME VARCHAR,
>     SERVICENAME VARCHAR NOT NULL,
>     ORIGIN VARCHAR NOT NULL,
>     APPID VARCHAR,
>     IPID VARCHAR,
>     NVALUE DOUBLE,
>     TIME TIMESTAMP NOT NULL  /+ TIMESTAMP +/,
>     DATA VARCHAR,
>     SVALUE VARCHAR
>     CONSTRAINT PK PRIMARY KEY (METRICID, SERVICENAME, ORIGIN, APPID, IPID, 
> TIME)
> ) SALT_BUCKETS=40, IMMUTABLE_ROWS=true,VERSIONS=1,DATA_BLOCK_ENCODING='NONE';
> Make note on   TIME TIMESTAMP NOT NULL  /+ TIMESTAMP +/ - this is the Hint to 
> Phoenix that the column
> TIME must be mapped to HBase timestamp. 
> The Query:
> Select all events of type 'X' for last 7 days
> SELECT * from EVENTS WHERE METRICID = 'X' and TIME < NOW() and TIME > NOW() - 
> 7*24*3600000; (this may be not correct SQL syntax of course)
> These types of queries will be efficiently optimized if:
> 1. Phoenix maps  TIME column to HBase timestamp
> 2. Phoenix smart enough to map WHERE clause on TIME attribute to Scan 
> timerange 
> Although this :
> Properties props = new Properties();
> props.setProperty(PhoenixRuntime.CURRENT_SCN_ATTRIB, Long.toString(ts));
> Connection conn = DriverManager.connect(myUrl, props);
> conn.createStatement().execute("UPSERT INTO myTable VALUES ('a')");
> conn.commit();
> will work in my case- it may not be efficient from performance point of view 
> because for every INSERT/UPSERT 
> new Connection object and new Statement is created, beside this we still need 
> the optimization 2. (see above). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to