[jira] [Commented] (PHOENIX-3744) Support snapshot scanners for MR-based queries

James Taylor (JIRA) Fri, 24 Mar 2017 12:02:11 -0700

    [ 
https://issues.apache.org/jira/browse/PHOENIX-3744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15940951#comment-15940951
 ]


James Taylor commented on PHOENIX-3744:
---------------------------------------

Here's an idea on how this can be implemented:
- In the beginning of PhoenixInputFormat.getQueryPlan(), take a snapshot so we 
have get the now unchanging region boundaries
- Later in PhoenixInputFormat.getQueryPlan(), when we call 
statement.optimizeQuery(), provide an overloaded version that passes through an 
interface from which we can get the region boundaries. Have two implementations 
of this interface: one that does what we do today in 
BaseResultIterators.getParallelScans():
{code}
        List<HRegionLocation> regionLocations = 
context.getConnection().getQueryServices()
                .getAllTableRegions(physicalTableName);
{code}
The other implementation would use the snapshot to get the region boundaries 
instead. This will prevent a race condition in which a split could occur prior 
to the running of the scans, but after we've already got the region boundaries 
(or the region boundaries being stale since we get these from the cache on the 
HConnection). You'd use a new job configuration parameter to determine which 
implementation to use based on whether or not a snapshot read is being done.
- As side note, we might want to leverage the ParallelScanGrouper interface 
that's already in place to get the region boundaries as it'll be somewhat 
tricky to thread a new interface to the BaseResultIterators class and we 
already do this with an alternate ParallelScanGrouper implementation for the MR 
jobs.
- In PhoenixRecordReader.initialize(), when doing a snapshot read, instead of 
instantiating a TableResultIterator (which is the thing that does an 
htable.getScanner()), instantiate a new TableSnapshotResultIterator which uses 
the snapshot scanner instead. The ResultIterator interface is very simple - you 
just need to implement two methods (and the explain method can be a noop):
{code}
public interface ResultIterator extends SQLCloseable {
    /**
     * Grab the next row's worth of values. The iterator will return a Tuple.
     * @return Tuple object if there is another row, null if the scanner is
     * exhausted.
     * @throws SQLException e
     */
    public Tuple next() throws SQLException;
    
    public void explain(List<String> planSteps);
}
{code}

FYI, [~akshita.malhotra], [~churromorales], [~samarthjain]

> Support snapshot scanners for MR-based queries
> ----------------------------------------------
>
>                 Key: PHOENIX-3744
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-3744
>             Project: Phoenix
>          Issue Type: Bug
>            Reporter: James Taylor
>            Assignee: Akshita Malhotra
>
> HBase support scanning over snapshots, with a SnapshotScanner that accesses 
> the region directly in HDFS. We should make sure that Phoenix can support 
> that.
> Not sure how we'd want to decide when to run a query over a snapshot. Some 
> ideas:
> - if there's an SCN set (i.e. the query is running at a point in time in the 
> past)
> - if the memstore is empty
> - if the query is being run at a timestamp earlier than any memstore data
> - as a config option on the table
> - as a query hint
> - based on some kind of optimizer rule (i.e. based on estimated # of bytes 
> that will be scanned)
> Phoenix typically runs a query at the timestamp at which it was compiled. Any 
> data committed after this time should not be seen while a query is running.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (PHOENIX-3744) Support snapshot scanners for MR-based queries

Reply via email to