[jira] [Updated] (PHOENIX-838) Continuous queries

Andrew Purtell (JIRA) Tue, 11 Mar 2014 18:48:51 -0700

     [ 
https://issues.apache.org/jira/browse/PHOENIX-838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Andrew Purtell updated PHOENIX-838:
-----------------------------------

    Description: 
Support continuous queries. 

As a coprocessor application, Phoenix is well positioned to observe  mutations 
and treat those observations as an event stream. 

Continuous queries are persistent queries that run server side, typically 
expressed as structured queries using some extensions for defining a bounded 
subset of a potentially unbounded tuple stream. A Phoenix user could create a 
materialized view using WINDOW and other OLAP extensions to SQL discussed on 
PHOENIX-154 to define time- or tuple- based sliding windows, possibly 
partitioned, and an aggregating or filtering operation over those windows. This 
would trigger instantiation of a long running distributed task on the cluster 
for incrementally maintaining the view. ("Task" is meant here as a logical 
notion, it may not be a separate thread of execution.) As the task receives 
observer events and performs work, it would update state in memory for 
on-demand retrieval. For state reconstruction after failure the WAL could be 
overloaded with in-window event history and/or the in-memory state could be 
periodically checkpointed into shadow stores in the region.

Users would pick up the latest state maintained by the continuous query by 
querying the view, or perhaps Phoenix can do this transparently on any query if 
the optimizer determines equivalence.

This could be an important feature for Phoenix. Generally Phoenix and HBase are 
meant to handle high data volumes that overwhelm other data management options, 
so even subsets of the full data may present scale challenges. Many use cases 
mix ad hoc or exploratory full table scans with aggregates, rollups, or 
sampling queries over a subset or sample. The user wishes the latter queries to 
run as fast as possible. If that work can be done inline with the process of 
initially persisting mutations then we trade some memory and CPU resources up 
front to eliminate significant IO time later that would otherwise dominate.

An initial implementation could automatically partition continuous queries on 
region boundaries. If this can be done then failure handling and state 
reconstruction for continuous queries would map naturally onto existing HBase 
mechanisms for detecting and recovering from regionserver failure. The 
following constructs should be excluded:

- DISTINCT (might require too much in memory state)
- Joins (defeats partitioning)
- Subqueries (implementation complexity)

Queries not meeting the constraints would generate an exception at view 
creation time. Partitioning could be exposed explicitly to the user, or the 
JDBC driver could pick up global results in parallel using an Endpoint 
invocation over all regions and perform a final global aggregation or filtering 
step at the client.

Follow on work could enable subqueries as stacking in the event model. The 
inner query would generate an event that notifies the outer query when new 
results are ready, and the outer query would pick up the results and process 
them further.

It might also be useful follow on work to extend server side persistent query 
management with an inactive-but-resident state. This would allow users to shed 
load by deactivating a subset of persistent queries without requiring expensive 
reconstruction or losing state.

  was:
Support continuous queries. 

As a coprocessor application, Phoenix is well positioned to observe  mutations 
and treat those observations as an event stream. 

Continuous queries are persistent queries that run server side, typically 
expressed as structured queries using some extensions for defining a bounded 
subset of the potentially unbounded event stream. A Phoenix user could create a 
materialized view using WINDOW and other OLAP extensions to SQL discussed on 
PHOENIX-154 to define time- or tuple- based sliding windows, possibly 
partitioned, and an aggregating or filtering operation over those windows. This 
would trigger instantiation of a long running distributed task on the cluster 
for incrementally maintaining the view. ("Task" is meant here as a logical 
notion, it may not be a separate thread of execution.) As the task receives 
observer events and performs work, it would update state in memory for 
on-demand retrieval. For state reconstruction after failure the WAL could be 
overloaded with in-window event history and/or the in-memory state could be 
periodically checkpointed into shadow stores in the region.

Users would pick up the latest state maintained by the continuous query by 
querying the view, or perhaps Phoenix can do this transparently on any query if 
the optimizer determines equivalence.

This could be an important feature for Phoenix. Generally Phoenix and HBase are 
meant to handle high data volumes that overwhelm other data management options, 
so even subsets of the full data may present scale challenges. Many use cases 
mix ad hoc or exploratory full table scans with aggregates, rollups, or 
sampling queries over a subset or sample. The user wishes the latter queries to 
run as fast as possible. If that work can be done inline with the process of 
initially persisting mutations then we trade some memory and CPU resources up 
front to eliminate significant IO time later that would otherwise dominate.

An initial implementation could automatically partition continuous queries on 
region boundaries. If this can be done then failure handling and state 
reconstruction for continuous queries would map naturally onto existing HBase 
mechanisms for detecting and recovering from regionserver failure. The 
following constructs should be excluded:

- DISTINCT (might require too much in memory state)
- Joins (defeats partitioning)
- Subqueries (implementation complexity)

Queries not meeting the constraints would generate an exception at view 
creation time. Partitioning could be exposed explicitly to the user, or the 
JDBC driver could pick up global results in parallel using an Endpoint 
invocation over all regions and perform a final global aggregation or filtering 
step at the client.

Follow on work could enable subqueries as stacking in the event model. The 
inner query would generate an event that notifies the outer query when new 
results are ready, and the outer query would pick up the results and process 
them further.

It might also be useful follow on work to extend server side persistent query 
management with an inactive-but-resident state. This would allow users to shed 
load by deactivating a subset of persistent queries without requiring expensive 
reconstruction or losing state.


> Continuous queries
> ------------------
>
>                 Key: PHOENIX-838
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-838
>             Project: Phoenix
>          Issue Type: New Feature
>            Reporter: Andrew Purtell
>
> Support continuous queries. 
> As a coprocessor application, Phoenix is well positioned to observe  
> mutations and treat those observations as an event stream. 
> Continuous queries are persistent queries that run server side, typically 
> expressed as structured queries using some extensions for defining a bounded 
> subset of a potentially unbounded tuple stream. A Phoenix user could create a 
> materialized view using WINDOW and other OLAP extensions to SQL discussed on 
> PHOENIX-154 to define time- or tuple- based sliding windows, possibly 
> partitioned, and an aggregating or filtering operation over those windows. 
> This would trigger instantiation of a long running distributed task on the 
> cluster for incrementally maintaining the view. ("Task" is meant here as a 
> logical notion, it may not be a separate thread of execution.) As the task 
> receives observer events and performs work, it would update state in memory 
> for on-demand retrieval. For state reconstruction after failure the WAL could 
> be overloaded with in-window event history and/or the in-memory state could 
> be periodically checkpointed into shadow stores in the region.
> Users would pick up the latest state maintained by the continuous query by 
> querying the view, or perhaps Phoenix can do this transparently on any query 
> if the optimizer determines equivalence.
> This could be an important feature for Phoenix. Generally Phoenix and HBase 
> are meant to handle high data volumes that overwhelm other data management 
> options, so even subsets of the full data may present scale challenges. Many 
> use cases mix ad hoc or exploratory full table scans with aggregates, 
> rollups, or sampling queries over a subset or sample. The user wishes the 
> latter queries to run as fast as possible. If that work can be done inline 
> with the process of initially persisting mutations then we trade some memory 
> and CPU resources up front to eliminate significant IO time later that would 
> otherwise dominate.
> An initial implementation could automatically partition continuous queries on 
> region boundaries. If this can be done then failure handling and state 
> reconstruction for continuous queries would map naturally onto existing HBase 
> mechanisms for detecting and recovering from regionserver failure. The 
> following constructs should be excluded:
> - DISTINCT (might require too much in memory state)
> - Joins (defeats partitioning)
> - Subqueries (implementation complexity)
> Queries not meeting the constraints would generate an exception at view 
> creation time. Partitioning could be exposed explicitly to the user, or the 
> JDBC driver could pick up global results in parallel using an Endpoint 
> invocation over all regions and perform a final global aggregation or 
> filtering step at the client.
> Follow on work could enable subqueries as stacking in the event model. The 
> inner query would generate an event that notifies the outer query when new 
> results are ready, and the outer query would pick up the results and process 
> them further.
> It might also be useful follow on work to extend server side persistent query 
> management with an inactive-but-resident state. This would allow users to 
> shed load by deactivating a subset of persistent queries without requiring 
> expensive reconstruction or losing state.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (PHOENIX-838) Continuous queries

Reply via email to