[jira] [Commented] (LENS-251) Support to query streaming data sources

Aniruddha Gangopadhyay (JIRA) Wed, 04 Feb 2015 23:44:11 -0800

    [ 
https://issues.apache.org/jira/browse/LENS-251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14306788#comment-14306788
 ]


Aniruddha Gangopadhyay commented on LENS-251:
---------------------------------------------

bq. Is knowing the data availability for the window of the storage table 
necessary? Is yes, how will user know that data is actually available for the 
range he queried Or is not partial Or is not stale. If not, should we mandate 
user to choose a time field to know for which time the data is getting answered.
Knowing the time range for data availability is needed as based on the current 
logic for query execution, we might end up returning empty data set even if we 
have the capacity to answer the query. As for the user's know-how of data 
availability or staleness, its not something the user should be aware of. The 
server's query execution logic should be such that whatever data is applicable 
for the queried time range  and granularity, should be returned in the 
resultset.
bq. Do we have to union such data with other tables which are partitioned, for 
time range query that is overlapping?
Having the union across storage tables at query time is a good to have feature. 
However, as a first step, we can try to bring in just the notion of time range 
in storage tables and update the query execution logic to take that into 
consideration. In case of query range overlapping across stores, we can create 
the notion of Realtime and Batch modes for querying which can have rules for 
allowed time ranges for querying (Just thinking aloud here, there may be, 
rather there are better approaches to handle this, if we do not go for unions 
in the 1st iteration of this feature). 
bq. Do we have to union storage tables with two different windows, for time 
range query that is overlapping?
Just a thought here, can we not have HiveStorageHandler Implementations for 
others stores (most major stores already have this in place) and pass the 
responsibility of unions to Hive itself (let hive be the only driver for cross 
storage resultsets)?

> Support to query streaming data sources
> ---------------------------------------
>
>                 Key: LENS-251
>                 URL: https://issues.apache.org/jira/browse/LENS-251
>             Project: Apache Lens
>          Issue Type: New Feature
>          Components: cube
>            Reporter: Sharad Agarwal
>
> For certain stores that allows streaming ingestion, to make the data 
> available immediately for querying, we need the notion of streaming update 
> period or some such.
> Describing by example:
> - lets say there are two drivers - D1 and D2.
> - D1 queries storage S1
> - D2 queries storage S2
> S1 and S2 are part of the same Cube.
> In S1 data is in hourly granularity loaded every hour, while S2 has data 
> being ingested continuously in streaming fashion and maintains last 5 hours 
> window of data.
> Now based on the query time range we want to select the right driver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (LENS-251) Support to query streaming data sources

Reply via email to