[
https://issues.apache.org/jira/browse/LENS-251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14306788#comment-14306788
]
Aniruddha Gangopadhyay commented on LENS-251:
---------------------------------------------
bq. Is knowing the data availability for the window of the storage table
necessary? Is yes, how will user know that data is actually available for the
range he queried Or is not partial Or is not stale. If not, should we mandate
user to choose a time field to know for which time the data is getting answered.
Knowing the time range for data availability is needed as based on the current
logic for query execution, we might end up returning empty data set even if we
have the capacity to answer the query. As for the user's know-how of data
availability or staleness, its not something the user should be aware of. The
server's query execution logic should be such that whatever data is applicable
for the queried time range and granularity, should be returned in the
resultset.
bq. Do we have to union such data with other tables which are partitioned, for
time range query that is overlapping?
Having the union across storage tables at query time is a good to have feature.
However, as a first step, we can try to bring in just the notion of time range
in storage tables and update the query execution logic to take that into
consideration. In case of query range overlapping across stores, we can create
the notion of Realtime and Batch modes for querying which can have rules for
allowed time ranges for querying (Just thinking aloud here, there may be,
rather there are better approaches to handle this, if we do not go for unions
in the 1st iteration of this feature).
bq. Do we have to union storage tables with two different windows, for time
range query that is overlapping?
Just a thought here, can we not have HiveStorageHandler Implementations for
others stores (most major stores already have this in place) and pass the
responsibility of unions to Hive itself (let hive be the only driver for cross
storage resultsets)?
> Support to query streaming data sources
> ---------------------------------------
>
> Key: LENS-251
> URL: https://issues.apache.org/jira/browse/LENS-251
> Project: Apache Lens
> Issue Type: New Feature
> Components: cube
> Reporter: Sharad Agarwal
>
> For certain stores that allows streaming ingestion, to make the data
> available immediately for querying, we need the notion of streaming update
> period or some such.
> Describing by example:
> - lets say there are two drivers - D1 and D2.
> - D1 queries storage S1
> - D2 queries storage S2
> S1 and S2 are part of the same Cube.
> In S1 data is in hourly granularity loaded every hour, while S2 has data
> being ingested continuously in streaming fashion and maintains last 5 hours
> window of data.
> Now based on the query time range we want to select the right driver.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)