Hi Stephan,

Thanks for your reply.

Data never expires automatically.

If there is a need for data retention, the user can choose one of the
following options:
- In the SQL for querying the managed table, users filter the data by themselves
- Define the time partition, and users can delete the expired
partition by themselves. (DROP PARTITION ...)
- In the future version, we will support the "DELETE FROM" statement,
users can delete the expired data according to the conditions.

So to answer your question:

> Will the VMQ send retractions so that the data will be removed from the table 
> (via compactions)?

The current implementation is not sending retraction, which I think
theoretically should be sent, currently the user can filter by
subsequent conditions.
And yes, the subscriber would not see strictly a correct result. I
think this is something we can improve for Flink SQL.

> Do we want time retention semantics handled by the compaction?

Currently, no, Data never expires automatically.

> Do we want to declare those types of queries "out of scope" initially?

I think we want users to be able to use three options above to
accomplish their requirements.

I will update FLIP to make the definition clearer and more explicit.

Best,
Jingsong

On Wed, Nov 24, 2021 at 5:01 AM Stephan Ewen <ewenstep...@gmail.com> wrote:
>
> Thanks for digging into this.
> Regarding this query:
>
> INSERT INTO the_table
>   SELECT window_end, COUNT(*)
>     FROM (TUMBLE(TABLE interactions, DESCRIPTOR(ts), INTERVAL '5' MINUTES))
> GROUP BY window_end
>   HAVING now() - window_end <= INTERVAL '14' DAYS;
>
> I am not sure I understand what the conclusion is on the data retention 
> question, where the continuous streaming SQL query has retention semantics. I 
> think we would need to answer the following questions (I will call the query 
> that computed the managed table the "view materializer query" - VMQ).
>
> (1) I guess the VMQ will send no updates for windows beyond the "retention 
> period" is over (14 days), as you said. That makes sense.
>
> (2) Will the VMQ send retractions so that the data will be removed from the 
> table (via compactions)?
>   - if yes, this seems semantically better for users, but it will be 
> expensive to keep the timers for retractions.
>   - if not, we can still solve this by adding filters to queries against the 
> managed table, as long as these queries are in Flink.
>   - any subscriber to the changelog stream would not see strictly a correct 
> result if we are not doing the retractions
>
> (3) Do we want time retention semantics handled by the compaction?
>   - if we say that we lazily apply the deletes in the queries that read the 
> managed tables, then we could also age out the old data during compaction.
>   - that is cheap, but it might be too much of a special case to be very 
> relevant here.
>
> (4) Do we want to declare those types of queries "out of scope" initially?
>   - if yes, how many users are we affecting? (I guess probably not many, but 
> would be good to hear some thoughts from others on this)
>   - should we simply reject such queries in the optimizer as "not possible to 
> support in managed tables"? I would suggest that, always better to tell users 
> exactly what works and what not, rather than letting them be surprised in the 
> end. Users can still remove the HAVING clause if they want the query to run, 
> and that would be better than if the VMQ just silently ignores those 
> semantics.
>
> Thanks,
> Stephan
>


-- 
Best, Jingsong Lee

Reply via email to