Hello Ron!

> I've updated the FLIP, and added the Outline Design section, to introduce how
Materialized Table interacts with the Workflow Scheduler in Full Refresh
mode via a timing diagram, it can help to understand this proposal design.

Thank you for the additions, the sequence diagram says a lot.
I have a question there: how can the gateway update the refreshHandler in the 
Catalog before getting it from the scheduler?

Just a nit, in the FLIP:
> WorkflowOperation implementation class is provided by the engine to the 
> WorkflowScheudler. Currently, its implementation class would include 
> CreatePeriodicWorkflowOperation, SuspendWorkflowOperation, 
> ResumeWorkflowOperation, and ModifyWorkflowCronOperation.

You have a typo here: WorkflowScheudler -> WorkflowScheduler :)

For the operations part, I still think that the FLIP would benefit from 
providing a specific pattern for operations. You could either propose a command 
pattern [1] or a visitor pattern (where the scheduler visits the operation to 
get relevant info) [2] for those operations at your choice.

> This means that the Scheduler will periodically call the REST
API, passing parameters such as scheduleTime and scheduleTimeFormat

About "isPeriodic" and the date type, I got your point, thank you for the 
context!

About the REST API, I will wait for your offline discussion :)


[1] https://en.wikipedia.org/wiki/Command_pattern
[2] https://en.wikipedia.org/wiki/Visitor_pattern
On Apr 25, 2024 at 13:22 +0200, Ron Liu <ron9....@gmail.com>, wrote:
> Hi, Lorenzo and Feng
>
> Thanks for joining this thread discussing.
>
> Sorry for later response, regarding your question:
>
> > About the Operations interfaces, how can they be empty?
> Should not they provide at least a `run` or `execute` method (similar to
> the command pattern)?
> In this way, their implementation can wrap all the implementations details
> of particular schedulers, and the scheduler can simply execute the command.
> In general, I think a simple sequence diagram showcasing the interaction
> between the interfaces would be awesome to better understand the concept.
>
> I've updated the FLIP, and added the Outline Design section, to introduce how
> Materialized Table interacts with the Workflow Scheduler in Full Refresh
> mode via a timing diagram, it can help to understand this proposal design.
>
> > What about the RefreshHandler, I cannot find a definition of its
> interface here.
> Is it out of scope for this FLIP?
>
> There is some context that is not aligned here, RefreshHandler was proposed
> in FLIP-435, you can get more detail from [1].
>
> > If it is periodic, where is the period?
> For the scheduleTime and format, why not simply pass an instance of
> LocalDateTime or similar? The gateway should not have the responsibility to
> parse the time.
>
> This might require a bit of context for clarity. In Full Refresh mode, the
> Materialized Table requires the Scheduler to periodically trigger refresh
> operations. This means that the Scheduler will periodically call the REST
> API, passing parameters such as scheduleTime and scheduleTimeFormat. The
> materialized table manager(to be introduced) relies on this information to
> accurately calculate the correct time partitions. At the same time, we also
> support manual refreshes of materialized tables, and in the future, we will
> support manual cascading refreshes on a multi-table granularity. For cases
> of manual cascading refresh, we will also register a one-time refresh
> workflow with the Scheduler, which then triggers the execution via the REST
> API call. However, during a manual refresh, users typically specify
> partition information, and there's no need for the engine to deduce it,
> thus scheduler time is not needed.
>
> Taking the above into account, there are two types of refresh workflows:
> periodic workflows and one-time workflows. The engine requires different
> information for each type of workflow. When designing the REST API, we aim
> for this API to support both types of workflows simultaneously. Hence we
> introduce the isPeriodic parameter for differentiation. Then the engine
> will know what to do accordingly.
>
> The scheduleTime and scheduleTimeFormat are passed from Scheduler to the
> Gateway via the REST API. Firstly, in the HTTP protocol, there is no type
> equivalent to Java's LocalDateTime. Secondly, Schedulers can potentially be
> written in different programming languages; for example, Airflow uses
> Python to develop its workflows. Hence, it's obvious that we cannot limit
> the Scheduler to the use of Java LocalDateTime type. Therefore, a String
> type is the most suitable. Lastly, the purpose of the schedulerTime is to
> determine the time partitioning details of the partitioned table. This
> parsing responsibility falls upon the materialized table manager and not
> the SqlGateway, which is solely responsible for passthrough parameters.
>
> You may refer to the Outline Design section of this FLIP, specifically the
> Partitioned Table Full Refresh part in FLIP-435, to further comprehend the
> overall design principles.
>
> > For the REST API:
> wouldn't it be better (more REST) to move the `mt_identifier` to the URL?
> E.g.: v3/materialized_tables/<mt_identifier>/refresh
>
> I think this is a good idea. I have another consideration though, does this
> API support passing multiple materialized tables at the same time, if it
> does, it will have to be put in the request body. I will discuss the design
> of this API with ShengKai Fang offline, he is the owner of the Gateway
> module. Anyway, your proposal is a good choice.
>
>
> > From my current understanding, the workflow handle should not be bound
> to the Dynamic Table. Therefore, if the workflow is modified, does it mean
> that the scheduling information corresponding to the Dynamic Table will be
> lost?
>
> You can see the FLIP Outline Design section to understand the overall
> design further. The refresh handler is just a pointer that can locate the
> workflow info in the scheduler, so scheduling info will be persistent to
> the Scheduler, it will not lost.
>
> > Regarding the status information of the workflow, I am wondering if it
> is necessary to provide an interface to display the backend scheduling
> information? This would make it more convenient to view the execution
> status of backend jobs.
>
> The RefreshHandler#asSummaryString will return the summary information of
> the background refresh job, you can get it via DESC TABLE xxx. I think you
> want to get detail information about background jobs, you should go to
> Scheduler, it provides the most detailed information. Even if the interface
> is provided, we don't get the complete information and how does this
> interface show the information about the background? So I don't think it is
> necessary.
>
>
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-435%3A+Introduce+a+New+Materialized+Table+for+Simplifying+Data+Pipelines
>
> Best,
> Ron
>
>
>
> Feng Jin <jinfeng1...@gmail.com> 于2024年4月25日周四 00:46写道:
>
> > Hi Ron
> >
> > Thank you for initiating this FLIP.
> >
> > My current questions are as follows:
> >
> > 1. From my current understanding, the workflow handle should not be bound
> > to the Dynamic Table. Therefore, if the workflow is modified, does it mean
> > that the scheduling information corresponding to the Dynamic Table will be
> > lost?
> >
> > 2. Regarding the status information of the workflow, I am wondering if it
> > is necessary to provide an interface to display the backend scheduling
> > information? This would make it more convenient to view the execution
> > status of backend jobs.
> >
> >
> > Best,
> > Feng
> >
> >
> > On Wed, Apr 24, 2024 at 3:24 PM <lorenzo.affe...@ververica.com.invalid>
> > wrote:
> >
> > > > Hello Ron Liu! Thank you for your FLIP!
> > > >
> > > > Here are my considerations:
> > > >
> > > > 1.
> > > > About the Operations interfaces, how can they be empty?
> > > > Should not they provide at least a `run` or `execute` method (similar to
> > > > the command pattern)?
> > > > In this way, their implementation can wrap all the implementations
> > details
> > > > of particular schedulers, and the scheduler can simply execute the
> > command.
> > > > In general, I think a simple sequence diagram showcasing the interaction
> > > > between the interfaces would be awesome to better understand the 
> > > > concept.
> > > >
> > > > 2.
> > > > What about the RefreshHandler, I cannot find a definition of its
> > interface
> > > > here.
> > > > Is it out of scope for this FLIP?
> > > >
> > > > 3.
> > > > For the SqlGatewayService arguments:
> > > >
> > > > boolean isPeriodic,
> > > > @Nullable String scheduleTime,
> > > > @Nullable String scheduleTimeFormat,
> > > >
> > > > If it is periodic, where is the period?
> > > > For the scheduleTime and format, why not simply pass an instance of
> > > > LocalDateTime or similar? The gateway should not have the responsibility
> > to
> > > > parse the time.
> > > >
> > > > 4.
> > > > For the REST API:
> > > > wouldn't it be better (more REST) to move the `mt_identifier` to the 
> > > > URL?
> > > > E.g.: v3/materialized_tables/<mt_identifier>/refresh
> > > >
> > > > Thank you!
> > > > On Apr 22, 2024 at 08:42 +0200, Ron Liu <ron9....@gmail.com>, wrote:
> > > > > > Hi, Dev
> > > > > >
> > > > > > I would like to start a discussion about FLIP-448: Introduce 
> > > > > > Pluggable
> > > > > > Workflow Scheduler Interface for Materialized Table.
> > > > > >
> > > > > > In FLIP-435[1], we proposed Materialized Table, which has two types 
> > > > > > of
> > > > data
> > > > > > refresh modes: Full Refresh & Continuous Refresh Mode. In Full 
> > > > > > Refresh
> > > > > > mode, the Materialized Table relies on a workflow scheduler to 
> > > > > > perform
> > > > > > periodic refresh operation to achieve the desired data freshness.
> > > > > >
> > > > > > There are numerous open-source workflow schedulers available, with
> > > > popular
> > > > > > ones including Airflow and DolphinScheduler. To enable Materialized
> > Table
> > > > > > to work with different workflow schedulers, we propose a pluggable
> > > > workflow
> > > > > > scheduler interface for Materialized Table in this FLIP.
> > > > > >
> > > > > > For more details, see FLIP-448 [2]. Looking forward to your 
> > > > > > feedback.
> > > > > >
> > > > > > [1] https://lists.apache.org/thread/c1gnn3bvbfs8v1trlf975t327s4rsffs
> > > > > > [2]
> > > > > >
> > > >
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-448%3A+Introduce+Pluggable+Workflow+Scheduler+Interface+for+Materialized+Table
> > > > > >
> > > > > > Best,
> > > > > > Ron
> > > >
> >

Reply via email to