Re: Re: Re: [DISCUSS] FLIP-305: Support atomic for CREATE TABLE AS SELECT(CTAS) statement

Jing Ge Fri, 28 Apr 2023 06:51:28 -0700

Hi Mang,

Boundedness and execution modes are two orthogonal concepts. Since atomic
CTAS will be only supported for bounded data, which means it does not
depend on the execution modes. I was wondering if it is possible to only
provide (or call) twoPhaseCreateTable for bounded data (in both streaming
and batch mode) and let unbounded data use the non-atomic CTAS? In this
way, we could avoid the selector argument code smell.


Best regards,
Jing

On Tue, Apr 25, 2023 at 10:04 AM Mang Zhang <[email protected]> wrote:

> Hi Jing,
> Yes, the atomic CTAS will be only supported for bounded data, but the
> execution modes can be stream or batch.
> I introduced the isStreamingMode parameter in the twoPhaseCreateTable API
> to make it easier for users to provide different levels of atomicity
> implementation depending on the capabilities of the backend service.
> For example, in the case of data synchronization, it is common to run the
> job using Stream mode, but also expect the data to be visible to the user
> only after the synchronization is complete.
> flink cdc's synchronized data scenario, where the user must first write to
> a temporary table and then manually rename it to the final table;
> unfriendly to user experience.
> Developers providing twoPhaseCreateTable capability in Catalog can decide
> whether to support atomicity based on the execution mode, or they can
> choose to provide lightweight atomicity support in Stream mode, such as
> automatically renaming the table name for the user.
>
>
>
> --
>
> Best regards,
>
> Mang Zhang
>
>
>
> At 2023-04-24 15:41:31, "Jing Ge" <[email protected]> wrote:
> >Hi Mang,
> >
> >
> >
> >Thanks for clarifying it. I am trying to understand your thoughts. Do you
> >actually mean the boundedness[1] instead of the execution modes[2]? I.e.
> >the atomic CTAS will be only supported for bounded data.
> >
> >
> >
> >Best regards,
> >
> >Jing
> >
> >
> >
> >[1] https://flink.apache.org/what-is-flink/flink-architecture/
> >
> >[2]
> >https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/execution_mode/#execution-mode-batchstreaming
> >
> >On Wed, Apr 19, 2023 at 9:14 AM Mang Zhang <[email protected]> wrote:
> >
> >> hi, Jing
> >>
> >> Thank you for your reply.
> >>
> >> >1. It looks like you found another way to design the atomic CTAS with new
> >> >serializable TwoPhaseCatalogTable instead of making Catalog serializable 
> >> >as
> >> >described in FLIP-218. Did I understand correctly?
> >> Yes, when I was implementing the FLIP-218 solution, I encountered problems 
> >> with Catalog/CatalogTable serialization deserialization, for example, 
> >> after deserialization CatalogTable could not be converted to Hive Table. 
> >> Also, Catalog serialization is still a heavy operation, but it may not 
> >> actually be necessary, we just need Create Table.
> >> Therefore, the TwoPhaseCatalogTable program is proposed, which also 
> >> facilitates the implementation of the subsequent data lake, ReplaceTable 
> >> and other functions.
> >>
> >> >2. I am a little bit confused about the isStreamingMode parameter of
> >> >Catalog#twoPhaseCreateTable(...), since it is the selector argument(code
> >> >smell) we should commonly avoid in the public interface. According to the
> >> >FLIP,  isStreamingMode will be used by the Catalog to determine whether to
> >> >support atomic or not. With this selector argument, there will be two
> >> >different logics built within one method and it is hard to follow without
> >> >reading the code or the doc carefully(another concern is to keep the doc
> >> >and code alway be consistent) i.e. sometimes there will be no difference 
> >> >by
> >> >using true/false isStreamingMode, sometimes they are quite different -
> >> >atomic vs. non-atomic. Another question is, before we call
> >> >Catalog#twoPhaseCreateTable(...), we have to know the value of
> >> >isStreamingMode. In case only non-atomic is supported for streaming mode,
> >> >we could just follow FLIP-218 instead of (twistedly) calling
> >> >Catalog#twoPhaseCreateTable(...) with a false isStreamingMode. Did I miss
> >> >anything here?
> >>
> >> Here's what I think about this issue, atomic CTAS wants to be the default
> >> behavior and only fall back to non-atomic CTAS if it's completely
> >> unattainable. Atomic CTAS will bring a better experience to users.
> >> Flink is already a stream batch unified engine, In our company kwai, many
> >> users are also using flink to do batch data processing, but still running
> >> in Stream mode.
> >> The boundary between stream and batch is gradually blurred, stream mode
> >> jobs may also FINISH, so I added the isStreamingMode parameter, this
> >> provides different atomicity implementations in Batch and Stream modes.
> >> Not only to determine if atomicity is supported, but also to help select
> >> different TwoPhaseCatalogTable implementations to provide different levels
> >> of atomicity!
> >>
> >> Looking forward to more feedback.
> >>
> >>
> >>
> >> --
> >>
> >> Best regards,
> >>
> >> Mang Zhang
> >>
> >>
> >>
> >> At 2023-04-15 04:20:40, "Jing Ge" <[email protected]> wrote:
> >> >Hi Mang,
> >> >
> >> >This is the FLIP I was looking forward to after FLIP-218. Thanks for
> >> >driving it. I have two questions and would like to know your thoughts,
> >> >thanks:
> >> >
> >> >1. It looks like you found another way to design the atomic CTAS with new
> >> >serializable TwoPhaseCatalogTable instead of making Catalog serializable 
> >> >as
> >> >described in FLIP-218. Did I understand correctly?
> >> >2. I am a little bit confused about the isStreamingMode parameter of
> >> >Catalog#twoPhaseCreateTable(...), since it is the selector argument(code
> >> >smell) we should commonly avoid in the public interface. According to the
> >> >FLIP,  isStreamingMode will be used by the Catalog to determine whether to
> >> >support atomic or not. With this selector argument, there will be two
> >> >different logics built within one method and it is hard to follow without
> >> >reading the code or the doc carefully(another concern is to keep the doc
> >> >and code alway be consistent) i.e. sometimes there will be no difference 
> >> >by
> >> >using true/false isStreamingMode, sometimes they are quite different -
> >> >atomic vs. non-atomic. Another question is, before we call
> >> >Catalog#twoPhaseCreateTable(...), we have to know the value of
> >> >isStreamingMode. In case only non-atomic is supported for streaming mode,
> >> >we could just follow FLIP-218 instead of (twistedly) calling
> >> >Catalog#twoPhaseCreateTable(...) with a false isStreamingMode. Did I miss
> >> >anything here?
> >> >
> >> >Best regards,
> >> >Jing
> >> >
> >> >On Fri, Apr 14, 2023 at 1:55 PM yuxia <[email protected]> wrote:
> >> >
> >> >> Hi, Mang.
> >> >> +1 for completing the support for atomicity of CTAS, this is very useful
> >> >> in batch scenarios and integrate with the data lake which support
> >> >> transcation.
> >> >>
> >> >> I just have one question, IIUC, the DynamiacTableSink will need to know
> >> >> it's for normal case or the atomicity with CTAS as well as neccessary
> >> >> context.
> >> >> Take jdbc catalog as an example, if it's CTAS with atomicity supports, 
> >> >> the
> >> >> jdbc DynamiacTableSink will write the temp table defined in the
> >> >> TwoPhaseCatalogTable which is different from normal case.
> >> >>
> >> >> How can the DynamiacTableSink can get it? Could you give some 
> >> >> explanation
> >> >> or example in this FLIP?
> >> >>
> >> >>
> >> >> Best regards,
> >> >> Yuxia
> >> >>
> >> >> ----- 原始邮件 -----
> >> >> 发件人: "zhangmang1" <[email protected]>
> >> >> 收件人: "dev" <[email protected]>, "ron9 liu" <[email protected]>,
> >> >> "lincoln 86xy" <[email protected]>
> >> >> 发送时间: 星期五, 2023年 4 月 14日 下午 2:50:40
> >> >> 主题: Re:Re: [DISCUSS] FLIP-305: Support atomic for CREATE TABLE AS
> >> >> SELECT(CTAS) statement
> >> >>
> >> >> Hi, Lincoln and Ron
> >> >>
> >> >>
> >> >> Thank you for your reply.
> >> >> On the naming wise I think OK, the future expansion of new features more
> >> >> uniform. I have updated the FLIP.
> >> >>
> >> >>
> >> >> About Hive support atomicity CTAS, Hive is rich in usage scenarios and 
> >> >> can
> >> >> be divided into three scenarios: 1. writing Hive tables 2. writing Hive
> >> >> tables with speculative execution 3. writing Hive table with small file
> >> >> merge
> >> >>
> >> >>
> >> >> The main purpose of FLIP-305 is to implement support for CTAS atomicity 
> >> >> in
> >> >> the Flink framework,
> >> >> so I only poc to verify the first scenario of writing to the Hive table,
> >> >> and we can subsequently split the sub-task to support the other two
> >> >> scenarios.
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >>
> >> >> Best regards,
> >> >> Mang Zhang
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> At 2023-04-13 12:27:24, "Lincoln Lee" <[email protected]> wrote:
> >> >> >Hi, Mang
> >> >> >
> >> >> >+1 for completing the support for atomicity of CTAS, this is very 
> >> >> >useful
> >> >> in
> >> >> >batch scenarios.
> >> >> >
> >> >> >I have two questions:
> >> >> >1. naming wise:
> >> >> >  a) can we rename the `Catalog#getTwoPhaseCommitCreateTable` to
> >> >> >`Catalog#twoPhaseCreateTable` (and we may add
> >> >> >twoPhaseReplaceTable/twoPhaseCreateOrReplaceTable later)
> >> >> >  b) for the `TwoPhaseCommitCatalogTable`, may it be better using
> >> >> >`TwoPhaseCatalogTable`?
> >> >> >  c) `TwoPhaseCommitCatalogTable#beginTransaction`, the word 
> >> >> > 'transaction'
> >> >> >in the method name, which may remind users of the relevance of 
> >> >> >transaction
> >> >> >support (however, it is not strictly so), so I suggest changing it to
> >> >> >`begin`
> >> >> >2. Has this design been validated by any relevant Poc on hive or other
> >> >> >catalogs?
> >> >> >
> >> >> >Best,
> >> >> >Lincoln Lee
> >> >> >
> >> >> >
> >> >> >liu ron <[email protected]> 于2023年4月13日周四 10:17写道：
> >> >> >
> >> >> >> Hi, Mang
> >> >> >> Atomicity is very important for CTAS, especially for batch jobs. This
> >> >> FLIP
> >> >> >> is a continuation of FLIP-218, which is valuable for CTAS.
> >> >> >> I just have one question, in the Motivation part of FLIP-218, we
> >> >> mentioned
> >> >> >> three levels of atomicity semantics, can this current design do the
> >> >> same as
> >> >> >> Spark's DataSource V2, which can guarantee both atomicity and 
> >> >> >> isolation,
> >> >> >> for example, can it be done by writing to Hive tables using CTAS?
> >> >> >>
> >> >> >> Best,
> >> >> >> Ron
> >> >> >>
> >> >> >> Mang Zhang <[email protected]> 于2023年4月10日周一 11:03写道：
> >> >> >>
> >> >> >> > Hi, everyone
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> > I'd like to start a discussion about FLIP-305: Support atomic for
> >> >> CREATE
> >> >> >> > TABLE AS SELECT(CTAS) statement [1].
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> > CREATE TABLE AS SELECT(CTAS) statement has been support, but it's 
> >> >> >> > not
> >> >> >> > atomic. It will create the table first before job running. If the 
> >> >> >> > job
> >> >> >> > execution fails, or is cancelled, the table will not be dropped.
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> > So I want Flink to support atomic CTAS, where only the table is
> >> >> created
> >> >> >> > when the Job succeeds. Improve user experience.
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> > Looking forward to your feedback.
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> > [1]
> >> >> >> >
> >> >> >>
> >> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-305%3A+Support+atomic+for+CREATE+TABLE+AS+SELECT%28CTAS%29+statement
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> > --
> >> >> >> >
> >> >> >> > Best regards,
> >> >> >> > Mang Zhang
> >> >> >>
> >> >>
> >>
> >>
>
>

Re: Re: Re: [DISCUSS] FLIP-305: Support atomic for CREATE TABLE AS SELECT(CTAS) statement

Reply via email to