Re: [DISCUSS] PIP-5: Paimon Table And Data Lineage For Flink

Shammon FY Wed, 21 Jun 2023 00:42:30 -0700

Hi dev,

Thanks for all the feedback. If there are no more comments, I will start a
vote about this PIP later. Thanks


Best,
Shammon FY


On Wed, Jun 21, 2023 at 12:08 PM Jingsong Li <[email protected]> wrote:

> Thanks for the update.
>
> Looks good to me!
>
> Best,
> Jingsong
>
> On Wed, Jun 21, 2023 at 9:59 AM Shammon FY <[email protected]> wrote:
> >
> > Thanks Jingsong.
> >
> > As we discussed offline, the `metadata.store` will store the table
> lineage and data lineage information, which is orthogonal with `metastore`.
> We can introduce an option `lineage-meta` as follows.
> >
> > CREATE CATALOG paimon_catalog1 WITH (
> >     ... // other options
> >     'metastore' = 'hive',
> >     'url' = 'XXXXX',
> >     'lineage-meta' = 'jdbc',
> >     'jdbc.driver' = 'com.mysql.jdbc.Driver',
> >     'jdbc.database' = 'paimon_cata1',    // The default Lineage Meta
> Database name is `paimon`
> >     'jdbc.username' = 'XXX',
> >     'jdbc.password' = 'XXX'
> > );
> >
> > Then we can support `lineage-meta` for `filesystem` and `hive`
> metastore. I have updated the PIP for the options and the interfaces.
> >
> >
> > Best,
> > Shammon FY
> >
> >
> > On Tue, Jun 20, 2023 at 8:13 PM Jingsong Li <[email protected]>
> wrote:
> >>
> >> Thanks Shammon,
> >>
> >> For the metadata.store, is this just now the metastore?
> >>
> >> I mean can we manage this meta information through the current Catalog
> >> interface (which is in fact metastore as a key)?
> >>
> >> For example,
> >>
> >> CREATE CATALOG paimon_catalog1 WITH (
> >>     ... // other options
> >>     'metastore' = 'jdbc',
> >>     'url' = 'XXXXX',
> >>     'jdbc.driver' = 'com.mysql.jdbc.Driver',
> >>     'jdbc.database' = 'paimon_cata1',    // The default Metadata
> >> Database name is `paimon`
> >>     'jdbc.username' = 'XXX',
> >>     'jdbc.password' = 'XXX'
> >> );
> >>
> >> JDBC manages not only the table information (which is what Catalog
> >> used to do), but also the data lineage information.
> >>
> >> What do you think?
> >>
> >> Or you still want to separate their responsibilities.
> >>
> >> Best,
> >> Jingsong
> >>
> >> On Thu, Jun 15, 2023 at 1:46 PM Shammon FY <[email protected]> wrote:
> >> >
> >> > Hi Jingsong,
> >> >
> >> > I have updated this PIP and added the implementation for System
> Database, the main changes are as follows:
> >> >
> >> > 1. Introduce MetadataStore and MetadataStoreFactory to store the data
> of table and data lineages.
> >> > 2. Use jdbc as default metadata store
> >> > 3. Users can query table and data lineage tables, and delete lineages
> with actions
> >> >
> >> > Looking forward to your feedback, thanks
> >> >
> >> > Best,
> >> > Shammon FY
> >> >
> >> >
> >> > On Wed, Jun 14, 2023 at 11:17 AM Shammon FY <[email protected]>
> wrote:
> >> >>
> >> >> Hi Jingsong,
> >> >>
> >> >> It's a good point about the detailed implementation of System
> Database, I'll update the PIP soon.
> >> >>
> >> >> Best,
> >> >> Shammon FY
> >> >>
> >> >> On Wed, Jun 14, 2023 at 8:48 AM Shammon FY <[email protected]>
> wrote:
> >> >>>
> >> >>> Hi Jingsong,
> >> >>>
> >> >>> Thanks for your comments.
> >> >>>
> >> >>> > We should document what is based on FLIP-314.
> >> >>>
> >> >>> I have updated the operations supported by FLIP-314 in the future
> work
> >> >>>
> >> >>> > Is the current Source interface sufficient for your functionality?
> >> >>>
> >> >>> In our design the current Source interface fulfills our
> requirements. As described in PIP-5, `AlignedEnumerator` will send
> checkpoint events to `AlignedSourceReader`, which will align the checkpoint
> and snapshot, and then send split the next operator. More detailed
> information can be provided by @liming
> >> >>>
> >> >>> > Can we currently achieve the ability to flush all data in a
> snapshot before snapshot?
> >> >>>
> >> >>> Can you provide a more detailed description of this? Do you mean
> there may be too much data for a snapshot if the source aligns the
> checkpoint and snapshot and causes the snapshot to be too large to flush?
> >> >>>
> >> >>>
> >> >>> Best,
> >> >>> Shammon FY
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>> On Mon, Jun 12, 2023 at 4:30 PM Jingsong Li <[email protected]>
> wrote:
> >> >>>>
> >> >>>> System Database looks very good~ But perhaps there are some design
> >> >>>> details here? What API should we use? Paimon Java API? And we
> should
> >> >>>> commit every operation?
> >> >>>>
> >> >>>> Best,
> >> >>>> Jingsong
> >> >>>>
> >> >>>> On Mon, Jun 12, 2023 at 4:27 PM Jingsong Li <
> [email protected]> wrote:
> >> >>>> >
> >> >>>> > Thanks Shammon,
> >> >>>> >
> >> >>>> > The overall design looks good to me!
> >> >>>> >
> >> >>>> > ## Plan For The Future
> >> >>>> >
> >> >>>> > We should document what is based on FLIP-314.
> >> >>>> >
> >> >>>> > ## AlignedEnumerator and AlignedSourceReader
> >> >>>> >
> >> >>>> > Is the current Source interface sufficient for your
> functionality?
> >> >>>> >
> >> >>>> > Can we currently achieve the ability to flush all data in a
> snapshot
> >> >>>> > before snapshot?
> >> >>>> >
> >> >>>> > Best,
> >> >>>> > Jingsong
> >> >>>> >
> >> >>>> > On Mon, Jun 5, 2023 at 7:57 PM Shammon FY <[email protected]>
> wrote:
> >> >>>> > >
> >> >>>> > > Hi Kelu,
> >> >>>> > >
> >> >>>> > > Thanks for your feedback. In the first stage, we do not want
> to introduce a
> >> >>>> > > server, but instead store information directly in the Paimon
> table when
> >> >>>> > > creating and running Flink jobs. A server will be considered
> when we
> >> >>>> > > encounter more requirements in the future and need a resident
> service
> >> >>>> > > management.
> >> >>>> > >
> >> >>>> > > Best,
> >> >>>> > > Shammon FY
> >> >>>> > >
> >> >>>> > > On Fri, Jun 2, 2023 at 5:55 PM Kelu Tao <[email protected]>
> wrote:
> >> >>>> > >
> >> >>>> > > > +1
> >> >>>> > > >
> >> >>>> > > > cool job ~
> >> >>>> > > >
> >> >>>> > > > For this PIP, do we need to introduce a new server for the
> information
> >> >>>> > > > serving?
> >> >>>> > > >
> >> >>>> > > > On 2023/05/31 02:28:21 Shammon FY wrote:
> >> >>>> > > > > Hi devs,
> >> >>>> > > > >
> >> >>>> > > > > We would like to start a discussion about PIP-5: Paimon
> Table And Data
> >> >>>> > > > > Lineage For Flink[1].
> >> >>>> > > > >
> >> >>>> > > > > As a streaming lake, users can use Paimon integrated with
> Flink to
> >> >>>> > > > complete
> >> >>>> > > > > the entire ETL processing. In this process, users need to
> manage batch &
> >> >>>> > > > > streaming jobs and data streams, including batch &
> streaming data
> >> >>>> > > > > validation, job debug, and data revision. To support the
> above ability,
> >> >>>> > > > we
> >> >>>> > > > > introduce table and data lineage for Flink & Paimon. Users
> can
> >> >>>> > > > conveniently
> >> >>>> > > > > manage the entire ETL processing based on lineage
> information.
> >> >>>> > > > >
> >> >>>> > > > > Looking forward to hearing from you, thanks.
> >> >>>> > > > >
> >> >>>> > > > >
> >> >>>> > > > > [1]
> >> >>>> > > > >
> >> >>>> > > >
> https://cwiki.apache.org/confluence/display/PAIMON/PIP-5%3A+Paimon+Table+And+Data+Lineage+For+Flink
> >> >>>> > > > >
> >> >>>> > > > >
> >> >>>> > > > > Best,
> >> >>>> > > > > Shammon FY
> >> >>>> > > > >
> >> >>>> > > >
>

Re: [DISCUSS] PIP-5: Paimon Table And Data Lineage For Flink

Reply via email to