Re: [FEEDBACK] Metadata Platforms / Catalogs / Lineage integration

2022-09-26 Thread Martijn Visser
Hi Yun Tang,

Sorry for the late reply. I haven't seen any tickets related to this topic.
Still think this is an important feature to have supported in Flink, would
love some volunteers on this topic.

Best regards,

Martijn

On Tue, Sep 13, 2022 at 7:47 AM Yun Tang  wrote:

> An interesting topic, I noticed that the datahub community has launched
> the feature request discussion of Flink Integration [1].
>
> @Martijn Visser  Did the Flink community had
> created tickets to track this topic?
> From my current understanding, Flink lacks rich information on 
> FlinkJobListener
> just as Feng mentioned, which has been supported well by Spark, to send
> data lineage to external systems.
>
>
>
> [1] https://feature-requests.datahubproject.io/p/flink-integration
>
>
> Best
> Yun Tang
> --
> *From:* wangqinghuan <1095193...@qq.com>
> *Sent:* Monday, January 17, 2022 18:27
> *To:* user@flink.apache.org 
> *Subject:* Re: [FEEDBACK] Metadata Platforms / Catalogs / Lineage
> integration
>
>
> we are using Datahub to address table-level lineage and column-level
> lineage for Flink SQL.
> 在 2022/1/13 23:27, Martijn Visser 写道:
>
> Hi everyone,
>
> I'm currently checking out different metadata platforms, such as Amundsen
> [1] and Datahub [2]. In short, these types of tools try to address problems
> related to topics such as data discovery, data lineage and an overall data
> catalogue.
>
> I'm reaching out to the Dev and User mailing lists to get some feedback.
> It would really help if you could spend a couple of minutes to let me know
> if you already use either one of the two mentioned metadata platforms or
> another one, or are you evaluating such tools? If so, is that for
> the purpose as a catalogue, for lineage or anything else? Any type of
> feedback on these types of tools is appreciated.
>
> Best regards,
>
> Martijn
>
> [1] https://github.com/amundsen-io/amundsen/
> [2] https://github.com/linkedin/datahub
>
>


Re: [FEEDBACK] Metadata Platforms / Catalogs / Lineage integration

2022-09-12 Thread Yun Tang
An interesting topic, I noticed that the datahub community has launched the 
feature request discussion of Flink Integration [1].

@Martijn Visser<mailto:martijnvis...@apache.org> Did the Flink community had 
created tickets to track this topic?
>From my current understanding, Flink lacks rich information on 
>FlinkJobListener just as Feng mentioned, which has been supported well by 
>Spark, to send data lineage to external systems.



[1] https://feature-requests.datahubproject.io/p/flink-integration


Best
Yun Tang

From: wangqinghuan <1095193...@qq.com>
Sent: Monday, January 17, 2022 18:27
To: user@flink.apache.org 
Subject: Re: [FEEDBACK] Metadata Platforms / Catalogs / Lineage integration


we are using Datahub to address table-level lineage and column-level lineage 
for Flink SQL.

在 2022/1/13 23:27, Martijn Visser 写道:
Hi everyone,

I'm currently checking out different metadata platforms, such as Amundsen [1] 
and Datahub [2]. In short, these types of tools try to address problems related 
to topics such as data discovery, data lineage and an overall data catalogue.

I'm reaching out to the Dev and User mailing lists to get some feedback. It 
would really help if you could spend a couple of minutes to let me know if you 
already use either one of the two mentioned metadata platforms or another one, 
or are you evaluating such tools? If so, is that for the purpose as a 
catalogue, for lineage or anything else? Any type of feedback on these types of 
tools is appreciated.

Best regards,

Martijn

[1] https://github.com/amundsen-io/amundsen/
[2] https://github.com/linkedin/datahub



Re: [FEEDBACK] Metadata Platforms / Catalogs / Lineage integration

2022-01-17 Thread wangqinghuan
we are using Datahub to address table-level lineage and column-level 
lineage for Flink SQL.


在 2022/1/13 23:27, Martijn Visser 写道:

Hi everyone,

I'm currently checking out different metadata platforms, such as 
Amundsen [1] and Datahub [2]. In short, these types of tools try to 
address problems related to topics such as data discovery, data 
lineage and an overall data catalogue.


I'm reaching out to the Dev and User mailing lists to get some 
feedback. It would really help if you could spend a couple of minutes 
to let me know if you already use either one of the two mentioned 
metadata platforms or another one, or are you evaluating such tools? 
If so, is that for the purpose as a catalogue, for lineage or anything 
else? Any type of feedback on these types of tools is appreciated.


Best regards,

Martijn

[1] https://github.com/amundsen-io/amundsen/
[2] https://github.com/linkedin/datahub


Re: [FEEDBACK] Metadata Platforms / Catalogs / Lineage integration

2022-01-13 Thread JIN FENG
Hi
I am a software engineer from Xiaomi.

Last year we used metacat(https://github.com/Netflix/metacat) to manage all
metadata, including Hive, Kudu, Doris, Iceberg, Elasticsearch, Talos
(Xiaomi self-developed message queue), Mysql, Tidb..

Metacat is well compatible with the hive-metastore protocol. Therefore, we
can directly use FlinkHiveCatalog to connect metacat to create different
Tables, including Hive tables, or other generic types of tables.

All systems are abstracted into catalog.database.table structure. So in
FlinkSQL we can access any registered table through catalog.database.table.

In addition, metacat uniformly manages all table creation, deletion, and
partitioning operations. By analyzing the audit log of metacat, we can
easily obtain the DDL lineage of different tables.

At the same time, with the use of ranger(https://github.com/ranger/ranger),
we have added permission control to the Flink framework, and all permission
information will be saved in the form of catalog.database.table.

We also modified the logic related to FlinkJobListener. By exposing the
JobGraph, we can obtain the lineage information of the job by parsing the
JobGraph.

To sum up, unified metadata management is convenient for managing different
systems and connecting to Flink, and at the same time, it is convenient for
unified permission management and obtaining table-related lineage
information.


On Fri, Jan 14, 2022 at 3:14 AM Maciej Obuchowski <
obuchowski.mac...@gmail.com> wrote:

> Hello,
>
> I'm an OpenLineage committer - and previously, a minor Flink contributor.
> OpenLineage community is very interested in conversation about Flink
> metadata, and we'll be happy to cooperate with the Flink community.
>
> Best,
> Maciej Obuchowski
>
>
>
> czw., 13 sty 2022 o 18:12 Martijn Visser 
> napisał(a):
> >
> > Hi all,
> >
> > @Andrew thanks for sharing that!
> >
> > @Tero good point, I should have clarified the purpose. I want to
> understand
> > what "metadata platforms" tools are used or evaluated by the Flink
> > community, what's their purpose for using such a tool (is it as a generic
> > catalogue, as a data discovery tool, is lineage the important part etc)
> and
> > what problems are people trying to solve with them. This space is
> > developing rapidly and there are many open source and commercial tools
> > popping up/growing, which is also why I'm trying to keep an open vision
> on
> > how this space is evolving.
> >
> > If the Flink community wants to integrate with metadata tools, I fully
> > agree that ideally we do that via standards. My perception is at this
> > moment that no clear standard has yet been established. You mentioned
> > open-metadata.org, but I believe https://openlineage.io/ is also an
> > alternative standard.
> >
> > Best regards,
> >
> > Martijn
> >
> > On Thu, 13 Jan 2022 at 17:00, Tero Paananen 
> wrote:
> >
> > > > I'm currently checking out different metadata platforms, such as
> > > Amundsen [1] and Datahub [2]. In short, these types of tools try to
> address
> > > problems related to topics such as data discovery, data lineage and an
> > > overall data catalogue.
> > > >
> > > > I'm reaching out to the Dev and User mailing lists to get some
> feedback.
> > > It would really help if you could spend a couple of minutes to let me
> know
> > > if you already use either one of the two mentioned metadata platforms
> or
> > > another one, or are you evaluating such tools? If so, is that for the
> > > purpose as a catalogue, for lineage or anything else? Any type of
> feedback
> > > on these types of tools is appreciated.
> > >
> > > I hope you don't mind answers off-list.
> > >
> > > You didn't say what purpose you're evaluating these tools for, but if
> > > you're evaluating platforms for integration with Flink, I wouldn't
> > > approach it with a particular product in mind. Rather I'd create some
> > > sort of facility to propagate metadata and/or lineage information in a
> > > generic way and allow Flink users to plug in their favorite metadata
> > > tool. Using standards like OpenLineage, for example. I believe Egeria
> > > is also trying to create an open standard for metadata.;
> > >
> > > If you're evaluating data catalogs for personal use or use in a
> > > particular project, Andrew's answer about the Wikimedia evaluation is
> > > a good start. It's missing OpenMetadata (https://open-metadata.org/).
> > > That one is showing a LOT of promise. Wikimedia's evaluation is also
> > > missing industry leading commercial products (understandably, given
> > > their mission). Collibra and Alation probably the ones that pop up
> > > most often.
> > >
> > > I have personally looked into both DataHub and Amundsen. My high level
> > > feedback is that DataHub is overengineered, and using proprietary
> > > LinkedIn technology platform(s), which aren't widely used anywhere.
> > > Amundsen is much less flexible than DataHub and quite basic in its
> > > functionality. If you need anything beyond wh

Re: [FEEDBACK] Metadata Platforms / Catalogs / Lineage integration

2022-01-13 Thread Maciej Obuchowski
Hello,

I'm an OpenLineage committer - and previously, a minor Flink contributor.
OpenLineage community is very interested in conversation about Flink
metadata, and we'll be happy to cooperate with the Flink community.

Best,
Maciej Obuchowski



czw., 13 sty 2022 o 18:12 Martijn Visser  napisał(a):
>
> Hi all,
>
> @Andrew thanks for sharing that!
>
> @Tero good point, I should have clarified the purpose. I want to understand
> what "metadata platforms" tools are used or evaluated by the Flink
> community, what's their purpose for using such a tool (is it as a generic
> catalogue, as a data discovery tool, is lineage the important part etc) and
> what problems are people trying to solve with them. This space is
> developing rapidly and there are many open source and commercial tools
> popping up/growing, which is also why I'm trying to keep an open vision on
> how this space is evolving.
>
> If the Flink community wants to integrate with metadata tools, I fully
> agree that ideally we do that via standards. My perception is at this
> moment that no clear standard has yet been established. You mentioned
> open-metadata.org, but I believe https://openlineage.io/ is also an
> alternative standard.
>
> Best regards,
>
> Martijn
>
> On Thu, 13 Jan 2022 at 17:00, Tero Paananen  wrote:
>
> > > I'm currently checking out different metadata platforms, such as
> > Amundsen [1] and Datahub [2]. In short, these types of tools try to address
> > problems related to topics such as data discovery, data lineage and an
> > overall data catalogue.
> > >
> > > I'm reaching out to the Dev and User mailing lists to get some feedback.
> > It would really help if you could spend a couple of minutes to let me know
> > if you already use either one of the two mentioned metadata platforms or
> > another one, or are you evaluating such tools? If so, is that for the
> > purpose as a catalogue, for lineage or anything else? Any type of feedback
> > on these types of tools is appreciated.
> >
> > I hope you don't mind answers off-list.
> >
> > You didn't say what purpose you're evaluating these tools for, but if
> > you're evaluating platforms for integration with Flink, I wouldn't
> > approach it with a particular product in mind. Rather I'd create some
> > sort of facility to propagate metadata and/or lineage information in a
> > generic way and allow Flink users to plug in their favorite metadata
> > tool. Using standards like OpenLineage, for example. I believe Egeria
> > is also trying to create an open standard for metadata.;
> >
> > If you're evaluating data catalogs for personal use or use in a
> > particular project, Andrew's answer about the Wikimedia evaluation is
> > a good start. It's missing OpenMetadata (https://open-metadata.org/).
> > That one is showing a LOT of promise. Wikimedia's evaluation is also
> > missing industry leading commercial products (understandably, given
> > their mission). Collibra and Alation probably the ones that pop up
> > most often.
> >
> > I have personally looked into both DataHub and Amundsen. My high level
> > feedback is that DataHub is overengineered, and using proprietary
> > LinkedIn technology platform(s), which aren't widely used anywhere.
> > Amundsen is much less flexible than DataHub and quite basic in its
> > functionality. If you need anything beyond what it already offers,
> > good luck.
> >
> > We dumped Amundsen in favor of OpenMetadata a few months back. We
> > don't have enough data points to fully evaluate OpenMetadata yet.
> >
> > -TPP
> >


Re: [FEEDBACK] Metadata Platforms / Catalogs / Lineage integration

2022-01-13 Thread Martijn Visser
Hi all,

@Andrew thanks for sharing that!

@Tero good point, I should have clarified the purpose. I want to understand
what "metadata platforms" tools are used or evaluated by the Flink
community, what's their purpose for using such a tool (is it as a generic
catalogue, as a data discovery tool, is lineage the important part etc) and
what problems are people trying to solve with them. This space is
developing rapidly and there are many open source and commercial tools
popping up/growing, which is also why I'm trying to keep an open vision on
how this space is evolving.

If the Flink community wants to integrate with metadata tools, I fully
agree that ideally we do that via standards. My perception is at this
moment that no clear standard has yet been established. You mentioned
open-metadata.org, but I believe https://openlineage.io/ is also an
alternative standard.

Best regards,

Martijn

On Thu, 13 Jan 2022 at 17:00, Tero Paananen  wrote:

> > I'm currently checking out different metadata platforms, such as
> Amundsen [1] and Datahub [2]. In short, these types of tools try to address
> problems related to topics such as data discovery, data lineage and an
> overall data catalogue.
> >
> > I'm reaching out to the Dev and User mailing lists to get some feedback.
> It would really help if you could spend a couple of minutes to let me know
> if you already use either one of the two mentioned metadata platforms or
> another one, or are you evaluating such tools? If so, is that for the
> purpose as a catalogue, for lineage or anything else? Any type of feedback
> on these types of tools is appreciated.
>
> I hope you don't mind answers off-list.
>
> You didn't say what purpose you're evaluating these tools for, but if
> you're evaluating platforms for integration with Flink, I wouldn't
> approach it with a particular product in mind. Rather I'd create some
> sort of facility to propagate metadata and/or lineage information in a
> generic way and allow Flink users to plug in their favorite metadata
> tool. Using standards like OpenLineage, for example. I believe Egeria
> is also trying to create an open standard for metadata.;
>
> If you're evaluating data catalogs for personal use or use in a
> particular project, Andrew's answer about the Wikimedia evaluation is
> a good start. It's missing OpenMetadata (https://open-metadata.org/).
> That one is showing a LOT of promise. Wikimedia's evaluation is also
> missing industry leading commercial products (understandably, given
> their mission). Collibra and Alation probably the ones that pop up
> most often.
>
> I have personally looked into both DataHub and Amundsen. My high level
> feedback is that DataHub is overengineered, and using proprietary
> LinkedIn technology platform(s), which aren't widely used anywhere.
> Amundsen is much less flexible than DataHub and quite basic in its
> functionality. If you need anything beyond what it already offers,
> good luck.
>
> We dumped Amundsen in favor of OpenMetadata a few months back. We
> don't have enough data points to fully evaluate OpenMetadata yet.
>
> -TPP
>


Re: [FEEDBACK] Metadata Platforms / Catalogs / Lineage integration

2022-01-13 Thread Pedro Silva
Hello,

I'm part of the DataHub community and working in collaboration with the
company behind it: http://acryldata.io
Happy to have a conversation or clarify any questions you may have on
DataHub :)

Have a nice day!

Em qui., 13 de jan. de 2022 às 15:33, Andrew Otto 
escreveu:

> Hello!  The Wikimedia Foundation is currently doing a similar evaluation
> (although we are not currently including any Flink considerations).
>
>
> https://wikitech.wikimedia.org/wiki/Data_Catalog_Application_Evaluation_Rubric
>
> More details will be published there as folks keep working on this.
> Hope that helps a little bit! :)
>
> -Andrew Otto
>
> On Thu, Jan 13, 2022 at 10:27 AM Martijn Visser 
> wrote:
>
>> Hi everyone,
>>
>> I'm currently checking out different metadata platforms, such as Amundsen
>> [1] and Datahub [2]. In short, these types of tools try to address problems
>> related to topics such as data discovery, data lineage and an overall data
>> catalogue.
>>
>> I'm reaching out to the Dev and User mailing lists to get some feedback.
>> It would really help if you could spend a couple of minutes to let me know
>> if you already use either one of the two mentioned metadata platforms or
>> another one, or are you evaluating such tools? If so, is that for
>> the purpose as a catalogue, for lineage or anything else? Any type of
>> feedback on these types of tools is appreciated.
>>
>> Best regards,
>>
>> Martijn
>>
>> [1] https://github.com/amundsen-io/amundsen/
>> [2] https://github.com/linkedin/datahub
>>
>>
>>


Re: [FEEDBACK] Metadata Platforms / Catalogs / Lineage integration

2022-01-13 Thread Andrew Otto
Hello!  The Wikimedia Foundation is currently doing a similar evaluation
(although we are not currently including any Flink considerations).

https://wikitech.wikimedia.org/wiki/Data_Catalog_Application_Evaluation_Rubric

More details will be published there as folks keep working on this.
Hope that helps a little bit! :)

-Andrew Otto

On Thu, Jan 13, 2022 at 10:27 AM Martijn Visser 
wrote:

> Hi everyone,
>
> I'm currently checking out different metadata platforms, such as Amundsen
> [1] and Datahub [2]. In short, these types of tools try to address problems
> related to topics such as data discovery, data lineage and an overall data
> catalogue.
>
> I'm reaching out to the Dev and User mailing lists to get some feedback.
> It would really help if you could spend a couple of minutes to let me know
> if you already use either one of the two mentioned metadata platforms or
> another one, or are you evaluating such tools? If so, is that for
> the purpose as a catalogue, for lineage or anything else? Any type of
> feedback on these types of tools is appreciated.
>
> Best regards,
>
> Martijn
>
> [1] https://github.com/amundsen-io/amundsen/
> [2] https://github.com/linkedin/datahub
>
>
>