Re: [FEEDBACK] Metadata Platforms / Catalogs / Lineage integration
Hi Yun Tang, Sorry for the late reply. I haven't seen any tickets related to this topic. Still think this is an important feature to have supported in Flink, would love some volunteers on this topic. Best regards, Martijn On Tue, Sep 13, 2022 at 7:47 AM Yun Tang wrote: > An interesting topic, I noticed that the datahub community has launched > the feature request discussion of Flink Integration [1]. > > @Martijn Visser Did the Flink community had > created tickets to track this topic? > From my current understanding, Flink lacks rich information on > FlinkJobListener > just as Feng mentioned, which has been supported well by Spark, to send > data lineage to external systems. > > > > [1] https://feature-requests.datahubproject.io/p/flink-integration > > > Best > Yun Tang > -- > *From:* wangqinghuan <1095193...@qq.com> > *Sent:* Monday, January 17, 2022 18:27 > *To:* user@flink.apache.org > *Subject:* Re: [FEEDBACK] Metadata Platforms / Catalogs / Lineage > integration > > > we are using Datahub to address table-level lineage and column-level > lineage for Flink SQL. > 在 2022/1/13 23:27, Martijn Visser 写道: > > Hi everyone, > > I'm currently checking out different metadata platforms, such as Amundsen > [1] and Datahub [2]. In short, these types of tools try to address problems > related to topics such as data discovery, data lineage and an overall data > catalogue. > > I'm reaching out to the Dev and User mailing lists to get some feedback. > It would really help if you could spend a couple of minutes to let me know > if you already use either one of the two mentioned metadata platforms or > another one, or are you evaluating such tools? If so, is that for > the purpose as a catalogue, for lineage or anything else? Any type of > feedback on these types of tools is appreciated. > > Best regards, > > Martijn > > [1] https://github.com/amundsen-io/amundsen/ > [2] https://github.com/linkedin/datahub > >
Re: [FEEDBACK] Metadata Platforms / Catalogs / Lineage integration
An interesting topic, I noticed that the datahub community has launched the feature request discussion of Flink Integration [1]. @Martijn Visser<mailto:martijnvis...@apache.org> Did the Flink community had created tickets to track this topic? >From my current understanding, Flink lacks rich information on >FlinkJobListener just as Feng mentioned, which has been supported well by >Spark, to send data lineage to external systems. [1] https://feature-requests.datahubproject.io/p/flink-integration Best Yun Tang From: wangqinghuan <1095193...@qq.com> Sent: Monday, January 17, 2022 18:27 To: user@flink.apache.org Subject: Re: [FEEDBACK] Metadata Platforms / Catalogs / Lineage integration we are using Datahub to address table-level lineage and column-level lineage for Flink SQL. 在 2022/1/13 23:27, Martijn Visser 写道: Hi everyone, I'm currently checking out different metadata platforms, such as Amundsen [1] and Datahub [2]. In short, these types of tools try to address problems related to topics such as data discovery, data lineage and an overall data catalogue. I'm reaching out to the Dev and User mailing lists to get some feedback. It would really help if you could spend a couple of minutes to let me know if you already use either one of the two mentioned metadata platforms or another one, or are you evaluating such tools? If so, is that for the purpose as a catalogue, for lineage or anything else? Any type of feedback on these types of tools is appreciated. Best regards, Martijn [1] https://github.com/amundsen-io/amundsen/ [2] https://github.com/linkedin/datahub
Re: [FEEDBACK] Metadata Platforms / Catalogs / Lineage integration
we are using Datahub to address table-level lineage and column-level lineage for Flink SQL. 在 2022/1/13 23:27, Martijn Visser 写道: Hi everyone, I'm currently checking out different metadata platforms, such as Amundsen [1] and Datahub [2]. In short, these types of tools try to address problems related to topics such as data discovery, data lineage and an overall data catalogue. I'm reaching out to the Dev and User mailing lists to get some feedback. It would really help if you could spend a couple of minutes to let me know if you already use either one of the two mentioned metadata platforms or another one, or are you evaluating such tools? If so, is that for the purpose as a catalogue, for lineage or anything else? Any type of feedback on these types of tools is appreciated. Best regards, Martijn [1] https://github.com/amundsen-io/amundsen/ [2] https://github.com/linkedin/datahub
Re: [FEEDBACK] Metadata Platforms / Catalogs / Lineage integration
Hi I am a software engineer from Xiaomi. Last year we used metacat(https://github.com/Netflix/metacat) to manage all metadata, including Hive, Kudu, Doris, Iceberg, Elasticsearch, Talos (Xiaomi self-developed message queue), Mysql, Tidb.. Metacat is well compatible with the hive-metastore protocol. Therefore, we can directly use FlinkHiveCatalog to connect metacat to create different Tables, including Hive tables, or other generic types of tables. All systems are abstracted into catalog.database.table structure. So in FlinkSQL we can access any registered table through catalog.database.table. In addition, metacat uniformly manages all table creation, deletion, and partitioning operations. By analyzing the audit log of metacat, we can easily obtain the DDL lineage of different tables. At the same time, with the use of ranger(https://github.com/ranger/ranger), we have added permission control to the Flink framework, and all permission information will be saved in the form of catalog.database.table. We also modified the logic related to FlinkJobListener. By exposing the JobGraph, we can obtain the lineage information of the job by parsing the JobGraph. To sum up, unified metadata management is convenient for managing different systems and connecting to Flink, and at the same time, it is convenient for unified permission management and obtaining table-related lineage information. On Fri, Jan 14, 2022 at 3:14 AM Maciej Obuchowski < obuchowski.mac...@gmail.com> wrote: > Hello, > > I'm an OpenLineage committer - and previously, a minor Flink contributor. > OpenLineage community is very interested in conversation about Flink > metadata, and we'll be happy to cooperate with the Flink community. > > Best, > Maciej Obuchowski > > > > czw., 13 sty 2022 o 18:12 Martijn Visser > napisał(a): > > > > Hi all, > > > > @Andrew thanks for sharing that! > > > > @Tero good point, I should have clarified the purpose. I want to > understand > > what "metadata platforms" tools are used or evaluated by the Flink > > community, what's their purpose for using such a tool (is it as a generic > > catalogue, as a data discovery tool, is lineage the important part etc) > and > > what problems are people trying to solve with them. This space is > > developing rapidly and there are many open source and commercial tools > > popping up/growing, which is also why I'm trying to keep an open vision > on > > how this space is evolving. > > > > If the Flink community wants to integrate with metadata tools, I fully > > agree that ideally we do that via standards. My perception is at this > > moment that no clear standard has yet been established. You mentioned > > open-metadata.org, but I believe https://openlineage.io/ is also an > > alternative standard. > > > > Best regards, > > > > Martijn > > > > On Thu, 13 Jan 2022 at 17:00, Tero Paananen > wrote: > > > > > > I'm currently checking out different metadata platforms, such as > > > Amundsen [1] and Datahub [2]. In short, these types of tools try to > address > > > problems related to topics such as data discovery, data lineage and an > > > overall data catalogue. > > > > > > > > I'm reaching out to the Dev and User mailing lists to get some > feedback. > > > It would really help if you could spend a couple of minutes to let me > know > > > if you already use either one of the two mentioned metadata platforms > or > > > another one, or are you evaluating such tools? If so, is that for the > > > purpose as a catalogue, for lineage or anything else? Any type of > feedback > > > on these types of tools is appreciated. > > > > > > I hope you don't mind answers off-list. > > > > > > You didn't say what purpose you're evaluating these tools for, but if > > > you're evaluating platforms for integration with Flink, I wouldn't > > > approach it with a particular product in mind. Rather I'd create some > > > sort of facility to propagate metadata and/or lineage information in a > > > generic way and allow Flink users to plug in their favorite metadata > > > tool. Using standards like OpenLineage, for example. I believe Egeria > > > is also trying to create an open standard for metadata.; > > > > > > If you're evaluating data catalogs for personal use or use in a > > > particular project, Andrew's answer about the Wikimedia evaluation is > > > a good start. It's missing OpenMetadata (https://open-metadata.org/). > > > That one is showing a LOT of promise. Wikimedia's evaluation is also > > > missing industry leading commercial products (understandably, given > > > their mission). Collibra and Alation probably the ones that pop up > > > most often. > > > > > > I have personally looked into both DataHub and Amundsen. My high level > > > feedback is that DataHub is overengineered, and using proprietary > > > LinkedIn technology platform(s), which aren't widely used anywhere. > > > Amundsen is much less flexible than DataHub and quite basic in its > > > functionality. If you need anything beyond wh
Re: [FEEDBACK] Metadata Platforms / Catalogs / Lineage integration
Hello, I'm an OpenLineage committer - and previously, a minor Flink contributor. OpenLineage community is very interested in conversation about Flink metadata, and we'll be happy to cooperate with the Flink community. Best, Maciej Obuchowski czw., 13 sty 2022 o 18:12 Martijn Visser napisał(a): > > Hi all, > > @Andrew thanks for sharing that! > > @Tero good point, I should have clarified the purpose. I want to understand > what "metadata platforms" tools are used or evaluated by the Flink > community, what's their purpose for using such a tool (is it as a generic > catalogue, as a data discovery tool, is lineage the important part etc) and > what problems are people trying to solve with them. This space is > developing rapidly and there are many open source and commercial tools > popping up/growing, which is also why I'm trying to keep an open vision on > how this space is evolving. > > If the Flink community wants to integrate with metadata tools, I fully > agree that ideally we do that via standards. My perception is at this > moment that no clear standard has yet been established. You mentioned > open-metadata.org, but I believe https://openlineage.io/ is also an > alternative standard. > > Best regards, > > Martijn > > On Thu, 13 Jan 2022 at 17:00, Tero Paananen wrote: > > > > I'm currently checking out different metadata platforms, such as > > Amundsen [1] and Datahub [2]. In short, these types of tools try to address > > problems related to topics such as data discovery, data lineage and an > > overall data catalogue. > > > > > > I'm reaching out to the Dev and User mailing lists to get some feedback. > > It would really help if you could spend a couple of minutes to let me know > > if you already use either one of the two mentioned metadata platforms or > > another one, or are you evaluating such tools? If so, is that for the > > purpose as a catalogue, for lineage or anything else? Any type of feedback > > on these types of tools is appreciated. > > > > I hope you don't mind answers off-list. > > > > You didn't say what purpose you're evaluating these tools for, but if > > you're evaluating platforms for integration with Flink, I wouldn't > > approach it with a particular product in mind. Rather I'd create some > > sort of facility to propagate metadata and/or lineage information in a > > generic way and allow Flink users to plug in their favorite metadata > > tool. Using standards like OpenLineage, for example. I believe Egeria > > is also trying to create an open standard for metadata.; > > > > If you're evaluating data catalogs for personal use or use in a > > particular project, Andrew's answer about the Wikimedia evaluation is > > a good start. It's missing OpenMetadata (https://open-metadata.org/). > > That one is showing a LOT of promise. Wikimedia's evaluation is also > > missing industry leading commercial products (understandably, given > > their mission). Collibra and Alation probably the ones that pop up > > most often. > > > > I have personally looked into both DataHub and Amundsen. My high level > > feedback is that DataHub is overengineered, and using proprietary > > LinkedIn technology platform(s), which aren't widely used anywhere. > > Amundsen is much less flexible than DataHub and quite basic in its > > functionality. If you need anything beyond what it already offers, > > good luck. > > > > We dumped Amundsen in favor of OpenMetadata a few months back. We > > don't have enough data points to fully evaluate OpenMetadata yet. > > > > -TPP > >
Re: [FEEDBACK] Metadata Platforms / Catalogs / Lineage integration
Hi all, @Andrew thanks for sharing that! @Tero good point, I should have clarified the purpose. I want to understand what "metadata platforms" tools are used or evaluated by the Flink community, what's their purpose for using such a tool (is it as a generic catalogue, as a data discovery tool, is lineage the important part etc) and what problems are people trying to solve with them. This space is developing rapidly and there are many open source and commercial tools popping up/growing, which is also why I'm trying to keep an open vision on how this space is evolving. If the Flink community wants to integrate with metadata tools, I fully agree that ideally we do that via standards. My perception is at this moment that no clear standard has yet been established. You mentioned open-metadata.org, but I believe https://openlineage.io/ is also an alternative standard. Best regards, Martijn On Thu, 13 Jan 2022 at 17:00, Tero Paananen wrote: > > I'm currently checking out different metadata platforms, such as > Amundsen [1] and Datahub [2]. In short, these types of tools try to address > problems related to topics such as data discovery, data lineage and an > overall data catalogue. > > > > I'm reaching out to the Dev and User mailing lists to get some feedback. > It would really help if you could spend a couple of minutes to let me know > if you already use either one of the two mentioned metadata platforms or > another one, or are you evaluating such tools? If so, is that for the > purpose as a catalogue, for lineage or anything else? Any type of feedback > on these types of tools is appreciated. > > I hope you don't mind answers off-list. > > You didn't say what purpose you're evaluating these tools for, but if > you're evaluating platforms for integration with Flink, I wouldn't > approach it with a particular product in mind. Rather I'd create some > sort of facility to propagate metadata and/or lineage information in a > generic way and allow Flink users to plug in their favorite metadata > tool. Using standards like OpenLineage, for example. I believe Egeria > is also trying to create an open standard for metadata.; > > If you're evaluating data catalogs for personal use or use in a > particular project, Andrew's answer about the Wikimedia evaluation is > a good start. It's missing OpenMetadata (https://open-metadata.org/). > That one is showing a LOT of promise. Wikimedia's evaluation is also > missing industry leading commercial products (understandably, given > their mission). Collibra and Alation probably the ones that pop up > most often. > > I have personally looked into both DataHub and Amundsen. My high level > feedback is that DataHub is overengineered, and using proprietary > LinkedIn technology platform(s), which aren't widely used anywhere. > Amundsen is much less flexible than DataHub and quite basic in its > functionality. If you need anything beyond what it already offers, > good luck. > > We dumped Amundsen in favor of OpenMetadata a few months back. We > don't have enough data points to fully evaluate OpenMetadata yet. > > -TPP >
Re: [FEEDBACK] Metadata Platforms / Catalogs / Lineage integration
Hello, I'm part of the DataHub community and working in collaboration with the company behind it: http://acryldata.io Happy to have a conversation or clarify any questions you may have on DataHub :) Have a nice day! Em qui., 13 de jan. de 2022 às 15:33, Andrew Otto escreveu: > Hello! The Wikimedia Foundation is currently doing a similar evaluation > (although we are not currently including any Flink considerations). > > > https://wikitech.wikimedia.org/wiki/Data_Catalog_Application_Evaluation_Rubric > > More details will be published there as folks keep working on this. > Hope that helps a little bit! :) > > -Andrew Otto > > On Thu, Jan 13, 2022 at 10:27 AM Martijn Visser > wrote: > >> Hi everyone, >> >> I'm currently checking out different metadata platforms, such as Amundsen >> [1] and Datahub [2]. In short, these types of tools try to address problems >> related to topics such as data discovery, data lineage and an overall data >> catalogue. >> >> I'm reaching out to the Dev and User mailing lists to get some feedback. >> It would really help if you could spend a couple of minutes to let me know >> if you already use either one of the two mentioned metadata platforms or >> another one, or are you evaluating such tools? If so, is that for >> the purpose as a catalogue, for lineage or anything else? Any type of >> feedback on these types of tools is appreciated. >> >> Best regards, >> >> Martijn >> >> [1] https://github.com/amundsen-io/amundsen/ >> [2] https://github.com/linkedin/datahub >> >> >>
Re: [FEEDBACK] Metadata Platforms / Catalogs / Lineage integration
Hello! The Wikimedia Foundation is currently doing a similar evaluation (although we are not currently including any Flink considerations). https://wikitech.wikimedia.org/wiki/Data_Catalog_Application_Evaluation_Rubric More details will be published there as folks keep working on this. Hope that helps a little bit! :) -Andrew Otto On Thu, Jan 13, 2022 at 10:27 AM Martijn Visser wrote: > Hi everyone, > > I'm currently checking out different metadata platforms, such as Amundsen > [1] and Datahub [2]. In short, these types of tools try to address problems > related to topics such as data discovery, data lineage and an overall data > catalogue. > > I'm reaching out to the Dev and User mailing lists to get some feedback. > It would really help if you could spend a couple of minutes to let me know > if you already use either one of the two mentioned metadata platforms or > another one, or are you evaluating such tools? If so, is that for > the purpose as a catalogue, for lineage or anything else? Any type of > feedback on these types of tools is appreciated. > > Best regards, > > Martijn > > [1] https://github.com/amundsen-io/amundsen/ > [2] https://github.com/linkedin/datahub > > >