Re: [DISCUSS] Apache Amoro proposal
HI, As the discussion seems to have died down, I’ll put this up for a vote. Kind Regards, Justin - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: Re: [DISCUSS] Apache Amoro proposal
Hi Nathan, Thanks for the detailed information. Much appreciated. I now have a better understanding of the goals. It looks interesting. Happy to help as a mentor if you need. Thanks ! Regards JB On Sat, Feb 24, 2024 at 6:24 AM nathan ma wrote: > > hi, JB > > As co-creator of this project, I’d love to explain more about the > positioning of lakehouse management system. > > When discussing databases or traditional data warehouses, we often used the > term DBMS (Database Management System) to describe them. Traditional > databases, including MPP databases, are typically considered “out-of-box” > solutions. Unlike big data systems, they don’t require various components > like compute engines, data lake formats, or metadata stores. When we need a > database management tool, lightweight options like Navicat are commonly > used. > > If we further abstract the capabilities of a DBMS and map them to the > modern data stack, we find that the data read/write part of a DBMS is now > shared among different compute engines such as Spark, Flink, Trino, and > cloud-native services like Athena. Another part of a DBMS deals with data > files, index files, and metadata (also known as the information schema) > maintenance. Currently, there are successful open-source and commercial > projects dedicated to managing metadata, such as HiveMetastore, > UnityCatalog, and more recently, Gravitino. In practice, developers often > combine these projects with compute engines to optimize data files. For > example, many commercial compute engines include an optimize command. > > Amoro, as a lakehouse management system, aims to encapsulate the > maintenance and management of data lake files, index files, and metadata in > a way that is transparent and easy-to-use for users. The richness of > diverse computing engines is a distinctive feature of the modern data > stack, opening up a multitude of possibilities for various application > scenarios. Additionally, concerning the part analogous to DBMS, we aspire > to have a mature system in place—one that seamlessly accommodates data > written to the lakehouse by any engine, in any manner, ensuring high data > availability across all other engines. For instance, when Flink writes to > Iceberg, Amoro’s self-optimizing mechanism ensures efficient data analysis > performance by Trino or other engines while controlling compacting costs. > Additionally, Amoro handles historical data, snapshots, and orphan file > cleanup in the background. > > By positioning Amoro in this way, we aim to provide an ‘out-of-box’ > experience that feels as straightforward as traditional DBMS while keeping > openness to various computing engines. At the same time, Amoro hopes to > empower data product builders with a lightweight solution that integrates > seamlessly into their modern data workflows. > > > > Thanks. > Jin Ma > > > On 2024/02/23 14:16:43 Jean-Baptiste Onofré wrote: > > Hi Justin > > > > Even if it looks interesting, I'm not sure to understand exactly the > > purpose of the proposal. > > > > What lakehouse management system means exactly ? Is it an abstraction > > layer on top of Iceberg, Paimon + query engine powered by Flink, > > Spark, Trino ? > > > > Please let me know if you want an additional mentor, I would be happy to > help. > > > > Thanks ! > > Regards > > JB > > > > On Fri, Feb 23, 2024 at 9:44 AM Justin Mclean > wrote: > > > > > > Hi, > > > > > > I would like to propose a new project to the ASF incubator - Apache > Amoro. I’m one of the mentors, but there are a lot of other people involved > who have done all of the hard work. > > > > > > Amoro is a Lakehouse management system built on open data lake formats > like Apache Iceberg and Apache Paimon (Incubating). Working with compute > engines including Apache Flink, Apache Spark, and Trino, Amoro brings > pluggable and self-managed features for Lakehouse to provide out-of-the-box > data warehouse experience, and helps data platforms or products easily > build infra-decoupled, stream-and-batch-fused and lake-native architecture. > You can find the proposal here. [1] > > > > > > We are looking forward to anyone's feedback or questions. > > > > > > Thanks, > > > Justin > > > > > > [1] https://cwiki.apache.org/confluence/display/INCUBATOR/AmoroProposal > > > - > > > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > > > For additional commands, e-mail: general-h...@incubator.apache.org > > > > > > > - > > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > > For additional commands, e-mail: general-h...@incubator.apache.org > > > > - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [DISCUSS] Apache Amoro proposal
+1, I'm glad to be one of the mentors. I had a discussion with Nathan and Jinsong two years ago. They expressed interest in open-sourcing Arctic (formerly) and donating it to the ASF Incubator in the future. I am happy to witness the community growth and the proposal that has finally been put forward. Kent Yao On 2024/02/26 03:18:02 Xinyu Zhou wrote: > +1, as one of the mentors, over the past few months, I have seen > significant progress within this community. > > Regards, > Xinyu Zhou > > On Mon, Feb 26, 2024 at 10:53 AM Xavier Bai wrote: > > > +1, I was also one of the early developers on the project, focusing on > > solving optimization and compaction issues with the company's Iceberg > > tables. I believe that many teams using datalake need a system like Amoro > > for effective data lake management and to reduce the complexity of data > > lake maintenance. Therefore, contributing it to ASF can enrich the usage > > scenarios and enhance datalake management capabilities. > > > > Thanks, > > Xu > > > > ConradJam 于2024年2月26日周一 10:12写道: > > > > > +1, I'm one of the developers. At present, I think the community is > > > developing well, and this project can help everyone better control the > > data > > > lake. I suggest joining the ASF incubator to let more people know about > > > this project and participate in it > > > > > > Justin Mclean 于2024年2月23日周五 16:44写道: > > > > > > > Hi, > > > > > > > > I would like to propose a new project to the ASF incubator - Apache > > > Amoro. > > > > I’m one of the mentors, but there are a lot of other people involved > > who > > > > have done all of the hard work. > > > > > > > > Amoro is a Lakehouse management system built on open data lake formats > > > > like Apache Iceberg and Apache Paimon (Incubating). Working with > > compute > > > > engines including Apache Flink, Apache Spark, and Trino, Amoro brings > > > > pluggable and self-managed features for Lakehouse to provide > > > out-of-the-box > > > > data warehouse experience, and helps data platforms or products easily > > > > build infra-decoupled, stream-and-batch-fused and lake-native > > > architecture. > > > > You can find the proposal here. [1] > > > > > > > > We are looking forward to anyone's feedback or questions. > > > > > > > > Thanks, > > > > Justin > > > > > > > > [1] > > https://cwiki.apache.org/confluence/display/INCUBATOR/AmoroProposal > > > > - > > > > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > > > > For additional commands, e-mail: general-h...@incubator.apache.org > > > > > > > > > > > > > > -- > > > Best > > > > > > ConradJam > > > > > > - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [DISCUSS] Apache Amoro proposal
+1, as one of the mentors, over the past few months, I have seen significant progress within this community. Regards, Xinyu Zhou On Mon, Feb 26, 2024 at 10:53 AM Xavier Bai wrote: > +1, I was also one of the early developers on the project, focusing on > solving optimization and compaction issues with the company's Iceberg > tables. I believe that many teams using datalake need a system like Amoro > for effective data lake management and to reduce the complexity of data > lake maintenance. Therefore, contributing it to ASF can enrich the usage > scenarios and enhance datalake management capabilities. > > Thanks, > Xu > > ConradJam 于2024年2月26日周一 10:12写道: > > > +1, I'm one of the developers. At present, I think the community is > > developing well, and this project can help everyone better control the > data > > lake. I suggest joining the ASF incubator to let more people know about > > this project and participate in it > > > > Justin Mclean 于2024年2月23日周五 16:44写道: > > > > > Hi, > > > > > > I would like to propose a new project to the ASF incubator - Apache > > Amoro. > > > I’m one of the mentors, but there are a lot of other people involved > who > > > have done all of the hard work. > > > > > > Amoro is a Lakehouse management system built on open data lake formats > > > like Apache Iceberg and Apache Paimon (Incubating). Working with > compute > > > engines including Apache Flink, Apache Spark, and Trino, Amoro brings > > > pluggable and self-managed features for Lakehouse to provide > > out-of-the-box > > > data warehouse experience, and helps data platforms or products easily > > > build infra-decoupled, stream-and-batch-fused and lake-native > > architecture. > > > You can find the proposal here. [1] > > > > > > We are looking forward to anyone's feedback or questions. > > > > > > Thanks, > > > Justin > > > > > > [1] > https://cwiki.apache.org/confluence/display/INCUBATOR/AmoroProposal > > > - > > > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > > > For additional commands, e-mail: general-h...@incubator.apache.org > > > > > > > > > > -- > > Best > > > > ConradJam > > >
Re: [DISCUSS] Apache Amoro proposal
+1, I was also one of the early developers on the project, focusing on solving optimization and compaction issues with the company's Iceberg tables. I believe that many teams using datalake need a system like Amoro for effective data lake management and to reduce the complexity of data lake maintenance. Therefore, contributing it to ASF can enrich the usage scenarios and enhance datalake management capabilities. Thanks, Xu ConradJam 于2024年2月26日周一 10:12写道: > +1, I'm one of the developers. At present, I think the community is > developing well, and this project can help everyone better control the data > lake. I suggest joining the ASF incubator to let more people know about > this project and participate in it > > Justin Mclean 于2024年2月23日周五 16:44写道: > > > Hi, > > > > I would like to propose a new project to the ASF incubator - Apache > Amoro. > > I’m one of the mentors, but there are a lot of other people involved who > > have done all of the hard work. > > > > Amoro is a Lakehouse management system built on open data lake formats > > like Apache Iceberg and Apache Paimon (Incubating). Working with compute > > engines including Apache Flink, Apache Spark, and Trino, Amoro brings > > pluggable and self-managed features for Lakehouse to provide > out-of-the-box > > data warehouse experience, and helps data platforms or products easily > > build infra-decoupled, stream-and-batch-fused and lake-native > architecture. > > You can find the proposal here. [1] > > > > We are looking forward to anyone's feedback or questions. > > > > Thanks, > > Justin > > > > [1] https://cwiki.apache.org/confluence/display/INCUBATOR/AmoroProposal > > - > > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > > For additional commands, e-mail: general-h...@incubator.apache.org > > > > > > -- > Best > > ConradJam >
Re: [DISCUSS] Apache Amoro proposal
+1, I'm one of the developers. At present, I think the community is developing well, and this project can help everyone better control the data lake. I suggest joining the ASF incubator to let more people know about this project and participate in it Justin Mclean 于2024年2月23日周五 16:44写道: > Hi, > > I would like to propose a new project to the ASF incubator - Apache Amoro. > I’m one of the mentors, but there are a lot of other people involved who > have done all of the hard work. > > Amoro is a Lakehouse management system built on open data lake formats > like Apache Iceberg and Apache Paimon (Incubating). Working with compute > engines including Apache Flink, Apache Spark, and Trino, Amoro brings > pluggable and self-managed features for Lakehouse to provide out-of-the-box > data warehouse experience, and helps data platforms or products easily > build infra-decoupled, stream-and-batch-fused and lake-native architecture. > You can find the proposal here. [1] > > We are looking forward to anyone's feedback or questions. > > Thanks, > Justin > > [1] https://cwiki.apache.org/confluence/display/INCUBATOR/AmoroProposal > - > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org > > -- Best ConradJam
Re: [DISCUSS] Apache Amoro proposal
+1. I'm happy to be one of the mentors. I have discussed with Jinsong, Nathan and the team, and am impressed by their openness and passion on improving the Amoro community through incubation. From my observation, it's a well developed community with a similar governance philosophy as the Apache Way. I believe joining Apache incubator could help the community to be more vibrant and diverse. And personally, I think having a dedicated project with the purpose to better manage Lakehouse is necessary. IMHO, compared to the traditional DBMS solutions, the existing open source Lakehouse solutions focus more on the "DB" part and lack efforts on the "MS" part. Best Regards, Yu On Fri, 23 Feb 2024 at 22:54, 周劲松 wrote: > > Hi JB, > > Yes, you can say it is an abstraction layer on top of data lake table > formats and query engines and we often call it the service layer in > Lakehouse architecture. The service layer primarily provides unified > metadata and access control, as well as common audit services, and so on. > Of course, Amoro is currently focusing on automatic optimizing, helping > users to more easily use the data lake and achieve the desired analytical > performance on it. Amoro can work with other software in the service layer > and can also extend plugins to integrate more capabilities. > > On Fri, Feb 23, 2024 at 10:18 PM Jean-Baptiste Onofré > wrote: > > > Hi Justin > > > > Even if it looks interesting, I'm not sure to understand exactly the > > purpose of the proposal. > > > > What lakehouse management system means exactly ? Is it an abstraction > > layer on top of Iceberg, Paimon + query engine powered by Flink, > > Spark, Trino ? > > > > Please let me know if you want an additional mentor, I would be happy to > > help. > > > > Thanks ! > > Regards > > JB > > > > On Fri, Feb 23, 2024 at 9:44 AM Justin Mclean > > wrote: > > > > > > Hi, > > > > > > I would like to propose a new project to the ASF incubator - Apache > > Amoro. I’m one of the mentors, but there are a lot of other people involved > > who have done all of the hard work. > > > > > > Amoro is a Lakehouse management system built on open data lake formats > > like Apache Iceberg and Apache Paimon (Incubating). Working with compute > > engines including Apache Flink, Apache Spark, and Trino, Amoro brings > > pluggable and self-managed features for Lakehouse to provide out-of-the-box > > data warehouse experience, and helps data platforms or products easily > > build infra-decoupled, stream-and-batch-fused and lake-native architecture. > > You can find the proposal here. [1] > > > > > > We are looking forward to anyone's feedback or questions. > > > > > > Thanks, > > > Justin > > > > > > [1] https://cwiki.apache.org/confluence/display/INCUBATOR/AmoroProposal > > > - > > > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > > > For additional commands, e-mail: general-h...@incubator.apache.org > > > > > > > - > > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > > For additional commands, e-mail: general-h...@incubator.apache.org > > > > - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
RE: Re: [DISCUSS] Apache Amoro proposal
hi, JB As co-creator of this project, I’d love to explain more about the positioning of lakehouse management system. When discussing databases or traditional data warehouses, we often used the term DBMS (Database Management System) to describe them. Traditional databases, including MPP databases, are typically considered “out-of-box” solutions. Unlike big data systems, they don’t require various components like compute engines, data lake formats, or metadata stores. When we need a database management tool, lightweight options like Navicat are commonly used. If we further abstract the capabilities of a DBMS and map them to the modern data stack, we find that the data read/write part of a DBMS is now shared among different compute engines such as Spark, Flink, Trino, and cloud-native services like Athena. Another part of a DBMS deals with data files, index files, and metadata (also known as the information schema) maintenance. Currently, there are successful open-source and commercial projects dedicated to managing metadata, such as HiveMetastore, UnityCatalog, and more recently, Gravitino. In practice, developers often combine these projects with compute engines to optimize data files. For example, many commercial compute engines include an optimize command. Amoro, as a lakehouse management system, aims to encapsulate the maintenance and management of data lake files, index files, and metadata in a way that is transparent and easy-to-use for users. The richness of diverse computing engines is a distinctive feature of the modern data stack, opening up a multitude of possibilities for various application scenarios. Additionally, concerning the part analogous to DBMS, we aspire to have a mature system in place—one that seamlessly accommodates data written to the lakehouse by any engine, in any manner, ensuring high data availability across all other engines. For instance, when Flink writes to Iceberg, Amoro’s self-optimizing mechanism ensures efficient data analysis performance by Trino or other engines while controlling compacting costs. Additionally, Amoro handles historical data, snapshots, and orphan file cleanup in the background. By positioning Amoro in this way, we aim to provide an ‘out-of-box’ experience that feels as straightforward as traditional DBMS while keeping openness to various computing engines. At the same time, Amoro hopes to empower data product builders with a lightweight solution that integrates seamlessly into their modern data workflows. Thanks. On 2024/02/23 14:16:43 Jean-Baptiste Onofré wrote: > Hi Justin > > Even if it looks interesting, I'm not sure to understand exactly the > purpose of the proposal. > > What lakehouse management system means exactly ? Is it an abstraction > layer on top of Iceberg, Paimon + query engine powered by Flink, > Spark, Trino ? > > Please let me know if you want an additional mentor, I would be happy to help. > > Thanks ! > Regards > JB > > On Fri, Feb 23, 2024 at 9:44 AM Justin Mclean wrote: > > > > Hi, > > > > I would like to propose a new project to the ASF incubator - Apache Amoro. I’m one of the mentors, but there are a lot of other people involved who have done all of the hard work. > > > > Amoro is a Lakehouse management system built on open data lake formats like Apache Iceberg and Apache Paimon (Incubating). Working with compute engines including Apache Flink, Apache Spark, and Trino, Amoro brings pluggable and self-managed features for Lakehouse to provide out-of-the-box data warehouse experience, and helps data platforms or products easily build infra-decoupled, stream-and-batch-fused and lake-native architecture. You can find the proposal here. [1] > > > > We are looking forward to anyone's feedback or questions. > > > > Thanks, > > Justin > > > > [1] https://cwiki.apache.org/confluence/display/INCUBATOR/AmoroProposal > > - > > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > > For additional commands, e-mail: general-h...@incubator.apache.org > > > > - > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org > >
RE: Re: [DISCUSS] Apache Amoro proposal
+1. Looks like a good candidate with a good number of contributors already. On 2024/02/24 05:24:33 nathan ma wrote: > hi, JB > > As co-creator of this project, I’d love to explain more about the > positioning of lakehouse management system. > > When discussing databases or traditional data warehouses, we often used the > term DBMS (Database Management System) to describe them. Traditional > databases, including MPP databases, are typically considered “out-of-box” > solutions. Unlike big data systems, they don’t require various components > like compute engines, data lake formats, or metadata stores. When we need a > database management tool, lightweight options like Navicat are commonly > used. > > If we further abstract the capabilities of a DBMS and map them to the > modern data stack, we find that the data read/write part of a DBMS is now > shared among different compute engines such as Spark, Flink, Trino, and > cloud-native services like Athena. Another part of a DBMS deals with data > files, index files, and metadata (also known as the information schema) > maintenance. Currently, there are successful open-source and commercial > projects dedicated to managing metadata, such as HiveMetastore, > UnityCatalog, and more recently, Gravitino. In practice, developers often > combine these projects with compute engines to optimize data files. For > example, many commercial compute engines include an optimize command. > > Amoro, as a lakehouse management system, aims to encapsulate the > maintenance and management of data lake files, index files, and metadata in > a way that is transparent and easy-to-use for users. The richness of > diverse computing engines is a distinctive feature of the modern data > stack, opening up a multitude of possibilities for various application > scenarios. Additionally, concerning the part analogous to DBMS, we aspire > to have a mature system in place—one that seamlessly accommodates data > written to the lakehouse by any engine, in any manner, ensuring high data > availability across all other engines. For instance, when Flink writes to > Iceberg, Amoro’s self-optimizing mechanism ensures efficient data analysis > performance by Trino or other engines while controlling compacting costs. > Additionally, Amoro handles historical data, snapshots, and orphan file > cleanup in the background. > > By positioning Amoro in this way, we aim to provide an ‘out-of-box’ > experience that feels as straightforward as traditional DBMS while keeping > openness to various computing engines. At the same time, Amoro hopes to > empower data product builders with a lightweight solution that integrates > seamlessly into their modern data workflows. > > > > Thanks. > Jin Ma > > > On 2024/02/23 14:16:43 Jean-Baptiste Onofré wrote: > > Hi Justin > > > > Even if it looks interesting, I'm not sure to understand exactly the > > purpose of the proposal. > > > > What lakehouse management system means exactly ? Is it an abstraction > > layer on top of Iceberg, Paimon + query engine powered by Flink, > > Spark, Trino ? > > > > Please let me know if you want an additional mentor, I would be happy to > help. > > > > Thanks ! > > Regards > > JB > > > > On Fri, Feb 23, 2024 at 9:44 AM Justin Mclean > wrote: > > > > > > Hi, > > > > > > I would like to propose a new project to the ASF incubator - Apache > Amoro. I’m one of the mentors, but there are a lot of other people involved > who have done all of the hard work. > > > > > > Amoro is a Lakehouse management system built on open data lake formats > like Apache Iceberg and Apache Paimon (Incubating). Working with compute > engines including Apache Flink, Apache Spark, and Trino, Amoro brings > pluggable and self-managed features for Lakehouse to provide out-of-the-box > data warehouse experience, and helps data platforms or products easily > build infra-decoupled, stream-and-batch-fused and lake-native architecture. > You can find the proposal here. [1] > > > > > > We are looking forward to anyone's feedback or questions. > > > > > > Thanks, > > > Justin > > > > > > [1] https://cwiki.apache.org/confluence/display/INCUBATOR/AmoroProposal > > > - > > > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > > > For additional commands, e-mail: general-h...@incubator.apache.org > > > > > > > - > > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > > For additional commands, e-mail: general-h...@incubator.apache.org > > > > > - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
RE: Re: [DISCUSS] Apache Amoro proposal
hi, JB As co-creator of this project, I’d love to explain more about the positioning of lakehouse management system. When discussing databases or traditional data warehouses, we often used the term DBMS (Database Management System) to describe them. Traditional databases, including MPP databases, are typically considered “out-of-box” solutions. Unlike big data systems, they don’t require various components like compute engines, data lake formats, or metadata stores. When we need a database management tool, lightweight options like Navicat are commonly used. If we further abstract the capabilities of a DBMS and map them to the modern data stack, we find that the data read/write part of a DBMS is now shared among different compute engines such as Spark, Flink, Trino, and cloud-native services like Athena. Another part of a DBMS deals with data files, index files, and metadata (also known as the information schema) maintenance. Currently, there are successful open-source and commercial projects dedicated to managing metadata, such as HiveMetastore, UnityCatalog, and more recently, Gravitino. In practice, developers often combine these projects with compute engines to optimize data files. For example, many commercial compute engines include an optimize command. Amoro, as a lakehouse management system, aims to encapsulate the maintenance and management of data lake files, index files, and metadata in a way that is transparent and easy-to-use for users. The richness of diverse computing engines is a distinctive feature of the modern data stack, opening up a multitude of possibilities for various application scenarios. Additionally, concerning the part analogous to DBMS, we aspire to have a mature system in place—one that seamlessly accommodates data written to the lakehouse by any engine, in any manner, ensuring high data availability across all other engines. For instance, when Flink writes to Iceberg, Amoro’s self-optimizing mechanism ensures efficient data analysis performance by Trino or other engines while controlling compacting costs. Additionally, Amoro handles historical data, snapshots, and orphan file cleanup in the background. By positioning Amoro in this way, we aim to provide an ‘out-of-box’ experience that feels as straightforward as traditional DBMS while keeping openness to various computing engines. At the same time, Amoro hopes to empower data product builders with a lightweight solution that integrates seamlessly into their modern data workflows. Thanks. Jin Ma On 2024/02/23 14:16:43 Jean-Baptiste Onofré wrote: > Hi Justin > > Even if it looks interesting, I'm not sure to understand exactly the > purpose of the proposal. > > What lakehouse management system means exactly ? Is it an abstraction > layer on top of Iceberg, Paimon + query engine powered by Flink, > Spark, Trino ? > > Please let me know if you want an additional mentor, I would be happy to help. > > Thanks ! > Regards > JB > > On Fri, Feb 23, 2024 at 9:44 AM Justin Mclean wrote: > > > > Hi, > > > > I would like to propose a new project to the ASF incubator - Apache Amoro. I’m one of the mentors, but there are a lot of other people involved who have done all of the hard work. > > > > Amoro is a Lakehouse management system built on open data lake formats like Apache Iceberg and Apache Paimon (Incubating). Working with compute engines including Apache Flink, Apache Spark, and Trino, Amoro brings pluggable and self-managed features for Lakehouse to provide out-of-the-box data warehouse experience, and helps data platforms or products easily build infra-decoupled, stream-and-batch-fused and lake-native architecture. You can find the proposal here. [1] > > > > We are looking forward to anyone's feedback or questions. > > > > Thanks, > > Justin > > > > [1] https://cwiki.apache.org/confluence/display/INCUBATOR/AmoroProposal > > - > > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > > For additional commands, e-mail: general-h...@incubator.apache.org > > > > - > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org > >
Re: [DISCUSS] Apache Amoro proposal
Hi JB, Yes, you can say it is an abstraction layer on top of data lake table formats and query engines and we often call it the service layer in Lakehouse architecture. The service layer primarily provides unified metadata and access control, as well as common audit services, and so on. Of course, Amoro is currently focusing on automatic optimizing, helping users to more easily use the data lake and achieve the desired analytical performance on it. Amoro can work with other software in the service layer and can also extend plugins to integrate more capabilities. On Fri, Feb 23, 2024 at 10:18 PM Jean-Baptiste Onofré wrote: > Hi Justin > > Even if it looks interesting, I'm not sure to understand exactly the > purpose of the proposal. > > What lakehouse management system means exactly ? Is it an abstraction > layer on top of Iceberg, Paimon + query engine powered by Flink, > Spark, Trino ? > > Please let me know if you want an additional mentor, I would be happy to > help. > > Thanks ! > Regards > JB > > On Fri, Feb 23, 2024 at 9:44 AM Justin Mclean > wrote: > > > > Hi, > > > > I would like to propose a new project to the ASF incubator - Apache > Amoro. I’m one of the mentors, but there are a lot of other people involved > who have done all of the hard work. > > > > Amoro is a Lakehouse management system built on open data lake formats > like Apache Iceberg and Apache Paimon (Incubating). Working with compute > engines including Apache Flink, Apache Spark, and Trino, Amoro brings > pluggable and self-managed features for Lakehouse to provide out-of-the-box > data warehouse experience, and helps data platforms or products easily > build infra-decoupled, stream-and-batch-fused and lake-native architecture. > You can find the proposal here. [1] > > > > We are looking forward to anyone's feedback or questions. > > > > Thanks, > > Justin > > > > [1] https://cwiki.apache.org/confluence/display/INCUBATOR/AmoroProposal > > - > > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > > For additional commands, e-mail: general-h...@incubator.apache.org > > > > - > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org > >
Re: [DISCUSS] Apache Amoro proposal
Hi Ayush, I am Jinsong from Amoro community. Thank you very much for your attention and feedback on Amoro. Amoro aims to support multiple versions of Hadoop and Hive clusters as much as possible, allowing users to specify versions during build time, but just as you said, our default version should remain the latest. I have created an issue[1] to track this problem and will work on it to resolve it as soon as possible. [1] https://github.com/NetEase/amoro/issues/2564 On Fri, Feb 23, 2024 at 17:13 PM Ayush Saxnea wrote: > > +1, > I remember exploring this while exploring a way for compaction for iceberg > tables for a Hive usecase, got some good pointers for cleaning up orphan > files, I think it was using a pretty old version of Hive(3.1.1 I believe), > so couldn't pull it in as dependency in Hive master branch itself, which > was my initial plan. > > But overall, it was some good code. > > Good Luck!!! > > -Ayush > - > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org > >
Re: [DISCUSS] Apache Amoro proposal
Hi Justin Even if it looks interesting, I'm not sure to understand exactly the purpose of the proposal. What lakehouse management system means exactly ? Is it an abstraction layer on top of Iceberg, Paimon + query engine powered by Flink, Spark, Trino ? Please let me know if you want an additional mentor, I would be happy to help. Thanks ! Regards JB On Fri, Feb 23, 2024 at 9:44 AM Justin Mclean wrote: > > Hi, > > I would like to propose a new project to the ASF incubator - Apache Amoro. > I’m one of the mentors, but there are a lot of other people involved who have > done all of the hard work. > > Amoro is a Lakehouse management system built on open data lake formats like > Apache Iceberg and Apache Paimon (Incubating). Working with compute engines > including Apache Flink, Apache Spark, and Trino, Amoro brings pluggable and > self-managed features for Lakehouse to provide out-of-the-box data warehouse > experience, and helps data platforms or products easily build > infra-decoupled, stream-and-batch-fused and lake-native architecture. You can > find the proposal here. [1] > > We are looking forward to anyone's feedback or questions. > > Thanks, > Justin > > [1] https://cwiki.apache.org/confluence/display/INCUBATOR/AmoroProposal > - > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org > - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [DISCUSS] Apache Amoro proposal
+1, I remember exploring this while exploring a way for compaction for iceberg tables for a Hive usecase, got some good pointers for cleaning up orphan files, I think it was using a pretty old version of Hive(3.1.1 I believe), so couldn't pull it in as dependency in Hive master branch itself, which was my initial plan. But overall, it was some good code. Good Luck!!! -Ayush On Fri, 23 Feb 2024 at 14:15, Justin Mclean wrote: > Hi, > > I would like to propose a new project to the ASF incubator - Apache Amoro. > I’m one of the mentors, but there are a lot of other people involved who > have done all of the hard work. > > Amoro is a Lakehouse management system built on open data lake formats > like Apache Iceberg and Apache Paimon (Incubating). Working with compute > engines including Apache Flink, Apache Spark, and Trino, Amoro brings > pluggable and self-managed features for Lakehouse to provide out-of-the-box > data warehouse experience, and helps data platforms or products easily > build infra-decoupled, stream-and-batch-fused and lake-native architecture. > You can find the proposal here. [1] > > We are looking forward to anyone's feedback or questions. > > Thanks, > Justin > > [1] https://cwiki.apache.org/confluence/display/INCUBATOR/AmoroProposal > - > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org > >
[DISCUSS] Apache Amoro proposal
Hi, I would like to propose a new project to the ASF incubator - Apache Amoro. I’m one of the mentors, but there are a lot of other people involved who have done all of the hard work. Amoro is a Lakehouse management system built on open data lake formats like Apache Iceberg and Apache Paimon (Incubating). Working with compute engines including Apache Flink, Apache Spark, and Trino, Amoro brings pluggable and self-managed features for Lakehouse to provide out-of-the-box data warehouse experience, and helps data platforms or products easily build infra-decoupled, stream-and-batch-fused and lake-native architecture. You can find the proposal here. [1] We are looking forward to anyone's feedback or questions. Thanks, Justin [1] https://cwiki.apache.org/confluence/display/INCUBATOR/AmoroProposal - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org