Re: Re: [DISCUSS] Apache Amoro proposal

2024-03-01 Thread Jean-Baptiste Onofré
Hi Nathan,

Thanks for the detailed information. Much appreciated.

I now have a better understanding of the goals. It looks interesting.

Happy to help as a mentor if you need.

Thanks !
Regards
JB

On Sat, Feb 24, 2024 at 6:24 AM nathan ma  wrote:
>
> hi, JB
>
> As co-creator of this project, I’d love to explain more about the
> positioning of lakehouse management system.
>
> When discussing databases or traditional data warehouses, we often used the
> term DBMS (Database Management System) to describe them. Traditional
> databases, including MPP databases, are typically considered “out-of-box”
> solutions. Unlike big data systems, they don’t require various components
> like compute engines, data lake formats, or metadata stores. When we need a
> database management tool, lightweight options like Navicat are commonly
> used.
>
> If we further abstract the capabilities of a DBMS and map them to the
> modern data stack, we find that the data read/write part of a DBMS is now
> shared among different compute engines such as Spark, Flink, Trino, and
> cloud-native services like Athena. Another part of a DBMS deals with data
> files, index files, and metadata (also known as the information schema)
> maintenance. Currently, there are successful open-source and commercial
> projects dedicated to managing metadata, such as HiveMetastore,
> UnityCatalog, and more recently, Gravitino. In practice, developers often
> combine these projects with compute engines to optimize data files. For
> example, many commercial compute engines include an optimize command.
>
> Amoro, as a lakehouse management system, aims to encapsulate the
> maintenance and management of data lake files, index files, and metadata in
> a way that is transparent and easy-to-use for users. The richness of
> diverse computing engines is a distinctive feature of the modern data
> stack, opening up a multitude of possibilities for various application
> scenarios. Additionally, concerning the part analogous to DBMS, we aspire
> to have a mature system in place—one that seamlessly accommodates data
> written to the lakehouse by any engine, in any manner, ensuring high data
> availability across all other engines. For instance, when Flink writes to
> Iceberg, Amoro’s self-optimizing mechanism ensures efficient data analysis
> performance by Trino or other engines while controlling compacting costs.
> Additionally, Amoro handles historical data, snapshots, and orphan file
> cleanup in the background.
>
> By positioning Amoro in this way, we aim to provide an ‘out-of-box’
> experience that feels as straightforward as traditional DBMS while keeping
> openness to various computing engines. At the same time, Amoro hopes to
> empower data product builders with a lightweight solution that integrates
> seamlessly into their modern data workflows.
>
>
>
> Thanks.
> Jin Ma
>
>
> On 2024/02/23 14:16:43 Jean-Baptiste Onofré wrote:
> > Hi Justin
> >
> > Even if it looks interesting, I'm not sure to understand exactly the
> > purpose of the proposal.
> >
> > What lakehouse management system means exactly ? Is it an abstraction
> > layer on top of Iceberg, Paimon + query engine powered by Flink,
> > Spark, Trino ?
> >
> > Please let me know if you want an additional mentor, I would be happy to
> help.
> >
> > Thanks !
> > Regards
> > JB
> >
> > On Fri, Feb 23, 2024 at 9:44 AM Justin Mclean 
> wrote:
> > >
> > > Hi,
> > >
> > > I would like to propose a new project to the ASF incubator - Apache
> Amoro. I’m one of the mentors, but there are a lot of other people involved
> who have done all of the hard work.
> > >
> > > Amoro is a Lakehouse management system built on open data lake formats
> like Apache Iceberg and Apache Paimon (Incubating). Working with compute
> engines including Apache Flink, Apache Spark, and Trino, Amoro brings
> pluggable and self-managed features for Lakehouse to provide out-of-the-box
> data warehouse experience, and helps data platforms or products easily
> build infra-decoupled, stream-and-batch-fused and lake-native architecture.
> You can find the proposal here. [1]
> > >
> > > We are looking forward to anyone's feedback or questions.
> > >
> > > Thanks,
> > > Justin
> > >
> > > [1] https://cwiki.apache.org/confluence/display/INCUBATOR/AmoroProposal
> > > -
> > > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> > > For additional commands, e-mail: general-h...@incubator.apache.org
> > >
> >
> > -
> > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> > For additional commands, e-mail: general-h...@incubator.apache.org
> >
> >

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



RE: Re: [DISCUSS] Apache Amoro proposal

2024-02-24 Thread nathan ma
hi, JB

As co-creator of this project, I’d love to explain more about the
positioning of lakehouse management system.

When discussing databases or traditional data warehouses, we often used the
term DBMS (Database Management System) to describe them. Traditional
databases, including MPP databases, are typically considered “out-of-box”
solutions. Unlike big data systems, they don’t require various components
like compute engines, data lake formats, or metadata stores. When we need a
database management tool, lightweight options like Navicat are commonly
used.

If we further abstract the capabilities of a DBMS and map them to the
modern data stack, we find that the data read/write part of a DBMS is now
shared among different compute engines such as Spark, Flink, Trino, and
cloud-native services like Athena. Another part of a DBMS deals with data
files, index files, and metadata (also known as the information schema)
maintenance. Currently, there are successful open-source and commercial
projects dedicated to managing metadata, such as HiveMetastore,
UnityCatalog, and more recently, Gravitino. In practice, developers often
combine these projects with compute engines to optimize data files. For
example, many commercial compute engines include an optimize command.

Amoro, as a lakehouse management system, aims to encapsulate the
maintenance and management of data lake files, index files, and metadata in
a way that is transparent and easy-to-use for users. The richness of
diverse computing engines is a distinctive feature of the modern data
stack, opening up a multitude of possibilities for various application
scenarios. Additionally, concerning the part analogous to DBMS, we aspire
to have a mature system in place—one that seamlessly accommodates data
written to the lakehouse by any engine, in any manner, ensuring high data
availability across all other engines. For instance, when Flink writes to
Iceberg, Amoro’s self-optimizing mechanism ensures efficient data analysis
performance by Trino or other engines while controlling compacting costs.
Additionally, Amoro handles historical data, snapshots, and orphan file
cleanup in the background.

By positioning Amoro in this way, we aim to provide an ‘out-of-box’
experience that feels as straightforward as traditional DBMS while keeping
openness to various computing engines. At the same time, Amoro hopes to
empower data product builders with a lightweight solution that integrates
seamlessly into their modern data workflows.



Thanks.



On 2024/02/23 14:16:43 Jean-Baptiste Onofré wrote:
> Hi Justin
>
> Even if it looks interesting, I'm not sure to understand exactly the
> purpose of the proposal.
>
> What lakehouse management system means exactly ? Is it an abstraction
> layer on top of Iceberg, Paimon + query engine powered by Flink,
> Spark, Trino ?
>
> Please let me know if you want an additional mentor, I would be happy to
help.
>
> Thanks !
> Regards
> JB
>
> On Fri, Feb 23, 2024 at 9:44 AM Justin Mclean 
wrote:
> >
> > Hi,
> >
> > I would like to propose a new project to the ASF incubator - Apache
Amoro. I’m one of the mentors, but there are a lot of other people involved
who have done all of the hard work.
> >
> > Amoro is a Lakehouse management system built on open data lake formats
like Apache Iceberg and Apache Paimon (Incubating). Working with compute
engines including Apache Flink, Apache Spark, and Trino, Amoro brings
pluggable and self-managed features for Lakehouse to provide out-of-the-box
data warehouse experience, and helps data platforms or products easily
build infra-decoupled, stream-and-batch-fused and lake-native architecture.
You can find the proposal here. [1]
> >
> > We are looking forward to anyone's feedback or questions.
> >
> > Thanks,
> > Justin
> >
> > [1] https://cwiki.apache.org/confluence/display/INCUBATOR/AmoroProposal
> > -
> > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> > For additional commands, e-mail: general-h...@incubator.apache.org
> >
>
> -
> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> For additional commands, e-mail: general-h...@incubator.apache.org
>
>


RE: Re: [DISCUSS] Apache Amoro proposal

2024-02-24 Thread PJ Fanning
+1. Looks like a good candidate with a good number of contributors already.

On 2024/02/24 05:24:33 nathan ma wrote:
> hi, JB
> 
> As co-creator of this project, I’d love to explain more about the
> positioning of lakehouse management system.
> 
> When discussing databases or traditional data warehouses, we often used the
> term DBMS (Database Management System) to describe them. Traditional
> databases, including MPP databases, are typically considered “out-of-box”
> solutions. Unlike big data systems, they don’t require various components
> like compute engines, data lake formats, or metadata stores. When we need a
> database management tool, lightweight options like Navicat are commonly
> used.
> 
> If we further abstract the capabilities of a DBMS and map them to the
> modern data stack, we find that the data read/write part of a DBMS is now
> shared among different compute engines such as Spark, Flink, Trino, and
> cloud-native services like Athena. Another part of a DBMS deals with data
> files, index files, and metadata (also known as the information schema)
> maintenance. Currently, there are successful open-source and commercial
> projects dedicated to managing metadata, such as HiveMetastore,
> UnityCatalog, and more recently, Gravitino. In practice, developers often
> combine these projects with compute engines to optimize data files. For
> example, many commercial compute engines include an optimize command.
> 
> Amoro, as a lakehouse management system, aims to encapsulate the
> maintenance and management of data lake files, index files, and metadata in
> a way that is transparent and easy-to-use for users. The richness of
> diverse computing engines is a distinctive feature of the modern data
> stack, opening up a multitude of possibilities for various application
> scenarios. Additionally, concerning the part analogous to DBMS, we aspire
> to have a mature system in place—one that seamlessly accommodates data
> written to the lakehouse by any engine, in any manner, ensuring high data
> availability across all other engines. For instance, when Flink writes to
> Iceberg, Amoro’s self-optimizing mechanism ensures efficient data analysis
> performance by Trino or other engines while controlling compacting costs.
> Additionally, Amoro handles historical data, snapshots, and orphan file
> cleanup in the background.
> 
> By positioning Amoro in this way, we aim to provide an ‘out-of-box’
> experience that feels as straightforward as traditional DBMS while keeping
> openness to various computing engines. At the same time, Amoro hopes to
> empower data product builders with a lightweight solution that integrates
> seamlessly into their modern data workflows.
> 
> 
> 
> Thanks.
> Jin Ma
> 
> 
> On 2024/02/23 14:16:43 Jean-Baptiste Onofré wrote:
> > Hi Justin
> >
> > Even if it looks interesting, I'm not sure to understand exactly the
> > purpose of the proposal.
> >
> > What lakehouse management system means exactly ? Is it an abstraction
> > layer on top of Iceberg, Paimon + query engine powered by Flink,
> > Spark, Trino ?
> >
> > Please let me know if you want an additional mentor, I would be happy to
> help.
> >
> > Thanks !
> > Regards
> > JB
> >
> > On Fri, Feb 23, 2024 at 9:44 AM Justin Mclean 
> wrote:
> > >
> > > Hi,
> > >
> > > I would like to propose a new project to the ASF incubator - Apache
> Amoro. I’m one of the mentors, but there are a lot of other people involved
> who have done all of the hard work.
> > >
> > > Amoro is a Lakehouse management system built on open data lake formats
> like Apache Iceberg and Apache Paimon (Incubating). Working with compute
> engines including Apache Flink, Apache Spark, and Trino, Amoro brings
> pluggable and self-managed features for Lakehouse to provide out-of-the-box
> data warehouse experience, and helps data platforms or products easily
> build infra-decoupled, stream-and-batch-fused and lake-native architecture.
> You can find the proposal here. [1]
> > >
> > > We are looking forward to anyone's feedback or questions.
> > >
> > > Thanks,
> > > Justin
> > >
> > > [1] https://cwiki.apache.org/confluence/display/INCUBATOR/AmoroProposal
> > > -
> > > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> > > For additional commands, e-mail: general-h...@incubator.apache.org
> > >
> >
> > -
> > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> > For additional commands, e-mail: general-h...@incubator.apache.org
> >
> >
> 

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



RE: Re: [DISCUSS] Apache Amoro proposal

2024-02-23 Thread nathan ma
hi, JB

As co-creator of this project, I’d love to explain more about the
positioning of lakehouse management system.

When discussing databases or traditional data warehouses, we often used the
term DBMS (Database Management System) to describe them. Traditional
databases, including MPP databases, are typically considered “out-of-box”
solutions. Unlike big data systems, they don’t require various components
like compute engines, data lake formats, or metadata stores. When we need a
database management tool, lightweight options like Navicat are commonly
used.

If we further abstract the capabilities of a DBMS and map them to the
modern data stack, we find that the data read/write part of a DBMS is now
shared among different compute engines such as Spark, Flink, Trino, and
cloud-native services like Athena. Another part of a DBMS deals with data
files, index files, and metadata (also known as the information schema)
maintenance. Currently, there are successful open-source and commercial
projects dedicated to managing metadata, such as HiveMetastore,
UnityCatalog, and more recently, Gravitino. In practice, developers often
combine these projects with compute engines to optimize data files. For
example, many commercial compute engines include an optimize command.

Amoro, as a lakehouse management system, aims to encapsulate the
maintenance and management of data lake files, index files, and metadata in
a way that is transparent and easy-to-use for users. The richness of
diverse computing engines is a distinctive feature of the modern data
stack, opening up a multitude of possibilities for various application
scenarios. Additionally, concerning the part analogous to DBMS, we aspire
to have a mature system in place—one that seamlessly accommodates data
written to the lakehouse by any engine, in any manner, ensuring high data
availability across all other engines. For instance, when Flink writes to
Iceberg, Amoro’s self-optimizing mechanism ensures efficient data analysis
performance by Trino or other engines while controlling compacting costs.
Additionally, Amoro handles historical data, snapshots, and orphan file
cleanup in the background.

By positioning Amoro in this way, we aim to provide an ‘out-of-box’
experience that feels as straightforward as traditional DBMS while keeping
openness to various computing engines. At the same time, Amoro hopes to
empower data product builders with a lightweight solution that integrates
seamlessly into their modern data workflows.



Thanks.
Jin Ma


On 2024/02/23 14:16:43 Jean-Baptiste Onofré wrote:
> Hi Justin
>
> Even if it looks interesting, I'm not sure to understand exactly the
> purpose of the proposal.
>
> What lakehouse management system means exactly ? Is it an abstraction
> layer on top of Iceberg, Paimon + query engine powered by Flink,
> Spark, Trino ?
>
> Please let me know if you want an additional mentor, I would be happy to
help.
>
> Thanks !
> Regards
> JB
>
> On Fri, Feb 23, 2024 at 9:44 AM Justin Mclean 
wrote:
> >
> > Hi,
> >
> > I would like to propose a new project to the ASF incubator - Apache
Amoro. I’m one of the mentors, but there are a lot of other people involved
who have done all of the hard work.
> >
> > Amoro is a Lakehouse management system built on open data lake formats
like Apache Iceberg and Apache Paimon (Incubating). Working with compute
engines including Apache Flink, Apache Spark, and Trino, Amoro brings
pluggable and self-managed features for Lakehouse to provide out-of-the-box
data warehouse experience, and helps data platforms or products easily
build infra-decoupled, stream-and-batch-fused and lake-native architecture.
You can find the proposal here. [1]
> >
> > We are looking forward to anyone's feedback or questions.
> >
> > Thanks,
> > Justin
> >
> > [1] https://cwiki.apache.org/confluence/display/INCUBATOR/AmoroProposal
> > -
> > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> > For additional commands, e-mail: general-h...@incubator.apache.org
> >
>
> -
> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> For additional commands, e-mail: general-h...@incubator.apache.org
>
>