Re: [new RFC Request] The need of Multiple event_time fields verification

2022-08-16 Thread Xinyao Tian
Appreciate for your time and effort in advance! Since we didn’t any way to 
invite code reviewers in the PR, thus we mentioned you in the comment. The url 
of our PR is:https://github.com/apache/hudi/pull/6382
By the way, could you please tell us is there a better way to notice you and 
others in the project? The only way we knew is comment Github account on the PR 
page. So if there’s some better way to communicate (like using Jira, 
dev-maillist or something else) we really would like to know :)
Also, for your convenient, we print the RFC material with pictures to a PDF to 
help your better reading. I’ll send it to you through Slack because it can’t be 
sent by email. 
Regards,
Xinyao
On 08/17/2022 02:11,Sivabalan wrote:
yes, sounds good. Appreciate that. my github profile is
https://github.com/nsivabalan


On Fri, 12 Aug 2022 at 01:25, 田昕峣 (Xinyao Tian)  wrote:




Hi Sivabalan,




Hope you are doing well. As promised, we finished writing the RFC proposal
and now we are ready to submit them as a PR with confident.

According to the RFC Process, in order to check our elaborated designed
RFC proposal, we need to add at least two PMCs as reviewers to examine it.
Therefore, we would like to invite you as one of the reviewers sincerely to
check our RFC proposal  as well as give us some comments and feedbacks.

Since we really put a lot of effort when writing this RFC proposal, and
you don’t hesitate sacrifice your time to helped us land our first PR and
give valuable comments, we sincerely hope that you could accept our
invitation so that I can put your Github account in the RFC.

Likewise, if you have other suggested candidates, we'd be happy to invite
them as reviewers, since the number of reviewers has no limitation.

Wish you all good and look forward to receiving your reply.




Sincerely,

Xinyao Tian

On 08/9/2022 21:46,Sivabalan wrote:
Eagerly looking forward for the RFC Xinyao. Definitely see a lot of folks
benefitting from this.

On Sun, 7 Aug 2022 at 20:00, 田昕峣 (Xinyao Tian) 
wrote:

Hi Shiyan,


Thanks so much for your feedback as well as your kind encouragement! It’s
always our honor to contribute our effort to everyone and make Hudi much
awesome :)


We are now carefully preparing materials for the new RFC. Once we
finished, we would strictly follow the RFC process shown in the Hudi
official documentation to propose the new RFC and share all details of the
new feature as well as related code to everyone. Since we benefit from Hudi
community, we would like to give back our effort to the community and make
Hudi benefit more people!


As always, please stay healthy and keep safe.


Kind regards,
Xinyao Tian
On 08/6/2022 10:11,Shiyan Xu wrote:
Hi Xinyao, awesome achievement! And really appreciate your keenness in
contributing to Hudi. Certainly we'd love to see an RFC for this.

On Fri, Aug 5, 2022 at 4:21 AM 田昕峣 (Xinyao Tian) 
wrote:

Greetings everyone,


My name is Xinyao and I'm currently working for an Insurance company. We
found that Apache Hudi is an extremely awesome utility and when it
cooprates with Apache Flink it can be even more powerful. Thus, we have
been using it for months and still keep benefiting from it.


However, there is one feature that we really desire but Hudi doesn't
currently have: It is called "Multiple event_time fields verification".
Because in the insurance industry, data is often stored distributed in
dozens of tables and conceptually connected by same primary keys. When the
data is being used, we often need to associate several or even dozens of
tables through the Join operation, and stitch all partial columns into an
entire record with dozens or even hundreds of columns for downstream
services to use.


Here comes to the problem. If we want to guarantee that every part of the
data being joined is up to date, Hudi must have the ability to filter
multiple event_time timestamps in a table and keep the most recent records.
So, in this scenario, the signle event_time filtering field provided by
Hudi (i.e. option 'write.precombine.field' in Hudi 0.10.0) is a bit
inadequate. Obviously, in order to cope with the use case with complex Join
operations like above, as well as to provide much potential for Hudi to
support more application scenarios and engage into more industries, Hudi
definitely needs to support the multiple event_time timestamps filtering
feature in a single table.


A good news is that, after more than two months of development, me and my
colleagues have made some changes in the hudi-flink and hudi-common modules
based on the hudi-0.10.0 and basically have achieved this feature.
Currently, my team is using the enhanced source code and working with Kafka
and Flink 1.13.2 to conduct some end-to-end testing on a dataset of more
than 140 million real-world insurance data and verifying the accuracy of
the data. The result is quite good: every part of the extremely-wide
records have been updated to latest status based on our continuous
observations during these 

Re: [new RFC Request] The need of Multiple event_time fields verification

2022-08-16 Thread Sivabalan
yes, sounds good. Appreciate that. my github profile is
https://github.com/nsivabalan


On Fri, 12 Aug 2022 at 01:25, 田昕峣 (Xinyao Tian)  wrote:

>
>
>
> Hi Sivabalan,
>
>
>
>
> Hope you are doing well. As promised, we finished writing the RFC proposal
> and now we are ready to submit them as a PR with confident.
>
> According to the RFC Process, in order to check our elaborated designed
> RFC proposal, we need to add at least two PMCs as reviewers to examine it.
> Therefore, we would like to invite you as one of the reviewers sincerely to
> check our RFC proposal  as well as give us some comments and feedbacks.
>
> Since we really put a lot of effort when writing this RFC proposal, and
> you don’t hesitate sacrifice your time to helped us land our first PR and
> give valuable comments, we sincerely hope that you could accept our
> invitation so that I can put your Github account in the RFC.
>
> Likewise, if you have other suggested candidates, we'd be happy to invite
> them as reviewers, since the number of reviewers has no limitation.
>
> Wish you all good and look forward to receiving your reply.
>
>
>
>
> Sincerely,
>
> Xinyao Tian
>
> On 08/9/2022 21:46,Sivabalan wrote:
> Eagerly looking forward for the RFC Xinyao. Definitely see a lot of folks
> benefitting from this.
>
> On Sun, 7 Aug 2022 at 20:00, 田昕峣 (Xinyao Tian) 
> wrote:
>
> Hi Shiyan,
>
>
> Thanks so much for your feedback as well as your kind encouragement! It’s
> always our honor to contribute our effort to everyone and make Hudi much
> awesome :)
>
>
> We are now carefully preparing materials for the new RFC. Once we
> finished, we would strictly follow the RFC process shown in the Hudi
> official documentation to propose the new RFC and share all details of the
> new feature as well as related code to everyone. Since we benefit from Hudi
> community, we would like to give back our effort to the community and make
> Hudi benefit more people!
>
>
> As always, please stay healthy and keep safe.
>
>
> Kind regards,
> Xinyao Tian
> On 08/6/2022 10:11,Shiyan Xu wrote:
> Hi Xinyao, awesome achievement! And really appreciate your keenness in
> contributing to Hudi. Certainly we'd love to see an RFC for this.
>
> On Fri, Aug 5, 2022 at 4:21 AM 田昕峣 (Xinyao Tian) 
> wrote:
>
> Greetings everyone,
>
>
> My name is Xinyao and I'm currently working for an Insurance company. We
> found that Apache Hudi is an extremely awesome utility and when it
> cooprates with Apache Flink it can be even more powerful. Thus, we have
> been using it for months and still keep benefiting from it.
>
>
> However, there is one feature that we really desire but Hudi doesn't
> currently have: It is called "Multiple event_time fields verification".
> Because in the insurance industry, data is often stored distributed in
> dozens of tables and conceptually connected by same primary keys. When the
> data is being used, we often need to associate several or even dozens of
> tables through the Join operation, and stitch all partial columns into an
> entire record with dozens or even hundreds of columns for downstream
> services to use.
>
>
> Here comes to the problem. If we want to guarantee that every part of the
> data being joined is up to date, Hudi must have the ability to filter
> multiple event_time timestamps in a table and keep the most recent records.
> So, in this scenario, the signle event_time filtering field provided by
> Hudi (i.e. option 'write.precombine.field' in Hudi 0.10.0) is a bit
> inadequate. Obviously, in order to cope with the use case with complex Join
> operations like above, as well as to provide much potential for Hudi to
> support more application scenarios and engage into more industries, Hudi
> definitely needs to support the multiple event_time timestamps filtering
> feature in a single table.
>
>
> A good news is that, after more than two months of development, me and my
> colleagues have made some changes in the hudi-flink and hudi-common modules
> based on the hudi-0.10.0 and basically have achieved this feature.
> Currently, my team is using the enhanced source code and working with Kafka
> and Flink 1.13.2 to conduct some end-to-end testing on a dataset of more
> than 140 million real-world insurance data and verifying the accuracy of
> the data. The result is quite good: every part of the extremely-wide
> records have been updated to latest status based on our continuous
> observations during these weeks. We're very keen to make this new feature
> available to everyone. We benefit from the Hudi community, so we really
> desire to give back to the community with our efforts.
>
>
> The only problem is that, we are not sure whether we need to create a RFC
> to illusrtate our design and implementations in detail. According to "RFC
> Process" in Hudi official documentation, we have to confirm that this
> feature has not already exsited so that we could create a new RFC to share
> concept and code as well as explain them in detail. Thus, we really 

Re: [new RFC Request] The need of Multiple event_time fields verification

2022-08-12 Thread Xinyao Tian



Hi Sivabalan,




Hope you are doing well. As promised, we finished writing the RFC proposal and 
now we are ready to submit them as a PR with confident. 

According to the RFC Process, in order to check our elaborated designed RFC 
proposal, we need to add at least two PMCs as reviewers to examine it. 
Therefore, we would like to invite you as one of the reviewers sincerely to 
check our RFC proposal  as well as give us some comments and feedbacks. 

Since we really put a lot of effort when writing this RFC proposal, and you 
don’t hesitate sacrifice your time to helped us land our first PR and give 
valuable comments, we sincerely hope that you could accept our invitation so 
that I can put your Github account in the RFC.

Likewise, if you have other suggested candidates, we'd be happy to invite them 
as reviewers, since the number of reviewers has no limitation.

Wish you all good and look forward to receiving your reply.




Sincerely,

Xinyao Tian

On 08/9/2022 21:46,Sivabalan wrote:
Eagerly looking forward for the RFC Xinyao. Definitely see a lot of folks
benefitting from this.

On Sun, 7 Aug 2022 at 20:00, 田昕峣 (Xinyao Tian)  wrote:

Hi Shiyan,


Thanks so much for your feedback as well as your kind encouragement! It’s
always our honor to contribute our effort to everyone and make Hudi much
awesome :)


We are now carefully preparing materials for the new RFC. Once we
finished, we would strictly follow the RFC process shown in the Hudi
official documentation to propose the new RFC and share all details of the
new feature as well as related code to everyone. Since we benefit from Hudi
community, we would like to give back our effort to the community and make
Hudi benefit more people!


As always, please stay healthy and keep safe.


Kind regards,
Xinyao Tian
On 08/6/2022 10:11,Shiyan Xu wrote:
Hi Xinyao, awesome achievement! And really appreciate your keenness in
contributing to Hudi. Certainly we'd love to see an RFC for this.

On Fri, Aug 5, 2022 at 4:21 AM 田昕峣 (Xinyao Tian) 
wrote:

Greetings everyone,


My name is Xinyao and I'm currently working for an Insurance company. We
found that Apache Hudi is an extremely awesome utility and when it
cooprates with Apache Flink it can be even more powerful. Thus, we have
been using it for months and still keep benefiting from it.


However, there is one feature that we really desire but Hudi doesn't
currently have: It is called "Multiple event_time fields verification".
Because in the insurance industry, data is often stored distributed in
dozens of tables and conceptually connected by same primary keys. When the
data is being used, we often need to associate several or even dozens of
tables through the Join operation, and stitch all partial columns into an
entire record with dozens or even hundreds of columns for downstream
services to use.


Here comes to the problem. If we want to guarantee that every part of the
data being joined is up to date, Hudi must have the ability to filter
multiple event_time timestamps in a table and keep the most recent records.
So, in this scenario, the signle event_time filtering field provided by
Hudi (i.e. option 'write.precombine.field' in Hudi 0.10.0) is a bit
inadequate. Obviously, in order to cope with the use case with complex Join
operations like above, as well as to provide much potential for Hudi to
support more application scenarios and engage into more industries, Hudi
definitely needs to support the multiple event_time timestamps filtering
feature in a single table.


A good news is that, after more than two months of development, me and my
colleagues have made some changes in the hudi-flink and hudi-common modules
based on the hudi-0.10.0 and basically have achieved this feature.
Currently, my team is using the enhanced source code and working with Kafka
and Flink 1.13.2 to conduct some end-to-end testing on a dataset of more
than 140 million real-world insurance data and verifying the accuracy of
the data. The result is quite good: every part of the extremely-wide
records have been updated to latest status based on our continuous
observations during these weeks. We're very keen to make this new feature
available to everyone. We benefit from the Hudi community, so we really
desire to give back to the community with our efforts.


The only problem is that, we are not sure whether we need to create a RFC
to illusrtate our design and implementations in detail. According to "RFC
Process" in Hudi official documentation, we have to confirm that this
feature has not already exsited so that we could create a new RFC to share
concept and code as well as explain them in detail. Thus, we really would
like to create a new RFC that would explain our implementation in detail
with theory and code, as well as make it easier for everyone to understand
and make improvement based on our RFC.


Look forward to receiving your feedback whether we should create a new RFC
and make Hudi better and better to benifit everyone.


Kind 

Re: [new RFC Request] The need of Multiple event_time fields verification

2022-08-12 Thread Xinyao Tian
Hi Shiyan,




Hope you are doing well. As promised, we finished writing the RFC proposal and 
now we are ready to submit them as a PR with confident. 

According to the RFC Process, in order to check our elaborated designed RFC 
proposal, we need to add at least two PMCs as reviewers to examine it. 
Therefore, we would like to invite you as one of the reviewers sincerely to 
check our RFC proposal  as well as give us some comments and feedbacks. 

Since we really put a lot of effort when writing this RFC proposal, and you are 
the first person who gave us feedback at the very beginning stage, we sincerely 
hope that you could accept our invitation so that I can put your Github account 
in the RFC.

Likewise, if you have other suggested candidates, we'd be happy to invite them 
as reviewers, since the number of reviewers has no limitation.

Wish you all good and look forward to receiving your reply.




Sincerely,

Xinyao Tian





On 08/6/2022 10:11,Shiyan Xu wrote:
Hi Xinyao, awesome achievement! And really appreciate your keenness in
contributing to Hudi. Certainly we'd love to see an RFC for this.

On Fri, Aug 5, 2022 at 4:21 AM 田昕峣 (Xinyao Tian) 
wrote:

Greetings everyone,


My name is Xinyao and I'm currently working for an Insurance company. We
found that Apache Hudi is an extremely awesome utility and when it
cooprates with Apache Flink it can be even more powerful. Thus, we have
been using it for months and still keep benefiting from it.


However, there is one feature that we really desire but Hudi doesn't
currently have: It is called "Multiple event_time fields verification".
Because in the insurance industry, data is often stored distributed in
dozens of tables and conceptually connected by same primary keys. When the
data is being used, we often need to associate several or even dozens of
tables through the Join operation, and stitch all partial columns into an
entire record with dozens or even hundreds of columns for downstream
services to use.


Here comes to the problem. If we want to guarantee that every part of the
data being joined is up to date, Hudi must have the ability to filter
multiple event_time timestamps in a table and keep the most recent records.
So, in this scenario, the signle event_time filtering field provided by
Hudi (i.e. option 'write.precombine.field' in Hudi 0.10.0) is a bit
inadequate. Obviously, in order to cope with the use case with complex Join
operations like above, as well as to provide much potential for Hudi to
support more application scenarios and engage into more industries, Hudi
definitely needs to support the multiple event_time timestamps filtering
feature in a single table.


A good news is that, after more than two months of development, me and my
colleagues have made some changes in the hudi-flink and hudi-common modules
based on the hudi-0.10.0 and basically have achieved this feature.
Currently, my team is using the enhanced source code and working with Kafka
and Flink 1.13.2 to conduct some end-to-end testing on a dataset of more
than 140 million real-world insurance data and verifying the accuracy of
the data. The result is quite good: every part of the extremely-wide
records have been updated to latest status based on our continuous
observations during these weeks. We're very keen to make this new feature
available to everyone. We benefit from the Hudi community, so we really
desire to give back to the community with our efforts.


The only problem is that, we are not sure whether we need to create a RFC
to illusrtate our design and implementations in detail. According to "RFC
Process" in Hudi official documentation, we have to confirm that this
feature has not already exsited so that we could create a new RFC to share
concept and code as well as explain them in detail. Thus, we really would
like to create a new RFC that would explain our implementation in detail
with theory and code, as well as make it easier for everyone to understand
and make improvement based on our RFC.


Look forward to receiving your feedback whether we should create a new RFC
and make Hudi better and better to benifit everyone.


Kind regards,
Xinyao Tian



--
Best,
Shiyan


Re: [new RFC Request] The need of Multiple event_time fields verification

2022-08-10 Thread Xinyao Tian
Just saw the PR has been approved. Thanks a lot for your time!


We will submit the RFC materials as soon as possible (within a few days to our 
best effort). Look forward to receiving your further feedback at that time.


Wish you have a good day :)
Xinyao
On 08/10/2022 13:44,Sivabalan wrote:
sure. Approved and landed!

On Tue, 9 Aug 2022 at 18:55, 田昕峣 (Xinyao Tian)  wrote:

Hi Sivabalan,




Thanks for you kind words! We have been working very hard to prepare
materials for the RFC this week since we got your feedback about our idea,
and I promise it will be very soon (within a few days) that everyone can
read our RFC and realize every details about this feature. It’s our
pleasure to make Hudi even more powerful by making this feature available
to everyone.




However, there’s one thing that we really need your help. According to the
RFC Process shown in Hudi Docs, we have to first raise a PR and add an
entry to rfc/README.md. But since this is the first time we raise a PR to
Hudi, it’s necessary to have a maintainer with write permission to approve
our PR. We have been wait for days but the PR is still in a pending status.




Therefore, may I ask you to help us to approve our first PR so that we
could submit our further materials to Hudi? The url of our pending PR is:
https://github.com/apache/hudi/pull/6328 and the corresponding Jira is:
https://issues.apache.org/jira/browse/HUDI-4569




Appreciate you so much for your help :)




Kind regards,

Xinyao Tian







On 08/9/2022 21:46,Sivabalan wrote:
Eagerly looking forward for the RFC Xinyao. Definitely see a lot of folks
benefitting from this.

On Sun, 7 Aug 2022 at 20:00, 田昕峣 (Xinyao Tian) 
wrote:

Hi Shiyan,


Thanks so much for your feedback as well as your kind encouragement! It’s
always our honor to contribute our effort to everyone and make Hudi much
awesome :)


We are now carefully preparing materials for the new RFC. Once we
finished, we would strictly follow the RFC process shown in the Hudi
official documentation to propose the new RFC and share all details of the
new feature as well as related code to everyone. Since we benefit from Hudi
community, we would like to give back our effort to the community and make
Hudi benefit more people!


As always, please stay healthy and keep safe.


Kind regards,
Xinyao Tian
On 08/6/2022 10:11,Shiyan Xu wrote:
Hi Xinyao, awesome achievement! And really appreciate your keenness in
contributing to Hudi. Certainly we'd love to see an RFC for this.

On Fri, Aug 5, 2022 at 4:21 AM 田昕峣 (Xinyao Tian) 
wrote:

Greetings everyone,


My name is Xinyao and I'm currently working for an Insurance company. We
found that Apache Hudi is an extremely awesome utility and when it
cooprates with Apache Flink it can be even more powerful. Thus, we have
been using it for months and still keep benefiting from it.


However, there is one feature that we really desire but Hudi doesn't
currently have: It is called "Multiple event_time fields verification".
Because in the insurance industry, data is often stored distributed in
dozens of tables and conceptually connected by same primary keys. When the
data is being used, we often need to associate several or even dozens of
tables through the Join operation, and stitch all partial columns into an
entire record with dozens or even hundreds of columns for downstream
services to use.


Here comes to the problem. If we want to guarantee that every part of the
data being joined is up to date, Hudi must have the ability to filter
multiple event_time timestamps in a table and keep the most recent records.
So, in this scenario, the signle event_time filtering field provided by
Hudi (i.e. option 'write.precombine.field' in Hudi 0.10.0) is a bit
inadequate. Obviously, in order to cope with the use case with complex Join
operations like above, as well as to provide much potential for Hudi to
support more application scenarios and engage into more industries, Hudi
definitely needs to support the multiple event_time timestamps filtering
feature in a single table.


A good news is that, after more than two months of development, me and my
colleagues have made some changes in the hudi-flink and hudi-common modules
based on the hudi-0.10.0 and basically have achieved this feature.
Currently, my team is using the enhanced source code and working with Kafka
and Flink 1.13.2 to conduct some end-to-end testing on a dataset of more
than 140 million real-world insurance data and verifying the accuracy of
the data. The result is quite good: every part of the extremely-wide
records have been updated to latest status based on our continuous
observations during these weeks. We're very keen to make this new feature
available to everyone. We benefit from the Hudi community, so we really
desire to give back to the community with our efforts.


The only problem is that, we are not sure whether we need to create a RFC
to illusrtate our design and implementations in detail. According to "RFC
Process" in Hudi 

Re: [new RFC Request] The need of Multiple event_time fields verification

2022-08-09 Thread Sivabalan
sure. Approved and landed!

On Tue, 9 Aug 2022 at 18:55, 田昕峣 (Xinyao Tian)  wrote:

> Hi Sivabalan,
>
>
>
>
> Thanks for you kind words! We have been working very hard to prepare
> materials for the RFC this week since we got your feedback about our idea,
> and I promise it will be very soon (within a few days) that everyone can
> read our RFC and realize every details about this feature. It’s our
> pleasure to make Hudi even more powerful by making this feature available
> to everyone.
>
>
>
>
> However, there’s one thing that we really need your help. According to the
> RFC Process shown in Hudi Docs, we have to first raise a PR and add an
> entry to rfc/README.md. But since this is the first time we raise a PR to
> Hudi, it’s necessary to have a maintainer with write permission to approve
> our PR. We have been wait for days but the PR is still in a pending status.
>
>
>
>
> Therefore, may I ask you to help us to approve our first PR so that we
> could submit our further materials to Hudi? The url of our pending PR is:
> https://github.com/apache/hudi/pull/6328 and the corresponding Jira is:
> https://issues.apache.org/jira/browse/HUDI-4569
>
>
>
>
> Appreciate you so much for your help :)
>
>
>
>
> Kind regards,
>
> Xinyao Tian
>
>
>
>
>
>
>
> On 08/9/2022 21:46,Sivabalan wrote:
> Eagerly looking forward for the RFC Xinyao. Definitely see a lot of folks
> benefitting from this.
>
> On Sun, 7 Aug 2022 at 20:00, 田昕峣 (Xinyao Tian) 
> wrote:
>
> Hi Shiyan,
>
>
> Thanks so much for your feedback as well as your kind encouragement! It’s
> always our honor to contribute our effort to everyone and make Hudi much
> awesome :)
>
>
> We are now carefully preparing materials for the new RFC. Once we
> finished, we would strictly follow the RFC process shown in the Hudi
> official documentation to propose the new RFC and share all details of the
> new feature as well as related code to everyone. Since we benefit from Hudi
> community, we would like to give back our effort to the community and make
> Hudi benefit more people!
>
>
> As always, please stay healthy and keep safe.
>
>
> Kind regards,
> Xinyao Tian
> On 08/6/2022 10:11,Shiyan Xu wrote:
> Hi Xinyao, awesome achievement! And really appreciate your keenness in
> contributing to Hudi. Certainly we'd love to see an RFC for this.
>
> On Fri, Aug 5, 2022 at 4:21 AM 田昕峣 (Xinyao Tian) 
> wrote:
>
> Greetings everyone,
>
>
> My name is Xinyao and I'm currently working for an Insurance company. We
> found that Apache Hudi is an extremely awesome utility and when it
> cooprates with Apache Flink it can be even more powerful. Thus, we have
> been using it for months and still keep benefiting from it.
>
>
> However, there is one feature that we really desire but Hudi doesn't
> currently have: It is called "Multiple event_time fields verification".
> Because in the insurance industry, data is often stored distributed in
> dozens of tables and conceptually connected by same primary keys. When the
> data is being used, we often need to associate several or even dozens of
> tables through the Join operation, and stitch all partial columns into an
> entire record with dozens or even hundreds of columns for downstream
> services to use.
>
>
> Here comes to the problem. If we want to guarantee that every part of the
> data being joined is up to date, Hudi must have the ability to filter
> multiple event_time timestamps in a table and keep the most recent records.
> So, in this scenario, the signle event_time filtering field provided by
> Hudi (i.e. option 'write.precombine.field' in Hudi 0.10.0) is a bit
> inadequate. Obviously, in order to cope with the use case with complex Join
> operations like above, as well as to provide much potential for Hudi to
> support more application scenarios and engage into more industries, Hudi
> definitely needs to support the multiple event_time timestamps filtering
> feature in a single table.
>
>
> A good news is that, after more than two months of development, me and my
> colleagues have made some changes in the hudi-flink and hudi-common modules
> based on the hudi-0.10.0 and basically have achieved this feature.
> Currently, my team is using the enhanced source code and working with Kafka
> and Flink 1.13.2 to conduct some end-to-end testing on a dataset of more
> than 140 million real-world insurance data and verifying the accuracy of
> the data. The result is quite good: every part of the extremely-wide
> records have been updated to latest status based on our continuous
> observations during these weeks. We're very keen to make this new feature
> available to everyone. We benefit from the Hudi community, so we really
> desire to give back to the community with our efforts.
>
>
> The only problem is that, we are not sure whether we need to create a RFC
> to illusrtate our design and implementations in detail. According to "RFC
> Process" in Hudi official documentation, we have to confirm that this
> feature has not already exsited 

Re: [new RFC Request] The need of Multiple event_time fields verification

2022-08-09 Thread Xinyao Tian
Hi Sivabalan,




Thanks for you kind words! We have been working very hard to prepare materials 
for the RFC this week since we got your feedback about our idea, and I promise 
it will be very soon (within a few days) that everyone can read our RFC and 
realize every details about this feature. It’s our pleasure to make Hudi even 
more powerful by making this feature available to everyone.




However, there’s one thing that we really need your help. According to the RFC 
Process shown in Hudi Docs, we have to first raise a PR and add an entry to 
rfc/README.md. But since this is the first time we raise a PR to Hudi, it’s 
necessary to have a maintainer with write permission to approve our PR. We have 
been wait for days but the PR is still in a pending status.




Therefore, may I ask you to help us to approve our first PR so that we could 
submit our further materials to Hudi? The url of our pending PR is: 
https://github.com/apache/hudi/pull/6328 and the corresponding Jira is: 
https://issues.apache.org/jira/browse/HUDI-4569 




Appreciate you so much for your help :)




Kind regards,

Xinyao Tian







On 08/9/2022 21:46,Sivabalan wrote:
Eagerly looking forward for the RFC Xinyao. Definitely see a lot of folks
benefitting from this.

On Sun, 7 Aug 2022 at 20:00, 田昕峣 (Xinyao Tian)  wrote:

Hi Shiyan,


Thanks so much for your feedback as well as your kind encouragement! It’s
always our honor to contribute our effort to everyone and make Hudi much
awesome :)


We are now carefully preparing materials for the new RFC. Once we
finished, we would strictly follow the RFC process shown in the Hudi
official documentation to propose the new RFC and share all details of the
new feature as well as related code to everyone. Since we benefit from Hudi
community, we would like to give back our effort to the community and make
Hudi benefit more people!


As always, please stay healthy and keep safe.


Kind regards,
Xinyao Tian
On 08/6/2022 10:11,Shiyan Xu wrote:
Hi Xinyao, awesome achievement! And really appreciate your keenness in
contributing to Hudi. Certainly we'd love to see an RFC for this.

On Fri, Aug 5, 2022 at 4:21 AM 田昕峣 (Xinyao Tian) 
wrote:

Greetings everyone,


My name is Xinyao and I'm currently working for an Insurance company. We
found that Apache Hudi is an extremely awesome utility and when it
cooprates with Apache Flink it can be even more powerful. Thus, we have
been using it for months and still keep benefiting from it.


However, there is one feature that we really desire but Hudi doesn't
currently have: It is called "Multiple event_time fields verification".
Because in the insurance industry, data is often stored distributed in
dozens of tables and conceptually connected by same primary keys. When the
data is being used, we often need to associate several or even dozens of
tables through the Join operation, and stitch all partial columns into an
entire record with dozens or even hundreds of columns for downstream
services to use.


Here comes to the problem. If we want to guarantee that every part of the
data being joined is up to date, Hudi must have the ability to filter
multiple event_time timestamps in a table and keep the most recent records.
So, in this scenario, the signle event_time filtering field provided by
Hudi (i.e. option 'write.precombine.field' in Hudi 0.10.0) is a bit
inadequate. Obviously, in order to cope with the use case with complex Join
operations like above, as well as to provide much potential for Hudi to
support more application scenarios and engage into more industries, Hudi
definitely needs to support the multiple event_time timestamps filtering
feature in a single table.


A good news is that, after more than two months of development, me and my
colleagues have made some changes in the hudi-flink and hudi-common modules
based on the hudi-0.10.0 and basically have achieved this feature.
Currently, my team is using the enhanced source code and working with Kafka
and Flink 1.13.2 to conduct some end-to-end testing on a dataset of more
than 140 million real-world insurance data and verifying the accuracy of
the data. The result is quite good: every part of the extremely-wide
records have been updated to latest status based on our continuous
observations during these weeks. We're very keen to make this new feature
available to everyone. We benefit from the Hudi community, so we really
desire to give back to the community with our efforts.


The only problem is that, we are not sure whether we need to create a RFC
to illusrtate our design and implementations in detail. According to "RFC
Process" in Hudi official documentation, we have to confirm that this
feature has not already exsited so that we could create a new RFC to share
concept and code as well as explain them in detail. Thus, we really would
like to create a new RFC that would explain our implementation in detail
with theory and code, as well as make it easier for everyone to understand
and make 

Re: [new RFC Request] The need of Multiple event_time fields verification

2022-08-09 Thread Sivabalan
Eagerly looking forward for the RFC Xinyao. Definitely see a lot of folks
benefitting from this.

On Sun, 7 Aug 2022 at 20:00, 田昕峣 (Xinyao Tian)  wrote:

> Hi Shiyan,
>
>
> Thanks so much for your feedback as well as your kind encouragement! It’s
> always our honor to contribute our effort to everyone and make Hudi much
> awesome :)
>
>
> We are now carefully preparing materials for the new RFC. Once we
> finished, we would strictly follow the RFC process shown in the Hudi
> official documentation to propose the new RFC and share all details of the
> new feature as well as related code to everyone. Since we benefit from Hudi
> community, we would like to give back our effort to the community and make
> Hudi benefit more people!
>
>
> As always, please stay healthy and keep safe.
>
>
> Kind regards,
> Xinyao Tian
> On 08/6/2022 10:11,Shiyan Xu wrote:
> Hi Xinyao, awesome achievement! And really appreciate your keenness in
> contributing to Hudi. Certainly we'd love to see an RFC for this.
>
> On Fri, Aug 5, 2022 at 4:21 AM 田昕峣 (Xinyao Tian) 
> wrote:
>
> Greetings everyone,
>
>
> My name is Xinyao and I'm currently working for an Insurance company. We
> found that Apache Hudi is an extremely awesome utility and when it
> cooprates with Apache Flink it can be even more powerful. Thus, we have
> been using it for months and still keep benefiting from it.
>
>
> However, there is one feature that we really desire but Hudi doesn't
> currently have: It is called "Multiple event_time fields verification".
> Because in the insurance industry, data is often stored distributed in
> dozens of tables and conceptually connected by same primary keys. When the
> data is being used, we often need to associate several or even dozens of
> tables through the Join operation, and stitch all partial columns into an
> entire record with dozens or even hundreds of columns for downstream
> services to use.
>
>
> Here comes to the problem. If we want to guarantee that every part of the
> data being joined is up to date, Hudi must have the ability to filter
> multiple event_time timestamps in a table and keep the most recent records.
> So, in this scenario, the signle event_time filtering field provided by
> Hudi (i.e. option 'write.precombine.field' in Hudi 0.10.0) is a bit
> inadequate. Obviously, in order to cope with the use case with complex Join
> operations like above, as well as to provide much potential for Hudi to
> support more application scenarios and engage into more industries, Hudi
> definitely needs to support the multiple event_time timestamps filtering
> feature in a single table.
>
>
> A good news is that, after more than two months of development, me and my
> colleagues have made some changes in the hudi-flink and hudi-common modules
> based on the hudi-0.10.0 and basically have achieved this feature.
> Currently, my team is using the enhanced source code and working with Kafka
> and Flink 1.13.2 to conduct some end-to-end testing on a dataset of more
> than 140 million real-world insurance data and verifying the accuracy of
> the data. The result is quite good: every part of the extremely-wide
> records have been updated to latest status based on our continuous
> observations during these weeks. We're very keen to make this new feature
> available to everyone. We benefit from the Hudi community, so we really
> desire to give back to the community with our efforts.
>
>
> The only problem is that, we are not sure whether we need to create a RFC
> to illusrtate our design and implementations in detail. According to "RFC
> Process" in Hudi official documentation, we have to confirm that this
> feature has not already exsited so that we could create a new RFC to share
> concept and code as well as explain them in detail. Thus, we really would
> like to create a new RFC that would explain our implementation in detail
> with theory and code, as well as make it easier for everyone to understand
> and make improvement based on our RFC.
>
>
> Look forward to receiving your feedback whether we should create a new RFC
> and make Hudi better and better to benifit everyone.
>
>
> Kind regards,
> Xinyao Tian
>
>
>
> --
> Best,
> Shiyan
>


-- 
Regards,
-Sivabalan


Re: [new RFC Request] The need of Multiple event_time fields verification

2022-08-07 Thread Xinyao Tian
Hi Shiyan,


Thanks so much for your feedback as well as your kind encouragement! It’s 
always our honor to contribute our effort to everyone and make Hudi much 
awesome :)


We are now carefully preparing materials for the new RFC. Once we finished, we 
would strictly follow the RFC process shown in the Hudi official documentation 
to propose the new RFC and share all details of the new feature as well as 
related code to everyone. Since we benefit from Hudi community, we would like 
to give back our effort to the community and make Hudi benefit more people!


As always, please stay healthy and keep safe.


Kind regards,
Xinyao Tian
On 08/6/2022 10:11,Shiyan Xu wrote:
Hi Xinyao, awesome achievement! And really appreciate your keenness in
contributing to Hudi. Certainly we'd love to see an RFC for this.

On Fri, Aug 5, 2022 at 4:21 AM 田昕峣 (Xinyao Tian) 
wrote:

Greetings everyone,


My name is Xinyao and I'm currently working for an Insurance company. We
found that Apache Hudi is an extremely awesome utility and when it
cooprates with Apache Flink it can be even more powerful. Thus, we have
been using it for months and still keep benefiting from it.


However, there is one feature that we really desire but Hudi doesn't
currently have: It is called "Multiple event_time fields verification".
Because in the insurance industry, data is often stored distributed in
dozens of tables and conceptually connected by same primary keys. When the
data is being used, we often need to associate several or even dozens of
tables through the Join operation, and stitch all partial columns into an
entire record with dozens or even hundreds of columns for downstream
services to use.


Here comes to the problem. If we want to guarantee that every part of the
data being joined is up to date, Hudi must have the ability to filter
multiple event_time timestamps in a table and keep the most recent records.
So, in this scenario, the signle event_time filtering field provided by
Hudi (i.e. option 'write.precombine.field' in Hudi 0.10.0) is a bit
inadequate. Obviously, in order to cope with the use case with complex Join
operations like above, as well as to provide much potential for Hudi to
support more application scenarios and engage into more industries, Hudi
definitely needs to support the multiple event_time timestamps filtering
feature in a single table.


A good news is that, after more than two months of development, me and my
colleagues have made some changes in the hudi-flink and hudi-common modules
based on the hudi-0.10.0 and basically have achieved this feature.
Currently, my team is using the enhanced source code and working with Kafka
and Flink 1.13.2 to conduct some end-to-end testing on a dataset of more
than 140 million real-world insurance data and verifying the accuracy of
the data. The result is quite good: every part of the extremely-wide
records have been updated to latest status based on our continuous
observations during these weeks. We're very keen to make this new feature
available to everyone. We benefit from the Hudi community, so we really
desire to give back to the community with our efforts.


The only problem is that, we are not sure whether we need to create a RFC
to illusrtate our design and implementations in detail. According to "RFC
Process" in Hudi official documentation, we have to confirm that this
feature has not already exsited so that we could create a new RFC to share
concept and code as well as explain them in detail. Thus, we really would
like to create a new RFC that would explain our implementation in detail
with theory and code, as well as make it easier for everyone to understand
and make improvement based on our RFC.


Look forward to receiving your feedback whether we should create a new RFC
and make Hudi better and better to benifit everyone.


Kind regards,
Xinyao Tian



--
Best,
Shiyan


Re: [new RFC Request] The need of Multiple event_time fields verification

2022-08-05 Thread Shiyan Xu
Hi Xinyao, awesome achievement! And really appreciate your keenness in
contributing to Hudi. Certainly we'd love to see an RFC for this.

On Fri, Aug 5, 2022 at 4:21 AM 田昕峣 (Xinyao Tian) 
wrote:

> Greetings everyone,
>
>
> My name is Xinyao and I'm currently working for an Insurance company. We
> found that Apache Hudi is an extremely awesome utility and when it
> cooprates with Apache Flink it can be even more powerful. Thus, we have
> been using it for months and still keep benefiting from it.
>
>
> However, there is one feature that we really desire but Hudi doesn't
> currently have: It is called "Multiple event_time fields verification".
> Because in the insurance industry, data is often stored distributed in
> dozens of tables and conceptually connected by same primary keys. When the
> data is being used, we often need to associate several or even dozens of
> tables through the Join operation, and stitch all partial columns into an
> entire record with dozens or even hundreds of columns for downstream
> services to use.
>
>
> Here comes to the problem. If we want to guarantee that every part of the
> data being joined is up to date, Hudi must have the ability to filter
> multiple event_time timestamps in a table and keep the most recent records.
> So, in this scenario, the signle event_time filtering field provided by
> Hudi (i.e. option 'write.precombine.field' in Hudi 0.10.0) is a bit
> inadequate. Obviously, in order to cope with the use case with complex Join
> operations like above, as well as to provide much potential for Hudi to
> support more application scenarios and engage into more industries, Hudi
> definitely needs to support the multiple event_time timestamps filtering
> feature in a single table.
>
>
> A good news is that, after more than two months of development, me and my
> colleagues have made some changes in the hudi-flink and hudi-common modules
> based on the hudi-0.10.0 and basically have achieved this feature.
> Currently, my team is using the enhanced source code and working with Kafka
> and Flink 1.13.2 to conduct some end-to-end testing on a dataset of more
> than 140 million real-world insurance data and verifying the accuracy of
> the data. The result is quite good: every part of the extremely-wide
> records have been updated to latest status based on our continuous
> observations during these weeks. We're very keen to make this new feature
> available to everyone. We benefit from the Hudi community, so we really
> desire to give back to the community with our efforts.
>
>
> The only problem is that, we are not sure whether we need to create a RFC
> to illusrtate our design and implementations in detail. According to "RFC
> Process" in Hudi official documentation, we have to confirm that this
> feature has not already exsited so that we could create a new RFC to share
> concept and code as well as explain them in detail. Thus, we really would
> like to create a new RFC that would explain our implementation in detail
> with theory and code, as well as make it easier for everyone to understand
> and make improvement based on our RFC.
>
>
> Look forward to receiving your feedback whether we should create a new RFC
> and make Hudi better and better to benifit everyone.
>
>
> Kind regards,
> Xinyao Tian



-- 
Best,
Shiyan


[new RFC Request] The need of Multiple event_time fields verification

2022-08-05 Thread Xinyao Tian
Greetings everyone,


My name is Xinyao and I'm currently working for an Insurance company. We found 
that Apache Hudi is an extremely awesome utility and when it cooprates with 
Apache Flink it can be even more powerful. Thus, we have been using it for 
months and still keep benefiting from it.


However, there is one feature that we really desire but Hudi doesn't currently 
have: It is called "Multiple event_time fields verification". Because in the 
insurance industry, data is often stored distributed in dozens of tables and 
conceptually connected by same primary keys. When the data is being used, we 
often need to associate several or even dozens of tables through the Join 
operation, and stitch all partial columns into an entire record with dozens or 
even hundreds of columns for downstream services to use. 


Here comes to the problem. If we want to guarantee that every part of the data 
being joined is up to date, Hudi must have the ability to filter multiple 
event_time timestamps in a table and keep the most recent records. So, in this 
scenario, the signle event_time filtering field provided by Hudi (i.e. option 
'write.precombine.field' in Hudi 0.10.0) is a bit inadequate. Obviously, in 
order to cope with the use case with complex Join operations like above, as 
well as to provide much potential for Hudi to support more application 
scenarios and engage into more industries, Hudi definitely needs to support the 
multiple event_time timestamps filtering feature in a single table.


A good news is that, after more than two months of development, me and my 
colleagues have made some changes in the hudi-flink and hudi-common modules 
based on the hudi-0.10.0 and basically have achieved this feature. Currently, 
my team is using the enhanced source code and working with Kafka and Flink 
1.13.2 to conduct some end-to-end testing on a dataset of more than 140 million 
real-world insurance data and verifying the accuracy of the data. The result is 
quite good: every part of the extremely-wide records have been updated to 
latest status based on our continuous observations during these weeks. We're 
very keen to make this new feature available to everyone. We benefit from the 
Hudi community, so we really desire to give back to the community with our 
efforts.


The only problem is that, we are not sure whether we need to create a RFC to 
illusrtate our design and implementations in detail. According to "RFC Process" 
in Hudi official documentation, we have to confirm that this feature has not 
already exsited so that we could create a new RFC to share concept and code as 
well as explain them in detail. Thus, we really would like to create a new RFC 
that would explain our implementation in detail with theory and code, as well 
as make it easier for everyone to understand and make improvement based on our 
RFC. 


Look forward to receiving your feedback whether we should create a new RFC and 
make Hudi better and better to benifit everyone.


Kind regards,
Xinyao Tian