Hi Shiyan,

Thanks so much for your feedback as well as your kind encouragement! It’s 
always our honor to contribute our effort to everyone and make Hudi much 
awesome :)


We are now carefully preparing materials for the new RFC. Once we finished, we 
would strictly follow the RFC process shown in the Hudi official documentation 
to propose the new RFC and share all details of the new feature as well as 
related code to everyone. Since we benefit from Hudi community, we would like 
to give back our effort to the community and make Hudi benefit more people!


As always, please stay healthy and keep safe.


Kind regards,
Xinyao Tian
On 08/6/2022 10:11,Shiyan Xu<xu.shiyan.raym...@gmail.com> wrote:
Hi Xinyao, awesome achievement! And really appreciate your keenness in
contributing to Hudi. Certainly we'd love to see an RFC for this.

On Fri, Aug 5, 2022 at 4:21 AM 田昕峣 (Xinyao Tian) <xinyaot...@yeah.net>
wrote:

Greetings everyone,


My name is Xinyao and I'm currently working for an Insurance company. We
found that Apache Hudi is an extremely awesome utility and when it
cooprates with Apache Flink it can be even more powerful. Thus, we have
been using it for months and still keep benefiting from it.


However, there is one feature that we really desire but Hudi doesn't
currently have: It is called "Multiple event_time fields verification".
Because in the insurance industry, data is often stored distributed in
dozens of tables and conceptually connected by same primary keys. When the
data is being used, we often need to associate several or even dozens of
tables through the Join operation, and stitch all partial columns into an
entire record with dozens or even hundreds of columns for downstream
services to use.


Here comes to the problem. If we want to guarantee that every part of the
data being joined is up to date, Hudi must have the ability to filter
multiple event_time timestamps in a table and keep the most recent records.
So, in this scenario, the signle event_time filtering field provided by
Hudi (i.e. option 'write.precombine.field' in Hudi 0.10.0) is a bit
inadequate. Obviously, in order to cope with the use case with complex Join
operations like above, as well as to provide much potential for Hudi to
support more application scenarios and engage into more industries, Hudi
definitely needs to support the multiple event_time timestamps filtering
feature in a single table.


A good news is that, after more than two months of development, me and my
colleagues have made some changes in the hudi-flink and hudi-common modules
based on the hudi-0.10.0 and basically have achieved this feature.
Currently, my team is using the enhanced source code and working with Kafka
and Flink 1.13.2 to conduct some end-to-end testing on a dataset of more
than 140 million real-world insurance data and verifying the accuracy of
the data. The result is quite good: every part of the extremely-wide
records have been updated to latest status based on our continuous
observations during these weeks. We're very keen to make this new feature
available to everyone. We benefit from the Hudi community, so we really
desire to give back to the community with our efforts.


The only problem is that, we are not sure whether we need to create a RFC
to illusrtate our design and implementations in detail. According to "RFC
Process" in Hudi official documentation, we have to confirm that this
feature has not already exsited so that we could create a new RFC to share
concept and code as well as explain them in detail. Thus, we really would
like to create a new RFC that would explain our implementation in detail
with theory and code, as well as make it easier for everyone to understand
and make improvement based on our RFC.


Look forward to receiving your feedback whether we should create a new RFC
and make Hudi better and better to benifit everyone.


Kind regards,
Xinyao Tian



--
Best,
Shiyan

Reply via email to