Re: [new RFC Request] The need of Multiple event_time fields verification

Shiyan Xu Fri, 05 Aug 2022 19:11:44 -0700

Hi Xinyao, awesome achievement! And really appreciate your keenness in
contributing to Hudi. Certainly we'd love to see an RFC for this.


On Fri, Aug 5, 2022 at 4:21 AM 田昕峣 (Xinyao Tian) <xinyaot...@yeah.net>
wrote:

> Greetings everyone,
>
>
> My name is Xinyao and I'm currently working for an Insurance company. We
> found that Apache Hudi is an extremely awesome utility and when it
> cooprates with Apache Flink it can be even more powerful. Thus, we have
> been using it for months and still keep benefiting from it.
>
>
> However, there is one feature that we really desire but Hudi doesn't
> currently have: It is called "Multiple event_time fields verification".
> Because in the insurance industry, data is often stored distributed in
> dozens of tables and conceptually connected by same primary keys. When the
> data is being used, we often need to associate several or even dozens of
> tables through the Join operation, and stitch all partial columns into an
> entire record with dozens or even hundreds of columns for downstream
> services to use.
>
>
> Here comes to the problem. If we want to guarantee that every part of the
> data being joined is up to date, Hudi must have the ability to filter
> multiple event_time timestamps in a table and keep the most recent records.
> So, in this scenario, the signle event_time filtering field provided by
> Hudi (i.e. option 'write.precombine.field' in Hudi 0.10.0) is a bit
> inadequate. Obviously, in order to cope with the use case with complex Join
> operations like above, as well as to provide much potential for Hudi to
> support more application scenarios and engage into more industries, Hudi
> definitely needs to support the multiple event_time timestamps filtering
> feature in a single table.
>
>
> A good news is that, after more than two months of development, me and my
> colleagues have made some changes in the hudi-flink and hudi-common modules
> based on the hudi-0.10.0 and basically have achieved this feature.
> Currently, my team is using the enhanced source code and working with Kafka
> and Flink 1.13.2 to conduct some end-to-end testing on a dataset of more
> than 140 million real-world insurance data and verifying the accuracy of
> the data. The result is quite good: every part of the extremely-wide
> records have been updated to latest status based on our continuous
> observations during these weeks. We're very keen to make this new feature
> available to everyone. We benefit from the Hudi community, so we really
> desire to give back to the community with our efforts.
>
>
> The only problem is that, we are not sure whether we need to create a RFC
> to illusrtate our design and implementations in detail. According to "RFC
> Process" in Hudi official documentation, we have to confirm that this
> feature has not already exsited so that we could create a new RFC to share
> concept and code as well as explain them in detail. Thus, we really would
> like to create a new RFC that would explain our implementation in detail
> with theory and code, as well as make it easier for everyone to understand
> and make improvement based on our RFC.
>
>
> Look forward to receiving your feedback whether we should create a new RFC
> and make Hudi better and better to benifit everyone.
>
>
> Kind regards,
> Xinyao Tian



-- 
Best,
Shiyan

Re: [new RFC Request] The need of Multiple event_time fields verification

Reply via email to