Re: 0.12.0 Release Timeline

2022-08-05 Thread sagar sumit
Hi folks,

Thanks for voting on RC1.
I will be preparing RC2 by Monday, 8th August end of day PST,
and I will send out a separate voting email for RC2.

Regards,
Sagar

On Fri, Jul 29, 2022 at 6:08 PM sagar sumit  wrote:

> We can now resume merging to master branch.
> Thanks for your patience.
>
> Regards,
> Sagar
>


Re: [DISCUSS] Diagnostic reporter

2022-08-05 Thread Shiyan Xu
Sure, Zhang Yue, feel free to initiate the RFC!

On Fri, Aug 5, 2022 at 4:57 AM 田昕峣 (Xinyao Tian) 
wrote:

> Hi Shiyan and everyone,
>
>
> Definitely this feature is very important. We really need to gather error
> infos to fix bugs more efficiently.
>
>
> If there’s any thing I could help please feel free to let me know :)
>
>
> Regards,
> Xinyao
>
>
>
>
> Hi Shiyan and everyone,
> This is a great idea! As one of Hudi user, I also struggle to Hudi
> troubleshooting sometimes. With this feature, it will definitely be able to
> reduce the burden.
> So I volunteer to draft a discuss and maybe raise a RFC about if you don't
> mind. Thanks :)
>
>
> | |
> Yue Zhang
> |
> |
> zhangyue921...@163.com
> |
>
>
> On 08/3/2022 00:44,冯健 wrote:
> Maybe we can start this with an audit feature? Since we need some sort of
> "images" to represent “facts”, can create an identity of a writer to link
> them. and in this audit file, we can label each operation with IP,
> environment, platform, version, write config and etc.
>
> On Sun, 31 Jul 2022 at 12:18, Shiyan Xu 
> wrote:
>
> To bubble this up
>
> On Wed, Jun 15, 2022 at 11:47 PM Vinoth Chandar  wrote:
>
> +1 from me.
>
> It will be very useful if we can have something that can gather
> troubleshooting info easily.
> This part takes a while currently.
>
> On Mon, May 30, 2022 at 9:52 AM Shiyan Xu 
> wrote:
>
> Hi all,
>
> When troubleshooting Hudi jobs in users' environments, we always ask
> users
> to share configs, environment info, check spark UI, etc. Here is an RFC
> idea: can we extend the Hudi metrics system and make a diagnostic
> reporter?
> It can be turned on like a normal metrics reporter. it should collect
> common troubleshooting info and save to json or other human-readable
> text
> format. Users should be able to run with it and share the diagnosis
> file.
> The RFC should discuss what info should / can be collected.
>
> Does this make sense? Anyone interested in driving the RFC design and
> implementation work?
>
> --
> Best,
> Shiyan
>
>
> --
> Best,
> Shiyan
>
>

-- 
Best,
Shiyan


Re: [new RFC Request] The need of Multiple event_time fields verification

2022-08-05 Thread Shiyan Xu
Hi Xinyao, awesome achievement! And really appreciate your keenness in
contributing to Hudi. Certainly we'd love to see an RFC for this.

On Fri, Aug 5, 2022 at 4:21 AM 田昕峣 (Xinyao Tian) 
wrote:

> Greetings everyone,
>
>
> My name is Xinyao and I'm currently working for an Insurance company. We
> found that Apache Hudi is an extremely awesome utility and when it
> cooprates with Apache Flink it can be even more powerful. Thus, we have
> been using it for months and still keep benefiting from it.
>
>
> However, there is one feature that we really desire but Hudi doesn't
> currently have: It is called "Multiple event_time fields verification".
> Because in the insurance industry, data is often stored distributed in
> dozens of tables and conceptually connected by same primary keys. When the
> data is being used, we often need to associate several or even dozens of
> tables through the Join operation, and stitch all partial columns into an
> entire record with dozens or even hundreds of columns for downstream
> services to use.
>
>
> Here comes to the problem. If we want to guarantee that every part of the
> data being joined is up to date, Hudi must have the ability to filter
> multiple event_time timestamps in a table and keep the most recent records.
> So, in this scenario, the signle event_time filtering field provided by
> Hudi (i.e. option 'write.precombine.field' in Hudi 0.10.0) is a bit
> inadequate. Obviously, in order to cope with the use case with complex Join
> operations like above, as well as to provide much potential for Hudi to
> support more application scenarios and engage into more industries, Hudi
> definitely needs to support the multiple event_time timestamps filtering
> feature in a single table.
>
>
> A good news is that, after more than two months of development, me and my
> colleagues have made some changes in the hudi-flink and hudi-common modules
> based on the hudi-0.10.0 and basically have achieved this feature.
> Currently, my team is using the enhanced source code and working with Kafka
> and Flink 1.13.2 to conduct some end-to-end testing on a dataset of more
> than 140 million real-world insurance data and verifying the accuracy of
> the data. The result is quite good: every part of the extremely-wide
> records have been updated to latest status based on our continuous
> observations during these weeks. We're very keen to make this new feature
> available to everyone. We benefit from the Hudi community, so we really
> desire to give back to the community with our efforts.
>
>
> The only problem is that, we are not sure whether we need to create a RFC
> to illusrtate our design and implementations in detail. According to "RFC
> Process" in Hudi official documentation, we have to confirm that this
> feature has not already exsited so that we could create a new RFC to share
> concept and code as well as explain them in detail. Thus, we really would
> like to create a new RFC that would explain our implementation in detail
> with theory and code, as well as make it easier for everyone to understand
> and make improvement based on our RFC.
>
>
> Look forward to receiving your feedback whether we should create a new RFC
> and make Hudi better and better to benifit everyone.
>
>
> Kind regards,
> Xinyao Tian



-- 
Best,
Shiyan


Re: [DISCUSS] Diagnostic reporter

2022-08-05 Thread Xinyao Tian
Hi Shiyan and everyone,


Definitely this feature is very important. We really need to gather error infos 
to fix bugs more efficiently.


If there’s any thing I could help please feel free to let me know :)


Regards,
Xinyao




Hi Shiyan and everyone,
This is a great idea! As one of Hudi user, I also struggle to Hudi 
troubleshooting sometimes. With this feature, it will definitely be able to 
reduce the burden.
So I volunteer to draft a discuss and maybe raise a RFC about if you don't 
mind. Thanks :)


| |
Yue Zhang
|
|
zhangyue921...@163.com
|


On 08/3/2022 00:44,冯健 wrote:
Maybe we can start this with an audit feature? Since we need some sort of
"images" to represent “facts”, can create an identity of a writer to link
them. and in this audit file, we can label each operation with IP,
environment, platform, version, write config and etc.

On Sun, 31 Jul 2022 at 12:18, Shiyan Xu  wrote:

To bubble this up

On Wed, Jun 15, 2022 at 11:47 PM Vinoth Chandar  wrote:

+1 from me.

It will be very useful if we can have something that can gather
troubleshooting info easily.
This part takes a while currently.

On Mon, May 30, 2022 at 9:52 AM Shiyan Xu 
wrote:

Hi all,

When troubleshooting Hudi jobs in users' environments, we always ask
users
to share configs, environment info, check spark UI, etc. Here is an RFC
idea: can we extend the Hudi metrics system and make a diagnostic
reporter?
It can be turned on like a normal metrics reporter. it should collect
common troubleshooting info and save to json or other human-readable
text
format. Users should be able to run with it and share the diagnosis
file.
The RFC should discuss what info should / can be collected.

Does this make sense? Anyone interested in driving the RFC design and
implementation work?

--
Best,
Shiyan


--
Best,
Shiyan



[new RFC Request] The need of Multiple event_time fields verification

2022-08-05 Thread Xinyao Tian
Greetings everyone,


My name is Xinyao and I'm currently working for an Insurance company. We found 
that Apache Hudi is an extremely awesome utility and when it cooprates with 
Apache Flink it can be even more powerful. Thus, we have been using it for 
months and still keep benefiting from it.


However, there is one feature that we really desire but Hudi doesn't currently 
have: It is called "Multiple event_time fields verification". Because in the 
insurance industry, data is often stored distributed in dozens of tables and 
conceptually connected by same primary keys. When the data is being used, we 
often need to associate several or even dozens of tables through the Join 
operation, and stitch all partial columns into an entire record with dozens or 
even hundreds of columns for downstream services to use. 


Here comes to the problem. If we want to guarantee that every part of the data 
being joined is up to date, Hudi must have the ability to filter multiple 
event_time timestamps in a table and keep the most recent records. So, in this 
scenario, the signle event_time filtering field provided by Hudi (i.e. option 
'write.precombine.field' in Hudi 0.10.0) is a bit inadequate. Obviously, in 
order to cope with the use case with complex Join operations like above, as 
well as to provide much potential for Hudi to support more application 
scenarios and engage into more industries, Hudi definitely needs to support the 
multiple event_time timestamps filtering feature in a single table.


A good news is that, after more than two months of development, me and my 
colleagues have made some changes in the hudi-flink and hudi-common modules 
based on the hudi-0.10.0 and basically have achieved this feature. Currently, 
my team is using the enhanced source code and working with Kafka and Flink 
1.13.2 to conduct some end-to-end testing on a dataset of more than 140 million 
real-world insurance data and verifying the accuracy of the data. The result is 
quite good: every part of the extremely-wide records have been updated to 
latest status based on our continuous observations during these weeks. We're 
very keen to make this new feature available to everyone. We benefit from the 
Hudi community, so we really desire to give back to the community with our 
efforts.


The only problem is that, we are not sure whether we need to create a RFC to 
illusrtate our design and implementations in detail. According to "RFC Process" 
in Hudi official documentation, we have to confirm that this feature has not 
already exsited so that we could create a new RFC to share concept and code as 
well as explain them in detail. Thus, we really would like to create a new RFC 
that would explain our implementation in detail with theory and code, as well 
as make it easier for everyone to understand and make improvement based on our 
RFC. 


Look forward to receiving your feedback whether we should create a new RFC and 
make Hudi better and better to benifit everyone.


Kind regards,
Xinyao Tian