Re: 0.12.0 Release Timeline
Hi folks, Thanks for voting on RC1. I will be preparing RC2 by Monday, 8th August end of day PST, and I will send out a separate voting email for RC2. Regards, Sagar On Fri, Jul 29, 2022 at 6:08 PM sagar sumit wrote: > We can now resume merging to master branch. > Thanks for your patience. > > Regards, > Sagar >
Re: [DISCUSS] Diagnostic reporter
Sure, Zhang Yue, feel free to initiate the RFC! On Fri, Aug 5, 2022 at 4:57 AM 田昕峣 (Xinyao Tian) wrote: > Hi Shiyan and everyone, > > > Definitely this feature is very important. We really need to gather error > infos to fix bugs more efficiently. > > > If there’s any thing I could help please feel free to let me know :) > > > Regards, > Xinyao > > > > > Hi Shiyan and everyone, > This is a great idea! As one of Hudi user, I also struggle to Hudi > troubleshooting sometimes. With this feature, it will definitely be able to > reduce the burden. > So I volunteer to draft a discuss and maybe raise a RFC about if you don't > mind. Thanks :) > > > | | > Yue Zhang > | > | > zhangyue921...@163.com > | > > > On 08/3/2022 00:44,冯健 wrote: > Maybe we can start this with an audit feature? Since we need some sort of > "images" to represent “facts”, can create an identity of a writer to link > them. and in this audit file, we can label each operation with IP, > environment, platform, version, write config and etc. > > On Sun, 31 Jul 2022 at 12:18, Shiyan Xu > wrote: > > To bubble this up > > On Wed, Jun 15, 2022 at 11:47 PM Vinoth Chandar wrote: > > +1 from me. > > It will be very useful if we can have something that can gather > troubleshooting info easily. > This part takes a while currently. > > On Mon, May 30, 2022 at 9:52 AM Shiyan Xu > wrote: > > Hi all, > > When troubleshooting Hudi jobs in users' environments, we always ask > users > to share configs, environment info, check spark UI, etc. Here is an RFC > idea: can we extend the Hudi metrics system and make a diagnostic > reporter? > It can be turned on like a normal metrics reporter. it should collect > common troubleshooting info and save to json or other human-readable > text > format. Users should be able to run with it and share the diagnosis > file. > The RFC should discuss what info should / can be collected. > > Does this make sense? Anyone interested in driving the RFC design and > implementation work? > > -- > Best, > Shiyan > > > -- > Best, > Shiyan > > -- Best, Shiyan
Re: [new RFC Request] The need of Multiple event_time fields verification
Hi Xinyao, awesome achievement! And really appreciate your keenness in contributing to Hudi. Certainly we'd love to see an RFC for this. On Fri, Aug 5, 2022 at 4:21 AM 田昕峣 (Xinyao Tian) wrote: > Greetings everyone, > > > My name is Xinyao and I'm currently working for an Insurance company. We > found that Apache Hudi is an extremely awesome utility and when it > cooprates with Apache Flink it can be even more powerful. Thus, we have > been using it for months and still keep benefiting from it. > > > However, there is one feature that we really desire but Hudi doesn't > currently have: It is called "Multiple event_time fields verification". > Because in the insurance industry, data is often stored distributed in > dozens of tables and conceptually connected by same primary keys. When the > data is being used, we often need to associate several or even dozens of > tables through the Join operation, and stitch all partial columns into an > entire record with dozens or even hundreds of columns for downstream > services to use. > > > Here comes to the problem. If we want to guarantee that every part of the > data being joined is up to date, Hudi must have the ability to filter > multiple event_time timestamps in a table and keep the most recent records. > So, in this scenario, the signle event_time filtering field provided by > Hudi (i.e. option 'write.precombine.field' in Hudi 0.10.0) is a bit > inadequate. Obviously, in order to cope with the use case with complex Join > operations like above, as well as to provide much potential for Hudi to > support more application scenarios and engage into more industries, Hudi > definitely needs to support the multiple event_time timestamps filtering > feature in a single table. > > > A good news is that, after more than two months of development, me and my > colleagues have made some changes in the hudi-flink and hudi-common modules > based on the hudi-0.10.0 and basically have achieved this feature. > Currently, my team is using the enhanced source code and working with Kafka > and Flink 1.13.2 to conduct some end-to-end testing on a dataset of more > than 140 million real-world insurance data and verifying the accuracy of > the data. The result is quite good: every part of the extremely-wide > records have been updated to latest status based on our continuous > observations during these weeks. We're very keen to make this new feature > available to everyone. We benefit from the Hudi community, so we really > desire to give back to the community with our efforts. > > > The only problem is that, we are not sure whether we need to create a RFC > to illusrtate our design and implementations in detail. According to "RFC > Process" in Hudi official documentation, we have to confirm that this > feature has not already exsited so that we could create a new RFC to share > concept and code as well as explain them in detail. Thus, we really would > like to create a new RFC that would explain our implementation in detail > with theory and code, as well as make it easier for everyone to understand > and make improvement based on our RFC. > > > Look forward to receiving your feedback whether we should create a new RFC > and make Hudi better and better to benifit everyone. > > > Kind regards, > Xinyao Tian -- Best, Shiyan
Re: [DISCUSS] Diagnostic reporter
Hi Shiyan and everyone, Definitely this feature is very important. We really need to gather error infos to fix bugs more efficiently. If there’s any thing I could help please feel free to let me know :) Regards, Xinyao Hi Shiyan and everyone, This is a great idea! As one of Hudi user, I also struggle to Hudi troubleshooting sometimes. With this feature, it will definitely be able to reduce the burden. So I volunteer to draft a discuss and maybe raise a RFC about if you don't mind. Thanks :) | | Yue Zhang | | zhangyue921...@163.com | On 08/3/2022 00:44,冯健 wrote: Maybe we can start this with an audit feature? Since we need some sort of "images" to represent “facts”, can create an identity of a writer to link them. and in this audit file, we can label each operation with IP, environment, platform, version, write config and etc. On Sun, 31 Jul 2022 at 12:18, Shiyan Xu wrote: To bubble this up On Wed, Jun 15, 2022 at 11:47 PM Vinoth Chandar wrote: +1 from me. It will be very useful if we can have something that can gather troubleshooting info easily. This part takes a while currently. On Mon, May 30, 2022 at 9:52 AM Shiyan Xu wrote: Hi all, When troubleshooting Hudi jobs in users' environments, we always ask users to share configs, environment info, check spark UI, etc. Here is an RFC idea: can we extend the Hudi metrics system and make a diagnostic reporter? It can be turned on like a normal metrics reporter. it should collect common troubleshooting info and save to json or other human-readable text format. Users should be able to run with it and share the diagnosis file. The RFC should discuss what info should / can be collected. Does this make sense? Anyone interested in driving the RFC design and implementation work? -- Best, Shiyan -- Best, Shiyan
[new RFC Request] The need of Multiple event_time fields verification
Greetings everyone, My name is Xinyao and I'm currently working for an Insurance company. We found that Apache Hudi is an extremely awesome utility and when it cooprates with Apache Flink it can be even more powerful. Thus, we have been using it for months and still keep benefiting from it. However, there is one feature that we really desire but Hudi doesn't currently have: It is called "Multiple event_time fields verification". Because in the insurance industry, data is often stored distributed in dozens of tables and conceptually connected by same primary keys. When the data is being used, we often need to associate several or even dozens of tables through the Join operation, and stitch all partial columns into an entire record with dozens or even hundreds of columns for downstream services to use. Here comes to the problem. If we want to guarantee that every part of the data being joined is up to date, Hudi must have the ability to filter multiple event_time timestamps in a table and keep the most recent records. So, in this scenario, the signle event_time filtering field provided by Hudi (i.e. option 'write.precombine.field' in Hudi 0.10.0) is a bit inadequate. Obviously, in order to cope with the use case with complex Join operations like above, as well as to provide much potential for Hudi to support more application scenarios and engage into more industries, Hudi definitely needs to support the multiple event_time timestamps filtering feature in a single table. A good news is that, after more than two months of development, me and my colleagues have made some changes in the hudi-flink and hudi-common modules based on the hudi-0.10.0 and basically have achieved this feature. Currently, my team is using the enhanced source code and working with Kafka and Flink 1.13.2 to conduct some end-to-end testing on a dataset of more than 140 million real-world insurance data and verifying the accuracy of the data. The result is quite good: every part of the extremely-wide records have been updated to latest status based on our continuous observations during these weeks. We're very keen to make this new feature available to everyone. We benefit from the Hudi community, so we really desire to give back to the community with our efforts. The only problem is that, we are not sure whether we need to create a RFC to illusrtate our design and implementations in detail. According to "RFC Process" in Hudi official documentation, we have to confirm that this feature has not already exsited so that we could create a new RFC to share concept and code as well as explain them in detail. Thus, we really would like to create a new RFC that would explain our implementation in detail with theory and code, as well as make it easier for everyone to understand and make improvement based on our RFC. Look forward to receiving your feedback whether we should create a new RFC and make Hudi better and better to benifit everyone. Kind regards, Xinyao Tian