Re:[DISCUSS] [RFC] Hudi bundle standards
Hi Shiyan, Having carefully read the RFC-63.md on the PR, I really think this feature is crucial for everyone who builds Hudi from source. For example, when I tried to compile Hudi 0.12.0 with flink1.15, I used command ‘mvn clean package -DskipTests -Dflink1.15 -Dscala-2.12’ but still get flink1.14 bundle. Also, in the documents it’s a highlight of supporting Fink 1.15 but Hudi 0.12.0 Github readme.md doesn’t mention anything about flink1.15 in compile section. All in all, there are many misleading points of Hudi bundles, which have to be enhanced asap. Really appreciate to have this RFC trying to solve all these problems. On 10/10/2022 13:36,Shiyan Xu wrote: Hi Hudi devs and users, I've raised an RFC around Hudi bundles, aiming to address issues around dependency conflicts, and to establish standards for bundle jar usage and change process. Please have a look. Thanks! https://github.com/apache/hudi/pull/6902 -- Best, Shiyan
Re: Release managers for 0.12.1 and 0.13.0
Got it! Will volunteer for the next time. Thanks for the effort made by Zhaojing and Ethan in advance :) On 08/30/2022 09:24,Shiyan Xu wrote: Hi Xinyao, thank you for being keen on this. Zhaojing and Ethan have already volunteered for the RM roles. On Sun, Aug 28, 2022 at 8:56 PM 田昕峣 (Xinyao Tian) wrote: Hi Shiyan, I also would like to be a RM and help the team to manage those things. Although I just graduated from the university and definitely lack of experience in Project Management field, I promise I’ll do my best if there’s still a position. Regards, Xinyao On 08/26/2022 07:02,Shiyan Xu wrote: Hi everyone, As we finished 0.12.0, we're planning for next major and minor releases. So I would like to call for volunteers to be the release managers. RM plays a very important role in ensuring the release schedule is followed through, the targeting features/tickets are properly closed, and the release notes are well-prepared. If you haven't done RM before, a PMC member will help you along the way. Please respond to this thread to claim the role. -- Best, Shiyan -- Best, Shiyan
Re:Release managers for 0.12.1 and 0.13.0
Hi Shiyan, I also would like to be a RM and help the team to manage those things. Although I just graduated from the university and definitely lack of experience in Project Management field, I promise I’ll do my best if there’s still a position. Regards, Xinyao On 08/26/2022 07:02,Shiyan Xu wrote: Hi everyone, As we finished 0.12.0, we're planning for next major and minor releases. So I would like to call for volunteers to be the release managers. RM plays a very important role in ensuring the release schedule is followed through, the targeting features/tickets are properly closed, and the release notes are well-prepared. If you haven't done RM before, a PMC member will help you along the way. Please respond to this thread to claim the role. -- Best, Shiyan
Re: [new RFC Request] The need of Multiple event_time fields verification
Appreciate for your time and effort in advance! Since we didn’t any way to invite code reviewers in the PR, thus we mentioned you in the comment. The url of our PR is:https://github.com/apache/hudi/pull/6382 By the way, could you please tell us is there a better way to notice you and others in the project? The only way we knew is comment Github account on the PR page. So if there’s some better way to communicate (like using Jira, dev-maillist or something else) we really would like to know :) Also, for your convenient, we print the RFC material with pictures to a PDF to help your better reading. I’ll send it to you through Slack because it can’t be sent by email. Regards, Xinyao On 08/17/2022 02:11,Sivabalan wrote: yes, sounds good. Appreciate that. my github profile is https://github.com/nsivabalan On Fri, 12 Aug 2022 at 01:25, 田昕峣 (Xinyao Tian) wrote: Hi Sivabalan, Hope you are doing well. As promised, we finished writing the RFC proposal and now we are ready to submit them as a PR with confident. According to the RFC Process, in order to check our elaborated designed RFC proposal, we need to add at least two PMCs as reviewers to examine it. Therefore, we would like to invite you as one of the reviewers sincerely to check our RFC proposal as well as give us some comments and feedbacks. Since we really put a lot of effort when writing this RFC proposal, and you don’t hesitate sacrifice your time to helped us land our first PR and give valuable comments, we sincerely hope that you could accept our invitation so that I can put your Github account in the RFC. Likewise, if you have other suggested candidates, we'd be happy to invite them as reviewers, since the number of reviewers has no limitation. Wish you all good and look forward to receiving your reply. Sincerely, Xinyao Tian On 08/9/2022 21:46,Sivabalan wrote: Eagerly looking forward for the RFC Xinyao. Definitely see a lot of folks benefitting from this. On Sun, 7 Aug 2022 at 20:00, 田昕峣 (Xinyao Tian) wrote: Hi Shiyan, Thanks so much for your feedback as well as your kind encouragement! It’s always our honor to contribute our effort to everyone and make Hudi much awesome :) We are now carefully preparing materials for the new RFC. Once we finished, we would strictly follow the RFC process shown in the Hudi official documentation to propose the new RFC and share all details of the new feature as well as related code to everyone. Since we benefit from Hudi community, we would like to give back our effort to the community and make Hudi benefit more people! As always, please stay healthy and keep safe. Kind regards, Xinyao Tian On 08/6/2022 10:11,Shiyan Xu wrote: Hi Xinyao, awesome achievement! And really appreciate your keenness in contributing to Hudi. Certainly we'd love to see an RFC for this. On Fri, Aug 5, 2022 at 4:21 AM 田昕峣 (Xinyao Tian) wrote: Greetings everyone, My name is Xinyao and I'm currently working for an Insurance company. We found that Apache Hudi is an extremely awesome utility and when it cooprates with Apache Flink it can be even more powerful. Thus, we have been using it for months and still keep benefiting from it. However, there is one feature that we really desire but Hudi doesn't currently have: It is called "Multiple event_time fields verification". Because in the insurance industry, data is often stored distributed in dozens of tables and conceptually connected by same primary keys. When the data is being used, we often need to associate several or even dozens of tables through the Join operation, and stitch all partial columns into an entire record with dozens or even hundreds of columns for downstream services to use. Here comes to the problem. If we want to guarantee that every part of the data being joined is up to date, Hudi must have the ability to filter multiple event_time timestamps in a table and keep the most recent records. So, in this scenario, the signle event_time filtering field provided by Hudi (i.e. option 'write.precombine.field' in Hudi 0.10.0) is a bit inadequate. Obviously, in order to cope with the use case with complex Join operations like above, as well as to provide much potential for Hudi to support more application scenarios and engage into more industries, Hudi definitely needs to support the multiple event_time timestamps filtering feature in a single table. A good news is that, after more than two months of development, me and my colleagues have made some changes in the hudi-flink and hudi-common modules based on the hudi-0.10.0 and basically have achieved this feature. Currently, my team is using the enhanced source code and working with Kafka and Flink 1.13.2 to conduct some end-to-end testing on a dataset of more than 140 million real-world insurance data and verifying the accuracy of the data. The result is quite good: every part of the extremely-wide records have been updated to latest status based on our continuous observati
Re: [new RFC Request] The need of Multiple event_time fields verification
Hi Sivabalan, Hope you are doing well. As promised, we finished writing the RFC proposal and now we are ready to submit them as a PR with confident. According to the RFC Process, in order to check our elaborated designed RFC proposal, we need to add at least two PMCs as reviewers to examine it. Therefore, we would like to invite you as one of the reviewers sincerely to check our RFC proposal as well as give us some comments and feedbacks. Since we really put a lot of effort when writing this RFC proposal, and you don’t hesitate sacrifice your time to helped us land our first PR and give valuable comments, we sincerely hope that you could accept our invitation so that I can put your Github account in the RFC. Likewise, if you have other suggested candidates, we'd be happy to invite them as reviewers, since the number of reviewers has no limitation. Wish you all good and look forward to receiving your reply. Sincerely, Xinyao Tian On 08/9/2022 21:46,Sivabalan wrote: Eagerly looking forward for the RFC Xinyao. Definitely see a lot of folks benefitting from this. On Sun, 7 Aug 2022 at 20:00, 田昕峣 (Xinyao Tian) wrote: Hi Shiyan, Thanks so much for your feedback as well as your kind encouragement! It’s always our honor to contribute our effort to everyone and make Hudi much awesome :) We are now carefully preparing materials for the new RFC. Once we finished, we would strictly follow the RFC process shown in the Hudi official documentation to propose the new RFC and share all details of the new feature as well as related code to everyone. Since we benefit from Hudi community, we would like to give back our effort to the community and make Hudi benefit more people! As always, please stay healthy and keep safe. Kind regards, Xinyao Tian On 08/6/2022 10:11,Shiyan Xu wrote: Hi Xinyao, awesome achievement! And really appreciate your keenness in contributing to Hudi. Certainly we'd love to see an RFC for this. On Fri, Aug 5, 2022 at 4:21 AM 田昕峣 (Xinyao Tian) wrote: Greetings everyone, My name is Xinyao and I'm currently working for an Insurance company. We found that Apache Hudi is an extremely awesome utility and when it cooprates with Apache Flink it can be even more powerful. Thus, we have been using it for months and still keep benefiting from it. However, there is one feature that we really desire but Hudi doesn't currently have: It is called "Multiple event_time fields verification". Because in the insurance industry, data is often stored distributed in dozens of tables and conceptually connected by same primary keys. When the data is being used, we often need to associate several or even dozens of tables through the Join operation, and stitch all partial columns into an entire record with dozens or even hundreds of columns for downstream services to use. Here comes to the problem. If we want to guarantee that every part of the data being joined is up to date, Hudi must have the ability to filter multiple event_time timestamps in a table and keep the most recent records. So, in this scenario, the signle event_time filtering field provided by Hudi (i.e. option 'write.precombine.field' in Hudi 0.10.0) is a bit inadequate. Obviously, in order to cope with the use case with complex Join operations like above, as well as to provide much potential for Hudi to support more application scenarios and engage into more industries, Hudi definitely needs to support the multiple event_time timestamps filtering feature in a single table. A good news is that, after more than two months of development, me and my colleagues have made some changes in the hudi-flink and hudi-common modules based on the hudi-0.10.0 and basically have achieved this feature. Currently, my team is using the enhanced source code and working with Kafka and Flink 1.13.2 to conduct some end-to-end testing on a dataset of more than 140 million real-world insurance data and verifying the accuracy of the data. The result is quite good: every part of the extremely-wide records have been updated to latest status based on our continuous observations during these weeks. We're very keen to make this new feature available to everyone. We benefit from the Hudi community, so we really desire to give back to the community with our efforts. The only problem is that, we are not sure whether we need to create a RFC to illusrtate our design and implementations in detail. According to "RFC Process" in Hudi official documentation, we have to confirm that this feature has not already exsited so that we could create a new RFC to share concept and code as well as explain them in detail. Thus, we really would like to create a new RFC that would explain our implementation in detail with theory and code, as well as make it easier for everyone to understand and make improvement based on our RFC. Look forward to receiving your feedback whether we should create a new RFC and make Hudi better and better to beni
Re: [new RFC Request] The need of Multiple event_time fields verification
Hi Shiyan, Hope you are doing well. As promised, we finished writing the RFC proposal and now we are ready to submit them as a PR with confident. According to the RFC Process, in order to check our elaborated designed RFC proposal, we need to add at least two PMCs as reviewers to examine it. Therefore, we would like to invite you as one of the reviewers sincerely to check our RFC proposal as well as give us some comments and feedbacks. Since we really put a lot of effort when writing this RFC proposal, and you are the first person who gave us feedback at the very beginning stage, we sincerely hope that you could accept our invitation so that I can put your Github account in the RFC. Likewise, if you have other suggested candidates, we'd be happy to invite them as reviewers, since the number of reviewers has no limitation. Wish you all good and look forward to receiving your reply. Sincerely, Xinyao Tian On 08/6/2022 10:11,Shiyan Xu wrote: Hi Xinyao, awesome achievement! And really appreciate your keenness in contributing to Hudi. Certainly we'd love to see an RFC for this. On Fri, Aug 5, 2022 at 4:21 AM 田昕峣 (Xinyao Tian) wrote: Greetings everyone, My name is Xinyao and I'm currently working for an Insurance company. We found that Apache Hudi is an extremely awesome utility and when it cooprates with Apache Flink it can be even more powerful. Thus, we have been using it for months and still keep benefiting from it. However, there is one feature that we really desire but Hudi doesn't currently have: It is called "Multiple event_time fields verification". Because in the insurance industry, data is often stored distributed in dozens of tables and conceptually connected by same primary keys. When the data is being used, we often need to associate several or even dozens of tables through the Join operation, and stitch all partial columns into an entire record with dozens or even hundreds of columns for downstream services to use. Here comes to the problem. If we want to guarantee that every part of the data being joined is up to date, Hudi must have the ability to filter multiple event_time timestamps in a table and keep the most recent records. So, in this scenario, the signle event_time filtering field provided by Hudi (i.e. option 'write.precombine.field' in Hudi 0.10.0) is a bit inadequate. Obviously, in order to cope with the use case with complex Join operations like above, as well as to provide much potential for Hudi to support more application scenarios and engage into more industries, Hudi definitely needs to support the multiple event_time timestamps filtering feature in a single table. A good news is that, after more than two months of development, me and my colleagues have made some changes in the hudi-flink and hudi-common modules based on the hudi-0.10.0 and basically have achieved this feature. Currently, my team is using the enhanced source code and working with Kafka and Flink 1.13.2 to conduct some end-to-end testing on a dataset of more than 140 million real-world insurance data and verifying the accuracy of the data. The result is quite good: every part of the extremely-wide records have been updated to latest status based on our continuous observations during these weeks. We're very keen to make this new feature available to everyone. We benefit from the Hudi community, so we really desire to give back to the community with our efforts. The only problem is that, we are not sure whether we need to create a RFC to illusrtate our design and implementations in detail. According to "RFC Process" in Hudi official documentation, we have to confirm that this feature has not already exsited so that we could create a new RFC to share concept and code as well as explain them in detail. Thus, we really would like to create a new RFC that would explain our implementation in detail with theory and code, as well as make it easier for everyone to understand and make improvement based on our RFC. Look forward to receiving your feedback whether we should create a new RFC and make Hudi better and better to benifit everyone. Kind regards, Xinyao Tian -- Best, Shiyan
Re: [new RFC Request] The need of Multiple event_time fields verification
Just saw the PR has been approved. Thanks a lot for your time! We will submit the RFC materials as soon as possible (within a few days to our best effort). Look forward to receiving your further feedback at that time. Wish you have a good day :) Xinyao On 08/10/2022 13:44,Sivabalan wrote: sure. Approved and landed! On Tue, 9 Aug 2022 at 18:55, 田昕峣 (Xinyao Tian) wrote: Hi Sivabalan, Thanks for you kind words! We have been working very hard to prepare materials for the RFC this week since we got your feedback about our idea, and I promise it will be very soon (within a few days) that everyone can read our RFC and realize every details about this feature. It’s our pleasure to make Hudi even more powerful by making this feature available to everyone. However, there’s one thing that we really need your help. According to the RFC Process shown in Hudi Docs, we have to first raise a PR and add an entry to rfc/README.md. But since this is the first time we raise a PR to Hudi, it’s necessary to have a maintainer with write permission to approve our PR. We have been wait for days but the PR is still in a pending status. Therefore, may I ask you to help us to approve our first PR so that we could submit our further materials to Hudi? The url of our pending PR is: https://github.com/apache/hudi/pull/6328 and the corresponding Jira is: https://issues.apache.org/jira/browse/HUDI-4569 Appreciate you so much for your help :) Kind regards, Xinyao Tian On 08/9/2022 21:46,Sivabalan wrote: Eagerly looking forward for the RFC Xinyao. Definitely see a lot of folks benefitting from this. On Sun, 7 Aug 2022 at 20:00, 田昕峣 (Xinyao Tian) wrote: Hi Shiyan, Thanks so much for your feedback as well as your kind encouragement! It’s always our honor to contribute our effort to everyone and make Hudi much awesome :) We are now carefully preparing materials for the new RFC. Once we finished, we would strictly follow the RFC process shown in the Hudi official documentation to propose the new RFC and share all details of the new feature as well as related code to everyone. Since we benefit from Hudi community, we would like to give back our effort to the community and make Hudi benefit more people! As always, please stay healthy and keep safe. Kind regards, Xinyao Tian On 08/6/2022 10:11,Shiyan Xu wrote: Hi Xinyao, awesome achievement! And really appreciate your keenness in contributing to Hudi. Certainly we'd love to see an RFC for this. On Fri, Aug 5, 2022 at 4:21 AM 田昕峣 (Xinyao Tian) wrote: Greetings everyone, My name is Xinyao and I'm currently working for an Insurance company. We found that Apache Hudi is an extremely awesome utility and when it cooprates with Apache Flink it can be even more powerful. Thus, we have been using it for months and still keep benefiting from it. However, there is one feature that we really desire but Hudi doesn't currently have: It is called "Multiple event_time fields verification". Because in the insurance industry, data is often stored distributed in dozens of tables and conceptually connected by same primary keys. When the data is being used, we often need to associate several or even dozens of tables through the Join operation, and stitch all partial columns into an entire record with dozens or even hundreds of columns for downstream services to use. Here comes to the problem. If we want to guarantee that every part of the data being joined is up to date, Hudi must have the ability to filter multiple event_time timestamps in a table and keep the most recent records. So, in this scenario, the signle event_time filtering field provided by Hudi (i.e. option 'write.precombine.field' in Hudi 0.10.0) is a bit inadequate. Obviously, in order to cope with the use case with complex Join operations like above, as well as to provide much potential for Hudi to support more application scenarios and engage into more industries, Hudi definitely needs to support the multiple event_time timestamps filtering feature in a single table. A good news is that, after more than two months of development, me and my colleagues have made some changes in the hudi-flink and hudi-common modules based on the hudi-0.10.0 and basically have achieved this feature. Currently, my team is using the enhanced source code and working with Kafka and Flink 1.13.2 to conduct some end-to-end testing on a dataset of more than 140 million real-world insurance data and verifying the accuracy of the data. The result is quite good: every part of the extremely-wide records have been updated to latest status based on our continuous observations during these weeks. We're very keen to make this new feature available to everyone. We benefit from the Hudi community, so we really desire to give back to the community with our efforts. The only problem is that, we are not sure whether we need to create a RFC to illusrtate our design and implementations in detail. According to "RFC P
Re: [new RFC Request] The need of Multiple event_time fields verification
Hi Sivabalan, Thanks for you kind words! We have been working very hard to prepare materials for the RFC this week since we got your feedback about our idea, and I promise it will be very soon (within a few days) that everyone can read our RFC and realize every details about this feature. It’s our pleasure to make Hudi even more powerful by making this feature available to everyone. However, there’s one thing that we really need your help. According to the RFC Process shown in Hudi Docs, we have to first raise a PR and add an entry to rfc/README.md. But since this is the first time we raise a PR to Hudi, it’s necessary to have a maintainer with write permission to approve our PR. We have been wait for days but the PR is still in a pending status. Therefore, may I ask you to help us to approve our first PR so that we could submit our further materials to Hudi? The url of our pending PR is: https://github.com/apache/hudi/pull/6328 and the corresponding Jira is: https://issues.apache.org/jira/browse/HUDI-4569 Appreciate you so much for your help :) Kind regards, Xinyao Tian On 08/9/2022 21:46,Sivabalan wrote: Eagerly looking forward for the RFC Xinyao. Definitely see a lot of folks benefitting from this. On Sun, 7 Aug 2022 at 20:00, 田昕峣 (Xinyao Tian) wrote: Hi Shiyan, Thanks so much for your feedback as well as your kind encouragement! It’s always our honor to contribute our effort to everyone and make Hudi much awesome :) We are now carefully preparing materials for the new RFC. Once we finished, we would strictly follow the RFC process shown in the Hudi official documentation to propose the new RFC and share all details of the new feature as well as related code to everyone. Since we benefit from Hudi community, we would like to give back our effort to the community and make Hudi benefit more people! As always, please stay healthy and keep safe. Kind regards, Xinyao Tian On 08/6/2022 10:11,Shiyan Xu wrote: Hi Xinyao, awesome achievement! And really appreciate your keenness in contributing to Hudi. Certainly we'd love to see an RFC for this. On Fri, Aug 5, 2022 at 4:21 AM 田昕峣 (Xinyao Tian) wrote: Greetings everyone, My name is Xinyao and I'm currently working for an Insurance company. We found that Apache Hudi is an extremely awesome utility and when it cooprates with Apache Flink it can be even more powerful. Thus, we have been using it for months and still keep benefiting from it. However, there is one feature that we really desire but Hudi doesn't currently have: It is called "Multiple event_time fields verification". Because in the insurance industry, data is often stored distributed in dozens of tables and conceptually connected by same primary keys. When the data is being used, we often need to associate several or even dozens of tables through the Join operation, and stitch all partial columns into an entire record with dozens or even hundreds of columns for downstream services to use. Here comes to the problem. If we want to guarantee that every part of the data being joined is up to date, Hudi must have the ability to filter multiple event_time timestamps in a table and keep the most recent records. So, in this scenario, the signle event_time filtering field provided by Hudi (i.e. option 'write.precombine.field' in Hudi 0.10.0) is a bit inadequate. Obviously, in order to cope with the use case with complex Join operations like above, as well as to provide much potential for Hudi to support more application scenarios and engage into more industries, Hudi definitely needs to support the multiple event_time timestamps filtering feature in a single table. A good news is that, after more than two months of development, me and my colleagues have made some changes in the hudi-flink and hudi-common modules based on the hudi-0.10.0 and basically have achieved this feature. Currently, my team is using the enhanced source code and working with Kafka and Flink 1.13.2 to conduct some end-to-end testing on a dataset of more than 140 million real-world insurance data and verifying the accuracy of the data. The result is quite good: every part of the extremely-wide records have been updated to latest status based on our continuous observations during these weeks. We're very keen to make this new feature available to everyone. We benefit from the Hudi community, so we really desire to give back to the community with our efforts. The only problem is that, we are not sure whether we need to create a RFC to illusrtate our design and implementations in detail. According to "RFC Process" in Hudi official documentation, we have to confirm that this feature has not already exsited so that we could create a new RFC to share concept and code as well as explain them in detail. Thus, we really would like to create a new RFC that would explain our implementation in detail with theory and code, as well as make it easier for everyone to
Re: [new RFC Request] The need of Multiple event_time fields verification
Hi Shiyan, Thanks so much for your feedback as well as your kind encouragement! It’s always our honor to contribute our effort to everyone and make Hudi much awesome :) We are now carefully preparing materials for the new RFC. Once we finished, we would strictly follow the RFC process shown in the Hudi official documentation to propose the new RFC and share all details of the new feature as well as related code to everyone. Since we benefit from Hudi community, we would like to give back our effort to the community and make Hudi benefit more people! As always, please stay healthy and keep safe. Kind regards, Xinyao Tian On 08/6/2022 10:11,Shiyan Xu wrote: Hi Xinyao, awesome achievement! And really appreciate your keenness in contributing to Hudi. Certainly we'd love to see an RFC for this. On Fri, Aug 5, 2022 at 4:21 AM 田昕峣 (Xinyao Tian) wrote: Greetings everyone, My name is Xinyao and I'm currently working for an Insurance company. We found that Apache Hudi is an extremely awesome utility and when it cooprates with Apache Flink it can be even more powerful. Thus, we have been using it for months and still keep benefiting from it. However, there is one feature that we really desire but Hudi doesn't currently have: It is called "Multiple event_time fields verification". Because in the insurance industry, data is often stored distributed in dozens of tables and conceptually connected by same primary keys. When the data is being used, we often need to associate several or even dozens of tables through the Join operation, and stitch all partial columns into an entire record with dozens or even hundreds of columns for downstream services to use. Here comes to the problem. If we want to guarantee that every part of the data being joined is up to date, Hudi must have the ability to filter multiple event_time timestamps in a table and keep the most recent records. So, in this scenario, the signle event_time filtering field provided by Hudi (i.e. option 'write.precombine.field' in Hudi 0.10.0) is a bit inadequate. Obviously, in order to cope with the use case with complex Join operations like above, as well as to provide much potential for Hudi to support more application scenarios and engage into more industries, Hudi definitely needs to support the multiple event_time timestamps filtering feature in a single table. A good news is that, after more than two months of development, me and my colleagues have made some changes in the hudi-flink and hudi-common modules based on the hudi-0.10.0 and basically have achieved this feature. Currently, my team is using the enhanced source code and working with Kafka and Flink 1.13.2 to conduct some end-to-end testing on a dataset of more than 140 million real-world insurance data and verifying the accuracy of the data. The result is quite good: every part of the extremely-wide records have been updated to latest status based on our continuous observations during these weeks. We're very keen to make this new feature available to everyone. We benefit from the Hudi community, so we really desire to give back to the community with our efforts. The only problem is that, we are not sure whether we need to create a RFC to illusrtate our design and implementations in detail. According to "RFC Process" in Hudi official documentation, we have to confirm that this feature has not already exsited so that we could create a new RFC to share concept and code as well as explain them in detail. Thus, we really would like to create a new RFC that would explain our implementation in detail with theory and code, as well as make it easier for everyone to understand and make improvement based on our RFC. Look forward to receiving your feedback whether we should create a new RFC and make Hudi better and better to benifit everyone. Kind regards, Xinyao Tian -- Best, Shiyan
Re: [DISCUSS] Diagnostic reporter
Hi Shiyan and everyone, Definitely this feature is very important. We really need to gather error infos to fix bugs more efficiently. If there’s any thing I could help please feel free to let me know :) Regards, Xinyao Hi Shiyan and everyone, This is a great idea! As one of Hudi user, I also struggle to Hudi troubleshooting sometimes. With this feature, it will definitely be able to reduce the burden. So I volunteer to draft a discuss and maybe raise a RFC about if you don't mind. Thanks :) | | Yue Zhang | | zhangyue921...@163.com | On 08/3/2022 00:44,冯健 wrote: Maybe we can start this with an audit feature? Since we need some sort of "images" to represent “facts”, can create an identity of a writer to link them. and in this audit file, we can label each operation with IP, environment, platform, version, write config and etc. On Sun, 31 Jul 2022 at 12:18, Shiyan Xu wrote: To bubble this up On Wed, Jun 15, 2022 at 11:47 PM Vinoth Chandar wrote: +1 from me. It will be very useful if we can have something that can gather troubleshooting info easily. This part takes a while currently. On Mon, May 30, 2022 at 9:52 AM Shiyan Xu wrote: Hi all, When troubleshooting Hudi jobs in users' environments, we always ask users to share configs, environment info, check spark UI, etc. Here is an RFC idea: can we extend the Hudi metrics system and make a diagnostic reporter? It can be turned on like a normal metrics reporter. it should collect common troubleshooting info and save to json or other human-readable text format. Users should be able to run with it and share the diagnosis file. The RFC should discuss what info should / can be collected. Does this make sense? Anyone interested in driving the RFC design and implementation work? -- Best, Shiyan -- Best, Shiyan
[new RFC Request] The need of Multiple event_time fields verification
Greetings everyone, My name is Xinyao and I'm currently working for an Insurance company. We found that Apache Hudi is an extremely awesome utility and when it cooprates with Apache Flink it can be even more powerful. Thus, we have been using it for months and still keep benefiting from it. However, there is one feature that we really desire but Hudi doesn't currently have: It is called "Multiple event_time fields verification". Because in the insurance industry, data is often stored distributed in dozens of tables and conceptually connected by same primary keys. When the data is being used, we often need to associate several or even dozens of tables through the Join operation, and stitch all partial columns into an entire record with dozens or even hundreds of columns for downstream services to use. Here comes to the problem. If we want to guarantee that every part of the data being joined is up to date, Hudi must have the ability to filter multiple event_time timestamps in a table and keep the most recent records. So, in this scenario, the signle event_time filtering field provided by Hudi (i.e. option 'write.precombine.field' in Hudi 0.10.0) is a bit inadequate. Obviously, in order to cope with the use case with complex Join operations like above, as well as to provide much potential for Hudi to support more application scenarios and engage into more industries, Hudi definitely needs to support the multiple event_time timestamps filtering feature in a single table. A good news is that, after more than two months of development, me and my colleagues have made some changes in the hudi-flink and hudi-common modules based on the hudi-0.10.0 and basically have achieved this feature. Currently, my team is using the enhanced source code and working with Kafka and Flink 1.13.2 to conduct some end-to-end testing on a dataset of more than 140 million real-world insurance data and verifying the accuracy of the data. The result is quite good: every part of the extremely-wide records have been updated to latest status based on our continuous observations during these weeks. We're very keen to make this new feature available to everyone. We benefit from the Hudi community, so we really desire to give back to the community with our efforts. The only problem is that, we are not sure whether we need to create a RFC to illusrtate our design and implementations in detail. According to "RFC Process" in Hudi official documentation, we have to confirm that this feature has not already exsited so that we could create a new RFC to share concept and code as well as explain them in detail. Thus, we really would like to create a new RFC that would explain our implementation in detail with theory and code, as well as make it easier for everyone to understand and make improvement based on our RFC. Look forward to receiving your feedback whether we should create a new RFC and make Hudi better and better to benifit everyone. Kind regards, Xinyao Tian
Re: Would like to become a code contributor
Hi Shiyan, Just received your feedback. Thanks for your kind advice! I’m really excited since this is the first time for me to join an open-source project as a code contributor. I’m sure that definitely I have too many things to learn in order to become a good contributor. So if you have anything to mention please feel free to let me know more :) As always, keep safe and have a good one. Regards, Xinyao Tian On 08/4/2022 05:35,Shiyan Xu wrote: Added and welcome! Look forward to your contributions. FYI there is nothing stopping you from creating GitHub PRs to the project. The JIRA is just for assigning tickets. On Wed, Aug 3, 2022 at 1:15 AM 田昕峣 (Xinyao Tian) wrote: Hi there, My name is Richard. I just graduated from the University of Sydney (Master of Data Science) and currently working in an IT department of an Insurance company. For the past 3 months I have read documentations as well as source code of Hudi 0.10.0, and also did some improvement and bug fixing of Hudi inside my department. I really want to share those improvements to everyone as well as make some code contribution of this awesome project. Thus, I would like to become a code contributor. So please let me join your developing group as a contributor. My Jira username is: xinyaotian8647 Thanks so much. Kind regards, Richard Xinyao Tian | | -- Best, Shiyan
Would like to become a code contributor
Hi there, My name is Richard. I just graduated from the University of Sydney (Master of Data Science) and currently working in an IT department of an Insurance company. For the past 3 months I have read documentations as well as source code of Hudi 0.10.0, and also did some improvement and bug fixing of Hudi inside my department. I really want to share those improvements to everyone as well as make some code contribution of this awesome project. Thus, I would like to become a code contributor. So please let me join your developing group as a contributor. My Jira username is: xinyaotian8647 Thanks so much. Kind regards, Richard Xinyao Tian | |