Re: [ANNOUNCE] Apache Hudi 0.12.1 released
Congrats! On Sun, Oct 23, 2022 at 3:57 PM Zhuoluo Yang wrote: > Congrats! > > Thanks, > Zhuoluo > > > leesf 于2022年10月20日周四 09:03写道: > > > Great job! > > > > Alexey Kudinkin 于2022年10月20日周四 03:45写道: > > > > > Thanks Zhaojing for masterfully navigating this release! > > > > > > On Wed, Oct 19, 2022 at 7:46 AM Vinoth Chandar > > wrote: > > > > > > > Great job everyone! > > > > > > > > On Wed, Oct 19, 2022 at 07:11 zhaojing yu > wrote: > > > > > > > > > The Apache Hudi team is pleased to announce the release of Apache > > Hudi > > > > > 0.12.1. > > > > > > > > > > Apache Hudi (pronounced Hoodie) stands for Hadoop Upserts Deletes > > > > > and Incrementals. Apache Hudi manages storage of large analytical > > > > > datasets on DFS (Cloud stores, HDFS or any Hadoop FileSystem > > compatible > > > > > storage) and provides the ability to query them. > > > > > > > > > > This release comes 2 months after 0.12.0. It includes more than > > > > > 150 resolved issues, comprising of a few new features as well as > > > > > general improvements and bug fixes. You can read the release > > > > > highlights at https://hudi.apache.org/releases/release-0.12.1. > > > > > > > > > > For details on how to use Hudi, please look at the quick start page > > > > located > > > > > at https://hudi.apache.org/docs/quick-start-guide.html > > > > > > > > > > If you'd like to download the source release, you can find it here: > > > > > https://github.com/apache/hudi/releases/tag/release-0.12.1 > > > > > > > > > > Release notes including the resolved issues can be found here: > > > > > > > > > > > > > > > > > > > > https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822=12352182 > > > > > > > > > > We welcome your help and feedback. For more information on how to > > > report > > > > > problems, and to get involved, visit the project website at > > > > > https://hudi.apache.org > > > > > > > > > > Thanks to everyone involved! > > > > > > > > > > Release Manager > > > > > > > > > > > > > > > -- Best, Shiyan
Re: [PSA] CI flakiness in master branch
CI is back to stably passing now. Thank you all. On Fri, Oct 21, 2022 at 11:20 PM Shiyan Xu wrote: > Amendment: cancel CI jobs for non-urgent PRs *when needed* > > On Fri, Oct 21, 2022 at 11:18 PM Shiyan Xu > wrote: > >> Hi all, >> >> We have seen CI passing rate being very low due to flakiness in the >> master branch. Let's tackle the failing tests first. >> >> We'll prioritize landing flakiness-fixing PRs. Due to CI resources >> limitation, we'll cancel CI jobs for non-urgent PRs. Apologies for the >> inconvenience caused. >> >> -- >> Best, >> Shiyan >> > > > -- > Best, > Shiyan > -- Best, Shiyan
Re: [Discuss] SCD-2 Payload
to Raymond: now combineAndGetUpdateValue can only return one IndexedRecord, but in the case of SCD-2, both old and new records need to be stored. to Alexey: yeah, this feature should be designed on top of RFC-46. Can HoodieRecordMerger return 2 HoodieRecord in this case? On Tue, 25 Oct 2022 at 03:55, Alexey Kudinkin wrote: > Hey, hey, Fengjian! > > With the landing of the RFC-46 we'll be kick-starting a process of phasing > out HoodieRecordPayload as an abstraction and instead migrating to > HoodieRecordMerger interface. > I'd recommend to base your design considerations off the new > HoodieRecordMerger interface instead of legacy HoodieRecordPayload to make > sure it's future-proof. > > On Thu, Oct 20, 2022 at 10:08 AM 冯健 wrote: > > > Hi guys, > > After reading this article with respect to how to implement SCD-2 > with > > Hudi Build Slowly Changing Dimensions Type 2 (SCD2) with Apache Spark and > > Apache Hudi on Amazon EMR > > < > > > https://aws.amazon.com/blogs/big-data/build-slowly-changing-dimensions-type-2-scd2-with-apache-spark-and-apache-hudi-on-amazon-emr/ > > > > > I have an idea about implementing embedded SCD-2 support in hudi by > > using a new Payload. Users don't need to manually join the data, then > > update end_data and status. > >For example, the record key is 'id,end_date', Let's say the current > > data's id is 1 and the end_date is 2099-12-31, when a new record with > id=1 > > arrives, it will update the current record's end_date to 2022-10-21, and > > also insert this new record with end_data ' 2099-12-31'. so this Payload > > will generate two records in combineAndGetUpdateValue . there will be no > > join cost, and the whole process is transparent to users. > > > >Any thoughts? > > >
Re: [Discuss] SCD-2 Payload
Hey, hey, Fengjian! With the landing of the RFC-46 we'll be kick-starting a process of phasing out HoodieRecordPayload as an abstraction and instead migrating to HoodieRecordMerger interface. I'd recommend to base your design considerations off the new HoodieRecordMerger interface instead of legacy HoodieRecordPayload to make sure it's future-proof. On Thu, Oct 20, 2022 at 10:08 AM 冯健 wrote: > Hi guys, > After reading this article with respect to how to implement SCD-2 with > Hudi Build Slowly Changing Dimensions Type 2 (SCD2) with Apache Spark and > Apache Hudi on Amazon EMR > < > https://aws.amazon.com/blogs/big-data/build-slowly-changing-dimensions-type-2-scd2-with-apache-spark-and-apache-hudi-on-amazon-emr/ > > > I have an idea about implementing embedded SCD-2 support in hudi by > using a new Payload. Users don't need to manually join the data, then > update end_data and status. >For example, the record key is 'id,end_date', Let's say the current > data's id is 1 and the end_date is 2099-12-31, when a new record with id=1 > arrives, it will update the current record's end_date to 2022-10-21, and > also insert this new record with end_data ' 2099-12-31'. so this Payload > will generate two records in combineAndGetUpdateValue . there will be no > join cost, and the whole process is transparent to users. > >Any thoughts? >
Re: [Discuss] SCD-2 Payload
Interesting thoughts. Not sure if I fully understand this part: "generate 2 records in combineAndGetUpdateValue". the API is defined to return just 1 record? On Fri, Oct 21, 2022 at 1:07 AM 冯健 wrote: > Hi guys, > After reading this article with respect to how to implement SCD-2 with > Hudi Build Slowly Changing Dimensions Type 2 (SCD2) with Apache Spark and > Apache Hudi on Amazon EMR > < > https://aws.amazon.com/blogs/big-data/build-slowly-changing-dimensions-type-2-scd2-with-apache-spark-and-apache-hudi-on-amazon-emr/ > > > I have an idea about implementing embedded SCD-2 support in hudi by > using a new Payload. Users don't need to manually join the data, then > update end_data and status. >For example, the record key is 'id,end_date', Let's say the current > data's id is 1 and the end_date is 2099-12-31, when a new record with id=1 > arrives, it will update the current record's end_date to 2022-10-21, and > also insert this new record with end_data ' 2099-12-31'. so this Payload > will generate two records in combineAndGetUpdateValue . there will be no > join cost, and the whole process is transparent to users. > >Any thoughts? > -- Best, Shiyan
Re: [DISCUSS] [RFC] Hudi bundle standards
Thanks Xinyao for raising the problem. Let's align more on the RFC to help clarify usage. Agree on the importance - the bundle artifacts are the user-facing components from this project. On Mon, Oct 10, 2022 at 4:44 PM 田昕峣 (Xinyao Tian) wrote: > Hi Shiyan, > > > Having carefully read the RFC-63.md on the PR, I really think this feature > is crucial for everyone who builds Hudi from source. For example, when I > tried to compile Hudi 0.12.0 with flink1.15, I used command ‘mvn clean > package -DskipTests -Dflink1.15 -Dscala-2.12’ but still get flink1.14 > bundle. Also, in the documents it’s a highlight of supporting Fink 1.15 but > Hudi 0.12.0 Github readme.md doesn’t mention anything about flink1.15 in > compile section. All in all, there are many misleading points of Hudi > bundles, which have to be enhanced asap. > > > Really appreciate to have this RFC trying to solve all these problems. > On 10/10/2022 13:36,Shiyan Xu wrote: > Hi Hudi devs and users, > > I've raised an RFC around Hudi bundles, aiming to address issues around > dependency conflicts, and to establish standards for bundle jar usage and > change process. Please have a look. Thanks! > > https://github.com/apache/hudi/pull/6902 > > -- > Best, > Shiyan > -- Best, Shiyan
Re: [DISCUSS] Build tool upgrade
Thank you all for the valuable inputs! I think we can close this topic for now, given the majority is leaning towards continuing with maven. On Mon, Oct 17, 2022 at 8:48 PM zhaojing yu wrote: > I have experienced some gradle development projects and want to share some > thoughts. > > The flexibility and faster speed of gradle itself can certainly bring some > advantages, but it will also greatly increase the troubleshooting time due > to the bugs of gradle itself, and gradle DSL is very different from that of > maven. There are also many learning costs for developers in the community. > > I think it does consume too much time on code release, but users or > developers usually only compile part of the module. > > So I think, a certain advantage in build time alone is not enough to cover > so much cost. > > Best, > Zhaojing > > Gary Li 于2022年10月17日周一 19:22写道: > > > Hi folks, > > > > I'd share my thoughts as well. I personally won't build the whole project > > too often, only before push to the remote branch or make big changes in > > different modules. If I just make some changes and run a test, the IDE > will > > only build the necessary modules I believe. In addition, each time I deal > > with dependency issues, the years of maven experience does help me locate > > the issue quickly, especially when the dependency tree is pretty > > complicated. The learning effort and the new environment setup effort are > > considerable as well. > > > > Happy to learn if there are other benefits gradle or bazel could bring to > > us, but if the only benefit is the xx% faster build time, I am a bit > > unconvinced to make this change. > > > > Best, > > Gary > > > > On Mon, Oct 17, 2022 at 2:58 PM Danny Chan wrote: > > > > > I have a full experience with how Apache Calcite switches from Maven > > > to Gradle, and I want to share some thoughts. > > > > > > The gradle build is fast, but it relies heavily on its local cache, > > > usually it needs too much time to download these cache jars because > > > gradle upgrade itself very frequently. > > > > > > The gradle is very flexive for building, but it also has many bugs, > > > you may need more time to debug its bug compared with building with > > > maven. > > > > > > The gradle DSL for building is a must to learn for all the developers. > > > > > > For all above cases, I don't think switching to gradle is a right > > > decision for Apache Calcite. Julian Hyde which is the creator of > > > Calcite may have more words to say here. > > > > > > So I would not suggest we do that for Hudi. > > > > > > > > > Best, > > > Danny Chan > > > > > > Shiyan Xu 于2022年10月1日周六 13:48写道: > > > > > > > > Hi all, > > > > > > > > I'd like to raise a discussion around the build tool for Hudi. > > > > > > > > Maven has been a mature yet slow (10min to build on 2021 macbook pro) > > > build > > > > tool compared to modern ones like gradle or bazel. We all want faster > > > > builds, however, we also need to consider the efforts and risks to > > > upgrade, > > > > and the developers' feedback on usability. > > > > > > > > What do you all think about upgrading to gradle or bazel? Please > share > > > your > > > > thoughts. Thanks. > > > > > > > > -- > > > > Best, > > > > Shiyan > > > > > > -- Best, Shiyan