Re: [ANNOUNCE] Apache Hudi 0.12.1 released

2022-10-24 Thread Shiyan Xu
Congrats!

On Sun, Oct 23, 2022 at 3:57 PM Zhuoluo Yang  wrote:

> Congrats!
>
> Thanks,
> Zhuoluo
>
>
> leesf  于2022年10月20日周四 09:03写道:
>
> > Great job!
> >
> > Alexey Kudinkin  于2022年10月20日周四 03:45写道:
> >
> > > Thanks Zhaojing for masterfully navigating this release!
> > >
> > > On Wed, Oct 19, 2022 at 7:46 AM Vinoth Chandar 
> > wrote:
> > >
> > > > Great job everyone!
> > > >
> > > > On Wed, Oct 19, 2022 at 07:11 zhaojing yu 
> wrote:
> > > >
> > > > > The Apache Hudi team is pleased to announce the release of Apache
> > Hudi
> > > > > 0.12.1.
> > > > >
> > > > > Apache Hudi (pronounced Hoodie) stands for Hadoop Upserts Deletes
> > > > > and Incrementals. Apache Hudi manages storage of large analytical
> > > > > datasets on DFS (Cloud stores, HDFS or any Hadoop FileSystem
> > compatible
> > > > > storage) and provides the ability to query them.
> > > > >
> > > > > This release comes 2 months after 0.12.0. It includes more than
> > > > > 150 resolved issues, comprising of a few new features as well as
> > > > > general improvements and bug fixes. You can read the release
> > > > > highlights at https://hudi.apache.org/releases/release-0.12.1.
> > > > >
> > > > > For details on how to use Hudi, please look at the quick start page
> > > > located
> > > > > at https://hudi.apache.org/docs/quick-start-guide.html
> > > > >
> > > > > If you'd like to download the source release, you can find it here:
> > > > > https://github.com/apache/hudi/releases/tag/release-0.12.1
> > > > >
> > > > > Release notes including the resolved issues can be found here:
> > > > >
> > > > >
> > > >
> > >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822=12352182
> > > > >
> > > > > We welcome your help and feedback. For more information on how to
> > > report
> > > > > problems, and to get involved, visit the project website at
> > > > > https://hudi.apache.org
> > > > >
> > > > > Thanks to everyone involved!
> > > > >
> > > > > Release Manager
> > > > >
> > > >
> > >
> >
>


-- 
Best,
Shiyan


Re: [PSA] CI flakiness in master branch

2022-10-24 Thread Shiyan Xu
CI is back to stably passing now. Thank you all.

On Fri, Oct 21, 2022 at 11:20 PM Shiyan Xu 
wrote:

> Amendment: cancel CI jobs for non-urgent PRs *when needed*
>
> On Fri, Oct 21, 2022 at 11:18 PM Shiyan Xu 
> wrote:
>
>> Hi all,
>>
>> We have seen CI passing rate being very low due to flakiness in the
>> master branch. Let's tackle the failing tests first.
>>
>> We'll prioritize landing flakiness-fixing PRs. Due to CI resources
>> limitation, we'll cancel CI jobs for non-urgent PRs. Apologies for the
>> inconvenience caused.
>>
>> --
>> Best,
>> Shiyan
>>
>
>
> --
> Best,
> Shiyan
>


-- 
Best,
Shiyan


Re: [Discuss] SCD-2 Payload

2022-10-24 Thread 冯健
to Raymond:  now combineAndGetUpdateValue can only return one
IndexedRecord, but in the case of SCD-2, both old and new records need to
be stored.
to Alexey: yeah,  this feature should be designed on top of RFC-46.  Can
HoodieRecordMerger return 2 HoodieRecord in this case?



On Tue, 25 Oct 2022 at 03:55, Alexey Kudinkin  wrote:

> Hey, hey, Fengjian!
>
> With the landing of the RFC-46 we'll be kick-starting a process of phasing
> out HoodieRecordPayload as an abstraction and instead migrating to
> HoodieRecordMerger interface.
> I'd recommend to base your design considerations off the new
> HoodieRecordMerger interface instead of legacy HoodieRecordPayload to make
> sure it's future-proof.
>
> On Thu, Oct 20, 2022 at 10:08 AM 冯健  wrote:
>
> > Hi guys,
> > After reading this article with respect to how to implement SCD-2
> with
> > Hudi Build Slowly Changing Dimensions Type 2 (SCD2) with Apache Spark and
> > Apache Hudi on Amazon EMR
> > <
> >
> https://aws.amazon.com/blogs/big-data/build-slowly-changing-dimensions-type-2-scd2-with-apache-spark-and-apache-hudi-on-amazon-emr/
> > >
> > I have an idea about implementing embedded SCD-2 support in hudi by
> > using a new Payload. Users don't need to manually join the data, then
> > update end_data and status.
> >For example, the record key is 'id,end_date',  Let's say the current
> > data's id is 1 and the end_date is 2099-12-31,  when a new record with
> id=1
> > arrives, it will update the current record's end_date to 2022-10-21, and
> > also insert this new record with end_data ' 2099-12-31'.  so this Payload
> > will generate two records in combineAndGetUpdateValue . there will be no
> > join cost, and the whole process is transparent to users.
> >
> >Any thoughts?
> >
>


Re: [Discuss] SCD-2 Payload

2022-10-24 Thread Alexey Kudinkin
Hey, hey, Fengjian!

With the landing of the RFC-46 we'll be kick-starting a process of phasing
out HoodieRecordPayload as an abstraction and instead migrating to
HoodieRecordMerger interface.
I'd recommend to base your design considerations off the new
HoodieRecordMerger interface instead of legacy HoodieRecordPayload to make
sure it's future-proof.

On Thu, Oct 20, 2022 at 10:08 AM 冯健  wrote:

> Hi guys,
> After reading this article with respect to how to implement SCD-2 with
> Hudi Build Slowly Changing Dimensions Type 2 (SCD2) with Apache Spark and
> Apache Hudi on Amazon EMR
> <
> https://aws.amazon.com/blogs/big-data/build-slowly-changing-dimensions-type-2-scd2-with-apache-spark-and-apache-hudi-on-amazon-emr/
> >
> I have an idea about implementing embedded SCD-2 support in hudi by
> using a new Payload. Users don't need to manually join the data, then
> update end_data and status.
>For example, the record key is 'id,end_date',  Let's say the current
> data's id is 1 and the end_date is 2099-12-31,  when a new record with id=1
> arrives, it will update the current record's end_date to 2022-10-21, and
> also insert this new record with end_data ' 2099-12-31'.  so this Payload
> will generate two records in combineAndGetUpdateValue . there will be no
> join cost, and the whole process is transparent to users.
>
>Any thoughts?
>


Re: [Discuss] SCD-2 Payload

2022-10-24 Thread Shiyan Xu
Interesting thoughts. Not sure if I fully understand this part: "generate 2
records in combineAndGetUpdateValue". the API is defined to return just 1
record?

On Fri, Oct 21, 2022 at 1:07 AM 冯健  wrote:

> Hi guys,
> After reading this article with respect to how to implement SCD-2 with
> Hudi Build Slowly Changing Dimensions Type 2 (SCD2) with Apache Spark and
> Apache Hudi on Amazon EMR
> <
> https://aws.amazon.com/blogs/big-data/build-slowly-changing-dimensions-type-2-scd2-with-apache-spark-and-apache-hudi-on-amazon-emr/
> >
> I have an idea about implementing embedded SCD-2 support in hudi by
> using a new Payload. Users don't need to manually join the data, then
> update end_data and status.
>For example, the record key is 'id,end_date',  Let's say the current
> data's id is 1 and the end_date is 2099-12-31,  when a new record with id=1
> arrives, it will update the current record's end_date to 2022-10-21, and
> also insert this new record with end_data ' 2099-12-31'.  so this Payload
> will generate two records in combineAndGetUpdateValue . there will be no
> join cost, and the whole process is transparent to users.
>
>Any thoughts?
>


-- 
Best,
Shiyan


Re: [DISCUSS] [RFC] Hudi bundle standards

2022-10-24 Thread Shiyan Xu
Thanks Xinyao for raising the problem. Let's align more on the RFC to help
clarify usage. Agree on the importance - the bundle artifacts are the
user-facing components from this project.

On Mon, Oct 10, 2022 at 4:44 PM 田昕峣 (Xinyao Tian) 
wrote:

> Hi Shiyan,
>
>
> Having carefully read the RFC-63.md on the PR, I really think this feature
> is crucial for everyone who builds Hudi from source. For example, when I
> tried to compile Hudi 0.12.0 with flink1.15, I used command ‘mvn clean
> package -DskipTests -Dflink1.15 -Dscala-2.12’ but still get flink1.14
> bundle. Also, in the documents it’s a highlight of supporting Fink 1.15 but
> Hudi 0.12.0 Github readme.md doesn’t mention anything about flink1.15 in
> compile section. All in all, there are many misleading points of Hudi
> bundles, which have to be enhanced asap.
>
>
> Really appreciate to have this RFC trying to solve all these problems.
> On 10/10/2022 13:36,Shiyan Xu wrote:
> Hi Hudi devs and users,
>
> I've raised an RFC around Hudi bundles, aiming to address issues around
> dependency conflicts, and to establish standards for bundle jar usage and
> change process. Please have a look. Thanks!
>
> https://github.com/apache/hudi/pull/6902
>
> --
> Best,
> Shiyan
>


-- 
Best,
Shiyan


Re: [DISCUSS] Build tool upgrade

2022-10-24 Thread Shiyan Xu
Thank you all for the valuable inputs! I think we can close this topic for
now, given the majority is leaning towards continuing with maven.

On Mon, Oct 17, 2022 at 8:48 PM zhaojing yu  wrote:

> I have experienced some gradle development projects and want to share some
> thoughts.
>
> The flexibility and faster speed of gradle itself can certainly bring some
> advantages, but it will also greatly increase the troubleshooting time due
> to the bugs of gradle itself, and gradle DSL is very different from that of
> maven. There are also many learning costs for developers in the community.
>
> I think it does consume too much time on code release, but users or
> developers usually only compile part of the module.
>
> So I think, a certain advantage in build time alone is not enough to cover
> so much cost.
>
> Best,
> Zhaojing
>
> Gary Li  于2022年10月17日周一 19:22写道:
>
> > Hi folks,
> >
> > I'd share my thoughts as well. I personally won't build the whole project
> > too often, only before push to the remote branch or make big changes in
> > different modules. If I just make some changes and run a test, the IDE
> will
> > only build the necessary modules I believe. In addition, each time I deal
> > with dependency issues, the years of maven experience does help me locate
> > the issue quickly, especially when the dependency tree is pretty
> > complicated. The learning effort and the new environment setup effort are
> > considerable as well.
> >
> > Happy to learn if there are other benefits gradle or bazel could bring to
> > us, but if the only benefit is the xx% faster build time, I am a bit
> > unconvinced to make this change.
> >
> > Best,
> > Gary
> >
> > On Mon, Oct 17, 2022 at 2:58 PM Danny Chan  wrote:
> >
> > > I have a full experience with how Apache Calcite switches from Maven
> > > to Gradle, and I want to share some thoughts.
> > >
> > > The gradle build is fast, but it relies heavily on its local cache,
> > > usually it needs too much time to download these cache jars because
> > > gradle upgrade itself very frequently.
> > >
> > > The gradle is very flexive for building, but it also has many bugs,
> > > you may need more time to debug its bug compared with building with
> > > maven.
> > >
> > > The gradle DSL for building is a must to learn for all the developers.
> > >
> > > For all above cases, I don't think switching to gradle is a right
> > > decision for Apache Calcite. Julian Hyde which is the creator of
> > > Calcite may have more words to say here.
> > >
> > > So I would not suggest we do that for Hudi.
> > >
> > >
> > > Best,
> > > Danny Chan
> > >
> > > Shiyan Xu  于2022年10月1日周六 13:48写道:
> > > >
> > > > Hi all,
> > > >
> > > > I'd like to raise a discussion around the build tool for Hudi.
> > > >
> > > > Maven has been a mature yet slow (10min to build on 2021 macbook pro)
> > > build
> > > > tool compared to modern ones like gradle or bazel. We all want faster
> > > > builds, however, we also need to consider the efforts and risks to
> > > upgrade,
> > > > and the developers' feedback on usability.
> > > >
> > > > What do you all think about upgrading to gradle or bazel? Please
> share
> > > your
> > > > thoughts. Thanks.
> > > >
> > > > --
> > > > Best,
> > > > Shiyan
> > >
> >
>


-- 
Best,
Shiyan