Re: PR 7984 implements hash partitioning

2023-02-16 Thread Alexey Kudinkin
Thanks for your contribution, Lvhu!

I think we should actually kick-start this effort with an small RFC
outlining proposed changes first, as this is modifying the core read-flow
for all Hudi tables and we want to make sure our approach there is
rock-solid.

On Thu, Feb 16, 2023 at 6:34 AM 吕虎  wrote:

> Hi folks,
>   PR 7984【 https://github.com/apache/hudi/pull/7984 】 implements hash
> partitioning.
>   As you know, It is often difficult to find an appropriate partition
> key in the existing big data. Hash partitioning can easily solve this
> problem. it can greatly improve the performance of hudi's big data
> processing.
>   The idea is to use the hash partition field as one of the partition
> fields of the ComplexKeyGenerator, so this PR  implementation does not
> involve logic modification of core code.
>   The codes are easy to review, but I think hash partition is very
> usefull. we really need it.
>   How to use hash partition in spark data source can refer to
> https://github.com/lvhu-goodluck/hudi/blob/hash_partition_spark_data_source/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestHoodieSparkSqlWriter.scala
>  #testHashPartition
>
>   No public API or user-facing feature change or any performance
> impact if the hash partition parameters are not specified.
>
>   When hash.partition.fields is specified and partition.fields
> contains _hoodie_hash_partition, a column named _hoodie_hash_partition will
> be added in this table as one of the partition key.
>
>   If predicates of hash.partition.fields appear in the query
> statement, the _hoodie_hash_partition = X predicate will be automatically
> added to the query statement for partition pruning.
>
> Hope folks help and review!
>   Thanks!
> Lvhu
>


Re: [VOTE] Release 0.12.2, release candidate #1

2022-12-23 Thread Alexey Kudinkin
+1 (non-binding)

[OK] Built successfully for Spark 2.4, 3.x
[OK] Run Spark SQL tests

On Fri, Dec 23, 2022 at 12:19 PM Y Ethan Guo  wrote:

> +1 non-binding
>
> [OK] checksums and signatures
> [OK] ran release validation script
> [OK] built successfully (Spark 2.4, 3.3)
> [OK] Spark 3.3.1 quickstart guide
>
> On Fri, Dec 23, 2022 at 1:30 AM Bhavani Sudha 
> wrote:
>
> > +1 binding
> >
> > [OK] Build successfully multiple supported spark versions
> >
> > [OK] Ran validation script
> >
> > [OK] Ran QuickStart on spark 3.2
> >
> >
> > ./release/validate_staged_release.sh --release=0.12.2 --rc_num=1
> >
> > /tmp/validation_scratch_dir_001 ~/hudi/scripts
> >
> > Downloading from svn co https://dist.apache.org/repos/dist/dev/hudi
> >
> > Validating hudi-0.12.2-rc1 with release type "dev"
> >
> > Checking Checksum of Source Release
> >
> > Checksum Check of Source Release - [OK]
> >
> >
> >   % Total% Received % Xferd  Average Speed   TimeTime Time
> > Current
> >
> >  Dload  Upload   Total   SpentLeft
> > Speed
> >
> > 100 69274  100 692740 0  97810  0 --:--:-- --:--:-- --:--:--
> > 98962
> >
> > Checking Signature
> >
> > Signature Check - [OK]
> >
> >
> > Checking for binary files in source release
> >
> > No Binary Files in Source Release? - [OK]
> >
> >
> > Checking for DISCLAIMER
> >
> > DISCLAIMER file exists ? [OK]
> >
> >
> > Checking for LICENSE and NOTICE
> >
> > License file exists ? [OK]
> >
> > Notice file exists ? [OK]
> >
> >
> > Performing custom Licensing Check
> >
> > Licensing Check Passed [OK]
> >
> >
> > Running RAT Check
> >
> > RAT Check Passed [OK]
> >
> >
> > ~/hudi/scripts
> >
> >
> >
> > On Thu, Dec 22, 2022 at 8:18 PM sagar sumit  wrote:
> >
> > > +1 (non-binding)
> > >
> > > Ran long-running deltastreamer.
> > > Validated meta sync and queried tables through Presto/Trino.
> > >
> > > On Fri, Dec 23, 2022 at 5:14 AM Sivabalan  wrote:
> > >
> > > > +1 binding.
> > > >
> > > > release Validation script succeeded.
> > > > Ran tests w/ diff spark versions for spark-ds writers and
> deltastreamer
> > > for
> > > > few hours.
> > > > Ran multi-writer tests.
> > > >
> > > >
> > > > On Thu, 22 Dec 2022 at 08:04, Satish Kotha 
> > > wrote:
> > > >
> > > > > We discussed on slack. Because the below commits didn’t meet code
> > > freeze
> > > > > date, we are skipping these in 0.12.2 release.
> > > > >
> > > > > Please test the release and appreciate feedback.
> > > > >
> > > > > On Tue, Dec 20, 2022 at 10:41 PM Danny Chan 
> > > > wrote:
> > > > >
> > > > > > Hi, there are another 2 fix that i want to include:
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/hudi/commit/c288a506d4c0b7c1272538d95928df118e4d79ac
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/hudi/commit/211af1a4fd76ce84ce80f4d1b2befe5fc9954888
> > > > > >
> > > > > > Best,
> > > > > > Danny
> > > > > >
> > > > > > Satish Kotha  于2022年12月20日周二 11:50写道:
> > > > > > >
> > > > > > > small correction in the first line: Please review and vote on
> > > > > > > the release candidate #1 for the version 0.12.2,
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Dec 19, 2022 at 6:37 PM Satish Kotha <
> > > satish.ko...@gmail.com
> > > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > > Hi everyone,
> > > > > > > >
> > > > > > > > Please review and vote on the release candidate #1 for the
> > > version
> > > > > > 0.12.1,
> > > > > > > > as follows:
> > > > > > > >
> > > > > > > > [ ] +1, Approve the release
> > > > > > > > [ ] -1, Do not approve the release (please provide specific
> > > > comments)
> > > > > > > >
> > > > > > > > The complete staging area is available for your review, which
> > > > > includes:
> > > > > > > >
> > > > > > > > * JIRA release notes [1],
> > > > > > > > * the official Apache source release and binary convenience
> > > > releases
> > > > > > to be
> > > > > > > > deployed to dist.apache.org [2], which are signed with the
> key
> > > > with
> > > > > > > > fingerprint 6DA0B39A13C2658D22AE7D14D08C4B6BD98EA659 [3],
> > > > > > > > * all artifacts to be deployed to the Maven Central
> Repository
> > > [4],
> > > > > > > > * source code tag "release-0.12.2-rc1" [5],
> > > > > > > >
> > > > > > > > The vote will be open for at least 72 hours. It is adopted by
> > > > > majority
> > > > > > > > approval, with at least 3 PMC affirmative votes.
> > > > > > > >
> > > > > > > > Thanks to Sagar, Raymond and Shiva and others from the
> > community
> > > > for
> > > > > > > > helping me through the release process.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Release Manager
> > > > > > > >
> > > > > > > > [1]
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12352249&styleName=Html&projectId=12322822&Create=Create&atl_token=A5KQ-2QAV-T4JA-FDED_88b472602a0f3c72f949e98ae8087a47c815053b_lin
> > > > > > > >
> > > > > > > > [

Re: RFC-46 Status Update

2022-12-06 Thread Alexey Kudinkin
Thanks for bubbling this up, Siva!

Merge has not happened yet.

What you're saying makes sense to me -- after merging RFC-46 everyone will
likely have to rebase their changes since there it touches quite a few core
components,
so it'd make it harder to cherry-pick 0.12.2 changes.

What do other folks think?

On Tue, Dec 6, 2022 at 2:06 PM Sivabalan  wrote:

> Hey folks,
>  Has this happened already? If not, with 0.12.2 underway, is it
> possible to delay this until end of this week. Not sure if code freeze for
> 0.12.2 will be this Friday, but atleast we can do our best and ask RM to
> cherry-pick the commits until Friday. Sorry, I also don't want to drag
> RFC-46 anymore, open to hear others thoughts.
>
> On Mon, 5 Dec 2022 at 09:56, Alexey Kudinkin  wrote:
>
> > Hey, everyone!
> >
> > Long-awaited RFC-46 update: It took the crew quite a bit more time than
> > originally anticipated to stabilize and make sure everything is working
> as
> > expected and now we believe it's finally ready for prime time!
> >
> > As such, we're going to execute on a following plan:
> >
> >1. As called out originally, to move forward with the PR of this
> >magnitude we will be instituting *temporary code-freeze* on master
> >starting at *00:00 (midnight) 12/06 PST* and lasting for *24h. *
> >2. During that window we're planning to do one final rebase and land
> it
> >on master.
> >3. We will lift the code-freeze as soon as PR lands, but reserve 24h
> for
> >us to have ample buffer just in case.
> >
> > Let me know if you have any questions or concerns with this plan
> >
> > Alexey, on behalf of RFC-46 team
> >
> > On Wed, Sep 28, 2022 at 9:17 AM Alexey Kudinkin 
> > wrote:
> >
> > > @Ken yes, that's the plan eventually -- to rely on Execution (Query)
> > > Engines to provide their own representation that Hudi will be handling
> > w/o
> > > any intermediate transitions. If you're curious to learn more i'd
> > encourage
> > > you (and everyone) to check out the RFC-46 itself.
> > >
> > > In the Phase 1 though, we're only focusing on Spark integration for now
> > > (as proof of concept) and then later on after the new infra is hardened
> > > enough we can expand it to other engines as well.
> > >
> > > On Tue, Sep 27, 2022 at 7:07 PM Gary Li 
> > wrote:
> > >
> > >> Great work! Really excited about this feature. Kudos to RFC-46 team.
> > >>
> > >> Best,
> > >> Gary
> > >>
> > >> On Wed, Sep 28, 2022 at 7:22 AM Ken Krugler <
> > kkrugler_li...@transpac.com>
> > >> wrote:
> > >>
> > >> > Hi Alexey,
> > >> >
> > >> > Thanks for the update!
> > >> >
> > >> > So for maximum performance when writing to Hudi from a low-level
> > >> > (DataStream, not Table) Flink workflow, we’d be creating RowData
> > >> records?
> > >> >
> > >> > — Ken
> > >> >
> > >> >
> > >> > > On Sep 27, 2022, at 2:08 PM, Alexey Kudinkin 
> > >> wrote:
> > >> > >
> > >> > > Hello, everyone!
> > >> > >
> > >> > > As you might be aware, community has been very busy at work on
> > RFC-46
> > >> > > aiming to bring long-awaited cutting edge level of performance to
> > >> Hudi by
> > >> > > avoiding using Avro as an intermediate representation, instead
> > >> relying on
> > >> > > individual engines to host data in their own formats (InternalRow
> > for
> > >> > > Spark, RowData for Flink, etc)
> > >> > >
> > >> > > We wanted to share an update in terms of where we are and what are
> > the
> > >> > next
> > >> > > steps from here:
> > >> > >
> > >> > >   - We're very close to completing the work and are already
> > preparing
> > >> to
> > >> > >   be landing complete implementation of the Phase 1 of the RFC-46
> > >> > currently
> > >> > >   being developed in a feature branch
> > >> > >   <https://github.com/apache/hudi/tree/release-feature-rfc46>
> > >> > >   - To be able to successfully merge the change of such scale, we
> > will
> > >> > >   have to do a *code freeze* for the m

Re: RFC-46 Status Update

2022-12-05 Thread Alexey Kudinkin
Hey, everyone!

Long-awaited RFC-46 update: It took the crew quite a bit more time than
originally anticipated to stabilize and make sure everything is working as
expected and now we believe it's finally ready for prime time!

As such, we're going to execute on a following plan:

   1. As called out originally, to move forward with the PR of this
   magnitude we will be instituting *temporary code-freeze* on master
   starting at *00:00 (midnight) 12/06 PST* and lasting for *24h. *
   2. During that window we're planning to do one final rebase and land it
   on master.
   3. We will lift the code-freeze as soon as PR lands, but reserve 24h for
   us to have ample buffer just in case.

Let me know if you have any questions or concerns with this plan

Alexey, on behalf of RFC-46 team

On Wed, Sep 28, 2022 at 9:17 AM Alexey Kudinkin  wrote:

> @Ken yes, that's the plan eventually -- to rely on Execution (Query)
> Engines to provide their own representation that Hudi will be handling w/o
> any intermediate transitions. If you're curious to learn more i'd encourage
> you (and everyone) to check out the RFC-46 itself.
>
> In the Phase 1 though, we're only focusing on Spark integration for now
> (as proof of concept) and then later on after the new infra is hardened
> enough we can expand it to other engines as well.
>
> On Tue, Sep 27, 2022 at 7:07 PM Gary Li  wrote:
>
>> Great work! Really excited about this feature. Kudos to RFC-46 team.
>>
>> Best,
>> Gary
>>
>> On Wed, Sep 28, 2022 at 7:22 AM Ken Krugler 
>> wrote:
>>
>> > Hi Alexey,
>> >
>> > Thanks for the update!
>> >
>> > So for maximum performance when writing to Hudi from a low-level
>> > (DataStream, not Table) Flink workflow, we’d be creating RowData
>> records?
>> >
>> > — Ken
>> >
>> >
>> > > On Sep 27, 2022, at 2:08 PM, Alexey Kudinkin 
>> wrote:
>> > >
>> > > Hello, everyone!
>> > >
>> > > As you might be aware, community has been very busy at work on RFC-46
>> > > aiming to bring long-awaited cutting edge level of performance to
>> Hudi by
>> > > avoiding using Avro as an intermediate representation, instead
>> relying on
>> > > individual engines to host data in their own formats (InternalRow for
>> > > Spark, RowData for Flink, etc)
>> > >
>> > > We wanted to share an update in terms of where we are and what are the
>> > next
>> > > steps from here:
>> > >
>> > >   - We're very close to completing the work and are already preparing
>> to
>> > >   be landing complete implementation of the Phase 1 of the RFC-46
>> > currently
>> > >   being developed in a feature branch
>> > >   <https://github.com/apache/hudi/tree/release-feature-rfc46>
>> > >   - To be able to successfully merge the change of such scale, we will
>> > >   have to do a *code freeze* for the master branch barring any
>> changes to
>> > >   land before we're able to merge the feature-branch.
>> > >   - To make sure that this activity doesn't interrupt the 0.12.1
>> release
>> > >   that is currently in progress we're tentatively planning to schedule
>> > this
>> > >   code-freeze *after* successful finalization of the release process
>> with
>> > >   RC branch being cut and validated for release. As of now, provided
>> RC
>> > >   candidate will be cut tomorrow on 09/28 we're aiming to schedule a
>> > merge
>> > >   attempt somewhere mid to late next week.
>> > >   - We will follow-up on this thread separately at least *24h* before
>> the
>> > >   scheduled code-freeze with an exact date and time frame for it. Stay
>> > tuned.
>> > >
>> > >
>> > > Alexey, on behalf of the RFC-46 group
>> >
>> > --
>> > Ken Krugler
>> > http://www.scaleunlimited.com
>> > Custom big data solutions
>> > Flink, Pinot, Solr, Elasticsearch
>> >
>> >
>> >
>> >
>>
>


[RFC] RFC-64: New APIs to facilitate faster Query Engine integrations

2022-11-10 Thread Alexey Kudinkin
Hello, everyone!

Recently we've been hard at work holistically evaluating how we can
streamline our current integration model and enable faster turnaround time
for building new Query Integrations.

Unequivocally, we already have some impressive portfolio of integrations
natively supporting Hudi today like Spark, Flink, Presto, Trino, Hive, and
more, but we're also looking forward to supporting many more in the future.

As such we've come up with a proposal outlined in RFC-64
 that we'd like to share and
solicit feedback on from the broader Hudi's community.

Please take a look and share your thoughts either in the PR or in this
thread
https://github.com/apache/hudi/pull/7080


Re: [Discuss] SCD-2 Payload

2022-10-24 Thread Alexey Kudinkin
Hey, hey, Fengjian!

With the landing of the RFC-46 we'll be kick-starting a process of phasing
out HoodieRecordPayload as an abstraction and instead migrating to
HoodieRecordMerger interface.
I'd recommend to base your design considerations off the new
HoodieRecordMerger interface instead of legacy HoodieRecordPayload to make
sure it's future-proof.

On Thu, Oct 20, 2022 at 10:08 AM 冯健  wrote:

> Hi guys,
> After reading this article with respect to how to implement SCD-2 with
> Hudi Build Slowly Changing Dimensions Type 2 (SCD2) with Apache Spark and
> Apache Hudi on Amazon EMR
> <
> https://aws.amazon.com/blogs/big-data/build-slowly-changing-dimensions-type-2-scd2-with-apache-spark-and-apache-hudi-on-amazon-emr/
> >
> I have an idea about implementing embedded SCD-2 support in hudi by
> using a new Payload. Users don't need to manually join the data, then
> update end_data and status.
>For example, the record key is 'id,end_date',  Let's say the current
> data's id is 1 and the end_date is 2099-12-31,  when a new record with id=1
> arrives, it will update the current record's end_date to 2022-10-21, and
> also insert this new record with end_data ' 2099-12-31'.  so this Payload
> will generate two records in combineAndGetUpdateValue . there will be no
> join cost, and the whole process is transparent to users.
>
>Any thoughts?
>


Re: [ANNOUNCE] Apache Hudi 0.12.1 released

2022-10-19 Thread Alexey Kudinkin
Thanks Zhaojing for masterfully navigating this release!

On Wed, Oct 19, 2022 at 7:46 AM Vinoth Chandar  wrote:

> Great job everyone!
>
> On Wed, Oct 19, 2022 at 07:11 zhaojing yu  wrote:
>
> > The Apache Hudi team is pleased to announce the release of Apache Hudi
> > 0.12.1.
> >
> > Apache Hudi (pronounced Hoodie) stands for Hadoop Upserts Deletes
> > and Incrementals. Apache Hudi manages storage of large analytical
> > datasets on DFS (Cloud stores, HDFS or any Hadoop FileSystem compatible
> > storage) and provides the ability to query them.
> >
> > This release comes 2 months after 0.12.0. It includes more than
> > 150 resolved issues, comprising of a few new features as well as
> > general improvements and bug fixes. You can read the release
> > highlights at https://hudi.apache.org/releases/release-0.12.1.
> >
> > For details on how to use Hudi, please look at the quick start page
> located
> > at https://hudi.apache.org/docs/quick-start-guide.html
> >
> > If you'd like to download the source release, you can find it here:
> > https://github.com/apache/hudi/releases/tag/release-0.12.1
> >
> > Release notes including the resolved issues can be found here:
> >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822&version=12352182
> >
> > We welcome your help and feedback. For more information on how to report
> > problems, and to get involved, visit the project website at
> > https://hudi.apache.org
> >
> > Thanks to everyone involved!
> >
> > Release Manager
> >
>


Re: [DISCUSS] Hudi data TTL

2022-10-18 Thread Alexey Kudinkin
That's a very interesting idea.

Do you want to take a stab at writing a full proposal (in the form of RFC)
for it?

On Tue, Oct 18, 2022 at 10:20 AM Bingeng Huang  wrote:

> Hi all,
>
> Do we have plan to integrate data TTL into HUDI, so we don't have to
> schedule a offline spark job to delete outdated data, just set a TTL
> config, then writer or some offline service will delete old data as
> expected.
>


[Action Required] Spark Bloom Index Metadata Regression in 0.12

2022-10-11 Thread Alexey Kudinkin
Hello, everyone!

Recently a regression in Hudi 0.12 release was discovered related to Bloom
Index metadata persisted w/in Parquet footers (HUDI-4992
).

Crux of the problem was that min/max statistics for the record keys were
computed incorrectly during (Spark-specific) row-writing

Bulk Insert operation affecting Key Range Pruning flow

w/in Hoodie Bloom Index

tagging sequence, resulting into updated records being incorrectly tagged
as "inserts" and not as "updates", leading to duplicated records in the
table.

PR  addressing the problem has
already been landed on master and *is also going to be incorporated into
upcoming Hudi 0.12.1 release.*

If all of the following is applicable to you:

   1. Using Spark as an execution engine
   2. Using Bulk Insert (using row-writing
   
,
   enabled *by default*)
   3. Using Bloom Index (with range-pruning
   

   enabled, enabled *by default*) for "UPSERT" operations

Please consider one of the following potential remediations to avoid
getting duplicate records in your pipeline:

   - Disabling Bloom Index range-pruning
   

flow (might
   affect performance of upsert operations)
   - Upgrading to 0.12.1 (which is targeted to be released this week)
   - Making sure that the fix  is
   included in your custom artifacts (if you're building and using ones)


Please, let me know if you have any questions


Re: [VOTE] Release 0.12.1, release candidate #1

2022-10-04 Thread Alexey Kudinkin
Apologies for the confusion! I missed that the reverting commit has already
been successfully cherry-picked.

Taking back my -1, and voting +1 (non-binding)

On Tue, Oct 4, 2022 at 9:46 PM Y Ethan Guo  wrote:

> +1 (non-binding)
>
> - [OK] checksums and signatures
> - [OK] ran release validation script
> - [OK] built successfully
> - [OK] error injection tests
> - [OK] table upgrade and downgrade tests
>
> On Tue, Oct 4, 2022 at 11:06 PM zhaojing yu  wrote:
>
> > This commit has been reverted in version 0.12.1.
> >
> > Alexey Kudinkin  于2022年10月5日周三 03:45写道:
> >
> > > -1
> > >
> > > Unfortunately, we will have to revert
> > > commit 830e35c3f1d5663c9e96d36da4af67928e9d598b, as it plants a
> > performance
> > > regression that the author is currently working on to address.
> > >
> > >
> > > On Tue, Oct 4, 2022 at 10:08 AM Sivabalan  wrote:
> > >
> > > > Sorry about that. Raymond referred me to apache policy around license
> > > > headers <
> https://www.apache.org/legal/src-headers.html#faq-exceptions
> > >.
> > > > So,
> > > > reverting my vote to +1.
> > > >
> > > > Ran Deltastreamer tests, lock provider tests, structured spark
> > streaming
> > > > tests.
> > > >
> > > > On Tue, 4 Oct 2022 at 08:48, zhaojing yu 
> wrote:
> > > >
> > > > > Confirm in
> > > https://www.apache.org/legal/src-headers.html#faq-exceptions
> > > > > that such files do not require LICENSE HEADER, have been modified
> > > > > validate_staged_release.sh skip the corresponding checks.
> > > > >
> > > > > Sivabalan  于2022年10月4日周二 12:01写道:
> > > > >
> > > > > > -1 Looks like we missed to add license header to a text file.
> > > > > >
> > > > > > ./release/validate_staged_release.sh --release=0.12.1 --rc_num=1
> > > > > > /tmp/validation_scratch_dir_001
> > > > > > ~/Documents/personal/projects/nov26/hudi/scripts
> > > > > > Downloading from svn co
> > https://dist.apache.org/repos/dist/dev/hudi
> > > > > > Validating hudi-0.12.1-rc1 with release type "dev"
> > > > > > Checking Checksum of Source Release
> > > > > > Checksum Check of Source Release - [OK]
> > > > > >
> > > > > >   % Total% Received % Xferd  Average Speed   TimeTime
> > >  Time
> > > > > > Current
> > > > > >  Dload  Upload   Total   Spent
> > > Left
> > > > > > Speed
> > > > > > 100 65803  100 658030 0   116k  0 --:--:-- --:--:--
> > > > --:--:--
> > > > > > 116k
> > > > > > Checking Signature
> > > > > > Signature Check - [OK]
> > > > > >
> > > > > > Checking for binary files in source release
> > > > > > No Binary Files in Source Release? - [OK]
> > > > > >
> > > > > > Checking for DISCLAIMER
> > > > > > DISCLAIMER file exists ? [OK]
> > > > > >
> > > > > > Checking for LICENSE and NOTICE
> > > > > > License file exists ? [OK]
> > > > > > Notice file exists ? [OK]
> > > > > >
> > > > > > Performing custom Licensing Check
> > > > > > There were some source files that did not have Apache
> > > > > > License*./hudi-cli/src/main/resources/banner.txt*
> > > > > >
> > > > > >
> > > > > > On Sat, 1 Oct 2022 at 05:56, zhaojing yu 
> > > wrote:
> > > > > >
> > > > > > > Hi everyone,
> > > > > > >
> > > > > > > Please review and vote on the release candidate #1 for the
> > version
> > > > > > 0.12.1,
> > > > > > > as follows:
> > > > > > >
> > > > > > > [ ] +1, Approve the release
> > > > > > > [ ] -1, Do not approve the release (please provide specific
> > > comments)
> > > > > > >
> > > > > > > The complete staging area is available for your review, which
> > > > includes:
> > > &g

Re: [VOTE] Release 0.12.1, release candidate #1

2022-10-04 Thread Alexey Kudinkin
-1

Unfortunately, we will have to revert
commit 830e35c3f1d5663c9e96d36da4af67928e9d598b, as it plants a performance
regression that the author is currently working on to address.


On Tue, Oct 4, 2022 at 10:08 AM Sivabalan  wrote:

> Sorry about that. Raymond referred me to apache policy around license
> headers .
> So,
> reverting my vote to +1.
>
> Ran Deltastreamer tests, lock provider tests, structured spark streaming
> tests.
>
> On Tue, 4 Oct 2022 at 08:48, zhaojing yu  wrote:
>
> > Confirm in https://www.apache.org/legal/src-headers.html#faq-exceptions
> > that such files do not require LICENSE HEADER, have been modified
> > validate_staged_release.sh skip the corresponding checks.
> >
> > Sivabalan  于2022年10月4日周二 12:01写道:
> >
> > > -1 Looks like we missed to add license header to a text file.
> > >
> > > ./release/validate_staged_release.sh --release=0.12.1 --rc_num=1
> > > /tmp/validation_scratch_dir_001
> > > ~/Documents/personal/projects/nov26/hudi/scripts
> > > Downloading from svn co https://dist.apache.org/repos/dist/dev/hudi
> > > Validating hudi-0.12.1-rc1 with release type "dev"
> > > Checking Checksum of Source Release
> > > Checksum Check of Source Release - [OK]
> > >
> > >   % Total% Received % Xferd  Average Speed   TimeTime Time
> > > Current
> > >  Dload  Upload   Total   SpentLeft
> > > Speed
> > > 100 65803  100 658030 0   116k  0 --:--:-- --:--:--
> --:--:--
> > > 116k
> > > Checking Signature
> > > Signature Check - [OK]
> > >
> > > Checking for binary files in source release
> > > No Binary Files in Source Release? - [OK]
> > >
> > > Checking for DISCLAIMER
> > > DISCLAIMER file exists ? [OK]
> > >
> > > Checking for LICENSE and NOTICE
> > > License file exists ? [OK]
> > > Notice file exists ? [OK]
> > >
> > > Performing custom Licensing Check
> > > There were some source files that did not have Apache
> > > License*./hudi-cli/src/main/resources/banner.txt*
> > >
> > >
> > > On Sat, 1 Oct 2022 at 05:56, zhaojing yu  wrote:
> > >
> > > > Hi everyone,
> > > >
> > > > Please review and vote on the release candidate #1 for the version
> > > 0.12.1,
> > > > as follows:
> > > >
> > > > [ ] +1, Approve the release
> > > > [ ] -1, Do not approve the release (please provide specific comments)
> > > >
> > > > The complete staging area is available for your review, which
> includes:
> > > >
> > > > * JIRA release notes [1],
> > > > * the official Apache source release and binary convenience releases
> to
> > > be
> > > > deployed to dist.apache.org [2], which are signed with the key with
> > > > fingerprint B4305519F36DD7E8B7E6A68458B85B8147783CE2 [3],
> > > > * all artifacts to be deployed to the Maven Central Repository [4],
> > > > * source code tag "release-0.12.1-rc1" [5],
> > > >
> > > > The vote will be open for at least 72 hours. It is adopted by
> majority
> > > > approval, with at least 3 PMC affirmative votes.
> > > >
> > > > Thanks,
> > > > Release Manager
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822&version=12352182
> > > > [2] https://dist.apache.org/repos/dist/dev/hudi/hudi-0.12.1-rc1/
> > > > [3] https://dist.apache.org/repos/dist/dev/hudi/KEYS
> > > > [4]
> > > https://repository.apache.org/content/repositories/orgapachehudi-1093/
> > > > [5] https://github.com/apache/hudi/releases/tag/release-0.12.1-rc1
> > > >
> > >
> > >
> > > --
> > > Regards,
> > > -Sivabalan
> > >
> >
>
>
> --
> Regards,
> -Sivabalan
>


Re: [DISCUSS] Build tool upgrade

2022-10-03 Thread Alexey Kudinkin
I think full project build slowly gravitates towards 15min already (it’s
about 12-14min on my 2021 Macbook).

@Vinoth the most important aspect that Maven couldn’t provide us with are
local incremental builds. Currently you have to build full dependency
hierarchy of the project whenever you’re changing even a single file.
There’re some limited workarounds but they aren’t really a replacement for
fully incremental builds.

Fully incremental builds will be a huge boost to Dev productivity.

On Sun, Oct 2, 2022 at 11:40 PM Pratyaksh Sharma 
wrote:

> My two cents. I have seen open source projects take more than 20-25 minutes
> for building on maven, so I guess we are fine for now. But we can
> definitely investigate and try to optimize if we can.
>
> On Sun, Oct 2, 2022 at 9:33 AM Shiyan Xu 
> wrote:
>
> > Yes, Vinoth, agree on the efforts and impact being big.
> >
> > Some perf comparison on gradle vs maven can be found in
> > https://gradle.org/gradle-vs-maven-performance/ where it claims
> multi-fold
> > build time reduction. I'd estimate maybe 2-4 min for a full build and
> based
> > on that.
> >
> > I mainly hope to collect some feedback on if build time is a dev
> experience
> > concern or if it's okay for people in general. If it's the latter case,
> > then no need to investigate further at this point.
> >
> > On Sat, Oct 1, 2022 at 1:52 PM Vinoth Chandar  wrote:
> >
> > > Hi Raymond.
> > >
> > > This would be a large undertaking and a big change for everyone.
> > >
> > > What does the build time look like if we switch to gradle or bazel? And
> > do
> > > we know why it takes 10 min to build and why is that not okay? Given we
> > all
> > > use IDEs mostly anyway
> > >
> > > Thanks
> > > Vinoth
> > >
> > > On Fri, Sep 30, 2022 at 22:48 Shiyan Xu 
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > I'd like to raise a discussion around the build tool for Hudi.
> > > >
> > > > Maven has been a mature yet slow (10min to build on 2021 macbook pro)
> > > build
> > > > tool compared to modern ones like gradle or bazel. We all want faster
> > > > builds, however, we also need to consider the efforts and risks to
> > > upgrade,
> > > > and the developers' feedback on usability.
> > > >
> > > > What do you all think about upgrading to gradle or bazel? Please
> share
> > > your
> > > > thoughts. Thanks.
> > > >
> > > > --
> > > > Best,
> > > > Shiyan
> > > >
> > >
> >
> >
> > --
> > Best,
> > Shiyan
> >
>


Re: RFC-46 Status Update

2022-09-28 Thread Alexey Kudinkin
@Ken yes, that's the plan eventually -- to rely on Execution (Query)
Engines to provide their own representation that Hudi will be handling w/o
any intermediate transitions. If you're curious to learn more i'd encourage
you (and everyone) to check out the RFC-46 itself.

In the Phase 1 though, we're only focusing on Spark integration for now (as
proof of concept) and then later on after the new infra is hardened enough
we can expand it to other engines as well.

On Tue, Sep 27, 2022 at 7:07 PM Gary Li  wrote:

> Great work! Really excited about this feature. Kudos to RFC-46 team.
>
> Best,
> Gary
>
> On Wed, Sep 28, 2022 at 7:22 AM Ken Krugler 
> wrote:
>
> > Hi Alexey,
> >
> > Thanks for the update!
> >
> > So for maximum performance when writing to Hudi from a low-level
> > (DataStream, not Table) Flink workflow, we’d be creating RowData records?
> >
> > — Ken
> >
> >
> > > On Sep 27, 2022, at 2:08 PM, Alexey Kudinkin 
> wrote:
> > >
> > > Hello, everyone!
> > >
> > > As you might be aware, community has been very busy at work on RFC-46
> > > aiming to bring long-awaited cutting edge level of performance to Hudi
> by
> > > avoiding using Avro as an intermediate representation, instead relying
> on
> > > individual engines to host data in their own formats (InternalRow for
> > > Spark, RowData for Flink, etc)
> > >
> > > We wanted to share an update in terms of where we are and what are the
> > next
> > > steps from here:
> > >
> > >   - We're very close to completing the work and are already preparing
> to
> > >   be landing complete implementation of the Phase 1 of the RFC-46
> > currently
> > >   being developed in a feature branch
> > >   <https://github.com/apache/hudi/tree/release-feature-rfc46>
> > >   - To be able to successfully merge the change of such scale, we will
> > >   have to do a *code freeze* for the master branch barring any changes
> to
> > >   land before we're able to merge the feature-branch.
> > >   - To make sure that this activity doesn't interrupt the 0.12.1
> release
> > >   that is currently in progress we're tentatively planning to schedule
> > this
> > >   code-freeze *after* successful finalization of the release process
> with
> > >   RC branch being cut and validated for release. As of now, provided RC
> > >   candidate will be cut tomorrow on 09/28 we're aiming to schedule a
> > merge
> > >   attempt somewhere mid to late next week.
> > >   - We will follow-up on this thread separately at least *24h* before
> the
> > >   scheduled code-freeze with an exact date and time frame for it. Stay
> > tuned.
> > >
> > >
> > > Alexey, on behalf of the RFC-46 group
> >
> > --
> > Ken Krugler
> > http://www.scaleunlimited.com
> > Custom big data solutions
> > Flink, Pinot, Solr, Elasticsearch
> >
> >
> >
> >
>


RFC-46 Status Update

2022-09-27 Thread Alexey Kudinkin
Hello, everyone!

As you might be aware, community has been very busy at work on RFC-46
aiming to bring long-awaited cutting edge level of performance to Hudi by
avoiding using Avro as an intermediate representation, instead relying on
individual engines to host data in their own formats (InternalRow for
Spark, RowData for Flink, etc)

We wanted to share an update in terms of where we are and what are the next
steps from here:

   - We're very close to completing the work and are already preparing to
   be landing complete implementation of the Phase 1 of the RFC-46 currently
   being developed in a feature branch
   
   - To be able to successfully merge the change of such scale, we will
   have to do a *code freeze* for the master branch barring any changes to
   land before we're able to merge the feature-branch.
   - To make sure that this activity doesn't interrupt the 0.12.1 release
   that is currently in progress we're tentatively planning to schedule this
   code-freeze *after* successful finalization of the release process with
   RC branch being cut and validated for release. As of now, provided RC
   candidate will be cut tomorrow on 09/28 we're aiming to schedule a merge
   attempt somewhere mid to late next week.
   - We will follow-up on this thread separately at least *24h* before the
   scheduled code-freeze with an exact date and time frame for it. Stay tuned.


Alexey, on behalf of the RFC-46 group


Re: 0.12.1 release timeline

2022-09-27 Thread Alexey Kudinkin
Zhaojing, we synced up with the authors of the aforementioned PRs, and
everyone feels comfortable that we should be able to land these w/in the
currently set deadline of 09/28 11:59 PST.

As such, i'd suggest we keep the existing deadline intact to avoid delaying
the release. What do you think?

On Tue, Sep 27, 2022 at 3:37 AM zhaojing yu  wrote:

> Hi folks,
>
> There are still a few critical PRs are pending, e.g.,
> HUDI-2780 <https://github.com/apache/hudi/pull/4015>,
> HUDI-3636 <https://github.com/apache/hudi/pull/5269>,
> HUDI-4453 <https://github.com/apache/hudi/pull/6676>,
> HUDI-4687 <https://github.com/apache/hudi/pull/6657>,
> HUDI-4855 <https://github.com/apache/hudi/pull/6694>, etc.
>
> We should get those landed for 0.12.1.
> In mind, I'm proposing a new date for code freeze *Oct 2nd, 11:59 PM PST*.
> Kindlky let me know if you hava any concerns.
>
>
> Thanks,
> - Zhaojing
>
> zhaojing yu  于2022年9月23日周五 15:10写道:
>
> > After discussion our goal will be to freeze the code on 28th Sep.
> > Until then I will follow the progress of the above blocker pr.
> >
> > Sivabalan  于2022年9月21日周三 11:02写道:
> >
> >> We are targeting to land all of these by 28th Sep. We will try our best
> to
> >> land them before starting of next week, but 28th would be more
> practical.
> >>
> >> On Tue, 20 Sept 2022 at 18:21, Vinoth Chandar 
> wrote:
> >>
> >> > Thanks for sharing. Do we have an ETA for these?
> >> >
> >> > Zhaojing - please chime with your thoughts as well
> >> >
> >> > On Wed, Sep 21, 2022 at 06:34 Y Ethan Guo  wrote:
> >> >
> >> > > Hi Zhaojing,
> >> > >
> >> > > It would be good if we can land the following bootstrap fixes for
> >> 0.12.1
> >> > > release.  I'm working on getting them merged.
> >> > >
> >> > > HUDI-4855: https://github.com/apache/hudi/pull/6694
> >> > > HUDI-4453: https://github.com/apache/hudi/pull/6676
> >> > >
> >> > > Thanks,
> >> > > - Ethan
> >> > >
> >> > > On Tue, Sep 20, 2022 at 12:03 PM Alexey Kudinkin <
> ale...@onehouse.ai>
> >> > > wrote:
> >> > >
> >> > > > There are also a few critical issues we want to address before
> >> cutting
> >> > > the
> >> > > > 0.12.1 release:
> >> > > >
> >> > > > HUDI-4760 <https://issues.apache.org/jira/browse/HUDI-4760>
> >> > > > HUDI-3636 <https://issues.apache.org/jira/browse/HUDI-3636>
> >> > > > HUDI-4885 <https://issues.apache.org/jira/browse/HUDI-4885>
> >> > > > HUDI-2780 <https://issues.apache.org/jira/browse/HUDI-2780>
> >> > > >
> >> > > > On Tue, Sep 20, 2022 at 10:14 AM Sivabalan 
> >> wrote:
> >> > > >
> >> > > > > sure. We have few critical PRs that we are looking to land. Few
> >> > notable
> >> > > > > ones are
> >> > > > >
> >> > > > > ClassNotFoundException when using hudi-spark-bundle to write
> table
> >> > with
> >> > > > > hbase index <https://github.com/apache/hudi/pull/6715>
> >> > > > > Fix fq can not be queried in pending compaction when query ro
> >> table
> >> > > with
> >> > > > > spark <https://github.com/apache/hudi/pull/6516>
> >> > > > > Syncing non-partitioned table has bugs around partition
> parameters
> >> > > > > <https://github.com/apache/hudi/pull/6525>
> >> > > > > bootstrap bug fixes: https://github.com/apache/hudi/pull/6694
> and
> >> > > > > https://github.com/apache/hudi/pull/6676
> >> > > > >
> >> > > > >
> >> > > > > On Mon, 19 Sept 2022 at 20:24, Vinoth Chandar <
> vin...@apache.org>
> >> > > wrote:
> >> > > > >
> >> > > > > > tbh the RM can make this call. Whether or not 1 week is
> >> aggressive,
> >> > > > > really
> >> > > > > > depends on the scope of release, whats left to land/test.
> >> > > > > >
> >> > > > > > Would it be useful to frame the discussion in that way?
> >> > > > > >
> >> > > > > > On M

Re: 0.12.1 release timeline

2022-09-20 Thread Alexey Kudinkin
There are also a few critical issues we want to address before cutting the
0.12.1 release:

HUDI-4760 
HUDI-3636 
HUDI-4885 
HUDI-2780 

On Tue, Sep 20, 2022 at 10:14 AM Sivabalan  wrote:

> sure. We have few critical PRs that we are looking to land. Few notable
> ones are
>
> ClassNotFoundException when using hudi-spark-bundle to write table with
> hbase index 
> Fix fq can not be queried in pending compaction when query ro table with
> spark 
> Syncing non-partitioned table has bugs around partition parameters
> 
> bootstrap bug fixes: https://github.com/apache/hudi/pull/6694 and
> https://github.com/apache/hudi/pull/6676
>
>
> On Mon, 19 Sept 2022 at 20:24, Vinoth Chandar  wrote:
>
> > tbh the RM can make this call. Whether or not 1 week is aggressive,
> really
> > depends on the scope of release, whats left to land/test.
> >
> > Would it be useful to frame the discussion in that way?
> >
> > On Mon, Sep 19, 2022 at 1:25 PM zhaojing yu  wrote:
> >
> > > Do anyone else have any suggestions?
> > > We will determine the time of the code freeze tomorrow.
> > >
> > > Sivabalan  于2022年9月19日周一 14:05写道:
> > >
> > > > Hey hi Zhaojing,
> > > >   Announcing a code freeze just 1 week ahead might be too
> > aggressive.
> > > > Do you think, we can make it sometime next week(week of 26th) to give
> > > some
> > > > buffer for folks to push any critical fixes in. Open to hear what
> > others
> > > > have to say.
> > > >
> > > >
> > > >
> > > > On Fri, 16 Sept 2022 at 01:47, zhaojing yu 
> > wrote:
> > > >
> > > > > To clarify, 09/21 is to cut RC1 and it will be released if all
> > > > > testing/checks pass.
> > > > >
> > > > > zhaojing yu  于2022年9月16日周五 16:45写道:
> > > > >
> > > > > > Hi folks,
> > > > > >
> > > > > > As the RM for the 0.12.1 release, I'd like to propose the code
> > freeze
> > > > on
> > > > > > Sep 21 (Wed) for any bug fixes that are going to be included in
> the
> > > > minor
> > > > > > release, about a month after the 0.12.0 release.  Let me know if
> > you
> > > > need
> > > > > > more time for fixing any issues.
> > > > > >
> > > > > > Please tag any fix that you think we should include in 0.12.1, by
> > > > setting
> > > > > > the "Fix Version/s" to "0.12.1" in the corresponding Jira ticket.
> > As
> > > > the
> > > > > > RM, I will make the final decision.  I have started
> cherry-picking
> > > the
> > > > > > commits from the master.  I will watch out for ongoing critical
> > fixes
> > > > and
> > > > > > remind authors and reviewers in the PRs along the way so they can
> > > land
> > > > in
> > > > > > time.
> > > > > >
> > > > > > cherry-picking link:
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1G4eeZkSUqMgONRI2YE0o_XBxy9kBaq_Cznolfp2VhS8/edit#heading=h.3fl3egu0kv0z
> > > > > >
> > > > > > Thanks,
> > > > > > - Zhaojing
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Regards,
> > > > -Sivabalan
> > > >
> > >
> >
>
>
> --
> Regards,
> -Sivabalan
>


Survey around using Apache Orc in Hudi

2022-09-16 Thread Alexey Kudinkin
Hello, everyone!

We have recently discovered that Apache Orc support is unfortunately broken
in Spark 3.x modules of Apache Hudi, due to the fact that Spark 3.x
switched from "nohive
" flavor of
Apache Orc to conventional one (which brings in some common
Hive-originating interfaces). More details could on that be found in
HUDI-4496 

As such, we're trying to understand a little more around usage of Apache
Orc and Hudi, and will be very grateful to everyone currently having such
setup and responding back on this thread w/ the answers to following
questions:

   - What Spark version are you using?
   - What Hudi version are you using?
   - Are you considering upgrading to Spark 3.x?
   - Are you considering upgrading to the latest Hudi release(s)?


As always, appreciate your help and look forward to hearing back from you.


Re: [VOTE] Release 0.12.0, release candidate #1

2022-08-01 Thread Alexey Kudinkin
Hello, everyone!

We've found that Orc support is broken for Spark >= 3.1
,
and we'd really like to make sure this makes it into 0.12.

-1, from my end.

On Sun, Jul 31, 2022 at 11:51 PM Danny Chan  wrote:

> Hi, sorry for bothering, but i would like to include the HUDI-4504,
> HUDI-4505, which are critical fix for Flink side.
>
> so -1 from my side.
>
> Best,
> Danny
>
> sagar sumit  于2022年7月30日周六 18:16写道:
> >
> > Hi everyone,
> >
> > Please review and vote on the release candidate #1 for the version
> 0.12.0,
> > as follows:
> >
> > [ ] +1, Approve the release
> > [ ] -1, Do not approve the release (please provide specific comments)
> >
> > The complete staging area is available for your review, which includes:
> >
> > * JIRA release notes [1],
> > * the official Apache source release and binary convenience releases to
> be
> > deployed to dist.apache.org [2], which are signed with the key with
> > fingerprint FD215342E3199419ADFBF41DD4623E3AA16D75B0 [3],
> > * all artifacts to be deployed to the Maven Central Repository [4],
> > * source code tag "release-0.12.0-rc1" [5],
> >
> > The vote will be open for at least 72 hours. It is adopted by majority
> > approval, with at least 3 PMC affirmative votes.
> >
> > Thanks,
> > Release Manager
> >
> > [1]
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822&version=12351209
> > [2] https://dist.apache.org/repos/dist/dev/hudi/hudi-0.12.0-rc1/
> > [3] https://dist.apache.org/repos/dist/dev/hudi/KEYS
> > [4]
> https://repository.apache.org/content/repositories/orgapachehudi-1086/
> > [5] https://github.com/apache/hudi/releases/tag/release-0.12.0-rc1
>


Re: [DISSCUSS][NEW FEATURE] Hudi Lake Manager

2022-04-21 Thread Alexey Kudinkin
Hey, folks!

I feel there's quite a bit of confusion in this thread, so let's try to
clear it: my understanding (please correct me if I'm wrong) is that
Lake Manager was referred to as a service in a similar interpretation of
how we call compaction, clustering and cleaning a* table services.*

So, i'd suggest for us to be extra careful in operating familiar terms to
avoid stirring up the confusion: for all things related to *RPC services *
(like Metastore Server) we can call them "servers"*, *and for compaction,
clustering and the rest we stick w/ "table services".

If my understanding of the proposal is correct, then I think the proposal
is to consolidate knobs and levers for Data Governance, Data Management, etc
w/in the layer called *Lake Manager, *which will be orchestrating already
existing table services through a nicely abstracted high-level API.

Regarding adding any new *server* components: given Hudi's *stateless*
architecture where we rely on standalone execution engines (like Spark or
Flink) to operate, i don't really see us introducing a server component
directly into Hudi's core. Metastore Server on the other hand will be a
*standalone* component, that Hudi (as well as other processes) could be
relying on to access the metadata.

On Mon, Apr 18, 2022 at 10:07 PM Yue Zhang 
wrote:

> Thanks for all your attention.
> Sure, we do need to take care of high availability in design.
>
> Also in my opinion this lake manager wouldn't drive hudi into a database
> on the cloud. It is just an official option. Something like
> HoodieDeltaStreamer and help users to reduce maintenance and hudi data
> governance efforts.
>
> As for resource and performance concerns, this lake manager should be
> designed as a planner/master, for example, lake manager will call out
> cleaner apis to launch a (spark/flink) execution to delete files under
> certain conditions based on table metadata information, rather than doing
> works itself. So that the workload and resources requirement is much less.
> But in general, I agree that we have to consider failure recovery and high
> availability, etc.
>
> On 2022/04/19 04:30:22 Simon Su wrote:
> > >
> > > I agree with Danny said. IMO, there are two points that should be
> > > considered
> >
> > 1. If Lake Manager is designed as a service, so we should consider its
> High
> > Availability, Dynamic Expanding/Shrinking, and state consistency.
> > 2. How many resources will Lake Manager used to execute those actions of
> > HUDI such as compaction, clustering, etc..
> >
>


Re: [VOTE] Release 0.11.0, release candidate #2

2022-04-18 Thread Alexey Kudinkin
-1

Found pretty substantial perf degradation in 0.11 RC2 as compared to
vanilla Parquet table in Spark (which is being addressed currently).
More details could be found HUDI-3891


On Mon, Apr 18, 2022 at 4:31 PM Y Ethan Guo 
wrote:

> -1
> The Kafka Connect Sink for Hudi cannot ingest data using
> hudi-kafka-connect-bundle from 0.11.0-rc2 due to NoClassDefFoundError.  The
> following fix is put up.
> https://github.com/apache/hudi/pull/5353
>
> Best,
> - Ethan
>
> On Fri, Apr 15, 2022 at 5:20 AM Shiyan Xu 
> wrote:
>
> > Hi everyone,
> >
> > Please review and vote on the release candidate #2 for the version
> 0.11.0,
> > as follows:
> >
> > [ ] +1, Approve the release
> >
> > [ ] -1, Do not approve the release (please provide specific comments)
> >
> >
> >
> > The complete staging area is available for your review, which includes:
> >
> > * JIRA release notes [1],
> >
> > * the official Apache source release and binary convenience releases to
> be
> > deployed to dist.apache.org [2], which are signed with the key with
> > fingerprint E1FACC15B67B2C5149224052D3B314F3B6E9C746 [3],
> >
> > * all artifacts to be deployed to the Maven Central Repository [4],
> >
> > * source code tag "0.11.0-rc2" [5],
> >
> >
> >
> > The vote will be open for at least 72 hours. It is adopted by majority
> > approval, with at least 3 PMC affirmative votes.
> >
> >
> >
> > Thanks,
> >
> > Release Manager
> >
> >
> >
> > [1]
> >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822&version=12350673
> >
> > [2] https://dist.apache.org/repos/dist/dev/hudi/hudi-0.11.0-rc2/
> >
> > [3] https://dist.apache.org/repos/dist/release/hudi/KEYS
> >
> > [4]
> https://repository.apache.org/content/repositories/orgapachehudi-1060/
> >
> > [5] https://github.com/apache/hudi/releases/tag/release-0.11.0-rc2
> >
>


Re: [DISCUSS] New RFC to create LogCompaction action for MOR tables?

2022-03-21 Thread Alexey Kudinkin
Hello, everyone!

@Surya, first of all, wanted to say that i think this is a great proposal!

> A new compaction strategy can be added, but we thought it might
> complicate the existing logic and need to rely on some hacks, especially
> since Compaction action writes to a base file and places a .commit file
> upon completion. Whereas, in our use case we are not concerned with the
> base file at all, instead we are merging blocks and writing back to the log
> file. So, we thought it is better to use a new action(called
> LogCompaction), which works at a log file level and writes back to the log
> file. Since log files are in general added by deltacommit, upon completion
> LogCompaction can place a .deltacommit.


Did I understand your proposal correctly, that you're suggesting that the
Minor compaction, unlike Major one, be performed as part of the Delta
Commit?

I think Sagar's question is very valid from the perspective of configuring
and balancing major/minor compactions. Nevertheless, I think we can cover
and discuss it as part of the RFC process.

> LogCompaction is not a replacement for regular compaction. LogCompaction
> is performed as a minor compaction so as to reduce the no. of log blocks to
> consider. It does not consider base files while merging the log blocks. To
> merge log files with base file Compaction action is still needed. By using
> LogCompaction action frequently, the frequency with which we do full scale
> compaction is reduced.
> Consider a scenario in which, after 'X' no. of LogCompaction actions, for
> some file groups the log files size becomes comparable to that of base file
> size, in this scenario LogCompaction action is going to take close to the
> same amount of time as compaction action. So, now instead of LogCompaction,
> full scale Compaction needs to be performed on those file groups. In future
> we can also introduce logic to determine what is the right
> action(Compaction or LogCompaction) to be performed depending on the state
> of the file group.


To be able to reason about strategy for both major/minor compactions we
need to clearly define the criteria we're optimizing on (is it # of RPC to
HDFS/Object Storage, I/O, etc). That would also help us to measure
objectively the performance improvements from the introduction of minor
compaction.

On Mon, Mar 21, 2022 at 8:26 AM Vinoth Chandar  wrote:

> +1 overall
>
> On Sat, Mar 19, 2022 at 5:02 PM Surya Prasanna  >
> wrote:
>
> > Hi Sagar,
> > Sorry for the delay in response. Thanks for the questions.
> >
> > 1. Trying to understand the main goal. Is it to balance the tradeoff
> > between read and write amplification for metadata table? Or is it purely
> to
> > optimize for reads?
> > > On large tables, write amplification is a side effect of frequent
> > compactions. So, instead of increasing the frequency of full compaction,
> we
> > are proposing minor compaction(LogCompaction) to be done frequently to
> > merge only the log blocks and write a new log block. By merging the
> blocks,
> > there are less no. of blocks to deal with during read, that way we are
> > optimizing for read performance and potentially avoiding the write
> > amplification problem.
> >
> > 2. Why do we need a separate action? Why can't any of the existing
> > compaction strategies (or a new one if needed) help to achieve this?
> > > A new compaction strategy can be added, but we thought it might
> > complicate the existing logic and need to rely on some hacks, especially
> > since Compaction action writes to a base file and places a .commit file
> > upon completion. Whereas, in our use case we are not concerned with the
> > base file at all, instead we are merging blocks and writing back to the
> log
> > file. So, we thought it is better to use a new action(called
> > LogCompaction), which works at a log file level and writes back to the
> log
> > file. Since log files are in general added by deltacommit, upon
> completion
> > LogCompaction can place a .deltacommit.
> >
> > 3. Is the proposed LogCompaction a replacement for regular compaction for
> > metadata table i.e. if LogCompaction is enabled then compaction cannot be
> > done?
> > > LogCompaction is not a replacement for regular compaction.
> LogCompaction
> > is performed as a minor compaction so as to reduce the no. of log blocks
> to
> > consider. It does not consider base files while merging the log blocks.
> To
> > merge log files with base file Compaction action is still needed. By
> using
> > LogCompaction action frequently, the frequency with which we do full
> scale
> > compaction is reduced.
> > Consider a scenario in which, after 'X' no. of LogCompaction actions, for
> > some file groups the log files size becomes comparable to that of base
> file
> > size, in this scenario LogCompaction action is going to take close to the
> > same amount of time as compaction action. So, now instead of
> LogCompaction,
> > full scale Compaction needs to be performed on those file groups. In
> 

Unbundling "spark-avro" dependency

2022-03-08 Thread Alexey Kudinkin
Hello, everyone!

While working on HUDI-3549 ,
we've surprisingly discovered that Hudi actually bundles "spark-avro"
dependency *by default*.

This is problematic b/c "spark-avro" is tightly coupled with some of the
other Spark components making up its core distribution (ie being packaged
in Spark itself, not an external packages, one example of that is
"spark-sql")

In regards to HUDI-3549
 itself,
the problem in there unfolded like following:

   1. We've built "hudi-spark-bundle" which got "spark-avro" 3.2.1 bundled
   along with it
   2. @Sivabalan tried to use this Hudi bundle w/ Spark 3.2.0
   3. It failed b/c "spark-avro" 3.2.1 is *not compatible *w/ "spark-sql"
   3.2.0 (b/c of https://github.com/apache/spark/pull/34978, fixing typo
   and renaming Internal API methods DataSourceUtils)


To avoid this problems going forward, our proposal is to

   1. *Unbundle* "spark-avro" from Hudi bundles by default (practically
   this means that Hudi users would need to now specify spark-avro via
   `--packages` flag, since it's not part of Spark's core distribution)
   2. (Optional) If community still sees value in bundling (and shading)
   "spark-avro" in some cases, we can add Maven profile that would allow to do
   that *ad hoc*.

We've put a PR#4955  with the
proposed changes.

Looking forward to your feedback.