[DISCUSS] Metadata based bloom index

2021-11-05 Thread Manoj Govindassamy
Hi Hudi Community,

Hudi has several indices to help lookup records. The most commonly used one
is the BloomFilter based index. This index today works by loading the bloom
filter from all the data files of interested partitions. This is a time
consuming operation. Better would be if can leverage the metadata table
infrastructure of the Hudi tables. That is, if all the bloom filters can be
loaded directly from a single metadata table partition, it would greatly
speed up the entire record key lookup process.

Let me know your thoughts on this high level idea. Planning to start a RFC
on this and I can share more details on the design and implementation.

Regards,
Manoj


Re: [DISCUSS] RFC for Synchronous Metadata table for File listing

2021-11-13 Thread Manoj Govindassamy
+1 for the synchronous metadata updates. Looking forward to the RFC.


On Fri, Nov 12, 2021 at 4:46 PM Vinoth Chandar  wrote:

> +1 on this.
>
> On Fri, Nov 5, 2021 at 9:17 AM Sivabalan  wrote:
>
> > RFC-15
> > <
> >
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+15%3A+HUDI+File+Listing+Improvements
> > >
> > made an attempt to boost performance of file listing by storing all file
> > information in metadata table. As we are looking to build more infra
> around
> > metadata table (RFC-27 for data skipping, etc), we felt having a
> > synchronous design will make it more tighter and will avoid some of the
> > corner cases with async approach.
> >
> > So, we will write up a new RFC for file listing based on metadata table
> > with synchronous updates.
> >
> > --
> > Regards,
> > -Sivabalan
> >
>


Re: [DISCUSS] Propose to implement a deltastreamer source for Debezium

2021-11-19 Thread Manoj Govindassamy
+1

On Fri, Nov 19, 2021 at 1:42 PM Rajesh Mahindra  wrote:

> Hi Community,
>
> We intend to implement a source for ingesting Debezium Change Data Capture
> (CDC) logs into Deltastreamer/ Hudi. With this capability, we can
> continuously capture row-level changes that insert, update and delete
> records that were committed to a database. While debezium supports multiple
> databases, we will focus on postgres and mysql initially.
>
> More details are published on the cwiki:
>  https://cwiki.apache.org/confluence/display/HUDI/RFC+-+38
> . This doc is
> not a formal RFC yet, just to aid in the discussion here. We will publish
> the RFC based on early feedback.
>
> Thanks
> Rajesh Mahindra
>


Re: [DISCUSS] Hudi 0.10.0 Release

2021-11-19 Thread Manoj Govindassamy
Hi Danny,

I am good with the Nov 26th cutoff as well. I am working on the below
in-progress items and have one other pending. For the rest all from the
list, PRs are out or landed. Thanks for compiling the list.

*InProgress:*
 - [HUDI-2763] Avoid persisting redundant key field in the Metadata table
   record payload (Owner: Manoj Govindassamy)
 - [HUDI-2475] Rolling Upgrade downgrade story for 0.10 & enabling
   metadata (Owner: Manoj Govindassamy)

*Pending:*
- [HUDI-2590] Validate Diff key gen w/ and w/o glob path with and w/o
metadata enabled

*Completed:*
 - [HUDI-2716] Fix InLineFS path conversions for S3FS paths (Owner: Manoj
   Govindassamy)
  - [HUDI-2593] Virtual keys support for metadata table (Owner: Manoj
   Govindassamy)
  - [HUDI-2472] Tests failure follow up when metadata is enabled by
   default (Owner: Manoj Govindassamy)
  - [HUDI-2666] async compaction failing with timeline mismatches between
   server and client when metadata is enabled (Owner: Manoj Govindassamy)
 - [HUDI-2764] Address test failures after enabling virtual keys support
   for the metadata table (Owner: Manoj Govindassamy)

On Fri, Nov 19, 2021 at 12:12 AM Danny Chan  wrote:

> Hi Community,
>
> As we draw close to doing Hudi 0.10.0 release, I am happy to share a
> summary of the key features/improvements that would be going in the release
> and the current blockers for everyone's visibility.
>
> *Highlights*
>
>- [HUDI-1290] Implement Debezium avro source for Delta Streamer
>- [HUDI-1491] Support partition pruning for MOR snapshot query
>- [HUDI-1763] DefaultHoodieRecordPayload does not honor ordering value
>when records within multiple log files are merged
>- [HUDI-1827] Add ORC support in Bootstrap Op
>- [HUDI-1869] Upgrading Spark3 To 3.1
>- [HUDI-2101] support z-order for hudi
>- [HUDI-2276] Enable Metadata Table by default for both writers and
>readers
>- [HUDI-2581] Analyze metadata size estimate in hudi with Hfile for col
>stats partition
>- [HUDI-2634] Improve bootstrap performance for very large tables
>- [HUDI-2086] redo the logical of mor_incremental_view for hive
>- [HUDI-2191] Bump flink version to 1.13.1
>- [HUDI-2285] Metadata Table Synchronous Design
>- [HUDI-2316] Support Flink batch upsert
>- [HUDI-2371] Improve flink streaming reader
>- [HUDI-2394] [Kafka Connect Mileston 1] Implement kafka connect for
>immutable data
>- [HUDI-2449] Incremental read for Flink
>- [HUDI-2562] Embedded timeline server on JobManager
>
> *Current Blockers*
>
>- [HUDI-1856] Upstream changes made in PrestoDB to eliminate file
>listing to Trino (Owner: Sagar Sumit)
>- [HUDI-1912] Presto defaults to GenericHiveRecordCursor for all Hudi
>tables (Owner: Sagar Sumit)
>- [HUDI-1932] Hive Sync should not always update last_commit_time_sync
>(Owner: Raymond Xu)
>- [HUDI-1937] When clustering fail, generating unfinished replacecommit
>timeline. (Owner: Sagar Sumit)
>- [HUDI-2077] Flaky test: TestHoodieDeltaStreamer (Owner: Sagar Sumit)
>- [HUDI-2314] Add DynamoDb based lock provider (Owner: Wenning Ding)
>- [HUDI-2325] Implement and test Hive Sync support for Kafka Connect
>(Owner: Rajesh Mahindra)
>- [HUDI-2332] Implement scheduling of compaction/ clustering for Kafka
>Connect (Owner: Ethan Guo)
>- [HUDI-2362] Hudi external configuration file support (Owner: Wenning
>Ding)
>- [HUDI-2409] Using HBase shaded jars in Hudi presto bundle (Owner:
>Sagar Sumit)
>- [HUDI-2443] KVComparator in HFile for metadata table is tied to HBase
>version and shading (Owner: Sagar Sumit)
>- [HUDI-2472] Tests failure follow up when metadata is enabled by
>default (Owner: Manoj Govindassamy)
>- [HUDI-2475] Rolling Upgrade downgrade story for 0.10 & enabling
>metadata (Owner: Manoj Govindassamy)
>- [HUDI-2478] Handle failure mid-way during init buckets (Owner: Vinoth
>Chandar)
>- [HUDI-2480] FileSlice after pending compaction-requested instant-time
>is ignored by MOR snapshot reader (Owner: Danny Chen)
>- [HUDI-2488] Support bootstrapping a single or more partitions in
>metadata table while regular writers and table services are in progress
>(Owner: Vinoth Chandar)
>- [HUDI-2527] Flaky test:
>
>  TestHoodieClientMultiWriter.testMultiWriterWithAsyncTableServicesWithConflict
>(Owner: sivabalan narayanan)
>- [HUDI-2559] Ensure unique timestamps are generated for commit times
>with concurrent writers (Owner: sivabalan narayanan)
>- [HUDI-2593] Virtual keys support for metadata table (Owner: Manoj
>Govindassamy)
>- [HUDI-2599] [Performance] Lower parallelism with snapshot query on COW
>table

Re: [DISCUSS] Hudi 0.10.0 Release

2021-11-26 Thread Manoj Govindassamy
t; > > >
> > > > > > > Hi Danny,
> > > > > > >
> > > > > > > I have one blocker. I plan to complete it by end of next week.
> I
> > am
> > > > > good
> > > > > > > with the prior Nov 26 cutoff.
> > > > > > > Does that work for everyone?
> > > > > > >
> > > > > > > Thanks
> > > > > > > Vinoth
> > > > > > >
> > > > > > > On Fri, Nov 19, 2021 at 12:12 AM Danny Chan <
> > danny0...@apache.org>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Community,
> > > > > > > >
> > > > > > > > As we draw close to doing Hudi 0.10.0 release, I am happy to
> > > share
> > > > a
> > > > > > > > summary of the key features/improvements that would be going
> in
> > > the
> > > > > > release
> > > > > > > > and the current blockers for everyone's visibility.
> > > > > > > >
> > > > > > > > *Highlights*
> > > > > > > >
> > > > > > > >- [HUDI-1290] Implement Debezium avro source for Delta
> > > Streamer
> > > > > > > >- [HUDI-1491] Support partition pruning for MOR snapshot
> > query
> > > > > > > >- [HUDI-1763] DefaultHoodieRecordPayload does not honor
> > > ordering
> > > > > > value
> > > > > > > >when records within multiple log files are merged
> > > > > > > >- [HUDI-1827] Add ORC support in Bootstrap Op
> > > > > > > >- [HUDI-1869] Upgrading Spark3 To 3.1
> > > > > > > >- [HUDI-2101] support z-order for hudi
> > > > > > > >- [HUDI-2276] Enable Metadata Table by default for both
> > > writers
> > > > > and
> > > > > > > >readers
> > > > > > > >- [HUDI-2581] Analyze metadata size estimate in hudi with
> > > Hfile
> > > > > for
> > > > > > col
> > > > > > > >stats partition
> > > > > > > >- [HUDI-2634] Improve bootstrap performance for very large
> > > > tables
> > > > > > > >- [HUDI-2086] redo the logical of mor_incremental_view for
> > > hive
> > > > > > > >- [HUDI-2191] Bump flink version to 1.13.1
> > > > > > > >- [HUDI-2285] Metadata Table Synchronous Design
> > > > > > > >- [HUDI-2316] Support Flink batch upsert
> > > > > > > >- [HUDI-2371] Improve flink streaming reader
> > > > > > > >- [HUDI-2394] [Kafka Connect Mileston 1] Implement kafka
> > > connect
> > > > > for
> > > > > > > >immutable data
> > > > > > > >- [HUDI-2449] Incremental read for Flink
> > > > > > > >- [HUDI-2562] Embedded timeline server on JobManager
> > > > > > > >
> > > > > > > > *Current Blockers*
> > > > > > > >
> > > > > > > >- [HUDI-1856] Upstream changes made in PrestoDB to
> eliminate
> > > > file
> > > > > > > >listing to Trino (Owner: Sagar Sumit)
> > > > > > > >- [HUDI-1912] Presto defaults to GenericHiveRecordCursor
> for
> > > all
> > > > > > Hudi
> > > > > > > >tables (Owner: Sagar Sumit)
> > > > > > > >- [HUDI-1932] Hive Sync should not always update
> > > > > > last_commit_time_sync
> > > > > > > >(Owner: Raymond Xu)
> > > > > > > >- [HUDI-1937] When clustering fail, generating unfinished
> > > > > > replacecommit
> > > > > > > >timeline. (Owner: Sagar Sumit)
> > > > > > > >- [HUDI-2077] Flaky test: TestHoodieDeltaStreamer (Owner:
> > > Sagar
> > > > > > Sumit)
> > > > > > > >- [HUDI-2314] Add DynamoDb based lock provider (Owner:
> > Wenning
> > > > > Ding)
> > > > > > > >- [HUDI-2325] Implement and test Hive Sync support for
> Kafka
> > > > > Connect

Re: [DISCUSS] Hudi 0.10.0 Release

2021-11-26 Thread Manoj Govindassamy
Hi Danny,

All the planned tickets have landed in master and we are good for cutting
0.10 RC. Please let us know if you see any CI issues with the latest master
and we can jump in to do the needful. Thanks for your patience.

thanks,
Manoj




On Fri, Nov 26, 2021 at 8:07 PM Manoj Govindassamy <
manoj.govindass...@gmail.com> wrote:

> Hi Danny,
>
> We have one last PR https://github.com/apache/hudi/pull/4114 to land to
> master. We are noticing one test flakiness with this last pending PR. The
> same test is consistently passing in the local setup though. We are waiting
> for the CI to finish before the merge to master. After this PR we are good
> for cutting the 0.10 RC. Will keep you posted on the status.
>
> thanks,
> Manoj
>
>
>
>
> On Sat, Nov 20, 2021 at 2:10 PM Raymond Xu 
> wrote:
>
>> Hi Danny, I'm good with the timeline.
>>
>> Cheers,
>> Raymond
>>
>> On Fri, Nov 19, 2021 at 7:34 PM sagar sumit 
>> wrote:
>>
>> > Hi Danny,
>> >
>> > I've added one more blocker: HUDI-2742
>> > <https://issues.apache.org/jira/browse/HUDI-2742>
>> > I am also good with the timelines.
>> >
>> > Regards,
>> > Sagar
>> >
>> > On Sat, Nov 20, 2021 at 8:14 AM Sivabalan  wrote:
>> >
>> > > Hi Danny,
>> > >  I am good with the timelines. All my jiras should be completed by
>> > > then.
>> > >
>> > >
>> > > On Fri, Nov 19, 2021 at 8:41 PM Y Ethan Guo > >
>> > > wrote:
>> > >
>> > > > Hi Danny,
>> > > >
>> > > > Thanks for summarizing the current progress towards the 0.10.0
>> release.
>> > > > I'm good with Nov 26th cutoff.
>> > > >
>> > > > Regarding my blockers:
>> > > > - [HUDI-2332] Implement scheduling of compaction/ clustering for
>> Kafka
>> > > >Connect (Owner: Ethan Guo)
>> > > > PR is up.  I'm addressing comments.
>> > > >
>> > > > - [HUDI-2737] Use earliest instant by default for compaction and
>> > > >clustering job (Owner: Ethan Guo)
>> > > > PR is up and approved.  It's near-landing after fixing CI failures.
>> > > >
>> > > > - [HUDI-2745] Record count does not match input after compaction is
>> > > >scheduled when running Hudi Kafka Connect sink (Owner: Ethan Guo)
>> > > > HUDI-2745 is going to be blocked on HUDI-2480, which is going to
>> > resolve
>> > > > this issue once done.
>> > > >
>> > > > - [HUDI-2735] Fix archival of commits in Java client for Kafka
>> Connect
>> > > >(Owner: Ethan Guo)
>> > > > This is pending and requires investigation into the archival logic
>> > which
>> > > is
>> > > > not Kafka-connect specific.
>> > > >
>> > > > Best,
>> > > > - Ethan
>> > > >
>> > > >
>> > > > On Fri, Nov 19, 2021 at 4:41 PM Rajesh Mahindra <
>> rmahin...@gmail.com>
>> > > > wrote:
>> > > >
>> > > > > Hi Danny,
>> > > > >
>> > > > > I have the following blockers that have a PR up. I am working on
>> a PR
>> > > for
>> > > > > the Debezium Source. I am fine with Nov 26th as cut off.
>> > > > >
>> > > > >- [HUDI-2325] Implement and test Hive Sync support for Kafka
>> > Connect
>> > > > >(Owner: Rajesh Mahindra)
>> > > > >- [HUDI-2671] Fix record offset handling in Kafka connect
>> > > transaction
>> > > > >participant (Owner: Rajesh Mahindra)
>> > > > >- [HUDI-2672] Avoid empty commits and rollbacks when there is
>> no
>> > > event
>> > > > >from the topic (Owner: Rajesh Mahindra)
>> > > > >
>> > > > > ** Pending
>> > > > >- [HUDI-1290] Implement Debezium avro source for Delta Streamer
>> > > > >
>> > > > > Thanks
>> > > > > Rajesh
>> > > > >
>> > > > >
>> > > > > On Fri, Nov 19, 2021 at 4:01 PM Udit Mehrotra 
>> > > wrote:
>> > > > >
>> > > > > > Hi Danny,
>> > > > > >
>> > > > > > I have a blocker as well
&g

Re: [VOTE] Release 0.10.0, release candidate #1

2021-11-27 Thread Manoj Govindassamy
+1

On Sat, Nov 27, 2021 at 4:49 AM Danny Chan  wrote:

> Hi everyone,
>
> Please review and vote on the release candidate #1 for the version 0.10.0,
> as follows:
>
> [ ] +1, Approve the release
>
> [ ] -1, Do not approve the release (please provide specific comments)
>
> The complete staging area is available for your review, which includes:
>
> * JIRA release notes [1],
>
> * the official Apache source release and binary convenience releases to be
> deployed to dist.apache.org [2], which are signed with the key with
> fingerprint 9A48922F682AB05D1AE4A3E7C2931E4BDB03D5AE [3],
>
> * all artifacts to be deployed to the Maven Central Repository [4],
>
> * source code tag "release-0.10.0-rc1" [5],
>
> The vote will be open for at least 72 hours. It is adopted by majority
> approval, with at least 3 PMC affirmative votes.
>
> Thanks,
>
> Release Manager
>
> [1]
>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12322822&version=12350285
>
> [2] https://dist.apache.org/repos/dist/dev/hudi/hudi-0.10.0-rc1/
>
> [3] https://dist.apache.org/repos/dist/dev/hudi/KEYS
>
> [4]
>
> https://repository.apache.org/content/repositories/orgapachehudi-1045/org/apache/hudi/
>
> [5] https://github.com/apache/hudi/tree/release-0.10.0-rc1
>


Re: [VOTE] Release 0.10.0, release candidate #3

2021-12-06 Thread Manoj Govindassamy
+1 (non-binding)

- Release validation script - passed
- Spark quick start guide using Spark 2.4.4 - passed
- Hudi table write and other operations from spark data source - passed

thanks,
Manoj



On Mon, Dec 6, 2021 at 8:23 PM sagar sumit  wrote:

> +1 (non-binding)
>
> - Builds for Spark2/3 [OK]
> - Spark quickstart [OK]
> - Docker Demo (Hive/Presto querying) [OK]
> - Long-running deltastreamer continuous mode [OK]
>
> Regards,
> Sagar
>
> On Tue, Dec 7, 2021 at 2:07 AM Udit Mehrotra  wrote:
>
> > +1 (binding)
> >
> > - Builds successfully
> > - RC validation successful
> > - Ran quickstart Scala/Spark SQL against EMR and S3
> >
> > Thanks,
> > Udit
> >
> > On Mon, Dec 6, 2021 at 10:12 AM Balaji Varadarajan
> >  wrote:
> > >
> > >  +1 (binding)
> > > - Package Build successful- Overnight staging test - Data Validation
> > successful for COW upsert workload.
> > >
> > >
> > > On Monday, December 6, 2021, 06:40:32 AM PST, vino yang <
> > yanghua1...@gmail.com> wrote:
> > >
> > >  +1 (binding)
> > >
> > > - build successfully
> > > - ran spark quickstart
> > > - verified checksum
> > >
> > > Best,
> > > Vino
> > >
> > > Y Ethan Guo  于2021年12月6日周一 14:25写道:
> > >
> > > > +1 (non-binding)
> > > >
> > > > - [OK] Ran release validation script [1]
> > > > - [OK] Built the source (Spark 2/3)
> > > > - [OK] Ran Spark Guide in Quick Start using Spark 3.1.2
> > > >
> > > > [1] https://gist.github.com/yihua/39ef5b07a08ed5780fa9c43819b326cb
> > > >
> > > > Best,
> > > > - Ethan
> > > >
> > > > On Sat, Dec 4, 2021 at 1:27 PM Bhavani Sudha <
> bhavanisu...@apache.org>
> > > > wrote:
> > > >
> > > > > +1 (binding)
> > > > >
> > > > > - [OK] checksums and signatures
> > > > > - [OK] ran validation script
> > > > > - [OK] built successfully
> > > > > - [OK] ran spark quickstart
> > > > > - [OK] Ran few tests in IDE
> > > > >
> > > > >
> > > > >
> > > > > bsaktheeswaran@Bhavanis-MacBook-Pro scripts %
> > > > > ./release/validate_staged_release.sh --release=0.10.0 --rc_num=3
> > > > > /tmp/validation_scratch_dir_001 ~/Sudha/hudi/scripts
> > > > > Downloading from svn co
> https://dist.apache.org/repos/dist//dev/hudi
> > > > > Validating hudi-0.10.0-rc3 with release type "dev"
> > > > > Checking Checksum of Source Release
> > > > > Checksum Check of Source Release - [OK]
> > > > >
> > > > >  % Total% Received % Xferd  Average Speed  TimeTimeTime
> > > > >  Current
> > > > >  Dload  Upload  Total  Spent
> Left
> > > > >  Speed
> > > > > 100 45904  100 4590400  85323  0 --:--:-- --:--:--
> > --:--:--
> > > > > 85165
> > > > > Checking Signature
> > > > > Signature Check - [OK]
> > > > >
> > > > > Checking for binary files in source release
> > > > > No Binary Files in Source Release? - [OK]
> > > > >
> > > > > Checking for DISCLAIMER
> > > > > DISCLAIMER file exists ? [OK]
> > > > >
> > > > > Checking for LICENSE and NOTICE
> > > > > License file exists ? [OK]
> > > > > Notice file exists ? [OK]
> > > > >
> > > > > Performing custom Licensing Check
> > > > > Licensing Check Passed [OK]
> > > > >
> > > > > Running RAT Check
> > > > > RAT Check Passed [OK]
> > > > >
> > > > > Thanks,
> > > > > Sudha
> > > > >
> > > > > On Sat, Dec 4, 2021 at 6:59 AM Vinoth Chandar 
> > wrote:
> > > > >
> > > > > > +1 (binding)
> > > > > >
> > > > > > Ran the RC checks in [1] . This is a huge release, thanks
> everyone
> > for
> > > > > all
> > > > > > the hard work!
> > > > > >
> > > > > > [1]
> > > > >
> > https://gist.github.com/vinothchandar/68b34f3051e41752ebffd6a3edeb042b
> > > > > >
> > > > > > On Sat, Dec 4, 2021 at 5:20 AM Danny Chan 
> > > > wrote:
> > > > > >
> > > > > > > Hi everyone,
> > > > > > >
> > > > > > > Please review and vote on the release candidate #3 for the
> > version
> > > > > > 0.10.0,
> > > > > > > as follows:
> > > > > > >
> > > > > > > [ ] +1, Approve the release
> > > > > > >
> > > > > > > [ ] -1, Do not approve the release (please provide specific
> > comments)
> > > > > > >
> > > > > > > The complete staging area is available for your review, which
> > > > includes:
> > > > > > >
> > > > > > > * JIRA release notes [1],
> > > > > > >
> > > > > > > * the official Apache source release and binary convenience
> > releases
> > > > to
> > > > > > be
> > > > > > > deployed to dist.apache.org [2], which are signed with the key
> > with
> > > > > > > fingerprint 9A48922F682AB05D1AE4A3E7C2931E4BDB03D5AE [3],
> > > > > > >
> > > > > > > * all artifacts to be deployed to the Maven Central Repository
> > [4],
> > > > > > >
> > > > > > > * source code tag "release-0.10.0-rc3" [5],
> > > > > > >
> > > > > > > The vote will be open for at least 72 hours. It is adopted by
> > > > majority
> > > > > > > approval, with at least 3 PMC affirmative votes.
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Release Manager
> > > > > > >
> > > > > > > [1]
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
> https://issues.apache.org/jira/secure/ReleaseNote