Re: [VOTE] Release 0.12.3, release candidate #1

2023-04-05 Thread sagar sumit
+1 (non-binding)

Long running deltastreamer OK
Query using Presto and Trino OK
Spark quickstart and docker demo OK

Regards,
Sagar

On Fri, Mar 31, 2023 at 10:41 PM Sivabalan  wrote:

> Hi everyone,
>
> Please review and vote on the release candidate #1 for the version 0.12.3,
> as follows:
>
> [ ] +1, Approve the release
> [ ] -1, Do not approve the release (please provide specific comments)
>
> The complete staging area is available for your review, which includes:
>
> * JIRA release notes [1],
> * the official Apache source release and binary convenience releases to be
> deployed to dist.apache.org [2], which are signed with the key with
> fingerprint ACD52A06633DB3B2C7D0EA5642CA2D3ED5895122 [3],
> * all artifacts to be deployed to the Maven Central Repository [4],
> * source code tag "release-0.12.3-rc1" [5],
>
> The vote will be open for at least 72 hours. It is adopted by majority
> approval, with at least 3 PMC affirmative votes.
>
> Thanks,
> Release Manager
>
>
> [1]
>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12352934&styleName=Html&projectId=12322822
> [2] https://dist.apache.org/repos/dist/dev/hudi/hudi-0.12.3-rc1/
> [3] https://dist.apache.org/repos/dist/dev/hudi/KEYS
> [4] https://repository.apache.org/content/repositories/orgapachehudi-1119
> [5] https://github.com/apache/hudi/releases/tag/release-0.12.3-rc1
>
> --
> Regards,
> -Sivabalan
>


Re: [VOTE] Release 0.12.3, release candidate #1

2023-04-05 Thread Danny Chan
-1(binding) for picking up https://github.com/apache/hudi/pull/8374

Best,
Danny

Sivabalan  于2023年4月1日周六 01:11写道:
>
> Hi everyone,
>
> Please review and vote on the release candidate #1 for the version 0.12.3,
> as follows:
>
> [ ] +1, Approve the release
> [ ] -1, Do not approve the release (please provide specific comments)
>
> The complete staging area is available for your review, which includes:
>
> * JIRA release notes [1],
> * the official Apache source release and binary convenience releases to be
> deployed to dist.apache.org [2], which are signed with the key with
> fingerprint ACD52A06633DB3B2C7D0EA5642CA2D3ED5895122 [3],
> * all artifacts to be deployed to the Maven Central Repository [4],
> * source code tag "release-0.12.3-rc1" [5],
>
> The vote will be open for at least 72 hours. It is adopted by majority
> approval, with at least 3 PMC affirmative votes.
>
> Thanks,
> Release Manager
>
>
> [1] 
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12352934&styleName=Html&projectId=12322822
> [2] https://dist.apache.org/repos/dist/dev/hudi/hudi-0.12.3-rc1/
> [3] https://dist.apache.org/repos/dist/dev/hudi/KEYS
> [4] https://repository.apache.org/content/repositories/orgapachehudi-1119
> [5] https://github.com/apache/hudi/releases/tag/release-0.12.3-rc1
>
> --
> Regards,
> -Sivabalan


Re: What precombine field really is used for and its future?

2023-04-05 Thread Ken Krugler
Hi Vinoth,

I just want to make sure my issue was clear - it seems like Spark shouldn’t be 
requiring a precombined field (or checking that it exists) when dropping 
partitions.

Thanks,

— Ken


> On Apr 4, 2023, at 7:31 AM, Vinoth Chandar  wrote:
> 
> Thanks for raising this issue.
> 
> Love to use this opp to share more context on why the preCombine field
> exists.
> 
>   - As you probably inferred already, we needed to eliminate duplicates,
>   while dealing with out-of-order data (e.g database change records arriving
>   in different orders from two Kafka clusters in two zones). So it was
>   necessary to preCombine by a "event" field, rather than just the arrival
>   time (which is what _hoodie_commit_time is).
>   - This comes from stream processing concepts like
>   https://www.oreilly.com/radar/the-world-beyond-batch-streaming-101/ ,
>   which build upon inadequacies in traditional database systems to deal with
>   things like this. At the end of the day, we are solving a "processing"
>   problem IMO with Hudi - Hudi replaces existing batch/streaming pipelines,
>   not OLTP databases. That's at-least the lens we approached it from.
>   - For this to work end-end, it is not sufficient to just precombine
>   within a batch of incoming writes, we also need to consistently apply the
>   same against data in storage. In CoW, we implicitly merge against storage,
>   so its simpler. But for MoR, we simply append records to log files, so we
>   needed to make this a table property - such that queries/compaction can
>   later do the right preCombine. Hope that clarifies the CoW vs MoR
>   differences.
> 
> On the issues raised/proposals here.
> 
>   1. I think we need some dedicated efforts across the different writer
>   paths to make it easier. probably some lower hanging fruits here. Some of
>   it results from just different authors contributing to different code paths
>   in an OSS project.
>   2. On picking a sane default precombine field. _hoodie_commit_time is a
>   good candidate for preCombine field, as you point out, we would just pick
>   1/many records with the same key arbitrarily, in that scenario. On
>   storage/across commits, we would pick the value with the latest
>   commit_time/last writer wins - which would make queries repeatedly provide
>   the same consistent values as well.  Needs more thought.
>   3. If the user desires to customize this behavior, they could supply a
>   preCombine field that is different. This would be similar to semantics of
>   event time vs arrival order processing in streaming systems. Personally, I
>   need to spend a bit more time digging to come up with an elegant solution
>   here.
>   4. For the proposals on how Hudi could de-duplicate, after the fact that
>   inserts introduced duplicates - I think the current behavior is a bit more
>   condoning than what I'd like tbh. It updates both the records IIRC. I think
>   Hudi should ensure record key uniqueness across different paths and fail
>   the write if it's violated. - if we think of this as in RDBMS lens, that's
>   what would happen, correct?
> 
> 
> Love to hear your thoughts. If we can file a JIRA or compile JIRAs with
> issues around this, we could discuss out short, long term plans?
> 
> Thanks
> Vinoth
> 
> On Sat, Apr 1, 2023 at 3:13 PM Ken Krugler 
> wrote:
> 
>> Hi Daniel,
>> 
>> Thanks for the detailed write-up.
>> 
>> I can’t add much to the discussion, other than noting we also recently ran
>> into the related oddity that we don’t need to define a precombine when
>> writing data to a COW table (using Flink), but then trying to use Spark to
>> drop partitions failed because there’s a default precombine field name (set
>> to “ts”), and if that field doesn’t exist then the Spark job fails.
>> 
>> — Ken
>> 
>> 
>>> On Mar 31, 2023, at 1:20 PM, Daniel Kaźmirski 
>> wrote:
>>> 
>>> Hi all,
>>> 
>>> I would like to bring up the topic of how precombine field is used and
>>> what's the purpose of it. I would also like to know what are the plans
>> for
>>> it in the future.
>>> 
>>> At first glance precombine filed looks like it's only used to deduplicate
>>> records in incoming batch.
>>> But when digging deeper it looks like it can/is also be used to:
>>> 1. combine records not before but on write to decide if update existing
>>> record (eg with DefaultHoodieRecordPayload)
>>> 2. combine records on read for MoR table to combine log and base files
>>> correctly.
>>> 3. precombine field is required for spark SQL UPDATE, even if user can't
>>> provide duplicates anyways with this sql statement.
>>> 
>>> Regarding [3] there's inconsistency as precombine field is not required
>> in
>>> MERGE INTO UPDATE. Underneath UPSERT is switched to INSERT in upsert mode
>>> to update existing records.
>>> 
>>> I know that Hudi does a lot of work to ensure PK uniqueness across/within
>>> partitions and there is a need to deduplicate records before write or to
>>> deduplicate existing data if duplicates were introd

Re: What precombine field really is used for and its future?

2023-04-05 Thread Daniel Kaźmirski
Hi Vinoth,

Thanks for your reply!

Regarding the first part, I agree that precombine solves a lot of issues,
especially during the ingestion.
I think this is a valid behavior and should be preserved so that we can
enjoy out-of-order events and duplicates handled by the framework.
I'm also aware there are many use-cases people want to solve with Hudi and
it ranges from  "i don't need precombine field" to "i need many precombine
fields" or "my deduplication logic is complex and I just need to do it
ourselves before write and outside Hudi".

Maybe I should share my usecase to show where this topic comes from and
what's my point of view.
I use precombine myself for handling events delivered from various
applications that are delivered using kafka. These are stored in Hudi table.
Then once these are stored, I have a nice deduplicated dataset that I can
use to create derived tables (App -> Kafka -> Hudi table "raw" -> n derived
Hudi tables). Then in the second part (hudi raw -> hudi derived) I can
safely assume there will be no duplicates and out-of-order events coming
from the first Hudi table, therefore I don't really need precombine field
and it makes modeling this derived table easier as I don't need to consider
how precombine will behave. At the same time, it does not make sense to
introduce another tool and grow my stack for it if Hudi can handle it.

Sometimes I just don't have any field in the source system/table that can
be used as precombine field and am forced to provide Hudi it with something
like "ingestion/processing timestamp" or some constant.


Regarding the proposal comments, I agree overall and understand that these
are not easy choices to make.
>From a user perspective, it would be great to have coherent APIs with
consistent behavior and no surprises. I agree with Ken, there
are operations where it does not feel like precombine should be needed (or
should be something internal and abstracted away from the user).
You're right Vinoth and I was wrong, the update does not deduplicate
existing records, as I checked it will rather get the latest record based
on percombine field, update values, and replace duplicates with it.
Regarding the RDBMS lense, some systems introduce PK NOT ENFORCED mode to
allow duplicates on insert, but always deduplicate on update. If users want
to preserve duplicates but also want to update values, they need to delete
and insert them again on their own. This is a clean way to do it imo, but
is opinionated and some may dislike it, at the same PK uniqueness and
deduplication is at Hudi core principles and is a differentiator.
This article from MS Synapse has nice examples around this:
https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-table-constraints
I myself have cases where I care about not missing any data and decide to
store duplicates and handle these later. I think right now Hudi gives us
nice control here with insert modes. If one thinks Hudi should fail on
write, then strict insert mode does the trick but is not a default.


I gathered some JIRA tickets related to precombine/no precombine model:
https://issues.apache.org/jira/browse/HUDI-2633
https://issues.apache.org/jira/browse/HUDI-4701
https://issues.apache.org/jira/browse/HUDI-5848
https://issues.apache.org/jira/browse/HUDI-2681

But maybe you'd like to have a new ticket for this work?
I'm always happy to help.

BR,
Daniel

śr., 5 kwi 2023 o 18:59 Ken Krugler 
napisał(a):

> Hi Vinoth,
>
> I just want to make sure my issue was clear - it seems like Spark
> shouldn’t be requiring a precombined field (or checking that it exists)
> when dropping partitions.
>
> Thanks,
>
> — Ken
>
>
> > On Apr 4, 2023, at 7:31 AM, Vinoth Chandar  wrote:
> >
> > Thanks for raising this issue.
> >
> > Love to use this opp to share more context on why the preCombine field
> > exists.
> >
> >   - As you probably inferred already, we needed to eliminate duplicates,
> >   while dealing with out-of-order data (e.g database change records
> arriving
> >   in different orders from two Kafka clusters in two zones). So it was
> >   necessary to preCombine by a "event" field, rather than just the
> arrival
> >   time (which is what _hoodie_commit_time is).
> >   - This comes from stream processing concepts like
> >   https://www.oreilly.com/radar/the-world-beyond-batch-streaming-101/ ,
> >   which build upon inadequacies in traditional database systems to deal
> with
> >   things like this. At the end of the day, we are solving a "processing"
> >   problem IMO with Hudi - Hudi replaces existing batch/streaming
> pipelines,
> >   not OLTP databases. That's at-least the lens we approached it from.
> >   - For this to work end-end, it is not sufficient to just precombine
> >   within a batch of incoming writes, we also need to consistently apply
> the
> >   same against data in storage. In CoW, we implicitly merge against
> storage,
> >   so its simpler. But for MoR, we simply append records to lo

Re: [VOTE] Release 0.12.3, release candidate #1

2023-04-05 Thread Sivabalan
Will cancel RC1 and work on RC2. thanks for bringing it up.

On Wed, 5 Apr 2023 at 08:29, Danny Chan  wrote:

> -1(binding) for picking up https://github.com/apache/hudi/pull/8374
>
> Best,
> Danny
>
> Sivabalan  于2023年4月1日周六 01:11写道:
> >
> > Hi everyone,
> >
> > Please review and vote on the release candidate #1 for the version
> 0.12.3,
> > as follows:
> >
> > [ ] +1, Approve the release
> > [ ] -1, Do not approve the release (please provide specific comments)
> >
> > The complete staging area is available for your review, which includes:
> >
> > * JIRA release notes [1],
> > * the official Apache source release and binary convenience releases to
> be
> > deployed to dist.apache.org [2], which are signed with the key with
> > fingerprint ACD52A06633DB3B2C7D0EA5642CA2D3ED5895122 [3],
> > * all artifacts to be deployed to the Maven Central Repository [4],
> > * source code tag "release-0.12.3-rc1" [5],
> >
> > The vote will be open for at least 72 hours. It is adopted by majority
> > approval, with at least 3 PMC affirmative votes.
> >
> > Thanks,
> > Release Manager
> >
> >
> > [1]
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12352934&styleName=Html&projectId=12322822
> > [2] https://dist.apache.org/repos/dist/dev/hudi/hudi-0.12.3-rc1/
> > [3] https://dist.apache.org/repos/dist/dev/hudi/KEYS
> > [4]
> https://repository.apache.org/content/repositories/orgapachehudi-1119
> > [5] https://github.com/apache/hudi/releases/tag/release-0.12.3-rc1
> >
> > --
> > Regards,
> > -Sivabalan
>


-- 
Regards,
-Sivabalan