codope commented on code in PR #7942:
URL: https://github.com/apache/hudi/pull/7942#discussion_r1106755374


##########
website/docs/record_payload.md:
##########
@@ -0,0 +1,97 @@
+---
+title: Record Payload 
+keywords: [hudi, merge, upsert, precombine]
+---
+
+## Record Payload
+
+One of the core features of Hudi is the ability to incrementally upsert data, 
deduplicate and merge records on the fly.
+Additionally, users can implement their custom logic to merge the input 
records with the record on storage. Record
+payload is an abstract representation of a Hudi record that allows the 
aforementioned capability. As we shall see below,
+Hudi provides out-of-box support for different payloads for different use 
cases, and a new record merger API for
+optimized payload handling. But, first let us understand how record payload is 
used in the Hudi upsert path.
+
+<figure>
+    <img className="docimage" 
src={require("/assets/images/upsert_path.png").default} alt="upsert_path.png" />
+</figure>
+
+Figure above shows the main stages that records go through while being written 
to the Hudi table. In the precombining
+stage, Hudi performs any deduplication based on the payload implementation and 
precombine key configured by the user.
+Further, on index lookup, Hudi identifies which records are being updated and 
the record payload implementation tells
+Hudi how to merge the incoming record with the existing record on storage.
+
+### Existing Payloads
+
+#### OverwriteWithLatestAvroPayload
+
+This is the default record payload implementation. It picks the record with 
the greatest value (determined by calling
+.compareTo() on the value of precombine key) to break ties and simply picks 
the latest record while merging. This gives
+latest-write-wins style semantics.
+
+#### EventTimeAvroPayload
+
+Some use cases require merging records by event time and thus event time plays 
the role of an ordering field. This

Review Comment:
   Not much. `DefaultHoodieRecordPayload` maintains some additional metadata to 
track latency and freshness. I wanted to write about 
`DefaultHoodieRecordPayload` but the naming belies the actual default. So, I 
avoided confusion. Perhaps, we should make this the actual default. Is it 
covered in config simplification story? cc @bhasudha 
   Let me just add some notes about `DefaultHoodieRecordPayload`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to