lq252392 opened a new issue, #6206:
URL: https://github.com/apache/paimon/issues/6206

   ### Search before asking
   
   - [x] I searched in the [issues](https://github.com/apache/paimon/issues) 
and found nothing similar.
   
   
   ### Motivation
   
   ## Background
   
   In a typical data warehouse architecture, joining multiple streams often 
requires separate Flink jobs (due to different sources, processing logic, or 
temporal reasons). This inherently leads to a scenario where **multiple 
independent tasks (potentially on different machines) need to write to the same 
Paimon table, and thus, likely to the same buckets**.
   
   From the documentation and community best practices, I understand that the 
recommended pattern is to have a **1:1 relationship between a sink task and a 
bucket** (i.e., `sink.parallelism == num-buckets`). My goal is to understand 
the **specific technical challenges and potential risks** when deviating from 
this pattern, and to discuss possible solutions or best practices for the 
multi-job writing scenario.
   
   ## Research & Analysis
   
   I have studied the relevant source code and documented my findings in this 
blog post: [Can Paimon write to the same bucket with multiple tasks 
simultaneously?](https://blog.csdn.net/lifallen/article/details/150935597?spm=1011.2415.3001.5331).
   
   Based on my research, here are my current understandings and questions:
   
   ### 1. Async Compaction & Distributed Snapshots  
   Paimon's support for **asynchronous compaction** and **distributed snapshot 
commits** appears to decouple writing from compaction physically. My question:  
   - Does this architecture inherently allow **safe concurrent commits** to the 
same bucket from multiple tasks?  (I think the answer is true)
   
   ### 2. `sequenceNumber` Behavior in Multi-Writer Scenarios  
   Each bucket tracks a `sequenceNumber` (plain `long`), but my analysis 
suggests:  
   - **No Shared State**: Since independent Flink jobs don't share the same 
`MergeTreeWriter` instance, each writer maintains its own sequence counter.  
   - **Sorting Scope**: The sequenceNumberis used ​​exclusively within 
SortBuffer and Compaction​​ to determine the ​​output row order. Crucially:
       • If multiple writers emit rows to the same bucket, the interleaving of 
their independent sequenceNumbervalues only affects the ​​physical arrangement 
of rows in files​​, not the logical correctness of partial update.
   
   This leads me to hypothesize:  
   - If writers are inserting **non-overlapping fields** (e.g., different 
columns), could interleaved sequence numbers from different writers be 
harmless?  
   - Is the `sequenceNumber` irrelevant for correctness in multi-writer cases?  
   
   
   ## Questions for the Community
   
   Could the maintainers or experienced users please help clarify:
   *   Are my analyses above correct? What are the **other potential risks** I 
might have missed?
   *   What is the **recommended architecture** for writing to the same Paimon 
table from multiple independent Flink jobs?
   
   Thank you for your time and insights! This discussion will be incredibly 
helpful for designing robust pipelines with Paimon.
   
   ### Solution
   
   _No response_
   
   ### Anything else?
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [ ] I'm willing to submit a PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to