u-ranjith-kumar opened a new issue, #17535:
URL: https://github.com/apache/pinot/issues/17535

   
   We are using **OFFLINE dimension tables** in Apache Pinot and are facing 
**duplicate rows with the same primary key** during batch ingestion.
   
   Currently:
   
   * `APPEND` ingestion is not supported for dimension tables
   * `REFRESH` ingestion keeps re-reading the same input files
   * This results in **duplicate primary keys** in the dimension table
   
   Pinot recently added support to **detect and error on duplicate primary 
keys** using:
   
   ```json
   "dimensionTableConfig": {
     "errorOnDuplicatePrimaryKey": true
   }
   ```
   
   (PR: #12290)
   
   While this helps catch the issue, it does not solve the core use case where 
we want to overwrite existing rows by primary key(UPSERT semantics) instead of 
failing ingestion.
   
   
   Current Behavior
   
   * OFFLINE dimension tables do not support UPSERT
   * Duplicate primary keys are either:
   
     * silently allowed (default), or
     * rejected using `errorOnDuplicatePrimaryKey=true`
   * There is no way to overwrite an existing row for a primary key during 
OFFLINE ingestion
   
   ---
   
   Expected Behavior
   
   Support UPSERT semantics for OFFLINE dimension tables, similar to REALTIME 
upsert tables:
   
   * If a record with an existing primary key is ingested:
   
     * overwrite the existing row
     * do not create duplicate records
   * Allow deterministic, idempotent batch ingestion
   * Enable safe reprocessing and reruns
   
   ---
   
   Why this is needed
   
   Offline dimension tables are commonly used for:
   
   * Slowly changing dimensions (store, area, category, mappings)
   * Periodic full refreshes or partial backfills
   * Reference data that naturally evolves over time
   
   Without upsert support:
   
   * Pipelines are fragile
   * Reruns cause duplicates
   * Users are forced to move dimension data to REALTIME ingestion, which is 
not always desirable
   
   ---
   
   Workarounds today
   
   1. Enable strict validation:
   
   ```json
   "dimensionTableConfig": {
     "errorOnDuplicatePrimaryKey": true
   }
   ```
   
   → Prevents bad data but breaks ingestion
   
   2. Move dimension data to REALTIME upsert table
      → Works, but adds operational complexity and is not ideal for 
batch-managed dimensions
   
   ---
   
   Proposal
   
   Add support for OFFLINE UPSERT dimension tables, where:
   
   * Primary key uniqueness is enforced
   * Latest record overwrites the previous one
   * Behavior is deterministic and rerun-safe
   
   This would align OFFLINE dimension tables with REALTIME upsert capabilities 
and significantly simplify batch ingestion workflows.
   
   ---
   
   Related Issues / PRs
   
   * Duplicate primary key handling for dimension tables: #12284
   * Disallow duplicate primary keys (error-only): 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to