yzeng1618 opened a new issue, #10301:
URL: https://github.com/apache/seatunnel/issues/10301

   ### Search before asking
   
   - [x] I had searched in the 
[feature](https://github.com/apache/seatunnel/issues?q=is%3Aissue+label%3A%22Feature%22)
 and found no similar feature requirement.
   
   
   ### Description
   
   HBaseSource currently supports rowkey range scanning via start_rowkey / 
end_rowkey and configurable inclusive/exclusive boundary policies (#9983, 
#10011). This approach only filters data by rowkey ordering (a continuous 
interval in the rowkey keyspace).
   
   In many real-world HBase tables, the rowkey is a business key (e.g., 
order_id, user_id). Updates happen based on business events and are therefore 
scattered across the entire keyspace, with no correlation between “update time” 
and “rowkey order”. In such cases, rowkey range scanning cannot efficiently or 
accurately cover common offline incremental extraction needs:
   
   - Using one large rowkey range to “cover all possible updated keys” will 
scan大量 unchanged rows/cells and becomes too expensive.
   
   - Splitting into many rowkey ranges still cannot enumerate all updated keys 
within a time window, leading to missing data.
   
   HBase natively supports time-range filtering by cell timestamp (version 
timestamp) using Scan#setTimeRange(minTimestamp, maxTimestamp). SeaTunnel 
HBaseSource does not expose this capability today: there are no 
min_timestamp/max_timestamp options in HbaseSourceOptions, and HbaseClient#scan 
builds a Scan without applying any timestamp time-range. As a result, users 
cannot perform timestamp-window scans and must fall back to inefficient full 
scans or rowkey-based workarounds.
   
   This feature requests adding optional min_timestamp / max_timestamp options 
to HBaseSource and applying them to HBase Scan so users can scan data within a 
timestamp window.
   
   ### Usage Scenario
   
   Offline time-window extraction for business-key tables:
   
   - Table rowkey is order_id (business key). Columns like info:status / 
info:amount are updated over time.
   
   - Requirement: export only the data written/updated during a specific time 
window [T1, T2) (based on HBase cell timestamps), without scanning the whole 
table.
   
   - Configuration example:
     - min_timestamp = T1
     - max_timestamp = T2
   
   - Expected behavior: HBaseSource performs a scan with setTimeRange(T1, T2) 
and outputs only the cells/columns whose timestamps fall in the window (note: 
columns not updated in the window may be absent/null in the output).
   
   ### Related issues
   
   https://github.com/apache/seatunnel/pull/9983
   https://github.com/apache/seatunnel/pull/10011
   
   ### Are you willing to submit a PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to