yzeng1618 opened a new issue, #10301: URL: https://github.com/apache/seatunnel/issues/10301
### Search before asking - [x] I had searched in the [feature](https://github.com/apache/seatunnel/issues?q=is%3Aissue+label%3A%22Feature%22) and found no similar feature requirement. ### Description HBaseSource currently supports rowkey range scanning via start_rowkey / end_rowkey and configurable inclusive/exclusive boundary policies (#9983, #10011). This approach only filters data by rowkey ordering (a continuous interval in the rowkey keyspace). In many real-world HBase tables, the rowkey is a business key (e.g., order_id, user_id). Updates happen based on business events and are therefore scattered across the entire keyspace, with no correlation between “update time” and “rowkey order”. In such cases, rowkey range scanning cannot efficiently or accurately cover common offline incremental extraction needs: - Using one large rowkey range to “cover all possible updated keys” will scan大量 unchanged rows/cells and becomes too expensive. - Splitting into many rowkey ranges still cannot enumerate all updated keys within a time window, leading to missing data. HBase natively supports time-range filtering by cell timestamp (version timestamp) using Scan#setTimeRange(minTimestamp, maxTimestamp). SeaTunnel HBaseSource does not expose this capability today: there are no min_timestamp/max_timestamp options in HbaseSourceOptions, and HbaseClient#scan builds a Scan without applying any timestamp time-range. As a result, users cannot perform timestamp-window scans and must fall back to inefficient full scans or rowkey-based workarounds. This feature requests adding optional min_timestamp / max_timestamp options to HBaseSource and applying them to HBase Scan so users can scan data within a timestamp window. ### Usage Scenario Offline time-window extraction for business-key tables: - Table rowkey is order_id (business key). Columns like info:status / info:amount are updated over time. - Requirement: export only the data written/updated during a specific time window [T1, T2) (based on HBase cell timestamps), without scanning the whole table. - Configuration example: - min_timestamp = T1 - max_timestamp = T2 - Expected behavior: HBaseSource performs a scan with setTimeRange(T1, T2) and outputs only the cells/columns whose timestamps fall in the window (note: columns not updated in the window may be absent/null in the output). ### Related issues https://github.com/apache/seatunnel/pull/9983 https://github.com/apache/seatunnel/pull/10011 ### Are you willing to submit a PR? - [x] Yes I am willing to submit a PR! ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
