[
https://issues.apache.org/jira/browse/PHOENIX-7684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Viraj Jasani resolved PHOENIX-7684.
-----------------------------------
Resolution: Fixed
> Introduce Segment Scan
> ----------------------
>
> Key: PHOENIX-7684
> URL: https://issues.apache.org/jira/browse/PHOENIX-7684
> Project: Phoenix
> Issue Type: New Feature
> Reporter: Viraj Jasani
> Assignee: Viraj Jasani
> Priority: Major
> Fix For: 5.3.0
>
>
> HBase stores rows of data in tables. Tables are split into groups of
> lexicographically adjacent rows. These groups are called regions. By
> lexicographically adjacent, all rows in the table that sort between the
> region’s start row key and end row key are stored in the same region. Large
> table with multi tera bytes or peta bytes of data can have their data spread
> across hundreds of thousands of regions.
> Depending on the size of the table, full table scan can take from few seconds
> to several hours. When client executes "SELECT * FROM <table>" query using
> Phoenix JDBC client, single client is responsible for retrieving all the rows
> from all the regions of the given table. Although, single Phoenix client does
> divide the scan range into the table region ranges and submit the scans in
> parallel. Single client application can still become bottleneck for the end
> user when the table size is very large, as the single client application
> always have limited memory, CPU and IO capabilities.
> The purpose of this Jira is to introduce segment scan. The segment is logical
> chunk of the given table, similar to HBase table region. While scanning the
> full table, the client application does not have any insight into how the
> table data are distributed among the regions. The concept of segment scan
> allows application to define the number of segments into which the table data
> can be divided. A new function TOTAL_SEGMENTS() can be used to retrieve the
> scan boundary of each segment. The retrieved scan boundaries can be provided
> to individual client worker (thread or VM) so that a given worker only
> performs scan of a segment of the table. Additional functions
> SCAN_START_KEY() and SCAN_END_KEY() are used to retrieve the segment
> boundaries and then later on use the boundaries to submit the segment scan
> request.
> {code:java}
> SELECT SCAN_START_KEY(), SCAN_END_KEY() FROM T1 WHERE TOTAL_SEGMENTS() = 10
> {code}
> The above SELECT query is meant to retrieve total 10 segment boundaries from
> the table T1 by bucketing the table region boundaries into 10 ranges.
> e.g.
> * If the table has 12 regions and client needs 4 segments: each segment will
> contain 3 regions
> * If the table has 10 regions and client needs 3 segments: you get segments
> of sizes 4, 3, and 3 regions
> * If the table has 3 regions and client needs 10 segments: you get 3
> segments (one per region)
> One of the advantages of the segment scan: External frameworks like Spark,
> Trino or MapReduce can now ask for "give me exactly 100 work units" instead
> of dealing with an unknown number of HBase regions. The segment scan approach
> provides horizontal scaling of the client workers.
> After retrieving the above segment boundaries, any of the segment can be
> scanned using
> {code:java}
> SELECT * FROM <table> WHERE SCAN_START_KEY() = ? AND SCAN_END_KEY() = ?{code}
> The above query takes param values of SCAN_START_KEY() and SCAN_END_KEY() as
> the VARBINARY values earlier retrieved from the first query. With the scan
> boundaries as the given segment boundary, only the specific segment worth of
> data will be scanned by the query.
> Overall, this feature provides deterministic data partitioning for parallel
> processing, making Phoenix more suitable for integration with modern big data
> processing frameworks.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)