virajjasani opened a new pull request, #2255:
URL: https://github.com/apache/phoenix/pull/2255

   Jira: PHOENIX-7684
   
   HBase stores rows of data in tables. Tables are split into groups of 
lexicographically adjacent rows. These groups are called regions. By 
lexicographically adjacent, all rows in the table that sort between the 
region’s start row key and end row key are stored in the same region. Large 
table with multi tera bytes or peta bytes of data can have their data spread 
across hundreds of thousands of regions.
   
   Depending on the size of the table, full table scan can take from few 
seconds to several hours. When client executes "SELECT * FROM <table>" query 
using Phoenix JDBC client, single client is responsible for retrieving all the 
rows from all the regions of the given table. Although, single Phoenix client 
does divide the scan range into the table region ranges and submit the scans in 
parallel. Single client application can still become bottleneck for the end 
user when the table size is very large, as the single client application always 
have limited memory, CPU and IO capabilities.
   
   The purpose of this Jira is to introduce segment scan. The segment is 
logical chunk of the given table, similar to HBase table region. While scanning 
the full table, the client application does not have any insight into how the 
table data are distributed among the regions. The concept of segment scan 
allows application to define the number of segments into which the table data 
can be divided. A new function TOTAL_SEGMENTS() can be used to retrieve the 
scan boundary of each segment. The retrieved scan boundaries can be provided to 
individual client worker (thread or VM) so that a given worker only performs 
scan of a segment of the table. Additional functions SCAN_START_KEY() and 
SCAN_END_KEY() are used to retrieve the segment boundaries and then later on 
use the boundaries to submit the segment scan request.
   
   ```
   SELECT SCAN_START_KEY(), SCAN_END_KEY() FROM T1 WHERE TOTAL_SEGMENTS() = 10 
   ```
   The above SELECT query is meant to retrieve total 10 segment boundaries from 
the table T1 by bucketing the table region boundaries into 10 ranges.
   
   e.g.
   
   - If the table has 12 regions and client needs 4 segments: each segment will 
contain 3 regions
   - If the table has 10 regions and client needs 3 segments: you get segments 
of sizes 4, 3, and 3 regions
   - If the table has 3 regions and client needs 10 segments: you get 3 
segments (one per region)
   
   One of the advantages of the segment scan: External frameworks like Spark, 
Trino or MapReduce can now ask for "give me exactly 100 work units" instead of 
dealing with an unknown number of HBase regions. The segment scan approach 
provides horizontal scaling of the client workers.
   
   After retrieving the above segment boundaries, any of the segment can be 
scanned using
   ```
   SELECT * FROM <table> WHERE SCAN_START_KEY() = ? AND SCAN_END_KEY() = ?
   ```
   The above query takes param values of SCAN_START_KEY() and SCAN_END_KEY() as 
the VARBINARY values earlier retrieved from the first query. With the scan 
boundaries as the given segment boundary, only the specific segment worth of 
data will be scanned by the query.
   
   Overall, this feature provides deterministic data partitioning for parallel 
processing, making Phoenix more suitable for integration with modern big data 
processing frameworks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to