[ 
https://issues.apache.org/jira/browse/HBASE-26466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HBASE-26466:
---------------------------------
    Description: 
For insertion of immutable data usecase (specifically time-series data), region 
split mechanism doesn't seem to provide better availability when ingestion rate 
is very high. When we ingest lot of data, the region split policy tries to 
split the given hot region based on the size (either size of all stores 
combined or size of any single store exceeding max file size configured) if we 
consider default {_}SteppingSplitPolicy{_}. The latest hot regions tend to 
receive all latest inserts. When the region is split, the first half of the 
region (say daughterA) stays on the same server whereas the second half 
(daughterB) region – likely to become another hot region because all new latest 
updates come to second half region in the sequential write fashion – is moved 
out to other servers in the cluster. Hence, once new daughter region is 
created, client traffic will be redirected to another server. Client requests 
will be piled up when region split is triggered till new daughters come alive 
and once done, client will have to request meta for updated daughter region and 
redirect traffic to new server.

If we could have configurable region creation strategy that 1) keeps the split 
disabled for the given table, and 2) create new region dynamically with 
lexicographically higher start key on the same server and update it's own 
region boundary, the client will have to look up meta once and continue 
ingestion without any degraded SLA caused by region split transitions.

Note: region split might also encounter some complications, requiring the 
procedure to be rolled back from some step, or continue with internal retries, 
eventually further delaying the ingestion from clients.

 

There are some complications around updating live region's start and end keys 
as this key range is immutable. We could brainstorm ideas around making them 
optionally mutable and any issues around them. For instance, client might 
continue writing data to the region with updated end key but writes will fail 
for out of range keys and hence, they will lookup in meta for updated key-space 
range (new region created with end key: EMPTY_END_ROW) of the table.

  was:
For insertion of immutable data usecase (specifically time-series data), region 
split mechanism doesn't seem to provide better availability when ingestion rate 
is very high. When we ingest lot of data, the region split policy tries to 
split the given hot region based on the size (either size of all stores 
combined or size of any single store exceeding max file size configured) if we 
consider default {_}SteppingSplitPolicy{_}. The latest hot regions tend to 
receive all latest inserts. When the region is split, the first half of the 
region (say daughterA) stays on the same server whereas the second half 
(daughterB) region – likely to become another hot region because all new latest 
updates come to second half region in the sequential write fashion – is moved 
out to other servers in the cluster. Hence, once new daughter region is 
created, client traffic will be redirected to another server. Client requests 
will be piled up when region split is triggered till new daughters come alive 
and once done, client will have to request meta for updated daughter region and 
redirect traffic to new server.

If we could have configurable region creation strategy that 1) keeps the split 
disabled for the given table, and 2) create new region dynamically with 
lexicographically higher start key on the same server and update it's own 
region boundary, the client will have to look up meta once and continue 
ingestion without any degraded SLA caused by region split transitions.

Note: region split might also encounter some complications, requiring the 
procedure to be rolled back from some step, or continue with internal retries, 
eventually further delaying the ingestion from clients.

 

There are some complications around updating live region's start and end keys 
as this key range is immutable. We could brainstorm ideas around making them 
optionally mutable and any issues around them. For instance, client might 
continue writing data to the region with updated end key but writes will fail 
and hence, they will lookup in meta for updated key-space range of the table.


> Immutable timeseries usecase - Create new region rather than split existing 
> one
> -------------------------------------------------------------------------------
>
>                 Key: HBASE-26466
>                 URL: https://issues.apache.org/jira/browse/HBASE-26466
>             Project: HBase
>          Issue Type: Brainstorming
>            Reporter: Viraj Jasani
>            Priority: Major
>
> For insertion of immutable data usecase (specifically time-series data), 
> region split mechanism doesn't seem to provide better availability when 
> ingestion rate is very high. When we ingest lot of data, the region split 
> policy tries to split the given hot region based on the size (either size of 
> all stores combined or size of any single store exceeding max file size 
> configured) if we consider default {_}SteppingSplitPolicy{_}. The latest hot 
> regions tend to receive all latest inserts. When the region is split, the 
> first half of the region (say daughterA) stays on the same server whereas the 
> second half (daughterB) region – likely to become another hot region because 
> all new latest updates come to second half region in the sequential write 
> fashion – is moved out to other servers in the cluster. Hence, once new 
> daughter region is created, client traffic will be redirected to another 
> server. Client requests will be piled up when region split is triggered till 
> new daughters come alive and once done, client will have to request meta for 
> updated daughter region and redirect traffic to new server.
> If we could have configurable region creation strategy that 1) keeps the 
> split disabled for the given table, and 2) create new region dynamically with 
> lexicographically higher start key on the same server and update it's own 
> region boundary, the client will have to look up meta once and continue 
> ingestion without any degraded SLA caused by region split transitions.
> Note: region split might also encounter some complications, requiring the 
> procedure to be rolled back from some step, or continue with internal 
> retries, eventually further delaying the ingestion from clients.
>  
> There are some complications around updating live region's start and end keys 
> as this key range is immutable. We could brainstorm ideas around making them 
> optionally mutable and any issues around them. For instance, client might 
> continue writing data to the region with updated end key but writes will fail 
> for out of range keys and hence, they will lookup in meta for updated 
> key-space range (new region created with end key: EMPTY_END_ROW) of the table.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to