Re: [DISCUSS] Pre splitting an existing table to avoid hotspotting issues.

Andrew Purtell Thu, 08 Feb 2024 19:09:40 -0800

Agreed, the key is going to be changing the design of references to support 
this new.


Splits must continue to minimize offline time. Increasing that time at all 
would not be acceptable. 

> On Feb 8, 2024, at 5:46 PM, 张铎 <[email protected]> wrote:
> 
> As I said above, split is not like merge, just simply changing the
> Admin API to take a byte[][] does not actually help here.
> 
> For online splitting a region, our algorithm only supports splitting
> to two sub regions, and then we do compaction to clean up all the
> reference files and prepare for the next split.
> 
> I'm not saying this is impossible, but I think the most difficult
> challenge here is how to design a new reference file algorithm to
> support referencing a 'range' of a HFile, not only top half or bottom
> half.
> In this way we can support splitting a region directly to more than 2
> sub regions.
> 
> Or maybe we could change the way on how we split regions, instead of
> creating reference files, we directly read a HFile and output multiple
> HFiles in different ranges, put them in different region directories,
> and also make the flush write to multiple places(the current region
> and all the sub regions), and once everything is fine, we offline the
> old region(also makes it do a final flush), and then online all the
> sub regions. In this way we have nuch lower overall write
> amplification, but the problem is it will take a very long time when
> splitting, and also the fail recovery will be more complicated.
> 
> Thanks.
> 
> Bryan Beaudreault <[email protected]> 于2024年2月9日周五 09:27写道：
>> 
>> Yep, I forgot about that nuance. I agree we can add a splitRegion overload
>> which takes a byte[][] for multiple split points.
>> 
>>> On Thu, Feb 8, 2024 at 8:23 PM Andrew Purtell <[email protected]> wrote:
>>> 
>>> Rushabh already covered this but splitting is not complete until the region
>>> can be split again. This is a very important nuance. The daughter regions
>>> are online very quickly, as designed, but then background housekeeping
>>> (compaction) must copy the data before the daughters become splittable.
>>> Depending on compaction pressure, compaction queue depth, and the settings
>>> of various tunables, waiting for some split daughters to become ready to
>>> split again can take many minutes to hours.
>>> 
>>> So let's say we have replicas of a table at two sites, site A and site B.
>>> The region boundaries of this table in A and B will be different. Now let's
>>> also say that table data is stored with a key prefix mapping to every
>>> unique tenant. When migrating a tenant, data copy will hotspot on the
>>> region(s) hosting keys with the tenant's prefix. This is fine if there are
>>> enough regions to absorb the load. We run into trouble when the region
>>> boundaries in the sub-keyspace of interest are quite different in B versus
>>> A. We get hotspotting and impact to operations until organic splitting
>>> eventually mitigates the hotspotting, but this might also require many
>>> minutes to hours, with noticeable performance degradation in the meantime.
>>> To avoid that degradation we pace the sender but then the copy may take so
>>> long as to miss SLA for the migration. To make the data movement performant
>>> and stay within SLA we want to apply one or more splits or merges so the
>>> region boundaries B roughly align to A, avoiding hotspotting. This will
>>> also make shipping this data by bulk load instead efficient too by
>>> minimizing the amount of HFile splitting necessary to load them at the
>>> receiver.
>>> 
>>> So let's say we have some regions that need to be split N ways, where N is
>>> order of ~10, by that I mean more than 1 and less than 100, in order to
>>> (roughly) align region boundaries. We think this calls for an enhancement
>>> to the split request API where the split should produce a requested number
>>> of daughter-pairs. Today that is always 1 pair. Instead we might want 2, 5,
>>> 10, conceivably more. And it would be nice if guideposts for multi-way
>>> splitting can be sent over in byte[][].
>>> 
>>> On Wed, Feb 7, 2024 at 10:03 AM Bryan Beaudreault <[email protected]
>>>> 
>>> wrote:
>>> 
>>>> This is the first time I've heard of a region split taking 4 minutes. For
>>>> us, it's always on the order of seconds. That's true even for a large
>>> 50+gb
>>>> region. It might be worth looking into why that's so slow for you.
>>>> 
>>>> On Wed, Feb 7, 2024 at 12:50 PM Rushabh Shah
>>>> <[email protected]> wrote:
>>>> 
>>>>> Thank you Andrew, Bryan and Duo for your responses.
>>>>> 
>>>>>> My main thought is that a migration like this should use bulk
>>> loading,
>>>>>> But also, I think, that data transfer should be in bulk
>>>>> 
>>>>> We are working on moving to bulk loading.
>>>>> 
>>>>>> With Admin.splitRegion, you can specify a split point. You can use
>>> that
>>>>> to
>>>>> iteratively add a bunch of regions wherever you need them in the
>>>> keyspace.
>>>>> Yes, it's 2 at a time, but it should still be quick enough in the grand
>>>>> scheme of a large migration.
>>>>> 
>>>>> 
>>>>> Trying to do some back of the envelope calculations.
>>>>> In a production environment, it took around 4 minutes to split a
>>> recently
>>>>> split region which had 4 store files with a total of 5 GB of data.
>>>>> Assuming we are migrating 5000 tenants at a time and normally we have
>>>>> around 10% of the tenants (500 tenants) which have data
>>>>> spread across more than 1000 regions. We have around 10 huge tables
>>>> where
>>>>> we store the tenant's data for different use cases.
>>>>> All the above numbers are on the *conservative* side.
>>>>> 
>>>>> To create a split structure for 1000 regions, we need 10 iterations of
>>>> the
>>>>> splits (2^10 = 1024). This assumes we are parallely splitting the
>>>> regions.
>>>>> Each split takes around 4 minutes. So to create 1000 regions just for 1
>>>>> tenant and for 1 table, it takes around 40 minutes.
>>>>> For 10 tables for 1 tenant, it takes around 400 minutes.
>>>>> 
>>>>> For 500 tenants, this will take around *140 days*. To reduce this time
>>>>> further, we can also create a split structure for each tenant and each
>>>>> table in parallel.
>>>>> But this would put a lot of pressure on the cluster and also it will
>>>>> require a lot of operational overhead and still we will end up with
>>>>> the whole process taking days, if not months.
>>>>> 
>>>>> Since we are moving our infrastructure to Public Cloud, we anticipate
>>>> this
>>>>> huge migration happening once every month.
>>>>> 
>>>>> 
>>>>>> Adding a splitRegion method that takes byte[][] for multiple split
>>>> points
>>>>> would be a nice UX improvement, but not
>>>>> strictly necessary.
>>>>> 
>>>>> IMHO for all the reasons stated above, I believe this is necessary.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Mon, Jan 29, 2024 at 6:25 AM 张铎(Duo Zhang) <[email protected]>
>>>>> wrote:
>>>>> 
>>>>>> As it is called 'pre' split, it means that it can only happen when
>>>>>> there is no data in table.
>>>>>> 
>>>>>> If there are already data in the table, you can not always create
>>>>>> 'empty' regions, as you do not know whether there are already data in
>>>>>> the given range...
>>>>>> 
>>>>>> And technically, if you want to split a HFile into more than 2 parts,
>>>>>> you need to design new algorithm as now in HBase we only support top
>>>>>> reference and bottom reference...
>>>>>> 
>>>>>> Thanks.
>>>>>> 
>>>>>> Bryan Beaudreault <[email protected]> 于2024年1月27日周六 02:16写道：
>>>>>>> 
>>>>>>> My main thought is that a migration like this should use bulk
>>>> loading,
>>>>>>> which should be relatively easy given you already use MR
>>>>>>> (HFileOutputFormat2). It doesn't solve the region-splitting
>>> problem.
>>>>> With
>>>>>>> Admin.splitRegion, you can specify a split point. You can use that
>>> to
>>>>>>> iteratively add a bunch of regions wherever you need them in the
>>>>>> keyspace.
>>>>>>> Yes, it's 2 at a time, but it should still be quick enough in the
>>>> grand
>>>>>>> scheme of a large migration. Adding a splitRegion method that takes
>>>>>>> byte[][] for multiple split points would be a nice UX improvement,
>>>> but
>>>>>> not
>>>>>>> strictly necessary.
>>>>>>> 
>>>>>>> On Fri, Jan 26, 2024 at 12:10 PM Rushabh Shah
>>>>>>> <[email protected]> wrote:
>>>>>>> 
>>>>>>>> Hi Everyone,
>>>>>>>> At my workplace, we use HBase + Phoenix to run our customer
>>>>> workloads.
>>>>>> Most
>>>>>>>> of our phoenix tables are multi-tenant and we store the tenantID
>>> as
>>>>> the
>>>>>>>> leading part of the rowkey. Each tenant belongs to only 1 hbase
>>>>>> cluster.
>>>>>>>> Due to capacity planning, hardware refresh cycles and most
>>> recently
>>>>>> move to
>>>>>>>> public cloud initiatives, we have to migrate a tenant from one
>>>> hbase
>>>>>>>> cluster (source cluster) to another hbase cluster (target
>>> cluster).
>>>>>>>> Normally we migrate a lot of tenants (in 10s of thousands) at a
>>>> time
>>>>>> and
>>>>>>>> hence we have to copy a huge amount of data (in TBs) from
>>> multiple
>>>>>> source
>>>>>>>> clusters to a single target cluster. We have our internal tool
>>>> which
>>>>>> uses
>>>>>>>> MapReduce framework to copy the data. Since all of these tenants
>>>>> don’t
>>>>>> have
>>>>>>>> any presence on the target cluster (Note that the table is NOT
>>>> empty
>>>>>> since
>>>>>>>> we have data for other tenants in the target cluster), they start
>>>>> with
>>>>>> one
>>>>>>>> region and due to an organic split process, the data gets
>>>> distributed
>>>>>> among
>>>>>>>> different regions and different regionservers. But the organic
>>>>>> splitting
>>>>>>>> process takes a lot of time and due to the distributed nature of
>>>> the
>>>>> MR
>>>>>>>> framework, it causes hotspotting issues on the target cluster
>>> which
>>>>>> often
>>>>>>>> lasts for days. This causes availability issues where the CPU is
>>>>>> saturated
>>>>>>>> and/or disk saturation on the regionservers ingesting the data.
>>>> Also
>>>>>> this
>>>>>>>> causes a lot of replication related alerts (Age of last ship,
>>>>> LogQueue
>>>>>>>> size) which goes on for days.
>>>>>>>> 
>>>>>>>> In order to handle the huge influx of data, we should ideally
>>>>>> pre-split the
>>>>>>>> table on the target based on the split structure present on the
>>>>> source
>>>>>>>> cluster. If we pre-split and create empty regions with right
>>> region
>>>>>>>> boundaries it will help to distribute the load to different
>>> regions
>>>>> and
>>>>>>>> region servers and will prevent hotspotting.
>>>>>>>> 
>>>>>>>> Problems with the above approach:
>>>>>>>> 1. Currently we allow pre splitting only while creating a new
>>>> table.
>>>>>> But in
>>>>>>>> our production env, we already have the table created for other
>>>>>> tenants. So
>>>>>>>> we would like to pre-split an existing table for new tenants.
>>>>>>>> 2. Currently we split a given region into just 2 daughter
>>> regions.
>>>>> But
>>>>>> if
>>>>>>>> we have the split points information from the source cluster and
>>> if
>>>>> the
>>>>>>>> data for the to-be-migrated tenant is split across 100 regions on
>>>> the
>>>>>>>> source side, we would ideally like to create 100 empty regions on
>>>> the
>>>>>>>> target cluster.
>>>>>>>> 
>>>>>>>> Trying to get early feedback from the community. Do you all think
>>>>> this
>>>>>> is a
>>>>>>>> good idea? Open to other suggestions also.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Thank you,
>>>>>>>> Rushabh.
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> Best regards,
>>> Andrew
>>> 
>>> Unrest, ignorance distilled, nihilistic imbeciles -
>>>    It's what we’ve earned
>>> Welcome, apocalypse, what’s taken you so long?
>>> Bring us the fitting end that we’ve been counting on
>>>   - A23, Welcome, Apocalypse
>>>

Re: [DISCUSS] Pre splitting an existing table to avoid hotspotting issues.

Reply via email to