Guidance on table splitting

Riesland, Zack Mon, 15 Jun 2015 03:33:55 -0700

I'm new to Hbase and to Phoenix.

I needed to build a GUI off of a huge data set from HDFS, so I decided to 
create a couple of Phoenix tables, dump the data using the CSV bulk load tool, 
and serve the GUI from there.


This all 'works', but as the data set grows, I would like to improve my table 
design.

Currently, I have a table (6 region servers) with about 8 billion rows and 
about a dozen columns.

The primary key is a combination of customer code (4 letters) and serial number 
(8-12 digits).

So I split the table with the idea of creating 2-3 regions per starting letter:

SPLIT ON ('AM', 'AZ', 'BK', 'BZ', 'CE', 'CM', 'CZ', 'DK', 'DZ', 'EK', 'EZ', 
'FK', 'FZ', 'GK', 'GZ', 'HK', 'HZ', 'IK', 'IZ', 'JK', 'JZ', 'KK', 'KZ', 'LF', 
'LZ', 'MK', 'MZ', 'NK', 'NZ', 'OK', 'OZ', 'PK', 'PZ', 'RK', 'RZ', 'SK', 'SZ', 
'TK', 'TZ', 'UK', 'UZ', 'VK', 'VZ', 'WK', 'WW', 'WZ');

This performs  somewhat better than if I just create the table and give no 
guidance to Phoenix.

But I'm wondering if I could do better. Key-based queries are very fast, but 
data ingest is surprisingly slow. Ingesting 1 billion rows takes on the order 
of hours.

When I look at the stats, this table has a fairly skewed distribution of data 
across 6 regions servers. Something like 15 regions, 13, 13, 3, 3, and 2.

Can anyone give me some guidance on how to improve this design?

Really, any suggestions at this point would be much appreciated, as I'm just 
getting started.

Guidance on table splitting

Reply via email to