I'm new to Hbase and to Phoenix.
I needed to build a GUI off of a huge data set from HDFS, so I decided to
create a couple of Phoenix tables, dump the data using the CSV bulk load tool,
and serve the GUI from there.
This all 'works', but as the data set grows, I would like to improve my table
design.
Currently, I have a table (6 region servers) with about 8 billion rows and
about a dozen columns.
The primary key is a combination of customer code (4 letters) and serial number
(8-12 digits).
So I split the table with the idea of creating 2-3 regions per starting letter:
SPLIT ON ('AM', 'AZ', 'BK', 'BZ', 'CE', 'CM', 'CZ', 'DK', 'DZ', 'EK', 'EZ',
'FK', 'FZ', 'GK', 'GZ', 'HK', 'HZ', 'IK', 'IZ', 'JK', 'JZ', 'KK', 'KZ', 'LF',
'LZ', 'MK', 'MZ', 'NK', 'NZ', 'OK', 'OZ', 'PK', 'PZ', 'RK', 'RZ', 'SK', 'SZ',
'TK', 'TZ', 'UK', 'UZ', 'VK', 'VZ', 'WK', 'WW', 'WZ');
This performs somewhat better than if I just create the table and give no
guidance to Phoenix.
But I'm wondering if I could do better. Key-based queries are very fast, but
data ingest is surprisingly slow. Ingesting 1 billion rows takes on the order
of hours.
When I look at the stats, this table has a fairly skewed distribution of data
across 6 regions servers. Something like 15 regions, 13, 13, 3, 3, and 2.
Can anyone give me some guidance on how to improve this design?
Really, any suggestions at this point would be much appreciated, as I'm just
getting started.