Hi Zack,
A good place to start is our web site: http://phoenix.apache.org. Take
a look at the Feature menu and you'll find Bulk Loading:
http://phoenix.apache.org/bulk_dataload.html
Thanks,
James

On Mon, Jun 15, 2015 at 4:20 AM, Riesland, Zack
<[email protected]> wrote:
> MR Bulkload tool sounds promising.
>
>
>
> Is there a link that provides some instructions?
>
>
>
> Does it take a HDFS folder as input? Or a Hive table?
>
>
>
> Thanks!
>
>
>
> From: Puneet Kumar Ojha [mailto:[email protected]]
> Sent: Monday, June 15, 2015 7:10 AM
> To: [email protected]
> Subject: RE: Guidance on table splitting
>
>
>
> Can you provide the Queries which you would be running on your table?
>
>
>
> Also use the MR Bulkload instead of using the CSV load tool.
>
>
>
>
>
>
>
> From: Riesland, Zack [mailto:[email protected]]
> Sent: Monday, June 15, 2015 4:03 PM
> To: [email protected]
> Subject: Guidance on table splitting
>
>
>
> I’m new to Hbase and to Phoenix.
>
>
>
> I needed to build a GUI off of a huge data set from HDFS, so I decided to
> create a couple of Phoenix tables, dump the data using the CSV bulk load
> tool, and serve the GUI from there.
>
>
>
> This all ‘works’, but as the data set grows, I would like to improve my
> table design.
>
>
>
> Currently, I have a table (6 region servers) with about 8 billion rows and
> about a dozen columns.
>
>
>
> The primary key is a combination of customer code (4 letters) and serial
> number (8-12 digits).
>
>
>
> So I split the table with the idea of creating 2-3 regions per starting
> letter:
>
>
>
> SPLIT ON ('AM', 'AZ', 'BK', 'BZ', 'CE', 'CM', 'CZ', 'DK', 'DZ', 'EK', 'EZ',
> 'FK', 'FZ', 'GK', 'GZ', 'HK', 'HZ', 'IK', 'IZ', 'JK', 'JZ', 'KK', 'KZ',
> 'LF', 'LZ', 'MK', 'MZ', 'NK', 'NZ', 'OK', 'OZ', 'PK', 'PZ', 'RK', 'RZ',
> 'SK', 'SZ', 'TK', 'TZ', 'UK', 'UZ', 'VK', 'VZ', 'WK', 'WW', 'WZ');
>
>
>
> This performs  somewhat better than if I just create the table and give no
> guidance to Phoenix.
>
>
>
> But I’m wondering if I could do better. Key-based queries are very fast, but
> data ingest is surprisingly slow. Ingesting 1 billion rows takes on the
> order of hours.
>
>
>
> When I look at the stats, this table has a fairly skewed distribution of
> data across 6 regions servers. Something like 15 regions, 13, 13, 3, 3, and
> 2.
>
>
>
> Can anyone give me some guidance on how to improve this design?
>
>
>
> Really, any suggestions at this point would be much appreciated, as I’m just
> getting started.
>
>
>
>

Reply via email to