Hi Zack, A good place to start is our web site: http://phoenix.apache.org. Take a look at the Feature menu and you'll find Bulk Loading: http://phoenix.apache.org/bulk_dataload.html Thanks, James
On Mon, Jun 15, 2015 at 4:20 AM, Riesland, Zack <[email protected]> wrote: > MR Bulkload tool sounds promising. > > > > Is there a link that provides some instructions? > > > > Does it take a HDFS folder as input? Or a Hive table? > > > > Thanks! > > > > From: Puneet Kumar Ojha [mailto:[email protected]] > Sent: Monday, June 15, 2015 7:10 AM > To: [email protected] > Subject: RE: Guidance on table splitting > > > > Can you provide the Queries which you would be running on your table? > > > > Also use the MR Bulkload instead of using the CSV load tool. > > > > > > > > From: Riesland, Zack [mailto:[email protected]] > Sent: Monday, June 15, 2015 4:03 PM > To: [email protected] > Subject: Guidance on table splitting > > > > I’m new to Hbase and to Phoenix. > > > > I needed to build a GUI off of a huge data set from HDFS, so I decided to > create a couple of Phoenix tables, dump the data using the CSV bulk load > tool, and serve the GUI from there. > > > > This all ‘works’, but as the data set grows, I would like to improve my > table design. > > > > Currently, I have a table (6 region servers) with about 8 billion rows and > about a dozen columns. > > > > The primary key is a combination of customer code (4 letters) and serial > number (8-12 digits). > > > > So I split the table with the idea of creating 2-3 regions per starting > letter: > > > > SPLIT ON ('AM', 'AZ', 'BK', 'BZ', 'CE', 'CM', 'CZ', 'DK', 'DZ', 'EK', 'EZ', > 'FK', 'FZ', 'GK', 'GZ', 'HK', 'HZ', 'IK', 'IZ', 'JK', 'JZ', 'KK', 'KZ', > 'LF', 'LZ', 'MK', 'MZ', 'NK', 'NZ', 'OK', 'OZ', 'PK', 'PZ', 'RK', 'RZ', > 'SK', 'SZ', 'TK', 'TZ', 'UK', 'UZ', 'VK', 'VZ', 'WK', 'WW', 'WZ'); > > > > This performs somewhat better than if I just create the table and give no > guidance to Phoenix. > > > > But I’m wondering if I could do better. Key-based queries are very fast, but > data ingest is surprisingly slow. Ingesting 1 billion rows takes on the > order of hours. > > > > When I look at the stats, this table has a fairly skewed distribution of > data across 6 regions servers. Something like 15 regions, 13, 13, 3, 3, and > 2. > > > > Can anyone give me some guidance on how to improve this design? > > > > Really, any suggestions at this point would be much appreciated, as I’m just > getting started. > > > >
