Austin is right. The pre-splitting is mainly used for generate&load HFiles, say when do bulkload, it will load each generated hfile to the corresponding region who include the rowkey interval of the hfile. If no pre-splitting, then all HFiles will be in one region, bulkload will be time-consuming and it's easy to be hotspot when query coming in.
About the demo, you can see here: [1]. https://hbase.apache.org/book.html#arch.bulk.load [2]. http://blog.cloudera.com/blog/2013/09/how-to-use-hbase-bulk-loading-and-why/ Thanks. On Thu, Jul 18, 2019 at 9:21 PM Austin Heyne <[email protected]> wrote: > Bulk importing requires the table the data is being bulk imported into > to already exists. This is because the mapreduce job needs to extract > the region start/end keys in order to drive the reducers. This means > that you need to create your table before hand, providing the > appropriate pre-splitting and then run your bulk ingest and bulk load to > get the data into the table. If you were to not pre-split your table > then you would end up with one reducer in your bulk ingest job. This > also means that your bulk ingest cluster will need to be able to > communicate with your HBase instance. > > -Austin > > On 7/18/19 4:39 AM, Michael wrote: > > Hi, > > > > I looked at the possibility of bulk importing into hbase, but somehow I > > don't get it. I am not able to perform a presplitting of the data, so > > does bulk importing work without presplitting? > > As I understand it, instead of putting the data, I create the hbase > > region files, but all tutorials I read mentioned presplitting... > > > > So, is presplitting essential for bulk importing? > > > > It would be really helpful, if someone could point me to demo > > implementation of a bulk import. > > > > Thanks for helping > > Michael > > > > >
