+1 to that last statement. (I think the split is done locally where you run the command, not sure if it's in the master, but I can be wrong). Means if you have a single big giant file and 200 regions, it will require a lot a non distributed work...
Le jeu. 18 juil. 2019 à 13:03, Austin Heyne <[email protected]> a écrit : > To add to that, the split will be done on the master, so if you > anticipate a lot of splits it can be an issue. > > -Austin > > On 7/18/19 12:32 PM, Jean-Marc Spaggiari wrote: > > One think to add, when you will bulkload your files, if needed, they will > > be split according to the regions boundaries. > > > > Because between when you start your job and when you push your files, > there > > might have been some "natural" splits on the table side, the bulkloader > has > > to be able to re-split your generated data. > > > > JMS > > > > Le jeu. 18 juil. 2019 à 09:55, OpenInx <[email protected]> a écrit : > > > >> Austin is right. The pre-splitting is mainly used for generate&load > HFiles, > >> say > >> when do bulkload, it will load each generated hfile to the corresponding > >> region > >> who include the rowkey interval of the hfile. If no pre-splitting, then > all > >> HFiles > >> will be in one region, bulkload will be time-consuming and it's easy to > be > >> hotspot > >> when query coming in. > >> > >> About the demo, you can see here: > >> [1]. https://hbase.apache.org/book.html#arch.bulk.load > >> [2]. > >> > >> > http://blog.cloudera.com/blog/2013/09/how-to-use-hbase-bulk-loading-and-why/ > >> > >> Thanks. > >> > >> On Thu, Jul 18, 2019 at 9:21 PM Austin Heyne <[email protected]> wrote: > >> > >>> Bulk importing requires the table the data is being bulk imported into > >>> to already exists. This is because the mapreduce job needs to extract > >>> the region start/end keys in order to drive the reducers. This means > >>> that you need to create your table before hand, providing the > >>> appropriate pre-splitting and then run your bulk ingest and bulk load > to > >>> get the data into the table. If you were to not pre-split your table > >>> then you would end up with one reducer in your bulk ingest job. This > >>> also means that your bulk ingest cluster will need to be able to > >>> communicate with your HBase instance. > >>> > >>> -Austin > >>> > >>> On 7/18/19 4:39 AM, Michael wrote: > >>>> Hi, > >>>> > >>>> I looked at the possibility of bulk importing into hbase, but somehow > I > >>>> don't get it. I am not able to perform a presplitting of the data, so > >>>> does bulk importing work without presplitting? > >>>> As I understand it, instead of putting the data, I create the hbase > >>>> region files, but all tutorials I read mentioned presplitting... > >>>> > >>>> So, is presplitting essential for bulk importing? > >>>> > >>>> It would be really helpful, if someone could point me to demo > >>>> implementation of a bulk import. > >>>> > >>>> Thanks for helping > >>>> Michael > >>>> > >>>> >
