> To add to that, the split will be done on the master,
It's done locally, not master.  say the LoadIncrementHFile tool will split
the hfile
locally if found anyone is cross two or more regions.

On Fri, Jul 19, 2019 at 1:27 AM Jean-Marc Spaggiari <[email protected]>
wrote:

> +1 to that last statement. (I think the split is done locally where you run
> the command, not sure if it's in the master, but I can be wrong). Means if
> you have a single big giant file and 200 regions, it will require a lot a
> non distributed work...
>
> Le jeu. 18 juil. 2019 à 13:03, Austin Heyne <[email protected]> a écrit :
>
> > To add to that, the split will be done on the master, so if you
> > anticipate a lot of splits it can be an issue.
> >
> > -Austin
> >
> > On 7/18/19 12:32 PM, Jean-Marc Spaggiari wrote:
> > > One think to add, when you will bulkload your files, if needed, they
> will
> > > be split according to the regions boundaries.
> > >
> > > Because between when you start your job and when you push your files,
> > there
> > > might have been some "natural" splits on the table side, the bulkloader
> > has
> > > to be able to re-split your generated data.
> > >
> > > JMS
> > >
> > > Le jeu. 18 juil. 2019 à 09:55, OpenInx <[email protected]> a écrit :
> > >
> > >> Austin is right. The pre-splitting is mainly used for generate&load
> > HFiles,
> > >> say
> > >> when do bulkload, it will load each generated hfile to the
> corresponding
> > >> region
> > >> who include the rowkey interval of the hfile. If no pre-splitting,
> then
> > all
> > >> HFiles
> > >> will be in one region, bulkload will be time-consuming and it's easy
> to
> > be
> > >> hotspot
> > >> when query coming in.
> > >>
> > >> About the demo, you can see here:
> > >> [1]. https://hbase.apache.org/book.html#arch.bulk.load
> > >> [2].
> > >>
> > >>
> >
> http://blog.cloudera.com/blog/2013/09/how-to-use-hbase-bulk-loading-and-why/
> > >>
> > >> Thanks.
> > >>
> > >> On Thu, Jul 18, 2019 at 9:21 PM Austin Heyne <[email protected]> wrote:
> > >>
> > >>> Bulk importing requires the table the data is being bulk imported
> into
> > >>> to already exists. This is because the mapreduce job needs to extract
> > >>> the region start/end keys in order to drive the reducers. This means
> > >>> that you need to create your table before hand, providing the
> > >>> appropriate pre-splitting and then run your bulk ingest and bulk load
> > to
> > >>> get the data into the table. If you were to not pre-split your table
> > >>> then you would end up with one reducer in your bulk ingest job. This
> > >>> also means that your bulk ingest cluster will need to be able to
> > >>> communicate with your HBase instance.
> > >>>
> > >>> -Austin
> > >>>
> > >>> On 7/18/19 4:39 AM, Michael wrote:
> > >>>> Hi,
> > >>>>
> > >>>> I looked at the possibility of bulk importing into hbase, but
> somehow
> > I
> > >>>> don't get it. I am not able to perform a presplitting of the data,
> so
> > >>>> does bulk importing work without presplitting?
> > >>>> As I understand it, instead of putting the data, I create the hbase
> > >>>> region files, but all tutorials I read mentioned presplitting...
> > >>>>
> > >>>> So, is presplitting essential for bulk importing?
> > >>>>
> > >>>> It would be really helpful, if someone could point me to demo
> > >>>> implementation of a bulk import.
> > >>>>
> > >>>> Thanks for helping
> > >>>>    Michael
> > >>>>
> > >>>>
> >
>

Reply via email to