+1 to that last statement. (I think the split is done locally where you run
the command, not sure if it's in the master, but I can be wrong). Means if
you have a single big giant file and 200 regions, it will require a lot a
non distributed work...

Le jeu. 18 juil. 2019 à 13:03, Austin Heyne <[email protected]> a écrit :

> To add to that, the split will be done on the master, so if you
> anticipate a lot of splits it can be an issue.
>
> -Austin
>
> On 7/18/19 12:32 PM, Jean-Marc Spaggiari wrote:
> > One think to add, when you will bulkload your files, if needed, they will
> > be split according to the regions boundaries.
> >
> > Because between when you start your job and when you push your files,
> there
> > might have been some "natural" splits on the table side, the bulkloader
> has
> > to be able to re-split your generated data.
> >
> > JMS
> >
> > Le jeu. 18 juil. 2019 à 09:55, OpenInx <[email protected]> a écrit :
> >
> >> Austin is right. The pre-splitting is mainly used for generate&load
> HFiles,
> >> say
> >> when do bulkload, it will load each generated hfile to the corresponding
> >> region
> >> who include the rowkey interval of the hfile. If no pre-splitting, then
> all
> >> HFiles
> >> will be in one region, bulkload will be time-consuming and it's easy to
> be
> >> hotspot
> >> when query coming in.
> >>
> >> About the demo, you can see here:
> >> [1]. https://hbase.apache.org/book.html#arch.bulk.load
> >> [2].
> >>
> >>
> http://blog.cloudera.com/blog/2013/09/how-to-use-hbase-bulk-loading-and-why/
> >>
> >> Thanks.
> >>
> >> On Thu, Jul 18, 2019 at 9:21 PM Austin Heyne <[email protected]> wrote:
> >>
> >>> Bulk importing requires the table the data is being bulk imported into
> >>> to already exists. This is because the mapreduce job needs to extract
> >>> the region start/end keys in order to drive the reducers. This means
> >>> that you need to create your table before hand, providing the
> >>> appropriate pre-splitting and then run your bulk ingest and bulk load
> to
> >>> get the data into the table. If you were to not pre-split your table
> >>> then you would end up with one reducer in your bulk ingest job. This
> >>> also means that your bulk ingest cluster will need to be able to
> >>> communicate with your HBase instance.
> >>>
> >>> -Austin
> >>>
> >>> On 7/18/19 4:39 AM, Michael wrote:
> >>>> Hi,
> >>>>
> >>>> I looked at the possibility of bulk importing into hbase, but somehow
> I
> >>>> don't get it. I am not able to perform a presplitting of the data, so
> >>>> does bulk importing work without presplitting?
> >>>> As I understand it, instead of putting the data, I create the hbase
> >>>> region files, but all tutorials I read mentioned presplitting...
> >>>>
> >>>> So, is presplitting essential for bulk importing?
> >>>>
> >>>> It would be really helpful, if someone could point me to demo
> >>>> implementation of a bulk import.
> >>>>
> >>>> Thanks for helping
> >>>>    Michael
> >>>>
> >>>>
>

Reply via email to