Re: BulkLoad 200GB table with one region. Is it OK?

Jean-Marc Spaggiari Fri, 03 Oct 2014 05:34:50 -0700

Even if you an 100 files. HBase will still need to read them to split then.
Each file might contains keys for the 2 regions, so HBase will read 200GB,
and write 100GB each side.


Last, I don't think the max file size will have any impact on the BulkLoad
side. It's the way you generate your file which matters. Can you have a
look at your output folder?

JM

2014-10-03 1:56 GMT-04:00 Serega Sheypak <[email protected]>:

> There are several files generated. I suppose there 20 files because its a
> setting for hbase to have 10gb files.
> 03.10.2014 1:01 пользователь "Jean-Marc Spaggiari" <
> [email protected]>
> написал:
>
> > If it's a sigle 200GB file, when HBase will spit this region,this file
> will
> > have to be splitted and re-written into 2 x 100GB files.
> >
> > How is the file generated? You should really think about splitting it
> > first...
> >
> > 2014-10-02 15:49 GMT-04:00 Jerry He <[email protected]>:
> >
> > > The reference files will be rewritten during compaction, which normally
> > > happens right after splits.
> > >
> > > You did not mention if your 200gb data is one file，or many hfiles.
> > >
> > > Jerry
> > > On Oct 2, 2014 12:26 PM, "Serega Sheypak" <[email protected]>
> > > wrote:
> > >
> > > > Sorry, massive IO.
> > > > This table is read-only. So hbase should just place reference files,
> > why
> > > > Hbase would rewrite the files?
> > > >
> > > > 2014-10-02 23:24 GMT+04:00 Serega Sheypak <[email protected]
> >:
> > > >
> > > > > Hi!
> > > > >
> > http://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/
> > > > > Says that splitting is just a placing 'reference' file.
> > > > > Why there sould be massive splitting?
> > > > >
> > > > > 2014-10-02 23:08 GMT+04:00 Jean-Marc Spaggiari <
> > > [email protected]
> > > > >:
> > > > >
> > > > >> Hi Serega,
> > > > >>
> > > > >> Bulk load just "push" the file into an HBase region, so there
> should
> > > not
> > > > >> be
> > > > >> any issue. Split however might take some time because HBase will
> > have
> > > to
> > > > >> split it again and again util it become small enough. So if you
> max
> > > file
> > > > >> size is 10GB, it will split it to 100GB then 50GB then 25GB then
> > 12GB
> > > > then
> > > > >> 6GB... Each time, everything will be re-written. a LOT of wasted
> > IOs.
> > > > >>
> > > > >> So response is: Yes, HBase can handle BUT it's not a good
> practice.
> > > > Better
> > > > >> to split the table before and generate the bulk based on the
> splited
> > > > >> regions. Also, it might affect the others tables and the
> > performances
> > > > >> because HBase will have to do massive IOs, which at the end might
> > > impact
> > > > >> the performances.
> > > > >>
> > > > >> JM
> > > > >>
> > > > >> 2014-10-02 15:03 GMT-04:00 Serega Sheypak <
> [email protected]
> > >:
> > > > >>
> > > > >> > Hi, I'm doing HBase bulk load to an empty table.
> > > > >> > Input data size is 200GB
> > > > >> > Is it OK to load data into one default region and then wait
> while
> > > > HBase
> > > > >> > splits 200GB region?
> > > > >> >
> > > > >> > I don't have any SLA for initial load. I can wait unitl HBase
> > splits
> > > > >> > initial load files.
> > > > >> > This table is READ only.
> > > > >> >
> > > > >> > The only conideration is not affect others tables and do not
> cause
> > > > HBase
> > > > >> > cluster degradation.
> > > > >> >
> > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: BulkLoad 200GB table with one region. Is it OK?

Reply via email to