Even if you an 100 files. HBase will still need to read them to split then. Each file might contains keys for the 2 regions, so HBase will read 200GB, and write 100GB each side.
Last, I don't think the max file size will have any impact on the BulkLoad side. It's the way you generate your file which matters. Can you have a look at your output folder? JM 2014-10-03 1:56 GMT-04:00 Serega Sheypak <[email protected]>: > There are several files generated. I suppose there 20 files because its a > setting for hbase to have 10gb files. > 03.10.2014 1:01 пользователь "Jean-Marc Spaggiari" < > [email protected]> > написал: > > > If it's a sigle 200GB file, when HBase will spit this region,this file > will > > have to be splitted and re-written into 2 x 100GB files. > > > > How is the file generated? You should really think about splitting it > > first... > > > > 2014-10-02 15:49 GMT-04:00 Jerry He <[email protected]>: > > > > > The reference files will be rewritten during compaction, which normally > > > happens right after splits. > > > > > > You did not mention if your 200gb data is one file,or many hfiles. > > > > > > Jerry > > > On Oct 2, 2014 12:26 PM, "Serega Sheypak" <[email protected]> > > > wrote: > > > > > > > Sorry, massive IO. > > > > This table is read-only. So hbase should just place reference files, > > why > > > > Hbase would rewrite the files? > > > > > > > > 2014-10-02 23:24 GMT+04:00 Serega Sheypak <[email protected] > >: > > > > > > > > > Hi! > > > > > > > http://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/ > > > > > Says that splitting is just a placing 'reference' file. > > > > > Why there sould be massive splitting? > > > > > > > > > > 2014-10-02 23:08 GMT+04:00 Jean-Marc Spaggiari < > > > [email protected] > > > > >: > > > > > > > > > >> Hi Serega, > > > > >> > > > > >> Bulk load just "push" the file into an HBase region, so there > should > > > not > > > > >> be > > > > >> any issue. Split however might take some time because HBase will > > have > > > to > > > > >> split it again and again util it become small enough. So if you > max > > > file > > > > >> size is 10GB, it will split it to 100GB then 50GB then 25GB then > > 12GB > > > > then > > > > >> 6GB... Each time, everything will be re-written. a LOT of wasted > > IOs. > > > > >> > > > > >> So response is: Yes, HBase can handle BUT it's not a good > practice. > > > > Better > > > > >> to split the table before and generate the bulk based on the > splited > > > > >> regions. Also, it might affect the others tables and the > > performances > > > > >> because HBase will have to do massive IOs, which at the end might > > > impact > > > > >> the performances. > > > > >> > > > > >> JM > > > > >> > > > > >> 2014-10-02 15:03 GMT-04:00 Serega Sheypak < > [email protected] > > >: > > > > >> > > > > >> > Hi, I'm doing HBase bulk load to an empty table. > > > > >> > Input data size is 200GB > > > > >> > Is it OK to load data into one default region and then wait > while > > > > HBase > > > > >> > splits 200GB region? > > > > >> > > > > > >> > I don't have any SLA for initial load. I can wait unitl HBase > > splits > > > > >> > initial load files. > > > > >> > This table is READ only. > > > > >> > > > > > >> > The only conideration is not affect others tables and do not > cause > > > > HBase > > > > >> > cluster degradation. > > > > >> > > > > > >> > > > > > > > > > > > > > > > > > > > >
