Hi, For my project, HBase would come to a halt after about 8 hours. I managed to reduce the load time to 10 minutes.
What gave me the best result was: splitting regions to best fit my data, compacting them manually when there was a change to the tables and using snappy for compression. I have data coming from 70 different sources. So, depending on source I added an offset of 10M to the primary key. The primary key data was 11 bits, so 10 millions was more than enough. E.g., 0 offset for the first source, 10M for the second, 20M for the third. Then I created 70 regions, 1 for each data source. Second task was when loading data or deleting old data, I run compaction manually as I mentioned. This way, I control when compaction is done. Hope, this helps. Behdad On Wed, Jul 8, 2015 at 12:30 PM, Dejan Menges <dejan.men...@gmail.com> wrote: > Hi Behdad, > > Thanks a lot, but this part I do already. My question was more what to use > to most intelligently (what exposed or not exposed metrics) figure out > where major compaction is needed the most. > > Currently, choosing the region which has biggest number of store files + > the biggest amount of store files is doing the job, but wasn't sure if > there's maybe something better so far to choose from. > > Cheers, > Dejan > > On Wed, Jul 8, 2015 at 7:19 PM Behdad Forghani <beh...@exapackets.com> > wrote: > > > To start major compaction for tablename from cli, you need to run: > > echo major_compact tablename | hbase shell > > > > I do this after bulk loading to the table. > > > > FYI, to avoid surprises, I also turn off load balancer and rebalance > > regions manually. > > > > The cli command to turn off balancer is: > > echo balance_switch false | hbase shell > > > > To rebalance regions after a bulk load or other changes, run: > > echo balance | hbase shell > > > > You can run these two command using ssh. I use Ansible to do these. > > Assuming you have defined hbase_master in your hosts file, you can run: > > ansible -i hosts hbase_master -a "echo major_compact tablename | hbase > > shell" > > > > Behdad Forghani > > > > On Wed, Jul 8, 2015 at 8:03 AM, Dejan Menges <dejan.men...@gmail.com> > > wrote: > > > > > Hi, > > > > > > What's the best way to automate major compactions without enabling it > > > during off peak period? > > > > > > What I was testing is simple script which runs on every node in > cluster, > > > checks if there is major compaction already running on that node, if > not > > > picks one region for compaction and run compaction on that one region. > > > > > > It's running for some time and it helped us get our data to much better > > > shape, but now I'm not quite sure how to choose anymore which region to > > > compact. So far I was reading for that node rs-status#regionStoreStats > > and > > > first choosing the one with biggest amount of storefiles, and then > those > > > with biggest storefile sizes. > > > > > > Is there maybe something more intelligent I could/should do? > > > > > > Thanks a lot! > > > > > >