HBase is open source. You can check out the source code and look at the source code.
$ svn info Path: . URL: http://svn.apache.org/repos/asf/hbase/branches/0.94 Repository Root: http://svn.apache.org/repos/asf Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68 Revision: 1525061 On Fri, Sep 20, 2013 at 6:46 PM, James Birchfield < jbirchfi...@stumbleupon.com> wrote: > Ted, > > My apologies if I am being thick, but I am looking at the API docs > here: http://hbase.apache.org/apidocs/index.html and I do not see that > package. And the coprocessor package only contains an exception. > > Ok, weird. Those classes do not show up through normal navigation > from that link, however, the documentation does exist if I google for it > directly. Maybe the javadocs need to be regenerated??? Dunno, but I will > check it out. > > Birch > > On Sep 20, 2013, at 6:32 PM, Ted Yu <yuzhih...@gmail.com> wrote: > > > Please take a look at the javadoc > > for > src/main/java/org/apache/hadoop/hbase/client/coprocessor/AggregationClient.java > > > > As long as the machine can reach your HBase cluster, you should be able > to > > run AggregationClient and utilize the AggregateImplementation endpoint in > > the region servers. > > > > Cheers > > > > > > On Fri, Sep 20, 2013 at 6:26 PM, James Birchfield < > > jbirchfi...@stumbleupon.com> wrote: > > > >> Thanks Ted. > >> > >> That was the direction I have been working towards as I am learning > today. > >> Much appreciation to all the replies to this thread. > >> > >> Whether I keep the MapReduce job or utilize the Aggregation coprocessor > >> (which is turning out that it should be possible for me here), I need to > >> make sure I am running the client in an efficient manner. Lars may have > >> hit upon the core problem. I am not running the map reduce job on the > >> cluster, but rather from a stand alone remote java client executing the > job > >> in process. This may very well turn out to be the number one issue. I > >> would love it if this turns out to be true. Would make this a great > >> learning lesson for me as a relative newcomer to working with HBase, and > >> potentially allow me to finish this initial task much quicker than I was > >> thinking. > >> > >> So assuming the MapReduce jobs need to be run on the cluster instead of > >> locally, does a coprocessor endpoint client need to be run the same, or > is > >> it safe to run it on a remote machine since the work gets distributed > out > >> to the region servers? Just wondering if I would run into the same > issues > >> if what I said above holds true. > >> > >> Thanks! > >> Birch > >> On Sep 20, 2013, at 6:17 PM, Ted Yu <yuzhih...@gmail.com> wrote: > >> > >>> In 0.94, we have AggregateImplementation, an endpoint coprocessor, > which > >>> implements getRowNum(). > >>> > >>> Example is in AggregationClient.java > >>> > >>> Cheers > >>> > >>> > >>> On Fri, Sep 20, 2013 at 6:09 PM, lars hofhansl <la...@apache.org> > wrote: > >>> > >>>> From your numbers below you have about 26k regions, thus each region > is > >>>> about 545tb/26k = 20gb. Good. > >>>> > >>>> How many mappers are you running? > >>>> And just to rule out the obvious, the M/R is running on the cluster > and > >>>> not locally, right? (it will default to a local runner when it cannot > >> use > >>>> the M/R cluster). > >>>> > >>>> Some back of the envelope calculations tell me that assuming 1ge > network > >>>> cards, the best you can expect for 110 machines to map through this > >> data is > >>>> about 10h. (so way faster than what you see). > >>>> (545tb/(110*1/8gb/s) ~ 40ks ~11h) > >>>> > >>>> > >>>> We should really add a rowcounting coprocessor to HBase and allow > using > >> it > >>>> via M/R. > >>>> > >>>> -- Lars > >>>> > >>>> > >>>> > >>>> ________________________________ > >>>> From: James Birchfield <jbirchfi...@stumbleupon.com> > >>>> To: user@hbase.apache.org > >>>> Sent: Friday, September 20, 2013 5:09 PM > >>>> Subject: Re: HBase Table Row Count Optimization - A Solicitation For > >> Help > >>>> > >>>> > >>>> I did not implement accurate timing, but the current table being > counted > >>>> has been running for about 10 hours, and the log is estimating the map > >>>> portion at 10% > >>>> > >>>> 2013-09-20 23:40:24,099 INFO [main] Job : > >> map > >>>> 10% reduce 0% > >>>> > >>>> So a loooong time. Like I mentioned, we have billions, if not > trillions > >>>> of rows potentially. > >>>> > >>>> Thanks for the feedback on the approaches I mentioned. I was not sure > >> if > >>>> they would have any effect overall. > >>>> > >>>> I will look further into coprocessors. > >>>> > >>>> Thanks! > >>>> Birch > >>>> On Sep 20, 2013, at 4:58 PM, Vladimir Rodionov < > vrodio...@carrieriq.com > >>> > >>>> wrote: > >>>> > >>>>> How long does it take for RowCounter Job for largest table to finish > on > >>>> your cluster? > >>>>> > >>>>> Just curious. > >>>>> > >>>>> On your options: > >>>>> > >>>>> 1. Not worth it probably - you may overload your cluster > >>>>> 2. Not sure this one differs from 1. Looks the same to me but more > >>>> complex. > >>>>> 3. The same as 1 and 2 > >>>>> > >>>>> Counting rows in efficient way can be done if you sacrifice some > >>>> accuracy : > >>>>> > >>>>> > >>>> > >> > http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html > >>>>> > >>>>> Yeah, you will need coprocessors for that. > >>>>> > >>>>> Best regards, > >>>>> Vladimir Rodionov > >>>>> Principal Platform Engineer > >>>>> Carrier IQ, www.carrieriq.com > >>>>> e-mail: vrodio...@carrieriq.com > >>>>> > >>>>> ________________________________________ > >>>>> From: James Birchfield [jbirchfi...@stumbleupon.com] > >>>>> Sent: Friday, September 20, 2013 3:50 PM > >>>>> To: user@hbase.apache.org > >>>>> Subject: Re: HBase Table Row Count Optimization - A Solicitation For > >> Help > >>>>> > >>>>> Hadoop 2.0.0-cdh4.3.1 > >>>>> > >>>>> HBase 0.94.6-cdh4.3.1 > >>>>> > >>>>> 110 servers, 0 dead, 238.2364 average load > >>>>> > >>>>> Some other info, not sure if it helps or not. > >>>>> > >>>>> Configured Capacity: 1295277834158080 (1.15 PB) > >>>>> Present Capacity: 1224692609430678 (1.09 PB) > >>>>> DFS Remaining: 624376503857152 (567.87 TB) > >>>>> DFS Used: 600316105573526 (545.98 TB) > >>>>> DFS Used%: 49.02% > >>>>> Under replicated blocks: 0 > >>>>> Blocks with corrupt replicas: 1 > >>>>> Missing blocks: 0 > >>>>> > >>>>> It is hitting a production cluster, but I am not really sure how to > >>>> calculate the load placed on the cluster. > >>>>> On Sep 20, 2013, at 3:19 PM, Ted Yu <yuzhih...@gmail.com> wrote: > >>>>> > >>>>>> How many nodes do you have in your cluster ? > >>>>>> > >>>>>> When counting rows, what other load would be placed on the cluster ? > >>>>>> > >>>>>> What is the HBase version you're currently using / planning to use ? > >>>>>> > >>>>>> Thanks > >>>>>> > >>>>>> > >>>>>> On Fri, Sep 20, 2013 at 2:47 PM, James Birchfield < > >>>>>> jbirchfi...@stumbleupon.com> wrote: > >>>>>> > >>>>>>> After reading the documentation and scouring the mailing list > >>>>>>> archives, I understand there is no real support for fast row > counting > >>>> in > >>>>>>> HBase unless you build some sort of tracking logic into your code. > >> In > >>>> our > >>>>>>> case, we do not have such logic, and have massive amounts of data > >>>> already > >>>>>>> persisted. I am running into the issue of very long execution of > the > >>>>>>> RowCounter MapReduce job against very large tables (multi-billion > for > >>>> many > >>>>>>> is our estimate). I understand why this issue exists and am slowly > >>>>>>> accepting it, but I am hoping I can solicit some possible ideas to > >> help > >>>>>>> speed things up a little. > >>>>>>> > >>>>>>> My current task is to provide total row counts on about 600 > >>>>>>> tables, some extremely large, some not so much. Currently, I have > a > >>>>>>> process that executes the MapRduce job in process like so: > >>>>>>> > >>>>>>> Job job = RowCounter.createSubmittableJob( > >>>>>>> > >> ConfigManager.getConfiguration(), > >>>>>>> new String[]{tableName}); > >>>>>>> boolean waitForCompletion = > >>>>>>> job.waitForCompletion(true); > >>>>>>> Counters counters = job.getCounters(); > >>>>>>> Counter rowCounter = > >>>>>>> counters.findCounter(hbaseadminconnection.Counters.ROWS); > >>>>>>> return rowCounter.getValue(); > >>>>>>> > >>>>>>> At the moment, each MapReduce job is executed in serial order, > >> so > >>>>>>> counting one table at a time. For the current implementation of > this > >>>> whole > >>>>>>> process, as it stands right now, my rough timing calculations > >> indicate > >>>> that > >>>>>>> fully counting all the rows of these 600 tables will take anywhere > >>>> between > >>>>>>> 11 to 22 days. This is not what I consider a desirable timeframe. > >>>>>>> > >>>>>>> I have considered three alternative approaches to speed things > >>>> up. > >>>>>>> > >>>>>>> First, since the application is not heavily CPU bound, I could > >>>> use > >>>>>>> a ThreadPool and execute multiple MapReduce jobs at the same time > >>>> looking > >>>>>>> at different tables. I have never done this, so I am unsure if > this > >>>> would > >>>>>>> cause any unanticipated side effects. > >>>>>>> > >>>>>>> Second, I could distribute the processes. I could find as many > >>>>>>> machines that can successfully talk to the desired cluster > properly, > >>>> give > >>>>>>> them a subset of tables to work on, and then combine the results > post > >>>>>>> process. > >>>>>>> > >>>>>>> Third, I could combine both the above approaches and run a > >>>>>>> distributed set of multithreaded process to execute the MapReduce > >> jobs > >>>> in > >>>>>>> parallel. > >>>>>>> > >>>>>>> Although it seems to have been asked and answered many times, I > >>>>>>> will ask once again. Without the need to change our current > >>>> configurations > >>>>>>> or restart the clusters, is there a faster approach to obtain row > >>>> counts? > >>>>>>> FYI, my cache size for the Scan is set to 1000. I have > experimented > >>>> with > >>>>>>> different numbers, but nothing made a noticeable difference. Any > >>>> advice or > >>>>>>> feedback would be greatly appreciated! > >>>>>>> > >>>>>>> Thanks, > >>>>>>> Birch > >>>>> > >>>>> > >>>>> Confidentiality Notice: The information contained in this message, > >>>> including any attachments hereto, may be confidential and is intended > >> to be > >>>> read only by the individual or entity to whom this message is > >> addressed. If > >>>> the reader of this message is not the intended recipient or an agent > or > >>>> designee of the intended recipient, please note that any review, use, > >>>> disclosure or distribution of this message or its attachments, in any > >> form, > >>>> is strictly prohibited. If you have received this message in error, > >> please > >>>> immediately notify the sender and/or notificati...@carrieriq.com and > >>>> delete or destroy any copy of this message and its attachments. > >>>> > >> > >> > >