Re: RowCounter example run time

Ryan Rawson Sun, 23 May 2010 18:07:12 -0700

Hey,

Yes, all improvements should be done via JIRA.


Thanks,
-ryan

On Sun, May 23, 2010 at 4:32 PM, Michael Segel
<[email protected]> wrote:
>
>
> Ryan & JD
>
> I'm aware of the difficulties in trying to maintain an accurate row count.
> Its not trivial, but its not rocket science either.
>
> There are a couple of ways of doing this and it will take some time to think 
> through the benefits vs the costs of how you do it.
>
> You're right. Its more difficult than a c-isam single machine database.
> There are tricks one could take.
>
> But I think that this should be taken offline and maybe open a JIRA issue, if 
> one doesn't already exist?
>
> -Mike
>
>
>> Date: Sun, 23 May 2010 11:20:17 -0700
>> Subject: Re: RowCounter example run time
>> From: [email protected]
>> To: [email protected]
>>
>> The select count(*) optimization is a classic in databases - some
>> people argue that it's really important and should be optimized for
>> (myisam for example) and others note that it's a trick and real DB
>> loads rarely use that on a sizable table.  Note that myisam locks the
>> entire table for each update (only 1 update at a time) so comparing
>> hbase to it is odd.  Innodb doesn't (maintaining global stats under
>> performance can be difficult).  Oracle doesn't (but may be able to use
>> a primary index to reduce the blocks read).
>>
>> Implementing this in HBase might be difficult - when a new column is
>> inserted into a table the regionserver doesn't know if that row
>> already exists - to know that it would have to read some data
>> potentially from disk first.  Any scheme that requires the
>> regionserver to increment a "rowsForRegion" during certain inserts
>> would therefore be problematic.
>>
>> As JD noted, the likely cause here is scanner pre-fetch caching.  We
>> ship with very conservative scanner pre-fetch values because if a
>> client takes too long they will get a fatal exception.  RowCounter MR
>> jobs shouldn't be like that however.
>>
>> As for cluster sizing - 6-10 is the minimum really.  With 3 nodes you
>> are replicating data to every node, and you arent getting the benefits
>> of a clustered solution.  At higher node counts you get some disjoint
>> parallelism underway and things really pick up on the larger datasets
>> (I can do MapReduces at 7-8m rows/sec for 20+ minutes on end).
>>
>> -ryan
>>
>>
>> On Sun, May 23, 2010 at 7:58 AM, Edward Capriolo <[email protected]> 
>> wrote:
>> > On Sun, May 23, 2010 at 10:36 AM, Michael Segel
>> > <[email protected]>wrote:
>> >
>> >>
>> >> J-D,
>> >>
>> >> Here's the problem.. you go to any relational database and do a select
>> >> count(*) and you get a response back fairly quickly.
>> >> The difference is that in HBase, you're doing a physical count and with 
>> >> the
>> >> relational engine you're pulling it from meta data.
>> >>
>> >> I have a couple of ideas on how we could do this...
>> >>
>> >> -Mike
>> >>
>> >> > Date: Sat, 22 May 2010 09:25:51 -0700
>> >> > Subject: Re: RowCounter example run time
>> >> > From: [email protected]
>> >> > To: [email protected]
>> >> >
>> >> > My first question would be, what do you expect exactly? Would 5 min be
>> >> > enough? Or are you expecting something more like 1-2 secs (which is
>> >> > impossible since this is mapreduce)?
>> >> >
>> >> > Then there's also Jon's questions.
>> >> >
>> >> > Finally, did you set a higher scanner caching on that job?
>> >> > hbase.client.scanner.caching is the name of the config, which defaults
>> >> > to 1. When mapping a HBase table, if you don't set it higher you're
>> >> > basically benchmarking the RPC layer since it does 1 call per next()
>> >> > invocation. Setting the right value depends on the size of your rows
>> >> > eg are you storing 60 bytes or something high like 100KB? On our 13B
>> >> > rows table (each row is a few bytes), we set it to 10k.
>> >> >
>> >> > J-D
>> >> >
>> >> > On Sat, May 22, 2010 at 8:40 AM, Andrew Nguyen
>> >> > <[email protected]> wrote:
>> >> > > Hello,
>> >> > >
>> >> > > I finally got some decent hardware to put together a 1 master, 4 slave
>> >> Hadoop/HBase cluster.  However, I'm still waiting for space in the
>> >> datacenter to clear out and only have 3 of the nodes deployed (master + 2
>> >> slaves).  Each node is a quad-core AMD with 8G of RAM, running on a GigE
>> >> network.  HDFS is configured to run on a separate (from the OS drive) U320
>> >> drive.  The master has RAID1 mirrored drives only.
>> >> > >
>> >> > > I've installed HBase with slave1 and slave2 as regionservers and
>> >> master, slave1, slave2 as the ZK quorom.  The master serves as the NN and 
>> >> JT
>> >> and the slaves as DN and TT.
>> >> > >
>> >> > > Now my question:
>> >> > >
>> >> > > I've imported 22.5M rows into HBase, into a single table.  Each row 
>> >> > > has
>> >> 8 or so columns.  I just ran the RowCounter MR example and it takes about 
>> >> 25
>> >> minutes to complete.  Is a 3 node setup too underpowered to combat the
>> >> overhead of Hadoop and HBase?  Or, could it be something with my
>> >> configuration?  I've been playing around with Hadoop some but this is my
>> >> first attempt at anything HBase.
>> >> > >
>> >> > > Thanks!
>> >> > >
>> >> > > --Andrew
>> >>
>> >> _________________________________________________________________
>> >> The New Busy is not the too busy. Combine all your e-mail accounts with
>> >> Hotmail.
>> >>
>> >> http://www.windowslive.com/campaign/thenewbusy?tile=multiaccount&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_4
>> >>
>> >
>> > Every system has its tradeoff. In the example above:
>> >
>> >>> select count(*) and you get a response back fairly quickly.
>> >
>> > Try this with my isam very fast. Try that will innodb, this takes a very
>> > long time. Some systems maintain a row count and some do not.
>> >
>> > Now if you are using innodb there is a quick way to get an approximate row
>> > count.
>> >
>> > explain select count(*)
>> >
>> > This causes the innodb engine to use indexes for an approximate table size.
>> >
>> > HBase does not maintain a row count. The row count is intensive process as
>> > it scans every row. Such is life.
>> >
>
> _________________________________________________________________
> Hotmail is redefining busy with tools for the New Busy. Get more from your 
> inbox.
> http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_2

Re: RowCounter example run time

Reply via email to