Hey, Yes, all improvements should be done via JIRA.
Thanks, -ryan On Sun, May 23, 2010 at 4:32 PM, Michael Segel <[email protected]> wrote: > > > Ryan & JD > > I'm aware of the difficulties in trying to maintain an accurate row count. > Its not trivial, but its not rocket science either. > > There are a couple of ways of doing this and it will take some time to think > through the benefits vs the costs of how you do it. > > You're right. Its more difficult than a c-isam single machine database. > There are tricks one could take. > > But I think that this should be taken offline and maybe open a JIRA issue, if > one doesn't already exist? > > -Mike > > >> Date: Sun, 23 May 2010 11:20:17 -0700 >> Subject: Re: RowCounter example run time >> From: [email protected] >> To: [email protected] >> >> The select count(*) optimization is a classic in databases - some >> people argue that it's really important and should be optimized for >> (myisam for example) and others note that it's a trick and real DB >> loads rarely use that on a sizable table. Note that myisam locks the >> entire table for each update (only 1 update at a time) so comparing >> hbase to it is odd. Innodb doesn't (maintaining global stats under >> performance can be difficult). Oracle doesn't (but may be able to use >> a primary index to reduce the blocks read). >> >> Implementing this in HBase might be difficult - when a new column is >> inserted into a table the regionserver doesn't know if that row >> already exists - to know that it would have to read some data >> potentially from disk first. Any scheme that requires the >> regionserver to increment a "rowsForRegion" during certain inserts >> would therefore be problematic. >> >> As JD noted, the likely cause here is scanner pre-fetch caching. We >> ship with very conservative scanner pre-fetch values because if a >> client takes too long they will get a fatal exception. RowCounter MR >> jobs shouldn't be like that however. >> >> As for cluster sizing - 6-10 is the minimum really. With 3 nodes you >> are replicating data to every node, and you arent getting the benefits >> of a clustered solution. At higher node counts you get some disjoint >> parallelism underway and things really pick up on the larger datasets >> (I can do MapReduces at 7-8m rows/sec for 20+ minutes on end). >> >> -ryan >> >> >> On Sun, May 23, 2010 at 7:58 AM, Edward Capriolo <[email protected]> >> wrote: >> > On Sun, May 23, 2010 at 10:36 AM, Michael Segel >> > <[email protected]>wrote: >> > >> >> >> >> J-D, >> >> >> >> Here's the problem.. you go to any relational database and do a select >> >> count(*) and you get a response back fairly quickly. >> >> The difference is that in HBase, you're doing a physical count and with >> >> the >> >> relational engine you're pulling it from meta data. >> >> >> >> I have a couple of ideas on how we could do this... >> >> >> >> -Mike >> >> >> >> > Date: Sat, 22 May 2010 09:25:51 -0700 >> >> > Subject: Re: RowCounter example run time >> >> > From: [email protected] >> >> > To: [email protected] >> >> > >> >> > My first question would be, what do you expect exactly? Would 5 min be >> >> > enough? Or are you expecting something more like 1-2 secs (which is >> >> > impossible since this is mapreduce)? >> >> > >> >> > Then there's also Jon's questions. >> >> > >> >> > Finally, did you set a higher scanner caching on that job? >> >> > hbase.client.scanner.caching is the name of the config, which defaults >> >> > to 1. When mapping a HBase table, if you don't set it higher you're >> >> > basically benchmarking the RPC layer since it does 1 call per next() >> >> > invocation. Setting the right value depends on the size of your rows >> >> > eg are you storing 60 bytes or something high like 100KB? On our 13B >> >> > rows table (each row is a few bytes), we set it to 10k. >> >> > >> >> > J-D >> >> > >> >> > On Sat, May 22, 2010 at 8:40 AM, Andrew Nguyen >> >> > <[email protected]> wrote: >> >> > > Hello, >> >> > > >> >> > > I finally got some decent hardware to put together a 1 master, 4 slave >> >> Hadoop/HBase cluster. However, I'm still waiting for space in the >> >> datacenter to clear out and only have 3 of the nodes deployed (master + 2 >> >> slaves). Each node is a quad-core AMD with 8G of RAM, running on a GigE >> >> network. HDFS is configured to run on a separate (from the OS drive) U320 >> >> drive. The master has RAID1 mirrored drives only. >> >> > > >> >> > > I've installed HBase with slave1 and slave2 as regionservers and >> >> master, slave1, slave2 as the ZK quorom. The master serves as the NN and >> >> JT >> >> and the slaves as DN and TT. >> >> > > >> >> > > Now my question: >> >> > > >> >> > > I've imported 22.5M rows into HBase, into a single table. Each row >> >> > > has >> >> 8 or so columns. I just ran the RowCounter MR example and it takes about >> >> 25 >> >> minutes to complete. Is a 3 node setup too underpowered to combat the >> >> overhead of Hadoop and HBase? Or, could it be something with my >> >> configuration? I've been playing around with Hadoop some but this is my >> >> first attempt at anything HBase. >> >> > > >> >> > > Thanks! >> >> > > >> >> > > --Andrew >> >> >> >> _________________________________________________________________ >> >> The New Busy is not the too busy. Combine all your e-mail accounts with >> >> Hotmail. >> >> >> >> http://www.windowslive.com/campaign/thenewbusy?tile=multiaccount&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_4 >> >> >> > >> > Every system has its tradeoff. In the example above: >> > >> >>> select count(*) and you get a response back fairly quickly. >> > >> > Try this with my isam very fast. Try that will innodb, this takes a very >> > long time. Some systems maintain a row count and some do not. >> > >> > Now if you are using innodb there is a quick way to get an approximate row >> > count. >> > >> > explain select count(*) >> > >> > This causes the innodb engine to use indexes for an approximate table size. >> > >> > HBase does not maintain a row count. The row count is intensive process as >> > it scans every row. Such is life. >> > > > _________________________________________________________________ > Hotmail is redefining busy with tools for the New Busy. Get more from your > inbox. > http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_2
