Lesson learned...restart thrift servers *after* restarting hadoop+hbase. On Thu, Dec 30, 2010 at 3:39 PM, Wayne <[email protected]> wrote:
> We have restarted with lzop compression, and now I am seeing some really > long and frequent stop the world pauses of the entire cluster. The load > requests for all regions all go to zero except for the meta table region. No > data batches are getting in (no loads are occurring) and everything seems > frozen. It seems to last for 5+ seconds. Is this GC on the master or GC in > the meta region? What could cause everything to stop for several seconds? It > appears to happen on a recurring basis as well. I think we saw it before > switching to lzo but it seems much worse now (lasts longer and occurs more > frequently). > > Thanks. > > > > On Thu, Dec 30, 2010 at 12:20 PM, Wayne <[email protected]> wrote: > >> HBase Version 0.89.20100924, r1001068 w/ 8GB heap >> >> I plan to run for 1 week straight maxed out. I am worried about GC pauses, >> especially concurrent mode failures (does hbase/hadoop suffer these under >> extended load?). What should I be looking for in the gc log in terms of >> problem signs? The ParNews are quick but the CMS concurrent marks are taking >> as much as 4 mins with an average of 20-30 secs. >> >> Thanks. >> >> >> >> On Thu, Dec 30, 2010 at 12:00 PM, Stack <[email protected]> wrote: >> >>> Oh, what versions are you using? >>> St.Ack >>> >>> On Thu, Dec 30, 2010 at 9:00 AM, Stack <[email protected]> wrote: >>> > Keep going. Let it run longer. Get the servers as loaded as you think >>> > they'll be in production. Make sure the perf numbers are not because >>> > cluster is 'fresh'. >>> > St.Ack >>> > >>> > On Thu, Dec 30, 2010 at 5:51 AM, Wayne <[email protected]> wrote: >>> >> We finally got our cluster up and running and write performance looks >>> very >>> >> good. We are getting sustained 8-10k writes/sec/node on a 10 node >>> cluster >>> >> from Python through thrift. These are values written to 3 CFs so >>> actual >>> >> hbase performance is 25-30k writes/sec/node. The nodes are currently >>> disk >>> >> i/o bound (40-50% utilization) but hopefully once we get lzop working >>> this >>> >> will go down. We have been running for 12 hours without a problem. We >>> hope >>> >> to get lzop going today and then load all through the long weekend. >>> >> >>> >> We plan to then test reads next week after we get some data in there. >>> Looks >>> >> good so far! Below are our settings in case there are some >>> >> suggestions/concerns. >>> >> >>> >> Thanks for everyone's help. It is pretty exciting to get performance >>> like >>> >> this from the start. >>> >> >>> >> >>> >> *Global* >>> >> >>> >> client.write.buffer = 10485760 (10MB = 5x default) >>> >> >>> >> optionalLogFlushInterval = 10000 (10 secs = 10x default) >>> >> >>> >> memstore.flush.size = 268435456 (256MB = 4x default) >>> >> >>> >> hregion.max.filesize = 1073741824 (1GB = 4x default) >>> >> >>> >> *Table* >>> >> >>> >> alter 'xxx', METHOD => 'table_att', DEFERRED_LOG_FLUSH => true >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> On Wed, Dec 29, 2010 at 12:55 AM, Stack <[email protected]> wrote: >>> >> >>> >>> On Mon, Dec 27, 2010 at 11:47 AM, Wayne <[email protected]> wrote: >>> >>> > All data is written to 3 CFs. Basically 2 of the CFs are secondary >>> >>> indexes >>> >>> > (manually managed as normal CFs). It sounds like we should try hard >>> to >>> >>> get >>> >>> > as much out of thrift as we can before going to a lower level. >>> >>> >>> >>> Yes. >>> >>> >>> >>> Writes need >>> >>> > to be "fast enough", but reads are more important in the end (and >>> are the >>> >>> > reason we are switching from a different solution). The numbers you >>> >>> quoted >>> >>> > below sound like they are in the ballpark of what we are looking to >>> do. >>> >>> > >>> >>> >>> >>> Even the tens per second that I threw in there to CMA? >>> >>> >>> >>> > Much of our data is cold, and we expect reads to be disk i/o based. >>> >>> >>> >>> OK. FYI, we're not the best at this -- cache-miss cold reads -- what >>> >>> w/ a network hop in the way and currently we'll open a socket per >>> >>> access. >>> >>> >>> >>> > Given >>> >>> > this is 8GB heap a good place to start on the data nodes (24GB >>> ram)? Is >>> >>> the >>> >>> > block cache managed on its own (being it won't blow up causing >>> OOM), >>> >>> >>> >>> It won't. Its constrained. Does our home-brewed sizeof. Default, >>> >>> its 0.2 of total heap. If you think cache will help, you could go up >>> >>> from there. 0.4 or 0.5 of heap. >>> >>> >>> >>> > and if >>> >>> > we do not use it (block cache) should we go even lower for the heap >>> (we >>> >>> want >>> >>> > to avoid CMF and long GC pauses)? >>> >>> >>> >>> If you are going to be doing cache-miss most of the time and cold >>> >>> reads, then yes, you can do away with cache. >>> >>> >>> >>> In testing of 0.90.x I've been running w/ 1MB heaps with 1k regions >>> >>> but this is my trying to break stuff. >>> >>> >>> >>> > Are there any timeouts we need to tweak to >>> >>> > make the cluster more "accepting" of long GC pauses while under >>> sustained >>> >>> > load (7+ days of 10k/inserts/sec/node)? >>> >>> > >>> >>> >>> >>> If zookeeper client timesout, the regionserver will shut itself down. >>> >>> In 0.90.0RC2, the client sessionout is set high -- 3 minutes. If you >>> >>> timeout that, then thats pretty extreme... something badly wrong I'd >>> >>> say. Heres' a few notes on the config and others that you might want >>> >>> to twiddle (see previous section on required configs... make sure >>> >>> you've got those too): >>> >>> >>> >>> >>> http://people.apache.org/~stack/hbase-0.90.0-candidate-2/docs/important_configurations.html#recommended_configurations<http://people.apache.org/%7Estack/hbase-0.90.0-candidate-2/docs/important_configurations.html#recommended_configurations> >>> < >>> http://people.apache.org/%7Estack/hbase-0.90.0-candidate-2/docs/important_configurations.html#recommended_configurations >>> > >>> >>> >>> >>> >>> >>> > Does LZO compression speed up reads/writes where there is excess >>> CPU to >>> >>> do >>> >>> > the compression? I assume it would lower disk i/o but increase CPU >>> a lot. >>> >>> Is >>> >>> > data compressed on the initial write or only after compaction? >>> >>> > >>> >>> >>> >>> LZO is pretty frictionless -- i.e. little CPU cost -- and yes, >>> usually >>> >>> helps speed things up (grab more in the one go). What size are your >>> >>> records? You might want to mess w/ hfile block sizes though the 64k >>> >>> default is usually good enough for all but very small cell sizes. >>> >>> >>> >>> >>> >>> > With the replication in the HDFS layer how are reads managed in >>> terms of >>> >>> > load balancing across region servers? Does HDFS know to spread >>> multiple >>> >>> > requests across the 3 region servers that contain the same data? >>> >>> >>> >>> You only read from one of the replicas, always the 'closest'. If the >>> >>> DFSClient has trouble getting the first of the replicas, it moves on >>> >>> to the second, etc. >>> >>> >>> >>> >>> >>> > For example >>> >>> > with 10 data nodes if we have 50 concurrent readers with very >>> "random" >>> >>> key >>> >>> > requests we would expect to have 5 reads occurring on each data >>> node at >>> >>> the >>> >>> > same time. We plan to have a thrift server on each data node, so 5 >>> >>> > concurrent readers will be connected to each thrift server at any >>> given >>> >>> time >>> >>> > (50 in aggregate across 10 nodes). We want to be sure everything is >>> >>> designed >>> >>> > to evenly spread this load to avoid any possible hot-spots. >>> >>> > >>> >>> >>> >>> This is different. This is key design. A thrift server will be >>> doing >>> >>> some subset of the key space. If the requests are evenly distributed >>> >>> over all of the key space, then you should be fine; all thrift >>> servers >>> >>> will be evenly loaded. If not, then there could be hot spots. >>> >>> >>> >>> We have a balancer that currently only counts regions per server, not >>> >>> regions per server plus hits per region so it could be the case that >>> a >>> >>> server by chance ends up carrying all of the hot regions. HBase >>> >>> itself is not too smart dealing with this. In 0.90.0, there is >>> >>> facility for manually moving regions -- i.e. closing in current >>> >>> location and moving the region off to another server w/ some outage >>> >>> while the move is happening (usually seconds) -- or you could split >>> >>> the hot region manually and then the daughters could be moved off to >>> >>> other servers... Primitive for now but should be better in next HBase >>> >>> versions. >>> >>> >>> >>> Have you been able to test w/ your data and your query pattern? >>> >>> That'll tell you way more than I ever could. >>> >>> >>> >>> Good luck, >>> >>> St.Ack >>> >>> >>> >>> >>> >>> > >>> >>> > >>> >>> > On Mon, Dec 27, 2010 at 1:49 PM, Stack <[email protected]> wrote: >>> >>> > >>> >>> >> On Fri, Dec 24, 2010 at 5:09 AM, Wayne <[email protected]> wrote: >>> >>> >> > We are in the process of evaluating hbase in an effort to switch >>> from >>> >>> a >>> >>> >> > different nosql solution. Performance is of course an important >>> part >>> >>> of >>> >>> >> our >>> >>> >> > evaluation. We are a python shop and we are very worried that we >>> can >>> >>> not >>> >>> >> get >>> >>> >> > any real performance out of hbase using thrift (and must drop >>> down to >>> >>> >> java). >>> >>> >> > We are aware of the various lower level options for bulk insert >>> or >>> >>> java >>> >>> >> > based inserts with turning off WAL etc. but none of these are >>> >>> available >>> >>> >> to >>> >>> >> > us in python so are not part of our evaluation. >>> >>> >> >>> >>> >> I can understand python for continuous updates from your frontend >>> or >>> >>> >> whatever but you might consider hacking up a bit of java to make >>> us of >>> >>> >> the bulk updater; you'll get upload rates orders of magnitude >>> beyond >>> >>> >> what you'd achieve going via the API via python (or java for that >>> >>> >> matter). You can also do incremental updates using the bulk >>> loader. >>> >>> >> >>> >>> >> >>> >>> >> We have a 10 node cluster >>> >>> >> > (24gb, 6 x 1TB, 16 core) that we setting up as data/region >>> nodes, and >>> >>> we >>> >>> >> are >>> >>> >> > looking for suggestions on configuration as well as benchmarks >>> in >>> >>> terms >>> >>> >> of >>> >>> >> > expectations of performance. Below are some specific questions. >>> I >>> >>> realize >>> >>> >> > there are a million factors that help determine specific >>> performance >>> >>> >> > numbers, so any examples of performance from running clusters >>> would be >>> >>> >> great >>> >>> >> > as examples of what can be done. >>> >>> >> >>> >>> >> Yeah, you have been around the block obviously. Its hard to give >>> out >>> >>> >> 'numbers' since so many different factors involved. >>> >>> >> >>> >>> >> >>> >>> >> Again thrift seems to be our "problem" so >>> >>> >> > non java based solutions are preferred (do any non java based >>> shops >>> >>> run >>> >>> >> > large scale hbase clusters?). Our total production cluster size >>> is >>> >>> >> estimated >>> >>> >> > to be 50TB. >>> >>> >> > >>> >>> >> >>> >>> >> There are some substantial shops running non-java; e.g. the yfrog >>> >>> >> folks go via REST, the mozilla fellas are python over thrift, >>> >>> >> Stumbleupon is php over thrift. >>> >>> >> >>> >>> >> > Our data model is 3 CFs, one primary and 2 secondary indexes. >>> All >>> >>> writes >>> >>> >> go >>> >>> >> > to all 3 CFs and are grouped as a batch of row mutations which >>> should >>> >>> >> avoid >>> >>> >> > row locking issues. >>> >>> >> > >>> >>> >> >>> >>> >> A write updates 3CFs and secondary indices? Thats an expensive >>> Put >>> >>> >> relatively. You have to run w/ 3CFs? It facilitates fast >>> querying? >>> >>> >> >>> >>> >> >>> >>> >> > What heap size is recommended for master, and for region servers >>> (24gb >>> >>> >> ram)? >>> >>> >> >>> >>> >> Master doesn't take much heap, at least not in the coming 0.90.0 >>> HBase >>> >>> >> (Is that what you intend to run)? >>> >>> >> >>> >>> >> The more RAM you give the regionservers, the more cache your >>> cluster >>> >>> will >>> >>> >> have. >>> >>> >> >>> >>> >> Whats important to you read or write times? >>> >>> >> >>> >>> >> >>> >>> >> > What other settings can/should be tweaked in hbase to optimize >>> >>> >> performance >>> >>> >> > (we have looked at the wiki page)? >>> >>> >> >>> >>> >> Thats a good place to start. Take a look through this mailing >>> list >>> >>> >> for others (Its time for a trawl of mailing list and then >>> distilling >>> >>> >> the findings into a reedit of our perf page). >>> >>> >> >>> >>> >> > What is a good batch size for writes? We will start with 10k >>> >>> >> values/batch. >>> >>> >> >>> >>> >> Start small with defaults. Make sure its all running smooth >>> first. >>> >>> >> Then rachet it up. >>> >>> >> >>> >>> >> >>> >>> >> > How many concurrent writers/readers can a single data node >>> handle with >>> >>> >> > evenly distributed load? Are there settings specific to this? >>> >>> >> >>> >>> >> How many clients you going to have writing HBase? >>> >>> >> >>> >>> >> >>> >>> >> > What is "very good" read/write latency for a single put/get in >>> hbase >>> >>> >> using >>> >>> >> > thrift? >>> >>> >> >>> >>> >> "Very Good" would be < a few milliseconds. >>> >>> >> >>> >>> >> >>> >>> >> > What is "very good" read/write throughput per node in hbase >>> using >>> >>> thrift? >>> >>> >> > >>> >>> >> >>> >>> >> Thousands of ops per second per regionserver (Sorry, can't be more >>> >>> >> specific than that). If the Puts are multi-family + updates on >>> >>> >> secondary indices, hundreds -- maybe even tens... I'm not sure -- >>> >>> >> rather than thousands. >>> >>> >> >>> >>> >> > We are looking to get performance numbers in the range of 10k >>> >>> aggregate >>> >>> >> > inserts/sec/node and read latency < 30ms/read with 3-4 >>> concurrent >>> >>> >> > readers/node. Can our expectations be met with hbase through >>> thrift? >>> >>> Can >>> >>> >> > they be met with hbase through java? >>> >>> >> > >>> >>> >> >>> >>> >> >>> >>> >> I wouldn't fixate on the thrift hop. At SU we can do thousands of >>> ops >>> >>> >> a second per node np from PHP frontend over thrift. >>> >>> >> >>> >>> >> 10k inserts a second per node into single CF might be doable. If >>> into >>> >>> >> 3CFs, then you need to recalibrate your expectations (I'd say). >>> >>> >> >>> >>> >> > Thanks in advance for any help, examples, or recommendations >>> that you >>> >>> can >>> >>> >> > provide! >>> >>> >> > >>> >>> >> Sorry, the above is light on recommendations (for reasons cited by >>> >>> >> Ryan above -- smile). >>> >>> >> St.Ack >>> >>> >> >>> >>> > >>> >>> >>> >> >>> > >>> >> >> >
