Awesome, thanks!... I will give it a whirl on our test cluster. -Jack
On Mon, Sep 20, 2010 at 10:15 PM, Ryan Rawson <[email protected]> wrote: > So we are running this code in production: > > http://github.com/stumbleupon/hbase > > The branch off point is 8dc5a1a353ffc9fa57ac59618f76928b5eb31f6c, and > everything past that is our rebase and cherry-picked changes. > > We use git to manage this internally, and don't use svn. Included is > the LZO libraries we use checked directly into the code, and the > assembly changes to publish those. > > So when we are ready to do a deploy, we do this: > mvn install assembly:assembly > (or include the -DskipTests to make it go faster) > > and then we have a new tarball to deploy. > > Note there is absolutely NO warranty here, not even that it will run > for a microsecond... futhermore this is NOT an ASF release, just a > courtesy. If there ever was to be a release it would look > differently, because ASF releases cant include GPL code (this does) > and depend on commercial releases of haoopp. > > Enjoy, > -ryan > > On Mon, Sep 20, 2010 at 9:57 PM, Ryan Rawson <[email protected]> wrote: >> no no, 20 GB heap per node. each node with 24-32gb ram, etc. >> >> we cant rely on the linux buffer cache to save us, so we have to cache >> in hbase ram. >> >> :-) >> >> -ryan >> >> On Mon, Sep 20, 2010 at 9:44 PM, Jack Levin <[email protected]> wrote: >>> 20GB+?, hmmm..... I do plan to run 50 regionserver nodes though, with >>> 3 GB Heap likely, this should be plenty to rip through say, 350TB of >>> data. >>> >>> -Jack >>> >>> On Mon, Sep 20, 2010 at 9:39 PM, Ryan Rawson <[email protected]> wrote: >>>> yes that is the new ZK based coordination. when i publish the SU code >>>> we have a patch which limits that and is faster. 2GB is a little >>>> small for a regionserver memory... in my ideal world we'll be putting >>>> 20GB+ of ram to regionserver. >>>> >>>> I just figured you were using the DEB/RPMs because your files were in >>>> /usr/local... I usually run everything out of /home/hadoop b/c it >>>> allows me to easily rsync as user hadoop. >>>> >>>> but you are on the right track yes :-) >>>> >>>> On Mon, Sep 20, 2010 at 9:32 PM, Jack Levin <[email protected]> wrote: >>>>> Who said anything about deb :). I do use tarballs.... Yes, so what did >>>>> it is the copy of that jar to under hbase/lib, and then full restart. >>>>> Now here is a funny thing, the master shuddered for about 10 minutes, >>>>> spewing those messages: >>>>> >>>>> 2010-09-20 21:23:45,826 DEBUG org.apache.hadoop.hbase.master.HMaster: >>>>> Event NodeCreated with state SyncConnected with path >>>>> /hbase/UNASSIGNED/97999366 >>>>> 2010-09-20 21:23:45,827 DEBUG >>>>> org.apache.hadoop.hbase.master.ZKMasterAddressWatcher: Got event >>>>> NodeCreated with path /hbase/UNASSIGNED/97999366 >>>>> 2010-09-20 21:23:45,827 DEBUG >>>>> org.apache.hadoop.hbase.master.ZKUnassignedWatcher: ZK-EVENT-PROCESS: >>>>> Got zkEvent NodeCreated state:SyncConnected >>>>> path:/hbase/UNASSIGNED/97999366 >>>>> 2010-09-20 21:23:45,827 DEBUG >>>>> org.apache.hadoop.hbase.master.RegionManager: Created/updated >>>>> UNASSIGNED zNode img15,normal052q.jpg,1285001686282.97999366 in state >>>>> M2ZK_REGION_OFFLINE >>>>> 2010-09-20 21:23:45,828 INFO >>>>> org.apache.hadoop.hbase.master.RegionServerOperation: >>>>> img13,p1000319tq.jpg,1284952655960.812544765 open on >>>>> 10.103.2.3,60020,1285042333293 >>>>> 2010-09-20 21:23:45,828 DEBUG >>>>> org.apache.hadoop.hbase.master.ZKUnassignedWatcher: Got event type [ >>>>> M2ZK_REGION_OFFLINE ] for region 97999366 >>>>> 2010-09-20 21:23:45,828 DEBUG org.apache.hadoop.hbase.master.HMaster: >>>>> Event NodeChildrenChanged with state SyncConnected with path >>>>> /hbase/UNASSIGNED >>>>> 2010-09-20 21:23:45,828 DEBUG >>>>> org.apache.hadoop.hbase.master.ZKMasterAddressWatcher: Got event >>>>> NodeChildrenChanged with path /hbase/UNASSIGNED >>>>> 2010-09-20 21:23:45,828 DEBUG >>>>> org.apache.hadoop.hbase.master.ZKUnassignedWatcher: ZK-EVENT-PROCESS: >>>>> Got zkEvent NodeChildrenChanged state:SyncConnected >>>>> path:/hbase/UNASSIGNED >>>>> 2010-09-20 21:23:45,830 DEBUG >>>>> org.apache.hadoop.hbase.master.BaseScanner: Current assignment of >>>>> img150,,1284859678248.3116007 is not valid; >>>>> serverAddress=10.103.2.1:60020, startCode=1285038205920 unknown. >>>>> >>>>> >>>>> Does anyone know what they mean? At first it would kill one of my >>>>> datanodes. But what helped is when I changed to heap size to 4GB for >>>>> master and 2GB for datanode that was dying, and after 10 minutes I got >>>>> into a clean state. >>>>> >>>>> -Jack >>>>> >>>>> >>>>> On Mon, Sep 20, 2010 at 9:28 PM, Ryan Rawson <[email protected]> wrote: >>>>>> yes, on every single machine as well, and restart. >>>>>> >>>>>> again, not sure how how you'd do this in a scalable manner with your >>>>>> deb packages... on the source tarball you can just replace it, rsync >>>>>> it out and done. >>>>>> >>>>>> :-) >>>>>> >>>>>> On Mon, Sep 20, 2010 at 8:56 PM, Jack Levin <[email protected]> wrote: >>>>>>> ok, I found that file, do I replace hadoop-core.*.jar under >>>>>>> /usr/lib/hbase/lib? >>>>>>> Then restart, etc? All regionservers too? >>>>>>> >>>>>>> -Jack >>>>>>> >>>>>>> On Mon, Sep 20, 2010 at 8:40 PM, Ryan Rawson <[email protected]> wrote: >>>>>>>> Well I don't really run CDH, I disagree with their rpm/deb packaging >>>>>>>> policies and I have to highly recommend not using DEBs to install >>>>>>>> software... >>>>>>>> >>>>>>>> So normally installing from tarball, the jar is in >>>>>>>> <installpath>/hadoop-0.20.0-320/hadoop-core-0.20.2+320.jar >>>>>>>> >>>>>>>> On CDH/DEB edition, it's somewhere silly ... locate and find will be >>>>>>>> your friend. It should be called hadoop-core-0.20.2+320.jar though! >>>>>>>> >>>>>>>> I'm working on a github publish of SU's production system, which uses >>>>>>>> the cloudera maven repo to install the correct JAR in hbase so when >>>>>>>> you type 'mvn assembly:assembly' to build your own hbase-*-bin.tar.gz >>>>>>>> (the * being whatever version you specified in pom.xml) the cdh3b2 jar >>>>>>>> comes pre-packaged. >>>>>>>> >>>>>>>> Stay tuned :-) >>>>>>>> >>>>>>>> -ryan >>>>>>>> >>>>>>>> On Mon, Sep 20, 2010 at 8:36 PM, Jack Levin <[email protected]> wrote: >>>>>>>>> Ryan, hadoop jar, what is the usual path to the file? I just to to be >>>>>>>>> sure, and where do I put it? >>>>>>>>> >>>>>>>>> -Jack >>>>>>>>> >>>>>>>>> On Mon, Sep 20, 2010 at 8:30 PM, Ryan Rawson <[email protected]> >>>>>>>>> wrote: >>>>>>>>>> you need 2 more things: >>>>>>>>>> >>>>>>>>>> - restart hdfs >>>>>>>>>> - make sure the hadoop jar from your install replaces the one we >>>>>>>>>> ship with >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Mon, Sep 20, 2010 at 8:22 PM, Jack Levin <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>>> So, I switched to 0.89, and we already had CDH3 >>>>>>>>>>> (hadoop-0.20-datanode-0.20.2+320-3.noarch), even though I added >>>>>>>>>>> <name>dfs.support.append</name> as true to both hdfs-site.xml and >>>>>>>>>>> hbase-site.xml, the master still reports this: >>>>>>>>>>> >>>>>>>>>>> You are currently running the HMaster without HDFS append support >>>>>>>>>>> enabled. This may result in data loss. Please see the HBase wiki >>>>>>>>>>> for >>>>>>>>>>> details. >>>>>>>>>>> Master Attributes >>>>>>>>>>> Attribute Name Value Description >>>>>>>>>>> HBase Version 0.89.20100726, r979826 HBase version and svn >>>>>>>>>>> revision >>>>>>>>>>> HBase Compiled Sat Jul 31 02:01:58 PDT 2010, stack When HBase >>>>>>>>>>> version >>>>>>>>>>> was compiled and by whom >>>>>>>>>>> Hadoop Version 0.20.2, r911707 Hadoop version and svn revision >>>>>>>>>>> Hadoop Compiled Fri Feb 19 08:07:34 UTC 2010, chrisdo When Hadoop >>>>>>>>>>> version was compiled and by whom >>>>>>>>>>> HBase Root Directory hdfs://namenode-rd.imageshack.us:9000/hbase >>>>>>>>>>> Location >>>>>>>>>>> of HBase home directory >>>>>>>>>>> >>>>>>>>>>> Any ideas whats wrong? >>>>>>>>>>> >>>>>>>>>>> -Jack >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Mon, Sep 20, 2010 at 5:47 PM, Ryan Rawson <[email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>>> Hey, >>>>>>>>>>>> >>>>>>>>>>>> There is actually only 1 active branch of hbase, that being the >>>>>>>>>>>> 0.89 >>>>>>>>>>>> release, which is based on 'trunk'. We have snapshotted a series >>>>>>>>>>>> of >>>>>>>>>>>> 0.89 "developer releases" in hopes that people would try them our >>>>>>>>>>>> and >>>>>>>>>>>> start thinking about the next major version. One of these is what >>>>>>>>>>>> SU >>>>>>>>>>>> is running prod on. >>>>>>>>>>>> >>>>>>>>>>>> At this point tracking 0.89 and which ones are the 'best' peach >>>>>>>>>>>> sets >>>>>>>>>>>> to run is a bit of a contact sport, but if you are serious about >>>>>>>>>>>> not >>>>>>>>>>>> losing data it is worthwhile. SU is based on the most recent DR >>>>>>>>>>>> with >>>>>>>>>>>> a few minor patches of our own concoction brought in. If current >>>>>>>>>>>> works, but some Master ops are slow, and there are a few patches on >>>>>>>>>>>> top of that. I'll poke about and see if its possible to publish >>>>>>>>>>>> to a >>>>>>>>>>>> github branch or something. >>>>>>>>>>>> >>>>>>>>>>>> -ryan >>>>>>>>>>>> >>>>>>>>>>>> On Mon, Sep 20, 2010 at 5:16 PM, Jack Levin <[email protected]> >>>>>>>>>>>> wrote: >>>>>>>>>>>>> Sounds, good, only reason I ask is because of this: >>>>>>>>>>>>> >>>>>>>>>>>>> There are currently two active branches of HBase: >>>>>>>>>>>>> >>>>>>>>>>>>> * 0.20 - the current stable release series, being maintained >>>>>>>>>>>>> with >>>>>>>>>>>>> patches for bug fixes only. This release series does not support >>>>>>>>>>>>> HDFS >>>>>>>>>>>>> durability - edits may be lost in the case of node failure. >>>>>>>>>>>>> * 0.89 - a development release series with active feature and >>>>>>>>>>>>> stability development, not currently recommended for production >>>>>>>>>>>>> use. >>>>>>>>>>>>> This release does support HDFS durability - cases in which edits >>>>>>>>>>>>> are >>>>>>>>>>>>> lost are considered serious bugs. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Are we talking about data loss in case of datanode going down >>>>>>>>>>>>> while >>>>>>>>>>>>> being written to, or RegionServer going down? >>>>>>>>>>>>> >>>>>>>>>>>>> -jack >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Mon, Sep 20, 2010 at 4:09 PM, Ryan Rawson <[email protected]> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> We run 0.89 in production @ Stumbleupon. We also employ 3 >>>>>>>>>>>>>> committers... >>>>>>>>>>>>>> >>>>>>>>>>>>>> As for safety, you have no choice but to run 0.89. If you run a >>>>>>>>>>>>>> 0.20 >>>>>>>>>>>>>> release you will lose data. you must be on 0.89 and >>>>>>>>>>>>>> CDH3/append-branch to achieve data durability, and there really >>>>>>>>>>>>>> is no >>>>>>>>>>>>>> argument around it. If you are doing your tests with 0.20.6 >>>>>>>>>>>>>> now, I'd >>>>>>>>>>>>>> stop and rebase those tests onto the latest DR announced on the >>>>>>>>>>>>>> list. >>>>>>>>>>>>>> >>>>>>>>>>>>>> -ryan >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Mon, Sep 20, 2010 at 3:17 PM, Jack Levin <[email protected]> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> Hi Stack, see inline: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Mon, Sep 20, 2010 at 2:42 PM, Stack <[email protected]> wrote: >>>>>>>>>>>>>>>> Hey Jack: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks for writing. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> See below for some comments. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Mon, Sep 20, 2010 at 11:00 AM, Jack Levin >>>>>>>>>>>>>>>> <[email protected]> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Image-Shack gets close to two million image uploads per day, >>>>>>>>>>>>>>>>> which are >>>>>>>>>>>>>>>>> usually stored on regular servers (we have about 700), as >>>>>>>>>>>>>>>>> regular >>>>>>>>>>>>>>>>> files, and each server has its own host name, such as >>>>>>>>>>>>>>>>> (img55). I've >>>>>>>>>>>>>>>>> been researching on how to improve our backend design in >>>>>>>>>>>>>>>>> terms of data >>>>>>>>>>>>>>>>> safety and stumped onto the Hbase project. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Any other requirements other than data safety? (latency, etc). >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Latency is the second requirement. We have some services that >>>>>>>>>>>>>>> are >>>>>>>>>>>>>>> very short tail, and can produce 95% cache hit rate, so I >>>>>>>>>>>>>>> assume this >>>>>>>>>>>>>>> would really put cache into good use. Some other services >>>>>>>>>>>>>>> however, >>>>>>>>>>>>>>> have about 25% cache hit ratio, in which case the latency >>>>>>>>>>>>>>> should be >>>>>>>>>>>>>>> 'adequate', e.g. if its slightly worse than getting data off >>>>>>>>>>>>>>> raw disk, >>>>>>>>>>>>>>> then its good enough. Safely is supremely important, then its >>>>>>>>>>>>>>> availability, then speed. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Now, I think hbase is he most beautiful thing that happen to >>>>>>>>>>>>>>>>> distributed DB world :). The idea is to store image files >>>>>>>>>>>>>>>>> (about >>>>>>>>>>>>>>>>> 400Kb on average into HBASE). >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I'd guess some images are much bigger than this. Do you ever >>>>>>>>>>>>>>>> limit >>>>>>>>>>>>>>>> the size of images folks can upload to your service? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The setup will include the following >>>>>>>>>>>>>>>>> configuration: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 50 servers total (2 datacenters), with 8 GB RAM, dual core >>>>>>>>>>>>>>>>> cpu, 6 x >>>>>>>>>>>>>>>>> 2TB disks each. >>>>>>>>>>>>>>>>> 3 to 5 Zookeepers >>>>>>>>>>>>>>>>> 2 Masters (in a datacenter each) >>>>>>>>>>>>>>>>> 10 to 20 Stargate REST instances (one per server, hash >>>>>>>>>>>>>>>>> loadbalanced) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Whats your frontend? Why REST? It might be more efficient if >>>>>>>>>>>>>>>> you >>>>>>>>>>>>>>>> could run with thrift given REST base64s its payload IIRC >>>>>>>>>>>>>>>> (check the >>>>>>>>>>>>>>>> src yourself). >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> For insertion we use Haproxy, and balance curl PUTs across >>>>>>>>>>>>>>> multiple REST APIs. >>>>>>>>>>>>>>> For reading, its a nginx proxy that does Content-type >>>>>>>>>>>>>>> modification >>>>>>>>>>>>>>> from image/jpeg to octet-stream, and vice versa, >>>>>>>>>>>>>>> it then hits Haproxy again, which hits balanced REST. >>>>>>>>>>>>>>> Why REST, it was the simplest thing to run, given that its >>>>>>>>>>>>>>> supports >>>>>>>>>>>>>>> HTTP, potentially we could rewrite something for thrift, as >>>>>>>>>>>>>>> long as we >>>>>>>>>>>>>>> can use http still to send and receive data (anyone wrote >>>>>>>>>>>>>>> anything >>>>>>>>>>>>>>> like that say in python, C or java?) >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 40 to 50 RegionServers (will probably keep masters separate >>>>>>>>>>>>>>>>> on dedicated boxes). >>>>>>>>>>>>>>>>> 2 Namenode servers (one backup, highly available, will do >>>>>>>>>>>>>>>>> fsimage and >>>>>>>>>>>>>>>>> edits snapshots also) >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> So far I got about 13 servers running, and doing about 20 >>>>>>>>>>>>>>>>> insertions / >>>>>>>>>>>>>>>>> second (file size ranging from few KB to 2-3MB, ave. 400KB). >>>>>>>>>>>>>>>>> via >>>>>>>>>>>>>>>>> Stargate API. Our frontend servers receive files, and I just >>>>>>>>>>>>>>>>> fork-insert them into stargate via http (curl). >>>>>>>>>>>>>>>>> The inserts are humming along nicely, without any noticeable >>>>>>>>>>>>>>>>> load on >>>>>>>>>>>>>>>>> regionservers, so far inserted about 2 TB worth of images. >>>>>>>>>>>>>>>>> I have adjusted the region file size to be 512MB, and table >>>>>>>>>>>>>>>>> block size >>>>>>>>>>>>>>>>> to about 400KB , trying to match average access block to >>>>>>>>>>>>>>>>> limit HDFS >>>>>>>>>>>>>>>>> trips. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> As Todd suggests, I'd go up from 512MB... 1G at least. You'll >>>>>>>>>>>>>>>> probably want to up your flush size from 64MB to 128MB or >>>>>>>>>>>>>>>> maybe 192MB. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Yep, i will adjust to 1G. I thought flush was controlled by a >>>>>>>>>>>>>>> function of memstore HEAP, something like 40%? Or are you >>>>>>>>>>>>>>> talking >>>>>>>>>>>>>>> about HDFS block size? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> So far the read performance was more than adequate, and of >>>>>>>>>>>>>>>>> course write performance is nowhere near capacity. >>>>>>>>>>>>>>>>> So right now, all newly uploaded images go to HBASE. But we >>>>>>>>>>>>>>>>> do plan >>>>>>>>>>>>>>>>> to insert about 170 Million images (about 100 days worth), >>>>>>>>>>>>>>>>> which is >>>>>>>>>>>>>>>>> only about 64 TB, or 10% of planned cluster size of 600TB. >>>>>>>>>>>>>>>>> The end goal is to have a storage system that creates data >>>>>>>>>>>>>>>>> safety, >>>>>>>>>>>>>>>>> e.g. system may go down but data can not be lost. Our >>>>>>>>>>>>>>>>> Front-End >>>>>>>>>>>>>>>>> servers will continue to serve images from their own file >>>>>>>>>>>>>>>>> system (we >>>>>>>>>>>>>>>>> are serving about 16 Gbits at peak), however should we need >>>>>>>>>>>>>>>>> to bring >>>>>>>>>>>>>>>>> any of those down for maintenance, we will redirect all >>>>>>>>>>>>>>>>> traffic to >>>>>>>>>>>>>>>>> Hbase (should be no more than few hundred Mbps), while the >>>>>>>>>>>>>>>>> front end >>>>>>>>>>>>>>>>> server is repaired (for example having its disk replaced), >>>>>>>>>>>>>>>>> after the >>>>>>>>>>>>>>>>> repairs, we quickly repopulate it with missing files, while >>>>>>>>>>>>>>>>> serving >>>>>>>>>>>>>>>>> the missing remaining off Hbase. >>>>>>>>>>>>>>>>> All in all should be very interesting project, and I am >>>>>>>>>>>>>>>>> hoping not to >>>>>>>>>>>>>>>>> run into any snags, however, should that happens, I am >>>>>>>>>>>>>>>>> pleased to know >>>>>>>>>>>>>>>>> that such a great and vibrant tech group exists that supports >>>>>>>>>>>>>>>>> and uses >>>>>>>>>>>>>>>>> HBASE :). >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> We're definetly interested in how your project progresses. If >>>>>>>>>>>>>>>> you are >>>>>>>>>>>>>>>> ever up in the city, you should drop by for a chat. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Cool. I'd like that. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> St.Ack >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> P.S. I'm also w/ Todd that you should move to 0.89 and blooms. >>>>>>>>>>>>>>>> P.P.S I updated the wiki on stargate REST: >>>>>>>>>>>>>>>> http://wiki.apache.org/hadoop/Hbase/Stargate >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Cool, I assume if we move to that it won't kill existing meta >>>>>>>>>>>>>>> tables, >>>>>>>>>>>>>>> and data? e.g. cross compatible? >>>>>>>>>>>>>>> Is 0.89 ready for production environment? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -Jack >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
