I don't have empty Rows, does that make a difference? E.g. when row is inserted its always followed by the image data.
-Jack On Mon, Sep 20, 2010 at 2:06 PM, Todd Lipcon <[email protected]> wrote: > On Mon, Sep 20, 2010 at 1:13 PM, Jack Levin <[email protected]> wrote: >> Todd, I could not get stargate to work on 0.89 for some reason, thats >> why we are running 0.20.6. Also in regards to bloom filters, I >> thought they were mainly for column seeking, in our case we have this >> schema: >> >> row att:data >> filename file_data >> > > The bloom filters work either in a ROW basis or a ROW_COL basis. If > you turn on the row key blooms, then your get of a particular filename > will avoid looking in the storefiles that don't have any data for the > row. > > Regarding stargate in 0.89, it's been renamed to "rest" since the old > rest server got removed. I haven't used it much but hopefully someone > can give you a pointer (or even better, update the wiki/docs!) > > -Todd > >> >> -Jack >> >> On Mon, Sep 20, 2010 at 11:53 AM, Todd Lipcon <[email protected]> wrote: >>> Hey Jack, >>> >>> This sounds like a very exciting project! A few thoughts that might help >>> you: >>> - Check out the Bloom filter support that is in the 0.89 series. It >>> sounds like all of your access is going to be random key gets - adding >>> blooms will save you lots of disk seeks. >>> - I might even bump the region size up to 1G or more given the planned >>> capacity. >>> - The "HA" setup will be tricky - we don't have a great HA story yet. >>> Given you have two DCs, you may want to consider running separate >>> HBase clusters, one in each, and either using the new replication >>> support, or simply doing "client replication" by writing all images to >>> both. >>> >>> Good luck with the project, and keep us posted how it goes. >>> >>> Thanks >>> -Todd >>> >>> On Mon, Sep 20, 2010 at 11:00 AM, Jack Levin <[email protected]> wrote: >>>> >>>> Greetings all. My name is Jack and I work for an image hosting >>>> company Image Shack, we also have a property thats widely used as a >>>> twitter app called yfrog (yfrog.com). >>>> >>>> Image-Shack gets close to two million image uploads per day, which are >>>> usually stored on regular servers (we have about 700), as regular >>>> files, and each server has its own host name, such as (img55). I've >>>> been researching on how to improve our backend design in terms of data >>>> safety and stumped onto the Hbase project. >>>> >>>> We have been running hadoop for data access log analysis for a while >>>> now, quite successfully. We are receiving about 2 billion hits per >>>> day and store all of that data into RCFiles (attribution to Facebook >>>> applies here), that are loadable into Hive (thanks to FB again). So >>>> we know how to manage HDFS, and run mapreduce jobs. >>>> >>>> Now, I think hbase is he most beautiful thing that happen to >>>> distributed DB world :). The idea is to store image files (about >>>> 400Kb on average into HBASE). The setup will include the following >>>> configuration: >>>> >>>> 50 servers total (2 datacenters), with 8 GB RAM, dual core cpu, 6 x >>>> 2TB disks each. >>>> 3 to 5 Zookeepers >>>> 2 Masters (in a datacenter each) >>>> 10 to 20 Stargate REST instances (one per server, hash loadbalanced) >>>> 40 to 50 RegionServers (will probably keep masters separate on dedicated >>>> boxes). >>>> 2 Namenode servers (one backup, highly available, will do fsimage and >>>> edits snapshots also) >>>> >>>> So far I got about 13 servers running, and doing about 20 insertions / >>>> second (file size ranging from few KB to 2-3MB, ave. 400KB). via >>>> Stargate API. Our frontend servers receive files, and I just >>>> fork-insert them into stargate via http (curl). >>>> The inserts are humming along nicely, without any noticeable load on >>>> regionservers, so far inserted about 2 TB worth of images. >>>> I have adjusted the region file size to be 512MB, and table block size >>>> to about 400KB , trying to match average access block to limit HDFS >>>> trips. So far the read performance was more than adequate, and of >>>> course write performance is nowhere near capacity. >>>> So right now, all newly uploaded images go to HBASE. But we do plan >>>> to insert about 170 Million images (about 100 days worth), which is >>>> only about 64 TB, or 10% of planned cluster size of 600TB. >>>> The end goal is to have a storage system that creates data safety, >>>> e.g. system may go down but data can not be lost. Our Front-End >>>> servers will continue to serve images from their own file system (we >>>> are serving about 16 Gbits at peak), however should we need to bring >>>> any of those down for maintenance, we will redirect all traffic to >>>> Hbase (should be no more than few hundred Mbps), while the front end >>>> server is repaired (for example having its disk replaced), after the >>>> repairs, we quickly repopulate it with missing files, while serving >>>> the missing remaining off Hbase. >>>> All in all should be very interesting project, and I am hoping not to >>>> run into any snags, however, should that happens, I am pleased to know >>>> that such a great and vibrant tech group exists that supports and uses >>>> HBASE :). >>>> >>>> -Jack >>> >>> >>> >>> -- >>> Todd Lipcon >>> Software Engineer, Cloudera >>> >> > > > > -- > Todd Lipcon > Software Engineer, Cloudera >
