Millions of photos into Hbase

Jack Levin Mon, 20 Sep 2010 11:01:00 -0700

Greetings all.  My name is Jack and I work for an image hosting
company Image Shack, we also have a property thats widely used as a
twitter app called yfrog (yfrog.com).


Image-Shack gets close to two million image uploads per day, which are
usually stored on regular servers (we have about 700), as regular
files, and each server has its own host name, such as (img55).   I've
been researching on how to improve our backend design in terms of data
safety and stumped onto the Hbase project.

We have been running hadoop for data access log analysis for a while
now, quite successfully.  We are receiving about 2 billion hits per
day and store all of that data into RCFiles (attribution to Facebook
applies here), that are loadable into Hive (thanks to FB again).  So
we know how to manage HDFS, and run mapreduce jobs.

Now, I think hbase is he most beautiful thing that happen to
distributed DB world :).   The idea is to store image files (about
400Kb on average into HBASE).  The setup will include the following
configuration:

50 servers total (2 datacenters), with 8 GB RAM, dual core cpu, 6 x
2TB disks each.
3 to 5 Zookeepers
2 Masters (in a datacenter each)
10 to 20 Stargate REST instances (one per server, hash loadbalanced)
40 to 50 RegionServers (will probably keep masters separate on dedicated boxes).
2 Namenode servers (one backup, highly available, will do fsimage and
edits snapshots also)

So far I got about 13 servers running, and doing about 20 insertions /
second (file size ranging from few KB to 2-3MB, ave. 400KB). via
Stargate API.  Our frontend servers receive files, and I just
fork-insert them into stargate via http (curl).
The inserts are humming along nicely, without any noticeable load on
regionservers, so far inserted about 2 TB worth of images.
I have adjusted the region file size to be 512MB, and table block size
to about 400KB , trying to match average access block to limit HDFS
trips.   So far the read performance was more than adequate, and of
course write performance is nowhere near capacity.
So right now, all newly uploaded images go to HBASE.  But we do plan
to insert about 170 Million images (about 100 days worth), which is
only about 64 TB, or 10% of planned cluster size of 600TB.
The end goal is to have a storage system that creates data safety,
e.g. system may go down but data can not be lost.   Our Front-End
servers will continue to serve images from their own file system (we
are serving about 16 Gbits at peak), however should we need to bring
any of those down for maintenance, we will redirect all traffic to
Hbase (should be no more than few hundred Mbps), while the front end
server is repaired (for example having its disk replaced), after the
repairs, we quickly repopulate it with missing files, while serving
the missing remaining off Hbase.
All in all should be very interesting project, and I am hoping not to
run into any snags, however, should that happens, I am pleased to know
that such a great and vibrant tech group exists that supports and uses
HBASE :).

-Jack

Millions of photos into Hbase

Reply via email to