Re: Millions of photos into Hbase

Jack Levin Mon, 20 Sep 2010 15:07:35 -0700

I don't have empty Rows, does that make a difference?  E.g. when row
is inserted its always followed by the image data.


-Jack

On Mon, Sep 20, 2010 at 2:06 PM, Todd Lipcon <[email protected]> wrote:
> On Mon, Sep 20, 2010 at 1:13 PM, Jack Levin <[email protected]> wrote:
>> Todd, I could not get stargate to work on 0.89 for some reason, thats
>> why we are running 0.20.6.  Also in regards to bloom filters, I
>> thought they were mainly for column seeking, in our case we have this
>> schema:
>>
>> row           att:data
>> filename    file_data
>>
>
> The bloom filters work either in a ROW basis or a ROW_COL basis. If
> you turn on the row key blooms, then your get of a particular filename
> will avoid looking in the storefiles that don't have any data for the
> row.
>
> Regarding stargate in 0.89, it's been renamed to "rest" since the old
> rest server got removed. I haven't used it much but hopefully someone
> can give you a pointer (or even better, update the wiki/docs!)
>
> -Todd
>
>>
>> -Jack
>>
>> On Mon, Sep 20, 2010 at 11:53 AM, Todd Lipcon <[email protected]> wrote:
>>> Hey Jack,
>>>
>>> This sounds like a very exciting project! A few thoughts that might help 
>>> you:
>>> - Check out the Bloom filter support that is in the 0.89 series. It
>>> sounds like all of your access is going to be random key gets - adding
>>> blooms will save you lots of disk seeks.
>>> - I might even bump the region size up to 1G or more given the planned 
>>> capacity.
>>> - The "HA" setup will be tricky - we don't have a great HA story yet.
>>> Given you have two DCs, you may want to consider running separate
>>> HBase clusters, one in each, and either using the new replication
>>> support, or simply doing "client replication" by writing all images to
>>> both.
>>>
>>> Good luck with the project, and keep us posted how it goes.
>>>
>>> Thanks
>>> -Todd
>>>
>>> On Mon, Sep 20, 2010 at 11:00 AM, Jack Levin <[email protected]> wrote:
>>>>
>>>> Greetings all.  My name is Jack and I work for an image hosting
>>>> company Image Shack, we also have a property thats widely used as a
>>>> twitter app called yfrog (yfrog.com).
>>>>
>>>> Image-Shack gets close to two million image uploads per day, which are
>>>> usually stored on regular servers (we have about 700), as regular
>>>> files, and each server has its own host name, such as (img55).   I've
>>>> been researching on how to improve our backend design in terms of data
>>>> safety and stumped onto the Hbase project.
>>>>
>>>> We have been running hadoop for data access log analysis for a while
>>>> now, quite successfully.  We are receiving about 2 billion hits per
>>>> day and store all of that data into RCFiles (attribution to Facebook
>>>> applies here), that are loadable into Hive (thanks to FB again).  So
>>>> we know how to manage HDFS, and run mapreduce jobs.
>>>>
>>>> Now, I think hbase is he most beautiful thing that happen to
>>>> distributed DB world :).   The idea is to store image files (about
>>>> 400Kb on average into HBASE).  The setup will include the following
>>>> configuration:
>>>>
>>>> 50 servers total (2 datacenters), with 8 GB RAM, dual core cpu, 6 x
>>>> 2TB disks each.
>>>> 3 to 5 Zookeepers
>>>> 2 Masters (in a datacenter each)
>>>> 10 to 20 Stargate REST instances (one per server, hash loadbalanced)
>>>> 40 to 50 RegionServers (will probably keep masters separate on dedicated 
>>>> boxes).
>>>> 2 Namenode servers (one backup, highly available, will do fsimage and
>>>> edits snapshots also)
>>>>
>>>> So far I got about 13 servers running, and doing about 20 insertions /
>>>> second (file size ranging from few KB to 2-3MB, ave. 400KB). via
>>>> Stargate API.  Our frontend servers receive files, and I just
>>>> fork-insert them into stargate via http (curl).
>>>> The inserts are humming along nicely, without any noticeable load on
>>>> regionservers, so far inserted about 2 TB worth of images.
>>>> I have adjusted the region file size to be 512MB, and table block size
>>>> to about 400KB , trying to match average access block to limit HDFS
>>>> trips.   So far the read performance was more than adequate, and of
>>>> course write performance is nowhere near capacity.
>>>> So right now, all newly uploaded images go to HBASE.  But we do plan
>>>> to insert about 170 Million images (about 100 days worth), which is
>>>> only about 64 TB, or 10% of planned cluster size of 600TB.
>>>> The end goal is to have a storage system that creates data safety,
>>>> e.g. system may go down but data can not be lost.   Our Front-End
>>>> servers will continue to serve images from their own file system (we
>>>> are serving about 16 Gbits at peak), however should we need to bring
>>>> any of those down for maintenance, we will redirect all traffic to
>>>> Hbase (should be no more than few hundred Mbps), while the front end
>>>> server is repaired (for example having its disk replaced), after the
>>>> repairs, we quickly repopulate it with missing files, while serving
>>>> the missing remaining off Hbase.
>>>> All in all should be very interesting project, and I am hoping not to
>>>> run into any snags, however, should that happens, I am pleased to know
>>>> that such a great and vibrant tech group exists that supports and uses
>>>> HBASE :).
>>>>
>>>> -Jack
>>>
>>>
>>>
>>> --
>>> Todd Lipcon
>>> Software Engineer, Cloudera
>>>
>>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Millions of photos into Hbase

Reply via email to