Using memcached as a distributed file cache

2009-11-02 Thread Jay Paroline

I'm running this by you guys to make sure we're not trying something
completely insane. ;)

We already rely on memcached quite heavily to minimize load on our DB
with stunning success, but as a music streaming service, we also serve
up lots and lots of 5-6MB files, and right now we don't have a
distributed cache of any kind, just lots and lots of really fast
disks. Due to the nature of our content, we have some files that are
insanely popular, and a lot of long tail content that gets played
infrequently. I don't remember the exact numbers, but I'd guesstimate
that the top 50GB of our many TB of files accounts for 40-60% of our
streams on any given day.

What I'd love to do is get those popular files served from memory,
which should alleviate load on the disks considerably. Obviously the
file system cache does some of this already, but since it's not
distributed it uses the space a lot less efficiently than a
distributed cache would (say one popular file lives on 3 stream nodes,
it's going to be cached in memory 3 separate times instead of just
once).  We have multiple stream servers, obviously, and between them
we could probably scrounge up 50GB or more for memcached,
theoretically removing the disk load for all of the most popular
content.

My favorite memory cache is of course memcache, so I'm wondering if
this would be an appropriate use (with the slab size turned way up,
obviously). We're going to start doing some experiments with it, but
I'm wondering what the community thinks.

Thanks,

Jay


Re: Using memcached as a distributed file cache

2009-11-02 Thread Adam Lee
I'm guessing you might get better mileage out of using something written
more for this purpose, e.g. squid set up as a reverse proxy.

On Mon, Nov 2, 2009 at 4:35 PM, Jay Paroline  wrote:

>
> I'm running this by you guys to make sure we're not trying something
> completely insane. ;)
>
> We already rely on memcached quite heavily to minimize load on our DB
> with stunning success, but as a music streaming service, we also serve
> up lots and lots of 5-6MB files, and right now we don't have a
> distributed cache of any kind, just lots and lots of really fast
> disks. Due to the nature of our content, we have some files that are
> insanely popular, and a lot of long tail content that gets played
> infrequently. I don't remember the exact numbers, but I'd guesstimate
> that the top 50GB of our many TB of files accounts for 40-60% of our
> streams on any given day.
>
> What I'd love to do is get those popular files served from memory,
> which should alleviate load on the disks considerably. Obviously the
> file system cache does some of this already, but since it's not
> distributed it uses the space a lot less efficiently than a
> distributed cache would (say one popular file lives on 3 stream nodes,
> it's going to be cached in memory 3 separate times instead of just
> once).  We have multiple stream servers, obviously, and between them
> we could probably scrounge up 50GB or more for memcached,
> theoretically removing the disk load for all of the most popular
> content.
>
> My favorite memory cache is of course memcache, so I'm wondering if
> this would be an appropriate use (with the slab size turned way up,
> obviously). We're going to start doing some experiments with it, but
> I'm wondering what the community thinks.
>
> Thanks,
>
> Jay
>



-- 
awl


Re: Using memcached as a distributed file cache

2009-11-02 Thread Jay Paroline

I'm not sure how well a reverse proxy would fit our needs, having
never used one before. The way we do streaming is a client sends a one-
time-use key to the stream server. The key is used to determine which
file should be streamed, and then the file is returned. The effect is
that no two requests are identical, and that code must be run for
every single request to verify the request and lookup the appropriate
file. Is it possible or practical to use a reverse proxy in that way?

Jay

Adam Lee wrote:
> I'm guessing you might get better mileage out of using something written
> more for this purpose, e.g. squid set up as a reverse proxy.
>
> On Mon, Nov 2, 2009 at 4:35 PM, Jay Paroline  wrote:
>
> >
> > I'm running this by you guys to make sure we're not trying something
> > completely insane. ;)
> >
> > We already rely on memcached quite heavily to minimize load on our DB
> > with stunning success, but as a music streaming service, we also serve
> > up lots and lots of 5-6MB files, and right now we don't have a
> > distributed cache of any kind, just lots and lots of really fast
> > disks. Due to the nature of our content, we have some files that are
> > insanely popular, and a lot of long tail content that gets played
> > infrequently. I don't remember the exact numbers, but I'd guesstimate
> > that the top 50GB of our many TB of files accounts for 40-60% of our
> > streams on any given day.
> >
> > What I'd love to do is get those popular files served from memory,
> > which should alleviate load on the disks considerably. Obviously the
> > file system cache does some of this already, but since it's not
> > distributed it uses the space a lot less efficiently than a
> > distributed cache would (say one popular file lives on 3 stream nodes,
> > it's going to be cached in memory 3 separate times instead of just
> > once).  We have multiple stream servers, obviously, and between them
> > we could probably scrounge up 50GB or more for memcached,
> > theoretically removing the disk load for all of the most popular
> > content.
> >
> > My favorite memory cache is of course memcache, so I'm wondering if
> > this would be an appropriate use (with the slab size turned way up,
> > obviously). We're going to start doing some experiments with it, but
> > I'm wondering what the community thinks.
> >
> > Thanks,
> >
> > Jay
> >
>
>
>
> --
> awl


Re: Using memcached as a distributed file cache

2009-11-02 Thread Adam Lee
So you actually give back the file contents in the response, not the URL to
the media? If so, then that does complicate things a little bit.  I still
think that memcached might not be the best solution for this, though it
could obviously be configured to do it.

On Mon, Nov 2, 2009 at 5:44 PM, Jay Paroline  wrote:

>
> I'm not sure how well a reverse proxy would fit our needs, having
> never used one before. The way we do streaming is a client sends a one-
> time-use key to the stream server. The key is used to determine which
> file should be streamed, and then the file is returned. The effect is
> that no two requests are identical, and that code must be run for
> every single request to verify the request and lookup the appropriate
> file. Is it possible or practical to use a reverse proxy in that way?
>
> Jay
>
> Adam Lee wrote:
> > I'm guessing you might get better mileage out of using something written
> > more for this purpose, e.g. squid set up as a reverse proxy.
> >
> > On Mon, Nov 2, 2009 at 4:35 PM, Jay Paroline 
> wrote:
> >
> > >
> > > I'm running this by you guys to make sure we're not trying something
> > > completely insane. ;)
> > >
> > > We already rely on memcached quite heavily to minimize load on our DB
> > > with stunning success, but as a music streaming service, we also serve
> > > up lots and lots of 5-6MB files, and right now we don't have a
> > > distributed cache of any kind, just lots and lots of really fast
> > > disks. Due to the nature of our content, we have some files that are
> > > insanely popular, and a lot of long tail content that gets played
> > > infrequently. I don't remember the exact numbers, but I'd guesstimate
> > > that the top 50GB of our many TB of files accounts for 40-60% of our
> > > streams on any given day.
> > >
> > > What I'd love to do is get those popular files served from memory,
> > > which should alleviate load on the disks considerably. Obviously the
> > > file system cache does some of this already, but since it's not
> > > distributed it uses the space a lot less efficiently than a
> > > distributed cache would (say one popular file lives on 3 stream nodes,
> > > it's going to be cached in memory 3 separate times instead of just
> > > once).  We have multiple stream servers, obviously, and between them
> > > we could probably scrounge up 50GB or more for memcached,
> > > theoretically removing the disk load for all of the most popular
> > > content.
> > >
> > > My favorite memory cache is of course memcache, so I'm wondering if
> > > this would be an appropriate use (with the slab size turned way up,
> > > obviously). We're going to start doing some experiments with it, but
> > > I'm wondering what the community thinks.
> > >
> > > Thanks,
> > >
> > > Jay
> > >
> >
> >
> >
> > --
> > awl
>



-- 
awl


Re: Using memcached as a distributed file cache

2009-11-02 Thread dormando

You could put something like varnish inbetween that final step and your
client..

so key is pulled in, file is looked up, then file is fetched *through*
varnish. Of course I don't know offhand how much work it would be to make
your app deal with that fetch-through scenario.

Since these files are large memcached probably isn't the best bet for
this.

On Mon, 2 Nov 2009, Jay Paroline wrote:

>
> I'm not sure how well a reverse proxy would fit our needs, having
> never used one before. The way we do streaming is a client sends a one-
> time-use key to the stream server. The key is used to determine which
> file should be streamed, and then the file is returned. The effect is
> that no two requests are identical, and that code must be run for
> every single request to verify the request and lookup the appropriate
> file. Is it possible or practical to use a reverse proxy in that way?
>
> Jay
>
> Adam Lee wrote:
> > I'm guessing you might get better mileage out of using something written
> > more for this purpose, e.g. squid set up as a reverse proxy.
> >
> > On Mon, Nov 2, 2009 at 4:35 PM, Jay Paroline  wrote:
> >
> > >
> > > I'm running this by you guys to make sure we're not trying something
> > > completely insane. ;)
> > >
> > > We already rely on memcached quite heavily to minimize load on our DB
> > > with stunning success, but as a music streaming service, we also serve
> > > up lots and lots of 5-6MB files, and right now we don't have a
> > > distributed cache of any kind, just lots and lots of really fast
> > > disks. Due to the nature of our content, we have some files that are
> > > insanely popular, and a lot of long tail content that gets played
> > > infrequently. I don't remember the exact numbers, but I'd guesstimate
> > > that the top 50GB of our many TB of files accounts for 40-60% of our
> > > streams on any given day.
> > >
> > > What I'd love to do is get those popular files served from memory,
> > > which should alleviate load on the disks considerably. Obviously the
> > > file system cache does some of this already, but since it's not
> > > distributed it uses the space a lot less efficiently than a
> > > distributed cache would (say one popular file lives on 3 stream nodes,
> > > it's going to be cached in memory 3 separate times instead of just
> > > once).  We have multiple stream servers, obviously, and between them
> > > we could probably scrounge up 50GB or more for memcached,
> > > theoretically removing the disk load for all of the most popular
> > > content.
> > >
> > > My favorite memory cache is of course memcache, so I'm wondering if
> > > this would be an appropriate use (with the slab size turned way up,
> > > obviously). We're going to start doing some experiments with it, but
> > > I'm wondering what the community thinks.
> > >
> > > Thanks,
> > >
> > > Jay
> > >
> >
> >
> >
> > --
> > awl
>


Re: Using memcached as a distributed file cache

2009-11-02 Thread Les Mikesell


dormando wrote:

You could put something like varnish inbetween that final step and your
client..

so key is pulled in, file is looked up, then file is fetched *through*
varnish. Of course I don't know offhand how much work it would be to make
your app deal with that fetch-through scenario.

Since these files are large memcached probably isn't the best bet for
this.


You could also redirect the client to the proxy/cache after computing 
the filename, but that exposes the name in a way that might be reusable.


--
  Les Mikesell
   lesmikes...@gmail.com



Re: Using memcached as a distributed file cache

2009-11-02 Thread dormando

> You could also redirect the client to the proxy/cache after computing the
> filename, but that exposes the name in a way that might be reusable.

perlbal is great for this... I think nginx might be able to do it too?
Internal "reproxy". Server returns headers for where the load balancer is
to re-run a request through to. Mostly it's used for looking up mogilefs
addresses, but could also be used to redirect files through caches and
such.


Re: Using memcached as a distributed file cache

2009-11-02 Thread Jay Paroline

Adam: yes, we serve up the file contents, not the URL to the media.
lighthttpd makes this simple with the X-SendFile header.

Having not used varnish or squid before, do either support some form
of distributed memory caches so that rather than buying a single
expensive box with tons of memory, we can aggregate the free memory of
a bunch of less expensive boxes, as with memcached?

Jay

dormando wrote:
> You could put something like varnish inbetween that final step and your
> client..
>
> so key is pulled in, file is looked up, then file is fetched *through*
> varnish. Of course I don't know offhand how much work it would be to make
> your app deal with that fetch-through scenario.
>
> Since these files are large memcached probably isn't the best bet for
> this.
>
> On Mon, 2 Nov 2009, Jay Paroline wrote:
>
> >
> > I'm not sure how well a reverse proxy would fit our needs, having
> > never used one before. The way we do streaming is a client sends a one-
> > time-use key to the stream server. The key is used to determine which
> > file should be streamed, and then the file is returned. The effect is
> > that no two requests are identical, and that code must be run for
> > every single request to verify the request and lookup the appropriate
> > file. Is it possible or practical to use a reverse proxy in that way?
> >
> > Jay
> >
> > Adam Lee wrote:
> > > I'm guessing you might get better mileage out of using something written
> > > more for this purpose, e.g. squid set up as a reverse proxy.
> > >
> > > On Mon, Nov 2, 2009 at 4:35 PM, Jay Paroline  wrote:
> > >
> > > >
> > > > I'm running this by you guys to make sure we're not trying something
> > > > completely insane. ;)
> > > >
> > > > We already rely on memcached quite heavily to minimize load on our DB
> > > > with stunning success, but as a music streaming service, we also serve
> > > > up lots and lots of 5-6MB files, and right now we don't have a
> > > > distributed cache of any kind, just lots and lots of really fast
> > > > disks. Due to the nature of our content, we have some files that are
> > > > insanely popular, and a lot of long tail content that gets played
> > > > infrequently. I don't remember the exact numbers, but I'd guesstimate
> > > > that the top 50GB of our many TB of files accounts for 40-60% of our
> > > > streams on any given day.
> > > >
> > > > What I'd love to do is get those popular files served from memory,
> > > > which should alleviate load on the disks considerably. Obviously the
> > > > file system cache does some of this already, but since it's not
> > > > distributed it uses the space a lot less efficiently than a
> > > > distributed cache would (say one popular file lives on 3 stream nodes,
> > > > it's going to be cached in memory 3 separate times instead of just
> > > > once).  We have multiple stream servers, obviously, and between them
> > > > we could probably scrounge up 50GB or more for memcached,
> > > > theoretically removing the disk load for all of the most popular
> > > > content.
> > > >
> > > > My favorite memory cache is of course memcache, so I'm wondering if
> > > > this would be an appropriate use (with the slab size turned way up,
> > > > obviously). We're going to start doing some experiments with it, but
> > > > I'm wondering what the community thinks.
> > > >
> > > > Thanks,
> > > >
> > > > Jay
> > > >
> > >
> > >
> > >
> > > --
> > > awl
> >


Re: Using memcached as a distributed file cache

2009-11-02 Thread Adam Lee
You could also do a relatively simple solution like tack a two digit shard
ID onto the front of your key then use this to direct your request to a
specific cluster internally.  Give the clusters a lot of RAM and rely on OS
filesystem caching to keep frequently requested files in memory.  Would be
very easy and cheap to build.

-- 
awl


Re: Using memcached as a distributed file cache

2009-11-02 Thread Mark Atwood



On Nov 2, 2009, at 1:35 PM, Jay Paroline wrote:


What I'd love to do is get those popular files served from memory,
which should alleviate load on the disks considerably. Obviously the
file system cache does some of this already, but since it's not
distributed it uses the space a lot less efficiently than a
distributed cache would (say one popular file lives on 3 stream nodes,
it's going to be cached in memory 3 separate times instead of just
once).  We have multiple stream servers, obviously, and between them
we could probably scrounge up 50GB or more for memcached,
theoretically removing the disk load for all of the most popular
content.


Take a look at the Apache module mod_memcached



--
Mark Atwood 





Re: Using memcached as a distributed file cache

2009-11-02 Thread dormando

If you have control over the reproxy, you can do a simple hash against the
list of all machines you have. Varnish/squid can also do internal forwards
after hashing.

It's a little weird but varnish affords you a lot of smarts from the
serving end.

Another thing worth noting I guess, is that typically for mogilefs
(getting even more off topic...) files are usually referred from one
server at a time. So once it gets into the filesystem cache on that
server, sendfile(2) from lighttpd or nginx or whatever is very fast. If
you're rounding through a bunch of servers or don't have enough RAM on
your storage nodes that isn't as helpful.

On Mon, 2 Nov 2009, Jay Paroline wrote:

>
> Adam: yes, we serve up the file contents, not the URL to the media.
> lighthttpd makes this simple with the X-SendFile header.
>
> Having not used varnish or squid before, do either support some form
> of distributed memory caches so that rather than buying a single
> expensive box with tons of memory, we can aggregate the free memory of
> a bunch of less expensive boxes, as with memcached?
>
> Jay
>
> dormando wrote:
> > You could put something like varnish inbetween that final step and your
> > client..
> >
> > so key is pulled in, file is looked up, then file is fetched *through*
> > varnish. Of course I don't know offhand how much work it would be to make
> > your app deal with that fetch-through scenario.
> >
> > Since these files are large memcached probably isn't the best bet for
> > this.
> >
> > On Mon, 2 Nov 2009, Jay Paroline wrote:
> >
> > >
> > > I'm not sure how well a reverse proxy would fit our needs, having
> > > never used one before. The way we do streaming is a client sends a one-
> > > time-use key to the stream server. The key is used to determine which
> > > file should be streamed, and then the file is returned. The effect is
> > > that no two requests are identical, and that code must be run for
> > > every single request to verify the request and lookup the appropriate
> > > file. Is it possible or practical to use a reverse proxy in that way?
> > >
> > > Jay
> > >
> > > Adam Lee wrote:
> > > > I'm guessing you might get better mileage out of using something written
> > > > more for this purpose, e.g. squid set up as a reverse proxy.
> > > >
> > > > On Mon, Nov 2, 2009 at 4:35 PM, Jay Paroline  
> > > > wrote:
> > > >
> > > > >
> > > > > I'm running this by you guys to make sure we're not trying something
> > > > > completely insane. ;)
> > > > >
> > > > > We already rely on memcached quite heavily to minimize load on our DB
> > > > > with stunning success, but as a music streaming service, we also serve
> > > > > up lots and lots of 5-6MB files, and right now we don't have a
> > > > > distributed cache of any kind, just lots and lots of really fast
> > > > > disks. Due to the nature of our content, we have some files that are
> > > > > insanely popular, and a lot of long tail content that gets played
> > > > > infrequently. I don't remember the exact numbers, but I'd guesstimate
> > > > > that the top 50GB of our many TB of files accounts for 40-60% of our
> > > > > streams on any given day.
> > > > >
> > > > > What I'd love to do is get those popular files served from memory,
> > > > > which should alleviate load on the disks considerably. Obviously the
> > > > > file system cache does some of this already, but since it's not
> > > > > distributed it uses the space a lot less efficiently than a
> > > > > distributed cache would (say one popular file lives on 3 stream nodes,
> > > > > it's going to be cached in memory 3 separate times instead of just
> > > > > once).  We have multiple stream servers, obviously, and between them
> > > > > we could probably scrounge up 50GB or more for memcached,
> > > > > theoretically removing the disk load for all of the most popular
> > > > > content.
> > > > >
> > > > > My favorite memory cache is of course memcache, so I'm wondering if
> > > > > this would be an appropriate use (with the slab size turned way up,
> > > > > obviously). We're going to start doing some experiments with it, but
> > > > > I'm wondering what the community thinks.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jay
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > awl
> > >
>


Re: Using memcached as a distributed file cache

2009-11-02 Thread Vladimir Vuksan


Perhaps using tmpfs may be an option. Benefit of using tmpfs is that you 
can create a filesystem that is larger than physical memory. This has 
the benefit that virtual memory manager will swap out unused items to 
disk. You can then perhaps NFS export the file system or do something 
else. Difficult to say without additional details.


Vladimir

Jay Paroline wrote:

We already rely on memcached quite heavily to minimize load on our DB
with stunning success, but as a music streaming service, we also serve
up lots and lots of 5-6MB files, and right now we don't have a
distributed cache of any kind, just lots and lots of really fast
disks. Due to the nature of our content, we have some files that are
insanely popular, and a lot of long tail content that gets played
infrequently. I don't remember the exact numbers, but I'd guesstimate
that the top 50GB of our many TB of files accounts for 40-60% of our
streams on any given day.

What I'd love to do is get those popular files served from memory,
which should alleviate load on the disks considerably. Obviously the
file system cache does some of this already, but since it's not
distributed it uses the space a lot less efficiently than a
distributed cache would (say one popular file lives on 3 stream nodes,
it's going to be cached in memory 3 separate times instead of just
once).  We have multiple stream servers, obviously, and between them
we could probably scrounge up 50GB or more for memcached,
theoretically removing the disk load for all of the most popular
content.