Grid data placement

2013-01-15 Thread Dimitri Maziuk
Hi everyone,

quick question: can I get ceph to replicate a bunch of files to every
host in compute cluster and then have those hosts read those files from
local disk?

TFM looks like a custom crush map should get the files to [osd on] every
host, but I'm not clear on the read step: do I need an mds on every host
and mount the fs off localhost's mds?

(We've $APP running on the cluster, normally one instance/cpu core, that
mmap's (read only) ~30GB of binary files. I/O over NFS kills the cluster
even with a few hosts. Currently the files are rsync'ed to every host at
the start of the batch; that'll only scale to a few dozen hosts at best.)

TIA,
-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu





signature.asc
Description: OpenPGP digital signature


Re: Grid data placement

2013-01-15 Thread Dimitri Maziuk
On 01/15/2013 12:36 PM, Gregory Farnum wrote:
 On Tue, Jan 15, 2013 at 10:33 AM, Dimitri Maziuk dmaz...@bmrb.wisc.edu 
 wrote:

 At the start of the batch #cores-in-the-cluster processes try to mmap
 the same 2GB and start reading it from SEEK_SET at the same time. I
 won't know until I try but I suspect it won't like that.
 
 Well, it'll be #servers-in-cluster serving up 4MB chunks out of cache.
 It's possible you could overwhelm their networking but my bet is
 they'll just get spread out slightly on the first block and then not
 contend in the future.

In the future the application spreads out the reads as well: running
instances go through the data at different speed, and when one's
finished, the next one starts on the same core  it mmap's the first
chunk again.

 Just as long as you're thinking of it as a test system that would make
 us very happy. :)

Well, IRL this is throw-away data generated at the start of a batch, and
we're good if one batch a month runs to completion. So if it doesn't
crash all the time every time, that actually should be good enough for
me. However, not all of the nodes have spare disk slots, so I couldn't
do a full-scale deployment anyway, not without rebuilding half the nodes.

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature


Re: Grid data placement

2013-01-15 Thread Gregory Farnum
On Tue, Jan 15, 2013 at 11:00 AM, Dimitri Maziuk dmaz...@bmrb.wisc.edu wrote:
 On 01/15/2013 12:36 PM, Gregory Farnum wrote:
 On Tue, Jan 15, 2013 at 10:33 AM, Dimitri Maziuk dmaz...@bmrb.wisc.edu 
 wrote:

 At the start of the batch #cores-in-the-cluster processes try to mmap
 the same 2GB and start reading it from SEEK_SET at the same time. I
 won't know until I try but I suspect it won't like that.

 Well, it'll be #servers-in-cluster serving up 4MB chunks out of cache.
 It's possible you could overwhelm their networking but my bet is
 they'll just get spread out slightly on the first block and then not
 contend in the future.

 In the future the application spreads out the reads as well: running
 instances go through the data at different speed, and when one's
 finished, the next one starts on the same core  it mmap's the first
 chunk again.

 Just as long as you're thinking of it as a test system that would make
 us very happy. :)

 Well, IRL this is throw-away data generated at the start of a batch, and
 we're good if one batch a month runs to completion. So if it doesn't
 crash all the time every time, that actually should be good enough for
 me. However, not all of the nodes have spare disk slots, so I couldn't
 do a full-scale deployment anyway, not without rebuilding half the nodes.

In that case you are my favorite kind of user and you should install
and try it out right away! :D
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html