Grid data placement
Hi everyone, quick question: can I get ceph to replicate a bunch of files to every host in compute cluster and then have those hosts read those files from local disk? TFM looks like a custom crush map should get the files to [osd on] every host, but I'm not clear on the read step: do I need an mds on every host and mount the fs off localhost's mds? (We've $APP running on the cluster, normally one instance/cpu core, that mmap's (read only) ~30GB of binary files. I/O over NFS kills the cluster even with a few hosts. Currently the files are rsync'ed to every host at the start of the batch; that'll only scale to a few dozen hosts at best.) TIA, -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu signature.asc Description: OpenPGP digital signature
Re: Grid data placement
On 01/15/2013 12:36 PM, Gregory Farnum wrote: On Tue, Jan 15, 2013 at 10:33 AM, Dimitri Maziuk dmaz...@bmrb.wisc.edu wrote: At the start of the batch #cores-in-the-cluster processes try to mmap the same 2GB and start reading it from SEEK_SET at the same time. I won't know until I try but I suspect it won't like that. Well, it'll be #servers-in-cluster serving up 4MB chunks out of cache. It's possible you could overwhelm their networking but my bet is they'll just get spread out slightly on the first block and then not contend in the future. In the future the application spreads out the reads as well: running instances go through the data at different speed, and when one's finished, the next one starts on the same core it mmap's the first chunk again. Just as long as you're thinking of it as a test system that would make us very happy. :) Well, IRL this is throw-away data generated at the start of a batch, and we're good if one batch a month runs to completion. So if it doesn't crash all the time every time, that actually should be good enough for me. However, not all of the nodes have spare disk slots, so I couldn't do a full-scale deployment anyway, not without rebuilding half the nodes. -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu signature.asc Description: OpenPGP digital signature
Re: Grid data placement
On Tue, Jan 15, 2013 at 11:00 AM, Dimitri Maziuk dmaz...@bmrb.wisc.edu wrote: On 01/15/2013 12:36 PM, Gregory Farnum wrote: On Tue, Jan 15, 2013 at 10:33 AM, Dimitri Maziuk dmaz...@bmrb.wisc.edu wrote: At the start of the batch #cores-in-the-cluster processes try to mmap the same 2GB and start reading it from SEEK_SET at the same time. I won't know until I try but I suspect it won't like that. Well, it'll be #servers-in-cluster serving up 4MB chunks out of cache. It's possible you could overwhelm their networking but my bet is they'll just get spread out slightly on the first block and then not contend in the future. In the future the application spreads out the reads as well: running instances go through the data at different speed, and when one's finished, the next one starts on the same core it mmap's the first chunk again. Just as long as you're thinking of it as a test system that would make us very happy. :) Well, IRL this is throw-away data generated at the start of a batch, and we're good if one batch a month runs to completion. So if it doesn't crash all the time every time, that actually should be good enough for me. However, not all of the nodes have spare disk slots, so I couldn't do a full-scale deployment anyway, not without rebuilding half the nodes. In that case you are my favorite kind of user and you should install and try it out right away! :D -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html