Re: [9fans] ceph
2. do we have anybody successfully managing that much storage that is also spread across the nodes? And if so, what's the best practices out there to make the client not worry about where does the storage actually come from (IOW, any kind of proxying of I/O, etc) http://labs.google.com/papers/gfs.html http://hadoop.apache.org/common/docs/current/hdfs_design.html I'm trying to see how the life after NFSv4 or AFS might look like for the clients still clinging to the old ways of doing things, yet trying to cooperatively use hundreds of T of storage. the two I mention above are both used in conjunction with distributed map/reduce calculations. Calculations are done on the nodes where the data is stored... Thanks, Roman. Tim Newsham http://www.thenewsh.com/~newsham/
Re: [9fans] ceph
Google? the exception that proves the rule? they emphatically don't go for posix semantics... why would purveryers of 9p give a rip about posix sematics? - erik
Re: [9fans] ceph
Well, with Linux, at least you have a benefit of a gazillions of FS clients being available either natively or via FUSE. Do you have a link to a site which lists interesting FUSE filesystems, I am definitely not trying to troll, I am always intrigued by others ideas of how to reprisent data/APIs as fs. Sadly the Fuse fs I have seen have mostly been disappointing.There are a few I which would be handy on plan9 (gmail, ipod, svn) but most seem less useful. -Steve
Re: [9fans] ceph
2009/8/4 erik quanstrom quans...@quanstro.net: Google? the exception that proves the rule? they emphatically don't go for posix semantics... why would purveryers of 9p give a rip about posix sematics? from ron: 10,000 machines, working on a single app, must have access to a common file store with full posix semantics and it all has to work like it were one machine (their desktop, of course). ... and gfs does things that aren't easily compatible with 9p either, such as returning the actually-written offset when appending to a file.
Re: [9fans] ceph
they emphatically don't go for posix semantics... what are posix semantics?
Re: [9fans] ceph
2009/8/4 C H Forsyth fors...@vitanuova.com: they emphatically don't go for posix semantics... what are posix semantics? perhaps wrongly, i'd assumed that the posix standard implied some semantics in defining its file API, and ron was referring to those. perhaps it defines less than i assume - i've not studied it. i was alluding to this sentence from the paper: : GFS provides a familiar file system interface, though it : does not implement a standard API such as POSIX.
Re: [9fans] ceph
On Tue, Aug 4, 2009 at 2:55 AM, C H Forsythfors...@vitanuova.com wrote: they emphatically don't go for posix semantics... what are posix semantics? whatever today's customer happens to think they are. ron
Re: [9fans] ceph
On Mon, 2009-08-03 at 19:56 -0700, ron minnich wrote: 2. do we have anybody successfully managing that much storage that is also spread across the nodes? And if so, what's the best practices out there to make the client not worry about where does the storage actually come from (IOW, any kind of proxying of I/O, etc) Google? By we I mostly meant this community, but even if we don't focus on 9fans, Google is a non-example. They have no clients for this filesystem per-se. The request: for each of the (lots of) compute nodes, have them mount over 9p to, say 100x fewer io nodes, each of those to run lustre. Sorry for being dense, but what exactly is going to be accomplished by proxying I/O in such a way? it makes the unscalable distributed lock manager and other such stuff work, because you stop asking it to scale. So strictly speaking you are not really using 9P as a filesystem protocol, but rather as a convenient way for doing RPC, right? Thanks, Roman.
Re: [9fans] ceph
On Tue, 2009-08-04 at 10:55 +0100, C H Forsyth wrote: they emphatically don't go for posix semantics... what are posix semantics? I'll bite: http://www.opengroup.org/onlinepubs/009695399/ [ anything else that would take an FD as an argument ] http://www.opengroup.org/onlinepubs/009695399/ Thanks, Roman.
Re: [9fans] ceph
On Tue, 2009-08-04 at 09:43 +0100, Steve Simon wrote: Well, with Linux, at least you have a benefit of a gazillions of FS clients being available either natively or via FUSE. Do you have a link to a site which lists interesting FUSE filesystems, I am definitely not trying to troll, I am always intrigued by others ideas of how to reprisent data/APIs as fs. I don't, and I probably should start documenting it. The easiest way to find them, though, is to be suscbribed to fuse ML and collect the domain names of posters. Turns out that anybody who's doing cloud storage these days does it via FUSE (which, might not be as surprising if you think about what's the dominant OS on EC2). You have companies ranging from startups: http://www.nirvanix.com/ all the way to tyrannosaurus' like EMC and IBM betting on FUSE to get them to storage in the cloud. Sadly, none of them are open source as far as I can tell. Sadly the Fuse fs I have seen have mostly been disappointing.There are a few I which would be handy on plan9 (gmail, ipod, svn) but most seem less useful. The OS ones, are not all that impressive. I agree. Thanks, Roman.
Re: [9fans] ceph
On Mon, 2009-08-03 at 21:23 -1000, Tim Newsham wrote: 2. do we have anybody successfully managing that much storage that is also spread across the nodes? And if so, what's the best practices out there to make the client not worry about where does the storage actually come from (IOW, any kind of proxying of I/O, etc) http://labs.google.com/papers/gfs.html http://hadoop.apache.org/common/docs/current/hdfs_design.html I'm trying to see how the life after NFSv4 or AFS might look like for the clients still clinging to the old ways of doing things, yet trying to cooperatively use hundreds of T of storage. the two I mention above are both used in conjunction with distributed map/reduce calculations. Calculations are done on the nodes where the data is stored... Hadoop and GFS are good examples and they work great for the single distributed application that is *written* with them in mind. Unfortunately, I can not stretch my imagination hard enough to see them as general purpose filesystems backing up data for gazillions of non-cooperative applications. The sort of thing NFS and AFS were built to accomplish. In that respect, ceph is more what I have in mind: it assembles storage from clusters of unrelated OSDs into a a hierarchy with a single point of entry for every user/application. The question, however, is how to avoid the complexity of ceph and still have it look like a humongous kenfs or fossil from the outside. Thanks, Roman.
Re: [9fans] ceph
Hadoop and GFS are good examples and they work great for the single distributed application that is *written* with them in mind. Unfortunately, I can not stretch my imagination hard enough to see them as general purpose filesystems backing up data for gazillions of non-cooperative applications. The sort of thing NFS and AFS were built to accomplish. I *think* the folks at google also use GFS for shared $HOME (ie. to stash files they want to share with others). I could be wrong. Roman. Tim Newsham http://www.thenewsh.com/~newsham/
Re: [9fans] ceph
On Sat, 2009-08-01 at 08:47 -0700, ron minnich wrote: What are their requirements as far as POSIX is concerned? 10,000 machines, working on a single app, must have access to a common file store with full posix semantics and it all has to work like it were one machine (their desktop, of course). This gets messy. It turns into an exercise of attempting to manage a competing set of race conditions. It's like tuning a multi-carburated enging from years gone by, assuming we ever had an engine with 10,000 cylinders. Well, with Linux, at least you have a benefit of a gazillions of FS clients being available either natively or via FUSE. With Solaris... oh well... How much storage are talking about? In round numbers, for the small clusters, usually a couple hundred T. For anyhing else, more. Is all of this storage attached to a very small number of IO nodes, or is it evenly spread across the cluster? In fact, I'm interested in both scenarios, so here come two questions: 1. do we have anybody successfully managing that much storage (lets say ~100T) via something like humongous fossil installation (or kenfs for that matter)? 2. do we have anybody successfully managing that much storage that is also spread across the nodes? And if so, what's the best practices out there to make the client not worry about where does the storage actually come from (IOW, any kind of proxying of I/O, etc) I'm trying to see how the life after NFSv4 or AFS might look like for the clients still clinging to the old ways of doing things, yet trying to cooperatively use hundreds of T of storage. I'd be interested in discussing some aspects of what you're trying to accomplish with 9P for the HPC guys. The request: for each of the (lots of) compute nodes, have them mount over 9p to, say 100x fewer io nodes, each of those to run lustre. Sorry for being dense, but what exactly is going to be accomplished by proxying I/O in such a way? Thanks, Roman.
Re: [9fans] ceph
2009/8/4 ron minnich rminn...@gmail.com: 2. do we have anybody successfully managing that much storage that is also spread across the nodes? And if so, what's the best practices out there to make the client not worry about where does the storage actually come from (IOW, any kind of proxying of I/O, etc) Google? the exception that proves the rule? they emphatically don't go for posix semantics...
Re: [9fans] ceph
On Fri, Jul 31, 2009 at 10:53 PM, Roman Shaposhnikr...@sun.com wrote: What are your clients running? Linux What are their requirements as far as POSIX is concerned? 10,000 machines, working on a single app, must have access to a common file store with full posix semantics and it all has to work like it were one machine (their desktop, of course). This gets messy. It turns into an exercise of attempting to manage a competing set of race conditions. It's like tuning a multi-carburated enging from years gone by, assuming we ever had an engine with 10,000 cylinders. How much storage are talking about? In round numbers, for the small clusters, usually a couple hundred T. For anyhing else, more. I'd be interested in discussing some aspects of what you're trying to accomplish with 9P for the HPC guys. The request: for each of the (lots of) compute nodes, have them mount over 9p to, say 100x fewer io nodes, each of those to run lustre. Which tells you right away that our original dreams for lustre did not quite work out. In all honesty, however, the 20K node Jaguar machine at ORNL claims to run lustre and have it all just work. I know as many people who have de-installed lustre as use it, however. ron
Re: [9fans] ceph
On Jul 30, 2009, at 9:31 AM, sqweek wrote: 2009/7/30 Roman V Shaposhnik r...@sun.com: This is sort of off-topic, but does anybody have any experience with Ceph? http://ceph.newdream.net/ Good or bad war stories (and general thoughts) would be quite welcome. Not with ceph itself, but the description and terminology they use remind me a lot of lustre (seems like it's a userspace version) which we use at work. Does a damn fine job - as long as you get a stable version. We have run into issues trying out new versions several times... I guess that sums up my impression of ceph so far: I don't see where it would fit. I think that in HPC it is 99% Lustre, in enterprise it is either CIFS or NFS, etc. There's some internal push for it around here so I was wondering whether I missed a memo once again... Thanks, Roman.
Re: [9fans] ceph
I'm not a big fan of lustre. In fact I'm talking to someone who really wants 9p working well so he can have lustre on all but a few nodes, and those lustre nodes export 9p. ron
Re: [9fans] ceph
On Jul 31, 2009, at 10:41 PM, ron minnich wrote: I'm not a big fan of lustre. In fact I'm talking to someone who really wants 9p working well so he can have lustre on all but a few nodes, and those lustre nodes export 9p. What are your clients running? What are their requirements as far as POSIX is concerned? How much storage are talking about? I'd be interested in discussing some aspects of what you're trying to accomplish with 9P for the HPC guys. Thanks, Roman. P.S. If it is ok with everybody else -- I'll keep the conversation on the list.
Re: [9fans] ceph
2009/7/30 Roman V Shaposhnik r...@sun.com: This is sort of off-topic, but does anybody have any experience with Ceph? http://ceph.newdream.net/ Good or bad war stories (and general thoughts) would be quite welcome. Not with ceph itself, but the description and terminology they use remind me a lot of lustre (seems like it's a userspace version) which we use at work. Does a damn fine job - as long as you get a stable version. We have run into issues trying out new versions several times... -sqweek