Re: [9fans] ceph

2009-08-04 Thread Tim Newsham

 2. do we have anybody successfully managing that much storage that is
also spread across the nodes? And if so, what's the best practices
out there to make the client not worry about where does the storage
actually come from (IOW, any kind of proxying of I/O, etc)


http://labs.google.com/papers/gfs.html
http://hadoop.apache.org/common/docs/current/hdfs_design.html


I'm trying to see how the life after NFSv4 or AFS might look like for
the clients still clinging to the old ways of doing things, yet
trying to cooperatively use hundreds of T of storage.


the two I mention above are both used in conjunction with
distributed map/reduce calculations.  Calculations are done
on the nodes where the data is stored...


Thanks,
Roman.


Tim Newsham
http://www.thenewsh.com/~newsham/



Re: [9fans] ceph

2009-08-04 Thread erik quanstrom
 
  Google?
 
 the exception that proves the rule? they emphatically
 don't go for posix semantics...

why would purveryers of 9p give a rip about posix sematics?

- erik



Re: [9fans] ceph

2009-08-04 Thread Steve Simon
 Well, with Linux, at least you have a benefit of a gazillions of FS
 clients being available either natively or via FUSE.

Do you have a link to a site which lists interesting FUSE filesystems,
I am definitely not trying to troll, I am always intrigued by others
ideas of how to reprisent data/APIs as fs.

Sadly the Fuse fs I have seen have mostly been disappointing.There are
a few I which would be handy on plan9 (gmail, ipod, svn) but most
seem less useful.

-Steve



Re: [9fans] ceph

2009-08-04 Thread roger peppe
2009/8/4 erik quanstrom quans...@quanstro.net:
 
  Google?

 the exception that proves the rule? they emphatically
 don't go for posix semantics...

 why would purveryers of 9p give a rip about posix sematics?

from ron:
 10,000 machines, working on a single app, must have access to a common
 file store with full posix semantics and it all has to work like it
 were one machine (their desktop, of course).

... and gfs does things that aren't easily compatible with 9p either,
such as returning the actually-written offset when appending to a file.



Re: [9fans] ceph

2009-08-04 Thread C H Forsyth
they emphatically don't go for posix semantics...

what are posix semantics?



Re: [9fans] ceph

2009-08-04 Thread roger peppe
2009/8/4 C H Forsyth fors...@vitanuova.com:
they emphatically don't go for posix semantics...

 what are posix semantics?

perhaps wrongly, i'd assumed that the posix standard
implied some semantics in defining its file API, and
ron was referring to those. perhaps it defines less
than i assume - i've not studied it.

i was alluding to this sentence from the paper:

: GFS provides a familiar file system interface, though it
: does not implement a standard API such as POSIX.



Re: [9fans] ceph

2009-08-04 Thread ron minnich
On Tue, Aug 4, 2009 at 2:55 AM, C H Forsythfors...@vitanuova.com wrote:
they emphatically don't go for posix semantics...

 what are posix semantics?

whatever today's customer happens to think they are.

ron



Re: [9fans] ceph

2009-08-04 Thread Roman V Shaposhnik
On Mon, 2009-08-03 at 19:56 -0700, ron minnich wrote:
   2. do we have anybody successfully managing that much storage that is
  also spread across the nodes? And if so, what's the best practices
  out there to make the client not worry about where does the storage
  actually come from (IOW, any kind of proxying of I/O, etc)
 
 Google?

By we I mostly meant this community, but even if we don't focus on
9fans, Google is a non-example. They have no clients for this filesystem
per-se.

  The request: for each of the (lots of) compute nodes, have them mount
  over 9p to, say 100x fewer io nodes, each of those to run lustre.
 
  Sorry for being dense, but what exactly is going to be accomplished
  by proxying I/O in such a way?
 
 it makes the unscalable distributed lock manager and other such stuff
 work, because you stop asking it to scale.

So strictly speaking you are not really using 9P as a filesystem
protocol, but rather as a convenient way for doing RPC, right? 

Thanks,
Roman.




Re: [9fans] ceph

2009-08-04 Thread Roman V Shaposhnik
On Tue, 2009-08-04 at 10:55 +0100, C H Forsyth wrote:
 they emphatically don't go for posix semantics...
 
 what are posix semantics?

I'll bite:
   http://www.opengroup.org/onlinepubs/009695399/
   
   [ anything else that would take an FD as an argument ]

   http://www.opengroup.org/onlinepubs/009695399/

Thanks,
Roman.
  




Re: [9fans] ceph

2009-08-04 Thread Roman V Shaposhnik
On Tue, 2009-08-04 at 09:43 +0100, Steve Simon wrote:
  Well, with Linux, at least you have a benefit of a gazillions of FS
  clients being available either natively or via FUSE.
 
 Do you have a link to a site which lists interesting FUSE filesystems,
 I am definitely not trying to troll, I am always intrigued by others
 ideas of how to reprisent data/APIs as fs.

I don't, and I probably should start documenting it. The easiest way
to find them, though, is to be suscbribed to fuse ML and collect
the domain names of posters. Turns out that anybody who's doing cloud
storage these days does it via FUSE (which, might not be as surprising
if you think about what's the dominant OS on EC2). You have companies
ranging from startups:
   http://www.nirvanix.com/
all the way to tyrannosaurus' like EMC and IBM betting on FUSE to get
them to storage in the cloud.

Sadly, none of them are open source as far as I can tell.

 Sadly the Fuse fs I have seen have mostly been disappointing.There are
 a few I which would be handy on plan9 (gmail, ipod, svn) but most
 seem less useful.

The OS ones, are not all that impressive. I agree.

Thanks,
Roman.




Re: [9fans] ceph

2009-08-04 Thread Roman V Shaposhnik
On Mon, 2009-08-03 at 21:23 -1000, Tim Newsham wrote:
   2. do we have anybody successfully managing that much storage that is
  also spread across the nodes? And if so, what's the best practices
  out there to make the client not worry about where does the storage
  actually come from (IOW, any kind of proxying of I/O, etc)
 
 http://labs.google.com/papers/gfs.html
 http://hadoop.apache.org/common/docs/current/hdfs_design.html
 
  I'm trying to see how the life after NFSv4 or AFS might look like for
  the clients still clinging to the old ways of doing things, yet
  trying to cooperatively use hundreds of T of storage.
 
 the two I mention above are both used in conjunction with
 distributed map/reduce calculations.  Calculations are done
 on the nodes where the data is stored...

Hadoop and GFS are good examples and they work great for the
single distributed application that is *written* with them
in mind.

Unfortunately, I can not stretch my imagination hard enough
to see them as general purpose filesystems backing up data
for gazillions of non-cooperative applications. The sort
of thing NFS and AFS were built to accomplish.

In that respect, ceph is more what I have in mind: it 
assembles storage from clusters of unrelated OSDs into a 
a hierarchy with a single point of entry for every 
user/application.

The question, however, is how to avoid the complexity of
ceph and still have it look like a humongous kenfs or
fossil from the outside. 

Thanks,
Roman.




Re: [9fans] ceph

2009-08-04 Thread Tim Newsham

Hadoop and GFS are good examples and they work great for the
single distributed application that is *written* with them
in mind.

Unfortunately, I can not stretch my imagination hard enough
to see them as general purpose filesystems backing up data
for gazillions of non-cooperative applications. The sort
of thing NFS and AFS were built to accomplish.


I *think* the folks at google also use GFS for shared $HOME
(ie. to stash files they want to share with others).  I
could be wrong.


Roman.


Tim Newsham
http://www.thenewsh.com/~newsham/



Re: [9fans] ceph

2009-08-03 Thread Roman V Shaposhnik
On Sat, 2009-08-01 at 08:47 -0700, ron minnich wrote:
  What are their requirements as
  far as POSIX is concerned?
 
 10,000 machines, working on a single app, must have access to a common
 file store with full posix semantics and it all has to work like it
 were one machine (their desktop, of course).
 
 This gets messy. It turns into an exercise of attempting to manage a
 competing set of race conditions. It's like tuning
 a multi-carburated enging from years gone by, assuming we ever had an
 engine with 10,000 cylinders.

Well, with Linux, at least you have a benefit of a gazillions of FS
clients being available either natively or via FUSE. With Solaris...
oh well...

  How much storage are talking about?
 In  round numbers, for the small clusters, usually a couple hundred T.
 For anyhing else, more.

Is all of this storage attached to a very small number of IO nodes, or
is it evenly spread across the cluster?

In fact, I'm interested in both scenarios, so here come two questions:
  1. do we have anybody successfully managing that much storage (lets
 say ~100T) via something like humongous fossil installation (or
 kenfs for that matter)?

  2. do we have anybody successfully managing that much storage that is
 also spread across the nodes? And if so, what's the best practices
 out there to make the client not worry about where does the storage
 actually come from (IOW, any kind of proxying of I/O, etc)

I'm trying to see how the life after NFSv4 or AFS might look like for 
the clients still clinging to the old ways of doing things, yet
trying to cooperatively use hundreds of T of storage.

  I'd be interested in discussing some aspects of what you're trying to
  accomplish with 9P for the HPC guys.
 
 The request: for each of the (lots of) compute nodes, have them mount
 over 9p to, say 100x fewer io nodes, each of those to run lustre.

Sorry for being dense, but what exactly is going to be accomplished
by proxying I/O in such a way?

Thanks,
Roman.




Re: [9fans] ceph

2009-08-03 Thread roger peppe
2009/8/4 ron minnich rminn...@gmail.com:
  2. do we have anybody successfully managing that much storage that is
     also spread across the nodes? And if so, what's the best practices
     out there to make the client not worry about where does the storage
     actually come from (IOW, any kind of proxying of I/O, etc)

 Google?

the exception that proves the rule? they emphatically
don't go for posix semantics...



Re: [9fans] ceph

2009-08-01 Thread ron minnich
On Fri, Jul 31, 2009 at 10:53 PM, Roman Shaposhnikr...@sun.com wrote:

 What are your clients running?

Linux

 What are their requirements as
 far as POSIX is concerned?

10,000 machines, working on a single app, must have access to a common
file store with full posix semantics and it all has to work like it
were one machine (their desktop, of course).

This gets messy. It turns into an exercise of attempting to manage a
competing set of race conditions. It's like tuning
a multi-carburated enging from years gone by, assuming we ever had an
engine with 10,000 cylinders.

 How much storage are talking about?
In  round numbers, for the small clusters, usually a couple hundred T.
For anyhing else, more.


 I'd be interested in discussing some aspects of what you're trying to
 accomplish with 9P for the HPC guys.

The request: for each of the (lots of) compute nodes, have them mount
over 9p to, say 100x fewer io nodes, each of those to run lustre.
Which tells you right away that our original dreams for lustre did not
quite work out.

In all honesty, however, the 20K node Jaguar machine at ORNL claims to
run lustre and have it all just work. I know as many people who have
de-installed lustre as use it, however.

ron



Re: [9fans] ceph

2009-07-31 Thread Roman Shaposhnik


On Jul 30, 2009, at 9:31 AM, sqweek wrote:


2009/7/30 Roman V Shaposhnik r...@sun.com:

This is sort of off-topic, but does anybody have any experience with
Ceph?
  http://ceph.newdream.net/

Good or bad war stories (and general thoughts) would be quite  
welcome.


Not with ceph itself, but the description and terminology they use
remind me a lot of lustre (seems like it's a userspace version) which
we use at work. Does a damn fine job - as long as you get a stable
version. We have run into issues trying out new versions several
times...


I guess that sums up my impression of ceph so far: I don't see where it
would fit. I think that in HPC it is 99% Lustre, in enterprise it is  
either

CIFS or NFS, etc.

There's some internal push for it around here so I was wondering
whether I missed a memo once again...

Thanks,
Roman.



Re: [9fans] ceph

2009-07-31 Thread ron minnich
I'm not a big fan of lustre. In fact I'm talking to someone who really
wants 9p working well so he can have lustre on all but a few nodes,
and those lustre nodes export 9p.

ron



Re: [9fans] ceph

2009-07-31 Thread Roman Shaposhnik

On Jul 31, 2009, at 10:41 PM, ron minnich wrote:

I'm not a big fan of lustre. In fact I'm talking to someone who really
wants 9p working well so he can have lustre on all but a few nodes,
and those lustre nodes export 9p.



What are your clients running? What are their requirements as
far as POSIX is concerned? How much storage are talking about?

I'd be interested in discussing some aspects of what you're trying to
accomplish with 9P for the HPC guys.

Thanks,
Roman.

P.S. If it is ok with everybody else -- I'll keep the conversation on  
the list.




Re: [9fans] ceph

2009-07-30 Thread sqweek
2009/7/30 Roman V Shaposhnik r...@sun.com:
 This is sort of off-topic, but does anybody have any experience with
 Ceph?
   http://ceph.newdream.net/

 Good or bad war stories (and general thoughts) would be quite welcome.

 Not with ceph itself, but the description and terminology they use
remind me a lot of lustre (seems like it's a userspace version) which
we use at work. Does a damn fine job - as long as you get a stable
version. We have run into issues trying out new versions several
times...
-sqweek