Re: stuff for v0.56.4

2013-03-06 Thread Wido den Hollander

On 03/06/2013 12:10 AM, Sage Weil wrote:

There have been a few important bug fixes that people are hitting or
want:

- the journal replay bug (5d54ab154ca790688a6a1a2ad5f869c17a23980a)
- the - _ pool name vs cap parsing thing that is biting openstack users
- ceph-disk-* changes to support latest ceph-deploy

If there are other things that we want to include in 0.56.4, lets get them
into the bobtial branch sooner rather than later.

Possible items:

- pg log trimming (probably a conservative subset) to avoid memory bloat
- omap scrub?
- pg temp collection removal?
- buffer::cmp fix from loic?

Are there other items that we are missing?



I'm still seeing #3816 on my systems. The fix in wip-3816 did not 
resolve it for me.


Wido


sage

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html




--
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] Using different storage types on same osd hosts?

2013-03-06 Thread Martin B Nielsen
Hi,

We did the opposite here; adding some SSD in free slots after having a
normal cluster running with SATA.

We just created a new pool for them and separated the two types. I
used this as a template:
http://ceph.com/docs/master/rados/operations/crush-map/?highlight=ssd#placing-different-pools-on-different-osds
and left out the part with placing a master copy on each ssd.

I had to create the pool, rack and host in the crush rules for the
first server (wouldn't let me from the command line using 'ceph osd
crush set ...'), after that I could just add servers/osd to it like
normal.

I think unless you really need two separate clusters, I'd go with just
having different pools for it; you'll need a copy of every service
(mons, storagenodes etc) with two clusters.

More info on running multiple clusters here:
http://ceph.com/docs/master/rados/configuration/ceph-conf/#running-multiple-clusters

Cheers,
Martin

On Tue, Mar 5, 2013 at 9:48 PM, Stefan Priebe s.pri...@profihost.ag wrote:
 Hi,

 right now i have a bunch of OSD hosts (servers) which have just 4 disks
 each. All of them use SSDs right now.

 So i have a lot of free harddisk slots in the chassis. So my idea was to
 create a second ceph system using these free slots. Is this possible? Or
 should i just the first one with different rules? Any hints?

 Greets,
 Stefan

 ___
 ceph-users mailing list
 ceph-us...@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RGW Blocking on 1-2 PG's - argonaut

2013-03-06 Thread Sławomir Skowron
Hi, i do some test, to reproduce this problem.

As you can see, only one drive (each drive in same PG) is much more
utilize, then others, and there are some ops in queue on this slow
osd. This test is getting heads from s3 objects, alphabetically
sorted. This is strange. why this files is going in much part only
from this triple osd's.

checking what osd are in this pg.

 ceph pg map 7.35b
osdmap e117008 pg 7.35b (7.35b) - up [18,61,133] acting [18,61,133]

On osd.61

{ num_ops: 13,
  ops: [
{ description: osd_sub_op(client.10376104.0:961532 7.35b
2b11a75b\/2013-03-06-13-8700.1-ocdn\/head\/\/7 [] v 117008'1370134
snapset=0=[]:[] snapc=0=[]),
  received_at: 2013-03-06 13:59:18.448543,
  age: 0.032431,
  flag_point: started},
{ description: osd_sub_op(client.10376110.0:972570 7.35b
2b11a75b\/2013-03-06-13-8700.1-ocdn\/head\/\/7 [] v 117008'1370135
snapset=0=[]:[] snapc=0=[]),
  received_at: 2013-03-06 13:59:18.453829,
  age: 0.027145,
  flag_point: started},
{ description: osd_sub_op(client.10376104.0:961534 7.35b
2b11a75b\/2013-03-06-13-8700.1-ocdn\/head\/\/7 [] v 117008'1370136
snapset=0=[]:[] snapc=0=[]),
  received_at: 2013-03-06 13:59:18.454012,
  age: 0.026962,
  flag_point: started},
{ description: osd_sub_op(client.10376107.0:952760 7.35b
2b11a75b\/2013-03-06-13-8700.1-ocdn\/head\/\/7 [] v 117008'1370137
snapset=0=[]:[] snapc=0=[]),
  received_at: 2013-03-06 13:59:18.458980,
  age: 0.021994,
  flag_point: started},
{ description: osd_sub_op(client.10376110.0:972572 7.35b
2b11a75b\/2013-03-06-13-8700.1-ocdn\/head\/\/7 [] v 117008'1370138
snapset=0=[]:[] snapc=0=[]),
  received_at: 2013-03-06 13:59:18.459546,
  age: 0.021428,
  flag_point: started},
{ description: osd_sub_op(client.10376110.0:972574 7.35b
2b11a75b\/2013-03-06-13-8700.1-ocdn\/head\/\/7 [] v 117008'1370139
snapset=0=[]:[] snapc=0=[]),
  received_at: 2013-03-06 13:59:18.463680,
  age: 0.017294,
  flag_point: started},
{ description: osd_sub_op(client.10376107.0:952762 7.35b
2b11a75b\/2013-03-06-13-8700.1-ocdn\/head\/\/7 [] v 117008'1370140
snapset=0=[]:[] snapc=0=[]),
  received_at: 2013-03-06 13:59:18.464660,
  age: 0.016314,
  flag_point: started},
{ description: osd_sub_op(client.10376104.0:961536 7.35b
2b11a75b\/2013-03-06-13-8700.1-ocdn\/head\/\/7 [] v 117008'1370141
snapset=0=[]:[] snapc=0=[]),
  received_at: 2013-03-06 13:59:18.468076,
  age: 0.012898,
  flag_point: started},
{ description: osd_sub_op(client.10376110.0:972576 7.35b
2b11a75b\/2013-03-06-13-8700.1-ocdn\/head\/\/7 [] v 117008'1370142
snapset=0=[]:[] snapc=0=[]),
  received_at: 2013-03-06 13:59:18.468332,
  age: 0.012642,
  flag_point: started},
{ description: osd_sub_op(client.10376107.0:952764 7.35b
2b11a75b\/2013-03-06-13-8700.1-ocdn\/head\/\/7 [] v 117008'1370143
snapset=0=[]:[] snapc=0=[]),
  received_at: 2013-03-06 13:59:18.470480,
  age: 0.010494,
  flag_point: started},
{ description: osd_sub_op(client.10376107.0:952766 7.35b
2b11a75b\/2013-03-06-13-8700.1-ocdn\/head\/\/7 [] v 117008'1370144
snapset=0=[]:[] snapc=0=[]),
  received_at: 2013-03-06 13:59:18.475372,
  age: 0.005602,
  flag_point: started},
{ description: osd_sub_op(client.10376104.0:961538 7.35b
2b11a75b\/2013-03-06-13-8700.1-ocdn\/head\/\/7 [] v 117008'1370145
snapset=0=[]:[] snapc=0=[]),
  received_at: 2013-03-06 13:59:18.479391,
  age: 0.001583,
  flag_point: started},
{ description: osd_sub_op(client.10376107.0:952768 7.35b
2b11a75b\/2013-03-06-13-8700.1-ocdn\/head\/\/7 [] v 117008'1370146
snapset=0=[]:[] snapc=0=[]),
  received_at: 2013-03-06 13:59:18.480276,
  age: 0.000698,
  flag_point: started}]}

On osd.18

{ num_ops: 9,
  ops: [
{ description: osd_op(client.10391092.0:718883
2013-03-06-13-8700.1-ocdn [append 0~299] 7.2b11a75b),
  received_at: 2013-03-06 13:57:52.929677,
  age: 0.025480,
  flag_point: waiting for sub ops,
  client_info: { client: client.10391092,
  tid: 718883}},
{ description: osd_op(client.10373691.0:956595
2013-03-06-13-8700.1-ocdn [append 0~299] 7.2b11a75b),
  received_at: 2013-03-06 13:57:52.934533,
  age: 0.020624,
  flag_point: waiting for sub ops,
  client_info: { client: client.10373691,
  tid: 956595}},
{ description: osd_op(client.10391092.0:718885
2013-03-06-13-8700.1-ocdn [append 0~299] 7.2b11a75b),
  received_at: 2013-03-06 13:57:52.937101,
  age: 0.018056,
  flag_point: waiting for sub ops,
  client_info: { client: client.10391092,
  tid: 718885}},
{ description: 

ceph -v doesn't work

2013-03-06 Thread Olivier Bonvalet
Hi,

since I compile Debian packages, ceph -v doesn't work.

I follow this steps :
- git clone XXX
- git checkout origin/bobtail
- dch -i
- dpkg-source -b ceph
- cowbuilder --build ceph*dsc


and I obtain :

root! okko:~# ceph -v
ceph version  ()
root! okko:~# 

or with strace :

write(1, ceph version  ()\n, 17ceph version  ()
)  = 17
exit_group(0)   = ?
root! okko:~# 


Do you know how can I fix that ?

Thanks,

Olivier

PS : I compile that packages to enable syncfs support on Debian 6
(Squeeze), since I use a recent kernel.

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RGW Blocking on 1-2 PG's - argonaut

2013-03-06 Thread Yehuda Sadeh
On Wed, Mar 6, 2013 at 5:06 AM, Sławomir Skowron szi...@gmail.com wrote:
 Hi, i do some test, to reproduce this problem.

 As you can see, only one drive (each drive in same PG) is much more
 utilize, then others, and there are some ops in queue on this slow
 osd. This test is getting heads from s3 objects, alphabetically
 sorted. This is strange. why this files is going in much part only
 from this triple osd's.

 checking what osd are in this pg.

  ceph pg map 7.35b
 osdmap e117008 pg 7.35b (7.35b) - up [18,61,133] acting [18,61,133]

 On osd.61

 { num_ops: 13,
   ops: [
 { description: osd_sub_op(client.10376104.0:961532 7.35b
 2b11a75b\/2013-03-06-13-8700.1-ocdn\/head\/\/7 [] v 117008'1370134

The ops log is slowing you down. Unless you really need it, set 'rgw
enable ops log = false'. This is off by default in bobtail.


Yehuda
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Usable Space

2013-03-06 Thread Patrick McGarry
Proxy-ing this in for a user I had a discussion with on irc this morning:

The question is is there a way to display usable space based on
replication level?

Ultimately what would be nice is to see something like the following:

---
$: sudo ceph --usable-space

Total Space: X / Y  ||  Usable Space: A / B

By Pools:
rbd -- J / K
foo -- F / G
bar -- H / I
baz -- C / D

-

Would it be possible to add this in at some point? Seems like a great
addition to go with some of the other 'usability enhancements' that
are planned.  Or would this get computationally sticky based on having
many pools with different replication levels?


Best Regards,

-- 
Patrick McGarry
Director, Community
Inktank

http://ceph.com  ||  http://inktank.com
@scuttlemonkey || @ceph || @inktank
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Usable Space

2013-03-06 Thread Sylvain Munaut
 Total Space: X / Y  ||  Usable Space: A / B

 Would it be possible to add this in at some point? Seems like a great
 addition to go with some of the other 'usability enhancements' that
 are planned.  Or would this get computationally sticky based on having
 many pools with different replication levels?

How would you even compute it ?

I mean if the underlying storage is shared between several pools with
different replication level, the usable space will depend in which
space you actually put your data in ...

You could do it per pool but even then I think it can get tricky
because the CRUSH map could very well distribute the data in a pool on
a subset of OSD so you'd need to take that into account as well.

Cheers,

   Sylvain
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: CephFS First product release discussion

2013-03-06 Thread Wido den Hollander

On 03/05/2013 08:33 PM, Sage Weil wrote:

On Tue, 5 Mar 2013, Wido den Hollander wrote:

Wido, by 'user quota' do you mean something that is uid-based, or would
enforcement on subtree/directory quotas be sufficient for your use cases?
I've been holding out hope that uid-based usage accounting is a thing of
the past and that subtrees are sufficient for real users... in which case
adding enfocement to the existing rstats infrastructure is a very
manageable task.



I mean actual uid-based quotas. That still plays nice with shared environments
like Samba or so where you have all homedirectories on a shared filesystems
and you set per user quotas. Samba reads out those quotas and propagates them
to the (Windows) client.


Does samba propagate the quota information (how much space is
used/available) or do enforcement on the client side?  (Is client
enforcement even necessary/useful if the backend will stop writes when the
quota is exceeded?)



I'm not sure. It will at least tell the user how much he/she is using on 
that volume and what the quota is. Not sure who enforces, Samba or the 
filesystem.


From a quick Google it seems like the filesystem has to enforce the 
quota, Samba doesn't.



I know this was a problem with ZFS as well. They also said they could do per
filesystem quotas so that would be sufficient, but for example NFS doesn't
export filesystems mounted in a export, so if you have a bunch of
homedirectories on the filesystem and you want to account the usage of each
user it's getting kind of hard.

This could be solved if the clients directly mounted CephFS though.

I'm talking about setups where you have 100k users in a LDAP and they all have
their data in a single filesystem and you want to track the usage of each
user, that's not an easy task without uid-based quotas.


Wouldn't each user live in a sub- or home directory?  If so, it seems like
the existing rstats would be sufficient to do the accounting piece; only
enforcement is missing.


Running 'du' on each directory would be much faster with Ceph since it
accounts tracks the subdirectories and shows their total size with an 'ls
-al'.

Environments with 100k users also tend to be very dynamic with adding and
removing users all the time, so creating separate filesystems for them would
be very time consuming.

Now, I'm not talking about enforcing soft or hard quotas, I'm just talking
about knowing how much space uid X and Y consume on the filesystem.


The part I'm most unclear on is what use cases people have where uid X and
Y are spread around the file system (not in a single or small set of sub
directories) and per-user (not, say, per-project) quotas are still
necessary.  In most environments, users get their own home directory and
everything lives there...



I see a POSIX-filesystem as being partially legacy and a part of that 
legacy is user quotas.


If you want existing applications who rely on userquotas to seamlessly 
switch from NFS to CephFS they will need this to work.


We only talked about userquotas, but groupquotas are just as important.

If you have 10 users where 5 of them are in the group webdev and you 
wan't to know how much space is being used by the group webdev you 
want to probe the group quotas and you are done.


In some setups like we have users have data in different directories 
outside their home directories / NFS exports. On one machine you just 
run quota -u uid and you know how much user X is using spread out 
over all the filesystems.


With rstats you would be able to achieve the same with some scripting, 
but that doesn't make the migration seamless.


Wido


sage




Wido


sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html




--
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html




--
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Usable Space

2013-03-06 Thread Dan Mick
You're aware of the just-added ceph df?  I don't know it well enough to know if 
it's a solution, but it's in that space...

On Mar 6, 2013, at 6:48 AM, Patrick McGarry patr...@inktank.com wrote:

 Proxy-ing this in for a user I had a discussion with on irc this morning:
 
 The question is is there a way to display usable space based on
 replication level?
 
 Ultimately what would be nice is to see something like the following:
 
 ---
 $: sudo ceph --usable-space
 
 Total Space: X / Y  ||  Usable Space: A / B
 
 By Pools:
 rbd -- J / K
 foo -- F / G
 bar -- H / I
 baz -- C / D
 
 -
 
 Would it be possible to add this in at some point? Seems like a great
 addition to go with some of the other 'usability enhancements' that
 are planned.  Or would this get computationally sticky based on having
 many pools with different replication levels?
 
 
 Best Regards,
 
 -- 
 Patrick McGarry
 Director, Community
 Inktank
 
 http://ceph.com  ||  http://inktank.com
 @scuttlemonkey || @ceph || @inktank
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: CephFS First product release discussion

2013-03-06 Thread Jim Schutt
On 03/05/2013 12:33 PM, Sage Weil wrote:
  Running 'du' on each directory would be much faster with Ceph since it
  accounts tracks the subdirectories and shows their total size with an 'ls
  -al'.
  
  Environments with 100k users also tend to be very dynamic with adding and
  removing users all the time, so creating separate filesystems for them 
  would
  be very time consuming.
  
  Now, I'm not talking about enforcing soft or hard quotas, I'm just talking
  about knowing how much space uid X and Y consume on the filesystem.
 The part I'm most unclear on is what use cases people have where uid X and 
 Y are spread around the file system (not in a single or small set of sub 
 directories) and per-user (not, say, per-project) quotas are still 
 necessary.  In most environments, users get their own home directory and 
 everything lives there...

Hmmm, is there a tool I should be using that will return the space
used by a directory, and all its descendants?

If it's 'du', that tool is definitely not fast for me.

I'm doing an 'strace du -s path', where path has one
subdirectory which contains ~600 files.  I've got ~200 clients
mounting the file system, and each client wrote 3 files in that
directory.

I'm doing the 'du' from one of those nodes, and the strace is showing
me du is doing a 'newfstat' for each file.  For each file that was
written on a different client from where du is running, that 'newfstat'
takes tens of seconds to return.  Which means my 'du' has been running
for quite some time and hasn't finished yet

I'm hoping there's another tool I'm supposed to be using that I
don't know about yet.  Our use case includes tens of millions
of files written from thousands of clients, and whatever tool
we use to do space accounting needs to not walk an entire directory
tree, checking each file.

-- Jim


 
 sage
 
 
  
  Wido


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


CephFS Space Accounting and Quotas (was: CephFS First product release discussion)

2013-03-06 Thread Greg Farnum
On Wednesday, March 6, 2013 at 11:07 AM, Jim Schutt wrote:
 On 03/05/2013 12:33 PM, Sage Weil wrote:
Running 'du' on each directory would be much faster with Ceph since it
accounts tracks the subdirectories and shows their total size with an 
'ls
-al'.
 
Environments with 100k users also tend to be very dynamic with adding 
and
removing users all the time, so creating separate filesystems for them 
would
be very time consuming.
 
Now, I'm not talking about enforcing soft or hard quotas, I'm just 
talking
about knowing how much space uid X and Y consume on the filesystem.

   
   
  The part I'm most unclear on is what use cases people have where uid X and  
  Y are spread around the file system (not in a single or small set of sub  
  directories) and per-user (not, say, per-project) quotas are still  
  necessary. In most environments, users get their own home directory and  
  everything lives there...
  
  
  
 Hmmm, is there a tool I should be using that will return the space
 used by a directory, and all its descendants?
  
 If it's 'du', that tool is definitely not fast for me.
  
 I'm doing an 'strace du -s path', where path has one
 subdirectory which contains ~600 files. I've got ~200 clients
 mounting the file system, and each client wrote 3 files in that
 directory.
  
 I'm doing the 'du' from one of those nodes, and the strace is showing
 me du is doing a 'newfstat' for each file. For each file that was
 written on a different client from where du is running, that 'newfstat'
 takes tens of seconds to return. Which means my 'du' has been running
 for quite some time and hasn't finished yet
  
 I'm hoping there's another tool I'm supposed to be using that I
 don't know about yet. Our use case includes tens of millions
 of files written from thousands of clients, and whatever tool
 we use to do space accounting needs to not walk an entire directory
 tree, checking each file.

Check out the directory sizes with ls -l or whatever — those numbers are 
semantically meaningful! :)

Unfortunately we can't (currently) use those recursive statistics to do 
proper hard quotas on subdirectories as they're lazily propagated following 
client ops, not as part of the updates. (Lazily in the technical sense — it's 
actually quite fast in general). But they'd work fine for soft quotas if 
somebody wrote the code, or to block writes on a slight time lag.
-Greg

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OpenStack summit : Ceph design session

2013-03-06 Thread Neil Levine
I think the multi-site RGW stuff is somewhat orthogonal to OpenStack
where as the RBD backups needs to factor in Horizon, Cinder APIs and
where the logic for managing the backups sits.

Ross is looking to get a wiki setup for Ceph blueprints so we can
document the incremental snapshot stuff and then use this as a basis
for creating the OpenStack blueprint.

Who approves the session at ODS and when is this decision made?

Neil

On Sun, Mar 3, 2013 at 1:37 AM, Loic Dachary l...@dachary.org wrote:
 Hi Neil,

 I've updated http://summit.openstack.org/cfp/details/38 with the Geographic 
 DR related threads

 Geo-replication with RADOS GW 
 http://marc.info/?l=ceph-develm=135939566407623w=4
 Geographic DR for RGW http://marc.info/?l=ceph-develm=136191479931880w=4

 I'm increasingly interested in figuring out how it fits with OpenStack.

 Cheers

 On 02/25/2013 11:04 AM, Loic Dachary wrote:
 Hi Neil,

 I've added RBD backups secondary clusters within Openstack to the list of 
 blueprints. Do you have links to mail threads / chat logs related to this 
 topic ?

 I moved the content of the session to an etherpad for collaborative editing

 https://etherpad.openstack.org/roadmap-for-ceph-integration-with-openstack

 and it is now linked from

 http://summit.openstack.org/cfp/details/38

 Cheers

 On 02/25/2013 07:12 AM, Neil Levine wrote:
 Thanks for taking the lead on this Loic.

 As a blueprint, I'd like to look at RBD backups to secondary clusters
 within Openstack. Nick Barcet and others have mentioned ideas for this
 now that Cinder is multi-cluster aware.

 Neil

 On Sun, Feb 24, 2013 at 3:16 PM, Josh Durgin josh.dur...@inktank.com 
 wrote:
 On 02/23/2013 02:33 AM, Loic Dachary wrote:

 Hi,

 In anticipation of the next OpenStack summit
 http://www.openstack.org/summit/portland-2013/, I proposed a session to
 discuss OpenStack and Ceph integration. Our meeting during FOSDEM earlier
 this month was a great experience although it was planned at the last
 minute. I hope we can organize something even better for the summit.

 For developers and contributors to both Ceph and OpenStack such as myself,
 it would be a great opportunity to figure out a sensible roadmap for the
 next six months. I realize this roadmap is already clear for Josh Durgin 
 and
 other Ceph / OpenStack developers who are passionately invested in both
 projects for a long time. However I am new to both projects and such a
 session would be a precious guide and highly motivating.

 http://summit.openstack.org/cfp/details/38

 What do you think ?


 Sounds like a great idea!
 Thanks for putting together the session!

 Josh

 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


 --
 Loïc Dachary, Artisan Logiciel Libre

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: CephFS Space Accounting and Quotas

2013-03-06 Thread Jim Schutt
On 03/06/2013 12:13 PM, Greg Farnum wrote:
 On Wednesday, March 6, 2013 at 11:07 AM, Jim Schutt wrote:
 On 03/05/2013 12:33 PM, Sage Weil wrote:
 Running 'du' on each directory would be much faster with Ceph since it
 accounts tracks the subdirectories and shows their total size with an 'ls
 -al'.
  
 Environments with 100k users also tend to be very dynamic with adding and
 removing users all the time, so creating separate filesystems for them 
 would
 be very time consuming.
  
 Now, I'm not talking about enforcing soft or hard quotas, I'm just talking
 about knowing how much space uid X and Y consume on the filesystem.
  
  
  
 The part I'm most unclear on is what use cases people have where uid X and  
 Y are spread around the file system (not in a single or small set of sub  
 directories) and per-user (not, say, per-project) quotas are still  
 necessary. In most environments, users get their own home directory and  
 everything lives there...
  
  
  
 Hmmm, is there a tool I should be using that will return the space
 used by a directory, and all its descendants?
  
 If it's 'du', that tool is definitely not fast for me.
  
 I'm doing an 'strace du -s path', where path has one
 subdirectory which contains ~600 files. I've got ~200 clients
 mounting the file system, and each client wrote 3 files in that
 directory.
  
 I'm doing the 'du' from one of those nodes, and the strace is showing
 me du is doing a 'newfstat' for each file. For each file that was
 written on a different client from where du is running, that 'newfstat'
 takes tens of seconds to return. Which means my 'du' has been running
 for quite some time and hasn't finished yet
  
 I'm hoping there's another tool I'm supposed to be using that I
 don't know about yet. Our use case includes tens of millions
 of files written from thousands of clients, and whatever tool
 we use to do space accounting needs to not walk an entire directory
 tree, checking each file.
 
 Check out the directory sizes with ls -l or whatever — those numbers are 
 semantically meaningful! :)

That is just exceptionally cool!

 
 Unfortunately we can't (currently) use those recursive statistics
 to do proper hard quotas on subdirectories as they're lazily
 propagated following client ops, not as part of the updates. (Lazily
 in the technical sense — it's actually quite fast in general). But
 they'd work fine for soft quotas if somebody wrote the code, or to
 block writes on a slight time lag.

'ls -lh dir' seems to be just the thing if you already know dir.

And it's perfectly suitable for our use case of not scheduling
new jobs for users consuming too much space.

I was thinking I might need to find a subtree where all the
subdirectories are owned by the same user, on the theory that
all the files in such a subtree would be owned by that same
user.  E.g., we might want such a capability to manage space per
user in shared project directories.

So, I tried 'find dir -type d -exec ls -lhd {} \;'

Unfortunately, that ended up doing a 'newfstatat' on each file
under dir, evidently to learn if it was a directory.  The
result was that same slowdown for files written on other clients.

Is there some other way I should be looking for directories if I
don't already know what they are?

Also, this issue of stat on files created on other clients seems
like it's going to be problematic for many interactions our users
will have with the files created by their parallel compute jobs -
any suggestion on how to avoid or fix it?

Thanks!

-- Jim

 -Greg
 
 
 


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: CephFS Space Accounting and Quotas

2013-03-06 Thread Greg Farnum
On Wednesday, March 6, 2013 at 11:58 AM, Jim Schutt wrote:
 On 03/06/2013 12:13 PM, Greg Farnum wrote:
  Check out the directory sizes with ls -l or whatever — those numbers are 
  semantically meaningful! :)
  
  
 That is just exceptionally cool!
  
   
  Unfortunately we can't (currently) use those recursive statistics
  to do proper hard quotas on subdirectories as they're lazily
  propagated following client ops, not as part of the updates. (Lazily
  in the technical sense — it's actually quite fast in general). But
  they'd work fine for soft quotas if somebody wrote the code, or to
  block writes on a slight time lag.
  
  
  
 'ls -lh dir' seems to be just the thing if you already know dir.
  
 And it's perfectly suitable for our use case of not scheduling
 new jobs for users consuming too much space.
  
 I was thinking I might need to find a subtree where all the
 subdirectories are owned by the same user, on the theory that
 all the files in such a subtree would be owned by that same
 user. E.g., we might want such a capability to manage space per
 user in shared project directories.
  
 So, I tried 'find dir -type d -exec ls -lhd {} \;'
  
 Unfortunately, that ended up doing a 'newfstatat' on each file
 under dir, evidently to learn if it was a directory. The
 result was that same slowdown for files written on other clients.
  
 Is there some other way I should be looking for directories if I
 don't already know what they are?
  
 Also, this issue of stat on files created on other clients seems
 like it's going to be problematic for many interactions our users
 will have with the files created by their parallel compute jobs -
 any suggestion on how to avoid or fix it?
  

Brief background: stat is required to provide file size information, and so 
when you do a stat Ceph needs to find out the actual file size. If the file is 
currently in use by somebody, that requires gathering up the latest metadata 
from them.
Separately, while Ceph allows a client and the MDS to proceed with a bunch of 
operations (ie, mknod) without having it go to disk first, it requires anything 
which is visible to a third party (another client) be durable on disk for 
consistency reasons.

These combine to mean that if you do a stat on a file which a client currently 
has buffered writes for, that buffer must be flushed out to disk before the 
stat can return. This is the usual cause of the slow stats you're seeing. You 
should be able to adjust dirty data thresholds to encourage faster writeouts, 
do fsyncs once a client is done with a file, etc in order to minimize the 
likelihood of running into this.
Also, I'd have to check but I believe opening a file with LAZY_IO or whatever 
will weaken those requirements — it's probably not the solution you'd like here 
but it's an option, and if this turns out to be a serious issue then config 
options to reduce consistency on certain operations are likely to make their 
way into the roadmap. :)
-Greg

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: CephFS Space Accounting and Quotas

2013-03-06 Thread Jim Schutt
On 03/06/2013 01:21 PM, Greg Farnum wrote:
  Also, this issue of stat on files created on other clients seems
  like it's going to be problematic for many interactions our users
  will have with the files created by their parallel compute jobs -
  any suggestion on how to avoid or fix it?
   
 Brief background: stat is required to provide file size information,
 and so when you do a stat Ceph needs to find out the actual file
 size. If the file is currently in use by somebody, that requires
 gathering up the latest metadata from them. Separately, while Ceph
 allows a client and the MDS to proceed with a bunch of operations
 (ie, mknod) without having it go to disk first, it requires anything
 which is visible to a third party (another client) be durable on disk
 for consistency reasons.
 
 These combine to mean that if you do a stat on a file which a client
 currently has buffered writes for, that buffer must be flushed out to
 disk before the stat can return. This is the usual cause of the slow
 stats you're seeing. You should be able to adjust dirty data
 thresholds to encourage faster writeouts, do fsyncs once a client is
 done with a file, etc in order to minimize the likelihood of running
 into this. Also, I'd have to check but I believe opening a file with
 LAZY_IO or whatever will weaken those requirements — it's probably
 not the solution you'd like here but it's an option, and if this
 turns out to be a serious issue then config options to reduce
 consistency on certain operations are likely to make their way into
 the roadmap. :)

That all makes sense.

But, it turns out the files in question were written yesterday,
and I did the stat operations today.

So, shouldn't the dirty buffer issue not be in play here?

Is there anything else that might be going on?

Thanks -- Jim

 -Greg
 
 
 


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RGW Blocking on 1-2 PG's - argonaut

2013-03-06 Thread Sławomir Skowron
Great, thanks. Now i understand everything.

Best Regards
SS

Dnia 6 mar 2013 o godz. 15:04 Yehuda Sadeh yeh...@inktank.com napisał(a):

 On Wed, Mar 6, 2013 at 5:06 AM, Sławomir Skowron szi...@gmail.com wrote:
 Hi, i do some test, to reproduce this problem.

 As you can see, only one drive (each drive in same PG) is much more
 utilize, then others, and there are some ops in queue on this slow
 osd. This test is getting heads from s3 objects, alphabetically
 sorted. This is strange. why this files is going in much part only
 from this triple osd's.

 checking what osd are in this pg.

 ceph pg map 7.35b
 osdmap e117008 pg 7.35b (7.35b) - up [18,61,133] acting [18,61,133]

 On osd.61

 { num_ops: 13,
  ops: [
{ description: osd_sub_op(client.10376104.0:961532 7.35b
 2b11a75b\/2013-03-06-13-8700.1-ocdn\/head\/\/7 [] v 117008'1370134

 The ops log is slowing you down. Unless you really need it, set 'rgw
 enable ops log = false'. This is off by default in bobtail.


 Yehuda
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: CephFS Space Accounting and Quotas

2013-03-06 Thread Greg Farnum
On Wednesday, March 6, 2013 at 1:28 PM, Jim Schutt wrote:
 On 03/06/2013 01:21 PM, Greg Farnum wrote:
Also, this issue of stat on files created on other clients seems
like it's going to be problematic for many interactions our users
will have with the files created by their parallel compute jobs -
any suggestion on how to avoid or fix it?

   
   
  Brief background: stat is required to provide file size information,
  and so when you do a stat Ceph needs to find out the actual file
  size. If the file is currently in use by somebody, that requires
  gathering up the latest metadata from them. Separately, while Ceph
  allows a client and the MDS to proceed with a bunch of operations
  (ie, mknod) without having it go to disk first, it requires anything
  which is visible to a third party (another client) be durable on disk
  for consistency reasons.
   
  These combine to mean that if you do a stat on a file which a client
  currently has buffered writes for, that buffer must be flushed out to
  disk before the stat can return. This is the usual cause of the slow
  stats you're seeing. You should be able to adjust dirty data
  thresholds to encourage faster writeouts, do fsyncs once a client is
  done with a file, etc in order to minimize the likelihood of running
  into this. Also, I'd have to check but I believe opening a file with
  LAZY_IO or whatever will weaken those requirements — it's probably
  not the solution you'd like here but it's an option, and if this
  turns out to be a serious issue then config options to reduce
  consistency on certain operations are likely to make their way into
  the roadmap. :)
  
  
  
 That all makes sense.
  
 But, it turns out the files in question were written yesterday,
 and I did the stat operations today.
  
 So, shouldn't the dirty buffer issue not be in play here?
Probably not. :/


 Is there anything else that might be going on?
In that case it sounds like either there's a slowdown on disk access that is 
propagating up the chain very bizarrely, there's a serious performance issue on 
the MDS (ie, swapping for everything), or the clients are still holding onto 
capabilities for the files in question and you're running into some issues with 
the capability revocation mechanisms.
Can you describe your setup a bit more? What versions are you running, kernel 
or userspace clients, etc. What config options are you setting on the MDS? 
Assuming you're on something semi-recent, getting a perfcounter dump from the 
MDS might be illuminating as well.

We'll probably want to get a high-debug log of the MDS during these slow stats 
as well.
-Greg

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: CephFS Space Accounting and Quotas

2013-03-06 Thread Sage Weil
On Wed, 6 Mar 2013, Greg Farnum wrote:
  'ls -lh dir' seems to be just the thing if you already know dir.
   
  And it's perfectly suitable for our use case of not scheduling
  new jobs for users consuming too much space.
   
  I was thinking I might need to find a subtree where all the
  subdirectories are owned by the same user, on the theory that
  all the files in such a subtree would be owned by that same
  user. E.g., we might want such a capability to manage space per
  user in shared project directories.
   
  So, I tried 'find dir -type d -exec ls -lhd {} \;'
   
  Unfortunately, that ended up doing a 'newfstatat' on each file
  under dir, evidently to learn if it was a directory. The
  result was that same slowdown for files written on other clients.
   
  Is there some other way I should be looking for directories if I
  don't already know what they are?

Normally the readdir result as the d_type field filled in to indicate 
whether the dentry is a directory or not, which makes the stat 
unnecessary.  I'm surprised that find isn't doing that properly already!  
It's possible we aren't populating a field we should be in our readdir 
code...

  Also, this issue of stat on files created on other clients seems
  like it's going to be problematic for many interactions our users
  will have with the files created by their parallel compute jobs -
  any suggestion on how to avoid or fix it?
   
 
 Brief background: stat is required to provide file size information, and 
 so when you do a stat Ceph needs to find out the actual file size. If 
 the file is currently in use by somebody, that requires gathering up the 
 latest metadata from them. Separately, while Ceph allows a client and 
 the MDS to proceed with a bunch of operations (ie, mknod) without having 
 it go to disk first, it requires anything which is visible to a third 
 party (another client) be durable on disk for consistency reasons.
 
 These combine to mean that if you do a stat on a file which a client 
 currently has buffered writes for, that buffer must be flushed out to 
 disk before the stat can return. This is the usual cause of the slow 
 stats you're seeing. You should be able to adjust dirty data thresholds 
 to encourage faster writeouts, do fsyncs once a client is done with a 
 file, etc in order to minimize the likelihood of running into this.

This is the current behavior.  There is a bug in the tracker to introduce 
a new lock state to optimize the stat case so that writers are paused but 
buffers aren't flushed.  It hasn't been prioritized, but is not terribly 
complex.

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] libceph: fix decoding of pgids

2013-03-06 Thread Sage Weil
In 4f6a7e5ee1393ec4b243b39dac9f36992d161540 we effectively dropped support
for the legacy encoding for the OSDMap and incremental.  However, we didn't
fix the decoding for the pgid.

Signed-off-by: Sage Weil s...@inktank.com
---
 net/ceph/osdmap.c |   40 +++-
 1 file changed, 27 insertions(+), 13 deletions(-)

diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
index a47ee06..6975102 100644
--- a/net/ceph/osdmap.c
+++ b/net/ceph/osdmap.c
@@ -654,6 +654,24 @@ static int osdmap_set_max_osd(struct ceph_osdmap *map, int 
max)
return 0;
 }
 
+static int __decode_pgid(void **p, void *end, struct ceph_pg *pg)
+{
+   u8 v;
+
+   ceph_decode_need(p, end, 1+8+4+4, bad);
+   v = ceph_decode_8(p);
+   if (v != 1)
+   goto bad;
+   pg-pool = ceph_decode_64(p);
+   pg-seed = ceph_decode_32(p);
+   *p += 4; /* skip preferred */
+   return 0;
+
+bad:
+   dout(error decoding pgid\n);
+   return -EINVAL;
+}
+
 /*
  * decode a full map.
  */
@@ -745,13 +763,11 @@ struct ceph_osdmap *osdmap_decode(void **p, void *end)
for (i = 0; i  len; i++) {
int n, j;
struct ceph_pg pgid;
-   struct ceph_pg_v1 pgid_v1;
struct ceph_pg_mapping *pg;
 
-   ceph_decode_need(p, end, sizeof(u32) + sizeof(u64), bad);
-   ceph_decode_copy(p, pgid_v1, sizeof(pgid_v1));
-   pgid.pool = le32_to_cpu(pgid_v1.pool);
-   pgid.seed = le16_to_cpu(pgid_v1.ps);
+   err = __decode_pgid(p, end, pgid);
+   if (err)
+   goto bad;
n = ceph_decode_32(p);
err = -EINVAL;
if (n  (UINT_MAX - sizeof(*pg)) / sizeof(u32))
@@ -818,8 +834,8 @@ struct ceph_osdmap *osdmap_apply_incremental(void **p, void 
*end,
u16 version;
 
ceph_decode_16_safe(p, end, version, bad);
-   if (version  6) {
-   pr_warning(got unknown v %d  %d of inc osdmap\n, version, 6);
+   if (version != 6) {
+   pr_warning(got unknown v %d != 6 of inc osdmap\n, version);
goto bad;
}
 
@@ -963,15 +979,13 @@ struct ceph_osdmap *osdmap_apply_incremental(void **p, 
void *end,
while (len--) {
struct ceph_pg_mapping *pg;
int j;
-   struct ceph_pg_v1 pgid_v1;
struct ceph_pg pgid;
u32 pglen;
-   ceph_decode_need(p, end, sizeof(u64) + sizeof(u32), bad);
-   ceph_decode_copy(p, pgid_v1, sizeof(pgid_v1));
-   pgid.pool = le32_to_cpu(pgid_v1.pool);
-   pgid.seed = le16_to_cpu(pgid_v1.ps);
-   pglen = ceph_decode_32(p);
 
+   err = __decode_pgid(p, end, pgid);
+   if (err)
+   goto bad;
+   pglen = ceph_decode_32(p);
if (pglen) {
ceph_decode_need(p, end, pglen*sizeof(u32), bad);
 
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OpenStack summit : Ceph design session

2013-03-06 Thread Loic Dachary
Hi Neil,

On 03/06/2013 08:27 PM, Neil Levine wrote:
 I think the multi-site RGW stuff is somewhat orthogonal to OpenStack

Even when keystone is involved ?

 where as the RBD backups needs to factor in Horizon, Cinder APIs and
 where the logic for managing the backups sits.
 
 Ross is looking to get a wiki setup for Ceph blueprints so we can
 document the incremental snapshot stuff and then use this as a basis
 for creating the OpenStack blueprint.

Great.

 Who approves the session at ODS and when is this decision made?

I suspect Josh knows more than I do about this. During the cinder meeting 
earlier today J. Griffith said that if the nova track is too busy to host the 
Roadmap for Ceph integration with OpenStack session he was in favor of having 
it in the cinder track. Following his advice I suggested to Thierry Carrez to 
open a Cross project track ( 
http://lists.openstack.org/pipermail/openstack-dev/2013-March/006365.html ) 

Cheers

 Neil
 
 On Sun, Mar 3, 2013 at 1:37 AM, Loic Dachary l...@dachary.org wrote:
 Hi Neil,

 I've updated http://summit.openstack.org/cfp/details/38 with the Geographic 
 DR related threads

 Geo-replication with RADOS GW 
 http://marc.info/?l=ceph-develm=135939566407623w=4
 Geographic DR for RGW 
 http://marc.info/?l=ceph-develm=136191479931880w=4

 I'm increasingly interested in figuring out how it fits with OpenStack.

 Cheers

 On 02/25/2013 11:04 AM, Loic Dachary wrote:
 Hi Neil,

 I've added RBD backups secondary clusters within Openstack to the list of 
 blueprints. Do you have links to mail threads / chat logs related to this 
 topic ?

 I moved the content of the session to an etherpad for collaborative editing

 https://etherpad.openstack.org/roadmap-for-ceph-integration-with-openstack

 and it is now linked from

 http://summit.openstack.org/cfp/details/38

 Cheers

 On 02/25/2013 07:12 AM, Neil Levine wrote:
 Thanks for taking the lead on this Loic.

 As a blueprint, I'd like to look at RBD backups to secondary clusters
 within Openstack. Nick Barcet and others have mentioned ideas for this
 now that Cinder is multi-cluster aware.

 Neil

 On Sun, Feb 24, 2013 at 3:16 PM, Josh Durgin josh.dur...@inktank.com 
 wrote:
 On 02/23/2013 02:33 AM, Loic Dachary wrote:

 Hi,

 In anticipation of the next OpenStack summit
 http://www.openstack.org/summit/portland-2013/, I proposed a session to
 discuss OpenStack and Ceph integration. Our meeting during FOSDEM earlier
 this month was a great experience although it was planned at the last
 minute. I hope we can organize something even better for the summit.

 For developers and contributors to both Ceph and OpenStack such as 
 myself,
 it would be a great opportunity to figure out a sensible roadmap for the
 next six months. I realize this roadmap is already clear for Josh Durgin 
 and
 other Ceph / OpenStack developers who are passionately invested in both
 projects for a long time. However I am new to both projects and such a
 session would be a precious guide and highly motivating.

 http://summit.openstack.org/cfp/details/38

 What do you think ?


 Sounds like a great idea!
 Thanks for putting together the session!

 Josh

 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


 --
 Loïc Dachary, Artisan Logiciel Libre


-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature


Re: [PATCH] libceph: fix decoding of pgids

2013-03-06 Thread Yehuda Sadeh
On Wed, Mar 6, 2013 at 2:15 PM, Sage Weil s...@inktank.com wrote:
 In 4f6a7e5ee1393ec4b243b39dac9f36992d161540 we effectively dropped support
 for the legacy encoding for the OSDMap and incremental.  However, we didn't
 fix the decoding for the pgid.

 Signed-off-by: Sage Weil s...@inktank.com
 ---
  net/ceph/osdmap.c |   40 +++-
  1 file changed, 27 insertions(+), 13 deletions(-)

 diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
 index a47ee06..6975102 100644
 --- a/net/ceph/osdmap.c
 +++ b/net/ceph/osdmap.c
 @@ -654,6 +654,24 @@ static int osdmap_set_max_osd(struct ceph_osdmap *map, 
 int max)
 return 0;
  }

 +static int __decode_pgid(void **p, void *end, struct ceph_pg *pg)
 +{
 +   u8 v;
 +
 +   ceph_decode_need(p, end, 1+8+4+4, bad);
 +   v = ceph_decode_8(p);
 +   if (v != 1)
 +   goto bad;
 +   pg-pool = ceph_decode_64(p);
 +   pg-seed = ceph_decode_32(p);
 +   *p += 4; /* skip preferred */
 +   return 0;
 +
 +bad:
 +   dout(error decoding pgid\n);
 +   return -EINVAL;
 +}
 +
  /*
   * decode a full map.
   */
 @@ -745,13 +763,11 @@ struct ceph_osdmap *osdmap_decode(void **p, void *end)
 for (i = 0; i  len; i++) {
 int n, j;
 struct ceph_pg pgid;
 -   struct ceph_pg_v1 pgid_v1;
 struct ceph_pg_mapping *pg;

 -   ceph_decode_need(p, end, sizeof(u32) + sizeof(u64), bad);
 -   ceph_decode_copy(p, pgid_v1, sizeof(pgid_v1));
 -   pgid.pool = le32_to_cpu(pgid_v1.pool);
 -   pgid.seed = le16_to_cpu(pgid_v1.ps);
 +   err = __decode_pgid(p, end, pgid);
 +   if (err)
 +   goto bad;
 n = ceph_decode_32(p);
 err = -EINVAL;
 if (n  (UINT_MAX - sizeof(*pg)) / sizeof(u32))
 @@ -818,8 +834,8 @@ struct ceph_osdmap *osdmap_apply_incremental(void **p, 
 void *end,
 u16 version;

 ceph_decode_16_safe(p, end, version, bad);
 -   if (version  6) {
 -   pr_warning(got unknown v %d  %d of inc osdmap\n, version, 
 6);
 +   if (version != 6) {
 +   pr_warning(got unknown v %d != 6 of inc osdmap\n, version);
 goto bad;
 }

 @@ -963,15 +979,13 @@ struct ceph_osdmap *osdmap_apply_incremental(void **p, 
 void *end,
 while (len--) {
 struct ceph_pg_mapping *pg;
 int j;
 -   struct ceph_pg_v1 pgid_v1;
 struct ceph_pg pgid;
 u32 pglen;
 -   ceph_decode_need(p, end, sizeof(u64) + sizeof(u32), bad);
 -   ceph_decode_copy(p, pgid_v1, sizeof(pgid_v1));
 -   pgid.pool = le32_to_cpu(pgid_v1.pool);
 -   pgid.seed = le16_to_cpu(pgid_v1.ps);
 -   pglen = ceph_decode_32(p);

 +   err = __decode_pgid(p, end, pgid);
 +   if (err)
 +   goto bad;

maybe missing?

ceph_decode_need(p, end, sizeof(u32), bad);

 +   pglen = ceph_decode_32(p);
 if (pglen) {
 ceph_decode_need(p, end, pglen*sizeof(u32), bad);

 --
 1.7.9.5

 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] libceph: fix decoding of pgids

2013-03-06 Thread Sage Weil
On Wed, 6 Mar 2013, Yehuda Sadeh wrote:
 On Wed, Mar 6, 2013 at 2:15 PM, Sage Weil s...@inktank.com wrote:
  In 4f6a7e5ee1393ec4b243b39dac9f36992d161540 we effectively dropped support
  for the legacy encoding for the OSDMap and incremental.  However, we didn't
  fix the decoding for the pgid.
 
  Signed-off-by: Sage Weil s...@inktank.com
  ---
   net/ceph/osdmap.c |   40 +++-
   1 file changed, 27 insertions(+), 13 deletions(-)
 
  diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
  index a47ee06..6975102 100644
  --- a/net/ceph/osdmap.c
  +++ b/net/ceph/osdmap.c
  @@ -654,6 +654,24 @@ static int osdmap_set_max_osd(struct ceph_osdmap *map, 
  int max)
  return 0;
   }
 
  +static int __decode_pgid(void **p, void *end, struct ceph_pg *pg)
  +{
  +   u8 v;
  +
  +   ceph_decode_need(p, end, 1+8+4+4, bad);
  +   v = ceph_decode_8(p);
  +   if (v != 1)
  +   goto bad;
  +   pg-pool = ceph_decode_64(p);
  +   pg-seed = ceph_decode_32(p);
  +   *p += 4; /* skip preferred */
  +   return 0;
  +
  +bad:
  +   dout(error decoding pgid\n);
  +   return -EINVAL;
  +}
  +
   /*
* decode a full map.
*/
  @@ -745,13 +763,11 @@ struct ceph_osdmap *osdmap_decode(void **p, void *end)
  for (i = 0; i  len; i++) {
  int n, j;
  struct ceph_pg pgid;
  -   struct ceph_pg_v1 pgid_v1;
  struct ceph_pg_mapping *pg;
 
  -   ceph_decode_need(p, end, sizeof(u32) + sizeof(u64), bad);
  -   ceph_decode_copy(p, pgid_v1, sizeof(pgid_v1));
  -   pgid.pool = le32_to_cpu(pgid_v1.pool);
  -   pgid.seed = le16_to_cpu(pgid_v1.ps);
  +   err = __decode_pgid(p, end, pgid);
  +   if (err)
  +   goto bad;
  n = ceph_decode_32(p);
  err = -EINVAL;
  if (n  (UINT_MAX - sizeof(*pg)) / sizeof(u32))
  @@ -818,8 +834,8 @@ struct ceph_osdmap *osdmap_apply_incremental(void **p, 
  void *end,
  u16 version;
 
  ceph_decode_16_safe(p, end, version, bad);
  -   if (version  6) {
  -   pr_warning(got unknown v %d  %d of inc osdmap\n, 
  version, 6);
  +   if (version != 6) {
  +   pr_warning(got unknown v %d != 6 of inc osdmap\n, 
  version);
  goto bad;
  }
 
  @@ -963,15 +979,13 @@ struct ceph_osdmap *osdmap_apply_incremental(void 
  **p, void *end,
  while (len--) {
  struct ceph_pg_mapping *pg;
  int j;
  -   struct ceph_pg_v1 pgid_v1;
  struct ceph_pg pgid;
  u32 pglen;
  -   ceph_decode_need(p, end, sizeof(u64) + sizeof(u32), bad);
  -   ceph_decode_copy(p, pgid_v1, sizeof(pgid_v1));
  -   pgid.pool = le32_to_cpu(pgid_v1.pool);
  -   pgid.seed = le16_to_cpu(pgid_v1.ps);
  -   pglen = ceph_decode_32(p);
 
  +   err = __decode_pgid(p, end, pgid);
  +   if (err)
  +   goto bad;
 
 maybe missing?
 
 ceph_decode_need(p, end, sizeof(u32), bad);

Yup, for both call sites.  Pushed updated patch to testing.

Thanks!
sage


 
  +   pglen = ceph_decode_32(p);
  if (pglen) {
  ceph_decode_need(p, end, pglen*sizeof(u32), bad);
 
  --
  1.7.9.5
 
  --
  To unsubscribe from this list: send the line unsubscribe ceph-devel in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OpenStack summit : Ceph design session

2013-03-06 Thread Neil Levine
On Wed, Mar 6, 2013 at 2:45 PM, Loic Dachary l...@dachary.org wrote:
 Hi Neil,

 On 03/06/2013 08:27 PM, Neil Levine wrote:
 I think the multi-site RGW stuff is somewhat orthogonal to OpenStack

 Even when keystone is involved ?

Good question.

Yehuda: how would the asynchronously replicated user metadata interact
with Keystone?

 Who approves the session at ODS and when is this decision made?

 I suspect Josh knows more than I do about this. During the cinder meeting 
 earlier today J. Griffith said that if the nova track is too busy to host the 
 Roadmap for Ceph integration with OpenStack session he was in favor of 
 having it in the cinder track. Following his advice I suggested to Thierry 
 Carrez to open a Cross project track ( 
 http://lists.openstack.org/pipermail/openstack-dev/2013-March/006365.html )

I don't think Cinder is such a bad place for it to be as presumably
the interaction to copy the block device to a secondary location would
be triggered through a Cinder API call no?

Neil
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Approaches to wrapping aio_exec

2013-03-06 Thread Noah Watkins
So I've been playing with the ObjectOperationCompletion code a bit. It seems to 
be really important to be able to handle decoding errors in in the 
handle_completion() callback. In particular, I'd like to be able to reach out 
and set the return value the user will see in the AioCompletion.

Any thoughts on dealing with this some how?

-Noah

On Mar 4, 2013, at 11:44 AM, Yehuda Sadeh yeh...@inktank.com wrote:

 On Mon, Mar 4, 2013 at 11:34 AM, Noah Watkins jayh...@cs.ucsc.edu wrote:
 
 On Mar 3, 2013, at 6:31 PM, Yehuda Sadeh yeh...@inktank.com wrote:
 
 I pushed the wip-librados-exec branch last week that solves a similar
 issue. I added two more ObjectOperation::exec() api calls. The more
 interesting one added a callback context that is called with the
 output buffer of the completed sub-op. Currently in order to use it
 you'll need to use operate()/aio_operate(), however, a similar
 aio_exec interface can be added.
 
 Thanks for the pointer to the branch. So, if I understand correctly,
 we might have a new librados::aio_exec_completion call that accepts
 a completion object? For example:
 
 aio_exec_completion(AioCompletion *c, bufferlist *outbl,
ObjectOperationCompletion* completion)
 {
  Context *onack = new C_aio_Ack(c);
 
  ::ObjectOperation rd;
  ObjectOpCompletionCtx *ctx = new ObjectOpCompletionCtx(completion);
  rd.call(cls, method, inbl, ctx-outbl, ctx, NULL);
  objecter-read(oid, oloc, rd, snap_seq, outbl, 0, onack, c-objver);
 
  return 0;
 }
 
 where the caller would provide an ObjectOperationCompletion where it's
 finish(..) would unwrap the protocol?
 
 Right.
 
 
 Do you expect wip-librados-exec going up stream pretty soon, and would
 
 We can push it ahead if needed, it doesn't depend on any of the stuff
 I'm working on right now. It just waits for someone to properly review
 it.
 
 something like librados::aio_exec_completion be a candidate for adding
 to librados?
 
 
 Sure, if there's a need then I don't see why not.
 
 Yehuda

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: CephFS Space Accounting and Quotas

2013-03-06 Thread Greg Farnum
On Wednesday, March 6, 2013 at 3:14 PM, Jim Schutt wrote:
 When I'm doing these stat operations the file system is otherwise
 idle.

What's the cluster look like? This is just one active MDS and a couple hundred 
clients?

 What is happening is that once one of these slow stat operations
 on a file completes, it never happens again for that file, from
 any client. At least, that's the case if I'm not writing to
 the file any more. I haven't checked if appending to the files
 restarts the behavior.

I assume it'll come back, but if you could verify that'd be good.

 
 On the client side I'm running with 3.8.2 + the ceph patch queue
 that was merged into 3.9-rc1.
 
 On the server side I'm running recent next branch (commit 0f42eddef5),
 with the tcp receive socket buffer option patches cherry-picked.
 I've also got a patch that allows mkcephfs to use osd_pool_default_pg_num
 rather than pg_bits to set initial number of PGs (same for pgp_num),
 and a patch that lets me run with just one pool that contains both
 data and metadata. I'm testing data distribution uniformity with 512K PGs.
 
 My MDS tunables are all at default settings.
 
  
  We'll probably want to get a high-debug log of the MDS during these slow 
  stats as well.
 
 OK.
 
 Do you want me to try to reproduce with a more standard setup?
No, this is fine. 
 
 Also, I see Sage just pushed a patch to pgid decoding - I expect
 I need that as well, if I'm running the latest client code.

Yeah, if you've got the commit it references you'll want it.

 Do you want the MDS log at 10 or 20?
More is better. ;)

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: MDS running at 100% CPU, no clients

2013-03-06 Thread Noah Watkins

On Mar 6, 2013, at 5:57 PM, Noah Watkins jayh...@cs.ucsc.edu wrote:

 The MDS process in my cluster is running at 100% CPU. In fact I thought the 
 cluster came down, but rather an ls was taking a minute. There aren't any 
 clients active. I've left the process running in case there is any probing 
 you'd like to do on it:
 
 virt   res  cpu
 4629m  88m 5260 S   92  1.1 113:32.79 ceph-mds
 
 Thanks,
 Noah
 


This is a ceph-mds child thread under strace. The only thread
that appears to be doing anything.

root@issdm-44:/home/hadoop/hadoop-common# strace -p 3372
Process 3372 attached - interrupt to quit
read(1649, 7f0203235000-7f0203236000 ---p 0..., 8191) = 4050
read(1649, 7f0205053000-7f0205054000 ---p 0..., 8191) = 4050
read(1649, 7f0206e71000-7f0206e72000 ---p 0..., 8191) = 4050
read(1649, 7f0214144000-7f0214244000 rw-p 0..., 8191) = 4020
read(1649, 7f0215f62000-7f0216062000 rw-p 0..., 8191) = 4020
read(1649, 7f0217d8-7f0217e8 rw-p 0..., 8191) = 4020
read(1649, 7f0219b9e000-7f0219c9e000 rw-p 0..., 8191) = 4020
...

That file looks to be:

ceph-mds 3337 root 1649r   REG0,30   266903 /proc/3337/maps

(3337 is the parent process).--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: MDS running at 100% CPU, no clients

2013-03-06 Thread Noah Watkins
Which, looks to be in a tight loop in the memory model _sample…

(gdb) bt
#0  0x7f0270d84d2d in read () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x7f027046dd88 in std::__basic_filechar::xsgetn(char*, long) () from 
/usr/lib/x86_64-linux-gnu/libstdc++.so.6
#2  0x7f027046f4c5 in std::basic_filebufchar, std::char_traitschar 
::underflow() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x7f0270467ceb in std::basic_istreamchar, std::char_traitschar  
std::getlinechar, std::char_traitschar, std::allocatorchar 
(std::basic_istreamchar, std::char_traitschar , std::basic_stringchar, 
std::char_traitschar, std::allocatorchar , char) () from 
/usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x0072bdd4 in MemoryModel::_sample(MemoryModel::snap*) ()
#5  0x005658db in MDCache::check_memory_usage() ()
#6  0x004ba929 in MDS::tick() ()
#7  0x00794c65 in SafeTimer::timer_thread() ()
#8  0x007958ad in SafeTimerThread::entry() ()
#9  0x7f0270d7de9a in start_thread () from 
/lib/x86_64-linux-gnu/libpthread.so.0

On Mar 6, 2013, at 6:18 PM, Noah Watkins jayh...@cs.ucsc.edu wrote:

 
 On Mar 6, 2013, at 5:57 PM, Noah Watkins jayh...@cs.ucsc.edu wrote:
 
 The MDS process in my cluster is running at 100% CPU. In fact I thought the 
 cluster came down, but rather an ls was taking a minute. There aren't any 
 clients active. I've left the process running in case there is any probing 
 you'd like to do on it:
 
 virt   res  cpu
 4629m  88m 5260 S   92  1.1 113:32.79 ceph-mds
 
 Thanks,
 Noah
 
 
 
 This is a ceph-mds child thread under strace. The only thread
 that appears to be doing anything.
 
 root@issdm-44:/home/hadoop/hadoop-common# strace -p 3372
 Process 3372 attached - interrupt to quit
 read(1649, 7f0203235000-7f0203236000 ---p 0..., 8191) = 4050
 read(1649, 7f0205053000-7f0205054000 ---p 0..., 8191) = 4050
 read(1649, 7f0206e71000-7f0206e72000 ---p 0..., 8191) = 4050
 read(1649, 7f0214144000-7f0214244000 rw-p 0..., 8191) = 4020
 read(1649, 7f0215f62000-7f0216062000 rw-p 0..., 8191) = 4020
 read(1649, 7f0217d8-7f0217e8 rw-p 0..., 8191) = 4020
 read(1649, 7f0219b9e000-7f0219c9e000 rw-p 0..., 8191) = 4020
 ...
 
 That file looks to be:
 
 ceph-mds 3337 root 1649r   REG0,30   266903 /proc/3337/maps
 
 (3337 is the parent process).

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html