Re: stuff for v0.56.4
On 03/06/2013 12:10 AM, Sage Weil wrote: There have been a few important bug fixes that people are hitting or want: - the journal replay bug (5d54ab154ca790688a6a1a2ad5f869c17a23980a) - the - _ pool name vs cap parsing thing that is biting openstack users - ceph-disk-* changes to support latest ceph-deploy If there are other things that we want to include in 0.56.4, lets get them into the bobtial branch sooner rather than later. Possible items: - pg log trimming (probably a conservative subset) to avoid memory bloat - omap scrub? - pg temp collection removal? - buffer::cmp fix from loic? Are there other items that we are missing? I'm still seeing #3816 on my systems. The fix in wip-3816 did not resolve it for me. Wido sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Wido den Hollander 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] Using different storage types on same osd hosts?
Hi, We did the opposite here; adding some SSD in free slots after having a normal cluster running with SATA. We just created a new pool for them and separated the two types. I used this as a template: http://ceph.com/docs/master/rados/operations/crush-map/?highlight=ssd#placing-different-pools-on-different-osds and left out the part with placing a master copy on each ssd. I had to create the pool, rack and host in the crush rules for the first server (wouldn't let me from the command line using 'ceph osd crush set ...'), after that I could just add servers/osd to it like normal. I think unless you really need two separate clusters, I'd go with just having different pools for it; you'll need a copy of every service (mons, storagenodes etc) with two clusters. More info on running multiple clusters here: http://ceph.com/docs/master/rados/configuration/ceph-conf/#running-multiple-clusters Cheers, Martin On Tue, Mar 5, 2013 at 9:48 PM, Stefan Priebe s.pri...@profihost.ag wrote: Hi, right now i have a bunch of OSD hosts (servers) which have just 4 disks each. All of them use SSDs right now. So i have a lot of free harddisk slots in the chassis. So my idea was to create a second ceph system using these free slots. Is this possible? Or should i just the first one with different rules? Any hints? Greets, Stefan ___ ceph-users mailing list ceph-us...@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RGW Blocking on 1-2 PG's - argonaut
Hi, i do some test, to reproduce this problem. As you can see, only one drive (each drive in same PG) is much more utilize, then others, and there are some ops in queue on this slow osd. This test is getting heads from s3 objects, alphabetically sorted. This is strange. why this files is going in much part only from this triple osd's. checking what osd are in this pg. ceph pg map 7.35b osdmap e117008 pg 7.35b (7.35b) - up [18,61,133] acting [18,61,133] On osd.61 { num_ops: 13, ops: [ { description: osd_sub_op(client.10376104.0:961532 7.35b 2b11a75b\/2013-03-06-13-8700.1-ocdn\/head\/\/7 [] v 117008'1370134 snapset=0=[]:[] snapc=0=[]), received_at: 2013-03-06 13:59:18.448543, age: 0.032431, flag_point: started}, { description: osd_sub_op(client.10376110.0:972570 7.35b 2b11a75b\/2013-03-06-13-8700.1-ocdn\/head\/\/7 [] v 117008'1370135 snapset=0=[]:[] snapc=0=[]), received_at: 2013-03-06 13:59:18.453829, age: 0.027145, flag_point: started}, { description: osd_sub_op(client.10376104.0:961534 7.35b 2b11a75b\/2013-03-06-13-8700.1-ocdn\/head\/\/7 [] v 117008'1370136 snapset=0=[]:[] snapc=0=[]), received_at: 2013-03-06 13:59:18.454012, age: 0.026962, flag_point: started}, { description: osd_sub_op(client.10376107.0:952760 7.35b 2b11a75b\/2013-03-06-13-8700.1-ocdn\/head\/\/7 [] v 117008'1370137 snapset=0=[]:[] snapc=0=[]), received_at: 2013-03-06 13:59:18.458980, age: 0.021994, flag_point: started}, { description: osd_sub_op(client.10376110.0:972572 7.35b 2b11a75b\/2013-03-06-13-8700.1-ocdn\/head\/\/7 [] v 117008'1370138 snapset=0=[]:[] snapc=0=[]), received_at: 2013-03-06 13:59:18.459546, age: 0.021428, flag_point: started}, { description: osd_sub_op(client.10376110.0:972574 7.35b 2b11a75b\/2013-03-06-13-8700.1-ocdn\/head\/\/7 [] v 117008'1370139 snapset=0=[]:[] snapc=0=[]), received_at: 2013-03-06 13:59:18.463680, age: 0.017294, flag_point: started}, { description: osd_sub_op(client.10376107.0:952762 7.35b 2b11a75b\/2013-03-06-13-8700.1-ocdn\/head\/\/7 [] v 117008'1370140 snapset=0=[]:[] snapc=0=[]), received_at: 2013-03-06 13:59:18.464660, age: 0.016314, flag_point: started}, { description: osd_sub_op(client.10376104.0:961536 7.35b 2b11a75b\/2013-03-06-13-8700.1-ocdn\/head\/\/7 [] v 117008'1370141 snapset=0=[]:[] snapc=0=[]), received_at: 2013-03-06 13:59:18.468076, age: 0.012898, flag_point: started}, { description: osd_sub_op(client.10376110.0:972576 7.35b 2b11a75b\/2013-03-06-13-8700.1-ocdn\/head\/\/7 [] v 117008'1370142 snapset=0=[]:[] snapc=0=[]), received_at: 2013-03-06 13:59:18.468332, age: 0.012642, flag_point: started}, { description: osd_sub_op(client.10376107.0:952764 7.35b 2b11a75b\/2013-03-06-13-8700.1-ocdn\/head\/\/7 [] v 117008'1370143 snapset=0=[]:[] snapc=0=[]), received_at: 2013-03-06 13:59:18.470480, age: 0.010494, flag_point: started}, { description: osd_sub_op(client.10376107.0:952766 7.35b 2b11a75b\/2013-03-06-13-8700.1-ocdn\/head\/\/7 [] v 117008'1370144 snapset=0=[]:[] snapc=0=[]), received_at: 2013-03-06 13:59:18.475372, age: 0.005602, flag_point: started}, { description: osd_sub_op(client.10376104.0:961538 7.35b 2b11a75b\/2013-03-06-13-8700.1-ocdn\/head\/\/7 [] v 117008'1370145 snapset=0=[]:[] snapc=0=[]), received_at: 2013-03-06 13:59:18.479391, age: 0.001583, flag_point: started}, { description: osd_sub_op(client.10376107.0:952768 7.35b 2b11a75b\/2013-03-06-13-8700.1-ocdn\/head\/\/7 [] v 117008'1370146 snapset=0=[]:[] snapc=0=[]), received_at: 2013-03-06 13:59:18.480276, age: 0.000698, flag_point: started}]} On osd.18 { num_ops: 9, ops: [ { description: osd_op(client.10391092.0:718883 2013-03-06-13-8700.1-ocdn [append 0~299] 7.2b11a75b), received_at: 2013-03-06 13:57:52.929677, age: 0.025480, flag_point: waiting for sub ops, client_info: { client: client.10391092, tid: 718883}}, { description: osd_op(client.10373691.0:956595 2013-03-06-13-8700.1-ocdn [append 0~299] 7.2b11a75b), received_at: 2013-03-06 13:57:52.934533, age: 0.020624, flag_point: waiting for sub ops, client_info: { client: client.10373691, tid: 956595}}, { description: osd_op(client.10391092.0:718885 2013-03-06-13-8700.1-ocdn [append 0~299] 7.2b11a75b), received_at: 2013-03-06 13:57:52.937101, age: 0.018056, flag_point: waiting for sub ops, client_info: { client: client.10391092, tid: 718885}}, { description:
ceph -v doesn't work
Hi, since I compile Debian packages, ceph -v doesn't work. I follow this steps : - git clone XXX - git checkout origin/bobtail - dch -i - dpkg-source -b ceph - cowbuilder --build ceph*dsc and I obtain : root! okko:~# ceph -v ceph version () root! okko:~# or with strace : write(1, ceph version ()\n, 17ceph version () ) = 17 exit_group(0) = ? root! okko:~# Do you know how can I fix that ? Thanks, Olivier PS : I compile that packages to enable syncfs support on Debian 6 (Squeeze), since I use a recent kernel. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RGW Blocking on 1-2 PG's - argonaut
On Wed, Mar 6, 2013 at 5:06 AM, Sławomir Skowron szi...@gmail.com wrote: Hi, i do some test, to reproduce this problem. As you can see, only one drive (each drive in same PG) is much more utilize, then others, and there are some ops in queue on this slow osd. This test is getting heads from s3 objects, alphabetically sorted. This is strange. why this files is going in much part only from this triple osd's. checking what osd are in this pg. ceph pg map 7.35b osdmap e117008 pg 7.35b (7.35b) - up [18,61,133] acting [18,61,133] On osd.61 { num_ops: 13, ops: [ { description: osd_sub_op(client.10376104.0:961532 7.35b 2b11a75b\/2013-03-06-13-8700.1-ocdn\/head\/\/7 [] v 117008'1370134 The ops log is slowing you down. Unless you really need it, set 'rgw enable ops log = false'. This is off by default in bobtail. Yehuda -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Usable Space
Proxy-ing this in for a user I had a discussion with on irc this morning: The question is is there a way to display usable space based on replication level? Ultimately what would be nice is to see something like the following: --- $: sudo ceph --usable-space Total Space: X / Y || Usable Space: A / B By Pools: rbd -- J / K foo -- F / G bar -- H / I baz -- C / D - Would it be possible to add this in at some point? Seems like a great addition to go with some of the other 'usability enhancements' that are planned. Or would this get computationally sticky based on having many pools with different replication levels? Best Regards, -- Patrick McGarry Director, Community Inktank http://ceph.com || http://inktank.com @scuttlemonkey || @ceph || @inktank -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Usable Space
Total Space: X / Y || Usable Space: A / B Would it be possible to add this in at some point? Seems like a great addition to go with some of the other 'usability enhancements' that are planned. Or would this get computationally sticky based on having many pools with different replication levels? How would you even compute it ? I mean if the underlying storage is shared between several pools with different replication level, the usable space will depend in which space you actually put your data in ... You could do it per pool but even then I think it can get tricky because the CRUSH map could very well distribute the data in a pool on a subset of OSD so you'd need to take that into account as well. Cheers, Sylvain -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CephFS First product release discussion
On 03/05/2013 08:33 PM, Sage Weil wrote: On Tue, 5 Mar 2013, Wido den Hollander wrote: Wido, by 'user quota' do you mean something that is uid-based, or would enforcement on subtree/directory quotas be sufficient for your use cases? I've been holding out hope that uid-based usage accounting is a thing of the past and that subtrees are sufficient for real users... in which case adding enfocement to the existing rstats infrastructure is a very manageable task. I mean actual uid-based quotas. That still plays nice with shared environments like Samba or so where you have all homedirectories on a shared filesystems and you set per user quotas. Samba reads out those quotas and propagates them to the (Windows) client. Does samba propagate the quota information (how much space is used/available) or do enforcement on the client side? (Is client enforcement even necessary/useful if the backend will stop writes when the quota is exceeded?) I'm not sure. It will at least tell the user how much he/she is using on that volume and what the quota is. Not sure who enforces, Samba or the filesystem. From a quick Google it seems like the filesystem has to enforce the quota, Samba doesn't. I know this was a problem with ZFS as well. They also said they could do per filesystem quotas so that would be sufficient, but for example NFS doesn't export filesystems mounted in a export, so if you have a bunch of homedirectories on the filesystem and you want to account the usage of each user it's getting kind of hard. This could be solved if the clients directly mounted CephFS though. I'm talking about setups where you have 100k users in a LDAP and they all have their data in a single filesystem and you want to track the usage of each user, that's not an easy task without uid-based quotas. Wouldn't each user live in a sub- or home directory? If so, it seems like the existing rstats would be sufficient to do the accounting piece; only enforcement is missing. Running 'du' on each directory would be much faster with Ceph since it accounts tracks the subdirectories and shows their total size with an 'ls -al'. Environments with 100k users also tend to be very dynamic with adding and removing users all the time, so creating separate filesystems for them would be very time consuming. Now, I'm not talking about enforcing soft or hard quotas, I'm just talking about knowing how much space uid X and Y consume on the filesystem. The part I'm most unclear on is what use cases people have where uid X and Y are spread around the file system (not in a single or small set of sub directories) and per-user (not, say, per-project) quotas are still necessary. In most environments, users get their own home directory and everything lives there... I see a POSIX-filesystem as being partially legacy and a part of that legacy is user quotas. If you want existing applications who rely on userquotas to seamlessly switch from NFS to CephFS they will need this to work. We only talked about userquotas, but groupquotas are just as important. If you have 10 users where 5 of them are in the group webdev and you wan't to know how much space is being used by the group webdev you want to probe the group quotas and you are done. In some setups like we have users have data in different directories outside their home directories / NFS exports. On one machine you just run quota -u uid and you know how much user X is using spread out over all the filesystems. With rstats you would be able to achieve the same with some scripting, but that doesn't make the migration seamless. Wido sage Wido sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Wido den Hollander 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Wido den Hollander 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Usable Space
You're aware of the just-added ceph df? I don't know it well enough to know if it's a solution, but it's in that space... On Mar 6, 2013, at 6:48 AM, Patrick McGarry patr...@inktank.com wrote: Proxy-ing this in for a user I had a discussion with on irc this morning: The question is is there a way to display usable space based on replication level? Ultimately what would be nice is to see something like the following: --- $: sudo ceph --usable-space Total Space: X / Y || Usable Space: A / B By Pools: rbd -- J / K foo -- F / G bar -- H / I baz -- C / D - Would it be possible to add this in at some point? Seems like a great addition to go with some of the other 'usability enhancements' that are planned. Or would this get computationally sticky based on having many pools with different replication levels? Best Regards, -- Patrick McGarry Director, Community Inktank http://ceph.com || http://inktank.com @scuttlemonkey || @ceph || @inktank -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CephFS First product release discussion
On 03/05/2013 12:33 PM, Sage Weil wrote: Running 'du' on each directory would be much faster with Ceph since it accounts tracks the subdirectories and shows their total size with an 'ls -al'. Environments with 100k users also tend to be very dynamic with adding and removing users all the time, so creating separate filesystems for them would be very time consuming. Now, I'm not talking about enforcing soft or hard quotas, I'm just talking about knowing how much space uid X and Y consume on the filesystem. The part I'm most unclear on is what use cases people have where uid X and Y are spread around the file system (not in a single or small set of sub directories) and per-user (not, say, per-project) quotas are still necessary. In most environments, users get their own home directory and everything lives there... Hmmm, is there a tool I should be using that will return the space used by a directory, and all its descendants? If it's 'du', that tool is definitely not fast for me. I'm doing an 'strace du -s path', where path has one subdirectory which contains ~600 files. I've got ~200 clients mounting the file system, and each client wrote 3 files in that directory. I'm doing the 'du' from one of those nodes, and the strace is showing me du is doing a 'newfstat' for each file. For each file that was written on a different client from where du is running, that 'newfstat' takes tens of seconds to return. Which means my 'du' has been running for quite some time and hasn't finished yet I'm hoping there's another tool I'm supposed to be using that I don't know about yet. Our use case includes tens of millions of files written from thousands of clients, and whatever tool we use to do space accounting needs to not walk an entire directory tree, checking each file. -- Jim sage Wido -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
CephFS Space Accounting and Quotas (was: CephFS First product release discussion)
On Wednesday, March 6, 2013 at 11:07 AM, Jim Schutt wrote: On 03/05/2013 12:33 PM, Sage Weil wrote: Running 'du' on each directory would be much faster with Ceph since it accounts tracks the subdirectories and shows their total size with an 'ls -al'. Environments with 100k users also tend to be very dynamic with adding and removing users all the time, so creating separate filesystems for them would be very time consuming. Now, I'm not talking about enforcing soft or hard quotas, I'm just talking about knowing how much space uid X and Y consume on the filesystem. The part I'm most unclear on is what use cases people have where uid X and Y are spread around the file system (not in a single or small set of sub directories) and per-user (not, say, per-project) quotas are still necessary. In most environments, users get their own home directory and everything lives there... Hmmm, is there a tool I should be using that will return the space used by a directory, and all its descendants? If it's 'du', that tool is definitely not fast for me. I'm doing an 'strace du -s path', where path has one subdirectory which contains ~600 files. I've got ~200 clients mounting the file system, and each client wrote 3 files in that directory. I'm doing the 'du' from one of those nodes, and the strace is showing me du is doing a 'newfstat' for each file. For each file that was written on a different client from where du is running, that 'newfstat' takes tens of seconds to return. Which means my 'du' has been running for quite some time and hasn't finished yet I'm hoping there's another tool I'm supposed to be using that I don't know about yet. Our use case includes tens of millions of files written from thousands of clients, and whatever tool we use to do space accounting needs to not walk an entire directory tree, checking each file. Check out the directory sizes with ls -l or whatever — those numbers are semantically meaningful! :) Unfortunately we can't (currently) use those recursive statistics to do proper hard quotas on subdirectories as they're lazily propagated following client ops, not as part of the updates. (Lazily in the technical sense — it's actually quite fast in general). But they'd work fine for soft quotas if somebody wrote the code, or to block writes on a slight time lag. -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OpenStack summit : Ceph design session
I think the multi-site RGW stuff is somewhat orthogonal to OpenStack where as the RBD backups needs to factor in Horizon, Cinder APIs and where the logic for managing the backups sits. Ross is looking to get a wiki setup for Ceph blueprints so we can document the incremental snapshot stuff and then use this as a basis for creating the OpenStack blueprint. Who approves the session at ODS and when is this decision made? Neil On Sun, Mar 3, 2013 at 1:37 AM, Loic Dachary l...@dachary.org wrote: Hi Neil, I've updated http://summit.openstack.org/cfp/details/38 with the Geographic DR related threads Geo-replication with RADOS GW http://marc.info/?l=ceph-develm=135939566407623w=4 Geographic DR for RGW http://marc.info/?l=ceph-develm=136191479931880w=4 I'm increasingly interested in figuring out how it fits with OpenStack. Cheers On 02/25/2013 11:04 AM, Loic Dachary wrote: Hi Neil, I've added RBD backups secondary clusters within Openstack to the list of blueprints. Do you have links to mail threads / chat logs related to this topic ? I moved the content of the session to an etherpad for collaborative editing https://etherpad.openstack.org/roadmap-for-ceph-integration-with-openstack and it is now linked from http://summit.openstack.org/cfp/details/38 Cheers On 02/25/2013 07:12 AM, Neil Levine wrote: Thanks for taking the lead on this Loic. As a blueprint, I'd like to look at RBD backups to secondary clusters within Openstack. Nick Barcet and others have mentioned ideas for this now that Cinder is multi-cluster aware. Neil On Sun, Feb 24, 2013 at 3:16 PM, Josh Durgin josh.dur...@inktank.com wrote: On 02/23/2013 02:33 AM, Loic Dachary wrote: Hi, In anticipation of the next OpenStack summit http://www.openstack.org/summit/portland-2013/, I proposed a session to discuss OpenStack and Ceph integration. Our meeting during FOSDEM earlier this month was a great experience although it was planned at the last minute. I hope we can organize something even better for the summit. For developers and contributors to both Ceph and OpenStack such as myself, it would be a great opportunity to figure out a sensible roadmap for the next six months. I realize this roadmap is already clear for Josh Durgin and other Ceph / OpenStack developers who are passionately invested in both projects for a long time. However I am new to both projects and such a session would be a precious guide and highly motivating. http://summit.openstack.org/cfp/details/38 What do you think ? Sounds like a great idea! Thanks for putting together the session! Josh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Loïc Dachary, Artisan Logiciel Libre -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CephFS Space Accounting and Quotas
On 03/06/2013 12:13 PM, Greg Farnum wrote: On Wednesday, March 6, 2013 at 11:07 AM, Jim Schutt wrote: On 03/05/2013 12:33 PM, Sage Weil wrote: Running 'du' on each directory would be much faster with Ceph since it accounts tracks the subdirectories and shows their total size with an 'ls -al'. Environments with 100k users also tend to be very dynamic with adding and removing users all the time, so creating separate filesystems for them would be very time consuming. Now, I'm not talking about enforcing soft or hard quotas, I'm just talking about knowing how much space uid X and Y consume on the filesystem. The part I'm most unclear on is what use cases people have where uid X and Y are spread around the file system (not in a single or small set of sub directories) and per-user (not, say, per-project) quotas are still necessary. In most environments, users get their own home directory and everything lives there... Hmmm, is there a tool I should be using that will return the space used by a directory, and all its descendants? If it's 'du', that tool is definitely not fast for me. I'm doing an 'strace du -s path', where path has one subdirectory which contains ~600 files. I've got ~200 clients mounting the file system, and each client wrote 3 files in that directory. I'm doing the 'du' from one of those nodes, and the strace is showing me du is doing a 'newfstat' for each file. For each file that was written on a different client from where du is running, that 'newfstat' takes tens of seconds to return. Which means my 'du' has been running for quite some time and hasn't finished yet I'm hoping there's another tool I'm supposed to be using that I don't know about yet. Our use case includes tens of millions of files written from thousands of clients, and whatever tool we use to do space accounting needs to not walk an entire directory tree, checking each file. Check out the directory sizes with ls -l or whatever — those numbers are semantically meaningful! :) That is just exceptionally cool! Unfortunately we can't (currently) use those recursive statistics to do proper hard quotas on subdirectories as they're lazily propagated following client ops, not as part of the updates. (Lazily in the technical sense — it's actually quite fast in general). But they'd work fine for soft quotas if somebody wrote the code, or to block writes on a slight time lag. 'ls -lh dir' seems to be just the thing if you already know dir. And it's perfectly suitable for our use case of not scheduling new jobs for users consuming too much space. I was thinking I might need to find a subtree where all the subdirectories are owned by the same user, on the theory that all the files in such a subtree would be owned by that same user. E.g., we might want such a capability to manage space per user in shared project directories. So, I tried 'find dir -type d -exec ls -lhd {} \;' Unfortunately, that ended up doing a 'newfstatat' on each file under dir, evidently to learn if it was a directory. The result was that same slowdown for files written on other clients. Is there some other way I should be looking for directories if I don't already know what they are? Also, this issue of stat on files created on other clients seems like it's going to be problematic for many interactions our users will have with the files created by their parallel compute jobs - any suggestion on how to avoid or fix it? Thanks! -- Jim -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CephFS Space Accounting and Quotas
On Wednesday, March 6, 2013 at 11:58 AM, Jim Schutt wrote: On 03/06/2013 12:13 PM, Greg Farnum wrote: Check out the directory sizes with ls -l or whatever — those numbers are semantically meaningful! :) That is just exceptionally cool! Unfortunately we can't (currently) use those recursive statistics to do proper hard quotas on subdirectories as they're lazily propagated following client ops, not as part of the updates. (Lazily in the technical sense — it's actually quite fast in general). But they'd work fine for soft quotas if somebody wrote the code, or to block writes on a slight time lag. 'ls -lh dir' seems to be just the thing if you already know dir. And it's perfectly suitable for our use case of not scheduling new jobs for users consuming too much space. I was thinking I might need to find a subtree where all the subdirectories are owned by the same user, on the theory that all the files in such a subtree would be owned by that same user. E.g., we might want such a capability to manage space per user in shared project directories. So, I tried 'find dir -type d -exec ls -lhd {} \;' Unfortunately, that ended up doing a 'newfstatat' on each file under dir, evidently to learn if it was a directory. The result was that same slowdown for files written on other clients. Is there some other way I should be looking for directories if I don't already know what they are? Also, this issue of stat on files created on other clients seems like it's going to be problematic for many interactions our users will have with the files created by their parallel compute jobs - any suggestion on how to avoid or fix it? Brief background: stat is required to provide file size information, and so when you do a stat Ceph needs to find out the actual file size. If the file is currently in use by somebody, that requires gathering up the latest metadata from them. Separately, while Ceph allows a client and the MDS to proceed with a bunch of operations (ie, mknod) without having it go to disk first, it requires anything which is visible to a third party (another client) be durable on disk for consistency reasons. These combine to mean that if you do a stat on a file which a client currently has buffered writes for, that buffer must be flushed out to disk before the stat can return. This is the usual cause of the slow stats you're seeing. You should be able to adjust dirty data thresholds to encourage faster writeouts, do fsyncs once a client is done with a file, etc in order to minimize the likelihood of running into this. Also, I'd have to check but I believe opening a file with LAZY_IO or whatever will weaken those requirements — it's probably not the solution you'd like here but it's an option, and if this turns out to be a serious issue then config options to reduce consistency on certain operations are likely to make their way into the roadmap. :) -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CephFS Space Accounting and Quotas
On 03/06/2013 01:21 PM, Greg Farnum wrote: Also, this issue of stat on files created on other clients seems like it's going to be problematic for many interactions our users will have with the files created by their parallel compute jobs - any suggestion on how to avoid or fix it? Brief background: stat is required to provide file size information, and so when you do a stat Ceph needs to find out the actual file size. If the file is currently in use by somebody, that requires gathering up the latest metadata from them. Separately, while Ceph allows a client and the MDS to proceed with a bunch of operations (ie, mknod) without having it go to disk first, it requires anything which is visible to a third party (another client) be durable on disk for consistency reasons. These combine to mean that if you do a stat on a file which a client currently has buffered writes for, that buffer must be flushed out to disk before the stat can return. This is the usual cause of the slow stats you're seeing. You should be able to adjust dirty data thresholds to encourage faster writeouts, do fsyncs once a client is done with a file, etc in order to minimize the likelihood of running into this. Also, I'd have to check but I believe opening a file with LAZY_IO or whatever will weaken those requirements — it's probably not the solution you'd like here but it's an option, and if this turns out to be a serious issue then config options to reduce consistency on certain operations are likely to make their way into the roadmap. :) That all makes sense. But, it turns out the files in question were written yesterday, and I did the stat operations today. So, shouldn't the dirty buffer issue not be in play here? Is there anything else that might be going on? Thanks -- Jim -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RGW Blocking on 1-2 PG's - argonaut
Great, thanks. Now i understand everything. Best Regards SS Dnia 6 mar 2013 o godz. 15:04 Yehuda Sadeh yeh...@inktank.com napisał(a): On Wed, Mar 6, 2013 at 5:06 AM, Sławomir Skowron szi...@gmail.com wrote: Hi, i do some test, to reproduce this problem. As you can see, only one drive (each drive in same PG) is much more utilize, then others, and there are some ops in queue on this slow osd. This test is getting heads from s3 objects, alphabetically sorted. This is strange. why this files is going in much part only from this triple osd's. checking what osd are in this pg. ceph pg map 7.35b osdmap e117008 pg 7.35b (7.35b) - up [18,61,133] acting [18,61,133] On osd.61 { num_ops: 13, ops: [ { description: osd_sub_op(client.10376104.0:961532 7.35b 2b11a75b\/2013-03-06-13-8700.1-ocdn\/head\/\/7 [] v 117008'1370134 The ops log is slowing you down. Unless you really need it, set 'rgw enable ops log = false'. This is off by default in bobtail. Yehuda -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CephFS Space Accounting and Quotas
On Wednesday, March 6, 2013 at 1:28 PM, Jim Schutt wrote: On 03/06/2013 01:21 PM, Greg Farnum wrote: Also, this issue of stat on files created on other clients seems like it's going to be problematic for many interactions our users will have with the files created by their parallel compute jobs - any suggestion on how to avoid or fix it? Brief background: stat is required to provide file size information, and so when you do a stat Ceph needs to find out the actual file size. If the file is currently in use by somebody, that requires gathering up the latest metadata from them. Separately, while Ceph allows a client and the MDS to proceed with a bunch of operations (ie, mknod) without having it go to disk first, it requires anything which is visible to a third party (another client) be durable on disk for consistency reasons. These combine to mean that if you do a stat on a file which a client currently has buffered writes for, that buffer must be flushed out to disk before the stat can return. This is the usual cause of the slow stats you're seeing. You should be able to adjust dirty data thresholds to encourage faster writeouts, do fsyncs once a client is done with a file, etc in order to minimize the likelihood of running into this. Also, I'd have to check but I believe opening a file with LAZY_IO or whatever will weaken those requirements — it's probably not the solution you'd like here but it's an option, and if this turns out to be a serious issue then config options to reduce consistency on certain operations are likely to make their way into the roadmap. :) That all makes sense. But, it turns out the files in question were written yesterday, and I did the stat operations today. So, shouldn't the dirty buffer issue not be in play here? Probably not. :/ Is there anything else that might be going on? In that case it sounds like either there's a slowdown on disk access that is propagating up the chain very bizarrely, there's a serious performance issue on the MDS (ie, swapping for everything), or the clients are still holding onto capabilities for the files in question and you're running into some issues with the capability revocation mechanisms. Can you describe your setup a bit more? What versions are you running, kernel or userspace clients, etc. What config options are you setting on the MDS? Assuming you're on something semi-recent, getting a perfcounter dump from the MDS might be illuminating as well. We'll probably want to get a high-debug log of the MDS during these slow stats as well. -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CephFS Space Accounting and Quotas
On Wed, 6 Mar 2013, Greg Farnum wrote: 'ls -lh dir' seems to be just the thing if you already know dir. And it's perfectly suitable for our use case of not scheduling new jobs for users consuming too much space. I was thinking I might need to find a subtree where all the subdirectories are owned by the same user, on the theory that all the files in such a subtree would be owned by that same user. E.g., we might want such a capability to manage space per user in shared project directories. So, I tried 'find dir -type d -exec ls -lhd {} \;' Unfortunately, that ended up doing a 'newfstatat' on each file under dir, evidently to learn if it was a directory. The result was that same slowdown for files written on other clients. Is there some other way I should be looking for directories if I don't already know what they are? Normally the readdir result as the d_type field filled in to indicate whether the dentry is a directory or not, which makes the stat unnecessary. I'm surprised that find isn't doing that properly already! It's possible we aren't populating a field we should be in our readdir code... Also, this issue of stat on files created on other clients seems like it's going to be problematic for many interactions our users will have with the files created by their parallel compute jobs - any suggestion on how to avoid or fix it? Brief background: stat is required to provide file size information, and so when you do a stat Ceph needs to find out the actual file size. If the file is currently in use by somebody, that requires gathering up the latest metadata from them. Separately, while Ceph allows a client and the MDS to proceed with a bunch of operations (ie, mknod) without having it go to disk first, it requires anything which is visible to a third party (another client) be durable on disk for consistency reasons. These combine to mean that if you do a stat on a file which a client currently has buffered writes for, that buffer must be flushed out to disk before the stat can return. This is the usual cause of the slow stats you're seeing. You should be able to adjust dirty data thresholds to encourage faster writeouts, do fsyncs once a client is done with a file, etc in order to minimize the likelihood of running into this. This is the current behavior. There is a bug in the tracker to introduce a new lock state to optimize the stat case so that writers are paused but buffers aren't flushed. It hasn't been prioritized, but is not terribly complex. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] libceph: fix decoding of pgids
In 4f6a7e5ee1393ec4b243b39dac9f36992d161540 we effectively dropped support for the legacy encoding for the OSDMap and incremental. However, we didn't fix the decoding for the pgid. Signed-off-by: Sage Weil s...@inktank.com --- net/ceph/osdmap.c | 40 +++- 1 file changed, 27 insertions(+), 13 deletions(-) diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c index a47ee06..6975102 100644 --- a/net/ceph/osdmap.c +++ b/net/ceph/osdmap.c @@ -654,6 +654,24 @@ static int osdmap_set_max_osd(struct ceph_osdmap *map, int max) return 0; } +static int __decode_pgid(void **p, void *end, struct ceph_pg *pg) +{ + u8 v; + + ceph_decode_need(p, end, 1+8+4+4, bad); + v = ceph_decode_8(p); + if (v != 1) + goto bad; + pg-pool = ceph_decode_64(p); + pg-seed = ceph_decode_32(p); + *p += 4; /* skip preferred */ + return 0; + +bad: + dout(error decoding pgid\n); + return -EINVAL; +} + /* * decode a full map. */ @@ -745,13 +763,11 @@ struct ceph_osdmap *osdmap_decode(void **p, void *end) for (i = 0; i len; i++) { int n, j; struct ceph_pg pgid; - struct ceph_pg_v1 pgid_v1; struct ceph_pg_mapping *pg; - ceph_decode_need(p, end, sizeof(u32) + sizeof(u64), bad); - ceph_decode_copy(p, pgid_v1, sizeof(pgid_v1)); - pgid.pool = le32_to_cpu(pgid_v1.pool); - pgid.seed = le16_to_cpu(pgid_v1.ps); + err = __decode_pgid(p, end, pgid); + if (err) + goto bad; n = ceph_decode_32(p); err = -EINVAL; if (n (UINT_MAX - sizeof(*pg)) / sizeof(u32)) @@ -818,8 +834,8 @@ struct ceph_osdmap *osdmap_apply_incremental(void **p, void *end, u16 version; ceph_decode_16_safe(p, end, version, bad); - if (version 6) { - pr_warning(got unknown v %d %d of inc osdmap\n, version, 6); + if (version != 6) { + pr_warning(got unknown v %d != 6 of inc osdmap\n, version); goto bad; } @@ -963,15 +979,13 @@ struct ceph_osdmap *osdmap_apply_incremental(void **p, void *end, while (len--) { struct ceph_pg_mapping *pg; int j; - struct ceph_pg_v1 pgid_v1; struct ceph_pg pgid; u32 pglen; - ceph_decode_need(p, end, sizeof(u64) + sizeof(u32), bad); - ceph_decode_copy(p, pgid_v1, sizeof(pgid_v1)); - pgid.pool = le32_to_cpu(pgid_v1.pool); - pgid.seed = le16_to_cpu(pgid_v1.ps); - pglen = ceph_decode_32(p); + err = __decode_pgid(p, end, pgid); + if (err) + goto bad; + pglen = ceph_decode_32(p); if (pglen) { ceph_decode_need(p, end, pglen*sizeof(u32), bad); -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OpenStack summit : Ceph design session
Hi Neil, On 03/06/2013 08:27 PM, Neil Levine wrote: I think the multi-site RGW stuff is somewhat orthogonal to OpenStack Even when keystone is involved ? where as the RBD backups needs to factor in Horizon, Cinder APIs and where the logic for managing the backups sits. Ross is looking to get a wiki setup for Ceph blueprints so we can document the incremental snapshot stuff and then use this as a basis for creating the OpenStack blueprint. Great. Who approves the session at ODS and when is this decision made? I suspect Josh knows more than I do about this. During the cinder meeting earlier today J. Griffith said that if the nova track is too busy to host the Roadmap for Ceph integration with OpenStack session he was in favor of having it in the cinder track. Following his advice I suggested to Thierry Carrez to open a Cross project track ( http://lists.openstack.org/pipermail/openstack-dev/2013-March/006365.html ) Cheers Neil On Sun, Mar 3, 2013 at 1:37 AM, Loic Dachary l...@dachary.org wrote: Hi Neil, I've updated http://summit.openstack.org/cfp/details/38 with the Geographic DR related threads Geo-replication with RADOS GW http://marc.info/?l=ceph-develm=135939566407623w=4 Geographic DR for RGW http://marc.info/?l=ceph-develm=136191479931880w=4 I'm increasingly interested in figuring out how it fits with OpenStack. Cheers On 02/25/2013 11:04 AM, Loic Dachary wrote: Hi Neil, I've added RBD backups secondary clusters within Openstack to the list of blueprints. Do you have links to mail threads / chat logs related to this topic ? I moved the content of the session to an etherpad for collaborative editing https://etherpad.openstack.org/roadmap-for-ceph-integration-with-openstack and it is now linked from http://summit.openstack.org/cfp/details/38 Cheers On 02/25/2013 07:12 AM, Neil Levine wrote: Thanks for taking the lead on this Loic. As a blueprint, I'd like to look at RBD backups to secondary clusters within Openstack. Nick Barcet and others have mentioned ideas for this now that Cinder is multi-cluster aware. Neil On Sun, Feb 24, 2013 at 3:16 PM, Josh Durgin josh.dur...@inktank.com wrote: On 02/23/2013 02:33 AM, Loic Dachary wrote: Hi, In anticipation of the next OpenStack summit http://www.openstack.org/summit/portland-2013/, I proposed a session to discuss OpenStack and Ceph integration. Our meeting during FOSDEM earlier this month was a great experience although it was planned at the last minute. I hope we can organize something even better for the summit. For developers and contributors to both Ceph and OpenStack such as myself, it would be a great opportunity to figure out a sensible roadmap for the next six months. I realize this roadmap is already clear for Josh Durgin and other Ceph / OpenStack developers who are passionately invested in both projects for a long time. However I am new to both projects and such a session would be a precious guide and highly motivating. http://summit.openstack.org/cfp/details/38 What do you think ? Sounds like a great idea! Thanks for putting together the session! Josh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Loïc Dachary, Artisan Logiciel Libre -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature
Re: [PATCH] libceph: fix decoding of pgids
On Wed, Mar 6, 2013 at 2:15 PM, Sage Weil s...@inktank.com wrote: In 4f6a7e5ee1393ec4b243b39dac9f36992d161540 we effectively dropped support for the legacy encoding for the OSDMap and incremental. However, we didn't fix the decoding for the pgid. Signed-off-by: Sage Weil s...@inktank.com --- net/ceph/osdmap.c | 40 +++- 1 file changed, 27 insertions(+), 13 deletions(-) diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c index a47ee06..6975102 100644 --- a/net/ceph/osdmap.c +++ b/net/ceph/osdmap.c @@ -654,6 +654,24 @@ static int osdmap_set_max_osd(struct ceph_osdmap *map, int max) return 0; } +static int __decode_pgid(void **p, void *end, struct ceph_pg *pg) +{ + u8 v; + + ceph_decode_need(p, end, 1+8+4+4, bad); + v = ceph_decode_8(p); + if (v != 1) + goto bad; + pg-pool = ceph_decode_64(p); + pg-seed = ceph_decode_32(p); + *p += 4; /* skip preferred */ + return 0; + +bad: + dout(error decoding pgid\n); + return -EINVAL; +} + /* * decode a full map. */ @@ -745,13 +763,11 @@ struct ceph_osdmap *osdmap_decode(void **p, void *end) for (i = 0; i len; i++) { int n, j; struct ceph_pg pgid; - struct ceph_pg_v1 pgid_v1; struct ceph_pg_mapping *pg; - ceph_decode_need(p, end, sizeof(u32) + sizeof(u64), bad); - ceph_decode_copy(p, pgid_v1, sizeof(pgid_v1)); - pgid.pool = le32_to_cpu(pgid_v1.pool); - pgid.seed = le16_to_cpu(pgid_v1.ps); + err = __decode_pgid(p, end, pgid); + if (err) + goto bad; n = ceph_decode_32(p); err = -EINVAL; if (n (UINT_MAX - sizeof(*pg)) / sizeof(u32)) @@ -818,8 +834,8 @@ struct ceph_osdmap *osdmap_apply_incremental(void **p, void *end, u16 version; ceph_decode_16_safe(p, end, version, bad); - if (version 6) { - pr_warning(got unknown v %d %d of inc osdmap\n, version, 6); + if (version != 6) { + pr_warning(got unknown v %d != 6 of inc osdmap\n, version); goto bad; } @@ -963,15 +979,13 @@ struct ceph_osdmap *osdmap_apply_incremental(void **p, void *end, while (len--) { struct ceph_pg_mapping *pg; int j; - struct ceph_pg_v1 pgid_v1; struct ceph_pg pgid; u32 pglen; - ceph_decode_need(p, end, sizeof(u64) + sizeof(u32), bad); - ceph_decode_copy(p, pgid_v1, sizeof(pgid_v1)); - pgid.pool = le32_to_cpu(pgid_v1.pool); - pgid.seed = le16_to_cpu(pgid_v1.ps); - pglen = ceph_decode_32(p); + err = __decode_pgid(p, end, pgid); + if (err) + goto bad; maybe missing? ceph_decode_need(p, end, sizeof(u32), bad); + pglen = ceph_decode_32(p); if (pglen) { ceph_decode_need(p, end, pglen*sizeof(u32), bad); -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] libceph: fix decoding of pgids
On Wed, 6 Mar 2013, Yehuda Sadeh wrote: On Wed, Mar 6, 2013 at 2:15 PM, Sage Weil s...@inktank.com wrote: In 4f6a7e5ee1393ec4b243b39dac9f36992d161540 we effectively dropped support for the legacy encoding for the OSDMap and incremental. However, we didn't fix the decoding for the pgid. Signed-off-by: Sage Weil s...@inktank.com --- net/ceph/osdmap.c | 40 +++- 1 file changed, 27 insertions(+), 13 deletions(-) diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c index a47ee06..6975102 100644 --- a/net/ceph/osdmap.c +++ b/net/ceph/osdmap.c @@ -654,6 +654,24 @@ static int osdmap_set_max_osd(struct ceph_osdmap *map, int max) return 0; } +static int __decode_pgid(void **p, void *end, struct ceph_pg *pg) +{ + u8 v; + + ceph_decode_need(p, end, 1+8+4+4, bad); + v = ceph_decode_8(p); + if (v != 1) + goto bad; + pg-pool = ceph_decode_64(p); + pg-seed = ceph_decode_32(p); + *p += 4; /* skip preferred */ + return 0; + +bad: + dout(error decoding pgid\n); + return -EINVAL; +} + /* * decode a full map. */ @@ -745,13 +763,11 @@ struct ceph_osdmap *osdmap_decode(void **p, void *end) for (i = 0; i len; i++) { int n, j; struct ceph_pg pgid; - struct ceph_pg_v1 pgid_v1; struct ceph_pg_mapping *pg; - ceph_decode_need(p, end, sizeof(u32) + sizeof(u64), bad); - ceph_decode_copy(p, pgid_v1, sizeof(pgid_v1)); - pgid.pool = le32_to_cpu(pgid_v1.pool); - pgid.seed = le16_to_cpu(pgid_v1.ps); + err = __decode_pgid(p, end, pgid); + if (err) + goto bad; n = ceph_decode_32(p); err = -EINVAL; if (n (UINT_MAX - sizeof(*pg)) / sizeof(u32)) @@ -818,8 +834,8 @@ struct ceph_osdmap *osdmap_apply_incremental(void **p, void *end, u16 version; ceph_decode_16_safe(p, end, version, bad); - if (version 6) { - pr_warning(got unknown v %d %d of inc osdmap\n, version, 6); + if (version != 6) { + pr_warning(got unknown v %d != 6 of inc osdmap\n, version); goto bad; } @@ -963,15 +979,13 @@ struct ceph_osdmap *osdmap_apply_incremental(void **p, void *end, while (len--) { struct ceph_pg_mapping *pg; int j; - struct ceph_pg_v1 pgid_v1; struct ceph_pg pgid; u32 pglen; - ceph_decode_need(p, end, sizeof(u64) + sizeof(u32), bad); - ceph_decode_copy(p, pgid_v1, sizeof(pgid_v1)); - pgid.pool = le32_to_cpu(pgid_v1.pool); - pgid.seed = le16_to_cpu(pgid_v1.ps); - pglen = ceph_decode_32(p); + err = __decode_pgid(p, end, pgid); + if (err) + goto bad; maybe missing? ceph_decode_need(p, end, sizeof(u32), bad); Yup, for both call sites. Pushed updated patch to testing. Thanks! sage + pglen = ceph_decode_32(p); if (pglen) { ceph_decode_need(p, end, pglen*sizeof(u32), bad); -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OpenStack summit : Ceph design session
On Wed, Mar 6, 2013 at 2:45 PM, Loic Dachary l...@dachary.org wrote: Hi Neil, On 03/06/2013 08:27 PM, Neil Levine wrote: I think the multi-site RGW stuff is somewhat orthogonal to OpenStack Even when keystone is involved ? Good question. Yehuda: how would the asynchronously replicated user metadata interact with Keystone? Who approves the session at ODS and when is this decision made? I suspect Josh knows more than I do about this. During the cinder meeting earlier today J. Griffith said that if the nova track is too busy to host the Roadmap for Ceph integration with OpenStack session he was in favor of having it in the cinder track. Following his advice I suggested to Thierry Carrez to open a Cross project track ( http://lists.openstack.org/pipermail/openstack-dev/2013-March/006365.html ) I don't think Cinder is such a bad place for it to be as presumably the interaction to copy the block device to a secondary location would be triggered through a Cinder API call no? Neil -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Approaches to wrapping aio_exec
So I've been playing with the ObjectOperationCompletion code a bit. It seems to be really important to be able to handle decoding errors in in the handle_completion() callback. In particular, I'd like to be able to reach out and set the return value the user will see in the AioCompletion. Any thoughts on dealing with this some how? -Noah On Mar 4, 2013, at 11:44 AM, Yehuda Sadeh yeh...@inktank.com wrote: On Mon, Mar 4, 2013 at 11:34 AM, Noah Watkins jayh...@cs.ucsc.edu wrote: On Mar 3, 2013, at 6:31 PM, Yehuda Sadeh yeh...@inktank.com wrote: I pushed the wip-librados-exec branch last week that solves a similar issue. I added two more ObjectOperation::exec() api calls. The more interesting one added a callback context that is called with the output buffer of the completed sub-op. Currently in order to use it you'll need to use operate()/aio_operate(), however, a similar aio_exec interface can be added. Thanks for the pointer to the branch. So, if I understand correctly, we might have a new librados::aio_exec_completion call that accepts a completion object? For example: aio_exec_completion(AioCompletion *c, bufferlist *outbl, ObjectOperationCompletion* completion) { Context *onack = new C_aio_Ack(c); ::ObjectOperation rd; ObjectOpCompletionCtx *ctx = new ObjectOpCompletionCtx(completion); rd.call(cls, method, inbl, ctx-outbl, ctx, NULL); objecter-read(oid, oloc, rd, snap_seq, outbl, 0, onack, c-objver); return 0; } where the caller would provide an ObjectOperationCompletion where it's finish(..) would unwrap the protocol? Right. Do you expect wip-librados-exec going up stream pretty soon, and would We can push it ahead if needed, it doesn't depend on any of the stuff I'm working on right now. It just waits for someone to properly review it. something like librados::aio_exec_completion be a candidate for adding to librados? Sure, if there's a need then I don't see why not. Yehuda -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CephFS Space Accounting and Quotas
On Wednesday, March 6, 2013 at 3:14 PM, Jim Schutt wrote: When I'm doing these stat operations the file system is otherwise idle. What's the cluster look like? This is just one active MDS and a couple hundred clients? What is happening is that once one of these slow stat operations on a file completes, it never happens again for that file, from any client. At least, that's the case if I'm not writing to the file any more. I haven't checked if appending to the files restarts the behavior. I assume it'll come back, but if you could verify that'd be good. On the client side I'm running with 3.8.2 + the ceph patch queue that was merged into 3.9-rc1. On the server side I'm running recent next branch (commit 0f42eddef5), with the tcp receive socket buffer option patches cherry-picked. I've also got a patch that allows mkcephfs to use osd_pool_default_pg_num rather than pg_bits to set initial number of PGs (same for pgp_num), and a patch that lets me run with just one pool that contains both data and metadata. I'm testing data distribution uniformity with 512K PGs. My MDS tunables are all at default settings. We'll probably want to get a high-debug log of the MDS during these slow stats as well. OK. Do you want me to try to reproduce with a more standard setup? No, this is fine. Also, I see Sage just pushed a patch to pgid decoding - I expect I need that as well, if I'm running the latest client code. Yeah, if you've got the commit it references you'll want it. Do you want the MDS log at 10 or 20? More is better. ;) -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: MDS running at 100% CPU, no clients
On Mar 6, 2013, at 5:57 PM, Noah Watkins jayh...@cs.ucsc.edu wrote: The MDS process in my cluster is running at 100% CPU. In fact I thought the cluster came down, but rather an ls was taking a minute. There aren't any clients active. I've left the process running in case there is any probing you'd like to do on it: virt res cpu 4629m 88m 5260 S 92 1.1 113:32.79 ceph-mds Thanks, Noah This is a ceph-mds child thread under strace. The only thread that appears to be doing anything. root@issdm-44:/home/hadoop/hadoop-common# strace -p 3372 Process 3372 attached - interrupt to quit read(1649, 7f0203235000-7f0203236000 ---p 0..., 8191) = 4050 read(1649, 7f0205053000-7f0205054000 ---p 0..., 8191) = 4050 read(1649, 7f0206e71000-7f0206e72000 ---p 0..., 8191) = 4050 read(1649, 7f0214144000-7f0214244000 rw-p 0..., 8191) = 4020 read(1649, 7f0215f62000-7f0216062000 rw-p 0..., 8191) = 4020 read(1649, 7f0217d8-7f0217e8 rw-p 0..., 8191) = 4020 read(1649, 7f0219b9e000-7f0219c9e000 rw-p 0..., 8191) = 4020 ... That file looks to be: ceph-mds 3337 root 1649r REG0,30 266903 /proc/3337/maps (3337 is the parent process).-- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: MDS running at 100% CPU, no clients
Which, looks to be in a tight loop in the memory model _sample… (gdb) bt #0 0x7f0270d84d2d in read () from /lib/x86_64-linux-gnu/libpthread.so.0 #1 0x7f027046dd88 in std::__basic_filechar::xsgetn(char*, long) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #2 0x7f027046f4c5 in std::basic_filebufchar, std::char_traitschar ::underflow() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #3 0x7f0270467ceb in std::basic_istreamchar, std::char_traitschar std::getlinechar, std::char_traitschar, std::allocatorchar (std::basic_istreamchar, std::char_traitschar , std::basic_stringchar, std::char_traitschar, std::allocatorchar , char) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #4 0x0072bdd4 in MemoryModel::_sample(MemoryModel::snap*) () #5 0x005658db in MDCache::check_memory_usage() () #6 0x004ba929 in MDS::tick() () #7 0x00794c65 in SafeTimer::timer_thread() () #8 0x007958ad in SafeTimerThread::entry() () #9 0x7f0270d7de9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 On Mar 6, 2013, at 6:18 PM, Noah Watkins jayh...@cs.ucsc.edu wrote: On Mar 6, 2013, at 5:57 PM, Noah Watkins jayh...@cs.ucsc.edu wrote: The MDS process in my cluster is running at 100% CPU. In fact I thought the cluster came down, but rather an ls was taking a minute. There aren't any clients active. I've left the process running in case there is any probing you'd like to do on it: virt res cpu 4629m 88m 5260 S 92 1.1 113:32.79 ceph-mds Thanks, Noah This is a ceph-mds child thread under strace. The only thread that appears to be doing anything. root@issdm-44:/home/hadoop/hadoop-common# strace -p 3372 Process 3372 attached - interrupt to quit read(1649, 7f0203235000-7f0203236000 ---p 0..., 8191) = 4050 read(1649, 7f0205053000-7f0205054000 ---p 0..., 8191) = 4050 read(1649, 7f0206e71000-7f0206e72000 ---p 0..., 8191) = 4050 read(1649, 7f0214144000-7f0214244000 rw-p 0..., 8191) = 4020 read(1649, 7f0215f62000-7f0216062000 rw-p 0..., 8191) = 4020 read(1649, 7f0217d8-7f0217e8 rw-p 0..., 8191) = 4020 read(1649, 7f0219b9e000-7f0219c9e000 rw-p 0..., 8191) = 4020 ... That file looks to be: ceph-mds 3337 root 1649r REG0,30 266903 /proc/3337/maps (3337 is the parent process). -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html