make check bot paused
Hi, The make check bot [1] that executes run-make-check.sh [2] on pull requests and reports results as comments [3] is experiencing problems. It may be a hardware issue and the bot is paused while the issue is investigated [4] to avoid sending confusing false negatives. In the meantime the run-make-check.sh [2] script can be run locally, before sending the pull request, to confirm the commits to be sent do not break them. It is expected to run in less than 15 minutes including compilation on a fast machine with a SSD (or RAM disk) and 8 cores and 32GB of RAM and may take up to two hours on a machine with a spinner and two cores. Thanks for your patience. Cheers [1] bot running on pull requests http://jenkins.ceph.dachary.org/job/ceph/ [2] run-make-check.sh http://workbench.dachary.org/ceph/ceph/blob/master/run-make-check.sh [3] make check results example : https://github.com/ceph/ceph/pull/3946#issuecomment-93286840 [4] possible RAM failure http://tracker.ceph.com/issues/11399 -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature
Re: Regarding newstore performance
On Wed, Apr 15, 2015 at 2:01 PM, Somnath Roy somnath@sandisk.com wrote: Hi Sage/Mark, I did some WA experiment with newstore with the similar settings I mentioned yesterday. Test: --- 64K Random write with 64 QD and writing total of 1 TB of data. Newstore: Fio output at the end of 1 TB write. --- rbd_iodepth32: (g=0): rw=randwrite, bs=64K-64K/64K-64K/64K-64K, ioengine=rbd, iodepth=64 fio-2.1.11-20-g9a44 Starting 1 process rbd engine: RBD version: 0.1.9 Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/49600KB/0KB /s] [0/775/0 iops] [eta 00m:00s] rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=42907: Tue Apr 14 00:34:23 2015 write: io=1000.0GB, bw=48950KB/s, iops=764, runt=21421419msec slat (usec): min=43, max=9480, avg=116.45, stdev=10.99 clat (msec): min=13, max=1331, avg=83.55, stdev=52.97 lat (msec): min=14, max=1331, avg=83.67, stdev=52.97 clat percentiles (msec): | 1.00th=[ 60], 5.00th=[ 68], 10.00th=[ 71], 20.00th=[ 74], | 30.00th=[ 76], 40.00th=[ 78], 50.00th=[ 81], 60.00th=[ 83], | 70.00th=[ 86], 80.00th=[ 90], 90.00th=[ 94], 95.00th=[ 98], | 99.00th=[ 109], 99.50th=[ 114], 99.90th=[ 1188], 99.95th=[ 1221], | 99.99th=[ 1270] bw (KB /s): min= 62, max=101888, per=100.00%, avg=49760.84, stdev=7320.03 lat (msec) : 20=0.01%, 50=0.38%, 100=96.00%, 250=3.34%, 500=0.03% lat (msec) : 750=0.02%, 1000=0.03%, 2000=0.20% cpu : usr=10.85%, sys=1.76%, ctx=123504191, majf=0, minf=603964 IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=2.1%, =64=97.9% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.1%, 32=0.0%, 64=0.1%, =64=0.0% issued: total=r=0/w=16384000/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=64 Run status group 0 (all jobs): WRITE: io=1000.0GB, aggrb=48949KB/s, minb=48949KB/s, maxb=48949KB/s, mint=21421419msec, maxt=21421419msec So, iops getting is ~764. 99th percentile latency should be 100ms. Write amplification at disk level: -- SanDisk SSDs have some disk level counters that can measure number of host writes with flash logical page size and number of actual flash writes with the same flash logical page size. The difference between these two is the actual WA causing to disk. Please find the data in the following xls. https://docs.google.com/spreadsheets/d/1vIT6PHRbLdk_IqFDc2iM_teMolDUlxcX5TLMRzdXyJE/edit?usp=sharing Total host writes in this period = 923896266 Total flash writes in this period = 1465339040 FileStore: - Fio output at the end of 1 TB write. --- rbd_iodepth32: (g=0): rw=randwrite, bs=64K-64K/64K-64K/64K-64K, ioengine=rbd, iodepth=64 fio-2.1.11-20-g9a44 Starting 1 process rbd engine: RBD version: 0.1.9 Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/164.7MB/0KB /s] [0/2625/0 iops] [eta 00m:01s] rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=44703: Tue Apr 14 18:44:00 2015 write: io=1000.0GB, bw=98586KB/s, iops=1540, runt=10636117msec slat (usec): min=42, max=7144, avg=120.45, stdev=45.80 clat (usec): min=942, max=3954.6K, avg=40776.42, stdev=81231.25 lat (msec): min=1, max=3954, avg=40.90, stdev=81.23 clat percentiles (msec): | 1.00th=[7], 5.00th=[ 11], 10.00th=[ 13], 20.00th=[ 16], | 30.00th=[ 18], 40.00th=[ 20], 50.00th=[ 22], 60.00th=[ 25], | 70.00th=[ 30], 80.00th=[ 40], 90.00th=[ 67], 95.00th=[ 114], | 99.00th=[ 433], 99.50th=[ 570], 99.90th=[ 914], 99.95th=[ 1090], | 99.99th=[ 1647] bw (KB /s): min= 32, max=243072, per=100.00%, avg=103148.37, stdev=63090.00 lat (usec) : 1000=0.01% lat (msec) : 2=0.01%, 4=0.14%, 10=4.50%, 20=38.34%, 50=42.42% lat (msec) : 100=8.81%, 250=3.48%, 500=1.58%, 750=0.50%, 1000=0.14% lat (msec) : 2000=0.06%, =2000=0.01% cpu : usr=18.46%, sys=2.64%, ctx=63096633, majf=0, minf=923370 IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.6%, 32=80.4%, =64=19.1% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=94.3%, 8=2.9%, 16=1.9%, 32=0.9%, 64=0.1%, =64=0.0% issued: total=r=0/w=16384000/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=64 Run status group 0 (all jobs): WRITE: io=1000.0GB, aggrb=98586KB/s, minb=98586KB/s, maxb=98586KB/s, mint=10636117msec, maxt=10636117msec Disk stats (read/write): sda: ios=0/251, merge=0/149, ticks=0/0, in_queue=0, util=0.00% So, iops here is ~1500. 99th percentile latency should be within 50ms Write amplification at disk level: -- Total host writes in this period =
04/15/2015 Weekly Ceph Performance Meeting IS ON!
8AM PST as usual! IE in 10 minutes. :D Forgot to send this out earlier, sorry! Anyway, the hew hotness is newstore, and there's been a lot of testing going on! Here's the links: Etherpad URL: http://pad.ceph.com/p/performance_weekly To join the Meeting: https://bluejeans.com/268261044 To join via Browser: https://bluejeans.com/268261044/browser To join with Lync: https://bluejeans.com/268261044/lync To join via Room System: Video Conferencing System: bjn.vc -or- 199.48.152.152 Meeting ID: 268261044 To join via Phone: 1) Dial: +1 408 740 7256 +1 888 240 2560(US Toll Free) +1 408 317 9253(Alternate Number) (see all numbers - http://bluejeans.com/numbers) 2) Enter Conference ID: 268261044 Mark -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 04/08/2015 Weekly Ceph Performance Meeting IS ON!
Nuts, this got out before I hit cancel. This should be 04/15/2015. :) On 04/15/2015 09:51 AM, Mark Nelson wrote: 8AM PST as usual! IE in 10 minutes. :D Forgot to send this out earlier, sorry! Anyway, the hew hotness is newstore, and there's been a lot of testing going on! Here's the links: Etherpad URL: http://pad.ceph.com/p/performance_weekly To join the Meeting: https://bluejeans.com/268261044 To join via Browser: https://bluejeans.com/268261044/browser To join with Lync: https://bluejeans.com/268261044/lync To join via Room System: Video Conferencing System: bjn.vc -or- 199.48.152.152 Meeting ID: 268261044 To join via Phone: 1) Dial: +1 408 740 7256 +1 888 240 2560(US Toll Free) +1 408 317 9253(Alternate Number) (see all numbers - http://bluejeans.com/numbers) 2) Enter Conference ID: 268261044 Mark -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[rgw][Swift API] Reusing WSGI middleware of Swift
Hello Guys, Swift -- besides implementing the OpenStack Object Storage API v1 -- provides a bunch of huge extensions like SLO/DLO (static/dynamic large objects), bulk operations (inc. server-side archive extraction), staticweb and many more. Full list is available in Swift's source code [1]. I'm pretty sure it would be nice to have at least some degree of compatibility with them. Reimplementing those features could be painful. Maintaining them would be another challenge. Even after accomplishing that we still would not provide users with ability to use (and easily create) a third party extensions like ClamAV-based virus scanner [2]. However, maybe there is another solution. Swift internally utilizes a pipeline-based architecture. Component delivering the base of OSOS API (the proxy-server WSGI application) could be seen as the last stage of such pipeline. Between it and user a lot of additional middleware modules exist. All of them are interconnected with the WSGI interface. In a pure theory, if we provide a WSGI-RGW bridge, we would be able to reuse them and save a lot of efforts. I created a really stupid proof of concept [3] built on top of wsgiproxy [4] and civetweb frontend of RGW. Its primary role is a WSGI-HTTP mediator. The code is terrible, definitely needs rework but already allowed to pass a few middleware-related API tests from Tempest. What would be necessary on the RGW's side? Basically: * support for extraction of account (aka tenant) name from URL; * good conformance with OSOS API; * optionally: a configuration parameter to disable auth mechanisms inside RGW and delegate the work to Swift middleware. What's your opinion? I would like to start a discussion to verify the concept. Regards, Radoslaw Zarzynski [1] https://github.com/openstack/swift/tree/master/swift/common/middleware [2] https://julien.danjou.info/blog/2013/extending-swift-with-a-middleware-clamav [3] https://github.com/rzarzynski/rgwift [4] http://pythonpaste.org/wsgiproxy/ -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
04/08/2015 Weekly Ceph Performance Meeting IS ON!
8AM PST as usual! IE in 10 minutes. :D Forgot to send this out earlier, sorry! Anyway, the hew hotness is newstore, and there's been a lot of testing going on! Here's the links: Etherpad URL: http://pad.ceph.com/p/performance_weekly To join the Meeting: https://bluejeans.com/268261044 To join via Browser: https://bluejeans.com/268261044/browser To join with Lync: https://bluejeans.com/268261044/lync To join via Room System: Video Conferencing System: bjn.vc -or- 199.48.152.152 Meeting ID: 268261044 To join via Phone: 1) Dial: +1 408 740 7256 +1 888 240 2560(US Toll Free) +1 408 317 9253(Alternate Number) (see all numbers - http://bluejeans.com/numbers) 2) Enter Conference ID: 268261044 Mark -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Regarding newstore performance
Hoamai, Yes, separating out the kvdb directory is the path I will take to identify the cause of the WA. This tool I have written on top of these disk counters. I can share that but you need SanDisk optimus echo (or max) drive to make it work :-) Thanks Regards Somnath -Original Message- From: Haomai Wang [mailto:haomaiw...@gmail.com] Sent: Wednesday, April 15, 2015 5:23 AM To: Somnath Roy Cc: ceph-devel Subject: Re: Regarding newstore performance On Wed, Apr 15, 2015 at 2:01 PM, Somnath Roy somnath@sandisk.com wrote: Hi Sage/Mark, I did some WA experiment with newstore with the similar settings I mentioned yesterday. Test: --- 64K Random write with 64 QD and writing total of 1 TB of data. Newstore: Fio output at the end of 1 TB write. --- rbd_iodepth32: (g=0): rw=randwrite, bs=64K-64K/64K-64K/64K-64K, ioengine=rbd, iodepth=64 fio-2.1.11-20-g9a44 Starting 1 process rbd engine: RBD version: 0.1.9 Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/49600KB/0KB /s] [0/775/0 iops] [eta 00m:00s] rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=42907: Tue Apr 14 00:34:23 2015 write: io=1000.0GB, bw=48950KB/s, iops=764, runt=21421419msec slat (usec): min=43, max=9480, avg=116.45, stdev=10.99 clat (msec): min=13, max=1331, avg=83.55, stdev=52.97 lat (msec): min=14, max=1331, avg=83.67, stdev=52.97 clat percentiles (msec): | 1.00th=[ 60], 5.00th=[ 68], 10.00th=[ 71], 20.00th=[ 74], | 30.00th=[ 76], 40.00th=[ 78], 50.00th=[ 81], 60.00th=[ 83], | 70.00th=[ 86], 80.00th=[ 90], 90.00th=[ 94], 95.00th=[ 98], | 99.00th=[ 109], 99.50th=[ 114], 99.90th=[ 1188], 99.95th=[ 1221], | 99.99th=[ 1270] bw (KB /s): min= 62, max=101888, per=100.00%, avg=49760.84, stdev=7320.03 lat (msec) : 20=0.01%, 50=0.38%, 100=96.00%, 250=3.34%, 500=0.03% lat (msec) : 750=0.02%, 1000=0.03%, 2000=0.20% cpu : usr=10.85%, sys=1.76%, ctx=123504191, majf=0, minf=603964 IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=2.1%, =64=97.9% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.1%, 32=0.0%, 64=0.1%, =64=0.0% issued: total=r=0/w=16384000/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=64 Run status group 0 (all jobs): WRITE: io=1000.0GB, aggrb=48949KB/s, minb=48949KB/s, maxb=48949KB/s, mint=21421419msec, maxt=21421419msec So, iops getting is ~764. 99th percentile latency should be 100ms. Write amplification at disk level: -- SanDisk SSDs have some disk level counters that can measure number of host writes with flash logical page size and number of actual flash writes with the same flash logical page size. The difference between these two is the actual WA causing to disk. Please find the data in the following xls. https://docs.google.com/spreadsheets/d/1vIT6PHRbLdk_IqFDc2iM_teMolDUlx cX5TLMRzdXyJE/edit?usp=sharing Total host writes in this period = 923896266 Total flash writes in this period = 1465339040 FileStore: - Fio output at the end of 1 TB write. --- rbd_iodepth32: (g=0): rw=randwrite, bs=64K-64K/64K-64K/64K-64K, ioengine=rbd, iodepth=64 fio-2.1.11-20-g9a44 Starting 1 process rbd engine: RBD version: 0.1.9 Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/164.7MB/0KB /s] [0/2625/0 iops] [eta 00m:01s] rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=44703: Tue Apr 14 18:44:00 2015 write: io=1000.0GB, bw=98586KB/s, iops=1540, runt=10636117msec slat (usec): min=42, max=7144, avg=120.45, stdev=45.80 clat (usec): min=942, max=3954.6K, avg=40776.42, stdev=81231.25 lat (msec): min=1, max=3954, avg=40.90, stdev=81.23 clat percentiles (msec): | 1.00th=[7], 5.00th=[ 11], 10.00th=[ 13], 20.00th=[ 16], | 30.00th=[ 18], 40.00th=[ 20], 50.00th=[ 22], 60.00th=[ 25], | 70.00th=[ 30], 80.00th=[ 40], 90.00th=[ 67], 95.00th=[ 114], | 99.00th=[ 433], 99.50th=[ 570], 99.90th=[ 914], 99.95th=[ 1090], | 99.99th=[ 1647] bw (KB /s): min= 32, max=243072, per=100.00%, avg=103148.37, stdev=63090.00 lat (usec) : 1000=0.01% lat (msec) : 2=0.01%, 4=0.14%, 10=4.50%, 20=38.34%, 50=42.42% lat (msec) : 100=8.81%, 250=3.48%, 500=1.58%, 750=0.50%, 1000=0.14% lat (msec) : 2000=0.06%, =2000=0.01% cpu : usr=18.46%, sys=2.64%, ctx=63096633, majf=0, minf=923370 IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.6%, 32=80.4%, =64=19.1% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=94.3%, 8=2.9%, 16=1.9%, 32=0.9%, 64=0.1%, =64=0.0% issued: total=r=0/w=16384000/d=0, short=r=0/w=0/d=0 latency : target=0,
Re: [rgw][Swift API] Reusing WSGI middleware of Swift
- Original Message - From: Radoslaw Zarzynski rzarzyn...@mirantis.com To: Ceph Development ceph-devel@vger.kernel.org Sent: Wednesday, April 15, 2015 8:31:22 AM Subject: [rgw][Swift API] Reusing WSGI middleware of Swift Hello Guys, Swift -- besides implementing the OpenStack Object Storage API v1 -- provides a bunch of huge extensions like SLO/DLO (static/dynamic large objects), bulk operations (inc. server-side archive extraction), staticweb and many more. Full list is available in Swift's source code [1]. I'm pretty sure it would be nice to have at least some degree of compatibility with them. Reimplementing those features could be painful. Maintaining them would be another challenge. Even after accomplishing that we still would not provide users with ability to use (and easily create) a third party extensions like ClamAV-based virus scanner [2]. However, maybe there is another solution. Swift internally utilizes a pipeline-based architecture. Component delivering the base of OSOS API (the proxy-server WSGI application) could be seen as the last stage of such pipeline. Between it and user a lot of additional middleware modules exist. All of them are interconnected with the WSGI interface. In a pure theory, if we provide a WSGI-RGW bridge, we would be able to reuse them and save a lot of efforts. I created a really stupid proof of concept [3] built on top of wsgiproxy [4] and civetweb frontend of RGW. Its primary role is a WSGI-HTTP mediator. The code is terrible, definitely needs rework but already allowed to pass a few middleware-related API tests from Tempest. What would be necessary on the RGW's side? Basically: * support for extraction of account (aka tenant) name from URL; * good conformance with OSOS API; * optionally: a configuration parameter to disable auth mechanisms inside RGW and delegate the work to Swift middleware. What's your opinion? I would like to start a discussion to verify the concept. That is really interesting. A while back I actually floated around such an idea. My main motivation was for keeping up to date with keystone, but I can see other uses for it. Generally I think it would be great to have such a module. I'm not sure that it would be the way to go with regard to SLO/DLO, these would result in a hacky and racy solutions at best. Yehuda Regards, Radoslaw Zarzynski [1] https://github.com/openstack/swift/tree/master/swift/common/middleware [2] https://julien.danjou.info/blog/2013/extending-swift-with-a-middleware-clamav [3] https://github.com/rzarzynski/rgwift [4] http://pythonpaste.org/wsgiproxy/ -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: radosgw and the next giant release v0.87.2
- Original Message - From: Loic Dachary l...@dachary.org To: Yehuda Sadeh yeh...@redhat.com Cc: Ceph Development ceph-devel@vger.kernel.org, Abhishek L abhishek.lekshma...@gmail.com Sent: Wednesday, April 15, 2015 2:43:12 AM Subject: radosgw and the next giant release v0.87.2 Hi Yehuda, The next giant release as found at https://github.com/ceph/ceph/tree/giant passed the rgw suite (http://tracker.ceph.com/issues/11153#rgw). One run had a transient failure (http://tracker.ceph.com/issues/11259) that did not repeat. You will also find traces of failed run because of http://tracker.ceph.com/issues/11180 but that was resolved by backporting https://github.com/ceph/ceph-qa-suite/pull/375. Do you think it is ready for QE to start their own round of testing ? Yes. Note that it will be the last giant release. Cheers P.S. http://tracker.ceph.com/issues/11153#Release-information has direct links to the pull requests merged into giant since v0.87.1 in case you need more context about one of them. -- Loïc Dachary, Artisan Logiciel Libre -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
leaking mons on a latest dumpling
Hello, there is a slow leak which is presented in all ceph versions I assume but it is positively exposed only on large time spans and on large clusters. It looks like the lower is monitor placed in the quorum hierarchy, the higher the leak is: {election_epoch:26,quorum:[0,1,2,3,4],quorum_names:[0,1,2,3,4],quorum_leader_name:0,monmap:{epoch:1,fsid:a2ec787e-3551-4a6f-aa24-deedbd8f8d01,modified:2015-03-05 13:48:54.696784,created:2015-03-05 13:48:54.696784,mons:[{rank:0,name:0,addr:10.0.1.91:6789\/0},{rank:1,name:1,addr:10.0.1.92:6789\/0},{rank:2,name:2,addr:10.0.1.93:6789\/0},{rank:3,name:3,addr:10.0.1.94:6789\/0},{rank:4,name:4,addr:10.0.1.95:6789\/0}]}} ceph heap stats -m 10.0.1.95:6789 | grep Actual MALLOC: =427626648 ( 407.8 MiB) Actual memory used (physical + swap) ceph heap stats -m 10.0.1.94:6789 | grep Actual MALLOC: =289550488 ( 276.1 MiB) Actual memory used (physical + swap) ceph heap stats -m 10.0.1.93:6789 | grep Actual MALLOC: =230592664 ( 219.9 MiB) Actual memory used (physical + swap) ceph heap stats -m 10.0.1.92:6789 | grep Actual MALLOC: =253710488 ( 242.0 MiB) Actual memory used (physical + swap) ceph heap stats -m 10.0.1.91:6789 | grep Actual MALLOC: = 97112216 ( 92.6 MiB) Actual memory used (physical + swap) for almost same uptime, the data difference is: rd KB 55365750505 wr KB 82719722467 The leak itself is not very critical but of course requires some script work to restart monitors at least once per month on a 300Tb cluster to prevent 1G memory consumption by monitor processes. Given a current status for a dumpling, it would be probably possible to identify leak source and then forward-port fix to the newer releases, as the freshest version I am running on a large scale is a top of dumpling branch, otherwise it would require enormous amount of time to check fix proposals. Thanks! -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Ceph-maintainers] statically allocated uid/gid for ceph
Hi Ken Dreyer kdre...@redhat.com writes: On 04/14/2015 09:21 AM, Sage Weil wrote: I think we still want them to be static across a distro; it's the cross-distro change that will be relatively rare. So a fixed ID from each distro family ought to be okay? Sounds sane to me. I've filed https://fedorahosted.org/fpc/ticket/524 to request one from Fedora. I have now requested the same for Debian. If the request is granted we will most likely get the uid/gid 64045. Maybe others could use the same. It seems that only Debian has a range of reserved ids for this purpose. I would expect Ubuntu to use the same id, but that's up to them finally. Gaudenz signature.asc Description: PGP signature
Request for static uid/gid for ceph
Hi Ceph is currently developing unprivileged user support, and the upstream developers are asking each of the distros to allocate a static numerical UID for the ceph user and group. The reason for the static number is to allow Ceph users to hot-swap hard drives from one OSD server to another. Currently this practice works because Ceph writes its files as uid 0, but when Ceph starts writing its files as an unprivileged user, it will be nice to allow administrators to unplug a drive and plug it into another computer and everything to just work. For this I'm requesting an uid and gid from the reserved range 6-64999 according to the process outlined in https://anonscm.debian.org/cgit/users/cjwatson/base-passwd.git/tree/README. Thanks, Gaudenz signature.asc Description: PGP signature
Few CEPH questions.
Hi All, I am new to this mailing list. I have few basic questions on the CEPH, hope someone can answer me. Thanks in advance.!! I would like to understand more the the placement policy, especially on the failure path. 1. Machines going up and down is fairly common in a data center. How often does the cluster map change? Every machine bounce causes an update/distribution of cluster map? and affect the CRUSH? Does it cause cluster network too chatty? 2. Ceph mainly depends on the primary OSD in a given PG. What happens in the read/write path if that OSD is down at that moment? There can be cases, OSD is down but cluster map is not up to date. When the write/read fail, does the client retry it after populating clustermap? 3. Whenever the cluster map changes, it may not propagate to the entire cluster. Some clients may be running with old map, and may end-up in wrong OSD. Do we depend on peering to take care of this situation? 4. If a OSD becomes too full, which may take reads, but no writes anymore. Does CRUSH() take that into account? Does it generate different maps for reads vs writes? or this case is handled by distributing(moving) data off of the OSD? 6. Does CRUSH() takes OSD size into consideration? 7. Does CEPH support quorum writes? (2 out of 3 is a success. ) -- Jvrao --- First they ignore you, then they laugh at you, then they fight you, then you win. - Mahatma Gandhi -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Regarding newstore performance
Hi Somnath, You could try apply this one:) https://github.com/ceph/ceph/pull/4356 BTW the previous RocksDB configuration has a bug that set rocksdb_disableDataSync to true by default, which may cause data loss in failure. So pls update the newstore to latest or manually set it to false. I suspect the KVDB performance will be worse after doing this...but that's the way we need to go. Xiaoxi -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Somnath Roy Sent: Thursday, April 16, 2015 12:07 AM To: Haomai Wang Cc: ceph-devel Subject: RE: Regarding newstore performance Hoamai, Yes, separating out the kvdb directory is the path I will take to identify the cause of the WA. This tool I have written on top of these disk counters. I can share that but you need SanDisk optimus echo (or max) drive to make it work :-) Thanks Regards Somnath -Original Message- From: Haomai Wang [mailto:haomaiw...@gmail.com] Sent: Wednesday, April 15, 2015 5:23 AM To: Somnath Roy Cc: ceph-devel Subject: Re: Regarding newstore performance On Wed, Apr 15, 2015 at 2:01 PM, Somnath Roy somnath@sandisk.com wrote: Hi Sage/Mark, I did some WA experiment with newstore with the similar settings I mentioned yesterday. Test: --- 64K Random write with 64 QD and writing total of 1 TB of data. Newstore: Fio output at the end of 1 TB write. --- rbd_iodepth32: (g=0): rw=randwrite, bs=64K-64K/64K-64K/64K-64K, ioengine=rbd, iodepth=64 fio-2.1.11-20-g9a44 Starting 1 process rbd engine: RBD version: 0.1.9 Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/49600KB/0KB /s] [0/775/0 iops] [eta 00m:00s] rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=42907: Tue Apr 14 00:34:23 2015 write: io=1000.0GB, bw=48950KB/s, iops=764, runt=21421419msec slat (usec): min=43, max=9480, avg=116.45, stdev=10.99 clat (msec): min=13, max=1331, avg=83.55, stdev=52.97 lat (msec): min=14, max=1331, avg=83.67, stdev=52.97 clat percentiles (msec): | 1.00th=[ 60], 5.00th=[ 68], 10.00th=[ 71], 20.00th=[ 74], | 30.00th=[ 76], 40.00th=[ 78], 50.00th=[ 81], 60.00th=[ 83], | 70.00th=[ 86], 80.00th=[ 90], 90.00th=[ 94], 95.00th=[ 98], | 99.00th=[ 109], 99.50th=[ 114], 99.90th=[ 1188], 99.95th=[ 1221], | 99.99th=[ 1270] bw (KB /s): min= 62, max=101888, per=100.00%, avg=49760.84, stdev=7320.03 lat (msec) : 20=0.01%, 50=0.38%, 100=96.00%, 250=3.34%, 500=0.03% lat (msec) : 750=0.02%, 1000=0.03%, 2000=0.20% cpu : usr=10.85%, sys=1.76%, ctx=123504191, majf=0, minf=603964 IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=2.1%, =64=97.9% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.1%, 32=0.0%, 64=0.1%, =64=0.0% issued: total=r=0/w=16384000/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=64 Run status group 0 (all jobs): WRITE: io=1000.0GB, aggrb=48949KB/s, minb=48949KB/s, maxb=48949KB/s, mint=21421419msec, maxt=21421419msec So, iops getting is ~764. 99th percentile latency should be 100ms. Write amplification at disk level: -- SanDisk SSDs have some disk level counters that can measure number of host writes with flash logical page size and number of actual flash writes with the same flash logical page size. The difference between these two is the actual WA causing to disk. Please find the data in the following xls. https://docs.google.com/spreadsheets/d/1vIT6PHRbLdk_IqFDc2iM_teMolDUlx cX5TLMRzdXyJE/edit?usp=sharing Total host writes in this period = 923896266 Total flash writes in this period = 1465339040 FileStore: - Fio output at the end of 1 TB write. --- rbd_iodepth32: (g=0): rw=randwrite, bs=64K-64K/64K-64K/64K-64K, ioengine=rbd, iodepth=64 fio-2.1.11-20-g9a44 Starting 1 process rbd engine: RBD version: 0.1.9 Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/164.7MB/0KB /s] [0/2625/0 iops] [eta 00m:01s] rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=44703: Tue Apr 14 18:44:00 2015 write: io=1000.0GB, bw=98586KB/s, iops=1540, runt=10636117msec slat (usec): min=42, max=7144, avg=120.45, stdev=45.80 clat (usec): min=942, max=3954.6K, avg=40776.42, stdev=81231.25 lat (msec): min=1, max=3954, avg=40.90, stdev=81.23 clat percentiles (msec): | 1.00th=[7], 5.00th=[ 11], 10.00th=[ 13], 20.00th=[ 16], | 30.00th=[ 18], 40.00th=[ 20], 50.00th=[ 22], 60.00th=[ 25], | 70.00th=[ 30], 80.00th=[ 40], 90.00th=[ 67], 95.00th=[ 114], | 99.00th=[ 433], 99.50th=[ 570], 99.90th=[ 914],
RE: Regarding newstore performance
Thanks Xiaoxi.. But, I have already initiated test by making db/ a symbolic link to another SSD..Will share the result soon. Regards Somnath -Original Message- From: Chen, Xiaoxi [mailto:xiaoxi.c...@intel.com] Sent: Wednesday, April 15, 2015 6:48 PM To: Somnath Roy; Haomai Wang Cc: ceph-devel Subject: RE: Regarding newstore performance Hi Somnath, You could try apply this one:) https://github.com/ceph/ceph/pull/4356 BTW the previous RocksDB configuration has a bug that set rocksdb_disableDataSync to true by default, which may cause data loss in failure. So pls update the newstore to latest or manually set it to false. I suspect the KVDB performance will be worse after doing this...but that's the way we need to go. Xiaoxi -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Somnath Roy Sent: Thursday, April 16, 2015 12:07 AM To: Haomai Wang Cc: ceph-devel Subject: RE: Regarding newstore performance Hoamai, Yes, separating out the kvdb directory is the path I will take to identify the cause of the WA. This tool I have written on top of these disk counters. I can share that but you need SanDisk optimus echo (or max) drive to make it work :-) Thanks Regards Somnath -Original Message- From: Haomai Wang [mailto:haomaiw...@gmail.com] Sent: Wednesday, April 15, 2015 5:23 AM To: Somnath Roy Cc: ceph-devel Subject: Re: Regarding newstore performance On Wed, Apr 15, 2015 at 2:01 PM, Somnath Roy somnath@sandisk.com wrote: Hi Sage/Mark, I did some WA experiment with newstore with the similar settings I mentioned yesterday. Test: --- 64K Random write with 64 QD and writing total of 1 TB of data. Newstore: Fio output at the end of 1 TB write. --- rbd_iodepth32: (g=0): rw=randwrite, bs=64K-64K/64K-64K/64K-64K, ioengine=rbd, iodepth=64 fio-2.1.11-20-g9a44 Starting 1 process rbd engine: RBD version: 0.1.9 Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/49600KB/0KB /s] [0/775/0 iops] [eta 00m:00s] rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=42907: Tue Apr 14 00:34:23 2015 write: io=1000.0GB, bw=48950KB/s, iops=764, runt=21421419msec slat (usec): min=43, max=9480, avg=116.45, stdev=10.99 clat (msec): min=13, max=1331, avg=83.55, stdev=52.97 lat (msec): min=14, max=1331, avg=83.67, stdev=52.97 clat percentiles (msec): | 1.00th=[ 60], 5.00th=[ 68], 10.00th=[ 71], 20.00th=[ 74], | 30.00th=[ 76], 40.00th=[ 78], 50.00th=[ 81], 60.00th=[ 83], | 70.00th=[ 86], 80.00th=[ 90], 90.00th=[ 94], 95.00th=[ 98], | 99.00th=[ 109], 99.50th=[ 114], 99.90th=[ 1188], 99.95th=[ 1221], | 99.99th=[ 1270] bw (KB /s): min= 62, max=101888, per=100.00%, avg=49760.84, stdev=7320.03 lat (msec) : 20=0.01%, 50=0.38%, 100=96.00%, 250=3.34%, 500=0.03% lat (msec) : 750=0.02%, 1000=0.03%, 2000=0.20% cpu : usr=10.85%, sys=1.76%, ctx=123504191, majf=0, minf=603964 IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=2.1%, =64=97.9% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.1%, 32=0.0%, 64=0.1%, =64=0.0% issued: total=r=0/w=16384000/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=64 Run status group 0 (all jobs): WRITE: io=1000.0GB, aggrb=48949KB/s, minb=48949KB/s, maxb=48949KB/s, mint=21421419msec, maxt=21421419msec So, iops getting is ~764. 99th percentile latency should be 100ms. Write amplification at disk level: -- SanDisk SSDs have some disk level counters that can measure number of host writes with flash logical page size and number of actual flash writes with the same flash logical page size. The difference between these two is the actual WA causing to disk. Please find the data in the following xls. https://docs.google.com/spreadsheets/d/1vIT6PHRbLdk_IqFDc2iM_teMolDUlx cX5TLMRzdXyJE/edit?usp=sharing Total host writes in this period = 923896266 Total flash writes in this period = 1465339040 FileStore: - Fio output at the end of 1 TB write. --- rbd_iodepth32: (g=0): rw=randwrite, bs=64K-64K/64K-64K/64K-64K, ioengine=rbd, iodepth=64 fio-2.1.11-20-g9a44 Starting 1 process rbd engine: RBD version: 0.1.9 Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/164.7MB/0KB /s] [0/2625/0 iops] [eta 00m:01s] rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=44703: Tue Apr 14 18:44:00 2015 write: io=1000.0GB, bw=98586KB/s, iops=1540, runt=10636117msec slat (usec): min=42, max=7144, avg=120.45, stdev=45.80 clat (usec): min=942, max=3954.6K, avg=40776.42, stdev=81231.25 lat (msec): min=1,
Re: Newstore get_omap_iterator
On 04/13/2015 10:27 AM, Sage Weil wrote: [adding ceph-devel] On Mon, 13 Apr 2015, Chen, Xiaoxi wrote: Hi, Actually I have done the tuning survey on RocksDB when I was updating the RocksDB to newer version and exposed the tuning in ceph.conf. What we need to ensure is the WAL never hit the disk. The rocksdb We'll always have to pay that 1x write to the log; we just want to make sure it doesn't turn into 2x. I take it you're assuming the log is on an SSD (not disk)? write ahead log is already introduce 1X write, if the data flushed to SST in level 0, that will be 2X, not to mention any further compaction. The tuning that makes the differences are : write_buffer_size max_write_buffer_number min_write_buffer_number_to_merge Say if we have write_buffer_size =512M max_write_buffer_number = 6 min_write_buffer_number_to_merge =2 Attached are tests for a single PCIE ssd with filestore, newstore + fsync + default tunables, newstore+fsync + Xiaoxi's tunables, and also a test using xiaoxi's tunables with fdatasync. Basically Xioaxi's tunables help, and fdatasync helps a little more (mostly at small IO sizes), but still not enough to get us to beat filestore, though newstore *does* do consistently better than filestore with 4MB writes now. Mark newstore_xiaoxi_fdatasync.pdf Description: Adobe PDF document
Re: how to test hammer rbd objectmap feature ?
Yeah, once we're confident in it in master. The idea behind this feature was to allow using object maps with existing images. There just wasn't time to include it in the base hammer release. Ok, thanks Josh ! (I'm planning to implement this feature in proxmox when it'll be released). - Mail original - De: Josh Durgin jdur...@redhat.com À: aderumier aderum...@odiso.com, ceph-devel ceph-devel@vger.kernel.org Envoyé: Mercredi 15 Avril 2015 01:12:38 Objet: Re: how to test hammer rbd objectmap feature ? On 04/14/2015 12:48 AM, Alexandre DERUMIER wrote: Hi, I would like to known how to enable object map on hammer ? I found a post hammer commit here: https://github.com/ceph/ceph/commit/3a7b28d9a2de365d515ea1380ee9e4f867504e10 rbd: add feature enable/disable support - Specifies which RBD format 2 features are to be enabled when creating - an image. The numbers from the desired features below should be added - to compute the parameter value: + Specifies which RBD format 2 feature should be enabled when creating + an image. Multiple features can be enabled by repeating this option + multiple times. The following features are supported: -.. option:: --image-features features +.. option:: --image-feature feature - +1: layering support - +2: striping v2 support - +4: exclusive locking support - +8: object map support + layering: layering support + striping: striping v2 support + exclusive-lock: exclusive locking support + object-map: object map support (requires exclusive-lock) So, in current hammer release, we can only setup objectmap and other features on rbd volume creation ? Yes, that's right. Do this patch allow to change features on the fly ? Yup, just for exclusive-lock and object map (since they don't affect object layout). If yes, is it planned to backport it to hammer soon ? Yeah, once we're confident in it in master. The idea behind this feature was to allow using object maps with existing images. There just wasn't time to include it in the base hammer release. Josh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: v0.80.8 and librbd performance
On 04/14/2015 08:01 PM, shiva rkreddy wrote: The clusters are in test environment, so its a new deployment of 0.80.9. OS on the cluster nodes is reinstalled as well, so there shouldn't be any fs aging unless the disks are slowing down. The perf measurement is done initiating multiple cinder create/delete commands and tracking the volume to be in available or completely gone from cinder list output. Even running rbd rm command from cinder node results in similar behaviour. I'll try with increasing rbd_concurrent_management in ceph.conf. Is the param name rbd_concurrent_management or rbd-concurrent-management ? 'rbd concurrent management ops' - spaces, hyphens, and underscores are equivalent in ceph configuration. A log with 'debug ms = 1' and 'debug rbd = 20' from 'rbd rm' on both versions might give clues about what's going slower. Josh On Tue, Apr 14, 2015 at 12:36 PM, Josh Durgin jdur...@redhat.com mailto:jdur...@redhat.com wrote: I don't see any commits that would be likely to affect that between 0.80.7 and 0.80.9. Is this after upgrading an existing cluster? Could this be due to fs aging beneath your osds? How are you measuring create/delete performance? You can try increasing rbd concurrent management ops in ceph.conf on the cinder node. This affects delete speed, since rbd tries to delete each object in a volume. Josh *From:* shiva rkreddy shiva.rkre...@gmail.com mailto:shiva.rkre...@gmail.com *Sent:* Apr 14, 2015 5:53 AM *To:* Josh Durgin *Cc:* Ken Dreyer; Sage Weil; Ceph Development; ceph-us...@ceph.com mailto:ceph-us...@ceph.com *Subject:* Re: v0.80.8 and librbd performance Hi Josh, We are using firefly 0.80.9 and see both cinder create/delete numbers slow down compared 0.80.7. I don't see any specific tuning requirements and our cluster is run pretty much on default configuration. Do you recommend any tuning or can you please suggest some log signatures we need to be looking at? Thanks shiva On Wed, Mar 4, 2015 at 1:53 PM, Josh Durgin jdur...@redhat.com mailto:jdur...@redhat.com wrote: On 03/03/2015 03:28 PM, Ken Dreyer wrote: On 03/03/2015 04:19 PM, Sage Weil wrote: Hi, This is just a heads up that we've identified a performance regression in v0.80.8 from previous firefly releases. A v0.80.9 is working it's way through QA and should be out in a few days. If you haven't upgraded yet you may want to wait. Thanks! sage Hi Sage, I've seen a couple Redmine tickets on this (eg http://tracker.ceph.com/__issues/9854 http://tracker.ceph.com/issues/9854 , http://tracker.ceph.com/__issues/10956 http://tracker.ceph.com/issues/10956). It's not totally clear to me which of the 70+ unreleased commits on the firefly branch fix this librbd issue. Is it only the three commits in https://github.com/ceph/ceph/__pull/3410 https://github.com/ceph/ceph/pull/3410 , or are there more? Those are the only ones needed to fix the librbd performance regression, yes. Josh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org mailto:majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/__majordomo-info.html http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Regarding newstore performance
Hi Sage/Mark, I did some WA experiment with newstore with the similar settings I mentioned yesterday. Test: --- 64K Random write with 64 QD and writing total of 1 TB of data. Newstore: Fio output at the end of 1 TB write. --- rbd_iodepth32: (g=0): rw=randwrite, bs=64K-64K/64K-64K/64K-64K, ioengine=rbd, iodepth=64 fio-2.1.11-20-g9a44 Starting 1 process rbd engine: RBD version: 0.1.9 Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/49600KB/0KB /s] [0/775/0 iops] [eta 00m:00s] rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=42907: Tue Apr 14 00:34:23 2015 write: io=1000.0GB, bw=48950KB/s, iops=764, runt=21421419msec slat (usec): min=43, max=9480, avg=116.45, stdev=10.99 clat (msec): min=13, max=1331, avg=83.55, stdev=52.97 lat (msec): min=14, max=1331, avg=83.67, stdev=52.97 clat percentiles (msec): | 1.00th=[ 60], 5.00th=[ 68], 10.00th=[ 71], 20.00th=[ 74], | 30.00th=[ 76], 40.00th=[ 78], 50.00th=[ 81], 60.00th=[ 83], | 70.00th=[ 86], 80.00th=[ 90], 90.00th=[ 94], 95.00th=[ 98], | 99.00th=[ 109], 99.50th=[ 114], 99.90th=[ 1188], 99.95th=[ 1221], | 99.99th=[ 1270] bw (KB /s): min= 62, max=101888, per=100.00%, avg=49760.84, stdev=7320.03 lat (msec) : 20=0.01%, 50=0.38%, 100=96.00%, 250=3.34%, 500=0.03% lat (msec) : 750=0.02%, 1000=0.03%, 2000=0.20% cpu : usr=10.85%, sys=1.76%, ctx=123504191, majf=0, minf=603964 IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=2.1%, =64=97.9% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.1%, 32=0.0%, 64=0.1%, =64=0.0% issued: total=r=0/w=16384000/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=64 Run status group 0 (all jobs): WRITE: io=1000.0GB, aggrb=48949KB/s, minb=48949KB/s, maxb=48949KB/s, mint=21421419msec, maxt=21421419msec So, iops getting is ~764. 99th percentile latency should be 100ms. Write amplification at disk level: -- SanDisk SSDs have some disk level counters that can measure number of host writes with flash logical page size and number of actual flash writes with the same flash logical page size. The difference between these two is the actual WA causing to disk. Please find the data in the following xls. https://docs.google.com/spreadsheets/d/1vIT6PHRbLdk_IqFDc2iM_teMolDUlxcX5TLMRzdXyJE/edit?usp=sharing Total host writes in this period = 923896266 Total flash writes in this period = 1465339040 FileStore: - Fio output at the end of 1 TB write. --- rbd_iodepth32: (g=0): rw=randwrite, bs=64K-64K/64K-64K/64K-64K, ioengine=rbd, iodepth=64 fio-2.1.11-20-g9a44 Starting 1 process rbd engine: RBD version: 0.1.9 Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/164.7MB/0KB /s] [0/2625/0 iops] [eta 00m:01s] rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=44703: Tue Apr 14 18:44:00 2015 write: io=1000.0GB, bw=98586KB/s, iops=1540, runt=10636117msec slat (usec): min=42, max=7144, avg=120.45, stdev=45.80 clat (usec): min=942, max=3954.6K, avg=40776.42, stdev=81231.25 lat (msec): min=1, max=3954, avg=40.90, stdev=81.23 clat percentiles (msec): | 1.00th=[7], 5.00th=[ 11], 10.00th=[ 13], 20.00th=[ 16], | 30.00th=[ 18], 40.00th=[ 20], 50.00th=[ 22], 60.00th=[ 25], | 70.00th=[ 30], 80.00th=[ 40], 90.00th=[ 67], 95.00th=[ 114], | 99.00th=[ 433], 99.50th=[ 570], 99.90th=[ 914], 99.95th=[ 1090], | 99.99th=[ 1647] bw (KB /s): min= 32, max=243072, per=100.00%, avg=103148.37, stdev=63090.00 lat (usec) : 1000=0.01% lat (msec) : 2=0.01%, 4=0.14%, 10=4.50%, 20=38.34%, 50=42.42% lat (msec) : 100=8.81%, 250=3.48%, 500=1.58%, 750=0.50%, 1000=0.14% lat (msec) : 2000=0.06%, =2000=0.01% cpu : usr=18.46%, sys=2.64%, ctx=63096633, majf=0, minf=923370 IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.6%, 32=80.4%, =64=19.1% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=94.3%, 8=2.9%, 16=1.9%, 32=0.9%, 64=0.1%, =64=0.0% issued: total=r=0/w=16384000/d=0, short=r=0/w=0/d=0 latency : target=0, window=0, percentile=100.00%, depth=64 Run status group 0 (all jobs): WRITE: io=1000.0GB, aggrb=98586KB/s, minb=98586KB/s, maxb=98586KB/s, mint=10636117msec, maxt=10636117msec Disk stats (read/write): sda: ios=0/251, merge=0/149, ticks=0/0, in_queue=0, util=0.00% So, iops here is ~1500. 99th percentile latency should be within 50ms Write amplification at disk level: -- Total host writes in this period = 643611346 Total flash writes in this period = 1157304512 https://docs.google.com/spreadsheets/d/1gbIATBerS8COzSsJRMbkFXCSbLjn61Fz49CLH8WPh7Q/edit?pli=1#gid=95373000 Summary:
Re: [Ceph-maintainers] statically allocated uid/gid for ceph
On 04/15/2015 01:21 AM, Sage Weil wrote: On Tue, 14 Apr 2015, Tim Serong wrote: On 04/14/2015 11:05 AM, Sage Weil wrote: Tim, Owen: Can we get a 'ceph' user/group uid/gid allocated for SUSE to get this unstuck? IMO the radosgw systemd stuff is blocked behind this too. I haven't yet been able to get a good answer to assignment of static UIDs and GIDs (I was told all the ones between 0-99 are taken already). But, if it's OK for the UID and GID numbers to potentially be different on different systems, adding a ceph user and ceph group is easy, we just add appropriate `groupadd -r ceph` and `useradd -r ceph` invocations to the rpm %pre script, which will give a UID/GID somewhere in the 100-499 range (see https://en.opensuse.org/openSUSE:Packaging_guidelines#Users_and_Groups for some notes on this). We'd also want to update rpmlint to not whine about the ceph name. I originally thought the risk of non-static UID/GID numbers on different systems was terrible, but... I think we still want them to be static across a distro; it's the cross-distro change that will be relatively rare. So a fixed ID from each distro family ought to be okay? Optimally, yes, I too want at least a fixed ID per distro. I'm presently attempting to find out exactly how we (SUSE) do this in some officially recognised way. I think a osd-prestart.sh snippet (or similar) that does a chown -R of any non-root osd data to the local ceph user prior to starting the daemon will handle the cross-distro changes without too much trouble. I'd lean toward not going from root - ceph, though, and have the start script stay root and not drop privs if the data is owned by root.. that covers upgrades without interruption. What do you think? ...that sounds reasonable, and I think it would also handle the case where, say, you move an OSD from one SUSE host to another - if the UID/GID doesn't match (maybe some other `useradd`ing software was installed first on the other host), the chown will fix it anyway. Are there any holes in this? It would be nicer if the suse-suse case didn't require a chown, but yeah, it'd still work just fine... OK. So at least that's a technically viable but undesirable fallback position. Tim sage Regards, Tim On Thu, 11 Dec 2014, Tim Serong wrote: On 12/11/2014 05:48 AM, Sage Weil wrote: +ceph-devel On Wed, 10 Dec 2014, Ken Dreyer wrote: On 12/06/2014 01:54 PM, Sage Weil wrote: Hi Colin, Boris, Owen, We would like to choose a statically allocated uid and gid for use by Ceph storage servers. The basic goals are: - run daemons as non-root (right now everything is uid 0 (runtime and on-disk data) and this is clearly not ideal) - enable hot swap of disks between storage servers - standardize across distros so that we can build clusters with a mix To support the hot swap, we can't use the usual uids allocated dynamically during package installation. Disks will completely filled with Ceph data files with the uid from one machine and will not be usable on another machine. I'm hoping we can choose a static uid/gid pair that is unused for Debian (and Ubuntu), Fedora (and RHEL/CentOS), and OpenSUSE/SLES. This will let us maintain consistency across the entire ecosystem. How many system users should I request from the Fedora Packaging Committee, and what should their names be? For example, are ceph-mon and ceph-osd going to run under the same non-privileged system account? Hmm, my first impulse was to make a single user and group. But it might make sense that e.g. rgw should run in a different context than ceph-osd or ceph-mon. If we go down that road, then maybe ceph-osd ceph-mon ceph-mds ceph-rgw ceph-calamari and a 'ceph' group that we can use for /var/log/ceph etc for the qemu and other librados users? Alternatively, if we just do user+group ceph, then rgw can run as www-data or apache (as it does now). Not sure what makes the most sense for ceph-calamari. FWIW my gut says go with a single ceph user+group and leave rgw running as the apache user. Calamari consists of a few pieces - the web-accessible bit runs as the apache user, then there's the cthulhu daemon, as well as carbon-cache for the graphite stuff. These latter two I believe run as root (at least, they do with my SUSE packages which have systemd units for each of these services, and I assume they run as root on other distros where they're run under supervisord). Now that I think of it though, I wonder if it makes sense to just run the whole lot as the apache user...? Regards, Tim -- Tim Serong Senior Clustering Engineer SUSE tser...@suse.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Tim Serong Senior Clustering Engineer SUSE
CephFS and the next giant release v0.87.2
Hi Greg, The next giant release as found at https://github.com/ceph/ceph/tree/giant passed the fs suite (http://tracker.ceph.com/issues/11153#fs). Do you think it is ready for QE to start their own round of testing ? Note that it will be the last giant release. Cheers P.S. http://tracker.ceph.com/issues/11153#Release-information has direct links to the pull requests merged into giant since v0.87.1 in case you need more context about one of them. -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature
radosgw and the next giant release v0.87.2
Hi Yehuda, The next giant release as found at https://github.com/ceph/ceph/tree/giant passed the rgw suite (http://tracker.ceph.com/issues/11153#rgw). One run had a transient failure (http://tracker.ceph.com/issues/11259) that did not repeat. You will also find traces of failed run because of http://tracker.ceph.com/issues/11180 but that was resolved by backporting https://github.com/ceph/ceph-qa-suite/pull/375. Do you think it is ready for QE to start their own round of testing ? Note that it will be the last giant release. Cheers P.S. http://tracker.ceph.com/issues/11153#Release-information has direct links to the pull requests merged into giant since v0.87.1 in case you need more context about one of them. -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature
rados and the next giant release v0.87.2
Hi Sam, The next giant release as found at https://github.com/ceph/ceph/tree/giant passed the rados suite (http://tracker.ceph.com/issues/11153#rados). Do you think it is ready for QE to start their own round of testing ? Note that it will be the last giant release. Cheers P.S. http://tracker.ceph.com/issues/11153#Release-information has direct links to the pull requests merged into giant since v0.87.1 in case you need more context about one of them. -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature
rbd and the next giant release v0.87.2
Hi Josh, The next giant release as found at https://github.com/ceph/ceph/tree/giant passed the rbd suite (http://tracker.ceph.com/issues/11153#rbd). Do you think it is ready for QE to start their own round of testing ? Note that it will be the last giant release. Cheers P.S. http://tracker.ceph.com/issues/11153#Release-information has direct links to the pull requests merged into giant since v0.87.1 in case you need more context about one of them. -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature
Re: Backporting to Firefly
Ping ? On 08/04/2015 11:22, Loic Dachary wrote: Hi, I see you have been busy backporting issues to Firefly today, this is great :-) https://github.com/ceph/ceph/pulls/xinxinsh It would be helpful if you could update the pull requests (and the corresponding issues) as explained at http://tracker.ceph.com/projects/ceph-releases/wiki/HOWTO_backport_commits. Once it's done I propose we move to the next step, as explained at http://tracker.ceph.com/projects/ceph-releases/wiki/HOWTO: merging your pull requests in the integration branch ( step 5 http://tracker.ceph.com/projects/ceph-releases/wiki/HOWTO_populate_the_integration_branch ) and running tests on them ( step 6 http://tracker.ceph.com/projects/ceph-releases/wiki/HOWTO_run_integration_and_upgrade_tests). Cheers -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature