make check bot paused

2015-04-15 Thread Loic Dachary
Hi,

The make check bot [1] that executes run-make-check.sh [2] on pull requests and 
reports results as comments [3] is experiencing problems. It may be a hardware 
issue and the bot is paused while the issue is investigated [4] to avoid 
sending confusing false negatives. In the meantime the run-make-check.sh [2] 
script can be run locally, before sending the pull request, to confirm the 
commits to be sent do not break them. It is expected to run in less than 15 
minutes including compilation on a fast machine with a SSD (or RAM disk) and 8 
cores and 32GB of RAM and may take up to two hours on a machine with a spinner 
and two cores.

Thanks for your patience.

Cheers

[1] bot running on pull requests http://jenkins.ceph.dachary.org/job/ceph/
[2] run-make-check.sh 
http://workbench.dachary.org/ceph/ceph/blob/master/run-make-check.sh
[3] make check results example : 
https://github.com/ceph/ceph/pull/3946#issuecomment-93286840
[4] possible RAM failure http://tracker.ceph.com/issues/11399
-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature


Re: Regarding newstore performance

2015-04-15 Thread Haomai Wang
On Wed, Apr 15, 2015 at 2:01 PM, Somnath Roy somnath@sandisk.com wrote:
 Hi Sage/Mark,
 I did some WA experiment with newstore with the similar settings I mentioned 
 yesterday.

 Test:
 ---

 64K Random write with 64 QD and writing total of 1 TB of data.


 Newstore:
 

 Fio output at the end of 1 TB write.
 ---

 rbd_iodepth32: (g=0): rw=randwrite, bs=64K-64K/64K-64K/64K-64K, ioengine=rbd, 
 iodepth=64
 fio-2.1.11-20-g9a44
 Starting 1 process
 rbd engine: RBD version: 0.1.9
 Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/49600KB/0KB /s] [0/775/0 iops] [eta 
 00m:00s]
 rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=42907: Tue Apr 14 00:34:23 
 2015
   write: io=1000.0GB, bw=48950KB/s, iops=764, runt=21421419msec
 slat (usec): min=43, max=9480, avg=116.45, stdev=10.99
 clat (msec): min=13, max=1331, avg=83.55, stdev=52.97
  lat (msec): min=14, max=1331, avg=83.67, stdev=52.97
 clat percentiles (msec):
  |  1.00th=[   60],  5.00th=[   68], 10.00th=[   71], 20.00th=[   74],
  | 30.00th=[   76], 40.00th=[   78], 50.00th=[   81], 60.00th=[   83],
  | 70.00th=[   86], 80.00th=[   90], 90.00th=[   94], 95.00th=[   98],
  | 99.00th=[  109], 99.50th=[  114], 99.90th=[ 1188], 99.95th=[ 1221],
  | 99.99th=[ 1270]
 bw (KB  /s): min=   62, max=101888, per=100.00%, avg=49760.84, 
 stdev=7320.03
 lat (msec) : 20=0.01%, 50=0.38%, 100=96.00%, 250=3.34%, 500=0.03%
 lat (msec) : 750=0.02%, 1000=0.03%, 2000=0.20%
   cpu  : usr=10.85%, sys=1.76%, ctx=123504191, majf=0, minf=603964
   IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=2.1%, =64=97.9%
  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
  complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.1%, 32=0.0%, 64=0.1%, 
 =64=0.0%
  issued: total=r=0/w=16384000/d=0, short=r=0/w=0/d=0
  latency   : target=0, window=0, percentile=100.00%, depth=64

 Run status group 0 (all jobs):
   WRITE: io=1000.0GB, aggrb=48949KB/s, minb=48949KB/s, maxb=48949KB/s, 
 mint=21421419msec, maxt=21421419msec


 So, iops getting is ~764.
 99th percentile latency should be 100ms.

 Write amplification at disk level:
 --

 SanDisk SSDs have some disk level counters that can measure number of host 
 writes with flash logical page size and number of actual flash writes with 
 the same flash logical page size. The difference between these two is the 
 actual WA causing to disk.

 Please find the data in the following xls.

 https://docs.google.com/spreadsheets/d/1vIT6PHRbLdk_IqFDc2iM_teMolDUlxcX5TLMRzdXyJE/edit?usp=sharing

 Total host writes in this period = 923896266

 Total flash writes in this period = 1465339040


 FileStore:
 -

 Fio output at the end of 1 TB write.
 ---

 rbd_iodepth32: (g=0): rw=randwrite, bs=64K-64K/64K-64K/64K-64K, ioengine=rbd, 
 iodepth=64
 fio-2.1.11-20-g9a44
 Starting 1 process
 rbd engine: RBD version: 0.1.9
 Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/164.7MB/0KB /s] [0/2625/0 iops] [eta 
 00m:01s]
 rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=44703: Tue Apr 14 18:44:00 
 2015
   write: io=1000.0GB, bw=98586KB/s, iops=1540, runt=10636117msec
 slat (usec): min=42, max=7144, avg=120.45, stdev=45.80
 clat (usec): min=942, max=3954.6K, avg=40776.42, stdev=81231.25
  lat (msec): min=1, max=3954, avg=40.90, stdev=81.23
 clat percentiles (msec):
  |  1.00th=[7],  5.00th=[   11], 10.00th=[   13], 20.00th=[   16],
  | 30.00th=[   18], 40.00th=[   20], 50.00th=[   22], 60.00th=[   25],
  | 70.00th=[   30], 80.00th=[   40], 90.00th=[   67], 95.00th=[  114],
  | 99.00th=[  433], 99.50th=[  570], 99.90th=[  914], 99.95th=[ 1090],
  | 99.99th=[ 1647]
 bw (KB  /s): min=   32, max=243072, per=100.00%, avg=103148.37, 
 stdev=63090.00
 lat (usec) : 1000=0.01%
 lat (msec) : 2=0.01%, 4=0.14%, 10=4.50%, 20=38.34%, 50=42.42%
 lat (msec) : 100=8.81%, 250=3.48%, 500=1.58%, 750=0.50%, 1000=0.14%
 lat (msec) : 2000=0.06%, =2000=0.01%
   cpu  : usr=18.46%, sys=2.64%, ctx=63096633, majf=0, minf=923370
   IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.6%, 32=80.4%, =64=19.1%
  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
  complete  : 0=0.0%, 4=94.3%, 8=2.9%, 16=1.9%, 32=0.9%, 64=0.1%, =64=0.0%
  issued: total=r=0/w=16384000/d=0, short=r=0/w=0/d=0
  latency   : target=0, window=0, percentile=100.00%, depth=64

 Run status group 0 (all jobs):
   WRITE: io=1000.0GB, aggrb=98586KB/s, minb=98586KB/s, maxb=98586KB/s, 
 mint=10636117msec, maxt=10636117msec

 Disk stats (read/write):
   sda: ios=0/251, merge=0/149, ticks=0/0, in_queue=0, util=0.00%

 So, iops here is ~1500.
 99th percentile latency should be within 50ms


 Write amplification at disk level:
 --

 Total host writes in this period = 

04/15/2015 Weekly Ceph Performance Meeting IS ON!

2015-04-15 Thread Mark Nelson
8AM PST as usual! IE in 10 minutes. :D  Forgot to send this out earlier, 
sorry!  Anyway, the hew hotness is newstore, and there's been a lot of 
testing going on!


Here's the links:

Etherpad URL:
http://pad.ceph.com/p/performance_weekly

To join the Meeting:
https://bluejeans.com/268261044

To join via Browser:
https://bluejeans.com/268261044/browser

To join with Lync:
https://bluejeans.com/268261044/lync


To join via Room System:
Video Conferencing System: bjn.vc -or- 199.48.152.152
Meeting ID: 268261044

To join via Phone:
1) Dial:
  +1 408 740 7256
  +1 888 240 2560(US Toll Free)
  +1 408 317 9253(Alternate Number)
  (see all numbers - http://bluejeans.com/numbers)
2) Enter Conference ID: 268261044

Mark
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 04/08/2015 Weekly Ceph Performance Meeting IS ON!

2015-04-15 Thread Mark Nelson

Nuts, this got out before I hit cancel.  This should be 04/15/2015.  :)

On 04/15/2015 09:51 AM, Mark Nelson wrote:

8AM PST as usual! IE in 10 minutes. :D  Forgot to send this out earlier,
sorry!  Anyway, the hew hotness is newstore, and there's been a lot of
testing going on!

Here's the links:

Etherpad URL:
http://pad.ceph.com/p/performance_weekly

To join the Meeting:
https://bluejeans.com/268261044

To join via Browser:
https://bluejeans.com/268261044/browser

To join with Lync:
https://bluejeans.com/268261044/lync


To join via Room System:
Video Conferencing System: bjn.vc -or- 199.48.152.152
Meeting ID: 268261044

To join via Phone:
1) Dial:
   +1 408 740 7256
   +1 888 240 2560(US Toll Free)
   +1 408 317 9253(Alternate Number)
   (see all numbers - http://bluejeans.com/numbers)
2) Enter Conference ID: 268261044

Mark
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[rgw][Swift API] Reusing WSGI middleware of Swift

2015-04-15 Thread Radoslaw Zarzynski
Hello Guys,

Swift -- besides implementing the OpenStack Object Storage API v1 -- provides
a bunch of huge extensions like SLO/DLO (static/dynamic large objects), bulk
operations (inc. server-side archive extraction), staticweb and many more.
Full list is available in Swift's source code [1]. I'm pretty sure it would
be nice to have at least some degree of compatibility with them.

Reimplementing those features could be painful. Maintaining them would be
another challenge. Even after accomplishing that we still would not provide
users with ability to use (and easily create) a third party extensions like
ClamAV-based virus scanner [2]. However, maybe there is another solution.

Swift internally utilizes a pipeline-based architecture. Component delivering
the base of OSOS API (the proxy-server WSGI application) could be seen
as the last stage of such pipeline. Between it and user a lot of additional
middleware modules exist. All of them are interconnected with the WSGI
interface. In a pure theory, if we provide a WSGI-RGW bridge, we would be able
to reuse them and save a lot of efforts.

I created a really stupid proof of concept [3] built on top of wsgiproxy [4]
and civetweb frontend of RGW. Its primary role is a WSGI-HTTP mediator.
The code is terrible, definitely needs rework but already allowed to pass
a few middleware-related API tests from Tempest.

What would be necessary on the RGW's side? Basically:
* support for extraction of account (aka tenant) name from URL;
* good conformance with OSOS API;
* optionally: a configuration parameter to disable auth mechanisms inside RGW
  and delegate the work to Swift middleware.

What's your opinion? I would like to start a discussion to verify the concept.

Regards,
Radoslaw Zarzynski

[1] https://github.com/openstack/swift/tree/master/swift/common/middleware
[2] 
https://julien.danjou.info/blog/2013/extending-swift-with-a-middleware-clamav
[3] https://github.com/rzarzynski/rgwift
[4] http://pythonpaste.org/wsgiproxy/
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


04/08/2015 Weekly Ceph Performance Meeting IS ON!

2015-04-15 Thread Mark Nelson
8AM PST as usual! IE in 10 minutes. :D  Forgot to send this out earlier, 
sorry!  Anyway, the hew hotness is newstore, and there's been a lot of 
testing going on!


Here's the links:

Etherpad URL:
http://pad.ceph.com/p/performance_weekly

To join the Meeting:
https://bluejeans.com/268261044

To join via Browser:
https://bluejeans.com/268261044/browser

To join with Lync:
https://bluejeans.com/268261044/lync


To join via Room System:
Video Conferencing System: bjn.vc -or- 199.48.152.152
Meeting ID: 268261044

To join via Phone:
1) Dial:
  +1 408 740 7256
  +1 888 240 2560(US Toll Free)
  +1 408 317 9253(Alternate Number)
  (see all numbers - http://bluejeans.com/numbers)
2) Enter Conference ID: 268261044

Mark
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Regarding newstore performance

2015-04-15 Thread Somnath Roy
Hoamai,
Yes, separating out the kvdb directory is the path I will take to identify the 
cause of the WA.
This tool I have written on top of these disk counters. I can share that but 
you need SanDisk optimus echo (or max) drive to make it work :-)

Thanks  Regards
Somnath

-Original Message-
From: Haomai Wang [mailto:haomaiw...@gmail.com] 
Sent: Wednesday, April 15, 2015 5:23 AM
To: Somnath Roy
Cc: ceph-devel
Subject: Re: Regarding newstore performance

On Wed, Apr 15, 2015 at 2:01 PM, Somnath Roy somnath@sandisk.com wrote:
 Hi Sage/Mark,
 I did some WA experiment with newstore with the similar settings I mentioned 
 yesterday.

 Test:
 ---

 64K Random write with 64 QD and writing total of 1 TB of data.


 Newstore:
 

 Fio output at the end of 1 TB write.
 ---

 rbd_iodepth32: (g=0): rw=randwrite, bs=64K-64K/64K-64K/64K-64K, 
 ioengine=rbd, iodepth=64
 fio-2.1.11-20-g9a44
 Starting 1 process
 rbd engine: RBD version: 0.1.9
 Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/49600KB/0KB /s] [0/775/0 
 iops] [eta 00m:00s]
 rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=42907: Tue Apr 14 00:34:23 
 2015
   write: io=1000.0GB, bw=48950KB/s, iops=764, runt=21421419msec
 slat (usec): min=43, max=9480, avg=116.45, stdev=10.99
 clat (msec): min=13, max=1331, avg=83.55, stdev=52.97
  lat (msec): min=14, max=1331, avg=83.67, stdev=52.97
 clat percentiles (msec):
  |  1.00th=[   60],  5.00th=[   68], 10.00th=[   71], 20.00th=[   74],
  | 30.00th=[   76], 40.00th=[   78], 50.00th=[   81], 60.00th=[   83],
  | 70.00th=[   86], 80.00th=[   90], 90.00th=[   94], 95.00th=[   98],
  | 99.00th=[  109], 99.50th=[  114], 99.90th=[ 1188], 99.95th=[ 1221],
  | 99.99th=[ 1270]
 bw (KB  /s): min=   62, max=101888, per=100.00%, avg=49760.84, 
 stdev=7320.03
 lat (msec) : 20=0.01%, 50=0.38%, 100=96.00%, 250=3.34%, 500=0.03%
 lat (msec) : 750=0.02%, 1000=0.03%, 2000=0.20%
   cpu  : usr=10.85%, sys=1.76%, ctx=123504191, majf=0, minf=603964
   IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=2.1%, =64=97.9%
  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
  complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.1%, 32=0.0%, 64=0.1%, 
 =64=0.0%
  issued: total=r=0/w=16384000/d=0, short=r=0/w=0/d=0
  latency   : target=0, window=0, percentile=100.00%, depth=64

 Run status group 0 (all jobs):
   WRITE: io=1000.0GB, aggrb=48949KB/s, minb=48949KB/s, maxb=48949KB/s, 
 mint=21421419msec, maxt=21421419msec


 So, iops getting is ~764.
 99th percentile latency should be 100ms.

 Write amplification at disk level:
 --

 SanDisk SSDs have some disk level counters that can measure number of host 
 writes with flash logical page size and number of actual flash writes with 
 the same flash logical page size. The difference between these two is the 
 actual WA causing to disk.

 Please find the data in the following xls.

 https://docs.google.com/spreadsheets/d/1vIT6PHRbLdk_IqFDc2iM_teMolDUlx
 cX5TLMRzdXyJE/edit?usp=sharing

 Total host writes in this period = 923896266

 Total flash writes in this period = 1465339040


 FileStore:
 -

 Fio output at the end of 1 TB write.
 ---

 rbd_iodepth32: (g=0): rw=randwrite, bs=64K-64K/64K-64K/64K-64K, 
 ioengine=rbd, iodepth=64
 fio-2.1.11-20-g9a44
 Starting 1 process
 rbd engine: RBD version: 0.1.9
 Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/164.7MB/0KB /s] [0/2625/0 
 iops] [eta 00m:01s]
 rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=44703: Tue Apr 14 18:44:00 
 2015
   write: io=1000.0GB, bw=98586KB/s, iops=1540, runt=10636117msec
 slat (usec): min=42, max=7144, avg=120.45, stdev=45.80
 clat (usec): min=942, max=3954.6K, avg=40776.42, stdev=81231.25
  lat (msec): min=1, max=3954, avg=40.90, stdev=81.23
 clat percentiles (msec):
  |  1.00th=[7],  5.00th=[   11], 10.00th=[   13], 20.00th=[   16],
  | 30.00th=[   18], 40.00th=[   20], 50.00th=[   22], 60.00th=[   25],
  | 70.00th=[   30], 80.00th=[   40], 90.00th=[   67], 95.00th=[  114],
  | 99.00th=[  433], 99.50th=[  570], 99.90th=[  914], 99.95th=[ 1090],
  | 99.99th=[ 1647]
 bw (KB  /s): min=   32, max=243072, per=100.00%, avg=103148.37, 
 stdev=63090.00
 lat (usec) : 1000=0.01%
 lat (msec) : 2=0.01%, 4=0.14%, 10=4.50%, 20=38.34%, 50=42.42%
 lat (msec) : 100=8.81%, 250=3.48%, 500=1.58%, 750=0.50%, 1000=0.14%
 lat (msec) : 2000=0.06%, =2000=0.01%
   cpu  : usr=18.46%, sys=2.64%, ctx=63096633, majf=0, minf=923370
   IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.6%, 32=80.4%, =64=19.1%
  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
  complete  : 0=0.0%, 4=94.3%, 8=2.9%, 16=1.9%, 32=0.9%, 64=0.1%, =64=0.0%
  issued: total=r=0/w=16384000/d=0, short=r=0/w=0/d=0
  latency   : target=0, 

Re: [rgw][Swift API] Reusing WSGI middleware of Swift

2015-04-15 Thread Yehuda Sadeh-Weinraub


- Original Message -
 From: Radoslaw Zarzynski rzarzyn...@mirantis.com
 To: Ceph Development ceph-devel@vger.kernel.org
 Sent: Wednesday, April 15, 2015 8:31:22 AM
 Subject: [rgw][Swift API] Reusing WSGI middleware of Swift
 
 Hello Guys,
 
 Swift -- besides implementing the OpenStack Object Storage API v1 -- provides
 a bunch of huge extensions like SLO/DLO (static/dynamic large objects), bulk
 operations (inc. server-side archive extraction), staticweb and many more.
 Full list is available in Swift's source code [1]. I'm pretty sure it would
 be nice to have at least some degree of compatibility with them.
 
 Reimplementing those features could be painful. Maintaining them would be
 another challenge. Even after accomplishing that we still would not provide
 users with ability to use (and easily create) a third party extensions like
 ClamAV-based virus scanner [2]. However, maybe there is another solution.
 
 Swift internally utilizes a pipeline-based architecture. Component delivering
 the base of OSOS API (the proxy-server WSGI application) could be seen
 as the last stage of such pipeline. Between it and user a lot of additional
 middleware modules exist. All of them are interconnected with the WSGI
 interface. In a pure theory, if we provide a WSGI-RGW bridge, we would be
 able
 to reuse them and save a lot of efforts.
 
 I created a really stupid proof of concept [3] built on top of wsgiproxy [4]
 and civetweb frontend of RGW. Its primary role is a WSGI-HTTP mediator.
 The code is terrible, definitely needs rework but already allowed to pass
 a few middleware-related API tests from Tempest.
 
 What would be necessary on the RGW's side? Basically:
 * support for extraction of account (aka tenant) name from URL;
 * good conformance with OSOS API;
 * optionally: a configuration parameter to disable auth mechanisms inside RGW
   and delegate the work to Swift middleware.
 
 What's your opinion? I would like to start a discussion to verify the
 concept.

That is really interesting. A while back I actually floated around such an 
idea. My main motivation was for keeping up to date with keystone, but I can 
see other uses for it.
Generally I think it would be great to have such a module. I'm not sure that it 
would be the way to go with regard to SLO/DLO, these would result in a hacky 
and racy solutions at best.

Yehuda

 
 Regards,
 Radoslaw Zarzynski
 
 [1] https://github.com/openstack/swift/tree/master/swift/common/middleware
 [2]
 https://julien.danjou.info/blog/2013/extending-swift-with-a-middleware-clamav
 [3] https://github.com/rzarzynski/rgwift
 [4] http://pythonpaste.org/wsgiproxy/
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: radosgw and the next giant release v0.87.2

2015-04-15 Thread Yehuda Sadeh-Weinraub


- Original Message -
 From: Loic Dachary l...@dachary.org
 To: Yehuda Sadeh yeh...@redhat.com
 Cc: Ceph Development ceph-devel@vger.kernel.org, Abhishek L 
 abhishek.lekshma...@gmail.com
 Sent: Wednesday, April 15, 2015 2:43:12 AM
 Subject: radosgw and the next giant release v0.87.2
 
 Hi Yehuda,
 
 The next giant release as found at https://github.com/ceph/ceph/tree/giant
 passed the rgw suite (http://tracker.ceph.com/issues/11153#rgw). One run had
 a transient failure (http://tracker.ceph.com/issues/11259) that did not
 repeat. You will also find traces of failed run because of
 http://tracker.ceph.com/issues/11180 but that was resolved by backporting
 https://github.com/ceph/ceph-qa-suite/pull/375.
 
 Do you think it is ready for QE to start their own round of testing ?

Yes.

 
 Note that it will be the last giant release.
 
 Cheers
 
 P.S. http://tracker.ceph.com/issues/11153#Release-information has direct
 links to the pull requests merged into giant since v0.87.1 in case you need
 more context about one of them.
 
 --
 Loïc Dachary, Artisan Logiciel Libre
 
 
 
 
 
 
 
 
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


leaking mons on a latest dumpling

2015-04-15 Thread Andrey Korolyov
Hello,

there is a slow leak which is presented in all ceph versions I assume
but it is positively exposed only on large time spans and on large
clusters. It looks like the lower is monitor placed in the quorum
hierarchy, the higher the leak is:


{election_epoch:26,quorum:[0,1,2,3,4],quorum_names:[0,1,2,3,4],quorum_leader_name:0,monmap:{epoch:1,fsid:a2ec787e-3551-4a6f-aa24-deedbd8f8d01,modified:2015-03-05
13:48:54.696784,created:2015-03-05
13:48:54.696784,mons:[{rank:0,name:0,addr:10.0.1.91:6789\/0},{rank:1,name:1,addr:10.0.1.92:6789\/0},{rank:2,name:2,addr:10.0.1.93:6789\/0},{rank:3,name:3,addr:10.0.1.94:6789\/0},{rank:4,name:4,addr:10.0.1.95:6789\/0}]}}

ceph heap stats -m 10.0.1.95:6789 | grep Actual
MALLOC: =427626648 (  407.8 MiB) Actual memory used (physical + swap)
ceph heap stats -m 10.0.1.94:6789 | grep Actual
MALLOC: =289550488 (  276.1 MiB) Actual memory used (physical + swap)
ceph heap stats -m 10.0.1.93:6789 | grep Actual
MALLOC: =230592664 (  219.9 MiB) Actual memory used (physical + swap)
ceph heap stats -m 10.0.1.92:6789 | grep Actual
MALLOC: =253710488 (  242.0 MiB) Actual memory used (physical + swap)
ceph heap stats -m 10.0.1.91:6789 | grep Actual
MALLOC: = 97112216 (   92.6 MiB) Actual memory used (physical + swap)

for almost same uptime, the data difference is:
rd KB 55365750505
wr KB 82719722467

The leak itself is not very critical but of course requires some
script work to restart monitors at least once per month on a 300Tb
cluster to prevent 1G memory consumption by monitor processes. Given
a current status for a dumpling, it would be probably possible to
identify leak source and then forward-port fix to the newer releases,
as the freshest version I am running on a large scale is a top of
dumpling branch, otherwise it would require enormous amount of time to
check fix proposals.

Thanks!
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Ceph-maintainers] statically allocated uid/gid for ceph

2015-04-15 Thread Gaudenz Steinlin

Hi

Ken Dreyer kdre...@redhat.com writes:

 On 04/14/2015 09:21 AM, Sage Weil wrote:
 I think we still want them to be static across a distro; it's the 
 cross-distro change that will be relatively rare.  So a fixed ID from each 
 distro family ought to be okay?

 Sounds sane to me. I've filed https://fedorahosted.org/fpc/ticket/524 to
 request one from Fedora.

I have now requested the same for Debian. If the request is granted we
will most likely get the uid/gid 64045. Maybe others could use the same.
It seems that only Debian has a range of reserved ids for this purpose.
I would expect Ubuntu to use the same id, but that's up to them finally.

Gaudenz


signature.asc
Description: PGP signature


Request for static uid/gid for ceph

2015-04-15 Thread Gaudenz Steinlin

Hi

Ceph is currently developing unprivileged user support, and the upstream
developers are asking each of the distros to allocate a static numerical
UID for the ceph user and group.

The reason for the static number is to allow Ceph users to hot-swap hard
drives from one OSD server to another. Currently this practice works
because Ceph writes its files as uid 0, but when Ceph starts writing
its files as an unprivileged user, it will be nice to allow
administrators to unplug a drive and plug it into another computer and
everything to just work.

For this I'm requesting an uid and gid from the reserved range
6-64999 according to the process outlined in
https://anonscm.debian.org/cgit/users/cjwatson/base-passwd.git/tree/README.

Thanks,
Gaudenz



signature.asc
Description: PGP signature


Few CEPH questions.

2015-04-15 Thread Venkateswara Rao Jujjuri
Hi All, I am new to this mailing list.
I have few basic questions on the CEPH, hope someone can answer me.
Thanks in advance.!!


I would like to understand more the the placement policy, especially
on the failure path.

1. Machines going up and down is fairly common in a data center.
 How often does the cluster map change?
 Every machine bounce causes an update/distribution of cluster map?
 and affect the CRUSH? Does it cause cluster network too chatty?

2. Ceph mainly depends on the primary OSD in a given PG.
What happens in the read/write path if that OSD is down at that moment?
There can be cases, OSD is down but cluster map is not up to date.
When the write/read fail, does the client retry it after
populating clustermap?

3. Whenever the cluster map changes, it may not propagate to the entire cluster.
Some clients may be running with old map, and may end-up in wrong OSD.
Do we depend on peering to take care of this situation?

4. If a OSD becomes too full, which may take reads, but no writes anymore. Does
CRUSH() take that into account? Does it generate different maps
for reads vs writes?
 or this case is handled by distributing(moving) data off of the OSD?

6. Does CRUSH() takes OSD size into consideration?

7. Does CEPH support quorum writes? (2 out of 3 is a success. )

--
Jvrao
---
First they ignore you, then they laugh at you, then they fight you,
then you win. - Mahatma Gandhi
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Regarding newstore performance

2015-04-15 Thread Chen, Xiaoxi
Hi Somnath,
 You could try apply this one:)
 https://github.com/ceph/ceph/pull/4356

  BTW the previous RocksDB configuration has a bug that set 
rocksdb_disableDataSync to true by default, which may cause data loss in 
failure. So pls update the newstore to latest or manually set it to false. I 
suspect the KVDB performance will be worse after doing this...but that's the 
way we need to go.

Xiaoxi

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Somnath Roy
Sent: Thursday, April 16, 2015 12:07 AM
To: Haomai Wang
Cc: ceph-devel
Subject: RE: Regarding newstore performance

Hoamai,
Yes, separating out the kvdb directory is the path I will take to identify the 
cause of the WA.
This tool I have written on top of these disk counters. I can share that but 
you need SanDisk optimus echo (or max) drive to make it work :-)

Thanks  Regards
Somnath

-Original Message-
From: Haomai Wang [mailto:haomaiw...@gmail.com]
Sent: Wednesday, April 15, 2015 5:23 AM
To: Somnath Roy
Cc: ceph-devel
Subject: Re: Regarding newstore performance

On Wed, Apr 15, 2015 at 2:01 PM, Somnath Roy somnath@sandisk.com wrote:
 Hi Sage/Mark,
 I did some WA experiment with newstore with the similar settings I mentioned 
 yesterday.

 Test:
 ---

 64K Random write with 64 QD and writing total of 1 TB of data.


 Newstore:
 

 Fio output at the end of 1 TB write.
 ---

 rbd_iodepth32: (g=0): rw=randwrite, bs=64K-64K/64K-64K/64K-64K, 
 ioengine=rbd, iodepth=64
 fio-2.1.11-20-g9a44
 Starting 1 process
 rbd engine: RBD version: 0.1.9
 Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/49600KB/0KB /s] [0/775/0 
 iops] [eta 00m:00s]
 rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=42907: Tue Apr 14 00:34:23 
 2015
   write: io=1000.0GB, bw=48950KB/s, iops=764, runt=21421419msec
 slat (usec): min=43, max=9480, avg=116.45, stdev=10.99
 clat (msec): min=13, max=1331, avg=83.55, stdev=52.97
  lat (msec): min=14, max=1331, avg=83.67, stdev=52.97
 clat percentiles (msec):
  |  1.00th=[   60],  5.00th=[   68], 10.00th=[   71], 20.00th=[   74],
  | 30.00th=[   76], 40.00th=[   78], 50.00th=[   81], 60.00th=[   83],
  | 70.00th=[   86], 80.00th=[   90], 90.00th=[   94], 95.00th=[   98],
  | 99.00th=[  109], 99.50th=[  114], 99.90th=[ 1188], 99.95th=[ 1221],
  | 99.99th=[ 1270]
 bw (KB  /s): min=   62, max=101888, per=100.00%, avg=49760.84, 
 stdev=7320.03
 lat (msec) : 20=0.01%, 50=0.38%, 100=96.00%, 250=3.34%, 500=0.03%
 lat (msec) : 750=0.02%, 1000=0.03%, 2000=0.20%
   cpu  : usr=10.85%, sys=1.76%, ctx=123504191, majf=0, minf=603964
   IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=2.1%, =64=97.9%
  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
  complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.1%, 32=0.0%, 64=0.1%, 
 =64=0.0%
  issued: total=r=0/w=16384000/d=0, short=r=0/w=0/d=0
  latency   : target=0, window=0, percentile=100.00%, depth=64

 Run status group 0 (all jobs):
   WRITE: io=1000.0GB, aggrb=48949KB/s, minb=48949KB/s, maxb=48949KB/s, 
 mint=21421419msec, maxt=21421419msec


 So, iops getting is ~764.
 99th percentile latency should be 100ms.

 Write amplification at disk level:
 --

 SanDisk SSDs have some disk level counters that can measure number of host 
 writes with flash logical page size and number of actual flash writes with 
 the same flash logical page size. The difference between these two is the 
 actual WA causing to disk.

 Please find the data in the following xls.

 https://docs.google.com/spreadsheets/d/1vIT6PHRbLdk_IqFDc2iM_teMolDUlx
 cX5TLMRzdXyJE/edit?usp=sharing

 Total host writes in this period = 923896266

 Total flash writes in this period = 1465339040


 FileStore:
 -

 Fio output at the end of 1 TB write.
 ---

 rbd_iodepth32: (g=0): rw=randwrite, bs=64K-64K/64K-64K/64K-64K, 
 ioengine=rbd, iodepth=64
 fio-2.1.11-20-g9a44
 Starting 1 process
 rbd engine: RBD version: 0.1.9
 Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/164.7MB/0KB /s] [0/2625/0 
 iops] [eta 00m:01s]
 rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=44703: Tue Apr 14 18:44:00 
 2015
   write: io=1000.0GB, bw=98586KB/s, iops=1540, runt=10636117msec
 slat (usec): min=42, max=7144, avg=120.45, stdev=45.80
 clat (usec): min=942, max=3954.6K, avg=40776.42, stdev=81231.25
  lat (msec): min=1, max=3954, avg=40.90, stdev=81.23
 clat percentiles (msec):
  |  1.00th=[7],  5.00th=[   11], 10.00th=[   13], 20.00th=[   16],
  | 30.00th=[   18], 40.00th=[   20], 50.00th=[   22], 60.00th=[   25],
  | 70.00th=[   30], 80.00th=[   40], 90.00th=[   67], 95.00th=[  114],
  | 99.00th=[  433], 99.50th=[  570], 99.90th=[  914], 

RE: Regarding newstore performance

2015-04-15 Thread Somnath Roy
Thanks Xiaoxi..
But, I have already initiated test by making db/ a symbolic link to another 
SSD..Will share the result soon.

Regards
Somnath

-Original Message-
From: Chen, Xiaoxi [mailto:xiaoxi.c...@intel.com] 
Sent: Wednesday, April 15, 2015 6:48 PM
To: Somnath Roy; Haomai Wang
Cc: ceph-devel
Subject: RE: Regarding newstore performance

Hi Somnath,
 You could try apply this one:)
 https://github.com/ceph/ceph/pull/4356

  BTW the previous RocksDB configuration has a bug that set 
rocksdb_disableDataSync to true by default, which may cause data loss in 
failure. So pls update the newstore to latest or manually set it to false. I 
suspect the KVDB performance will be worse after doing this...but that's the 
way we need to go.

Xiaoxi

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Somnath Roy
Sent: Thursday, April 16, 2015 12:07 AM
To: Haomai Wang
Cc: ceph-devel
Subject: RE: Regarding newstore performance

Hoamai,
Yes, separating out the kvdb directory is the path I will take to identify the 
cause of the WA.
This tool I have written on top of these disk counters. I can share that but 
you need SanDisk optimus echo (or max) drive to make it work :-)

Thanks  Regards
Somnath

-Original Message-
From: Haomai Wang [mailto:haomaiw...@gmail.com]
Sent: Wednesday, April 15, 2015 5:23 AM
To: Somnath Roy
Cc: ceph-devel
Subject: Re: Regarding newstore performance

On Wed, Apr 15, 2015 at 2:01 PM, Somnath Roy somnath@sandisk.com wrote:
 Hi Sage/Mark,
 I did some WA experiment with newstore with the similar settings I mentioned 
 yesterday.

 Test:
 ---

 64K Random write with 64 QD and writing total of 1 TB of data.


 Newstore:
 

 Fio output at the end of 1 TB write.
 ---

 rbd_iodepth32: (g=0): rw=randwrite, bs=64K-64K/64K-64K/64K-64K, 
 ioengine=rbd, iodepth=64
 fio-2.1.11-20-g9a44
 Starting 1 process
 rbd engine: RBD version: 0.1.9
 Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/49600KB/0KB /s] [0/775/0 
 iops] [eta 00m:00s]
 rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=42907: Tue Apr 14 00:34:23 
 2015
   write: io=1000.0GB, bw=48950KB/s, iops=764, runt=21421419msec
 slat (usec): min=43, max=9480, avg=116.45, stdev=10.99
 clat (msec): min=13, max=1331, avg=83.55, stdev=52.97
  lat (msec): min=14, max=1331, avg=83.67, stdev=52.97
 clat percentiles (msec):
  |  1.00th=[   60],  5.00th=[   68], 10.00th=[   71], 20.00th=[   74],
  | 30.00th=[   76], 40.00th=[   78], 50.00th=[   81], 60.00th=[   83],
  | 70.00th=[   86], 80.00th=[   90], 90.00th=[   94], 95.00th=[   98],
  | 99.00th=[  109], 99.50th=[  114], 99.90th=[ 1188], 99.95th=[ 1221],
  | 99.99th=[ 1270]
 bw (KB  /s): min=   62, max=101888, per=100.00%, avg=49760.84, 
 stdev=7320.03
 lat (msec) : 20=0.01%, 50=0.38%, 100=96.00%, 250=3.34%, 500=0.03%
 lat (msec) : 750=0.02%, 1000=0.03%, 2000=0.20%
   cpu  : usr=10.85%, sys=1.76%, ctx=123504191, majf=0, minf=603964
   IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=2.1%, =64=97.9%
  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
  complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.1%, 32=0.0%, 64=0.1%, 
 =64=0.0%
  issued: total=r=0/w=16384000/d=0, short=r=0/w=0/d=0
  latency   : target=0, window=0, percentile=100.00%, depth=64

 Run status group 0 (all jobs):
   WRITE: io=1000.0GB, aggrb=48949KB/s, minb=48949KB/s, maxb=48949KB/s, 
 mint=21421419msec, maxt=21421419msec


 So, iops getting is ~764.
 99th percentile latency should be 100ms.

 Write amplification at disk level:
 --

 SanDisk SSDs have some disk level counters that can measure number of host 
 writes with flash logical page size and number of actual flash writes with 
 the same flash logical page size. The difference between these two is the 
 actual WA causing to disk.

 Please find the data in the following xls.

 https://docs.google.com/spreadsheets/d/1vIT6PHRbLdk_IqFDc2iM_teMolDUlx
 cX5TLMRzdXyJE/edit?usp=sharing

 Total host writes in this period = 923896266

 Total flash writes in this period = 1465339040


 FileStore:
 -

 Fio output at the end of 1 TB write.
 ---

 rbd_iodepth32: (g=0): rw=randwrite, bs=64K-64K/64K-64K/64K-64K, 
 ioengine=rbd, iodepth=64
 fio-2.1.11-20-g9a44
 Starting 1 process
 rbd engine: RBD version: 0.1.9
 Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/164.7MB/0KB /s] [0/2625/0 
 iops] [eta 00m:01s]
 rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=44703: Tue Apr 14 18:44:00 
 2015
   write: io=1000.0GB, bw=98586KB/s, iops=1540, runt=10636117msec
 slat (usec): min=42, max=7144, avg=120.45, stdev=45.80
 clat (usec): min=942, max=3954.6K, avg=40776.42, stdev=81231.25
  lat (msec): min=1, 

Re: Newstore get_omap_iterator

2015-04-15 Thread Mark Nelson



On 04/13/2015 10:27 AM, Sage Weil wrote:

[adding ceph-devel]

On Mon, 13 Apr 2015, Chen, Xiaoxi wrote:

Hi,

   Actually I have done the tuning survey on RocksDB when I was
updating the RocksDB to newer version and exposed the tuning in
ceph.conf.

   What we need to ensure is the WAL never hit the disk. The rocksdb


We'll always have to pay that 1x write to the log; we just want to make
sure it doesn't turn into 2x.  I take it you're assuming the log is on an
SSD (not disk)?


write ahead log is already introduce 1X write, if the data flushed to
SST in level 0, that will be 2X, not to mention any further compaction.

   The tuning that makes the differences are :
write_buffer_size
max_write_buffer_number
min_write_buffer_number_to_merge

   Say if we have
write_buffer_size =512M
max_write_buffer_number = 6
min_write_buffer_number_to_merge =2


Attached are tests for a single PCIE ssd with filestore, newstore + 
fsync + default tunables, newstore+fsync + Xiaoxi's tunables, and also a 
test using xiaoxi's tunables with fdatasync.


Basically Xioaxi's tunables help, and fdatasync helps a little more 
(mostly at small IO sizes), but still not enough to get us to beat 
filestore, though newstore *does* do consistently better than filestore 
with 4MB writes now.


Mark



newstore_xiaoxi_fdatasync.pdf
Description: Adobe PDF document


Re: how to test hammer rbd objectmap feature ?

2015-04-15 Thread Alexandre DERUMIER
Yeah, once we're confident in it in master. The idea behind this 
feature was to allow using object maps with existing images. There 
just wasn't time to include it in the base hammer release. 

Ok, thanks Josh !

(I'm planning to implement this feature in proxmox when it'll be released).


- Mail original -
De: Josh Durgin jdur...@redhat.com
À: aderumier aderum...@odiso.com, ceph-devel ceph-devel@vger.kernel.org
Envoyé: Mercredi 15 Avril 2015 01:12:38
Objet: Re: how to test hammer rbd objectmap feature ?

On 04/14/2015 12:48 AM, Alexandre DERUMIER wrote: 
 Hi, 
 
 I would like to known how to enable object map on hammer ? 
 
 I found a post hammer commit here: 
 
 https://github.com/ceph/ceph/commit/3a7b28d9a2de365d515ea1380ee9e4f867504e10 
 rbd: add feature enable/disable support 
 
 - Specifies which RBD format 2 features are to be enabled when creating 
 - an image. The numbers from the desired features below should be added 
 - to compute the parameter value: 
 + Specifies which RBD format 2 feature should be enabled when creating 
 + an image. Multiple features can be enabled by repeating this option 
 + multiple times. The following features are supported: 
 
 -.. option:: --image-features features 
 +.. option:: --image-feature feature 
 
 - +1: layering support 
 - +2: striping v2 support 
 - +4: exclusive locking support 
 - +8: object map support 
 + layering: layering support 
 + striping: striping v2 support 
 + exclusive-lock: exclusive locking support 
 + object-map: object map support (requires exclusive-lock) 
 
 
 So, in current hammer release, we can only setup objectmap and other features 
 on rbd volume creation ? 

Yes, that's right. 

 Do this patch allow to change features on the fly ? 

Yup, just for exclusive-lock and object map (since they don't 
affect object layout). 

 If yes, is it planned to backport it to hammer soon ? 

Yeah, once we're confident in it in master. The idea behind this 
feature was to allow using object maps with existing images. There 
just wasn't time to include it in the base hammer release. 

Josh 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: v0.80.8 and librbd performance

2015-04-15 Thread Josh Durgin

On 04/14/2015 08:01 PM, shiva rkreddy wrote:

The clusters are in test environment, so its a new deployment of 0.80.9.
OS on the cluster nodes is reinstalled as well, so there shouldn't be
any fs aging unless the disks are slowing down.

The perf measurement is done initiating multiple cinder create/delete
commands and tracking the volume to be in available or completely gone
from cinder list output.

Even running  rbd rm  command from cinder node results in similar
behaviour.

I'll try with  increasing  rbd_concurrent_management in ceph.conf.
  Is the param name rbd_concurrent_management or rbd-concurrent-management ?


'rbd concurrent management ops' - spaces, hyphens, and underscores are
equivalent in ceph configuration.

A log with 'debug ms = 1' and 'debug rbd = 20' from 'rbd rm' on both 
versions might give clues about what's going slower.


Josh


On Tue, Apr 14, 2015 at 12:36 PM, Josh Durgin jdur...@redhat.com
mailto:jdur...@redhat.com wrote:

I don't see any commits that would be likely to affect that between
0.80.7 and 0.80.9.

Is this after upgrading an existing cluster?
Could this be due to fs aging beneath your osds?

How are you measuring create/delete performance?

You can try increasing rbd concurrent management ops in ceph.conf on
the cinder node. This affects delete speed, since rbd tries to
delete each object in a volume.

Josh


*From:* shiva rkreddy shiva.rkre...@gmail.com
mailto:shiva.rkre...@gmail.com
*Sent:* Apr 14, 2015 5:53 AM
*To:* Josh Durgin
*Cc:* Ken Dreyer; Sage Weil; Ceph Development; ceph-us...@ceph.com
mailto:ceph-us...@ceph.com
*Subject:* Re: v0.80.8 and librbd performance

Hi Josh,

We are using firefly 0.80.9 and see both cinder create/delete
numbers slow down compared 0.80.7.
I don't see any specific tuning requirements and our cluster is
run pretty much on default configuration.
Do you recommend any tuning or can you please suggest some log
signatures we need to be looking at?

Thanks
shiva

On Wed, Mar 4, 2015 at 1:53 PM, Josh Durgin jdur...@redhat.com
mailto:jdur...@redhat.com wrote:

On 03/03/2015 03:28 PM, Ken Dreyer wrote:

On 03/03/2015 04:19 PM, Sage Weil wrote:

Hi,

This is just a heads up that we've identified a
performance regression in
v0.80.8 from previous firefly releases.  A v0.80.9
is working it's way
through QA and should be out in a few days.  If you
haven't upgraded yet
you may want to wait.

Thanks!
sage


Hi Sage,

I've seen a couple Redmine tickets on this (eg
http://tracker.ceph.com/__issues/9854
http://tracker.ceph.com/issues/9854 ,
http://tracker.ceph.com/__issues/10956
http://tracker.ceph.com/issues/10956). It's not
totally clear to me
which of the 70+ unreleased commits on the firefly
branch fix this
librbd issue.  Is it only the three commits in
https://github.com/ceph/ceph/__pull/3410
https://github.com/ceph/ceph/pull/3410 , or are there
more?


Those are the only ones needed to fix the librbd performance
regression, yes.

Josh

--
To unsubscribe from this list: send the line unsubscribe
ceph-devel in
the body of a message to majord...@vger.kernel.org
mailto:majord...@vger.kernel.org
More majordomo info at
http://vger.kernel.org/__majordomo-info.html
http://vger.kernel.org/majordomo-info.html





--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Regarding newstore performance

2015-04-15 Thread Somnath Roy
Hi Sage/Mark,
I did some WA experiment with newstore with the similar settings I mentioned 
yesterday.

Test:
---

64K Random write with 64 QD and writing total of 1 TB of data.


Newstore:


Fio output at the end of 1 TB write.
---

rbd_iodepth32: (g=0): rw=randwrite, bs=64K-64K/64K-64K/64K-64K, ioengine=rbd, 
iodepth=64
fio-2.1.11-20-g9a44
Starting 1 process
rbd engine: RBD version: 0.1.9
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/49600KB/0KB /s] [0/775/0 iops] [eta 
00m:00s]
rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=42907: Tue Apr 14 00:34:23 2015
  write: io=1000.0GB, bw=48950KB/s, iops=764, runt=21421419msec
slat (usec): min=43, max=9480, avg=116.45, stdev=10.99
clat (msec): min=13, max=1331, avg=83.55, stdev=52.97
 lat (msec): min=14, max=1331, avg=83.67, stdev=52.97
clat percentiles (msec):
 |  1.00th=[   60],  5.00th=[   68], 10.00th=[   71], 20.00th=[   74],
 | 30.00th=[   76], 40.00th=[   78], 50.00th=[   81], 60.00th=[   83],
 | 70.00th=[   86], 80.00th=[   90], 90.00th=[   94], 95.00th=[   98],
 | 99.00th=[  109], 99.50th=[  114], 99.90th=[ 1188], 99.95th=[ 1221],
 | 99.99th=[ 1270]
bw (KB  /s): min=   62, max=101888, per=100.00%, avg=49760.84, stdev=7320.03
lat (msec) : 20=0.01%, 50=0.38%, 100=96.00%, 250=3.34%, 500=0.03%
lat (msec) : 750=0.02%, 1000=0.03%, 2000=0.20%
  cpu  : usr=10.85%, sys=1.76%, ctx=123504191, majf=0, minf=603964
  IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=2.1%, =64=97.9%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.1%, 32=0.0%, 64=0.1%, =64=0.0%
 issued: total=r=0/w=16384000/d=0, short=r=0/w=0/d=0
 latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
  WRITE: io=1000.0GB, aggrb=48949KB/s, minb=48949KB/s, maxb=48949KB/s, 
mint=21421419msec, maxt=21421419msec


So, iops getting is ~764.
99th percentile latency should be 100ms.

Write amplification at disk level:
--

SanDisk SSDs have some disk level counters that can measure number of host 
writes with flash logical page size and number of actual flash writes with the 
same flash logical page size. The difference between these two is the actual WA 
causing to disk.

Please find the data in the following xls.

https://docs.google.com/spreadsheets/d/1vIT6PHRbLdk_IqFDc2iM_teMolDUlxcX5TLMRzdXyJE/edit?usp=sharing

Total host writes in this period = 923896266

Total flash writes in this period = 1465339040


FileStore:
-

Fio output at the end of 1 TB write.
---

rbd_iodepth32: (g=0): rw=randwrite, bs=64K-64K/64K-64K/64K-64K, ioengine=rbd, 
iodepth=64
fio-2.1.11-20-g9a44
Starting 1 process
rbd engine: RBD version: 0.1.9
Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/164.7MB/0KB /s] [0/2625/0 iops] [eta 
00m:01s]
rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=44703: Tue Apr 14 18:44:00 2015
  write: io=1000.0GB, bw=98586KB/s, iops=1540, runt=10636117msec
slat (usec): min=42, max=7144, avg=120.45, stdev=45.80
clat (usec): min=942, max=3954.6K, avg=40776.42, stdev=81231.25
 lat (msec): min=1, max=3954, avg=40.90, stdev=81.23
clat percentiles (msec):
 |  1.00th=[7],  5.00th=[   11], 10.00th=[   13], 20.00th=[   16],
 | 30.00th=[   18], 40.00th=[   20], 50.00th=[   22], 60.00th=[   25],
 | 70.00th=[   30], 80.00th=[   40], 90.00th=[   67], 95.00th=[  114],
 | 99.00th=[  433], 99.50th=[  570], 99.90th=[  914], 99.95th=[ 1090],
 | 99.99th=[ 1647]
bw (KB  /s): min=   32, max=243072, per=100.00%, avg=103148.37, 
stdev=63090.00
lat (usec) : 1000=0.01%
lat (msec) : 2=0.01%, 4=0.14%, 10=4.50%, 20=38.34%, 50=42.42%
lat (msec) : 100=8.81%, 250=3.48%, 500=1.58%, 750=0.50%, 1000=0.14%
lat (msec) : 2000=0.06%, =2000=0.01%
  cpu  : usr=18.46%, sys=2.64%, ctx=63096633, majf=0, minf=923370
  IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.6%, 32=80.4%, =64=19.1%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0%
 complete  : 0=0.0%, 4=94.3%, 8=2.9%, 16=1.9%, 32=0.9%, 64=0.1%, =64=0.0%
 issued: total=r=0/w=16384000/d=0, short=r=0/w=0/d=0
 latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
  WRITE: io=1000.0GB, aggrb=98586KB/s, minb=98586KB/s, maxb=98586KB/s, 
mint=10636117msec, maxt=10636117msec

Disk stats (read/write):
  sda: ios=0/251, merge=0/149, ticks=0/0, in_queue=0, util=0.00%

So, iops here is ~1500.
99th percentile latency should be within 50ms


Write amplification at disk level:
--

Total host writes in this period = 643611346

Total flash writes in this period = 1157304512



https://docs.google.com/spreadsheets/d/1gbIATBerS8COzSsJRMbkFXCSbLjn61Fz49CLH8WPh7Q/edit?pli=1#gid=95373000





Summary:

Re: [Ceph-maintainers] statically allocated uid/gid for ceph

2015-04-15 Thread Tim Serong
On 04/15/2015 01:21 AM, Sage Weil wrote:
 On Tue, 14 Apr 2015, Tim Serong wrote:
 On 04/14/2015 11:05 AM, Sage Weil wrote:
 Tim, Owen:

 Can we get a 'ceph' user/group uid/gid allocated for SUSE to get this 
 unstuck?  IMO the radosgw systemd stuff is blocked behind this too.

 I haven't yet been able to get a good answer to assignment of static
 UIDs and GIDs (I was told all the ones between 0-99 are taken already).

 But, if it's OK for the UID and GID numbers to potentially be different
 on different systems, adding a ceph user and ceph group is easy, we
 just add appropriate `groupadd -r ceph` and `useradd -r ceph`
 invocations to the rpm %pre script, which will give a UID/GID somewhere
 in the 100-499 range (see
 https://en.opensuse.org/openSUSE:Packaging_guidelines#Users_and_Groups
 for some notes on this).  We'd also want to update rpmlint to not whine
 about the ceph name.

 I originally thought the risk of non-static UID/GID numbers on different
 systems was terrible, but...
 
 I think we still want them to be static across a distro; it's the 
 cross-distro change that will be relatively rare.  So a fixed ID from each 
 distro family ought to be okay?

Optimally, yes, I too want at least a fixed ID per distro.  I'm
presently attempting to find out exactly how we (SUSE) do this in some
officially recognised way.

 I think a osd-prestart.sh snippet (or similar) that does a chown -R of any 
 non-root osd data to the local ceph user prior to starting the daemon will 
 handle the cross-distro changes without too much trouble.  I'd lean toward 
 not going from root - ceph, though, and have the start script stay root 
 and not drop privs if the data is owned by root.. that covers upgrades 
 without interruption.

 What do you think?

 ...that sounds reasonable, and I think it would also handle the case
 where, say, you move an OSD from one SUSE host to another - if the
 UID/GID doesn't match (maybe some other `useradd`ing software was
 installed first on the other host), the chown will fix it anyway.

 Are there any holes in this?
 
 It would be nicer if the suse-suse case didn't require a chown, but yeah, 
 it'd still work just fine...

OK.  So at least that's a technically viable but undesirable fallback
position.

Tim

 
 sage
 
 

 Regards,

 Tim




 On Thu, 11 Dec 2014, Tim Serong wrote:

 On 12/11/2014 05:48 AM, Sage Weil wrote:
 +ceph-devel

 On Wed, 10 Dec 2014, Ken Dreyer wrote:
 On 12/06/2014 01:54 PM, Sage Weil wrote:
 Hi Colin, Boris, Owen,

 We would like to choose a statically allocated uid and gid for use by 
 Ceph 
 storage servers.  The basic goals are:

  - run daemons as non-root (right now everything is uid 0 (runtime and 
 on-disk data) and this is clearly not ideal)
  - enable hot swap of disks between storage servers
  - standardize across distros so that we can build clusters with a mix

 To support the hot swap, we can't use the usual uids allocated 
 dynamically 
 during package installation.  Disks will completely filled with Ceph 
 data 
 files with the uid from one machine and will not be usable on another 
 machine.

 I'm hoping we can choose a static uid/gid pair that is unused for 
 Debian 
 (and Ubuntu), Fedora (and RHEL/CentOS), and OpenSUSE/SLES.  This will 
 let 
 us maintain consistency across the entire ecosystem.

 How many system users should I request from the Fedora Packaging
 Committee, and what should their names be?

 For example, are ceph-mon and ceph-osd going to run under the same
 non-privileged system account?

 Hmm, my first impulse was to make a single user and group.  But it might 
 make sense that e.g. rgw should run in a different context than ceph-osd 
 or ceph-mon.

 If we go down that road, then maybe

  ceph-osd
  ceph-mon
  ceph-mds
  ceph-rgw
  ceph-calamari

 and a 'ceph' group that we can use for /var/log/ceph etc for the qemu 
 and other librados users?

 Alternatively, if we just do user+group ceph, then rgw can run as 
 www-data 
 or apache (as it does now).  Not sure what makes the most sense for 
 ceph-calamari.

 FWIW my gut says go with a single ceph user+group and leave rgw running
 as the apache user.

 Calamari consists of a few pieces - the web-accessible bit runs as the
 apache user, then there's the cthulhu daemon, as well as carbon-cache
 for the graphite stuff.  These latter two I believe run as root (at
 least, they do with my SUSE packages which have systemd units for each
 of these services, and I assume they run as root on other distros where
 they're run under supervisord).  Now that I think of it though, I wonder
 if it makes sense to just run the whole lot as the apache user...?

 Regards,

 Tim
 -- 
 Tim Serong
 Senior Clustering Engineer
 SUSE
 tser...@suse.com
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html





 -- 
 Tim Serong
 Senior Clustering Engineer
 SUSE
 

CephFS and the next giant release v0.87.2

2015-04-15 Thread Loic Dachary
Hi Greg,

The next giant release as found at https://github.com/ceph/ceph/tree/giant 
passed the fs suite (http://tracker.ceph.com/issues/11153#fs). Do you think it 
is ready for QE to start their own round of testing ?

Note that it will be the last giant release. 

Cheers

P.S. http://tracker.ceph.com/issues/11153#Release-information has direct links 
to the pull requests merged into giant since v0.87.1 in case you need more 
context about one of them.

-- 
Loïc Dachary, Artisan Logiciel Libre









signature.asc
Description: OpenPGP digital signature


radosgw and the next giant release v0.87.2

2015-04-15 Thread Loic Dachary
Hi Yehuda,

The next giant release as found at https://github.com/ceph/ceph/tree/giant 
passed the rgw suite (http://tracker.ceph.com/issues/11153#rgw). One run had a 
transient failure (http://tracker.ceph.com/issues/11259) that did not repeat. 
You will also find traces of failed run because of 
http://tracker.ceph.com/issues/11180 but that was resolved by backporting 
https://github.com/ceph/ceph-qa-suite/pull/375. 

Do you think it is ready for QE to start their own round of testing ?

Note that it will be the last giant release. 

Cheers

P.S. http://tracker.ceph.com/issues/11153#Release-information has direct links 
to the pull requests merged into giant since v0.87.1 in case you need more 
context about one of them.

-- 
Loïc Dachary, Artisan Logiciel Libre











signature.asc
Description: OpenPGP digital signature


rados and the next giant release v0.87.2

2015-04-15 Thread Loic Dachary
Hi Sam,

The next giant release as found at https://github.com/ceph/ceph/tree/giant 
passed the rados suite (http://tracker.ceph.com/issues/11153#rados). Do you 
think it is ready for QE to start their own round of testing ?

Note that it will be the last giant release. 

Cheers

P.S. http://tracker.ceph.com/issues/11153#Release-information has direct links 
to the pull requests merged into giant since v0.87.1 in case you need more 
context about one of them.

-- 
Loïc Dachary, Artisan Logiciel Libre











signature.asc
Description: OpenPGP digital signature


rbd and the next giant release v0.87.2

2015-04-15 Thread Loic Dachary
Hi Josh,

The next giant release as found at https://github.com/ceph/ceph/tree/giant 
passed the rbd suite (http://tracker.ceph.com/issues/11153#rbd). Do you think 
it is ready for QE to start their own round of testing ?

Note that it will be the last giant release. 

Cheers

P.S. http://tracker.ceph.com/issues/11153#Release-information has direct links 
to the pull requests merged into giant since v0.87.1 in case you need more 
context about one of them.

-- 
Loïc Dachary, Artisan Logiciel Libre











signature.asc
Description: OpenPGP digital signature


Re: Backporting to Firefly

2015-04-15 Thread Loic Dachary
Ping ?

On 08/04/2015 11:22, Loic Dachary wrote: Hi,
 
 I see you have been busy backporting issues to Firefly today, this is great 
 :-) 
 
 https://github.com/ceph/ceph/pulls/xinxinsh
 
 It would be helpful if you could update the pull requests (and the 
 corresponding issues) as explained at 
 http://tracker.ceph.com/projects/ceph-releases/wiki/HOWTO_backport_commits. 
 
 Once it's done I propose we move to the next step, as explained at 
 http://tracker.ceph.com/projects/ceph-releases/wiki/HOWTO: merging your pull 
 requests in the integration branch ( step 5 
 http://tracker.ceph.com/projects/ceph-releases/wiki/HOWTO_populate_the_integration_branch
  ) and running tests on them ( step 6 
 http://tracker.ceph.com/projects/ceph-releases/wiki/HOWTO_run_integration_and_upgrade_tests).
 
 Cheers
 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature