[ceph-users] Purpose of the s3gw.fcgi script?

2015-04-11 Thread Greg Meier
>From my observation, the s3gw.fcgi script seems to be completely
superfluous in the operation of Ceph. With or without the script, swift
requests execute correctly, as long as a radosgw daemon is running.

Is there something I'm missing here?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Auth URL not found when using object gateway

2015-03-24 Thread Greg Meier
Hi,

I'm having trouble setting up an object gateway on an existing cluster. The
cluster I'm trying to add the gateway to is running on a Precise 12.04
virtual machine.

The cluster is up and running, with a monitor, two OSDs, and a metadata
server. It returns HEALTH_OK and active+clean, so I am somewhat assured
that it is running correctly.

I've:
 - set up an apache2 webserver with the fastcgi mod installed
 - created an rgw.conf file
 - added an s3gw.fcgi script
 - enabled the rgw.conf site and disabled the default
 - created a keyring and gateway user with appropriate cap's
 - restarted ceph, apache2, and the radosgw daemon
 - created a user and subuser
 - tested both s3 and swift calls

Unfortunately, both s3 and swift fail to authorize. An attempt to create a
new bucket with s3 using a python script returns:

Traceback (most recent call last):
  File "s3test.py", line 13, in 
bucket = conn.create_bucket('my-new-bucket')
  File "/usr/lib/python2.7/dist-packages/boto/s3/connection.py", line 422,
in create_bucket
response.status, response.reason, body)
boto.exception.S3ResponseError: S3ResponseError: 404 Not Found
None

And an attempt to post a container using the python-swiftclient from the
command line with command:

swift --debug --info -A http://localhost/auth/1.0 -U gatewayuser:swift -K
 post new_container

returns:

INFO:urllib3.connectionpool:Starting new HTTP connection (1): localhost
DEBUG:urllib3.connectionpool:"GET /auth/1.0 HTTP/1.1" 404 180
INFO:swiftclient:REQ: curl -i http://localhost/auth/1.0 -X GET
INFO:swiftclient:RESP STATUS: 404 Not Found
INFO:swiftclient:RESP HEADERS: [('content-length', '180'),
('content-encoding', 'gzip'), ('date', 'Tue, 24 Mar 2015 23:19:50 GMT'),
('content-type', 'text/html; charset=iso-8859-1'), ('vary',
'Accept-Encoding'), ('server', 'Apache/2.2.22 (Ubuntu)')]
INFO:swiftclient:RESP BODY:
M�0��}���,�I�)֔)Ң��m��qv��Y��.)�59�=Ve
���y���lsa���#T��p��v�,����B/��� �5D�Z|=���S�N�+
�|-�X)��V��b�a���與'@Uo���-�n��"?�
ERROR:swiftclient:Auth GET failed: http://localhost/auth/1.0 404 Not Found
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/swiftclient/client.py", line 1181,
in _retry
self.url, self.token = self.get_auth()
  File "/usr/lib/python2.7/dist-packages/swiftclient/client.py", line 1155,
in get_auth
insecure=self.insecure)
  File "/usr/lib/python2.7/dist-packages/swiftclient/client.py", line 318,
in get_auth
insecure=insecure)
  File "/usr/lib/python2.7/dist-packages/swiftclient/client.py", line 241,
in get_auth_1_0
http_reason=resp.reason)
ClientException: Auth GET failed: http://localhost/auth/1.0 404 Not Found
INFO:urllib3.connectionpool:Starting new HTTP connection (1): localhost
DEBUG:urllib3.connectionpool:"GET /auth/1.0 HTTP/1.1" 404 180
INFO:swiftclient:REQ: curl -i http://localhost/auth/1.0 -X GET
INFO:swiftclient:RESP STATUS: 404 Not Found
INFO:swiftclient:RESP HEADERS: [('content-length', '180'),
('content-encoding', 'gzip'), ('date', 'Tue, 24 Mar 2015 23:19:50 GMT'),
('content-type', 'text/html; charset=iso-8859-1'), ('vary',
'Accept-Encoding'), ('server', 'Apache/2.2.22 (Ubuntu)')]
INFO:swiftclient:RESP BODY:
M�0��}���,�I�)֔)Ң��m��qv��Y��.)�59�=Ve
���y���lsa���#T��p��v�,����B/��� �5D�Z|=���S�N�+
�|-�X)��V��b�a���與'@Uo���-�n��"?�
ERROR:swiftclient:Auth GET failed: http://localhost/auth/1.0 404 Not Found
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/swiftclient/client.py", line 1181,
in _retry
self.url, self.token = self.get_auth()
  File "/usr/lib/python2.7/dist-packages/swiftclient/client.py", line 1155,
in get_auth
insecure=self.insecure)
  File "/usr/lib/python2.7/dist-packages/swiftclient/client.py", line 318,
in get_auth
insecure=insecure)
  File "/usr/lib/python2.7/dist-packages/swiftclient/client.py", line 241,
in get_auth_1_0
http_reason=resp.reason)
ClientException: Auth GET failed: http://localhost/auth/1.0 404 Not Found
Auth GET failed: http://localhost/auth/1.0 404 Not Found

I'm not at all sure why it doesn't work when I've followed the
documentation for setting it up. Please find attached, the config files for
rgw.conf, ceph.conf, and apache2.conf


apache2.conf
Description: Binary data


ceph.conf
Description: Binary data


rgw.conf
Description: Binary data
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Monitor failure after series of traumatic network failures

2015-03-24 Thread Greg Chavez
This was excellent advice. It should be on some official Ceph
troubleshooting page. It takes a while for the monitors to deal with new
info, but it works.

Thanks again!
--Greg

On Wed, Mar 18, 2015 at 5:24 PM, Sage Weil  wrote:

> On Wed, 18 Mar 2015, Greg Chavez wrote:
> > We have a cuttlefish (0.61.9) 192-OSD cluster that has lost network
> > availability several times since this past Thursday and whose nodes were
> all
> > rebooted twice (hastily and inadvisably each time). The final reboot,
> which
> > was supposed to be "the last thing" before recovery according to our data
> > center team, resulted in a failure of the cluster's 4 monitors. This
> > happened yesterday afternoon.
> >
> > [ By the way, we use Ceph to back Cinder and Glance in our OpenStack
> Cloud,
> > block storage only; also this network problems were the result of our
> data
> > center team executing maintenance on our switches that was supposed to be
> > quick and painless ]
> >
> > After working all day on various troubleshooting techniques found here
> and
> > there, we have this situation on our monitor nodes (debug 20):
> >
> >
> > node-10: dead. ceph-mon will not start
> >
> > node-14: Seemed to rebuild its monmap. The log has stopped reporting with
> > this final tail -100: http://pastebin.com/tLiq2ewV
> >
> > node-16: Same as 14, similar outcome in the
> > log: http://pastebin.com/W87eT7Mw
> >
> > node-15: ceph-mon starts but even at debug 20, it will only ouput this
> line,
> > over and over again:
> >
> >2015-03-18 14:54:35.859511 7f8c82ad3700 -1 asok(0x2e560e0)
> > AdminSocket: request 'mon_status' not defined
> >
> > node-02: I added this guy to replace node-10. I updated ceph.conf and
> pushed
> > it to all the monitor nodes (the osd nodes without monitors did not get
> the
> > config push). Since he's a new guy the log out is obviously different,
> but
> > again, here are the last 50 lines: http://pastebin.com/pfixdD3d
> >
> >
> > I run my ceph client from my OpenStack controller. All ceph -s shows me
> is
> > faults, albeit only to node-15
> >
> > 2015-03-18 16:47:27.145194 7ff762cff700  0 -- 192.168.241.100:0/15112 >>
> > 192.168.241.115:6789/0 pipe(0x7ff75000cf00 sd=3 :0 s=1 pgs=0 cs=0
> l=1).fault
> >
> >
> > Finally, here is our ceph.conf: http://pastebin.com/Gmiq2V8S
> >
> > So that's where we stand. Did we kill our Ceph Cluster (and thus our
> > OpenStack Cloud)?
>
> Unlikely!  You have 5 copies, and I doubt all of them are unrecoverable.
>
> > Or is there hope? Any suggestions would be greatly
> > appreciated.
>
> Stop all mons.
>
> Make a backup copy of each mon data dir.
>
> Copy the node-14 data dir over the node-15 and/or node-10 and/or
> node-02.
>
> Start all mons, see if they form a quorum.
>
> Once things are working again, at the *very* least upgrade to dumpling,
> and preferably then upgrade to firefly!!  Cuttlefish was EOL more than a
> year ago, and dumpling is EOL in a couple months.
>
> sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Monitor failure after series of traumatic network failures

2015-03-18 Thread Greg Chavez
We have a cuttlefish (0.61.9) 192-OSD cluster that has lost network
availability several times since this past Thursday and whose nodes were
all rebooted twice (hastily and inadvisably each time). The final reboot,
which was supposed to be "the last thing" before recovery according to our
data center team, resulted in a failure of the cluster's 4 monitors. This
happened yesterday afternoon.

[ By the way, we use Ceph to back Cinder and Glance in our OpenStack Cloud,
block storage only; also this network problems were the result of our data
center team executing maintenance on our switches that was supposed to be
quick and painless ]

After working all day on various troubleshooting techniques found here and
there, we have this situation on our monitor nodes (debug 20):


node-10: dead. ceph-mon will not start

node-14: Seemed to rebuild its monmap. The log has stopped reporting with
this final tail -100: http://pastebin.com/tLiq2ewV

node-16: Same as 14, similar outcome in the log:
http://pastebin.com/W87eT7Mw

node-15: ceph-mon starts but even at debug 20, it will only ouput this
line, over and over again:

   2015-03-18 14:54:35.859511 7f8c82ad3700 -1 asok(0x2e560e0)
AdminSocket: request 'mon_status' not defined

node-02: I added this guy to replace node-10. I updated ceph.conf and
pushed it to all the monitor nodes (the osd nodes without monitors did not
get the config push). Since he's a new guy the log out is obviously
different, but again, here are the last 50 lines:
http://pastebin.com/pfixdD3d


I run my ceph client from my OpenStack controller. All ceph -s shows me is
faults, albeit only to node-15

2015-03-18 16:47:27.145194 7ff762cff700  0 -- 192.168.241.100:0/15112 >>
192.168.241.115:6789/0 pipe(0x7ff75000cf00 sd=3 :0 s=1 pgs=0 cs=0 l=1).fault


Finally, here is our ceph.conf: http://pastebin.com/Gmiq2V8S

So that's where we stand. Did we kill our Ceph Cluster (and thus our
OpenStack Cloud)? Or is there hope? Any suggestions would be greatly
appreciated.


-- 
\*..+.-
--Greg Chavez
+//..;};
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD turned itself off

2015-02-16 Thread Greg Farnum
4 7f6d5c155700  0 -- 10.168.7.23:6819/10217 
> submit_message osd_op_reply(949861 rbd_data.1c56a792eb141f2.6200 
> [stat,write 2228224~12288] ondisk = 0) v4 remote, 10.168.7.54:0/1025735, 
> failed lossy con, dropping message 0x1bc00400
>  -976> 2015-01-05 07:10:01.763055 7f6d5c155700  0 -- 10.168.7.23:6819/10217 
> submit_message osd_op_reply(11034565 
> rbd_data.1cc69562eb141f2.03ce [stat,write 1925120~4096] ondisk = 
> 0) v4 remote, 10.168.7.54:0/2007323, failed lossy con, dropping message 
> 0x12989400
>  -855> 2015-01-10 22:01:36.589036 7f6d5b954700  0 -- 10.168.7.23:6819/10217 
> submit_message osd_op_reply(727627 rbd_data.1cc69413d1b58ba.0055 
> [stat,write 2289664~4096] ondisk = 0) v4 remote, 10.168.7.54:0/1007323, 
> failed lossy con, dropping message 0x24f68800
>  -819> 2015-01-12 05:25:06.229753 7f6d3646c700  0 -- 10.168.7.23:6819/10217 
> >> 10.168.7.53:0/2019809 pipe(0x1f0e9680 sd=460 :6819 s=0 pgs=0 cs=0 l=1 
> c=0x13090420).accept replacing existing (lossy) channel (new one lossy=1)
>  -818> 2015-01-12 05:25:06.581703 7f6d37534700  0 -- 10.168.7.23:6819/10217 
> >> 10.168.7.53:0/1025252 pipe(0x1b67a780 sd=71 :6819 s=0 pgs=0 cs=0 l=1 
> c=0x16311e40).accept replacing existing (lossy) channel (new one lossy=1)
>  -817> 2015-01-12 05:25:21.342998 7f6d41167700  0 -- 10.168.7.23:6819/10217 
> >> 10.168.7.53:0/1025579 pipe(0x114e8000 sd=502 :6819 s=0 pgs=0 cs=0 l=1 
> c=0x16310160).accept replacing existing (lossy) channel (new one lossy=1)
>  -808> 2015-01-12 16:01:35.783534 7f6d5b954700  0 -- 10.168.7.23:6819/10217 
> submit_message osd_op_reply(752034 rbd_data.1cc69413d1b58ba.0055 
> [stat,write 2387968~8192] ondisk = 0) v4 remote, 10.168.7.54:0/1007323, 
> failed lossy con, dropping message 0x1fde9a00
>  -515> 2015-01-25 18:44:23.303855 7f6d5b954700  0 -- 10.168.7.23:6819/10217 
> submit_message osd_op_reply(46402240 rbd_data.4b8e9b3d1b58ba.0471 
> [read 1310720~4096] ondisk = 0) v4 remote, 10.168.7.51:0/1017204, failed 
> lossy con, dropping message 0x250bce00
>  -303> 2015-02-02 22:30:03.140599 7f6d5c155700  0 -- 10.168.7.23:6819/10217 
> submit_message osd_op_reply(17710313 
> rbd_data.1cc69562eb141f2.03ce [stat,write 4145152~4096] ondisk = 
> 0) v4 remote, 10.168.7.54:0/2007323, failed lossy con, dropping message 
> 0x1c5d4200
>  -236> 2015-02-05 15:29:04.945660 7f6d3d357700  0 -- 10.168.7.23:6819/10217 
> >> 10.168.7.51:0/1026961 pipe(0x1c63e780 sd=203 :6819 s=0 pgs=0 cs=0 l=1 
> c=0x11dc8dc0).accept replacing existing (lossy) channel (new one lossy=1)
>   -66> 2015-02-10 20:20:36.673969 7f6d5b954700  0 -- 10.168.7.23:6819/10217 
> submit_message osd_op_reply(11088 rbd_data.10b8c82eb141f2.4459 
> [stat,write 749568~8192] ondisk = 0) v4 remote, 10.168.7.55:0/1005630, failed 
> lossy con, dropping message 0x138db200
> 
> Could this have lead to the data being erroneous, or is the -5 return code 
> just a sign of a broken hard drive?
> 

These are the OSDs creating new connections to each other because the previous 
ones failed. That's not necessarily a problem (although here it's probably a 
symptom of some kind of issue, given the frequency) and cannot introduce data 
corruption of any kind.
I’m not seeing any -5 return codes as part of that messenger debug output, so 
unless you were referring to your EIO from last June I’m not sure what that’s 
about? (If you do mean EIOs, yes, they’re still a sign of a broken hard drive 
or local FS.)

> Cheers,
> Josef
> 
>> On 14 Jun 2014, at 02:38, Josef Johansson  wrote:
>> 
>> Thanks for the quick response.
>> 
>> Cheers,
>> Josef
>> 
>> Gregory Farnum skrev 2014-06-14 02:36:
>>> On Fri, Jun 13, 2014 at 5:25 PM, Josef Johansson  wrote:
>>>> Hi Greg,
>>>> 
>>>> Thanks for the clarification. I believe the OSD was in the middle of a deep
>>>> scrub (sorry for not mentioning this straight away), so then it could've
>>>> been a silent error that got wind during scrub?
>>> Yeah.
>>> 
>>>> What's best practice when the store is corrupted like this?
>>> Remove the OSD from the cluster, and either reformat the disk or
>>> replace as you judge appropriate.
>>> -Greg
>>> 
>>>> Cheers,
>>>> Josef
>>>> 
>>>> Gregory Farnum skrev 2014-06-14 02:21:
>>>> 
>>>>> The OSD did a read off of the local filesystem and it got back the EIO
>>>>> error code. That means the store got corrupted or something, so it
>>>>> killed itself to avoid spreading bad data to the rest of the cluster.
>>>

Re: [ceph-users] Poor performance on all SSD cluster

2014-06-23 Thread Greg Poirier
On Sun, Jun 22, 2014 at 6:44 AM, Mark Nelson 
wrote:

> RBD Cache is definitely going to help in this use case.  This test is
> basically just sequentially writing a single 16k chunk of data out, one at
> a time.  IE, entirely latency bound.  At least on OSDs backed by XFS, you
> have to wait for that data to hit the journals of every OSD associated with
> the object before the acknowledgement gets sent back to the client.
>

Again, I can reproduce this with replication disabled.


>  If you are using the default 4MB block size, you'll hit the same OSDs
> over and over again and your other OSDs will sit there twiddling their
> thumbs waiting for IO until you hit the next block, but then it will just
> be a different set OSDs getting hit.  You should be able to verify this by
> using iostat or collectl or something to look at the behaviour of the SSDs
> during the test.  Since this is all sequential though, switching to
>  buffered IO (ie coalesce IOs at the buffercache layer) or using RBD cache
> for direct IO (coalesce IOs below the block device) will dramatically
> improve things.
>

This makes sense.

Given the following scenario:

- No replication
- osd_op time average is .015 seconds (stddev ~.003 seconds)
- Network latency is approximately .000237 seconds on avg

I should be getting 60 IOPS from the OSD reporting this time, right?

So 60 * 16kB = 960kB.  That's slightly slower than we're getting because
I'm only able to sample the slowest ops. We're getting closer to 100 IOPS.
But that does make sense, I suppose.

So the only way to improve performance would be to not use O_DIRECT (as
this should bypass rbd cache as well, right?).


> Ceph is pretty good at small random IO with lots of parallelism on
> spinning disk backed OSDs (So long as you use SSD journals or controllers
> with WB cache).  It's much harder to get native-level IOPS rates with SSD
> backed OSDs though.  The latency involved in distributing and processing
> all of that data becomes a much bigger deal.  Having said that, we are
> actively working on improving latency as much as we can. :)


And this is true because flushing from the journal to spinning disks is
going to coalesce the writes into the appropriate blocks in a meaningful
way, right? Or I guess... Why is this?

Why doesn't that happen with SSD journals and SSD OSDs?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Poor performance on all SSD cluster

2014-06-23 Thread Greg Poirier
10 OSDs per node
12 physical cores hyperthreaded (24 logical cores exposed to OS)
64GB RAM

Negligible load

iostat shows the disks are largely idle except for bursty writes
occasionally.

Results of fio from one of the SSDs in the cluster:

fiojob: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio,
iodepth=128
fio-2.1.3
Starting 1 process
fiojob: Laying out IO file(s) (1 file(s) / 400MB)
Jobs: 1 (f=1): [w] [-.-% done] [0KB/155.5MB/0KB /s] [0/39.8K/0 iops] [eta
00m:00s]
fiojob: (groupid=0, jobs=1): err= 0: pid=21845: Mon Jun 23 13:23:47 2014
  write: io=409600KB, bw=157599KB/s, iops=39399, runt=  2599msec
slat (usec): min=6, max=2149, avg=22.13, stdev=23.08
clat (usec): min=70, max=10700, avg=3220.76, stdev=521.44
 lat (usec): min=90, max=10722, avg=3243.13, stdev=523.70
clat percentiles (usec):
 |  1.00th=[ 2736],  5.00th=[ 2864], 10.00th=[ 2896], 20.00th=[ 2928],
 | 30.00th=[ 2960], 40.00th=[ 3024], 50.00th=[ 3056], 60.00th=[ 3184],
 | 70.00th=[ 3344], 80.00th=[ 3440], 90.00th=[ 3504], 95.00th=[ 3632],
 | 99.00th=[ 5856], 99.50th=[ 6240], 99.90th=[ 7136], 99.95th=[ 7584],
 | 99.99th=[ 8160]
bw (KB  /s): min=139480, max=173320, per=99.99%, avg=157577.60,
stdev=16122.77
lat (usec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
lat (msec) : 2=0.08%, 4=95.89%, 10=3.98%, 20=0.01%
  cpu  : usr=14.05%, sys=46.73%, ctx=72243, majf=0, minf=186
  IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>=64=99.9%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.1%
 issued: total=r=0/w=102400/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
  WRITE: io=409600KB, aggrb=157599KB/s, minb=157599KB/s, maxb=157599KB/s,
mint=2599msec, maxt=2599msec

Disk stats (read/write):
  sda: ios=0/95026, merge=0/0, ticks=0/3016, in_queue=2972, util=82.27%

All of the disks are identical.

The same fio from the host with the RBD volume mounted:

fiojob: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio,
iodepth=128
fio-2.1.3
Starting 1 process
fiojob: Laying out IO file(s) (1 file(s) / 400MB)
Jobs: 1 (f=1): [w] [100.0% done] [0KB/5384KB/0KB /s] [0/1346/0 iops] [eta
00m:00s]
fiojob: (groupid=0, jobs=1): err= 0: pid=30070: Mon Jun 23 13:25:50 2014
  write: io=409600KB, bw=9264.3KB/s, iops=2316, runt= 44213msec
slat (usec): min=17, max=154210, avg=84.83, stdev=535.40
clat (msec): min=10, max=1294, avg=55.17, stdev=103.43
 lat (msec): min=10, max=1295, avg=55.25, stdev=103.43
clat percentiles (msec):
 |  1.00th=[   17],  5.00th=[   21], 10.00th=[   24], 20.00th=[   28],
 | 30.00th=[   31], 40.00th=[   34], 50.00th=[   37], 60.00th=[   40],
 | 70.00th=[   44], 80.00th=[   50], 90.00th=[   63], 95.00th=[  103],
 | 99.00th=[  725], 99.50th=[  906], 99.90th=[ 1106], 99.95th=[ 1172],
 | 99.99th=[ 1237]
bw (KB  /s): min= 3857, max=12416, per=100.00%, avg=9280.09,
stdev=1233.63
lat (msec) : 20=3.76%, 50=76.60%, 100=14.45%, 250=2.98%, 500=0.72%
lat (msec) : 750=0.56%, 1000=0.66%, 2000=0.27%
  cpu  : usr=3.50%, sys=19.31%, ctx=131358, majf=0, minf=986
  IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>=64=99.9%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.1%
 issued: total=r=0/w=102400/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
  WRITE: io=409600KB, aggrb=9264KB/s, minb=9264KB/s, maxb=9264KB/s,
mint=44213msec, maxt=44213msec

Disk stats (read/write):
  rbd2: ios=0/102499, merge=0/1818, ticks=0/5593828, in_queue=5599520,
util=99.85%


On Sun, Jun 22, 2014 at 6:42 PM, Christian Balzer  wrote:

> On Sun, 22 Jun 2014 12:14:38 -0700 Greg Poirier wrote:
>
> > We actually do have a use pattern of large batch sequential writes, and
> > this dd is pretty similar to that use case.
> >
> > A round-trip write with replication takes approximately 10-15ms to
> > complete. I've been looking at dump_historic_ops on a number of OSDs and
> > getting mean, min, and max for sub_op and ops. If these were on the order
> > of 1-2 seconds, I could understand this throughput... But we're talking
> > about fairly fast SSDs and a 20Gbps network with <1ms latency for TCP
> > round-trip between the client machine and all of the OSD hosts.
> >
> > I've gone so far as disabling replication entirely (which had almost no
> > impact) and putting journals on separate SSDs as the data disks (which
> > are ALSO SSDs).
> >
> > This just doesn't make sense to me.
> >
> A lot of this sounds like my "Slow IOPS on RBD compared to journal and
> backing devices" thread a few weeks ago.
> Though those results are even 

Re: [ceph-users] Poor performance on all SSD cluster

2014-06-22 Thread Greg Poirier
How does RBD cache work? I wasn't able to find an adequate explanation in
the docs.

On Sunday, June 22, 2014, Mark Kirkwood 
wrote:

> Good point, I had neglected to do that.
>
> So, amending my conf.conf [1]:
>
> [client]
> rbd cache = true
> rbd cache size = 2147483648
> rbd cache max dirty = 1073741824
> rbd cache max dirty age = 100
>
> and also the VM's xml def to include cache to writeback:
>
> 
>   
>   
> 
>   
>   
> 
>   
>   
>function='0x0'/>
> 
>
> Retesting from inside the VM:
>
> $ dd if=/dev/zero of=/mnt/vol1/scratch/file bs=16k count=65535 oflag=direct
> 65535+0 records in
> 65535+0 records out
> 1073725440 bytes (1.1 GB) copied, 8.1686 s, 131 MB/s
>
> Which is much better, so certainly for the librbd case enabling the rbd
> cache seems to nail this particular issue.
>
> Regards
>
> Mark
>
> [1] possibly somewhat agressively set, but at least a noticeable
> difference :-)
>
> On 22/06/14 19:02, Haomai Wang wrote:
>
>> Hi Mark,
>>
>> Do you enable rbdcache? I test on my ssd cluster(only one ssd), it seemed
>> ok.
>>
>>  dd if=/dev/zero of=test bs=16k count=65536 oflag=direct
>>>
>>
>> 82.3MB/s
>>
>>
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Poor performance on all SSD cluster

2014-06-22 Thread Greg Poirier
We actually do have a use pattern of large batch sequential writes, and
this dd is pretty similar to that use case.

A round-trip write with replication takes approximately 10-15ms to
complete. I've been looking at dump_historic_ops on a number of OSDs and
getting mean, min, and max for sub_op and ops. If these were on the order
of 1-2 seconds, I could understand this throughput... But we're talking
about fairly fast SSDs and a 20Gbps network with <1ms latency for TCP
round-trip between the client machine and all of the OSD hosts.

I've gone so far as disabling replication entirely (which had almost no
impact) and putting journals on separate SSDs as the data disks (which are
ALSO SSDs).

This just doesn't make sense to me.


On Sun, Jun 22, 2014 at 6:44 AM, Mark Nelson 
wrote:

> On 06/22/2014 02:02 AM, Haomai Wang wrote:
>
>> Hi Mark,
>>
>> Do you enable rbdcache? I test on my ssd cluster(only one ssd), it seemed
>> ok.
>>
>>  dd if=/dev/zero of=test bs=16k count=65536 oflag=direct
>>>
>>
>> 82.3MB/s
>>
>
> RBD Cache is definitely going to help in this use case.  This test is
> basically just sequentially writing a single 16k chunk of data out, one at
> a time.  IE, entirely latency bound.  At least on OSDs backed by XFS, you
> have to wait for that data to hit the journals of every OSD associated with
> the object before the acknowledgement gets sent back to the client.  If you
> are using the default 4MB block size, you'll hit the same OSDs over and
> over again and your other OSDs will sit there twiddling their thumbs
> waiting for IO until you hit the next block, but then it will just be a
> different set OSDs getting hit.  You should be able to verify this by using
> iostat or collectl or something to look at the behaviour of the SSDs during
> the test.  Since this is all sequential though, switching to  buffered IO
> (ie coalesce IOs at the buffercache layer) or using RBD cache for direct IO
> (coalesce IOs below the block device) will dramatically improve things.
>
> The real question here though, is whether or not a synchronous stream of
> sequential 16k writes is even remotely close to the IO patterns that would
> be seen in actual use for MySQL.  Most likely in actual use you'll be
> seeing a big mix of random reads and writes, and hopefully at least some
> parallelism (though this depends on the number of databases, number of
> users, and the workload!).
>
> Ceph is pretty good at small random IO with lots of parallelism on
> spinning disk backed OSDs (So long as you use SSD journals or controllers
> with WB cache).  It's much harder to get native-level IOPS rates with SSD
> backed OSDs though.  The latency involved in distributing and processing
> all of that data becomes a much bigger deal.  Having said that, we are
> actively working on improving latency as much as we can. :)
>
> Mark
>
>
>
>>
>> On Sun, Jun 22, 2014 at 11:50 AM, Mark Kirkwood
>>  wrote:
>>
>>> On 22/06/14 14:09, Mark Kirkwood wrote:
>>>
>>> Upgrading the VM to 14.04 and restesting the case *without* direct I get:
>>>
>>> - 164 MB/s (librbd)
>>> - 115 MB/s (kernel 3.13)
>>>
>>> So managing to almost get native performance out of the librbd case. I
>>> tweaked both filestore max and min sync intervals (100 and 10 resp) just
>>> to
>>> see if I could actually avoid writing to the spinners while the test was
>>> in
>>> progress (still seeing some, but clearly fewer).
>>>
>>> However no improvement at all *with* direct enabled. The output of
>>> iostat on
>>> the host while the direct test is in progress is interesting:
>>>
>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>11.730.005.040.760.00   82.47
>>>
>>> Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s
>>> avgrq-sz
>>> avgqu-sz   await r_await w_await  svctm  %util
>>> sda   0.00 0.000.00   11.00 0.00 4.02 749.09
>>> 0.14   12.360.00   12.36   6.55   7.20
>>> sdb   0.00 0.000.00   11.00 0.00 4.02 749.09
>>> 0.14   12.360.00   12.36   5.82   6.40
>>> sdc   0.00 0.000.00  435.00 0.00 4.29 20.21
>>> 0.531.210.001.21   1.21  52.80
>>> sdd   0.00 0.000.00  435.00 0.00 4.29 20.21
>>> 0.521.200.001.20   1.20  52.40
>>>
>>> (sda,b are the spinners sdc,d the ssds). Something is making the journal
>>> work very hard for its 4.29 MB/s!
>>>
>>> regards
>>>
>>> Mark
>>>
>>>
>>>  Leaving
 off direct I'm seeing about 140 MB/s (librbd) and 90 MB/s (kernel 3.11
 [2]). The ssd's can do writes at about 180 MB/s each... which is
 something to look at another day[1].

>>>
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>>
>>
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.

Re: [ceph-users] Poor performance on all SSD cluster

2014-06-22 Thread Greg Poirier
I'm using Crucial M500s.


On Sat, Jun 21, 2014 at 7:09 PM, Mark Kirkwood <
mark.kirkw...@catalyst.net.nz> wrote:

> I can reproduce this in:
>
> ceph version 0.81-423-g1fb4574
>
> on Ubuntu 14.04. I have a two osd cluster with data on two sata spinners
> (WD blacks) and journals on two ssd (Crucual m4's). I getting about 3.5
> MB/s (kernel and librbd) using your dd command with direct on. Leaving off
> direct I'm seeing about 140 MB/s (librbd) and 90 MB/s (kernel 3.11 [2]).
> The ssd's can do writes at about 180 MB/s each... which is something to
> look at another day[1].
>
> It would be interesting to know what version of Ceph Tyer is using, as his
> setup seems not nearly impacted by adding direct. Also it might be useful
> to know what make and model of ssd you both are using (some of 'em do not
> like a series of essentially sync writes). Having said that testing my
> Crucial m4's shows they can do the dd command (with direct *on*) at about
> 180 MB/s...hmmm...so it *is* the Ceph layer it seems.
>
> Regards
>
> Mark
>
> [1] I set filestore_max_sync_interval = 100 (30G journal...ssd able to do
> 180 MB/s etc), however I am still seeing writes to the spinners during the
> 8s or so that the above dd tests take).
> [2] Ubuntu 13.10 VM - I'll upgrade it to 14.04 and see if that helps at
> all.
>
>
> On 21/06/14 09:17, Greg Poirier wrote:
>
>> Thanks Tyler. So, I'm not totally crazy. There is something weird going
>> on.
>>
>> I've looked into things about as much as I can:
>>
>> - We have tested with collocated journals and dedicated journal disks.
>> - We have bonded 10Gb nics and have verified network configuration and
>> connectivity is sound
>> - We have run dd independently on the SSDs in the cluster and they are
>> performing fine
>> - We have tested both in a VM and with the RBD kernel module and get
>> identical performance
>> - We have pool size = 3, pool min size = 2 and have tested with min size
>> of 2 and 3 -- the performance impact is not bad
>> - osd_op times are approximately 6-12ms
>> - osd_sub_op times are 6-12 ms
>> - iostat reports service time of 6-12ms
>> - Latency between the storage and rbd client is approximately .1-.2ms
>> - Disabling replication entirely did not help significantly
>>
>>
>>
>>
>> On Fri, Jun 20, 2014 at 2:13 PM, Tyler Wilson > <mailto:k...@linuxdigital.net>> wrote:
>>
>> Greg,
>>
>> Not a real fix for you but I too run a full-ssd cluster and am able
>> to get 112MB/s with your command;
>>
>> [root@plesk-test ~]# dd if=/dev/zero of=testfilasde bs=16k
>> count=65535 oflag=direct
>> 65535+0 records in
>> 65535+0 records out
>> 1073725440 bytes (1.1 GB) copied, 9.59092 s, 112 MB/s
>>
>> This of course is in a VM, here is my ceph config
>>
>> [global]
>> fsid = 
>> mon_initial_members = node-1 node-2 node-3
>> mon_host = 192.168.0.3 192.168.0.4 192.168.0.5
>>     auth_supported = cephx
>> osd_journal_size = 2048
>> filestore_xattr_use_omap = true
>> osd_pool_default_size = 2
>> osd_pool_default_min_size = 1
>> osd_pool_default_pg_num = 1024
>> public_network = 192.168.0.0/24 <http://192.168.0.0/24>
>> osd_mkfs_type = xfs
>> cluster_network = 192.168.1.0/24 <http://192.168.1.0/24>
>>
>>
>>
>>
>> On Fri, Jun 20, 2014 at 11:08 AM, Greg Poirier
>> mailto:greg.poir...@opower.com>> wrote:
>>
>> I recently created a 9-node Firefly cluster backed by all SSDs.
>> We have had some pretty severe performance degradation when
>> using O_DIRECT in our tests (as this is how MySQL will be
>> interacting with RBD volumes, this makes the most sense for a
>> preliminary test). Running the following test:
>>
>> dd if=/dev/zero of=testfilasde bs=16k count=65535 oflag=direct
>>
>> 779829248 bytes (780 MB) copied, 604.333 s, 1.3 MB/s
>>
>> Shows us only about 1.5 MB/s throughput and 100 IOPS from the
>> single dd thread. Running a second dd process does show
>> increased throughput which is encouraging, but I am still
>> concerned by the low throughput of a single thread w/ O_DIRECT.
>>
>> Two threads:
>> 779829248 bytes (780 MB) copied, 604.333 s, 1.3 MB/s
>> 126271488 bytes (126 MB) copied, 99.2069 s, 1.3 MB/s
>>
>> I am testing with an RBD volume m

Re: [ceph-users] Poor performance on all SSD cluster

2014-06-20 Thread Greg Poirier
Thanks Tyler. So, I'm not totally crazy. There is something weird going on.

I've looked into things about as much as I can:

- We have tested with collocated journals and dedicated journal disks.
- We have bonded 10Gb nics and have verified network configuration and
connectivity is sound
- We have run dd independently on the SSDs in the cluster and they are
performing fine
- We have tested both in a VM and with the RBD kernel module and get
identical performance
- We have pool size = 3, pool min size = 2 and have tested with min size of
2 and 3 -- the performance impact is not bad
- osd_op times are approximately 6-12ms
- osd_sub_op times are 6-12 ms
- iostat reports service time of 6-12ms
- Latency between the storage and rbd client is approximately .1-.2ms
- Disabling replication entirely did not help significantly




On Fri, Jun 20, 2014 at 2:13 PM, Tyler Wilson  wrote:

> Greg,
>
> Not a real fix for you but I too run a full-ssd cluster and am able to get
> 112MB/s with your command;
>
> [root@plesk-test ~]# dd if=/dev/zero of=testfilasde bs=16k count=65535
> oflag=direct
> 65535+0 records in
> 65535+0 records out
> 1073725440 bytes (1.1 GB) copied, 9.59092 s, 112 MB/s
>
> This of course is in a VM, here is my ceph config
>
> [global]
> fsid = 
> mon_initial_members = node-1 node-2 node-3
> mon_host = 192.168.0.3 192.168.0.4 192.168.0.5
> auth_supported = cephx
> osd_journal_size = 2048
> filestore_xattr_use_omap = true
> osd_pool_default_size = 2
> osd_pool_default_min_size = 1
> osd_pool_default_pg_num = 1024
> public_network = 192.168.0.0/24
> osd_mkfs_type = xfs
> cluster_network = 192.168.1.0/24
>
>
>
> On Fri, Jun 20, 2014 at 11:08 AM, Greg Poirier 
> wrote:
>
>> I recently created a 9-node Firefly cluster backed by all SSDs. We have
>> had some pretty severe performance degradation when using O_DIRECT in our
>> tests (as this is how MySQL will be interacting with RBD volumes, this
>> makes the most sense for a preliminary test). Running the following test:
>>
>> dd if=/dev/zero of=testfilasde bs=16k count=65535 oflag=direct
>>
>> 779829248 bytes (780 MB) copied, 604.333 s, 1.3 MB/s
>>
>> Shows us only about 1.5 MB/s throughput and 100 IOPS from the single dd
>> thread. Running a second dd process does show increased throughput which is
>> encouraging, but I am still concerned by the low throughput of a single
>> thread w/ O_DIRECT.
>>
>> Two threads:
>> 779829248 bytes (780 MB) copied, 604.333 s, 1.3 MB/s
>> 126271488 bytes (126 MB) copied, 99.2069 s, 1.3 MB/s
>>
>> I am testing with an RBD volume mounted with the kernel module (I have
>> also tested from within KVM, similar performance).
>>
>> If allow caching, we start to see reasonable numbers from a single dd
>> process:
>>
>> dd if=/dev/zero of=testfilasde bs=16k count=65535
>> 65535+0 records in
>> 65535+0 records out
>> 1073725440 bytes (1.1 GB) copied, 2.05356 s, 523 MB/s
>>
>> I can get >1GB/s from a single host with three threads.
>>
>> Rados bench produces similar results.
>>
>> Is there something I can do to increase the performance of O_DIRECT? I
>> expect performance degradation, but so much?
>>
>> If I increase the blocksize to 4M, I'm able to get significantly higher
>> throughput:
>>
>> 3833593856 bytes (3.8 GB) copied, 44.2964 s, 86.5 MB/s
>>
>> This still seems very low.
>>
>> I'm using the deadline scheduler in all places. With noop scheduler, I do
>> not see a performance improvement.
>>
>> Suggestions?
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Poor performance on all SSD cluster

2014-06-20 Thread Greg Poirier
I recently created a 9-node Firefly cluster backed by all SSDs. We have had
some pretty severe performance degradation when using O_DIRECT in our tests
(as this is how MySQL will be interacting with RBD volumes, this makes the
most sense for a preliminary test). Running the following test:

dd if=/dev/zero of=testfilasde bs=16k count=65535 oflag=direct

779829248 bytes (780 MB) copied, 604.333 s, 1.3 MB/s

Shows us only about 1.5 MB/s throughput and 100 IOPS from the single dd
thread. Running a second dd process does show increased throughput which is
encouraging, but I am still concerned by the low throughput of a single
thread w/ O_DIRECT.

Two threads:
779829248 bytes (780 MB) copied, 604.333 s, 1.3 MB/s
126271488 bytes (126 MB) copied, 99.2069 s, 1.3 MB/s

I am testing with an RBD volume mounted with the kernel module (I have also
tested from within KVM, similar performance).

If allow caching, we start to see reasonable numbers from a single dd
process:

dd if=/dev/zero of=testfilasde bs=16k count=65535
65535+0 records in
65535+0 records out
1073725440 bytes (1.1 GB) copied, 2.05356 s, 523 MB/s

I can get >1GB/s from a single host with three threads.

Rados bench produces similar results.

Is there something I can do to increase the performance of O_DIRECT? I
expect performance degradation, but so much?

If I increase the blocksize to 4M, I'm able to get significantly higher
throughput:

3833593856 bytes (3.8 GB) copied, 44.2964 s, 86.5 MB/s

This still seems very low.

I'm using the deadline scheduler in all places. With noop scheduler, I do
not see a performance improvement.

Suggestions?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Backfill and Recovery traffic shaping

2014-04-19 Thread Greg Poirier
On Saturday, April 19, 2014, Mike Dawson  wrote:
>
>
> With a workload consisting of lots of small writes, I've seen client IO
> starved with as little as 5Mbps of traffic per host due to spindle
> contention once deep-scrub and/or recovery/backfill start. Co-locating OSD
> Journals on the same spinners as you have will double that likelihood.
>

Yeah. We're working on addressing the collocation issues.


> Possible solutions include moving OSD Journals to SSD (with a reasonable
> ratio), expanding the cluster, or increasing the performance of underlying
> storage.
>
>
We are considering an all SSD cluster. If I'm not mistaken, at that point
journal collocation isn't as much of an issue since iops/seek time stop
being an issue.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Backfill and Recovery traffic shaping

2014-04-19 Thread Greg Poirier
We have a cluster in a sub-optimal configuration with data and journal
colocated on OSDs (that coincidentally are spinning disks).

During recovery/backfill, the entire cluster suffers degraded performance
because of the IO storm that backfills cause. Client IO becomes extremely
latent. I've tried to decrease the impact that recovery/backfill has with
the following:

ceph tell osd.* injectargs '--osd-max-backfills 1'
ceph tell osd.* injectargs '--osd-max-recovery-threads 1'
ceph tell osd.* injectargs '--osd-recovery-op-priority 1'
ceph tell osd.* injectargs '--osd-client-op-priority 63'
ceph tell osd.* injectargs '--osd-recovery-max-active 1'

The only other option I have left would be to use linux traffic shaping to
artificially reduce the bandwidth available to the interfaced tagged for
cluster traffic (instead of separate physical networks, we use VLAN
tagging). We are nowhere _near_ the point where network saturation would
cause the latency we're seeing, so I am left to believe that it is simply
disk IO saturation.

I could be wrong about this assumption, though, as iostat doesn't terrify
me. This could be suboptimal network configuration on the cluster as well.
I'm still looking into that possibility, but I wanted to get feedback on
what I'd done already first--as well as the proposed traffic shaping idea.

Thoughts?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Useful visualizations / metrics

2014-04-13 Thread Greg Poirier
This is fantastic stuff. Thanks!


On Sun, Apr 13, 2014 at 10:17 AM, Dan Van Der Ster <
daniel.vanders...@cern.ch> wrote:

>  For our cluster we monitor write latency by running a short (10s) rados
> bench with one thread writing 64kB objects, every 5 minutes or so. rados
> bench tells you the min, max, and average of those writes -- we plot them
> all. An example is attached.
>
>  The latency and other metrics that we plot (including iops) are here in
> this sensor:
> https://github.com/cernceph/ceph-scripts/blob/master/cern-sls/ceph-sls.py
>   Unfortunately it is not directly usable by others since it has been
> written for our local monitoring system.
>
>  Cheers, Dan
>
>
>
>
>
>
>  --
> *From:* ceph-users-boun...@lists.ceph.com [
> ceph-users-boun...@lists.ceph.com] on behalf of Jason Villalta [
> ja...@rubixnet.com]
> *Sent:* 12 April 2014 16:41
> *To:* Greg Poirier
> *Cc:* ceph-users@lists.ceph.com
> *Subject:* Re: [ceph-users] Useful visualizations / metrics
>
>   I know ceph throws some warnings if there is high write latency.  But i
> would be most intrested in the delay for io requests, linking directly to
> iops.  If iops start to drop because the disk are overwhelmed then latency
> for requests would be increasing.  This would tell me that I need to add
> more OSDs/Nodes.  I am not sure there is a specific metric in ceph for this
> but it would be awesome if there was.
>
>
> On Sat, Apr 12, 2014 at 10:37 AM, Greg Poirier wrote:
>
>> Curious as to how you define cluster latency.
>>
>>
>> On Sat, Apr 12, 2014 at 7:21 AM, Jason Villalta wrote:
>>
>>> Hi, i have not don't anything with metrics yet but the only ones I
>>> personally would be interested in is total capacity utilization and cluster
>>> latency.
>>>
>>>  Just my 2 cents.
>>>
>>>
>>>  On Sat, Apr 12, 2014 at 10:02 AM, Greg Poirier >> > wrote:
>>>
>>>>  I'm in the process of building a dashboard for our Ceph nodes. I was
>>>> wondering if anyone out there had instrumented their OSD / MON clusters and
>>>> found particularly useful visualizations.
>>>>
>>>>  At first, I was trying to do ridiculous things (like graphing % used
>>>> for every disk in every OSD host), but I realized quickly that that is
>>>> simply too many metrics and far too visually dense to be useful. I am
>>>> attempting to put together a few simpler, more dense visualizations like...
>>>> overcall cluster utilization, aggregate cpu and memory utilization per osd
>>>> host, etc.
>>>>
>>>>  Just looking for some suggestions.  Thanks!
>>>>
>>>>  ___
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>>
>>>
>>>
>>>  --
>>> --
>>> *Jason Villalta*
>>> Co-founder
>>> [image: Inline image 1]
>>> 800.799.4407x1230 | www.RubixTechnology.com<http://www.rubixtechnology.com/>
>>>
>>
>>
>
>
>  --
> --
> *Jason Villalta*
> Co-founder
> [image: Inline image 1]
> 800.799.4407x1230 | www.RubixTechnology.com<http://www.rubixtechnology.com/>
>
<>___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Useful visualizations / metrics

2014-04-12 Thread Greg Poirier
We are collecting system metrics through sysstat every minute and getting
those to OpenTSDB via Sensu. We have a plethora of metrics, but I am
finding it difficult to create meaningful visualizations. We have alerting
for things like individual OSDs reaching capacity thresholds, memory spikes
on OSD or MON hosts. I am just trying to come up with some visualizations
that could become solid indicators that something is wrong with the cluster
in general, or with a particular host (besides CPU or memory utilization).

This morning, I have thought of things like:

- Stddev of bytes used on all disks in the cluster and individual OSD hosts
- 1st and 2nd derivative of bytes used on all disks in the cluster and
individual OSD hosts
- bytes used in the entire cluster
- % usage of cluster capacity

Stddev should help us identify hotspots. Velocity and acceleration of bytes
used should help us with capacity planning. Bytes used in general is just a
neat thing to see, but doesn't tell us all that much. % usage of cluster
capacity is another thing that's just kind of neat to see.

What would you suggest looking for in dump_historic_ops? Maybe get regular
metrics on things like total transaction length? The only problem is that
dump_historic_ops may not always contain relevant/recent data. It is not as
easily translated into time series data as some other things.




On Sat, Apr 12, 2014 at 9:23 AM, Mark Nelson wrote:

> One thing I do right now for ceph performance testing is run a copy of
> collectl during every test.  This gives you a TON of information about CPU
> usage, network stats, disk stats, etc.  It's pretty easy to import the
> output data into gnuplot.  Mark Seger (the creator of collectl) also has
> some tools to gather aggregate statistics across multiple nodes. Beyond
> collectl, you can get a ton of useful data out of the ceph admin socket.  I
> especially like dump_historic_ops as it some times is enough to avoid
> having to parse through debug 20 logs.
>
> While the following tools have too much overhead to be really useful for
> general system monitoring, they are really useful for specific performance
> investiations:
>
> 1) perf with the dwarf/unwind support
> 2) blktrace (optionally with seekwatcher)
> 3) valgrind (cachegrind, callgrind, massif)
>
> Beyond that, there are some collectd plugins for Ceph and last time I
> checked DreamHost was using Graphite for a lot of visualizations. There's
> always ganglia too. :)
>
> Mark
>
>
> On 04/12/2014 09:41 AM, Jason Villalta wrote:
>
>> I know ceph throws some warnings if there is high write latency.  But i
>> would be most intrested in the delay for io requests, linking directly
>> to iops.  If iops start to drop because the disk are overwhelmed then
>> latency for requests would be increasing.  This would tell me that I
>> need to add more OSDs/Nodes.  I am not sure there is a specific metric
>> in ceph for this but it would be awesome if there was.
>>
>>
>> On Sat, Apr 12, 2014 at 10:37 AM, Greg Poirier > <mailto:greg.poir...@opower.com>> wrote:
>>
>> Curious as to how you define cluster latency.
>>
>>
>> On Sat, Apr 12, 2014 at 7:21 AM, Jason Villalta > <mailto:ja...@rubixnet.com>> wrote:
>>
>> Hi, i have not don't anything with metrics yet but the only ones
>> I personally would be interested in is total capacity
>> utilization and cluster latency.
>>
>> Just my 2 cents.
>>
>>
>> On Sat, Apr 12, 2014 at 10:02 AM, Greg Poirier
>> mailto:greg.poir...@opower.com>> wrote:
>>
>> I'm in the process of building a dashboard for our Ceph
>> nodes. I was wondering if anyone out there had instrumented
>> their OSD / MON clusters and found particularly useful
>> visualizations.
>>
>> At first, I was trying to do ridiculous things (like
>> graphing % used for every disk in every OSD host), but I
>> realized quickly that that is simply too many metrics and
>> far too visually dense to be useful. I am attempting to put
>> together a few simpler, more dense visualizations like...
>> overcall cluster utilization, aggregate cpu and memory
>> utilization per osd host, etc.
>>
>> Just looking for some suggestions.  Thanks!
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>> http://lists.ceph.com/listinfo.cg

Re: [ceph-users] Useful visualizations / metrics

2014-04-12 Thread Greg Poirier
Curious as to how you define cluster latency.


On Sat, Apr 12, 2014 at 7:21 AM, Jason Villalta  wrote:

> Hi, i have not don't anything with metrics yet but the only ones I
> personally would be interested in is total capacity utilization and cluster
> latency.
>
> Just my 2 cents.
>
>
> On Sat, Apr 12, 2014 at 10:02 AM, Greg Poirier wrote:
>
>> I'm in the process of building a dashboard for our Ceph nodes. I was
>> wondering if anyone out there had instrumented their OSD / MON clusters and
>> found particularly useful visualizations.
>>
>> At first, I was trying to do ridiculous things (like graphing % used for
>> every disk in every OSD host), but I realized quickly that that is simply
>> too many metrics and far too visually dense to be useful. I am attempting
>> to put together a few simpler, more dense visualizations like... overcall
>> cluster utilization, aggregate cpu and memory utilization per osd host, etc.
>>
>> Just looking for some suggestions.  Thanks!
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
>
> --
> --
> *Jason Villalta*
> Co-founder
> [image: Inline image 1]
> 800.799.4407x1230 | www.RubixTechnology.com<http://www.rubixtechnology.com/>
>
<>___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Useful visualizations / metrics

2014-04-12 Thread Greg Poirier
I'm in the process of building a dashboard for our Ceph nodes. I was
wondering if anyone out there had instrumented their OSD / MON clusters and
found particularly useful visualizations.

At first, I was trying to do ridiculous things (like graphing % used for
every disk in every OSD host), but I realized quickly that that is simply
too many metrics and far too visually dense to be useful. I am attempting
to put together a few simpler, more dense visualizations like... overcall
cluster utilization, aggregate cpu and memory utilization per osd host, etc.

Just looking for some suggestions.  Thanks!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD full - All RBD Volumes stopped responding

2014-04-11 Thread Greg Poirier
So, setting pgp_num to 2048 to match pg_num had a more serious impact than
I expected. The cluster is rebalancing quite substantially (8.5% of objects
being rebalanced)... which makes sense... Disk utilization is evening out
fairly well which is encouraging.

We are a little stumped as to why a few OSDs being full would cause the
entire cluster to stop serving IO. Is this a configuration issue that we
have?

We're slowly recovering:

 health HEALTH_WARN 135 pgs backfill; 187 pgs backfill_toofull; 151 pgs
backfilling; 2 pgs degraded; 369 pgs stuck unclean; 29 requests are blocked
> 32 sec; recovery 2563902/52390259 objects degraded (4.894%); 4 near full
osd(s)
  pgmap v8363400: 5120 pgs, 3 pools, 22635 GB data, 23872 kobjects
48889 GB used, 45022 GB / 93911 GB avail
2563902/52390259 objects degraded (4.894%)
4751 active+clean
  31 active+remapped+wait_backfill
   1 active+backfill_toofull
 103 active+remapped+wait_backfill+backfill_toofull
   1 active+degraded+wait_backfill+backfill_toofull
 150 active+remapped+backfilling
  82 active+remapped+backfill_toofull
   1 active+degraded+remapped+backfilling
recovery io 362 MB/s, 365 objects/s
  client io 1643 kB/s rd, 6001 kB/s wr, 911 op/s


On Fri, Apr 11, 2014 at 5:45 AM, Greg Poirier wrote:

> So... our storage problems persisted for about 45 minutes. I gave an
> entire hypervisor worth of VM's time to recover (approx. 30 vms), and none
> of them recovered on their own. In the end, we had to stop and start every
> VM (easily done, it was just alarming). Once rebooted, the VMs of course
> were fine.
>
> I marked the two full OSDs as down and out. I am a little concerned that
> these two are full while the cluster, in general, is only at 50% capacity.
> It appears we may have a hot spot. I'm going to look into that later today.
> Also, I'm not sure how it happened, but pgp_num is lower than pg_num.  I
> had not noticed that until last night. Will address that as well. This
> probably happened when i last resized placement groups or potentially when
> I setup object storage pools.
>
>
>
>
> On Fri, Apr 11, 2014 at 3:49 AM, Wido den Hollander  wrote:
>
>> On 04/11/2014 09:23 AM, Josef Johansson wrote:
>>
>>>
>>> On 11/04/14 09:07, Wido den Hollander wrote:
>>>
>>>>
>>>>  Op 11 april 2014 om 8:50 schreef Josef Johansson :
>>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>> On 11/04/14 07:29, Wido den Hollander wrote:
>>>>>
>>>>>> Op 11 april 2014 om 7:13 schreef Greg Poirier <
>>>>>>> greg.poir...@opower.com>:
>>>>>>>
>>>>>>>
>>>>>>> One thing to note
>>>>>>> All of our kvm VMs have to be rebooted. This is something I wasn't
>>>>>>> expecting.  Tried waiting for them to recover on their own, but
>>>>>>> that's not
>>>>>>> happening. Rebooting them restores service immediately. :/ Not ideal.
>>>>>>>
>>>>>>>  A reboot isn't really required though. It could be that the VM
>>>>>> itself is in
>>>>>> trouble, but from a librados/librbd perspective I/O should simply
>>>>>> continue
>>>>>> as
>>>>>> soon as a osdmap has been received without the "full" flag.
>>>>>>
>>>>>> It could be that you have to wait some time before the VM continues.
>>>>>> This
>>>>>> can
>>>>>> take up to 15 minutes.
>>>>>>
>>>>> With other storage solution you would have to change the timeout-value
>>>>> for each disk, i.e. changing to 180 secs from 60 secs, for the VMs to
>>>>> survive storage problems.
>>>>> Does Ceph handle this differently somehow?
>>>>>
>>>>>  It's not that RBD does it differently. Librados simply blocks the I/O
>>>> and thus
>>>> dus librbd which then causes Qemu to block.
>>>>
>>>> I've seen VMs survive RBD issues for longer periods then 60 seconds.
>>>> Gave them
>>>> some time and they continued again.
>>>>
>>>> Which exact setting are you talking about? I'm talking about a Qemu/KVM
>>>> VM
>>>> running with a VirtIO drive.
>>>>
>>> cat /sys/block/*/device/timeout
>>> (http://kb.vmware.com/selfser

Re: [ceph-users] OSD full - All RBD Volumes stopped responding

2014-04-11 Thread Greg Poirier
So... our storage problems persisted for about 45 minutes. I gave an entire
hypervisor worth of VM's time to recover (approx. 30 vms), and none of them
recovered on their own. In the end, we had to stop and start every VM
(easily done, it was just alarming). Once rebooted, the VMs of course were
fine.

I marked the two full OSDs as down and out. I am a little concerned that
these two are full while the cluster, in general, is only at 50% capacity.
It appears we may have a hot spot. I'm going to look into that later today.
Also, I'm not sure how it happened, but pgp_num is lower than pg_num.  I
had not noticed that until last night. Will address that as well. This
probably happened when i last resized placement groups or potentially when
I setup object storage pools.




On Fri, Apr 11, 2014 at 3:49 AM, Wido den Hollander  wrote:

> On 04/11/2014 09:23 AM, Josef Johansson wrote:
>
>>
>> On 11/04/14 09:07, Wido den Hollander wrote:
>>
>>>
>>>  Op 11 april 2014 om 8:50 schreef Josef Johansson :
>>>>
>>>>
>>>> Hi,
>>>>
>>>> On 11/04/14 07:29, Wido den Hollander wrote:
>>>>
>>>>> Op 11 april 2014 om 7:13 schreef Greg Poirier >>>>> >:
>>>>>>
>>>>>>
>>>>>> One thing to note
>>>>>> All of our kvm VMs have to be rebooted. This is something I wasn't
>>>>>> expecting.  Tried waiting for them to recover on their own, but
>>>>>> that's not
>>>>>> happening. Rebooting them restores service immediately. :/ Not ideal.
>>>>>>
>>>>>>  A reboot isn't really required though. It could be that the VM
>>>>> itself is in
>>>>> trouble, but from a librados/librbd perspective I/O should simply
>>>>> continue
>>>>> as
>>>>> soon as a osdmap has been received without the "full" flag.
>>>>>
>>>>> It could be that you have to wait some time before the VM continues.
>>>>> This
>>>>> can
>>>>> take up to 15 minutes.
>>>>>
>>>> With other storage solution you would have to change the timeout-value
>>>> for each disk, i.e. changing to 180 secs from 60 secs, for the VMs to
>>>> survive storage problems.
>>>> Does Ceph handle this differently somehow?
>>>>
>>>>  It's not that RBD does it differently. Librados simply blocks the I/O
>>> and thus
>>> dus librbd which then causes Qemu to block.
>>>
>>> I've seen VMs survive RBD issues for longer periods then 60 seconds.
>>> Gave them
>>> some time and they continued again.
>>>
>>> Which exact setting are you talking about? I'm talking about a Qemu/KVM
>>> VM
>>> running with a VirtIO drive.
>>>
>> cat /sys/block/*/device/timeout
>> (http://kb.vmware.com/selfservice/microsites/search.
>> do?language=en_US&cmd=displayKC&externalId=1009465)
>>
>> This file is non-existant for my Ceph-VirtIO-drive however, so it seems
>> RBD handles this.
>>
>>
> Well, I don't think it's handled by RBD, but VirtIO simply doesn't have
> the timeout. That's probably only in the SCSI driver.
>
> Wido
>
>
>  I have just Para-Virtualized VMs to compare with right now, and they
>> don't have it inside the VM, but that's expected. From my understanding
>> it should've been there if it was a HVM. Whenever the timeout was
>> reached, an error occured and the disk was set in read-only-mode.
>>
>> Cheers,
>> Josef
>>
>>> Wido
>>>
>>>  Cheers,
>>>> Josef
>>>>
>>>>> Wido
>>>>>
>>>>>  On Thu, Apr 10, 2014 at 10:12 PM, Greg Poirier
>>>>>> wrote:
>>>>>>
>>>>>>  Going to try increasing the full ratio. Disk utilization wasn't
>>>>>>> really
>>>>>>> growing at an unreasonable pace. I'm going to keep an eye on it for
>>>>>>> the
>>>>>>> next couple of hours and down/out the OSDs if necessary.
>>>>>>>
>>>>>>> We have four more machines that we're in the process of adding (which
>>>>>>> doubles the number of OSDs), but got held up by some networking
>>>>>>> nonsense.
>>>>>>>
>>>>>>> Thanks for the tips.
>>>>>>

Re: [ceph-users] OSD full - All RBD Volumes stopped responding

2014-04-10 Thread Greg Poirier
One thing to note
All of our kvm VMs have to be rebooted. This is something I wasn't
expecting.  Tried waiting for them to recover on their own, but that's not
happening. Rebooting them restores service immediately. :/ Not ideal.


On Thu, Apr 10, 2014 at 10:12 PM, Greg Poirier wrote:

> Going to try increasing the full ratio. Disk utilization wasn't really
> growing at an unreasonable pace. I'm going to keep an eye on it for the
> next couple of hours and down/out the OSDs if necessary.
>
> We have four more machines that we're in the process of adding (which
> doubles the number of OSDs), but got held up by some networking nonsense.
>
> Thanks for the tips.
>
>
> On Thu, Apr 10, 2014 at 9:51 PM, Sage Weil  wrote:
>
>> On Thu, 10 Apr 2014, Greg Poirier wrote:
>> > Hi,
>> > I have about 200 VMs with a common RBD volume as their root filesystem
>> and a
>> > number of additional filesystems on Ceph.
>> >
>> > All of them have stopped responding. One of the OSDs in my cluster is
>> marked
>> > full. I tried stopping that OSD to force things to rebalance or at
>> least go
>> > to degraded mode, but nothing is responding still.
>> >
>> > I'm not exactly sure what to do or how to investigate. Suggestions?
>>
>> Try marking the osd out or partially out (ceph osd reweight N .9) to move
>> some data off, and/or adjust the full ratio up (ceph pg set_full_ratio
>> .95).  Note that this becomes increasinly dangerous as OSDs get closer to
>> full; add some disks.
>>
>> sage
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD full - All RBD Volumes stopped responding

2014-04-10 Thread Greg Poirier
Going to try increasing the full ratio. Disk utilization wasn't really
growing at an unreasonable pace. I'm going to keep an eye on it for the
next couple of hours and down/out the OSDs if necessary.

We have four more machines that we're in the process of adding (which
doubles the number of OSDs), but got held up by some networking nonsense.

Thanks for the tips.


On Thu, Apr 10, 2014 at 9:51 PM, Sage Weil  wrote:

> On Thu, 10 Apr 2014, Greg Poirier wrote:
> > Hi,
> > I have about 200 VMs with a common RBD volume as their root filesystem
> and a
> > number of additional filesystems on Ceph.
> >
> > All of them have stopped responding. One of the OSDs in my cluster is
> marked
> > full. I tried stopping that OSD to force things to rebalance or at least
> go
> > to degraded mode, but nothing is responding still.
> >
> > I'm not exactly sure what to do or how to investigate. Suggestions?
>
> Try marking the osd out or partially out (ceph osd reweight N .9) to move
> some data off, and/or adjust the full ratio up (ceph pg set_full_ratio
> .95).  Note that this becomes increasinly dangerous as OSDs get closer to
> full; add some disks.
>
> sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD full - All RBD Volumes stopped responding

2014-04-10 Thread Greg Poirier
Hi,

I have about 200 VMs with a common RBD volume as their root filesystem and
a number of additional filesystems on Ceph.

All of them have stopped responding. One of the OSDs in my cluster is
marked full. I tried stopping that OSD to force things to rebalance or at
least go to degraded mode, but nothing is responding still.

I'm not exactly sure what to do or how to investigate. Suggestions?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Replication lag in block storage

2014-03-14 Thread Greg Poirier
We are stressing these boxes pretty spectacularly at the moment.

On every box I have one OSD that is pegged for IO almost constantly.

ceph-1:
Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz
avgqu-sz   await r_await w_await  svctm  %util
sdv   0.00 0.00  104.00  160.00   748.00  1000.0013.24
1.154.369.461.05   3.70  97.60

ceph-2:
Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz
avgqu-sz   await r_await w_await  svctm  %util
sdq   0.0025.00  109.00  218.00   844.00  1773.5016.01
1.374.209.031.78   3.01  98.40

ceph-3:
Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz
avgqu-sz   await r_await w_await  svctm  %util
sdm   0.00 0.00  126.00   56.00   996.00   540.0016.88
1.015.588.060.00   5.43  98.80

These are all disks in my block storage pool.

 osdmap e26698: 102 osds: 102 up, 102 in
  pgmap v6752413: 4624 pgs, 3 pools, 14151 GB data, 21729 kobjects
28517 GB used, 65393 GB / 93911 GB avail
4624 active+clean
  client io 1915 kB/s rd, 59690 kB/s wr, 1464 op/s

I don't see any smart errors, but i'm slowly working my way through all of
the disks on these machines with smartctl to see if anything stands out.


On Fri, Mar 14, 2014 at 9:52 AM, Gregory Farnum  wrote:

> On Fri, Mar 14, 2014 at 9:37 AM, Greg Poirier 
> wrote:
> > So, on the cluster that I _expect_ to be slow, it appears that we are
> > waiting on journal commits. I want to make sure that I am reading this
> > correctly:
> >
> >   "received_at": "2014-03-14 12:14:22.659170",
> >
> > { "time": "2014-03-14 12:14:22.660191",
> >   "event": "write_thread_in_journal_buffer"},
> >
> > At this point we have received the write and are attempting to write the
> > transaction to the OSD's journal, yes?
> >
> > Then:
> >
> > { "time": "2014-03-14 12:14:22.900779",
> >   "event": "journaled_completion_queued"},
> >
> > 240ms later we have successfully written to the journal?
>
> Correct. That seems an awfully long time for a 16K write, although I
> don't know how much data I have on co-located journals. (At least, I'm
> assuming it's in the 16K range based on the others, although I'm just
> now realizing that subops aren't providing that information...I've
> created a ticket to include that diagnostic info in future.)
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
>
> > I expect this particular slowness due to colocation of journal and data
> on
> > the same disk (and it's a spinning disk, not an SSD). I expect some of
> this
> > could be alleviated by migrating journals to SSDs, but I am looking to
> > rebuild in the near future--so am willing to hobble in the meantime.
> >
> > I am surprised that our all SSD cluster is also underperforming. I am
> trying
> > colocating the journal on the same disk with all SSDs at the moment and
> will
> > see if the performance degradation is of the same nature.
> >
> >
> >
> > On Thu, Mar 13, 2014 at 6:25 PM, Gregory Farnum 
> wrote:
> >>
> >> Right. So which is the interval that's taking all the time? Probably
> >> it's waiting for the journal commit, but maybe there's something else
> >> blocking progress. If it is the journal commit, check out how busy the
> >> disk is (is it just saturated?) and what its normal performance
> >> characteristics are (is it breaking?).
> >> -Greg
> >> Software Engineer #42 @ http://inktank.com | http://ceph.com
> >>
> >>
> >> On Thu, Mar 13, 2014 at 5:48 PM, Greg Poirier 
> >> wrote:
> >> > Many of the sub ops look like this, with significant lag between
> >> > received_at
> >> > and commit_sent:
> >> >
> >> > { "description": "osd_op(client.6869831.0:1192491
> >> > rbd_data.67b14a2ae8944a.9105 [write 507904~3686400]
> >> > 6.556a4db0
> >> > e660)",
> >> >   "received_at": "2014-03-13 20:42:05.811936",
> >> >   "age": "46.088198",
> >> >   "duration": "0.038328",
> >> > 
> >> > { "time": "2014-03-13 20:42:05.850215",
> >> >

Re: [ceph-users] Replication lag in block storage

2014-03-14 Thread Greg Poirier
So, on the cluster that I _expect_ to be slow, it appears that we are
waiting on journal commits. I want to make sure that I am reading this
correctly:

  "received_at": "2014-03-14 12:14:22.659170",

{ "time": "2014-03-14 12:14:22.660191",
  "event": "write_thread_in_journal_buffer"},

At this point we have received the write and are attempting to write the
transaction to the OSD's journal, yes?

Then:

{ "time": "2014-03-14 12:14:22.900779",
  "event": "journaled_completion_queued"},

240ms later we have successfully written to the journal?

I expect this particular slowness due to colocation of journal and data on
the same disk (and it's a spinning disk, not an SSD). I expect some of this
could be alleviated by migrating journals to SSDs, but I am looking to
rebuild in the near future--so am willing to hobble in the meantime.

I am surprised that our all SSD cluster is also underperforming. I am
trying colocating the journal on the same disk with all SSDs at the moment
and will see if the performance degradation is of the same nature.



On Thu, Mar 13, 2014 at 6:25 PM, Gregory Farnum  wrote:

> Right. So which is the interval that's taking all the time? Probably
> it's waiting for the journal commit, but maybe there's something else
> blocking progress. If it is the journal commit, check out how busy the
> disk is (is it just saturated?) and what its normal performance
> characteristics are (is it breaking?).
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
>
> On Thu, Mar 13, 2014 at 5:48 PM, Greg Poirier 
> wrote:
> > Many of the sub ops look like this, with significant lag between
> received_at
> > and commit_sent:
> >
> > { "description": "osd_op(client.6869831.0:1192491
> > rbd_data.67b14a2ae8944a.9105 [write 507904~3686400]
> 6.556a4db0
> > e660)",
> >   "received_at": "2014-03-13 20:42:05.811936",
> >   "age": "46.088198",
> >   "duration": "0.038328",
> > 
> > { "time": "2014-03-13 20:42:05.850215",
> >   "event": "commit_sent"},
> > { "time": "2014-03-13 20:42:05.850264",
> >   "event": "done"}]]},
> >
> > In this case almost 39ms between received_at and commit_sent.
> >
> > A particularly egregious example of 80+ms lag between received_at and
> > commit_sent:
> >
> >{ "description": "osd_op(client.6869831.0:1190526
> > rbd_data.67b14a2ae8944a.8fac [write 3325952~868352]
> 6.5255f5fd
> > e660)",
> >   "received_at": "2014-03-13 20:41:40.227813",
> >   "age": "320.017087",
> >   "duration": "0.086852",
> > 
> > { "time": "2014-03-13 20:41:40.314633",
> >   "event": "commit_sent"},
> > { "time": "2014-03-13 20:41:40.314665",
> >   "event": "done"}]]},
> >
> >
> >
> > On Thu, Mar 13, 2014 at 4:17 PM, Gregory Farnum 
> wrote:
> >>
> >> On Thu, Mar 13, 2014 at 3:56 PM, Greg Poirier 
> >> wrote:
> >> > We've been seeing this issue on all of our dumpling clusters, and I'm
> >> > wondering what might be the cause of it.
> >> >
> >> > In dump_historic_ops, the time between op_applied and
> sub_op_commit_rec
> >> > or
> >> > the time between commit_sent and sub_op_applied is extremely high.
> Some
> >> > of
> >> > the osd_sub_ops are as long as 100 ms. A sample dump_historic_ops is
> >> > included at the bottom.
> >>
> >> It's important to understand what each of those timestamps are
> reporting.
> >>
> >> op_applied: the point at which an OSD has applied an operation to its
> >> readable backing filesystem in-memory (which for xfs or ext4 will be
> >> after it's committed to the journal)
> >> sub_op_commit_rec: the point at which an OSD has gotten commits from
> >> the replica OSDs
> >> commit_sent: the point at which a replica OSD has sent a commit back
> >> to its primary
> >> sub_op_applied: the poin

Re: [ceph-users] Replication lag in block storage

2014-03-13 Thread Greg Poirier
Many of the sub ops look like this, with significant lag between
received_at and commit_sent:

{ "description": "osd_op(client.6869831.0:1192491
rbd_data.67b14a2ae8944a.9105 [write 507904~3686400] 6.556a4db0
e660)",
  "received_at": "2014-03-13 20:42:05.811936",
  "age": "46.088198",
  "duration": "0.038328",

{ "time": "2014-03-13 20:42:05.850215",
  "event": "commit_sent"},
{ "time": "2014-03-13 20:42:05.850264",
  "event": "done"}]]},

In this case almost 39ms between received_at and commit_sent.

A particularly egregious example of 80+ms lag between received_at and
commit_sent:

   { "description": "osd_op(client.6869831.0:1190526
rbd_data.67b14a2ae8944a.8fac [write 3325952~868352] 6.5255f5fd
e660)",
  "received_at": "2014-03-13 20:41:40.227813",
  "age": "320.017087",
  "duration": "0.086852",

{ "time": "2014-03-13 20:41:40.314633",
  "event": "commit_sent"},
{ "time": "2014-03-13 20:41:40.314665",
  "event": "done"}]]},



On Thu, Mar 13, 2014 at 4:17 PM, Gregory Farnum  wrote:

> On Thu, Mar 13, 2014 at 3:56 PM, Greg Poirier 
> wrote:
> > We've been seeing this issue on all of our dumpling clusters, and I'm
> > wondering what might be the cause of it.
> >
> > In dump_historic_ops, the time between op_applied and sub_op_commit_rec
> or
> > the time between commit_sent and sub_op_applied is extremely high. Some
> of
> > the osd_sub_ops are as long as 100 ms. A sample dump_historic_ops is
> > included at the bottom.
>
> It's important to understand what each of those timestamps are reporting.
>
> op_applied: the point at which an OSD has applied an operation to its
> readable backing filesystem in-memory (which for xfs or ext4 will be
> after it's committed to the journal)
> sub_op_commit_rec: the point at which an OSD has gotten commits from
> the replica OSDs
> commit_sent: the point at which a replica OSD has sent a commit back
> to its primary
> sub_op_applied: the point at which a replica OSD has applied a
> particular operation to its backing filesystem in-memory (again, after
> the journal if using xfs)
>
> Reads are never served from replicas, so a long time between
> commit_sent and sub_op_applied should not in itself be an issue. A lag
> time between op_applied and sub_op_commit_rec means that the OSD is
> waiting on its replicas. A long time there indicates either that the
> replica is processing slowly, or that there's some issue in the
> communications stack (all the way from the raw ethernet up to the
> message handling in the OSD itself).
> So the first thing to look for are sub ops which have a lag time
> between the received_at and commit_sent timestamps. If none of those
> ever turn up, but unusually long waits for sub_op_commit_rec are still
> present, then it'll take more effort to correlate particular subops on
> replicas with the op on the primary they correspond to, and see where
> the time lag is coming into it.
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] "no user info saved" after user creation / can't create buckets

2014-03-12 Thread Greg Poirier
And, I figured out the issue.

The utility I was using to create pools, zones, and regions automatically
failed to do two things:

- create rgw.buckets and rgw.buckets.index for each zone
- setup placement pools for each zone

I did both of those, and now everything is working.

Thanks, me, for the commitment to figuring this poo out.


On Wed, Mar 12, 2014 at 8:31 PM, Greg Poirier wrote:

> Increasing the logging further, and I notice the following:
>
> 2014-03-13 00:27:28.617100 7f6036ffd700 20 rgw_create_bucket returned
> ret=-1 bucket=test(@.rgw.buckets[us-west-1.15849318.1])
>
> But hope that .rgw.buckets doesn't have to exist... and that @.rgw.buckets
> is perhaps telling of something?
>
> I did notice that .us-west-1.rgw.buckets and .us-west-1.rgw.buckets.index
> weren't created. I created those, restarted radosgw, and still 403 errors.
>
>
> On Wed, Mar 12, 2014 at 8:00 PM, Greg Poirier wrote:
>
>> And the debug log because that last log was obviously not helpful...
>>
>> 2014-03-12 23:57:49.497780 7ff97e7dd700  1 == starting new request
>> req=0x23bc650 =
>> 2014-03-12 23:57:49.498198 7ff97e7dd700  2 req 1:0.000419::PUT
>> /test::initializing
>> 2014-03-12 23:57:49.498233 7ff97e7dd700 10 
>> host=s3.amazonaws.comrgw_dns_name=us-west-1.domain
>> 2014-03-12 23:57:49.498366 7ff97e7dd700 10 s->object= s->bucket=test
>> 2014-03-12 23:57:49.498437 7ff97e7dd700  2 req 1:0.000659:s3:PUT
>> /test::getting op
>> 2014-03-12 23:57:49.498448 7ff97e7dd700  2 req 1:0.000670:s3:PUT
>> /test:create_bucket:authorizing
>> 2014-03-12 23:57:49.498508 7ff97e7dd700 10 cache get:
>> name=.us-west-1.users+BLAHBLAHBLAH : miss
>> 2014-03-12 23:57:49.500852 7ff97e7dd700 10 cache put:
>> name=.us-west-1.users+BLAHBLAHBLAH
>> 2014-03-12 23:57:49.500865 7ff97e7dd700 10 adding
>> .us-west-1.users+BLAHBLAHBLAH to cache LRU end
>> 2014-03-12 23:57:49.500886 7ff97e7dd700 10 moving
>> .us-west-1.users+BLAHBLAHBLAH to cache LRU end
>> 2014-03-12 23:57:49.500889 7ff97e7dd700 10 cache get:
>> name=.us-west-1.users+BLAHBLAHBLAH : type miss (requested=1, cached=6)
>> 2014-03-12 23:57:49.500907 7ff97e7dd700 10 moving
>> .us-west-1.users+BLAHBLAHBLAH to cache LRU end
>> 2014-03-12 23:57:49.500910 7ff97e7dd700 10 cache get:
>> name=.us-west-1.users+BLAHBLAHBLAH : hit
>> 2014-03-12 23:57:49.502663 7ff97e7dd700 10 cache put:
>> name=.us-west-1.users+BLAHBLAHBLAH
>> 2014-03-12 23:57:49.502667 7ff97e7dd700 10 moving
>> .us-west-1.users+BLAHBLAHBLAH to cache LRU end
>> 2014-03-12 23:57:49.502700 7ff97e7dd700 10 cache get:
>> name=.us-west-1.users.uid+test : miss
>> 2014-03-12 23:57:49.505128 7ff97e7dd700 10 cache put:
>> name=.us-west-1.users.uid+test
>> 2014-03-12 23:57:49.505138 7ff97e7dd700 10 adding
>> .us-west-1.users.uid+test to cache LRU end
>> 2014-03-12 23:57:49.505157 7ff97e7dd700 10 moving
>> .us-west-1.users.uid+test to cache LRU end
>> 2014-03-12 23:57:49.505160 7ff97e7dd700 10 cache get:
>> name=.us-west-1.users.uid+test : type miss (requested=1, cached=6)
>> 2014-03-12 23:57:49.505176 7ff97e7dd700 10 moving
>> .us-west-1.users.uid+test to cache LRU end
>> 2014-03-12 23:57:49.505178 7ff97e7dd700 10 cache get:
>> name=.us-west-1.users.uid+test : hit
>> 2014-03-12 23:57:49.507401 7ff97e7dd700 10 cache put:
>> name=.us-west-1.users.uid+test
>> 2014-03-12 23:57:49.507406 7ff97e7dd700 10 moving
>> .us-west-1.users.uid+test to cache LRU end
>> 2014-03-12 23:57:49.507521 7ff97e7dd700 10 get_canon_resource():
>> dest=/test
>> 2014-03-12 23:57:49.507529 7ff97e7dd700 10 auth_hdr:
>> PUT
>>
>> binary/octet-stream
>> Wed, 12 Mar 2014 23:57:51 GMT
>> /test
>> 2014-03-12 23:57:49.507674 7ff97e7dd700  2 req 1:0.009895:s3:PUT
>> /test:create_bucket:reading permissions
>> 2014-03-12 23:57:49.507682 7ff97e7dd700  2 req 1:0.009904:s3:PUT
>> /test:create_bucket:verifying op mask
>> 2014-03-12 23:57:49.507695 7ff97e7dd700  2 req 1:0.009917:s3:PUT
>> /test:create_bucket:verifying op permissions
>> 2014-03-12 23:57:49.509604 7ff97e7dd700  2 req 1:0.011826:s3:PUT
>> /test:create_bucket:verifying op params
>> 2014-03-12 23:57:49.509615 7ff97e7dd700  2 req 1:0.011836:s3:PUT
>> /test:create_bucket:executing
>>  2014-03-12 23:57:49.509694 7ff97e7dd700 10 cache get:
>> name=.us-west-1.domain.rgw+test : miss
>> 2014-03-12 23:57:49.512229 7ff97e7dd700 10 cache put:
>> name=.us-west-1.domain.rgw+test
>> 2014-03-12 23:57:49.512259 7ff97e7dd700 10 adding
>> .us-west-1.domain.rgw+test to cache LRU end
>> 2014-03-12 23:57:49.512333 7ff97e7dd700 10 cache get:
>>

Re: [ceph-users] "no user info saved" after user creation / can't create buckets

2014-03-12 Thread Greg Poirier
Increasing the logging further, and I notice the following:

2014-03-13 00:27:28.617100 7f6036ffd700 20 rgw_create_bucket returned
ret=-1 bucket=test(@.rgw.buckets[us-west-1.15849318.1])

But hope that .rgw.buckets doesn't have to exist... and that @.rgw.buckets
is perhaps telling of something?

I did notice that .us-west-1.rgw.buckets and .us-west-1.rgw.buckets.index
weren't created. I created those, restarted radosgw, and still 403 errors.


On Wed, Mar 12, 2014 at 8:00 PM, Greg Poirier wrote:

> And the debug log because that last log was obviously not helpful...
>
> 2014-03-12 23:57:49.497780 7ff97e7dd700  1 == starting new request
> req=0x23bc650 =
> 2014-03-12 23:57:49.498198 7ff97e7dd700  2 req 1:0.000419::PUT
> /test::initializing
> 2014-03-12 23:57:49.498233 7ff97e7dd700 10 
> host=s3.amazonaws.comrgw_dns_name=us-west-1.domain
> 2014-03-12 23:57:49.498366 7ff97e7dd700 10 s->object= s->bucket=test
> 2014-03-12 23:57:49.498437 7ff97e7dd700  2 req 1:0.000659:s3:PUT
> /test::getting op
> 2014-03-12 23:57:49.498448 7ff97e7dd700  2 req 1:0.000670:s3:PUT
> /test:create_bucket:authorizing
> 2014-03-12 23:57:49.498508 7ff97e7dd700 10 cache get:
> name=.us-west-1.users+BLAHBLAHBLAH : miss
> 2014-03-12 23:57:49.500852 7ff97e7dd700 10 cache put:
> name=.us-west-1.users+BLAHBLAHBLAH
> 2014-03-12 23:57:49.500865 7ff97e7dd700 10 adding
> .us-west-1.users+BLAHBLAHBLAH to cache LRU end
> 2014-03-12 23:57:49.500886 7ff97e7dd700 10 moving
> .us-west-1.users+BLAHBLAHBLAH to cache LRU end
> 2014-03-12 23:57:49.500889 7ff97e7dd700 10 cache get:
> name=.us-west-1.users+BLAHBLAHBLAH : type miss (requested=1, cached=6)
> 2014-03-12 23:57:49.500907 7ff97e7dd700 10 moving
> .us-west-1.users+BLAHBLAHBLAH to cache LRU end
> 2014-03-12 23:57:49.500910 7ff97e7dd700 10 cache get:
> name=.us-west-1.users+BLAHBLAHBLAH : hit
> 2014-03-12 23:57:49.502663 7ff97e7dd700 10 cache put:
> name=.us-west-1.users+BLAHBLAHBLAH
> 2014-03-12 23:57:49.502667 7ff97e7dd700 10 moving
> .us-west-1.users+BLAHBLAHBLAH to cache LRU end
> 2014-03-12 23:57:49.502700 7ff97e7dd700 10 cache get:
> name=.us-west-1.users.uid+test : miss
> 2014-03-12 23:57:49.505128 7ff97e7dd700 10 cache put:
> name=.us-west-1.users.uid+test
> 2014-03-12 23:57:49.505138 7ff97e7dd700 10 adding
> .us-west-1.users.uid+test to cache LRU end
> 2014-03-12 23:57:49.505157 7ff97e7dd700 10 moving
> .us-west-1.users.uid+test to cache LRU end
> 2014-03-12 23:57:49.505160 7ff97e7dd700 10 cache get:
> name=.us-west-1.users.uid+test : type miss (requested=1, cached=6)
> 2014-03-12 23:57:49.505176 7ff97e7dd700 10 moving
> .us-west-1.users.uid+test to cache LRU end
> 2014-03-12 23:57:49.505178 7ff97e7dd700 10 cache get:
> name=.us-west-1.users.uid+test : hit
> 2014-03-12 23:57:49.507401 7ff97e7dd700 10 cache put:
> name=.us-west-1.users.uid+test
> 2014-03-12 23:57:49.507406 7ff97e7dd700 10 moving
> .us-west-1.users.uid+test to cache LRU end
> 2014-03-12 23:57:49.507521 7ff97e7dd700 10 get_canon_resource(): dest=/test
> 2014-03-12 23:57:49.507529 7ff97e7dd700 10 auth_hdr:
> PUT
>
> binary/octet-stream
> Wed, 12 Mar 2014 23:57:51 GMT
> /test
> 2014-03-12 23:57:49.507674 7ff97e7dd700  2 req 1:0.009895:s3:PUT
> /test:create_bucket:reading permissions
> 2014-03-12 23:57:49.507682 7ff97e7dd700  2 req 1:0.009904:s3:PUT
> /test:create_bucket:verifying op mask
> 2014-03-12 23:57:49.507695 7ff97e7dd700  2 req 1:0.009917:s3:PUT
> /test:create_bucket:verifying op permissions
> 2014-03-12 23:57:49.509604 7ff97e7dd700  2 req 1:0.011826:s3:PUT
> /test:create_bucket:verifying op params
> 2014-03-12 23:57:49.509615 7ff97e7dd700  2 req 1:0.011836:s3:PUT
> /test:create_bucket:executing
> 2014-03-12 23:57:49.509694 7ff97e7dd700 10 cache get:
> name=.us-west-1.domain.rgw+test : miss
> 2014-03-12 23:57:49.512229 7ff97e7dd700 10 cache put:
> name=.us-west-1.domain.rgw+test
> 2014-03-12 23:57:49.512259 7ff97e7dd700 10 adding
> .us-west-1.domain.rgw+test to cache LRU end
> 2014-03-12 23:57:49.512333 7ff97e7dd700 10 cache get:
> name=.us-west-1.domain.rgw+.pools.avail : miss
> 2014-03-12 23:57:49.518216 7ff97e7dd700 10 cache put:
> name=.us-west-1.domain.rgw+.pools.avail
> 2014-03-12 23:57:49.518228 7ff97e7dd700 10 adding
> .us-west-1.domain.rgw+.pools.avail to cache LRU end
> 2014-03-12 23:57:49.518248 7ff97e7dd700 10 moving
> .us-west-1.domain.rgw+.pools.avail to cache LRU end
> 2014-03-12 23:57:49.518251 7ff97e7dd700 10 cache get:
> name=.us-west-1.domain.rgw+.pools.avail : type miss (requested=1, cached=6)
> 2014-03-12 23:57:49.518270 7ff97e7dd700 10 moving
> .us-west-1.domain.rgw+.pools.avail to cache LRU end
> 2014-03-12 23:57:49.518272 7ff97e7dd700 10 cache get:
> name=.us-west-1.domain.rgw+.pools.avail : hit
> 2014-03-12 23:57:4

Re: [ceph-users] "no user info saved" after user creation / can't create buckets

2014-03-12 Thread Greg Poirier
And the debug log because that last log was obviously not helpful...

2014-03-12 23:57:49.497780 7ff97e7dd700  1 == starting new request
req=0x23bc650 =
2014-03-12 23:57:49.498198 7ff97e7dd700  2 req 1:0.000419::PUT
/test::initializing
2014-03-12 23:57:49.498233 7ff97e7dd700 10
host=s3.amazonaws.comrgw_dns_name=us-west-1.domain
2014-03-12 23:57:49.498366 7ff97e7dd700 10 s->object= s->bucket=test
2014-03-12 23:57:49.498437 7ff97e7dd700  2 req 1:0.000659:s3:PUT
/test::getting op
2014-03-12 23:57:49.498448 7ff97e7dd700  2 req 1:0.000670:s3:PUT
/test:create_bucket:authorizing
2014-03-12 23:57:49.498508 7ff97e7dd700 10 cache get:
name=.us-west-1.users+BLAHBLAHBLAH : miss
2014-03-12 23:57:49.500852 7ff97e7dd700 10 cache put:
name=.us-west-1.users+BLAHBLAHBLAH
2014-03-12 23:57:49.500865 7ff97e7dd700 10 adding
.us-west-1.users+BLAHBLAHBLAH to cache LRU end
2014-03-12 23:57:49.500886 7ff97e7dd700 10 moving
.us-west-1.users+BLAHBLAHBLAH to cache LRU end
2014-03-12 23:57:49.500889 7ff97e7dd700 10 cache get:
name=.us-west-1.users+BLAHBLAHBLAH : type miss (requested=1, cached=6)
2014-03-12 23:57:49.500907 7ff97e7dd700 10 moving
.us-west-1.users+BLAHBLAHBLAH to cache LRU end
2014-03-12 23:57:49.500910 7ff97e7dd700 10 cache get:
name=.us-west-1.users+BLAHBLAHBLAH : hit
2014-03-12 23:57:49.502663 7ff97e7dd700 10 cache put:
name=.us-west-1.users+BLAHBLAHBLAH
2014-03-12 23:57:49.502667 7ff97e7dd700 10 moving
.us-west-1.users+BLAHBLAHBLAH to cache LRU end
2014-03-12 23:57:49.502700 7ff97e7dd700 10 cache get:
name=.us-west-1.users.uid+test : miss
2014-03-12 23:57:49.505128 7ff97e7dd700 10 cache put:
name=.us-west-1.users.uid+test
2014-03-12 23:57:49.505138 7ff97e7dd700 10 adding .us-west-1.users.uid+test
to cache LRU end
2014-03-12 23:57:49.505157 7ff97e7dd700 10 moving .us-west-1.users.uid+test
to cache LRU end
2014-03-12 23:57:49.505160 7ff97e7dd700 10 cache get:
name=.us-west-1.users.uid+test : type miss (requested=1, cached=6)
2014-03-12 23:57:49.505176 7ff97e7dd700 10 moving .us-west-1.users.uid+test
to cache LRU end
2014-03-12 23:57:49.505178 7ff97e7dd700 10 cache get:
name=.us-west-1.users.uid+test : hit
2014-03-12 23:57:49.507401 7ff97e7dd700 10 cache put:
name=.us-west-1.users.uid+test
2014-03-12 23:57:49.507406 7ff97e7dd700 10 moving .us-west-1.users.uid+test
to cache LRU end
2014-03-12 23:57:49.507521 7ff97e7dd700 10 get_canon_resource(): dest=/test
2014-03-12 23:57:49.507529 7ff97e7dd700 10 auth_hdr:
PUT

binary/octet-stream
Wed, 12 Mar 2014 23:57:51 GMT
/test
2014-03-12 23:57:49.507674 7ff97e7dd700  2 req 1:0.009895:s3:PUT
/test:create_bucket:reading permissions
2014-03-12 23:57:49.507682 7ff97e7dd700  2 req 1:0.009904:s3:PUT
/test:create_bucket:verifying op mask
2014-03-12 23:57:49.507695 7ff97e7dd700  2 req 1:0.009917:s3:PUT
/test:create_bucket:verifying op permissions
2014-03-12 23:57:49.509604 7ff97e7dd700  2 req 1:0.011826:s3:PUT
/test:create_bucket:verifying op params
2014-03-12 23:57:49.509615 7ff97e7dd700  2 req 1:0.011836:s3:PUT
/test:create_bucket:executing
2014-03-12 23:57:49.509694 7ff97e7dd700 10 cache get:
name=.us-west-1.domain.rgw+test : miss
2014-03-12 23:57:49.512229 7ff97e7dd700 10 cache put:
name=.us-west-1.domain.rgw+test
2014-03-12 23:57:49.512259 7ff97e7dd700 10 adding
.us-west-1.domain.rgw+test to cache LRU end
2014-03-12 23:57:49.512333 7ff97e7dd700 10 cache get:
name=.us-west-1.domain.rgw+.pools.avail : miss
2014-03-12 23:57:49.518216 7ff97e7dd700 10 cache put:
name=.us-west-1.domain.rgw+.pools.avail
2014-03-12 23:57:49.518228 7ff97e7dd700 10 adding
.us-west-1.domain.rgw+.pools.avail to cache LRU end
2014-03-12 23:57:49.518248 7ff97e7dd700 10 moving
.us-west-1.domain.rgw+.pools.avail to cache LRU end
2014-03-12 23:57:49.518251 7ff97e7dd700 10 cache get:
name=.us-west-1.domain.rgw+.pools.avail : type miss (requested=1, cached=6)
2014-03-12 23:57:49.518270 7ff97e7dd700 10 moving
.us-west-1.domain.rgw+.pools.avail to cache LRU end
2014-03-12 23:57:49.518272 7ff97e7dd700 10 cache get:
name=.us-west-1.domain.rgw+.pools.avail : hit
2014-03-12 23:57:49.520295 7ff97e7dd700 10 cache put:
name=.us-west-1.domain.rgw+.pools.avail
2014-03-12 23:57:49.520348 7ff97e7dd700 10 moving
.us-west-1.domain.rgw+.pools.avail to cache LRU end
2014-03-12 23:57:49.522672 7ff97e7dd700  2 req 1:0.024893:s3:PUT
/test:create_bucket:http status=403
2014-03-12 23:57:49.523204 7ff97e7dd700  1 == req done req=0x23bc650
http_status=403 ==


On Wed, Mar 12, 2014 at 7:36 PM, Greg Poirier wrote:

> The saga continues...
>
> So, after fiddling with haproxy a bit, I managed to make sure that my
> requests were hitting the RADOS Gateway.
>
> NOW, I get a 403 from my ruby script:
>
> 2014-03-12 23:34:08.289670 7fda9bfbf700  1 == starting new request
> req=0x215a780 =
> 2014-03-12 23:34:08.305105 7fda9bfbf700  1 == req done req=0x215a780
> http_status=403 ==
>
> The aws-s3 gem forces the Host header to be set to s3.amazonaws.com --
> and I am wonderi

Re: [ceph-users] "no user info saved" after user creation / can't create buckets

2014-03-12 Thread Greg Poirier
The saga continues...

So, after fiddling with haproxy a bit, I managed to make sure that my
requests were hitting the RADOS Gateway.

NOW, I get a 403 from my ruby script:

2014-03-12 23:34:08.289670 7fda9bfbf700  1 == starting new request
req=0x215a780 =
2014-03-12 23:34:08.305105 7fda9bfbf700  1 == req done req=0x215a780
http_status=403 ==

The aws-s3 gem forces the Host header to be set to s3.amazonaws.com -- and
I am wondering if this could potentially cause a problem. Is that the case?
Or is radosgw able to just know where it is supposed to look for objects
based on its configuration? I assume the latter, otherwise we would have
multi-tenant radosgw instead of needing to have an instance per zone.

So does it ignore the Host header?

Is this 403 related to my problems with the user error I found earlier
where I'm unable to view the user with the radosgw-admin tool?



On Wed, Mar 12, 2014 at 1:54 PM, Greg Poirier wrote:

> Also... what are linger_ops?
>
> ceph --admin-daemon /var/run/ceph/ceph-client.radosgw..asok
> objecter_requests
> { "ops": [],
>   "linger_ops": [
> { "linger_id": 1,
>   "pg": "7.4322fa9f",
>   "osd": 25,
>   "object_id": "notify.0",
>   "object_locator": "@7",
>   "snapid": "head",
>   "registering": "head",
>   "registered": "head"},
> { "linger_id": 2,
>   "pg": "7.16dafda0",
>   "osd": 132,
>   "object_id": "notify.1",
>   "object_locator": "@7",
>   "snapid": "head",
>   "registering": "head",
>   "registered": "head"},
> { "linger_id": 3,
>   "pg": "7.88aa5c95",
>   "osd": 32,
>   "object_id": "notify.2",
>   "object_locator": "@7",
>   "snapid": "head",
>   "registering": "head",
>   "registered": "head"},
> { "linger_id": 4,
>   "pg": "7.f8c99aee",
>   "osd": 62,
>   "object_id": "notify.3",
>   "object_locator": "@7",
>   "snapid": "head",
>   "registering": "head",
>   "registered": "head"},
> { "linger_id": 5,
>   "pg": "7.a204812d",
>   "osd": 129,
>   "object_id": "notify.4",
>   "object_locator": "@7",
>   "snapid": "head",
>   "registering": "head",
>   "registered": "head"},
> { "linger_id": 6,
>   "pg": "7.31099063",
>   "osd": 28,
>   "object_id": "notify.5",
>   "object_locator": "@7",
>   "snapid": "head",
>   "registering": "head",
>   "registered": "head"},
> { "linger_id": 7,
>   "pg": "7.97c520d4",
>   "osd": 135,
>   "object_id": "notify.6",
>   "object_locator": "@7",
>   "snapid": "head",
>   "registering": "head",
>   "registered": "head"},
> { "linger_id": 8,
>   "pg": "7.84ada7c9",
>   "osd": 94,
>   "object_id": "notify.7",
>   "object_locator": "@7",
>   "snapid": "head",
>   "registering": "head",
>   "registered": "head"}],
>   "pool_ops": [],
>   "pool_stat_ops": [],
>   "statfs_ops": [],
>   "command_ops": []}
>
>
> On Wed, Mar 12, 2014 at 10:45 AM, Greg Poirier wrote:
>
>> Rados GW and Ceph versions installed:
>> Version: 0.67.7-1precise
>>
>> I create a user:
>> radosgw-admin --name client.radosgw. user create --uid test
>> --display-name "Test User"
>>
>> It outputs som

Re: [ceph-users] "no user info saved" after user creation / can't create buckets

2014-03-12 Thread Greg Poirier
Also... what are linger_ops?

ceph --admin-daemon /var/run/ceph/ceph-client.radosgw..asok
objecter_requests
{ "ops": [],
  "linger_ops": [
{ "linger_id": 1,
  "pg": "7.4322fa9f",
  "osd": 25,
  "object_id": "notify.0",
  "object_locator": "@7",
  "snapid": "head",
  "registering": "head",
  "registered": "head"},
{ "linger_id": 2,
  "pg": "7.16dafda0",
  "osd": 132,
  "object_id": "notify.1",
  "object_locator": "@7",
  "snapid": "head",
  "registering": "head",
  "registered": "head"},
{ "linger_id": 3,
  "pg": "7.88aa5c95",
  "osd": 32,
  "object_id": "notify.2",
  "object_locator": "@7",
  "snapid": "head",
  "registering": "head",
  "registered": "head"},
{ "linger_id": 4,
  "pg": "7.f8c99aee",
  "osd": 62,
  "object_id": "notify.3",
  "object_locator": "@7",
  "snapid": "head",
  "registering": "head",
  "registered": "head"},
{ "linger_id": 5,
  "pg": "7.a204812d",
  "osd": 129,
  "object_id": "notify.4",
  "object_locator": "@7",
  "snapid": "head",
  "registering": "head",
  "registered": "head"},
{ "linger_id": 6,
  "pg": "7.31099063",
  "osd": 28,
      "object_id": "notify.5",
  "object_locator": "@7",
  "snapid": "head",
  "registering": "head",
  "registered": "head"},
{ "linger_id": 7,
  "pg": "7.97c520d4",
  "osd": 135,
  "object_id": "notify.6",
  "object_locator": "@7",
  "snapid": "head",
  "registering": "head",
  "registered": "head"},
{ "linger_id": 8,
  "pg": "7.84ada7c9",
  "osd": 94,
  "object_id": "notify.7",
  "object_locator": "@7",
  "snapid": "head",
  "registering": "head",
  "registered": "head"}],
  "pool_ops": [],
  "pool_stat_ops": [],
  "statfs_ops": [],
  "command_ops": []}


On Wed, Mar 12, 2014 at 10:45 AM, Greg Poirier wrote:

> Rados GW and Ceph versions installed:
> Version: 0.67.7-1precise
>
> I create a user:
> radosgw-admin --name client.radosgw. user create --uid test
> --display-name "Test User"
>
> It outputs some JSON that looks convincing:
> { "user_id": "test",
>   "display_name": "test user",
>   "email": "",
>   "suspended": 0,
>   "max_buckets": 1000,
>   "auid": 0,
>   "subusers": [],
>   "keys": [
> { "user": "test",
>   "access_key": "",
>   "secret_key": ""},
> { "user": "test",
>   "access_key": "",
>   "secret_key": ""}],
>   "swift_keys": [],
>   "caps": [],
>   "op_mask": "read, write, delete",
>   "default_placement": "",
>   "placement_tags": []}
>
> There are two keys because I have tried this twice.
>
> I can see it in metadata list:
> radosgw-admin --name client.radosgw. metadata list user
> [
> "test",
> "us-east-2",
> "us-west-1"]
>
> I then try to get user info:
>
> radosgw-admin --name client.radosgw. user info test
> could not fetch user info: no user info saved
>
> I try to create a bucket with the user using Ruby's aws/s3 API:
>
> equire 'aws/

[ceph-users] "no user info saved" after user creation / can't create buckets

2014-03-12 Thread Greg Poirier
Rados GW and Ceph versions installed:
Version: 0.67.7-1precise

I create a user:
radosgw-admin --name client.radosgw. user create --uid test
--display-name "Test User"

It outputs some JSON that looks convincing:
{ "user_id": "test",
  "display_name": "test user",
  "email": "",
  "suspended": 0,
  "max_buckets": 1000,
  "auid": 0,
  "subusers": [],
  "keys": [
{ "user": "test",
  "access_key": "",
  "secret_key": ""},
{ "user": "test",
  "access_key": "",
  "secret_key": ""}],
  "swift_keys": [],
  "caps": [],
  "op_mask": "read, write, delete",
  "default_placement": "",
  "placement_tags": []}

There are two keys because I have tried this twice.

I can see it in metadata list:
radosgw-admin --name client.radosgw. metadata list user
[
"test",
"us-east-2",
"us-west-1"]

I then try to get user info:

radosgw-admin --name client.radosgw. user info test
could not fetch user info: no user info saved

I try to create a bucket with the user using Ruby's aws/s3 API:

equire 'aws/s3'

AWS::S3::Base.establish_connection!(
  access_key_id: '',
  secret_access_key: '',
  use_ssl: true,
  server: '',
  persistent: true
)

AWS::S3::Bucket.create('test')

file = 'sloth.txt'

AWS::S3::S3Object.store(file, open(file), 'test')

bucket = AWS::S3::Bucket.find('test')

puts bucket

bucket.each do |object|
  puts
"#{object.key}\t#{object.about['content-length']}\t#{object.about['last-modified']}"
end

And I get the following:
#
/Users/greg.poirier/.rvm/gems/ruby-1.9.3-p429/gems/aws-s3-0.6.3/lib/aws/s3/base.rb:235:in
`method_missing': undefined local variable or method `name' for
# (NameError)
 from
/Users/greg.poirier/.rvm/gems/ruby-1.9.3-p429/gems/aws-s3-0.6.3/lib/aws/s3/bucket.rb:313:in
`reload!'
from
/Users/greg.poirier/.rvm/gems/ruby-1.9.3-p429/gems/aws-s3-0.6.3/lib/aws/s3/bucket.rb:242:in
`objects'
 from
/Users/greg.poirier/.rvm/gems/ruby-1.9.3-p429/gems/aws-s3-0.6.3/lib/aws/s3/bucket.rb:253:in
`each'
from test.rb:21:in `'

The bucket fails to be created:

radosgw-admin --name client.radosgw. bucket list
[]

And also this:

radosgw-admin --name client.radosgw. metadata list bucket
[]2014-03-12 17:42:42.221112 7f426b779780 -1 failed to list objects
pool_iterate returned r=-2


So clearly there is something going on here. My questions:

Is this failure to create a bucket related to do the "no user info saved"
error?

What would cause the "no user info saved" error?

What may be causing the bucket to not be created?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD Snapshots

2014-03-03 Thread Greg Poirier
Interesting. I think this may not be a bad idea. Thanks for the info.

On Monday, March 3, 2014, Jean-Tiare LE BIGOT 
wrote:

> To get consistent RBD live snapshots, you may want to first freeze the
> guest filesystem (ext4, btrfs, xfs) with a tool like [fsfreeze]. It will
> basically flush the FS state to disk and blocking any future write access
> while maintaining Read accesses.
>
> [fsfreeze] http://manpages.courier-mta.org/htmlman8/fsfreeze.8.html
>
> On 02/28/2014 11:27 PM, Gregory Farnum wrote:
>
>> RBD itself will behave fine with whenever you take the snapshot. The
>> thing to worry about is that it's a snapshot at the block device
>> layer, not the filesystem layer, so if you don't quiesce IO and sync
>> to disk the filesystem might not be entirely happy with you for the
>> same reasons that it won't be happy if you pull the power plug on it.
>> -Greg
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>
>>
>> On Fri, Feb 28, 2014 at 2:12 PM, Greg Poirier 
>> wrote:
>>
>>> According to the documentation at
>>> https://ceph.com/docs/master/rbd/rbd-snapshot/ -- snapshots require
>>> that all
>>> I/O to a block device be stopped prior to making the snapshot. Is there
>>> any
>>> plan to allow for online snapshotting so that we could do incremental
>>> snapshots of running VMs on a regular basis.
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>  ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
> --
> Jean-Tiare, shared-hosting team
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RBD Snapshots

2014-02-28 Thread Greg Poirier
According to the documentation at
https://ceph.com/docs/master/rbd/rbd-snapshot/ -- snapshots require that
all I/O to a block device be stopped prior to making the snapshot. Is there
any plan to allow for online snapshotting so that we could do incremental
snapshots of running VMs on a regular basis.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph MON can no longer join quorum

2014-02-05 Thread Greg Poirier
Hi Karan,

I resolved it the same way you did. We had a network partition that caused
the MON to die, it appears.

I'm running 0.72.1

It would be nice if redeploying wasn't the solution, but if it's simply
cleaner to do so, then I will continue along that route.

I think what's more troubling is that when this occurred we lost all
connectivity to the Ceph cluster.


On Wed, Feb 5, 2014 at 1:11 AM, Karan Singh  wrote:

> Hi Greg
>
>
> I have seen this problem before in my cluster.
>
>
>
>- What ceph version you are running
>- Did you made any change recently in the cluster , that resulted in
>this problem
>
>
> You identified correct , the only problem is ceph-mon-2003  is listening
> to incorrect port , it should listen on port 6789 ( like the other two
> monitors ) . How i resolved is cleanly removing the infected monitor node
> and adding it back to cluster.
>
>
> Regards
>
> Karan
>
> --
> *From: *"Greg Poirier" 
> *To: *ceph-users@lists.ceph.com
> *Sent: *Tuesday, 4 February, 2014 10:50:21 PM
> *Subject: *[ceph-users] Ceph MON can no longer join quorum
>
>
> I have a MON that at some point lost connectivity to the rest of the
> cluster and now cannot rejoin.
>
> Each time I restart it, it looks like it's attempting to create a new MON
> and join the cluster, but the rest of the cluster rejects it, because the
> new one isn't in the monmap.
>
> I don't know why it suddenly decided it needed to be a new MON.
>
> I am not really sure where to start.
>
> root@ceph-mon-2003:/var/log/ceph# ceph -s
> cluster 4167d5f2-2b9e-4bde-a653-f24af68a45f8
>  health HEALTH_ERR 1 pgs inconsistent; 2 pgs peering; 126 pgs stale; 2
> pgs stuck inactive; 126 pgs stuck stale; 2 pgs stuck unclean; 10 requests
> are blocked > 32 sec; 1 scrub errors; 1 mons down, quorum 0,1
> ceph-mon-2001,ceph-mon-2002
>  monmap e2: 3 mons at {ceph-mon-2001=
> 10.30.66.13:6789/0,ceph-mon-2002=10.30.66.14:6789/0,ceph-mon-2003=10.30.66.15:6800/0},
> election epoch 12964, quorum 0,1 ceph-mon-2001,ceph-mon-2002
>
> Notice ceph-mon-2003:6800
>
> If I try to start ceph-mon-all, it will be listening on some other port...
>
> root@ceph-mon-2003:/var/log/ceph# start ceph-mon-all
> ceph-mon-all start/running
> root@ceph-mon-2003:/var/log/ceph# ps -ef | grep ceph
> root  6930 1 31 15:49 ?00:00:00 /usr/bin/ceph-mon
> --cluster=ceph -i ceph-mon-2003 -f
> root  6931 1  3 15:49 ?00:00:00 python
> /usr/sbin/ceph-create-keys --cluster=ceph -i ceph-mon-2003
>
> root@ceph-mon-2003:/var/log/ceph# ceph -s
> 2014-02-04 15:49:56.854866 7f9cf422d700  0 -- :/1007028 >>
> 10.30.66.15:6789/0 pipe(0x7f9cf0021370 sd=3 :0 s=1 pgs=0 cs=0 l=1
> c=0x7f9cf00215d0).fault
> cluster 4167d5f2-2b9e-4bde-a653-f24af68a45f8
>  health HEALTH_ERR 1 pgs inconsistent; 2 pgs peering; 126 pgs stale; 2
> pgs stuck inactive; 126 pgs stuck stale; 2 pgs stuck unclean; 10 requests
> are blocked > 32 sec; 1 scrub errors; 1 mons down, quorum 0,1
> ceph-mon-2001,ceph-mon-2002
>  monmap e2: 3 mons at {ceph-mon-2001=
> 10.30.66.13:6789/0,ceph-mon-2002=10.30.66.14:6789/0,ceph-mon-2003=10.30.66.15:6800/0},
> election epoch 12964, quorum 0,1 ceph-mon-2001,ceph-mon-2002
>
> Suggestions?
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph MON can no longer join quorum

2014-02-04 Thread Greg Poirier
I have a MON that at some point lost connectivity to the rest of the
cluster and now cannot rejoin.

Each time I restart it, it looks like it's attempting to create a new MON
and join the cluster, but the rest of the cluster rejects it, because the
new one isn't in the monmap.

I don't know why it suddenly decided it needed to be a new MON.

I am not really sure where to start.

root@ceph-mon-2003:/var/log/ceph# ceph -s
cluster 4167d5f2-2b9e-4bde-a653-f24af68a45f8
 health HEALTH_ERR 1 pgs inconsistent; 2 pgs peering; 126 pgs stale; 2
pgs stuck inactive; 126 pgs stuck stale; 2 pgs stuck unclean; 10 requests
are blocked > 32 sec; 1 scrub errors; 1 mons down, quorum 0,1
ceph-mon-2001,ceph-mon-2002
 monmap e2: 3 mons at {ceph-mon-2001=
10.30.66.13:6789/0,ceph-mon-2002=10.30.66.14:6789/0,ceph-mon-2003=10.30.66.15:6800/0},
election epoch 12964, quorum 0,1 ceph-mon-2001,ceph-mon-2002

Notice ceph-mon-2003:6800

If I try to start ceph-mon-all, it will be listening on some other port...

root@ceph-mon-2003:/var/log/ceph# start ceph-mon-all
ceph-mon-all start/running
root@ceph-mon-2003:/var/log/ceph# ps -ef | grep ceph
root  6930 1 31 15:49 ?00:00:00 /usr/bin/ceph-mon
--cluster=ceph -i ceph-mon-2003 -f
root  6931 1  3 15:49 ?00:00:00 python
/usr/sbin/ceph-create-keys --cluster=ceph -i ceph-mon-2003

root@ceph-mon-2003:/var/log/ceph# ceph -s
2014-02-04 15:49:56.854866 7f9cf422d700  0 -- :/1007028 >>
10.30.66.15:6789/0 pipe(0x7f9cf0021370 sd=3 :0 s=1 pgs=0 cs=0 l=1
c=0x7f9cf00215d0).fault
cluster 4167d5f2-2b9e-4bde-a653-f24af68a45f8
 health HEALTH_ERR 1 pgs inconsistent; 2 pgs peering; 126 pgs stale; 2
pgs stuck inactive; 126 pgs stuck stale; 2 pgs stuck unclean; 10 requests
are blocked > 32 sec; 1 scrub errors; 1 mons down, quorum 0,1
ceph-mon-2001,ceph-mon-2002
 monmap e2: 3 mons at {ceph-mon-2001=
10.30.66.13:6789/0,ceph-mon-2002=10.30.66.14:6789/0,ceph-mon-2003=10.30.66.15:6800/0},
election epoch 12964, quorum 0,1 ceph-mon-2001,ceph-mon-2002

Suggestions?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RadosGW S3 API - Bucket Versions

2014-01-24 Thread Greg Poirier
On Fri, Jan 24, 2014 at 4:28 PM, Yehuda Sadeh  wrote:

> For each object that rgw stores it keeps a version tag. However this
> version is not ascending, it's just used for identifying whether an
> object has changed. I'm not completely sure what is the problem that
> you're trying to solve though.
>

We have two datacenters. I want to have two regions that are split across
both datacenters. Let's say us-west and us-east are our regions, us-east-1
would live in one datacenter and be the primary zone for the us-east region
while us-east-2 would live in the other datacenter and be secondary zone.
We then do the opposite for us-west.

What I was envisioning, I think will not work. For example:

- write object A.0 to bucket X in us-west-1 (master)
- us-west-1 (master) goes down.
- write to us-west-2 (secondary) a _new_ version of of object A.1 to bucket
X
- us-west-1 comes back up
- read object A.1 from us-west-1

The idea being that if you are versioning objects, you are never updating
them, so it doesn't matter that the copy of the object that is now in
us-west-1 is read-only.

I'm not even sure if this is an accurate description of how replication
operates, but I thought I'd discussed a master-master scenario with someone
who said this _might_ be possible... assuming you had versioned objects.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RadosGW S3 API - Bucket Versions

2014-01-23 Thread Greg Poirier
Hello!

I have a great deal of interest in the ability to version objects in
buckets via the S3 API. Where is this on the roadmap for Ceph?

This is a pretty useful feature during failover scenarios between zones in
a region. For instance, take the example where you have a region with two
zones:

us-east-1
us-west-1

In the us region, east is master, west is secondary.

In the cast where us-east-1 fails, we want to failover to us-west-1. Let us
assume that we have versioned all of our objects in a bucket. This means
that every version of an object should map to a different object in Ceph.
So, in the case of a us-east-1 failure, clients could still write new
versions of an object to us-west-1 until us-east-1 becomes available again.

I am working under the assumption that after the restoration of service of
us-east-1, objects written to us-west-1 will replicate to us-east-1,
however. Is that the case?

I realize that this is just a convenience and that we could embed a version
number in our object names somehow, but that is somewhat less... clean?
Plus we cannot re-use code that was written against S3. :(

Thoughts, information?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 1MB/s throughput to 33-ssd test cluster

2013-12-09 Thread Greg Poirier
On Sun, Dec 8, 2013 at 8:33 PM, Mark Kirkwood  wrote:
>
> I'd suggest testing the components separately - try to rule out NIC (and
> switch) issues and SSD performance issues, then when you are sure the bits
> all go fast individually test how ceph performs again.
>
> What make and model of SSD? I'd check that the firmware is up to date
> (sometimes makes a huge difference). I'm also wondering if you might get
> better performance by having (say) 7 osds and using 4 of the SSD for
> journals for them.


Thanks, Mark.

In my haste, I left out part of a paragraph... probably really a whole
paragraph... that contains a pretty crucial detail.

I had previously run rados bench on this hardware with some success
(24-26MBps throughput w/ 4k blocks).

ceph osd bench looks great.

iperf on the network looks great.

After my last round of testing (with a few aborted rados bench tests), I
deleted the pool and recreated it (same name, crush ruleset, pg num, size,
etc). That is when I started to notice the degraded performance.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] 1MB/s throughput to 33-ssd test cluster

2013-12-08 Thread Greg Poirier
Hi.

So, I have a test cluster made up of ludicrously overpowered machines with
nothing but SSDs in them. Bonded 10Gbps NICs (802.3ad layer 2+3 xmit hash
policy, confirmed ~19.8 Gbps throughput with 32+ threads). I'm running
rados bench, and I am currently getting less than 1 MBps throughput:

sudo rados -N `hostname` bench 600 write -b 4096 -p volumes --no-cleanup -t
32 > bench_write_4096_volumes_1_32.out 2>&1'


Colocated journals on the same disk, so I'm not expecting optimum
throughput, but previous tests on spinning disks have shown reasonable
speeds (23MB/s, 4000-6000 iops) as opposed to the 150-450 iops I'm
currently getting.

ceph_deploy@ssd-1001:~$ sudo ceph -s
cluster 4167d5f2-2b9e-4bde-a653-f24af68a45f8
 health HEALTH_WARN clock skew detected on mon.ssd-1003
 monmap e1: 3 mons at {ssd-1001=
10.20.69.101:6789/0,ssd-1002=10.20.69.102:6789/0,ssd-1003=10.20.69.103:6789/0},
election epoch 20, quorum 0,1,2 ssd-1001,ssd-1002,ssd-1003
 osdmap e344: 33 osds: 33 up, 33 in
  pgmap v10600: 1650 pgs, 6 pools, 289 MB data, 74029 objects
466 GB used, 17621 GB / 18088 GB avail
1650 active+clean
  client io 1263 kB/s wr, 315 op/s

ceph_deploy@ssd-1001:~$ sudo ceph osd tree
# id weight type name up/down reweight
-1 30.03 root default
-2 10.01 host ssd-1001
0 0.91 osd.0 up 1
1 0.91 osd.1 up 1
2 0.91 osd.2 up 1
3 0.91 osd.3 up 1
4 0.91 osd.4 up 1
5 0.91 osd.5 up 1
6 0.91 osd.6 up 1
7 0.91 osd.7 up 1
8 0.91 osd.8 up 1
9 0.91 osd.9 up 1
10 0.91 osd.10 up 1
-3 10.01 host ssd-1002
11 0.91 osd.11 up 1
12 0.91 osd.12 up 1
13 0.91 osd.13 up 1
14 0.91 osd.14 up 1
15 0.91 osd.15 up 1
16 0.91 osd.16 up 1
17 0.91 osd.17 up 1
18 0.91 osd.18 up 1
19 0.91 osd.19 up 1
20 0.91 osd.20 up 1
21 0.91 osd.21 up 1
-4 10.01 host ssd-1003
22 0.91 osd.22 up 1
23 0.91 osd.23 up 1
24 0.91 osd.24 up 1
25 0.91 osd.25 up 1
26 0.91 osd.26 up 1
27 0.91 osd.27 up 1
28 0.91 osd.28 up 1
29 0.91 osd.29 up 1
30 0.91 osd.30 up 1
31 0.91 osd.31 up 1
32 0.91 osd.32 up 1

The clock skew error can safely be ignored. It's something like 2-3 ms
skew, I just haven't bothered configuring away the warning.

This is with a newly-created pool after deleting the last pool used for
testing.

Any suggestions on where to start debugging?

thanks.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] near full osd

2013-11-05 Thread Greg Chavez
Erik, it's utterly non-intuitive and I'd love another explanation than the
one I've provided.  Nevertheless, the OSDs on my slower PE2970 nodes fill
up much faster than those on HP585s or Dell R820s.  I've handled this by
dropping priorities and, in a couple of cases, outing or removing the OSD.

Kevin, generally speaking, the OSDs that fill up on me are the same ones.
 Once I lower the weights, they stay low or they fill back up again within
days or hours of re-raising the weight.  Please try to lift them up though,
maybe you'll have better luck than me.

--Greg


On Tue, Nov 5, 2013 at 11:30 AM, Kevin Weiler
wrote:

>   All of the disks in my cluster are identical and therefore all have the
> same weight (each drive is 2TB and the automatically generated weight is
> 1.82 for each one).
>
>  Would the procedure here be to reduce the weight, let it rebal, and then
> put the weight back to where it was?
>
>
>  --
>
> *Kevin Weiler*
>
> IT
>
>
>
> IMC Financial Markets | 233 S. Wacker Drive, Suite 4300 | Chicago, IL
> 60606 | http://imc-chicago.com/
>
> Phone: +1 312-204-7439 | Fax: +1 312-244-3301 | E-Mail: 
> *kevin.wei...@imc-chicago.com
> *
>
>   From: , Erik 
> Date: Tuesday, November 5, 2013 10:27 AM
> To: Greg Chavez , Kevin Weiler <
> kevin.wei...@imc-chicago.com>
> Cc: "ceph-users@lists.ceph.com" 
> Subject: RE: [ceph-users] near full osd
>
>   If there’s an underperforming disk, why on earth would *more* data be
> put on it?  You’d think it would be less….   I would think an
> *overperforming* disk should (desirably) cause that case,right?
>
>
>
> *From:* ceph-users-boun...@lists.ceph.com [
> mailto:ceph-users-boun...@lists.ceph.com]
> *On Behalf Of *Greg Chavez
> *Sent:* Tuesday, November 05, 2013 11:20 AM
> *To:* Kevin Weiler
> *Cc:* ceph-users@lists.ceph.com
> *Subject:* Re: [ceph-users] near full osd
>
>
>
> Kevin, in my experience that usually indicates a bad or underperforming
> disk, or a too-high priority.  Try running "ceph osd crush reweight
> osd.<##> 1.0.  If that doesn't do the trick, you may want to just out that
> guy.
>
>
>
> I don't think the crush algorithm guarantees balancing things out in the
> way you're expecting.
>
>
>
> --Greg
>
> On Tue, Nov 5, 2013 at 11:11 AM, Kevin Weiler <
> kevin.wei...@imc-chicago.com> wrote:
>
> Hi guys,
>
>
>
> I have an OSD in my cluster that is near full at 90%, but we're using a
> little less than half the available storage in the cluster. Shouldn't this
> be balanced out?
>
>
>
> --
>
> *Kevin Weiler*
>
> IT
>
>
>
> IMC Financial Markets | 233 S. Wacker Drive, Suite 4300 | Chicago, IL
> 60606 | http://imc-chicago.com/
>
> Phone: +1 312-204-7439 | Fax: +1 312-244-3301 | E-Mail: 
> *kevin.wei...@imc-chicago.com
> *
>
>
>  --
>
>
> The information in this e-mail is intended only for the person or entity
> to which it is addressed.
>
> It may contain confidential and /or privileged material. If someone other
> than the intended recipient should receive this e-mail, he / she shall not
> be entitled to read, disseminate, disclose or duplicate it.
>
> If you receive this e-mail unintentionally, please inform us immediately
> by "reply" and then delete it from your system. Although this information
> has been compiled with great care, neither IMC Financial Markets & Asset
> Management nor any of its related entities shall accept any responsibility
> for any errors, omissions or other inaccuracies in this information or for
> the consequences thereof, nor shall it be bound in any way by the contents
> of this e-mail or its attachments. In the event of incomplete or incorrect
> transmission, please return the e-mail to the sender and permanently delete
> this message and any attachments.
>
> Messages and attachments are scanned for all known viruses. Always scan
> attachments before opening them.
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
>
> The information in this e-mail is intended only for the person or entity
> to which it is addressed.
>
> It may contain confidential and /or privileged material. If someone other
> than the intended recipient should receive this e-mail, he / she shall not
> be entitled to read, disseminate, disclose or duplicate it.
>
> If you receive this e-mail unintentionally, please inform us immediately
> by "reply" and then delete it from your s

Re: [ceph-users] near full osd

2013-11-05 Thread Greg Chavez
Kevin, in my experience that usually indicates a bad or underperforming
disk, or a too-high priority.  Try running "ceph osd crush reweight
osd.<##> 1.0.  If that doesn't do the trick, you may want to just out that
guy.

I don't think the crush algorithm guarantees balancing things out in the
way you're expecting.


--Greg

On Tue, Nov 5, 2013 at 11:11 AM, Kevin Weiler
wrote:

>  Hi guys,
>
>  I have an OSD in my cluster that is near full at 90%, but we're using a
> little less than half the available storage in the cluster. Shouldn't this
> be balanced out?
>
>
>  --
>
> *Kevin Weiler*
>
> IT
>
>
>
> IMC Financial Markets | 233 S. Wacker Drive, Suite 4300 | Chicago, IL
> 60606 | http://imc-chicago.com/
>
> Phone: +1 312-204-7439 | Fax: +1 312-244-3301 | E-Mail: 
> *kevin.wei...@imc-chicago.com
> *
>
> --
>
> The information in this e-mail is intended only for the person or entity
> to which it is addressed.
>
> It may contain confidential and /or privileged material. If someone other
> than the intended recipient should receive this e-mail, he / she shall not
> be entitled to read, disseminate, disclose or duplicate it.
>
> If you receive this e-mail unintentionally, please inform us immediately
> by "reply" and then delete it from your system. Although this information
> has been compiled with great care, neither IMC Financial Markets & Asset
> Management nor any of its related entities shall accept any responsibility
> for any errors, omissions or other inaccuracies in this information or for
> the consequences thereof, nor shall it be bound in any way by the contents
> of this e-mail or its attachments. In the event of incomplete or incorrect
> transmission, please return the e-mail to the sender and permanently delete
> this message and any attachments.
>
> Messages and attachments are scanned for all known viruses. Always scan
> attachments before opening them.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG repair failing when object missing

2013-10-24 Thread Greg Farnum
I was also able to reproduce this, guys, but I believe it’s specific to the 
mode of testing rather than to anything being wrong with the OSD. In 
particular, after restarting the OSD whose file I removed and running repair, 
it did so successfully.
The OSD has an “fd cacher” which caches open file handles, and we believe this 
is what causes the observed behavior: if the removed object is among the most 
recent  objects touched, the FileStore (an OSD subsystem) has an open fd 
cached, so when manually deleting the file the FileStore now has a deleted file 
open. When the repair happens, it finds that open file descriptor and applies 
the repair to it — which of course doesn’t help put it back into place!
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

On October 24, 2013 at 2:52:54 AM, Matt Thompson (watering...@gmail.com) wrote:
>
>Hi Harry,
>
>I was able to replicate this.
>
>What does appear to work (for me) is to do an osd scrub followed by a pg
>repair. I've tried this 2x now and in each case the deleted file gets
>copied over to the OSD from where it was removed. However, I've tried a
>few pg scrub / pg repairs after manually deleting a file and have yet to
>see the file get copied back to the OSD on which it was deleted. Like you
>said, the pg repair sets the health of the PG back to active+clean, but
>then re-running the pg scrub detects the file as missing again and sets it
>back to active+clean+inconsistent.
>
>Regards,
>Matt
>
>
>On Wed, Oct 23, 2013 at 3:45 PM, Harry Harrington wrote:
>
>> Hi,
>>
>> I've been taking a look at the repair functionality in ceph. As I
>> understand it the osds should try to copy an object from another member of
>> the pg if it is missing. I have been attempting to test this by manually
>> removing a file from one of the osds however each time the repair
>> completes the the file has not been restored. If I run another scrub on the
>> pg it gets flagged as inconsistent. See below for the output from my
>> testing. I assume I'm missing something obvious, any insight into this
>> process would be greatly appreciated.
>>
>> Thanks,
>> Harry
>>
>> # ceph --version
>> ceph version 0.67.4 (ad85b8bfafea6232d64cb7ba76a8b6e8252fa0c7)
>> # ceph status
>> cluster a4e417fe-0386-46a5-4475-ca7e10294273
>> health HEALTH_OK
>> monmap e1: 1 mons at {ceph1=1.2.3.4:6789/0}, election epoch 2, quorum
>> 0 ceph1
>> osdmap e13: 3 osds: 3 up, 3 in
>> pgmap v232: 192 pgs: 192 active+clean; 44 bytes data, 15465 MB used,
>> 164 GB / 179 GB avail
>> mdsmap e1: 0/0/1 up
>>
>> file removed from osd.2
>>
>> # ceph pg scrub 0.b
>> instructing pg 0.b on osd.1 to scrub
>>
>> # ceph status
>> cluster a4e417fe-0386-46a5-4475-ca7e10294273
>> health HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
>> monmap e1: 1 mons at {ceph1=1.2.3.4:6789/0}, election epoch 2, quorum
>> 0 ceph1
>> osdmap e13: 3 osds: 3 up, 3 in
>> pgmap v233: 192 pgs: 191 active+clean, 1 active+clean+inconsistent; 44
>> bytes data, 15465 MB used, 164 GB / 179 GB avail
>> mdsmap e1: 0/0/1 up
>>
>> # ceph pg repair 0.b
>> instructing pg 0.b on osd.1 to repair
>>
>> # ceph status
>> cluster a4e417fe-0386-46a5-4475-ca7e10294273
>> health HEALTH_OK
>> monmap e1: 1 mons at {ceph1=1.2.3.4:6789/0}, election epoch 2, quorum
>> 0 ceph1
>> osdmap e13: 3 osds: 3 up, 3 in
>> pgmap v234: 192 pgs: 192 active+clean; 44 bytes data, 15465 MB used,
>> 164 GB / 179 GB avail
>> mdsmap e1: 0/0/1 up
>>
>> # ceph pg scrub 0.b
>> instructing pg 0.b on osd.1 to scrub
>>
>> # ceph status
>> cluster a4e417fe-0386-46a5-4475-ca7e10294273
>> health HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
>> monmap e1: 1 mons at {ceph1=1.2.3.4:6789/0}, election epoch 2, quorum
>> 0 ceph1
>> osdmap e13: 3 osds: 3 up, 3 in
>> pgmap v236: 192 pgs: 191 active+clean, 1 active+clean+inconsistent; 44
>> bytes data, 15465 MB used, 164 GB / 179 GB avail
>> mdsmap e1: 0/0/1 up
>>
>>
>>
>> The logs from osd.1:
>> 2013-10-23 14:12:31.188281 7f02a5161700 0 log [ERR] : 0.b osd.2 missing
>> 3a643fcb/testfile1/head//0
>> 2013-10-23 14:12:31.188312 7f02a5161700 0 log [ERR] : 0.b scrub 1
>> missing, 0 inconsistent objects
>> 2013-10-23 14:12:31.188319 7f02a5161700 0 log [ERR] : 0.b scrub 1 errors
>> 2013-10-23 14:13:03.197802 7f02a5161700 0 log [ERR] : 0.b osd.2 missing
>> 3a643fcb/testfile1/head//0
>> 2013-10-23 14:13:03.197837 7f02a5161700 0 log [ERR] : 0.b repair 1
>> missing, 0 incon

[ceph-users] Cluster stuck at 15% degraded

2013-09-19 Thread Greg Chavez
We have an 84-osd cluster with volumes and images pools for OpenStack.  I
was having trouble with full osds, so I increased the pg count from the 128
default to 2700.  This balanced out the osds but the cluster is stuck at
15% degraded

http://hastebin.com/wixarubebe.dos

That's the output of ceph health detail.  I've never seen a pg with the
state active+remapped+wait_backfill+backfill_toofull.  Clearly I should
have increased the pg count more gradually, but here I am. I'm frozen,
afraid to do anything.

Any suggestions? Thanks.

--Greg Chavez
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] librados pthread_create failure

2013-08-26 Thread Greg Poirier
Gregs are awesome, apparently. Thanks for the confirmation.

I know that threads are light-weight, it's just the first time I've ever
run into something that uses them... so liberally. ^_^


On Mon, Aug 26, 2013 at 10:07 AM, Gregory Farnum  wrote:

> On Mon, Aug 26, 2013 at 9:24 AM, Greg Poirier 
> wrote:
> > So, in doing some testing last week, I believe I managed to exhaust the
> > number of threads available to nova-compute last week. After some
> > investigation, I found the pthread_create failure and increased nproc for
> > our Nova user to, what I considered, a ridiculous 120,000 threads after
> > reading that librados will require a thread per osd, plus a few for
> > overhead, per VM on our compute nodes.
> >
> > This made me wonder: how many threads could Ceph possibly need on one of
> our
> > compute nodes.
> >
> > 32 cores * an overcommit ratio of 16, assuming each one is booted from a
> > Ceph volume, * 300 (approximate number of disks in our soon-to-go-live
> Ceph
> > cluster) = 153,600 threads.
> >
> > So this is where I started to put the truck in reverse. Am I right? What
> > about when we triple the size of our Ceph cluster? I could easily see a
> > future where we have easily 1,000 disks, if not many, many more in our
> > cluster. How do people scale this? Do you RAID to increase the density of
> > your Ceph cluster? I can only imagine that this will also drastically
> > increase the amount of resources required on my data nodes as well.
> >
> > So... suggestions? Reading?
>
> Your math looks right to me. So far though it hasn't caused anybody
> any trouble — Linux threads are much cheaper than people imagine when
> they're inactive. At some point we will certainly need to reduce the
> thread counts of our messenger (using epoll on a bunch of sockets
> instead of 2 threads -> 1 socket), but it hasn't happened yet.
> In terms of things you can do if this does become a problem, the most
> prominent is probably to (sigh) partition your cluster into pods on a
> per-rack basis or something. This is actually not as bad as it sounds
> since your network design probably would prefer not to send all writes
> through your core router, so if you create a pool for each rack and do
> something like this rack, next rack, next row for your replication you
> get better network traffic patterns.
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] librados pthread_create failure

2013-08-26 Thread Greg Poirier
So, in doing some testing last week, I believe I managed to exhaust the
number of threads available to nova-compute last week. After some
investigation, I found the pthread_create failure and increased nproc for
our Nova user to, what I considered, a ridiculous 120,000 threads after
reading that librados will require a thread per osd, plus a few for
overhead, per VM on our compute nodes.

This made me wonder: how many threads could Ceph possibly need on one of
our compute nodes.

32 cores * an overcommit ratio of 16, assuming each one is booted from a
Ceph volume, * 300 (approximate number of disks in our soon-to-go-live Ceph
cluster) = 153,600 threads.

So this is where I started to put the truck in reverse. Am I right? What
about when we triple the size of our Ceph cluster? I could easily see a
future where we have easily 1,000 disks, if not many, many more in our
cluster. How do people scale this? Do you RAID to increase the density of
your Ceph cluster? I can only imagine that this will also drastically
increase the amount of resources required on my data nodes as well.

So... suggestions? Reading?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexpectedly slow write performance (RBD cinder volumes)

2013-08-23 Thread Greg Poirier
On Fri, Aug 23, 2013 at 9:53 AM, Gregory Farnum  wrote:

>
> Okay. It's important to realize that because Ceph distributes data
> pseudorandomly, each OSD is going to end up with about the same amount
> of data going to it. If one of your drives is slower than the others,
> the fast ones can get backed up waiting on the slow one to acknowledge
> writes, so they end up impacting the cluster throughput a
> disproportionate amount. :(
>
> Anyway, I'm guessing you have 24 OSDs from your math earlier?
> 47MB/s * 24 / 2 = 564MB/s
> 41MB/s * 24 / 2 = 492MB/s
>

33 OSDs and 3 hosts in the cluster.


> So taking out or reducing the weight on the slow ones might improve
> things a little. But that's still quite a ways off from what you're
> seeing — there are a lot of things that could be impacting this but
> there's probably something fairly obvious with that much of a gap.
> What is the exact benchmark you're running? What do your nodes look like?
>

The write benchmark I am running is Fio with the following configuration:

  ioengine: "libaio"
  iodepth: 16
  runtime: 180
  numjobs: 16
  - name: "128k-500M-write"
description: "128K block 500M write"
bs: "128K"
size: "500M"
rw: "write"

Sorry for the weird yaml formatting but I'm copying it from the config file
of my automation stuff.

I run that on powers of 2 VMs up to 32. Each VM is qemu-kvm with a 50 GB
RBD-backed Cinder volume attached. They are 2 VCPU, 4 GB RAM VMs.

The host machines are Dell C6220s, 16-core, hyperthreaded VMs, 128 GB RAM,
with bonded 10 Gbps NICs (mode 4, 20 Gbps throughput -- tested and verified
that's working correctly). There are 2 host machines with 16 VMs each.

The Ceph cluster is made up of Dell C6220s, same NIC setup, 256 GB RAM,
same CPU, 12 disks each (one for os, 11 for OSDs).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexpectedly slow write performance (RBD cinder volumes)

2013-08-23 Thread Greg Poirier
Ah thanks, Brian. I will do that. I was going off the wiki instructions on
performing rados benchmarks. If I have the time later, I will change it
there.


On Fri, Aug 23, 2013 at 9:37 AM, Brian Andrus wrote:

> Hi Greg,
>
>
>> I haven't had any luck with the seq bench. It just errors every time.
>>
>
> Can you confirm you are using the --no-cleanup flag with rados write? This
> will ensure there is actually data to read for subsequent seq tests.
>
> ~Brian
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexpectedly slow write performance (RBD cinder volumes)

2013-08-22 Thread Greg Poirier
On Thu, Aug 22, 2013 at 2:34 PM, Gregory Farnum  wrote:

> You don't appear to have accounted for the 2x replication (where all
>  writes go to two OSDs) in these calculations. I assume your pools have
>

Ah. Right. So I should then be looking at:

# OSDs * Throughput per disk / 2 / repl factor ?

Which makes 300-400 MB/s aggregate throughput actually sort of reasonable.


> size 2 (or 3?) for these tests. 3 would explain the performance
> difference entirely; 2x replication leaves it still a bit low but
> takes the difference down to ~350/600 instead of ~350/1200. :)
>

Yeah. We're doing 2x repl now, and haven't yet made the decision if we're
going to move to 3x repl or not.


> You mentioned that your average osd bench throughput was ~50MB/s;
> what's the range?


41.9 - 54.7 MB/s

The actual average is 47.1 MB/s


> Have you run any rados bench tests?


Yessir.

rados bench write:

2013-08-23 00:18:51.933594min lat: 0.071682 max lat: 1.77006 avg lat:
0.196411
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
   900  14 73322 73308   325.764   316   0.13978  0.196411
 Total time run: 900.239317
Total writes made:  73322
Write size: 4194304
Bandwidth (MB/sec): 325.789

Stddev Bandwidth:   35.102
Max bandwidth (MB/sec): 440
Min bandwidth (MB/sec): 0
Average Latency:0.196436
Stddev Latency: 0.121463
Max latency:1.77006
Min latency:0.071682

I haven't had any luck with the seq bench. It just errors every time.



> What is your PG count across the cluster?
>

pgmap v18263: 1650 pgs: 1650 active+clean; 946 GB data, 1894 GB used,
28523 GB / 30417 GB avail; 498MB/s wr, 124op/s

Thanks again.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexpectedly slow write performance (RBD cinder volumes)

2013-08-22 Thread Greg Poirier
I should have also said that we experienced similar performance on
Cuttlefish. I have run identical benchmarks on both.


On Thu, Aug 22, 2013 at 2:23 PM, Oliver Daudey  wrote:

> Hey Greg,
>
> I encountered a similar problem and we're just in the process of
> tracking it down here on the list.  Try downgrading your OSD-binaries to
> 0.61.8 Cuttlefish and re-test.  If it's significantly faster on RBD,
> you're probably experiencing the same problem I have with Dumpling.
>
> PS: Only downgrade your OSDs.  Cuttlefish-monitors don't seem to want to
> start with a database that has been touched by a Dumpling-monitor and
> don't talk to them, either.
>
> PPS: I've also had OSDs no longer start with an assert while processing
> the journal during these upgrade/downgrade-tests, mostly when coming
> down from Dumpling to Cuttlefish.  If you encounter those, delete your
> journal and re-create with `ceph-osd -i  --mkjournal'.  Your
> data-store will be OK, as far as I can tell.
>
>
>Regards,
>
>  Oliver
>
> On do, 2013-08-22 at 10:55 -0700, Greg Poirier wrote:
> > I have been benchmarking our Ceph installation for the last week or
> > so, and I've come across an issue that I'm having some difficulty
> > with.
> >
> >
> > Ceph bench reports reasonable write throughput at the OSD level:
> >
> >
> > ceph tell osd.0 bench
> > { "bytes_written": 1073741824,
> >   "blocksize": 4194304,
> >   "bytes_per_sec": "47288267.00"}
> >
> >
> > Running this across all OSDs produces on average 50-55 MB/s, which is
> > fine with us. We were expecting around 100 MB/s / 2 (journal and OSD
> > on same disk, separate partitions).
> >
> >
> > What I wasn't expecting was the following:
> >
> >
> > I tested 1, 2, 4, 8, 16, 24, and 32 VMSs simultaneously writing
> > against 33 OSDs. Aggregate write throughput peaked under 400 MB/s:
> >
> >
> > 1  196.013671875
> > 2  285.8759765625
> > 4  351.9169921875
> > 8  386.455078125
> > 16 363.8583984375
> > 24 353.6298828125
> > 32 348.9697265625
> >
> >
> >
> > I was hoping to see something closer to # OSDs * Average value for
> > ceph bench (approximately 1.2 GB/s peak aggregate write throughput).
> >
> >
> > We're seeing excellent read, randread performance, but writes are a
> > bit of a bother.
> >
> >
> > Does anyone have any suggestions?
> >
> >
> > We have 20 Gb/s network
> > I used Fio w/ 16 thread concurrency
> > We're running Scientific Linux 6.4
> > 2.6.32 kernel
> > Ceph Dumpling 0.67.1-0.el6
> > OpenStack Grizzly
> > Libvirt 0.10.2
> > qemu-kvm 0.12.1.2-2.355.el6.2.cuttlefish
> >
> > (I'm using qemu-kvm from the ceph-extras repository, which doesn't
> > appear to have a -.dumpling version yet).
> >
> >
> > Thanks very much for any assistance.
> >
> >
> > Greg
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Unexpectedly slow write performance (RBD cinder volumes)

2013-08-22 Thread Greg Poirier
I have been benchmarking our Ceph installation for the last week or so, and
I've come across an issue that I'm having some difficulty with.

Ceph bench reports reasonable write throughput at the OSD level:

ceph tell osd.0 bench
{ "bytes_written": 1073741824,
  "blocksize": 4194304,
  "bytes_per_sec": "47288267.00"}

Running this across all OSDs produces on average 50-55 MB/s, which is fine
with us. We were expecting around 100 MB/s / 2 (journal and OSD on same
disk, separate partitions).

What I wasn't expecting was the following:

I tested 1, 2, 4, 8, 16, 24, and 32 VMSs simultaneously writing against 33
OSDs. Aggregate write throughput peaked under 400 MB/s:

1  196.013671875
2  285.8759765625
4  351.9169921875
8  386.455078125
16 363.8583984375
24 353.6298828125
32 348.9697265625

I was hoping to see something closer to # OSDs * Average value for ceph
bench (approximately 1.2 GB/s peak aggregate write throughput).

We're seeing excellent read, randread performance, but writes are a bit of
a bother.

Does anyone have any suggestions?

We have 20 Gb/s network
I used Fio w/ 16 thread concurrency
We're running Scientific Linux 6.4
2.6.32 kernel
Ceph Dumpling 0.67.1-0.el6
OpenStack Grizzly
Libvirt 0.10.2
qemu-kvm 0.12.1.2-2.355.el6.2.cuttlefish
(I'm using qemu-kvm from the ceph-extras repository, which doesn't appear
to have a -.dumpling version yet).

Thanks very much for any assistance.

Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Defective ceph startup script

2013-07-31 Thread Greg Chavez
I'm having trouble following this whole thread.  Two questions:

>
>  - is there an upstart or sysvinit file in your /var/lib/ceph/mon/* dirs?
>

upstart.  My first mistake was that I was trying to use both upstart and
sysv, depending on what directions I was following from the list, IRC, or
site documentation.  I should have only been using upstart, but that didn't
matter because the sockets were missing.


>  - are the daemons defined in [mon.xxx] sections in ceph.conf?
>
>
No they are not because I instantiated everything with ceph-deploy.


> That will control whether it is sysvinit or upstart that should be doing
> the restart.
>
> (And note that either way, upgrading the package doesn't restart the
> daemons for you.)


Which is probably good, especially if you are running osd and mon on the
same hosts.




> On Wed, 31 Jul 2013, Greg Chavez wrote:
> > I am running on Ubuntu 13.04.
> >
> > There is something amiss with /etc/init.d/ceph on all of my ceph nodes.
> > I was upgrading to 0.61.7 from what I *thought* was 0.61.5 today when I
> > realized that "service ceph-all restart" wasn't actually doing anything.
>  I
>
> I've only done '[stop|start|restart] ceph-all', or 'service ceph
> [stop|start|restart]'.  I'm not sure what the service command is supposed
> to do with upstart jobs.
>
> sage
>
>
>
> > saw nothing in /var/log/ceph.log - it just kept printing pg statuses -
> and
> > the PIDs of the osd and mon daemons did not change.  Stops failed as
> well.
> >
> > Then, when I tried to do individual osd restarts like this:
> >
> >   root@kvm-cs-sn-14i:/var/lib/ceph/osd# service ceph -v status
> >   osd.10
> > /etc/init.d/ceph: osd.10 not found (/etc/ceph/ceph.conf defines ,
> > /var/lib/ceph defines )
> >
> >
> > Despite the fact that I have this directory: /var/lib/ceph/osd/ceph-10/.
> >
> > I have the same issue with mon restarts:
> >
> >   root@kvm-cs-sn-14i:/var/lib/ceph/mon# ls
> > ceph-kvm-cs-sn-14i
> >
> > root@kvm-cs-sn-14i:/var/lib/ceph/mon# service ceph -v status
> > mon.kvm-cs-sn-14i
> > /etc/init.d/ceph: mon.kvm-cs-sn-14i not found (/etc/ceph/ceph.conf
> > defines , /var/lib/ceph defines )
> >
> >
> > I'm very worried that I have all my packages at  0.61.7 while my osd and
> mon
> > daemons could be running as old as  0.61.1!
> > Can anyone help me figure this out?  Thanks.
> >
> >
> > --
> > \*..+.-
> > --Greg Chavez
> > +//..;};
> >
> >
>



-- 
\*..+.-
--Greg Chavez
+//..;};
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Defective ceph startup script

2013-07-31 Thread Greg Chavez
After I did what Eric Eastman, suggested, my mon and osd sockets showed up
in /var/run/ceph:

root@kvm-cs-sn-10i:/etc/ceph# ls /var/run/ceph/
ceph-osd.0.asok  ceph-osd.1.asok  ceph-osd.2.asok  ceph-osd.3.asok
 ceph-osd.4.asok  ceph-osd.5.asok  ceph-osd.6.asok  ceph-osd.7.asok

However, while the osd daemons came back on line, the mon did not.  As it
happened, the cause for it is in another thread from today
(Subject: Problem with MON after reboot).  The solution is to upgrade and
restart the other mon nodes.  This worked.

Now the status/stop/start  commands work each and every time.  Somewhere
along the line this got goofed up and the osd and mon sockets either
weren't created or were deleted.  I started my cluster with a devel version
of cuttlefish, so who knows?

Craig, that's good advice re: starting the mon daemons first, but this is
no good if the sockets are missing from /var/run/ceph.  I'll keep on eye on
these directories moving forward to make sure they don't get lost again.

Thanks everyone for their help. Now I hope to engage in some drama free
upgrading on my osd-only nodes.  Ceph is great!


On Wed, Jul 31, 2013 at 4:31 PM, Craig Lewis wrote:

> You do need to use the stop script, not service stop. If you use service
> stop, Upstart will restart the service.  It's ok for start and restart,
> because that what you want anyway, but service stop is effectively a
> restart.
>
> I wouldn't recommend doing stop ceph-all and start ceph-all after an
> upgrade anyway, at least not with the latest 0.61 upgrades.  Due to the MON
> issues between 61.4, 61.5, and 61.6, it seemed safer to follow the major
> version upgrade procedure (http://ceph.com/docs/next/**
> install/upgrading-ceph/<http://ceph.com/docs/next/install/upgrading-ceph/>).
>  So I've been restarting MON on all nodes, then all OSDs on all nodes, then
> the remaining services.
>
> That said, it stop ceph-all should stop all the daemons.  I just wouldn't
> use this upgrade procedure.
>
>
>
>
>> On all my cluster nodes to upgrade from 0.61.5 to 0.61.7 and then noticed
>> that some of my systems did not restart all the daemons.  I tried:
>>
>> stop ceph-all
>> start ceph-all
>>
>
>
> __**_
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/**listinfo.cgi/ceph-users-ceph.**com<http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>



-- 
\*..+.-
--Greg Chavez
+//..;};
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Defective ceph startup script

2013-07-31 Thread Greg Chavez
Blast and gadzooks.  This is a bug then.

What's worse is that on three of my mon nodes have anything in
/var/run/ceph.  The directory is empty!  I can't believe I've basically
been running a busy ceph cluster for the last month.

I'll try what you suggested, thank you.


On Wed, Jul 31, 2013 at 3:48 PM, Eric Eastman  wrote:

> Hi Greg,
>
> I saw about the same thing on Ubuntu 13.04 as you did. I used
>
> apt-get -y update
> apt-get -y upgrade
>
> On all my cluster nodes to upgrade from 0.61.5 to 0.61.7 and then noticed
> that some of my systems did not restart all the daemons.  I tried:
>
> stop ceph-all
> start ceph-all
>
> On those nodes, but that did not kill all the old processes on
> the systems still running old daemons, so I ended up doing a:
>
> ps auxww | grep ceph
>
> On every node, and for any ceph process that was older then
> when I upgraded, I hand killed all the ceph processes on that
> node and then did a:
>
> start ceph-all
>
> Which seemed to fixed the issue.
>
> Eric
>
>
>
>> I am running on Ubuntu 13.04.
>>
>> There is something amiss with /etc/init.d/ceph on all of my ceph
>>
> nodes.
>
>>
>> I was upgrading to 0.61.7 from what I *thought* was 0.61.5 today when
>>
> I
>
>> realized that "service ceph-all restart" wasn't actually doing
>>
> anything.
>
>>   I saw nothing in /var/log/ceph.log - it just kept printing pg
>>
> statuses
>
>> - and the PIDs of the osd and mon daemons did not change.  Stops
>>
> failed
>
>> as well.
>>
>>
>
>


-- 
\*..+.-
--Greg Chavez
+//..;};
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Production/Non-production segmentation

2013-07-31 Thread Greg Poirier
On Wed, Jul 31, 2013 at 12:19 PM, Mike Dawson wrote:

> Due to the speed of releases in the Ceph project, I feel having separate
> physical hardware is the safer way to go, especially in light of your
> mention of an SLA for your production services.
>

Ah. I guess I should offer a little more background as to what I mean by
production vs. non-production: customer-facing, and not.

We're using Ceph primarily for volume storage with OpenStack at the moment
and operate two OS clusters: one for all of our customer-facing services
(which require a higher SLA) and one for all of our internal services. The
idea being that all of the customer-facing stuff is segmented physically
from anything our developers might be testing internally.

What I'm wondering:

Does anyone else here do this?
If so, do you run multiple Ceph clusters?
Do you let Ceph sort itself out?
Can this be done with a single physical cluster, but multiple logical
clusters?
Should it be?

I know that, mathematically speaking, the larger your Ceph cluster is, the
more evenly distributed the load (thanks to CRUSH). I'm wondering if, in
practice, RBD can still create hotspots (say from a runaway service with
multiple instances and volumes that is suddenly doing a ton of IO). This
would increase IO latency across the Ceph cluster, I'd assume, and could
impact the performance of customer-facing services.

So, to some degree, physical segmentation makes sense to me. But can we
simply reserve some OSDs per physical host for a "production" logical
cluster and then use the rest for the "development" logical cluster
(separate MON clusters for each, but all running on the same hardware). Or,
given a sufficiently large cluster, is this not even a concern?

I'm also interested in hearing about experience using CephFS, Swift, and
RBD all on a single cluster or if people have chosen to use multiple
clusters for these as well. For example, if you need faster volume storage
in RBD, so you go for more spindles and smaller disks vs. larger disks with
fewer spindles for object storage, which can have a higher allowance for
latency than volume storage.

A separate non-production cluster will allow you to test and validate new
> versions (including point releases within a stable series) before you
> attempt to upgrade your production cluster.
>

Oh yeah. I'm doing that for sure.

Thanks,

Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Defective ceph startup script

2013-07-31 Thread Greg Chavez
I am running on Ubuntu 13.04.

There is something amiss with /etc/init.d/ceph on all of my ceph nodes.

I was upgrading to 0.61.7 from what I *thought* was 0.61.5 today when I
realized that "service ceph-all restart" wasn't actually doing anything.  I
saw nothing in /var/log/ceph.log - it just kept printing pg statuses - and
the PIDs of the osd and mon daemons did not change.  Stops failed as well.

Then, when I tried to do individual osd restarts like this:

root@kvm-cs-sn-14i:/var/lib/ceph/osd# service ceph -v status osd.10
/etc/init.d/ceph: osd.10 not found (/etc/ceph/ceph.conf defines ,
/var/lib/ceph defines )


Despite the fact that I have this directory: /var/lib/ceph/osd/ceph-10/.

I have the same issue with mon restarts:

root@kvm-cs-sn-14i:/var/lib/ceph/mon# ls
ceph-kvm-cs-sn-14i

root@kvm-cs-sn-14i:/var/lib/ceph/mon# service ceph -v status
mon.kvm-cs-sn-14i
/etc/init.d/ceph: mon.kvm-cs-sn-14i not found (/etc/ceph/ceph.conf defines
, /var/lib/ceph defines )


I'm very worried that I have all my packages at  0.61.7 while my osd and
mon daemons could be running as old as  0.61.1!

Can anyone help me figure this out?  Thanks.


-- 
\*..+.-
--Greg Chavez
+//..;};
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Production/Non-production segmentation

2013-07-31 Thread Greg Poirier
Does anyone here have multiple clusters or segment their single cluster in
such a way as to try to maintain different SLAs for production vs
non-production services?

We have been toying with the idea of running separate clusters (on the same
hardware, but reserve a portion of the OSDs for the production cluster),
but I'd rather have a single cluster in order to more evenly distribute
load across all of the spindles.

Thoughts or observations from people with Ceph in production would be
greatly appreciated.

Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What is this HEALTH_WARN indicating?

2013-07-25 Thread Greg Chavez
Any idea how we tweak this?  If I want to keep my ceph node root
volume at 85% used, that's my business, man.

Thanks.

--Greg

On Mon, Jul 8, 2013 at 4:27 PM, Mike Bryant  wrote:
> Run "ceph health detail" and it should give you more information.
> (I'd guess an osd or mon has a full hard disk)
>
> Cheers
> Mike
>
> On 8 July 2013 21:16, Jordi Llonch  wrote:
>> Hello,
>>
>> I am testing ceph using ubuntu raring with ceph version 0.61.4
>> (1669132fcfc27d0c0b5e5bb93ade59d147e23404) on 3 virtualbox nodes.
>>
>> What is this HEALTH_WARN indicating?
>>
>> # ceph -s
>>health HEALTH_WARN
>>monmap e3: 3 mons at
>> {node1=192.168.56.191:6789/0,node2=192.168.56.192:6789/0,node3=192.168.56.193:6789/0},
>> election epoch 52, quorum 0,1,2 node1,node2,node3
>>osdmap e84: 3 osds: 3 up, 3 in
>> pgmap v3209: 192 pgs: 192 active+clean; 460 MB data, 1112 MB used, 135
>> GB / 136 GB avail
>>mdsmap e37: 1/1/1 up {0=node3=up:active}, 1 up:standby
>>
>>
>> Thanks,
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
>
> --
> Mike Bryant | Systems Administrator | Ocado Technology
> mike.bry...@ocado.com | 01707 382148 | www.ocadotechnology.com
>
> --
> Notice:  This email is confidential and may contain copyright material of
> Ocado Limited (the "Company"). Opinions and views expressed in this message
> may not necessarily reflect the opinions and views of the Company.
>
> If you are not the intended recipient, please notify us immediately and
> delete all copies of this message. Please note that it is your
> responsibility to scan this message for viruses.
>
> Company reg. no. 3875000.
>
> Ocado Limited
> Titan Court
> 3 Bishops Square
> Hatfield Business Park
> Hatfield
> Herts
> AL10 9NE
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
\*..+.-
--Greg Chavez
+//..;};
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Proplem about capacity when mount using CephFS?

2013-07-16 Thread Greg Chavez
Watching.  Thanks, Neil.

On Tue, Jul 16, 2013 at 12:43 PM, Neil Levine  wrote:
> This seems like a good feature to have. I've created
> http://tracker.ceph.com/issues/5642
>
> N
>
>
> On Tue, Jul 16, 2013 at 8:05 AM, Greg Chavez  wrote:
>>
>> This is interesting.  So there are no built-in ceph commands that can
>> calculate your usable space?  It just so happened that I was going to
>> try and figure that out today (new Openstack block cluster, 20TB total
>> capacity) by skimming through the documentation.  I figured that there
>> had to be a command that would do this.  Blast and gadzooks.
>>
>> On Tue, Jul 16, 2013 at 10:37 AM, Ta Ba Tuan  wrote:
>> >
>> > Thank Sage,
>> >
>> > tuantaba
>> >
>> >
>> > On 07/16/2013 09:24 PM, Sage Weil wrote:
>> >>
>> >> On Tue, 16 Jul 2013, Ta Ba Tuan wrote:
>> >>>
>> >>> Thanks  Sage,
>> >>> I wories about returned capacity when mounting CephFS.
>> >>> but when disk is full, capacity will showed 50% or 100% Used?
>> >>
>> >> 100%.
>> >>
>> >> sage
>> >>
>> >>>
>> >>> On 07/16/2013 11:01 AM, Sage Weil wrote:
>> >>>>
>> >>>> On Tue, 16 Jul 2013, Ta Ba Tuan wrote:
>> >>>>>
>> >>>>> Hi everyone.
>> >>>>>
>> >>>>> I have 83 osds, and every osds have same 2TB, (Capacity sumary is
>> >>>>> 166TB)
>> >>>>> I'm using replicate 3 for pools ('data','metadata').
>> >>>>>
>> >>>>> But when mounting Ceph filesystem from somewhere (using: mount -t
>> >>>>> ceph
>> >>>>> Monitor_IP:/ /ceph -o name=admin,secret=xx")
>> >>>>> then capacity sumary is showed "160TB"?, I used replicate 3 and I
>> >>>>> think
>> >>>>> that
>> >>>>> it must return 160TB/3=50TB?
>> >>>>>
>> >>>>> FilesystemSize  Used Avail Use% Mounted on
>> >>>>> 192.168.32.90:/160T  500G  156T   1%  /tmp/ceph_mount
>> >>>>>
>> >>>>> Please, explain this  help me?
>> >>>>
>> >>>> statfs/df show the raw capacity of the cluster, not the usable
>> >>>> capacity.
>> >>>> How much data you can store is a (potentially) complex function of
>> >>>> your
>> >>>> CRUSH rules and replication layout.  If you store 1TB, you'll notice
>> >>>> the
>> >>>> available space will go down by about 2TB (if you're using the
>> >>>> default
>> >>>> 2x).
>> >>>>
>> >>>> sage
>> >>>
>> >>>
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>> --
>> \*..+.-
>> --Greg Chavez
>> +//..;};
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>



-- 
\*..+.-
--Greg Chavez
+//..;};
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Proplem about capacity when mount using CephFS?

2013-07-16 Thread Greg Chavez
This is interesting.  So there are no built-in ceph commands that can
calculate your usable space?  It just so happened that I was going to
try and figure that out today (new Openstack block cluster, 20TB total
capacity) by skimming through the documentation.  I figured that there
had to be a command that would do this.  Blast and gadzooks.

On Tue, Jul 16, 2013 at 10:37 AM, Ta Ba Tuan  wrote:
>
> Thank Sage,
>
> tuantaba
>
>
> On 07/16/2013 09:24 PM, Sage Weil wrote:
>>
>> On Tue, 16 Jul 2013, Ta Ba Tuan wrote:
>>>
>>> Thanks  Sage,
>>> I wories about returned capacity when mounting CephFS.
>>> but when disk is full, capacity will showed 50% or 100% Used?
>>
>> 100%.
>>
>> sage
>>
>>>
>>> On 07/16/2013 11:01 AM, Sage Weil wrote:
>>>>
>>>> On Tue, 16 Jul 2013, Ta Ba Tuan wrote:
>>>>>
>>>>> Hi everyone.
>>>>>
>>>>> I have 83 osds, and every osds have same 2TB, (Capacity sumary is
>>>>> 166TB)
>>>>> I'm using replicate 3 for pools ('data','metadata').
>>>>>
>>>>> But when mounting Ceph filesystem from somewhere (using: mount -t ceph
>>>>> Monitor_IP:/ /ceph -o name=admin,secret=xx")
>>>>> then capacity sumary is showed "160TB"?, I used replicate 3 and I think
>>>>> that
>>>>> it must return 160TB/3=50TB?
>>>>>
>>>>> FilesystemSize  Used Avail Use% Mounted on
>>>>> 192.168.32.90:/160T  500G  156T   1%  /tmp/ceph_mount
>>>>>
>>>>> Please, explain this  help me?
>>>>
>>>> statfs/df show the raw capacity of the cluster, not the usable capacity.
>>>> How much data you can store is a (potentially) complex function of your
>>>> CRUSH rules and replication layout.  If you store 1TB, you'll notice the
>>>> available space will go down by about 2TB (if you're using the default
>>>> 2x).
>>>>
>>>> sage
>>>
>>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
\*..+.-
--Greg Chavez
+//..;};
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD recovery stuck

2013-06-27 Thread Greg Chavez
We set up a small ceph cluster of three nodes on top of an OpenStack
deployment of three nodes (that is, each compute node was also an
OSD/MON node).  Worked great until we started to expand the ceph
cluster once the OSDs started to fill up.  I added 4 OSDs two days ago
and the recovery went smoothly.  I added another four last night, but
the recovery is stuck:

root@kvm-sn-14i:~# ceph -s
   health HEALTH_WARN 22 pgs backfill_toofull; 19 pgs degraded; 1 pgs
recovering; 23 pgs stuck unclean; recovery 157614/1775814 degraded
(8.876%);  recovering 2 o/s, 8864KB/s; 1 near full osd(s)
   monmap e1: 3 mons at
{kvm-cs-sn-10i=192.168.241.110:6789/0,kvm-cs-sn-14i=192.168.241.114:6789/0,kvm-cs-sn-15i=192.168.241.115:6789/0},
election epoch 42, quorum 0,1,2
kvm-cs-sn-10i,kvm-cs-sn-14i,kvm-cs-sn-15i
   osdmap e512: 30 osds: 27 up, 27 in
pgmap v1474651: 448 pgs: 425 active+clean, 1
active+recovering+remapped, 3 active+remapped+backfill_toofull, 11
active+degraded+backfill_toofull, 8
active+degraded+remapped+backfill_toofull; 3414 GB data, 6640 GB used,
7007 GB / 13647 GB avail; 0B/s rd, 2363B/s wr, 0op/s; 157614/1775814
degraded (8.876%);  recovering 2 o/s, 8864KB/s
   mdsmap e1: 0/0/1 up

Even after restarting the OSDs, it hangs at 8.876%.  Consequently,
many of our virts have crashed.

I'm hoping someone on this list can provide some suggestions.
Otherwise, I may have to blow this up.  Thanks!

--
\*..+.-
--Greg Chavez
+//..;};
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph on mixed AMD/Intel architecture

2013-06-26 Thread Greg Chavez
I could have sworn that I read somewhere, very early on in my
investigation of Ceph, that you OSDs need to run on the same processor
architecture.  Only it suddenly occurred to me that for the last
month, I 've been running a small 3-node cluster with two Intel
systems and one AMD system.  I thought they were all AMD!

So... is this a problem?  It seems to be running well.

Thanks.

--
\*..+.-
--Greg Chavez
+//..;};
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD image copying

2013-05-14 Thread Greg

Le 14/05/2013 13:00, Wolfgang Hennerbichler a écrit :


On 05/14/2013 12:16 PM, Greg wrote:
...

And try to mount :

root@client2:~# mount /dev/rbd1 /mnt/
root@client2:~# umount /mnt/
root@client2:~# mount /dev/rbd2 /mnt/
mount: you must specify the filesystem type

What strikes me is the copy is superfast and I'm in a pool in format 1
which, as far as I understand, is not supposed to support copy-on-write.

I tried listing the pool (with rados tool) and it show the p2b16.rbd
file is there but no rb.0.X.Y.offset is present for p2b16 (name can be
found from p2b16.rbd), while there is for p1b16.

Did I not understand the copy mechanism ?

you sure did understand it the way it is supposed to be. something's
wrong here. what happens if you dd bs=1024 count=1 | hexdump your
devices, do you see differences there? is your cluster healthy?


Wolfgang,

after a copy, there is an "index" file (.rbd file) but no "data" file.

When I map the block device, I can read/write from/to it, when writing 
the "data" files are created and I can read them back.


Regards,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD image copying

2013-05-14 Thread Greg

Ok now the copy is done to the right pool but the data isn't there.
I mapped both the original and the copy to try and compare :

root@client2:~# rbd showmapped
id pool image snap device
1  sp   p1b16 -/dev/rbd1
2  sp   p2b16 -/dev/rbd2

And try to mount :

root@client2:~# mount /dev/rbd1 /mnt/
root@client2:~# umount /mnt/
root@client2:~# mount /dev/rbd2 /mnt/
mount: you must specify the filesystem type


What strikes me is the copy is superfast and I'm in a pool in format 1 
which, as far as I understand, is not supposed to support copy-on-write.


I tried listing the pool (with rados tool) and it show the p2b16.rbd 
file is there but no rb.0.X.Y.offset is present for p2b16 (name can be 
found from p2b16.rbd), while there is for p1b16.


Did I not understand the copy mechanism ?

Thanks!

Le 14/05/2013 11:52, Wolfgang Hennerbichler a écrit :

Hi,

I believe this went into the pool named 'rbd'.

if you rbd copy it's maybe easier to do it with explicit destination
pool name:

rbd cp sp/p1b16 sp/p2b16

hth
wolfgang

On 05/14/2013 11:47 AM, Greg wrote:

Hello,

I found some oddity when attempting to copy an rbd image in my pool
(using bobtail 0.56.4), please see this :

I have a built working RBD image name p1b16 :

root@nas16:~# rbd -p sp ls
p1b16

Copying image :

root@nas16:~# rbd -p sp cp p1b16 p2b16
Image copy: 100% complete...done.

Great, seems to go fine, it went superfast (a few seconds), let's check :

root@nas16:~# rbd -p sp ls
p1b16

Uh ? let's try again :

root@nas16:~# rbd -p sp cp p1b16 p2b16
2013-05-14 09:30:42.369917 400b8000 -1 Image copy: 0%
complete...failed.librbd: rbd image p2b16 already exists
rbd: copy failed:
(17) File exists
2013-05-14 09:30:42.369969 400b8000 -1 librbd: header creation failed

Doh! Really ?

root@nas16:~# rbd -p sp ls
p1b16

Hmmm, something hidden? let's try to restart :

root@nas16:~# rbd -p sp rm p2b16
2013-05-14 09:30:19.445336 400c7000 -1 librbd::ImageCtx: error finding
header: (2) No such file or directory
2013-05-14 09:30:19.644381 400c7000 -1 Removing image: librbd: error
removing img from new-style directory: (2) No such file or directory0%
complete...failed.

rbd: delete error: (2) No such file or directory

Damned, let's see at rados level :

root@nas16:~# rados -p sp ls | grep -v "rb\\."
p1b16.rbd
rbd_directory

I downloaded rbd_directory file and took a look inside, I see p1b16
(along with binary data) but no trace of p2b16

I must have missed something somewhere...

Cheers,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD image copying

2013-05-14 Thread Greg

Wolfgang,

you are perfectly right, the '-p' switch only applies to the source 
image, this is subtle !


Thanks a lot.

Le 14/05/2013 11:52, Wolfgang Hennerbichler a écrit :

Hi,

I believe this went into the pool named 'rbd'.

if you rbd copy it's maybe easier to do it with explicit destination
pool name:

rbd cp sp/p1b16 sp/p2b16

hth
wolfgang

On 05/14/2013 11:47 AM, Greg wrote:

Hello,

I found some oddity when attempting to copy an rbd image in my pool
(using bobtail 0.56.4), please see this :

I have a built working RBD image name p1b16 :

root@nas16:~# rbd -p sp ls
p1b16

Copying image :

root@nas16:~# rbd -p sp cp p1b16 p2b16
Image copy: 100% complete...done.

Great, seems to go fine, it went superfast (a few seconds), let's check :

root@nas16:~# rbd -p sp ls
p1b16

Uh ? let's try again :

root@nas16:~# rbd -p sp cp p1b16 p2b16
2013-05-14 09:30:42.369917 400b8000 -1 Image copy: 0%
complete...failed.librbd: rbd image p2b16 already exists
rbd: copy failed:
(17) File exists
2013-05-14 09:30:42.369969 400b8000 -1 librbd: header creation failed

Doh! Really ?

root@nas16:~# rbd -p sp ls
p1b16

Hmmm, something hidden? let's try to restart :

root@nas16:~# rbd -p sp rm p2b16
2013-05-14 09:30:19.445336 400c7000 -1 librbd::ImageCtx: error finding
header: (2) No such file or directory
2013-05-14 09:30:19.644381 400c7000 -1 Removing image: librbd: error
removing img from new-style directory: (2) No such file or directory0%
complete...failed.

rbd: delete error: (2) No such file or directory

Damned, let's see at rados level :

root@nas16:~# rados -p sp ls | grep -v "rb\\."
p1b16.rbd
rbd_directory

I downloaded rbd_directory file and took a look inside, I see p1b16
(along with binary data) but no trace of p2b16

I must have missed something somewhere...

Cheers,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RBD image copying

2013-05-14 Thread Greg

Hello,

I found some oddity when attempting to copy an rbd image in my pool 
(using bobtail 0.56.4), please see this :


I have a built working RBD image name p1b16 :

root@nas16:~# rbd -p sp ls
p1b16

Copying image :

root@nas16:~# rbd -p sp cp p1b16 p2b16
Image copy: 100% complete...done.

Great, seems to go fine, it went superfast (a few seconds), let's check :

root@nas16:~# rbd -p sp ls
p1b16

Uh ? let's try again :

root@nas16:~# rbd -p sp cp p1b16 p2b16
2013-05-14 09:30:42.369917 400b8000 -1 Image copy: 0% 
complete...failed.librbd: rbd image p2b16 already exists

rbd: copy failed:
(17) File exists
2013-05-14 09:30:42.369969 400b8000 -1 librbd: header creation failed

Doh! Really ?

root@nas16:~# rbd -p sp ls
p1b16

Hmmm, something hidden? let's try to restart :

root@nas16:~# rbd -p sp rm p2b16
2013-05-14 09:30:19.445336 400c7000 -1 librbd::ImageCtx: error finding 
header: (2) No such file or directory
2013-05-14 09:30:19.644381 400c7000 -1 Removing image: librbd: error 
removing img from new-style directory: (2) No such file or directory0% 
complete...failed.


rbd: delete error: (2) No such file or directory

Damned, let's see at rados level :

root@nas16:~# rados -p sp ls | grep -v "rb\\."
p1b16.rbd
rbd_directory
I downloaded rbd_directory file and took a look inside, I see p1b16 
(along with binary data) but no trace of p2b16


I must have missed something somewhere...

Cheers,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD vs RADOS benchmark performance

2013-05-13 Thread Greg

Le 13/05/2013 17:01, Gandalf Corvotempesta a écrit :

2013/5/13 Greg :

thanks a lot for pointing this out, it indeed makes a *huge* difference !

# dd if=/mnt/t/1 of=/dev/zero bs=4M count=100

100+0 records in
100+0 records out
419430400 bytes (419 MB) copied, 5.12768 s, 81.8 MB/s

(caches dropped before each test of course)

What if you set 1024 or greater value ?
Is bandwidth relative to the read ahead size?
Setting the value too high degrades performance, especially random IO 
performance.

You have to determine  the right choice for your usage.

Cheers,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD vs RADOS benchmark performance

2013-05-13 Thread Greg

Le 13/05/2013 15:55, Mark Nelson a écrit :

On 05/13/2013 07:26 AM, Greg wrote:

Le 13/05/2013 07:38, Olivier Bonvalet a écrit :

Le vendredi 10 mai 2013 à 19:16 +0200, Greg a écrit :

Hello folks,

I'm in the process of testing CEPH and RBD, I have set up a small
cluster of  hosts running each a MON and an OSD with both journal and
data on the same SSD (ok this is stupid but this is simple to 
verify the
disks are not the bottleneck for 1 client). All nodes are connected 
on a

1Gb network (no dedicated network for OSDs, shame on me :).

Summary : the RBD performance is poor compared to benchmark

A 5 seconds seq read benchmark shows something like this :

sec Cur ops   started  finished avg MB/s  cur MB/s  last lat
avg lat
  0   0 0 0 0 0 - 0
  1  163923   91.958692 0.966117
0.431249
  2  166448   95.9602   100 0.513435
0.53849
  3  169074   98.6317   104 0.25631
0.55494
  4  119584   83.973540 1.80038
0.58712
  Total time run:4.165747
Total reads made: 95
Read size:4194304
Bandwidth (MB/sec):91.220

Average Latency:   0.678901
Max latency:   1.80038
Min latency:   0.104719

91MB read performance, quite good !

Now the RBD performance :

root@client:~# dd if=/dev/rbd1 of=/dev/null bs=4M count=100
100+0 records in
100+0 records out
419430400 bytes (419 MB) copied, 13.0568 s, 32.1 MB/s

There is a 3x performance factor (same for write: ~60M benchmark, ~20M
dd on block device)

The network is ok, the CPU is also ok on all OSDs.
CEPH is Bobtail 0.56.4, linux is 3.8.1 arm (vanilla release + some
patches for the SoC being used)

Can you show me the starting point for digging into this ?

You should try to increase read_ahead to 512K instead of the defaults
128K (/sys/block/*/queue/read_ahead_kb). I have seen a huge difference
on reads with that.


Olivier,

thanks a lot for pointing this out, it indeed makes a *huge* 
difference !

# dd if=/mnt/t/1 of=/dev/zero bs=4M count=100
100+0 records in
100+0 records out
419430400 bytes (419 MB) copied, 5.12768 s, 81.8 MB/s

(caches dropped before each test of course)

Mark, this is probably something you will want to investigate and
explain in a "tweaking" topic of the documentation.

Regards,


Out of curiosity, has your rados bench performance improved as well? 
We've also seen improvements for sequential read throughput when 
increasing read_ahead_kb. (it may decrease random iops in some cases 
though!)  The reason I didn't think to mention it here though is 
because I was just focused on the difference between rados bench and 
rbd.  It would be interesting to know if rbd has improved more 
dramatically than rados bench.
Mark, the read ahead is set on the RBD block device (on the client), so 
it doesn't improve benchmark results as the benchmark doesn't use the 
block layer.


1 question remains : why did I have poor performance with 1 single 
writing thread ?


Regards,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD vs RADOS benchmark performance

2013-05-13 Thread Greg

Le 13/05/2013 07:38, Olivier Bonvalet a écrit :

Le vendredi 10 mai 2013 à 19:16 +0200, Greg a écrit :

Hello folks,

I'm in the process of testing CEPH and RBD, I have set up a small
cluster of  hosts running each a MON and an OSD with both journal and
data on the same SSD (ok this is stupid but this is simple to verify the
disks are not the bottleneck for 1 client). All nodes are connected on a
1Gb network (no dedicated network for OSDs, shame on me :).

Summary : the RBD performance is poor compared to benchmark

A 5 seconds seq read benchmark shows something like this :

sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
  0   0 0 0 0 0 - 0
  1  163923   91.958692 0.966117  0.431249
  2  166448   95.9602   100 0.513435   0.53849
  3  169074   98.6317   104 0.25631   0.55494
  4  119584   83.973540 1.80038   0.58712
  Total time run:4.165747
Total reads made: 95
Read size:4194304
Bandwidth (MB/sec):91.220

Average Latency:   0.678901
Max latency:   1.80038
Min latency:   0.104719

91MB read performance, quite good !

Now the RBD performance :

root@client:~# dd if=/dev/rbd1 of=/dev/null bs=4M count=100
100+0 records in
100+0 records out
419430400 bytes (419 MB) copied, 13.0568 s, 32.1 MB/s

There is a 3x performance factor (same for write: ~60M benchmark, ~20M
dd on block device)

The network is ok, the CPU is also ok on all OSDs.
CEPH is Bobtail 0.56.4, linux is 3.8.1 arm (vanilla release + some
patches for the SoC being used)

Can you show me the starting point for digging into this ?

You should try to increase read_ahead to 512K instead of the defaults
128K (/sys/block/*/queue/read_ahead_kb). I have seen a huge difference
on reads with that.


Olivier,

thanks a lot for pointing this out, it indeed makes a *huge* difference !

# dd if=/mnt/t/1 of=/dev/zero bs=4M count=100
100+0 records in
100+0 records out
419430400 bytes (419 MB) copied, 5.12768 s, 81.8 MB/s

(caches dropped before each test of course)

Mark, this is probably something you will want to investigate and 
explain in a "tweaking" topic of the documentation.


Regards,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD vs RADOS benchmark performance

2013-05-11 Thread Greg

Le 11/05/2013 13:24, Greg a écrit :

Le 11/05/2013 02:52, Mark Nelson a écrit :

On 05/10/2013 07:20 PM, Greg wrote:

Le 11/05/2013 00:56, Mark Nelson a écrit :

On 05/10/2013 12:16 PM, Greg wrote:

Hello folks,

I'm in the process of testing CEPH and RBD, I have set up a small
cluster of  hosts running each a MON and an OSD with both journal and
data on the same SSD (ok this is stupid but this is simple to 
verify the
disks are not the bottleneck for 1 client). All nodes are 
connected on a

1Gb network (no dedicated network for OSDs, shame on me :).

Summary : the RBD performance is poor compared to benchmark

A 5 seconds seq read benchmark shows something like this :
   sec Cur ops   started  finished avg MB/s  cur MB/s  last lat   
avg

lat
 0   0 0 0 0 0 - 0
 1  163923   91.958692 0.966117
0.431249
 2  166448   95.9602   100 0.513435 
0.53849
 3  169074   98.6317   104 0.25631 
0.55494
 4  119584   83.973540 1.80038 
0.58712

 Total time run:4.165747
Total reads made: 95
Read size:4194304
Bandwidth (MB/sec):91.220

Average Latency:   0.678901
Max latency:   1.80038
Min latency:   0.104719


91MB read performance, quite good !

Now the RBD performance :

root@client:~# dd if=/dev/rbd1 of=/dev/null bs=4M count=100
100+0 records in
100+0 records out
419430400 bytes (419 MB) copied, 13.0568 s, 32.1 MB/s


There is a 3x performance factor (same for write: ~60M benchmark, 
~20M

dd on block device)

The network is ok, the CPU is also ok on all OSDs.
CEPH is Bobtail 0.56.4, linux is 3.8.1 arm (vanilla release + some
patches for the SoC being used)

Can you show me the starting point for digging into this ?


Hi Greg, First things first, are you doing kernel rbd or qemu/kvm?  If
you are doing qemu/kvm, make sure you are using virtio disks. This
can have a pretty big performance impact. Next, are you using RBD
cache? With 0.56.4 there are some performance issues with large
sequential writes if cache is on, but it does provide benefit for
small sequential writes.  In general RBD cache behaviour has improved
with Cuttlefish.

Beyond that, are the pools being targeted by RBD and rados bench setup
the same way?  Same number of Pgs?  Same replication?

Mark, thanks for your prompt reply.

I'm doing kernel RBD and so, I have not enabled the cache (default
setting?)
Sorry, I forgot to mention the pool used for bench and RBD is the same.


Interesting.  Does your rados bench performance change if you run a 
longer test?  So far I've been seeing about a 20-30% performance 
overhead for kernel RBD, but 3x is excessive!  It might be worth 
watching the underlying IO sizes to the OSDs in each case with 
something like "collectl -sD -oT" to see if there's any significant 
differences.

Mark,

I'll gather you some more data with collectl, meanwhile I realized a 
difference : the benchmark performs 16 concurrent reads while RBD only 
does 1. Shouldn't be a problem but still these are 2 different usage 
patterns.


Ok, I run the benchmark with only 1 concurrent thread and here is the 
result :

Total time run:5.118677
Total reads made: 56
Read size:4194304
Bandwidth (MB/sec):43.761

Average Latency:   0.09
Max latency:   0.096591
Min latency:   0.076976


So we have 36% more performance in the benchmark which correlates with 
your numbers.
Now the question is : why so much difference between 1 and 16 concurrent 
workloads ? I guess I know the answer : because of latency.

So the next question is : how can I optimize latency ? :)

Cheers,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD vs RADOS benchmark performance

2013-05-11 Thread Greg

Le 11/05/2013 02:52, Mark Nelson a écrit :

On 05/10/2013 07:20 PM, Greg wrote:

Le 11/05/2013 00:56, Mark Nelson a écrit :

On 05/10/2013 12:16 PM, Greg wrote:

Hello folks,

I'm in the process of testing CEPH and RBD, I have set up a small
cluster of  hosts running each a MON and an OSD with both journal and
data on the same SSD (ok this is stupid but this is simple to 
verify the
disks are not the bottleneck for 1 client). All nodes are connected 
on a

1Gb network (no dedicated network for OSDs, shame on me :).

Summary : the RBD performance is poor compared to benchmark

A 5 seconds seq read benchmark shows something like this :

   sec Cur ops   started  finished avg MB/s  cur MB/s  last lat   avg
lat
 0   0 0 0 0 0 - 0
 1  163923   91.958692 0.966117
0.431249
 2  166448   95.9602   100 0.513435 
0.53849
 3  169074   98.6317   104 0.25631 
0.55494
 4  119584   83.973540 1.80038 
0.58712

 Total time run:4.165747
Total reads made: 95
Read size:4194304
Bandwidth (MB/sec):91.220

Average Latency:   0.678901
Max latency:   1.80038
Min latency:   0.104719


91MB read performance, quite good !

Now the RBD performance :

root@client:~# dd if=/dev/rbd1 of=/dev/null bs=4M count=100
100+0 records in
100+0 records out
419430400 bytes (419 MB) copied, 13.0568 s, 32.1 MB/s


There is a 3x performance factor (same for write: ~60M benchmark, ~20M
dd on block device)

The network is ok, the CPU is also ok on all OSDs.
CEPH is Bobtail 0.56.4, linux is 3.8.1 arm (vanilla release + some
patches for the SoC being used)

Can you show me the starting point for digging into this ?


Hi Greg, First things first, are you doing kernel rbd or qemu/kvm?  If
you are doing qemu/kvm, make sure you are using virtio disks. This
can have a pretty big performance impact. Next, are you using RBD
cache? With 0.56.4 there are some performance issues with large
sequential writes if cache is on, but it does provide benefit for
small sequential writes.  In general RBD cache behaviour has improved
with Cuttlefish.

Beyond that, are the pools being targeted by RBD and rados bench setup
the same way?  Same number of Pgs?  Same replication?

Mark, thanks for your prompt reply.

I'm doing kernel RBD and so, I have not enabled the cache (default
setting?)
Sorry, I forgot to mention the pool used for bench and RBD is the same.


Interesting.  Does your rados bench performance change if you run a 
longer test?  So far I've been seeing about a 20-30% performance 
overhead for kernel RBD, but 3x is excessive!  It might be worth 
watching the underlying IO sizes to the OSDs in each case with 
something like "collectl -sD -oT" to see if there's any significant 
differences.

Mark,

I'll gather you some more data with collectl, meanwhile I realized a 
difference : the benchmark performs 16 concurrent reads while RBD only 
does 1. Shouldn't be a problem but still these are 2 different usage 
patterns.


Cheers,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD vs RADOS benchmark performance

2013-05-10 Thread Greg

Le 11/05/2013 00:56, Mark Nelson a écrit :

On 05/10/2013 12:16 PM, Greg wrote:

Hello folks,

I'm in the process of testing CEPH and RBD, I have set up a small
cluster of  hosts running each a MON and an OSD with both journal and
data on the same SSD (ok this is stupid but this is simple to verify the
disks are not the bottleneck for 1 client). All nodes are connected on a
1Gb network (no dedicated network for OSDs, shame on me :).

Summary : the RBD performance is poor compared to benchmark

A 5 seconds seq read benchmark shows something like this :

   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg
lat
 0   0 0 0 0 0 - 0
 1  163923   91.958692 0.966117  
0.431249
 2  166448   95.9602   100 0.513435   
0.53849
 3  169074   98.6317   104 0.25631   
0.55494
 4  119584   83.973540 1.80038   
0.58712

 Total time run:4.165747
Total reads made: 95
Read size:4194304
Bandwidth (MB/sec):91.220

Average Latency:   0.678901
Max latency:   1.80038
Min latency:   0.104719


91MB read performance, quite good !

Now the RBD performance :

root@client:~# dd if=/dev/rbd1 of=/dev/null bs=4M count=100
100+0 records in
100+0 records out
419430400 bytes (419 MB) copied, 13.0568 s, 32.1 MB/s


There is a 3x performance factor (same for write: ~60M benchmark, ~20M
dd on block device)

The network is ok, the CPU is also ok on all OSDs.
CEPH is Bobtail 0.56.4, linux is 3.8.1 arm (vanilla release + some
patches for the SoC being used)

Can you show me the starting point for digging into this ?


Hi Greg, First things first, are you doing kernel rbd or qemu/kvm?  If 
you are doing qemu/kvm, make sure you are using virtio disks.  This 
can have a pretty big performance impact. Next, are you using RBD 
cache? With 0.56.4 there are some performance issues with large 
sequential writes if cache is on, but it does provide benefit for 
small sequential writes.  In general RBD cache behaviour has improved 
with Cuttlefish.


Beyond that, are the pools being targeted by RBD and rados bench setup 
the same way?  Same number of Pgs?  Same replication?

Mark, thanks for your prompt reply.

I'm doing kernel RBD and so, I have not enabled the cache (default setting?)
Sorry, I forgot to mention the pool used for bench and RBD is the same.

Regards,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RBD vs RADOS benchmark performance

2013-05-10 Thread Greg

Hello folks,

I'm in the process of testing CEPH and RBD, I have set up a small 
cluster of  hosts running each a MON and an OSD with both journal and 
data on the same SSD (ok this is stupid but this is simple to verify the 
disks are not the bottleneck for 1 client). All nodes are connected on a 
1Gb network (no dedicated network for OSDs, shame on me :).


Summary : the RBD performance is poor compared to benchmark

A 5 seconds seq read benchmark shows something like this :

   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
 0   0 0 0 0 0 - 0
 1  163923   91.958692 0.966117  0.431249
 2  166448   95.9602   100 0.513435   0.53849
 3  169074   98.6317   104 0.25631   0.55494
 4  119584   83.973540 1.80038   0.58712
 Total time run:4.165747
Total reads made: 95
Read size:4194304
Bandwidth (MB/sec):91.220

Average Latency:   0.678901
Max latency:   1.80038
Min latency:   0.104719


91MB read performance, quite good !

Now the RBD performance :

root@client:~# dd if=/dev/rbd1 of=/dev/null bs=4M count=100
100+0 records in
100+0 records out
419430400 bytes (419 MB) copied, 13.0568 s, 32.1 MB/s


There is a 3x performance factor (same for write: ~60M benchmark, ~20M 
dd on block device)


The network is ok, the CPU is also ok on all OSDs.
CEPH is Bobtail 0.56.4, linux is 3.8.1 arm (vanilla release + some 
patches for the SoC being used)


Can you show me the starting point for digging into this ?

Thanks!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-deploy doesn't update ceph.conf

2013-05-09 Thread Greg Chavez
So I feel like I'm missing something.  I just deployed 3 storage nodes
with ceph-deploy, each with a monitor daemon and 6-8 osd's. All of
them seem to be active with health OK.  However, it doesn't seem that
I ended up with a useful ceph.conf.

( running 0.61-113-g61354b2-1raring )

This is all I get:

[global]
fsid = af1581c1-8c45-4e24-b4f1-9a56e8a62aeb
mon_initial_members = kvm-cs-sn-10i, kvm-cs-sn-14i, kvm-cs-sn-15i
mon_host = 192.168.241.110,192.168.241.114,192.168.241.115
auth_supported = cephx
osd_journal_size = 1024
filestore_xattr_use_omap = true

When you run "service ceph status", it returns nothing because it
can't find any osd stanzas.  Based on the directory names in
/var/lib/ceph/mon and the parsing of the mounted storage volumes I
wrote this out:

http://pastebin.com/DLgtiC23

And then pushed it out.

The monitor id names seem odd to me, with the hostname instead of a,
b, and c, but whatever.

Now I get this output:

root@kvm-cs-sn-10i:/etc/ceph# service ceph -a status
=== mon.kvm-cs-sn-10i ===
mon.kvm-cs-sn-10i: not running.
=== mon.kvm-cs-sn-14i ===
mon.kvm-cs-sn-14i: not running.
=== mon.kvm-cs-sn-15i ===
mon.kvm-cs-sn-15i: not running.
=== osd.0 ===
osd.0: not running.
=== osd.1 ===
osd.1: not running.
=== osd.10 ===
osd.10: not running.
...etc

Not true!  Even worse, when I try to run "service ceph -a start", It
freaks and complains about missing keys.  So now I have this process
hanging around:

/usr/bin/python /usr/sbin/ceph-create-keys -i kvm-cs-sn-10i

Here's the output from that attempt:

root@kvm-cs-sn-10i:/tmp# service ceph -a start
=== mon.kvm-cs-sn-10i ===
Starting Ceph mon.kvm-cs-sn-10i on kvm-cs-sn-10i...
failed: 'ulimit -n 8192;  /usr/bin/ceph-mon -i kvm-cs-sn-10i
--pid-file /var/run/ceph/mon.kvm-cs-sn-10i.pid -c /etc/ceph/ceph.conf
'
Starting ceph-create-keys on kvm-cs-sn-10i...

Luckily I hadn't set up my ssh keys yet, so that's as far as I got.

Would dearly love some guidance.  Thanks in advance!

--Greg Chavez
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] problem readding an osd

2013-05-06 Thread Greg

Le 06/05/2013 20:41, Glen Aidukas a écrit :


New post bellow...

*From:*Greg [mailto:it...@itooo.com]
*Sent:* Monday, May 06, 2013 2:31 PM
*To:* Glen Aidukas
*Subject:* Re: [ceph-users] problem readding an osd

Le 06/05/2013 20:05, Glen Aidukas a écrit :

Greg,

Not sure where to use the --d switch.  I tried the following:

Service ceph start --d

Service ceph --d start

Both do not work.

I did see an error in my log though...

2013-05-06 13:03:38.432479 7f0007ef2780 -1
filestore(/srv/ceph/osd/osd.2) limited size xattrs --
filestore_xattr_use_omap enabled

2013-05-06 13:03:38.438563 7f0007ef2780  0
filestore(/srv/ceph/osd/osd.2) mount FIEMAP ioctl is supported and
appears to work

2013-05-06 13:03:38.438591 7f0007ef2780  0
filestore(/srv/ceph/osd/osd.2) mount FIEMAP ioctl is disabled via
'filestore fiemap' config option

2013-05-06 13:03:38.438804 7f0007ef2780  0
filestore(/srv/ceph/osd/osd.2) mount did NOT detect btrfs

2013-05-06 13:03:38.484841 7f0007ef2780  0
filestore(/srv/ceph/osd/osd.2) mount syncfs(2) syscall fully
supported (by glibc and kernel)

2013-05-06 13:03:38.485010 7f0007ef2780  0
filestore(/srv/ceph/osd/osd.2) mount found snaps <>

2013-05-06 13:03:38.488631 7f0007ef2780  0
filestore(/srv/ceph/osd/osd.2) mount: enabling WRITEAHEAD journal
mode: btrfs not detected

2013-05-06 13:03:38.488936 7f0007ef2780  1 journal _open
/srv/ceph/osd/osd.2/journal fd 19: 1048576000 bytes, block size
4096 bytes, directio = 1, aio = 0

2013-05-06 13:03:38.489095 7f0007ef2780  1 journal _open
/srv/ceph/osd/osd.2/journal fd 19: 1048576000 bytes, block size
4096 bytes, directio = 1, aio = 0

2013-05-06 13:03:38.490116 7f0007ef2780  1 journal close
/srv/ceph/osd/osd.2/journal

2013-05-06 13:03:38.538302 7f0007ef2780 -1
filestore(/srv/ceph/osd/osd.2) limited size xattrs --
filestore_xattr_use_omap enabled

2013-05-06 13:03:38.559813 7f0007ef2780  0
filestore(/srv/ceph/osd/osd.2) mount FIEMAP ioctl is supported and
appears to work

2013-05-06 13:03:38.559848 7f0007ef2780  0
filestore(/srv/ceph/osd/osd.2) mount FIEMAP ioctl is disabled via
'filestore fiemap' config option

2013-05-06 13:03:38.560082 7f0007ef2780  0
filestore(/srv/ceph/osd/osd.2) mount did NOT detect btrfs

2013-05-06 13:03:38.566015 7f0007ef2780  0
filestore(/srv/ceph/osd/osd.2) mount syncfs(2) syscall fully
supported (by glibc and kernel)

2013-05-06 13:03:38.566106 7f0007ef2780  0
filestore(/srv/ceph/osd/osd.2) mount found snaps <>

2013-05-06 13:03:38.569047 7f0007ef2780  0
filestore(/srv/ceph/osd/osd.2) mount: enabling WRITEAHEAD journal
mode: btrfs not detected

2013-05-06 13:03:38.569237 7f0007ef2780  1 journal _open
/srv/ceph/osd/osd.2/journal fd 27: 1048576000 bytes, block size
4096 bytes, directio = 1, aio = 0

2013-05-06 13:03:38.569316 7f0007ef2780  1 journal _open
/srv/ceph/osd/osd.2/journal fd 27: 1048576000 bytes, block size
4096 bytes, directio = 1, aio = 0

2013-05-06 13:03:38.574317 7f0007ef2780  1 journal close
/srv/ceph/osd/osd.2/journal

2013-05-06 13:03:38.574801 7f0007ef2780 -1  ** ERROR: osd init
failed: (1) Operation not permitted

*Glen Aidukas  [Manager IT Infrasctructure]


*

*From:*ceph-users-boun...@lists.ceph.com
<mailto:ceph-users-boun...@lists.ceph.com>
[mailto:ceph-users-boun...@lists.ceph.com] *On Behalf Of *Greg
*Sent:* Monday, May 06, 2013 1:47 PM
*To:* ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
*Subject:* Re: [ceph-users] problem readding an osd

Le 06/05/2013 19:23, Glen Aidukas a écrit :

Hello,

I think this is a newbe question but I tested everything and,
yes I FTFM as best I could.

I'm evaluating ceph and so I setup a cluster of 4 nodes.  The
nodes are KVM virtual machines named ceph01 to ceph04 all
running Ubuntu 12.04.2 LTS each with a single osd named osd.1
though osd.4 respective to the host they were running on. 
Each host also has a 1TB disk for ceph to use '/dev/vdb1'.


After some work I was able to get the cluster up and running
and even mounted it on a test client host (named ceph00).  I
ran into issues when I was testing a failure.  I shut off
ceph02 and watched via (ceph --w) it recover and move the data
around.  At this point all is fine.

When I turned the host back on, it did not auto reconnect.  I
expected this.  I then send through many attempts to re add it
but all failed.

Here is an output from:  ceph osd tree

# idweight  type name   up/down reweight

-1  4   root default

-3  4 rack unknownrack

-2 1   host ceph01


Re: [ceph-users] problem readding an osd

2013-05-06 Thread Greg

Le 06/05/2013 19:23, Glen Aidukas a écrit :


Hello,

I think this is a newbe question but I tested everything and, yes I 
FTFM as best I could.


I'm evaluating ceph and so I setup a cluster of 4 nodes.  The nodes 
are KVM virtual machines named ceph01 to ceph04 all running Ubuntu 
12.04.2 LTS each with a single osd named osd.1 though osd.4 respective 
to the host they were running on.  Each host also has a 1TB disk for 
ceph to use '/dev/vdb1'.


After some work I was able to get the cluster up and running and even 
mounted it on a test client host (named ceph00).  I ran into issues 
when I was testing a failure.  I shut off ceph02 and watched via (ceph 
--w) it recover and move the data around.  At this point all is fine.


When I turned the host back on, it did not auto reconnect.  I expected 
this.  I then send through many attempts to re add it but all failed.


Here is an output from:  ceph osd tree

# idweight  type name   up/down reweight

-1 4   root default

-3 4   rack unknownrack

-2 1   host ceph01

1 1   osd.1   up  1

-4 1   host ceph02

2 1   osd.2   down0

-5 1   host ceph03

3 1   osd.3   up  1

-6 1   host ceph04

4 1   osd.4   up  1

-7 0   rack unkownrack

ceph -s

health HEALTH_WARN 208 pgs peering; 208 pgs stuck inactive; 208 pgs 
stuck unclean; 1/4 in osds are down


monmap e1: 1 mons at {a=10.30.20.81:6789/0}, election epoch 1, quorum 0 a

osdmap e172: 4 osds: 3 up, 4 in

pgmap v1970: 960 pgs: 752 active+clean, 208 peering; 5917 MB data, 
61702 MB used, 2854 GB / 3068 GB avail


mdsmap e39: 1/1/1 up {0=a=up:active}

While I'm able to get it to be in the 'in' state, I cant seem to bring 
it up.


Any ideas on how to fix this?



Glen,

try to bring up your OSD daemon with -d switch, this will probably give 
you some information. (alternatively look in the logs)


Cheers,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph osd tell bench

2013-05-03 Thread Greg

Le 03/05/2013 16:34, Travis Rhoden a écrit :

I have a question about "tell bench" command.

When I run this, is it behaving more or less like a dd on the drive?  
It appears to be, but I wanted to confirm whether or not it is 
bypassing all the normal Ceph stack that would be writing metadata, 
calculating checksums, etc.


One bit of behavior I noticed a while back that I was not expecting is 
that this command does write to the journal. It made sense when I 
thought about it, but when I have an SSD journal in front of an OSD, I 
can't get the "tell bench" command to really show me accurate numbers 
of the raw speed of the OSD -- instead I get write speeds of the SSD.  
Just a small caveat there.


The upside to that is when do you something like "tell \* bench", you 
are able to see if that SSD becomes a bottleneck by hosting multiple 
journals, so I'm not really complaining.  But it does make a bit tough 
to see if perhaps one OSD is performing much differently than others.


But really, I'm mainly curious if it skips any normal 
metadata/checksum overhead that may be there otherwise.


Travis,

I'm no expert but, to me, the bench doesn't bypass the ceph stack.
On a test setup, I set up the journal on the same drive as the data 
drive, when I "tell bench" I can see ~160MB/s throughoutput on the SSD 
block device and the benchmark result is ~80MB/s which leads me to think 
the data is written twice : once to the journal and once to the 
"permanent" storage.
I see almost no read on the block device but the written data probably 
is in the page cache.


Cheers,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Replacement hardware

2013-03-20 Thread Greg Farnum
Yeah. If you run "ceph auth list" you'll get a dump of all the users and keys 
the cluster knows about; each of your daemons has that key stored somewhere 
locally (generally in /var/lib/ceph/ceph-[osd|mds|mon].$id). You can create 
more or copy an unused MDS one. I believe the docs include information on how 
this works. 
-Greg

Software Engineer #42 @ http://inktank.com | http://ceph.com


On Wednesday, March 20, 2013 at 10:48 AM, Igor Laskovy wrote:

> Well, can you please clarify what exactly key I must to use? Do I need to 
> get/generate it somehow from working cluster?
> 
> 
> On Wed, Mar 20, 2013 at 7:41 PM, Greg Farnum  (mailto:g...@inktank.com)> wrote:
> > The MDS doesn't have any local state. You just need start up the daemon 
> > somewhere with a name and key that are known to the cluster (these can be 
> > different from or the same as the one that existed on the dead node; 
> > doesn't matter!).
> > -Greg
> > 
> > Software Engineer #42 @ http://inktank.com | http://ceph.com
> > 
> > 
> > On Wednesday, March 20, 2013 at 10:40 AM, Igor Laskovy wrote:
> > 
> > > Actually, I already have recovered OSDs and MON daemon back to the 
> > > cluster according to 
> > > http://ceph.com/docs/master/rados/operations/add-or-rm-osds/ and 
> > > http://ceph.com/docs/master/rados/operations/add-or-rm-mons/ .
> > > 
> > > But doc has missed info about removing/add MDS.
> > > How I can recovery MDS daemon for failed node?
> > > 
> > > 
> > > 
> > > On Wed, Mar 20, 2013 at 3:23 PM, Dave (Bob)  > > (mailto:d...@bob-the-boat.me.uk) (mailto:d...@bob-the-boat.me.uk)> wrote:
> > > > Igor,
> > > > 
> > > > I am sure that I'm right in saying that you just have to create a new
> > > > filesystem (btrfs?) on the new block device, mount it, and then
> > > > initialise the osd with:
> > > > 
> > > > ceph-osd -i  --mkfs
> > > > 
> > > > Then you can start the osd with:
> > > > 
> > > > ceph-osd -i 
> > > > 
> > > > Since you are replacing an osd that already existed, the cluster knows
> > > > about it, and there is a key for it that is known.
> > > > 
> > > > I don't claim any great expertise, but this is what I've been doing, and
> > > > the cluster seems to adopt the new osd and sort everything out.
> > > > 
> > > > David
> > > > ___
> > > > ceph-users mailing list
> > > > ceph-users@lists.ceph.com (mailto:ceph-users@lists.ceph.com) 
> > > > (mailto:ceph-users@lists.ceph.com)
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > 
> > > 
> > > 
> > > 
> > > 
> > > --
> > > Igor Laskovy
> > > facebook.com/igor.laskovy (http://facebook.com/igor.laskovy) 
> > > (http://facebook.com/igor.laskovy)
> > > Kiev, Ukraine
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com (mailto:ceph-users@lists.ceph.com) 
> > > (mailto:ceph-users@lists.ceph.com)
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 
> 
> 
> 
> 
> -- 
> Igor Laskovy
> facebook.com/igor.laskovy (http://facebook.com/igor.laskovy)
> Kiev, Ukraine 



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Replacement hardware

2013-03-20 Thread Greg Farnum
The MDS doesn't have any local state. You just need start up the daemon 
somewhere with a name and key that are known to the cluster (these can be 
different from or the same as the one that existed on the dead node; doesn't 
matter!). 
-Greg

Software Engineer #42 @ http://inktank.com | http://ceph.com


On Wednesday, March 20, 2013 at 10:40 AM, Igor Laskovy wrote:

> Actually, I already have recovered OSDs and MON daemon back to the cluster 
> according to http://ceph.com/docs/master/rados/operations/add-or-rm-osds/ and 
> http://ceph.com/docs/master/rados/operations/add-or-rm-mons/ . 
> 
> But doc has missed info about removing/add MDS.
> How I can recovery MDS daemon for failed node?
> 
> 
> 
> On Wed, Mar 20, 2013 at 3:23 PM, Dave (Bob)  (mailto:d...@bob-the-boat.me.uk)> wrote:
> > Igor,
> > 
> > I am sure that I'm right in saying that you just have to create a new
> > filesystem (btrfs?) on the new block device, mount it, and then
> > initialise the osd with:
> > 
> > ceph-osd -i  --mkfs
> > 
> > Then you can start the osd with:
> > 
> > ceph-osd -i 
> > 
> > Since you are replacing an osd that already existed, the cluster knows
> > about it, and there is a key for it that is known.
> > 
> > I don't claim any great expertise, but this is what I've been doing, and
> > the cluster seems to adopt the new osd and sort everything out.
> > 
> > David
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com (mailto:ceph-users@lists.ceph.com)
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> 
> -- 
> Igor Laskovy
> facebook.com/igor.laskovy (http://facebook.com/igor.laskovy)
> Kiev, Ukraine 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com (mailto:ceph-users@lists.ceph.com)
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Status of Mac OS and Windows PC client

2013-03-19 Thread Greg Farnum
At various times there the ceph-fuse client has worked on OS X — Noah was the 
last one to do this and the branch for it is sitting in my long-term 
really-like-to-get-this-mainlined-someday queue. OS X is a lot easier than 
Windows though, and nobody's done any planning around that beyond noting that 
there are FUSE-like systems for Windows, and that Samba is a workaround. Sorry. 
 
-Greg

Software Engineer #42 @ http://inktank.com | http://ceph.com


On Tuesday, March 19, 2013 at 8:58 AM, Igor Laskovy wrote:

> Thanks for reply!
>  
> Actually I would like found some way to use one large salable central storage 
> across multiple PC and MAC. CephFS will be most suitable here, but you 
> provide only Linux support.  
> Really no planning here?
>  
>  
> On Tue, Mar 19, 2013 at 3:52 PM, Patrick McGarry  (mailto:patr...@inktank.com)> wrote:
> > Hey Igor,
> >  
> > Currently there are no plans to develop a OS X or Windows-specific
> > client per se. We do provide a number of different ways to expose the
> > cluster in ways that you could use it from these machines, however.
> >  
> > The most recent example of this is the work being done on tgt that can
> > expose Ceph via iSCSI. For reference see:
> > http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg11662.html
> >  
> > Keep an eye out for more details in the near future.
> >  
> >  
> > Best Regards,
> >  
> > Patrick McGarry
> > Director, Community || Inktank
> >  
> > http://ceph.com || http://inktank.com
> > @scuttlemonkey || @ceph || @inktank
> >  
> >  
> > On Tue, Mar 19, 2013 at 8:30 AM, Igor Laskovy  > (mailto:igor.lask...@gmail.com)> wrote:
> > > Anybody? :)
> > >  
> > > Igor Laskovy
> > > facebook.com/igor.laskovy (http://facebook.com/igor.laskovy)
> > > Kiev, Ukraine
> > >  
> > > On Mar 17, 2013 6:37 PM, "Igor Laskovy"  > > (mailto:igor.lask...@gmail.com)> wrote:
> > > >  
> > > > Hi there!
> > > >  
> > > > Could you please clarify what is the current status of development 
> > > > client
> > > > for OS X and Windows desktop editions?
> > > >  
> > > > --
> > > > Igor Laskovy
> > > > facebook.com/igor.laskovy (http://facebook.com/igor.laskovy)
> > > > Kiev, Ukraine
> > >  
> > >  
> > >  
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com (mailto:ceph-users@lists.ceph.com)
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >  
>  
>  
>  
>  
> --  
> Igor Laskovy
> facebook.com/igor.laskovy (http://facebook.com/igor.laskovy)
> Kiev, Ukraine  
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com (mailto:ceph-users@lists.ceph.com)
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Crash and strange things on MDS

2013-03-15 Thread Greg Farnum
On Friday, March 15, 2013 at 3:40 PM, Marc-Antoine Perennou wrote:
> Thank you a lot for these explanations, looking forward for these fixes!
> Do you have some public bug reports regarding this to link us?
> 
> Good luck, thank you for your great job and have a nice weekend
> 
> Marc-Antoine Perennou 
Well, for now the fixes are for stuff like "make analysis take less time, and 
export timing information more easily". The most immediately applicable one is 
probably http://tracker.ceph.com/issues/4354, which I hope to start on next 
week and should be done by the end of the sprint.
-Greg

Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Crash and strange things on MDS

2013-03-15 Thread Greg Farnum
On Friday, March 8, 2013 at 3:29 PM, Kevin Decherf wrote:
> On Fri, Mar 01, 2013 at 11:12:17AM -0800, Gregory Farnum wrote:
> > On Tue, Feb 26, 2013 at 4:49 PM, Kevin Decherf  > (mailto:ke...@kdecherf.com)> wrote:
> > > You will find the archive here: 
> > > The data is not anonymized. Interesting folders/files here are
> > > /user_309bbd38-3cff-468d-a465-dc17c260de0c/*
> >  
> >  
> >  
> > Sorry for the delay, but I have retrieved this archive locally at
> > least so if you want to remove it from your webserver you can do so.
> > :) Also, I notice when I untar it that the file name includes
> > "filtered" — what filters did you run it through?
>  
>  
>  
> Hi Gregory,
>  
> Do you have any news about it?
>  

I wrote a couple tools to do log analysis and created a number of bugs to make 
the MDS more amenable to analysis as a result of this.
Having spot-checked some of your longer-running requests, they're all getattrs 
or setattrs contending on files in what look to be shared cache and php 
libraries. These cover a range from ~40 milliseconds to ~150 milliseconds. I'd 
look into what your split applications are sharing across those spaces.

On the up side for Ceph, >80% of your requests take "0" milliseconds and ~95% 
of them take less than 2 milliseconds. Hurray, it's not ridiculously slow most 
of the time. :)
-Greg

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pgs stuck unclean after growing my ceph-cluster

2013-03-13 Thread Greg Farnum
On Wednesday, March 13, 2013 at 5:52 AM, Ansgar Jazdzewski wrote:
> hi,
>  
> i added 10 new OSD's to my cluster, after the growth is done, i got:
>  
> ##
> # ceph -s
> health HEALTH_WARN 217 pgs stuck unclean
> monmap e4: 2 mons at {a=10.100.217.3:6789/0,b=10.100.217.4:6789/0 
> (http://10.100.217.3:6789/0,b=10.100.217.4:6789/0)}, election epoch 4, quorum 
> 0,1 a,b
> osdmap e1480: 14 osds: 14 up, 14 in
> pgmap v8690731: 776 pgs: 559 active+clean, 217 active+remapped; 341 GB data, 
> 685 GB used, 15390 GB / 16075 GB avail
> mdsmap e312: 1/1/1 up {0=d=up:active}, 3 up:standby
> ##
>  
> during the growth some vm was online, with rbd! is that the reason for the 
> warning?
Nope, it's not because you were using the cluster. The "unclean" PGs here are 
those which are in the "active+remapped" state. That's actually two states — 
"active" which is good, because it means they're serving reads and writes"; 
"remapped" which means that for some reason the current set of OSDs handling 
them isn't the set that CRUSH thinks should be handling them. Given your 
cluster expansion that probably means that your CRUSH map and rules aren't 
behaving themselves and are failing to assign the right number of replicas to 
those PGs. You can check this by looking at the PG dump. If you search for 
"ceph active remapped" it looks to me like you'll get some useful results; you 
might also just be able to enable the CRUSH tunables 
(http://ceph.com/docs/master/rados/operations/crush-map/#tunables).

John, this is becoming a more common problem; we should generate some more 
targeted documentation around it. :)

-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] If one node lost connect to replication network?

2013-03-11 Thread Greg Farnum
There's not a really good per-version list, but tracker.ceph.com is reasonably 
complete and has a number of views.
-Greg

On Monday, March 11, 2013 at 8:22 AM, Igor Laskovy wrote:

> Thanks for the quick reply.
> Ok, so at this time looks like better to avoid split networks across network 
> interfaces.
> Where can I find list of all issues related to the concrete version? 
> 
> 
> On Mon, Mar 11, 2013 at 5:16 PM, Gregory Farnum  (mailto:g...@inktank.com)> wrote:
> > On Monday, March 11, 2013, Igor Laskovy wrote:
> > > Hi there!
> > > 
> > > I have Ceph FS cluster version 0.56.3. This is 3 nodes with XFS on disks 
> > > and with minimum options in ceph.conf in my lab and I do some crush 
> > > testing. 
> > > One of the of several tests is lost connect to replication network only.
> > > What expect behavior in this situation? Will mounted disk on client 
> > > machine frozen or so?
> > > 
> > > Look like in my case whole cluster have gone crazy. 
> > 
> > Yeah, this is a known issue with the way Ceph determines if nodes are up or 
> > down. Basically the OSDs are communicating over the replication network and 
> > reporting to the monitors that the disconnected node is dead, but when they 
> > mark it down it finds out and insists (over the public network) that it's 
> > up. 
> > 
> > I believe Sage fixed this issue in our development releases, but could be 
> > misremembering. Sage?
> > -Greg
> 
> 
> 
> 
> 
> 
> -- 
> Igor Laskovy
> facebook.com/igor.laskovy (http://facebook.com/igor.laskovy)
> Kiev, Ukraine 



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD goes up and down - osd suicide?

2013-03-05 Thread Greg Farnum
On Tuesday, March 5, 2013 at 5:53 AM, Marco Aroldi wrote:
> Hi,
> I've collected a osd log with these parameters:
>  
> debug osd = 20
> debug ms = 1
> debug filestore = 20
>  
> You can download it from here:
> https://docs.google.com/file/d/0B1lZcgrNMBAJVjBqa1lJRndxc2M/edit?usp=sharing
>  
> I have also captured a video to show the behavior in 
> realtime:http://youtu.be/708AI8PGy7k
>  
Ah, this is interesting — the ceph-osd processes are using up the time, not the 
filesystem or something. However, I don't see any reason for that in a brief 
look at the OSD log here — can you describe what you did to the OSD during that 
logging period? (In particular I see a lot of pg_log messages, but not the sub 
op messages that would be associated with this OSD doing a deep scrub, nor the 
internal heartbeat timeouts that the other OSDs were generating.)
-Greg

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS First product release discussion

2013-03-05 Thread Greg Farnum
This is a companion discussion to the blog post at 
http://ceph.com/dev-notes/cephfs-mds-status-discussion/ — go read that!

The short and slightly alternate version: I spent most of about two weeks 
working on bugs related to snapshots in the MDS, and we started realizing that 
we could probably do our first supported release of CephFS and the related 
infrastructure much sooner if we didn't need to support all of the whizbang 
features. (This isn't to say that the base feature set is stable now, but it's 
much closer than when you turn on some of the other things.) I'd like to get 
feedback from you in the community on what minimum supported feature set would 
prompt or allow you to start using CephFS in real environments — not what you'd 
*like* to see, but what you *need* to see. This will allow us at Inktank to 
prioritize more effectively and hopefully get out a supported release much more 
quickly! :)

The current proposed feature set is basically what's left over after we've 
trimmed off everything we can think to split off, but if any of the proposed 
included features are also particularly important or don't matter, be sure to 
mention them (NFS export in particular — it works right now but isn't in great 
shape due to NFS filehandle caching).

Thanks,
-Greg

Software Engineer #42 @ http://inktank.com | http://ceph.com  


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph 0.57

2013-03-05 Thread Greg Farnum
I believe the debian folder only includes stable releases; .57 is a dev 
release. See http://ceph.com/docs/master/install/debian/ for more! :)
-Greg

Software Engineer #42 @ http://inktank.com | http://ceph.com


On Tuesday, March 5, 2013 at 8:44 AM, Scott Kinder wrote:

> When is ceph 0.57 going to be available from the ceph.com (http://ceph.com) 
> PPA? I checked, and all releases under http://ceph.com/debian/dists/ seem to 
> still be 0.56.3. Or am I missing something?
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com (mailto:ceph-users@lists.ceph.com)
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com