from:"Greg"

[ceph-users] Purpose of the s3gw.fcgi script?

2015-04-11 Thread Greg Meier

From my observation, the s3gw.fcgi script seems to be completely
superfluous in the operation of Ceph. With or without the script, swift
requests execute correctly, as long as a radosgw daemon is running.

Is there something I'm missing here?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Auth URL not found when using object gateway

2015-03-24 Thread Greg Meier

Hi,

I'm having trouble setting up an object gateway on an existing cluster. The
cluster I'm trying to add the gateway to is running on a Precise 12.04
virtual machine.

The cluster is up and running, with a monitor, two OSDs, and a metadata
server. It returns HEALTH_OK and active+clean, so I am somewhat assured
that it is running correctly.

I've:
 - set up an apache2 webserver with the fastcgi mod installed
 - created an rgw.conf file
 - added an s3gw.fcgi script
 - enabled the rgw.conf site and disabled the default
 - created a keyring and gateway user with appropriate cap's
 - restarted ceph, apache2, and the radosgw daemon
 - created a user and subuser
 - tested both s3 and swift calls

Unfortunately, both s3 and swift fail to authorize. An attempt to create a
new bucket with s3 using a python script returns:

Traceback (most recent call last):
  File s3test.py, line 13, in module
bucket = conn.create_bucket('my-new-bucket')
  File /usr/lib/python2.7/dist-packages/boto/s3/connection.py, line 422,
in create_bucket
response.status, response.reason, body)
boto.exception.S3ResponseError: S3ResponseError: 404 Not Found
None

And an attempt to post a container using the python-swiftclient from the
command line with command:

swift --debug --info -A http://localhost/auth/1.0 -U gatewayuser:swift -K
key post new_container

returns:

INFO:urllib3.connectionpool:Starting new HTTP connection (1): localhost
DEBUG:urllib3.connectionpool:GET /auth/1.0 HTTP/1.1 404 180
INFO:swiftclient:REQ: curl -i http://localhost/auth/1.0 -X GET
INFO:swiftclient:RESP STATUS: 404 Not Found
INFO:swiftclient:RESP HEADERS: [('content-length', '180'),
('content-encoding', 'gzip'), ('date', 'Tue, 24 Mar 2015 23:19:50 GMT'),
('content-type', 'text/html; charset=iso-8859-1'), ('vary',
'Accept-Encoding'), ('server', 'Apache/2.2.22 (Ubuntu)')]
INFO:swiftclient:RESP BODY:
M�0��}���,�I�)֔)Ң��m��qv��Y��.)�59�=Ve
���y���lsa���#T��p��v�,����B/��� �5D�Z|=���S�N�+
�|-�X)��V��b�a���與'@Uo���-�n��?�
ERROR:swiftclient:Auth GET failed: http://localhost/auth/1.0 404 Not Found
Traceback (most recent call last):
  File /usr/lib/python2.7/dist-packages/swiftclient/client.py, line 1181,
in _retry
self.url, self.token = self.get_auth()
  File /usr/lib/python2.7/dist-packages/swiftclient/client.py, line 1155,
in get_auth
insecure=self.insecure)
  File /usr/lib/python2.7/dist-packages/swiftclient/client.py, line 318,
in get_auth
insecure=insecure)
  File /usr/lib/python2.7/dist-packages/swiftclient/client.py, line 241,
in get_auth_1_0
http_reason=resp.reason)
ClientException: Auth GET failed: http://localhost/auth/1.0 404 Not Found
INFO:urllib3.connectionpool:Starting new HTTP connection (1): localhost
DEBUG:urllib3.connectionpool:GET /auth/1.0 HTTP/1.1 404 180
INFO:swiftclient:REQ: curl -i http://localhost/auth/1.0 -X GET
INFO:swiftclient:RESP STATUS: 404 Not Found
INFO:swiftclient:RESP HEADERS: [('content-length', '180'),
('content-encoding', 'gzip'), ('date', 'Tue, 24 Mar 2015 23:19:50 GMT'),
('content-type', 'text/html; charset=iso-8859-1'), ('vary',
'Accept-Encoding'), ('server', 'Apache/2.2.22 (Ubuntu)')]
INFO:swiftclient:RESP BODY:
M�0��}���,�I�)֔)Ң��m��qv��Y��.)�59�=Ve
���y���lsa���#T��p��v�,����B/��� �5D�Z|=���S�N�+
�|-�X)��V��b�a���與'@Uo���-�n��?�
ERROR:swiftclient:Auth GET failed: http://localhost/auth/1.0 404 Not Found
Traceback (most recent call last):
  File /usr/lib/python2.7/dist-packages/swiftclient/client.py, line 1181,
in _retry
self.url, self.token = self.get_auth()
  File /usr/lib/python2.7/dist-packages/swiftclient/client.py, line 1155,
in get_auth
insecure=self.insecure)
  File /usr/lib/python2.7/dist-packages/swiftclient/client.py, line 318,
in get_auth
insecure=insecure)
  File /usr/lib/python2.7/dist-packages/swiftclient/client.py, line 241,
in get_auth_1_0
http_reason=resp.reason)
ClientException: Auth GET failed: http://localhost/auth/1.0 404 Not Found
Auth GET failed: http://localhost/auth/1.0 404 Not Found

I'm not at all sure why it doesn't work when I've followed the
documentation for setting it up. Please find attached, the config files for
rgw.conf, ceph.conf, and apache2.conf


apache2.conf
Description: Binary data


ceph.conf
Description: Binary data


rgw.conf
Description: Binary data
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Monitor failure after series of traumatic network failures

2015-03-24 Thread Greg Chavez

This was excellent advice. It should be on some official Ceph
troubleshooting page. It takes a while for the monitors to deal with new
info, but it works.

Thanks again!
--Greg

On Wed, Mar 18, 2015 at 5:24 PM, Sage Weil s...@newdream.net wrote:

 On Wed, 18 Mar 2015, Greg Chavez wrote:
  We have a cuttlefish (0.61.9) 192-OSD cluster that has lost network
  availability several times since this past Thursday and whose nodes were
 all
  rebooted twice (hastily and inadvisably each time). The final reboot,
 which
  was supposed to be the last thing before recovery according to our data
  center team, resulted in a failure of the cluster's 4 monitors. This
  happened yesterday afternoon.
 
  [ By the way, we use Ceph to back Cinder and Glance in our OpenStack
 Cloud,
  block storage only; also this network problems were the result of our
 data
  center team executing maintenance on our switches that was supposed to be
  quick and painless ]
 
  After working all day on various troubleshooting techniques found here
 and
  there, we have this situation on our monitor nodes (debug 20):
 
 
  node-10: dead. ceph-mon will not start
 
  node-14: Seemed to rebuild its monmap. The log has stopped reporting with
  this final tail -100: http://pastebin.com/tLiq2ewV
 
  node-16: Same as 14, similar outcome in the
  log: http://pastebin.com/W87eT7Mw
 
  node-15: ceph-mon starts but even at debug 20, it will only ouput this
 line,
  over and over again:
 
 2015-03-18 14:54:35.859511 7f8c82ad3700 -1 asok(0x2e560e0)
  AdminSocket: request 'mon_status' not defined
 
  node-02: I added this guy to replace node-10. I updated ceph.conf and
 pushed
  it to all the monitor nodes (the osd nodes without monitors did not get
 the
  config push). Since he's a new guy the log out is obviously different,
 but
  again, here are the last 50 lines: http://pastebin.com/pfixdD3d
 
 
  I run my ceph client from my OpenStack controller. All ceph -s shows me
 is
  faults, albeit only to node-15
 
  2015-03-18 16:47:27.145194 7ff762cff700  0 -- 192.168.241.100:0/15112 
  192.168.241.115:6789/0 pipe(0x7ff75000cf00 sd=3 :0 s=1 pgs=0 cs=0
 l=1).fault
 
 
  Finally, here is our ceph.conf: http://pastebin.com/Gmiq2V8S
 
  So that's where we stand. Did we kill our Ceph Cluster (and thus our
  OpenStack Cloud)?

 Unlikely!  You have 5 copies, and I doubt all of them are unrecoverable.

  Or is there hope? Any suggestions would be greatly
  appreciated.

 Stop all mons.

 Make a backup copy of each mon data dir.

 Copy the node-14 data dir over the node-15 and/or node-10 and/or
 node-02.

 Start all mons, see if they form a quorum.

 Once things are working again, at the *very* least upgrade to dumpling,
 and preferably then upgrade to firefly!!  Cuttlefish was EOL more than a
 year ago, and dumpling is EOL in a couple months.

 sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Monitor failure after series of traumatic network failures

2015-03-18 Thread Greg Chavez

We have a cuttlefish (0.61.9) 192-OSD cluster that has lost network
availability several times since this past Thursday and whose nodes were
all rebooted twice (hastily and inadvisably each time). The final reboot,
which was supposed to be the last thing before recovery according to our
data center team, resulted in a failure of the cluster's 4 monitors. This
happened yesterday afternoon.

[ By the way, we use Ceph to back Cinder and Glance in our OpenStack Cloud,
block storage only; also this network problems were the result of our data
center team executing maintenance on our switches that was supposed to be
quick and painless ]

After working all day on various troubleshooting techniques found here and
there, we have this situation on our monitor nodes (debug 20):


node-10: dead. ceph-mon will not start

node-14: Seemed to rebuild its monmap. The log has stopped reporting with
this final tail -100: http://pastebin.com/tLiq2ewV

node-16: Same as 14, similar outcome in the log:
http://pastebin.com/W87eT7Mw

node-15: ceph-mon starts but even at debug 20, it will only ouput this
line, over and over again:

   2015-03-18 14:54:35.859511 7f8c82ad3700 -1 asok(0x2e560e0)
AdminSocket: request 'mon_status' not defined

node-02: I added this guy to replace node-10. I updated ceph.conf and
pushed it to all the monitor nodes (the osd nodes without monitors did not
get the config push). Since he's a new guy the log out is obviously
different, but again, here are the last 50 lines:
http://pastebin.com/pfixdD3d


I run my ceph client from my OpenStack controller. All ceph -s shows me is
faults, albeit only to node-15

2015-03-18 16:47:27.145194 7ff762cff700  0 -- 192.168.241.100:0/15112 
192.168.241.115:6789/0 pipe(0x7ff75000cf00 sd=3 :0 s=1 pgs=0 cs=0 l=1).fault


Finally, here is our ceph.conf: http://pastebin.com/Gmiq2V8S

So that's where we stand. Did we kill our Ceph Cluster (and thus our
OpenStack Cloud)? Or is there hope? Any suggestions would be greatly
appreciated.


-- 
\*..+.-
--Greg Chavez
+//..;};
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD turned itself off

2015-02-16 Thread Greg Farnum

/2007323, failed lossy con, dropping message 
 0x12989400
  -855 2015-01-10 22:01:36.589036 7f6d5b954700  0 -- 10.168.7.23:6819/10217 
 submit_message osd_op_reply(727627 rbd_data.1cc69413d1b58ba.0055 
 [stat,write 2289664~4096] ondisk = 0) v4 remote, 10.168.7.54:0/1007323, 
 failed lossy con, dropping message 0x24f68800
  -819 2015-01-12 05:25:06.229753 7f6d3646c700  0 -- 10.168.7.23:6819/10217 
  10.168.7.53:0/2019809 pipe(0x1f0e9680 sd=460 :6819 s=0 pgs=0 cs=0 l=1 
 c=0x13090420).accept replacing existing (lossy) channel (new one lossy=1)
  -818 2015-01-12 05:25:06.581703 7f6d37534700  0 -- 10.168.7.23:6819/10217 
  10.168.7.53:0/1025252 pipe(0x1b67a780 sd=71 :6819 s=0 pgs=0 cs=0 l=1 
 c=0x16311e40).accept replacing existing (lossy) channel (new one lossy=1)
  -817 2015-01-12 05:25:21.342998 7f6d41167700  0 -- 10.168.7.23:6819/10217 
  10.168.7.53:0/1025579 pipe(0x114e8000 sd=502 :6819 s=0 pgs=0 cs=0 l=1 
 c=0x16310160).accept replacing existing (lossy) channel (new one lossy=1)
  -808 2015-01-12 16:01:35.783534 7f6d5b954700  0 -- 10.168.7.23:6819/10217 
 submit_message osd_op_reply(752034 rbd_data.1cc69413d1b58ba.0055 
 [stat,write 2387968~8192] ondisk = 0) v4 remote, 10.168.7.54:0/1007323, 
 failed lossy con, dropping message 0x1fde9a00
  -515 2015-01-25 18:44:23.303855 7f6d5b954700  0 -- 10.168.7.23:6819/10217 
 submit_message osd_op_reply(46402240 rbd_data.4b8e9b3d1b58ba.0471 
 [read 1310720~4096] ondisk = 0) v4 remote, 10.168.7.51:0/1017204, failed 
 lossy con, dropping message 0x250bce00
  -303 2015-02-02 22:30:03.140599 7f6d5c155700  0 -- 10.168.7.23:6819/10217 
 submit_message osd_op_reply(17710313 
 rbd_data.1cc69562eb141f2.03ce [stat,write 4145152~4096] ondisk = 
 0) v4 remote, 10.168.7.54:0/2007323, failed lossy con, dropping message 
 0x1c5d4200
  -236 2015-02-05 15:29:04.945660 7f6d3d357700  0 -- 10.168.7.23:6819/10217 
  10.168.7.51:0/1026961 pipe(0x1c63e780 sd=203 :6819 s=0 pgs=0 cs=0 l=1 
 c=0x11dc8dc0).accept replacing existing (lossy) channel (new one lossy=1)
   -66 2015-02-10 20:20:36.673969 7f6d5b954700  0 -- 10.168.7.23:6819/10217 
 submit_message osd_op_reply(11088 rbd_data.10b8c82eb141f2.4459 
 [stat,write 749568~8192] ondisk = 0) v4 remote, 10.168.7.55:0/1005630, failed 
 lossy con, dropping message 0x138db200
 
 Could this have lead to the data being erroneous, or is the -5 return code 
 just a sign of a broken hard drive?
 

These are the OSDs creating new connections to each other because the previous 
ones failed. That's not necessarily a problem (although here it's probably a 
symptom of some kind of issue, given the frequency) and cannot introduce data 
corruption of any kind.
I’m not seeing any -5 return codes as part of that messenger debug output, so 
unless you were referring to your EIO from last June I’m not sure what that’s 
about? (If you do mean EIOs, yes, they’re still a sign of a broken hard drive 
or local FS.)

 Cheers,
 Josef
 
 On 14 Jun 2014, at 02:38, Josef Johansson jo...@oderland.se wrote:
 
 Thanks for the quick response.
 
 Cheers,
 Josef
 
 Gregory Farnum skrev 2014-06-14 02:36:
 On Fri, Jun 13, 2014 at 5:25 PM, Josef Johansson jo...@oderland.se wrote:
 Hi Greg,
 
 Thanks for the clarification. I believe the OSD was in the middle of a deep
 scrub (sorry for not mentioning this straight away), so then it could've
 been a silent error that got wind during scrub?
 Yeah.
 
 What's best practice when the store is corrupted like this?
 Remove the OSD from the cluster, and either reformat the disk or
 replace as you judge appropriate.
 -Greg
 
 Cheers,
 Josef
 
 Gregory Farnum skrev 2014-06-14 02:21:
 
 The OSD did a read off of the local filesystem and it got back the EIO
 error code. That means the store got corrupted or something, so it
 killed itself to avoid spreading bad data to the rest of the cluster.
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com
 
 
 On Fri, Jun 13, 2014 at 5:16 PM, Josef Johansson jo...@oderland.se
 wrote:
 Hey,
 
 Just examing what happened to an OSD, that was just turned off. Data has
 been moved away from it, so hesitating to turned it back on.
 
 Got the below in the logs, any clues to what the assert talks about?
 
 Cheers,
 Josef
 
 -1 os/FileStore.cc: In function 'virtual int FileStore::read(coll_t,
 const
 hobject_t, uint64_t, size_t, ceph::bufferlist, bool)' thread 7fdacb88
 c700 time 2014-06-11 21:13:54.036982
 os/FileStore.cc: 2992: FAILED assert(allow_eio || !m_filestore_fail_eio
 ||
 got != -5)
 
  ceph version 0.67.7 (d7ab4244396b57aac8b7e80812115bbd079e6b73)
  1: (FileStore::read(coll_t, hobject_t const, unsigned long, unsigned
 long,
 ceph::buffer::list, bool)+0x653) [0x8ab6c3]
  2: (ReplicatedPG::do_osd_ops(ReplicatedPG::OpContext*,
 std::vectorOSDOp,
 std::allocatorOSDOp )+0x350) [0x708230]
  3: (ReplicatedPG::prepare_transaction(ReplicatedPG::OpContext*)+0x86)
 [0x713366]
  4: (ReplicatedPG::do_op(std::tr1::shared_ptrOpRequest)+0x3095

Re: [ceph-users] Poor performance on all SSD cluster

2014-06-23 Thread Greg Poirier

How does RBD cache work? I wasn't able to find an adequate explanation in
the docs.

On Sunday, June 22, 2014, Mark Kirkwood mark.kirkw...@catalyst.net.nz
wrote:

 Good point, I had neglected to do that.

 So, amending my conf.conf [1]:

 [client]
 rbd cache = true
 rbd cache size = 2147483648
 rbd cache max dirty = 1073741824
 rbd cache max dirty age = 100

 and also the VM's xml def to include cache to writeback:

 disk type='network' device='disk'
   driver name='qemu' type='raw' cache='writeback' io='native'/
   auth username='admin'
 secret type='ceph' uuid='cd2d3ab1-2d31-41e0-ab08-3d0c6e2fafa0'/
   /auth
   source protocol='rbd' name='rbd/vol1'
 host name='192.168.1.64' port='6789'/
   /source
   target dev='vdb' bus='virtio'/
   address type='pci' domain='0x' bus='0x00' slot='0x07'
 function='0x0'/
 /disk

 Retesting from inside the VM:

 $ dd if=/dev/zero of=/mnt/vol1/scratch/file bs=16k count=65535 oflag=direct
 65535+0 records in
 65535+0 records out
 1073725440 bytes (1.1 GB) copied, 8.1686 s, 131 MB/s

 Which is much better, so certainly for the librbd case enabling the rbd
 cache seems to nail this particular issue.

 Regards

 Mark

 [1] possibly somewhat agressively set, but at least a noticeable
 difference :-)

 On 22/06/14 19:02, Haomai Wang wrote:

 Hi Mark,

 Do you enable rbdcache? I test on my ssd cluster(only one ssd), it seemed
 ok.

  dd if=/dev/zero of=test bs=16k count=65536 oflag=direct


 82.3MB/s



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Poor performance on all SSD cluster

2014-06-23 Thread Greg Poirier

On Sun, Jun 22, 2014 at 6:44 AM, Mark Nelson mark.nel...@inktank.com
wrote:

 RBD Cache is definitely going to help in this use case.  This test is
 basically just sequentially writing a single 16k chunk of data out, one at
 a time.  IE, entirely latency bound.  At least on OSDs backed by XFS, you
 have to wait for that data to hit the journals of every OSD associated with
 the object before the acknowledgement gets sent back to the client.


Again, I can reproduce this with replication disabled.


  If you are using the default 4MB block size, you'll hit the same OSDs
 over and over again and your other OSDs will sit there twiddling their
 thumbs waiting for IO until you hit the next block, but then it will just
 be a different set OSDs getting hit.  You should be able to verify this by
 using iostat or collectl or something to look at the behaviour of the SSDs
 during the test.  Since this is all sequential though, switching to
  buffered IO (ie coalesce IOs at the buffercache layer) or using RBD cache
 for direct IO (coalesce IOs below the block device) will dramatically
 improve things.


This makes sense.

Given the following scenario:

- No replication
- osd_op time average is .015 seconds (stddev ~.003 seconds)
- Network latency is approximately .000237 seconds on avg

I should be getting 60 IOPS from the OSD reporting this time, right?

So 60 * 16kB = 960kB.  That's slightly slower than we're getting because
I'm only able to sample the slowest ops. We're getting closer to 100 IOPS.
But that does make sense, I suppose.

So the only way to improve performance would be to not use O_DIRECT (as
this should bypass rbd cache as well, right?).


 Ceph is pretty good at small random IO with lots of parallelism on
 spinning disk backed OSDs (So long as you use SSD journals or controllers
 with WB cache).  It's much harder to get native-level IOPS rates with SSD
 backed OSDs though.  The latency involved in distributing and processing
 all of that data becomes a much bigger deal.  Having said that, we are
 actively working on improving latency as much as we can. :)


And this is true because flushing from the journal to spinning disks is
going to coalesce the writes into the appropriate blocks in a meaningful
way, right? Or I guess... Why is this?

Why doesn't that happen with SSD journals and SSD OSDs?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Poor performance on all SSD cluster

2014-06-22 Thread Greg Poirier

I'm using Crucial M500s.


On Sat, Jun 21, 2014 at 7:09 PM, Mark Kirkwood 
mark.kirkw...@catalyst.net.nz wrote:

 I can reproduce this in:

 ceph version 0.81-423-g1fb4574

 on Ubuntu 14.04. I have a two osd cluster with data on two sata spinners
 (WD blacks) and journals on two ssd (Crucual m4's). I getting about 3.5
 MB/s (kernel and librbd) using your dd command with direct on. Leaving off
 direct I'm seeing about 140 MB/s (librbd) and 90 MB/s (kernel 3.11 [2]).
 The ssd's can do writes at about 180 MB/s each... which is something to
 look at another day[1].

 It would be interesting to know what version of Ceph Tyer is using, as his
 setup seems not nearly impacted by adding direct. Also it might be useful
 to know what make and model of ssd you both are using (some of 'em do not
 like a series of essentially sync writes). Having said that testing my
 Crucial m4's shows they can do the dd command (with direct *on*) at about
 180 MB/s...hmmm...so it *is* the Ceph layer it seems.

 Regards

 Mark

 [1] I set filestore_max_sync_interval = 100 (30G journal...ssd able to do
 180 MB/s etc), however I am still seeing writes to the spinners during the
 8s or so that the above dd tests take).
 [2] Ubuntu 13.10 VM - I'll upgrade it to 14.04 and see if that helps at
 all.


 On 21/06/14 09:17, Greg Poirier wrote:

 Thanks Tyler. So, I'm not totally crazy. There is something weird going
 on.

 I've looked into things about as much as I can:

 - We have tested with collocated journals and dedicated journal disks.
 - We have bonded 10Gb nics and have verified network configuration and
 connectivity is sound
 - We have run dd independently on the SSDs in the cluster and they are
 performing fine
 - We have tested both in a VM and with the RBD kernel module and get
 identical performance
 - We have pool size = 3, pool min size = 2 and have tested with min size
 of 2 and 3 -- the performance impact is not bad
 - osd_op times are approximately 6-12ms
 - osd_sub_op times are 6-12 ms
 - iostat reports service time of 6-12ms
 - Latency between the storage and rbd client is approximately .1-.2ms
 - Disabling replication entirely did not help significantly




 On Fri, Jun 20, 2014 at 2:13 PM, Tyler Wilson k...@linuxdigital.net
 mailto:k...@linuxdigital.net wrote:

 Greg,

 Not a real fix for you but I too run a full-ssd cluster and am able
 to get 112MB/s with your command;

 [root@plesk-test ~]# dd if=/dev/zero of=testfilasde bs=16k
 count=65535 oflag=direct
 65535+0 records in
 65535+0 records out
 1073725440 bytes (1.1 GB) copied, 9.59092 s, 112 MB/s

 This of course is in a VM, here is my ceph config

 [global]
 fsid = hidden
 mon_initial_members = node-1 node-2 node-3
 mon_host = 192.168.0.3 192.168.0.4 192.168.0.5
 auth_supported = cephx
 osd_journal_size = 2048
 filestore_xattr_use_omap = true
 osd_pool_default_size = 2
 osd_pool_default_min_size = 1
 osd_pool_default_pg_num = 1024
 public_network = 192.168.0.0/24 http://192.168.0.0/24
 osd_mkfs_type = xfs
 cluster_network = 192.168.1.0/24 http://192.168.1.0/24




 On Fri, Jun 20, 2014 at 11:08 AM, Greg Poirier
 greg.poir...@opower.com mailto:greg.poir...@opower.com wrote:

 I recently created a 9-node Firefly cluster backed by all SSDs.
 We have had some pretty severe performance degradation when
 using O_DIRECT in our tests (as this is how MySQL will be
 interacting with RBD volumes, this makes the most sense for a
 preliminary test). Running the following test:

 dd if=/dev/zero of=testfilasde bs=16k count=65535 oflag=direct

 779829248 bytes (780 MB) copied, 604.333 s, 1.3 MB/s

 Shows us only about 1.5 MB/s throughput and 100 IOPS from the
 single dd thread. Running a second dd process does show
 increased throughput which is encouraging, but I am still
 concerned by the low throughput of a single thread w/ O_DIRECT.

 Two threads:
 779829248 bytes (780 MB) copied, 604.333 s, 1.3 MB/s
 126271488 bytes (126 MB) copied, 99.2069 s, 1.3 MB/s

 I am testing with an RBD volume mounted with the kernel module
 (I have also tested from within KVM, similar performance).

 If allow caching, we start to see reasonable numbers from a
 single dd process:

 dd if=/dev/zero of=testfilasde bs=16k count=65535
 65535+0 records in
 65535+0 records out
 1073725440 bytes (1.1 GB) copied, 2.05356 s, 523 MB/s

 I can get 1GB/s from a single host with three threads.

 Rados bench produces similar results.

 Is there something I can do to increase the performance of
 O_DIRECT? I expect performance degradation, but so much?

 If I increase the blocksize to 4M, I'm able to get significantly
 higher throughput:

 3833593856 bytes (3.8 GB) copied, 44.2964 s, 86.5 MB/s

Re: [ceph-users] Poor performance on all SSD cluster

2014-06-22 Thread Greg Poirier

We actually do have a use pattern of large batch sequential writes, and
this dd is pretty similar to that use case.

A round-trip write with replication takes approximately 10-15ms to
complete. I've been looking at dump_historic_ops on a number of OSDs and
getting mean, min, and max for sub_op and ops. If these were on the order
of 1-2 seconds, I could understand this throughput... But we're talking
about fairly fast SSDs and a 20Gbps network with 1ms latency for TCP
round-trip between the client machine and all of the OSD hosts.

I've gone so far as disabling replication entirely (which had almost no
impact) and putting journals on separate SSDs as the data disks (which are
ALSO SSDs).

This just doesn't make sense to me.


On Sun, Jun 22, 2014 at 6:44 AM, Mark Nelson mark.nel...@inktank.com
wrote:

 On 06/22/2014 02:02 AM, Haomai Wang wrote:

 Hi Mark,

 Do you enable rbdcache? I test on my ssd cluster(only one ssd), it seemed
 ok.

  dd if=/dev/zero of=test bs=16k count=65536 oflag=direct


 82.3MB/s


 RBD Cache is definitely going to help in this use case.  This test is
 basically just sequentially writing a single 16k chunk of data out, one at
 a time.  IE, entirely latency bound.  At least on OSDs backed by XFS, you
 have to wait for that data to hit the journals of every OSD associated with
 the object before the acknowledgement gets sent back to the client.  If you
 are using the default 4MB block size, you'll hit the same OSDs over and
 over again and your other OSDs will sit there twiddling their thumbs
 waiting for IO until you hit the next block, but then it will just be a
 different set OSDs getting hit.  You should be able to verify this by using
 iostat or collectl or something to look at the behaviour of the SSDs during
 the test.  Since this is all sequential though, switching to  buffered IO
 (ie coalesce IOs at the buffercache layer) or using RBD cache for direct IO
 (coalesce IOs below the block device) will dramatically improve things.

 The real question here though, is whether or not a synchronous stream of
 sequential 16k writes is even remotely close to the IO patterns that would
 be seen in actual use for MySQL.  Most likely in actual use you'll be
 seeing a big mix of random reads and writes, and hopefully at least some
 parallelism (though this depends on the number of databases, number of
 users, and the workload!).

 Ceph is pretty good at small random IO with lots of parallelism on
 spinning disk backed OSDs (So long as you use SSD journals or controllers
 with WB cache).  It's much harder to get native-level IOPS rates with SSD
 backed OSDs though.  The latency involved in distributing and processing
 all of that data becomes a much bigger deal.  Having said that, we are
 actively working on improving latency as much as we can. :)

 Mark




 On Sun, Jun 22, 2014 at 11:50 AM, Mark Kirkwood
 mark.kirkw...@catalyst.net.nz wrote:

 On 22/06/14 14:09, Mark Kirkwood wrote:

 Upgrading the VM to 14.04 and restesting the case *without* direct I get:

 - 164 MB/s (librbd)
 - 115 MB/s (kernel 3.13)

 So managing to almost get native performance out of the librbd case. I
 tweaked both filestore max and min sync intervals (100 and 10 resp) just
 to
 see if I could actually avoid writing to the spinners while the test was
 in
 progress (still seeing some, but clearly fewer).

 However no improvement at all *with* direct enabled. The output of
 iostat on
 the host while the direct test is in progress is interesting:

 avg-cpu:  %user   %nice %system %iowait  %steal   %idle
11.730.005.040.760.00   82.47

 Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s
 avgrq-sz
 avgqu-sz   await r_await w_await  svctm  %util
 sda   0.00 0.000.00   11.00 0.00 4.02 749.09
 0.14   12.360.00   12.36   6.55   7.20
 sdb   0.00 0.000.00   11.00 0.00 4.02 749.09
 0.14   12.360.00   12.36   5.82   6.40
 sdc   0.00 0.000.00  435.00 0.00 4.29 20.21
 0.531.210.001.21   1.21  52.80
 sdd   0.00 0.000.00  435.00 0.00 4.29 20.21
 0.521.200.001.20   1.20  52.40

 (sda,b are the spinners sdc,d the ssds). Something is making the journal
 work very hard for its 4.29 MB/s!

 regards

 Mark


  Leaving
 off direct I'm seeing about 140 MB/s (librbd) and 90 MB/s (kernel 3.11
 [2]). The ssd's can do writes at about 180 MB/s each... which is
 something to look at another day[1].




 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com

[ceph-users] Poor performance on all SSD cluster

2014-06-20 Thread Greg Poirier

I recently created a 9-node Firefly cluster backed by all SSDs. We have had
some pretty severe performance degradation when using O_DIRECT in our tests
(as this is how MySQL will be interacting with RBD volumes, this makes the
most sense for a preliminary test). Running the following test:

dd if=/dev/zero of=testfilasde bs=16k count=65535 oflag=direct

779829248 bytes (780 MB) copied, 604.333 s, 1.3 MB/s

Shows us only about 1.5 MB/s throughput and 100 IOPS from the single dd
thread. Running a second dd process does show increased throughput which is
encouraging, but I am still concerned by the low throughput of a single
thread w/ O_DIRECT.

Two threads:
779829248 bytes (780 MB) copied, 604.333 s, 1.3 MB/s
126271488 bytes (126 MB) copied, 99.2069 s, 1.3 MB/s

I am testing with an RBD volume mounted with the kernel module (I have also
tested from within KVM, similar performance).

If allow caching, we start to see reasonable numbers from a single dd
process:

dd if=/dev/zero of=testfilasde bs=16k count=65535
65535+0 records in
65535+0 records out
1073725440 bytes (1.1 GB) copied, 2.05356 s, 523 MB/s

I can get 1GB/s from a single host with three threads.

Rados bench produces similar results.

Is there something I can do to increase the performance of O_DIRECT? I
expect performance degradation, but so much?

If I increase the blocksize to 4M, I'm able to get significantly higher
throughput:

3833593856 bytes (3.8 GB) copied, 44.2964 s, 86.5 MB/s

This still seems very low.

I'm using the deadline scheduler in all places. With noop scheduler, I do
not see a performance improvement.

Suggestions?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Backfill and Recovery traffic shaping

2014-04-19 Thread Greg Poirier

We have a cluster in a sub-optimal configuration with data and journal
colocated on OSDs (that coincidentally are spinning disks).

During recovery/backfill, the entire cluster suffers degraded performance
because of the IO storm that backfills cause. Client IO becomes extremely
latent. I've tried to decrease the impact that recovery/backfill has with
the following:

ceph tell osd.* injectargs '--osd-max-backfills 1'
ceph tell osd.* injectargs '--osd-max-recovery-threads 1'
ceph tell osd.* injectargs '--osd-recovery-op-priority 1'
ceph tell osd.* injectargs '--osd-client-op-priority 63'
ceph tell osd.* injectargs '--osd-recovery-max-active 1'

The only other option I have left would be to use linux traffic shaping to
artificially reduce the bandwidth available to the interfaced tagged for
cluster traffic (instead of separate physical networks, we use VLAN
tagging). We are nowhere _near_ the point where network saturation would
cause the latency we're seeing, so I am left to believe that it is simply
disk IO saturation.

I could be wrong about this assumption, though, as iostat doesn't terrify
me. This could be suboptimal network configuration on the cluster as well.
I'm still looking into that possibility, but I wanted to get feedback on
what I'd done already first--as well as the proposed traffic shaping idea.

Thoughts?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Backfill and Recovery traffic shaping

2014-04-19 Thread Greg Poirier

On Saturday, April 19, 2014, Mike Dawson mike.daw...@cloudapt.com wrote:


 With a workload consisting of lots of small writes, I've seen client IO
 starved with as little as 5Mbps of traffic per host due to spindle
 contention once deep-scrub and/or recovery/backfill start. Co-locating OSD
 Journals on the same spinners as you have will double that likelihood.


Yeah. We're working on addressing the collocation issues.


 Possible solutions include moving OSD Journals to SSD (with a reasonable
 ratio), expanding the cluster, or increasing the performance of underlying
 storage.


We are considering an all SSD cluster. If I'm not mistaken, at that point
journal collocation isn't as much of an issue since iops/seek time stop
being an issue.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Useful visualizations / metrics

2014-04-12 Thread Greg Poirier

I'm in the process of building a dashboard for our Ceph nodes. I was
wondering if anyone out there had instrumented their OSD / MON clusters and
found particularly useful visualizations.

At first, I was trying to do ridiculous things (like graphing % used for
every disk in every OSD host), but I realized quickly that that is simply
too many metrics and far too visually dense to be useful. I am attempting
to put together a few simpler, more dense visualizations like... overcall
cluster utilization, aggregate cpu and memory utilization per osd host, etc.

Just looking for some suggestions.  Thanks!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Useful visualizations / metrics

2014-04-12 Thread Greg Poirier

Curious as to how you define cluster latency.


On Sat, Apr 12, 2014 at 7:21 AM, Jason Villalta ja...@rubixnet.com wrote:

 Hi, i have not don't anything with metrics yet but the only ones I
 personally would be interested in is total capacity utilization and cluster
 latency.

 Just my 2 cents.


 On Sat, Apr 12, 2014 at 10:02 AM, Greg Poirier greg.poir...@opower.comwrote:

 I'm in the process of building a dashboard for our Ceph nodes. I was
 wondering if anyone out there had instrumented their OSD / MON clusters and
 found particularly useful visualizations.

 At first, I was trying to do ridiculous things (like graphing % used for
 every disk in every OSD host), but I realized quickly that that is simply
 too many metrics and far too visually dense to be useful. I am attempting
 to put together a few simpler, more dense visualizations like... overcall
 cluster utilization, aggregate cpu and memory utilization per osd host, etc.

 Just looking for some suggestions.  Thanks!

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




 --
 --
 *Jason Villalta*
 Co-founder
 [image: Inline image 1]
 800.799.4407x1230 | www.RubixTechnology.comhttp://www.rubixtechnology.com/

inline: EmailLogo.png___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Useful visualizations / metrics

2014-04-12 Thread Greg Poirier

We are collecting system metrics through sysstat every minute and getting
those to OpenTSDB via Sensu. We have a plethora of metrics, but I am
finding it difficult to create meaningful visualizations. We have alerting
for things like individual OSDs reaching capacity thresholds, memory spikes
on OSD or MON hosts. I am just trying to come up with some visualizations
that could become solid indicators that something is wrong with the cluster
in general, or with a particular host (besides CPU or memory utilization).

This morning, I have thought of things like:

- Stddev of bytes used on all disks in the cluster and individual OSD hosts
- 1st and 2nd derivative of bytes used on all disks in the cluster and
individual OSD hosts
- bytes used in the entire cluster
- % usage of cluster capacity

Stddev should help us identify hotspots. Velocity and acceleration of bytes
used should help us with capacity planning. Bytes used in general is just a
neat thing to see, but doesn't tell us all that much. % usage of cluster
capacity is another thing that's just kind of neat to see.

What would you suggest looking for in dump_historic_ops? Maybe get regular
metrics on things like total transaction length? The only problem is that
dump_historic_ops may not always contain relevant/recent data. It is not as
easily translated into time series data as some other things.




On Sat, Apr 12, 2014 at 9:23 AM, Mark Nelson mark.nel...@inktank.comwrote:

 One thing I do right now for ceph performance testing is run a copy of
 collectl during every test.  This gives you a TON of information about CPU
 usage, network stats, disk stats, etc.  It's pretty easy to import the
 output data into gnuplot.  Mark Seger (the creator of collectl) also has
 some tools to gather aggregate statistics across multiple nodes. Beyond
 collectl, you can get a ton of useful data out of the ceph admin socket.  I
 especially like dump_historic_ops as it some times is enough to avoid
 having to parse through debug 20 logs.

 While the following tools have too much overhead to be really useful for
 general system monitoring, they are really useful for specific performance
 investiations:

 1) perf with the dwarf/unwind support
 2) blktrace (optionally with seekwatcher)
 3) valgrind (cachegrind, callgrind, massif)

 Beyond that, there are some collectd plugins for Ceph and last time I
 checked DreamHost was using Graphite for a lot of visualizations. There's
 always ganglia too. :)

 Mark


 On 04/12/2014 09:41 AM, Jason Villalta wrote:

 I know ceph throws some warnings if there is high write latency.  But i
 would be most intrested in the delay for io requests, linking directly
 to iops.  If iops start to drop because the disk are overwhelmed then
 latency for requests would be increasing.  This would tell me that I
 need to add more OSDs/Nodes.  I am not sure there is a specific metric
 in ceph for this but it would be awesome if there was.


 On Sat, Apr 12, 2014 at 10:37 AM, Greg Poirier greg.poir...@opower.com
 mailto:greg.poir...@opower.com wrote:

 Curious as to how you define cluster latency.


 On Sat, Apr 12, 2014 at 7:21 AM, Jason Villalta ja...@rubixnet.com
 mailto:ja...@rubixnet.com wrote:

 Hi, i have not don't anything with metrics yet but the only ones
 I personally would be interested in is total capacity
 utilization and cluster latency.

 Just my 2 cents.


 On Sat, Apr 12, 2014 at 10:02 AM, Greg Poirier
 greg.poir...@opower.com mailto:greg.poir...@opower.com wrote:

 I'm in the process of building a dashboard for our Ceph
 nodes. I was wondering if anyone out there had instrumented
 their OSD / MON clusters and found particularly useful
 visualizations.

 At first, I was trying to do ridiculous things (like
 graphing % used for every disk in every OSD host), but I
 realized quickly that that is simply too many metrics and
 far too visually dense to be useful. I am attempting to put
 together a few simpler, more dense visualizations like...
 overcall cluster utilization, aggregate cpu and memory
 utilization per osd host, etc.

 Just looking for some suggestions.  Thanks!

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




 --
 --
 */Jason Villalta/*
 Co-founder

 Inline image 1
 800.799.4407x1230 | www.RubixTechnology.com
 http://www.rubixtechnology.com/





 --
 --
 */Jason Villalta/*
 Co-founder

 Inline image 1
 800.799.4407x1230 | www.RubixTechnology.com
 http://www.rubixtechnology.com/



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com

Re: [ceph-users] OSD full - All RBD Volumes stopped responding

2014-04-11 Thread Greg Poirier

So... our storage problems persisted for about 45 minutes. I gave an entire
hypervisor worth of VM's time to recover (approx. 30 vms), and none of them
recovered on their own. In the end, we had to stop and start every VM
(easily done, it was just alarming). Once rebooted, the VMs of course were
fine.

I marked the two full OSDs as down and out. I am a little concerned that
these two are full while the cluster, in general, is only at 50% capacity.
It appears we may have a hot spot. I'm going to look into that later today.
Also, I'm not sure how it happened, but pgp_num is lower than pg_num.  I
had not noticed that until last night. Will address that as well. This
probably happened when i last resized placement groups or potentially when
I setup object storage pools.




On Fri, Apr 11, 2014 at 3:49 AM, Wido den Hollander w...@42on.com wrote:

 On 04/11/2014 09:23 AM, Josef Johansson wrote:


 On 11/04/14 09:07, Wido den Hollander wrote:


  Op 11 april 2014 om 8:50 schreef Josef Johansson jo...@oderland.se:


 Hi,

 On 11/04/14 07:29, Wido den Hollander wrote:

 Op 11 april 2014 om 7:13 schreef Greg Poirier greg.poir...@opower.com
 :


 One thing to note
 All of our kvm VMs have to be rebooted. This is something I wasn't
 expecting.  Tried waiting for them to recover on their own, but
 that's not
 happening. Rebooting them restores service immediately. :/ Not ideal.

  A reboot isn't really required though. It could be that the VM
 itself is in
 trouble, but from a librados/librbd perspective I/O should simply
 continue
 as
 soon as a osdmap has been received without the full flag.

 It could be that you have to wait some time before the VM continues.
 This
 can
 take up to 15 minutes.

 With other storage solution you would have to change the timeout-value
 for each disk, i.e. changing to 180 secs from 60 secs, for the VMs to
 survive storage problems.
 Does Ceph handle this differently somehow?

  It's not that RBD does it differently. Librados simply blocks the I/O
 and thus
 dus librbd which then causes Qemu to block.

 I've seen VMs survive RBD issues for longer periods then 60 seconds.
 Gave them
 some time and they continued again.

 Which exact setting are you talking about? I'm talking about a Qemu/KVM
 VM
 running with a VirtIO drive.

 cat /sys/block/*/device/timeout
 (http://kb.vmware.com/selfservice/microsites/search.
 do?language=en_UScmd=displayKCexternalId=1009465)

 This file is non-existant for my Ceph-VirtIO-drive however, so it seems
 RBD handles this.


 Well, I don't think it's handled by RBD, but VirtIO simply doesn't have
 the timeout. That's probably only in the SCSI driver.

 Wido


  I have just Para-Virtualized VMs to compare with right now, and they
 don't have it inside the VM, but that's expected. From my understanding
 it should've been there if it was a HVM. Whenever the timeout was
 reached, an error occured and the disk was set in read-only-mode.

 Cheers,
 Josef

 Wido

  Cheers,
 Josef

 Wido

  On Thu, Apr 10, 2014 at 10:12 PM, Greg Poirier
 greg.poir...@opower.comwrote:

  Going to try increasing the full ratio. Disk utilization wasn't
 really
 growing at an unreasonable pace. I'm going to keep an eye on it for
 the
 next couple of hours and down/out the OSDs if necessary.

 We have four more machines that we're in the process of adding (which
 doubles the number of OSDs), but got held up by some networking
 nonsense.

 Thanks for the tips.


 On Thu, Apr 10, 2014 at 9:51 PM, Sage Weil s...@inktank.com wrote:

  On Thu, 10 Apr 2014, Greg Poirier wrote:

 Hi,
 I have about 200 VMs with a common RBD volume as their root
 filesystem

 and a

 number of additional filesystems on Ceph.

 All of them have stopped responding. One of the OSDs in my cluster
 is

 marked

 full. I tried stopping that OSD to force things to rebalance or at

 least go

 to degraded mode, but nothing is responding still.

 I'm not exactly sure what to do or how to investigate. Suggestions?

 Try marking the osd out or partially out (ceph osd reweight N .9)
 to move
 some data off, and/or adjust the full ratio up (ceph pg
 set_full_ratio
 .95).  Note that this becomes increasinly dangerous as OSDs get
 closer to
 full; add some disks.

 sage


  ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




 --
 Wido den Hollander
 42on B.V.
 Ceph trainer and consultant

 Phone: +31 (0)20 700 9902
 Skype: contact42on

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD full - All RBD Volumes stopped responding

2014-04-11 Thread Greg Poirier

So, setting pgp_num to 2048 to match pg_num had a more serious impact than
I expected. The cluster is rebalancing quite substantially (8.5% of objects
being rebalanced)... which makes sense... Disk utilization is evening out
fairly well which is encouraging.

We are a little stumped as to why a few OSDs being full would cause the
entire cluster to stop serving IO. Is this a configuration issue that we
have?

We're slowly recovering:

 health HEALTH_WARN 135 pgs backfill; 187 pgs backfill_toofull; 151 pgs
backfilling; 2 pgs degraded; 369 pgs stuck unclean; 29 requests are blocked
 32 sec; recovery 2563902/52390259 objects degraded (4.894%); 4 near full
osd(s)
  pgmap v8363400: 5120 pgs, 3 pools, 22635 GB data, 23872 kobjects
48889 GB used, 45022 GB / 93911 GB avail
2563902/52390259 objects degraded (4.894%)
4751 active+clean
  31 active+remapped+wait_backfill
   1 active+backfill_toofull
 103 active+remapped+wait_backfill+backfill_toofull
   1 active+degraded+wait_backfill+backfill_toofull
 150 active+remapped+backfilling
  82 active+remapped+backfill_toofull
   1 active+degraded+remapped+backfilling
recovery io 362 MB/s, 365 objects/s
  client io 1643 kB/s rd, 6001 kB/s wr, 911 op/s


On Fri, Apr 11, 2014 at 5:45 AM, Greg Poirier greg.poir...@opower.comwrote:

 So... our storage problems persisted for about 45 minutes. I gave an
 entire hypervisor worth of VM's time to recover (approx. 30 vms), and none
 of them recovered on their own. In the end, we had to stop and start every
 VM (easily done, it was just alarming). Once rebooted, the VMs of course
 were fine.

 I marked the two full OSDs as down and out. I am a little concerned that
 these two are full while the cluster, in general, is only at 50% capacity.
 It appears we may have a hot spot. I'm going to look into that later today.
 Also, I'm not sure how it happened, but pgp_num is lower than pg_num.  I
 had not noticed that until last night. Will address that as well. This
 probably happened when i last resized placement groups or potentially when
 I setup object storage pools.




 On Fri, Apr 11, 2014 at 3:49 AM, Wido den Hollander w...@42on.com wrote:

 On 04/11/2014 09:23 AM, Josef Johansson wrote:


 On 11/04/14 09:07, Wido den Hollander wrote:


  Op 11 april 2014 om 8:50 schreef Josef Johansson jo...@oderland.se:


 Hi,

 On 11/04/14 07:29, Wido den Hollander wrote:

 Op 11 april 2014 om 7:13 schreef Greg Poirier 
 greg.poir...@opower.com:


 One thing to note
 All of our kvm VMs have to be rebooted. This is something I wasn't
 expecting.  Tried waiting for them to recover on their own, but
 that's not
 happening. Rebooting them restores service immediately. :/ Not ideal.

  A reboot isn't really required though. It could be that the VM
 itself is in
 trouble, but from a librados/librbd perspective I/O should simply
 continue
 as
 soon as a osdmap has been received without the full flag.

 It could be that you have to wait some time before the VM continues.
 This
 can
 take up to 15 minutes.

 With other storage solution you would have to change the timeout-value
 for each disk, i.e. changing to 180 secs from 60 secs, for the VMs to
 survive storage problems.
 Does Ceph handle this differently somehow?

  It's not that RBD does it differently. Librados simply blocks the I/O
 and thus
 dus librbd which then causes Qemu to block.

 I've seen VMs survive RBD issues for longer periods then 60 seconds.
 Gave them
 some time and they continued again.

 Which exact setting are you talking about? I'm talking about a Qemu/KVM
 VM
 running with a VirtIO drive.

 cat /sys/block/*/device/timeout
 (http://kb.vmware.com/selfservice/microsites/search.
 do?language=en_UScmd=displayKCexternalId=1009465)

 This file is non-existant for my Ceph-VirtIO-drive however, so it seems
 RBD handles this.


 Well, I don't think it's handled by RBD, but VirtIO simply doesn't have
 the timeout. That's probably only in the SCSI driver.

 Wido


  I have just Para-Virtualized VMs to compare with right now, and they
 don't have it inside the VM, but that's expected. From my understanding
 it should've been there if it was a HVM. Whenever the timeout was
 reached, an error occured and the disk was set in read-only-mode.

 Cheers,
 Josef

 Wido

  Cheers,
 Josef

 Wido

  On Thu, Apr 10, 2014 at 10:12 PM, Greg Poirier
 greg.poir...@opower.comwrote:

  Going to try increasing the full ratio. Disk utilization wasn't
 really
 growing at an unreasonable pace. I'm going to keep an eye on it for
 the
 next couple of hours and down/out the OSDs if necessary.

 We have four more machines that we're in the process of adding
 (which
 doubles the number of OSDs), but got held up by some networking
 nonsense.

 Thanks for the tips.


 On Thu, Apr 10, 2014 at 9:51 PM, Sage Weil s...@inktank.com
 wrote:

  On Thu, 10 Apr 2014

[ceph-users] OSD full - All RBD Volumes stopped responding

2014-04-10 Thread Greg Poirier

Hi,

I have about 200 VMs with a common RBD volume as their root filesystem and
a number of additional filesystems on Ceph.

All of them have stopped responding. One of the OSDs in my cluster is
marked full. I tried stopping that OSD to force things to rebalance or at
least go to degraded mode, but nothing is responding still.

I'm not exactly sure what to do or how to investigate. Suggestions?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD full - All RBD Volumes stopped responding

2014-04-10 Thread Greg Poirier

Going to try increasing the full ratio. Disk utilization wasn't really
growing at an unreasonable pace. I'm going to keep an eye on it for the
next couple of hours and down/out the OSDs if necessary.

We have four more machines that we're in the process of adding (which
doubles the number of OSDs), but got held up by some networking nonsense.

Thanks for the tips.


On Thu, Apr 10, 2014 at 9:51 PM, Sage Weil s...@inktank.com wrote:

 On Thu, 10 Apr 2014, Greg Poirier wrote:
  Hi,
  I have about 200 VMs with a common RBD volume as their root filesystem
 and a
  number of additional filesystems on Ceph.
 
  All of them have stopped responding. One of the OSDs in my cluster is
 marked
  full. I tried stopping that OSD to force things to rebalance or at least
 go
  to degraded mode, but nothing is responding still.
 
  I'm not exactly sure what to do or how to investigate. Suggestions?

 Try marking the osd out or partially out (ceph osd reweight N .9) to move
 some data off, and/or adjust the full ratio up (ceph pg set_full_ratio
 .95).  Note that this becomes increasinly dangerous as OSDs get closer to
 full; add some disks.

 sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD full - All RBD Volumes stopped responding

2014-04-10 Thread Greg Poirier

One thing to note
All of our kvm VMs have to be rebooted. This is something I wasn't
expecting.  Tried waiting for them to recover on their own, but that's not
happening. Rebooting them restores service immediately. :/ Not ideal.


On Thu, Apr 10, 2014 at 10:12 PM, Greg Poirier greg.poir...@opower.comwrote:

 Going to try increasing the full ratio. Disk utilization wasn't really
 growing at an unreasonable pace. I'm going to keep an eye on it for the
 next couple of hours and down/out the OSDs if necessary.

 We have four more machines that we're in the process of adding (which
 doubles the number of OSDs), but got held up by some networking nonsense.

 Thanks for the tips.


 On Thu, Apr 10, 2014 at 9:51 PM, Sage Weil s...@inktank.com wrote:

 On Thu, 10 Apr 2014, Greg Poirier wrote:
  Hi,
  I have about 200 VMs with a common RBD volume as their root filesystem
 and a
  number of additional filesystems on Ceph.
 
  All of them have stopped responding. One of the OSDs in my cluster is
 marked
  full. I tried stopping that OSD to force things to rebalance or at
 least go
  to degraded mode, but nothing is responding still.
 
  I'm not exactly sure what to do or how to investigate. Suggestions?

 Try marking the osd out or partially out (ceph osd reweight N .9) to move
 some data off, and/or adjust the full ratio up (ceph pg set_full_ratio
 .95).  Note that this becomes increasinly dangerous as OSDs get closer to
 full; add some disks.

 sage



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Replication lag in block storage

2014-03-14 Thread Greg Poirier

So, on the cluster that I _expect_ to be slow, it appears that we are
waiting on journal commits. I want to make sure that I am reading this
correctly:

  received_at: 2014-03-14 12:14:22.659170,

{ time: 2014-03-14 12:14:22.660191,
  event: write_thread_in_journal_buffer},

At this point we have received the write and are attempting to write the
transaction to the OSD's journal, yes?

Then:

{ time: 2014-03-14 12:14:22.900779,
  event: journaled_completion_queued},

240ms later we have successfully written to the journal?

I expect this particular slowness due to colocation of journal and data on
the same disk (and it's a spinning disk, not an SSD). I expect some of this
could be alleviated by migrating journals to SSDs, but I am looking to
rebuild in the near future--so am willing to hobble in the meantime.

I am surprised that our all SSD cluster is also underperforming. I am
trying colocating the journal on the same disk with all SSDs at the moment
and will see if the performance degradation is of the same nature.



On Thu, Mar 13, 2014 at 6:25 PM, Gregory Farnum g...@inktank.com wrote:

 Right. So which is the interval that's taking all the time? Probably
 it's waiting for the journal commit, but maybe there's something else
 blocking progress. If it is the journal commit, check out how busy the
 disk is (is it just saturated?) and what its normal performance
 characteristics are (is it breaking?).
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com


 On Thu, Mar 13, 2014 at 5:48 PM, Greg Poirier greg.poir...@opower.com
 wrote:
  Many of the sub ops look like this, with significant lag between
 received_at
  and commit_sent:
 
  { description: osd_op(client.6869831.0:1192491
  rbd_data.67b14a2ae8944a.9105 [write 507904~3686400]
 6.556a4db0
  e660),
received_at: 2014-03-13 20:42:05.811936,
age: 46.088198,
duration: 0.038328,
  snip
  { time: 2014-03-13 20:42:05.850215,
event: commit_sent},
  { time: 2014-03-13 20:42:05.850264,
event: done}]]},
 
  In this case almost 39ms between received_at and commit_sent.
 
  A particularly egregious example of 80+ms lag between received_at and
  commit_sent:
 
 { description: osd_op(client.6869831.0:1190526
  rbd_data.67b14a2ae8944a.8fac [write 3325952~868352]
 6.5255f5fd
  e660),
received_at: 2014-03-13 20:41:40.227813,
age: 320.017087,
duration: 0.086852,
  snip
  { time: 2014-03-13 20:41:40.314633,
event: commit_sent},
  { time: 2014-03-13 20:41:40.314665,
event: done}]]},
 
 
 
  On Thu, Mar 13, 2014 at 4:17 PM, Gregory Farnum g...@inktank.com
 wrote:
 
  On Thu, Mar 13, 2014 at 3:56 PM, Greg Poirier greg.poir...@opower.com
  wrote:
   We've been seeing this issue on all of our dumpling clusters, and I'm
   wondering what might be the cause of it.
  
   In dump_historic_ops, the time between op_applied and
 sub_op_commit_rec
   or
   the time between commit_sent and sub_op_applied is extremely high.
 Some
   of
   the osd_sub_ops are as long as 100 ms. A sample dump_historic_ops is
   included at the bottom.
 
  It's important to understand what each of those timestamps are
 reporting.
 
  op_applied: the point at which an OSD has applied an operation to its
  readable backing filesystem in-memory (which for xfs or ext4 will be
  after it's committed to the journal)
  sub_op_commit_rec: the point at which an OSD has gotten commits from
  the replica OSDs
  commit_sent: the point at which a replica OSD has sent a commit back
  to its primary
  sub_op_applied: the point at which a replica OSD has applied a
  particular operation to its backing filesystem in-memory (again, after
  the journal if using xfs)
 
  Reads are never served from replicas, so a long time between
  commit_sent and sub_op_applied should not in itself be an issue. A lag
  time between op_applied and sub_op_commit_rec means that the OSD is
  waiting on its replicas. A long time there indicates either that the
  replica is processing slowly, or that there's some issue in the
  communications stack (all the way from the raw ethernet up to the
  message handling in the OSD itself).
  So the first thing to look for are sub ops which have a lag time
  between the received_at and commit_sent timestamps. If none of those
  ever turn up, but unusually long waits for sub_op_commit_rec are still
  present, then it'll take more effort to correlate particular subops on
  replicas with the op on the primary they correspond to, and see where
  the time lag is coming into it.
  -Greg
  Software Engineer #42 @ http://inktank.com | http://ceph.com
 
 

___
ceph-users mailing

Re: [ceph-users] Replication lag in block storage

2014-03-14 Thread Greg Poirier

We are stressing these boxes pretty spectacularly at the moment.

On every box I have one OSD that is pegged for IO almost constantly.

ceph-1:
Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz
avgqu-sz   await r_await w_await  svctm  %util
sdv   0.00 0.00  104.00  160.00   748.00  1000.0013.24
1.154.369.461.05   3.70  97.60

ceph-2:
Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz
avgqu-sz   await r_await w_await  svctm  %util
sdq   0.0025.00  109.00  218.00   844.00  1773.5016.01
1.374.209.031.78   3.01  98.40

ceph-3:
Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz
avgqu-sz   await r_await w_await  svctm  %util
sdm   0.00 0.00  126.00   56.00   996.00   540.0016.88
1.015.588.060.00   5.43  98.80

These are all disks in my block storage pool.

 osdmap e26698: 102 osds: 102 up, 102 in
  pgmap v6752413: 4624 pgs, 3 pools, 14151 GB data, 21729 kobjects
28517 GB used, 65393 GB / 93911 GB avail
4624 active+clean
  client io 1915 kB/s rd, 59690 kB/s wr, 1464 op/s

I don't see any smart errors, but i'm slowly working my way through all of
the disks on these machines with smartctl to see if anything stands out.


On Fri, Mar 14, 2014 at 9:52 AM, Gregory Farnum g...@inktank.com wrote:

 On Fri, Mar 14, 2014 at 9:37 AM, Greg Poirier greg.poir...@opower.com
 wrote:
  So, on the cluster that I _expect_ to be slow, it appears that we are
  waiting on journal commits. I want to make sure that I am reading this
  correctly:
 
received_at: 2014-03-14 12:14:22.659170,
 
  { time: 2014-03-14 12:14:22.660191,
event: write_thread_in_journal_buffer},
 
  At this point we have received the write and are attempting to write the
  transaction to the OSD's journal, yes?
 
  Then:
 
  { time: 2014-03-14 12:14:22.900779,
event: journaled_completion_queued},
 
  240ms later we have successfully written to the journal?

 Correct. That seems an awfully long time for a 16K write, although I
 don't know how much data I have on co-located journals. (At least, I'm
 assuming it's in the 16K range based on the others, although I'm just
 now realizing that subops aren't providing that information...I've
 created a ticket to include that diagnostic info in future.)
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com


  I expect this particular slowness due to colocation of journal and data
 on
  the same disk (and it's a spinning disk, not an SSD). I expect some of
 this
  could be alleviated by migrating journals to SSDs, but I am looking to
  rebuild in the near future--so am willing to hobble in the meantime.
 
  I am surprised that our all SSD cluster is also underperforming. I am
 trying
  colocating the journal on the same disk with all SSDs at the moment and
 will
  see if the performance degradation is of the same nature.
 
 
 
  On Thu, Mar 13, 2014 at 6:25 PM, Gregory Farnum g...@inktank.com
 wrote:
 
  Right. So which is the interval that's taking all the time? Probably
  it's waiting for the journal commit, but maybe there's something else
  blocking progress. If it is the journal commit, check out how busy the
  disk is (is it just saturated?) and what its normal performance
  characteristics are (is it breaking?).
  -Greg
  Software Engineer #42 @ http://inktank.com | http://ceph.com
 
 
  On Thu, Mar 13, 2014 at 5:48 PM, Greg Poirier greg.poir...@opower.com
  wrote:
   Many of the sub ops look like this, with significant lag between
   received_at
   and commit_sent:
  
   { description: osd_op(client.6869831.0:1192491
   rbd_data.67b14a2ae8944a.9105 [write 507904~3686400]
   6.556a4db0
   e660),
 received_at: 2014-03-13 20:42:05.811936,
 age: 46.088198,
 duration: 0.038328,
   snip
   { time: 2014-03-13 20:42:05.850215,
 event: commit_sent},
   { time: 2014-03-13 20:42:05.850264,
 event: done}]]},
  
   In this case almost 39ms between received_at and commit_sent.
  
   A particularly egregious example of 80+ms lag between received_at and
   commit_sent:
  
  { description: osd_op(client.6869831.0:1190526
   rbd_data.67b14a2ae8944a.8fac [write 3325952~868352]
   6.5255f5fd
   e660),
 received_at: 2014-03-13 20:41:40.227813,
 age: 320.017087,
 duration: 0.086852,
   snip
   { time: 2014-03-13 20:41:40.314633,
 event: commit_sent},
   { time: 2014-03-13 20:41:40.314665,
 event: done}]]},
  
  
  
   On Thu, Mar 13, 2014 at 4:17 PM, Gregory Farnum g...@inktank.com
   wrote:
  
   On Thu, Mar 13, 2014 at 3:56 PM

Re: [ceph-users] Replication lag in block storage

2014-03-13 Thread Greg Poirier

Many of the sub ops look like this, with significant lag between
received_at and commit_sent:

{ description: osd_op(client.6869831.0:1192491
rbd_data.67b14a2ae8944a.9105 [write 507904~3686400] 6.556a4db0
e660),
  received_at: 2014-03-13 20:42:05.811936,
  age: 46.088198,
  duration: 0.038328,
snip
{ time: 2014-03-13 20:42:05.850215,
  event: commit_sent},
{ time: 2014-03-13 20:42:05.850264,
  event: done}]]},

In this case almost 39ms between received_at and commit_sent.

A particularly egregious example of 80+ms lag between received_at and
commit_sent:

   { description: osd_op(client.6869831.0:1190526
rbd_data.67b14a2ae8944a.8fac [write 3325952~868352] 6.5255f5fd
e660),
  received_at: 2014-03-13 20:41:40.227813,
  age: 320.017087,
  duration: 0.086852,
snip
{ time: 2014-03-13 20:41:40.314633,
  event: commit_sent},
{ time: 2014-03-13 20:41:40.314665,
  event: done}]]},



On Thu, Mar 13, 2014 at 4:17 PM, Gregory Farnum g...@inktank.com wrote:

 On Thu, Mar 13, 2014 at 3:56 PM, Greg Poirier greg.poir...@opower.com
 wrote:
  We've been seeing this issue on all of our dumpling clusters, and I'm
  wondering what might be the cause of it.
 
  In dump_historic_ops, the time between op_applied and sub_op_commit_rec
 or
  the time between commit_sent and sub_op_applied is extremely high. Some
 of
  the osd_sub_ops are as long as 100 ms. A sample dump_historic_ops is
  included at the bottom.

 It's important to understand what each of those timestamps are reporting.

 op_applied: the point at which an OSD has applied an operation to its
 readable backing filesystem in-memory (which for xfs or ext4 will be
 after it's committed to the journal)
 sub_op_commit_rec: the point at which an OSD has gotten commits from
 the replica OSDs
 commit_sent: the point at which a replica OSD has sent a commit back
 to its primary
 sub_op_applied: the point at which a replica OSD has applied a
 particular operation to its backing filesystem in-memory (again, after
 the journal if using xfs)

 Reads are never served from replicas, so a long time between
 commit_sent and sub_op_applied should not in itself be an issue. A lag
 time between op_applied and sub_op_commit_rec means that the OSD is
 waiting on its replicas. A long time there indicates either that the
 replica is processing slowly, or that there's some issue in the
 communications stack (all the way from the raw ethernet up to the
 message handling in the OSD itself).
 So the first thing to look for are sub ops which have a lag time
 between the received_at and commit_sent timestamps. If none of those
 ever turn up, but unusually long waits for sub_op_commit_rec are still
 present, then it'll take more effort to correlate particular subops on
 replicas with the op on the primary they correspond to, and see where
 the time lag is coming into it.
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] no user info saved after user creation / can't create buckets

2014-03-12 Thread Greg Poirier

And the debug log because that last log was obviously not helpful...

2014-03-12 23:57:49.497780 7ff97e7dd700  1 == starting new request
req=0x23bc650 =
2014-03-12 23:57:49.498198 7ff97e7dd700  2 req 1:0.000419::PUT
/test::initializing
2014-03-12 23:57:49.498233 7ff97e7dd700 10
host=s3.amazonaws.comrgw_dns_name=us-west-1.domain
2014-03-12 23:57:49.498366 7ff97e7dd700 10 s-object=NULL s-bucket=test
2014-03-12 23:57:49.498437 7ff97e7dd700  2 req 1:0.000659:s3:PUT
/test::getting op
2014-03-12 23:57:49.498448 7ff97e7dd700  2 req 1:0.000670:s3:PUT
/test:create_bucket:authorizing
2014-03-12 23:57:49.498508 7ff97e7dd700 10 cache get:
name=.us-west-1.users+BLAHBLAHBLAH : miss
2014-03-12 23:57:49.500852 7ff97e7dd700 10 cache put:
name=.us-west-1.users+BLAHBLAHBLAH
2014-03-12 23:57:49.500865 7ff97e7dd700 10 adding
.us-west-1.users+BLAHBLAHBLAH to cache LRU end
2014-03-12 23:57:49.500886 7ff97e7dd700 10 moving
.us-west-1.users+BLAHBLAHBLAH to cache LRU end
2014-03-12 23:57:49.500889 7ff97e7dd700 10 cache get:
name=.us-west-1.users+BLAHBLAHBLAH : type miss (requested=1, cached=6)
2014-03-12 23:57:49.500907 7ff97e7dd700 10 moving
.us-west-1.users+BLAHBLAHBLAH to cache LRU end
2014-03-12 23:57:49.500910 7ff97e7dd700 10 cache get:
name=.us-west-1.users+BLAHBLAHBLAH : hit
2014-03-12 23:57:49.502663 7ff97e7dd700 10 cache put:
name=.us-west-1.users+BLAHBLAHBLAH
2014-03-12 23:57:49.502667 7ff97e7dd700 10 moving
.us-west-1.users+BLAHBLAHBLAH to cache LRU end
2014-03-12 23:57:49.502700 7ff97e7dd700 10 cache get:
name=.us-west-1.users.uid+test : miss
2014-03-12 23:57:49.505128 7ff97e7dd700 10 cache put:
name=.us-west-1.users.uid+test
2014-03-12 23:57:49.505138 7ff97e7dd700 10 adding .us-west-1.users.uid+test
to cache LRU end
2014-03-12 23:57:49.505157 7ff97e7dd700 10 moving .us-west-1.users.uid+test
to cache LRU end
2014-03-12 23:57:49.505160 7ff97e7dd700 10 cache get:
name=.us-west-1.users.uid+test : type miss (requested=1, cached=6)
2014-03-12 23:57:49.505176 7ff97e7dd700 10 moving .us-west-1.users.uid+test
to cache LRU end
2014-03-12 23:57:49.505178 7ff97e7dd700 10 cache get:
name=.us-west-1.users.uid+test : hit
2014-03-12 23:57:49.507401 7ff97e7dd700 10 cache put:
name=.us-west-1.users.uid+test
2014-03-12 23:57:49.507406 7ff97e7dd700 10 moving .us-west-1.users.uid+test
to cache LRU end
2014-03-12 23:57:49.507521 7ff97e7dd700 10 get_canon_resource(): dest=/test
2014-03-12 23:57:49.507529 7ff97e7dd700 10 auth_hdr:
PUT

binary/octet-stream
Wed, 12 Mar 2014 23:57:51 GMT
/test
2014-03-12 23:57:49.507674 7ff97e7dd700  2 req 1:0.009895:s3:PUT
/test:create_bucket:reading permissions
2014-03-12 23:57:49.507682 7ff97e7dd700  2 req 1:0.009904:s3:PUT
/test:create_bucket:verifying op mask
2014-03-12 23:57:49.507695 7ff97e7dd700  2 req 1:0.009917:s3:PUT
/test:create_bucket:verifying op permissions
2014-03-12 23:57:49.509604 7ff97e7dd700  2 req 1:0.011826:s3:PUT
/test:create_bucket:verifying op params
2014-03-12 23:57:49.509615 7ff97e7dd700  2 req 1:0.011836:s3:PUT
/test:create_bucket:executing
2014-03-12 23:57:49.509694 7ff97e7dd700 10 cache get:
name=.us-west-1.domain.rgw+test : miss
2014-03-12 23:57:49.512229 7ff97e7dd700 10 cache put:
name=.us-west-1.domain.rgw+test
2014-03-12 23:57:49.512259 7ff97e7dd700 10 adding
.us-west-1.domain.rgw+test to cache LRU end
2014-03-12 23:57:49.512333 7ff97e7dd700 10 cache get:
name=.us-west-1.domain.rgw+.pools.avail : miss
2014-03-12 23:57:49.518216 7ff97e7dd700 10 cache put:
name=.us-west-1.domain.rgw+.pools.avail
2014-03-12 23:57:49.518228 7ff97e7dd700 10 adding
.us-west-1.domain.rgw+.pools.avail to cache LRU end
2014-03-12 23:57:49.518248 7ff97e7dd700 10 moving
.us-west-1.domain.rgw+.pools.avail to cache LRU end
2014-03-12 23:57:49.518251 7ff97e7dd700 10 cache get:
name=.us-west-1.domain.rgw+.pools.avail : type miss (requested=1, cached=6)
2014-03-12 23:57:49.518270 7ff97e7dd700 10 moving
.us-west-1.domain.rgw+.pools.avail to cache LRU end
2014-03-12 23:57:49.518272 7ff97e7dd700 10 cache get:
name=.us-west-1.domain.rgw+.pools.avail : hit
2014-03-12 23:57:49.520295 7ff97e7dd700 10 cache put:
name=.us-west-1.domain.rgw+.pools.avail
2014-03-12 23:57:49.520348 7ff97e7dd700 10 moving
.us-west-1.domain.rgw+.pools.avail to cache LRU end
2014-03-12 23:57:49.522672 7ff97e7dd700  2 req 1:0.024893:s3:PUT
/test:create_bucket:http status=403
2014-03-12 23:57:49.523204 7ff97e7dd700  1 == req done req=0x23bc650
http_status=403 ==


On Wed, Mar 12, 2014 at 7:36 PM, Greg Poirier greg.poir...@opower.comwrote:

 The saga continues...

 So, after fiddling with haproxy a bit, I managed to make sure that my
 requests were hitting the RADOS Gateway.

 NOW, I get a 403 from my ruby script:

 2014-03-12 23:34:08.289670 7fda9bfbf700  1 == starting new request
 req=0x215a780 =
 2014-03-12 23:34:08.305105 7fda9bfbf700  1 == req done req=0x215a780
 http_status=403 ==

 The aws-s3 gem forces the Host header to be set to s3.amazonaws.com --
 and I am wondering if this could potentially cause

Re: [ceph-users] no user info saved after user creation / can't create buckets

2014-03-12 Thread Greg Poirier

Increasing the logging further, and I notice the following:

2014-03-13 00:27:28.617100 7f6036ffd700 20 rgw_create_bucket returned
ret=-1 bucket=test(@.rgw.buckets[us-west-1.15849318.1])

But hope that .rgw.buckets doesn't have to exist... and that @.rgw.buckets
is perhaps telling of something?

I did notice that .us-west-1.rgw.buckets and .us-west-1.rgw.buckets.index
weren't created. I created those, restarted radosgw, and still 403 errors.


On Wed, Mar 12, 2014 at 8:00 PM, Greg Poirier greg.poir...@opower.comwrote:

 And the debug log because that last log was obviously not helpful...

 2014-03-12 23:57:49.497780 7ff97e7dd700  1 == starting new request
 req=0x23bc650 =
 2014-03-12 23:57:49.498198 7ff97e7dd700  2 req 1:0.000419::PUT
 /test::initializing
 2014-03-12 23:57:49.498233 7ff97e7dd700 10 
 host=s3.amazonaws.comrgw_dns_name=us-west-1.domain
 2014-03-12 23:57:49.498366 7ff97e7dd700 10 s-object=NULL s-bucket=test
 2014-03-12 23:57:49.498437 7ff97e7dd700  2 req 1:0.000659:s3:PUT
 /test::getting op
 2014-03-12 23:57:49.498448 7ff97e7dd700  2 req 1:0.000670:s3:PUT
 /test:create_bucket:authorizing
 2014-03-12 23:57:49.498508 7ff97e7dd700 10 cache get:
 name=.us-west-1.users+BLAHBLAHBLAH : miss
 2014-03-12 23:57:49.500852 7ff97e7dd700 10 cache put:
 name=.us-west-1.users+BLAHBLAHBLAH
 2014-03-12 23:57:49.500865 7ff97e7dd700 10 adding
 .us-west-1.users+BLAHBLAHBLAH to cache LRU end
 2014-03-12 23:57:49.500886 7ff97e7dd700 10 moving
 .us-west-1.users+BLAHBLAHBLAH to cache LRU end
 2014-03-12 23:57:49.500889 7ff97e7dd700 10 cache get:
 name=.us-west-1.users+BLAHBLAHBLAH : type miss (requested=1, cached=6)
 2014-03-12 23:57:49.500907 7ff97e7dd700 10 moving
 .us-west-1.users+BLAHBLAHBLAH to cache LRU end
 2014-03-12 23:57:49.500910 7ff97e7dd700 10 cache get:
 name=.us-west-1.users+BLAHBLAHBLAH : hit
 2014-03-12 23:57:49.502663 7ff97e7dd700 10 cache put:
 name=.us-west-1.users+BLAHBLAHBLAH
 2014-03-12 23:57:49.502667 7ff97e7dd700 10 moving
 .us-west-1.users+BLAHBLAHBLAH to cache LRU end
 2014-03-12 23:57:49.502700 7ff97e7dd700 10 cache get:
 name=.us-west-1.users.uid+test : miss
 2014-03-12 23:57:49.505128 7ff97e7dd700 10 cache put:
 name=.us-west-1.users.uid+test
 2014-03-12 23:57:49.505138 7ff97e7dd700 10 adding
 .us-west-1.users.uid+test to cache LRU end
 2014-03-12 23:57:49.505157 7ff97e7dd700 10 moving
 .us-west-1.users.uid+test to cache LRU end
 2014-03-12 23:57:49.505160 7ff97e7dd700 10 cache get:
 name=.us-west-1.users.uid+test : type miss (requested=1, cached=6)
 2014-03-12 23:57:49.505176 7ff97e7dd700 10 moving
 .us-west-1.users.uid+test to cache LRU end
 2014-03-12 23:57:49.505178 7ff97e7dd700 10 cache get:
 name=.us-west-1.users.uid+test : hit
 2014-03-12 23:57:49.507401 7ff97e7dd700 10 cache put:
 name=.us-west-1.users.uid+test
 2014-03-12 23:57:49.507406 7ff97e7dd700 10 moving
 .us-west-1.users.uid+test to cache LRU end
 2014-03-12 23:57:49.507521 7ff97e7dd700 10 get_canon_resource(): dest=/test
 2014-03-12 23:57:49.507529 7ff97e7dd700 10 auth_hdr:
 PUT

 binary/octet-stream
 Wed, 12 Mar 2014 23:57:51 GMT
 /test
 2014-03-12 23:57:49.507674 7ff97e7dd700  2 req 1:0.009895:s3:PUT
 /test:create_bucket:reading permissions
 2014-03-12 23:57:49.507682 7ff97e7dd700  2 req 1:0.009904:s3:PUT
 /test:create_bucket:verifying op mask
 2014-03-12 23:57:49.507695 7ff97e7dd700  2 req 1:0.009917:s3:PUT
 /test:create_bucket:verifying op permissions
 2014-03-12 23:57:49.509604 7ff97e7dd700  2 req 1:0.011826:s3:PUT
 /test:create_bucket:verifying op params
 2014-03-12 23:57:49.509615 7ff97e7dd700  2 req 1:0.011836:s3:PUT
 /test:create_bucket:executing
 2014-03-12 23:57:49.509694 7ff97e7dd700 10 cache get:
 name=.us-west-1.domain.rgw+test : miss
 2014-03-12 23:57:49.512229 7ff97e7dd700 10 cache put:
 name=.us-west-1.domain.rgw+test
 2014-03-12 23:57:49.512259 7ff97e7dd700 10 adding
 .us-west-1.domain.rgw+test to cache LRU end
 2014-03-12 23:57:49.512333 7ff97e7dd700 10 cache get:
 name=.us-west-1.domain.rgw+.pools.avail : miss
 2014-03-12 23:57:49.518216 7ff97e7dd700 10 cache put:
 name=.us-west-1.domain.rgw+.pools.avail
 2014-03-12 23:57:49.518228 7ff97e7dd700 10 adding
 .us-west-1.domain.rgw+.pools.avail to cache LRU end
 2014-03-12 23:57:49.518248 7ff97e7dd700 10 moving
 .us-west-1.domain.rgw+.pools.avail to cache LRU end
 2014-03-12 23:57:49.518251 7ff97e7dd700 10 cache get:
 name=.us-west-1.domain.rgw+.pools.avail : type miss (requested=1, cached=6)
 2014-03-12 23:57:49.518270 7ff97e7dd700 10 moving
 .us-west-1.domain.rgw+.pools.avail to cache LRU end
 2014-03-12 23:57:49.518272 7ff97e7dd700 10 cache get:
 name=.us-west-1.domain.rgw+.pools.avail : hit
 2014-03-12 23:57:49.520295 7ff97e7dd700 10 cache put:
 name=.us-west-1.domain.rgw+.pools.avail
 2014-03-12 23:57:49.520348 7ff97e7dd700 10 moving
 .us-west-1.domain.rgw+.pools.avail to cache LRU end
 2014-03-12 23:57:49.522672 7ff97e7dd700  2 req 1:0.024893:s3:PUT
 /test:create_bucket:http status=403
 2014-03-12 23:57:49.523204 7ff97e7dd700  1 == req

Re: [ceph-users] no user info saved after user creation / can't create buckets

2014-03-12 Thread Greg Poirier

And, I figured out the issue.

The utility I was using to create pools, zones, and regions automatically
failed to do two things:

- create rgw.buckets and rgw.buckets.index for each zone
- setup placement pools for each zone

I did both of those, and now everything is working.

Thanks, me, for the commitment to figuring this poo out.


On Wed, Mar 12, 2014 at 8:31 PM, Greg Poirier greg.poir...@opower.comwrote:

 Increasing the logging further, and I notice the following:

 2014-03-13 00:27:28.617100 7f6036ffd700 20 rgw_create_bucket returned
 ret=-1 bucket=test(@.rgw.buckets[us-west-1.15849318.1])

 But hope that .rgw.buckets doesn't have to exist... and that @.rgw.buckets
 is perhaps telling of something?

 I did notice that .us-west-1.rgw.buckets and .us-west-1.rgw.buckets.index
 weren't created. I created those, restarted radosgw, and still 403 errors.


 On Wed, Mar 12, 2014 at 8:00 PM, Greg Poirier greg.poir...@opower.comwrote:

 And the debug log because that last log was obviously not helpful...

 2014-03-12 23:57:49.497780 7ff97e7dd700  1 == starting new request
 req=0x23bc650 =
 2014-03-12 23:57:49.498198 7ff97e7dd700  2 req 1:0.000419::PUT
 /test::initializing
 2014-03-12 23:57:49.498233 7ff97e7dd700 10 
 host=s3.amazonaws.comrgw_dns_name=us-west-1.domain
 2014-03-12 23:57:49.498366 7ff97e7dd700 10 s-object=NULL s-bucket=test
 2014-03-12 23:57:49.498437 7ff97e7dd700  2 req 1:0.000659:s3:PUT
 /test::getting op
 2014-03-12 23:57:49.498448 7ff97e7dd700  2 req 1:0.000670:s3:PUT
 /test:create_bucket:authorizing
 2014-03-12 23:57:49.498508 7ff97e7dd700 10 cache get:
 name=.us-west-1.users+BLAHBLAHBLAH : miss
 2014-03-12 23:57:49.500852 7ff97e7dd700 10 cache put:
 name=.us-west-1.users+BLAHBLAHBLAH
 2014-03-12 23:57:49.500865 7ff97e7dd700 10 adding
 .us-west-1.users+BLAHBLAHBLAH to cache LRU end
 2014-03-12 23:57:49.500886 7ff97e7dd700 10 moving
 .us-west-1.users+BLAHBLAHBLAH to cache LRU end
 2014-03-12 23:57:49.500889 7ff97e7dd700 10 cache get:
 name=.us-west-1.users+BLAHBLAHBLAH : type miss (requested=1, cached=6)
 2014-03-12 23:57:49.500907 7ff97e7dd700 10 moving
 .us-west-1.users+BLAHBLAHBLAH to cache LRU end
 2014-03-12 23:57:49.500910 7ff97e7dd700 10 cache get:
 name=.us-west-1.users+BLAHBLAHBLAH : hit
 2014-03-12 23:57:49.502663 7ff97e7dd700 10 cache put:
 name=.us-west-1.users+BLAHBLAHBLAH
 2014-03-12 23:57:49.502667 7ff97e7dd700 10 moving
 .us-west-1.users+BLAHBLAHBLAH to cache LRU end
 2014-03-12 23:57:49.502700 7ff97e7dd700 10 cache get:
 name=.us-west-1.users.uid+test : miss
 2014-03-12 23:57:49.505128 7ff97e7dd700 10 cache put:
 name=.us-west-1.users.uid+test
 2014-03-12 23:57:49.505138 7ff97e7dd700 10 adding
 .us-west-1.users.uid+test to cache LRU end
 2014-03-12 23:57:49.505157 7ff97e7dd700 10 moving
 .us-west-1.users.uid+test to cache LRU end
 2014-03-12 23:57:49.505160 7ff97e7dd700 10 cache get:
 name=.us-west-1.users.uid+test : type miss (requested=1, cached=6)
 2014-03-12 23:57:49.505176 7ff97e7dd700 10 moving
 .us-west-1.users.uid+test to cache LRU end
 2014-03-12 23:57:49.505178 7ff97e7dd700 10 cache get:
 name=.us-west-1.users.uid+test : hit
 2014-03-12 23:57:49.507401 7ff97e7dd700 10 cache put:
 name=.us-west-1.users.uid+test
 2014-03-12 23:57:49.507406 7ff97e7dd700 10 moving
 .us-west-1.users.uid+test to cache LRU end
 2014-03-12 23:57:49.507521 7ff97e7dd700 10 get_canon_resource():
 dest=/test
 2014-03-12 23:57:49.507529 7ff97e7dd700 10 auth_hdr:
 PUT

 binary/octet-stream
 Wed, 12 Mar 2014 23:57:51 GMT
 /test
 2014-03-12 23:57:49.507674 7ff97e7dd700  2 req 1:0.009895:s3:PUT
 /test:create_bucket:reading permissions
 2014-03-12 23:57:49.507682 7ff97e7dd700  2 req 1:0.009904:s3:PUT
 /test:create_bucket:verifying op mask
 2014-03-12 23:57:49.507695 7ff97e7dd700  2 req 1:0.009917:s3:PUT
 /test:create_bucket:verifying op permissions
 2014-03-12 23:57:49.509604 7ff97e7dd700  2 req 1:0.011826:s3:PUT
 /test:create_bucket:verifying op params
 2014-03-12 23:57:49.509615 7ff97e7dd700  2 req 1:0.011836:s3:PUT
 /test:create_bucket:executing
  2014-03-12 23:57:49.509694 7ff97e7dd700 10 cache get:
 name=.us-west-1.domain.rgw+test : miss
 2014-03-12 23:57:49.512229 7ff97e7dd700 10 cache put:
 name=.us-west-1.domain.rgw+test
 2014-03-12 23:57:49.512259 7ff97e7dd700 10 adding
 .us-west-1.domain.rgw+test to cache LRU end
 2014-03-12 23:57:49.512333 7ff97e7dd700 10 cache get:
 name=.us-west-1.domain.rgw+.pools.avail : miss
 2014-03-12 23:57:49.518216 7ff97e7dd700 10 cache put:
 name=.us-west-1.domain.rgw+.pools.avail
 2014-03-12 23:57:49.518228 7ff97e7dd700 10 adding
 .us-west-1.domain.rgw+.pools.avail to cache LRU end
 2014-03-12 23:57:49.518248 7ff97e7dd700 10 moving
 .us-west-1.domain.rgw+.pools.avail to cache LRU end
 2014-03-12 23:57:49.518251 7ff97e7dd700 10 cache get:
 name=.us-west-1.domain.rgw+.pools.avail : type miss (requested=1, cached=6)
 2014-03-12 23:57:49.518270 7ff97e7dd700 10 moving
 .us-west-1.domain.rgw+.pools.avail to cache LRU end
 2014-03-12 23:57:49.518272

Re: [ceph-users] RBD Snapshots

2014-03-03 Thread Greg Poirier

Interesting. I think this may not be a bad idea. Thanks for the info.

On Monday, March 3, 2014, Jean-Tiare LE BIGOT jean-tiare.le-bi...@ovh.net
wrote:

 To get consistent RBD live snapshots, you may want to first freeze the
 guest filesystem (ext4, btrfs, xfs) with a tool like [fsfreeze]. It will
 basically flush the FS state to disk and blocking any future write access
 while maintaining Read accesses.

 [fsfreeze] http://manpages.courier-mta.org/htmlman8/fsfreeze.8.html

 On 02/28/2014 11:27 PM, Gregory Farnum wrote:

 RBD itself will behave fine with whenever you take the snapshot. The
 thing to worry about is that it's a snapshot at the block device
 layer, not the filesystem layer, so if you don't quiesce IO and sync
 to disk the filesystem might not be entirely happy with you for the
 same reasons that it won't be happy if you pull the power plug on it.
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com


 On Fri, Feb 28, 2014 at 2:12 PM, Greg Poirier greg.poir...@opower.com
 wrote:

 According to the documentation at
 https://ceph.com/docs/master/rbd/rbd-snapshot/ -- snapshots require
 that all
 I/O to a block device be stopped prior to making the snapshot. Is there
 any
 plan to allow for online snapshotting so that we could do incremental
 snapshots of running VMs on a regular basis.

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

  ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


 --
 Jean-Tiare, shared-hosting team
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] RBD Snapshots

2014-02-28 Thread Greg Poirier

According to the documentation at
https://ceph.com/docs/master/rbd/rbd-snapshot/ -- snapshots require that
all I/O to a block device be stopped prior to making the snapshot. Is there
any plan to allow for online snapshotting so that we could do incremental
snapshots of running VMs on a regular basis.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph MON can no longer join quorum

2014-02-05 Thread Greg Poirier

Hi Karan,

I resolved it the same way you did. We had a network partition that caused
the MON to die, it appears.

I'm running 0.72.1

It would be nice if redeploying wasn't the solution, but if it's simply
cleaner to do so, then I will continue along that route.

I think what's more troubling is that when this occurred we lost all
connectivity to the Ceph cluster.


On Wed, Feb 5, 2014 at 1:11 AM, Karan Singh ksi...@csc.fi wrote:

 Hi Greg


 I have seen this problem before in my cluster.



- What ceph version you are running
- Did you made any change recently in the cluster , that resulted in
this problem


 You identified correct , the only problem is ceph-mon-2003  is listening
 to incorrect port , it should listen on port 6789 ( like the other two
 monitors ) . How i resolved is cleanly removing the infected monitor node
 and adding it back to cluster.


 Regards

 Karan

 --
 *From: *Greg Poirier greg.poir...@opower.com
 *To: *ceph-users@lists.ceph.com
 *Sent: *Tuesday, 4 February, 2014 10:50:21 PM
 *Subject: *[ceph-users] Ceph MON can no longer join quorum


 I have a MON that at some point lost connectivity to the rest of the
 cluster and now cannot rejoin.

 Each time I restart it, it looks like it's attempting to create a new MON
 and join the cluster, but the rest of the cluster rejects it, because the
 new one isn't in the monmap.

 I don't know why it suddenly decided it needed to be a new MON.

 I am not really sure where to start.

 root@ceph-mon-2003:/var/log/ceph# ceph -s
 cluster 4167d5f2-2b9e-4bde-a653-f24af68a45f8
  health HEALTH_ERR 1 pgs inconsistent; 2 pgs peering; 126 pgs stale; 2
 pgs stuck inactive; 126 pgs stuck stale; 2 pgs stuck unclean; 10 requests
 are blocked  32 sec; 1 scrub errors; 1 mons down, quorum 0,1
 ceph-mon-2001,ceph-mon-2002
  monmap e2: 3 mons at {ceph-mon-2001=
 10.30.66.13:6789/0,ceph-mon-2002=10.30.66.14:6789/0,ceph-mon-2003=10.30.66.15:6800/0},
 election epoch 12964, quorum 0,1 ceph-mon-2001,ceph-mon-2002

 Notice ceph-mon-2003:6800

 If I try to start ceph-mon-all, it will be listening on some other port...

 root@ceph-mon-2003:/var/log/ceph# start ceph-mon-all
 ceph-mon-all start/running
 root@ceph-mon-2003:/var/log/ceph# ps -ef | grep ceph
 root  6930 1 31 15:49 ?00:00:00 /usr/bin/ceph-mon
 --cluster=ceph -i ceph-mon-2003 -f
 root  6931 1  3 15:49 ?00:00:00 python
 /usr/sbin/ceph-create-keys --cluster=ceph -i ceph-mon-2003

 root@ceph-mon-2003:/var/log/ceph# ceph -s
 2014-02-04 15:49:56.854866 7f9cf422d700  0 -- :/1007028 
 10.30.66.15:6789/0 pipe(0x7f9cf0021370 sd=3 :0 s=1 pgs=0 cs=0 l=1
 c=0x7f9cf00215d0).fault
 cluster 4167d5f2-2b9e-4bde-a653-f24af68a45f8
  health HEALTH_ERR 1 pgs inconsistent; 2 pgs peering; 126 pgs stale; 2
 pgs stuck inactive; 126 pgs stuck stale; 2 pgs stuck unclean; 10 requests
 are blocked  32 sec; 1 scrub errors; 1 mons down, quorum 0,1
 ceph-mon-2001,ceph-mon-2002
  monmap e2: 3 mons at {ceph-mon-2001=
 10.30.66.13:6789/0,ceph-mon-2002=10.30.66.14:6789/0,ceph-mon-2003=10.30.66.15:6800/0},
 election epoch 12964, quorum 0,1 ceph-mon-2001,ceph-mon-2002

 Suggestions?

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Ceph MON can no longer join quorum

2014-02-04 Thread Greg Poirier

I have a MON that at some point lost connectivity to the rest of the
cluster and now cannot rejoin.

Each time I restart it, it looks like it's attempting to create a new MON
and join the cluster, but the rest of the cluster rejects it, because the
new one isn't in the monmap.

I don't know why it suddenly decided it needed to be a new MON.

I am not really sure where to start.

root@ceph-mon-2003:/var/log/ceph# ceph -s
cluster 4167d5f2-2b9e-4bde-a653-f24af68a45f8
 health HEALTH_ERR 1 pgs inconsistent; 2 pgs peering; 126 pgs stale; 2
pgs stuck inactive; 126 pgs stuck stale; 2 pgs stuck unclean; 10 requests
are blocked  32 sec; 1 scrub errors; 1 mons down, quorum 0,1
ceph-mon-2001,ceph-mon-2002
 monmap e2: 3 mons at {ceph-mon-2001=
10.30.66.13:6789/0,ceph-mon-2002=10.30.66.14:6789/0,ceph-mon-2003=10.30.66.15:6800/0},
election epoch 12964, quorum 0,1 ceph-mon-2001,ceph-mon-2002

Notice ceph-mon-2003:6800

If I try to start ceph-mon-all, it will be listening on some other port...

root@ceph-mon-2003:/var/log/ceph# start ceph-mon-all
ceph-mon-all start/running
root@ceph-mon-2003:/var/log/ceph# ps -ef | grep ceph
root  6930 1 31 15:49 ?00:00:00 /usr/bin/ceph-mon
--cluster=ceph -i ceph-mon-2003 -f
root  6931 1  3 15:49 ?00:00:00 python
/usr/sbin/ceph-create-keys --cluster=ceph -i ceph-mon-2003

root@ceph-mon-2003:/var/log/ceph# ceph -s
2014-02-04 15:49:56.854866 7f9cf422d700  0 -- :/1007028 
10.30.66.15:6789/0 pipe(0x7f9cf0021370 sd=3 :0 s=1 pgs=0 cs=0 l=1
c=0x7f9cf00215d0).fault
cluster 4167d5f2-2b9e-4bde-a653-f24af68a45f8
 health HEALTH_ERR 1 pgs inconsistent; 2 pgs peering; 126 pgs stale; 2
pgs stuck inactive; 126 pgs stuck stale; 2 pgs stuck unclean; 10 requests
are blocked  32 sec; 1 scrub errors; 1 mons down, quorum 0,1
ceph-mon-2001,ceph-mon-2002
 monmap e2: 3 mons at {ceph-mon-2001=
10.30.66.13:6789/0,ceph-mon-2002=10.30.66.14:6789/0,ceph-mon-2003=10.30.66.15:6800/0},
election epoch 12964, quorum 0,1 ceph-mon-2001,ceph-mon-2002

Suggestions?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RadosGW S3 API - Bucket Versions

2014-01-24 Thread Greg Poirier

On Fri, Jan 24, 2014 at 4:28 PM, Yehuda Sadeh yeh...@inktank.com wrote:

 For each object that rgw stores it keeps a version tag. However this
 version is not ascending, it's just used for identifying whether an
 object has changed. I'm not completely sure what is the problem that
 you're trying to solve though.


We have two datacenters. I want to have two regions that are split across
both datacenters. Let's say us-west and us-east are our regions, us-east-1
would live in one datacenter and be the primary zone for the us-east region
while us-east-2 would live in the other datacenter and be secondary zone.
We then do the opposite for us-west.

What I was envisioning, I think will not work. For example:

- write object A.0 to bucket X in us-west-1 (master)
- us-west-1 (master) goes down.
- write to us-west-2 (secondary) a _new_ version of of object A.1 to bucket
X
- us-west-1 comes back up
- read object A.1 from us-west-1

The idea being that if you are versioning objects, you are never updating
them, so it doesn't matter that the copy of the object that is now in
us-west-1 is read-only.

I'm not even sure if this is an accurate description of how replication
operates, but I thought I'd discussed a master-master scenario with someone
who said this _might_ be possible... assuming you had versioned objects.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] 1MB/s throughput to 33-ssd test cluster

2013-12-09 Thread Greg Poirier

On Sun, Dec 8, 2013 at 8:33 PM, Mark Kirkwood mark.kirkw...@catalyst.net.nz
 wrote:

 I'd suggest testing the components separately - try to rule out NIC (and
 switch) issues and SSD performance issues, then when you are sure the bits
 all go fast individually test how ceph performs again.

 What make and model of SSD? I'd check that the firmware is up to date
 (sometimes makes a huge difference). I'm also wondering if you might get
 better performance by having (say) 7 osds and using 4 of the SSD for
 journals for them.


Thanks, Mark.

In my haste, I left out part of a paragraph... probably really a whole
paragraph... that contains a pretty crucial detail.

I had previously run rados bench on this hardware with some success
(24-26MBps throughput w/ 4k blocks).

ceph osd bench looks great.

iperf on the network looks great.

After my last round of testing (with a few aborted rados bench tests), I
deleted the pool and recreated it (same name, crush ruleset, pg num, size,
etc). That is when I started to notice the degraded performance.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] 1MB/s throughput to 33-ssd test cluster

2013-12-08 Thread Greg Poirier

Hi.

So, I have a test cluster made up of ludicrously overpowered machines with
nothing but SSDs in them. Bonded 10Gbps NICs (802.3ad layer 2+3 xmit hash
policy, confirmed ~19.8 Gbps throughput with 32+ threads). I'm running
rados bench, and I am currently getting less than 1 MBps throughput:

sudo rados -N `hostname` bench 600 write -b 4096 -p volumes --no-cleanup -t
32  bench_write_4096_volumes_1_32.out 21'


Colocated journals on the same disk, so I'm not expecting optimum
throughput, but previous tests on spinning disks have shown reasonable
speeds (23MB/s, 4000-6000 iops) as opposed to the 150-450 iops I'm
currently getting.

ceph_deploy@ssd-1001:~$ sudo ceph -s
cluster 4167d5f2-2b9e-4bde-a653-f24af68a45f8
 health HEALTH_WARN clock skew detected on mon.ssd-1003
 monmap e1: 3 mons at {ssd-1001=
10.20.69.101:6789/0,ssd-1002=10.20.69.102:6789/0,ssd-1003=10.20.69.103:6789/0},
election epoch 20, quorum 0,1,2 ssd-1001,ssd-1002,ssd-1003
 osdmap e344: 33 osds: 33 up, 33 in
  pgmap v10600: 1650 pgs, 6 pools, 289 MB data, 74029 objects
466 GB used, 17621 GB / 18088 GB avail
1650 active+clean
  client io 1263 kB/s wr, 315 op/s

ceph_deploy@ssd-1001:~$ sudo ceph osd tree
# id weight type name up/down reweight
-1 30.03 root default
-2 10.01 host ssd-1001
0 0.91 osd.0 up 1
1 0.91 osd.1 up 1
2 0.91 osd.2 up 1
3 0.91 osd.3 up 1
4 0.91 osd.4 up 1
5 0.91 osd.5 up 1
6 0.91 osd.6 up 1
7 0.91 osd.7 up 1
8 0.91 osd.8 up 1
9 0.91 osd.9 up 1
10 0.91 osd.10 up 1
-3 10.01 host ssd-1002
11 0.91 osd.11 up 1
12 0.91 osd.12 up 1
13 0.91 osd.13 up 1
14 0.91 osd.14 up 1
15 0.91 osd.15 up 1
16 0.91 osd.16 up 1
17 0.91 osd.17 up 1
18 0.91 osd.18 up 1
19 0.91 osd.19 up 1
20 0.91 osd.20 up 1
21 0.91 osd.21 up 1
-4 10.01 host ssd-1003
22 0.91 osd.22 up 1
23 0.91 osd.23 up 1
24 0.91 osd.24 up 1
25 0.91 osd.25 up 1
26 0.91 osd.26 up 1
27 0.91 osd.27 up 1
28 0.91 osd.28 up 1
29 0.91 osd.29 up 1
30 0.91 osd.30 up 1
31 0.91 osd.31 up 1
32 0.91 osd.32 up 1

The clock skew error can safely be ignored. It's something like 2-3 ms
skew, I just haven't bothered configuring away the warning.

This is with a newly-created pool after deleting the last pool used for
testing.

Any suggestions on where to start debugging?

thanks.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] near full osd

2013-11-05 Thread Greg Chavez

Kevin, in my experience that usually indicates a bad or underperforming
disk, or a too-high priority.  Try running ceph osd crush reweight
osd.## 1.0.  If that doesn't do the trick, you may want to just out that
guy.

I don't think the crush algorithm guarantees balancing things out in the
way you're expecting.


--Greg

On Tue, Nov 5, 2013 at 11:11 AM, Kevin Weiler
kevin.wei...@imc-chicago.comwrote:

  Hi guys,

  I have an OSD in my cluster that is near full at 90%, but we're using a
 little less than half the available storage in the cluster. Shouldn't this
 be balanced out?


  --

 *Kevin Weiler*

 IT



 IMC Financial Markets | 233 S. Wacker Drive, Suite 4300 | Chicago, IL
 60606 | http://imc-chicago.com/

 Phone: +1 312-204-7439 | Fax: +1 312-244-3301 | E-Mail: 
 *kevin.wei...@imc-chicago.com
 kevin.wei...@imc-chicago.com*

 --

 The information in this e-mail is intended only for the person or entity
 to which it is addressed.

 It may contain confidential and /or privileged material. If someone other
 than the intended recipient should receive this e-mail, he / she shall not
 be entitled to read, disseminate, disclose or duplicate it.

 If you receive this e-mail unintentionally, please inform us immediately
 by reply and then delete it from your system. Although this information
 has been compiled with great care, neither IMC Financial Markets  Asset
 Management nor any of its related entities shall accept any responsibility
 for any errors, omissions or other inaccuracies in this information or for
 the consequences thereof, nor shall it be bound in any way by the contents
 of this e-mail or its attachments. In the event of incomplete or incorrect
 transmission, please return the e-mail to the sender and permanently delete
 this message and any attachments.

 Messages and attachments are scanned for all known viruses. Always scan
 attachments before opening them.

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] near full osd

2013-11-05 Thread Greg Chavez

Erik, it's utterly non-intuitive and I'd love another explanation than the
one I've provided.  Nevertheless, the OSDs on my slower PE2970 nodes fill
up much faster than those on HP585s or Dell R820s.  I've handled this by
dropping priorities and, in a couple of cases, outing or removing the OSD.

Kevin, generally speaking, the OSDs that fill up on me are the same ones.
 Once I lower the weights, they stay low or they fill back up again within
days or hours of re-raising the weight.  Please try to lift them up though,
maybe you'll have better luck than me.

--Greg


On Tue, Nov 5, 2013 at 11:30 AM, Kevin Weiler
kevin.wei...@imc-chicago.comwrote:

   All of the disks in my cluster are identical and therefore all have the
 same weight (each drive is 2TB and the automatically generated weight is
 1.82 for each one).

  Would the procedure here be to reduce the weight, let it rebal, and then
 put the weight back to where it was?


  --

 *Kevin Weiler*

 IT



 IMC Financial Markets | 233 S. Wacker Drive, Suite 4300 | Chicago, IL
 60606 | http://imc-chicago.com/

 Phone: +1 312-204-7439 | Fax: +1 312-244-3301 | E-Mail: 
 *kevin.wei...@imc-chicago.com
 kevin.wei...@imc-chicago.com*

   From: Aronesty, Erik earone...@expressionanalysis.com
 Date: Tuesday, November 5, 2013 10:27 AM
 To: Greg Chavez greg.cha...@gmail.com, Kevin Weiler 
 kevin.wei...@imc-chicago.com
 Cc: ceph-users@lists.ceph.com ceph-users@lists.ceph.com
 Subject: RE: [ceph-users] near full osd

   If there’s an underperforming disk, why on earth would *more* data be
 put on it?  You’d think it would be less….   I would think an
 *overperforming* disk should (desirably) cause that case,right?



 *From:* ceph-users-boun...@lists.ceph.com [
 mailto:ceph-users-boun...@lists.ceph.comceph-users-boun...@lists.ceph.com]
 *On Behalf Of *Greg Chavez
 *Sent:* Tuesday, November 05, 2013 11:20 AM
 *To:* Kevin Weiler
 *Cc:* ceph-users@lists.ceph.com
 *Subject:* Re: [ceph-users] near full osd



 Kevin, in my experience that usually indicates a bad or underperforming
 disk, or a too-high priority.  Try running ceph osd crush reweight
 osd.## 1.0.  If that doesn't do the trick, you may want to just out that
 guy.



 I don't think the crush algorithm guarantees balancing things out in the
 way you're expecting.



 --Greg

 On Tue, Nov 5, 2013 at 11:11 AM, Kevin Weiler 
 kevin.wei...@imc-chicago.com wrote:

 Hi guys,



 I have an OSD in my cluster that is near full at 90%, but we're using a
 little less than half the available storage in the cluster. Shouldn't this
 be balanced out?



 --

 *Kevin Weiler*

 IT



 IMC Financial Markets | 233 S. Wacker Drive, Suite 4300 | Chicago, IL
 60606 | http://imc-chicago.com/

 Phone: +1 312-204-7439 | Fax: +1 312-244-3301 | E-Mail: 
 *kevin.wei...@imc-chicago.com
 kevin.wei...@imc-chicago.com*


  --


 The information in this e-mail is intended only for the person or entity
 to which it is addressed.

 It may contain confidential and /or privileged material. If someone other
 than the intended recipient should receive this e-mail, he / she shall not
 be entitled to read, disseminate, disclose or duplicate it.

 If you receive this e-mail unintentionally, please inform us immediately
 by reply and then delete it from your system. Although this information
 has been compiled with great care, neither IMC Financial Markets  Asset
 Management nor any of its related entities shall accept any responsibility
 for any errors, omissions or other inaccuracies in this information or for
 the consequences thereof, nor shall it be bound in any way by the contents
 of this e-mail or its attachments. In the event of incomplete or incorrect
 transmission, please return the e-mail to the sender and permanently delete
 this message and any attachments.

 Messages and attachments are scanned for all known viruses. Always scan
 attachments before opening them.


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



 --

 The information in this e-mail is intended only for the person or entity
 to which it is addressed.

 It may contain confidential and /or privileged material. If someone other
 than the intended recipient should receive this e-mail, he / she shall not
 be entitled to read, disseminate, disclose or duplicate it.

 If you receive this e-mail unintentionally, please inform us immediately
 by reply and then delete it from your system. Although this information
 has been compiled with great care, neither IMC Financial Markets  Asset
 Management nor any of its related entities shall accept any responsibility
 for any errors, omissions or other inaccuracies in this information or for
 the consequences thereof, nor shall it be bound in any way by the contents
 of this e-mail or its attachments. In the event of incomplete or incorrect
 transmission, please return the e

Re: [ceph-users] PG repair failing when object missing

2013-10-24 Thread Greg Farnum

I was also able to reproduce this, guys, but I believe it’s specific to the 
mode of testing rather than to anything being wrong with the OSD. In 
particular, after restarting the OSD whose file I removed and running repair, 
it did so successfully.
The OSD has an “fd cacher” which caches open file handles, and we believe this 
is what causes the observed behavior: if the removed object is among the most 
recent n objects touched, the FileStore (an OSD subsystem) has an open fd 
cached, so when manually deleting the file the FileStore now has a deleted file 
open. When the repair happens, it finds that open file descriptor and applies 
the repair to it — which of course doesn’t help put it back into place!
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

On October 24, 2013 at 2:52:54 AM, Matt Thompson (watering...@gmail.com) wrote:

Hi Harry,

I was able to replicate this.

What does appear to work (for me) is to do an osd scrub followed by a pg
repair. I've tried this 2x now and in each case the deleted file gets
copied over to the OSD from where it was removed. However, I've tried a
few pg scrub / pg repairs after manually deleting a file and have yet to
see the file get copied back to the OSD on which it was deleted. Like you
said, the pg repair sets the health of the PG back to active+clean, but
then re-running the pg scrub detects the file as missing again and sets it
back to active+clean+inconsistent.

Regards,
Matt


On Wed, Oct 23, 2013 at 3:45 PM, Harry Harrington wrote:

 Hi,

 I've been taking a look at the repair functionality in ceph. As I
 understand it the osds should try to copy an object from another member of
 the pg if it is missing. I have been attempting to test this by manually
 removing a file from one of the osds however each time the repair
 completes the the file has not been restored. If I run another scrub on the
 pg it gets flagged as inconsistent. See below for the output from my
 testing. I assume I'm missing something obvious, any insight into this
 process would be greatly appreciated.

 Thanks,
 Harry

 # ceph --version
 ceph version 0.67.4 (ad85b8bfafea6232d64cb7ba76a8b6e8252fa0c7)
 # ceph status
 cluster a4e417fe-0386-46a5-4475-ca7e10294273
 health HEALTH_OK
 monmap e1: 1 mons at {ceph1=1.2.3.4:6789/0}, election epoch 2, quorum
 0 ceph1
 osdmap e13: 3 osds: 3 up, 3 in
 pgmap v232: 192 pgs: 192 active+clean; 44 bytes data, 15465 MB used,
 164 GB / 179 GB avail
 mdsmap e1: 0/0/1 up

 file removed from osd.2

 # ceph pg scrub 0.b
 instructing pg 0.b on osd.1 to scrub

 # ceph status
 cluster a4e417fe-0386-46a5-4475-ca7e10294273
 health HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
 monmap e1: 1 mons at {ceph1=1.2.3.4:6789/0}, election epoch 2, quorum
 0 ceph1
 osdmap e13: 3 osds: 3 up, 3 in
 pgmap v233: 192 pgs: 191 active+clean, 1 active+clean+inconsistent; 44
 bytes data, 15465 MB used, 164 GB / 179 GB avail
 mdsmap e1: 0/0/1 up

 # ceph pg repair 0.b
 instructing pg 0.b on osd.1 to repair

 # ceph status
 cluster a4e417fe-0386-46a5-4475-ca7e10294273
 health HEALTH_OK
 monmap e1: 1 mons at {ceph1=1.2.3.4:6789/0}, election epoch 2, quorum
 0 ceph1
 osdmap e13: 3 osds: 3 up, 3 in
 pgmap v234: 192 pgs: 192 active+clean; 44 bytes data, 15465 MB used,
 164 GB / 179 GB avail
 mdsmap e1: 0/0/1 up

 # ceph pg scrub 0.b
 instructing pg 0.b on osd.1 to scrub

 # ceph status
 cluster a4e417fe-0386-46a5-4475-ca7e10294273
 health HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
 monmap e1: 1 mons at {ceph1=1.2.3.4:6789/0}, election epoch 2, quorum
 0 ceph1
 osdmap e13: 3 osds: 3 up, 3 in
 pgmap v236: 192 pgs: 191 active+clean, 1 active+clean+inconsistent; 44
 bytes data, 15465 MB used, 164 GB / 179 GB avail
 mdsmap e1: 0/0/1 up



 The logs from osd.1:
 2013-10-23 14:12:31.188281 7f02a5161700 0 log [ERR] : 0.b osd.2 missing
 3a643fcb/testfile1/head//0
 2013-10-23 14:12:31.188312 7f02a5161700 0 log [ERR] : 0.b scrub 1
 missing, 0 inconsistent objects
 2013-10-23 14:12:31.188319 7f02a5161700 0 log [ERR] : 0.b scrub 1 errors
 2013-10-23 14:13:03.197802 7f02a5161700 0 log [ERR] : 0.b osd.2 missing
 3a643fcb/testfile1/head//0
 2013-10-23 14:13:03.197837 7f02a5161700 0 log [ERR] : 0.b repair 1
 missing, 0 inconsistent objects
 2013-10-23 14:13:03.197850 7f02a5161700 0 log [ERR] : 0.b repair 1
 errors, 1 fixed
 2013-10-23 14:14:47.232953 7f02a5161700 0 log [ERR] : 0.b osd.2 missing
 3a643fcb/testfile1/head//0
 2013-10-23 14:14:47.232985 7f02a5161700 0 log [ERR] : 0.b scrub 1
 missing, 0 inconsistent objects
 2013-10-23 14:14:47.232991 7f02a5161700 0 log [ERR] : 0.b scrub 1 errors
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users

[ceph-users] Cluster stuck at 15% degraded

2013-09-19 Thread Greg Chavez

We have an 84-osd cluster with volumes and images pools for OpenStack.  I
was having trouble with full osds, so I increased the pg count from the 128
default to 2700.  This balanced out the osds but the cluster is stuck at
15% degraded

http://hastebin.com/wixarubebe.dos

That's the output of ceph health detail.  I've never seen a pg with the
state active+remapped+wait_backfill+backfill_toofull.  Clearly I should
have increased the pg count more gradually, but here I am. I'm frozen,
afraid to do anything.

Any suggestions? Thanks.

--Greg Chavez
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] librados pthread_create failure

2013-08-26 Thread Greg Poirier

So, in doing some testing last week, I believe I managed to exhaust the
number of threads available to nova-compute last week. After some
investigation, I found the pthread_create failure and increased nproc for
our Nova user to, what I considered, a ridiculous 120,000 threads after
reading that librados will require a thread per osd, plus a few for
overhead, per VM on our compute nodes.

This made me wonder: how many threads could Ceph possibly need on one of
our compute nodes.

32 cores * an overcommit ratio of 16, assuming each one is booted from a
Ceph volume, * 300 (approximate number of disks in our soon-to-go-live Ceph
cluster) = 153,600 threads.

So this is where I started to put the truck in reverse. Am I right? What
about when we triple the size of our Ceph cluster? I could easily see a
future where we have easily 1,000 disks, if not many, many more in our
cluster. How do people scale this? Do you RAID to increase the density of
your Ceph cluster? I can only imagine that this will also drastically
increase the amount of resources required on my data nodes as well.

So... suggestions? Reading?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Unexpectedly slow write performance (RBD cinder volumes)

2013-08-23 Thread Greg Poirier

Ah thanks, Brian. I will do that. I was going off the wiki instructions on
performing rados benchmarks. If I have the time later, I will change it
there.


On Fri, Aug 23, 2013 at 9:37 AM, Brian Andrus brian.and...@inktank.comwrote:

 Hi Greg,


 I haven't had any luck with the seq bench. It just errors every time.


 Can you confirm you are using the --no-cleanup flag with rados write? This
 will ensure there is actually data to read for subsequent seq tests.

 ~Brian

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Unexpectedly slow write performance (RBD cinder volumes)

2013-08-23 Thread Greg Poirier

On Fri, Aug 23, 2013 at 9:53 AM, Gregory Farnum g...@inktank.com wrote:


 Okay. It's important to realize that because Ceph distributes data
 pseudorandomly, each OSD is going to end up with about the same amount
 of data going to it. If one of your drives is slower than the others,
 the fast ones can get backed up waiting on the slow one to acknowledge
 writes, so they end up impacting the cluster throughput a
 disproportionate amount. :(

 Anyway, I'm guessing you have 24 OSDs from your math earlier?
 47MB/s * 24 / 2 = 564MB/s
 41MB/s * 24 / 2 = 492MB/s


33 OSDs and 3 hosts in the cluster.


 So taking out or reducing the weight on the slow ones might improve
 things a little. But that's still quite a ways off from what you're
 seeing — there are a lot of things that could be impacting this but
 there's probably something fairly obvious with that much of a gap.
 What is the exact benchmark you're running? What do your nodes look like?


The write benchmark I am running is Fio with the following configuration:

  ioengine: libaio
  iodepth: 16
  runtime: 180
  numjobs: 16
  - name: 128k-500M-write
description: 128K block 500M write
bs: 128K
size: 500M
rw: write

Sorry for the weird yaml formatting but I'm copying it from the config file
of my automation stuff.

I run that on powers of 2 VMs up to 32. Each VM is qemu-kvm with a 50 GB
RBD-backed Cinder volume attached. They are 2 VCPU, 4 GB RAM VMs.

The host machines are Dell C6220s, 16-core, hyperthreaded VMs, 128 GB RAM,
with bonded 10 Gbps NICs (mode 4, 20 Gbps throughput -- tested and verified
that's working correctly). There are 2 host machines with 16 VMs each.

The Ceph cluster is made up of Dell C6220s, same NIC setup, 256 GB RAM,
same CPU, 12 disks each (one for os, 11 for OSDs).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Unexpectedly slow write performance (RBD cinder volumes)

2013-08-22 Thread Greg Poirier

On Thu, Aug 22, 2013 at 2:34 PM, Gregory Farnum g...@inktank.com wrote:

 You don't appear to have accounted for the 2x replication (where all
  writes go to two OSDs) in these calculations. I assume your pools have


Ah. Right. So I should then be looking at:

# OSDs * Throughput per disk / 2 / repl factor ?

Which makes 300-400 MB/s aggregate throughput actually sort of reasonable.


 size 2 (or 3?) for these tests. 3 would explain the performance
 difference entirely; 2x replication leaves it still a bit low but
 takes the difference down to ~350/600 instead of ~350/1200. :)


Yeah. We're doing 2x repl now, and haven't yet made the decision if we're
going to move to 3x repl or not.


 You mentioned that your average osd bench throughput was ~50MB/s;
 what's the range?


41.9 - 54.7 MB/s

The actual average is 47.1 MB/s


 Have you run any rados bench tests?


Yessir.

rados bench write:

2013-08-23 00:18:51.933594min lat: 0.071682 max lat: 1.77006 avg lat:
0.196411
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
   900  14 73322 73308   325.764   316   0.13978  0.196411
 Total time run: 900.239317
Total writes made:  73322
Write size: 4194304
Bandwidth (MB/sec): 325.789

Stddev Bandwidth:   35.102
Max bandwidth (MB/sec): 440
Min bandwidth (MB/sec): 0
Average Latency:0.196436
Stddev Latency: 0.121463
Max latency:1.77006
Min latency:0.071682

I haven't had any luck with the seq bench. It just errors every time.



 What is your PG count across the cluster?


pgmap v18263: 1650 pgs: 1650 active+clean; 946 GB data, 1894 GB used,
28523 GB / 30417 GB avail; 498MB/s wr, 124op/s

Thanks again.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Production/Non-production segmentation

2013-07-31 Thread Greg Poirier

Does anyone here have multiple clusters or segment their single cluster in
such a way as to try to maintain different SLAs for production vs
non-production services?

We have been toying with the idea of running separate clusters (on the same
hardware, but reserve a portion of the OSDs for the production cluster),
but I'd rather have a single cluster in order to more evenly distribute
load across all of the spindles.

Thoughts or observations from people with Ceph in production would be
greatly appreciated.

Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Defective ceph startup script

2013-07-31 Thread Greg Chavez

I am running on Ubuntu 13.04.

There is something amiss with /etc/init.d/ceph on all of my ceph nodes.

I was upgrading to 0.61.7 from what I *thought* was 0.61.5 today when I
realized that service ceph-all restart wasn't actually doing anything.  I
saw nothing in /var/log/ceph.log - it just kept printing pg statuses - and
the PIDs of the osd and mon daemons did not change.  Stops failed as well.

Then, when I tried to do individual osd restarts like this:

root@kvm-cs-sn-14i:/var/lib/ceph/osd# service ceph -v status osd.10
/etc/init.d/ceph: osd.10 not found (/etc/ceph/ceph.conf defines ,
/var/lib/ceph defines )


Despite the fact that I have this directory: /var/lib/ceph/osd/ceph-10/.

I have the same issue with mon restarts:

root@kvm-cs-sn-14i:/var/lib/ceph/mon# ls
ceph-kvm-cs-sn-14i

root@kvm-cs-sn-14i:/var/lib/ceph/mon# service ceph -v status
mon.kvm-cs-sn-14i
/etc/init.d/ceph: mon.kvm-cs-sn-14i not found (/etc/ceph/ceph.conf defines
, /var/lib/ceph defines )


I'm very worried that I have all my packages at  0.61.7 while my osd and
mon daemons could be running as old as  0.61.1!

Can anyone help me figure this out?  Thanks.


-- 
\*..+.-
--Greg Chavez
+//..;};
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Production/Non-production segmentation

2013-07-31 Thread Greg Poirier

On Wed, Jul 31, 2013 at 12:19 PM, Mike Dawson mike.daw...@cloudapt.comwrote:

 Due to the speed of releases in the Ceph project, I feel having separate
 physical hardware is the safer way to go, especially in light of your
 mention of an SLA for your production services.


Ah. I guess I should offer a little more background as to what I mean by
production vs. non-production: customer-facing, and not.

We're using Ceph primarily for volume storage with OpenStack at the moment
and operate two OS clusters: one for all of our customer-facing services
(which require a higher SLA) and one for all of our internal services. The
idea being that all of the customer-facing stuff is segmented physically
from anything our developers might be testing internally.

What I'm wondering:

Does anyone else here do this?
If so, do you run multiple Ceph clusters?
Do you let Ceph sort itself out?
Can this be done with a single physical cluster, but multiple logical
clusters?
Should it be?

I know that, mathematically speaking, the larger your Ceph cluster is, the
more evenly distributed the load (thanks to CRUSH). I'm wondering if, in
practice, RBD can still create hotspots (say from a runaway service with
multiple instances and volumes that is suddenly doing a ton of IO). This
would increase IO latency across the Ceph cluster, I'd assume, and could
impact the performance of customer-facing services.

So, to some degree, physical segmentation makes sense to me. But can we
simply reserve some OSDs per physical host for a production logical
cluster and then use the rest for the development logical cluster
(separate MON clusters for each, but all running on the same hardware). Or,
given a sufficiently large cluster, is this not even a concern?

I'm also interested in hearing about experience using CephFS, Swift, and
RBD all on a single cluster or if people have chosen to use multiple
clusters for these as well. For example, if you need faster volume storage
in RBD, so you go for more spindles and smaller disks vs. larger disks with
fewer spindles for object storage, which can have a higher allowance for
latency than volume storage.

A separate non-production cluster will allow you to test and validate new
 versions (including point releases within a stable series) before you
 attempt to upgrade your production cluster.


Oh yeah. I'm doing that for sure.

Thanks,

Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Defective ceph startup script

2013-07-31 Thread Greg Chavez

Blast and gadzooks.  This is a bug then.

What's worse is that on three of my mon nodes have anything in
/var/run/ceph.  The directory is empty!  I can't believe I've basically
been running a busy ceph cluster for the last month.

I'll try what you suggested, thank you.


On Wed, Jul 31, 2013 at 3:48 PM, Eric Eastman eri...@aol.com wrote:

 Hi Greg,

 I saw about the same thing on Ubuntu 13.04 as you did. I used

 apt-get -y update
 apt-get -y upgrade

 On all my cluster nodes to upgrade from 0.61.5 to 0.61.7 and then noticed
 that some of my systems did not restart all the daemons.  I tried:

 stop ceph-all
 start ceph-all

 On those nodes, but that did not kill all the old processes on
 the systems still running old daemons, so I ended up doing a:

 ps auxww | grep ceph

 On every node, and for any ceph process that was older then
 when I upgraded, I hand killed all the ceph processes on that
 node and then did a:

 start ceph-all

 Which seemed to fixed the issue.

 Eric



 I am running on Ubuntu 13.04.

 There is something amiss with /etc/init.d/ceph on all of my ceph

 nodes.


 I was upgrading to 0.61.7 from what I *thought* was 0.61.5 today when

 I

 realized that service ceph-all restart wasn't actually doing

 anything.

   I saw nothing in /var/log/ceph.log - it just kept printing pg

 statuses

 - and the PIDs of the osd and mon daemons did not change.  Stops

 failed

 as well.






-- 
\*..+.-
--Greg Chavez
+//..;};
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Defective ceph startup script

2013-07-31 Thread Greg Chavez

After I did what Eric Eastman, suggested, my mon and osd sockets showed up
in /var/run/ceph:

root@kvm-cs-sn-10i:/etc/ceph# ls /var/run/ceph/
ceph-osd.0.asok ceph-osd.1.asok ceph-osd.2.asok ceph-osd.3.asok
ceph-osd.4.asok ceph-osd.5.asok ceph-osd.6.asok ceph-osd.7.asok

However, while the osd daemons came back on line, the mon did not. As it
happened, the cause for it is in another thread from today
(Subject: Problem with MON after reboot). The solution is to upgrade and
restart the other mon nodes. This worked.

Now the status/stop/start commands work each and every time. Somewhere
along the line this got goofed up and the osd and mon sockets either
weren't created or were deleted. I started my cluster with a devel version
of cuttlefish, so who knows?

Craig, that's good advice re: starting the mon daemons first, but this is
no good if the sockets are missing from /var/run/ceph. I'll keep on eye on
these directories moving forward to make sure they don't get lost again.

Thanks everyone for their help. Now I hope to engage in some drama free
upgrading on my osd-only nodes. Ceph is great!

On Wed, Jul 31, 2013 at 4:31 PM, Craig Lewis cle...@centraldesktop.comwrote:

You do need to use the stop script, not service stop. If you use service
stop, Upstart will restart the service. It's ok for start and restart,
because that what you want anyway, but service stop is effectively a
restart.

I wouldn't recommend doing stop ceph-all and start ceph-all after an
upgrade anyway, at least not with the latest 0.61 upgrades. Due to the MON
issues between 61.4, 61.5, and 61.6, it seemed safer to follow the major
version upgrade procedure (http://ceph.com/docs/next/**
install/upgrading-ceph/http://ceph.com/docs/next/install/upgrading-ceph/).
So I've been restarting MON on all nodes, then all OSDs on all nodes, then
the remaining services.

That said, it stop ceph-all should stop all the daemons. I just wouldn't
use this upgrade procedure.

On all my cluster nodes to upgrade from 0.61.5 to 0.61.7 and then noticed
that some of my systems did not restart all the daemons. I tried:

stop ceph-all
start ceph-all

__**_
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/**listinfo.cgi/ceph-users-ceph.**comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
\*..+.-
--Greg Chavez
+//..;};
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Proplem about capacity when mount using CephFS?

2013-07-16 Thread Greg Chavez

This is interesting.  So there are no built-in ceph commands that can
calculate your usable space?  It just so happened that I was going to
try and figure that out today (new Openstack block cluster, 20TB total
capacity) by skimming through the documentation.  I figured that there
had to be a command that would do this.  Blast and gadzooks.

On Tue, Jul 16, 2013 at 10:37 AM, Ta Ba Tuan tua...@vccloud.vn wrote:

 Thank Sage,

 tuantaba


 On 07/16/2013 09:24 PM, Sage Weil wrote:

 On Tue, 16 Jul 2013, Ta Ba Tuan wrote:

 Thanks  Sage,
 I wories about returned capacity when mounting CephFS.
 but when disk is full, capacity will showed 50% or 100% Used?

 100%.

 sage


 On 07/16/2013 11:01 AM, Sage Weil wrote:

 On Tue, 16 Jul 2013, Ta Ba Tuan wrote:

 Hi everyone.

 I have 83 osds, and every osds have same 2TB, (Capacity sumary is
 166TB)
 I'm using replicate 3 for pools ('data','metadata').

 But when mounting Ceph filesystem from somewhere (using: mount -t ceph
 Monitor_IP:/ /ceph -o name=admin,secret=xx)
 then capacity sumary is showed 160TB?, I used replicate 3 and I think
 that
 it must return 160TB/3=50TB?

 FilesystemSize  Used Avail Use% Mounted on
 192.168.32.90:/160T  500G  156T   1%  /tmp/ceph_mount

 Please, explain this  help me?

 statfs/df show the raw capacity of the cluster, not the usable capacity.
 How much data you can store is a (potentially) complex function of your
 CRUSH rules and replication layout.  If you store 1TB, you'll notice the
 available space will go down by about 2TB (if you're using the default
 2x).

 sage



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
\*..+.-
--Greg Chavez
+//..;};
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Proplem about capacity when mount using CephFS?

2013-07-16 Thread Greg Chavez

Watching.  Thanks, Neil.

On Tue, Jul 16, 2013 at 12:43 PM, Neil Levine neil.lev...@inktank.com wrote:
 This seems like a good feature to have. I've created
 http://tracker.ceph.com/issues/5642

 N


 On Tue, Jul 16, 2013 at 8:05 AM, Greg Chavez greg.cha...@gmail.com wrote:

 This is interesting.  So there are no built-in ceph commands that can
 calculate your usable space?  It just so happened that I was going to
 try and figure that out today (new Openstack block cluster, 20TB total
 capacity) by skimming through the documentation.  I figured that there
 had to be a command that would do this.  Blast and gadzooks.

 On Tue, Jul 16, 2013 at 10:37 AM, Ta Ba Tuan tua...@vccloud.vn wrote:
 
  Thank Sage,
 
  tuantaba
 
 
  On 07/16/2013 09:24 PM, Sage Weil wrote:
 
  On Tue, 16 Jul 2013, Ta Ba Tuan wrote:
 
  Thanks  Sage,
  I wories about returned capacity when mounting CephFS.
  but when disk is full, capacity will showed 50% or 100% Used?
 
  100%.
 
  sage
 
 
  On 07/16/2013 11:01 AM, Sage Weil wrote:
 
  On Tue, 16 Jul 2013, Ta Ba Tuan wrote:
 
  Hi everyone.
 
  I have 83 osds, and every osds have same 2TB, (Capacity sumary is
  166TB)
  I'm using replicate 3 for pools ('data','metadata').
 
  But when mounting Ceph filesystem from somewhere (using: mount -t
  ceph
  Monitor_IP:/ /ceph -o name=admin,secret=xx)
  then capacity sumary is showed 160TB?, I used replicate 3 and I
  think
  that
  it must return 160TB/3=50TB?
 
  FilesystemSize  Used Avail Use% Mounted on
  192.168.32.90:/160T  500G  156T   1%  /tmp/ceph_mount
 
  Please, explain this  help me?
 
  statfs/df show the raw capacity of the cluster, not the usable
  capacity.
  How much data you can store is a (potentially) complex function of
  your
  CRUSH rules and replication layout.  If you store 1TB, you'll notice
  the
  available space will go down by about 2TB (if you're using the
  default
  2x).
 
  sage
 
 
 
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



 --
 \*..+.-
 --Greg Chavez
 +//..;};
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





-- 
\*..+.-
--Greg Chavez
+//..;};
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Ceph on mixed AMD/Intel architecture

2013-06-26 Thread Greg Chavez

I could have sworn that I read somewhere, very early on in my
investigation of Ceph, that you OSDs need to run on the same processor
architecture.  Only it suddenly occurred to me that for the last
month, I 've been running a small 3-node cluster with two Intel
systems and one AMD system.  I thought they were all AMD!

So... is this a problem?  It seems to be running well.

Thanks.

--
\*..+.-
--Greg Chavez
+//..;};
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] RBD image copying

2013-05-14 Thread Greg


Hello,

I found some oddity when attempting to copy an rbd image in my pool 
(using bobtail 0.56.4), please see this :


I have a built working RBD image name p1b16 :

root@nas16:~# rbd -p sp ls
p1b16

Copying image :

root@nas16:~# rbd -p sp cp p1b16 p2b16
Image copy: 100% complete...done.

Great, seems to go fine, it went superfast (a few seconds), let's check :

root@nas16:~# rbd -p sp ls
p1b16

Uh ? let's try again :

root@nas16:~# rbd -p sp cp p1b16 p2b16
2013-05-14 09:30:42.369917 400b8000 -1 Image copy: 0% 
complete...failed.librbd: rbd image p2b16 already exists

rbd: copy failed:
(17) File exists
2013-05-14 09:30:42.369969 400b8000 -1 librbd: header creation failed

Doh! Really ?

root@nas16:~# rbd -p sp ls
p1b16

Hmmm, something hidden? let's try to restart :

root@nas16:~# rbd -p sp rm p2b16
2013-05-14 09:30:19.445336 400c7000 -1 librbd::ImageCtx: error finding 
header: (2) No such file or directory
2013-05-14 09:30:19.644381 400c7000 -1 Removing image: librbd: error 
removing img from new-style directory: (2) No such file or directory0% 
complete...failed.


rbd: delete error: (2) No such file or directory

Damned, let's see at rados level :

root@nas16:~# rados -p sp ls | grep -v rb\\.
p1b16.rbd
rbd_directory
I downloaded rbd_directory file and took a look inside, I see p1b16 
(along with binary data) but no trace of p2b16


I must have missed something somewhere...

Cheers,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RBD image copying

2013-05-14 Thread Greg


Wolfgang,

you are perfectly right, the '-p' switch only applies to the source 
image, this is subtle !


Thanks a lot.

Le 14/05/2013 11:52, Wolfgang Hennerbichler a écrit :

Hi,

I believe this went into the pool named 'rbd'.

if you rbd copy it's maybe easier to do it with explicit destination
pool name:

rbd cp sp/p1b16 sp/p2b16

hth
wolfgang

On 05/14/2013 11:47 AM, Greg wrote:

Hello,

I found some oddity when attempting to copy an rbd image in my pool
(using bobtail 0.56.4), please see this :

I have a built working RBD image name p1b16 :

root@nas16:~# rbd -p sp ls
p1b16

Copying image :

root@nas16:~# rbd -p sp cp p1b16 p2b16
Image copy: 100% complete...done.

Great, seems to go fine, it went superfast (a few seconds), let's check :

root@nas16:~# rbd -p sp ls
p1b16

Uh ? let's try again :

root@nas16:~# rbd -p sp cp p1b16 p2b16
2013-05-14 09:30:42.369917 400b8000 -1 Image copy: 0%
complete...failed.librbd: rbd image p2b16 already exists
rbd: copy failed:
(17) File exists
2013-05-14 09:30:42.369969 400b8000 -1 librbd: header creation failed

Doh! Really ?

root@nas16:~# rbd -p sp ls
p1b16

Hmmm, something hidden? let's try to restart :

root@nas16:~# rbd -p sp rm p2b16
2013-05-14 09:30:19.445336 400c7000 -1 librbd::ImageCtx: error finding
header: (2) No such file or directory
2013-05-14 09:30:19.644381 400c7000 -1 Removing image: librbd: error
removing img from new-style directory: (2) No such file or directory0%
complete...failed.

rbd: delete error: (2) No such file or directory

Damned, let's see at rados level :

root@nas16:~# rados -p sp ls | grep -v rb\\.
p1b16.rbd
rbd_directory

I downloaded rbd_directory file and took a look inside, I see p1b16
(along with binary data) but no trace of p2b16

I must have missed something somewhere...

Cheers,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RBD image copying

2013-05-14 Thread Greg


Ok now the copy is done to the right pool but the data isn't there.
I mapped both the original and the copy to try and compare :

root@client2:~# rbd showmapped
id pool image snap device
1  sp   p1b16 -/dev/rbd1
2  sp   p2b16 -/dev/rbd2

And try to mount :

root@client2:~# mount /dev/rbd1 /mnt/
root@client2:~# umount /mnt/
root@client2:~# mount /dev/rbd2 /mnt/
mount: you must specify the filesystem type


What strikes me is the copy is superfast and I'm in a pool in format 1 
which, as far as I understand, is not supposed to support copy-on-write.


I tried listing the pool (with rados tool) and it show the p2b16.rbd 
file is there but no rb.0.X.Y.offset is present for p2b16 (name can be 
found from p2b16.rbd), while there is for p1b16.


Did I not understand the copy mechanism ?

Thanks!

Le 14/05/2013 11:52, Wolfgang Hennerbichler a écrit :

Hi,

I believe this went into the pool named 'rbd'.

if you rbd copy it's maybe easier to do it with explicit destination
pool name:

rbd cp sp/p1b16 sp/p2b16

hth
wolfgang

On 05/14/2013 11:47 AM, Greg wrote:

Hello,

I found some oddity when attempting to copy an rbd image in my pool
(using bobtail 0.56.4), please see this :

I have a built working RBD image name p1b16 :

root@nas16:~# rbd -p sp ls
p1b16

Copying image :

root@nas16:~# rbd -p sp cp p1b16 p2b16
Image copy: 100% complete...done.

Great, seems to go fine, it went superfast (a few seconds), let's check :

root@nas16:~# rbd -p sp ls
p1b16

Uh ? let's try again :

root@nas16:~# rbd -p sp cp p1b16 p2b16
2013-05-14 09:30:42.369917 400b8000 -1 Image copy: 0%
complete...failed.librbd: rbd image p2b16 already exists
rbd: copy failed:
(17) File exists
2013-05-14 09:30:42.369969 400b8000 -1 librbd: header creation failed

Doh! Really ?

root@nas16:~# rbd -p sp ls
p1b16

Hmmm, something hidden? let's try to restart :

root@nas16:~# rbd -p sp rm p2b16
2013-05-14 09:30:19.445336 400c7000 -1 librbd::ImageCtx: error finding
header: (2) No such file or directory
2013-05-14 09:30:19.644381 400c7000 -1 Removing image: librbd: error
removing img from new-style directory: (2) No such file or directory0%
complete...failed.

rbd: delete error: (2) No such file or directory

Damned, let's see at rados level :

root@nas16:~# rados -p sp ls | grep -v rb\\.
p1b16.rbd
rbd_directory

I downloaded rbd_directory file and took a look inside, I see p1b16
(along with binary data) but no trace of p2b16

I must have missed something somewhere...

Cheers,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RBD image copying

2013-05-14 Thread Greg


Le 14/05/2013 13:00, Wolfgang Hennerbichler a écrit :


On 05/14/2013 12:16 PM, Greg wrote:
...

And try to mount :

root@client2:~# mount /dev/rbd1 /mnt/
root@client2:~# umount /mnt/
root@client2:~# mount /dev/rbd2 /mnt/
mount: you must specify the filesystem type

What strikes me is the copy is superfast and I'm in a pool in format 1
which, as far as I understand, is not supposed to support copy-on-write.

I tried listing the pool (with rados tool) and it show the p2b16.rbd
file is there but no rb.0.X.Y.offset is present for p2b16 (name can be
found from p2b16.rbd), while there is for p1b16.

Did I not understand the copy mechanism ?

you sure did understand it the way it is supposed to be. something's
wrong here. what happens if you dd bs=1024 count=1 | hexdump your
devices, do you see differences there? is your cluster healthy?


Wolfgang,

after a copy, there is an index file (.rbd file) but no data file.

When I map the block device, I can read/write from/to it, when writing 
the data files are created and I can read them back.


Regards,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RBD vs RADOS benchmark performance

2013-05-13 Thread Greg


Le 13/05/2013 07:38, Olivier Bonvalet a écrit :

Le vendredi 10 mai 2013 à 19:16 +0200, Greg a écrit :

Hello folks,

I'm in the process of testing CEPH and RBD, I have set up a small
cluster of  hosts running each a MON and an OSD with both journal and
data on the same SSD (ok this is stupid but this is simple to verify the
disks are not the bottleneck for 1 client). All nodes are connected on a
1Gb network (no dedicated network for OSDs, shame on me :).

Summary : the RBD performance is poor compared to benchmark

A 5 seconds seq read benchmark shows something like this :

sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
  0   0 0 0 0 0 - 0
  1  163923   91.958692 0.966117  0.431249
  2  166448   95.9602   100 0.513435   0.53849
  3  169074   98.6317   104 0.25631   0.55494
  4  119584   83.973540 1.80038   0.58712
  Total time run:4.165747
Total reads made: 95
Read size:4194304
Bandwidth (MB/sec):91.220

Average Latency:   0.678901
Max latency:   1.80038
Min latency:   0.104719

91MB read performance, quite good !

Now the RBD performance :

root@client:~# dd if=/dev/rbd1 of=/dev/null bs=4M count=100
100+0 records in
100+0 records out
419430400 bytes (419 MB) copied, 13.0568 s, 32.1 MB/s

There is a 3x performance factor (same for write: ~60M benchmark, ~20M
dd on block device)

The network is ok, the CPU is also ok on all OSDs.
CEPH is Bobtail 0.56.4, linux is 3.8.1 arm (vanilla release + some
patches for the SoC being used)

Can you show me the starting point for digging into this ?

You should try to increase read_ahead to 512K instead of the defaults
128K (/sys/block/*/queue/read_ahead_kb). I have seen a huge difference
on reads with that.


Olivier,

thanks a lot for pointing this out, it indeed makes a *huge* difference !

# dd if=/mnt/t/1 of=/dev/zero bs=4M count=100
100+0 records in
100+0 records out
419430400 bytes (419 MB) copied, 5.12768 s, 81.8 MB/s

(caches dropped before each test of course)

Mark, this is probably something you will want to investigate and 
explain in a tweaking topic of the documentation.


Regards,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RBD vs RADOS benchmark performance

2013-05-13 Thread Greg


Le 13/05/2013 15:55, Mark Nelson a écrit :

On 05/13/2013 07:26 AM, Greg wrote:

Le 13/05/2013 07:38, Olivier Bonvalet a écrit :

Le vendredi 10 mai 2013 à 19:16 +0200, Greg a écrit :

Hello folks,

I'm in the process of testing CEPH and RBD, I have set up a small
cluster of  hosts running each a MON and an OSD with both journal and
data on the same SSD (ok this is stupid but this is simple to 
verify the
disks are not the bottleneck for 1 client). All nodes are connected 
on a

1Gb network (no dedicated network for OSDs, shame on me :).

Summary : the RBD performance is poor compared to benchmark

A 5 seconds seq read benchmark shows something like this :

sec Cur ops   started  finished avg MB/s  cur MB/s  last lat
avg lat
  0   0 0 0 0 0 - 0
  1  163923   91.958692 0.966117
0.431249
  2  166448   95.9602   100 0.513435
0.53849
  3  169074   98.6317   104 0.25631
0.55494
  4  119584   83.973540 1.80038
0.58712
  Total time run:4.165747
Total reads made: 95
Read size:4194304
Bandwidth (MB/sec):91.220

Average Latency:   0.678901
Max latency:   1.80038
Min latency:   0.104719

91MB read performance, quite good !

Now the RBD performance :

root@client:~# dd if=/dev/rbd1 of=/dev/null bs=4M count=100
100+0 records in
100+0 records out
419430400 bytes (419 MB) copied, 13.0568 s, 32.1 MB/s

There is a 3x performance factor (same for write: ~60M benchmark, ~20M
dd on block device)

The network is ok, the CPU is also ok on all OSDs.
CEPH is Bobtail 0.56.4, linux is 3.8.1 arm (vanilla release + some
patches for the SoC being used)

Can you show me the starting point for digging into this ?

You should try to increase read_ahead to 512K instead of the defaults
128K (/sys/block/*/queue/read_ahead_kb). I have seen a huge difference
on reads with that.


Olivier,

thanks a lot for pointing this out, it indeed makes a *huge* 
difference !

# dd if=/mnt/t/1 of=/dev/zero bs=4M count=100
100+0 records in
100+0 records out
419430400 bytes (419 MB) copied, 5.12768 s, 81.8 MB/s

(caches dropped before each test of course)

Mark, this is probably something you will want to investigate and
explain in a tweaking topic of the documentation.

Regards,


Out of curiosity, has your rados bench performance improved as well? 
We've also seen improvements for sequential read throughput when 
increasing read_ahead_kb. (it may decrease random iops in some cases 
though!)  The reason I didn't think to mention it here though is 
because I was just focused on the difference between rados bench and 
rbd.  It would be interesting to know if rbd has improved more 
dramatically than rados bench.
Mark, the read ahead is set on the RBD block device (on the client), so 
it doesn't improve benchmark results as the benchmark doesn't use the 
block layer.


1 question remains : why did I have poor performance with 1 single 
writing thread ?


Regards,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RBD vs RADOS benchmark performance

2013-05-11 Thread Greg


Le 11/05/2013 02:52, Mark Nelson a écrit :

On 05/10/2013 07:20 PM, Greg wrote:

Le 11/05/2013 00:56, Mark Nelson a écrit :

On 05/10/2013 12:16 PM, Greg wrote:

Hello folks,

I'm in the process of testing CEPH and RBD, I have set up a small
cluster of  hosts running each a MON and an OSD with both journal and
data on the same SSD (ok this is stupid but this is simple to 
verify the
disks are not the bottleneck for 1 client). All nodes are connected 
on a

1Gb network (no dedicated network for OSDs, shame on me :).

Summary : the RBD performance is poor compared to benchmark

A 5 seconds seq read benchmark shows something like this :

   sec Cur ops   started  finished avg MB/s  cur MB/s  last lat   avg
lat
 0   0 0 0 0 0 - 0
 1  163923   91.958692 0.966117
0.431249
 2  166448   95.9602   100 0.513435 
0.53849
 3  169074   98.6317   104 0.25631 
0.55494
 4  119584   83.973540 1.80038 
0.58712

 Total time run:4.165747
Total reads made: 95
Read size:4194304
Bandwidth (MB/sec):91.220

Average Latency:   0.678901
Max latency:   1.80038
Min latency:   0.104719


91MB read performance, quite good !

Now the RBD performance :

root@client:~# dd if=/dev/rbd1 of=/dev/null bs=4M count=100
100+0 records in
100+0 records out
419430400 bytes (419 MB) copied, 13.0568 s, 32.1 MB/s


There is a 3x performance factor (same for write: ~60M benchmark, ~20M
dd on block device)

The network is ok, the CPU is also ok on all OSDs.
CEPH is Bobtail 0.56.4, linux is 3.8.1 arm (vanilla release + some
patches for the SoC being used)

Can you show me the starting point for digging into this ?


Hi Greg, First things first, are you doing kernel rbd or qemu/kvm?  If
you are doing qemu/kvm, make sure you are using virtio disks. This
can have a pretty big performance impact. Next, are you using RBD
cache? With 0.56.4 there are some performance issues with large
sequential writes if cache is on, but it does provide benefit for
small sequential writes.  In general RBD cache behaviour has improved
with Cuttlefish.

Beyond that, are the pools being targeted by RBD and rados bench setup
the same way?  Same number of Pgs?  Same replication?

Mark, thanks for your prompt reply.

I'm doing kernel RBD and so, I have not enabled the cache (default
setting?)
Sorry, I forgot to mention the pool used for bench and RBD is the same.


Interesting.  Does your rados bench performance change if you run a 
longer test?  So far I've been seeing about a 20-30% performance 
overhead for kernel RBD, but 3x is excessive!  It might be worth 
watching the underlying IO sizes to the OSDs in each case with 
something like collectl -sD -oT to see if there's any significant 
differences.

Mark,

I'll gather you some more data with collectl, meanwhile I realized a 
difference : the benchmark performs 16 concurrent reads while RBD only 
does 1. Shouldn't be a problem but still these are 2 different usage 
patterns.


Cheers,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] problem readding an osd

2013-05-06 Thread Greg


Le 06/05/2013 20:41, Glen Aidukas a écrit :


New post bellow...

*From:*Greg [mailto:it...@itooo.com]
*Sent:* Monday, May 06, 2013 2:31 PM
*To:* Glen Aidukas
*Subject:* Re: [ceph-users] problem readding an osd

Le 06/05/2013 20:05, Glen Aidukas a écrit :

Greg,

Not sure where to use the --d switch.  I tried the following:

Service ceph start --d

Service ceph --d start

Both do not work.

I did see an error in my log though...

2013-05-06 13:03:38.432479 7f0007ef2780 -1
filestore(/srv/ceph/osd/osd.2) limited size xattrs --
filestore_xattr_use_omap enabled

2013-05-06 13:03:38.438563 7f0007ef2780  0
filestore(/srv/ceph/osd/osd.2) mount FIEMAP ioctl is supported and
appears to work

2013-05-06 13:03:38.438591 7f0007ef2780  0
filestore(/srv/ceph/osd/osd.2) mount FIEMAP ioctl is disabled via
'filestore fiemap' config option

2013-05-06 13:03:38.438804 7f0007ef2780  0
filestore(/srv/ceph/osd/osd.2) mount did NOT detect btrfs

2013-05-06 13:03:38.484841 7f0007ef2780  0
filestore(/srv/ceph/osd/osd.2) mount syncfs(2) syscall fully
supported (by glibc and kernel)

2013-05-06 13:03:38.485010 7f0007ef2780  0
filestore(/srv/ceph/osd/osd.2) mount found snaps 

2013-05-06 13:03:38.488631 7f0007ef2780  0
filestore(/srv/ceph/osd/osd.2) mount: enabling WRITEAHEAD journal
mode: btrfs not detected

2013-05-06 13:03:38.488936 7f0007ef2780  1 journal _open
/srv/ceph/osd/osd.2/journal fd 19: 1048576000 bytes, block size
4096 bytes, directio = 1, aio = 0

2013-05-06 13:03:38.489095 7f0007ef2780  1 journal _open
/srv/ceph/osd/osd.2/journal fd 19: 1048576000 bytes, block size
4096 bytes, directio = 1, aio = 0

2013-05-06 13:03:38.490116 7f0007ef2780  1 journal close
/srv/ceph/osd/osd.2/journal

2013-05-06 13:03:38.538302 7f0007ef2780 -1
filestore(/srv/ceph/osd/osd.2) limited size xattrs --
filestore_xattr_use_omap enabled

2013-05-06 13:03:38.559813 7f0007ef2780  0
filestore(/srv/ceph/osd/osd.2) mount FIEMAP ioctl is supported and
appears to work

2013-05-06 13:03:38.559848 7f0007ef2780  0
filestore(/srv/ceph/osd/osd.2) mount FIEMAP ioctl is disabled via
'filestore fiemap' config option

2013-05-06 13:03:38.560082 7f0007ef2780  0
filestore(/srv/ceph/osd/osd.2) mount did NOT detect btrfs

2013-05-06 13:03:38.566015 7f0007ef2780  0
filestore(/srv/ceph/osd/osd.2) mount syncfs(2) syscall fully
supported (by glibc and kernel)

2013-05-06 13:03:38.566106 7f0007ef2780  0
filestore(/srv/ceph/osd/osd.2) mount found snaps 

2013-05-06 13:03:38.569047 7f0007ef2780  0
filestore(/srv/ceph/osd/osd.2) mount: enabling WRITEAHEAD journal
mode: btrfs not detected

2013-05-06 13:03:38.569237 7f0007ef2780  1 journal _open
/srv/ceph/osd/osd.2/journal fd 27: 1048576000 bytes, block size
4096 bytes, directio = 1, aio = 0

2013-05-06 13:03:38.569316 7f0007ef2780  1 journal _open
/srv/ceph/osd/osd.2/journal fd 27: 1048576000 bytes, block size
4096 bytes, directio = 1, aio = 0

2013-05-06 13:03:38.574317 7f0007ef2780  1 journal close
/srv/ceph/osd/osd.2/journal

2013-05-06 13:03:38.574801 7f0007ef2780 -1  ** ERROR: osd init
failed: (1) Operation not permitted

*Glen Aidukas  [Manager IT Infrasctructure]


*

*From:*ceph-users-boun...@lists.ceph.com
mailto:ceph-users-boun...@lists.ceph.com
[mailto:ceph-users-boun...@lists.ceph.com] *On Behalf Of *Greg
*Sent:* Monday, May 06, 2013 1:47 PM
*To:* ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com
*Subject:* Re: [ceph-users] problem readding an osd

Le 06/05/2013 19:23, Glen Aidukas a écrit :

Hello,

I think this is a newbe question but I tested everything and,
yes I FTFM as best I could.

I'm evaluating ceph and so I setup a cluster of 4 nodes.  The
nodes are KVM virtual machines named ceph01 to ceph04 all
running Ubuntu 12.04.2 LTS each with a single osd named osd.1
though osd.4 respective to the host they were running on. 
Each host also has a 1TB disk for ceph to use '/dev/vdb1'.


After some work I was able to get the cluster up and running
and even mounted it on a test client host (named ceph00).  I
ran into issues when I was testing a failure.  I shut off
ceph02 and watched via (ceph --w) it recover and move the data
around.  At this point all is fine.

When I turned the host back on, it did not auto reconnect.  I
expected this.  I then send through many attempts to re add it
but all failed.

Here is an output from:  ceph osd tree

# idweight  type name   up/down reweight

-1  4   root default

-3  4 rack unknownrack

-2 1   host ceph01

1 1   osd.1   up  1

-4

Re: [ceph-users] ceph osd tell bench

2013-05-03 Thread Greg


Le 03/05/2013 16:34, Travis Rhoden a écrit :

I have a question about tell bench command.

When I run this, is it behaving more or less like a dd on the drive?  
It appears to be, but I wanted to confirm whether or not it is 
bypassing all the normal Ceph stack that would be writing metadata, 
calculating checksums, etc.


One bit of behavior I noticed a while back that I was not expecting is 
that this command does write to the journal. It made sense when I 
thought about it, but when I have an SSD journal in front of an OSD, I 
can't get the tell bench command to really show me accurate numbers 
of the raw speed of the OSD -- instead I get write speeds of the SSD.  
Just a small caveat there.


The upside to that is when do you something like tell \* bench, you 
are able to see if that SSD becomes a bottleneck by hosting multiple 
journals, so I'm not really complaining.  But it does make a bit tough 
to see if perhaps one OSD is performing much differently than others.


But really, I'm mainly curious if it skips any normal 
metadata/checksum overhead that may be there otherwise.


Travis,

I'm no expert but, to me, the bench doesn't bypass the ceph stack.
On a test setup, I set up the journal on the same drive as the data 
drive, when I tell bench I can see ~160MB/s throughoutput on the SSD 
block device and the benchmark result is ~80MB/s which leads me to think 
the data is written twice : once to the journal and once to the 
permanent storage.
I see almost no read on the block device but the written data probably 
is in the page cache.


Cheers,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Replacement hardware

2013-03-20 Thread Greg Farnum

The MDS doesn't have any local state. You just need start up the daemon 
somewhere with a name and key that are known to the cluster (these can be 
different from or the same as the one that existed on the dead node; doesn't 
matter!). 
-Greg

Software Engineer #42 @ http://inktank.com | http://ceph.com


On Wednesday, March 20, 2013 at 10:40 AM, Igor Laskovy wrote:

 Actually, I already have recovered OSDs and MON daemon back to the cluster 
 according to http://ceph.com/docs/master/rados/operations/add-or-rm-osds/ and 
 http://ceph.com/docs/master/rados/operations/add-or-rm-mons/ . 
 
 But doc has missed info about removing/add MDS.
 How I can recovery MDS daemon for failed node?
 
 
 
 On Wed, Mar 20, 2013 at 3:23 PM, Dave (Bob) d...@bob-the-boat.me.uk 
 (mailto:d...@bob-the-boat.me.uk) wrote:
  Igor,
  
  I am sure that I'm right in saying that you just have to create a new
  filesystem (btrfs?) on the new block device, mount it, and then
  initialise the osd with:
  
  ceph-osd -i the osd number --mkfs
  
  Then you can start the osd with:
  
  ceph-osd -i the osd number
  
  Since you are replacing an osd that already existed, the cluster knows
  about it, and there is a key for it that is known.
  
  I don't claim any great expertise, but this is what I've been doing, and
  the cluster seems to adopt the new osd and sort everything out.
  
  David
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com (mailto:ceph-users@lists.ceph.com)
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 
 
 -- 
 Igor Laskovy
 facebook.com/igor.laskovy (http://facebook.com/igor.laskovy)
 Kiev, Ukraine 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com (mailto:ceph-users@lists.ceph.com)
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Replacement hardware

2013-03-20 Thread Greg Farnum

Yeah. If you run ceph auth list you'll get a dump of all the users and keys 
the cluster knows about; each of your daemons has that key stored somewhere 
locally (generally in /var/lib/ceph/ceph-[osd|mds|mon].$id). You can create 
more or copy an unused MDS one. I believe the docs include information on how 
this works. 
-Greg

Software Engineer #42 @ http://inktank.com | http://ceph.com


On Wednesday, March 20, 2013 at 10:48 AM, Igor Laskovy wrote:

 Well, can you please clarify what exactly key I must to use? Do I need to 
 get/generate it somehow from working cluster?
 
 
 On Wed, Mar 20, 2013 at 7:41 PM, Greg Farnum g...@inktank.com 
 (mailto:g...@inktank.com) wrote:
  The MDS doesn't have any local state. You just need start up the daemon 
  somewhere with a name and key that are known to the cluster (these can be 
  different from or the same as the one that existed on the dead node; 
  doesn't matter!).
  -Greg
  
  Software Engineer #42 @ http://inktank.com | http://ceph.com
  
  
  On Wednesday, March 20, 2013 at 10:40 AM, Igor Laskovy wrote:
  
   Actually, I already have recovered OSDs and MON daemon back to the 
   cluster according to 
   http://ceph.com/docs/master/rados/operations/add-or-rm-osds/ and 
   http://ceph.com/docs/master/rados/operations/add-or-rm-mons/ .
   
   But doc has missed info about removing/add MDS.
   How I can recovery MDS daemon for failed node?
   
   
   
   On Wed, Mar 20, 2013 at 3:23 PM, Dave (Bob) d...@bob-the-boat.me.uk 
   (mailto:d...@bob-the-boat.me.uk) (mailto:d...@bob-the-boat.me.uk) wrote:
Igor,

I am sure that I'm right in saying that you just have to create a new
filesystem (btrfs?) on the new block device, mount it, and then
initialise the osd with:

ceph-osd -i the osd number --mkfs

Then you can start the osd with:

ceph-osd -i the osd number

Since you are replacing an osd that already existed, the cluster knows
about it, and there is a key for it that is known.

I don't claim any great expertise, but this is what I've been doing, and
the cluster seems to adopt the new osd and sort everything out.

David
___
ceph-users mailing list
ceph-users@lists.ceph.com (mailto:ceph-users@lists.ceph.com) 
(mailto:ceph-users@lists.ceph.com)
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
   
   
   
   
   
   --
   Igor Laskovy
   facebook.com/igor.laskovy (http://facebook.com/igor.laskovy) 
   (http://facebook.com/igor.laskovy)
   Kiev, Ukraine
   ___
   ceph-users mailing list
   ceph-users@lists.ceph.com (mailto:ceph-users@lists.ceph.com) 
   (mailto:ceph-users@lists.ceph.com)
   http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
  
 
 
 
 
 -- 
 Igor Laskovy
 facebook.com/igor.laskovy (http://facebook.com/igor.laskovy)
 Kiev, Ukraine 



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Status of Mac OS and Windows PC client

2013-03-19 Thread Greg Farnum

At various times there the ceph-fuse client has worked on OS X — Noah was the 
last one to do this and the branch for it is sitting in my long-term 
really-like-to-get-this-mainlined-someday queue. OS X is a lot easier than 
Windows though, and nobody's done any planning around that beyond noting that 
there are FUSE-like systems for Windows, and that Samba is a workaround. Sorry. 
 
-Greg

Software Engineer #42 @ http://inktank.com | http://ceph.com


On Tuesday, March 19, 2013 at 8:58 AM, Igor Laskovy wrote:

 Thanks for reply!
  
 Actually I would like found some way to use one large salable central storage 
 across multiple PC and MAC. CephFS will be most suitable here, but you 
 provide only Linux support.  
 Really no planning here?
  
  
 On Tue, Mar 19, 2013 at 3:52 PM, Patrick McGarry patr...@inktank.com 
 (mailto:patr...@inktank.com) wrote:
  Hey Igor,
   
  Currently there are no plans to develop a OS X or Windows-specific
  client per se. We do provide a number of different ways to expose the
  cluster in ways that you could use it from these machines, however.
   
  The most recent example of this is the work being done on tgt that can
  expose Ceph via iSCSI. For reference see:
  http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg11662.html
   
  Keep an eye out for more details in the near future.
   
   
  Best Regards,
   
  Patrick McGarry
  Director, Community || Inktank
   
  http://ceph.com || http://inktank.com
  @scuttlemonkey || @ceph || @inktank
   
   
  On Tue, Mar 19, 2013 at 8:30 AM, Igor Laskovy igor.lask...@gmail.com 
  (mailto:igor.lask...@gmail.com) wrote:
   Anybody? :)

   Igor Laskovy
   facebook.com/igor.laskovy (http://facebook.com/igor.laskovy)
   Kiev, Ukraine

   On Mar 17, 2013 6:37 PM, Igor Laskovy igor.lask...@gmail.com 
   (mailto:igor.lask...@gmail.com) wrote:
 
Hi there!
 
Could you please clarify what is the current status of development 
client
for OS X and Windows desktop editions?
 
--
Igor Laskovy
facebook.com/igor.laskovy (http://facebook.com/igor.laskovy)
Kiev, Ukraine



   ___
   ceph-users mailing list
   ceph-users@lists.ceph.com (mailto:ceph-users@lists.ceph.com)
   http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
   
  
  
  
  
 --  
 Igor Laskovy
 facebook.com/igor.laskovy (http://facebook.com/igor.laskovy)
 Kiev, Ukraine  
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com (mailto:ceph-users@lists.ceph.com)
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Crash and strange things on MDS

2013-03-15 Thread Greg Farnum

On Friday, March 8, 2013 at 3:29 PM, Kevin Decherf wrote:
 On Fri, Mar 01, 2013 at 11:12:17AM -0800, Gregory Farnum wrote:
  On Tue, Feb 26, 2013 at 4:49 PM, Kevin Decherf ke...@kdecherf.com 
  (mailto:ke...@kdecherf.com) wrote:
   You will find the archive here: snip
   The data is not anonymized. Interesting folders/files here are
   /user_309bbd38-3cff-468d-a465-dc17c260de0c/*
   
   
   
  Sorry for the delay, but I have retrieved this archive locally at
  least so if you want to remove it from your webserver you can do so.
  :) Also, I notice when I untar it that the file name includes
  filtered — what filters did you run it through?
  
  
  
 Hi Gregory,
  
 Do you have any news about it?
  

I wrote a couple tools to do log analysis and created a number of bugs to make 
the MDS more amenable to analysis as a result of this.
Having spot-checked some of your longer-running requests, they're all getattrs 
or setattrs contending on files in what look to be shared cache and php 
libraries. These cover a range from ~40 milliseconds to ~150 milliseconds. I'd 
look into what your split applications are sharing across those spaces.

On the up side for Ceph, 80% of your requests take 0 milliseconds and ~95% 
of them take less than 2 milliseconds. Hurray, it's not ridiculously slow most 
of the time. :)
-Greg

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Crash and strange things on MDS

2013-03-15 Thread Greg Farnum

On Friday, March 15, 2013 at 3:40 PM, Marc-Antoine Perennou wrote:
 Thank you a lot for these explanations, looking forward for these fixes!
 Do you have some public bug reports regarding this to link us?
 
 Good luck, thank you for your great job and have a nice weekend
 
 Marc-Antoine Perennou 
Well, for now the fixes are for stuff like make analysis take less time, and 
export timing information more easily. The most immediately applicable one is 
probably http://tracker.ceph.com/issues/4354, which I hope to start on next 
week and should be done by the end of the sprint.
-Greg

Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

63 matches

Mail list logo