Re: [ceph-users] mds isn't working anymore after osd's running full
Unfortunately that doesn't help. I restarted both the active and standby mds but that doesn't change the state of the mds. Is there a way to force the mds to look at the 1832 epoch (or earlier) instead of 1833 (need osdmap epoch 1833, have 1832)? Thanks, Jasper Van: Gregory Farnum [g...@inktank.com] Verzonden: dinsdag 19 augustus 2014 19:49 Aan: Jasper Siero CC: ceph-users@lists.ceph.com Onderwerp: Re: [ceph-users] mds isn't working anymore after osd's running full On Mon, Aug 18, 2014 at 6:56 AM, Jasper Siero jasper.si...@target-holding.nl wrote: Hi all, We have a small ceph cluster running version 0.80.1 with cephfs on five nodes. Last week some osd's were full and shut itself down. To help de osd's start again I added some extra osd's and moved some placement group directories on the full osd's (which has a copy on another osd) to another place on the node (as mentioned in http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/) After clearing some space on the full osd's I started them again. After a lot of deep scrubbing and two pg inconsistencies which needed to be repaired everything looked fine except the mds which still is in the replay state and it stays that way. The log below says that mds need osdmap epoch 1833 and have 1832. 2014-08-18 12:29:22.268248 7fa786182700 1 mds.-1.0 handle_mds_map standby 2014-08-18 12:29:22.273995 7fa786182700 1 mds.0.25 handle_mds_map i am now mds.0.25 2014-08-18 12:29:22.273998 7fa786182700 1 mds.0.25 handle_mds_map state change up:standby -- up:replay 2014-08-18 12:29:22.274000 7fa786182700 1 mds.0.25 replay_start 2014-08-18 12:29:22.274014 7fa786182700 1 mds.0.25 recovery set is 2014-08-18 12:29:22.274016 7fa786182700 1 mds.0.25 need osdmap epoch 1833, have 1832 2014-08-18 12:29:22.274017 7fa786182700 1 mds.0.25 waiting for osdmap 1833 (which blacklists prior instance) # ceph status cluster c78209f5-55ea-4c70-8968-2231d2b05560 health HEALTH_WARN mds cluster is degraded monmap e3: 3 mons at {th1-mon001=10.1.2.21:6789/0,th1-mon002=10.1.2.22:6789/0,th1-mon003=10.1.2.23:6789/0}, election epoch 362, quorum 0,1,2 th1-mon001,th1-mon002,th1-mon003 mdsmap e154: 1/1/1 up {0=th1-mon001=up:replay}, 1 up:standby osdmap e1951: 12 osds: 12 up, 12 in pgmap v193685: 492 pgs, 4 pools, 60297 MB data, 470 kobjects 124 GB used, 175 GB / 299 GB avail 492 active+clean # ceph osd tree # idweighttype nameup/downreweight -10.2399root default -20.05997host th1-osd001 00.01999osd.0up1 10.01999osd.1up1 20.01999osd.2up1 -30.05997host th1-osd002 30.01999osd.3up1 40.01999osd.4up1 50.01999osd.5up1 -40.05997host th1-mon003 60.01999osd.6up1 70.01999osd.7up1 80.01999osd.8up1 -50.05997host th1-mon002 90.01999osd.9up1 100.01999osd.10up1 110.01999osd.11up1 What is the way to get the mds up and running again? I still have all the placement group directories which I moved from the full osds which where down to create disk space. Try just restarting the MDS daemon. This sounds a little familiar so I think it's a known bug which may be fixed in a later dev or point release on the MDS, but it's a soft-state rather than a disk state issue. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] some pgs active+remapped, Ceph can not recover itself.
thanks , Lewis.and i got one suggestion it is better to put similar OSD size . 2014-08-20 9:24 GMT+07:00 Craig Lewis cle...@centraldesktop.com: I believe you need to remove the authorization for osd.4 and osd.6 before re-creating them. When I re-format disks, I migrate data off of the disk using: ceph osd out $OSDID Then wait for the remapping to finish. Once it does: stop ceph-osd id=$OSDID ceph osd out $OSDID ceph auth del osd.$OSDID ceph osd crush remove osd.$OSDID ceph osd rm $OSDID Ceph will migrate the data off of it. When it's empty, you can delete it using the above commands. Since osd.4 and osd.6 are already lost, you can just do the part after remapping finishes for them. You could be having trouble because the size of the OSDs are so different. I wouldn't mix OSDs that are 100GB and 1.8TB. Most of the stuck PGs are on osd.5, osd.7, and one of the small OSDs. You can migrate data off of those small disks the same way I said to do osd.10. On Tue, Aug 19, 2014 at 6:34 AM, debian Only onlydeb...@gmail.com wrote: this is happen after some OSD fail and i recreate osd. i have did ceph osd rm osd.4 to remove the osd.4 and osd.6. but when i use ceph-deploy to install OSD by ceph-deploy osd --zap-disk --fs-type btrfs create ceph0x-vm:sdb, ceph-deploy result said new osd is ready, but the OSD can not start. said that ceph-disk failure. /var/lib/ceph/bootstrap-osd/ceph.keyring and auth:error and i have check the ceph.keyring is same as other on live OSD. when i run ceph-deploy twice. first it will create osd.4, failed , will display in osd tree. then osd.6 same. next ceph-deploy osd again, create osd.10, this OSD can start successful. but osd.4 osd.6 display down in osd tree. when i use ceph osd reweight-by-utilization, run one time, more pgs active+remapped. Ceph can not recover itself and Crush map tunables already optimize. do not how to solve it. root@ceph-admin:~# ceph osd crush dump { devices: [ { id: 0, name: osd.0}, { id: 1, name: osd.1}, { id: 2, name: osd.2}, { id: 3, name: osd.3}, { id: 4, name: device4}, { id: 5, name: osd.5}, { id: 6, name: device6}, { id: 7, name: osd.7}, { id: 8, name: osd.8}, { id: 9, name: osd.9}, { id: 10, name: osd.10}], types: [ { type_id: 0, name: osd}, { type_id: 1, name: host}, { type_id: 2, name: chassis}, { type_id: 3, name: rack}, { type_id: 4, name: row}, { type_id: 5, name: pdu}, { type_id: 6, name: pod}, { type_id: 7, name: room}, { type_id: 8, name: datacenter}, { type_id: 9, name: region}, { type_id: 10, name: root}], buckets: [ { id: -1, name: default, type_id: 10, type_name: root, weight: 302773, alg: straw, hash: rjenkins1, items: [ { id: -2, weight: 5898, pos: 0}, { id: -3, weight: 5898, pos: 1}, { id: -4, weight: 5898, pos: 2}, { id: -5, weight: 12451, pos: 3}, { id: -6, weight: 13107, pos: 4}, { id: -7, weight: 87162, pos: 5}, { id: -8, weight: 49807, pos: 6}, { id: -9, weight: 116654, pos: 7}, { id: -10, weight: 5898, pos: 8}]}, { id: -2, name: ceph02-vm, type_id: 1, type_name: host, weight: 5898, alg: straw, hash: rjenkins1, items: [ { id: 0, weight: 5898, pos: 0}]}, { id: -3, name: ceph03-vm, type_id: 1, type_name: host, weight: 5898, alg: straw, hash: rjenkins1, items: [ { id: 1, weight: 5898, pos: 0}]}, { id: -4, name: ceph01-vm, type_id: 1, type_name: host, weight: 5898, alg: straw, hash: rjenkins1, items: [ { id: 2, weight: 5898, pos: 0}]}, { id: -5, name: ceph04-vm, type_id: 1, type_name: host,
Re: [ceph-users] Problem when buildingrunning cuttlefish from source on Ubuntu 14.04 Server
Hello Gregory: I'm doing some comparison about performance between different combination of environment. Therefore I have to try such old version. Thanks for your kindly help! The solution you provided does work! I think I was relying on ceph-disk too much therefore I didn't noticed this. 2014-08-20 1:44 GMT+08:00 Gregory Farnum g...@inktank.com: On Thu, Aug 14, 2014 at 2:28 AM, NotExist notex...@gmail.com wrote: Hello everyone: Since there's no cuttlefish package for 14.04 server on ceph repository (only ceph-deploy there), I tried to build cuttlefish from source on 14.04. ...why? Cuttlefish is old and no longer provided updates. You really want to be using either Dumpling or Firefly. Here's what I did: Get source by following http://ceph.com/docs/master/install/clone-source/ Enter the sourcecode directory git checkout cluttlefish git submodule update rm -rf src/civetweb/ src/erasure-code/ src/rocksdb/ to get the latest cluttlefish repo. Build source by following http://ceph.com/docs/master/install/build-ceph/ beside the package this url mentioned for Ubuntu: sudo apt-get install autotools-dev autoconf automake cdbs gcc g++ git libboost-dev libedit-dev libssl-dev libtool libfcgi libfcgi-dev libfuse-dev linux-kernel-headers libcrypto++-dev libcrypto++ libexpat1-dev pkg-config sudo apt-get install uuid-dev libkeyutils-dev libgoogle-perftools-dev libatomic-ops-dev libaio-dev libgdata-common libgdata13 libsnappy-dev libleveldb-dev I also found it will need sudo apt-get install libboost-filesystem-dev libboost-thread-dev libboost-program-options-dev (And xfsprogs if you need xfs) after all packages are installed, I start to complie according to the doc: ./autogen.sh ./configure make -j8 And install following http://ceph.com/docs/master/install/install-storage-cluster/#installing-a-build sudo make install everything seems fine, but I found ceph_common.sh had been putted to '/usr/local/lib/ceph', and some tools are putted into /usr/local/usr/local/sbin/ (ceph-disk* and ceph-create-keys). I was used to use ceph-disk to prepare the disk on other deployment (on other machines with Emperor), but I can't do it now (and maybe the path is the reason) so I choose to do do all stuffs manually. I follow the doc http://ceph.com/docs/master/install/manual-deployment/ to deploy the cluster many times, but it turns out different this time. /etc/ceph isn't there, therefore I sudo mkdir /etc/ceph Put a ceph.conf into /etc/ceph Generate all required keys in /etc/ceph instead of /tmp/ to keep them ceph-authtool --create-keyring /etc/ceph/ceph.mon.keyring --gen-key -n mon. --cap mon 'allow *' ceph-authtool --create-keyring /etc/ceph/ceph.client.admin.keyring --gen-key -n client.admin --set-uid=0 --cap mon 'allow *' --cap osd 'allow *' --cap mds 'allow' ceph-authtool /etc/ceph/ceph.mon.keyring --import-keyring /etc/ceph/ceph.client.admin.keyring Generate monmap with monmaptool monmaptool --create --add storage01 192.168.11.1 --fsid 9f8fffe3-040d-4641-b35a-ffa90241f723 /etc/ceph/monmap /var/lib/ceph is not there either sudo mkdir -p /var/lib/ceph/mon/ceph-storage01 sudo ceph-mon --mkfs -i storage01 --monmap /etc/ceph/monmap --keyring /etc/ceph/ceph.mon.keyring log directory are not there, so I create it manually: sudo mkdir /var/log/ceph since service doesn't work, I start mon daemon manually: sudo /usr/local/bin/ceph-mon -i storage01 and ceph -s looks like these: storage@storage01:~/ceph$ ceph -s health HEALTH_ERR 192 pgs stuck inactive; 192 pgs stuck unclean; no osds monmap e1: 1 mons at {storage01=192.168.11.1:6789/0}, election epoch 2, quorum 0 storage01 osdmap e1: 0 osds: 0 up, 0 in pgmap v2: 192 pgs: 192 creating; 0 bytes data, 0 KB used, 0 KB / 0 KB avail mdsmap e1: 0/0/1 up And I add disks as osd by following manual commands: sudo mkfs -t xfs -f /dev/sdb sudo mkdir /var/lib/ceph/osd/ceph-1 sudo mount /dev/sdb /var/lib/ceph/osd/ceph-1/ sudo ceph-osd -i 1 --mkfs --mkkey ceph osd create ceph osd crush add osd.1 1.0 host=storage01 sudo ceph-osd -i 1 for 10 times, and I got: storage@storage01:~/ceph$ ceph osd tree # idweight type name up/down reweight -2 10 host storage01 0 1 osd.0 up 1 1 1 osd.1 up 1 2 1 osd.2 up 1 3 1 osd.3 up 1 4 1 osd.4 up 1 5 1 osd.5 up 1 6 1 osd.6 up 1 7 1 osd.7 up 1 8 1 osd.8 up 1 9 1 osd.9 up 1 -1 0 root default and storage@storage01:~/ceph$ ceph -s health HEALTH_WARN 192 pgs stuck inactive; 192 pgs stuck unclean monmap e1: 1 mons at {storage01=192.168.11.1:6789/0}, election epoch 2, quorum 0 storage01 osdmap e32: 10 osds: 10 up, 10 in pgmap v56:
[ceph-users] Starting Ceph OSD
Hi All, We monitored two of our osd as down using the ceph osd tree command. We tried starting it using the following commands but ceph osd tree command still reports it as down. Please see below for the commands used. command:sudo start ceph-osd id=osd.0 output: ceph-osd (ceph/osd.0) stop/pre-start, process 3831 ceph osd tree output: # id weight type name up/down reweight -1 5.13 root default -2 1.71 host ceph-node1 0 0.8 osd.0 down 0 2 0.91 osd.2 down 0 -3 1.71 host ceph-node2 command: sudo start ceph-osd id=0 output: ceph-osd (ceph/0) start/running, process 3887 ceph osd tree output: # id weight type name up/down reweight -1 5.13 root default -2 1.71 host ceph-node1 0 0.8 osd.0 down 0 2 0.91 osd.2 down 0 -3 1.71 host ceph-node2 command: sudo start ceph-osd id=0 output: ceph-osd (ceph/0) start/running, process 4348 ceph osd tree output: # id weight type name up/down reweight -1 5.22 root default -2 1.8 host ceph-node1 0 0.8 osd.0 down 0 2 0.91 osd.2 down 0 Is there any other ways to start an OSD? I'm out of ideas. What we do is we execute the ceph-deploy activate command to make an OSD as UP. Is that the right way to do it? We are using ceph version 0.80.4 Thanks! Regards, Pons ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RadosGW problems
Hello, Yehuda, I know I was using the correct fastcgi module, it was the one on Ceph repositories; I had also disabled in apache, all other modules; I tried to create a second swift user, using the provided instructions, only to get the following: # radosgw-admin user create --uid=marcogarces --display-name=Marco Garces # radosgw-admin subuser create --uid=marcogarces --subuser=marcogarces:swift --access=full # radosgw-admin key create --subuser=marcogarces:swift --key-type=swift --gen-secret could not create key: unable to add access key, unable to store user info 2014-08-20 13:19:33.664945 7f925b130880 0 WARNING: can't store user info, swift id () already mapped to another user (marcogarces) So I have created another user, some other way: # radosgw-admin user create --subuser=testuser:swift --display-name=Test User One --key-type=swift --access=full { user_id: testuser, display_name: Test User One, email: , suspended: 0, max_buckets: 1000, auid: 0, subusers: [], keys: [], swift_keys: [ { user: testuser:swift, secret_key: MHA4vFaDy5XsJq+F5NuZLcBMCoJcuot44ASDuReY}], caps: [], op_mask: read, write, delete, default_placement: , placement_tags: [], bucket_quota: { enabled: false, max_size_kb: -1, max_objects: -1}, user_quota: { enabled: false, max_size_kb: -1, max_objects: -1}, temp_url_keys: []} Now, when I do, from the client: swift -V 1 -A http://gateway.bcitestes.local/auth -U testuser:swift -K MHA4vFaDy5XsJq+F5NuZLcBMCoJcuot44ASDuReY stat Account: v1 Containers: 0 Objects: 0 Bytes: 0 Server: Tengine/2.0.3 Connection: keep-alive X-Account-Bytes-Used-Actual: 0 Content-Type: text/plain; charset=utf-8 If I try using https, I still have errors: swift --insecure -V 1 -A https://gateway.bcitestes.local/auth -U testuser:swift -K MHA4vFaDy5XsJq+F5NuZLcBMCoJcuot44ASDuReY stat Account HEAD failed: http://gateway.bcitestes.local:443/swift/v1 400 Bad Request And I could not validate this account using a Swift client (Cyberduck); Also, there are no S3 credentials! How can I have a user with both S3 and Swift credentials created, and valid to use with http/https, and on all clients (command line and gui). The first user works great with the S3 credentials, on all scenarios. Thank you, Marco Garcês On Tue, Aug 19, 2014 at 7:59 PM, Yehuda Sadeh yeh...@inktank.com wrote: On Tue, Aug 19, 2014 at 5:32 AM, Marco Garcês ma...@garces.cc wrote: UPDATE: I have installed Tengine (nginx fork) and configured both HTTP and HTTPS to use radosgw socket. Looking back at this thread, and considering this solution it seems to me that you were running the wrong apache fastcgi module. I can login with S3, create buckets and upload objects. It's still not possible to use Swift credentials, can you help me on this part? What do I use when I login (url, username, password) ? Here is the info for the user: radosgw-admin user info --uid=mgarces { user_id: mgarces, display_name: Marco Garces, email: marco.gar...@bci.co.mz, suspended: 0, max_buckets: 1000, auid: 0, subusers: [ { id: mgarces:swift, permissions: full-control}], keys: [ { user: mgarces:swift, access_key: AJW2BCBXHFJ1DPXT112O, secret_key: }, { user: mgarces, access_key: S88Y6ZJRACZG49JFPY83, secret_key: PlubMMjfQecJ5Py46e2kZz5VuUgHgsjLmYZDRdFg}], swift_keys: [ { user: mgarces:swift, secret_key: TtKWhY67ujhjn36\/nhv44A2BVPw5wDi3Sp13YrMM}], caps: [], op_mask: read, write, delete, default_placement: , placement_tags: [], bucket_quota: { enabled: false, max_size_kb: -1, max_objects: -1}, user_quota: { enabled: false, max_size_kb: -1, max_objects: -1}, temp_url_keys: []} You might be hitting issue #8587 (aka #9155). Try creating a second swift user, see if it still happens. Yehuda ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Serious performance problems with small file writes
We have a ceph system here, and we're seeing performance regularly descend into unusability for periods of minutes at a time (or longer). This appears to be triggered by writing large numbers of small files. Specifications: ceph 0.80.5 6 machines running 3 OSDs each (one 4 TB rotational HD per OSD, 2 threads) 2 machines running primary and standby MDS 3 monitors on the same machines as the OSDs Infiniband to about 8 CephFS clients (headless, in the machine room) Gigabit ethernet to a further 16 or so CephFS clients (Linux desktop machines, in the analysis lab) The cluster stores home directories of the users and a larger area of scientific data (approx 15 TB) which is being processed and analysed by the users of the cluster. We have a relatively small number of concurrent users (typically 4-6 at most), who use GUI tools to examine their data, and then complex sets of MATLAB scripts to process it, with processing often being distributed across all the machines using Condor. It's not unusual to see the analysis scripts write out large numbers (thousands, possibly tens or hundreds of thousands) of small files, often from many client machines at once in parallel. When this happens, the ceph cluster becomes almost completely unresponsive for tens of seconds (or even for minutes) at a time, until the writes are flushed through the system. Given the nature of modern GUI desktop environments (often reading and writing small state files in the user's home directory), this means that desktop interactiveness and responsiveness for all the other users of the cluster suffer. 1-minute load on the servers typically peaks at about 8 during these events (on 4-core machines). Load on the clients also peaks high, because of the number of processes waiting for a response from the FS. The MDS shows little sign of stress -- it seems to be entirely down to the OSDs. ceph -w shows requests blocked for more than 10 seconds, and in bad cases, ceph -s shows up to many hundreds of requests blocked for more than 32s. We've had to turn off scrubbing and deep scrubbing completely -- except between 01.00 and 04.00 every night -- because it triggers the exact same symptoms, even with only 2-3 PGs being scrubbed. If it gets up to 7 PGs being scrubbed, as it did on Monday, it's completely unusable. Is this problem something that's often seen? If so, what are the best options for mitigation or elimination of the problem? I've found a few references to issue #6278 [1], but that seems to be referencing scrub specifically, not ordinary (if possibly pathological) writes. What are the sorts of things I should be looking at to work out where the bottleneck(s) are? I'm a bit lost about how to drill down into the ceph system for identifying performance issues. Is there a useful guide to tools somewhere? Is an upgrade to 0.84 likely to be helpful? How development are the development releases, from a stability / dangerous bugs point of view? Thanks, Hugo. [1] http://tracker.ceph.com/issues/6278 -- Hugo Mills :: IT Services, University of Reading Specialist Engineer, Research Servers ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Serious performance problems with small file writes
Hi, Do you get slow requests during the slowness incidents? What about monitor elections? Are your MDSs using a lot of CPU? did you try tuning anything in the MDS (I think the default config is still conservative, and there are options to cache more entries, etc…) What about iostat on the OSDs — are your OSD disks busy reading or writing during these incidents? What are you using for OSD journals? Also check the CPU usage for the mons and osds... Does your hardware provide enough IOPS for what your users need? (e.g. what is the op/s from ceph -w) If disabling deep scrub helps, then it might be that something else is reading the disks heavily. One thing to check is updatedb — we had to disable it from indexing /var/lib/ceph on our OSDs. Best Regards, Dan -- Dan van der Ster || Data Storage Services || CERN IT Department -- On 20 Aug 2014, at 16:39, Hugo Mills h.r.mi...@reading.ac.uk wrote: We have a ceph system here, and we're seeing performance regularly descend into unusability for periods of minutes at a time (or longer). This appears to be triggered by writing large numbers of small files. Specifications: ceph 0.80.5 6 machines running 3 OSDs each (one 4 TB rotational HD per OSD, 2 threads) 2 machines running primary and standby MDS 3 monitors on the same machines as the OSDs Infiniband to about 8 CephFS clients (headless, in the machine room) Gigabit ethernet to a further 16 or so CephFS clients (Linux desktop machines, in the analysis lab) The cluster stores home directories of the users and a larger area of scientific data (approx 15 TB) which is being processed and analysed by the users of the cluster. We have a relatively small number of concurrent users (typically 4-6 at most), who use GUI tools to examine their data, and then complex sets of MATLAB scripts to process it, with processing often being distributed across all the machines using Condor. It's not unusual to see the analysis scripts write out large numbers (thousands, possibly tens or hundreds of thousands) of small files, often from many client machines at once in parallel. When this happens, the ceph cluster becomes almost completely unresponsive for tens of seconds (or even for minutes) at a time, until the writes are flushed through the system. Given the nature of modern GUI desktop environments (often reading and writing small state files in the user's home directory), this means that desktop interactiveness and responsiveness for all the other users of the cluster suffer. 1-minute load on the servers typically peaks at about 8 during these events (on 4-core machines). Load on the clients also peaks high, because of the number of processes waiting for a response from the FS. The MDS shows little sign of stress -- it seems to be entirely down to the OSDs. ceph -w shows requests blocked for more than 10 seconds, and in bad cases, ceph -s shows up to many hundreds of requests blocked for more than 32s. We've had to turn off scrubbing and deep scrubbing completely -- except between 01.00 and 04.00 every night -- because it triggers the exact same symptoms, even with only 2-3 PGs being scrubbed. If it gets up to 7 PGs being scrubbed, as it did on Monday, it's completely unusable. Is this problem something that's often seen? If so, what are the best options for mitigation or elimination of the problem? I've found a few references to issue #6278 [1], but that seems to be referencing scrub specifically, not ordinary (if possibly pathological) writes. What are the sorts of things I should be looking at to work out where the bottleneck(s) are? I'm a bit lost about how to drill down into the ceph system for identifying performance issues. Is there a useful guide to tools somewhere? Is an upgrade to 0.84 likely to be helpful? How development are the development releases, from a stability / dangerous bugs point of view? Thanks, Hugo. [1] http://tracker.ceph.com/issues/6278 -- Hugo Mills :: IT Services, University of Reading Specialist Engineer, Research Servers ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Serious performance problems with small file writes
Hi, On 20 Aug 2014, at 16:55, German Anders gand...@despegar.commailto:gand...@despegar.com wrote: Hi Dan, How are you? I want to know how you disable the indexing on the /var/lib/ceph OSDs? # grep ceph /etc/updatedb.conf PRUNEPATHS = /afs /media /net /sfs /tmp /udev /var/cache/ccache /var/spool/cups /var/spool/squid /var/tmp /var/lib/ceph Did you disable deep scrub on you OSDs? No but this can be an issue. If you get many PGs scrubbing at once, performance will suffer. There is a new feature in 0.67.10 to sleep between scrubbing “chunks”. I set the that sleep to 0.1 (and the chunk_max to 5, and the scrub size 1MB). In 0.67.10+1 there are some new options to set the iopriority of the scrubbing threads. Set that to class = 3, priority = 0 to give the scrubbing thread the idle priority. You need to use the cfq disk scheduler for io priorities to work. (cfq will also help if updatedb is causing any problems, since it runs with ionice -c 3). I’m pretty sure those features will come in 0.80.6 as well. Do you have the journals on SSD's or RAMDISK? Never use RAMDISK. We currently have the journals on the same spinning disk as the OSD, but the iops performance is low for the rbd and fs use-cases. (For object store it should be OK). But for rbd or fs, you really need journals on SSDs or your cluster will suffer. We now have SSDs on order to augment our cluster. (The way I justified this is that our cluster has X TB of storage capacity and Y iops capacity. With disk journals we will run out of iops capacity well before we run out of storage capacity. So you can either increase the iops capacity substantially by decreasing the volume of the cluster by 20% and replacing those disks with SSD journals, or you can just leave 50% of the disk capacity empty since you can’t use it anyway). What's the perf of your cluster? randos bench? fio? I've setup a new cluster and I want to know what would be the best option scheme to go. It’s not really meaningful to compare performance of different clusters with different hardware. Some “constants” I can advise - with few clients, large write throughput is limited by the clients bandwidth, as long as you have enough OSDs and the client is striping over many objects. - with disk journals, small write latency will be ~30-50ms even when the cluster is idle. if you have SSD journals, maybe ~10ms. - count your iops. Each disk OSD can do ~100, and you need to divide by the number of replicas. With SSDs you can do a bit better than this since the synchronous writes go to the SSDs not the disks. In my tests with our hardware I estimate that going from disk to SSD journal will multiply the iops capacity by around 5x. I also found that I needed to increase some the journal max write and journal queue max limits, also the filestore limits, to squeeze the best performance out of the SSD journals. Try increasing filestore queue max ops/bytes, filestore queue committing max ops/bytes, and the filestore wbthrottle xfs * options. (I’m not going to publish exact configs here because I haven’t finished tuning yet). Cheers, Dan Thanks a lot!! Best regards, German Anders On Wednesday 20/08/2014 at 11:51, Dan Van Der Ster wrote: Hi, Do you get slow requests during the slowness incidents? What about monitor elections? Are your MDSs using a lot of CPU? did you try tuning anything in the MDS (I think the default config is still conservative, and there are options to cache more entries, etc…) What about iostat on the OSDs — are your OSD disks busy reading or writing during these incidents? What are you using for OSD journals? Also check the CPU usage for the mons and osds... Does your hardware provide enough IOPS for what your users need? (e.g. what is the op/s from ceph -w) If disabling deep scrub helps, then it might be that something else is reading the disks heavily. One thing to check is updatedb — we had to disable it from indexing /var/lib/ceph on our OSDs. Best Regards, Dan -- Dan van der Ster || Data Storage Services || CERN IT Department -- On 20 Aug 2014, at 16:39, Hugo Mills h.r.mi...@reading.ac.ukmailto:h.r.mi...@reading.ac.uk wrote: We have a ceph system here, and we're seeing performance regularly descend into unusability for periods of minutes at a time (or longer). This appears to be triggered by writing large numbers of small files. Specifications: ceph 0.80.5 6 machines running 3 OSDs each (one 4 TB rotational HD per OSD, 2 threads) 2 machines running primary and standby MDS 3 monitors on the same machines as the OSDs Infiniband to about 8 CephFS clients (headless, in the machine room) Gigabit ethernet to a further 16 or so CephFS clients (Linux desktop machines, in the analysis lab) The cluster stores home directories of the users and a larger area of scientific data (approx 15 TB) which is being processed and analysed by the users of the cluster. We have a
Re: [ceph-users] Serious performance problems with small file writes
Hi, Dan, Some questions below I can't answer immediately, but I'll spend tomorrow morning irritating people by triggering these events (I think I have a reproducer -- unpacking a 1.2 GiB tarball with 25 small files in it) and giving you more details. For the ones I can answer right now: On Wed, Aug 20, 2014 at 02:51:12PM +, Dan Van Der Ster wrote: Do you get slow requests during the slowness incidents? Slow requests, yes. ceph -w reports them coming in groups, e.g.: 2014-08-20 15:51:23.911711 mon.1 [INF] pgmap v2287926: 704 pgs: 704 active+clean; 18105 GB data, 37369 GB used, 20169 GB / 57538 GB avail; 8449 kB/s rd, 3506 kB/s wr, 527 op/s 2014-08-20 15:51:22.381063 osd.5 [WRN] 6 slow requests, 6 included below; oldest blocked for 10.133901 secs 2014-08-20 15:51:22.381066 osd.5 [WRN] slow request 10.133901 seconds old, received at 2014-08-20 15:51:12.247127: osd_op(mds.0.101:5528578 10005889b29. [create 0~0,setxattr parent (394)] 0.786a9365 ondisk+write e217298) v4 currently waiting for subops from 6 2014-08-20 15:51:22.381068 osd.5 [WRN] slow request 10.116337 seconds old, received at 2014-08-20 15:51:12.264691: osd_op(mds.0.101:5529006 1000599e576. [create 0~0,setxattr parent (392)] 0.5ccbd6a9 ondisk+write e217298) v4 currently waiting for subops from 7 2014-08-20 15:51:22.381070 osd.5 [WRN] slow request 10.116277 seconds old, received at 2014-08-20 15:51:12.264751: osd_op(mds.0.101:5529009 1000588932d. [create 0~0,setxattr parent (394)] 0.de5eca4e ondisk+write e217298) v4 currently waiting for subops from 6 2014-08-20 15:51:22.381071 osd.5 [WRN] slow request 10.115296 seconds old, received at 2014-08-20 15:51:12.265732: osd_op(mds.0.101:5529042 1000588933e. [create 0~0,setxattr parent (395)] 0.5e4d56be ondisk+write e217298) v4 currently waiting for subops from 7 2014-08-20 15:51:22.381073 osd.5 [WRN] slow request 10.115184 seconds old, received at 2014-08-20 15:51:12.265844: osd_op(mds.0.101:5529047 1000599e58a. [create 0~0,setxattr parent (395)] 0.6a487965 ondisk+write e217298) v4 currently waiting for subops from 6 2014-08-20 15:51:24.381370 osd.5 [WRN] 2 slow requests, 2 included below; oldest blocked for 10.73 secs 2014-08-20 15:51:24.381373 osd.5 [WRN] slow request 10.73 seconds old, received at 2014-08-20 15:51:14.381267: osd_op(mds.0.101:5529327 100058893ca. [create 0~0,setxattr parent (395)] 0.750c7574 ondisk+write e217298) v4 currently commit sent 2014-08-20 15:51:24.381375 osd.5 [WRN] slow request 10.28 seconds old, received at 2014-08-20 15:51:14.381312: osd_op(mds.0.101:5529329 100058893cb. [create 0~0,setxattr parent (395)] 0.c75853fa ondisk+write e217298) v4 currently commit sent 2014-08-20 15:51:24.913554 mon.1 [INF] pgmap v2287927: 704 pgs: 704 active+clean; 18105 GB data, 37369 GB used, 20169 GB / 57538 GB avail; 13218 B/s rd, 3532 kB/s wr, 377 op/s 2014-08-20 15:51:25.381582 osd.5 [WRN] 3 slow requests, 3 included below; oldest blocked for 10.709989 secs 2014-08-20 15:51:25.381586 osd.5 [WRN] slow request 10.709989 seconds old, received at 2014-08-20 15:51:14.671549: osd_op(mds.0.101:5529457 10005889403. [create 0~0,setxattr parent (407)] 0.e15ab1fa ondisk+write e217298) v4 currently no flag points reached 2014-08-20 15:51:25.381587 osd.5 [WRN] slow request 10.709767 seconds old, received at 2014-08-20 15:51:14.671771: osd_op(mds.0.101:5529462 10005889406. [create 0~0,setxattr parent (406)] 0.70f8a5d3 ondisk+write e217298) v4 currently no flag points reached 2014-08-20 15:51:25.381589 osd.5 [WRN] slow request 10.182354 seconds old, received at 2014-08-20 15:51:15.199184: osd_op(mds.0.101:5529464 10005889407. [create 0~0,setxattr parent (391)] 0.30535d02 ondisk+write e217298) v4 currently no flag points reached 2014-08-20 15:51:25.920298 mon.1 [INF] pgmap v2287928: 704 pgs: 704 active+clean; 18105 GB data, 37369 GB used, 20169 GB / 57538 GB avail; 12231 B/s rd, 5534 kB/s wr, 370 op/s 2014-08-20 15:51:26.925996 mon.1 [INF] pgmap v2287929: 704 pgs: 704 active+clean; 18105 GB data, 37369 GB used, 20169 GB / 57538 GB avail; 26498 B/s rd, 8121 kB/s wr, 367 op/s 2014-08-20 15:51:27.933424 mon.1 [INF] pgmap v2287930: 704 pgs: 704 active+clean; 18105 GB data, 37369 GB used, 20169 GB / 57538 GB avail; 706 kB/s rd, 7552 kB/s wr, 444 op/s What about monitor elections? No, that's been reporting monmap e3 and election epoch 130 for a week or two. I assume that to mean we've had no elections. We're actually running without one monitor at the moment, because one machine is down, but we've had the same problems with the machine present. Are your MDSs using a lot of CPU? No, they're showing load averages well under 1 the whole time. Peak load average is about 0.6. did you try tuning anything in the MDS (I think the default config is still conservative, and there are options to cache more entries, etc…) Not much. We have:
Re: [ceph-users] mds isn't working anymore after osd's running full
After restarting your MDS, it still says it has epoch 1832 and needs epoch 1833? I think you didn't really restart it. If the epoch numbers have changed, can you restart it with debug mds = 20, debug objecter = 20, debug ms = 1 in the ceph.conf and post the resulting log file somewhere? -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Wed, Aug 20, 2014 at 12:49 AM, Jasper Siero jasper.si...@target-holding.nl wrote: Unfortunately that doesn't help. I restarted both the active and standby mds but that doesn't change the state of the mds. Is there a way to force the mds to look at the 1832 epoch (or earlier) instead of 1833 (need osdmap epoch 1833, have 1832)? Thanks, Jasper Van: Gregory Farnum [g...@inktank.com] Verzonden: dinsdag 19 augustus 2014 19:49 Aan: Jasper Siero CC: ceph-users@lists.ceph.com Onderwerp: Re: [ceph-users] mds isn't working anymore after osd's running full On Mon, Aug 18, 2014 at 6:56 AM, Jasper Siero jasper.si...@target-holding.nl wrote: Hi all, We have a small ceph cluster running version 0.80.1 with cephfs on five nodes. Last week some osd's were full and shut itself down. To help de osd's start again I added some extra osd's and moved some placement group directories on the full osd's (which has a copy on another osd) to another place on the node (as mentioned in http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/) After clearing some space on the full osd's I started them again. After a lot of deep scrubbing and two pg inconsistencies which needed to be repaired everything looked fine except the mds which still is in the replay state and it stays that way. The log below says that mds need osdmap epoch 1833 and have 1832. 2014-08-18 12:29:22.268248 7fa786182700 1 mds.-1.0 handle_mds_map standby 2014-08-18 12:29:22.273995 7fa786182700 1 mds.0.25 handle_mds_map i am now mds.0.25 2014-08-18 12:29:22.273998 7fa786182700 1 mds.0.25 handle_mds_map state change up:standby -- up:replay 2014-08-18 12:29:22.274000 7fa786182700 1 mds.0.25 replay_start 2014-08-18 12:29:22.274014 7fa786182700 1 mds.0.25 recovery set is 2014-08-18 12:29:22.274016 7fa786182700 1 mds.0.25 need osdmap epoch 1833, have 1832 2014-08-18 12:29:22.274017 7fa786182700 1 mds.0.25 waiting for osdmap 1833 (which blacklists prior instance) # ceph status cluster c78209f5-55ea-4c70-8968-2231d2b05560 health HEALTH_WARN mds cluster is degraded monmap e3: 3 mons at {th1-mon001=10.1.2.21:6789/0,th1-mon002=10.1.2.22:6789/0,th1-mon003=10.1.2.23:6789/0}, election epoch 362, quorum 0,1,2 th1-mon001,th1-mon002,th1-mon003 mdsmap e154: 1/1/1 up {0=th1-mon001=up:replay}, 1 up:standby osdmap e1951: 12 osds: 12 up, 12 in pgmap v193685: 492 pgs, 4 pools, 60297 MB data, 470 kobjects 124 GB used, 175 GB / 299 GB avail 492 active+clean # ceph osd tree # idweighttype nameup/downreweight -10.2399root default -20.05997host th1-osd001 00.01999osd.0up1 10.01999osd.1up1 20.01999osd.2up1 -30.05997host th1-osd002 30.01999osd.3up1 40.01999osd.4up1 50.01999osd.5up1 -40.05997host th1-mon003 60.01999osd.6up1 70.01999osd.7up1 80.01999osd.8up1 -50.05997host th1-mon002 90.01999osd.9up1 100.01999osd.10up1 110.01999osd.11up1 What is the way to get the mds up and running again? I still have all the placement group directories which I moved from the full osds which where down to create disk space. Try just restarting the MDS daemon. This sounds a little familiar so I think it's a known bug which may be fixed in a later dev or point release on the MDS, but it's a soft-state rather than a disk state issue. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Best Practice to Copy/Move Data Across Clusters
Hi guys, Anyone has done copy/move data between clusters? If yes, what are the best practices for you? Thanks signature.asc Description: Message signed with OpenPGP using GPGMail ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Best Practice to Copy/Move Data Across Clusters
We do it with rbd volumes. We're using rbd export/import and netcat to transfer it across clusters. This was the most efficient solution, that did not require one cluster to have access to the other clusters (though it does require some way of starting the process on the different machines). On 8/20/2014 12:49 PM, Larry Liu wrote: Hi guys, Anyone has done copy/move data between clusters? If yes, what are the best practices for you? Thanks ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Translating a RadosGW object name into a filename on disk
Looks like I need to upgrade to Firefly to get ceph-kvstore-tool before I can proceed. I am getting some hits just from grepping the LevelDB store, but so far nothing has panned out. Thanks for the help! On Tue, Aug 19, 2014 at 10:27 AM, Gregory Farnum g...@inktank.com wrote: It's been a while since I worked on this, but let's see what I remember... On Thu, Aug 14, 2014 at 11:34 AM, Craig Lewis cle...@centraldesktop.com wrote: In my effort to learn more of the details of Ceph, I'm trying to figure out how to get from an object name in RadosGW, through the layers, down to the files on disk. clewis@clewis-mac ~ $ s3cmd ls s3://cpltest/ 2014-08-13 23:0214M 28dde9db15fdcb5a342493bc81f91151 s3://cpltest/vmware-freebsd-tools.tar.gz Looking at the .rgw pool's contents tells me that the cpltest bucket is default.73886.55: root@dev-ceph0:/var/lib/ceph/osd/ceph-0/current# rados -p .rgw ls | grep cpltest cpltest .bucket.meta.cpltest:default.73886.55 Okay, what you're seeing here are two different types, whose names I'm not going to get right: 1) The bucket link cpltest, which maps from the name cpltest to a bucket instance. The contents of cpltest, or one of its xattrs, are pointing at .bucket.meta.cpltest:default.73886.55 2) The bucket instance .bucket.meta.cpltest:default.73886.55. I think this contains the bucket index (list of all objects), etc. The rados objects that belong to that bucket are: root@dev-ceph0:~# rados -p .rgw.buckets ls | grep default.73886.55 default.73886.55__shadow__RpwwfOt2X-mhwU65Qa1OHDi--4OMGvQ_1 default.73886.55__shadow__RpwwfOt2X-mhwU65Qa1OHDi--4OMGvQ_3 default.73886.55_vmware-freebsd-tools.tar.gz default.73886.55__shadow__RpwwfOt2X-mhwU65Qa1OHDi--4OMGvQ_2 default.73886.55__shadow__RpwwfOt2X-mhwU65Qa1OHDi--4OMGvQ_4 Okay, so when you ask RGW for the object vmware-freebsd-tools.tar.gz from the cpltest bucket, it will look up (or, if we're lucky, have cached) the cpltest link, and find out that the bucket prefix is default.73886.55. It will then try and access the object default.73886.55_vmware-freebsd-tools.tar.gz (whose construction I hope is obvious — bucket instance ID as a prefix, _ as a separate, then the object name). This RADOS object is called the head for the RGW object. In addition to (usually) the beginning bit of data, it will also contain some xattrs with things like a tag for any extra RADOS objects which include data for this RGW object. In this case, that tag is RpwwfOt2X-mhwU65Qa1OHDi--4OMGvQ. (This construction is how we do atomic overwrites of RGW objects which are larger than a single RADOS object, in addition to a few other things.) I don't think there's any way of mapping from a shadow (tail) object name back to its RGW name. but if you look at the rados object xattrs, there might (? or might not) be an attr which contains the parent object in one form or another. Check that out. (Or, if you want to check out the source, I think all the relevant bits for this are somewhere in the -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com I know those shadow__RpwwfOt2X-mhwU65Qa1OHDi--4OMGvQ_ files are the rest of vmware-freebsd-tools.tar.gz. I can infer that because this bucket only has a single file (and the sum of the sizes matches). With many files, I can't infer the link anymore. How do I look up that link? I tried reading the src/rgw/rgw_rados.cc, but I'm getting lost. My real goal is the reverse. I recently repaired an inconsistent PG. The primary replica had the bad data, so I want to verify that the repaired object is correct. I have a database that stores the SHA256 of every object. If I can get from the filename on disk back to an S3 object, I can verify the file. If it's bad, I can restore from the replicated zone. Aside from today's task, I think it's really handy to understand these low level details. I know it's been handy in the past, when I had disk corruption under my PostgreSQL database. Knowing (and practicing) ahead of time really saved me a lot of downtime then. Thanks for any pointers. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Translating a RadosGW object name into a filename on disk
On Wed, 20 Aug 2014, Craig Lewis wrote: Looks like I need to upgrade to Firefly to get ceph-kvstore-tool before I can proceed. I am getting some hits just from grepping the LevelDB store, but so far nothing has panned out. FWIW if you just need the tool, you can wget the .deb and 'dpkg -x foo.deb /tmp/whatever' and grab the binary from there. sage Thanks for the help! On Tue, Aug 19, 2014 at 10:27 AM, Gregory Farnum g...@inktank.com wrote: It's been a while since I worked on this, but let's see what I remember... On Thu, Aug 14, 2014 at 11:34 AM, Craig Lewis cle...@centraldesktop.com wrote: In my effort to learn more of the details of Ceph, I'm trying to figure out how to get from an object name in RadosGW, through the layers, down to the files on disk. clewis@clewis-mac ~ $ s3cmd ls s3://cpltest/ 2014-08-13 23:0214M 28dde9db15fdcb5a342493bc81f91151 s3://cpltest/vmware-freebsd-tools.tar.gz Looking at the .rgw pool's contents tells me that the cpltest bucket is default.73886.55: root@dev-ceph0:/var/lib/ceph/osd/ceph-0/current# rados -p .rgw ls | grep cpltest cpltest .bucket.meta.cpltest:default.73886.55 Okay, what you're seeing here are two different types, whose names I'm not going to get right: 1) The bucket link cpltest, which maps from the name cpltest to a bucket instance. The contents of cpltest, or one of its xattrs, are pointing at .bucket.meta.cpltest:default.73886.55 2) The bucket instance .bucket.meta.cpltest:default.73886.55. I think this contains the bucket index (list of all objects), etc. The rados objects that belong to that bucket are: root@dev-ceph0:~# rados -p .rgw.buckets ls | grep default.73886.55 default.73886.55__shadow__RpwwfOt2X-mhwU65Qa1OHDi--4OMGvQ_1 default.73886.55__shadow__RpwwfOt2X-mhwU65Qa1OHDi--4OMGvQ_3 default.73886.55_vmware-freebsd-tools.tar.gz default.73886.55__shadow__RpwwfOt2X-mhwU65Qa1OHDi--4OMGvQ_2 default.73886.55__shadow__RpwwfOt2X-mhwU65Qa1OHDi--4OMGvQ_4 Okay, so when you ask RGW for the object vmware-freebsd-tools.tar.gz from the cpltest bucket, it will look up (or, if we're lucky, have cached) the cpltest link, and find out that the bucket prefix is default.73886.55. It will then try and access the object default.73886.55_vmware-freebsd-tools.tar.gz (whose construction I hope is obvious ? bucket instance ID as a prefix, _ as a separate, then the object name). This RADOS object is called the head for the RGW object. In addition to (usually) the beginning bit of data, it will also contain some xattrs with things like a tag for any extra RADOS objects which include data for this RGW object. In this case, that tag is RpwwfOt2X-mhwU65Qa1OHDi--4OMGvQ. (This construction is how we do atomic overwrites of RGW objects which are larger than a single RADOS object, in addition to a few other things.) I don't think there's any way of mapping from a shadow (tail) object name back to its RGW name. but if you look at the rados object xattrs, there might (? or might not) be an attr which contains the parent object in one form or another. Check that out. (Or, if you want to check out the source, I think all the relevant bits for this are somewhere in the -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com I know those shadow__RpwwfOt2X-mhwU65Qa1OHDi--4OMGvQ_ files are the rest of vmware-freebsd-tools.tar.gz. I can infer that because this bucket only has a single file (and the sum of the sizes matches). With many files, I can't infer the link anymore. How do I look up that link? I tried reading the src/rgw/rgw_rados.cc, but I'm getting lost. My real goal is the reverse. I recently repaired an inconsistent PG. The primary replica had the bad data, so I want to verify that the repaired object is correct. I have a database that stores the SHA256 of every object. If I can get from the filename on disk back to an S3 object, I can verify the file. If it's bad, I can restore from the replicated zone. Aside from today's task, I think it's really handy to understand these low level details. I know it's been handy in the past, when I had disk corruption under my PostgreSQL database. Knowing (and practicing) ahead of time really saved me a lot of downtime then. Thanks for any pointers. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Serious performance problems with small file writes
Hugo, I would look at setting up a cache pool made of 4-6 ssds to start with. So, if you have 6 osd servers, stick at least 1 ssd disk in each server for the cache pool. It should greatly reduce the osd's stress of writing a large number of small files. Your cluster should become more responsive and the end user's experience should also improve. I am planning on doing so in a near future, but according to my friend's experience, introducing a cache pool has greatly increased the overall performance of the cluster and has removed the performance issues that he was having during scrubbing/deep-scrubbing/recovery activities. The size of your working data set should determine the size of the cache pool, but in general it will create a nice speedy buffer between your clients and those terribly slow spindles. Andrei - Original Message - From: Hugo Mills h.r.mi...@reading.ac.uk To: Dan Van Der Ster daniel.vanders...@cern.ch Cc: Ceph Users List ceph-users@lists.ceph.com Sent: Wednesday, 20 August, 2014 4:54:28 PM Subject: Re: [ceph-users] Serious performance problems with small file writes Hi, Dan, Some questions below I can't answer immediately, but I'll spend tomorrow morning irritating people by triggering these events (I think I have a reproducer -- unpacking a 1.2 GiB tarball with 25 small files in it) and giving you more details. For the ones I can answer right now: On Wed, Aug 20, 2014 at 02:51:12PM +, Dan Van Der Ster wrote: Do you get slow requests during the slowness incidents? Slow requests, yes. ceph -w reports them coming in groups, e.g.: 2014-08-20 15:51:23.911711 mon.1 [INF] pgmap v2287926: 704 pgs: 704 active+clean; 18105 GB data, 37369 GB used, 20169 GB / 57538 GB avail; 8449 kB/s rd, 3506 kB/s wr, 527 op/s 2014-08-20 15:51:22.381063 osd.5 [WRN] 6 slow requests, 6 included below; oldest blocked for 10.133901 secs 2014-08-20 15:51:22.381066 osd.5 [WRN] slow request 10.133901 seconds old, received at 2014-08-20 15:51:12.247127: osd_op(mds.0.101:5528578 10005889b29. [create 0~0,setxattr parent (394)] 0.786a9365 ondisk+write e217298) v4 currently waiting for subops from 6 2014-08-20 15:51:22.381068 osd.5 [WRN] slow request 10.116337 seconds old, received at 2014-08-20 15:51:12.264691: osd_op(mds.0.101:5529006 1000599e576. [create 0~0,setxattr parent (392)] 0.5ccbd6a9 ondisk+write e217298) v4 currently waiting for subops from 7 2014-08-20 15:51:22.381070 osd.5 [WRN] slow request 10.116277 seconds old, received at 2014-08-20 15:51:12.264751: osd_op(mds.0.101:5529009 1000588932d. [create 0~0,setxattr parent (394)] 0.de5eca4e ondisk+write e217298) v4 currently waiting for subops from 6 2014-08-20 15:51:22.381071 osd.5 [WRN] slow request 10.115296 seconds old, received at 2014-08-20 15:51:12.265732: osd_op(mds.0.101:5529042 1000588933e. [create 0~0,setxattr parent (395)] 0.5e4d56be ondisk+write e217298) v4 currently waiting for subops from 7 2014-08-20 15:51:22.381073 osd.5 [WRN] slow request 10.115184 seconds old, received at 2014-08-20 15:51:12.265844: osd_op(mds.0.101:5529047 1000599e58a. [create 0~0,setxattr parent (395)] 0.6a487965 ondisk+write e217298) v4 currently waiting for subops from 6 2014-08-20 15:51:24.381370 osd.5 [WRN] 2 slow requests, 2 included below; oldest blocked for 10.73 secs 2014-08-20 15:51:24.381373 osd.5 [WRN] slow request 10.73 seconds old, received at 2014-08-20 15:51:14.381267: osd_op(mds.0.101:5529327 100058893ca. [create 0~0,setxattr parent (395)] 0.750c7574 ondisk+write e217298) v4 currently commit sent 2014-08-20 15:51:24.381375 osd.5 [WRN] slow request 10.28 seconds old, received at 2014-08-20 15:51:14.381312: osd_op(mds.0.101:5529329 100058893cb. [create 0~0,setxattr parent (395)] 0.c75853fa ondisk+write e217298) v4 currently commit sent 2014-08-20 15:51:24.913554 mon.1 [INF] pgmap v2287927: 704 pgs: 704 active+clean; 18105 GB data, 37369 GB used, 20169 GB / 57538 GB avail; 13218 B/s rd, 3532 kB/s wr, 377 op/s 2014-08-20 15:51:25.381582 osd.5 [WRN] 3 slow requests, 3 included below; oldest blocked for 10.709989 secs 2014-08-20 15:51:25.381586 osd.5 [WRN] slow request 10.709989 seconds old, received at 2014-08-20 15:51:14.671549: osd_op(mds.0.101:5529457 10005889403. [create 0~0,setxattr parent (407)] 0.e15ab1fa ondisk+write e217298) v4 currently no flag points reached 2014-08-20 15:51:25.381587 osd.5 [WRN] slow request 10.709767 seconds old, received at 2014-08-20 15:51:14.671771: osd_op(mds.0.101:5529462 10005889406. [create 0~0,setxattr parent (406)] 0.70f8a5d3 ondisk+write e217298) v4 currently no flag points reached 2014-08-20 15:51:25.381589 osd.5 [WRN] slow request 10.182354 seconds old, received at 2014-08-20 15:51:15.199184: osd_op(mds.0.101:5529464 10005889407. [create 0~0,setxattr parent (391)] 0.30535d02 ondisk+write e217298) v4 currently no flag points reached 2014-08-20