Re: [ceph-users] Unexpected disk write activity with btrfs OSDs
Thanks! Just so I understand correctly, the btrfs snapshots are mainly useful if the journals are on the same disk as the osd, right? Is it indeed safe to turn them off if the journals are on a separate ssd? Kind regards, Erik. On 22-06-15 20:18, Krzysztof Nowicki wrote: pon., 22.06.2015 o 20:09 użytkownik Lionel Bouton lionel-subscript...@bouton.name mailto:lionel-subscript...@bouton.name napisał: On 06/22/15 17:21, Erik Logtenberg wrote: I have the journals on a separate disk too. How do you disable the snapshotting on the OSD? http://ceph.com/docs/master/rados/configuration/filestore-config-ref/ : filestore btrfs snap = false Once this is done and verified working (after a restart of the OSD) make sure to remove the now unnecessary snapshots (snap_xxx) from the osd filesystem as failing to do so will cause an increase of occupied space over time (old and unneeded versions of objects will remain stored). This can be done by running 'sudo btrfs subvolume delete /var/lib/ceph/osd/ceph-xx/snap_yy'. To verify that the option change is effective you can observe the 'snap_xxx' directories - after disabling snapshotting their revision number should not increase any more). ___ ceph-users mailing list ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] What do internal_safe_to_start_threads and leveldb_compression do?
What does this do? - leveldb_compression: false (default: true) - leveldb_block/cache/write_buffer_size (all bigger than default) I take it you're running these commands on a monitor (from I think the Dumpling timeframe, or maybe even Firefly)? These are hitting specific settings in LevelDB which we tune differently for the monitor and OSD, but which were shared config options in older releases. They have their own settings in newer code. -Greg You are correct. I started out with Firefly and gradually upgraded the cluster as new releases came out. I am on Hammer (0.94.1) now. The current settings are different from the default. Does this mean that the settings are still Firefly-like and should be changed to the new default; or does this mean that the defaults are still Firefly-like but the settings are actually Hammer-style ;) and thus right. Thanks, Erik. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] What do internal_safe_to_start_threads and leveldb_compression do?
Hi, I ran a config diff, like this: ceph --admin-daemon (...).asok config diff There are the obvious things like the fsid and IP-ranges, but two settings stand out: - internal_safe_to_start_threads: true (default: false) What does this do? - leveldb_compression: false (default: true) - leveldb_block/cache/write_buffer_size (all bigger than default) Any idea what these settings do and if it is a good idea to have them differ from the default? Thanks, Erik. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Mount options nodcache and nofsc
Hi, Can anyone explain what the mount options nodcache and nofsc are for, and especially why you would want to turn these options on/off (what are the pros and cons either way?) Thanks, Erik. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Interesting re-shuffling of pg's after adding new osd
Hi, Two days ago I added a new osd to one of my ceph machines, because one of the existing osd's got rather full. There was quite a difference in disk space usage between osd's, but I understand this is kind of just how ceph works. It spreads data over osd's but not perfectly even. Now check out the graph of free disk space. You can clearly see the new 4TB osd added and how it starts to fill up. It's also quite visible that some existing osd's profit more than others. And not only is data put onto the new osd, but also data is exchanged between existing osd's. This is also why it takes so incredibly long to fill the new osd up, because ceph is spending most its time shuffling data around instead of moving it to the new osd. Anyway, what is especially troubling, is that the osd that was already lowest on disk space, is actually filling up even more during this process (!) What's causing that and how can I get ceph to do the reasonable thing? All crush weights are identical. Thanks, Erik. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS unexplained writes
It does sound contradictory: why would read operations in cephfs result in writes to disk? But they do. I upgraded to Hammer last week and I am still seeing this. The setup is as follows: EC-pool on hdd's for data replicated pool on ssd's for data-cache replicated pool on ssd's for meta-data Now whenever I start doing heavy reads on cephfs, I see intense bursts of write operations on the hdd's. The reads I'm doing are things like reading a large file (streaming a video), or running a big rsync job with --dry-run (so it just checks meta-data). No clue why that would have any effect on the hdd's, but it does. Now, to further figure out what's going on, I tried using lsof, atop, iotop, but those tools don't provide the necessary information. In lsof I just see a whole bunch of files opened at any time, but it doesn't change much during these tests. In atop and iotop I can clearly see that the hdd's are doing a lot of writes when I'm reading in cephfs, but those tools can't tell me what those writes are. So I tried strace, which can trace file operations and attach to running processes. # strace -f -e trace=file -p 5076 This gave me an idea of what was going on. 5076 is the process id of the osd for one of the hdd's. I saw mostly stat's and open's, but those are all reads, not writes. Of course btrfs can cause writes when doing reads (atime), but I have the osd mounted with noatime. The only write operations that I saw a lot of are these: [pid 5350] getxattr(/var/lib/ceph/osd/ceph-10/current/4.1es1_head/DIR_E/DIR_1/DIR_D/DIR_3, user.cephos.phash.contents, \1Q\0\0\0\0\0\0\0\0\0\0\0\4\0\0, 1024) = 17 [pid 5350] setxattr(/var/lib/ceph/osd/ceph-10/current/4.1es1_head/DIR_E/DIR_1/DIR_D/DIR_3, user.cephos.phash.contents, \1R\0\0\0\0\0\0\0\0\0\0\0\4\0\0, 17, 0) = 0 [pid 5350] removexattr(/var/lib/ceph/osd/ceph-10/current/4.1es1_head/DIR_E/DIR_1/DIR_D/DIR_3, user.cephos.phash.contents@1) = -1 ENODATA (No data available) So it appears that the osd's aren't writing actual data to disk, but metadata in the form of xattr's. Can anyone explain what this setting and removing of xattr's could be for? Kind regards, Erik. On 03/16/2015 10:44 PM, Gregory Farnum wrote: The information you're giving sounds a little contradictory, but my guess is that you're seeing the impacts of object promotion and flushing. You can sample the operations the OSDs are doing at any given time by running ops_in_progress (or similar, I forget exact phrasing) command on the OSD admin socket. I'm not sure if rados df is going to report cache movement activity or not. That though would mostly be written to the SSDs, not the hard drives — although the hard drives could still get metadata updates written when objects are flushed. What data exactly are you seeing that's leading you to believe writes are happening against these drives? What is the exact CephFS and cache pool configuration? -Greg On Mon, Mar 16, 2015 at 2:36 PM, Erik Logtenberg e...@logtenberg.eu wrote: Hi, I forgot to mention: while I am seeing these writes in iotop and /proc/diskstats for the hdd's, I am -not- seeing any writes in rados df for the pool residing on these disks. There is only one pool active on the hdd's and according to rados df it is getting zero writes when I'm just reading big files from cephfs. So apparently the osd's are doing some non-trivial amount of writing on their own behalf. What could it be? Thanks, Erik. On 03/16/2015 10:26 PM, Erik Logtenberg wrote: Hi, I am getting relatively bad performance from cephfs. I use a replicated cache pool on ssd in front of an erasure coded pool on rotating media. When reading big files (streaming video), I see a lot of disk i/o, especially writes. I have no clue what could cause these writes. The writes are going to the hdd's and they stop when I stop reading. I mounted everything with noatime and nodiratime so it shouldn't be that. On a related note, the Cephfs metadata is stored on ssd too, so metadata-related changes shouldn't hit the hdd's anyway I think. Any thoughts? How can I get more information about what ceph is doing? Using iotop I only see that the osd processes are busy but it doesn't give many hints as to what they are doing. Thanks, Erik. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] CephFS unexplained writes
Hi, I am getting relatively bad performance from cephfs. I use a replicated cache pool on ssd in front of an erasure coded pool on rotating media. When reading big files (streaming video), I see a lot of disk i/o, especially writes. I have no clue what could cause these writes. The writes are going to the hdd's and they stop when I stop reading. I mounted everything with noatime and nodiratime so it shouldn't be that. On a related note, the Cephfs metadata is stored on ssd too, so metadata-related changes shouldn't hit the hdd's anyway I think. Any thoughts? How can I get more information about what ceph is doing? Using iotop I only see that the osd processes are busy but it doesn't give many hints as to what they are doing. Thanks, Erik. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS unexplained writes
Hi, I forgot to mention: while I am seeing these writes in iotop and /proc/diskstats for the hdd's, I am -not- seeing any writes in rados df for the pool residing on these disks. There is only one pool active on the hdd's and according to rados df it is getting zero writes when I'm just reading big files from cephfs. So apparently the osd's are doing some non-trivial amount of writing on their own behalf. What could it be? Thanks, Erik. On 03/16/2015 10:26 PM, Erik Logtenberg wrote: Hi, I am getting relatively bad performance from cephfs. I use a replicated cache pool on ssd in front of an erasure coded pool on rotating media. When reading big files (streaming video), I see a lot of disk i/o, especially writes. I have no clue what could cause these writes. The writes are going to the hdd's and they stop when I stop reading. I mounted everything with noatime and nodiratime so it shouldn't be that. On a related note, the Cephfs metadata is stored on ssd too, so metadata-related changes shouldn't hit the hdd's anyway I think. Any thoughts? How can I get more information about what ceph is doing? Using iotop I only see that the osd processes are busy but it doesn't give many hints as to what they are doing. Thanks, Erik. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Crush Map and SSD Pools
Hi Lindsay, Actually you just setup two entries for each host in your crush map. One for hdd's and one for ssd's. My osd's look like this: # idweight type name up/down reweight -6 1.8 root ssd -7 0.45host ceph-01-ssd 0 0.45osd.0 up 1 -8 0.45host ceph-02-ssd 3 0.45osd.3 up 1 -9 0.45host ceph-03-ssd 8 0.45osd.8 up 1 -10 0.45host ceph-04-ssd 11 0.45osd.11 up 1 -1 29.12 root default -2 7.28host ceph-01 1 3.64osd.1 up 1 2 3.64osd.2 up 1 -3 7.28host ceph-02 5 3.64osd.5 up 1 4 3.64osd.4 up 1 -4 7.28host ceph-03 6 3.64osd.6 up 1 7 3.64osd.7 up 1 -5 7.28host ceph-04 10 3.64osd.10 up 1 9 3.64osd.9 up 1 As you can see, I have four hosts: ceph-01 ... ceph-04, but eight host entries. This works great. Regards, Erik. On 30-12-14 15:13, Lindsay Mathieson wrote: I looked at the section for setting up different pools with different OSD's (e.g SSD Pool): http://ceph.com/docs/master/rados/operations/crush-map/#placing-different-pools-on-different-osds And it seems to make the assumption that the ssd's and platters all live on separate hosts. Not the case at all for my setup and I imagine for most people I have ssd's mixed with platters on the same hosts. In that case should one have the root buckets referencing buckets not based on hosts, e.g, something like this: # devices # Platters device 0 osd.0 device 1 osd.1 # SSD device 2 osd.2 device 3 osd.3 host vnb { id -2 # do not change unnecessarily # weight 1.000 alg straw hash 0 # rjenkins1 item osd.0 weight 1.000 item osd.2 weight 1.000 } host vng { id -3 # do not change unnecessarily # weight 1.000 alg straw hash 0 # rjenkins1 item osd.1 weight 1.000 item osd.3 weight 1.000 } row disk-platter { alg straw hash 0 # rjenkins1 item osd.0 weight 1.000 item osd.1 weight 1.000 } row disk-ssd { alg straw hash 0 # rjenkins1 item osd.2 weight 1.000 item osd.3 weight 1.000 } root default { id -1 # do not change unnecessarily # weight 2.000 alg straw hash 0 # rjenkins1 item disk-platter weight 2.000 } root ssd { id -4 alg straw hash 0 item disk-ssd weight 2.000 } # rules rule replicated_ruleset { ruleset 0 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } rule ssd { ruleset 1 type replicated min_size 0 max_size 4 step take ssd step chooseleaf firstn 0 type host step emit } -- Lindsay ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cache tiers flushing logic
Hi Erik, I have tiering working on a couple test clusters. It seems to be working with Ceph v0.90 when I set: ceph osd pool set POOL hit_set_type bloom ceph osd pool set POOL hit_set_count 1 ceph osd pool set POOL hit_set_period 3600 ceph osd pool set POOL cache_target_dirty_ratio .5 ceph osd pool set POOL cache_target_full_ratio .9 Eric Hi Eric, You say it seems to be working. My setup also seems to be working, in the sense that the pools can be written to and read from. However the cache flushing doesn't work as expected. Do you mean that all objects in your cache are flushed during idle time? Thanks, Erik. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Crush Map and SSD Pools
No, bucket names in crush map are completely arbitrary. In fact, crush doesn't really know what a host is. It is just a bucket, like rack or datacenter. But they could be called cat and mouse just as well. The only reason to use host names is for human readability. You can then use crush rules to make sure that for instance two copies of some object are not on the same host or in the same rack or not in the same whatever bucket you like. This way you can define your failure domains in correspondence with your physical layout. Kind regards, Erik. On 12/30/2014 10:18 PM, Lindsay Mathieson wrote: On Tue, 30 Dec 2014 04:18:07 PM Erik Logtenberg wrote: As you can see, I have four hosts: ceph-01 ... ceph-04, but eight host entries. This works great. you have - host ceph-01 - host ceph-01-ssd Don't the host names have to match the real host names? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Crush Map and SSD Pools
If you want to be able to start your osd's with /etc/init.d/ceph init script, then you better make sure that /etc/ceph/ceph.conf does link the osd's to the actual hostname :) Check out this snippet from my ceph.conf: [osd.0] host = ceph-01 osd crush location = host=ceph-01-ssd root=ssd [osd.1] host = ceph-01 [osd.2] host = ceph-01 You see all osd's are linked to the right hostname. But the ssd osd is then explicitly set to go into the right crush location too. Kind regards, Erik. On 12/30/2014 11:11 PM, Lindsay Mathieson wrote: On Tue, 30 Dec 2014 10:38:14 PM Erik Logtenberg wrote: No, bucket names in crush map are completely arbitrary. In fact, crush doesn't really know what a host is. It is just a bucket, like rack or datacenter. But they could be called cat and mouse just as well. Hmmm, I tried that earlier and ran into problems with starting/stopping the osd - but maybe I screwed something else up. Will give it another go. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Online converting of pool type
Hi, Every now and then someone asks if it's possible to convert a pool to a different type (replicated vs erasure / change the amount of pg's / etc), but this is not supported. The advised approach is usually to just create a new pool and somehow copy all data manually to this new pool, removing the old pool afterwards. This is both unpractical and very time consuming. Recently I saw someone on this list suggest that the cache tiering feature may actually be used to achieve some form of online converting of pool types. Today I ran some tests and I would like to share my results. I started out with a pool test-A, created an rbd image in the pool, mapped it, created a filesystem in the rbd image, mounted the fs and placed some test files in the fs. Just to have some objects in the test-A pool. I then added a test-B pool and transferred the data using cache tiering as follows: Step 0: We have a test-A pool and it contains data, some of which is in use. # rados -p test-A df test-A - 9941 110 0 0 324 2404 57 4717 Step 1: Create new pool test-B # ceph osd pool create test-B 32 pool 'test-B' created Step 2: Make pool test-A a cache pool for test-B. # ceph osd tier add test-B test-A --force-nonempty # ceph osd tier cache-mode test-A forward Step 3: Move data from test-A to test-B (this potentially takes long) # rados -p test-A cache-flush-evict-all This step will move all data except the objects that are in active use, so we are left with some remaining data on test-A pool. Step 4: Move also the remaining data. This is the only step that doesn't work online. Step 4a: Disconnect all clients # rbd unmap /dev/rbd/test-A/test-rbd (in my case) Stab 4b: Move remaining objects # rados -p test-A cache-flush-evict-all # rados -p test-A ls (should now be empty) Step 5: Remove test-A as cache pool # ceph osd tier remove test-B test-A Step 6: Clients are allowed to connect with test-B pool (we are back in online mode) # rbd map test-B/test-rbd (in my case) Step 7: Remove the now empty pool test-A # ceph osd pool delete test-A test-A --yes-i-really-really-mean-it This worked smoothly. In my first try I actually used more steps, by creatig ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Online converting of pool type
Whoops, I accidently sent my mail before it was finished. Anyway I have some more testing to do, especially with converting between erasure/replicated pools. But it looks promising. Thanks, Erik. On 23-12-14 16:57, Erik Logtenberg wrote: Hi, Every now and then someone asks if it's possible to convert a pool to a different type (replicated vs erasure / change the amount of pg's / etc), but this is not supported. The advised approach is usually to just create a new pool and somehow copy all data manually to this new pool, removing the old pool afterwards. This is both unpractical and very time consuming. Recently I saw someone on this list suggest that the cache tiering feature may actually be used to achieve some form of online converting of pool types. Today I ran some tests and I would like to share my results. I started out with a pool test-A, created an rbd image in the pool, mapped it, created a filesystem in the rbd image, mounted the fs and placed some test files in the fs. Just to have some objects in the test-A pool. I then added a test-B pool and transferred the data using cache tiering as follows: Step 0: We have a test-A pool and it contains data, some of which is in use. # rados -p test-A df test-A - 9941 110 0 0 324 2404 57 4717 Step 1: Create new pool test-B # ceph osd pool create test-B 32 pool 'test-B' created Step 2: Make pool test-A a cache pool for test-B. # ceph osd tier add test-B test-A --force-nonempty # ceph osd tier cache-mode test-A forward Step 3: Move data from test-A to test-B (this potentially takes long) # rados -p test-A cache-flush-evict-all This step will move all data except the objects that are in active use, so we are left with some remaining data on test-A pool. Step 4: Move also the remaining data. This is the only step that doesn't work online. Step 4a: Disconnect all clients # rbd unmap /dev/rbd/test-A/test-rbd (in my case) Stab 4b: Move remaining objects # rados -p test-A cache-flush-evict-all # rados -p test-A ls (should now be empty) Step 5: Remove test-A as cache pool # ceph osd tier remove test-B test-A Step 6: Clients are allowed to connect with test-B pool (we are back in online mode) # rbd map test-B/test-rbd (in my case) Step 7: Remove the now empty pool test-A # ceph osd pool delete test-A test-A --yes-i-really-really-mean-it This worked smoothly. In my first try I actually used more steps, by creatig ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Tip of the week: don't use Intel 530 SSD's for journals
If you are like me, you have the journals for your OSD's with rotating media stored separately on an SSD. If you are even more like me, you happen to use Intel 530 SSD's in some of your hosts. If so, please do check your S.M.A.R.T. statistics regularly, because these SSD's really can't cope with Ceph. Check out the media-wear graphs for the two Intel 530's in my cluster. As soon as those declining lines get down to 30% or so, they need to be replaced. That means less than half a year between purchase and end-of-life :( Tip of the week, keep an eye on those statistics, don't let a failing SSD surprise you. Erik. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] How to mount cephfs from fstab
Hi, I would like to mount a cephfs share from fstab, but it doesn't completely work. First of all, I followed the documentation [1], which resulted in the following line in fstab: ceph-01:6789:/ /mnt/cephfs/ ceph name=testhost,secretfile=/root/testhost.key,noacl 0 2 Yes, this works when I manually try mount /mnt/cephfs, but it does give me the following error/warning: mount: error writing /etc/mtab: Invalid argument Now, even though this error doesn't influence the mounting itself, it does prohibit my machine from booting right. Apparently Fedora/systemd doesn't like this error when going through fstab, so booting is not possible. The mtab issue can easily be worked around, by calling mount manually and using the -n (--no-mtab) argument, like this: mount -t ceph -n ceph-01:6789:/ /mnt/cephfs/ -o name=testhost,secretfile=/root/testhost.key,noacl However, I can't find a way to put that -n option in /etc/fstab itself (since it's not a -o option. Currently, I have the noauto setting in fstab, so it doesn't get mounted on boot at all. Then I have to manually log in and say mount /mnt/cephfs to explicitly mount the share. Far from ideal. So, how do my fellow cephfs-users do this? Thanks, Erik. [1] http://ceph.com/docs/giant/cephfs/fstab/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] How to add/remove/move an MDS?
Hi, I noticed that the docs [1] on adding and removing an MDS are not yet written... [1] https://ceph.com/docs/master/rados/deployment/ceph-deploy-mds/ I would like to do exactly that, however. I have an MDS on one machine, but I'd like a faster machine to take over instead. In fact, It would be great to make an active/standby configuration (at least as long as multimaster is not supported). How to do this? By the way, this is a manually deployed cluster, no ceph-deploy used. Thanks, Erik. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Troubleshooting an erasure coded pool with a cache tier
I think I might be running into the same issue. I'm using Giant though. A lot of slow writes. My thoughts went to: the OSD's get too much work to do (commodity hardware), so I'll have to do some performance tuning to limit parallellism a bit. And indeed, limiting the amount of threads for different tasks reduced some of the load, but I keep getting slow writes very often, especially if the load is coming from CephFS (which is the only thing I use a cache tier for). To answer your question: no, it's not yet production, and it's not suited for production currently either. In my case the slow writes keep stacking up, until OSD's commit suicide, and then the recovery process adds even further to the load of the remaining OSD's, causing a chain reaction in which other OSD's also kill themselves. Non-optimal performance could in my case be acceptable for semi-production, but stability is essential. So I hope these issues can be fixed. Kind regards, Erik. On 17-11-14 17:45, Laurent GUERBY wrote: Hi, Just a follow-up on this issue, we're probably hitting: http://tracker.ceph.com/issues/9285 We had the issue a few weeks ago with replicated SSD pool in front of rotational pool and turned off cache tiering. Yesterday we made a new test and activating cache tiering on a single erasure pool threw the whole ceph cluster performance to the floor (including non cached non erasure coded pools) with frequent slow write in the logs. Removing cache tiering was enough to go back to normal performance. I assume no one use cache tiering on 0.80.7 in production clusters? Sincerely, Laurent Le Sunday 09 November 2014 à 00:24 +0100, Loic Dachary a écrit : On 09/11/2014 00:03, Gregory Farnum wrote: It's all about the disk accesses. What's the slow part when you dump historic and in-progress ops? This is what I see on g1 (6% iowait) root@g1:~# ceph daemon osd.0 dump_ops_in_flight { num_ops: 0, ops: []} root@g1:~# ceph daemon osd.0 dump_ops_in_flight { num_ops: 1, ops: [ { description: osd_op(client.4407100.0:11030174 rb.0.410809.238e1f29.1038 [set-alloc-hint object_size 4194304 write_size 4194304,write 4095488~4096] 58.3aabb66d ack+ondisk+write e15613), received_at: 2014-11-09 00:14:17.385256, age: 0.538802, duration: 0.011955, type_data: [ waiting for sub ops, { client: client.4407100, tid: 11030174}, [ { time: 2014-11-09 00:14:17.385393, event: waiting_for_osdmap}, { time: 2014-11-09 00:14:17.385563, event: reached_pg}, { time: 2014-11-09 00:14:17.385793, event: started}, { time: 2014-11-09 00:14:17.385807, event: started}, { time: 2014-11-09 00:14:17.385875, event: waiting for subops from 1,10}, { time: 2014-11-09 00:14:17.386201, event: commit_queued_for_journal_write}, { time: 2014-11-09 00:14:17.386336, event: write_thread_in_journal_buffer}, { time: 2014-11-09 00:14:17.396293, event: journaled_completion_queued}, { time: 2014-11-09 00:14:17.396332, event: op_commit}, { time: 2014-11-09 00:14:17.396678, event: op_applied}, { time: 2014-11-09 00:14:17.397211, event: sub_op_commit_rec}]]}]} and it looks ok. When I go to n7 which has 20% iowait, I see a much larger output http://pastebin.com/DPxsaf6z which includes a number of event: waiting_for_osdmap. I'm not sure what to make of this and it would certainly be better if n7 had a lower iowait. Also when I ceph -w I see a new pgmap is created every second which is also not a good sign. 2014-11-09 00:22:47.090795 mon.0 [INF] pgmap v4389613: 460 pgs: 460 active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail; 3889 B/s rd, 2125 kB/s wr, 237 op/s 2014-11-09 00:22:48.143412 mon.0 [INF] pgmap v4389614: 460 pgs: 460 active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail; 1586 kB/s wr, 204 op/s 2014-11-09 00:22:49.172794 mon.0 [INF] pgmap v4389615: 460 pgs: 460 active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail; 343 kB/s wr, 88 op/s 2014-11-09 00:22:50.222958 mon.0 [INF] pgmap v4389616: 460 pgs: 460 active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail; 412 kB/s wr, 130 op/s 2014-11-09 00:22:51.281294 mon.0 [INF] pgmap v4389617: 460 pgs: 460 active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail; 1195 kB/s wr, 167 op/s 2014-11-09 00:22:52.318895 mon.0 [INF] pgmap v4389618: 460 pgs: 460 active+clean; 2580 GB data,
Re: [ceph-users] Cache tiering and cephfs
I know that it is possible to run CephFS with a cache tier on the data pool in Giant, because that's what I do. However when I configured it, I was on the previous release. When I upgraded to Giant, everything just kept working. By the way when I set it up, I used the following commmands: ceph osd pool create cephfs-data 192 192 erasure ceph osd pool create cephfs-metadata 192 192 replicated ssd ceph osd pool create cephfs-data-cache 192 192 replicated ssd ceph osd pool set cephfs-data-cache crush_ruleset 1 ceph osd pool set cephfs-metadata crush_ruleset 1 ceph osd tier add cephfs-data cephfs-data-cache ceph osd tier cache-mode cephfs-data-cache writeback ceph osd tier set-overlay cephfs-data cephfs-data-cache ceph osd dump ceph mds newfs 5 6 --yes-i-really-mean-it So actually I didn't add a cache tier to an existing CephFS, but first made the pools and added CephFS directly after. In my case, the ssd pool is ssd-backed (obviously), while the default pool is on rotating media; the crush_ruleset 1 is meant to place both the cache pool and the metadata pool on the ssd's. Erik. On 11/16/2014 08:01 PM, Scott Laird wrote: Is it possible to add a cache tier to cephfs's data pool in giant? I'm getting a error: $ ceph osd tier set-overlay data data-cache Error EBUSY: pool 'data' is in use by CephFS via its tier From what I can see in the code, that comes from OSDMonitor::_check_remove_tier; I don't understand why set-overlay needs to call _check_remove_tier. A quick look makes it look like set-overlay will always fail once MDS has been set up. Is this a bug, or am I doing something wrong? Scott ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD commits suicide
Hi, Thanks for the tip, I applied these configuration settings and it does lower the load during rebuilding a bit. Are there settings like these that also tune Ceph down a bit during regular operations? The slow requests, timeouts and OSD suicides are killing me. If I allow the cluster to regain consciousness and stay idle a bit, it all seems to settle down nicely, but as soon as I apply some load it immediately starts to overstress and complain like crazy. I'm also seeing this behaviour: http://tracker.ceph.com/issues/9844 This was reported by Dmitry Smirnov 26 days ago, but the report has no response yet. Any ideas? In my experience, OSD's are quite unstable in Giant and very easily stressed, causing chain effects, further worsening the issues. It would be nice to know if this is also noticed by other users? Thanks, Erik. On 11/10/2014 08:40 PM, Craig Lewis wrote: Have you tuned any of the recovery or backfill parameters? My ceph.conf has: [osd] osd max backfills = 1 osd recovery max active = 1 osd recovery op priority = 1 Still, if it's running for a few hours, then failing, it sounds like there might be something else at play. OSDs use a lot of RAM during recovery. How much RAM and how many OSDs do you have in these nodes? What does memory usage look like after a fresh restart, and what does it look like when the problems start? Even better if you know what it looks like 5 minutes before the problems start. Is there anything interesting in the kernel logs? OOM killers, or memory deadlocks? On Sat, Nov 8, 2014 at 11:19 AM, Erik Logtenberg e...@logtenberg.eu mailto:e...@logtenberg.eu wrote: Hi, I have some OSD's that keep committing suicide. My cluster has ~1.3M misplaced objects, and it can't really recover, because OSD's keep failing before recovering finishes. The load on the hosts is quite high, but the cluster currently has no other tasks than just the backfilling/recovering. I attached the logfile from a failed OSD. It shows the suicide, the recent events and also me starting the OSD again after some time. It'll keep running for a couple of hours and then fail again, for the same reason. I noticed a lot of timeouts. Apparently ceph stresses the hosts to the limit with the recovery tasks, so much that they timeout and can't finish that task. I don't understand why. Can I somehow throttle ceph a bit so that it doesn't keep overrunning itself? I kinda feel like it should chill out a bit and simply recover one step at a time instead of full force and then fail. Thanks, Erik. ___ ceph-users mailing list ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] OSD always tries to remove non-existent btrfs snapshot
Hi, Every time I start any OSD, it always logs that it tried to remove two btrfs snapshots but failed: 2014-11-15 22:31:08.251600 7f1730f71700 -1 filestore(/var/lib/ceph/osd/ceph-5) unable to destroy snap 'snap_3020746' got (2) No such file or directory 2014-11-15 22:31:09.661161 7f1730f71700 -1 filestore(/var/lib/ceph/osd/ceph-5) unable to destroy snap 'snap_3020758' got (2) No such file or directory These three snapshots do exist: drwxr-xr-x. 1 root root 3126 15 nov 22:16 snap_3020728 drwxr-xr-x. 1 root root 3126 15 nov 22:16 snap_3020736 drwxr-xr-x. 1 root root 3126 15 nov 22:16 snap_3020746 So the first one it tries to remove does exist but somehow removing it fails. The second one it wants to remove does not exist, so of course that fails too. And the two remaining snapshots that do exist are not mentioned in the log file. The OSD always continues to boot successfully, so it doesn't appear to hurt. Nontheless, it doesn't instill much confidence in its internal administration either.. ;) Anyone else seeing this? Thanks, Erik. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] jbod + SMART : how to identify failing disks ?
I have no experience with the DELL SAS controller, but usually the advantage of using a simple controller (instead of a RAID card) is that you can use full SMART directly. $ sudo smartctl -a /dev/sda === START OF INFORMATION SECTION === Device Model: INTEL SSDSA2BW300G3H Serial Number:PEPR2381003E300EGN Personally, I make sure that I know which serial number drive is in which bay, so I can easily tell which drive I'm talking about. So you can use SMART both to notice (pre)failing disks -and- to physically identify them. The same smartctl command also returns the health status like so: 233 Media_Wearout_Indicator 0x0032 099 099 000Old_age Always - 0 This specific SSD has 99% media lifetime left, so it's in the green. But it will continue to gradually degrade, and at some time It'll hit a percentage where I like to replace it. To keep an eye on the speed of decay, I'm graphing those SMART values in Cacti. That way I can somewhat predict how long a disk will last, especially SSD's which die very gradually. Erik. On 12-11-14 14:43, JF Le Fillâtre wrote: Hi, May or may not work depending on your JBOD and the way it's identified and set up by the LSI card and the kernel: cat /sys/block/sdX/../../../../sas_device/end_device-*/bay_identifier The weird path and the wildcards are due to the way the sysfs is set up. That works with a Dell R520, 6GB HBA SAS cards and Dell MD1200s, running CentOS release 6.5. Note that you can make your life easier by writing an udev script that will create a symlink with a sane identifier for each of your external disks. If you match along the lines of KERNEL==sd*[a-z], KERNELS==end_device-*:*:* then you'll just have to cat /sys/class/sas_device/${1}/bay_identifier in a script (with $1 being the $id of udev after that match, so the string end_device-X:Y:Z) to obtain the bay ID. Thanks, JF On 12/11/14 14:05, SCHAER Frederic wrote: Hi, I’m used to RAID software giving me the failing disks slots, and most often blinking the disks on the disk bays. I recently installed a DELL “6GB HBA SAS” JBOD card, said to be an LSI 2008 one, and I now have to identify 3 pre-failed disks (so says S.M.A.R.T) . Since this is an LSI, I thought I’d use MegaCli to identify the disks slot, but MegaCli does not see the HBA card. Then I found the LSI “sas2ircu” utility, but again, this one fails at giving me the disk slots (it finds the disks, serials and others, but slot is always 0) Because of this, I’m going to head over to the disk bay and unplug the disk which I think corresponds to the alphabetical order in linux, and see if it’s the correct one…. But even if this is correct this time, it might not be next time. But this makes me wonder : how do you guys, Ceph users, manage your disks if you really have JBOD servers ? I can’t imagine having to guess slots that each time, and I can’t imagine neither creating serial number stickers for every single disk I could have to manage … Is there any specific advice reguarding JBOD cards people should (not) use in their systems ? Any magical way to “blink” a drive in linux ? Thanks regards ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] E-Mail netiquette
Oops, my apologies if the 3MB logfile that I sent to this list yesterday was annoying to anybody. I didn't realize that the combination low bandwith / high mobile tariffs and email client that automatically downloads all attachments was still a thing. Apparently it is. Next time I'll upload a large-ish attachment somewhere else and put a link in the mail. Thanks, Erik. On 11/09/2014 09:04 PM, Manfred Hollstein wrote: Hi there, as we've recently seen some pretty bad examples, is it possible to agree on some topics for posting on a list reaching quite a few subscribers? Logs should be uploaded to some server; if you still believe they must be sent to the list, compress them first. Some people are on mobile tariffs where the monthly quota is quite pressing, and uncompressed logs are very unlikely to help with that... While we're at it, I know HTML e-mails to be mostly preferred by managers, in fact they don't really add any real benefit on a mostly technical mailing list, hence text-only format should work. The level of detail and information is really great on this list, so I'd really appreciate if some people here hold on and think about something before hitting the Send button ;) I hope you don't feel offended, that clearly wasn't my intention! Keep up the good work. Cheers. l8er manfred ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] MDS slow, logging rdlock failures
Hi, My MDS is very slow, and it logs stuff like this: 2014-11-07 23:38:41.154939 7f8180a31700 0 log_channel(default) log [WRN] : 2 slow requests, 1 included below; oldest blocked for 187.777061 secs 2014-11-07 23:38:41.154956 7f8180a31700 0 log_channel(default) log [WRN] : slow request 121.322570 seconds old, received at 2014-11-07 23:36:39.832336: client_request(client.7071:115 getattr pAsLsXsFs #102bdbe 2014-11-07 23:36:39.00) currently failed to rdlock, waiting Any idea what that rdlock is and what might cause it to fail? This is on Giant by the way. Thanks, Erik. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Bug in Fedora package ceph-0.87-1
Hi, There is a small bug in the Fedora package for ceph-0.87. Two days ago, Boris Ranto built the first 0.87 package, for Fedora 22 (rawhide) [1]. [1] http://koji.fedoraproject.org/koji/buildinfo?buildID=589731 This build was a succes, so I took that package and built it for Fedora 20 (which is the current version), however that failed. The thing is, apparently in Fedora the rbd-replay-prep binary is not built, but there is a specific check for Fedora 20, which nontheless tries to include it in the file list: %if (0%{?fedora} == 20 || 0%{?rhel} == 6) %{_mandir}/man8/rbd-replay-prep.8* %{_bindir}/rbd-replay-prep %endif I really have no idea what this is supposed to accomplish. That binary is not built, and that manpage is even explicitly deleted for that reason: # do not package man page for binary that is not built rm -f $RPM_BUILD_ROOT%{_mandir}/man8/rbd-replay-prep.8* So it is quite obvious that in Fedora those two files do not exist. In all other versions of Fedora than 20 this is no problem, but specifically in Fedora 20 they are explicitly included in the file list, which causes packaging to fail (obviously). The easiest way to fix this is simply remove those four lines from %if to %endif, so that this package behaves the same for Fedora 20 as for all other versions of Fedora. I haven't testes a build for rhel version 6, but that man page gets deleted anyway on all rpm platforms, so this must fail for rhel 6 too. Thanks, Erik. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Negative amount of objects degraded
Hi, Yesterday I removed two OSD's, to replace them with new disks. Ceph was not able to completely reach all active+clean state, but some degraded objects remain. However, the amount of degraded objects is negative (-82), see below: 2014-10-30 13:31:32.862083 mon.0 [INF] pgmap v209175: 768 pgs: 761 active+clean, 7 active+remapped; 1644 GB data, 2524 GB used, 17210 GB / 19755 GB avail; 2799 B/s wr, 1 op/s; -82/1439391 objects degraded (-0.006%) According to rados df, the -82 degraded objects are part of the cephfs-data-cache pool, which is an SSD-backed replicated pool, that functions as a cache pool for an HDD-backed erasure coded pool for cephfs. The cache should be empty, because I isseud rados cache-flush-evict-all-command, and rados -p cephfs-data-cache ls indeed shows zero objects in this pool. rados df however does show 192 objects for this pool, with just 35KB used and -82 degraded: pool name category KB objects clones degraded unfound rdrd KB wrwr KB cephfs-data-cache - 35 1920 -82 0 1119 348800 1198371 1703673493 Please advice... Thanks, Erik. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Negative amount of objects degraded
Yesterday I removed two OSD's, to replace them with new disks. Ceph was not able to completely reach all active+clean state, but some degraded objects remain. However, the amount of degraded objects is negative (-82), see below: So why didn't it reach that state? Well, I dunno, I was hoping this list would know why? I simply sat there waiting for the process to complete and it didn't. Could you query those PGs and see why they are remapped? $ ceph pg pg id query I queried one of the PG's, see below for the output. Can you tell why they are remapped? # ceph pg 5.af query { state: active+remapped, epoch: 1105, up: [ 14, 11], acting: [ 14, 11, 12], actingbackfill: [ 11, 12, 14], info: { pgid: 5.af, last_update: 533'1976, last_complete: 533'1976, log_tail: 0'0, last_user_version: 1976, last_backfill: MAX, purged_snaps: [], history: { epoch_created: 197, last_epoch_started: 772, last_epoch_clean: 772, last_epoch_split: 0, same_up_since: 721, same_interval_since: 771, same_primary_since: 721, last_scrub: 533'1976, last_scrub_stamp: 2014-10-29 00:09:56.095703, last_deep_scrub: 533'1976, last_deep_scrub_stamp: 2014-10-27 00:09:48.770622, last_clean_scrub_stamp: 2014-10-29 00:09:56.095703}, stats: { version: 533'1976, reported_seq: 2846, reported_epoch: 1105, state: active+remapped, last_fresh: 2014-10-30 01:55:27.177249, last_change: 2014-10-29 23:07:40.579020, last_active: 2014-10-30 01:55:27.177249, last_clean: 2014-10-26 21:49:13.064622, last_became_active: 0.00, last_unstale: 2014-10-30 01:55:27.177249, mapping_epoch: 766, log_start: 0'0, ondisk_log_start: 0'0, created: 197, last_epoch_clean: 772, parent: 0.0, parent_split_bits: 0, last_scrub: 533'1976, last_scrub_stamp: 2014-10-29 00:09:56.095703, last_deep_scrub: 533'1976, last_deep_scrub_stamp: 2014-10-27 00:09:48.770622, last_clean_scrub_stamp: 2014-10-29 00:09:56.095703, log_size: 1976, ondisk_log_size: 1976, stats_invalid: 0, stat_sum: { num_bytes: 4194304, num_objects: 13, num_object_clones: 0, num_object_copies: 39, num_objects_missing_on_primary: 0, num_objects_degraded: 0, num_objects_unfound: 0, num_objects_dirty: 13, num_whiteouts: 0, num_read: 36, num_read_kb: 40, num_write: 2047, num_write_kb: 17259, num_scrub_errors: 0, num_shallow_scrub_errors: 0, num_deep_scrub_errors: 0, num_objects_recovered: 26, num_bytes_recovered: 8388608, num_keys_recovered: 222, num_objects_omap: 12, num_objects_hit_set_archive: 0}, stat_cat_sum: {}, up: [ 14, 11], acting: [ 14, 11, 12], up_primary: 14, acting_primary: 14}, empty: 0, dne: 0, incomplete: 0, last_epoch_started: 772, hit_set_history: { current_last_update: 0'0, current_last_stamp: 0.00, current_info: { begin: 0.00, end: 0.00, version: 0'0}, history: []}}, peer_info: [ { peer: 11, pgid: 5.af, last_update: 533'1976, last_complete: 533'1976, log_tail: 0'0, last_user_version: 1976, last_backfill: MAX, purged_snaps: [], history: { epoch_created: 197, last_epoch_started: 772, last_epoch_clean: 772, last_epoch_split: 0, same_up_since: 721, same_interval_since: 771, same_primary_since: 721, last_scrub: 533'1976, last_scrub_stamp: 2014-10-29 00:09:56.095703, last_deep_scrub: 533'1976, last_deep_scrub_stamp: 2014-10-27 00:09:48.770622, last_clean_scrub_stamp: 2014-10-29 00:09:56.095703}, stats: { version: 533'1976, reported_seq: 2430, reported_epoch: 723, state: remapped+peering, last_fresh: 2014-10-29 23:03:18.847590, last_change: 2014-10-29 23:03:17.673820, last_active: 2014-10-29 22:41:29.551558, last_clean: 2014-10-26 21:49:13.064622, last_became_active: 0.00, last_unstale: 2014-10-29 23:03:18.847590, mapping_epoch: 766,
Re: [ceph-users] Negative amount of objects degraded
Thanks for pointing that out. Unfortunately, those tickets contain only a description of the problem, but no solution or workaround. One was opened 8 months ago and the other more than a year ago. No love since. Is there any way I can get my cluster back in a healthy state? Thanks, Erik. On 10/30/2014 05:13 PM, John Spray wrote: There are a couple of open tickets about bogus (negative) stats on PGs: http://tracker.ceph.com/issues/5884 http://tracker.ceph.com/issues/7737 Cheers, John On Thu, Oct 30, 2014 at 12:38 PM, Erik Logtenberg e...@logtenberg.eu wrote: Hi, Yesterday I removed two OSD's, to replace them with new disks. Ceph was not able to completely reach all active+clean state, but some degraded objects remain. However, the amount of degraded objects is negative (-82), see below: 2014-10-30 13:31:32.862083 mon.0 [INF] pgmap v209175: 768 pgs: 761 active+clean, 7 active+remapped; 1644 GB data, 2524 GB used, 17210 GB / 19755 GB avail; 2799 B/s wr, 1 op/s; -82/1439391 objects degraded (-0.006%) According to rados df, the -82 degraded objects are part of the cephfs-data-cache pool, which is an SSD-backed replicated pool, that functions as a cache pool for an HDD-backed erasure coded pool for cephfs. The cache should be empty, because I isseud rados cache-flush-evict-all-command, and rados -p cephfs-data-cache ls indeed shows zero objects in this pool. rados df however does show 192 objects for this pool, with just 35KB used and -82 degraded: pool name category KB objects clones degraded unfound rdrd KB wrwr KB cephfs-data-cache - 35 1920 -82 0 1119 348800 1198371 1703673493 Please advice... Thanks, Erik. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph RPM spec removes /etc/ceph
I would like to add that removing log files (/var/log/ceph is also removed on uninstall) is also a bad thing. My suggestion would be to simply drop the whole %postun trigger, since it does only these two very questionable things. Thanks, Erik. On 10/22/2014 09:16 PM, Dmitry Borodaenko wrote: Current version of RPM spec for Ceph removes the whole /etc/ceph directory on uninstall: https://github.com/ceph/ceph/blob/master/ceph.spec.in#L557-L562 I don't think contents of /etc/ceph is disposable and should be silently discarded like that. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Reweight a host
I don't think so, check this out: # idweight type name up/down reweight -6 3.05root ssd -7 0.04999 host ceph-01-ssd 11 0.04999 osd.11 up 1 -8 1 host ceph-02-ssd 12 0.04999 osd.12 up 1 -9 1 host ceph-03-ssd 13 0.03999 osd.13 up 1 -10 1 host ceph-04-ssd 14 0.03999 osd.14 up 1 As you can see, only host ceph-01-ssd has the same weight as its osd, the other three hosts have weight 1 which is different from their associated osd. If the weight of the host -should- be the sum of all osd weights on this hosts, then my question becomes: how do I make that so for the three hosts where this is currently not the case? Thanks, Erik. On 20-10-14 03:55, Lei Dong wrote: According to my understanding, the weight of a host is the sum of all osd weights on this host. So you just reweight any osd on this host, the weight of this host is reweighed. Thanks LeiDong On 10/20/14, 7:11 AM, Erik Logtenberg e...@logtenberg.eu wrote: Hi, Simple question: how do I reweight a host in crushmap? I can use ceph osd crush reweight to reweight an osd, but I would like to change the weight of a host instead. I tried exporting the crushmap, but I noticed that the weights of all hosts are commented out, like so: # weight 5.460 And they are not the same values as seen in ceph osd tree. So how do I keep everything as it currently it, but simply change one single weight of one single host? Thanks, Erik. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Reweight a host
Hi, Simple question: how do I reweight a host in crushmap? I can use ceph osd crush reweight to reweight an osd, but I would like to change the weight of a host instead. I tried exporting the crushmap, but I noticed that the weights of all hosts are commented out, like so: # weight 5.460 And they are not the same values as seen in ceph osd tree. So how do I keep everything as it currently it, but simply change one single weight of one single host? Thanks, Erik. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Best practice K/M-parameters EC pool
Hi, With EC pools in Ceph you are free to choose any K and M parameters you like. The documentation explains what K and M do, so far so good. Now, there are certain combinations of K and M that appear to have more or less the same result. Do any of these combinations have pro's and con's that I should consider and/or are there best practices for choosing the right K/M-parameters? For instance, if I choose K = 3 and M = 2, then pg's in this pool will use 5 OSD's and sustain the loss of 2 OSD's. There is 40% overhead in this configuration. Now, if I were to choose K = 6 and M = 4, I would end up with pg's that use 10 OSD's and sustain the loss of 4 OSD's, which is statistically not so much different from the first configuration. Also there is the same 40% overhead. One rather obvious difference between the two configurations is that the latter requires a cluster with at least 10 OSD's to make sense. But let's say we have such a cluster, which of the two configurations would be recommended, and why? Thanks, Erik. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Best practice K/M-parameters EC pool
Now, there are certain combinations of K and M that appear to have more or less the same result. Do any of these combinations have pro's and con's that I should consider and/or are there best practices for choosing the right K/M-parameters? Loic might have a better anwser, but I think that the more segments (K) you have, the heavier recovery. You have to contact more OSDs to reconstruct the whole object so that involves more disks doing seeks. I heard sombody from Fujitsu say that he thought 8/3 was best for most situations. That wasn't with Ceph though, but with a different system which implemented Erasure Coding. Performance is definitely lower with more segments in Ceph. I kind of gravitate toward 4/2 or 6/2, though that's just my own preference. This is indeed the kind of pro's and con's I was thinking about. Performance-wise, I would expect differences, but I can think of both positive and negative effects of bigger values for K. For instance, yes recovery takes more OSD's with bigger values of K, but it seems to me that there are also less or smaller items to recover. Also read-performance generally appears to benefit from having a bigger cluster (more parallellism), so I can imagine that bigger values of K also provide an increase in read-performance. Mark says more segments hurts performance though, are you referring just to rebuild-performance or also basic operational performance (read/write)? For instance, if I choose K = 3 and M = 2, then pg's in this pool will use 5 OSD's and sustain the loss of 2 OSD's. There is 40% overhead in this configuration. Now, if I were to choose K = 6 and M = 4, I would end up with pg's that use 10 OSD's and sustain the loss of 4 OSD's, which is statistically not so much different from the first configuration. Also there is the same 40% overhead. Although I don't have numbers in mind, I think the odds of loosing two OSD simultaneously are a lot smaller than the odds of loosing four OSD simultaneously. Or am I misunderstanding you when you write statistically not so much different from the first configuration ? Loosing two smaller then loosing four? Is that correct or did you mean it the other way around? I'd say that loosing four OSDs simultaneously is less likely to happen then two simultaneously. This is true, though the more disks you spread your objects across, the higher likelihood that any given object will be affected by a lost OSD. The extreme case being that every object is spread across every OSD and losing any given OSD affects all objects. I suppose the severity depends on the relative fraction of your erasure coding parameters relative to the total number of OSDs. I think this is perhaps what Erik was getting at. I haven't done the actual calculations, but given some % chance of disk failure, I would assume that losing x out of y disks has roughly the same chance as losing 2*x out of 2*y disks over the same period. That's also why you generally want to limit RAID5 arrays to maybe 6 disks or so and move to RAID6 for bigger arrays. For arrays bigger than 20 disks you would usually split those into separate arrays, just to keep the (parity disks / total disks) fraction high enough. With regard to data safety I would guess that 3+2 and 6+4 are roughly equal, although the behaviour of 6+4 is probably easier to predict because bigger numbers makes your calculations less dependent on individual deviations in reliability. Do you guys feel this argument is valid? Erik. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Best practice K/M-parameters EC pool
I haven't done the actual calculations, but given some % chance of disk failure, I would assume that losing x out of y disks has roughly the same chance as losing 2*x out of 2*y disks over the same period. That's also why you generally want to limit RAID5 arrays to maybe 6 disks or so and move to RAID6 for bigger arrays. For arrays bigger than 20 disks you would usually split those into separate arrays, just to keep the (parity disks / total disks) fraction high enough. With regard to data safety I would guess that 3+2 and 6+4 are roughly equal, although the behaviour of 6+4 is probably easier to predict because bigger numbers makes your calculations less dependent on individual deviations in reliability. Do you guys feel this argument is valid? Here is how I reason about it, roughly: If the probability of loosing a disk is 0.1%, the probability of loosing two disks simultaneously (i.e. before the failure can be recovered) would be 0.1*0.1 = 0.01% and three disks becomes 0.1*0.1*0.1 = 0.001% and four disks becomes 0.0001% Accurately calculating the reliability of the system as a whole is a lot more complex (see https://wiki.ceph.com/Development/Add_erasure_coding_to_the_durability_model/ for more information). Cheers Okay, I see that in your calculation, you leave the total amount of disks completely out of the equation. The link you provided is very useful indeed and does some actual calculations. Interestingly, the example in the details page [1] use k=32 and m=32 for a total of 64 blocks. Those are very much bigger values than Mark Nelson mentioned earlier. Is that example merely meant to demonstrate the theoretical advantages, or would you actually recommend using those numbers in practice. Let's assume that we have at least 64 OSD's available, would you recommend k=32 and m=32? [1] https://wiki.ceph.com/Development/Add_erasure_coding_to_the_durability_model/Technical_details_on_the_model ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Fw: external monitoring tools for processes
Hi, Be sure to check this out: http://ceph.com/community/ceph-calamari-goes-open-source/ Erik. On 11-08-14 08:50, Irek Fasikhov wrote: Hi. I use ZABBIX with the following script: [ceph@ceph08 ~]$ cat /etc/zabbix/external/ceph #!/usr/bin/python import sys import os import commands import json import datetime import time #Chech arguments. If count arguments equally 1, then false. if len(sys.argv) == 1: print You will need arguments!; exit; def generate(data,type): JSON={\data\:[ for js in range(len(splits)): JSON+={\{#+type+}\:\+splits[js]+\},; return JSON[:-1]+]} if sys.argv[1] == osd: if len(sys.argv)==2: splits=commands.getoutput('df | grep osd | awk {\'print $6\'}| sed \'s/[^0-9]//g\'| sed \':a;N;$!ba;s/\\n/,/g\'').split(,) print generate(splits,OSD) else: ID=sys.argv[2] LEVEL=sys.argv[3] PERF=sys.argv[4] CACHEFILE=/tmp/zabbix.ceph.osd+ID+.cache CACHETTL=5 TIME=int(round(float(datetime.datetime.now().strftime(%s ##CACHE FOR OPTIMIZATION PERFORMANCE# if os.path.isfile(CACHEFILE): CACHETIME=int(round(os.stat(CACHEFILE).st_mtime)) else: CACHETIME=0 if TIME-CACHETIMECACHETTL: if os.system('sudo ceph --admin-daemon /var/run/ceph/ceph-osd.'+ID+'.asok perfcounters_dump '+CACHEFILE)0: exit json_data=open(CACHEFILE) data = json.load(json_data) json_data.close() ## PARSING if LEVEL in data: if PERF in data[LEVEL]: try: key=data[LEVEL][PERF].has_key(sum) print (data[LEVEL][PERF][sum])/(data[LEVEL][PERF][avgcount]) except AttributeError: print data[LEVEL][PERF] and zabbix templates: https://dl.dropboxusercontent.com/u/575018/zbx_export_templates.xml 2014-08-11 7:42 GMT+04:00 pragya jain prag_2...@yahoo.co.in mailto:prag_2...@yahoo.co.in: please somebody reply my question. On Saturday, 9 August 2014 3:34 PM, pragya jain prag_2...@yahoo.co.in mailto:prag_2...@yahoo.co.in wrote: hi all, can somebody suggest me some external monitoring tools which can monitor whether the processes in ceph, such as, heartbeating, data scrubbing, authentication, backfilling, recovering etc. are working properly or not. Regards Pragya Jain ___ ceph-users mailing list ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- С уважением, Фасихов Ирек Нургаязович Моб.: +79229045757 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph.com centos7 repository ?
Hi, RHEL7 repository works just as well. CentOS 7 is effectively a copy of RHEL7 anyway. Packages for CentOS 7 wouldn't actually be any different. Erik. On 07/10/2014 06:14 AM, Alexandre DERUMIER wrote: Hi, I would like to known if a centos7 respository will be available soon ? Or can I use current rhel7 for the moment ? http://ceph.com/rpm-firefly/rhel7/x86_64/ Cheers, Alexandre ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Temporary degradation when adding OSD's
Yeah, Ceph will never voluntarily reduce the redundancy. I believe splitting the degraded state into separate wrongly placed and degraded (reduced redundancy) states is currently on the menu for the Giant release, but it's not been done yet. That would greatly improve the accuracy of ceph's status reports. Does ceph currently know about the difference of these states well enough to be smart with prioritizing? Specifically, if I add an OSD and ceph starts moving data around, but during that time an other OSD fails; is ceph smart enough to quickly prioritize reduplicating the lost copies before continuing to move data around (that was still perfectly duplicated)? Thanks, Erik. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Temporary degradation when adding OSD's
Hi, If you add an OSD to an existing cluster, ceph will move some existing data around so the new OSD gets its respective share of usage right away. Now I noticed that during this moving around, ceph reports the relevant PG's as degraded. I can more or less understand the logic here: if a piece of data is supposed to be in a certain place (the new OSD), but it is not yet there, it's degraded. However I would hope that the movement of data is executed in such a way that first a new copy is made on the new OSD and only after successfully doing that, one of the existing copies is removed. If so, there is never actually any degradation of that PG. More to the point, if I have a PG replicated over three OSD's: 1, 2 and 3; now I add an OSD 4, and ceph decides to move the copy of OSD 3 to the new OSD 4; if it turns out that ceph can't read the copies on OSD 1 and 2 due to some disk error, I would assume that ceph would still use the copy that exists on OSD 3 to populate the copy on OSD 4. Is that indeed the case? I have a very similar question about removing an OSD. You can tell ceph to mark an OSD as out before physically removing it. The OSD is still up but ceph will no longer assign PG's to it, and will make new copies of the PG's that are on this OSD to other OSD's. Now again ceph will report degradation, even though the out OSD is still up, so the existing copies are not actually lost. Does ceph use the OSD that is marked out as a source for making the new copies on other OSD's? Thanks, Erik. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Question about placing different pools on different osds
Hi, I have some osd's on hdd's and some on ssd's, just like the example in these docs: http://ceph.com/docs/firefly/rados/operations/crush-map/ Now I'd like to place an erasure encoded pool on the hdd's and a replicated (cache) pool on the ssd's. In order to do that, I have to split the crush maps into two roots, according to the docs. The way the docs describe it, I have to create two roots, and then separate server-nodes for ssd's and hdd's, then the osd's under the right server node. My hdd's and ssd's are mixed within hosts though: I have four physical hosts each with a couple of hdd's and an ssd. So there is not a hdd-server and a ssd-server, like in the docs. Or do I have to create two server-nodes per host? It appears to me that not all rules will still work the same way in that case. Regards, Erik. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Permissions spontaneously changing in cephfs
Hi Zheng, Yes, it was mounted implicitly with acl's enabled. I disabled it by adding noacl to the mount command, and now the behaviour is correct! No more changing permissions. So it appears to be related to acl's indeed, even though I didn't actually set any acl's. Simply mounting with acl's enabled was enough to cause the issue apparently. So, do you have enough information to possibly fix it, or is there any way that I can provide additional information? Thanks, Erik. On 06/30/2014 05:13 AM, Yan, Zheng wrote: On Mon, Jun 30, 2014 at 4:25 AM, Erik Logtenberg e...@logtenberg.eu wrote: Hi Zheng, Okay, so on host1 I did: # echo module ceph +p /sys/kernel/debug/dynamic_debug/control # mkdir hoi mkdir: kan map 'hoi' niet aanmaken: Bestand bestaat al # mkdir hoi2 # ls -al drwxr-xr-x 1 root root0 29 jun 22:12 hoi drwxr-xr-x 1 root root0 29 jun 22:16 hoi2 # dmesg /host1.log Did you have Posix ACL enabled? A bug in Posix ACL support code can cause this issue. Regards Yan, Zheng On host2 I did: # echo module ceph +p /sys/kernel/debug/dynamic_debug/control # ls -al drwxrwxrwx 1 root root0 29 jun 22:12 hoi drwxrwxrwx 1 root root0 29 jun 22:16 hoi2 # dmesg /host2.log Please find attached both host1.log and host2.log Thanks, Erik. On 06/20/2014 08:04 AM, Yan, Zheng wrote: On Fri, Jun 20, 2014 at 6:13 AM, Erik Logtenberg e...@logtenberg.eu wrote: Hi Zheng, Additionally, I notice that as long as I don't do anything with that directory, the permissions stay wrong. Previously I noticed that the permissions eventually got right by themselves, but I don't know what triggered it. Also, the permission problem is not just with the first ever created directory, it happens to files too: [host1 bla]# touch hoi [host1 bla]# ls -al -rw-r--r-- 1 root root 0 20 jun 00:05 hoi [host2 bla]# ls -al -rw-rw-rw- 1 root root 0 20 jun 00:05 hoi Notice the additional group and world writable flags. It works the other way round too: [host2 bla]# touch hoi2 [host2 bla]# ls -al -rw-r--r-- 1 root root 0 20 jun 00:09 hoi2 [host1 bla]# ls -al -rw-rw-rw- 1 root root 0 20 jun 00:09 hoi2 However now after a couple of seconds I re-check on host2, and the permissons have changed there as well: [host2 bla]# ls -al -rw-rw-rw- 1 root root 0 20 jun 00:09 hoi2 So now it's group and world writable on both hosts. I can't reproduce this locally. Please enable dynamic debugging for ceph (echo module ceph +p /sys/kernel/debug/dynamic_debug/control) and send kernel log to me. Regards Yan, Zheng Kind regards, Erik. On 06/19/2014 11:37 PM, Erik Logtenberg wrote: I am using the kernel client. kernel: 3.14.4-100.fc19.x86_64 ceph: ceph-0.80.1-0.fc19.x86_64 Actually, I seem to be able to reproduce it quite reliably. I just reset my cephfs (fiddling with erasure coded pools which was no success), so just for kicks tried again with creating a directory. Exactly the same results. Kind regards, Erik. On 06/16/2014 02:32 PM, Yan, Zheng wrote: were you using ceph-fuse or kernel client? ceph version and kernel version? how reliably you can reproduce this problem? Regards Yan, Zheng On Sun, Jun 15, 2014 at 4:42 AM, Erik Logtenberg e...@logtenberg.eu wrote: Hi, So... I wrote some files into that directory to test performance, and now I notice that both hosts see the permissions the right way, like they were when I first created the directory. What is going on here? .. Erik. On 06/14/2014 10:32 PM, Erik Logtenberg wrote: Hi, I ran into a weird issue with cephfs today. I create a directory like this: # mkdir bla # ls -al drwxr-xr-x 1 root root0 14 jun 22:22 bla Now on another host, with the same cephfs mounted, I see different permissions: # ls -al drwxrwxrwx 1 root root 0 14 jun 22:22 bla Weird, huh? Back to host #1, I unmount cephfs and mount it again. Now it sees the same (changed) permissions as I saw on the second host: # ls -al drwxrwxrwx 1 root root0 14 jun 22:22 bla So... what happened to the original permissions and why did they change? Thanks, Erik. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v0.82 released
Ah, okay I missed that. So, what distributions/versions are supported then? I see that the FC20 part of the ceph repository (http://ceph.com/rpm/fc20/x86_64) doesn't contain ceph itself, so I am assuming you'd have to use the ceph package from FC20 itself, however they are still at 0.80.1: ceph-0.80.1-2.fc20.x86_64.rpm There are also el6 packages at http://ceph.com/rpm/el6/ but they are at 0.80.1 as well. The same seems to be true for the RHEL7 packages, so I am a bit at a loss... Thanks, Erik. On 06/30/2014 03:02 PM, Alfredo Deza wrote: Erik, I don't think we are building for FC19 anymore. There are some dependencies that could not be met for Ceph in FC19 so we decided to stop trying to get builds out for that. On Sun, Jun 29, 2014 at 2:52 PM, Erik Logtenberg e...@logtenberg.eu wrote: Nice work! When will the new rpm's be released on http://ceph.com/rpm/fc19/x86_64/ ? Thanks, Erik. On 06/27/2014 10:55 PM, Sage Weil wrote: This is the second post-firefly development release. It includes a range of bug fixes and some usability improvements. There are some MDS debugging and diagnostic tools, an improved 'ceph df', and some OSD backend refactoring and cleanup. Notable Changes --- * ceph-brag: add tox tests (Alfredo Deza) * common: perfcounters now use atomics and go faster (Sage Weil) * doc: CRUSH updates (John Wilkins) * doc: osd primary affinity (John Wilkins) * doc: pool quotas (John Wilkins) * doc: pre-flight doc improvements (Kevin Dalley) * doc: switch to an unencumbered font (Ross Turk) * doc: update openstack docs (Josh Durgin) * fix hppa arch build (Dmitry Smirnov) * init-ceph: continue starting other daemons on crush or mount failure (#8343, Sage Weil) * keyvaluestore: fix hint crash (#8381, Haomai Wang) * libcephfs-java: build against older JNI headers (Greg Farnum) * librados: fix rados_pool_list bounds checks (Sage Weil) * mds: cephfs-journal-tool (John Spray) * mds: improve Journaler on-disk format (John Spray) * mds, libcephfs: use client timestamp for mtime/ctime (Sage Weil) * mds: misc encoding improvements (John Spray) * mds: misc fixes for multi-mds (Yan, Zheng) * mds: OPTracker integration, dump_ops_in_flight (Greg Farnum) * misc cleanup (Christophe Courtaut) * mon: fix default replication pool ruleset choice (#8373, John Spray) * mon: fix set cache_target_full_ratio (#8440, Geoffrey Hartz) * mon: include per-pool 'max avail' in df output (Sage Weil) * mon: prevent EC pools from being used with cephfs (Joao Eduardo Luis) * mon: restore original weight when auto-marked out OSDs restart (Sage Weil) * mon: use msg header tid for MMonGetVersionReply (Ilya Dryomov) * osd: fix bogus assert during OSD shutdown (Sage Weil) * osd: fix clone deletion case (#8334, Sam Just) * osd: fix filestore removal corner case (#8332, Sam Just) * osd: fix hang waiting for osdmap (#8338, Greg Farnum) * osd: fix interval check corner case during peering (#8104, Sam Just) * osd: fix journal-less operation (Sage Weil) * osd: include backend information in metadata reported to mon (Sage Weil) * rest-api: fix help (Ailing Zhang) * rgw: check entity permission for put_metadata (#8428, Yehuda Sadeh) Getting Ceph * Git at git://github.com/ceph/ceph.git * Tarball at http://ceph.com/download/ceph-0.82.tar.gz * For packages, see http://ceph.com/docs/master/install/get-packages * For ceph-deploy, see http://ceph.com/docs/master/install/install-ceph-deploy ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v0.82 released
Nice work! When will the new rpm's be released on http://ceph.com/rpm/fc19/x86_64/ ? Thanks, Erik. On 06/27/2014 10:55 PM, Sage Weil wrote: This is the second post-firefly development release. It includes a range of bug fixes and some usability improvements. There are some MDS debugging and diagnostic tools, an improved 'ceph df', and some OSD backend refactoring and cleanup. Notable Changes --- * ceph-brag: add tox tests (Alfredo Deza) * common: perfcounters now use atomics and go faster (Sage Weil) * doc: CRUSH updates (John Wilkins) * doc: osd primary affinity (John Wilkins) * doc: pool quotas (John Wilkins) * doc: pre-flight doc improvements (Kevin Dalley) * doc: switch to an unencumbered font (Ross Turk) * doc: update openstack docs (Josh Durgin) * fix hppa arch build (Dmitry Smirnov) * init-ceph: continue starting other daemons on crush or mount failure (#8343, Sage Weil) * keyvaluestore: fix hint crash (#8381, Haomai Wang) * libcephfs-java: build against older JNI headers (Greg Farnum) * librados: fix rados_pool_list bounds checks (Sage Weil) * mds: cephfs-journal-tool (John Spray) * mds: improve Journaler on-disk format (John Spray) * mds, libcephfs: use client timestamp for mtime/ctime (Sage Weil) * mds: misc encoding improvements (John Spray) * mds: misc fixes for multi-mds (Yan, Zheng) * mds: OPTracker integration, dump_ops_in_flight (Greg Farnum) * misc cleanup (Christophe Courtaut) * mon: fix default replication pool ruleset choice (#8373, John Spray) * mon: fix set cache_target_full_ratio (#8440, Geoffrey Hartz) * mon: include per-pool 'max avail' in df output (Sage Weil) * mon: prevent EC pools from being used with cephfs (Joao Eduardo Luis) * mon: restore original weight when auto-marked out OSDs restart (Sage Weil) * mon: use msg header tid for MMonGetVersionReply (Ilya Dryomov) * osd: fix bogus assert during OSD shutdown (Sage Weil) * osd: fix clone deletion case (#8334, Sam Just) * osd: fix filestore removal corner case (#8332, Sam Just) * osd: fix hang waiting for osdmap (#8338, Greg Farnum) * osd: fix interval check corner case during peering (#8104, Sam Just) * osd: fix journal-less operation (Sage Weil) * osd: include backend information in metadata reported to mon (Sage Weil) * rest-api: fix help (Ailing Zhang) * rgw: check entity permission for put_metadata (#8428, Yehuda Sadeh) Getting Ceph * Git at git://github.com/ceph/ceph.git * Tarball at http://ceph.com/download/ceph-0.82.tar.gz * For packages, see http://ceph.com/docs/master/install/get-packages * For ceph-deploy, see http://ceph.com/docs/master/install/install-ceph-deploy ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Erasure coded pool suitable for MDS?
Hi, Are erasure coded pools suitable for use with MDS? I tried to give it a go by creating two new pools like so: # ceph osd pool create ecdata 128 128 erasure # ceph osd pool create ecmetadata 128 128 erasure Then looked up their id's: # ceph osd lspools ..., 6 ecdata,7 ecmetadata # ceph mds newfs 7 6 --yes-i-really-mean-it But then when I start MDS, it crashes horribly. I did notice that MDS created a couple of objects in the ecmetadata pool: # rados ls -p ecmetadata mds0_sessionmap mds0_inotable 1..inode 200. mds_anchortable mds_snaptable 100..inode However it crashes immediately after. I started mds manually to try and see what's up: # ceph-mds -i 0 -d This spews out so much information that I saved it in a logfile, added as an attachment. Kind regards, Erik. 2014-06-19 22:07:34.492328 7f3572f6e7c0 0 ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74), process ceph-mds, pid 2943 starting mds.0 at :/0 2014-06-19 22:07:35.793309 7f356dd88700 1 mds.-1.0 handle_mds_map standby 2014-06-19 22:07:35.876689 7f356dd88700 1 mds.0.15 handle_mds_map i am now mds.0.15 2014-06-19 22:07:35.876695 7f356dd88700 1 mds.0.15 handle_mds_map state change up:standby -- up:creating 2014-06-19 22:07:35.876931 7f356dd88700 0 mds.0.cache creating system inode with ino:1 2014-06-19 22:07:35.877204 7f356dd88700 0 mds.0.cache creating system inode with ino:100 2014-06-19 22:07:35.877209 7f356dd88700 0 mds.0.cache creating system inode with ino:600 2014-06-19 22:07:35.877369 7f356dd88700 0 mds.0.cache creating system inode with ino:601 2014-06-19 22:07:35.877455 7f356dd88700 0 mds.0.cache creating system inode with ino:602 2014-06-19 22:07:35.877519 7f356dd88700 0 mds.0.cache creating system inode with ino:603 2014-06-19 22:07:35.877566 7f356dd88700 0 mds.0.cache creating system inode with ino:604 2014-06-19 22:07:35.877606 7f356dd88700 0 mds.0.cache creating system inode with ino:605 2014-06-19 22:07:35.877683 7f356dd88700 0 mds.0.cache creating system inode with ino:606 2014-06-19 22:07:35.877723 7f356dd88700 0 mds.0.cache creating system inode with ino:607 2014-06-19 22:07:35.877780 7f356dd88700 0 mds.0.cache creating system inode with ino:608 2014-06-19 22:07:35.877819 7f356dd88700 0 mds.0.cache creating system inode with ino:609 2014-06-19 22:07:35.877858 7f356dd88700 0 mds.0.cache creating system inode with ino:200 mds/CDir.cc: In function 'virtual void C_Dir_Committed::finish(int)' thread 7f356dd88700 time 2014-06-19 22:07:35.881337 mds/CDir.cc: 1809: FAILED assert(r == 0) ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74) 1: ceph-mds() [0x75c6f1] 2: (Context::complete(int)+0x9) [0x56cff9] 3: (C_Gather::sub_finish(Context*, int)+0x1f7) [0x56e9a7] 4: (C_Gather::C_GatherSub::finish(int)+0x12) [0x56eab2] 5: (Context::complete(int)+0x9) [0x56cff9] 6: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xf4e) [0x7d26ee] 7: (MDS::handle_core_message(Message*)+0xb1f) [0x58e5ef] 8: (MDS::_dispatch(Message*)+0x32) [0x58e7f2] 9: (MDS::ms_dispatch(Message*)+0xa3) [0x5901d3] 10: (DispatchQueue::entry()+0x57a) [0x99d9da] 11: (DispatchQueue::DispatchThread::entry()+0xd) [0x8be63d] 12: (()+0x7c53) [0x7f3572366c53] 13: (clone()+0x6d) [0x7f3571257dbd] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. 2014-06-19 22:07:35.883239 7f356dd88700 -1 mds/CDir.cc: In function 'virtual void C_Dir_Committed::finish(int)' thread 7f356dd88700 time 2014-06-19 22:07:35.881337 mds/CDir.cc: 1809: FAILED assert(r == 0) ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74) 1: ceph-mds() [0x75c6f1] 2: (Context::complete(int)+0x9) [0x56cff9] 3: (C_Gather::sub_finish(Context*, int)+0x1f7) [0x56e9a7] 4: (C_Gather::C_GatherSub::finish(int)+0x12) [0x56eab2] 5: (Context::complete(int)+0x9) [0x56cff9] 6: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xf4e) [0x7d26ee] 7: (MDS::handle_core_message(Message*)+0xb1f) [0x58e5ef] 8: (MDS::_dispatch(Message*)+0x32) [0x58e7f2] 9: (MDS::ms_dispatch(Message*)+0xa3) [0x5901d3] 10: (DispatchQueue::entry()+0x57a) [0x99d9da] 11: (DispatchQueue::DispatchThread::entry()+0xd) [0x8be63d] 12: (()+0x7c53) [0x7f3572366c53] 13: (clone()+0x6d) [0x7f3571257dbd] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. --- begin dump of recent events --- -144 2014-06-19 22:07:34.489920 7f3572f6e7c0 5 asok(0x1e0) register_command perfcounters_dump hook 0x1dc8010 -143 2014-06-19 22:07:34.489992 7f3572f6e7c0 5 asok(0x1e0) register_command 1 hook 0x1dc8010 -142 2014-06-19 22:07:34.490003 7f3572f6e7c0 5 asok(0x1e0) register_command perf dump hook 0x1dc8010 -141 2014-06-19 22:07:34.490015 7f3572f6e7c0 5 asok(0x1e0) register_command perfcounters_schema hook 0x1dc8010 -140 2014-06-19 22:07:34.490027 7f3572f6e7c0 5 asok(0x1e0) register_command 2 hook 0x1dc8010 -139 2014-06-19 22:07:34.490035 7f3572f6e7c0 5 asok(0x1e0)
Re: [ceph-users] Errors setting up erasure coded pool
Hi Loic, Yes I upgraded the cluster, it is definately not new. I ran into several other issues as well, for instance legacy tunables that had to be changed. These are the directory contents you asked for: # ls -al /usr/lib64/ceph/erasure-code totaal 776 drwxr-xr-x. 2 root root 4096 12 jun 21:36 . drwxr-xr-x. 3 root root 4096 12 jun 21:36 .. lrwxrwxrwx. 1 root root 22 12 jun 21:36 libec_example.so - libec_example.so.0.0.0 lrwxrwxrwx. 1 root root 22 12 jun 21:36 libec_example.so.0 - libec_example.so.0.0.0 -rwxr-xr-x. 1 root root 381240 14 mei 17:44 libec_example.so.0.0.0 lrwxrwxrwx. 1 root root 33 12 jun 21:36 libec_fail_to_initialize.so - libec_fail_to_initialize.so.0.0.0 lrwxrwxrwx. 1 root root 33 12 jun 21:36 libec_fail_to_initialize.so.0 - libec_fail_to_initialize.so.0.0.0 -rwxr-xr-x. 1 root root 6032 14 mei 17:44 libec_fail_to_initialize.so.0.0.0 lrwxrwxrwx. 1 root root 31 12 jun 21:36 libec_fail_to_register.so - libec_fail_to_register.so.0.0.0 lrwxrwxrwx. 1 root root 31 12 jun 21:36 libec_fail_to_register.so.0 - libec_fail_to_register.so.0.0.0 -rwxr-xr-x. 1 root root 6032 14 mei 17:44 libec_fail_to_register.so.0.0.0 lrwxrwxrwx. 1 root root 20 12 jun 21:36 libec_hangs.so - libec_hangs.so.0.0.0 lrwxrwxrwx. 1 root root 20 12 jun 21:36 libec_hangs.so.0 - libec_hangs.so.0.0.0 -rwxr-xr-x. 1 root root 6024 14 mei 17:44 libec_hangs.so.0.0.0 lrwxrwxrwx. 1 root root 23 12 jun 21:36 libec_jerasure.so - libec_jerasure.so.2.0.0 lrwxrwxrwx. 1 root root 23 12 jun 21:36 libec_jerasure.so.2 - libec_jerasure.so.2.0.0 -rwxr-xr-x. 1 root root 364696 14 mei 17:44 libec_jerasure.so.2.0.0 lrwxrwxrwx. 1 root root 34 12 jun 21:36 libec_missing_entry_point.so - libec_missing_entry_point.so.0.0.0 lrwxrwxrwx. 1 root root 34 12 jun 21:36 libec_missing_entry_point.so.0 - libec_missing_entry_point.so.0.0.0 -rwxr-xr-x. 1 root root 5944 14 mei 17:44 libec_missing_entry_point.so.0.0.0 Kind regards, Erik. On 06/15/2014 09:47 AM, Loic Dachary wrote: Hi Erik, Did you upgrade the cluster or is it a new cluster ? Could you please ls -l /usr/lib64/ceph/erasure-code ? If you're connected on irc.oftc.net#ceph today feel free to ping me ( loicd ). Cheers On 14/06/2014 23:25, Erik Logtenberg wrote: Hi, I'm trying to set up an erasure coded pool, as described in the Ceph docs: http://ceph.com/docs/firefly/dev/erasure-coded-pool/ Unfortunately, creating a pool like that gives me the following error: # ceph osd pool create ecpool 12 12 erasure Error EINVAL: cannot determine the erasure code plugin because there is no 'plugin' entry in the erasure_code_profile {}failed to load plugin using profile default I noticed that the default profile appears to be empty: # ceph osd erasure-code-profile get default # I tried setting the default profile right, but I'm not allowed: # ceph osd erasure-code-profile set default plugin=jerasure technique=reed_sol_van k=2 m=1 Error EPERM: will not override erasure code profile default So I set a new profile with the same configuration: # ceph osd erasure-code-profile set k2m1 plugin=jerasure technique=reed_sol_van k=2 m=1 # ceph osd erasure-code-profile get k2m1 directory=/usr/lib64/ceph/erasure-code k=2 m=1 plugin=jerasure technique=reed_sol_van So, that worked :) However, I still can't use it to create a pool: # ceph osd pool create test123 100 100 erasure k2m1 Error EIO: failed to load plugin using profile k2m1 I double checked that the directory is correct, and it is: # ls -al /usr/lib64/ceph/erasure-code/libec_jerasure.so lrwxrwxrwx. 1 root root 23 12 jun 21:36 /usr/lib64/ceph/erasure-code/libec_jerasure.so - libec_jerasure.so.2.0.0 It does contain the jerasure library, so now I'm at a loss. What am I doing wrong? By the way, this is ceph-0.80.1. Thanks, Erik. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Errors setting up erasure coded pool
Hi Loic, I hope you can sort of read the log snippet below. 2014-06-15 13:20:22.593085 7effee3f0700 0 mon.0@0(leader) e1 handle_command mon_command({pool_type: erasure, prefix: osd pool create, pg_num: 100, erasure_code_profile: k2m1, pgp_num: 100, pool: test123} v 0) v1 2014-06-15 13:20:22.593192 7effee3f0700 20 is_capable service=osd command=osd pool create read write on cap allow * 2014-06-15 13:20:22.593197 7effee3f0700 20 allow so far , doing grant allow * 2014-06-15 13:20:22.593208 7effee3f0700 20 allow all 2014-06-15 13:20:22.593209 7effee3f0700 10 mon.0@0(leader) e1 _allowed_command capable 2014-06-15 13:20:22.593212 7effee3f0700 1 mon.0@0(leader).paxos(paxos active c 17561..18177) is_readable now=2014-06-15 13:20:22.593212 lease_expire=0.00 has v0 lc 18177 2014-06-15 13:20:22.593216 7effee3f0700 10 mon.0@0(leader).osd e281 preprocess_query mon_command({pool_type: erasure, prefix: osd pool create, pg_num: 100, erasure_code_profile: k2m1, pgp_num: 100, pool: test123} v 0) v1 from client.5815 192.168.1.15:0/1002158 2014-06-15 13:20:22.593544 7effee3f0700 7 mon.0@0(leader).osd e281 prepare_update mon_command({pool_type: erasure, prefix: osd pool create, pg_num: 100, erasure_code_profile: k2m1, pgp_num: 100, pool: test123} v 0) v1 from client.5815 192.168.1.15:0/1002158 2014-06-15 13:20:22.593664 7effee3f0700 1 mon.0@0(leader).osd e281 implicitly use ruleset named after the pool: test123 2014-06-15 13:20:22.593801 7effee3f0700 -1 ErasureCodePluginSelectJerasure: load dlopen(/usr/lib64/ceph/erasure-code/libec_jerasure_sse3.so): /usr/lib64/ceph/erasure-code/libec_jerasure_sse3.so: cannot open shared object file: No such file or directory 2014-06-15 13:20:22.601143 7effee3f0700 10 mon.0@0(leader) e1 ms_handle_reset 0x1d2ca40 192.168.1.15:0/1002158 It's quite obvious what goes wrong here, ceph can't find the file: /usr/lib64/ceph/erasure-code/libec_jerasure_sse3.so And that's correct. I only have: libec_jerasure.so.2.0.0 and two symlinks to this library, called: libec_jerasure.so - libec_jerasure.so.2.0.0 libec_jerasure.so.2 - libec_jerasure.so.2.0.0 I don't know why it's looking for an sse3 optimized version (?) of this library and/or why that library is missing. My cpu does have the sse, sse2 and ssse3 flags by the way. Kind regards, Erik. P.S. How do I reset the verbosity of the logging? ;) On 06/15/2014 10:12 AM, Loic Dachary wrote: Hi, It would also help if you could add ceph tell mon.'*' injectargs '--debug-mon 20' check that it is set (replace mon.a with mon.) ceph daemon mon.a config get debug_mon { debug_mon: 20\/20} and check what the log have after you get ceph osd pool create test123 100 100 erasure k2m1 Error EIO: failed to load plugin using profile k2m1 I think you are facing two different problems. A) the empty default profile is probably combination of upgrading an existing cluster (in which case maybe the default profile is not created) and the implicit creation of an empty profile (which should not happen and has been fixed in http://tracker.ceph.com/issues/8599 but not yet released). B) the jerasure plugin fails to load for some reason and the mon logs should tell us why Cheers On 15/06/2014 09:47, Loic Dachary wrote: Hi Erik, Did you upgrade the cluster or is it a new cluster ? Could you please ls -l /usr/lib64/ceph/erasure-code ? If you're connected on irc.oftc.net#ceph today feel free to ping me ( loicd ). Cheers On 14/06/2014 23:25, Erik Logtenberg wrote: Hi, I'm trying to set up an erasure coded pool, as described in the Ceph docs: http://ceph.com/docs/firefly/dev/erasure-coded-pool/ Unfortunately, creating a pool like that gives me the following error: # ceph osd pool create ecpool 12 12 erasure Error EINVAL: cannot determine the erasure code plugin because there is no 'plugin' entry in the erasure_code_profile {}failed to load plugin using profile default I noticed that the default profile appears to be empty: # ceph osd erasure-code-profile get default # I tried setting the default profile right, but I'm not allowed: # ceph osd erasure-code-profile set default plugin=jerasure technique=reed_sol_van k=2 m=1 Error EPERM: will not override erasure code profile default So I set a new profile with the same configuration: # ceph osd erasure-code-profile set k2m1 plugin=jerasure technique=reed_sol_van k=2 m=1 # ceph osd erasure-code-profile get k2m1 directory=/usr/lib64/ceph/erasure-code k=2 m=1 plugin=jerasure technique=reed_sol_van So, that worked :) However, I still can't use it to create a pool: # ceph osd pool create test123 100 100 erasure k2m1 Error EIO: failed to load plugin using profile k2m1 I double checked that the directory is correct, and it is: # ls -al /usr/lib64/ceph/erasure-code/libec_jerasure.so lrwxrwxrwx. 1 root root 23 12 jun 21:36 /usr/lib64/ceph/erasure-code/libec_jerasure.so - libec_jerasure.so.2.0.0 It does
[ceph-users] Permissions spontaneously changing in cephfs
Hi, I ran into a weird issue with cephfs today. I create a directory like this: # mkdir bla # ls -al drwxr-xr-x 1 root root0 14 jun 22:22 bla Now on another host, with the same cephfs mounted, I see different permissions: # ls -al drwxrwxrwx 1 root root 0 14 jun 22:22 bla Weird, huh? Back to host #1, I unmount cephfs and mount it again. Now it sees the same (changed) permissions as I saw on the second host: # ls -al drwxrwxrwx 1 root root0 14 jun 22:22 bla So... what happened to the original permissions and why did they change? Thanks, Erik. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Permissions spontaneously changing in cephfs
Hi, So... I wrote some files into that directory to test performance, and now I notice that both hosts see the permissions the right way, like they were when I first created the directory. What is going on here? .. Erik. On 06/14/2014 10:32 PM, Erik Logtenberg wrote: Hi, I ran into a weird issue with cephfs today. I create a directory like this: # mkdir bla # ls -al drwxr-xr-x 1 root root0 14 jun 22:22 bla Now on another host, with the same cephfs mounted, I see different permissions: # ls -al drwxrwxrwx 1 root root 0 14 jun 22:22 bla Weird, huh? Back to host #1, I unmount cephfs and mount it again. Now it sees the same (changed) permissions as I saw on the second host: # ls -al drwxrwxrwx 1 root root0 14 jun 22:22 bla So... what happened to the original permissions and why did they change? Thanks, Erik. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Errors setting up erasure coded pool
Hi, I'm trying to set up an erasure coded pool, as described in the Ceph docs: http://ceph.com/docs/firefly/dev/erasure-coded-pool/ Unfortunately, creating a pool like that gives me the following error: # ceph osd pool create ecpool 12 12 erasure Error EINVAL: cannot determine the erasure code plugin because there is no 'plugin' entry in the erasure_code_profile {}failed to load plugin using profile default I noticed that the default profile appears to be empty: # ceph osd erasure-code-profile get default # I tried setting the default profile right, but I'm not allowed: # ceph osd erasure-code-profile set default plugin=jerasure technique=reed_sol_van k=2 m=1 Error EPERM: will not override erasure code profile default So I set a new profile with the same configuration: # ceph osd erasure-code-profile set k2m1 plugin=jerasure technique=reed_sol_van k=2 m=1 # ceph osd erasure-code-profile get k2m1 directory=/usr/lib64/ceph/erasure-code k=2 m=1 plugin=jerasure technique=reed_sol_van So, that worked :) However, I still can't use it to create a pool: # ceph osd pool create test123 100 100 erasure k2m1 Error EIO: failed to load plugin using profile k2m1 I double checked that the directory is correct, and it is: # ls -al /usr/lib64/ceph/erasure-code/libec_jerasure.so lrwxrwxrwx. 1 root root 23 12 jun 21:36 /usr/lib64/ceph/erasure-code/libec_jerasure.so - libec_jerasure.so.2.0.0 It does contain the jerasure library, so now I'm at a loss. What am I doing wrong? By the way, this is ceph-0.80.1. Thanks, Erik. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Recommended way to use Ceph as storage for file server
Hi, In march 2013 Greg wrote an excellent blog posting regarding the (then) current status of MDS/CephFS and the plans for going forward with development. http://ceph.com/dev-notes/cephfs-mds-status-discussion/ Since then, I understand progress has been slow, and Greg confirmed that he didn't want to commit to any release date yet, when I asked him for an update earlier this year. CephFS appears to be a more or less working product, does receive stability fixes every now and then, but I don't think Inktank would call it production ready. So my question is: I would like to use Ceph as a storage for files, as a fileserver or at least as a backend to my fileserver. What is the recommended way to do this? A more or less obvious alternative for CephFS would be to simply create a huge RBD and have a separate file server (running NFS / Samba / whatever) use that block device as backend. Just put a regular FS on top of the RBD and use it that way. Clients wouldn't really have any of the real performance and resilience benefits that Ceph could offer though, because the (single machine?) file server is now the bottleneck. Any advice / best practice would be greatly appreciated. Any real-world experience with current CephFS as well. Kind regards, Erik. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Idle OSD's keep using a lot of CPU
Hi, I think the high CPU usage was due to the system time not being right. I activated ntp and it had to do quite big adjustment, and after that the high CPU usage was gone. Anyway, I immediately ran into another issue. I ran a simple benchmark: # rados bench --pool benchmark 300 write --no-cleanup During the benchmark, one of my osd's went down. I checked the logs and apparently there was no hardware failure (the disk is still nicely mounted and the osd is still running, but the logfile fills up rapidly with these messages: 2013-08-02 00:03:40.014982 7fe7336fd700 0 -- 192.168.1.15:6801/1229 192.168.1.16:6801/3001 pipe(0x39e9680 sd=28 :36884 s=2 pgs=86874 cs=173547 l=0).fault, initiating reconnect 2013-08-02 00:03:40.016682 7fe7336fd700 0 -- 192.168.1.15:6801/1229 192.168.1.16:6801/3001 pipe(0x39e9680 sd=28 :36885 s=2 pgs=86875 cs=173549 l=0).fault, initiating reconnect 2013-08-02 00:03:40.019241 7fe7336fd700 0 -- 192.168.1.15:6801/1229 192.168.1.16:6801/3001 pipe(0x39e9680 sd=28 :36886 s=2 pgs=86876 cs=173551 l=0).fault, initiating reconnect What could be wrong here? King regards, Erik. On 08/01/2013 08:00 AM, Dan Mick wrote: Logging might well help. http://ceph.com/docs/master/rados/troubleshooting/log-and-debug/ On 07/31/2013 03:51 PM, Erik Logtenberg wrote: Hi, I just added a second node to my ceph test platform. The first node has a mon and three osd's, the second node only has three osd's. Adding the osd's was pretty painless, and ceph distributed the data from the first node evenly over both nodes so everything seems to be fine. The monitor also thinks everything is fine: 2013-08-01 00:41:12.719640 mon.0 [INF] pgmap v1283: 292 pgs: 292 active+clean; 9264 MB data, 24826 MB used, 5541 GB / 5578 GB avail Unfortunately, the three osd's on the second node keep eating a lot of cpu, while there is no activity whatsoever: PID USER VIRTRESSHR S %CPU %MEM TIME+ COMMAND 21272 root 441440 34632 7848 S 61.8 0.9 4:08.62 ceph-osd 21145 root 439852 29316 8360 S 60.4 0.7 4:04.31 ceph-osd 21036 root 443828 31324 8336 S 60.1 0.8 4:07.55 ceph-osd Any idea why that is and how I can even ask an osd what it's doing? There is no corresponding hdd activity, it's only cpu and hardly any memory usage. Also the monitor on the first node is doing the same thing: PID USERVIRTRESSHR S %CPU %MEM TIME+ COMMAND 12825 root186900 23492 5540 S 141.1 0.590 9:47.64 ceph-mon I tried stopping the three osd's: that makes the monitor calm down, but after restarting the osd's, the monitor resumes its cpu usage. I also tried stopping the monitor, which makes the three osd's calm down, but once again they will start eating cpu again as soon as the monitor is back online. In the mean time, the first three osd's, the ones on the same machine as the monitor, don't behave like this at all. Currently as there is no activity, they are just idling on low cpu usage, as expected. Kind regards, Erik. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Idle OSD's keep using a lot of CPU
Hi, I just added a second node to my ceph test platform. The first node has a mon and three osd's, the second node only has three osd's. Adding the osd's was pretty painless, and ceph distributed the data from the first node evenly over both nodes so everything seems to be fine. The monitor also thinks everything is fine: 2013-08-01 00:41:12.719640 mon.0 [INF] pgmap v1283: 292 pgs: 292 active+clean; 9264 MB data, 24826 MB used, 5541 GB / 5578 GB avail Unfortunately, the three osd's on the second node keep eating a lot of cpu, while there is no activity whatsoever: PID USER VIRTRESSHR S %CPU %MEM TIME+ COMMAND 21272 root 441440 34632 7848 S 61.8 0.9 4:08.62 ceph-osd 21145 root 439852 29316 8360 S 60.4 0.7 4:04.31 ceph-osd 21036 root 443828 31324 8336 S 60.1 0.8 4:07.55 ceph-osd Any idea why that is and how I can even ask an osd what it's doing? There is no corresponding hdd activity, it's only cpu and hardly any memory usage. Also the monitor on the first node is doing the same thing: PID USERVIRTRESSHR S %CPU %MEM TIME+ COMMAND 12825 root186900 23492 5540 S 141.1 0.590 9:47.64 ceph-mon I tried stopping the three osd's: that makes the monitor calm down, but after restarting the osd's, the monitor resumes its cpu usage. I also tried stopping the monitor, which makes the three osd's calm down, but once again they will start eating cpu again as soon as the monitor is back online. In the mean time, the first three osd's, the ones on the same machine as the monitor, don't behave like this at all. Currently as there is no activity, they are just idling on low cpu usage, as expected. Kind regards, Erik. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Fwd: Small fix for ceph.spec
Hi, Fedora, in this case Fedora 19, x86_64. Kind regards, Erik. On 07/30/2013 09:29 AM, Danny Al-Gaaf wrote: Hi, I think this is a bug in packaging of the leveldb package in this case since the spec-file already sets dependencies on on leveldb-devel. leveldb depends on snappy, therefore the leveldb package should set a dependency on snappy-devel for leveldb-devel (check the SUSE spec file for leveldb: https://build.opensuse.org/package/view_file/home:dalgaaf:ceph:extra/leveldb/leveldb.spec?expand=1). This way the RPM build process will pick up the correct packages needed to build ceph. Which distro do you use? Danny Am 30.07.2013 01:33, schrieb Patrick McGarry: -- Forwarded message -- From: Erik Logtenberg e...@logtenberg.eu Date: Mon, Jul 29, 2013 at 7:07 PM Subject: [ceph-users] Small fix for ceph.spec To: ceph-users@lists.ceph.com Hi, The spec file used for building rpm's misses a build time dependency on snappy-devel. Please see attached patch to fix. Kind regards, Erik. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Fwd: Small fix for ceph.spec
Hi, I will report the issue there as well. Please note that Ceph seems to support Fedora 17, even though that release is considered end-of-life by Fedora. This issue with the leveldb package cannot be fixed for Fedora 17, only for 18 and 19. So if Ceph wants to continue supporting Fedora 17, adding this workaround seems to be the only way to get this (rather minor) bug fixed. Kind regards, Erik. On 07/30/2013 09:56 AM, Danny Al-Gaaf wrote: Hi, then the Fedora package is broken. If you check the spec file of: http://dl.fedoraproject.org/pub/fedora/linux/updates/19/SRPMS/leveldb-1.12.0-3.fc19.src.rpm You can see the spec-file sets a: BuildRequires: snappy-devel But not the corresponding Requires: snappy-devel for the devel package. You should report this issue to your distribution, it needs to be fixed there instead of adding a workaround to the ceph spec. Regards, Danny Am 30.07.2013 09:42, schrieb Erik Logtenberg: Hi, Fedora, in this case Fedora 19, x86_64. Kind regards, Erik. On 07/30/2013 09:29 AM, Danny Al-Gaaf wrote: Hi, I think this is a bug in packaging of the leveldb package in this case since the spec-file already sets dependencies on on leveldb-devel. leveldb depends on snappy, therefore the leveldb package should set a dependency on snappy-devel for leveldb-devel (check the SUSE spec file for leveldb: https://build.opensuse.org/package/view_file/home:dalgaaf:ceph:extra/leveldb/leveldb.spec?expand=1). This way the RPM build process will pick up the correct packages needed to build ceph. Which distro do you use? Danny Am 30.07.2013 01:33, schrieb Patrick McGarry: -- Forwarded message -- From: Erik Logtenberg e...@logtenberg.eu Date: Mon, Jul 29, 2013 at 7:07 PM Subject: [ceph-users] Small fix for ceph.spec To: ceph-users@lists.ceph.com Hi, The spec file used for building rpm's misses a build time dependency on snappy-devel. Please see attached patch to fix. Kind regards, Erik. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v0.66 released
* osd: pg log (re)writes are not vastly more efficient (faster peering) (Sam Just) Do you really mean are not? I'd think are now would make sense (?) - Erik. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com