Re: [ceph-users] Is there a limit for object size in CephFS?
just tired 4.0 kernel, still do not encounter any problem. please run the test again, when the test hang, check /sys/kernel/debug/ceph/*/mdsc and /sys/kernel/debug/ceph/*/osdc to find which request is hung. By the way, do you have cephfs mount on host which run ceph-osd/ceph-mds? On Wed, Aug 12, 2015 at 11:12 PM, Hadi Montakhabi h...@cs.uh.edu wrote: 4.0.6-300.fc22.x86_64 On Tue, Aug 11, 2015 at 10:24 PM, Yan, Zheng uker...@gmail.com wrote: On Wed, Aug 12, 2015 at 5:33 AM, Hadi Montakhabi h...@cs.uh.edu wrote: [sequential read] readwrite=read size=2g directory=/mnt/mycephfs ioengine=libaio direct=1 blocksize=${BLOCKSIZE} numjobs=1 iodepth=1 invalidate=1 # causes the kernel buffer and page cache to be invalidated #nrfiles=1 [sequential write] readwrite=write # randread randwrite size=2g directory=/mnt/mycephfs ioengine=libaio direct=1 blocksize=${BLOCKSIZE} numjobs=1 iodepth=1 invalidate=1 [random read] readwrite=randread size=2g directory=/mnt/mycephfs ioengine=libaio direct=1 blocksize=${BLOCKSIZE} numjobs=1 iodepth=1 invalidate=1 [random write] readwrite=randwrite size=2g directory=/mnt/mycephfs ioengine=libaio direct=1 blocksize=${BLOCKSIZE} numjobs=1 iodepth=1 invalidate=1 I just tried 4.2-rc kernel, everything went well. which version of kernel were you using? On Sun, Aug 9, 2015 at 9:27 PM, Yan, Zheng uker...@gmail.com wrote: On Sun, Aug 9, 2015 at 8:57 AM, Hadi Montakhabi h...@cs.uh.edu wrote: I am using fio. I use the kernel module to Mount CephFS. please send fio job file to us On Aug 8, 2015 10:52 AM, Ketor D d.ke...@gmail.com wrote: Hi Haidi, Which bench tool do you use? And how you mount CephFS, ceph-fuse or kernel-cephfs? On Fri, Aug 7, 2015 at 11:50 PM, Hadi Montakhabi h...@cs.uh.edu wrote: Hello Cephers, I am benchmarking CephFS. In one of my experiments, I change the object size. I start from 64kb. Everytime I do different block size reads and writes. By increasing the object size to 64MB and increasing the block size to 64MB, CephFS crashes (shown in the chart below). What I mean by crash is when I do ceph -s or ceph -w it gets into constantly reporting me reads, but it never finishes the operation (even after a few days!). I have repeated this experiment for different underlying file systems (xfs and btrfs), and the same thing happens in both cases. What could be the reason for crashing CephFS? Is there a limit for object size in CephFS? Thank you, Hadi ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cache tier best practices
Thanks Nick for your suggestion. Can you also tell how i can reduce RBD block size to 512K or 1M , do i need to put something in clients ceph.conf ( what parameter i need to set ) Thanks once again - Vickey On Wed, Aug 12, 2015 at 4:49 PM, Nick Fisk n...@fisk.me.uk wrote: -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Dominik Zalewski Sent: 12 August 2015 14:40 To: ceph-us...@ceph.com Subject: [ceph-users] Cache tier best practices Hi, I would like to hear from people who use cache tier in Ceph about best practices and things I should avoid. I remember hearing that it wasn't that stable back then. Has it changed in Hammer release? It's not so much the stability, but the performance. If your working set will sit mostly in the cache tier and won't tend to change then you might be alright. Otherwise you will find that performance is very poor. Only tip I can really give is that I have found dropping the RBD block size down to 512kb-1MB helps quite a bit as it makes the cache more effective and also minimises the amount of data transferred on each promotion/flush. Any tips and tricks are much appreciated! Thanks Dominik ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] mds server(s) crashed
On Thu, Aug 13, 2015 at 5:12 AM, Bob Ababurko b...@ababurko.net wrote: I am actually looking for the most stable way to implement cephfs at this point. My cephfs cluster contains millions of small files, so many inodes if that needs to be taken into account. Perhaps I should only be using one MDS node for stability at this point? Is this the best way forward to get a handle on stability? I'm also curious if I should I set my mds cache size to a number greater than files I have in the cephfs cluster? If you can give some key points to configure cephfs to get the best stability and if possible, availability.this would be helpful to me. One active MDS is the most stable setup. Adding a few standby MDS should not hurt stability. You can't set mds cache size to a number greater than files in the fs, it requires lots of memory. I'm not sure what amount of RAM you consider to be 'lots' but I would really like to understand a bit more about this. Perhaps a rule of thumb? It there an advantage to more RAM large mds cache size? We plan on putting close to a billion small files in this pool via cephfs so what should we be considering when sizing our MDS hosts OR change to the MDS config? Basically, what should we OR should not be doing when we have a cluster with this many files? Thanks! The advantage to setting up a larger cache is: * We can allow clients to hold more in cache (anything in client cache must also be in MDS cache) * We are less likely to need to read from disk on a random metadata read * We are less likely to need to write from to disk again if a file was modified (can just journal + update in cache) None of these outcomes particularly relevant if your workload is a stream of a billion creates. The reason we're hitting the cache size limit in this case is because of the size of the directories: some operations during restart of the MDS are happening at a per-directory level of granularity. If you're running up to deploying a billion-file workload, it might be worth doing some experiments on a smaller system with the same file hierarchy structure. You could experiment with enabling inline data, tuning mds_bal_split_size (how large dirs grow before getting fragmented), mds_cache_size and see what effect these options have on the rate of file creates that we sustain. For best results, also periodically kill an MDS during a run, to check that the system recovers correctly (i.e. check for bugs like the one you've just hit). As for the most stable configuration, the CephFS for early adopters page[1] is still current. Enabling inline data and/or directory fragmentation will put you in slightly riskier territory (aka less comprehensively tested by us), but if you can check that the filesystem is working correctly for your workload in a POC then that's the most important measure of whether it's suitable for you to deploy. John 1. http://ceph.com/docs/master/cephfs/early-adopters/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] mds server(s) crashed
On Thu, Aug 13, 2015 at 3:29 AM, yangyongp...@bwstor.com.cn yangyongp...@bwstor.com.cn wrote: I also encounter a problem,standby mds can not be altered to active when active mds service stopped,which bother me for serval days.Maybe MDS cluster can solve those problem,but ceph team haven't released this feature. That sounds like an unrelated issue -- can you give us more details, like the output of ceph status? (possibly in a tracker.ceph.com ticket) John ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Geographical Replication and Disaster Recovery Support
Hi. This document applies only to RadosGW. You need to read the data document: https://wiki.ceph.com/Planning/Blueprints/Hammer/RBD%3A_Mirroring С уважением, Фасихов Ирек Нургаязович Моб.: +79229045757 2015-08-13 11:40 GMT+03:00 Özhan Rüzgar Karaman oruzgarkara...@gmail.com: Hi; I like to learn about Ceph's Geographical Replication and Disaster Recovery Options. I know that currently we do not have a built-in official Geo Replication or disaster recovery, there are some third party tools like drbd but they are not like a solution that business needs. I also read the RGW document at Ceph Wiki Site. https://wiki.ceph.com/Planning/Blueprints/Dumpling/RGW_Geo-Replication_and_Disaster_Recovery The document is from Dumpling Release nearly year 2013. Do we have any active works or efforts to achieve disaster recovery or geographical replication features to Ceph, is it on our current road map? Thanks Özhan KARAMAN ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Geographical Replication and Disaster Recovery Support
Hi; I like to learn about Ceph's Geographical Replication and Disaster Recovery Options. I know that currently we do not have a built-in official Geo Replication or disaster recovery, there are some third party tools like drbd but they are not like a solution that business needs. I also read the RGW document at Ceph Wiki Site. https://wiki.ceph.com/Planning/Blueprints/Dumpling/RGW_Geo-Replication_and_Disaster_Recovery The document is from Dumpling Release nearly year 2013. Do we have any active works or efforts to achieve disaster recovery or geographical replication features to Ceph, is it on our current road map? Thanks Özhan KARAMAN ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CEPH cache layer. Very slow
So, after testing SSD (i wipe 1 SSD, and used it for tests) root@ix-s2:~# sudo fio --filename=/dev/sda --direct=1 --sync=1 --rw=write --bs=4k --numjobs=1 --iodepth=1 --runtime=60 --time_based --gr[53/1800] ting --name=journal-test journal-test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1 fio-2.1.3 Starting 1 process Jobs: 1 (f=1): [W] [100.0% done] [0KB/1152KB/0KB /s] [0/288/0 iops] [eta 00m:00s] journal-test: (groupid=0, jobs=1): err= 0: pid=2849460: Thu Aug 13 10:46:42 2015 write: io=68972KB, bw=1149.6KB/s, iops=287, runt= 60001msec clat (msec): min=2, max=15, avg= 3.48, stdev= 1.08 lat (msec): min=2, max=15, avg= 3.48, stdev= 1.08 clat percentiles (usec): | 1.00th=[ 2704], 5.00th=[ 2800], 10.00th=[ 2864], 20.00th=[ 2928], | 30.00th=[ 3024], 40.00th=[ 3088], 50.00th=[ 3280], 60.00th=[ 3408], | 70.00th=[ 3504], 80.00th=[ 3728], 90.00th=[ 3856], 95.00th=[ 4016], | 99.00th=[ 9024], 99.50th=[ 9280], 99.90th=[ 9792], 99.95th=[10048], | 99.99th=[14912] bw (KB /s): min= 1064, max= 1213, per=100.00%, avg=1150.07, stdev=34.31 lat (msec) : 4=94.99%, 10=4.96%, 20=0.05% cpu : usr=0.13%, sys=0.57%, ctx=17248, majf=0, minf=7 IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% issued: total=r=0/w=17243/d=0, short=r=0/w=0/d=0 Run status group 0 (all jobs): WRITE: io=68972KB, aggrb=1149KB/s, minb=1149KB/s, maxb=1149KB/s, mint=60001msec, maxt=60001msec Disk stats (read/write): sda: ios=0/17224, merge=0/0, ticks=0/59584, in_queue=59576, util=99.30% So, it's pain... SSD do only 287 iops on 4K... 1,1 MB/s I try to change cache mode : echo temporary write through /sys/class/scsi_disk/2:0:0:0/cache_type echo temporary write through /sys/class/scsi_disk/3:0:0:0/cache_type no luck, still same shit results, also i found this article: https://lkml.org/lkml/2013/11/20/264 pointed to old very simple patch, which disable CMD_FLUSH https://gist.github.com/TheCodeArtist/93dddcd6a21dc81414ba Has everybody better ideas, how to improve this? (or disable CMD_FLUSH without recompile kernel, i used ubuntu and 4.0.4 for now (4.x branch because SSD 850 Pro have issue with NCQ TRIM and before 4.0.4 this exception was not included into libsata.c) 2015-08-12 19:17 GMT+03:00 Pieter Koorts pieter.koo...@me.com: Hi Igor I suspect you have very much the same problem as me. https://www.mail-archive.com/ceph-users@lists.ceph.com/msg22260.html Basically Samsung drives (like many SATA SSD's) are very much hit and miss so you will need to test them like described here to see if they are any good. http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/ To give you an idea my average performance went from 11MB/s (with Samsung SSD) to 30MB/s (without any SSD) on write performance. This is a very small cluster. Pieter On Aug 12, 2015, at 04:33 PM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Hi all, we have setup CEPH cluster with 60 OSD (2 diff types) (5 nodes, 12 disks on each, 10 HDD, 2 SSD) Also we cover this with custom crushmap with 2 root leaf ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -100 5.0 root ssd -102 1.0 host ix-s2-ssd 2 1.0 osd.2 up 1.0 1.0 9 1.0 osd.9 up 1.0 1.0 -103 1.0 host ix-s3-ssd 3 1.0 osd.3 up 1.0 1.0 7 1.0 osd.7 up 1.0 1.0 -104 1.0 host ix-s5-ssd 1 1.0 osd.1 up 1.0 1.0 6 1.0 osd.6 up 1.0 1.0 -105 1.0 host ix-s6-ssd 4 1.0 osd.4 up 1.0 1.0 8 1.0 osd.8 up 1.0 1.0 -106 1.0 host ix-s7-ssd 0 1.0 osd.0 up 1.0 1.0 5 1.0 osd.5 up 1.0 1.0 -1 5.0 root platter -2 1.0 host ix-s2-platter 13 1.0 osd.13 up 1.0 1.0 17 1.0 osd.17 up 1.0 1.0 21 1.0 osd.21 up 1.0 1.0 27 1.0 osd.27 up 1.0 1.0 32 1.0 osd.32 up 1.0 1.0 37 1.0 osd.37 up 1.0 1.0 44 1.0 osd.44 up 1.0 1.0 48 1.0 osd.48 up 1.0 1.0 55 1.0 osd.55 up 1.0 1.0
Re: [ceph-users] CEPH cache layer. Very slow
Hi, Igor. Try to roll the patch here: http://www.theirek.com/blog/2014/02/16/patch-dlia-raboty-s-enierghoniezavisimym-keshiem-ssd-diskov P.S. I am no longer tracks changes in this direction(kernel), because we use already recommended SSD С уважением, Фасихов Ирек Нургаязович Моб.: +79229045757 2015-08-13 11:56 GMT+03:00 Voloshanenko Igor igor.voloshane...@gmail.com: So, after testing SSD (i wipe 1 SSD, and used it for tests) root@ix-s2:~# sudo fio --filename=/dev/sda --direct=1 --sync=1 --rw=write --bs=4k --numjobs=1 --iodepth=1 --runtime=60 --time_based --gr[53/1800] ting --name=journal-test journal-test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1 fio-2.1.3 Starting 1 process Jobs: 1 (f=1): [W] [100.0% done] [0KB/1152KB/0KB /s] [0/288/0 iops] [eta 00m:00s] journal-test: (groupid=0, jobs=1): err= 0: pid=2849460: Thu Aug 13 10:46:42 2015 write: io=68972KB, bw=1149.6KB/s, iops=287, runt= 60001msec clat (msec): min=2, max=15, avg= 3.48, stdev= 1.08 lat (msec): min=2, max=15, avg= 3.48, stdev= 1.08 clat percentiles (usec): | 1.00th=[ 2704], 5.00th=[ 2800], 10.00th=[ 2864], 20.00th=[ 2928], | 30.00th=[ 3024], 40.00th=[ 3088], 50.00th=[ 3280], 60.00th=[ 3408], | 70.00th=[ 3504], 80.00th=[ 3728], 90.00th=[ 3856], 95.00th=[ 4016], | 99.00th=[ 9024], 99.50th=[ 9280], 99.90th=[ 9792], 99.95th=[10048], | 99.99th=[14912] bw (KB /s): min= 1064, max= 1213, per=100.00%, avg=1150.07, stdev=34.31 lat (msec) : 4=94.99%, 10=4.96%, 20=0.05% cpu : usr=0.13%, sys=0.57%, ctx=17248, majf=0, minf=7 IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% issued: total=r=0/w=17243/d=0, short=r=0/w=0/d=0 Run status group 0 (all jobs): WRITE: io=68972KB, aggrb=1149KB/s, minb=1149KB/s, maxb=1149KB/s, mint=60001msec, maxt=60001msec Disk stats (read/write): sda: ios=0/17224, merge=0/0, ticks=0/59584, in_queue=59576, util=99.30% So, it's pain... SSD do only 287 iops on 4K... 1,1 MB/s I try to change cache mode : echo temporary write through /sys/class/scsi_disk/2:0:0:0/cache_type echo temporary write through /sys/class/scsi_disk/3:0:0:0/cache_type no luck, still same shit results, also i found this article: https://lkml.org/lkml/2013/11/20/264 pointed to old very simple patch, which disable CMD_FLUSH https://gist.github.com/TheCodeArtist/93dddcd6a21dc81414ba Has everybody better ideas, how to improve this? (or disable CMD_FLUSH without recompile kernel, i used ubuntu and 4.0.4 for now (4.x branch because SSD 850 Pro have issue with NCQ TRIM and before 4.0.4 this exception was not included into libsata.c) 2015-08-12 19:17 GMT+03:00 Pieter Koorts pieter.koo...@me.com: Hi Igor I suspect you have very much the same problem as me. https://www.mail-archive.com/ceph-users@lists.ceph.com/msg22260.html Basically Samsung drives (like many SATA SSD's) are very much hit and miss so you will need to test them like described here to see if they are any good. http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/ To give you an idea my average performance went from 11MB/s (with Samsung SSD) to 30MB/s (without any SSD) on write performance. This is a very small cluster. Pieter On Aug 12, 2015, at 04:33 PM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Hi all, we have setup CEPH cluster with 60 OSD (2 diff types) (5 nodes, 12 disks on each, 10 HDD, 2 SSD) Also we cover this with custom crushmap with 2 root leaf ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -100 5.0 root ssd -102 1.0 host ix-s2-ssd 2 1.0 osd.2 up 1.0 1.0 9 1.0 osd.9 up 1.0 1.0 -103 1.0 host ix-s3-ssd 3 1.0 osd.3 up 1.0 1.0 7 1.0 osd.7 up 1.0 1.0 -104 1.0 host ix-s5-ssd 1 1.0 osd.1 up 1.0 1.0 6 1.0 osd.6 up 1.0 1.0 -105 1.0 host ix-s6-ssd 4 1.0 osd.4 up 1.0 1.0 8 1.0 osd.8 up 1.0 1.0 -106 1.0 host ix-s7-ssd 0 1.0 osd.0 up 1.0 1.0 5 1.0 osd.5 up 1.0 1.0 -1 5.0 root platter -2 1.0 host ix-s2-platter 13 1.0 osd.13 up 1.0 1.0 17 1.0 osd.17 up 1.0 1.0 21 1.0 osd.21 up 1.0 1.0
[ceph-users] Change protection/profile from a erasure coded pool
Hello everyone, Today I have a cluster with 4 hosts and I created a pool that uses a erasure code profile bellow: ## directory=/usr/lib/ceph/erasure-code k=3 m=1 plugin=jerasure ruleset-failure-domain=host technique=reed_sol_van ## This cluster is used only for RGW and I’m planning to add a new host on that cluster. So, with a new node I’d like to change this pool to use another profile where with k=3, m=2 to increase the protection of the data. Keep in mind the need of create new pool with a different profile and move all objects from the previous pool to the new, because a previous created pool cannot be modified. I’d like to know if someone already done that and if someone has some recommendation for better way to make this done? Regards. Italo Santos http://italosantos.com.br/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] rbd map failed
Hi, I have a CEPH cluster running on 4 physical servers, the cluster is up and healthy So far I was unable to connect any client to the cluster using krbd or fio rbd plugin. My clients can see and create images in rbd pool but cannot map root@r-dcs68 ~ # rbd ls fio_test foo foo1 foo_test root@r-dcs68 ~ # rbd map foo rbd: sysfs write failed rbd: map failed: (95) Operation not supported using strace I see that there is no write permissions to /sys/bus/rbd/add root@r-dcs68 ~ # echo 192.168.57.102:16789 name=admin,key=client.admin rbd foo - /sys/bus/rbd/add -bash: echo: write error: Operation not permitted Any idea why I can't map? Thanks Adir -- here's some info about my client: root@r-dcs68 ~ # uname -r 4.1.0 root@r-dcs68 ~ # cat /etc/redhat-release Red Hat Enterprise Linux Server release 7.0 (Maipo) root@r-dcs68 ~ # lsmod |grep rbd rbd62776 0 libceph 236956 1 rbd and tail of log: [ 49.303330] Key type id_resolver registered [ 49.306638] Key type id_legacy registered [ 70.471743] Key type ceph registered [ 70.474482] libceph: loaded (mon/osd proto 15/24) [ 70.479873] rbd: loaded [ 114.968597] Loading iSCSI transport class v2.0-870. [ 114.975685] iscsi: registered transport (iser) [ 146.061478] ib0: sendonly multicast join failed for ff12:401b::::::0016, status -22 [ 148.076070] ib0: sendonly multicast join failed for ff12:401b::::::0016, status -22 [ 190.207159] libceph: client4407 fsid 0169d615-9b5b-432f-9241-4aadf71be9cc [ 190.213774] libceph: mon0 192.168.57.102:16789 session established ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CEPH cache layer. Very slow
I tested and can recommend the Samsung 845 DC PRO (make sure it is DC PRO and not just PRO or DC EVO!). Those were very cheap but are out of stock at the moment (here). Faster than Intels, cheaper, and slightly different technology (3D V-NAND) which IMO makes them superior without needing many tricks to do its job. Jan On 13 Aug 2015, at 14:40, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Tnx, Irek! Will try! but another question to all, which SSD good enough for CEPH now? I'm looking into S3500 240G (I have some S3500 120G which show great results. Around 8x times better than Samsung) Possible you can give advice about other vendors/models with same or below price level as S3500 240G? 2015-08-13 12:11 GMT+03:00 Irek Fasikhov malm...@gmail.com mailto:malm...@gmail.com: Hi, Igor. Try to roll the patch here: http://www.theirek.com/blog/2014/02/16/patch-dlia-raboty-s-enierghoniezavisimym-keshiem-ssd-diskov http://www.theirek.com/blog/2014/02/16/patch-dlia-raboty-s-enierghoniezavisimym-keshiem-ssd-diskov P.S. I am no longer tracks changes in this direction(kernel), because we use already recommended SSD С уважением, Фасихов Ирек Нургаязович Моб.: +79229045757 tel:%2B79229045757 2015-08-13 11:56 GMT+03:00 Voloshanenko Igor igor.voloshane...@gmail.com mailto:igor.voloshane...@gmail.com: So, after testing SSD (i wipe 1 SSD, and used it for tests) root@ix-s2:~# sudo fio --filename=/dev/sda --direct=1 --sync=1 --rw=write --bs=4k --numjobs=1 --iodepth=1 --runtime=60 --time_based --gr[53/1800] ting --name=journal-test journal-test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1 fio-2.1.3 Starting 1 process Jobs: 1 (f=1): [W] [100.0% done] [0KB/1152KB/0KB /s] [0/288/0 iops] [eta 00m:00s] journal-test: (groupid=0, jobs=1): err= 0: pid=2849460: Thu Aug 13 10:46:42 2015 write: io=68972KB, bw=1149.6KB/s, iops=287, runt= 60001msec clat (msec): min=2, max=15, avg= 3.48, stdev= 1.08 lat (msec): min=2, max=15, avg= 3.48, stdev= 1.08 clat percentiles (usec): | 1.00th=[ 2704], 5.00th=[ 2800], 10.00th=[ 2864], 20.00th=[ 2928], | 30.00th=[ 3024], 40.00th=[ 3088], 50.00th=[ 3280], 60.00th=[ 3408], | 70.00th=[ 3504], 80.00th=[ 3728], 90.00th=[ 3856], 95.00th=[ 4016], | 99.00th=[ 9024], 99.50th=[ 9280], 99.90th=[ 9792], 99.95th=[10048], | 99.99th=[14912] bw (KB /s): min= 1064, max= 1213, per=100.00%, avg=1150.07, stdev=34.31 lat (msec) : 4=94.99%, 10=4.96%, 20=0.05% cpu : usr=0.13%, sys=0.57%, ctx=17248, majf=0, minf=7 IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% issued: total=r=0/w=17243/d=0, short=r=0/w=0/d=0 Run status group 0 (all jobs): WRITE: io=68972KB, aggrb=1149KB/s, minb=1149KB/s, maxb=1149KB/s, mint=60001msec, maxt=60001msec Disk stats (read/write): sda: ios=0/17224, merge=0/0, ticks=0/59584, in_queue=59576, util=99.30% So, it's pain... SSD do only 287 iops on 4K... 1,1 MB/s I try to change cache mode : echo temporary write through /sys/class/scsi_disk/2:0:0:0/cache_type echo temporary write through /sys/class/scsi_disk/3:0:0:0/cache_type no luck, still same shit results, also i found this article: https://lkml.org/lkml/2013/11/20/264 https://lkml.org/lkml/2013/11/20/264 pointed to old very simple patch, which disable CMD_FLUSH https://gist.github.com/TheCodeArtist/93dddcd6a21dc81414ba https://gist.github.com/TheCodeArtist/93dddcd6a21dc81414ba Has everybody better ideas, how to improve this? (or disable CMD_FLUSH without recompile kernel, i used ubuntu and 4.0.4 for now (4.x branch because SSD 850 Pro have issue with NCQ TRIM and before 4.0.4 this exception was not included into libsata.c) 2015-08-12 19:17 GMT+03:00 Pieter Koorts pieter.koo...@me.com mailto:pieter.koo...@me.com: Hi Igor I suspect you have very much the same problem as me. https://www.mail-archive.com/ceph-users@lists.ceph.com/msg22260.html https://www.mail-archive.com/ceph-users@lists.ceph.com/msg22260.html Basically Samsung drives (like many SATA SSD's) are very much hit and miss so you will need to test them like described here to see if they are any good. http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/ http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/ To give you an idea my average performance went from 11MB/s (with Samsung SSD) to 30MB/s (without any SSD) on write performance. This is a very small cluster. Pieter On Aug 12, 2015, at 04:33 PM, Voloshanenko Igor igor.voloshane...@gmail.com mailto:igor.voloshane...@gmail.com wrote: Hi all, we have setup CEPH cluster with 60 OSD (2 diff types)
Re: [ceph-users] CEPH cache layer. Very slow
So, good, but price for 845 DC PRO 400 GB higher in about 2x times than intel S3500 240G ((( Any other models? ((( 2015-08-13 15:45 GMT+03:00 Jan Schermer j...@schermer.cz: I tested and can recommend the Samsung 845 DC PRO (make sure it is DC PRO and not just PRO or DC EVO!). Those were very cheap but are out of stock at the moment (here). Faster than Intels, cheaper, and slightly different technology (3D V-NAND) which IMO makes them superior without needing many tricks to do its job. Jan On 13 Aug 2015, at 14:40, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Tnx, Irek! Will try! but another question to all, which SSD good enough for CEPH now? I'm looking into S3500 240G (I have some S3500 120G which show great results. Around 8x times better than Samsung) Possible you can give advice about other vendors/models with same or below price level as S3500 240G? 2015-08-13 12:11 GMT+03:00 Irek Fasikhov malm...@gmail.com: Hi, Igor. Try to roll the patch here: http://www.theirek.com/blog/2014/02/16/patch-dlia-raboty-s-enierghoniezavisimym-keshiem-ssd-diskov P.S. I am no longer tracks changes in this direction(kernel), because we use already recommended SSD С уважением, Фасихов Ирек Нургаязович Моб.: +79229045757 2015-08-13 11:56 GMT+03:00 Voloshanenko Igor igor.voloshane...@gmail.com : So, after testing SSD (i wipe 1 SSD, and used it for tests) root@ix-s2:~# sudo fio --filename=/dev/sda --direct=1 --sync=1 --rw=write --bs=4k --numjobs=1 --iodepth=1 --runtime=60 --time_based --gr[53/1800] ting --name=journal-test journal-test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1 fio-2.1.3 Starting 1 process Jobs: 1 (f=1): [W] [100.0% done] [0KB/1152KB/0KB /s] [0/288/0 iops] [eta 00m:00s] journal-test: (groupid=0, jobs=1): err= 0: pid=2849460: Thu Aug 13 10:46:42 2015 write: io=68972KB, bw=1149.6KB/s, iops=287, runt= 60001msec clat (msec): min=2, max=15, avg= 3.48, stdev= 1.08 lat (msec): min=2, max=15, avg= 3.48, stdev= 1.08 clat percentiles (usec): | 1.00th=[ 2704], 5.00th=[ 2800], 10.00th=[ 2864], 20.00th=[ 2928], | 30.00th=[ 3024], 40.00th=[ 3088], 50.00th=[ 3280], 60.00th=[ 3408], | 70.00th=[ 3504], 80.00th=[ 3728], 90.00th=[ 3856], 95.00th=[ 4016], | 99.00th=[ 9024], 99.50th=[ 9280], 99.90th=[ 9792], 99.95th=[10048], | 99.99th=[14912] bw (KB /s): min= 1064, max= 1213, per=100.00%, avg=1150.07, stdev=34.31 lat (msec) : 4=94.99%, 10=4.96%, 20=0.05% cpu : usr=0.13%, sys=0.57%, ctx=17248, majf=0, minf=7 IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0% issued: total=r=0/w=17243/d=0, short=r=0/w=0/d=0 Run status group 0 (all jobs): WRITE: io=68972KB, aggrb=1149KB/s, minb=1149KB/s, maxb=1149KB/s, mint=60001msec, maxt=60001msec Disk stats (read/write): sda: ios=0/17224, merge=0/0, ticks=0/59584, in_queue=59576, util=99.30% So, it's pain... SSD do only 287 iops on 4K... 1,1 MB/s I try to change cache mode : echo temporary write through /sys/class/scsi_disk/2:0:0:0/cache_type echo temporary write through /sys/class/scsi_disk/3:0:0:0/cache_type no luck, still same shit results, also i found this article: https://lkml.org/lkml/2013/11/20/264 pointed to old very simple patch, which disable CMD_FLUSH https://gist.github.com/TheCodeArtist/93dddcd6a21dc81414ba Has everybody better ideas, how to improve this? (or disable CMD_FLUSH without recompile kernel, i used ubuntu and 4.0.4 for now (4.x branch because SSD 850 Pro have issue with NCQ TRIM and before 4.0.4 this exception was not included into libsata.c) 2015-08-12 19:17 GMT+03:00 Pieter Koorts pieter.koo...@me.com: Hi Igor I suspect you have very much the same problem as me. https://www.mail-archive.com/ceph-users@lists.ceph.com/msg22260.html Basically Samsung drives (like many SATA SSD's) are very much hit and miss so you will need to test them like described here to see if they are any good. http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/ To give you an idea my average performance went from 11MB/s (with Samsung SSD) to 30MB/s (without any SSD) on write performance. This is a very small cluster. Pieter On Aug 12, 2015, at 04:33 PM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Hi all, we have setup CEPH cluster with 60 OSD (2 diff types) (5 nodes, 12 disks on each, 10 HDD, 2 SSD) Also we cover this with custom crushmap with 2 root leaf ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -100 5.0 root ssd -102 1.0 host ix-s2-ssd 2 1.0 osd.2 up 1.0 1.0 9 1.0 osd.9 up 1.0 1.0
[ceph-users] OSD space imbalance
Hello, I'm having an issue where disk usages between OSDs aren't well balanced thus causing disk space to be wasted. Ceph is latest 0.94.2, used exclusively through cephfs. Re-weighting helps, but just slightly, and it has to be done on a daily basis causing constant refills. In the end I get OSD with 65% usage with some other going over 90%. I also set the ceph osd crush tunables optimal, but I didn't notice any changes when it comes to disk usage. Is there anything I can do to get them within 10% range at least? health HEALTH_OK mdsmap e2577: 1/1/1 up, 2 up:standby osdmap e25239: 48 osds: 48 up, 48 in pgmap v3188836: 5184 pgs, 3 pools, 18028 GB data, 6385 kobjects 36156 GB used, 9472 GB / 45629 GB avail 5184 active+clean ID WEIGHT REWEIGHT SIZE USEAVAIL %USE VAR 37 0.92999 1.0 950G 625G324G 65.85 0.83 21 0.92999 1.0 950G 649G300G 68.35 0.86 32 0.92999 1.0 950G 670G279G 70.58 0.89 7 0.92999 1.0 950G 676G274G 71.11 0.90 17 0.92999 1.0 950G 681G268G 71.73 0.91 40 0.92999 1.0 950G 689G260G 72.55 0.92 20 0.92999 1.0 950G 690G260G 72.62 0.92 25 0.92999 1.0 950G 691G258G 72.76 0.92 2 0.92999 1.0 950G 694G256G 73.03 0.92 39 0.92999 1.0 950G 697G253G 73.35 0.93 18 0.92999 1.0 950G 703G247G 74.00 0.93 47 0.92999 1.0 950G 703G246G 74.05 0.93 23 0.92999 0.86693 950G 704G245G 74.14 0.94 6 0.92999 1.0 950G 726G224G 76.39 0.96 8 0.92999 1.0 950G 727G223G 76.54 0.97 5 0.92999 1.0 950G 728G222G 76.62 0.97 35 0.92999 1.0 950G 728G221G 76.66 0.97 11 0.92999 1.0 950G 730G220G 76.82 0.97 43 0.92999 1.0 950G 730G219G 76.87 0.97 33 0.92999 1.0 950G 734G215G 77.31 0.98 38 0.92999 1.0 950G 736G214G 77.49 0.98 12 0.92999 1.0 950G 737G212G 77.61 0.98 31 0.92999 0.85184 950G 742G208G 78.09 0.99 28 0.92999 1.0 950G 745G205G 78.41 0.99 27 0.92999 1.0 950G 751G199G 79.04 1.00 10 0.92999 1.0 950G 754G195G 79.40 1.00 13 0.92999 1.0 950G 762G188G 80.21 1.01 9 0.92999 1.0 950G 763G187G 80.29 1.01 16 0.92999 1.0 950G 764G186G 80.37 1.01 0 0.92999 1.0 950G 778G171G 81.94 1.03 3 0.92999 1.0 950G 780G170G 82.11 1.04 41 0.92999 1.0 950G 780G169G 82.13 1.04 34 0.92999 0.87303 950G 783G167G 82.43 1.04 14 0.92999 1.0 950G 784G165G 82.56 1.04 42 0.92999 1.0 950G 786G164G 82.70 1.04 46 0.92999 1.0 950G 788G162G 82.93 1.05 30 0.92999 1.0 950G 790G160G 83.12 1.05 45 0.92999 1.0 950G 804G146G 84.59 1.07 44 0.92999 1.0 950G 807G143G 84.92 1.07 1 0.92999 1.0 950G 817G132G 86.05 1.09 22 0.92999 1.0 950G 825G125G 86.81 1.10 15 0.92999 1.0 950G 826G123G 86.97 1.10 19 0.92999 1.0 950G 829G120G 87.30 1.10 36 0.92999 1.0 950G 831G119G 87.48 1.10 24 0.92999 1.0 950G 831G118G 87.50 1.10 26 0.92999 1.0 950G 851G 101692M 89.55 1.13 29 0.92999 1.0 950G 851G 101341M 89.59 1.13 4 0.92999 1.0 950G 860G 92164M 90.53 1.14 MIN/MAX VAR: 0.83/1.14 STDDEV: 5.94 TOTAL 45629G 36156G 9473G 79.24 Thanks, Vedran ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Can not active osds (old/different cluster instance?)
Dear all, I try to create osds and get an error message (old/different cluster instance?). And osd can create but not active. This server ever build osds before. Pls give me some advises. OS:rhel7 ceph:0.80 firefly Best wishes, Mika ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph distributed osd
Dear Team, We are using two ceph OSD with replica 2 and it is working properly. Here my doubt is (Pool A -image size will be 10GB) and its replicated with two OSD, what will happen suppose if the size reached the limit, Is there any chance to make the data to continue writing in another two OSD's. Regards Prabu ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cluster health_warn 1 active+undersized+degraded/1 active+remapped
I decided to set OSD 76 out and let the cluster shuffle the data off that disk and then brought the OSD back in. For the most part this seemed to be working, but then I had 1 object degraded and 88xxx objects misplaced: # ceph health detail HEALTH_WARN 11 pgs stuck unclean; recovery 1/66089446 objects degraded (0.000%); recovery 88844/66089446 objects misplaced (0.134%) pg 2.e7f is stuck unclean for 88398.251351, current state active+remapped, last acting [58,5] pg 2.143 is stuck unclean for 13892.364101, current state active+remapped, last acting [16,76] pg 2.968 is stuck unclean for 13892.363521, current state active+remapped, last acting [44,76] pg 2.5f8 is stuck unclean for 13892.377245, current state active+remapped, last acting [17,76] pg 2.81c is stuck unclean for 13892.363443, current state active+remapped, last acting [25,76] pg 2.1a3 is stuck unclean for 13892.364400, current state active+remapped, last acting [16,76] pg 2.2cb is stuck unclean for 13892.374390, current state active+remapped, last acting [14,76] pg 2.d41 is stuck unclean for 13892.373636, current state active+remapped, last acting [27,76] pg 2.3f9 is stuck unclean for 13892.373147, current state active+remapped, last acting [35,76] pg 2.a62 is stuck unclean for 86283.741920, current state active+remapped, last acting [2,38] pg 2.1b0 is stuck unclean for 13892.363268, current state active+remapped, last acting [3,76] recovery 1/66089446 objects degraded (0.000%) recovery 88844/66089446 objects misplaced (0.134%) I say apparently because with one object degraded, none of the pg's are showing degraded: # ceph pg dump_stuck degraded ok # ceph pg dump_stuck unclean ok pg_stat state up up_primary acting acting_primary 2.e7f active+remapped [58] 58 [58,5] 58 2.143 active+remapped [16] 16 [16,76] 16 2.968 active+remapped [44] 44 [44,76] 44 2.5f8 active+remapped [17] 17 [17,76] 17 2.81c active+remapped [25] 25 [25,76] 25 2.1a3 active+remapped [16] 16 [16,76] 16 2.2cb active+remapped [14] 14 [14,76] 14 2.d41 active+remapped [27] 27 [27,76] 27 2.3f9 active+remapped [35] 35 [35,76] 35 2.a62 active+remapped [2] 2 [2,38] 2 2.1b0 active+remapped [3] 3 [3,76] 3 All of the OSD filesystems are below 85% full. I then compared a 0.94.2 cluster that was new and had not been updated (current cluster is 0.94.2 which had been updated a couple times) and noticed the crush map had 'tunable straw_calc_version 1' so I added it to the current cluster. After the data moved around for about 8 hours or so I'm left with this state: # ceph health detail HEALTH_WARN 2 pgs stuck unclean; recovery 16357/66089446 objects misplaced (0.025%) pg 2.e7f is stuck unclean for 149422.331848, current state active+remapped, last acting [58,5] pg 2.782 is stuck unclean for 64878.002464, current state active+remapped, last acting [76,31] recovery 16357/66089446 objects misplaced (0.025%) I attempted a pg repair on both of the pg's listed above, but it doesn't look like anything is happening. The doc's reference an inconsistent state as a use case for the repair command so that's likely why. These 2 pg's have been the issue throughout this process so how can I dig deeper to figure out what the problem is? # ceph pg 2.e7f query: http://pastebin.com/jMMsbsjS # ceph pg 2.e7f query: http://pastebin.com/0ntBfFK5 On Wed, Aug 12, 2015 at 6:52 PM, yangyongp...@bwstor.com.cn yangyongp...@bwstor.com.cn wrote: You can try ceph pg repair pg_idto repair the unhealth pg.ceph health detail command is very useful to detect unhealth pgs. yangyongp...@bwstor.com.cn From: Steve Dainard Date: 2015-08-12 23:48 To: ceph-users Subject: [ceph-users] Cluster health_warn 1 active+undersized+degraded/1 active+remapped I ran a ceph osd reweight-by-utilization yesterday and partway through had a network interruption. After the network was restored the cluster continued to rebalance but this morning the cluster has stopped rebalance and status will not change from: # ceph status cluster af859ff1-c394-4c9a-95e2-0e0e4c87445c health HEALTH_WARN 1 pgs degraded 1 pgs stuck degraded 2 pgs stuck unclean 1 pgs stuck undersized 1 pgs undersized recovery 8163/66089054 objects degraded (0.012%) recovery 8194/66089054 objects misplaced (0.012%) monmap e24: 3 mons at {mon1=10.0.231.53:6789/0,mon2=10.0.231.54:6789/0,mon3=10.0.231.55:6789/0} election epoch 250, quorum 0,1,2 mon1,mon2,mon3 osdmap e184486: 100 osds: 100 up, 100 in; 1 remapped pgs pgmap v3010985: 4144 pgs, 7 pools, 125 TB data, 32270 kobjects 251 TB used, 111 TB / 363 TB avail 8163/66089054 objects degraded (0.012%) 8194/66089054 objects misplaced (0.012%) 4142 active+clean 1 active+undersized+degraded 1 active+remapped # ceph health detail HEALTH_WARN 1 pgs degraded;
Re: [ceph-users] Cache tier best practices
Use the order parameter when creating an RBD 22=4MB, 20=1MB From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Vickey Singh Sent: 13 August 2015 09:31 To: Nick Fisk n...@fisk.me.uk Cc: ceph-us...@ceph.com Subject: Re: [ceph-users] Cache tier best practices Thanks Nick for your suggestion. Can you also tell how i can reduce RBD block size to 512K or 1M , do i need to put something in clients ceph.conf ( what parameter i need to set ) Thanks once again - Vickey On Wed, Aug 12, 2015 at 4:49 PM, Nick Fisk n...@fisk.me.uk mailto:n...@fisk.me.uk wrote: -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com mailto:ceph-users-boun...@lists.ceph.com ] On Behalf Of Dominik Zalewski Sent: 12 August 2015 14:40 To: ceph-us...@ceph.com mailto:ceph-us...@ceph.com Subject: [ceph-users] Cache tier best practices Hi, I would like to hear from people who use cache tier in Ceph about best practices and things I should avoid. I remember hearing that it wasn't that stable back then. Has it changed in Hammer release? It's not so much the stability, but the performance. If your working set will sit mostly in the cache tier and won't tend to change then you might be alright. Otherwise you will find that performance is very poor. Only tip I can really give is that I have found dropping the RBD block size down to 512kb-1MB helps quite a bit as it makes the cache more effective and also minimises the amount of data transferred on each promotion/flush. Any tips and tricks are much appreciated! Thanks Dominik ___ ceph-users mailing list ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cache tier best practices
I think you're looking for this. http://ceph.com/docs/master/man/8/rbd/#cmdoption-rbd--order It's used when you create the RBD images. 1MB is order=20, 512 is order=19. Thanks, Bill Sanders On Thu, Aug 13, 2015 at 1:31 AM, Vickey Singh vickey.singh22...@gmail.com wrote: Thanks Nick for your suggestion. Can you also tell how i can reduce RBD block size to 512K or 1M , do i need to put something in clients ceph.conf ( what parameter i need to set ) Thanks once again - Vickey On Wed, Aug 12, 2015 at 4:49 PM, Nick Fisk n...@fisk.me.uk wrote: -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Dominik Zalewski Sent: 12 August 2015 14:40 To: ceph-us...@ceph.com Subject: [ceph-users] Cache tier best practices Hi, I would like to hear from people who use cache tier in Ceph about best practices and things I should avoid. I remember hearing that it wasn't that stable back then. Has it changed in Hammer release? It's not so much the stability, but the performance. If your working set will sit mostly in the cache tier and won't tend to change then you might be alright. Otherwise you will find that performance is very poor. Only tip I can really give is that I have found dropping the RBD block size down to 512kb-1MB helps quite a bit as it makes the cache more effective and also minimises the amount of data transferred on each promotion/flush. Any tips and tricks are much appreciated! Thanks Dominik ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] How to improve single thread sequential reads?
Hi, I'm trying to use a RBD to act as a staging area for some data before pushing it down to some LTO6 tapes. As I cannot use striping with the kernel client I tend to be maxing out at around 80MB/s reads testing with DD. Has anyone got any clever suggestions of giving this a bit of a boost, I think I need to get it up to around 200MB/s to make sure there is always a steady flow of data to the tape drive. Rbd-fuse seems to top out at 12MB/s, so there goes that option. I'm thinking mapping multiple RBD's and then combining them into a mdadm RAID0 stripe might work, but seems a bit messy. Any suggestions? Thanks, Nick ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD space imbalance
On 13.08.2015 18:01, GuangYang wrote: Try 'ceph osd reweight-by-pg int' right after creating the pools? Would it do any good now when pool is in use and nearly full as I can't re-create it now. Also, what's the integer argument in the command above? I failed to find proper explanation in the docs. What is the typical object size in the cluster? Around 50 MB. Thanks, Vedran ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cluster health_warn 1 active+undersized+degraded/1 active+remapped
I don't see anything obvious, sorry.. Looks like something with osd.{5, 76, 38}, which are absent from the *up* set though they are up. How about increasing log level 'debug_osd = 20' on osd.76 and restart the OSD? Thanks, Guang Date: Thu, 13 Aug 2015 09:10:31 -0700 Subject: Re: [ceph-users] Cluster health_warn 1 active+undersized+degraded/1 active+remapped From: sdain...@spd1.com To: yguan...@outlook.com CC: yangyongp...@bwstor.com.cn; ceph-users@lists.ceph.com OSD tree: http://pastebin.com/3z333DP4 Crushmap: http://pastebin.com/DBd9k56m I realize these nodes are quite large, I have plans to break them out into 12 OSD's/node. On Thu, Aug 13, 2015 at 9:02 AM, GuangYang yguan...@outlook.com wrote: Could you share the 'ceph osd tree dump' and CRUSH map dump ? Thanks, Guang Date: Thu, 13 Aug 2015 08:16:09 -0700 From: sdain...@spd1.com To: yangyongp...@bwstor.com.cn; ceph-users@lists.ceph.com Subject: Re: [ceph-users] Cluster health_warn 1 active+undersized+degraded/1 active+remapped I decided to set OSD 76 out and let the cluster shuffle the data off that disk and then brought the OSD back in. For the most part this seemed to be working, but then I had 1 object degraded and 88xxx objects misplaced: # ceph health detail HEALTH_WARN 11 pgs stuck unclean; recovery 1/66089446 objects degraded (0.000%); recovery 88844/66089446 objects misplaced (0.134%) pg 2.e7f is stuck unclean for 88398.251351, current state active+remapped, last acting [58,5] pg 2.143 is stuck unclean for 13892.364101, current state active+remapped, last acting [16,76] pg 2.968 is stuck unclean for 13892.363521, current state active+remapped, last acting [44,76] pg 2.5f8 is stuck unclean for 13892.377245, current state active+remapped, last acting [17,76] pg 2.81c is stuck unclean for 13892.363443, current state active+remapped, last acting [25,76] pg 2.1a3 is stuck unclean for 13892.364400, current state active+remapped, last acting [16,76] pg 2.2cb is stuck unclean for 13892.374390, current state active+remapped, last acting [14,76] pg 2.d41 is stuck unclean for 13892.373636, current state active+remapped, last acting [27,76] pg 2.3f9 is stuck unclean for 13892.373147, current state active+remapped, last acting [35,76] pg 2.a62 is stuck unclean for 86283.741920, current state active+remapped, last acting [2,38] pg 2.1b0 is stuck unclean for 13892.363268, current state active+remapped, last acting [3,76] recovery 1/66089446 objects degraded (0.000%) recovery 88844/66089446 objects misplaced (0.134%) I say apparently because with one object degraded, none of the pg's are showing degraded: # ceph pg dump_stuck degraded ok # ceph pg dump_stuck unclean ok pg_stat state up up_primary acting acting_primary 2.e7f active+remapped [58] 58 [58,5] 58 2.143 active+remapped [16] 16 [16,76] 16 2.968 active+remapped [44] 44 [44,76] 44 2.5f8 active+remapped [17] 17 [17,76] 17 2.81c active+remapped [25] 25 [25,76] 25 2.1a3 active+remapped [16] 16 [16,76] 16 2.2cb active+remapped [14] 14 [14,76] 14 2.d41 active+remapped [27] 27 [27,76] 27 2.3f9 active+remapped [35] 35 [35,76] 35 2.a62 active+remapped [2] 2 [2,38] 2 2.1b0 active+remapped [3] 3 [3,76] 3 All of the OSD filesystems are below 85% full. I then compared a 0.94.2 cluster that was new and had not been updated (current cluster is 0.94.2 which had been updated a couple times) and noticed the crush map had 'tunable straw_calc_version 1' so I added it to the current cluster. After the data moved around for about 8 hours or so I'm left with this state: # ceph health detail HEALTH_WARN 2 pgs stuck unclean; recovery 16357/66089446 objects misplaced (0.025%) pg 2.e7f is stuck unclean for 149422.331848, current state active+remapped, last acting [58,5] pg 2.782 is stuck unclean for 64878.002464, current state active+remapped, last acting [76,31] recovery 16357/66089446 objects misplaced (0.025%) I attempted a pg repair on both of the pg's listed above, but it doesn't look like anything is happening. The doc's reference an inconsistent state as a use case for the repair command so that's likely why. These 2 pg's have been the issue throughout this process so how can I dig deeper to figure out what the problem is? # ceph pg 2.e7f query: http://pastebin.com/jMMsbsjS # ceph pg 2.e7f query: http://pastebin.com/0ntBfFK5 On Wed, Aug 12, 2015 at 6:52 PM, yangyongp...@bwstor.com.cn yangyongp...@bwstor.com.cn wrote: You can try ceph pg repair pg_idto repair the unhealth pg.ceph health detail command is very useful to detect unhealth pgs. yangyongp...@bwstor.com.cn From: Steve Dainard Date: 2015-08-12 23:48 To: ceph-users Subject: [ceph-users] Cluster health_warn 1 active+undersized+degraded/1 active+remapped I ran a ceph osd reweight-by-utilization yesterday and partway
Re: [ceph-users] OSD space imbalance
There are three factors that impact disk utilization of an OSD: 1. number of PGs on the OSD (determined by CRUSH) 2. number of objects with each PG (better to pick a 2 power PG number to make this one more even) 3. object size deviation with 'ceph osd reweight-by-pg', you can tune (1). And if you would like to get a better understanding of what is the root cause in your cluster, you can find more information from 'pg dump', from where you can get the raw data for 1 and 2. Once the cluster is filled, you properly go with 'ceph osd reweight-by-utilization', be careful of that since it could incur lots of data movement... To: ceph-users@lists.ceph.com From: vedran.fu...@gmail.com Date: Fri, 14 Aug 2015 00:15:17 +0200 Subject: Re: [ceph-users] OSD space imbalance On 13.08.2015 18:01, GuangYang wrote: Try 'ceph osd int' right after creating the pools? Would it do any good now when pool is in use and nearly full as I can't re-create it now. Also, what's the integer argument in the command above? I failed to find proper explanation in the docs. Please check it out here - https://github.com/ceph/ceph/blob/master/src/mon/OSDMonitor.cc#L469 What is the typical object size in the cluster? Around 50 MB. Thanks, Vedran ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph distributed osd
You can add one or more osd,and ceph will balance the distribution of pgs.Data will not out of rage as if you have big enough space. yangyongp...@bwstor.com.cn From: gjprabu Date: 2015-08-13 22:42 To: ceph-users CC: Kamala Subramani; Siva Sokkumuthu Subject: [ceph-users] ceph distributed osd Dear Team, We are using two ceph OSD with replica 2 and it is working properly. Here my doubt is (Pool A -image size will be 10GB) and its replicated with two OSD, what will happen suppose if the size reached the limit, Is there any chance to make the data to continue writing in another two OSD's. Regards Prabu ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph osd map pool object question / bug?
Hi Goncalo, On Fri, 14 Aug 2015 13:30:35 +1000 Goncalo Borges gonc...@physics.usyd.edu.au wrote: Is this expected? Are those PGs actually assigned to something it does not exists? Objects are mapped to PGs algorithmically, based on their names. You can think of that result as telling you where the object *would* be placed if it were created. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph osd map pool object question / bug?
Hi Cehp gurus... I using 0.94.2 in all my Ceph / CephFS installation. While trying to understand how files are translated into object, it seems that 'ceph osd map' returns a valid answer even for objects that do not exist. # ceph osd map cephfs_dt thisobjectdoesnotexist osdmap e341 pool 'cephfs_dt' (5) object 'thisobjectdoesnotexist' - pg 5.28aa7f5a (5.35a) - up ([24,21,15], p24) acting ([24,21,15], p24) Is this expected? Are those PGs actually assigned to something it does not exists? Cheers Goncalo -- Goncalo Borges Research Computing ARC Centre of Excellence for Particle Physics at the Terascale School of Physics A28 | University of Sydney, NSW 2006 T: +61 2 93511937 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] teuthology: running create_nodes.py will be hanged
Hi, When setting up teuthology in my own environment , I found a problem as follows: In the file teuthology/__init__.py, when importing class gevent.monkey, It will conflict with paramiko. and if create_nodes.py is used to connect to paddles/pulpito node, it will be hanged. root@ubunut4:~/src/teuthology_master# git diff teuthology/__init__.py diff --git a/teuthology/__init__.py b/teuthology/__init__.py index d0bcfc0..b34cf4e 100644 --- a/teuthology/__init__.py +++ b/teuthology/__init__.py @@ -1,5 +1,5 @@ -from gevent import monkey -monkey.patch_all(dns=False) +#from gevent import monkey +#monkey.patch_all(dns=False) from .orchestra import monkey monkey.patch_all() After modification, everything looks fine. So l am wondering if this is a bug? Any reply will be highly appreciated. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cluster health_warn 1 active+undersized+degraded/1 active+remapped
OSD tree: http://pastebin.com/3z333DP4 Crushmap: http://pastebin.com/DBd9k56m I realize these nodes are quite large, I have plans to break them out into 12 OSD's/node. On Thu, Aug 13, 2015 at 9:02 AM, GuangYang yguan...@outlook.com wrote: Could you share the 'ceph osd tree dump' and CRUSH map dump ? Thanks, Guang Date: Thu, 13 Aug 2015 08:16:09 -0700 From: sdain...@spd1.com To: yangyongp...@bwstor.com.cn; ceph-users@lists.ceph.com Subject: Re: [ceph-users] Cluster health_warn 1 active+undersized+degraded/1 active+remapped I decided to set OSD 76 out and let the cluster shuffle the data off that disk and then brought the OSD back in. For the most part this seemed to be working, but then I had 1 object degraded and 88xxx objects misplaced: # ceph health detail HEALTH_WARN 11 pgs stuck unclean; recovery 1/66089446 objects degraded (0.000%); recovery 88844/66089446 objects misplaced (0.134%) pg 2.e7f is stuck unclean for 88398.251351, current state active+remapped, last acting [58,5] pg 2.143 is stuck unclean for 13892.364101, current state active+remapped, last acting [16,76] pg 2.968 is stuck unclean for 13892.363521, current state active+remapped, last acting [44,76] pg 2.5f8 is stuck unclean for 13892.377245, current state active+remapped, last acting [17,76] pg 2.81c is stuck unclean for 13892.363443, current state active+remapped, last acting [25,76] pg 2.1a3 is stuck unclean for 13892.364400, current state active+remapped, last acting [16,76] pg 2.2cb is stuck unclean for 13892.374390, current state active+remapped, last acting [14,76] pg 2.d41 is stuck unclean for 13892.373636, current state active+remapped, last acting [27,76] pg 2.3f9 is stuck unclean for 13892.373147, current state active+remapped, last acting [35,76] pg 2.a62 is stuck unclean for 86283.741920, current state active+remapped, last acting [2,38] pg 2.1b0 is stuck unclean for 13892.363268, current state active+remapped, last acting [3,76] recovery 1/66089446 objects degraded (0.000%) recovery 88844/66089446 objects misplaced (0.134%) I say apparently because with one object degraded, none of the pg's are showing degraded: # ceph pg dump_stuck degraded ok # ceph pg dump_stuck unclean ok pg_stat state up up_primary acting acting_primary 2.e7f active+remapped [58] 58 [58,5] 58 2.143 active+remapped [16] 16 [16,76] 16 2.968 active+remapped [44] 44 [44,76] 44 2.5f8 active+remapped [17] 17 [17,76] 17 2.81c active+remapped [25] 25 [25,76] 25 2.1a3 active+remapped [16] 16 [16,76] 16 2.2cb active+remapped [14] 14 [14,76] 14 2.d41 active+remapped [27] 27 [27,76] 27 2.3f9 active+remapped [35] 35 [35,76] 35 2.a62 active+remapped [2] 2 [2,38] 2 2.1b0 active+remapped [3] 3 [3,76] 3 All of the OSD filesystems are below 85% full. I then compared a 0.94.2 cluster that was new and had not been updated (current cluster is 0.94.2 which had been updated a couple times) and noticed the crush map had 'tunable straw_calc_version 1' so I added it to the current cluster. After the data moved around for about 8 hours or so I'm left with this state: # ceph health detail HEALTH_WARN 2 pgs stuck unclean; recovery 16357/66089446 objects misplaced (0.025%) pg 2.e7f is stuck unclean for 149422.331848, current state active+remapped, last acting [58,5] pg 2.782 is stuck unclean for 64878.002464, current state active+remapped, last acting [76,31] recovery 16357/66089446 objects misplaced (0.025%) I attempted a pg repair on both of the pg's listed above, but it doesn't look like anything is happening. The doc's reference an inconsistent state as a use case for the repair command so that's likely why. These 2 pg's have been the issue throughout this process so how can I dig deeper to figure out what the problem is? # ceph pg 2.e7f query: http://pastebin.com/jMMsbsjS # ceph pg 2.e7f query: http://pastebin.com/0ntBfFK5 On Wed, Aug 12, 2015 at 6:52 PM, yangyongp...@bwstor.com.cn yangyongp...@bwstor.com.cn wrote: You can try ceph pg repair pg_idto repair the unhealth pg.ceph health detail command is very useful to detect unhealth pgs. yangyongp...@bwstor.com.cn From: Steve Dainard Date: 2015-08-12 23:48 To: ceph-users Subject: [ceph-users] Cluster health_warn 1 active+undersized+degraded/1 active+remapped I ran a ceph osd reweight-by-utilization yesterday and partway through had a network interruption. After the network was restored the cluster continued to rebalance but this morning the cluster has stopped rebalance and status will not change from: # ceph status cluster af859ff1-c394-4c9a-95e2-0e0e4c87445c health HEALTH_WARN 1 pgs degraded 1 pgs stuck degraded 2 pgs stuck unclean 1 pgs stuck undersized 1 pgs undersized recovery 8163/66089054 objects degraded (0.012%) recovery 8194/66089054 objects misplaced (0.012%) monmap e24: 3 mons at
Re: [ceph-users] How to improve single thread sequential reads?
-Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Nick Fisk Sent: 13 August 2015 18:04 To: ceph-users@lists.ceph.com Subject: [ceph-users] How to improve single thread sequential reads? Hi, I'm trying to use a RBD to act as a staging area for some data before pushing it down to some LTO6 tapes. As I cannot use striping with the kernel client I tend to be maxing out at around 80MB/s reads testing with DD. Has anyone got any clever suggestions of giving this a bit of a boost, I think I need to get it up to around 200MB/s to make sure there is always a steady flow of data to the tape drive. I've just tried the testing kernel with the blk-mq fixes in it for full size IO's, this combined with bumping readahead up to 4MB, is now getting me on average 150MB/s to 200MB/s so this might suffice. On a personal interest, I would still like to know if anyone has ideas on how to really push much higher bandwidth through a RBD. Rbd-fuse seems to top out at 12MB/s, so there goes that option. I'm thinking mapping multiple RBD's and then combining them into a mdadm RAID0 stripe might work, but seems a bit messy. Any suggestions? Thanks, Nick ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD space imbalance
Try 'ceph osd reweight-by-pg int' right after creating the pools? What is the typical object size in the cluster? Thanks, Guang To: ceph-users@lists.ceph.com From: vedran.fu...@gmail.com Date: Thu, 13 Aug 2015 14:58:11 +0200 Subject: [ceph-users] OSD space imbalance Hello, I'm having an issue where disk usages between OSDs aren't well balanced thus causing disk space to be wasted. Ceph is latest 0.94.2, used exclusively through cephfs. Re-weighting helps, but just slightly, and it has to be done on a daily basis causing constant refills. In the end I get OSD with 65% usage with some other going over 90%. I also set the ceph osd crush tunables optimal, but I didn't notice any changes when it comes to disk usage. Is there anything I can do to get them within 10% range at least? health HEALTH_OK mdsmap e2577: 1/1/1 up, 2 up:standby osdmap e25239: 48 osds: 48 up, 48 in pgmap v3188836: 5184 pgs, 3 pools, 18028 GB data, 6385 kobjects 36156 GB used, 9472 GB / 45629 GB avail 5184 active+clean ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR 37 0.92999 1.0 950G 625G 324G 65.85 0.83 21 0.92999 1.0 950G 649G 300G 68.35 0.86 32 0.92999 1.0 950G 670G 279G 70.58 0.89 7 0.92999 1.0 950G 676G 274G 71.11 0.90 17 0.92999 1.0 950G 681G 268G 71.73 0.91 40 0.92999 1.0 950G 689G 260G 72.55 0.92 20 0.92999 1.0 950G 690G 260G 72.62 0.92 25 0.92999 1.0 950G 691G 258G 72.76 0.92 2 0.92999 1.0 950G 694G 256G 73.03 0.92 39 0.92999 1.0 950G 697G 253G 73.35 0.93 18 0.92999 1.0 950G 703G 247G 74.00 0.93 47 0.92999 1.0 950G 703G 246G 74.05 0.93 23 0.92999 0.86693 950G 704G 245G 74.14 0.94 6 0.92999 1.0 950G 726G 224G 76.39 0.96 8 0.92999 1.0 950G 727G 223G 76.54 0.97 5 0.92999 1.0 950G 728G 222G 76.62 0.97 35 0.92999 1.0 950G 728G 221G 76.66 0.97 11 0.92999 1.0 950G 730G 220G 76.82 0.97 43 0.92999 1.0 950G 730G 219G 76.87 0.97 33 0.92999 1.0 950G 734G 215G 77.31 0.98 38 0.92999 1.0 950G 736G 214G 77.49 0.98 12 0.92999 1.0 950G 737G 212G 77.61 0.98 31 0.92999 0.85184 950G 742G 208G 78.09 0.99 28 0.92999 1.0 950G 745G 205G 78.41 0.99 27 0.92999 1.0 950G 751G 199G 79.04 1.00 10 0.92999 1.0 950G 754G 195G 79.40 1.00 13 0.92999 1.0 950G 762G 188G 80.21 1.01 9 0.92999 1.0 950G 763G 187G 80.29 1.01 16 0.92999 1.0 950G 764G 186G 80.37 1.01 0 0.92999 1.0 950G 778G 171G 81.94 1.03 3 0.92999 1.0 950G 780G 170G 82.11 1.04 41 0.92999 1.0 950G 780G 169G 82.13 1.04 34 0.92999 0.87303 950G 783G 167G 82.43 1.04 14 0.92999 1.0 950G 784G 165G 82.56 1.04 42 0.92999 1.0 950G 786G 164G 82.70 1.04 46 0.92999 1.0 950G 788G 162G 82.93 1.05 30 0.92999 1.0 950G 790G 160G 83.12 1.05 45 0.92999 1.0 950G 804G 146G 84.59 1.07 44 0.92999 1.0 950G 807G 143G 84.92 1.07 1 0.92999 1.0 950G 817G 132G 86.05 1.09 22 0.92999 1.0 950G 825G 125G 86.81 1.10 15 0.92999 1.0 950G 826G 123G 86.97 1.10 19 0.92999 1.0 950G 829G 120G 87.30 1.10 36 0.92999 1.0 950G 831G 119G 87.48 1.10 24 0.92999 1.0 950G 831G 118G 87.50 1.10 26 0.92999 1.0 950G 851G 101692M 89.55 1.13 29 0.92999 1.0 950G 851G 101341M 89.59 1.13 4 0.92999 1.0 950G 860G 92164M 90.53 1.14 MIN/MAX VAR: 0.83/1.14 STDDEV: 5.94 TOTAL 45629G 36156G 9473G 79.24 Thanks, Vedran ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cluster health_warn 1 active+undersized+degraded/1 active+remapped
Could you share the 'ceph osd tree dump' and CRUSH map dump ? Thanks, Guang Date: Thu, 13 Aug 2015 08:16:09 -0700 From: sdain...@spd1.com To: yangyongp...@bwstor.com.cn; ceph-users@lists.ceph.com Subject: Re: [ceph-users] Cluster health_warn 1 active+undersized+degraded/1 active+remapped I decided to set OSD 76 out and let the cluster shuffle the data off that disk and then brought the OSD back in. For the most part this seemed to be working, but then I had 1 object degraded and 88xxx objects misplaced: # ceph health detail HEALTH_WARN 11 pgs stuck unclean; recovery 1/66089446 objects degraded (0.000%); recovery 88844/66089446 objects misplaced (0.134%) pg 2.e7f is stuck unclean for 88398.251351, current state active+remapped, last acting [58,5] pg 2.143 is stuck unclean for 13892.364101, current state active+remapped, last acting [16,76] pg 2.968 is stuck unclean for 13892.363521, current state active+remapped, last acting [44,76] pg 2.5f8 is stuck unclean for 13892.377245, current state active+remapped, last acting [17,76] pg 2.81c is stuck unclean for 13892.363443, current state active+remapped, last acting [25,76] pg 2.1a3 is stuck unclean for 13892.364400, current state active+remapped, last acting [16,76] pg 2.2cb is stuck unclean for 13892.374390, current state active+remapped, last acting [14,76] pg 2.d41 is stuck unclean for 13892.373636, current state active+remapped, last acting [27,76] pg 2.3f9 is stuck unclean for 13892.373147, current state active+remapped, last acting [35,76] pg 2.a62 is stuck unclean for 86283.741920, current state active+remapped, last acting [2,38] pg 2.1b0 is stuck unclean for 13892.363268, current state active+remapped, last acting [3,76] recovery 1/66089446 objects degraded (0.000%) recovery 88844/66089446 objects misplaced (0.134%) I say apparently because with one object degraded, none of the pg's are showing degraded: # ceph pg dump_stuck degraded ok # ceph pg dump_stuck unclean ok pg_stat state up up_primary acting acting_primary 2.e7f active+remapped [58] 58 [58,5] 58 2.143 active+remapped [16] 16 [16,76] 16 2.968 active+remapped [44] 44 [44,76] 44 2.5f8 active+remapped [17] 17 [17,76] 17 2.81c active+remapped [25] 25 [25,76] 25 2.1a3 active+remapped [16] 16 [16,76] 16 2.2cb active+remapped [14] 14 [14,76] 14 2.d41 active+remapped [27] 27 [27,76] 27 2.3f9 active+remapped [35] 35 [35,76] 35 2.a62 active+remapped [2] 2 [2,38] 2 2.1b0 active+remapped [3] 3 [3,76] 3 All of the OSD filesystems are below 85% full. I then compared a 0.94.2 cluster that was new and had not been updated (current cluster is 0.94.2 which had been updated a couple times) and noticed the crush map had 'tunable straw_calc_version 1' so I added it to the current cluster. After the data moved around for about 8 hours or so I'm left with this state: # ceph health detail HEALTH_WARN 2 pgs stuck unclean; recovery 16357/66089446 objects misplaced (0.025%) pg 2.e7f is stuck unclean for 149422.331848, current state active+remapped, last acting [58,5] pg 2.782 is stuck unclean for 64878.002464, current state active+remapped, last acting [76,31] recovery 16357/66089446 objects misplaced (0.025%) I attempted a pg repair on both of the pg's listed above, but it doesn't look like anything is happening. The doc's reference an inconsistent state as a use case for the repair command so that's likely why. These 2 pg's have been the issue throughout this process so how can I dig deeper to figure out what the problem is? # ceph pg 2.e7f query: http://pastebin.com/jMMsbsjS # ceph pg 2.e7f query: http://pastebin.com/0ntBfFK5 On Wed, Aug 12, 2015 at 6:52 PM, yangyongp...@bwstor.com.cn yangyongp...@bwstor.com.cn wrote: You can try ceph pg repair pg_idto repair the unhealth pg.ceph health detail command is very useful to detect unhealth pgs. yangyongp...@bwstor.com.cn From: Steve Dainard Date: 2015-08-12 23:48 To: ceph-users Subject: [ceph-users] Cluster health_warn 1 active+undersized+degraded/1 active+remapped I ran a ceph osd reweight-by-utilization yesterday and partway through had a network interruption. After the network was restored the cluster continued to rebalance but this morning the cluster has stopped rebalance and status will not change from: # ceph status cluster af859ff1-c394-4c9a-95e2-0e0e4c87445c health HEALTH_WARN 1 pgs degraded 1 pgs stuck degraded 2 pgs stuck unclean 1 pgs stuck undersized 1 pgs undersized recovery 8163/66089054 objects degraded (0.012%) recovery 8194/66089054 objects misplaced (0.012%) monmap e24: 3 mons at {mon1=10.0.231.53:6789/0,mon2=10.0.231.54:6789/0,mon3=10.0.231.55:6789/0} election epoch 250, quorum 0,1,2 mon1,mon2,mon3 osdmap e184486: 100 osds: 100 up, 100 in; 1 remapped pgs pgmap v3010985: 4144 pgs, 7 pools, 125 TB data, 32270 kobjects 251 TB used, 111 TB / 363 TB avail