Re: [ceph-users] PG num calculator live on Ceph.com
Lindsay, Yes, I would suggest starting with the 'RBD and libRados' use case from the drop down, then adjusting the percentages / pool names (if you desire) as appropriate. I don't have a ton of experience with CephFS, but I would suspect that the metadata is less than 5% of the total data usage across those two pools. I welcome anyone with more CephFS experience to weigh in on this! :) Thanks, Michael J. Kidd Sr. Storage Consultant Inktank Professional Services - by Red Hat On Wed, Jan 7, 2015 at 3:59 PM, Lindsay Mathieson lindsay.mathie...@gmail.com wrote: With cephfs we have the two pools - data metadata. Does that effect the pg calculations? metadata pool will have substantially less data than the data pool. -- Lindsay ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] PG num calculator live on Ceph.com
Hello Bill, Either 2048 or 4096 should be acceptable. 4096 gives about a 300 PG per OSD ratio, which would leave room for tripling the OSD count without needing to increase the PG number. While 2048 gives about 150 PGs per OSD, not leaving room but for about a 50% OSD count expansion. The high PG count per OSD issue really doesn't manifest aggressively until you get around 1000 PGs per OSD and beyond. At those levels, steady state operation continues without issue.. but recovery within the cluster will see the memory utilization of the OSDs climb and could push into out of memory conditions on the OSD host (or at a minimum, heavy swap usage if enabled). It still depends of course on the # of OSDs per node, and the amount of memory on the node as to if you'll actually experience issues or not. As an example though, I worked on a cluster which was about 5500 PGs per OSD. The cluster experienced a network config issue in the switchgear which isolated 2/3's of the OSD nodes from each other and the other 1/3 of the cluster. When the network issue was cleared, the OSDs started dropping like flies... They'd start up, spool up the memory they needed for map update parsing, and get killed before making any real headway. We were finally able to get the cluster online by limiting what the OSDs were doing to a small slice of the normal start-up, waiting for the OSDs to calm down, then opening up a bit more for them to do (noup, noin, norecover, nobackfill, pause, noscrub, nodeep-scrub were all set, and then unset one at a time until all OSDs were up/in and able to handle the recovery). 6 weeks later, that same cluster lost about 40% of the OSDs during a power outage due to corruption from an HBA bug.. (it didn't flush the write cache to disk). This pushed the PG per OSD count over 9000!! It simply couldn't recover with the available memory at that PG count. Each OSD, started by itself, would consume 60gb of RAM and get killed (the nodes only had 64gb total). While this is an extreme example... we see cases generated with 1000 PGs per OSD on a regular basis. This is the type of thing we're trying to head off. It should be noted that you can increase the PG num of a pool.. but cannot decrease! The only way to reduce your cluster PG count is to create new smaller PG num pools, migrate the data and then delete the old, high PG count pools. You could also simply add more OSDs to reduce the PG per OSD ratio. The issue with too few PGs is poor data distribution. So it's all about having enough PGs to get good data distribution without going too high and having resource exhaustion during recovery. Hope this helps put things into perspective. Michael J. Kidd Sr. Storage Consultant Inktank Professional Services - by Red Hat On Wed, Jan 7, 2015 at 4:34 PM, Sanders, Bill bill.sand...@teradata.com wrote: This is interesting. Kudos to you guys for getting the calculator up, I think this'll help some folks. I have 1 pool, 40 OSDs, and replica of 3. I based my PG count on: http://ceph.com/docs/master/rados/operations/placement-groups/ ''' Less than 5 OSDs set pg_num to 128 Between 5 and 10 OSDs set pg_num to 512 Between 10 and 50 OSDs set pg_num to 4096 ''' But the calculator gives a different result of 2048. Out of curiosity, what sorts of issues might one encounter by having too many placement groups? I understand there's some resource overhead. I don't suppose it would manifest itself in a recognizable way? Bill -- *From:* ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Michael J. Kidd [michael.k...@inktank.com] *Sent:* Wednesday, January 07, 2015 3:51 PM *To:* Loic Dachary *Cc:* ceph-us...@ceph.com *Subject:* Re: [ceph-users] PG num calculator live on Ceph.com Where is the source ? On the page.. :) It does link out to jquery and jquery-ui, but all the custom bits are embedded in the HTML. Glad it's helpful :) Michael J. Kidd Sr. Storage Consultant Inktank Professional Services - by Red Hat On Wed, Jan 7, 2015 at 3:46 PM, Loic Dachary l...@dachary.org wrote: On 07/01/2015 23:08, Michael J. Kidd wrote: Hello all, Just a quick heads up that we now have a PG calculator to help determine the proper PG per pool numbers to achieve a target PG per OSD ratio. http://ceph.com/pgcalc Please check it out! Happy to answer any questions, and always welcome any feedback on the tool / verbiage, etc... Great work ! That will be immensely useful :-) Where is the source ? Cheers As an aside, we're also working to update the documentation to reflect the best practices. See Ceph.com tracker for this at: http://tracker.ceph.com/issues/9867 Thanks! Michael J. Kidd Sr. Storage Consultant Inktank Professional Services - by Red Hat ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph
[ceph-users] PG num calculator live on Ceph.com
Hello all, Just a quick heads up that we now have a PG calculator to help determine the proper PG per pool numbers to achieve a target PG per OSD ratio. http://ceph.com/pgcalc Please check it out! Happy to answer any questions, and always welcome any feedback on the tool / verbiage, etc... As an aside, we're also working to update the documentation to reflect the best practices. See Ceph.com tracker for this at: http://tracker.ceph.com/issues/9867 Thanks! Michael J. Kidd Sr. Storage Consultant Inktank Professional Services - by Red Hat ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] PG num calculator live on Ceph.com
Hello Christopher, Keep in mind that the PGs per OSD (and per pool) calculations take into account the replica count ( pool size= parameter ). So, for example.. if you're using a default of 3 replicas.. 16 * 3 = 48 PGs which allows for at least one PG per OSD on that pool. Even with a size=2, 32 PGs total still gives very close to 1 PG per OSD. Being that it's such a low utilization pool, this is still sufficient. Thanks, Michael J. Kidd Sr. Storage Consultant Inktank Professional Services - by Red Hat On Wed, Jan 7, 2015 at 3:17 PM, Christopher O'Connell c...@sendfaster.com wrote: Hi, Im playing with this with a modest sized ceph cluster (36x6TB disks). Based on this it says that small pools (such as .users) would have just 16 PGs. Is this correct? I've historically always made even these small pools have at least as many PGs as the next power of 2 over my number of OSDs (64 in this case). All the best, ~ Christopher On Wed, Jan 7, 2015 at 3:08 PM, Michael J. Kidd michael.k...@inktank.com wrote: Hello all, Just a quick heads up that we now have a PG calculator to help determine the proper PG per pool numbers to achieve a target PG per OSD ratio. http://ceph.com/pgcalc Please check it out! Happy to answer any questions, and always welcome any feedback on the tool / verbiage, etc... As an aside, we're also working to update the documentation to reflect the best practices. See Ceph.com tracker for this at: http://tracker.ceph.com/issues/9867 Thanks! Michael J. Kidd Sr. Storage Consultant Inktank Professional Services - by Red Hat ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] PG num calculator live on Ceph.com
Where is the source ? On the page.. :) It does link out to jquery and jquery-ui, but all the custom bits are embedded in the HTML. Glad it's helpful :) Michael J. Kidd Sr. Storage Consultant Inktank Professional Services - by Red Hat On Wed, Jan 7, 2015 at 3:46 PM, Loic Dachary l...@dachary.org wrote: On 07/01/2015 23:08, Michael J. Kidd wrote: Hello all, Just a quick heads up that we now have a PG calculator to help determine the proper PG per pool numbers to achieve a target PG per OSD ratio. http://ceph.com/pgcalc Please check it out! Happy to answer any questions, and always welcome any feedback on the tool / verbiage, etc... Great work ! That will be immensely useful :-) Where is the source ? Cheers As an aside, we're also working to update the documentation to reflect the best practices. See Ceph.com tracker for this at: http://tracker.ceph.com/issues/9867 Thanks! Michael J. Kidd Sr. Storage Consultant Inktank Professional Services - by Red Hat ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Loïc Dachary, Artisan Logiciel Libre ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD process exhausting server memory
Hello Lukas, The 'slow request' logs are expected while the cluster is in such a state.. the OSD processes simply aren't able to respond quickly to client IO requests. I would recommend trying to recover without the most problematic disk ( seems to be OSD.10? ).. Simply shut it down and see if the other OSDs settle down. You should also take a look at the kernel logs for any indications of a problem with the disks themselves, or possibly do an FIO test against the drive with the OSD shut down (to a file on the OSD filesystem, not the raw drive.. this would be destructive). Also, you could upgrade to 0.80.7. There are some bug fixes, but I'm not sure if any would specifically help this situation.. not likely to hurt through. The desired state is for the cluster to be steady-state before the next move (unsetting the next flag). Hopefully this can be achieved without needing to take down OSDs in multiple hosts. I'm also unsure about the cache tiering and how it could relate to the load being seen. Hope this helps... Michael J. Kidd Sr. Storage Consultant Inktank Professional Services - by Red Hat On Thu, Oct 30, 2014 at 4:00 AM, Lukáš Kubín lukas.ku...@gmail.com wrote: Hi, I've noticed the following messages always accumulate in OSD log before it exhausts all memory: 2014-10-30 08:48:42.994190 7f80a2019700 0 log [WRN] : slow request 38.901192 seconds old, received at 2014-10-30 08:48:04.092889: osd_op(osd.29.3076:207644827 rbd_data.2e4ee3ba663be.363b@17 [copy-get max 8388608] 7.af87e887 ack+read+ignore_cache+ignore_overlay+map_snap_clone e3359) v4 currently reached pg Note this is always from the most frequently failing osd.10 (sata tier) referring to osd.29 (ssd cache tier). That osd.29 is consuming huge CPU and memory resources, but keeps running without failures. Can this be eg. a bug? Or some erroneous I/O request which initiated this behaviour? Can I eg. attempt to upgrade the Ceph to a more recent release in the current unhealthy status of the cluster? Can I eg. try disabling the caching tier? Or just somehow evacuate the problematic OSD? I'll welcome any ideas. Currently, I'm keeping the osd.10 in an automatic restart loop with 60 seconds pause before starting again. Thanks and greetings, Lukas On Wed, Oct 29, 2014 at 8:04 PM, Lukáš Kubín lukas.ku...@gmail.com wrote: I should have figured that out myself since I did that recently. Thanks. Unfortunately, I'm still at the step ceph osd unset noin. After setting all the OSDs in, the original issue reapears preventing me to proceed with recovery. It now appears mostly at single OSD - osd.10 which consumes ~200% CPU and all memory within 45 seconds being killed by Linux then: Oct 29 18:24:38 q09 kernel: Out of memory: Kill process 17202 (ceph-osd) score 912 or sacrifice child Oct 29 18:24:38 q09 kernel: Killed process 17202, UID 0, (ceph-osd) total-vm:62713176kB, anon-rss:62009772kB, file-rss:328kB I've tried to restart it several times with same result. Similar situation with OSDs 0 and 13. Also, I've noticed one of SSD cache tier's OSD - osd.29 generating high CPU utilization around 180%. All the problematic OSD's have been the same ones all the time - OSD 0,8,10,13 and 29 - they are those which I've found to be down this morning. There is some minor load coming from client - Openstack instances, I preferred not to kill them: [root@q04 ceph-recovery]# ceph -s cluster ec433b4a-9dc0-4d08-bde4-f1657b1fdb99 health HEALTH_ERR 31 pgs backfill; 241 pgs degraded; 62 pgs down; 193 pgs incomplete; 13 pgs inconsistent; 62 pgs peering; 12 pgs recovering; 205 pgs recovery_wait; 93 pgs stuck inactive; 608 pgs stuck unclean; 381138 requests are blocked 32 sec; recovery 1162468/35207488 objects degraded (3.302%); 466/17112963 unfound (0.003%); 13 scrub errors; 1/34 in osds are down; nobackfill,norecover,noscrub,nodeep-scrub flag(s) set monmap e2: 3 mons at {q03= 10.255.253.33:6789/0,q04=10.255.253.34:6789/0,q05=10.255.253.35:6789/0}, election epoch 92, quorum 0,1,2 q03,q04,q05 osdmap e2782: 34 osds: 33 up, 34 in flags nobackfill,norecover,noscrub,nodeep-scrub pgmap v7440374: 5632 pgs, 7 pools, 1449 GB data, 16711 kobjects 3148 GB used, 15010 GB / 18158 GB avail 1162468/35207488 objects degraded (3.302%); 466/17112963 unfound (0.003%) 13 active 22 active+recovery_wait+remapped 1 active+recovery_wait+inconsistent 4794 active+clean 193 incomplete 62 down+peering 9 active+degraded+remapped+wait_backfill 182 active+recovery_wait 74 active+remapped 12 active+recovering 12 active+clean+inconsistent 22 active+remapped+wait_backfill 4 active+clean+replay 232
Re: [ceph-users] OSD process exhausting server memory
Hello Lukas, Unfortunately, I'm all out of ideas at the moment. There are some memory profiling techniques which can help identify what is causing the memory utilization, but it's a bit beyond what I typically work on. Others on the list may have experience with this (or otherwise have ideas) and may chip in... Wish I could be more help.. Michael J. Kidd Sr. Storage Consultant Inktank Professional Services - by Red Hat On Thu, Oct 30, 2014 at 11:00 AM, Lukáš Kubín lukas.ku...@gmail.com wrote: Thanks Michael, still no luck. Letting the problematic OSD.10 down has no effect. Within minutes more of OSDs fail on same issue after consuming ~50GB of memory. Also, I can see two of those cache-tier OSDs on separate hosts which remain utilized almost 200% CPU all the time I've performed upgrade of all cluster to 0.80.7. Did not help. I have also tried to unset norecovery+nobackfill flags to attempt a recovery completion. No luck, several OSDs fail with the same issue preventing the recovery to complete. I've performed your fix steps from the start again and currently I'm behind the unset noin step. I could get some of pools to a state with no degraded objects temporarily. Then (within minutes) some OSD fails and it's degraded again. I have also tried to let the OSD processes get restarted automatically to keep them up as much as possible. I consider disabling the tiering pool volumes-cache as that's something I can miss: pool name category KB objects clones degraded backups - 000 0 data- 000 0 images - 777989590950270 8883 metadata- 000 0 rbd - 000 0 volumes - 11560869325965 179 3307 volumes-cache - 649577103 16708730 9894 1144650 Can I just switch it into the forward mode and let it empty (cache-flush-evict-all) to see if that changes anything? Could you or any of your colleagues provide anything else to try? Thank you, Lukas On Thu, Oct 30, 2014 at 3:05 PM, Michael J. Kidd michael.k...@inktank.com wrote: Hello Lukas, The 'slow request' logs are expected while the cluster is in such a state.. the OSD processes simply aren't able to respond quickly to client IO requests. I would recommend trying to recover without the most problematic disk ( seems to be OSD.10? ).. Simply shut it down and see if the other OSDs settle down. You should also take a look at the kernel logs for any indications of a problem with the disks themselves, or possibly do an FIO test against the drive with the OSD shut down (to a file on the OSD filesystem, not the raw drive.. this would be destructive). Also, you could upgrade to 0.80.7. There are some bug fixes, but I'm not sure if any would specifically help this situation.. not likely to hurt through. The desired state is for the cluster to be steady-state before the next move (unsetting the next flag). Hopefully this can be achieved without needing to take down OSDs in multiple hosts. I'm also unsure about the cache tiering and how it could relate to the load being seen. Hope this helps... Michael J. Kidd Sr. Storage Consultant Inktank Professional Services - by Red Hat On Thu, Oct 30, 2014 at 4:00 AM, Lukáš Kubín lukas.ku...@gmail.com wrote: Hi, I've noticed the following messages always accumulate in OSD log before it exhausts all memory: 2014-10-30 08:48:42.994190 7f80a2019700 0 log [WRN] : slow request 38.901192 seconds old, received at 2014-10-30 08:48:04.092889: osd_op(osd.29.3076:207644827 rbd_data.2e4ee3ba663be.363b@17 [copy-get max 8388608] 7.af87e887 ack+read+ignore_cache+ignore_overlay+map_snap_clone e3359) v4 currently reached pg Note this is always from the most frequently failing osd.10 (sata tier) referring to osd.29 (ssd cache tier). That osd.29 is consuming huge CPU and memory resources, but keeps running without failures. Can this be eg. a bug? Or some erroneous I/O request which initiated this behaviour? Can I eg. attempt to upgrade the Ceph to a more recent release in the current unhealthy status of the cluster? Can I eg. try disabling the caching tier? Or just somehow evacuate the problematic OSD? I'll welcome any ideas. Currently, I'm keeping the osd.10 in an automatic restart loop with 60 seconds pause before starting again. Thanks and greetings, Lukas On Wed, Oct 29, 2014 at 8:04 PM, Lukáš Kubín lukas.ku...@gmail.com wrote: I should have figured that out myself since I did that recently. Thanks. Unfortunately, I'm still at the step ceph osd unset noin. After setting all the OSDs in, the original
Re: [ceph-users] OSD process exhausting server memory
Hello Lukas, Please try the following process for getting all your OSDs up and operational... * Set the following flags: noup, noin, noscrub, nodeep-scrub, norecover, nobackfill for i in noup noin noscrub nodeep-scrub norecover nobackfill; do ceph osd set $i; done * Stop all OSDs (I know, this seems counter productive) * Set all OSDs down / out for i in $(ceph osd tree | grep osd | awk '{print $3}'); do ceph osd down $i; ceph osd out $i; done * Set recovery / backfill throttles as well as heartbeat and OSD map processing tweaks in the /etc/ceph/ceph.conf file under the [osd] section: [osd] osd_max_backfills = 1 osd_recovery_max_active = 1 osd_recovery_max_single_start = 1 osd_backfill_scan_min = 8 osd_heartbeat_interval = 36 osd_heartbeat_grace = 240 osd_map_message_max = 1000 osd_map_cache_size = 3136 * Start all OSDs * Monitor 'top' for 0% CPU on all OSD processes.. it may take a while.. I usually issue 'top' then, the keys M c - M = Sort by memory usage - c = Show command arguments - This allows to easily monitor the OSD process and know which OSDs have settled, etc.. * Once all OSDs have hit 0% CPU utilization, remove the 'noup' flag - ceph osd unset noup * Again, wait for 0% CPU utilization (may be immediate, may take a while.. just gotta wait) * Once all OSDs have hit 0% CPU again, remove the 'noin' flag - ceph osd unset noin - All OSDs should now appear up/in, and will go through peering.. * Once ceph -s shows no further activity, and OSDs are back at 0% CPU again, unset 'nobackfill' - ceph osd unset nobackfill * Once ceph -s shows no further activity, and OSDs are back at 0% CPU again, unset 'norecover' - ceph osd unset norecover * Monitor OSD memory usage... some OSDs may get killed off again, but their subsequent restart should consume less memory and allow more recovery to occur between each step above.. and ultimately, hopefully... your entire cluster will come back online and be usable. ## Clean-up: * Remove all of the above set options from ceph.conf * Reset the running OSDs to their defaults: ceph tell osd.\* injectargs '--osd_max_backfills 10 --osd_recovery_max_active 15 --osd_recovery_max_single_start 5 --osd_backfill_scan_min 64 --osd_heartbeat_interval 6 --osd_heartbeat_grace 36 --osd_map_message_max 100 --osd_map_cache_size 500' * Unset the noscrub and nodeep-scrub flags: - ceph osd unset noscrub - ceph osd unset nodeep-scrub ## For help identifying why memory usage was so high, please provide: * ceph osd dump | grep pool * ceph osd crush rule dump Let us know if this helps... I know it looks extreme, but it's worked for me in the past.. Michael J. Kidd Sr. Storage Consultant Inktank Professional Services - by Red Hat On Wed, Oct 29, 2014 at 8:51 AM, Lukáš Kubín lukas.ku...@gmail.com wrote: Hello, I've found my ceph v 0.80.3 cluster in a state with 5 of 34 OSDs being down through night after months of running without change. From Linux logs I found out the OSD processes were killed because they consumed all available memory. Those 5 failed OSDs were from different hosts of my 4-node cluster (see below). Two hosts act as SSD cache tier in some of my pools. The other two hosts are the default rotational drives storage. After checking the Linux was not out of memory I've attempted to restart those failed OSDs. Most of those OSD daemon exhaust all memory in seconds and got killed by Linux again: Oct 28 22:16:34 q07 kernel: Out of memory: Kill process 24207 (ceph-osd) score 867 or sacrifice child Oct 28 22:16:34 q07 kernel: Killed process 24207, UID 0, (ceph-osd) total-vm:59974412kB, anon-rss:59076880kB, file-rss:512kB On the host I've found lots of similar slow request messages preceding the crash: 2014-10-28 22:11:20.885527 7f25f84d1700 0 log [WRN] : slow request 31.117125 seconds old, received at 2014-10-28 22:10:49.768291: osd_sub_op(client.168752.0:2197931 14.2c7 888596c7/rbd_data.293272f8695e4.006f/head//14 [] v 1551'377417 snapset=0=[]:[] snapc=0=[]) v10 currently no flag points reached 2014-10-28 22:11:21.885668 7f25f84d1700 0 log [WRN] : 67 slow requests, 1 included below; oldest blocked for 9879.304770 secs Apparently I can't get the cluster fixed by restarting the OSDs all over again. Is there any other option then? Thank you. Lukas Kubin [root@q04 ~]# ceph -s cluster ec433b4a-9dc0-4d08-bde4-f1657b1fdb99 health HEALTH_ERR 9 pgs backfill; 1 pgs backfilling; 521 pgs degraded; 425 pgs incomplete; 13 pgs inconsistent; 20 pgs recovering; 50 pgs recovery_wait; 151 pgs stale; 425 pgs stuck inactive; 151 pgs stuck stale; 1164 pgs stuck unclean; 12070270 requests are blocked 32 sec; recovery 887322/35206223 objects degraded (2.520%); 119/17131232 unfound (0.001%); 13 scrub errors monmap e2: 3 mons at {q03= 10.255.253.33:6789/0,q04=10.255.253.34:6789/0,q05=10.255.253.35:6789/0}, election epoch 90, quorum 0,1,2 q03,q04,q05 osdmap e2194: 34 osds: 31 up, 31 in pgmap
Re: [ceph-users] OSD process exhausting server memory
Ah, sorry... since they were set out manually, they'll need to be set in manually.. for i in $(ceph osd tree | grep osd | awk '{print $3}'); do ceph osd in $i; done Michael J. Kidd Sr. Storage Consultant Inktank Professional Services - by Red Hat On Wed, Oct 29, 2014 at 12:33 PM, Lukáš Kubín lukas.ku...@gmail.com wrote: I've ended up at step ceph osd unset noin. My OSDs are up, but not in, even after an hour: [root@q04 ceph-recovery]# ceph osd stat osdmap e2602: 34 osds: 34 up, 0 in flags nobackfill,norecover,noscrub,nodeep-scrub There seems to be no activity generated by OSD processes, occasionally they show 0,3% which I believe is just some basic communication processing. No load in network interfaces. Is there some other step needed to bring the OSDs in? Thank you. Lukas On Wed, Oct 29, 2014 at 3:58 PM, Michael J. Kidd michael.k...@inktank.com wrote: Hello Lukas, Please try the following process for getting all your OSDs up and operational... * Set the following flags: noup, noin, noscrub, nodeep-scrub, norecover, nobackfill for i in noup noin noscrub nodeep-scrub norecover nobackfill; do ceph osd set $i; done * Stop all OSDs (I know, this seems counter productive) * Set all OSDs down / out for i in $(ceph osd tree | grep osd | awk '{print $3}'); do ceph osd down $i; ceph osd out $i; done * Set recovery / backfill throttles as well as heartbeat and OSD map processing tweaks in the /etc/ceph/ceph.conf file under the [osd] section: [osd] osd_max_backfills = 1 osd_recovery_max_active = 1 osd_recovery_max_single_start = 1 osd_backfill_scan_min = 8 osd_heartbeat_interval = 36 osd_heartbeat_grace = 240 osd_map_message_max = 1000 osd_map_cache_size = 3136 * Start all OSDs * Monitor 'top' for 0% CPU on all OSD processes.. it may take a while.. I usually issue 'top' then, the keys M c - M = Sort by memory usage - c = Show command arguments - This allows to easily monitor the OSD process and know which OSDs have settled, etc.. * Once all OSDs have hit 0% CPU utilization, remove the 'noup' flag - ceph osd unset noup * Again, wait for 0% CPU utilization (may be immediate, may take a while.. just gotta wait) * Once all OSDs have hit 0% CPU again, remove the 'noin' flag - ceph osd unset noin - All OSDs should now appear up/in, and will go through peering.. * Once ceph -s shows no further activity, and OSDs are back at 0% CPU again, unset 'nobackfill' - ceph osd unset nobackfill * Once ceph -s shows no further activity, and OSDs are back at 0% CPU again, unset 'norecover' - ceph osd unset norecover * Monitor OSD memory usage... some OSDs may get killed off again, but their subsequent restart should consume less memory and allow more recovery to occur between each step above.. and ultimately, hopefully... your entire cluster will come back online and be usable. ## Clean-up: * Remove all of the above set options from ceph.conf * Reset the running OSDs to their defaults: ceph tell osd.\* injectargs '--osd_max_backfills 10 --osd_recovery_max_active 15 --osd_recovery_max_single_start 5 --osd_backfill_scan_min 64 --osd_heartbeat_interval 6 --osd_heartbeat_grace 36 --osd_map_message_max 100 --osd_map_cache_size 500' * Unset the noscrub and nodeep-scrub flags: - ceph osd unset noscrub - ceph osd unset nodeep-scrub ## For help identifying why memory usage was so high, please provide: * ceph osd dump | grep pool * ceph osd crush rule dump Let us know if this helps... I know it looks extreme, but it's worked for me in the past.. Michael J. Kidd Sr. Storage Consultant Inktank Professional Services - by Red Hat On Wed, Oct 29, 2014 at 8:51 AM, Lukáš Kubín lukas.ku...@gmail.com wrote: Hello, I've found my ceph v 0.80.3 cluster in a state with 5 of 34 OSDs being down through night after months of running without change. From Linux logs I found out the OSD processes were killed because they consumed all available memory. Those 5 failed OSDs were from different hosts of my 4-node cluster (see below). Two hosts act as SSD cache tier in some of my pools. The other two hosts are the default rotational drives storage. After checking the Linux was not out of memory I've attempted to restart those failed OSDs. Most of those OSD daemon exhaust all memory in seconds and got killed by Linux again: Oct 28 22:16:34 q07 kernel: Out of memory: Kill process 24207 (ceph-osd) score 867 or sacrifice child Oct 28 22:16:34 q07 kernel: Killed process 24207, UID 0, (ceph-osd) total-vm:59974412kB, anon-rss:59076880kB, file-rss:512kB On the host I've found lots of similar slow request messages preceding the crash: 2014-10-28 22:11:20.885527 7f25f84d1700 0 log [WRN] : slow request 31.117125 seconds old, received at 2014-10-28 22:10:49.768291: osd_sub_op(client.168752.0:2197931 14.2c7 888596c7/rbd_data.293272f8695e4.006f/head//14 [] v 1551'377417 snapset=0=[]:[] snapc=0=[]) v10
Re: [ceph-users] RBD for ephemeral
Since the status is 'Abandoned', it would appear that the fix has not been merged into any release of OpenStack. Thanks, Michael J. Kidd Sr. Storage Consultant Inktank Professional Services On Sun, May 18, 2014 at 5:13 PM, Yuming Ma (yumima) yum...@cisco.comwrote: Wondering what is the status of this fix https://review.openstack.org/#/c/46879/? Which release has it? — Yuming ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD for ephemeral
After sending my earlier email, I found another commit that was merged in March: https://review.openstack.org/#/c/59149/ Seems to follow a newer image handling technique that was being sought which prevented the first patch from being merged in... Michael J. Kidd Sr. Storage Consultant Inktank Professional Services On Mon, May 19, 2014 at 11:20 AM, Pierre Grandin pierre.gran...@tubemogul.com wrote: Actually you can get the patched code from here for Havana : https://github.com/jdurgin/nova/tree/havana-ephemeral-rbd But i'm still trying to get it to work (in my case the volumes are still copies, and not copy on write). On Mon, May 19, 2014 at 7:19 AM, Michael J. Kidd michael.k...@inktank.com wrote: Since the status is 'Abandoned', it would appear that the fix has not been merged into any release of OpenStack. Thanks, Michael J. Kidd Sr. Storage Consultant Inktank Professional Services On Sun, May 18, 2014 at 5:13 PM, Yuming Ma (yumima) yum...@cisco.comwrote: Wondering what is the status of this fix https://review.openstack.org/#/c/46879/? Which release has it? — Yuming ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- *Pierre Grandin *| Senior Site Reliability Engineer *M: *510.423.2231 http://559.217.2126/| @p_grandinhttps://twitter.com/p_grandin [image: Inline image 1]http://www.tubemogul.com/solutions/playtime/brandpoint ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph not replicating
You may also want to check your 'min_size'... if it's 2, then you'll be incomplete even with 1 complete copy. ceph osd dump | grep pool You can reduce the min size with the following syntax: ceph osd pool set poolname min_size 1 Thanks, Michael J. Kidd Sent from my mobile device. Please excuse brevity and typographical errors. On Apr 19, 2014 12:50 PM, Jean-Charles Lopez jc.lo...@inktank.com wrote: Hi again Looked at your ceph -s. You have only 2 OSDs, one on each node. The default replica count is 2, the default crush map says each replica on a different host, or may be you set it to 2 different OSDs. Anyway, when one of your OSD goes down, Ceph can no longer find another OSDs to host the second replica it must create. Looking at your crushmap we would know better. Recommendation: for testing efficiently and most options available, functionnally speaking, deploy a cluster with 3 nodes, 3 OSDs each is my best practice. Or make 1 node with 3 OSDs modifying your crushmap to choose type osd in your rulesets. JC On Saturday, April 19, 2014, Gonzalo Aguilar Delgado gagui...@aguilardelgado.com wrote: Hi, I'm building a cluster where two nodes replicate objects inside. I found that shutting down just one of the nodes (the second one), makes everything incomplete. I cannot find why, since crushmap looks good to me. after shutting down one node cluster 9028f4da-0d77-462b-be9b-dbdf7fa57771 health HEALTH_WARN 192 pgs incomplete; 96 pgs stuck inactive; 96 pgs stuck unclean; 1/2 in osds are down monmap e9: 1 mons at {blue-compute=172.16.0.119:6789/0}, election epoch 1, quorum 0 blue-compute osdmap e73: 2 osds: 1 up, 2 in pgmap v172: 192 pgs, 3 pools, 275 bytes data, 1 objects 7552 kB used, 919 GB / 921 GB avail 192 incomplete Both nodes has WD Caviar Black 500MB disk with btrfs filesystem on it. Full disk used. I cannot understand why does not replicate to both nodes. Someone can help? Best regards, -- Sent while moving Pardon my French and any spelling | grammar glitches ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph not replicating
Can I remove safely default pools? Yes, as long as you're not using the default pools to store data, you can delete them. Why total size is about 1GB?, because it should have 500MB since 2 replicas. I'm assuming that you're talking about the output of 'ceph df' or 'rados df'. These commands report *raw* storage capacity.. It's up to you to divide the raw capacity by the number of replicas. It's this way intentionally since you could have multiple pools each with different replica counts. btw.. I'd strongly urge you to re-deploy your OSDs with XFS instead of BTRFS. The last details I've seen show BTRFS slows drastically after only a few hours with a high file count in the filesystem. Better to re-deploy now than when you have data serving in production. Thanks, Michael J. Kidd Sr. Storage Consultant Inktank Professional Services On Sat, Apr 19, 2014 at 5:51 PM, Gonzalo Aguilar Delgado gagui...@aguilardelgado.com wrote: Hi Michael, It worked. I didn't realized of this because docs it installs two osd nodes and says that would become active+clean after installing them. (Something that didn't worked for me because the 3 replicas problem). http://ceph.com/docs/master/start/quick-ceph-deploy/ Now I can shutdown second node and I can retrieve the data stored there. So last questions are: Can I remove safely default pools? Why total size is about 1GB?, because it should have 500MB since 2 replicas. Thank you a lot for your help. PS: I will try now the openstack integration. El sáb, 19 de abr 2014 a las 6:53 , Michael J. Kidd michael.k...@inktank.com escribió: You may also want to check your 'min_size'... if it's 2, then you'll be incomplete even with 1 complete copy. ceph osd dump | grep pool You can reduce the min size with the following syntax: ceph osd pool set poolname min_size 1 Thanks, Michael J. Kidd Sent from my mobile device. Please excuse brevity and typographical errors. On Apr 19, 2014 12:50 PM, Jean-Charles Lopez jc.lo...@inktank.com wrote: Hi again Looked at your ceph -s. You have only 2 OSDs, one on each node. The default replica count is 2, the default crush map says each replica on a different host, or may be you set it to 2 different OSDs. Anyway, when one of your OSD goes down, Ceph can no longer find another OSDs to host the second replica it must create. Looking at your crushmap we would know better. Recommendation: for testing efficiently and most options available, functionnally speaking, deploy a cluster with 3 nodes, 3 OSDs each is my best practice. Or make 1 node with 3 OSDs modifying your crushmap to choose type osd in your rulesets. JC On Saturday, April 19, 2014, Gonzalo Aguilar Delgado gagui...@aguilardelgado.com wrote: Hi, I'm building a cluster where two nodes replicate objects inside. I found that shutting down just one of the nodes (the second one), makes everything incomplete. I cannot find why, since crushmap looks good to me. after shutting down one node cluster 9028f4da-0d77-462b-be9b-dbdf7fa57771 health HEALTH_WARN 192 pgs incomplete; 96 pgs stuck inactive; 96 pgs stuck unclean; 1/2 in osds are down monmap e9: 1 mons at {blue-compute=172.16.0.119:6789/0}, election epoch 1, quorum 0 blue-compute osdmap e73: 2 osds: 1 up, 2 in pgmap v172: 192 pgs, 3 pools, 275 bytes data, 1 objects 7552 kB used, 919 GB / 921 GB avail 192 incomplete Both nodes has WD Caviar Black 500MB disk with btrfs filesystem on it. Full disk used. I cannot understand why does not replicate to both nodes. Someone can help? Best regards, -- Sent while moving Pardon my French and any spelling | grammar glitches ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] No more Journals ?
Journals will default to being on-disk with the OSD if there is nothing specified on the ceph-deploy line. If you have a separate journal device, then you should specify it per the original example syntax. Michael J. Kidd Sr. Storage Consultant Inktank Professional Services On Fri, Mar 14, 2014 at 8:22 AM, Markus Goldberg goldb...@uni-hildesheim.de wrote: Sorry, i should have asked a little bit clearer: Can ceph (or OSDs) be used without journals now ? The Journal-Parameter seems to be optional ( because of '[...]' ) Markus Am 14.03.2014 12:19, schrieb John Spray: Journals have not gone anywhere, and ceph-deploy still supports specifying them with exactly the same syntax as before. The page you're looking at is the simplified quick start, the detail on osd creation including journals is here: http://eu.ceph.com/docs/v0.77/rados/deployment/ceph-deploy-osd/ Cheers, John On Fri, Mar 14, 2014 at 9:47 AM, Markus Goldberg goldb...@uni-hildesheim.de wrote: Hi, i'm a little bit surprised. I read through the new manuals of 0.77 (http://eu.ceph.com/docs/v0.77/start/quick-ceph-deploy/) In the section of creating the osd the manual says: Then, from your admin node, use ceph-deploy to prepare the OSDs. ceph-deploy osd prepare {ceph-node}:/path/to/directory For example: ceph-deploy osd prepare node2:/var/local/osd0 node3:/var/local/osd1 Finally, activate the OSDs. ceph-deploy osd activate {ceph-node}:/path/to/directory For example: ceph-deploy osd activate node2:/var/local/osd0 node3:/var/local/osd1 In former versions the osd was created like: ceph-deploy -v --overwrite-conf osd --fs-type btrfs prepare bd-0:/dev/sdb:/dev/sda5 ^^ Journal As i remember defining and creating a journal for each osd was a must. So the question is: Are Journals obsolet now ? -- MfG, Markus Goldberg -- Markus Goldberg Universität Hildesheim Rechenzentrum Tel +49 5121 88392822 Marienburger Platz 22, D-31141 Hildesheim, Germany Fax +49 5121 88392823 email goldb...@uni-hildesheim.de -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- MfG, Markus Goldberg -- Markus Goldberg Universität Hildesheim Rechenzentrum Tel +49 5121 88392822 Marienburger Platz 22, D-31141 Hildesheim, Germany Fax +49 5121 88392823 email goldb...@uni-hildesheim.de -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Very high latency values
Hello Dan, A couple of quick things... * Latency is shown as a sum of all measured latencies over a period of time, and the count of operations included. So to calculate the average per op latency, you must divide the sum by the count. The result will be in milliseconds. * The latency values you're showing there are from 'recoverystate_perf', meaning they're not relevant to normal operations of the OSDs. For that, I'd recommend doing a perf dump against the OSD admin socket and looking at the latency values under the osd section. * I've not seen any documentation on each counter, aside from occasional mailing list posts about specific counters.. Hope this helps! Michael J. Kidd Sr. Storage Consultant Inktank Professional Services On Fri, Mar 7, 2014 at 11:39 AM, Dan Ryder (daryder) dary...@cisco.comwrote: Hello, I'm working with two different Ceph clusters, and in both clusters, I'm seeing very high latency values. Here's part of a sample perf dump: recoverystate_perf: { initial_latency: { avgcount: 338, sum: 0.069851000}, started_latency: { avgcount: 1647, sum: 322317122.940019000}, reset_latency: { avgcount: 1985, sum: 195.935076000}, start_latency: { avgcount: 1985, sum: 0.234355000}, primary_latency: { avgcount: 266, sum: 10819570.688122000}, You can see both started latency and primary latency have extremely high values. Some info about the cluster: All nodes are on the same subnet - 2 VMs, 1 physical node VM1 is just a Monitor, VM2 is Monitor and OSD, Physical node is just an OSD. One additional question, are these latency values in milliseconds? Is there any documentation on the units for perf dump command? I've looked around but haven't seen anything. Thanks, Dan [image: http://www.cisco.com/web/europe/images/email/signature/logo05.jpg] *Dan Ryder* ENGINEER.SOFTWARE ENGINEERING CSMTG Performance/Analytics dary...@cisco.com Phone: *+1 919 392 7438 %2B1%20919%20392%207438* *Cisco Systems, Inc.* 7100-8 Kit Creek Road PO Box 14987 27709-4987 Research Triangle Park United States Cisco.com http://www.cisco.com/ [image: Think before you print.] Think before you print. This email may contain confidential and privileged material for the sole use of the intended recipient. Any review, use, distribution or disclosure by others is strictly prohibited. If you are not the intended recipient (or authorized to receive for the recipient), please contact the sender by reply email and delete all copies of this message. For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com inline: image006.pnginline: image005.jpg___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] pausing recovery when adding new machine
Hello Sid, You may try setting the 'noup' flag (instead of the 'noout' flag). This would prevent new OSDs from being set 'up' and therefore, the data rebalance shouldn't occur. Once you add all OSDs, then unset the 'noup' flag and ensure they're set 'up' automatically... if not, use 'ceph osd up osdid' to bring them up manually. Hope this helps! Michael J. Kidd Sr. Storage Consultant Inktank Professional Services On Fri, Mar 7, 2014 at 3:06 PM, Sidharta Mukerjee smukerje...@gmail.comwrote: When I use ceph-deploy to add a bunch of new OSDs (from a new machine), the ceph cluster starts rebalancing immediately; as a result, the first couple OSDs are started properly; but the last few can't start because I keep getting a timeout problem, as shown here: [root@ia6 ia_scripts]# service ceph start osd.24 === osd.24 === failed: 'timeout 10 /usr/bin/ceph --name=osd.24 --keyring=/var/lib/ceph/osd/ceph-24/keyring osd crush create-or-move -- 24 1.82 root=default host=ia6 Is there a way I can pause the recovery so that the overall system behaves way faster and I can then start all the OSDs, make sure they're up and they look normal (via ceph osd tree) , and then unpause recovery? -Sid ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS: files never stored on OSDs
Seems that you may also need to tell CephFS to use the new pool instead of the default.. After CephFS is mounted, run: # cephfs /mnt/ceph set_layout -p 4 Michael J. Kidd Sr. Storage Consultant Inktank Professional Services On Fri, Feb 28, 2014 at 9:12 AM, Sage Weil s...@inktank.com wrote: Hi Florent, It sounds like the capability for the user you are authenticating as does not have access to the new OSD data pool. Try doing ceph auth list and see if there is an osd cap that mentions the data pool but not the new pool you created; that would explain your symptoms. sage On Fri, 28 Feb 2014, Florent Bautista wrote: Hi all, Today I'm testing CephFS with client-side kernel drivers. My installation is composed of 2 nodes, each one with a monitor and an OSD. One of them is also MDS. root@test2:~# ceph -s cluster 42081905-1a6b-4b9e-8984-145afe0f22f6 health HEALTH_OK monmap e2: 2 mons at {0=192.168.0.202:6789/0,1=192.168.0.200:6789/0 }, election epoch 18, quorum 0,1 0,1 mdsmap e15: 1/1/1 up {0=0=up:active} osdmap e82: 2 osds: 2 up, 2 in pgmap v4405: 384 pgs, 5 pools, 16677 MB data, 4328 objects 43473 MB used, 2542 GB / 2584 GB avail 384 active+clean I added data pool to MDS : ceph mds add_data_pool 4 Then I created keyring for my client : ceph --id admin --keyring /etc/ceph/ceph.client.admin.keyring auth get-or-create client.test mds 'allow' osd 'allow * pool=CephFS' mon 'allow *' /etc/ceph/ceph.client.test.keyring And I mount FS with : mount -o name=test,secret=AQC9YhBT8CE9GhAAdgDiVLGIIgEleen4vkOp5w==,noatime -t ceph 192.168.0.200,192.168.0.202:/ /mnt/ceph The client could be Debian 7.4 (kernel 3.2) or Ubuntu 13.11 (kernel 3.11). Mount is OK. I can write files to it. I can see files on every clients mounted. BUT... Where are stored my files ? My pool stays at 0 disk usage on rados df Disk usage of OSDs never grows... What did I miss ? When client A writes a file, I got Operation not permitted when client B reads the file, even if I sync FS. That sounds very strange to me, I think I missed something but I don't know what. Of course, no error in logs. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RedHat ceph boot question
While clearly not optimal for long term flexibility, I've found that adding my OSD's to fstab allows the OSDs to mount during boot, and they start automatically when they're already mounted during boot. Hope this helps until a permanent fix is available. Michael J. Kidd Sr. Storage Consultant Inktank Professional Services On Fri, Jan 24, 2014 at 9:08 PM, Derek Yarnell de...@umiacs.umd.edu wrote: So we have a test cluster, and two production clusters all running on RHEL6.5. Two are running Emperor and one of them running Dumpling. On all of them our OSDs do not start at boot it seems via the udev rules. The OSDs were created with ceph-deploy and are all GPT. The OSDs are visable with `ceph-disk list` and running `/usr/sbin/ceph-disk-activate {device}` mounts and adds them. Running a `partprobe {device}` does not seem to trigger the udev rule at all. I had found this issue[1] but we are definitely running code that was released after this ticket was closed. Has there been anyone else that has problems with udev on RHEL mounting their OSDs? [1] - http://tracker.ceph.com/issues/5194 Thanks, derek -- Derek T. Yarnell University of Maryland Institute for Advanced Computer Studies ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] servers advise (dell r515 or supermicro ....)
It's also good to note that the m500 has built in RAIN protection (basically, diagonal parity at the nand level). Should be very good for journal consistency. Sent from my mobile device. Please excuse brevity and typographical errors. On Jan 15, 2014 9:07 AM, Stefan Priebe s.pri...@profihost.ag wrote: Am 15.01.2014 15:03, schrieb Robert van Leeuwen: Power-Loss Protection: In the rare event that power fails while the drive is operating, power-loss protection helps ensure that data isn’t corrupted. Seems that not all power protected SSDs are created equal: http://lkcl.net/reports/ssd_analysis.html The m500 is not tested but the m4 is. Up to now it seems that only Intel seems to have done his homework. In general they *seem* to be the most reliable SSD provider. Testing the m4 is useless as it as no power loss protection. The result should have been known before the test has started. But yes intel is very reliable but the 520 series and others from intel aren't. Stefan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] servers advise (dell r515 or supermicro ....)
actually, they're very inexpensive as far as SSD's go. The 960gb m500 can be had on Amazon for $499 US on prime (as of yesterday anyway). Sent from my mobile device. Please excuse brevity and typographical errors. On Jan 15, 2014 9:50 AM, Sebastien Han sebastien@enovance.com wrote: However you have to get 480GB which ridiculously large for a journal. I believe they are pretty expensive too. Sébastien Han Cloud Engineer Always give 100%. Unless you're giving blood.” Phone: +33 (0)1 49 70 99 72 Mail: sebastien@enovance.com Address : 10, rue de la Victoire - 75009 Paris Web : www.enovance.com - Twitter : @enovance On 15 Jan 2014, at 15:49, Sebastien Han sebastien@enovance.com wrote: Sorry I was only looking at the 4K aligned results. Sébastien Han Cloud Engineer Always give 100%. Unless you're giving blood.” Phone: +33 (0)1 49 70 99 72 Mail: sebastien@enovance.com Address : 10, rue de la Victoire - 75009 Paris Web : www.enovance.com - Twitter : @enovance On 15 Jan 2014, at 15:46, Stefan Priebe s.pri...@profihost.ag wrote: Am 15.01.2014 15:44, schrieb Mark Nelson: On 01/15/2014 08:39 AM, Stefan Priebe wrote: Am 15.01.2014 15:34, schrieb Sebastien Han: Hum the Crucial m500 is pretty slow. The biggest one doesn’t even reach 300MB/s. Intel DC S3700 100G showed around 200MB/sec for us. where did you get this values from? I've some 960GB and they all have 450Mb/s write speed. Also in tests like here you see 450MB/s http://www.tomshardware.com/reviews/crucial-m500-1tb-ssd,3551-5.html Looks like at least according to Anand's chart, you'll get full write speed once you buy the 480GB model, but not for the 120 or 240GB models: http://www.anandtech.com/show/6884/crucial-micron-m500-review-960gb-480gb-240gb-120gb that's correct but the sentence was The biggest one doesn’t even reach 300MB/s. Actually, I don’t know the price difference between the crucial and the intel but the intel looks more suitable for me. Especially after Mark’s comment. Sébastien Han Cloud Engineer Always give 100%. Unless you're giving blood.” Phone: +33 (0)1 49 70 99 72 Mail: sebastien@enovance.com Address : 10, rue de la Victoire - 75009 Paris Web : www.enovance.com - Twitter : @enovance On 15 Jan 2014, at 15:28, Mark Nelson mark.nel...@inktank.com wrote: On 01/15/2014 08:03 AM, Robert van Leeuwen wrote: Power-Loss Protection: In the rare event that power fails while the drive is operating, power-loss protection helps ensure that data isn’t corrupted. Seems that not all power protected SSDs are created equal: http://lkcl.net/reports/ssd_analysis.html The m500 is not tested but the m4 is. Up to now it seems that only Intel seems to have done his homework. In general they *seem* to be the most reliable SSD provider. Even at that, there has been some concern on the list (and lkml) that certain older Intel drives without super-capacitors are ignoring ATA_CMD_FLUSH, making them very fast (which I like!) but potentially dangerous (boo!). The 520 in particular is a drive I've used for a lot of Ceph performance testing but I'm afraid that if it's not properly handling CMD FLUSH requests, it may not be indicative of the performance folks would see on other drives that do. On the third hand, if drives with supercaps like the Intel DC S3700 can safely ignore CMD_FLUSH and maintain high performance (even when there are a lot of O_DSYNC calls, ala the journal), that potentially makes them even more attractive (and that drive already has relatively high sequential write performance and high write endurance). Cheers, Robert van Leeuwen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com