Re: [ceph-users] One of three monitors can not be started
There is asok on computer06. I tried to start the mon.computer06, maybe two hours later, the mon.computer06 still not start, but there are some different processes on computer06, I don't know how to handle it: root 7812 1 0 11:39 pts/400:00:00 python /usr/sbin/ceph-create-keys -i computer06 root 11025 1 12 09:02 pts/400:32:13 /usr/bin/ceph-mon -i computer06 --pid-file /var/run/ceph/mon.computer06.pid -c /etc/ceph/ceph.conf root 35692 7812 0 12:59 pts/400:00:00 python /usr/bin/ceph --cluster=ceph --admin-daemon=/var/run/ceph/ceph-mon.computer06.asok mon_status I got the quorum_status from another running monitor: { "election_epoch": 508, "quorum": [ 0, 1], "quorum_names": [ "computer05", "computer04"], "quorum_leader_name": "computer04", "monmap": { "epoch": 4, "fsid": "471483e5-493f-41f6-b6f4-0187c13d156d", "modified": "2014-07-26 09:52:02.411967", "created": "0.00", "mons": [ { "rank": 0, "name": "computer04", "addr": "192.168.1.60:6789\/0"}, { "rank": 1, "name": "computer05", "addr": "192.168.1.65:6789\/0"}, { "rank": 2, "name": "computer06", "addr": "192.168.1.66:6789\/0"}]}} > Date: Tue, 31 Mar 2015 12:30:22 -0700 > Subject: Re: [ceph-users] One of three monitors can not be started > From: g...@gregs42.com > To: zhanghaoyu1...@hotmail.com > CC: ceph-users@lists.ceph.com > > On Tue, Mar 31, 2015 at 2:50 AM, 张皓宇 wrote: > > Who can help me? > > > > One monitor in my ceph cluster can not be started. > > Before that, I added '[mon] mon_compact_on_start = true' to > > /etc/ceph/ceph.conf on three monitor hosts. Then I did 'ceph tell > > mon.computer05 compact ' on computer05, which has a monitor on it. > > When store.db of computer05 changed from 108G to 1G, mon.computer06 stoped, > > and it can not be started since that. > > > > If I start mon.computer06, it will stop on this state: > > # /etc/init.d/ceph start mon.computer06 > > === mon.computer06 === > > Starting Ceph mon.computer06 on computer06... > > > > The process info is like this: > > root 12149 3807 0 20:46 pts/27 00:00:00 /bin/sh /etc/init.d/ceph start > > mon.computer06 > > root 12308 12149 0 20:46 pts/27 00:00:00 bash -c ulimit -n 32768; > > /usr/bin/ceph-mon -i computer06 --pid-file /var/run/ceph/mon.computer06.pid > > -c /etc/ceph/ceph.conf > > root 12309 12308 0 20:46 pts/27 00:00:00 /usr/bin/ceph-mon -i computer06 > > --pid-file /var/run/ceph/mon.computer06.pid -c /etc/ceph/ceph.conf > > root 12313 12309 19 20:46 pts/27 00:00:01 /usr/bin/ceph-mon -i computer06 > > --pid-file /var/run/ceph/mon.computer06.pid -c /etc/ceph/ceph.conf > > > > Log on computer06 is like this: > > 2015-03-30 20:46:54.152956 7fc5379d07a0 0 ceph version 0.72.2 > > (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mon, pid 12309 > > ... > > 2015-03-30 20:46:54.759791 7fc5379d07a0 1 mon.computer06@-1(probing) e4 > > preinit clean up potentially inconsistent store state > > So I haven't looked at this code in a while, but I think the monitor > is trying to validate that it's consistent with the others. You > probably want to dig around the monitor admin sockets and see what > state each monitor is in, plus its perception of the others. > > In this case, I think maybe mon.computer06 is trying to examine its > whole store, but 100GB is a lot (way too much, in fact), so this can > take a lng time. > > > > > Sorry, my English is not good. > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cascading Failure of OSDs
Hi, Quentin Hartman wrote: > Since I have been in ceph-land today, it reminded me that I needed to close > the loop on this. I was finally able to isolate this problem down to a > faulty NIC on the ceph cluster network. It "worked", but it was > accumulating a huge number of Rx errors. My best guess is some receive > buffer cache failed? Anyway, having a NIC go weird like that is totally > consistent with all the weird problems I was seeing, the corrupted PGs, and > the inability for the cluster to settle down. > > As a result we've added NIC error rates to our monitoring suite on the > cluster so we'll hopefully see this coming if it ever happens again. Good for you. ;) Could you post here the command that you use to get NIC error rates? -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Weird cluster restart behavior
On Tue, Mar 31, 2015 at 3:05 PM, Gregory Farnum wrote: > On Tue, Mar 31, 2015 at 12:56 PM, Quentin Hartman > > > > My understanding is that the "right" method to take an entire cluster > > offline is to set noout and then shutting everything down. Is there a > better > > way? > > That's probably the best way to do it. Like I said, there was also a > bug here that I think is fixed for Hammer but that might not have been > backported to Giant. Unfortunately I don't remember the right keywords > as I wasn't involved in the fix. I'd hope that the complete shutdown scenario would get some more testing in the future... I know that Ceph is targeted more at "enterprise" situations where things like generators and properly sized battery backups aren't extravagant luxuries, but there are probably a lot of clusters out there that will get shut down completely, planned or unplanned. -- Jeff Ollie ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Weird cluster restart behavior
On Tue, Mar 31, 2015 at 2:05 PM, Gregory Farnum wrote: > Github pull requests. :) > Ah, well that's easy: https://github.com/ceph/ceph/pull/4237 QH ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Weird cluster restart behavior
On Tue, Mar 31, 2015 at 12:56 PM, Quentin Hartman wrote: > Thanks for the extra info Gregory. I did not also set nodown. > > I expect that I will be very rarely shutting everything down in the normal > course of things, but it has come up a couple times when having to do some > physical re-organizing of racks. Little irritants like this aren't a big > deal if people know to expect them, but as it is I lost quite a lot of time > troubleshooting a non-existant problem. What's the best way to get notes to > that effect added to the docs? It seems something in > http://ceph.com/docs/master/rados/operations/operating/ would save some > people some headache. I'm happy to propose edits, but a quick look doesn't > reveal a process for submitting that sort of thing. Github pull requests. :) > > My understanding is that the "right" method to take an entire cluster > offline is to set noout and then shutting everything down. Is there a better > way? That's probably the best way to do it. Like I said, there was also a bug here that I think is fixed for Hammer but that might not have been backported to Giant. Unfortunately I don't remember the right keywords as I wasn't involved in the fix. -Greg > > QH > > On Tue, Mar 31, 2015 at 1:35 PM, Gregory Farnum wrote: >> >> On Tue, Mar 31, 2015 at 7:50 AM, Quentin Hartman >> wrote: >> > I'm working on redeploying a 14-node cluster. I'm running giant 0.87.1. >> > Last >> > friday I got everything deployed and all was working well, and I set >> > noout >> > and shut all the OSD nodes down over the weekend. Yesterday when I spun >> > it >> > back up, the OSDs were behaving very strangely, incorrectly marking each >> > other because of missed heartbeats, even though they were up. It looked >> > like >> > some kind of low-level networking problem, but I couldn't find any. >> > >> > After much work, I narrowed the apparent source of the problem down to >> > the >> > OSDs running on the first host I started in the morning. They were the >> > ones >> > that were logged the most messages about not being able to ping other >> > OSDs, >> > and the other OSDs were mostly complaining about them. After running out >> > of >> > other ideas to try, I restarted them, and then everything started >> > working. >> > It's still working happily this morning. It seems as though when that >> > set of >> > OSDs started they got stale OSD map information from the MON boxes, >> > which >> > failed to be updated as the other OSDs came up. Does that make sense? I >> > still don't consider myself an expert on ceph architecture and would >> > appreciate and corrections or other possible interpretations of events >> > (I'm >> > happy to provide whatever additional information I can) so I can get a >> > deeper understanding of things. If my interpretation of events is >> > correct, >> > it seems that might point at a bug. >> >> I can't find the ticket now, but I think we did indeed have a bug >> around heartbeat failures when restarting nodes. This has been fixed >> in other branches but might have been missed for giant. (Did you by >> any chance set the nodown flag as well as noout?) >> >> In general Ceph isn't very happy with being shut down completely like >> that and its behaviors aren't validated, so nothing will go seriously >> wrong but you might find little irritants like this. It's particularly >> likely when you're prohibiting state changes with the noout/nodown >> flags. >> -Greg > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Weird cluster restart behavior
Thanks for the extra info Gregory. I did not also set nodown. I expect that I will be very rarely shutting everything down in the normal course of things, but it has come up a couple times when having to do some physical re-organizing of racks. Little irritants like this aren't a big deal if people know to expect them, but as it is I lost quite a lot of time troubleshooting a non-existant problem. What's the best way to get notes to that effect added to the docs? It seems something in http://ceph.com/docs/master/rados/operations/operating/ would save some people some headache. I'm happy to propose edits, but a quick look doesn't reveal a process for submitting that sort of thing. My understanding is that the "right" method to take an entire cluster offline is to set noout and then shutting everything down. Is there a better way? QH On Tue, Mar 31, 2015 at 1:35 PM, Gregory Farnum wrote: > On Tue, Mar 31, 2015 at 7:50 AM, Quentin Hartman > wrote: > > I'm working on redeploying a 14-node cluster. I'm running giant 0.87.1. > Last > > friday I got everything deployed and all was working well, and I set > noout > > and shut all the OSD nodes down over the weekend. Yesterday when I spun > it > > back up, the OSDs were behaving very strangely, incorrectly marking each > > other because of missed heartbeats, even though they were up. It looked > like > > some kind of low-level networking problem, but I couldn't find any. > > > > After much work, I narrowed the apparent source of the problem down to > the > > OSDs running on the first host I started in the morning. They were the > ones > > that were logged the most messages about not being able to ping other > OSDs, > > and the other OSDs were mostly complaining about them. After running out > of > > other ideas to try, I restarted them, and then everything started > working. > > It's still working happily this morning. It seems as though when that > set of > > OSDs started they got stale OSD map information from the MON boxes, which > > failed to be updated as the other OSDs came up. Does that make sense? I > > still don't consider myself an expert on ceph architecture and would > > appreciate and corrections or other possible interpretations of events > (I'm > > happy to provide whatever additional information I can) so I can get a > > deeper understanding of things. If my interpretation of events is > correct, > > it seems that might point at a bug. > > I can't find the ticket now, but I think we did indeed have a bug > around heartbeat failures when restarting nodes. This has been fixed > in other branches but might have been missed for giant. (Did you by > any chance set the nodown flag as well as noout?) > > In general Ceph isn't very happy with being shut down completely like > that and its behaviors aren't validated, so nothing will go seriously > wrong but you might find little irritants like this. It's particularly > likely when you're prohibiting state changes with the noout/nodown > flags. > -Greg > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Force an OSD to try to peer
On 03/31/2015 09:23 PM, Sage Weil wrote: It's nothing specific to peering (or ceph). The symptom we've seen is just that byte stop passing across a TCP connection, usually when there is some largish messages being sent. The ping/heartbeat messages get through because they are small and we disable nagle so they never end up in large frames. Is there any special route one should take in order to transition a live cluster to use jumbo frames and avoid such pitfalls with OSD peering? -K. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Weird cluster restart behavior
On Tue, Mar 31, 2015 at 7:50 AM, Quentin Hartman wrote: > I'm working on redeploying a 14-node cluster. I'm running giant 0.87.1. Last > friday I got everything deployed and all was working well, and I set noout > and shut all the OSD nodes down over the weekend. Yesterday when I spun it > back up, the OSDs were behaving very strangely, incorrectly marking each > other because of missed heartbeats, even though they were up. It looked like > some kind of low-level networking problem, but I couldn't find any. > > After much work, I narrowed the apparent source of the problem down to the > OSDs running on the first host I started in the morning. They were the ones > that were logged the most messages about not being able to ping other OSDs, > and the other OSDs were mostly complaining about them. After running out of > other ideas to try, I restarted them, and then everything started working. > It's still working happily this morning. It seems as though when that set of > OSDs started they got stale OSD map information from the MON boxes, which > failed to be updated as the other OSDs came up. Does that make sense? I > still don't consider myself an expert on ceph architecture and would > appreciate and corrections or other possible interpretations of events (I'm > happy to provide whatever additional information I can) so I can get a > deeper understanding of things. If my interpretation of events is correct, > it seems that might point at a bug. I can't find the ticket now, but I think we did indeed have a bug around heartbeat failures when restarting nodes. This has been fixed in other branches but might have been missed for giant. (Did you by any chance set the nodown flag as well as noout?) In general Ceph isn't very happy with being shut down completely like that and its behaviors aren't validated, so nothing will go seriously wrong but you might find little irritants like this. It's particularly likely when you're prohibiting state changes with the noout/nodown flags. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] One of three monitors can not be started
On Tue, Mar 31, 2015 at 2:50 AM, 张皓宇 wrote: > Who can help me? > > One monitor in my ceph cluster can not be started. > Before that, I added '[mon] mon_compact_on_start = true' to > /etc/ceph/ceph.conf on three monitor hosts. Then I did 'ceph tell > mon.computer05 compact ' on computer05, which has a monitor on it. > When store.db of computer05 changed from 108G to 1G, mon.computer06 stoped, > and it can not be started since that. > > If I start mon.computer06, it will stop on this state: > # /etc/init.d/ceph start mon.computer06 > === mon.computer06 === > Starting Ceph mon.computer06 on computer06... > > The process info is like this: > root 12149 3807 0 20:46 pts/27 00:00:00 /bin/sh /etc/init.d/ceph start > mon.computer06 > root 12308 12149 0 20:46 pts/27 00:00:00 bash -c ulimit -n 32768; > /usr/bin/ceph-mon -i computer06 --pid-file /var/run/ceph/mon.computer06.pid > -c /etc/ceph/ceph.conf > root 12309 12308 0 20:46 pts/27 00:00:00 /usr/bin/ceph-mon -i computer06 > --pid-file /var/run/ceph/mon.computer06.pid -c /etc/ceph/ceph.conf > root 12313 12309 19 20:46 pts/27 00:00:01 /usr/bin/ceph-mon -i computer06 > --pid-file /var/run/ceph/mon.computer06.pid -c /etc/ceph/ceph.conf > > Log on computer06 is like this: > 2015-03-30 20:46:54.152956 7fc5379d07a0 0 ceph version 0.72.2 > (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mon, pid 12309 > ... > 2015-03-30 20:46:54.759791 7fc5379d07a0 1 mon.computer06@-1(probing) e4 > preinit clean up potentially inconsistent store state So I haven't looked at this code in a while, but I think the monitor is trying to validate that it's consistent with the others. You probably want to dig around the monitor admin sockets and see what state each monitor is in, plus its perception of the others. In this case, I think maybe mon.computer06 is trying to examine its whole store, but 100GB is a lot (way too much, in fact), so this can take a lng time. > > Sorry, my English is not good. > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Force an OSD to try to peer
On Tue, 31 Mar 2015, Somnath Roy wrote: > But, do we know why Jumbo frames may have an impact on peering ? > In our setup so far, we haven't enabled jumbo frames other than performance > reason (if at all). It's nothing specific to peering (or ceph). The symptom we've seen is just that byte stop passing across a TCP connection, usually when there is some largish messages being sent. The ping/heartbeat messages get through because they are small and we disable nagle so they never end up in large frames. It's a pain to diagnose. sage > > Thanks & Regards > Somnath > > -Original Message- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of > Robert LeBlanc > Sent: Tuesday, March 31, 2015 11:08 AM > To: Sage Weil > Cc: ceph-devel; Ceph-User > Subject: Re: [ceph-users] Force an OSD to try to peer > > I was desperate for anything after exhausting every other possibility I could > think of. Maybe I should put a checklist in the Ceph docs of things to look > for. > > Thanks, > > On Tue, Mar 31, 2015 at 11:36 AM, Sage Weil wrote: > > On Tue, 31 Mar 2015, Robert LeBlanc wrote: > >> Turns out jumbo frames was not set on all the switch ports. Once that > >> was resolved the cluster quickly became healthy. > > > > I always hesitate to point the finger at the jumbo frames > > configuration but almost every time that is the culprit! > > > > Thanks for the update. :) > > sage > > > > > > > >> > >> On Mon, Mar 30, 2015 at 8:15 PM, Robert LeBlanc > >> wrote: > >> > I've been working at this peering problem all day. I've done a lot > >> > of testing at the network layer and I just don't believe that we > >> > have a problem that would prevent OSDs from peering. When looking > >> > though osd_debug 20/20 logs, it just doesn't look like the OSDs are > >> > trying to peer. I don't know if it is because there are so many > >> > outstanding creations or what. OSDs will peer with OSDs on other > >> > hosts, but for reason only chooses a certain number and not one that it > >> > needs to finish the peering process. > >> > > >> > I've check: firewall, open files, number of threads allowed. These > >> > usually have given me an error in the logs that helped me fix the > >> > problem. > >> > > >> > I can't find a configuration item that specifies how many peers an > >> > OSD should contact or anything that would be artificially limiting > >> > the peering connections. I've restarted the OSDs a number of times, > >> > as well as rebooting the hosts. I beleive if the OSDs finish > >> > peering everything will clear up. I can't find anything in pg query > >> > that would help me figure out what is blocking it (peering blocked > >> > by is empty). The PGs are scattered across all the hosts so we can't pin > >> > it down to a specific host. > >> > > >> > Any ideas on what to try would be appreciated. > >> > > >> > [ulhglive-root@ceph9 ~]# ceph --version ceph version 0.80.7 > >> > (6c0127fcb58008793d3c8b62d925bc91963672a3) > >> > [ulhglive-root@ceph9 ~]# ceph status > >> > cluster 48de182b-5488-42bb-a6d2-62e8e47b435c > >> > health HEALTH_WARN 1 pgs down; 1321 pgs peering; 1321 pgs > >> > stuck inactive; 1321 pgs stuck unclean; too few pgs per osd (17 < min 20) > >> > monmap e2: 3 mons at > >> > {mon1=10.217.72.27:6789/0,mon2=10.217.72.28:6789/0,mon3=10.217.72.2 > >> > 9:6789/0}, election epoch 30, quorum 0,1,2 mon1,mon2,mon3 > >> > osdmap e704: 120 osds: 120 up, 120 in > >> > pgmap v1895: 2048 pgs, 1 pools, 0 bytes data, 0 objects > >> > 11447 MB used, 436 TB / 436 TB avail > >> > 727 active+clean > >> > 990 peering > >> > 37 creating+peering > >> >1 down+peering > >> > 290 remapped+peering > >> >3 creating+remapped+peering > >> > > >> > { "state": "peering", > >> > "epoch": 707, > >> > "up": [ > >> > 40, > >> > 92, > >> > 48, > >> > 91], > >> > "acting": [ > >> > 40, > >> > 92, > >> > 48, > >> > 91], > >> > "info": { "pgid": "7.171", > >> > "last_update": "0'0", > >> > "last_complete": "0'0", > >> > "log_tail": "0'0", > >> > "last_user_version": 0, > >> > "last_backfill": "MAX", > >> > "purged_snaps": "[]", > >> > "history": { "epoch_created": 293, > >> > "last_epoch_started": 343, > >> > "last_epoch_clean": 343, > >> > "last_epoch_split": 0, > >> > "same_up_since": 688, > >> > "same_interval_since": 688, > >> > "same_primary_since": 608, > >> > "last_scrub": "0'0", > >> > "last_scrub_stamp": "2015-03-30 11:11:18.872851", > >> > "last_deep_scrub": "0'0", > >> > "last_deep_scrub_stamp": "2015-03-30 11:11:18.872851", > >> > "last_clean_scrub_stamp": "0.00"}, > >> > "stats": { "version": "0'0", > >> > "r
Re: [ceph-users] Force an OSD to try to peer
At the L2 level, if the hosts and switches don't accept jumbo frames, they just drop them because they are too big. They are not fragmented because they don't go through a router. My problem is that OSDs were able to peer with other OSDs on the host, but my guess is that they never sent/received packets larger than 1500 bytes. Then other OSD processes tried to peer but sent packets larger than 1500 bytes causing the packets to be dropped and peering to stall. On Tue, Mar 31, 2015 at 12:10 PM, Somnath Roy wrote: > But, do we know why Jumbo frames may have an impact on peering ? > In our setup so far, we haven't enabled jumbo frames other than performance > reason (if at all). > > Thanks & Regards > Somnath > > -Original Message- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of > Robert LeBlanc > Sent: Tuesday, March 31, 2015 11:08 AM > To: Sage Weil > Cc: ceph-devel; Ceph-User > Subject: Re: [ceph-users] Force an OSD to try to peer > > I was desperate for anything after exhausting every other possibility I could > think of. Maybe I should put a checklist in the Ceph docs of things to look > for. > > Thanks, > > On Tue, Mar 31, 2015 at 11:36 AM, Sage Weil wrote: >> On Tue, 31 Mar 2015, Robert LeBlanc wrote: >>> Turns out jumbo frames was not set on all the switch ports. Once that >>> was resolved the cluster quickly became healthy. >> >> I always hesitate to point the finger at the jumbo frames >> configuration but almost every time that is the culprit! >> >> Thanks for the update. :) >> sage >> >> >> >>> >>> On Mon, Mar 30, 2015 at 8:15 PM, Robert LeBlanc >>> wrote: >>> > I've been working at this peering problem all day. I've done a lot >>> > of testing at the network layer and I just don't believe that we >>> > have a problem that would prevent OSDs from peering. When looking >>> > though osd_debug 20/20 logs, it just doesn't look like the OSDs are >>> > trying to peer. I don't know if it is because there are so many >>> > outstanding creations or what. OSDs will peer with OSDs on other >>> > hosts, but for reason only chooses a certain number and not one that it >>> > needs to finish the peering process. >>> > >>> > I've check: firewall, open files, number of threads allowed. These >>> > usually have given me an error in the logs that helped me fix the problem. >>> > >>> > I can't find a configuration item that specifies how many peers an >>> > OSD should contact or anything that would be artificially limiting >>> > the peering connections. I've restarted the OSDs a number of times, >>> > as well as rebooting the hosts. I beleive if the OSDs finish >>> > peering everything will clear up. I can't find anything in pg query >>> > that would help me figure out what is blocking it (peering blocked >>> > by is empty). The PGs are scattered across all the hosts so we can't pin >>> > it down to a specific host. >>> > >>> > Any ideas on what to try would be appreciated. >>> > >>> > [ulhglive-root@ceph9 ~]# ceph --version ceph version 0.80.7 >>> > (6c0127fcb58008793d3c8b62d925bc91963672a3) >>> > [ulhglive-root@ceph9 ~]# ceph status >>> > cluster 48de182b-5488-42bb-a6d2-62e8e47b435c >>> > health HEALTH_WARN 1 pgs down; 1321 pgs peering; 1321 pgs >>> > stuck inactive; 1321 pgs stuck unclean; too few pgs per osd (17 < min 20) >>> > monmap e2: 3 mons at >>> > {mon1=10.217.72.27:6789/0,mon2=10.217.72.28:6789/0,mon3=10.217.72.2 >>> > 9:6789/0}, election epoch 30, quorum 0,1,2 mon1,mon2,mon3 >>> > osdmap e704: 120 osds: 120 up, 120 in >>> > pgmap v1895: 2048 pgs, 1 pools, 0 bytes data, 0 objects >>> > 11447 MB used, 436 TB / 436 TB avail >>> > 727 active+clean >>> > 990 peering >>> > 37 creating+peering >>> >1 down+peering >>> > 290 remapped+peering >>> >3 creating+remapped+peering >>> > >>> > { "state": "peering", >>> > "epoch": 707, >>> > "up": [ >>> > 40, >>> > 92, >>> > 48, >>> > 91], >>> > "acting": [ >>> > 40, >>> > 92, >>> > 48, >>> > 91], >>> > "info": { "pgid": "7.171", >>> > "last_update": "0'0", >>> > "last_complete": "0'0", >>> > "log_tail": "0'0", >>> > "last_user_version": 0, >>> > "last_backfill": "MAX", >>> > "purged_snaps": "[]", >>> > "history": { "epoch_created": 293, >>> > "last_epoch_started": 343, >>> > "last_epoch_clean": 343, >>> > "last_epoch_split": 0, >>> > "same_up_since": 688, >>> > "same_interval_since": 688, >>> > "same_primary_since": 608, >>> > "last_scrub": "0'0", >>> > "last_scrub_stamp": "2015-03-30 11:11:18.872851", >>> > "last_deep_scrub": "0'0", >>> > "last_deep_scrub_stamp": "2015-03-30 11:11:18.872851", >>> > "last_clean_scrub_stamp": "0.00"}, >>> > "stats": { "v
Re: [ceph-users] Force an OSD to try to peer
But, do we know why Jumbo frames may have an impact on peering ? In our setup so far, we haven't enabled jumbo frames other than performance reason (if at all). Thanks & Regards Somnath -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Robert LeBlanc Sent: Tuesday, March 31, 2015 11:08 AM To: Sage Weil Cc: ceph-devel; Ceph-User Subject: Re: [ceph-users] Force an OSD to try to peer I was desperate for anything after exhausting every other possibility I could think of. Maybe I should put a checklist in the Ceph docs of things to look for. Thanks, On Tue, Mar 31, 2015 at 11:36 AM, Sage Weil wrote: > On Tue, 31 Mar 2015, Robert LeBlanc wrote: >> Turns out jumbo frames was not set on all the switch ports. Once that >> was resolved the cluster quickly became healthy. > > I always hesitate to point the finger at the jumbo frames > configuration but almost every time that is the culprit! > > Thanks for the update. :) > sage > > > >> >> On Mon, Mar 30, 2015 at 8:15 PM, Robert LeBlanc wrote: >> > I've been working at this peering problem all day. I've done a lot >> > of testing at the network layer and I just don't believe that we >> > have a problem that would prevent OSDs from peering. When looking >> > though osd_debug 20/20 logs, it just doesn't look like the OSDs are >> > trying to peer. I don't know if it is because there are so many >> > outstanding creations or what. OSDs will peer with OSDs on other >> > hosts, but for reason only chooses a certain number and not one that it >> > needs to finish the peering process. >> > >> > I've check: firewall, open files, number of threads allowed. These >> > usually have given me an error in the logs that helped me fix the problem. >> > >> > I can't find a configuration item that specifies how many peers an >> > OSD should contact or anything that would be artificially limiting >> > the peering connections. I've restarted the OSDs a number of times, >> > as well as rebooting the hosts. I beleive if the OSDs finish >> > peering everything will clear up. I can't find anything in pg query >> > that would help me figure out what is blocking it (peering blocked >> > by is empty). The PGs are scattered across all the hosts so we can't pin >> > it down to a specific host. >> > >> > Any ideas on what to try would be appreciated. >> > >> > [ulhglive-root@ceph9 ~]# ceph --version ceph version 0.80.7 >> > (6c0127fcb58008793d3c8b62d925bc91963672a3) >> > [ulhglive-root@ceph9 ~]# ceph status >> > cluster 48de182b-5488-42bb-a6d2-62e8e47b435c >> > health HEALTH_WARN 1 pgs down; 1321 pgs peering; 1321 pgs >> > stuck inactive; 1321 pgs stuck unclean; too few pgs per osd (17 < min 20) >> > monmap e2: 3 mons at >> > {mon1=10.217.72.27:6789/0,mon2=10.217.72.28:6789/0,mon3=10.217.72.2 >> > 9:6789/0}, election epoch 30, quorum 0,1,2 mon1,mon2,mon3 >> > osdmap e704: 120 osds: 120 up, 120 in >> > pgmap v1895: 2048 pgs, 1 pools, 0 bytes data, 0 objects >> > 11447 MB used, 436 TB / 436 TB avail >> > 727 active+clean >> > 990 peering >> > 37 creating+peering >> >1 down+peering >> > 290 remapped+peering >> >3 creating+remapped+peering >> > >> > { "state": "peering", >> > "epoch": 707, >> > "up": [ >> > 40, >> > 92, >> > 48, >> > 91], >> > "acting": [ >> > 40, >> > 92, >> > 48, >> > 91], >> > "info": { "pgid": "7.171", >> > "last_update": "0'0", >> > "last_complete": "0'0", >> > "log_tail": "0'0", >> > "last_user_version": 0, >> > "last_backfill": "MAX", >> > "purged_snaps": "[]", >> > "history": { "epoch_created": 293, >> > "last_epoch_started": 343, >> > "last_epoch_clean": 343, >> > "last_epoch_split": 0, >> > "same_up_since": 688, >> > "same_interval_since": 688, >> > "same_primary_since": 608, >> > "last_scrub": "0'0", >> > "last_scrub_stamp": "2015-03-30 11:11:18.872851", >> > "last_deep_scrub": "0'0", >> > "last_deep_scrub_stamp": "2015-03-30 11:11:18.872851", >> > "last_clean_scrub_stamp": "0.00"}, >> > "stats": { "version": "0'0", >> > "reported_seq": "326", >> > "reported_epoch": "707", >> > "state": "peering", >> > "last_fresh": "2015-03-30 20:10:39.509855", >> > "last_change": "2015-03-30 19:44:17.361601", >> > "last_active": "2015-03-30 11:37:56.956417", >> > "last_clean": "2015-03-30 11:37:56.956417", >> > "last_became_active": "0.00", >> > "last_unstale": "2015-03-30 20:10:39.509855", >> > "mapping_epoch": 683, >> > "log_start": "0'0", >> > "ondisk_log_start": "0'0", >> > "created": 293, >> > "last_epoch_cle
Re: [ceph-users] Force an OSD to try to peer
I was desperate for anything after exhausting every other possibility I could think of. Maybe I should put a checklist in the Ceph docs of things to look for. Thanks, On Tue, Mar 31, 2015 at 11:36 AM, Sage Weil wrote: > On Tue, 31 Mar 2015, Robert LeBlanc wrote: >> Turns out jumbo frames was not set on all the switch ports. Once that >> was resolved the cluster quickly became healthy. > > I always hesitate to point the finger at the jumbo frames configuration > but almost every time that is the culprit! > > Thanks for the update. :) > sage > > > >> >> On Mon, Mar 30, 2015 at 8:15 PM, Robert LeBlanc wrote: >> > I've been working at this peering problem all day. I've done a lot of >> > testing at the network layer and I just don't believe that we have a >> > problem >> > that would prevent OSDs from peering. When looking though osd_debug 20/20 >> > logs, it just doesn't look like the OSDs are trying to peer. I don't know >> > if >> > it is because there are so many outstanding creations or what. OSDs will >> > peer with OSDs on other hosts, but for reason only chooses a certain number >> > and not one that it needs to finish the peering process. >> > >> > I've check: firewall, open files, number of threads allowed. These usually >> > have given me an error in the logs that helped me fix the problem. >> > >> > I can't find a configuration item that specifies how many peers an OSD >> > should contact or anything that would be artificially limiting the peering >> > connections. I've restarted the OSDs a number of times, as well as >> > rebooting >> > the hosts. I beleive if the OSDs finish peering everything will clear up. I >> > can't find anything in pg query that would help me figure out what is >> > blocking it (peering blocked by is empty). The PGs are scattered across all >> > the hosts so we can't pin it down to a specific host. >> > >> > Any ideas on what to try would be appreciated. >> > >> > [ulhglive-root@ceph9 ~]# ceph --version >> > ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3) >> > [ulhglive-root@ceph9 ~]# ceph status >> > cluster 48de182b-5488-42bb-a6d2-62e8e47b435c >> > health HEALTH_WARN 1 pgs down; 1321 pgs peering; 1321 pgs stuck >> > inactive; 1321 pgs stuck unclean; too few pgs per osd (17 < min 20) >> > monmap e2: 3 mons at >> > {mon1=10.217.72.27:6789/0,mon2=10.217.72.28:6789/0,mon3=10.217.72.29:6789/0}, >> > election epoch 30, quorum 0,1,2 mon1,mon2,mon3 >> > osdmap e704: 120 osds: 120 up, 120 in >> > pgmap v1895: 2048 pgs, 1 pools, 0 bytes data, 0 objects >> > 11447 MB used, 436 TB / 436 TB avail >> > 727 active+clean >> > 990 peering >> > 37 creating+peering >> >1 down+peering >> > 290 remapped+peering >> >3 creating+remapped+peering >> > >> > { "state": "peering", >> > "epoch": 707, >> > "up": [ >> > 40, >> > 92, >> > 48, >> > 91], >> > "acting": [ >> > 40, >> > 92, >> > 48, >> > 91], >> > "info": { "pgid": "7.171", >> > "last_update": "0'0", >> > "last_complete": "0'0", >> > "log_tail": "0'0", >> > "last_user_version": 0, >> > "last_backfill": "MAX", >> > "purged_snaps": "[]", >> > "history": { "epoch_created": 293, >> > "last_epoch_started": 343, >> > "last_epoch_clean": 343, >> > "last_epoch_split": 0, >> > "same_up_since": 688, >> > "same_interval_since": 688, >> > "same_primary_since": 608, >> > "last_scrub": "0'0", >> > "last_scrub_stamp": "2015-03-30 11:11:18.872851", >> > "last_deep_scrub": "0'0", >> > "last_deep_scrub_stamp": "2015-03-30 11:11:18.872851", >> > "last_clean_scrub_stamp": "0.00"}, >> > "stats": { "version": "0'0", >> > "reported_seq": "326", >> > "reported_epoch": "707", >> > "state": "peering", >> > "last_fresh": "2015-03-30 20:10:39.509855", >> > "last_change": "2015-03-30 19:44:17.361601", >> > "last_active": "2015-03-30 11:37:56.956417", >> > "last_clean": "2015-03-30 11:37:56.956417", >> > "last_became_active": "0.00", >> > "last_unstale": "2015-03-30 20:10:39.509855", >> > "mapping_epoch": 683, >> > "log_start": "0'0", >> > "ondisk_log_start": "0'0", >> > "created": 293, >> > "last_epoch_clean": 343, >> > "parent": "0.0", >> > "parent_split_bits": 0, >> > "last_scrub": "0'0", >> > "last_scrub_stamp": "2015-03-30 11:11:18.872851", >> > "last_deep_scrub": "0'0", >> > "last_deep_scrub_stamp": "2015-03-30 11:11:18.872851", >> > "last_clean_scrub_stamp": "0.00", >> > "log_size": 0, >> > "ondisk_log_size": 0, >> > "stats_
Re: [ceph-users] Force an OSD to try to peer
On Tue, 31 Mar 2015, Robert LeBlanc wrote: > Turns out jumbo frames was not set on all the switch ports. Once that > was resolved the cluster quickly became healthy. I always hesitate to point the finger at the jumbo frames configuration but almost every time that is the culprit! Thanks for the update. :) sage > > On Mon, Mar 30, 2015 at 8:15 PM, Robert LeBlanc wrote: > > I've been working at this peering problem all day. I've done a lot of > > testing at the network layer and I just don't believe that we have a problem > > that would prevent OSDs from peering. When looking though osd_debug 20/20 > > logs, it just doesn't look like the OSDs are trying to peer. I don't know if > > it is because there are so many outstanding creations or what. OSDs will > > peer with OSDs on other hosts, but for reason only chooses a certain number > > and not one that it needs to finish the peering process. > > > > I've check: firewall, open files, number of threads allowed. These usually > > have given me an error in the logs that helped me fix the problem. > > > > I can't find a configuration item that specifies how many peers an OSD > > should contact or anything that would be artificially limiting the peering > > connections. I've restarted the OSDs a number of times, as well as rebooting > > the hosts. I beleive if the OSDs finish peering everything will clear up. I > > can't find anything in pg query that would help me figure out what is > > blocking it (peering blocked by is empty). The PGs are scattered across all > > the hosts so we can't pin it down to a specific host. > > > > Any ideas on what to try would be appreciated. > > > > [ulhglive-root@ceph9 ~]# ceph --version > > ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3) > > [ulhglive-root@ceph9 ~]# ceph status > > cluster 48de182b-5488-42bb-a6d2-62e8e47b435c > > health HEALTH_WARN 1 pgs down; 1321 pgs peering; 1321 pgs stuck > > inactive; 1321 pgs stuck unclean; too few pgs per osd (17 < min 20) > > monmap e2: 3 mons at > > {mon1=10.217.72.27:6789/0,mon2=10.217.72.28:6789/0,mon3=10.217.72.29:6789/0}, > > election epoch 30, quorum 0,1,2 mon1,mon2,mon3 > > osdmap e704: 120 osds: 120 up, 120 in > > pgmap v1895: 2048 pgs, 1 pools, 0 bytes data, 0 objects > > 11447 MB used, 436 TB / 436 TB avail > > 727 active+clean > > 990 peering > > 37 creating+peering > >1 down+peering > > 290 remapped+peering > >3 creating+remapped+peering > > > > { "state": "peering", > > "epoch": 707, > > "up": [ > > 40, > > 92, > > 48, > > 91], > > "acting": [ > > 40, > > 92, > > 48, > > 91], > > "info": { "pgid": "7.171", > > "last_update": "0'0", > > "last_complete": "0'0", > > "log_tail": "0'0", > > "last_user_version": 0, > > "last_backfill": "MAX", > > "purged_snaps": "[]", > > "history": { "epoch_created": 293, > > "last_epoch_started": 343, > > "last_epoch_clean": 343, > > "last_epoch_split": 0, > > "same_up_since": 688, > > "same_interval_since": 688, > > "same_primary_since": 608, > > "last_scrub": "0'0", > > "last_scrub_stamp": "2015-03-30 11:11:18.872851", > > "last_deep_scrub": "0'0", > > "last_deep_scrub_stamp": "2015-03-30 11:11:18.872851", > > "last_clean_scrub_stamp": "0.00"}, > > "stats": { "version": "0'0", > > "reported_seq": "326", > > "reported_epoch": "707", > > "state": "peering", > > "last_fresh": "2015-03-30 20:10:39.509855", > > "last_change": "2015-03-30 19:44:17.361601", > > "last_active": "2015-03-30 11:37:56.956417", > > "last_clean": "2015-03-30 11:37:56.956417", > > "last_became_active": "0.00", > > "last_unstale": "2015-03-30 20:10:39.509855", > > "mapping_epoch": 683, > > "log_start": "0'0", > > "ondisk_log_start": "0'0", > > "created": 293, > > "last_epoch_clean": 343, > > "parent": "0.0", > > "parent_split_bits": 0, > > "last_scrub": "0'0", > > "last_scrub_stamp": "2015-03-30 11:11:18.872851", > > "last_deep_scrub": "0'0", > > "last_deep_scrub_stamp": "2015-03-30 11:11:18.872851", > > "last_clean_scrub_stamp": "0.00", > > "log_size": 0, > > "ondisk_log_size": 0, > > "stats_invalid": "0", > > "stat_sum": { "num_bytes": 0, > > "num_objects": 0, > > "num_object_clones": 0, > > "num_object_copies": 0, > > "num_objects_missing_on_primary": 0, > > "num_objects_degraded": 0, > > "num_objects_unfound": 0, > > "num_object
Re: [ceph-users] Force an OSD to try to peer
Turns out jumbo frames was not set on all the switch ports. Once that was resolved the cluster quickly became healthy. On Mon, Mar 30, 2015 at 8:15 PM, Robert LeBlanc wrote: > I've been working at this peering problem all day. I've done a lot of > testing at the network layer and I just don't believe that we have a problem > that would prevent OSDs from peering. When looking though osd_debug 20/20 > logs, it just doesn't look like the OSDs are trying to peer. I don't know if > it is because there are so many outstanding creations or what. OSDs will > peer with OSDs on other hosts, but for reason only chooses a certain number > and not one that it needs to finish the peering process. > > I've check: firewall, open files, number of threads allowed. These usually > have given me an error in the logs that helped me fix the problem. > > I can't find a configuration item that specifies how many peers an OSD > should contact or anything that would be artificially limiting the peering > connections. I've restarted the OSDs a number of times, as well as rebooting > the hosts. I beleive if the OSDs finish peering everything will clear up. I > can't find anything in pg query that would help me figure out what is > blocking it (peering blocked by is empty). The PGs are scattered across all > the hosts so we can't pin it down to a specific host. > > Any ideas on what to try would be appreciated. > > [ulhglive-root@ceph9 ~]# ceph --version > ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3) > [ulhglive-root@ceph9 ~]# ceph status > cluster 48de182b-5488-42bb-a6d2-62e8e47b435c > health HEALTH_WARN 1 pgs down; 1321 pgs peering; 1321 pgs stuck > inactive; 1321 pgs stuck unclean; too few pgs per osd (17 < min 20) > monmap e2: 3 mons at > {mon1=10.217.72.27:6789/0,mon2=10.217.72.28:6789/0,mon3=10.217.72.29:6789/0}, > election epoch 30, quorum 0,1,2 mon1,mon2,mon3 > osdmap e704: 120 osds: 120 up, 120 in > pgmap v1895: 2048 pgs, 1 pools, 0 bytes data, 0 objects > 11447 MB used, 436 TB / 436 TB avail > 727 active+clean > 990 peering > 37 creating+peering >1 down+peering > 290 remapped+peering >3 creating+remapped+peering > > { "state": "peering", > "epoch": 707, > "up": [ > 40, > 92, > 48, > 91], > "acting": [ > 40, > 92, > 48, > 91], > "info": { "pgid": "7.171", > "last_update": "0'0", > "last_complete": "0'0", > "log_tail": "0'0", > "last_user_version": 0, > "last_backfill": "MAX", > "purged_snaps": "[]", > "history": { "epoch_created": 293, > "last_epoch_started": 343, > "last_epoch_clean": 343, > "last_epoch_split": 0, > "same_up_since": 688, > "same_interval_since": 688, > "same_primary_since": 608, > "last_scrub": "0'0", > "last_scrub_stamp": "2015-03-30 11:11:18.872851", > "last_deep_scrub": "0'0", > "last_deep_scrub_stamp": "2015-03-30 11:11:18.872851", > "last_clean_scrub_stamp": "0.00"}, > "stats": { "version": "0'0", > "reported_seq": "326", > "reported_epoch": "707", > "state": "peering", > "last_fresh": "2015-03-30 20:10:39.509855", > "last_change": "2015-03-30 19:44:17.361601", > "last_active": "2015-03-30 11:37:56.956417", > "last_clean": "2015-03-30 11:37:56.956417", > "last_became_active": "0.00", > "last_unstale": "2015-03-30 20:10:39.509855", > "mapping_epoch": 683, > "log_start": "0'0", > "ondisk_log_start": "0'0", > "created": 293, > "last_epoch_clean": 343, > "parent": "0.0", > "parent_split_bits": 0, > "last_scrub": "0'0", > "last_scrub_stamp": "2015-03-30 11:11:18.872851", > "last_deep_scrub": "0'0", > "last_deep_scrub_stamp": "2015-03-30 11:11:18.872851", > "last_clean_scrub_stamp": "0.00", > "log_size": 0, > "ondisk_log_size": 0, > "stats_invalid": "0", > "stat_sum": { "num_bytes": 0, > "num_objects": 0, > "num_object_clones": 0, > "num_object_copies": 0, > "num_objects_missing_on_primary": 0, > "num_objects_degraded": 0, > "num_objects_unfound": 0, > "num_objects_dirty": 0, > "num_whiteouts": 0, > "num_read": 0, > "num_read_kb": 0, > "num_write": 0, > "num_write_kb": 0, > "num_scrub_errors": 0, > "num_shallow_scrub_errors": 0, > "num_deep_scrub_errors": 0, > "num_objects_recovered": 0, > "num_bytes_recovered": 0, >
Re: [ceph-users] SSD Hardware recommendation
Speaking of SSD IOPs. Running the same tests on my SSDs (LiteOn ECT-480N9S 480GB SSDs): The lines at the bottom are a single 6TB spinning disk for comparison's sake. http://imgur.com/a/fD0Mh Based on these numbers, there is a minimum latency per operation, but multiple operations can be performed simultaneously. The sweet spot for my SSDs is ~8 journals per SSD to maximize IOPs on a per journal basis. Unfortunately, at 8 journals, the overall IOPs is much less than the stated IOPs for the SSD. (~5000 vs 9000 IOPs). Better than spinning disks, but not what I was expecting. The spreadsheet is available here: https://people.beocat.cis.ksu.edu/~mozes/hobbit-ssd-vs-std-iops.ods -- Adam On Tue, Mar 31, 2015 at 7:09 AM, f...@univ-lr.fr wrote: > Hi, > > in our quest to get the right SSD for OSD journals, I managed to benchmark > two kind of "10 DWPD" SSDs : > - Toshiba M2 PX02SMF020 > - Samsung 845DC PRO > > I wan't to determine if a disk is appropriate considering its absolute > performances, and the optimal number of ceph-osd processes using the SSD as > a journal. > The benchmark consists of a fio command, with SYNC and DIRECT access > options, and 4k blocks write accesses. > > fio --filename=/dev/sda --direct=1 --sync=1 --rw=write --bs=4k --runtime=60 > --time_based --group_reporting --name=journal-test --iodepth=<1 or 16> > --numjobs=< ranging from 1 to 16> > > I think numjobs can represent the concurrent number of OSD served by this > SSD. Am I right on this ? > > > http://www.4shared.com/download/WOvooKVXce/Fio-Direct-Sync-ToshibaM2-Sams.png?lgfp=3000 > > My understanding of that data is that the 845DC Pro cannot be used for more > that 4 OSD. > The M2 is very constant in its comportment. > The iodepth has almost no impact on perfs here. > > Could someone having other SSD types make the same test to consolidate the > data ? > > Among the short list that could be considered for that task (for their > price/perfs/DWPD/...) : > - Seagate 1200 SSD 200GB, SAS 12Gb/s ST200FM0053 > - Hitachi SSD800MM MLC HUSMM8020ASS200 > - Intel DC3700 > > I've not yet considered write amplification mentionned in other posts. > > Frederic > > Josef Johansson a écrit le 20/03/15 10:29 : > > > The 845DC Pro does look really nice, comparable with s3700 with TDW even. > The price is what really does it, as it’s almost a third compared with > s3700.. > > > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] SSD Journaling
Hi Mark, Yes my reads are consistently slower. I have testes both Random and Sequential and various block sizes. Thanks Pankaj -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Mark Nelson Sent: Monday, March 30, 2015 1:07 PM To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] SSD Journaling On 03/30/2015 03:01 PM, Garg, Pankaj wrote: > Hi, > > I'm benchmarking my small cluster with HDDs vs HDDs with SSD Journaling. > I am using both RADOS bench and Block device (using fio) for testing. > > I am seeing significant Write performance improvements, as expected. I > am however seeing the Reads coming out a bit slower on the SSD > Journaling side. They are not terribly different, but sometimes 10% slower. > > Is that something other folks have also seen, or do I need some > settings to be tuned properly? I'm wondering if accessing 2 drives for > reads, adds latency and hence the throughput suffers. Hi, What kind of reads are you seeing the degradation with? Is it consistent with different sizes and random/seq? Any interesting spikes or valleys during the tests? > > Thanks > > Pankaj > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Weird cluster restart behavior
I'm working on redeploying a 14-node cluster. I'm running giant 0.87.1. Last friday I got everything deployed and all was working well, and I set noout and shut all the OSD nodes down over the weekend. Yesterday when I spun it back up, the OSDs were behaving very strangely, incorrectly marking each other because of missed heartbeats, even though they were up. It looked like some kind of low-level networking problem, but I couldn't find any. After much work, I narrowed the apparent source of the problem down to the OSDs running on the first host I started in the morning. They were the ones that were logged the most messages about not being able to ping other OSDs, and the other OSDs were mostly complaining about them. After running out of other ideas to try, I restarted them, and then everything started working. It's still working happily this morning. It seems as though when that set of OSDs started they got stale OSD map information from the MON boxes, which failed to be updated as the other OSDs came up. Does that make sense? I still don't consider myself an expert on ceph architecture and would appreciate and corrections or other possible interpretations of events (I'm happy to provide whatever additional information I can) so I can get a deeper understanding of things. If my interpretation of events is correct, it seems that might point at a bug. QH ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Creating and deploying OSDs in parallel
Hi Somnath, We have deployed many machines in parallel and it generally works. Keep in mind that if you deploy many many (>1000) then this will create so many osdmap incrementals, so quickly, that the memory usage on the OSDs will increase substantially (until you reboot). Best Regards, Dan On Mon, Mar 30, 2015 at 5:29 PM, Somnath Roy wrote: > Hi, > > I am planning to modify our deployment script so that it can create and > deploy multiple OSDs in parallel to the same host as well as on different > hosts. > > Just wanted to check if there is any problem to run say ‘ceph-deploy osd > create’ etc. in parallel while deploying cluster. > > > > Thanks & Regards > > Somnath > > > > > PLEASE NOTE: The information contained in this electronic mail message is > intended only for the use of the designated recipient(s) named above. If the > reader of this message is not the intended recipient, you are hereby > notified that you have received this message in error and that any review, > dissemination, distribution, or copying of this message is strictly > prohibited. If you have received this communication in error, please notify > the sender by telephone or e-mail (as shown above) immediately and destroy > any and all copies of this message in your possession (whether hard copies > or electronically stored copies). > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] SSD Hardware recommendation
Hi, in our quest to get the right SSD for OSD journals, I managed to benchmark two kind of "10 DWPD" SSDs : - Toshiba M2 PX02SMF020 - Samsung 845DC PRO I wan't to determine if a disk is appropriate considering its absolute performances, and the optimal number of ceph-osd processes using the SSD as a journal. The benchmark consists of a fio command, with SYNC and DIRECT access options, and 4k blocks write accesses. fio --filename=/dev/sda --direct=1 --sync=1 --rw=write --bs=4k --runtime=60 --time_based --group_reporting --name=journal-test --iodepth=<1 or 16> --numjobs=< ranging from 1 to 16> I think numjobs can represent the concurrent number of OSD served by this SSD. Am I right on this ? http://www.4shared.com/download/WOvooKVXce/Fio-Direct-Sync-ToshibaM2-Sams.png?lgfp=3000 My understanding of that data is that the 845DC Pro cannot be used for more that 4 OSD. The M2 is very constant in its comportment. The iodepth has almost no impact on perfs here. Could someone having other SSD types make the same test to consolidate the data ? Among the short list that could be considered for that task (for their price/perfs/DWPD/...) : - Seagate 1200 SSD 200GB, SAS 12Gb/s ST200FM0053 - Hitachi SSD800MM MLC HUSMM8020ASS200 - Intel DC3700 I've not yet considered write amplification mentionned in other posts. Frederic Josef Johansson a écrit le 20/03/15 10:29 : The 845DC Pro does look really nice, comparable with s3700 with TDW even. The price is what really does it, as it’s almost a third compared with s3700.. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Radosgw multi-region user creation question
Hi I'm trying to set up a POC multi-region radosgw configuration (with different ceph clusters). Following the official docs[1], here the part about creation of zone system users was not very clear. Going by an example configuration of 2 regions US (master zone us-dc1), EU (master zone eu-dc1) for eg. (with secondary zones of other also created in these regions). If I create zone users seperately in the 2 regions ie. us-dc1 zone user & eu-dc1 zone user, while the metadata sync does occur, if I try to create a bucket with location passed as the secondary region, it fails with an 403, access denied, as the system user of secondary region is unknown to master region. I was able to bypass this by creating a system user for secondary zone of secondary region in the master region (ie creating a system user for eu secondary zone in us region) and then recreating the user in the secondary region by passing on --access & --secret-key parameter to recreate the same user with same keys. This seemed to work, however I'm not sure whether this is the direction to proceed, as the docs do not mention a step like this [1] http://ceph.com/docs/master/radosgw/federated-config/#configure-a-secondary-region -- Abhishek signature.asc Description: PGP signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Radosgw authorization failed
> Date: Mon, 30 Mar 2015 12:17:48 -0400 > From: yeh...@redhat.com > To: neville.tay...@hotmail.co.uk > CC: ceph-users@lists.ceph.com > Subject: Re: [ceph-users] Radosgw authorization failed > > > > - Original Message - > > From: "Neville" > > To: "Yehuda Sadeh-Weinraub" > > Cc: ceph-users@lists.ceph.com > > Sent: Monday, March 30, 2015 6:49:29 AM > > Subject: Re: [ceph-users] Radosgw authorization failed > > > > > > > Date: Wed, 25 Mar 2015 11:43:44 -0400 > > > From: yeh...@redhat.com > > > To: neville.tay...@hotmail.co.uk > > > CC: ceph-users@lists.ceph.com > > > Subject: Re: [ceph-users] Radosgw authorization failed > > > > > > > > > > > > - Original Message - > > > > From: "Neville" > > > > To: ceph-users@lists.ceph.com > > > > Sent: Wednesday, March 25, 2015 8:16:39 AM > > > > Subject: [ceph-users] Radosgw authorization failed > > > > > > > > Hi all, > > > > > > > > I'm testing backup product which supports Amazon S3 as target for > > > > Archive > > > > storage and I'm trying to setup a Ceph cluster configured with the S3 > > > > API > > > > to > > > > use as an internal target for backup archives instead of AWS. > > > > > > > > I've followed the online guide for setting up Radosgw and created a > > > > default > > > > region and zone based on the AWS naming convention US-East-1. I'm not > > > > sure > > > > if this is relevant but since I was having issues I thought it might > > > > need > > > > to > > > > be the same. > > > > > > > > I've tested the radosgw using boto.s3 and it seems to work ok i.e. I can > > > > create a bucket, create a folder, list buckets etc. The problem is when > > > > the > > > > backup software tries to create an object I get an authorization > > > > failure. > > > > It's using the same user/access/secret as I'm using from boto.s3 and I'm > > > > sure the creds are right as it lets me create the initial connection, it > > > > just fails when trying to create an object (backup folder). > > > > > > > > Here's the extract from the radosgw log: > > > > > > > > - > > > > 2015-03-25 15:07:26.449227 7f1050dc7700 2 req 5:0.000419:s3:GET > > > > /:list_bucket:init op > > > > 2015-03-25 15:07:26.449232 7f1050dc7700 2 req 5:0.000424:s3:GET > > > > /:list_bucket:verifying op mask > > > > 2015-03-25 15:07:26.449234 7f1050dc7700 20 required_mask= 1 > > > > user.op_mask=7 > > > > 2015-03-25 15:07:26.449235 7f1050dc7700 2 req 5:0.000427:s3:GET > > > > /:list_bucket:verifying op permissions > > > > 2015-03-25 15:07:26.449237 7f1050dc7700 5 Searching permissions for > > > > uid=test > > > > mask=49 > > > > 2015-03-25 15:07:26.449238 7f1050dc7700 5 Found permission: 15 > > > > 2015-03-25 15:07:26.449239 7f1050dc7700 5 Searching permissions for > > > > group=1 > > > > mask=49 > > > > 2015-03-25 15:07:26.449240 7f1050dc7700 5 Found permission: 15 > > > > 2015-03-25 15:07:26.449241 7f1050dc7700 5 Searching permissions for > > > > group=2 > > > > mask=49 > > > > 2015-03-25 15:07:26.449242 7f1050dc7700 5 Found permission: 15 > > > > 2015-03-25 15:07:26.449243 7f1050dc7700 5 Getting permissions id=test > > > > owner=test perm=1 > > > > 2015-03-25 15:07:26.449244 7f1050dc7700 10 uid=test requested perm > > > > (type)=1, > > > > policy perm=1, user_perm_mask=1, acl perm=1 > > > > 2015-03-25 15:07:26.449245 7f1050dc7700 2 req 5:0.000437:s3:GET > > > > /:list_bucket:verifying op params > > > > 2015-03-25 15:07:26.449247 7f1050dc7700 2 req 5:0.000439:s3:GET > > > > /:list_bucket:executing > > > > 2015-03-25 15:07:26.449252 7f1050dc7700 10 cls_bucket_list > > > > test1(@{i=.us-east.rgw.buckets.index}.us-east.rgw.buckets[us-east.280959.2]) > > > > start num 1001 > > > > 2015-03-25 15:07:26.450828 7f1050dc7700 2 req 5:0.002020:s3:GET > > > > /:list_bucket:http status=200 > > > > 2015-03-25 15:07:26.450832 7f1050dc7700 1 == req done > > > > req=0x7f107000e2e0 > > > > http_status=200 == > > > > 2015-03-25 15:07:26.516999 7f1069df9700 20 enqueued request > > > > req=0x7f107000f0e0 > > > > 2015-03-25 15:07:26.517006 7f1069df9700 20 RGWWQ: > > > > 2015-03-25 15:07:26.517007 7f1069df9700 20 req: 0x7f107000f0e0 > > > > 2015-03-25 15:07:26.517010 7f1069df9700 10 allocated request > > > > req=0x7f107000f6b0 > > > > 2015-03-25 15:07:26.517021 7f1058dd7700 20 dequeued request > > > > req=0x7f107000f0e0 > > > > 2015-03-25 15:07:26.517023 7f1058dd7700 20 RGWWQ: empty > > > > 2015-03-25 15:07:26.517081 7f1058dd7700 20 CONTENT_LENGTH=88 > > > > 2015-03-25 15:07:26.517084 7f1058dd7700 20 > > > > CONTENT_TYPE=application/octet-stream > > > > 2015-03-25 15:07:26.517085 7f1058dd7700 20 > > > > CONTEXT_DOCUMENT_ROOT=/var/www > > > > 2015-03-25 15:07:26.517086 7f1058dd7700 20 CONTEXT_PREFIX= > > > > 2015-03-25 15:07:26.517087 7f1058dd7700 20 DOCUMENT_ROOT=/var/www > > > > 2015-03-25 15:07:26.517088 7f1058dd7700 20 FCGI_ROLE=RESPONDER > > > >
Re: [ceph-users] Cannot add OSD node into crushmap or all writes fail
check firewall rules, network connectivity. Can all nodes and clients reach each other? Can you telnet to OSD ports (note that multiple OSDs may listen on differenct ports)? On 3/31/15 8:44, Tyler Bishop wrote: I have this ceph node that will correctly recover into my ceph pool and performance looks to be normal for the rbd clients. However after a few minutes once finishing recovery the rbd clients begin to fall over and cannot write data to the pool. I've been trying to figure this out for weeks! None of the logs contain anything relevant at all. If I disable the node in the crushmap the rbd clients immediately begin writing to the other nodes. Ideas? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] One of three monitors can not be started
Who can help me? One monitor in my ceph cluster can not be started. Before that, I added '[mon] mon_compact_on_start = true' to /etc/ceph/ceph.conf on three monitor hosts. Then I did 'ceph tell mon.computer05 compact ' on computer05, which has a monitor on it. When store.db of computer05 changed from 108G to 1G, mon.computer06 stoped, and it can not be started since that. If I start mon.computer06, it will stop on this state: # /etc/init.d/ceph start mon.computer06 === mon.computer06 === Starting Ceph mon.computer06 on computer06... The process info is like this: root 12149 3807 0 20:46 pts/27 00:00:00 /bin/sh /etc/init.d/ceph start mon.computer06 root 12308 12149 0 20:46 pts/27 00:00:00 bash -c ulimit -n 32768; /usr/bin/ceph-mon -i computer06 --pid-file /var/run/ceph/mon.computer06.pid -c /etc/ceph/ceph.conf root 12309 12308 0 20:46 pts/27 00:00:00 /usr/bin/ceph-mon -i computer06 --pid-file /var/run/ceph/mon.computer06.pid -c /etc/ceph/ceph.conf root 12313 12309 19 20:46 pts/27 00:00:01 /usr/bin/ceph-mon -i computer06 --pid-file /var/run/ceph/mon.computer06.pid -c /etc/ceph/ceph.conf Log on computer06 is like this: 2015-03-30 20:46:54.152956 7fc5379d07a0 0 ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mon, pid 12309 ... 2015-03-30 20:46:54.759791 7fc5379d07a0 1 mon.computer06@-1(probing) e4 preinit clean up potentially inconsistent store state Sorry, my English is not good. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] One host failure bring down the whole cluster
On 3/31/15 11:27, Kai KH Huang wrote: 1) But Ceph says "...You can run a cluster with 1 monitor." (http://ceph.com/docs/master/rados/operations/add-or-rm-mons/), I assume it should work. And brain split is not my current concern Point is that you must have majority of monitors up. * In one monitor setup you need one monitor running, * In two monitor setup you need two monitors running,because if one goes down you do not have majority up, * In three monitor setup you need at least two monitors up, because if one goes down you still have majority up, * 4 - at least 3 * 5 - at least 3 * etc 2) I've written object to Ceph, now I just want to get it back Anyway. I tried to reduce the mon number to 1. But after I remove it following the steps, it just cannot start up any more 1. [root~] service ceph -a stop mon.serverB 2. [root~] ceph mon remove serverB ## hang here forever 3. #Remove the monitor entry from ceph.conf. 4. Restart ceph service It is grey area for me, but I think that you failed to remove that monitor because you didn't have a quorum for operation to succeed. I think you'll need to modify monmap manually and remove second monitor from it [root@serverA~]# systemctl status ceph.service -l ceph.service - LSB: Start Ceph distributed file system daemons at boot time Loaded: loaded (/etc/rc.d/init.d/ceph) Active: failed (Result: timeout) since Tue 2015-03-31 15:46:25 CST; 3min 15s ago Process: 2937 ExecStop=/etc/rc.d/init.d/ceph stop (code=exited, status=0/SUCCESS) Process: 3670 ExecStart=/etc/rc.d/init.d/ceph start (code=killed, signal=TERM) Mar 31 15:44:26 serverA ceph[3670]: === osd.6 === Mar 31 15:44:56 serverA ceph[3670]: failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.6 --keyring=/var/lib/ceph/osd/ceph-6/keyring osd crush create-or-move -- 6 3.64 host=serverA root=default' Mar 31 15:44:56 serverA ceph[3670]: === osd.7 === Mar 31 15:45:26 serverA ceph[3670]: failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.7 --keyring=/var/lib/ceph/osd/ceph-7/keyring osd crush create-or-move -- 7 3.64 host=serverA root=default' Mar 31 15:45:26 serverA ceph[3670]: === osd.8 === Mar 31 15:45:57 serverA ceph[3670]: failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.8 --keyring=/var/lib/ceph/osd/ceph-8/keyring osd crush create-or-move -- 8 3.64 host=serverA root=default' Mar 31 15:45:57 serverA ceph[3670]: === osd.9 === Mar 31 15:46:25 serverA systemd[1]: ceph.service operation timed out. Terminating. Mar 31 15:46:25 serverA systemd[1]: Failed to start LSB: Start Ceph distributed file system daemons at boot time. Mar 31 15:46:25 serverA systemd[1]: Unit ceph.service entered failed state. /var/log/ceph/ceph.log says: 2015-03-31 15:55:57.648800 mon.0 10.???.78:6789/0 1048 : cluster [INF] osd.21 10.???.78:6855/25598 failed (39 reports from 9 peers after 20.118062 >= grace 20.00) 2015-03-31 15:55:57.931889 mon.0 10.???.78:6789/0 1055 : cluster [INF] osd.15 10..78:6825/23894 failed (39 reports from 9 peers after 20.401379 >= grace 20.00) Obviously serverB is down, but it should not affect serverA from functioning? Right? From: Gregory Farnum [g...@gregs42.com] Sent: Tuesday, March 31, 2015 11:53 AM To: Lindsay Mathieson; Kai KH Huang Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] One host failure bring down the whole cluster On Mon, Mar 30, 2015 at 8:02 PM, Lindsay Mathieson wrote: On Tue, 31 Mar 2015 02:42:27 AM Kai KH Huang wrote: Hi, all I have a two-node Ceph cluster, and both are monitor and osd. When they're both up, osd are all up and in, everything is fine... almost: Two things. 1 - You *really* need a min of three monitors. Ceph cannot form a quorum with just two monitors and you run a risk of split brain. You can form quorums with an even number of monitors, and Ceph does so — there's no risk of split brain. The problem with 2 monitors is that a quorum is always 2 — which is exactly what you're seeing right now. You can't run with only one monitor up (assuming you have a non-zero number of them). 2 - You also probably have a min size of two set (the default). This means that you need a minimum of two copies of each data object for writes to work. So with just two nodes, if one goes down you can't write to the other. Also this. So: - Install a extra monitor node - it doesn't have to be powerful, we just use a Intel Celeron NUC for that. - reduce your minimum size to 1 (One). Yep. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] One host failure bring down the whole cluster
1) But Ceph says "...You can run a cluster with 1 monitor." (http://ceph.com/docs/master/rados/operations/add-or-rm-mons/), I assume it should work. And brain split is not my current concern 2) I've written object to Ceph, now I just want to get it back Anyway. I tried to reduce the mon number to 1. But after I remove it following the steps, it just cannot start up any more 1. [root~] service ceph -a stop mon.serverB 2. [root~] ceph mon remove serverB ## hang here forever 3. #Remove the monitor entry from ceph.conf. 4. Restart ceph service [root@serverA~]# systemctl status ceph.service -l ceph.service - LSB: Start Ceph distributed file system daemons at boot time Loaded: loaded (/etc/rc.d/init.d/ceph) Active: failed (Result: timeout) since Tue 2015-03-31 15:46:25 CST; 3min 15s ago Process: 2937 ExecStop=/etc/rc.d/init.d/ceph stop (code=exited, status=0/SUCCESS) Process: 3670 ExecStart=/etc/rc.d/init.d/ceph start (code=killed, signal=TERM) Mar 31 15:44:26 serverA ceph[3670]: === osd.6 === Mar 31 15:44:56 serverA ceph[3670]: failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.6 --keyring=/var/lib/ceph/osd/ceph-6/keyring osd crush create-or-move -- 6 3.64 host=serverA root=default' Mar 31 15:44:56 serverA ceph[3670]: === osd.7 === Mar 31 15:45:26 serverA ceph[3670]: failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.7 --keyring=/var/lib/ceph/osd/ceph-7/keyring osd crush create-or-move -- 7 3.64 host=serverA root=default' Mar 31 15:45:26 serverA ceph[3670]: === osd.8 === Mar 31 15:45:57 serverA ceph[3670]: failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.8 --keyring=/var/lib/ceph/osd/ceph-8/keyring osd crush create-or-move -- 8 3.64 host=serverA root=default' Mar 31 15:45:57 serverA ceph[3670]: === osd.9 === Mar 31 15:46:25 serverA systemd[1]: ceph.service operation timed out. Terminating. Mar 31 15:46:25 serverA systemd[1]: Failed to start LSB: Start Ceph distributed file system daemons at boot time. Mar 31 15:46:25 serverA systemd[1]: Unit ceph.service entered failed state. /var/log/ceph/ceph.log says: 2015-03-31 15:55:57.648800 mon.0 10.???.78:6789/0 1048 : cluster [INF] osd.21 10.???.78:6855/25598 failed (39 reports from 9 peers after 20.118062 >= grace 20.00) 2015-03-31 15:55:57.931889 mon.0 10.???.78:6789/0 1055 : cluster [INF] osd.15 10..78:6825/23894 failed (39 reports from 9 peers after 20.401379 >= grace 20.00) Obviously serverB is down, but it should not affect serverA from functioning? Right? From: Gregory Farnum [g...@gregs42.com] Sent: Tuesday, March 31, 2015 11:53 AM To: Lindsay Mathieson; Kai KH Huang Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] One host failure bring down the whole cluster On Mon, Mar 30, 2015 at 8:02 PM, Lindsay Mathieson wrote: > On Tue, 31 Mar 2015 02:42:27 AM Kai KH Huang wrote: >> Hi, all >> I have a two-node Ceph cluster, and both are monitor and osd. When >> they're both up, osd are all up and in, everything is fine... almost: > > > > Two things. > > 1 - You *really* need a min of three monitors. Ceph cannot form a quorum with > just two monitors and you run a risk of split brain. You can form quorums with an even number of monitors, and Ceph does so — there's no risk of split brain. The problem with 2 monitors is that a quorum is always 2 — which is exactly what you're seeing right now. You can't run with only one monitor up (assuming you have a non-zero number of them). > 2 - You also probably have a min size of two set (the default). This means > that you need a minimum of two copies of each data object for writes to work. > So with just two nodes, if one goes down you can't write to the other. Also this. > > > So: > - Install a extra monitor node - it doesn't have to be powerful, we just use a > Intel Celeron NUC for that. > > - reduce your minimum size to 1 (One). Yep. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] RGW buckets sync to AWS?
Hello, can anyone recommend script/program to periodically synchronize RGW buckets with Amazon's S3? -- Sincerely Henrik ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com