Re: [ceph-users] all oas crush on start

2013-07-18 Thread Vladislav Gorbunov
>In the monitor log you sent along, the monitor was crashing on a setcrushmap command. Where in this sequence of events did that happen? It's happened after I try to upload different crushmap, much later step 13. >Where are you getting these numbers 82-84 and 92-94 from? They don't appear in any a

Re: [ceph-users] all oas crush on start

2013-07-18 Thread Gregory Farnum
In the monitor log you sent along, the monitor was crashing on a setcrushmap command. Where in this sequence of events did that happen? On Wed, Jul 17, 2013 at 5:07 PM, Vladislav Gorbunov wrote: > That's what I did: > > cluster state HEALTH_OK > > 1. load crush map from cluster: > https://dl.drop

Re: [ceph-users] all oas crush on start

2013-07-17 Thread Vladislav Gorbunov
That's what I did: cluster state HEALTH_OK 1. load crush map from cluster: https://dl.dropboxusercontent.com/u/2296931/ceph/crushmap1.txt 2. modify crush map for adding pool and ruleset iscsi with 2 datacenters, upload crush map to cluster: https://dl.dropboxusercontent.com/u/2296931/ceph/crushma

Re: [ceph-users] all oas crush on start

2013-07-17 Thread Gregory Farnum
On Wed, Jul 17, 2013 at 4:40 AM, Vladislav Gorbunov wrote: > Sorry, not send to ceph-users later. > > I check mon.1 log and found that cluster was not in HEALTH_OK when set > ruleset to iscsi: > 2013-07-14 15:52:15.715871 7fe8a852a700 0 log [INF] : pgmap > v16861121: 19296 pgs: 19052 active+clean

Re: [ceph-users] all oas crush on start

2013-07-17 Thread Vladislav Gorbunov
Sorry, not send to ceph-users later. I check mon.1 log and found that cluster was not in HEALTH_OK when set ruleset to iscsi: 2013-07-14 15:52:15.715871 7fe8a852a700 0 log [INF] : pgmap v16861121: 19296 pgs: 19052 active+clean, 73 active+remapped+wait_backfill, 171 active+remapped+b ackfilling; 9

Re: [ceph-users] all oas crush on start

2013-07-16 Thread Vladislav Gorbunov
Yes, I changed original crushmap, need to take nodes gstore1, gstore2, cstore5 for new cluster. I have only crushmap from the failed cluster downloaded immediately after cluster was crushed. It on attachment. 2013/7/17 Gregory Farnum : > Have you changed either of these maps since you originally

Re: [ceph-users] all oas crush on start

2013-07-16 Thread Vladislav Gorbunov
output is in the attached files 2013/7/17 Gregory Farnum : > The maps in the OSDs only would have gotten there from the monitors. > If a bad map somehow got distributed to the OSDs then cleaning it up > is unfortunately going to take a lot of work without any well-defined > processes. > So if you

Re: [ceph-users] all oas crush on start

2013-07-16 Thread Gregory Farnum
Have you changed either of these maps since you originally switched to use rule 3? Can you compare them to what you have on your test cluster? In particular I see that you have 0 weight for all the buckets in the crush pool, which I expect to misbehave but not to cause the OSD to crash everywhere.

Re: [ceph-users] all oas crush on start

2013-07-16 Thread Gregory Farnum
The maps in the OSDs only would have gotten there from the monitors. If a bad map somehow got distributed to the OSDs then cleaning it up is unfortunately going to take a lot of work without any well-defined processes. So if you could just do "ceph osd crush dump" and "ceph osd dump" and provide th

Re: [ceph-users] all oas crush on start

2013-07-16 Thread Vladislav Gorbunov
Gregory, thank for you help! After all osd servers downed, i'am back rule set for the iscsi pool back to default rule 0: ceph osd pool set iscsi crush_ruleset 0 it does not help, all osd not started, except without data, with weight 0. next i remove ruleset iscsi from crush map. It does not help to

Re: [ceph-users] all oas crush on start

2013-07-16 Thread Gregory Farnum
I notice that your first dump of the crush map didn't include rule #3. Are you sure you've injected it into the cluster? Try extracting it from the monitors and looking at that map directly, instead of a locally cached version. You mentioned some problem with OSDs being positioned wrong too, so you

Re: [ceph-users] all oas crush on start

2013-07-15 Thread Vladislav Gorbunov
ruleset 3 is: rule iscsi { ruleset 3 type replicated min_size 1 max_size 10 step take iscsi step chooseleaf firstn 0 type datacenter step chooseleaf firstn 0 type host step emit } 2013/7/16 Vladislav Gorbunov : > sorry, after i'm try

Re: [ceph-users] all oas crush on start

2013-07-15 Thread Vladislav Gorbunov
sorry, after i'm try to apply crush ruleset 3 (iscsi) to pool iscsi: ceph osd pool set iscsi crush_ruleset 3 2013/7/16 Vladislav Gorbunov : >>Have you run this crush map through any test mappings yet? > Yes, it worked on test cluster, and after apply map to main cluster. > OSD servers downed after

Re: [ceph-users] all oas crush on start

2013-07-15 Thread Vladislav Gorbunov
>Have you run this crush map through any test mappings yet? Yes, it worked on test cluster, and after apply map to main cluster. OSD servers downed after i'm try to apply crush ruleset 3 (iscsi) to pool iscsi: ceph osd pool set data crush_ruleset 3 2013/7/16 Gregory Farnum : > It's probably not th

Re: [ceph-users] all oas crush on start

2013-07-15 Thread Gregory Farnum
It's probably not the same issue as that ticket, which was about the OSD handling a lack of output incorrectly. (It might be handling the output incorrectly in some other way, but hopefully not...) Have you run this crush map through any test mappings yet? -Greg Software Engineer #42 @ http://inkt

Re: [ceph-users] all oas crush on start

2013-07-14 Thread Vladislav Gorbunov
Sympthoms like on http://tracker.ceph.com/issues/4699 all OSDs the process ceph-osd crash with segfault If i stop MONs daemons then i can start OSDs but if i start MONs back then die all OSDs again. more detailed log: 0> 2013-07-15 16:42:05.001242 7ffe5a6fc700 -1 *** Caught signal (Segmenta

[ceph-users] all oas crush on start

2013-07-13 Thread Vladislav Gorbunov
Hello! After change the crush map all osd (ceph version 0.61.4 (1669132fcfc27d0c0b5e5bb93ade59d147e23404)) on pool default is crushed with the error: 2013-07-14 17:26:23.755432 7f0c963ad700 -1 *** Caught signal (Segmentation fault) ** in thread 7f0c963ad700 ...skipping... 10: (OSD::PeeringWQ::_p