Re: OSD doesn't start
Hrm, it looks like the OSD data directory got a little busted somehow. How did you perform your upgrade? (That is, how did you kill your daemons, in what order, and when did you bring them back up.) -Greg On Wednesday, July 4, 2012 at 8:31 AM, Székelyi Szabolcs wrote: > Hi, > > after upgrading to 0.48 "Argonaut", my OSDs won't start up again. This > problem > might not be related to the upgrade, since the cluster had strange behavior > before, too: ceph-fuse was spinning the CPU around 70%, so did the OSDs. This > > happened to both of my clusters. Thought that upgrading might solve the > problem, but it just got worse. > > I've copied the log of the OSD run to http://pastebin.com/XYRtfFMU . I've > rebooted all the nodes, but they still don't work. > > What should I do to resurrect my OSDs? > > Thanks, > -- > cc > > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > (mailto:majord...@vger.kernel.org) > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD doesn't start
On 2012. July 4. 09:34:04 Gregory Farnum wrote: > Hrm, it looks like the OSD data directory got a little busted somehow. How > did you perform your upgrade? (That is, how did you kill your daemons, in > what order, and when did you bring them back up.) Since it would be hard and long to describe in text, I've collected the relevant log entries, sorted by time at http://pastebin.com/Ev3M4DQ9 . The short story is that after seeing that the OSDs won't start, I tried to bring down the whole cluster and start it up from scratch. It didn't change anything, so I rebooted the two machines (running all three daemons), to see if it changes anything. It didn't and I gave up. My ceph config is available at http://pastebin.com/KKNjmiWM . Since this is my test cluster, I'm not very concerned about the data on it. But the other one, with the same config, is dying I think. ceph-fuse is eating around 75% CPU on the sole monitor ("cc") node. The monitor about 15%. On the other two nodes, the OSD eats around 50%, the MDS 15%, the monitor another 10%. No Ceph filesystem activity is going on at the moment. Blktrace reports about 1kB/s disk traffic on the partition hosting the OSD data dir. The data seems to be accessible at the moment, but I'm afraid that my production cluster will end up in a similar situation after upgrade, so I don't dare to touch it. Do you have any suggestion what I should check? Thanks, -- cc > On Wednesday, July 4, 2012 at 8:31 AM, Székelyi Szabolcs wrote: > > Hi, > > > > after upgrading to 0.48 "Argonaut", my OSDs won't start up again. This > > problem might not be related to the upgrade, since the cluster had > > strange behavior before, too: ceph-fuse was spinning the CPU around 70%, > > so did the OSDs. This happened to both of my clusters. Thought that > > upgrading might solve the problem, but it just got worse. > > > > I've copied the log of the OSD run to http://pastebin.com/XYRtfFMU . I've > > rebooted all the nodes, but they still don't work. > > > > What should I do to resurrect my OSDs? > > > > Thanks, > > -- > > cc > > > > > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > the body of a message to majord...@vger.kernel.org > > (mailto:majord...@vger.kernel.org) More majordomo info at > > http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD doesn't start
On 2012. July 5. 16:12:42 Székelyi Szabolcs wrote: > On 2012. July 4. 09:34:04 Gregory Farnum wrote: > > Hrm, it looks like the OSD data directory got a little busted somehow. How > > did you perform your upgrade? (That is, how did you kill your daemons, in > > what order, and when did you bring them back up.) > > Since it would be hard and long to describe in text, I've collected the > relevant log entries, sorted by time at http://pastebin.com/Ev3M4DQ9 . The > short story is that after seeing that the OSDs won't start, I tried to bring > down the whole cluster and start it up from scratch. It didn't change > anything, so I rebooted the two machines (running all three daemons), to > see if it changes anything. It didn't and I gave up. > > My ceph config is available at http://pastebin.com/KKNjmiWM . > > Since this is my test cluster, I'm not very concerned about the data on it. > But the other one, with the same config, is dying I think. ceph-fuse is > eating around 75% CPU on the sole monitor ("cc") node. The monitor about > 15%. On the other two nodes, the OSD eats around 50%, the MDS 15%, the > monitor another 10%. No Ceph filesystem activity is going on at the moment. > Blktrace reports about 1kB/s disk traffic on the partition hosting the OSD > data dir. The data seems to be accessible at the moment, but I'm afraid > that my production cluster will end up in a similar situation after > upgrade, so I don't dare to touch it. > > Do you have any suggestion what I should check? Yes, it definitely looks like dying. Besides the above symptoms all clients' ceph-fuse burn the CPU, there are unreadable files on the fs (tar blocks on them infinitely), the FUSE clients emit messages like ceph-fuse: 2012-07-05 23:21:41.583692 7f444dfd5700 0 -- client_ip:0/1181 send_message dropped message ping v1 because of no pipe on con 0x1034000 every 5 seconds. I tried to backup the data on it, but it got blocked in the middle. Since then I'm unable to get any data out of it, not even by killing ceph-fuse and remounting the fs. -- cc -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD doesn't start
On 2012. July 6. 01:33:13 Székelyi Szabolcs wrote: > On 2012. July 5. 16:12:42 Székelyi Szabolcs wrote: > > On 2012. July 4. 09:34:04 Gregory Farnum wrote: > > > Hrm, it looks like the OSD data directory got a little busted somehow. > > > How > > > did you perform your upgrade? (That is, how did you kill your daemons, > > > in > > > what order, and when did you bring them back up.) > > > > Since it would be hard and long to describe in text, I've collected the > > relevant log entries, sorted by time at http://pastebin.com/Ev3M4DQ9 . The > > short story is that after seeing that the OSDs won't start, I tried to > > bring down the whole cluster and start it up from scratch. It didn't > > change anything, so I rebooted the two machines (running all three > > daemons), to see if it changes anything. It didn't and I gave up. > > > > My ceph config is available at http://pastebin.com/KKNjmiWM . > > > > Since this is my test cluster, I'm not very concerned about the data on > > it. > > But the other one, with the same config, is dying I think. ceph-fuse is > > eating around 75% CPU on the sole monitor ("cc") node. The monitor about > > 15%. On the other two nodes, the OSD eats around 50%, the MDS 15%, the > > monitor another 10%. No Ceph filesystem activity is going on at the > > moment. > > Blktrace reports about 1kB/s disk traffic on the partition hosting the OSD > > data dir. The data seems to be accessible at the moment, but I'm afraid > > that my production cluster will end up in a similar situation after > > upgrade, so I don't dare to touch it. > > > > Do you have any suggestion what I should check? > > Yes, it definitely looks like dying. Besides the above symptoms all clients' > ceph-fuse burn the CPU, there are unreadable files on the fs (tar blocks on > them infinitely), the FUSE clients emit messages like > > ceph-fuse: 2012-07-05 23:21:41.583692 7f444dfd5700 0 -- client_ip:0/1181 > send_message dropped message ping v1 because of no pipe on con 0x1034000 > > every 5 seconds. I tried to backup the data on it, but it got blocked in the > middle. Since then I'm unable to get any data out of it, not even by > killing ceph-fuse and remounting the fs. So it looks like the recent leap second caused all my troubles... After a colleague applied the workaround descibed here[0], the load on the nodes went back to normal, but the cluster was still sick. For example, stopping one of the monitors and looking at the output of `ceph -s`, it still showed all the monitors as up & running, whereas it was clear that at least one of them should have been marked down (there was no ceph-mon process there). Finally I stopped the whole cluster (BTW `ceph stop` documented here[1] doesn't work any longer, it replies something like 'unrecognized subsystem'), rebooted all the nodes, and everything came up as it should have. Cheers, -- cc [0] http://www.h-online.com/open/news/item/Leap-second-bug-in-Linux-wastes- electricity-1631462.html [1] http://ceph.com/docs/master/control/ -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD doesn't start
On 2012. July 4. 09:34:04 Gregory Farnum wrote: > Hrm, it looks like the OSD data directory got a little busted somehow. How > did you perform your upgrade? (That is, how did you kill your daemons, in > what order, and when did you bring them back up.) Just to make sure: what's the recommended upgrade process? Thanks, -- cc -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD doesn't start
On Sun, Jul 8, 2012 at 11:53 AM, Székelyi Szabolcs wrote: > On 2012. July 4. 09:34:04 Gregory Farnum wrote: >> Hrm, it looks like the OSD data directory got a little busted somehow. How >> did you perform your upgrade? (That is, how did you kill your daemons, in >> what order, and when did you bring them back up.) > > Just to make sure: what's the recommended upgrade process? Nominally, you should be able to upgrade however you like, but this doesn't get much testing. We normally recommend doing the monitors, and then doing OSDs all together or a rack at a time (depending on cluster size). In any case, it sounds like you didn't break anything that way, I was just looking around for clues. :) -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html