[ceph-users] OSD Down After Reboot

2019-08-29 Thread Thomas Sumpter
Hi Folks, I have found similar reports of this problem in the past but can't seem to find any solution to it. We have ceph filesystem running mimic version 13.2.5. OSDs are running on AWS EC2 instances with centos 7. OSD disk is an AWS nvme device. Problem I, sometimes when rebooting an OSD in

Re: [ceph-users] OSD down with Ceph version of Kraken

2017-12-05 Thread Brad Hubbard
On Tue, Dec 5, 2017 at 8:14 PM, wrote: > Hi, > > > > Our Ceph version is Kraken and for the storage node we have up to 90 hard > disks that can be used for OSD, we configured the messenger type as > “simple”, I noticed that “simple” type here might create lots of threads and > hence occupied lots

[ceph-users] OSD down with Ceph version of Kraken

2017-12-05 Thread Dave.Chen
Hi, Our Ceph version is Kraken and for the storage node we have up to 90 hard disks that can be used for OSD, we configured the messenger type as "simple", I noticed that "simple" type here might create lots of threads and hence occupied lots of resource, we observed the configuration will caus

[ceph-users] OSD down ( rocksdb: submit_transaction error: Corruption: block checksum mismatch)

2017-11-23 Thread Karun Josy
Hi, One OSD in the cluster is down. Tried to restart the service, but its still failing. I can see the below error in log file. Can this be a hardware issue ? - -9> 2017-11-23 09:47:37.768969 7f368686a700 3 rocksdb: [/home/jenkins-build/build/workspace/ceph

[ceph-users] osd down but the service is up

2017-06-22 Thread Alex Wang
Hi All I am recently testing a new ceph cluster with SSD as journal. ceph -v ceph version 10.2.7 cat /etc/redhat-release Red Hat Enterprise Linux Server release 7.4 Beta (Maipo) I followed http://ceph.com/geen-categorie/ceph-recover-osds-after-ssd-journal-failure/ to replace the journal drive. (

[ceph-users] osd down

2017-04-17 Thread ??????
Hi, All: I am installing ceph in 2 node using ceph-deploy. node1:monitor and osd.0, ip:192.168.1.11 node2:osd.1, ip:192.168.1.12 When I've configured node1 as monitor and osd.0, it ok. But when I added node

Re: [ceph-users] osd down detection broken in jewel?

2016-12-12 Thread Gregory Farnum
On Wed, Nov 30, 2016 at 8:31 AM, Manuel Lausch wrote: > Yes. This parameter is used in the condition described there: > http://docs.ceph.com/docs/jewel/rados/configuration/mon- > osd-interaction/#osds-report-their-status and works. I think the default > timeout of 900s is quiet a bit large. > > A

Re: [ceph-users] osd down detection broken in jewel?

2016-11-30 Thread Manuel Lausch
Yes. This parameter is used in the condition described there: http://docs.ceph.com/docs/jewel/rados/configuration/mon-osd-interaction/#osds-report-their-status and works. I think the default timeout of 900s is quiet a bit large. Also in the documentation is a other function wich checks the heal

Re: [ceph-users] osd down detection broken in jewel?

2016-11-30 Thread Warren Wang - ISD
OSDs did not have enough peers to exceed the min down reporters. Warren Wang Walmart ✻ From: ceph-users on behalf of John Petrini Date: Wednesday, November 30, 2016 at 9:24 AM To: Manuel Lausch Cc: Ceph Users Subject: Re: [ceph-users] osd down detection broken in jewel? It's right the

Re: [ceph-users] osd down detection broken in jewel?

2016-11-30 Thread John Petrini
It's right there in your config. mon osd report timeout = 900 See: http://docs.ceph.com/docs/jewel/rados/configuration/mon-osd-interaction/ ___ John Petrini NOC Systems Administrator // *CoreDial, LLC* // coredial.com // [image: Twitter] [image: Linke

[ceph-users] osd down detection broken in jewel?

2016-11-30 Thread Manuel Lausch
Hi, In a test with ceph jewel we tested how long the cluster needs to detect and mark down OSDs after they are killed (with kill -9). The result -> 900 seconds. In Hammer this took about 20 - 30 seconds. In the Logfile from the leader monitor are a lot of messeages like 2016-11-30 11:32:20.9

Re: [ceph-users] OSD Down but not marked down by cluster

2016-09-29 Thread Tyler Bishop
. - Original Message - From: "Wido den Hollander" To: "ceph-users" , "ceph new" , "Tyler Bishop" Sent: Thursday, September 29, 2016 3:35:14 AM Subject: Re: [ceph-users] OSD Down but not marked down by cluster > Op 29 september 2016 om 1:57 schreef Tyle

Re: [ceph-users] OSD Down but not marked down by cluster

2016-09-29 Thread Wido den Hollander
> Op 29 september 2016 om 1:57 schreef Tyler Bishop > : > > > S1148 is down but the cluster does not mark it as such. > A host will never be marked as down, but the output shows that all OSDs are marked as down however. Wido > cluster 3aac8ab8-1011-43d6-b281-d16e7a61b2bd > health HEALTH_

[ceph-users] OSD Down but not marked down by cluster

2016-09-28 Thread Tyler Bishop
S1148 is down but the cluster does not mark it as such. cluster 3aac8ab8-1011-43d6-b281-d16e7a61b2bd health HEALTH_WARN 3888 pgs backfill 196 pgs backfilling 6418 pgs degraded 52 pgs down 52 pgs peering 1 pgs recovery_wait 3653 pgs stuck degraded 52 pgs stuck inactive 6088 pgs stuck unc

[ceph-users] OSD down potentially causing no new volume creation

2016-04-30 Thread Jagga Soorma
Hi Guys, I am new to ceph and have a openstack deployment that uses ceph as the backend storage. Recently we had a failed OSD disk which seems to have caused issues for any new volume creation. I am really surprised that having a single osd drive down or node down could cause such a large impact

Re: [ceph-users] OSD down

2015-02-05 Thread Steve Anthony
Hi Daniel, When I encounter an OSD which I can start, but which then stops on its own after running for some period of time, then root cause has generally been sectors pending reallocation on the hard drive the OSD is using. The OSD will run fine until it attempts to read from the bad disk sectors

Re: [ceph-users] OSD down

2015-02-05 Thread Daniel Takatori Ohara
Hello Alex, Thank's for the answer. In the server's, i use CentOS 6.6 with kernel 2.6.32, and in the clients i use Ubuntu 14 with kernel 3.16. And the version of the Ceph is 0.87. Thank's, Att. --- Daniel Takatori Ohara. System Administrator - Lab. of Bioinformatics Molecular Oncology Center

Re: [ceph-users] OSD down

2015-02-05 Thread Alexis KOALLA
Hi Daniel Could you be more precise on your issue please? What is the OS under which your ceph is running and what is the ceph version you are currently running? Anyway, I have exeprienced an issue that looks like yours. I have installed and configured a small cluster "microceph" on my PC fo

[ceph-users] OSD down

2015-02-05 Thread Daniel Takatori Ohara
Hi, anyone help me please. I have a cluster with 4 OSD's, 1 MDS and 1 MON. The osd.3 was down, and i need restart in the host with the command /etc/init.d/ceph restart osd.3. The osd.0 is marked down sometimes, but he is marked up automatically. [ceph@ceph-admin my-cluster]$ ceph osd tree # id

Re: [ceph-users] osd down

2014-11-10 Thread Craig Lewis
Yes, removing an OSD before re-creating it will give you the same OSD ID. That's my preferred method, because it keeps the crushmap the same. Only PGs that existed on the replaced disk need to be backfilled. I don't know if adding the replacement to the same host then removing the old OSD gives y

Re: [ceph-users] osd down

2014-11-10 Thread Shain Miley
Craig, Thanks for the info. I ended up doing a zap and then a create via ceph-deploy. One question that I still have is surrounding adding the failed osd back into the pool. In this example...osd.70 was badwhen I added it back in via ceph-deploy...the disk was brought up as osd.108. Only

Re: [ceph-users] osd down

2014-11-07 Thread Craig Lewis
I'd stop that osd daemon, and run xfs_check / xfs_repair on that partition. If you repair anything, you should probably force a deep-scrub on all the PGs on that disk. I think ceph osd deep-scrub will do that, but you might have to manually grep ceph pg dump . Or you could just treat it like a

Re: [ceph-users] osd down

2014-11-07 Thread Michael Nishimoto
Most likely, the drive mapping to /dev/sdl1 is going bad or is bad. I suggest power cycling it to see if the error is cleared. If the drive comes up, check out the SMART stats to see if sectors are starting to get remapped. It's possible that a transient error occurred. Mike On 11/6/14 5:06

Re: [ceph-users] osd down

2014-11-06 Thread Shain Miley
I tried restarting all the osd's on that node, osd.70 was the only ceph process that did not come back online. There is nothing in the ceph-osd log for osd.70. However I do see over 13,000 of these messages in the kern.log: Nov 6 19:54:27 hqosd6 kernel: [34042786.392178] XFS (sdl1): xfs_log_fo

Re: [ceph-users] osd down question

2014-11-04 Thread Craig Lewis
The OSDs will heartbeat each other, and report back to the monitors if any other OSD fails to respond. An OSD that fails to respond is effectively down, since it's not doing the things that it's supposed to do. It is possible for this process to cause problems. For example, I've had some OSDs on

[ceph-users] osd down question

2014-11-03 Thread ??
hello, I am running ceph v0.87 for one week, at this week, many osd have marking down, but I run "ps -ef | grep osd", I can see the osd process, the osd not really down, then, I check osd log, I see many logs like "osd.XX from dead osd.YY,marking down", if the 0.87 will check other osd process ?

[ceph-users] osd down/out problem

2014-06-04 Thread Cao, Buddy
Hi, some of the osds in my env continues to try to connect to monitors/ceph nodes, but get connection refused and down/out. It even worse when I try to initialize 100+ osds (800G HDD for each osd), most of the osds would run into the same problem to connect to monitor. I checked the monitor sta

[ceph-users] osd down and autoout in firefly

2014-05-30 Thread Cao, Buddy
Hello, One of the osd in my ceph cluster, change to down and autoout, I did not get the root cause in the osd log. Could you help? 2014-05-30 17:35:55.541353 7f7b03a937a0 0 ceph version 0.80 (b78644e7dee100e48dfeca32c9270a6b210d3003), process ceph-osd, pid 5519 2014-05-30 17:35:55.544601 7f7

Re: [ceph-users] osd down/autoout problem

2014-05-15 Thread Cao, Buddy
Re: [ceph-users] osd down/autoout problem On Thu, 15 May 2014, Cao, Buddy wrote: > ?Too many open files not handled on operation 24 (541468.0.1, or op 1, > counting from 0) You need to increase the 'ulimit -n' max open files limit. You can do this in ceph.conf with 'max

Re: [ceph-users] osd down/autoout problem

2014-05-15 Thread Guang
On May 15, 2014, at 6:06 PM, Cao, Buddy wrote: > Hi, > > One of the osd in my cluster downs w no reason, I saw the error message in > the log below, I restarted osd, but after several hours, the problem come > back again. Could you help? > > “Too many open files not handled on operation 24

Re: [ceph-users] osd down/autoout problem

2014-05-15 Thread Sage Weil
On Thu, 15 May 2014, Cao, Buddy wrote: > ?Too many open files not handled on operation 24 (541468.0.1, or op 1, > counting from 0) You need to increase the 'ulimit -n' max open files limit. You can do this in ceph.conf with 'max open files' if it's sysvinit or manually in /etc/init/ceph-osd.con

Re: [ceph-users] osd down/autoout problem

2014-05-15 Thread Haomai Wang
"Too many open files not handled on operation 24" This is the reason. You need to increase the fd size limit. On Thu, May 15, 2014 at 6:06 PM, Cao, Buddy wrote: > Hi, > > > > One of the osd in my cluster downs w no reason, I saw the error message in > the log below, I restarted osd, but after se

[ceph-users] osd down/autoout problem

2014-05-15 Thread Cao, Buddy
Hi, One of the osd in my cluster downs w no reason, I saw the error message in the log below, I restarted osd, but after several hours, the problem come back again. Could you help? "Too many open files not handled on operation 24 (541468.0.1, or op 1, counting from 0) -96> 2014-05-14 22:12:

Re: [ceph-users] OSD down after PG increase

2014-03-13 Thread Kasper Dieter
On Thu, Mar 13, 2014 at 11:16:45AM +0100, Gandalf Corvotempesta wrote: > 2014-03-13 10:53 GMT+01:00 Kasper Dieter : > > After adding two new pools (each with 2 PGs) > > 100 out of 140 OSDs are going down + out. > > The cluster never recovers. > > In my case, cluster recovered after a couple of

Re: [ceph-users] OSD down after PG increase

2014-03-13 Thread Gandalf Corvotempesta
So, in normal condition with RGW enabled, only 2 pools has data on it: "data" and ".rgw.buckets" ? In this case, I could use ReplicaNum*2 2014-03-13 11:48 GMT+01:00 Dan Van Der Ster : > On 13 Mar 2014 at 11:41:30, Gandalf Corvotempesta > (gandalf.corvotempe...@gmail.com) wrote: > > 2014-03-13 11:3

Re: [ceph-users] OSD down after PG increase

2014-03-13 Thread Dan Van Der Ster
On 13 Mar 2014 at 11:41:30, Gandalf Corvotempesta (gandalf.corvotempe...@gmail.com) wrote: 2014-03-13 11:32 GMT+01:00 Dan Van Der Ster : > Do you have any other pools? Remember that you need to include _all_ pools > in the PG calculation, not just a single p

Re: [ceph-users] OSD down after PG increase

2014-03-13 Thread Gandalf Corvotempesta
2014-03-13 11:32 GMT+01:00 Dan Van Der Ster : > Do you have any other pools? Remember that you need to include _all_ pools > in the PG calculation, not just a single pool. Actually I have only standard pools (that should be 3) In production i'll also have RGW. So, which is the exact equation to d

Re: [ceph-users] OSD down after PG increase

2014-03-13 Thread Dan Van Der Ster
On 13 Mar 2014 at 11:26:55, Gandalf Corvotempesta (gandalf.corvotempe...@gmail.com) wrote: I'm also unsure if 8192 PGs are correct for my cluster. At maximum i'll have 168 OSDs (14 servers, 12 disks each, 1 osd per disk), with replica set to 3, so: (168*100

Re: [ceph-users] OSD down after PG increase

2014-03-13 Thread Gandalf Corvotempesta
2014-03-13 11:26 GMT+01:00 Dan Van Der Ster : > See http://tracker.ceph.com/issues/6922 > > This is explicity blocked in latest code (not sure if thats released yet). This seems to explain my behaviour ___ ceph-users mailing list ceph-users@lists.ceph.co

Re: [ceph-users] OSD down after PG increase

2014-03-13 Thread Dan Van Der Ster
On 13 Mar 2014 at 11:23:44, Gandalf Corvotempesta (gandalf.corvotempe...@gmail.com) wrote: 2014-03-13 11:19 GMT+01:00 Dan Van Der Ster : > Do you mean you used PG splitting? > > You should split PGs by a factor of 2x at a time. So to get from 64 to 8192, > d

Re: [ceph-users] OSD down after PG increase

2014-03-13 Thread Gandalf Corvotempesta
2014-03-13 11:23 GMT+01:00 Gandalf Corvotempesta : > I've brutally increased, no further steps. > > 64 -> 8192 :-) I'm also unsure if 8192 PGs are correct for my cluster. At maximum i'll have 168 OSDs (14 servers, 12 disks each, 1 osd per disk), with replica set to 3, so: (168*100)/3 = 5600. Roun

Re: [ceph-users] OSD down after PG increase

2014-03-13 Thread Gandalf Corvotempesta
2014-03-13 11:19 GMT+01:00 Dan Van Der Ster : > Do you mean you used PG splitting? > > You should split PGs by a factor of 2x at a time. So to get from 64 to 8192, > do 64->128, then 128->256, ..., 4096->8192. I've brutally increased, no further steps. 64 -> 8192 :-)

Re: [ceph-users] OSD down after PG increase

2014-03-13 Thread Dan Van Der Ster
On 13 Mar 2014 at 10:46:13, Gandalf Corvotempesta (gandalf.corvotempe...@gmail.com) wrote: > Yes, if you have essentially high amount of commited data in the cluster > and/or large number of PG(tens of thousands). I've increased from 64 to 8192 PGs Do you

Re: [ceph-users] OSD down after PG increase

2014-03-13 Thread Gandalf Corvotempesta
2014-03-13 10:53 GMT+01:00 Kasper Dieter : > After adding two new pools (each with 2 PGs) > 100 out of 140 OSDs are going down + out. > The cluster never recovers. In my case, cluster recovered after a couple of hours. How much time did you wait ? __

Re: [ceph-users] OSD down after PG increase

2014-03-13 Thread Dan Van Der Ster
Why do you create so many PGs ?? The goal is 100 per OSD, with your numbers you have 3 * (48000) / 140 ~= 1000 per OSD. -- Dan van der Ster || Data & Storage Services || CERN IT Department -- On 13 Mar 2014 at 11:11:16, Kasper Dieter (dieter.kas...@ts.fujitsu.com

Re: [ceph-users] OSD down after PG increase

2014-03-13 Thread Kasper Dieter
We have observed a very similar behavior. In a 140 OSD cluster (new created and idle) ~8000 PGs are available. After adding two new pools (each with 2 PGs) 100 out of 140 OSDs are going down + out. The cluster never recovers. This problem can be reproduced every time with v0.67 and 0.72. Wit

Re: [ceph-users] OSD down after PG increase

2014-03-13 Thread Gandalf Corvotempesta
2014-03-13 9:02 GMT+01:00 Andrey Korolyov : > Yes, if you have essentially high amount of commited data in the cluster > and/or large number of PG(tens of thousands). I've increased from 64 to 8192 PGs > If you have a room to > experiment with this transition from scratch you may want to play wit

Re: [ceph-users] OSD down after PG increase

2014-03-13 Thread Andrey Korolyov
On 03/13/2014 02:08 AM, Gandalf Corvotempesta wrote: > I've increased PG number to a running cluster. > After this operation, all OSDs from one node was marked as down. > > Now, after a while, i'm seeing that OSDs are slowly coming up again > (sequentially) after rebalancing. > > Is this an expec

[ceph-users] OSD down after PG increase

2014-03-12 Thread Gandalf Corvotempesta
I've increased PG number to a running cluster. After this operation, all OSDs from one node was marked as down. Now, after a while, i'm seeing that OSDs are slowly coming up again (sequentially) after rebalancing. Is this an expected behaviour ? ___ cep

Re: [ceph-users] osd down

2014-02-16 Thread Jean-Charles LOPEZ
Hi Pavel, It looks like you have deployed your 2 OSDs on the same host. By default, in the CRUSH map, each object is going to be assigned ti 2 OSDs that are on different host. If you want this to work for testing, you’ll have to adapt your CRUSH map so that each copy is dispatch on a bucket of

Re: [ceph-users] osd down

2014-02-16 Thread Pavel V. Kaygorodov
Hi! I have tried, but situation not changed significantly: # ceph -w cluster e90dfd37-98d1-45bb-a847-8590a5ed8e71 health HEALTH_WARN 192 pgs stuck inactive; 192 pgs stuck unclean; 2/2 in osds are down monmap e1: 1 mons at {host1=172.17.0.4:6789/0}, election epoch 1, quorum 0 host1

Re: [ceph-users] osd down

2014-02-16 Thread Karan Singh
Hi Pavel Try to add at least 1 more OSD ( bare minimum ) and set pool replication to 2 after that. For osd.0 try , # ceph osd in osd.0 , once the osd is IN , try to bring up osd.0 services up Finally your both the OSD should be IN and UP , so that your cluster can store data. Regard

[ceph-users] osd down

2014-02-16 Thread Pavel V. Kaygorodov
Hi, All! I am trying to setup ceph from scratch, without dedicated drive, with one mon and one osd. After all, I see following output of ceph osd tree: # idweight type name up/down reweight -1 1 root default -2 1 host host1 0 1

[ceph-users] osd down due to disk full

2013-10-27 Thread Kalin Bogatzevski
Hello, I have suddenly let 2 OSDs in our small 2 node cluster to be filled. Reading from the docs, i move 2 pgs dirs to another disk, so that free some disk space. Unfortunately after this the osd cannot start. Please advice! This happened before the 2:2 replication end, so it is absolutely needed

[ceph-users] osd down due to disk full

2013-10-26 Thread Kalin Bogatzevski
Hello, I have suddenly let 2 OSDs in our small 2 node cluster to be filled. Reading from the docs, i move 2 pgs dirs to another disk, so that free some disk space. Unfortunately after this the osd cannot start. Please advice! This happened before the 2:2 replication end, so it is absolutely needed

Re: [ceph-users] osd down after server failure

2013-10-14 Thread Dong Yuan
>From your informantion, the osd log ended with " 2013-10-14 06:21:26.727681 7f02690f9780 10 osd.47 43203 load_pgs 3.df1_TEMP clearing temp" That means the osd is loading all PG directories from the disk. If there is any I/O error (disk or xfs error), the process couldn't finished. Suggest resta

Re: [ceph-users] osd down after server failure

2013-10-14 Thread Sage Weil
Is osd.47 the one with the bad disk? I should not start. If there are other osds on the same host that aren't started with 'service ceph start', you may have to mention them by name (the old version of the script would stop on the first error instead of continuing). e.g., service ceph start

Re: [ceph-users] osd down after server failure

2013-10-14 Thread Dominik Mostowiec
Hi I have found somthing. After restart time was wrong on server (+2hours) before ntp has fixed it. I restarted this 3 osd - it not helps. It is possible that ceph banned this osd? Or after start with wrong time osd has broken hi's filestore? -- Regards Dominik 2013/10/14 Dominik Mostowiec : > H

[ceph-users] osd down after server failure

2013-10-13 Thread Dominik Mostowiec
Hi, I had server failure that starts from one disk failure: Oct 14 03:25:04 s3-10-177-64-6 kernel: [1027237.023986] sd 4:2:26:0: [sdaa] Unhandled error code Oct 14 03:25:04 s3-10-177-64-6 kernel: [1027237.023990] sd 4:2:26:0: [sdaa] Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK Oct 14 03:25:04 s