Hi!
@Paul
Thanks! I know, I read the whole topic about size 2 some months ago. But
this has not been my decision, I had to set it up like that.
In the meantime, I did a reboot of node1001 and node1002 with flag "noout"
set and now peering has finished and only 0.0x% are rebalanced.
IO is flowing
Check ceph pg query, it will (usually) tell you why something is stuck
inactive.
Also: never do min_size 1.
Paul
2018-05-17 15:48 GMT+02:00 Kevin Olbrich :
> I was able to obtain another NVMe to get the HDDs in node1004 into the
> cluster.
> The number of disks (all 1TB) is now balanced betwe
I was able to obtain another NVMe to get the HDDs in node1004 into the
cluster.
The number of disks (all 1TB) is now balanced between racks, still some
inactive PGs:
data:
pools: 2 pools, 1536 pgs
objects: 639k objects, 2554 GB
usage: 5167 GB used, 14133 GB / 19300 GB avail
p
Ok, I just waited some time but I still got some "activating" issues:
data:
pools: 2 pools, 1536 pgs
objects: 639k objects, 2554 GB
usage: 5194 GB used, 11312 GB / 16506 GB avail
pgs: 7.943% pgs not active
5567/1309948 objects degraded (0.425%)
1
PS: Cluster currently is size 2, I used PGCalc on Ceph website which, by
default, will place 200 PGs on each OSD.
I read about the protection in the docs and later noticed that I better had
only placed 100 PGs.
2018-05-17 13:35 GMT+02:00 Kevin Olbrich :
> Hi!
>
> Thanks for your quick reply.
> B
Hi!
Thanks for your quick reply.
Before I read your mail, i applied the following conf to my OSDs:
ceph tell 'osd.*' injectargs '--osd_max_pg_per_osd_hard_ratio 32'
Status is now:
data:
pools: 2 pools, 1536 pgs
objects: 639k objects, 2554 GB
usage: 5211 GB used, 11295 GB / 16506
Hi,
On 05/17/2018 01:09 PM, Kevin Olbrich wrote:
Hi!
Today I added some new OSDs (nearly doubled) to my luminous cluster.
I then changed pg(p)_num from 256 to 1024 for that pool because it was
complaining about to few PGs. (I noticed that should better have been small
changes).
This is the cu
Hi!
Today I added some new OSDs (nearly doubled) to my luminous cluster.
I then changed pg(p)_num from 256 to 1024 for that pool because it was
complaining about to few PGs. (I noticed that should better have been small
changes).
This is the current status:
health: HEALTH_ERR
336
Hi all,
So using ceph-ansible, i built the below mentioned cluster with 2 OSD
Nodes and 3 Mons
Just after creating osds i started benchmarking the performance using
"rbd bench" and "rados bench" and started seeing the performance drop.
Checking the status shows slow requests.
[root@storage-28-1
instability you mention, experimenting with
BlueStore looks like a better alternative.
Thanks again
Fulvio
Original Message
Subject: Re: [ceph-users] Blocked requests
From: Matthew Stroud
To: Fulvio Galeazzi , Brian Andrus
CC: "ceph-
al Message
Subject: Re: [ceph-users] Blocked requests
From: Matthew Stroud
To: Brian Andrus
CC: "ceph-users@lists.ceph.com"
Date: 09/07/2017 11:01 PM
> After some troubleshooting, the issues appear to be caused by gnocchi
> using rados. I’
your help
Fulvio
Original Message
Subject: Re: [ceph-users] Blocked requests
From: Matthew Stroud
To: Brian Andrus
CC: "ceph-users@lists.ceph.com"
Date: 09/07/2017 11:01 PM
After some troubleshooting, the issues appear to be caused by gnoc
roud
>
>
>
> From: Brian Andrus
> Date: Thursday, September 7, 2017 at 1:53 PM
> To: Matthew Stroud
> Cc: David Turner , "ceph-users@lists.ceph.com"
>
>
>
> Subject: Re: [ceph-users] Blocked requests
>
>
>
> "ceph osd blocked-by" can do t
ail.com>>
Date: Thursday, September 7, 2017 at 1:17 PM
To: Matthew Stroud mailto:mattstr...@overstock.com>>,
"ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>"
mailto:ceph-users@lists.ceph.com>>
Subject: Re: [ceph-users] Blocked requests
I would recom
ster [WRN] 73 slow requests, 1 included
> below; oldest blocked for > 1821.342499 secs
>
> /var/log/ceph/ceph.log:2017-09-07 13:29:43.979559 osd.10
> 10.20.57.15:6806/7029 9371 : cluster [WRN] slow request 30.452344 seconds
> old, received at 2017-09-07 13:29:13.527157: osd_op(client.1
: Matthew Stroud , "ceph-users@lists.ceph.com"
Subject: Re: [ceph-users] Blocked requests
I would recommend pushing forward with the update instead of rolling back.
Ceph doesn't have a track record of rolling back to a previous version.
I don't have enough information to really make
>
>
> 1 ops are blocked > 524.288 sec on osd.2
>
> 1 ops are blocked > 262.144 sec on osd.2
>
> 2 ops are blocked > 65.536 sec on osd.21
>
> 9 ops are blocked > 1048.58 sec on osd.5
>
> 9 ops are blocked > 524.288 sec on osd.5
>
> 71 ops are blocke
have slow requests
recovery 4678/1097738 objects degraded (0.426%)
recovery 10364/1097738 objects misplaced (0.944%)
From: David Turner
Date: Thursday, September 7, 2017 at 11:33 AM
To: Matthew Stroud , "ceph-users@lists.ceph.com"
Subject: Re: [ceph-users] Blocked requests
To be
To be fair, other times I have to go in and tweak configuration settings
and timings to resolve chronic blocked requests.
On Thu, Sep 7, 2017 at 1:32 PM David Turner wrote:
> `ceph health detail` will give a little more information into the blocked
> requests. Specifically which OSDs are the re
`ceph health detail` will give a little more information into the blocked
requests. Specifically which OSDs are the requests blocked on and how long
have they actually been blocked (as opposed to '> 32 sec'). I usually find
a pattern after watching that for a time and narrow things down to an OSD
After updating from 10.2.7 to 10.2.9 I have a bunch of blocked requests for
‘currently waiting for missing object’. I have tried bouncing the osds and
rebooting the osd nodes, but that just moves the problems around. Previous to
this upgrade we had no issues. Any ideas of what to look at?
Thank
Finally problem solved.
First, I set noscrub, nodeep-scrub, norebalance, nobackfill, norecover, noup
and nodown flags. Then I restarted the OSD which has problem.
When OSD daemon started, blocked requests increased (up to 100) and some
misplaced PGs appeared. Then I unset flags in order to noup,
Hi,
Sometimes we have the same issue on our 10.2.9 Cluster. (24 Nodes á 60
OSDs)
I think there is some racecondition or something like that
which results in this state. The blocking requests starts exactly at
the time the PG begins to scrub.
you can try the following. The OSD will automaticaly
Hm. That's quite weird. On our cluster, when I set "noscrub",
"nodeep-scrub", scrubbing will always stop pretty quickly (a few
minutes). I wonder why this doesnt happen on your cluster. When exactly
did you set the flag? Perhaps it just needs some more time... Or there
might be a disk problem w
Hi Ranjan,
Thanks for your reply. I did set scrub and nodeep-scrub flags. But active
scrubbing operation can’t working properly. Scrubbing operation always in same
pg (20.1e).
$ ceph pg dump | grep scrub
dumped all in format plain
pg_stat objects mip degrmispunf bytes log
Hi Ramazan,
I'm no Ceph expert, but what I can say from my experience using Ceph is:
1) During "Scrubbing", Ceph can be extremely slow. This is probably
where your "blocked requests" are coming from. BTW: Perhaps you can even
find out which processes are currently blocking with: ps aux | grep
Hello,
I have a Ceph Cluster with specifications below:
3 x Monitor node
6 x Storage Node (6 disk per Storage Node, 6TB SATA Disks, all disks have SSD
journals)
Distributed public and private networks. All NICs are 10Gbit/s
osd pool default size = 3
osd pool default min size = 2
Ceph version is
Am 10.12.2015 um 06:38 schrieb Robert LeBlanc:
> Since I'm very interested in
> reducing this problem, I'm willing to try and submit a fix after I'm
> done with the new OP queue I'm working on. I don't know the best
> course of action at the moment, but I hope I can get some input for
> when I do t
Just try to give the booting OSD and all MONs the resources they ask for (CPU,
memory).
Yes, it causes disruption but only for a select group of clients, and only for
a moment (<20s with my extremely high number of PGs).
From a service provider perspective this might break SLAs, but until you get
Am 10.12.2015 um 06:38 schrieb Robert LeBlanc:
> I noticed this a while back and did some tracing. As soon as the PGs
> are read in by the OSD (very limited amount of housekeeping done), the
> OSD is set to the "in" state so that peering with other OSDs can
> happen and the recovery process can beg
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256
I noticed this a while back and did some tracing. As soon as the PGs
are read in by the OSD (very limited amount of housekeeping done), the
OSD is set to the "in" state so that peering with other OSDs can
happen and the recovery process can begin. Th
Am 09.12.2015 um 11:21 schrieb Jan Schermer:
> Are you seeing "peering" PGs when the blocked requests are happening? That's
> what we see regularly when starting OSDs.
Mostly "peering" and "activating".
> I'm not sure this can be solved completely (and whether there are major
> improvements in
Are you seeing "peering" PGs when the blocked requests are happening? That's
what we see regularly when starting OSDs.
I'm not sure this can be solved completely (and whether there are major
improvements in newer Ceph versions), but it can be sped up by
1) making sure you have free (and not dirt
Hi,
I'm getting blocked requests (>30s) every time when an OSD is set to "in" in
our clusters. Once this has happened, backfills run smoothly.
I have currently no idea where to start debugging. Has anyone a hint what to
examine first in order to narrow this issue?
TIA
Christian
--
Dipl-Inf. C
Hello,
On Thu, 28 May 2015 12:05:03 +0200 Xavier Serrano wrote:
> On Thu May 28 11:22:52 2015, Christian Balzer wrote:
>
> > > We are testing different scenarios before making our final decision
> > > (cache-tiering, journaling, separate pool,...).
> > >
> > Definitely a good idea to test thing
On Thu May 28 11:22:52 2015, Christian Balzer wrote:
> > We are testing different scenarios before making our final decision
> > (cache-tiering, journaling, separate pool,...).
> >
> Definitely a good idea to test things out and get an idea what Ceph and
> your hardware can do.
>
> From my experi
On Wed, 27 May 2015 15:38:26 +0200 Xavier Serrano wrote:
> Hello,
>
> On Wed May 27 21:20:49 2015, Christian Balzer wrote:
>
> >
> > Hello,
> >
> > On Wed, 27 May 2015 12:54:04 +0200 Xavier Serrano wrote:
> >
> > > Hello,
> > >
> > > Slow requests, blocked requests and blocked ops occur quit
Hello,
On Wed May 27 21:20:49 2015, Christian Balzer wrote:
>
> Hello,
>
> On Wed, 27 May 2015 12:54:04 +0200 Xavier Serrano wrote:
>
> > Hello,
> >
> > Slow requests, blocked requests and blocked ops occur quite often
> > in our cluster; too often, I'd say: several times during one day.
> >
Hello,
On Wed, 27 May 2015 12:54:04 +0200 Xavier Serrano wrote:
> Hello,
>
> Slow requests, blocked requests and blocked ops occur quite often
> in our cluster; too often, I'd say: several times during one day.
> I must say we are running some tests, but we are far from pushing
> the cluster to
Hello,
Slow requests, blocked requests and blocked ops occur quite often
in our cluster; too often, I'd say: several times during one day.
I must say we are running some tests, but we are far from pushing
the cluster to the limit (or at least, that's what I believe).
Every time a blocked request/
Hello,
On Tue, 26 May 2015 10:00:13 -0600 Robert LeBlanc wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> I've seen I/O become stuck after we have done network torture tests.
> It seems that after so many retries that the OSD peering just gives up
> and doesn't retry any more. An
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256
I've seen I/O become stuck after we have done network torture tests.
It seems that after so many retries that the OSD peering just gives up
and doesn't retry any more. An OSD restart kicks off another round of
retries and the I/O completes. It seems
Hello,
Thanks for your detailed explanation, and for the pointer to the
"Unexplainable slow request" thread.
After investigating osd logs, disk SMART status, etc., the disk under
osd.71 seems OK, so we restarted the osd... And voilà, problem seems
to be solved! (or at least, the "slow request" me
Hello,
Firstly, find my "Unexplainable slow request" thread in the ML archives
and read all of it.
On Tue, 26 May 2015 07:05:36 +0200 Xavier Serrano wrote:
> Hello,
>
> We have observed that our cluster is often moving back and forth
> from HEALTH_OK to HEALTH_WARN states due to "blocked reque
Hello,
We have observed that our cluster is often moving back and forth
from HEALTH_OK to HEALTH_WARN states due to "blocked requests".
We have also observed "blocked ops". For instance:
# ceph status
cluster 905a1185-b4f0-4664-b881-f0ad2d8be964
health HEALTH_WARN
1 requests
Hello,
On Mon, 4 Aug 2014 11:03:37 +0800 飞 wrote:
> hello, I have running a ceph cluster(RBD) on production environment to
> host 200 VMs, Under normal circumstances, ceph's performance is quite
> good. but when I delete a snapshot or image, ceph cluster will be
> appear a lot of blocked request
hello, I have running a ceph cluster(RBD) on production environment to host 200
VMs, Under normal circumstances, ceph's performance is quite good.
but when I delete a snapshot or image, ceph cluster will be appear a lot of
blocked requests(generally morn than 1000), then , the whole cluster hav
hello, I have running a ceph cluster(RBD) on production environment to host 200
VMs, Under normal circumstances, ceph's performance is quite good.
but when I delete a snapshot or image, ceph cluster will be appear a lot of
blocked requests(generally morn than 1000), then , the whole cluster hav
[ Re-added the list since I don't have log files. ;) ]
On Mon, Dec 9, 2013 at 5:52 AM, Oliver Schulz wrote:
> Hi Greg,
>
> I'll send this privately, maybe better not to post log-files, etc.
> to the list. :-)
>
>
>> Nobody's reported it before, but I think the CephFS MDS is sending out
>> too man
On Sun, Dec 8, 2013 at 7:16 AM, Oliver Schulz wrote:
> Hello Ceph-Gurus,
>
> a short while ago I reported some trouble we had with our cluster
> suddenly going into a state of "blocked requests".
>
> We did a few tests, and we can reproduce the problem:
> During / after deleting of a substantial c
Hello Ceph-Gurus,
a short while ago I reported some trouble we had with our cluster
suddenly going into a state of "blocked requests".
We did a few tests, and we can reproduce the problem:
During / after deleting of a substantial chunk of data on
CephFS (a few TB), ceph health shows blocked requ
51 matches
Mail list logo