[ceph-users] Re: Global AVAIL vs Pool MAX AVAIL

2021-01-11 Thread Mark Johnson
Thanks Anthony,

Shortly after I made that post, I found a Server Fault post where someone had 
asked the exact same question.  The reply was this - "The 'MAX AVAIL' column 
represents the amount of data that can be used before the first OSD becomes 
full. It takes into account the projected distribution of data across disks 
from the CRUSH map and uses the 'first OSD to fill up' as the target."

To answer your question, yes we have a rather unbalanced cluster which is 
something I'm working on.  When I saw these figures, I got scared that I had 
less time to work on it than I thought.  There are about 10 pools in the 
cluster, but we primarily use one for almost all of our storage and it only has 
64 pgs & 1 replica across 20 OSDs.  So, as data has grown, it works out that 
each PG in this cluster accounts for about 148GB, and the OSDs are about 1.4TB 
each, so it's easy to see how it's found itself way out of balance.

Anyway, once I've added the OSDs and data has rebalanced, I'm going to start 
the process of incrementally increasing the PG count for this pool in a staged 
process to reduce the amount of data per PG and (hopefully) balance out the 
data distribution better than it is.

This is one big learning process - I just wish I wasn't learning in production 
so much.



On Mon, 2021-01-11 at 15:58 -0800, Anthony D'Atri wrote:

Either you have multiple CRUSH roots or device classes, or you have unbalanced 
OSD utilization.  What version of Ceph?  Do you have any balancing enabled?


Do


ceph osd df | sort -nk8 | head

ceph osd df | sort -nk8 | tail


and I’ll bet you have OSDs way more full than others.  The STDDEV value that 
ceph df reports I suspect is accordingly high


On Jan 11, 2021, at 2:07 PM, Mark Johnson <



ma...@iovox.com

> wrote:


Can someone please explain to me the difference between the Global "AVAIL" and 
the "MAX AVAIL" in the pools table when I do a "ceph df detail"?  The reason 
being that we have a total of 14 pools, however almost all of our data exists 
in one pool.  A "ceph df detail" shows the following:


GLOBAL:

   SIZE   AVAIL RAW USED %RAW USED OBJECTS

   28219G 6840G   19945G 70.68  36112k


But the POOLS table from the same output shows the MAX AVAIL for each pool as 
498G and the pool with all the data shows 9472G used with a %USED of 95.00.  If 
it matters, the pool size is set to 2 so my guess is the global available 
figure is raw, meaning I should still have approx. 3.4TB available, but that 
95% used has me concerned.  I'm going to be adding some OSDs soon but still 
would like to understand the difference and how much trouble I'm in at this 
point.



___

ceph-users mailing list --



ceph-users@ceph.io


To unsubscribe send an email to



ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph 15.2.3 on Ubuntu 20.04 with odroid xu4 / python thread Problem

2021-01-11 Thread Oliver Weinmann

Hi again,

it took me some time but I figured out that on ubuntu focal there is a 
more recent version of ceph (15.2.7) available. So I gave it a try and 
replaced the ceph_argparse.py file but it still stuck running the command:


[2021-01-11 23:44:06,340][ceph_volume.process][INFO  ] Running command: 
/usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring 
/var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new 
c3a567be-08e0-4802-8b08-07d6891de485


any more clues?


Am 08.01.2021 um 10:03 schrieb kefu chai:

Oliver Weinmann 于2021年1月8日 周五04:30写道:


Ok, I replaced the whole file ceph_argparse.py with the patched one from
github. Instead of an throwing an error it now seems to be stuck forever.
Or am I to impatient? I'm running


I don’t think so. In a healthy cluster, the command should complete in no
more than 1 second. I just checked the revision history of
ceph_argparse.py, there are a bunch of changes since the release of
nautilus. My guess is that the version in master might include some bits
not compatible with nautilus?  So, I’d suggest only cherry-pick the change
in that PR, and try again.


debian buster so this is not the latest ceph release octopus, but nautilus:

root@odroidxu4:~# dpkg -l ceph
Desired=Unknown/Install/Remove/Purge/Hold
|
Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name   Version   Architecture Description

+++-==-=--===
ii  ceph   14.2.15-3~bpo10+1 armhfdistributed storage and
file system
Am 07.01.2021 um 13:07 schrieb kefu chai:



Oliver Weinmann 于2021年1月7日 周四16:32写道:


Hi,

thanks for the quick reply. I will test it. Do I have to recompile ceph
in order to test it?


No, you just need to apply the change of ceph_argparse.py.



Am 07.01.2021 um 02:13 schrieb kefu chai:



On Thursday, January 7, 2021, Oliver Weinmann 
wrote:


Hi,

I have a similar if not the same issue. I run armbian buster on my
odroid hc2 which is the same as a xu4 and I get the following error, trying
to build a cluster with ceph-ansible:


We have a fix for a similar issue recently. See
https://github.com/ceph/ceph/pull/38665. Could you give it a shot? I
will backport it to LTS branches if it helps.




ASK [ceph-osd : use ceph-volume lvm batch to create bluestore osds]
***
Wednesday 06 January 2021  21:46:44 + (0:00:00.073) 0:02:01.697 *
fatal: [192.168.2.123]: FAILED! => changed=true
   cmd:
   - ceph-volume
   - --cluster
   - ceph
   - lvm
   - batch
   - --bluestore
   - --yes
   - /dev/sda
   delta: '0:00:02.979200'
   end: '2021-01-06 22:46:48.049074'
   msg: non-zero return code
   rc: 1
   start: '2021-01-06 22:46:45.069874'
   stderr: |-
 --> DEPRECATION NOTICE
 --> You are using the legacy automatic disk sorting behavior
 --> The Pacific release will change the default to --no-auto
 --> passed data devices: 1 physical, 0 LVM
 --> relative data size: 1.0
 Running command: /usr/bin/ceph-authtool --gen-print-key
 Running command: /usr/bin/ceph --cluster ceph --name
client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i
- osd new 8854fc6d-d637-40a9-a1b1-b8e20afd
  stderr: Traceback (most recent call last):
  stderr: File "/usr/bin/ceph", line 1273, in 
  stderr: retval = main()
  stderr: File "/usr/bin/ceph", line 982, in main
  stderr: conffile=conffile)
  stderr: File "/usr/lib/python3/dist-packages/ceph_argparse.py",
line 1320, in run_in_thread
  stderr: raise Exception("timed out")
  stderr: Exception: timed out
 Traceback (most recent call last):
   File "/usr/sbin/ceph-volume", line 11, in 
 load_entry_point('ceph-volume==1.0.0', 'console_scripts',
'ceph-volume')()
   File "/usr/lib/python3/dist-packages/ceph_volume/main.py", line
39, in __init__
 self.main(self.argv)
   File "/usr/lib/python3/dist-packages/ceph_volume/decorators.py",
line 59, in newfunc
 return f(*a, **kw)
   File "/usr/lib/python3/dist-packages/ceph_volume/main.py", line
150, in main
 terminal.dispatch(self.mapper, subcommand_args)
   File "/usr/lib/python3/dist-packages/ceph_volume/terminal.py",
line 194, in dispatch
 instance.main()
   File
"/usr/lib/python3/dist-packages/ceph_volume/devices/lvm/main.py", line 42,
in main
 terminal.dispatch(self.mapper, self.argv)
   File "/usr/lib/python3/dist-packages/ceph_volume/terminal.py",
line 194, in dispatch
 instance.main()
   File "/usr/lib/python3/dist-packages/ceph_volume/decorators.py",
line 16, in is_root
 return func(*a, **kw)
   File
"/usr/lib/python3/dist-packages/ceph_volume/devices/lvm/batch.py", line
415, in main
 self._execute(plan)
   File

[ceph-users] denied reconnect attempt for ceph fs client

2021-01-11 Thread Frank Schilder
Hi all,

I'm not 100% sure, but I believe that since the update from mimic-13.2.8 to 
mimic-13.2.10 I have a strange issue. If a ceph fs client becomes unresponsive, 
it is evicted, but it cannot reconnect; see ceph.log extract below. In the 
past, clients would retry after the blacklist period and everything continued 
fine. I'm wondering why the clients cannot reconnect any more. I see this now 
every time a client gets thrown out and didn't have this problem before.

Any hints to what I might want to change are welcome, and also information what 
might have changed during the update (eg. is this expected or not).

2021-01-11 19:45:53.839721 [INF]  denied reconnect attempt (mds is up:active) 
from client.30770997 192.168.56.121:0/2325067585 after 1.82224e+06 (allowed 
interval 45)
2021-01-11 19:45:46.713822 [INF]  Health check cleared: MDS_SLOW_REQUEST (was: 
1 MDSs report slow requests)
2021-01-11 19:45:45.937126 [INF]  MDS health message cleared (mds.0): 1 slow 
requests are blocked > 30 secs
2021-01-11 19:45:39.527180 [INF]  Evicting (and blacklisting) client session 
30770997 (192.168.56.121:0/2325067585)
2021-01-11 19:45:39.527168 [WRN]  evicting unresponsive client 
HOSTNAME:CLIENT_NAME (30770997), after 62.735 seconds
2021-01-11 19:45:39.522141 [WRN]  1 slow requests, 0 included below; oldest 
blocked for > 50.371751 secs
2021-01-11 19:45:34.522085 [WRN]  1 slow requests, 0 included below; oldest 
blocked for > 45.371685 secs
2021-01-11 19:45:29.521991 [WRN]  1 slow requests, 0 included below; oldest 
blocked for > 40.371604 secs
2021-01-11 19:45:24.521907 [WRN]  1 slow requests, 0 included below; oldest 
blocked for > 35.371520 secs
2021-01-11 19:45:19.521895 [WRN]  slow request 30.371469 seconds old, received 
at 2021-01-11 19:44:49.150361: client_request(client.30771333:10419033 getattr 
pAsLsXsFs #0x10014446b0f 2021-01-11 19:44:49.145012 caller_uid=257062, 
caller_gid=257062{}) currently failed to rdlock, waiting

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Global AVAIL vs Pool MAX AVAIL

2021-01-11 Thread Mark Johnson
Can someone please explain to me the difference between the Global "AVAIL" and 
the "MAX AVAIL" in the pools table when I do a "ceph df detail"?  The reason 
being that we have a total of 14 pools, however almost all of our data exists 
in one pool.  A "ceph df detail" shows the following:

GLOBAL:
SIZE   AVAIL RAW USED %RAW USED OBJECTS
28219G 6840G   19945G 70.68  36112k

But the POOLS table from the same output shows the MAX AVAIL for each pool as 
498G and the pool with all the data shows 9472G used with a %USED of 95.00.  If 
it matters, the pool size is set to 2 so my guess is the global available 
figure is raw, meaning I should still have approx. 3.4TB available, but that 
95% used has me concerned.  I'm going to be adding some OSDs soon but still 
would like to understand the difference and how much trouble I'm in at this 
point.


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: bluefs_buffered_io=false performance regression

2021-01-11 Thread Robert Sander
Hi Marc and Dan,

thanks for your quick responses assuring me that we did nothing totally
wrong.

Regards
-- 
Robert Sander
Heinlein Support GmbH
Schwedter Str. 8/9b, 10119 Berlin

http://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Zwangsangaben lt. §35a GmbHG:
HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein -- Sitz: Berlin



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] "ceph orch restart mgr" command creates mgr restart loop

2021-01-11 Thread Chris Read
Greetings all...

I'm busy testing out Ceph and have hit this troublesome bug while following
the steps outlined here:

https://docs.ceph.com/en/octopus/cephadm/monitoring/#configuring-ssl-tls-for-grafana

When I issue the "ceph orch restart mgr" command, it appears the command is
not cleared from a message queue somewhere (I'm still very unclear on many
ceph specifics), and so each time the mgr process returns from restart it
picks up the message again and keeps restarting itself forever (so far it's
been stuck in this state for 45 minutes).

Watching the logs we see this going on:

$ ceph log last cephadm -w

root@ceph-poc-000:~# ceph log last cephadm -w
  cluster:
id: d23bc326-543a-11eb-bfe0-b324db228b6c
health: HEALTH_OK

  services:
mon: 5 daemons, quorum
ceph-poc-000,ceph-poc-003,ceph-poc-004,ceph-poc-002,ceph-poc-001 (age 2h)
mgr: ceph-poc-000.himivo(active, since 4s), standbys:
ceph-poc-001.unjulx
osd: 10 osds: 10 up (since 2h), 10 in (since 2h)

  data:
pools:   1 pools, 1 pgs
objects: 0 objects, 0 B
usage:   10 GiB used, 5.4 TiB / 5.5 TiB avail
pgs: 1 active+clean


2021-01-11T20:46:32.976606+ mon.ceph-poc-000 [INF] Active manager
daemon ceph-poc-000.himivo restarted
2021-01-11T20:46:32.980749+ mon.ceph-poc-000 [INF] Activating manager
daemon ceph-poc-000.himivo
2021-01-11T20:46:33.061519+ mon.ceph-poc-000 [INF] Manager daemon
ceph-poc-000.himivo is now available
2021-01-11T20:46:39.156420+ mon.ceph-poc-000 [INF] Active manager
daemon ceph-poc-000.himivo restarted
2021-01-11T20:46:39.160618+ mon.ceph-poc-000 [INF] Activating manager
daemon ceph-poc-000.himivo
2021-01-11T20:46:39.242603+ mon.ceph-poc-000 [INF] Manager daemon
ceph-poc-000.himivo is now available
2021-01-11T20:46:45.299953+ mon.ceph-poc-000 [INF] Active manager
daemon ceph-poc-000.himivo restarted
2021-01-11T20:46:45.304006+ mon.ceph-poc-000 [INF] Activating manager
daemon ceph-poc-000.himivo
2021-01-11T20:46:45.733495+ mon.ceph-poc-000 [INF] Manager daemon
ceph-poc-000.himivo is now available
2021-01-11T20:46:51.871903+ mon.ceph-poc-000 [INF] Active manager
daemon ceph-poc-000.himivo restarted
2021-01-11T20:46:51.877107+ mon.ceph-poc-000 [INF] Activating manager
daemon ceph-poc-000.himivo
2021-01-11T20:46:51.976190+ mon.ceph-poc-000 [INF] Manager daemon
ceph-poc-000.himivo is now available
2021-01-11T20:46:58.000720+ mon.ceph-poc-000 [INF] Active manager
daemon ceph-poc-000.himivo restarted
2021-01-11T20:46:58.006843+ mon.ceph-poc-000 [INF] Activating manager
daemon ceph-poc-000.himivo
2021-01-11T20:46:58.097163+ mon.ceph-poc-000 [INF] Manager daemon
ceph-poc-000.himivo is now available
2021-01-11T20:47:04.188630+ mon.ceph-poc-000 [INF] Active manager
daemon ceph-poc-000.himivo restarted
2021-01-11T20:47:04.193501+ mon.ceph-poc-000 [INF] Activating manager
daemon ceph-poc-000.himivo
2021-01-11T20:47:04.285509+ mon.ceph-poc-000 [INF] Manager daemon
ceph-poc-000.himivo is now available
2021-01-11T20:47:10.348099+ mon.ceph-poc-000 [INF] Active manager
daemon ceph-poc-000.himivo restarted
2021-01-11T20:47:10.352340+ mon.ceph-poc-000 [INF] Activating manager
daemon ceph-poc-000.himivo
2021-01-11T20:47:10.752243+ mon.ceph-poc-000 [INF] Manager daemon
ceph-poc-000.himivo is now available

And in the logs for the mgr instance itself we see it keep replaying the
message over and over:

$ docker logs -f
ceph-d23bc326-543a-11eb-bfe0-b324db228b6c-mgr.ceph-poc-000.himivo
debug 2021-01-11T20:47:31.390+ 7f48b0d0d200  0 set uid:gid to 167:167
(ceph:ceph)
debug 2021-01-11T20:47:31.390+ 7f48b0d0d200  0 ceph version 15.2.8
(bdf3eebcd22d7d0b3dd4d5501bee5bac354d5b55) octopus (stable), process
ceph-mgr, pid 1
debug 2021-01-11T20:47:31.390+ 7f48b0d0d200  0 pidfile_write: ignore
empty --pid-file
debug 2021-01-11T20:47:31.414+ 7f48b0d0d200  1 mgr[py] Loading python
module 'alerts'
debug 2021-01-11T20:47:31.486+ 7f48b0d0d200  1 mgr[py] Loading python
module 'balancer'
debug 2021-01-11T20:47:31.542+ 7f48b0d0d200  1 mgr[py] Loading python
module 'cephadm'
debug 2021-01-11T20:47:31.742+ 7f48b0d0d200  1 mgr[py] Loading python
module 'crash'
debug 2021-01-11T20:47:31.798+ 7f48b0d0d200  1 mgr[py] Loading python
module 'dashboard'
debug 2021-01-11T20:47:32.258+ 7f48b0d0d200  1 mgr[py] Loading python
module 'devicehealth'
debug 2021-01-11T20:47:32.306+ 7f48b0d0d200  1 mgr[py] Loading python
module 'diskprediction_local'
debug 2021-01-11T20:47:32.498+ 7f48b0d0d200  1 mgr[py] Loading python
module 'influx'
debug 2021-01-11T20:47:32.550+ 7f48b0d0d200  1 mgr[py] Loading python
module 'insights'
debug 2021-01-11T20:47:32.598+ 7f48b0d0d200  1 mgr[py] Loading python
module 'iostat'
debug 2021-01-11T20:47:32.642+ 7f48b0d0d200  1 mgr[py] Loading python
module 'k8sevents'
debug 2021-01-11T20:47:33.034+ 7f48b0d0d200  1 mgr[py] Loading python
module 'localpool'
debug 

[ceph-users] DocuBetter Meeting This Week -- 13 Jan 2021 1730 UTC

2021-01-11 Thread John Zachary Dover
Unless an unforeseen crisis arises, the DocuBetter meetings for the next
two months will focus on ensuring that we have a smooth and
easy-to-understand docs suite for the release of Pacific.

Meeting: https://bluejeans.com/908675367
Etherpad: https://pad.ceph.com/p/Ceph_Documentation
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: bluefs_buffered_io=false performance regression

2021-01-11 Thread Dan van der Ster
And to add some references, there is a PR on hold here:
https://github.com/ceph/ceph/pull/38044 which links some relevant
trackers entries.
Outside of large block.db removals (e.g. from backfilling or snap
trimming) we didn't notice a huge difference -- though that is not
conclusive.
There are several PG removal optimizations in the pipeline which
hopefully fix the issues in a different way, rather than needing
buffered io.

-- dan

On Mon, Jan 11, 2021 at 5:20 PM Mark Nelson  wrote:
>
> Hi Robert,
>
>
> We are definitely aware of this issue.  It appears to often be related
> to snap trimming and we believe possibly related to excessive thrashing
> of the rocksdb block cache.  I suspect that when bluefs_buffered_io is
> enabled it hides the issue and people don't notice the problem, but that
> might be related to why we see the other issue with the kernel with rgw
> workloads.  I would recommend that if you didn't see issues with
> bluefs_buffered_io enabled, you can re-enable it and periodically check
> to make sure you aren't hitting issues with kernel swap.  Unfortunately
> we are sort of between a rock and a hard place on this one until we
> solve the root cause.
>
>
> Right now we're looking at trying to reduce thrashing in the rocksdb
> block cache(s) by splitting up onode and omap (and potentially pglog and
> allocator) block cache into their own distinct entities.  My hope is
> that we can finesse the situation so that the overall system page cache
> is no longer required to avoid execessive reads assuming enough memory
> has been assigned to the osd_memory_target.
>
>
> Mark
>
>
> On 1/11/21 9:47 AM, Robert Sander wrote:
> > Hi,
> >
> > bluefs_buffered_io was disabled in Ceph version 14.2.11.
> >
> > The cluster started last year with 14.2.5 and got upgraded over the year 
> > now running 14.2.16.
> >
> > The performance was OK first but got abysmal bad at the end of 2020.
> >
> > We checked the components and HDDs and SSDs seem to be fine. Single disk 
> > benchmarks showed performance according the specs.
> >
> > Today we (re-)enabled bluefs_buffered_io and restarted all OSD processes on 
> > 248 HDDs distributed over 12 nodes.
> >
> > Now the benchmarks are fine again: 434MB/s write instead of 60MB/s, 960MB/s 
> > read instead of 123MB/s.
> >
> > This setting was disabled in 14.2.11 because "in some test cases it appears 
> > to cause excessive swap utilization by the linux kernel and a large 
> > negative performance impact after several hours of run time."
> > We have to monitor if this will happen in our cluster. Is there any other 
> > negative side effect currently known?
> >
> > Here are the rados bench values, first with bluefs_buffered_io=false, then 
> > with bluefs_buffered_io=true:
> >
> > Bench Total   Total   Write   Object  BandStddev  Max Min   
> >   Average Stddev  Max Min Average Stddev  Max Min
> >   timewrites  ReadsizewidthBandwidth
> >IOPS Latency (s)
> >   run reads   size(MB/sec)
> >   made
> > false write   33,081  490 4194304 4194304 59,2485 71,3829 264 0 
> >   14  17,8702 66  0   1,07362 2,83017 20,71   0,0741089
> > false seq 15,8226 490 4194304 4194304 123,874   
> >   30  46,8659 174 0   0,51453 9,53873 0,00343417
> > false rand38,2615 21314194304 4194304 222,782   
> >   55  109,374 415 0   0,28191 12,1039 0,00327948
> > true write30,4612 33084194304 4194304 434,389 26,0323 480 376   
> >   108 6,50809 120 94  0,14683 0,07368 0,99791 0,0751249
> > true seq  13,7628 33084194304 4194304 961,429   
> >   240 22,544  280 184 0,06528 0,88676 0,00338191
> > true rand 30,1007 82474194304 4194304 1095,92   
> >   273 25,5066 313 213 0,05719 0,99140 0,00325295
> >
> > Regards
> >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: bluefs_buffered_io=false performance regression

2021-01-11 Thread Mark Nelson

Hi Robert,


We are definitely aware of this issue.  It appears to often be related 
to snap trimming and we believe possibly related to excessive thrashing 
of the rocksdb block cache.  I suspect that when bluefs_buffered_io is 
enabled it hides the issue and people don't notice the problem, but that 
might be related to why we see the other issue with the kernel with rgw 
workloads.  I would recommend that if you didn't see issues with 
bluefs_buffered_io enabled, you can re-enable it and periodically check 
to make sure you aren't hitting issues with kernel swap.  Unfortunately 
we are sort of between a rock and a hard place on this one until we 
solve the root cause.



Right now we're looking at trying to reduce thrashing in the rocksdb 
block cache(s) by splitting up onode and omap (and potentially pglog and 
allocator) block cache into their own distinct entities.  My hope is 
that we can finesse the situation so that the overall system page cache 
is no longer required to avoid execessive reads assuming enough memory 
has been assigned to the osd_memory_target.



Mark


On 1/11/21 9:47 AM, Robert Sander wrote:

Hi,

bluefs_buffered_io was disabled in Ceph version 14.2.11.

The cluster started last year with 14.2.5 and got upgraded over the year now 
running 14.2.16.

The performance was OK first but got abysmal bad at the end of 2020.

We checked the components and HDDs and SSDs seem to be fine. Single disk 
benchmarks showed performance according the specs.

Today we (re-)enabled bluefs_buffered_io and restarted all OSD processes on 248 
HDDs distributed over 12 nodes.

Now the benchmarks are fine again: 434MB/s write instead of 60MB/s, 960MB/s 
read instead of 123MB/s.

This setting was disabled in 14.2.11 because "in some test cases it appears to cause 
excessive swap utilization by the linux kernel and a large negative performance impact 
after several hours of run time."
We have to monitor if this will happen in our cluster. Is there any other 
negative side effect currently known?

Here are the rados bench values, first with bluefs_buffered_io=false, then with 
bluefs_buffered_io=true:

Bench   Total   Total   Write   Object  BandStddev  Max Min 
Average Stddev  Max Min Average Stddev  Max Min
timewrites  ReadsizewidthBandwidth  
 IOPS Latency (s)
run reads   size(MB/sec)
made
false write 33,081  490 4194304 4194304 59,2485 71,3829 264 0   
14  17,8702 66  0   1,07362 2,83017 20,71   0,0741089
false seq   15,8226 490 4194304 4194304 123,874 
30  46,8659 174 0   0,51453 9,53873 0,00343417
false rand  38,2615 21314194304 4194304 222,782 
55  109,374 415 0   0,28191 12,1039 0,00327948
true write  30,4612 33084194304 4194304 434,389 26,0323 480 376 
108 6,50809 120 94  0,14683 0,07368 0,99791 0,0751249
true seq13,7628 33084194304 4194304 961,429 
240 22,544  280 184 0,06528 0,88676 0,00338191
true rand   30,1007 82474194304 4194304 1095,92 
273 25,5066 313 213 0,05719 0,99140 0,00325295

Regards

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] bluefs_buffered_io=false performance regression

2021-01-11 Thread Robert Sander
Hi,

bluefs_buffered_io was disabled in Ceph version 14.2.11.

The cluster started last year with 14.2.5 and got upgraded over the year now 
running 14.2.16.

The performance was OK first but got abysmal bad at the end of 2020.

We checked the components and HDDs and SSDs seem to be fine. Single disk 
benchmarks showed performance according the specs.

Today we (re-)enabled bluefs_buffered_io and restarted all OSD processes on 248 
HDDs distributed over 12 nodes.

Now the benchmarks are fine again: 434MB/s write instead of 60MB/s, 960MB/s 
read instead of 123MB/s.

This setting was disabled in 14.2.11 because "in some test cases it appears to 
cause excessive swap utilization by the linux kernel and a large negative 
performance impact after several hours of run time."
We have to monitor if this will happen in our cluster. Is there any other 
negative side effect currently known?

Here are the rados bench values, first with bluefs_buffered_io=false, then with 
bluefs_buffered_io=true:

Bench   Total   Total   Write   Object  BandStddev  Max Min 
Average Stddev  Max Min Average Stddev  Max Min
timewrites  ReadsizewidthBandwidth  
 IOPS Latency (s)
run reads   size(MB/sec)
made
false write 33,081  490 4194304 4194304 59,2485 71,3829 264 0   
14  17,8702 66  0   1,07362 2,83017 20,71   0,0741089
false seq   15,8226 490 4194304 4194304 123,874 
30  46,8659 174 0   0,51453 9,53873 0,00343417
false rand  38,2615 21314194304 4194304 222,782 
55  109,374 415 0   0,28191 12,1039 0,00327948
true write  30,4612 33084194304 4194304 434,389 26,0323 480 376 
108 6,50809 120 94  0,14683 0,07368 0,99791 0,0751249
true seq13,7628 33084194304 4194304 961,429 
240 22,544  280 184 0,06528 0,88676 0,00338191
true rand   30,1007 82474194304 4194304 1095,92 
273 25,5066 313 213 0,05719 0,99140 0,00325295

Regards
-- 
Robert Sander
Heinlein Support GmbH
Schwedter Str. 8/9b, 10119 Berlin

http://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Zwangsangaben lt. §35a GmbHG: 
HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein -- Sitz: Berlin



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RBD Image can't be formatted - blk_error

2021-01-11 Thread Ilya Dryomov
On Mon, Jan 11, 2021 at 10:09 AM Gaël THEROND  wrote:
>
> Hi Ilya,
>
> Here is additional information:
> My cluster is a three OSD Nodes cluster with each node having 24 4TB SSD 
> disks.
>
> The mkfs.xfs command fail with the following error: 
> https://pastebin.com/yTmMUtQs
>
> I'm using the following command to format the image: mkfs.xfs 
> /dev/rbd//
> I'm facing the same problem (and same sectors) if I'm directly targeting the 
> device with mkfs.xfs /dev/rbb
>
> The client authentication caps are as follows: https://pastebin.com/UuAHRycF
>
> Regarding your questions, yes, it is a persistent issue as soon as I try to 
> create a large image from a newly created pool.
> Yes, after the first attempt, all new attempts fail too.
> Yes, it is always the same set of sectors that fails.

Have you tried writing to sector 0, just to take mkfs.xfs out of the
picture?  E.g. "dd if=/dev/zero of=/dev/rbd17 bs=512 count=1 oflag=direct"?

>
> Strange thing is, if I use an already existing pool, and create this 80Tb 
> image within this pool, it formats it correctly.

What do you mean by a newly created pool?  A metadata pool, a data pool
or both?

Are you deleting and re-creating pools (whether metadata or data) with
the same name?  It would help if you paste all commands, starting with
how you create pools all the way to a failing write.

Have you tried mapping using the admin user ("rbd map --id admin ...")?

Thanks,

Ilya
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [cephadm] Point release minor updates block themselves infinitely

2021-01-11 Thread Paul Browne
Next thing I've tried is taking a low-impact host and purging all
Ceph/Podman state from it to re-install it from scratch (a Rados GW
instance, in this case).

But now seeing this strange error in just re-adding the host via a "ceph
orch host add" , at the point where a disk inventory is attempted to be
taken ;

2021-01-11 12:01:16,028 DEBUG Running command: /bin/podman run --rm
--ipc=host --net=host --entrypoint /usr/sbin/ceph-volume --privileged
--group-add=disk -e CONTAINER_IMAGE=docker.io/ceph/ceph:v15.2.7 -e
NODE_NAME=ceph02-rgw-01 -v
/var/run/ceph/fbbe7cac-3324-11eb-8186-34800d5b932c:/var/run/ceph:z -v
/var/log/ceph/fbbe7cac-3324-11eb-8186-34800d5b932c:/var/log/ceph:z -v
/var/lib/ceph/fbbe7cac-3324-11eb-8186-34800d5b932c/crash:/var/lib/ceph/crash:z
-v /dev:/dev -v /run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm -v
/run/lock/lvm:/run/lock/lvm docker.io/ceph/ceph:v15.2.7 inventory
--format=json --filter-for-batch
2021-01-11 12:01:16,518 INFO /bin/podman:stderr usage: ceph-volume
inventory [-h] [--format {plain,json,json-pretty}] [path]
2021-01-11 12:01:16,519 INFO /bin/podman:stderr ceph-volume inventory:
error: unrecognized arguments: --filter-for-batch

Deployment of RGW container to the host is similarly blocked again.

This seems to be covered under this issue;
https://tracker.ceph.com/issues/48694 for ceph-volume, but may not
have been addressed as yet...





On Mon, 11 Jan 2021 at 10:36, Paul Browne  wrote:

> Hello all,
>
> I've been having some real troubles in getting cephadm to apply some very
> minor point release updates cleanly, twice now applying the point update of
> 15.2.6 -> 15.2.7 and 15.2.7 to 15.2.8 has gotten blocked somewhere and
> ended up making no progress, requiring digging deep into internals to
> unblock things.
>
> In the most recent attempt of 15.2.7 -> 15.2.8, the Orchestrator cleanly
> replaced Mon and MGR containers in the first steps, but when it came to
> replacing Crash daemon containers the running 15.2.7 Crash container was
> purged but container update operations then seem to get blocked on trying
> to start it again on the older image, leading to an infinite loop of Podman
> trying to start a non-existent container in logging;
>
> https://pastebin.com/9zdMs1XU
>
> Forcing an `ceph orch daemon rm` of the Crash daemon affected for the host
> just repeats the loop again.
>
> I'd then tried removing the Crash service and all daemons through the
> Orchestrator API next.
>
> This purged all running Crash containers from all hosts, and then
> re-applyed a service spec to restart them, hopefully on the new image.
>
> The Orchestrator removal of the Crash containers seems to have left
> container state dangling on hosts however, as now we see the same issue of
> Crash containers not starting on *every* host in the cluster due to
> left-over container state ;
>
> https://pastebin.com/tjaegxqg
>
> At this point I'm not certain if Podman (v1.6.4 EPEL, CentOS7.9) or
> Orchestrator is to blame for leaving this state dangling and blocking new
> container creation, but it's proving a real problem in applying even simple
> minor version point updates.
>
> Has anyone else been seeing similar behaviour in applying minor version
> updates via cephadm+Orchestrator? Are there any good workarounds to clean
> up the dangling container state?
>
> --
> ***
> Paul Browne
> Research Computing Platforms
> University Information Services
> Roger Needham Building
> JJ Thompson Avenue
> University of Cambridge
> Cambridge
> United Kingdom
> E-Mail: pf...@cam.ac.uk
> Tel: 0044-1223-746548
> ***
>


-- 
***
Paul Browne
Research Computing Platforms
University Information Services
Roger Needham Building
JJ Thompson Avenue
University of Cambridge
Cambridge
United Kingdom
E-Mail: pf...@cam.ac.uk
Tel: 0044-1223-746548
***
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] [cephadm] Point release minor updates block themselves infinitely

2021-01-11 Thread Paul Browne
Hello all,

I've been having some real troubles in getting cephadm to apply some very
minor point release updates cleanly, twice now applying the point update of
15.2.6 -> 15.2.7 and 15.2.7 to 15.2.8 has gotten blocked somewhere and
ended up making no progress, requiring digging deep into internals to
unblock things.

In the most recent attempt of 15.2.7 -> 15.2.8, the Orchestrator cleanly
replaced Mon and MGR containers in the first steps, but when it came to
replacing Crash daemon containers the running 15.2.7 Crash container was
purged but container update operations then seem to get blocked on trying
to start it again on the older image, leading to an infinite loop of Podman
trying to start a non-existent container in logging;

https://pastebin.com/9zdMs1XU

Forcing an `ceph orch daemon rm` of the Crash daemon affected for the host
just repeats the loop again.

I'd then tried removing the Crash service and all daemons through the
Orchestrator API next.

This purged all running Crash containers from all hosts, and then
re-applyed a service spec to restart them, hopefully on the new image.

The Orchestrator removal of the Crash containers seems to have left
container state dangling on hosts however, as now we see the same issue of
Crash containers not starting on *every* host in the cluster due to
left-over container state ;

https://pastebin.com/tjaegxqg

At this point I'm not certain if Podman (v1.6.4 EPEL, CentOS7.9) or
Orchestrator is to blame for leaving this state dangling and blocking new
container creation, but it's proving a real problem in applying even simple
minor version point updates.

Has anyone else been seeing similar behaviour in applying minor version
updates via cephadm+Orchestrator? Are there any good workarounds to clean
up the dangling container state?

-- 
***
Paul Browne
Research Computing Platforms
University Information Services
Roger Needham Building
JJ Thompson Avenue
University of Cambridge
Cambridge
United Kingdom
E-Mail: pf...@cam.ac.uk
Tel: 0044-1223-746548
***
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: osd gradual reweight question

2021-01-11 Thread mj

Hi Anthony and Frank,

Thanks for your responses!

I think you have answered my question: the impact of one complete sudden 
reweight to zero is bigger, because of the increased peering that is 
happening.


With impact I meant: OSDs being marked down by the cluster. (and 
automatically coming back online) (client performance seems basically 
unaffected) No OSDs crashing etc.


And yes: the cluster recovers quickly from the OSD that are 
(temporarily) down. Besides: I also set noout, so the impact was limited 
anyway.


I will next time also set nodown, thanks for that suggestion.

I had already set the osd_op_queue_cutoff and recovery/backfill settings 
to 1.


Thank you both for your answers! We'll continue with the gradual weight 
decreases. :-)


MJ

On 1/9/21 12:28 PM, Frank Schilder wrote:

One reason for such observations is swap usage. If you have swap configured, 
you should probably disable it. Swap can be useful with ceph, but you really 
need to know what you are doing and how swap actually works (it is not for 
providing more RAM as most people tend to believe).

In my case, I have substantial amounts swap configured. Then one needs to be 
aware of its impact on certain ceph operations. Code and data that's rarely 
used, as well as leaked memory will end up on swap. During normal operations, 
that is not a problem. However, during exceptional operations, you are likely 
in a situation where all OSDs try to swap the same code/data in/out at the same 
time, which can temporarily lead to very large response latencies.

One of these exceptional operations are large peering operations. The code/data 
for peering is rarely used, so it will be on swap. The increased latency can be 
bad enough for MONs to mark OSDs as down for a short while, I have seen that. 
Usually, the cluster recovers very quickly and this is not a real issue if you 
have an actual OSD fail.

If you add/remove disks, it can be irritating. The workaround is to set nodown 
in addition to noout when doing admin. This will not only speed up peering 
dramatically, it will also ignore the increased heartbeat ping times during the 
admin operation. I see the warnings, but no detrimental effects.

In general, deploying swap in a ceph cluster is more an exception than a rule. 
The most common use is to allow a cluster to recover during a period of 
increased RAM requirements. There are cases in this list for both, MDS and OSD 
recoveries where having more address space was the only way forward. If 
deployed during normal operation, swap really needs to be fast and be able to 
handle simultaneous requests from many processes in parallel. Usually, only RAM 
is fast enough, so don't buy NVMe drives, just buy more RAM. Having some fast 
drives in stock for emergency swap deployment is a good idea though.

I deployed swap to cope with a memory leak that was present in mimic 13.2.8. 
Seems to be fixed in 13.2.10. If swap is fast enough, the impact is there but 
harmless. Swap on a crappy disk is dangerous.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Anthony D'Atri 
Sent: 08 January 2021 23:58:43
To: ceph-users@ceph.io
Subject: [ceph-users] Re: osd gradual reweight question



Hi,

We are replacing HDD with SSD, and we first (gradually) drain (reweight) the 
HDDs with 0.5 steps until 0 = empty.

Works perfectly.

Then (just for kicks) I tried reducing HDD weight from 3.6 to 0 in one large 
step. That seemed to have had more impact on the cluster, and we even noticed 
some OSD's temporarily go down after a few minutes. It all worked out, but the 
impact seemed much larger.


Please clarify “impact”.  Do you mean that client performance was decreased, or 
something else?


We never had OSDs go down when gradually reducing the weight step by step. This 
surprised us.


Please also clarify what you mean by going down — do you mean being marked 
“down” by the mons, or the daemons actually crashing?  I’m not being critical — 
I want to fully understand your situation.


Is it expected that the impact of a sudden reweight from 3.6 to 0 is bigger 
than a gradual step-by-step decrease?


There are a lot of variables there, so It Depends.

For sure going in one step means that more PGs will peer, which can be 
expensive.  I’ll speculate, with incomplete information, that this is what most 
of what you’re seeing.


I would assume the impact to be similar, only the time it takes to reach 
HEALTH_OK to be longer.


The end result, yes — the concern is how we get there.

The strategy of incremental downweighting has some advantages:

* If something goes wrong, you can stop without having a huge delta of data to 
move before health is restored
* Peering is spread out
* Impact on the network and drives *may* be less at a given time

A disadvantage is that you end up moving some data more than once.  This was 
worse with older releases and CRUSH details than with recent 

[ceph-users] Re: performance impact by pool deletion?

2021-01-11 Thread Scheurer François
Many Thanks Eugen for you experience sharing!! very useful information

(I thought I was maybe to paranoid.. thanks god I asked the maillist first!)


Cheers

Francois

--


EveryWare AG
François Scheurer
Senior Systems Engineer
Zurlindenstrasse 52a
CH-8003 Zürich

tel: +41 44 466 60 00
fax: +41 44 466 60 10
mail: francois.scheu...@everyware.ch
web: http://www.everyware.ch



From: Eugen Block 
Sent: Wednesday, January 6, 2021 8:17 PM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: performance impact by pool deletion?

Hi,

one of our customers just recently tried that in a non-production
cluster (Nautilus) and it was horrible. They wanted to test a couple
of failure resistency scenarios and needed to purge old data (I
believe around 200 TB or so), we ended up recreating the OSDs instead
of waiting for the deletion to finish. They have HDDs with separate
rocksDB on SSDs (ratio 1:5) and the SSDs were completely saturated
according to iostat.

It's a known issue (see [1], [2]), I'll just quote Igor:

> there are two issues with massive data removals
> 1) It might take pretty long time to complete since it performs 15
> object removals per second per PG/OSD. Hence it might takes days
> 2)RocksDB starts to perform poorly from performance point of view
> after (or even during) such removals

As far as I can see the backports for the fix are pending for Mimic,
Nautilus and Octopus.

Regards,
Eugen


[1] https://tracker.ceph.com/issues/45765
[2] https://tracker.ceph.com/issues/47044

Zitat von Scheurer François :

> Hi everybody
>
>
>
> Does somebody had experience with important performance degradations during
>
> a pool deletion?
>
>
> We are asking because we are going to delete a 370 TiB with 120 M
> objects and have never done this in the past.
>
> The pool is using erasure coding 8+2 on nvme ssd's with rocksdb/wal
> on nvme optane disks.
>
> Openstack VM's are running on the other rbd pools.
>
>
> Thank you in advance for your feedback!
>
>
> Cheers
>
> Francois
>
>
> PS:
>
> this is no option, as it take about 100 year to complete ;-) :
>
> rados -p ch-zh1-az1.rgw.buckets.data ls | while read i; do rados -p
> ch-zh1-az1.rgw.buckets.data rm "$i"; done
>
>
>
>
>
> --
>
>
> EveryWare AG
> François Scheurer
> Senior Systems Engineer
> Zurlindenstrasse 52a
> CH-8003 Zürich
>
> tel: +41 44 466 60 00
> fax: +41 44 466 60 10
> mail: francois.scheu...@everyware.ch
> web: http://www.everyware.ch


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: performance impact by pool deletion?

2021-01-11 Thread Scheurer François
Thank you Glen and Frank for your experience sharing!


Cheers

Francois


--


EveryWare AG
François Scheurer
Senior Systems Engineer
Zurlindenstrasse 52a
CH-8003 Zürich

tel: +41 44 466 60 00
fax: +41 44 466 60 10
mail: francois.scheu...@everyware.ch
web: http://www.everyware.ch



From: Frank Schilder 
Sent: Saturday, January 9, 2021 12:10 PM
To: Glen Baars; Scheurer François; ceph-users@ceph.io
Subject: Re: performance impact by pool deletion?

Hi all,

I deleted a ceph fs data pool (EC 8+2) of size 240TB with about 150M objects 
and it had no observable impact at all. Client IO and admin operations worked 
just like before. In fact, I was surprised how fast it went and how fast the 
capacity became available again. It was probably just a few days, but don't 
remember the exact times any more.

My version back then was mimic 13.2.8. All OSDs had collocated WAL/DB, 
everything on spindle. My impression from reports in this list is, that this 
started becoming a problem with changes made in nautilus.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Glen Baars 
Sent: 09 January 2021 08:15:51
To: Scheurer François; ceph-users@ceph.io
Subject: [ceph-users] Re: performance impact by pool deletion?

I deleted a 240TB rgw pool a few weeks ago, caused a huge slowdown. Lucky it 
wasn't a important cluster otherwise it would've taken it down for a week.

From: Scheurer François 
Sent: Wednesday, 6 January 2021 11:32 PM
To: ceph-users@ceph.io
Subject: [ceph-users] performance impact by pool deletion?


Hi everybody





Does somebody had experience with important performance degradations during

a pool deletion?



We are asking because we are going to delete a 370 TiB with 120 M objects and 
have never done this in the past.

The pool is using erasure coding 8+2 on nvme ssd's with rocksdb/wal on nvme 
optane disks.

Openstack VM's are running on the other rbd pools.



Thank you in advance for your feedback!



Cheers

Francois



PS:

this is no option, as it take about 100 year to complete ;-) :
rados -p ch-zh1-az1.rgw.buckets.data ls | while read i; do rados -p 
ch-zh1-az1.rgw.buckets.data rm "$i"; done






--


EveryWare AG
François Scheurer
Senior Systems Engineer
Zurlindenstrasse 52a
CH-8003 Zürich

tel: +41 44 466 60 00
fax: +41 44 466 60 10
mail: francois.scheu...@everyware.ch
web: http://www.everyware.ch
This e-mail is intended solely for the benefit of the addressee(s) and any 
other named recipient. It is confidential and may contain legally privileged or 
confidential information. If you are not the recipient, any use, distribution, 
disclosure or copying of this e-mail is prohibited. The confidentiality and 
legal privilege attached to this communication is not waived or lost by reason 
of the mistaken transmission or delivery to you. If you have received this 
e-mail in error, please notify us immediately.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RBD Image can't be formatted - blk_error

2021-01-11 Thread Gaël THEROND
Hi Ilya,

Here is additional information:
My cluster is a three OSD Nodes cluster with each node having 24 4TB SSD
disks.

The mkfs.xfs command fail with the following error:
https://pastebin.com/yTmMUtQs

I'm using the following command to format the image: mkfs.xfs
/dev/rbd//
I'm facing the same problem (and same sectors) if I'm directly targeting
the device with mkfs.xfs /dev/rbb

The client authentication caps are as follows: https://pastebin.com/UuAHRycF

Regarding your questions, yes, it is a persistent issue as soon as I try to
create a large image from a newly created pool.
Yes, after the first attempt, all new attempts fail too.
Yes, it is always the same set of sectors that fails.

Strange thing is, if I use an already existing pool, and create this 80Tb
image within this pool, it formats it correctly.

Here is the image rbd info output: https://pastebin.com/sAjnmZ4g

Here is the complete kernel logs: https://pastebin.com/SNucPXZW

Thanks a lot for your answer, I hope these logs can help ^^

Le ven. 8 janv. 2021 à 21:23, Ilya Dryomov  a écrit :

> On Fri, Jan 8, 2021 at 2:19 PM Gaël THEROND 
> wrote:
> >
> > Hi everyone!
> >
> > I'm facing a weird issue with one of my CEPH clusters:
> >
> > OS: CentOS - 8.2.2004 (Core)
> > CEPH: Nautilus 14.2.11 - stable
> > RBD using erasure code profile (K=3; m=2)
> >
> > When I want to format one of my RBD image (client side) I've got the
> > following kernel messages multiple time with different sector IDs:
> >
> >
> > *[2417011.790154] blk_update_request: I/O error, dev rbd23, sector
> > 164743869184 op 0x3:(DISCARD) flags 0x4000 phys_seg 1 prio class
> > 0[2417011.791404] rbd: rbd23: discard at objno 20110336 2490368~1703936
> > result -1  *
> >
> > At first I thought about a faulty disk BUT the monitoring system is not
> > showing anything faulty so I decided to run manual tests on all my OSDs
> to
> > look at disk health using smartctl etc.
> >
> > None of them is marked as not healthy and actually they don't get any
> > counter with faulty sectors/read or writes and the Wear Level is 99%
> >
> > So, the only particularity of this image is it is a 80Tb image, but it
> > shouldn't be an issue as we already have that kind of image size used on
> > another pool.
> >
> > If anyone have a clue at how I could sort this out, I'll be more than
> happy
>
> Hi Gaël,
>
> What command are you running to format the image?
>
> Is it persistent?  After the first formatting attempt fails, do the
> following attempts fail too?
>
> Is it always the same set of sectors?
>
> Could you please attach the output of "rbd info" for that image and the
> entire kernel log from the time that image is mapped?
>
> Thanks,
>
> Ilya
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io