[ceph-users] Ceph flash deployment

2020-11-02 Thread Seena Fallah
Hi all,

Does this guid is still valid for a bluestore deployment with nautilus or
octopus?
https://tracker.ceph.com/projects/ceph/wiki/Tuning_for_All_Flash_Deployments

Thanks.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Monitor persistently out-of-quorum

2020-11-02 Thread Ki Wong
Folks,

We’ve finally found the issue: MTU mismatch on the switch-side.
So, my colleague noticed “tracepath” from the other monitors
to the problematic one does not return and I tracked it down
to an MTU mismatch (jumbo vs not) on the switch end. After
fixing the mismatch all is back to normal.

It turned out to be quite the head scratcher.

Thanks to all who’ve offer assistance.

-kc

> On Oct 29, 2020, at 2:17 AM, Stefan Kooman  wrote:
> 
> On 2020-10-29 01:26, Ki Wong wrote:
>> Hello,
>> 
>> I am at my wit's end.
>> 
>> So I made a mistake in the configuration of my router and one
>> of the monitors (out of 3) dropped out of the quorum and nothing
>> I’ve done allow it to rejoin. That includes reinstalling the
>> monitor with ceph-ansible.
> 
> What Ceph version?
> What kernel version (on the monitors)?
> 
> 
> Just to check some things:
> 
> make sure the mon-keyring on _all_ monitors is equal and permissions are
> correct (ceph can read the file) and read/write to the monstore.
> 
> Have you enabled msgr v1 and v2?
> Do you use DNS to detect the monitors [1].
> 
> ceph daemon mon.$mon$id daemon mon_status <- what does this give on the
> out of quorum monitor?
> 
> See the troubleshooting documentation [2] for more information.
> 
> Gr. Stefan
> 
> [1]: https://docs.ceph.com/en/latest/rados/configuration/mon-lookup-dns/
> [2]:
> https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to recover from active+clean+inconsistent+failed_repair?

2020-11-02 Thread Frank Schilder
> But there can be a on chip disk controller on the motherboard, I'm not sure.

There is always some kind of controller. Could be on-board. Usually, the cache 
settings are accessible when booting into the BIOS set-up.

> If your worry is fsync persistence

No, what I worry about is volatile write cache, which is usually enabled by 
default. This cache exists on disk as well as on controller. To avoid loosing 
writes on power fail, the controller needs to be in write-through mode and the 
disk write cache disabled. The latter can be done with smartctl, the former in 
the BIOS setup.

Did you test power failure? If so, how often? On how many hosts simultaneously? 
Pulling network cables will not trigger cache related problems. The problem 
with write cache is, that you rely on a lot of bells and whistles where some 
usually fail. With ceph, this will lead to exactly the problem you are 
observing now.

Your pool configuration looks OK. You need to find out where exactly the scrub 
errors are situated. It looks like meta-data damage and you might loose some 
data. Be careful to do only read-only admin operations for now.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Sagara Wijetunga 
Sent: 02 November 2020 16:08:58
To: ceph-users@ceph.io; Frank Schilder
Subject: Re: [ceph-users] Re: How to recover from 
active+clean+inconsistent+failed_repair?

> Hmm, I'm getting a bit confused. Could you also send the output of "ceph osd 
> pool ls detail".

File ceph-osd-pool-ls-detail.txt attached.


> Did you look at the disk/controller cache settings?

I don't have disk controllers on Ceph machines. The hard disk is directly 
attached to the motherboard via SATA cable. But there can be a on chip disk 
controller on the motherboard, I'm not sure.

If your worry is fsync persistence, I have thoroughly tested database fsync 
reliability on Ceph RBD with hundreds of transactions per second and remove 
network cable and restart the database machine, etc. while inserts going on. 
and I did not lose a single transaction. I simulated this many times and 
persistence on my Ceph cluster was perfect (i.e not a single loss).


> I think you should start a deep-scrub with "ceph pg deep-scrub 3.b" and 
> record the output of "ceph -w | grep '3\.b'" (note the single quotes).

> The error messages you included in one of your first e-mails are only on 1 
> out of 3 scrub errors (3 lines for 1 error). We need to find all 3 errors.

I ran again the "ceph pg deep-scrub 3.b", here is the whole output of ceph -w:


2020-11-02 22:33:48.224392 osd.0 [ERR] 3.b shard 2 soid 
3:d577e975:::123675e.:head : candidate had a missing snapset key, 
candidate had a missing info key


2020-11-02 22:33:48.224396 osd.0 [ERR] 3.b soid 
3:d577e975:::123675e.:head : failed to pick suitable object info


2020-11-02 22:35:30.087042 osd.0 [ERR] 3.b deep-scrub 3 errors


Btw, I'm very grateful for your perseverance on this.


Best regards

Sagara

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: read latency

2020-11-02 Thread Tony Liu
Thanks Vladimir for the clarification!

Tony
> -Original Message-
> From: Vladimir Prokofev 
> Sent: Monday, November 2, 2020 3:46 AM
> Cc: ceph-users 
> Subject: [ceph-users] Re: read latency
> 
> With sequential read you get "read ahead" mechanics attached which helps
> a lot.
> So let's say you do 4KB seq reads with fio.
> By default, Ubuntu, for example, has 128KB read ahead size. That means
> when you request that 4KB of data, driver will actually request 128KB.
> When your IO is served, and you request next seq 4KB, they're already in
> VMs memory, so no new read IO is necessary.
> All those 128KB will likely reside on the same OSD, depending on your
> CEPH object size.
> When you'll reach the end of that 128KB of data, and request next - once
> again it will likely reside in the same rbd object as before, assuming
> 4MB object size, so depending on the internal mechanics which I'm not
> really familiar with, that data can be either in the hosts memory, or at
> least in osd node memory, so no real physical IO will be necessary.
> What you're thinking about is the worst case scenario - when that 128KB
> is split between 2 objects residing on 2 different osds - well, you just
> get 2 real physical IO for your 1 virtual, and in that moment you'll
> have slower request, but after that you get read ahead to help for a lot
> of seq IOs.
> In the end, read ahead with sequential IOs leads to way way less real
> physical reads than random read, hence the IOPS difference.
> 
> пн, 2 нояб. 2020 г. в 06:20, Tony Liu :
> 
> > Another confusing about read vs. random read. My understanding is
> > that, when fio does read, it reads from the test file sequentially.
> > When it does random read, it reads from the test file randomly.
> > That file read inside VM comes down to volume read handed by RBD
> > client who distributes read to PG and eventually to OSD. So a file
> > sequential read inside VM won't be a sequential read on OSD disk.
> > Is that right?
> > Then what difference seq. and rand. read make on OSD disk?
> > Is it rand. read on OSD disk for both cases?
> > Then how to explain the performance difference between seq. and rand.
> > read inside VM? (seq. read IOPS is 20x than rand. read, Ceph is with
> > 21 HDDs on 3 nodes, 7 on each)
> >
> > Thanks!
> > Tony
> > > -Original Message-
> > > From: Vladimir Prokofev 
> > > Sent: Sunday, November 1, 2020 5:58 PM
> > > Cc: ceph-users 
> > > Subject: [ceph-users] Re: read latency
> > >
> > > Not exactly. You can also tune network/software.
> > > Network - go for lower latency interfaces. If you have 10G go to 25G
> > > or 100G. 40G will not do though, afaik they're just 4x10G so their
> > > latency is the same as in 10G.
> > > Software - it's closely tied to your network card queues and
> > > processor cores. In short - tune affinity so that the packet receive
> > > queues and osds processes run on the same corresponding cores.
> > > Disabling process power saving features helps a lot. Also watch out
> for NUMA interference.
> > > But overall all these tricks will save you less than switching from
> > > HDD to SSD.
> > >
> > > пн, 2 нояб. 2020 г. в 02:45, Tony Liu :
> > >
> > > > Hi,
> > > >
> > > > AWIK, the read latency primarily depends on HW latency, not much
> > > > can be tuned in SW. Is that right?
> > > >
> > > > I ran a fio random read with iodepth 1 within a VM backed by Ceph
> > > > with HDD OSD and here is what I got.
> > > > =
> > > >read: IOPS=282, BW=1130KiB/s (1157kB/s)(33.1MiB/30001msec)
> > > > slat (usec): min=4, max=181, avg=14.04, stdev=10.16
> > > > clat (usec): min=178, max=393831, avg=3521.86, stdev=5771.35
> > > >  lat (usec): min=188, max=393858, avg=3536.38, stdev=5771.51
> > > > = I checked HDD average latency is 2.9 ms. Looks
> > > > like the test result makes perfect sense, isn't it?
> > > >
> > > > If I want to get shorter latency (more IOPS), I will have to go
> > > > for better disk, eg. SSD. Right?
> > > >
> > > >
> > > > Thanks!
> > > > Tony
> > > > ___
> > > > ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send
> > > > an email to ceph-users-le...@ceph.io
> > > >
> > > ___
> > > ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> > > email to ceph-users-le...@ceph.io
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to recover from active+clean+inconsistent+failed_repair?

2020-11-02 Thread Sagara Wijetunga
 
 > Hmm, I'm getting a bit confused. Could you also send the output of "ceph osd 
 > pool ls detail".

File ceph-osd-pool-ls-detail.txt attached.

> Did you look at the disk/controller cache settings?
I don't have disk controllers on Ceph machines. The hard disk is directly 
attached to the motherboard via SATA cable. But there can be a on chip disk 
controller on the motherboard, I'm not sure. 
If your worry is fsync persistence, I have thoroughly tested database fsync 
reliability on Ceph RBD with hundreds of transactions per second and remove 
network cable and restart the database machine, etc. while inserts going on. 
and I did not lose a single transaction. I simulated this many times and 
persistence on my Ceph cluster was perfect (i.e not a single loss).

> I think you should start a deep-scrub with "ceph pg deep-scrub 3.b" and 
> record the output of "ceph -w | grep '3\.b'" (note the single quotes).

> The error messages you included in one of your first e-mails are only on 1 
> out of 3 scrub errors (3 lines for 1 error). We need to find all 3 errors.

I ran again the "ceph pg deep-scrub 3.b", here is the whole output of ceph -w:

2020-11-02 22:33:48.224392 osd.0 [ERR] 3.b shard 2 soid 
3:d577e975:::123675e.:head : candidate had a missing snapset key, 
candidate had a missing info key

2020-11-02 22:33:48.224396 osd.0 [ERR] 3.b soid 
3:d577e975:::123675e.:head : failed to pick suitable object info

2020-11-02 22:35:30.087042 osd.0 [ERR] 3.b deep-scrub 3 errors

Btw, I'm very grateful for your perseverance on this.

Best regards
Sagara

  ceph osd pool ls detail
pool 2 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 0 object_hash 
rjenkins pg_num 128 pgp_num 128 autoscale_mode on last_change 4051 lfor 
0/0/3797 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 
recovery_priority 5 target_size_ratio 0.8 application cephfs

pool 3 'cephfs_data' replicated size 3 min_size 2 crush_rule 0 object_hash 
rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 3736 lfor 
0/3266/3582 flags hashpspool stripe_width 0 target_size_ratio 0.8 application 
cephfs

pool 4 'rbd' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins 
pg_num 32 pgp_num 32 autoscale_mode on last_change 4156 lfor 0/4156/4154 flags 
hashpspool,selfmanaged_snaps stripe_width 0 target_size_ratio 0.8 application 
rbd
removed_snaps [1~3]___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: pgs stuck backfill_toofull

2020-11-02 Thread Joachim Kraftmayer

Stefan, I agree with you.

In Jewel the recovery process is not really throttled by default.

With Luminous and later you benefit from dynamic resharding and the too 
big OMAPs handling.


Regars, Joachim

___

Clyso GmbH

Am 29.10.2020 um 21:30 schrieb Stefan Kooman:

On 2020-10-29 06:55, Mark Johnson wrote:

I've been struggling with this one for a few days now.  We had an OSD report as 
near full a few days ago.  Had this happen a couple of times before and a 
reweight-by-utilization has sorted it out in the past.  Tried the same again 
but this time we ended up with a couple of pgs in a state of backfill_toofull 
and a handful of misplaced objects as a result.

Consider upgrading to luminous (and then later nautilus).

Why? There you can use ceph balancer in upmap mode (at least when your
clients are new enough). No need to do any manual re weighting anymore.

^^ this besides the tips Frank gave you.

Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to recover from active+clean+inconsistent+failed_repair?

2020-11-02 Thread Frank Schilder
Hmm, I'm getting a bit confused. Could you also send the output of "ceph osd 
pool ls detail".

Did you look at the disk/controller cache settings?

I think you should start a deep-scrub with "ceph pg deep-scrub 3.b" and record 
the output of "ceph -w | grep '3\.b'" (note the single quotes).

The error messages you included in one of your first e-mails are only on 1 out 
of 3 scrub errors (3 lines for 1 error). We need to find all 3 errors.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Sagara Wijetunga 
Sent: 02 November 2020 14:25:08
To: ceph-users@ceph.io; Frank Schilder
Subject: Re: [ceph-users] Re: How to recover from 
active+clean+inconsistent+failed_repair?

Hi Frank


> the primary OSD is probably not listed as a peer. Can you post the complete 
> output of

> - ceph pg 3.b query
> - ceph pg dump
> - ceph osd df tree

> in a pastebin?

Yes, the Primary OSD is 0.

I have attached above as .txt files. Please let me know if you still cannot 
read them.

Regards

Sagara

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Inconsistent Space Usage reporting

2020-11-02 Thread Vikas Rana
Hi Friends,

 

We have some inconsistent storage space usage reporting. We used only 46TB
with single copy but the space used on the pool is close to 128TB.

 

Any idea where's the extra space is utilized and how to reclaim it?

 

Ceph Version : 12.2.11 with XFS OSDs. We are planning to upgrade soon.

 

# ceph df detail

GLOBAL:

SIZE   AVAIL  RAW USED %RAW USED OBJECTS

363TiB 131TiB   231TiB 63.83  43.80M

POOLS:

NAMEID QUOTA OBJECTS QUOTA BYTES USED%USED
MAX AVAIL OBJECTS  DIRTY  READWRITE   RAW USED

fcp 15 N/A   N/A 23.6TiB 42.69
31.7TiB  3053801  3.05M 6.10GiB 12.6GiB  47.3TiB

nfs 16 N/A   N/A  128TiB 66.91
63.4TiB 33916181 33.92M 3.93GiB 4.73GiB   128TiB

 

 

 

 

# df -h

Filesystem  Size  Used Avail Use% Mounted on

/dev/nbd0   200T   46T  155T  23% /vol/dir_research

 

 

#ceph osd pool get nfs all

size: 1

min_size: 1

crash_replay_interval: 0

pg_num: 128

pgp_num: 128

crush_rule: replicated_ruleset

hashpspool: true

nodelete: false

nopgchange: false

nosizechange: false

write_fadvise_dontneed: false

noscrub: false

nodeep-scrub: false

use_gmt_hitset: 1

auid: 0

fast_read: 0

 

 

Appreciate your help.

 

Thanks,

-Vikas

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to recover from active+clean+inconsistent+failed_repair?

2020-11-02 Thread Sagara Wijetunga
 Hi Frank

> the primary OSD is probably not listed as a peer. Can you post the complete 
> output of

> - ceph pg 3.b query
> - ceph pg dump
> - ceph osd df tree

> in a pastebin?

Yes, the Primary OSD is 0.
I have attached above as .txt files. Please let me know if you still cannot 
read them.
Regards
Sagara


  # ceph pg 3.b query 
{
"state": "active+clean+inconsistent",
"snap_trimq": "[]",
"snap_trimq_len": 0,
"epoch": 4850,
"up": [
0,
1,
2
],
"acting": [
0,
1,
2
],
"acting_recovery_backfill": [
"0",
"1",
"2"
],
"info": {
"pgid": "3.b",
"last_update": "4825'2264303",
"last_complete": "4825'2264303",
"log_tail": "4759'2261298",
"last_user_version": 2263481,
"last_backfill": "MAX",
"last_backfill_bitwise": 1,
"purged_snaps": [],
"history": {
"epoch_created": 3582,
"epoch_pool_created": 22,
"last_epoch_started": 4849,
"last_interval_started": 4848,
"last_epoch_clean": 4849,
"last_interval_clean": 4848,
"last_epoch_split": 3582,
"last_epoch_marked_full": 0,
"same_up_since": 4848,
"same_interval_since": 4848,
"same_primary_since": 4844,
"last_scrub": "4825'2264303",
"last_scrub_stamp": "2020-11-01 18:26:33.496258",
"last_deep_scrub": "4825'2264303",
"last_deep_scrub_stamp": "2020-11-01 18:26:33.496258",
"last_clean_scrub_stamp": "2020-10-30 12:30:17.706147"
},
"stats": {
"version": "4825'2264303",
"reported_seq": "2595188",
"reported_epoch": "4850",
"state": "active+clean+inconsistent",
"last_fresh": "2020-11-02 12:39:53.437354",
"last_change": "2020-11-02 12:39:29.021414",
"last_active": "2020-11-02 12:39:53.437354",
"last_peered": "2020-11-02 12:39:53.437354",
"last_clean": "2020-11-02 12:39:53.437354",
"last_became_active": "2020-11-02 12:39:29.021213",
"last_became_peered": "2020-11-02 12:39:29.021213",
"last_unstale": "2020-11-02 12:39:53.437354",
"last_undegraded": "2020-11-02 12:39:53.437354",
"last_fullsized": "2020-11-02 12:39:53.437354",
"mapping_epoch": 4848,
"log_start": "4759'2261298",
"ondisk_log_start": "4759'2261298",
"created": 3582,
"last_epoch_clean": 4849,
"parent": "0.0",
"parent_split_bits": 5,
"last_scrub": "4825'2264303",
"last_scrub_stamp": "2020-11-01 18:26:33.496258",
"last_deep_scrub": "4825'2264303",
"last_deep_scrub_stamp": "2020-11-01 18:26:33.496258",
"last_clean_scrub_stamp": "2020-10-30 12:30:17.706147",
"log_size": 3005,
"ondisk_log_size": 3005,
"stats_invalid": false,
"dirty_stats_invalid": false,
"omap_stats_invalid": false,
"hitset_stats_invalid": false,
"hitset_bytes_stats_invalid": false,
"pin_stats_invalid": false,
"manifest_stats_invalid": false,
"snaptrimq_len": 0,
"stat_sum": {
"num_bytes": 9649392528,
"num_objects": 6992,
"num_object_clones": 0,
"num_object_copies": 20976,
"num_objects_missing_on_primary": 0,
"num_objects_missing": 0,
"num_objects_degraded": 0,
"num_objects_misplaced": 0,
"num_objects_unfound": 0,
"num_objects_dirty": 6992,
"num_whiteouts": 0,
"num_read": 27005,
"num_read_kb": 1096308,
"num_write": 46262,
"num_write_kb": 1240514,
"num_scrub_errors": 3,
"num_shallow_scrub_errors": 3,
"num_deep_scrub_errors": 0,
"num_objects_recovered": 23,
"num_bytes_recovered": 189943,
"num_keys_recovered": 0,
"num_objects_omap": 0,
"num_objects_hit_set_archive": 0,
"num_bytes_hit_set_archive": 0,
"num_flush": 0,
"num_flush_kb": 0,
"num_evict": 0,
"num_evict_kb": 0,
"num_promote": 0,
"num_flush_mode_high": 0,
"num_flush_mode_low": 0,
"num_evict_mode_some": 0,
"num_evict_mode_full": 0,
"num_objects_pinned": 0,
"num_legacy_snapsets": 0,
"num_large_omap_objects": 0,
"num_objects_manifest": 0,
"num_omap_bytes": 0,
"num_omap_keys": 

[ceph-users] v14.2.13 Nautilus released

2020-11-02 Thread Abhishek Lekshmanan


This is the 13th backport release in the Nautilus series. This release fixes a
regression introduced in v14.2.12, and a few ceph-volume & RGW fixes. We
recommend users to update to this release.

Notable Changes
---

* Fixed a regression that caused breakage in clusters that referred to ceph-mon
  hosts using dns names instead of ip addresses in the `mon_host` param in
  `ceph.conf` (issue#47951)
* ceph-volume: the ``lvm batch`` subcommand received a major rewrite

Changelog
-
* ceph-volume: major batch refactor (pr#37522, Jan Fajerski)
* mgr/dashboard: Proper format iSCSI target portals (pr#37060, Volker Theile)
* rpm: move python-enum34 into rhel 7 conditional (pr#37747, Nathan Cutler)
* mon/MonMap: fix unconditional failure for init_with_hosts (pr#37816, Nathan 
Cutler, Patrick Donnelly)
* rgw: allow rgw-orphan-list to note when rados objects are in namespace 
(pr#37799, J. Eric Ivancich)
* rgw: fix setting of namespace in ordered and unordered bucket listing 
(pr#37798, J. Eric Ivancich)

--
Abhishek 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to recover from active+clean+inconsistent+failed_repair?

2020-11-02 Thread Frank Schilder
Hi Sagara,

the primary OSD is probably not listed as a peer. Can you post the complete 
output of

- ceph pg 3.b query
- ceph pg dump
- ceph osd df tree

in a pastebin?

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Sagara Wijetunga 
Sent: 02 November 2020 11:53:58
To: ceph-users@ceph.io; Frank Schilder
Subject: Re: [ceph-users] Re: How to recover from 
active+clean+inconsistent+failed_repair?

Hi Frank


> Please note, there is no peer 0 in "ceph pg 3.b query". Also no word osd.


I checked other PGs with "active+clean", there is a "peer": "0".


But "ceph pg pgid query" always shows only two peers, sometime peer 0 and 1, or 
1 and 2, 0 and 2, etc.


Regards


Sagara

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Seriously degraded performance after update to Octopus

2020-11-02 Thread Vladimir Prokofev
Just shooting in the dark here, but you may be affected by similar issue I
had a while back, it was discussed here:
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/ZOPBOY6XQOYOV6CQMY27XM37OC6DKWZ7/

In short - they've changed setting bluefs_buffered_io to false in the
recent Nautilus release. I guess the same was applied to newer releases.
That lead to severe performance issues and similar symptoms, i.e. lower
memory usage on OSD nodes. Worth checking out.

Of course, it may be something completely different. You should look into
monitoring all your OSDs separately, checking their utilization, await, and
other parameters, at the same time comparing them to pre-upgrade values, to
find the root cause.

пн, 2 нояб. 2020 г. в 11:55, Marc Roos :

>
> I am advocating already a long time for publishing testing data of some
> basic test cluster against different ceph releases. Just a basic ceph
> cluster that covers most configs and run the same tests, so you can
> compare just ceph performance. That would mean a lot for smaller
> companies that do not have access to a good test environment. I have
> asked also about this at some ceph seminar.
>
>
>
> -Original Message-
> From: Martin Rasmus Lundquist Hansen [mailto:han...@imada.sdu.dk]
> Sent: Monday, November 02, 2020 7:53 AM
> To: ceph-users@ceph.io
> Subject: [ceph-users] Seriously degraded performance after update to
> Octopus
>
> Two weeks ago we updated our Ceph cluster from Nautilus (14.2.0) to
> Octopus (15.2.5), an update that was long overdue. We used the Ansible
> playbooks to perform a rolling update and except from a few minor
> problems with the Ansible code, the update went well. The Ansible
> playbooks were also used for setting up the cluster in the first place.
> Before updating the Ceph software we also performed a full update of
> CentOS and the Linux kernel (this part of the update had already been
> tested on one of the OSD nodes the week before and we didn't notice any
> problems).
>
> However, after the update we are seeing a serious decrease in
> performance, more than a factor of 10x in some cases. I spend a week
> trying to come up with an explantion or solution, but I am completely
> blank. Independently of Ceph I tested the network performance and the
> performance of the OSD disks, and I am not really seeing any problems
> here.
>
> The specifications of the cluster is:
> - 3x Monitor nodes running mgr+mon+mds (Intel(R) Xeon(R) Silver 4108 CPU
> @ 1.80GHz, 16 cores, 196 GB RAM)
> - 14x OSD nodes, each with 18 HDDs and 1 NVME (Intel(R) Xeon(R) Gold
> 6126 CPU @ 2.60GHz, 24 cores, 384 GB RAM)
> - CentOS 7.8 and Kernel 5.4.51
> - 100 Gbps Infiniband
>
> We are collecting various metrics using Prometheus, and on the OSD nodes
> we are seeing some clear differences when it comes to CPU and Memory
> usage. I collected some graphs here: http://mitsted.dk/ceph . After the
> update the system load is highly reduced, there is almost no longer any
> iowait for the CPU, and the free memory is no longer used for Buffers (I
> can confirm that the changes in these metrics are not due to the update
> of CentOS or the Linux kernel). All in all, now the OSD nodes are almost
> completely idle all the time (and so are the monitors). On the linked
> page I also attached two RADOS benchmarks. The first benchmark was
> performed when the cluster was initially configured, and the second is
> the same benchmark after the update to Octopus. When comparing these
> two, it is clear that the performance has changed dramatically. For
> example, in the write test the bandwidth is reduced from 320 MB/s to 21
> MB/s and the number of IOPS has also dropped significantly.
>
> I temporarily tried to disable the firewall and SELinux on all nodes to
> see if it made any difference, but it didnt look like it (I did not
> restart any services during this test, I am not sure if that could be
> necessary).
>
> Any suggestions for finding the root cause of this performance decrease
> would be greatly appreciated.
> ___
> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> email to ceph-users-le...@ceph.io
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to recover from active+clean+inconsistent+failed_repair?

2020-11-02 Thread Sagara Wijetunga
 Hi Frank

> Please note, there is no peer 0 in "ceph pg 3.b query". Also no word osd.

I checked other PGs with "active+clean", there is a "peer": "0".  

But "ceph pg pgid query" always shows only two peers, sometime peer 0 and 1, or 
1 and 2, 0 and 2, etc.

Regards

Sagara


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to recover from active+clean+inconsistent+failed_repair?

2020-11-02 Thread Sagara Wijetunga
 Hi Frank
> Please note, there is no peer 0 in "ceph pg 3.b query". Also no word osd.I 
> checked other PGs with active+clean, there is a "peer": "0". 
But "ceph pg pgid query" always shows only two peers, sometime peer 0 and 1, or 
1 and 2, 0 and 2, etc.

Regards
Sagara



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephfs cannot write

2020-11-02 Thread Eugen Block

Hi,

I'm not sure if the ceph-volume error is related to the "operation not  
permitted" error. Have you checked the auth settings for your cephfs  
client? Or did you mount it as admin user?



Zitat von Patrick :


Hi all,


My ceph cluster is HEALTH_OK, but I cannot write on cephfs.
OS: Ubuntu 20.04, ceph version 15.2.5, deploy with cephadm.


root@RK01-OSD-A001:~# ceph -s
  cluster:
    id:     9091b472-1bdb-11eb-b217-abff3468259e
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum  
RK01-OSD-A001,RK02-OSD-A002,RK03-OSD-A003 (age 18s)
    mgr: RK01-OSD-A001.jwrjgj(active, since 51m),  
standbys: RK03-OSD-A003.tulrii
    mds: cephfs:1  
{0=cephfs.RK02-OSD-A002.lwpgaw=up:active} 1 up:standby

    osd: 6 osds: 6 up (since 44m), 6 in (since 44m)
 
  task status:
    scrub status:
        mds.cephfs.RK02-OSD-A002.lwpgaw: idle
 
  data:
    pools:   3 pools, 65 pgs
    objects: 24 objects, 67 KiB
    usage:   6.0 GiB used, 44 TiB / 44 TiB avail
    pgs:     65 active+clean


root@RK01-OSD-A001:~# ceph fs status
cephfs - 1 clients
==
RANK  STATE               
 MDS                 
 ACTIVITY     DNS    INOS  
 0    active  cephfs.RK02-OSD-A002.lwpgaw   
Reqs:    0 /s    13     15   
 
       POOL           
 TYPE     USED  AVAIL  

cephfs.cephfs.meta  metadata  1152k  20.7T  
cephfs.cephfs.data    data       
 0   20.7T  
        STANDBY MDS         
  

cephfs.RK03-OSD-A003.xchwqj  
MDS version: ceph version 15.2.5  
(2c93eff00150f0cc5f106a559557a58d3d7b6f1f) octopus (stable)




root@RK05-FRP-A001:~# df -h|grep "ceph-test"
172.16.65.1,172.16.65.2,172.16.65.3:6789:/   21T   
   0   21T   0% /ceph-test

root@RK05-FRP-A001:~# echo 123 > /ceph-test/1.txt
-bash: echo: write error: Operation not permitted
root@RK05-FRP-A001:~# ls -l /ceph-test/1.txt
-rw-r--r-- 1 root root 0 Nov  1 09:40 /ceph-test/1.txt
root@RK05-FRP-A001:~# ls -ld /ceph-test/
drwxr-xr-x 2 root root 1 Nov  1 09:40 /ceph-test/


root@RK01-OSD-A001:~# cd /var/log/ceph/`ceph fsid`
root@RK01-OSD-A001:/var/log/ceph/9091b472-1bdb-11eb-b217-abff3468259e# cat  
ceph-volume.log | grep err | grep sdx
[2020-11-01 08:53:51,384][ceph_volume.process][INFO  ] stderr  
Failed to find physical volume "/dev/sdx".
[2020-11-01 08:53:51,417][ceph_volume.process][INFO  ] stderr  
unable to read label for /dev/sdx: (2) No such file or directory
[2020-11-01 08:53:51,445][ceph_volume.process][INFO  ] stderr  
unable to read label for /dev/sdx: (2) No such file or directory



root@RK01-OSD-A001:~# pvs|grep sdx
  /dev/sdx   
 ceph-41b09a52-e44b-43c5-ad86-0eada11b48b6 lvm2 a--   
<7.28t    0 

root@RK01-OSD-A001:~# lsblk|grep sdx
sdx                   
                   
                   
                   
                   
         65:112  0   
 7.3T  0 disk 

root@RK01-OSD-A001:~# parted -s /dev/sdx print
Error: /dev/sdx: unrecognised disk label
Model: LSI MR9261-8i (scsi)
Disk /dev/sdx: 8001GB
Sector size (logical/physical): 512B/4096B
Partition Table: unknown
Disk Flags: 
root@RK01-OSD-A001:~# 



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RGW seems to not clean up after some requests

2020-11-02 Thread Denis Krienbühl
Hi Abhishek

> On 2 Nov 2020, at 14:54, Abhishek Lekshmanan  wrote:
> 
> There isn't much in terms of code changes in the scheduler from
> v15.2.4->5. Does the perf dump (`ceph daemon perf dump 
> `) on RGW socket show any throttle counts?

I know, I was wondering if this somehow might have an influence, but I’m likely 
wrong:
https://github.com/ceph/ceph/commit/c43f71056322e1a149a444735bf65d80fec7a7ae 


As for the perf counters, I don’t see anything interesting. I dumped the 
current state, but I don’t know how interesting this is:
https://gist.github.com/href/a42c30e001789f005e9aa748f6f858fc 


At the moment we don’t see any errors, but I do already count 135 incomplete 
requests in the current log (out of 3 Million).

This number is typical for most days, where we’ll see something like 150 such 
requests. Our working theory is that out of the 1024 maximum outstanding 
requests of the throttler, ~150 get lost every day to those incomplete 
requests, until our need for up to 400 requests per instance can no longer be 
met (first a few will be over the watermark, then more, then all).

For those incomplete requests we know that the following line is executed, 
producing “starting new request”:
https://github.com/ceph/ceph/blob/8f393c0fc1886a369d213d5e5791c10cb1591828/src/rgw/rgw_process.cc#L187
 


However, it never reaches “req done” in the same function:
https://github.com/ceph/ceph/blob/master/src/rgw/rgw_process.cc#L350 


That entry, and the “beast” entry is missing for those few requests.

Cheers, Denis
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Restart Error: osd.47 already exists in network host

2020-11-02 Thread Ml Ml
Hello Eugen,

cephadm ls for OSD.41:

   {
"style": "cephadm:v1",
"name": "osd.41",
"fsid": "5436dd5d-83d4-4dc8-a93b-60ab5db145df",
"systemd_unit": "ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df@osd.41",
"enabled": true,
"state": "error",
"container_id": null,
"container_image_name": "docker.io/ceph/ceph:v15.2.5",
"container_image_id": null,
"version": null,
"started": null,
"created": "2020-07-28T12:42:17.292765",
"deployed": "2020-10-21T11:29:36.284462",
"configured": "2020-10-21T11:29:47.032038"
},



root@ceph06:~# systemctl start
ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df@osd.41.service
Job for ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df@osd.41.service
failed because the control process exited with error code.
See "systemctl status
ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df@osd.41.service" and
"journalctl -xe" for details.

● ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df@osd.41.service - Ceph
osd.41 for 5436dd5d-83d4-4dc8-a93b-60ab5db145df
   Loaded: loaded
(/etc/systemd/system/ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df@.service;
enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Mon 2020-11-02 10:56:50
CET; 9min ago
  Process: 430022 ExecStartPre=/usr/bin/docker rm
ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df-osd.41 (code=exited,
status=1/FAILURE)
  Process: 430040 ExecStart=/bin/bash
/var/lib/ceph/5436dd5d-83d4-4dc8-a93b-60ab5db145df/osd.41/unit.run
(code=exited, status=125)
  Process: 430159 ExecStopPost=/bin/bash
/var/lib/ceph/5436dd5d-83d4-4dc8-a93b-60ab5db145df/osd.41/unit.poststop
(code=exited, status=0/SUCCESS)
 Main PID: 430040 (code=exited, status=125)
Tasks: 51 (limit: 9830)
   Memory: 31.0M
   CGroup: 
/system.slice/system-ceph\x2d5436dd5d\x2d83d4\x2d4dc8\x2da93b\x2d60ab5db145df.slice/ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df@osd.41.service
   ├─224974 /bin/bash
/var/lib/ceph/5436dd5d-83d4-4dc8-a93b-60ab5db145df/osd.41/unit.run
   └─225079 /usr/bin/docker run --rm --net=host --ipc=host
--privileged --group-add=disk --name
ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df-osd.41 -e
CONTAINER_IMAGE=docker.io/ceph/ceph:v15.2.5 -e NODE_NAME..

Nov 02 10:56:50 ceph06 systemd[1]: Failed to start Ceph osd.41 for
5436dd5d-83d4-4dc8-a93b-60ab5db145df.
Nov 02 11:01:21 ceph06 systemd[1]:
ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df@osd.41.service: Start
request repeated too quickly.
Nov 02 11:01:21 ceph06 systemd[1]:
ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df@osd.41.service: Failed with
result 'exit-code'.
Nov 02 11:01:21 ceph06 systemd[1]: Failed to start Ceph osd.41 for
5436dd5d-83d4-4dc8-a93b-60ab5db145df.
Nov 02 11:01:49 ceph06 systemd[1]:
ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df@osd.41.service: Start
request repeated too quickly.
Nov 02 11:01:49 ceph06 systemd[1]:
ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df@osd.41.service: Failed with
result 'exit-code'.
Nov 02 11:01:49 ceph06 systemd[1]: Failed to start Ceph osd.41 for
5436dd5d-83d4-4dc8-a93b-60ab5db145df.
Nov 02 11:05:34 ceph06 systemd[1]:
ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df@osd.41.service: Start
request repeated too quickly.
Nov 02 11:05:34 ceph06 systemd[1]:
ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df@osd.41.service: Failed with
result 'exit-code'.
Nov 02 11:05:34 ceph06 systemd[1]: Failed to start Ceph osd.41 for
5436dd5d-83d4-4dc8-a93b-60ab5db145df.



If i run it manually, i get:
root@ceph06:~#  /usr/bin/docker run --rm --net=host --ipc=host
--privileged --group-add=disk --name
ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df-osd.41 -e
CONTAINER_IMAGE=docker.io/ceph/ceph:v15.2.5 -e NODE_NAME=ceph06 -v
/var/run/ceph/5436dd5d-83d4-4dc8-a93b-60ab5db145df:/var/run/ceph:z -v
/var/log/ceph/5436dd5d-83d4-4dc8-a93b-60ab5db145df:/var/log/ceph:z -v
/var/lib/ceph/5436dd5d-83d4-4dc8-a93b-60ab5db145df/crash:/var/lib/ceph/crash:z
-v 
/var/lib/ceph/5436dd5d-83d4-4dc8-a93b-60ab5db145df/osd.41:/var/lib/ceph/osd/ceph-41:z
-v 
/var/lib/ceph/5436dd5d-83d4-4dc8-a93b-60ab5db145df/osd.41/config:/etc/ceph/ceph.conf:z
-v /dev:/dev -v /run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm
-v /run/lock/lvm:/run/lock/lvm --entrypoint /usr/bin/ceph-osd
docker.io/ceph/ceph:v15.2.5 -n osd.41 -f --setuser ceph --setgroup
ceph --default-log-to-file=false --default-log-to-stderr=true
--default-log-stderr-prefix=debug
/usr/bin/docker: Error response from daemon: endpoint with name
ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df-osd.41 already exists in
network host.


Can you see a ContainerID Error here?

cluster id is: 5436dd5d-83d4-4dc8-a93b-60ab5db145df

On Mon, Nov 2, 2020 at 10:03 AM Eugen Block  wrote:
>
> Hi,
>
> are you sure it's the right container ID you're using for the restart?
> I noticed that 'cephadm ls' shows older containers after a daemon had
> to be recreated (a MGR in my case). Maybe you're trying to restart a
> daemon that was already removed?
>
> Regards,
> Eugen
>
>
> Zitat von Ml Ml :
>
> > Hello List,
> > sometimes some OS

[ceph-users] Re: How to recover from active+clean+inconsistent+failed_repair?

2020-11-02 Thread Sagara Wijetunga
 Hi Frank
   
> looks like you have one on a new and 2 on an old version. Can you add the 
> information about which OSD each version resides?

The "ceph pg 3.b query" shows following:

    "peer_info": [
        {
            "peer": "1",
            "pgid": "3.b",
            "last_update": "4825'2264303",
            "last_complete": "4825'2264303",
            "log_tail": "4759'2261298",
            "last_user_version": 2263481,
:
:
            "stats": {
                "version": "4825'2264301",
 }
},
        {
            "peer": "2",
            "pgid": "3.b",
            "last_update": "4825'2264303",
            "last_complete": "4825'2264303",
            "log_tail": "4759'2261298",
            "last_user_version": 2263481,
:
:
            "stats": {
                "version": "4825'2264301",
    }

}
Please note, there is no peer 0 in "ceph pg 3.b query". Also no word osd.
Is  "peer": "1" means osd.1?
I have osd.0, osd.1 and osd2.
Note, version "4825'2264303" and "4825'2264301" appear in both above peer 1 and 
2.
Thanks.
Sagara


  
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Fix PGs states

2020-11-02 Thread Ing . Luis Felipe Domínguez Vega
Of course yes yes jejeje, the thing is that my housing provider has 
problems with the black fibber that connects the DCs, so i prefer use 
only 1 DC and replicated PGs


El 2020-11-02 03:13, Eugen Block escribió:

There's nothing wrong with EC pools or multiple datacenters, you just
need the right configuration to cover the specific requirements ;-)


Zitat von "Ing. Luis Felipe Domínguez Vega" :

Yes, thanks to all, the decisition was, remove all and start from 0  
and not use EC pools, use only Replicated and not distribute over DCs.


El 2020-10-31 14:08, Eugen Block escribió:

To me it looks like a snapshot is not found which seems plausible
because you already encountered missing rbd chunks. Since you said
it's just a test cluster the easiest way would probably be to delete
the affected pools ans recreate them when the cluster is healthy
again. With the current situation it's almost impossible to say which
rbd images will be corrupted and which can be rescued. Is that an
option to delete the pools?


Zitat von "Ing. Luis Felipe Domínguez Vega" 
:



https://pastebin.ubuntu.com/p/tHSpzWp8Cx/

El 2020-10-30 11:47, dhils...@performair.com escribió:

This line is telling:
   1 osds down
This is likely the cause of everything else.

Why is one of your OSDs down?

Thank you,

Dominic L. Hilsbos, MBA
Director - Information Technology
Perform Air International, Inc.
dhils...@performair.com
www.PerformAir.com



-Original Message-
From: Ing. Luis Felipe Domínguez Vega 
[mailto:luis.doming...@desoft.cu]

Sent: Thursday, October 29, 2020 7:46 PM
To: Ceph Users
Subject: [ceph-users] Fix PGs states

Hi:

I have this ceph status:
-
cluster:
   id: 039bf268-b5a6-11e9-bbb7-d06726ca4a78
   health: HEALTH_WARN
   noout flag(s) set
   1 osds down
   Reduced data availability: 191 pgs inactive, 2 pgs down, 
35

pgs incomplete, 290 pgs stale
   5 pgs not deep-scrubbed in time
   7 pgs not scrubbed in time
   327 slow ops, oldest one blocked for 233398 sec, daemons
[osd.12,osd.36,osd.5] have slow ops.

 services:
   mon: 1 daemons, quorum fond-beagle (age 23h)
   mgr: fond-beagle(active, since 7h)
   osd: 48 osds: 45 up (since 95s), 46 in (since 8h); 4 remapped 
pgs

flags noout

 data:
   pools:   7 pools, 2305 pgs
   objects: 350.37k objects, 1.5 TiB
   usage:   3.0 TiB used, 38 TiB / 41 TiB avail
   pgs: 6.681% pgs unknown
1.605% pgs not active
1835 active+clean
279  stale+active+clean
154  unknown
22   incomplete
10   stale+incomplete
2down
2remapped+incomplete
1stale+remapped+incomplete


How can i fix all of unknown, incomplete, remmaped+incomplete, 
etc... i

dont care if i need remove PGs
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RGW seems to not clean up after some requests

2020-11-02 Thread Abhishek Lekshmanan
Denis Krienbühl  writes:

> Hi everyone
>
> We have faced some RGW outages recently, with the RGW returning HTTP 503. 
> First for a few, then for most, then all requests - in the course of 1-2 
> hours. This seems to have started since we have updated from 15.2.4 to 15.2.5.
>
> The line that accompanies these outages in the log is the following:
>
>   s3:list_bucket Scheduling request failed with -2218
There isn't much in terms of code changes in the scheduler from
v15.2.4->5. Does the perf dump (`ceph daemon perf dump 
`) on RGW socket show any throttle counts?

>
> It first pops up a few times here and there, until it eventually applies to 
> all requests. It seems to indicate that the throttler has reached the limit 
> of open connections.
>
> As we run a pair of HAProxy instances in front of RGW, which limit the number 
> of connections to the two RGW instances to 400, this limit should never be 
> reached. We do use RGW metadata sync between the instances, which could 
> account for some extra connections, but if I look at open TCP connections 
> between the instances I can count no more than 20 at any given time.
>
> I also noticed that some connections in the RGW log seem to never complete. 
> That is, I can find a ‘starting new request’ line, but no associated ‘req 
> done’ or ‘beast’ line.
>
> I don’t think there are any hung connections around, as they are killed by 
> HAProxy after a short timeout.
>
> Looking at the code, it seems as if the throttler in use (SimpleThrottler), 
> eventually reaches the maximum count of 1024 connections 
> (outstanding_requests), and never recovers. I believe that the 
> request_complete function is not called in all cases, but I am not familiar 
> with the Ceph codebase, so I am not sure.
>
> See 
> https://github.com/ceph/ceph/blob/cc17681b478594aa39dd80437256a54e388432f0/src/rgw/rgw_dmclock_async_scheduler.h#L166-L214
>  
> 
>
> Does anyone see the same phenomenon? Could this be a bug in the request 
> handling of RGW, or am I wrong in my assumptions?
>
> For now we’re just restarting our RGWs regularly, which seems to keep the 
> problem at bay.
>
> Thanks for any hints.
>
> Denis
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

-- 
Abhishek 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] RGW seems to not clean up after some requests

2020-11-02 Thread Denis Krienbühl
Hi everyone

We have faced some RGW outages recently, with the RGW returning HTTP 503. First 
for a few, then for most, then all requests - in the course of 1-2 hours. This 
seems to have started since we have updated from 15.2.4 to 15.2.5.

The line that accompanies these outages in the log is the following:

s3:list_bucket Scheduling request failed with -2218

It first pops up a few times here and there, until it eventually applies to all 
requests. It seems to indicate that the throttler has reached the limit of open 
connections.

As we run a pair of HAProxy instances in front of RGW, which limit the number 
of connections to the two RGW instances to 400, this limit should never be 
reached. We do use RGW metadata sync between the instances, which could account 
for some extra connections, but if I look at open TCP connections between the 
instances I can count no more than 20 at any given time.

I also noticed that some connections in the RGW log seem to never complete. 
That is, I can find a ‘starting new request’ line, but no associated ‘req done’ 
or ‘beast’ line.

I don’t think there are any hung connections around, as they are killed by 
HAProxy after a short timeout.

Looking at the code, it seems as if the throttler in use (SimpleThrottler), 
eventually reaches the maximum count of 1024 connections 
(outstanding_requests), and never recovers. I believe that the request_complete 
function is not called in all cases, but I am not familiar with the Ceph 
codebase, so I am not sure.

See 
https://github.com/ceph/ceph/blob/cc17681b478594aa39dd80437256a54e388432f0/src/rgw/rgw_dmclock_async_scheduler.h#L166-L214
 


Does anyone see the same phenomenon? Could this be a bug in the request 
handling of RGW, or am I wrong in my assumptions?

For now we’re just restarting our RGWs regularly, which seems to keep the 
problem at bay.

Thanks for any hints.

Denis
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to recover from active+clean+inconsistent+failed_repair?

2020-11-02 Thread Frank Schilder
Hi Sagra,

looks like you have one on a new and 2 on an old version. Can you add the 
information about which OSD each version resides?

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Sagara Wijetunga 
Sent: 02 November 2020 10:10:02
To: ceph-users@ceph.io; Frank Schilder
Subject: Re: [ceph-users] Re: How to recover from 
active+clean+inconsistent+failed_repair?

Hi Frank


> I'm not sure if my hypothesis can be correct. Ceph sends an acknowledge of a 
> write only after all copies are on disk. In other words, if PGs end up on 
> different versions after a power outage, one always needs to roll back. Since 
> you have two healthy OSDs in the PG and the PG is active (successfully 
> peered), it might just be a broken disk and read/write errors. I would focus 
> on that.

I tried to revert the PG as follows:

# ceph pg 3.b query | grep version
"last_user_version": 2263481,
"version": "4825'2264303",

"last_user_version": 2263481,
"version": "4825'2264301",

"last_user_version": 2263481,
"version": "4825'2264301",


ceph pg 3.b list_unfound

{
"num_missing": 0,
"num_unfound": 0,
"objects": [],
"more": false
}


# ceph pg 3.b mark_unfound_lost revert
pg has no unfound objects


# ceph pg 3.b revert
Invalid command: revert not in query
pg  query :  show details of a specific pg
Error EINVAL: invalid command


How to revert/rollback a PG?


> Another question, do you have write caches enabled (disk cache and controller 
> cache)? This is know to cause problems on power outages and also degraded 
> performance with ceph. You should check and disable any caches if necessary.

No. HDD is directly connected to motherboard.

Thank you

Sagara

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to recover from active+clean+inconsistent+failed_repair?

2020-11-02 Thread Sagara Wijetunga
 Hi Frank

> I'm not sure if my hypothesis can be correct. Ceph sends an acknowledge of a 
> write only after all copies are on disk. In other words, if PGs end up on 
> different versions after a power outage, one always needs to roll back. Since 
> you have two healthy OSDs in the PG and the PG is active (successfully 
> peered), it might just be a broken disk and read/write errors. I would focus 
> on that.

I tried to revert the PG as follows:
# ceph pg 3.b query | grep version        "last_user_version": 2263481,        
"version": "4825'2264303",
        "last_user_version": 2263481,        "version": "4825'2264301",
        "last_user_version": 2263481,        "version": "4825'2264301",

ceph pg 3.b list_unfound 
{    "num_missing": 0,    "num_unfound": 0,    "objects": [],    "more": false}

# ceph pg 3.b mark_unfound_lost revertpg has no unfound objects

# ceph pg 3.b revertInvalid command: revert not in querypg  query :  show 
details of a specific pgError EINVAL: invalid command

How to revert/rollback a PG?

> Another question, do you have write caches enabled (disk cache and controller 
> cache)? This is know to cause problems on power outages and also degraded 
> performance with ceph. You should check and disable any caches if necessary.

No. HDD is directly connected to motherboard.
Thank you
Sagara

  
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Intel SSD firmware guys contacts, if any

2020-11-02 Thread vitalif
Hi!

I have an interesting question regarding SSDs and I'll try to ask about it here.

During my testing of Ceph & Vitastor & Linstor on servers equipped with Intel 
D3-4510 SSDs I discovered a very funny problem with these SSDs:

They don't like overwrites of the same sector.

That is, if you overwrite the same sector over and over again you get very low 
iops:

$ fio -direct=1 -rw=write -bs=4k -size=4k -loops=10 -iodepth=1
  write: IOPS=3142, BW=12.3MiB/s (12.9MB/s)(97.9MiB/7977msec)

And if you overwrite at least ~128k of other sectors between overwriting the 
same sector you get normal results:

$ fio -direct=1 -rw=write -bs=4k -size=128k -loops=10 -iodepth=1
  write: IOPS=20.8k, BW=81.4MiB/s (85.3MB/s)(543MiB/6675msec)

This slowdown almost doesn't hurt Ceph, slightly hurts Vitastor (the impact was 
greater before I added a fix), and MASSIVELY hurts Linstor/DRBD9 because of its 
"bitmap".

By now I've only seen it on this particular model of SSD. For example, Intel 
P4500, Micron 9300 Pro, Samsung PM983 don't have this issue.

Do you have any contacts of Intel SSD firmware guys to ask them about this 
bug-o-feature? :-)

-- 
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: read latency

2020-11-02 Thread Vladimir Prokofev
With sequential read you get "read ahead" mechanics attached which helps a
lot.
So let's say you do 4KB seq reads with fio.
By default, Ubuntu, for example, has 128KB read ahead size. That means when
you request that 4KB of data, driver will actually request 128KB. When your
IO is served, and you request next seq 4KB, they're already in VMs memory,
so no new read IO is necessary.
All those 128KB will likely reside on the same OSD, depending on your CEPH
object size.
When you'll reach the end of that 128KB of data, and request next - once
again it will likely reside in the same rbd object as before, assuming 4MB
object size, so depending on the internal mechanics which I'm not really
familiar with, that data can be either in the hosts memory, or at least in
osd node memory, so no real physical IO will be necessary.
What you're thinking about is the worst case scenario - when that 128KB is
split between 2 objects residing on 2 different osds - well, you just get 2
real physical IO for your 1 virtual, and in that moment you'll have slower
request, but after that you get read ahead to help for a lot of seq IOs.
In the end, read ahead with sequential IOs leads to way way less real
physical reads than random read, hence the IOPS difference.

пн, 2 нояб. 2020 г. в 06:20, Tony Liu :

> Another confusing about read vs. random read. My understanding is
> that, when fio does read, it reads from the test file sequentially.
> When it does random read, it reads from the test file randomly.
> That file read inside VM comes down to volume read handed by RBD
> client who distributes read to PG and eventually to OSD. So a file
> sequential read inside VM won't be a sequential read on OSD disk.
> Is that right?
> Then what difference seq. and rand. read make on OSD disk?
> Is it rand. read on OSD disk for both cases?
> Then how to explain the performance difference between seq. and rand.
> read inside VM? (seq. read IOPS is 20x than rand. read, Ceph is
> with 21 HDDs on 3 nodes, 7 on each)
>
> Thanks!
> Tony
> > -Original Message-
> > From: Vladimir Prokofev 
> > Sent: Sunday, November 1, 2020 5:58 PM
> > Cc: ceph-users 
> > Subject: [ceph-users] Re: read latency
> >
> > Not exactly. You can also tune network/software.
> > Network - go for lower latency interfaces. If you have 10G go to 25G or
> > 100G. 40G will not do though, afaik they're just 4x10G so their latency
> > is the same as in 10G.
> > Software - it's closely tied to your network card queues and processor
> > cores. In short - tune affinity so that the packet receive queues and
> > osds processes run on the same corresponding cores. Disabling process
> > power saving features helps a lot. Also watch out for NUMA interference.
> > But overall all these tricks will save you less than switching from HDD
> > to SSD.
> >
> > пн, 2 нояб. 2020 г. в 02:45, Tony Liu :
> >
> > > Hi,
> > >
> > > AWIK, the read latency primarily depends on HW latency, not much can
> > > be tuned in SW. Is that right?
> > >
> > > I ran a fio random read with iodepth 1 within a VM backed by Ceph with
> > > HDD OSD and here is what I got.
> > > =
> > >read: IOPS=282, BW=1130KiB/s (1157kB/s)(33.1MiB/30001msec)
> > > slat (usec): min=4, max=181, avg=14.04, stdev=10.16
> > > clat (usec): min=178, max=393831, avg=3521.86, stdev=5771.35
> > >  lat (usec): min=188, max=393858, avg=3536.38, stdev=5771.51
> > > = I checked HDD average latency is 2.9 ms. Looks like
> > > the test result makes perfect sense, isn't it?
> > >
> > > If I want to get shorter latency (more IOPS), I will have to go for
> > > better disk, eg. SSD. Right?
> > >
> > >
> > > Thanks!
> > > Tony
> > > ___
> > > ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> > > email to ceph-users-le...@ceph.io
> > >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> > email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Restart Error: osd.47 already exists in network host

2020-11-02 Thread Ml Ml
Hello List,
sometimes some OSD get taken our for some reason ( i am still looking
for the reason, and i guess its due to some overload), however, when i
try to restart them i get:

Nov 02 08:05:26 ceph05 bash[9811]: Error: No such container:
ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df-osd.47
Nov 02 08:05:29 ceph05 bash[9811]: /usr/bin/docker: Error response
from daemon: endpoint with name
ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df-osd.47 already exists in
network host.
Nov 02 08:05:29 ceph05 systemd[1]:
ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df@osd.47.service: Main process
exited, code=exited, status=125/n/a
Nov 02 08:05:34 ceph05 systemd[1]:
ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df@osd.47.service: Failed with
result 'exit-code'.
Nov 02 08:05:44 ceph05 systemd[1]:
ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df@osd.47.service: Service
RestartSec=10s expired, scheduling restart.
Nov 02 08:05:44 ceph05 systemd[1]:
ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df@osd.47.service: Scheduled
restart job, restart counter is at 5.
Nov 02 08:05:44 ceph05 systemd[1]: Stopped Ceph osd.47 for
5436dd5d-83d4-4dc8-a93b-60ab5db145df.
Nov 02 08:05:44 ceph05 systemd[1]:
ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df@osd.47.service: Start
request repeated too quickly.
Nov 02 08:05:44 ceph05 systemd[1]:
ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df@osd.47.service: Failed with
result 'exit-code'.
Nov 02 08:05:44 ceph05 systemd[1]: Failed to start Ceph osd.47 for
5436dd5d-83d4-4dc8-a93b-60ab5db145df.

I need to reboot the full host to get the OSD back in again. As far i
can see this is some docker problem?

root@ceph05:~# docker ps | grep osd.47 => no hit
root@ceph05:~# docker network prune => does not solve the problem
Any hint on that?

Thanks,
Michael
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Seriously degraded performance after update to Octopus

2020-11-02 Thread Martin Rasmus Lundquist Hansen
Two weeks ago we updated our Ceph cluster from Nautilus (14.2.0) to Octopus 
(15.2.5), an update that was long overdue. We used the Ansible playbooks to 
perform a rolling update and except from a few minor problems with the Ansible 
code, the update went well. The Ansible playbooks were also used for setting up 
the cluster in the first place. Before updating the Ceph software we also 
performed a full update of CentOS and the Linux kernel (this part of the update 
had already been tested on one of the OSD nodes the week before and we didn't 
notice any problems).

However, after the update we are seeing a serious decrease in performance, more 
than a factor of 10x in some cases. I spend a week trying to come up with an 
explantion or solution, but I am completely blank. Independently of Ceph I 
tested the network performance and the performance of the OSD disks, and I am 
not really seeing any problems here.

The specifications of the cluster is:
- 3x Monitor nodes running mgr+mon+mds (Intel(R) Xeon(R) Silver 4108 CPU @ 
1.80GHz, 16 cores, 196 GB RAM)
- 14x OSD nodes, each with 18 HDDs and 1 NVME (Intel(R) Xeon(R) Gold 6126 CPU @ 
2.60GHz, 24 cores, 384 GB RAM)
- CentOS 7.8 and Kernel 5.4.51
- 100 Gbps Infiniband

We are collecting various metrics using Prometheus, and on the OSD nodes we are 
seeing some clear differences when it comes to CPU and Memory usage. I 
collected some graphs here: http://mitsted.dk/ceph . After the update the 
system load is highly reduced, there is almost no longer any iowait for the 
CPU, and the free memory is no longer used for Buffers (I can confirm that the 
changes in these metrics are not due to the update of CentOS or the Linux 
kernel). All in all, now the OSD nodes are almost completely idle all the time 
(and so are the monitors). On the linked page I also attached two RADOS 
benchmarks. The first benchmark was performed when the cluster was initially 
configured, and the second is the same benchmark after the update to Octopus. 
When comparing these two, it is clear that the performance has changed 
dramatically. For example, in the write test the bandwidth is reduced from 320 
MB/s to 21 MB/s and the number of IOPS has also dropped significantly.

I temporarily tried to disable the firewall and SELinux on all nodes to see if 
it made any difference, but it didn’t look like it (I did not restart any 
services during this test, I am not sure if that could be necessary).

Any suggestions for finding the root cause of this performance decrease would 
be greatly appreciated.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to recover from active+clean+inconsistent+failed_repair?

2020-11-02 Thread Frank Schilder
Hi Sagara,

I'm not sure if my hypothesis can be correct. Ceph sends an acknowledge of a 
write only after all copies are on disk. In other words, if PGs end up on 
different versions after a power outage, one always needs to roll back. Since 
you have two healthy OSDs in the PG and the PG is active (successfully peered), 
it might just be a broken disk and read/write errors. I would focus on that.

Another question, do you have write caches enabled (disk cache and controller 
cache)? This is know to cause problems on power outages and also degraded 
performance with ceph. You should check and disable any caches if necessary.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: 01 November 2020 14:37:41
To: Sagara Wijetunga; ceph-users@ceph.io
Subject: [ceph-users] Re: How to recover from 
active+clean+inconsistent+failed_repair?

sorry: *badblocks* can force remappings of broken sectors (non-destructive 
read-write check)

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: 01 November 2020 14:35:35
To: Sagara Wijetunga; ceph-users@ceph.io
Subject: [ceph-users] Re: How to recover from 
active+clean+inconsistent+failed_repair?

Hi Sagara,

looks like your situation is more complex. Before doing anything potentially 
destructive, you need to investigate some more. A possible interpretation 
(numbering just for the example):

OSD 0 PG at version 1
OSD 1 PG at version 2
OSD 2 PG has scrub error

Depending on the version of the PG on OSD 2, either OSD 0 needs to roll forward 
(OSD 2 PG at version 2), or OSD 1 needs to roll back (OSD 2 PG at version 1). 
Part of the relevant information on OSD 2 seems to be unreadable, therefore pg 
repair bails out.

You need to find out if you are in this situation or some other case. If you 
are, you need to find out somehow if you need to roll back or forward. I'm 
afraid in your current situation, even taking the OSD with the scrub errors 
down will not rebuild the PG.

I would probably try:

- find out with smartctl if the OSD with scrub errors is in a pre-fail state 
(has remapped sectors)
- if it is:
  * take it down and try to make a full copy with ddrescue
  * if ddrescure manages to copy everything, copy back to a new disk and add to 
ceph
  * if ddrescue fails to copy everything, you could try if badblocks manages to 
get the disk back; ddrescue can force remappings of broken sectors 
(non-destructive read-write check) and it can happen that data becomes readable 
again, exchange the disk as soon as possible thereafter
- if the disk is healthy:
  * try to find out if you can deduce the state of the copies on every OSD

The tool for low-level operations is bluestore-tool. I never used it, so you 
need to look at the documentation.

If everything fails, I guess your last option is to decide for one of the 
copies, export it from one OSD and inject it to another one (but not any of 
0,1,2!). This will establish 2 identical copies and the third one will be 
changed to this one automatically. Note that this may lead to data loss on 
objects that were in the undefined state. As far as I can see, its only 1 
object and probably possible to recover from (backup, snapshot).

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Sagara Wijetunga 
Sent: 01 November 2020 14:05:36
To: ceph-users@ceph.io
Subject: [ceph-users] Re: How to recover from 
active+clean+inconsistent+failed_repair?

Hi Frank

Thanks for the reply.

> I think this happens when a PG has 3 different copies and cannot decide which 
> one is correct. You might have hit a very rare case. You should start with 
> the scrub errors, check which PGs and which copies (OSDs) are affected. It 
> sounds almost like all 3 scrub errors are on the same PG.
Yes, all 3 errors are for the same PG and on the same OSD:
2020-11-01 18:25:09.39 osd.0 [ERR] 3.b shard 2 soid 
3:d577e975:::123675e.:head : candidate had a missing snapset key, 
candidate had a missing info key
2020-11-01 18:25:09.42 osd.0 [ERR] 3.b soid 
3:d577e975:::123675e.:head : failed to pick suitable object info
2020-11-01 18:26:33.496255 osd.0 [ERR] 3.b repair 3 errors, 0 fixed

> You might have had a combination of crash and OSD fail, your situation is 
> probably not covered by "single point of failure".
Yes it was a complex crash, all went down.

> In case you have a PG with scrub errors on 2 copies, you should be able to 
> reconstruct the PG from the third with PG export/PG import commands.
I have not done a PG export/import before. Mind if you could send the 
instructions or a link for it.

Thanks
Sagara
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___

[ceph-users] how to rbd export image from group snap?

2020-11-02 Thread Timo Weingärtner
Hi,

we're using rbd for VM disk images and want to make consistent backups of
groups of them.

I know I can create a group and make consistent snapshots of all of them:

# rbd --version
ceph version 14.2.9 (581f22da52345dba46ee232b73b990f06029a2a0) nautilus 
(stable)
# rbd create test_foo --size 1M
# rbd create test_bar --size 1M
# rbd group create test
# rbd group image add test test_foo
# rbd group image add test test_bar
# rbd group snap create test@1

But how can I export the individual image snapshots? I tried different
ways of addressing them, but nothing worked:

# rbd export test_foo@1 -   

error setting snapshot context: (2) No such file or directory
# rbd export test_foo@test/1 -
rbd: error opening pool 'test_foo@test': (2) No such file or directory
# rbd export rbd/test_foo@test/1 -
error setting snapshot context: (2) No such file or directory
# rbd export test@1/test_foo -
rbd: error opening pool 'test@1': (2) No such file or directory
# rbd export rbd/test@1/test_foo -
rbd: error opening image test: (2) No such file or directory

Am I missing something?


Mit freundlichen Grüßen,
Timo Weingärtner
Systemadministrator
-- 
ITscope GmbH
Ludwig-Erhard-Allee 20
D-76131 Karlsruhe

Tel: +49 721 62737637
Fax: +49 721 66499175

https://www.itscope.com

Handelsregister: AG Mannheim, HRB 232782 Sitz der Gesellschaft: Karlsruhe
Geschäftsführer: Alexander Münkel, Benjamin Mund, Stefan Reger


signature.asc
Description: This is a digitally signed message part.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [EXTERNAL] Re: 14.2.12 breaks mon_host pointing to Round Robin DNS entry

2020-11-02 Thread Wido den Hollander



On 31/10/2020 11:16, Sasha Litvak wrote:

Hello everyone,

Assuming that backport has been merged for few days now,  is there a 
chance that 14.2.13 will be released? >


On the dev list it was posted that .13 will be released this week.

Wido



On Fri, Oct 23, 2020, 6:03 AM Van Alstyne, Kenneth 
> wrote:


Jason/Wido, et al:
      I was hitting this exact problem when attempting to update
from 14.2.11 to 14.2.12.  I reverted the two commits associated with
that pull request and was able to successfully upgrade to 14.2.12. 
Everything seems normal, now.



Thanks,

--
Kenneth Van Alstyne
Systems Architect
M: 804.240.2327
14291 Park Meadow Drive, Chantilly, VA 20151
perspecta


From: Jason Dillaman mailto:jdill...@redhat.com>>
Sent: Thursday, October 22, 2020 12:54 PM
To: Wido den Hollander mailto:w...@42on.com>>
Cc: ceph-users@ceph.io 
mailto:ceph-users@ceph.io>>
Subject: [EXTERNAL] [ceph-users] Re: 14.2.12 breaks mon_host
pointing to Round Robin DNS entry

This backport [1] looks suspicious as it was introduced in v14.2.12
and directly changes the initial MonMap code. If you revert it in a
dev build does it solve your problem?

[1] https://github.com/ceph/ceph/pull/36704

On Thu, Oct 22, 2020 at 12:39 PM Wido den Hollander mailto:w...@42on.com>> wrote:
 >
 > Hi,
 >
 > I already submitted a ticket: https://tracker.ceph.com/issues/47951
 >
 > Maybe other people noticed this as well.
 >
 > Situation:
 > - Cluster is running IPv6
 > - mon_host is set to a DNS entry
 > - DNS entry is a Round Robin with three -records
 >
 > root@wido-standard-benchmark:~# ceph -s
 > unable to parse addrs in 'mon.objects.xx.xxx.net
'
 > [errno 22] error connecting to the cluster
 > root@wido-standard-benchmark:~#
 >
 > The relevant part of the ceph.conf:
 >
 > [global]
 > auth_client_required = cephx
 > auth_cluster_required = cephx
 > auth_service_required = cephx
 > mon_host = mon.objects.xxx.xxx.xxx
 > ms_bind_ipv6 = true
 >
 > This works fine with 14.2.11 and breaks under 14.2.12
 >
 > Anybody else seeing this as well?
 >
 > Wido
 > ___
 > ceph-users mailing list -- ceph-users@ceph.io

 > To unsubscribe send an email to ceph-users-le...@ceph.io

 >


--
Jason
___
ceph-users mailing list -- ceph-users@ceph.io

To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io

To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Restart Error: osd.47 already exists in network host

2020-11-02 Thread Eugen Block

Hi,

are you sure it's the right container ID you're using for the restart?
I noticed that 'cephadm ls' shows older containers after a daemon had  
to be recreated (a MGR in my case). Maybe you're trying to restart a  
daemon that was already removed?


Regards,
Eugen


Zitat von Ml Ml :


Hello List,
sometimes some OSD get taken our for some reason ( i am still looking
for the reason, and i guess its due to some overload), however, when i
try to restart them i get:

Nov 02 08:05:26 ceph05 bash[9811]: Error: No such container:
ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df-osd.47
Nov 02 08:05:29 ceph05 bash[9811]: /usr/bin/docker: Error response
from daemon: endpoint with name
ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df-osd.47 already exists in
network host.
Nov 02 08:05:29 ceph05 systemd[1]:
ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df@osd.47.service: Main process
exited, code=exited, status=125/n/a
Nov 02 08:05:34 ceph05 systemd[1]:
ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df@osd.47.service: Failed with
result 'exit-code'.
Nov 02 08:05:44 ceph05 systemd[1]:
ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df@osd.47.service: Service
RestartSec=10s expired, scheduling restart.
Nov 02 08:05:44 ceph05 systemd[1]:
ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df@osd.47.service: Scheduled
restart job, restart counter is at 5.
Nov 02 08:05:44 ceph05 systemd[1]: Stopped Ceph osd.47 for
5436dd5d-83d4-4dc8-a93b-60ab5db145df.
Nov 02 08:05:44 ceph05 systemd[1]:
ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df@osd.47.service: Start
request repeated too quickly.
Nov 02 08:05:44 ceph05 systemd[1]:
ceph-5436dd5d-83d4-4dc8-a93b-60ab5db145df@osd.47.service: Failed with
result 'exit-code'.
Nov 02 08:05:44 ceph05 systemd[1]: Failed to start Ceph osd.47 for
5436dd5d-83d4-4dc8-a93b-60ab5db145df.

I need to reboot the full host to get the OSD back in again. As far i
can see this is some docker problem?

root@ceph05:~# docker ps | grep osd.47 => no hit
root@ceph05:~# docker network prune => does not solve the problem
Any hint on that?

Thanks,
Michael
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Seriously degraded performance after update to Octopus

2020-11-02 Thread Marc Roos


I am advocating already a long time for publishing testing data of some 
basic test cluster against different ceph releases. Just a basic ceph 
cluster that covers most configs and run the same tests, so you can 
compare just ceph performance. That would mean a lot for smaller 
companies that do not have access to a good test environment. I have 
asked also about this at some ceph seminar.

 

-Original Message-
From: Martin Rasmus Lundquist Hansen [mailto:han...@imada.sdu.dk] 
Sent: Monday, November 02, 2020 7:53 AM
To: ceph-users@ceph.io
Subject: [ceph-users] Seriously degraded performance after update to 
Octopus

Two weeks ago we updated our Ceph cluster from Nautilus (14.2.0) to 
Octopus (15.2.5), an update that was long overdue. We used the Ansible 
playbooks to perform a rolling update and except from a few minor 
problems with the Ansible code, the update went well. The Ansible 
playbooks were also used for setting up the cluster in the first place. 
Before updating the Ceph software we also performed a full update of 
CentOS and the Linux kernel (this part of the update had already been 
tested on one of the OSD nodes the week before and we didn't notice any 
problems).

However, after the update we are seeing a serious decrease in 
performance, more than a factor of 10x in some cases. I spend a week 
trying to come up with an explantion or solution, but I am completely 
blank. Independently of Ceph I tested the network performance and the 
performance of the OSD disks, and I am not really seeing any problems 
here.

The specifications of the cluster is:
- 3x Monitor nodes running mgr+mon+mds (Intel(R) Xeon(R) Silver 4108 CPU 
@ 1.80GHz, 16 cores, 196 GB RAM)
- 14x OSD nodes, each with 18 HDDs and 1 NVME (Intel(R) Xeon(R) Gold 
6126 CPU @ 2.60GHz, 24 cores, 384 GB RAM)
- CentOS 7.8 and Kernel 5.4.51
- 100 Gbps Infiniband

We are collecting various metrics using Prometheus, and on the OSD nodes 
we are seeing some clear differences when it comes to CPU and Memory 
usage. I collected some graphs here: http://mitsted.dk/ceph . After the 
update the system load is highly reduced, there is almost no longer any 
iowait for the CPU, and the free memory is no longer used for Buffers (I 
can confirm that the changes in these metrics are not due to the update 
of CentOS or the Linux kernel). All in all, now the OSD nodes are almost 
completely idle all the time (and so are the monitors). On the linked 
page I also attached two RADOS benchmarks. The first benchmark was 
performed when the cluster was initially configured, and the second is 
the same benchmark after the update to Octopus. When comparing these 
two, it is clear that the performance has changed dramatically. For 
example, in the write test the bandwidth is reduced from 320 MB/s to 21 
MB/s and the number of IOPS has also dropped significantly.

I temporarily tried to disable the firewall and SELinux on all nodes to 
see if it made any difference, but it didnt look like it (I did not 
restart any services during this test, I am not sure if that could be 
necessary).

Any suggestions for finding the root cause of this performance decrease 
would be greatly appreciated.
___
ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an 
email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Fix PGs states

2020-11-02 Thread Eugen Block
There's nothing wrong with EC pools or multiple datacenters, you just  
need the right configuration to cover the specific requirements ;-)



Zitat von "Ing. Luis Felipe Domínguez Vega" :

Yes, thanks to all, the decisition was, remove all and start from 0  
and not use EC pools, use only Replicated and not distribute over DCs.


El 2020-10-31 14:08, Eugen Block escribió:

To me it looks like a snapshot is not found which seems plausible
because you already encountered missing rbd chunks. Since you said
it's just a test cluster the easiest way would probably be to delete
the affected pools ans recreate them when the cluster is healthy
again. With the current situation it's almost impossible to say which
rbd images will be corrupted and which can be rescued. Is that an
option to delete the pools?


Zitat von "Ing. Luis Felipe Domínguez Vega" :


https://pastebin.ubuntu.com/p/tHSpzWp8Cx/

El 2020-10-30 11:47, dhils...@performair.com escribió:

This line is telling:
   1 osds down
This is likely the cause of everything else.

Why is one of your OSDs down?

Thank you,

Dominic L. Hilsbos, MBA
Director - Information Technology
Perform Air International, Inc.
dhils...@performair.com
www.PerformAir.com



-Original Message-
From: Ing. Luis Felipe Domínguez Vega [mailto:luis.doming...@desoft.cu]
Sent: Thursday, October 29, 2020 7:46 PM
To: Ceph Users
Subject: [ceph-users] Fix PGs states

Hi:

I have this ceph status:
-
cluster:
   id: 039bf268-b5a6-11e9-bbb7-d06726ca4a78
   health: HEALTH_WARN
   noout flag(s) set
   1 osds down
   Reduced data availability: 191 pgs inactive, 2 pgs down, 35
pgs incomplete, 290 pgs stale
   5 pgs not deep-scrubbed in time
   7 pgs not scrubbed in time
   327 slow ops, oldest one blocked for 233398 sec, daemons
[osd.12,osd.36,osd.5] have slow ops.

 services:
   mon: 1 daemons, quorum fond-beagle (age 23h)
   mgr: fond-beagle(active, since 7h)
   osd: 48 osds: 45 up (since 95s), 46 in (since 8h); 4 remapped pgs
flags noout

 data:
   pools:   7 pools, 2305 pgs
   objects: 350.37k objects, 1.5 TiB
   usage:   3.0 TiB used, 38 TiB / 41 TiB avail
   pgs: 6.681% pgs unknown
1.605% pgs not active
1835 active+clean
279  stale+active+clean
154  unknown
22   incomplete
10   stale+incomplete
2down
2remapped+incomplete
1stale+remapped+incomplete


How can i fix all of unknown, incomplete, remmaped+incomplete, etc... i
dont care if i need remove PGs
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io