from:"Wido den Hollander"

[ceph-users] Re: A change in Ceph leadership...

2021-10-17 Thread Wido den Hollander




Op 15-10-2021 om 16:40 schreef Sage Weil:

This fall I will be stepping back from a leadership role in the Ceph
project. My primary focus during the next two months will be to work with
developers and community members to ensure a smooth transition to a more
formal system of governance for the Ceph project. My last day at Red Hat
will be in December.

I have been working on Ceph since 2004. I am immensely proud of what we
have accomplished over the last 17 years, and see a bright future for the
project. I continue to be excited by and inspired by Ceph and the impact
that open source storage has on the industry. I am especially impressed by
the recent influx of talented new developers to the project, and I have the
utmost confidence in and respect for the developer leads who are the
driving force behind the project today. I have no doubt that Ceph will
continue to thrive.


A few likely questions…

-- Where am I going / What am I doing next?

As many of you may be aware, I took a nine-month leave in 2020 to work with
VoteAmerica to help people vote in the US elections. This was a great
experience that reminded me what it was like to work on a small,
fast-moving team with a clear mission. I expect to do something similar for
the 2022 midterm elections, but I do not yet have concrete plans beyond
that, except for a desire to work on something with high social impact.

-- Am I leaving Ceph for good?

Initially, I expect to step back from a leadership role but continue to
contribute as a developer--at least for the short term. How much I
contribute will depend on other time commitments, which are not easy to
predict. My best guess is that I will continue to do some work on the
Quincy release but play a minimal role in the R release.

-- What will Ceph project leadership look like going forward?

The Ceph Leadership Team[1] has known about this coming transition for a
couple of weeks now and has been discussing models for more formal project
governance. We’ve been looking at other projects that have made similar
transitions (e.g., Python) for ideas and inspiration.  Expect to see
messages to the email list(s) as we get closer to specific proposals.

-- What does this mean for Red Hat's involvement in Ceph?

My decision to leave Red Hat is unlikely to have any impact whatsoever on
Red Hat's ongoing use of Ceph or its commitment to the Ceph project and
community.

Stay tuned for more information as the discussion around project governance
continues.

It has been a great honor to see a vibrant user and developer community
form around Ceph, and to see so many people devote their time, energy, and
careers to a project that once seemed like such a long shot. Thank you all
for believing in us!



Thank you so much for all the hard work and effort you put into the 
project for all these years!


I can write a super long e-mail or just keep it very short: Thanks for 
everything! And I wish you all the best!


Wido


With gratitude,

sage


[1] https://ceph.io/en/community/team/
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Announcing go-ceph v0.11.0

2021-08-11 Thread Wido den Hollander

Op 10-08-2021 om 22:32 schreef Xiaolong Jiang:

Thank you for your response.

We are making a choice between go and java. Internally, our company has 
very good java ecosystem support with spring which I don't want to lose. 
Go binding is being actively developed/supported, it's a better choice 
too. I am still not sure which is the one I will choose.

I wrote the Java bindings and have been maintaining them ever since. In 
the last time I haven't paid much attention to them as a) I don't use 
them myself anymore, b) they work for most use-cases.

If fixes are needed then Pull Requests are more then welcome!

Wido

On Tue, Aug 10, 2021 at 1:20 PM John Mulligan 
mailto:phlogistonj...@asynchrono.us>> wrote:

On Tuesday, August 10, 2021 2:18:59 PM EDT Xiaolong Jiang wrote:
 > Hi John,
 >
 > I noticed the java binding repo has not been updated for a while.
As the
 > user, is it recommended to use go-binding?

If you want to use Go (aka golang) I can certainly recommend
go-ceph. However
I can't speak much to other language bindings as I don't follow the
development or help with other bindings.

I hope that helps!

 > > On Aug 10, 2021, at 10:31 AM, John Mulligan
mailto:phlogistonj...@asynchrono.us>>
 > > wrote:
 > >
 > > I'm happy to announce another release of the go-ceph API
 > > library. This is a regular release following our
every-two-months release
 > > cadence.
 > >
 > > https://github.com/ceph/go-ceph/releases/tag/v0.11.0

 > >
 > > Changes include additions to the cephfs, cephfs admin, rbd, and
rgw admin
 > > packages. More details are available at the link above.
 > >
 > > The library includes bindings that aim to play a similar role
to the
 > > "pybind" python bindings in the ceph tree but for the Go
language. The
 > > library also includes additional APIs that can be used to
administer
 > > cephfs, rbd, and rgw subsystems.
 > > There are already a few consumers of this library in the wild,
including
 > > the ceph-csi project.
 >
 > ___
 > Dev mailing list -- d...@ceph.io 
 > To unsubscribe send an email to dev-le...@ceph.io

--
Best regards,
Xiaolong Jiang

Senior Software Engineer at Netflix
Columbia University

___
Dev mailing list -- d...@ceph.io
To unsubscribe send an email to dev-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: we're living in 2005.

2021-07-27 Thread Wido den Hollander




Op 27-07-2021 om 05:11 schreef Fyodor Ustinov:

Hi!


docs.ceph.io ?  If there’s something that you’d like to see added there, you’re
welcome to submit a tracker ticket, or write to me privately.  It is not
uncommon for documentation enhancements to be made based on mailing list
feedback.


Documentation...

Try to install a completely new ceph cluster from scratch on fresh installed 
LTS Ubuntu by this doc https://docs.ceph.com/en/latest/cephadm/install/ . Many 
interesting discoveries await you.
Nothing special - only step by step installation. As described in 
documentation. No more and no less.



But who's responsibility is it to write documentation? Ceph is Open 
Source and anybody can help to make it better.


Developing Ceph is not only writing code, but that could also be writing 
documentation.


Open Source != Free, that means that you need to invest time to get it 
working.


And if you spot flaws in the documentation anybody is more then welcome 
to open a pull request to improve the documentation.


Who else do we expect to write the documentation? In the end everybody 
needs to get paid and in the end the company who pays that person needs 
to get paid. That's how it works.


Suggestions and feedback are always welcome, but don't expect a paved 
road for free.


Wido


WBR,
 Fyodor.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Is it safe to mix Octopus and Pacific mons?

2021-06-09 Thread Wido den Hollander




On 6/9/21 8:51 PM, Vladimir Brik wrote:
> Hello
> 
> My attempt to upgrade from Octopus to Pacific ran into issues, and I
> currently have one 16.2.4 mon and two 15.2.12 mons. Is this safe to run
> the cluster like this or should I shut down the 16.2.4 mon until I
> figure out what to do next with the upgrade?
> 

I would try to keep it as short as possible, but I would *not* choose to
shut down a MON.

What is preventing you from updating the other MONs to v16?

Wido

> Thanks,
> 
> Vlad
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Performance (RBD) regression after upgrading beyond v15.2.8

2021-06-09 Thread Wido den Hollander





On 09/06/2021 14:33, Ilya Dryomov wrote:

On Wed, Jun 9, 2021 at 1:38 PM Wido den Hollander  wrote:


Hi,

While doing some benchmarks I have two identical Ceph clusters:

3x SuperMicro 1U
AMD Epyc 7302P 16C
256GB DDR
4x Samsung PM983 1,92TB
100Gbit networking

I tested on such a setup with v16.2.4 with fio:

bs=4k
qd=1

IOps: 695

That was very low as I was expecting at least >1000 IOps.

I checked with the second Ceph cluster which was still running v15.2.8,
the result: 1364 IOps.

I then upgraded from 15.2.8 to 15.2.13: 725 IOps

Looking at the differences between v15.2.8 and v15.2.8 of options.cc I
saw these options:

bluefs_buffered_io: false -> true
bluestore_cache_trim_max_skip_pinned: 1000 -> 64

The main difference seems to be 'bluefs_buffered_io', but in both cases
this was already explicitly set to 'true'.

So anything beyond 15.2.8 is right now giving me a much lower I/O
performance with Queue Depth = 1 and Block Size = 4k.

15.2.8: 1364 IOps
15.2.13: 725 IOps
16.2.4: 695 IOps

Has anybody else seen this as well? I'm trying to figure out where this
is going wrong.


Hi Wido,

Going by the subject, I assume these are rbd numbers?  If so, did you
run any RADOS-level benchmarks?


Yes, rbd benchmark using fio.

$ rados -p rbd -t 1 -O 4096 -b 4096 bench 60 write

Average IOPS:   1024
Stddev IOPS:29.6598
Max IOPS:   1072
Min IOPS:   918
Average Latency(s): 0.00097
Stddev Latency(s):  0.000306557

So that seems kind of OK. Still roughly 1k IOps and a write latency of ~1ms.

But that was ~0.75ms when writing through RBD.

I now have a 16.2.4 and 15.2.13 cluster with identical hardware to run 
some benchmarks on.


Wido



Thanks,

 Ilya


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Performance (RBD) regression after upgrading beyond v15.2.8

2021-06-09 Thread Wido den Hollander


Hi,

While doing some benchmarks I have two identical Ceph clusters:

3x SuperMicro 1U
AMD Epyc 7302P 16C
256GB DDR
4x Samsung PM983 1,92TB
100Gbit networking

I tested on such a setup with v16.2.4 with fio:

bs=4k
qd=1

IOps: 695

That was very low as I was expecting at least >1000 IOps.

I checked with the second Ceph cluster which was still running v15.2.8, 
the result: 1364 IOps.


I then upgraded from 15.2.8 to 15.2.13: 725 IOps

Looking at the differences between v15.2.8 and v15.2.8 of options.cc I 
saw these options:


bluefs_buffered_io: false -> true
bluestore_cache_trim_max_skip_pinned: 1000 -> 64

The main difference seems to be 'bluefs_buffered_io', but in both cases 
this was already explicitly set to 'true'.


So anything beyond 15.2.8 is right now giving me a much lower I/O 
performance with Queue Depth = 1 and Block Size = 4k.


15.2.8: 1364 IOps
15.2.13: 725 IOps
16.2.4: 695 IOps

Has anybody else seen this as well? I'm trying to figure out where this 
is going wrong.


Wido
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Metrics for object sizes

2021-04-22 Thread Wido den Hollander





On 21/04/2021 11:46, Szabo, Istvan (Agoda) wrote:

Hi,

Is there any clusterwise metric regarding object sizes?

I'd like to collect some information about the users what is the object sizes 
in their buckets.


Are you talking about RADOS objects or objects inside RGW buckets?

I think you are talking about RGW, but I just wanted to check.

Afaik this information is not available for both RADOS and RGW.

Do keep in mind that small objects are much more expensive then large 
objects. The metadata overhead becomes costly and can even become 
problematic if you have millions of tiny (few kb) objects.


Wido






This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: has anyone enabled bdev_enable_discard?

2021-04-15 Thread Wido den Hollander





On 13/04/2021 11:07, Dan van der Ster wrote:

On Tue, Apr 13, 2021 at 9:00 AM Wido den Hollander  wrote:




On 4/12/21 5:46 PM, Dan van der Ster wrote:

Hi all,

bdev_enable_discard has been in ceph for several major releases now
but it is still off by default.
Did anyone try it recently -- is it safe to use? And do you have perf
numbers before and after enabling?



I have done so on SATA SSDs in a few cases and: it worked

Did I notice a real difference? Not really.



Thanks, I've enabled it on a test box and am draining data to check
that it doesn't crash anything.


It's highly debated if this still makes a difference with modern flash
devices. I don't think there is a real conclusion if you still need to
trim/discard blocks.


Do you happen to have any more info on these debates? As you know we
have seen major performance issues on hypervisors that are not running
a periodic fstrim; we use similar or identical SATA ssds for HV local
storage and our block.db's. If it doesn't hurt anything, why wouldn't
we enable it by default?



These debates are more about if it really makes sense with modern SSDs 
as the performance gain seems limited.


With older (SATA) SSDs it might, but with the modern NVMe DC-grade ones 
people are doubting if it is still needed.


SATA 3.0 also had the issue that the TRIM command was a blocking command 
where with SATA 3.1 it became async and thus non-blocking.


With NVMe it is a different story again.

I don't have links or papers for you, it's mainly stories I heard on 
conferences and such.


Wido


Cheers, Dan


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: has anyone enabled bdev_enable_discard?

2021-04-13 Thread Wido den Hollander




On 4/12/21 5:46 PM, Dan van der Ster wrote:
> Hi all,
> 
> bdev_enable_discard has been in ceph for several major releases now
> but it is still off by default.
> Did anyone try it recently -- is it safe to use? And do you have perf
> numbers before and after enabling?
> 

I have done so on SATA SSDs in a few cases and: it worked

Did I notice a real difference? Not really.

It's highly debated if this still makes a difference with modern flash
devices. I don't think there is a real conclusion if you still need to
trim/discard blocks.

Wido

> Cheers, Dan
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: KRBD failed to mount rbd image if mapping it to the host with read-only option

2021-04-08 Thread Wido den Hollander

On 08/04/2021 14:09, Ha, Son Hai wrote:

Hi everyone,

We encountered an issue with KRBD mounting after mapping it to the host with
read-only option.
We try to pinpoint where the problem is, but not able to do it.

See my reply down below.

The image is mounted well if we map it without the "read-only" option.
This leads to an issue that the pod in k8s cannot use the snapshotted
persistent volume created by ceph-csi rbd provisioner.
Thank you for reading.

I have reported the bug here: Bug #50234: krbd failed to mount after map image with
read-only option - Ceph - Ceph

Context
- Using admin keyring
- Linux Kernel: 3.10.0-1160.15.2.el7.x86_64
- Linux Distribution: Red Hat Enterprise Linux Server 7.8 (Maipo)
- Ceph version: "ceph version 14.2.8 (2d095e947a02261ce61424021bb43bd3022d35cb)
nautilus (stable)"

rbd image 'csi-vol-85919409-9797-11eb-80ba-720b2b57c790':
size 10 GiB in 2560 objects
order 22 (4 MiB objects)
snapshot_count: 0
id: 533a03bba388ea
block_name_prefix: rbd_data.533a03bba388ea
format: 2
features: layering
op_features:
flags:
create_timestamp: Wed Apr 7 13:51:02 2021
access_timestamp: Wed Apr 7 13:51:02 2021
modify_timestamp: Wed Apr 7 13:51:02 2021

Bug Reproduction
# Map RBD image WITH read-only option, CANNOT mount with both readonly or
readwrite option
sudo rbd device map -p k8s-sharedpool
csi-vol-85919409-9797-11eb-80ba-720b2b57c790 -ro
/dev/rbd0
sudo mount -v -r -t ext4 /dev/rbd0 /mnt/test1
mount: cannot mount /dev/rbd0 read-only

sudo mount -v -r -t ext4 /dev/rbd0 /mnt/test1
mount: /dev/rbd0 is write-protected, mounting read-only
mount: cannot mount /dev/rbd0 read-only

ext4 will always try to recover it's journal during mount and this means
it wants to write. That fails.

Try this with mounting:

sudo mount -t ext4 -o norecover /dev/rbd0 /mnt/test1

sudo mount -t ext4 -o noload /dev/rbd0 /mnt/test1

Wido

# Map RBD image WITHOUT read-only option, CAN mount with both readonly or
readwrite option
sudo rbd device map -p k8s-sharedpool
csi-vol-85919409-9797-11eb-80ba-720b2b57c790
/dev/rbd0
sudo mount -v -r -t ext4 /dev/rbd0 /mnt/test1

mount: /mnt/test1 does not contain SELinux labels.
You just mounted an file system that supports labels which does not
contain labels, onto an SELinux box. It is likely that confined
applications will generate AVC messages and not be allowed access to
this file system. For more details see restorecon(8) and mount(8).
mount: /dev/rbd0 mounted on /mnt/test1.

sudo mount -v -t ext4 /dev/rbd0 /mnt/test1
mount: /mnt/test1 does not contain SELinux labels.
You just mounted an file system that supports labels which does not
contain labels, onto an SELinux box. It is likely that confined
applications will generate AVC messages and not be allowed access to
this file system. For more details see restorecon(8) and mount(8).
mount: /dev/rbd0 mounted on /mnt/test1.

With my best regards,
Son Hai Ha

--
KPMG IT Service GmbH
Sitz/Registergericht: Berlin/Amtsgericht Charlottenburg, HRB 87521 B
Geschäftsführer: Hans-Christian Schwieger, Helmar Symmank
Aufsichtsratsvorsitzender: WP StB Klaus Becker

Allgemeine Informationen zur Datenverarbeitung im Rahmen unserer allgemeinen Geschäftstätigkeit sowie im Mandatsverhältnis gemäß EU Datenschutz-Grundverordnung sind hier abrufbar.

Die Information in dieser E-Mail ist vertraulich und kann dem Berufsgeheimnis unterliegen. Sie ist ausschließlich für den Adressaten bestimmt. Jeglicher Zugriff auf diese E-Mail durch andere Personen als den Adressaten ist untersagt. Sollten Sie nicht der für diese E-Mail bestimmte Adressat sein, ist Ihnen jede Veröffentlichung, Vervielfältigung oder Weitergabe wie auch das Ergreifen oder Unterlassen von Maßnahmen im Vertrauen auf erlangte Information untersagt. In dieser E-Mail enthaltene Meinungen oder Empfehlungen unterliegen den Bedingungen des jeweiligen Mandatsverhältnisses mit dem Adressaten.

The information in this e-mail is confidential and may be legally privileged.
It is intended solely for the addressee. Access to this e-mail by anyone else
is unauthorized. If you are not the intended recipient, any disclosure,
copying, distribution or any action taken or omitted to be taken in reliance on
it, is prohibited and may be unlawful. Any opinions or advice contained in this
e-mail are subject to the terms and conditions expressed in the governing KPMG
client engagement letter.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -

[ceph-users] Re: ceph-ansible in Pacific and beyond?

2021-03-18 Thread Wido den Hollander





On 18/03/2021 09:09, Janne Johansson wrote:

Den ons 17 mars 2021 kl 20:17 skrev Matthew H :


"A containerized environment just makes troubleshooting more difficult, getting 
access and retrieving details on Ceph processes isn't as straightforward as with a non 
containerized infrastructure. I am still not convinced that containerizing everything 
brings any benefits except the collocation of services."

It changes the way you troubleshoot, but I don't find it more difficult in the 
issues I have seen and had. Even today without containers, all services can be 
co-located within the same hosts (mons,mgrs,osds,mds).. Is there a situation 
you've seen where that has not been the case?


New ceph users pop in all the time on the #ceph IRC and have
absolutely no idea on how to see the relevant logs from the
containerized services.

Me being one of the people that do run services on bare metal (and
VMs) I actually can't help them, and it seems several other old ceph
admins can't either.



Me being one of them.

Yes, it's all possible with containers, but it's different. And I don't 
see the true benefit of running Ceph in Docker just yet.


Another layer of abstraction which you need to understand. Also, when 
you need to do real emergency stuff like working with 
ceph-objectstore-tool to fix broken OSDs/PGs it's just much easier to 
work on a bare-metal box than with containers (if you ask me).


So no, I am not convinced yet. Not against it, but personally I would 
say it's not the only way forward.


DEB and RPM packages are still alive and kicking.

Wido


Not that it is impossible or might not even be hard to get them, but
somewhere in the "it is so easy to get it up and running, just pop a
container and off you go" docs there seem to be a lack of the parts
"when the OSD crashes at boot, run this to export the file normally
called /var/log/ceph/ceph-osd.12.log" meaning it becomes a black box
to the users and they are left to wipe/reinstall or something else
when it doesn't work. At the end, I guess the project will see less
useful reports with Assert Failed logs from impossible conditions and
more people turning away from something that could be fixed in the
long run.

I get some of the advantages, and for stateless services elsewhere it
might be gold to have containers, I am not equally enthusiastic about
it for ceph.


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Monitor leveldb growing without bound v14.2.16

2021-03-03 Thread Wido den Hollander





On 03/03/2021 00:55, Lincoln Bryant wrote:

Hi list,

We recently had a cluster outage over the weekend where several OSDs were 
inaccessible over night for several hours. When I found the cluster in the 
morning, the monitors' root disks (which contained both the monitor's leveldb 
and the Ceph logs) had completely filled.

After restarting OSDs, cleaning out the monitors' logs, moving /var/lib/ceph to 
dedicated disks on the mons, and starting recovery (in which there was 1 
unfound object that I marked lost, if that has any relevancy), the leveldb 
continued/continues to grow without bound. The cluster has all PGs in 
active+clean at this point, yet I'm accumulating what seems like approximately 
~1GB/hr of new leveldb data.

Two of the monitors (a, c) are in quorum, while the third (b) has been 
synchronizing for the last several hours, but doesn't seem to be able to catch 
up. Mon 'b' has been running for 4 hours now in the 'synchronizing' state. The 
mon's log has many messages about compacting and deleting files, yet we never 
exit the synchronization state.

The ceph.log is also rapidly accumulating complaints that the mons are slow 
(not surprising, I suppose, since the levelDBs are ~100GB at this point).

I've found that using monstore tool to do compaction on mons 'a' and 'c' thelps 
but is only a temporary fix. Soon the database inflates again and I'm back to 
where I started.


Are all the PGs in the active+clean state? I don't assume so? This will 
cause the MONs to keep a large history of OSDMaps in their DB and thus 
it will keep growing.




Thoughts on how to proceed here? Some ideas I had:
- Would it help to add some new monitors that use RocksDB?


They would need to sync which can take a lot of time. Moving to RocksDB 
is a good idea when this is all fixed.



- Stop a monitor and dump the keys via monstoretool, just to get an idea of 
what's going on?
- Increase mon_sync_max_payload_size to try to move data in larger chunks?


I would just try it.


- Drop down to a single monitor, and see if normal compaction triggers and 
stops growing unbounded?


It will keep growing, the compact only works for a limited time. Make 
sure the PGs become clean again.


In the meantime make sure you have enough disk space.

Wido


- Stop both 'a' and 'c', compact them, start them, and immediately start 
'b' ?

Appreciate any advice.

Regards,
Lincoln

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: NVMe and 2x Replica

2021-02-05 Thread Wido den Hollander




On 04/02/2021 18:57, Adam Boyhan wrote:

All great input and points guys.

Helps me lean towards 3 copes a bit more.

I mean honestly NVMe cost per TB isn't that much more than SATA SSD now. 
Somewhat surprised the salesmen aren't pitching 3x replication as it makes them 
more money.


To add to this, I have seen real cases as a Ceph consultant where size=2 
and min_size=1 on all flash lead to data loss.


Picture this:

- One node is down (Maintenance, failure, etc, etc)
- NVMe device in other node dies
- You loose data

Although you can bring back the other node which was down but not broken 
you are missing data. The data on the NVMe devices in there is outdated 
and thus the PGs will not become active.


size=2 is only safe with min_size=2, but that doesn't really provide HA.

The same goes with ZFS in mirror, raidz1, etc. If you loose one device 
the chances are real you loose the other device before the array has 
healed itself.


With Ceph it's slighly more complex, but the same principles apply.

No, with NVMe I still would highly advise against using size=2, min_size=1

The question is not if you will loose data, but the question is: When 
will you loose data? Within one year, 2? 3? 10?


Wido





From: "Anthony D'Atri" 
To: "ceph-users" 
Sent: Thursday, February 4, 2021 12:47:27 PM
Subject: [ceph-users] Re: NVMe and 2x Replica


I searched each to find the section where 2x was discussed. What I found was 
interesting. First, there are really only 2 positions here: Micron's and Red 
Hat's. Supermicro copies Micron's positon paragraph word for word. Not 
surprising considering that they are advertising a Supermicro / Micron solution.


FWIW, at Cephalocon another vendor made a similar claim during a talk.

* Failure rates are averages, not minima. Some drives will always fail sooner
* Firmware and other design flaws can result in much higher rates of failure or 
insidious UREs that can result in partial data unavailability or loss
* Latent soft failures may not be detected until a deep scrub succeeds, which 
could be weeks later
* In a distributed system, there are up/down/failure scenarios where the 
location of even one good / canonical / latest copy of data is unclear, 
especially when drive or HBA cache is in play.
* One of these is a power failure. Sure PDU / PSU redundancy helps, but stuff 
happens, like a DC underprovisioning amps, so that a spike in user traffic 
results in the whole row going down :-x Various unpleasant things can happen.

I was championing R3 even pre-Ceph when I was using ZFS or HBA RAID. As others 
have written, as drives get larger the time to fill them with replica data 
increases, as does the chance of overlapping failures. I’ve experieneced R2 
overlapping failures more than once, with and before Ceph.

My sense has been that not many people run R2 for data they care about, and as 
has been written recently 2,2 EC is safer with the same raw:usable ratio. I’ve 
figured that vendors make R2 statements like these as a selling point to assert 
lower TCO. My first response is often “How much would it cost you directly, and 
indirectly in terms of user / customer goodwill, to loose data?”.


Personally, this looks like marketing BS to me. SSD shops want to sell SSDs, 
but because of the cost difference they have to convince buyers that their 
products are competitive.


^this. I’m watching the QLC arena with interest for the potential to narrow the 
CapEx gap. Durability has been one concern, though I’m seeing newer products 
claiming that eg. ZNS improves that. It also seems that there are something 
like what, *4* separate EDSFF / ruler form factors, I really want to embrace 
those eg. for object clusters, but I’m VERY wary of the longevity of competing 
standards and any single-source for chassies or drives.


Our products cost twice as much, but LOOK you only need 2/3 as many, and you 
get all these other benefits (performance). Plus, if you replace everything in 
2 or 3 years anyway, then you won't have to worry about them failing.


Refresh timelines. You’re funny ;) Every time, every single time, that I’ve 
worked in an organization that claims a 3 (or 5, or whatever) hardware refresh 
cycle, it hasn’t happened. When you start getting close, the capex doesn’t 
materialize, or the opex cost of DC hands and operational oversight. “How do 
you know that the drives will start failing or getting slower? Let’s revisit 
this in 6 months”. Etc.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Increasing QD=1 performance (lowering latency)

2021-02-02 Thread Wido den Hollander

Hi,

There are many talks and presentations out there about Ceph's
performance. Ceph is great when it comes to parallel I/O, large queue
depths and many applications sending I/O towards Ceph.

One thing where Ceph isn't the fastest are 4k blocks written at Queue
Depth 1.

Some applications benefit very much from high performance/low latency
I/O at qd=1, for example Single Threaded applications which are writing
small files inside a VM running on RBD.

With some tuning you can get to a ~700us latency for a 4k write with
qd=1 (Replication, size=3)

I benchmark this using fio:

$ fio --ioengine=librbd --bs=4k --iodepth=1 --direct=1 .. .. .. ..

700us latency means the result will be about ~1500 IOps (1000 / 0.7)

When comparing this to let's say a BSD machine running ZFS that's on the
low side. With ZFS+NVMe you'll be able to reach about somewhere between
7.000 and 10.000 IOps, the latency is simply much lower.

My benchmarking / test setup for this:

- Ceph Nautilus/Octopus (doesn't make a big difference)
- 3x SuperMicro 1U with:
- AMD Epyc 7302P 16-core CPU
- 128GB DDR4
- 10x Samsung PM983 3,84TB
- 10Gbit Base-T networking

Things to configure/tune:

- C-State pinning to 1
- CPU governer to performance
- Turn off all logging in Ceph (debug_osd, debug_ms, debug_bluestore=0)

Higher clock speeds (New AMD Epyc coming in March!) help to reduce the
latency and going towards 25Gbit/100Gbit might help as well.

These are however only very small increments and might help to reduce
the latency by another 15% or so.

It doesn't bring us anywhere near the 10k IOps other applications can do.

And I totally understand that replication over a TCP/IP network takes
time and thus increases latency.

The Crimson project [0] is aiming to lower the latency with many things
like DPDK and SPDK, but this is far from finished and production ready.

In the meantime, am I overseeing some things here? Can we reduce the
latency further of the current OSDs?

Reaching a ~500us latency would already be great!

Thanks,

Wido


[0]: https://docs.ceph.com/en/latest/dev/crimson/crimson/
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: osd recommended scheduler

2021-02-01 Thread Wido den Hollander





On 28/01/2021 18:09, Andrei Mikhailovsky wrote:


Hello everyone,

Could some one please let me know what is the recommended modern kernel disk 
scheduler that should be used for SSD and HDD osds? The information in the 
manuals is pretty dated and refer to the schedulers which have been deprecated 
from the recent kernels.



Afaik noop is usually the one use for Flash devices.

CFQ is used on HDDs most of the time as it allows for better scheduling/QoS.

Wido


Thanks

Andrei
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: is unknown pg going to be active after osds are fixed?

2021-02-01 Thread Wido den Hollander





On 01/02/2021 22:48, Tony Liu wrote:

Hi,

With 3 replicas, a pg hs 3 osds. If all those 3 osds are down,
the pg becomes unknow. Is that right?



Yes. As no OSD can report the status to the MONs.


If those 3 osds are replaced and in and on, is that pg going to
be eventually back to active? Or anything else has to be done
to fix it?



If you can bring back the OSDs without wiping them: Yes

As you mention the word 'replaced' I was wondering what you mean by 
that. If you replace the disks without data recovery the PGs will be lost.


So you need to bring back the OSDs with their data in tact for the PG to 
come back online.


Wido



Thanks!
Tony
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Which version of Ceph fully supports CephFS Snapshot?

2021-01-13 Thread Wido den Hollander

In addition: Make sure you are using kernels with the proper fixes.

CephFS is a co-operation between the MDS, OSDs and (Kernel) clients. If
the clients are outdated they can cause all kinds of troubles.

So make sure you are able to update clients to recent versions.

Although a stock CentOS or Ubuntu kernel might allow you to mount and
use the filesystem they aren't always the proper choice.

Wido

On 1/13/21 7:26 AM, Konstantin Shalygin wrote:
> Nautilus will be a good solution for this 
> 
> 
> k
> 
> Sent from my iPhone
> 
>> On 11 Jan 2021, at 06:25, fantastic2085  wrote:
>>
>> I would like to use the Cephfs Snapshot feature, which version of Ceph 
>> supports it
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: mgr's stop responding, dropping out of cluster with _check_auth_rotating

2020-12-11 Thread Wido den Hollander





On 11/12/2020 00:12, David Orman wrote:

Hi Janek,

We realize this, we referenced that issue in our initial email. We do want
the metrics exposed by Ceph internally, and would prefer to work towards a
fix upstream. We appreciate the suggestion for a workaround, however!

Again, we're happy to provide whatever information we can that would be of
assistance. If there's some debug setting that is preferred, we are happy
to implement it, as this is currently a test cluster for us to work through
issues such as this one.



Have you tried disabling Prometheus just to see if this also fixes the 
issue for you?


Wido


David

On Thu, Dec 10, 2020 at 12:02 PM Janek Bevendorff <
janek.bevendo...@uni-weimar.de> wrote:


Do you have the prometheus module enabled? Turn that off, it's causing
issues. I replaced it with another ceph exporter from Github and almost
forgot about it.

Here's the relevant issue report:
https://tracker.ceph.com/issues/39264#change-179946

On 10/12/2020 16:43, Welby McRoberts wrote:

Hi Folks

We've noticed that in a cluster of 21 nodes (5 mgrs&mons & 504 OSDs with

24

per node) that the mgr's are, after a non specific period of time,

dropping

out of the cluster. The logs only show the following:

debug 2020-12-10T02:02:50.409+ 7f1005840700  0 log_channel(cluster)

log

[DBG] : pgmap v14163: 4129 pgs: 4129 active+clean; 10 GiB data, 31 TiB
used, 6.3 PiB / 6.3 PiB avail
debug 2020-12-10T03:20:59.223+ 7f10624eb700 -1 monclient:
_check_auth_rotating possible clock skew, rotating keys expired way too
early (before 2020-12-10T02:20:59.226159+)
debug 2020-12-10T03:21:00.223+ 7f10624eb700 -1 monclient:
_check_auth_rotating possible clock skew, rotating keys expired way too
early (before 2020-12-10T02:21:00.226310+)

The _check_auth_rotating repeats approximately every second. The

instances

are all syncing their time with NTP and have no issues on that front. A
restart of the mgr fixes the issue.

It appears that this may be related to

https://tracker.ceph.com/issues/39264.

The suggestion seems to be to disable prometheus metrics, however, this
obviously isn't realistic for a production environment where metrics are
critical for operations.

Please let us know what additional information we can provide to assist

in

resolving this critical issue.

Cheers
Welby
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Running Mons on msgrv2/3300 only.

2020-12-09 Thread Wido den Hollander





On 08/12/2020 20:17, Wesley Dillingham wrote:

We rebuilt all of our mons in one cluster such that they bind only to port 3300 
with msgrv2. Previous to this we were binding to both 6789 and 3300. All of our 
server and client components are sufficiently new (14.2.x) and we haven’t 
observed any disruption but I am inquiring if this may be problematic for any 
unforeseen reason. We don’t intend to have any older clients connecting. 
https://docs.ceph.com/en/latest/rados/configuration/msgr2/ doesn’t mention much 
about running with only v2 so I just want to make sure we aren’t setting 
ourselves up for trouble. Thanks.




No, not really. I have a couple of those servers as well where v1 was 
disabled completely.


Wido


--
Respectfully,

Wes Dillingham
Site Reliability Engineer IV
Storage Engineering / Ceph
wdillingham(at)godaddy.com

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Not all OSDs in rack marked as down when the rack fails

2020-11-18 Thread Wido den Hollander




On 30/10/2020 11:28, Wido den Hollander wrote:



On 29/10/2020 18:58, Dan van der Ster wrote:

Hi Wido,

Could it be one of these?

mon osd min up ratio
mon osd min in ratio

36/120 is 0.3 so it might be one of those magic ratios at play.


I thought of those settings and looked at it. The weird thing is that 
all 3 racks are equal and it works as expected in the other racks. There 
all 40 OSDs are marked as down properly.


These settings should also yield log messages in the MON's log, but they 
don't. Searching for 'ratio' in the logfile doesn't show me anything.


It's weird that osd.51 was only marked as down after 15 minutes because 
it didn't send a beacon to the MON. Other OSDs kept sending reports that 
it was down, but the MONs simply didn't act on it.




See: https://tracker.ceph.com/issues/48274


2020-11-18T02:33:21.498-0500 7f849ae8c700  0 log_channel(cluster) log 
[DBG] : osd.58 reported failed by osd.105


2020-11-18T02:33:21.498-0500 7f849ae8c700 10 
mon.CEPH2-MON1-206-U39@0(leader).osd e108196  osd.58 has 1 reporters, 
82.847365 grace (20.00 + 62.8474 + 5.67647e-167), max_failed_since 
2020-11-18T02:32:59.498216-0500


2020-11-18T02:33:21.498-0500 7f849ae8c700 10 
mon.CEPH2-MON1-206-U39@0(leader).log v9218943  logging 
2020-11-18T02:33:21.499338-0500 mon.CEPH2-MON1-206-U39 (mon.0) 165 : 
cluster [DBG] osd.58 reported failed by osd.105


The MONs kept adding time to the grace period.

In the end setting mon_osd_adjust_heartbeat_grace to 'false' solved it.

Why? I'm not sure yet.

root default
  rack 206
   host A
  rack 207
   host B
  rack 208
   host C

The names of the racks are 'integers' so I tried renaming them to 'r206' 
for example, but that didn't work either.


We achieved our goal, but I'm not sure why this setting is preventing 
the OSDs from being marked as down.


Wido


Wido



Cheers,

Dan


On Thu, 29 Oct 2020, 18:05 Wido den Hollander, <mailto:w...@42on.com>> wrote:


    Hi,

    I'm investigating an issue where 4 to 5 OSDs in a rack aren't 
marked as

    down when the network is cut to that rack.

    Situation:

    - Nautilus cluster
    - 3 racks
    - 120 OSDs, 40 per rack

    We performed a test where we turned off the network Top-of-Rack for
    each
    rack. This worked as expected with two racks, but with the third
    something weird happened.

  From the 40 OSDs which were supposed to be marked as down only 36
    were
    marked as down.

    In the end it took 15 minutes for all 40 OSDs to be marked as down.

    $ ceph config set mon mon_osd_reporter_subtree_level rack

    That setting is set to make sure that we only accept reports from 
other

    racks.

    What we saw in the logs for example:

    2020-10-29T03:49:44.409-0400 7fbda185e700 10
    mon.CEPH2-MON1-206-U39@0(leader).osd e107102  osd.51 has 54 
reporters,
    239.856038 grace (20.00 + 219.856 + 7.43801e-23), 
max_failed_since

    2020-10-29T03:47:22.374857-0400

    But osd.51 was still not marked as down after 54 reporters have
    reported
    that it is actually down.

    I checked, no ping or other traffic possible to osd.51. Host is
    unreachable.

    Another osd was marked as down, but it took a couple of minutes as 
well:


    2020-10-29T03:50:54.455-0400 7fbda185e700 10
    mon.CEPH2-MON1-206-U39@0(leader).osd e107102  osd.37 has 48 
reporters,
    221.378970 grace (20.00 + 201.379 + 6.34437e-23), 
max_failed_since

    2020-10-29T03:47:12.761584-0400
    2020-10-29T03:50:54.455-0400 7fbda185e700  1
    mon.CEPH2-MON1-206-U39@0(leader).osd e107102  we have enough 
reporters

    to mark osd.37 down

    In the end osd.51 was marked as down, but only after the MON 
decided to

    do so:

    2020-10-29T03:53:44.631-0400 7fbda185e700  0 log_channel(cluster) log
    [INF] : osd.51 marked down after no beacon for 903.943390 seconds
    2020-10-29T03:53:44.631-0400 7fbda185e700 -1
    mon.CEPH2-MON1-206-U39@0(leader).osd e107104 no beacon from osd.51
    since
    2020-10-29T03:38:40.689062-0400, 903.943390 seconds ago.  marking 
down


    I haven't seen this happen before in any cluster. It's also strange
    that
    this only happens in this rack, the other two racks work fine.

    ID    CLASS  WEIGHT      TYPE NAME
    -1         1545.35999  root default

    -206          515.12000      rack 206

    -7           27.94499          host CEPH2-206-U16
    ...
    -207          515.12000      rack 207

   -17           27.94499          host CEPH2-207-U16
    ...
    -208          515.12000      rack 208

   -31           27.94499          host CEPH2-208-U16
    ...

    That's how the CRUSHMap looks like. Straight forward and 3x 
replication

    over 3 racks.

    This issue only occurs in rack *207*.

    Has anybody seen this before or knows where to start?

    Wido
    ___

[ceph-users] Re: Updating client caps online

2020-11-03 Thread Wido den Hollander





On 03/11/2020 10:02, Dan van der Ster wrote:

Hi all,

We still have legacy caps on our nautilus rbd cluster. I just wanted
to check if this is totally safe (and to post here ftr because I don't
think this has ever been documented)

Here are the current caps:

[client.images]
key = xxx
caps mgr = "allow r"
caps mon = "allow r, allow command \"osd blacklist\""
caps osd = "allow class-read object_prefix rbd_children, allow rwx pool=images"

[client.volumes]
key = xxx
caps mgr = "allow r"
caps mon = "allow r, allow command \"osd blacklist\""
caps osd = "allow class-read object_prefix rbd_children, allow rwx
pool=volumes, allow rx pool=images, allow rwx pool=cinder-critical"

Now that we upgraded to nautilus we would do:

# ceph auth caps client.images mon 'profile rbd' osd 'profile rbd
pool=images' mgr 'profile rbd pool=images'
# ceph auth caps client.volumes mon 'profile rbd' osd 'profile rbd
pool=volumes, profile rbd-read-only pool=images, profile rbd
pool=cinder-critical' mgr 'profile rbd pool=volumes, profile rbd
pool=cinder-critical'

Does that look correct? Does this apply without impacting any client IOs ?



Yes, it looks correct, but what I usually do:

$ ceph auth get client.images -o client.images
$ cp client.images client.images.org
$ edit the client.images file
$ diff -u client.images client.images.org
$ ceph auth import -i client.images

This way I also have a way of reverting quickly if things do go wrong.

What I also did is import the key with a name (eg client.images2) and 
test if I could manually perform RBD operations with the 'rbd' cli tool.


Warning: IF you make a mistake (and I have seen this happen!) ceph will 
start returning 'Operation Not Permitted' to librados which they causes 
I/O errors inside librbd. Your VMs will go into Read-Only as filesystems 
break and probably need an fsck to get back.


So triple-check your work before doing this. But if done properly it can 
be done online.


Wido


Thanks!

Dan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: [EXTERNAL] Re: 14.2.12 breaks mon_host pointing to Round Robin DNS entry

2020-11-02 Thread Wido den Hollander




On 31/10/2020 11:16, Sasha Litvak wrote:

Hello everyone,

Assuming that backport has been merged for few days now,  is there a 
chance that 14.2.13 will be released? >


On the dev list it was posted that .13 will be released this week.

Wido



On Fri, Oct 23, 2020, 6:03 AM Van Alstyne, Kenneth 
<mailto:kenneth.vanalst...@perspecta.com>> wrote:


Jason/Wido, et al:
      I was hitting this exact problem when attempting to update
from 14.2.11 to 14.2.12.  I reverted the two commits associated with
that pull request and was able to successfully upgrade to 14.2.12. 
Everything seems normal, now.



Thanks,

--
Kenneth Van Alstyne
Systems Architect
M: 804.240.2327
14291 Park Meadow Drive, Chantilly, VA 20151
perspecta


From: Jason Dillaman mailto:jdill...@redhat.com>>
Sent: Thursday, October 22, 2020 12:54 PM
To: Wido den Hollander mailto:w...@42on.com>>
Cc: ceph-users@ceph.io <mailto:ceph-users@ceph.io>
mailto:ceph-users@ceph.io>>
Subject: [EXTERNAL] [ceph-users] Re: 14.2.12 breaks mon_host
pointing to Round Robin DNS entry

This backport [1] looks suspicious as it was introduced in v14.2.12
and directly changes the initial MonMap code. If you revert it in a
dev build does it solve your problem?

[1] https://github.com/ceph/ceph/pull/36704

    On Thu, Oct 22, 2020 at 12:39 PM Wido den Hollander mailto:w...@42on.com>> wrote:
 >
 > Hi,
 >
 > I already submitted a ticket: https://tracker.ceph.com/issues/47951
 >
 > Maybe other people noticed this as well.
 >
 > Situation:
 > - Cluster is running IPv6
 > - mon_host is set to a DNS entry
 > - DNS entry is a Round Robin with three -records
 >
 > root@wido-standard-benchmark:~# ceph -s
 > unable to parse addrs in 'mon.objects.xx.xxx.net
<http://mon.objects.xx.xxx.net>'
 > [errno 22] error connecting to the cluster
 > root@wido-standard-benchmark:~#
 >
 > The relevant part of the ceph.conf:
 >
 > [global]
 > auth_client_required = cephx
 > auth_cluster_required = cephx
 > auth_service_required = cephx
 > mon_host = mon.objects.xxx.xxx.xxx
 > ms_bind_ipv6 = true
 >
 > This works fine with 14.2.11 and breaks under 14.2.12
 >
 > Anybody else seeing this as well?
 >
 > Wido
 > ___
 > ceph-users mailing list -- ceph-users@ceph.io
<mailto:ceph-users@ceph.io>
 > To unsubscribe send an email to ceph-users-le...@ceph.io
<mailto:ceph-users-le...@ceph.io>
 >


--
Jason
___
ceph-users mailing list -- ceph-users@ceph.io
<mailto:ceph-users@ceph.io>
To unsubscribe send an email to ceph-users-le...@ceph.io
<mailto:ceph-users-le...@ceph.io>
___
ceph-users mailing list -- ceph-users@ceph.io
<mailto:ceph-users@ceph.io>
To unsubscribe send an email to ceph-users-le...@ceph.io
<mailto:ceph-users-le...@ceph.io>


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Not all OSDs in rack marked as down when the rack fails

2020-10-30 Thread Wido den Hollander




On 29/10/2020 18:58, Dan van der Ster wrote:

Hi Wido,

Could it be one of these?

mon osd min up ratio
mon osd min in ratio

36/120 is 0.3 so it might be one of those magic ratios at play.


I thought of those settings and looked at it. The weird thing is that 
all 3 racks are equal and it works as expected in the other racks. There 
all 40 OSDs are marked as down properly.


These settings should also yield log messages in the MON's log, but they 
don't. Searching for 'ratio' in the logfile doesn't show me anything.


It's weird that osd.51 was only marked as down after 15 minutes because 
it didn't send a beacon to the MON. Other OSDs kept sending reports that 
it was down, but the MONs simply didn't act on it.


Wido



Cheers,

Dan


On Thu, 29 Oct 2020, 18:05 Wido den Hollander, <mailto:w...@42on.com>> wrote:


Hi,

I'm investigating an issue where 4 to 5 OSDs in a rack aren't marked as
down when the network is cut to that rack.

Situation:

- Nautilus cluster
- 3 racks
- 120 OSDs, 40 per rack

We performed a test where we turned off the network Top-of-Rack for
each
rack. This worked as expected with two racks, but with the third
something weird happened.

  From the 40 OSDs which were supposed to be marked as down only 36
were
marked as down.

In the end it took 15 minutes for all 40 OSDs to be marked as down.

$ ceph config set mon mon_osd_reporter_subtree_level rack

That setting is set to make sure that we only accept reports from other
racks.

What we saw in the logs for example:

2020-10-29T03:49:44.409-0400 7fbda185e700 10
mon.CEPH2-MON1-206-U39@0(leader).osd e107102  osd.51 has 54 reporters,
239.856038 grace (20.00 + 219.856 + 7.43801e-23), max_failed_since
2020-10-29T03:47:22.374857-0400

But osd.51 was still not marked as down after 54 reporters have
reported
that it is actually down.

I checked, no ping or other traffic possible to osd.51. Host is
unreachable.

Another osd was marked as down, but it took a couple of minutes as well:

2020-10-29T03:50:54.455-0400 7fbda185e700 10
mon.CEPH2-MON1-206-U39@0(leader).osd e107102  osd.37 has 48 reporters,
221.378970 grace (20.00 + 201.379 + 6.34437e-23), max_failed_since
2020-10-29T03:47:12.761584-0400
2020-10-29T03:50:54.455-0400 7fbda185e700  1
mon.CEPH2-MON1-206-U39@0(leader).osd e107102  we have enough reporters
to mark osd.37 down

In the end osd.51 was marked as down, but only after the MON decided to
do so:

2020-10-29T03:53:44.631-0400 7fbda185e700  0 log_channel(cluster) log
[INF] : osd.51 marked down after no beacon for 903.943390 seconds
2020-10-29T03:53:44.631-0400 7fbda185e700 -1
mon.CEPH2-MON1-206-U39@0(leader).osd e107104 no beacon from osd.51
since
2020-10-29T03:38:40.689062-0400, 903.943390 seconds ago.  marking down

I haven't seen this happen before in any cluster. It's also strange
that
this only happens in this rack, the other two racks work fine.

ID    CLASS  WEIGHT      TYPE NAME
    -1         1545.35999  root default

-206          515.12000      rack 206

    -7           27.94499          host CEPH2-206-U16
...
-207          515.12000      rack 207

   -17           27.94499          host CEPH2-207-U16
...
-208          515.12000      rack 208

   -31           27.94499          host CEPH2-208-U16
...

That's how the CRUSHMap looks like. Straight forward and 3x replication
over 3 racks.

This issue only occurs in rack *207*.

Has anybody seen this before or knows where to start?

Wido
___
ceph-users mailing list -- ceph-users@ceph.io
<mailto:ceph-users@ceph.io>
To unsubscribe send an email to ceph-users-le...@ceph.io
<mailto:ceph-users-le...@ceph.io>


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: monitor sst files continue growing

2020-10-30 Thread Wido den Hollander




On 29/10/2020 19:29, Zhenshi Zhou wrote:

Hi Alex,

We found that there were a huge number of keys in the "logm" and "osdmap"
table
while using ceph-monstore-tool. I think that could be the root cause.



But that is exactly how Ceph works. It might need that very old OSDMap 
to get all the PGs clean again. An OSD which has been gone for a very 
long time and needs to catch up to make a PG clean.


If not all PGs are active+clean you will and can see the MON databases 
grow rapidly.


Therefor I always deploy 1TB SSDs in all Monitors. Not expensive anymore 
and they give breathing room.


I always deploy physical and dedicated machines for Monitors just to 
prevent these cases.


Wido


Well, some pages also say that disable 'insight' module can resolve this
issue, but
I checked our cluster and we didn't enable this module. check this page
.

Anyway, our cluster is unhealthy though, it just need time keep recovering
data :)

Thanks

Alex Gracie  于2020年10月29日周四 下午10:57写道：


We hit this issue over the weekend on our HDD backed EC Nautilus cluster
while removing a single OSD. We also did not have any luck using
compaction. The mon-logs filled up our entire root disk on the mon servers
and we were running on a single monitor for hours while we tried to finish
recovery and reclaim space. The past couple weeks we also noticed "pg not
scubbed in time" errors but are unsure if they are related. I'm still the
exact cause of this(other than the general misplaced/degraded objects) and
what kind of growth is acceptable for these store.db files.

In order to get our downed mons restarted, we ended up backing up and
coping the /var/lib/ceph/mon/* contents to a remote host, setting up an
sshfs mount to that new host with large NVME and SSDs, ensuring the mount
paths were owned by ceph, then clearing up enough space on the monitor host
to start the service. This allowed our store.db directory to grow freely
until the misplaced/degraded objects could recover and monitors all
rejoined eventually.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Fix PGs states

2020-10-30 Thread Wido den Hollander




On 30/10/2020 05:20, Ing. Luis Felipe Domínguez Vega wrote:
Great and thanks, i fixed all unknowns with the command, now left the 
incomplete, down, etc.




Start with a query:

$ ceph pg  query

That will tell you why it's down and incomplete.

The force-create-pg has probably corrupted and destroyed data in your 
cluster.


PGs should recover themselves if all OSDs are back. If not then 
something is very wrong and you need to find the root-cause.


PGs not becoming clean is only the result of a underlying problem.

Wido


El 2020-10-29 23:57, 胡 玮文 escribió:

Hi,

I have not tried, but maybe this will help with the unknown PGs, if
you don’t care any data loss.


ceph osd force-create-pg 


在 2020年10月30日，10:46，Ing. Luis Felipe Domínguez Vega
 写道：

Hi:

I have this ceph status:
- 


cluster:
   id: 039bf268-b5a6-11e9-bbb7-d06726ca4a78
   health: HEALTH_WARN
   noout flag(s) set
   1 osds down
   Reduced data availability: 191 pgs inactive, 2 pgs down, 35
pgs incomplete, 290 pgs stale
   5 pgs not deep-scrubbed in time
   7 pgs not scrubbed in time
   327 slow ops, oldest one blocked for 233398 sec, daemons
[osd.12,osd.36,osd.5] have slow ops.

 services:
   mon: 1 daemons, quorum fond-beagle (age 23h)
   mgr: fond-beagle(active, since 7h)
   osd: 48 osds: 45 up (since 95s), 46 in (since 8h); 4 remapped pgs
    flags noout

 data:
   pools:   7 pools, 2305 pgs
   objects: 350.37k objects, 1.5 TiB
   usage:   3.0 TiB used, 38 TiB / 41 TiB avail
   pgs: 6.681% pgs unknown
    1.605% pgs not active
    1835 active+clean
    279  stale+active+clean
    154  unknown
    22   incomplete
    10   stale+incomplete
    2    down
    2    remapped+incomplete
    1    stale+remapped+incomplete
 



How can i fix all of unknown, incomplete, remmaped+incomplete, etc...
i dont care if i need remove PGs
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Corrupted RBD image

2020-10-30 Thread Wido den Hollander




On 30/10/2020 06:09, Ing. Luis Felipe Domínguez Vega wrote:

Hi:

I tried get info from a RBD image but:

-
root@fond-beagle:/# rbd list --pool cinder-ceph | grep 
volume-dfcca6c8-cb96-4b79-bc85-b200a061dcda

volume-dfcca6c8-cb96-4b79-bc85-b200a061dcda


root@fond-beagle:/# rbd info --pool cinder-ceph 
volume-dfcca6c8-cb96-4b79-bc85-b200a061dcda
rbd: error opening image volume-dfcca6c8-cb96-4b79-bc85-b200a061dcda: 
(2) No such file or directory

--

THis is that the metadata show the image but the content was removed?


Sometimes it's easier to run with --debug-rados=20

Then you can see what it tries to fetch from RADOS and where it fails. 
That might tell you more.


Wido


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Not all OSDs in rack marked as down when the rack fails

2020-10-29 Thread Wido den Hollander


Hi,

I'm investigating an issue where 4 to 5 OSDs in a rack aren't marked as 
down when the network is cut to that rack.


Situation:

- Nautilus cluster
- 3 racks
- 120 OSDs, 40 per rack

We performed a test where we turned off the network Top-of-Rack for each 
rack. This worked as expected with two racks, but with the third 
something weird happened.


From the 40 OSDs which were supposed to be marked as down only 36 were 
marked as down.


In the end it took 15 minutes for all 40 OSDs to be marked as down.

$ ceph config set mon mon_osd_reporter_subtree_level rack

That setting is set to make sure that we only accept reports from other 
racks.


What we saw in the logs for example:

2020-10-29T03:49:44.409-0400 7fbda185e700 10 
mon.CEPH2-MON1-206-U39@0(leader).osd e107102  osd.51 has 54 reporters, 
239.856038 grace (20.00 + 219.856 + 7.43801e-23), max_failed_since 
2020-10-29T03:47:22.374857-0400


But osd.51 was still not marked as down after 54 reporters have reported 
that it is actually down.


I checked, no ping or other traffic possible to osd.51. Host is unreachable.

Another osd was marked as down, but it took a couple of minutes as well:

2020-10-29T03:50:54.455-0400 7fbda185e700 10 
mon.CEPH2-MON1-206-U39@0(leader).osd e107102  osd.37 has 48 reporters, 
221.378970 grace (20.00 + 201.379 + 6.34437e-23), max_failed_since 
2020-10-29T03:47:12.761584-0400
2020-10-29T03:50:54.455-0400 7fbda185e700  1 
mon.CEPH2-MON1-206-U39@0(leader).osd e107102  we have enough reporters 
to mark osd.37 down


In the end osd.51 was marked as down, but only after the MON decided to 
do so:


2020-10-29T03:53:44.631-0400 7fbda185e700  0 log_channel(cluster) log 
[INF] : osd.51 marked down after no beacon for 903.943390 seconds
2020-10-29T03:53:44.631-0400 7fbda185e700 -1 
mon.CEPH2-MON1-206-U39@0(leader).osd e107104 no beacon from osd.51 since 
2020-10-29T03:38:40.689062-0400, 903.943390 seconds ago.  marking down


I haven't seen this happen before in any cluster. It's also strange that 
this only happens in this rack, the other two racks work fine.


IDCLASS  WEIGHT  TYPE NAME
  -1 1545.35999  root default 

-206  515.12000  rack 206 


  -7   27.94499  host CEPH2-206-U16
...
-207  515.12000  rack 207 


 -17   27.94499  host CEPH2-207-U16
...
-208  515.12000  rack 208 


 -31   27.94499  host CEPH2-208-U16
...

That's how the CRUSHMap looks like. Straight forward and 3x replication 
over 3 racks.


This issue only occurs in rack *207*.

Has anybody seen this before or knows where to start?

Wido
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] 14.2.12 breaks mon_host pointing to Round Robin DNS entry

2020-10-22 Thread Wido den Hollander

Hi,

I already submitted a ticket: https://tracker.ceph.com/issues/47951

Maybe other people noticed this as well.

Situation:
- Cluster is running IPv6
- mon_host is set to a DNS entry
- DNS entry is a Round Robin with three -records

root@wido-standard-benchmark:~# ceph -s
unable to parse addrs in 'mon.objects.xx.xxx.net'
[errno 22] error connecting to the cluster
root@wido-standard-benchmark:~#

The relevant part of the ceph.conf:

[global]
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
mon_host = mon.objects.xxx.xxx.xxx
ms_bind_ipv6 = true

This works fine with 14.2.11 and breaks under 14.2.12

Anybody else seeing this as well?

Wido
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Nautilus RGW fails to open Jewel buckets (400 Bad Request)

2020-10-09 Thread Wido den Hollander


Hi,

Most of it is described here: https://tracker.ceph.com/issues/22928

Buckets created under Jewel don't always have the *placement_rule* set 
in their bucket metadata and this causes Nautilus RGWs to not serve 
requests for them.


Snippet from the metadata:

{
"key": "bucket.instance:pbx:ams02.446941181.1",
"ver": {
"tag": "86lc3iVtQpPiJYkh95YCTnhu",
"ver": 2
},
"mtime": "2020-10-09 09:12:04.744423Z",
"data": {
"bucket_info": {
"bucket": {
"name": "pbx",
"marker": "ams02.241978.4",
"bucket_id": "ams02.446941181.1",
"tenant": "",
"explicit_placement": {
"data_pool": ".rgw.buckets",
"data_extra_pool": "",
"index_pool": ".rgw.buckets"
}
},
"creation_time": "2014-02-16 12:32:15.00Z",
"owner": "vdvm",
"flags": 0,
"zonegroup": "eu",
"placement_rule": "",

Notice that *placement_rule* is empty and that this bucket has 
*explicit_placement* set.


There is no way to update the bucket.instance metadata as far as I know, 
otherwise I could have set a placement rule for the bucket.


Earlier on the ML this has been discussed: 
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/ULKK5RU2VXLFXNUJMZBMUG7CQ5UCWJCB/#R6CPZ2TEWRFL2JJWP7TT5GX7DPSV5S7Z


People there compiled a manual version of RGW, something I'd rather stay 
away from.


Has anybody seen this and if so: Have you found a solution?

The commit that breaks these buckets is this one: 
https://github.com/ceph/ceph/commit/2a8e8a98d8c56cc374ec671846a20e2b0484bc75


14.2.0 was the first release with that code in there.

So two things I'm thinking about and I don't know which one is best:

- Update RGW and modify the if-statement added by commit 2a8e8a
- Enhance 'bucket check --fix' to update the placement_rule if none is 
set for a bucket


Any hints or suggestions?

Wido
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: another osd_pglog memory usage incident

2020-10-07 Thread Wido den Hollander





On 07/10/2020 16:00, Dan van der Ster wrote:

On Wed, Oct 7, 2020 at 3:29 PM Wido den Hollander  wrote:




On 07/10/2020 14:08, Dan van der Ster wrote:

Hi all,

This morning some osds in our S3 cluster started going OOM, after
restarting them I noticed that the osd_pglog is using >1.5GB per osd.
(This is on an osd with osd_memory_target = 2GB, hosting 112PGs, all
PGs are active+clean).

After reading through this list and trying a few things, I'd like to
share the following observations for your feedback:

1. The pg log contains 3000 entries by default (on nautilus). These
3000 entries can legitimately consume gigabytes of ram for some
use-cases. (I haven't determined exactly which ops triggered this
today).
2. The pg log length is decided by the primary osd -- setting
osd_max_pg_log_entries/osd_min_pg_log_entries on one single OSD does
not have a big effect (because most of the PGs are primaried somewhere
else). You need to set it on all the osds for it to be applied to all
PGs.
3. We eventually set osd_max_pg_log_entries = 500 everywhere. This
decreased the osd_pglog mempool from more than 1.5GB on our largest
osds to less that 500MB.
4. The osd_pglog mempool is not accounted for in the osd_memory_target
(in nautilus).
5. I have opened a feature request to limit the pg_log length by
memory size (https://tracker.ceph.com/issues/47775). This way we could
allocate a fraction of memory to the pg log and it would shorten the
pglog length (budget) accordingly.
6. Would it be feasible to add an osd option to 'trim pg log at boot'
? This way we could avoid the cumbersome ceph-objectstore-tool
trim-pg-log in cases of disaster (osds going oom at boot).

For those that had pglog memory usage incidents -- does this match
your experience?


Not really. I have an active case where reducing pglog lenght works for
a short period after which memory consumption grows again.

These OSDs however show data being used in buffer anon which is probably
something different.


Well in fact at the very beginning of this incident we had excessive
buffer_anon -- and I only rebooted the osds a couple hours ago and
buffer_anon might indeed be growing still:

# ceph daemon osd.245 dump_mempools | jq .mempool.by_pool.buffer_anon
{
   "items": 36762,
   "bytes": 436869187
}

Did you have any clues yet what is triggering that? How do you work around?


In this case writing to the RGW seems to keep it workable. If we stop 
writing to RADOS the OSD's their memory explodes and they OOM.


We do not have a clue or solution yet.

In this case we also see a lot of BlueFS spillovers and RocksDB growing 
almost unbounded, a lot of compactions are required to keep it working.



Is there a tracker for this?


No, not yet. We do have a couple of messages on the ML about this.

Wido



-- dan




Regarding the trim on boot, that sounds feasible. I already added a
'compact on boot' setting, but trimming all PGs on boot should be
doable. It loads all the PGs and at that point they can be trimmed.

Wido



Thanks!

Dan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: another osd_pglog memory usage incident

2020-10-07 Thread Wido den Hollander





On 07/10/2020 14:08, Dan van der Ster wrote:

Hi all,

This morning some osds in our S3 cluster started going OOM, after
restarting them I noticed that the osd_pglog is using >1.5GB per osd.
(This is on an osd with osd_memory_target = 2GB, hosting 112PGs, all
PGs are active+clean).

After reading through this list and trying a few things, I'd like to
share the following observations for your feedback:

1. The pg log contains 3000 entries by default (on nautilus). These
3000 entries can legitimately consume gigabytes of ram for some
use-cases. (I haven't determined exactly which ops triggered this
today).
2. The pg log length is decided by the primary osd -- setting
osd_max_pg_log_entries/osd_min_pg_log_entries on one single OSD does
not have a big effect (because most of the PGs are primaried somewhere
else). You need to set it on all the osds for it to be applied to all
PGs.
3. We eventually set osd_max_pg_log_entries = 500 everywhere. This
decreased the osd_pglog mempool from more than 1.5GB on our largest
osds to less that 500MB.
4. The osd_pglog mempool is not accounted for in the osd_memory_target
(in nautilus).
5. I have opened a feature request to limit the pg_log length by
memory size (https://tracker.ceph.com/issues/47775). This way we could
allocate a fraction of memory to the pg log and it would shorten the
pglog length (budget) accordingly.
6. Would it be feasible to add an osd option to 'trim pg log at boot'
? This way we could avoid the cumbersome ceph-objectstore-tool
trim-pg-log in cases of disaster (osds going oom at boot).

For those that had pglog memory usage incidents -- does this match
your experience?


Not really. I have an active case where reducing pglog lenght works for 
a short period after which memory consumption grows again.


These OSDs however show data being used in buffer anon which is probably 
something different.


Regarding the trim on boot, that sounds feasible. I already added a 
'compact on boot' setting, but trimming all PGs on boot should be 
doable. It loads all the PGs and at that point they can be trimmed.


Wido



Thanks!

Dan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Recover pgs from failed osds

2020-09-07 Thread Wido den Hollander




On 04/09/2020 13:50, Eugen Block wrote:

Hi,

Wido had an idea in a different thread [1], you could try to advise the 
OSDs to compact at boot:


[osd]
osd_compact_on_start = true


This is in master only, not yet in any release.



Can you give that a shot?

Wido also reported something about large OSD memory in [2], but noone 
commented yet.




Still seeing that problem indeed. haven't been able to solve it.

Wido


Regards,
Eugen


[1] 
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/EDL7U5EWFHSFK5IIBRBNAIXX7IFWR5QK/ 

[2] 
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/F5MOI47FIVSFHULNNPWEJAY6LLDOVUJQ/ 




Zitat von Vahideh Alinouri :


Is not any solution or advice?

On Tue, Sep 1, 2020, 11:53 AM Vahideh Alinouri 


wrote:


One of failed osd with 3G RAM started and dump_mempools shows total RAM
usage is 18G and buff_anon uses 17G RAM!

On Mon, Aug 31, 2020 at 6:24 PM Vahideh Alinouri <
vahideh.alino...@gmail.com> wrote:


osd_memory_target of failed osd in one ceph-osd node changed to 6G but
other osd_memory_target is 3G, starting failed osd with 6G 
memory_target

causes other osd "down" in ceph-osd node! and failed osd is still down.

On Mon, Aug 31, 2020 at 2:19 PM Eugen Block  wrote:


Can you try the opposite and turn up the memory_target and only try to
start a single OSD?


Zitat von Vahideh Alinouri :

> osd_memory_target is changed to 3G, starting failed osd causes 
ceph-osd

> nodes crash! and failed osd is still "down"
>
> On Fri, Aug 28, 2020 at 1:13 PM Vahideh Alinouri <
vahideh.alino...@gmail.com>
> wrote:
>
>> Yes, each osd node has 7 osds with 4 GB memory_target.
>>
>>
>> On Fri, Aug 28, 2020, 12:48 PM Eugen Block  wrote:
>>
>>> Just to confirm, each OSD node has 7 OSDs with 4 GB memory_target?
>>> That leaves only 4 GB RAM for the rest, and in case of heavy 
load the
>>> OSDs use even more. I would suggest to reduce the memory_target 
to 3

>>> GB and see if they start successfully.
>>>
>>>
>>> Zitat von Vahideh Alinouri :
>>>
>>> > osd_memory_target is 4294967296.
>>> > Cluster setup:
>>> > 3 mon, 3 mgr, 21 osds on 3 ceph-osd nodes in lvm scenario.
ceph-osd
>>> nodes
>>> > resources are 32G RAM - 4 core CPU - osd disk 4TB - 9 osds have
>>> > block.wal on SSDs.  Public network is 1G and cluster network is
10G.
>>> > Cluster installed and upgraded using ceph-ansible.
>>> >
>>> > On Thu, Aug 27, 2020 at 7:01 PM Eugen Block  
wrote:

>>> >
>>> >> What is the memory_target for your OSDs? Can you share more
details
>>> >> about your setup? You write about high memory, are the OSD 
nodes

>>> >> affected by OOM killer? You could try to reduce the
osd_memory_target
>>> >> and see if that helps bring the OSDs back up. Splitting the PGs
is a
>>> >> very heavy operation.
>>> >>
>>> >>
>>> >> Zitat von Vahideh Alinouri :
>>> >>
>>> >> > Ceph cluster is updated from nautilus to octopus. On ceph-osd
nodes
>>> we
>>> >> have
>>> >> > high I/O wait.
>>> >> >
>>> >> > After increasing one of pool’s pg_num from 64 to 128 
according

to
>>> warning
>>> >> > message (more objects per pg), this lead to high cpu load and
ram
>>> usage
>>> >> on
>>> >> > ceph-osd nodes and finally crashed the whole cluster. Three
osds,
>>> one on
>>> >> > each host, stuck at down state (osd.34 osd.35 osd.40).
>>> >> >
>>> >> > Starting the down osd service causes high ram usage and cpu
load and
>>> >> > ceph-osd node to crash until the osd service fails.
>>> >> >
>>> >> > The active mgr service on each mon host will crash after
consuming
>>> almost
>>> >> > all available ram on the physical hosts.
>>> >> >
>>> >> > I need to recover pgs and solving corruption. How can i 
recover

>>> unknown
>>> >> and
>>> >> > down pgs? Is there any way to starting up failed osd?
>>> >> >
>>> >> >
>>> >> > Below steps are done:
>>> >> >
>>> >> > 1- osd nodes’ kernel was upgraded to 5.4.2 before ceph 
cluster

>>> upgrading.
>>> >> > Reverting to previous kernel 4.2.1 is tested for iowate
decreasing,
>>> but
>>> >> it
>>> >> > had no effect.
>>> >> >
>>> >> > 2- Recovering 11 pgs on failed osds by export them using
>>> >> > ceph-objectstore-tools utility and import them on other osds.
The
>>> result
>>> >> > followed: 9 pgs are “down” and 2 pgs are “unknown”.
>>> >> >
>>> >> > 2-1) 9 pgs export and import successfully but status is 
“down”

>>> because of
>>> >> > "peering_blocked_by" 3 failed osds. I cannot lost osds 
because

of
>>> >> > preventing unknown pgs from getting lost. pgs size in K 
and M.

>>> >> >
>>> >> > "peering_blocked_by": [
>>> >> >
>>> >> > {
>>> >> >
>>> >> > "osd": 34,
>>> >> >
>>> >> > "current_lost_at": 0,
>>> >> >
>>> >> > "comment": "starting or marking this osd lost may let us
proceed"
>>> >> >
>>> >> > },
>>> >> >
>>> >> > {
>>> >> >
>>> >> > "osd": 35,
>>> >> >
>>> >> > "current_lost_at": 0,
>>> >> >
>>> >> > "comment": "starting or marking this osd lost may let us
proceed"
>>> >> >
>>> >> > },
>>> >> >
>>> >> > {
>>> >> >
>>> >> > "osd": 40,
>>> >> >
>>> >> > "current_l

[ceph-users] Re: Change fsid of Ceph cluster after splitting it into two clusters

2020-09-03 Thread Wido den Hollander




On 9/3/20 3:55 PM, Dan van der Ster wrote:
> Hi Wido,
> 
> Out of curiosity, did you ever work out how to do this?

Nope, never did this. So there are two clusters running with the same
fsid :-)

Wido

> 
> Cheers, Dan
> 
> On Tue, Feb 12, 2019 at 6:17 PM Wido den Hollander  wrote:
>>
>> Hi,
>>
>> I've got a situation where I need to split a Ceph cluster into two.
>>
>> This cluster is currently running a mix of RBD and RGW and in this case
>> I am splitting it into two different clusters.
>>
>> A difficult thing to do, but it's possible.
>>
>> One problem that stays though is that after the split both Ceph clusters
>> have the same fsid and that might be confusing.
>>
>> Is there a way to change the fsid of an existing cluster?
>>
>> Injecting an updated MONMAP and OSDMAP into the cluster?
>>
>> It's no problem if this has to be done offline, but I'm just wondering
>> if this is possible.
>>
>> Wido
>> ___
>> ceph-users mailing list
>> ceph-us...@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Messenger v2 and IPv6-only still seems to prefer IPv4 (OSDs stuck in booting state)

2020-09-03 Thread Wido den Hollander

Hi,

Last night I've spend a couple of hours debugging a issue where OSDs
would be marked as 'up', but then PGs stayed in the 'peering' state.

Looking through the admin socket I saw these OSDs were in the 'booting'
state.

Looking at the OSDMap I saw this:

osd.3 up   in  weight 1 up_from 26 up_thru 700 down_at 0
last_clean_interval [0,0)
[v2:[2a05:xx0:700:2::7]:6816/7923,v1:[2a05:xx:700:2::7]:6817/7923,v2:0.0.0.0:6818/7923,v1:0.0.0.0:6819/7923]
[v2:[2a05:xx:700:2::7]:6820/7923,v1:[2a05:1500:700:2::7]:6821/7923,v2:0.0.0.0:6822/7923,v1:0.0.0.0:6823/7923]
exists,up 786d3e9d-047f-4b09-b368-db9e8dc0805d

In ceph.conf this was set:

ms_bind_ipv6 = true
public_addr = 2a05:xx:700:2::6

On true IPv6-only nodes this works fine. But on nodes where there is
also IPv4 present this can (and will?) cause problems.

It did not use tcpdump/wireshark to investigate, but it seems that the
OSDs tried to contact each other. Using the 0.0.0.0 IPv4 address.

After adding these settings the problems were resolved:

ms_bind_msgr1 = false
ms_bind_ipv4 = false

This also disables msgrv1 as we didn't need it here. A cluster and
clients all running Octopus.

The OSDMap now showed:

osd.3 up   in  weight 1 up_from 704 up_thru 712 down_at 702
last_clean_interval [26,701) v2:[2a05:xx:700:2::7]:6804/791503
v2:[2a05:xx:700:2::7]:6805/791503 exists,up
786d3e9d-047f-4b09-b368-db9e8dc0805d

OSDs can back right away, PGs peered and the problems were resolved.

Wido
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Is it possible to change the cluster network on a production ceph?

2020-09-03 Thread Wido den Hollander




On 9/3/20 3:38 PM, pso...@alticelabs.com wrote:
> Hello people,
>I am trying to change the cluster network in a production ceph. I'm having 
> problems, after changing the ceph.conf file and restarting a osd the cluster 
> is always going to HEALTH_ERROR with blocked requests. Only by returning to 
> the previous configuration and restarting the same osd make the cluster going 
> to OK. So, my question is: is it possible to change the cluster network on a 
> production ceph without stopping the service and destroying data? If so, how?
> 

Short answer: Yes, Yes, Yes

How? That totally depends. Keep in mind that Ceph just expects all
daemons can talk to each other over IP.

As long as you make sure the routing works, the firewalls allow the
connections this can be done.

How exactly depends on how you have set up your network configuration.

Wido

> Anticipated thanks
>   Paulo Sousa
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] OSD memory (buffer_anon) grows once writing stops

2020-09-02 Thread Wido den Hollander


Hi,

The cluster I'm writing about has a long history (months) of instability 
mainly related to large RocksDB database and high memory consumption.


The use-case is RGW with an EC8+3 pool for data.

In the last months this cluster has been suffering from OSDs using much 
more memory then osd_memory_target and mainly allocated in buffer_anon.


After removing a lot of data from the cluster and re-installing all OSDs 
there is one thing remaining: High memory usage when *NOT* writing data 
to the cluster.


There is a script running which keeps writing data to RADOS in a slow 
pace. Once this stops we are observing the memory usage of the OSDs grow 
steadily and also see the RocksDB databases of the BlueStore OSDs grow.


Once we start to write again memory usage (buffer_annon) reduces.

I think this is related to the pglogs, but even trimming all the pglogs 
does not solve this issue.


Has anybody seen this before or has any clues where to start looking?

Ceph version 14.2.8

Wido
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: slow "rados ls"

2020-09-02 Thread Wido den Hollander





On 02/09/2020 12:07, Stefan Kooman wrote:

On 2020-09-01 10:51, Marcel Kuiper wrote:

As a matter of fact we did. We doubled the storage nodes from 25 to 50.
Total osds now 460.

You want to share your thoughts on that?


Yes. We observed the same thing with expansions. The OSDs will be very
busy (with multiple threads per OSD) on housekeeping after the OMAP data
has been moved to another OSD (and eating up all CPU power it can get).
But even after that there is a lot of garbage left behind not gets
cleaned up. At least not with regular housekeeping / online compaction.
Manual compaction for clusters with a lot of OMAP data feels like a
necessity (and ideally shouln't be).


Indeed, it shouldn't be.

This config option should make it easier in a future release: 
https://github.com/ceph/ceph/commit/93e4c56ecc13560e0dad69aaa67afc3ca053fb4c


[osd]
osd_compact_on_start = true

Then just restart the OSDs and they will compact on boot. No need for 
external scripts. Just put this into the ceph.conf.


The mon config store won't work as there is no connection with the 
Monitors at that point in the code.


Wido



Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: cephfs needs access from two networks

2020-09-01 Thread Wido den Hollander




On 01/09/2020 08:15, Simon Sutter wrote:

Hello again

So I have changed the network configuration.
Now my Ceph is reachable from outside, this also means all osd’s of all nodes 
are reachable.
I still have the same behaviour which is a timeout.

The client can resolve all nodes with their hostnames.
The mon’s are still listening on the internal network so the nat rule is still 
there.
I have set “public bind addr” to the external ip and restarted the mon but it’s 
still not working.


It could be that the NAT is the problem here.

Just use routing and firewalling. That way clients and OSDs have direct 
IP-access to each other. Will make your life much easier.


Wido



[root@testnode1 ~]# ceph config get mon.public_bind_addr
WHO MASK  LEVEL OPTIONVALUERO
mon   advanced  public_bind_addr  v2:[ext-addr]:0/0 *

Do I have to change them somewhere else too?

Thanks in advance,
Simon


Von: Janne Johansson [mailto:icepic...@gmail.com]
Gesendet: 27 August 2020 20:01
An: Simon Sutter 
Betreff: Re: [ceph-users] cephfs needs access from two networks

Den tors 27 aug. 2020 kl 12:05 skrev Simon Sutter 
mailto:ssut...@hosttech.ch>>:
Hello Janne

Oh I missed that point. No, the client cannot talk directly to the osds.
In this case it’s extremely difficult to set this up.

This is an absolute requirement to be a ceph client.

How is the mon telling the client, which host and port of the osd, it should 
connect to?

The same port and ip that the ODS called into the mon with when it started up 
and joined the clusster.

Can I have an influence on it?


Well, you set the ip on the OSD hosts, and the port ranges in use for OSDs are 
changeable/settable, but it would not really help the above-mentioned client.

Von: Janne Johansson [mailto:icepic...@gmail.com]
Gesendet: 26 August 2020 15:09
An: Simon Sutter mailto:ssut...@hosttech.ch>>
Cc: ceph-users@ceph.io
Betreff: Re: [ceph-users] cephfs needs access from two networks

Den ons 26 aug. 2020 kl 14:16 skrev Simon Sutter 
mailto:ssut...@hosttech.ch>>:
Hello,
So I know, the mon services can only bind to just one ip.
But I have to make it accessible to two networks because internal and external 
servers have to mount the cephfs.
The internal ip is 10.99.10.1 and the external is some public-ip.
I tried nat'ing it  with this: "firewall-cmd --zone=public 
--add-forward-port=port=6789:proto=tcp:toport=6789:toaddr=10.99.10.1 -permanent"

So the nat is working, because I get a "ceph v027" (alongside with some gibberish) when I 
do a telnet "telnet *public-ip* 6789"
But when I try to mount it, I get just a timeout:
mount - -t ceph *public-ip*:6789:/testing /mnt -o 
name=test,secretfile=/root/ceph.client. test.key
mount error 110 = Connection timed out

The tcpdump also recognizes a "Ceph Connect" packet, coming from the mon.

How can I get around this problem?
Is there something I have missed?

Any ceph client will need direct access to all OSDs involved also. Your mail 
doesn't really say if the cephfs-mounting client can talk to OSDs?

In ceph, traffic is not shuffled via mons, mons only tell the client which OSDs 
it needs to talk to, then all IO goes directly from client to any involved OSD 
servers.

--
May the most significant bit of your life be positive.


--
May the most significant bit of your life be positive.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Bluestore does not defer writes

2020-08-31 Thread Wido den Hollander





On 31/08/2020 11:00, Dennis Benndorf wrote:

Hi,

today I recognized bad performance in our cluster. Running "watch ceph
osd perf |sort -hk 2 -r" I found that all bluestore OSDs are slow on
commit and that the commit timings are equal to their apply timings:

For example
Every 2.0s: ceph osd perf |sort -hk 2
-r
  
440 8282

430 5858
435 5656
449 5353
442 4040
441 3030
439 2727
  99  0 1
  98  0 0
  97  0 2
  96  0 6
  95  0 2
  94  0 6
  93  013

The once with zero commit timings are filestore and the others are
bluestore osds.
I did not see this after installing the new bluestore osds (maybe this
occured later).
Both types of osds have nvmes as journal/db. Servers have equal
cpus/ram etc.

The only tuning regarding bluestore is:
   bluestore_block_db_size = 69793218560
   bluestore_prefer_deferred_size_hdd = 524288
In order to make a filestore like behavior, but that does not seem to
work.


As far as I know is that with BlueStore apply and commit latencies are 
equal.


Where did you get the idea that you could influence this with these 
settings?


Wido



Any tips?

Regards Dennis
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: osd regularly wrongly marked down

2020-08-31 Thread Wido den Hollander




On 31/08/2020 15:44, Francois Legrand wrote:

Thanks Igor for your answer,

We could try do a compaction of RocksDB manually, but it's not clear to 
me if we have to compact on the mon with something like

ceph-kvstore-tool rocksdb  /var/lib/ceph/mon/mon01/store.db/ compact
or on the concerned osd with
ceph-kvstore-tool rocksdb  /var/lib/ceph/osd/ceph-16/ compact
(or for all osd with a script like in 
https://gist.github.com/wido/b0f0200bd1a2cbbe3307265c5cfb2771 )


You would compact the OSDs, not the MONs. So the last command or my 
script which you linked there.


For my culture, how does compaction works ? Is it done automatically in 
background, regularly, at startup ?


Usually it's done by the OSD in the background, but sometimes an offline 
compact works best.


Because in the logs of the osd we have every 10mn some reports about 
compaction (which suggests that compaction occurs regularly), like :




Yes, that is normal. But the offline compaction is sometimes more 
effective than the online ones are.



2020-08-31 15:06:55.448 7f03fb398700  4 rocksdb: [db/db_impl.cc:777] 
--- DUMPING STATS ---

2020-08-31 15:06:55.448 7f03fb398700  4 rocksdb: [db/db_impl.cc:778]
** DB Stats **
Uptime(secs): 449404.8 total, 600.0 interval
Cumulative writes: 136K writes, 692K keys, 136K commit groups, 1.0 
writes per commit group, ingest: 0.28 GB, 0.00 MB/s
Cumulative WAL: 136K writes, 67K syncs, 2.04 writes per sync, written: 
0.28 GB, 0.00 MB/s

Cumulative stall: 00:00:0.000 H:M:S, 0.0 percent
Interval writes: 128 writes, 336 keys, 128 commit groups, 1.0 writes per 
commit group, ingest: 0.22 MB, 0.00 MB/s
Interval WAL: 128 writes, 64 syncs, 1.97 writes per sync, written: 0.00 
MB, 0.00 MB/s

Interval stall: 00:00:0.000 H:M:S, 0.0 percent

** Compaction Stats [default] **
Level    Files   Size Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) 
Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) 
Comp(cnt) Avg(sec) KeyIn KeyDrop
 

   L0  1/0   60.48 MB   0.2  0.0 0.0 0.0   0.1 0.1   
0.0   1.0  0.0    163.7 0.52  0.40 2    0.258   
0  0
   L1  0/0    0.00 KB   0.0  0.1 0.1 0.0   0.1 0.1   
0.0   0.5 48.2 26.1 2.32  0.64 1    2.319    920K   
197K
   L2 17/0    1.00 GB   0.8  1.1 0.1 1.1   1.1 0.0   
0.0  18.3 69.8 67.5 16.38  4.97 1   16.380   
4747K    82K
   L3 81/0    4.50 GB   0.9  0.6 0.1 0.5   0.3 
-0.2   0.0   4.3 66.9 36.6 9.23  4.95 2
4.617   9544K   802K
   L4    285/0   16.64 GB   0.1  2.4 0.3 2.0   0.2 
-1.8   0.0   0.8    110.3 11.7 21.92  4.37 5
4.384 12M    12M
  Sum    384/0   22.20 GB   0.0  4.2 0.6 3.6   1.8 
-1.8   0.0  21.8 85.2 36.6 50.37 15.32 11
4.579 28M    13M
  Int  0/0    0.00 KB   0.0  0.0 0.0 0.0   0.0 0.0   
0.0   0.0  0.0  0.0 0.00  0.00 0    0.000   
0  0


** Compaction Stats [default] **
Priority    Files   Size Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) 
Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) 
Comp(cnt) Avg(sec) KeyIn KeyDrop
--- 

  Low  0/0    0.00 KB   0.0  4.2 0.6 3.6   1.7 
-1.9   0.0   0.0 86.0 35.3 49.86 14.92 9
5.540 28M    13M
High  0/0    0.00 KB   0.0  0.0 0.0 0.0   0.1 0.1   
0.0   0.0  0.0    150.2 0.40  0.40 1    0.403   
0  0
User  0/0    0.00 KB   0.0  0.0 0.0 0.0   0.0 0.0   
0.0   0.0  0.0    211.7 0.11  0.00 1    0.114   
0  0

Uptime(secs): 449404.8 total, 600.0 interval
Flush(GB): cumulative 0.083, interval 0.000
AddFile(Total Files): cumulative 0, interval 0
AddFile(L0 Files): cumulative 0, interval 0
AddFile(Keys): cumulative 0, interval 0
Cumulative compaction: 1.80 GB write, 0.00 MB/s write, 4.19 GB read, 
0.01 MB/s read, 50.4 seconds
Interval compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 
MB/s read, 0.0 seconds
Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 
level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for 
pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 
memtable_compaction, 0 memtable_slowdown, interval 0 total count




Concerning the data removal, I don't know if this could be the trigger. 
We had some osd marked down before starting the removal, but at this 
epoch the situation was so confuse that I cannot be sure that the origin 
of the problem was

[ceph-users] Re: Large RocksDB (db_slow_bytes) on OSD which is marked as out

2020-08-31 Thread Wido den Hollander





On 31/08/2020 12:31, Igor Fedotov wrote:

Hi Wido,

'b' prefix relates to free list manager which keeps all the free extents 
for main device in a bitmap. Its records have fixed size hence you can 
easily estimate the overall size for these type of data.




Yes, so I figured.

But I doubt it takes that much. I presume that DB just lacks the proper 
compaction. Which could happen eventually but looks like you interrupted 
the process by going offline.


May be try manual compaction with ceph-kvstore-tool?



This cluster is suffering from a lot of spillovers. So we tested with 
marking one OSD as out.


After being marked as out it still had this large DB. A compact didn't 
work, the RocksDB database just stayed so large.


New OSDs coming into the cluster aren't suffering from this and they 
have a RocksDB of a couple of MB in size.


Old OSDs installed with Luminous and now upgraded to Nautilus are 
suffering from this.


It kind of seems like that garbage data stays behind in RocksDB which is 
never clean up.


Wido



Thanks,

Igor



On 8/31/2020 10:57 AM, Wido den Hollander wrote:

Hello,

On a Nautilus 14.2.8 cluster I am seeing large RocksDB database with 
many slow DB bytes in use.


To investigate this further I marked one OSD as out and waited for the 
all the backfilling to complete.


Once the backfilling was completed I exported BlueFS and investigated 
the RocksDB using 'ceph-kvstore-tool'. This resulted in 22GB of data.


Listing all the keys in the RocksDB shows me there are 747.000 keys in 
the DB. A small portion are osdmaps, but the biggest amount are keys 
prefixed with 'b'.


I dumped the stats of the RocksDB and this shows me:

L1: 1/0: 439.32 KB
L2: 1/0: 2.65 MB
L3: 5/0: 14.36 MB
L4: 127/0: 7.22 GB
L5: 217/0: 13.73 GB
Sum: 351/0: 20.98 GB

So there is almost 21GB of data in this RocksDB database. Why? Where 
is this coming from?


Throughout this cluster OSDs are suffering from many slow bytes used 
and I can't figure out why.


Has anybody seen this or has a clue on what is going on?

I have an external copy of this RocksDB database to do investigations on.

Thank you,

Wido
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Large RocksDB (db_slow_bytes) on OSD which is marked as out

2020-08-31 Thread Wido den Hollander


Hello,

On a Nautilus 14.2.8 cluster I am seeing large RocksDB database with 
many slow DB bytes in use.


To investigate this further I marked one OSD as out and waited for the 
all the backfilling to complete.


Once the backfilling was completed I exported BlueFS and investigated 
the RocksDB using 'ceph-kvstore-tool'. This resulted in 22GB of data.


Listing all the keys in the RocksDB shows me there are 747.000 keys in 
the DB. A small portion are osdmaps, but the biggest amount are keys 
prefixed with 'b'.


I dumped the stats of the RocksDB and this shows me:

L1: 1/0: 439.32 KB
L2: 1/0: 2.65 MB
L3: 5/0: 14.36 MB
L4: 127/0: 7.22 GB
L5: 217/0: 13.73 GB
Sum: 351/0: 20.98 GB

So there is almost 21GB of data in this RocksDB database. Why? Where is 
this coming from?


Throughout this cluster OSDs are suffering from many slow bytes used and 
I can't figure out why.


Has anybody seen this or has a clue on what is going on?

I have an external copy of this RocksDB database to do investigations on.

Thank you,

Wido
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: radowsgw still needs dedicated clientid?

2020-08-27 Thread Wido den Hollander





On 27/08/2020 14:23, Marc Roos wrote:
  
Can someone shed a light on this? Because it is the difference of

running multiple instances of one task, or running multiple different
tasks.


As far as I know this is still required because the client talk to each 
other using RADOS notifies and thus require different client IDs.


Wido





-Original Message-
To: ceph-users
Subject: [ceph-users] radowsgw still needs dedicated clientid?


I think I can remember reading somewhere that every radosgw is required
to run with their own clientid. Is this still necessary? Or can I run
multiple instances of radosgw with the same clientid?

So can have something like

rgw: 2 daemons active (rgw1, rgw1, rgw1)

___
ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: slow "rados ls"

2020-08-26 Thread Wido den Hollander





On 26/08/2020 15:59, Stefan Kooman wrote:

On 2020-08-26 15:20, Marcel Kuiper wrote:

Hi Vladimir,

no it is the same on all monitors. Actually I got triggered because I got
slow responses on my rados gateway with the radosgw-admin command and
narrowed it down to slow respons for rados commands anywhere in the
cluster.


Do you have a very large amount of objects. And / or a lot of OMAP data
and thus large rocksdb databases? We have seen slowness (and slow ops)
from having very large rocksdb databases due to a lot of OMAP data
concentrated on only a few nodes (cephfs metadata only). You might
suffer from the same thing.

Manual rocksdb compaction on the OSDs might help.


In addition: Keep in mind that RADOS was never designed to list objects 
fast. The more Placement Groups you have the slower a listing will be.


Wido



Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: 5 pgs inactive, 5 pgs incomplete

2020-08-11 Thread Wido den Hollander




On 11/08/2020 20:41, Kevin Myers wrote:

Replica count of 2 is a sure fire way to a crisis !



It is :-)


Sent from my iPad


On 11 Aug 2020, at 18:45, Martin Palma  wrote:

Hello,
after an unexpected power outage our production cluster has 5 PGs
inactive and incomplete. The OSDs on which these 5 PGs are located all
show "stuck requests are blocked":

  Reduced data availability: 5 pgs inactive, 5 pgs incomplete
  98 stuck requests are blocked > 4096 sec. Implicated osds 63,80,492,494

What is the best procedure to get these PGs back? These PGs are all of
pools with a replica of 2.


Are the OSDs online? Or do they refuse to boot?

Can you list the data with ceph-objectstore-tool on these OSDs?

Wido



Best,
Martin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: pg stuck in unknown state

2020-08-11 Thread Wido den Hollander




On 11/08/2020 00:40, Michael Thomas wrote:
On my relatively new Octopus cluster, I have one PG that has been 
perpetually stuck in the 'unknown' state.  It appears to belong to the 
device_health_metrics pool, which was created automatically by the mgr 
daemon(?).


The OSDs that the PG maps to are all online and serving other PGs.  But 
when I list the PGs that belong to the OSDs from 'ceph pg map', the 
offending PG is not listed.


# ceph pg dump pgs | grep ^1.0
dumped pgs
1.0    0   0 0  0    0  
0    0   0  0 0   unknown 
2020-08-08T09:30:33.251653-0500 0'0 0:0
[]  -1 []  -1  0'0  
2020-08-08T09:30:33.251653-0500  0'0 
2020-08-08T09:30:33.251653-0500  0


# ceph osd pool stats device_health_metrics
pool device_health_metrics id 1
   nothing is going on

# ceph pg map 1.0
osdmap e7199 pg 1.0 (1.0) -> up [41,40,2] acting [41,0]

What can be done to fix the PG?  I tried doing a 'ceph pg repair 1.0', 
but that didn't seem to do anything.


Is it safe to try to update the crush_rule for this pool so that the PG 
gets mapped to a fresh set of OSDs?


Yes, it would be. But still, it's weird. Mainly as the acting set is so 
different from the up-set.


You have different CRUSH rules I think?

Marking those OSDs down might work, but otherwise change the crush_rule 
and see how that goes.


Wido



--Mike
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] RGW Garbage Collection (GC) does not make progress

2020-08-07 Thread Wido den Hollander


Hi,

On a Nautilus 14.2.8 cluster I'm seeing a large amount of GC data and 
the GC on the RGW does not seem to make progress.


The .rgw.gc pool contains 39GB of data spread out over 32 objects.

In the logs we do see references of the RGW GC doing work and it says it 
is removing objects.


Those objects however still exist and only their 'refcount' attribute is 
updated.


2020-08-07 10:28:01.946 7fbd79f9a7c0  5 garbage collection: 
RGWGC::process removing 
.rgw.buckets.ec-v2:default.1834866551.1__multipart_fedora-data/datastreamStore/XX-YYY-5/5c/f3/info%3Afedora%2FCH-001514-5%3A36%2FORIGINAL%2FORIGINAL.0.2~yKGz1-SLXINhZvm3cQMBWgx9BJVoH5j.1
2020-08-07 10:28:01.946 7fbd79f9a7c0  5 garbage collection: 
RGWGC::process removing 
.rgw.buckets.ec-v2:default.1834866551.1__shadow_fedora-data/datastreamStore/XXX-YYY-5/5c/f3/info%3Afedora%2FCH-001514-5%3A36%2FORIGINAL%2FORIGINAL.0.2~yKGz1-SLXINhZvm3cQMBWgx9BJVoH5j.1_1


This objects however still exist, 'rados stat'  shows me:

mtime 2020-08-07 12:28:44.00, size 4194304

Has anybody seen this before and has clues on what this could be?

Wido
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Setting rbd_default_data_pool through the config store

2020-07-29 Thread Wido den Hollander




On 29/07/2020 16:54, Wido den Hollander wrote:



On 29/07/2020 16:00, Jason Dillaman wrote:
On Wed, Jul 29, 2020 at 9:07 AM Jason Dillaman  
wrote:


On Wed, Jul 29, 2020 at 9:03 AM Wido den Hollander  
wrote:




On 29/07/2020 14:54, Jason Dillaman wrote:
On Wed, Jul 29, 2020 at 6:23 AM Wido den Hollander  
wrote:


Hi,

I'm trying to have clients read the 'rbd_default_data_pool' config
option from the config store when creating a RBD image.

This doesn't seem to work and I'm wondering if somebody knows why.


It looks like all string-based config overrides for RBD are ignored:

2020-07-29T08:52:44.393-0400 7f2a97fff700  4 set_mon_vals failed to
set rbd_default_data_pool = rbd-data: Configuration option
'rbd_default_data_pool' may not be modified at runtime

librbd always accesses the config options in a thread-safe manner, so
I'll open a tracker ticket to flag all the RBD string config options
are runtime updatable (primitive data type options are implicitly
runtime updatable).


I wasn't updating it at runtime, I just wanted to make sure that I 
don't

have to set this in ceph.conf everywhere (and libvirt doesn't read
ceph.conf)


You weren't updating it at runtime -- the MON's "MConfig" message back
to the client was attempting to set the config option after "rbd" had
already started. However, if it's working under python, perhaps there
is an easy tweak for "rbd" to have it delay flagging the application
as having started until after it has connected to the cluster. Right
now it manages its own CephContext lifetime which it re-uses when
creating a librados connection. It's that CephContext that is flagged
as "running" prior to librados actually connecting to the cluster.


It looks like this is caused by two issues:

-- In [1], this will prevent librados from applying any MON config
overrides (for strings). This line can just be trivially removed.

-- Fixing that, there is a race in librados / MonClient [2] where it
attempts to first pull the config from the MONs, but it uses a
separate thread to actually apply the received config values, which
can race w/ the completion of the bootstrap occurring in the main
thread. This means that the example below may work sometimes -- and
may fail other times.


Interesting! In this case it will be libvirt which runs for ever and 
talks to librbd/librados.


I'll need to see how that works out. I'll test and report back.



I can confirm this works with Libvirt. I created a RBD volume through 
Libvirt's RBD storage driver and this resulted in the 'data-pool' 
feature set and the RBD image using the data pool.


On the hypervisor where libvirt runs no ceph.conf is present. All 
information is provided through Libvirt's XML definitions which only 
contain the Monitors and the Cephx credentials.


In this case librados/librbd fetched the configuration from the Config 
Store and thus detected it needed to use the data pool feature.


I'll keep an eye out to see if this goes wrong and it by accident 
creates an image without this feature.


Running 15.2.4 in this case on Ubuntu 18.04

Wido


Wido




But it seems that Python works:

#!/usr/bin/python3

import rados
import rbd

cluster = rados.Rados(conffile='/etc/ceph/ceph.conf')
cluster.connect()
ioctx = cluster.open_ioctx('rbd')

rbd_inst = rbd.RBD()
size = 4 * 1024**3  # 4 GiB
rbd_inst.create(ioctx, 'myimage', size)

ioctx.close()
cluster.shutdown()


And then:

$ ceph config set client rbd_default_data_pool rbd-data

rbd image 'myimage':
 size 4 GiB in 1024 objects
 order 22 (4 MiB objects)
 snapshot_count: 0
 id: 1aa963a21028
 data_pool: rbd-data
 block_name_prefix: rbd_data.2.1aa963a21028
 format: 2
 features: layering, exclusive-lock, object-map, fast-diff,
deep-flatten, data-pool


I haven't tested this through libvirt yet. That's the next thing to 
test.


Wido




I tried:

$ ceph config set client rbd_default_data_pool rbd-data
$ ceph config set global rbd_default_data_pool rbd-data

They both show up under:

$ ceph config dump

However, newly created RBD images with the 'rbd' CLI tool do not 
use the

data pool.

If I set this in ceph.conf it works:

[client]
rbd_default_data_pool = rbd-data

Somehow librbd isn't fetching these configuration options. Any 
hints on

how to get this working?

The end result is that libvirt (which doesn't read ceph.conf) should
also be able to create RBD images with a different data pool.

Wido
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io









--
Jason


[1] https://github.com/ceph/ceph/blob/master/src/tools/rbd/Utils.cc#L680
[2] https://g

[ceph-users] Re: Setting rbd_default_data_pool through the config store

2020-07-29 Thread Wido den Hollander





On 29/07/2020 16:00, Jason Dillaman wrote:

On Wed, Jul 29, 2020 at 9:07 AM Jason Dillaman  wrote:


On Wed, Jul 29, 2020 at 9:03 AM Wido den Hollander  wrote:




On 29/07/2020 14:54, Jason Dillaman wrote:

On Wed, Jul 29, 2020 at 6:23 AM Wido den Hollander  wrote:


Hi,

I'm trying to have clients read the 'rbd_default_data_pool' config
option from the config store when creating a RBD image.

This doesn't seem to work and I'm wondering if somebody knows why.


It looks like all string-based config overrides for RBD are ignored:

2020-07-29T08:52:44.393-0400 7f2a97fff700  4 set_mon_vals failed to
set rbd_default_data_pool = rbd-data: Configuration option
'rbd_default_data_pool' may not be modified at runtime

librbd always accesses the config options in a thread-safe manner, so
I'll open a tracker ticket to flag all the RBD string config options
are runtime updatable (primitive data type options are implicitly
runtime updatable).


I wasn't updating it at runtime, I just wanted to make sure that I don't
have to set this in ceph.conf everywhere (and libvirt doesn't read
ceph.conf)


You weren't updating it at runtime -- the MON's "MConfig" message back
to the client was attempting to set the config option after "rbd" had
already started. However, if it's working under python, perhaps there
is an easy tweak for "rbd" to have it delay flagging the application
as having started until after it has connected to the cluster. Right
now it manages its own CephContext lifetime which it re-uses when
creating a librados connection. It's that CephContext that is flagged
as "running" prior to librados actually connecting to the cluster.


It looks like this is caused by two issues:

-- In [1], this will prevent librados from applying any MON config
overrides (for strings). This line can just be trivially removed.

-- Fixing that, there is a race in librados / MonClient [2] where it
attempts to first pull the config from the MONs, but it uses a
separate thread to actually apply the received config values, which
can race w/ the completion of the bootstrap occurring in the main
thread. This means that the example below may work sometimes -- and
may fail other times.


Interesting! In this case it will be libvirt which runs for ever and 
talks to librbd/librados.


I'll need to see how that works out. I'll test and report back.

Wido




But it seems that Python works:

#!/usr/bin/python3

import rados
import rbd

cluster = rados.Rados(conffile='/etc/ceph/ceph.conf')
cluster.connect()
ioctx = cluster.open_ioctx('rbd')

rbd_inst = rbd.RBD()
size = 4 * 1024**3  # 4 GiB
rbd_inst.create(ioctx, 'myimage', size)

ioctx.close()
cluster.shutdown()


And then:

$ ceph config set client rbd_default_data_pool rbd-data

rbd image 'myimage':
 size 4 GiB in 1024 objects
 order 22 (4 MiB objects)
 snapshot_count: 0
 id: 1aa963a21028
 data_pool: rbd-data
 block_name_prefix: rbd_data.2.1aa963a21028
 format: 2
 features: layering, exclusive-lock, object-map, fast-diff,
deep-flatten, data-pool


I haven't tested this through libvirt yet. That's the next thing to test.

Wido




I tried:

$ ceph config set client rbd_default_data_pool rbd-data
$ ceph config set global rbd_default_data_pool rbd-data

They both show up under:

$ ceph config dump

However, newly created RBD images with the 'rbd' CLI tool do not use the
data pool.

If I set this in ceph.conf it works:

[client]
rbd_default_data_pool = rbd-data

Somehow librbd isn't fetching these configuration options. Any hints on
how to get this working?

The end result is that libvirt (which doesn't read ceph.conf) should
also be able to create RBD images with a different data pool.

Wido
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io









--
Jason


[1] https://github.com/ceph/ceph/blob/master/src/tools/rbd/Utils.cc#L680
[2] https://github.com/ceph/ceph/blob/master/src/mon/MonClient.cc#L445

--
Jason


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: High io wait when osd rocksdb is compacting

2020-07-29 Thread Wido den Hollander




On 29/07/2020 14:52, Raffael Bachmann wrote:

Hi All,

I'm kind of crossposting this from here: 
https://forum.proxmox.com/threads/i-o-wait-after-upgrade-5-x-to-6-2-and-ceph-luminous-to-nautilus.73581/ 

But since I'm more and more sure that it's a ceph problem I'll try my 
luck here.


Since updating from Luminous to Nautilus I have a big problem.

I have a 3 node cluster. Each cluster has 2 nvme ssd and a 10GBASE-T net 
for ceph.
Every few minutes a osd seems to compact the rocksdb. While doing this 
it uses alot of I/O and blocks.
This basically blocks the whole cluster and no VM/Container can read 
data for some seconds (minutes).


While it happens "iostat -x" looks like this:

Device    r/s w/s rkB/s wkB/s   rrqm/s   wrqm/s  
%rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
nvme0n1  0.00    2.00  0.00 24.00 0.00    46.00   
0.00  95.83    0.00    0.00   0.00 0.00    12.00   2.00   0.40
nvme1n1  0.00 1495.00  0.00   3924.00 0.00  6099.00   
0.00  80.31    0.00  352.39 523.78 0.00 2.62   0.67 100.00


And iotop:

Total DISK READ: 0.00 B/s | Total DISK WRITE:  1573.47 K/s
Current DISK READ:   0.00 B/s | Current DISK WRITE:   3.43 M/s
     TID  PRIO  USER DISK READ  DISK WRITE  SWAPIN IO>    COMMAND
    2306 be/4 ceph    0.00 B/s 1533.22 K/s  0.00 % 99.99 % ceph-osd 
-f --cluster ceph --id 3 --setuser ceph --setgroup ceph [rocksdb:low1]



In the ceph-osd log I see that rocksdb is compacting. 
https://gist.github.com/qwasli/3bd0c7d535ee462feff8aaee618f3e08


The pool and one OSD is nearfull. I'd planed to move some data away to 
another ceph pool. But now I'm not sure anymore if I should go with ceph.
I'l move some data away anyway today to see if that helps, but before 
the upgrade there was the same amount of data an I haven't had a problem.


Any hints to solve this are appreciated.


What model/type of NVMe is this?

And on a nearfull cluster these problems can arise, it's usually not a 
good idea to have OSDs be nearfull.


What does 'ceph df' tell you?

Wido



Cheers
Raffael
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Setting rbd_default_data_pool through the config store

2020-07-29 Thread Wido den Hollander





On 29/07/2020 14:54, Jason Dillaman wrote:

On Wed, Jul 29, 2020 at 6:23 AM Wido den Hollander  wrote:


Hi,

I'm trying to have clients read the 'rbd_default_data_pool' config
option from the config store when creating a RBD image.

This doesn't seem to work and I'm wondering if somebody knows why.


It looks like all string-based config overrides for RBD are ignored:

2020-07-29T08:52:44.393-0400 7f2a97fff700  4 set_mon_vals failed to
set rbd_default_data_pool = rbd-data: Configuration option
'rbd_default_data_pool' may not be modified at runtime

librbd always accesses the config options in a thread-safe manner, so
I'll open a tracker ticket to flag all the RBD string config options
are runtime updatable (primitive data type options are implicitly
runtime updatable).


I wasn't updating it at runtime, I just wanted to make sure that I don't 
have to set this in ceph.conf everywhere (and libvirt doesn't read 
ceph.conf)


But it seems that Python works:

#!/usr/bin/python3

import rados
import rbd

cluster = rados.Rados(conffile='/etc/ceph/ceph.conf')
cluster.connect()
ioctx = cluster.open_ioctx('rbd')

rbd_inst = rbd.RBD()
size = 4 * 1024**3  # 4 GiB
rbd_inst.create(ioctx, 'myimage', size)

ioctx.close()
cluster.shutdown()


And then:

$ ceph config set client rbd_default_data_pool rbd-data

rbd image 'myimage':
size 4 GiB in 1024 objects
order 22 (4 MiB objects)
snapshot_count: 0
id: 1aa963a21028
data_pool: rbd-data
block_name_prefix: rbd_data.2.1aa963a21028
format: 2
	features: layering, exclusive-lock, object-map, fast-diff, 
deep-flatten, data-pool



I haven't tested this through libvirt yet. That's the next thing to test.

Wido




I tried:

$ ceph config set client rbd_default_data_pool rbd-data
$ ceph config set global rbd_default_data_pool rbd-data

They both show up under:

$ ceph config dump

However, newly created RBD images with the 'rbd' CLI tool do not use the
data pool.

If I set this in ceph.conf it works:

[client]
rbd_default_data_pool = rbd-data

Somehow librbd isn't fetching these configuration options. Any hints on
how to get this working?

The end result is that libvirt (which doesn't read ceph.conf) should
also be able to create RBD images with a different data pool.

Wido
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io





___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Setting rbd_default_data_pool through the config store

2020-07-29 Thread Wido den Hollander


Hi,

I'm trying to have clients read the 'rbd_default_data_pool' config 
option from the config store when creating a RBD image.


This doesn't seem to work and I'm wondering if somebody knows why.

I tried:

$ ceph config set client rbd_default_data_pool rbd-data
$ ceph config set global rbd_default_data_pool rbd-data

They both show up under:

$ ceph config dump

However, newly created RBD images with the 'rbd' CLI tool do not use the 
data pool.


If I set this in ceph.conf it works:

[client]
rbd_default_data_pool = rbd-data

Somehow librbd isn't fetching these configuration options. Any hints on 
how to get this working?


The end result is that libvirt (which doesn't read ceph.conf) should 
also be able to create RBD images with a different data pool.


Wido
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: radosgw, public and private access on the same cluster ?

2020-07-23 Thread Wido den Hollander




On 7/21/20 6:30 PM, Jean-Sebastien Landry wrote:
> Hi everyone, we have a ceph cluster for object storage only, the rgws are 
> accessible from the internet, and everything is ok.

Is there a HTTP proxy in between?

> 
> Now, one of our team/client required that their data should not ever be 
> accessible from the internet. 

First: Upload with a Private ACL. This means that Authentication is
always required to read the data.

> In any case of security bug/breach/whatever, they want to limit the access to 
> their data from the local network.
> 
> Before creating a second "private" cluster, is there a way to achieve this on 
> our current "public" cluster?
> 
> Is a multi-zone without replication would help me with that?
> 
> A public rgws for public access on the "pub_zone", and a private rgws for 
> private access on the "prv_zone"?
> 
> pubzone.rgw.buckets.data
> prvzone.rgw.buckets.data
> 
> If the "public" rgws is hacked, without the access_key/secret_key of the 
> private zone, is there any possibilities to access the private zone?
> 
> Does a multi-realms would help me to secure it more?
> 
> Any input would be really appreciated.
> 
> I don't want to put to much energy for false security and/or security by 
> obscurity, 
> so if these scenarios of multi-sites/multi-realms are useless, in a security 
> point of view, please tell me. :-)

Why not work with a HTTP proxy in between that filters out specific
bucket names? Or only allows access to them if the client IP matches X.

This way two barriers need to be crossed:

- Filtering in the proxy
- RGW authentication

Wido

> 
> Thanks!
> JS
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Enabling Multi-MDS under Nautilus after cephfs-data-scan scan_links

2020-07-22 Thread Wido den Hollander

Hi,

I got involved in a case where a Nautilus cluster was experiencing MDSes
asserting showing the backtrace mentioned in this ticket:
https://tracker.ceph.com/issues/36349

ceph_assert(follows >= realm->get_newest_seq());

In the end we needed to use these tooling to get one MDS running again:
https://docs.ceph.com/docs/master/cephfs/disaster-recovery-experts/#using-an-alternate-metadata-pool-for-recovery

The root-cause seems to be that this Nautilus cluster was running
Multi-MDS with a very high amount of CephFS snapshots.

After a couple of days of scanning (scan_links seems single threaded!)
we finally got a single MDS running again with a usable CephFS filesystem.

At the moment chowns() are running to get all the permissions set back
to what they should be.

The question now outstanding: Is it safe to enable Multi-MDS again on a
CephFS filesystem which still has these many snapshots and is running
single at the moment?

New snapshots are disabled at the moment, so those won't be created.

In addition: How safe is it to remove snapshots? As this will result in
metadata updates.

Thanks

Wido
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Octopus: Recovery and backfilling causes OSDs to crash after upgrading from nautilus to octopus

2020-07-05 Thread Wido den Hollander



> Op 5 jul. 2020 om 15:26 heeft Wout van Heeswijk  het volgende 
> geschreven:
> 
> Good point, we've looked at that, but can't see any message regarding OOM 
> Killer:
> 

Have to add here that we looked at changing osd memory target as well, but that 
did not make a difference.

tcmalloc seems to suggest a memory allocation problem, but we haven’t found the 
root cause yet.

Hopefully somebody else on the list here knows where to look.

Wido


> root@st0:~# lsb_release -a
> No LSB modules are available.
> Distributor ID: Ubuntu
> Description:Ubuntu 18.04.4 LTS
> Release:18.04
> Codename:   bionic
> root@st0:~# grep -i "out of memory" /var/log/kern.log
> root@st0:~#
> 
> kind regards,
> 
> Wout
> 42on
> 
>> On 2020-07-05 14:45, Lindsay Mathieson wrote:
>>> On 5/07/2020 10:43 pm, Wout van Heeswijk wrote:
>>> After unsetting the norecover and nobackfill flag some OSDs started 
>>> crashing every few minutes. The OSD log, even with high debug settings, 
>>> don't seem to reveal anything, it just stops logging mid log line. 
>> 
>> 
>> POOMA U, but could the OOM Killer be taking them down?
>> 
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Octopus OSDs dropping out of cluster: _check_auth_rotating possible clock skew, rotating keys expired way too early

2020-06-09 Thread Wido den Hollander

Hi,

On a recently deployed Octopus (15.2.2) cluster (240 OSDs) we are seeing
OSDs randomly drop out of the cluster.

Usually it's 2 to 4 OSDs spread out over different nodes. Each node has
16 OSDs and not all the failing OSDs are on the same node.

The OSDs are marked as down and all they keep print in their logs:

monclient: _check_auth_rotating possible clock skew, rotating keys
expired way too early (before 2020-06-04T07:57:17.706529-0400)

Looking at their status through the admin socket:

{
"cluster_fsid": "68653193-9b84-478d-bc39-1a811dd50836",
"osd_fsid": "87231b5d-ae5f-4901-93c5-18034381e5ec",
"whoami": 206,
"state": "active",
"oldest_map": 73697,
"newest_map": 75795,
"num_pgs": 19
}

The message brought me to my own ticket I created 2 years ago:
https://tracker.ceph.com/issues/23460

The first thing I've checked is NTP/time. Double, triple check this. All
the times are in sync on the cluster. Nothing wrong there.

Again, it's not all the OSDs on a node failing. Just 1 or 2 dropping out.

Restarting them brings them back right away and then within 24h some
other OSDs will drop out.

Has anybody seen this behavior with Octopus as well?

Wido
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Best way to change bucket hierarchy

2020-06-04 Thread Wido den Hollander




On 6/4/20 9:17 AM, Frank Schilder wrote:
>> Yes and No. This will cause many CRUSHMap updates where a manual update
>> is only a single change.
>>
>> I would do:
>>
>> $ ceph osd getcrushmap -o crushmap
> 
> Well, that's a yes and a no as well.
> 
> If you are experienced and edit crush maps on a regular basis, you can go 
> that way. I would still enclose the change in a norebalance setting. If you 
> are not experienced, you are likely to shoot your cluster. In particular, 
> adding and moving buckets is not fun this way. You need to be careful what 
> IDs you assign, and there are many options to choose from with documentation 
> targeted at experienced cephers.
> 
> CLI commands will prevent a lot of stupid typos, errors and forgotten 
> mandatory lines. I learned that the hard way and decided to use a direct edit 
> only when absolutely necessary. A couple of extra peerings is a low-cost 
> operation compared with trying to find a stupid typo that just killed all 
> pools when angry users stand next to you.
> 
> My recommendation would be to save the original crush map, apply commands and 
> look at changes these commands do. That's a great way to learn how to do it 
> right. And in general, better be safe than sorry.
> 

I think we understand each other :-)

Main thing: Backup your crushmap! Then you can always roll back if
things go wrong.

Wido

> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> 
> 
> From: Wido den Hollander 
> Sent: 04 June 2020 08:50:16
> To: Frank Schilder; Kyriazis, George; ceph-users
> Subject: Re: [ceph-users] Re: Best way to change bucket hierarchy
> 
> On 6/4/20 12:24 AM, Frank Schilder wrote:
>> You can use the command-line without editing the crush map. Look at the 
>> documentation of commands like
>>
>> ceph osd crush add-bucket ...
>> ceph osd crush move ...
>>
>> Before starting this, set "ceph osd set norebalance" and unset after you are 
>> happy with the crush tree. Let everything peer. You should see misplaced 
>> objects and remapped PGs, but no degraded objects or PGs.
>>
>> Do this only when cluster is helth_ok, otherwise things can get really 
>> complicated.
>>
> 
> Yes and No. This will cause many CRUSHMap updates where a manual update
> is only a single change.
> 
> I would do:
> 
> $ ceph osd getcrushmap -o crushmap
> $ cp crushmap crushmap.backup
> $ crushtool -d crushmap -o crushmap.txt
> $ vi crushmap.txt (now make your changes)
> $ crushtool -c crushmap.txt -o crushmap.new
> $ crushtool -i crushmap.new --tree (check if all OK)
> $ crushtool -i crushmap.new --test --rule 0 --num-rep 3 --show-mappings
> 
> If all is good:
> 
> $ ceph osd setcrushmap -i crushmap.new
> 
> If all goes bad, simply revert to your old crushmap.
> 
> Wido
> 
>> Best regards,
>> =
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>>
>> 
>> From: Kyriazis, George 
>> Sent: 03 June 2020 22:45:11
>> To: ceph-users
>> Subject: [ceph-users] Best way to change bucket hierarchy
>>
>> Helo,
>>
>> I have a live ceph cluster, and I’m in the need of modifying the bucket 
>> hierarchy.  I am currently using the default crush rule (ie. keep each 
>> replica on a different host).  My need is to add a “chassis” level, and keep 
>> replicas on a per-chassis level.
>>
>> From what I read in the documentation, I would have to edit the crush file 
>> manually, however this sounds kinda scary for a live cluster.
>>
>> Are there any “best known methods” to achieve that goal without messing 
>> things up?
>>
>> In my current scenario, I have one host per chassis, and planning on later 
>> adding nodes where there would be >1 hosts per chassis. It looks like “in 
>> theory” there wouldn’t be a need for any data movement after the crush map 
>> changes.  Will reality match theory?  Anything else I need to watch out for?
>>
>> Thank you!
>>
>> George
>>
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Best way to change bucket hierarchy

2020-06-03 Thread Wido den Hollander




On 6/4/20 12:24 AM, Frank Schilder wrote:
> You can use the command-line without editing the crush map. Look at the 
> documentation of commands like
> 
> ceph osd crush add-bucket ...
> ceph osd crush move ...
> 
> Before starting this, set "ceph osd set norebalance" and unset after you are 
> happy with the crush tree. Let everything peer. You should see misplaced 
> objects and remapped PGs, but no degraded objects or PGs.
> 
> Do this only when cluster is helth_ok, otherwise things can get really 
> complicated.
> 

Yes and No. This will cause many CRUSHMap updates where a manual update
is only a single change.

I would do:

$ ceph osd getcrushmap -o crushmap
$ cp crushmap crushmap.backup
$ crushtool -d crushmap -o crushmap.txt
$ vi crushmap.txt (now make your changes)
$ crushtool -c crushmap.txt -o crushmap.new
$ crushtool -i crushmap.new --tree (check if all OK)
$ crushtool -i crushmap.new --test --rule 0 --num-rep 3 --show-mappings

If all is good:

$ ceph osd setcrushmap -i crushmap.new

If all goes bad, simply revert to your old crushmap.

Wido

> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> 
> 
> From: Kyriazis, George 
> Sent: 03 June 2020 22:45:11
> To: ceph-users
> Subject: [ceph-users] Best way to change bucket hierarchy
> 
> Helo,
> 
> I have a live ceph cluster, and I’m in the need of modifying the bucket 
> hierarchy.  I am currently using the default crush rule (ie. keep each 
> replica on a different host).  My need is to add a “chassis” level, and keep 
> replicas on a per-chassis level.
> 
> From what I read in the documentation, I would have to edit the crush file 
> manually, however this sounds kinda scary for a live cluster.
> 
> Are there any “best known methods” to achieve that goal without messing 
> things up?
> 
> In my current scenario, I have one host per chassis, and planning on later 
> adding nodes where there would be >1 hosts per chassis. It looks like “in 
> theory” there wouldn’t be a need for any data movement after the crush map 
> changes.  Will reality match theory?  Anything else I need to watch out for?
> 
> Thank you!
> 
> George
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Change mon bind address / Change IPs with the orchestrator

2020-06-03 Thread Wido den Hollander




On 6/3/20 4:49 PM, Simon Sutter wrote:
> Hello,
> 
> 
> I think I missunderstood the internal / public network concepts in the docs 
> https://docs.ceph.com/docs/master/rados/configuration/network-config-ref/.
> 
> Now there are two questions:
> 
> - Is it somehow possible to bind the MON daemon to 0.0.0.0?

No

> I tried it with manually add the ip in  /var/lib/ceph/{UUID}/mon.node01/config

Won't work :-)

> 
> 
> [mon.node01]
> public bind addr = 0.0.0.0
> 
> 
> But that does not work, in netstat I can see, the mon still binds to it's 
> internal ip. Is this an expected behaviour?
> 

Yes. This is not an orchestrator thing, this is how the MONs work. They
need to bind to a specific IP and that can't be 0.0.0.0

You then need to make sure proper routing is in place so all clients and
OSDs can talk to the MONs.

So don't attempt anything like NAT. Make sure everything works with
proper IP-routing.

Wido

> If I set this value to the public ip, the other nodes cannot communicate with 
> it, so this leads to the next question:
> 
> - What's the Right way to correct the problem with the orchestrator?
> So the correct way to configure the ip's, would be to set every mon, mds and 
> so on, to the public ip and just let the osd's stay on their internal ip. 
> (described here 
> https://docs.ceph.com/docs/master/rados/configuration/network-config-ref/)
> 
> Do I have to remove every daemon and redeploy them with "ceph orch daemon rm" 
> / "ceph orch apply"?
> 
> Or do I have to go to every node and manually apply the settings in the 
> daemon config file?
> 
> 
> Thanks in advance,
> 
> 
> Simon
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: OSD upgrades

2020-06-02 Thread Wido den Hollander




On 6/2/20 5:44 AM, Brent Kennedy wrote:
> We are rebuilding servers and before luminous our process was:
> 
>  
> 
> 1.   Reweight the OSD to 0
> 
> 2.   Wait for rebalance to complete
> 
> 3.   Out the osd
> 
> 4.   Crush remove osd
> 
> 5.   Auth del osd
> 
> 6.   Ceph osd rm #
> 
>  
> 
> Seems the luminous documentation says that you should:
> 
> 1.   Out the osd
> 
> 2.   Wait for the cluster rebalance to finish
> 
> 3.   Stop the osd
> 
> 4.   Osd purge # 
> 
>  
> 
> Is reweighting to 0 no longer suggested?
> 
>  
> 
> Side note:  I tried our existing process and even after reweight, the entire
> cluster restarted the balance again after step 4 ( crush remove osd ) of the
> old process.  I should also note, by reweighting to 0, when I tried to run
> "ceph osd out #", it said it was already marked out.  
> 
>  
> 
> I assume the docs are correct, but just want to make sure since reweighting
> had been previously recommended.

The new commands just make it more simple. There are many ways to
accomplish the same goal, but what the docs describe should work in most
scenarios.

Wido

> 
>  
> 
> Regards,
> 
> -Brent
> 
>  
> 
> Existing Clusters:
> 
> Test: Nautilus 14.2.2 with 3 osd servers, 1 mon/man, 1 gateway, 2 iscsi
> gateways ( all virtual on nvme )
> 
> US Production(HDD): Nautilus 14.2.2 with 11 osd servers, 3 mons, 4 gateways,
> 2 iscsi gateways
> 
> UK Production(HDD): Nautilus 14.2.2 with 12 osd servers, 3 mons, 4 gateways
> 
> US Production(SSD): Nautilus 14.2.2 with 6 osd servers, 3 mons, 3 gateways,
> 2 iscsi gateways
> 
>  
> 
>  
> 
>  
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Deploy Ceph on the secondary datacenter for DR

2020-06-01 Thread Wido den Hollander



On 6/1/20 6:46 AM, Nghia Viet Tran wrote:
> Hi everyone,
> 
>  
> 
> Currently, our client application and Ceph cluster are running on the
> primary datacenter. We’re planning to deploy Ceph on the secondary
> datacenter for DR. The secondary datacenter is in the standby mode. If
> something went wrong with the primary datacenter, the secondary
> datacenter will take over.
> 
> The possible way would work in this case is that adding hosts from the
> secondary datacenter into the existed Ceph cluster in the primary
> datacenter. By this way, it would add more latency for client requests
> since client  from primary datacenter might connects to OSD hosts in the
> secondary datacenter)
> 
>  
> 
> Are there any special configurations in Ceph that fulfill this requirement?

What is the application? CephFS? RBD? RGW?

The second datacenter indeed adds latency, so I would be very, very
careful with that.

Wido

> 
>  
> 
> I truly appreciate any comments!
> 
> Nghia.
> 
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Ceph Nautius not working after setting MTU 9000

2020-05-23 Thread Wido den Hollander

On 5/23/20 12:02 PM, Amudhan P wrote:
> Hi,
> 
> I am using ceph Nautilus in Ubuntu 18.04 working fine wit MTU size 1500
> (default) recently i tried to update MTU size to 9000.
> After setting Jumbo frame running ceph -s is timing out.

Ceph can run just fine with an MTU of 9000. But there is probably
something else wrong on the network which is causing this.

Check the Jumbo Frames settings on all the switches as well to make sure
they forward all the packets.

This is definitely not a Ceph issue.

Wido

> 
> regards
> Amudhan P
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: nfs migrate to rgw

2020-05-18 Thread Wido den Hollander



On 5/18/20 1:51 PM, Zhenshi Zhou wrote:
> Hi Wido,
> 
> I did a research on the nfs files. I found that it contains much
> pictures about 
> 50KB, and much video files around 30MB. The amount of the files is more than
> 1 million. Maybe I can find a way to seperate the files in more buckets
> so that 
> there is no more than 1M objects in each bucket. But how about the small
> files 
> around 50KB. Does rgw serve well on small files?

I would recommend using different buckets. What I've done in such cases
is use the year+month for sharding.

For example: video-2020-05

RGW can serve objects which are 50kB in size, but there is overhead
involved. Storing a lot of such small objects comes at a price of overhead.

Wido

> 
> Wido den Hollander mailto:w...@42on.com>> 于2020年5月12
> 日周二 下午2:41写道：
> 
> 
> 
> On 5/12/20 4:22 AM, Zhenshi Zhou wrote:
> > Hi all,
> >
> > We have several nfs servers providing file storage. There is a
> nginx in
> > front of
> > nfs servers in order to serve the clients. The files are mostly
> small files
> > and
> > nearly about 30TB in total.
> >
> 
> What is small? How many objects/files are you talking about?
> 
> > I'm gonna use ceph rgw as the storage. I wanna know if it's
> appropriate to
> > do so.
> > The data migrating from nfs to rgw is a huge job. Besides I'm not sure
> > whether
> > ceph rgw is suitable in this scenario or not.
> >
> 
> Yes, it is. But make sure you don't put millions of objects into a
> single bucket. Make sure that you spread them out so that you have let's
> say 1M of objects per bucket at max.
> 
> Wido
> 
> > Thanks
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> <mailto:ceph-users@ceph.io>
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> <mailto:ceph-users-le...@ceph.io>
> >
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: OSDs taking too much memory, for pglog

2020-05-17 Thread Wido den Hollander



On 5/17/20 4:49 PM, Harald Staub wrote:
> tl;dr: this cluster is up again, thank you all (Mark, Wout, Paul
> Emmerich off-list)!
> 

Awesome!

> First we tried to lower max- and min_pg_log_entries on a single running
> OSD, without and with restarting it. There was no effect. Maybe because
> of the unclean state of the cluster.
> 
> Then we tried ceph-objectstore-tool trim-pg-log on an offline OSD. This
> has to be called per PG that is stored on the OSD. At first it seemed to
> be much too slow, took around 20 minutes. But the following PGs were
> much faster (like 1 minute). The trim part of the command was always
> fast, but the compaction part took a long time the first time.
> 
> CEPH_ARGS="--osd-min-pg-log-entries=1500 --osd-max-pg-log-entries=1500"
> ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-$OSD --pgid $pg
> --op trim-pg-log
> 
> Thanks to these pglog trimmings, memory consumption was reduced and we
> could bring up all OSDs. Then recovery was quite fast, no big
> backfilling. We checked a bit later in the evening, there was plenty of
> free RAM.
> 
> Next morning, again free memory was very tight. Although it looked
> differently. dump mempools showed buffer_anon as biggest (should this
> not be tuned down by "osd memory target"?). But also osd_pglog (although
> all PGs were active+clean?).
> 
> Soon there was another OOM killer. Again, we treated this OSD with
> trim-pg-log to bring it back.
> 
> Then we decided to try again to reduce the pg_log parameters, cluster
> wide (from default 3000 to 2000). This time it worked, memory was
> released :-)
> 
> Then we added some RAM to get more to the safe side.
> 
> Some more background. As already mentioned, the number of PGs per OSD is
> ok, but there is a lot of small objects (nearly 1 billion), mostly S3,
> in an EC pool 8+3. So the number of the objects lieing on the OSDs
> (chunks? shards?) is about 10 billions in total. Per OSD (510 of type
> hdd) this is probably quite a lot. Maybe also a reason for high pglog
> demand. And it is not equally distributed, HDDs are 4 TB and 8 TB.
> 

Small files/objects are always a problem. They were when I was still
fiddling with NFS servers which stored PHP websites, but they still are
in modern systems.

Each object becomes an entry in BlueStore's (Rocks)DB and that can cause
all kinds of slowdowns and other issues.

I would always advise to set quotas on systems to prevent unbounded
growth of small objects in Ceph. CephFS and RGW both have such mechanisms.

> Another point: the DB devices lie on SSDs. But they are too small
> nowadays, the sizing was done years ago, for Filestore.
> 

~30GB per OSD is sufficient at the moment with RocksDB's settings. The
next step is 300GB, see:
https://github.com/facebook/rocksdb/wiki/Leveled-Compaction#levels-target-size

> Last not least, probably the trigger was a broken HDD on the Sunday
> before. Rebalancing then takes several days and was ongoing when the
> problems started.
> 
> Cheers
>  Harry
> 
> On 14.05.20 11:08, Wout van Heeswijk wrote:
>> Hi Harald,
>>
>> Your cluster has a lot of objects per osd/pg and the pg logs will grow
>> fast and large because of this. The pg_logs will keep growing as long
>> as you're clusters pgs are not active+clean. This means you are now in
>> a loop where you cannot get stable running OSDs because the pg_logs
>> take too much memory, and therefore the OSDs cannot purge the pg_logs...
>>
>> I suggest you lower the values for both the osd_min_pg_log_entries and
>> the osd_max_pg_log_entries. Lowering these values will cause Ceph to
>> go into backfilling much earlier, but the memory usage of the OSDs
>> will go down significantly enabling them to run stable. The default is
>> 3000 for both of these values.
>>
>> You can lower them to 500 by executing:
>>
>> ceph config set osd osd_min_pg_log_entries 500
>> ceph config set osd osd_max_pg_log_entries 500
>>
>> When you lower these values, you will get more backfilling instead of
>> recoveries but I think it will help you get through this situation.
>>
>> kind regards,
>>
>> Wout
>> 42on
>>
>> On 13-05-2020 07:27, Harald Staub wrote:
>>> Hi Mark
>>>
>>> Thank you for your feedback!
>>>
>>> The maximum number of PGs per OSD is only 123. But we have PGs with a
>>> lot of objects. For RGW, there is an EC pool 8+3 with 1024 PGs with
>>> 900M objects, maybe this is the problematic part. The OSDs are 510
>>> hdd, 32 ssd.
>>>
>>> Not sure, do you suggest to use something like
>>> ceph-objectstore-tool --op trim-pg-log ?
>>>
>>> When done correctly, would the risk be a lot of backfilling? Or also
>>> data loss?
>>>
>>> Also, to get up the cluster is one thing, to keep it running seems to
>>> be a real challenge right now (OOM killer) ...
>>>
>>> Cheers
>>>  Harry
>>>
>>> On 13.05.20 07:10, Mark Nelson wrote:
 Hi Herald,


 Changing the bluestore cache settings will have no effect at all on
 pglog memory consumption.  You can try either reducing the number of

[ceph-users] Re: Zeroing out rbd image or volume

2020-05-12 Thread Wido den Hollander

On 5/12/20 1:54 PM, Paul Emmerich wrote:
> And many hypervisors will turn writing zeroes into an unmap/trim (qemu
> detect-zeroes=unmap), so running trim on the entire empty disk is often the
> same as writing zeroes.
> So +1 for encryption being the proper way here
> 

+1

And to add to this: No, a newly created RBD image will never have 'left
over' bits and bytes from a previous RBD image.

I had to explain this multiple times to people which were used to old
(i)SCSI setups where partitions could have leftover data from a
previously created LUN.

With RBD this won't happen.

Wido

> 
> Paul
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: nfs migrate to rgw

2020-05-11 Thread Wido den Hollander




On 5/12/20 4:22 AM, Zhenshi Zhou wrote:
> Hi all,
> 
> We have several nfs servers providing file storage. There is a nginx in
> front of
> nfs servers in order to serve the clients. The files are mostly small files
> and
> nearly about 30TB in total.
> 

What is small? How many objects/files are you talking about?

> I'm gonna use ceph rgw as the storage. I wanna know if it's appropriate to
> do so.
> The data migrating from nfs to rgw is a huge job. Besides I'm not sure
> whether
> ceph rgw is suitable in this scenario or not.
> 

Yes, it is. But make sure you don't put millions of objects into a
single bucket. Make sure that you spread them out so that you have let's
say 1M of objects per bucket at max.

Wido

> Thanks
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Cluster network and public network

2020-05-10 Thread Wido den Hollander



On 5/8/20 12:13 PM, Willi Schiegel wrote:
> Hello Nghia,
> 
> I once asked a similar question about network architecture and got the
> same answer as Martin wrote from Wido den Hollander:
> 
> There is no need to have a public and cluster network with Ceph. Working
> as a Ceph consultant I've deployed multi-PB Ceph clusters with a single
> public network without any problems. Each node has a single IP-address,
> nothing more, nothing less.
> 
> In the current Ceph manual you can read
> 
> It is possible to run a Ceph Storage Cluster with two networks: a public
> (front-side) network and a cluster (back-side) network. However, this
> approach complicates network configuration (both hardware and software)
> and does not usually have a significant impact on overall performance.
> For this reason, we generally recommend that dual-NIC systems either be
> configured with two IPs on the same network, or bonded.
> 
> I followed the advice from Wido "One system, one IP address" and
> everything works fine. So, you should be fine with one interface for
> MONs, MGRs, and OSDs.
> 

Great to hear! I'm still behind this idea and all the clusters I design
have a single (or LACP) network going to the host.

One IP address per node where all traffic goes over. That's Ceph, SSH,
(SNMP) Monitoring, etc.

Wido

> Best
> Willi
> 
> On 5/8/20 11:57 AM, Nghia Viet Tran wrote:
>> Hi Martin,
>>
>> Thanks for your response. You mean one network interface for only MON
>> hosts or for the whole cluster including OSD hosts? I’m confusing now
>> because there are some projects that only useone public network for
>> the whole cluster. That means the rebalancing, replicating objects and
>> heartbeats from OSD hostswould affects the performance of Ceph client.
>>
>> *From: *Martin Verges 
>> *Date: *Friday, May 8, 2020 at 16:20
>> *To: *Nghia Viet Tran 
>> *Cc: *"ceph-users@ceph.io" 
>> *Subject: *Re: [ceph-users] Cluster network and public network
>>
>> Hello Nghia,
>>
>> just use one network interface card and use frontend and backend
>> traffic on the same. No problem with that.
>>
>> If you have a dual port card, use both ports as an LACP channel and
>> maybe separate it using VLANs if you want to, but not required as well.
>>
>>
>> -- 
>>
>> Martin Verges
>> Managing director
>>
>> Mobile: +49 174 9335695
>> E-Mail: martin.ver...@croit.io <mailto:martin.ver...@croit.io>
>> Chat: https://t.me/MartinVerges
>>
>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>> CEO: Martin Verges - VAT-ID: DE310638492
>> Com. register: Amtsgericht Munich HRB 231263
>>
>> Web: https://croit.io
>> YouTube: https://goo.gl/PGE1Bx
>>
>> Am Fr., 8. Mai 2020 um 09:29 Uhr schrieb Nghia Viet Tran
>> mailto:nghia.viet.t...@mgm-tp.com>>:
>>
>>     Hi everyone,
>>
>>     I have a question about the network setup. From the document, It’s
>>     recommended to have 2 NICs per hosts as described in below picture
>>
>>     Diagram
>>
>>     In the picture, OSD hosts will connect to the Cluster network for
>>     replicate and heartbeat between OSDs, therefore, we definitely need
>>     2 NICs for it. But seems there are no connections between Ceph MON
>>     and Cluster network. Can we install 1 NIC on Ceph MON then?
>>
>>     I appreciated any comments!
>>
>>     Thank you!
>>
>>     --
>>     Nghia Viet Tran (Mr)
>>
>>     ___
>>     ceph-users mailing list -- ceph-users@ceph.io
>>     <mailto:ceph-users@ceph.io>
>>     To unsubscribe send an email to ceph-users-le...@ceph.io
>>     <mailto:ceph-users-le...@ceph.io>
>>
>>
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: pg incomplete blocked by destroyed osd

2020-05-04 Thread Wido den Hollander



On 5/4/20 10:48 AM, Francois Legrand wrote:
> Hi all,
> During a crash disaster we had destroyed and reinstalled with a
> different number a few osds.
> As an example osd 3 was destroyed and recreated with id 101 by command
> ceph osd purge 3 --yes-i-really-mean-it + ceph osd create (to block id
> 3) + ceph-deploy osd create --data /dev/sdxx  and finally ceph
> osd rm 3).
> Some of our pgs are now incomplet (which can be understood) but blocked
> by some of the removed osd :
> ex: here is an part of the ceph pg 30.3 query
> {
>     "state": "incomplete",
>     "snap_trimq": "[]",
>     "snap_trimq_len": 0,
>     "epoch": 384075,
>     "up": [
>     103,
>     43,
>     29,
>     2,
>     66
>     ],
>     "acting": [
>     103,
>     43,
>     29,
>     2,
>     66
>     ],
> 
> 
> "peer_info": [
>     {
>     "peer": "2(3)",
>     "pgid": "30.3s3",
>     "last_update": "373570'105925965",
>     "last_complete": "373570'105925965",
> ...
> },
>     "up": [
>     103,
>     43,
>     29,
>     2,
>     66
>     ],
>     "acting": [
>     103,
>     43,
>     29,
>     2,
>     66
>     ],
>     "avail_no_missing": [],
>     "object_location_counts": [],
> *"blocked_by": [**
> **    3,**
> **    49**
> **    ],*
> 
>     "down_osds_we_would_probe": [
> *3*
>     ],
>     "peering_blocked_by": [],
>     "peering_blocked_by_detail": [
>     {
> *    "detail": "peering_blocked_by_history_les_bound"*
>     }
>     ]
> 
> 
> I don't understand why the removed osd are still considered and present
> in the pg infos.
> Is there a way to get rid of that ?

You can try to set:

osd_find_best_info_ignore_history_les = true

Then restart the OSDs involved with that PG.

Wido

> Moreover, we have tons of slow ops (more than 15 000) but I guess that
> the two problems are linked.
> Thanks for your help.
> F.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: [Octopus] OSD overloading

2020-04-08 Thread Wido den Hollander




On 4/8/20 1:38 PM, Jack wrote:
> Hello,
> 
> I've a issue, since my Nautilus -> Octopus upgrade
> 
> My cluster has many rbd images (~3k or something)
> Each of them has ~30 snapshots
> Each day, I create and remove a least a snapshot per image
> 
> Since Octopus, when I remove the "nosnaptrim" flags, each OSDs uses 100%
> of its CPU time

Why do you have the 'nosnaptrim' flag set? I'm missing that piece of
information.

> The whole cluster collapses: OSDs no longer see each others, most of
> them are seens as down ..
> I do not see any progress being made : it does not appear the problem
> will solve by itself
> 
> What can I do ?
> 
> Best regards,
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Radosgw WAF

2020-04-06 Thread Wido den Hollander

On 4/5/20 11:53 AM, m.kefay...@afranet.com wrote:
> Hi
> we deploy ceph object storage and want secure RGW. Is there any solution or 
> any user experience about it?
> Is it common to use WAF ?

I wouldn't say common, but I did this for many customers. I usually
install Varnish Cache in between. In addition to use it as a WAF we can
also easily cache Object so it can serve a lot of traffic.

As the traffic to RGW is predictable in many cases we can create WAF
rules for the RGW.

Wido

> 
> tnx
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: v14.2.8 Nautilus released

2020-04-06 Thread Wido den Hollander



On 4/5/20 1:16 PM, Marc Roos wrote:
> No didn't get answer to this. 
> 
> Yes I thought also, but recently there has been an issue here with an 
> upgrade to Octopus, where osd's are being changed automatically and 
> consume huge amounts of memory during this. Furthermore if you have a 
> cluster with hundreds of osds, it is not really acceptable to have to 
> recreate them.
> 

The upgrade to O is not related to this.

But the alloc size of BlueStore is set during mkfs of BlueStore. So yes,
you will need to re-create the OSDs if you want this done.

That takes time:

- Mark out
- Wait for HEALTH_OK
- Re-format OSD
- Mark in
- Repeat

As time goes on the developers improve things which can't always be done
automatically, therefor at some point you will have to do this.

Wido

> 
>  
> 
> -Original Message-
> From: Brent Kennedy [mailto:bkenn...@cfl.rr.com] 
> Sent: 05 April 2020 04:26
> To: Marc Roos; 'abhishek'; 'ceph-users'
> Subject: RE: [ceph-users] Re: v14.2.8 Nautilus released
> 
> Did you get an answer for this?  My original thought when I read it was 
> that the osd would need to be recreated(as you noted).
> 
> -Brent
> 
> -Original Message-
> From: Marc Roos 
> Sent: Tuesday, March 3, 2020 10:58 AM
> To: abhishek ; ceph-users 
> Subject: [ceph-users] Re: v14.2.8 Nautilus released
> 
>  
> This bluestore_min_alloc_size_ssd=4K, do I need to recreate these osd's? 
> 
> Or does this magically change? What % performance increase can be 
> expected?
> 
> 
> -Original Message-
> To: ceph-annou...@ceph.io; ceph-users@ceph.io; d...@ceph.io; 
> ceph-de...@vger.kernel.org
> Subject: [ceph-users] v14.2.8 Nautilus released
> 
> 
> This is the eighth update to the Ceph Nautilus release series. This 
> release fixes issues across a range of subsystems. We recommend that all 
> 
> users upgrade to this release. Please note the following important 
> changes in this release; as always the full changelog is posted at:
> https://ceph.io/releases/v14-2-8-nautilus-released
> 
> Notable Changes
> ---
> 
> * The default value of `bluestore_min_alloc_size_ssd` has been changed
>   to 4K to improve performance across all workloads.
> 
> * The following OSD memory config options related to bluestore cache 
> autotuning can now
>   be configured during runtime:
> 
> - osd_memory_base (default: 768 MB)
> - osd_memory_cache_min (default: 128 MB)
> - osd_memory_expected_fragmentation (default: 0.15)
> - osd_memory_target (default: 4 GB)
> 
>   The above options can be set with::
> 
> ceph config set osd  
> 
> * The MGR now accepts `profile rbd` and `profile rbd-read-only` user 
> caps.
>   These caps can be used to provide users access to MGR-based RBD 
> functionality
>   such as `rbd perf image iostat` an `rbd perf image iotop`.
> 
> * The configuration value `osd_calc_pg_upmaps_max_stddev` used for upmap
>   balancing has been removed. Instead use the mgr balancer config
>   `upmap_max_deviation` which now is an integer number of PGs of 
> deviation
>   from the target PGs per OSD.  This can be set with a command like
>   `ceph config set mgr mgr/balancer/upmap_max_deviation 2`.  The default
>   `upmap_max_deviation` is 1.  There are situations where crush rules
>   would not allow a pool to ever have completely balanced PGs.  For 
> example, if
>   crush requires 1 replica on each of 3 racks, but there are fewer OSDs 
> in 1 of
>   the racks.  In those cases, the configuration value can be increased.
> 
> * RGW: a mismatch between the bucket notification documentation and the 
> actual
>   message format was fixed. This means that any endpoints receiving 
> bucket
>   notification, will now receive the same notifications inside a JSON 
> array
>   named 'Records'. Note that this does not affect pulling bucket 
> notification
>   from a subscription in a 'pubsub' zone, as these are already wrapped 
> inside
>   that array.
> 
> * CephFS: multiple active MDS forward scrub is now rejected. Scrub 
> currently
>   only is permitted on a file system with a single rank. Reduce the 
> ranks to one
>   via `ceph fs set  max_mds 1`.
> 
> * Ceph now refuses to create a file system with a default EC data pool. 
> For
>   further explanation, see:
>   https://docs.ceph.com/docs/nautilus/cephfs/createfs/#creating-pools
> 
> * Ceph will now issue a health warning if a RADOS pool has a `pg_num`
>   value that is not a power of two. This can be fixed by adjusting
>   the pool to a nearby power of two::
> 
> ceph osd pool set  pg_num 
> 
>   Alternatively, the warning can be silenced with::
> 
> ceph config set global mon_warn_on_pool_pg_num_not_power_of_two 
> false
> 
> Getting Ceph
> 
> 
> * Git at git://github.com/ceph/ceph.git
> * Tarball at http://download.ceph.com/tarballs/ceph-14.2.8.tar.gz
> * For packages, see 
> http://docs.ceph.com/docs/master/install/get-packages/
> * Release git sha1: 2d095e947a02261ce61424021bb43bd3022d35cb
> 
> --
> Abhishek Lek

[ceph-users] Re: Leave of absence...

2020-03-28 Thread Wido den Hollander



On 3/27/20 7:59 PM, Sage Weil wrote:
> Hi everyone,
> 
> I am taking time off from the Ceph project and from Red Hat, starting in 
> April and extending through the US election in November. I will initially 
> be working with an organization focused on voter registration and turnout 
> and combating voter suppression and disinformation campaigns.
> 
> During this time I will maintain some involvement in the Ceph community, 
> primarily around strategic planning for Pacific and the Ceph Foundation, 
> but most of my time will be focused elsewhere. 
> 

Thanks a lot for all your hard work and time in the last years. What
started as this tiny project more then 10 years ago grew into something
really awesome!

> Most decision making around Ceph will remain in the capable hands of the 
> Ceph Leadership Team and component leads--I have the utmost confidence in 
> their judgement and abilities.  Yehuda Sadeh and Josh Durgin will be 
> filling in to provide high-level guidance where needed.
> 
> I’ll be participating in the Pacific planning meetings planned for next 
> week, which will be important in kicking off development for Pacific: 
> 
>   https://ceph.io/cds/ceph-developer-summit-pacific/
> 
> I am extremely proud of what we have accomplished with the Octopus 
> release, and I believe the Ceph community will continue to do great things 
> with Pacific!  I look forward to returning at the end of the year to help 
> wrap up the release and (hopefully) get things ready for Cephalocon next 
> March.
> 
> Most of all, I am excited to become engaged in another effort that I feel 
> strongly about--one that will have a very real impact on my kids’ 
> futures--and that will be easier to explain to lay people! :)
> 

Family is important. Stay healthy and safe!

Wido

> Thanks!
> sage
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: v15.2.0 Octopus released

2020-03-25 Thread Wido den Hollander




On 3/25/20 10:24 AM, Simon Oosthoek wrote:
> On 25/03/2020 10:10, konstantin.ilya...@mediascope.net wrote:
>> That is why i am asking that question about upgrade instruction.
>> I really don`t understand, how to upgrade/reinstall CentOS 7 to 8 without 
>> affecting the work of cluster.
>> As i know, this process is easier on Debian, but we deployed our cluster 
>> Nautilus on CentOS because there weren`t any packages for 14.x for Debian 
>> Stretch (9) or Buster(10).
>> P.s.: if this is even possible, i would like to know how to upgrade servers 
>> with CentOs7 + ceph 14.2.8 to Debian 10 with ceph 15.2.0 (we have servers 
>> with OSD only and 3 servers with Mon/Mgr/Mds)
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
> 
> I guess you could upgrade each node one by one. So upgrade/reinstall the
> OS, install Ceph 15 and re-initialise the OSDs if necessary. Though it
> would be nice if there was a way to re-integrate the OSDs from the
> previous installation...
> 

That works just fine. You can re-install the host OS and have
ceph-volume scan all the volumes. The OSDs should then just come back.

Or you can take the even safer route by removing OSDs completely from
the cluster and wiping a box.

Did this recently with a customer. In the meantime they took the
oppertunity to also flash the firmware of all the components and the
machines came back again with a complete fresh installation.

> Personally, I'm planning to wait for a while to upgrade to Ceph 15, not
> in the least because it's not convenient to do stuff like OS upgrades
> from home ;-)
> 
> Currently we're running ubuntu 18.04 on the ceph nodes, I'd like to
> upgrade to ubuntu 20.04 and then to ceph 15.
> 

I think many people will do this. I wouldn't run 15.2.0 on my production
environment right away.

Wido

> Cheers
> 
> /Simon
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Inactive PGs

2020-03-13 Thread Wido den Hollander



On 3/13/20 5:44 PM, Peter Eisch wrote:
> 
> 
> 
> Peter Eisch
> Senior Site Reliability Engineer
> 
> T
> 
>   *1.612.445.5135* 
> 
> Facebook <https://www.facebook.com/VirginPulse>
> 
>   
> LinkedIn <https://www.linkedin.com/company/virgin-pulse>
> 
>   
> Twitter <https://twitter.com/virginpulse>
> 
> *virginpulse.com* <https://www.virginpulse.com/>
>   
> |
> 
>   *virginpulse.com/global-challenge*
> <https://www.virginpulse.com/en-gb/global-challenge/>
> 
> Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | 
> Switzerland | United Kingdom | USA
> 
> Confidentiality Notice: The information contained in this e-mail,
> including any attachment(s), is intended solely for use by the
> designated recipient(s). Unauthorized use, dissemination, distribution,
> or reproduction of this message by anyone other than the intended
> recipient(s), or a person designated as responsible for delivering such
> messages to the intended recipient, is strictly prohibited and may be
> unlawful. This e-mail may contain proprietary, confidential or
> privileged information. Any views or opinions expressed are solely those
> of the author and do not necessarily represent those of Virgin Pulse,
> Inc. If you have received this message in error, or are not the named
> recipient(s), please immediately notify the sender and delete this
> e-mail message.
> 
> v2.64
> 
> On 3/13/20, 11:38 AM, "Wido den Hollander"  wrote:
> 
> This email originates outside Virgin Pulse.
> 
> 
> On 3/13/20 4:09 PM, Peter Eisch wrote:
>> Full cluster is 14.2.8.
>>
>> I had some OSD drop overnight which results now in 4 inactive PGs. The
>> pools had three participant (2 ssd, 1 sas) OSDs. In each pool at least 1
>> ssd and 1 sas OSD is working without issue. I’ve ‘ceph pg repair ’
>> but it doesn’t seem to make any changes.
>>
>> PG_AVAILABILITY Reduced data availability: 4 pgs inactive, 4 pgs
> incomplete
>> pg 10.2e is incomplete, acting [59,67]
>> pg 10.c3 is incomplete, acting [62,105]
>> pg 10.f3 is incomplete, acting [62,59]
>> pg 10.1d5 is incomplete, acting [87,106]
>>
>> Using `ceph pg  query` I can see the OSD in each case of the ones
>> which failed. Respectively they are:
>> pg 10.2e participants: 59, 68, 77, 143
>> pg 10.c3 participants: 60, 62, 85, 102, 105, 106
>> pg 10.f3 participants: 59, 64, 75, 107
>> pg 10.1d5 participants: 64, 77, 87, 106
>>
>> The OSDs which are now down/out and have been removed from the crush map
>> and removed the auth are:
>> 62, 64, 68
>>
>> Of course I have lots of reports of slow OSDs now from OSDs worried
>> about the inactive PGs.
>>
>> How do I properly kick these PGs to have them drop their usage of the
>> OSDs which no longer exist?
> 
> You don't. Because those OSDs hold the data you need.
> 
> Why did you remove them from the CRUSHMap, OSDMap and auth? As you need
> these to rebuild the PGs.
> 
> Wido
> 
> The drives failed at a hardware level. I've replaced OSDs with this by
> either planned migration or failure in previous instances without issue.
> I didn't realize all the replicated copies were on just one drive in
> each pool.
> > What should my actions have been in this case?

Try to get those OSDs online again. Maybe try a rescue of the disks or
see how the OSDs would be able to start.

A tool like dd_rescue can help in getting such a thing done.

> 
> pool 10 volumes' replicated size 2 min_size 1 crush_rule 1 object_hash
> rjenkins pg_num 512 pgp_num 512 autoscale_mode warn last_change 47570
> lfor 0/0/40781 flags hashpspool,selfmanaged_snaps stripe_width 0
> application rbd

I see you use 2x replication with min_size=1, that's dangerous and can
easily lead to data loss.

I wouldn't say it's impossible to get the data back, but something like
this can take a while (a lot of hours) to be brought back online.

Wido

> 
> Crush rule 1:
> rule ssd_by_host {
> id 1
> type replicated
> min_size 1
> max_size 10
> step take default class ssd
> step chooseleaf firstn 0 type host
> step emit
> }
> 
> peter
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Inactive PGs

2020-03-13 Thread Wido den Hollander



On 3/13/20 4:09 PM, Peter Eisch wrote:
> Full cluster is 14.2.8.
> 
> I had some OSD drop overnight which results now in 4 inactive PGs. The
> pools had three participant (2 ssd, 1 sas) OSDs. In each pool at least 1
> ssd and 1 sas OSD is working without issue. I’ve ‘ceph pg repair ’
> but it doesn’t seem to make any changes.
> 
> PG_AVAILABILITY Reduced data availability: 4 pgs inactive, 4 pgs incomplete
> pg 10.2e is incomplete, acting [59,67]
> pg 10.c3 is incomplete, acting [62,105]
> pg 10.f3 is incomplete, acting [62,59]
> pg 10.1d5 is incomplete, acting [87,106]
> 
> Using `ceph pg  query` I can see the OSD in each case of the ones
> which failed. Respectively they are:
> pg 10.2e participants: 59, 68, 77, 143
> pg 10.c3 participants: 60, 62, 85, 102, 105, 106
> pg 10.f3 participants: 59, 64, 75, 107
> pg 10.1d5 participants: 64, 77, 87, 106
> 
> The OSDs which are now down/out and have been removed from the crush map
> and removed the auth are:
> 62, 64, 68
> 
> Of course I have lots of reports of slow OSDs now from OSDs worried
> about the inactive PGs.
> 
> How do I properly kick these PGs to have them drop their usage of the
> OSDs which no longer exist?

You don't. Because those OSDs hold the data you need.

Why did  you remove them from the CRUSHMap, OSDMap and auth? As you need
these to rebuild the PGs.

Wido

> 
> Thanks for you thoughts on this,
> 
> peter
> 
> Peter Eisch
> Senior Site Reliability Engineer
> 
> T
> 
>   *1.612.445.5135* 
> 
> Facebook 
> 
>   
> LinkedIn 
> 
>   
> Twitter 
> 
> *virginpulse.com* 
>   
> |
> 
>   *virginpulse.com/global-challenge*
> 
> 
> Australia | Bosnia and Herzegovina | Brazil | Canada | Singapore | 
> Switzerland | United Kingdom | USA
> 
> Confidentiality Notice: The information contained in this e-mail,
> including any attachment(s), is intended solely for use by the
> designated recipient(s). Unauthorized use, dissemination, distribution,
> or reproduction of this message by anyone other than the intended
> recipient(s), or a person designated as responsible for delivering such
> messages to the intended recipient, is strictly prohibited and may be
> unlawful. This e-mail may contain proprietary, confidential or
> privileged information. Any views or opinions expressed are solely those
> of the author and do not necessarily represent those of Virgin Pulse,
> Inc. If you have received this message in error, or are not the named
> recipient(s), please immediately notify the sender and delete this
> e-mail message.
> 
> v2.64
> 
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Cancelled: Ceph Day Oslo May 13th

2020-03-13 Thread Wido den Hollander

Hi,

Due to the recent developments around the COVID-19 virus we (the
organizers) have decided to cancel the Ceph Day in Oslo on May 13th.

Altough it's still 8 weeks away we don't know how the situation will
develop and if travel will be possible or people are willing to travel.

Therefor we thought it was best to cancel the event for now and to
re-schedule to a later date in 2020.

We haven't picked a date yet. Once chosen we'll communicate it through
the regular channels.

Wido
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] IPv6 connectivity gone for Ceph Telemetry

2020-03-12 Thread Wido den Hollander

Hi,

I was just checking on a few (13) IPv6-only Ceph clusters and I noticed
that they couldn't send their Telemetry data anymore:

telemetry.ceph.com has address 8.43.84.137

This server used to have Dual-Stack connectivity while it was still
hosted at OVH.

It seemed to have moved to Red Hat, but lost IPv6 connectivity.

How can we get this back?

Wido
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: ceph-mon store.db disk usage increase on OSD-Host fail

2020-03-12 Thread Wido den Hollander




On 3/12/20 7:44 AM, Hartwig Hauschild wrote:
> Am 10.03.2020 schrieb Wido den Hollander:
>>
>>
>> On 3/10/20 10:48 AM, Hartwig Hauschild wrote:
>>> Hi, 
>>>
>>> I've done a bit more testing ...
>>>
>>> Am 05.03.2020 schrieb Hartwig Hauschild:
>>>> Hi, 
>>>>
> [ snipped ]
>>> I've read somewhere in the docs that I should provide ample space (tens of
>>> GB) for the store.db, found on the ML and Bugtracker that ~100GB might not
>>> be a bad idea and that large clusters may require space on order of
>>> magnitude greater.
>>> Is there some sort of formula I can use to approximate the space required?
>>
>> I don't know about a formula, but make sure you have enough space. MONs
>> are dedicated nodes in most production environments, so I usually
>> install a 400 ~ 1000GB SSD just to make sure they don't run out of space.
>>
> That seems fair.
>>>
>>> Also: is the db supposed to grow this fast in Nautilus when it did not do
>>> that in Luminous? Is that behaviour configurable somewhere?
>>>
>>
>> The MONs need to cache the OSDMaps when not all PGs are active+clean
>> thus their database grows.
>>
>> You can compact RocksDB in the meantime, but it won't last for ever.
>>
>> Just make sure the MONs have enough space.
>>
> Do you happen to know if that behaved differently in previous releases? I'm
> just asking because I have not found anything about this yet and may need to
> explain that it's different now.
> 

It actually became better in recent releases. Nautilus didn't became worse.

Hammer and Jewel were very bad with this and they grew to hundreds of GB
on large(r) clusters.

So no, I'm not aware of any changes.

Wido
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Setting user in rados command line utility

2020-03-11 Thread Wido den Hollander



On 3/11/20 4:22 PM, Rodrigo Severo - Fábrica wrote:
> Em qua., 11 de mar. de 2020 às 12:20, Rodrigo Severo - Fábrica
>  escreveu:
>>
>> Hi,
>>
>>
>> How can I set the user to be used by rados command line utility?
>>
>> I can see no option for that in 
>> https://docs.ceph.com/docs/nautilus/man/8/rados/
> 
> Must I always use the client.admin user?
> 

No, you can use:

$ rados --id admin lspools
$ rados --id myuser lspools

Or you can use:

$ rados -n client.myuser lspools

This also works for the 'ceph' and 'rbd' command.

Wido


> 
> Rodrigo
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: ceph-mon store.db disk usage increase on OSD-Host fail

2020-03-10 Thread Wido den Hollander




On 3/10/20 10:48 AM, Hartwig Hauschild wrote:
> Hi, 
> 
> I've done a bit more testing ...
> 
> Am 05.03.2020 schrieb Hartwig Hauschild:
>> Hi, 
>>
>> I'm (still) testing upgrading from Luminous to Nautilus and ran into the
>> following situation:
>>
>> The lab-setup I'm testing in has three OSD-Hosts. 
>> If one of those hosts dies the store.db in /var/lib/ceph/mon/ on all my
>> Mon-Nodes starts to rapidly grow in size until either the OSD-host comes
>> back up or disks are full.
>>
> This also happens when I take one single OSD offline - /var/lib/ceph/mon/
> grows from around 100MB to ~2GB in about 5 Minutes, then I aborted the test.
> Since we've had an OSD-Host fail over a weekend I know that growing won't
> stop until the disk is full and that usually happens in around 20 Minutes,
> then taking up 17GB of diskspace.
> 
>> On another cluster that's still on Luminous I don't see any growth at all.
>>
> Retested that cluster as well, observing the size on disk of
> /var/lib/ceph/mon/ suggests, that there's writes and deletes / compactions
> going on as it kept floating within +- 5% of the original size.
> 
>> Is that a difference in behaviour between Luminous and Nautilus or is that
>> caused by the lab-setup only having three hosts and one lost host causing
>> all PGs to be degraded at the same time?
>>
> 
> I've read somewhere in the docs that I should provide ample space (tens of
> GB) for the store.db, found on the ML and Bugtracker that ~100GB might not
> be a bad idea and that large clusters may require space on order of
> magnitude greater.
> Is there some sort of formula I can use to approximate the space required?

I don't know about a formula, but make sure you have enough space. MONs
are dedicated nodes in most production environments, so I usually
install a 400 ~ 1000GB SSD just to make sure they don't run out of space.

> 
> Also: is the db supposed to grow this fast in Nautilus when it did not do
> that in Luminous? Is that behaviour configurable somewhere?
> 

The MONs need to cache the OSDMaps when not all PGs are active+clean
thus their database grows.

You can compact RocksDB in the meantime, but it won't last for ever.

Just make sure the MONs have enough space.

Wido

> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: MDS Issues

2020-03-06 Thread Wido den Hollander




On 3/6/20 7:46 PM, dhils...@performair.com wrote:
> All;
> 
> We are in the middle of upgrading our primary cluster from 14.2.5 to 14.2.8. 
> Our cluster utilizes 6 MDSs for 3 CephFS file systems. 3 MDSs are collocated 
> with MON/MGR, and 3 MDSs are collocated with OSDs.
> 
> At this point we have upgraded all 3 of the MON/MDS/MGR servers. The MDS on 2 
> of the 3 is currently not working, and we are seeing the below log messages.
> 
> 2020-03-06 11:12:56.184 <> -1 mds. unable to obtain rotating service 
> keys; retrying
> 2020-03-06 11:13:26.184 <>  0 monclient: wait_auth_rotating timed out after 30
> 2020-03-06 11:13:26.184 <> -1 mds. ERROR: failed to refresh rotating 
> keys, maximum retry time reached.
> 2020-03-06 11:13:26.184 <>  1 mds. suicide! Wanted state up:boot
> 
> Any ideas?
> 

Double check: Is the time correct on all the machines?

cephx can have issues if there is a clock issue.

Wido

> Thank you,
> 
> Dominic L. Hilsbos, MBA 
> Director - Information Technology 
> Perform Air International Inc.
> dhils...@performair.com 
> www.PerformAir.com
> 
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Can't add a ceph-mon to existing large cluster

2020-03-05 Thread Wido den Hollander




On 3/5/20 3:22 PM, Sage Weil wrote:
> On Thu, 5 Mar 2020, Dan van der Ster wrote:
>> Hi all,
>>
>> There's something broken in our env when we try to add new mons to
>> existing clusters, confirmed on two clusters running mimic and
>> nautilus. It's basically this issue
>> https://tracker.ceph.com/issues/42830
>>
>> In case something is wrong with our puppet manifests, I'm trying to
>> doing it manually.
>>
>> First we --mkfs the mon and start it, but as soon as the new mon
>> starts synchronizing, the existing leader becomes unresponsive and an
>> election is triggered.
>>
>> Here's exactly what I'm doing:
>>
>> # cd /var/lib/ceph/tmp/
>> # scp cephmon1:/var/lib/ceph/tmp/keyring.mon.cephmon1 keyring.mon.cephmon4
>> # ceph mon getmap -o monmap
>> # ceph-mon --mkfs -i cephmon4 --monmap monmap --keyrin
>> keyring.mon.cephmon4 --setuser ceph --setgroup ceph
>> # vi /etc/ceph/ceph.conf 
>> [mon.cephmon4]
>> host = cephmon4
>> mon addr = a.b.c.d:6790
>> # systemctl start ceph-mon@cephmon4
>>
>> The log file on the new mon shows it start synchronizing, then
>> immediately the CPU usage on the leader goes to 100% and elections
>> start happening, and ceph health shows mon slow ops. perf top of the
>> ceph-mon with 100% CPU is shown below [1].
>> On a small nautilus cluster, the new mon gets added withing a minute
>> or so (but not cleanly -- the leader is unresponsive for quite awhile
>> until the new mon joins). debug_mon=20 on the leader doesn't show
>> anything very interesting.
>> On our large mimic cluster we tried waiting more than 10 minutes --
>> suffering through several mon elections and 100% usage bouncing around
>> between leaders -- until we gave up.
>>
>> I'm pulling my hair out a bit on this -- it's really weird!
> 
> Can you try running a rocksdb compaction on the existing mons before 
> adding the new one and see if that helps?

I can chime in here: I had this happen to a customer as well.

Compact did not work.

Some background:

5 Monitors and the DBs were ~350M in size. They upgraded one MON from
13.2.6 to 13.2.8 and that caused one MON (sync source) to eat 100% CPU.

The logs showed that the upgraded MON (which was restarted) was in the
synchronizing state.

Because they had 5 MONs they now had 3 left so the cluster kept running.

I left this for about 5 minutes, but it never synced.

I tried a compact, didn't work either.

Eventually I stopped one MON, tarballed it's database and used that to
bring back the MON which was upgraded to 13.2.8

That work without any hickups. The MON joined again within a few seconds.

Wido

> 
> s
> 
>>
>> Did anyone add a new mon to an existing large cluster recently, and it
>> went smoothly?
>>
>> Cheers, Dan
>>
>> [1]
>>
>>   15.12%  ceph-mon [.]
>> MonitorDBStore::Transaction::encode
>>8.95%  libceph-common.so.0  [.]
>> ceph::buffer::v14_2_0::ptr::append
>>8.68%  libceph-common.so.0  [.]
>> ceph::buffer::v14_2_0::list::append
>>7.69%  libceph-common.so.0  [.]
>> ceph::buffer::v14_2_0::ptr::release
>>5.86%  libceph-common.so.0  [.]
>> ceph::buffer::v14_2_0::ptr::ptr
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: consistency of import-diff

2020-03-04 Thread Wido den Hollander

On 3/4/20 3:49 PM, Lars Marowsky-Bree wrote:
> On 2020-03-04T15:44:34, Wido den Hollander  wrote:
> 
>> I understand what you are trying to do, but it's a trade-off. Endless
>> snapshots are also a danger because bit-rot can sneak in somewhere which
>> you might not notice.
>>
>> A fresh export (full copy) every X period protects you against this.
> 
> Hrm. We have checksums on the actual OSD data, so it ought to be
> possible to add these to the export/import/diff bits so it can be
> verified faster.
> 
> (Well, barring bugs.)
> 

I mainly meant bugs, I should have clarified that better.

Do you trust the technology you want to backup to create the proper
backup for you? With that I mean, what if librbd or librados contains a
bug which corrupts all your backups?

You think the backups all went fine because the snapshots seem
consistent on both ends, but you are not sure until you actually test a
restore.

Those are the things I take into consideration when using such technologies.

Wido

> 
> 
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: consistency of import-diff

2020-03-04 Thread Wido den Hollander




On 3/3/20 8:46 PM, Stefan Priebe - Profihost AG wrote:
> Hello,
> 
> does anybody know whether there is any mechanism to make sure an image
> looks like the original after an import-diff?
> 
> While doing ceph backups on another ceph cluster i currently do a fresh
> import every 7 days. So i'm sure if something went wrong with
> import-diff i have a fresh one every 7 days.
> 
> Otherwise i waste a lot of backup storage. So i wanted to know if there
> is any way to be sure that the image is OK and save and match the
> orignal snapshot afterwards.

But how can you be sure that the program that verifies this for you
doesn't have a bug?

I understand what you are trying to do, but it's a trade-off. Endless
snapshots are also a danger because bit-rot can sneak in somewhere which
you might not notice.

A fresh export (full copy) every X period protects you against this.

Wido

> 
> Greets,
> Stefan
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Error in Telemetry Module

2020-03-04 Thread Wido den Hollander




On 3/4/20 12:35 PM, Tecnologia Charne.Net wrote:
> Hello!
> 
> Today, I started the day with
> 
> # ceph -s
>   cluster:
>     health: HEALTH_ERR
>     Module 'telemetry' has failed:
> HTTPSConnectionPool(host='telemetry.ceph.com', port=443): Max retries
> exceeded with url: /report (Caused by
> NewConnectionError(' at 0x7fa97e5a4f90>: Failed to establish a new connection: [Errno 110]
> Connection timed out'))
> 
> Any thoughts?
> 
> I tried disable an re-enable the module, but the error remains.
> 

The telemetry server seems to be down. People have been notified :-)

Wido

> Thanks in advance!
> 
> 
> -Javier
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Forcibly move PGs from full to empty OSD

2020-03-04 Thread Wido den Hollander




On 3/4/20 11:15 AM, Thomas Schneider wrote:
> Hi,
> 
> Ceph balancer is not working correctly; there's an open bug
>  report, too.
> 
> Until this issue is not solved, I need a workaround because I get more
> and more warnings about "nearfull osd(s)".
> 
> Therefore my question is:
> How can I forcibly move PGs from full OSD to empty OSD?

Yes, you could manually create upmap items to map PGs to a specific OSD
and offload another one.

This is what the balancer also does. Keep in mind though that you should
respect your failure domain (host, rack, etc) when creating these mappings.

Wido

> 
> THX
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: osdmap::decode crc error -- 13.2.7 -- most osds down

2020-02-20 Thread Wido den Hollander



> Op 20 feb. 2020 om 19:54 heeft Dan van der Ster  het 
> volgende geschreven:
> 
> For those following along, the issue is here:
> https://tracker.ceph.com/issues/39525#note-6
> 
> Somehow single bits are getting flipped in the osdmaps -- maybe
> network, maybe memory, maybe a bug.
> 

Weird!

But I did see things like this happen before. This was under Hammer and Jewel 
where I needed to these kind of things. Crashes looked very similar.

> To get an osd starting, we have to extract the full osdmap from the
> mon, and set it into the crashing osd. So for the osd.666:
> 
> # ceph osd getmap 2982809 -o 2982809
> # ceph-objectstore-tool --op set-osdmap --data-path
> /var/lib/ceph/osd/ceph-666/ --file 2982809
> 
> Some osds had multiple corrupted osdmaps -- so we scriptified the above.

Were those corrupted onces in sequence?

> As of now our PGs are all active, but we're not confident that this


Awesome!

Wido

> won't happen again (without knowing why the maps were corrupting).
> 
> Thanks to all who helped!
> 
> dan
> 
> 
> 
>> On Thu, Feb 20, 2020 at 1:01 PM Dan van der Ster  wrote:
>> 
>> 680 is epoch 2983572
>> 666 crashes at 2982809 or 2982808
>> 
>>  -407> 2020-02-20 11:20:24.960 7f4d931b5b80 10 osd.666 0 add_map_bl
>> 2982809 612069 bytes
>>  -407> 2020-02-20 11:20:24.966 7f4d931b5b80 -1 *** Caught signal (Aborted) **
>> in thread 7f4d931b5b80 thread_name:ceph-osd
>> 
>> So I grabbed 2982809 and 2982808 and am setting them.
>> 
>> Checking if the osds will start with that.
>> 
>> -- dan
>> 
>> 
>> 
>>> On Thu, Feb 20, 2020 at 12:47 PM Wido den Hollander  wrote:
>>> On 2/20/20 12:40 PM, Dan van der Ster wrote:
>>>> Hi,
>>>> 
>>>> My turn.
>>>> We suddenly have a big outage which is similar/identical to
>>>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-August/036519.html
>>>> 
>>>> Some of the osds are runnable, but most crash when they start -- crc
>>>> error in osdmap::decode.
>>>> I'm able to extract an osd map from a good osd and it decodes well
>>>> with osdmaptool:
>>>> 
>>>> # ceph-objectstore-tool --op get-osdmap --data-path
>>>> /var/lib/ceph/osd/ceph-680/ --file osd.680.map
>>>> 
>>>> But when I try on one of the bad osds I get:
>>>> 
>>>> # ceph-objectstore-tool --op get-osdmap --data-path
>>>> /var/lib/ceph/osd/ceph-666/ --file osd.666.map
>>>> terminate called after throwing an instance of 
>>>> 'ceph::buffer::malformed_input'
>>>>  what():  buffer::malformed_input: bad crc, actual 822724616 !=
>>>> expected 2334082500
>>>> *** Caught signal (Aborted) **
>>>> in thread 7f600aa42d00 thread_name:ceph-objectstor
>>>> ceph version 13.2.7 (71bd687b6e8b9424dd5e5974ed542595d8977416) mimic 
>>>> (stable)
>>>> 1: (()+0xf5f0) [0x7f5ffefc45f0]
>>>> 2: (gsignal()+0x37) [0x7f5ffdbae337]
>>>> 3: (abort()+0x148) [0x7f5ffdbafa28]
>>>> 4: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f5ffe4be7d5]
>>>> 5: (()+0x5e746) [0x7f5ffe4bc746]
>>>> 6: (()+0x5e773) [0x7f5ffe4bc773]
>>>> 7: (()+0x5e993) [0x7f5ffe4bc993]
>>>> 8: (OSDMap::decode(ceph::buffer::list::iterator&)+0x160e) [0x7f6000f4168e]
>>>> 9: (OSDMap::decode(ceph::buffer::list&)+0x31) [0x7f6000f42e31]
>>>> 10: (get_osdmap(ObjectStore*, unsigned int, OSDMap&,
>>>> ceph::buffer::list&)+0x1d0) [0x55d30a489190]
>>>> 11: (main()+0x5340) [0x55d30a3aae70]
>>>> 12: (__libc_start_main()+0xf5) [0x7f5ffdb9a505]
>>>> 13: (()+0x3a0f40) [0x55d30a483f40]
>>>> Aborted (core dumped)
>>>> 
>>>> 
>>>> 
>>>> I think I want to inject the osdmap, but can't:
>>>> 
>>>> # ceph-objectstore-tool --op set-osdmap --data-path
>>>> /var/lib/ceph/osd/ceph-666/ --file osd.680.map
>>>> osdmap (#-1:b65b78ab:::osdmap.2983572:0#) does not exist.
>>>> 
>>> 
>>> Have you tried to list which epoch osd.680 is at and which one osd.666
>>> is at? And which one the MONs are at?
>>> 
>>> Maybe there is a difference there?
>>> 
>>> Wido
>>> 
>>>> 
>>>> How do I do this?
>>>> 
>>>> Thanks for any help!
>>>> 
>>>> dan
>>>> ___
>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: osdmap::decode crc error -- 13.2.7 -- most osds down

2020-02-20 Thread Wido den Hollander




On 2/20/20 12:40 PM, Dan van der Ster wrote:
> Hi,
> 
> My turn.
> We suddenly have a big outage which is similar/identical to
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-August/036519.html
> 
> Some of the osds are runnable, but most crash when they start -- crc
> error in osdmap::decode.
> I'm able to extract an osd map from a good osd and it decodes well
> with osdmaptool:
> 
> # ceph-objectstore-tool --op get-osdmap --data-path
> /var/lib/ceph/osd/ceph-680/ --file osd.680.map
> 
> But when I try on one of the bad osds I get:
> 
> # ceph-objectstore-tool --op get-osdmap --data-path
> /var/lib/ceph/osd/ceph-666/ --file osd.666.map
> terminate called after throwing an instance of 'ceph::buffer::malformed_input'
>   what():  buffer::malformed_input: bad crc, actual 822724616 !=
> expected 2334082500
> *** Caught signal (Aborted) **
>  in thread 7f600aa42d00 thread_name:ceph-objectstor
>  ceph version 13.2.7 (71bd687b6e8b9424dd5e5974ed542595d8977416) mimic (stable)
>  1: (()+0xf5f0) [0x7f5ffefc45f0]
>  2: (gsignal()+0x37) [0x7f5ffdbae337]
>  3: (abort()+0x148) [0x7f5ffdbafa28]
>  4: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f5ffe4be7d5]
>  5: (()+0x5e746) [0x7f5ffe4bc746]
>  6: (()+0x5e773) [0x7f5ffe4bc773]
>  7: (()+0x5e993) [0x7f5ffe4bc993]
>  8: (OSDMap::decode(ceph::buffer::list::iterator&)+0x160e) [0x7f6000f4168e]
>  9: (OSDMap::decode(ceph::buffer::list&)+0x31) [0x7f6000f42e31]
>  10: (get_osdmap(ObjectStore*, unsigned int, OSDMap&,
> ceph::buffer::list&)+0x1d0) [0x55d30a489190]
>  11: (main()+0x5340) [0x55d30a3aae70]
>  12: (__libc_start_main()+0xf5) [0x7f5ffdb9a505]
>  13: (()+0x3a0f40) [0x55d30a483f40]
> Aborted (core dumped)
> 
> 
> 
> I think I want to inject the osdmap, but can't:
> 
> # ceph-objectstore-tool --op set-osdmap --data-path
> /var/lib/ceph/osd/ceph-666/ --file osd.680.map
> osdmap (#-1:b65b78ab:::osdmap.2983572:0#) does not exist.
> 

Have you tried to list which epoch osd.680 is at and which one osd.666
is at? And which one the MONs are at?

Maybe there is a difference there?

Wido

> 
> How do I do this?
> 
> Thanks for any help!
> 
> dan
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: ceph nvme 2x replication

2020-02-19 Thread Wido den Hollander

On 2/19/20 3:17 PM, Frank R wrote:
> Hi all,
> 
> I have noticed that RedHat is willing to support 2x replication with
> NVME drives. Additionally, I have seen CERN presentation where they
> use a 2x replication with NVME for a hyperconverged/HPC/CephFS
> solution.
> 

Don't do this if you care about your data. NVMe isn't anything better or
worse than SSDs. It's actually still an SSD, but we swapped the SATA/SAS
controller for NVMe, but it's still flash.

> I would like to hear some opinions on whether this is really a good
> idea for production. Is this setup (NVME/2x replication) really only
> meant to be used for data that is backed up and/or can be lost without
> causing a catastrophe.
> 

Yes.

You can still loose data due to a single drive failure or OSD crash.
Let's say you have an OSD/host down for maintenance or due to a network
outage. The OSD's device isn't lost, but it's unavailable.

While that happens you loose another OSD, but this time you actually
loose the device due to a failure.

Now you've lost data. Although you *think* you still have another OSD
which is in a healthy state. If you boot the OSD you'll find out it's
outdated because writes happened to the OSD you just lost.

Result = data loss

2x replication is a bad thing in production if you care about your data.

Wido

> Thanks,
> Frank
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: [FORGED] Lost all Monitors in Nautilus Upgrade, best way forward?

2020-02-19 Thread Wido den Hollander



On 2/19/20 10:11 AM, Paul Emmerich wrote:
> On Wed, Feb 19, 2020 at 10:03 AM Wido den Hollander  wrote:
>>
>>
>>
>> On 2/19/20 8:49 AM, Sean Matheny wrote:
>>> Thanks,
>>>
>>>> If the OSDs have a newer epoch of the OSDMap than the MON it won't work.
>>>
>>> How can I verify this? (i.e the epoch of the monitor vs the epoch of the
>>> osd(s))
>>>
>>
>> Check the status of the OSDs:
>>
>> $ ceph daemon osd.X status
>>
>> This should tell the newest map it has.
>>
>> Then check on the mons:
>>
>> $ ceph osd dump|head -n 10
> 
> mons are offline

I think he said het got one MON back manually. His 'ceph -s' also show
it :-)

> 
>> Or using ceph-monstore-tool to see what the latest map is the MON has.
> 
> ceph-monstore-tool  dump-keys
> 
> Also useful:
> 
> ceph-monstore-tool  get osdmap
> 

Indeed. My thought is that there is a mismatch in OSDMaps between the
MONs and OSDs which is causing these problems.

Wido

> Paul
> 
>>
>> Wido
>>
>>> Cheers,
>>> Sean
>>>
>>>
>>>> On 19/02/2020, at 7:25 PM, Wido den Hollander >>> <mailto:w...@42on.com>> wrote:
>>>>
>>>>
>>>>
>>>> On 2/19/20 5:45 AM, Sean Matheny wrote:
>>>>> I wanted to add a specific question to the previous post, in the
>>>>> hopes it’s easier to answer.
>>>>>
>>>>> We have a Luminous monitor restored from the OSDs using
>>>>> ceph-object-tool, which seems like the best chance of any success. We
>>>>> followed this rough process:
>>>>>
>>>>> https://tracker.ceph.com/issues/24419
>>>>>
>>>>> The monitor has come up (as a single monitor cluster), but it’s
>>>>> reporting wildly inaccurate info, such as the number of osds that are
>>>>> down (157 but all 223 are down), and hosts (1, but all 14 are down).
>>>>>
>>>>
>>>> Have you verified that the MON's database has the same epoch of the
>>>> OSDMap (or newer) as all the other OSDs?
>>>>
>>>> If the OSDs have a newer epoch of the OSDMap than the MON it won't work.
>>>>
>>>>> The OSD Daemons are still off, but I’m not sure if starting them back
>>>>> up with this monitor will make things worse. The fact that this mon
>>>>> daemon can’t even see how many OSDs are correctly down makes me think
>>>>> that nothing good will come from turning the OSDs back on.
>>>>>
>>>>> Do I run risk of further corruption (i.e. on the Ceph side, not
>>>>> client data as the cluster is paused) if I proceed and turn on the
>>>>> osd daemons? Or is it worth a shot?
>>>>>
>>>>> Are these Ceph health metrics commonly inaccurate until it can talk
>>>>> to the daemons?
>>>>
>>>> The PG stats will be inaccurate indeed and the number of OSDs can vary
>>>> as long as they aren't able to peer with each other and the MONs.
>>>>
>>>>>
>>>>> (Also other commands like `ceph osd tree` agree with the below `ceph
>>>>> -s` so far)
>>>>>
>>>>> Many thanks for any wisdom… I just don’t want to make things
>>>>> (unnecessarily) much worse.
>>>>>
>>>>> Cheers,
>>>>> Sean
>>>>>
>>>>>
>>>>> root@ntr-mon01:/var/log/ceph# ceph -s
>>>>>  cluster:
>>>>>id: ababdd7f-1040-431b-962c-c45bea5424aa
>>>>>health: HEALTH_WARN
>>>>>pauserd,pausewr,noout,norecover,noscrub,nodeep-scrub
>>>>> flag(s) set
>>>>>157 osds down
>>>>>1 host (15 osds) down
>>>>>Reduced data availability: 12225 pgs inactive, 885 pgs
>>>>> down, 673 pgs peering
>>>>>Degraded data redundancy: 14829054/35961087 objects
>>>>> degraded (41.236%), 2869 pgs degraded, 2995 pgs undersized  services:
>>>>>mon: 1 daemons, quorum ntr-mon01
>>>>>mgr: ntr-mon01(active)
>>>>>osd: 223 osds: 66 up, 223 in
>>>>> flags pauserd,pausewr,noout,norecover,noscrub,nodeep-scrub  data:
>>>>>pools:   14 pools, 15220 pgs
>>>>>objects: 10.58M objects, 40.1TiB
>>>>

[ceph-users] Re: osd_pg_create causing slow requests in Nautilus

2020-02-19 Thread Wido den Hollander




On 2/19/20 9:34 AM, Paul Emmerich wrote:
> On Wed, Feb 19, 2020 at 7:26 AM Wido den Hollander  wrote:
>>
>>
>>
>> On 2/18/20 6:54 PM, Paul Emmerich wrote:
>>> I've also seen this problem on Nautilus with no obvious reason for the
>>> slowness once.
>>
>> Did this resolve itself? Or did you remove the pool?
> 
> I've seen this twice on the same cluster, it fixed itself the first
> time (maybe with some OSD restarts?) and the other time I removed the
> pool after a few minutes because the OSDs were running into heartbeat
> timeouts. There unfortunately seems to be no way to reproduce this :(
> 

Yes, that's the problem. I've been trying to reproduce it, but I can't.
It works on all my Nautilus systems except for this one.

As you saw it, Bryan saw it, I expect others to encounter this at some
point as well.

I don't have any extensive logging as this cluster is in production and
I can't simply crank up the logging and try again.

> In this case it wasn't a new pool that caused problems but a very old one.
> 
> 
> Paul
> 
>>
>>> In my case it was a rather old cluster that was upgraded all the way
>>> from firefly
>>>
>>>
>>
>> This cluster has also been installed with Firefly. It was installed in
>> 2015, so a while ago.
>>
>> Wido
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: osd_pg_create causing slow requests in Nautilus

2020-02-19 Thread Wido den Hollander




On 2/19/20 9:21 AM, Dan van der Ster wrote:
> On Wed, Feb 19, 2020 at 7:29 AM Wido den Hollander  wrote:
>>
>>
>>
>> On 2/18/20 6:54 PM, Paul Emmerich wrote:
>>> I've also seen this problem on Nautilus with no obvious reason for the
>>> slowness once.
>>
>> Did this resolve itself? Or did you remove the pool?
>>
>>> In my case it was a rather old cluster that was upgraded all the way
>>> from firefly
>>>
>>>
>>
>> This cluster has also been installed with Firefly. It was installed in
>> 2015, so a while ago.
> 
> FileStore vs. BlueStore relevant ?
> 

We've checked that. All the OSDs involved are SSD and running on
BlueStore. Convertered to BlueStore under Luminous.

There are some HDD OSDs left on FileStore.

Wido

> -- dan
> 
> 
>>
>> Wido
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: [FORGED] Lost all Monitors in Nautilus Upgrade, best way forward?

2020-02-19 Thread Wido den Hollander



On 2/19/20 8:49 AM, Sean Matheny wrote:
> Thanks,
> 
>> If the OSDs have a newer epoch of the OSDMap than the MON it won't work.
> 
> How can I verify this? (i.e the epoch of the monitor vs the epoch of the
> osd(s))
> 

Check the status of the OSDs:

$ ceph daemon osd.X status

This should tell the newest map it has.

Then check on the mons:

$ ceph osd dump|head -n 10

Or using ceph-monstore-tool to see what the latest map is the MON has.

Wido

> Cheers,
> Sean
> 
> 
>> On 19/02/2020, at 7:25 PM, Wido den Hollander > <mailto:w...@42on.com>> wrote:
>>
>>
>>
>> On 2/19/20 5:45 AM, Sean Matheny wrote:
>>> I wanted to add a specific question to the previous post, in the
>>> hopes it’s easier to answer.
>>>
>>> We have a Luminous monitor restored from the OSDs using
>>> ceph-object-tool, which seems like the best chance of any success. We
>>> followed this rough process:
>>>
>>> https://tracker.ceph.com/issues/24419
>>>
>>> The monitor has come up (as a single monitor cluster), but it’s
>>> reporting wildly inaccurate info, such as the number of osds that are
>>> down (157 but all 223 are down), and hosts (1, but all 14 are down).
>>>
>>
>> Have you verified that the MON's database has the same epoch of the
>> OSDMap (or newer) as all the other OSDs?
>>
>> If the OSDs have a newer epoch of the OSDMap than the MON it won't work.
>>
>>> The OSD Daemons are still off, but I’m not sure if starting them back
>>> up with this monitor will make things worse. The fact that this mon
>>> daemon can’t even see how many OSDs are correctly down makes me think
>>> that nothing good will come from turning the OSDs back on.
>>>
>>> Do I run risk of further corruption (i.e. on the Ceph side, not
>>> client data as the cluster is paused) if I proceed and turn on the
>>> osd daemons? Or is it worth a shot?
>>>
>>> Are these Ceph health metrics commonly inaccurate until it can talk
>>> to the daemons?
>>
>> The PG stats will be inaccurate indeed and the number of OSDs can vary
>> as long as they aren't able to peer with each other and the MONs.
>>
>>>
>>> (Also other commands like `ceph osd tree` agree with the below `ceph
>>> -s` so far)
>>>
>>> Many thanks for any wisdom… I just don’t want to make things
>>> (unnecessarily) much worse.
>>>
>>> Cheers,
>>> Sean
>>>
>>>
>>> root@ntr-mon01:/var/log/ceph# ceph -s
>>>  cluster:
>>>    id: ababdd7f-1040-431b-962c-c45bea5424aa
>>>    health: HEALTH_WARN
>>>    pauserd,pausewr,noout,norecover,noscrub,nodeep-scrub
>>> flag(s) set
>>>    157 osds down
>>>    1 host (15 osds) down
>>>    Reduced data availability: 12225 pgs inactive, 885 pgs
>>> down, 673 pgs peering
>>>    Degraded data redundancy: 14829054/35961087 objects
>>> degraded (41.236%), 2869 pgs degraded, 2995 pgs undersized  services:
>>>    mon: 1 daemons, quorum ntr-mon01
>>>    mgr: ntr-mon01(active)
>>>    osd: 223 osds: 66 up, 223 in
>>> flags pauserd,pausewr,noout,norecover,noscrub,nodeep-scrub  data:
>>>    pools:   14 pools, 15220 pgs
>>>    objects: 10.58M objects, 40.1TiB
>>>    usage:   43.0TiB used, 121TiB / 164TiB avail
>>>    pgs: 70.085% pgs unknown
>>> 10.237% pgs not active
>>> 14829054/35961087 objects degraded (41.236%)
>>> 10667 unknown
>>> 2869  active+undersized+degraded
>>> 885   down
>>> 673   peering
>>> 126   active+undersized
>>>
>>>
>>> On 19/02/2020, at 10:18 AM, Sean Matheny >> <mailto:s.math...@auckland.ac.nz><mailto:s.math...@auckland.ac.nz>>
>>> wrote:
>>>
>>> Hi folks,
>>>
>>> Our entire cluster is down at the moment.
>>>
>>> We started upgrading from 12.2.13 to 14.2.7 with the monitors. The
>>> first monitor we upgraded crashed. We reverted to luminous on this
>>> one and tried another, and it was fine. We upgraded the rest, and
>>> they all worked.
>>>
>>> Then we upgraded the first one again, and after it became the leader,
>>> it died. Then the second one became the leader, and it died. Then the
>>> third became the leader, and it died, leaving mon 4 and 5 unable to
>&g

[ceph-users] Re: Pool on limited number of OSDs

2020-02-18 Thread Wido den Hollander




On 2/18/20 6:56 PM, Jacek Suchenia wrote:
> Hello
> 
> I have a cluster, (Nautilus 14.2.4) where one pool I'd like to keep on a
> dedicated OSDs. So I setup a rule that covers *3* dedicated OSDs (using
> device classes) and assigned it to pool with replication factor *3*. Only
> 10% PGs were assigned and rebalanced, where rest of them stuck in
> *undersized* state.
> 

Can you share the rule and some snippets of the CRUSHMap?

Wido

> What mechanism prevents CRUSH algorithm to assign the same set of OSDs to
> all PGs in a pool? How can I control it?
> 
> Jacek
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: osd_pg_create causing slow requests in Nautilus

2020-02-18 Thread Wido den Hollander




On 2/18/20 6:54 PM, Paul Emmerich wrote:
> I've also seen this problem on Nautilus with no obvious reason for the
> slowness once.

Did this resolve itself? Or did you remove the pool?

> In my case it was a rather old cluster that was upgraded all the way
> from firefly
> 
> 

This cluster has also been installed with Firefly. It was installed in
2015, so a while ago.

Wido
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: [FORGED] Lost all Monitors in Nautilus Upgrade, best way forward?

2020-02-18 Thread Wido den Hollander



On 2/19/20 5:45 AM, Sean Matheny wrote:
> I wanted to add a specific question to the previous post, in the hopes it’s 
> easier to answer.
> 
> We have a Luminous monitor restored from the OSDs using ceph-object-tool, 
> which seems like the best chance of any success. We followed this rough 
> process:
> 
> https://tracker.ceph.com/issues/24419
> 
> The monitor has come up (as a single monitor cluster), but it’s reporting 
> wildly inaccurate info, such as the number of osds that are down (157 but all 
> 223 are down), and hosts (1, but all 14 are down).
> 

Have you verified that the MON's database has the same epoch of the
OSDMap (or newer) as all the other OSDs?

If the OSDs have a newer epoch of the OSDMap than the MON it won't work.

> The OSD Daemons are still off, but I’m not sure if starting them back up with 
> this monitor will make things worse. The fact that this mon daemon can’t even 
> see how many OSDs are correctly down makes me think that nothing good will 
> come from turning the OSDs back on.
> 
> Do I run risk of further corruption (i.e. on the Ceph side, not client data 
> as the cluster is paused) if I proceed and turn on the osd daemons? Or is it 
> worth a shot?
> 
> Are these Ceph health metrics commonly inaccurate until it can talk to the 
> daemons?

The PG stats will be inaccurate indeed and the number of OSDs can vary
as long as they aren't able to peer with each other and the MONs.

> 
> (Also other commands like `ceph osd tree` agree with the below `ceph -s` so 
> far)
> 
> Many thanks for any wisdom… I just don’t want to make things (unnecessarily) 
> much worse.
> 
> Cheers,
> Sean
> 
> 
> root@ntr-mon01:/var/log/ceph# ceph -s
>   cluster:
> id: ababdd7f-1040-431b-962c-c45bea5424aa
> health: HEALTH_WARN
> pauserd,pausewr,noout,norecover,noscrub,nodeep-scrub flag(s) set
> 157 osds down
> 1 host (15 osds) down
> Reduced data availability: 12225 pgs inactive, 885 pgs down, 673 
> pgs peering
> Degraded data redundancy: 14829054/35961087 objects degraded 
> (41.236%), 2869 pgs degraded, 2995 pgs undersized  services:
> mon: 1 daemons, quorum ntr-mon01
> mgr: ntr-mon01(active)
> osd: 223 osds: 66 up, 223 in
>  flags pauserd,pausewr,noout,norecover,noscrub,nodeep-scrub  data:
> pools:   14 pools, 15220 pgs
> objects: 10.58M objects, 40.1TiB
> usage:   43.0TiB used, 121TiB / 164TiB avail
> pgs: 70.085% pgs unknown
>  10.237% pgs not active
>  14829054/35961087 objects degraded (41.236%)
>  10667 unknown
>  2869  active+undersized+degraded
>  885   down
>  673   peering
>  126   active+undersized
> 
> 
> On 19/02/2020, at 10:18 AM, Sean Matheny 
> mailto:s.math...@auckland.ac.nz>> wrote:
> 
> Hi folks,
> 
> Our entire cluster is down at the moment.
> 
> We started upgrading from 12.2.13 to 14.2.7 with the monitors. The first 
> monitor we upgraded crashed. We reverted to luminous on this one and tried 
> another, and it was fine. We upgraded the rest, and they all worked.
> 
> Then we upgraded the first one again, and after it became the leader, it 
> died. Then the second one became the leader, and it died. Then the third 
> became the leader, and it died, leaving mon 4 and 5 unable to form a quorum.
> 
> We tried creating a single monitor cluster by editing the monmap of mon05, 
> and it died in the same way, just without the paxos negotiation first.
> 
> We have tried to revert to a luminous (12.2.12) monitor backup taken a few 
> hours before the crash. The mon daemon will start, but is flooded with 
> blocked requests and unknown pgs after a while. For better or worse we 
> removed the “noout” flag and 144 of 232 OSDs are now showing as down:
> 
>  cluster:
>id: ababdd7f-1040-431b-962c-c45bea5424aa
>health: HEALTH_ERR
>noout,nobackfill,norecover flag(s) set
>101 osds down
>9 hosts (143 osds) down
>1 auth entities have invalid capabilities
>Long heartbeat ping times on back interface seen, longest is 
> 15424.178 msec
>Long heartbeat ping times on front interface seen, longest is 
> 14763.145 msec
>Reduced data availability: 521 pgs inactive, 48 pgs stale
>274 slow requests are blocked > 32 sec
>88 stuck requests are blocked > 4096 sec
>1303 slow ops, oldest one blocked for 174 sec, mon.ntr-mon01 has 
> slow ops
>too many PGs per OSD (299 > max 250)  services:
>mon: 1 daemons, quorum ntr-mon01 (age 3m)
>mgr: ntr-mon01(active, since 30m)
>mds: cephfs:1 {0=akld2e18u42=up:active(laggy or crashed)}
>osd: 223 osds: 66 up, 167 in
> flags noout,nobackfill,norecover
>rgw: 2 daemons active (ntr-rgw01, ntr-rgw02)  data:
>pools:   14 pools, 15220 pgs
>objects: 35.26M objects, 134 TiB
>usage:   379 TiB used, 1014

[ceph-users] Re: osd_pg_create causing slow requests in Nautilus

2020-02-18 Thread Wido den Hollander




On 8/27/19 11:49 PM, Bryan Stillwell wrote:
> We've run into a problem on our test cluster this afternoon which is running 
> Nautilus (14.2.2).  It seems that any time PGs move on the cluster (from 
> marking an OSD down, setting the primary-affinity to 0, or by using the 
> balancer), a large number of the OSDs in the cluster peg the CPU cores 
> they're running on for a while which causes slow requests.  From what I can 
> tell it appears to be related to slow peering caused by osd_pg_create() 
> taking a long time.
> 
> This was seen on quite a few OSDs while waiting for peering to complete:
> 
> # ceph daemon osd.3 ops
> {
> "ops": [
> {
> "description": "osd_pg_create(e179061 287.7a:177739 287.9a:177739 
> 287.e2:177739 287.e7:177739 287.f6:177739 287.187:177739 287.1aa:177739 
> 287.216:177739 287.306:177739 287.3e6:177739)",
> "initiated_at": "2019-08-27 14:34:46.556413",
> "age": 318.2523453801,
> "duration": 318.2524189532,
> "type_data": {
> "flag_point": "started",
> "events": [
> {
> "time": "2019-08-27 14:34:46.556413",
> "event": "initiated"
> },
> {
> "time": "2019-08-27 14:34:46.556413",
> "event": "header_read"
> },
> {
> "time": "2019-08-27 14:34:46.556299",
> "event": "throttled"
> },
> {
> "time": "2019-08-27 14:34:46.556456",
> "event": "all_read"
> },
> {
> "time": "2019-08-27 14:35:12.456901",
> "event": "dispatched"
> },
> {
> "time": "2019-08-27 14:35:12.456903",
> "event": "wait for new map"
> },
> {
> "time": "2019-08-27 14:40:01.292346",
> "event": "started"
> }
> ]
> }
> },
> ...snip...
> {
> "description": "osd_pg_create(e179066 287.7a:177739 287.9a:177739 
> 287.e2:177739 287.e7:177739 287.f6:177739 287.187:177739 287.1aa:177739 
> 287.216:177739 287.306:177739 287.3e6:177739)",
> "initiated_at": "2019-08-27 14:35:09.908567",
> "age": 294.900191001,
> "duration": 294.9006841689,
> "type_data": {
> "flag_point": "delayed",
> "events": [
> {
> "time": "2019-08-27 14:35:09.908567",
> "event": "initiated"
> },
> {
> "time": "2019-08-27 14:35:09.908567",
> "event": "header_read"
> },
> {
> "time": "2019-08-27 14:35:09.908520",
> "event": "throttled"
> },
> {
> "time": "2019-08-27 14:35:09.908617",
> "event": "all_read"
> },
> {
> "time": "2019-08-27 14:35:12.456921",
> "event": "dispatched"
> },
> {
> "time": "2019-08-27 14:35:12.456923",
> "event": "wait for new map"
> }
> ]
> }
> }
> ],
> "num_ops": 6
> }
> 
> 
> That "wait for new map" message made us think something was getting hung up 
> on the monitors, so we restarted them all without any luck.
> 
> I'll keep investigating, but so far my google searches aren't pulling 
> anything up so I wanted to see if anyone else is running into this?
> 

I've seen this twice now on a ~1400 OSD cluster running Nautilus.

I created a bug report for this: https://tracker.ceph.com/issues/44184

Did you make any progress on this or run into it a second time?

Wido

> Thanks,
> Bryan
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: ceph status reports: slow ops - this is related to long running process /usr/bin/ceph-osd

2020-02-18 Thread Wido den Hollander



On 10/8/19 3:53 PM, Thomas wrote:
> Hi,
> ceph status reports:
> root@ld3955:~# ceph -s
>   cluster:
>     id: 6b1b5117-6e08-4843-93d6-2da3cf8a6bae
>     health: HEALTH_ERR
>     1 filesystem is degraded
>     1 filesystem has a failed mds daemon
>     1 filesystem is offline
>     insufficient standby MDS daemons available
>     4 nearfull osd(s)
>     1 pool(s) nearfull
>     Reduced data availability: 59 pgs inactive, 16 pgs peering
>     Degraded data redundancy: 597/153910758 objects degraded
> (0.000%), 2 pgs degraded, 1 pg undersized
>     Degraded data redundancy (low space): 23 pgs backfill_toofull
>     1 pgs not deep-scrubbed in time
>     4 pgs not scrubbed in time
>     3 pools have too many placement groups
>     164 slow requests are blocked > 32 sec
>     1082 stuck requests are blocked > 4096 sec
>     1490 slow ops, oldest one blocked for 19711 sec, daemons
> [osd,0,osd,175,osd,186,osd,5,osd,6,osd,63,osd,68,osd,9,mon,ld5505,mon,ld5506]...
> have slow ops.
> 
>   services:
>     mon: 3 daemons, quorum ld5505,ld5506,ld5507 (age 5h)
>     mgr: ld5507(active, since 5h), standbys: ld5506, ld5505
>     mds: pve_cephfs:0/1, 1 failed
>     osd: 419 osds: 416 up, 416 in; 6024 remapped pgs
> 
>   data:
>     pools:   6 pools, 8864 pgs
>     objects: 51.30M objects, 196 TiB
>     usage:   594 TiB used, 907 TiB / 1.5 PiB avail
>     pgs: 0.666% pgs not active
>  597/153910758 objects degraded (0.000%)
>  52964415/153910758 objects misplaced (34.412%)
>  5954 active+remapped+backfill_wait
>  2786 active+clean
>  40   active+remapped+backfilling
>  35   activating
>  23   active+remapped+backfill_wait+backfill_toofull
>  16   peering
>  7    activating+remapped
>  1    activating+undersized+degraded
>  1    active+clean+scrubbing
>  1    active+recovering+degraded
> 
>   io:
>     client:   3.5 KiB/s wr, 0 op/s rd, 0 op/s wr
>     recovery: 551 MiB/s, 137 objects/s
> 
> I'm concerned about the slow ops on osd.0 and osd.9.
> On the relevant OSD node I can see 2 relevant services running for hours:
> ceph   14795   1 99 09:58 ?    08:49:22 /usr/bin/ceph-osd -f
> --cluster ceph --id 9 --setuser ceph --setgroup ceph
> ceph   15394   1 99 09:58 ?    07:10:00 /usr/bin/ceph-osd -f
> --cluster ceph --id 0 --setuser ceph --setgroup ceph
> 
> In the relevant osd log I can find similar messages:
> root@ld5505:~# tail -f /var/log/ceph/ceph-osd.0.log
> 2019-10-08 15:35:32.830 7ff60c7cc700 -1 osd.0 233323 get_health_metrics
> reporting 236 slow ops, oldest is osd_pg_create(e233257 38.0:199987)
> 2019-10-08 15:35:33.806 7ff60c7cc700 -1 osd.0 233323 get_health_metrics
> reporting 236 slow ops, oldest is osd_pg_create(e233257 38.0:199987)
> 2019-10-08 15:35:34.842 7ff60c7cc700 -1 osd.0 233323 get_health_metrics
> reporting 236 slow ops, oldest is osd_pg_create(e233257 38.0:199987)
> 2019-10-08 15:35:35.862 7ff60c7cc700 -1 osd.0 233323 get_health_metrics
> reporting 236 slow ops, oldest is osd_pg_create(e233257 38.0:199987)
> 

This triggered me as I saw this happening twice on a cluster.

I created a issue in the tracker as I think it might be the same thing:
https://tracker.ceph.com/issues/44184

Wido

> root@ld5505:~# tail -f /var/log/ceph/ceph-osd.9.log
> 2019-10-08 15:35:38.822 7f8957599700 -1 osd.9 233407 get_health_metrics
> reporting 818 slow ops, oldest is osd_op(client.53385387.0:23 30.f7
> 30.bcc140f7 (undecoded) ondisk+retry+read+known_if_redirected e233362)
> 2019-10-08 15:35:39.854 7f8957599700 -1 osd.9 233407 get_health_metrics
> reporting 818 slow ops, oldest is osd_op(client.53385387.0:23 30.f7
> 30.bcc140f7 (undecoded) ondisk+retry+read+known_if_redirected e233362)
> 2019-10-08 15:35:40.850 7f8957599700 -1 osd.9 233407 get_health_metrics
> reporting 818 slow ops, oldest is osd_op(client.53385387.0:23 30.f7
> 30.bcc140f7 (undecoded) ondisk+retry+read+known_if_redirected e233362)
> 2019-10-08 15:35:41.862 7f8957599700 -1 osd.9 233407 get_health_metrics
> reporting 818 slow ops, oldest is osd_op(client.53385387.0:23 30.f7
> 30.bcc140f7 (undecoded) ondisk+retry+read+known_if_redirected e233362)
> 
> Question:
> How can I analyse and solve the issue with slow ops?
> 
> THX
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: cephfs slow, howto investigate and tune mds configuration?

2020-02-12 Thread Wido den Hollander




On 2/11/20 2:53 PM, Marc Roos wrote:
> 
> Say I think my cephfs is slow when I rsync to it, slower than it used to 
> be. First of all, I do not get why it reads so much data. I assume the 
> file attributes need to come from the mds server, so the rsync backup 
> should mostly cause writes not?
> 

Are you running one or multiple MDS? I've seen cases where the
synchronization between the different MDS slow down rsync.

The problem is that rsync creates and renames files a lot. When doing
this with small files it can be very heavy for the MDS.

Wido

> I think it started being slow, after enabling snapshots on the file 
> system.
> 
> - how can I determine if mds_cache_memory_limit = 80 is still 
> correct?
> 
> - how can I test the mds performance from the command line, so I can 
> experiment with cpu power configurations, and see if this brings a 
> significant change?
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Benefits of high RAM on a metadata server?

2020-02-06 Thread Wido den Hollander




On 2/6/20 11:01 PM, Matt Larson wrote:
> Hi, we are planning out a Ceph storage cluster and were choosing
> between 64GB, 128GB, or even 256GB on metadata servers. We are
> considering having 2 metadata servers overall.
> 
> Does going to high levels of RAM possibly yield any performance
> benefits? Is there a size beyond which there are just diminishing
> returns vs cost?
> 

The MDS will try to cache as much inodes as you allow it to.

So the amount of users nor the total amount of bytes doesn't matter,
it's the amount of inodes, thus: files and directories.

The more you have of those, the more memory it requires.

A lot of small files? A lot of memory!

Wido

> The expected use case would be for a cluster where there might be
> 10-20 concurrent users working on individual datasets of 5TB in size.
> I expect there would be lots of reads of the 5TB datasets matched with
> the creation of hundreds to thousands of smaller files during
> processing of the images.
> 
> Thanks!
> -Matt
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

1 2 >

1 - 100 of 140 matches

Mail list logo