[ceph-users] Re: Why you might want packages not containers for Ceph deployments

2021-06-25 Thread Fox, Kevin M
Orchestration is hard, especially with every permutation. The devs have 
implemented what they feel is the right solution for their own needs from the 
sound of it. The orchestration was made modular to support non containerized 
deployment. It just takes someone to step up and implement the permutations 
desired. And ultimately that's what opensource is geared towards. With 
opensource and some desired feature, you can:
1. Implement it
2. Pay someone else to implement it
3. Convince someone else to implement it in their spare time.

The thread seems to be currently focused around #3 but no developer seems to be 
interested in implementing it. So that leaves options 1 and 2?

To move this forward, is anyone interested in developing package support in the 
orchestration system or paying to have it implemented?


From: Oliver Freyermuth 
Sent: Wednesday, June 2, 2021 2:26 PM
To: Matthew Vernon; ceph-users@ceph.io
Subject: [ceph-users] Re: Why you might want packages not containers for Ceph 
deployments

Check twice before you click! This email originated from outside PNNL.


Hi,

that's also a +1 from me — we also use containers heavily for scientific 
workflows, and know their benefits well.
But they are not the "best", or rather, the most fitting tool in every 
situation.
You have provided a great summary and I agree with all points, and thank you a 
lot for this very competent and concise write-up.


Since in this lengthy thread, static linking and solving the issue of many 
inter-dependencies for production services with containers have been mentioned 
as solutions,
I'd like to add another point to your list of complexities:
* Keeping production systems secure may be a lot more of a hassle.

Even though the following article is long and many may regard it as 
controversial, I'd like to link to a concise write-up from a packager 
discussing this topic in a quite generic way:
  
https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fblogs.gentoo.org%2Fmgorny%2F2021%2F02%2F19%2Fthe-modern-packagers-security-nightmare%2Fdata=04%7C01%7CKevin.Fox%40pnnl.gov%7C7e520344a4cb466b0fc908d9260d5851%7Cd6faa5f90ae240338c0130048a38deeb%7C0%7C0%7C637582661036645267%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=qHV9gj8s0oEmHpHp5ZZdzsf%2Fs5Z6RhUZS8PaHwzeNRs%3Dreserved=0
While the article discusses the issues of static linking and package management 
performed in language-specific domains, it applies all the same to containers.

If I operate services in containers built by developers, of course this ensures 
the setup works, and dependencies are well tested, and even upgrades work well 
— but it also means that,
at the end of the day, if I run 50 services in 50 different containers from 50 
different upstreams, I'll have up to 50 different versions of OpenSSL floating 
around my production servers.
If a security issue is found in any of the packages used in all the container 
images, I now need to trust the security teams of all the 50 developer groups 
building these containers
(and most FOSS projects won't have the ressources, understandably...),
instead of the one security team of the disto I use. And then, I also have to 
re-pull all these containers, after finding out that a security fix has become 
available.
Or I need to build all these containers myself, and effectively take over the 
complete job, and have my own security team.

This may scale somewhat well, if you have a team of 50 people, and every person 
takes care of one service. Containers are often your friend in this case[1],
since it allows to isolate the different responsibilities along with the 
service.

But this is rarely the case outside of industry, and especially not in 
academics.
So the approach we chose for us is to have one common OS everywhere, and 
automate all of our deployment and configuration management with Puppet.
Of course, that puts is in one of the many corners out there, but it scales 
extremely well to all services we operate,
and I can still trust the distro maintainers to keep the base OS safe on all 
our servers, automate reboots etc.

For Ceph, we've actually seen questions about security issues already on the 
list[0] (never answered AFAICT).


To conclude, I strongly believe there's no one size fits all here.

That was why I was hopeful when I first heard about the Ceph orchestrator idea, 
when it looked to be planned out to be modular,
with the different tasks being implementable in several backends, so one could 
imagine them being implemented with containers, with classic SSH on bare-metal 
(i.e. ceph-deploy-like), ansible, rook or maybe others.
Sadly, it seems it ended up being "container-only".
Containers certainly have many uses, and we run thousands of them daily, but 
neither do they fit each and every existing requirement,
nor are they a magic bullet to solve all issues.

Cheers,
Oliver


[0] 

[ceph-users] Ceph Disk Prediction module issues

2021-06-25 Thread Justas Balcas
Hello Folks,

We are running Ceph Octopus 15.2.13 release and would like to use the disk
prediction module. So far issues we faced are:
1. Ceph documentation does not mention to install
`ceph-mgr-diskprediction-local.noarch`
2. Even if I install the needed package, after mgr restart, it does not
appear on Ceph cluster. Detailed log is here:
gist:b687798ea97ef13e36d466f2d7b1470a
 . Ceph -s
shows [1].

Are you aware of this issue and are there any workarounds?

Thanks!



[1]
# ceph -s
  cluster:
id: 12d9d70a-e993-464c-a6f8-4f674db35136
health: HEALTH_WARN
no active mgr

  services:
mon: 3 daemons, quorum ceph-mon-cms-1,ceph-mon-cms-2,ceph-mon-cms-3
(age 2d)
mgr: no daemons active (since 11m)
mds: cephfs:1 {0=ceph-mds-cms-1=up:active} 1 up:standby
ceph health detail
HEALTH_WARN no active mgr
[WRN] MGR_DOWN: no active mgr
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: rgw multisite sync not syncing data, error: RGW-SYNC:data:init_data_sync_status: ERROR: failed to read remote data log shards

2021-06-25 Thread DHilsbos
Christian;

Do the second site's RGW instance(s) have access to the first site's OSDs?  Is 
the reverse true?

It's been a while since I set up the multi-site sync between our clusters, but 
I seem to remember that, while metadata is exchanged RGW1<-->RGW2, data is 
exchanged OSD1<-->RGW2.

Anyone else on the list, PLEASE correct me if I'm wrong.

Thank you,

Dominic L. Hilsbos, MBA 
Vice President – Information Technology 
Perform Air International Inc.
dhils...@performair.com 
www.PerformAir.com


-Original Message-
From: Christian Rohmann [mailto:christian.rohm...@inovex.de] 
Sent: Friday, June 25, 2021 9:25 AM
To: ceph-users@ceph.io
Subject: [ceph-users] rgw multisite sync not syncing data, error: 
RGW-SYNC:data:init_data_sync_status: ERROR: failed to read remote data log 
shards

Hey ceph-users,


I setup a multisite sync between two freshly setup Octopus clusters.
In the first cluster I created a bucket with some data just to test the 
replication of actual data later.

I then followed the instructions on 
https://docs.ceph.com/en/octopus/radosgw/multisite/#migrating-a-single-site-system-to-multi-site
 
to add a second zone.

Things went well and both zones are now happily reaching each other and 
the API endpoints are talking.
Also the metadata is in sync already - both sides are happy and I can 
see bucket listings and users are "in sync":


> # radosgw-admin sync status
>   realm 13d1b8cb-dc76-4aed-8578-2ce5d3d010e8 (obst)
>   zonegroup 17a06c15-2665-484e-8c61-cbbb806e11d2 (obst-fra)
>    zone 6d2c1275-527e-432f-a57a-9614930deb61 (obst-rgn)
>   metadata sync no sync (zone is master)
>   data sync source: c07447eb-f93a-4d8f-bf7a-e52fade399f3 (obst-az1)
>     init
>     full sync: 128/128 shards
>     full sync: 0 buckets to sync
>     incremental sync: 0/128 shards
>     data is behind on 128 shards
>     behind shards: [0...127]
>

and on the other side ...

> # radosgw-admin sync status
>   realm 13d1b8cb-dc76-4aed-8578-2ce5d3d010e8 (obst)
>   zonegroup 17a06c15-2665-484e-8c61-cbbb806e11d2 (obst-fra)
>    zone c07447eb-f93a-4d8f-bf7a-e52fade399f3 (obst-az1)
>   metadata sync syncing
>     full sync: 0/64 shards
>     incremental sync: 64/64 shards
>     metadata is caught up with master
>   data sync source: 6d2c1275-527e-432f-a57a-9614930deb61 (obst-rgn)
>     init
>     full sync: 128/128 shards
>     full sync: 0 buckets to sync
>     incremental sync: 0/128 shards
>     data is behind on 128 shards
>     behind shards: [0...127]
>


also the newly created buckets (read: their metadata) is synced.



What is apparently not working in the sync of actual data.

Upon startup the radosgw on the second site shows:

> 2021-06-25T16:15:06.445+ 7fe71eff5700  1 RGW-SYNC:meta: start
> 2021-06-25T16:15:06.445+ 7fe71eff5700  1 RGW-SYNC:meta: realm 
> epoch=2 period id=f4553d7c-5cc5-4759-9253-9a22b051e736
> 2021-06-25T16:15:11.525+ 7fe71dff3700  0 
> RGW-SYNC:data:sync:init_data_sync_status: ERROR: failed to read remote 
> data log shards
>

also when issuing

# radosgw-admin data sync init --source-zone obst-rgn

it throws

> 2021-06-25T16:20:29.167+ 7f87c2aec080 0 
> RGW-SYNC:data:init_data_sync_status: ERROR: failed to read remote data 
> log shards





Does anybody have any hints on where to look for what could be broken here?

Thanks a bunch,
Regards


Christian





___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Why you might want packages not containers for Ceph deployments

2021-06-25 Thread Marc




> Orchestration is hard, especially with every permutation. The devs have
> implemented what they feel is the right solution for their own needs
> from the sound of it. The orchestration was made modular to support non
> containerized deployment. It just takes someone to step up and implement
> the permutations desired. And ultimately that's what opensource is
> geared towards. With opensource and some desired feature, you can:
> 1. Implement it
> 2. Pay someone else to implement it
> 3. Convince someone else to implement it in their spare time.
> 
> The thread seems to be currently focused around #3 but no developer
> seems to be interested in implementing it. So that leaves options 1 and
> 2?
> 

Imho a bit simplistic view to opensource and the current thread.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] rgw multisite sync not syncing data, error: RGW-SYNC:data:init_data_sync_status: ERROR: failed to read remote data log shards

2021-06-25 Thread Christian Rohmann

Hey ceph-users,


I setup a multisite sync between two freshly setup Octopus clusters.
In the first cluster I created a bucket with some data just to test the 
replication of actual data later.


I then followed the instructions on 
https://docs.ceph.com/en/octopus/radosgw/multisite/#migrating-a-single-site-system-to-multi-site 
to add a second zone.


Things went well and both zones are now happily reaching each other and 
the API endpoints are talking.
Also the metadata is in sync already - both sides are happy and I can 
see bucket listings and users are "in sync":




# radosgw-admin sync status
  realm 13d1b8cb-dc76-4aed-8578-2ce5d3d010e8 (obst)
  zonegroup 17a06c15-2665-484e-8c61-cbbb806e11d2 (obst-fra)
   zone 6d2c1275-527e-432f-a57a-9614930deb61 (obst-rgn)
  metadata sync no sync (zone is master)
  data sync source: c07447eb-f93a-4d8f-bf7a-e52fade399f3 (obst-az1)
    init
    full sync: 128/128 shards
    full sync: 0 buckets to sync
    incremental sync: 0/128 shards
    data is behind on 128 shards
    behind shards: [0...127]



and on the other side ...


# radosgw-admin sync status
  realm 13d1b8cb-dc76-4aed-8578-2ce5d3d010e8 (obst)
  zonegroup 17a06c15-2665-484e-8c61-cbbb806e11d2 (obst-fra)
   zone c07447eb-f93a-4d8f-bf7a-e52fade399f3 (obst-az1)
  metadata sync syncing
    full sync: 0/64 shards
    incremental sync: 64/64 shards
    metadata is caught up with master
  data sync source: 6d2c1275-527e-432f-a57a-9614930deb61 (obst-rgn)
    init
    full sync: 128/128 shards
    full sync: 0 buckets to sync
    incremental sync: 0/128 shards
    data is behind on 128 shards
    behind shards: [0...127]




also the newly created buckets (read: their metadata) is synced.



What is apparently not working in the sync of actual data.

Upon startup the radosgw on the second site shows:


2021-06-25T16:15:06.445+ 7fe71eff5700  1 RGW-SYNC:meta: start
2021-06-25T16:15:06.445+ 7fe71eff5700  1 RGW-SYNC:meta: realm 
epoch=2 period id=f4553d7c-5cc5-4759-9253-9a22b051e736
2021-06-25T16:15:11.525+ 7fe71dff3700  0 
RGW-SYNC:data:sync:init_data_sync_status: ERROR: failed to read remote 
data log shards




also when issuing

# radosgw-admin data sync init --source-zone obst-rgn

it throws

2021-06-25T16:20:29.167+ 7f87c2aec080 0 
RGW-SYNC:data:init_data_sync_status: ERROR: failed to read remote data 
log shards






Does anybody have any hints on where to look for what could be broken here?

Thanks a bunch,
Regards


Christian





___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Why you might want packages not containers for Ceph deployments

2021-06-25 Thread Nico Schottelius


Hey Sage,

Sage Weil  writes:
> Thank you for bringing this up.  This is in fact a key reason why the
> orchestration abstraction works the way it does--to allow other
> runtime environments to be supported (FreeBSD!
> sysvinit/Devuan/whatever for systemd haters!)

I would like you to stop labeling people who have reasons for not using
a specific software as haters.

It is not productive to call Ceph developers "GlusterFS haters", nor to
call Redhat users Debian haters.

It is simple not an accurate representation.

Cheers,

Nico

--
Sustainable and modern Infrastructures by ungleich.ch
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Why you might want packages not containers for Ceph deployments

2021-06-25 Thread Nico Schottelius


Hey Sage,

thanks for the reply.

Sage Weil  writes:
> Rook is based on kubernetes, and cephadm on podman or docker.  These
> are well-defined runtimes.  Yes, some have bugs, but our experience so
> far has been a big improvement over the complexity of managing package
> dependencies across even just a handful of distros.  (Podman has been
> the only real culprit here, tbh, but I give them a partial pass as the
> tool is relatively new.)

let me come back to a particular part of your message:

"
our experience so far has been a big improvement over the complexity
of managing package  dependencies across even just a handful of distros.
"

This is something I cannot understand at all. This is a process that for
most mid sized Open Source projects have automated years ago. And you
can even use containers for that!

Does it break on distro upgrade? Maybe. Is it trival? Probably in 90%
of the cases. Is it easy to spot? Almost always if you use CI.

*What* is exactly the complexity that you deal with? Where are the
problems? Is there an open issue that says

  "Reduce time and complexity of package build"

open anywhere? Can we join that discussion?

The message I get here is:

- Building packages is too hard for us
- let's make the life of everyone else more complex for a build case
  that works easier for us

I am not sure whether that's the right direction.

*If* this this really about complexity of package building, why did you
not shout out to the community and ask for help? I assume that one or
the other party on this mailing list is open for helping out.

Nico

--
Sustainable and modern Infrastructures by ungleich.ch
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Can not mount rbd device anymore

2021-06-25 Thread Ml Ml
Btw: dd bs=1M count=2048 if=/dev/rbd6 of=/dev/null => gives me 50MB/sec.
So reading the block device seems to work?!

On Fri, Jun 25, 2021 at 12:39 PM Ml Ml  wrote:
>
> I started the mount 15mins ago.:
>   mount -nv /dev/rbd6 /mnt/backup-cluster5
>
> ps:
> root  1143  0.2  0.0   8904  3088 pts/0D+   12:17   0:03  |
>\_ mount -nv /dev/rbd6 /mnt/backup-cluster5
>
>
> There is no timout or ANY msg in dmesg until now.
>
> strace -p 1143  :  seems to do nothing.
> iotop --pid=1143: uses about 50KB/sec
>
> it might mount after a few hours i gues... :-(
>
> On Fri, Jun 25, 2021 at 11:39 AM Ilya Dryomov  wrote:
> >
> > On Fri, Jun 25, 2021 at 11:25 AM Ml Ml  wrote:
> > >
> > > The rbd Client is not on one of the OSD Nodes.
> > >
> > > I now added a "backup-proxmox/cluster5a" to it and it works perfectly.
> > > Just that one rbd image sucks. The last thing i remember was to resize
> > > the Image from 6TB to 8TB and i then did a xfs_grow on it.
> > >
> > > Does that ring a bell?
> >
> > It does seem like a filesystem problem so far but you haven't posted
> > dmesg or other details.  "mount" will not time out, if it's not returning
> > due to hanging somewhere you would likely get "task ... blocked for ..."
> > splats in dmesg.
> >
> > Thanks,
> >
> > Ilya
> >
> > >
> > >
> > > On Wed, Jun 23, 2021 at 11:25 AM Ilya Dryomov  wrote:
> > > >
> > > > On Wed, Jun 23, 2021 at 9:59 AM Matthias Ferdinand 
> > > >  wrote:
> > > > >
> > > > > On Tue, Jun 22, 2021 at 02:36:00PM +0200, Ml Ml wrote:
> > > > > > Hello List,
> > > > > >
> > > > > > oversudden i can not mount a specific rbd device anymore:
> > > > > >
> > > > > > root@proxmox-backup:~# rbd map backup-proxmox/cluster5 -k
> > > > > > /etc/ceph/ceph.client.admin.keyring
> > > > > > /dev/rbd0
> > > > > >
> > > > > > root@proxmox-backup:~# mount /dev/rbd0 /mnt/backup-cluster5/
> > > > > >  (just never times out)
> > > > >
> > > > >
> > > > > Hi,
> > > > >
> > > > > there used to be some kernel lock issues when the kernel rbd client
> > > > > tried to access an OSD on the same machine. Not sure if these issues
> > > > > still exist (but I would guess so) and if you use your proxmox cluster
> > > > > in a hyperconverged manner (nodes providing VMs and storage service at
> > > > > the same time) you may just have been lucky that it had worked before.
> > > > >
> > > > > Instead of the kernel client mount you can try to export the volume as
> > > > > an NBD device (https://docs.ceph.com/en/latest/man/8/rbd-nbd/) and
> > > > > mounting that. rbd-nbd runs in userspace and should not have that
> > > > > locking problem.
> > > >
> > > > rbd-nbd is also susceptible to locking up in such setups, likely more
> > > > so than krbd.  Don't forget that it also has a kernel component and
> > > > there are actually more opportunities for things to go sideways/lock up
> > > > because there is an extra daemon involved allocating some additional
> > > > memory for each I/O request.
> > > >
> > > > Thanks,
> > > >
> > > > Ilya
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Can not mount rbd device anymore

2021-06-25 Thread Ml Ml
I started the mount 15mins ago.:
  mount -nv /dev/rbd6 /mnt/backup-cluster5

ps:
root  1143  0.2  0.0   8904  3088 pts/0D+   12:17   0:03  |
   \_ mount -nv /dev/rbd6 /mnt/backup-cluster5


There is no timout or ANY msg in dmesg until now.

strace -p 1143  :  seems to do nothing.
iotop --pid=1143: uses about 50KB/sec

it might mount after a few hours i gues... :-(

On Fri, Jun 25, 2021 at 11:39 AM Ilya Dryomov  wrote:
>
> On Fri, Jun 25, 2021 at 11:25 AM Ml Ml  wrote:
> >
> > The rbd Client is not on one of the OSD Nodes.
> >
> > I now added a "backup-proxmox/cluster5a" to it and it works perfectly.
> > Just that one rbd image sucks. The last thing i remember was to resize
> > the Image from 6TB to 8TB and i then did a xfs_grow on it.
> >
> > Does that ring a bell?
>
> It does seem like a filesystem problem so far but you haven't posted
> dmesg or other details.  "mount" will not time out, if it's not returning
> due to hanging somewhere you would likely get "task ... blocked for ..."
> splats in dmesg.
>
> Thanks,
>
> Ilya
>
> >
> >
> > On Wed, Jun 23, 2021 at 11:25 AM Ilya Dryomov  wrote:
> > >
> > > On Wed, Jun 23, 2021 at 9:59 AM Matthias Ferdinand  
> > > wrote:
> > > >
> > > > On Tue, Jun 22, 2021 at 02:36:00PM +0200, Ml Ml wrote:
> > > > > Hello List,
> > > > >
> > > > > oversudden i can not mount a specific rbd device anymore:
> > > > >
> > > > > root@proxmox-backup:~# rbd map backup-proxmox/cluster5 -k
> > > > > /etc/ceph/ceph.client.admin.keyring
> > > > > /dev/rbd0
> > > > >
> > > > > root@proxmox-backup:~# mount /dev/rbd0 /mnt/backup-cluster5/
> > > > >  (just never times out)
> > > >
> > > >
> > > > Hi,
> > > >
> > > > there used to be some kernel lock issues when the kernel rbd client
> > > > tried to access an OSD on the same machine. Not sure if these issues
> > > > still exist (but I would guess so) and if you use your proxmox cluster
> > > > in a hyperconverged manner (nodes providing VMs and storage service at
> > > > the same time) you may just have been lucky that it had worked before.
> > > >
> > > > Instead of the kernel client mount you can try to export the volume as
> > > > an NBD device (https://docs.ceph.com/en/latest/man/8/rbd-nbd/) and
> > > > mounting that. rbd-nbd runs in userspace and should not have that
> > > > locking problem.
> > >
> > > rbd-nbd is also susceptible to locking up in such setups, likely more
> > > so than krbd.  Don't forget that it also has a kernel component and
> > > there are actually more opportunities for things to go sideways/lock up
> > > because there is an extra daemon involved allocating some additional
> > > memory for each I/O request.
> > >
> > > Thanks,
> > >
> > > Ilya
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Why you might want packages not containers for Ceph deployments

2021-06-25 Thread Marc
> The security issue (50 containers -> 50 versions of openssl to patch)
> also still stands — the earlier question on this list (when to expect
> patched containers for a CVE affecting a library)

I assume they use the default el7/el8 as a base layer, so when that is updated, 
you will get the updates. However redeploying tasks is not the same as just 
giving them a restart.

> is still unreplied to[1], so these are real-life concerns. In general, I
> don't know any project which ever managed to keep up with the workload
> caused by the requirement to follow
> all CVEs of all dependencies, informing about them and patching them,
> since this is a workload comparable to the one the security teams of
> Linux distributions have to handle.

Indeed this is the core business of a distro that you choose. No software 
solution should ever make it theirs. Eg. this DCOS is just a binary blob of a 
centos release, from which you have no idea if it is up to date or not, I do 
not get why people install it.

> 
> Cheers (and congratulations to all who made it to the end of this mail),

I think your text clearly summarizes the point of view of many here.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Why you might want packages not containers for Ceph deployments

2021-06-25 Thread Marc
> rgw, grafana, prom, haproxy, etc are all optional components.  The

Is this Prometheus stateful? Where is this data stored?

> Early on the team building the container images opted for a single
> image that includes all of the daemons for simplicity.  We could build
> stripped down images for each daemon type, but that's an investment in
> developer time and complexity and we haven't heard any complaints
> about the container size. (Usually a few hundred MB on a large scale
> storage server isn't a problem.)

To me it looks like you do not take the containerization seriously, a container 
development team that does not want to spend time on container images. You 
create something >10x slower to start, >10x more disk space used (times 2 when 
upgrading). Haproxy is 9MB. Your osd is 350MB.

> > 5. I have been writing this previously on the mailing list here. Is
> each rgw still requiring its own dedicated client id? Is it still true,
> that if you want to spawn 3 rgw instances, they need to authorize like
> client.rgw1, client.rgw2 and client.rgw3?
> > This does not allow for auto scaling. The idea of using an OC is that
> you launch a task, and that you can scale this task automatically when
> necessary. So you would get multiple instances of rgw1. If this is still
> and issue with rgw, mds and mgr etc. Why even bother doing something
> with an OC and containers?
> 
> The orchestrator automates the creation and cleanup of credentials for
> each rgw instance.  (It also trivially scales them up/down, ala k8s.)

I do not understand this. This sounds more to me like creating a new task, 
instead of scaling a second instance of an existing task. Are you currently 
able to automatically scale up/down instances of a rgw or is your statement 
hypothetical?

I can remember on the mesos mailing list/issue tracker talk about the 
difficulty of determining a tasks 'number' . Because tasks are being 
killed/started at random, based on resource offers. Thus supplying them with 
the correct different credentials is not as trivial as it would seem.
So I wonder how you are scaling this? If there are already so many differences 
between OC's, I would even recon they differ in this area quite a lot. So the 
most plausible solution would be fixing this in at the rgw daemon.

> If you have an autoscaler, you just need to tell cephadm how many you
> want and it will add/remove daemons.  If you are using cephadm's
> ingress (haproxy) capability, the LB configuration will be adjusted
> for you.  If you are using an external LB, you can query cephadm for a
> description of the current daemons and their endpoints and feed that
> info into your own ingress solution.

Forgive me for not looking at all the video links before writing this. But from 
the video's I saw about cephadm it was more always like a command reference. 
Would be nice to maybe show the above in ceph tech talk or so. I think a lot of 
people would be interested seeing this.

> > 6. As I wrote before I do not want my rgw or haproxy running in a OC
> that has the ability to give tasks capability SYSADMIN. So that would
> mean I have to run my osd daemons/containers separately.
> 
> Only the OSD containers get extra caps to deal with the storage
> hardware.

I know, that is why I choose to run drivers that require such SYSADMIN rights, 
to run outside of my OC environment. My OC environment does not allow any tasks 
to use the SYSADMIN.
 
> Memory limits are partially implemented; we haven't gotten to CPU
> limits yet.  It's on the list!
> 

To me it is sort of clear what the focus of the cephadm team is.

> 
> I humbly contend that most users, 

H, most, most, most is not most mostly the average? Most people drive a 
Toyota, less people drive Porsche and even less drive a Ferrari. It is your 
choice who your target audience is and what you are 'selling' them.

> especially those with small
> clusters, would rather issue a single command and have the cluster
> upgrade itself--with all of the latest and often version-specific
> safety checks and any special per-release steps implemented for
> them--than to do it themselves.
> 

The flip side to this approach is. That if you guys make a mistake in some 
script, lots of ceph clusters could go down. 
Is this not a bit of a paradox, a team that has problems with their software 
dependencies (ceph-ansible/ceph-deploy?), I should blindly trust to script the 
update of my cluster?

I know I have been very critical/sceptical about this cephadm. Please do also 
note I just love this ceph storage, and I am advertising whenever possible. So 
a big thanks to the whole team still!!!






___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Why you might want packages not containers for Ceph deployments

2021-06-25 Thread Oliver Freyermuth

Am 18.06.21 um 20:42 schrieb Sage Weil:

Following up with some general comments on the main container
downsides and on the upsides that led us down this path in the first
place.
[...]


Thanks, Sage, for the nice and concise summary on the Cephadm benefits, and the 
reasoning on why the path was chosen!
Also thanks for your reply on my question about the modularity of the actual 
orchestrator. I really appreciate this, and will try to reply in one place here.


After the huge activity in this thread, I did take a step back to watch and 
also make up my mind,
trying to condense my main issues with the "containers-only" approach, also 
taking other replies into account.

I hope this is not seen as a rant, but rather a collection of arguments for an 
additional orchestrator module,
or maybe even something different. Unfortunately, it has become a wall of text, 
but I hope at least some will fight their way through.


First of all, I fully agree with the positive points you raised — it's surely a gain for 
devs and many users to ship something tested and "complete"
without having a full and still necessarily incomplete OS test matrix to 
constantly check and extend. It also eases testing especially when trying out 
experimental features,
and takes away usage complexity e.g. in the upgrade path.

Of course, there's also the point that having a large test matrix across OSs 
tends to uncover actual bugs or issues which may not show up in a reduced test 
environment[0],
so reducing the matrix also comes at a price for reliability which has to be 
weighed against the time which is saved.

The security issue (50 containers -> 50 versions of openssl to patch) also 
still stands — the earlier question on this list (when to expect patched 
containers for a CVE affecting a library)
is still unreplied to[1], so these are real-life concerns. In general, I don't 
know any project which ever managed to keep up with the workload caused by the 
requirement to follow
all CVEs of all dependencies, informing about them and patching them, since 
this is a workload comparable to the one the security teams of Linux 
distributions have to handle.
In addition, you'll also need to address the question of when and how to pull 
new images when patched containers become available, how / when to inform the 
administrator,
and orchestrate service restarts as-needed (you'd basically need 
"needs-restarting" and friends). That's still quite a way to go, and will be a 
constant developer effort
from now on.

That being said, Ceph may be the first ever project managing to fulfil 
expectations here due to the close coupling to those guys wearing red hats ;-).


Another point raised on this list is that some users are anxious about pushing a 
"magic" button which upgrades a whole cluster.
Sure, this button is super useful, and incorporates developer wisdom, and 
allows the developers to test the full sequence and ship it to everybody.
So these buttons (e.g. "ceph orch upgrade") are something which is useful and, 
I should say, important.
However, by design, it hides the "inner workings" of Ceph, which is a major 
drawback for some users.


After this introduction, let me come to my main personal concern: Loss of 
integratibility.

Our model of operation is to have all machines (anything, be it your 
off-the-shelf desktop, a laptop,
a hypervisor, a compute node or a Ceph node) handled by the very same 
configuration management. It means all configuration is self-documenting,
reinstallation is done with the push of a button, and anybody who understands 
the configuration management and the services at a basic level can take over 
operations.
It's the only way we _can_ operate, given the huge number of services requested 
and required in the IT business these days.

To give just one example: We mount kerberised NFS on all our desktop nodes, via 
CephFS exposed via nfs-ganesha. The desktops run Ubuntu/Debian, the file 
servers CentOS.
If I need to change the Kerberos configuration (order of KDCs, roll out new 
principals etc.), for us this is a change in a single place: We perform the 
change in Puppet,
wait 30 minutes, and all systems run the new configuration[2].
When operating a service with its own orchestrator, this means for me: I have 
to manually adapt the configuration of this service.
I need someone who is able to do that (i.e. two configuration systems have to 
be learnt), and who then does it for all instances of the service (e.g. all 
Ceph clusters).
Hence, a previously simple change is multiplied in complexity. I can't just replace the 
OS disk of a Ceph-OSD node and push "reinstall" anymore,
letting Puppet install all services and Ceph packages (so I only have to adopt 
the disks, this is not automated as a safety precaution),
but I also have to talk to the Ceph orchestrator.

So automation is a must, and we also heavily rely on containers for scientific 
workloads to offer a large variety of software stacks to our users.

[ceph-users] Re: ceph fs mv does copy, not move

2021-06-25 Thread Frank Schilder
Dear Marc

> Adding to this. I can remember that I was surprised that a mv on cephfs 
> between directories linked to different pools

This is documented behaviour and should not be surprising. Placement is 
assigned on file creation time. Hence, placement changes only affect newly 
created files, existing files retain their placement. To perform a migration, a 
full copy must be executed.

This is, in fact, what is expected and also wished for (only unavoidable data 
movement should happen automatically, optional data movement on explicit 
request). A move is not a file creation and, therefore, cannot request a change 
of data placement. It is a request for re-linking a file/directory to another 
root (a move on an fs is a pointer-like operation, not a data operation), which 
is the reason why a move should be and usually is atomic and O(1).

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Marc 
Sent: 25 June 2021 10:21:16
To: Frank Schilder; Patrick Donnelly
Cc: ceph-users@ceph.io
Subject: RE: [ceph-users] Re: ceph fs mv does copy, not move

Adding to this. I can remember that I was surprised that a mv on cephfs between 
directories linked to different pools, only some meta(?) data was moved/changed 
and some data stayed still in the old pool.
I am not sure if this is still the same in newer ceph versions, but I rather 
see data being moved completely. That is what everyone expects, regardless if 
this would take more time in this case between different pools.


> -Original Message-
> From: Frank Schilder 
> Sent: Thursday, 24 June 2021 17:34
> To: Patrick Donnelly 
> Cc: ceph-users@ceph.io
> Subject: [ceph-users] Re: ceph fs mv does copy, not move
>
> Dear Patrick,
>
> thanks for letting me know.
>
> Could you please consider to make this a ceph client mount option, for
> example, '-o fast_move', that enables a code path that enforces an mv to
> be a proper atomic mv with the risk that in some corner cases the target
> quota is overrun? With this option enabled, a move should either be a
> move or fail outright with "out of disk quota" (no partial move, no
> cp+rm at all). The fail should only occur if it is absolutely obvious
> that the target quota will be exceeded. Any corner cases are the
> responsibility of the operator. Application crashes due to incorrect
> error handling are acceptable.
>
> Reasoning:
>
> From a user's/operator's side, the preferred functionality is that in
> cases where a definite quota overrun can securely be detected in
> advance, the move should actually fail with "out of disk quota" instead
> of resorting to cp+rm, potentially leading to partial moves and a total
> mess for users/operators to clean up. In any other case, the quota
> should simply be ignored and the move should be a complete atomic move
> with the risk of exceeding the target quota and IO to stall. A temporary
> stall or fail of IO until the operator increases the quota again is, in
> my opinion and use case, highly preferable over the alternative of
> cp+rm. A quota or a crashed job is fast to fix, a partial move is not.
>
> Some background:
>
> We use ceph fs as an HPC home file system and as a back-end store. Being
> able to move data quickly across the entire file system is essential,
> because users re-factor their directory structure containing huge
> amounts of data quite often for various reasons.
>
> On our system, we set file system quotas mainly for psychological
> reasons. We run a cron job that adjusts the quotas every day to show
> between 20% and 30% free capacity on the mount points. The psychological
> side here is to give an incentive to users to clean up temporary data.
> It is not intended to limit usage seriously, only to limit what can be
> done in between cron job runs as a safe-guard. The pool quotas set the
> real hard limits.
>
> I'm in the process of migrating 100+TB right now and am really happy
> that I still have a client where I can do an O(1) move. It would be a
> disaster if I had now to use rsync or similar, which would take weeks.
>
> Please, in such situations where developers seem to have to make a
> definite choice, consider the possibility of offering operators to
> choose the alternative that suits their use case best. Adding further
> options seems far better than limiting functionality in a way that
> becomes a terrible burden in certain, if not many use cases.
>
> In ceph fs there have been many such decisions that allow for different
> answers from a user/operator perspective. For example, I would prefer if
> I could get rid of the attempted higher POSIX compliance level of ceph
> fs compared with Lustre, just disable all the client-caps and cache-
> coherence management and turn it into an awesome scale-out parallel file
> system. The attempt of POSIX compliant handling of simultaneous writes
> to files offers nothing to us, but costs huge in performance and forces
> users 

[ceph-users] Re: Can not mount rbd device anymore

2021-06-25 Thread Ml Ml
The rbd Client is not on one of the OSD Nodes.

I now added a "backup-proxmox/cluster5a" to it and it works perfectly.
Just that one rbd image sucks. The last thing i remember was to resize
the Image from 6TB to 8TB and i then did a xfs_grow on it.

Does that ring a bell?


On Wed, Jun 23, 2021 at 11:25 AM Ilya Dryomov  wrote:
>
> On Wed, Jun 23, 2021 at 9:59 AM Matthias Ferdinand  
> wrote:
> >
> > On Tue, Jun 22, 2021 at 02:36:00PM +0200, Ml Ml wrote:
> > > Hello List,
> > >
> > > oversudden i can not mount a specific rbd device anymore:
> > >
> > > root@proxmox-backup:~# rbd map backup-proxmox/cluster5 -k
> > > /etc/ceph/ceph.client.admin.keyring
> > > /dev/rbd0
> > >
> > > root@proxmox-backup:~# mount /dev/rbd0 /mnt/backup-cluster5/
> > >  (just never times out)
> >
> >
> > Hi,
> >
> > there used to be some kernel lock issues when the kernel rbd client
> > tried to access an OSD on the same machine. Not sure if these issues
> > still exist (but I would guess so) and if you use your proxmox cluster
> > in a hyperconverged manner (nodes providing VMs and storage service at
> > the same time) you may just have been lucky that it had worked before.
> >
> > Instead of the kernel client mount you can try to export the volume as
> > an NBD device (https://docs.ceph.com/en/latest/man/8/rbd-nbd/) and
> > mounting that. rbd-nbd runs in userspace and should not have that
> > locking problem.
>
> rbd-nbd is also susceptible to locking up in such setups, likely more
> so than krbd.  Don't forget that it also has a kernel component and
> there are actually more opportunities for things to go sideways/lock up
> because there is an extra daemon involved allocating some additional
> memory for each I/O request.
>
> Thanks,
>
> Ilya
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Why you might want packages not containers for Ceph deployments

2021-06-25 Thread Marc
> but our experience so
> far has been a big improvement over the complexity of managing package
> dependencies across even just a handful of distros

Do you have some charts or docs that show this complexity problem, because I 
have problems understanding it. 
This is very likely due to that my understanding of ceph internals is limited. 
For instance my view of the osd daemon.  Now working with logical volumes for 
writing/reading data and then you have osd<->osd,mon,mgr communication. What 
dependency hell is there to be expected?

> (Podman has been
> the only real culprit here, tbh, but I give them a partial pass as the
> tool is relatively new.)

Is it not better for the sake of stability, security and future support to 
choose something with a proven record? 


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Why you might want packages not containers for Ceph deployments

2021-06-25 Thread Marc



> 
> This thread would not be so long if docker/containers solved the
> problems, but it did not. It solved some, but introduced new ones. So we
> cannot really say its better now.

The only thing I can deduct from this thread, is the necessity to create a 
solution for eg. 'dentists' to install a ceph cluster. Everything really 
related to container use is moved to the future. The focus of the cephadm 
development more or less shows this.

> 
> 
> Again, I think focus should more on a working ceph with clean
> documentation while leaving software management, packages to admins. And
> staticilly linked binaries would certinly solve dependecy hell and "how
> to support other environments" for most of the cases.
> 

I agree, and I worry that at some point only docker images are going to be 
available, and/or a CO environment that I do not want.





___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: speeding up EC recovery

2021-06-25 Thread Serkan Çoban
You can use clay codes(1).
This reads less data for reconstruction.

1- https://docs.ceph.com/en/latest/rados/operations/erasure-code-clay/

On Fri, Jun 25, 2021 at 2:50 PM Andrej Filipcic  wrote:
>
>
> Hi,
>
> on a large cluster with ~1600 OSDs, 60 servers and using 16+3 erasure
> coded pools, the recovery after OSD failure (HDD) is quite slow. Typical
> values are at 4GB/s with 125 ops/s and 32MB object sizes, which then
> takes 6-8 hours, during that time the pgs are degraded. I tried to speed
> it up with
>
>osd advanced  osd_max_backfills 32
>osd advanced  osd_recovery_max_active 10
>osd advanced  osd_recovery_op_priority 63
>osd advanced  osd_recovery_sleep_hdd 0.00
>
> which at least kept the iops/s at a constant level. The recovery does
> not seem to be cpu or memory bound. Is there any way to speed it up?
> While testing the recovery on replicated pools, it reached 50GB/s.
>
> In contrast, replacing the failed drive with a new one and re-adding the
> OSD is  quite fast, with 1GB/s recovery rate of misplaced pgs, or
> ~120MB/s average HDD write speed, which is not very far from HDD throughput.
>
> Regards,
> Andrej
>
> --
> _
> prof. dr. Andrej Filipcic,   E-mail: andrej.filip...@ijs.si
> Department of Experimental High Energy Physics - F9
> Jozef Stefan Institute, Jamova 39, P.o.Box 3000
> SI-1001 Ljubljana, Slovenia
> Tel.: +386-1-477-3674Fax: +386-1-425-7074
> -
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] speeding up EC recovery

2021-06-25 Thread Andrej Filipcic


Hi,

on a large cluster with ~1600 OSDs, 60 servers and using 16+3 erasure 
coded pools, the recovery after OSD failure (HDD) is quite slow. Typical 
values are at 4GB/s with 125 ops/s and 32MB object sizes, which then 
takes 6-8 hours, during that time the pgs are degraded. I tried to speed 
it up with


  osd advanced  osd_max_backfills 32
  osd advanced  osd_recovery_max_active 10
  osd advanced  osd_recovery_op_priority 63
  osd advanced  osd_recovery_sleep_hdd 0.00

which at least kept the iops/s at a constant level. The recovery does 
not seem to be cpu or memory bound. Is there any way to speed it up? 
While testing the recovery on replicated pools, it reached 50GB/s.


In contrast, replacing the failed drive with a new one and re-adding the 
OSD is  quite fast, with 1GB/s recovery rate of misplaced pgs, or 
~120MB/s average HDD write speed, which is not very far from HDD throughput.


Regards,
Andrej

--
_
   prof. dr. Andrej Filipcic,   E-mail: andrej.filip...@ijs.si
   Department of Experimental High Energy Physics - F9
   Jozef Stefan Institute, Jamova 39, P.o.Box 3000
   SI-1001 Ljubljana, Slovenia
   Tel.: +386-1-477-3674Fax: +386-1-425-7074
-
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Why you might want packages not containers for Ceph deployments

2021-06-25 Thread Rok Jaklič
This thread would not be so long if docker/containers solved the problems,
but it did not. It solved some, but introduced new ones. So we cannot
really say its better now.

Again, I think focus should more on a working ceph with clean documentation
while leaving software management, packages to admins. And staticilly
linked binaries would certinly solve dependecy hell and "how to support
other environments" for most of the cases.

On Thu, 24 Jun 2021, 23:06 Sage Weil,  wrote:

> On Tue, Jun 22, 2021 at 1:25 PM Stefan Kooman  wrote:
> > On 6/21/21 6:19 PM, Nico Schottelius wrote:
> > > And while we are at claiming "on a lot more platforms", you are at the
> > > same time EXCLUDING a lot of platforms by saying "Linux based
> > > container" (remember Ceph on FreeBSD? [0]).
> >
> > Indeed, and that is a more fundamental question: how easy it is to make
> > Ceph a first-class citizen on non linux platforms. Was that ever a
> > (design) goal? But then again, if you would be able to port docker
> > natively to say OpenBSD, you should be able to run Ceph on it as well.
>
> Thank you for bringing this up.  This is in fact a key reason why the
> orchestration abstraction works the way it does--to allow other
> runtime environments to be supported (FreeBSD!
> sysvinit/Devuan/whatever for systemd haters!) while ALSO allowing an
> integrated, user-friendly experience in which users workflow for
> adding/removing hosts, replacing failed OSDs, managing services (MDSs,
> RGWs, load balancers, etc) can be consistent across all platforms.
> For 10+ years we basically said "out of scope" to these pesky
> deployment details and left this job to Puppet, Chef, Ansible,
> ceph-deploy, rook, etc., but the result of that strategy was pretty
> clear: ceph was hard to use and the user experience dismal when
> compared to an integrated product from any half-decent enterprise
> storage company, or products like Martin's that capitalize on core
> ceph's bad UX.
>
> The question isn't whether we support other environments, but how.  As
> I mentioned in one of my first messages, we can either (1) generalize
> cephadm to work in other environments (break the current
> systemd+container requirement), or (2) add another orchestrator
> backend that supports a new environment.  I don't have any well-formed
> opinion here.  There is a lot of pretty generic "orchestration" logic
> in cephadm right now that isn't related to systemd or containers that
> could either be pulled out of cephadm into the mgr/ochestrator layer
> or a library.  Or an independent, fresh orch backend implementation
> could opt for a very different approach or set of opinions.
>
> Either way, my assumption has been that these other environments would
> probably not be docker|podman-based.  In the case of FreeBSD we'd
> probably want to use jails or whatever.  But anything is possible.
>
> s
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Missing objects in pg

2021-06-25 Thread Vadim Bulst

Hello Cephers,

it is a mystery. My cluster is out of error state. How - don't really 
know. I initiated deep scrubbing for affected pgs yesterday. Maybe that 
was fixing it.


Cheers,

Vadim

On 6/24/21 1:15 PM, Vadim Bulst wrote:

Dear List,

since my update yesterday from 14.2.18 to 14.2.20 i got an unhealthy 
cluster. As I remember right, it appeared after rebooting the second 
server. They are 7 missing objects from pgs of a cache pool (pool 3). 
This pool is now changed writeback to proxy and i'm not able to flush 
all objects.


root@scvirt06:/home/urzadmin/ceph_issue# ceph -s
  cluster:
    id: 5349724e-fa96-4fd6-8e44-8da2a39253f7
    health: HEALTH_ERR
    7/15893342 objects unfound (0.000%)
    Possible data damage: 7 pgs recovery_unfound
    Degraded data redundancy: 21/47680026 objects degraded 
(0.000%), 7 pgs degraded, 7 pgs undersized

    client is using insecure global_id reclaim
    mons are allowing insecure global_id reclaim

  services:
    mon: 3 daemons, quorum scvirt03,scvirt06,scvirt01 (age 19h)
    mgr: scvirt04(active, since 21m), standbys: scvirt03, scvirt02
    mds: scfs:1 {0=scvirt04=up:active} 1 up:standby-replay 1 up:standby
    osd: 54 osds: 54 up (since 17m), 54 in (since 10w); 7 remapped pgs

  task status:
    scrub status:
    mds.scvirt03: idle

  data:
    pools:   5 pools, 704 pgs
    objects: 15.89M objects, 49 TiB
    usage:   139 TiB used, 145 TiB / 285 TiB avail
    pgs: 21/47680026 objects degraded (0.000%)
 7/15893342 objects unfound (0.000%)
 694 active+clean
 7 active+recovery_unfound+undersized+degraded+remapped
 3   active+clean+scrubbing+deep

  io:
    client:   3.7 MiB/s rd, 6.6 MiB/s wr, 40 op/s rd, 31 op/s wr

my cluster:

scvirt01 - mon,osds

scvirt02 - mgr,osds

scvirt03 - mon,mgr,mds,osds

scvirt04 - mgr,mds,osds

scvirt05 - osds

scvirt06 - mon,mds,osds


log of osd.49:

root@scvirt03:/home/urzadmin# tail -f /var/log/ceph/ceph-osd.49.log
AddFile(GB): cumulative 0.000, interval 0.000
AddFile(Total Files): cumulative 0, interval 0
AddFile(L0 Files): cumulative 0, interval 0
AddFile(Keys): cumulative 0, interval 0
Cumulative compaction: 0.64 GB write, 0.01 MB/s write, 0.54 GB read, 
0.01 MB/s read, 6.5 seconds Interval compaction: 0.00 GB write, 0.00 
MB/s write, 0.00 GB read, 0.00 MB/s read, 0.0 seconds Stalls(count): 0 
level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 
0 level0_numfiles_with_compaction, 0 stop for 
pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 
memtable_compaction, 0 memtable_slowdown, interval 0 total count


** File Read Latency Histogram By Level [default] **

2021-06-24 08:53:08.865 7f88ab86c700 -1 log_channel(cluster) log [ERR] 
: 3.9 has 1 objects unfound and apparently lost
2021-06-24 08:53:08.865 7f88a505f700 -1 log_channel(cluster) log [ERR] 
: 3.1e has 1 objects unfound and apparently lost
2021-06-24 08:53:40.570 7f88ab86c700 -1 log_channel(cluster) log [ERR] 
: 3.9 has 1 objects unfound and apparently lost
2021-06-24 08:53:40.570 7f88a9067700 -1 log_channel(cluster) log [ERR] 
: 3.1e has 1 objects unfound and apparently lost
2021-06-24 08:54:45.042 7f88b487e700  4 rocksdb: [db/db_impl.cc:777] 
--- DUMPING STATS ---

2021-06-24 08:54:45.042 7f88b487e700  4 rocksdb: [db/db_impl.cc:778]
** DB Stats **
Uptime(secs): 85202.3 total, 600.0 interval
Cumulative writes: 1148K writes, 8640K keys, 1148K commit groups, 1.0 
writes per commit group, ingest: 1.24 GB, 0.01 MB/s
Cumulative WAL: 1148K writes, 546K syncs, 2.10 writes per sync, 
written: 1.24 GB, 0.01 MB/s

Cumulative stall: 00:00:0.000 H:M:S, 0.0 percent
Interval writes: 369 writes, 1758 keys, 369 commit groups, 1.0 writes 
per commit group, ingest: 0.41 MB, 0.00 MB/s
Interval WAL: 369 writes, 155 syncs, 2.37 writes per sync, written: 
0.00 MB, 0.00 MB/s

Interval stall: 00:00:0.000 H:M:S, 0.0 percent

** Compaction Stats [default] **
Level    Files   Size Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) 
Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) 
Comp(cnt) Avg(sec) KeyIn KeyDrop
 

  L0  3/0   104.40 MB   0.8  0.0 0.0  0.0 0.2 
0.2   0.0   1.0  0.0 67.8 2.89  2.70 6 
0.482   0  0
  L1  2/0   131.98 MB   0.5  0.2 0.1  0.1 0.2 
0.1   0.0   1.8    149.9    120.9 1.53  1.41 1 1.527   
2293K   140K
  L2 16/0   871.57 MB   0.3  0.3 0.1  0.3 0.3 
-0.0   0.0   5.2    158.1    132.3 2.05 1.93 1 2.052   
3997K  1089K
 Sum 21/0    1.08 GB   0.0  0.5 0.2  0.4 0.6 0.2   
0.0   3.3 85.5    100.8 6.47  6.03 8 0.809   6290K  1229K
 Int  0/0    0.00 KB   0.0  0.0 0.0  0.0 0.0 0.0   

[ceph-users] Re: Can not mount rbd device anymore

2021-06-25 Thread Ilya Dryomov
On Fri, Jun 25, 2021 at 11:25 AM Ml Ml  wrote:
>
> The rbd Client is not on one of the OSD Nodes.
>
> I now added a "backup-proxmox/cluster5a" to it and it works perfectly.
> Just that one rbd image sucks. The last thing i remember was to resize
> the Image from 6TB to 8TB and i then did a xfs_grow on it.
>
> Does that ring a bell?

It does seem like a filesystem problem so far but you haven't posted
dmesg or other details.  "mount" will not time out, if it's not returning
due to hanging somewhere you would likely get "task ... blocked for ..."
splats in dmesg.

Thanks,

Ilya

>
>
> On Wed, Jun 23, 2021 at 11:25 AM Ilya Dryomov  wrote:
> >
> > On Wed, Jun 23, 2021 at 9:59 AM Matthias Ferdinand  
> > wrote:
> > >
> > > On Tue, Jun 22, 2021 at 02:36:00PM +0200, Ml Ml wrote:
> > > > Hello List,
> > > >
> > > > oversudden i can not mount a specific rbd device anymore:
> > > >
> > > > root@proxmox-backup:~# rbd map backup-proxmox/cluster5 -k
> > > > /etc/ceph/ceph.client.admin.keyring
> > > > /dev/rbd0
> > > >
> > > > root@proxmox-backup:~# mount /dev/rbd0 /mnt/backup-cluster5/
> > > >  (just never times out)
> > >
> > >
> > > Hi,
> > >
> > > there used to be some kernel lock issues when the kernel rbd client
> > > tried to access an OSD on the same machine. Not sure if these issues
> > > still exist (but I would guess so) and if you use your proxmox cluster
> > > in a hyperconverged manner (nodes providing VMs and storage service at
> > > the same time) you may just have been lucky that it had worked before.
> > >
> > > Instead of the kernel client mount you can try to export the volume as
> > > an NBD device (https://docs.ceph.com/en/latest/man/8/rbd-nbd/) and
> > > mounting that. rbd-nbd runs in userspace and should not have that
> > > locking problem.
> >
> > rbd-nbd is also susceptible to locking up in such setups, likely more
> > so than krbd.  Don't forget that it also has a kernel component and
> > there are actually more opportunities for things to go sideways/lock up
> > because there is an extra daemon involved allocating some additional
> > memory for each I/O request.
> >
> > Thanks,
> >
> > Ilya
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] native linux distribution host running ceph container ?

2021-06-25 Thread marc boisis
Hi,

We have a containerised ceph cluster in version 16.2.4 (15 hosts, 180 osds) 
deployed with ceph-ansible.
Our host run on centos 7 (kernel 3.10) with ceph-deamon docker image based on 
centos 8.

I cannot find in the documentation which native distribution is recommended, 
should it be the same as docker image (centos 8) ?

About centos8 and the end of support announced for the end of the year, which 
distribution will ceph use in docker image ?

Thanks





 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RGW topic created in wrong (default) tenant

2021-06-25 Thread Daniel Iwan
Thanks for clarification


> according to what i tested, this is not the case. deletion of a topic only
> prevents the creation of new notifications with that topic.
> it does not effect the deletion of notifications with that topic, not the
> actual sending of these notifications.
>
> note that we also added a cascade delete process to delete all
> notifications of a bucket when a bucket is deleted.
> (it should be in pacific: https://github.com/ceph/ceph/pull/38351)
>

You are right, this is cleaned up as expected.
I re-ran my tests and it turned out problems show up after deleting the
internal topic (created upon creation of notification).
When I did that initially I didn't know about internal topics and in my
eagerness I deleted all topics to start over. This triggered an issue with
deleting notifications.

In my case with topics as below

# radosgw-admin topic list --tenant marvel --uid none
{
"topics": [
{
"topic": {
"user": "marvel",
"name": "MyTopic",
"dest": {
"bucket_name": "",
"oid_prefix": "",
"push_endpoint": "amqp://127.0.0.1",
"push_endpoint_args":
"amqp-exchange=rgw-exchange=amqp://127.0.0.1
=false=false",
"push_endpoint_topic": "MyTopic",
"stored_secret": "false",
"persistent": "false"
},
"arn": "arn:aws:sns:default:marvel:MyTopic",
"opaqueData": ""
},
"subs": []
},
{
"topic": {
"user": "marvel",
"name": "bucket-by-thanos_notification_MyTopic",
"dest": {
"bucket_name": "",
"oid_prefix": "",
"push_endpoint": "amqp://127.0.0.1",
"push_endpoint_args":
"amqp-exchange=rgw-exchange=amqp://127.0.0.1
=false=false",
"push_endpoint_topic": "MyTopic",
"stored_secret": "false",
"persistent": "false"
},
"arn": "arn:aws:sns:default:marvel:MyTopic",
"opaqueData": ""
},
"subs": []
}
]
}

Once I delete both of those and attempt to delete the notification on the
bucket, RGW logs as below.
Returned status code is 200 but notification can still be listed.

Log

debug 2021-06-25T08:36:59.823+ 7fefed33a700 10 cache get:
name=my-rgw.rgw.log++pubsub.marvel.bucket.bucket-by-thanos/103132f8-3d78-4592-ae7b-067ecf2a1b92.9081601.36
: hit (requested=0x6, cached=0x17)
debug 2021-06-25T08:36:59.823+ 7fefed33a700 20 get_system_obj_state:
s->obj_tag was set empty
debug 2021-06-25T08:36:59.823+ 7fefed33a700 10 cache get:
name=my-rgw.rgw.log++pubsub.marvel.bucket.bucket-by-thanos/103132f8-3d78-4592-ae7b-067ecf2a1b92.9081601.36
: hit (requested=0x1, cached=0x17)
debug 2021-06-25T08:36:59.823+ 7fefed33a700 20 get_system_obj_state:
rctx=0x558dd865b1c0 obj=my-rgw.rgw.log:pubsub.marvel state=0x558dd5ff9720
s->prefetch_data=0
debug 2021-06-25T08:36:59.823+ 7fefed33a700 10 cache get:
name=my-rgw.rgw.log++pubsub.marvel : hit (requested=0x16, cached=0x17)
debug 2021-06-25T08:36:59.823+ 7fefed33a700 20 get_system_obj_state:
s->obj_tag was set empty
debug 2021-06-25T08:36:59.823+ 7fefed33a700 10 cache get:
name=my-rgw.rgw.log++pubsub.marvel : hit (requested=0x11, cached=0x17)
debug 2021-06-25T08:36:59.823+ 7fefed33a700 10 distributing
notification oid=my-rgw.rgw.control:notify.0 bl.length()=189
debug 2021-06-25T08:36:59.823+ 7ff108570700 10
RGWWatcher::handle_notify()  notify_id 3573413718119 cookie 94067650647552
notifier 14638216 bl.length()=189
debug 2021-06-25T08:36:59.831+ 7fefe3b27700 20 get_system_obj_state:
rctx=0x558dd865b1c0 obj=my-rgw.rgw.log:pubsub.marvel state=0x558dd5ff9720
s->prefetch_data=0
debug 2021-06-25T08:36:59.831+ 7fefe3b27700 10 cache get:
name=my-rgw.rgw.log++pubsub.marvel : hit (requested=0x1, cached=0x17)
debug 2021-06-25T08:36:59.839+ 7fefe3b27700  1 ERROR: topic not found
debug 2021-06-25T08:36:59.839+ 7fefe3b27700  1 ERROR: failed to read
topic info: ret=-2
debug 2021-06-25T08:36:59.839+ 7fefe3b27700  1 failed to remove
notification of topic 'bucket-by-thanos_notification_MyTopic', ret=-2
debug 2021-06-25T08:36:59.839+ 7fefe3b27700 20 get_system_obj_state:
rctx=0x558dd865b1c0 obj=my-rgw.rgw.log:pubsub.marvel state=0x558dd5ff9720
s->prefetch_data=0
debug 2021-06-25T08:36:59.839+ 7fefe3b27700 10 cache get:
name=my-rgw.rgw.log++pubsub.marvel : hit (requested=0x11, cached=0x17)
debug 2021-06-25T08:36:59.843+ 7ff007b6f700 10 cache put:
name=my-rgw.rgw.log++pubsub.marvel info.flags=0x17
debug 2021-06-25T08:36:59.843+ 7ff007b6f700 10 moving
my-rgw.rgw.log++pubsub.marvel to cache LRU end
debug 2021-06-25T08:36:59.843+ 7ff007b6f700 10 distributing
notification 

[ceph-users] Re: ceph fs mv does copy, not move

2021-06-25 Thread Marc
Adding to this. I can remember that I was surprised that a mv on cephfs between 
directories linked to different pools, only some meta(?) data was moved/changed 
and some data stayed still in the old pool. 
I am not sure if this is still the same in newer ceph versions, but I rather 
see data being moved completely. That is what everyone expects, regardless if 
this would take more time in this case between different pools.


> -Original Message-
> From: Frank Schilder 
> Sent: Thursday, 24 June 2021 17:34
> To: Patrick Donnelly 
> Cc: ceph-users@ceph.io
> Subject: [ceph-users] Re: ceph fs mv does copy, not move
> 
> Dear Patrick,
> 
> thanks for letting me know.
> 
> Could you please consider to make this a ceph client mount option, for
> example, '-o fast_move', that enables a code path that enforces an mv to
> be a proper atomic mv with the risk that in some corner cases the target
> quota is overrun? With this option enabled, a move should either be a
> move or fail outright with "out of disk quota" (no partial move, no
> cp+rm at all). The fail should only occur if it is absolutely obvious
> that the target quota will be exceeded. Any corner cases are the
> responsibility of the operator. Application crashes due to incorrect
> error handling are acceptable.
> 
> Reasoning:
> 
> From a user's/operator's side, the preferred functionality is that in
> cases where a definite quota overrun can securely be detected in
> advance, the move should actually fail with "out of disk quota" instead
> of resorting to cp+rm, potentially leading to partial moves and a total
> mess for users/operators to clean up. In any other case, the quota
> should simply be ignored and the move should be a complete atomic move
> with the risk of exceeding the target quota and IO to stall. A temporary
> stall or fail of IO until the operator increases the quota again is, in
> my opinion and use case, highly preferable over the alternative of
> cp+rm. A quota or a crashed job is fast to fix, a partial move is not.
> 
> Some background:
> 
> We use ceph fs as an HPC home file system and as a back-end store. Being
> able to move data quickly across the entire file system is essential,
> because users re-factor their directory structure containing huge
> amounts of data quite often for various reasons.
> 
> On our system, we set file system quotas mainly for psychological
> reasons. We run a cron job that adjusts the quotas every day to show
> between 20% and 30% free capacity on the mount points. The psychological
> side here is to give an incentive to users to clean up temporary data.
> It is not intended to limit usage seriously, only to limit what can be
> done in between cron job runs as a safe-guard. The pool quotas set the
> real hard limits.
> 
> I'm in the process of migrating 100+TB right now and am really happy
> that I still have a client where I can do an O(1) move. It would be a
> disaster if I had now to use rsync or similar, which would take weeks.
> 
> Please, in such situations where developers seem to have to make a
> definite choice, consider the possibility of offering operators to
> choose the alternative that suits their use case best. Adding further
> options seems far better than limiting functionality in a way that
> becomes a terrible burden in certain, if not many use cases.
> 
> In ceph fs there have been many such decisions that allow for different
> answers from a user/operator perspective. For example, I would prefer if
> I could get rid of the attempted higher POSIX compliance level of ceph
> fs compared with Lustre, just disable all the client-caps and cache-
> coherence management and turn it into an awesome scale-out parallel file
> system. The attempt of POSIX compliant handling of simultaneous writes
> to files offers nothing to us, but costs huge in performance and forces
> users to move away from perfectly reasonable HPC work flows. Also, that
> it takes a TTL to expire before changes on one client become visible on
> another (unless direct_io is used for all IO) is perfectly acceptable
> for us given the potential performance gain due to simpler client-MDS
> communication.
> 
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> 
> 
> From: Patrick Donnelly 
> Sent: 24 June 2021 05:29:45
> To: Frank Schilder
> Cc: ceph-users@ceph.io
> Subject: Re: [ceph-users] ceph fs mv does copy, not move
> 
> Hello Frank,
> 
> On Tue, Jun 22, 2021 at 2:16 AM Frank Schilder  wrote:
> >
> > Dear all,
> >
> > some time ago I reported that the kernel client resorts to a copy
> instead of move when moving a file across quota domains. I was told that
> the fuse client does not have this problem. If enough space is
> available, a move should be a move, not a copy.
> >
> > Today, I tried to move a large file across quota domains testing botn,
> the kernel- and the fuse client. Both still resort to a copy even though
> this issue