from:"E Taka"

[ceph-users] MDS crashes shortly after starting

2024-05-05 Thread E Taka

Hi all,

we have a serious problem with CephFS. A few days ago, the CephFS file
systems became inaccessible, with the message MDS_DAMAGE: 1 mds daemon
damaged

The cephfs-journal-tool tells us: "Overall journal integrity: OK"

The usual attempts with redeploy were unfortunately not successful.

After many attempts to achieve something with the orchestrator, we set the
MDS to “failed” and provoked the creation of new MDS with “ceph fs reset”.

But this MDS crashes:
ceph-17.2.7/src/mds/MDCache.cc: In function 'void
MDCache::rejoin_send_rejoins()'
ceph-17.2.7/src/mds/MDCache.cc: 4086: FAILED ceph_assert(auth >= 0)

(The full trace is attached).

What can we do now? We are grateful for any help!
May 05 22:42:43 ceph06 bash[707251]: debug -1> 2024-05-05T20:42:43.006+ 
7f6892752700 -1 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/mds/MDCache.cc:
 In function 'void MDCache::rejoin_send_rejoins()' thread 7f6892752700 time 
2024-05-05T20:42:43.008448+
May 05 22:42:43 ceph06 bash[707251]: 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/mds/MDCache.cc:
 4086: FAILED ceph_assert(auth >= 0)
May 05 22:42:43 ceph06 bash[707251]:  ceph version 17.2.7 
(b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)
May 05 22:42:43 ceph06 bash[707251]:  1: (ceph::__ceph_assert_fail(char const*, 
char const*, int, char const*)+0x135) [0x7f689fb974a3]
May 05 22:42:43 ceph06 bash[707251]:  2: 
/usr/lib64/ceph/libceph-common.so.2(+0x269669) [0x7f689fb97669]
May 05 22:42:43 ceph06 bash[707251]:  3: 
(MDCache::rejoin_send_rejoins()+0x216b) [0x5605d03da7eb]
May 05 22:42:43 ceph06 bash[707251]:  4: 
(MDCache::process_imported_caps()+0x1993) [0x5605d03d8353]
May 05 22:42:43 ceph06 bash[707251]:  5: 
(MDCache::rejoin_open_ino_finish(inodeno_t, int)+0x217) [0x5605d03e5837]
May 05 22:42:43 ceph06 bash[707251]:  6: (MDSContext::complete(int)+0x5f) 
[0x5605d05a7f4f]
May 05 22:42:43 ceph06 bash[707251]:  7: (void 
finish_contexts > 
>(ceph::common::CephContext*, std::vector >&, int)+0x8d) [0x5605d024cf5d]
May 05 22:42:43 ceph06 bash[707251]:  8: (MDCache::open_ino_finish(inodeno_t, 
MDCache::open_ino_info_t&, int)+0x138) [0x5605d03cd168]
May 05 22:42:43 ceph06 bash[707251]:  9: 
(MDCache::_open_ino_traverse_dir(inodeno_t, MDCache::open_ino_info_t&, 
int)+0xbb) [0x5605d03cd4bb]
May 05 22:42:43 ceph06 bash[707251]:  10: (MDSContext::complete(int)+0x5f) 
[0x5605d05a7f4f]
May 05 22:42:43 ceph06 bash[707251]:  11: (MDSRank::_advance_queues()+0xaa) 
[0x5605d025b34a]
May 05 22:42:43 ceph06 bash[707251]:  12: 
(MDSRank::ProgressThread::entry()+0xb8) [0x5605d025b918]
May 05 22:42:43 ceph06 bash[707251]:  13: /lib64/libpthread.so.0(+0x81ca) 
[0x7f689eb861ca]
May 05 22:42:43 ceph06 bash[707251]:  14: clone()
May 05 22:42:43 ceph06 bash[707251]: debug  0> 2024-05-05T20:42:43.010+ 
7f6892752700 -1 *** Caught signal (Aborted) **
May 05 22:42:43 ceph06 bash[707251]:  in thread 7f6892752700 
thread_name:mds_rank_progr
May 05 22:42:43 ceph06 bash[707251]:  ceph version 17.2.7 
(b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)
May 05 22:42:43 ceph06 bash[707251]:  1: /lib64/libpthread.so.0(+0x12cf0) 
[0x7f689eb90cf0]
May 05 22:42:43 ceph06 bash[707251]:  2: gsignal()
May 05 22:42:43 ceph06 bash[707251]:  3: abort()
May 05 22:42:43 ceph06 bash[707251]:  4: (ceph::__ceph_assert_fail(char const*, 
char const*, int, char const*)+0x18f) [0x7f689fb974fd]
May 05 22:42:43 ceph06 bash[707251]:  5: 
/usr/lib64/ceph/libceph-common.so.2(+0x269669) [0x7f689fb97669]
May 05 22:42:43 ceph06 bash[707251]:  6: 
(MDCache::rejoin_send_rejoins()+0x216b) [0x5605d03da7eb]
May 05 22:42:43 ceph06 bash[707251]:  7: 
(MDCache::process_imported_caps()+0x1993) [0x5605d03d8353]
May 05 22:42:43 ceph06 bash[707251]:  8: 
(MDCache::rejoin_open_ino_finish(inodeno_t, int)+0x217) [0x5605d03e5837]
May 05 22:42:43 ceph06 bash[707251]:  9: (MDSContext::complete(int)+0x5f) 
[0x5605d05a7f4f]
May 05 22:42:43 ceph06 bash[707251]:  10: (void 
finish_contexts > 
>(ceph::common::CephContext*, std::vector >&, int)+0x8d) [0x5605d024cf5d]
May 05 22:42:43 ceph06 bash[707251]:  11: (MDCache::open_ino_finish(inodeno_t, 
MDCache::open_ino_info_t&, int)+0x138) [0x5605d03cd168]
May 05 22:42:43 ceph06 bash[707251]:  12: 
(MDCache::_open_ino_traverse_dir(inodeno_t, MDCache::open_ino_info_t&, 
int)+0xbb) [0x5605d03cd4bb]
May 05 22:42:43 ceph06 bash[707251]:  13: (MDSContext::complete(int)+0x5f) 
[0x5605d05a7f4f]
May 05 22:42:43 ceph06 bash[707251]:  14: (MDSRank::_advance_queues()+0xaa) 
[0x5605d025b34a]
May 05 22:42:43 ceph06 bash[707251]:  15: 
(MDSRank::ProgressThread::entry()+0xb8) [0x5605d025b918]
May 05 22:42:43 ceph06 bash[707251]:  16: /lib64/libpthread.so.0(+0x81ca) 
[0x7f689eb861ca]
May 05 22:42:43 ceph06 bas

[ceph-users] Best practice in 2024 for simple RGW failover

2024-03-26 Thread E Taka

Hi,

The requirements are actually not high: 1. there should be a generally
known address for access. 2. it should be possible to reboot or shut down a
server without the RGW connections being down the entire time. A downtime
of a few seconds is OK.

Constant load balancing would be nice, but is not necessary. I have found
various approaches on the Internet - what is currently recommended for a
current Ceph installation?


Thanks,
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: 6 pgs not deep-scrubbed in time

2024-01-28 Thread E Taka

22 is more often there than the others. Other operations may be blocked
because of a deep-scrub is not finished yet. I would remove OSD 22, just to
be sure about this: ceph orch osd rm osd.22

If this does not help, just add it again.

Am Fr., 26. Jan. 2024 um 08:05 Uhr schrieb Michel Niyoyita <
mico...@gmail.com>:

> It seems that are different OSDs as shown here . how have you managed to
> sort this out?
>
> ceph pg dump | grep -F 6.78
> dumped all
> 6.78   44268   0 0  00
> 1786796401180   0  10099 10099
>  active+clean  2024-01-26T03:51:26.781438+0200  107547'115445304
> 107547:225274427  [12,36,37]  12  [12,36,37]  12
> 106977'114532385  2024-01-24T08:37:53.597331+0200  101161'109078277
> 2024-01-11T16:07:54.875746+0200  0
> root@ceph-osd3:~# ceph pg dump | grep -F 6.60
> dumped all
> 6.60   9   0 0  00
> 179484338742  716  36  10097 10097
>  active+clean  2024-01-26T03:50:44.579831+0200  107547'153238805
> 107547:287193139   [32,5,29]  32   [32,5,29]  32
> 107231'152689835  2024-01-25T02:34:01.849966+0200  102171'147920798
> 2024-01-13T19:44:26.922000+0200  0
> 6.3a   44807   0 0  00
> 1809690056940   0  10093 10093
>  active+clean  2024-01-26T03:53:28.837685+0200  107547'114765984
> 107547:238170093  [22,13,11]  22  [22,13,11]  22
> 106945'113739877  2024-01-24T04:10:17.224982+0200  102863'109559444
> 2024-01-15T05:31:36.606478+0200  0
> root@ceph-osd3:~# ceph pg dump | grep -F 6.5c
> 6.5c   44277   0 0  00
> 1787649782300   0  10051 10051
>  active+clean  2024-01-26T03:55:23.339584+0200  107547'126480090
> 107547:264432655  [22,37,30]  22  [22,37,30]  22
> 107205'125858697  2024-01-24T22:32:10.365869+0200  101941'120957992
> 2024-01-13T09:07:24.780936+0200  0
> dumped all
> root@ceph-osd3:~# ceph pg dump | grep -F 4.12
> dumped all
> 4.12   0   0 0  00
>  00   0  0 0
>  active+clean  2024-01-24T08:36:48.284388+0200   0'0
>  107546:152711   [22,19,7]  22   [22,19,7]  22
>  0'0  2024-01-24T08:36:48.284307+0200   0'0
> 2024-01-13T09:09:22.176240+0200  0
> root@ceph-osd3:~# ceph pg dump | grep -F 10.d
> dumped all
> 10.d   0   0 0  00
>  00   0  0 0
>  active+clean  2024-01-24T04:04:33.641541+0200   0'0
>  107546:142651   [14,28,1]  14   [14,28,1]  14
>  0'0  2024-01-24T04:04:33.641451+0200   0'0
> 2024-01-12T08:04:02.078062+0200  0
> root@ceph-osd3:~# ceph pg dump | grep -F 5.f
> dumped all
> 5.f0   0 0  00
>  00   0  0 0
>  active+clean  2024-01-25T08:19:04.148941+0200   0'0
>  107546:161331  [11,24,35]  11  [11,24,35]  11
>  0'0  2024-01-25T08:19:04.148837+0200   0'0
> 2024-01-12T06:06:00.970665+0200  0
>
>
> On Fri, Jan 26, 2024 at 8:58 AM E Taka <0eta...@gmail.com> wrote:
>
>> We had the same problem. It turned out that one disk was slowly dying. It
>> was easy to identify by the commands (in your case):
>>
>> ceph pg dump | grep -F 6.78
>> ceph pg dump | grep -F 6.60
>> …
>>
>> This command shows the OSDs of a PG in square brackets. If is there
>> always the same number, then you've found the OSD which causes the slow
>> scrubs.
>>
>> Am Fr., 26. Jan. 2024 um 07:45 Uhr schrieb Michel Niyoyita <
>> mico...@gmail.com>:
>>
>>> Hello team,
>>>
>>> I have a cluster in production composed by  3 osds servers with 20 disks
>>> each deployed using ceph-ansibleand ubuntu OS , and the version is
>>> pacific
>>> . These days is in WARN state caused by pgs which are not deep-scrubbed
>>> in
>>> time . I tried to deep-scrubbed some pg manually but seems that the
>>> cluster
>>> can be slow, would like your assistance in order that my cluster can be
>>> in
>>> HEALTH_OK state as before without any interuption of service . The
>>> cluster
>>> is used as openstack backend storage.

[ceph-users] Re: 6 pgs not deep-scrubbed in time

2024-01-25 Thread E Taka

We had the same problem. It turned out that one disk was slowly dying. It
was easy to identify by the commands (in your case):

ceph pg dump | grep -F 6.78
ceph pg dump | grep -F 6.60
…

This command shows the OSDs of a PG in square brackets. If is there always
the same number, then you've found the OSD which causes the slow scrubs.

Am Fr., 26. Jan. 2024 um 07:45 Uhr schrieb Michel Niyoyita <
mico...@gmail.com>:

> Hello team,
>
> I have a cluster in production composed by  3 osds servers with 20 disks
> each deployed using ceph-ansibleand ubuntu OS , and the version is pacific
> . These days is in WARN state caused by pgs which are not deep-scrubbed in
> time . I tried to deep-scrubbed some pg manually but seems that the cluster
> can be slow, would like your assistance in order that my cluster can be in
> HEALTH_OK state as before without any interuption of service . The cluster
> is used as openstack backend storage.
>
> Best Regards
>
> Michel
>
>
>  ceph -s
>   cluster:
> id: cb0caedc-eb5b-42d1-a34f-96facfda8c27
> health: HEALTH_WARN
> 6 pgs not deep-scrubbed in time
>
>   services:
> mon: 3 daemons, quorum ceph-mon1,ceph-mon2,ceph-mon3 (age 11M)
> mgr: ceph-mon2(active, since 11M), standbys: ceph-mon3, ceph-mon1
> osd: 48 osds: 48 up (since 11M), 48 in (since 11M)
> rgw: 6 daemons active (6 hosts, 1 zones)
>
>   data:
> pools:   10 pools, 385 pgs
> objects: 5.97M objects, 23 TiB
> usage:   151 TiB used, 282 TiB / 433 TiB avail
> pgs: 381 active+clean
>  4   active+clean+scrubbing+deep
>
>   io:
> client:   59 MiB/s rd, 860 MiB/s wr, 155 op/s rd, 665 op/s wr
>
> root@ceph-osd3:~# ceph health detail
> HEALTH_WARN 6 pgs not deep-scrubbed in time
> [WRN] PG_NOT_DEEP_SCRUBBED: 6 pgs not deep-scrubbed in time
> pg 6.78 not deep-scrubbed since 2024-01-11T16:07:54.875746+0200
> pg 6.60 not deep-scrubbed since 2024-01-13T19:44:26.922000+0200
> pg 6.5c not deep-scrubbed since 2024-01-13T09:07:24.780936+0200
> pg 4.12 not deep-scrubbed since 2024-01-13T09:09:22.176240+0200
> pg 10.d not deep-scrubbed since 2024-01-12T08:04:02.078062+0200
> pg 5.f not deep-scrubbed since 2024-01-12T06:06:00.970665+0200
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] OSD is usable, but not shown in "ceph orch device ls"

2023-12-23 Thread E Taka

Hello,

in our cluster we have one node with SSD, which are used, but we cannot see
it in "ceph orch device ls". Everything als looks OK. For better
understanding, the diskname is /dev/sda, it's osd.138:

~# lsblk
NAME MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda  8:01 7T  0 disk

~# wipefs /dev/sda
DEVICE OFFSET TYPE   UUID LABEL
sda0x0ceph_bluestore

~# ceph osd tree
 -9   15.42809  host ceph06
138ssd 6.98630  osd.138  up   1.0  1.0

The file ceph-osd.138.log does not look unusal to me.

ceph-volume.log show that the SSD is found by the "lsblk" command of the
volume processing.

It is not possible to add the SSD by
"# ceph orch daemon add osd ceph06:/dev/sda

Error message in this case is a question asking if it is already used, even
if the SSD is fully wiped via "wipefs -a" or by overwriting the entire disk
with the dd command. But It is possible to add it to the cluster by using
the option "--method raw".

Do you have an idea what happened here and how can I debug this behaviour?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: ceph osd dump_historic_ops

2023-12-01 Thread E Taka

This small (Bash) wrapper around the "ceph daemon" command, especially the
auto-completeion with the TAB key, ist quite helpful, IMHO:
https://github.com/test-erik/ceph-daemon-wrapper

Am Fr., 1. Dez. 2023 um 15:03 Uhr schrieb Phong Tran Thanh <
tranphong...@gmail.com>:

> It works!!!
>
> Thanks Kai Stian Olstad
>
> Vào Th 6, 1 thg 12, 2023 vào lúc 17:06 Kai Stian Olstad <
> ceph+l...@olstad.com> đã viết:
>
> > On Fri, Dec 01, 2023 at 04:33:20PM +0700, Phong Tran Thanh wrote:
> > >I have a problem with my osd, i want to show dump_historic_ops of osd
> > >I follow the guide:
> > >
> >
> https://www.ibm.com/docs/en/storage-fusion/2.6?topic=alerts-cephosdslowops
> > >But when i run command
> > >
> > >ceph daemon osd.8 dump_historic_ops show the error, the command run on
> > node
> > >with osd.8
> > >Can't get admin socket path: unable to get conf option admin_socket for
> > >osd: b"error parsing 'osd': expected string of the form TYPE.ID, valid
> > >types are: auth, mon, osd, mds, mgr, client\n"
> > >
> > >I am running ceph cluster reef version by cephadmin install
> > >
> > >What should I do?
> >
> > The easiest is use tell, then you can run it on any node that have access
> > to ceph.
> >
> >  ceph tell osd.8 dump_historic_ops
> >
> >
> >  ceph tell osd.8 help
> > will give you all you can do with tell.
> >
> > --
> > Kai Stian Olstad
> >
>
>
> --
> Trân trọng,
>
> 
>
> *Tran Thanh Phong*
>
> Email: tranphong...@gmail.com
> Skype: tranphong079
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Clients failing to respond to capability release

2023-10-02 Thread E Taka

Same problem here with Ceph 17.2.6 on Ubuntu 22.04 and Clients Debian 11,
Kernel 6.0.12-1~bpo11+1.

We are still looking for a solution. At the time being we let restart the
Orchestrator MDS daemons by removig/adding labels to the servers. We use
multiple MDS and have many CPU cores and memory. The problem should not be
due to a lack of resources.

Am Di., 19. Sept. 2023 um 13:36 Uhr schrieb Tim Bishop <
tim-li...@bishnet.net>:

> Hi,
>
> I've seen this issue mentioned in the past, but with older releases. So
> I'm wondering if anybody has any pointers.
>
> The Ceph cluster is running Pacific 16.2.13 on Ubuntu 20.04. Almost all
> clients are working fine, with the exception of our backup server. This
> is using the kernel CephFS client on Ubuntu 22.04 with kernel 6.2.0 [1]
> (so I suspect a newer Ceph version?).
>
> The backup server has multiple (12) CephFS mount points. One of them,
> the busiest, regularly causes this error on the cluster:
>
> HEALTH_WARN 1 clients failing to respond to capability release
> [WRN] MDS_CLIENT_LATE_RELEASE: 1 clients failing to respond to capability
> release
> mds.mds-server(mds.0): Client backupserver:cephfs-backupserver failing
> to respond to capability release client_id: 521306112
>
> And occasionally, which may be unrelated, but occurs at the same time:
>
> [WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests
> mds.mds-server(mds.0): 1 slow requests are blocked > 30 secs
>
> The second one clears itself, but the first sticks until I can unmount
> the filesystem on the client after the backup completes.
>
> It appears that whilst it's in this stuck state there may be one or more
> directory trees that are inaccessible to all clients. The backup server
> is walking the whole tree but never gets stuck itself, so either the
> inaccessible directory entry is caused after it has gone past, or it's
> not affected. Maybe the backup server is holding a directory when it
> shouldn't?
>
> It may be that an upgrade to Quincy resolves this, since it's more
> likely to be inline with the kernel client version wise, but I don't
> want to knee-jerk upgrade just to try and fix this problem.
>
> Thanks for any advice.
>
> Tim.
>
> [1] The reason for the newer kernel is that the backup performance from
> CephFS was terrible with older kernels. This newer kernel does at least
> resolve that issue.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] MDS failing to respond to capability release while `ls -lR`

2023-10-01 Thread E Taka

Dockerized Ceph 17.2.6 on Ununtu 22.04

The Cephfs filesystem has a size of 180TB, used are only 66TB.

When running a  `ls -lR` the output stops and all accesses to the directory
stall. ceph health says:

# ceph health detail
HEALTH_WARN 1 clients failing to respond to capability release; 1 MDSs
report slow requests
[WRN] MDS_CLIENT_LATE_RELEASE: 1 clients failing to respond to capability
release
   mds.vol.ppc721.mvxstq(mds.0): Client dessert failing to respond to
capability release client_id: 6899709
[WRN] MDS_SLOW_REQUEST: 1 MDSs report slow requests
   mds.vol.ppc721.mvxstq(mds.0): 1 slow requests are blocked > 30 secs

ceph -w shows:

[WRN] slow request 31.421408 seconds old, received at
2023-10-01T09:53:44.634849+: client
_request(client.7360117:2224947 getattr AsLsFs #0x412503f
2023-10-01T09:53:44.631148+ caller_uid=0, caller_gid=0{0,}) currently
failed to rdlock, waiting
[WRN] client.6899709 isn't responding to mclientcaps(revoke), ino
0x412503f pending pAsLs
XsFsc issued pAsLsXsFscb, sent 61.422148 seconds ago

The full output of ceph daemon [mds] dump inode 0x412503f, config show,
dump_ops_in_flight  and "ceph -w" with timestamps can be found on
https://gist.github.com/test-erik/5de4a7bd632f62ab58c3115cfb876ae0

Do you have an idea what we can do about this?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] How to call cephfs-top

2023-04-28 Thread E Taka

I'm using a dockerized Ceph 17.2.6 under Ubuntu 22.04.

Presumably I'm missing a very basic thing, since this seems a very simple
question: how can I call cephfs-top in my environment? It is not inckuded
in the Docker Image which is accessed by "cephadm shell".

And calling the version found in the source code always fails with "[errno
13] RADOS permission denied", even when using "--cluster" with the correct
ID, "--conffile" and "--id".

The auth user client.fstop exists, and "ceph fs perf stats" runs.
What am I missing?

Thanks!
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: excluding from host_pattern

2023-01-27 Thread E Taka

Thanks, Ulrich, but:

# ceph orch host ls --host_pattern="^ceph(0[1-9])|(1[0-9])$"
0 hosts in cluster whose hostname matched "^ceph(0[1-9])|(1[0-9])$"

Bash pattern are not accepted. (I tried it in numerous other combinations).
But, as I said, not really a problem - just wondering what the regex might
be.

Am Fr., 27. Jan. 2023 um 19:09 Uhr schrieb Ulrich Klein <
ulrich.kl...@ulrichklein.de>:

> I use something like "^ceph(0[1-9])|(1[0-9])$", but in a script that
> checks a parameter for a "correct" ceph node name like in:
>
>wantNum=$1
>if [[ $wantNum =~ ^ceph(0[2-9]|1[0-9])$ ]] ; then
>   wantNum=${BASH_REMATCH[1]}
>
> Which gives me the number, if it is in the range 02-19
>
> Dunno, if that helps :)
>
> Ciao, Uli
>
> > On 27. Jan 2023, at 18:17, E Taka <0eta...@gmail.com> wrote:
> >
> > Hi,
> >
> > I wonder if it is possible to define a host pattern, which includes the
> > host names
> > ceph01…ceph19, but no other hosts, especially not ceph00. That means,
> this
> > pattern is wrong: ceph[01][0-9] , since it includes ceph00.
> >
> > Not really a problem, but it seems that the "“host-pattern” is a regex
> that
> > matches against hostnames and returns only matching hosts"¹ is not
> defined
> > more precisely in the docs.
> >
> > 1) https://docs.ceph.com/en/latest/cephadm/host-management/
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] excluding from host_pattern

2023-01-27 Thread E Taka

Hi,

I wonder if it is possible to define a host pattern, which includes the
host names
ceph01…ceph19, but no other hosts, especially not ceph00. That means, this
pattern is wrong: ceph[01][0-9] , since it includes ceph00.

Not really a problem, but it seems that the "“host-pattern” is a regex that
matches against hostnames and returns only matching hosts"¹ is not defined
more precisely in the docs.

1) https://docs.ceph.com/en/latest/cephadm/host-management/
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Useful MDS configuration for heavily used Cephfs

2023-01-16 Thread E Taka

Thanks, Frank, for these detailed insights! I really appreciate your help.

Am Mo., 16. Jan. 2023 um 09:49 Uhr schrieb Frank Schilder :

> Hi, we are using ceph fs for data on an HPC cluster and, looking at your
> file size distribution, I doubt that MDS performance is a bottleneck. Your
> limiting factors are super-small files and IOP/s budget of the fs data
> pool. On our system, we moved these workloads to an all-flash beegfs. Ceph
> is very good for cold data, but this kind of hot data on HDD is a pain.
>
> Problems to consider: What is the replication factor of your data pool?
> The IOP/s budget of a pool is approximately 100*#OSDs/replication factor.
> If you use 3xrep, the budget is 1/3rd of what your drives can do
> aggregated. If you use something like 8+2 or 8+3, its 1/10 or 1/11 of
> aggregated. Having WAL/DB on SSD might give you 100% of HDD IOP/s, because
> the bluestore admin IO goes to SSD.
>
> All in all, this is not much. Example: 100HDD OSDs, aggregated IOP/s
> raw=1, 8+2 EC pool gives 1000 aggregated. If you try to process 1
> small files, this will feel very slow compared to a crappy desktop-SSD. On
> the other hand, large file-IO (streaming of data) will fly with up to
> 1.2GB/s.
>
> Second issue is allocation amplification. Even with min_alloc_size=4K on
> HDD, with your file size distribution you will have a dramatic allocation
> amplification. Take again EC 8+2. For this pool, the minimum allocation
> size is 8*4K=32K. All files from 0-32K will allocate 32K. All files from
> 32K to 64K will allocate 64K.  Similar but less pessimistic calculation
> applies to replicated pools.
>
> If you have mostly small file IO and only the occasional large files, you
> might consider having 2 data pools on HDD. One 3x-replicated pool as
> default and an 8+2 or 8+3 EC pool for large files. You can give every user
> a folder on the large-file pool (say "archives" under home). For data
> folders users should tell you what it is for.
>
> If you use a replicated HDD pool for all the small files up to 64K you
> will probably get good performance out of it. Still, solid state storage
> would be much preferable. When I look at your small file count (0-64K),
> 2Miox64Kx3 is about 400G. If you increase by a factor of 10, we are talking
> about 4T solid state storage. That is really not much and should be
> affordable if money is not the limit. I would consider a small SSD pool for
> all the small files and use the HDDs for the large ones and possibly
> compressed archives.
>
> You can force users to use the right place by setting appropriate quotas
> on the sub-dirs on different pools. You could offer a hierarchy of rep-HDD
> as default, EC-HDD on "archives" and rep-SSD on demand.
>
> For MDS config, my experience is that human users do actually not allocate
> large amounts of meta-data. We have 8 active MDSes with 24G memory limit
> each and I don't see human users using all that cache. What does happen
> though is that backup daemons allocate a lot, but obviously don't use it
> (its a read-once workload). Our MDSes hold 4M inodes and 4M dentries in 12G
> (I have set mid-point to 0.5) and that seems more than enough. What is
> really important is to pin directories. We use manual pinning over all
> ranks and it works like a charm. If you don't pin, the MDSes will not work
> very well. I had a thread on that 2-3 months ago.
>
> Hope that is helpful.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: E Taka <0eta...@gmail.com>
> Sent: 15 January 2023 17:45:25
> To: Darren Soothill
> Cc: ceph-users@ceph.io
> Subject: [ceph-users] Re: Useful MDS configuration for heavily used Cephfs
>
> Thanks for the detailed inquiry. We use HDD with WAL/DB on SSD. The Ceph
> servers have of lots of RAM and many CPU cores. We are looking for
> a general purpose approach – running reasonably well in most cases is
> better than a perfect solution for one use case.
>
> This is the existing file size distribution.For the future let's multiply
> the number of files by 10 (that's more than 100 TB, I know):
>
>   1k: 532069
>  2k:  54458
>  4k:  36613
>  8k:  37139
> 16k: 726302
> 32k: 286573
> 64k:  55841
> 128k:  30510
> 256k:  37386
> 512k:  48462
>  1M:   9461
>  2M:   4707
>  4M:   9233
>  8M:   4816
> 16M:   3059
> 32M:   2268
> 64M:   4314
> 128M:  17017
> 256M:   7263
> 512M:   1917
>  1G:   1561
>  2G:   1342
>  4G:670
>  8G:493
> 16G:238
> 32G:121
> 64G: 15
> 128G: 10
> 256G:  5
> 512G:  4
>  1T:  3
>  2T:

[ceph-users] Re: Useful MDS configuration for heavily used Cephfs

2023-01-15 Thread E Taka

Thanks for the detailed inquiry. We use HDD with WAL/DB on SSD. The Ceph
servers have of lots of RAM and many CPU cores. We are looking for
a general purpose approach – running reasonably well in most cases is
better than a perfect solution for one use case.

This is the existing file size distribution.For the future let's multiply
the number of files by 10 (that's more than 100 TB, I know):

  1k: 532069
 2k:  54458
 4k:  36613
 8k:  37139
16k: 726302
32k: 286573
64k:  55841
128k:  30510
256k:  37386
512k:  48462
 1M:   9461
 2M:   4707
 4M:   9233
 8M:   4816
16M:   3059
32M:   2268
64M:   4314
128M:  17017
256M:   7263
512M:   1917
 1G:   1561
 2G:   1342
 4G:670
 8G:493
16G:238
32G:121
64G: 15
128G: 10
256G:  5
512G:  4
 1T:  3
 2T:  2

There will be home directories for a few hundred users, and dozens of data
dirs with thousands of files between 10-100 kB, which will be processed one
by one or in parallel. In this process, some small files are written, but
in the main usage, many files of rather small size are read.
(A database would be better suited for this, but I have no influence on
that.) I would prefer not to create separate pools on SSD, but to use the
RAM (some spare servers with 128 GB - 512 GB) for Metadata caching.

Thank you for your encouragement!
Erich

Am So., 15. Jan. 2023 um 16:29 Uhr schrieb Darren Soothill <
darren.sooth...@croit.io>:

> There are a few details missing to allow people to provide you with advice.
>
> How many files are you expecting to be in this 100TB of capacity?
> This really dictates what you are looking for. It could be full of 4K
> files which is a very different proposition to it being full of 100M files.
>
> What sort of media is this file system made up of?
> If you have 10’s millions of files on HDD then you are going to be wanting
> a separate metadata pool for CephFS on some much faster storage.
>
> What is the sort of use case that you are expecting for this storage?
> You say it is heavily used but what does that really mean?
> You have a 1000 HPC nodes all trying to access millions of 4K files?
> Or are you using it as a more general purpose file system for say home
> directories?
>
>
>
> Darren Soothill
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io/
>
> croit GmbH, Freseniusstr. 31h, 81247 Munich
> CEO: Martin Verges - VAT-ID: DE310638492
> Com. register: Amtsgericht Munich HRB 231263
> Web: https://croit.io/ | YouTube: https://goo.gl/PGE1Bx
>
>
>
> On 15 Jan 2023, at 09:26, E Taka <0eta...@gmail.com> wrote:
>
> Ceph 17.2.5:
>
> Hi,
>
> I'm looking for a reasonable and useful MDS configuration for a – in
> future, no experiences until now – heavily used CephFS (~100TB).
> For example, does it make a difference to increase the
> mds_cache_memory_limit or the number of MDS instances?
>
> The hardware does not set any limits, I just want to know where the default
> values can be optimized usefully before problem occur.
>
> Thanks,
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Useful MDS configuration for heavily used Cephfs

2023-01-15 Thread E Taka

Ceph 17.2.5:

Hi,

I'm looking for a reasonable and useful MDS configuration for a – in
future, no experiences until now – heavily used CephFS (~100TB).
For example, does it make a difference to increase the
mds_cache_memory_limit or the number of MDS instances?

The hardware does not set any limits, I just want to know where the default
values can be optimized usefully before problem occur.

Thanks,
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Removing OSDs - draining but never completes.

2023-01-12 Thread E Taka

You have to wait until the rebalancing finished.

Am Di., 10. Jan. 2023 um 17:14 Uhr schrieb Wyll Ingersoll <
wyllys.ingers...@keepertech.com>:

> Running ceph-pacific 16.2.9 using ceph orchestrator.
>
> We made a mistake adding a disk to the cluster and immediately issued a
> command to remove it using "ceph orch osd rm ### --replace --force".
>
> This OSD had no data on it at the time and was removed after just a few
> minutes.  "ceph orch osd rm status" shows that it is still "draining".
> ceph osd df shows that the osd being removed has -1 PGs.
>
> So - why is the simple act of removal taking so long and can we abort it
> and manually remove that osd somehow?
>
> Note: the cluster is also doing a rebalance while this is going on, but
> the osd being removed never had any data and should not be affected by the
> rebalance.
>
> thanks!
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Mysterious Disk-Space Eater

2023-01-12 Thread E Taka

We had a similar problem, and it was a (visible) logfile. It is easy to
find with the ncdu utility (`ncdu -x /var`). There's no need of a reboot,
you can get rid of it with restarting the Monitor with `ceph orch daemon
restart mon.NODENAME`. You may also lower the debug level.

Am Do., 12. Jan. 2023 um 09:14 Uhr schrieb Eneko Lacunza :

> Hi,
>
> El 12/1/23 a las 3:59, duluxoz escribió:
> > Got a funny one, which I'm hoping someone can help us with.
> >
> > We've got three identical(?) Ceph Quincy Nodes running on Rocky Linux
> > 8.7. Each Node has 4 OSDs, plus Monitor, Manager, and iSCSI G/W
> > services running on them (we're only a small shop). Each Node has a
> > separate 16 GiB partition mounted as /var. Everything is running well
> > and the Ceph Cluster is handling things very well).
> >
> > However, one of the Nodes (not the one currently acting as the Active
> > Manager) is running out of space on /var. Normally, all of the Nodes
> > have around 10% space used (via a df -H command), but the problem Node
> > only takes 1 to 3 days to run out of space, hence taking it out of
> > Quorum. Its currently at 85% and growing.
> >
> > At first we thought this was caused by an overly large log file, but
> > investigations showed that all the logs on all 3 Nodes were of
> > comparable size. Also, searching for the 20 largest files on the
> > problem Node's /var didn't produce any significant results.
> >
> > Coincidentally, unrelated to this issue, the problem Node (but not the
> > other 2 Nodes) was re-booted a couple of days ago and, when the
> > Cluster had re-balanced itself and everything was back online and
> > reporting as Healthy, the problem Node's /var was back down to around
> > 10%, the same as the other two Nodes.
> >
> > This lead us to suspect that there was some sort of "run-away" process
> > or journaling/logging/temporary file(s) or whatever that the re-boot
> > has "cleaned up". So we've been keeping an eye on things but we can't
> > see anything causing the issue and now, as I said above, the problem
> > Node's /var is back up to 85% and growing.
> >
> > I've been looking at the log files, tying to determine the issue, but
> > as I don't really know what I'm looking for I don't even know if I'm
> > looking in the *correct* log files...
> >
> > Obviously rebooting the problem Node every couple of days is not a
> > viable option, and increasing the size of the /var partition is only
> > going to postpone the issue, not resolve it. So if anyone has any
> > ideas we'd love to hear about it - thanks
>
> This seems one or more files that are removed but some process has their
> handle open (and maybe is still writing...). When rebooting process is
> terminated and file(s) effectively removed.
>
> Try to inspect each process' open files and find what file(s) have no
> longer a directory entry... that would give you a hint.
>
> Cheers
>
>
> Eneko Lacunza
> Zuzendari teknikoa | Director técnico
> Binovo IT Human Project
>
> Tel. +34 943 569 206 |https://www.binovo.es
> Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun
>
> https://www.youtube.com/user/CANALBINOVO
> https://www.linkedin.com/company/37269706/
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Increase the recovery throughput

2022-12-29 Thread E Taka

Ceph 17.2.5, dockerized, Ubuntu 20.04, OSD on HDD with WAL/DB on SSD.

Hi all,

old topic, but the problem still exists. I tested it extensively,
with osd_op_queue set either to mclock_scheduler (and profile set to high
recovery) or wpq and the well known options (sleep_time, max_backfill) from
https://docs.ceph.com/en/quincy/rados/configuration/osd-config-ref/

When removing an OSD with `ceph orch osd rm X` the backfilling always ends
with a large number of misplaced objects at a low recovery rate (right now
"120979/336643536 objects misplaced (0.036%); 10 KiB/s, 2 objects/s
recovering"). The rate drops significantly when there are very few PGs
involved.I wonder if someone a similar installation as we have (see above)
doesn't experience this problem.

Thanks, Erich


Am Mo., 12. Dez. 2022 um 12:28 Uhr schrieb Frank Schilder :

> Hi Monish,
>
> you are probably on mclock scheduler, which ignores these settings. You
> might want to set them back to defaults, change the scheduler to wpq and
> then try again if it needs adjusting. there were several threads about
> "broken" recovery ops scheduling with mclock in the latest versions.
>
> So, back to Eugen's answer: go through this list and try solutions of
> earlier cases.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Monish Selvaraj 
> Sent: 12 December 2022 11:32:26
> To: Eugen Block
> Cc: ceph-users@ceph.io
> Subject: [ceph-users] Re: Increase the recovery throughput
>
> Hi Eugen,
>
> We tried that already. the osd_max_backfills is in 24 and the
> osd_recovery_max_active is in 20.
>
> On Mon, Dec 12, 2022 at 3:47 PM Eugen Block  wrote:
>
> > Hi,
> >
> > there are many threads dicussing recovery throughput, have you tried
> > any of the solutions? First thing to try is to increase
> > osd_recovery_max_active and osd_max_backfills. What are the current
> > values in your cluster?
> >
> >
> > Zitat von Monish Selvaraj :
> >
> > > Hi,
> > >
> > > Our ceph cluster consists of 20 hosts and 240 osds.
> > >
> > > We used the erasure-coded pool with cache-pool concept.
> > >
> > > Some time back 2 hosts went down and the pg are in a degraded state. We
> > got
> > > the 2 hosts back up in some time. After the pg is started recovering
> but
> > it
> > > takes a long time ( months )  . While this was happening we had the
> > cluster
> > > with 664.4 M objects and 987 TB data. The recovery status is not
> changed;
> > > it remains 88 pgs degraded.
> > >
> > > During this period, we increase the pg size from 256 to 512 for the
> > > data-pool ( erasure-coded pool ).
> > >
> > > We also observed (one week ) the recovery to be very slow, the current
> > > recovery around 750 Mibs.
> > >
> > > Is there any way to increase this recovery throughput ?
> > >
> > > *Ceph-version : quincy*
> > >
> > > [image: image.png]
> >
> >
> >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Removing OSD very slow (objects misplaced)

2022-12-28 Thread E Taka

Thanks, Liang. But this doesn't help since Ceph 17. Setting the mclock
profile to "high recovery" speeds up a little bit. The main problem
remains: 95% of the recovery time is needed for just one PG. This was not
the case before Quincy.

郑亮  schrieb am Mo., 26. Dez. 2022, 03:52:

> Hi erich,
> You can reference following link:
> https://www.suse.com/support/kb/doc/?id=19693
>
> Thanks,
> Liang Zheng
>
>
> E Taka <0eta...@gmail.com> 于2022年12月16日周五 01:52写道：
>
>> Hi,
>>
>> when removing some OSD with the command `ceph orch osd rm X`, the
>> rebalancing starts very fast, but after a while it almost stalls with a
>> very low recovering rate:
>>
>> Dec 15 18:47:17 … : cluster [DBG] pgmap v125312: 3361 pgs: 13
>> active+clean+scrubbing+deep, 4 active+remapped+backfilling, 3344
>> active+clean; 95 TiB data, 298 TiB used, 320 TiB / 618 TiB avail; 13 MiB/s
>> rd, 3.9 MiB/s wr, 610 op/s; 403603/330817302 objects misplaced (0.122%);
>> 1.1 MiB/s, 2 objects/s recovering
>>
>> As you can see, the rate is 2 Objects/s for over 40 objects. `ceph
>> orch
>> osd rm status` shows long running draining processes (now over 4 days):
>>
>> OSD  HOSTSTATE PGS  REPLACE  FORCE  ZAPDRAIN STARTED AT
>> 64   ceph05  draining1  FalseFalse  False  2022-12-11
>> 16:18:14.692636+00:00
>> …
>>
>> Is there y way to increase the speed of the draining/rebalancing?
>>
>> Thanks!
>> Erich
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Removing OSD very slow (objects misplaced)

2022-12-15 Thread E Taka

Hi,

when removing some OSD with the command `ceph orch osd rm X`, the
rebalancing starts very fast, but after a while it almost stalls with a
very low recovering rate:

Dec 15 18:47:17 … : cluster [DBG] pgmap v125312: 3361 pgs: 13
active+clean+scrubbing+deep, 4 active+remapped+backfilling, 3344
active+clean; 95 TiB data, 298 TiB used, 320 TiB / 618 TiB avail; 13 MiB/s
rd, 3.9 MiB/s wr, 610 op/s; 403603/330817302 objects misplaced (0.122%);
1.1 MiB/s, 2 objects/s recovering

As you can see, the rate is 2 Objects/s for over 40 objects. `ceph orch
osd rm status` shows long running draining processes (now over 4 days):

OSD  HOSTSTATE PGS  REPLACE  FORCE  ZAPDRAIN STARTED AT
64   ceph05  draining1  FalseFalse  False  2022-12-11
16:18:14.692636+00:00
…

Is there y way to increase the speed of the draining/rebalancing?

Thanks!
Erich
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Cache modes libvirt

2022-11-30 Thread E Taka

Some information is missing to give a helpful answer.

How do you backup? (Files? RBD via Ceph? Block Device with qemu-img? Which
Device driver do you use (Virtio? SATA?).

In our production we use Virtio RBD and the Hypervisor standard cache mode.
The Disks are snapshoted before the backup with 'qemu-img', e. g.:
virsh snapshot-create-as VM backup-VM --diskspec
vda,file=/snapshots/backup-snapshot-VM-vda.raw --disk-only --atomic
--quiesce --no-metadata
qemu-img convert -O raw rbd:libvirt-pool/VM.raw /backup/VM.raw -p
virsh blockcommit VM vda --wait --active --pivot

The backup process is as fast as expected.

Am Mi., 30. Nov. 2022 um 16:09 Uhr schrieb Dominique Ramaekers <
dominique.ramaek...@cometal.be>:
>
> Hi,
>
> I was wondering...
>
> In Ceph/Libvirt docs only cachmodes writetrough and writeback are
discussed. My clients's disks are all set to writeback in the libvirt
client xml-definition.
>
> For a backup operation, I notice a severe lag on one of my VM's. Such a
backup operation that takes 1 to 2 hours (on a same machine but) on local
LVM storage, takes on a ceph storage 6 hours. Though I have to say that the
cachemode on the local LVM storage is set on directsync.
>
> => So what if I played around with other cachemodes? Will it make a
difference?
>
> I'm trying now cachemode 'unsafe' so tonight I should measure the
difference if there is any.
>
> Input will be greatly appreciated.
>
> Greetings,
>
> Dominique.
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Quincy 17.2.5: proper way to replace OSD (HDD with Wal/DB on SSD)

2022-11-30 Thread E Taka

Ubuntu 20.04, Ceph 17.2.5, dockerized

Hello all,
this is frequently asked, but the answers I found are either old or do not
cover an extra WAL/DB device. Given an OSD that is located on a HDD, where
WAL/DB is located on a SSD, which is  used by all OSDs of the host. The OSD
is in, up and running.

What is currently the recommended way to replace the HDD? Of course after
the replacement the SSD should be used as before.

Thanks, Erich
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Enable Centralized Logging in Dashboard.

2022-11-16 Thread E Taka

Thank you, Nizam. I wasn't aware that the Dashboard login is not the same
as the grafana login. Now I have accass to the logfiles.

Am Mi., 16. Nov. 2022 um 15:06 Uhr schrieb Nizamudeen A :

> Hi,
>
> Did you login to the grafana dashboard? For centralized logging you'll
> need to login to the
> grafana using your grafana username and password. If you do that and
> refresh the dashboard,
> I think the Loki page should be visible from the Daemon Logs page.
>
> Regards,
> Nizam
>
> On Wed, Nov 16, 2022 at 7:31 PM E Taka <0eta...@gmail.com> wrote:
>
>> Ceph: 17.2.5, dockerized with Ubuntu 20.04
>>
>> Hi all,
>>
>> I try to enable the Centralized Logging in Dashboard as described in
>>
>>
>> https://docs.ceph.com/en/quincy/cephadm/services/monitoring/#cephadm-monitoring-centralized-logs
>>
>> Logging inti files is enabled:
>>   ceph config set global log_to_file true
>>   ceph config set global mon_cluster_log_to_file true
>>
>> Loki is deployed at one host, promtail on every host:
>>
>>
>> service_type: loki
>> service_name: loki
>> placement:
>>  hosts:
>>  - ceph00
>> ---
>> service_type: promtail
>> service_name: promtail
>> placement:
>>  host_pattern: '*'
>>
>>
>> After applying the YAML above the log messages in »ceph -W cephadm« look
>> good (deploying loki+promtail and reconfiguring grafana). But the
>> Dashboard
>> "Cluster → Logs → Daemon Logs" just shows a standard grafana page without
>> any buttons for the Ceph Cluster. Its URL is https://ceph00.
>>
>> [mydomain]:3000/explore?orgId=1&left=["now-1h","now","Loki",{"refId":"A"}]&kiosk.
>>
>> Did I miss something for the centralized Logging in the Dashboard?
>>
>> Thank!
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Enable Centralized Logging in Dashboard.

2022-11-16 Thread E Taka

Ceph: 17.2.5, dockerized with Ubuntu 20.04

Hi all,

I try to enable the Centralized Logging in Dashboard as described in

https://docs.ceph.com/en/quincy/cephadm/services/monitoring/#cephadm-monitoring-centralized-logs

Logging inti files is enabled:
  ceph config set global log_to_file true
  ceph config set global mon_cluster_log_to_file true

Loki is deployed at one host, promtail on every host:


service_type: loki
service_name: loki
placement:
 hosts:
 - ceph00
---
service_type: promtail
service_name: promtail
placement:
 host_pattern: '*'


After applying the YAML above the log messages in »ceph -W cephadm« look
good (deploying loki+promtail and reconfiguring grafana). But the Dashboard
"Cluster → Logs → Daemon Logs" just shows a standard grafana page without
any buttons for the Ceph Cluster. Its URL is https://ceph00.
[mydomain]:3000/explore?orgId=1&left=["now-1h","now","Loki",{"refId":"A"}]&kiosk.

Did I miss something for the centralized Logging in the Dashboard?

Thank!
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Mails not getting through?

2022-11-16 Thread E Taka

gmail marks too many messages on this mailing list as spam.

Am Mi., 16. Nov. 2022 um 11:01 Uhr schrieb Kai Stian Olstad <
ceph+l...@olstad.com>:

> On 16.11.2022 00:25, Daniel Brunner wrote:
> > are my mails not getting through?
> >
> > is anyone receiving my emails?
>
> You can check this yourself by checking the archives
> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/
> If you see your mail there, they are getting through.
>
> --
> Kai Stian Olstad
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Why did ceph turn /etc/ceph/ceph.client.admin.keyring into a directory?

2022-10-25 Thread E Taka

Question 1) makes me wonder too.

This results in errors:

2022-10-25T11:20:00.000109+0200 mon.ceph00 [INF] overall HEALTH_OK
2022-10-25T11:21:05.422793+0200 mon.ceph00 [WRN] Health check failed:
failed to probe daemons or devices (CEPHADM_REFRESH_FAILED)
2022-10-25T11:22:06.037456+0200 mon.ceph00 [INF] Health check cleared:
CEPHADM_REFRESH_FAILED (was: failed to probe daemons or devices)
2022-10-25T11:22:06.037491+0200 mon.ceph00 [INF] Cluster is now healthy
2022-10-25T11:30:00.71+0200 mon.ceph00 [INF] overall HEALTH_OK

I would like to stop this behavior. But how?

Am Di., 25. Okt. 2022 um 09:44 Uhr schrieb Marc :

> >
> > 1) Why does ceph delete /etc/ceph/ceph.client.admin.keyring several
> > times a
> > day?
> >
> > 2) Why was it turned into a directory? It contains one file
> > "ceph.client.admin.keyring.new". This then causes an error in the ceph
> > logs
> > when ceph tries to remove the file: "rm: cannot remove
> > '/etc/ceph/ceph.client.admin.keyring': Is a directory".
> >
>
> Are you using the ceph-csi driver? The ceph csi people just delete your
> existing ceph files and mount your root fs when you are not running the
> driver in a container. They seem to think that checking for files and
> validating parameters is not necessary.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Recommended procedure in case of OSD_SCRUB_ERRORS / PG_DAMAGED

2022-10-19 Thread E Taka

Thanks, I will try this the next time!

Am Mi., 19. Okt. 2022 um 13:50 Uhr schrieb Eugen Block :

> Hi,
>
> you don't need to stop the OSDs, just query the inconsistent object,
> here's a recent example (form an older cluster though):
>
> ---snip---
>  health: HEALTH_ERR
>  1 scrub errors
>  Possible data damage: 1 pg inconsistent
>
> admin:~ # ceph health detail
> HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
> OSD_SCRUB_ERRORS 1 scrub errors
> PG_DAMAGED Possible data damage: 1 pg inconsistent
>  pg 7.17a is active+clean+inconsistent, acting [15,2,58,33,28,69]
>
> admin:~ # rados -p cephfs_data list-inconsistent-obj 7.17a | jq
> [...]
>"shards": [
>  {
>"osd": 2,
>"primary": false,
>"errors": [],
>"size": 2780496,
>"omap_digest": "0x",
>"data_digest": "0x11e1764c"
>  },
>  {
>"osd": 15,
>"primary": true,
>"errors": [],
>"size": 2780496,
>"omap_digest": "0x",
>"data_digest": "0x11e1764c"
>  },
>  {
>"osd": 28,
>"primary": false,
>"errors": [],
>"size": 2780496,
>"omap_digest": "0x",
>"data_digest": "0x11e1764c"
>  },
>  {
>"osd": 33,
>"primary": false,
>"errors": [
>  "read_error"
>],
>"size": 2780496
>  },
>  {
>"osd": 58,
>"primary": false,
>"errors": [],
>"size": 2780496,
>"omap_digest": "0x",
>"data_digest": "0x11e1764c"
>  },
>  {
>"osd": 69,
>"primary": false,
>"errors": [],
>"size": 2780496,
>"omap_digest": "0x",
>"data_digest": "0x11e1764c"
> ---snip---
>
> Five of the six omap_digest and data_digest values were identical, so
> it was safe to run 'ceph pg repair 7.17a'.
>
> Regards,
> Eugen
>
> Zitat von E Taka <0eta...@gmail.com>:
>
> > (17.2.4, 3 replicated, Container install)
> >
> > Hello,
> >
> > since many of the information found in the WWW or books is outdated, I
> want
> > to ask which procedure is recommended to repair damaged PG with status
> > active+clean+inconsistent for Ceph Quincy.
> >
> > IMHO, the best process for a pool with 3 replicas it would be to check if
> > two of the replicas are identical and replace the third different one.
> >
> > If I understand it correctly, the ceph-objectstore-tool could be used for
> > this approach, but unfortunately it is difficult even to start,
> especially
> > in a Docker environment. (OSD have to marked as "down", the Ubuntu
> package
> > ceph-osd, where ceph-objectstore-tool is included, starts server
> processes
> > which confuse the dockerized environment).
> >
> > Is “ceph pg repair” safe to use, and is there a risk to enable
> > osd_scrub_auto_repair and osd_repair_during_recovery?
> >
> > Thanks!
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Recommended procedure in case of OSD_SCRUB_ERRORS / PG_DAMAGED

2022-10-19 Thread E Taka

(17.2.4, 3 replicated, Container install)

Hello,

since many of the information found in the WWW or books is outdated, I want
to ask which procedure is recommended to repair damaged PG with status
active+clean+inconsistent for Ceph Quincy.

IMHO, the best process for a pool with 3 replicas it would be to check if
two of the replicas are identical and replace the third different one.

If I understand it correctly, the ceph-objectstore-tool could be used for
this approach, but unfortunately it is difficult even to start, especially
in a Docker environment. (OSD have to marked as "down", the Ubuntu package
ceph-osd, where ceph-objectstore-tool is included, starts server processes
which confuse the dockerized environment).

Is “ceph pg repair” safe to use, and is there a risk to enable
osd_scrub_auto_repair and osd_repair_during_recovery?

Thanks!
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: 17.2.4: mgr/cephadm/grafana_crt is ignored

2022-10-05 Thread E Taka

Thanks, Redouane, that helped! The documentation should of course also be
updated in this context.

Am Mi., 5. Okt. 2022 um 15:33 Uhr schrieb Redouane Kachach Elhichou <
rkach...@redhat.com>:

> Hello,
>
> As of this PR https://github.com/ceph/ceph/pull/47098 grafana cert/key are
> now stored per-node. So instead of *mgr/cephadm/grafana_crt* they are
> stored per-nodee as:
>
> *mgr/cephadm/{hostname}/grafana_crt*
> *mgr/cephadm/{hostname}/grafana_key*
>
> In order to see the config entries that have been generated you can filter
> by:
>
> > ceph config-key dump | grep grafana | grep crt
>
> I hope that helps,
> Redo.
>
>
>
> On Wed, Oct 5, 2022 at 3:19 PM E Taka <0eta...@gmail.com> wrote:
>
> > Hi,
> >
> > since the last update from 17.2.3 to version 17.2.4, the
> > mgr/cephadm/grafana_crt
> > setting is ignored. The output of
> >
> > ceph config-key get mgr/cephadm/grafana_crt
> > ceph config-key get mgr/cephadm/grafana_key
> > ceph dashboard get-grafana-frontend-api-url
> >
> > ist correct.
> >
> > Grafana and the Dashboard are re-applied, re-started, and re-configured
> via
> > "ceph orch", even the nodes are rebooted. The dashboard is {dis,en}abled
> as
> > documented via "ceph mgr module en/disable dashboard"
> >
> > But the Grafana-Dashboards still use a self signed certificate, and not
> the
> > provided one from mgr/cephadm/grafana_crt.
> >
> > Prior the update this was never a problem. What did I miss?
> >
> > Thanks,
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] 17.2.4: mgr/cephadm/grafana_crt is ignored

2022-10-05 Thread E Taka

Hi,

since the last update from 17.2.3 to version 17.2.4, the
mgr/cephadm/grafana_crt
setting is ignored. The output of

ceph config-key get mgr/cephadm/grafana_crt
ceph config-key get mgr/cephadm/grafana_key
ceph dashboard get-grafana-frontend-api-url

ist correct.

Grafana and the Dashboard are re-applied, re-started, and re-configured via
"ceph orch", even the nodes are rebooted. The dashboard is {dis,en}abled as
documented via "ceph mgr module en/disable dashboard"

But the Grafana-Dashboards still use a self signed certificate, and not the
provided one from mgr/cephadm/grafana_crt.

Prior the update this was never a problem. What did I miss?

Thanks,
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] MDS crashes after evicting client session

2022-09-22 Thread E Taka

Ceph 17.2.3 (dockerized in Ubuntu 20.04)

The subject says it. The MDS process always crashes after evicting. ceph -w
shows:

2022-09-22T13:26:23.305527+0200 mds.ksz-cephfs2.ceph00.kqjdwe [INF]
Evicting (and blocklisting) client session 5181680 (
10.149.12.21:0/3369570791)
2022-09-22T13:26:35.729317+0200 mon.ceph00 [INF] daemon
mds.ksz-cephfs2.ceph03.vsyrbk restarted
2022-09-22T13:26:36.039678+0200 mon.ceph00 [INF] daemon
mds.ksz-cephfs2.ceph01.xybiqv restarted
2022-09-22T13:29:21.000392+0200 mds.ksz-cephfs2.ceph04.ekmqio [INF]
Evicting (and blocklisting) client session 5249349 (
10.149.12.22:0/2459302619)
2022-09-22T13:29:32.069656+0200 mon.ceph00 [INF] daemon
mds.ksz-cephfs2.ceph01.xybiqv restarted
2022-09-22T13:30:00.000101+0200 mon.ceph00 [INF] overall HEALTH_OK
2022-09-22T13:30:20.710271+0200 mon.ceph00 [WRN] Health check failed: 1
daemons have recently crashed (RECENT_CRASH)

The crash info of the crashed MDS is:
# ceph crash info
2022-09-22T11:26:24.013274Z_b005f3fc-7704-4cfc-96c5-f2a9c993f166
{
   "assert_condition": "!mds->is_any_replay()",
   "assert_file":
"/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.3/rpm/el8/BUILD/ceph-17.2.3/src/mds/MDLog.cc",

   "assert_func": "void MDLog::_submit_entry(LogEvent*,
MDSLogContextBase*)",
   "assert_line": 283,
   "assert_msg":
"/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.3/rpm/el8/BUILD/ceph-17.2.3/src/mds/MDLog.cc:
In function 'void MDLog::_submit_entry(LogEvent*, MDSLogContextBase*)'
thread 7f76fa8f6700 time
2022-09-22T11:26:23.992050+\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.3/rpm/el8/BUILD/ceph-17.2.3/src/mds/MDLog.cc:
283: FAILED ceph_assert(!mds->is_any_replay())\n",
   "assert_thread_name": "ms_dispatch",
   "backtrace": [
   "/lib64/libpthread.so.0(+0x12ce0) [0x7f770231bce0]",
   "gsignal()",
   "abort()",
   "(ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x1b0) [0x7f770333bcd2]",
   "/usr/lib64/ceph/libceph-common.so.2(+0x283e95) [0x7f770333be95]",
   "(MDLog::_submit_entry(LogEvent*, MDSLogContextBase*)+0x3f)
[0x55991905efdf]",
   "(Server::journal_close_session(Session*, int, Context*)+0x78c)
[0x559918d7d63c]",
   "(Server::kill_session(Session*, Context*)+0x212) [0x559918d7dd92]",
   "(Server::apply_blocklist()+0x10d) [0x559918d7e04d]",
   "(MDSRank::apply_blocklist(std::set, std::allocator > const&, unsigned
int)+0x34) [0x559918d39d74]",
   "(MDSRankDispatcher::handle_osd_map()+0xf6) [0x559918d3a0b6]",
   "(MDSDaemon::handle_core_message(boost::intrusive_ptr
const&)+0x39b) [0x559918d2330b]",
   "(MDSDaemon::ms_dispatch2(boost::intrusive_ptr
const&)+0xc3) [0x559918d23cc3]",
   "(DispatchQueue::entry()+0x14fa) [0x7f77035c240a]",
   "(DispatchQueue::DispatchThread::entry()+0x11) [0x7f7703679481]",
   "/lib64/libpthread.so.0(+0x81ca) [0x7f77023111ca]",
   "clone()"
   ],
   "ceph_version": "17.2.3",
   "crash_id":
"2022-09-22T11:26:24.013274Z_b005f3fc-7704-4cfc-96c5-f2a9c993f166",
   "entity_name": "mds.ksz-cephfs2.ceph03.vsyrbk",
   "os_id": "centos",
   "os_name": "CentOS Stream",
   "os_version": "8",
   "os_version_id": "8",
   "process_name": "ceph-mds",
   "stack_sig":
"b75e46941b5f6b7c05a037f9af5d42bb19d82ab7fc6a3c168533fc31a42b4de8",
   "timestamp": "2022-09-22T11:26:24.013274Z",
   "utsname_hostname": "ceph03",
   "utsname_machine": "x86_64",
   "utsname_release": "5.4.0-125-generic",
   "utsname_sysname": "Linux",
   "utsname_version": "#141-Ubuntu SMP Wed Aug 10 13:42:03 UTC 2022"
}

(Don't be confused by the time information, "ceph -w" is UTC+2, "crash
info" is UTC)

Should I report this a bug or did I miss something which caused the error?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Status occurring several times a day: CEPHADM_REFRESH_FAILED

2022-07-12 Thread E Taka

Yes, "_admin" is set. After some restarting and redeploying the problem
seemed to disappear.

Thanks anyway.
Erich

Am Fr., 8. Juli 2022 um 14:18 Uhr schrieb Adam King :

> Hello,
>
> Does the MGR node have an "_admin" label on it?
>
> Thanks,
>   - Adam King
>
> On Fri, Jul 8, 2022 at 4:23 AM E Taka <0eta...@gmail.com> wrote:
>
>> Hi,
>>
>> since updating to 17.2.1 we get 5 – 10 times per day the message:
>>
>> [WARN] CEPHADM_REFRESH_FAILED: failed to probe daemons or devices
>> host cephXX `cephadm gather-facts` failed: Unable to reach remote
>> host cephXX.
>>
>> (cephXX is not always the same node).
>>
>> This status is cleared after one or two minutes.
>>
>> When this happens, the ceph.conf and ceph.client.admin.keyring files are
>> not present for a short time on the MGR node.
>>
>> Do you have an idea what can I da about this?
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Status occurring several times a day: CEPHADM_REFRESH_FAILED

2022-07-08 Thread E Taka

Hi,

since updating to 17.2.1 we get 5 – 10 times per day the message:

[WARN] CEPHADM_REFRESH_FAILED: failed to probe daemons or devices
host cephXX `cephadm gather-facts` failed: Unable to reach remote
host cephXX.

(cephXX is not always the same node).

This status is cleared after one or two minutes.

When this happens, the ceph.conf and ceph.client.admin.keyring files are
not present for a short time on the MGR node.

Do you have an idea what can I da about this?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Dashboard: SSL error in the Object gateway menu only

2022-05-23 Thread E Taka

Thanks, Eugen, for pointing that out. The commands for setting some RGW
options disappeared in 16.2.6.
I'm pretty sure that none of the admins used the IP address instead of a
DNS name. We'll  try it with the orch commands.

Am So., 22. Mai 2022 um 10:52 Uhr schrieb Eugen Block :

> Hi,
> in earlier versions (e.g. Nautilus) there was a dashboard command to
> set the RGW hostname, that is not available in Octopus (I didn’t check
> Pacific, probably when cephadm took over), so I would assume that it
> comes from the ‘ceph orch host add’ command and you probably used the
> host’s IP for that? But I’m just guessing here. Not sure if there’s a
> way to reset it with a ceph command or if removing and adding the host
> again would be necessary. There was a thread about that just this week.
>
>
> Zitat von E Taka <0eta...@gmail.com>:
>
> > Version: Pacific 16.2.9
> >
> > Hi,
> >
> > when clicking in the Dashboard at "Object Gateway" submenus, for example
> > "Daemons", the Dashboard gets an HTTP error 500. The logs says about
> this:
> >
> > requests.exceptions.SSLError: HTTPSConnectionPool(host='10.149.12.179',
> > port=8000): Max retries exceeded with url: /admin/metadata/user?myself
> > (Caused by SSLError(CertificateError("hostname '10.149.12.179' doesn't
> > match either of […hostnames…] […]
> >
> > We applied a correct rgw_frontend_ssl_certificate with a FQDN.
> > Obviously the error shows that the Dashboard should use the FQDN instead
> of
> > the correct IP address '10.149.12.179'. But how can I change it?
> >
> > (Yes, there is the workaround  "ceph dashboard set-rgw-api-ssl-verify
> > False", which I try to avoid).
> >
> > Thanks
> > Erich
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Dashboard: SSL error in the Object gateway menu only

2022-05-21 Thread E Taka

Version: Pacific 16.2.9

Hi,

when clicking in the Dashboard at "Object Gateway" submenus, for example
"Daemons", the Dashboard gets an HTTP error 500. The logs says about this:

requests.exceptions.SSLError: HTTPSConnectionPool(host='10.149.12.179',
port=8000): Max retries exceeded with url: /admin/metadata/user?myself
(Caused by SSLError(CertificateError("hostname '10.149.12.179' doesn't
match either of […hostnames…] […]

We applied a correct rgw_frontend_ssl_certificate with a FQDN.
Obviously the error shows that the Dashboard should use the FQDN instead of
the correct IP address '10.149.12.179'. But how can I change it?

(Yes, there is the workaround  "ceph dashboard set-rgw-api-ssl-verify
False", which I try to avoid).

Thanks
Erich
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Importance of CEPHADM_CHECK_KERNEL_VERSION

2022-05-05 Thread E Taka

Hello all,

how important is it to use the same Linux kernel version on all Hosts?
Background is, that new  hosts are installed with the actual Ubuntu server
22.04 while the older ones run with Ubuntu 20.04.

In other words: may I disable this check:
 ceph cephadm config-check disable kernel_version

Or should I downgrade the new hosts (upgrading all old hosts would be much
more work since do-release-upgrade is not available yet)?

Thanks, Erich
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Pacific + NFS-Ganesha 4?

2022-03-03 Thread E Taka

Hi,
is there a reason that the current Pacific container of NFS-Ganesha does
not provide the actual version Ganesha 4?

If yes: will Quincy use Ganesha 4?

If not: what's the recommended way to use Ganesha 4 together with Ceph
Pacific?

Thanks
Erich
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: How to identify the RBD images with the most IO?

2022-02-13 Thread E Taka

That's exactly what I was looking for, thanks!

Am So., 13. Feb. 2022 um 13:32 Uhr schrieb Marc :

> >
> > this sounds like a FAQ, but I haven't found a clear answer:
> >
> > How can I identify the most active RBD images, where "most active" means
> > the images with many I/O operations?
> >
>
> rbd perf image iotop
>
> it is in the rbd -h
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] How to identify the RBD images with the most IO?

2022-02-13 Thread E Taka

Hi,

this sounds like a FAQ, but I haven't found a clear answer:

How can I identify the most active RBD images, where "most active" means
the images with many I/O operations?

Thanks
Erich


(Background of the question is that we have many virtual machines running
that we can't monitor directly, and the I/O rate is a good indication when
a VM is running out of memory, but it has been allocated its own swap space
against all recommendations).
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: dashboard fails with error code 500 on a particular file system

2022-01-20 Thread E Taka

Hello Ernesto,

I found the reason. One of the users set a directory permission without the
+x bit (drw---). After the command 'chmod 700' everything was OK again.
The MDS log did not help, but with the API call 'ls_dir?path=…' I was able
to iterate to the directory with the wrong permissions.

IMHO this is not an urgent problem , but an user should not be able to
crash the management interface of the admin.

Thanks for your patience
Erich


Am Mi., 19. Jan. 2022 um 18:00 Uhr schrieb Ernesto Puerta <
epuer...@redhat.com>:

> Given the error returned from libcephfs is "cephfs.OSError: opendir
> failed: Permission denied [Errno 13]", it could be that the mgr doesn't
> have rights (ceph auth) to access the filesystem. Could you check the mds
> logs for any trace when the Dashboard error appears?
>
> Kind Regards,
> Ernesto
>
>
> On Wed, Jan 19, 2022 at 4:40 PM E Taka <0eta...@gmail.com> wrote:
>
>> Hello Ernesto,
>>
>> the commands worked without any problems, with Ubuntus 20.04 Ceph packages
>> and inside "cephadm shell". I tried all 55k directories of the filesystem.
>>
>> Best,
>> Erich
>>
>> Am Mo., 17. Jan. 2022 um 21:10 Uhr schrieb Ernesto Puerta <
>> epuer...@redhat.com>:
>>
>> > Hi E Taka,
>> >
>> > There's already a report of that issue in 16.2.5 (
>> > https://tracker.ceph.com/issues/51611), stating that it didn't happen
>> in
>> > 16.2.3 (so a regression then), but we couldn't reproduce it so far.
>> >
>> > I just tried creating a regular fresh cephfs filesystem (1 MDS), a
>> > directory inside it (via cephfs-shell) and I could access the directory
>> > from the dashboard with no issues. Is there anything specific on that
>> > deployment? The dashboard basically uses Python libcephfs
>> > <https://docs.ceph.com/en/latest/cephfs/api/libcephfs-py/> for
>> accessing
>> > Cephfs, so could you plz try the same and validate whether it works?
>> >
>> > >>> import cephfs
>> > >>> fs = cephfs.LibCephFS()
>> > >>> fs.conf_read_file('/etc/ceph/ceph.conf')
>> > >>> fs.mount(b'/', b'a')
>> > >>> fs.opendir('/test')
>> >
>> > # NO ERROR
>> >
>> >
>> > Kind Regards,
>> > Ernesto
>> >
>> >
>> > On Sun, Jan 16, 2022 at 11:26 AM E Taka <0eta...@gmail.com> wrote:
>> >
>> >> Dashboard → Filesystems → (filesystem name) → Directories
>> >>
>> >> fails on a particular file system with error "500 - Internal Server
>> >> Error".
>> >>
>> >> The log shows:
>> >>
>> >>  Jan 16 11:22:18 ceph00 bash[96786]:   File
>> >> "/usr/share/ceph/mgr/dashboard/services/cephfs.py", line 57, in opendir
>> >>  Jan 16 11:22:18 ceph00 bash[96786]: d = self.cfs.opendir(dirpath)
>> >>  Jan 16 11:22:18 ceph00 bash[96786]:   File "cephfs.pyx", line 942, in
>> >> cephfs.LibCephFS.opendir
>> >>  Jan 16 11:22:18 ceph00 bash[96786]: cephfs.OSError: opendir failed:
>> >> Permission denied [Errno 13]
>> >>  Jan 16 11:22:18 ceph00 bash[96786]: [:::10.149.249.237:47814]
>> [GET]
>> >> [500] [0.246s] [admin] [513.0B] /ui-api/cephfs/3/ls_dir
>> >>  Jan 16 11:22:18 ceph00 bash[96786]: [b'{"status": "500 Internal Server
>> >> Error", "detail": "The server encountered an unexpected condition which
>> >> prevented it from fulfilling the request.", "request_id":
>> >> "76727248-cf64-4b85-8630-8131e33832f8"}
>> >>
>> >> Do you have an idea what went wrong hore and how can I solve this
>> issue?
>> >>
>> >> Thanks!
>> >> Erich
>> >> ___
>> >> ceph-users mailing list -- ceph-users@ceph.io
>> >> To unsubscribe send an email to ceph-users-le...@ceph.io
>> >>
>> >
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: dashboard fails with error code 500 on a particular file system

2022-01-19 Thread E Taka

Hello Ernesto,

the commands worked without any problems, with Ubuntus 20.04 Ceph packages
and inside "cephadm shell". I tried all 55k directories of the filesystem.

Best,
Erich

Am Mo., 17. Jan. 2022 um 21:10 Uhr schrieb Ernesto Puerta <
epuer...@redhat.com>:

> Hi E Taka,
>
> There's already a report of that issue in 16.2.5 (
> https://tracker.ceph.com/issues/51611), stating that it didn't happen in
> 16.2.3 (so a regression then), but we couldn't reproduce it so far.
>
> I just tried creating a regular fresh cephfs filesystem (1 MDS), a
> directory inside it (via cephfs-shell) and I could access the directory
> from the dashboard with no issues. Is there anything specific on that
> deployment? The dashboard basically uses Python libcephfs
> <https://docs.ceph.com/en/latest/cephfs/api/libcephfs-py/> for accessing
> Cephfs, so could you plz try the same and validate whether it works?
>
> >>> import cephfs
> >>> fs = cephfs.LibCephFS()
> >>> fs.conf_read_file('/etc/ceph/ceph.conf')
> >>> fs.mount(b'/', b'a')
> >>> fs.opendir('/test')
>
> # NO ERROR
>
>
> Kind Regards,
> Ernesto
>
>
> On Sun, Jan 16, 2022 at 11:26 AM E Taka <0eta...@gmail.com> wrote:
>
>> Dashboard → Filesystems → (filesystem name) → Directories
>>
>> fails on a particular file system with error "500 - Internal Server
>> Error".
>>
>> The log shows:
>>
>>  Jan 16 11:22:18 ceph00 bash[96786]:   File
>> "/usr/share/ceph/mgr/dashboard/services/cephfs.py", line 57, in opendir
>>  Jan 16 11:22:18 ceph00 bash[96786]: d = self.cfs.opendir(dirpath)
>>  Jan 16 11:22:18 ceph00 bash[96786]:   File "cephfs.pyx", line 942, in
>> cephfs.LibCephFS.opendir
>>  Jan 16 11:22:18 ceph00 bash[96786]: cephfs.OSError: opendir failed:
>> Permission denied [Errno 13]
>>  Jan 16 11:22:18 ceph00 bash[96786]: [:::10.149.249.237:47814] [GET]
>> [500] [0.246s] [admin] [513.0B] /ui-api/cephfs/3/ls_dir
>>  Jan 16 11:22:18 ceph00 bash[96786]: [b'{"status": "500 Internal Server
>> Error", "detail": "The server encountered an unexpected condition which
>> prevented it from fulfilling the request.", "request_id":
>> "76727248-cf64-4b85-8630-8131e33832f8"}
>>
>> Do you have an idea what went wrong hore and how can I solve this issue?
>>
>> Thanks!
>> Erich
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] dashboard fails with error code 500 on a particular file system

2022-01-16 Thread E Taka

Dashboard → Filesystems → (filesystem name) → Directories

fails on a particular file system with error "500 - Internal Server Error".

The log shows:

 Jan 16 11:22:18 ceph00 bash[96786]:   File
"/usr/share/ceph/mgr/dashboard/services/cephfs.py", line 57, in opendir
 Jan 16 11:22:18 ceph00 bash[96786]: d = self.cfs.opendir(dirpath)
 Jan 16 11:22:18 ceph00 bash[96786]:   File "cephfs.pyx", line 942, in
cephfs.LibCephFS.opendir
 Jan 16 11:22:18 ceph00 bash[96786]: cephfs.OSError: opendir failed:
Permission denied [Errno 13]
 Jan 16 11:22:18 ceph00 bash[96786]: [:::10.149.249.237:47814] [GET]
[500] [0.246s] [admin] [513.0B] /ui-api/cephfs/3/ls_dir
 Jan 16 11:22:18 ceph00 bash[96786]: [b'{"status": "500 Internal Server
Error", "detail": "The server encountered an unexpected condition which
prevented it from fulfilling the request.", "request_id":
"76727248-cf64-4b85-8630-8131e33832f8"}

Do you have an idea what went wrong hore and how can I solve this issue?

Thanks!
Erich
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: jj's "improved" ceph balancer

2021-10-25 Thread E Taka

Hi Jonas,
I'm impressed, Thanks!

I have a question about the usage: do I have to turn off the automatic
balancing feature (ceph balancer off)? Do the upmap balancer and your
customizations get in each other's way, or can I run your script from time
to time?

Thanks
Erich


Am Mo., 25. Okt. 2021 um 14:50 Uhr schrieb Jonas Jelten :

> Hi Dan,
>
> basically it's this: when you have a server that is so big, crush can't
> utilize it the same way as the other smaller servers because of the
> placement constraints,
> the balancer doesn't balance data on the smaller servers any more, because
> it just "sees" the big one to be too empty.
>
> To my understanding the mgr-balancer balances hierarchically, on each
> crush level.
> It moves pgs between buckets on the same level (i.e. from too-full-rack to
> too-empty-rack, from too-full-server to too-empty server, inside a server
> from osd to another osd),
> so when there's e.g. an always-too-empty server, it kinda defeats the
> algorithm and doesn't migrate PGs even when the crush constraints would
> allow it.
> So it won't move PGs from small-server 1 (with osds at ~90% full) to
> small-server 2 (with osds at ~60%), due to server 3 with osds at 30%.
> We have servers with 12T drives and some with 1T drives, and various drive
> counts, so that this situation emerged...
> Since I saw how it could be balanced, but wasn't, I wrote the tool.
>
> I also think that the mgr-balancer approach is good, but the hierarchical
> movements are hard to adjust I think.
> But yes, I see my balancer complementary to the mgr-balancer, and for some
> time I used both (since mgr-balance is happy about my movements and just
> leaves them) and it worked well.
>
> -- Jonas
>
> On 20/10/2021 21.41, Dan van der Ster wrote:
> > Hi,
> >
> > I don't quite understand your "huge server" scenario, other than a basic
> understanding that the balancer cannot do magic in some impossible cases.
> >
> > But anyway, I wonder if this sort of higher order balancing could/should
> be added as a "part two" to the mgr balancer. The existing code does a
> quite good job in many (dare I say most?) cases. E.g. it even balances
> empty clusters perfectly.
> > But after it cannot find a further optimization, maybe a heuristic like
> yours can further refine the placement...
> >
> >  Dan
> >
> >
> > On Wed, 20 Oct 2021, 20:52 Jonas Jelten,  jel...@in.tum.de>> wrote:
> >
> > Hi Dan,
> >
> > I'm not kidding, these were real-world observations, hence my
> motivation to create this balancer :)
> > First I tried "fixing" the mgr balancer, but after understanding the
> exact algorithm there I thought of a completely different approach.
> >
> > For us the main reason things got out of balance was this (from the
> README):
> > > To make things worse, if there's a huge server in the cluster
> which is so big, CRUSH can't place data often enough on it to fill it to
> the same level as any other server, the balancer will fail moving PGs
> across servers that actually would have space.
> > > This happens since it sees only this server's OSDs as "underfull",
> but each PG has one shard on that server already, so no data can be moved
> on it.
> >
> > But all the aspects in that section play together, and I don't think
> it's easily improvable in mgr-balancer while keeping the same base
> algorithm.
> >
> > Cheers
> >   -- Jonas
> >
> > On 20/10/2021 19.55, Dan van der Ster wrote:
> > > Hi Jonas,
> > >
> > > From your readme:
> > >
> > > "the best possible solution is some OSDs having an offset of 1 PG
> to the ideal count. As a PG-distribution-optimization is done per pool,
> without checking other pool's distribution at all, some devices will be the
> +1 more often than others. At worst one OSD is the +1 for each pool in the
> cluster."
> > >
> > > That's an interesting observation/flaw which hadn't occurred to me
> before. I think we don't ever see it in practice in our clusters because we
> do not have multiple large pools on the same osds.
> > >
> > > How large are the variances in your real clusters? I hope the
> example in your readme isn't from real life??
> > >
> > > Cheers, Dan
> >
> >
> > ___
> > Dev mailing list -- d...@ceph.io
> > To unsubscribe send an email to dev-le...@ceph.io
> >
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Fwd: Dashboard URL

2021-10-24 Thread E Taka

Hi Yuri,

I faced the same problem that recently only IP addresses are listed in
`ceph mgr services`

As a easy workarounf I installed a lightttp for just one CGI script:


#!/bin/bash

DASHBOARD=$(ceph mgr services | jq '.dashboard')

DASHIP=$(echo $DASHBOARD | awk -F[/:] '{print $4}')

if [[ $DASHIP =~ ^[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$ ]]; then
   DASHNAME=$(host $DASHIP | awk '{print $NF}' | sed 's/\.$//')
else
   DASHNAME=$DASHIP
fi

DASHURL='https://'${DASHNAME}:8443/

echo "Refresh: 2; url=$DASHURL"
echo "Content-type: text/html"
echo
echo  "You will be redirected to $DASHNAME in 2
seconds..."



Yes, it's only a kludge, but it "works for me".
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: CephFS multi active MDS high availability

2021-10-24 Thread E Taka

see https://docs.ceph.com/en/pacific/cephfs/multimds/

If I understand it, do this:

ceph fs set  max_mds 2
ceph fs set  standby_count_wanted 1
ceph orch apply mds  3

Am So., 24. Okt. 2021 um 09:52 Uhr schrieb huxia...@horebdata.cn <
huxia...@horebdata.cn>:

> Dear Cephers,
>
> When setting up multiple active CephFS MDS, how to make these MDS high
> available? i.e. whenever there is failed MDS, another MDS would quickly
> take over. Does it mean that for N active MDS, I need to set up N standby
> MDS, and make one standby MDS associated with one active MDS?
>
> What would be the best practice for high availability with multiple active
> MDS?
>
> best regards,
>
> samuel
>
>
>
> huxia...@horebdata.cn
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Fwd: Confused dashboard with FQDN

2021-09-05 Thread E Taka

The Dashboard URLs can be very confusing, especially since SSL certificates
require an FQDN, but Ceph itself is recommended with the short names. Not
to mention that ceph mgr services shows since 16.2.5 or so IP addresses
instead of names.

Please read https://docs.ceph.com/en/pacific/mgr/dashboard/

I would try dis dis-/enable the dashboard, to redeploy it, and check the
configuration (ceph orch host ls, ceph config dump | grep dashboard
 …)  for suspect entries.

Best
Erich

Am Sa., 4. Sept. 2021 um 19:24 Uhr schrieb Joseph Timothy Foley :

> Hi there.
> I am learning ceph and made a configuration mistake that I'm trying to fix.
> I have a cluster of Pacific nodes running Centos 8 Stream:  ceph1.ru.is,
> ceph2.ru.is, ceph3.ru.is
> I bootstrapped with ceph1.ru.is (as a FQDN) and added the rest as FQDN.
> Then I discovered that the instructions suggest just using basenames, so I
> used "hostnamectl" to rename them.
> This confused ceph so I used the "remove" and "add" commands to add the
> basenames.
> (That clearly was not the right procedure.  I'm guessing I was supposed to
> remove them somehow before renaming.)
>
> Now things are working, but the web dashboard has entries for both the
> base and FQDN.
> The FQDN entries look like they have daemons running on them, but when you
> expand them under "Cluster >> hosts" all the entries are empty.
> How do I get the dashboard to reflect the real state of things?
>
> Kind regards,
> Joe
> --
> Dr. Joseph T. Foley 
> Lektor | Assistant Professor
> Verkfræðideild | Dept. of Engineering
> Háskólinn í Reykjavík | Reykjavik University
> Menntavegur 1, Nauthólsvík | 102 Reykjavík | Iceland
> Sími/Cell +354-599-6569 | Fax +354-599-6201
> http://www.ru.is
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Pacific: access via S3 / Object gateway slow for small files

2021-08-27 Thread E Taka

Hi,

thanks for the answers. My goal was to speed up the S3 interface, an
not only  a single program. This was successful with this method:
https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#block-and-block-db

However, one major disadvantage was that Cephadm considered the OSDs
as "STRAY DAEMON"  and the OSD could not be adminstered with the
Dashboard. What really helped was this doc:

https://docs.ceph.com/en/pacific/cephadm/osd/

1. As prerequisites one have to turn off the automagic creation of OSD:

  ceph orch apply osd --all-available-devices --unmanaged=true

2. Then create a YAML specificastion like this and apply it:

service_type: osd
service_id: osd_spec_default
placement:
  host_pattern: '*'
data_devices:
  rotational: 1
db_devices:
  rotational: 0

3. delete ALL OSD from one node:
  ceph orch osd rm 
(and wait probably for many< hours)

4. Zap those HDD and SSD:
ceph orch device zap  

5. Activate ceph-volume via
  ceph cephadm osd activate 

Et voià! Now we can use the dashboard and the SSD are used für WAL/DB.
This speedens up the access to Ceph, epsecially the S3 API which is
almost 10 times as fast as before.

For Pacific++ there should be a very prominent reference to the doc
"Cephadm – OSD service", in particular from the "BlueStore Settings"
(first URL above). That would have saved me many hours of testing.

Thanks anyway!

Am Di., 24. Aug. 2021 um 10:41 Uhr schrieb Janne Johansson
:
>
> Den tis 24 aug. 2021 kl 09:46 skrev Francesco Piraneo G. 
> :
> > Il 24.08.21 09:32, Janne Johansson ha scritto:
> > >> As a simple test I copied an Ubuntu /usr/share/doc (580 MB in 23'000 
> > >> files):
> > >> - rsync -a to a Cephfs took 2 min
> > >> - s3cmd put --recursive took over 70 min
> > >> Users reported that the S3 access is generally slow, not only with 
> > >> s3tools.
> > > Single per-object accesses and writes on S3 are slower, since they
> > > involve both client and server side checksumming, a lot of http(s)
> > > stuff before the actual operations start and I don't think there is a
> > > lot of connection reuse or pipelining being done so you are going to
> > > make some 23k requests, each taking a non-zero time to complete.
> > >
> > Question: Is Swift compatible protocol faster?
>
> Probably not, but make a few tests and find out how it works at your place.
> It's kind of easy to rig both at the same time, so you can test on exactly the
> same setup.
>
> > Use case: I have to store indefinite files quantity for a data storage
> > service; I thought object storage is the unique solution; each file is
> > identified by UUID, no metadata on file, files are chunked 4Mb size each.
>
> That sounds like a better case for S3/Swift.
>
> > In such case cephfs is the best suitable choice?
>
> One factor to add might be "will it be reachable from the outside?",
> since radosgw is kind of easy to put behind a set of load balancers,
> that can wash/clean incoming traffic and handle TLS offload and things
> like that. Putting cephfs out on the internet might have other cons.
>
> --
> May the most significant bit of your life be positive.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Pacific: access via S3 / Object gateway slow for small files

2021-08-24 Thread E Taka

One can find questions about this topic in the WWW, but most of them
for older versions of Ceph. So I ask specifically for the actual
version:

· Pacific 16.2.5
· 7 nodes (with many cores and RAM) with 111 OSD
· all OSD included by: ceph orch apply osd --all-available-devices
· bucket created in the Dashboard, default-placement
· no slow OSDs

As a simple test I copied an Ubuntu /usr/share/doc (580 MB in 23'000 files):

- rsync -a to a Cephfs took 2 min
- s3cmd put --recursive took over 70 min

Users reported that the S3 access is generally slow, not only with s3tools.

So my question is: How do we speed up access via S3?

Thanks!
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Balanced use of HDD and SSD

2021-08-09 Thread E Taka

Hello all,

a year ago we started with a 3-node-Cluster for Ceph with 21 HDD and 3
SSD, which we installed with Cephadm, configuring the disks with
`ceph orch apply osd --all-available-devices`

Over the time the usage grew quite significantly: now we have another
5 nodes with 8-12 HDD and 1-2 SSD each, the integration worked without
any problems with `ceph orch add host`. Now we wonder if the HDD and
SSD are used as recommended, so that access is fast, but without

My questions: how can I check what the data_devices and db_devices
are? Can we still apply a setup as for example the second one in this
documentation? https://docs.ceph.com/en/latest/cephadm/osd/#the-simple-case

Some technical details: Xeans with plenty RAM and Cores, Ceph 16.2.5
with mostly default configuration, Ubuntu 20.04, separated cluster and
public network (both 10 Gb), Usage as RBD (Qemu), Cephfs, and Ceph
object gateway. (The latter is surprisingly slow, but I want to sort
out the problem with the underlying configuration first.)

Thanks for any helpful responses,
Erich
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: All OSDs on one host down

2021-08-07 Thread E Taka

A few hours ago we had the same problem, also with Ubuntu 20.04, and
there is a coincidence in time with the latest docker update, which
was triggered from Puppet. After all, all the containers came back up
without a reboot. Thanks for the hint.

Note to myself: change the package parameter for the Ubuntu package
'docker.io' from 'latest' to 'installed'.

Am Sa., 7. Aug. 2021 um 11:05 Uhr schrieb Andrew Walker-Brown
:
>
> Thanks David,
>
> Spent some more time digging in the logs/google.  Also had a further 2 nodes 
> fail this morning (different nodes).
>
> Looks like it’s related to apt-auto updates on Ubuntu 20.04, although we 
> don’t run unattended upgrades.  Docker appears to get a terminate signal 
> which shutsdown/restarts all the containers but some don’t come back cleanly. 
>  There’s was also some legacy unused interfaces/bonds in the netplan config.
>
> Anyway, cleaned all that up...so hopefully it’s resolved.
>
> Cheers,
>
> A.
>
>
>
> Sent from Mail for Windows 10
>
> From: David Caro
> Sent: 06 August 2021 09:20
> To: Andrew Walker-Brown
> Cc: Marc; 
> ceph-users@ceph.io
> Subject: Re: [ceph-users] Re: All OSDs on one host down
>
> On 08/06 07:59, Andrew Walker-Brown wrote:
> > Hi Marc,
> >
> > Yes i’m probably doing just that.
> >
> > The ceph admin guides aren’t exactly helpful on this.  The cluster was 
> > deployed using cephadm and it’s been running perfectly until now.
> >
> > Wouldn’t running “journalctl -u ceph-osd@5” on host ceph-004 show me the 
> > logs for osd.5 on that host?
>
> On my containerized setup, the services that cephadm created are:
>
> dcaro@node1:~ $ sudo systemctl list-units | grep ceph
>   ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@crash.node1.service   
>   loaded 
> active running   Ceph crash.node1 for d49b287a-b680-11eb-95d4-e45f010c03a8
>   ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@mgr.node1.mhqltg.service  
>   loaded 
> active running   Ceph mgr.node1.mhqltg for 
> d49b287a-b680-11eb-95d4-e45f010c03a8
>   ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@mon.node1.service 
>   loaded 
> active running   Ceph mon.node1 for d49b287a-b680-11eb-95d4-e45f010c03a8
>   ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@osd.3.service 
>   loaded 
> active running   Ceph osd.3 for d49b287a-b680-11eb-95d4-e45f010c03a8
>   ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@osd.7.service 
>   loaded 
> active running   Ceph osd.7 for d49b287a-b680-11eb-95d4-e45f010c03a8
>   system-ceph\x2dd49b287a\x2db680\x2d11eb\x2d95d4\x2de45f010c03a8.slice   
>   loaded 
> active active
> system-ceph\x2dd49b287a\x2db680\x2d11eb\x2d95d4\x2de45f010c03a8.slice
>   ceph-d49b287a-b680-11eb-95d4-e45f010c03a8.target
>   loaded 
> active activeCeph cluster d49b287a-b680-11eb-95d4-e45f010c03a8
>   ceph.target 
>   loaded 
> active activeAll Ceph clusters and services
>
> where the string after 'ceph-' is the fsid of the cluster.
> Hope that helps (you can use the systemctl list-units also to search the 
> specific ones on yours).
>
>
> >
> > Cheers,
> > A
> >
> >
> >
> >
> >
> > Sent from Mail for Windows 
> > 10
> >
> > From: Marc
> > Sent: 06 August 2021 08:54
> > To: Andrew Walker-Brown; 
> > ceph-users@ceph.io
> > Subject: RE: All OSDs on one host down
> >
> > >
> > > I’ve tried restarting on of the osds but that fails, journalctl shows
> > > osd not found.not convinced I’ve got the systemctl command right.
> > >
> >
> > You are not mixing 'not container commands' with 'container commands'. As 
> > in, if you execute this journalctl outside of the container it will not 
> > find anything of course.
> >
> >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
> --
> David Caro
> SRE - Cloud Services
> Wikimedia Foundation 
> PGP Signature: 7180 83A2 AC8B 314F B4CE  1171 4071 C7E1 D262 69C3
>
> "Imagine a world in which every single human being can freely share in the
> sum of all kn

[ceph-users] Re: Dashboard Montitoring: really suppress messages

2021-08-02 Thread E Taka

Hi, sorry for being so vague in the initial question. We use Ceph
16.2.5 (with Docker and Ubuntu 20.04).

Thanks for opening the issue!

Am Mo., 2. Aug. 2021 um 09:57 Uhr schrieb Patrick Seidensal
:
>
> Hi Erich,
>
> I agree that there should be a way to disable an alert in the Ceph Dashboard 
> completely, as the alert is completely silent for the Alertmanager when 
> silenced, it should be the same for the Ceph Dashboard.
>
> May I ask which Ceph version you use?
>
> I created an issue to have that fixed [1].
>
> Thanks
>
> Patrick
>
> [1] https://tracker.ceph.com/issues/51987
>
> ____
> From: E Taka <0eta...@gmail.com>
> Sent: Friday, July 30, 2021 12:39 PM
> To: ceph-users
> Subject: [ceph-users] Dashboard Montitoring: really suppress messages
>
> Hi,
>
> we have enabled Cluster → Monitoring in the Dashboard. Some of the
> regularly shown  messages are not really useful for us (packet drops
> in OVS) and we want to suppress them. Creating a silence does not
> help, because the messages still appear, but in blue instead of red
> color.
>
> Is there a way to turn off popups for the monitoring?
> Or maybe, to turn off an alert? I  did not find the place to configure
> these alerts, beside the very limited dashboard interface.
>
> Thanks
> Erich
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Dashboard Montitoring: really suppress messages

2021-07-30 Thread E Taka

Hi,

we have enabled Cluster → Monitoring in the Dashboard. Some of the
regularly shown  messages are not really useful for us (packet drops
in OVS) and we want to suppress them. Creating a silence does not
help, because the messages still appear, but in blue instead of red
color.

Is there a way to turn off popups for the monitoring?
Or maybe, to turn off an alert? I  did not find the place to configure
these alerts, beside the very limited dashboard interface.

Thanks
Erich
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] large directory /var/lib/ceph/$FSID/removed/

2021-07-27 Thread E Taka

Hi,

the dashboard warns me about a node, that it is filling up fast.
Actually, on this node is a large directory (now 12 GB) large
directory /var/lib/ceph/$FSID/removed/

Is this directory or its content needed? Can I remove the content, or
is there a Ceph command for purging it?

Thanks!
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Start a service on a specified node

2021-06-24 Thread E Taka

If I understand the documentation for the placements in "ceph orch
apply" correctly, I can place the daemons by number or on specific
host. But what I want is:

"Start 3 mgr services, and one of it should be started on node ceph01."

How I can achieve this?

Thanks!
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Enable Dashboard Active Alerts

2021-04-13 Thread E Taka

Hi,

this is documented with many links to other documents, which
unfortunately only confused me. In our 6-Node-Ceph-Cluster (Pacific)
the Dashboard tells me that I should "provide the URL to the API of
Prometheus' Alertmanager". We only use Grafana and Prometheus which
are deployed by cephadm. We did not configure anything unusual, with
own containers or so. We use just the standard cephadm installation.

What the documentation writes about how it "should look like"
(https://docs.ceph.com/en/pacific/mgr/dashboard/#enabling-prometheus-alerting),
seems to exist in the Docker-Container "prom/alertmanager:v0.20.0" in
file /etc/alertmanager/alertmanager.yml

[…]
- name: 'ceph-dashboard'
 webhook_configs:
 - url: 'https://ceph01:8443/api/prometheus_receiver'
 - url: 'https://10.149.12.22:8443/api/prometheus_receiver'
[…]

(10.149.12.22 is the IP address for ceph01)

Nevertheless I get the message above from the Dashboard.
My questions: What do I have to write in which file or which commands,
so that I can access the alerts via the dashboard? Of course this
should survive reboots and updates.

Thanks.
Erich
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] grafana-api-url not only for one host

2021-03-03 Thread E Taka

Hi,

if the host  fails, to which the grafana-api-url points (in the example
below ceph01.hostxyz.tld:3000), Ceph Dashboard can't Display Grafana Data:

# ceph dashboard get-grafana-api-url
https://ceph01.hostxyz.tld:3000

Is it possible to automagically switch to an other host?

Thanks, Erich
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Certificate for Dashboard / Grafana

2020-11-25 Thread E Taka

FYI: I've found a solution for the Grafana Certificate. Just run the
following commands:

1.
ceph config-key set mgr/cephadm/grafana_crt -i  
ceph config-key set mgr/cephadm/grafana_key -i  

2.
ceph orch redeploy grafana

3.
ceph config set mgr mgr/dashboard/GRAFANA_API_URL
https://ceph01.domain.tld:3000
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Certificate for Dashboard / Grafana

2020-11-24 Thread E Taka

Hello,

I'm having difficulties with setting up the web certificates for the
Dashboard on hostnames ceph*01..n*.domain.tld.
I set the keys and crt with ceph-config-key. ceph-config-key get
mgr/dashbord/crt shows the correct certificate,
the same applies to mgr/dashbord/key, mgr/cephadm/grafana_key and
mgr/cephadm/grafana_crt.
The hosts were rebooted since then, I used different browsers etc. It is no
cache or proxy problem. But:

1. The browser complains Grafana web page via https://ceph01.domain.tld:3000
about the certificate and uses the self-signed certificate from cephadm.

2. The Dashboard forwards always from https://ceph01.domain.tld:8443 to
https://ceph01:8443, which means that the browser complains again, since it
needs the FQDN.

How do you handle this?

Thanks, Erich
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Beginner's installation questions about network

2020-11-13 Thread E Taka

Hi Stefan, the cluster network has its own switch and is faster than the
public network.
Thanks for pointing me to the documentation. I must have overlooked this
sentence.

But let me ask another question: do the OSD use the cluster network
"magically"? I did not find this in the docs, but that may be my fault…

Am Fr., 13. Nov. 2020 um 21:40 Uhr schrieb Stefan Kooman :

> On 2020-11-13 21:19, E Taka wrote:
> > Hello,
> >
> > I want to install Ceph Octopus on Ubuntu 20.04. The nodes for have 2
> > network interfaces: 192.168.1.0/24 for the cluster network, and a
> > 10.10.0.0/16 is the public network. When I bootstrap with cephadm, which
> > Network do I use? That means, do i use cephadm bootstrap --mon-ip
> > 192.168.1.1 or do I have to use the other network?
> >
> > When adding the other host with ceph orch, which network I have to use:
> > ceph orch host add ceph03i 192.168.20.2 …  (or the 10.10 network)
>
> I found the following info in the documentation:
>
> "You need to know which IP address to use for the cluster’s first
> monitor daemon. This is normally just the IP for the first host. If
> there are multiple networks and interfaces, be sure to choose one that
> will be accessible by any host accessing the Ceph cluster."
>
> So that would be the public network in your case. See [1].
>
> Just curious: why do you want to use separate networks? You might as
> well use bonded interfaces on the public network (i.e. LACP) and have
> more redundancy there. I figure that you might even make more effective
> use of the bandwidth as well.
>
> Gr. Stefan
>
> [1]:
> https://docs.ceph.com/en/latest/cephadm/install/#bootstrap-a-new-cluster
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Beginner's installation questions about network

2020-11-13 Thread E Taka

Hello,

I want to install Ceph Octopus on Ubuntu 20.04. The nodes for have 2
network interfaces: 192.168.1.0/24 for the cluster network, and a
10.10.0.0/16 is the public network. When I bootstrap with cephadm, which
Network do I use? That means, do i use cephadm bootstrap --mon-ip
192.168.1.1 or do I have to use the other network?

When adding the other host with ceph orch, which network I have to use:
ceph orch host add ceph03i 192.168.20.2 …  (or the 10.10 network)

Thanks, Erich
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

59 matches

Mail list logo