[ceph-users] Re: upgrading from 15.2.17 to 16.2.11 - Health ERROR

2023-03-10 Thread Adam King
The things in "ceph orch ps" output are gathered by checking the contents
of the /var/lib/ceph// directory on the host. Those
"cephadm." files get deployed normally though, and aren't usually
reported in "ceph orch ps" as it should only report things that are
directories rather than files. You could maybe try going and removing them
anyway to see what happens (cephadm should just deploy another one though).
Would be interested anyway in what the contents of
/var/lib/ceph// are on that srvcephprod07 node and also what
"cephadm ls" spits out on that node (you would have to put a copy of the
cephadm tool on the host to run that).

As for the logs, the "cephadm.log" on the host is only the log of what the
cephadm tool has done on that host, not what the cephadm mgr module is
running. Could maybe try "ceph mgr fail; ceph -W cephadm" and let it sit
for a bit to see if you get a traceback printout that way.

On Fri, Mar 10, 2023 at 10:41 AM  wrote:

> looking at ceph orch upgrade check
> I find out
> },
>
> "cephadm.8d0364fef6c92fc3580b0d022e32241348e6f11a7694d2b957cdafcb9d059ff2":
> {
> "current_id": null,
> "current_name": null,
> "current_version": null
> },
>
>
> Could this lead to the issue?
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS thrashing through the page cache

2023-03-10 Thread Ashu Pachauri
Also, I am able to reproduce the network read amplification when I try to
do very small reads from larger files. e.g.

for i in $(seq 1 1); do
  dd if=test_${i} of=/dev/null bs=5k count=10
done


This piece of code generates a network traffic of 3.3 GB while it actually
reads approx 500 MB of data.


Thanks and Regards,
Ashu Pachauri

On Fri, Mar 10, 2023 at 9:22 PM Ashu Pachauri  wrote:

> We have an internal use case where we back the storage of a proprietary
> database by a shared file system. We noticed something very odd when
> testing some workload with a local block device backed file system vs
> cephfs. We noticed that the amount of network IO done by cephfs is almost
> double compared to the IO done in case of a local file system backed by an
> attached block device.
>
> We also noticed that CephFS thrashes through the page cache very quickly
> compared to the amount of data being read and think that the two issues
> might be related. So, I wrote a simple test.
>
> 1. I wrote 10k files 400KB each using dd (approx 4 GB data).
> 2. I dropped the page cache completely.
> 3. I then read these files serially, again using dd. The page cache usage
> shot up to 39 GB for reading such a small amount of data.
>
> Following is the code used to repro this in bash:
>
> for i in $(seq 1 1); do
>   dd if=/dev/zero of=test_${i} bs=4k count=100
> done
>
> sync; echo 1 > /proc/sys/vm/drop_caches
>
> for i in $(seq 1 1); do
>   dd if=test_${i} of=/dev/null bs=4k count=100
> done
>
>
> The ceph version being used is:
> ceph version 15.2.13 (c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus
> (stable)
>
> The ceph configs being overriden:
> WHO   MASK  LEVEL OPTION VALUE
>RO
>   mon   advanced  auth_allow_insecure_global_id_reclaim  false
>
>   mgr   advanced  mgr/balancer/mode  upmap
>
>   mgr   advanced  mgr/dashboard/server_addr  127.0.0.1
>*
>   mgr   advanced  mgr/dashboard/server_port  8443
> *
>   mgr   advanced  mgr/dashboard/ssl  false
>*
>   mgr   advanced  mgr/prometheus/server_addr 0.0.0.0
>*
>   mgr   advanced  mgr/prometheus/server_port 9283
> *
>   osd   advanced  bluestore_compression_algorithmlz4
>
>   osd   advanced  bluestore_compression_mode
> aggressive
>   osd   advanced  bluestore_throttle_bytes   536870912
>
>   osd   advanced  osd_max_backfills  3
>
>   osd   advanced  osd_op_num_threads_per_shard_ssd   8
>*
>   osd   advanced  osd_scrub_auto_repair  true
>
>   mds   advanced  client_oc  false
>
>   mds   advanced  client_readahead_max_bytes 4096
>
>   mds   advanced  client_readahead_max_periods   1
>
>   mds   advanced  client_readahead_min   0
>
>   mds   basic mds_cache_memory_limit
> 21474836480
>   clientadvanced  client_oc  false
>
>   clientadvanced  client_readahead_max_bytes 4096
>
>   clientadvanced  client_readahead_max_periods   1
>
>   clientadvanced  client_readahead_min   0
>
>   clientadvanced  fuse_disable_pagecache false
>
>
> The cephfs mount options (note that readahead was disabled for this test):
> /mnt/cephfs type ceph
> (rw,relatime,name=cephfs,secret=,acl,rasize=0)
>
> Any help or pointers are appreciated; this is a major performance issue
> for us.
>
>
> Thanks and Regards,
> Ashu Pachauri
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] CephFS thrashing through the page cache

2023-03-10 Thread Ashu Pachauri
We have an internal use case where we back the storage of a proprietary
database by a shared file system. We noticed something very odd when
testing some workload with a local block device backed file system vs
cephfs. We noticed that the amount of network IO done by cephfs is almost
double compared to the IO done in case of a local file system backed by an
attached block device.

We also noticed that CephFS thrashes through the page cache very quickly
compared to the amount of data being read and think that the two issues
might be related. So, I wrote a simple test.

1. I wrote 10k files 400KB each using dd (approx 4 GB data).
2. I dropped the page cache completely.
3. I then read these files serially, again using dd. The page cache usage
shot up to 39 GB for reading such a small amount of data.

Following is the code used to repro this in bash:

for i in $(seq 1 1); do
  dd if=/dev/zero of=test_${i} bs=4k count=100
done

sync; echo 1 > /proc/sys/vm/drop_caches

for i in $(seq 1 1); do
  dd if=test_${i} of=/dev/null bs=4k count=100
done


The ceph version being used is:
ceph version 15.2.13 (c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus
(stable)

The ceph configs being overriden:
WHO   MASK  LEVEL OPTION VALUE
   RO
  mon   advanced  auth_allow_insecure_global_id_reclaim  false

  mgr   advanced  mgr/balancer/mode  upmap

  mgr   advanced  mgr/dashboard/server_addr  127.0.0.1
   *
  mgr   advanced  mgr/dashboard/server_port  8443
  *
  mgr   advanced  mgr/dashboard/ssl  false
   *
  mgr   advanced  mgr/prometheus/server_addr 0.0.0.0
   *
  mgr   advanced  mgr/prometheus/server_port 9283
  *
  osd   advanced  bluestore_compression_algorithmlz4

  osd   advanced  bluestore_compression_mode aggressive

  osd   advanced  bluestore_throttle_bytes   536870912

  osd   advanced  osd_max_backfills  3

  osd   advanced  osd_op_num_threads_per_shard_ssd   8
   *
  osd   advanced  osd_scrub_auto_repair  true

  mds   advanced  client_oc  false

  mds   advanced  client_readahead_max_bytes 4096

  mds   advanced  client_readahead_max_periods   1

  mds   advanced  client_readahead_min   0

  mds   basic mds_cache_memory_limit
21474836480
  clientadvanced  client_oc  false

  clientadvanced  client_readahead_max_bytes 4096

  clientadvanced  client_readahead_max_periods   1

  clientadvanced  client_readahead_min   0

  clientadvanced  fuse_disable_pagecache false

The cephfs mount options (note that readahead was disabled for this test):
/mnt/cephfs type ceph (rw,relatime,name=cephfs,secret=,acl,rasize=0)

Any help or pointers are appreciated; this is a major performance issue for
us.


Thanks and Regards,
Ashu Pachauri
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] pg wait too long when osd restart

2023-03-10 Thread yite gu
Hi all,
osd_heartbeat_grace = 20 and osd_pool_default_read_lease_ratio = 0.8 by
default, so, pg will wait 16s when osd restart in the worst case. This wait
time is too long, client i/o can not be unacceptable. I think adjusting
the osd_pool_default_read_lease_ratio to lower is a good way. Have any good
suggestions about reduce pg wait time?

Best Regard
Yite Gu
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: upgrading from 15.2.17 to 16.2.11 - Health ERROR

2023-03-10 Thread xadhoom76
looking at ceph orch upgrade check 
I find out 
},

"cephadm.8d0364fef6c92fc3580b0d022e32241348e6f11a7694d2b957cdafcb9d059ff2": {
"current_id": null,
"current_name": null,
"current_version": null
},


Could this lead to the issue?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: upgrading from 15.2.17 to 16.2.11 - Health ERROR

2023-03-10 Thread xadhoom76
I find out with ceph orch ps 

cephadm.8d0364fef6c92fc3580b0d022e32241348e6f11a7694d2b957cdafcb9d059ff2  
srvcephprod04  stopped4m ago -
   
  
cephadm.8d0364fef6c92fc3580b0d022e32241348e6f11a7694d2b957cdafcb9d059ff2  
srvcephprod06  stopped4m ago -
   
  
cephadm.8d0364fef6c92fc3580b0d022e32241348e6f11a7694d2b957cdafcb9d059ff2  
srvcephprod07  stopped4m ago -
   
  

And cannot remove.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: upgrading from 15.2.17 to 16.2.11 - Health ERROR

2023-03-10 Thread xadhoom76
I cannnot find anything interesting in the cephadm.log

now the error is 
HEALTH_ERR
Module 'cephadm' has failed: 'cephadm'
Idea how to fix it ?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: pg wait too long when osd restart

2023-03-10 Thread Josh Baergen
Hello,

When you say "osd restart", what sort of restart are you referring to
- planned (e.g. for upgrades or maintenance) or unplanned (OSD
hang/crash, host issue, etc.)? If it's the former, then these
parameters shouldn't matter provided that you're running a recent
enough Ceph with default settings - it's supposed to handle planned
restarts with little I/O wait time. There were some issues with this
mechanism before Octopus 15.2.17 / Pacific 16.2.8 that could cause
planned restarts to wait for the read lease timeout in some
circumstances.

Josh

On Fri, Mar 10, 2023 at 1:31 AM yite gu  wrote:
>
> Hi all,
> osd_heartbeat_grace = 20 and osd_pool_default_read_lease_ratio = 0.8 by
> default, so, pg will wait 16s when osd restart in the worst case. This wait
> time is too long, client i/o can not be unacceptable. I think adjusting
> the osd_pool_default_read_lease_ratio to lower is a good way. Have any good
> suggestions about reduce pg wait time?
>
> Best Regard
> Yite Gu
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] pg wait too long when osd restart

2023-03-10 Thread yite gu
Hi all,
osd_heartbeat_grace = 20 and osd_pool_default_read_lease_ratio = 0.8 by
default, so, pg will wait 16s when osd restart in the worst case. This wait
time is too long, client i/o can not be unacceptable. I think adjusting
the osd_pool_default_read_lease_ratio to lower is a good way. Have any good
suggestions about reduce pg wait time?

Best Regard
Yite Gu
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io