[ceph-users] Getting rid of prometheus messages in /var/log/messages

2019-10-21 Thread Vladimir Brik

Hello

/var/log/messages on machines in our ceph cluster are inundated with 
entries from Prometheus scraping ("GET /metrics HTTP/1.1" 200 - "" 
"Prometheus/2.11.1")


Is it possible to configure ceph to not send those to syslog? If not, 
can I configure something so that none of ceph-mgr messages go to syslog 
and only go to /var/log/ceph/ceph-mgr.log?


Thanks,

Vlad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexpected increase in the memory usage of OSDs

2019-10-09 Thread Vladimir Brik
Best I can tell, automatic cache sizing is enabled and all related 
settings are at their default values.


Looking through cache tunables, I came across 
osd_memory_expected_fragmentation, which the docs define as "estimate 
the percent of memory fragmentation". What's the formula to compute 
actual percentage of memory fragmentation?


Based on /proc/buddyinfo, I suspect that our memory fragmentation is a 
lot worse than osd_memory_expected_fragmentation default of 0.15. Could 
this be related to many OSDs' RSSes far exceeding osd_memory_target?


So far high memory consumption hasn't been a problem for us. (I guess 
it's possible that the kernel simply sees no need to reclaim unmapped 
memory until there is actually real memory pressure?) It's just a little 
scary not understanding why this started happening when memory usage had 
been so stable before.


Thanks,

Vlad



On 10/9/19 11:51 AM, Gregory Farnum wrote:

On Mon, Oct 7, 2019 at 7:20 AM Vladimir Brik
 wrote:


  > Do you have statistics on the size of the OSDMaps or count of them
  > which were being maintained by the OSDs?
No, I don't think so. How can I find this information?


Hmm I don't know if we directly expose the size of maps. There are
perfcounters which expose the range of maps being kept around but I
don't know their names off-hand.

Maybe it's something else involving the bluestore cache or whatever;
if you're not using the newer memory limits I'd switch to those but
otherwise I dunno.
-Greg



Memory consumption started to climb again:
https://icecube.wisc.edu/~vbrik/graph-3.png

Some more info (not sure if relevant or not):

I increased size of the swap on the servers to 10GB and it's being
completely utilized, even though there is still quite a bit of free memory.

It appears that memory is highly fragmented on the NUMA node 0 of all
the servers. Some of the servers have no free pages higher than order 0.
(Memory on NUMA node 1 of the servers appears much less fragmented.)

The servers have 192GB of RAM, 2 NUMA nodes.


Vlad



On 10/4/19 6:09 PM, Gregory Farnum wrote:

Do you have statistics on the size of the OSDMaps or count of them
which were being maintained by the OSDs? I'm not sure why having noout
set would change that if all the nodes were alive, but that's my bet.
-Greg

On Thu, Oct 3, 2019 at 7:04 AM Vladimir Brik
 wrote:


And, just as unexpectedly, things have returned to normal overnight
https://icecube.wisc.edu/~vbrik/graph-1.png

The change seems to have coincided with the beginning of Rados Gateway
activity (before, it was essentially zero). I can see nothing in the
logs that would explain what happened though.

Vlad



On 10/2/19 3:43 PM, Vladimir Brik wrote:

Hello

I am running a Ceph 14.2.2 cluster and a few days ago, memory
consumption of our OSDs started to unexpectedly grow on all 5 nodes,
after being stable for about 6 months.

Node memory consumption: https://icecube.wisc.edu/~vbrik/graph.png
Average OSD resident size: https://icecube.wisc.edu/~vbrik/image.png

I am not sure what changed to cause this. Cluster usage has been very
light (typically <10 iops) during this period, and the number of objects
stayed about the same.

The only unusual occurrence was the reboot of one of the nodes the day
before (a firmware update). For the reboot, I ran "ceph osd set noout",
but forgot to unset it until several days later. Unsetting noout did not
stop the increase in memory consumption.

I don't see anything unusual in the logs.

Our nodes have SSDs and HDDs. Resident set size of SSD ODSs is about
3.7GB. Resident set size of HDD OSDs varies from about 5GB to 12GB. I
don't know why there is such a big spread. All HDDs are 10TB, 72-76%
utilized, with 101-104 PGs.

Does anybody know what might be the problem here and how to address or
debug it?


Thanks very much,

Vlad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexpected increase in the memory usage of OSDs

2019-10-07 Thread Vladimir Brik

> Do you have statistics on the size of the OSDMaps or count of them
> which were being maintained by the OSDs?
No, I don't think so. How can I find this information?

Memory consumption started to climb again:
https://icecube.wisc.edu/~vbrik/graph-3.png

Some more info (not sure if relevant or not):

I increased size of the swap on the servers to 10GB and it's being 
completely utilized, even though there is still quite a bit of free memory.


It appears that memory is highly fragmented on the NUMA node 0 of all 
the servers. Some of the servers have no free pages higher than order 0. 
(Memory on NUMA node 1 of the servers appears much less fragmented.)


The servers have 192GB of RAM, 2 NUMA nodes.


Vlad



On 10/4/19 6:09 PM, Gregory Farnum wrote:

Do you have statistics on the size of the OSDMaps or count of them
which were being maintained by the OSDs? I'm not sure why having noout
set would change that if all the nodes were alive, but that's my bet.
-Greg

On Thu, Oct 3, 2019 at 7:04 AM Vladimir Brik
 wrote:


And, just as unexpectedly, things have returned to normal overnight
https://icecube.wisc.edu/~vbrik/graph-1.png

The change seems to have coincided with the beginning of Rados Gateway
activity (before, it was essentially zero). I can see nothing in the
logs that would explain what happened though.

Vlad



On 10/2/19 3:43 PM, Vladimir Brik wrote:

Hello

I am running a Ceph 14.2.2 cluster and a few days ago, memory
consumption of our OSDs started to unexpectedly grow on all 5 nodes,
after being stable for about 6 months.

Node memory consumption: https://icecube.wisc.edu/~vbrik/graph.png
Average OSD resident size: https://icecube.wisc.edu/~vbrik/image.png

I am not sure what changed to cause this. Cluster usage has been very
light (typically <10 iops) during this period, and the number of objects
stayed about the same.

The only unusual occurrence was the reboot of one of the nodes the day
before (a firmware update). For the reboot, I ran "ceph osd set noout",
but forgot to unset it until several days later. Unsetting noout did not
stop the increase in memory consumption.

I don't see anything unusual in the logs.

Our nodes have SSDs and HDDs. Resident set size of SSD ODSs is about
3.7GB. Resident set size of HDD OSDs varies from about 5GB to 12GB. I
don't know why there is such a big spread. All HDDs are 10TB, 72-76%
utilized, with 101-104 PGs.

Does anybody know what might be the problem here and how to address or
debug it?


Thanks very much,

Vlad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexpected increase in the memory usage of OSDs

2019-10-03 Thread Vladimir Brik

And, just as unexpectedly, things have returned to normal overnight
https://icecube.wisc.edu/~vbrik/graph-1.png

The change seems to have coincided with the beginning of Rados Gateway 
activity (before, it was essentially zero). I can see nothing in the 
logs that would explain what happened though.


Vlad



On 10/2/19 3:43 PM, Vladimir Brik wrote:

Hello

I am running a Ceph 14.2.2 cluster and a few days ago, memory 
consumption of our OSDs started to unexpectedly grow on all 5 nodes, 
after being stable for about 6 months.


Node memory consumption: https://icecube.wisc.edu/~vbrik/graph.png
Average OSD resident size: https://icecube.wisc.edu/~vbrik/image.png

I am not sure what changed to cause this. Cluster usage has been very 
light (typically <10 iops) during this period, and the number of objects 
stayed about the same.


The only unusual occurrence was the reboot of one of the nodes the day 
before (a firmware update). For the reboot, I ran "ceph osd set noout", 
but forgot to unset it until several days later. Unsetting noout did not 
stop the increase in memory consumption.


I don't see anything unusual in the logs.

Our nodes have SSDs and HDDs. Resident set size of SSD ODSs is about 
3.7GB. Resident set size of HDD OSDs varies from about 5GB to 12GB. I 
don't know why there is such a big spread. All HDDs are 10TB, 72-76% 
utilized, with 101-104 PGs.


Does anybody know what might be the problem here and how to address or 
debug it?



Thanks very much,

Vlad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Unexpected increase in the memory usage of OSDs

2019-10-02 Thread Vladimir Brik

Hello

I am running a Ceph 14.2.2 cluster and a few days ago, memory 
consumption of our OSDs started to unexpectedly grow on all 5 nodes, 
after being stable for about 6 months.


Node memory consumption: https://icecube.wisc.edu/~vbrik/graph.png
Average OSD resident size: https://icecube.wisc.edu/~vbrik/image.png

I am not sure what changed to cause this. Cluster usage has been very 
light (typically <10 iops) during this period, and the number of objects 
stayed about the same.


The only unusual occurrence was the reboot of one of the nodes the day 
before (a firmware update). For the reboot, I ran "ceph osd set noout", 
but forgot to unset it until several days later. Unsetting noout did not 
stop the increase in memory consumption.


I don't see anything unusual in the logs.

Our nodes have SSDs and HDDs. Resident set size of SSD ODSs is about 
3.7GB. Resident set size of HDD OSDs varies from about 5GB to 12GB. I 
don't know why there is such a big spread. All HDDs are 10TB, 72-76% 
utilized, with 101-104 PGs.


Does anybody know what might be the problem here and how to address or 
debug it?



Thanks very much,

Vlad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw pegging down 5 CPU cores when no data is being transferred

2019-08-26 Thread Vladimir Brik

I created a ticket: https://tracker.ceph.com/issues/41511

Note that I think I was mistaken when I said that sometimes the problem 
goes away on its own. I've looked back through our monitoring and it 
looks like when the problem did go away, it was because either the 
machine was rebooted or the radosgw service was restarted.



Vlad



On 8/23/19 10:17 AM, Eric Ivancich wrote:

Good morning, Vladimir,

Please create a tracker for this 
(https://tracker.ceph.com/projects/rgw/issues/new) and include the link 
to it in an email reply. And if you can include any more potentially 
relevant details, please do so. I’ll add my initial analysis to it.


But the threads do seem to be stuck, at least for a while, in 
get_obj_data::flush despite a lack of traffic. And sometimes it 
self-resolves, so it’s not a true “infinite loop”.


Thank you,

Eric

On Aug 22, 2019, at 9:12 PM, Eric Ivancich <mailto:ivanc...@redhat.com>> wrote:


Thank you for providing the profiling data, Vladimir. There are 5078 
threads and most of them are waiting. Here is a list of the deepest 
call of each thread with duplicates removed.


            + 100.00% epoll_wait
                          + 100.00% 
get_obj_data::flush(rgw::OwningList&&)

            + 100.00% poll
        + 100.00% poll
      + 100.00% poll
        + 100.00% pthread_cond_timedwait@@GLIBC_2.3.2
      + 100.00% pthread_cond_timedwait@@GLIBC_2.3.2
        + 100.00% pthread_cond_wait@@GLIBC_2.3.2
      + 100.00% pthread_cond_wait@@GLIBC_2.3.2
      + 100.00% read
                            + 
100.00% _ZN5boost9intrusive9list_implINS0_8bhtraitsIN3rgw14AioResultEntryENS0_16list_node_traitsIPvEELNS0_14link_mode_typeE1ENS0_7dft_tagELj1EEEmLb1EvE4sortIZN12get_obj_data5flushEONS3_10OwningListIS4_JUlRKT_RKT0_E_EEvSH_


The only interesting ones are the second and last:

* get_obj_data::flush(rgw::OwningList&&)
* 
_ZN5boost9intrusive9list_implINS0_8bhtraitsIN3rgw14AioResultEntryENS0_16list_node_traitsIPvEELNS0_14link_mode_typeE1ENS0_7dft_tagELj1EEEmLb1EvE4sortIZN12get_obj_data5flushEONS3_10OwningListIS4_JUlRKT_RKT0_E_EEvSH_


They are essentially part of the same call stack that results from 
processing a GetObj request, and five threads are in this call stack 
(the only difference is wether or not they include the call into boost 
intrusive list). Here’s the full call stack of those threads:


+ 100.00% clone
  + 100.00% start_thread
    + 100.00% worker_thread
      + 100.00% process_new_connection
        + 100.00% handle_request
          + 100.00% RGWCivetWebFrontend::process(mg_connection*)
            + 100.00% process_request(RGWRados*, RGWREST*, 
RGWRequest*, std::string const&, rgw::auth::StrategyRegistry const&, 
RGWRestfulIO*, OpsLogSocket*, opt

ional_yield, rgw::dmclock::Scheduler*, int*)
              + 100.00% rgw_process_authenticated(RGWHandler_REST*, 
RGWOp*&, RGWRequest*, req_state*, bool)

                + 100.00% RGWGetObj::execute()
                  + 100.00% RGWRados::Object::Read::iterate(long, 
long, RGWGetDataCB*)
                    + 100.00% RGWRados::iterate_obj(RGWObjectCtx&, 
RGWBucketInfo const&, rgw_obj const&, long, long, unsigned long, int 
(*)(rgw_raw_obj const&, l

ong, long, long, bool, RGWObjState*, void*), void*)
                      + 100.00% _get_obj_iterate_cb(rgw_raw_obj 
const&, long, long, long, bool, RGWObjState*, void*)
                        + 100.00% 
RGWRados::get_obj_iterate_cb(rgw_raw_obj const&, long, long, long, 
bool, RGWObjState*, void*)
                          + 100.00% 
get_obj_data::flush(rgw::OwningList&&)
                            + 
100.00% _ZN5boost9intrusive9list_implINS0_8bhtraitsIN3rgw14AioResultEntryENS0_16list_node_traitsIPvEELNS0_14link_mode_typeE1ENS0_7dft_tagELj1EEEmLb1EvE4sortIZN12get_obj_data5flushEONS3_10OwningListIS4_JUlRKT_RKT0_E_EEvSH_


So this isn’t background processing but request processing. I’m not 
clear why these requests are consuming so much CPU for so long.


From your initial message:
I am running a Ceph 14.2.1 cluster with 3 rados gateways. 
Periodically, radosgw process on those machines starts consuming 100% 
of 5 CPU cores for days at a time, even though the machine is not 
being used for data transfers (nothing in radosgw logs, couple of 
KB/s of network).


This situation can affect any number of our rados gateways, lasts 
from few hours to few days and stops if radosgw process is restarted 
or on its own.


I’m going to check with others who’re more familiar with this code path.


Begin forwarded message:

*From:*Vladimir Brik <mailto:vladimir.b...@icecube.wisc.edu>>
*Subject:**Re: [ceph-users] radosgw pegging down 5 CPU cores when no 
data is being transferred*

*Date:*August 21, 2019 at 4:47:01 PM EDT
*To:*"J. Eric Ivancich" <mailto:ivanc...@redhat.com>>, Mark Nelson <mailto:mnel...@redhat.com>>,ceph-users@lists.ceph.com 
<mailto:ceph-users@lists.

Re: [ceph-users] radosgw pegging down 5 CPU cores when no data is being transferred

2019-08-21 Thread Vladimir Brik

> Are you running multisite?
No

> Do you have dynamic bucket resharding turned on?
Yes. "radosgw-admin reshard list" prints "[]"

> Are you using lifecycle?
I am not sure. How can I check? "radosgw-admin lc list" says "[]"

> And just to be clear -- sometimes all 3 of your rados gateways are
> simultaneously in this state?
Multiple, but I have not seen all 3 being in this state simultaneously. 
Currently one gateway has 1 thread using 100% of CPU, and another has 5 
threads each using 100% CPU.


Here are the fruits of my attempts to capture the call graph using perf 
and gdbpmp:

https://icecube.wisc.edu/~vbrik/perf.data
https://icecube.wisc.edu/~vbrik/gdbpmp.data

These are the commands that I ran and their outputs (note I couldn't get 
perf not to generate the warning):

rgw-3 gdbpmp # ./gdbpmp.py -n 100 -p 73688 -o gdbpmp.data
Attaching to process 73688...Done.
Gathering 
Samples

Profiling complete with 100 samples.

rgw-3 ~ # perf record --call-graph fp -p 73688 -- sleep 10
[ perf record: Woken up 54 times to write data ]
Warning:
Processed 574207 events and lost 4 chunks!
Check IO/CPU overload!
[ perf record: Captured and wrote 58.866 MB perf.data (233750 samples) ]





Vlad



On 8/21/19 11:16 AM, J. Eric Ivancich wrote:

On 8/21/19 10:22 AM, Mark Nelson wrote:

Hi Vladimir,


On 8/21/19 8:54 AM, Vladimir Brik wrote:

Hello



[much elided]


You might want to try grabbing a a callgraph from perf instead of just
running perf top or using my wallclock profiler to see if you can drill
down and find out where in that method it's spending the most time.


I agree with Mark -- a call graph would be very helpful in tracking down
what's happening.

There are background tasks that run. Are you running multisite? Do you
have dynamic bucket resharding turned on? Are you using lifecycle? And
garbage collection is another background task.

And just to be clear -- sometimes all 3 of your rados gateways are
simultaneously in this state?

But the call graph would be incredibly helpful.

Thank you,

Eric


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph status: pg backfill_toofull, but all OSDs have enough space

2019-08-21 Thread Vladimir Brik

Hello

After increasing number of PGs in a pool, ceph status is reporting 
"Degraded data redundancy (low space): 1 pg backfill_toofull", but I 
don't understand why, because all OSDs seem to have enough space.


ceph health detail says:
pg 40.155 is active+remapped+backfill_toofull, acting [20,57,79,85]

$ ceph pg map 40.155
osdmap e3952 pg 40.155 (40.155) -> up [20,57,66,85] acting [20,57,79,85]

So I guess Ceph wants to move 40.155 from 66 to 79 (or other way 
around?). According to "osd df", OSD 66's utilization is 71.90%, OSD 
79's utilization is 58.45%. The OSD with least free space in the cluster 
is 81.23% full, and it's not any of the ones above.


OSD backfillfull_ratio is 90% (is there a better way to determine this?):
$ ceph osd dump | grep ratio
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.7

Does anybody know why a PG could be in the backfill_toofull state if no 
OSD is in the backfillfull state?



Vlad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw pegging down 5 CPU cores when no data is being transferred

2019-08-21 Thread Vladimir Brik
Correction: the number of threads stuck using 100% of a CPU core varies 
from 1 to 5 (it's not always 5)


Vlad

On 8/21/19 8:54 AM, Vladimir Brik wrote:

Hello

I am running a Ceph 14.2.1 cluster with 3 rados gateways. Periodically, 
radosgw process on those machines starts consuming 100% of 5 CPU cores 
for days at a time, even though the machine is not being used for data 
transfers (nothing in radosgw logs, couple of KB/s of network).


This situation can affect any number of our rados gateways, lasts from 
few hours to few days and stops if radosgw process is restarted or on 
its own.


Does anybody have an idea what might be going on or how to debug it? I 
don't see anything obvious in the logs. Perf top is saying that CPU is 
consumed by radosgw shared object in symbol get_obj_data::flush, which, 
if I interpret things correctly, is called from a symbol with a long 
name that contains the substring "boost9intrusive9list_impl"


This is our configuration:
rgw_frontends = civetweb num_threads=5000 port=443s 
ssl_certificate=/etc/ceph/rgw.crt 
error_log_file=/var/log/ceph/civetweb.error.log


(error log file doesn't exist)


Thanks,

Vlad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] radosgw pegging down 5 CPU cores when no data is being transferred

2019-08-21 Thread Vladimir Brik

Hello

I am running a Ceph 14.2.1 cluster with 3 rados gateways. Periodically, 
radosgw process on those machines starts consuming 100% of 5 CPU cores 
for days at a time, even though the machine is not being used for data 
transfers (nothing in radosgw logs, couple of KB/s of network).


This situation can affect any number of our rados gateways, lasts from 
few hours to few days and stops if radosgw process is restarted or on 
its own.


Does anybody have an idea what might be going on or how to debug it? I 
don't see anything obvious in the logs. Perf top is saying that CPU is 
consumed by radosgw shared object in symbol get_obj_data::flush, which, 
if I interpret things correctly, is called from a symbol with a long 
name that contains the substring "boost9intrusive9list_impl"


This is our configuration:
rgw_frontends = civetweb num_threads=5000 port=443s 
ssl_certificate=/etc/ceph/rgw.crt 
error_log_file=/var/log/ceph/civetweb.error.log


(error log file doesn't exist)


Thanks,

Vlad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] radosgw daemons constantly reading default.rgw.log pool

2019-05-03 Thread Vladimir Brik

Hello

I have set up rados gateway using "ceph-deploy rgw create" (default 
pools, 3 machines acting as gateways) on Ceph 13.2.5.


For over 2 weeks now, the three rados gateways have been generating 
constant ~30MB/s 4K ops/s of read i/o on default.rgw.log even though 
nothing is using the rados gateways.


Nothing in the logs except occasional
7fbce9329700  0 RGWReshardLock::lock failed to acquire lock on 
reshard.00 ret=-16


Anybody know what might be going on?


Thanks,

Vlad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Restricting access to RadosGW/S3 buckets

2019-05-02 Thread Vladimir Brik

Hello

I am trying to figure out a way to restrict access to S3 buckets. Is it 
possible to create a RadosGW user that can only access specific bucket(s)?



Thanks,

Vlad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Bluestore nvme DB/WAL size

2018-12-20 Thread Vladimir Brik

Hello

I am considering using logical volumes of an NVMe drive as DB or WAL 
devices for OSDs on spinning disks.


The documentation recommends against DB devices smaller than 4% of slow 
disk size. Our servers have 16x 10TB HDDs and a single 1.5TB NVMe, so 
dividing it equally will result in each OSD getting ~90GB DB NVMe 
volume, which is a lot less than 4%. Will this cause problems down the road?



Thanks

Vlad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Scrub behavior

2018-12-20 Thread Vladimir Brik

Hello

I am experimenting with how Ceph (13.2.2) deals with on-disk data 
corruption, and I've run into some unexpected behavior. I am wondering 
if somebody could comment on whether I understand things correctly.


In my tests I would dd /dev/urandom onto an OSD's disk and see what 
would happen. I don't fill up the entire disk (that causes OSD to crash) 
and choose an OSD that is pretty full.


It looks like regular scrubs don't detect any problems at all, and I 
actually don't see any disk activity. So I guess only the stuff that is 
in memory is getting scrubbed?


When I initiate a deep scrub of an OSD, it looks like only PGs for which 
that OSD is the primary are checked. Is this correct? If so, how is 
corruption of other PGs on that OSD is detected?



Thanks,

Vlad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What could cause mon_osd_full_ratio to be exceeded?

2018-11-26 Thread Vladimir Brik

> Why didn't it stop at mon_osd_full_ratio (90%)
Should be 95%

Vlad



On 11/26/18 9:28 AM, Vladimir Brik wrote:

Hello

I am doing some Ceph testing on a near-full cluster, and I noticed that, 
after I brought down a node, some OSDs' utilization reached 
osd_failsafe_full_ratio (97%). Why didn't it stop at mon_osd_full_ratio 
(90%) if mon_osd_backfillfull_ratio is 90%?



Thanks,

Vlad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] What could cause mon_osd_full_ratio to be exceeded?

2018-11-26 Thread Vladimir Brik

Hello

I am doing some Ceph testing on a near-full cluster, and I noticed that, 
after I brought down a node, some OSDs' utilization reached 
osd_failsafe_full_ratio (97%). Why didn't it stop at mon_osd_full_ratio 
(90%) if mon_osd_backfillfull_ratio is 90%?



Thanks,

Vlad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How many PGs per OSD is too many?

2018-11-14 Thread Vladimir Brik

Hello

I have a ceph 13.2.2 cluster comprised of 5 hosts, each with 16 HDDs and 
4 SSDs. HDD OSDs have about 50 PGs each, while SSD OSDs have about 400 
PGs each (a lot more pools use SSDs than HDDs). Servers are fairly 
powerful: 48 HT cores, 192GB of RAM, and 2x25Gbps Ethernet.


The impression I got from the docs is that having more than 200 PGs per 
OSD is not a good thing, but justifications were vague (no concrete 
numbers), like increased peering time, increased resource consumption, 
and possibly decreased recovery performance. None of these appeared to 
be a significant problem in my testing, but the tests were very basic 
and done on a pretty empty cluster under minimal load, so I worry I'll 
run into trouble down the road.


Here are the questions I have:
- In practice, is it a big deal that some OSDs have ~400 PGs?
- In what situations would our cluster most likely fare significantly 
better if I went through the trouble of re-creating pools so that no OSD 
would have more than, say, ~100 PGs?
- What performance metrics could I monitor to detect possible issues due 
to having too many PGs?


Thanks,

Vlad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Erasure coding with more chunks than servers

2018-10-04 Thread Vladimir Brik
Hello

I have a 5-server cluster and I am wondering if it's possible to create
pool that uses k=5 m=2 erasure code. In my experiments, I ended up with
pools whose pgs are stuck in creating+incomplete state even when I
created the erasure code profile with --crush-failure-domain=osd.

Assuming that what I want to do is possible, will CRUSH distribute
chunks evenly among servers, so that if I need to bring one server down
(e.g. reboot), clients' ability to write or read any object would not be
disrupted? (I guess something would need to ensure that no server holds
more than two chunks of an object)

Thanks,

Vlad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] NVMe SSD not assigned "nvme" device class

2018-10-01 Thread Vladimir Brik
Hello,

It looks like Ceph (13.2.2) assigns device class "ssd" to our Samsung
PM1725a NVMe SSDs instead of "nvme". Is that a bug or is the "nvme"
class reserved for a different kind of device?


Vlad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Problems after increasing number of PGs in a pool

2018-10-01 Thread Vladimir Brik
Thanks to everybody who responded. The problem was, indeed, that I hit
the limit on the number of PGs per SSD OSD when I increased the number
of PGs in a pool.

One question though: should I have received a warning that some OSDs are
close to their maximum PG limit? A while back, in a Luminous test pool I
remember seeing something like "too many PGs per OSD" in some of my
testing, but not this time (perhaps because this time I hit the limit
during the resizing operation). Where might such warning be recorded if
not in "ceph status"?

Thanks,

Vlad



On 09/28/2018 01:04 PM, Paul Emmerich wrote:
> I guess the pool is mapped to SSDs only from the name and you only got 20 
> SSDs.
> So you should have about ~2000 effective PGs taking replication into account.
> 
> Your pool has ~10k effective PGs with k+m=5 and you seem to have 5
> more pools
> 
> Check "ceph osd df tree" to see how many PGs per OSD you got.
> 
> Try increasing these two options to "fix" it.
> 
> mon max pg per osd
> osd max pg per osd hard ratio
> 
> 
> Paul
> Am Fr., 28. Sep. 2018 um 18:05 Uhr schrieb Vladimir Brik
> :
>>
>> Hello
>>
>> I've attempted to increase the number of placement groups of the pools
>> in our test cluster and now ceph status (below) is reporting problems. I
>> am not sure what is going on or how to fix this. Troubleshooting
>> scenarios in the docs don't seem to quite match what I am seeing.
>>
>> I have no idea how to begin to debug this. I see OSDs listed in
>> "blocked_by" of pg dump, but don't know how to interpret that. Could
>> somebody assist please?
>>
>> I attached output of "ceph pg dump_stuck -f json-pretty" just in case.
>>
>> The cluster consists of 5 hosts, each with 16 HDDs and 4 SSDs. I am
>> running 13.2.2.
>>
>> This is the affected pool:
>> pool 6 'fs-data-ec-ssd' erasure size 5 min_size 4 crush_rule 6
>> object_hash rjenkins pg_num 2048 pgp_num 2048 last_change 2493 lfor
>> 0/2491 flags hashpspool,ec_overwrites stripe_width 12288 application cephfs
>>
>>
>> Thanks,
>>
>> Vlad
>>
>>
>> ceph health
>>
>>   cluster:
>> id: 47caa1df-42be-444d-b603-02cad2a7fdd3
>> health: HEALTH_WARN
>> Reduced data availability: 155 pgs inactive, 47 pgs peering,
>> 64 pgs stale
>> Degraded data redundancy: 321039/114913606 objects degraded
>> (0.279%), 108 pgs degraded, 108 pgs undersized
>>
>>   services:
>> mon: 5 daemons, quorum ceph-1,ceph-2,ceph-3,ceph-4,ceph-5
>> mgr: ceph-3(active), standbys: ceph-2, ceph-5, ceph-1, ceph-4
>> mds: cephfs-1/1/1 up  {0=ceph-5=up:active}, 4 up:standby
>> osd: 100 osds: 100 up, 100 in; 165 remapped pgs
>>
>>   data:
>> pools:   6 pools, 5120 pgs
>> objects: 22.98 M objects, 88 TiB
>> usage:   154 TiB used, 574 TiB / 727 TiB avail
>> pgs: 3.027% pgs not active
>>  321039/114913606 objects degraded (0.279%)
>>  4903 active+clean
>>  105  activating+undersized+degraded+remapped
>>  61   stale+active+clean
>>  47   remapped+peering
>>  3stale+activating+undersized+degraded+remapped
>>  1active+clean+scrubbing+deep
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Problems after increasing number of PGs in a pool

2018-09-28 Thread Vladimir Brik
Hello

I've attempted to increase the number of placement groups of the pools
in our test cluster and now ceph status (below) is reporting problems. I
am not sure what is going on or how to fix this. Troubleshooting
scenarios in the docs don't seem to quite match what I am seeing.

I have no idea how to begin to debug this. I see OSDs listed in
"blocked_by" of pg dump, but don't know how to interpret that. Could
somebody assist please?

I attached output of "ceph pg dump_stuck -f json-pretty" just in case.

The cluster consists of 5 hosts, each with 16 HDDs and 4 SSDs. I am
running 13.2.2.

This is the affected pool:
pool 6 'fs-data-ec-ssd' erasure size 5 min_size 4 crush_rule 6
object_hash rjenkins pg_num 2048 pgp_num 2048 last_change 2493 lfor
0/2491 flags hashpspool,ec_overwrites stripe_width 12288 application cephfs


Thanks,

Vlad


ceph health

  cluster:
id: 47caa1df-42be-444d-b603-02cad2a7fdd3
health: HEALTH_WARN
Reduced data availability: 155 pgs inactive, 47 pgs peering,
64 pgs stale
Degraded data redundancy: 321039/114913606 objects degraded
(0.279%), 108 pgs degraded, 108 pgs undersized

  services:
mon: 5 daemons, quorum ceph-1,ceph-2,ceph-3,ceph-4,ceph-5
mgr: ceph-3(active), standbys: ceph-2, ceph-5, ceph-1, ceph-4
mds: cephfs-1/1/1 up  {0=ceph-5=up:active}, 4 up:standby
osd: 100 osds: 100 up, 100 in; 165 remapped pgs

  data:
pools:   6 pools, 5120 pgs
objects: 22.98 M objects, 88 TiB
usage:   154 TiB used, 574 TiB / 727 TiB avail
pgs: 3.027% pgs not active
 321039/114913606 objects degraded (0.279%)
 4903 active+clean
 105  activating+undersized+degraded+remapped
 61   stale+active+clean
 47   remapped+peering
 3stale+activating+undersized+degraded+remapped
 1active+clean+scrubbing+deep


stuck.json.gz
Description: application/gzip
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS on a mixture of SSDs and HDDs

2018-09-06 Thread Vladimir Brik
Hello

I am setting up a new ceph cluster (probably Mimic) made up of servers
that have a mixture of solid state and spinning disks. I'd like CephFS
to store data of some of our applications only on SSDs, and store data
of other applications only on HDDs.

Is there a way of doing this without running multiple filesystems within
the same cluster? (E.g. something like configuring CephFS to store data
of some directory trees in an SSD pool, and storing others in an HDD pool)

If not, can anybody comment on their experience running multiple file
systems in a single cluster? Are there any known issues (I am only aware
of some issues related to security)?

Does anybody know if support/testing of multiple filesystems in a
cluster is something actively being worked on and if it might stop being
"experimental" in near future?


Thanks very much,

Vlad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com