[ceph-users] Re: Monitoring

2024-06-19 Thread adam.ther

Hello,

On this topic, I was trying to use Zabbix for alerting. Is there a way 
to make the API Key used in the dashboard not expire after a period?


Regards,

Adam


On 6/18/24 09:12, Anthony D'Atri wrote:

I don't, I have the fleetwide monitoring / observability systems query 
ceph_exporter and a fleetwide node_exporter instance on 9101.  ymmv.



On Jun 18, 2024, at 09:25, Alex  wrote:

Good morning.

Our RH Ceph comes with Prometheus monitoring "built in". How does everyone
interstate that into their existing monitoring infrastructure so Ceph and
other servers are all under one dashboard?

Thanks,
Alex.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Monitoring

2024-06-18 Thread Alex
But how do you combine it with Prometheus node exporter built into Ceph?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Monitoring

2024-06-18 Thread Alex
Thanks
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Monitoring

2024-06-18 Thread Anthony D'Atri
Easier to ignore any node_exporter that Ceph (or k8s) deploys and just deploy 
your own on a different port across your whole fleet.

> On Jun 18, 2024, at 13:56, Alex  wrote:
> 
> But how do you combine it with Prometheus node exporter built into Ceph?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Monitoring

2024-06-18 Thread John Jasen
Kinda what he said, but I use Zabbix.

https://docs.ceph.com/en/latest/mgr/zabbix/

On Tue, Jun 18, 2024 at 11:53 AM Anthony D'Atri 
wrote:

> I don't, I have the fleetwide monitoring / observability systems query
> ceph_exporter and a fleetwide node_exporter instance on 9101.  ymmv.
>
>
> > On Jun 18, 2024, at 09:25, Alex  wrote:
> >
> > Good morning.
> >
> > Our RH Ceph comes with Prometheus monitoring "built in". How does
> everyone
> > interstate that into their existing monitoring infrastructure so Ceph and
> > other servers are all under one dashboard?
> >
> > Thanks,
> > Alex.
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Monitoring

2024-06-18 Thread Alex
Alright, thanks.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Monitoring

2024-06-18 Thread Anthony D'Atri
I don't, I have the fleetwide monitoring / observability systems query 
ceph_exporter and a fleetwide node_exporter instance on 9101.  ymmv.


> On Jun 18, 2024, at 09:25, Alex  wrote:
> 
> Good morning.
> 
> Our RH Ceph comes with Prometheus monitoring "built in". How does everyone
> interstate that into their existing monitoring infrastructure so Ceph and
> other servers are all under one dashboard?
> 
> Thanks,
> Alex.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Monitoring Ceph Bucket and overall ceph cluster remaining space

2024-03-06 Thread Michael Worsham
SW is SolarWinds (www.soparwinds.com), a network and application monitoring and 
alerting platform.

It's not very open source at all, but it's what we use for monitoring all of 
our physical and virtual servers, network switches, SAN and NAS devices, and 
anything else with a network card in it.

From: Konstantin Shalygin 
Sent: Wednesday, March 6, 2024 1:39:43 AM
To: Michael Worsham 
Cc: ceph-users@ceph.io 
Subject: Re: [ceph-users] Re: Monitoring Ceph Bucket and overall ceph cluster 
remaining space

This is an external email. Please take care when clicking links or opening 
attachments. When in doubt, check with the Help Desk or Security.


Hi,

Don't aware about what is SW, but if this software works with Prometheus 
metrics format - why not. Anyway the exporters are open source, you can modify 
the existing code for your environment


k

Sent from my iPhone

> On 6 Mar 2024, at 07:58, Michael Worsham  wrote:
>
> This looks interesting, but instead of Prometheus, could the data be exported 
> for SolarWinds?
>
> The intent is to have SW watch the available storage space allocated and then 
> to alert when a certain threshold is reached (75% remaining for a warning; 
> 95% remaining for a critical).

This message and its attachments are from Data Dimensions and are intended only 
for the use of the individual or entity to which it is addressed, and may 
contain information that is privileged, confidential, and exempt from 
disclosure under applicable law. If the reader of this message is not the 
intended recipient, or the employee or agent responsible for delivering the 
message to the intended recipient, you are hereby notified that any 
dissemination, distribution, or copying of this communication is strictly 
prohibited. If you have received this communication in error, please notify the 
sender immediately and permanently delete the original email and destroy any 
copies or printouts of this email as well as any attachments.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Monitoring Ceph Bucket and overall ceph cluster remaining space

2024-03-05 Thread Konstantin Shalygin
Hi,

Don't aware about what is SW, but if this software works with Prometheus 
metrics format - why not. Anyway the exporters are open source, you can modify 
the existing code for your environment


k

Sent from my iPhone

> On 6 Mar 2024, at 07:58, Michael Worsham  wrote:
> 
> This looks interesting, but instead of Prometheus, could the data be exported 
> for SolarWinds?
> 
> The intent is to have SW watch the available storage space allocated and then 
> to alert when a certain threshold is reached (75% remaining for a warning; 
> 95% remaining for a critical).
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Monitoring Ceph Bucket and overall ceph cluster remaining space

2024-03-05 Thread Michael Worsham
This looks interesting, but instead of Prometheus, could the data be exported 
for SolarWinds?

The intent is to have SW watch the available storage space allocated and then 
to alert when a certain threshold is reached (75% remaining for a warning; 95% 
remaining for a critical).

-- Michael

From: Konstantin Shalygin 
Sent: Tuesday, March 5, 2024 11:17:10 PM
To: Michael Worsham 
Cc: ceph-users@ceph.io 
Subject: Re: [ceph-users] Monitoring Ceph Bucket and overall ceph cluster 
remaining space

This is an external email. Please take care when clicking links or opening 
attachments. When in doubt, check with the Help Desk or Security.

Hi,

For RGW usage statistics you can use radosgw_usage_exporter [1]


k
[1] https://github.com/blemmenes/radosgw_usage_exporter

Sent from my iPhone

On 6 Mar 2024, at 00:21, Michael Worsham  wrote:

Is there an easy way to poll the ceph cluster buckets in a way to see how much 
space is remaining? And is it possible to see how much ceph cluster space is 
remaining overall? I am trying to extract the data from our  Ceph cluster and 
put it into a format that our SolarWinds can understand in whole number 
integers, so we can monitor bucket allocated space and overall cluster space in 
the cluster as a whole.

Via Canonical support, the said I can do something like "sudo ceph df -f 
json-pretty" to pull the information, but what is it I need to look at from the 
output (see below) to display over to SolarWinds?

{
"stats": {
"total_bytes": 960027263238144,
"total_avail_bytes": 403965214187520,
"total_used_bytes": 556062049050624,
"total_used_raw_bytes": 556062049050624,
"total_used_raw_ratio": 0.57921481132507324,
"num_osds": 48,
"num_per_pool_osds": 48,
"num_per_pool_omap_osds": 48
},
"stats_by_class": {
"ssd": {
"total_bytes": 960027263238144,
"total_avail_bytes": 403965214187520,
"total_used_bytes": 556062049050624,
"total_used_raw_bytes": 556062049050624,
"total_used_raw_ratio": 0.57921481132507324
}
},

And a couple of data pools...
{
"name": "default.rgw.jv-va-pool.data",
"id": 65,
"stats": {
"stored": 4343441915904,
"objects": 17466616,
"kb_used": 12774490932,
"bytes_used": 13081078714368,
"percent_used": 0.053900588303804398,
"max_avail": 76535973281792
}
},
{
"name": "default.rgw.jv-va-pool.index",
"id": 66,
"stats": {
"stored": 42533675008,
"objects": 401,
"kb_used": 124610380,
"bytes_used": 127601028363,
"percent_used": 0.00055542576592415571,
"max_avail": 76535973281792
}
},
This message and its attachments are from Data Dimensions and are intended only 
for the use of the individual or entity to which it is addressed, and may 
contain information that is privileged, confidential, and exempt from 
disclosure under applicable law. If the reader of this message is not the 
intended recipient, or the employee or agent responsible for delivering the 
message to the intended recipient, you are hereby notified that any 
dissemination, distribution, or copying of this communication is strictly 
prohibited. If you have received this communication in error, please notify the 
sender immediately and permanently delete the original email and destroy any 
copies or printouts of this email as well as any attachments.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
This message and its attachments are from Data Dimensions and are intended only 
for the use of the individual or entity to which it is addressed, and may 
contain information that is privileged, confidential, and exempt from 
disclosure under applicable law. If the reader of this message is not the 
intended recipient, or the employee or agent responsible for delivering the 
message to the intended recipient, you are hereby notified that any 
dissemination, distribution, or copying of this communication is strictly 
prohibited. If you have received this communication in error, please notify the 
sender immediately and permanently delete the original email and destroy any 
copies or printouts of this email as well as any attachments.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Monitoring Ceph Bucket and overall ceph cluster remaining space

2024-03-05 Thread Konstantin Shalygin
Hi, 

For RGW usage statistics you can use radosgw_usage_exporter [1]


k
[1] https://github.com/blemmenes/radosgw_usage_exporter

Sent from my iPhone

> On 6 Mar 2024, at 00:21, Michael Worsham  wrote:
> Is there an easy way to poll the ceph cluster buckets in a way to see how 
> much space is remaining? And is it possible to see how much ceph cluster 
> space is remaining overall? I am trying to extract the data from our  Ceph 
> cluster and put it into a format that our SolarWinds can understand in whole 
> number integers, so we can monitor bucket allocated space and overall cluster 
> space in the cluster as a whole.
> 
> Via Canonical support, the said I can do something like "sudo ceph df -f 
> json-pretty" to pull the information, but what is it I need to look at from 
> the output (see below) to display over to SolarWinds?
> 
> {
> "stats": {
> "total_bytes": 960027263238144,
> "total_avail_bytes": 403965214187520,
> "total_used_bytes": 556062049050624,
> "total_used_raw_bytes": 556062049050624,
> "total_used_raw_ratio": 0.57921481132507324,
> "num_osds": 48,
> "num_per_pool_osds": 48,
> "num_per_pool_omap_osds": 48
> },
> "stats_by_class": {
> "ssd": {
> "total_bytes": 960027263238144,
> "total_avail_bytes": 403965214187520,
> "total_used_bytes": 556062049050624,
> "total_used_raw_bytes": 556062049050624,
> "total_used_raw_ratio": 0.57921481132507324
> }
> },
> 
> And a couple of data pools...
> {
> "name": "default.rgw.jv-va-pool.data",
> "id": 65,
> "stats": {
> "stored": 4343441915904,
> "objects": 17466616,
> "kb_used": 12774490932,
> "bytes_used": 13081078714368,
> "percent_used": 0.053900588303804398,
> "max_avail": 76535973281792
> }
> },
> {
> "name": "default.rgw.jv-va-pool.index",
> "id": 66,
> "stats": {
> "stored": 42533675008,
> "objects": 401,
> "kb_used": 124610380,
> "bytes_used": 127601028363,
> "percent_used": 0.00055542576592415571,
> "max_avail": 76535973281792
> }
> },
> This message and its attachments are from Data Dimensions and are intended 
> only for the use of the individual or entity to which it is addressed, and 
> may contain information that is privileged, confidential, and exempt from 
> disclosure under applicable law. If the reader of this message is not the 
> intended recipient, or the employee or agent responsible for delivering the 
> message to the intended recipient, you are hereby notified that any 
> dissemination, distribution, or copying of this communication is strictly 
> prohibited. If you have received this communication in error, please notify 
> the sender immediately and permanently delete the original email and destroy 
> any copies or printouts of this email as well as any attachments.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: monitoring apply_latency / commit_latency ?

2023-04-02 Thread Konstantin Shalygin

Hi,

> On 2 Apr 2023, at 23:14, Matthias Ferdinand  wrote:
> 
> I understand that grafana graphs are generated from prometheus metrics.
> I just wanted to know which OSD daemon-perf values feed these prometheus
> metrics (or if they are generated in some other way).

Yep, this perf metrics is generated in some way 🙂
You can consult with ceph-mgr prometheus module source code [1]


[1] 
https://github.com/ceph/ceph/blob/main/src/pybind/mgr/prometheus/module.py#L1656-L1676
k
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: monitoring apply_latency / commit_latency ?

2023-04-02 Thread Matthias Ferdinand
On Thu, Mar 30, 2023 at 08:56:06PM +0400, Konstantin Shalygin wrote:
> Hi,
> 
> > On 25 Mar 2023, at 23:15, Matthias Ferdinand  wrote:
> > 
> > from "ceph daemon osd.X perf dump"?
> 
> 
> No, from ceph-mgr prometheus exporter
> You can enable it via `ceph mgr module enable prometheus`

Hi Konstantin,

thanks :-)
I understand that grafana graphs are generated from prometheus metrics.
I just wanted to know which OSD daemon-perf values feed these prometheus
metrics (or if they are generated in some other way).


Output for "ceph daemon osd.X perf dump" is quite large; most of the
time I am just looking for some kind of latency indicator, or checking
if there are "slow" bytes in bluestore OSDs. Most of the output lines
get filtered away immediately by the next grep/jq. Can somebody tell me
if asking often (like every second) for full perf dump output could slow
down the OSD?


Regards
Matthias
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: monitoring apply_latency / commit_latency ?

2023-03-30 Thread Konstantin Shalygin
Hi,

> On 25 Mar 2023, at 23:15, Matthias Ferdinand  wrote:
> 
> from "ceph daemon osd.X perf dump"?


No, from ceph-mgr prometheus exporter
You can enable it via `ceph mgr module enable prometheus`

> Please bear with me :-) I just try to get some rough understanding what
> the numbers to be collected and graphed actually mean and how they are
> related to each other.

I think you can find metrics descriptions at source of official Grafana 
dashborad [1]


[1] 
https://github.com/ceph/ceph/blob/main/monitoring/ceph-mixin/dashboards_out/osds-overview.json
k
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: monitoring apply_latency / commit_latency ?

2023-03-25 Thread Matthias Ferdinand
On Sat, Mar 25, 2023 at 11:09:58AM +0700, Konstantin Shalygin wrote:
> Hi Matthias,
> 
> Prometheus exporter already have all this metrics, you can setup Grafana 
> panels as you want
> Also, the apply latency in a metric for a pre-bluestore, i.e. filestore
> For Bluestore apply latency is the same as commit latency, you can check this 
> via `ceph osd perf` command


Thanks Konstantin,

do I guess right that the metrics shown in your screenshot correspond to
values

  "bluestore.txc_commit_lat.description": "Average commit latency",
  "bluestore.txc_throttle_lat.description": "Average submit throttle latency",
  "bluestore.txc_submit_lat.description": "Average submit latency",
  "bluestore.read_lat.description": "Average read latency",

from "ceph daemon osd.X perf dump"?


And "ceph osd perf" output would correspond to
  "bluestore.txc_commit_lat.description": "Average commit latency",
or
  "filestore.apply_latency.description": "Apply latency",
  "filestore.journal_latency.description": "Average journal queue completing 
latency",
depending on OSD format?

It looks like "read_lat" is Bluestore only, and there is no comparable
value for Filestore.

There are other, format-agnostic OSD latency values:
  "osd.op_r_latency.description": "Latency of read operation (including queue 
time)",
  "osd.op_w_latency.description": "Latency of write operation (including queue 
time)",
  "osd.op_rw_latency.description": "Latency of read-modify-write operation 
(including queue time)",


More guesswork:
  - is osd.op_X_latency about client->OSD command timing?
  - are bluestore/filestore values about OSD->storage op timing?

Please bear with me :-) I just try to get some rough understanding what
the numbers to be collected and graphed actually mean and how they are
related to each other.


Regards
Matthias

> > On 25 Mar 2023, at 00:02, Matthias Ferdinand  wrote:
> > 
> > Hi,
> > 
> > I would like to understand how the per-OSD data from "ceph osd perf"
> > (i.e.  apply_latency, commit_latency) is generated. So far I couldn't
> > find documentation on this. "ceph osd perf" output is nice for a quick
> > glimpse, but is not very well suited for graphing. Output values are
> > from the most recent 5s-averages apparently.
> > 
> > With "ceph daemon osd.X perf dump" OTOH, you get quite a lot of latency
> > metrics, while it is just not obvious to me how they aggregate into
> > apply_latency and commit_latency. Or some comparably easy read latency
> > metric (something that is missing completely in "ceph osd perf").
> > 
> > Can somebody shed some light on this?
> > 
> > 
> > Regards
> > Matthias
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: monitoring apply_latency / commit_latency ?

2023-03-24 Thread Konstantin Shalygin
Hi Matthias,

Prometheus exporter already have all this metrics, you can setup Grafana panels 
as you want
Also, the apply latency in a metric for a pre-bluestore, i.e. filestore
For Bluestore apply latency is the same as commit latency, you can check this 
via `ceph osd perf` command




k

> On 25 Mar 2023, at 00:02, Matthias Ferdinand  wrote:
> 
> Hi,
> 
> I would like to understand how the per-OSD data from "ceph osd perf"
> (i.e.  apply_latency, commit_latency) is generated. So far I couldn't
> find documentation on this. "ceph osd perf" output is nice for a quick
> glimpse, but is not very well suited for graphing. Output values are
> from the most recent 5s-averages apparently.
> 
> With "ceph daemon osd.X perf dump" OTOH, you get quite a lot of latency
> metrics, while it is just not obvious to me how they aggregate into
> apply_latency and commit_latency. Or some comparably easy read latency
> metric (something that is missing completely in "ceph osd perf").
> 
> Can somebody shed some light on this?
> 
> 
> Regards
> Matthias
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: monitoring drives

2022-10-18 Thread Kai Stian Olstad

On 17.10.2022 12:52, Ernesto Puerta wrote:

   - Ceph already exposes SMART-based health-checks, metrics and alerts
   from the devicehealth/diskprediction modules

.
   I find this kind of high-level monitoring more digestible to 
operators than

   low-level SMART metrics.


Marc that started this thread was asking about SAS disk.
smartctl doesn't show much SMART Attributes on SAS disk, but some drive 
only have error log like this


Error counter log:
   Errors Corrected by   Total   Correction 
GigabytesTotal
   ECC  rereads/errors   algorithm  
processeduncorrected
   fast | delayed   rewrites  corrected  invocations   [10^9 
bytes]  errors
read:  00 0 0 376907  93335.728  
 0
write: 02 0 22113307  17978.600  
 0
verify:00 0 0848  0.002  
 0



But for the drive I have is look like they all have SMART Health Status.

"SMART Health Status: OK"


Ceph doesn't support SMART or any status on SAS disk today, I only get 
the message "No SMART data available".



I have gathered "smartctl -x --json=vo" log for the 6 types of SAS this 
I have in my possession.

You can find them here if interested [1]


[1] https://gitlab.com/-/snippets/2431089

--
Kai Stian Olstad
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: monitoring drives

2022-10-17 Thread Ernesto Puerta
I see a few (a priori) potential issues with this:

   - Given "disks" is THE key scaling dimension in a Ceph cluster,
   depending on how many metrics per device this exporter generates, it could
   negatively impact Prometheus performance (we already experienced such an
   issue when we explored adding cAdvisor support... and discarded that).
   - Depending on the type of smartctl testing, that might interfere with
   the IO load (after checking the actual metrics exported
   
<https://github.com/prometheus-community/smartctl_exporter/blob/master/metrics.go>,
   that doesn't seem to be the case),
   - Ceph already exposes SMART-based health-checks, metrics and alerts
   from the devicehealth/diskprediction modules
   
<https://docs.ceph.com/en/latest/rados/operations/devices/#enabling-monitoring>.
   I find this kind of high-level monitoring more digestible to operators than
   low-level SMART metrics.

Kind Regards,
Ernesto


On Fri, Oct 14, 2022 at 9:31 PM Fox, Kevin M  wrote:

> Would it cause problems to mix the smartctl exporter along with ceph's
> built in monitoring stuff?
>
> Thanks,
> Kevin
>
> 
> From: Wyll Ingersoll 
> Sent: Friday, October 14, 2022 10:48 AM
> To: Konstantin Shalygin; John Petrini
> Cc: Marc; Paul Mezzanini; ceph-users
> Subject: [ceph-users] Re: monitoring drives
>
> Check twice before you click! This email originated from outside PNNL.
>
>
> This looks very useful.  Has anyone created a grafana dashboard that will
> display the collected data ?
>
>
> 
> From: Konstantin Shalygin 
> Sent: Friday, October 14, 2022 12:12 PM
> To: John Petrini 
> Cc: Marc ; Paul Mezzanini ;
> ceph-users 
> Subject: [ceph-users] Re: monitoring drives
>
> Hi,
>
> You can get this metrics, even wear level, from official smartctl_exporter
> [1]
>
> [1]
> https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fprometheus-community%2Fsmartctl_exporter&data=05%7C01%7Ckevin.fox%40pnnl.gov%7C427caf0d5bb141698e2c08daae0c89bc%7Cd6faa5f90ae240338c0130048a38deeb%7C0%7C0%7C638013666131743069%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=qo1e3pnVlv7ILn6%2FN7Ojh7j8dB9pThI0g%2F56%2F66wdbM%3D&reserved=0
>
> k
> Sent from my iPhone
>
> > On 14 Oct 2022, at 17:12, John Petrini  wrote:
> >
> > We run a mix of Samsung and Intel SSD's, our solution was to write a
> > script that parses the output of the Samsung SSD Toolkit and Intel
> > ISDCT CLI tools respectively. In our case, we expose those metrics
> > using node_exporter's textfile collector for ingestion by prometheus.
> > It's mostly the same smart data but it helps identify some vendor
> > specific smart metrics, namely SSD wear level, that we were unable to
> > decipher from the raw smart data.
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: monitoring drives

2022-10-14 Thread Fox, Kevin M
Would it cause problems to mix the smartctl exporter along with ceph's built in 
monitoring stuff?

Thanks,
Kevin


From: Wyll Ingersoll 
Sent: Friday, October 14, 2022 10:48 AM
To: Konstantin Shalygin; John Petrini
Cc: Marc; Paul Mezzanini; ceph-users
Subject: [ceph-users] Re: monitoring drives

Check twice before you click! This email originated from outside PNNL.


This looks very useful.  Has anyone created a grafana dashboard that will 
display the collected data ?



From: Konstantin Shalygin 
Sent: Friday, October 14, 2022 12:12 PM
To: John Petrini 
Cc: Marc ; Paul Mezzanini ; ceph-users 

Subject: [ceph-users] Re: monitoring drives

Hi,

You can get this metrics, even wear level, from official smartctl_exporter [1]

[1] 
https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fprometheus-community%2Fsmartctl_exporter&data=05%7C01%7Ckevin.fox%40pnnl.gov%7C427caf0d5bb141698e2c08daae0c89bc%7Cd6faa5f90ae240338c0130048a38deeb%7C0%7C0%7C638013666131743069%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=qo1e3pnVlv7ILn6%2FN7Ojh7j8dB9pThI0g%2F56%2F66wdbM%3D&reserved=0

k
Sent from my iPhone

> On 14 Oct 2022, at 17:12, John Petrini  wrote:
>
> We run a mix of Samsung and Intel SSD's, our solution was to write a
> script that parses the output of the Samsung SSD Toolkit and Intel
> ISDCT CLI tools respectively. In our case, we expose those metrics
> using node_exporter's textfile collector for ingestion by prometheus.
> It's mostly the same smart data but it helps identify some vendor
> specific smart metrics, namely SSD wear level, that we were unable to
> decipher from the raw smart data.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: monitoring drives

2022-10-14 Thread Wyll Ingersoll
This looks very useful.  Has anyone created a grafana dashboard that will 
display the collected data ?



From: Konstantin Shalygin 
Sent: Friday, October 14, 2022 12:12 PM
To: John Petrini 
Cc: Marc ; Paul Mezzanini ; ceph-users 

Subject: [ceph-users] Re: monitoring drives

Hi,

You can get this metrics, even wear level, from official smartctl_exporter [1]

[1] https://github.com/prometheus-community/smartctl_exporter

k
Sent from my iPhone

> On 14 Oct 2022, at 17:12, John Petrini  wrote:
>
> We run a mix of Samsung and Intel SSD's, our solution was to write a
> script that parses the output of the Samsung SSD Toolkit and Intel
> ISDCT CLI tools respectively. In our case, we expose those metrics
> using node_exporter's textfile collector for ingestion by prometheus.
> It's mostly the same smart data but it helps identify some vendor
> specific smart metrics, namely SSD wear level, that we were unable to
> decipher from the raw smart data.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: monitoring drives

2022-10-14 Thread Konstantin Shalygin
Hi,

You can get this metrics, even wear level, from official smartctl_exporter [1]

[1] https://github.com/prometheus-community/smartctl_exporter

k
Sent from my iPhone

> On 14 Oct 2022, at 17:12, John Petrini  wrote:
> 
> We run a mix of Samsung and Intel SSD's, our solution was to write a
> script that parses the output of the Samsung SSD Toolkit and Intel
> ISDCT CLI tools respectively. In our case, we expose those metrics
> using node_exporter's textfile collector for ingestion by prometheus.
> It's mostly the same smart data but it helps identify some vendor
> specific smart metrics, namely SSD wear level, that we were unable to
> decipher from the raw smart data.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: monitoring drives

2022-10-14 Thread John Petrini
We run a mix of Samsung and Intel SSD's, our solution was to write a
script that parses the output of the Samsung SSD Toolkit and Intel
ISDCT CLI tools respectively. In our case, we expose those metrics
using node_exporter's textfile collector for ingestion by prometheus.
It's mostly the same smart data but it helps identify some vendor
specific smart metrics, namely SSD wear level, that we were unable to
decipher from the raw smart data.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: monitoring drives

2022-10-14 Thread Marc
> smartctl can very much read sas drives so I would look into that chain
> first.

I have smartd running and it does recognize the sas drives, however I have 
collectd is grabbing smart data and I am getting nothing from it. This is all 
the stuff I am getting from a sata drive

# SELECT * FROM "smart_value" WHERE "host"='c01' AND "instance"='sdb' AND 
time>=now()-60m limit 50
name: smart_value
time   host instance type  value
       -
2022-10-14T13:24:04.029043881Z c01  sdb  smart_poweron 118652400
2022-10-14T13:24:04.043975567Z c01  sdb  smart_powercycles 8
2022-10-14T13:24:04.05828545Z  c01  sdb  smart_badsectors  0
2022-10-14T13:24:04.07207858Z  c01  sdb  smart_temperature 30
> SELECT * FROM "smart_pretty" WHERE "host"='c01' AND "instance"='sdb' AND 
> time>=now()-60m limit 50
name: smart_pretty
time   host instance typetype_instance  
  value
     -  
  -
2022-10-14T13:24:04.072900793Z c01  sdb  smart_attribute 
raw-read-error-rate  0
2022-10-14T13:24:04.073731474Z c01  sdb  smart_attribute spin-up-time   
  5383
2022-10-14T13:24:04.074562994Z c01  sdb  smart_attribute start-stop-count   
  8
2022-10-14T13:24:04.075397312Z c01  sdb  smart_attribute 
reallocated-sector-count 0
2022-10-14T13:24:04.07624241Z  c01  sdb  smart_attribute seek-error-rate
  0
2022-10-14T13:24:04.077058461Z c01  sdb  smart_attribute power-on-hours 
  11865240
2022-10-14T13:24:04.077886085Z c01  sdb  smart_attribute spin-retry-count   
  0
2022-10-14T13:24:04.078708091Z c01  sdb  smart_attribute 
calibration-retry-count  0
2022-10-14T13:24:04.079542614Z c01  sdb  smart_attribute power-cycle-count  
  8
2022-10-14T13:24:04.080374422Z c01  sdb  smart_attribute 
power-off-retract-count  6
2022-10-14T13:24:04.0812049Z   c01  sdb  smart_attribute load-cycle-count   
  74
2022-10-14T13:24:04.082027399Z c01  sdb  smart_attribute 
temperature-celsius-2303150
2022-10-14T13:24:04.082879593Z c01  sdb  smart_attribute 
reallocated-event-count  0
2022-10-14T13:24:04.083707815Z c01  sdb  smart_attribute 
current-pending-sector   0
2022-10-14T13:24:04.084536779Z c01  sdb  smart_attribute 
offline-uncorrectable0
2022-10-14T13:24:04.085365242Z c01  sdb  smart_attribute 
udma-crc-error-count 0
2022-10-14T13:24:04.086191201Z c01  sdb  smart_attribute 
multi-zone-error-rate0

>   Are they behind a raid controller that is masking the smart
> commands?

No

> As for monitoring, we run the smartd service to keep an eye on drives.
> More often than not I notice weird things with ceph long before smart
> throws an actual error.  Bouncing drives, oddly high latency on our "Max
> OSD Apply Latency" graph. 

Do you only grab one metric in the query or do you also 'calculate' if the disk 
currently is being used and compensate for that in the reported latency. (Or is 
this metric not depending on current use?)

What values should I look for, how many hundreds of ms?

I have 106 metrics listed in ceph_latency. These start with osd, what would be 
the apply latency one?

Osd.opBeforeDequeueOpLat
Osd.opBeforeQueueOpLat
Osd.opLatency
Osd.opPrepareLatency
Osd.opProcessLatency
Osd.opRLatency
Osd.opRPrepareLatency
Osd.opRProcessLatency
Osd.opRwLatency
Osd.opRwPrepareLatency
Osd.opRwProcessLatency
Osd.opWLatency
Osd.opWPrepareLatency
Osd.opWProcessLatency
Osd.subopLatency
Osd.subopWLatency

>  Every few months I throw a smart long test
> at the whole cluster and a few days later go back and rake the results.
> Anything that has a failure gets immediately removed from ceph by me
> regardless if smart says it's fine or not.   At least 90% of the drives
> we RMA have smart passed but failures in the read test.  Never had
> pushback from WDC or Seagate on it.
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: monitoring drives

2022-10-14 Thread Paul Mezzanini
smartctl can very much read sas drives so I would look into that chain first.   
Are they behind a raid controller that is masking the smart commands?

As for monitoring, we run the smartd service to keep an eye on drives.   More 
often than not I notice weird things with ceph long before smart throws an 
actual error.  Bouncing drives, oddly high latency on our "Max OSD Apply 
Latency" graph.   Every few months I throw a smart long test at the whole 
cluster and a few days later go back and rake the results.   Anything that has 
a failure gets immediately removed from ceph by me regardless if smart says 
it's fine or not.   At least 90% of the drives we RMA have smart passed but 
failures in the read test.  Never had pushback from WDC or Seagate on it.

-paul


From: Marc 
Sent: Thursday, October 13, 2022 4:44 PM
To: ceph-users
Subject: [ceph-users] monitoring drives

I was wondering what is a best practice for monitoring drives. I am 
transitioning from sata to sas drives which have less smartctl information not 
even power on hours.

eg. is ceph registering somewhere when an osd has been created?

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Monitoring slow ops

2022-02-09 Thread Trey Palmer
Thank y'all.   This metric is exactly what we need.   Turns out it was
introduced in 14.2.17 and we have 14.2.9.

On Wed, Feb 9, 2022 at 2:32 AM Konstantin Shalygin  wrote:

> Hi,
>
> On 9 Feb 2022, at 09:03, Benoît Knecht  wrote:
>
> I don't remember in which Ceph release it was introduced, but on Pacific
> there's a metric called `ceph_healthcheck_slow_ops`.
>
>
> At least in Nautilus this metric exists
>
>
> k
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Monitoring slow ops

2022-02-08 Thread Konstantin Shalygin
Hi,

> On 9 Feb 2022, at 09:03, Benoît Knecht  wrote:
> 
> I don't remember in which Ceph release it was introduced, but on Pacific
> there's a metric called `ceph_healthcheck_slow_ops`.

At least in Nautilus this metric exists


k
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Monitoring ceph cluster

2022-01-26 Thread Anthony D'Atri

What David said!

A couple of additional thoughts:

o Nagios (and derivatives like Icinga and check_mk) have been popular for 
years.  Note that they’re monitoring solutions vs metrics solutions — it’s good 
to have both.  One issue I’ve seen multiple times with Nagios-family monitoring 
is that over time as checks and the fleet grow, the server tends to bog down, 
and the litany of active checks starts taking longer to run than the check 
interval.  Prometheus alertmanager is more scalable, and with some thought most 
active checks can be recast in terms of metrics.

o Prometheus (forked node_exporter) was INVALUABLE to me when characterizing 
and engaging two seperate SSD firmware design flaw issues. It includes a data 
query interface for ad-hoc queries and expression development

o Grafana pairs well with Prometheus for dashboard-style visualization and 
trending across many clusters / nodes


> On Jan 26, 2022, at 1:22 PM, David Orman  wrote:
> 
> What version of Ceph are you using? Newer versions deploy a dashboard and
> prometheus module, which has some of this built in. It's a great start to
> seeing what can be done using Prometheus and the built in exporter. Once
> you learn this, if you decide you want something more robust, you can do an
> external deployment of Prometheus (clusters), Alertmanager, Grafana, and
> all the other tooling that might interest you for a more scalable solution
> when dealing with more clusters. It's the perfect way to get your feet wet
> and it showcases a lot of the interesting things you can do with this
> solution!
> 
> https://docs.ceph.com/en/latest/mgr/dashboard/
> https://docs.ceph.com/en/latest/mgr/prometheus/
> 
> David
> 
> On Wed, Jan 26, 2022 at 1:42 AM Michel Niyoyita  wrote:
> 
>> Thank you for your email Szabo, these can be helpful , can you provide
>> links then I start to work on it.
>> 
>> Michel.
>> 
>> On Tue, 25 Jan 2022, 18:51 Szabo, Istvan (Agoda), 
>> wrote:
>> 
>>> Which monitoring tool? Like prometheus or nagios style thing?
>>> We use sensu for keepalive and ceph health reporting + prometheus with
>>> grafana for metrics collection.
>>> 
>>> Istvan Szabo
>>> Senior Infrastructure Engineer
>>> ---
>>> Agoda Services Co., Ltd.
>>> e: istvan.sz...@agoda.com
>>> ---
>>> 
>>> On 2022. Jan 25., at 22:38, Michel Niyoyita  wrote:
>>> 
>>> Email received from the internet. If in doubt, don't click any link nor
>>> open any attachment !
>>> 
>>> 
>>> Hello team,
>>> 
>>> I would like to monitor my ceph cluster using one of the
>>> monitoring tool, does someone has a help on that ?
>>> 
>>> Michel
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>> 
>>> 
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Monitoring ceph cluster

2022-01-26 Thread David Orman
What version of Ceph are you using? Newer versions deploy a dashboard and
prometheus module, which has some of this built in. It's a great start to
seeing what can be done using Prometheus and the built in exporter. Once
you learn this, if you decide you want something more robust, you can do an
external deployment of Prometheus (clusters), Alertmanager, Grafana, and
all the other tooling that might interest you for a more scalable solution
when dealing with more clusters. It's the perfect way to get your feet wet
and it showcases a lot of the interesting things you can do with this
solution!

https://docs.ceph.com/en/latest/mgr/dashboard/
https://docs.ceph.com/en/latest/mgr/prometheus/

David

On Wed, Jan 26, 2022 at 1:42 AM Michel Niyoyita  wrote:

> Thank you for your email Szabo, these can be helpful , can you provide
> links then I start to work on it.
>
> Michel.
>
> On Tue, 25 Jan 2022, 18:51 Szabo, Istvan (Agoda), 
> wrote:
>
> > Which monitoring tool? Like prometheus or nagios style thing?
> > We use sensu for keepalive and ceph health reporting + prometheus with
> > grafana for metrics collection.
> >
> > Istvan Szabo
> > Senior Infrastructure Engineer
> > ---
> > Agoda Services Co., Ltd.
> > e: istvan.sz...@agoda.com
> > ---
> >
> > On 2022. Jan 25., at 22:38, Michel Niyoyita  wrote:
> >
> > Email received from the internet. If in doubt, don't click any link nor
> > open any attachment !
> > 
> >
> > Hello team,
> >
> > I would like to monitor my ceph cluster using one of the
> > monitoring tool, does someone has a help on that ?
> >
> > Michel
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Monitoring ceph cluster

2022-01-25 Thread Michel Niyoyita
Thank you for your email Szabo, these can be helpful , can you provide
links then I start to work on it.

Michel.

On Tue, 25 Jan 2022, 18:51 Szabo, Istvan (Agoda), 
wrote:

> Which monitoring tool? Like prometheus or nagios style thing?
> We use sensu for keepalive and ceph health reporting + prometheus with
> grafana for metrics collection.
>
> Istvan Szabo
> Senior Infrastructure Engineer
> ---
> Agoda Services Co., Ltd.
> e: istvan.sz...@agoda.com
> ---
>
> On 2022. Jan 25., at 22:38, Michel Niyoyita  wrote:
>
> Email received from the internet. If in doubt, don't click any link nor
> open any attachment !
> 
>
> Hello team,
>
> I would like to monitor my ceph cluster using one of the
> monitoring tool, does someone has a help on that ?
>
> Michel
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io