Re: [ceph-users] Use telegraf/influx to detect problems is very difficult

2019-12-12 Thread Miroslav Kalina
I just briefly peaked into source of module and I suppose it's because
main design idea is just to forward existing metrics from ceph core and
do not calculate anything.

To me it seems most users probably use prometheus which doesn't have
this kind of issue.

Monitor down is also easy as pie, because it's just "num_mon -
mon_quorum". But there is also metric mon_outside_quorum which I have
always zero and don't really know how it works.

OSD near full will be probably more tricky, you have to use
"osd.stat_bytes_used / osd.stat_bytes" and compare it with your own
configured value (not metric so not exported) per each OSD.

Or you can just watch general cluster health metric (what you should
anyway) and rise general alarm in this case.

M.



On 11. 12. 19 21:18, Mario Giammarco wrote:
> Miroslav replied better for us why "is not so simple" to use math.
> And osd down was the easiest. How can I calculate:
> - monitor down
> - osd near full
>
> ?
>
> I do not understand why ceph plugin cannot send to influx all the
> metrics it has, especially the most useful for creating alarms.
>
> Il giorno mer 11 dic 2019 alle ore 04:58 Konstantin Shalygin
> mailto:k0...@k0ste.ru>> ha scritto:
>
>> But it is very difficult/complicated to make simple queries because, for
>> example I have osd up and osd total but not osd down metric.
>>
> To determine how much osds down you don't need special metric,
> because you already
>
> have osd_up and osd_in metrics. Just use math.
>
>
>
>
> k
>
>
> _______
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Miroslav Kalina
Systems development specialist

miroslav.kal...@livesport.eu
+420 773 071 848

Livesport s.r.o.
Aspira Business Centre
Bucharova 2928/14a, 158 00 Praha 5
www.livesport.eu

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Use telegraf/influx to detect problems is very difficult

2019-12-11 Thread Miroslav Kalina
As I mentioned yesterday here, there is an issue with current scheme of
metrics.

With current scheme you cannot do simple math like

> SELECT num_osd - num_osd_up FROM "ceph_cluster_stats"

Instead you will need query like

> SELECT (SELECT last("value") FROM "ceph_cluster_stats" WHERE
"type_instance" = 'num_osd') - (SELECT last("value") FROM
"ceph_cluster_stats" WHERE "type_instance" = 'num_osd_up')

which is not supported by InfluxDB.

I know this type of queries works perfectly in prometheus or SQL world,
but AFAIK you unfortunately cannot easily combine multiple series in
InfluxDB.



To Mario's issue with alerting - maybe you can try to use kapacitor for
alerting purposes. I have no direct experiences with it, but it should
be easily controlled via chronograf and could solve your issue.

M.


On 12/11/19 4:58 AM, Konstantin Shalygin wrote:
>
>> But it is very difficult/complicated to make simple queries because, for
>> example I have osd up and osd total but not osd down metric.
>>
> To determine how much osds down you don't need special metric, because
> you already
>
> have osd_up and osd_in metrics. Just use math.
>
>
>
>
> k
>
>
> _______
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Miroslav Kalina
Systems developement specialist

miroslav.kal...@livesport.eu
+420 773 071 848

Livesport s.r.o.
Aspira Business Centre
Bucharova 2928/14a, 158 00 Praha 5
www.livesport.eu

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph-mgr :: Grafana + Telegraf / InfluxDB metrics format

2019-12-10 Thread Miroslav Kalina
Hello guys,

is there anyone using Telegraf / InfluxDB metrics exporter with Grafana
dashboards? I am asking like that because I was unable to find any
existing Grafana dashboards based on InfluxDB.

I am having hard times with creating graphs I want to see. Metrics are
exported in way that every single one is stored in separated series in
Influx like:

> ceph_pool_stats,cluster=ceph1,metric=read value=1234 15506589110
> ceph_pool_stats,cluster=ceph1,metric=write value=1234 15506589110
> ceph_pool_stats,cluster=ceph1,metric=total value=1234 15506589110

instead of single series like:

> ceph_pool_stats,cluster=ceph1 read=1234,write=1234,total=1234
15506589110

This means when I want to create graph of something like % usage ratio
(= bytes_used / bytes_total) or number of faulty OSDs (= num_osd_up -
num_osd_in) I am unable to do it with single query like

> SELECT mean("num_osd_up") - mean("num_osd_in") FROM
"ceph_cluster_stats" WHERE "cluster" =~ /^ceph1$/ AND time >= now() - 6h
GROUP BY time(5m) fill(null)

but instead it requires two queries followed by math operation, which I
was unable to get it working in my Grafana nor InfluxDB (I believe it's
not supported, Influx removed JOIN queries some time ago).

I didn't see any possibility how to modify metrics format exported to
Telegraf. I feel like I am missing something pretty obvious here.

I am currently unable to switch to prometheus exporter (which don't have
this kind of issue) because of my current infrastructure setup.

Currently I am using following versions:
 * Ceph 14.2.4
 * InfluxDB 1.6.4
 * Grafana 6.4.2

So ... do you have it working anyone? Please could you share your
dashboards?

Best regards

-- 
Miroslav Kalina
Systems developement specialist

Livesport s.r.o.
Aspira Business Centre
Bucharova 2928/14a, 158 00 Praha 5
www.livesport.eu

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com