Re: [prometheus-users] Giving different Dynamic Thresholds for the same alert.

Yagyansh S. Kumar Sat, 14 Mar 2020 14:35:51 -0700

Yes, I did experiment with node_filesystem_device_error earlier based on 
Ben's suggestion on my earlier thread, but not extensively. Also, I didn't 
know it is Statfs success. With what I have read so far on this matter, 
statfs is the best way to find your filesystem is hanging or not. Hence, 
I'll definitely give node_filesystem_device_error another try and see if I 
can come up with something interesting.


Thanks a lot for your help. Cheers!

On Sunday, March 15, 2020 at 2:49:01 AM UTC+5:30, Christian Hoffmann wrote:
>
> On 3/14/20 10:01 PM, Yagyansh S. Kumar wrote: 
> > Also, since you mentioned hanging network filesystem, is there any 
> > way/logic to find out whether my NFS mount is hanged on a machine or 
> > not? I have busted my ass on getting this result, must have tried more 
> > than 50 things but still have nothing in this matter. 
> > In our setup we use a lot of NFS and some of the mounts are really 
> > critical. All these shared NFS mounts are taken from a 3rd party vendor 
> > and due to network lag or IP mismatch or 10 other reasons, the NFS ends 
> > up being hanged on a machine or two. I need to know whenever this 
> > happens. Anything that can be done here? 
>
> I think I would aim for using the regular node_filesystem_device_error 
> metric nowadays, which is basically the Statfs sucess status. 
>
> In earlier node_exporter times, a hung nfs mount could easily prevent 
> node_exporter from working reliably, which is why we still have nfs 
> excluded via --collector.filesystem.ignored-fs-types. However, since 
> #997 [1] this should have been improved. Therefore, I plan to give this 
> a go again. 
>
> Other than that, there are nfs client metrics, but I'm not sure if you 
> can derive a hung / not hung result from that. 
>
> I was about to link to another thread some weeks ago, but I just noticed 
> that it was started by you as well [2]. ;) 
>
> I think that Ben's suggestion is basically the same. Julien's approach 
> regarding separation of collector's into different jobs (in the same 
> mail thread) also sounded interesting. 
>
> Have you done some experiments with node_filesystem_device_error? 
>
> Kind regards, 
> Christian 
>
>
> [1] https://github.com/prometheus/node_exporter/pull/997 
> [2] 
>
> https://groups.google.com/d/msgid/prometheus-users/CABbyFmqMKQXYNOfdr7BeFA%3Dx%3D5fY%2Bk4EQ8oprL0Wh-8SNqmvoA%40mail.gmail.com?utm_medium=email&utm_source=footer
>  
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/514d2573-a723-4c9e-8e0a-61c8188f989e%40googlegroups.com.

Re: [prometheus-users] Giving different Dynamic Thresholds for the same alert.

Reply via email to