I wonder if maybe at one point Netdata returned timestamps that are in the
future for those time series, and now your production Prometheus cannot
scrape the earlier timestamps for the same series anymore. Try setting
-log.level=debug in production and see if there are any out-of-order scrape
error messages of the kind:

level=debug ts=2020-05-10T21:35:17.206Z caller=scrape.go:1245
component="scrape manager" scrape_pool=<scrape pool name> target=<target>
msg="Out of order sample" series=<series>

On Sun, May 10, 2020 at 11:29 PM Yashar Nesabian <[email protected]> wrote:

> I installed Prometheus server on a test machine with the same ansible as
> what we used to install our main Prometheus and added the netdata job and
> metrics are fine on the local Prometheus, now I'm sure this is a Prometheus
> problem
>
> On Monday, May 11, 2020 at 1:07:31 AM UTC+4:30, Julius Volz wrote:
>>
>> Huh! Ok, strange. And I guess you double-checked that that is what the
>> Prometheus server really scrapes... then I'm a bit out of suggestions at
>> the moment without poking at the setup myself.
>>
>> On Sun, May 10, 2020 at 10:34 PM Yashar Nesabian <[email protected]>
>> wrote:
>>
>>> here is the chart for the last 6 hours for the metric: (the last metric
>>> is for 14:43 )
>>>
>>> [image: Screenshot from 2020-05-11 00-58-58.png]
>>>
>>>
>>> On Monday, May 11, 2020 at 12:43:21 AM UTC+4:30, Yashar Nesabian wrote:
>>>>
>>>> The other slaves have 2-3 seconds difference with the timestamp of
>>>> these metrics, and yes the 2:57pm UTC is almost correct (I don't know the
>>>> exact time) and using foo[24h] is not very informative right now because we
>>>> still have the previous metrics when the slaves were on netdata master
>>>> number 1.
>>>> I did another experiment, I downloaded the metric files again and ran
>>>> the command (date +%s) on the Prometheus server almost at the same time,
>>>> The metrics' timestamp was 1589141392868 and the server's timestamp
>>>> was  1589141393 So I think this is not the problem
>>>>
>>>> On Monday, May 11, 2020 at 12:19:23 AM UTC+4:30, Julius Volz wrote:
>>>>>
>>>>> [+CCing back prometheus-users, which I had accidentally removed]
>>>>>
>>>>> How similar are the others? The ones in your example are from this
>>>>> afternoon (2:57pm UTC), I guess that's when you downloaded the file for
>>>>> grepping first?
>>>>>
>>>>> A regular instant vector selector in PromQL (like just "foo") will
>>>>> only select data points up to 5 minutes into the past from the current
>>>>> evaluation timestamp. So the table view would not show samples for any
>>>>> series whose last sample is more than 5m into the past. You could try a
>>>>> range selector like "foo[24h]" on these to see if any historical data is
>>>>> returned (I would expect so).
>>>>>
>>>>> On Sun, May 10, 2020 at 9:37 PM Yashar Nesabian <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Sure, here it is:
>>>>>> if the second parameter is the timestamp, then yes that's the
>>>>>> problem, but I wonder how come other metrics are stored by the Prometheus
>>>>>> server? because they also have a similar timestamp
>>>>>>
>>>>>> grep -i "netdata_web_log_detailed_response_codes_total" 
>>>>>> allmetrics\?format=prometheus_all_hosts\&source=as-collected.2 | grep -i 
>>>>>> "abs"
>>>>>> netdata_web_log_detailed_response_codes_total{chart="web_log_passenger_event.detailed_response_codes",family="responses",dimension="200",instance="abs-02.x.y.zabs"}
>>>>>>  245453 1589122673736
>>>>>> netdata_web_log_detailed_response_codes_total{chart="web_log_passenger_event.detailed_response_codes",family="responses",dimension="400",instance="abs-02.x.y.zabs"}
>>>>>>  82 1589122673736
>>>>>> netdata_web_log_detailed_response_codes_total{chart="web_log_passenger_event.detailed_response_codes",family="responses",dimension="401",instance="abs-02.x.y.zabs"}
>>>>>>  6 1589122673736
>>>>>> netdata_web_log_detailed_response_codes_total{chart="web_log_passenger_event.detailed_response_codes",family="responses",dimension="200",instance="abs-04.x.y.zabs"}
>>>>>>  238105 1589122673017
>>>>>> netdata_web_log_detailed_response_codes_total{chart="web_log_passenger_event.detailed_response_codes",family="responses",dimension="400",instance="abs-04.x.y.zabs"}
>>>>>>  59 1589122673017
>>>>>> netdata_web_log_detailed_response_codes_total{chart="web_log_passenger_event.detailed_response_codes",family="responses",dimension="401",instance="abs-04.x.y.zabs"}
>>>>>>  3 1589122673017
>>>>>> netdata_web_log_detailed_response_codes_total{chart="web_log_passenger_event.detailed_response_codes",family="responses",dimension="200",instance="abs-03.x.y.zabs"}
>>>>>>  241708 1589122673090
>>>>>> netdata_web_log_detailed_response_codes_total{chart="web_log_passenger_event.detailed_response_codes",family="responses",dimension="400",instance="abs-03.x.y.zabs"}
>>>>>>  68 1589122673090
>>>>>> netdata_web_log_detailed_response_codes_total{chart="web_log_passenger_event.detailed_response_codes",family="responses",dimension="401",instance="abs-03.x.y.zabs"}
>>>>>>  5 1589122673090
>>>>>> netdata_web_log_detailed_response_codes_total{chart="web_log_passenger_event.detailed_response_codes",family="responses",dimension="200",instance="abs-01.x.y.zabs"}
>>>>>>  250296 1589122674872
>>>>>> netdata_web_log_detailed_response_codes_total{chart="web_log_passenger_event.detailed_response_codes",family="responses",dimension="400",instance="abs-01.x.y.zabs"}
>>>>>>  81 1589122674872
>>>>>> netdata_web_log_detailed_response_codes_total{chart="web_log_passenger_event.detailed_response_codes",family="responses",dimension="401",instance="abs-01.x.y.zabs"}
>>>>>>  7 1589122674872
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sun, May 10, 2020 at 10:36 PM Julius Volz <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hmm, odd. Could you share some of the lines that your grep finds in
>>>>>>> the metrics output of the correctly scraped target?
>>>>>>>
>>>>>>> The example at the top of
>>>>>>> https://github.com/netdata/netdata/issues/3891 suggests that
>>>>>>> Netdata sets client-side timestamps for samples (which is uncommon for
>>>>>>> Prometheus otherwise). Maybe those timestamps are too far in the past 
>>>>>>> (more
>>>>>>> than 5 minutes), so they would not be shown anymore?
>>>>>>>
>>>>>>> On Sun, May 10, 2020 at 6:51 PM Yashar Nesabian <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I have a job on the Prometheus which gathers metrics from 4 netdata
>>>>>>>> master servers. Here is the scenario I had:
>>>>>>>> - on netdata master number 1, I gather metrics of  about 200 slaves
>>>>>>>> - For some reason, I decided to move 12 slaves
>>>>>>>> (a1,a2,a3,a4,b1,b2,b3,b4,c1,c2,c3,c4) from the first netdata master to 
>>>>>>>> the
>>>>>>>> second netdata master
>>>>>>>> - Now I only see metrics from 8 servers on the Prometheus server
>>>>>>>> a1,a2,a3,a4,b1,b2,b3,b4) coming from the second master
>>>>>>>> - I check the job status in the targets page and I see all 4
>>>>>>>> masters are up and metrics are gathered successfully
>>>>>>>> - Here is the URL which Prometheus uses to read the metrics from
>>>>>>>> the netdata master number 2:
>>>>>>>> http://172.16.76.152:19999/api/v1/allmetrics?format=prometheus_all_hosts
>>>>>>>> - I grep the downloaded file with hosts metrics for the c1,c2,c3,c4
>>>>>>>> hosts and I see netdata is sending all the metrics relevant to these 
>>>>>>>> slaves
>>>>>>>> - But when I search for the metric in the Graph page, I don't see
>>>>>>>> any results:
>>>>>>>>
>>>>>>>> [image: Screenshot from 2020-05-10 20-58-27.png]
>>>>>>>>
>>>>>>>> all the servers' time is synced and are correct.
>>>>>>>> here is the output of systemctl status prometheus:
>>>>>>>>
>>>>>>>> May 10 19:35:07 devops-mon-01 systemd[1]: Reloading Prometheus.
>>>>>>>> May 10 19:35:07 devops-mon-01 prometheus[6076]: level=info
>>>>>>>> ts=2020-05-10T15:05:07.407Z caller=main.go:734 msg="Loading 
>>>>>>>> configuration
>>>>>>>> file" filename=/e
>>>>>>>> tc/prometheus/prometheus.yml
>>>>>>>> May 10 19:35:07 devops-mon-01 prometheus[6076]: level=info
>>>>>>>> ts=2020-05-10T15:05:07.416Z caller=main.go:762 msg="Completed loading 
>>>>>>>> of
>>>>>>>> configuration file
>>>>>>>> " filename=/etc/prometheus/prometheus.yml
>>>>>>>> May 10 19:35:07 devops-mon-01 systemd[1]: Reloaded Prometheus.
>>>>>>>> May 10 19:53:22 devops-mon-01 prometheus[6076]: level=error
>>>>>>>> ts=2020-05-10T15:23:22.621Z caller=api.go:1347 component=web msg="error
>>>>>>>> writing response"
>>>>>>>> bytesWritten=0 err="write tcp 172.16.77.50:9090->
>>>>>>>> 172.16.76.168:56778: write: broken pipe"
>>>>>>>> May 10 20:25:53 devops-mon-01 prometheus[6076]: level=error
>>>>>>>> ts=2020-05-10T15:55:53.058Z caller=api.go:1347 component=web msg="error
>>>>>>>> writing response"
>>>>>>>> bytesWritten=0 err="write tcp 172.16.77.50:9090->
>>>>>>>> 172.16.76.168:41728: write: broken pipe"
>>>>>>>>
>>>>>>>> 172.16.77.50 is our Prometheus server and 172.16.76.168 is our
>>>>>>>> grafana server so I think the last error is not related to my problem
>>>>>>>>
>>>>>>>> --
>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>> Groups "Prometheus Users" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>>> send an email to [email protected].
>>>>>>>> To view this discussion on the web visit
>>>>>>>> https://groups.google.com/d/msgid/prometheus-users/156d8c36-c1de-4ca3-8b2a-2cfbcb5895fc%40googlegroups.com
>>>>>>>> <https://groups.google.com/d/msgid/prometheus-users/156d8c36-c1de-4ca3-8b2a-2cfbcb5895fc%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>> .
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Julius Volz
>>>>>>> PromLabs - promlabs.com
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> *Best Regards*
>>>>>>
>>>>>> *Yashar Nesabian*
>>>>>>
>>>>>> *Senior Site Reliability Engineer*
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Julius Volz
>>>>> PromLabs - promlabs.com
>>>>>
>>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "Prometheus Users" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/prometheus-users/75c2fc70-af2a-4d23-beb8-0682bb250437%40googlegroups.com
>>> <https://groups.google.com/d/msgid/prometheus-users/75c2fc70-af2a-4d23-beb8-0682bb250437%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>
>>
>> --
>> Julius Volz
>> PromLabs - promlabs.com
>>
> --
> You received this message because you are subscribed to the Google Groups
> "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-users/515088f5-d330-4c16-92fd-e173f523017c%40googlegroups.com
> <https://groups.google.com/d/msgid/prometheus-users/515088f5-d330-4c16-92fd-e173f523017c%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>


-- 
Julius Volz
PromLabs - promlabs.com

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CAObpH5wN8Yd2PYmmPkQXrXqcsQnk659nunExQfFESGKvP07iSA%40mail.gmail.com.

Reply via email to