Re: flink_taskmanager_job_task_operator_records_lag_max == -Inf on Flink 1.4.2

2018-07-30 Thread Tony Wei
Hi Juilo,

As Gordon said, the `records_lag_max` metric is a Kafka-shipped metric [1].
And I also found this thread [2] in Kafka mailing list. It seems that it is
the
design inner Kafka. So I think there is nothing we can do in
Flink-Kafka-Connector.

BTW, the Kafka document [1] said `records_lag_max` is the maximum lag
in terms of number of records for any partition in this "window". I'm not
sure
what this "window" means and if it is configurable. If it is configurable,
then
you can directly pass the config argument to Flink-Kafka-Connector to set
kafka consumer.

Best,
Tony Wei

[1] https://docs.confluent.io/current/kafka/monitoring.html#fetch-metrics
[2] https://lists.apache.org/thread.html/92475e5eb0c1a5fd08e49c30b3fef4
5213b8626e8fea8d52993c0d8c@%3Cusers.kafka.apache.org%3E

2018-07-31 1:36 GMT+08:00 Julio Biason :

> Hey Gordon,
>
> (Reviving this long thread) I think I found part of the problem: It seems
> the metric is capturing the lag from time to time and reseting the value
> in-between. I managed to replicate this attaching a SQL Sink
> (JDBCOutputFormat) connecting to an outside database -- something that took
> about 2 minutes to write 500 records.
>
> I opened the ticket https://issues.apache.org/jira/browse/FLINK-9998 with
> a bit more information about this ('cause I completely forgot to open a
> ticket a month ago about this).
>
> On Thu, Jun 14, 2018 at 11:31 AM, Julio Biason 
> wrote:
>
>> Hey Gordon,
>>
>> The job restarted somewhere in the middle of the night (I haven't checked
>> why yet) and now I have this weird status of the first TaskManager with
>> only one valid lag, the second with 2 and the third with none.
>>
>> I dunno if I could see the partition in the logs, but all "numRecordsOut"
>> are increasing over time (attached the screenshot of the graphs).
>>
>> On Thu, Jun 14, 2018 at 5:27 AM, Tzu-Li (Gordon) Tai > > wrote:
>>
>>> Hi,
>>>
>>> Thanks for the extra information. So, there seems to be 2 separate
>>> issues here. I’ll go through them one by one.
>>>
>>> I was also checking our Grafana and the metric we were using was
>>> "flink_taskmanager_job_task_operator_KafkaConsumer_records_lag_max",
>>> actually. "flink_taskmanager_job_task_operator_records_lag_max" seems
>>> to be new (with the attempt thingy).
>>>
>>>
>>> After looking at the code changes in FLINK-8419, this unfortunately is a
>>> accidental “break” in the scope of the metric.
>>> In 1.4.0, the Kafka-shipped metrics were exposed under the
>>> “KafkaConsumer” metrics group. After FLINK-8419, this was changed, as you
>>> observed.
>>> In 1.5.0, however, I think the metrics are exposed under both patterns.
>>>
>>> Now, with the fact that some subtasks are returning -Inf for
>>> ‘record-lag-max’:
>>> If I understood the metric semantics correctly, this metric represents
>>> the "max record lag across **partitions subscribed by a Kafka consumer
>>> client**.
>>> So, the only possibility that could think of causing this, is that
>>> either the subtask does not have any partitions assigned to it, or simply
>>> there is a bug with the Kafka client returning this value.
>>>
>>> Is it possible that you verify that all subtasks have a partition
>>> assigned to it? That should be possible by just checking the job status in
>>> the Web UI, and observe the numRecordsOut value for each source subtask.
>>>
>>> Cheers,
>>> Gordon
>>>
>>>
>>> On 13 June 2018 at 2:05:17 PM, Julio Biason (julio.bia...@azion.com)
>>> wrote:
>>>
>>> Hi Gordon,
>>>
>>> We have Kafka 0.10.1.1 running and use the flink-connector-kafka-0.10
>>> driver.
>>>
>>> There are a bunch of flink_taskmanager_job_task_operator_* metrics,
>>> including some about the committed offset for each partition. It seems I
>>> have 4 different records_lag_max with different attempt_id, though, 3 with
>>> -Inf and 1 with a value -- which will give me some more understand of
>>> Prometheus to extract this properly.
>>>
>>> I was also checking our Grafana and the metric we were using was
>>> "flink_taskmanager_job_task_operator_KafkaConsumer_records_lag_max",
>>> actually. "flink_taskmanager_job_task_operator_records_lag_max" seems
>>> to be new (with the attempt thingy).
>>>
>>> On the "KafkaConsumer" front, but it only has the "commited_offset" for
>>> each partition.
>>>
>>> On Wed, Jun 13, 2018 at 5:41 AM, Tzu-Li (Gordon) Tai <
>>> tzuli...@apache.org> wrote:
>>>
 Hi,

 Which Kafka version are you using?

 AFAIK, the only recent changes to Kafka connector metrics in the 1.4.x
 series would be FLINK-8419 [1].
 The ‘records_lag_max’ metric is a Kafka-shipped metric simply forwarded
 from the internally used Kafka client, so nothing should have been 
 affected.

 Do you see other metrics under the pattern of
 ‘flink_taskmanager_job_task_operator_*’? All Kafka-shipped metrics
 should still follow this pattern.
 If not, could you find the ‘records_lag_max’ metric (or any other
 Kafka-shipped metrics [2]) under the u

Re: flink_taskmanager_job_task_operator_records_lag_max == -Inf on Flink 1.4.2

2018-07-30 Thread Julio Biason
Hey Gordon,

(Reviving this long thread) I think I found part of the problem: It seems
the metric is capturing the lag from time to time and reseting the value
in-between. I managed to replicate this attaching a SQL Sink
(JDBCOutputFormat) connecting to an outside database -- something that took
about 2 minutes to write 500 records.

I opened the ticket https://issues.apache.org/jira/browse/FLINK-9998 with a
bit more information about this ('cause I completely forgot to open a
ticket a month ago about this).

On Thu, Jun 14, 2018 at 11:31 AM, Julio Biason 
wrote:

> Hey Gordon,
>
> The job restarted somewhere in the middle of the night (I haven't checked
> why yet) and now I have this weird status of the first TaskManager with
> only one valid lag, the second with 2 and the third with none.
>
> I dunno if I could see the partition in the logs, but all "numRecordsOut"
> are increasing over time (attached the screenshot of the graphs).
>
> On Thu, Jun 14, 2018 at 5:27 AM, Tzu-Li (Gordon) Tai 
> wrote:
>
>> Hi,
>>
>> Thanks for the extra information. So, there seems to be 2 separate issues
>> here. I’ll go through them one by one.
>>
>> I was also checking our Grafana and the metric we were using was
>> "flink_taskmanager_job_task_operator_KafkaConsumer_records_lag_max",
>> actually. "flink_taskmanager_job_task_operator_records_lag_max" seems to
>> be new (with the attempt thingy).
>>
>>
>> After looking at the code changes in FLINK-8419, this unfortunately is a
>> accidental “break” in the scope of the metric.
>> In 1.4.0, the Kafka-shipped metrics were exposed under the
>> “KafkaConsumer” metrics group. After FLINK-8419, this was changed, as you
>> observed.
>> In 1.5.0, however, I think the metrics are exposed under both patterns.
>>
>> Now, with the fact that some subtasks are returning -Inf for
>> ‘record-lag-max’:
>> If I understood the metric semantics correctly, this metric represents
>> the "max record lag across **partitions subscribed by a Kafka consumer
>> client**.
>> So, the only possibility that could think of causing this, is that either
>> the subtask does not have any partitions assigned to it, or simply there is
>> a bug with the Kafka client returning this value.
>>
>> Is it possible that you verify that all subtasks have a partition
>> assigned to it? That should be possible by just checking the job status in
>> the Web UI, and observe the numRecordsOut value for each source subtask.
>>
>> Cheers,
>> Gordon
>>
>>
>> On 13 June 2018 at 2:05:17 PM, Julio Biason (julio.bia...@azion.com)
>> wrote:
>>
>> Hi Gordon,
>>
>> We have Kafka 0.10.1.1 running and use the flink-connector-kafka-0.10
>> driver.
>>
>> There are a bunch of flink_taskmanager_job_task_operator_* metrics,
>> including some about the committed offset for each partition. It seems I
>> have 4 different records_lag_max with different attempt_id, though, 3 with
>> -Inf and 1 with a value -- which will give me some more understand of
>> Prometheus to extract this properly.
>>
>> I was also checking our Grafana and the metric we were using was
>> "flink_taskmanager_job_task_operator_KafkaConsumer_records_lag_max",
>> actually. "flink_taskmanager_job_task_operator_records_lag_max" seems to
>> be new (with the attempt thingy).
>>
>> On the "KafkaConsumer" front, but it only has the "commited_offset" for
>> each partition.
>>
>> On Wed, Jun 13, 2018 at 5:41 AM, Tzu-Li (Gordon) Tai > > wrote:
>>
>>> Hi,
>>>
>>> Which Kafka version are you using?
>>>
>>> AFAIK, the only recent changes to Kafka connector metrics in the 1.4.x
>>> series would be FLINK-8419 [1].
>>> The ‘records_lag_max’ metric is a Kafka-shipped metric simply forwarded
>>> from the internally used Kafka client, so nothing should have been affected.
>>>
>>> Do you see other metrics under the pattern of
>>> ‘flink_taskmanager_job_task_operator_*’? All Kafka-shipped metrics
>>> should still follow this pattern.
>>> If not, could you find the ‘records_lag_max’ metric (or any other
>>> Kafka-shipped metrics [2]) under the user scope ‘KafkaConsumer’?
>>>
>>> The above should provide more insight into what may be wrong here.
>>>
>>> - Gordon
>>>
>>> [1] https://issues.apache.org/jira/browse/FLINK-8419
>>> [2] https://docs.confluent.io/current/kafka/monitoring.html#
>>> fetch-metrics
>>>
>>> On 12 June 2018 at 11:47:51 PM, Julio Biason (julio.bia...@azion.com)
>>> wrote:
>>>
>>> Hey guys,
>>>
>>> I just updated our Flink install from 1.4.0 to 1.4.2, but our Prometheus
>>> monitoring is not getting the current Kafka lag.
>>>
>>> After updating to 1.4.2 and making the symlink between
>>> opt/flink-metrics-prometheus-1.4.2.jar to lib/, I got the metrics back
>>> on Prometheus, but the most important one, 
>>> flink_taskmanager_job_task_operator_records_lag_max
>>> is now returning -Inf.
>>>
>>> Did I miss something?
>>>
>>> --
>>> *Julio Biason*, Sofware Engineer
>>> *AZION*  |  Deliver. Accelerate. Protect.
>>> Office: +55 51 3083 8101   |  Mobile: +55 51
>>> *99907 0554*
>>>
>>>
>>

Re: flink_taskmanager_job_task_operator_records_lag_max == -Inf on Flink 1.4.2

2018-06-14 Thread Tzu-Li (Gordon) Tai
Hi,

Thanks for the extra information. So, there seems to be 2 separate issues here. 
I’ll go through them one by one.

I was also checking our Grafana and the metric we were using was 
"flink_taskmanager_job_task_operator_KafkaConsumer_records_lag_max", actually. 
"flink_taskmanager_job_task_operator_records_lag_max" seems to be new (with the 
attempt thingy).

After looking at the code changes in FLINK-8419, this unfortunately is a 
accidental “break” in the scope of the metric.
In 1.4.0, the Kafka-shipped metrics were exposed under the “KafkaConsumer” 
metrics group. After FLINK-8419, this was changed, as you observed.
In 1.5.0, however, I think the metrics are exposed under both patterns.

Now, with the fact that some subtasks are returning -Inf for ‘record-lag-max’:
If I understood the metric semantics correctly, this metric represents the "max 
record lag across **partitions subscribed by a Kafka consumer client**.
So, the only possibility that could think of causing this, is that either the 
subtask does not have any partitions assigned to it, or simply there is a bug 
with the Kafka client returning this value.

Is it possible that you verify that all subtasks have a partition assigned to 
it? That should be possible by just checking the job status in the Web UI, and 
observe the numRecordsOut value for each source subtask.

Cheers,
Gordon


On 13 June 2018 at 2:05:17 PM, Julio Biason (julio.bia...@azion.com) wrote:

Hi Gordon,

We have Kafka 0.10.1.1 running and use the flink-connector-kafka-0.10 driver.

There are a bunch of flink_taskmanager_job_task_operator_* metrics, including 
some about the committed offset for each partition. It seems I have 4 different 
records_lag_max with different attempt_id, though, 3 with -Inf and 1 with a 
value -- which will give me some more understand of Prometheus to extract this 
properly.

I was also checking our Grafana and the metric we were using was 
"flink_taskmanager_job_task_operator_KafkaConsumer_records_lag_max", actually. 
"flink_taskmanager_job_task_operator_records_lag_max" seems to be new (with the 
attempt thingy).

On the "KafkaConsumer" front, but it only has the "commited_offset" for each 
partition.

On Wed, Jun 13, 2018 at 5:41 AM, Tzu-Li (Gordon) Tai  
wrote:
Hi,

Which Kafka version are you using?

AFAIK, the only recent changes to Kafka connector metrics in the 1.4.x series 
would be FLINK-8419 [1].
The ‘records_lag_max’ metric is a Kafka-shipped metric simply forwarded from 
the internally used Kafka client, so nothing should have been affected.

Do you see other metrics under the pattern of 
‘flink_taskmanager_job_task_operator_*’? All Kafka-shipped metrics should still 
follow this pattern.
If not, could you find the ‘records_lag_max’ metric (or any other Kafka-shipped 
metrics [2]) under the user scope ‘KafkaConsumer’?

The above should provide more insight into what may be wrong here.

- Gordon

[1] https://issues.apache.org/jira/browse/FLINK-8419
[2] https://docs.confluent.io/current/kafka/monitoring.html#fetch-metrics

On 12 June 2018 at 11:47:51 PM, Julio Biason (julio.bia...@azion.com) wrote:

Hey guys,

I just updated our Flink install from 1.4.0 to 1.4.2, but our Prometheus 
monitoring is not getting the current Kafka lag.

After updating to 1.4.2 and making the symlink between 
opt/flink-metrics-prometheus-1.4.2.jar to lib/, I got the metrics back on 
Prometheus, but the most important one, 
flink_taskmanager_job_task_operator_records_lag_max is now returning -Inf.

Did I miss something?

--
Julio Biason, Sofware Engineer
AZION  |  Deliver. Accelerate. Protect.
Office: +55 51 3083 8101  |  Mobile: +55 51 99907 0554



--
Julio Biason, Sofware Engineer
AZION  |  Deliver. Accelerate. Protect.
Office: +55 51 3083 8101  |  Mobile: +55 51 99907 0554

Re: flink_taskmanager_job_task_operator_records_lag_max == -Inf on Flink 1.4.2

2018-06-13 Thread Julio Biason
Just to add some more info, here is the data I have on Prometheus (with
some names redacted):

flink_taskmanager_job_task_operator_records_lag_max{host="002s",operator_id="cbc357ccb763df2852fee8c4fc7d55f2",subtask_index="0",task_attempt_id="fa104111e1f493bbec6f4b2ce44ec1da",task_attempt_num="11",task_id="cbc357ccb763df2852fee8c4fc7d55f2",tm_id="0496ee3dc28c7cad5a512b4aefce67fa"}
83529
flink_taskmanager_job_task_operator_records_lag_max{host="002s",operator_id="cbc357ccb763df2852fee8c4fc7d55f2",subtask_index="3",task_attempt_id="4c5d8395882fb1ad26bdd6fd7f4e789c",task_attempt_num="11",task_id="cbc357ccb763df2852fee8c4fc7d55f2",tm_id="0496ee3dc28c7cad5a512b4aefce67fa"}
-Inf
flink_taskmanager_job_task_operator_records_lag_max{host="002s",operator_id="cbc357ccb763df2852fee8c4fc7d55f2",subtask_index="6",task_attempt_id="f05e3171c446c19c7b928eeffe0fa52f",task_attempt_num="11",task_id="cbc357ccb763df2852fee8c4fc7d55f2",tm_id="0496ee3dc28c7cad5a512b4aefce67fa"}
-Inf
flink_taskmanager_job_task_operator_records_lag_max{host="002s",operator_id="cbc357ccb763df2852fee8c4fc7d55f2",subtask_index="9",task_attempt_id="9533c5fa9fafadf4878ce90a08f83213",task_attempt_num="11",task_id="cbc357ccb763df2852fee8c4fc7d55f2",tm_id="0496ee3dc28c7cad5a512b4aefce67fa"}
84096
flink_taskmanager_job_task_operator_records_lag_max{host="003s",operator_id="cbc357ccb763df2852fee8c4fc7d55f2",subtask_index="1",task_attempt_id="7ea45523850f5d08bab719418321e410",task_attempt_num="11",task_id="cbc357ccb763df2852fee8c4fc7d55f2",tm_id="8f68b231855c663ae6b3c2362d39568a"}
83867
flink_taskmanager_job_task_operator_records_lag_max{host="003s",operator_id="cbc357ccb763df2852fee8c4fc7d55f2",subtask_index="10",task_attempt_id="cf6c2349ccf818f6870fdf0296be121b",task_attempt_num="11",task_id="cbc357ccb763df2852fee8c4fc7d55f2",tm_id="8f68b231855c663ae6b3c2362d39568a"}
83829
flink_taskmanager_job_task_operator_records_lag_max{host="003s",operator_id="cbc357ccb763df2852fee8c4fc7d55f2",subtask_index="4",task_attempt_id="b39c5366b90d74e57d058b64e9e08e56",task_attempt_num="11",task_id="cbc357ccb763df2852fee8c4fc7d55f2",tm_id="8f68b231855c663ae6b3c2362d39568a"}
-Inf
flink_taskmanager_job_task_operator_records_lag_max{host="003s",operator_id="cbc357ccb763df2852fee8c4fc7d55f2",subtask_index="7",task_attempt_id="db563e7e360227585cff9fa3d0035b0d",task_attempt_num="11",task_id="cbc357ccb763df2852fee8c4fc7d55f2",tm_id="8f68b231855c663ae6b3c2362d39568a"}
-Inf
flink_taskmanager_job_task_operator_records_lag_max{host="004s",operator_id="cbc357ccb763df2852fee8c4fc7d55f2",subtask_index="11",task_attempt_id="4e9231f9187b0dffc728d8cd77cfef9e",task_attempt_num="11",task_id="cbc357ccb763df2852fee8c4fc7d55f2",tm_id="9098e39a467aa6c255dcf2ec44544cb2"}
83730
flink_taskmanager_job_task_operator_records_lag_max{host="004s",operator_id="cbc357ccb763df2852fee8c4fc7d55f2",subtask_index="2",task_attempt_id="e920a1672b0a31c2d186e3f6fee38bed",task_attempt_num="11",task_id="cbc357ccb763df2852fee8c4fc7d55f2",tm_id="9098e39a467aa6c255dcf2ec44544cb2"}
-Inf
flink_taskmanager_job_task_operator_records_lag_max{host="004s",operator_id="cbc357ccb763df2852fee8c4fc7d55f2",subtask_index="5",task_attempt_id="0e22fd213905d4da222e3651e7007106",task_attempt_num="11",task_id="cbc357ccb763df2852fee8c4fc7d55f2",tm_id="9098e39a467aa6c255dcf2ec44544cb2"}
83472
flink_taskmanager_job_task_operator_records_lag_max{host="004s",operator_id="cbc357ccb763df2852fee8c4fc7d55f2",subtask_index="8",task_attempt_id="f36fe63b0688a821f5abf685551c47fa",task_attempt_num="11",task_id="cbc357ccb763df2852fee8c4fc7d55f2",tm_id="9098e39a467aa6c255dcf2ec44544cb2"}
83842

What we have are 3 servers running with 4 slots each. Our Kafka have 12
partitions and the Job is running with a parallelism of 12. In this set up,
I'd expect that each slot would grab one partition and process it, giving
me each a different lag respective to each partition. But it seems some
slots are stubbornly refusing to either grab a partition or update their
status. It doesn't seem (from a perspective of someone that doesn't know
the code) that it's not related to TaskManagers using the same Kafka
connection, as 004 is consuming 3 partitions while 002 and 003 are
consuming just 2.

And I was wrong about the attempt_id: It's not what's messing with my
Prometheus query, it's some slots reporting -Inf on their partitions.

On Wed, Jun 13, 2018 at 9:05 AM, Julio Biason 
wrote:

> Hi Gordon,
>
> We have Kafka 0.10.1.1 running and use the flink-connector-kafka-0.10
> driver.
>
> There are a bunch of flink_taskmanager_job_task_operator_* metrics,
> including some about the committed offset for each partition. It seems I
> have 4 different records_lag_max with different attempt_id, though, 3 with
> -Inf and 1 with a value -- which will give me some more understand of
> Prometheus to extract this properly.
>
> I was also checking our Grafana and the metric we were using was
> "flink_taskmanager_job_task_operator_KafkaConsumer_records_lag_max",
> actually. "flink_taskmana

Re: flink_taskmanager_job_task_operator_records_lag_max == -Inf on Flink 1.4.2

2018-06-13 Thread Julio Biason
Hi Gordon,

We have Kafka 0.10.1.1 running and use the flink-connector-kafka-0.10
driver.

There are a bunch of flink_taskmanager_job_task_operator_* metrics,
including some about the committed offset for each partition. It seems I
have 4 different records_lag_max with different attempt_id, though, 3 with
-Inf and 1 with a value -- which will give me some more understand of
Prometheus to extract this properly.

I was also checking our Grafana and the metric we were using was
"flink_taskmanager_job_task_operator_KafkaConsumer_records_lag_max",
actually. "flink_taskmanager_job_task_operator_records_lag_max" seems to be
new (with the attempt thingy).

On the "KafkaConsumer" front, but it only has the "commited_offset" for
each partition.

On Wed, Jun 13, 2018 at 5:41 AM, Tzu-Li (Gordon) Tai 
wrote:

> Hi,
>
> Which Kafka version are you using?
>
> AFAIK, the only recent changes to Kafka connector metrics in the 1.4.x
> series would be FLINK-8419 [1].
> The ‘records_lag_max’ metric is a Kafka-shipped metric simply forwarded
> from the internally used Kafka client, so nothing should have been affected.
>
> Do you see other metrics under the pattern of 
> ‘flink_taskmanager_job_task_operator_*’?
> All Kafka-shipped metrics should still follow this pattern.
> If not, could you find the ‘records_lag_max’ metric (or any other
> Kafka-shipped metrics [2]) under the user scope ‘KafkaConsumer’?
>
> The above should provide more insight into what may be wrong here.
>
> - Gordon
>
> [1] https://issues.apache.org/jira/browse/FLINK-8419
> [2] https://docs.confluent.io/current/kafka/monitoring.html#fetch-metrics
>
> On 12 June 2018 at 11:47:51 PM, Julio Biason (julio.bia...@azion.com)
> wrote:
>
> Hey guys,
>
> I just updated our Flink install from 1.4.0 to 1.4.2, but our Prometheus
> monitoring is not getting the current Kafka lag.
>
> After updating to 1.4.2 and making the symlink between
> opt/flink-metrics-prometheus-1.4.2.jar to lib/, I got the metrics back on
> Prometheus, but the most important one, 
> flink_taskmanager_job_task_operator_records_lag_max
> is now returning -Inf.
>
> Did I miss something?
>
> --
> *Julio Biason*, Sofware Engineer
> *AZION*  |  Deliver. Accelerate. Protect.
> Office: +55 51 3083 8101   |  Mobile: +55 51
> *99907 0554*
>
>


-- 
*Julio Biason*, Sofware Engineer
*AZION*  |  Deliver. Accelerate. Protect.
Office: +55 51 3083 8101   |  Mobile: +55 51
*99907 0554*


Re: flink_taskmanager_job_task_operator_records_lag_max == -Inf on Flink 1.4.2

2018-06-13 Thread Tzu-Li (Gordon) Tai
Hi,

Which Kafka version are you using?

AFAIK, the only recent changes to Kafka connector metrics in the 1.4.x series 
would be FLINK-8419 [1].
The ‘records_lag_max’ metric is a Kafka-shipped metric simply forwarded from 
the internally used Kafka client, so nothing should have been affected.

Do you see other metrics under the pattern of 
‘flink_taskmanager_job_task_operator_*’? All Kafka-shipped metrics should still 
follow this pattern.
If not, could you find the ‘records_lag_max’ metric (or any other Kafka-shipped 
metrics [2]) under the user scope ‘KafkaConsumer’?

The above should provide more insight into what may be wrong here.

- Gordon

[1] https://issues.apache.org/jira/browse/FLINK-8419
[2] https://docs.confluent.io/current/kafka/monitoring.html#fetch-metrics

On 12 June 2018 at 11:47:51 PM, Julio Biason (julio.bia...@azion.com) wrote:

Hey guys,

I just updated our Flink install from 1.4.0 to 1.4.2, but our Prometheus 
monitoring is not getting the current Kafka lag.

After updating to 1.4.2 and making the symlink between 
opt/flink-metrics-prometheus-1.4.2.jar to lib/, I got the metrics back on 
Prometheus, but the most important one, 
flink_taskmanager_job_task_operator_records_lag_max is now returning -Inf.

Did I miss something?

--
Julio Biason, Sofware Engineer
AZION  |  Deliver. Accelerate. Protect.
Office: +55 51 3083 8101  |  Mobile: +55 51 99907 0554

flink_taskmanager_job_task_operator_records_lag_max == -Inf on Flink 1.4.2

2018-06-12 Thread Julio Biason
Hey guys,

I just updated our Flink install from 1.4.0 to 1.4.2, but our Prometheus
monitoring is not getting the current Kafka lag.

After updating to 1.4.2 and making the symlink between
opt/flink-metrics-prometheus-1.4.2.jar to lib/, I got the metrics back on
Prometheus, but the most important one,
flink_taskmanager_job_task_operator_records_lag_max is now returning -Inf.

Did I miss something?

-- 
*Julio Biason*, Sofware Engineer
*AZION*  |  Deliver. Accelerate. Protect.
Office: +55 51 3083 8101   |  Mobile: +55 51
*99907 0554*