Re: datadog failed to send report

2020-06-26 Thread seeksst
原始邮件
发件人:seeksstseek...@163.com
收件人:Fanbin bufanbin...@coinbase.com
发送时间:2020年6月26日(周五) 23:36
主题:Re: datadog failed to send report


Hi, I’m sorry for not explaining it clearly and misread the exception.
log4j.logger.org.apache.flink.metrics.datadog.DatadogHttpClient=ERROR
log4j.logger.org.apache.flink.runtime.metrics will not work on flink.metrics, 
it effect on flink.runtime.metrics。


if it does work again, you can see that there are many log profiles in the 
folder /conf.
Modifying config is helpful to control the log output. If it doesn’t work,may 
be log4j.properties is not being used.
You can read this artical for answers[1]. If you’re still not sure, you can 
change all.A more granular configuration is recommended.




I’m not familiar with datadog (I use influxdb to collect metrics). but i think 
if it can collect metrics, and network is not a problem, the bottleneck may be 
processing the request but not sure.SocketTimeoutException can occur in 
serveral situations:
1.the network is down
you think the network is ok
2.server processing is slow
datadog may deal many requests, and can’t answer fast.
you can check cpu usage of the datadog machine. Sometimes it depends on the 
program, if it use one thread deal all request(this is something that i don’t 
know about datadog).if cup usage is high, this may be reason, if not, need know 
about datadog.
 3.slow network transmission
you need check network,whether the network traffic is full or the machine 
physical location is far away.
you can also find ways to adjust the timeout.
 4.your job frequently triggered full gc.
you can check gc log, this need to edit flink-conf.yml
   something like :env.java.opts.taskmanager: -Xloggc:LOG_DIR/taskmanager-gc.log
Best wish to you.


[1]https://ci.apache.org/projects/flink/flink-docs-release-1.10/monitoring/logging.html


原始邮件
发件人:Fanbin bufanbin...@coinbase.com
收件人:seeksstseek...@163.com
发送时间:2020年6月26日(周五) 05:38
主题:Re: datadog failed to send report


this does not help.
log4j.logger.org.apache.flink.runtime.metrics=ERROR


i believe all machines can telnet datadog port since there are other metrics 
reported correctly.
how do i check the number of requests capacity?


On Tue, Jun 23, 2020 at 11:32 PM seeksst seek...@163.com wrote:

Hi,


If you don’t care about losing some metrics, you can edit log4j.properties to 
ignore it.
log4j.logger.org.apache.flink.runtime.metrics=ERROR
BTW, Whether all machines can telnet datadog port?
Whether the number of requests exceeds the datadog's processing capacity?


原始邮件
发件人:Fanbin bufanbin...@coinbase.com
收件人:useru...@flink.apache.org
发送时间:2020年6月24日(周三) 12:05
主题:datadog failed to send report


Hi,



Does any have any idea on the following error msg: (it flooded my task manager 
log)
I do have datadog metrics present so this is probably only happens for some 
metrics.


2020-06-24 03:27:15,362 WARN org.apache.flink.metrics.datadog.DatadogHttpClient 
- Failed sending request to Datadog java.net.SocketTimeoutException: timeout at 
org.apache.flink.shaded.okio.Okio$4.newTimeoutException(Okio.java:227) at 
org.apache.flink.shaded.okio.AsyncTimeout.exit(AsyncTimeout.java:284) at 
org.apache.flink.shaded.okio.AsyncTimeout$2.read(AsyncTimeout.java:240) at 
org.apache.flink.shaded.okio.RealBufferedSource.indexOf(RealBufferedSource.java:344)
 at 
org.apache.flink.shaded.okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:216)
 at 
org.apache.flink.shaded.okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:210)
 at 
org.apache.flink.shaded.okhttp3.internal.http1.Http1Codec.readResponseHeaders(Http1Codec.java:189)
 at 
org.apache.flink.shaded.okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:75)
 at 
org.apache.flink.shaded.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
 at 
org.apache.flink.shaded.okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45)
 at 
org.apache.flink.shaded.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
 at 
org.apache.flink.shaded.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
 at 
org.apache.flink.shaded.okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
 at 
org.apache.flink.shaded.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
 at 
org.apache.flink.shaded.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
 at 
org.apache.flink.shaded.okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
 at 
org.apache.flink.shaded.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
 at 
org.apache.flink.shaded.okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:120)
 at 
org.apache.flink.shaded.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92

回复:datadog failed to send report

2020-06-23 Thread seeksst
Hi,


If you don’t care about losing some metrics, you can edit log4j.properties to 
ignore it.
log4j.logger.org.apache.flink.runtime.metrics=ERROR
BTW, Whether all machines can telnet datadog port?
Whether the number of requests exceeds the datadog's processing capacity?


原始邮件
发件人:Fanbin bufanbin...@coinbase.com
收件人:useru...@flink.apache.org
发送时间:2020年6月24日(周三) 12:05
主题:datadog failed to send report


Hi,



Does any have any idea on the following error msg: (it flooded my task manager 
log)
I do have datadog metrics present so this is probably only happens for some 
metrics.


2020-06-24 03:27:15,362 WARN org.apache.flink.metrics.datadog.DatadogHttpClient 
- Failed sending request to Datadog java.net.SocketTimeoutException: timeout at 
org.apache.flink.shaded.okio.Okio$4.newTimeoutException(Okio.java:227) at 
org.apache.flink.shaded.okio.AsyncTimeout.exit(AsyncTimeout.java:284) at 
org.apache.flink.shaded.okio.AsyncTimeout$2.read(AsyncTimeout.java:240) at 
org.apache.flink.shaded.okio.RealBufferedSource.indexOf(RealBufferedSource.java:344)
 at 
org.apache.flink.shaded.okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:216)
 at 
org.apache.flink.shaded.okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:210)
 at 
org.apache.flink.shaded.okhttp3.internal.http1.Http1Codec.readResponseHeaders(Http1Codec.java:189)
 at 
org.apache.flink.shaded.okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:75)
 at 
org.apache.flink.shaded.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
 at 
org.apache.flink.shaded.okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45)
 at 
org.apache.flink.shaded.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
 at 
org.apache.flink.shaded.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
 at 
org.apache.flink.shaded.okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
 at 
org.apache.flink.shaded.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
 at 
org.apache.flink.shaded.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
 at 
org.apache.flink.shaded.okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
 at 
org.apache.flink.shaded.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
 at 
org.apache.flink.shaded.okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:120)
 at 
org.apache.flink.shaded.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
 at 
org.apache.flink.shaded.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
 at 
org.apache.flink.shaded.okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:185)
 at 
org.apache.flink.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:135) 
at 
org.apache.flink.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748) Caused by: java.net.SocketException: 
Socket closed at java.net.SocketInputStream.read(SocketInputStream.java:204) at 
java.net.SocketInputStream.read(SocketInputStream.java:141) at 
sun.security.ssl.InputRecord.readFully(InputRecord.java:465) at 
sun.security.ssl.InputRecord.read(InputRecord.java:503) at 
sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:975) at 
sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:933) at 
sun.security.ssl.AppInputStream.read(AppInputStream.java:105) at 
org.apache.flink.shaded.okio.Okio$2.read(Okio.java:138) at 
org.apache.flink.shaded.okio.AsyncTimeout$2.read(AsyncTimeout.java:236) ... 23 
more

Flink 1.10.0 failover

2020-04-25 Thread seeksst
Hi,


  Recently, I find a problem when job failed in 1.10.0, flink didn’t release 
resource first.

  You can see I used flink on yarn, and it doesn’t allocate task manager, 
beacause no more memory left.
  If i cancel the job, the cluster has more memory.
  In 1.8.2, the job will restart normally, is this a bug?
  Thanks.

7283c880-dd93-4177-9dc2-af3f25c4fd97.png
Description: Binary data


02eb930a-3b3e-4058-908f-495367872800.png
Description: Binary data


c32876d7-7583-4785-8acd-b9f999df34e9.png
Description: Binary data


Re: Flink 1.10.0 stop command

2020-04-22 Thread seeksst
Thanks a lot. I’m glad to hear that and looking forward to 1.10.1
it there more plan about stop command? it will replace cancel in future. Is the 
state.savepoints.dir required at the end?


原始邮件
发件人:tisonwander4...@gmail.com
收件人:Yun tangmyas...@live.com
抄送:seeksstseek...@163.com; useru...@flink.apache.org
发送时间:2020年4月23日(周四) 01:21
主题:Re: Flink 1.10.0 stop command


To be precise, the cancel command would succeed on cluster side but the 
response *might* lost so that client throws with TimeoutException. If it is the 
case, this is the root which will be fixed in 1.10.1.


Best,
tison.




tison wander4...@gmail.com 于2020年4月23日周四 上午1:20写道:

'flink cancel' broken because 
ofhttps://issues.apache.org/jira/browse/FLINK-16626


Best,
tison.




Yun Tang myas...@live.com 于2020年4月23日周四 上午1:18写道:

Hi


I think you could still use ./bin/flink cancel jobID to cancel the job. What is 
the exception thrown?


Best
Yun Tang

From: seeksst seek...@163.com
 Sent: Wednesday, April 22, 2020 18:17
 To: user user@flink.apache.org
 Subject: Flink 1.10.0 stop command

Hi,


 When i test 1.10.0, i found i must to set savepoint path otherwise i can’t 
stop the job. I confuse about this, beacuse as i know, savepoint offen large 
than checkpoint, so i usually resume job from checkpoint. Another problem is 
sometimes job throw exception and i can’t trigger a savepoint, so i cancel the 
job and change logical, resume it from last checkpoint. although sometimes will 
failed, i think this will be a better way, because i can choose cancel with a 
savepoint or not, so i can decede how to resume. but in 1.10.0, i must to set 
it, and seems system will trigger savepoint, i think this will take more risk, 
and it will delete checkpoint even i set retain on cancellation. so i have no 
checkpoint left. If i usecancel jobID, it will break with exception.
So how to work with 1.10.0 ? any advice will be helpful.
 Thanks.

Flink 1.10.0 stop command

2020-04-22 Thread seeksst
Hi,


 When i test 1.10.0, i found i must to set savepoint path otherwise i can’t 
stop the job. I confuse about this, beacuse as i know, savepoint offen large 
than checkpoint, so i usually resume job from checkpoint. Another problem is 
sometimes job throw exception and i can’t trigger a savepoint, so i cancel the 
job and change logical, resume it from last checkpoint. although sometimes will 
failed, i think this will be a better way, because i can choose cancel with a 
savepoint or not, so i can decede how to resume. but in 1.10.0, i must to set 
it, and seems system will trigger savepoint, i think this will take more risk, 
and it will delete checkpoint even i set retain on cancellation. so i have no 
checkpoint left. If i usecancel jobID, it will break with exception.
So how to work with 1.10.0 ? any advice will be helpful.
 Thanks.

Re: Flink upgrade to 1.10: function

2020-04-17 Thread seeksst
Hi,


   Thank you for reply.


   I find it caused by SqlStdOperatorTable and have tried many ways to change 
it, but failed. Finally, I decided to copy it and renamed. Another thing that 
caught my attention is that i also define last_value function which args is 
same to SqlStdOperatorTable, and through CalciteConfig to replace 
SqlOperatorTable, so i can use CustomBasicOperatorTable replace 
BasicOperatorTable and register my defined last_value function. I change 
AggregateUtil to make it work. It seems not prefect, do you have any advice to 
avoid changing AggregateUtil?


  the different thing is that my JSON_VALUE function args is not similar to 
SqlStdOperatorTable.there may be no way to replace it. I learn more about 
Build-in Function by source code, JSON_VALUE may not realize.is it necessary to 
define many functions in SqlStdOperatorTable that not use? as i know, parser 
may decide many thing, and limit by calcite, so convert may get information 
which i don’t want and can’t change it. it seems hard to solve, and i have no 
idea.


   Best,
  L

 原始邮件 
发件人: Jark Wu
收件人: Till Rohrmann; Danny Chan
抄送: seeksst; user; Timo 
Walther
发送时间: 2020年4月17日(周五) 22:53
主题: Re: Flink upgrade to 1.10: function


Hi,


I guess you are already close to the truth. Since Flink 1.10 we upgraded 
Calcite to 1.21 which reserves JSON_VALUE as keyword.
So that a user define function can't use this name anymore. That's why 
JSON_VALUE(...) will always be parsed as the Calcite's builtin function 
definition. 
Currently, I don't have other ideas except using another function name... 


cc @Danny Chan , do you know why JSON_VALUE is a reserved keyword in Calcite? 
Could we change it into non-reserved keyword?


Best,
Jark






On Fri, 17 Apr 2020 at 21:00, Till Rohrmann  wrote:

Hi,


thanks for reporting these problems. I'm pulling in Timo and Jark who are 
working on the SQL component. They might be able to help you with your problem.


Cheers,
Till


On Thu, Apr 16, 2020 at 11:10 AM seeksst  wrote:

Hi, All


Recently, I try to upgrade flink from 1.8.2 to 1.10, but i meet some problem 
about function. In 1.8.2, there are just Built-In function and User-defined 
Functions, but in 1.10, there are 4 categories of funtions.
I defined a function which named JSON_VALUE in my system, it doesn’t exist in 
1.8.2, but present to 1.10.0. of course i use it in sql, something like  
'select JSON_VALUE(string, string) from table1’, no category or database. the 
problem is in 1.10.0, my function will be recognized as SqlJsonValueFunction, 
and args not match, so my sql is wrong.
I read document about Ambiguous Function Reference, In my 
understanding, my function will be registered as temporary system function, and 
it should be chosen first. isn’t it? I try to debug it, and find some 
information:
First, sql will be parsed by ParseImpl, and JSON_VALUE will be parsed as 
SqlBasicCall, operator is SqlJsonValueFunction, it’s belonged to SYSTEM catalog 
and the kind is OTHER_FUNCTION. Then, SqlUtil.lookupRoutine will not find this 
SqlFunction, because it not in BasicOperatorTable. my function in 
FunctionCatalog, but SqlJsonValueFunction belonged to SYSTEM, not belong to 
USER_DEFINED, so program will not search it in FunctionCatalog.
How can i solve this problem without modifying sql and function name? my 
program can choose flink version and have many sql jobs, so i don’t wish to 
modify sql and function name.
Thansk.

Flink upgrade to 1.10: function

2020-04-16 Thread seeksst
Hi, All


Recently, I try to upgrade flink from 1.8.2 to 1.10, but i meet some problem 
about function. In 1.8.2, there are just Built-In function and User-defined 
Functions, but in 1.10, there are 4 categories of funtions.
I defined a function which named JSON_VALUE in my system, it doesn’t exist in 
1.8.2, but present to 1.10.0. of course i use it in sql, something like 'select 
JSON_VALUE(string, string) from table1’, no category or database. the problem 
is in 1.10.0, my function will be recognized as SqlJsonValueFunction, and args 
not match, so my sql is wrong.
I read document about Ambiguous Function Reference, In my understanding, my 
function will be registered as temporary system function, and it should be 
chosen first. isn’t it? I try to debug it, and find some information:
First, sql will be parsed by ParseImpl, and JSON_VALUE will be parsed as 
SqlBasicCall, operator is SqlJsonValueFunction, it’s belonged to SYSTEM catalog 
and the kind is OTHER_FUNCTION. Then,SqlUtil.lookupRoutine will not find this 
SqlFunction, because it not in BasicOperatorTable. my function 
inFunctionCatalog, butSqlJsonValueFunctionbelonged to SYSTEM, not belong 
toUSER_DEFINED, so program will not search it in FunctionCatalog.
How can i solve this problem without modifying sql and function name? my 
program can choose flink version and have many sql jobs, so i don’t wish to 
modify sql and function name.
Thansk.

Re: How to debug checkpoints failing to complete

2020-03-23 Thread seeksst
Hi:
according to my experience, there are several possible reasons for 
checkpoint fail.
1. if you use rocksdb as backend, insufficient disk will cause it. 
because file save on local disk, and you may see a exception.
2. Sink can’t be written. all parallelism can’t be complete, and there 
is often no phenomenon.
3. Back Pressure. data skew will cause one subtask take on more 
calculations, so checkpoint can’t be finish.
Here is my advice:
1. learn more about checkpoint work.

https://ci.apache.org/projects/flink/flink-docs-release-1.10/internals/stream_checkpointing.html
 

2. try to test back pressure.

https://ci.apache.org/projects/flink/flink-docs-release-1.10/monitoring/back_pressure.html
 

3. if there is no data skew, you can set more parallelism, or you can 
ajust checkpoint parameter.
In my computer, I have hadoop environment. so i commit job to yarn, i can 
use dashboard to test pressure.

On 2020/03/23 15:14:33, Stephen Connolly  wrote: 
> We have a topology and the checkpoints fail to complete a *lot* of the time.> 
> 
> Typically it is just one subtask that fails.> 
> 
> We have a parallelism of 2 on this topology at present and the other> 
> subtask will complete in 3ms though the end to end duration on the rare> 
> times when the checkpointing completes is like 4m30> 
> 
> How can I start debugging this? When I run locally on my development> 
> cluster I have no issues, the issues only seem to show in production.> 
> 

backpressure and memory

2020-03-22 Thread seeksst
Hi, everyone:


I’m a flink sql user, and the version is 1.8.2.
Recently I confuse about memory and backpressure. I have two job on yarn, 
due to memory over, it’s frequently killed by yarn.
One job,I have 3 taskmanagers and 6 parallelism, each one has 8G memory.It read 
from kafka, one minute tumble windows to calculate pv and uv. There are many 
aggregation dimensions, to avoid data skew, it group by 
deviceId,TUMBLE(event_time, INTERVAL '1' MINUTE)。My question is that the 
checkpoint is just 60MB, I give 24G memory, why it was killed by yarn? I use 
rocksdb as backend, and data is big, but I think it should trigger backpressure 
rather than OOM, although it dosen’t trigger. In Pool Usage is 0.45 normally.
Another job looks different, I use 2 taskmanagers and 4 parallelism, each one 
has 20G memory. I define a aggregate functions to calculate complex data, group 
by date,hour,deviceId. it seems like first job, OOM and no backpressure. but 
the problem is when I read one day data, just one taskmanager was killed by 
yarn, I confuse about this. according to dashboard, I don't find data skew, but 
why just one taskmanager?
May be it’s the same question or not, but I want to know more about memory used 
in flink, and backpressure can stop source or not, and how to trigger it, 
rocksdb affect on flink.
Thanks for reading, it would be better if there were some suggestions.Thank you.