Re:Re: RichMapFunction的问题

2020-05-24 Thread guanyq
>> 1、这个RichMapFunction的并发度上不去,只能整到4,并发度上去后各种问题,但从主机内存以及分配给taskmanager的内存足够; -- 能粘贴下代码么 -- 还有提交的命令 >> 2、这个RichMapFunction的所有slot都分配到同一个taskmanager上,即同一个主机。没有找到接口可以分散到不同的taskmanager上; -- 什么模式提交的job(yarn session,yarn,还是stand alone模式) 在 2020-05-25 11:47:48,"tison" 写道:

Pojo List and Map Data Type in UDFs

2020-05-24 Thread lec ssmi
Hi: I received a java pojo serialized json string from kafka, and I want to use UDTF to restore it to a table with a similar structure to pojo. Some member variables of pojo use the List type or Map type whose generic type is also a pojo . The sample code as bellow: public class Car

Re: java.lang.NoSuchMethodError while writing to Kafka from Flink

2020-05-24 Thread tison
Could you try to download binary dist from flink download page and re-execute the job? It seems like something wrong with flink-dist.jar. BTW, please post user question on only user mailing list(not dev). Best, tison. Guowei Ma 于2020年5月25日周一 上午10:49写道: > Hi > 1. You could check whether the

??????RichMapFunction??????

2020-05-24 Thread ??????(Jiacheng Jiang)
flink 1.10??slot??tmcluster.evenly-spread-out-slots: true ---- ??: "xue...@outlook.com"https://go.microsoft.com/fwlink/?LinkId=550986gt;

Re: RichMapFunction的问题

2020-05-24 Thread tison
关于第一个问题,最好细化一下【各种问题】是什么问题。 关于第二个问题,我印象中目前 Flink 不支持按并发(SubTask)级别指定调度的位置,绕过方案可以是设置每个 TM 仅持有一个 Slot。这方面我抄送 Xintong,或许他的工作能帮到你。 Best, tison. xue...@outlook.com 于2020年5月25日周一 上午11:29写道: > 遇到两个问题: > 背景:flink v1.10集群,几十台主机,每台CPU 16,内存 50G,整个job的并发是200 > 比如我的一个RichMapFunction在open中会加载存量数据。 >

RichMapFunction的问题

2020-05-24 Thread xue...@outlook.com
遇到两个问题: 背景:flink v1.10集群,几十台主机,每台CPU 16,内存 50G,整个job的并发是200 比如我的一个RichMapFunction在open中会加载存量数据。 因维度数据和主数据是非常离散的,因此这些维度数据都需要加载到内存 1、这个RichMapFunction的并发度上不去,只能整到4,并发度上去后各种问题,但从主机内存以及分配给taskmanager的内存足够; 2、这个RichMapFunction的所有slot都分配到同一个taskmanager上,即同一个主机。没有找到接口可以分散到不同的taskmanager上; 说简单点:

Re: Could not find a suitable table factory for 'TableSourceFactory'

2020-05-24 Thread Leonard Xu
Hi, 确实,connector包太多,DataStream 和 Table 分两套的问题,format的包也需要用户导入问题,确实比较困扰用户。 社区也在讨论flink打包方案[1]来降低用户接入成本。 祝好, Leonard Xu [1]http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Releasing-quot-fat-quot-and-quot-slim-quot-Flink-distributions-tc40237.html#none

Re: Apache Flink - Question about application restart

2020-05-24 Thread Zhu Zhu
Hi M, Regarding your questions: 1. yes. The id is fixed once the job graph is generated. 2. yes Regarding yarn mode: 1. the job id keeps the same because the job graph will be generated once at client side and persist in DFS for reuse 2. yes if high availability is enabled Thanks, Zhu Zhu M

Re: Flink Dashboard UI Tasks hard limit

2020-05-24 Thread Xintong Song
> > Increasing network memory buffers (fraction, min, max) seems to increase > tasks slightly. That's wired. I don't think the number of network memory buffers have anything to do with the task amount. Let me try to clarify a few things. Please be aware that, how many tasks a Flink job has, and

Re: java.lang.NoSuchMethodError while writing to Kafka from Flink

2020-05-24 Thread Guowei Ma
Hi 1. You could check whether the 'org.apache.flink.api.java.clean' is in your classpath first. 2. Do you follow the doc[1] to deploy your local cluster and run some existed examples such as WordCount? [1]

Re: Could not find a suitable table factory for 'TableSourceFactory'

2020-05-24 Thread Leonard Xu
Hi, > 对了还有个问题,我之前看文档使用 `flink-connector-kafka_2.11`一直都无法运行,后来看别人也遇到这道这个问题,改成 > `flink-sql-connector-kafka-0.11` > 才可以运行,这两个有什么区别,如果不一样的话,对于 table API 最好标明一下用后者 flink-connector-kafka_2.11 是dataStream API编程使用的 flink-sql-connector-kafka-0.11_2.11 是 Table API & SQL

Re: Flink SQL 新手问题,RowTime field should not be null, please convert it to a non-null long value

2020-05-24 Thread Leonard Xu
Hi, > 还有个小问题,类似上面的问题,如何写flink SQL跳过没有ts字段的kafka消息? 有解析异常就fail 还是 跳过解析异常的record,json forma有两个参数可以配置: 'format.fail-on-missing-field' = 'true', -- optional: flag whether to fail if a field is missing or not, -- 'false' by default

Re: Could not find a suitable table factory for 'TableSourceFactory'

2020-05-24 Thread macia kk
对了还有个问题,我之前看文档使用 `flink-connector-kafka_2.11`一直都无法运行,后来看别人也遇到这道这个问题,改成 `flink-sql-connector-kafka-0.11` 才可以运行,这两个有什么区别,如果不一样的话,对于 table API 最好标明一下用后者 macia kk 于2020年5月25日周一 上午10:05写道: > built.sbt > > val flinkVersion = "1.10.0" > libraryDependencies ++= Seq( > "org.apache.flink" %%

Re: How do I get the IP of the master and slave files programmatically in Flink?

2020-05-24 Thread Yangze Guo
Glad to see that! However, I was told that it is not the right approach to directly extend `AbstractUdfStreamOperator` in DataStream API. This would be fixed at some point (maybe Flink 2.0). The JIRA link is [1]. [1] https://issues.apache.org/jira/browse/FLINK-17862 Best, Yangze Guo On Fri,

Re: Flink SQL 新手问题,RowTime field should not be null, please convert it to a non-null long value

2020-05-24 Thread Enzo wang
Hi Leonard, 谢谢,你说的是对的,之前kafka有一些脏数据,没有ts字段,导致的问题,将 'connector.startup-mode' = 'earliest-offset', 改变成 'connector.startup-mode' = 'latest-offset', 就可用了。 还有个小问题,类似上面的问题,如何写flink SQL跳过没有ts字段的kafka消息? Cheers, Enzo On Mon, 25 May 2020 at 10:01, Leonard Xu wrote: > Hi, > >

Re: kerberos integration with flink

2020-05-24 Thread Yangze Guo
Yes, you can use kinit. But AFAIK, if you deploy Flink on Kubernetes or Mesos, Flink will not ship the ticket cache. If you deploy Flink on Yarn, Flink will acquire delegation tokens with your ticket cache and set tokens for job manager and task executor. As the document said, the main drawback is

Re: Could not find a suitable table factory for 'TableSourceFactory'

2020-05-24 Thread macia kk
built.sbt val flinkVersion = "1.10.0" libraryDependencies ++= Seq( "org.apache.flink" %% "flink-streaming-scala" % flinkVersion , "org.apache.flink" %% "flink-scala" % flinkVersion, "org.apache.flink" %% "flink-statebackend-rocksdb" % flinkVersion, "org.apache.flink" %

Re: Flink SQL 新手问题,RowTime field should not be null, please convert it to a non-null long value

2020-05-24 Thread Leonard Xu
Hi, 这个报错信息应该挺明显了,eventTime是不能为null的,请检查下Kafka里的数据ts字段是不是有null值或者没有这个字段的情况,如果是可以用个简单udf处理下没有值时ts需要指定一个值。 祝好, Leonard Xu > 在 2020年5月25日,09:52,Enzo wang 写道: > > 请各位帮忙看一下是什么问题? > > 数据流如下: > Apache -> Logstash -> Kafka -> Flink ->ES -> Kibana > > 日志到Kafka里面已经为JSON,格式如下: > { >

Re: Could not find a suitable table factory for 'TableSourceFactory'

2020-05-24 Thread Leonard Xu
Hi, 你使用的kafka connector的版本是0.11的吗?报错看起来有点像版本不对 Best, Leonard Xu > 在 2020年5月25日,02:44,macia kk 写道: > > 感谢,我在之前的邮件记录中搜索到了答案。我现在遇到了新的问题,卡主了好久: > > Table API, sink to Kafka > >val result = bsTableEnv.sqlQuery("SELECT * FROM " + "") > >bsTableEnv > .connect( >new

Re: Could not find a suitable table factory for 'TableSourceFactory'

2020-05-24 Thread macia kk
感谢,我在之前的邮件记录中搜索到了答案。我现在遇到了新的问题,卡主了好久: Table API, sink to Kafka val result = bsTableEnv.sqlQuery("SELECT * FROM " + "") bsTableEnv .connect( new Kafka() .version("0.11") // required: valid connector versions are .topic("aaa") // required: topic

Multiple Sinks for a Single Soure

2020-05-24 Thread Prasanna kumar
Hi, There is a single source of events for me in my system. I need to process and send the events to multiple destination/sink at the same time.[ kafka topic, s3, kinesis and POST to HTTP endpoint ] I am able send to one sink. By adding more sink stream to the source stream could we achieve it

Re: Re: Flink Window with multiple trigger condition

2020-05-24 Thread Yun Gao
Hi, First sorry that I'm not expert on Window and please correct me if I'm wrong, but from my side, it seems the assigner might also be a problem in addition to the trigger: currently Flink window assigner should be all based on time (processing time or event time), and it might be hard

Re:Could not find a suitable table factory for 'TableSourceFactory'

2020-05-24 Thread Lijie Wang
Hi,我不能加载你邮件中的图片。从下面的报错看起来是因为找不到 match 的connector。可以检查一下 DDL 中的 with 属性是否正确。 在 2020-05-25 00:11:16,"macia kk" 写道: 有人帮我看下这个问题吗,谢谢 org.apache.flink.client.program.ProgramInvocationException: The main method caused an error: findAndCreateTableSource failed. Caused by:

Could not find a suitable table factory for 'TableSourceFactory'

2020-05-24 Thread macia kk
有人帮我看下这个问题吗,谢谢 [image: image.png] [image: image.png] org.apache.flink.client.program.ProgramInvocationException: The main method caused an error: findAndCreateTableSource failed. Caused by: org.apache.flink.table.api.NoMatchingTableFactoryException: Could not find a suitable table factory for

?????? ????????????????????????????????????????

2020-05-24 Thread kris wu
Best, kris. ---- ??:"tison"

回复:关于水位线Watermark的理解

2020-05-24 Thread smq
恩恩,我是刚接触flink不久,所以很多地方没有很清楚,谢谢指点 ---原始邮件--- 发件人: tisonhttps://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/windows.html#allowed-lateness [2]

回复:关于水位线Watermark的理解

2020-05-24 Thread smq
感谢! ---原始邮件--- 发件人: Benchao Li

Re: 使用广播流要怎么保证广播流比数据流先到?

2020-05-24 Thread tison
高老师的方案应该是比较 make sense 的,你从网络上去限制某个先到后到很麻烦,而且就算可以,也会涉及 Flink 网络层很底层的逻辑。通常来说希望【先到】的含义是【先处理】,那你把物理上先到的缓存起来后处理就可以了。 Best, tison. 1048262223 <1048262...@qq.com> 于2020年5月24日周日 下午2:08写道: > Hello,我的理解是这样的 > 广播流一般都是为了减少访问外部配置数据,提高性能来使用的,因此如果你是在这种场景下使用播流,我有一个在生产实践过的方法可供参考。 > >

Re: 关于水位线Watermark的理解

2020-05-24 Thread tison
整体没啥问题,但是我看你说【假如第一个数据的事件时间刚好为12:00的,那么此时水位线应该在11:59】,这个 Watermark 跟 allowedLateness 没啥关系哈,是独立的逻辑。 文档层面你可以看看[1],源码你可以看看[2]里面检索 allowedLateness Best, tison. [1] https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/windows.html#allowed-lateness [2]

Re: Singal task backpressure problem with Credit-based Flow Control

2020-05-24 Thread Zhijiang
Hi Weihua, From your below info, it is with the expectation in credit-based flow control. I guess one of the sink parallelism causes the backpressure, so you will see that there are no available credits on Sink side and the outPoolUsage of Map is almost 100%. It really reflects the

Re: 关于水位线Watermark的理解

2020-05-24 Thread Benchao Li
Hi, 你理解的是正确的,进入哪个窗口完全看事件时间,窗口什么时候trigger,是看watermark。 smq <374060...@qq.com> 于2020年5月24日周日 下午9:46写道: > > 使用时间时间窗口处理关于数据延迟,加入允许延迟时间为1min,窗口大小是10min,那么在12:00-12:10这个窗口中,如果事件时间是在12:09:50这个数据在12:10:50这个数据到达,并且此时水位线刚好在12:09:50,那么这个延迟数据可以被处理,这个可以理解。 > >

Re: 關於LocalExecutionEnvironment使用MiniCluster的配置

2020-05-24 Thread Jeff Zhang
在zeppelin也集成了flink的local 模式,可以通过设置 local.number-taskmanager 和 flink.tm.slot来设置tm和slot的数目, 具体可以参考这个视频 https://www.bilibili.com/video/BV1Te411W73b?p=3 tison 于2020年5月24日周日 下午9:46写道: > 是这样的。 > > 这里的配置可以参考[1][2]两个类,具体你 Maven 启动的代码路径还跟[3][4]有关。 > > 这边可能确实文档比较缺失。可以看下配置传递的路径,TM 的数量还有 RPC

关于水位线Watermark的理解

2020-05-24 Thread smq
使用时间时间窗口处理关于数据延迟,加入允许延迟时间为1min,窗口大小是10min,那么在12:00-12:10这个窗口中,如果事件时间是在12:09:50这个数据在12:10:50这个数据到达,并且此时水位线刚好在12:09:50,那么这个延迟数据可以被处理,这个可以理解。 但是,假如第一个数据的事件时间刚好为12:00的,那么此时水位线应该在11:59,这个数据能进入12:00-12:10这个窗口被处理吗。按道理来说应该被正确处理。那么这样的话,进入窗口是按照事件时间,触发是按照水印时间。不知道这么理解对不对,这个问题想了很久。

Re: 關於LocalExecutionEnvironment使用MiniCluster的配置

2020-05-24 Thread tison
是这样的。 这里的配置可以参考[1][2]两个类,具体你 Maven 启动的代码路径还跟[3][4]有关。 这边可能确实文档比较缺失。可以看下配置传递的路径,TM 的数量还有 RPC 的共享格式等配置,至少编程接口上都是可以配的。 Best, tison. [1] https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/minicluster/MiniCluster.java [2]

關於LocalExecutionEnvironment使用MiniCluster的配置

2020-05-24 Thread 月月
您好, 在單機模式使用maven執行專案時,會自動啟動MiniCluster, 我想請問在這種情形下,預設是配置一個JobManager以及一個TaskManager嗎? 找了一下文件中並沒有相關的說明。 感謝!

Re: Does Flink use EMRFS?

2020-05-24 Thread Rafi Aroch
Hi Peter, I've dealt with the cross-account delegation issues in the past (with no relation to Flink) and got into the same ownership problems (accounts can't access data, account A 'loses' access to it's own data). My 2-cents are that: - The account that produces the data (A) should be the

Re: Flink 1.8.3 Kubernetes POD OOM

2020-05-24 Thread Andrey Zagrebin
Hi Josson, Thanks for the details. Sorry, I overlooked, you indeed mentioned the file backend. Looking into Flink memory model [1], I do not notice any problems related to the types of memory consumption we model in Flink. Direct memory consumption by network stack corresponds to your

Singal task backpressure problem with Credit-based Flow Control

2020-05-24 Thread Weihua Hu
Hi, all I ran into a weird single Task BackPressure problem. JobInfo: DAG: Source (1000)-> Map (2000)-> Sink (1000), which is linked via rescale. Flink version: 1.9.0 There is no related info in jobmanager/taskamanger log. Through Metrics, I see that Map (242) 's outPoolUsage is

回复:使用广播流要怎么保证广播流比数据流先到?

2020-05-24 Thread 1048262223
Hello,我的理解是这样的 广播流一般都是为了减少访问外部配置数据,提高性能来使用的,因此如果你是在这种场景下使用播流,我有一个在生产实践过的方法可供参考。 可以先在正常数据处理流的open方法中初始化访问一次配置,后续配置变更时再去使用广播中的数据对配置进行更新。如果硬要求某些数据必须在某个广播流配置数据更新后才能进行处理,则可以使用大佬们在上面提供的用state存储的方式进行解决。 -- 原始邮件 -- 发件人: Yun Gao