Looking for MultipleLinearRegression in Flink

2021-12-14 Thread thekingofcity
Hi, I'm looking for multiple linear regression in recent Flink versions. I do find it in Flink 1.2 but have no idea where to find it in 1.10+. https://nightlies.apache.org/flink/flink-docs-release-1.2/dev/libs/ml/multiple_linear_regression.html I also find a flink-ml repo but can't find it

Direct buffer memory in job with hbase client

2021-12-14 Thread Anton
Hi, from time to time my job is stopping to process messages with warn message listed below. Tried to increase jobmanager.memory.process.size and taskmanager.memory.process.size but it didn't help. What else can I try? "Framework Off-heap" is 128mb now as seen is task manager dashboard and Task

Re: Hybrid Source with Parquet Files from GCS + KafkaSource

2021-12-14 Thread Meghajit Mazumdar
Hi, Thanks. I was able to get this working. Had to use recordFileFormat though. Is there a performance difference between FileRecordFormat and BulkFormat

Re: [DISCUSS] Change some default config values of blocking shuffle

2021-12-14 Thread Yingjie Cao
Hi Till, Thanks for the suggestion. I think it makes a lot of sense to also extend the documentation for the sort shuffle to include a tuning guide. Best, Yingjie Till Rohrmann 于2021年12月14日周二 18:57写道: > As part of this FLIP, does it make sense to also extend the documentation > for the sort

Re: [DISCUSS] Change some default config values of blocking shuffle

2021-12-14 Thread Yingjie Cao
Hi Till, Thanks for the suggestion. I think it makes a lot of sense to also extend the documentation for the sort shuffle to include a tuning guide. Best, Yingjie Till Rohrmann 于2021年12月14日周二 18:57写道: > As part of this FLIP, does it make sense to also extend the documentation > for the sort

Re:Re: BroadcastConnectedStream处理顺序问题

2021-12-14 Thread casel.chen
嗯,正解,谢谢! 在 2021-12-14 14:26:35,"yidan zhao" 写道: >应该在open中做全量数据的初始化。然后broadcastState做增量的更新。 > >Caizhi Weng 于2021年12月14日周二 09:50写道: > >> Hi! >> >> 可以看一下 event time temporal join [1] 是否满足需求。 >> >> [1] >> >>

retract回撤流上窗口统计问题

2021-12-14 Thread casel.chen
业务需要按每分钟统计不同交易状态的交易数,接了业务mysql库binlog到flink计算。这是一个retract回撤流。是不能够直接使用Tumble window计算的。 1. 那么是不是只能用全量窗口的方式实现?即group by交易时间落到其时长为一分钟的窗口,再配合state TTL来过期不需要的状态? 2. 这样一来的话每来一笔交易都会更新状态,如果直接输出到下游mysql保存的话会对mysql造成很大写压力,那么是不是可以再接一个Tumble window获取每个指标的最新统计值输出? 3.

Re: reading gz files

2021-12-14 Thread Caizhi Weng
Hi! Thanks for raising this issue. This is unfortunately a bug. I've created a JIRA ticket [1] and you can check the progress of this issue there. [1] https://issues.apache.org/jira/browse/FLINK-25311 Egor Ryashin 于2021年12月15日周三 02:33写道: > Hey, > > I’m using Flink 1.14 and having trouble

Re: Java dependencies management in Pyflink

2021-12-14 Thread Paul Lam
Hi Dian, Thanks a lot for your input. That’s a valid solution. We avoid using fat jars in Java API, because it easily leads to class conflicts. But PyFlink is like SQL API, user-imported Java dependencies are comparatively rare, so fat jar is a proper choice. Best, Paul Lam > 2021年12月14日

??????flinksql source ????????????????????????

2021-12-14 Thread ????
---- ??: "user-zh"

Re: Flink SQL 有办法access State吗

2021-12-14 Thread Caizhi Weng
Hi! 可以用 SQL 的聚合函数 [1] 实现。内置聚合函数详见 [2],也可以自定义聚合函数,详见 [3]。 [1] https://nightlies.apache.org/flink/flink-docs-master/zh/docs/dev/table/sql/queries/group-agg/ [2] https://nightlies.apache.org/flink/flink-docs-master/zh/docs/dev/table/functions/systemfunctions/#%E8%81%9A%E5%90%88%E5%87%BD%E6%95%B0

Re: UDF and Broadcast State Pattern

2021-12-14 Thread Caizhi Weng
Hi! Currently you can't use broadcast state in Flink SQL UDF because UDFs are all stateless. However you mentioned your use case that you want to control the logic in UDF with some information. If that is the case, you can just run a thread in your UDF to read that information and change the

Re: Python Interop with Java defined org.apache.flink.api.common.functions.Function classes?

2021-12-14 Thread Dian Fu
Hi Kevin, You could try to use it as following: ``` from pyflink.java_gateway import get_gateway jvm = get_gateway().jvm ds = ( DataStream(ds._j_data_stream.map(jvm.com.example.MyJavaMapFunction())) ) ``` Regards, Dian On Wed, Dec 15, 2021 at 5:41 AM Kevin Lam wrote: > Hi all, > > We

Re: FileSource with Parquet Format - parallelism level

2021-12-14 Thread Krzysztof Chmielewski
Hi Arvid, thank you for your response. I did a little bit more digging and analyzing and I noticed one thing, Please correct me if I'm wrong. Whether the Parquet file will be read in parallel in fact depends on underlying file system. If the file system supports file blocks then we will have

UDF and Broadcast State Pattern

2021-12-14 Thread Krzysztof Chmielewski
Hi, Is there a way to build an UDF [1] for FLink SQL that can be used with Broadcast State Pattern [2]? I have a use case, where I would like to be able to use broadcast control stream to change logic in UDF. Regards, Krzysztof Chmielewski [1]

Python Interop with Java defined org.apache.flink.api.common.functions.Function classes?

2021-12-14 Thread Kevin Lam
Hi all, We currently operate several Flink applications using the Scala API, and run on kubernetes in Application mode. We're interested in researching the Python API and how we can support Python for application developers that prefer to use Python. We have a common library which implements a

unaligned checkpoint for job with large start delay

2021-12-14 Thread Mason Chen
Hi all, I'm using Flink 1.13 and my job is experiencing high start delay, more so than high alignment time. (our flip 27 kafka source is heavily backpressured). Since our alignment timeout is set to 1s, the unaligned checkpoint never triggers since alignment delay is always below the threshold.

reading gz files

2021-12-14 Thread Egor Ryashin
Hey, I’m using Flink 1.14 and having trouble ingesting data from json gz file. I’ve successfully created a table but number of records is wrong. I’m using this SQL: create table i1( line_item_id STRING ) with ( 'connector'='filesystem', 'path'='/Users/egorryashin/temp/test.json', 'format' =

Re: [DISCUSS] Change some default config values of blocking shuffle

2021-12-14 Thread Yun Gao
Hi, > I think setting taskmanager.network.sort-shuffle.min-parallelism to 1 and > using sort-shuffle for all cases by default is a good suggestion. I am not > choosing this value mainly because two reasons: > 1. The first one is that it increases the usage of network memory which may > cause

Re: [DISCUSS] Change some default config values of blocking shuffle

2021-12-14 Thread Yun Gao
Hi, > I think setting taskmanager.network.sort-shuffle.min-parallelism to 1 and > using sort-shuffle for all cases by default is a good suggestion. I am not > choosing this value mainly because two reasons: > 1. The first one is that it increases the usage of network memory which may > cause

Re: Sending an Alert to Slack, AWS sns, mattermost

2021-12-14 Thread Seth Wiesman
Sure, Just implement `RichSinkFunction`. You will initialize your client inside the open method and then send alerts from invoke. Seth On Mon, Dec 13, 2021 at 9:17 PM Robert Cullen wrote: > Yes, That's the correct use case. Will this work with the DataStream > API? UDFs are for the Table

Re: Java dependencies management in Pyflink

2021-12-14 Thread Dian Fu
Hi Paul, For connectors(including Kafka), it's recommended to use the fat jar which contains the dependencies. For example, for kafka, you could use https://repo.maven.apache.org/maven2/org/apache/flink/flink-sql-connector-kafka_2.11/1.14.0/flink-sql-connector-kafka_2.11-1.14.0.jar Regards, Dian

Re: [DISCUSS] Change some default config values of blocking shuffle

2021-12-14 Thread Till Rohrmann
As part of this FLIP, does it make sense to also extend the documentation for the sort shuffle [1] to include a tuning guide? I am thinking of a more in depth description of what things you might observe and how to influence them with the configuration options. [1]

Re: [DISCUSS] Change some default config values of blocking shuffle

2021-12-14 Thread Till Rohrmann
As part of this FLIP, does it make sense to also extend the documentation for the sort shuffle [1] to include a tuning guide? I am thinking of a more in depth description of what things you might observe and how to influence them with the configuration options. [1]

Java dependencies management in Pyflink

2021-12-14 Thread Paul Lam
Hi! I’m trying out PyFlink and looking for the best practice to manage Java dependencies. The docs recommends to use ‘pipeline-jars’ configuration or command line options to specify jars for a PyFlink job. However, PyFlink users may not know what Java dependencies is required. For example, a

Re: [DISCUSS] Strong read-after-write consistency of Flink FileSystems

2021-12-14 Thread David Morávek
Any other thoughts on the topic? If there are no concerns, I'd continue with creating a FLIP for changing the "written" contract of the Flink FileSystems to reflect this. Best, D. On Wed, Dec 8, 2021 at 5:53 PM David Morávek wrote: > Hi Martijn, > > I simply wasn't aware of that one :) It