Streaming File Sink??????????

2020-03-16 Thread 58683632
Streaming File Sinkparquet avrobulk writefinal StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.setParallelism(1); env.enableCheckpointing(60 * 1000, CheckpointingMode.AT_LEAST_ONCE); env.setStateBackend(new FsStateBackend(new

Re: 读取ORC文件的VectorizedRowBatch的最佳batchSize设置建议

2020-03-16 Thread Jingsong Li
Hi, 1万行太大了,会占用太大内存。而且batchSize太大也不利于cache。 batchSize不一定要和row group一样,这种row group特别大的情况下,batchSize 够用就行了。 Best, Jingsong Lee On Tue, Mar 17, 2020 at 11:52 AM jun su wrote: > hi all: > 在向量化读取orc文件时, 需要配置VectorizedRowBatch的batchSize, 用于设置每次读取的行数, > 我知道根据orc索引, 读取orc文件最小的单位应该是row

读取ORC文件的VectorizedRowBatch的最佳batchSize设置建议

2020-03-16 Thread jun su
hi all: 在向量化读取orc文件时, 需要配置VectorizedRowBatch的batchSize, 用于设置每次读取的行数, 我知道根据orc索引, 读取orc文件最小的单位应该是row group(默认1w行), 底层会根据filter条件来精确到哪些row group, 那之前提到的batchSize设置为1000时 , 那一个row group需要读取10次, 每个row group又是按列存储, 势必会存在非连续读取的可能, 这样岂不是做不到最大优化? 是够将batchSize设置和row group配置一样才能读取效率最大化呢? 不知道我的理解是否正确.

Re: JobMaster does not register with ResourceManager in high availability setup

2020-03-16 Thread Xintong Song
Hi Abhinav, I think you are right. The log confirms that JobMaster has not tried to connect ResourceManager. Most likely the JobMaster requested for RM address but has never received it. I would suggest you to check the ZK logs, see if the request form JM for RM address has been received and

Re: Flink YARN app terminated before the client receives the result

2020-03-16 Thread tison
edit: previously after the cancellation we have a longer call chain to #jobReachedGloballyTerminalState which does the archive job & JM graceful showdown, which might take some time so that ... Best, tison. tison 于2020年3月17日周二 上午10:13写道: > Hi Weike & Till, > > I agree with Till and it is also

Re: Flink YARN app terminated before the client receives the result

2020-03-16 Thread tison
Hi Weike & Till, I agree with Till and it is also the analysis from my side. However, it seems even if we don't have FLINK-15116, it is still possible that we complete the cancel future but the cluster got shutdown before it properly delivered the response. There is one thing strange that this

Re: Flink Conf "yarn.flink-dist-jar" Question

2020-03-16 Thread Yang Wang
Hi Hailu, Sorry for the late response. If the Flink cluster(e.g. Yarn application) is stopped directly by `yarn application -kill`, then the staging directory will be left behind. Since the jobmanager do not have any change to clean up the staging directly. Also it may happen when the jobmanager

Re: Issues with Watermark generation after join

2020-03-16 Thread Kurt Young
Hi, could you share the SQL you written for your original purpose, not the one you attached ProcessFunction for debugging? Best, Kurt On Tue, Mar 17, 2020 at 3:08 AM Dominik Wosiński wrote: > Actually, I just put this process function there for debugging purposes. > My main goal is to join

Re: Very large _metadata file

2020-03-16 Thread Jacob Sevart
Thanks! That would do it. I've disabled the operator for now. The purpose was to know the age of the job's state, so that we could consider its output in terms of how much context it knows. Regular state seemed insufficient because partitions might see their first traffic at different times. How

Re: Flink on Kubernetes Vs Flink Natively on Kubernetes

2020-03-16 Thread Yang Wang
Hi Pankaj, Just like Xintong has said, the biggest difference of Flink on Kubernetes and native integration is dynamic resource allocation. Since the latter has en embedded K8s client and will communicate with K8s Api server directly to allocate/release JM/TM pods. Both for the two ways to run

Re: datadog metrics

2020-03-16 Thread Fanbin Bu
Hi Steve, could you please share your work around solution in more detail in the above ticket? Thanks, Fanbin On Mon, Mar 16, 2020 at 2:50 AM Chesnay Schepler wrote: > I've created https://issues.apache.org/jira/browse/FLINK-16611. > > @Steva Any chance you could contribute your changes, or

Re: Very large _metadata file

2020-03-16 Thread Till Rohrmann
Hi Jacob, I think you are running into some deficiencies of Flink's union state here. The problem is that for every entry in your list state, Flink stores a separate offset (a long value). The reason for this behaviour is that we use the same state implementation for the union state as well as

Re: Cancel the flink task and restore from checkpoint ,can I change the flink operator's parallelism

2020-03-16 Thread Till Rohrmann
If you want to change the max parallelism then you need to take a savepoint and use Flink's state processor API [1] to rewrite the max parallelism by creating a new savepoint from the old one. [1] https://ci.apache.org/projects/flink/flink-docs-stable/dev/libs/state_processor_api.html Cheers,

Re: Flink YARN app terminated before the client receives the result

2020-03-16 Thread Till Rohrmann
Hi Weike, could you share the complete logs with us? Attachments are being filtered out by the Apache mail server but it works if you upload the logs somewhere (e.g. https://gist.github.com/) and then share the link with us. Ideally you run the cluster with DEBUG log settings. I assume that you

Re: how to specify yarnqueue when starting a new job programmatically?

2020-03-16 Thread Till Rohrmann
Hi Vitaliy, in the case of a session cluster you cannot influence the queue programmatically since Flink uses the value configured via yarn.application.queue which is read from the flink-conf.yaml. However, there is a way to influence the yarn queue programmatically if you use the per job mode.

Re: Flink gelly dependency in transient EMR cluster

2020-03-16 Thread Till Rohrmann
Alternatively, you could also bundle the Gelly dependency with your user code jar by creating an uber jar. The downside of this approach would be an increased jar size which needs to be uploaded to the cluster. Cheers, Till On Thu, Mar 12, 2020 at 4:13 PM Antonio Martínez Carratalá <

Re: Issues with Watermark generation after join

2020-03-16 Thread Dominik Wosiński
Actually, I just put this process function there for debugging purposes. My main goal is to join the E & C using the Temporal Table function, but I have observed exactly the same behavior i.e. when the parallelism was > 1 there was no output and when I was setting it to 1 then the output was

Re: Issues with Watermark generation after join

2020-03-16 Thread Theo Diefenthal
Hi Dominik, I had the same once with a custom processfunction. My processfunction buffered the data for a while and then output it again. As the proces function can do anything with the data (transforming, buffering, aggregating...), I think it's just not safe for flink to reason about the

Fwd: AfterMatchSkipStrategy for timed out patterns

2020-03-16 Thread Dominik Wosiński
Hey all, I was wondering whether for CEP the *AfterMatchSkipStrategy *is applied during matching or if simply the results are removed after the match. The question is the result of the experiments I was doing with CEP. Say I have the readings from some sensor and I want to detect events over some

How do I get the outPoolUsage value inside my own stream operator?

2020-03-16 Thread Felipe Gutierrez
Hi community, I have built my own operator (not a UDF) and I want to collect the metrics of "outPoolUsage" inside it. How do I do it assuming that I have to do some modifications in the source code? I know that the Gouge comes from

Issues with Watermark generation after join

2020-03-16 Thread Dominik Wosiński
Hey, I have noticed a weird behavior with a job that I am currently working on. I have 4 different streams from Kafka, lets call them A, B, C and D. Now the idea is that first I do SQL Join of A & B based on some field, then I create append stream from Joined A, let's call it E. Then I need to

Re: Implicit Flink Context Documentation

2020-03-16 Thread Padarn Wilson
Thanks for the clarification. I'll dig in then! On Mon, 16 Mar 2020, 3:47 pm Piotr Nowojski, wrote: > Hi, > > We are not maintaining internal docs. We have design docs for newly > proposed features (previously informal design docs published on dev mailing > list and recently as FLIP documents

Re: [EXT.MSG] Re: datadog http reporter metrics

2020-03-16 Thread Chesnay Schepler
It would only be logged when using 1.10 unfortunately; but you should be able to use the 1.10 version of the reporter with your version of Flink to at least confirm that it is the same issue as FLINK-16611. On 16/03/2020 11:35, Yitzchak Lieberman wrote: No, tried to find error/warn logs for

Re: [EXT.MSG] Re: datadog http reporter metrics

2020-03-16 Thread Yitzchak Lieberman
No, tried to find error/warn logs for rejected metrics, nothing... tor that case there should be an error, right? (when report is too large) I saw that there are some changes on version 1.10 for datadog reporter, maybe I should upgrade to this version? On Mon, Mar 16, 2020 at 11:47 AM Chesnay

Re: Automatically Clearing Temporary Directories

2020-03-16 Thread David Maddison
Thanks for the responses and thanks Gary for the confirmation. Just to give some background, we deploy Flink inside Kubernetes so there is a chance that TaskManagers COULD be shut down in a non-graceful way leaving cache artifacts on the temporary volumes. With Gary's confirmation, we'll add an

Re: Flink on Kubernetes Vs Flink Natively on Kubernetes

2020-03-16 Thread Pankaj Chand
Hi Xintong, Thank you for the explanation! If I run Flink "natively" on Kubernetes, will I also be able to run Spark on the same Kubernetes cluster, or will it make the Kubernetes cluster be reserved for Flink only? Thank you! Pankaj On Mon, Mar 16, 2020 at 5:41 AM Xintong Song wrote: >

Re: datadog metrics

2020-03-16 Thread Chesnay Schepler
I've created https://issues.apache.org/jira/browse/FLINK-16611. @Steva Any chance you could contribute your changes, or some insight on what would need to be changed? On 11/03/2020 23:16, Steve Whelan wrote: Hi Fabian, We ran into the same issue. We modified the reporter to emit the

Re: datadog http reporter metrics

2020-03-16 Thread Chesnay Schepler
Do you see anything in the logs? In another thread a user reported that the datadog reporter could stop working when faced with a large number of metrics since datadog was rejecting the report due to being too large. On 15/03/2020 12:22, Yitzchak Lieberman wrote: Anyone? On Wed, Mar 11, 2020

Re: Flink on Kubernetes Vs Flink Natively on Kubernetes

2020-03-16 Thread Xintong Song
Hi Pankaj, "Running Flink on Kubernetes" refers to the old way that basically deploys a Flink standalone cluster on Kubernetes. We leverage scripts to run Flink Master and TaskManager processes inside Kubernetes container. In this way, Flink is not ware of whether it's running in containers or

Re: Flink on Kubernetes Vs Flink Natively on Kubernetes

2020-03-16 Thread Xintong Song
Forgot to mention that "running Flink natively on Kubernetes" is newly introduced and is only available for Flink 1.10 and above. Thank you~ Xintong Song On Mon, Mar 16, 2020 at 5:40 PM Xintong Song wrote: > Hi Pankaj, > > "Running Flink on Kubernetes" refers to the old way that basically

Re: Stop job with savepoint during graceful shutdown on a k8s cluster

2020-03-16 Thread Vijay Bhaskar
For point (1) above: Its up to user to have proper sink and source to choose to have exactly once semantics as per the documentation: https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/guarantees.html If we choose the supported source and sink combinations duplicates will be

Re: [DISCUSS] FLIP-111: Docker image unification

2020-03-16 Thread Andrey Zagrebin
Thanks for the further feedback Thomas and Yangze. > A generic, dynamic configuration mechanism based on environment variables is essential and it is already supported via envsubst and an environment variable that can supply a configuration fragment True, we already have this. As I understand

Re: 全量编译报错

2020-03-16 Thread Yangze Guo
报错信息是什么呢?代码分支是哪个? 您可以再pull一下最新的master,目前看最新的master是可以全量编译的[1] [1] https://travis-ci.org/github/apache/flink/builds/662890263?utm_source=github_status_medium=notification Best, Yangze Guo Best, Yangze Guo On Mon, Mar 16, 2020 at 4:48 PM 吴志勇 <1154365...@qq.com> wrote: > > 在项目根目录下执行 `mvn clean

????????????

2020-03-16 Thread ??????
?? `mvn clean install -DskipTests` github

Re: 最新代码编译问题

2020-03-16 Thread tison
Hi, You'd better use English in user mailing list. If you prefer Chinese, you can post the email to user...@flink.apache.org . Best, tison. tison 于2020年3月16日周一 下午4:25写道: > 从 flink/ 根目录运行 mvn clean install -DskipTests > > 你这个问题是因为 impl 那些类是生成类,一般来说从根目录运行一次全量编译可以解决各种疑难杂症 > > Best, > tison. >

Re: 最新代码编译问题

2020-03-16 Thread tison
Hi, You'd better use English in user mailing list. If you prefer Chinese, you can post the email to user-zh@flink.apache.org . Best, tison. tison 于2020年3月16日周一 下午4:25写道: > 从 flink/ 根目录运行 mvn clean install -DskipTests > > 你这个问题是因为 impl 那些类是生成类,一般来说从根目录运行一次全量编译可以解决各种疑难杂症 > > Best, > tison. >

Re: 最新代码编译问题

2020-03-16 Thread tison
从 flink/ 根目录运行 mvn clean install -DskipTests 你这个问题是因为 impl 那些类是生成类,一般来说从根目录运行一次全量编译可以解决各种疑难杂症 Best, tison. 吴志勇 <1154365...@qq.com> 于2020年3月16日周一 下午4:23写道: > 您好, > 我从github上下载了最新的代码。在IDEA中尝试编译,但是flink-table项目flink-sql-parser编译报错, > > test中也同样报错, > > 请问该如何解决呢?flink-sql-parser像是缺少了impl包呀。 >

????????????????

2020-03-16 Thread ??????
?? github??IDEAflink-tableflink-sql-parser?? test?? ??flink-sql-parser??impl??

Streaming File Sink??????????

2020-03-16 Thread cs
Streaming File Sinkparquet avrobulk writefinal StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.setParallelism(1); env.enableCheckpointing(60 * 1000, CheckpointingMode.AT_LEAST_ONCE); env.setStateBackend(new FsStateBackend(new

Flink on Kubernetes Vs Flink Natively on Kubernetes

2020-03-16 Thread Pankaj Chand
Hi all, I want to run Flink, Spark and other processing engines on a single Kubernetes cluster. >From the Flink documentation, I did not understand the difference between: (1) Running Flink on Kubernetes, Versus (2) Running Flink natively on Kubernetes. Could someone please explain the

Re: Implicit Flink Context Documentation

2020-03-16 Thread Piotr Nowojski
Hi, We are not maintaining internal docs. We have design docs for newly proposed features (previously informal design docs published on dev mailing list and recently as FLIP documents [1]), but keyed state is such an old concept that dates back so much into the past, that I’m pretty sure it

Re: Communication between two queries

2020-03-16 Thread Piotr Nowojski
Hi, Let us know if something doesn’t work :) Piotrek > On 16 Mar 2020, at 08:42, Mikael Gordani wrote: > > Hi, > I'll try it out =) > > Cheers! > > Den mån 16 mars 2020 kl 08:32 skrev Piotr Nowojski >: > Hi, > > In that case you could try to implement your

Re: Communication between two queries

2020-03-16 Thread Mikael Gordani
Hi, I'll try it out =) Cheers! Den mån 16 mars 2020 kl 08:32 skrev Piotr Nowojski : > Hi, > > In that case you could try to implement your `FilterFunction` as two input > operator, with broadcast control input, that would be setting the > `global_var`. Broadcast control input can be originating

Re: Expected behaviour when changing operator parallelism but starting from an incremental checkpoint

2020-03-16 Thread Piotr Nowojski
Hi Seth, > Currently, all rescaling operations technically work with checkpoints. That > is purely by chance that the implementation supports that, and the line is > because the community is not committed to maintaining that functionality Are you sure that’s the case? Support for rescaling

Re: Communication between two queries

2020-03-16 Thread Piotr Nowojski
Hi, In that case you could try to implement your `FilterFunction` as two input operator, with broadcast control input, that would be setting the `global_var`. Broadcast control input can be originating from some source, or from some operator. Piotrek > On 13 Mar 2020, at 15:47, Mikael