[Problem] Unable to do join on TumblingEventTimeWindows using SQL

2019-10-23 Thread Manoj Kumar
*Hi All,* *[Explanation]* Two tables say lineitem and orders: Table orderstbl=bsTableEnv.fromDataStream(orders,"a,b,c,d,e,f,g,h,i,orders.rowtime"); Table lineitemtbl=bsTableEnv.fromDataStream(lineitem,"a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,lineitem.rowtime"); bsTableEnv.registerTable("Orders",orderst

Re: multi-tenancy without a kafka partition per tenant

2019-10-23 Thread vino yang
Hi Constantinos, I think your analysis is correct, if you have a multi-tenant scenario, but there is no distinction in Kafka. Then Flink can't treat different tenants differently. It is easy to form a data hotspot problem for the difference in the data volume of different tenants. A compromise is

Problem creating tumbling windows based on number of rows

2019-10-23 Thread A. V.
Hi, I try to create a tumbling time window of 2 rows each in Flink Java. This must based on the dateTime (TimeStamp3 datatype) or unixDateTime(BIGINT datatype) column. I've added below the code of two different code versions. The error messages I get I placed above the code. When I print the

Re: The RMClient's and YarnResourceManagers internal state about the number of pending container requests has diverged

2019-10-23 Thread Yang Wang
Hi Chan, After FLIP-6, the Flink ResourceManager dynamically allocate resource from Yarn on demand. What's your flink version? On the current code base, if the pending containers in resource manager is zero, then it will releaseall the excess containers. Could you please check the "Remaining pendi

Re: The RMClient's and YarnResourceManagers internal state about the number of pending container requests has diverged

2019-10-23 Thread Till Rohrmann
Hi Regina, When using the FLIP-6 mode, you can control how long it takes for an idle TaskManager to be released via resourcemanager.taskmanager-timeout. Per default it is set to 30s. In the Flink version you are using, 1.6.4, we do not support TaskManagers with multiple slots properly [1]. The co

Re: Monitor number of keys per Taskmanager

2019-10-23 Thread Piotr Nowojski
Hi, This is a known issue of Flink. For example key groups can have sizes +/- 1 and they are currently randomly distributed across the cluster, so some machines will get more keys to handle then the others. If the number of keys is relatively small, like 3 keys per key group, the load differenc

Re: Flink grpc-netty-shaded NoClassDefFoundError

2019-10-23 Thread Piotr Nowojski
Hi, Have you checked build of your job for a dependency convergence errors? Either automatically or manually (`mvn dependency:tree` command)? Look for the version clashes for the dependencies that are pulling in `io/grpc/netty/shaded/io/netty/channel/AbstractChannel` class (grpc-netty?). It co

Re: Problem creating tumbling windows based on number of rows

2019-10-23 Thread Manoj Kumar
Hi A.V., *//When I run below code I get this error: Caused by: java.lang.RuntimeException: Rowtime timestamp is null. //Please make sure that a proper TimestampAssigner is defined and the stream environment uses the EventTime time characteristic.* You need to assign Timestamp and watermarks to d

Docker flink 1.9.1

2019-10-23 Thread Farouk
Hi Is there an official image published for Flink 1.9.1 on Docker hub ? I can't find anything and there is one for minor flink version 1.8.2. Thanks Farouk

How many events can Flink process each second

2019-10-23 Thread A. V.
Hi, My boss wants to know how many events Flink can process, analyse etc. per second? I cant find this in the documentation.

Re: How many events can Flink process each second

2019-10-23 Thread Andres Angel
Hello A.V. Id depends on the the underlying resources you are planing for your jobs. I mean memory and processing will play a principal role about this answer. keep in mind you are capable to break down your job in a number of parallel tasks by environment or even by an specific taks within your p

Re: How many events can Flink process each second

2019-10-23 Thread Michael Latta
There are a lot of variables. How many cores are allocated, how much ram, etc. there are companies doing billions of events per day and more. Tell your boss it has proven to have extremely flat horizontal scaling. Meaning you can get it to process almost any number given sufficient hardware.

Re: Docker flink 1.9.1

2019-10-23 Thread vino yang
Hi Farouk, Not long after Flink 1.9.1 was released, the community may not have time to provide the corresponding Dockerfiles. I can give you some information: Flink's official docker file is maintained in this repository. [1] I have seen many versions of docker files contributed by patricklucas[

Re: How many events can Flink process each second

2019-10-23 Thread vino yang
Hi A.V. Add a few more points to the previous two answers. There is no clear answer to this question. In addition to resource issues, it depends on the size of the messages you are dealing with and the complexity of the logic. If you don't consider a lot of extra factors, look at the performance o

Re: Monitor number of keys per Taskmanager

2019-10-23 Thread Till Rohrmann
Currently, we don't work on trying to ensure that the number of key groups is as evenly spread as possible. As a workaround I would suggest to increase the number of key groups or to change the key function. Cheers, Till On Wed, Oct 23, 2019 at 1:42 PM Piotr Nowojski wrote: > Hi, > > This is a

Flink StreamingFileSink part file behavior

2019-10-23 Thread amran dean
Hello, I am using StreamingFileSink, KafkaConsumer010 as a Kafka -> S3 connector (Flink 1.8.1, Kafka 0.10.1). The setup is simple: Data is written first bucketed by datetime (granularity of 1 day), then by kafka partition. I am using *event time* (Kafka timestamp, recorded at the time of creation

Issue with writeAsText() to S3 bucket

2019-10-23 Thread Nguyen, Michael
Hello all, I am running into issues at the moment trying to print my DataStreams to an S3 bucket using writeAsText(“s3://bucket/result.json”) in my Flink job. I used print() on the same DataStream and I see the output I am looking for in standard output. I first confirm that my datastream has d

RE: The RMClient's and YarnResourceManagers internal state about the number of pending container requests has diverged

2019-10-23 Thread Chan, Regina
Yeah thanks for the responses. We’re in the process of testing 1.9.1 after we found https://issues.apache.org/jira/browse/FLINK-12342 as the cause of the original issue. FLINK-9455 makes sense as to why it didn’t work on legacy mode. From: Till Rohrmann Sent: Wednesday, October 23, 2019 5:32

Does operator uid() have to be unique across all jobs?

2019-10-23 Thread John Smith
When setting uid() of an operator does it have to be unique across all jobs or just unique within a job? For example can I use env.addSource(myKafkaConsumer).uid("kafka-consumer") in another job?

How to create an empty test stream

2019-10-23 Thread Dmitry Minaev
Hi everyone, I have a pipeline where I union several streams. I want to test it and don't want to populate one of the streams. I'm usually creating streams with: DataStreamTestBase.createTestStreamWith(event).close(); The above statement creates a stream and puts the `event` inside. But in my ca

Flink Kafka->S3 exactly once guarantees

2019-10-23 Thread amran dean
Hello, Suppose I am using a *nondeterministic* time based partitioning scheme (e.g Flink processing time) to bucket S3 objects via the *BucketAssigner*, designated using *BulkFormatBuilder* for StreamingFileSink. Suppose that after an S3 MPU has completed, but *before* Flink internally commits (w

Re: Does operator uid() have to be unique across all jobs?

2019-10-23 Thread Dian Fu
Yes, you can use it in another job. The uid needs only to be unique within a job. > 在 2019年10月24日,上午5:42,John Smith 写道: > > When setting uid() of an operator does it have to be unique across all jobs > or just unique within a job? > > For example can I use env.addSource(myKafkaConsumer).uid("

Using STSAssumeRoleSessionCredentialsProvider for cross account access

2019-10-23 Thread Vinay Patil
Hi, I am trying to access dynamo streams from a different aws account but getting resource not found exception while trying to access the dynamo streams from Task Manager. I have provided the following configurations : *dynamodbStreamsConsumerConfig.setProperty(ConsumerConfigConstants.AWS_ROLE_CR