Struggling with reading the file from s3 as Source

2020-09-10 Thread Vijay Balakrishnan
Hi, I want to *get data from S3 and process and send to Kinesis.* 1. Get gzip files from an s3 folder(s3://bucket/prefix) 2. Sort each file 3. Do some map/processing on each record in the file 4. send to Kinesis Idea is: env.readTextFile(s3Folder) .sort(SortFunction) .map(MapFunction) .sink(Kines

I hit a bad jobmanager address when trying to use Flink SQL Client

2020-09-10 Thread Dan Hill
I just tried using the Flink SQL Client. A simple job is not running because it cannot hit jobmanager. I'm not sure why Flink SQL Client is hitting "flink-jobmanager/10.98.253.58:8081". I'd expect either "flink-jobmanager:8081" or "10.98.253.58:8081" (which should work with my kubernetes setup).

Re: [DISCUSS] FLIP-143: Unified Sink API

2020-09-10 Thread Steven Wu
Guowei, Thanks a lot for the proposal and starting the discussion thread. Very excited. For the big question of "Is the sink an operator or a topology?", I have a few related sub questions. * Where should we run the committers? * Is the committer parallel or single parallelism? * Can a single cho

Re: arbitrary state handling in python api

2020-09-10 Thread Dian Fu
Hi Georg, It still doesn't support state access in Python API in the latest version 1.11. Could you take a look at if KeyedProcessFunction could meet your requirements? We are planning to support it in Python DataStream API in 1.12. Regards, Dian > 在 2020年9月9日,下午2:28,Georg Heiler 写道: > > Hi

Measure CPU utilization

2020-09-10 Thread Piper Piper
Hello, What is the best way to measure the CPU utilization of a TaskManager in Flink, as opposed to using Linux's "top" command? Is querying the REST endpoint http://:/taskmanagers//metrics?get=Status.JVM.CPU.Load\ the best option? Roman's reply (copied below) from the archives suggests that it r

Re: Flink Table API and not recognizing s3 plugins

2020-09-10 Thread Dan Hill
This is running on my local minikube and is trying to hit minio. On Thu, Sep 10, 2020 at 1:10 PM Dan Hill wrote: > I'm using this Helm chart > . I > start the job by building an image with the job jar and using kubectl apply > t

Re: Flink Table API and not recognizing s3 plugins

2020-09-10 Thread Arvid Heise
In general, I'd assume that JM and TM are enough. However, it seems like the query planner is doing some path sanitization for which it needs the filesystem. Since I don't know this part too well, I'm pulling in Jark and Dawid that may know more. I'm also not sure if this is intentional or a bug.

Re: Flink Table API and not recognizing s3 plugins

2020-09-10 Thread Dan Hill
I'm using this Helm chart . I start the job by building an image with the job jar and using kubectl apply to do a flink run with the jar. The log4j.properties on jobmanager and taskmanager have debug level set and are pretty embed

Re: Flink Table API and not recognizing s3 plugins

2020-09-10 Thread Dan Hill
Copying more of the log 2020-09-10 19:50:17,712 INFO org.apache.flink.client.cli.CliFrontend [] - 2020-09-10 19:50:17,718 INFO org.apache.flink.client.cli.CliFrontend [] - Starting

Re: Flink Table API and not recognizing s3 plugins

2020-09-10 Thread Dan Hill
I was able to get more info to output on jobmanager. 2020-09-10 19:50:17,722 INFO org.apache.flink.client.cli.CliFrontend [] - 2020-09-10 19:50:17,731 INFO org.apache.flink.configuration.GlobalConfig

Re: Flink Table API and not recognizing s3 plugins

2020-09-10 Thread Arvid Heise
Hi Dan, somehow enabling debug statements did not work. However, the logs helps to narrow down the issue. The exception occurs neither on jobmanager nor on taskmanager. It occurs wherever you execute the command line interface. How do you execute the job? Do you start it from your machine? Can y

Re: Flink Table API and not recognizing s3 plugins

2020-09-10 Thread Dan Hill
I changed the levels to DEBUG. I don't see useful data in the logs. https://drive.google.com/file/d/1ua1zsr3BInY_8xdsWwA__F0uloAqy-vG/view?usp=sharing On Thu, Sep 10, 2020 at 8:45 AM Arvid Heise wrote: > Could you try 1) or 2) and enable debug logging* and share the log with us? > > *Usually b

Re: Flink Table API and not recognizing s3 plugins

2020-09-10 Thread Arvid Heise
Could you try 1) or 2) and enable debug logging* and share the log with us? *Usually by adjusting FLINK_HOME/conf/log4j.properties. On Thu, Sep 10, 2020 at 5:38 PM Dan Hill wrote: > Ah, sorry, it's a copy/paste issue with this email. I've tried both: > 1) using s3a uri with flink-s3-fs-hadoop

Re: Flink Table API and not recognizing s3 plugins

2020-09-10 Thread Dan Hill
Ah, sorry, it's a copy/paste issue with this email. I've tried both: 1) using s3a uri with flink-s3-fs-hadoop jar in /opt/flink/plugins/s3-fs-hadoop. 2) using s3p uri with flink-s3-fs-presto jar in /opt/flink/plugins/s3-fs-presto. 3) loading both 1 and 2 4) trying s3 uri. When doing 1) Caused by

Re: Flink Table API and not recognizing s3 plugins

2020-09-10 Thread Arvid Heise
Hi Dan, s3p is only provided by flink-s3-fs-presto plugin. The plugin you used provides s3a. (and both provide s3, but it's good to use the more specific prefix). Best, Arvid On Thu, Sep 10, 2020 at 9:24 AM Dan Hill wrote: > *Background* > I'm converting some prototype Flink v1.11.1 code that

Re: [DISCUSS] Drop Scala 2.11

2020-09-10 Thread Seth Wiesman
@glen Yes, we would absolutely migrate statefun. StateFun can be compiled with Scala 2.12 today, I'm not sure why it's not cross released. @aljoscha :) @mathieu Its on the roadmap but it's non-trivial and I'm not aware of anyone actively working on it. On Thu, Sep 10, 2020 at 10:09 AM Matthieu

Streaming data to parquet

2020-09-10 Thread Marek Maj
Hello Flink Community, When designing our data pipelines, we very often encounter the requirement to stream traffic (usually from kafka) to external distributed file system (usually HDFS or S3). This data is typically meant to be queried from hive/presto or similar tools. Preferably data sits in c

Re: [DISCUSS] Drop Scala 2.11

2020-09-10 Thread Aljoscha Krettek
Yes! I would be in favour of this since it's blocking us from upgrading certain dependencies. I would also be in favour of dropping Scala completely but that's a different story. Aljoscha On 10.09.20 16:51, Seth Wiesman wrote: Hi Everyone, Think of this as a pre-flip, but what does everyon

[DISCUSS] Drop Scala 2.11

2020-09-10 Thread Seth Wiesman
Hi Everyone, Think of this as a pre-flip, but what does everyone think about dropping Scala 2.11 support from Flink. The last patch release was in 2017 and in that time the scala community has released 2.13 and is working towards a 3.0 release. Apache Kafka and Spark have both dropped 2.11 suppor

Re: Flink DynamoDB stream connector losing records

2020-09-10 Thread Andrey Zagrebin
Generally speaking this should not be a problem for exactly-once but I am not familiar with the DynamoDB and its Flink connector. Did you observe any failover in Flink logs? On Thu, Sep 10, 2020 at 4:34 PM Jiawei Wu wrote: > And I suspect I have throttled by DynamoDB stream, I contacted AWS supp

Re: Flink DynamoDB stream connector losing records

2020-09-10 Thread Jiawei Wu
And I suspect I have throttled by DynamoDB stream, I contacted AWS support but got no response except for increasing WCU and RCU. Is it possible that Flink will lose exactly-once semantics when throttled? On Thu, Sep 10, 2020 at 10:31 PM Jiawei Wu wrote: > Hi Andrey, > > Thanks for your suggest

Re: Flink DynamoDB stream connector losing records

2020-09-10 Thread Jiawei Wu
Hi Andrey, Thanks for your suggestion, but I'm using Kinesis analytics application which supports only Flink 1.8 Regards, Jiawei On Thu, Sep 10, 2020 at 10:13 PM Andrey Zagrebin wrote: > Hi Jiawei, > > Could you try Flink latest release 1.11? > 1.8 will probably not get bugfix releases. >

Re: Flink DynamoDB stream connector losing records

2020-09-10 Thread Andrey Zagrebin
Hi Jiawei, Could you try Flink latest release 1.11? 1.8 will probably not get bugfix releases. I will cc Ying Xu who might have a better idea about the DinamoDB source. Best, Andrey On Thu, Sep 10, 2020 at 3:10 PM Jiawei Wu wrote: > Hi, > > I'm using AWS kinesis analytics application with Flin

Re: How to get Latency Tracking results?

2020-09-10 Thread David Anderson
Glad you got it working, and thanks for letting us know! David On Thu, Sep 10, 2020 at 2:40 PM Pankaj Chand wrote: > Actually, I was wrong. It turns out I was setting the values the wrong way > in conf/flink-conf.yaml. > I set "metrics.latency.interval 100" instead of "metrics.latency.interval:

[DISCUSS] FLIP-143: Unified Sink API

2020-09-10 Thread Guowei Ma
Hi, devs & users As discussed in FLIP-131[1], Flink will deprecate the DataSet API in favor of DataStream API and Table API. Users should be able to use DataStream API to write jobs that support both bounded and unbounded execution modes. However, Flink does not provide a sink API to guarantee the

Flink DynamoDB stream connector losing records

2020-09-10 Thread Jiawei Wu
Hi, I'm using AWS kinesis analytics application with Flink 1.8. I am using the FlinkDynamoDBStreamsConsumer to consume DynamoDB stream records. But recently I found my internal state is wrong. After I printed some logs I found some DynamoDB stream record are skipped and not consumed by Flink. May

Re: How to get Latency Tracking results?

2020-09-10 Thread Pankaj Chand
Actually, I was wrong. It turns out I was setting the values the wrong way in conf/flink-conf.yaml. I set "metrics.latency.interval 100" instead of "metrics.latency.interval: 100". Sorry about that. T On Thu, Sep 10, 2020 at 7:05 AM Pankaj Chand wrote: > Thank you, David! > > After setting Exe

Re: Flink 1.8.3 GC issues

2020-09-10 Thread Nico Kruber
What looks a bit strange to me is that with a running job, the SystemProcessingTimeService should actually not be collected (since it is still in use)! My guess is that something is indeed happening during that time frame (maybe job restarts?) and I would propose to check your logs for anything

Re: Flink 1.8.3 GC issues

2020-09-10 Thread Piotr Nowojski
Hi Josson, Thanks again for the detailed answer, and sorry that I can not help you with some immediate answer. I presume that jvm args for 1.8 are the same? Can you maybe post what exactly has crashed in your cases a) and b)? Re c), in the previously attached word document, it looks like Flink wa

Re: Performance issue associated with managed RocksDB memory

2020-09-10 Thread Juha Mynttinen
Hey I've fixed the code (https://github.com/juha-mynttinen-king/flink/commits/arena_block_sanity_check) slightly. Now it WARNs if there is the memory configuration issue. Also, I think there was a bug in the way the check calculated the mutable memory, fixed that. Also, wrote some tests. I tr

Re: How to get Latency Tracking results?

2020-09-10 Thread Pankaj Chand
Thank you, David! After setting ExecutionConfig for latency tracking in the SocketWindowWordCount Java source code and rebuilding and using that application jar file, I am now getting the latency tracking metrics using the REST endpoint. The below documentation [1] seems to imply that merely setti

Re:

2020-09-10 Thread Timo Walther
Hi Violeta, I just noticed that the plan might be generated from Flink's old planner instead of the new, more performant Blink planner. Which planner are you currently using? Regards, Timo On 08.09.20 17:51, Timo Walther wrote: You are using the old connectors. The new connectors are availa

Re: Watermark generation issues with File sources in Flink 1.11.1

2020-09-10 Thread Aljoscha Krettek
Thanks David! This saved me quite some time. Aljoscha On 09.09.20 19:58, David Anderson wrote: Arti, The problem with watermarks and the File source operator will be fixed in 1.11.2 [1]. This bug was introduced in 1.10.0, and isn't related to the new WatermarkStrategy api. [1] https://issues.

Re: Idle stream does not advance watermark in connected stream

2020-09-10 Thread Pierre Bedoucha
Hi and thank you for this thread, I'm also experimenting the same valid bug/limitation when connecting streams. I had a quick look in the annoucments but couldn't find any more information: Would it be planned to propagate the Idle stream status to the operator in the upcoming Flink minor versio

Re: Slow Performance inquiry

2020-09-10 Thread Timo Walther
Hi Heidy, I agree with David that a heap-based state backend would improve the serialization overhead a lot. If you like to optimize your serialization further, I would recommend to look at the type that comes out of TypeInformation.of with a debugger. You can find a list of all types and a

Flink Table API and not recognizing s3 plugins

2020-09-10 Thread Dan Hill
*Background* I'm converting some prototype Flink v1.11.1 code that uses DataSet/DataTable APIs to use the Table API. *Problem* When switching to using the Table API, my s3 plugins stopped working. I don't know why. I've added the required maven table dependencies to the job. I've tried us movin