Re: Increase in parallelism has very bad impact on performance

2020-11-02 Thread Yangze Guo
Hi, Sidney,

What is the data generation rate of your Kafka topic? Is it a lot
bigger than 6000?

Best,
Yangze Guo

Best,
Yangze Guo


On Tue, Nov 3, 2020 at 8:45 AM Sidney Feiner  wrote:
>
> Hey,
> I'm writing a Flink app that does some transformation on an event consumed 
> from Kafka and then creates time windows keyed by some field, and apply an 
> aggregation on all those events.
> When I run it with parallelism 1, I get a throughput of around 1.6K events 
> per second (so also 1.6K events per slot). With parallelism 5, that goes down 
> to 1.2K events per slot, and when I increase the parallelism to 10, it drops 
> to 600 events per slot.
> Which means that parallelism 5 and parallelism 10, give me the same total 
> throughput (1.2x5 = 600x10).
>
> I noticed that although I have 3 Task Managers, all the all the tasks are run 
> on the same machine, causing it's CPU to spike and probably, this is the 
> reason that the throughput dramatically decreases. After increasing the 
> parallelism to 15 and now tasks run on 2/3 machines, the average throughput 
> per slot is still around 600.
>
> What could cause this dramatic decrease in performance?
>
> Extra info:
>
> Flink version 1.9.2
> Flink High Availability mode
> 3 task managers, 66 slots total
>
>
> Execution plan:
>
>
> Any help would be much appreciated
>
>
> Sidney Feiner / Data Platform Developer
> M: +972.528197720 / Skype: sidney.feiner.startapp
>
>


Re: Flink on YARN: delegation token expired prevent job restart

2020-11-17 Thread Yangze Guo
Hi, Kien,

Do you config the "security.kerberos.login.principal" and the
"security.kerberos.login.keytab" together? If you only set the keytab,
it will not take effect.

Best,
Yangze Guo

On Tue, Nov 17, 2020 at 3:03 PM Kien Truong  wrote:
>
> Hi all,
>
> We are having an issue where Flink Application Master is unable to 
> automatically restart Flink job after its delegation token has expired.
>
> We are using Flink 1.11 with YARN 3.1.1 in single job per yarn-cluster mode. 
> We have also add valid keytab configuration and taskmanagers are able to 
> login with keytabs correctly. However, it seems YARN Application Master still 
> use delegation tokens instead of the keytab.
>
> Any idea how to resolve this would be much appreciated.
>
> Thanks
> Kien
>
>
>
>


Re: Flink on YARN: delegation token expired prevent job restart

2020-11-17 Thread Yangze Guo
Hi,

AFAIK, Flink does exclude the HDFS_DELEGATION_TOKEN in the
HadoopModule when user provides the keytab and principal. I'll try to
do a deeper investigation to figure out is there any HDFS access
before the HadoopModule installed.

Best,
Yangze Guo


On Tue, Nov 17, 2020 at 4:36 PM Kien Truong  wrote:
>
> Hi,
>
> Yes, I did. There're also logs about logging in using keytab successfully in 
> both Job Manager and Task Manager.
>
> I found some YARN docs about token renewal on AM restart
>
>
> > Therefore, to survive AM restart after token expiry, your AM has to get the 
> > NMs to localize the keytab or make no HDFS accesses until (somehow) a new 
> > token has been passed to them from a client.
>
> Maybe Flink did access HDFS with an expired token, before switching to use 
> the localized keytab ?
>
> Regards,
> Kien
>
>
>
> On 17 Nov 2020 at 15:14, Yangze Guo  wrote:
>
> Hi, Kien,
>
>
>
> Do you config the "security.kerberos.login.principal" and the
>
> "security.kerberos.login.keytab" together? If you only set the keytab,
>
> it will not take effect.
>
>
>
> Best,
>
> Yangze Guo
>
>
>
> On Tue, Nov 17, 2020 at 3:03 PM Kien Truong  wrote:
>
> >
>
> > Hi all,
>
> >
>
> > We are having an issue where Flink Application Master is unable to 
> > automatically restart Flink job after its delegation token has expired.
>
> >
>
> > We are using Flink 1.11 with YARN 3.1.1 in single job per yarn-cluster 
> > mode. We have also add valid keytab configuration and taskmanagers are able 
> > to login with keytabs correctly. However, it seems YARN Application Master 
> > still use delegation tokens instead of the keytab.
>
> >
>
> > Any idea how to resolve this would be much appreciated.
>
> >
>
> > Thanks
>
> > Kien
>
> >
>
> >
>
> >
>
> >
>


Re: Flink on YARN: delegation token expired prevent job restart

2020-11-17 Thread Yangze Guo
Hi,

There is a login operation in
YarnEntrypointUtils.logYarnEnvironmentInformation without the keytab.
One suspect is that Flink may access the HDFS when it tries to build
the PackagedProgram.

Does this issue only happen in the application mode? If so, I would cc
@kkloudas.

Best,
Yangze Guo

On Tue, Nov 17, 2020 at 4:52 PM Yangze Guo  wrote:
>
> Hi,
>
> AFAIK, Flink does exclude the HDFS_DELEGATION_TOKEN in the
> HadoopModule when user provides the keytab and principal. I'll try to
> do a deeper investigation to figure out is there any HDFS access
> before the HadoopModule installed.
>
> Best,
> Yangze Guo
>
>
> On Tue, Nov 17, 2020 at 4:36 PM Kien Truong  wrote:
> >
> > Hi,
> >
> > Yes, I did. There're also logs about logging in using keytab successfully 
> > in both Job Manager and Task Manager.
> >
> > I found some YARN docs about token renewal on AM restart
> >
> >
> > > Therefore, to survive AM restart after token expiry, your AM has to get 
> > > the NMs to localize the keytab or make no HDFS accesses until (somehow) a 
> > > new token has been passed to them from a client.
> >
> > Maybe Flink did access HDFS with an expired token, before switching to use 
> > the localized keytab ?
> >
> > Regards,
> > Kien
> >
> >
> >
> > On 17 Nov 2020 at 15:14, Yangze Guo  wrote:
> >
> > Hi, Kien,
> >
> >
> >
> > Do you config the "security.kerberos.login.principal" and the
> >
> > "security.kerberos.login.keytab" together? If you only set the keytab,
> >
> > it will not take effect.
> >
> >
> >
> > Best,
> >
> > Yangze Guo
> >
> >
> >
> > On Tue, Nov 17, 2020 at 3:03 PM Kien Truong  wrote:
> >
> > >
> >
> > > Hi all,
> >
> > >
> >
> > > We are having an issue where Flink Application Master is unable to 
> > > automatically restart Flink job after its delegation token has expired.
> >
> > >
> >
> > > We are using Flink 1.11 with YARN 3.1.1 in single job per yarn-cluster 
> > > mode. We have also add valid keytab configuration and taskmanagers are 
> > > able to login with keytabs correctly. However, it seems YARN Application 
> > > Master still use delegation tokens instead of the keytab.
> >
> > >
> >
> > > Any idea how to resolve this would be much appreciated.
> >
> > >
> >
> > > Thanks
> >
> > > Kien
> >
> > >
> >
> > >
> >
> > >
> >
> > >
> >


Re: taskmanager.cpu.cores 1.7976931348623157E308

2020-12-06 Thread Yangze Guo
Hi, Rex,

Can you share more logs for it. Did you see something like "The
configuration option taskmanager.cpu.cores required for local
execution is not set, setting it to" in your logs?

Best,
Yangze Guo

Best,
Yangze Guo


On Sat, Dec 5, 2020 at 6:53 PM David Anderson  wrote:
>
> taskmanager.cpu.cores is intended for internal use only -- you aren't meant 
> to set this option. What happens if you leave it alone?
>
> Regards,
> David
>
>
> On Sat, Dec 5, 2020 at 8:04 AM Rex Fenley  wrote:
>>
>> We're running this in a local environment so that may be contributing to 
>> what we're seeing.
>>
>> On Fri, Dec 4, 2020 at 10:41 PM Rex Fenley  wrote:
>>>
>>> Hello,
>>>
>>> I'm tuning flink for parallelism right now and when I look at the 
>>> JobManager I see
>>> taskmanager.cpu.cores1.7976931348623157E308
>>> Which looks like the maximum double number.
>>>
>>> We have 8 cpu cores, so we figured we'd bump to 16 for hyper threading. We 
>>> have 37 operators so we rounded up and set 40 task slots.
>>>
>>> Here is our configuration
>>>
>>> "vmArgs": "-Xmx16g -Xms16g -XX:MaxDirectMemorySize=1207959552 
>>> -XX:MaxMetaspaceSize=268435456 -Dlog.file=/tmp/flink.log 
>>> -Dtaskmanager.memory.framework.off-heap.size=134217728b 
>>> -Dtaskmanager.memory.network.max=1073741824b 
>>> -Dtaskmanager.memory.network.min=1073741824b 
>>> -Dtaskmanager.memory.framework.heap.size=134217728b 
>>> -Dtaskmanager.memory.managed.size=6335076856b 
>>> -Dtaskmanager.memory.task.heap.size=8160437768b 
>>> -Dtaskmanager.memory.task.off-heap.size=0b 
>>> -Dtaskmanager.numberOfTaskSlots=40 -Dtaskmanager.cpu.cores=16.0"
>>>
>>> We then tried with -Dtaskmanager.cpu.cores=7.0 and still ended up with that 
>>> very odd value for cpu cores.
>>>
>>> How do we correctly adjust this?
>>>
>>> Thanks!
>>> --
>>>
>>> Rex Fenley  |  Software Engineer - Mobile and Backend
>>>
>>>
>>> Remind.com |  BLOG  |  FOLLOW US  |  LIKE US
>>
>>
>>
>> --
>>
>> Rex Fenley  |  Software Engineer - Mobile and Backend
>>
>>
>> Remind.com |  BLOG  |  FOLLOW US  |  LIKE US


Re: taskmanager.cpu.cores 1.7976931348623157E308

2020-12-06 Thread Yangze Guo
My gut feeling is your "vmArgs" does not take effect.

Best,
Yangze Guo

On Mon, Dec 7, 2020 at 10:32 AM Yangze Guo  wrote:
>
> Hi, Rex,
>
> Can you share more logs for it. Did you see something like "The
> configuration option taskmanager.cpu.cores required for local
> execution is not set, setting it to" in your logs?
>
> Best,
> Yangze Guo
>
> Best,
> Yangze Guo
>
>
> On Sat, Dec 5, 2020 at 6:53 PM David Anderson  wrote:
> >
> > taskmanager.cpu.cores is intended for internal use only -- you aren't meant 
> > to set this option. What happens if you leave it alone?
> >
> > Regards,
> > David
> >
> >
> > On Sat, Dec 5, 2020 at 8:04 AM Rex Fenley  wrote:
> >>
> >> We're running this in a local environment so that may be contributing to 
> >> what we're seeing.
> >>
> >> On Fri, Dec 4, 2020 at 10:41 PM Rex Fenley  wrote:
> >>>
> >>> Hello,
> >>>
> >>> I'm tuning flink for parallelism right now and when I look at the 
> >>> JobManager I see
> >>> taskmanager.cpu.cores1.7976931348623157E308
> >>> Which looks like the maximum double number.
> >>>
> >>> We have 8 cpu cores, so we figured we'd bump to 16 for hyper threading. 
> >>> We have 37 operators so we rounded up and set 40 task slots.
> >>>
> >>> Here is our configuration
> >>>
> >>> "vmArgs": "-Xmx16g -Xms16g -XX:MaxDirectMemorySize=1207959552 
> >>> -XX:MaxMetaspaceSize=268435456 -Dlog.file=/tmp/flink.log 
> >>> -Dtaskmanager.memory.framework.off-heap.size=134217728b 
> >>> -Dtaskmanager.memory.network.max=1073741824b 
> >>> -Dtaskmanager.memory.network.min=1073741824b 
> >>> -Dtaskmanager.memory.framework.heap.size=134217728b 
> >>> -Dtaskmanager.memory.managed.size=6335076856b 
> >>> -Dtaskmanager.memory.task.heap.size=8160437768b 
> >>> -Dtaskmanager.memory.task.off-heap.size=0b 
> >>> -Dtaskmanager.numberOfTaskSlots=40 -Dtaskmanager.cpu.cores=16.0"
> >>>
> >>> We then tried with -Dtaskmanager.cpu.cores=7.0 and still ended up with 
> >>> that very odd value for cpu cores.
> >>>
> >>> How do we correctly adjust this?
> >>>
> >>> Thanks!
> >>> --
> >>>
> >>> Rex Fenley  |  Software Engineer - Mobile and Backend
> >>>
> >>>
> >>> Remind.com |  BLOG  |  FOLLOW US  |  LIKE US
> >>
> >>
> >>
> >> --
> >>
> >> Rex Fenley  |  Software Engineer - Mobile and Backend
> >>
> >>
> >> Remind.com |  BLOG  |  FOLLOW US  |  LIKE US


Re: Main class logs in Yarn Mode

2021-01-12 Thread Yangze Guo
The main function of your WordCountExample is executed in your local
environment. So, the logs you are looking for ("Entering
application.") are be located in your console output and the "log/"
directory of your Flink distribution.

Best,
Yangze Guo

On Tue, Jan 12, 2021 at 4:50 PM bat man  wrote:
>
> Hi,
>
> I am running a sample job as below -
>
> public class WordCountExample {
> static Logger logger = LoggerFactory.getLogger(WordCountExample.class);
>
> public static void main(String[] args) throws Exception {
> final ExecutionEnvironment env = 
> ExecutionEnvironment.getExecutionEnvironment();
>
> logger.info("Entering application.");
>
> DataSet text = env.fromElements(
> "Who's there?",
> "I think I hear them. Stand, ho! Who's there?");
>
> List elements = new ArrayList();
> elements.add(0);
>
>
> DataSet set = env.fromElements(new TestClass(elements));
>
> DataSet> wordCounts = text
> .flatMap(new LineSplitter())
> .withBroadcastSet(set, "set")
> .groupBy(0)
> .sum(1);
>
> wordCounts.print();
>
> logger.info("Processing done");
>
> //env.execute("wordcount job complete");
>
> }
>
> public static class LineSplitter implements FlatMapFunction Tuple2> {
>
> static Logger loggerLineSplitter = 
> LoggerFactory.getLogger(LineSplitter.class);
>
> @Override
> public void flatMap(String line, Collector> out) {
> loggerLineSplitter.info("Logger in LineSplitter.flatMap");
> for (String word : line.split(" ")) {
> out.collect(new Tuple2(word, 1));
> }
> }
> }
>
> public static class TestClass implements Serializable {
> private static final long serialVersionUID = -2932037991574118651L;
>
> static Logger loggerTestClass = 
> LoggerFactory.getLogger("WordCountExample.TestClass");
>
> List integerList;
> public TestClass(List integerList){
> this.integerList=integerList;
> loggerTestClass.info("Logger in TestClass");
> }
>
>
> }
> }
>
> When run in IDE I can see the logs from main class i.e. statements like below 
> in console logs -
>
> 13:40:24.459 [main] INFO  com.flink.transform.WordCountExample - Entering 
> application.
> 13:40:24.486 [main] INFO  WordCountExample.TestClass - Logger in TestClass
>
>
> When run on Yarn with command - flink run -m yarn-cluster  -c 
> com.flink.transform.WordCountExample rt-1.0-jar-with-dependencies.jar
>
> I only see the flatmap logging statements like -
> INFO  com.flink.transform.WordCountExample$LineSplitter - Logger in 
> LineSplitter.flatMap
> INFO  com.flink.transform.WordCountExample$LineSplitter - Logger in 
> LineSplitter.flatMap
>
> I have checked the jobmanager and taskmanager logs from yarn in EMR.
>
> This is my log4j.properties from EMR cluster
>
> log4j.rootLogger=INFO,file,elastic
>
> # Config ES logging appender
> log4j.appender.elastic=com.letfy.log4j.appenders.ElasticSearchClientAppender
> log4j.appender.elastic.elasticHost=http://<>:9200
> log4j.appender.elastic.hostName=<>
> log4j.appender.elastic.applicationName=<>
>
> # more options (see github project for the full list)
> log4j.appender.elastic.elasticIndex=<>
> log4j.appender.elastic.elasticType=<>
>
> # Log all infos in the given file
> log4j.appender.file=org.apache.log4j.FileAppender
> log4j.appender.file.file=${log.file}
> log4j.appender.file.append=false
> log4j.appender.file.layout=org.apache.log4j.PatternLayout
> log4j.appender.file.layout.ConversionPattern=%d{-MM-dd HH:mm:ss,SSS} %-5p 
> %-60c %x - %m%n
>
> # suppress the irrelevant (wrong) warnings from the netty channel handler
> log4j.logger.org.jboss.netty.channel.DefaultChannelPipeline=ERROR,file
>
>
> How can I access main driver logs when run on yarn as master.
>
> Thanks,
> Hemant
>
>
>
>


Re: Main class logs in Yarn Mode

2021-01-12 Thread Yangze Guo
I think you can try the application mode[1].

[1] 
https://ci.apache.org/projects/flink/flink-docs-master/deployment/#application-mode

Best,
Yangze Guo

On Tue, Jan 12, 2021 at 5:23 PM bat man  wrote:
>
> Thanks Yangze Gua.
> Is there a way these can be redirected to a yarn logs.
>
> On Tue, 12 Jan 2021 at 2:35 PM, Yangze Guo  wrote:
>>
>> The main function of your WordCountExample is executed in your local
>> environment. So, the logs you are looking for ("Entering
>> application.") are be located in your console output and the "log/"
>> directory of your Flink distribution.
>>
>> Best,
>> Yangze Guo
>>
>> On Tue, Jan 12, 2021 at 4:50 PM bat man  wrote:
>> >
>> > Hi,
>> >
>> > I am running a sample job as below -
>> >
>> > public class WordCountExample {
>> > static Logger logger = LoggerFactory.getLogger(WordCountExample.class);
>> >
>> > public static void main(String[] args) throws Exception {
>> > final ExecutionEnvironment env = 
>> > ExecutionEnvironment.getExecutionEnvironment();
>> >
>> > logger.info("Entering application.");
>> >
>> > DataSet text = env.fromElements(
>> > "Who's there?",
>> > "I think I hear them. Stand, ho! Who's there?");
>> >
>> > List elements = new ArrayList();
>> > elements.add(0);
>> >
>> >
>> > DataSet set = env.fromElements(new TestClass(elements));
>> >
>> > DataSet> wordCounts = text
>> > .flatMap(new LineSplitter())
>> > .withBroadcastSet(set, "set")
>> > .groupBy(0)
>> > .sum(1);
>> >
>> > wordCounts.print();
>> >
>> > logger.info("Processing done");
>> >
>> > //env.execute("wordcount job complete");
>> >
>> > }
>> >
>> > public static class LineSplitter implements FlatMapFunction> > Tuple2> {
>> >
>> > static Logger loggerLineSplitter = 
>> > LoggerFactory.getLogger(LineSplitter.class);
>> >
>> > @Override
>> > public void flatMap(String line, Collector> out) {
>> > loggerLineSplitter.info("Logger in LineSplitter.flatMap");
>> > for (String word : line.split(" ")) {
>> > out.collect(new Tuple2(word, 1));
>> > }
>> > }
>> > }
>> >
>> > public static class TestClass implements Serializable {
>> > private static final long serialVersionUID = -2932037991574118651L;
>> >
>> > static Logger loggerTestClass = 
>> > LoggerFactory.getLogger("WordCountExample.TestClass");
>> >
>> > List integerList;
>> > public TestClass(List integerList){
>> > this.integerList=integerList;
>> > loggerTestClass.info("Logger in TestClass");
>> > }
>> >
>> >
>> > }
>> > }
>> >
>> > When run in IDE I can see the logs from main class i.e. statements like 
>> > below in console logs -
>> >
>> > 13:40:24.459 [main] INFO  com.flink.transform.WordCountExample - Entering 
>> > application.
>> > 13:40:24.486 [main] INFO  WordCountExample.TestClass - Logger in TestClass
>> >
>> >
>> > When run on Yarn with command - flink run -m yarn-cluster  -c 
>> > com.flink.transform.WordCountExample rt-1.0-jar-with-dependencies.jar
>> >
>> > I only see the flatmap logging statements like -
>> > INFO  com.flink.transform.WordCountExample$LineSplitter - Logger in 
>> > LineSplitter.flatMap
>> > INFO  com.flink.transform.WordCountExample$LineSplitter - Logger in 
>> > LineSplitter.flatMap
>> >
>> > I have checked the jobmanager and taskmanager logs from yarn in EMR.
>> >
>> > This is my log4j.properties from EMR cluster
>> >
>> > log4j.rootLogger=INFO,file,elastic
>> >
>> > # Config ES logging appender
>> > log4j.appender.elastic=com.letfy.log4j.appenders.ElasticSearchClientAppender
>> > log4j.appender.elastic.elasticHost=http://<>:9200
>> > log4j.appender.elastic.hostName=<>
>> > log4j.appender.elastic.applicationName=<>
>> >
>> > # more options (see github project for the full list)
>> > log4j.appender.elastic.elasticIndex=<>
>> > log4j.appender.elastic.elasticType=<>
>> >
>> > # Log all infos in the given file
>> > log4j.appender.file=org.apache.log4j.FileAppender
>> > log4j.appender.file.file=${log.file}
>> > log4j.appender.file.append=false
>> > log4j.appender.file.layout=org.apache.log4j.PatternLayout
>> > log4j.appender.file.layout.ConversionPattern=%d{-MM-dd HH:mm:ss,SSS} 
>> > %-5p %-60c %x - %m%n
>> >
>> > # suppress the irrelevant (wrong) warnings from the netty channel handler
>> > log4j.logger.org.jboss.netty.channel.DefaultChannelPipeline=ERROR,file
>> >
>> >
>> > How can I access main driver logs when run on yarn as master.
>> >
>> > Thanks,
>> > Hemant
>> >
>> >
>> >
>> >


Re: Number of parallel connections for Elasticsearch Connector

2021-01-17 Thread Yangze Guo
Hi, Rex.

> How many connections does the ES connector use to write to Elasticsearch?
I think the number is equal to your parallelism. Each subtask of an
Elasticsearch sink will have its own separate Bulk Processor as both
the Client and the Bulk Processor are class private[1]. The subtasks
will be placed into different slots and have their own Elasticsearch
sink instance.

[1] 
https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-elasticsearch-base/src/main/java/org/apache/flink/streaming/connectors/elasticsearch/ElasticsearchSinkBase.java#L204.

Best,
Yangze Guo

On Sun, Jan 17, 2021 at 11:33 AM Rex Fenley  wrote:
>
> I found the following, indicating that there is no concurrency for the 
> Elasticsearch Connector 
> https://github.com/apache/flink/blob/97bfd049951f8d52a2e0aed14265074c4255ead0/flink-connectors/flink-connector-elasticsearch-base/src/main/java/org/apache/flink/streaming/connectors/elasticsearch/ElasticsearchSinkBase.java#L382
>
> Does each subtask of an Elasticsearch sink have it's own separate Bulk 
> Processor to allow for parallel bulk writes?
>
> Thanks!
>
> On Sat, Jan 16, 2021 at 10:33 AM Rex Fenley  wrote:
>>
>> Hello,
>>
>> How many connections does the ES connector use to write to Elasticsearch? We 
>> have a single machine with 16 vCPUs and parallelism of 4 running our job, 
>> with -p 4 I'd expect there to be 4 parallel bulk request writers / 
>> connections to Elasticsearch. Is there a place in the code to confirm this?
>>
>> Thanks!
>>
>> --
>>
>> Rex Fenley  |  Software Engineer - Mobile and Backend
>>
>>
>> Remind.com |  BLOG  |  FOLLOW US  |  LIKE US
>
>
>
> --
>
> Rex Fenley  |  Software Engineer - Mobile and Backend
>
>
> Remind.com |  BLOG  |  FOLLOW US  |  LIKE US


Re: Monitor the Flink

2021-01-17 Thread Yangze Guo
Hi,

First of all, there’s no resource isolation atm between
operators/tasks within a slot, except for managed memory. So,
monitoring of individual tasks might be meaningless.

Regarding TM/JM level cpu/memory metrics, you can refer to [1] and
[2]. Regarding the traffic between tasks, you can refer to [3].

[1] 
https://ci.apache.org/projects/flink/flink-docs-release-1.12/ops/metrics.html#cpu
[2] 
https://ci.apache.org/projects/flink/flink-docs-release-1.12/ops/metrics.html#memory
[3] 
https://ci.apache.org/projects/flink/flink-docs-release-1.12/ops/metrics.html#default-shuffle-service

Best,
Yangze Guo

On Sun, Jan 17, 2021 at 6:43 PM penguin.  wrote:
>
> Hello,
>
>
> In the Flink cluster,
>
> How to monitor each taskslot of taskmanager? For example, the CPU and memory 
> usage of each slot and the traffic between slots.
>
> What is the way to get the traffic between nodes?
>
> thank you very much!
>
>
> penguin
>
>
>
>


Re: [ANNOUNCE] Apache Flink 1.12.1 released

2021-01-19 Thread Yangze Guo
Thanks Xintong for the great work!

Best,
Yangze Guo

On Tue, Jan 19, 2021 at 4:47 PM Till Rohrmann  wrote:
>
> Thanks a lot for driving this release Xintong. This was indeed a release with 
> some obstacles to overcome and you did it very well!
>
> Cheers,
> Till
>
> On Tue, Jan 19, 2021 at 5:59 AM Xingbo Huang  wrote:
>>
>> Thanks Xintong for the great work!
>>
>> Best,
>> Xingbo
>>
>> Peter Huang  于2021年1月19日周二 下午12:51写道:
>>
>> > Thanks for the great effort to make this happen. It paves us from using
>> > 1.12 soon.
>> >
>> > Best Regards
>> > Peter Huang
>> >
>> > On Mon, Jan 18, 2021 at 8:16 PM Yang Wang  wrote:
>> >
>> > > Thanks Xintong for the great work as our release manager!
>> > >
>> > >
>> > > Best,
>> > > Yang
>> > >
>> > > Xintong Song  于2021年1月19日周二 上午11:53写道:
>> > >
>> > >> The Apache Flink community is very happy to announce the release of
>> > >> Apache Flink 1.12.1, which is the first bugfix release for the Apache
>> > Flink
>> > >> 1.12 series.
>> > >>
>> > >> Apache Flink® is an open-source stream processing framework for
>> > >> distributed, high-performing, always-available, and accurate data
>> > streaming
>> > >> applications.
>> > >>
>> > >> The release is available for download at:
>> > >> https://flink.apache.org/downloads.html
>> > >>
>> > >> Please check out the release blog post for an overview of the
>> > >> improvements for this bugfix release:
>> > >> https://flink.apache.org/news/2021/01/19/release-1.12.1.html
>> > >>
>> > >> The full release notes are available in Jira:
>> > >>
>> > >>
>> > https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12349459
>> > >>
>> > >> We would like to thank all contributors of the Apache Flink community
>> > who
>> > >> made this release possible!
>> > >>
>> > >> Regards,
>> > >> Xintong
>> > >>
>> > >
>> >


Re: [BULK]Re: [SURVEY] Remove Mesos support

2021-03-28 Thread Yangze Guo
+1

Best,
Yangze Guo

On Mon, Mar 29, 2021 at 11:31 AM Xintong Song  wrote:
>
> +1
> It's already a matter of fact for a while that we no longer port new features 
> to the Mesos deployment.
>
> Thank you~
>
> Xintong Song
>
>
>
> On Fri, Mar 26, 2021 at 10:37 PM Till Rohrmann  wrote:
>>
>> +1 for officially deprecating this component for the 1.13 release.
>>
>> Cheers,
>> Till
>>
>> On Thu, Mar 25, 2021 at 1:49 PM Konstantin Knauf  wrote:
>>>
>>> Hi Matthias,
>>>
>>> Thank you for following up on this. +1 to officially deprecate Mesos in the 
>>> code and documentation, too. It will be confusing for users if this 
>>> diverges from the roadmap.
>>>
>>> Cheers,
>>>
>>> Konstantin
>>>
>>> On Thu, Mar 25, 2021 at 12:23 PM Matthias Pohl  
>>> wrote:
>>>>
>>>> Hi everyone,
>>>> considering the upcoming release of Flink 1.13, I wanted to revive the
>>>> discussion about the Mesos support ones more. Mesos is also already listed
>>>> as deprecated in Flink's overall roadmap [1]. Maybe, it's time to align the
>>>> documentation accordingly to make it more explicit?
>>>>
>>>> What do you think?
>>>>
>>>> Best,
>>>> Matthias
>>>>
>>>> [1] https://flink.apache.org/roadmap.html#feature-radar
>>>>
>>>> On Wed, Oct 28, 2020 at 9:40 AM Till Rohrmann  wrote:
>>>>
>>>> > Hi Oleksandr,
>>>> >
>>>> > yes you are right. The biggest problem is at the moment the lack of test
>>>> > coverage and thereby confidence to make changes. We have some e2e tests
>>>> > which you can find here [1]. These tests are, however, quite coarse 
>>>> > grained
>>>> > and are missing a lot of cases. One idea would be to add a Mesos e2e test
>>>> > based on Flink's end-to-end test framework [2]. I think what needs to be
>>>> > done there is to add a Mesos resource and a way to submit jobs to a Mesos
>>>> > cluster to write e2e tests.
>>>> >
>>>> > [1] https://github.com/apache/flink/tree/master/flink-jepsen
>>>> > [2]
>>>> > https://github.com/apache/flink/tree/master/flink-end-to-end-tests/flink-end-to-end-tests-common
>>>> >
>>>> > Cheers,
>>>> > Till
>>>> >
>>>> > On Tue, Oct 27, 2020 at 12:29 PM Oleksandr Nitavskyi <
>>>> > o.nitavs...@criteo.com> wrote:
>>>> >
>>>> >> Hello Xintong,
>>>> >>
>>>> >> Thanks for the insights and support.
>>>> >>
>>>> >> Browsing the Mesos backlog and didn't identify anything critical, which
>>>> >> is left there.
>>>> >>
>>>> >> I see that there are were quite a lot of contributions to the Flink 
>>>> >> Mesos
>>>> >> in the recent version:
>>>> >> https://github.com/apache/flink/commits/master/flink-mesos.
>>>> >> We plan to validate the current Flink master (or release 1.12 branch) 
>>>> >> our
>>>> >> Mesos setup. In case of any issues, we will try to propose changes.
>>>> >> My feeling is that our test results shouldn't affect the Flink 1.12
>>>> >> release cycle. And if any potential commits will land into the 1.12.1 it
>>>> >> should be totally fine.
>>>> >>
>>>> >> In the future, we would be glad to help you guys with any
>>>> >> maintenance-related questions. One of the highest priorities around this
>>>> >> component seems to be the development of the full e2e test.
>>>> >>
>>>> >> Kind Regards
>>>> >> Oleksandr Nitavskyi
>>>> >> 
>>>> >> From: Xintong Song 
>>>> >> Sent: Tuesday, October 27, 2020 7:14 AM
>>>> >> To: dev ; user 
>>>> >> Cc: Piyush Narang 
>>>> >> Subject: [BULK]Re: [SURVEY] Remove Mesos support
>>>> >>
>>>> >> Hi Piyush,
>>>> >>
>>>> >> Thanks a lot for sharing the information. It would be a great relief 
>>>> >> that
>>>> >> you are good with Flink on Mesos as is.
>>>> >>
&

Re: period batch job lead to OutOfMemoryError: Metaspace problem

2021-04-06 Thread Yangze Guo
I think you can try to increase the JVM metaspace option for
TaskManagers through taskmanager.memory.jvm-metaspace.size. [1]

[1] 
https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/memory/mem_trouble/#outofmemoryerror-metaspace

Best,
Yangze Guo

Best,
Yangze Guo


On Tue, Apr 6, 2021 at 4:22 PM 太平洋 <495635...@qq.com> wrote:
>
> batch job:
> read data from s3 by sql,then by some operators and write data to clickhouse 
> and kafka.
> after some times, task-manager quit with OutOfMemoryError: Metaspace.
>
> env:
> flink version:1.12.2
> task-manager slot count: 5
> deployment: standalone kubernetes session 模式
> dependencies:
>
> 
>
>   org.apache.flink
>
>   flink-connector-kafka_2.11
>
>   ${flink.version}
>
> 
>
> 
>
>   com.google.code.gson
>
>   gson
>
>   2.8.5
>
> 
>
> 
>
>   org.apache.flink
>
>   flink-connector-jdbc_2.11
>
>   ${flink.version}
>
> 
>
> 
>
>   ru.yandex.clickhouse
>
>   clickhouse-jdbc
>
>   0.3.0
>
> 
>
> 
>
>   org.apache.flink
>
> flink-parquet_2.11
>
> ${flink.version}
>
> 
>
> 
>
>  org.apache.flink
>
>  flink-json
>
>  ${flink.version}
>
> 
>
>
> heap dump1:
>
> Leak Suspects
>
> System Overview
>
>  Leaks
>
>  Overview
>
>
>   Problem Suspect 1
>
> 21 instances of "org.apache.flink.util.ChildFirstClassLoader", loaded by 
> "sun.misc.Launcher$AppClassLoader @ 0x73b2d42e0" occupy 29,656,880 (41.16%) 
> bytes.
>
> Biggest instances:
>
> org.apache.flink.util.ChildFirstClassLoader @ 0x73ca2a1e8 - 1,474,760 (2.05%) 
> bytes.
> org.apache.flink.util.ChildFirstClassLoader @ 0x73d2af820 - 1,474,168 (2.05%) 
> bytes.
> org.apache.flink.util.ChildFirstClassLoader @ 0x73cdcaa10 - 1,474,160 (2.05%) 
> bytes.
> org.apache.flink.util.ChildFirstClassLoader @ 0x73cf6aab0 - 1,474,160 (2.05%) 
> bytes.
> org.apache.flink.util.ChildFirstClassLoader @ 0x73dd8 - 1,474,160 (2.05%) 
> bytes.
> org.apache.flink.util.ChildFirstClassLoader @ 0x73d2bb108 - 1,474,128 (2.05%) 
> bytes.
> org.apache.flink.util.ChildFirstClassLoader @ 0x73de202e0 - 1,474,120 (2.05%) 
> bytes.
> org.apache.flink.util.ChildFirstClassLoader @ 0x73dadc778 - 1,474,112 (2.05%) 
> bytes.
> org.apache.flink.util.ChildFirstClassLoader @ 0x73d5f70e8 - 1,474,064 (2.05%) 
> bytes.
> org.apache.flink.util.ChildFirstClassLoader @ 0x73d93aa38 - 1,474,064 (2.05%) 
> bytes.
> org.apache.flink.util.ChildFirstClassLoader @ 0x73e179638 - 1,474,064 (2.05%) 
> bytes.
> org.apache.flink.util.ChildFirstClassLoader @ 0x73dc80418 - 1,474,056 (2.05%) 
> bytes.
> org.apache.flink.util.ChildFirstClassLoader @ 0x73dfcda60 - 1,474,056 (2.05%) 
> bytes.
> org.apache.flink.util.ChildFirstClassLoader @ 0x73e4bcd38 - 1,474,056 (2.05%) 
> bytes.
> org.apache.flink.util.ChildFirstClassLoader @ 0x73d6006e8 - 1,474,032 (2.05%) 
> bytes.
> org.apache.flink.util.ChildFirstClassLoader @ 0x73c7d2ad8 - 1,461,944 (2.03%) 
> bytes.
> org.apache.flink.util.ChildFirstClassLoader @ 0x73ca1bb98 - 1,460,752 (2.03%) 
> bytes.
> org.apache.flink.util.ChildFirstClassLoader @ 0x73bf203f0 - 1,460,744 (2.03%) 
> bytes.
> org.apache.flink.util.ChildFirstClassLoader @ 0x73e3284a8 - 1,445,232 (2.01%) 
> bytes.
> org.apache.flink.util.ChildFirstClassLoader @ 0x73e65de00 - 1,445,232 (2.01%) 
> bytes.
>
>
>
> Keywords
> org.apache.flink.util.ChildFirstClassLoader
> sun.misc.Launcher$AppClassLoader @ 0x73b2d42e0
> Details »
>
>   Problem Suspect 2
>
> 34,407 instances of "org.apache.flink.core.memory.HybridMemorySegment", 
> loaded by "sun.misc.Launcher$AppClassLoader @ 0x73b2d42e0" occupy 7,707,168 
> (10.70%) bytes.
>
> Keywords
> org.apache.flink.core.memory.HybridMemorySegment
> sun.misc.Launcher$AppClassLoader @ 0x73b2d42e0
>
> Details »
>
>
>
> heap dump2:
>
> Leak Suspects
>
> System Overview
>
>  Leaks
>
>  Overview
>
>   Problem Suspect 1
>
> 21 instances of "org.apache.flink.util.ChildFirstClassLoader", loaded by 
> "sun.misc.Launcher$AppClassLoader @ 0x73b2d42e0" occupy 26,061,408 (30.68%) 
> bytes.
>
> Biggest instances:
>
> org.apache.flink.util.ChildFirstClassLoader @ 0x73e9e9930 - 1,474,224 (1.74%) 
> bytes.
> org.apache.flink.util.ChildFirstClassLoader @ 0x73edce0b8 - 1,474,224 (1.74%) 
> bytes.
> org.apache.flink.util.ChildFirstClassLoader @ 0x73f1ad7d0 - 1,474,168 (1.74%) 
> bytes.
> org.apache.flink.util.C

Re: period batch job lead to OutOfMemoryError: Metaspace problem

2021-04-06 Thread Yangze Guo
> I have tried this method, but the problem still exist.
How much memory do you configure for it?

> is 21 instances of "org.apache.flink.util.ChildFirstClassLoader" normal
Not quite sure about it. AFAIK, each job will have a classloader.
Multiple tasks of the same job in the same TM will share the same
classloader. The classloader will be removed if there is no more task
running on the TM. Classloader without reference will be finally
cleanup by GC. Could you share JM and TM logs for further analysis?
I'll also involve @Guowei Ma in this thread.


Best,
Yangze Guo

On Tue, Apr 6, 2021 at 6:05 PM 太平洋 <495635...@qq.com> wrote:
>
> I have tried this method, but the problem still exist.
> by heap dump analysis, is 21 instances of 
> "org.apache.flink.util.ChildFirstClassLoader" normal?
>
>
> ------ 原始邮件 --
> 发件人: "Yangze Guo" ;
> 发送时间: 2021年4月6日(星期二) 下午4:32
> 收件人: "太平洋"<495635...@qq.com>;
> 抄送: "user";
> 主题: Re: period batch job lead to OutOfMemoryError: Metaspace problem
>
> I think you can try to increase the JVM metaspace option for
> TaskManagers through taskmanager.memory.jvm-metaspace.size. [1]
>
> [1] 
> https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/memory/mem_trouble/#outofmemoryerror-metaspace
>
> Best,
> Yangze Guo
>
> Best,
> Yangze Guo
>
>
> On Tue, Apr 6, 2021 at 4:22 PM 太平洋 <495635...@qq.com> wrote:
> >
> > batch job:
> > read data from s3 by sql,then by some operators and write data to 
> > clickhouse and kafka.
> > after some times, task-manager quit with OutOfMemoryError: Metaspace.
> >
> > env:
> > flink version:1.12.2
> > task-manager slot count: 5
> > deployment: standalone kubernetes session 模式
> > dependencies:
> >
> > 
> >
> >   org.apache.flink
> >
> >   flink-connector-kafka_2.11
> >
> >   ${flink.version}
> >
> > 
> >
> > 
> >
> >   com.google.code.gson
> >
> >   gson
> >
> >   2.8.5
> >
> > 
> >
> > 
> >
> >   org.apache.flink
> >
> >   flink-connector-jdbc_2.11
> >
> >   ${flink.version}
> >
> > 
> >
> > 
> >
> >   ru.yandex.clickhouse
> >
> >   clickhouse-jdbc
> >
> >   0.3.0
> >
> > 
> >
> > 
> >
> >   org.apache.flink
> >
> > flink-parquet_2.11
> >
> > ${flink.version}
> >
> > 
> >
> > 
> >
> >  org.apache.flink
> >
> >  flink-json
> >
> >  ${flink.version}
> >
> > 
> >
> >
> > heap dump1:
> >
> > Leak Suspects
> >
> > System Overview
> >
> >  Leaks
> >
> >  Overview
> >
> >
> >   Problem Suspect 1
> >
> > 21 instances of "org.apache.flink.util.ChildFirstClassLoader", loaded by 
> > "sun.misc.Launcher$AppClassLoader @ 0x73b2d42e0" occupy 29,656,880 (41.16%) 
> > bytes.
> >
> > Biggest instances:
> >
> > org.apache.flink.util.ChildFirstClassLoader @ 0x73ca2a1e8 - 1,474,760 
> > (2.05%) bytes.
> > org.apache.flink.util.ChildFirstClassLoader @ 0x73d2af820 - 1,474,168 
> > (2.05%) bytes.
> > org.apache.flink.util.ChildFirstClassLoader @ 0x73cdcaa10 - 1,474,160 
> > (2.05%) bytes.
> > org.apache.flink.util.ChildFirstClassLoader @ 0x73cf6aab0 - 1,474,160 
> > (2.05%) bytes.
> > org.apache.flink.util.ChildFirstClassLoader @ 0x73dd8 - 1,474,160 
> > (2.05%) bytes.
> > org.apache.flink.util.ChildFirstClassLoader @ 0x73d2bb108 - 1,474,128 
> > (2.05%) bytes.
> > org.apache.flink.util.ChildFirstClassLoader @ 0x73de202e0 - 1,474,120 
> > (2.05%) bytes.
> > org.apache.flink.util.ChildFirstClassLoader @ 0x73dadc778 - 1,474,112 
> > (2.05%) bytes.
> > org.apache.flink.util.ChildFirstClassLoader @ 0x73d5f70e8 - 1,474,064 
> > (2.05%) bytes.
> > org.apache.flink.util.ChildFirstClassLoader @ 0x73d93aa38 - 1,474,064 
> > (2.05%) bytes.
> > org.apache.flink.util.ChildFirstClassLoader @ 0x73e179638 - 1,474,064 
> > (2.05%) bytes.
> > org.apache.flink.util.ChildFirstClassLoader @ 0x73dc80418 - 1,474,056 
> > (2.05%) bytes.
> > org.apache.flink.util.ChildFirstClassLoader @ 0x73dfcda60 - 1,474,056 
> > (2.05%) bytes.
> > org.apache.flink.util.ChildFirstClassLoader @ 0x73e4bcd38 - 1,474,056 
> > (2.05%) bytes.
> > 

Re: period batch job lead to OutOfMemoryError: Metaspace problem

2021-04-07 Thread Yangze Guo
I went through the JM & TM logs but could not find any valuable clue.
The exception is actually thrown by kafka-producer-network-thread.
Maybe @Qingsheng could also take a look?


Best,
Yangze Guo

On Thu, Apr 8, 2021 at 10:39 AM 太平洋 <495635...@qq.com> wrote:
>
> I have configured to 512M, but problem still exist. Now the memory size is 
> still 256M.
> Attachments are TM and JM logs.
>
> Look forward to your reply.
>
> -- 原始邮件 ------
> 发件人: "Yangze Guo" ;
> 发送时间: 2021年4月6日(星期二) 晚上6:35
> 收件人: "太平洋"<495635...@qq.com>;
> 抄送: "user";"guowei.mgw";
> 主题: Re: period batch job lead to OutOfMemoryError: Metaspace problem
>
> > I have tried this method, but the problem still exist.
> How much memory do you configure for it?
>
> > is 21 instances of "org.apache.flink.util.ChildFirstClassLoader" normal
> Not quite sure about it. AFAIK, each job will have a classloader.
> Multiple tasks of the same job in the same TM will share the same
> classloader. The classloader will be removed if there is no more task
> running on the TM. Classloader without reference will be finally
> cleanup by GC. Could you share JM and TM logs for further analysis?
> I'll also involve @Guowei Ma in this thread.
>
>
> Best,
> Yangze Guo
>
> On Tue, Apr 6, 2021 at 6:05 PM 太平洋 <495635...@qq.com> wrote:
> >
> > I have tried this method, but the problem still exist.
> > by heap dump analysis, is 21 instances of 
> > "org.apache.flink.util.ChildFirstClassLoader" normal?
> >
> >
> > -- 原始邮件 --
> > 发件人: "Yangze Guo" ;
> > 发送时间: 2021年4月6日(星期二) 下午4:32
> > 收件人: "太平洋"<495635...@qq.com>;
> > 抄送: "user";
> > 主题: Re: period batch job lead to OutOfMemoryError: Metaspace problem
> >
> > I think you can try to increase the JVM metaspace option for
> > TaskManagers through taskmanager.memory.jvm-metaspace.size. [1]
> >
> > [1] 
> > https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/memory/mem_trouble/#outofmemoryerror-metaspace
> >
> > Best,
> > Yangze Guo
> >
> > Best,
> > Yangze Guo
> >
> >
> > On Tue, Apr 6, 2021 at 4:22 PM 太平洋 <495635...@qq.com> wrote:
> > >
> > > batch job:
> > > read data from s3 by sql,then by some operators and write data to 
> > > clickhouse and kafka.
> > > after some times, task-manager quit with OutOfMemoryError: Metaspace.
> > >
> > > env:
> > > flink version:1.12.2
> > > task-manager slot count: 5
> > > deployment: standalone kubernetes session 模式
> > > dependencies:
> > >
> > > 
> > >
> > >   org.apache.flink
> > >
> > >   flink-connector-kafka_2.11
> > >
> > >   ${flink.version}
> > >
> > > 
> > >
> > > 
> > >
> > >   com.google.code.gson
> > >
> > >   gson
> > >
> > >   2.8.5
> > >
> > > 
> > >
> > > 
> > >
> > >   org.apache.flink
> > >
> > >   flink-connector-jdbc_2.11
> > >
> > >   ${flink.version}
> > >
> > > 
> > >
> > > 
> > >
> > >   ru.yandex.clickhouse
> > >
> > >   clickhouse-jdbc
> > >
> > >   0.3.0
> > >
> > > 
> > >
> > > 
> > >
> > >   org.apache.flink
> > >
> > > flink-parquet_2.11
> > >
> > > ${flink.version}
> > >
> > > 
> > >
> > > 
> > >
> > >  org.apache.flink
> > >
> > >  flink-json
> > >
> > >  ${flink.version}
> > >
> > > 
> > >
> > >
> > > heap dump1:
> > >
> > > Leak Suspects
> > >
> > > System Overview
> > >
> > >  Leaks
> > >
> > >  Overview
> > >
> > >
> > >   Problem Suspect 1
> > >
> > > 21 instances of "org.apache.flink.util.ChildFirstClassLoader", loaded by 
> > > "sun.misc.Launcher$AppClassLoader @ 0x73b2d42e0" occupy 29,656,880 
> > > (41.16%) bytes.
> > >
> > > Biggest instances:
> > >
> > > org.apache.flink.util.

Re: period batch job lead to OutOfMemoryError: Metaspace problem

2021-04-08 Thread Yangze Guo
IIUC, your program will finally generate 100 ChildFirstClassLoader in
a TM. But it should always be GC when job finished. So, as Arvid said,
you'd better check who is referencing those ChildFirstClassLoader.


Best,
Yangze Guo

On Thu, Apr 8, 2021 at 5:43 PM 太平洋 <495635...@qq.com> wrote:
>
> My application program looks like this. Does this structure has some problem?
>
> public class StreamingJob {
> public static void main(String[] args) throws Exception {
> int i = 0;
> while (i < 100) {
> try {
> StreamExecutionEnvironment env = 
> StreamExecutionEnvironment.getExecutionEnvironment();
> env.setRuntimeMode(RuntimeExecutionMode.BATCH);
> env.setParallelism(Parallelism);
>
> EnvironmentSettings bsSettings = 
> EnvironmentSettings.newInstance().useBlinkPlanner()
> .inStreamingMode().build();
> StreamTableEnvironment bsTableEnv = StreamTableEnvironment.create(env, 
> bsSettings);
>
> bsTableEnv.executeSql("CREATE TEMPORARY TABLE ");
> Table t = bsTableEnv.sqlQuery(query);
>
> DataStream points = bsTableEnv.toAppendStream(t, DataPoint.class);
>
> DataStream weightPoints = points.map();
>
> DataStream predictPoints = weightPoints.keyBy()
> .reduce().map();
>
> // side output
> final OutputTag outPutPredict = new 
> OutputTag("predict") {
> };
>
> SingleOutputStreamOperator mainDataStream = predictPoints
> .process();
>
> DataStream exStream = 
> mainDataStream.getSideOutput(outPutPredict);
>
> //write data to clickhouse
> String insertIntoCKSql = "xxx";
> mainDataStream.addSink(JdbcSink.sink(insertIntoCKSql, new CkSinkBuilder(),
> new JdbcExecutionOptions.Builder().withBatchSize(CkBatchSize).build(),
> new 
> JdbcConnectionOptions.JdbcConnectionOptionsBuilder().withDriverName(CkDriverName)
> .withUrl(CkUrl).withUsername(CkUser).withPassword(CkPassword).build()));
>
> // write data to kafka
> FlinkKafkaProducer producer = new FlinkKafkaProducer<>();
> exStream.map().addSink(producer);
>
> env.execute("Prediction Program");
> } catch (Exception e) {
> e.printStackTrace();
> }
> i++;
> Thread.sleep(window * 1000);
> }
> }
> }
>
>
>
> -- 原始邮件 --
> 发件人: "Arvid Heise" ;
> 发送时间: 2021年4月8日(星期四) 下午2:33
> 收件人: "Yangze Guo";
> 抄送: 
> "太平洋"<495635...@qq.com>;"user";"guowei.mgw";"renqschn";
> 主题: Re: period batch job lead to OutOfMemoryError: Metaspace problem
>
> Hi,
>
> ChildFirstClassLoader are created (more or less) by application jar and 
> seeing so many looks like a classloader leak to me. I'd expect you to see a 
> new ChildFirstClassLoader popping up with each new job submission.
>
> Can you check who is referencing the ChildFirstClassLoader transitively? 
> Usually, it's some thread that is lingering around because some third party 
> library is leaking threads etc.
>
> OneInputStreamTask is legit and just indicates that you have a job running 
> with 4 slots on that TM. It should not hold any dedicated metaspace memory.
>
> On Thu, Apr 8, 2021 at 4:52 AM Yangze Guo  wrote:
>>
>> I went through the JM & TM logs but could not find any valuable clue.
>> The exception is actually thrown by kafka-producer-network-thread.
>> Maybe @Qingsheng could also take a look?
>>
>>
>> Best,
>> Yangze Guo
>>
>> On Thu, Apr 8, 2021 at 10:39 AM 太平洋 <495635...@qq.com> wrote:
>> >
>> > I have configured to 512M, but problem still exist. Now the memory size is 
>> > still 256M.
>> > Attachments are TM and JM logs.
>> >
>> > Look forward to your reply.
>> >
>> > -- 原始邮件 --
>> > 发件人: "Yangze Guo" ;
>> > 发送时间: 2021年4月6日(星期二) 晚上6:35
>> > 收件人: "太平洋"<495635...@qq.com>;
>> > 抄送: "user";"guowei.mgw";
>> > 主题: Re: period batch job lead to OutOfMemoryError: Metaspace problem
>> >
>> > > I have tried this method, but the problem still exist.
>> > How much memory do you configure for it?
>> >
>> > > is 21 instances of "org.apache.flink.util.ChildFirstClassLoader" normal
>> > Not quite sure about it. AFAIK, each job will have a classloader.
>> > Multiple tasks of the same job in the same TM will share the same
>> > classloader. The classloader will be removed if there is no more task
>> > running on the TM. Classloader without reference will be finally
>> > cleanup by GC. Could you share JM and TM logs for furthe

Re: [ANNOUNCE] Apache Flink 1.10.1 released

2020-05-17 Thread Yangze Guo
Thanks Yu for the great job. Congrats everyone who made this release possible.
Best,
Yangze Guo

On Mon, May 18, 2020 at 10:57 AM Leonard Xu  wrote:
>
>
> Thanks Yu for being the release manager, and everyone else who made this 
> possible.
>
> Best,
> Leonard Xu
>
> 在 2020年5月18日,10:43,Zhu Zhu  写道:
>
> Thanks Yu for being the release manager. Thanks everyone who made this 
> release possible!
>
> Thanks,
> Zhu Zhu
>
> Benchao Li  于2020年5月15日周五 下午7:51写道:
>>
>> Thanks Yu for the great work, and everyone else who made this possible.
>>
>> Dian Fu  于2020年5月15日周五 下午6:55写道:
>>>
>>> Thanks Yu for managing this release and everyone else who made this release 
>>> possible. Good work!
>>>
>>> Regards,
>>> Dian
>>>
>>> 在 2020年5月15日,下午6:26,Till Rohrmann  写道:
>>>
>>> Thanks Yu for being our release manager and everyone else who made the 
>>> release possible!
>>>
>>> Cheers,
>>> Till
>>>
>>> On Fri, May 15, 2020 at 9:15 AM Congxian Qiu  wrote:
>>>>
>>>> Thanks a lot for the release and your great job, Yu!
>>>> Also thanks to everyone who made this release possible!
>>>>
>>>> Best,
>>>> Congxian
>>>>
>>>>
>>>> Yu Li  于2020年5月14日周四 上午1:59写道:
>>>>>
>>>>> The Apache Flink community is very happy to announce the release of 
>>>>> Apache Flink 1.10.1, which is the first bugfix release for the Apache 
>>>>> Flink 1.10 series.
>>>>>
>>>>> Apache Flink® is an open-source stream processing framework for 
>>>>> distributed, high-performing, always-available, and accurate data 
>>>>> streaming applications.
>>>>>
>>>>> The release is available for download at:
>>>>> https://flink.apache.org/downloads.html
>>>>>
>>>>> Please check out the release blog post for an overview of the 
>>>>> improvements for this bugfix release:
>>>>> https://flink.apache.org/news/2020/05/12/release-1.10.1.html
>>>>>
>>>>> The full release notes are available in Jira:
>>>>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12346891
>>>>>
>>>>> We would like to thank all contributors of the Apache Flink community who 
>>>>> made this release possible!
>>>>>
>>>>> Regards,
>>>>> Yu
>>>
>>>
>>
>>
>> --
>>
>> Benchao Li
>> School of Electronics Engineering and Computer Science, Peking University
>> Tel:+86-15650713730
>> Email: libenc...@gmail.com; libenc...@pku.edu.cn
>
>


Re: How do I get the IP of the master and slave files programmatically in Flink?

2020-05-20 Thread Yangze Guo
Hi, Felipe

Do you mean to get the Host and Port of the task executor where your
operator is indeed running on?

If that is the case, IIUC, two possible components that contain this
information are RuntimeContext and the Configuration param of
RichFunction#open. After reading the relevant code path, it seems you
could not get it at the moment.

Best,
Yangze Guo

Best,
Yangze Guo


On Wed, May 20, 2020 at 11:46 PM Alexander Fedulov
 wrote:
>
> Hi Felippe,
>
> could you clarify in some more details what you are trying to achieve?
>
> Best regards,
>
> --
>
> Alexander Fedulov | Solutions Architect
>
> +49 1514 6265796
>
>
>
> Follow us @VervericaData
>
> --
>
> Join Flink Forward - The Apache Flink Conference
>
> Stream Processing | Event Driven | Real Time
>
> --
>
> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
>
> --
>
> Ververica GmbH
> Registered at Amtsgericht Charlottenburg: HRB 158244 B
> Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji 
> (Tony) Cheng
>
>
>
>
> On Wed, May 20, 2020 at 1:14 PM Felipe Gutierrez 
>  wrote:
>>
>> Hi all,
>>
>> I have my own operator that extends the AbstractUdfStreamOperator
>> class and I want to issue some messages to it. Sometimes the operator
>> instances are deployed on different TaskManagers and I would like to
>> set some attributes like the master and slave IPs on it.
>>
>> I am trying to use these values but they only return localhost, not
>> the IP configured at flink-conf.yaml file. (jobmanager.rpc.address:
>> 192.168.56.1).
>>
>> ConfigOption restAddressOption = ConfigOptions
>>.key("rest.address")
>>.stringType()
>>.noDefaultValue();
>> System.out.println("DefaultJobManagerRunnerFactory rest.address: " +
>> jobMasterConfiguration.getConfiguration().getValue(restAddressOption));
>> System.out.println("rpcService: " + rpcService.getAddress());
>>
>>
>> Thanks,
>> Felipe
>>
>> --
>> -- Felipe Gutierrez
>> -- skype: felipe.o.gutierrez
>> -- https://felipeogutierrez.blogspot.com


Re: How do I get the IP of the master and slave files programmatically in Flink?

2020-05-21 Thread Yangze Guo
Hi, Felipe

I see your problem. IIUC, if you use AbstractUdfStreamOperator, you
could indeed get all the configurations(including what you defined in
flink-conf.yaml) through
"AbstractUdfStreamOperator#getRuntimeContext().getTaskManagerRuntimeInfo().getConfiguration()".
However, I guess it is not the right behavior and might be fixed in
future versions.

Best,
Yangze Guo



On Thu, May 21, 2020 at 3:13 PM Felipe Gutierrez
 wrote:
>
> Hi all,
>
> I would like to have the IP of the JobManager, not the Task Executors.
> I explain why.
>
> I have an operator (my own operator that extends
> AbstractUdfStreamOperator) that sends and receives messages from a
> global controller. So, regardless of which TaskManager these operator
> instances are deployed, they need to send and receive messages from my
> controller. Currently, I am doing this using MQTT broker (this is my
> first approach and I don't know if there is a better way to do it,
> maybe there is...)
>
> The first thing that I do is to start my controller using the
> org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl and subscribe
> it to the JobManager host. I am getting the IP of the JobManager by
> adding this method on the
> org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory
> class:
>public String getRpcServiceAddress() {
> return this.rpcService.getAddress();
> }
> That is working. Although I am not sure if it is the best approach.
>
> The second thing that I am doing is to make each operator instance
> publish and subscribe to this controller. To do this they need the
> JobManager IP. I could get the TaskManager IPs from the
> AbstractUdfStreamOperator, but not the JobManager IP. So, I am passing
> the JobManager IP as a parameter to the operator at the moment. I
> suppose that it is easy to get the JobManager IP inside the
> AbstractUdfStreamOperator or simply add some method somewhere to get
> this value. However, I don't know where.
>
> Thanks,
> Felipe
>
> --
> -- Felipe Gutierrez
> -- skype: felipe.o.gutierrez
> -- https://felipeogutierrez.blogspot.com
>
> On Thu, May 21, 2020 at 7:13 AM Yangze Guo  wrote:
> >
> > Hi, Felipe
> >
> > Do you mean to get the Host and Port of the task executor where your
> > operator is indeed running on?
> >
> > If that is the case, IIUC, two possible components that contain this
> > information are RuntimeContext and the Configuration param of
> > RichFunction#open. After reading the relevant code path, it seems you
> > could not get it at the moment.
> >
> > Best,
> > Yangze Guo
> >
> > Best,
> > Yangze Guo
> >
> >
> > On Wed, May 20, 2020 at 11:46 PM Alexander Fedulov
> >  wrote:
> > >
> > > Hi Felippe,
> > >
> > > could you clarify in some more details what you are trying to achieve?
> > >
> > > Best regards,
> > >
> > > --
> > >
> > > Alexander Fedulov | Solutions Architect
> > >
> > > +49 1514 6265796
> > >
> > >
> > >
> > > Follow us @VervericaData
> > >
> > > --
> > >
> > > Join Flink Forward - The Apache Flink Conference
> > >
> > > Stream Processing | Event Driven | Real Time
> > >
> > > --
> > >
> > > Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
> > >
> > > --
> > >
> > > Ververica GmbH
> > > Registered at Amtsgericht Charlottenburg: HRB 158244 B
> > > Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji 
> > > (Tony) Cheng
> > >
> > >
> > >
> > >
> > > On Wed, May 20, 2020 at 1:14 PM Felipe Gutierrez 
> > >  wrote:
> > >>
> > >> Hi all,
> > >>
> > >> I have my own operator that extends the AbstractUdfStreamOperator
> > >> class and I want to issue some messages to it. Sometimes the operator
> > >> instances are deployed on different TaskManagers and I would like to
> > >> set some attributes like the master and slave IPs on it.
> > >>
> > >> I am trying to use these values but they only return localhost, not
> > >> the IP configured at flink-conf.yaml file. (jobmanager.rpc.address:
> > >> 192.168.56.1).
> > >>
> > >> ConfigOption restAddressOption = ConfigOptions
> > >>.key("rest.address")
> > >>.stringType()
> > >>.noDefaultValue();
> > >> System.out.println("DefaultJobManagerRunnerFactory rest.address: " +
> > >> jobMasterConfiguration.getConfiguration().getValue(restAddressOption));
> > >> System.out.println("rpcService: " + rpcService.getAddress());
> > >>
> > >>
> > >> Thanks,
> > >> Felipe
> > >>
> > >> --
> > >> -- Felipe Gutierrez
> > >> -- skype: felipe.o.gutierrez
> > >> -- https://felipeogutierrez.blogspot.com


Re: kerberos integration with flink

2020-05-21 Thread Yangze Guo
Hi, Nick,

>From my understanding, if you configure the
"security.kerberos.login.keytab", Flink will add the
AppConfigurationEntry of this keytab to all the apps defined in
"security.kerberos.login.contexts". If you define
"java.security.auth.login.config" at the same time, Flink will also
keep the configuration in it. For more details, see [1][2].

If you want to use this keytab to interact with HDFS, HBase and Yarn,
you need to set "security.kerberos.login.contexts". See [3][4].

[1] 
https://ci.apache.org/projects/flink/flink-docs-master/ops/security-kerberos.html#jaas-security-module
[2] 
https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/security/modules/JaasModule.java
[3] 
https://ci.apache.org/projects/flink/flink-docs-master/ops/security-kerberos.html#hadoop-security-module
[4] 
https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/security/modules/HadoopModule.java

Best,
Yangze Guo

On Thu, May 21, 2020 at 11:06 PM Nick Bendtner  wrote:
>
> Hi guys,
> Is there any difference in providing kerberos config to the flink jvm using 
> this method in the flink configuration?
>
> env.java.opts:  -Dconfig.resource=qa.conf 
> -Djava.library.path=/usr/mware/flink-1.7.2/simpleapi/lib/ 
> -Djava.security.auth.login.config=/usr/mware/flink-1.7.2/Jaas/kafka-jaas.conf 
> -Djava.security.krb5.conf=/usr/mware/flink-1.7.2/Jaas/krb5.conf
>
> Is there any difference in doing it this way vs providing it from 
> security.kerberos.login.keytab .
>
> Best,
>
> Nick.


Re: kerberos integration with flink

2020-05-24 Thread Yangze Guo
Yes, you can use kinit. But AFAIK, if you deploy Flink on Kubernetes
or Mesos, Flink will not ship the ticket cache. If you deploy Flink on
Yarn, Flink will acquire delegation tokens with your ticket cache and
set tokens for job manager and task executor. As the document said,
the main drawback is that the cluster is necessarily short-lived since
the generated delegation tokens will expire (typically within a week).

Best,
Yangze Guo

On Sat, May 23, 2020 at 1:23 AM Nick Bendtner  wrote:
>
> Hi Guo,
> Even for HDFS I don't really need to set "security.kerberos.login.contexts" . 
> As long as there is the right ticket in the ticket cache before starting the 
> flink cluster it seems to work fine. I think even [4] from your reference 
> seems to do the same thing. I have defined own ticket cache specifically for 
> flink cluster by setting this environment variable. Before starting the 
> cluster I create a ticket by using kinit.
> This is how I make flink read this cache.
> export KRB5CCNAME=/home/was/Jaas/krb5cc . I think even flink tries to find 
> the location of ticket cache using this variable [1].
> Do you see any problems in setting up hadoop security module this way ? And 
> thanks a lot for your help.
>
> [1] 
> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/security/KerberosUtils.java
>
> Best,
> Nick
>
>
>
> On Thu, May 21, 2020 at 9:54 PM Yangze Guo  wrote:
>>
>> Hi, Nick,
>>
>> From my understanding, if you configure the
>> "security.kerberos.login.keytab", Flink will add the
>> AppConfigurationEntry of this keytab to all the apps defined in
>> "security.kerberos.login.contexts". If you define
>> "java.security.auth.login.config" at the same time, Flink will also
>> keep the configuration in it. For more details, see [1][2].
>>
>> If you want to use this keytab to interact with HDFS, HBase and Yarn,
>> you need to set "security.kerberos.login.contexts". See [3][4].
>>
>> [1] 
>> https://ci.apache.org/projects/flink/flink-docs-master/ops/security-kerberos.html#jaas-security-module
>> [2] 
>> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/security/modules/JaasModule.java
>> [3] 
>> https://ci.apache.org/projects/flink/flink-docs-master/ops/security-kerberos.html#hadoop-security-module
>> [4] 
>> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/security/modules/HadoopModule.java
>>
>> Best,
>> Yangze Guo
>>
>> On Thu, May 21, 2020 at 11:06 PM Nick Bendtner  wrote:
>> >
>> > Hi guys,
>> > Is there any difference in providing kerberos config to the flink jvm 
>> > using this method in the flink configuration?
>> >
>> > env.java.opts:  -Dconfig.resource=qa.conf 
>> > -Djava.library.path=/usr/mware/flink-1.7.2/simpleapi/lib/ 
>> > -Djava.security.auth.login.config=/usr/mware/flink-1.7.2/Jaas/kafka-jaas.conf
>> >  -Djava.security.krb5.conf=/usr/mware/flink-1.7.2/Jaas/krb5.conf
>> >
>> > Is there any difference in doing it this way vs providing it from 
>> > security.kerberos.login.keytab .
>> >
>> > Best,
>> >
>> > Nick.


Re: How do I get the IP of the master and slave files programmatically in Flink?

2020-05-24 Thread Yangze Guo
Glad to see that!

However, I was told that it is not the right approach to directly
extend `AbstractUdfStreamOperator` in DataStream API. This would be
fixed at some point (maybe Flink 2.0). The JIRA link is [1].

[1] https://issues.apache.org/jira/browse/FLINK-17862

Best,
Yangze Guo

On Fri, May 22, 2020 at 9:56 PM Felipe Gutierrez
 wrote:
>
> thanks. it worked!
>
> I add the following method at the
> org.apache.flink.streaming.api.operators.StreamingRuntimeContext
> class:
>
> public Environment getTaskEnvironment() { return this.taskEnvironment; }
>
> Then I am getting the IP using:
>
> ConfigOption restAddressOption = ConfigOptions
>.key("rest.address")
>.stringType()
>.noDefaultValue();
> String restAddress =
> this.getRuntimeContext().getTaskEnvironment().getTaskManagerInfo().getConfiguration().getValue(restAddressOption);
>
> Thanks!
>
> --
> -- Felipe Gutierrez
> -- skype: felipe.o.gutierrez
> -- https://felipeogutierrez.blogspot.com
>
> On Fri, May 22, 2020 at 3:54 AM Yangze Guo  wrote:
> >
> > Hi, Felipe
> >
> > I see your problem. IIUC, if you use AbstractUdfStreamOperator, you
> > could indeed get all the configurations(including what you defined in
> > flink-conf.yaml) through
> > "AbstractUdfStreamOperator#getRuntimeContext().getTaskManagerRuntimeInfo().getConfiguration()".
> > However, I guess it is not the right behavior and might be fixed in
> > future versions.
> >
> > Best,
> > Yangze Guo
> >
> >
> >
> > On Thu, May 21, 2020 at 3:13 PM Felipe Gutierrez
> >  wrote:
> > >
> > > Hi all,
> > >
> > > I would like to have the IP of the JobManager, not the Task Executors.
> > > I explain why.
> > >
> > > I have an operator (my own operator that extends
> > > AbstractUdfStreamOperator) that sends and receives messages from a
> > > global controller. So, regardless of which TaskManager these operator
> > > instances are deployed, they need to send and receive messages from my
> > > controller. Currently, I am doing this using MQTT broker (this is my
> > > first approach and I don't know if there is a better way to do it,
> > > maybe there is...)
> > >
> > > The first thing that I do is to start my controller using the
> > > org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl and subscribe
> > > it to the JobManager host. I am getting the IP of the JobManager by
> > > adding this method on the
> > > org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory
> > > class:
> > >public String getRpcServiceAddress() {
> > > return this.rpcService.getAddress();
> > > }
> > > That is working. Although I am not sure if it is the best approach.
> > >
> > > The second thing that I am doing is to make each operator instance
> > > publish and subscribe to this controller. To do this they need the
> > > JobManager IP. I could get the TaskManager IPs from the
> > > AbstractUdfStreamOperator, but not the JobManager IP. So, I am passing
> > > the JobManager IP as a parameter to the operator at the moment. I
> > > suppose that it is easy to get the JobManager IP inside the
> > > AbstractUdfStreamOperator or simply add some method somewhere to get
> > > this value. However, I don't know where.
> > >
> > > Thanks,
> > > Felipe
> > >
> > > --
> > > -- Felipe Gutierrez
> > > -- skype: felipe.o.gutierrez
> > > -- https://felipeogutierrez.blogspot.com
> > >
> > > On Thu, May 21, 2020 at 7:13 AM Yangze Guo  wrote:
> > > >
> > > > Hi, Felipe
> > > >
> > > > Do you mean to get the Host and Port of the task executor where your
> > > > operator is indeed running on?
> > > >
> > > > If that is the case, IIUC, two possible components that contain this
> > > > information are RuntimeContext and the Configuration param of
> > > > RichFunction#open. After reading the relevant code path, it seems you
> > > > could not get it at the moment.
> > > >
> > > > Best,
> > > > Yangze Guo
> > > >
> > > > Best,
> > > > Yangze Guo
> > > >
> > > >
> > > > On Wed, May 20, 2020 at 11:46 PM Alexander Fedulov
> > > >  wrote:
> > > > >
> > > > > Hi Felippe,
> > > > >
> > > > > could you clarify in some mor

Re: How do I get the IP of the master and slave files programmatically in Flink?

2020-05-25 Thread Yangze Guo
I'm not quite familiar with that. I'd like to cc @Aljoscha Krettek here.


Best,
Yangze Guo

On Mon, May 25, 2020 at 4:39 PM Felipe Gutierrez
 wrote:
>
> ok, I see.
>
> Do you suggest a better approach to send messages from the JobManager
> to the TaskManagers and my specific operator?
>
> Thanks,
> Felipe
> --
> -- Felipe Gutierrez
> -- skype: felipe.o.gutierrez
> -- https://felipeogutierrez.blogspot.com
>
> On Mon, May 25, 2020 at 4:23 AM Yangze Guo  wrote:
> >
> > Glad to see that!
> >
> > However, I was told that it is not the right approach to directly
> > extend `AbstractUdfStreamOperator` in DataStream API. This would be
> > fixed at some point (maybe Flink 2.0). The JIRA link is [1].
> >
> > [1] https://issues.apache.org/jira/browse/FLINK-17862
> >
> > Best,
> > Yangze Guo
> >
> > On Fri, May 22, 2020 at 9:56 PM Felipe Gutierrez
> >  wrote:
> > >
> > > thanks. it worked!
> > >
> > > I add the following method at the
> > > org.apache.flink.streaming.api.operators.StreamingRuntimeContext
> > > class:
> > >
> > > public Environment getTaskEnvironment() { return this.taskEnvironment; }
> > >
> > > Then I am getting the IP using:
> > >
> > > ConfigOption restAddressOption = ConfigOptions
> > >.key("rest.address")
> > >.stringType()
> > >.noDefaultValue();
> > > String restAddress =
> > > this.getRuntimeContext().getTaskEnvironment().getTaskManagerInfo().getConfiguration().getValue(restAddressOption);
> > >
> > > Thanks!
> > >
> > > --
> > > -- Felipe Gutierrez
> > > -- skype: felipe.o.gutierrez
> > > -- https://felipeogutierrez.blogspot.com
> > >
> > > On Fri, May 22, 2020 at 3:54 AM Yangze Guo  wrote:
> > > >
> > > > Hi, Felipe
> > > >
> > > > I see your problem. IIUC, if you use AbstractUdfStreamOperator, you
> > > > could indeed get all the configurations(including what you defined in
> > > > flink-conf.yaml) through
> > > > "AbstractUdfStreamOperator#getRuntimeContext().getTaskManagerRuntimeInfo().getConfiguration()".
> > > > However, I guess it is not the right behavior and might be fixed in
> > > > future versions.
> > > >
> > > > Best,
> > > > Yangze Guo
> > > >
> > > >
> > > >
> > > > On Thu, May 21, 2020 at 3:13 PM Felipe Gutierrez
> > > >  wrote:
> > > > >
> > > > > Hi all,
> > > > >
> > > > > I would like to have the IP of the JobManager, not the Task Executors.
> > > > > I explain why.
> > > > >
> > > > > I have an operator (my own operator that extends
> > > > > AbstractUdfStreamOperator) that sends and receives messages from a
> > > > > global controller. So, regardless of which TaskManager these operator
> > > > > instances are deployed, they need to send and receive messages from my
> > > > > controller. Currently, I am doing this using MQTT broker (this is my
> > > > > first approach and I don't know if there is a better way to do it,
> > > > > maybe there is...)
> > > > >
> > > > > The first thing that I do is to start my controller using the
> > > > > org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl and subscribe
> > > > > it to the JobManager host. I am getting the IP of the JobManager by
> > > > > adding this method on the
> > > > > org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory
> > > > > class:
> > > > >public String getRpcServiceAddress() {
> > > > > return this.rpcService.getAddress();
> > > > > }
> > > > > That is working. Although I am not sure if it is the best approach.
> > > > >
> > > > > The second thing that I am doing is to make each operator instance
> > > > > publish and subscribe to this controller. To do this they need the
> > > > > JobManager IP. I could get the TaskManager IPs from the
> > > > > AbstractUdfStreamOperator, but not the JobManager IP. So, I am passing
> > > > > the JobManager IP as a parameter to the operator at the moment. I
> > > > > suppose that it is easy to get the JobManager IP inside the
> > > > > AbstractUdfStreamOperator or

Re: Flink Elastic Sink

2020-05-28 Thread Yangze Guo
Hi, Anuj.

>From my understanding, you could send IndexRequest to the indexer in
`ElasticsearchSink`. It will create a document under the given index
and type. So, it seems you only need to get the timestamp and concat
the `date` to your index. Am I understanding that correctly? Or do you
want to emit only 1 record per day?

Best,
Yangze Guo

On Fri, May 29, 2020 at 2:43 AM aj  wrote:
>
> Hello All,
>
> I am getting many events in Kafka and I have written a link job that sinks 
> that Avro records from Kafka to S3 in parquet format.
>
> Now, I want to sink these records into elastic search. but the only challenge 
> is that I want to sink record on time indices. Basically, In Elastic, I want 
> to create a per day index with the date as the suffix.
> So in Flink stream job if I create an es sink how will I change the sink to 
> start writing  in a new index when the first event of the day arrives
>
> Thanks,
> Anuj.
>
>
>
>
>


Re: kerberos integration with flink

2020-05-31 Thread Yangze Guo
Hi, Nick.

Do you mean that you manually execute "kinit -R" to renew the ticket cache?
If that is the case, Flink already sets the "renewTGT" to true. If
everything is ok, you do not need to do it yourself. However, it seems
this mechanism has a bug and this bug is not fixed in all JDK
versions. Please refer to [1].

If you mean that you generate a new ticket cache in the same place(by
default /tmp/krb5cc_uid), I'm not sure will Krb5LoginModule re-login
with your new ticket cache. I'll try to do a deeper investigation.

[1] https://bugs.openjdk.java.net/browse/JDK-8058290.

Best,
Yangze Guo

On Sat, May 30, 2020 at 3:07 AM Nick Bendtner  wrote:
>
> Hi Guo,
> Thanks again for your inputs. If I periodically renew the kerberos cache 
> using an external process(kinit) on all flink nodes in standalone mode, will 
> the cluster still be short lived or will the new ticket in the cache be used 
> and the cluster can live till the end of the new expiry ?
>
> Best,
> Nick.
>
> On Sun, May 24, 2020 at 9:15 PM Yangze Guo  wrote:
>>
>> Yes, you can use kinit. But AFAIK, if you deploy Flink on Kubernetes
>> or Mesos, Flink will not ship the ticket cache. If you deploy Flink on
>> Yarn, Flink will acquire delegation tokens with your ticket cache and
>> set tokens for job manager and task executor. As the document said,
>> the main drawback is that the cluster is necessarily short-lived since
>> the generated delegation tokens will expire (typically within a week).
>>
>> Best,
>> Yangze Guo
>>
>> On Sat, May 23, 2020 at 1:23 AM Nick Bendtner  wrote:
>> >
>> > Hi Guo,
>> > Even for HDFS I don't really need to set 
>> > "security.kerberos.login.contexts" . As long as there is the right ticket 
>> > in the ticket cache before starting the flink cluster it seems to work 
>> > fine. I think even [4] from your reference seems to do the same thing. I 
>> > have defined own ticket cache specifically for flink cluster by setting 
>> > this environment variable. Before starting the cluster I create a ticket 
>> > by using kinit.
>> > This is how I make flink read this cache.
>> > export KRB5CCNAME=/home/was/Jaas/krb5cc . I think even flink tries to find 
>> > the location of ticket cache using this variable [1].
>> > Do you see any problems in setting up hadoop security module this way ? 
>> > And thanks a lot for your help.
>> >
>> > [1] 
>> > https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/security/KerberosUtils.java
>> >
>> > Best,
>> > Nick
>> >
>> >
>> >
>> > On Thu, May 21, 2020 at 9:54 PM Yangze Guo  wrote:
>> >>
>> >> Hi, Nick,
>> >>
>> >> From my understanding, if you configure the
>> >> "security.kerberos.login.keytab", Flink will add the
>> >> AppConfigurationEntry of this keytab to all the apps defined in
>> >> "security.kerberos.login.contexts". If you define
>> >> "java.security.auth.login.config" at the same time, Flink will also
>> >> keep the configuration in it. For more details, see [1][2].
>> >>
>> >> If you want to use this keytab to interact with HDFS, HBase and Yarn,
>> >> you need to set "security.kerberos.login.contexts". See [3][4].
>> >>
>> >> [1] 
>> >> https://ci.apache.org/projects/flink/flink-docs-master/ops/security-kerberos.html#jaas-security-module
>> >> [2] 
>> >> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/security/modules/JaasModule.java
>> >> [3] 
>> >> https://ci.apache.org/projects/flink/flink-docs-master/ops/security-kerberos.html#hadoop-security-module
>> >> [4] 
>> >> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/security/modules/HadoopModule.java
>> >>
>> >> Best,
>> >> Yangze Guo
>> >>
>> >> On Thu, May 21, 2020 at 11:06 PM Nick Bendtner  wrote:
>> >> >
>> >> > Hi guys,
>> >> > Is there any difference in providing kerberos config to the flink jvm 
>> >> > using this method in the flink configuration?
>> >> >
>> >> > env.java.opts:  -Dconfig.resource=qa.conf 
>> >> > -Djava.library.path=/usr/mware/flink-1.7.2/simpleapi/lib/ 
>> >> > -Djava.security.auth.login.config=/usr/mware/flink-1.7.2/Jaas/kafka-jaas.conf
>> >> >  -Djava.security.krb5.conf=/usr/mware/flink-1.7.2/Jaas/krb5.conf
>> >> >
>> >> > Is there any difference in doing it this way vs providing it from 
>> >> > security.kerberos.login.keytab .
>> >> >
>> >> > Best,
>> >> >
>> >> > Nick.


Re: kerberos integration with flink

2020-06-01 Thread Yangze Guo
It sounds good to me. If your job keeps running (longer than the
expiration time), I think it implies that Krb5LoginModule will use
your newly generated cache. It's my pleasure to help you.

Best,
Yangze Guo

On Mon, Jun 1, 2020 at 10:47 PM Nick Bendtner  wrote:
>
> Hi Guo,
> The auto renewal happens fine, however I want to generate a new ticket with a 
> new renew until period so that the job can run longer than 7 days, I am 
> talking about the second paragraph your email, I have set a custom cache by 
> setting KRB5CCNAME . Just want to make sure that Krb5LoginModule does a 
> re-login like you said. I think it does because I generated a new ticket when 
> the flink job was running and the job continues to auto renew the new ticket. 
> Let me know if you can think of any pit falls. Once again i really want to 
> thank you for your help and your time.
>
> Best,
> Nick.
>
> On Mon, Jun 1, 2020 at 12:29 AM Yangze Guo  wrote:
>>
>> Hi, Nick.
>>
>> Do you mean that you manually execute "kinit -R" to renew the ticket cache?
>> If that is the case, Flink already sets the "renewTGT" to true. If
>> everything is ok, you do not need to do it yourself. However, it seems
>> this mechanism has a bug and this bug is not fixed in all JDK
>> versions. Please refer to [1].
>>
>> If you mean that you generate a new ticket cache in the same place(by
>> default /tmp/krb5cc_uid), I'm not sure will Krb5LoginModule re-login
>> with your new ticket cache. I'll try to do a deeper investigation.
>>
>> [1] https://bugs.openjdk.java.net/browse/JDK-8058290.
>>
>> Best,
>> Yangze Guo
>>
>> On Sat, May 30, 2020 at 3:07 AM Nick Bendtner  wrote:
>> >
>> > Hi Guo,
>> > Thanks again for your inputs. If I periodically renew the kerberos cache 
>> > using an external process(kinit) on all flink nodes in standalone mode, 
>> > will the cluster still be short lived or will the new ticket in the cache 
>> > be used and the cluster can live till the end of the new expiry ?
>> >
>> > Best,
>> > Nick.
>> >
>> > On Sun, May 24, 2020 at 9:15 PM Yangze Guo  wrote:
>> >>
>> >> Yes, you can use kinit. But AFAIK, if you deploy Flink on Kubernetes
>> >> or Mesos, Flink will not ship the ticket cache. If you deploy Flink on
>> >> Yarn, Flink will acquire delegation tokens with your ticket cache and
>> >> set tokens for job manager and task executor. As the document said,
>> >> the main drawback is that the cluster is necessarily short-lived since
>> >> the generated delegation tokens will expire (typically within a week).
>> >>
>> >> Best,
>> >> Yangze Guo
>> >>
>> >> On Sat, May 23, 2020 at 1:23 AM Nick Bendtner  wrote:
>> >> >
>> >> > Hi Guo,
>> >> > Even for HDFS I don't really need to set 
>> >> > "security.kerberos.login.contexts" . As long as there is the right 
>> >> > ticket in the ticket cache before starting the flink cluster it seems 
>> >> > to work fine. I think even [4] from your reference seems to do the same 
>> >> > thing. I have defined own ticket cache specifically for flink cluster 
>> >> > by setting this environment variable. Before starting the cluster I 
>> >> > create a ticket by using kinit.
>> >> > This is how I make flink read this cache.
>> >> > export KRB5CCNAME=/home/was/Jaas/krb5cc . I think even flink tries to 
>> >> > find the location of ticket cache using this variable [1].
>> >> > Do you see any problems in setting up hadoop security module this way ? 
>> >> > And thanks a lot for your help.
>> >> >
>> >> > [1] 
>> >> > https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/security/KerberosUtils.java
>> >> >
>> >> > Best,
>> >> > Nick
>> >> >
>> >> >
>> >> >
>> >> > On Thu, May 21, 2020 at 9:54 PM Yangze Guo  wrote:
>> >> >>
>> >> >> Hi, Nick,
>> >> >>
>> >> >> From my understanding, if you configure the
>> >> >> "security.kerberos.login.keytab", Flink will add the
>> >> >> AppConfigurationEntry of this keytab to all the apps defined in
>> >> >> "security.kerberos.login.contexts". If you define
>> >> >> &

Re: Native K8S not creating TMs

2020-06-03 Thread Yangze Guo
Hi, Kevin,

Regarding logs, you could follow this guide [1].

BTW, you could execute "kubectl get pod" to get the current pods. If
there is something like "flink-taskmanager-1-1", you could execute
"kubectl describe pod flink-taskmanager-1-1" to see the status of it.

[1] 
https://ci.apache.org/projects/flink/flink-docs-master/ops/deployment/native_kubernetes.html#log-files

Best,
Yangze Guo

On Thu, Jun 4, 2020 at 2:28 AM kb  wrote:
>
> Hi
>
> We are using 1.10.1 with native k8s and while the service appears to be
> created and I can submit a job & see it via Web UI, TMs/pods are never
> created thus the jobs never start.
>
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
> Could not allocate the required slot within slot request timeout. Please
> make sure that the cluster has enough resources.
>
> Is there somewhere I could see the pod creation logs?
>
> thanks
>
>
>
> --
> Sent from: 
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/


Re: Native K8S not creating TMs

2020-06-03 Thread Yangze Guo
Amend: for release 1.10.1, please refer to this guide [1].

[1] 
https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/deployment/native_kubernetes.html#log-files

Best,
Yangze Guo

On Thu, Jun 4, 2020 at 9:52 AM Yangze Guo  wrote:
>
> Hi, Kevin,
>
> Regarding logs, you could follow this guide [1].
>
> BTW, you could execute "kubectl get pod" to get the current pods. If
> there is something like "flink-taskmanager-1-1", you could execute
> "kubectl describe pod flink-taskmanager-1-1" to see the status of it.
>
> [1] 
> https://ci.apache.org/projects/flink/flink-docs-master/ops/deployment/native_kubernetes.html#log-files
>
> Best,
> Yangze Guo
>
> On Thu, Jun 4, 2020 at 2:28 AM kb  wrote:
> >
> > Hi
> >
> > We are using 1.10.1 with native k8s and while the service appears to be
> > created and I can submit a job & see it via Web UI, TMs/pods are never
> > created thus the jobs never start.
> >
> > org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
> > Could not allocate the required slot within slot request timeout. Please
> > make sure that the cluster has enough resources.
> >
> > Is there somewhere I could see the pod creation logs?
> >
> > thanks
> >
> >
> >
> > --
> > Sent from: 
> > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/


Re: Shipping Filesystem Plugins with YarnClusterDescriptor

2020-06-10 Thread Yangze Guo
Hi, John,

AFAIK, Flink will automatically help you to ship the "plugins/"
directory of your Flink distribution to Yarn[1]. So, you just need to
make a directory in "plugins/" and put your custom jar into it. Do you
meet any problem with this approach?

[1] 
https://github.com/apache/flink/blob/216f65fff10fb0957e324570662d075be66bacdf/flink-yarn/src/main/java/org/apache/flink/yarn/YarnClusterDescriptor.java#L770

Best,
Yangze Guo

On Wed, Jun 10, 2020 at 11:29 PM John Mathews  wrote:
>
> Hello,
>
> I have a custom filesystem that I am trying to migrate to the plugins model 
> described here: 
> https://ci.apache.org/projects/flink/flink-docs-stable/ops/filesystems/#adding-a-new-pluggable-file-system-implementation,
>  but it is unclear to me how to dynamically get the plugins directory to be 
> available when launching using a Yarn Cluster Descriptor. One thought was to 
> add the plugins to the shipFilesList, but I don't think that would result in 
> the plugins being in the correct directory location for Flink to discover it.
>
> Is there another way to get the plugins onto the host when launching the 
> cluster? Or is there a different recommended way of doing this? Happy to 
> answer any questions if something is unclear.
>
> Thanks so much for your help!
>
> John


Re: [ANNOUNCE] Yu Li is now part of the Flink PMC

2020-06-16 Thread Yangze Guo
Congrats, Yu!
Best,
Yangze Guo

On Wed, Jun 17, 2020 at 9:35 AM Xintong Song  wrote:
>
> Congratulations Yu, well deserved~!
>
> Thank you~
>
> Xintong Song
>
>
>
> On Wed, Jun 17, 2020 at 9:15 AM jincheng sun  wrote:
>>
>> Hi all,
>>
>> On behalf of the Flink PMC, I'm happy to announce that Yu Li is now
>> part of the Apache Flink Project Management Committee (PMC).
>>
>> Yu Li has been very active on Flink's Statebackend component, working on 
>> various improvements, for example the RocksDB memory management for 1.10. 
>> and keeps checking and voting for our releases, and also has successfully 
>> produced two releases(1.10.0&1.10.1) as RM.
>>
>> Congratulations & Welcome Yu Li!
>>
>> Best,
>> Jincheng (on behalf of the Flink PMC)


Re: Manual allocation of slot usage

2020-07-07 Thread Yangze Guo
Hi, Mu,

IIUC, cluster.evenly-spread-out-slots would fulfill your demand. Why
do you think it does the opposite of what you want. Do you run your
job in active mode? If so, cluster.evenly-spread-out-slots might not
work very well because there could be insufficient task managers when
request slot from ResourceManager. This has been discussed in
https://issues.apache.org/jira/browse/FLINK-12122 .


Best,
Yangze Guo

On Tue, Jul 7, 2020 at 5:44 PM Mu Kong  wrote:
>
> Hi community,
>
> I'm running an application to consume data from kafka, and process it then 
> put data to the druid.
> I wonder if there is a way where I can allocate the data source consuming 
> process evenly across the task manager to maximize the usage of the network 
> of task managers.
>
> So, for example, I have 15 task managers and I set parallelism for the kafka 
> source as 60, since I have 60 partitions in kafka topic.
> What I want is flink cluster will put 4 kafka source subtasks on each task 
> manager.
>
> Is that possible? I have gone through the document, the only thing we found is
>
> cluster.evenly-spread-out-slots
>
> which does exact the opposite of what I want. It will put the subtasks of the 
> same operator onto one task manager as much as possible.
>
> So, is some kind of manual resource allocation available?
> Thanks in advance!
>
>
> Best regards,
> Mu


Re: Manual allocation of slot usage

2020-07-07 Thread Yangze Guo
Hi, Mu,

AFAIK, this feature is added to 1.9.2. If you use 1.9.0, would you
like to upgrade your Flink distribution?

Best,
Yangze Guo

On Tue, Jul 7, 2020 at 8:33 PM Mu Kong  wrote:
>
> Hi, Guo,
>
> Thanks for helping out.
>
> My application has a kafka source with 60 subtasks(parallelism), and we have 
> 15 task managers with 15 slots on each.
>
> Before I applied the cluster.evenly-spread-out-slots, meaning it is set to 
> default false, the operator 'kafka source" has 11 subtasks allocated in one 
> single task manager,
> while the remaining 49 subtasks of "kafka source" distributed to the 
> remaining 14 task managers.
>
> After I set cluster.evenly-spread-out-slots to true, the 60 subtasks of 
> "kafka source" were allocated to only 4 task managers, and they took 15 slots 
> on each of these 4 TMs.
>
> What I thought is that this config will make the subtasks of one operator 
> more evenly spread among the task managers, but it seems it made them 
> allocated in the same task manager as much as possible.
>
> The version I'm deploying is 1.9.0.
>
> Best regards,
> Mu
>
> On Tue, Jul 7, 2020 at 7:10 PM Yangze Guo  wrote:
>>
>> Hi, Mu,
>>
>> IIUC, cluster.evenly-spread-out-slots would fulfill your demand. Why
>> do you think it does the opposite of what you want. Do you run your
>> job in active mode? If so, cluster.evenly-spread-out-slots might not
>> work very well because there could be insufficient task managers when
>> request slot from ResourceManager. This has been discussed in
>> https://issues.apache.org/jira/browse/FLINK-12122 .
>>
>>
>> Best,
>> Yangze Guo
>>
>> On Tue, Jul 7, 2020 at 5:44 PM Mu Kong  wrote:
>> >
>> > Hi community,
>> >
>> > I'm running an application to consume data from kafka, and process it then 
>> > put data to the druid.
>> > I wonder if there is a way where I can allocate the data source consuming 
>> > process evenly across the task manager to maximize the usage of the 
>> > network of task managers.
>> >
>> > So, for example, I have 15 task managers and I set parallelism for the 
>> > kafka source as 60, since I have 60 partitions in kafka topic.
>> > What I want is flink cluster will put 4 kafka source subtasks on each task 
>> > manager.
>> >
>> > Is that possible? I have gone through the document, the only thing we 
>> > found is
>> >
>> > cluster.evenly-spread-out-slots
>> >
>> > which does exact the opposite of what I want. It will put the subtasks of 
>> > the same operator onto one task manager as much as possible.
>> >
>> > So, is some kind of manual resource allocation available?
>> > Thanks in advance!
>> >
>> >
>> > Best regards,
>> > Mu


Re: [Third-party Tool] Flink memory calculator

2020-07-07 Thread Yangze Guo
Hi, there,

As Flink 1.11.0 released, we provide a new calculator[1] for this
version. Feel free to try it and any feedback or suggestion is
welcomed!

[1] 
https://github.com/KarmaGYZ/flink-memory-calculator/blob/master/calculator-1.11.sh

Best,
Yangze Guo

On Wed, Apr 1, 2020 at 9:45 PM Yangze Guo  wrote:
>
> @Marta
> Thanks for the tip! I'll do that.
>
> Best,
> Yangze Guo
>
> On Wed, Apr 1, 2020 at 8:05 PM Marta Paes Moreira  wrote:
> >
> > Hey, Yangze.
> >
> > I'd like to suggest that you submit this tool to Flink Community Pages [1]. 
> > That way it can get more exposure and it'll be easier for users to find it.
> >
> > Thanks for your contribution!
> >
> > [1] https://flink-packages.org/
> >
> > On Tue, Mar 31, 2020 at 9:09 AM Yangze Guo  wrote:
> >>
> >> Hi, there.
> >>
> >> In the latest version, the calculator supports dynamic options. You
> >> could append all your dynamic options to the end of "bin/calculator.sh
> >> [-h]".
> >> Since "-tm" will be deprecated eventually, please replace it with
> >> "-Dtaskmanager.memory.process.size=".
> >>
> >> Best,
> >> Yangze Guo
> >>
> >> On Mon, Mar 30, 2020 at 12:57 PM Xintong Song  
> >> wrote:
> >> >
> >> > Hi Jeff,
> >> >
> >> > I think the purpose of this tool it to allow users play with the memory 
> >> > configurations without needing to actually deploy the Flink cluster or 
> >> > even have a job. For sanity checks, we currently have them in the 
> >> > start-up scripts (for standalone clusters) and resource managers (on 
> >> > K8s/Yarn/Mesos).
> >> >
> >> > I think it makes sense do the checks earlier, i.e. on the client side. 
> >> > But I'm not sure if JobListener is the right place. IIUC, JobListener is 
> >> > invoked before submitting a specific job, while the mentioned checks 
> >> > validate Flink's cluster level configurations. It might be okay for a 
> >> > job cluster, but does not cover the scenarios of session clusters.
> >> >
> >> > Thank you~
> >> >
> >> > Xintong Song
> >> >
> >> >
> >> >
> >> > On Mon, Mar 30, 2020 at 12:03 PM Yangze Guo  wrote:
> >> >>
> >> >> Thanks for your feedbacks, @Xintong and @Jeff.
> >> >>
> >> >> @Jeff
> >> >> I think it would always be good to leverage exist logic in Flink, such
> >> >> as JobListener. However, this calculator does not only target to check
> >> >> the conflict, it also targets to provide the calculating result to
> >> >> user before the job is actually deployed in case there is any
> >> >> unexpected configuration. It's a good point that we need to parse the
> >> >> dynamic configs. I prefer to parse the dynamic configs and cli
> >> >> commands in bash instead of adding hook in JobListener.
> >> >>
> >> >> Best,
> >> >> Yangze Guo
> >> >>
> >> >> On Mon, Mar 30, 2020 at 10:32 AM Jeff Zhang  wrote:
> >> >> >
> >> >> > Hi Yangze,
> >> >> >
> >> >> > Does this tool just parse the configuration in flink-conf.yaml ?  
> >> >> > Maybe it could be done in JobListener [1] (we should enhance it via 
> >> >> > adding hook before job submission), so that it could all the cases 
> >> >> > (e.g. parameters coming from command line)
> >> >> >
> >> >> > [1] 
> >> >> > https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/core/execution/JobListener.java#L35
> >> >> >
> >> >> >
> >> >> > Yangze Guo  于2020年3月30日周一 上午9:40写道:
> >> >> >>
> >> >> >> Hi, Yun,
> >> >> >>
> >> >> >> I'm sorry that it currently could not handle it. But I think it is a
> >> >> >> really good idea and that feature would be added to the next version.
> >> >> >>
> >> >> >> Best,
> >> >> >> Yangze Guo
> >> >> >>
> >> >> >> On Mon, Mar 30, 2020 at 12:21 AM Yun Tang  wrote:
> >> >> >> >
> >> >> >> > Very interesting and convenient tool, just a qui

Re: [ANNOUNCE] Apache Flink 1.11.0 released

2020-07-07 Thread Yangze Guo
Thanks, Zhijiang and Piotr. Congrats to everyone involved!

Best,
Yangze Guo

On Wed, Jul 8, 2020 at 10:19 AM Jark Wu  wrote:
>
> Congratulations!
> Thanks Zhijiang and Piotr for the great work as release manager, and thanks
> everyone who makes the release possible!
>
> Best,
> Jark
>
> On Wed, 8 Jul 2020 at 10:12, Paul Lam  wrote:
>
> > Finally! Thanks for Piotr and Zhijiang being the release managers, and
> > everyone that contributed to the release!
> >
> > Best,
> > Paul Lam
> >
> > 2020年7月7日 22:06,Zhijiang  写道:
> >
> > The Apache Flink community is very happy to announce the release of
> > Apache Flink 1.11.0, which is the latest major release.
> >
> > Apache Flink® is an open-source stream processing framework for distributed,
> > high-performing, always-available, and accurate data streaming
> > applications.
> >
> > The release is available for download at:
> > https://flink.apache.org/downloads.html
> >
> > Please check out the release blog post for an overview of the improvements 
> > for
> > this new major release:
> > https://flink.apache.org/news/2020/07/06/release-1.11.0.html
> >
> > The full release notes are available in Jira:
> >
> > https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12346364
> >
> > We would like to thank all contributors of the Apache Flink community who 
> > made
> > this release possible!
> >
> > Cheers,
> > Piotr & Zhijiang
> >
> >
> >


Re: Flink Cluster Java 11 support

2020-07-20 Thread Yangze Guo
Hi,

AFAIK, there is no official image with Java 11. However, I think you
could simply build a custom image by changing the base layer[1] to
openjdk:11-jre.

[1] 
https://github.com/apache/flink-docker/blob/949e445006c4fc288813900c264847d23d3e33d4/1.11/scala_2.12-debian/Dockerfile

Best,
Yangze Guo


On Mon, Jul 20, 2020 at 7:24 PM Pedro Cardoso  wrote:
>
> Hello,
>
> Are there docker images available for Flink Clusters in Kubernetes that run 
> on Java 11?
>
> Thank you.
> Regards
>
> Pedro Cardoso
>
> Research Data Engineer
>
> pedro.card...@feedzai.com
>
>
>
>
>
>
> The content of this email is confidential and intended for the recipient 
> specified in message only. It is strictly prohibited to share any part of 
> this message with any third party, without a written consent of the sender. 
> If you received this message by mistake, please reply to this message and 
> follow with its deletion, so that we can ensure such a mistake does not occur 
> in the future.


Re: How to get flink JobId in runtime

2020-07-21 Thread Yangze Guo
Hi Si-li,

Just a reminder that it is not the right way to get JobId because the
`StreamTask` is actually an internal class. For more discussion about
it, please refer to [1] and [2]. You could get JobId through this way
at the moment. Please keep in mind that it is not a stable contract.

[1] https://issues.apache.org/jira/browse/FLINK-17862
[2] 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/How-do-I-get-the-IP-of-the-master-and-slave-files-programmatically-in-Flink-td35299.html

Best,
Yangze Guo

On Tue, Jul 21, 2020 at 4:42 PM Si-li Liu  wrote:
>
> I figure out another way, wrapper my function in a custom StreamOperator that 
> extends AbstractUdfStreamOperator, then I can use 
> this.getContainingTask.getEnvironment.getJobId
>
> Congxian Qiu  于2020年7月21日周二 上午11:49写道:
>>
>> Hi Sili
>>
>> I'm not sure if there are other ways to get this value properly. Maybe 
>> you can try 
>> `RuntimeContext.getMetricGroup().getAllVariables().get("")`.
>>
>> Best,
>> Congxian
>>
>>
>> Si-li Liu  于2020年7月20日周一 下午7:38写道:
>>>
>>> Hi
>>>
>>> I want to retrieve flink JobId in runtime, for example, during 
>>> RichFunction's open method. Is there anyway to do it?
>>>
>>> I checked the methods in RuntimeContext and ExecutionConfig, seems I can't 
>>> get this information from them.
>>>
>>> Thanks!
>>>
>>> --
>>> Best regards
>>>
>>> Sili Liu
>
>
>
> --
> Best regards
>
> Sili Liu


Re: Unsubscribe

2020-07-21 Thread Yangze Guo
Hi Harshvardhan,

You need to send an email to user-unsubscr...@flink.apache.org to unsubscribe.

Best,
Yangze Guo

On Tue, Jul 21, 2020 at 7:12 PM Harshvardhan Agrawal
 wrote:

>
> --
> Regards,
> Harshvardhan


Re: [ANNOUNCE] Apache Flink 1.11.1 released

2020-07-22 Thread Yangze Guo
Congrats!

Thanks Dian Fu for being release manager, and everyone involved!

Best,
Yangze Guo

On Wed, Jul 22, 2020 at 3:14 PM Wei Zhong  wrote:
>
> Congratulations! Thanks Dian for the great work!
>
> Best,
> Wei
>
> > 在 2020年7月22日,15:09,Leonard Xu  写道:
> >
> > Congratulations!
> >
> > Thanks Dian Fu for the great work as release manager, and thanks everyone 
> > involved!
> >
> > Best
> > Leonard Xu
> >
> >> 在 2020年7月22日,14:52,Dian Fu  写道:
> >>
> >> The Apache Flink community is very happy to announce the release of Apache 
> >> Flink 1.11.1, which is the first bugfix release for the Apache Flink 1.11 
> >> series.
> >>
> >> Apache Flink® is an open-source stream processing framework for 
> >> distributed, high-performing, always-available, and accurate data 
> >> streaming applications.
> >>
> >> The release is available for download at:
> >> https://flink.apache.org/downloads.html
> >>
> >> Please check out the release blog post for an overview of the improvements 
> >> for this bugfix release:
> >> https://flink.apache.org/news/2020/07/21/release-1.11.1.html
> >>
> >> The full release notes are available in Jira:
> >> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12348323
> >>
> >> We would like to thank all contributors of the Apache Flink community who 
> >> made this release possible!
> >>
> >> Regards,
> >> Dian
> >
>


Re: Flink Session TM Logs

2020-07-26 Thread Yangze Guo
Hi, Richard

Before the session has been terminated, you could not fetch the
terminated TM logs. One possible solution could be leveraging the
log4j2 appenders[1]. Flink uses log4j2 as default in the latest
release 1.11.

[1] https://logging.apache.org/log4j/2.x/manual/appenders.html

Best,
Yangze Guo

On Sat, Jul 25, 2020 at 2:37 AM Richard Moorhead
 wrote:
>
>
>
> -- Forwarded message -
> From: Robert Metzger 
> Date: Fri, Jul 24, 2020 at 1:09 PM
> Subject: Re: Flink Session TM Logs
> To: Richard Moorhead 
>
>
> I accidentally replied to you directly, not to the user@ mailing list. Is it 
> okay for you to publish the thread on the list again?
>
>
>
> On Fri, Jul 24, 2020 at 8:01 PM Richard Moorhead  
> wrote:
>>
>> It is enabled. The issue is that for a long running flink session -which may 
>> execute many jobs- the task managers, after a job is completed, are gone, 
>> and their logs arent available.
>>
>> What I have noticed is that when the session is terminated I am able to find 
>> the logs in the job history server under the associated yarn application id.
>>
>> On Fri, Jul 24, 2020 at 12:51 PM Robert Metzger  wrote:
>>>
>>> Hi Richard,
>>>
>>> you need to enable YARN log aggregation to access logs of finished YARN 
>>> applications.
>>>
>>> On Fri, Jul 24, 2020 at 5:58 PM Richard Moorhead 
>>>  wrote:
>>>>
>>>> When running a flink session on YARN, task manager logs for a job are not 
>>>> available after completion. How do we locate these logs?
>>>>


Re: Flink conf/flink-conf.yaml

2020-08-06 Thread Yangze Guo
Hi,

> can I override flink default conf/flink-conf.yaml from flink run command
Yes, you could override it by manually exporting the env variable
FLINK_CONF_DIR.

> can just override selective variables from user define flink-conf.yaml file
No. You could use the default conf/flink-conf.yaml and override
variables by dynamic options(-Dxxx=xxx). Or just contain all the
necessary configuration in your custom flink-conf.yaml.

Best,
Yangze Guo

On Fri, Aug 7, 2020 at 4:02 AM Vijayendra Yadav  wrote:
>
> Hi Team,
>
> How can I override flink default conf/flink-conf.yaml from flink run command 
> with custom alternative path.
> Also, when we override flink-conf.yaml, should it contain all variables which 
> are present in flink default conf/flink-conf.yaml or i can just override 
> selective variables from user define flink-conf.yaml file ?
>
> Regards,
> Vijay


Re: [Flink-KAFKA-KEYTAB] Kafkaconsumer error Kerberos

2020-08-13 Thread Yangze Guo
Hi,

When deploying Flink on Yarn, you could ship krb5.conf by "--ship"
command. Notice that this command only supports to ship folders now.

Best,
Yangze Guo

On Fri, Aug 14, 2020 at 11:22 AM Vijayendra Yadav  wrote:
>
> Any inputs ?
>
> On Tue, Aug 11, 2020 at 10:34 AM Vijayendra Yadav  
> wrote:
>>
>> Dawid, I was able to resolve the keytab issue by passing the service name, 
>> but now I am facing the KRB5 issue.
>>
>> Caused by: org.apache.kafka.common.errors.SaslAuthenticationException: 
>> Failed to create SaslClient with mechanism GSSAPI
>> Caused by: javax.security.sasl.SaslException: Failure to initialize security 
>> context [Caused by GSSException: Invalid name provided (Mechanism level: 
>> KrbException: Cannot locate default realm)]
>>
>> I passed KRB5 from yaml conf file like:
>>
>> env.java.opts.jobmanager: -Djava.security.krb5.conf=/path/krb5.conf
>> env.java.opts.taskmanager: -Djava.security.krb5.conf=/path/krb5.conf
>>
>> How can I resolve this? Is there another way to pass KRB5?
>>
>> I also tried via option#1 from flink run command -D parameter.
>>
>> Regards,
>> Vijay
>>
>>
>> On Tue, Aug 11, 2020 at 1:26 AM Dawid Wysakowicz  
>> wrote:
>>>
>>> Hi,
>>>
>>> As far as I know the approach 2) is the supported way of setting up 
>>> Kerberos authentication in Flink. In the second approach have you tried 
>>> setting the `sasl.kerberos.service.name` in the configuration of your 
>>> KafkaConsumer/Producer[1]? I think this might be the issue.
>>>
>>> Best,
>>>
>>> Dawid
>>>
>>> [1] 
>>> https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/kafka.html#enabling-kerberos-authentication
>>>
>>>
>>> On 09/08/2020 20:39, Vijayendra Yadav wrote:
>>>
>>> Hi Team,
>>>
>>> I am trying to stream data from kafkaconsumer using: 
>>> https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/kafka.html
>>>
>>> Here my KAFKA is Kerberos secured and SSL enabled.
>>>
>>> I am running my Flink streaming in yarn-cluster on EMR 5.31.
>>>
>>> I have tried to pass keytab/principal in following 2 Ways:
>>>
>>> 1) Passing as JVM property in Flink run Command.
>>>
>>> /usr/lib/flink/bin/flink run
>>>-yt ${app_install_path}/conf/
>>>  \
>>>>
>>>> -Dsecurity.kerberos.login.use-ticket-cache=false   
>>>>\
>>>> -yDsecurity.kerberos.login.use-ticket-cache=false  
>>>>\
>>>> -Dsecurity.kerberos.login.keytab=${app_install_path}/conf/keytab  \
>>>> -yDsecurity.kerberos.login.keytab=${app_install_path}/conf/.keytab \
>>>> -Djava.security.krb5.conf=${app_install_path}/conf/krb5.conf   
>>>>\
>>>> -yDjava.security.krb5.conf=${app_install_path}/conf/krb5.conf  
>>>>\
>>>> -Dsecurity.kerberos.login.principal=x...@xx.net \
>>>> -yDsecurity.kerberos.login.principal= x...@xx.net\
>>>> -Dsecurity.kerberos.login.contexts=Client,KafkaClient  
>>>>\
>>>> -yDsecurity.kerberos.login.contexts=Client,KafkaClient
>>>
>>>
>>> Here, I am getting the following Error, it seems like KEYTAB Was not 
>>> transported to the run environment and probably not found.
>>>
>>> org.apache.kafka.common.KafkaException: Failed to construct kafka consumer
>>> Caused by: java.lang.IllegalArgumentException: Could not find a 
>>> 'KafkaClient' entry in the JAAS configuration. System property 
>>> 'java.security.auth.login.config'
>>>
>>> 2) Passing from flink config:  /usr/lib/flink/conf/flink-conf.yaml
>>>
>>> security.kerberos.login.use-ticket-cache: false
>>> security.kerberos.login.keytab:  ${app_install_path}/conf/keytab
>>> security.kerberos.login.principal:  x...@xx.net
>>> security.kerberos.login.contexts: Client,KafkaClient
>>>
>>> Here, I am getting the following Error,
>>>
>>> org.apache.kafka.common.KafkaException: Failed to construct kafka consumer
>>> Caused by: org.apache.kafka.common.KafkaException: 
>>> java.lang.IllegalArgumentException: No serviceName defined in either JAAS 
>>> or Kafka config
>>>
>>>
>>> Could you please help find, what are probable causes and resolution?
>>>
>>> Regards,
>>> Vijay
>>>


Re: How to specify the number of TaskManagers in Yarn Cluster using Per-Job Mode

2020-08-17 Thread Yangze Guo
Hi,

Do you start the NodeManager in all the three machines? If so, could
you check all the NMs correctly connect to the ResourceManager?

Best,
Yangze Guo

On Tue, Aug 18, 2020 at 10:01 AM 范超  wrote:
>
> Hi, Dev and Users
> I’ve 3 machines each one is 8 cores and 16GB memory.
> Following it’s my Resource Manager screenshot the cluster have 36GB total.
> I specify the paralism to 3 or even up to 12,  But the task manager is always 
> running on two nodes not all three machine, the third node does not start the 
> task manager.
> I tried set the –p –tm –jm parameters, but it always the same, only different 
> is more container on the two maching but not all three machine start the task 
> manager.
> My question is how to set the cli parameter to start all of my three machine 
> (all task manager start on 3 machines)
>
> Thanks a lot
> [cid:image001.png@01D67546.62291B70]
>
>
> Chao fan
>


Re: How to specify the number of TaskManagers in Yarn Cluster using Per-Job Mode

2020-08-17 Thread Yangze Guo
Hi,

I think that is only related to the Yarn scheduling strategy. AFAIK,
Flink could not control it. You could check the RM log to figure out
why it did not schedule the containers to all the three machines. BTW,
if you have specific requirements to start with all the three
machines, how about deploying a standalone cluster instead?

Best,
Yangze Guo

On Tue, Aug 18, 2020 at 10:24 AM 范超  wrote:
>
> Thanks Yangze
>
> All 3 machines NodeManager is started.
>
> I just don't know why not three machines each running a Flink TaskManager and 
> how to achieve this
>
> -邮件原件-
> 发件人: Yangze Guo [mailto:karma...@gmail.com]
> 发送时间: 2020年8月18日 星期二 10:10
> 收件人: 范超 
> 抄送: user (user@flink.apache.org) 
> 主题: Re: How to specify the number of TaskManagers in Yarn Cluster using 
> Per-Job Mode
>
> Hi,
>
> Do you start the NodeManager in all the three machines? If so, could you 
> check all the NMs correctly connect to the ResourceManager?
>
> Best,
> Yangze Guo
>
> On Tue, Aug 18, 2020 at 10:01 AM 范超  wrote:
> >
> > Hi, Dev and Users
> > I’ve 3 machines each one is 8 cores and 16GB memory.
> > Following it’s my Resource Manager screenshot the cluster have 36GB total.
> > I specify the paralism to 3 or even up to 12,  But the task manager is 
> > always running on two nodes not all three machine, the third node does not 
> > start the task manager.
> > I tried set the –p –tm –jm parameters, but it always the same, only 
> > different is more container on the two maching but not all three machine 
> > start the task manager.
> > My question is how to set the cli parameter to start all of my three
> > machine (all task manager start on 3 machines)
> >
> > Thanks a lot
> > [cid:image001.png@01D67546.62291B70]
> >
> >
> > Chao fan
> >


Re: How to specify the number of TaskManagers in Yarn Cluster using Per-Job Mode

2020-08-17 Thread Yangze Guo
Hi,

Flink can control how many TM to start, but where to start the TMs
depends on Yarn.

Do you meet any problem when deploying on Yarn or running Flink job?
Why do you need to start the TMs on all the three machines?

Best,
Yangze Guo

On Tue, Aug 18, 2020 at 11:25 AM 范超  wrote:
>
> Thanks Yangze
> The reason why I don’t deploying a standalone cluster, it's because there 
> kafka, kudu, hadoop, zookeeper on these machines, maybe currently using the 
> yarn to manage resources is the best choice for me.
> If Flink can not control how many tm to start , could anyone providing me 
> some best practice for deploying on yarn please? I read the [1] and still 
> don't very clear
>
> [1] 
> https://www.ververica.com/blog/how-to-size-your-apache-flink-cluster-general-guidelines
>
> -邮件原件-
> 发件人: Yangze Guo [mailto:karma...@gmail.com]
> 发送时间: 2020年8月18日 星期二 10:50
> 收件人: 范超 
> 抄送: user (user@flink.apache.org) 
> 主题: Re: How to specify the number of TaskManagers in Yarn Cluster using 
> Per-Job Mode
>
> Hi,
>
> I think that is only related to the Yarn scheduling strategy. AFAIK, Flink 
> could not control it. You could check the RM log to figure out why it did not 
> schedule the containers to all the three machines. BTW, if you have specific 
> requirements to start with all the three machines, how about deploying a 
> standalone cluster instead?
>
> Best,
> Yangze Guo
>
> On Tue, Aug 18, 2020 at 10:24 AM 范超  wrote:
> >
> > Thanks Yangze
> >
> > All 3 machines NodeManager is started.
> >
> > I just don't know why not three machines each running a Flink
> > TaskManager and how to achieve this
> >
> > -邮件原件-
> > 发件人: Yangze Guo [mailto:karma...@gmail.com]
> > 发送时间: 2020年8月18日 星期二 10:10
> > 收件人: 范超 
> > 抄送: user (user@flink.apache.org) 
> > 主题: Re: How to specify the number of TaskManagers in Yarn Cluster
> > using Per-Job Mode
> >
> > Hi,
> >
> > Do you start the NodeManager in all the three machines? If so, could you 
> > check all the NMs correctly connect to the ResourceManager?
> >
> > Best,
> > Yangze Guo
> >
> > On Tue, Aug 18, 2020 at 10:01 AM 范超  wrote:
> > >
> > > Hi, Dev and Users
> > > I’ve 3 machines each one is 8 cores and 16GB memory.
> > > Following it’s my Resource Manager screenshot the cluster have 36GB total.
> > > I specify the paralism to 3 or even up to 12,  But the task manager is 
> > > always running on two nodes not all three machine, the third node does 
> > > not start the task manager.
> > > I tried set the –p –tm –jm parameters, but it always the same, only 
> > > different is more container on the two maching but not all three machine 
> > > start the task manager.
> > > My question is how to set the cli parameter to start all of my three
> > > machine (all task manager start on 3 machines)
> > >
> > > Thanks a lot
> > > [cid:image001.png@01D67546.62291B70]
> > >
> > >
> > > Chao fan
> > >


Re: How to specify the number of TaskManagers in Yarn Cluster using Per-Job Mode

2020-08-17 Thread Yangze Guo
The number of TM mainly depends on the parallelism and job graph.
Flink now allows you to set the maximum slots number
(slotmanager-number-of-slots-max[1]). There is also a plan to support
setting the minimum number of slots[2].

[1] 
https://ci.apache.org/projects/flink/flink-docs-master/ops/config.html#slotmanager-number-of-slots-max
[2] https://issues.apache.org/jira/browse/FLINK-15959

Best,
Yangze Guo

On Tue, Aug 18, 2020 at 12:21 PM 范超  wrote:
>
> Thanks Yangze
>
> 1. Do you meet any problem when deploying on Yarn or running Flink job?
> My job works well
>
> 2. Why do you need to start the TMs on all the three machines?
> From cluster perspective, I wonder if the process pressure can be balance to 
> 3 machines.
>
> 3. Flink can control how many TM to start, but where to start the TMs depends 
> on Yarn.
> Yes, the job where to start the TM is depend on Yarn.
> Could you please tell me parameter controls how many TM to start, the yn 
> parameter is delete from 1.10 as the 1.9 doc sample list[1] below
>
> [1] https://ci.apache.org/projects/flink/flink-docs-release-1.9/ops/cli.html
>
> Run example program using a per-job YARN cluster with 2 TaskManagers:
>
> ./bin/flink run -m yarn-cluster -yn 2 \
>./examples/batch/WordCount.jar \
>--input hdfs:///user/hamlet.txt --output 
> hdfs:///user/wordcount_out
>
> -邮件原件-
> 发件人: Yangze Guo [mailto:karma...@gmail.com]
> 发送时间: 2020年8月18日 星期二 11:31
> 收件人: 范超 
> 抄送: user (user@flink.apache.org) 
> 主题: Re: How to specify the number of TaskManagers in Yarn Cluster using 
> Per-Job Mode
>
> Hi,
>
> Flink can control how many TM to start, but where to start the TMs depends on 
> Yarn.
>
> Do you meet any problem when deploying on Yarn or running Flink job?
> Why do you need to start the TMs on all the three machines?
>
> Best,
> Yangze Guo
>
> On Tue, Aug 18, 2020 at 11:25 AM 范超  wrote:
> >
> > Thanks Yangze
> > The reason why I don’t deploying a standalone cluster, it's because there 
> > kafka, kudu, hadoop, zookeeper on these machines, maybe currently using the 
> > yarn to manage resources is the best choice for me.
> > If Flink can not control how many tm to start , could anyone providing
> > me some best practice for deploying on yarn please? I read the [1] and
> > still don't very clear
> >
> > [1]
> > https://www.ververica.com/blog/how-to-size-your-apache-flink-cluster-g
> > eneral-guidelines
> >
> > -邮件原件-
> > 发件人: Yangze Guo [mailto:karma...@gmail.com]
> > 发送时间: 2020年8月18日 星期二 10:50
> > 收件人: 范超 
> > 抄送: user (user@flink.apache.org) 
> > 主题: Re: How to specify the number of TaskManagers in Yarn Cluster
> > using Per-Job Mode
> >
> > Hi,
> >
> > I think that is only related to the Yarn scheduling strategy. AFAIK, Flink 
> > could not control it. You could check the RM log to figure out why it did 
> > not schedule the containers to all the three machines. BTW, if you have 
> > specific requirements to start with all the three machines, how about 
> > deploying a standalone cluster instead?
> >
> > Best,
> > Yangze Guo
> >
> > On Tue, Aug 18, 2020 at 10:24 AM 范超  wrote:
> > >
> > > Thanks Yangze
> > >
> > > All 3 machines NodeManager is started.
> > >
> > > I just don't know why not three machines each running a Flink
> > > TaskManager and how to achieve this
> > >
> > > -邮件原件-
> > > 发件人: Yangze Guo [mailto:karma...@gmail.com]
> > > 发送时间: 2020年8月18日 星期二 10:10
> > > 收件人: 范超 
> > > 抄送: user (user@flink.apache.org) 
> > > 主题: Re: How to specify the number of TaskManagers in Yarn Cluster
> > > using Per-Job Mode
> > >
> > > Hi,
> > >
> > > Do you start the NodeManager in all the three machines? If so, could you 
> > > check all the NMs correctly connect to the ResourceManager?
> > >
> > > Best,
> > > Yangze Guo
> > >
> > > On Tue, Aug 18, 2020 at 10:01 AM 范超  wrote:
> > > >
> > > > Hi, Dev and Users
> > > > I’ve 3 machines each one is 8 cores and 16GB memory.
> > > > Following it’s my Resource Manager screenshot the cluster have 36GB 
> > > > total.
> > > > I specify the paralism to 3 or even up to 12,  But the task manager is 
> > > > always running on two nodes not all three machine, the third node does 
> > > > not start the task manager.
> > > > I tried set the –p –tm –jm parameters, but it always the same, only 
> > > > different is more container on the two maching but not all three 
> > > > machine start the task manager.
> > > > My question is how to set the cli parameter to start all of my
> > > > three machine (all task manager start on 3 machines)
> > > >
> > > > Thanks a lot
> > > > [cid:image001.png@01D67546.62291B70]
> > > >
> > > >
> > > > Chao fan
> > > >


Re: Setting job/task manager memory management in kubernetes

2020-08-24 Thread Yangze Guo
Hi, Sakshi

Could you provide more information about:
- What is the Flink version you are using? "taskmanager.heap.size" is
deprecated since 1.10[1].
- How do you deploy the cluster? In the approach of native k8s[2] or
the standalone k8s[3]?

[1] 
https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/memory/mem_migration.html
[2] 
https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/cluster_setup.html
[3] 
https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/kubernetes.html

Best,
Yangze Guo

On Mon, Aug 24, 2020 at 6:31 PM Sakshi Bansal  wrote:
>
> Hello,
>
> I am trying to set the heap size of job and task manager when deploying the 
> job in kubernetes. I have set the jobmanager.heap.size and 
> taskmanager.heap.size. However, the custom values are not being used and it 
> is creating its own values and starting the job. How can I set custom values?
>
> --
> Thanks and Regards
> Sakshi Bansal


Re: Setting job/task manager memory management in kubernetes

2020-08-24 Thread Yangze Guo
Hi,

You need to define them in "flink-configuration-configmap.yaml".
Please also make sure you've created the config map by executing
"kubectl create -f flink-configuration-configmap.yaml".

[1] 
https://ci.apache.org/projects/flink/flink-docs-release-1.9/ops/deployment/kubernetes.html

Best,
Yangze Guo

On Mon, Aug 24, 2020 at 9:33 PM Sakshi Bansal  wrote:
>
> The flink version is 1.9 and it is a standalone k8s
>
> On Mon 24 Aug, 2020, 17:17 Yangze Guo,  wrote:
>>
>> Hi, Sakshi
>>
>> Could you provide more information about:
>> - What is the Flink version you are using? "taskmanager.heap.size" is
>> deprecated since 1.10[1].
>> - How do you deploy the cluster? In the approach of native k8s[2] or
>> the standalone k8s[3]?
>>
>> [1] 
>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/memory/mem_migration.html
>> [2] 
>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/cluster_setup.html
>> [3] 
>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/kubernetes.html
>>
>> Best,
>> Yangze Guo
>>
>> On Mon, Aug 24, 2020 at 6:31 PM Sakshi Bansal  
>> wrote:
>> >
>> > Hello,
>> >
>> > I am trying to set the heap size of job and task manager when deploying 
>> > the job in kubernetes. I have set the jobmanager.heap.size and 
>> > taskmanager.heap.size. However, the custom values are not being used and 
>> > it is creating its own values and starting the job. How can I set custom 
>> > values?
>> >
>> > --
>> > Thanks and Regards
>> > Sakshi Bansal


Re: [ANNOUNCE] Apache Flink 1.10.2 released

2020-08-24 Thread Yangze Guo
Thanks a lot for being the release manager Zhu Zhu!
Congrats to all others who have contributed to the release!

Best,
Yangze Guo

On Tue, Aug 25, 2020 at 2:42 PM Dian Fu  wrote:
>
> Thanks ZhuZhu for managing this release and everyone else who contributed to 
> this release!
>
> Regards,
> Dian
>
> 在 2020年8月25日,下午2:22,Till Rohrmann  写道:
>
> Great news. Thanks a lot for being our release manager Zhu Zhu and to all 
> others who have contributed to the release!
>
> Cheers,
> Till
>
> On Tue, Aug 25, 2020 at 5:37 AM Zhu Zhu  wrote:
>>
>> The Apache Flink community is very happy to announce the release of Apache 
>> Flink 1.10.2, which is the first bugfix release for the Apache Flink 1.10 
>> series.
>>
>> Apache Flink® is an open-source stream processing framework for distributed, 
>> high-performing, always-available, and accurate data streaming applications.
>>
>> The release is available for download at:
>> https://flink.apache.org/downloads.html
>>
>> Please check out the release blog post for an overview of the improvements 
>> for this bugfix release:
>> https://flink.apache.org/news/2020/08/25/release-1.10.2.html
>>
>> The full release notes are available in Jira:
>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12347791
>>
>> We would like to thank all contributors of the Apache Flink community who 
>> made this release possible!
>>
>> Thanks,
>> Zhu
>
>


Re: Re: [ANNOUNCE] New PMC member: Dian Fu

2020-08-27 Thread Yangze Guo
Congrats Dian!

Best,
Yangze Guo

On Thu, Aug 27, 2020 at 6:26 PM Zhu Zhu  wrote:
>
> Congratulations Dian!
>
> Thanks,
> Zhu
>
> Zhijiang  于2020年8月27日周四 下午6:04写道:
>
> > Congrats, Dian!
> >
> > --
> > From:Yun Gao 
> > Send Time:2020年8月27日(星期四) 17:44
> > To:dev ; Dian Fu ; user <
> > user@flink.apache.org>; user-zh 
> > Subject:Re: Re: [ANNOUNCE] New PMC member: Dian Fu
> >
> > Congratulations Dian !
> >
> >  Best
> >  Yun
> >
> >
> > --
> > Sender:Marta Paes Moreira
> > Date:2020/08/27 17:42:34
> > Recipient:Yuan Mei
> > Cc:Xingbo Huang; jincheng sun > >; dev; Dian Fu; user<
> > user@flink.apache.org>; user-zh
> > Theme:Re: [ANNOUNCE] New PMC member: Dian Fu
> >
> > Congrats, Dian!
> > On Thu, Aug 27, 2020 at 11:39 AM Yuan Mei  wrote:
> >
> > Congrats!
> > On Thu, Aug 27, 2020 at 5:38 PM Xingbo Huang  wrote:
> >
> > Congratulations Dian!
> >
> > Best,
> > Xingbo
> > jincheng sun  于2020年8月27日周四 下午5:24写道:
> >
> > Hi all,
> >
> >
> > On behalf of the Flink PMC, I'm happy to announce that Dian Fu is now part 
> > of the Apache Flink Project Management Committee (PMC).
> >
> >
> > Dian Fu has been very active on PyFlink component, working on various 
> > important features, such as the Python UDF and Pandas integration, and 
> > keeps checking and voting for our releases, and also has successfully 
> > produced two releases(1.9.3&1.11.1) as RM, currently working as RM to push 
> > forward the release of Flink 1.12.
> >
> > Please join me in congratulating Dian Fu for becoming a Flink PMC Member!
> >
> > Best,
> > Jincheng(on behalf of the Flink PMC)
> >
> >
> >


Re: Use of slot sharing groups causing workflow to hang

2020-09-02 Thread Yangze Guo
Hi,

The failure of requesting slots usually because of the lack of
resources. If you put part of the workflow to a specific slot sharing
group, it may require more slots to run the workflow than before.
Could you share logs of the ResourceManager and SlotManager, I think
there are more clues in it.

Best,
Yangze Guo

On Thu, Sep 3, 2020 at 4:39 AM Ken Krugler  wrote:
>
> Hi all,
>
> I’ve got a streaming workflow (using Flink 1.11.1) that runs fine locally 
> (via Eclipse), with a parallelism of either 3 or 6.
>
> If I set up part of the workflow to use a specific (not “default”) slot 
> sharing group with a parallelism of 3, and the remaining portions of the 
> workflow have a parallelism of either 1 or 2, then the workflow never starts 
> running, and eventually fails due to a slot request not being fulfilled in 
> time.
>
> So I’m wondering how best to debug this.
>
> I don’t see any information (even at DEBUG level) being logged about which 
> operators are in what slot sharing group, or which slots are assigned to what 
> groups.
>
> Thanks,
>
> — Ken
>
> PS - I’ve looked at https://issues.apache.org/jira/browse/FLINK-8712, and 
> tried the approach of setting # of slots in the config, but that didn’t 
> change anything. I see that issue is still open, so wondering what Til and 
> Konstantin have to say about it.
>
> --
> Ken Krugler
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>


Re: Difficulties with Minio state storage

2020-09-08 Thread Yangze Guo
Hi, Rex,

I've tried to use MinIO as state backend and everything seems works well.
Just sharing my configuration:
```
s3.access-key:
s3.secret-key:
s3.endpoint: http://localhost:9000
s3.path.style.access: true
state.checkpoints.dir: s3://flink/checkpoints
```

I think the problem might be caused by the following reasons:
- The MinIO is not well configured.
- Maybe you need to create a bucket for it first. In my case, I create
a bucket called "flink" first.

Best,
Yangze Guo

On Wed, Sep 9, 2020 at 9:33 AM Rex Fenley  wrote:
>
> Hello!
>
> I'm trying to test out Minio as state storage backend using docker-compose on 
> my local machine but keep running into errors that seem strange to me. Any 
> help would be much appreciated :)
>
> The problem:
> With the following environment:
>
> environment:
> - |
> FLINK_PROPERTIES=
> jobmanager.rpc.address: flink-jobmanager
> parallelism.default: 2
> s3.access-key: 
> s3.secret-key: 
> s3.path.style.access: true
>
> And the following State Backend (with flink-jdbc-test_graph-minio_1 being the 
> container serving minio):
>
> val bsEnv = StreamExecutionEnvironment.getExecutionEnvironment
> bsEnv.setStateBackend(
> new RocksDBStateBackend(
> "s3://flink-jdbc-test_graph-minio_1/data/checkpoints:9000",
> true
> )
> )
>
> And submitting the flink job and saving from another docker container like so:
>
> flink run -m flink-jdbc-test_flink-jobmanager_1:8081 -c  
> .jar
>
> flink savepoint -m flink-jdbc-test_flink-jobmanager_1:8081  
> s3://flink-jdbc-test_graph-minio_1:9000/data/savepoints
>
> I end up with the following error:
>
> Caused by: 
> com.facebook.presto.hive.s3.PrestoS3FileSystem$UnrecoverableS3OperationException:
>  com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request (Service: 
> Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID: 
> A7E3BB7EEFB524FD; S3 Extended Request ID: 
> cJOtc6E3Kb+U5hgbkA+09Dd/ouDHBGL2ftb1pGHpIwFgd6tE461nkaDtjOj40zbWEpFAcMOEmbY=),
>  S3 Extended Request ID: 
> cJOtc6E3Kb+U5hgbkA+09Dd/ouDHBGL2ftb1pGHpIwFgd6tE461nkaDtjOj40zbWEpFAcMOEmbY= 
> (Path: 
> s3://flink-jdbc-test_graph-minio_1:9000/data/savepoints/savepoint-5c4090-5f90e0cdc603/_metadata)
> at 
> com.facebook.presto.hive.s3.PrestoS3FileSystem.lambda$getS3ObjectMetadata$2(PrestoS3FileSystem.java:573)
> at com.facebook.presto.hive.RetryDriver.run(RetryDriver.java:138)
> at 
> com.facebook.presto.hive.s3.PrestoS3FileSystem.getS3ObjectMetadata(PrestoS3FileSystem.java:560)
> at 
> com.facebook.presto.hive.s3.PrestoS3FileSystem.getFileStatus(PrestoS3FileSystem.java:311)
> at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1398)
> at 
> com.facebook.presto.hive.s3.PrestoS3FileSystem.create(PrestoS3FileSystem.java:356)
> at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906)
> at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:887)
> at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:784)
> at 
> org.apache.flink.fs.s3presto.common.HadoopFileSystem.create(HadoopFileSystem.java:141)
> at 
> org.apache.flink.fs.s3presto.common.HadoopFileSystem.create(HadoopFileSystem.java:37)
> at 
> org.apache.flink.core.fs.PluginFileSystemFactory$ClassLoaderFixingFileSystem.create(PluginFileSystemFactory.java:169)
> at 
> org.apache.flink.runtime.state.filesystem.FsCheckpointMetadataOutputStream.(FsCheckpointMetadataOutputStream.java:65)
> at 
> org.apache.flink.runtime.state.filesystem.FsCheckpointStorageLocation.createMetadataOutputStream(FsCheckpointStorageLocation.java:109)
> at 
> org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:306)
> ... 10 more
> Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request 
> (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request 
> ID: A7E3BB7EEFB524FD; S3 Extended Request ID: 
> cJOtc6E3Kb+U5hgbkA+09Dd/ouDHBGL2ftb1pGHpIwFgd6tE461nkaDtjOj40zbWEpFAcMOEmbY=)
> at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1799)
>
> If I add to the environment to include:
> ...
> s3.endpoint: s3://flink-jdbc-test_graph-minio_1:9000
> ...
>
> Then I end up with the following error just trying to submit the job:
> Caused by: java.lang.IllegalArgumentException: Endpoint does not contain a 
> valid host name: s3://flink-jdbc-test_graph-minio_1:9000
> at 
> com.amazonaws.AmazonWebServiceClient.computeSignerByURI(AmazonWebServiceClient.java:426)
> at 
> com.amazonaws.AmazonWebServiceClient.setEndpoint(AmazonWebServiceClient.java:318)
>
> Changing s3: to http: like so:
> s3.endpoint: http://flink-jdbc-test_graph-minio_1:9000
>
> Then I receive the same error as 

Re: Use of slot sharing groups causing workflow to hang

2020-09-09 Thread Yangze Guo
Hi, Ken

>From the RM perspective, could you share the following logs:
- "Request slot with profile {} for job {} with allocation id {}.".
- "Requesting new slot [{}] and profile {} with allocation id {} from
resource manager."
This will help to figure out how many slots your job indeed requests.
And probably help to figure out what the ExecutionGraph finally looks
like.


Best,
Yangze Guo

On Thu, Sep 10, 2020 at 10:47 AM Ken Krugler
 wrote:
>
> Hi Til,
>
> On Sep 3, 2020, at 12:31 AM, Till Rohrmann  wrote:
>
> Hi Ken,
>
> I believe that we don't have a lot if not any explicit logging about the slot 
> sharing group in the code. You can, however, learn indirectly about it by 
> looking at the required number of AllocatedSlots in the SlotPool. Also the 
> number of "multi task slot" which are created should vary because every group 
> of slot sharing tasks will create one of them. For learning about the 
> SlotPoolImpl's status, you can also take a look at SlotPoolImpl.printStatus.
>
> For the underlying problem, I believe that Yangze could be right. How many 
> resources do you have in your cluster?
>
>
> I've got a Flink MiniCluster with 12 slots. Even with only 6 pipelined
> operators, each with a parallelism of 1, it still hangs while starting. So
> I don't think that it's a resource issue.
>
> One odd thing I've noticed. I've got three streams that I union together.
> Two of the streams are in separate slot sharing groups, the third is not
> assigned to a group. But when I check the logs, I see three "Create multi
> task slot" entries. I'm wondering if unioning streams that are in different
> slot sharing groups creates a problem.
>
> Thanks,
>
> -- Ken
>
> On Thu, Sep 3, 2020 at 4:25 AM Yangze Guo  wrote:
>>
>> Hi,
>>
>> The failure of requesting slots usually because of the lack of
>> resources. If you put part of the workflow to a specific slot sharing
>> group, it may require more slots to run the workflow than before.
>> Could you share logs of the ResourceManager and SlotManager, I think
>> there are more clues in it.
>>
>> Best,
>> Yangze Guo
>>
>> On Thu, Sep 3, 2020 at 4:39 AM Ken Krugler  
>> wrote:
>> >
>> > Hi all,
>> >
>> > I’ve got a streaming workflow (using Flink 1.11.1) that runs fine locally 
>> > (via Eclipse), with a parallelism of either 3 or 6.
>> >
>> > If I set up part of the workflow to use a specific (not “default”) slot 
>> > sharing group with a parallelism of 3, and the remaining portions of the 
>> > workflow have a parallelism of either 1 or 2, then the workflow never 
>> > starts running, and eventually fails due to a slot request not being 
>> > fulfilled in time.
>> >
>> > So I’m wondering how best to debug this.
>> >
>> > I don’t see any information (even at DEBUG level) being logged about which 
>> > operators are in what slot sharing group, or which slots are assigned to 
>> > what groups.
>> >
>> > Thanks,
>> >
>> > — Ken
>> >
>> > PS - I’ve looked at https://issues.apache.org/jira/browse/FLINK-8712, and 
>> > tried the approach of setting # of slots in the config, but that didn’t 
>> > change anything. I see that issue is still open, so wondering what Til and 
>> > Konstantin have to say about it.
>> >
>> > --
>> > Ken Krugler
>> > http://www.scaleunlimited.com
>> > custom big data solutions & training
>> > Hadoop, Cascading, Cassandra & Solr
>> >
>
>
> --
> Ken Krugler
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>


Re: Flink multiple task managers setup

2020-09-17 Thread Yangze Guo
Hi,

>From my understanding, you want to set up a standalone cluster in your
local machine. If that is the case, you could simply edit the
$FLINK_DIST/conf/workers, in which each line represents a TM host. By
default, there is only one TM in localhost. In your case, you could
add a line 'localhost' to it. Then, execute the
$FLINK_DIST/bin/start-cluster.sh, you could see a standalone cluster
with two TM in your local machine.

[1] 
https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/cluster_setup.html#configuring-flink

Best,
Yangze Guo

On Thu, Sep 17, 2020 at 3:16 PM saksham sapra  wrote:
>
> Hi ,
>
> I am unable to set two task managers in my local machine and neither any 
> documentation provided for the same.
>
> I want to run a parallel job in two task managers using flink.
> kindly help me with the same, how can i set up in my local without using any 
> zookeeper or something.
>
>
> Thanks & Regards,
> Saksham Sapra
>
>


Re: Flink multiple task managers setup

2020-09-17 Thread Yangze Guo
Hi,

> I wasnt having "workers" file in conf/workers so i created one, but i have 
> "slaves" file in  conf/workers, so i edited both two localhost like 
> screenshot given below :
Yes, for 1.9.3, you need to edit the 'slaves' file.

I think we need more information to figure out what happened.
- What is the output when you execute ./bin/start-cluster.sh, could
you see two "Starting taskexecutor daemon on host" lines?
- Could you see two flink-xxx-taskexecutor-xxx.log in $FLINK_DIST/log?
If so, could you share these two log files?

Best,
Yangze Guo

Best,
Yangze Guo


On Thu, Sep 17, 2020 at 4:06 PM saksham sapra  wrote:
>
> Hi Yangze,
>
> Thanks for replying, but i still have some questions.
> I wasnt having "workers" file in conf/workers so i created one, but i have 
> "slaves" file in  conf/workers, so i edited both two localhost like 
> screenshot given below :
>
>
>
>
>
>
> and then again started flink , but i can see only one task manager
>
>
> Please find my config.yaml file attached.
>
>
> Thanks for helping.
>
> Thanks & Regards,
> Saksham Sapra
>
> On Thu, Sep 17, 2020 at 12:57 PM Yangze Guo  wrote:
>>
>> Hi,
>>
>> From my understanding, you want to set up a standalone cluster in your
>> local machine. If that is the case, you could simply edit the
>> $FLINK_DIST/conf/workers, in which each line represents a TM host. By
>> default, there is only one TM in localhost. In your case, you could
>> add a line 'localhost' to it. Then, execute the
>> $FLINK_DIST/bin/start-cluster.sh, you could see a standalone cluster
>> with two TM in your local machine.
>>
>> [1] 
>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/cluster_setup.html#configuring-flink
>>
>> Best,
>> Yangze Guo
>>
>> On Thu, Sep 17, 2020 at 3:16 PM saksham sapra  
>> wrote:
>> >
>> > Hi ,
>> >
>> > I am unable to set two task managers in my local machine and neither any 
>> > documentation provided for the same.
>> >
>> > I want to run a parallel job in two task managers using flink.
>> > kindly help me with the same, how can i set up in my local without using 
>> > any zookeeper or something.
>> >
>> >
>> > Thanks & Regards,
>> > Saksham Sapra
>> >
>> >


Re: Flink multiple task managers setup

2020-09-17 Thread Yangze Guo
Hi,

It seems you run it in Windows. In that case, only start-cluster.bat
could be used. However, this script could only start one TM[1] no
matter how you configure the slaves/workers.

[1] 
https://github.com/apache/flink/blob/release-1.9/flink-dist/src/main/flink-bin/bin/start-cluster.bat

Best,
Yangze Guo

On Thu, Sep 17, 2020 at 4:53 PM saksham sapra  wrote:
>
> HI Yangze,
>
> I tried to run start-cluster.sh and i can see in host , when flink tries to 
> run second task manager or executor, pop up or host gets closed.
> Please find attached logs for two command : start-cluster.sh and 
> start-cluster.bat.
>
> Thanks & Regards,
> Saksham
>
> On Thu, Sep 17, 2020 at 2:00 PM Yangze Guo  wrote:
>>
>> Hi,
>>
>> > I wasnt having "workers" file in conf/workers so i created one, but i have 
>> > "slaves" file in  conf/workers, so i edited both two localhost like 
>> > screenshot given below :
>> Yes, for 1.9.3, you need to edit the 'slaves' file.
>>
>> I think we need more information to figure out what happened.
>> - What is the output when you execute ./bin/start-cluster.sh, could
>> you see two "Starting taskexecutor daemon on host" lines?
>> - Could you see two flink-xxx-taskexecutor-xxx.log in $FLINK_DIST/log?
>> If so, could you share these two log files?
>>
>> Best,
>> Yangze Guo
>>
>> Best,
>> Yangze Guo
>>
>>
>> On Thu, Sep 17, 2020 at 4:06 PM saksham sapra  
>> wrote:
>> >
>> > Hi Yangze,
>> >
>> > Thanks for replying, but i still have some questions.
>> > I wasnt having "workers" file in conf/workers so i created one, but i have 
>> > "slaves" file in  conf/workers, so i edited both two localhost like 
>> > screenshot given below :
>> >
>> >
>> >
>> >
>> >
>> >
>> > and then again started flink , but i can see only one task manager
>> >
>> >
>> > Please find my config.yaml file attached.
>> >
>> >
>> > Thanks for helping.
>> >
>> > Thanks & Regards,
>> > Saksham Sapra
>> >
>> > On Thu, Sep 17, 2020 at 12:57 PM Yangze Guo  wrote:
>> >>
>> >> Hi,
>> >>
>> >> From my understanding, you want to set up a standalone cluster in your
>> >> local machine. If that is the case, you could simply edit the
>> >> $FLINK_DIST/conf/workers, in which each line represents a TM host. By
>> >> default, there is only one TM in localhost. In your case, you could
>> >> add a line 'localhost' to it. Then, execute the
>> >> $FLINK_DIST/bin/start-cluster.sh, you could see a standalone cluster
>> >> with two TM in your local machine.
>> >>
>> >> [1] 
>> >> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/cluster_setup.html#configuring-flink
>> >>
>> >> Best,
>> >> Yangze Guo
>> >>
>> >> On Thu, Sep 17, 2020 at 3:16 PM saksham sapra  
>> >> wrote:
>> >> >
>> >> > Hi ,
>> >> >
>> >> > I am unable to set two task managers in my local machine and neither 
>> >> > any documentation provided for the same.
>> >> >
>> >> > I want to run a parallel job in two task managers using flink.
>> >> > kindly help me with the same, how can i set up in my local without 
>> >> > using any zookeeper or something.
>> >> >
>> >> >
>> >> > Thanks & Regards,
>> >> > Saksham Sapra
>> >> >
>> >> >


Re: Flink multiple task managers setup

2020-09-17 Thread Yangze Guo
Sorry that the community decided to not maintain it anymore, you could
take a look at [1].

[1] 
https://lists.apache.org/thread.html/r7693d0c06ac5ced9a34597c662bcf37b34ef8e799c32cc0edee373b2%40%3Cdev.flink.apache.org%3E

Best,
Yangze Guo

On Thu, Sep 17, 2020 at 5:21 PM saksham sapra  wrote:
>
> Thanks Yangze, So should i raise a JIRA Ticket for the same on the flink 
> community group?
>
> Thanks & Regards,
> Saksham
>
> On Thu, Sep 17, 2020 at 2:38 PM Yangze Guo  wrote:
>>
>> Hi,
>>
>> It seems you run it in Windows. In that case, only start-cluster.bat
>> could be used. However, this script could only start one TM[1] no
>> matter how you configure the slaves/workers.
>>
>> [1] 
>> https://github.com/apache/flink/blob/release-1.9/flink-dist/src/main/flink-bin/bin/start-cluster.bat
>>
>> Best,
>> Yangze Guo
>>
>> On Thu, Sep 17, 2020 at 4:53 PM saksham sapra  
>> wrote:
>> >
>> > HI Yangze,
>> >
>> > I tried to run start-cluster.sh and i can see in host , when flink tries 
>> > to run second task manager or executor, pop up or host gets closed.
>> > Please find attached logs for two command : start-cluster.sh and 
>> > start-cluster.bat.
>> >
>> > Thanks & Regards,
>> > Saksham
>> >
>> > On Thu, Sep 17, 2020 at 2:00 PM Yangze Guo  wrote:
>> >>
>> >> Hi,
>> >>
>> >> > I wasnt having "workers" file in conf/workers so i created one, but i 
>> >> > have "slaves" file in  conf/workers, so i edited both two localhost 
>> >> > like screenshot given below :
>> >> Yes, for 1.9.3, you need to edit the 'slaves' file.
>> >>
>> >> I think we need more information to figure out what happened.
>> >> - What is the output when you execute ./bin/start-cluster.sh, could
>> >> you see two "Starting taskexecutor daemon on host" lines?
>> >> - Could you see two flink-xxx-taskexecutor-xxx.log in $FLINK_DIST/log?
>> >> If so, could you share these two log files?
>> >>
>> >> Best,
>> >> Yangze Guo
>> >>
>> >> Best,
>> >> Yangze Guo
>> >>
>> >>
>> >> On Thu, Sep 17, 2020 at 4:06 PM saksham sapra  
>> >> wrote:
>> >> >
>> >> > Hi Yangze,
>> >> >
>> >> > Thanks for replying, but i still have some questions.
>> >> > I wasnt having "workers" file in conf/workers so i created one, but i 
>> >> > have "slaves" file in  conf/workers, so i edited both two localhost 
>> >> > like screenshot given below :
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > and then again started flink , but i can see only one task manager
>> >> >
>> >> >
>> >> > Please find my config.yaml file attached.
>> >> >
>> >> >
>> >> > Thanks for helping.
>> >> >
>> >> > Thanks & Regards,
>> >> > Saksham Sapra
>> >> >
>> >> > On Thu, Sep 17, 2020 at 12:57 PM Yangze Guo  wrote:
>> >> >>
>> >> >> Hi,
>> >> >>
>> >> >> From my understanding, you want to set up a standalone cluster in your
>> >> >> local machine. If that is the case, you could simply edit the
>> >> >> $FLINK_DIST/conf/workers, in which each line represents a TM host. By
>> >> >> default, there is only one TM in localhost. In your case, you could
>> >> >> add a line 'localhost' to it. Then, execute the
>> >> >> $FLINK_DIST/bin/start-cluster.sh, you could see a standalone cluster
>> >> >> with two TM in your local machine.
>> >> >>
>> >> >> [1] 
>> >> >> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/cluster_setup.html#configuring-flink
>> >> >>
>> >> >> Best,
>> >> >> Yangze Guo
>> >> >>
>> >> >> On Thu, Sep 17, 2020 at 3:16 PM saksham sapra 
>> >> >>  wrote:
>> >> >> >
>> >> >> > Hi ,
>> >> >> >
>> >> >> > I am unable to set two task managers in my local machine and neither 
>> >> >> > any documentation provided for the same.
>> >> >> >
>> >> >> > I want to run a parallel job in two task managers using flink.
>> >> >> > kindly help me with the same, how can i set up in my local without 
>> >> >> > using any zookeeper or something.
>> >> >> >
>> >> >> >
>> >> >> > Thanks & Regards,
>> >> >> > Saksham Sapra
>> >> >> >
>> >> >> >


Re: Flink multiple task managers setup

2020-09-21 Thread Yangze Guo
Hi,

As the error message said, it could not find the flink-dist.jar in
"/cygdrive/d/Apacheflink/dist/apache-flink-1.9.3/deps/lib". Where is
your flink distribution and do you change the directory structure of
it?

Best,
Yangze Guo

On Mon, Sep 21, 2020 at 5:31 PM saksham sapra  wrote:
>
> HI,
>
>  i installed cygdrive and tried to run start-cluster.sh where zookeeper is up 
> and running and defined one job manager and one task manager,
> but getting this issue.
>
> $ start-cluster.sh start
> Starting HA cluster with 1 masters.
> -zTheFIND: Invalid switch
>  system cannot find the file specified.
> [ERROR] Flink distribution jar not found in 
> /cygdrive/d/Apacheflink/dist/apache-flink-1.9.3/deps/lib.
> File not found - 
> D:\Apacheflink\dist\apache-flink-1.9.3\deps/log/flink--standalo
> nesession-7-PLRENT-5LC73H2*
> Starting standalonesession daemon on host PLRENT-5LC73H2.
> -zThe system cannot FIND: Invalid switch
> find the file specified.
> [ERROR] Flink distribution jar not found in 
> /cygdrive/d/Apacheflink/dist/apache-flink-1.9.3/deps/lib.
> File not found - 
> D:\Apacheflink\dist\apache-flink-1.9.3\deps/log/flink--taskexec
> utor-7-PLRENT-5LC73H2*
> Starting taskexecutor daemon on host PLRENT-5LC73H2.
>>>
>>>


[DISCUSS] Adding e2e tests for Flink's Mesos integration

2019-12-06 Thread Yangze Guo
Hi, all,

Currently, there is no end to end test or IT case for Mesos deployment
while the common deployment related developing would inevitably touch
the logic of this component. Thus, some work needs to be done to
guarantee experience for both Meos users and contributors. After
offline discussion with Till and Xintong, we have some basic ideas and
would like to start a discussion thread on adding end to end tests for
Flink's Mesos integration.

As a first step, we would like to keep the scope of this contribution
to be relative small. This may also help us to quickly get some basic
test cases that might be helpful for the upcoming 1.10 release.

As far as we can think of, what needs to be done is to setup a Mesos
framework during the testing and determine which tests need to be
included.


** Regarding the Mesos framework, after trying out several approaches,
I find that setting up Mesos in docker is probably what we want. The
resources needed for building and setting up Mesos from source is
probably not affordable in most of the scenarios. So, the one open
question that worth discussion is the choice of Docker image. We have
come up with two options.

- Using official Mesos image[1]
The official image was the first alternative that come to our mind,
but we run into some sort of Java version compatibility problem that
leads to failures of launching task executors. Flink supports Java 9
since version 1.9.0 [2], However, the official Docker image of Mesos
is built with a development version of JDK 9, which probably has
caused this problem. Unless we want to make Flink to also be
compatible with the JDK development version used by the official mesos
image, this option does not work out. Besides, according to the
official roadmap[5], Java 9 is not a long-term support version, which
may bring stability risk in future.

- Build a custom image
I've already tried build a custom image[3] and successfully run most
of the existing end to end tests cases with it. The image is built
with Ubuntu 16.04, JDK 8 and Mesos 1.7.1. For the mesos e2e test
framework, we could either build the image from a Docker file or pull
the pre-built image from DockerHub (or other hub services) during the
testing.
If we decide to publish the an image on DockerHub, we probably need a
Flink official  repository/account to hold it.


** Regarding the test coverage, we think the following three tests
could be a good starting point that covers a very essential set of
behaviors for Mesos deployment.
- Wordcount end-to-end test. For verifying the basic process of Mesos
deployment.
- Multiple submissions of the same job. For preventing resource
management problems on Mesos, such as [4]
- State TTL RocksDb backend end-to-end test. For verifying memory
configuration behaviors, since Mesos has it’s own config options and
logics.

Unfortunately, neither of us who participated the initial offline
discussion has much experience for running flink on mesos in
production. It would be good that users and experts who actually use
flink on mesos can join the discussion and provide some feedbacks. Any
feedback, idea, suggestion, concern and question will be welcomed and
appreciated.


BTW, we would like to raise a survey on the usages of Flink on Mesos
in the community. For the Flink on Mesos users, we would like to
learn:
- Which version of Mesos do you use and what setups (such as Marathon)
do you need for Mesos
- Is it Flink job cluster or session cluster that  is majorly used
- How is the scale of the Flink / Mesos cluster


[1]https://hub.docker.com/r/mesosphere/mesos
[2]https://issues.apache.org/jira/browse/FLINK-11307
[3]https://hub.docker.com/repository/docker/karmagyz/mesos-flink
[4]https://issues.apache.org/jira/browse/FLINK-14074
[5]https://www.oracle.com/technetwork/java/java-se-support-roadmap.html


Best,
Yangze Guo


Re: [DISCUSS] Adding e2e tests for Flink's Mesos integration

2019-12-07 Thread Yangze Guo
Thanks for your feedback!

@Till
Regarding the time overhead, I think it mainly come from the network
transmission. For building the image locally, it will totally download
260MB files including the base image and packages. For pulling from
DockerHub, the compressed size of the image is 347MB. Thus, I agree
that it is ok to build the image locally.

@Piyush
Thank you for offering the help and sharing your usage scenario. In
current stage, I think it will be really helpful if you can compress
the custom image[1] or reduce the time overhead to build it locally.
Any ideas for improving test coverage will also be appreciated.

[1]https://hub.docker.com/layers/karmagyz/mesos-flink/latest/images/sha256-4e1caefea107818aa11374d6ac8a6e889922c81806f5cd791ead141f18ec7e64

Best,
Yangze Guo

On Sat, Dec 7, 2019 at 3:17 AM Piyush Narang  wrote:
>
> +1 from our end as well. At Criteo, we are running some Flink jobs on Mesos 
> in production to compute short term features for machine learning. We’d love 
> to help out and contribute on this initiative.
>
> Thanks,
> -- Piyush
>
>
> From: Till Rohrmann 
> Date: Friday, December 6, 2019 at 8:10 AM
> To: dev 
> Cc: user 
> Subject: Re: [DISCUSS] Adding e2e tests for Flink's Mesos integration
>
> Big +1 for adding a fully working e2e test for Flink's Mesos integration. 
> Ideally we would have it ready for the 1.10 release. The lack of such a test 
> has bitten us already multiple times.
>
> In general I would prefer to use the official image if possible since it 
> frees us from maintaining our own custom image. Since Java 9 is no longer 
> officially supported as we opted for supporting Java 11 (LTS) it might not be 
> feasible, though. How much longer would building the custom image vs. 
> downloading the custom image from DockerHub be? Maybe it is ok to build the 
> image locally. Then we would not have to maintain the image.
>
> Cheers,
> Till
>
> On Fri, Dec 6, 2019 at 11:05 AM Yangze Guo 
> mailto:karma...@gmail.com>> wrote:
> Hi, all,
>
> Currently, there is no end to end test or IT case for Mesos deployment
> while the common deployment related developing would inevitably touch
> the logic of this component. Thus, some work needs to be done to
> guarantee experience for both Meos users and contributors. After
> offline discussion with Till and Xintong, we have some basic ideas and
> would like to start a discussion thread on adding end to end tests for
> Flink's Mesos integration.
>
> As a first step, we would like to keep the scope of this contribution
> to be relative small. This may also help us to quickly get some basic
> test cases that might be helpful for the upcoming 1.10 release.
>
> As far as we can think of, what needs to be done is to setup a Mesos
> framework during the testing and determine which tests need to be
> included.
>
>
> ** Regarding the Mesos framework, after trying out several approaches,
> I find that setting up Mesos in docker is probably what we want. The
> resources needed for building and setting up Mesos from source is
> probably not affordable in most of the scenarios. So, the one open
> question that worth discussion is the choice of Docker image. We have
> come up with two options.
>
> - Using official Mesos image[1]
> The official image was the first alternative that come to our mind,
> but we run into some sort of Java version compatibility problem that
> leads to failures of launching task executors. Flink supports Java 9
> since version 1.9.0 [2], However, the official Docker image of Mesos
> is built with a development version of JDK 9, which probably has
> caused this problem. Unless we want to make Flink to also be
> compatible with the JDK development version used by the official mesos
> image, this option does not work out. Besides, according to the
> official roadmap[5], Java 9 is not a long-term support version, which
> may bring stability risk in future.
>
> - Build a custom image
> I've already tried build a custom image[3] and successfully run most
> of the existing end to end tests cases with it. The image is built
> with Ubuntu 16.04, JDK 8 and Mesos 1.7.1. For the mesos e2e test
> framework, we could either build the image from a Docker file or pull
> the pre-built image from DockerHub (or other hub services) during the
> testing.
> If we decide to publish the an image on DockerHub, we probably need a
> Flink official  repository/account to hold it.
>
>
> ** Regarding the test coverage, we think the following three tests
> could be a good starting point that covers a very essential set of
> behaviors for Mesos deployment.
> - Wordcount end-to-end test. For verifying the basic process of Mesos
> deployment.
> - Multiple submissions of 

Re: [DISCUSS] Adding e2e tests for Flink's Mesos integration

2019-12-10 Thread Yangze Guo
Thanks for the feedback, Yang.

Some updates I want to share in this thread.
I have built a PoC version of Meos e2e test with WordCount
workflow.[1] Then, I ran it in the testing environment. As the result
shown here[2]:
- For pulling image from DockerHub, it took 1 minute and 21 seconds
- For building it locally, it took 2 minutes and 54 seconds.

I prefer building it locally. Although it is slower, I think the time
overhead, comparing to the cost of maintaining the image in DockerHub
and the whole test process, is trivial for building or pulling the
image.

I look forward to hearing from you. ;)

Best,
Yangze Guo

[1]https://github.com/KarmaGYZ/flink/commit/0406d942446a1b17f81d93235b21a829bf88ccf0
[2]https://travis-ci.org/KarmaGYZ/flink/jobs/623207957
Best,
Yangze Guo

On Mon, Dec 9, 2019 at 2:39 PM Yang Wang  wrote:
>
> Thanks Yangze for starting this discussion.
>
> Just share my thoughts.
>
> If the mesos official docker image could not meet our requirement, i suggest 
> to build the image locally.
> We have done the same things for yarn e2e tests. This way is more flexible 
> and easy to maintain. However,
> i have no idea how long building the mesos image locally will take. Based on 
> previous experience of yarn, i
> think it may not take too much time.
>
>
>
> Best,
> Yang
>
> Yangze Guo  于2019年12月7日周六 下午4:25写道:
>>
>> Thanks for your feedback!
>>
>> @Till
>> Regarding the time overhead, I think it mainly come from the network
>> transmission. For building the image locally, it will totally download
>> 260MB files including the base image and packages. For pulling from
>> DockerHub, the compressed size of the image is 347MB. Thus, I agree
>> that it is ok to build the image locally.
>>
>> @Piyush
>> Thank you for offering the help and sharing your usage scenario. In
>> current stage, I think it will be really helpful if you can compress
>> the custom image[1] or reduce the time overhead to build it locally.
>> Any ideas for improving test coverage will also be appreciated.
>>
>> [1]https://hub.docker.com/layers/karmagyz/mesos-flink/latest/images/sha256-4e1caefea107818aa11374d6ac8a6e889922c81806f5cd791ead141f18ec7e64
>>
>> Best,
>> Yangze Guo
>>
>> On Sat, Dec 7, 2019 at 3:17 AM Piyush Narang  wrote:
>> >
>> > +1 from our end as well. At Criteo, we are running some Flink jobs on 
>> > Mesos in production to compute short term features for machine learning. 
>> > We’d love to help out and contribute on this initiative.
>> >
>> > Thanks,
>> > -- Piyush
>> >
>> >
>> > From: Till Rohrmann 
>> > Date: Friday, December 6, 2019 at 8:10 AM
>> > To: dev 
>> > Cc: user 
>> > Subject: Re: [DISCUSS] Adding e2e tests for Flink's Mesos integration
>> >
>> > Big +1 for adding a fully working e2e test for Flink's Mesos integration. 
>> > Ideally we would have it ready for the 1.10 release. The lack of such a 
>> > test has bitten us already multiple times.
>> >
>> > In general I would prefer to use the official image if possible since it 
>> > frees us from maintaining our own custom image. Since Java 9 is no longer 
>> > officially supported as we opted for supporting Java 11 (LTS) it might not 
>> > be feasible, though. How much longer would building the custom image vs. 
>> > downloading the custom image from DockerHub be? Maybe it is ok to build 
>> > the image locally. Then we would not have to maintain the image.
>> >
>> > Cheers,
>> > Till
>> >
>> > On Fri, Dec 6, 2019 at 11:05 AM Yangze Guo 
>> > mailto:karma...@gmail.com>> wrote:
>> > Hi, all,
>> >
>> > Currently, there is no end to end test or IT case for Mesos deployment
>> > while the common deployment related developing would inevitably touch
>> > the logic of this component. Thus, some work needs to be done to
>> > guarantee experience for both Meos users and contributors. After
>> > offline discussion with Till and Xintong, we have some basic ideas and
>> > would like to start a discussion thread on adding end to end tests for
>> > Flink's Mesos integration.
>> >
>> > As a first step, we would like to keep the scope of this contribution
>> > to be relative small. This may also help us to quickly get some basic
>> > test cases that might be helpful for the upcoming 1.10 release.
>> >
>> > As far as we can think of, what needs to be done is to setup a Mesos
>> > framework during the testing and determine which tests need to be
&g

Re: Running Flink on java 11

2020-01-10 Thread Yangze Guo
Hi Krzysztof

All the tests run with Java 11 after FLINK-13457[1]. Its fix version
is set to 1.10. So, I fear 1.9.1 is not guaranteed to be running on
java 11. I suggest you to wait for the release-1.10.

[1]https://issues.apache.org/jira/browse/FLINK-13457

Best,
Yangze Guo

On Fri, Jan 10, 2020 at 5:13 PM Krzysztof Chmielewski
 wrote:
>
> Hi,
> Thank you for your answer. Btw it seams that you send the replay only to my 
> address and not to the mailing list :)
>
> I'm looking forward to try out 1.10-rc then.
>
> Regarding second thing you wrote, that
> "on Java 11, all the tests(including end to end tests) would be run with Java 
> 11 profile now."
> I'm not sure if I get that fully. You meant that currently Flink builds are 
> running on java 11?
>
> I was not rebuilding Flink 1.9.1 sources with JDK 11. I just ran 1.9.1 build 
> on JRE 11 locally on my machine and
> I also modify Job Cluster Dockerfile to use openjdk:13-jdk-alpine as a base 
> image instead openjdk:8-jre-alpine.
>
> Are here any users who are currently running Flink on Java 11 or higher? What 
> are your experiences?
>
> Thanks,
>
> pt., 10 sty 2020 o 03:14 Yangze Guo  napisał(a):
>>
>> Hi, Krzysztof
>>
>> Regarding the release-1.10, the community is now focus on this effort.
>> I believe we will have our first release candidate soon.
>>
>> Regarding the issue when running Flink on Java 11, all the
>> tests(including end to end tests) would be run with Java 11 profile
>> now. If you meet any problem, feel free to open a new JIRA ticket or
>> ask in user/dev ML.
>>
>> Best,
>> Yangze Guo
>>
>> On Fri, Jan 10, 2020 at 1:11 AM KristoffSC
>>  wrote:
>> >
>> > Hi guys,
>> > well We have requirement in our project to use Java 11, although we would
>> > really like to use Flink because it seems to match our needs perfectly.
>> >
>> > We were testing it on java 1.8 and all looks fine.
>> > We tried to run it on Java 11 and also looks fine, at least for now.
>> >
>> > We were also running this as a Job Cluster, and since those images [1] are
>> > based on openjdk:8-jre-alpine we switch to java 13-jdk-alpine. Cluster
>> > started and submitted the job. All seemed fine.
>> >
>> > The Job and 3rd party library that this job is using were compiled with 
>> > Java
>> > 11.
>> > I was looking for any posts related to java 11 issues and I've found this
>> > [2] one.
>> > We are also aware of ongoing FLINK-10725 [3] but this is assigned to 1.10
>> > FLink version
>> >
>> > Having all of this, I would like to ask few questions
>> >
>> > 1. Is there any release date planed for 1.10?
>> > 2. Are you aware of any issues regarding running Flink on Java 11?
>> > 3. If my Job code would not use any code features from java 11, would flink
>> > handle it when running on java 11? Or they are some internal 
>> > functionalities
>> > that would not be working on Java 11 (things that are using unsafe or
>> > reflections?)
>> >
>> > Thanks,
>> > Krzysztof
>> >
>> >
>> > [1]
>> > https://github.com/apache/flink/blob/release-1.9/flink-container/docker/README.md
>> > [2]
>> > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/UnsupportedOperationException-from-org-apache-flink-shaded-asm6-org-objectweb-asm-ClassVisitor-visit1-td28571.html
>> > [3] https://issues.apache.org/jira/browse/FLINK-10725
>> >
>> >
>> >
>> > --
>> > Sent from: 
>> > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/


Re: Flink Job claster scalability

2020-01-10 Thread Yangze Guo
Hi KristoffSC

As Zhu said, Flink enables slot sharing[1] by default. This feature is
nothing to do with the resource of your cluster. The benefit of this
feature is written in [1] as well. I mean, it will not detect how many
slots in your cluster and adjust its behavior toward this number. If
you want to make the best use of your cluster, you can increase the
parallelism of the vertex that has the largest parallelism or
"disable" the slot sharing by [2]. IMO, the first way matches your
purpose.

[1] 
https://ci.apache.org/projects/flink/flink-docs-release-1.9/concepts/runtime.html#task-slots-and-resources
[2] 
https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/stream/operators/#task-chaining-and-resource-groups

Best,
Yangze Guo

On Fri, Jan 10, 2020 at 6:49 PM KristoffSC
 wrote:
>
> Hi Zhu Zhu,
> well In my last test I did not change the job config, so I did not change
> the parallelism level of any operator and I did not change policy regarding
> slot sharing (it stays as default one). Operator Chaining is set to true
> without any extra actions like "start new chain, disable chain etc"
>
> What I assume however is that Flink will try find most efficient way to use
> available resources during job submission.
>
> In the first case, where I had only 6 task managers (which matches max
> parallelism of my JobVertex), Flink reused some TaskSlots. Adding extra task
> slots did was not effective because reason described by David. This is
> understandable.
>
> However, I was assuming that if I submit my job on a cluster that have more
> task managers than 6, Flink will not share task slots by default. That did
> not happen. Flink deployed the job in the same way regardless of extra
> resources.
>
>
> So the conclusion is that simple job resubmitting will not work in this case
> and actually I cant have any certainty that it will. Since in my case Flink
> still reuses slot task.
>
> If this would be the production case, I would have to do a test job
> submission on testing env and potentially change the job. Not the config,
> but adding  slot sharing groups etc.
> So if this would be the production case I will not be able to react fast, I
> would have to deploy new version of my app/job which could be problematic.
>
>
>
>
> --
> Sent from: 
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/


Re: [Question] Failed to submit flink job to secure yarn cluster

2020-01-10 Thread Yangze Guo
Hi, Ethan

You could first check your cluster following this guide and check if
all the related config[2] set correctly.

[1] 
https://ci.apache.org/projects/flink/flink-docs-release-1.9/ops/security-kerberos.html
[2] 
https://ci.apache.org/projects/flink/flink-docs-release-1.9/ops/config.html#security-kerberos-login-contexts

Best,
Yangze Guo

On Fri, Jan 10, 2020 at 10:37 AM Ethan Li  wrote:
>
> Hello
>
> I was following  
> https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/deployment/yarn_setup.html#run-a-flink-job-on-yarn
>  and trying to submit a flink job on yarn.
>
> I downloaded flink-1.9.1 and pre-bundled Hadoop 2.8.3 from 
> https://flink.apache.org/downloads.html#apache-flink-191. I used default 
> configs except:
>
> security.kerberos.login.keytab: userA.keytab
> security.kerberos.login.principal: userA@REALM
>
>
> I have a secure Yarn cluster set up already. Then I ran “ ./bin/flink run -m 
> yarn-cluster -p 1 -yjm 1024m -ytm 1024m ./examples/streaming/WordCount.jar” 
> and got the following errors:
>
>
> org.apache.flink.client.deployment.ClusterDeploymentException: Couldn't 
> deploy Yarn session cluster
> at 
> org.apache.flink.yarn.AbstractYarnClusterDescriptor.deploySessionCluster(AbstractYarnClusterDescriptor.java:385)
> at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:251)
> at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:205)
> at 
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1010)
> at 
> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1083)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
> at 
> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
> at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1083)
> Caused by: org.apache.hadoop.yarn.exceptions.YarnException: Failed to submit 
> application_1578605412668_0005 to YARN : Failed to renew token: Kind: kms-dt, 
> Service: host3.com:3456, Ident: (owner=userA, renewer=adminB, realUser=, 
> issueDate=1578606224956, maxDate=1579211024956, sequenceNumber=32, 
> masterKeyId=52)
> at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.submitApplication(YarnClientImpl.java:275)
> at 
> org.apache.flink.yarn.AbstractYarnClusterDescriptor.startAppMaster(AbstractYarnClusterDescriptor.java:1004)
> at 
> org.apache.flink.yarn.AbstractYarnClusterDescriptor.deployInternal(AbstractYarnClusterDescriptor.java:507)
> at 
> org.apache.flink.yarn.AbstractYarnClusterDescriptor.deploySessionCluster(AbstractYarnClusterDescriptor.java:378)
> ... 9 more
>
>
> Full client 
> log:https://gist.github.com/Ethanlm/221284bcaa272270a799957dc05b94fd
> Resource manager log: 
> https://gist.github.com/Ethanlm/ecd0a3eb25582ad6b1552927fc0e5c47
> Hostname, IP address, username and etc. are anonymized.
>
>
> Not sure how to proceed further. Wondering if anyone in the community has 
> encountered this before. Thank you very much for your time!
>
> Best,
> Ethan
>


Re: Taskmanager fails to connect to Jobmanager [Could not find any IPv4 address that is not loopback or link-local. Using localhost address.]

2020-01-17 Thread Yangze Guo
Hi, Harshith

As a supplementary note to Yang, the issue seems to be that something
went wrong when trying to connect to the ResourceManager.
There may be two possibilities, the leader of ResourceManager does not
write the znode or the TaskExecutor fails to connect to it. If you
turn on the DEBUG log, it will help a lot. Also, you could watch the
content znode "/leader/resource_manager_lock" of ZooKeeper.

Best,
Yangze Guo

On Fri, Jan 17, 2020 at 5:11 PM Yang Wang  wrote:
>
> Hi Kumar Bolar, Harshith,
>
> Could you please check the jobmanager log to find out what address the akka 
> is listening?
> Also the address could be used to connected to the jobmanager on the 
> taskmanger machine.
>
> BTW, if you could share the debug level logs of jobmanger and taskmanger. It 
> will help a lot to find
> the root cause.
>
>
> Best,
> Yang
>
> Kumar Bolar, Harshith  于2020年1月16日周四 下午7:10写道:
>>
>> Hi all,
>>
>>
>>
>> We were previously using RHEL for our Flink machines. I'm currently working 
>> on moving them over to Ubuntu. When I start the task manager, it fails to 
>> connect to the job manager with the following message -
>>
>>
>>
>> 2020-01-16 10:54:42,777 INFO  
>> org.apache.flink.runtime.util.LeaderRetrievalUtils- Trying to 
>> select the network interface and address to use by connecting to the leading 
>> JobManager.
>>
>> 2020-01-16 10:54:42,778 INFO  
>> org.apache.flink.runtime.util.LeaderRetrievalUtils- TaskManager 
>> will try to connect for 1 milliseconds before falling back to heuristics
>>
>> 2020-01-16 10:54:52,780 WARN  org.apache.flink.runtime.net.ConnectionUtils   
>>- Could not find any IPv4 address that is not loopback or 
>> link-local. Using localhost address.
>>
>>
>>
>> The network interface on the machine looks like this -
>>
>>
>>
>>
>>
>> ens5: flags=4163  mtu 9001
>>
>> inet 10.16.75.30  netmask 255.255.255.128  broadcast 10.16.75.127
>>
>> ether 02:f1:8b:34:75:51  txqueuelen 1000  (Ethernet)
>>
>> RX packets 69370  bytes 80369110 (80.3 MB)
>>
>> RX errors 0  dropped 0  overruns 0  frame 0
>>
>> TX packets 28787  bytes 2898540 (2.8 MB)
>>
>> TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>>
>>
>>
>> lo: flags=73  mtu 65536
>>
>> inet 127.0.0.1  netmask 255.0.0.0
>>
>> loop  txqueuelen 1000  (Local Loopback)
>>
>> RX packets 9562  bytes 1596138 (1.5 MB)
>>
>> RX errors 0  dropped 0  overruns 0  frame 0
>>
>> TX packets 9562  bytes 1596138 (1.5 MB)
>>
>> TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>>
>>
>>
>>
>>
>> Note: On RHEL, the primary network interface was eth0. Could this be the 
>> issue?
>>
>>
>>
>> Here's the full task manager log - https://paste.ubuntu.com/p/vgh96FHzRq/
>>
>>
>>
>> Thanks
>>
>> Harshith


Re: [ANNOUNCE] Apache Flink 1.10.0 released

2020-02-12 Thread Yangze Guo
Thanks, Gary & Yu. Congrats to everyone involved!

Best,
Yangze Guo

On Thu, Feb 13, 2020 at 9:23 AM Jingsong Li  wrote:
>
> Congratulations! Great work.
>
> Best,
> Jingsong Lee
>
> On Wed, Feb 12, 2020 at 11:05 PM Leonard Xu  wrote:
>>
>> Great news!
>> Thanks everyone involved !
>> Thanks Gary and Yu for being the release manager !
>>
>> Best,
>> Leonard Xu
>>
>> 在 2020年2月12日,23:02,Stephan Ewen  写道:
>>
>> Congrats to us all.
>>
>> A big piece of work, nicely done.
>>
>> Let's hope that this helps our users make their existing use cases easier 
>> and also opens up new use cases.
>>
>> On Wed, Feb 12, 2020 at 3:31 PM 张光辉  wrote:
>>>
>>> Greet work.
>>>
>>> Congxian Qiu  于2020年2月12日周三 下午10:11写道:
>>>>
>>>> Great work.
>>>> Thanks everyone involved.
>>>> Thanks Gary and Yu for being the release manager
>>>>
>>>>
>>>> Best,
>>>> Congxian
>>>>
>>>>
>>>> Jark Wu  于2020年2月12日周三 下午9:46写道:
>>>>>
>>>>> Congratulations to everyone involved!
>>>>> Great thanks to Yu & Gary for being the release manager!
>>>>>
>>>>> Best,
>>>>> Jark
>>>>>
>>>>> On Wed, 12 Feb 2020 at 21:42, Zhu Zhu  wrote:
>>>>>>
>>>>>> Cheers!
>>>>>> Thanks Gary and Yu for the great job as release managers.
>>>>>> And thanks to everyone whose contribution makes the release possible!
>>>>>>
>>>>>> Thanks,
>>>>>> Zhu Zhu
>>>>>>
>>>>>> Wyatt Chun  于2020年2月12日周三 下午9:36写道:
>>>>>>>
>>>>>>> Sounds great. Congrats & Thanks!
>>>>>>>
>>>>>>> On Wed, Feb 12, 2020 at 9:31 PM Yu Li  wrote:
>>>>>>>>
>>>>>>>> The Apache Flink community is very happy to announce the release of 
>>>>>>>> Apache Flink 1.10.0, which is the latest major release.
>>>>>>>>
>>>>>>>> Apache Flink® is an open-source stream processing framework for 
>>>>>>>> distributed, high-performing, always-available, and accurate data 
>>>>>>>> streaming applications.
>>>>>>>>
>>>>>>>> The release is available for download at:
>>>>>>>> https://flink.apache.org/downloads.html
>>>>>>>>
>>>>>>>> Please check out the release blog post for an overview of the 
>>>>>>>> improvements for this new major release:
>>>>>>>> https://flink.apache.org/news/2020/02/11/release-1.10.0.html
>>>>>>>>
>>>>>>>> The full release notes are available in Jira:
>>>>>>>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12345845
>>>>>>>>
>>>>>>>> We would like to thank all contributors of the Apache Flink community 
>>>>>>>> who made this release possible!
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Gary & Yu
>>
>>
>
>
> --
> Best, Jingsong Lee


Re: [DISCUSS] FLIP-111: Docker image unification

2020-03-10 Thread Yangze Guo
Thanks for the reply, Andrey.

Regarding building from local dist:
- Yes, I bring this up mostly for development purpose. Since k8s is
popular, I believe more and more developers would like to test their
work on k8s cluster. I'm not sure should all developers write a custom
docker file themselves in this scenario. Thus, I still prefer to
provide a script for devs.
- I agree to keep the scope of this FLIP mostly for those normal
users. But as far as I can see, supporting building from local dist
would not take much extra effort.
- The maven docker plugin sounds good. I'll take a look at it.

Regarding supporting JAVA 11:
- Not sure if it is necessary to ship JAVA. Maybe we could just change
the base image from openjdk:8-jre to openjdk:11-jre in template docker
file[1]. Correct me if I understand incorrectly. Also, I agree to move
this out of the scope of this FLIP if it indeed takes much extra
effort.

Regarding the custom configuration, the mechanism that Thomas mentioned LGTM.

[1] 
https://github.com/apache/flink-docker/blob/master/Dockerfile-debian.template

Best,
Yangze Guo

On Wed, Mar 11, 2020 at 5:52 AM Thomas Weise  wrote:
>
> Thanks for working on improvements to the Flink Docker container images. This 
> will be important as more and more users are looking to adopt Kubernetes and 
> other deployment tooling that relies on Docker images.
>
> A generic, dynamic configuration mechanism based on environment variables is 
> essential and it is already supported via envsubst and an environment 
> variable that can supply a configuration fragment:
>
> https://github.com/apache/flink-docker/blob/09adf2dcd99abfb6180e1e2b5b917b288e0c01f6/docker-entrypoint.sh#L88
> https://github.com/apache/flink-docker/blob/09adf2dcd99abfb6180e1e2b5b917b288e0c01f6/docker-entrypoint.sh#L85
>
> This gives the necessary control for infrastructure use cases that aim to 
> supply deployment tooling other users. An example in this category this is 
> the FlinkK8sOperator:
>
> https://github.com/lyft/flinkk8soperator/tree/master/examples/wordcount
>
> On the flip side, attempting to support a fixed subset of configuration 
> options is brittle and will probably lead to compatibility issues down the 
> road:
>
> https://github.com/apache/flink-docker/blob/09adf2dcd99abfb6180e1e2b5b917b288e0c01f6/docker-entrypoint.sh#L97
>
> Besides the configuration, it may be worthwhile to see in which other ways 
> the base Docker images can provide more flexibility to incentivize wider 
> adoption.
>
> I would second that it is desirable to support Java 11 and in general use a 
> base image that allows the (straightforward) use of more recent versions of 
> other software (Python etc.)
>
> https://github.com/apache/flink-docker/blob/d3416e720377e9b4c07a2d0f4591965264ac74c5/Dockerfile-debian.template#L19
>
> Thanks,
> Thomas
>
> On Tue, Mar 10, 2020 at 12:26 PM Andrey Zagrebin  wrote:
>>
>> Hi All,
>>
>> Thanks a lot for the feedback!
>>
>> *@Yangze Guo*
>>
>> - Regarding the flink_docker_utils#install_flink function, I think it
>> > should also support build from local dist and build from a
>> > user-defined archive.
>>
>> I suppose you bring this up mostly for development purpose or powerful
>> users.
>> Most of normal users are usually interested in mainstream released versions
>> of Flink.
>> Although, you are bring a valid concern, my idea was to keep scope of this
>> FLIP mostly for those normal users.
>> The powerful users are usually capable to design a completely
>> custom Dockerfile themselves.
>> At the moment, we already have custom Dockerfiles e.g. for tests in
>> flink-end-to-end-tests/test-scripts/docker-hadoop-secure-cluster/Dockerfile.
>> We can add something similar for development purposes and maybe introduce a
>> special maven goal. There is a maven docker plugin, afaik.
>> I will add this to FLIP as next step.
>>
>> - It seems that the install_shaded_hadoop could be an option of
>> > install_flink
>>
>> I woud rather think about this as a separate independent optional step.
>>
>> - Should we support JAVA 11? Currently, most of the docker file based on
>> > JAVA 8.
>>
>> Indeed, it is a valid concern. Java version is a fundamental property of
>> the docker image.
>> To customise this in the current mainstream image is difficult, this would
>> require to ship it w/o Java at all.
>> Or this is a separate discussion whether we want to distribute docker hub
>> images with different Java versions or just bump it to Java 11.
>> This should be easy in a custom Dockerfile for development purposes though
>> as mentioned before.
>>
>> - I do n

Re: Flink 1.10 container memory configuration with Mesos.

2020-03-11 Thread Yangze Guo
Hi, Alexander

I could not reproduce it in my local environment. Normally, Mesos RM
will calculate all the mem config and add it to the launch command.
Unfortunately, all the log I could found for this command is at the
DEBUG level. Would you mind changing the log level to DEBUG or sharing
anything about the taskmanager launch command you could found in the
current log?


Best,
Yangze Guo

On Thu, Mar 12, 2020 at 1:38 PM Alexander Kasyanenko
 wrote:
>
> Hi folks,
>
> I have a question related configuration for new memory introduced in flink 
> 1.10. Has anyone encountered similar problem?
> I'm trying to make use of taskmanager.memory.process.size configuration key 
> in combination with mesos session cluster, but I get an error like this:
>
> 2020-03-11 11:44:09,771 [main] ERROR 
> org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner - Error while 
> starting the TaskManager
> org.apache.flink.configuration.IllegalConfigurationException: Failed to 
> create TaskExecutorResourceSpec
> at 
> org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.resourceSpecFromConfig(TaskExecutorResourceUtils.java:72)
> at 
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.startTaskManager(TaskManagerRunner.java:356)
> at 
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.(TaskManagerRunner.java:152)
> at 
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManager(TaskManagerRunner.java:308)
> at 
> org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner.lambda$main$0(MesosTaskExecutorRunner.java:106)
> at java.base/java.security.AccessController.doPrivileged(Native Method)
> at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692)
> at 
> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
> at 
> org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner.main(MesosTaskExecutorRunner.java:105)
> Caused by: org.apache.flink.configuration.IllegalConfigurationException: The 
> required configuration option Key: 'taskmanager.memory.task.heap.size' , 
> default: null (fallback keys: []) is not set
> at 
> org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.checkConfigOptionIsSet(TaskExecutorResourceUtils.java:90)
> at 
> org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.lambda$checkTaskExecutorResourceConfigSet$0(TaskExecutorResourceUtils.java:84)
> at java.base/java.util.Arrays$ArrayList.forEach(Arrays.java:4390)
> at 
> org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.checkTaskExecutorResourceConfigSet(TaskExecutorResourceUtils.java:84)
> at 
> org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.resourceSpecFromConfig(TaskExecutorResourceUtils.java:70)
> ... 9 more
>
> But when task manager is launched, it correctly parses process memory key:
>
> 2020-03-11 11:43:55,376 [main] INFO  
> org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner - 
> 
> 2020-03-11 11:43:55,377 [main] INFO  
> org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner -  Starting 
> MesosTaskExecutorRunner (Version: 1.10.0, Rev:aa4eb8f, Date:07.02.2020 @ 
> 19:18:19 CET)
> 2020-03-11 11:43:55,377 [main] INFO  
> org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner -  OS current 
> user: root
> 2020-03-11 11:43:57,347 [main] WARN  org.apache.hadoop.util.NativeCodeLoader  
>  - Unable to load native-hadoop library for your 
> platform... using builtin-java classes where applicable
> 2020-03-11 11:43:57,535 [main] INFO  
> org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner -  JVM: OpenJDK 
> 64-Bit Server VM - AdoptOpenJDK - 11/11.0.2+9
> 2020-03-11 11:43:57,535 [main] INFO  
> org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner -  Maximum heap 
> size: 746 MiBytes
> 2020-03-11 11:43:57,535 [main] INFO  
> org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner -  JAVA_HOME: 
> (not set)
> 2020-03-11 11:43:57,539 [main] INFO  
> org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner -  Hadoop 
> version: 2.6.5
> 2020-03-11 11:43:57,539 [main] INFO  
> org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner -  JVM Options:
> 2020-03-11 11:43:57,539 [main] INFO  
> org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner - 
> -Xmx781818251
> 2020-03-11 11:43:57,539 [main] INFO  
> org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner - 
> -Xms781818251
> 2020-03-11 11:43:57,540 [main] INFO  
> org.apache.flink.mesos.en

Re: Flink 1.10 container memory configuration with Mesos.

2020-03-12 Thread Yangze Guo
Glad to hear that your issue is fixed.
I'm not sure what you suggest to add. Could you tell it more specific
or create a Jira ticket?

Best,
Yangze Guo


On Thu, Mar 12, 2020 at 3:51 PM Alexander Kasyanenko
 wrote:
>
> Hi Yangze, Xintong,
>
> Thank you for instant response.
>
> And big thanks for the hint on TM launch command. It indeed was the problem. 
> I've added my own custom mesos-taskmanager.sh to echo the launch command 
> (I've switched to DEBUG level on logging, but it didn't really display 
> anything useful). May I suggest to add something like this in the future 
> releases?
>
> As for my particular case, the issue was in mesos-appmaster.sh option:
>
> -Dmesos.resourcemanager.tasks.taskmanager-cmd="/opt/job/custom_launch_tm.sh"
>
> My custom launch script was slicing argument array incorrectly.
>
> Thanks for the help and regards,
> Alex.
>
> чт, 12 мар. 2020 г. в 15:46, Xintong Song :
>>
>> Hi Alex,
>>
>> Could you try to check and post your TM launch command? I suspect that there 
>> might be some unrecognized arguments that prevent the rest of arguments 
>> being parsed.
>>
>> The TM memory configuration process works as follow:
>>
>> The resource manager will parse the configurations, checking which options 
>> are configured and which are not, and calculate the size of each memory 
>> component. (This is where ‘taskmanager.memory.process.size’ is used.)
>> After deriving the memory component sizes, the resource manager will 
>> generate launch command for the task managers, with dynamic configurations 
>> "-D " overwriting the memory component sizes. Therefore, even you 
>> have not configured 'taskmanager.memory.task.heap.size', it is expected that 
>> before when the TM is launched this config option should be available.
>> When a task manager is started, it will not do the calculations again, and 
>> will directly read the memory component sizes calculated by resource manager 
>> from the dynamic configurations. That means it is not reading 
>> ‘taskmanager.memory.process.size’ and deriving memory component sizes from 
>> it again.
>>
>> One thing that might have caused your problem is that, when 
>> MesosTaskExecutorRunner parses the command line arguments (that's where the 
>> dynamic configurations are passed in), if it meets an unrecognized token it 
>> will stop parsing the rest of the arguments. That could be the reason that 
>> 'taskmanager.memory.task.heap.size' is missing. You can take a look at the 
>> launching command, see if there's anything unexpected before the memory 
>> dynamic configurations.
>>
>> Thank you~
>>
>> Xintong Song
>>
>>
>>
>> On Thu, Mar 12, 2020 at 2:26 PM Yangze Guo  wrote:
>>>
>>> Hi, Alexander
>>>
>>> I could not reproduce it in my local environment. Normally, Mesos RM
>>> will calculate all the mem config and add it to the launch command.
>>> Unfortunately, all the log I could found for this command is at the
>>> DEBUG level. Would you mind changing the log level to DEBUG or sharing
>>> anything about the taskmanager launch command you could found in the
>>> current log?
>>>
>>>
>>> Best,
>>> Yangze Guo
>>>
>>> On Thu, Mar 12, 2020 at 1:38 PM Alexander Kasyanenko
>>>  wrote:
>>> >
>>> > Hi folks,
>>> >
>>> > I have a question related configuration for new memory introduced in 
>>> > flink 1.10. Has anyone encountered similar problem?
>>> > I'm trying to make use of taskmanager.memory.process.size configuration 
>>> > key in combination with mesos session cluster, but I get an error like 
>>> > this:
>>> >
>>> > 2020-03-11 11:44:09,771 [main] ERROR 
>>> > org.apache.flink.mesos.entrypoint.MesosTaskExecutorRunner - Error 
>>> > while starting the TaskManager
>>> > org.apache.flink.configuration.IllegalConfigurationException: Failed to 
>>> > create TaskExecutorResourceSpec
>>> > at 
>>> > org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.resourceSpecFromConfig(TaskExecutorResourceUtils.java:72)
>>> > at 
>>> > org.apache.flink.runtime.taskexecutor.TaskManagerRunner.startTaskManager(TaskManagerRunner.java:356)
>>> > at 
>>> > org.apache.flink.runtime.taskexecutor.TaskManagerRunner.(TaskManagerRunner.java:152)
>>> > at 
>>> > org.apache.flink.runtime.taskexecutor.TaskMana

Re: Flink 1.10 container memory configuration with Mesos.

2020-03-12 Thread Yangze Guo
It seems we already have such logs in [1]. If that is the case, +1 for
changing it to INFO level.

[1] 
https://github.com/apache/flink/blob/663af45c7f403eb6724852915bf2078241927258/flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/LaunchableMesosWorker.java#L341
Best,
Yangze Guo

On Thu, Mar 12, 2020 at 4:03 PM Alexander Kasyanenko
 wrote:
>
> Instead of just launching TM as it works right now, I suggest to log launch 
> command first, and then launch TM. But that might be unnecessary, since the 
> use case is rather specific.
>
> Regards,
> Alex.
>
> чт, 12 мар. 2020 г. в 16:58, Yangze Guo :
>>
>> Glad to hear that your issue is fixed.
>> I'm not sure what you suggest to add. Could you tell it more specific
>> or create a Jira ticket?
>>
>> Best,
>> Yangze Guo
>>
>>
>> On Thu, Mar 12, 2020 at 3:51 PM Alexander Kasyanenko
>>  wrote:
>> >
>> > Hi Yangze, Xintong,
>> >
>> > Thank you for instant response.
>> >
>> > And big thanks for the hint on TM launch command. It indeed was the 
>> > problem. I've added my own custom mesos-taskmanager.sh to echo the launch 
>> > command (I've switched to DEBUG level on logging, but it didn't really 
>> > display anything useful). May I suggest to add something like this in the 
>> > future releases?
>> >
>> > As for my particular case, the issue was in mesos-appmaster.sh option:
>> >
>> > -Dmesos.resourcemanager.tasks.taskmanager-cmd="/opt/job/custom_launch_tm.sh"
>> >
>> > My custom launch script was slicing argument array incorrectly.
>> >
>> > Thanks for the help and regards,
>> > Alex.
>> >
>> > чт, 12 мар. 2020 г. в 15:46, Xintong Song :
>> >>
>> >> Hi Alex,
>> >>
>> >> Could you try to check and post your TM launch command? I suspect that 
>> >> there might be some unrecognized arguments that prevent the rest of 
>> >> arguments being parsed.
>> >>
>> >> The TM memory configuration process works as follow:
>> >>
>> >> The resource manager will parse the configurations, checking which 
>> >> options are configured and which are not, and calculate the size of each 
>> >> memory component. (This is where ‘taskmanager.memory.process.size’ is 
>> >> used.)
>> >> After deriving the memory component sizes, the resource manager will 
>> >> generate launch command for the task managers, with dynamic 
>> >> configurations "-D " overwriting the memory component sizes. 
>> >> Therefore, even you have not configured 
>> >> 'taskmanager.memory.task.heap.size', it is expected that before when the 
>> >> TM is launched this config option should be available.
>> >> When a task manager is started, it will not do the calculations again, 
>> >> and will directly read the memory component sizes calculated by resource 
>> >> manager from the dynamic configurations. That means it is not reading 
>> >> ‘taskmanager.memory.process.size’ and deriving memory component sizes 
>> >> from it again.
>> >>
>> >> One thing that might have caused your problem is that, when 
>> >> MesosTaskExecutorRunner parses the command line arguments (that's where 
>> >> the dynamic configurations are passed in), if it meets an unrecognized 
>> >> token it will stop parsing the rest of the arguments. That could be the 
>> >> reason that 'taskmanager.memory.task.heap.size' is missing. You can take 
>> >> a look at the launching command, see if there's anything unexpected 
>> >> before the memory dynamic configurations.
>> >>
>> >> Thank you~
>> >>
>> >> Xintong Song
>> >>
>> >>
>> >>
>> >> On Thu, Mar 12, 2020 at 2:26 PM Yangze Guo  wrote:
>> >>>
>> >>> Hi, Alexander
>> >>>
>> >>> I could not reproduce it in my local environment. Normally, Mesos RM
>> >>> will calculate all the mem config and add it to the launch command.
>> >>> Unfortunately, all the log I could found for this command is at the
>> >>> DEBUG level. Would you mind changing the log level to DEBUG or sharing
>> >>> anything about the taskmanager launch command you could found in the
>> >>> current log?
>> >>>
>> >>>
>>

Re: Flink 1.10 container memory configuration with Mesos.

2020-03-12 Thread Yangze Guo
BTW, the dynamic config will also occur in TM side logs [1]. It would
be good to print it in INFO level as well.

[1] 
https://github.com/apache/flink/blob/663af45c7f403eb6724852915bf2078241927258/flink-mesos/src/main/java/org/apache/flink/mesos/entrypoint/MesosTaskExecutorRunner.java#L77

Best,
Yangze Guo

On Thu, Mar 12, 2020 at 4:06 PM Yangze Guo  wrote:
>
> It seems we already have such logs in [1]. If that is the case, +1 for
> changing it to INFO level.
>
> [1] 
> https://github.com/apache/flink/blob/663af45c7f403eb6724852915bf2078241927258/flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/LaunchableMesosWorker.java#L341
> Best,
> Yangze Guo
>
> On Thu, Mar 12, 2020 at 4:03 PM Alexander Kasyanenko
>  wrote:
> >
> > Instead of just launching TM as it works right now, I suggest to log launch 
> > command first, and then launch TM. But that might be unnecessary, since the 
> > use case is rather specific.
> >
> > Regards,
> > Alex.
> >
> > чт, 12 мар. 2020 г. в 16:58, Yangze Guo :
> >>
> >> Glad to hear that your issue is fixed.
> >> I'm not sure what you suggest to add. Could you tell it more specific
> >> or create a Jira ticket?
> >>
> >> Best,
> >> Yangze Guo
> >>
> >>
> >> On Thu, Mar 12, 2020 at 3:51 PM Alexander Kasyanenko
> >>  wrote:
> >> >
> >> > Hi Yangze, Xintong,
> >> >
> >> > Thank you for instant response.
> >> >
> >> > And big thanks for the hint on TM launch command. It indeed was the 
> >> > problem. I've added my own custom mesos-taskmanager.sh to echo the 
> >> > launch command (I've switched to DEBUG level on logging, but it didn't 
> >> > really display anything useful). May I suggest to add something like 
> >> > this in the future releases?
> >> >
> >> > As for my particular case, the issue was in mesos-appmaster.sh option:
> >> >
> >> > -Dmesos.resourcemanager.tasks.taskmanager-cmd="/opt/job/custom_launch_tm.sh"
> >> >
> >> > My custom launch script was slicing argument array incorrectly.
> >> >
> >> > Thanks for the help and regards,
> >> > Alex.
> >> >
> >> > чт, 12 мар. 2020 г. в 15:46, Xintong Song :
> >> >>
> >> >> Hi Alex,
> >> >>
> >> >> Could you try to check and post your TM launch command? I suspect that 
> >> >> there might be some unrecognized arguments that prevent the rest of 
> >> >> arguments being parsed.
> >> >>
> >> >> The TM memory configuration process works as follow:
> >> >>
> >> >> The resource manager will parse the configurations, checking which 
> >> >> options are configured and which are not, and calculate the size of 
> >> >> each memory component. (This is where ‘taskmanager.memory.process.size’ 
> >> >> is used.)
> >> >> After deriving the memory component sizes, the resource manager will 
> >> >> generate launch command for the task managers, with dynamic 
> >> >> configurations "-D " overwriting the memory component sizes. 
> >> >> Therefore, even you have not configured 
> >> >> 'taskmanager.memory.task.heap.size', it is expected that before when 
> >> >> the TM is launched this config option should be available.
> >> >> When a task manager is started, it will not do the calculations again, 
> >> >> and will directly read the memory component sizes calculated by 
> >> >> resource manager from the dynamic configurations. That means it is not 
> >> >> reading ‘taskmanager.memory.process.size’ and deriving memory component 
> >> >> sizes from it again.
> >> >>
> >> >> One thing that might have caused your problem is that, when 
> >> >> MesosTaskExecutorRunner parses the command line arguments (that's where 
> >> >> the dynamic configurations are passed in), if it meets an unrecognized 
> >> >> token it will stop parsing the rest of the arguments. That could be the 
> >> >> reason that 'taskmanager.memory.task.heap.size' is missing. You can 
> >> >> take a look at the launching command, see if there's anything 
> >> >> unexpected before the memory dynamic configurations.
> >> >>
> >> >> Thank you~
> >> >>

Re: Flink YARN app terminated before the client receives the result

2020-03-12 Thread Yangze Guo
Would you mind to share more information about why the task executor
is killed? If it is killed by Yarn, you might get such info in Yarn
NM/RM logs.

Best,
Yangze Guo

Best,
Yangze Guo


On Fri, Mar 13, 2020 at 12:31 PM DONG, Weike  wrote:
>
> Hi,
>
> Recently I have encountered a strange behavior of Flink on YARN, which is 
> that when I try to cancel a Flink job running in per-job mode on YARN using 
> commands like
>
> "cancel -m yarn-cluster -yid application_1559388106022_9412 
> ed7e2e0ab0a7316c1b65df6047bc6aae"
>
> the client happily found and connected to ResourceManager and then stucks at
> Found Web Interface 172.28.28.3:50099 of application 
> 'application_1559388106022_9412'.
>
> And after one minute, an exception is thrown at the client side:
> Caused by: org.apache.flink.util.FlinkException: Could not cancel job 
> ed7e2e0ab0a7316c1b65df6047bc6aae.
> at 
> org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545)
> at 
> org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843)
> at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538)
> at 
> org.apache.flink.client.cli.CliFrontend.parseParametersWithException(CliFrontend.java:917)
> at 
> org.apache.flink.client.cli.CliFrontend.lambda$mainWithReturnCodeAndException$10(CliFrontend.java:988)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
> ... 20 more
> Caused by: java.util.concurrent.TimeoutException
> at 
> java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771)
> at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)
> at 
> org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:543)
> ... 27 more
>
> Then I discovered that the YARN app has already terminated with FINISHED 
> state and KILLED final status, like below.
>
> And after digging into the log of this finished YARN app, I have found that 
> TaskManager had already received the SIGTERM signal and terminated gracefully.
> org.apache.flink.yarn.YarnTaskExecutorRunner  - RECEIVED SIGNAL 15: SIGTERM. 
> Shutting down as requested.
>
> Also, the log of JobManager shows that it terminated with exit code 0.
> Terminating cluster entrypoint process YarnJobClusterEntrypoint with exit 
> code 0
>
> However, the JobManager did not return anything to the client before its 
> shutdown, which is different from previous versions (like Flink 1.9).
>
> I wonder if this is a new bug on the flink-clients or flink-yarn module?
>
> Thank you : )
>
> Sincerely,
> Weike


Re: [DISCUSS] FLIP-111: Docker image unification

2020-03-17 Thread Yangze Guo
I second Thomas that we can support both Java 8 and 11.

Best,
Yangze Guo

On Wed, Mar 18, 2020 at 12:12 PM Thomas Weise  wrote:
>
> -->
>
> On Mon, Mar 16, 2020 at 1:58 AM Andrey Zagrebin  wrote:
>>
>> Thanks for the further feedback Thomas and Yangze.
>>
>> > A generic, dynamic configuration mechanism based on environment variables
>> is essential and it is already supported via envsubst and an environment
>> variable that can supply a configuration fragment
>>
>> True, we already have this. As I understand this was introduced for
>> flexibility to template a custom flink-conf.yaml with env vars, put it into
>> the FLINK_PROPERTIES and merge it with the default one.
>> Could we achieve the same with the dynamic properties (-Drpc.port=1234),
>> passed as image args to run it, instead of FLINK_PROPERTIES?
>> They could be also parametrised with env vars. This would require
>> jobmanager.sh to properly propagate them to
>> the StandaloneSessionClusterEntrypoint though:
>> https://github.com/docker-flink/docker-flink/pull/82#issuecomment-525285552
>> cc @Till
>> This would provide a unified configuration approach.
>>
>
> How would that look like for the various use cases? The k8s operator would 
> need to generate the -Dabc .. -Dxyz entry point command instead of setting 
> the FLINK_PROPERTIES environment variable? Potentially that introduces 
> additional complexity for little gain. Do most deployment platforms that 
> support Docker containers handle the command line route well? Backward 
> compatibility may also be a concern.
>
>>
>> > On the flip side, attempting to support a fixed subset of configuration
>> options is brittle and will probably lead to compatibility issues down the
>> road
>>
>> I agree with it. The idea was to have just some shortcut scripted functions
>> to set options in flink-conf.yaml for a custom Dockerfile or entry point
>> script.
>> TASK_MANAGER_NUMBER_OF_TASK_SLOTS could be set as a dynamic property of
>> started JM.
>> I am not sure how many users depend on it. Maybe we could remove it.
>> It also looks we already have somewhat unclean state in
>> the docker-entrypoint.sh where some ports are set the hardcoded values
>> and then FLINK_PROPERTIES are applied potentially duplicating options in
>> the result flink-conf.yaml.
>
>
> That is indeed possible and duplicate entries from FLINK_PROPERTIES prevail. 
> Unfortunately, the special cases you mention were already established and the 
> generic mechanism was added later for the k8s operators.
>
>>
>>
>> I can see some potential usage of env vars as standard entry point args but
>> for purposes related to something which cannot be achieved by passing entry
>> point args, like changing flink-conf.yaml options. Nothing comes into my
>> mind at the moment. It could be some setting specific to the running mode
>> of the entry point. The mode itself can stay the first arg of the entry
>> point.
>>
>> > I would second that it is desirable to support Java 11
>>
>> > Regarding supporting JAVA 11:
>> > - Not sure if it is necessary to ship JAVA. Maybe we could just change
>> > the base image from openjdk:8-jre to openjdk:11-jre in template docker
>> > file[1]. Correct me if I understand incorrectly. Also, I agree to move
>> > this out of the scope of this FLIP if it indeed takes much extra
>> > effort.
>>
>> This is what I meant by bumping up the Java version in the docker hub Flink
>> image:
>> FROM openjdk:8-jre -> FROM openjdk:11-jre
>> This can be polled dependently in user mailing list.
>
>
> That sounds reasonable as long as we can still support both Java versions 
> (i.e. provide separate images for 8 and 11).
>
>>
>>
>> > and in general use a base image that allows the (straightforward) use of
>> more recent versions of other software (Python etc.)
>>
>> This can be polled whether to always include some version of python into
>> the docker hub image.
>> A potential problem here is once it is there, it is some hassle to
>> remove/change it in a custom extended Dockerfile.
>>
>> It would be also nice to avoid maintaining images for various combinations
>> of installed Java/Scala/Python in docker hub.
>>
>> > Regarding building from local dist:
>> > - Yes, I bring this up mostly for development purpose. Since k8s is
>> > popular, I believe more and more developers would like to test their
>> > work on k8s cluster. I'm not sure should all developers write a custom
>&g

[Third-party Tool] Flink memory calculator

2020-03-27 Thread Yangze Guo
Hi, there.

In release-1.10, the memory setup of task managers has changed a lot.
I would like to provide here a third-party tool to simulate and get
the calculation result of Flink's memory configuration.

 Although there is already a detailed setup guide[1] and migration
guide[2] officially, the calculator could further allow users to:
- Verify if there is any conflict in their configuration. The
calculator is more lightweight than starting a Flink cluster,
especially when running Flink on Yarn/Kubernetes. User could make sure
their configuration is correct locally before deploying it to external
resource managers.
- Get all of the memory configurations before deploying. User may set
taskmanager.memory.task.heap.size and taskmanager.memory.managed.size.
But they also want to know the total memory consumption of Flink. With
this tool, users could get all of the memory configurations they are
interested in. If anything is unexpected, they would not need to
re-deploy a Flink cluster.

The repo link of this tool is
https://github.com/KarmaGYZ/flink-memory-calculator. It reuses the
BashJavaUtils.jar of Flink and ensures the calculation result is
exactly the same as your Flink dist. For more details, please take a
look at the README.

Any feedback or suggestion is welcomed!

[1] 
https://ci.apache.org/projects/flink/flink-docs-master/ops/memory/mem_setup.html
[2] 
https://ci.apache.org/projects/flink/flink-docs-master/ops/memory/mem_migration.html

Best,
Yangze Guo


Re: [Third-party Tool] Flink memory calculator

2020-03-29 Thread Yangze Guo
Hi, Yun,

I'm sorry that it currently could not handle it. But I think it is a
really good idea and that feature would be added to the next version.

Best,
Yangze Guo

On Mon, Mar 30, 2020 at 12:21 AM Yun Tang  wrote:
>
> Very interesting and convenient tool, just a quick question: could this tool 
> also handle deployment cluster commands like "-tm" mixed with configuration 
> in `flink-conf.yaml` ?
>
> Best
> Yun Tang
> ________
> From: Yangze Guo 
> Sent: Friday, March 27, 2020 18:00
> To: user ; user...@flink.apache.org 
> 
> Subject: [Third-party Tool] Flink memory calculator
>
> Hi, there.
>
> In release-1.10, the memory setup of task managers has changed a lot.
> I would like to provide here a third-party tool to simulate and get
> the calculation result of Flink's memory configuration.
>
>  Although there is already a detailed setup guide[1] and migration
> guide[2] officially, the calculator could further allow users to:
> - Verify if there is any conflict in their configuration. The
> calculator is more lightweight than starting a Flink cluster,
> especially when running Flink on Yarn/Kubernetes. User could make sure
> their configuration is correct locally before deploying it to external
> resource managers.
> - Get all of the memory configurations before deploying. User may set
> taskmanager.memory.task.heap.size and taskmanager.memory.managed.size.
> But they also want to know the total memory consumption of Flink. With
> this tool, users could get all of the memory configurations they are
> interested in. If anything is unexpected, they would not need to
> re-deploy a Flink cluster.
>
> The repo link of this tool is
> https://github.com/KarmaGYZ/flink-memory-calculator. It reuses the
> BashJavaUtils.jar of Flink and ensures the calculation result is
> exactly the same as your Flink dist. For more details, please take a
> look at the README.
>
> Any feedback or suggestion is welcomed!
>
> [1] 
> https://ci.apache.org/projects/flink/flink-docs-master/ops/memory/mem_setup.html
> [2] 
> https://ci.apache.org/projects/flink/flink-docs-master/ops/memory/mem_migration.html
>
> Best,
> Yangze Guo


Re: [Third-party Tool] Flink memory calculator

2020-03-29 Thread Yangze Guo
Thanks for your feedbacks, @Xintong and @Jeff.

@Jeff
I think it would always be good to leverage exist logic in Flink, such
as JobListener. However, this calculator does not only target to check
the conflict, it also targets to provide the calculating result to
user before the job is actually deployed in case there is any
unexpected configuration. It's a good point that we need to parse the
dynamic configs. I prefer to parse the dynamic configs and cli
commands in bash instead of adding hook in JobListener.

Best,
Yangze Guo

On Mon, Mar 30, 2020 at 10:32 AM Jeff Zhang  wrote:
>
> Hi Yangze,
>
> Does this tool just parse the configuration in flink-conf.yaml ?  Maybe it 
> could be done in JobListener [1] (we should enhance it via adding hook before 
> job submission), so that it could all the cases (e.g. parameters coming from 
> command line)
>
> [1] 
> https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/core/execution/JobListener.java#L35
>
>
> Yangze Guo  于2020年3月30日周一 上午9:40写道:
>>
>> Hi, Yun,
>>
>> I'm sorry that it currently could not handle it. But I think it is a
>> really good idea and that feature would be added to the next version.
>>
>> Best,
>> Yangze Guo
>>
>> On Mon, Mar 30, 2020 at 12:21 AM Yun Tang  wrote:
>> >
>> > Very interesting and convenient tool, just a quick question: could this 
>> > tool also handle deployment cluster commands like "-tm" mixed with 
>> > configuration in `flink-conf.yaml` ?
>> >
>> > Best
>> > Yun Tang
>> > 
>> > From: Yangze Guo 
>> > Sent: Friday, March 27, 2020 18:00
>> > To: user ; user...@flink.apache.org 
>> > 
>> > Subject: [Third-party Tool] Flink memory calculator
>> >
>> > Hi, there.
>> >
>> > In release-1.10, the memory setup of task managers has changed a lot.
>> > I would like to provide here a third-party tool to simulate and get
>> > the calculation result of Flink's memory configuration.
>> >
>> >  Although there is already a detailed setup guide[1] and migration
>> > guide[2] officially, the calculator could further allow users to:
>> > - Verify if there is any conflict in their configuration. The
>> > calculator is more lightweight than starting a Flink cluster,
>> > especially when running Flink on Yarn/Kubernetes. User could make sure
>> > their configuration is correct locally before deploying it to external
>> > resource managers.
>> > - Get all of the memory configurations before deploying. User may set
>> > taskmanager.memory.task.heap.size and taskmanager.memory.managed.size.
>> > But they also want to know the total memory consumption of Flink. With
>> > this tool, users could get all of the memory configurations they are
>> > interested in. If anything is unexpected, they would not need to
>> > re-deploy a Flink cluster.
>> >
>> > The repo link of this tool is
>> > https://github.com/KarmaGYZ/flink-memory-calculator. It reuses the
>> > BashJavaUtils.jar of Flink and ensures the calculation result is
>> > exactly the same as your Flink dist. For more details, please take a
>> > look at the README.
>> >
>> > Any feedback or suggestion is welcomed!
>> >
>> > [1] 
>> > https://ci.apache.org/projects/flink/flink-docs-master/ops/memory/mem_setup.html
>> > [2] 
>> > https://ci.apache.org/projects/flink/flink-docs-master/ops/memory/mem_migration.html
>> >
>> > Best,
>> > Yangze Guo
>
>
>
> --
> Best Regards
>
> Jeff Zhang


Re: [Third-party Tool] Flink memory calculator

2020-03-31 Thread Yangze Guo
Hi, there.

In the latest version, the calculator supports dynamic options. You
could append all your dynamic options to the end of "bin/calculator.sh
[-h]".
Since "-tm" will be deprecated eventually, please replace it with
"-Dtaskmanager.memory.process.size=".

Best,
Yangze Guo

On Mon, Mar 30, 2020 at 12:57 PM Xintong Song  wrote:
>
> Hi Jeff,
>
> I think the purpose of this tool it to allow users play with the memory 
> configurations without needing to actually deploy the Flink cluster or even 
> have a job. For sanity checks, we currently have them in the start-up scripts 
> (for standalone clusters) and resource managers (on K8s/Yarn/Mesos).
>
> I think it makes sense do the checks earlier, i.e. on the client side. But 
> I'm not sure if JobListener is the right place. IIUC, JobListener is invoked 
> before submitting a specific job, while the mentioned checks validate Flink's 
> cluster level configurations. It might be okay for a job cluster, but does 
> not cover the scenarios of session clusters.
>
> Thank you~
>
> Xintong Song
>
>
>
> On Mon, Mar 30, 2020 at 12:03 PM Yangze Guo  wrote:
>>
>> Thanks for your feedbacks, @Xintong and @Jeff.
>>
>> @Jeff
>> I think it would always be good to leverage exist logic in Flink, such
>> as JobListener. However, this calculator does not only target to check
>> the conflict, it also targets to provide the calculating result to
>> user before the job is actually deployed in case there is any
>> unexpected configuration. It's a good point that we need to parse the
>> dynamic configs. I prefer to parse the dynamic configs and cli
>> commands in bash instead of adding hook in JobListener.
>>
>> Best,
>> Yangze Guo
>>
>> On Mon, Mar 30, 2020 at 10:32 AM Jeff Zhang  wrote:
>> >
>> > Hi Yangze,
>> >
>> > Does this tool just parse the configuration in flink-conf.yaml ?  Maybe it 
>> > could be done in JobListener [1] (we should enhance it via adding hook 
>> > before job submission), so that it could all the cases (e.g. parameters 
>> > coming from command line)
>> >
>> > [1] 
>> > https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/core/execution/JobListener.java#L35
>> >
>> >
>> > Yangze Guo  于2020年3月30日周一 上午9:40写道:
>> >>
>> >> Hi, Yun,
>> >>
>> >> I'm sorry that it currently could not handle it. But I think it is a
>> >> really good idea and that feature would be added to the next version.
>> >>
>> >> Best,
>> >> Yangze Guo
>> >>
>> >> On Mon, Mar 30, 2020 at 12:21 AM Yun Tang  wrote:
>> >> >
>> >> > Very interesting and convenient tool, just a quick question: could this 
>> >> > tool also handle deployment cluster commands like "-tm" mixed with 
>> >> > configuration in `flink-conf.yaml` ?
>> >> >
>> >> > Best
>> >> > Yun Tang
>> >> > 
>> >> > From: Yangze Guo 
>> >> > Sent: Friday, March 27, 2020 18:00
>> >> > To: user ; user...@flink.apache.org 
>> >> > 
>> >> > Subject: [Third-party Tool] Flink memory calculator
>> >> >
>> >> > Hi, there.
>> >> >
>> >> > In release-1.10, the memory setup of task managers has changed a lot.
>> >> > I would like to provide here a third-party tool to simulate and get
>> >> > the calculation result of Flink's memory configuration.
>> >> >
>> >> >  Although there is already a detailed setup guide[1] and migration
>> >> > guide[2] officially, the calculator could further allow users to:
>> >> > - Verify if there is any conflict in their configuration. The
>> >> > calculator is more lightweight than starting a Flink cluster,
>> >> > especially when running Flink on Yarn/Kubernetes. User could make sure
>> >> > their configuration is correct locally before deploying it to external
>> >> > resource managers.
>> >> > - Get all of the memory configurations before deploying. User may set
>> >> > taskmanager.memory.task.heap.size and taskmanager.memory.managed.size.
>> >> > But they also want to know the total memory consumption of Flink. With
>> >> > this tool, users could get all of the memory configurations they are
>> >> > interested in. If anything is unexpected, they would not need to
>> >> > re-deploy a Flink cluster.
>> >> >
>> >> > The repo link of this tool is
>> >> > https://github.com/KarmaGYZ/flink-memory-calculator. It reuses the
>> >> > BashJavaUtils.jar of Flink and ensures the calculation result is
>> >> > exactly the same as your Flink dist. For more details, please take a
>> >> > look at the README.
>> >> >
>> >> > Any feedback or suggestion is welcomed!
>> >> >
>> >> > [1] 
>> >> > https://ci.apache.org/projects/flink/flink-docs-master/ops/memory/mem_setup.html
>> >> > [2] 
>> >> > https://ci.apache.org/projects/flink/flink-docs-master/ops/memory/mem_migration.html
>> >> >
>> >> > Best,
>> >> > Yangze Guo
>> >
>> >
>> >
>> > --
>> > Best Regards
>> >
>> > Jeff Zhang


Re: [Third-party Tool] Flink memory calculator

2020-04-01 Thread Yangze Guo
@Marta
Thanks for the tip! I'll do that.

Best,
Yangze Guo

On Wed, Apr 1, 2020 at 8:05 PM Marta Paes Moreira  wrote:
>
> Hey, Yangze.
>
> I'd like to suggest that you submit this tool to Flink Community Pages [1]. 
> That way it can get more exposure and it'll be easier for users to find it.
>
> Thanks for your contribution!
>
> [1] https://flink-packages.org/
>
> On Tue, Mar 31, 2020 at 9:09 AM Yangze Guo  wrote:
>>
>> Hi, there.
>>
>> In the latest version, the calculator supports dynamic options. You
>> could append all your dynamic options to the end of "bin/calculator.sh
>> [-h]".
>> Since "-tm" will be deprecated eventually, please replace it with
>> "-Dtaskmanager.memory.process.size=".
>>
>> Best,
>> Yangze Guo
>>
>> On Mon, Mar 30, 2020 at 12:57 PM Xintong Song  wrote:
>> >
>> > Hi Jeff,
>> >
>> > I think the purpose of this tool it to allow users play with the memory 
>> > configurations without needing to actually deploy the Flink cluster or 
>> > even have a job. For sanity checks, we currently have them in the start-up 
>> > scripts (for standalone clusters) and resource managers (on 
>> > K8s/Yarn/Mesos).
>> >
>> > I think it makes sense do the checks earlier, i.e. on the client side. But 
>> > I'm not sure if JobListener is the right place. IIUC, JobListener is 
>> > invoked before submitting a specific job, while the mentioned checks 
>> > validate Flink's cluster level configurations. It might be okay for a job 
>> > cluster, but does not cover the scenarios of session clusters.
>> >
>> > Thank you~
>> >
>> > Xintong Song
>> >
>> >
>> >
>> > On Mon, Mar 30, 2020 at 12:03 PM Yangze Guo  wrote:
>> >>
>> >> Thanks for your feedbacks, @Xintong and @Jeff.
>> >>
>> >> @Jeff
>> >> I think it would always be good to leverage exist logic in Flink, such
>> >> as JobListener. However, this calculator does not only target to check
>> >> the conflict, it also targets to provide the calculating result to
>> >> user before the job is actually deployed in case there is any
>> >> unexpected configuration. It's a good point that we need to parse the
>> >> dynamic configs. I prefer to parse the dynamic configs and cli
>> >> commands in bash instead of adding hook in JobListener.
>> >>
>> >> Best,
>> >> Yangze Guo
>> >>
>> >> On Mon, Mar 30, 2020 at 10:32 AM Jeff Zhang  wrote:
>> >> >
>> >> > Hi Yangze,
>> >> >
>> >> > Does this tool just parse the configuration in flink-conf.yaml ?  Maybe 
>> >> > it could be done in JobListener [1] (we should enhance it via adding 
>> >> > hook before job submission), so that it could all the cases (e.g. 
>> >> > parameters coming from command line)
>> >> >
>> >> > [1] 
>> >> > https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/core/execution/JobListener.java#L35
>> >> >
>> >> >
>> >> > Yangze Guo  于2020年3月30日周一 上午9:40写道:
>> >> >>
>> >> >> Hi, Yun,
>> >> >>
>> >> >> I'm sorry that it currently could not handle it. But I think it is a
>> >> >> really good idea and that feature would be added to the next version.
>> >> >>
>> >> >> Best,
>> >> >> Yangze Guo
>> >> >>
>> >> >> On Mon, Mar 30, 2020 at 12:21 AM Yun Tang  wrote:
>> >> >> >
>> >> >> > Very interesting and convenient tool, just a quick question: could 
>> >> >> > this tool also handle deployment cluster commands like "-tm" mixed 
>> >> >> > with configuration in `flink-conf.yaml` ?
>> >> >> >
>> >> >> > Best
>> >> >> > Yun Tang
>> >> >> > 
>> >> >> > From: Yangze Guo 
>> >> >> > Sent: Friday, March 27, 2020 18:00
>> >> >> > To: user ; user...@flink.apache.org 
>> >> >> > 
>> >> >> > Subject: [Third-party Tool] Flink memory calculator
>> >> >> >
>> >> >> > Hi, there.
>> >> >> >
&

Re: on YARN question

2020-04-09 Thread Yangze Guo
Do you mean to run it in detach mode? If so, you could add "-d".

Best,
Yangze Guo

On Fri, Apr 10, 2020 at 1:05 PM Ethan Li  wrote:
>
> I am not a Flink expert. Just out of curiosity,
>
> I am seeing
>
> “YARN application has been deployed successfully“
>
> Does it not mean it’s working properly?
>
>
> Best,
> Ethan
>
> On Apr 9, 2020, at 23:01, 罗杰  wrote:
>
> 
> Hello, could you please tell me how to solve the problem that when I use 
> yarn-session.sh, the card will not run when it reaches the following place?
> Hadoop2.7.2  flink 1.10.0
> have: flink/lib/ flink-shaded-hadoop-2-uber-2.7.5-10.0.jar
> [root@hadoop131 bin]# ./yarn-session.sh -n 2 -s 2 -jm 1024 -tm 1024 -nm test 
> -d
> 2020-04-10 11:00:35,434 INFO  
> org.apache.flink.configuration.GlobalConfiguration- Loading 
> configuration property: jobmanager.rpc.address, hadoop131
> 2020-04-10 11:00:35,437 INFO  
> org.apache.flink.configuration.GlobalConfiguration- Loading 
> configuration property: jobmanager.rpc.port, 6123
> 2020-04-10 11:00:35,437 INFO  
> org.apache.flink.configuration.GlobalConfiguration- Loading 
> configuration property: jobmanager.heap.size, 1024m
> 2020-04-10 11:00:35,437 INFO  
> org.apache.flink.configuration.GlobalConfiguration- Loading 
> configuration property: taskmanager.memory.process.size, 1568m
> 2020-04-10 11:00:35,437 INFO  
> org.apache.flink.configuration.GlobalConfiguration- Loading 
> configuration property: taskmanager.numberOfTaskSlots, 1
> 2020-04-10 11:00:35,437 INFO  
> org.apache.flink.configuration.GlobalConfiguration- Loading 
> configuration property: parallelism.default, 1
> 2020-04-10 11:00:35,438 INFO  
> org.apache.flink.configuration.GlobalConfiguration- Loading 
> configuration property: jobmanager.execution.failover-strategy, region
> 2020-04-10 11:00:35,553 INFO  org.apache.flink.yarn.cli.FlinkYarnSessionCli   
>   - Found Yarn properties file under /tmp/.yarn-properties-root.
> 2020-04-10 11:00:36,141 WARN  org.apache.hadoop.util.NativeCodeLoader 
>   - Unable to load native-hadoop library for your platform... 
> using builtin-java classes where applicable
> 2020-04-10 11:00:36,323 INFO  
> org.apache.flink.runtime.security.modules.HadoopModule- Hadoop user 
> set to root (auth:SIMPLE)
> 2020-04-10 11:00:36,509 INFO  
> org.apache.flink.runtime.security.modules.JaasModule  - Jaas file 
> will be created as /tmp/jaas-9182197754252132172.conf.
> 2020-04-10 11:00:36,554 WARN  org.apache.flink.yarn.cli.FlinkYarnSessionCli   
>   - The configuration directory ('/opt/module/flink-1.10.0/conf') 
> already contains a LOG4J config file.If you want to use logback, then please 
> delete or rename the log configuration file.
> 2020-04-10 11:00:36,653 INFO  org.apache.hadoop.yarn.client.RMProxy   
>   - Connecting to ResourceManager at hadoop132/192.168.15.132:8032
> 2020-04-10 11:00:36,903 INFO  
> org.apache.flink.runtime.clusterframework.TaskExecutorProcessUtils  - The 
> derived from fraction jvm overhead memory (156.800mb (164416719 bytes)) is 
> less than its min value 192.000mb (201326592 bytes), min value will be used 
> instead
> 2020-04-10 11:00:37,048 WARN  org.apache.flink.yarn.YarnClusterDescriptor 
>   - Neither the HADOOP_CONF_DIR nor the YARN_CONF_DIR environment 
> variable is set. The Flink YARN Client needs one of these to be set to 
> properly load the Hadoop configuration for accessing YARN.
> 2020-04-10 11:00:37,109 INFO  org.apache.flink.yarn.YarnClusterDescriptor 
>   - Cluster specification: 
> ClusterSpecification{masterMemoryMB=1024, taskManagerMemoryMB=1568, 
> slotsPerTaskManager=1}
> 2020-04-10 11:00:50,693 INFO  org.apache.flink.yarn.YarnClusterDescriptor 
>   - Submitting application master application_1586487382351_0001
> 2020-04-10 11:00:51,093 INFO  
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted 
> application application_1586487382351_0001
> 2020-04-10 11:00:51,093 INFO  org.apache.flink.yarn.YarnClusterDescriptor 
>   - Waiting for the cluster to be allocated
> 2020-04-10 11:00:51,096 INFO  org.apache.flink.yarn.YarnClusterDescriptor 
>   - Deploying cluster, current state ACCEPTED
> 2020-04-10 11:01:04,140 INFO  org.apache.flink.yarn.YarnClusterDescriptor 
>   - YARN application has been deployed successfully.
> 2020-04-10 11:01:04,141 INFO  org.apache.flink.yarn.YarnClusterDescriptor 
>   - Found Web Interface hadoop133:40677 of application 
> 'application_1586487382351_0001'.
> JobManager Web Interface: http://hadoop133:40677
>


Re: flink java.util.concurrent.TimeoutException

2020-04-15 Thread Yangze Guo
日志上看是Taskmanager心跳超时了,如果tm还在,是不是网络问题呢?尝试把heartbeat.timeout调大一些试试?

Best,
Yangze Guo

On Mon, Apr 13, 2020 at 10:40 AM 欧阳苗  wrote:
>
> job运行了两天就挂了,然后抛出如下异常,但是taskManager没有挂,其他的job还能正常在上面跑,请问这个问题是什么原因导致的,有什么好的解决办法吗
>
>
> 2020-04-13 06:20:31.379 ERROR 1 --- [ent-IO-thread-3] 
> org.apache.flink.runtime.rest.RestClient.parseResponse:393 : Received 
> response was neither of the expected type ([simple type, class 
> org.apache.flink.runtime.rest.messages.job.JobExecutionResultResponseBody]) 
> nor an error. 
> Response=JsonResponse{json={"status":{"id":"COMPLETED"},"job-execution-result":{"id":"2d2a0b4efc8c3d973e2e9490b7b3b2f1","application-status":"FAILED","accumulator-results":{},"net-runtime":217272900,"failure-cause":{"class":"java.util.concurrent.TimeoutException","stack-trace":"java.util.concurrent.TimeoutException:
>  Heartbeat of TaskManager with id 0a4ea651244982ef4b4b7092d18de776 timed 
> out.\n\tat 
> org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(JobMaster.java:1656)\n\tat
>  
> org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl$HeartbeatMonitor.run(HeartbeatManagerImpl.java:339)\n\tat
>  
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\n\tat 
> java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat 
> org.apache.flink.runtime.concurrent.akka.ActorSystemScheduledExecutorAdapter$ScheduledFutureTask.run(ActorSystemScheduledExecutorAdapter.java:154)\n\tat
>  akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)\n\tat 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)\n\tat
>  scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)\n\tat 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)\n\tat
>  
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)\n\tat
>  
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)\n","serialized-throwable":"rO0ABXNyAClvcmcuYXBhY2hlLmZsaW5rLnV0aWwuU2VyaWFsaXplZFRocm93YWJsZWUWnfUfpxPzAgADTAAZZnVsbFN0cmluZ2lmaWVkU3RhY2tUcmFjZXQAEkxqYXZhL2xhbmcvU3RyaW5nO0wAFm9yaWdpbmFsRXJyb3JDbGFzc05hbWVxAH4AAVsAE3NlcmlhbGl6ZWRFeGNlcHRpb250AAJbQnhyABNqYXZhLmxhbmcuRXhjZXB0aW9u0P0fPho7HMQCAAB4cgATamF2YS5sYW5nLlRocm93YWJsZdXGNSc5d7jLAwAETAAFY2F1c2V0ABVMamF2YS9sYW5nL1Rocm93YWJsZTtMAA1kZXRhaWxNZXNzYWdlcQB+AAFbAApzdGFja1RyYWNldAAeW0xqYXZhL2xhbmcvU3RhY2tUcmFjZUVsZW1lbnQ7TAAUc3VwcHJlc3NlZEV4Y2VwdGlvbnN0ABBMamF2YS91dGlsL0xpc3Q7eHBwdABMSGVhcnRiZWF0IG9mIFRhc2tNYW5hZ2VyIHdpdGggaWQgMGE0ZWE2NTEyNDQ5ODJlZjRiNGI3MDkyZDE4ZGU3NzYgdGltZWQgb3V0LnVyAB5bTGphdmEubGFuZy5TdGFja1RyYWNlRWxlbWVudDsCRio8PP0iOQIAAHhwC3NyABtqYXZhLmxhbmcuU3RhY2tUcmFjZUVsZW1lbnRhCcWaJjbdhQIABEkACmxpbmVOdW1iZXJMAA5kZWNsYXJpbmdDbGFzc3EAfgABTAAIZmlsZU5hbWVxAH4AAUwACm1ldGhvZE5hbWVxAH4AAXhwAAAGeHQASW9yZy5hcGFjaGUuZmxpbmsucnVudGltZS5qb2JtYXN0ZXIuSm9iTWFzdGVyJFRhc2tNYW5hZ2VySGVhcnRiZWF0TGlzdGVuZXJ0AA5Kb2JNYXN0ZXIuamF2YXQAFm5vdGlmeUhlYXJ0YmVhdFRpbWVvdXRzcQB+AAwAAAFTdABIb3JnLmFwYWNoZS5mbGluay5ydW50aW1lLmhlYXJ0YmVhdC5IZWFydGJlYXRNYW5hZ2VySW1wbCRIZWFydGJlYXRNb25pdG9ydAAZSGVhcnRiZWF0TWFuYWdlckltcGwuamF2YXQAA3J1bnNxAH4ADf90AC5qYXZhLnV0aWwuY29uY3VycmVudC5FeGVjdXRvcnMkUnVubmFibGVBZGFwdGVydAAORXhlY3V0b3JzLmphdmF0AARjYWxsc3EAfgAMAAABCnQAH2phdmEudXRpbC5jb25jdXJyZW50LkZ1dHVyZVRhc2t0AA9GdXR1cmVUYXNrLmphdmFxAH4AFHNxAH4ADJp0AGBvcmcuYXBhY2hlLmZsaW5rLnJ1bnRpbWUuY29uY3VycmVudC5ha2thLkFjdG9yU3lzdGVtU2NoZWR1bGVkRXhlY3V0b3JBZGFwdGVyJFNjaGVkdWxlZEZ1dHVyZVRhc2t0AChBY3RvclN5c3RlbVNjaGVkdWxlZEV4ZWN1dG9yQWRhcHRlci5qYXZhcQB+ABRzcQB+AAwndAAcYWtrYS5kaXNwYXRjaC5UYXNrSW52b2NhdGlvbnQAGEFic3RyYWN0RGlzcGF0Y2hlci5zY2FsYXEAfgAUc3EAfgAMAAABn3QAO2Fra2EuZGlzcGF0Y2guRm9ya0pvaW5FeGVjdXRvckNvbmZpZ3VyYXRvciRBa2thRm9ya0pvaW5UYXNrcQB+ACF0AARleGVjc3EAfgAMAAABBHQAJnNjYWxhLmNvbmN1cnJlbnQuZm9ya2pvaW4uRm9ya0pvaW5UYXNrdAARRm9ya0pvaW5UYXNrLmphdmF0AAZkb0V4ZWNzcQB+AAwAAAU7dAAwc2NhbGEuY29uY3VycmVudC5mb3Jram9pbi5Gb3JrSm9pblBvb2wkV29ya1F1ZXVldAARRm9ya0pvaW5Qb29sLmphdmF0AAdydW5UYXNrc3EAfgAMAAAHu3QAJnNjYWxhLmNvbmN1cnJlbnQuZm9ya2pvaW4uRm9ya0pvaW5Qb29scQB+ACt0AAlydW5Xb3JrZXJzcQB+AAwAAABrdAAuc2NhbGEuY29uY3VycmVudC5mb3Jram9pbi5Gb3JrSm9pbldvcmtlclRocmVhZHQAGUZvcmtKb2luV29ya2VyVGhyZWFkLmphdmFxAH4AFHNyACZqYXZhLnV0aWwuQ29sbGVjdGlvbnMkVW5tb2RpZmlhYmxlTGlzdPwPJTG17I4QAgABTAAEbGlzdHEAfgAHeHIALGphdmEudXRpbC5Db2xsZWN0aW9ucyRVbm1vZGlmaWFibGVDb2xsZWN0aW9uGUIAgMte9x4CAAFMAAFjdAAWTGphdmEvdXRpbC9Db2xsZWN0aW9uO3hwc3IAE2phdmEudXRpbC5BcnJheUxpc3R4gdIdmcdhnQMAAUkABHNpemV4cAB3BAB4cQB+ADh4dARka

Re: Receiving context information through JobListener interface

2021-04-25 Thread Yangze Guo
It seems that the JobListener interface could not expose such
information. Maybe you can set the RuleId as the jobName(or the suffix
of the jobName) of the application, then you can get the mappings of
jobId to jobName(RuleId) throw /jobs/overview.

[1] 
https://ci.apache.org/projects/flink/flink-docs-master/docs/ops/rest_api/#jobs-overview

Best,
Yangze Guo

On Sun, Apr 25, 2021 at 4:17 PM Barak Ben Nathan
 wrote:
>
>
>
> Hi all,
>
>
>
> I am building an application that launches Flink Jobs and monitors them.
>
>
>
> I want to use the JobListener interface to output job evemts to a Kafka Topic.
>
>
>
> The problem:
>
> In the application we have RuleId, i.e.  business logic identifier for the 
> job,  and there’s JobId which is  the internal identifier generated by Flink.
>
> I need the events emitted to Kafka to be partitioned by *RuleId*.
>
>
>
> Is there a way to pass this kind of information to Flink and get it through 
> the JobListener interface?
>
>
>
> Thanks,
>
> Barak


Re: when should `FlinkYarnSessionCli` be included for parsing CLI arguments?

2021-04-25 Thread Yangze Guo
Hi, Tony.

What is the version of your flink-dist. AFAIK, this issue should be
addressed in FLINK-15852[1]. Could you give the client log of case
2(set the log level to DEBUG would be better).

[1] https://issues.apache.org/jira/browse/FLINK-15852

Best,
Yangze Guo

On Sun, Apr 25, 2021 at 11:33 AM Tony Wei  wrote:
>
> Hi Experts,
>
> I recently tried to run yarn-application mode on my yarn cluster, and I had a 
> problem related to configuring `execution.target`.
> After reading the source code and doing some experiments, I found that there 
> should be some room of improvement for `FlinkYarnSessionCli` or 
> `AbstractYarnCli`.
>
> My experiments are:
>
> setting `execution.target: yarn-application` in flink-conf.yaml and run 
> `flink run-application -t yarn-application`: run job successfully.
>
> `FlinkYarnSessionCli` is not active
> `GenericCLI` is active
>
> setting `execution.target: yarn-per-job` in flink-conf.yaml and run `flink 
> run-application -t yarn-application`: run job failed
>
> failed due to `ClusterDeploymentException` [1]
> `FlinkYarnSessionCli` is active
>
> setting `execution.target: yarn-application` in flink-conf.yaml and run 
> `flink run -t yarn-per-job`: run job successfully.
>
> `FlinkYarnSessionCli` is not active
> `GenericCLI` is active
>
> setting `execution.target: yarn-per-job` in flink-conf.yaml and run `flink 
> run -t yarn-per-job`: run job successfully.
>
> `FlinkYarnSessionCli` is active
>
> From `AbstractYarnCli#isActive` [2] and `FlinkYarnSessionCli#isActive` [3], 
> `FlinkYarnSessionCli` will be active when `execution.target` is specified 
> with `yarn-per-job` or `yarn-session`.
>
> According to the flink official document [4], I thought the 2nd experiment 
> should also work well, but it didn't.
>>
>> The --target will overwrite the execution.target specified in the 
>> config/flink-config.yaml.
>
>
> The root cause is that `FlinkYarnSessionCli` only overwrite the 
> `execution.target` with `yarn-session` or `yarn-per-job` [5], but no 
> `yarn-application`.
> So, my question is
>
> should we use `FlinkYarnSessionCli` in case 2?
> if we should, how we can improve `FlinkYarnSessionCli` so that we can 
> overwrite `execution.target` via `--target`?
>
> and one more improvement, the config description for `execution.target` [6] 
> should include `yarn-application` as well.
>
> [1] 
> https://github.com/apache/flink/blob/master/flink-yarn/src/main/java/org/apache/flink/yarn/YarnClusterDescriptor.java#L439-L447
> [2] 
> https://github.com/apache/flink/blob/master/flink-yarn/src/main/java/org/apache/flink/yarn/cli/AbstractYarnCli.java#L54-L66
> [3] 
> https://github.com/apache/flink/blob/master/flink-yarn/src/main/java/org/apache/flink/yarn/cli/FlinkYarnSessionCli.java#L373-L377
> [4] 
> https://ci.apache.org/projects/flink/flink-docs-stable/deployment/cli.html#selecting-deployment-targets
> [5] 
> https://github.com/apache/flink/blob/master/flink-yarn/src/main/java/org/apache/flink/yarn/cli/FlinkYarnSessionCli.java#L397-L413
> [6] 
> https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/configuration/DeploymentOptions.java#L41-L46
>
> best regards,
>


Re: Kubernetes Setup - JM as job vs JM as deployment

2021-04-25 Thread Yangze Guo
Hi, Gil

IIUC, you want to deploy Flink cluster using YAML files yourselves and
want to know whether the JM should be deployed as Job[1] or
Deployment. If that is the case, as Matthias mentioned, Flink provides
two ways to integrate with K8S [2][3], in [3] the JM will be deployed
as a Deployment.

[1] https://kubernetes.io/docs/concepts/workloads/controllers/job/
[2] 
https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/resource-providers/native_kubernetes.html
[3] 
https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/resource-providers/standalone/kubernetes.html

Best,
Yangze Guo

On Thu, Apr 22, 2021 at 10:46 PM Matthias Pohl  wrote:
>
> Hi Gil,
> I'm not sure whether I understand you correctly. What do you mean by 
> deploying the job manager as "job" or "deployment"? Are you referring to the 
> different deployment modes, Flink offers [1]? These would be independent of 
> Kubernetes. Or do you wonder what the differences are between the Flink on 
> Kubernetes (native) [2] vs Flink on Kubernetes (standalone using YAML files)?
>
> Best,
> Matthias
>
> [1] 
> https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/#deployment-modes
> [2] 
> https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/resource-providers/native_kubernetes.html
> [3] 
> https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/resource-providers/standalone/kubernetes.html
>
> On Wed, Apr 21, 2021 at 11:19 PM Gil Amsalem  wrote:
>>
>> Hi,
>>
>> I found that there are 2 different approaches to setup Flink over kubernetes.
>> 1. Deploy job manager as Job.
>> 2. Deploy job manager as Deployment.
>>
>> What is the recommended way? What are the benefits of each?
>>
>> Thanks,
>> Gil Amsalem


Re: when should `FlinkYarnSessionCli` be included for parsing CLI arguments?

2021-04-26 Thread Yangze Guo
Hi, Till,

I agree that we need to resolve the issue by overriding the
configuration before selecting the CustomCommandLines. However, IIUC,
after FLINK-15852 the GenericCLI should always be the first choice.
Could you help me to understand why the FlinkYarnSessionCli can be
activated?


Best,
Yangze Guo

On Mon, Apr 26, 2021 at 4:48 PM Till Rohrmann  wrote:
>
> Hi Tony,
>
> I think you are right that Flink's cli does not behave super consistent at 
> the moment. Case 2. should definitely work because `-t yarn-application` 
> should overwrite what is defined in the Flink configuration. The problem 
> seems to be that we don't resolve the configuration wrt the specified command 
> line options before calling into `CustomCommandLine.isActive`. If we parsed 
> first the command line configuration options which can overwrite 
> flink-conf.yaml options and then replaced them, then the custom command lines 
> (assuming that they use the Configuration as the ground truth) should behave 
> consistently.
>
> For your questions:
>
> 1. I am not 100% sure. I think the FlinkYarnSessionCli wasn't used on purpose 
> when introducing the yarn application mode.
> 2. See answer 1.
>
> I think it is a good idea to extend the description of the config option 
> `execution.target`. Do you want to create a ticket and a PR for it?
>
> Cheers,
> Till
>
> On Mon, Apr 26, 2021 at 8:37 AM Yangze Guo  wrote:
>>
>> Hi, Tony.
>>
>> What is the version of your flink-dist. AFAIK, this issue should be
>> addressed in FLINK-15852[1]. Could you give the client log of case
>> 2(set the log level to DEBUG would be better).
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-15852
>>
>> Best,
>> Yangze Guo
>>
>> On Sun, Apr 25, 2021 at 11:33 AM Tony Wei  wrote:
>> >
>> > Hi Experts,
>> >
>> > I recently tried to run yarn-application mode on my yarn cluster, and I 
>> > had a problem related to configuring `execution.target`.
>> > After reading the source code and doing some experiments, I found that 
>> > there should be some room of improvement for `FlinkYarnSessionCli` or 
>> > `AbstractYarnCli`.
>> >
>> > My experiments are:
>> >
>> > setting `execution.target: yarn-application` in flink-conf.yaml and run 
>> > `flink run-application -t yarn-application`: run job successfully.
>> >
>> > `FlinkYarnSessionCli` is not active
>> > `GenericCLI` is active
>> >
>> > setting `execution.target: yarn-per-job` in flink-conf.yaml and run `flink 
>> > run-application -t yarn-application`: run job failed
>> >
>> > failed due to `ClusterDeploymentException` [1]
>> > `FlinkYarnSessionCli` is active
>> >
>> > setting `execution.target: yarn-application` in flink-conf.yaml and run 
>> > `flink run -t yarn-per-job`: run job successfully.
>> >
>> > `FlinkYarnSessionCli` is not active
>> > `GenericCLI` is active
>> >
>> > setting `execution.target: yarn-per-job` in flink-conf.yaml and run `flink 
>> > run -t yarn-per-job`: run job successfully.
>> >
>> > `FlinkYarnSessionCli` is active
>> >
>> > From `AbstractYarnCli#isActive` [2] and `FlinkYarnSessionCli#isActive` 
>> > [3], `FlinkYarnSessionCli` will be active when `execution.target` is 
>> > specified with `yarn-per-job` or `yarn-session`.
>> >
>> > According to the flink official document [4], I thought the 2nd experiment 
>> > should also work well, but it didn't.
>> >>
>> >> The --target will overwrite the execution.target specified in the 
>> >> config/flink-config.yaml.
>> >
>> >
>> > The root cause is that `FlinkYarnSessionCli` only overwrite the 
>> > `execution.target` with `yarn-session` or `yarn-per-job` [5], but no 
>> > `yarn-application`.
>> > So, my question is
>> >
>> > should we use `FlinkYarnSessionCli` in case 2?
>> > if we should, how we can improve `FlinkYarnSessionCli` so that we can 
>> > overwrite `execution.target` via `--target`?
>> >
>> > and one more improvement, the config description for `execution.target` 
>> > [6] should include `yarn-application` as well.
>> >
>> > [1] 
>> > https://github.com/apache/flink/blob/master/flink-yarn/src/main/java/org/apache/flink/yarn/YarnClusterDescriptor.java#L439-L447
>> > [2] 
>> > https://github.com/apache/flink/blob/master/flink-yarn/src/main/java/org/apache/flink/yarn/cli/AbstractYarnCli.java#L54-L66
>> > [3] 
>> > https://github.com/apache/flink/blob/master/flink-yarn/src/main/java/org/apache/flink/yarn/cli/FlinkYarnSessionCli.java#L373-L377
>> > [4] 
>> > https://ci.apache.org/projects/flink/flink-docs-stable/deployment/cli.html#selecting-deployment-targets
>> > [5] 
>> > https://github.com/apache/flink/blob/master/flink-yarn/src/main/java/org/apache/flink/yarn/cli/FlinkYarnSessionCli.java#L397-L413
>> > [6] 
>> > https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/configuration/DeploymentOptions.java#L41-L46
>> >
>> > best regards,
>> >


Re: when should `FlinkYarnSessionCli` be included for parsing CLI arguments?

2021-04-26 Thread Yangze Guo
If the GenericCLI is selected, then the execution.target should have
been overwritten to "yarn-application" in GenericCLI#toConfiguration.
It is odd that why the GenericCLI#isActive return false as the
execution.target is defined in both flink-conf and command line.

Best,
Yangze Guo

On Mon, Apr 26, 2021 at 5:14 PM Till Rohrmann  wrote:
>
> I think you are right that the `GenericCLI` should be the first choice. From 
> the top of my head I do not remember why FlinkYarnSessionCli is still used. 
> Maybe it is in order to support some Yarn specific cli option parsing. I 
> assume it is either an oversight or some parsing has not been completely 
> migrated to the GenericCLI.
>
> Cheers,
> Till
>
> On Mon, Apr 26, 2021 at 11:07 AM Yangze Guo  wrote:
>>
>> Hi, Till,
>>
>> I agree that we need to resolve the issue by overriding the
>> configuration before selecting the CustomCommandLines. However, IIUC,
>> after FLINK-15852 the GenericCLI should always be the first choice.
>> Could you help me to understand why the FlinkYarnSessionCli can be
>> activated?
>>
>>
>> Best,
>> Yangze Guo
>>
>> On Mon, Apr 26, 2021 at 4:48 PM Till Rohrmann  wrote:
>> >
>> > Hi Tony,
>> >
>> > I think you are right that Flink's cli does not behave super consistent at 
>> > the moment. Case 2. should definitely work because `-t yarn-application` 
>> > should overwrite what is defined in the Flink configuration. The problem 
>> > seems to be that we don't resolve the configuration wrt the specified 
>> > command line options before calling into `CustomCommandLine.isActive`. If 
>> > we parsed first the command line configuration options which can overwrite 
>> > flink-conf.yaml options and then replaced them, then the custom command 
>> > lines (assuming that they use the Configuration as the ground truth) 
>> > should behave consistently.
>> >
>> > For your questions:
>> >
>> > 1. I am not 100% sure. I think the FlinkYarnSessionCli wasn't used on 
>> > purpose when introducing the yarn application mode.
>> > 2. See answer 1.
>> >
>> > I think it is a good idea to extend the description of the config option 
>> > `execution.target`. Do you want to create a ticket and a PR for it?
>> >
>> > Cheers,
>> > Till
>> >
>> > On Mon, Apr 26, 2021 at 8:37 AM Yangze Guo  wrote:
>> >>
>> >> Hi, Tony.
>> >>
>> >> What is the version of your flink-dist. AFAIK, this issue should be
>> >> addressed in FLINK-15852[1]. Could you give the client log of case
>> >> 2(set the log level to DEBUG would be better).
>> >>
>> >> [1] https://issues.apache.org/jira/browse/FLINK-15852
>> >>
>> >> Best,
>> >> Yangze Guo
>> >>
>> >> On Sun, Apr 25, 2021 at 11:33 AM Tony Wei  wrote:
>> >> >
>> >> > Hi Experts,
>> >> >
>> >> > I recently tried to run yarn-application mode on my yarn cluster, and I 
>> >> > had a problem related to configuring `execution.target`.
>> >> > After reading the source code and doing some experiments, I found that 
>> >> > there should be some room of improvement for `FlinkYarnSessionCli` or 
>> >> > `AbstractYarnCli`.
>> >> >
>> >> > My experiments are:
>> >> >
>> >> > setting `execution.target: yarn-application` in flink-conf.yaml and run 
>> >> > `flink run-application -t yarn-application`: run job successfully.
>> >> >
>> >> > `FlinkYarnSessionCli` is not active
>> >> > `GenericCLI` is active
>> >> >
>> >> > setting `execution.target: yarn-per-job` in flink-conf.yaml and run 
>> >> > `flink run-application -t yarn-application`: run job failed
>> >> >
>> >> > failed due to `ClusterDeploymentException` [1]
>> >> > `FlinkYarnSessionCli` is active
>> >> >
>> >> > setting `execution.target: yarn-application` in flink-conf.yaml and run 
>> >> > `flink run -t yarn-per-job`: run job successfully.
>> >> >
>> >> > `FlinkYarnSessionCli` is not active
>> >> > `GenericCLI` is active
>> >> >
>> >> > setting `execution.target: yarn-per-job` in flink-conf.yaml and run 
>> >> > `flink run -t yarn-per-job`: run job successfully.
>> >> >
>> >> > `F

Re: Deployment/Memory Configuration/Scalability

2021-04-26 Thread Yangze Guo
Hi, Radoslav,

> 1. Is it a good idea to have regular savepoints (say on a daily basis)?
> 2. Is it possible to have high availability with Per-Job mode? Or maybe I 
> should go with session mode and make sure that my flink cluster is running a 
> single job?

Yes, we can achieve HA with per-job mode with ZooKeeper[2]. Look at
your configuration, you need to also enable the checkpoint[2], which
is automatically triggered and helps you to resume the program when
failure, by setting the execution.checkpointing.interval.

> 3. Let's assume that savepoints should be triggered only before job 
> update/deployment. How can I trigger a savepoint if my job is already 
> consuming more than 80% of the allowed memory per pod in k8s? My observations 
> show that k8s kills task managers (which are running as pods) and I need to 
> retry it a couple of times.

I think with the checkpoint, you no longer need to trigger the
savepoint manually with a specific condition as the checkpoint will be
periodically triggered.

> 4. Should I consider upgrading to version 1.12.3?
> 5. Should I consider switching off state.backend.rocksdb.memory.managed 
> property even in version 1.12.3?

I'm not an expert on the state backend, but it seems the fix of that
issue is only applied to the docker image. So I guess you can package
a custom image yourselves if you do not want to upgrade. However, if
you are using the Native K8S mode[3] and there is no compatibility
issue, I think it might be good to upgrading because there are also
lots of improvements[4] in 1.12.

> 6. How do I decide when the job parallelism should be increased? Are there 
> some metrics which can lead me to a clue that the parallelism should be 
> increased?

As there are 6 Kafka sources in your job, I think the parallelism
should first be fixed with the topic partition number. For metrics,
you could refer to the backpressure of tasks and
numRecordsOutPerSecond[5].

[1] 
https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/ha/zookeeper_ha/
[2] 
https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/checkpoints.html
[3] 
https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/resource-providers/native_kubernetes.html
[4] https://issues.apache.org/jira/browse/FLINK-17709
[5] 
https://ci.apache.org/projects/flink/flink-docs-release-1.12/ops/metrics.html#io

Best,
Yangze Guo

On Mon, Apr 26, 2021 at 4:14 PM Radoslav Smilyanov
 wrote:
>
> Hi all,
>
> I am having multiple questions regarding Flink :) Let me give you some 
> background of what I have done so far.
>
> Description
> I am using Flink 1.11.2. My job is doing data enrichment. Data is consumed 
> from 6 different kafka topics and it is joined via multiple 
> CoProcessFunctions. On a daily basis the job is handling ~20 millions events 
> from the source kafka topics.
>
> Configuration
> These are the settings I am using:
>
> jobmanager.memory.process.size: 4096m
> jobmanager.memory.off-heap.size: 512m
> taskmanager.memory.process.size: 12000m
> taskmanager.memory.task.off-heap.size: 512m
> taskmanager.numberOfTaskSlots: 1
> parallelism.default: 5
> taskmanager.rpc.port: 6122
> jobmanager.execution.failover-strategy: region
> state.backend: rocksdb
> state.backend.incremental: true
> state.backend.rocksdb.localdir: /opt/flink/rocksdb
> state.backend.rocksdb.memory.managed: true
> state.backend.rocksdb.predefined-options: FLASH_SSD_OPTIMIZED
> state.backend.rocksdb.block.cache-size: 64mb
> state.checkpoints.dir: s3://bucket/checkpoints
> state.savepoints.dir: s3://bucket/savepoints
> s3.access-key: AWS_ACCESS_KEY_ID
> s3.secret-key: AWS_SECRET_ACCESS_KEY
> s3.endpoint: http://
> s3.path.style.access: true
> s3.entropy.key: _entropy_
> s3.entropy.length: 8
> presto.s3.socket-timeout: 10m
> client.timeout: 60min
>
> Deployment setup
> Flink is deployed in k8s with Per-Job mode having 1 job manager and 5 task 
> managers. I have a daily cron job which triggers savepoint in order to have a 
> fresh copy of the whole state.
>
> Problems with the existing setup
> 1. I observe that savepoints are causing Flink to consume more than the 
> allowed memory. I observe the behavior described in this stackoverflow post 
> (which seems to be solved in 1.12.X if I am getting it right).
> 2. I cannot achieve high availability with Per-Job mode and thus I ended up 
> having a regular savepoint on a daily basis.
>
> Questions
> 1. Is it a good idea to have regular savepoints (say on a daily basis)?
> 2. Is it possible to have high availability with Per-Job mode? Or maybe I 
> should go with session mode and make sure that my flink cluster is running a 
> single job?
> 3. Let's assume that savepoints should be triggered only before job 
> update/deployment. How

Re: [ANNOUNCE] Apache Flink 1.13.0 released

2021-05-07 Thread Yangze Guo
Thanks, Dawid & Guowei for the great work, thanks to everyone involved.

Best,
Yangze Guo

On Thu, May 6, 2021 at 5:51 PM Rui Li  wrote:
>
> Thanks to Dawid and Guowei for the great work!
>
> On Thu, May 6, 2021 at 4:48 PM Zhu Zhu  wrote:
>>
>> Thanks Dawid and Guowei for being the release managers! And thanks everyone 
>> who has made this release possible!
>>
>> Thanks,
>> Zhu
>>
>> Yun Tang  于2021年5月6日周四 下午2:30写道:
>>>
>>> Thanks for Dawid and Guowei's great work, and thanks for everyone involved 
>>> for this release.
>>>
>>> Best
>>> Yun Tang
>>> 
>>> From: Xintong Song 
>>> Sent: Thursday, May 6, 2021 12:08
>>> To: user ; dev 
>>> Subject: Re: [ANNOUNCE] Apache Flink 1.13.0 released
>>>
>>> Thanks Dawid & Guowei as the release managers, and everyone who has
>>> contributed to this release.
>>>
>>>
>>> Thank you~
>>>
>>> Xintong Song
>>>
>>>
>>>
>>> On Thu, May 6, 2021 at 9:51 AM Leonard Xu  wrote:
>>>
>>> > Thanks Dawid & Guowei for the great work, thanks everyone involved.
>>> >
>>> > Best,
>>> > Leonard
>>> >
>>> > 在 2021年5月5日,17:12,Theo Diefenthal  写道:
>>> >
>>> > Thanks for managing the release. +1. I like the focus on improving
>>> > operations with this version.
>>> >
>>> > --
>>> > *Von: *"Matthias Pohl" 
>>> > *An: *"Etienne Chauchot" 
>>> > *CC: *"dev" , "Dawid Wysakowicz" <
>>> > dwysakow...@apache.org>, "user" ,
>>> > annou...@apache.org
>>> > *Gesendet: *Dienstag, 4. Mai 2021 21:53:31
>>> > *Betreff: *Re: [ANNOUNCE] Apache Flink 1.13.0 released
>>> >
>>> > Yes, thanks for managing the release, Dawid & Guowei! +1
>>> >
>>> > On Tue, May 4, 2021 at 4:20 PM Etienne Chauchot 
>>> > wrote:
>>> >
>>> >> Congrats to everyone involved !
>>> >>
>>> >> Best
>>> >>
>>> >> Etienne
>>> >> On 03/05/2021 15:38, Dawid Wysakowicz wrote:
>>> >>
>>> >> The Apache Flink community is very happy to announce the release of
>>> >> Apache Flink 1.13.0.
>>> >>
>>> >> Apache Flink® is an open-source stream processing framework for
>>> >> distributed, high-performing, always-available, and accurate data 
>>> >> streaming
>>> >> applications.
>>> >>
>>> >> The release is available for download at:
>>> >> https://flink.apache.org/downloads.html
>>> >>
>>> >> Please check out the release blog post for an overview of the
>>> >> improvements for this bugfix release:
>>> >> https://flink.apache.org/news/2021/05/03/release-1.13.0.html
>>> >>
>>> >> The full release notes are available in Jira:
>>> >>
>>> >> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12349287
>>> >>
>>> >> We would like to thank all contributors of the Apache Flink community who
>>> >> made this release possible!
>>> >>
>>> >> Regards,
>>> >> Guowei & Dawid
>>> >>
>>> >>
>>> >
>>> >
>
>
>
> --
> Best regards!
> Rui Li


Re: Enabling Checkpointing using FsStatebackend

2021-05-07 Thread Yangze Guo
Hi,

I think the checkpointing is not the root cause of your job failure.
As the log describes, your job failed caused by the authorization
issue of Kafka. "Caused by:
org.apache.kafka.common.errors.TransactionalIdAuthorizationException:
Transactional Id authorization failed."

Best,
Yangze Guo

On Fri, May 7, 2021 at 11:29 PM sudhansu jena
 wrote:
>
> Hi Team,
>
> We have recently enabled checking pointing using FsStateBackend where we are 
> trying to use S3 bucket as the persistent storage but after enabling it we 
> are running into issues while submitting the job into the cluster.
>
> Can you please let us know if we are missing anything ?
>
>
> Below is the code sample  for enabling Checkpointing.
>
> env.setStateBackend(new FsStateBackend("s3://flinkcheckpointing/fhirmapper"));
> env.enableCheckpointing(1000);
>
>
>
> Below logs for the issue.
>
>
> org.apache.flink.runtime.JobException: Recovery is suppressed by 
> FixedDelayRestartBackoffTimeStrategy(maxNumberRestartAttempts=10, 
> backoffTimeMS=15000)
> at 
> org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:118)
> at 
> org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:80)
> at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:233)
> at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:224)
> at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:215)
> at 
> org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:669)
> at 
> org.apache.flink.runtime.scheduler.SchedulerNG.updateTaskExecutionState(SchedulerNG.java:89)
> at 
> org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:447)
> at jdk.internal.reflect.GeneratedMethodAccessor366.invoke(Unknown Source)
> at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown 
> Source)
> at java.base/java.lang.reflect.Method.invoke(Unknown Source)
> at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:305)
> at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:212)
> at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:77)
> at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158)
> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
> at scala.PartialFunction.applyOrElse(PartialFunction.scala:123)
> at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122)
> at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
> at akka.actor.Actor.aroundReceive(Actor.scala:517)
> at akka.actor.Actor.aroundReceive$(Actor.scala:515)
> at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
> at akka.actor.ActorCell.invoke(ActorCell.scala:561)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
> at akka.dispatch.Mailbox.run(Mailbox.scala:225)
> at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
> at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at 
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at 
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by: 
> org.apache.kafka.common.errors.TransactionalIdAuthorizationException: 
> Transactional Id authorization failed.
>
>
> Thanks,
> Sudhansu
>
>


Re: How to increase the number of task managers?

2021-05-07 Thread Yangze Guo
Hi,

> I wonder if I can tune the number of task managers? Is there a corresponding 
> config?

With K8S/Yarn resource provider, the task managers are allocated on
demand. So, the number of them are depends on the max parallelism and
the slot sharing group topology of your job.
In standalone mode, you need to config the "conf/workers" in your
flink distribution to decide the number of task managers[3].

[1] 
https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/resource-providers/native_kubernetes/
[2] 
https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/resource-providers/yarn/
[3] 
https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/resource-providers/standalone/overview/#starting-and-stopping-a-cluster

Best,
Yangze Guo

Best,
Yangze Guo


On Fri, May 7, 2021 at 7:34 PM Tamir Sagi  wrote:
>
> Hey
>
> num of TMs = parallelism / num of slots
>
> parallelism.default is another config you should consider.
>
> Read also
> https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/dev/execution/parallel/
>
>
> 
> From: Yik San Chan 
> Sent: Friday, May 7, 2021 1:56 PM
> To: user 
> Subject: How to increase the number of task managers?
>
>
> EXTERNAL EMAIL
>
>
>
> Hi community,
>
> According to the 
> [docs](https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/config/):
>
> > taskmanager.numberOfTaskSlots: The number of slots that a TaskManager 
> > offers (default: 1). Each slot can take one task or pipeline. Having 
> > multiple slots in a TaskManager can help amortize certain constant 
> > overheads (of the JVM, application libraries, or network connections) 
> > across parallel tasks or pipelines. See the Task Slots and Resources 
> > concepts section for details.
>
> > Running more smaller TaskManagers with one slot each is a good starting 
> > point and leads to the best isolation between tasks. Dedicating the same 
> > resources to fewer larger TaskManagers with more slots can help to increase 
> > resource utilization, at the cost of weaker isolation between the tasks 
> > (more tasks share the same JVM).
>
> We're able to tune slot count by setting taskmanager.numberOfTaskSlots, that 
> may help parallelize my task.
>
> I wonder if I can tune the number of task managers? Is there a corresponding 
> config?
>
> Best,
> Yik San
>
>
> Confidentiality: This communication and any attachments are intended for the 
> above-named persons only and may be confidential and/or legally privileged. 
> Any opinions expressed in this communication are not necessarily those of 
> NICE Actimize. If this communication has come to you in error you must take 
> no action based on it, nor must you copy or show it to anyone; please 
> delete/destroy and inform the sender by e-mail immediately.
> Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
> Viruses: Although we have taken steps toward ensuring that this e-mail and 
> attachments are free from any virus, we advise that in keeping with good 
> computing practice the recipient should ensure they are actually virus free.


Re: Session mode on Kubernetes and # of TMs

2021-05-10 Thread Yangze Guo
Hi, Youngwoo

In K8S session, the number of TMs depends on how many slots your job
needs and the number of slots per task managers (config key:
taskmanager.numberOfTaskSlots). In this case,

# of TM  = Ceil(total slots need / taskmanager.numberOfTaskSlots)

How many your job's topology and parallelism. For streaming SQL, the
whole job graph will locate in one slot by default. So, the number of
slots would be equal to the parallelism you set.

> Is it possible to run pre-spawned TMs for session mode? I'm looking for a way 
> to scale the computing resources. i.e., # of TM for the jobs.

I might not fully understand your problem. Do you mean starting TMs
before submitting the job? If that is the case,
- You can try the standalone k8s mode. [1]
- Warmup the session by submitting some puppet jobs yourselves and
submit your job before those TMs idle timeout.
- In FLINK-15959, we will introduce the min number of slots of the
cluster. With this feature, you can configure how many TMs needed
before submitting the jobs.

[1] 
https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/resource-providers/standalone/kubernetes/

Best,
Yangze Guo

On Tue, May 11, 2021 at 12:24 PM Youngwoo Kim (김영우)  wrote:
>
> Hi,
>
> I have deployed a cluster with session mode on kubernetes and I can see one 
> deployment, services and one JM. I'm trying to run a SQL query through sql 
> client. for instance, 'INSERT INTO ... SELECT ...;'
>
> When I run the query in cli, the Flink session is spinning up a TM for the 
> query and then the query is running in a job.
>
> Now, I'm curious. How does Flink calculate the number of TMs for the query? 
> and also, Is it possible to run pre-spawned TMs for session mode? I'm looking 
> for a way to scale the computing resources. i.e., # of TM for the jobs.
>
> Thanks,
> Youngwoo


Re: Customized Metric Reporter can not be found by Flink

2021-05-11 Thread Yangze Guo
Hi, Fan

Flink loaded the custom reporter through the service loader mechanism.[1]
Do you add the service file in the "resources/META-INF/services" directory?

[1] https://docs.oracle.com/javase/9/docs/api/java/util/ServiceLoader.html

Best,
Yangze Guo

On Wed, May 12, 2021 at 7:53 AM Fan Xie  wrote:
>
> Hi Flink Community,
>
> Recently I implemented a customized metric reporter (named: 
> DiagnosticsMessageReporter) to report Flink metrics to a Kafka topic. I built 
> this reporter into a jar file and copy it to 
> /opt/flink/plugins/DiagnosticsMessageReporter/DiagnosticsMessageReporter.jar 
> for both the Job Manager and task manager's containers. But later on I found 
> the following logs indicated that the metric reporter can not be loaded:
>
> 2021-05-11 23:08:31,523 WARN  org.apache.flink.runtime.metrics.ReporterSetup  
>  [] - The reporter factory 
> (org.apache.flink.metrics.reporter.DiagnosticsMessageReporterFactory) could 
> not be found for reporter DiagnosticsMessageReporter. Available factories: 
> [org.apache.flink.metrics.datadog.DatadogHttpReporterFactory, 
> org.apache.flink.metrics.slf4j.Slf4jReporterFactory, 
> org.apache.flink.metrics.prometheus.PrometheusPushGatewayReporterFactory, 
> org.apache.flink.metrics.graphite.GraphiteReporterFactory, 
> org.apache.flink.metrics.statsd.StatsDReporterFactory, 
> org.apache.flink.metrics.prometheus.PrometheusReporterFactory, 
> org.apache.flink.metrics.jmx.JMXReporterFactory, 
> org.apache.flink.metrics.influxdb.InfluxdbReporterFactory].
> 2021-05-11 23:21:55,698 INFO  
> org.apache.flink.runtime.metrics.MetricRegistryImpl  [] - No metrics 
> reporter configured, no metrics will be exposed/reported.
>
> The Flink configs I used are as following:
>
> #DiagnosticsMessageReporter configs
> metrics.reporters: DiagnosticsMessageReporter
> metrics.reporter.DiagnosticsMessageReporter.factory.class: 
> org.apache.flink.metrics.reporter.DiagnosticsMessageReporterFactory
> metrics.reporter.DiagnosticsMessageReporter.bootstrap.servers: kafka:9092
> metrics.reporter.DiagnosticsMessageReporter.topic: flink-metrics
> metrics.reporter.DiagnosticsMessageReporter.keyBy: task_attempt_id
> metrics.reporter.DiagnosticsMessageReporter.interval: 1 SECONDS
>
> Does anyone have any idea about what happened here? Am I missing some of the 
> steps to load the customized reporter as a plugin? Really appreciate if 
> someone can help to take a look at this!
>
> Best,
> Fan
>


Re: Root Exception can not be shown on Web UI in Flink 1.13.0

2021-05-12 Thread Yangze Guo
Hi, it seems to be related to FLINK-22276. Thus, I'd involve Matthias
to take a look.

@Matthias My gut feeling is that not all execution who has failureInfo
has been deployed?

Best,
Yangze Guo

On Wed, May 12, 2021 at 10:12 PM Gary Wu  wrote:
>
> Hi,
>
> We have upgraded our Flink applications to 1.13.0 but we found that Root 
> Exception can not be shown on Web UI with an internal server error message. 
> After opening browser development console and trace the message, we found 
> that there is a exception in jobmanager:
>
> 2021-05-12 13:30:45,589 ERROR 
> org.apache.flink.runtime.rest.handler.job.JobExceptionsHandler [] - Unhandled 
> exception.
> java.lang.IllegalArgumentException: The location must not be null for a 
> non-global failure.
> at 
> org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:138) 
> ~[flink-dist_2.12-1.13.0.jar:1.13.0]
> at 
> org.apache.flink.runtime.rest.handler.job.JobExceptionsHandler.assertLocalExceptionInfo(JobExceptionsHandler.java:218)
>  ~[flink-dist_2.12-1.13.0.jar:1.13.0]
> at 
> org.apache.flink.runtime.rest.handler.job.JobExceptionsHandler.createRootExceptionInfo(JobExceptionsHandler.java:191)
>  ~[flink-dist_2.12-1.13.0.jar:1.13.0]
> at 
> java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) 
> ~[?:?]
> at java.util.stream.SliceOps$1$1.accept(SliceOps.java:199) ~[?:?]
> at 
> java.util.ArrayList$ArrayListSpliterator.tryAdvance(ArrayList.java:1632) 
> ~[?:?]
> at 
> java.util.stream.ReferencePipeline.forEachWithCancel(ReferencePipeline.java:127)
>  ~[?:?]
> at 
> java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:502)
>  ~[?:?]
> at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:488) 
> ~[?:?]
> at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) 
> ~[?:?]
> at 
> java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913) 
> ~[?:?]
> at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) 
> ~[?:?]
> at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578) 
> ~[?:?]
> at 
> org.apache.flink.runtime.rest.handler.job.JobExceptionsHandler.createJobExceptionHistory(JobExceptionsHandler.java:169)
>  ~[flink-dist_2.12-1.13.0.jar:1.13.0]
> at 
> org.apache.flink.runtime.rest.handler.job.JobExceptionsHandler.createJobExceptionsInfo(JobExceptionsHandler.java:154)
>  ~[flink-dist_2.12-1.13.0.jar:1.13.0]
> at 
> org.apache.flink.runtime.rest.handler.job.JobExceptionsHandler.handleRequest(JobExceptionsHandler.java:101)
>  ~[flink-dist_2.12-1.13.0.jar:1.13.0]
> at 
> org.apache.flink.runtime.rest.handler.job.JobExceptionsHandler.handleRequest(JobExceptionsHandler.java:63)
>  ~[flink-dist_2.12-1.13.0.jar:1.13.0]
> at 
> org.apache.flink.runtime.rest.handler.job.AbstractExecutionGraphHandler.lambda$handleRequest$0(AbstractExecutionGraphHandler.java:87)
>  ~[flink-dist_2.12-1.13.0.jar:1.13.0]
> at 
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:642)
>  [?:?]
> at 
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478)
>  [?:?]
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
> at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
>  [?:?]
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  [?:?]
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  [?:?]
> at java.lang.Thread.run(Thread.java:834) [?:?]
>
> We would like to check Is there any configuration change should be done for 
> the application? Thanks!
>
> Regards,
> -Gary
>
>
>
> APPIER EMAIL NOTICE
>
> The contents of this email message and any attachments from Appier Group Inc. 
> and/or its affiliates may be privileged and confidential. If you are not the 
> intended recipient of this email, please note that any disclosure, copying, 
> distribution, or use of this message or its attachments is prohibited. If you 
> have received this email in error, please contact us immediately and delete 
> this message and any attachments.


Re: ES sink never receive error code

2021-05-20 Thread Yangze Guo
> So, ES BulkProcessor retried after bulk request was partially rejected. And 
> eventually that request was sent successfully? That is why failure handler 
> was not called?

If the bulk request fails after the max number of retries
(bulk.flush.backoff.retries), the failure handler will still be
called.


Best,
Yangze Guo

On Fri, May 21, 2021 at 5:53 AM Qihua Yang  wrote:
>
> Thank you for the reply!
> Yes, we did config bulk.flush.backoff.enable.
> So, ES BulkProcessor retried after bulk request was partially rejected. And 
> eventually that request was sent successfully? That is why failure handler 
> was not called?
>
> Thanks,
> Qihua
>
> On Thu, May 20, 2021 at 2:22 PM Roman Khachatryan  wrote:
>>
>> Hi,
>>
>> Have you tried to change bulk.flush.backoff.enable?
>> According to the docs [1], the underlying ES BulkProcessor will retry
>> (by default), so the provided failure handler might not be called.
>>
>> [1]
>> https://ci.apache.org/projects/flink/flink-docs-stable/docs/connectors/datastream/elasticsearch/#configuring-the-internal-bulk-processor
>>
>> Regards,
>> Roman
>>
>> On Thu, May 20, 2021 at 10:08 PM Qihua Yang  wrote:
>> >
>> > Hello,
>> > We are using flink-connector-elasticsearch6_2.11 to ingest stream data to 
>> > ES by using bulk requests. From ES metrics, we observed some bulk thread 
>> > pool rejections. Contacted AWS team, their explanation is part of bulk 
>> > request was rejected. Response body should include status for each item. 
>> > For bulk thread pool rejection, the error code is 429.
>> > Our flink app override FailureHandler to process error cases.
>> > I checked Flink code, it has AfterBulk() method to handle item errors. 
>> > FailureHandler() never received any 429 error.
>> > Is that flink issue? Or we need to config something to make it work?
>> > Thanks,
>> >
>> > Qihua


Re: Issues with forwarding environment variables

2021-05-20 Thread Yangze Guo
Hi, Milind

Could you help to provide the skeleton of your job code? Actually, if
you implement a custom function, like Tokenizer in the WordCount
example, the class member will be initialized at the client-side and
be serialized to the task manager. As a result, neither the system
envs nor the system properties at the TaskManager will be used.

If that is the case, you can initiate the `serviceName` field in the
map/flatMap or open function. Then, it will read the TM's envs or
properties instead.

Best,
Yangze Guo


On Fri, May 21, 2021 at 5:40 AM Milind Vaidya  wrote:
>
> This is java code. I have a flink job running and it is trying to fetch this 
> variable at run time itself. I see the properties getting reflected in the 
> logs as already mentioned but not visible from the code.
>
> On Thu, May 20, 2021 at 1:53 PM Roman Khachatryan  wrote:
>>
>> > private String serviceName = System.getenv("SERVICE_NAME");
>> Is it a scala object? If so, it can be initialized before any
>> properties are set.
>> What happens if the variable/property is read later at run time?
>>
>> Regards,
>> Roman
>>
>> On Thu, May 20, 2021 at 10:41 PM Milind Vaidya  wrote:
>> >
>> > here are the entries from taskmanager logs
>> >
>> > 2021-05-20 13:34:13,739 INFO 
>> > org.apache.flink.configuration.GlobalConfiguration - Loading configuration 
>> > property: env.java.opts.taskmanager, 
>> > "-DSERVICE_NAME=hello-test,-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005"
>> > 2021-05-20 13:34:13,740 INFO 
>> > org.apache.flink.configuration.GlobalConfiguration - Loading configuration 
>> > property: jobmanager.execution.failover-strategy, region
>> > 2021-05-20 13:34:13,742 INFO 
>> > org.apache.flink.configuration.GlobalConfiguration - Loading configuration 
>> > property: containerized.taskmanager.env.SERVICE_NAME, "hello-test"
>> > 2021-05-20 13:34:13,743 INFO 
>> > org.apache.flink.configuration.GlobalConfiguration - Loading configuration 
>> > property: containerized.master.env.SERVICE_NAME, "hello-test"
>> >
>> > But the error still persists
>> >
>> >
>> > On Thu, May 20, 2021 at 1:20 PM Roman Khachatryan  wrote:
>> >>
>> >> Thanks, it should work. I've created a ticket to track the issue [1].
>> >> Could you please specify Flink and Yarn versions you are using?
>> >>
>> >> You can also use properties (which don't depend on Yarn integration),
>> >> for example like this:
>> >> In flink-conf.yaml: env.java.opts.taskmanager: -DSERVICE_NAME=...
>> >> In the application: System.getProperty("SERVICE_NAME");
>> >>
>> >> Regards,
>> >> Roman
>> >>
>> >> On Thu, May 20, 2021 at 9:50 PM Milind Vaidya  wrote:
>> >> >
>> >> >
>> >> > Hi Roman,
>> >> >
>> >> > I have added following lines to conf/flink-conf.yaml
>> >> >
>> >> > containerized.taskmanager.env.SERVICE_NAME: "test_service_name"
>> >> > containerized.master.env.SERVICE_NAME: "test_service_name"
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > On Thu, May 20, 2021 at 12:30 PM Roman Khachatryan  
>> >> > wrote:
>> >> >>
>> >> >> Hi,
>> >> >>
>> >> >> Could you please share the relevant parts of your flink-conf.yaml?
>> >> >>
>> >> >> Regards,
>> >> >> Roman
>> >> >>
>> >> >> On Thu, May 20, 2021 at 9:13 PM Milind Vaidya  
>> >> >> wrote:
>> >> >> >
>> >> >> > Hi
>> >> >> >
>> >> >> > Need to forward a few env variables to Job and Task manager.
>> >> >> > I am running jobs in Yarn cluster
>> >> >> > I was referring to this : Forwarding
>> >> >> >
>> >> >> > I also found Stack Overflow
>> >> >> >
>> >> >> > I was able to configure and see the variables in Flink Dashboard
>> >> >> >
>> >> >> > But the task manager logs stills says
>> >> >> >
>> >> >> > `The system environment variable SERVICE_NAME is missing` as an 
>> >> >> > exception message.
>> >> >> >
>> >> >> > The code trying to fetch it is as follows
>> >> >> >
>> >> >> > private String serviceName = System.getenv("SERVICE_NAME");
>> >> >> >
>> >> >> > Is the fetched one not the same as set one ? How to set / fetch 
>> >> >> > environment variables in such case ?
>> >> >> >


Re: ES sink never receive error code

2021-05-24 Thread Yangze Guo
Jacky is right. It's a known issue and will be fixed in FLINK-21511.

Best,
Yangze Guo

On Tue, May 25, 2021 at 8:40 AM Jacky Yin 殷传旺  wrote:
>
> If you are using es connector 6.*, actually there is a deadlock bug if the 
> backoff is enabled. The 'retry' and 'flush' share one thread pool which has 
> only one thread. Sometimes the one holding the thread tries to get the 
> semaphore which is hold by the one who tries to get the thread. Therefore 
> please upgrade to connector 7.*.
>
> 
> 发件人: Qihua Yang 
> 发送时间: 2021年5月24日 23:17
> 收件人: Yangze Guo 
> 抄送: ro...@apache.org ; user 
> 主题: Re: ES sink never receive error code
>
> Got it! thanks for helping.
>
> On Thu, May 20, 2021 at 7:15 PM Yangze Guo  wrote:
>
> > So, ES BulkProcessor retried after bulk request was partially rejected. And 
> > eventually that request was sent successfully? That is why failure handler 
> > was not called?
>
> If the bulk request fails after the max number of retries
> (bulk.flush.backoff.retries), the failure handler will still be
> called.
>
>
> Best,
> Yangze Guo
>
> On Fri, May 21, 2021 at 5:53 AM Qihua Yang  wrote:
> >
> > Thank you for the reply!
> > Yes, we did config bulk.flush.backoff.enable.
> > So, ES BulkProcessor retried after bulk request was partially rejected. And 
> > eventually that request was sent successfully? That is why failure handler 
> > was not called?
> >
> > Thanks,
> > Qihua
> >
> > On Thu, May 20, 2021 at 2:22 PM Roman Khachatryan  wrote:
> >>
> >> Hi,
> >>
> >> Have you tried to change bulk.flush.backoff.enable?
> >> According to the docs [1], the underlying ES BulkProcessor will retry
> >> (by default), so the provided failure handler might not be called.
> >>
> >> [1]
> >> https://ci.apache.org/projects/flink/flink-docs-stable/docs/connectors/datastream/elasticsearch/#configuring-the-internal-bulk-processor
> >>
> >> Regards,
> >> Roman
> >>
> >> On Thu, May 20, 2021 at 10:08 PM Qihua Yang  wrote:
> >> >
> >> > Hello,
> >> > We are using flink-connector-elasticsearch6_2.11 to ingest stream data 
> >> > to ES by using bulk requests. From ES metrics, we observed some bulk 
> >> > thread pool rejections. Contacted AWS team, their explanation is part of 
> >> > bulk request was rejected. Response body should include status for each 
> >> > item. For bulk thread pool rejection, the error code is 429.
> >> > Our flink app override FailureHandler to process error cases.
> >> > I checked Flink code, it has AfterBulk() method to handle item errors. 
> >> > FailureHandler() never received any 429 error.
> >> > Is that flink issue? Or we need to config something to make it work?
> >> > Thanks,
> >> >
> >> > Qihua


Re: DataStream API in Batch Mode job is timing out, please advise on how to adjust the parameters.

2021-05-25 Thread Yangze Guo
Hi, Marco,

The root cause is NoResourceAvailableException. Could you provide the
following information?
- How many slots each TM has?
- Your job's topology, it would also be good to share the job manager log.

Best,
Yangze Guo

On Tue, May 25, 2021 at 12:10 PM Marco Villalobos
 wrote:
>
> I am running with one job manager and three task managers.
>
> Each task manager is receiving at most 8 gb of data, but the job is timing 
> out.
>
> What parameters must I adjust?
>
> Sink: back fill db sink) (15/32) (50626268d1f0d4c0833c5fa548863abd) switched 
> from SCHEDULED to FAILED on [unassigned resource].
> java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> Slot request bulk is not fulfillable! Could not allocate the required slot 
> within slot request timeout
> at 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
>  ~[?:1.8.0_282]
> at 
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
>  ~[?:1.8.0_282]
> at 
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607) 
> ~[?:1.8.0_282]
> at 
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
>  ~[?:1.8.0_282]
> at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
>  ~[?:1.8.0_282]
> at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
>  ~[?:1.8.0_282]
> at 
> org.apache.flink.runtime.scheduler.SharedSlot.cancelLogicalSlotRequest(SharedSlot.java:223)
>  ~[feature-LUM-3882-toledo--850a6747.jar:?]
> at 
> org.apache.flink.runtime.scheduler.SlotSharingExecutionSlotAllocator.cancelLogicalSlotRequest(SlotSharingExecutionSlotAllocator.java:168)
>  ~[feature-LUM-3882-toledo--850a6747.jar:?]
> at 
> org.apache.flink.runtime.scheduler.SharingPhysicalSlotRequestBulk.cancel(SharingPhysicalSlotRequestBulk.java:86)
>  ~[feature-LUM-3882-toledo--850a6747.jar:?]
> at 
> org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkWithTimestamp.cancel(PhysicalSlotRequestBulkWithTimestamp.java:66)
>  ~[feature-LUM-3882-toledo--850a6747.jar:?]
> at 
> org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkCheckerImpl.lambda$schedulePendingRequestBulkWithTimestampCheck$0(PhysicalSlotRequestBulkCheckerImpl.java:91)
>  ~[feature-LUM-3882-toledo--850a6747.jar:?]
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> ~[?:1.8.0_282]
> at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_282]
> at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:442)
>  ~[feature-LUM-3882-toledo--850a6747.jar:?]
> at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:209)
>  ~[feature-LUM-3882-toledo--850a6747.jar:?]
> at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:77)
>  ~[feature-LUM-3882-toledo--850a6747.jar:?]
> at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:159)
>  ~[feature-LUM-3882-toledo--850a6747.jar:?]
> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) 
> [feature-LUM-3882-toledo--850a6747.jar:?]
> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) 
> [feature-LUM-3882-toledo--850a6747.jar:?]
> at scala.PartialFunction.applyOrElse(PartialFunction.scala:123) 
> [feature-LUM-3882-toledo--850a6747.jar:?]
> at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122) 
> [feature-LUM-3882-toledo--850a6747.jar:?]
> at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) 
> [feature-LUM-3882-toledo--850a6747.jar:?]
> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) 
> [feature-LUM-3882-toledo--850a6747.jar:?]
> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) 
> [feature-LUM-3882-toledo--850a6747.jar:?]
> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) 
> [feature-LUM-3882-toledo--850a6747.jar:?]
> at akka.actor.Actor.aroundReceive(Actor.scala:517) 
> [feature-LUM-3882-toledo--850a6747.jar:?]
> at akka.actor.Actor.aroundReceive$(Actor.scala:515) 
> [feature-LUM-3882-toledo--850a6747.jar:?]
> at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) 
> [feature-LUM-3882-toledo--850a6747.jar:?]
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) 
> [feature-LUM-3882-toledo--850a6747.jar:?]
> at akka.actor.ActorCell.invoke(ActorCell.scala:561) 
> [feature-LUM-3882-toledo--850a6747.jar:?]
>   

Re: Heartbeat Timeout

2021-05-27 Thread Yangze Guo
Hi, Rober,

To mitigate this issue, you can increase the "heartbeat.interval" and
"heartbeat.timeout". However, I think we should first figure out the
root cause, would you like to provide the log of
10.42.0.49:6122-e26293?

Best,
Yangze Guo

On Thu, May 27, 2021 at 10:44 PM Robert Cullen  wrote:
>
> I have a job that fails after @1 hour due to a TaskManager Timeout. How can I 
> prevent this from happening?
>
> 2021-05-27 10:24:21
> org.apache.flink.runtime.JobException: Recovery is suppressed by 
> NoRestartBackoffTimeStrategy
> at 
> org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:138)
> at 
> org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:82)
> at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:207)
> at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:197)
> at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:188)
> at 
> org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:677)
> at 
> org.apache.flink.runtime.scheduler.UpdateSchedulerNgOnInternalFailuresListener.notifyTaskFailure(UpdateSchedulerNgOnInternalFailuresListener.java:51)
> at 
> org.apache.flink.runtime.executiongraph.DefaultExecutionGraph.notifySchedulerNgAboutInternalTaskFailure(DefaultExecutionGraph.java:1462)
> at 
> org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1139)
> at 
> org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1079)
> at 
> org.apache.flink.runtime.executiongraph.Execution.fail(Execution.java:783)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot.signalPayloadRelease(SingleLogicalSlot.java:195)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot.release(SingleLogicalSlot.java:182)
> at 
> org.apache.flink.runtime.scheduler.SharedSlot.lambda$release$4(SharedSlot.java:271)
> at 
> java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:670)
> at 
> java.util.concurrent.CompletableFuture.uniAcceptStage(CompletableFuture.java:683)
> at 
> java.util.concurrent.CompletableFuture.thenAccept(CompletableFuture.java:2010)
> at 
> org.apache.flink.runtime.scheduler.SharedSlot.release(SharedSlot.java:271)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.AllocatedSlot.releasePayload(AllocatedSlot.java:152)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.DefaultDeclarativeSlotPool.releasePayload(DefaultDeclarativeSlotPool.java:385)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.DefaultDeclarativeSlotPool.releaseSlots(DefaultDeclarativeSlotPool.java:361)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.DeclarativeSlotPoolService.internalReleaseTaskManager(DeclarativeSlotPoolService.java:249)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.DeclarativeSlotPoolService.releaseTaskManager(DeclarativeSlotPoolService.java:230)
> at 
> org.apache.flink.runtime.jobmaster.JobMaster.disconnectTaskManager(JobMaster.java:497)
> at 
> org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(JobMaster.java:1295)
> at 
> org.apache.flink.runtime.heartbeat.HeartbeatMonitorImpl.run(HeartbeatMonitorImpl.java:111)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:440)
> at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:208)
> at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:77)
> at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158)
> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
> at scala.PartialFunction.applyOrElse(PartialFunction.scala:123)
> at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122)
> at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
> at akka.actor.Actor.aroundRece

Re: [ANNOUNCE] Apache Flink 1.13.1 released

2021-05-31 Thread Yangze Guo
Thanks, Dawid for the great work, thanks to everyone involved.

Best,
Yangze Guo

On Mon, May 31, 2021 at 4:14 PM Youngwoo Kim (김영우)  wrote:
>
> Got it.
> Thanks Dawid for the clarification.
>
> - Youngwoo
>
> On Mon, May 31, 2021 at 4:50 PM Dawid Wysakowicz  
> wrote:
>>
>> Hi Youngwoo,
>>
>> Usually we publish the docker images a day after the general release, so
>> that the artifacts are properly distributed across Apache mirrors. You
>> should be able to download the docker images from apache/flink now. It
>> may take a few extra days to have the images published as the official
>> image, as it depends on the maintainers of docker hub.
>>
>> Best,
>>
>> Dawid
>>
>> On 31/05/2021 08:01, Youngwoo Kim (김영우) wrote:
>> > Great work! Thank you Dawid and all of the contributors.
>> > I'm eager to adopt the new release, however can't find docker images for
>> > that from https://hub.docker.com/_/flink
>> >
>> > Hope it'll be available soon.
>> >
>> > Thanks,
>> > Youngwoo
>> >
>> >
>> > On Sat, May 29, 2021 at 1:49 AM Dawid Wysakowicz 
>> > wrote:
>> >
>> >> The Apache Flink community is very happy to announce the release of Apache
>> >> Flink 1.13.1, which is the first bugfix release for the Apache Flink 1.13
>> >> series.
>> >>
>> >> Apache Flink® is an open-source stream processing framework for
>> >> distributed, high-performing, always-available, and accurate data 
>> >> streaming
>> >> applications.
>> >>
>> >> The release is available for download at:
>> >> https://flink.apache.org/downloads.html
>> >>
>> >> Please check out the release blog post for an overview of the improvements
>> >> for this bugfix release:
>> >> https://flink.apache.org/news/2021/05/28/release-1.13.1.html
>> >>
>> >> The full release notes are available in Jira:
>> >>
>> >> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12350058
>> >>
>> >> We would like to thank all contributors of the Apache Flink community who
>> >> made this release possible!
>> >>
>> >> Regards,
>> >> Dawid Wysakowicz
>> >>
>>


Re: Flink app performance test framework

2021-06-06 Thread Yangze Guo
Hi, Luck,

I may not fully understand your requirements. If you just want to test
the performance of typical streaming jobs with the Flink, you can
refer to the nexmark[1]. If you just care about the performance
regression of your specific production jobs, I don't know there is
such a framework.

[1] https://github.com/nexmark/nexmark


Best,
Yangze Guo

On Sun, Jun 6, 2021 at 7:35 AM luck li  wrote:
>
> Hi flink community,
>
> Is there any test framework that we can use to test flink jobs performance?
> We would like to automate process for regression tests during flink version 
> upgrade and job performance tests when rolling out new changes to prod.
>
> Any suggestions would be appreciated!
>
> Thank you
> Best regards
> Luck


Re: Elasticsearch sink connector timeout

2021-06-06 Thread Yangze Guo
Hi, Kai,

I think the exception should be thrown from
RetryRejectedExecutionFailureHandler as you configure the
'failure-handler' to 'retry-rejected'. It will retry the action that
failed with EsRejectedExecutionException and throw all other failures.

AFAIK, there is no way to configure the connection/socket timeout in
Elasticsearch SQL connector. However, if the root cause is a network
jitter, you may increase the sink.bulk-flush.backoff.delay and the
sink.bulk-flush.backoff.max-retries.


Best,
Yangze Guo

On Sat, Jun 5, 2021 at 2:28 PM Kai Fu  wrote:
>
> With some investigation in the task manager's log, the exception was raised 
> from RetryRejectedExecutionFailureHandler path, the related logs are showing 
> below, not sure why it's that.
>
>
> 5978 2021-06-05 05:31:31,529 INFO 
> org.apache.flink.elasticsearch7.shaded.org.elasticsearch.action.bulk.BulkRequestHandler
>  [] - Bulk request 1033 has been cancelled.
> 5979 java.lang.InterruptedException: null
> 5980 at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:998)
>  ~[?:1.8.0_272]
> 5981 at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>  ~[?:1.8.0_272]
> 5982 at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) 
> ~[?:1.8.0_272]
> 5983 at 
> org.apache.flink.elasticsearch7.shaded.org.elasticsearch.action.bulk.BulkRequestHandler.execute(BulkRequestHandler.java:78)
>  ~[flink-sql-connector-elasticsearch7_2.11-1.13.1.jar: 1.13.1]
> 5984 at 
> org.apache.flink.elasticsearch7.shaded.org.elasticsearch.action.bulk.BulkProcessor.execute(BulkProcessor.java:455)
>  ~[flink-sql-connector-elasticsearch7_2.11-1.13.1.jar:1.13.1]
> 5985 at 
> org.apache.flink.elasticsearch7.shaded.org.elasticsearch.action.bulk.BulkProcessor.execute(BulkProcessor.java:464)
>  ~[flink-sql-connector-elasticsearch7_2.11-1.13.1.jar:1.13.1]
> 5986 at 
> org.apache.flink.elasticsearch7.shaded.org.elasticsearch.action.bulk.BulkProcessor.awaitClose(BulkProcessor.java:330)
>  ~[flink-sql-connector-elasticsearch7_2.11-1.13.1.jar:1.13. 1]
> 5987 at 
> org.apache.flink.elasticsearch7.shaded.org.elasticsearch.action.bulk.BulkProcessor.close(BulkProcessor.java:300)
>  ~[flink-sql-connector-elasticsearch7_2.11-1.13.1.jar:1.13.1]
> 5988 at 
> org.apache.flink.streaming.connectors.elasticsearch.ElasticsearchSinkBase.close(ElasticsearchSinkBase.java:354)
>  ~[flink-sql-connector-elasticsearch7_2.11-1.13.1.jar:1.13.1]
> 5989 at 
> org.apache.flink.api.common.functions.util.FunctionUtils.closeFunction(FunctionUtils.java:41)
>  ~[flink-dist_2.11-1.13.1.jar:1.13.1]
> 5990 at 
> org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.dispose(AbstractUdfStreamOperator.java:117)
>  ~[flink-dist_2.11-1.13.1.jar:1.13.1]
> 5991 at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.disposeAllOperators(StreamTask.java:861)
>  ~[flink-dist_2.11-1.13.1.jar:1.13.1]
> 5992 at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.runAndSuppressThrowable(StreamTask.java:840)
>  [flink-dist_2.11-1.13.1.jar:1.13.1]
> 5993 at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.cleanUpInvoke(StreamTask.java:753)
>  [flink-dist_2.11-1.13.1.jar:1.13.1]
> 5994 at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.runWithCleanUpOnFail(StreamTask.java:659)
>  [flink-dist_2.11-1.13.1.jar:1.13.1]
> 5995 at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:620)
>  [flink-dist_2.11-1.13.1.jar:1.13.1]
> 5996 at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:779) 
> [flink-dist_2.11-1.13.1.jar:1.13.1]
> 5997 at org.apache.flink.runtime.taskmanager.Task.run(Task.java:566) 
> [flink-dist_2.11-1.13.1.jar:1.13.1]
> 5998 at java.lang.Thread.run(Thread.java:748) [?:1.8.0_272]
> 5999 2021-06-05 05:31:31,530 ERROR 
> org.apache.flink.streaming.connectors.elasticsearch.util.RetryRejectedExecutionFailureHandler
>  [] - Failed Elasticsearch item request: null
> 6000 java.lang.InterruptedException: null
> 6001 at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:998)
>  ~[?:1.8.0_272]
> 6002 at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>  ~[?:1.8.0_272]
> 6003 at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) 
> ~[?:1.8.0_272]
> 6004 at 
> org.apache.flink.elasticsearch7.shaded.org.elasticsearch.action.bulk.BulkRequestHandler.execute(BulkRequestHandler.java:78)
>  ~[flink-sql-connector-elasticsearch7_2.11-1.13.1.jar: 1.13.1]
> 6005 at 
> org.apache.flink.elasticsearch7.shaded.org.elasticsear

Re: after upgrade flink1.12 to flink1.13.1, flink web-ui's taskmanager detail page error

2021-06-17 Thread Yangze Guo
Thanks for the report, Yidan.

It will be fixed in FLINK-23024 and hopefully fixed in 1.13.2.

Best,
Yangze Guo

On Fri, Jun 18, 2021 at 10:00 AM yidan zhao  wrote:
>
>  Yeah, I also think it is a bug.
>
> Arvid Heise  于2021年6月17日周四 下午10:13写道:
> >
> > Hi Yidan,
> >
> > could you check if the bucket exist and is accessible? Seems like this 
> > directory cannot be created 
> > bos://flink-bucket/flink/ha/opera_upd_FlinkTestJob3/blob.
> >
> > The second issue looks like a bug. I will create a ticket.
> >
> > On Wed, Jun 16, 2021 at 5:21 AM yidan zhao  wrote:
> >>
> >> does anyone has idea? Here I give another exception stack.
> >>
> >>
> >> Unhandled exception.
> >> org.apache.flink.runtime.rpc.akka.exceptions.AkkaRpcException: Failed
> >> to serialize the result for RPC call : requestTaskManagerDetailsInfo.
> >> at 
> >> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.serializeRemoteResultAndVerifySize(AkkaRpcActor.java:404)
> >> ~[flink-dist_2.11-1.13.1.jar:1.13.1] at
> >> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.lambda$sendAsyncResponse$0(AkkaRpcActor.java:360)
> >> ~[flink-dist_2.11-1.13.1.jar:1.13.1] at
> >> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
> >> ~[?:1.8.0_251] at
> >> java.util.concurrent.CompletableFuture.uniHandleStage(CompletableFuture.java:848)
> >> ~[?:1.8.0_251] at
> >> java.util.concurrent.CompletableFuture.handle(CompletableFuture.java:2168)
> >> ~[?:1.8.0_251] at
> >> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.sendAsyncResponse(AkkaRpcActor.java:352)
> >> ~[flink-dist_2.11-1.13.1.jar:1.13.1] at
> >> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:319)
> >> ~[flink-dist_2.11-1.13.1.jar:1.13.1] at
> >> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:212)
> >> ~[flink-dist_2.11-1.13.1.jar:1.13.1] at
> >> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:77)
> >> ~[flink-dist_2.11-1.13.1.jar:1.13.1] at
> >> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158)
> >> ~[flink-dist_2.11-1.13.1.jar:1.13.1] at
> >> akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
> >> ~[flink-dist_2.11-1.13.1.jar:1.13.1] at
> >> akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
> >> ~[flink-dist_2.11-1.13.1.jar:1.13.1] at
> >> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
> >> ~[flink-dist_2.11-1.13.1.jar:1.13.1] at
> >> akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
> >> ~[flink-dist_2.11-1.13.1.jar:1.13.1] at
> >> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
> >> ~[flink-dist_2.11-1.13.1.jar:1.13.1] at
> >> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> >> ~[flink-dist_2.11-1.13.1.jar:1.13.1] at
> >> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> >> ~[flink-dist_2.11-1.13.1.jar:1.13.1] at
> >> akka.actor.Actor$class.aroundReceive(Actor.scala:517)
> >> ~[flink-dist_2.11-1.13.1.jar:1.13.1] at
> >> akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
> >> ~[flink-dist_2.11-1.13.1.jar:1.13.1] at
> >> akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
> >> [flink-dist_2.11-1.13.1.jar:1.13.1] at
> >> akka.actor.ActorCell.invoke(ActorCell.scala:561)
> >> [flink-dist_2.11-1.13.1.jar:1.13.1] at
> >> akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
> >> [flink-dist_2.11-1.13.1.jar:1.13.1] at
> >> akka.dispatch.Mailbox.run(Mailbox.scala:225)
> >> [flink-dist_2.11-1.13.1.jar:1.13.1] at
> >> akka.dispatch.Mailbox.exec(Mailbox.scala:235)
> >> [flink-dist_2.11-1.13.1.jar:1.13.1] at
> >> akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> >> [flink-dist_2.11-1.13.1.jar:1.13.1] at
> >> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> >> [flink-dist_2.11-1.13.1.jar:1.13.1] at
> >> akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> >> [flink-dist_2.11-1.13.1.jar:1.13.1] at
> >> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> >> [flink-dist_2.11-1.13.1.jar:1.13.1] Caused by:
> >> java.io.NotSerializableException:
> >> org.apache.flink.runtime.resourcemanager.TaskManagerInfoWithSlots at
> >> java.io.ObjectOutputStream.writeOb

  1   2   >