Re: [DISCUSS] Adding e2e tests for Flink's Mesos integration

2019-12-07 Thread Yangze Guo
Thanks for your feedback!

@Till
Regarding the time overhead, I think it mainly come from the network
transmission. For building the image locally, it will totally download
260MB files including the base image and packages. For pulling from
DockerHub, the compressed size of the image is 347MB. Thus, I agree
that it is ok to build the image locally.

@Piyush
Thank you for offering the help and sharing your usage scenario. In
current stage, I think it will be really helpful if you can compress
the custom image[1] or reduce the time overhead to build it locally.
Any ideas for improving test coverage will also be appreciated.

[1]https://hub.docker.com/layers/karmagyz/mesos-flink/latest/images/sha256-4e1caefea107818aa11374d6ac8a6e889922c81806f5cd791ead141f18ec7e64

Best,
Yangze Guo

On Sat, Dec 7, 2019 at 3:17 AM Piyush Narang  wrote:
>
> +1 from our end as well. At Criteo, we are running some Flink jobs on Mesos 
> in production to compute short term features for machine learning. We’d love 
> to help out and contribute on this initiative.
>
> Thanks,
> -- Piyush
>
>
> From: Till Rohrmann 
> Date: Friday, December 6, 2019 at 8:10 AM
> To: dev 
> Cc: user 
> Subject: Re: [DISCUSS] Adding e2e tests for Flink's Mesos integration
>
> Big +1 for adding a fully working e2e test for Flink's Mesos integration. 
> Ideally we would have it ready for the 1.10 release. The lack of such a test 
> has bitten us already multiple times.
>
> In general I would prefer to use the official image if possible since it 
> frees us from maintaining our own custom image. Since Java 9 is no longer 
> officially supported as we opted for supporting Java 11 (LTS) it might not be 
> feasible, though. How much longer would building the custom image vs. 
> downloading the custom image from DockerHub be? Maybe it is ok to build the 
> image locally. Then we would not have to maintain the image.
>
> Cheers,
> Till
>
> On Fri, Dec 6, 2019 at 11:05 AM Yangze Guo 
> mailto:karma...@gmail.com>> wrote:
> Hi, all,
>
> Currently, there is no end to end test or IT case for Mesos deployment
> while the common deployment related developing would inevitably touch
> the logic of this component. Thus, some work needs to be done to
> guarantee experience for both Meos users and contributors. After
> offline discussion with Till and Xintong, we have some basic ideas and
> would like to start a discussion thread on adding end to end tests for
> Flink's Mesos integration.
>
> As a first step, we would like to keep the scope of this contribution
> to be relative small. This may also help us to quickly get some basic
> test cases that might be helpful for the upcoming 1.10 release.
>
> As far as we can think of, what needs to be done is to setup a Mesos
> framework during the testing and determine which tests need to be
> included.
>
>
> ** Regarding the Mesos framework, after trying out several approaches,
> I find that setting up Mesos in docker is probably what we want. The
> resources needed for building and setting up Mesos from source is
> probably not affordable in most of the scenarios. So, the one open
> question that worth discussion is the choice of Docker image. We have
> come up with two options.
>
> - Using official Mesos image[1]
> The official image was the first alternative that come to our mind,
> but we run into some sort of Java version compatibility problem that
> leads to failures of launching task executors. Flink supports Java 9
> since version 1.9.0 [2], However, the official Docker image of Mesos
> is built with a development version of JDK 9, which probably has
> caused this problem. Unless we want to make Flink to also be
> compatible with the JDK development version used by the official mesos
> image, this option does not work out. Besides, according to the
> official roadmap[5], Java 9 is not a long-term support version, which
> may bring stability risk in future.
>
> - Build a custom image
> I've already tried build a custom image[3] and successfully run most
> of the existing end to end tests cases with it. The image is built
> with Ubuntu 16.04, JDK 8 and Mesos 1.7.1. For the mesos e2e test
> framework, we could either build the image from a Docker file or pull
> the pre-built image from DockerHub (or other hub services) during the
> testing.
> If we decide to publish the an image on DockerHub, we probably need a
> Flink official  repository/account to hold it.
>
>
> ** Regarding the test coverage, we think the following three tests
> could be a good starting point that covers a very essential set of
> behaviors for Mesos deployment.
> - Wordcount end-to-end test. For verifying the basic process of Mesos
> deployment.
> - Multiple submissions of the same job. For preventing resource
> management problems on Mesos, such as [4]
> - State TTL RocksDb backend end-to-end test. For verifying memory
> configuration behaviors, since Mesos has it’s own config options and
> logics.
>
> Unfortunately, neither of us who participated the in

Re: User program failures cause JobManager to be shutdown

2019-12-07 Thread Robert Metzger
I guess we could manage the security only when calling the user's main()
method.

This problem actually exists for all usercode in Flink: You can also kill
TaskManagers like this.
If we are going to add something like this to Flink, I would only log that
System.exit() has been called by the user code, not intercept and ignore
the call.

On Fri, Dec 6, 2019 at 10:31 AM Khachatryan Roman <
khachatryan.ro...@gmail.com> wrote:

> Hi Dongwon,
>
> This should work but it could also interfere with Flink itself exiting in
> case of a fatal error.
>
> Regards,
> Roman
>
>
> On Fri, Dec 6, 2019 at 2:54 AM Dongwon Kim  wrote:
>
>> FYI, we've launched a session cluster where multiple jobs are managed by
>> a job manager. If that happens, all the other jobs also fail because the
>> job manager is shut down and all the task managers get into chaos (failing
>> to connect to the job manager).
>>
>> I just searched a way to prevent System.exit() calls from terminating
>> JVMs and found [1]. Can it be a possible solution to the problem?
>>
>> [1]
>> https://stackoverflow.com/questions/5549720/how-to-prevent-calls-to-system-exit-from-terminating-the-jvm
>>
>> Best,
>> - Dongwon
>>
>> On Fri, Dec 6, 2019 at 10:39 AM Dongwon Kim 
>> wrote:
>>
>>> Hi Robert and Roman,
>>>
>>> Thank you for taking a look at this.
>>>
>>> what is your main() method / client doing when it's receiving wrong
 program parameters? Does it call System.exit(), or something like that?

>>>
>>> I just found that our HTTP client is programmed to call System.exit(1).
>>> I should guide not to call System.exit() in Flink applications.
>>>
>>> p.s. Just out of curiosity, is there no way for the web app to intercept
>>> System.exit() and prevent the job manager from being shutting down?
>>>
>>> Best,
>>>
>>> - Dongwon
>>>
>>> On Fri, Dec 6, 2019 at 3:59 AM Robert Metzger 
>>> wrote:
>>>
 Hi Dongwon,

 what is your main() method / client doing when it's receiving wrong
 program parameters? Does it call System.exit(), or something like that?

 By the way, the http address from the error message is
 publicly available. Not sure if this is internal data or not.

 On Thu, Dec 5, 2019 at 6:32 PM Khachatryan Roman <
 khachatryan.ro...@gmail.com> wrote:

> Hi Dongwon,
>
> I wasn't able to reproduce your problem with Flink JobManager 1.9.1
> with various kinds of errors in the job.
> I suggest you try it on a fresh Flink installation without any other
> jobs submitted.
>
> Regards,
> Roman
>
>
> On Thu, Dec 5, 2019 at 3:48 PM Dongwon Kim 
> wrote:
>
>> Hi Roman,
>>
>> We're using the latest version 1.9.1 and those two lines are all I've
>> seen after executing the job on the web ui.
>>
>> Best,
>>
>> Dongwon
>>
>> On Thu, Dec 5, 2019 at 11:36 PM r_khachatryan <
>> khachatryan.ro...@gmail.com> wrote:
>>
>>> Hi Dongwon,
>>>
>>> Could you please provide Flink version you are running and the job
>>> manager
>>> logs?
>>>
>>> Regards,
>>> Roman
>>>
>>>
>>> eastcirclek wrote
>>> > Hi,
>>> >
>>> > I tried to run a program by uploading a jar on Flink UI. When I
>>> > intentionally enter a wrong parameter to my program, JobManager
>>> dies.
>>> > Below
>>> > is all log messages I can get from JobManager; JobManager dies as
>>> soon as
>>> > spitting the second line:
>>> >
>>> > 2019-12-05 04:47:58,623 WARN
>>> >>  org.apache.flink.runtime.webmonitor.handlers.JarRunHandler-
>>> >> Configuring the job submission via query parameters is
>>> deprecated. Please
>>> >> migrate to submitting a JSON request instead.
>>> >>
>>> >>
>>> >> *2019-12-05 04:47:59,133 ERROR com.skt.apm.http.HTTPClient
>>> >>   - Cannot
>>> >> connect:
>>> http://52.141.38.11:8380/api/spec/poc_asset_model_01/model/imbalance/models
>>> >> <
>>> http://52.141.38.11:8380/api/spec/poc_asset_model_01/model/imbalance/models>
>>> ;:
>>> >> com.fasterxml.jackson.databind.exc.MismatchedInputException:
>>> Cannot
>>> >> deserialize instance of `java.util.ArrayList` out of START_OBJECT
>>> token
>>> >> at
>>> >> [Source:
>>> >>
>>> (String)“{”code”:“GB0001”,“resource”:“msg.comm.unknown.error”,“details”:“NullPointerException:
>>> >> “}”; line: 1, column: 1]2019-12-05 04:47:59,166 INFO
>>> >>  org.apache.flink.runtime.blob.BlobServer  -
>>> Stopped
>>> >> BLOB server at 0.0.0.0:6124 *
>>> >
>>> >
>>> > The second line is obviously from my program and it shouldn't cause
>>> > JobManager to be shut down. Is it intended behavior?
>>> >
>>> > Best,
>>> >
>>> > Dongwon
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Sent from:
>>> http://apache-flink-user-mailing-list-archive.2336050.

Re: How to install Flink + YARN?

2019-12-07 Thread Pankaj Chand
Please disregard my last question. It is working fine with Hadoop 2.7.5.

Thanks

On Sat, Dec 7, 2019 at 2:13 AM Pankaj Chand 
wrote:

> Is it required to use exactly the same versions of Hadoop as the
> pre-bundled hadoop version?
>
> I'm using Hadoop 2.7.1 cluster with Flink 1.9.1 and the corresponding
> Prebundled Hadoop 2.7.5.
>
> When I submit a job using:
>
> [vagrant@node1 flink]$ ./bin/flink run -m yarn-cluster
> ./examples/streaming/SocketWindowWordCount.jar --port 9001
>
> the tail of the log is empty , and i get the following messages:
>
> vagrant@node1 flink]$ ./bin/flink run -m yarn-cluster
> ./examples/streaming/SocketWindowWordCount.jar --port 9001
> 2019-12-07 07:04:15,394 INFO  org.apache.hadoop.yarn.client.RMProxy
>   - Connecting to ResourceManager at /0.0.0.0:8032
> 2019-12-07 07:04:15,493 INFO
>  org.apache.flink.yarn.cli.FlinkYarnSessionCli - No path
> for the flink jar passed. Using the location of class
> org.apache.flink.yarn.YarnClusterDescriptor to locate the jar
> 2019-12-07 07:04:15,493 INFO
>  org.apache.flink.yarn.cli.FlinkYarnSessionCli - No path
> for the flink jar passed. Using the location of class
> org.apache.flink.yarn.YarnClusterDescriptor to locate the jar
> 2019-12-07 07:04:16,615 INFO  org.apache.hadoop.ipc.Client
>  - Retrying connect to server: 0.0.0.0/0.0.0.0:8032.
> Already tried 0 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> 2019-12-07 07:04:17,617 INFO  org.apache.hadoop.ipc.Client
>  - Retrying connect to server: 0.0.0.0/0.0.0.0:8032.
> Already tried 1 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> 2019-12-07 07:04:18,619 INFO  org.apache.hadoop.ipc.Client
>  - Retrying connect to server: 0.0.0.0/0.0.0.0:8032.
> Already tried 2 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> 2019-12-07 07:04:19,621 INFO  org.apache.hadoop.ipc.Client
>  - Retrying connect to server: 0.0.0.0/0.0.0.0:8032.
> Already tried 3 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> 2019-12-07 07:04:20,629 INFO  org.apache.hadoop.ipc.Client
>  - Retrying connect to server: 0.0.0.0/0.0.0.0:8032.
> Already tried 4 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> 2019-12-07 07:04:21,632 INFO  org.apache.hadoop.ipc.Client
>  - Retrying connect to server: 0.0.0.0/0.0.0.0:8032.
> Already tried 5 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> 2019-12-07 07:04:22,634 INFO  org.apache.hadoop.ipc.Client
>  - Retrying connect to server: 0.0.0.0/0.0.0.0:8032.
> Already tried 6 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> 2019-12-07 07:04:23,639 INFO  org.apache.hadoop.ipc.Client
>  - Retrying connect to server: 0.0.0.0/0.0.0.0:8032.
> Already tried 7 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> 2019-12-07 07:04:24,644 INFO  org.apache.hadoop.ipc.Client
>  - Retrying connect to server: 0.0.0.0/0.0.0.0:8032.
> Already tried 8 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> 2019-12-07 07:04:25,651 INFO  org.apache.hadoop.ipc.Client
>  - Retrying connect to server: 0.0.0.0/0.0.0.0:8032.
> Already tried 9 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> 2019-12-07 07:04:56,677 INFO  org.apache.hadoop.ipc.Client
>  - Retrying connect to server: 0.0.0.0/0.0.0.0:8032.
> Already tried 0 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> 2019-12-07 07:04:57,679 INFO  org.apache.hadoop.ipc.Client
>  - Retrying connect to server: 0.0.0.0/0.0.0.0:8032.
> Already tried 1 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> 2019-12-07 07:04:58,680 INFO  org.apache.hadoop.ipc.Client
>  - Retrying connect to server: 0.0.0.0/0.0.0.0:8032.
> Already tried 2 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> 2019-12-07 07:04:59,686 INFO  org.apache.hadoop.ipc.Client
>  - Retrying connect to server: 0.0.0.0/0.0.0.0:8032.
> Already tried 3 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> 2019-12-07 07:05:00,688 INFO  org.apache.hadoop.ipc.Client
>  - Retrying connect to server: