Re: Running on a firewalled Yarn cluster?

2015-12-10 Thread Robert Metzger
I've finally fixed the issues identified here in the thread: The blob
manager and the application master/job manager allocate their ports in a
specified range.

You can now whitelist a port range in the firewall and Flink services will
only allocate ports in that range:
https://github.com/apache/flink/blob/master/docs/setup/yarn_setup.md#running-flink-on-yarn-behind-firewalls

Please let me know if that fixes your issues.
Note that the fix is only available in 1.0-SNAPSHOT.

On Wed, Nov 25, 2015 at 6:58 PM, Robert Metzger  wrote:

> Hi,
> I just wanted to let you know that I didn't forget about this!
>
> The BlobManager in 1.0-SNAPSHOT has already a configuration parameter to
> use a certain range of ports.
> I'm trying to add the same feature for YARN tomorrow.
> Sorry for the delay.
>
>
> On Tue, Nov 10, 2015 at 9:27 PM, Cory Monty 
> wrote:
>
>> Thanks, Stephan.
>>
>> I'll give those two workarounds a try!
>>
>> On Tue, Nov 10, 2015 at 2:18 PM, Stephan Ewen  wrote:
>>
>>> Hi Cory!
>>>
>>> There is no flag to define the BlobServer port right now, but we should
>>> definitely add this: https://issues.apache.org/jira/browse/FLINK-2996
>>>
>>> If your setup is such that the firewall problem is only between client
>>> and master node (and the workers can reach the master on all ports), then
>>> you can try two workarounds:
>>>
>>> 1) Start the program in the cluster (or on the master node, via ssh).
>>>
>>> 2) Add the program jar to the lib directory of Flink, and start your
>>> program with the RemoteExecutor, without a jar attachment. Then it only
>>> needs to communicate to the actor system (RPC) port, which is not random in
>>> standalone mode (6123 by default).
>>>
>>> Stephan
>>>
>>>
>>>
>>>
>>> On Tue, Nov 10, 2015 at 8:46 PM, Cory Monty >> > wrote:
>>>
 I'm also running into an issue with a non-YARN cluster. When submitting
 a JAR to Flink, we'll need to have an arbitrary port open on all of the
 hosts, which we don't know about until the socket attempts to bind; a bit
 of a problem for us.

 Are there ways to submit a JAR to Flink that bypasses the need for the
 BlobServer's random port binding? Or, to control the port BlobServer binds
 to?

 Cheers,

 Cory

 On Thu, Nov 5, 2015 at 8:07 AM, Niels Basjes  wrote:

> That is what I tried. Couldn't find that port though.
>
> On Thu, Nov 5, 2015 at 3:06 PM, Robert Metzger 
> wrote:
>
>> Hi,
>>
>> cool, that's good news.
>>
>> The RM proxy is only for the web interface of the AM.
>>
>>  I'm pretty sure that the MapReduce AM has at least two ports:
>> - one for the web interface (accessible through the RM proxy, so
>> behind the firewall)
>> - one for the AM RPC (and that port is allocated within the
>> configured range, open through the firewall).
>>
>> You can probably find the RPC port in the log file of the running
>> MapReduce AM (to find that, identify the NodeManager running the AM, 
>> access
>> the NM web interface and retrieve the logs of the container running the 
>> AM).
>>
>> Maybe the mapreduce client also logs the AM RPC port when querying
>> the status of a running job.
>>
>>
>> On Thu, Nov 5, 2015 at 2:59 PM, Niels Basjes  wrote:
>>
>>> Hi,
>>>
>>> I checked and this setting has been set to a limited port range of
>>> only 100 port numbers.
>>>
>>> I tried to find the actual port an AM is running on and couldn't
>>> find it (I'm not the admin on that cluster)
>>>
>>> The url to the AM that I use to access it always looks like this:
>>>
>>> http://master-001.xx.net:8088/proxy/application_1443166961758_85492/index.html
>>>
>>> As you can see I never connect directly; always via the proxy that
>>> runs over the master on a single fixed port.
>>>
>>> Niels
>>>
>>> On Thu, Nov 5, 2015 at 2:46 PM, Robert Metzger 
>>> wrote:
>>>
 While discussing with my colleagues about the issue today, we came
 up with another approach to resolve the issue:

 d) Upload the job jar to HDFS (or another FS) and trigger the
 execution of the jar using an HTTP request to the web interface.

 We could add some tooling into the /bin/flink client to submit a
 job like this transparently, so users would not need to bother with the
 file upload and request sending.
 Also, Sachin started a discussion on the dev@ list to add support
 for submitting jobs over the web interface, so maybe we can base the 
 fix
 for FLINK-2960 on that.

 I've also looked into the Hadoop MapReduce code and it seems they
 do the following:

Re: Running on a firewalled Yarn cluster?

2015-11-25 Thread Robert Metzger
Hi,
I just wanted to let you know that I didn't forget about this!

The BlobManager in 1.0-SNAPSHOT has already a configuration parameter to
use a certain range of ports.
I'm trying to add the same feature for YARN tomorrow.
Sorry for the delay.


On Tue, Nov 10, 2015 at 9:27 PM, Cory Monty 
wrote:

> Thanks, Stephan.
>
> I'll give those two workarounds a try!
>
> On Tue, Nov 10, 2015 at 2:18 PM, Stephan Ewen  wrote:
>
>> Hi Cory!
>>
>> There is no flag to define the BlobServer port right now, but we should
>> definitely add this: https://issues.apache.org/jira/browse/FLINK-2996
>>
>> If your setup is such that the firewall problem is only between client
>> and master node (and the workers can reach the master on all ports), then
>> you can try two workarounds:
>>
>> 1) Start the program in the cluster (or on the master node, via ssh).
>>
>> 2) Add the program jar to the lib directory of Flink, and start your
>> program with the RemoteExecutor, without a jar attachment. Then it only
>> needs to communicate to the actor system (RPC) port, which is not random in
>> standalone mode (6123 by default).
>>
>> Stephan
>>
>>
>>
>>
>> On Tue, Nov 10, 2015 at 8:46 PM, Cory Monty 
>> wrote:
>>
>>> I'm also running into an issue with a non-YARN cluster. When submitting
>>> a JAR to Flink, we'll need to have an arbitrary port open on all of the
>>> hosts, which we don't know about until the socket attempts to bind; a bit
>>> of a problem for us.
>>>
>>> Are there ways to submit a JAR to Flink that bypasses the need for the
>>> BlobServer's random port binding? Or, to control the port BlobServer binds
>>> to?
>>>
>>> Cheers,
>>>
>>> Cory
>>>
>>> On Thu, Nov 5, 2015 at 8:07 AM, Niels Basjes  wrote:
>>>
 That is what I tried. Couldn't find that port though.

 On Thu, Nov 5, 2015 at 3:06 PM, Robert Metzger 
 wrote:

> Hi,
>
> cool, that's good news.
>
> The RM proxy is only for the web interface of the AM.
>
>  I'm pretty sure that the MapReduce AM has at least two ports:
> - one for the web interface (accessible through the RM proxy, so
> behind the firewall)
> - one for the AM RPC (and that port is allocated within the configured
> range, open through the firewall).
>
> You can probably find the RPC port in the log file of the running
> MapReduce AM (to find that, identify the NodeManager running the AM, 
> access
> the NM web interface and retrieve the logs of the container running the 
> AM).
>
> Maybe the mapreduce client also logs the AM RPC port when querying the
> status of a running job.
>
>
> On Thu, Nov 5, 2015 at 2:59 PM, Niels Basjes  wrote:
>
>> Hi,
>>
>> I checked and this setting has been set to a limited port range of
>> only 100 port numbers.
>>
>> I tried to find the actual port an AM is running on and couldn't find
>> it (I'm not the admin on that cluster)
>>
>> The url to the AM that I use to access it always looks like this:
>>
>> http://master-001.xx.net:8088/proxy/application_1443166961758_85492/index.html
>>
>> As you can see I never connect directly; always via the proxy that
>> runs over the master on a single fixed port.
>>
>> Niels
>>
>> On Thu, Nov 5, 2015 at 2:46 PM, Robert Metzger 
>> wrote:
>>
>>> While discussing with my colleagues about the issue today, we came
>>> up with another approach to resolve the issue:
>>>
>>> d) Upload the job jar to HDFS (or another FS) and trigger the
>>> execution of the jar using an HTTP request to the web interface.
>>>
>>> We could add some tooling into the /bin/flink client to submit a job
>>> like this transparently, so users would not need to bother with the file
>>> upload and request sending.
>>> Also, Sachin started a discussion on the dev@ list to add support
>>> for submitting jobs over the web interface, so maybe we can base the fix
>>> for FLINK-2960 on that.
>>>
>>> I've also looked into the Hadoop MapReduce code and it seems they do
>>> the following:
>>> When submitting a job, they are uploading the job jar file to HDFS.
>>> They also upload a configuration file that contains all the config 
>>> options
>>> of the job. Then, they submit this altogether as an application to YARN.
>>> So far, there has not been any firewall involved. They establish a
>>> connection between the JobClient and the ApplicationMaster when the 
>>> user is
>>> querying the current job status, but I could not find any special code
>>> getting the status over HTTP.
>>>
>>> But I found the following configuration parameter:
>>> "yarn.app.mapreduce.am.job.client.port-range", so it seems that they 
>>> try to
>>> 

Re: Running on a firewalled Yarn cluster?

2015-11-05 Thread Robert Metzger
While discussing with my colleagues about the issue today, we came up with
another approach to resolve the issue:

d) Upload the job jar to HDFS (or another FS) and trigger the execution of
the jar using an HTTP request to the web interface.

We could add some tooling into the /bin/flink client to submit a job like
this transparently, so users would not need to bother with the file upload
and request sending.
Also, Sachin started a discussion on the dev@ list to add support for
submitting jobs over the web interface, so maybe we can base the fix for
FLINK-2960 on that.

I've also looked into the Hadoop MapReduce code and it seems they do the
following:
When submitting a job, they are uploading the job jar file to HDFS. They
also upload a configuration file that contains all the config options of
the job. Then, they submit this altogether as an application to YARN.
So far, there has not been any firewall involved. They establish a
connection between the JobClient and the ApplicationMaster when the user is
querying the current job status, but I could not find any special code
getting the status over HTTP.

But I found the following configuration parameter:
"yarn.app.mapreduce.am.job.client.port-range", so it seems that they try to
allocate the AM port within that range (if specified).
Niels, can you check if this configuration parameter is set in your
environment? I assume your firewall allows outside connections from that
port range.
So we also have a new approach:

f) Allocate the YARN application master (and blob manager) within a
user-specified port-range.

This would be really easy to implement, because we would just need to go
through the range until we find an available port.


On Tue, Nov 3, 2015 at 1:06 PM, Niels Basjes  wrote:

> Great!
>
> I'll watch the issue and give it a test once I see a working patch.
>
> Niels Basjes
>
> On Tue, Nov 3, 2015 at 1:03 PM, Maximilian Michels  wrote:
>
>> Hi Niels,
>>
>> Thanks a lot for reporting this issue. I think it is a very common setup
>> in corporate infrastructure to have restrictive firewall settings. For
>> Flink 1.0 (and probably in a minor 0.10.X release) we will have to address
>> this issue to ensure proper integration of Flink.
>>
>> I've created a JIRA to keep track:
>> https://issues.apache.org/jira/browse/FLINK-2960
>>
>> Best regards,
>> Max
>>
>> On Tue, Nov 3, 2015 at 11:02 AM, Niels Basjes  wrote:
>>
>>> Hi,
>>>
>>> I forgot to answer your other question:
>>>
>>> On Mon, Nov 2, 2015 at 4:34 PM, Robert Metzger 
>>> wrote:
>>>
 so the problem is that you can not submit a job to Flink using the
 "/bin/flink" tool, right?
 I assume Flink and its TaskManagers properly start and connect to each
 other (the number of TaskManagers is shown correctly in the web interface).

>>>
>>> Correct. Flink starts (i see the jobmanager UI) but the actual job is
>>> not started.
>>>
>>> Niels Basjes
>>>
>>
>>
>
>
> --
> Best regards / Met vriendelijke groeten,
>
> Niels Basjes
>


Re: Running on a firewalled Yarn cluster?

2015-11-05 Thread Niels Basjes
Hi,

I checked and this setting has been set to a limited port range of only 100
port numbers.

I tried to find the actual port an AM is running on and couldn't find it
(I'm not the admin on that cluster)

The url to the AM that I use to access it always looks like this:
http://master-001.xx.net:8088/proxy/application_1443166961758_85492/index.html

As you can see I never connect directly; always via the proxy that runs
over the master on a single fixed port.

Niels

On Thu, Nov 5, 2015 at 2:46 PM, Robert Metzger  wrote:

> While discussing with my colleagues about the issue today, we came up with
> another approach to resolve the issue:
>
> d) Upload the job jar to HDFS (or another FS) and trigger the execution of
> the jar using an HTTP request to the web interface.
>
> We could add some tooling into the /bin/flink client to submit a job like
> this transparently, so users would not need to bother with the file upload
> and request sending.
> Also, Sachin started a discussion on the dev@ list to add support for
> submitting jobs over the web interface, so maybe we can base the fix for
> FLINK-2960 on that.
>
> I've also looked into the Hadoop MapReduce code and it seems they do the
> following:
> When submitting a job, they are uploading the job jar file to HDFS. They
> also upload a configuration file that contains all the config options of
> the job. Then, they submit this altogether as an application to YARN.
> So far, there has not been any firewall involved. They establish a
> connection between the JobClient and the ApplicationMaster when the user is
> querying the current job status, but I could not find any special code
> getting the status over HTTP.
>
> But I found the following configuration parameter:
> "yarn.app.mapreduce.am.job.client.port-range", so it seems that they try to
> allocate the AM port within that range (if specified).
> Niels, can you check if this configuration parameter is set in your
> environment? I assume your firewall allows outside connections from that
> port range.
> So we also have a new approach:
>
> f) Allocate the YARN application master (and blob manager) within a
> user-specified port-range.
>
> This would be really easy to implement, because we would just need to go
> through the range until we find an available port.
>
>
> On Tue, Nov 3, 2015 at 1:06 PM, Niels Basjes  wrote:
>
>> Great!
>>
>> I'll watch the issue and give it a test once I see a working patch.
>>
>> Niels Basjes
>>
>> On Tue, Nov 3, 2015 at 1:03 PM, Maximilian Michels 
>> wrote:
>>
>>> Hi Niels,
>>>
>>> Thanks a lot for reporting this issue. I think it is a very common setup
>>> in corporate infrastructure to have restrictive firewall settings. For
>>> Flink 1.0 (and probably in a minor 0.10.X release) we will have to address
>>> this issue to ensure proper integration of Flink.
>>>
>>> I've created a JIRA to keep track:
>>> https://issues.apache.org/jira/browse/FLINK-2960
>>>
>>> Best regards,
>>> Max
>>>
>>> On Tue, Nov 3, 2015 at 11:02 AM, Niels Basjes  wrote:
>>>
 Hi,

 I forgot to answer your other question:

 On Mon, Nov 2, 2015 at 4:34 PM, Robert Metzger 
 wrote:

> so the problem is that you can not submit a job to Flink using the
> "/bin/flink" tool, right?
> I assume Flink and its TaskManagers properly start and connect to each
> other (the number of TaskManagers is shown correctly in the web 
> interface).
>

 Correct. Flink starts (i see the jobmanager UI) but the actual job is
 not started.

 Niels Basjes

>>>
>>>
>>
>>
>> --
>> Best regards / Met vriendelijke groeten,
>>
>> Niels Basjes
>>
>
>


-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Re: Running on a firewalled Yarn cluster?

2015-11-03 Thread Niels Basjes
Great!

I'll watch the issue and give it a test once I see a working patch.

Niels Basjes

On Tue, Nov 3, 2015 at 1:03 PM, Maximilian Michels  wrote:

> Hi Niels,
>
> Thanks a lot for reporting this issue. I think it is a very common setup
> in corporate infrastructure to have restrictive firewall settings. For
> Flink 1.0 (and probably in a minor 0.10.X release) we will have to address
> this issue to ensure proper integration of Flink.
>
> I've created a JIRA to keep track:
> https://issues.apache.org/jira/browse/FLINK-2960
>
> Best regards,
> Max
>
> On Tue, Nov 3, 2015 at 11:02 AM, Niels Basjes  wrote:
>
>> Hi,
>>
>> I forgot to answer your other question:
>>
>> On Mon, Nov 2, 2015 at 4:34 PM, Robert Metzger 
>> wrote:
>>
>>> so the problem is that you can not submit a job to Flink using the
>>> "/bin/flink" tool, right?
>>> I assume Flink and its TaskManagers properly start and connect to each
>>> other (the number of TaskManagers is shown correctly in the web interface).
>>>
>>
>> Correct. Flink starts (i see the jobmanager UI) but the actual job is not
>> started.
>>
>> Niels Basjes
>>
>
>


-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Re: Running on a firewalled Yarn cluster?

2015-11-03 Thread Niels Basjes
Hi,

I forgot to answer your other question:

On Mon, Nov 2, 2015 at 4:34 PM, Robert Metzger  wrote:

> so the problem is that you can not submit a job to Flink using the
> "/bin/flink" tool, right?
> I assume Flink and its TaskManagers properly start and connect to each
> other (the number of TaskManagers is shown correctly in the web interface).
>

Correct. Flink starts (i see the jobmanager UI) but the actual job is not
started.

Niels Basjes


Re: Running on a firewalled Yarn cluster?

2015-11-02 Thread Robert Metzger
Hi Niels,

so the problem is that you can not submit a job to Flink using the
"/bin/flink" tool, right?
I assume Flink and its TaskManagers properly start and connect to each
other (the number of TaskManagers is shown correctly in the web interface).

I see the following solutions for the problem
a) Add a new page in the job manager web frontend allowing users to upload
and execute a jar with a flink job
b) add options for starting the jobmanager and blob manager on the job
manager container on fixed ports
c) Somehow make the akka rpc requests and blob manager uploads over HTTP
using the YARN proxy

The reason why we use a free port instead a fixed port is that this way two
job manager containers can run on the same machine. So solution b) would
only work if users are not using multiple flink jobs / sessions on yarn at
the same time (or you make somehow sure they are not running on the same
machine).

What's your take on the three solutions?

Does anybody here know how MR is doing it? Are they running the
ApplicationMaster RPC on a fixed port? Do they use HTTP-based calls over
the proxy?

Robert



On Mon, Nov 2, 2015 at 4:05 PM, Niels Basjes  wrote:

> Hi,
>
> Here at work our security guys have chosen (long time ago) to only allow
> the firewalls to have the ports open that needed (I say: good call!).
> For the Yarn cluster this includes things like the proxy to see the
> application manager of an application.
> For everything we've done so far (i.e. mr/pig/...) this has worked fine.
>
> Now with Flink I run into problems:
> When I run either the yarn-session or a job on Yarn the application
> manager gets started and I can see the webinterface.
> The problem is that the jobmanager.rpc.address is on one of the worker
> nodes and the jobmanager.rpc.port is essentially a random value.
> A random value which is not accessible because of the firewall rules.
> So I cannot reach the jobmanager on the yarn cluster.
>
> How do I tackle this assuming that opening the all ports on the firewall
> is not an option?
>
> Or is this something that should be handled by Flink? ( Perhaps the
> application manager can proxy the RPC calls? )
>
> --
> Best regards / Met vriendelijke groeten,
>
> Niels Basjes
>


Running on a firewalled Yarn cluster?

2015-11-02 Thread Niels Basjes
Hi,

Here at work our security guys have chosen (long time ago) to only allow
the firewalls to have the ports open that needed (I say: good call!).
For the Yarn cluster this includes things like the proxy to see the
application manager of an application.
For everything we've done so far (i.e. mr/pig/...) this has worked fine.

Now with Flink I run into problems:
When I run either the yarn-session or a job on Yarn the application manager
gets started and I can see the webinterface.
The problem is that the jobmanager.rpc.address is on one of the worker
nodes and the jobmanager.rpc.port is essentially a random value.
A random value which is not accessible because of the firewall rules.
So I cannot reach the jobmanager on the yarn cluster.

How do I tackle this assuming that opening the all ports on the firewall is
not an option?

Or is this something that should be handled by Flink? ( Perhaps the
application manager can proxy the RPC calls? )

-- 
Best regards / Met vriendelijke groeten,

Niels Basjes


Re: Running on a firewalled Yarn cluster?

2015-11-02 Thread Niels Basjes
My take on those 3 options:
a) Bad idea; people need to be able to automate their jobs and run them
from the command line (i.e. bash, cron).
b) Bad idea; Same reason you gave. In addition I do not want to reserve an
open 'flink port' for every user who wants to run a job.
c) From my perspective this sounds like the most viable solution.

I don't know how they implemented this in MR.
I know the way they did it actually works on our clusters (with firewalls).

Niels Basjes

On Mon, Nov 2, 2015 at 4:34 PM, Robert Metzger  wrote:

> Hi Niels,
>
> so the problem is that you can not submit a job to Flink using the
> "/bin/flink" tool, right?
> I assume Flink and its TaskManagers properly start and connect to each
> other (the number of TaskManagers is shown correctly in the web interface).
>
> I see the following solutions for the problem
> a) Add a new page in the job manager web frontend allowing users to upload
> and execute a jar with a flink job
> b) add options for starting the jobmanager and blob manager on the job
> manager container on fixed ports
> c) Somehow make the akka rpc requests and blob manager uploads over HTTP
> using the YARN proxy
>
> The reason why we use a free port instead a fixed port is that this way
> two job manager containers can run on the same machine. So solution b)
> would only work if users are not using multiple flink jobs / sessions on
> yarn at the same time (or you make somehow sure they are not running on the
> same machine).
>
> What's your take on the three solutions?
>
> Does anybody here know how MR is doing it? Are they running the
> ApplicationMaster RPC on a fixed port? Do they use HTTP-based calls over
> the proxy?
>
> Robert
>
>
>
> On Mon, Nov 2, 2015 at 4:05 PM, Niels Basjes  wrote:
>
>> Hi,
>>
>> Here at work our security guys have chosen (long time ago) to only allow
>> the firewalls to have the ports open that needed (I say: good call!).
>> For the Yarn cluster this includes things like the proxy to see the
>> application manager of an application.
>> For everything we've done so far (i.e. mr/pig/...) this has worked fine.
>>
>> Now with Flink I run into problems:
>> When I run either the yarn-session or a job on Yarn the application
>> manager gets started and I can see the webinterface.
>> The problem is that the jobmanager.rpc.address is on one of the worker
>> nodes and the jobmanager.rpc.port is essentially a random value.
>> A random value which is not accessible because of the firewall rules.
>> So I cannot reach the jobmanager on the yarn cluster.
>>
>> How do I tackle this assuming that opening the all ports on the firewall
>> is not an option?
>>
>> Or is this something that should be handled by Flink? ( Perhaps the
>> application manager can proxy the RPC calls? )
>>
>> --
>> Best regards / Met vriendelijke groeten,
>>
>> Niels Basjes
>>
>
>


-- 
Best regards / Met vriendelijke groeten,

Niels Basjes