Re: Problem to show logs in task managers

2016-01-11 Thread Ana M. Martinez
Hi Till,

Thanks for that! I can see the "Logger in LineSplitter.flatMap” output if I 
retrieve the task manager logs manually (under 
/var/log/hadoop-yarn/containers/application_X/…). However that solution is not 
ideal when for instance I am using 32 machines for my mapReduce operations.

I would like to know why Yarn’s log aggregation is not working. Can you tell me 
how to check if there are some Yarn containers running after the Flink job has 
finished? I have tried:
hadoop job -list
but I cannot see any jobs there, although I am not sure that it means that 
there are not containers running...

Thanks,
Ana

On 08 Jan 2016, at 16:24, Till Rohrmann 
mailto:trohrm...@apache.org>> wrote:


You’re right that the log statements of the LineSplitter are in the logs of the 
cluster nodes, because that’s where the LineSplitter code is executed. In 
contrast, you create a TestClass on the client when you submit the program. 
Therefore, you see the logging statement “Logger in TestClass” on the command 
line or in the cli log file.

So I would assume that the problem is Yarn’s log aggregation. Either your 
configuration is not correct or there are still some Yarn containers running 
after the Flink job has finished. Yarn will only show you the logs after all 
containers are terminated. Maybe you could check that. Alternatively, you can 
try to retrieve the taskmanager logs manually by going to the machine where 
your yarn container was executed. Then under hadoop/logs/userlogs you should 
find somewhere the logs.

Cheers,
Till

​

On Fri, Jan 8, 2016 at 3:50 PM, Ana M. Martinez 
mailto:a...@cs.aau.dk>> wrote:
Thanks for the tip Robert! It was a good idea to rule out other possible 
causes, but I am afraid that is not the problem. If we stick to the 
WordCountExample (for simplicity), the Exception is thrown if placed into the 
flatMap function.

I am going to try to re-write my problem and all the settings below:

When I try to aggregate all logs:
 $yarn logs -applicationId application_1452250761414_0005

the following message is retrieved:
16/01/08 13:32:37 INFO client.RMProxy: Connecting to ResourceManager at 
ip-172-31-33-221.us-west-2.compute.internal/172.31.33.221:8032
/var/log/hadoop-yarn/apps/hadoop/logs/application_1452250761414_0005does not 
exist.
Log aggregation has not completed or is not enabled.

(Tried the same command a few minutes later and got the same message, so might 
it be that log aggregation is not properly enabled??)

I am going to carefully enumerate all the steps I have followed (and settings) 
to see if someone can identify why the Logger messages from CORE nodes (in an 
Amazon cluster) are not shown.

1) Enable yarn.log-aggregation-enable property to true in 
/etc/alternatives/hadoop-conf/yarn-site.xml.

2) Include log messages in my WordCountExample as follows:

import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.core.fs.FileSystem;
import org.apache.flink.util.Collector;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.io.Serializable;
import java.util.ArrayList;
import java.util.List;


public class WordCountExample {
static Logger logger = LoggerFactory.getLogger(WordCountExample.class);

public static void main(String[] args) throws Exception {
final ExecutionEnvironment env = 
ExecutionEnvironment.getExecutionEnvironment();

logger.info("Entering application.");

DataSet text = env.fromElements(
"Who's there?",
"I think I hear them. Stand, ho! Who's there?");

List elements = new ArrayList();
elements.add(0);


DataSet set = env.fromElements(new TestClass(elements));

DataSet> wordCounts = text
.flatMap(new LineSplitter())
.withBroadcastSet(set, "set")
.groupBy(0)
.sum(1);

wordCounts.writeAsText(“output.txt", FileSystem.WriteMode.OVERWRITE);


}

public static class LineSplitter implements FlatMapFunction> {

static Logger loggerLineSplitter = 
LoggerFactory.getLogger(LineSplitter.class);

@Override
public void flatMap(String line, Collector> 
out) {
loggerLineSplitter.info("Logger in LineSplitter.flatMap");

for (String word : line.split(" ")) {
out.collect(new Tuple2(word, 1));
//throw new RuntimeException("LineSplitter class 
called");
}

}
}

public static class TestClass implements Serializable {
private static final long serialVersionUID = -2932037991574118651L;

static Logger loggerTestClass = 
LoggerFactory.getLogger("TestClass.class");

List integerList;
public TestClass(List integerL

Re: Problem to show logs in task managers

2016-01-11 Thread Till Rohrmann
Hi Ana,

good to hear that you found the logging statements. You can check in Yarn’s
web interface whether there are still occupied containers. Alternatively
you can go to the different machines and run jps which lists you the
running Java processes. If you see an ApplicationMaster or
YarnTaskManagerRunner process, then there is still a container running with
Flink on this machine. I hope this helps you.

Cheers,
Till
​

On Mon, Jan 11, 2016 at 9:37 AM, Ana M. Martinez  wrote:

> Hi Till,
>
> Thanks for that! I can see the "Logger in LineSplitter.flatMap” output if
> I retrieve the task manager logs manually
> (under /var/log/hadoop-yarn/containers/application_X/…). However that
> solution is not ideal when for instance I am using 32 machines for my
> mapReduce operations.
>
> I would like to know why Yarn’s log aggregation is not working. Can you
> tell me how to check if there are some Yarn containers running after the
> Flink job has finished? I have tried:
> hadoop job -list
> but I cannot see any jobs there, although I am not sure that it means that
> there are not containers running...
>
> Thanks,
> Ana
>
> On 08 Jan 2016, at 16:24, Till Rohrmann  wrote:
>
> You’re right that the log statements of the LineSplitter are in the logs
> of the cluster nodes, because that’s where the LineSplitter code is
> executed. In contrast, you create a TestClass on the client when you
> submit the program. Therefore, you see the logging statement “Logger in
> TestClass” on the command line or in the cli log file.
>
> So I would assume that the problem is Yarn’s log aggregation. Either your
> configuration is not correct or there are still some Yarn containers
> running after the Flink job has finished. Yarn will only show you the logs
> after all containers are terminated. Maybe you could check that.
> Alternatively, you can try to retrieve the taskmanager logs manually by
> going to the machine where your yarn container was executed. Then under
> hadoop/logs/userlogs you should find somewhere the logs.
>
> Cheers,
> Till
> ​
>
> On Fri, Jan 8, 2016 at 3:50 PM, Ana M. Martinez  wrote:
>
>> Thanks for the tip Robert! It was a good idea to rule out other possible
>> causes, but I am afraid that is not the problem. If we stick to the
>> WordCountExample (for simplicity), the Exception is thrown if placed into
>> the flatMap function.
>>
>> I am going to try to re-write my problem and all the settings below:
>>
>> When I try to aggregate all logs:
>>  $yarn logs -applicationId application_1452250761414_0005
>>
>> the following message is retrieved:
>> 16/01/08 13:32:37 INFO client.RMProxy: Connecting to ResourceManager at
>> ip-172-31-33-221.us-west-2.compute.internal/172.31.33.221:8032
>> /var/log/hadoop-yarn/apps/hadoop/logs/application_1452250761414_0005does
>> not exist.
>> Log aggregation has not completed or is not enabled.
>>
>> (Tried the same command a few minutes later and got the same message, so
>> might it be that log aggregation is not properly enabled??)
>>
>> I am going to carefully enumerate all the steps I have followed (and
>> settings) to see if someone can identify why the Logger messages from CORE
>> nodes (in an Amazon cluster) are not shown.
>>
>> 1) Enable yarn.log-aggregation-enable property to true
>> in /etc/alternatives/hadoop-conf/yarn-site.xml.
>>
>> 2) Include log messages in my WordCountExample as follows:
>>
>> import org.apache.flink.api.common.functions.FlatMapFunction;
>> import org.apache.flink.api.java.DataSet;
>> import org.apache.flink.api.java.ExecutionEnvironment;
>> import org.apache.flink.api.java.tuple.Tuple2;
>> import org.apache.flink.core.fs.FileSystem;
>> import org.apache.flink.util.Collector;
>> import org.slf4j.Logger;
>> import org.slf4j.LoggerFactory;
>>
>> import java.io.Serializable;
>> import java.util.ArrayList;
>> import java.util.List;
>>
>>
>> public class WordCountExample {
>> static Logger logger = LoggerFactory.getLogger(WordCountExample.class);
>>
>> public static void main(String[] args) throws Exception {
>> final ExecutionEnvironment env = 
>> ExecutionEnvironment.getExecutionEnvironment();
>>
>> logger.info("Entering application.");
>>
>> DataSet text = env.fromElements(
>> "Who's there?",
>> "I think I hear them. Stand, ho! Who's there?");
>>
>> List elements = new ArrayList();
>> elements.add(0);
>>
>>
>> DataSet set = env.fromElements(new TestClass(elements));
>>
>> DataSet> wordCounts = text
>> .flatMap(new LineSplitter())
>> .withBroadcastSet(set, "set")
>> .groupBy(0)
>> .sum(1);
>>
>> wordCounts.writeAsText(*“*output.txt", 
>> FileSystem.WriteMode.OVERWRITE);
>>
>>
>> }
>>
>> public static class LineSplitter implements FlatMapFunction> Tuple2> {
>>
>> static Logger loggerLineSplitter = 
>> LoggerFactory.getLogger(LineSplitter.class);
>>
>> @Override
>> 

Re: Machine Learning on Apache Fink

2016-01-11 Thread Till Rohrmann
Hi Ashutosh,

Flink’s ML library flinkML is maybe not as extensive as Spark’s MLlib.
However, Flink has native support for iterations which makes them blazingly
fast. Iterations in Flink are a distinct operator so that they don’t have
to communicate after each iteration with the client. Furthermore, Flink has
support for delta iterations which allow you to iterate over a subset of
the elements in DataSet. This can speed appropriate algorithms, such as
many graph algorithms, considerably up.

If you have further questions while studying the available material feel
free to write me.

Cheers,
Till
​

On Sat, Jan 9, 2016 at 2:20 PM, Vasiliki Kalavri 
wrote:

> Hi Ashutosh,
>
> Flink has a Machine Learning library, Flink-ML. You can find more
> information and examples the documentation [1].
> The code is currently in the flink-staging repository. There is also
> material on Slideshare that you might find interesting [2, 3, 4].
>
> I hope this helps!
> -Vasia.
>
> [1]: https://ci.apache.org/projects/flink/flink-docs-master/libs/ml/
> [2]:
> http://www.slideshare.net/tillrohrmann/machine-learning-with-apache-flink
> [3]:
> http://www.slideshare.net/TheodorosVasiloudis/flinkml-large-scale-machine-learning-with-apache-flink
> [4]:
> http://www.slideshare.net/tillrohrmann/computing-recommendations-at-extreme-scale-with-apache-flink-buzzwords-2015-48879155
>
> On 9 January 2016 at 12:46, Ashutosh Kumar 
> wrote:
>
>> I see lot of study materials and even book available for ml on spark.
>> Spark seems to be more mature for analytics related work as of now. Please
>> correct me if I am wrong. As I have already built my collection and data
>> pre processing layers on flink , I want to use Flink for analytics as well.
>> Thanks in advance.
>>
>>
>> Ashutosh
>>
>> On Sat, Jan 9, 2016 at 3:32 PM, Ashutosh Kumar <
>> ashutosh.disc...@gmail.com> wrote:
>>
>>> I am looking for some study material and examples on machine learning .
>>> Are classification , recommendation and clustering libraries available ?
>>> What is the timeline for Flink as backend for Mahout? I am building a meta
>>> data driven framework over Flink . While building data collection and
>>> transformation modules was cool , I am struggling since I started analytics
>>> module. Thanks in advance.
>>> Ashutosh
>>>
>>
>>
>


Flink streaming Python

2016-01-11 Thread Madhukar Thota
Hi

Is streaming supported in Flink-Python API? If so, can you point me to the
documentation?


Re: Problem to show logs in task managers

2016-01-11 Thread Ana M. Martinez
Hi Till,

Thanks for your help. I have checked both in Yarn’s web interface and through 
command line and it seems that there are not occupied containers.

Additionally, I have checked the configuration values in the web interface and 
even though I have changed the log.aggregation property in the yarn-site.xml 
file to true, it appears as false and with the following source label:

yarn.log-aggregation-enable
false
java.io.BufferedInputStream@3c407114


I am not sure if that is relevant. I had assumed that the "./bin/flink run -m 
yarn-cluster" command is starting a yarn session and thus reloading the 
yarn-site file. Is that right? If I am wrong here, then, how can I restart it 
so that the modifications in the yarn-site.xml file are considered? (I have 
also tried with ./bin/yarn-session.sh and then ./bin/flink run without 
success…).

I am not sure if this is related to flink anymore, should I move my problem to 
the yarn community instead?

Thanks,
Ana

On 11 Jan 2016, at 10:37, Till Rohrmann 
mailto:trohrm...@apache.org>> wrote:


Hi Ana,

good to hear that you found the logging statements. You can check in Yarn’s web 
interface whether there are still occupied containers. Alternatively you can go 
to the different machines and run jps which lists you the running Java 
processes. If you see an ApplicationMaster or YarnTaskManagerRunner process, 
then there is still a container running with Flink on this machine. I hope this 
helps you.

Cheers,
Till

​

On Mon, Jan 11, 2016 at 9:37 AM, Ana M. Martinez 
mailto:a...@cs.aau.dk>> wrote:
Hi Till,

Thanks for that! I can see the "Logger in LineSplitter.flatMap” output if I 
retrieve the task manager logs manually (under 
/var/log/hadoop-yarn/containers/application_X/…). However that solution is not 
ideal when for instance I am using 32 machines for my mapReduce operations.

I would like to know why Yarn’s log aggregation is not working. Can you tell me 
how to check if there are some Yarn containers running after the Flink job has 
finished? I have tried:
hadoop job -list
but I cannot see any jobs there, although I am not sure that it means that 
there are not containers running...

Thanks,
Ana

On 08 Jan 2016, at 16:24, Till Rohrmann 
mailto:trohrm...@apache.org>> wrote:


You’re right that the log statements of the LineSplitter are in the logs of the 
cluster nodes, because that’s where the LineSplitter code is executed. In 
contrast, you create a TestClass on the client when you submit the program. 
Therefore, you see the logging statement “Logger in TestClass” on the command 
line or in the cli log file.

So I would assume that the problem is Yarn’s log aggregation. Either your 
configuration is not correct or there are still some Yarn containers running 
after the Flink job has finished. Yarn will only show you the logs after all 
containers are terminated. Maybe you could check that. Alternatively, you can 
try to retrieve the taskmanager logs manually by going to the machine where 
your yarn container was executed. Then under hadoop/logs/userlogs you should 
find somewhere the logs.

Cheers,
Till

​

On Fri, Jan 8, 2016 at 3:50 PM, Ana M. Martinez 
mailto:a...@cs.aau.dk>> wrote:
Thanks for the tip Robert! It was a good idea to rule out other possible 
causes, but I am afraid that is not the problem. If we stick to the 
WordCountExample (for simplicity), the Exception is thrown if placed into the 
flatMap function.

I am going to try to re-write my problem and all the settings below:

When I try to aggregate all logs:
 $yarn logs -applicationId application_1452250761414_0005

the following message is retrieved:
16/01/08 13:32:37 INFO client.RMProxy: Connecting to ResourceManager at 
ip-172-31-33-221.us-west-2.compute.internal/172.31.33.221:8032
/var/log/hadoop-yarn/apps/hadoop/logs/application_1452250761414_0005does not 
exist.
Log aggregation has not completed or is not enabled.

(Tried the same command a few minutes later and got the same message, so might 
it be that log aggregation is not properly enabled??)

I am going to carefully enumerate all the steps I have followed (and settings) 
to see if someone can identify why the Logger messages from CORE nodes (in an 
Amazon cluster) are not shown.

1) Enable yarn.log-aggregation-enable property to true in 
/etc/alternatives/hadoop-conf/yarn-site.xml.

2) Include log messages in my WordCountExample as follows:

import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.core.fs.FileSystem;
import org.apache.flink.util.Collector;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.io.Serializable;
import java.util.ArrayList;
import java.util.List;


public class WordCountExample {
static Logger logger = LoggerFactory.getLogger(WordCountExample.class);

  

Re: Flink streaming Python

2016-01-11 Thread Ufuk Celebi

> On 11 Jan 2016, at 13:03, Madhukar Thota  wrote:
> 
> Hi 
> 
> Is streaming supported in Flink-Python API? If so, can you point me to the 
> documentation?

No. Only the DataSet API has Python support at the moment. I expect this to 
change at some point in time, but I’m not aware of any concrete plans.

– Ufuk



Re: Problem to show logs in task managers

2016-01-11 Thread Till Rohrmann
You have to restart the yarn cluster to let your changes take effect. You
can do that via HADOOP_HOME/sbin/stop-yarn.sh;
HADOOP_HOME/sbin/start-yarn.sh.

The commands yarn-session.sh ... and bin/flink run -m yarn cluster start a
new yarn application within the yarn cluster.

Cheers,
Till
​

On Mon, Jan 11, 2016 at 1:39 PM, Ana M. Martinez  wrote:

> Hi Till,
>
> Thanks for your help. I have checked both in Yarn’s web interface and
> through command line and it seems that there are not occupied containers.
>
> Additionally, I have checked the configuration values in the web interface
> and even though I have changed the log.aggregation property in the
> yarn-site.xml file to true, it appears as false and with the following
> source label:
> 
> yarn.log-aggregation-enable
> false
> java.io.BufferedInputStream@3c407114
> 
>
> I am not sure if that is relevant. I had assumed that the "./bin/flink run
> -m yarn-cluster" command is starting a yarn session and thus reloading the
> yarn-site file. Is that right? If I am wrong here, then, how can I restart
> it so that the modifications in the yarn-site.xml file are considered? (I
> have also tried with ./bin/yarn-session.sh and then ./bin/flink run without
> success…).
>
> I am not sure if this is related to flink anymore, should I move my
> problem to the yarn community instead?
>
> Thanks,
> Ana
>
> On 11 Jan 2016, at 10:37, Till Rohrmann  wrote:
>
> Hi Ana,
>
> good to hear that you found the logging statements. You can check in
> Yarn’s web interface whether there are still occupied containers.
> Alternatively you can go to the different machines and run jps which
> lists you the running Java processes. If you see an ApplicationMaster or
> YarnTaskManagerRunner process, then there is still a container running
> with Flink on this machine. I hope this helps you.
>
> Cheers,
> Till
> ​
>
> On Mon, Jan 11, 2016 at 9:37 AM, Ana M. Martinez  wrote:
>
>> Hi Till,
>>
>> Thanks for that! I can see the "Logger in LineSplitter.flatMap” output if
>> I retrieve the task manager logs manually
>> (under /var/log/hadoop-yarn/containers/application_X/…). However that
>> solution is not ideal when for instance I am using 32 machines for my
>> mapReduce operations.
>>
>> I would like to know why Yarn’s log aggregation is not working. Can you
>> tell me how to check if there are some Yarn containers running after the
>> Flink job has finished? I have tried:
>> hadoop job -list
>> but I cannot see any jobs there, although I am not sure that it means
>> that there are not containers running...
>>
>> Thanks,
>> Ana
>>
>> On 08 Jan 2016, at 16:24, Till Rohrmann  wrote:
>>
>> You’re right that the log statements of the LineSplitter are in the logs
>> of the cluster nodes, because that’s where the LineSplitter code is
>> executed. In contrast, you create a TestClass on the client when you
>> submit the program. Therefore, you see the logging statement “Logger in
>> TestClass” on the command line or in the cli log file.
>>
>> So I would assume that the problem is Yarn’s log aggregation. Either your
>> configuration is not correct or there are still some Yarn containers
>> running after the Flink job has finished. Yarn will only show you the logs
>> after all containers are terminated. Maybe you could check that.
>> Alternatively, you can try to retrieve the taskmanager logs manually by
>> going to the machine where your yarn container was executed. Then under
>> hadoop/logs/userlogs you should find somewhere the logs.
>>
>> Cheers,
>> Till
>> ​
>>
>> On Fri, Jan 8, 2016 at 3:50 PM, Ana M. Martinez  wrote:
>>
>>> Thanks for the tip Robert! It was a good idea to rule out other possible
>>> causes, but I am afraid that is not the problem. If we stick to the
>>> WordCountExample (for simplicity), the Exception is thrown if placed into
>>> the flatMap function.
>>>
>>> I am going to try to re-write my problem and all the settings below:
>>>
>>> When I try to aggregate all logs:
>>>  $yarn logs -applicationId application_1452250761414_0005
>>>
>>> the following message is retrieved:
>>> 16/01/08 13:32:37 INFO client.RMProxy: Connecting to ResourceManager at
>>> ip-172-31-33-221.us-west-2.compute.internal/172.31.33.221:8032
>>> /var/log/hadoop-yarn/apps/hadoop/logs/application_1452250761414_0005does
>>> not exist.
>>> Log aggregation has not completed or is not enabled.
>>>
>>> (Tried the same command a few minutes later and got the same message, so
>>> might it be that log aggregation is not properly enabled??)
>>>
>>> I am going to carefully enumerate all the steps I have followed (and
>>> settings) to see if someone can identify why the Logger messages from CORE
>>> nodes (in an Amazon cluster) are not shown.
>>>
>>> 1) Enable yarn.log-aggregation-enable property to true
>>> in /etc/alternatives/hadoop-conf/yarn-site.xml.
>>>
>>> 2) Include log messages in my WordCountExample as follows:
>>>
>>> import org.apache.flink.api.common.functions.FlatMapFunction;
>>> import org.apache.flink.api.j

Re: Security in Flink

2016-01-11 Thread Sourav Mazumder
Thanks Steven for your details response. Things are more clear to me now.

A follow up Qs -
Looks like most of the security support depends on Hadoop ? What happens if
anyone wants to use Flink with Hadoop (in a cluster where Hadoop is not
there) ?

Regards,
Sourav

On Sun, Jan 10, 2016 at 12:41 PM, Stephan Ewen  wrote:

> Hi Sourav!
>
> There is user-authentication support in Flink via the Hadoop / Kerberos
> infrastructure. If you run Flink on YARN, it should seamlessly work that
> Flink acquires the Kerberos tokens of the user that submits programs, and
> authenticate itself at YARN, HDFS, and HBase with that.
>
> If you run Flink standalone, Flink can still authenticate at HDFS/HBase
> via Kerberos, with a bit of manual help by the user (running kinit on the
> workers).
>
> With Kafka 0.9 and Flink's upcoming connector (
> https://github.com/apache/flink/pull/1489), streaming programs can
> authenticate themselves as stream brokers via SSL (and read via encrypted
> connections).
>
>
> What we have on the roadmap for the coming months it the following:
>   - Encrypt in-flight data streams that are exchanged between worker nodes
> (TaskManagers).
>   - Encrypt the coordination messages between client/master/workers.
> Note that these refer to encryption between Flink's own components only,
> which would use transient keys generated just for a specific job or session
> (hence would not need any user involvement).
>
>
> Let us know if that answers your questions, and if that meets your
> requirements.
>
> Greetings,
> Stephan
>
>
> On Fri, Jan 8, 2016 at 3:23 PM, Sourav Mazumder <
> sourav.mazumde...@gmail.com> wrote:
>
>> Hi,
>>
>> Can anyone point me to ant documentation on support for Security in Flink
>> ?
>>
>> The type of information I'm looking for are -
>>
>> 1. How do I do user level authentication to ensure that a job is
>> submitted/deleted/modified by the right user ? Is it possible though the
>> web client ?
>> 2. Authentication across multiple slave nodes (where the task managers
>> are running) and driver program so that they can communicate with each other
>> 3. Support for SSL/encryption for data exchanged happening across the
>> slave nodes
>> 4. Support for pluggable authentication with existing solution like LDAP
>>
>> If not there today is there a roadmap for these security features ?
>>
>> Regards,
>> Sourav
>>
>
>


Flink with Yarn

2016-01-11 Thread Sourav Mazumder
I am going through the documentation of integrating Flink with YARN.

However not sure whether Flink can be run on YARN in two modes (like
Spark). In one mode the driver/client program of Flink is also managed by
YARN. In the second mode where the client program is outside the control of
YARN. Is the running Flinkon behind Firewalls is like the second mode

Any clarification on this ?

Regards,
Sourav


Re: Flink with Yarn

2016-01-11 Thread Stephan Ewen
Hi!

Flink is different than Spark in that respect. The driver program in Flink
can submit a program to the master (in YARN Application Master) and
disconnect then. It is not a part of the distributed execution - that is
coordinated only by the master (JobManager).
The driver can stay connected to receive progress updates, though.

For programs that do consist of multiple parallel executions (that have
count() or collect() statements), the driver needs to stay connected,
because it needs to pull the intermediate results. However, they are all
pulled/proxied through the master (JobManager), so the driver needs not be
able to connect to the workers. The only requirement for firewalled
clusters is to have two ports from the master node reachable by the client.

Greetings,
Stephan


On Mon, Jan 11, 2016 at 5:18 PM, Sourav Mazumder <
sourav.mazumde...@gmail.com> wrote:

> I am going through the documentation of integrating Flink with YARN.
>
> However not sure whether Flink can be run on YARN in two modes (like
> Spark). In one mode the driver/client program of Flink is also managed by
> YARN. In the second mode where the client program is outside the control of
> YARN. Is the running Flinkon behind Firewalls is like the second mode
>
> Any clarification on this ?
>
> Regards,
> Sourav
>


Re: Flink with Yarn

2016-01-11 Thread Sourav Mazumder
Hi Stephan,

Thanks for the explanation.

>From your explanation it looks like Flink runs in a mode similar to Spark's
YARN-Client mode.

Regards,
Sourav

On Mon, Jan 11, 2016 at 8:27 AM, Stephan Ewen  wrote:

> Hi!
>
> Flink is different than Spark in that respect. The driver program in Flink
> can submit a program to the master (in YARN Application Master) and
> disconnect then. It is not a part of the distributed execution - that is
> coordinated only by the master (JobManager).
> The driver can stay connected to receive progress updates, though.
>
> For programs that do consist of multiple parallel executions (that have
> count() or collect() statements), the driver needs to stay connected,
> because it needs to pull the intermediate results. However, they are all
> pulled/proxied through the master (JobManager), so the driver needs not be
> able to connect to the workers. The only requirement for firewalled
> clusters is to have two ports from the master node reachable by the client.
>
> Greetings,
> Stephan
>
>
> On Mon, Jan 11, 2016 at 5:18 PM, Sourav Mazumder <
> sourav.mazumde...@gmail.com> wrote:
>
>> I am going through the documentation of integrating Flink with YARN.
>>
>> However not sure whether Flink can be run on YARN in two modes (like
>> Spark). In one mode the driver/client program of Flink is also managed by
>> YARN. In the second mode where the client program is outside the control of
>> YARN. Is the running Flinkon behind Firewalls is like the second mode
>>
>> Any clarification on this ?
>>
>> Regards,
>> Sourav
>>
>
>


Re: Flink with Yarn

2016-01-11 Thread Tzu-Li (Gordon) Tai
Hi Sourav,

A little help with more clarification on your last comment.

In sense of "where" the driver program is executed, then yes the Flink
driver program runs in a mode similar to Spark's YARN-client.

However, the "role" of the driver program and the work that it is
responsible of is quite different between Flink and Spark. In Spark, the
driver program is in charge of coordinating Spark workers (executors) and
must listen for and accept incoming connections from the workers throughout
the job's lifetime. Therefore, in Spark's YARN-client mode, you must keep
the driver program process alive otherwise the job will be shutdown.

However, in Flink, the coordination of Flink TaskManagers to complete a job
is handled by Flink's JobManager once the client at the driver program
submits the job to the JobManager. The driver program is solely used for the
job submission and can disconnect afterwards. 

Like what Stephan explained, if the user-defined dataflow defines any
intermediate results to be retrieved via collect() or print(), the results
are transmitted through the JobManager. Only then does the driver program
need to stay connected. Note that this connection still does not need to
have any connections with the workers (Flink TaskManagers), only the
JobManager.

Cheers,
Gordon



--
View this message in context: 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-with-Yarn-tp4224p4227.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at 
Nabble.com.


Re: Flink with Yarn

2016-01-11 Thread Sourav Mazumder
Hi Gordon,

Thanks for the explanation. It is much clear now. Looks like a much cleaner
approach. In that way the driver program can run in a machine which does
not need connectivity to all worker nodes.

Regards,
Sourav

On Mon, Jan 11, 2016 at 9:22 AM, Tzu-Li (Gordon) Tai 
wrote:

> Hi Sourav,
>
> A little help with more clarification on your last comment.
>
> In sense of "where" the driver program is executed, then yes the Flink
> driver program runs in a mode similar to Spark's YARN-client.
>
> However, the "role" of the driver program and the work that it is
> responsible of is quite different between Flink and Spark. In Spark, the
> driver program is in charge of coordinating Spark workers (executors) and
> must listen for and accept incoming connections from the workers throughout
> the job's lifetime. Therefore, in Spark's YARN-client mode, you must keep
> the driver program process alive otherwise the job will be shutdown.
>
> However, in Flink, the coordination of Flink TaskManagers to complete a job
> is handled by Flink's JobManager once the client at the driver program
> submits the job to the JobManager. The driver program is solely used for
> the
> job submission and can disconnect afterwards.
>
> Like what Stephan explained, if the user-defined dataflow defines any
> intermediate results to be retrieved via collect() or print(), the results
> are transmitted through the JobManager. Only then does the driver program
> need to stay connected. Note that this connection still does not need to
> have any connections with the workers (Flink TaskManagers), only the
> JobManager.
>
> Cheers,
> Gordon
>
>
>
> --
> View this message in context:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-with-Yarn-tp4224p4227.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive
> at Nabble.com.
>


How to sort tuples in DataStream

2016-01-11 Thread Saiph Kappa
Hi,

I'm trying to do a simple application in Flink Stream to count the top N
words on a window-basis, but I don't know how to sort the words by their
frequency in Flink.

In spark streaming, I would do something like this:
«
val isAscending = true
stream.reduceByKeyAndWindow(reduceFunc, Seconds(10), Seconds
(10)).transform(_.sortByKey(isAscending)).map(_._2)
»

How can I do it in Flink Stream?

This is what I have so far:
«
val reduceFunc = (a: String, b: String) => {

  val aElems = a.split(Separator)
  val bElems = b.split(Separator)
  val result = a(params.aggParams.get.head.aggIndex).toInt +
b(params.aggParams.get.head.aggIndex).toInt
  result.toString
}

stream.keyBy(0).timeWindow(Time.seconds(10),
Time.seconds(10)).reduce(reduceFunc)
»


My stream is just a series of strings like "field1|field2|field3|..."

Thanks.


DeserializationSchema isEndOfStream usage?

2016-01-11 Thread David Kim
Hello all,

I saw that DeserializationSchema has an API "isEndOfStream()".

https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/util/serialization/DeserializationSchema.java

Can *isEndOfStream* be utilized to somehow terminate a streaming flink job?

I was under the impression that if we return "true" we can control when a
stream can close. The use case I had in mind was controlling when
unit/integration tests would terminate a flink job. We can rely on the fact
that a test/spec would know how many items it expects to consume and then
switch *isEndOfStream* to return true.

Am I misunderstanding the intention for *isEndOfStream*?

I also set a breakpoint on *isEndOfStream* and saw that it never was hit
when using "FlinkKafkaConsumer082" to pass in a DeserializationSchema
implementation.

Currently testing on 1.0-SNAPSHOT.

Cheers!
David


Re: DeserializationSchema isEndOfStream usage?

2016-01-11 Thread Robert Metzger
Hi David,

In theory isEndOfStream() is absolutely the right way to go for stopping data 
sources in Flink.
That its not working as expected is a bug. I have a pending pull request for 
adding a Kafka 0.9 connector, which fixes this issue as well (for all supported 
Kafka versions).

Sorry for the inconvenience. If you want, you can check out the branch of the 
PR and build Flink yourself to get the fix.
I hope that I can merge the connector to master this week, then, the fix will 
be available in 1.0-SNAPSHOT as well.

Regards,
Robert



Sent from my iPhone

> On 11.01.2016, at 21:39, David Kim  wrote:
> 
> Hello all,
> 
> I saw that DeserializationSchema has an API "isEndOfStream()". 
> 
> https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/util/serialization/DeserializationSchema.java
> 
> Can isEndOfStream be utilized to somehow terminate a streaming flink job?
> 
> I was under the impression that if we return "true" we can control when a 
> stream can close. The use case I had in mind was controlling when 
> unit/integration tests would terminate a flink job. We can rely on the fact 
> that a test/spec would know how many items it expects to consume and then 
> switch isEndOfStream to return true.
> 
> Am I misunderstanding the intention for isEndOfStream? 
> 
> I also set a breakpoint on isEndOfStream and saw that it never was hit when 
> using "FlinkKafkaConsumer082" to pass in a DeserializationSchema 
> implementation.
> 
> Currently testing on 1.0-SNAPSHOT.
> 
> Cheers!
> David


SQL query support in Flink

2016-01-11 Thread Sourav Mazumder
Hi,

Just wanted to check whether one can directly run SQL queries on Flink.

For example whether one can define a table on a dataset and then run
queries like dataset.sql ("select column1, column2 from mytable").

I used to think that this is possible right now in 0.10.1. But when I
checked the blog a year in review
http://flink.apache.org/news/2015/12/18/a-year-in-review.html, I found
there is a feature mentioned in 2016's roadmap as - "SQL queries for static
data sets and streams". Hence my doubt.

Regards,
Sourav


Re: Security in Flink

2016-01-11 Thread Welly Tambunan
Hi Stephen,

Do you have any plan on which encryption method and mechanism will be used
on Flink ? Could you share about the detail on this ?

We have very strict requirement from client that every communication need
to be encryption. So any detail would be really appreciated for answering
their security concern.


Cheers

On Mon, Jan 11, 2016 at 9:46 PM, Sourav Mazumder <
sourav.mazumde...@gmail.com> wrote:

> Thanks Steven for your details response. Things are more clear to me now.
>
> A follow up Qs -
> Looks like most of the security support depends on Hadoop ? What happens
> if anyone wants to use Flink with Hadoop (in a cluster where Hadoop is not
> there) ?
>
> Regards,
> Sourav
>
> On Sun, Jan 10, 2016 at 12:41 PM, Stephan Ewen  wrote:
>
>> Hi Sourav!
>>
>> There is user-authentication support in Flink via the Hadoop / Kerberos
>> infrastructure. If you run Flink on YARN, it should seamlessly work that
>> Flink acquires the Kerberos tokens of the user that submits programs, and
>> authenticate itself at YARN, HDFS, and HBase with that.
>>
>> If you run Flink standalone, Flink can still authenticate at HDFS/HBase
>> via Kerberos, with a bit of manual help by the user (running kinit on the
>> workers).
>>
>> With Kafka 0.9 and Flink's upcoming connector (
>> https://github.com/apache/flink/pull/1489), streaming programs can
>> authenticate themselves as stream brokers via SSL (and read via encrypted
>> connections).
>>
>>
>> What we have on the roadmap for the coming months it the following:
>>   - Encrypt in-flight data streams that are exchanged between worker
>> nodes (TaskManagers).
>>   - Encrypt the coordination messages between client/master/workers.
>> Note that these refer to encryption between Flink's own components only,
>> which would use transient keys generated just for a specific job or session
>> (hence would not need any user involvement).
>>
>>
>> Let us know if that answers your questions, and if that meets your
>> requirements.
>>
>> Greetings,
>> Stephan
>>
>>
>> On Fri, Jan 8, 2016 at 3:23 PM, Sourav Mazumder <
>> sourav.mazumde...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Can anyone point me to ant documentation on support for Security in
>>> Flink ?
>>>
>>> The type of information I'm looking for are -
>>>
>>> 1. How do I do user level authentication to ensure that a job is
>>> submitted/deleted/modified by the right user ? Is it possible though the
>>> web client ?
>>> 2. Authentication across multiple slave nodes (where the task managers
>>> are running) and driver program so that they can communicate with each other
>>> 3. Support for SSL/encryption for data exchanged happening across the
>>> slave nodes
>>> 4. Support for pluggable authentication with existing solution like LDAP
>>>
>>> If not there today is there a roadmap for these security features ?
>>>
>>> Regards,
>>> Sourav
>>>
>>
>>
>


-- 
Welly Tambunan
Triplelands

http://weltam.wordpress.com
http://www.triplelands.com