Re: Apache Flink Operator State as Query Cache

2015-11-11 Thread Stephan Ewen
Hi!

In general, if you can keep state in Flink, you get better
throughput/latency/consistency and have one less system to worry about
(external k/v store). State outside means that the Flink processes can be
slimmer and need fewer resources and as such recover a bit faster. There
are use cases for that as well.

Storing the model in OperatorState is a good idea, if you can. On the
roadmap is to migrate the operator state to managed memory as well, so that
should take care of the GC issues.

We are just adding functionality to make the Key/Value operator state
usable in CoMap/CoFlatMap as well (currently it only works in windows and
in Map/FlatMap/Filter functions over the KeyedStream).
Until the, you should be able to use a simple Java HashMap and use the
"Checkpointed" interface to get it persistent.

Greetings,
Stephan


On Sun, Nov 8, 2015 at 10:11 AM, Welly Tambunan  wrote:

> Thanks for the answer.
>
> Currently the approach that i'm using right now is creating a base/marker
> interface to stream different type of message to the same operator. Not
> sure about the performance hit about this compare to the CoFlatMap
> function.
>
> Basically this one is providing query cache, so i'm thinking instead of
> using in memory cache like redis, ignite etc, i can just use operator state
> for this one.
>
> I just want to gauge do i need to use memory cache or operator state would
> be just fine.
>
> However i'm concern about the Gen 2 Garbage Collection for caching our own
> state without using operator state. Is there any clarification on that one
> ?
>
>
>
> On Sat, Nov 7, 2015 at 12:38 AM, Anwar Rizal  wrote:
>
>>
>> Let me understand your case better here. You have a stream of model and
>> stream of data. To process the data, you will need a way to access your
>> model from the subsequent stream operations (map, filter, flatmap, ..).
>> I'm not sure in which case Operator State is a good choice, but I think
>> you can also live without.
>>
>> val modelStream =  // get the model stream
>> val dataStream   =
>>
>> modelStream.broadcast.connect(dataStream). coFlatMap(  ) Then you can
>> keep the latest model in a CoFlatMapRichFunction, not necessarily as
>> Operator State, although maybe OperatorState is a good choice too.
>>
>> Does it make sense to you ?
>>
>> Anwar
>>
>> On Fri, Nov 6, 2015 at 10:21 AM, Welly Tambunan 
>> wrote:
>>
>>> Hi All,
>>>
>>> We have a high density data that required a downsample. However this
>>> downsample model is very flexible based on the client device and user
>>> interaction. So it will be wasteful to precompute and store to db.
>>>
>>> So we want to use Apache Flink to do downsampling and cache the result
>>> for subsequent query.
>>>
>>> We are considering using Flink Operator state for that one.
>>>
>>> Is that the right approach to use that for memory cache ? Or if that
>>> preferable using memory cache like redis etc.
>>>
>>> Any comments will be appreciated.
>>>
>>>
>>> Cheers
>>> --
>>> Welly Tambunan
>>> Triplelands
>>>
>>> http://weltam.wordpress.com
>>> http://www.triplelands.com 
>>>
>>
>>
>
>
> --
> Welly Tambunan
> Triplelands
>
> http://weltam.wordpress.com
> http://www.triplelands.com 
>


Re: Cluster installation gives java.lang.NoClassDefFoundError for everything

2015-11-11 Thread Maximilian Michels
Hi Camelia,

Flink 0.9.X supports Java 6. So this can't be the issue.

Out of curiosity, I gave it a spin on a Linux machine with OpenJDK 6. I was
able to start the command-line interface, job manager and task managers.

java version "1.6.0_36"
OpenJDK Runtime Environment (IcedTea6 1.13.8) (6b36-1.13.8-0ubuntu1~14.04)
OpenJDK 64-Bit Server VM (build 23.25-b01, mixed mode)


I think the error must be caused by your NFS setup. Let's start with
./bin/flink. For instance, can you access
"/users/camelia/thecluster/flink-0.9.1/lib/flink-dist-0.9.1.jar" from the
machine where you run ./bin/flink?

Best,
Max


On Wed, Nov 11, 2015 at 10:41 AM, Camelia Elena Ciolac 
wrote:

>  Hello,
>
> As promised, I come back with debugging details.
> So:
>
> *** In start-cluster.sh , the following echo's
>
> echo "start-cluster ~"
> echo $HOSTLIST
> echo "start-cluster ~"
> echo $FLINK_BIN_DIR
> echo "start-cluster ~"
>
> # cluster mode, bring up job manager locally and a task manager on every
> slave host
> "$FLINK_BIN_DIR"/jobmanager.sh start cluster batch
>
> ==>  gave:
>
> start-cluster ~
> /users/camelia/thecluster/flink-0.9.1/conf/slaves
> start-cluster ~
> /users/camelia/thecluster/flink-0.9.1/bin
> start-cluster ~
>
>
> *** In jobmanager.sh , the following echo's
>
> echo "Starting Job Manager"
> echo "jobmanager ~"
> pwd
> echo "jobmanager ~"
> echo ${FLINK_ENV_JAVA_OPTS}
> echo "jobmanager ~"
> echo $FLINK_JM_CLASSPATH
> echo "jobmanager ~"
> echo $INTERNAL_HADOOP_CLASSPATHS
> echo "jobmanager "
> echo $FLINK_CONF_DIR
> echo ""
> $JAVA_RUN $JVM_ARGS ${FLINK_ENV_JAVA_OPTS} "${log_setting[@]}"
> -classpath "`manglePathList
> "$FLINK_JM_CLASSPATH:$INTERNAL_HADOOP_CLASSPATHS"`"
> org.apache.flink.runtime.jobmanager.JobManager --configDir
> "$FLINK_CONF_DIR" --executionMode $EXECUTIONMODE --streamingMode
> "$STREAMINGMODE" > "$out" 2>&1 < /dev/null &
>
>
>
> ==> gave:
>
> jobmanager ~
> /users/camelia/thecluster
> jobmanager ~
>
> jobmanager ~
>
> /users/camelia/thecluster/flink-0.9.1/lib/flink-dist-0.9.1.jar:/users/camelia/thecluster/flink-0.9.1/lib/flink-python-0.9.1.jar
> jobmanager ~
> ::
> jobmanager 
> /users/camelia/thecluster/flink-0.9.1/conf
> 
>
> Exception in thread "main" java.lang.NoClassDefFoundError:
> org/apache/flink/runtime/jobmanager/JobManager
> Caused by: java.lang.ClassNotFoundException:
> org.apache.flink.runtime.jobmanager.JobManager
> at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:323)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:268)
> Could not find the main class:
> org.apache.flink.runtime.jobmanager.JobManager. Program will exit.
>
>
>
> *** In taskmanager.sh, the following echo's
>
> echo Starting task manager on host $HOSTNAME
> echo "taskmanager ~"
> pwd
> echo "taskmanager ~"
> echo ${FLINK_ENV_JAVA_OPTS}
> echo "taskmanager ~"
> echo $FLINK_JM_CLASSPATH
> echo "taskmanager ~"
> echo $INTERNAL_HADOOP_CLASSPATHS
> echo "taskmanager "
> echo $FLINK_CONF_DIR
> echo ""
> $JAVA_RUN $JVM_ARGS ${FLINK_ENV_JAVA_OPTS} "${log_setting[@]}"
> -classpath "`manglePathList
> "$FLINK_TM_CLASSPATH:$INTERNAL_HADOOP_CLASSPATHS"`"
> org.apache.flink.runtime.taskmanager.TaskManager --configDir
> "$FLINK_CONF_DIR" --streamingMode "$STREAMINGMODE" > "$out" 2>&1 <
> /dev/null &
>
> ==> gave:
>
> taskmanager ~
> /users/camelia/thecluster
> taskmanager ~
>
> taskmanager ~
>
> taskmanager ~
> ::
> taskmanager 
> /users/camelia/thecluster/flink-0.9.1/conf
> 
>
> Exception in thread "main" java.lang.NoClassDefFoundError:
> org/apache/flink/runtime/taskmanager/TaskManager
> Caused by: java.lang.ClassNotFoundException:
> org.apache.flink.runtime.taskmanager.TaskManager
> at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
> at java.security.AccessController.doPrivileged(Native Method)
> at 

Accumulators/Metrics

2015-11-11 Thread Nick Dimiduk
Hello,

I'm interested in exposing metrics from my UDFs. I see FLINK-1501 exposes
task manager metrics via a UI; it would be nice to plug into the same
MetricRegistry to register my own (ie, gauges). I don't see this exposed
via runtime context. This did lead me to discovering the Accumulators API.
This looks more oriented to simple counts, which are summed across
components of a batch job. In my case, I'd like to expose details of my
stream processing vertices so that I can monitor their correctness and
health re: runtime decisions. For instance, referring back to my previous
thread, I would like to expose the number of filters loaded into my custom
RichCoFlatMap so that I can easily monitor this value.

Thanks,
Nick


RE: Cluster installation gives java.lang.NoClassDefFoundError for everything - solved

2015-11-11 Thread Camelia Elena Ciolac
Hello,

Thank you very much for your advices, Stephan, Robert and Maximilian.
Upgrading to Java 1.7 on the cluster solved the problem indeed.

java version "1.7.0_75"
OpenJDK Runtime Environment (rhel-2.5.4.0.el6_6-x86_64 u75-b13)
OpenJDK 64-Bit Server VM (build 24.75-b04, mixed mode)


Best regards,
Camelia


From: Camelia Elena Ciolac
Sent: Wednesday, November 11, 2015 10:41 AM
To: user@flink.apache.org
Subject: RE: Cluster installation gives java.lang.NoClassDefFoundError for 
everything - detailed debugging

 Hello,

As promised, I come back with debugging details.
So:

*** In start-cluster.sh , the following echo's

echo "start-cluster ~"
echo $HOSTLIST
echo "start-cluster ~"
echo $FLINK_BIN_DIR
echo "start-cluster ~"

# cluster mode, bring up job manager locally and a task manager on every slave 
host
"$FLINK_BIN_DIR"/jobmanager.sh start cluster batch

==>  gave:

start-cluster ~
/users/camelia/thecluster/flink-0.9.1/conf/slaves
start-cluster ~
/users/camelia/thecluster/flink-0.9.1/bin
start-cluster ~


*** In jobmanager.sh , the following echo's

echo "Starting Job Manager"
echo "jobmanager ~"
pwd
echo "jobmanager ~"
echo ${FLINK_ENV_JAVA_OPTS}
echo "jobmanager ~"
echo $FLINK_JM_CLASSPATH
echo "jobmanager ~"
echo $INTERNAL_HADOOP_CLASSPATHS
echo "jobmanager "
echo $FLINK_CONF_DIR
echo ""
$JAVA_RUN $JVM_ARGS ${FLINK_ENV_JAVA_OPTS} "${log_setting[@]}" 
-classpath "`manglePathList "$FLINK_JM_CLASSPATH:$INTERNAL_HADOOP_CLASSPATHS"`" 
org.apache.flink.runtime.jobmanager.JobManager --configDir "$FLINK_CONF_DIR" 
--executionMode $EXECUTIONMODE --streamingMode "$STREAMINGMODE" > "$out" 2>&1 < 
/dev/null &



==> gave:

jobmanager ~
/users/camelia/thecluster
jobmanager ~

jobmanager ~
/users/camelia/thecluster/flink-0.9.1/lib/flink-dist-0.9.1.jar:/users/camelia/thecluster/flink-0.9.1/lib/flink-python-0.9.1.jar
jobmanager ~
::
jobmanager 
/users/camelia/thecluster/flink-0.9.1/conf


Exception in thread "main" java.lang.NoClassDefFoundError: 
org/apache/flink/runtime/jobmanager/JobManager
Caused by: java.lang.ClassNotFoundException: 
org.apache.flink.runtime.jobmanager.JobManager
at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
at java.lang.ClassLoader.loadClass(ClassLoader.java:323)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
at java.lang.ClassLoader.loadClass(ClassLoader.java:268)
Could not find the main class: org.apache.flink.runtime.jobmanager.JobManager. 
Program will exit.



*** In taskmanager.sh, the following echo's

echo Starting task manager on host $HOSTNAME
echo "taskmanager ~"
pwd
echo "taskmanager ~"
echo ${FLINK_ENV_JAVA_OPTS}
echo "taskmanager ~"
echo $FLINK_JM_CLASSPATH
echo "taskmanager ~"
echo $INTERNAL_HADOOP_CLASSPATHS
echo "taskmanager "
echo $FLINK_CONF_DIR
echo ""
$JAVA_RUN $JVM_ARGS ${FLINK_ENV_JAVA_OPTS} "${log_setting[@]}" 
-classpath "`manglePathList "$FLINK_TM_CLASSPATH:$INTERNAL_HADOOP_CLASSPATHS"`" 
org.apache.flink.runtime.taskmanager.TaskManager --configDir "$FLINK_CONF_DIR" 
--streamingMode "$STREAMINGMODE" > "$out" 2>&1 < /dev/null &

==> gave:

taskmanager ~
/users/camelia/thecluster
taskmanager ~

taskmanager ~

taskmanager ~
::
taskmanager 
/users/camelia/thecluster/flink-0.9.1/conf


Exception in thread "main" java.lang.NoClassDefFoundError: 
org/apache/flink/runtime/taskmanager/TaskManager
Caused by: java.lang.ClassNotFoundException: 
org.apache.flink.runtime.taskmanager.TaskManager
at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
at java.lang.ClassLoader.loadClass(ClassLoader.java:323)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
at java.lang.ClassLoader.loadClass(ClassLoader.java:268)
Could not find the main class: 
org.apache.flink.runtime.taskmanager.TaskManager. Program will exit.



*** In flink , the following echo's


Re: Cluster installation gives java.lang.NoClassDefFoundError for everything

2015-11-11 Thread Robert Metzger
Is "jar tf /users/camelia/thecluster/flink-0.9.1/lib/flink-dist-0.9.1.jar"
listing for example the "org/apache/flink/client/CliFrontend" class?

On Wed, Nov 11, 2015 at 12:09 PM, Maximilian Michels  wrote:

> Hi Camelia,
>
> Flink 0.9.X supports Java 6. So this can't be the issue.
>
> Out of curiosity, I gave it a spin on a Linux machine with OpenJDK 6. I
> was able to start the command-line interface, job manager and task managers.
>
> java version "1.6.0_36"
> OpenJDK Runtime Environment (IcedTea6 1.13.8) (6b36-1.13.8-0ubuntu1~14.04)
> OpenJDK 64-Bit Server VM (build 23.25-b01, mixed mode)
>
>
> I think the error must be caused by your NFS setup. Let's start with
> ./bin/flink. For instance, can you access
> "/users/camelia/thecluster/flink-0.9.1/lib/flink-dist-0.9.1.jar" from the
> machine where you run ./bin/flink?
>
> Best,
> Max
>
>
> On Wed, Nov 11, 2015 at 10:41 AM, Camelia Elena Ciolac <
> came...@chalmers.se> wrote:
>
>>  Hello,
>>
>> As promised, I come back with debugging details.
>> So:
>>
>> *** In start-cluster.sh , the following echo's
>>
>> echo "start-cluster ~"
>> echo $HOSTLIST
>> echo "start-cluster ~"
>> echo $FLINK_BIN_DIR
>> echo "start-cluster ~"
>>
>> # cluster mode, bring up job manager locally and a task manager on every
>> slave host
>> "$FLINK_BIN_DIR"/jobmanager.sh start cluster batch
>>
>> ==>  gave:
>>
>> start-cluster ~
>> /users/camelia/thecluster/flink-0.9.1/conf/slaves
>> start-cluster ~
>> /users/camelia/thecluster/flink-0.9.1/bin
>> start-cluster ~
>>
>>
>> *** In jobmanager.sh , the following echo's
>>
>> echo "Starting Job Manager"
>> echo "jobmanager ~"
>> pwd
>> echo "jobmanager ~"
>> echo ${FLINK_ENV_JAVA_OPTS}
>> echo "jobmanager ~"
>> echo $FLINK_JM_CLASSPATH
>> echo "jobmanager ~"
>> echo $INTERNAL_HADOOP_CLASSPATHS
>> echo "jobmanager "
>> echo $FLINK_CONF_DIR
>> echo ""
>> $JAVA_RUN $JVM_ARGS ${FLINK_ENV_JAVA_OPTS} "${log_setting[@]}"
>> -classpath "`manglePathList
>> "$FLINK_JM_CLASSPATH:$INTERNAL_HADOOP_CLASSPATHS"`"
>> org.apache.flink.runtime.jobmanager.JobManager --configDir
>> "$FLINK_CONF_DIR" --executionMode $EXECUTIONMODE --streamingMode
>> "$STREAMINGMODE" > "$out" 2>&1 < /dev/null &
>>
>>
>>
>> ==> gave:
>>
>> jobmanager ~
>> /users/camelia/thecluster
>> jobmanager ~
>>
>> jobmanager ~
>>
>> /users/camelia/thecluster/flink-0.9.1/lib/flink-dist-0.9.1.jar:/users/camelia/thecluster/flink-0.9.1/lib/flink-python-0.9.1.jar
>> jobmanager ~
>> ::
>> jobmanager 
>> /users/camelia/thecluster/flink-0.9.1/conf
>> 
>>
>> Exception in thread "main" java.lang.NoClassDefFoundError:
>> org/apache/flink/runtime/jobmanager/JobManager
>> Caused by: java.lang.ClassNotFoundException:
>> org.apache.flink.runtime.jobmanager.JobManager
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:323)
>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:268)
>> Could not find the main class:
>> org.apache.flink.runtime.jobmanager.JobManager. Program will exit.
>>
>>
>>
>> *** In taskmanager.sh, the following echo's
>>
>> echo Starting task manager on host $HOSTNAME
>> echo "taskmanager ~"
>> pwd
>> echo "taskmanager ~"
>> echo ${FLINK_ENV_JAVA_OPTS}
>> echo "taskmanager ~"
>> echo $FLINK_JM_CLASSPATH
>> echo "taskmanager ~"
>> echo $INTERNAL_HADOOP_CLASSPATHS
>> echo "taskmanager "
>> echo $FLINK_CONF_DIR
>> echo ""
>> $JAVA_RUN $JVM_ARGS ${FLINK_ENV_JAVA_OPTS} "${log_setting[@]}"
>> -classpath "`manglePathList
>> "$FLINK_TM_CLASSPATH:$INTERNAL_HADOOP_CLASSPATHS"`"
>> org.apache.flink.runtime.taskmanager.TaskManager --configDir
>> "$FLINK_CONF_DIR" --streamingMode "$STREAMINGMODE" > "$out" 2>&1 <
>> /dev/null &
>>
>> ==> gave:
>>
>> taskmanager ~
>> /users/camelia/thecluster
>> taskmanager ~
>>
>> taskmanager ~
>>
>> taskmanager ~
>> ::
>> taskmanager 
>> /users/camelia/thecluster/flink-0.9.1/conf
>> 
>>
>> Exception in 

Re: finite subset of an infinite data stream

2015-11-11 Thread Robert Metzger
I think what you call "union" is a "connected stream" in Flink. Have a look
at this example: https://gist.github.com/fhueske/4ea5422edb5820915fa4
It shows how to dynamically update a list of filters by external requests.
Maybe that's what you are looking for?



On Wed, Nov 11, 2015 at 12:15 PM, Stephan Ewen  wrote:

> Hi!
>
> I don not really understand what exactly you want to do, especially the "union
> an infinite real time data stream with filtered persistent data where the
> condition of filtering is provided by external requests".
>
> If you want to work on substreams in general, there are two options:
>
> 1) Create the substream in a streaming window. You can "cut" the stream
> based on special records/events that signal that the subsequence is done.
> Have a look at the "Trigger" class for windows, it can react to elements
> and their contents:
>
>
> https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html#windows-on-keyed-data-streams
> (secion on Advanced Windowing).
>
>
> 2) You can trigger sequences of batch jobs. The batch job data source
> (input format) can decide when to stop consuming the stream, at which point
> the remainder of the transformations run, and the batch job finishes.
> You can already run new transformation chains after each call to
> "env.execute()", once the execution finished, to implement the sequence of
> batch jobs.
>
>
> I would try and go for the windowing solution if that works, because that
> will give you better fault tolerance / high availability. In the repeated
> batch jobs case, you need to worry yourself about what happens when the
> driver program (that calls env.execute()) fails.
>
>
> Hope that helps...
>
> Greetings,
> Stephan
>
>
>
> On Mon, Nov 9, 2015 at 1:24 PM, rss rss  wrote:
>
>> Hello,
>>
>>   thanks for the answer but windows produce periodical results. I used
>> your example but the data source is changed to TCP stream:
>>
>> DataStream text = env.socketTextStream("localhost", 2015,
>> '\n');
>> DataStream> wordCounts =
>> text
>> .flatMap(new LineSplitter())
>> .keyBy(0)
>> .timeWindow(Time.of(5, TimeUnit.SECONDS))
>> .sum(1);
>>
>> wordCounts.print();
>> env.execute("WordCount Example");
>>
>>  I see an infinite results printing instead of the only list.
>>
>>  The data source is following script:
>> -
>> #!/usr/bin/env ruby
>>
>> require 'socket'
>>
>> server = TCPServer.new 2015
>> loop do
>>   Thread.start(server.accept) do |client|
>> puts Time.now.to_s + ': New client!'
>> loop do
>>   client.puts "#{Time.now} #{[*('A'..'Z')].sample(3).join}"
>>   sleep rand(1000)/1000.0
>> end
>> client.close
>>   end
>> end
>> -
>>
>>   My purpose is to union an infinite real time data stream with filtered
>> persistent data where the condition of filtering is provided by external
>> requests. And the only result of union is interested. In this case I guess
>> I need a way to terminate the stream. May be I wrong.
>>
>>   Moreover it should be possible to link the streams by next request with
>> other filtering criteria. That is create new data transformation chain
>> after running of env.execute("WordCount Example"). Is it possible now? If
>> not, is it possible with minimal changes of the core of Flink?
>>
>> Regards,
>> Roman
>>
>> 2015-11-09 12:34 GMT+04:00 Stephan Ewen :
>>
>>> Hi!
>>>
>>> If you want to work on subsets of streams, the answer is usually to use
>>> windows, "stream.keyBy(...).timeWindow(Time.of(1, MINUTE))".
>>>
>>> The transformations that you want to make, do they fit into a window
>>> function?
>>>
>>> There are thoughts to introduce something like global time windows
>>> across the entire stream, inside which you can work more in a batch-style,
>>> but that is quite an extensive change to the core.
>>>
>>> Greetings,
>>> Stephan
>>>
>>>
>>> On Sun, Nov 8, 2015 at 5:15 PM, rss rss  wrote:
>>>
 Hello,



 I need to extract a finite subset like a data buffer from an infinite
 data stream. The best way for me is to obtain a finite stream with data
 accumulated for a 1minute before (as example). But I not found any existing
 technique to do it.



 As a possible ways how to do something near to a stream’s subset I see
 following cases:

 -  some transformation operation like ‘take_while’ that
 produces new stream but able to switch one to FINISHED state. Unfortunately
 I not found how to switch the state of a stream from a user code of
 transformation functions;

 -  new DataStream or StreamSource constructors which allow to
 connect a data processing chain to the source 

Re: Flink, Kappa and Lambda

2015-11-11 Thread Nick Dimiduk
The first and 3rd points here aren't very fair -- they apply to all data
systems. Systems downstream of your database can lose data in the same way;
the database retention policy expires old data, downstream fails, and back
to the tapes you must go. Likewise with 3, a bug in any ETL system can
cause problems. Also not specific to streaming in general or Kafka/Flink
specifically.

I'm much more curious about the 2nd claim. The whole point of high
availability in these systems is to not lose data during failure. The
post's author is not specific on any of these points, but just like I look
to a distributed database community to prove to me it doesn't lose data in
these corner cases, so too do I expect Kafka to prove it is resilient. In
the absence of software formally proven correct, I look to empirical
evidence in the form of chaos monkey type tests.

On Wednesday, November 11, 2015, Welly Tambunan  wrote:

> Hi Stephan,
>
>
> Thanks for your response.
>
>
> We are trying to justify whether it's enough to use Kappa Architecture
> with Flink. This more about resiliency and message lost issue etc.
>
> The article is worry about message lost even if you are using Kafka.
>
> No matter the message queue or broker you rely on whether it be RabbitMQ,
> JMS, ActiveMQ, Websphere, MSMQ and yes even Kafka you can lose messages in
> any of the following ways:
>
>- A downstream system from the broker can have data loss
>- All message queues today can lose already acknowledged messages
>during failover or leader election.
>- A bug can send the wrong messages to the wrong systems.
>
> Cheers
>
> On Wed, Nov 11, 2015 at 4:13 PM, Stephan Ewen  > wrote:
>
>> Hi!
>>
>> Can you explain a little more what you want to achieve? Maybe then we can
>> give a few more comments...
>>
>> I briefly read through some of the articles you linked, but did not quite
>> understand their train of thoughts.
>> For example, letting Tomcat write to Cassandra directly, and to Kafka,
>> might just be redundant. Why not let the streaming job that reads the Kafka
>> queue
>> move the data to Cassandra as one of its results? Further more, durable
>> storing the sequence of events is exactly what Kafka does, but the article
>> suggests to use Cassandra for that, which I find very counter intuitive.
>> It looks a bit like the suggested approach is only adopting streaming for
>> half the task.
>>
>> Greetings,
>> Stephan
>>
>>
>> On Tue, Nov 10, 2015 at 7:49 AM, Welly Tambunan > > wrote:
>>
>>> Hi All,
>>>
>>> I read a couple of article about Kappa and Lambda Architecture.
>>>
>>>
>>> http://www.confluent.io/blog/real-time-stream-processing-the-next-step-for-apache-flink/
>>>
>>> I'm convince that Flink will simplify this one with streaming.
>>>
>>> However i also stumble upon this blog post that has valid argument to
>>> have a system of record storage ( event sourcing ) and finally lambda
>>> architecture is appear at the solution. Basically it will write twice to
>>> Queuing system and C* for safety. System of record here is basically
>>> storing the event (delta).
>>>
>>> [image: Inline image 1]
>>>
>>>
>>> https://lostechies.com/ryansvihla/2015/09/17/event-sourcing-and-system-of-record-sane-distributed-development-in-the-modern-era-2/
>>>
>>> Another approach is about lambda architecture for maintaining the
>>> correctness of the system.
>>>
>>>
>>> https://lostechies.com/ryansvihla/2015/09/17/real-time-analytics-with-spark-streaming-and-cassandra/
>>>
>>>
>>> Given that he's using Spark for the streaming processor, do we have to
>>> do the same thing with Apache Flink ?
>>>
>>>
>>>
>>> Cheers
>>> --
>>> Welly Tambunan
>>> Triplelands
>>>
>>> http://weltam.wordpress.com
>>> http://www.triplelands.com 
>>>
>>
>>
>
>
> --
> Welly Tambunan
> Triplelands
>
> http://weltam.wordpress.com
> http://www.triplelands.com 
>


Re: Flink, Kappa and Lambda

2015-11-11 Thread Welly Tambunan
Hi Stephan,


Thanks for your response.


We are trying to justify whether it's enough to use Kappa Architecture with
Flink. This more about resiliency and message lost issue etc.

The article is worry about message lost even if you are using Kafka.

No matter the message queue or broker you rely on whether it be RabbitMQ,
JMS, ActiveMQ, Websphere, MSMQ and yes even Kafka you can lose messages in
any of the following ways:

   - A downstream system from the broker can have data loss
   - All message queues today can lose already acknowledged messages during
   failover or leader election.
   - A bug can send the wrong messages to the wrong systems.

Cheers

On Wed, Nov 11, 2015 at 4:13 PM, Stephan Ewen  wrote:

> Hi!
>
> Can you explain a little more what you want to achieve? Maybe then we can
> give a few more comments...
>
> I briefly read through some of the articles you linked, but did not quite
> understand their train of thoughts.
> For example, letting Tomcat write to Cassandra directly, and to Kafka,
> might just be redundant. Why not let the streaming job that reads the Kafka
> queue
> move the data to Cassandra as one of its results? Further more, durable
> storing the sequence of events is exactly what Kafka does, but the article
> suggests to use Cassandra for that, which I find very counter intuitive.
> It looks a bit like the suggested approach is only adopting streaming for
> half the task.
>
> Greetings,
> Stephan
>
>
> On Tue, Nov 10, 2015 at 7:49 AM, Welly Tambunan  wrote:
>
>> Hi All,
>>
>> I read a couple of article about Kappa and Lambda Architecture.
>>
>>
>> http://www.confluent.io/blog/real-time-stream-processing-the-next-step-for-apache-flink/
>>
>> I'm convince that Flink will simplify this one with streaming.
>>
>> However i also stumble upon this blog post that has valid argument to
>> have a system of record storage ( event sourcing ) and finally lambda
>> architecture is appear at the solution. Basically it will write twice to
>> Queuing system and C* for safety. System of record here is basically
>> storing the event (delta).
>>
>> [image: Inline image 1]
>>
>>
>> https://lostechies.com/ryansvihla/2015/09/17/event-sourcing-and-system-of-record-sane-distributed-development-in-the-modern-era-2/
>>
>> Another approach is about lambda architecture for maintaining the
>> correctness of the system.
>>
>>
>> https://lostechies.com/ryansvihla/2015/09/17/real-time-analytics-with-spark-streaming-and-cassandra/
>>
>>
>> Given that he's using Spark for the streaming processor, do we have to do
>> the same thing with Apache Flink ?
>>
>>
>>
>> Cheers
>> --
>> Welly Tambunan
>> Triplelands
>>
>> http://weltam.wordpress.com
>> http://www.triplelands.com 
>>
>
>


-- 
Welly Tambunan
Triplelands

http://weltam.wordpress.com
http://www.triplelands.com 


RE: Cluster installation gives java.lang.NoClassDefFoundError for everything

2015-11-11 Thread Camelia Elena Ciolac
Good morning,

Thank you Stephan!

I keep on testing and in the meantime I'm wondering if the Java version on the 
cluster may be part of the issue:
java version "1.6.0_36"
OpenJDK Runtime Environment (IcedTea6 1.13.8) (rhel-1.13.8.1.el6_7-x86_64)
OpenJDK 64-Bit Server VM (build 23.25-b01, mixed mode)
?

The cluster has shared network file system, so no problem in this aspect.
I keep on trying and I've put some echo's for debugging in the Flink start 
scripts.

Best regards,
Camelia




From: Camelia Elena Ciolac
Sent: Monday, November 09, 2015 2:32 PM
To: user@flink.apache.org
Subject: Cluster installation gives java.lang.NoClassDefFoundError for 
everything

Hello,

I am configuring Flink to run on a cluster with NFS.
I have the Flink 0.9.1 distribution in some path in NFS and I added that path 
in ~/.bashrc as FLINK_HOME, and also included the $FLINK_HOME/lib folder to 
$PATH.
I have the slaves file and the yaml file configured correctly with the nodes 
involved.

Still, when I run the following command I get the following errors:

$FLINK_HOME/bin/start-cluster.sh

(tried also
cd $FLINK_HOME
./bin/start-cluster.sh)

--> in /log/flink-theuser-jobmanager-thenode.out
Exception in thread "main" java.lang.NoClassDefFoundError: 
org/apache/flink/runtime/jobmanager/JobManager
Caused by: java.lang.ClassNotFoundException: 
org.apache.flink.runtime.jobmanager.JobManager
at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
at java.lang.ClassLoader.loadClass(ClassLoader.java:323)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
at java.lang.ClassLoader.loadClass(ClassLoader.java:268)
Could not find the main class: org.apache.flink.runtime.jobmanager.JobManager. 
Program will exit.

-->in /log/flink-theuser-taskmanager-theslavenode.out
Exception in thread "main" java.lang.NoClassDefFoundError: 
org/apache/flink/runtime/taskmanager/TaskManager
Caused by: java.lang.ClassNotFoundException: 
org.apache.flink.runtime.taskmanager.TaskManager
at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
at java.lang.ClassLoader.loadClass(ClassLoader.java:323)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
at java.lang.ClassLoader.loadClass(ClassLoader.java:268)
Could not find the main class: 
org.apache.flink.runtime.taskmanager.TaskManager. Program will exit.


And lastly, in my script I have the run of a job which gives:

Exception in thread "main" java.lang.NoClassDefFoundError: 
org/apache/flink/client/CliFrontend
Caused by: java.lang.ClassNotFoundException: org.apache.flink.client.CliFrontend
at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
at java.lang.ClassLoader.loadClass(ClassLoader.java:323)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
at java.lang.ClassLoader.loadClass(ClassLoader.java:268)
Could not find the main class: org.apache.flink.client.CliFrontend. Program 
will exit.

What could be the cause and which is the solution?

Looking forward to your answer as soon as possible.
Many thanks,
Camelia