Re: Any good ideas for online/offline detection of devices that send events?

2017-03-06 Thread Tzu-Li (Gordon) Tai
Some more input:

Right now, you can also use the `ProcessFunction` [1] available in Flink 1.2 to 
simulate state TTL.
The `ProcessFunction` should allow you to keep device state and simulate the 
online / offline detection by registering processing timers. In the `onTimer` 
callback, you can emit the “offline” marker event downstream, and in the 
`processElement` method, you can emit the “online” marker event if the case is 
the device has sent an event after it was determined to be offline.

[1] 
https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/stream/process_function.html

On March 6, 2017 at 9:40:28 PM, Tzu-Li (Gordon) Tai (tzuli...@apache.org) wrote:

Hi Bruno!

The Flink CEP library also seems like an option you can look into to see if it 
can easily realize what you have in mind.

Basically, the pattern you are detecting is a timeout of 5 minutes after the 
last event. Once that pattern is detected, you emit a “device offline” event 
downstream.
With this, you can also extend the pattern output stream to detect whether a 
device has became online again.

Here are some materials for you to take a look at Flink CEP:
1. http://flink.apache.org/news/2016/04/06/cep-monitoring.html
2. 
https://www.slideshare.net/FlinkForward/fabian-huesketill-rohrmann-declarative-stream-processing-with-streamsql-and-cep?qid=3c13eb7d-ed39-4eae-9b74-a6c94e8b08a3==_search=4

The CEP parts in the slides in 2. also provides some good examples of timeout 
detection using CEP.

Hope this helps!

Cheers,
Gordon

On March 4, 2017 at 1:27:51 AM, Bruno Aranda (bara...@apache.org) wrote:

Hi all,

We are trying to write an online/offline detector for devices that keep 
streaming data through Flink. We know how often roughly to expect events from 
those devices and we want to be able to detect when any of them stops (goes 
offline) or starts again (comes back online) sending events through the 
pipeline. For instance, if 5 minutes have passed since the last event of a 
device, we would fire an event to indicate that the device is offline.

The data from the devices comes through Kafka, with their own event time. The 
devices events are in order in the partitions and each devices goes to a 
specific partition, so in theory, we should not have out of order when looking 
at one partition.

We are assuming a good way to do this is by using sliding windows that are big 
enough, so we can see the relevant gap before/after the events for each 
specific device. 

We were wondering if there are other ideas on how to solve this.

Many thanks!

Bruno

Re: How to use 'dynamic' state

2017-03-06 Thread Tzu-Li (Gordon) Tai
Hi Steve,

I’ll try to provide some input for the approaches you’re currently looking into 
(I’ll comment on your email below):

* API based stop and restart of job … ugly. 
Yes, indeed ;) I think this is absolutely not necessary.

* Use a co-map function with the rules alone stream and the events as the 
other. This seems better however, I would like to have several ‘trigger’ 
functions changed together .. e.g. a tumbling window for one type of criteria 
and a flat map for a different sort … So I’m not sure how to hook this up for 
more than a simple co-map/flatmap. I did see this suggested in one answer and 
Do you mean that operators listen only to certain rules / criteria settings 
changes? You could either have separate stream sources for each kind of 
criteria rule trigger events, or have a single source and split them 
afterwards. Then, you broadcast each of them with the corresponding co-map / 
flat-maps.

* Use broadcast state : this seems reasonable however I couldn’t tell if the 
broadcast state would be available to all of the processing functions. Is it 
generally available? 
From the context of your description, I think what you want is that the 
injected rules stream can be seen by all operators (instead of “broadcast 
state”, which in Flink streaming refers to something else).

Aljoscha recently consolidated a FLIP for Side Inputs [1], which I think is 
targeted exactly for what you have in mind here. Perhaps you can take a look at 
that and see if it makes sense for your use case? But of course, this isn’t yet 
available as it is still under discussion. I think Side Inputs may be an ideal 
solution for what you have in mind here, as the rule triggers I assume should 
be slowly changing.

I’ve CCed Aljoscha so that he can probably provide more insights, as he has 
worked a lot on the stuff mentioned here.

Cheers,
Gordon

[1] FLIP-17: 
https://cwiki.apache.org/confluence/display/FLINK/FLIP-17+Side+Inputs+for+DataStream+API

On March 7, 2017 at 5:05:04 AM, Steve Jerman (st...@kloudspot.com) wrote:

I’ve been reading the code/user goup/SO and haven’t really found a great answer 
to this… so I thought I’d ask.  

I have a UI that allows the user to edit rules which include specific criteria 
for example trigger event if X many people present for over a minute.  

I would like to have a flink job that processes an event stream and triggers on 
these rules.  

The catch is that I don’t want to have to restart the job if the rules change… 
(it would be easy otherwise :))  

So I found four ways to proceed:  

* API based stop and restart of job … ugly.  

* Use a co-map function with the rules alone stream and the events as the 
other. This seems better however, I would like to have several ‘trigger’ 
functions changed together .. e.g. a tumbling window for one type of criteria 
and a flat map for a different sort … So I’m not sure how to hook this up for 
more than a simple co-map/flatmap. I did see this suggested in one answer and  

* Use broadcast state : this seems reasonable however I couldn’t tell if the 
broadcast state would be available to all of the processing functions. Is it 
generally available?  

* Implement my own operators… seems complicated ;)  

Are there other approaches?  

Thanks for any advice  
Steve

Re: AWS exception serialization problem

2017-03-06 Thread Shannon Carey
This happened when running Flink with bin/run-local.sh I notice that there only 
appears to be one Java process. The job manager and the task manager run in the 
same JVM, right? I notice, however, that there are two blob store folders on 
disk. Could the problem be caused by two different FlinkUserCodeClassLoader 
objects pointing to the two different JARs?


From: Shannon Carey >
Date: Monday, March 6, 2017 at 6:39 PM
To: "user@flink.apache.org" 
>
Subject: AWS exception serialization problem

Has anyone encountered this or know what might be causing it?



java.lang.RuntimeException: Could not forward element to next operator
at 
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:394)
at 
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:376)
at 
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:366)
at 
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:349)
at 
org.apache.flink.streaming.api.operators.StreamSource$ManualWatermarkContext.collect(StreamSource.java:355)
at 
org.apache.flink.streaming.connectors.kafka.internals.AbstractFetcher.emitRecord(AbstractFetcher.java:225)
at 
org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher.run(Kafka09Fetcher.java:253)
at java.lang.Thread.run(Thread.java:745)
Caused by: com.esotericsoftware.kryo.KryoException: Error during Java 
deserialization.
at 
com.esotericsoftware.kryo.serializers.JavaSerializer.read(JavaSerializer.java:47)
at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:657)
at 
org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializer.copy(KryoSerializer.java:172)
at 
org.apache.flink.api.scala.typeutils.TrySerializer.copy(TrySerializer.scala:51)
at 
org.apache.flink.api.scala.typeutils.TrySerializer.copy(TrySerializer.scala:32)
at 
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:389)
... 7 more
Caused by: java.lang.ClassNotFoundException: 
com.amazonaws.services.s3.model.AmazonS3Exception
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at java.io.ObjectInputStream.resolveClass(ObjectInputStream.java:677)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1819)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1713)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1986)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2231)
at 
java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:552)
at java.lang.Throwable.readObject(Throwable.java:914)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2122)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2231)
at 
java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:552)
at java.lang.Throwable.readObject(Throwable.java:914)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2122)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at 

AWS exception serialization problem

2017-03-06 Thread Shannon Carey
Has anyone encountered this or know what might be causing it?



java.lang.RuntimeException: Could not forward element to next operator
at 
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:394)
at 
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:376)
at 
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:366)
at 
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:349)
at 
org.apache.flink.streaming.api.operators.StreamSource$ManualWatermarkContext.collect(StreamSource.java:355)
at 
org.apache.flink.streaming.connectors.kafka.internals.AbstractFetcher.emitRecord(AbstractFetcher.java:225)
at 
org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher.run(Kafka09Fetcher.java:253)
at java.lang.Thread.run(Thread.java:745)
Caused by: com.esotericsoftware.kryo.KryoException: Error during Java 
deserialization.
at 
com.esotericsoftware.kryo.serializers.JavaSerializer.read(JavaSerializer.java:47)
at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:657)
at 
org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializer.copy(KryoSerializer.java:172)
at 
org.apache.flink.api.scala.typeutils.TrySerializer.copy(TrySerializer.scala:51)
at 
org.apache.flink.api.scala.typeutils.TrySerializer.copy(TrySerializer.scala:32)
at 
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:389)
... 7 more
Caused by: java.lang.ClassNotFoundException: 
com.amazonaws.services.s3.model.AmazonS3Exception
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at java.io.ObjectInputStream.resolveClass(ObjectInputStream.java:677)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1819)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1713)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1986)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2231)
at 
java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:552)
at java.lang.Throwable.readObject(Throwable.java:914)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2122)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2231)
at 
java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:552)
at java.lang.Throwable.readObject(Throwable.java:914)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2122)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2231)
at 
java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:552)
at java.lang.Throwable.readObject(Throwable.java:914)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
at 

How to use 'dynamic' state

2017-03-06 Thread Steve Jerman
I’ve been reading the code/user goup/SO and haven’t really found a great answer 
to this… so I thought I’d ask.

I have a UI that allows the user to edit rules which include specific criteria 
for example trigger event if X many people present for over a minute.

I would like to have a flink job that processes an event stream and triggers on 
these rules.

The catch is that I don’t want to have to restart the job if the rules change… 
(it would be easy otherwise :))

So I found four ways to proceed:

* API based stop and restart of job … ugly.

* Use a co-map function with the rules alone stream and the events as the 
other. This seems better however, I would like to have several ‘trigger’ 
functions changed together .. e.g. a tumbling window for one type of criteria 
and a flat map for a different sort … So I’m not sure how to hook this up for 
more than a simple co-map/flatmap. I did see this suggested in one answer and 

* Use broadcast state : this seems reasonable however I couldn’t tell if the 
broadcast state would be available to all of the processing functions. Is it 
generally available?

* Implement my own operators… seems complicated ;)

Are there other approaches?

Thanks for any advice
Steve

Issues with Event Time and Kafka

2017-03-06 Thread ext.mwalker
Hi Folks,

We are working on a Flink job to proccess a large amount of data coming in
from a Kafka stream.

We selected Flink because the data is sometimes out of order or late, and we
need to roll up the data into 30-minutes event time windows, after which we
are writing it back out to an s3 bucket.

We have hit a couple issues:

1) The job works fine using processing time, but when we switch to event
time (almost) nothing seems to be written out.
Our watermark code looks like this:
```
  override def getCurrentWatermark(): Watermark = {
new Watermark(System.currentTimeMillis() - maxLateness);
  }
```
And we are doing this:
```
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
```
and this:
```
.assignTimestampsAndWatermarks(new
TimestampAndWatermarkAssigner(Time.minutes(30).toMilliseconds))
```

However even though we get millions of records per hour (the vast majority
of which are no more that 30 minutes late) we get like 2 - 10 records per
hour written out to the s3 bucket.
We are using a custom BucketingFileSink Bucketer if folks believe that is
the issue I would be happy to provide that code here as well.

2) On top of all this, we would really prefer to write the records directly
to Aurora in RDS rather than to an intermediate s3 bucket, but it seems that
the JDBC sink connector is unsupported / doesn't exist.
If this is not the case we would love to know.

Thanks in advance for all the help / insight on this,

Max Walker



--
View this message in context: 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Issues-with-Event-Time-and-Kafka-tp12061.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at 
Nabble.com.


FlinkKafkaConsumer010 - creating a data stream of type DataStream<ConsumerRecord<K,V>>

2017-03-06 Thread Dominik Safaric
Hi,

Unfortunately I cannot find the option of using raw ConsumerRecord 
instances when creating a Kafka data stream. 

In general, I would like to use an instance of the mentioned type because our 
use case requires certain metadata such as record offset and partition.

So far I’ve examined the source code of the Kafka connector and checked the 
docs, but unfortunately I could not find the option of creating a data stream 
of the type DataStream>. 

Am I missing something or in order to have this ability I have to implement it 
myself and build Flink from source? 

Thanks in advance,
Dominik  

Re: OutOfMemory error (Direct buffer memory) while allocating the TaskManager off-heap memory

2017-03-06 Thread Nico Kruber
Hi Yassine,
Thanks for reporting this. The problem you run into is due to start-local.sh 
which we discourage in favour of start-cluster.sh that resembles real use case 
better.

In your case, start-local.sh starts a job manager with an embedded task 
manager but does not parse the task manager config properly to set the right 
parameters.
With start-cluster.sh (or manually via jobmanager.sh and taskmanager.sh), job 
and task manager are started separately in separate JVMs without this issue.

There's an open pull request to make start-cluster.sh better cooperate in 
certain situations in order to replace start-local.sh completely[1] but it 
hasn't been merged yet nor is start-local.sh replaced. In the future, we might 
do that though.

FYI: I created a Jira issue for us to check further code paths that may lead 
to this problem:
https://issues.apache.org/jira/browse/FLINK-5973


Regards
Nico

[1] https://github.com/apache/flink/pull/3298

On Friday, 3 March 2017 17:07:14 CET Yassine MARZOUGUI wrote:
> Hi all,
> 
> I tried starting a local Flink 1.2.0 cluster using start-local.sh, with the
> following settings for the taskmanager memory:
> 
> taskmanager.heap.mb: 16384
> taskmanager.memory.off-heap: true
> taskmanager.memory.preallocate: true
> 
> That throws and OOM error:
> Caused by: java.lang.Exception: OutOfMemory error (Direct buffer memory)
> while allocating the TaskManager off-heap memory (39017161219 bytes). Try
> increasing the maximum direct memory (-XX:MaxDirectMemorySize)
> 
> However If I add an obsolute taskmanager.memory.size:
> taskmanager.memory.size: 15360
> the cluster starts successfully.
> 
> My understanding is that if taskmanager.memory.size is unspecified then it
> should be equal to 0.7 * taskmanager.heap.mb. So I don't understand why it
> throws an exception and it works if its larger than that fraction.
> 
> Any help is appreciated.
> 
> Best,
> Yassine



signature.asc
Description: This is a digitally signed message part.


Re: Amazon EMR Releases

2017-03-06 Thread Meghashyam Sandeep V
I spoke to one of the representatives in AWS EMR team last week. They
mentioned that they usually practice a 4 week cool down period. Hopefully
we will get Flink 1.2 in the next week.

Thanks,
Sandeep

On Mar 6, 2017 9:27 AM, "Chen Qin"  wrote:

> EMR is a team within Amazon Web Services, to them, Flink is one of
> frameworks they need to support. Haven't see SLA around release time lag,
> but seems common practice better off keep a cool down period and let
> problem resolved before onboard new versions.
>
> Sent from my iPhone
>
> On Mar 6, 2017, at 07:23, Torok, David  wrote:
>
> Does anyone have any insight as to how closely Amazon EMR releases will
> track the Flink releases?  For example EMR 5.3.1 on Feb 7 included Flink
> 1.1.4, from Dec 21 … about 1.5 month lag.  With Flink accelerating to a
> timed release schedule,  I wonder how far behind EMR will track the
> official release schedule.
>
>
>
>


Amazon EMR Releases

2017-03-06 Thread Torok, David
Does anyone have any insight as to how closely Amazon EMR releases will track 
the Flink releases?  For example EMR 5.3.1 on Feb 7 included Flink 1.1.4, from 
Dec 21 ... about 1.5 month lag.  With Flink accelerating to a timed release 
schedule,  I wonder how far behind EMR will track the official release schedule.



TTL for State Entries / FLINK-3089

2017-03-06 Thread Johannes Schulte
Hi,

I am trying to achieve a stream-to-stream join with big windows and are
searching for a way to clean up state of old keys. I am already using a
RichCoProcessFunction

I found there is already an existing ticket

https://issues.apache.org/jira/browse/FLINK-3089

but I have doubts that a registration of a timer for every incoming event
is feasible as the timers seem to reside in an in-memory queue.

The task is somewhat similar to the following blog post:
http://devblog.mediamath.com/real-time-streaming-attribution-using-apache-flink

Is the implementation of a custom window operator a necessity for achieving
such functionality

Thanks a lot,

Johannes


Re: AsyncIO/QueryableStateClient hanging with high parallelism

2017-03-06 Thread Yassine MARZOUGUI
I think I found the reason for what happened. The way I used the
QueryableStateClient is that I wrapped scala.concurrent.Future in a
FlinkFuture and then called FlinkFuture.thenAccept. It turns out
thenAccept doesn't
throw exceptions and when an exception happens (which likely happened once
I inreased the parallelism) the job simply doesn't finish. I solved the
problem by using resultFuture.get()which araised the appropriate exceptions
when they happens and failed the job.

Best,
Yassine

2017-03-06 15:53 GMT+01:00 Yassine MARZOUGUI :

> Hi all,
>
> I set up a job with simple queryable state sink and tried to query it from
> another job using the new Async I/O API. Everything worked as expected,
> except when I tried to increase the parallelism of the querying job it
> hanged.
> As you can see in the attached image, when the parallism is 5 (even <5)
> the job finishes within 5 seconds, but when it is >5 it hangs. Any Idea of
> what might be causing this behaviour? Thank you.
>
> Best,
> Yassine
>


Re: Integrate Flink with S3 on EMR cluster

2017-03-06 Thread vinay patil
Hi Guys,

I am getting the same exception:
EMRFileSystem not Found

I am trying to read encrypted S3 file using Hadoop File System class. 
(using Flink 1.2.0)
When I copy all the libs from /usr/share/aws/emrfs/lib and /usr/lib/hadoop
to Flink lib folder , it works.

However I see that all these libs are already included in the Hadoop
classpath.

Is there any other way I can make this work ?



--
View this message in context: 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Integrate-Flink-with-S3-on-EMR-cluster-tp5894p12053.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at 
Nabble.com.


AsyncIO/QueryableStateClient hanging with high parallelism

2017-03-06 Thread Yassine MARZOUGUI
Hi all,

I set up a job with simple queryable state sink and tried to query it from
another job using the new Async I/O API. Everything worked as expected,
except when I tried to increase the parallelism of the querying job it
hanged.
As you can see in the attached image, when the parallism is 5 (even <5) the
job finishes within 5 seconds, but when it is >5 it hangs. Any Idea of what
might be causing this behaviour? Thank you.

Best,
Yassine


Re: Flink using notebooks in EMR

2017-03-06 Thread Tzu-Li (Gordon) Tai
Hi,

Are you running Zeppelin on a local machine?

I haven’t tried this before, but you could first try and check if port ‘6123’ 
is publicly accessible in the security group settings of the AWS EMR instances.

- Gordon


On March 3, 2017 at 10:21:41 AM, Meghashyam Sandeep V (vr1meghash...@gmail.com) 
wrote:

Hi there,

Has anyone tried using flink interpreter in Zeppelin using  AWS EMR? I tried 
creating a new interpreter using host as 'localhots' and port '6123' which 
didn't seem to work.

Thanks,
Sandeep

Re: Memory Limits: MiniCluster vs. Local Mode

2017-03-06 Thread Tzu-Li (Gordon) Tai
Hi Dominik,

AFAIK, the local mode executions create a mini cluster within the JVM to run 
the job.

Also, `MiniCluster` seems to be something FLIP-6 related, and since FLIP-6 is 
still work
in progress, I’m not entirely sure if it is viable at the moment. Right now, 
you should look
into using `LocalFlinkMiniCluster`.

In a lot of the Flink integration tests, we use a `LocalFlinkMiniCluster` to 
setup the test
cluster programatically, and instantiate an `StreamExecutionEnvironment` 
against that
mini cluster. That would probably be helpful for trying out your Flink jobs 
programatically
in your CI / CD cycles.

You can also take a look at some Flink test utilities such as
`StreamingMultipleProgramsTestBase`, which helps you to set up an environment
that allows you to submit multiple test jobs on a single 
`LocalFlinkMiniCluster`. For a
simple example on how to use it, you can take a look at the tests in the 
Elasticsearch
connector. The Flink Kafka connector tests also have a more complicated test
environment setup where jobs are submitted to the `LocalFlinkMiniCluster` using 
an
remote environment.

As for memory consumption configuration for the created 
`LocalFlinkMiniCluster`, I think
you should be able to tune it using the `Configuration` instance passed to it.

Hope this helps!

Cheers,
Gordon

On March 4, 2017 at 12:27:53 AM, domi...@dbruhn.de (domi...@dbruhn.de) wrote:

Hey,  
for our CI/CD cycle I'd like to try out our Flink Jobs in an development  
environment without running them against a huge EMR cluster (which is  
what we do for production), so something like a standalone mode.  

Until now, for this standalone running, I just started the job jar. As  
the "env.execute()" is in the main-method, this works. I think this is  
callled "Local Mode" by the Flink Devs. I packaged the whole thing in a  
docker container so I have a deployable artefact.  

The problem with that is, that the memory constraint seem to be  
difficult to control: Setting Xmx and Xms for the job doesn't seem to  
limit the memory. This is most likely due to flinks off-heap memory  
allocation.  

Now, I got as feedback that perhaps the MiniCluster is the way to go  
instead of the "Local Mode".  

My questions:  
1. Is the MiniCluster better than the local mode? What are the use-cases  
in which you would choose one over the other?  
2. Is there an example how to use the MiniCluster? I see that I need a  
JobGraph, how do I get one?  
3. What are the tuning parameters to limit the memory consumption of the  
MiniCluster (and maybe the local mode)?  

Thanks for your help,  
Dominik  


Re: Any good ideas for online/offline detection of devices that send events?

2017-03-06 Thread Tzu-Li (Gordon) Tai
Hi Bruno!

The Flink CEP library also seems like an option you can look into to see if it 
can easily realize what you have in mind.

Basically, the pattern you are detecting is a timeout of 5 minutes after the 
last event. Once that pattern is detected, you emit a “device offline” event 
downstream.
With this, you can also extend the pattern output stream to detect whether a 
device has became online again.

Here are some materials for you to take a look at Flink CEP:
1. http://flink.apache.org/news/2016/04/06/cep-monitoring.html
2. 
https://www.slideshare.net/FlinkForward/fabian-huesketill-rohrmann-declarative-stream-processing-with-streamsql-and-cep?qid=3c13eb7d-ed39-4eae-9b74-a6c94e8b08a3==_search=4

The CEP parts in the slides in 2. also provides some good examples of timeout 
detection using CEP.

Hope this helps!

Cheers,
Gordon

On March 4, 2017 at 1:27:51 AM, Bruno Aranda (bara...@apache.org) wrote:

Hi all,

We are trying to write an online/offline detector for devices that keep 
streaming data through Flink. We know how often roughly to expect events from 
those devices and we want to be able to detect when any of them stops (goes 
offline) or starts again (comes back online) sending events through the 
pipeline. For instance, if 5 minutes have passed since the last event of a 
device, we would fire an event to indicate that the device is offline.

The data from the devices comes through Kafka, with their own event time. The 
devices events are in order in the partitions and each devices goes to a 
specific partition, so in theory, we should not have out of order when looking 
at one partition.

We are assuming a good way to do this is by using sliding windows that are big 
enough, so we can see the relevant gap before/after the events for each 
specific device. 

We were wondering if there are other ideas on how to solve this.

Many thanks!

Bruno

Re: Flink 1.2 and Cassandra Connector

2017-03-06 Thread Chesnay Schepler

Hello,

i believe the cassandra connector is not shading it's dependencies 
properly. This didn't cause issues in the

past since flink used to have a dependency on codahale metrics as well.

Please open a JIRA for this issue.

Regards,
Chesnay

On 06.03.2017 11:32, Tarandeep Singh wrote:

Hi Robert & Nico,

I am facing the same problem (java.lang.NoClassDefFoundError: 
com/codahale/metrics/Metric)

Can you help me identify shading issue in pom.xml file.

My pom.xml content-
-
http://maven.apache.org/POM/4.0.0; 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance; 
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
http://maven.apache.org/xsd/maven-4.0.0.xsd;> 4.0.0 rfk-dataplatform stream-processing 0.1.0 jar Stream processing UTF-8 1.2.0 1.7.7 1.2.17   org.apache.flink flink-streaming-java_2.10 ${flink.version}   org.apache.flink flink-clients_2.10 ${flink.version}   org.apache.flink flink-connector-cassandra_2.10 1.2.0   org.apache.flink flink-statebackend-rocksdb_2.10 1.2.0   org.slf4j slf4j-log4j12 ${slf4j.version}   log4j log4j ${log4j.version}   org.apache.avro avro 1.8.1   org.testng testng 6.8 test
org.apache.flink flink-connector-kafka-0.8_2.10 1.2.0
org.influxdb influxdb-java 2.5  build-jar  falseorg.apache.flink flink-java ${flink.version} provided   org.apache.flink flink-streaming-java_2.10 ${flink.version} provided   org.apache.flink flink-clients_2.10 ${flink.version} provided   org.slf4j slf4j-log4j12 ${slf4j.version} provided   log4j log4j ${log4j.version} provided   org.apache.maven.plugins maven-shade-plugin 2.4.1   package  shadecombine.self="override">   
  
org.apache.maven.plugins maven-shade-plugin 2.4.1   
 package  shade  org.apache.flink:flink-annotations org.apache.flink:flink-shaded-hadoop2 org.apache.flink:flink-shaded-curator-recipes org.apache.flink:flink-core org.apache.flink:flink-java org.apache.flink:flink-scala_2.10 org.apache.flink:flink-runtime_2.10 org.apache.flink:flink-optimizer_2.10 org.apache.flink:flink-clients_2.10 org.apache.flink:flink-avro_2.10 org.apache.flink:flink-examples-batch_2.10 org.apache.flink:flink-examples-streaming_2.10 org.apache.flink:flink-streaming-java_2.10 org.apache.flink:flink-streaming-scala_2.10 org.apache.flink:flink-scala-shell_2.10 org.apache.flink:flink-python org.apache.flink:flink-metrics-core org.apache.flink:flink-metrics-jmx org.apache.flink:flink-statebackend-rocksdb_2.10  log4j:log4j org.scala-lang:scala-library org.scala-lang:scala-compiler org.scala-lang:scala-reflect com.data-artisans:flakka-actor_* com.data-artisans:flakka-remote_* com.data-artisans:flakka-slf4j_* io.netty:netty-all io.netty:netty commons-fileupload:commons-fileupload org.apache.avro:avro commons-collections:commons-collections org.codehaus.jackson:jackson-core-asl org.codehaus.jackson:jackson-mapper-asl com.thoughtworks.paranamer:paranamer org.xerial.snappy:snappy-java org.apache.commons:commons-compress org.tukaani:xz com.esotericsoftware.kryo:kryo com.esotericsoftware.minlog:minlog org.objenesis:objenesis com.twitter:chill_* com.twitter:chill-java commons-lang:commons-lang junit:junit org.apache.commons:commons-lang3 org.slf4j:slf4j-api org.slf4j:slf4j-log4j12 log4j:log4j org.apache.commons:commons-math org.apache.sling:org.apache.sling.commons.json commons-logging:commons-logging commons-codec:commons-codec com.fasterxml.jackson.core:jackson-core com.fasterxml.jackson.core:jackson-databind com.fasterxml.jackson.core:jackson-annotations stax:stax-api com.typesafe:config org.uncommons.maths:uncommons-maths com.github.scopt:scopt_* commons-io:commons-io commons-cli:commons-cli org.apache.flink:*   org/apache/flink/shaded/com/** web-docs/***:*  META-INF/*.SF META-INF/*.DSA META-INF/*.RSAfalse   
   org.apache.maven.plugins maven-compiler-plugin 3.1  1.8 1.8org.apache.avro avro-maven-plugin 1.8.1   generate-sources  schema   String PRIVATE ${project.basedir}/src/main/schema/ ${project.basedir}/src/main/java/
  org.apache.maven.plugins maven-compiler-plugin  1.8 1.8 


-


On Sun, Feb 12, 2017 at 1:56 AM, Robert Metzger > wrote:


Hi Nico,
The cassandra connector should be available on Maven central:

http://search.maven.org/#artifactdetails%7Corg.apache.flink%7Cflink-connector-cassandra_2.10%7C1.2.0%7Cjar



Potentially, the issue you've mentioned is due to some shading
issue. Is the "com/codahale/metrics/Metric" class in your user
code jar?

On Thu, Feb 9, 2017 at 2:56 PM, Nico > wrote:

Hi,

I would like to upgrade to the new stable version 1.2 - but i
get an ClassNotFound exception when i start the application.

Caused 

Re: Flink 1.2 and Cassandra Connector

2017-03-06 Thread Tarandeep Singh
Hi Robert & Nico,

I am facing the same problem (java.lang.NoClassDefFoundError:
com/codahale/metrics/Metric)
Can you help me identify shading issue in pom.xml file.

My pom.xml content-
-


http://maven.apache.org/POM/4.0.0;
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance;
   xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd;>
   4.0.0

   rfk-dataplatform
   stream-processing
   0.1.0
   jar

   Stream processing

   
  UTF-8
  1.2.0
  1.7.7
  1.2.17
   

   
  
 org.apache.flink
 flink-streaming-java_2.10
 ${flink.version}
  
  
 org.apache.flink
 flink-clients_2.10
 ${flink.version}
  
  
 org.apache.flink
 flink-connector-cassandra_2.10
 1.2.0
  

org.apache.flink
flink-statebackend-rocksdb_2.10
1.2.0


  
 org.slf4j
 slf4j-log4j12
 ${slf4j.version}
  
  
 log4j
 log4j
 ${log4j.version}
  


org.apache.avro
avro
1.8.1



org.testng
testng
6.8
test




org.apache.flink
flink-connector-kafka-0.8_2.10
1.2.0




org.influxdb
influxdb-java
2.5




   
  
 
 build-jar

 
false
 

 

   org.apache.flink
   flink-java
   ${flink.version}
   provided


   org.apache.flink
   flink-streaming-java_2.10
   ${flink.version}
   provided


   org.apache.flink
   flink-clients_2.10
   ${flink.version}
   provided


   org.slf4j
   slf4j-log4j12
   ${slf4j.version}
   provided


   log4j
   log4j
   ${log4j.version}
   provided

 




   
   
  org.apache.maven.plugins
  maven-shade-plugin
  2.4.1
  
 
package

   shade


   
  
   

 
  
   

 
  
   

   
  
 
org.apache.maven.plugins
maven-shade-plugin
2.4.1

   
   
  package
  
 shade
  
  
 

   
   org.apache.flink:flink-annotations

org.apache.flink:flink-shaded-hadoop2

org.apache.flink:flink-shaded-curator-recipes
   org.apache.flink:flink-core
   org.apache.flink:flink-java
   org.apache.flink:flink-scala_2.10

org.apache.flink:flink-runtime_2.10

org.apache.flink:flink-optimizer_2.10

org.apache.flink:flink-clients_2.10
   org.apache.flink:flink-avro_2.10

org.apache.flink:flink-examples-batch_2.10

org.apache.flink:flink-examples-streaming_2.10

org.apache.flink:flink-streaming-java_2.10

org.apache.flink:flink-streaming-scala_2.10

org.apache.flink:flink-scala-shell_2.10
   org.apache.flink:flink-python

org.apache.flink:flink-metrics-core
   org.apache.flink:flink-metrics-jmx

org.apache.flink:flink-statebackend-rocksdb_2.10

   

   log4j:log4j
   org.scala-lang:scala-library
   org.scala-lang:scala-compiler
   org.scala-lang:scala-reflect
   com.data-artisans:flakka-actor_*
   com.data-artisans:flakka-remote_*
   com.data-artisans:flakka-slf4j_*
   io.netty:netty-all
   io.netty:netty

commons-fileupload:commons-fileupload
   org.apache.avro:avro

commons-collections:commons-collections

org.codehaus.jackson:jackson-core-asl

org.codehaus.jackson:jackson-mapper-asl

com.thoughtworks.paranamer:paranamer
   org.xerial.snappy:snappy-java

org.apache.commons:commons-compress