Re: Flink 1.3 REST API wrapper for Scala

2017-06-12 Thread Flavio Pompermaier
Nice lib! Is it available also on maven central?

On 13 Jun 2017 4:36 am, "Michael Reid"  wrote:

> I'm currently working on a project where I need to manage jobs
> programmatically without being tied to Flink, so I wrote a small,
> asynchronous Scala wrapper around the Flink REST API. I decided to open
> source the project in case anyone else needs something similar. Source is
> at https://github.com/mjreid/flink-rest-scala-wrapper , library is
> available on maven central at com.github.mjreid:flink-wrapper_2.11:0.0.1.
>
> Currently it supports basic job management functionality (uploading JARs,
> starting jobs, cancelling jobs) and a few of the job monitoring REST
> endpoints. There are a few monitoring endpoints which I didn't need for my
> use case, so they aren't implemented right now, but I would be happy to
> implement them if there's any interest. Any comments and/or feedback is
> appreciated!
>
> Thanks,
>
> - Mike
>


Flink 1.3 REST API wrapper for Scala

2017-06-12 Thread Michael Reid
I'm currently working on a project where I need to manage jobs
programmatically without being tied to Flink, so I wrote a small,
asynchronous Scala wrapper around the Flink REST API. I decided to open
source the project in case anyone else needs something similar. Source is
at https://github.com/mjreid/flink-rest-scala-wrapper , library is
available on maven central at com.github.mjreid:flink-wrapper_2.11:0.0.1.

Currently it supports basic job management functionality (uploading JARs,
starting jobs, cancelling jobs) and a few of the job monitoring REST
endpoints. There are a few monitoring endpoints which I didn't need for my
use case, so they aren't implemented right now, but I would be happy to
implement them if there's any interest. Any comments and/or feedback is
appreciated!

Thanks,

- Mike


Re: Cannot write record to fresh sort buffer. Record too large.

2017-06-12 Thread Kurt Young
Hi,

I think the reason is your record is too large to do a in-memory combine.
You can try to disable your combiner.

Best,
Kurt

On Mon, Jun 12, 2017 at 9:55 PM, Sebastian Neef <
gehax...@mailbox.tu-berlin.de> wrote:

> Hi,
>
> when I'm running my Flink job on a small dataset, it successfully
> finishes. However, when a bigger dataset is used, I get multiple
> exceptions:
>
> -  Caused by: java.io.IOException: Cannot write record to fresh sort
> buffer. Record too large.
> - Thread 'SortMerger Reading Thread' terminated due to an exception: null
>
> A full stack trace can be found here [0].
>
> I tried to reduce the taskmanager.memory.fraction (or so) and also the
> amount of parallelism, but that did not help much.
>
> Flink 1.0.3-Hadoop2.7 was used.
>
> Any tipps are appreciated.
>
> Kind regards,
> Sebastian
>
> [0]:
> http://paste.gehaxelt.in/?1f24d0da3856480d#/dR8yriXd/
> VQn5zTfZACS52eWiH703bJbSTZSifegwI=
>


Re: Cannot write record to fresh sort buffer. Record too large.

2017-06-12 Thread Ted Yu
Sebastian:
Are you using jdk 7 or jdk 8 ?

For jdk 7, there was bug w.r.t. code cache getting full which affects
performance.

https://bugs.openjdk.java.net/browse/JDK-8051955

https://bugs.openjdk.java.net/browse/JDK-8074288

http://blog.andresteingress.com/2016/10/19/java-codecache

Cheers

On Mon, Jun 12, 2017 at 1:08 PM, Flavio Pompermaier 
wrote:

> Try to see of in the output of dmesg command there are some log about an
> OOM. The OS logs there such info. I had a similar experience recently...
> see [1]
>
> Best,
> Flavio
>
> [1] http://apache-flink-user-mailing-list-archive.2336050.
> n4.nabble.com/Flink-and-swapping-question-td13284.html
>
> On 12 Jun 2017 21:51, "Sebastian Neef" 
> wrote:
>
>> Hi Stefan,
>>
>> thanks for the answer and the advise, which I've already seen in another
>> email.
>>
>> Anyway, I played around with the taskmanager.numberOfTaskSlots and
>> taskmanager.memory.fraction options. I noticed that decreasing the
>> former and increasing the latter lead to longer execution and more
>> processed data before the failure.
>>
>> The error messages and exceptions from an affected TaskManager are here
>> [1]. Unfortunately, I cannot find a java.lang.OutOfMemoryError in here.
>>
>> Do you have another idea or something to try?
>>
>> Thanks in advance,
>> Sebastian
>>
>>
>> [1]
>> http://paste.gehaxelt.in/?e669fabc1d4c15be#G1Ioq/ASwGUdCaK2r
>> Q1AY3ZmCkA7LN4xVOHvM9NeI2g=
>>
>


Re: Exception in Flink 1.3.0

2017-06-12 Thread rhashmi
Any update when 1.3.1 will be available?

Our current copy is 1.2.0 but that has separate issue(invalid type code:
00). 

http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/invalid-type-code-00-td13326.html#a13332



--
View this message in context: 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Exception-in-Flink-1-3-0-tp13493p13664.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at 
Nabble.com.


Re: Window Function on AllWindowed Stream - Combining Kafka Topics

2017-06-12 Thread Meera
Did this problem get resolved  

- I am running into this problem when I parallelize the tasks 
Unexpected key group index. This indicates a bug.

- it runs fine on 1 parallelism. This suggests there is some key grouping
issue - I checked my Watermark and KeySelector - they look okay.

The snippet of my KeySelector and Watermark attached to the KeyedStream. 
public class DimensionKeySelector> implements
KeySelector {

private static final long serialVersionUID = 7666263008141606451L;
private final String[] dimKeys;

public DimensionKeySelector(Map conf) {
if (conf.containsKey("dimKeys") == false) {
throw new RuntimeException("Required 'dimKeys' 
missing.");
}
this.dimKeys = conf.get("dimKeys").split(",");
}

@Override
public String getKey(T signalSet) throws Exception {
StringBuffer group = new StringBuffer(signalSet.namespace());
if (signalSet.size() != 0) {
for (String dim : dimKeys) {
if (signalSet.dimensions().containsKey(dim)) {

group.append(signalSet.dimensions().get(dim));
}
}
}
return group.toString();
}
}

and Watermark
public class WaterMarks extends
BoundedOutOfOrdernessTimestampExtractor {

public WaterMarks(Time maxOutOfOrderness) {
super(maxOutOfOrderness);
}

private static final long serialVersionUID = 1L;

@Override
public long extractTimestamp(MetricSignalSet element) {
return element.get(0).timestamp().getTime();
}
}

Any thoughts?




--
View this message in context: 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Window-Function-on-AllWindowed-Stream-Combining-Kafka-Topics-tp12941p13663.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at 
Nabble.com.


Re: Cannot write record to fresh sort buffer. Record too large.

2017-06-12 Thread Flavio Pompermaier
Try to see of in the output of dmesg command there are some log about an
OOM. The OS logs there such info. I had a similar experience recently...
see [1]

Best,
Flavio

[1]
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-and-swapping-question-td13284.html

On 12 Jun 2017 21:51, "Sebastian Neef" 
wrote:

> Hi Stefan,
>
> thanks for the answer and the advise, which I've already seen in another
> email.
>
> Anyway, I played around with the taskmanager.numberOfTaskSlots and
> taskmanager.memory.fraction options. I noticed that decreasing the
> former and increasing the latter lead to longer execution and more
> processed data before the failure.
>
> The error messages and exceptions from an affected TaskManager are here
> [1]. Unfortunately, I cannot find a java.lang.OutOfMemoryError in here.
>
> Do you have another idea or something to try?
>
> Thanks in advance,
> Sebastian
>
>
> [1]
> http://paste.gehaxelt.in/?e669fabc1d4c15be#G1Ioq/
> ASwGUdCaK2rQ1AY3ZmCkA7LN4xVOHvM9NeI2g=
>


Re: At what point do watermarks get injected into the stream?

2017-06-12 Thread Fabian Hueske
Hi,

each operator keeps track of the latest (and therefore maximum) watermark
received from each of its inputs and sets its own internal time to the
minimum watermark of each input.
In the example, window(1) has two inputs (Map(1) and Map(2)). If Window(1)
first receives a WM 33 from Map(1) it won't emit a watermark with 33 until
it received a watermark with >= 33 from Map(2).

All operators use this logic for watermark propagation. So it does not
matter whether this is a window operator or a CEP operator.

Let me know if you have further questions,
Fabian

2017-06-12 21:55 GMT+02:00 Ray Ruvinskiy :

> Thanks!
>
>
>
> I had a couple some follow-up questions to the example in the
> documentation. Suppose Source 1 sends a watermark of 33, and Source 2 sends
> a watermark of 17. If I understand correctly, map(1) will forward the
> watermark of 33 to window(1) and window(2), and map(2) will forward the
> watermark of 17 to the same window operators. I’m assuming there is nothing
> to prevent window(1) and window(2) from getting the watermark of 33 before
> the watermark of 17, right? In that case, how do window(1) and window(2)
> compute the minimum watermark to forward to the next operator downstream?
> Will it be a per-window watermark?
>
>
>
> What would happen if instead if a window operator in that position, we had
> something like the CEP operator, which in effect maintains state and does
> aggregations without windowing (or another similar such operator)? How does
> it determine what the minimum watermark is at any given time, in light of
> the fact that, in principle, it might receive a watermark value smaller
> than anything it’s seen before from a parallel source?
>
>
>
> Ray
>
>
>
> *From: *Fabian Hueske 
> *Date: *Sunday, June 11, 2017 at 5:54 PM
>
> *To: *Ray Ruvinskiy 
> *Cc: *"user@flink.apache.org" 
> *Subject: *Re: At what point do watermarks get injected into the stream?
>
>
>
> Each parallel instance of a TimestampAssigner independently assigns
> timestamps.
>
> After a shuffle, operators forward the minimum watermark across all input
> connections. For details have a look at the watermarks documentation [1].
>
> Best, Fabian
>
> [1] https://ci.apache.org/projects/flink/flink-docs-
> release-1.3/dev/event_time.html#watermarks-in-parallel-streams
>
>
>
> 2017-06-11 17:22 GMT+02:00 Ray Ruvinskiy :
>
> Thanks for the explanation, Fabian.
>
>
>
> Suppose I have a parallel source that does not inject watermarks, and the
> first operation on the DataStream is assignTimestampsAndWatermarks. Does
> each parallel task that makes up the source independently inject watermarks
> for the records that it has read? Suppose I then call keyBy and a shuffle
> ensues. Will the resulting partitions after the shuffle have interleaved
> watermarks from the various source tasks?
>
>
>
> More concretely, suppose s source has a degree of parallelism of two. One
> of the source tasks injects the watermarks 2 and 5, while the other injects
> 3 and 10. There is then a shuffle, creating two different partitions. Will
> all the watermarks be broadcast to all the partitions? Or is it possible
> for, say, one partition to end up with watermarks 2 and 10 and another with
> 3 and 5? And after the shuffle, how do we ensure that the watermarks are
> processed in order by the operators receiving them?
>
>
>
> Thanks,
>
>
>
> Ray
>
>
>
> *From: *Fabian Hueske 
> *Date: *Saturday, June 10, 2017 at 3:56 PM
> *To: *Ray Ruvinskiy 
> *Cc: *"user@flink.apache.org" 
> *Subject: *Re: At what point do watermarks get injected into the stream?
>
>
>
> Hi Ray,
>
> in principle, watermarks can be injected anywhere in a stream by calling
> DataStream.assignTimestampsAndWatermarks().
>
> However, timestamps are usually injected as soon as possible after a
> stream in ingested (before the first shuffle). The reason is that
> watermarks depend on the order of events (and their timestamps) in the
> stream. While Flink guarantees the order of events within a partition, a
> shuffle interleaves events of different partitions in an unpredictable way
> such that it is not possible to reason about the order of timestamps
> afterwards.
>
> The most common way to inject watermarks is directly inside of a
> SourceFunction or with a TimestampAssigner before the first shuffle.
>
> Best, Fabian
>
>
>
> 2017-06-09 0:46 GMT+02:00 Ray Ruvinskiy :
>
> I’m trying to build a mental model of how watermarks get injected into the
> stream. Suppose I have a stream with a parallel source, and I’m running a
> cluster with multiple task managers. Does each parallel source reader
> inject watermarks, which are then forwarded to downstream consumers and
> shuffled between task managers? Or are watermarks created after the
> shuffle, when the stream records reach their destined task manager and
> right before they’re processed by the operator?
>
>
>
> Thanks,
>
>
>
> Ray
>
>
>
>
>


Re: At what point do watermarks get injected into the stream?

2017-06-12 Thread Ray Ruvinskiy
Thanks!

I had a couple some follow-up questions to the example in the documentation. 
Suppose Source 1 sends a watermark of 33, and Source 2 sends a watermark of 17. 
If I understand correctly, map(1) will forward the watermark of 33 to window(1) 
and window(2), and map(2) will forward the watermark of 17 to the same window 
operators. I’m assuming there is nothing to prevent window(1) and window(2) 
from getting the watermark of 33 before the watermark of 17, right? In that 
case, how do window(1) and window(2) compute the minimum watermark to forward 
to the next operator downstream? Will it be a per-window watermark?

What would happen if instead if a window operator in that position, we had 
something like the CEP operator, which in effect maintains state and does 
aggregations without windowing (or another similar such operator)? How does it 
determine what the minimum watermark is at any given time, in light of the fact 
that, in principle, it might receive a watermark value smaller than anything 
it’s seen before from a parallel source?

Ray

From: Fabian Hueske 
Date: Sunday, June 11, 2017 at 5:54 PM
To: Ray Ruvinskiy 
Cc: "user@flink.apache.org" 
Subject: Re: At what point do watermarks get injected into the stream?

Each parallel instance of a TimestampAssigner independently assigns timestamps.
After a shuffle, operators forward the minimum watermark across all input 
connections. For details have a look at the watermarks documentation [1].
Best, Fabian

[1] 
https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/event_time.html#watermarks-in-parallel-streams

2017-06-11 17:22 GMT+02:00 Ray Ruvinskiy 
mailto:ray.ruvins...@arcticwolf.com>>:
Thanks for the explanation, Fabian.

Suppose I have a parallel source that does not inject watermarks, and the first 
operation on the DataStream is assignTimestampsAndWatermarks. Does each 
parallel task that makes up the source independently inject watermarks for the 
records that it has read? Suppose I then call keyBy and a shuffle ensues. Will 
the resulting partitions after the shuffle have interleaved watermarks from the 
various source tasks?

More concretely, suppose s source has a degree of parallelism of two. One of 
the source tasks injects the watermarks 2 and 5, while the other injects 3 and 
10. There is then a shuffle, creating two different partitions. Will all the 
watermarks be broadcast to all the partitions? Or is it possible for, say, one 
partition to end up with watermarks 2 and 10 and another with 3 and 5? And 
after the shuffle, how do we ensure that the watermarks are processed in order 
by the operators receiving them?

Thanks,

Ray

From: Fabian Hueske mailto:fhue...@gmail.com>>
Date: Saturday, June 10, 2017 at 3:56 PM
To: Ray Ruvinskiy 
mailto:ray.ruvins...@arcticwolf.com>>
Cc: "user@flink.apache.org" 
mailto:user@flink.apache.org>>
Subject: Re: At what point do watermarks get injected into the stream?

Hi Ray,
in principle, watermarks can be injected anywhere in a stream by calling 
DataStream.assignTimestampsAndWatermarks().
However, timestamps are usually injected as soon as possible after a stream in 
ingested (before the first shuffle). The reason is that watermarks depend on 
the order of events (and their timestamps) in the stream. While Flink 
guarantees the order of events within a partition, a shuffle interleaves events 
of different partitions in an unpredictable way such that it is not possible to 
reason about the order of timestamps afterwards.
The most common way to inject watermarks is directly inside of a SourceFunction 
or with a TimestampAssigner before the first shuffle.
Best, Fabian

2017-06-09 0:46 GMT+02:00 Ray Ruvinskiy 
mailto:ray.ruvins...@arcticwolf.com>>:
I’m trying to build a mental model of how watermarks get injected into the 
stream. Suppose I have a stream with a parallel source, and I’m running a 
cluster with multiple task managers. Does each parallel source reader inject 
watermarks, which are then forwarded to downstream consumers and shuffled 
between task managers? Or are watermarks created after the shuffle, when the 
stream records reach their destined task manager and right before they’re 
processed by the operator?

Thanks,

Ray




Re: Cannot write record to fresh sort buffer. Record too large.

2017-06-12 Thread Sebastian Neef
Hi Stefan,

thanks for the answer and the advise, which I've already seen in another
email.

Anyway, I played around with the taskmanager.numberOfTaskSlots and
taskmanager.memory.fraction options. I noticed that decreasing the
former and increasing the latter lead to longer execution and more
processed data before the failure.

The error messages and exceptions from an affected TaskManager are here
[1]. Unfortunately, I cannot find a java.lang.OutOfMemoryError in here.

Do you have another idea or something to try?

Thanks in advance,
Sebastian


[1]
http://paste.gehaxelt.in/?e669fabc1d4c15be#G1Ioq/ASwGUdCaK2rQ1AY3ZmCkA7LN4xVOHvM9NeI2g=


Re: ReduceFunction mechanism

2017-06-12 Thread Fabian Hueske
No, the reduce() method of a ReduceFunction requires two elements.
The first received element is just put into state. Once the second element
arrives, both are given to the ReduceFunction and the result is put into
state and replaces the first element.

Best, Fabian

2017-06-12 18:25 GMT+02:00 nragon :

> Hi,
>
> Regarding ReduceFunction.
> Is reduce() called when there is only one record for a given key?
>
> Thanks
>
>
>
> --
> View this message in context: http://apache-flink-user-
> mailing-list-archive.2336050.n4.nabble.com/ReduceFunction-
> mechanism-tp13651.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive
> at Nabble.com.
>


Task and Operator Metrics in Flink 1.3

2017-06-12 Thread Dail, Christopher
I’m using the Flink 1.3.0 release and am not seeing all of the metrics that I 
would expect to see. I have flink configured to write out metrics via statsd 
and I am consuming this with telegraf. Initially I thought this was an issue 
with telegraf parsing the data generated. I dumped all of the metrics going 
into telegraf using tcpdump and found that there was a bunch of data missing 
that I expect.

I’m using this as a reference for what metrics I expect:
https://ci.apache.org/projects/flink/flink-docs-release-1.3/monitoring/metrics.html

I see all of the JobManager and TaskManager level metrics. Things like 
Status.JVM.* are coming through. TaskManager Status.Network are there (but not 
Task level buffers). The ‘Cluster’ metrics are there.

This IO section contains task and operator level metrics (like what is 
available on the dashboard). I’m not seeing any of these metrics coming through 
when using statsd.

I’m configuring flink with this configuration:

metrics.reporters: statsd
metrics.reporter.statsd.class: org.apache.flink.metrics.statsd.StatsDReporter
metrics.reporter.statsd.host: hostname
metrics.reporter.statsd.port: 8125

# Customized Scopes
metrics.scope.jm: flink.jm
metrics.scope.jm.job: flink.jm.
metrics.scope.tm: flink.tm.
metrics.scope.tm.job: flink.tm..
metrics.scope.task: flink.tm
metrics.scope.operator: 
flink.tm

I have tried with and without specifically setting the metrics.scope values.

Is anyone else having similar issues with metrics in 1.3?

Thanks

Chris Dail
Director, Software Engineering
Dell EMC | Infrastructure Solutions Group
mobile +1 506 863 4675
christopher.d...@dell.com





Java parallel streams vs IterativeStream

2017-06-12 Thread nragon
In my map functions i have an object containing a list which must be changed,
executing some logic.
So, considering java 8 parallel streams would it be worth to use them or
does IterativeStreams offer a better performance without java 8 streams
parallel overhead?

Thanks



--
View this message in context: 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Java-parallel-streams-vs-IterativeStream-tp13655.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at 
Nabble.com.


Re: Use Single Sink For All windows

2017-06-12 Thread rhashmi
I think CheckpointListener?




--
View this message in context: 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Use-Single-Sink-For-All-windows-tp13475p13653.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at 
Nabble.com.


Re: coGroup exception or something else in Gelly job

2017-06-12 Thread Kaepke, Marc
I’m working on an implementation of SemiClustering [1].
I used two graph models (Pregel aka. vertexcentric iteration and gather 
scatter).

Short description of the algorithm

  *   input is a weighted, undirected graph
  *   output are greedy clusters


  *   Each vertex V maintains a list containing at most Cmax semi-clusters, 
sorted by score.
  *   In superstep 0 V enters itself in that list as a semi-cluster of size 1 
and score 1, and publishes itself to all of its neighbors.
  *   In subsequent supersteps:
 *   Vertex V iterates over the semi-clusters c1,...,ck sent to it on the 
previous superstep. If a semi-cluster c does not already contain V , and Vc < 
Mmax, then V is added to c to form c′.
 *   The semi-clusters c1,…,ck,c′1,…,c′k are sorted by their scores, and 
the best ones are sent to V ’s neighbors.
 *   Vertex V updates its list of semi-clusters with the semi-clusters from 
c1,…,ck, c′1,…,c′k that contain V.
  *   The algorithm terminates either when the semi-clusters stop changing or 
(to improve performance) when the number of supersteps reaches a user-specified 
limit. At that point the list of best semi-cluster candidates for each vertex 
may be aggregated into a global list of best semi-clusters.

It’s a simple algorithm, except the access to the edges of each vertex. If I 
iterate every time across all edges to find the incident ones I gain a horrible 
landau (time complexity).


Best
Marc


[1] https://kowshik.github.io/JPregel/pregel_paper.pdf (page 141-142)


Am 12.06.2017 um 18:28 schrieb Greg Hogan 
mailto:c...@greghogan.com>>:

I don't think it is possible for anyone to debug your exception without the 
source code. Storing the adjacency list within the Vertex value is not 
scalable. Can you share a basic description of the algorithm you are working to 
implement?

On Mon, Jun 12, 2017 at 5:47 AM, Kaepke, Marc 
mailto:marc.kae...@haw-hamburg.de>> wrote:
It seems Flink used a different exception graph outside of my IDE (intellij)

The job anatomy is:
load data from csv and build an initial graph => reduce that graph (remove 
loops and combine multi edges) => extend the modified graph by a new vertex 
value => run a gather-scatter iteration

I have to extend the vertex value, because each vertex need its incident edges 
inside the iteration. My CustomVertexValue is able to hold all incident edges. 
Vertex vertex

Flink is try to optimize the execution graph, but that’s the issue.
Maybe Flink provides an influence by the programmer?


Best and thanks
Marc


Am 10.06.2017 um 00:49 schrieb Greg Hogan 
mailto:c...@greghogan.com>>:

Have you looked at 
org.apache.flink.gelly.GraphExtension.CustomVertexValue.createInitSemiCluster(CustomVertexValue.java:51)?


On Jun 9, 2017, at 4:53 PM, Kaepke, Marc 
mailto:marc.kae...@haw-hamburg.de>> wrote:

Hi everyone,

I don’t have any exceptions if I execute my Gelly job in my IDE (local) 
directly.
The next step is an execution with a real kubernetes cluster (1 JobManager and 
3 TaskManager on dedicated machines). The word count example is running without 
exceptions. My Gelly job throws following exception and I don’t know why.

org.apache.flink.client.program.ProgramInvocationException: The program 
execution failed: Job execution failed.
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:478)
at 
org.apache.flink.client.program.StandaloneClusterClient.submitJob(StandaloneClusterClient.java:105)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:442)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:429)
at 
org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:62)
at 
org.apache.flink.api.java.ExecutionEnvironment.execute(ExecutionEnvironment.java:926)
at org.apache.flink.api.java.DataSet.collect(DataSet.java:410)
at 
org.apache.flink.gelly.Algorithm.SemiClusteringPregel.run(SemiClusteringPregel.java:84)
at 
org.apache.flink.gelly.Algorithm.SemiClusteringPregel.run(SemiClusteringPregel.java:34)
at org.apache.flink.graph.Graph.run(Graph.java:1850)
at org.apache.flink.gelly.job.GellyMain.main(GellyMain.java:128)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:528)
at 
org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:419)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:381)
at org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:838)
at org.apache.flink.client.CliFrontend.run(CliFrontend.java:259)
at org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1086)
at org.apache.flink.client.CliFrontend$2.call(CliFrontend

Re: Use Single Sink For All windows

2017-06-12 Thread rhashmi
Thanks Aljoscha for your response. 

I would give a try.. 
 


1- flink call *invoke* method of SinkFunction to dispatch aggregated
information. My follow up question here is .. while snapshotState method is
in process, if sink received another update then we might have mix records,
however per document all update stop during checkpoint. i assume this works
the same way.  


https://ci.apache.org/projects/flink/flink-docs-release-1.3/internals/stream_checkpointing.html

*"As soon as the operator receives snapshot barrier n from an incoming
stream, it cannot process any further records from that stream until it has
received the barrier n from the other inputs as well. Otherwise, it would
mix records that belong to snapshot n and with records that belong to
snapshot n+1."*

*"Streams that report barrier n are temporarily set aside. Records that are
received from these streams are not processed, but put into an input
buffer".
*


2- snapshotState method call when "checkpoint is requested". is there an
interface that provide when checkpoint complete .. I meant.. I will add my
flush logic right after completion of snapshot & before flink resume the
stream. With this approach we can assure that we update state only if the
checkpoint was successful. 



--
View this message in context: 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Use-Single-Sink-For-All-windows-tp13475p13652.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at 
Nabble.com.


ReduceFunction mechanism

2017-06-12 Thread nragon
Hi,

Regarding ReduceFunction.
Is reduce() called when there is only one record for a given key?

Thanks



--
View this message in context: 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ReduceFunction-mechanism-tp13651.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at 
Nabble.com.


Re: coGroup exception or something else in Gelly job

2017-06-12 Thread Greg Hogan
I don't think it is possible for anyone to debug your exception without the
source code. Storing the adjacency list within the Vertex value is not
scalable. Can you share a basic description of the algorithm you are
working to implement?

On Mon, Jun 12, 2017 at 5:47 AM, Kaepke, Marc 
wrote:

> It seems Flink used a different exception graph outside of my IDE
> (intellij)
>
> The job anatomy is:
> load data from csv and build an initial graph => reduce that graph (remove
> loops and combine multi edges) => extend the modified graph by a new vertex
> value => run a gather-scatter iteration
>
> I have to extend the vertex value, because each vertex need its incident
> edges inside the iteration. My CustomVertexValue is able to hold all
> incident edges. Vertex vertex
>
> Flink is try to optimize the execution graph, but that’s the issue.
> Maybe Flink provides an influence by the programmer?
>
>
> Best and thanks
> Marc
>
>
> Am 10.06.2017 um 00:49 schrieb Greg Hogan :
>
> Have you looked at org.apache.flink.gelly.GraphExtension.
> CustomVertexValue.createInitSemiCluster(CustomVertexValue.java:51)?
>
>
> On Jun 9, 2017, at 4:53 PM, Kaepke, Marc 
> wrote:
>
> Hi everyone,
>
> I don’t have any exceptions if I execute my Gelly job in my IDE (local)
> directly.
> The next step is an execution with a real kubernetes cluster (1 JobManager
> and 3 TaskManager on dedicated machines). The word count example is running
> without exceptions. My Gelly job throws following exception and I don’t
> know why.
>
> org.apache.flink.client.program.ProgramInvocationException: The program
> execution failed: Job execution failed.
> at org.apache.flink.client.program.ClusterClient.run(
> ClusterClient.java:478)
> at org.apache.flink.client.program.StandaloneClusterClient.submitJob(
> StandaloneClusterClient.java:105)
> at org.apache.flink.client.program.ClusterClient.run(
> ClusterClient.java:442)
> at org.apache.flink.client.program.ClusterClient.run(
> ClusterClient.java:429)
> at org.apache.flink.client.program.ContextEnvironment.
> execute(ContextEnvironment.java:62)
> at org.apache.flink.api.java.ExecutionEnvironment.execute(
> ExecutionEnvironment.java:926)
> at org.apache.flink.api.java.DataSet.collect(DataSet.java:410)
> at org.apache.flink.gelly.Algorithm.SemiClusteringPregel.run(
> SemiClusteringPregel.java:84)
> at org.apache.flink.gelly.Algorithm.SemiClusteringPregel.run(
> SemiClusteringPregel.java:34)
> at org.apache.flink.graph.Graph.run(Graph.java:1850)
> at org.apache.flink.gelly.job.GellyMain.main(GellyMain.java:128)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(
> NativeMethodAccessorImpl.java:62)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.flink.client.program.PackagedProgram.callMainMethod(
> PackagedProgram.java:528)
> at org.apache.flink.client.program.PackagedProgram.
> invokeInteractiveModeForExecution(PackagedProgram.java:419)
> at org.apache.flink.client.program.ClusterClient.run(
> ClusterClient.java:381)
> at org.apache.flink.client.CliFrontend.executeProgram(
> CliFrontend.java:838)
> at org.apache.flink.client.CliFrontend.run(CliFrontend.java:259)
> at org.apache.flink.client.CliFrontend.parseParameters(
> CliFrontend.java:1086)
> at org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1133)
> at org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1130)
> at org.apache.flink.runtime.security.HadoopSecurityContext$1.run(
> HadoopSecurityContext.java:43)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at org.apache.hadoop.security.UserGroupInformation.doAs(
> UserGroupInformation.java:1657)
> at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(
> HadoopSecurityContext.java:40)
> at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1129)
> Caused by: org.apache.flink.runtime.client.JobExecutionException: Job
> execution failed.
> at org.apache.flink.runtime.jobmanager.JobManager$$
> anonfun$handleMessage$1$$anonfun$applyOrElse$6.apply$
> mcV$sp(JobManager.scala:933)
> at org.apache.flink.runtime.jobmanager.JobManager$$
> anonfun$handleMessage$1$$anonfun$applyOrElse$6.apply(JobManager.scala:876)
> at org.apache.flink.runtime.jobmanager.JobManager$$
> anonfun$handleMessage$1$$anonfun$applyOrElse$6.apply(JobManager.scala:876)
> at scala.concurrent.impl.Future$PromiseCompletingRunnable.
> liftedTree1$1(Future.scala:24)
> at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(
> Future.scala:24)
> at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
> at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(
> AbstractDispatcher.scala:397)
> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.
> runTask(Fork

Re: Use Single Sink For All windows

2017-06-12 Thread Aljoscha Krettek
Ah, I think now I get your problem. You could manually implement batching 
inside your SinkFunction, The SinkFunction would batch values in memory and 
periodically (based on the count of values and on a timeout) send these values 
as a single batch to MySQL. To ensure that data is not lost you can implement 
the CheckpointedFunction interface and make sure to always flush to MySQL when 
a snapshot is happening.

Does that help?

> On 8. Jun 2017, at 11:46, Nico Kruber  wrote:
> 
> How about using asynchronous I/O operations?
> 
> https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/stream/
> asyncio.html
> 
> 
> Nico
> 
> On Tuesday, 6 June 2017 16:22:31 CEST rhashmi wrote:
>> because of parallelism i am seeing db contention. Wondering if i can merge
>> sink of multiple windows and insert in batch.
>> 
>> 
>> 
>> --
>> View this message in context:
>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Use-Sin
>> gle-Sink-For-All-windows-tp13475p13525.html Sent from the Apache Flink User
>> Mailing List archive. mailing list archive at Nabble.com.
> 



Re: Cannot write record to fresh sort buffer. Record too large.

2017-06-12 Thread Stefan Richter
Hi,

can you please take a look at your TM logs? I would expect that you can see an 
java.lang.OutOfMemoryError there.

If this assumption is correct, you can try to:

1. Further decrease the taskmanager.memory.fraction: This will cause the 
TaskManager to allocate less memory for managed memory and leaves more free 
heap memory available
2. Decrease the number of slots on the TaskManager: This will decrease the 
number of concurrently running user functions and thus the number of objects 
which have to be kept on the heap.
3. Increase the number of ALS blocks `als.setBlocks(numberBlocks)`. This will 
increase the number of blocks into which the factor matrices are split up. A 
larger number means that each individual block is smaller and thus will need 
fewer memory to be kept on the heap.

Best,
Stefan

> Am 12.06.2017 um 15:55 schrieb Sebastian Neef :
> 
> Hi,
> 
> when I'm running my Flink job on a small dataset, it successfully
> finishes. However, when a bigger dataset is used, I get multiple exceptions:
> 
> -  Caused by: java.io.IOException: Cannot write record to fresh sort
> buffer. Record too large.
> - Thread 'SortMerger Reading Thread' terminated due to an exception: null
> 
> A full stack trace can be found here [0].
> 
> I tried to reduce the taskmanager.memory.fraction (or so) and also the
> amount of parallelism, but that did not help much.
> 
> Flink 1.0.3-Hadoop2.7 was used.
> 
> Any tipps are appreciated.
> 
> Kind regards,
> Sebastian
> 
> [0]:
> http://paste.gehaxelt.in/?1f24d0da3856480d#/dR8yriXd/VQn5zTfZACS52eWiH703bJbSTZSifegwI=



Flink Streaming and K-Nearest-Neighbours

2017-06-12 Thread Alexandru Ciobanu

Hello flink community,
I would like to know how you would calculate k-nearest neighbours using
the flink streaming environment - is this even possible?

What I currently have is a datastream which comes from a socket. The
messages from the socket are run through a map and a reduce function,
thus I have something like
Tuple3. I have seen that there is a flink
k-means algorithm in scala working on DataSet[Vector]. Can you point me
in the right direction on how to
transform the Tuples into a DataSet Vector in Java? If this is not
possible with flink streaming, what would you recommend for k-NN of
streams?

Best regards

<>

Re: dynamic add sink to flink

2017-06-12 Thread Tzu-Li (Gordon) Tai
Hi,

The Flink Kafka Producer allows writing to multiple topics beside the default 
topic.
To do this, you can override the configured default topic by implementing the 
`getTargetTopic` method on the `KeyedSerializationSchema`.
That method is invoked for each record, and if a value is returned, the record 
will be routed to that topic instead of the default one.

Does this address what you have in mind?

Cheers,
Gordon

On 10 June 2017 at 6:20:59 AM, qq (468137...@qq.com) wrote:

Hi: 
we use flink as a router to our kafka, we read from one kafka and to a lot of 
diffrent kafka and topic, 
but it will cost too much time to start the flink job if we want to add some 
other kafka sink to the flink, so 
if there any way to dynamic add sink to flink or just start the flink job 
faster? The reason of slow start flink 
job I think is the lots of kafka sink. 

We now use the demo code like this, 

 
splitStream = stream.split(by the content) 
for ((k, v) <- map) { 
splitStream.select(k).addSink(new kafkaSink(v)) 
} 
 








Re: [DISCUSS] Removal of twitter-inputformat

2017-06-12 Thread sblackmon
Hello,

Apache Streams (incubating) maintains and publishes json-schemas and 
jackson-compatible POJOs for Twitter and other popular third-party APIs.

http://streams.apache.org/site/0.5.1-incubating-SNAPSHOT/streams-project/streams-contrib/streams-provider-twitter/index.html

We also have a repository of public examples, one of which demonstrates how to 
embed various twitter data collectors into Flink.

http://streams.apache.org/site/0.5.1-incubating-SNAPSHOT/streams-examples/streams-examples-flink/flink-twitter-collection/index.html

We’d welcome support of anyone from Flink project to help us maintain and 
improve these examples.  Potentially, Flink could maintain the benefit of the 
existence of useful, ready-to-run examples for new Flink users, while getting 
the boring code out of your code base.  Also, our examples have integration 
tests that actually connect to twitter and check that everything continues to 
work :)

if anyone wants to know more about this, feel free to reach out to the team on 
d...@streams.incubator.apache.org

Steve
sblack...@apache.org
On June 12, 2017 at 7:18:08 AM, Aljoscha Krettek (aljos...@apache.org) wrote:

Bumpety-bump.  

I would be in favour or removing this:  
- It can be implemented as a MapFunction parser after a TextInputFormat  
- Additions, changes, fixes that happen on TextInputFormat are not reflected to 
SimpleTweetInputFormat  
- SimpleTweetInput format overrides nextRecord(), which is not something 
DelimitedInputFormats are normally supposed to do, I think  
- The Tweet POJO has a very strange naming scheme  

Best,  
Aljoscha  

> On 7. Jun 2017, at 11:15, Chesnay Schepler  wrote:  
>  
> Hello,  
>  
> I'm proposing to remove the Twitter-InputFormat in FLINK-6710 
> , with an open PR you can 
> find here .  
> The PR currently has a +1 from Robert, but Timo raised some concerns saying 
> that it is useful for prototyping and  
> advised me to start a discussion on the ML.  
>  
> This format is a DelimitedInputFormat that reads JSON objects and turns them 
> into a custom tweet class.  
> I believe this format doesn't provide much value to Flink; there's nothing 
> interesting about it as an InputFormat,  
> as it is purely an exercise in manually converting a JSON object into a POJO. 
>  
> This is apparent since you could just as well use 
> ExecutionEnvironment#readTextFile(...) and throw the parsing logic  
> into a subsequent MapFunction.  
>  
> In the PR i suggested to replace this with a JsonInputFormat, but this was a 
> misguided attempt at getting Timo to agree  
> to the removal. This format has the same problem outlined above, as it could 
> be effectively implemented with a one-liner map function.  
>  
> So the question now is whether we want to keep it, remove it, or replace it 
> with something more general.  
>  
> Regards,  
> Chesnay  



Cannot write record to fresh sort buffer. Record too large.

2017-06-12 Thread Sebastian Neef
Hi,

when I'm running my Flink job on a small dataset, it successfully
finishes. However, when a bigger dataset is used, I get multiple exceptions:

-  Caused by: java.io.IOException: Cannot write record to fresh sort
buffer. Record too large.
- Thread 'SortMerger Reading Thread' terminated due to an exception: null

A full stack trace can be found here [0].

I tried to reduce the taskmanager.memory.fraction (or so) and also the
amount of parallelism, but that did not help much.

Flink 1.0.3-Hadoop2.7 was used.

Any tipps are appreciated.

Kind regards,
Sebastian

[0]:
http://paste.gehaxelt.in/?1f24d0da3856480d#/dR8yriXd/VQn5zTfZACS52eWiH703bJbSTZSifegwI=


Re: Flink-derrived operator names cause issues in Graphite metrics

2017-06-12 Thread Carst Tankink
Thanks for the quick response :-)

I think the limiting of names might still be good enough for my use-case, 
because the default case is naming operators properly (it helps in creating 
dashboards...) but if we forget/miss one, we do not want to start hammering our 
graphite setup with bad data.

Thanks again,
Carst

From: Chesnay Schepler 
Date: Monday, June 12, 2017 at 15:10
To: "user@flink.apache.org" 
Subject: Re: Flink-derrived operator names cause issues in Graphite metrics

So there's 2 issues here:

  1.  The default names for windows are horrible. They are to long, full of 
special characters, and unstable as reported in 
FLINK-6464
  2.  The reporter doesn't filter out metrics it can't report.

For 2) we can do 2 things:

  *   If a fully assembled metric name is too long the graphite reporter will 
ignore the metric and log a warning.
  *   when converting the operator name to a string, limit the total size. Say, 
40-60 characters. This may not be enough for your use-case though.
I'll create JIRAs for 2), and try to fix them as soon as possible.

A more comprehensive solution will be made as part of FLINK-6464, which 
includes a clean-up/refactoring of operator names.

On 12.06.2017 14:45, Carst Tankink wrote:

Hi,



We accidentally forgot to give some operators in our flink stream a 
custom/unique name, and ran into the following exception in Graphite:

‘exceptions.IOError: [Errno 36] File name too long: 
'///TriggerWindow_SlidingEventTimeWindows_60_-60__-FoldingStateDescriptor_serializer=org-apache-flink-api-common-typeutils-base-IntSerializer_655523dd_-initialValue=0_-foldFunction=_24e08d59__-EventTimeTrigger___-WindowedStream-fold_AllWindowedStream-java:
 __/0/buffers/inputQueueLength.wsp'



(some placeholders because it might reveal too much about our platform, sorry. 
The actual filename is quite a bit longer).



The problem seems to be that Flink uses toString for the operator if no name is 
set, and the graphite exporter does not sanitize the output for length.

Is this something that should be posted as a bug?  Or a known limitation that 
we missed in the documentation?



Thanks,

Caarst






Re: Flink-derrived operator names cause issues in Graphite metrics

2017-06-12 Thread Chesnay Schepler

So there's 2 issues here:

1. The default names for windows are horrible. They are to long, full
   of special characters, and unstable as reported in FLINK-6464
   
2. The reporter doesn't filter out metrics it can't report.

For 2) we can do 2 things:

 * If a fully assembled metric name is too long the graphite reporter
   will ignore the metric and log a warning.
 * when converting the operator name to a string, limit the total size.
   Say, 40-60 characters. This may not be enough for your use-case though.

I'll create JIRAs for 2), and try to fix them as soon as possible.

A more comprehensive solution will be made as part of FLINK-6464, which 
includes a clean-up/refactoring of operator names.


On 12.06.2017 14:45, Carst Tankink wrote:

Hi,

We accidentally forgot to give some operators in our flink stream a 
custom/unique name, and ran into the following exception in Graphite:
‘exceptions.IOError: [Errno 36] File name too long: 
'///TriggerWindow_SlidingEventTimeWindows_60_-60__-FoldingStateDescriptor_serializer=org-apache-flink-api-common-typeutils-base-IntSerializer_655523dd_-initialValue=0_-foldFunction=_24e08d59__-EventTimeTrigger___-WindowedStream-fold_AllWindowedStream-java:
 __/0/buffers/inputQueueLength.wsp'

(some placeholders because it might reveal too much about our platform, sorry. 
The actual filename is quite a bit longer).

The problem seems to be that Flink uses toString for the operator if no name is 
set, and the graphite exporter does not sanitize the output for length.
Is this something that should be posted as a bug?  Or a known limitation that 
we missed in the documentation?

Thanks,
Caarst





Flink-derrived operator names cause issues in Graphite metrics

2017-06-12 Thread Carst Tankink
Hi, 

We accidentally forgot to give some operators in our flink stream a 
custom/unique name, and ran into the following exception in Graphite:
‘exceptions.IOError: [Errno 36] File name too long: 
'///TriggerWindow_SlidingEventTimeWindows_60_-60__-FoldingStateDescriptor_serializer=org-apache-flink-api-common-typeutils-base-IntSerializer_655523dd_-initialValue=0_-foldFunction=_24e08d59__-EventTimeTrigger___-WindowedStream-fold_AllWindowedStream-java:
 __/0/buffers/inputQueueLength.wsp'

(some placeholders because it might reveal too much about our platform, sorry. 
The actual filename is quite a bit longer).

The problem seems to be that Flink uses toString for the operator if no name is 
set, and the graphite exporter does not sanitize the output for length. 
Is this something that should be posted as a bug?  Or a known limitation that 
we missed in the documentation? 

Thanks,
Caarst



Re: [DISCUSS] Removal of twitter-inputformat

2017-06-12 Thread Aljoscha Krettek
Bumpety-bump.

I would be in favour or removing this:
 - It can be implemented as a MapFunction parser after a TextInputFormat
 - Additions, changes, fixes that happen on TextInputFormat are not reflected 
to SimpleTweetInputFormat
 - SimpleTweetInput format overrides nextRecord(), which is not something 
DelimitedInputFormats are normally supposed to do, I think
 - The Tweet POJO has a very strange naming scheme

Best,
Aljoscha

> On 7. Jun 2017, at 11:15, Chesnay Schepler  wrote:
> 
> Hello,
> 
> I'm proposing to remove the Twitter-InputFormat in FLINK-6710 
> , with an open PR you can 
> find here .
> The PR currently has a +1 from Robert, but Timo raised some concerns saying 
> that it is useful for prototyping and
> advised me to start a discussion on the ML.
> 
> This format is a DelimitedInputFormat that reads JSON objects and turns them 
> into a custom tweet class.
> I believe this format doesn't provide much value to Flink; there's nothing 
> interesting about it as an InputFormat,
> as it is purely an exercise in manually converting a JSON object into a POJO.
> This is apparent since you could just as well use 
> ExecutionEnvironment#readTextFile(...) and throw the parsing logic
> into a subsequent MapFunction.
> 
> In the PR i suggested to replace this with a JsonInputFormat, but this was a 
> misguided attempt at getting Timo to agree
> to the removal. This format has the same problem outlined above, as it could 
> be effectively implemented with a one-liner map function.
> 
> So the question now is whether we want to keep it, remove it, or replace it 
> with something more general.
> 
> Regards,
> Chesnay



Re: Error running Flink job in Yarn-cluster mode

2017-06-12 Thread Aljoscha Krettek
Hi,

What version of Flink are you using? Also, how are you building your job? Is it 
a fat-jar, maybe based in the Flink Quickstart project?

Best,
Aljoscha

> On 9. Jun 2017, at 11:49, Biplob Biswas  wrote:
> 
> One more thing i noticed is that the streaming wordcount from the flink
> package works when i run it but when i used the same github code, packaged
> it and uploaded the fat jar with the word count example to the cluster, i
> get the same error.
> 
> I am wondering, How can making my uber jar produce such erroneous result
> when i am basically not changing anything else.
> 
> 
> 
> --
> View this message in context: 
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Error-running-Flink-job-in-Yarn-cluster-mode-tp13593p13607.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive at 
> Nabble.com.



Re: Java 8 lambdas for CEP patterns won't compile

2017-06-12 Thread Kostas Kloudas
Done.

> On Jun 12, 2017, at 12:24 PM, Ted Yu  wrote:
> 
> Can you add link to this thread in the JIRA ?
> 
> Cheers
> 
> On Mon, Jun 12, 2017 at 3:15 AM, Kostas Kloudas  > wrote:
> Unfortunately, there was no discussion as this regression came as an 
> artifact of the addition of the IterativeConditions, but it will be fixed.
> 
> This is the JIRA to track it:
> https://issues.apache.org/jira/browse/FLINK-6897 
> 
> 
> Kostas
> 
>> On Jun 12, 2017, at 11:51 AM, Ted Yu > > wrote:
>> 
>> Do know which JIRA / discussion thread had the context for this decision ?
>> 
>> I did a quick search in JIRA but only found FLINK-3681.
>> 
>> Cheers
>> 
>> On Mon, Jun 12, 2017 at 1:48 AM, Kostas Kloudas > > wrote:
>> Hi David and Ted,
>> 
>> The documentation is outdated. I will update it today.
>> Java8 Lambdas are NOT supported by CEP in Flink 1.3.
>> 
>> Hopefully this will change soon. I will open a JIRA for this.
>> 
>> Cheers,
>> Kostas
>> 
>>> On Jun 11, 2017, at 11:55 PM, Ted Yu >> > wrote:
>>> 
>>>  
>> 
>> 
> 
> 



Re: Java 8 lambdas for CEP patterns won't compile

2017-06-12 Thread Ted Yu
Can you add link to this thread in the JIRA ?

Cheers

On Mon, Jun 12, 2017 at 3:15 AM, Kostas Kloudas  wrote:

> Unfortunately, there was no discussion as this regression came as an
> artifact of the addition of the IterativeConditions, but it will be fixed.
>
> This is the JIRA to track it:
> https://issues.apache.org/jira/browse/FLINK-6897
>
> Kostas
>
> On Jun 12, 2017, at 11:51 AM, Ted Yu  wrote:
>
> Do know which JIRA / discussion thread had the context for this decision ?
>
> I did a quick search in JIRA but only found FLINK-3681.
>
> Cheers
>
> On Mon, Jun 12, 2017 at 1:48 AM, Kostas Kloudas <
> k.klou...@data-artisans.com> wrote:
>
>> Hi David and Ted,
>>
>> The documentation is outdated. I will update it today.
>> Java8 Lambdas are NOT supported by CEP in Flink 1.3.
>>
>> Hopefully this will change soon. I will open a JIRA for this.
>>
>> Cheers,
>> Kostas
>>
>> On Jun 11, 2017, at 11:55 PM, Ted Yu  wrote:
>>
>>
>>
>>
>>
>
>


Re: Java 8 lambdas for CEP patterns won't compile

2017-06-12 Thread Kostas Kloudas
Unfortunately, there was no discussion as this regression came as an 
artifact of the addition of the IterativeConditions, but it will be fixed.

This is the JIRA to track it:
https://issues.apache.org/jira/browse/FLINK-6897 


Kostas

> On Jun 12, 2017, at 11:51 AM, Ted Yu  wrote:
> 
> Do know which JIRA / discussion thread had the context for this decision ?
> 
> I did a quick search in JIRA but only found FLINK-3681.
> 
> Cheers
> 
> On Mon, Jun 12, 2017 at 1:48 AM, Kostas Kloudas  > wrote:
> Hi David and Ted,
> 
> The documentation is outdated. I will update it today.
> Java8 Lambdas are NOT supported by CEP in Flink 1.3.
> 
> Hopefully this will change soon. I will open a JIRA for this.
> 
> Cheers,
> Kostas
> 
>> On Jun 11, 2017, at 11:55 PM, Ted Yu > > wrote:
>> 
>>  
> 
> 



Re: Java 8 lambdas for CEP patterns won't compile

2017-06-12 Thread Ted Yu
Do know which JIRA / discussion thread had the context for this decision ?

I did a quick search in JIRA but only found FLINK-3681.

Cheers

On Mon, Jun 12, 2017 at 1:48 AM, Kostas Kloudas  wrote:

> Hi David and Ted,
>
> The documentation is outdated. I will update it today.
> Java8 Lambdas are NOT supported by CEP in Flink 1.3.
>
> Hopefully this will change soon. I will open a JIRA for this.
>
> Cheers,
> Kostas
>
> On Jun 11, 2017, at 11:55 PM, Ted Yu  wrote:
>
>
>
>
>


Re: coGroup exception or something else in Gelly job

2017-06-12 Thread Kaepke, Marc
It seems Flink used a different exception graph outside of my IDE (intellij)

The job anatomy is:
load data from csv and build an initial graph => reduce that graph (remove 
loops and combine multi edges) => extend the modified graph by a new vertex 
value => run a gather-scatter iteration

I have to extend the vertex value, because each vertex need its incident edges 
inside the iteration. My CustomVertexValue is able to hold all incident edges. 
Vertex vertex

Flink is try to optimize the execution graph, but that’s the issue.
Maybe Flink provides an influence by the programmer?


Best and thanks
Marc


Am 10.06.2017 um 00:49 schrieb Greg Hogan 
mailto:c...@greghogan.com>>:

Have you looked at 
org.apache.flink.gelly.GraphExtension.CustomVertexValue.createInitSemiCluster(CustomVertexValue.java:51)?


On Jun 9, 2017, at 4:53 PM, Kaepke, Marc 
mailto:marc.kae...@haw-hamburg.de>> wrote:

Hi everyone,

I don’t have any exceptions if I execute my Gelly job in my IDE (local) 
directly.
The next step is an execution with a real kubernetes cluster (1 JobManager and 
3 TaskManager on dedicated machines). The word count example is running without 
exceptions. My Gelly job throws following exception and I don’t know why.

org.apache.flink.client.program.ProgramInvocationException: The program 
execution failed: Job execution failed.
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:478)
at 
org.apache.flink.client.program.StandaloneClusterClient.submitJob(StandaloneClusterClient.java:105)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:442)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:429)
at 
org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:62)
at 
org.apache.flink.api.java.ExecutionEnvironment.execute(ExecutionEnvironment.java:926)
at org.apache.flink.api.java.DataSet.collect(DataSet.java:410)
at 
org.apache.flink.gelly.Algorithm.SemiClusteringPregel.run(SemiClusteringPregel.java:84)
at 
org.apache.flink.gelly.Algorithm.SemiClusteringPregel.run(SemiClusteringPregel.java:34)
at org.apache.flink.graph.Graph.run(Graph.java:1850)
at org.apache.flink.gelly.job.GellyMain.main(GellyMain.java:128)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:528)
at 
org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:419)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:381)
at org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:838)
at org.apache.flink.client.CliFrontend.run(CliFrontend.java:259)
at org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1086)
at org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1133)
at org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1130)
at 
org.apache.flink.runtime.security.HadoopSecurityContext$1.run(HadoopSecurityContext.java:43)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at 
org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40)
at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1129)
Caused by: org.apache.flink.runtime.client.JobExecutionException: Job execution 
failed.
at 
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$6.apply$mcV$sp(JobManager.scala:933)
at 
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$6.apply(JobManager.scala:876)
at 
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$6.apply(JobManager.scala:876)
at 
scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.NullPointerException
at 
org.apache.flink.gelly.GraphExtension.CustomVertexValue.createInitSemiCluster(CustomVertexValue.java:51)
at 
org.apache.flink.gelly.PreModification.IncidentEdgesCollector.iterateEdges(IncidentEdgesColl

Re: Java 8 lambdas for CEP patterns won't compile

2017-06-12 Thread Kostas Kloudas
Hi David and Ted,

The documentation is outdated. I will update it today.
Java8 Lambdas are NOT supported by CEP in Flink 1.3.

Hopefully this will change soon. I will open a JIRA for this.

Cheers,
Kostas

> On Jun 11, 2017, at 11:55 PM, Ted Yu  wrote:
> 
>  



Re: Guava version conflict

2017-06-12 Thread Tzu-Li (Gordon) Tai
This seems like a shading problem then.
I’ve tested this again with Maven 3.0.5, even without building against CDH 
Hadoop binaries the flink-dist jar contains non-shaded Guava dependencies.

Let me investigate a bit and get back to this!

Cheers,
Gordon

On 8 June 2017 at 2:47:02 PM, Flavio Pompermaier (pomperma...@okkam.it) wrote:

On an empty machine (with Ubuntu 14.04.5 LTS) and an empty maven local repo I 
did:
git clone https://github.com/apache/flink.git && cd flink && git checkout 
tags/release-1.2.1
/opt/devel/apache-maven-3.3.9/bin/mvn clean install 
-Dhadoop.version=2.6.0-cdh5.9.0 -Dhbase.version=1.2.0-cdh5.9.0 
-Dhadoop.core.version=2.6.0-mr1-cdh5.9.0 -DskipTests -Pvendor-repos
cd flink-dist
/opt/devel/apache-maven-3.3.9/bin/mvn clean install 
-Dhadoop.version=2.6.0-cdh5.9.0 -Dhbase.version=1.2.0-cdh5.9.0 
-Dhadoop.core.version=2.6.0-mr1-cdh5.9.0 -DskipTests -Pvendor-repos
jar tf target/flink-1.2.1-bin/flink-1.2.1/lib/flink-dist_2.10-1.2.1.jar  | grep 
MoreExecutors
And I still see guava dependencies:

org/apache/flink/hadoop/shaded/com/google/common/util/concurrent/MoreExecutors$1.class
org/apache/flink/hadoop/shaded/com/google/common/util/concurrent/MoreExecutors$SameThreadExecutorService.class
org/apache/flink/hadoop/shaded/com/google/common/util/concurrent/MoreExecutors$ListeningDecorator.class
org/apache/flink/hadoop/shaded/com/google/common/util/concurrent/MoreExecutors$ScheduledListeningDecorator.class
org/apache/flink/hadoop/shaded/com/google/common/util/concurrent/MoreExecutors.class
org/apache/flink/shaded/com/google/common/util/concurrent/MoreExecutors$1.class
org/apache/flink/shaded/com/google/common/util/concurrent/MoreExecutors$2.class
org/apache/flink/shaded/com/google/common/util/concurrent/MoreExecutors$3.class
org/apache/flink/shaded/com/google/common/util/concurrent/MoreExecutors$4.class
org/apache/flink/shaded/com/google/common/util/concurrent/MoreExecutors$Application$1.class
org/apache/flink/shaded/com/google/common/util/concurrent/MoreExecutors$Application.class
org/apache/flink/shaded/com/google/common/util/concurrent/MoreExecutors$DirectExecutor.class
org/apache/flink/shaded/com/google/common/util/concurrent/MoreExecutors$DirectExecutorService.class
org/apache/flink/shaded/com/google/common/util/concurrent/MoreExecutors$ListeningDecorator.class
org/apache/flink/shaded/com/google/common/util/concurrent/MoreExecutors$ScheduledListeningDecorator$ListenableScheduledTask.class
org/apache/flink/shaded/com/google/common/util/concurrent/MoreExecutors$ScheduledListeningDecorator$NeverSuccessfulListenableFutureTask.class
org/apache/flink/shaded/com/google/common/util/concurrent/MoreExecutors$ScheduledListeningDecorator.class
org/apache/flink/shaded/com/google/common/util/concurrent/MoreExecutors.class
com/google/common/util/concurrent/MoreExecutors$1.class
com/google/common/util/concurrent/MoreExecutors$SameThreadExecutorService.class
com/google/common/util/concurrent/MoreExecutors$ListeningDecorator.class
com/google/common/util/concurrent/MoreExecutors$ScheduledListeningDecorator.class
com/google/common/util/concurrent/MoreExecutors.class

It seems that it mix togheter guava 11 (probably coming from CDH dependencies) 
and guava 18 classes.

Also using maven 3.0.5 lead to the same output :(

Best,
Flavio 

On Wed, Jun 7, 2017 at 6:21 PM, Tzu-Li (Gordon) Tai  wrote:
Yes, those should not be in the flink-dist jar, so the root reason should be 
that the shading isn’t working properly for your custom build.

If possible, could you try building Flink again with a lower Maven version as 
specified in the doc, and see if that works?
If so, it could be that Maven 3.3.x simply isn’t shading properly even with the 
double compilation trick.


On 7 June 2017 at 6:17:15 PM, Flavio Pompermaier (pomperma...@okkam.it) wrote:

What I did was to take the sources of the new ES connector and I took them into 
my code.
Flink was compiled with maven 3.3+ but I did the double compilation as 
specified in the Flink build section.
In flink dist I see guava classes, e.g.:

com/google/common/util/concurrent/MoreExecutors$1.class
com/google/common/util/concurrent/MoreExecutors$SameThreadExecutorService.class
com/google/common/util/concurrent/MoreExecutors$ListeningDecorator.class
com/google/common/util/concurrent/MoreExecutors$ScheduledListeningDecorator.class
com/google/common/util/concurrent/MoreExecutors.class

Is it a problem of the shading with Maven 3.3+?

Best,
Flavio

On Wed, Jun 7, 2017 at 5:48 PM, Tzu-Li (Gordon) Tai  wrote:
Ah, I assumed you were running 1.3.0 (since you mentioned “new” ES connector).

Another thing to check, if you built Flink yourself, make sure you’re not using 
Maven 3.3+. There are shading problems when Flink is built with Maven versions 
higher then that.
The flink-dist jar should not contain any non-shaded Guava dependencies, could 
you also quickly check that?

On 7 June 2017 at 5:42:28 PM, Flavio Pompermaier (pomperma...@okkam.it) wrote:

I shaded the Elasticsearch depe