Re: [ANNOUNCE] Welcome Stefan Richter as a new committer

2017-02-10 Thread Stavros Kontopoulos
Congrats! On Fri, Feb 10, 2017 at 9:11 PM, Matthias J. Sax wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA512 > > Congrats! > > On 2/10/17 2:00 AM, Ufuk Celebi wrote: > > Hey everyone, > > > > I'm very happy to announce that the Flink PMC has accepted Stefan > > Richter to become a commi

Re: Hadoop 2.7.3

2017-02-10 Thread Dean Wampler
I don't have it any more, unfortunately. To be clear, I don't think it was Flink related, but a collision between a Hadoop security library calling into a Google Guava library, where a method was missing on CacheBuilder in the latter. Also, to add to the irritation, it only happened in my OSX envir

Re: Hadoop 2.7.3

2017-02-10 Thread Ted Yu
Dean: Can you pastebin the stack trace around the MethodMissing error ? If there was no stack trace, please tell us the what the log said. Thanks On Fri, Feb 10, 2017 at 2:26 PM, Dean Wampler wrote: > This is completely unrelated, but I just debugged a MethodMissing error in > an application s

Re: Hadoop 2.7.3

2017-02-10 Thread Dean Wampler
This is completely unrelated, but I just debugged a MethodMissing error in an application stack, where it doesn't occur with Hadoop 2.7.2, but does occur with 2.7.3 (yeah!). I would dig into the appropriate logs to see if an underlying exception is being thrown and you're not seeing enough detail.

Hadoop 2.7.3

2017-02-10 Thread Mohit Anchlia
Does Flink support Hadoop 2.7.3? I installed Flink for HAdoop 2.7.0 but seeing this error: 2017-02-10 18:59:52,661 INFO org.apache.flink.yarn.YarnClusterDescriptor - Deployment took more than 60 seconds. Please check if the requested resources are available in the YARN cluster

Re: [ANNOUNCE] Welcome Stefan Richter as a new committer

2017-02-10 Thread Matthias J. Sax
-BEGIN PGP SIGNED MESSAGE- Hash: SHA512 Congrats! On 2/10/17 2:00 AM, Ufuk Celebi wrote: > Hey everyone, > > I'm very happy to announce that the Flink PMC has accepted Stefan > Richter to become a committer of the Apache Flink project. > > Stefan is part of the community for almost a y

Re: parallelism and slots allocated

2017-02-10 Thread bwong247
Hi Kurt, Thanks for the reply. Does this mean that if my job has 3 operators (not chained), it will use at least 3 slots? I thought parallelism was task based. You can define it at an operator level, but that only means that the tasks for that operator are distributed across that many slots.

How important is 'registerType'?

2017-02-10 Thread Dmitry Golubets
The docs say that it may improve performance. How true is it, when custom serializers are provided? There is also 'disableAutoTypeRegistration' method in the config class, implying Flink registers types automatically. So, given that I have an hierarchy: trait A class B extends A class C extends A

Re: tasks running in parallel beyond configured parallelism/slots

2017-02-10 Thread Aljoscha Krettek
Hi, Flink operators will not always (in fact almost never) run in a single slot. Mostly the whole parallel sub-slice of a pipeline can run in one slot, so in your case you get three parallel instances for every operator in your topology and then one instance of each operator will sit in a slot. Ch

Re: Start streaming tuples depending on another streams rate

2017-02-10 Thread Aljoscha Krettek
Hi, I think there are two (somewhat) orthogonal problems here: 1) Determining when a stream of input data switches from the "reading old data" to the "reading current data" phase. 2) Blocking/buffering one input of an operator depending on some condition on the other input. I think 1. can only

Re: complete digraph

2017-02-10 Thread Aljoscha Krettek
Hi, so you mean data flowing in arbitrary directions? That's certainly interesting but I guess it can lead to all sorts of problems in a distributed system. Cheers, Aljoscha On Wed, 8 Feb 2017 at 01:02 Chen Qin wrote: > Hi there, > > I don't think this would be a urgent topic but definitely see

Re: stream clustering in flink

2017-02-10 Thread Aljoscha Krettek
Hi, I think distributed stream clustering is still a somewhat open field. I'm not aware of popular open source systems that have implementations for that (except maybe Apache SAMOA). Maybe you will have some luck if you try to search for "distributed stream clustering" papers. Cheers, Aljoscha On

Re: Questions about the V-C Iteration in Gelly

2017-02-10 Thread Greg Hogan
Hi Xingcan, FLINK-1885 looked into adding a bulk mode to Gelly's iterative models. As an alternative you could implement your algorithm with Flink operators and a bulk iteration. Most of the Gelly library is written with native operators. Greg On Fri, Feb 10, 2017 at 5:02 AM, Xingcan Cui wrote

Re: Frequent Full GC's in case of FSStateBackend

2017-02-10 Thread Stefan Richter
Async snapshotting is the default. > Am 10.02.2017 um 14:03 schrieb vinay patil : > > Hi Stephan, > > Thank you for the clarification. > Yes with RocksDB I don't see Full GC happening, also I am using Flink 1.2.0 > version and I have set the statebackend in flink-conf.yaml file to rocksdb, >

Re: Join with Default-Value

2017-02-10 Thread Gábor Gévay
I'm not sure what exactly is the problem, but could you check this FAQ item? http://flink.apache.org/faq.html#why-am-i-getting-a-nonserializableexception- Best, Gábor 2017-02-10 14:16 GMT+01:00 Sebastian Neef : > Hi, > > thanks! That's exactly what I needed. > > I'm not using: DataSetA.leftOute

Re: Join with Default-Value

2017-02-10 Thread Sebastian Neef
Hi, thanks! That's exactly what I needed. I'm not using: DataSetA.leftOuterJoin(DataSetB).where(new KeySelector()).equalTo(new KeySelector()).with(new JoinFunction(...)). Now I get the following error: > Caused by: org.apache.flink.optimizer.CompilerException: Error translating > node 'Map "Ke

Re: Frequent Full GC's in case of FSStateBackend

2017-02-10 Thread vinay patil
Hi Stephan, Thank you for the clarification. Yes with RocksDB I don't see Full GC happening, also I am using Flink 1.2.0 version and I have set the statebackend in flink-conf.yaml file to rocksdb, so by default does this do asynchronous checkpointing or I have to specify it at the job level ? Re

Re: Join with Default-Value

2017-02-10 Thread Gábor Gévay
Hello Sebastian, You can use DataSet.leftOuterJoin for this. Best, Gábor 2017-02-10 12:58 GMT+01:00 Sebastian Neef : > Hi, > > is it possible to assign a "default" value to elements that didn't match? > > For example I have the following two datasets: > > |DataSetA | DataSetB| > -

Join with Default-Value

2017-02-10 Thread Sebastian Neef
Hi, is it possible to assign a "default" value to elements that didn't match? For example I have the following two datasets: |DataSetA | DataSetB| - |id=1 | id=1 |id=2 | id=3 |id=5 | id=4 |id=6 | id=6 When doing a join with: A.join(B).where( KeySelector(A.id

Re: Frequent Full GC's in case of FSStateBackend

2017-02-10 Thread Stefan Richter
Hi, FSStateBackend operates completely on-heap and only snapshots for checkpoints go against the file system. This is why the backend is typically faster for small states, but can become problematic for larger states. If your state exceeds a certain size, you should strongly consider to use Roc

Frequent Full GC's in case of FSStateBackend

2017-02-10 Thread Vinay Patil
Hi, I am doing performance test for my pipeline keeping FSStateBackend, I have observed frequent Full GC's after processing 20M records. When I did memory analysis using MAT, it showed that the many objects maintained by Flink state are live. Flink keeps the state in memory even after checkpoint

Re: Questions about the V-C Iteration in Gelly

2017-02-10 Thread Xingcan Cui
Hi Vasia, b) As I said, when some vertices finished their work in current phase, they have nothing to do (no value updates, no message received, just like slept) but to wait for other vertices that have not finished (the current phase) yet. After that in the next phase, all the vertices should go

[ANNOUNCE] Welcome Stefan Richter as a new committer

2017-02-10 Thread Ufuk Celebi
Hey everyone, I'm very happy to announce that the Flink PMC has accepted Stefan Richter to become a committer of the Apache Flink project. Stefan is part of the community for almost a year now and worked on major features of the latest 1.2 release, most notably rescaling and backwards compatibili

Re: parallelism and slots allocated

2017-02-10 Thread Kurt Young
Hi, Parallelism is actually operator level, and each instance of the operator will occupy one slot. In some cases, Flink use chaining to chain multi operators to let them share one single slot, but sometimes it can not be done. If your job contains multiple operators and some of them cannot be cha

Re: Questions about the V-C Iteration in Gelly

2017-02-10 Thread Vasiliki Kalavri
Hi Xingcan, On 9 February 2017 at 18:16, Xingcan Cui wrote: > Hi Vasia, > > thanks for your reply. It helped a lot and I got some new ideas. > > a) As you said, I did use the getPreviousIterationAggregate() method in > preSuperstep() of the next superstep. > However, if the (only?) global (aggre

Re: Start streaming tuples depending on another streams rate

2017-02-10 Thread Jonas
Tzu-Li (Gordon) Tai wrote > Stream A has a rate of 100k tuples/s. After processing the whole Kafka > queue, the rate drops to 10 tuples/s. Absolutely correct. Tzu-Li (Gordon) Tai wrote > So what you are looking for is that flatMap2 for stream B only doing work > after the job reaches the latest re