Execution graph

2015-06-29 Thread Michele Bertoni
Hi, I was trying to run my program in the flink web environment (the local one) when I run it I get the graph of the planned execution but in each node there is a "parallelism = 1”, instead i think it runs with par = 8 (8 core, i always get 8 output) what does that mean? is that wrong or is it

Re: The slot in which the task was scheduled has been killed (probably loss of TaskManager)

2015-06-29 Thread Andra Lungu
Something similar in flink-0.10-SNAPSHOT: 06/29/2015 10:33:46 CHAIN Join(Join at main(TriangleCount.java:79)) -> Combine (Reduce at main(TriangleCount.java:79))(222/224) switched to FAILED java.lang.Exception: The slot in which the task was executed has been released. Probably loss of TaskMana

Re: cogroup

2015-06-29 Thread Michele Bertoni
ok thanks! then by now i will use it until true outer join is ready Il giorno 29/giu/2015, alle ore 18:22, Fabian Hueske mailto:fhue...@gmail.com>> ha scritto: Yes, if you need outer join semantics you have to go with CoGroup. Some members of the Flink community are working on true outer joins

Re: cogroup

2015-06-29 Thread Fabian Hueske
Yes, if you need outer join semantics you have to go with CoGroup. Some members of the Flink community are working on true outer joins for Flink, but I don't know what the progress is. Best, Fabian 2015-06-29 18:05 GMT+02:00 Michele Bertoni : > thanks both for answering, > that’s what i expect

Re: cogroup

2015-06-29 Thread Michele Bertoni
thanks both for answering, that’s what i expected I was using join at first but sadly i had to move from join to cogroup because I need outer join the alternative to the cogroup is to “complete” the inner join extracting from the original dataset what did not matched in the cogroup by differenc

Re: cogroup

2015-06-29 Thread Fabian Hueske
If you just want to do the pairwise comparison try join(). Join is an inner join and will give you all pairs of elements with matching keys. For CoGroup, there is no other way than collecting one side in memory. Best, Fabian 2015-06-29 17:42 GMT+02:00 Matthias J. Sax : > Why do you not use a joi

Re: cogroup

2015-06-29 Thread Matthias J. Sax
Why do you not use a join? CoGroup seems not to be the right operator. -Matthias On 06/29/2015 05:40 PM, Michele Bertoni wrote: > Hi I have a question on cogroup > > when I cogroup two dataset is there a way to compare each element on the left > with each element on the right (inside a group) w

cogroup

2015-06-29 Thread Michele Bertoni
Hi I have a question on cogroup when I cogroup two dataset is there a way to compare each element on the left with each element on the right (inside a group) without collecting one side? right now I am doing left.cogroup(right).where(0,1,2).equalTo(0,1,2){ (leftIterator, rightIterator,

Re: JobManager is no longer reachable

2015-06-29 Thread Flavio Pompermaier
I think that actually there's an Exception thrown within the code that I suspect it's not reported anywhere..could it be? On Mon, Jun 29, 2015 at 3:28 PM, Flavio Pompermaier wrote: > Which file and which JVM options do I have to modify to try options 1 and > 3..? > >1. Don't fill the JVMs up

Re: Logs meaning states

2015-06-29 Thread Stephan Ewen
Ah, thank you! - If you create a data set from a Java/Scala collection, this data source has the parallelism one. - The map function is chained to that source, so it runs with parallelism one as well. - To run it with a higher parallelism, use "setParallelism(...)" on the mapFunction, or call "

Re: Logs meaning states

2015-06-29 Thread Juan Fumero
Yeah, sorry. I would like to do something simple like this, but using Java Threads. DataSet> input = env.fromCollection(in); DataSet output = input.map(new HighWorkLoad()); ArrayList result = output.consume(); // ? like collect but in parallel, some operation that consumes the pipeline. return r

Re: Logs meaning states

2015-06-29 Thread Stephan Ewen
It is not quite easy to understand what you are trying to do. Can you post your program here? Then we can take a look and give you a good answer... On Mon, Jun 29, 2015 at 3:47 PM, Juan Fumero < juan.jose.fumero.alfo...@oracle.com> wrote: > Is there any other way to apply the function in paralle

Re: Logs meaning states

2015-06-29 Thread Juan Fumero
Is there any other way to apply the function in parallel and return the result to the client in parallel? Thanks Juan On Mon, 2015-06-29 at 15:01 +0200, Stephan Ewen wrote: > In general, avoid collect if you can. Collect brings data top the > client, where the computation is not parallel any mor

Re: JobManager is no longer reachable

2015-06-29 Thread Flavio Pompermaier
Which file and which JVM options do I have to modify to try options 1 and 3..? 1. Don't fill the JVMs up to the limit with objects. Give more memory to the JVM, or give less memory to Flink managed memory 2. Use more JVMs, i.e., a higher parallelism 3. Use a concurrent garbage collecto

Re: Logs meaning states

2015-06-29 Thread Stephan Ewen
In general, avoid collect if you can. Collect brings data top the client, where the computation is not parallel any more. Try to do as much on the DataSet as possible. On Mon, Jun 29, 2015 at 2:58 PM, Juan Fumero < juan.jose.fumero.alfo...@oracle.com> wrote: > Hi Stephan, > so should I use ano

Re: Logs meaning states

2015-06-29 Thread Juan Fumero
Hi Stephan, so should I use another method instead of collect? It seems multithread is not working with this. Juan On Mon, 2015-06-29 at 14:51 +0200, Stephan Ewen wrote: > Hi Juan! > > > This is an artifact of a workaround right now. The actual collect() > logic happens in the flatMap() a

Re: Logs meaning states

2015-06-29 Thread Stephan Ewen
Hi Juan! This is an artifact of a workaround right now. The actual collect() logic happens in the flatMap() and the sink is a dummy that executes nothing. The flatMap writes the data to be collected to the "accumulator" that delivers it back. Greetings, Stephan On Mon, Jun 29, 2015 at 2:30 PM,

Logs meaning states

2015-06-29 Thread Juan Fumero
Hi, I am starting with Flink. I have tried to look for the documentation but I havent found it clear. I wonder the difference between these two states: FlatMap RUNNING vs DataSink RUNNIG. FlatMap is doing data any data transformation? Compilation? In which point is actually executing the f

Re: JobManager is no longer reachable

2015-06-29 Thread Stephan Ewen
Hi Flavio! I had a look at the logs. There seems nothing suspicious - at some point, the TaskManager and JobManager declare each other unreachable. A pretty common cause for that is that the JVMs stall for a long time due to garbage collection. The JobManager cannot see the difference between a J

Re: The slot in which the task was scheduled has been killed (probably loss of TaskManager)

2015-06-29 Thread Alexander Alexandrov
I witnessed a similar issue yesterday on a simple job (single task chain, no shuffles) with a release-0.9 based fork. 2015-04-15 14:59 GMT+02:00 Flavio Pompermaier : > Yes , sorry for that..I found it somewhere in the logs..the problem was > that the program didn't die immediately but was somehow

Re: FileSystem exists

2015-06-29 Thread Flavio Pompermaier
Ok..I don't like to repeat twice myDir but for the moment I can live with it ;) Thanks, Flavio On Mon, Jun 29, 2015 at 12:41 PM, Stephan Ewen wrote: > You can also do "myDir.getFileSystem().exists(myDir)", but I don't think > there is a shorter way... > > On Mon, Jun 29, 2015 at 12:39 PM, Flavi

Re: FileSystem exists

2015-06-29 Thread Stephan Ewen
You can also do "myDir.getFileSystem().exists(myDir)", but I don't think there is a shorter way... On Mon, Jun 29, 2015 at 12:39 PM, Flavio Pompermaier wrote: > Hi to all, > > in my job I have to check if a directory exists and currently I have to > write: > > Path myDir = new Path(...); > boole

FileSystem exists

2015-06-29 Thread Flavio Pompermaier
Hi to all, in my job I have to check if a directory exists and currently I have to write: Path myDir = new Path(...); boolean exists = FileSystem.get(myDir.toUri()).exists(myDir); Is there a better way to achieve this? Best, Flavio

Re: JobManager is no longer reachable

2015-06-29 Thread Stephan Ewen
Hi Flavio! Can you post the JobManager's log here? It should have the message about what is going wrong... Stephan On Mon, Jun 29, 2015 at 11:43 AM, Flavio Pompermaier wrote: > Hi to all, > > I'm restarting the discussion about a problem I alredy dicussed on this > mailing list (but that star

Python vs Scala - Performance

2015-06-29 Thread Maximilian Alber
Hi Flinksters, we had recently a discussion in our working group which Language we should use with Flink. To bring it to the point: most people would like to use Python because the are familiar with it and there is a nice scientific stack to f.e. print and analyse the results. But our concern is t

JobManager is no longer reachable

2015-06-29 Thread Flavio Pompermaier
Hi to all, I'm restarting the discussion about a problem I alredy dicussed on this mailing list (but that started with a different subject). I'm running Flink 0.9.0 on CDH 5.1.3 so I compiled the sources as: mvn clean install -Dhadoop.version=2.3.0-cdh5.1.3 -Dhbase.version=0.98.1-cdh5.1.3 -Dhado

Re: ArrayIndexOutOfBoundsException when running job from JAR

2015-06-29 Thread Robert Metzger
@Stephan: I don't think there is a way to deal with this. In my understanding, the (main) purpose of the user@ list is not to report Flink bugs. It is a forum for users to help each other. Flink committers happen to know a lot about the system, so its easy for them to help users. Also, its a good w

Re: Best way to write data to HDFS by Flink

2015-06-29 Thread Stephan Ewen
Hi Hawin! The performance tuning of Kafka is much trickier than that of Flink. Your performance bottleneck may be Kafka at this point, not Flink. To make Kafka fast, make sure you have the right setup for the data directories, and you set up zookeeper properly (for good throughput). To test the K

Re: ArrayIndexOutOfBoundsException when running job from JAR

2015-06-29 Thread Stephan Ewen
Static fields are not parts of the serialized program (by Java's definition). Whether the static field has the same value in the cluster JVMs depends on how the static field is initialized, whether it is initialized the same way in the shipped code, without the program's main method. BTW: We are

Re: ArrayIndexOutOfBoundsException when running job from JAR

2015-06-29 Thread Vasiliki Kalavri
Thank you for the answer Robert! I realize it's a single JVM running, yet I would expect programs to behave in the same way, i.e. serialization to happen (even if not necessary), in order to catch this kind of bugs before cluster deployment. Is this simply not possible or is it a design choice we

RE: Best way to write data to HDFS by Flink

2015-06-29 Thread Hawin Jiang
Dear Marton Thanks for your asking. Yes. it is working now. But, the TPS is not very good. I have met four issues as below 1. My TPS around 2000 events per second. But I saw some companies achieved 132K per second on single node at 2015 Los Angeles big data day yesterday.

Re: ArrayIndexOutOfBoundsException when running job from JAR

2015-06-29 Thread Robert Metzger
It is working in the IDE because there we execute everything in the same JVM, so the mapper can access the correct value of the static variable. When submitting a job with the CLI frontend, there are at least two JVMs involved, and code running in the JM/TM can not access the value from the static