DataStream transformation isolation in Flink Streaming

2016-02-23 Thread Juan Rodríguez Hortalá
Hi, I was thinking on a problem and how to solve it with Flink Streaming. Imagine you have a stream of data where you want to apply several transformations, where some transformations depend on previous transformations and there is a final set of actions. This is modeled in a natural way as a DAG

How to use all available task managers

2016-02-23 Thread Saiph Kappa
Hi, I am running a flink stream application on a cluster with 6 slaves/task managers. I have set in flink-conf.yaml of every machine "parallelization.degree.default: 6". However, when I run my application it just uses one task slot and not all of them. Am I missing something? Thanks.

Re:

2016-02-23 Thread Ufuk Celebi
Thanks! :-) I hope we can fix it for the release. On Tue, Feb 23, 2016 at 4:45 PM, Zach Cox wrote: > Hi Ufuk - here is the jira issue with the requested information: > https://issues.apache.org/jira/browse/FLINK-3483 > > -Zach > > > On Tue, Feb 23, 2016 at 8:59 AM Ufuk Celebi wrote: >> >> Hey Z

Re:

2016-02-23 Thread Zach Cox
Hi Ufuk - here is the jira issue with the requested information: https://issues.apache.org/jira/browse/FLINK-3483 -Zach On Tue, Feb 23, 2016 at 8:59 AM Ufuk Celebi wrote: > Hey Zach! I'm not aware of an open issue for this. > > You can go ahead and open an issue for it. It will be very helpful

Re: Dataset filter improvement

2016-02-23 Thread Till Rohrmann
Registering a data type is only relevant for the Kryo serializer or if you want to serialize a subclass of a POJO. Registering has the advantage that you assign an id to the class which is written instead of the full class name. The latter is usually much longer than the id. Cheers, Till On Tue,

Re:

2016-02-23 Thread Ufuk Celebi
Hey Zach! I'm not aware of an open issue for this. You can go ahead and open an issue for it. It will be very helpful to include the following: - exact Chrome and OS X version - the exectuion plan as JSON (via env.getExecutionPlan()) - screenshot Thanks! – Ufuk On Tue, Feb 23, 2016 at 3:46 PM,

[no subject]

2016-02-23 Thread Zach Cox
Hi - I typically use the Chrome browser on OS X, and notice that with 1.0.0-rc0 the job graph visualization displays the nodes in the graph, but not any of the edges. Also the graph does not move around when dragging the mouse. The job graph visualization seems to work perfectly in Safari and Fire

Re: Best way to process data in many files? (FLINK-BATCH)

2016-02-23 Thread Tim Conrad
Hi Till (and others). Thank you very much for your helpful answer. On 23.02.2016 14:20, Till Rohrmann wrote: [...] In contrast, if you had a parallel data source which would consist of multiple source task, then these tasks would be independent and spread out across your cluster [...] Can yo

Re: Dataset filter improvement

2016-02-23 Thread Maximilian Michels
Hi Flavio, I think the point is that Flink can use its serialization tools if you register the class in advance. If you don't do that, it will use Kryo as a fall-back which is slightly less efficient. Equals and hash code have to be implemented correctly if you compare Pojos. For standard types l

Re: Best way to process data in many files? (FLINK-BATCH)

2016-02-23 Thread Till Rohrmann
Hi Tim, depending on how you create the DataSource fileList, Flink will schedule the downstream operators differently. If you used the ExecutionEnvironment.fromCollection method, then it will create a DataSource with a CollectionInputFormat. This kind of DataSource will only be executed with a deg

Re: Optimal Configuration for Cluster

2016-02-23 Thread Ufuk Celebi
I would go with one task manager with 48 slots per machine. This reduces the communication overheads between task managers. Regarding memory configuration: Given that the machines have plenty of memory, I would configure a bigger heap than the 4 GB you had previously. Furhermore, you can also cons

Best way to process data in many files? (FLINK-BATCH)

2016-02-23 Thread Tim Conrad
Dear FLINK community. I was wondering what would be the recommended (best?) way to achieve some kind of file conversion. That runs in parallel on all available Flink Nodes, since it it "embarrassingly parallel" (no dependency between files). Say, I have a HDFS folder that contains multiple

Re: streaming hdfs sub folders

2016-02-23 Thread Martin Neumann
I'm not very familiar with the inner workings of the InputFomat's. calling .open() got rid of the Nullpointer but the stream still produces no output. As a temporary solution I wrote a batch job that just unions all the different datasets and puts them (sorted) into a single folder. cheers Martin

Re: Optimal Configuration for Cluster

2016-02-23 Thread Welly Tambunan
Hi Ufuk and Fabian, Is that better to start 48 task manager ( one slot each ) in one machine than having single task manager with 48 slot ? Any trade-off that we should know etc ? Cheers On Tue, Feb 23, 2016 at 3:03 PM, Welly Tambunan wrote: > Hi Ufuk, > > Thanks for the explanation. > > Yes.

Re: Problem with Kafka 0.9 Client

2016-02-23 Thread Robert Metzger
Great. That's good news. Let us know if you encounter more issues with the Kafka connector. By the way, Kafka released 0.9.0.1, maybe updating your brokers to that version resolves the issues? (Maybe the problems of some of the topics were caused by bugs in Kafka) On Tue, Feb 23, 2016 at 10:23 AM

Re: Use jvm to run flink on single-node machine with many cores

2016-02-23 Thread Ufuk Celebi
On Tue, Feb 23, 2016 at 10:17 AM, Ana M. Martinez wrote: > I believe that setting taskmanager.numberOfTaskSlots is not necessary, but > setParallelism is, as by default 1 was taken. Yes, the number of slots in local execution defaults to the maximum parallelism of the job.

Re: Problem with Kafka 0.9 Client

2016-02-23 Thread Lopez, Javier
Hi Robert, After we restarted our Kafka / Zookeeper cluster the consumer worked. Some of our topics had some problems. The flink's consumer for Kafka 0.9 works as expected. Thanks! On 19 February 2016 at 12:03, Lopez, Javier wrote: > Hi, these are the properties: > > Properties properties = ne

Re: Use jvm to run flink on single-node machine with many cores

2016-02-23 Thread Ana M. Martinez
Hi all, Thank you very much for your help. It worked perfectly like this: Configuration conf = new Configuration(); conf.setInteger("taskmanager.network.numberOfBuffers", 16000); conf.setInteger("taskmanager.numberOfTaskSlots”,32); final ExecutionEnvironment env = ExecutionEnvironment.createLoc

Re: Optimal Configuration for Cluster

2016-02-23 Thread Welly Tambunan
Hi Ufuk, Thanks for the explanation. Yes. Our jobs is all streaming job. Cheers On Tue, Feb 23, 2016 at 2:48 PM, Ufuk Celebi wrote: > The new default is equivalent to the previous "streaming mode". The > community decided to get rid of this distinction, because it was > confusing to users. >