RE: Load balancing

2015-06-15 Thread Kruse, Sebastian
Hi Gianmarco, The processing time is quadratic in the size of the single elements. I was already applying that strategy that you also proposed, but tried to find out if there is a way of balancing the subitems of these large items over the workers without shuffling the whole dataset. However, I

Re: Flink 0.9 built with Scala 2.11

2015-06-15 Thread Till Rohrmann
+1 for giving only those modules a version suffix which depend on Scala. On Sun, Jun 14, 2015 at 8:03 PM Robert Metzger wrote: > There was already a discussion regarding the two options here [1], back > then we had a majority for giving all modules a scala suffix. > > I'm against giving all modu

Random Shuffling

2015-06-15 Thread Maximilian Alber
Hi Flinksters, I would like to shuffle my elements in the data set and then split it in two according to some ratio. Each element in the data set has an unique id. Is there a nice way to do it with the flink api? (It would be nice to have guaranteed random shuffling.) Thanks! Cheers, Max

Random Selection

2015-06-15 Thread Maximilian Alber
Hi Flinksters, I would like to randomly choose a element of my data set. But somehow I cannot use scala.util inside my filter functions: val sample_x = X filter(new RichFilterFunction[Vector](){ var i: Int = -1 override def open(config: Configuration) = { i = scal

Re: Random Selection

2015-06-15 Thread Till Rohrmann
Hi Max, the problem is that you’re trying to serialize the companion object of scala.util.Random. Try to create an instance of the scala.util.Random class and use this instance within your RIchFilterFunction to generate the random numbers. Cheers, Till On Mon, Jun 15, 2015 at 1:56 PM Maximilian

Re: Random Shuffling

2015-06-15 Thread Matthias J. Sax
I think, you need to implement an own Partitioner.java and hand it via DataSet.partitionCustom(partitioner, field) (Just specify any field you like; as you don't want to group by key, it doesn't matter.) When implementing the partitionier, you can ignore the key parameter and compute the output c

Re: Random Shuffling

2015-06-15 Thread Till Rohrmann
Hi Max, you can always shuffle your elements using the rebalance method. What Flink here does is to distribute the elements of each partition among all available TaskManagers. This happens in a round-robin fashion and is thus not completely random. A different mean is the partitionCustom method w

Monitoring memory usage of a Flink Job

2015-06-15 Thread Tamara Mendt
Hi, I am running some experiments on Flink and was wondering if there is some way to monitor the memory usage of a Flink Job (running locally and on a cluster). I need to run multiple jobs and compare their memory usage. Cheers, Tamara

Re: Monitoring memory usage of a Flink Job

2015-06-15 Thread Fabian Hueske
Hi Tamara, what kind of information do you need? Something like, size and usage of in-memory sort buffers or hash tables? Some information might written in DEBUG logs, but I'm not sure about that. Besides logs, I doubt that Flink monitors memory usage. Cheers, Fabian 2015-06-15 14:34 GMT+02:00 T

Re: Monitoring memory usage of a Flink Job

2015-06-15 Thread Till Rohrmann
Hi Tamara, you can instruct Flink to write the current memory statistics to the log by setting taskmanager.debug.memory.startLogThread: true in the Flink configuration. Furthermore, you can control the logging interval with taskmanager.debug.memory.logIntervalMs where the interval is specified in

RE: Random Selection

2015-06-15 Thread Kruse, Sebastian
Hi everyone, I did not reenact it, but I think the problem here is rather the anonymous class. It looks like it is created within a class, not an object. Thus it is not “static” in Java terms, which means that also its surrounding class (the job class) will be serialized. And in this job class,

Re: Monitoring memory usage of a Flink Job

2015-06-15 Thread Tamara Mendt
Ok great, I will try this out and get back to you. Thanks =) On Mon, Jun 15, 2015 at 2:52 PM, Till Rohrmann wrote: > Hi Tamara, > > you can instruct Flink to write the current memory statistics to the log > by setting taskmanager.debug.memory.startLogThread: true in the Flink > configuration. Fu

Re: Random Selection

2015-06-15 Thread Maximilian Alber
Hi everyone! Thanks! It seems the variable that makes the problems. Making an inner class solved the issue. Cheers, Max On Mon, Jun 15, 2015 at 2:58 PM, Kruse, Sebastian wrote: > Hi everyone, > > > > I did not reenact it, but I think the problem here is rather the anonymous > class. It looks li

Re: Random Shuffling

2015-06-15 Thread Maximilian Alber
Thanks! Ok, so for a random shuffle I need partitionCustom. But in that case the data might be out of balance then? For the splitting. Is there no way to have exact sizes? Cheers, Max On Mon, Jun 15, 2015 at 2:26 PM, Till Rohrmann wrote: > Hi Max, > > you can always shuffle your elements usin

Re: Random Shuffling

2015-06-15 Thread Matthias J. Sax
Hi, using partitionCustom, the data distribution depends only on your probability distribution. If it is uniform, you should be fine (ie, choosing the channel like > private final Random random = new Random(System.currentTimeMillis()); > int partition(K key, int numPartitions) { > return random

"No space left on device" IOException when using Cross operator

2015-06-15 Thread Mihail Vieru
Hi, I get the following *"No space left on device" IOException* when using the following Cross operator. The inputs for the operator are each just *10MB* in size (same input for IN1 and IN2; 1000 tuples) and I get the exception after Flink manages to fill *50GB* of SSD space and the partition

Re: "No space left on device" IOException when using Cross operator

2015-06-15 Thread Stephan Ewen
Cross is a quadratic operation. As such, it produces very large results on moderate inputs, which can easily exceed memory and disk space, if the subsequent operation requires to gather all data (such as for the sort in your case). If you use on both inputs 10 MB of 100 byte elements (100K element

Re: Help with Flink experimental Table API

2015-06-15 Thread Shiti Saxena
Hi, Can I work on the issue with TupleSerializer or is someone working on it? On Mon, Jun 15, 2015 at 11:20 AM, Aljoscha Krettek wrote: > Hi, > the reason why this doesn't work is that the TupleSerializer cannot deal > with null values: > > @Test > def testAggregationWithNull(): Unit = { > > v

Re: Help with Flink experimental Table API

2015-06-15 Thread Aljoscha Krettek
I think you can work on it. By the way, there are actually two serializers. For Scala, CaseClassSerializer is responsible for tuples as well. In Java, TupleSerializer is responsible for, well, Tuples. On Tue, 16 Jun 2015 at 06:25 Shiti Saxena wrote: > Hi, > > Can I work on the issue with TupleSe