Re: Sequential/ordered map

2017-01-05 Thread Fabian Hueske
Please avoid collecting the data to the client using collect(). This operation looks convenient but is only meant for super small data and would be a lot slower and less robust even if it would work for large data sets. Rather set the parallelism of the operator to 1. Fabian 2017-01-05 13:18 GMT+

Re: Sequential/ordered map

2017-01-05 Thread Sebastian Neef
Hi Chesnay, thanks for the input. Finding a word's first occurrence is part of the algorithm. To be exact I'm trying to implement Adler's Text authorship tracking in flink (http://www2007.org/papers/paper692.pdf, page 266). Thanks, Sebastian

Re: Sequential/ordered map

2017-01-05 Thread Chesnay Schepler
So given an ordered list of texts, for each word find the earliest text it appears in? As Kostas said, when splitting the text into words wrap them in a Tuple2 containing the word and text index and group them by the word. As far as i can tell the next step would be a simple reduce that finds

Re: Sequential/ordered map

2017-01-05 Thread Sebastian Neef
Hi Kostas, thanks for the quick reply. > If T_1 must be processed before T_i, i>1, then you cannot parallelize the > algorithm. What would be the best way to process it anyway? DataSet.collect() -> loop over List -> env.fromCollection(...) ? Or with a parallelism of 1 and a .map(...) ? Howeve

Re: Sequential/ordered map

2017-01-05 Thread Kostas Kloudas
Hi Sebastian, If T_1 must be processed before T_i, i>1, then you cannot parallelize the algorithm. If this is not a restriction, then you could; 1) split the text in words and also attach the id of the text they appear in, 2) do a groupBy that will send all the same words to the same node, 3) k

Sequential/ordered map

2017-01-05 Thread Sebastian Neef
Hello, I'd like to implement an algorithm which doesn't really look parallelizable to me, but maybe there's a way around it: In general the algorithm looks like this: 1. Take a list of texts T_1 ... T_n 2. For every text T_i (i > 1) do 2.1: Split text into a list of words W_1 ... W_m 2.2: For ev