Hi Tim, On 31 Jan 2013, at 10:45, Tim van Heugten <[email protected]> wrote:
> Hi Gabriel, > > For the most part it is similar to what was send around recently on this > mailinglist, see: > From Dave Beech <[email protected]> > Subject Question about mapreduce job planner > Date Tue, 15 Jan 2013 11:41:42 GMT > > So, the common path before multiple outputs branch is executed twice. > Sometimes the issues seem related to unions though, i.e. multiple inputs. We > seem to have been troubled by a grouped table parallelDo on a table-union-gbk > that got its data twice (all grouped doubled in size). Inserting a > materialize between the union and groupByKey solved the issue. > > These issues seem very fragile (so they're fixed easily by changing something > that's irrelevant to the output), so usually we just add or remove a > materialization to make it run again. > I'll see if I can cleanly reproduce the data duplication issue later this > week. Ok, that would be great if you could replicate it in a small test, thanks! - Gabriel
