Hi Tim,

On 31 Jan 2013, at 10:45, Tim van Heugten <[email protected]> wrote:

> Hi Gabriel,
> 
> For the most part it is similar to what was send around recently on this 
> mailinglist, see:
> From  Dave Beech <[email protected]>
> Subject       Question about mapreduce job planner
> Date  Tue, 15 Jan 2013 11:41:42 GMT
> 
> So, the common path before multiple outputs branch is executed twice. 
> Sometimes the issues seem related to unions though, i.e. multiple inputs. We 
> seem to have been troubled by a grouped table parallelDo on a table-union-gbk 
> that got its data twice (all grouped doubled in size). Inserting a 
> materialize between the union and groupByKey solved the issue.
> 
> These issues seem very fragile (so they're fixed easily by changing something 
> that's irrelevant to the output), so usually we just add or remove a 
> materialization to make it run again.
> I'll see if I can cleanly reproduce the data duplication issue later this 
> week.

Ok, that would be great if you could replicate it in a small test, thanks!

- Gabriel

Reply via email to