Hi Gabriel, For the most part it is similar to what was send around recently on this mailinglist, see: FromDave Beech <[email protected]> SubjectQuestion about mapreduce job plannerDateTue, 15 Jan 2013 11:41:42 GMT
So, the common path before multiple outputs branch is executed twice. Sometimes the issues seem related to unions though, i.e. multiple inputs. We seem to have been troubled by a grouped table parallelDo on a table-union-gbk that got its data twice (all grouped doubled in size). Inserting a materialize between the union and groupByKey solved the issue. These issues seem very fragile (so they're fixed easily by changing something that's irrelevant to the output), so usually we just add or remove a materialization to make it run again. I'll see if I can cleanly reproduce the data duplication issue later this week. Cheers, Tim On Wed, Jan 30, 2013 at 8:51 PM, Gabriel Reid <[email protected]>wrote: > Hi Tim, > > On Wed, Jan 30, 2013 at 10:33 AM, Tim van Heugten <[email protected]>wrote: > >> >> Since april I'm using Crunch for a project. We're not doing only linear >> executions of the pipeline, so we're sometimes having issues with how >> Crunch is optimizing our execution graph. We need to add materializations >> here and there as hints to what parts of the graph can be shared for >> outputs and so on. >> > > About the extra calls to materialize to force changes to the execution > plan: I remember seeing this previously. We've discussed adding something > specifically for this functionality to the API, although it hasn't yet > happened. > > Could you give an example of a situation where these extra materialize > calls get added? That would be useful for validating the addition to the > API. > > > Thanks, > > Gabriel > >
