Hi, It turns out the data in the two branches that are unioned in union2 is not mutually exclusive (counter to what I was expecting). Probably we should expect data duplication.
However, this does still not explain why sometimes we find data duplication and sometimes we don't. Will keep you posted, Tim On Tue, Feb 5, 2013 at 11:32 AM, Tim van Heugten <[email protected]> wrote: > Hi Gabriel, > > I've been unsuccessful so far to reproduce the issue in a controlled > environment. As said, its fragile, maybe the types involved play a role, so > when I tried to simplify those I broke the failure condition. > I decide it's time to try providing more information without giving an > explicit example. > > The pipeline we build is illustrated here: http://yuml.me/8ef99512. > Depending on where we materialize the data occurs twice in UP. > The EITPI job filters the exact opposite of the filter branch. In PWR only > data from EITPI is passed through, while the PITP data is used to modify it. > Below you find the job names as executed when dataduplication occurs, > materializations occur before BTO(*) and after UP. > "Avro(target/stored/sIPhase)+EITPI+GBK+PITEI+Avro(/tmp/crunch655004156/p4)" > > "[[Avro(target/stored/sIPhase)+PITP]/[Avro(/tmp/crunch655004156/p4)]]+GBK+PWR+UnionCollectionWrapper+Avro(/tmp/crunch655004156/p2)" > > "[[Avro(target/stored/sIPhase)+PITP]/[Avro(/tmp/crunch655004156/p4)]]+GBK+PWR+BTO+Avro(/tmp/crunch655004156/p8)" > > "[[Avro(target/stored/sIPhase)+S0+BTO]/[Avro(/tmp/crunch655004156/p8)]]+GBK+UP+Avro(/tmp/crunch655004156/p6)" > "Avro(/tmp/crunch655004156/p6)+GetData+Avro(/tmp/crunch655004156/p10)" > "Avro(/tmp/crunch655004156/p6)+GetTraces+Avro(target/trace-dump/traces)" > > Here are the jobs performed when materialization is added between BTO and > gbk: > > "Avro(target/stored/sIPhase)+EITPI+GBK+PITEI+Avro(/tmp/crunch-551174870/p4)" > > "[[Avro(target/stored/sIPhase)+PITP]/[Avro(/tmp/crunch-551174870/p4)]]+GBK+PWR+UnionCollectionWrapper+Avro(/tmp/crunch-551174870/p2)" > > "[[Avro(target/stored/sIPhase)+PITP]/[Avro(/tmp/crunch-551174870/p4)]]+GBK+PWR+BTO+Avro(/tmp/crunch-551174870/p6)" > "Avro(/tmp/crunch-551174870/p6)+GBK+UP+Avro(/tmp/crunch-551174870/p8)" > "Avro(/tmp/crunch-551174870/p8)+GetData+Avro(/tmp/crunch-551174870/p10)" > "Avro(/tmp/crunch-551174870/p8)+GetTraces+Avro(target/trace-dump/traces)" > > Without changing changing anything else, the added materialization fixes > the issue of data duplication. > > If you have any clues how I can extract a clean working example I'm happy > to hear. > > > *) This materialization probably explains the second job, however, where > the filtered data is joined is lost on me. This is not the cause though, > with just one materialize at the end, after UP, the data count still > doubled. The jobs then look like this: > "Avro(target/stored/sIPhase)+EITPI+GBK+PITEI+Avro(/tmp/crunch369510677/p4)" > > "[[Avro(target/stored/sIPhase)+PITP]/[Avro(/tmp/crunch369510677/p4)]]+GBK+PWR+BTO+Avro(/tmp/crunch369510677/p6)" > > "[[Avro(target/stored/sIPhase)+S0+BTO]/[Avro(/tmp/crunch369510677/p6)]]+GBK+UP+Avro(/tmp/crunch369510677/p2)" > "Avro(/tmp/crunch369510677/p2)+GetTraces+Avro(target/trace-dump/traces)" > "Avro(/tmp/crunch369510677/p2)+GetData+Avro(/tmp/crunch369510677/p8)" > > BR, > > Tim van Heugten > > > On Thu, Jan 31, 2013 at 9:27 PM, Gabriel Reid <[email protected]>wrote: > >> Hi Tim, >> >> On 31 Jan 2013, at 10:45, Tim van Heugten <[email protected]> wrote: >> >> > Hi Gabriel, >> > >> > For the most part it is similar to what was send around recently on >> this mailinglist, see: >> > From Dave Beech <[email protected]> >> > Subject Question about mapreduce job planner >> > Date Tue, 15 Jan 2013 11:41:42 GMT >> > >> > So, the common path before multiple outputs branch is executed twice. >> Sometimes the issues seem related to unions though, i.e. multiple inputs. >> We seem to have been troubled by a grouped table parallelDo on a >> table-union-gbk that got its data twice (all grouped doubled in size). >> Inserting a materialize between the union and groupByKey solved the issue. >> > >> > These issues seem very fragile (so they're fixed easily by changing >> something that's irrelevant to the output), so usually we just add or >> remove a materialization to make it run again. >> > I'll see if I can cleanly reproduce the data duplication issue later >> this week. >> >> Ok, that would be great if you could replicate it in a small test, thanks! >> >> - Gabriel > > >
