Hi, Since april I'm using Crunch for a project. We're not doing only linear executions of the pipeline, so we're sometimes having issues with how Crunch is optimizing our execution graph. We need to add materializations here and there as hints to what parts of the graph can be shared for outputs and so on.
Recently we decided to see if 0.4.0-incubating would provide us any improvements (I'm afraid not yet). Trying to adapt our code to the new API, however, exposed some difficulties and issues. A few bugs have been reported regarding those issue (CRUNCH-152 to CRUNCH-155), thank you for picking them up. The difficulties arise from the newly introduced tight bound with TaskInputOutputContext. Now in our jUnit tests we need to inject this before the tests can run (many of our DoFns adjust counters of perform progress() calls). So far so good, I can use CrunchTestSupport.getTestContext(config) with a mocked config and call setContext() on the DoFn. But there is some unclarity: *Should I call initialize() after setContext()? *I can see initialize() is called in setContext(), but this doesn't seem documented or guarenteed. Should setContext() be made final so it can be documented that initialize does not need to be called after? In our more elaborate tests we use MemPipeline to see the combined effect of our DoFns. But there: *MemCollection shows ambiguous behaviour wrt initialize/setContext. *A parallelDo with a PCollection output makes a call to *just*initialize(), and a parallelDo with a PTable output makes a call to *both* initialize() and setContext(). Currently this fails some of our tests because we use counters and progress(). Finally, I've had to create our own implementation of MemCollection altogether*, because the stubbed TaskInputOutputContext is too limited for our tests. *Stubbed TaskInputOutputContext in MemCollection is unable to handle Counters*. I'm aware that so much is stated in the javadoc, but I can't choose *not*to use counters when testing the business code. Because Counters were handled (and even counted) in 0.3.0 I'm feeling confident enough about this to raise the issue. I'm very happy with the api of crunch and would love for this project to become more reliable and widely adopted. If there is anything I can do (or instruction on where to begin understanding the planning component) please let me know. Cheers, Tim van Heugten * Because I use MemPipeline in test contexts only I rely on the mocked instance from CrunchTestSupport.getTestContext(conf);, further, replacing MemCollection implies replacing MemTable and MemGroupedTable as well.
