I created https://issues.apache.org/jira/browse/CRUNCH-157 and https://issues.apache.org/jira/browse/CRUNCH-158 to track these.
On Wed, Jan 30, 2013 at 6:07 AM, Josh Wills <[email protected]> wrote: > Hey Tim, > > Replies inlined. > > > On Wed, Jan 30, 2013 at 1:33 AM, Tim van Heugten <[email protected]> wrote: > >> Hi, >> >> Since april I'm using Crunch for a project. We're not doing only linear >> executions of the pipeline, so we're sometimes having issues with how >> Crunch is optimizing our execution graph. We need to add materializations >> here and there as hints to what parts of the graph can be shared for >> outputs and so on. >> >> Recently we decided to see if 0.4.0-incubating would provide us any >> improvements (I'm afraid not yet). Trying to adapt our code to the new API, >> however, exposed some difficulties and issues. A few bugs have been >> reported regarding those issue (CRUNCH-152 to CRUNCH-155), thank you for >> picking them up. >> >> The difficulties arise from the newly introduced tight bound with >> TaskInputOutputContext. Now in our jUnit tests we need to inject this >> before the tests can run (many of our DoFns adjust counters of perform >> progress() calls). So far so good, I can use >> CrunchTestSupport.getTestContext(config) with a mocked config and call >> setContext() on the DoFn. But there is some unclarity: >> *Should I call initialize() after setContext()? >> *I can see initialize() is called in setContext(), but this doesn't seem >> documented or guarenteed. Should setContext() be made final so it can be >> documented that initialize does not need to be called after? >> > > Yes, I think so. I'm not sure of the implications of doing that, but I'll > create a branch and see what fails and what needs fixing. > > >> >> In our more elaborate tests we use MemPipeline to see the combined effect >> of our DoFns. But there: >> *MemCollection shows ambiguous behaviour wrt initialize/setContext. >> *A parallelDo with a PCollection output makes a call to *just*initialize(), >> and a parallelDo with a PTable output makes a call to >> *both* initialize() and setContext(). Currently this fails some of our >> tests because we use counters and progress(). >> > > Yeah, that's no good. Let's open a JIRA for that one. > > >> >> Finally, I've had to create our own implementation of MemCollection >> altogether*, because the stubbed TaskInputOutputContext is too limited for >> our tests. >> *Stubbed TaskInputOutputContext in MemCollection is unable to handle >> Counters*. >> I'm aware that so much is stated in the javadoc, but I can't choose *not*to >> use counters when testing the business code. Because Counters were >> handled (and even counted) in 0.3.0 I'm feeling confident enough about this >> to raise the issue. >> > > Agreed-- I didn't realize MemCollection gave up the ability to use > Counters in that change-- I'm a bit surprised by that. Let's create a JIRA > to put it back in. > > >> I'm very happy with the api of crunch and would love for this project to >> become more reliable and widely adopted. If there is anything I can do (or >> instruction on where to begin understanding the planning component) please >> let me know. >> >> Cheers, >> >> Tim van Heugten >> >> * Because I use MemPipeline in test contexts only I rely on the mocked >> instance from CrunchTestSupport.getTestContext(conf);, further, replacing >> MemCollection implies replacing MemTable and MemGroupedTable as well. >> > > > > -- > Director of Data Science > Cloudera <http://www.cloudera.com> > Twitter: @josh_wills <http://twitter.com/josh_wills> > -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
