MemPipeline and context

Tim van Heugten Wed, 30 Jan 2013 01:34:42 -0800

Hi,

Since april I'm using Crunch for a project. We're not doing only linear
executions of the pipeline, so we're sometimes having issues with how
Crunch is optimizing our execution graph. We need to add materializations
here and there as hints to what parts of the graph can be shared for
outputs and so on.


Recently we decided to see if 0.4.0-incubating would provide us any
improvements (I'm afraid not yet). Trying to adapt our code to the new API,
however, exposed some difficulties and issues. A few bugs have been
reported regarding those issue (CRUNCH-152 to CRUNCH-155), thank you for
picking them up.

The difficulties arise from the newly introduced tight bound with
TaskInputOutputContext. Now in our jUnit tests we need to inject this
before the tests can run (many of our DoFns adjust counters of perform
progress() calls). So far so good, I can use
CrunchTestSupport.getTestContext(config) with a mocked config and call
setContext() on the DoFn. But there is some unclarity:
*Should I call initialize() after setContext()?
*I can see initialize() is called in setContext(), but this doesn't seem
documented or guarenteed. Should setContext() be made final so it can be
documented that initialize does not need to be called after?

In our more elaborate tests we use MemPipeline to see the combined effect
of our DoFns. But there:
*MemCollection shows ambiguous behaviour wrt initialize/setContext.
*A parallelDo with a PCollection output makes a call to
*just*initialize(), and a parallelDo with a PTable output makes a call
to
*both* initialize() and setContext(). Currently this fails some of our
tests because we use counters and progress().

Finally, I've had to create our own implementation of MemCollection
altogether*, because the stubbed TaskInputOutputContext is too limited for
our tests.
*Stubbed TaskInputOutputContext in MemCollection is unable to handle
Counters*.
I'm aware that so much is stated in the javadoc, but I can't choose
*not*to use counters when testing the business code. Because Counters
were
handled (and even counted) in 0.3.0 I'm feeling confident enough about this
to raise the issue.

I'm very happy with the api of crunch and would love for this project to
become more reliable and widely adopted. If there is anything I can do (or
instruction on where to begin understanding the planning component) please
let me know.

Cheers,

Tim van Heugten

* Because I use MemPipeline in test contexts only I rely on the mocked
instance from CrunchTestSupport.getTestContext(conf);, further, replacing
MemCollection implies replacing MemTable and MemGroupedTable as well.

MemPipeline and context

Reply via email to