Re: MemPipeline and context

Josh Wills Wed, 30 Jan 2013 07:45:45 -0800

I created https://issues.apache.org/jira/browse/CRUNCH-157 and
https://issues.apache.org/jira/browse/CRUNCH-158 to track these.



On Wed, Jan 30, 2013 at 6:07 AM, Josh Wills <[email protected]> wrote:

> Hey Tim,
>
> Replies inlined.
>
>
> On Wed, Jan 30, 2013 at 1:33 AM, Tim van Heugten <[email protected]> wrote:
>
>> Hi,
>>
>> Since april I'm using Crunch for a project. We're not doing only linear
>> executions of the pipeline, so we're sometimes having issues with how
>> Crunch is optimizing our execution graph. We need to add materializations
>> here and there as hints to what parts of the graph can be shared for
>> outputs and so on.
>>
>> Recently we decided to see if 0.4.0-incubating would provide us any
>> improvements (I'm afraid not yet). Trying to adapt our code to the new API,
>> however, exposed some difficulties and issues. A few bugs have been
>> reported regarding those issue (CRUNCH-152 to CRUNCH-155), thank you for
>> picking them up.
>>
>> The difficulties arise from the newly introduced tight bound with
>> TaskInputOutputContext. Now in our jUnit tests we need to inject this
>> before the tests can run (many of our DoFns adjust counters of perform
>> progress() calls). So far so good, I can use
>> CrunchTestSupport.getTestContext(config) with a mocked config and call
>> setContext() on the DoFn. But there is some unclarity:
>> *Should I call initialize() after setContext()?
>> *I can see initialize() is called in setContext(), but this doesn't seem
>> documented or guarenteed. Should setContext() be made final so it can be
>> documented that initialize does not need to be called after?
>>
>
> Yes, I think so. I'm not sure of the implications of doing that, but I'll
> create a branch and see what fails and what needs fixing.
>
>
>>
>> In our more elaborate tests we use MemPipeline to see the combined effect
>> of our DoFns. But there:
>> *MemCollection shows ambiguous behaviour wrt initialize/setContext.
>> *A parallelDo with a PCollection output makes a call to *just*initialize(), 
>> and a parallelDo with a PTable output makes a call to
>> *both* initialize() and setContext(). Currently this fails some of our
>> tests because we use counters and progress().
>>
>
> Yeah, that's no good. Let's open a JIRA for that one.
>
>
>>
>> Finally, I've had to create our own implementation of MemCollection
>> altogether*, because the stubbed TaskInputOutputContext is too limited for
>> our tests.
>> *Stubbed TaskInputOutputContext in MemCollection is unable to handle
>> Counters*.
>> I'm aware that so much is stated in the javadoc, but I can't choose *not*to 
>> use counters when testing the business code. Because Counters were
>> handled (and even counted) in 0.3.0 I'm feeling confident enough about this
>> to raise the issue.
>>
>
> Agreed-- I didn't realize MemCollection gave up the ability to use
> Counters in that change-- I'm a bit surprised by that. Let's create a JIRA
> to put it back in.
>
>
>> I'm very happy with the api of crunch and would love for this project to
>> become more reliable and widely adopted. If there is anything I can do (or
>> instruction on where to begin understanding the planning component) please
>> let me know.
>>
>> Cheers,
>>
>> Tim van Heugten
>>
>> * Because I use MemPipeline in test contexts only I rely on the mocked
>> instance from CrunchTestSupport.getTestContext(conf);, further, replacing
>> MemCollection implies replacing MemTable and MemGroupedTable as well.
>>
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: MemPipeline and context

Reply via email to