It works. Thanks! I added a groupByKey to force it into a MR stage.

On Wed, Feb 6, 2013 at 4:04 PM, Gabriel Reid <[email protected]> wrote:

> Hi Chao,
>
> There's currently no way of marking a particular part of the pipeline as
> being CPU intensive -- however, what you can do is force a slightly
> different execution plan by calling "materialize.iterator()" on the
> PCollection containing the results of the "FirstPass" parallelDo. This will
> force Crunch to run the pipeline up to that point and serialize the
> "FirstPass" data, and then use the serialized collection for future
> processing instead of rebuilding it.
>
> The plan for the future is to include functionality like this in the API
> (which could also possibly run somewhat more efficiently by not immediately
> running the pipeline at such a point), but for now the materialize hack is
> the easiest way to achieve this.
>
> - Gabriel
>
>
> On Wed, Feb 6, 2013 at 5:57 AM, Chao Shi <[email protected]> wrote:
>
>> Hi crunch users,
>>
>> The execution plan of my pipeline is attached with this mail. The
>> ParallelDo "FirstPass" (at the top of the graph) is highly CPU intensive,
>> which needs to call parsers to build ASTs from source code. The best plan I
>> can imagine for my case is to have a map-only job in the front and have the
>> following 3 MRs read its output.
>>
>> I wonder if there's a way to mark my ParallelDo as CPU intensive, so that
>> crunch only create a single instane  of it.
>>
>> Thanks,
>> Chao
>>
>
>

Reply via email to