It works. Thanks! I added a groupByKey to force it into a MR stage. On Wed, Feb 6, 2013 at 4:04 PM, Gabriel Reid <[email protected]> wrote:
> Hi Chao, > > There's currently no way of marking a particular part of the pipeline as > being CPU intensive -- however, what you can do is force a slightly > different execution plan by calling "materialize.iterator()" on the > PCollection containing the results of the "FirstPass" parallelDo. This will > force Crunch to run the pipeline up to that point and serialize the > "FirstPass" data, and then use the serialized collection for future > processing instead of rebuilding it. > > The plan for the future is to include functionality like this in the API > (which could also possibly run somewhat more efficiently by not immediately > running the pipeline at such a point), but for now the materialize hack is > the easiest way to achieve this. > > - Gabriel > > > On Wed, Feb 6, 2013 at 5:57 AM, Chao Shi <[email protected]> wrote: > >> Hi crunch users, >> >> The execution plan of my pipeline is attached with this mail. The >> ParallelDo "FirstPass" (at the top of the graph) is highly CPU intensive, >> which needs to call parsers to build ASTs from source code. The best plan I >> can imagine for my case is to have a map-only job in the front and have the >> following 3 MRs read its output. >> >> I wonder if there's a way to mark my ParallelDo as CPU intensive, so that >> crunch only create a single instane of it. >> >> Thanks, >> Chao >> > >
