[ 
https://issues.apache.org/jira/browse/CRUNCH-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034925#comment-14034925
 ] 

Gabriel Reid commented on CRUNCH-420:
-------------------------------------

I think your analysis is correct, in that there are two situations where we 
want this kind of functionality. 

I don't think that there is a reduce-side version of the problem where we're 
not between two GBKs. A single stream coming out of a GBK will only split into 
multiple streams at the latest point possible, and things will only be run once 
anyhow thanks to the multiple outputs from a reducer.

I think that the more "correct" method that should be called for this kind of 
functionality (in documentation) is {{PCollection.cache()}} instead of 
{{PCollection.materialize}}. The cache method is just a call to materialize 
anyhow, but I think it's more consistent with the intended meaning of the cache 
method in a Spark context (is that right?)

The patch goes in the same direction that I was thinking, but there still seem 
to be some issues with it. If the breakpointed pipeline in Breakpoint2IT 
actually gets run, it crashes with a StackOverflowError.

I put together a little mini-test to check demonstrate what this patch is doing 
(actually, it might be good to use a more simple situation like this in 
Breakpoint2IT as well to make it easier to debug). My test case simply reads in 
a single PCollection of strings, maps it to a table using an IdentityFn, runs 
the table through an IdentityFn, and then sends it to two GBKs which are then 
ungrouped and written.

Running my mini test without a breakpoint gives a job plan that looks like 
this, as expected:

!withoutbreakpoint.png!

Running the mini test with a breakpoint gives this job plan:

!withbreakpoint.png!

I think we want to have two jobs when the breakpoint is enabled -- a single 
map-only job, and then two jobs that do the grouping stemming from the output 
of the first job.


> Breakpoints Not Working
> -----------------------
>
>                 Key: CRUNCH-420
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-420
>             Project: Crunch
>          Issue Type: Bug
>         Environment: Crunch 0.8.2
>            Reporter: Allan Shoup
>            Assignee: Josh Wills
>         Attachments: Breakpoint2IT.java, CRUNCH-420.patch, 
> testBreakpoint_plan.png, withbreakpoint.png, withoutbreakpoint.png
>
>
> Reading through CRUNCH-294, it looks like materialize is supposed to function 
> as a breakpoint to the planner. I've seen several plans where it appeared to 
> me a particular DoFn shouldn't have been repeated, but it was.
> I'll attach some supporting material.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to