[
https://issues.apache.org/jira/browse/CRUNCH-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14034925#comment-14034925
]
Gabriel Reid commented on CRUNCH-420:
-------------------------------------
I think your analysis is correct, in that there are two situations where we
want this kind of functionality.
I don't think that there is a reduce-side version of the problem where we're
not between two GBKs. A single stream coming out of a GBK will only split into
multiple streams at the latest point possible, and things will only be run once
anyhow thanks to the multiple outputs from a reducer.
I think that the more "correct" method that should be called for this kind of
functionality (in documentation) is {{PCollection.cache()}} instead of
{{PCollection.materialize}}. The cache method is just a call to materialize
anyhow, but I think it's more consistent with the intended meaning of the cache
method in a Spark context (is that right?)
The patch goes in the same direction that I was thinking, but there still seem
to be some issues with it. If the breakpointed pipeline in Breakpoint2IT
actually gets run, it crashes with a StackOverflowError.
I put together a little mini-test to check demonstrate what this patch is doing
(actually, it might be good to use a more simple situation like this in
Breakpoint2IT as well to make it easier to debug). My test case simply reads in
a single PCollection of strings, maps it to a table using an IdentityFn, runs
the table through an IdentityFn, and then sends it to two GBKs which are then
ungrouped and written.
Running my mini test without a breakpoint gives a job plan that looks like
this, as expected:
!withoutbreakpoint.png!
Running the mini test with a breakpoint gives this job plan:
!withbreakpoint.png!
I think we want to have two jobs when the breakpoint is enabled -- a single
map-only job, and then two jobs that do the grouping stemming from the output
of the first job.
> Breakpoints Not Working
> -----------------------
>
> Key: CRUNCH-420
> URL: https://issues.apache.org/jira/browse/CRUNCH-420
> Project: Crunch
> Issue Type: Bug
> Environment: Crunch 0.8.2
> Reporter: Allan Shoup
> Assignee: Josh Wills
> Attachments: Breakpoint2IT.java, CRUNCH-420.patch,
> testBreakpoint_plan.png, withbreakpoint.png, withoutbreakpoint.png
>
>
> Reading through CRUNCH-294, it looks like materialize is supposed to function
> as a breakpoint to the planner. I've seen several plans where it appeared to
> me a particular DoFn shouldn't have been repeated, but it was.
> I'll attach some supporting material.
--
This message was sent by Atlassian JIRA
(v6.2#6252)