[
https://issues.apache.org/jira/browse/CRUNCH-449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14076326#comment-14076326
]
Gabriel Reid commented on CRUNCH-449:
-------------------------------------
Sorry I took so long to take a look at this. Looks interesting -- at first I
found it a bit difficult to figure out what exactly it would be used for (and
what the advantage is between this and just calling Pipeline.run at some
points), but it looks like this opens up a whole lot of other opportunities to
indirectly influence the job plan without actually having to worry about how
it's exactly done.
I noticed that SeqDoFn.dependsOn(String, PCollection) is called implicitly from
PCollectionImpl.sequentialDo , but SeqDoFn.dependsOn(String, Target) always
needs to be called explicitly. I guess this makes sense, but maybe it would be
handy to change PCollection.sequentialDo to accept a String argument that would
be used as the label of the incoming PCollection dependency. I'm thinking that
would make it easier to retrieve that PCollection later by name from within the
SeqDoFn.
Can the "Output" generic parameter of SeqDoFn be bounded by PCollection (i.e.
<Output extends PCollection<?>>), just because that might make documentation
things easier? Or is it possible to have a SeqDoFn that is bound to something
other than a PCollection?
I noticed that the PCollection class has a commented-out version of the
sequentialDo method that needs to be removed.
I know you're probably on top of this, but I'll just point it out anyway: more
docs in SeqDoFn, particularly on the abstract methods, would be really good.
It's not immediately obvious exactly how it is intended to be used.
Also, more tests demonstrating some more use cases (target isn't created,
dependent on multiple targets, dependent on multiple PCollections, dependent on
a combination of targets and PCollections) would also be really handy, if only
in terms of documenting some use cases for this new functionality.
> Add sequentialDo function for injecting arbitrary non-parallel code
> -------------------------------------------------------------------
>
> Key: CRUNCH-449
> URL: https://issues.apache.org/jira/browse/CRUNCH-449
> Project: Crunch
> Issue Type: Bug
> Components: Core
> Reporter: Josh Wills
> Assignee: Josh Wills
> Attachments: CRUNCH-449.patch, CRUNCH-449b.patch
>
>
> I've been noodling on this one for awhile: how to add the ability to execute
> some code if and only if one or more targets are created, and have that
> executed code (optionally) return one or more new PCollections as a result. I
> was thinking that this functionality could be wired in to libraries to do
> things like bulk loading HBase tables or running Sqoop jobs as part of Crunch
> pipelines automatically.
--
This message was sent by Atlassian JIRA
(v6.2#6252)