[ 
https://issues.apache.org/jira/browse/CRUNCH-449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14076326#comment-14076326
 ] 

Gabriel Reid commented on CRUNCH-449:
-------------------------------------

Sorry I took so long to take a look at this. Looks interesting -- at first I 
found it a bit difficult to figure out what exactly it would be used for (and 
what the advantage is between this and just calling Pipeline.run at some 
points), but it looks like this opens up a whole lot of other opportunities to 
indirectly influence the job plan without actually having to worry about how 
it's exactly done.

I noticed that SeqDoFn.dependsOn(String, PCollection) is called implicitly from 
PCollectionImpl.sequentialDo , but SeqDoFn.dependsOn(String, Target) always 
needs to be called explicitly. I guess this makes sense, but maybe it would be 
handy to change PCollection.sequentialDo to accept a String argument that would 
be used as the label of the incoming PCollection dependency. I'm thinking that 
would make it easier to retrieve that PCollection later by name from within the 
SeqDoFn.

Can the "Output" generic parameter of SeqDoFn be bounded by PCollection (i.e. 
<Output extends PCollection<?>>), just because that might make documentation 
things easier? Or is it possible to have a SeqDoFn that is bound to something 
other than a PCollection?

I noticed that the PCollection class has a commented-out version of the 
sequentialDo method that needs to be removed.

I know you're probably on top of this, but I'll just point it out anyway: more 
docs in SeqDoFn, particularly on the abstract methods, would be really good. 
It's not immediately obvious exactly how it is intended to be used.

Also, more tests demonstrating some more use cases (target isn't created, 
dependent on multiple targets, dependent on multiple PCollections, dependent on 
a combination of targets and PCollections) would also be really handy, if only 
in terms of documenting some use cases for this new functionality.

> Add sequentialDo function for injecting arbitrary non-parallel code
> -------------------------------------------------------------------
>
>                 Key: CRUNCH-449
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-449
>             Project: Crunch
>          Issue Type: Bug
>          Components: Core
>            Reporter: Josh Wills
>            Assignee: Josh Wills
>         Attachments: CRUNCH-449.patch, CRUNCH-449b.patch
>
>
> I've been noodling on this one for awhile: how to add the ability to execute 
> some code if and only if one or more targets are created, and have that 
> executed code (optionally) return one or more new PCollections as a result. I 
> was thinking that this functionality could be wired in to libraries to do 
> things like bulk loading HBase tables or running Sqoop jobs as part of Crunch 
> pipelines automatically.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to