[
https://issues.apache.org/jira/browse/PIG-506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Aniket Mokashi reassigned PIG-506:
----------------------------------
Assignee: Thejas M Nair (was: Aniket Mokashi)
> Does pig need a NATIVE keyword?
> -------------------------------
>
> Key: PIG-506
> URL: https://issues.apache.org/jira/browse/PIG-506
> Project: Pig
> Issue Type: New Feature
> Components: impl
> Reporter: Alan Gates
> Assignee: Thejas M Nair
> Priority: Minor
> Fix For: 0.8.0
>
> Attachments: NativeImplInitial.patch, NativeMapReduceFinale1.patch,
> NativeMapReduceFinale2.patch, NativeMapReduceFinale3.patch, TestWordCount.jar
>
>
> Assume a user had a job that broke easily into three pieces. Further assume
> that pieces one and three were easily expressible in pig, but that piece two
> needed to be written in map reduce for whatever reason (performance,
> something that pig could not easily express, legacy job that was too
> important to change, etc.). Today the user would either have to use map
> reduce for the entire job or manually handle the stitching together of pig
> and map reduce jobs. What if instead pig provided a NATIVE keyword that
> would allow the script to pass off the data stream to the underlying system
> (in this case map reduce). The semantics of NATIVE would vary by underlying
> system. In the map reduce case, we would assume that this indicated a
> collection of one or more fully contained map reduce jobs, so that pig would
> store the data, invoke the map reduce jobs, and then read the resulting data
> to continue. It might look something like this:
> {code}
> A = load 'myfile';
> X = load 'myotherfile';
> B = group A by $0;
> C = foreach B generate group, myudf(B);
> D = native (jar=mymr.jar, infile=frompig outfile=topig);
> E = join D by $0, X by $0;
> ...
> {code}
> This differs from streaming in that it allows the user to insert an arbitrary
> amount of native processing, whereas streaming allows the insertion of one
> binary. It also differs in that, for streaming, data is piped directly into
> and out of the binary as part of the pig pipeline. Here the pipeline would
> be broken, data written to disk, and the native block invoked, then data read
> back from disk.
> Another alternative is to say this is unnecessary because the user can do the
> coordination from java, using the PIgServer interface to run pig and calling
> the map reduce job explicitly. The advantages of the native keyword are that
> the user need not be worried about coordination between the jobs, pig will
> take care of it. Also the user can make use of existing java applications
> without being a java programmer.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.