[ https://issues.apache.org/jira/browse/PIG-506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Thejas M Nair updated PIG-506: ------------------------------ Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed patch PIG-506.3.patch with changes suggested by Daniel committed to trunk. > Does pig need a NATIVE keyword? > ------------------------------- > > Key: PIG-506 > URL: https://issues.apache.org/jira/browse/PIG-506 > Project: Pig > Issue Type: New Feature > Components: impl > Reporter: Alan Gates > Assignee: Aniket Mokashi > Priority: Minor > Fix For: 0.8.0 > > Attachments: NativeImplInitial.patch, NativeMapReduceFinale1.patch, > NativeMapReduceFinale2.patch, NativeMapReduceFinale3.patch, PIG-506.2.patch, > PIG-506.3.patch, PIG-506.patch, TestWordCount.jar > > > Assume a user had a job that broke easily into three pieces. Further assume > that pieces one and three were easily expressible in pig, but that piece two > needed to be written in map reduce for whatever reason (performance, > something that pig could not easily express, legacy job that was too > important to change, etc.). Today the user would either have to use map > reduce for the entire job or manually handle the stitching together of pig > and map reduce jobs. What if instead pig provided a NATIVE keyword that > would allow the script to pass off the data stream to the underlying system > (in this case map reduce). The semantics of NATIVE would vary by underlying > system. In the map reduce case, we would assume that this indicated a > collection of one or more fully contained map reduce jobs, so that pig would > store the data, invoke the map reduce jobs, and then read the resulting data > to continue. It might look something like this: > {code} > A = load 'myfile'; > X = load 'myotherfile'; > B = group A by $0; > C = foreach B generate group, myudf(B); > D = native (jar=mymr.jar, infile=frompig outfile=topig); > E = join D by $0, X by $0; > ... > {code} > This differs from streaming in that it allows the user to insert an arbitrary > amount of native processing, whereas streaming allows the insertion of one > binary. It also differs in that, for streaming, data is piped directly into > and out of the binary as part of the pig pipeline. Here the pipeline would > be broken, data written to disk, and the native block invoked, then data read > back from disk. > Another alternative is to say this is unnecessary because the user can do the > coordination from java, using the PIgServer interface to run pig and calling > the map reduce job explicitly. The advantages of the native keyword are that > the user need not be worried about coordination between the jobs, pig will > take care of it. Also the user can make use of existing java applications > without being a java programmer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.