[ 
https://issues.apache.org/jira/browse/PIG-506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14733831#comment-14733831
 ] 

argho chatterjee commented on PIG-506:
--------------------------------------

Hi Team,

I tried the above approach, but PIG forces to store the data in one directory 
and then load it form that directory. This means eg :

A = Load ....
B = MAPREDUCE 'SomeJar.jar' Store A into input Load Output as ...

Here we are loading and simply storing it back again for the Map-reduce Job To 
work.

I do not think this is Optimized way.

Can There be a way where the data loaded by A can be directly Fed to the 
Map-reduce Job.???!!!

That is We will Implement some Reader Of Pig IN our MR job and then Use it to 
Read Data into MR, this unlike Storing it this way.
We have implemented some Smart Pig Readers , I want to use them in my map 
reduce and not use the native MR readers.

Please have a look at this case scenario.

> Does pig need a NATIVE keyword?
> -------------------------------
>
>                 Key: PIG-506
>                 URL: https://issues.apache.org/jira/browse/PIG-506
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>            Reporter: Alan Gates
>            Assignee: Aniket Mokashi
>            Priority: Minor
>              Labels: gsoc, mentor
>             Fix For: 0.8.0
>
>         Attachments: NativeImplInitial.patch, NativeMapReduceFinale1.patch, 
> NativeMapReduceFinale2.patch, NativeMapReduceFinale3.patch, PIG-506.2.patch, 
> PIG-506.3.patch, PIG-506.patch, TestWordCount.jar
>
>
> Assume a user had a job that broke easily into three pieces.  Further assume 
> that pieces one and three were easily expressible in pig, but that piece two 
> needed to be written in map reduce for whatever reason (performance, 
> something that pig could not easily express, legacy job that was too 
> important to change, etc.).  Today the user would either have to use map 
> reduce for the entire job or manually handle the stitching together of pig 
> and map reduce jobs.  What if instead pig provided a NATIVE keyword that 
> would allow the script to pass off the data stream to the underlying system 
> (in this case map reduce).  The semantics of NATIVE would vary by underlying 
> system.  In the map reduce case, we would assume that this indicated a 
> collection of one or more fully contained map reduce jobs, so that pig would 
> store the data, invoke the map reduce jobs, and then read the resulting data 
> to continue.  It might look something like this:
> {code}
> A = load 'myfile';
> X = load 'myotherfile';
> B = group A by $0;
> C = foreach B generate group, myudf(B);
> D = native (jar=mymr.jar, infile=frompig outfile=topig);
> E = join D by $0, X by $0;
> ...
> {code}
> This differs from streaming in that it allows the user to insert an arbitrary 
> amount of native processing, whereas streaming allows the insertion of one 
> binary.  It also differs in that, for streaming, data is piped directly into 
> and out of the binary as part of the pig pipeline.  Here the pipeline would 
> be broken, data written to disk, and the native block invoked, then data read 
> back from disk.
> Another alternative is to say this is unnecessary because the user can do the 
> coordination from java, using the PIgServer interface to run pig and calling 
> the map reduce job explicitly.  The advantages of the native keyword are that 
> the user need not be worried about coordination between the jobs, pig will 
> take care of it.  Also the user can make use of existing java applications 
> without being a java programmer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to