Hi, Eduardo I didn't try this myself, but how about using java action, where you write custom code which starts pig runner, grab std out and persist it into HDFS for later use. You can use distributed-cache to provide pig.jar to each compute node, so no need to install pig.jar in whole cluster.
Thanks Ryota On 9/5/12 7:11 AM, "Eduardo Afonso Ferreira" <[email protected]> wrote: >Hey, Kamal, > >That would require me to install Pig on the Hadoop/HBase cluster. >I believe running Pig as an action seems to make better use of resources. >In this case, Oozie runs the pig script using PigRunner.run(). > >Anyway, I did not figure out yet how to get the output generated by >PigRunner.run(). > > >I'm thinking if there's a way of having another action in my workflow >that will retrieve that output from the FS (if I can determine how to do >that) or if there's any EL function or variable that would retrieve that >for me. > >So far, the only possibility I see to get that output is by writing a >Java action (or whatever) that receives the launcher hadoop job ID (I >still need to figure out how to get that hadoop job ID) and use the >hadoop API to get the output. > >If somebody out there knows how to do it, would you mind sharing how you >did it? > >What I really would like to see is the existence of an EL function that >returns that output, or a way of redirecting that output to a known >location on HDFS. Or better yet, if there were functions, variables or >whatever that had all the information I see on that output, such as >follows: > >- HadoopVersion >- PigVersion >- UserId, >- StartedAt >- FinishedAt, >- Features > >- List of hadoop jobs with specific stats (JobId, Maps, Reduces, >MaxMapTime, MinMapTime, AvgMapTime, MaxReduceTime, MinReduceTime, >AvgReduceTime, Alias, Feature Outputs) >- Input Information (Records Read and Source, ex. hbase table name) > >- Output Information (Records Written and Destination, ex. hbase table >name) >- Indication if it was Success or Failure >- etc > > >If I figure out how to get that output file, I can parse it and retrieve >all the information. > > >Thank you. >Eduardo. > > > >________________________________ > From: Kamal Hakim <[email protected]> >To: "[email protected]" ><[email protected]>; Eduardo Afonso Ferreira ><[email protected]> >Sent: Tuesday, September 4, 2012 5:40 PM >Subject: RE: Capturing Pig action output > >Hi Eduardo , > >You could run the Pig command as an in-line script command and capture >the output into an output file. > >Example: > >pig -f pigscript.pig > pig_log.log > > > > >Kamal Hakim >American Express >Big Data Architecture >Phone: 602-537-6819 >________________________________________ >From: Eduardo Afonso Ferreira [[email protected]] >Sent: Tuesday, September 04, 2012 02:24 PM >To: [email protected] >Subject: Re: Capturing Pig action output > >Hey, thanks for the response. > >What I need is not the log file that's created by Pig when there's an >error, you know, the one that lists stack trace and things like that. > >I need the output that Pig sends to stdout/stderr. This is the output >that includes information about hadoop jobs created, started/finished >timestamps, success/failure, etc. > >Here's an example of part of that output: > >________________________________ > >HadoopVersion PigVersion UserId StartedAt FinishedAt >Features >0.20.2-cdh3u3 0.11.0-SNAPSHOT mapred 2012-09-04 17:00:56 >2012-09-04 17:06:26 GROUP_BY,DISTINCT,FILTER > >Success! > >Job Stats (time in seconds): >JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime >MaxReduceTime MinReduceTime AvgReduceTime Alias Feature >Outputs >job_201206281058_812355 14 4 20 11 16 250 235 246 > B,esi,filt1,keys,slots DISTINCT >job_201206281058_812414 4 4 7 6 7 26 22 24 >1-4,C,E GROUP_BY,DISTINCT hbase://active_video_plays, > >Input(s): >Successfully read 144190 records (5079 bytes) from: >"hbase://events_sessions" > >Output(s): >Successfully stored 20835 records in: "hbase://active_video_plays" > >Counters: >Total records written : 20835 >Total bytes written : 0 >Spillable Memory Manager spill count : 0 >Total bags proactively spilled: 0 >Total records proactively spilled: 0 >________________________________ > > >Virag, > >We're currently using Oozie 2.3.2 (2.3.2-cdh3u3)and I guess the wf >configuration you mentioned is not available on this version. > > >Eduardo. > > > > >________________________________ >From: Mona Chitnis <[email protected]> >To: "[email protected]" ><[email protected]>; Eduardo Afonso Ferreira ><[email protected]> >Sent: Thursday, August 30, 2012 4:01 PM >Subject: Re: Capturing Pig action output > >Hi Eduardo, > >The log file where the Oozie pig action's output is written to, is a local >file in the current working directory of the map task, and not on HDFS. >Also, passing a custom path using "pig -logfile <HDFS_PATH>" is not >allowed. Can you try with passing another argument at the end of your >arguments list to your pig action as a redirection to some file of choice? > >E.g. Trying to recreate something like the following. Note the redirection >to 'myfile.txt' at the end >pig -param $xyz=1000 myscript.pig &>myfile.txt > >I haven't tried this out myself but it will be helpful to find out. >Otherwise, Virag's suggestion can help you access specific stats related >information > >Regards, > >-- >Mona Chitnis > > > > >On 8/30/12 11:59 AM, "Virag Kothari" <[email protected]> wrote: > >>Hi, >> >>From 3.2 onwards, counters and hadoop job ids for Pig and Map-reduce can >>be accessed through the API or EL function. >> >>First, the following should be set in wf configuration. This will store >>the Pig/MR related statistics in the DB. >><property> >> <name>oozie.action.external.stats.write</name> >> <value>true</value> >> </property> >> >>Then, the stats and jobIds can be accessed using the verbose API >>oozie job -info <jobId> -verbose >> >>Also, the hadoop job Id's can be retrieved for a Pig action through >>El-function >> >>wf:actionData(<pig-action-name>)["hadoopJobs"] >> >> >>Detailed docs at >>http://incubator.apache.org/oozie/docs/3.2.0-incubating/docs/WorkflowFunc >>t >>i >>onalSpec.html. Look under "4.2.5 Hadoop EL Functions" >> >>Thanks, >>Virag >> >> >> >> >> >>On 8/30/12 10:31 AM, "Eduardo Afonso Ferreira" <[email protected]> >>wrote: >> >>>Hi there, >>> >>>I have a pig that runs periodically by oozie via coordinator with a set >>>frequency. >>>I wanted to capture the Pig script output because I need to look at some >>>information on the results to keep track of several things. >>>I know I can look at the output by doing a whole bunch of clicks >>>starting >>>at the oozie web console as follows: >>> >>>- Open oozie web console (ex.: http://localhost:11000/oozie/) >>>- Find and click the specific job under "Workflow Jobs" >>>- Select (click) the pig action in the window that pops up >>>- Click the magnifying glass icon on the "Console URL" field >>>- Click the Map of the launcher job >>>- Click the task ID >>>- Click All under "Task Logs" >>> >>>My question is how can I know the exact name and location of that log >>>file in HDFS so I can programmaticaly retrieve the file from HDFS and >>>parse and look for what I need? >>> >>>Is this something I can determine ahead of time, like pass a >>>parameter/argument to the action/pig so that it will store the log where >>>I want with the file name I want? >>> >>>Thanks in advance for your help. >>>Eduardo. >> >American Express made the following annotations on Tue Sep 04 2012 >14:40:20 > >************************************************************************** >**** > >"This message and any attachments are solely for the intended recipient >and may contain confidential or privileged information. If you are not >the intended recipient, any disclosure, copying, use, or distribution of >the information included in this message and any attachments is >prohibited. If you have received this communication in error, please >notify us by reply e-mail and immediately and permanently delete this >message and any attachments. Thank you." > >American Express a ajouté le commentaire suivant le Tue Sep 04 2012 >14:40:20 > >Ce courrier et toute pièce jointe qu'il contient sont réservés au seul >destinataire indiqué et peuvent renfermer des renseignements >confidentiels et privilégiés. Si vous n'êtes pas le destinataire prévu, >toute divulgation, duplication, utilisation ou distribution du courrier >ou de toute pièce jointe est interdite. Si vous avez reçu cette >communication par erreur, veuillez nous en aviser par courrier et >détruire immédiatement le courrier et les pièces jointes. Merci. > >************************************************************************** >**** >-------------------------------------------------------------------------- >-----
