Re: Resume failed pig script

2012-06-15 Thread Jonathan Coveney
Well, you can do this physically by adding load/store boundaries to your code. Thinking out loud, such a thing could be possible... At any M/R boundary, you store the intermediate in HDFS, and pig is aware of this and doesn't automatically delete it (this part in and of itself is not trivial -- wh

Resume failed pig script

2012-06-15 Thread Russell Jurney
In production I use short Pig scripts and schedule them with Azkaban with dependencies setup, so that I can use Azkaban to restart long data pipelines at the point of failure. I edit the failing pig script, usually towards the end of the data pipeline, and restart the Azkaban job. This saves hours

Re: Pig Meetup Notes

2012-06-15 Thread Russell Jurney
Is that a PMC position? I also do AV and can bounce #credentials :D Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com On Jun 15, 2012, at 3:35 PM, Jonathan Coveney wrote: > +1 > > 2012/6/15 Alan Gates > >> Thanks Russell. I move we make you the official Apache Pig s

Importing python modules in embedded pig

2012-06-15 Thread Chun Yang
Hi all, I'm trying to run the mahout canopy clustering algorithm through a Python-embedded Pig script. The embedded Pig part of the script works (using compileFromFile, bind, runSingle), but I can't figure out how to run mahout from the same script. Originally I tried running mahout via subprocess

Re: Design question - parsing clickstream with query parameters

2012-06-15 Thread Mohit Anchlia
On Fri, Jun 15, 2012 at 3:34 PM, Jonathan Coveney wrote: > We just use the Java Map class, with the restriction that the key must be a > String. There are some helper methods in trunk to work with maps, and you > can you # to dereference ie map#'key' > thanks! If you don't mind could you please s

Re: Pig Meetup Notes

2012-06-15 Thread Jonathan Coveney
+1 2012/6/15 Alan Gates > Thanks Russell. I move we make you the official Apache Pig secretary. :) > > Alan. > > On Jun 12, 2012, at 9:45 PM, Russell Jurney wrote: > > > Tuesday, Pig Meetup > > > > Alan Gates - upcoming improvements in operators/backend physical plan. > > Desphagetification. >

Re: Design question - parsing clickstream with query parameters

2012-06-15 Thread Jonathan Coveney
We just use the Java Map class, with the restriction that the key must be a String. There are some helper methods in trunk to work with maps, and you can you # to dereference ie map#'key' 2012/6/15 Mohit Anchlia > On Fri, Jun 15, 2012 at 9:12 AM, Alan Gates wrote: > > > This seems reasonable, e

Re: please correct the pig web page!

2012-06-15 Thread yonghu
http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html. You can find DIFF built-in function. regards! Yong On Fri, Jun 15, 2012 at 6:23 PM, Alan Gates wrote: > Which page is this? > > Alan. > > On Jun 13, 2012, at 5:24 AM, yonghu wrote: > >> Hello, >> >> The example of DIFF lacks of the generate.

Re: Design question - parsing clickstream with query parameters

2012-06-15 Thread Mohit Anchlia
On Fri, Jun 15, 2012 at 9:12 AM, Alan Gates wrote: > This seems reasonable, except it seems like it would make more sense to > convert query parameters to maps. By definition a query parameter is > key=value. And a map is easier to work with in general then a bag, since > there's no need to fla

Re: AvroStorage Issue in 0.9.2-cdh4.0.0 -- Schema is unknown

2012-06-15 Thread Russell Jurney
Oh, you maybe also need to load other jars? I load avro this way. REGISTER /me/pig/build/ivy/lib/Pig/avro-1.5.3.jar REGISTER /me/pig/build/ivy/lib/Pig/json-simple-1.1.jar REGISTER /me/pig/contrib/piggybank/java/piggybank.jar Russell Jurney http://datasyndrome.com On Jun 15, 2012, at 1:50 AM,

Re: AvroStorage Issue in 0.9.2-cdh4.0.0 -- Schema is unknown

2012-06-15 Thread Russell Jurney
In my experience, AvroStorage only really works reliably in Pig 0.10. I believe CDH will have Pig 0.10 in v4.1. I suggest you upgrade to v0.10 manually now, and if your data still won't load, please file a JIRA. Russell Jurney http://datasyndrome.com On Jun 15, 2012, at 1:50 AM, Markus Resch wro

opensource groovy workflow for pig, mysql, hadoop

2012-06-15 Thread Gerrit Jansen van Vuuren
Hi, I saw the note about oozie earlier in the list. When I first saw it I though XML == NO ## WAY. I've developed and opensourced a groovy based workflow software, that is currently being used for the last 2 years in production by 3 companies ( MySpace, SpecificMedia, and Komli ) http://code.go

Re: please correct the pig web page!

2012-06-15 Thread Alan Gates
Which page is this? Alan. On Jun 13, 2012, at 5:24 AM, yonghu wrote: > Hello, > > The example of DIFF lacks of the generate. It should be > > X = FOREACH A GENERATE DIFF(B1,B2); > > Regards! > > Yong

Re: Pig Meetup Notes

2012-06-15 Thread Alan Gates
Thanks Russell. I move we make you the official Apache Pig secretary. :) Alan. On Jun 12, 2012, at 9:45 PM, Russell Jurney wrote: > Tuesday, Pig Meetup > > Alan Gates - upcoming improvements in operators/backend physical plan. > Desphagetification. > Reworking UDF interface, keep backward comp

Re: Design question - parsing clickstream with query parameters

2012-06-15 Thread Alan Gates
This seems reasonable, except it seems like it would make more sense to convert query parameters to maps. By definition a query parameter is key=value. And a map is easier to work with in general then a bag, since there's no need to flatten them. Alan. On Jun 11, 2012, at 10:55 AM, Mohit Anc

Re: Mixed input formats in LOAD path

2012-06-15 Thread Johannes Schwenk
Well I don't consider this strategy of an data format migration to be a hack. The only thing that is somewhat "hacky" and definitely not elegant is the creation of empty files for each known format by the logger! Do you have any advice on how to design our pig scripts that they account for migrati

Re: error to generate a map?

2012-06-15 Thread Alan Gates
Maps require string keys. So it should read ['222'#1]. Alan. On Jun 7, 2012, at 8:51 PM, Yang wrote: > I ran the following simple pig script > > > a = load 'a'; > > b = foreach a generate [222#1]; > > dump b; > > > but it gave the following error > > $ pig -x local a.pig > 2012-06-07 20

Re: distributing macros in a jar?

2012-06-15 Thread Alan Gates
It's definitely a reasonable thing to do, but it hasn't been added yet. Ideally registering a jar of macros would make them available, just as registering a jar of UDFs does. Alan. On Jun 7, 2012, at 6:13 PM, Matthew Hayes wrote: > According to the documentation it looks like the only way to

Re: Mixed input formats in LOAD path

2012-06-15 Thread Ruslan Al-Fakikh
Hey, You can keep a single empty file per format. That way pig won't fail. But basically I recommend to avoid such situations that need hacks or custom formats. According to my experience you'll soon get in trouble with that. Thanks On Fri, Jun 15, 2012 at 5:39 PM, Johannes Schwenk wrote: > Tha

Re: Mixed input formats in LOAD path

2012-06-15 Thread Johannes Schwenk
Thanks a lot Ruslan, that seems one possible direction! One things stands to be resolved: I don't know whether I will get an Avro in the input or CSV, TSV or all... So how could I get pig not to choke on missing input files? Johannes Am 15.06.2012 15:24, schrieb Ruslan Al-Fakikh: > I guess you c

Re: Mixed input formats in LOAD path

2012-06-15 Thread Ruslan Al-Fakikh
I guess you could use globbing for extracting the files by extensions, like this: $ ls input.avro input.txt $ cat input.avro avro1 avro2 $ cat input.txt txt1 txt2 [cloudera@localhost workpig]$ pig -x local 2012-06-15 17:21:09,613 [main] INFO org.apache.pig.Main - Logging error messages to: /home

Re: Mixed input formats in LOAD path

2012-06-15 Thread Johannes Schwenk
Hi Ruslan, thanks for you answer! I have only the input path, but do not know which file format the different files in that path possess. All files that are in the path belong to one relation however, so i want to load them at once. Though a union of separately loaded files would be ok too, if th

Re: Mixed input formats in LOAD path

2012-06-15 Thread Ruslan Al-Fakikh
Hi Johannes, I guess you'd have to write a custom Loader for such a situation, but why do you need to load everything in one pass? You can load different types of files separately (having multiple LOAD statements) and make a join or a union afterwards. Ruslan On Fri, Jun 15, 2012 at 4:13 PM, Joh

Mixed input formats in LOAD path

2012-06-15 Thread Johannes Schwenk
Hi all, is it possible to have an input path (as parameter to a LOAD statement) that contains several files in *different formats* - say serialized Avro data and tab separated values and make pig read the data into one alias? I guess I have to write an UDF for this? How should I start, can you ske

AvroStorage Issue in 0.9.2-cdh4.0.0 -- Schema is unknown

2012-06-15 Thread Markus Resch
Hey all, we're currently testing to switch over from CDH3 to CDH4. When I try to read my Avro input data I get en Schema unknown Error: bash-3.2$ pig 12/06/15 08:48:08 WARN pig.Main: Cannot write to log file: /usr/lib/pig/pig_1339750088923.log 2012-06-15 08:48:09,415 [main] INFO org.apache.pig.

Re: different mapred.min.split.size within one pig script?

2012-06-15 Thread Dmitriy Ryaboy
Correct; I don't think there is a good way to do that except perhaps by inserting "exec" statements to separate parts of the script that you need to execute with the different settings. D On Wed, Jun 13, 2012 at 11:08 PM, Yang wrote: > thanks, > > I tried, but it does not seem to work,  even aft