Well, you can do this physically by adding load/store boundaries to your
code. Thinking out loud, such a thing could be possible...
At any M/R boundary, you store the intermediate in HDFS, and pig is aware
of this and doesn't automatically delete it (this part in and of itself is
not trivial -- wh
In production I use short Pig scripts and schedule them with Azkaban
with dependencies setup, so that I can use Azkaban to restart long
data pipelines at the point of failure. I edit the failing pig script,
usually towards the end of the data pipeline, and restart the Azkaban
job. This saves hours
Is that a PMC position? I also do AV and can bounce #credentials :D
Russell Jurney
twitter.com/rjurney
russell.jur...@gmail.com
datasyndrome.com
On Jun 15, 2012, at 3:35 PM, Jonathan Coveney wrote:
> +1
>
> 2012/6/15 Alan Gates
>
>> Thanks Russell. I move we make you the official Apache Pig s
Hi all,
I'm trying to run the mahout canopy clustering algorithm through a
Python-embedded Pig script. The embedded Pig part of the script works (using
compileFromFile, bind, runSingle), but I can't figure out how to run mahout
from the same script. Originally I tried running mahout via subprocess
On Fri, Jun 15, 2012 at 3:34 PM, Jonathan Coveney wrote:
> We just use the Java Map class, with the restriction that the key must be a
> String. There are some helper methods in trunk to work with maps, and you
> can you # to dereference ie map#'key'
>
thanks! If you don't mind could you please s
+1
2012/6/15 Alan Gates
> Thanks Russell. I move we make you the official Apache Pig secretary. :)
>
> Alan.
>
> On Jun 12, 2012, at 9:45 PM, Russell Jurney wrote:
>
> > Tuesday, Pig Meetup
> >
> > Alan Gates - upcoming improvements in operators/backend physical plan.
> > Desphagetification.
>
We just use the Java Map class, with the restriction that the key must be a
String. There are some helper methods in trunk to work with maps, and you
can you # to dereference ie map#'key'
2012/6/15 Mohit Anchlia
> On Fri, Jun 15, 2012 at 9:12 AM, Alan Gates wrote:
>
> > This seems reasonable, e
http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html. You can find
DIFF built-in function.
regards!
Yong
On Fri, Jun 15, 2012 at 6:23 PM, Alan Gates wrote:
> Which page is this?
>
> Alan.
>
> On Jun 13, 2012, at 5:24 AM, yonghu wrote:
>
>> Hello,
>>
>> The example of DIFF lacks of the generate.
On Fri, Jun 15, 2012 at 9:12 AM, Alan Gates wrote:
> This seems reasonable, except it seems like it would make more sense to
> convert query parameters to maps. By definition a query parameter is
> key=value. And a map is easier to work with in general then a bag, since
> there's no need to fla
Oh, you maybe also need to load other jars? I load avro this way.
REGISTER /me/pig/build/ivy/lib/Pig/avro-1.5.3.jar
REGISTER /me/pig/build/ivy/lib/Pig/json-simple-1.1.jar
REGISTER /me/pig/contrib/piggybank/java/piggybank.jar
Russell Jurney http://datasyndrome.com
On Jun 15, 2012, at 1:50 AM,
In my experience, AvroStorage only really works reliably in Pig 0.10.
I believe CDH will have Pig 0.10 in v4.1. I suggest you upgrade to
v0.10 manually now, and if your data still won't load, please file a
JIRA.
Russell Jurney http://datasyndrome.com
On Jun 15, 2012, at 1:50 AM, Markus Resch wro
Hi,
I saw the note about oozie earlier in the list. When I first saw it I
though XML == NO ## WAY.
I've developed and opensourced a groovy based workflow software, that is
currently being used for the last 2 years in production by 3 companies (
MySpace, SpecificMedia, and Komli )
http://code.go
Which page is this?
Alan.
On Jun 13, 2012, at 5:24 AM, yonghu wrote:
> Hello,
>
> The example of DIFF lacks of the generate. It should be
>
> X = FOREACH A GENERATE DIFF(B1,B2);
>
> Regards!
>
> Yong
Thanks Russell. I move we make you the official Apache Pig secretary. :)
Alan.
On Jun 12, 2012, at 9:45 PM, Russell Jurney wrote:
> Tuesday, Pig Meetup
>
> Alan Gates - upcoming improvements in operators/backend physical plan.
> Desphagetification.
> Reworking UDF interface, keep backward comp
This seems reasonable, except it seems like it would make more sense to convert
query parameters to maps. By definition a query parameter is key=value. And a
map is easier to work with in general then a bag, since there's no need to
flatten them.
Alan.
On Jun 11, 2012, at 10:55 AM, Mohit Anc
Well I don't consider this strategy of an data format migration to be a
hack. The only thing that is somewhat "hacky" and definitely not elegant
is the creation of empty files for each known format by the logger!
Do you have any advice on how to design our pig scripts that they
account for migrati
Maps require string keys. So it should read ['222'#1].
Alan.
On Jun 7, 2012, at 8:51 PM, Yang wrote:
> I ran the following simple pig script
>
>
> a = load 'a';
>
> b = foreach a generate [222#1];
>
> dump b;
>
>
> but it gave the following error
>
> $ pig -x local a.pig
> 2012-06-07 20
It's definitely a reasonable thing to do, but it hasn't been added yet.
Ideally registering a jar of macros would make them available, just as
registering a jar of UDFs does.
Alan.
On Jun 7, 2012, at 6:13 PM, Matthew Hayes wrote:
> According to the documentation it looks like the only way to
Hey,
You can keep a single empty file per format. That way pig won't fail.
But basically I recommend to avoid such situations that need hacks or
custom formats. According to my experience you'll soon get in trouble
with that.
Thanks
On Fri, Jun 15, 2012 at 5:39 PM, Johannes Schwenk
wrote:
> Tha
Thanks a lot Ruslan, that seems one possible direction!
One things stands to be resolved: I don't know whether I will get an
Avro in the input or CSV, TSV or all... So how could I get pig not to
choke on missing input files?
Johannes
Am 15.06.2012 15:24, schrieb Ruslan Al-Fakikh:
> I guess you c
I guess you could use globbing for extracting the files by extensions,
like this:
$ ls
input.avro input.txt
$ cat input.avro
avro1
avro2
$ cat input.txt
txt1
txt2
[cloudera@localhost workpig]$ pig -x local
2012-06-15 17:21:09,613 [main] INFO org.apache.pig.Main - Logging
error messages to: /home
Hi Ruslan,
thanks for you answer!
I have only the input path, but do not know which file format the
different files in that path possess. All files that are in the path
belong to one relation however, so i want to load them at once. Though a
union of separately loaded files would be ok too, if th
Hi Johannes,
I guess you'd have to write a custom Loader for such a situation, but
why do you need to load everything in one pass? You can load different
types of files separately (having multiple LOAD statements) and make a
join or a union afterwards.
Ruslan
On Fri, Jun 15, 2012 at 4:13 PM, Joh
Hi all,
is it possible to have an input path (as parameter to a LOAD statement)
that contains several files in *different formats* - say serialized Avro
data and tab separated values and make pig read the data into one alias?
I guess I have to write an UDF for this? How should I start, can you
ske
Hey all,
we're currently testing to switch over from CDH3 to CDH4.
When I try to read my Avro input data I get en Schema unknown Error:
bash-3.2$ pig
12/06/15 08:48:08 WARN pig.Main: Cannot write to log
file: /usr/lib/pig/pig_1339750088923.log
2012-06-15 08:48:09,415 [main] INFO
org.apache.pig.
Correct; I don't think there is a good way to do that except perhaps
by inserting "exec" statements to separate parts of the script that
you need to execute with the different settings.
D
On Wed, Jun 13, 2012 at 11:08 PM, Yang wrote:
> thanks,
>
> I tried, but it does not seem to work, even aft
26 matches
Mail list logo