Multi-way outer joins

2011-03-30 Thread Josh Devins
Hey all, I have several relations that all have the same keys. Something like: A = id,countA B = id,countB C = id,countC ... I would like to do a multi-way, full outer join such that any tuple that is null or non-existent is still joined. The approach right now is to do it step wise, as the Pig

Re: Storing and reporting off Pig data

2011-03-23 Thread Josh Devins
Hey Jon, I think a common approach is to use Pig (and MR/Hadoop in general) as purely the heavy lifter, doing all the merge-downs, aggregations and such of the data. At Nokia we tend to output a lot of data from Pig/MR as TSV or CSV (using PigStorage) and then use Sqoop to push that into a MySQL D

Re: Incorrect header or version mismatch

2011-03-22 Thread Josh Devins
Hey Dan This usually means that you have mismatched Hadoop jar versions somewhere. I encountered a similar problem with Oozie trying to talk to HDFS. Maybe try posting to the Hadoop user list as well. In general, you should just need the same hadoop-core.jar as on your cluster when you run Pig. Fr

Re: PiggyBank official repo

2011-03-13 Thread Josh Devins
March 2011 10:38, Dan Brickley wrote: > On 13 March 2011 10:35, Josh Devins wrote: > > Hey guys, > > > > I'm still seeing references (http://wiki.apache.org/pig/PiggyBank) to > > PiggyBank being in the contrib module in SVN. What is the official > PiggyBank &g

PiggyBank official repo

2011-03-13 Thread Josh Devins
Hey guys, I'm still seeing references (http://wiki.apache.org/pig/PiggyBank) to PiggyBank being in the contrib module in SVN. What is the official PiggyBank repo at the moment: GitHub/wilbur/Piggybank or Apache SVN contrib/piggybank? Are they out of sync at the moment/which is the current authorit

Re: Percentile UDF

2011-03-11 Thread Josh Devins
; > > That's the path I started down today, I don't suppose the UDF you wrote > > is > > > in the public domain > > > at all - would you consider contributing it to piggybank.jar at all? > How > > > does it fare with large datasets > > > as

Re: Percentile UDF

2011-03-10 Thread Josh Devins
Hey Jon, I wrote one not long ago that just relies on Apache Commons Math underlying the UDF. It's pretty straightforward as the Percentile implementation will sort your numbers before doing percentile calculations. The UDF then just needs to take a bag/tuple, pull out all the fields as double[] a

Re: Moving fields to map

2010-11-27 Thread Josh Devins
n Pig this way. You can of > course run your data through a UDF that would take a tuple whose first > argument is a list of key names, and invoke it like so: > > jsonStore = FOREACH thing GENERATE > toMap('id foo bar', *) AS json:map[]; > > -D > > On Thu, Nov 25

Moving fields to map

2010-11-25 Thread Josh Devins
Hi all, I have a a simple schema that I want to store as JSON. So I've written a simple JsonStorage class but it requires that the tuple's first field is a map. The problem is in converting a regular tuple into a map: DESCRIBE thing; > thing: {id: chararray,field1: chararray,field2: chararray} W

Re: Implementation of ORDER and LIMIT

2010-11-18 Thread Josh Devins
ttp://pig.apache.org/docs/r0.7.0/api/ >> >> PigServer is the starting point. and internally will have formations of >> logical/physical plan of jobs.The executionengine executes the job. Refer >> files under o.a.p.backend.hadoop.executionengine. >> More details under http

Implementation of ORDER and LIMIT

2010-11-14 Thread Josh Devins
Hi all, I'm happily using Pig to ORDER BY and LIMIT some large relations quite effectively. However I'm curious about how these are/would be implemented in "raw" MapReduce. Can anyone shed some light/point to some details, examples or pseudo-code somewhere? Cheers, Josh

Pig counters and UDF_WARNING_1

2010-11-10 Thread Josh Devins
I just ran a Pig job and for the first time noticed the output at the end of the job (and of course a matching counter): Encountered Warning UDF_WARNING_1 108939522 time(s) What exactly does this count refer to and is there any way to find out what the actual warning is about? I've checked the jo

Re: JUnit & Pig Script

2010-10-20 Thread Josh Devins
a Pig 8 solution? > > On Oct 20, 2010, at 10:22 AM, Josh Devins wrote: > > > You might want to also have a look at Pig trunk/0.8.0 since there's > already > > some work been done on this topic. > > > > https://issues.apache.org/jira/browse/PIG-1404 > > >

Re: JUnit & Pig Script

2010-10-20 Thread Josh Devins
You might want to also have a look at Pig trunk/0.8.0 since there's already some work been done on this topic. https://issues.apache.org/jira/browse/PIG-1404 Cheers, Josh On 20 October 2010 16:58, Dave Wellman wrote: > All, > > I have a solution for writing unit test in Java to test pig scri

Re: Built-in counters

2010-10-18 Thread Josh Devins
Ah, sorry, just saw that this should read: PigStatusReporter.getInstance() and there is no special counters keyword/variable. However is this common for Pig, being able to access static methods directly from within a Pig script? Thanks, Josh On 18 October 2010 11:56, Josh Devins wrote

Re: Built-in counters

2010-10-18 Thread Josh Devins
rds anywhere. > I am not sure what you mean by 3) -- you can just increment > counters. PigStatusReporter.getInstance().getCounter(myEnum).increment(1L); > > (watch out for a null reporter when you are still in the client-side). > > -D > > > On Sat, Oct 16, 2010 at 2:2

Built-in counters

2010-10-17 Thread Josh Devins
I've seen a few threads about counters, PigStats, Elephant-Bird's stats utility class, etc. http://www.mail-archive.com/pig-u...@hadoop.apache.org/msg00900.html http://www.mail-archive.com/user%40pig.apache.org/msg00034.html Has any progress been made on this or to provide a comprehensive stats/c

Re: Converting an inner bag

2010-10-08 Thread Josh Devins
Crap, of course :) Many thanks, that did the job. Cheers, Josh On 8 October 2010 19:12, Mehmet Tepedelenlioglu wrote: > I= foreach A generate group, flatten(items); > > I believe that should do it. > > On Oct 8, 2010, at 9:13 AM, Josh Devins wrote: > > > I have a sim

Converting an inner bag

2010-10-08 Thread Josh Devins
I have a simple schema that contains an inner bag. What I need to essentially do is that for each tuple in the inner bag, I need to create a new tuple in a new outer bag. This is easier shown than explained! Consider the following schema and data: DESCRIBE A; A: {id: chararray, items: {item: chara