Pig DataGenerator as a MR Job

2010-01-14 Thread Rob Stewart
Hi there. I am well underway with comparing Pig, Hive, JAQL etc... The DataGenerator is proving a valuable tool for me. Thanks for that. I have one query. I am able to use it in local mode, no problem, and some experiments are complete. However, I cannot seem to use it in MapReduce mode on the

Re: Piglet: a Ruby DSL for writing Pig scripts

2010-01-14 Thread Kevin Weil
Theo, this is awesome. We at Twitter are hoping to contribute to and extend the great work you've done. Kevin On Wed, Jan 13, 2010 at 10:01 PM, Theo Hultberg wrote: > Please do! > > T# > > On Thu, Jan 14, 2010 at 12:02 AM, Alan Gates wrote: > > Theo, > > > > This looks really interesting. Ca

Re: Pig DataGenerator as a MR Job

2010-01-14 Thread Dmitriy Ryaboy
Rob, You need to tell Hadoop which jars you need it to ship to the worker nodes. You include datagen.jar, etc, on the classpath, which makes them discoverable locally, but you aren't telling Hadoop to ship them. You want to list them, comma-separated, in the -libjars parameter. -D On Thu, Jan 14,

Re: Pig DataGenerator as a MR Job

2010-01-14 Thread Rob Stewart
Hi Dmitriy, OK, well it seems that since 0.20.0 the order as specified on the Pig wiki is no longer relevant: doop jar -libjars $zipfjar $datagenjar org.apache.pig.test.utils.datagen. DataGenerator -conf $conf_file [options] colspec... See this patch over at Hive for 0.20.0: http://mail-archives

Re: Pig DataGenerator as a MR Job

2010-01-14 Thread Dmitriy Ryaboy
I think the link you sent got malformatted, but try separating the jars with a comma http://issues.apache.org/jira/browse/HADOOP-4864 On Thu, Jan 14, 2010 at 7:40 AM, Rob Stewart wrote: > Hi Dmitriy, > > OK, well it seems that since 0.20.0 the order as specified on the Pig wiki > is no longer rel

Re: Piglet: a Ruby DSL for writing Pig scripts

2010-01-14 Thread Alan Gates
Done. Alan. On Jan 13, 2010, at 10:01 PM, Theo Hultberg wrote: Please do! T# On Thu, Jan 14, 2010 at 12:02 AM, Alan Gates wrote: Theo, This looks really interesting. Can I put a link to it on our page for tools use with Pig, http://wiki.apache.org/pig/PigTools ? Alan. On Jan 13, 2

Re: Pig DataGenerator as a MR Job

2010-01-14 Thread Rob Stewart
Hi Dmitriy, No, I do think that there was a change in 0.20.0 See the error I get: Exception in thread "main" java.io.IOException: Error opening job jar: -libjars This is what I am trying to run: hadoop jar -libjars $zipfjar $datagenjar org.apache.pig.test.utils.datagen.DataGenerator -conf $conf_

Re: Pig DataGenerator as a MR Job

2010-01-14 Thread Dmitriy Ryaboy
Sorry if I am not reading carefully enough -- but the bug report you cite seems to indicate you want hadoop jar org.apache.pig.test.utils.datagen.DataGenerator -libjars $zipfjar $datagenjar -conf $conf_file -rows 1000 -f /scratch/tmpHDFS_files/wordsx1_skewed.dat s:8:50:z:0 (possibly separatin

Re: Pig DataGenerator as a MR Job

2010-01-14 Thread Rob Stewart
Yeah, unfortunately your suggestion does not work, and neither does the order given on the Pig wiki. Instead, see the Hadoop wiki for -libjars usage: hadoop jar hadoop-examples.jar wordcount -files cachefile.txt -libjars mylib.jar input output So I tried this: hadoop jar $datagenjar org.apache.pi

Re: Pig DataGenerator as a MR Job

2010-01-14 Thread Rob Stewart
Hello Dmitry! I have it solved, it was just a bit of trial and error based on the Hive bug report/fix I found. The report is indeed correct, the following works: > hadoop jar $datagenjar org.apache.pig.test.utils.datagen.DataGenerator -libjars $zipfjar -conf $conf_file -rows 1000 -m 3 -f /scr

Re: Pig DataGenerator as a MR Job

2010-01-14 Thread Alan Gates
Rob, Feel free to update the wiki with your findings. You don't have to be a committer to change the wiki. Alan. On Jan 14, 2010, at 12:15 PM, Rob Stewart wrote: Hello Dmitry! I have it solved, it was just a bit of trial and error based on the Hive bug report/fix I found. The report

Basic 'SUM' question

2010-01-14 Thread Scott
Hello fellow Pig users. I am brand new to Pig/hadoop, and am having trouble with something that I am guessing is very basic. I have a relation where I did a group by several values, then counted the groups. Here is a description of the relation: count_grouped: {g1: (site: chararray,tf: char

Re: Pig DataGenerator as a MR Job

2010-01-14 Thread Rob Stewart
Cheers Alan, Done. Rob. 2010/1/14 Alan Gates > Rob, > > Feel free to update the wiki with your findings. You don't have to be a > committer to change the wiki. > > Alan. > > > On Jan 14, 2010, at 12:15 PM, Rob Stewart wrote: > > Hello Dmitry! >> >> I have it solved, it was just a bit of tri

Re: Basic 'SUM' question

2010-01-14 Thread Alan Gates
A general sum with group all can be done as: A = load 'file' as (x, y); B = group A all; C = foreach B generate sum(A.x); This will give you the sum of all x. But from the schema you show below I'm not sure this is what you're trying to do. Can you attach your script and an example record

Re: Pig DataGenerator as a MR Job

2010-01-14 Thread Dmitriy Ryaboy
Thanks for persevering Rob! :) -D On Thu, Jan 14, 2010 at 4:16 PM, Rob Stewart wrote: > Cheers Alan, > > Done. > > Rob. > > > 2010/1/14 Alan Gates > >> Rob, >> >> Feel free to update the wiki with your findings.  You don't have to be a >> committer to change the wiki. >> >> Alan. >> >> >> On J