Yeah, unfortunately your suggestion does not work, and neither does the order given on the Pig wiki. Instead, see the Hadoop wiki for -libjars usage:
hadoop jar hadoop-examples.jar wordcount -files cachefile.txt -libjars mylib.jar input output So I tried this: hadoop jar $datagenjar org.apache.pig.test.utils.datagen.DataGenerator -conf $conf_file -rows 10000000 -f /scratch/tmpHDFS_files/wordsx1_skewed.dat -libjars $zipfjar s:8:50:z:0 However, the DataGenerator does not like it as one of its' options: --------- Couldn't parse the command line arguments, Found unknown option (-libjars) at position 5 --------- I'd be happy/surprised to hear from anyone who can use the format given on the Pig wiki for the DataGenerator, in cluster mode (using -m parameter). Any more suggestions Dmitry, and thanks for your help, it's mucho appreciated! Rob 2010/1/14 Dmitriy Ryaboy <dvrya...@gmail.com> > Sorry if I am not reading carefully enough -- but the bug report you > cite seems to indicate you want > > hadoop jar org.apache.pig.test.utils.datagen.DataGenerator -libjars > $zipfjar $datagenjar -conf $conf_file -rows > 10000000 -f /scratch/tmpHDFS_files/wordsx1_skewed.dat s:8:50:z:0 > > (possibly separating zipfjar and datagenjar with commas if that patch > was applied to your version of 20) > > which I don't see in the list of things you tried? > > -D > > On Thu, Jan 14, 2010 at 10:13 AM, Rob Stewart > <robstewar...@googlemail.com> wrote: > > Hi Dmitriy, > > > > No, I do think that there was a change in 0.20.0 > > > > See the error I get: > > Exception in thread "main" java.io.IOException: Error opening job jar: > > -libjars > > > > This is what I am trying to run: > > hadoop jar -libjars $zipfjar $datagenjar > > org.apache.pig.test.utils.datagen.DataGenerator -conf $conf_file -rows > > 10000000 -f /scratch/tmpHDFS_files/wordsx1_skewed.dat s:8:50:z:0 > > > > The $zipfjar has only one jar file in this classpath. It seems that there > > was a change to hadoop 0.20.0, not allowing for the option -libjars > > immediately after "hadoop jar". > > > > This is the extract from the Hive bug report I was talking about: > > ------------- > > > > > > In hadoop-20 - the -libjars has to come after the jar file/class > > > > Please try applying this patch to bin/ext/cli.sh > > > > --- cli.sh (revision 789726) > > +++ cli.sh (working copy) > > @@ -10,7 +10,7 @@ > > exit 3; > > fi > > > > - exec $HADOOP jar $AUX_JARS_CMD_LINE ${HIVE_LIB}/hive_cli.jar $CLASS > > $HIVE_OPTS "$@" > > + exec $HADOOP jar ${HIVE_LIB}/hive_cli.jar $CLASS $AUX_JARS_CMD_LINE > > $HIVE_OPTS "$@" > > } > > > > ---------------- > > > > I have also tried: > > hadoop jar -libjars [full_location_to_sdsuLibJKD14.jar] $datagenjar > > org.apache.pig.test.utils.datagen.DataGenerator -conf $conf_file -rows > > 10000000 -f /scratch/tmpHDFS_files/wordsx1_skewed.dat s:8:50:z:0 > > > > This gives the same error. > > > > > > > > Rob > > > > 2010/1/14 Dmitriy Ryaboy <dvrya...@gmail.com> > > > >> I think the link you sent got malformatted, but try separating the > >> jars with a comma > >> http://issues.apache.org/jira/browse/HADOOP-4864 > >> > >> On Thu, Jan 14, 2010 at 7:40 AM, Rob Stewart > >> <robstewar...@googlemail.com> wrote: > >> > Hi Dmitriy, > >> > > >> > OK, well it seems that since 0.20.0 the order as specified on the Pig > >> wiki > >> > is no longer relevant: > >> > doop jar -libjars $zipfjar $datagenjar > org.apache.pig.test.utils.datagen. > >> > DataGenerator </pig/DataGenerator> -conf $conf_file [options] > colspec... > >> > > >> > See this patch over at Hive for 0.20.0: > >> > > http://mail-archives.apache.org/mod_mbox/hadoop-hive-user/200907.mbox/< > >> > dfd95197f3ae8c45b0a96c2f4ba3a2556c8358c...@sc-mbxc1.thefacebook.com> > >> > > >> > I have tried a few combinations, but I can't seem to fit in the > "-libjars > >> > $zipfjar" in anywhere now. > >> > > >> > Any ideas? > >> > > >> > Thanks for your help. > >> > > >> > Rob > >> > > >> > > >> > > >> > > >> > 2010/1/14 Dmitriy Ryaboy <dvrya...@gmail.com> > >> > > >> >> Rob, > >> >> You need to tell Hadoop which jars you need it to ship to the worker > >> >> nodes. You include datagen.jar, etc, on the classpath, which makes > >> >> them discoverable locally, but you aren't telling Hadoop to ship > them. > >> >> You want to list them, comma-separated, in the -libjars parameter. > >> >> > >> >> -D > >> >> > >> >> On Thu, Jan 14, 2010 at 6:49 AM, Rob Stewart > >> >> <robstewar...@googlemail.com> wrote: > >> >> > Hi there. > >> >> > > >> >> > I am well underway with comparing Pig, Hive, JAQL etc... > >> >> > > >> >> > The DataGenerator is proving a valuable tool for me. Thanks for > that. > >> >> > > >> >> > I have one query. I am able to use it in local mode, no problem, > and > >> some > >> >> > experiments are complete. > >> >> > > >> >> > However, I cannot seem to use it in MapReduce mode on the cluster. > >> This > >> >> is > >> >> > my file "generateData" contents: > >> >> > ------------------ > >> >> > export pigjar=$HOME/installation/pig/pig-0.5.0/pig-0.5.0-core.jar > >> >> > export zipfjar=$HOME/installation/pig/pig-0.5.0/sdsuLibJKD14.jar > >> >> > export > datagenjar=$HOME/rs46/installation/DataGenerator/dist/MyPig.jar > >> >> > export conf_file=/usr/lib/hadoop/conf/hadoop-site.xml > >> >> > export HADOOP_CLASSPATH=$pigjar:$zipfjar:$datagenjar > >> >> > /usr/lib/hadoop/bin/hadoop jar $datagenjar > >> >> > org.apache.pig.test.utils.datagen.DataGenerator -conf $conf_file -m > 1 > >> >> -rows > >> >> > 10000000 -f words.dat s:8:50:z:0 > >> >> > ------------------ > >> >> > > >> >> > The error I receive when trying to run it with "-m 1" option (in > >> cluster > >> >> > mode): > >> >> > Caused by: java.lang.ClassNotFoundException: > sdsu.algorithms.data.Zipf > >> >> > > >> >> > So in local mode, it successfully picks up the jar file > >> sdsuLibJKD14.jar > >> >> , > >> >> > but when running it in cluster mode, this classpath is not found? > >> >> > > >> >> > > >> >> > thanks. > >> >> > > >> >> > Rob Stewart > >> >> > > >> >> > >> > > >> > > >