You can ssh to the EMR cluster if you like. On Wed, Dec 12, 2012 at 9:38 AM, hellen maziku <nahe...@yahoo.com> wrote:
> Thank you for the advice. But on my machine I do not have hadoop > installed. Running the jobs locally with mahout gives me heap size errors > as seen from http://en.wikipedia.org/wiki/User:Bloodysnowrocker/Hadoop. I > could only do recommendations locally but clustering and creating of > vectors wasn't possible. > > Do you suggest I should use the EMR GUI to submit my jobs, or I should > just install hadoop on my machine or on ec2 and perform my tasks? > > > > > ________________________________ > From: Ted Dunning <ted.dunn...@gmail.com> > To: user@mahout.apache.org; hellen maziku <nahe...@yahoo.com> > Sent: Wednesday, December 12, 2012 10:56 AM > Subject: Re: Creating vectors from lucene index on EMR via the CLI > > I would still recommend that you switch to using the mahout programs > directly to submit jobs. Those programs really have an assumption baked in > that they will be submitting the jobs themselves. The EMR commands that > you are using take responsibility for creating the environment that you > need for job submission, but you are probably not getting the command line > arguments through to the Mahout program in good order. As it typical with > shell script based utilities, determining how to get those across correctly > is probably somewhat difficult. > > On Wed, Dec 12, 2012 at 7:58 AM, hellen maziku <nahe...@yahoo.com> wrote: > > > Hi Ted, > > If I am running it as a single step, then how come I can add more steps > to > > it. Currently there are 6 steps. Every time I get the errors, I just add > > another step to the same job ID. So I dont understand. > > > > Also the command to create the job flow is > > /elastic-mapreduce --create --alive --log-uri > > s3n://mahout-output/logs/ --name dict_vectorize > > > > > > doesn't that mean that the keep alive is set? > > > > > > > > ________________________________ > > From: Ted Dunning <ted.dunn...@gmail.com> > > To: user@mahout.apache.org; hellen maziku <nahe...@yahoo.com> > > Sent: Wednesday, December 12, 2012 9:48 AM > > Subject: Re: Creating vectors from lucene index on EMR via the CLI > > > > You are trying to run this job as a single step in an EMR flow. Mahout's > > command line programs assume that you are running against a live cluster > > that will hang around (since many mahout steps involve more than one > > map-reduce). > > > > It would probably be best to separate the creation of the cluster (with > the > > keep-alive flag set) from the execution of the Mahout jobs with a > > subsequent explicit tear-down of the cluster. > > > > On Wed, Dec 12, 2012 at 3:55 AM, hellen maziku <nahe...@yahoo.com> > wrote: > > > > > Hi, > > > I installed mahout and solr. > > > > > > I created an index from the dictionary.txt using the command below > > > > > > curl " > > > http://localhost:8983/solr/update/extract?literal.id=doc1&commit=true" > > -F > > > "myfile=@dictionary.txt" > > > > > > To create the vectors from my index > > > > > > I needed the org.apache.mahout.utils.vectors.lucene.Driver class. I > > > couldnot locate this class in mahout-core-o.7-job.jar. I could only > > > locate it from mahout-examples-0.7-job.jar, so I uploaded the > > > mahout-examples-0.7-job.jar on an s3 bucket. > > > > > > I also uploaded the dictionary index on a separete s3 bucket. I created > > > another bucket with two folders to store my dictOut and vectors. > > > > > > I created a job flow on the CLI > > > > > > /elastic-mapreduce --create --alive --log-uri > > > s3n://mahout-output/logs/ --name dict_vectorize > > > > > > I added the step to vectorize my index using the following command > > > ./elastic-mapreduce -j j-2NSJRI6N9EQJ4 --jar > > > s3n://mahout-bucket/jars/mahout-examples-0.7-job.jar --main-class > > > org.apache.mahout.utils.vectors.lucene.Driver --arg --dir > > > s3n://mahout-input/input1/index/ --arg --field doc1 --arg --dictOut > > > s3n://mahout-output/solr-dict-out/dict.txt --arg --output > > > s3n://mahout-output/solr-vect-out/vectors > > > > > > > > > But in the logs I get the following error > > > > > > 2012-12-12 09:37:17,883 ERROR > > > org.apache.mahout.utils.vectors.lucene.Driver (main): Exception > > > org.apache.commons.cli2.OptionException: Missing value(s) --dir > > > at > > > > > > org.apache.commons.cli2.option.ArgumentImpl.validate(ArgumentImpl.java:241) > > > at > > > org.apache.commons.cli2.option.ParentImpl.validate(ParentImpl.java:124) > > > at > > > > > > org.apache.commons.cli2.option.DefaultOption.validate(DefaultOption.java:176) > > > at > > > org.apache.commons.cli2.option.GroupImpl.validate(GroupImpl.java:265) > > > at > org.apache.commons.cli2.commandline.Parser.parse(Parser.java:104) > > > at > > > org.apache.mahout.utils.vectors.lucene.Driver.main(Driver.java:197) > > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > > at > > > > > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > > > at > > > > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > > > at java.lang.reflect.Method.invoke(Method.java:597) > > > at org.apache.hadoop.util.RunJar.main(RunJar.java:187) > > > > > > > > > What am I doing wrong? > > > Another question: what is the correct value of the --field argument, is > > it > > > doc1 (the id) or dictionary(from the filename dictionary.txt). I am > > asking > > > this becasue when I issue the querry with q=doc1 on solr I get no > > > results. But when I issue the query with q=dictionary, I see my > content. > > > > > > Thank you so much for help. I am a newbie, so please excuse my being > too > > > verbal. > > >