[ 
https://issues.apache.org/jira/browse/MAHOUT-598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Anil resolved MAHOUT-598.
-------------------------------

    Resolution: Cannot Reproduce
    
> Downstream steps in the seq2sparse job flow looking in wrong location for 
> output from previous steps when running in Elastic MapReduce (EMR) cluster
> ----------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-598
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-598
>             Project: Mahout
>          Issue Type: Bug
>          Components: Integration
>    Affects Versions: 0.4, 0.5
>         Environment: seq2sparse, Mahout 0.4, S3, EMR, Hadoop 0.20.2
>            Reporter: Timothy Potter
>            Assignee: Grant Ingersoll
>             Fix For: 0.8
>
>
> While working on MAHOUT-588, I've discovered an issue with the seq2sparse job 
> running on EMR. From what I can tell this job is made up of multiple MR steps 
> and downstream steps are expecting output from previous steps to be in HDFS, 
> but the output is in S3 (see errors below). For example, the 
> DictionaryVectorizer wrote "dictionary.file.0" to S3 but 
> TFPartialVectorReducer is looking for it in HDFS.
> To run this job, I spin up an EMR cluster and then add the following step to 
> it (this is using the elastic-mapreduce-ruby tool):
> elastic-mapreduce --jar s3n://thelabdude/mahout-core-0.4-job.jar \
> --main-class org.apache.mahout.driver.MahoutDriver \
> --arg seq2sparse \
> --arg -i --arg 
> s3n://thelabdude/asf-mail-archives/mahout-0.4/sequence-files-sm/ \
> --arg -o --arg s3n://thelabdude/asf-mail-archives/mahout-0.4/vectors-sm/ \
> --arg --weight --arg tfidf \
> --arg --chunkSize --arg 200 \
> --arg --minSupport --arg 2 \
> --arg --minDF --arg 1 \
> --arg --maxDFPercent --arg 90 \
> --arg --norm --arg 2 \
> --arg --maxNGramSize --arg 2 \
> --arg --overwrite \
> -j JOB_ID
> With these parameters, I see the following errors in the hadoop logs:
> java.io.FileNotFoundException: File does not exist: 
> /asf-mail-archives/mahout-0.4/vectors-sm/dictionary.file-0
>       at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457)
>       at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:716)
>       at 
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1476)
>       at 
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1471)
>       at 
> org.apache.mahout.vectorizer.term.TFPartialVectorReducer.setup(TFPartialVectorReducer.java:126)
>       at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
>       at 
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:575)
>       at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:412)
>       at org.apache.hadoop.mapred.Child.main(Child.java:170)
> java.io.FileNotFoundException: File does not exist: 
> /asf-mail-archives/mahout-0.4/vectors-sm/dictionary.file-0
>       at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457)
>       at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:716)
>       at 
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1476)
>       at 
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1471)
>       at 
> org.apache.mahout.vectorizer.term.TFPartialVectorReducer.setup(TFPartialVectorReducer.java:126)
>       at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
>       at 
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:575)
>       at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:412)
>       at org.apache.hadoop.mapred.Child.main(Child.java:170)
> java.io.FileNotFoundException: File does not exist: 
> /asf-mail-archives/mahout-0.4/vectors-sm/dictionary.file-0
>       at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457)
>       at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:716)
>       at 
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1476)
>       at 
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1471)
>       at 
> org.apache.mahout.vectorizer.term.TFPartialVectorReducer.setup(TFPartialVectorReducer.java:126)
>       at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
>       at 
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:575)
>       at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:412)
>       at org.apache.hadoop.mapred.Child.main(Child.java:170)
> Exception in thread "main" 
> org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does 
> not exist: 
> s3n://thelabdude/asf-mail-archives/mahout-0.4/vectors-sm/partial-vectors-0
>       at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:224)
>       at 
> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:55)
>       at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:241)
>       at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:933)
>       at 
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:827)
>       at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
>       at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
>       at 
> org.apache.mahout.vectorizer.common.PartialVectorMerger.mergePartialVectors(PartialVectorMerger.java:126)
>       at 
> org.apache.mahout.vectorizer.DictionaryVectorizer.createTermFrequencyVectors(DictionaryVectorizer.java:176)
>       at 
> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:253)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>       at java.lang.reflect.Method.invoke(Method.java:597)
>       at 
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>       at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>       at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>       at java.lang.reflect.Method.invoke(Method.java:597)
>       at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
> I don't think this is a "config" error on my side because if I change the -o 
> argument to:
> /thelabdude/asf-mail-archives/mahout-0.4/vectors-sm/ 
> then the job completes successfully, except the output is now stored in the 
> hdfs and not S3. After the job completes successfully, if I SSH into the EMR 
> master server, then I see the following output as expected:
> hadoop@ip-10-170-93-177:~$ hadoop fs -lsr 
> /thelabdude/asf-mail-archives/mahout-0.4/vectors-sm/
> drwxr-xr-x   - hadoop supergroup          0 2011-01-24 23:44 
> /thelabdude/asf-mail-archives/mahout-0.4/vectors-sm/df-count
> -rw-r--r--   1 hadoop supergroup      26893 2011-01-24 23:43 
> /thelabdude/asf-mail-archives/mahout-0.4/vectors-sm/df-count/part-r-00000
> -rw-r--r--   1 hadoop supergroup      26913 2011-01-24 23:43 
> /thelabdude/asf-mail-archives/mahout-0.4/vectors-sm/df-count/part-r-00001
> -rw-r--r--   1 hadoop supergroup      26893 2011-01-24 23:43 
> /thelabdude/asf-mail-archives/mahout-0.4/vectors-sm/df-count/part-r-00002
> -rw-r--r--   1 hadoop supergroup     104874 2011-01-24 23:42 
> /thelabdude/asf-mail-archives/mahout-0.4/vectors-sm/dictionary.file-0
> -rw-r--r--   1 hadoop supergroup      80493 2011-01-24 23:44 
> /thelabdude/asf-mail-archives/mahout-0.4/vectors-sm/frequency.file-0
> drwxr-xr-x   - hadoop supergroup          0 2011-01-24 23:43 
> /thelabdude/asf-mail-archives/mahout-0.4/vectors-sm/tf-vectors
> /part-r-00000
> ...
> The work-around is to just write all output to HDFS and then SSH into the 
> master server once the job completes and then copy the output to S3.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to