[ https://issues.apache.org/jira/browse/MAHOUT-598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Robin Anil resolved MAHOUT-598. ------------------------------- Resolution: Cannot Reproduce > Downstream steps in the seq2sparse job flow looking in wrong location for > output from previous steps when running in Elastic MapReduce (EMR) cluster > ---------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: MAHOUT-598 > URL: https://issues.apache.org/jira/browse/MAHOUT-598 > Project: Mahout > Issue Type: Bug > Components: Integration > Affects Versions: 0.4, 0.5 > Environment: seq2sparse, Mahout 0.4, S3, EMR, Hadoop 0.20.2 > Reporter: Timothy Potter > Assignee: Grant Ingersoll > Fix For: 0.8 > > > While working on MAHOUT-588, I've discovered an issue with the seq2sparse job > running on EMR. From what I can tell this job is made up of multiple MR steps > and downstream steps are expecting output from previous steps to be in HDFS, > but the output is in S3 (see errors below). For example, the > DictionaryVectorizer wrote "dictionary.file.0" to S3 but > TFPartialVectorReducer is looking for it in HDFS. > To run this job, I spin up an EMR cluster and then add the following step to > it (this is using the elastic-mapreduce-ruby tool): > elastic-mapreduce --jar s3n://thelabdude/mahout-core-0.4-job.jar \ > --main-class org.apache.mahout.driver.MahoutDriver \ > --arg seq2sparse \ > --arg -i --arg > s3n://thelabdude/asf-mail-archives/mahout-0.4/sequence-files-sm/ \ > --arg -o --arg s3n://thelabdude/asf-mail-archives/mahout-0.4/vectors-sm/ \ > --arg --weight --arg tfidf \ > --arg --chunkSize --arg 200 \ > --arg --minSupport --arg 2 \ > --arg --minDF --arg 1 \ > --arg --maxDFPercent --arg 90 \ > --arg --norm --arg 2 \ > --arg --maxNGramSize --arg 2 \ > --arg --overwrite \ > -j JOB_ID > With these parameters, I see the following errors in the hadoop logs: > java.io.FileNotFoundException: File does not exist: > /asf-mail-archives/mahout-0.4/vectors-sm/dictionary.file-0 > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457) > at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:716) > at > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1476) > at > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1471) > at > org.apache.mahout.vectorizer.term.TFPartialVectorReducer.setup(TFPartialVectorReducer.java:126) > at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174) > at > org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:575) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:412) > at org.apache.hadoop.mapred.Child.main(Child.java:170) > java.io.FileNotFoundException: File does not exist: > /asf-mail-archives/mahout-0.4/vectors-sm/dictionary.file-0 > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457) > at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:716) > at > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1476) > at > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1471) > at > org.apache.mahout.vectorizer.term.TFPartialVectorReducer.setup(TFPartialVectorReducer.java:126) > at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174) > at > org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:575) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:412) > at org.apache.hadoop.mapred.Child.main(Child.java:170) > java.io.FileNotFoundException: File does not exist: > /asf-mail-archives/mahout-0.4/vectors-sm/dictionary.file-0 > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457) > at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:716) > at > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1476) > at > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1471) > at > org.apache.mahout.vectorizer.term.TFPartialVectorReducer.setup(TFPartialVectorReducer.java:126) > at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174) > at > org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:575) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:412) > at org.apache.hadoop.mapred.Child.main(Child.java:170) > Exception in thread "main" > org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does > not exist: > s3n://thelabdude/asf-mail-archives/mahout-0.4/vectors-sm/partial-vectors-0 > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:224) > at > org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:55) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:241) > at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:933) > at > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:827) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:432) > at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447) > at > org.apache.mahout.vectorizer.common.PartialVectorMerger.mergePartialVectors(PartialVectorMerger.java:126) > at > org.apache.mahout.vectorizer.DictionaryVectorizer.createTermFrequencyVectors(DictionaryVectorizer.java:176) > at > org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:253) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) > at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) > at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.util.RunJar.main(RunJar.java:156) > I don't think this is a "config" error on my side because if I change the -o > argument to: > /thelabdude/asf-mail-archives/mahout-0.4/vectors-sm/ > then the job completes successfully, except the output is now stored in the > hdfs and not S3. After the job completes successfully, if I SSH into the EMR > master server, then I see the following output as expected: > hadoop@ip-10-170-93-177:~$ hadoop fs -lsr > /thelabdude/asf-mail-archives/mahout-0.4/vectors-sm/ > drwxr-xr-x - hadoop supergroup 0 2011-01-24 23:44 > /thelabdude/asf-mail-archives/mahout-0.4/vectors-sm/df-count > -rw-r--r-- 1 hadoop supergroup 26893 2011-01-24 23:43 > /thelabdude/asf-mail-archives/mahout-0.4/vectors-sm/df-count/part-r-00000 > -rw-r--r-- 1 hadoop supergroup 26913 2011-01-24 23:43 > /thelabdude/asf-mail-archives/mahout-0.4/vectors-sm/df-count/part-r-00001 > -rw-r--r-- 1 hadoop supergroup 26893 2011-01-24 23:43 > /thelabdude/asf-mail-archives/mahout-0.4/vectors-sm/df-count/part-r-00002 > -rw-r--r-- 1 hadoop supergroup 104874 2011-01-24 23:42 > /thelabdude/asf-mail-archives/mahout-0.4/vectors-sm/dictionary.file-0 > -rw-r--r-- 1 hadoop supergroup 80493 2011-01-24 23:44 > /thelabdude/asf-mail-archives/mahout-0.4/vectors-sm/frequency.file-0 > drwxr-xr-x - hadoop supergroup 0 2011-01-24 23:43 > /thelabdude/asf-mail-archives/mahout-0.4/vectors-sm/tf-vectors > /part-r-00000 > ... > The work-around is to just write all output to HDFS and then SSH into the > master server once the job completes and then copy the output to S3. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira