Based on what i have in mind, the usage will just be mahout vectorize -i s3://input -o s3://output -tmp hdfs://file (here, there is a risk of fixing a exact path and not knowing the hadoop user, I would have preferred a relative path)
On Thu, Mar 4, 2010 at 8:32 PM, Drew Farris <drew.far...@gmail.com> wrote: > Interesting. At some point, can you post a patch that shows how this is > done > > > On Mar 4, 2010 8:46 AM, "Robin Anil" <robin.a...@gmail.com> wrote: > > Peter below says that we needed to put the temp files on hdfs. Currently > mahout is in the input output format. This goes for all the jobs at the > moment. > > With 0.4 we have to move to input, temp, output format for all the job. so > that temp can be put on the hdfs and the rest on s3n > > > Robin > > > On Thu, Mar 4, 2010 at 3:59 AM, Sirota, Peter <sir...@amazon.com> wrote: > > > > The job flow step consists of at least two jobs. The intermediate data > > between the jobs is being stored in Amazon S3. It would run faster if the > > intermediate results were written to HDFS. The upload from the tasks to > S3 > > experienced a connection reset. This caused the upload to be repeated > which > > took more time and it occurred at the end of the task, while the task was > > reporting itself as complete. The upload occurred on the slave machines. > > > > The best solution here is to increase the number of map tasks to make the > > individual writes smaller and to have intermediate data stored in HDFS. > > > > > > >