Based on what i have in mind, the usage will just be

mahout vectorize -i s3://input -o s3://output -tmp hdfs://file (here, there
is a risk of fixing a exact path and not knowing the hadoop user, I would
have preferred a relative path)


On Thu, Mar 4, 2010 at 8:32 PM, Drew Farris <drew.far...@gmail.com> wrote:

> Interesting. At some point, can you post a patch that shows how this is
> done
>
>
> On Mar 4, 2010 8:46 AM, "Robin Anil" <robin.a...@gmail.com> wrote:
>
> Peter below says that we needed to put the temp files on hdfs. Currently
> mahout is in the input output format. This goes for all the jobs at the
> moment.
>
> With 0.4 we have to move to input, temp, output format for all the job. so
> that temp can be put on the hdfs and the rest on s3n
>
>
> Robin
>
>
> On Thu, Mar 4, 2010 at 3:59 AM, Sirota, Peter <sir...@amazon.com> wrote:
> >
> >  The job flow step consists of at least two jobs. The intermediate data
> > between the jobs is being stored in Amazon S3. It would run faster if the
> > intermediate results were written to HDFS. The upload from the tasks to
> S3
> > experienced a connection reset. This caused the upload to be repeated
> which
> > took more time and it occurred at the end of the task, while the task was
> > reporting itself as complete. The upload occurred on the slave machines.
> >
> > The best solution here is to increase the number of map tasks to make the
> > individual writes smaller and to have intermediate data stored in HDFS.
> >
> >
> >
>

Reply via email to