Interesting. At some point, can you post a patch that shows how this is done


On Mar 4, 2010 8:46 AM, "Robin Anil" <robin.a...@gmail.com> wrote:

Peter below says that we needed to put the temp files on hdfs. Currently
mahout is in the input output format. This goes for all the jobs at the
moment.

With 0.4 we have to move to input, temp, output format for all the job. so
that temp can be put on the hdfs and the rest on s3n


Robin


On Thu, Mar 4, 2010 at 3:59 AM, Sirota, Peter <sir...@amazon.com> wrote:
>
>  The job flow step consists of at least two jobs. The intermediate data
> between the jobs is being stored in Amazon S3. It would run faster if the
> intermediate results were written to HDFS. The upload from the tasks to S3
> experienced a connection reset. This caused the upload to be repeated
which
> took more time and it occurred at the end of the task, while the task was
> reporting itself as complete. The upload occurred on the slave machines.
>
> The best solution here is to increase the number of map tasks to make the
> individual writes smaller and to have intermediate data stored in HDFS.
>
>
>

Reply via email to