One thing to be careful about is paths of dependent libraries or
executables like streaming binaries. In pseudo distributed mode, since all
processes are looking on the same machine, it is likely that they will find
paths that are really local to only the machine where the job is being
launched from. When you start to run them in a true distributed
environment, and if these files are not packaged and distributed to the
cluster in some way, they will start failing.

Thanks
hemanth

On Fri, Sep 14, 2012 at 1:04 PM, Jason Yang <lin.yang.ja...@gmail.com>wrote:

> All right, I got it.
>
> Thanks for all of you.
>
>
> 2012/9/14 Bertrand Dechoux <decho...@gmail.com>
>
>> The only difference between pseudo-distributed and fully distributed
>> would be scale. You could say that code that runs fine on the former, runs
>> fine too on the latter. But it does not necessary mean that the performance
>> will scale the same way (ie if you keep a list of elements in memory, at
>> bigger scale you could receive OOME).
>>
>> Of course, like it has been implied in previous answers, you can't say
>> the same with standalone. With this mode, you could use a global mutable
>> static state thinking it's fine without caring about distribution between
>> the nodes. In that case, the same code launched on pseudo-distributed will
>> fail to replicate the same results.
>>
>> Regards
>>
>> Bertrand
>>
>>
>> On Fri, Sep 14, 2012 at 9:24 AM, Harsh J <ha...@cloudera.com> wrote:
>>
>>> Hi Jason,
>>>
>>> I think you're confusing the standalone mode with a pseudo-distributed
>>> mode. The former is a limited mode of MR where no daemons need to be
>>> deployed and the tasks run in a single JVM (via threads).
>>>
>>> A pseudo distributed cluster is a cluster where all daemons are
>>> running on one node itself. Hence, not "distributed" in the sense of
>>> multi-nodes (no use of an network gear) but works in the same way
>>> between nodes (RPC, etc.) as a fully-distributed one.
>>>
>>> If an MR program works fine in a pseudo-distributed mode, it "should"
>>> work (no guarantee) fine in a fully-distributed mode iff all nodes
>>> have the same arch/OS, same JVM, and job-specific configurations. This
>>> is because tasks execute on various nodes and may be affected by the
>>> node's behavior or setup that is different from others - and thats
>>> something you'd have to detect/know about if it exhibits failures more
>>> than others.
>>>
>>> On Fri, Sep 14, 2012 at 11:58 AM, Jason Yang <lin.yang.ja...@gmail.com>
>>> wrote:
>>> > Hey, Kai
>>> >
>>> > Thanks for you reply.
>>> >
>>> > I was wondering what's difference btw the pseudo-distributed and
>>> > fully-distributed hadoop, except the maximum number of map/reduce.
>>> >
>>> > And if a MR program works fine in pseudo-distributed cluster, will it
>>> work
>>> > exactly fine in the fully-distributed cluster ?
>>> >
>>> >
>>> > 2012/9/14 Kai Voigt <k...@123.org>
>>> >>
>>> >> e default setting is that a tasktracker can run up to two map and
>>> reduce
>>> >> tasks in parallel (mapred.tasktracker.map.tasks.maximum and
>>> >> mapred.tasktracker.reduce.tasks.maximum), so you will actually see
>>> some
>>> >> concurrency on your one machine.
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > YANG, Lin
>>> >
>>>
>>>
>>>
>>> --
>>> Harsh J
>>>
>>
>>
>>
>> --
>> Bertrand Dechoux
>>
>
>
>
> --
> YANG, Lin
>
>

Reply via email to