Jeff,

Thanks for the advice.  I'll keep that in mind as we grow.  At present,
the cluster is only six machines, and for these small (2K-ish) scripts,
it has worked flawlessly, with the caveats mentioned.  When it starts
failing, or if I need to move beyond these very-small-sizes, I'll
certainly use the Distributed Cache.

Brian

Jeff Hammerbacher wrote:
> Hey Brian,
> 
> Having tried and failed to use NFS to store shared resources for a large
> Hadoop cluster, I feel the need to say: you may want to reconsider that
> strategy as your cluster grows. NFS mounts can be quite flaky at scale, as
> Ted mentions. As Allen mentions, the Distributed Cache is intended to allow
> access to shared resources on the cluster; see
> http://hadoop.apache.org/common/docs/r0.20.1/mapred_tutorial.html#DistributedCachefor
> more information.
> 
> Later,
> Jeff
> 
> On Wed, Sep 23, 2009 at 10:19 AM, Allen Wittenauer <awittena...@linkedin.com
>> wrote:
> 
>>
>>
>> On 9/23/09 10:09 AM, "Brian Vargas" <br...@ardvaark.net> wrote:
>>
>>> Although it can be quite useful to store small shared resources on an
>>> NFS mount.  For example, I find it easier to store various scripts
>>> called by a streaming job on NFS rather than distributing them from the
>>> command-line.
>>>
>>> Of course, then you have to be sure they don't change out from under the
>>> running jobs.  Tradeoffs.  :-)
>> You should probably look into distributed cache archives.  This eliminates
>> the NFS bottleneck, avoids the 'magically changing file' problem, and
>> allows
>> you to use different versions with different job submissions such that you
>> can test changes on the fly without having to redeploy.
>>
>>
> 

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to