Thank you for info Steve.

I always believed (IMO) that there is an optimal position where one can
plot the projected NN memory (assuming 1GB--> 40TB of data) to the number
of nodes. For example heuristically how many nodes would be sufficient for
1PB of storage with nodes each having  512GB of memory, 50TB of storage and
32 cores? That will require 25GB of RAM for NN with 20 DN in the cluster.
but then one can half that number of nodes to 10 and increase the storage
to 100TG on each. So the question is the optimal balance between storage
and nodes. Would one go to more DNs and less storage or lesser number
of DNs and more storage in each DN. The proponent may argue that more nodes
provide better MPP but at what cost to the operation, start-up and
maintenance?

Cheers


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 18 March 2016 at 11:42, Steve Loughran <ste...@hortonworks.com> wrote:

>
> > On 17 Mar 2016, at 12:28, Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
> >
> > Thanks Steve,
> >
> > For NN it all depends how fast you want a start-up. 1GB of NameNode
> memory accommodates around 42T so if you are talking about 100GB of NN
> memory then SSD may make sense to speed up the start-up. Raid 10 is the
> best one that one can get  assuming all internal disks.
>
> I wasn't really thinking of startup: in larger clusters startup time is
> often determined by how long it takes for all the datanodes to report in,
> and for HDFS to exit safe mode. But of course, the NN doesn't start
> listening for DN block reports until it's read in the FS image *and
> replayed the log*, so start time will be O(image+ log-events + DNs)
>
> >
> > In general it is also suggested that fsimage are copied across to NFS
> mount directory between primary and fail-over in case of an issue.
>
> yes
>
> if you're curious, there's a 2011 paper on Y!s experience
>
> https://www.usenix.org/system/files/login/articles/chansler_0.pdf
>
> there are also a trace of HDFS failure events in some of the JIRAs,
> HDFS-599 being the classic, as is HADOOP-572. Both of these document
> cascade failures in Facebook's HDFS clusters. Scale brings interesting
> problems
>
>

Reply via email to