Hi, Bear in mind that you typically need 1GB of NameNode memory for 1 million blocks. So if you have 128MB block size, you can store 128 * 1E6 / (3 *1024) = 41,666GB of data for every 1GB. Number 3 comes from the fact that the block is replicated three times. In other words just under 42TB of data. So if you have 5GB of NameNode cache, you can have up to 210TB of data on your DataNodes. You also need to account for each YARN container resource that includes memory, CPU and disk resources. This could be up to 8GB memory with minimum allocation of 1GB. I am not sure if a YARN container can use more than one core (someone please correct me). Regardless Spark will try to use memory for its work and that has to fit in a YARN container whether it is a pure Spark Process or Hive running on Spark engine. Will the 8GB limit set by (yarn.scheduler.maximum-allocation-mb) applies here (meaning with Spark) as well?
Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 10 March 2016 at 23:53, Prabhu Joseph <prabhujose.ga...@gmail.com> wrote: > Ashok, > > Cluster nodes has enough memory but CPU cores are less. 512GB / 16 = > 32 GB. For 1 core the cluster has 32GB memory. Either their should be more > cores available to use efficiently the > available memory or don't configure a higher executor memory which will > cause lot of GC. > > Thanks, > Prabhu Joseph > > On Fri, Mar 11, 2016 at 3:45 AM, Ashok Kumar <ashok34...@yahoo.com.invalid > > wrote: > >> >> Hi, >> >> We intend to use 5 servers which will be utilized for building Bigdata >> Hadoop data warehouse system (not using any propriety distribution like >> Hortonworks or Cloudera or others). >> All servers configurations are 512GB RAM, 30TB storage and 16 cores, >> Ubuntu Linux servers. Hadoop will be installed on all the servers/nodes. >> Server 1 will be used for NameNode plus DataNode as well. Server 2 will be >> used for standby NameNode & DataNode. The rest of the servers will be >> used as DataNodes.. >> Now we would like to install Spark on each servers to create Spark >> cluster. Is that the good thing to do or we should buy additional hardware >> for Spark (minding cost here) or simply do we require additional memory to >> accommodate Spark as well please. In that case how much memory for each >> Spark node would you recommend? >> >> >> thanks all >> > >