Hi Nishant! You should be able to look at the datanode and nodemanager log files to find out why they died after you ran the 76 mappers. It is extremely unusual (I haven't heard of a verified case for over 4-5 years) of a job killing nodemanagers unless your cluster is configured poorly. Which container-executor do you use? Which user is running the nodemanager and datanode process? Which user does a MapTask run as?
Are you sure the cluster is fine? How many resources do you see available in the ResourceManager? Are you submitting the application to a queue with enough resources? Ravi On Fri, Jul 28, 2017 at 5:19 AM, Nishant Verma <nishant.verma0...@gmail.com> wrote: > Hello, > > In my 5 node Hadoop 2.7.3 AWS EC2 instance cluster, things were running > smooth before I submitted one query. I tried to create an ORC table using > below query: > > create table dummy_orc stored as orc tblproperties ("orc.compress"="Lz4") > as select * from dummy; > > The job said, it would run 76 mappers and 0 reducers and job started. > After some 10-12 minutes when the map % reached 100%, the job aborted and > did not give output. Since number of records was large, I did not mind the > large time it took initially.But then all my datanode daemons and > nodemanager daemons died. The hdfs dfsadmin -report command gave 0 cluster > capacity, 0 live datanodes, etc. > > I restarted the cluster completely. Restarted namenode, resource manager, > datanode, nodemanager, zkfc services, quorumPeerMain, everything. After > that the cluster capacity,etc is coming fine. I am able to fire normal > non-mapreduce queries like select *. > > But mapreduce is not starting.Also spark jobs are running now. They are > stuck at ACCEPTED state like MR jobs. > > MR is stuck for select count(1) from dummy at: > > Query ID = hadoopuser_20170728093320_b1875223-801e-466b-997f-4b58f0e90041 > Total jobs = 1 > Launching Job 1 out of 1 > Number of reduce tasks determined at compile time: 1 > In order to change the average load for a reducer (in bytes): > set hive.exec.reducers.bytes.per.reducer=<number> > In order to limit the maximum number of reducers: > set hive.exec.reducers.max=<number> > In order to set a constant number of reducers: > set mapreduce.job.reduces=<number> > Starting Job = job_1501233326257_0003, Tracking URL = > http://dev-bigdatamaster1:8088/proxy/application_1501233326257_0003/ > Kill Command = /home/hadoopuser/hadoop//bin/hadoop job -kill > job_1501233326257_0003 > > Which log would give me better picture to resolve this error? And what > went wrong? >