Filemax across the cluster is set to over 6 million. I¹ve checked the open file limits for the accounts used by the Hadoop daemons and they have an open file limit of 32K. This is confirmed by the various .out files, e.g.
/var/log/hadoop-hdfs/hadoop-hdfs-datanode-slave1.out Contains open files (-n) 32768. Is this too low? What is the recommended value for open files on all nodes? Also does my own user need to have the same value? I¹ve also tried running the same column selection on files crushed by the filecrush program https://github.com/edwardcapriolo/filecrush/ This created 5 large files out of the 10,000 small files (still totally 2gb compressed), but this job won¹t progress past 0% map. From: Ana Gillan <ana.gil...@gmail.com> Date: Saturday, 2 August 2014 16:36 To: <user@hadoop.apache.org> Subject: Re: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode .LeaseExpiredException) For my own user? It is as follows: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 483941 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 800 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited From: hadoop hive <hadooph...@gmail.com> Reply-To: <user@hadoop.apache.org> Date: Saturday, 2 August 2014 16:34 To: <user@hadoop.apache.org> Subject: Re: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode .LeaseExpiredException) Can you check the ulimit for tour user. Which might be causing this. On Aug 2, 2014 8:54 PM, "Ana Gillan" <ana.gil...@gmail.com> wrote: > Hi everyone, > > I am having an issue with MapReduce jobs running through Hive being killed > after 600s timeouts and with very simple jobs taking over 3 hours (or just > failing) for a set of files with a compressed size of only 1-2gb. I will try > and provide as much information as I can here, so if someone can help, that > would be really great. > > I have a cluster of 7 nodes (1 master, 6 slaves) with the following config: >> Master node: >> >> 2 x Intel Xeon 6-core E5-2620v2 @ 2.1GHz >> >> 64GB DDR3 SDRAM >> >> 8 x 2TB SAS 600 hard drive (arranged as RAID 1 and RAID 5) >> >> Slave nodes (each): >> >> Intel Xeon 4-core E3-1220v3 @ 3.1GHz >> >> 32GB DDR3 SDRAM >> >> 4 x 2TB SATA-3 hard drive >> >> Operating system on all nodes: openSUSE Linux 13.1 > > We have the Apache BigTop package version 0.7, with Hadoop version 2.0.6-alpha > and Hive version 0.11. > YARN has been configured as per these recommendations: > http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/ > > I also set the following additional settings before running jobs: > set yarn.nodemanager.resource.cpu-vcores=4; > set mapred.tasktracker.map.tasks.maximum=4; > set hive.hadoop.supports.splittable.combineinputformat=true; > set hive.merge.mapredfiles=true; > > No one else uses this cluster while I am working. > > What I¹m trying to do: > I have a bunch of XML files on HDFS, which I am reading into Hive using this > SerDe https://github.com/dvasilen/Hive-XML-SerDe. I then want to create a > series of tables from these files and finally run a Python script on one of > them to perform some scientific calculations. The files are .xml.gz format and > (uncompressed) are only about 4mb in size each. hive.input.format is set to > org.apache.hadoop.hive.ql.io.CombineHiveInputFormat so as to avoid the ³small > files problem.² > > Problems: > My HQL statements work perfectly for up to 1000 of these files. Even for much > larger numbers, doing select * works fine, which means the files are being > read properly, but if I do something as simple as selecting just one column > from the whole table for a larger number of files, containers start being > killed and jobs fail with this error in the container logs: > > 2014-08-02 14:51:45,137 ERROR [Thread-3] org.apache.hadoop.hdfs.DFSClient: > Failed to close file > /tmp/hive-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task_tmp.- > ext-10001/_tmp.000000_0 > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.L > easeExpiredException): No lease on > /tmp/hive-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task_tmp.- > ext-10001/_tmp.000000_0: File does not exist. Holder > DFSClient_attempt_1403771939632_0402_m_000000_0_-1627633686_1 does not have > any open files. > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.ja > va:2398) > > Killed jobs show the above and also the following message: > AttemptID:attempt_1403771939632_0402_m_000000_0 Timed out after 600 > secsContainer killed by the ApplicationMaster. > > Also, in the node logs, I get a lot of pings like this: > INFO [IPC Server handler 17 on 40961] > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from > attempt_1403771939632_0362_m_000002_0 > > For 5000 files (1gb compressed), the selection of a single column finishes, > but takes over 3 hours. For 10,000 files, the job hangs on about 4% map and > then errors out. > > While the jobs are running, I notice that the containers are not evenly > distributed across the cluster. Some nodes lie idle, while the application > master node runs 7 containers, maxing out the 28gb of RAM allocated to Hadoop > on each slave node. > > This is the output of netstat i while the column selection is running: > Kernel Interface table > > Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR > Flg > > eth0 1500 0 79515196 0 2265807 0 45694758 0 0 0 > BMRU > > eth1 1500 0 77410508 0 0 0 40815746 0 0 0 > BMRU > > lo 65536 0 16593808 0 0 0 16593808 0 0 0 > LRU > > > > > > Are there some settings I am missing that mean the cluster isn¹t processing > this data as efficiently as it can? > > I am very new to Hadoop and there are so many logs, etc, that troubleshooting > can be a bit overwhelming. Where else should I be looking to try and diagnose > what is wrong? > > Thanks in advance for any help you can give! > > Kind regards, > Ana >