Hi everyone, I am having an issue with MapReduce jobs running through Hive being killed after 600s timeouts and with very simple jobs taking over 3 hours (or just failing) for a set of files with a compressed size of only 1-2gb. I will try and provide as much information as I can here, so if someone can help, that would be really great.
I have a cluster of 7 nodes (1 master, 6 slaves) with the following config: > Master node: > > 2 x Intel Xeon 6-core E5-2620v2 @ 2.1GHz > > 64GB DDR3 SDRAM > > 8 x 2TB SAS 600 hard drive (arranged as RAID 1 and RAID 5) > > Slave nodes (each): > > Intel Xeon 4-core E3-1220v3 @ 3.1GHz > > 32GB DDR3 SDRAM > > 4 x 2TB SATA-3 hard drive > > Operating system on all nodes: openSUSE Linux 13.1 We have the Apache BigTop package version 0.7, with Hadoop version 2.0.6-alpha and Hive version 0.11. YARN has been configured as per these recommendations: http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/ I also set the following additional settings before running jobs: set yarn.nodemanager.resource.cpu-vcores=4; set mapred.tasktracker.map.tasks.maximum=4; set hive.hadoop.supports.splittable.combineinputformat=true; set hive.merge.mapredfiles=true; No one else uses this cluster while I am working. What I¹m trying to do: I have a bunch of XML files on HDFS, which I am reading into Hive using this SerDe https://github.com/dvasilen/Hive-XML-SerDe. I then want to create a series of tables from these files and finally run a Python script on one of them to perform some scientific calculations. The files are .xml.gz format and (uncompressed) are only about 4mb in size each. hive.input.format is set to org.apache.hadoop.hive.ql.io.CombineHiveInputFormat so as to avoid the ³small files problem.² Problems: My HQL statements work perfectly for up to 1000 of these files. Even for much larger numbers, doing select * works fine, which means the files are being read properly, but if I do something as simple as selecting just one column from the whole table for a larger number of files, containers start being killed and jobs fail with this error in the container logs: 2014-08-02 14:51:45,137 ERROR [Thread-3] org.apache.hadoop.hdfs.DFSClient: Failed to close file /tmp/hive-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task_tmp .-ext-10001/_tmp.000000_0 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode .LeaseExpiredException): No lease on /tmp/hive-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task_tmp .-ext-10001/_tmp.000000_0: File does not exist. Holder DFSClient_attempt_1403771939632_0402_m_000000_0_-1627633686_1 does not have any open files. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem. java:2398) Killed jobs show the above and also the following message: AttemptID:attempt_1403771939632_0402_m_000000_0 Timed out after 600 secsContainer killed by the ApplicationMaster. Also, in the node logs, I get a lot of pings like this: INFO [IPC Server handler 17 on 40961] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from attempt_1403771939632_0362_m_000002_0 For 5000 files (1gb compressed), the selection of a single column finishes, but takes over 3 hours. For 10,000 files, the job hangs on about 4% map and then errors out. While the jobs are running, I notice that the containers are not evenly distributed across the cluster. Some nodes lie idle, while the application master node runs 7 containers, maxing out the 28gb of RAM allocated to Hadoop on each slave node. This is the output of netstat i while the column selection is running: Kernel Interface table Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg eth0 1500 0 79515196 0 2265807 0 45694758 0 0 0 BMRU eth1 1500 0 77410508 0 0 0 40815746 0 0 0 BMRU lo 65536 0 16593808 0 0 0 16593808 0 0 0 LRU Are there some settings I am missing that mean the cluster isn¹t processing this data as efficiently as it can? I am very new to Hadoop and there are so many logs, etc, that troubleshooting can be a bit overwhelming. Where else should I be looking to try and diagnose what is wrong? Thanks in advance for any help you can give! Kind regards, Ana