Filemax across the cluster is set to over 6 million. I¹ve checked the open
file limits for the accounts used by the Hadoop daemons  and they have an
open file limit of 32K. This is confirmed by the various .out files, e.g.


Contains open files (-n) 32768. Is this too low? What is the recommended
value for open files on all nodes? Also does my own user need to have the
same value?

I¹ve also tried running the same column selection on files crushed by the
filecrush program
This created 5 large files out of the 10,000 small files (still totally 2gb
compressed), but this job won¹t progress past 0% map.

For my own user? It is as follows:

core file size          (blocks, -c) 0

data seg size           (kbytes, -d) unlimited

scheduling priority             (-e) 0

file size               (blocks, -f) unlimited

pending signals                 (-i) 483941

max locked memory       (kbytes, -l) 64

max memory size         (kbytes, -m) unlimited

open files                      (-n) 1024

pipe size            (512 bytes, -p) 8

POSIX message queues     (bytes, -q) 819200

real-time priority              (-r) 0

stack size              (kbytes, -s) 8192

cpu time               (seconds, -t) unlimited

max user processes              (-u) 800

virtual memory          (kbytes, -v) unlimited

file locks                      (-x) unlimited

Can you check the ulimit for tour user. Which might be causing this.

> Hi everyone,
> I am having an issue with MapReduce jobs running through Hive being killed
> after 600s timeouts and with very simple jobs taking over 3 hours (or just
> failing) for a set of files with a compressed size of only 1-2gb. I will try
> and provide as much information as I can here, so if someone can help, that
> would be really great.
> I have a cluster of 7 nodes (1 master, 6 slaves) with the following config:
>> € Master node:
>> ­ 2 x Intel Xeon 6-core E5-2620v2 @ 2.1GHz
>> ­ 64GB DDR3 SDRAM
>> ­ 8 x 2TB SAS 600 hard drive (arranged as RAID 1 and RAID 5)
>> € Slave nodes (each):
>> ­ Intel Xeon 4-core E3-1220v3 @ 3.1GHz
>> ­ 32GB DDR3 SDRAM
>> ­ 4 x 2TB SATA-3 hard drive
>> € Operating system on all nodes: openSUSE Linux 13.1
> We have the Apache BigTop package version 0.7, with Hadoop version 2.0.6-alpha
> and Hive version 0.11.
> YARN has been configured as per these recommendations:
> I also set the following additional settings before running jobs:
> set yarn.nodemanager.resource.cpu-vcores=4;
> set;
> set hive.hadoop.supports.splittable.combineinputformat=true;
> set hive.merge.mapredfiles=true;
> No one else uses this cluster while I am working.
> What I¹m trying to do:
> I have a bunch of XML files on HDFS, which I am reading into Hive using this
> SerDe I then want to create a
> series of tables from these files and finally run a Python script on one of
> them to perform some scientific calculations. The files are .xml.gz format and
> (uncompressed) are only about 4mb in size each. hive.input.format is set to
> so as to avoid the ³small
> files problem.² 
> Problems:
> My HQL statements work perfectly for up to 1000 of these files. Even for much
> larger numbers, doing select * works fine, which means the files are being
> read properly, but if I do something as simple as selecting just one column
> from the whole table for a larger number of files, containers start being
> killed and jobs fail with this error in the container logs:
> 2014-08-02 14:51:45,137 ERROR [Thread-3] org.apache.hadoop.hdfs.DFSClient:
> Failed to close file
> /tmp/hive-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task_tmp.-
> ext-10001/_tmp.000000_0
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.L
> easeExpiredException): No lease on
> /tmp/hive-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task_tmp.-
> ext-10001/_tmp.000000_0: File does not exist. Holder
> DFSClient_attempt_1403771939632_0402_m_000000_0_-1627633686_1 does not have
> any open files.
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.ja
> va:2398)
> Killed jobs show the above and also the following message:
> AttemptID:attempt_1403771939632_0402_m_000000_0 Timed out after 600
> secsContainer killed by the ApplicationMaster.
> Also, in the node logs, I get a lot of pings like this:
> INFO [IPC Server handler 17 on 40961]
> org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from
> attempt_1403771939632_0362_m_000002_0
> For 5000 files (1gb compressed), the selection of a single column finishes,
> but takes over 3 hours. For 10,000 files, the job hangs on about 4% map and
> then errors out.
> While the jobs are running, I notice that the containers are not evenly
> distributed across the cluster. Some nodes lie idle, while the application
> master node runs 7 containers, maxing out the 28gb of RAM allocated to Hadoop
> on each slave node.
> This is the output of netstat ­i while the column selection is running:
> Kernel Interface table
> Flg
> eth0   1500   0 79515196      0 2265807     0 45694758      0      0      0
> eth1   1500   0 77410508      0      0      0 40815746      0      0      0
> lo    65536   0 16593808      0      0      0 16593808      0      0      0
> Are there some settings I am missing that mean the cluster isn¹t processing
> this data as efficiently as it can?
> I am very new to Hadoop and there are so many logs, etc, that troubleshooting
> can be a bit overwhelming. Where else should I be looking to try and diagnose
> what is wrong?
> Thanks in advance for any help you can give!
> Kind regards,
> Ana 

