Hi everyone,

I am having an issue with MapReduce jobs running through Hive being killed
after 600s timeouts and with very simple jobs taking over 3 hours (or just
failing) for a set of files with a compressed size of only 1-2gb. I will try
and provide as much information as I can here, so if someone can help, that
would be really great.

I have a cluster of 7 nodes (1 master, 6 slaves) with the following config:
> € Master node:
> 
> ­ 2 x Intel Xeon 6-core E5-2620v2 @ 2.1GHz
> 
> ­ 64GB DDR3 SDRAM
> 
> ­ 8 x 2TB SAS 600 hard drive (arranged as RAID 1 and RAID 5)
> 
> € Slave nodes (each):
> 
> ­ Intel Xeon 4-core E3-1220v3 @ 3.1GHz
> 
> ­ 32GB DDR3 SDRAM
> 
> ­ 4 x 2TB SATA-3 hard drive
> 
> € Operating system on all nodes: openSUSE Linux 13.1

We have the Apache BigTop package version 0.7, with Hadoop version
2.0.6-alpha and Hive version 0.11.
YARN has been configured as per these recommendations:
http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/

I also set the following additional settings before running jobs:
set yarn.nodemanager.resource.cpu-vcores=4;
set mapred.tasktracker.map.tasks.maximum=4;
set hive.hadoop.supports.splittable.combineinputformat=true;
set hive.merge.mapredfiles=true;

No one else uses this cluster while I am working.

What I¹m trying to do:
I have a bunch of XML files on HDFS, which I am reading into Hive using this
SerDe https://github.com/dvasilen/Hive-XML-SerDe. I then want to create a
series of tables from these files and finally run a Python script on one of
them to perform some scientific calculations. The files are .xml.gz format
and (uncompressed) are only about 4mb in size each. hive.input.format is set
to org.apache.hadoop.hive.ql.io.CombineHiveInputFormat so as to avoid the
³small files problem.²

Problems:
My HQL statements work perfectly for up to 1000 of these files. Even for
much larger numbers, doing select * works fine, which means the files are
being read properly, but if I do something as simple as selecting just one
column from the whole table for a larger number of files, containers start
being killed and jobs fail with this error in the container logs:

2014-08-02 14:51:45,137 ERROR [Thread-3] org.apache.hadoop.hdfs.DFSClient:
Failed to close file
/tmp/hive-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task_tmp
.-ext-10001/_tmp.000000_0
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode
.LeaseExpiredException): No lease on
/tmp/hive-zslf023/hive_2014-08-02_12-33-59_857_6455822541748133957/_task_tmp
.-ext-10001/_tmp.000000_0: File does not exist. Holder
DFSClient_attempt_1403771939632_0402_m_000000_0_-1627633686_1 does not have
any open files.
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.
java:2398)

Killed jobs show the above and also the following message:
AttemptID:attempt_1403771939632_0402_m_000000_0 Timed out after 600
secsContainer killed by the ApplicationMaster.

Also, in the node logs, I get a lot of pings like this:
INFO [IPC Server handler 17 on 40961]
org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from
attempt_1403771939632_0362_m_000002_0

For 5000 files (1gb compressed), the selection of a single column finishes,
but takes over 3 hours. For 10,000 files, the job hangs on about 4% map and
then errors out.

While the jobs are running, I notice that the containers are not evenly
distributed across the cluster. Some nodes lie idle, while the application
master node runs 7 containers, maxing out the 28gb of RAM allocated to
Hadoop on each slave node.

This is the output of netstat ­i while the column selection is running:
Kernel Interface table

Iface   MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR
Flg

eth0   1500   0 79515196      0 2265807     0 45694758      0      0      0
BMRU

eth1   1500   0 77410508      0      0      0 40815746      0      0      0
BMRU

lo    65536   0 16593808      0      0      0 16593808      0      0      0
LRU





Are there some settings I am missing that mean the cluster isn¹t processing
this data as efficiently as it can?

I am very new to Hadoop and there are so many logs, etc, that
troubleshooting can be a bit overwhelming. Where else should I be looking to
try and diagnose what is wrong?

Thanks in advance for any help you can give!

Kind regards,
Ana 



Reply via email to