Does anyone run Hadoop in PC cluster?
I just tested WordCount in PC cluster, and my first impression as following:
***
Number of PCs: 7(512M RAM, 2.8G CPU, 100M NIC, CentOS 5.0, Handoop
0.16.1, Sun jre 1.6)
Master(Namenode): 1
Master(Jobtracker): 1
Slaves(Datanode & Tasktracker): 5
1. Writing to HDFS
--
File size: 4,295,341,065 bytes(4.1G)
Time elapsed putting file into HDFS: 7m57.757s
Average rate: 8,990,583 bytes/sec
Average bandwidth usage: 68.59%
I also tested libhdfs, it's just as fine as java.
2. Map/Reduce with Java
--
Time elapsed: 19mins, 56sec
Bytes/time rate: 3,591,422 bytes/sec
Job Counters:
Launched map tasks 67
Launched reduce tasks 7
Data-local map tasks 64
Map-Reduce Framework:
Map input records 65,869,800
Map output records 697,923,360
Map input bytes 4,295,341,065
Map output bytes 6,504,944,565
Combine input records 697,923,360
Combine output records 2,330,048
Reduce input groups 5,201
Reduce input records 2,330,048
Reduce output records 5,201
It's acceptable. The main bottleneck was CPU, keeping 100% usage.
3. Map/Reduce with C++ Pipe(No combiner)
--
Time elapsed: 1hrs, 2mins, 47sec
Bytes/time rate: 1,140,255 bytes/sec
Job Counters:
Launched map tasks 68
Launched reduce tasks 5
Data-local map tasks 64
Map-Reduce Framework:
Map input records 65,869,800
Map output records 697,452,105
Map input bytes 4,295,341,065
Map output bytes 5,107,053,975
Combine input records 0
Combine output records 0
Reduce input groups 5,191
Reduce input records 697,452,105
Reduce output records 5,191
As my first impression, C++ pipe interface is slower than Java. If I add
C++ pipe combiner, the result become even worse: The main bottleneck is
RAM, a great deal of swapping space used, processes blocked, CPU keeping
waiting...
Adding more RAM maybe improve performance, but still slower than Java, I
think.
4. Map/Reduce with Python streaming(No combiner)
--
Time elapsed: 1hrs, 48mins, 53sec
Bytes/time rate: 657,483 bytes/sec
Job Counters:
Launched map tasks 68
Launched reduce tasks 5
Data-local map tasks 64
Map-Reduce Framework:
Map input records 65,869,800
Map output records 697,452,105
Map input bytes 4,295,341,065
Map output bytes 5,107,053,975
Combine input records 0
Combine output records 0
Reduce input groups 5,191
Reduce input records 697,452,105
Reduce output records 5,191
As you see, the result is not as good as C++ pipe interface. Maybe
python is slower, I didn't test other cases.
Are there any suggestions to improve such situation?
--
yingyuan