Does anyone run Hadoop in PC cluster? I just tested WordCount in PC cluster, and my first impression as following:
*************************************************************************************** Number of PCs: 7(512M RAM, 2.8G CPU, 100M NIC, CentOS 5.0, Handoop 0.16.1, Sun jre 1.6) Master(Namenode): 1 Master(Jobtracker): 1 Slaves(Datanode & Tasktracker): 5 1. Writing to HDFS ---------------------------------------------------------- File size: 4,295,341,065 bytes(4.1G) Time elapsed putting file into HDFS: 7m57.757s Average rate: 8,990,583 bytes/sec Average bandwidth usage: 68.59% I also tested libhdfs, it's just as fine as java. 2. Map/Reduce with Java ---------------------------------------------------------- Time elapsed: 19mins, 56sec Bytes/time rate: 3,591,422 bytes/sec Job Counters: Launched map tasks 67 Launched reduce tasks 7 Data-local map tasks 64 Map-Reduce Framework: Map input records 65,869,800 Map output records 697,923,360 Map input bytes 4,295,341,065 Map output bytes 6,504,944,565 Combine input records 697,923,360 Combine output records 2,330,048 Reduce input groups 5,201 Reduce input records 2,330,048 Reduce output records 5,201 It's acceptable. The main bottleneck was CPU, keeping 100% usage. 3. Map/Reduce with C++ Pipe(No combiner) ---------------------------------------------------------- Time elapsed: 1hrs, 2mins, 47sec Bytes/time rate: 1,140,255 bytes/sec Job Counters: Launched map tasks 68 Launched reduce tasks 5 Data-local map tasks 64 Map-Reduce Framework: Map input records 65,869,800 Map output records 697,452,105 Map input bytes 4,295,341,065 Map output bytes 5,107,053,975 Combine input records 0 Combine output records 0 Reduce input groups 5,191 Reduce input records 697,452,105 Reduce output records 5,191 As my first impression, C++ pipe interface is slower than Java. If I add C++ pipe combiner, the result become even worse: The main bottleneck is RAM, a great deal of swapping space used, processes blocked, CPU keeping waiting... Adding more RAM maybe improve performance, but still slower than Java, I think. 4. Map/Reduce with Python streaming(No combiner) ---------------------------------------------------------- Time elapsed: 1hrs, 48mins, 53sec Bytes/time rate: 657,483 bytes/sec Job Counters: Launched map tasks 68 Launched reduce tasks 5 Data-local map tasks 64 Map-Reduce Framework: Map input records 65,869,800 Map output records 697,452,105 Map input bytes 4,295,341,065 Map output bytes 5,107,053,975 Combine input records 0 Combine output records 0 Reduce input groups 5,191 Reduce input records 697,452,105 Reduce output records 5,191 As you see, the result is not as good as C++ pipe interface. Maybe python is slower, I didn't test other cases. Are there any suggestions to improve such situation? -- yingyuan