Try this rather small C++ program...it will more than likley be a LOT faster than anything you could do in hadoop. Hadoop is not the hammer for every nail. Too many people think that any "cluster" solution will automagically scale their problem...tain't true.
I'd appreciate hearing your results with this. #include <iostream> #include <fstream> #include <string> using namespace std; int main(int argc, char *argv[]) { if (argc < 2) { cerr << "Usage: " << argv[0] << " [filename]" << endl; return -1; } ifstream in(argv[1]); if (!in) { perror(argv[1]); return -1; } string str; in >> str; int n=0; while(!in.eof()) { ++n; //cout << str << endl; in >> str; } in.close(); cout << n << " words" << endl; return 0; } Michael D. Black Senior Scientist NG Information Systems Advanced Analytics Directorate ________________________________________ From: Igor Bubkin [igb...@gmail.com] Sent: Tuesday, February 01, 2011 2:19 AM To: common-iss...@hadoop.apache.org Cc: common-user@hadoop.apache.org Subject: EXTERNAL:How to speed up of Map/Reduce job? Hello everybody I have a problem. I installed Hadoop on 2-nodes cluster and run Wordcount example. It takes about 20 sec for processing of 1,5MB text file. We want to use Map/Reduce in real time (interactive: by user's requests). User can't wait for his request 20 sec. This is too long. Is it possible to reduce time of Map/Reduce job? Or may be I misunderstand something? BR, Igor Babkin, Mifors.com