Re: collecting CPU, mem, iops of hadoop jobs
You may need Ganglia. It is a cluster monitoring software. On Tue, Dec 20, 2011 at 2:44 PM, Patai Sangbutsarakum silvianhad...@gmail.com wrote: Hi Hadoopers, We're running Hadoop 0.20 CentOS5.5. I am finding the way to collect CPU time, memory usage, IOPS of each hadoop Job. What would be the good starting point ? document ? api ? Thanks in advance -P
Re: More cores Vs More Nodes ?
Hi Brad This is a really interesting experiment. I am curious why you did not use 2 cores each machine but 32 nodes. That makes the number of CPU core in two groups equal. Chen On Tue, Dec 13, 2011 at 7:15 PM, Brad Sarsfield b...@bing.com wrote: Hi Prashant, In each case I had a single tasktracker per node. I oversubscribed the total tasks per tasktracker/node by 1.5 x # of cores. So for the 64 core allocation comparison. In A: 8 cores; Each machine had a single tasktracker with 8 maps / 4 reduce slots for 12 task slots total per machine x 8 machines (including head node) In B: 2 c ores; Each machine had a single tasktracker with 2 maps / 1 reduce slots for 3 slots total per machines x 29 machines (including head node which was running 8 cores) The experiment was done in a cloud hosted environment running set of VMs. ~Brad -Original Message- From: Prashant Kommireddi [mailto:prash1...@gmail.com] Sent: Tuesday, December 13, 2011 9:46 AM To: common-user@hadoop.apache.org Subject: Re: More cores Vs More Nodes ? Hi Brad, how many taskstrackers did you have on each node in both cases? Thanks, Prashant Sent from my iPhone On Dec 13, 2011, at 9:42 AM, Brad Sarsfield b...@bing.com wrote: Praveenesh, Your question is not naïve; in fact, optimal hardware design can ultimately be a very difficult question to answer on what would be better. If you made me pick one without much information I'd go for more machines. But... It all depends; and there is no right answer :) More machines +May run your workload faster +Will give you a higher degree of reliability protection from node / hardware / hard drive failure. +More aggregate IO capabilities - capex / opex may be higher than allocating more cores More cores +May run your workload faster +More cores may allow for more tasks to run on the same machine +More cores/tasks may reduce network contention and increase increasing task to task data flow performance. Notice May run your workload faster is in both; as it can be very workload dependant. My Experience: I did a recent experiment and found that given the same number of cores (64) with the exact same network / machine configuration; A: I had 8 machines with 8 cores B: I had 28 machines with 2 cores (and 1x8 core head node) B was able to outperform A by 2x using teragen and terasort. These machines were running in a virtualized environment; where some of the IO capabilities behind the scenes were being regulated to 400Mbps per node when running in the 2 core configuration vs 1Gbps on the 8 core. So I would expect the non-throttled scenario to work even better. ~Brad -Original Message- From: praveenesh kumar [mailto:praveen...@gmail.com] Sent: Monday, December 12, 2011 8:51 PM To: common-user@hadoop.apache.org Subject: More cores Vs More Nodes ? Hey Guys, So I have a very naive question in my mind regarding Hadoop cluster nodes ? more cores or more nodes - Shall I spend money on going from 2-4 core machines, or spend money on buying more nodes less core eg. say 2 machines of 2 cores for example? Thanks, Praveenesh
Re: Matrix multiplication in Hadoop
Did you try Hama? There are may methods. 1) use Hadoop MPI which allows you use MPI MM code based on Hadoop; 2) Hama is designed for MM 3) Use pure Hadoop Java MapReduce; I did this before but may not be optimal algorithm. Put your first matrix in DistributedCache and take second matrix line as inputsplit. For each line, use a mapper to let a array multply the first matrix in DistributedCache. Use reducer to collect the result matrix. This algorithm is limited by your DistributedCache size. It is suitable for a small matrix to multiply a huge matrix. Chen On Sat, Nov 19, 2011 at 10:34 AM, Tim Broberg tim.brob...@exar.com wrote: Perhaps this is a good candidate for a native library, then? From: Mike Davis [xmikeda...@gmail.com] Sent: Friday, November 18, 2011 7:39 PM To: common-user@hadoop.apache.org Subject: Re: Matrix multiplication in Hadoop On Friday, November 18, 2011, Mike Spreitzer mspre...@us.ibm.com wrote: Why is matrix multiplication ill-suited for Hadoop? IMHO, a huge issue here is the JVM's inability to fully support cpu vendor specific SIMD instructions and, by extension, optimized BLAS routines. Running a large MM task using intel's MKL rather than relying on generic compiler optimization is orders of magnitude faster on a single multicore processor. I see almost no way that Hadoop could win such a CPU intensive task against an mpi cluster with even a tenth of the nodes running with a decently tuned BLAS library. Racing even against a single CPU might be difficult, given the i/o overhead. Still, it's a reasonably common problem and we shouldn't murder the good in favor of the best. I'm certain a MM/LinAlg Hadoop library with even mediocre performance, wrt C, would get used. -- Mike Davis The information and any attached documents contained in this message may be confidential and/or legally privileged. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, dissemination, or reproduction is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender immediately by return e-mail and destroy all copies of the original message.
Re: Matrix multiplication in Hadoop
Right, I agree with Edward Capriolo, Hadoop + GPGPU is a better choice. On Sat, Nov 19, 2011 at 10:53 AM, Edward Capriolo edlinuxg...@gmail.comwrote: Sounds like a job for next gen map reduce native libraries and gpu's. A modern day Dr frankenstein for sure. On Saturday, November 19, 2011, Tim Broberg tim.brob...@exar.com wrote: Perhaps this is a good candidate for a native library, then? From: Mike Davis [xmikeda...@gmail.com] Sent: Friday, November 18, 2011 7:39 PM To: common-user@hadoop.apache.org Subject: Re: Matrix multiplication in Hadoop On Friday, November 18, 2011, Mike Spreitzer mspre...@us.ibm.com wrote: Why is matrix multiplication ill-suited for Hadoop? IMHO, a huge issue here is the JVM's inability to fully support cpu vendor specific SIMD instructions and, by extension, optimized BLAS routines. Running a large MM task using intel's MKL rather than relying on generic compiler optimization is orders of magnitude faster on a single multicore processor. I see almost no way that Hadoop could win such a CPU intensive task against an mpi cluster with even a tenth of the nodes running with a decently tuned BLAS library. Racing even against a single CPU might be difficult, given the i/o overhead. Still, it's a reasonably common problem and we shouldn't murder the good in favor of the best. I'm certain a MM/LinAlg Hadoop library with even mediocre performance, wrt C, would get used. -- Mike Davis The information and any attached documents contained in this message may be confidential and/or legally privileged. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, dissemination, or reproduction is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender immediately by return e-mail and destroy all copies of the original message.
Re: HDFS file into Blocks
Hi It is interesting that a guy from Huawei is also working on Hadoop project. :) Chen On Sun, Sep 25, 2011 at 11:29 PM, Uma Maheswara Rao G 72686 mahesw...@huawei.com wrote: Hi, You can find the Code in DFSOutputStream.java Here there will be one thread DataStreamer thread. This thread will pick the packets from DataQueue and write on to the sockets. Before this, when actually writing the chunks, based on the block size parameter passed from client, it will set the last packet parameter in Packet. If the streamer thread finds that is the last block then it end the block. That means it will close the socket which were used for witing the block. Streamer thread repeat the loops. When it find there is no sockets open then it will again create the pipeline for the next block. Go throgh the flow from writeChunk in DFSOutputStream.java, where exactly enqueing the packets in dataQueue. Regards, Uma - Original Message - From: kartheek muthyala kartheek0...@gmail.com Date: Sunday, September 25, 2011 11:06 am Subject: HDFS file into Blocks To: common-user@hadoop.apache.org Hi all, I am working around the code to understand where HDFS divides a file into blocks. Can anyone point me to this section of the code? Thanks, Kartheek
Re: phases of Hadoop Jobs
Hi Kai Thank you for the reply. The reduce() will not start because the shuffle phase does not finish. And the shuffle phase will not finish untill alll mapper end. I am curious about the design purpose about overlapping the map and reduce stage. Was this only for saving shuffling time? Or there are some other reasons. Best wishes! Chen On Mon, Sep 19, 2011 at 12:36 AM, Kai Voigt k...@123.org wrote: Hi Chen, the times when nodes running instances of the map and reduce nodes overlap. But map() and reduce() execution will not. reduce nodes will start copying data from map nodes, that's the shuffle phase. And the map nodes are still running during that copy phase. My observation had been that if the map phase progresses from 0 to 100%, it matches with the reduce phase progress from 0-33%. For example, if you map progress shows 60%, reduce might show 20%. But the reduce() will not start until all the map() code has processed the entire input. So you will never see the reduce progress higher than 66% when map progress didn't reach 100%. If you see map phase reaching 100%, but reduce phase not making any higher number than 66%, it means your reduce() code is broken or slow because it doesn't produce any output. An infinitive loop is a common mistake. Kai Am 19.09.2011 um 07:29 schrieb He Chen: Hi Arun I have a question. Do you know what is the reason that hadoop allows the map and the reduce stage overlap? Or anyone knows about it. Thank you in advance. Chen On Sun, Sep 18, 2011 at 11:17 PM, Arun C Murthy a...@hortonworks.com wrote: Nan, The 'phase' is implicitly understood by the 'progress' (value) made by the map/reduce tasks (see o.a.h.mapred.TaskStatus.Phase). For e.g. Reduce: 0-33% - Shuffle 34-66% - Sort (actually, just 'merge', there is no sort in the reduce since all map-outputs are sorted) 67-100% - Reduce With 0.23 onwards the Map has phases too: 0-90% - Map 91-100% - Final Sort/merge Now,about starting reduces early - this is done to ensure shuffle can proceed for completed maps while rest of the maps run, there-by pipelining shuffle and map completion. There is a 'reduce slowstart' feature to control this - by default, reduces aren't started until 5% of maps are complete. Users can set this higher. Arun On Sep 18, 2011, at 7:24 PM, Nan Zhu wrote: Hi, all recently, I was hit by a question, how is a hadoop job divided into 2 phases?, In textbooks, we are told that the mapreduce jobs are divided into 2 phases, map and reduce, and for reduce, we further divided it into 3 stages, shuffle, sort, and reduce, but in hadoop codes, I never think about this question, I didn't see any variable members in JobInProgress class to indicate this information, and according to my understanding on the source code of hadoop, the reduce tasks are unnecessarily started until all mappers are finished, in constract, we can see the reduce tasks are in shuffle stage while there are mappers which are still in running, So how can I indicate the phase which the job is belonging to? Thanks -- Nan Zhu School of Electronic, Information and Electrical Engineering,229 Shanghai Jiao Tong University 800,Dongchuan Road,Shanghai,China E-Mail: zhunans...@gmail.com -- Kai Voigt k...@123.org
Re: phases of Hadoop Jobs
Or we can just seperate shuffle from reduce stage and integrate it to the map stage . Then we can clearly differentiate the map stage(before shuffle finish) and (after shuffle finish)the reduce stage. On Mon, Sep 19, 2011 at 1:20 AM, He Chen airb...@gmail.com wrote: Hi Kai Thank you for the reply. The reduce() will not start because the shuffle phase does not finish. And the shuffle phase will not finish untill alll mapper end. I am curious about the design purpose about overlapping the map and reduce stage. Was this only for saving shuffling time? Or there are some other reasons. Best wishes! Chen On Mon, Sep 19, 2011 at 12:36 AM, Kai Voigt k...@123.org wrote: Hi Chen, the times when nodes running instances of the map and reduce nodes overlap. But map() and reduce() execution will not. reduce nodes will start copying data from map nodes, that's the shuffle phase. And the map nodes are still running during that copy phase. My observation had been that if the map phase progresses from 0 to 100%, it matches with the reduce phase progress from 0-33%. For example, if you map progress shows 60%, reduce might show 20%. But the reduce() will not start until all the map() code has processed the entire input. So you will never see the reduce progress higher than 66% when map progress didn't reach 100%. If you see map phase reaching 100%, but reduce phase not making any higher number than 66%, it means your reduce() code is broken or slow because it doesn't produce any output. An infinitive loop is a common mistake. Kai Am 19.09.2011 um 07:29 schrieb He Chen: Hi Arun I have a question. Do you know what is the reason that hadoop allows the map and the reduce stage overlap? Or anyone knows about it. Thank you in advance. Chen On Sun, Sep 18, 2011 at 11:17 PM, Arun C Murthy a...@hortonworks.com wrote: Nan, The 'phase' is implicitly understood by the 'progress' (value) made by the map/reduce tasks (see o.a.h.mapred.TaskStatus.Phase). For e.g. Reduce: 0-33% - Shuffle 34-66% - Sort (actually, just 'merge', there is no sort in the reduce since all map-outputs are sorted) 67-100% - Reduce With 0.23 onwards the Map has phases too: 0-90% - Map 91-100% - Final Sort/merge Now,about starting reduces early - this is done to ensure shuffle can proceed for completed maps while rest of the maps run, there-by pipelining shuffle and map completion. There is a 'reduce slowstart' feature to control this - by default, reduces aren't started until 5% of maps are complete. Users can set this higher. Arun On Sep 18, 2011, at 7:24 PM, Nan Zhu wrote: Hi, all recently, I was hit by a question, how is a hadoop job divided into 2 phases?, In textbooks, we are told that the mapreduce jobs are divided into 2 phases, map and reduce, and for reduce, we further divided it into 3 stages, shuffle, sort, and reduce, but in hadoop codes, I never think about this question, I didn't see any variable members in JobInProgress class to indicate this information, and according to my understanding on the source code of hadoop, the reduce tasks are unnecessarily started until all mappers are finished, in constract, we can see the reduce tasks are in shuffle stage while there are mappers which are still in running, So how can I indicate the phase which the job is belonging to? Thanks -- Nan Zhu School of Electronic, Information and Electrical Engineering,229 Shanghai Jiao Tong University 800,Dongchuan Road,Shanghai,China E-Mail: zhunans...@gmail.com -- Kai Voigt k...@123.org
Re: phases of Hadoop Jobs
Hi Nan I have the same question for a while. In some research papers, people like to make the reduce stage to be slow start. In this way, the map stage and reduce stage are easy to differentiate. You can use the number of remaining unallocated map tasks to detect in which stage your job is. To let the reduce stage overlap with the map stage, it blurs the boundary between two stages. I think it may decreases the execution time of the whole job (I am not sure whether this is the main reason that people allow fast start happen or not). However, fast start has its side-effect. It is hard to get a global view of the map stage's output, and then the reduce stage's balance and data locality are not easy to be solved. Chen Research Assistant of Holland Computing Center PhD student of CSE Department University of Nebraska-Lincoln On Sun, Sep 18, 2011 at 9:24 PM, Nan Zhu zhunans...@gmail.com wrote: Hi, all recently, I was hit by a question, how is a hadoop job divided into 2 phases?, In textbooks, we are told that the mapreduce jobs are divided into 2 phases, map and reduce, and for reduce, we further divided it into 3 stages, shuffle, sort, and reduce, but in hadoop codes, I never think about this question, I didn't see any variable members in JobInProgress class to indicate this information, and according to my understanding on the source code of hadoop, the reduce tasks are unnecessarily started until all mappers are finished, in constract, we can see the reduce tasks are in shuffle stage while there are mappers which are still in running, So how can I indicate the phase which the job is belonging to? Thanks -- Nan Zhu School of Electronic, Information and Electrical Engineering,229 Shanghai Jiao Tong University 800,Dongchuan Road,Shanghai,China E-Mail: zhunans...@gmail.com
Re: phases of Hadoop Jobs
Hi Arun I have a question. Do you know what is the reason that hadoop allows the map and the reduce stage overlap? Or anyone knows about it. Thank you in advance. Chen On Sun, Sep 18, 2011 at 11:17 PM, Arun C Murthy a...@hortonworks.com wrote: Nan, The 'phase' is implicitly understood by the 'progress' (value) made by the map/reduce tasks (see o.a.h.mapred.TaskStatus.Phase). For e.g. Reduce: 0-33% - Shuffle 34-66% - Sort (actually, just 'merge', there is no sort in the reduce since all map-outputs are sorted) 67-100% - Reduce With 0.23 onwards the Map has phases too: 0-90% - Map 91-100% - Final Sort/merge Now,about starting reduces early - this is done to ensure shuffle can proceed for completed maps while rest of the maps run, there-by pipelining shuffle and map completion. There is a 'reduce slowstart' feature to control this - by default, reduces aren't started until 5% of maps are complete. Users can set this higher. Arun On Sep 18, 2011, at 7:24 PM, Nan Zhu wrote: Hi, all recently, I was hit by a question, how is a hadoop job divided into 2 phases?, In textbooks, we are told that the mapreduce jobs are divided into 2 phases, map and reduce, and for reduce, we further divided it into 3 stages, shuffle, sort, and reduce, but in hadoop codes, I never think about this question, I didn't see any variable members in JobInProgress class to indicate this information, and according to my understanding on the source code of hadoop, the reduce tasks are unnecessarily started until all mappers are finished, in constract, we can see the reduce tasks are in shuffle stage while there are mappers which are still in running, So how can I indicate the phase which the job is belonging to? Thanks -- Nan Zhu School of Electronic, Information and Electrical Engineering,229 Shanghai Jiao Tong University 800,Dongchuan Road,Shanghai,China E-Mail: zhunans...@gmail.com
Re: Poor IO performance on a 10 node cluster.
Hi Gyuribácsi I would suggest you divide MapReduce program execution time into 3 parts a) Map Stage In this stage, wc splits input data and generates map tasks. Each map task process one block (in default, you can change it in FileInputFormat.java). As Brian said, if you have larger blocks size, you may have less number of map tasks, and then probably less overhead. b) Reduce Stage 2) shuffle phase In this phase, reduce task collect intermediate results from every node that has executed map tasks. Each reduce task can have many current threads to obtain data(you can configure it in mapred-site.xml, it is mapreduce.reduce.shuffle.parallelcopies). But, be careful to your data popularity. For example, you have Hadoop, Hadoop, Hadoop,hello. The default Hadoop partitioner will assign 3 Hadoop, 1 key-value pairs to one node. Thus, if you have two nodes run reduce tasks, one of them will copy 3 times more data than the other. This will cause one node slower than the other. You may rewrite the partitioner. 3) sort and reduce phase I think the Hadoop UI will give you some hints about how long this phase takes. By dividing MapReduce application into these 3 parts, you can easily find which one is your bottleneck and do some profiling. And I don't know why my font change to this type.:( Hope it will be helpful. Chen On Mon, May 30, 2011 at 12:32 PM, Harsh J ha...@cloudera.com wrote: Psst. The cats speak in their own language ;-) On Mon, May 30, 2011 at 10:31 PM, James Seigel ja...@tynt.com wrote: Not sure that will help ;) Sent from my mobile. Please excuse the typos. On 2011-05-30, at 9:23 AM, Boris Aleksandrovsky balek...@gmail.com wrote: Ljddfjfjfififfifjftjiifjfjjjffkxbznzsjxodiewisshsudddudsjidhddueiweefiuftttoitfiirriifoiffkllddiririiriioerorooiieirrioeekroooeoooirjjfdijdkkduddjudiiehs On May 30, 2011 5:28 AM, Gyuribácsi bogyo...@gmail.com wrote: Hi, I have a 10 node cluster (IBM blade servers, 48GB RAM, 2x500GB Disk, 16 HT cores). I've uploaded 10 files to HDFS. Each file is 10GB. I used the streaming jar with 'wc -l' as mapper and 'cat' as reducer. I use 64MB block size and the default replication (3). The wc on the 100 GB took about 220 seconds which translates to about 3.5 Gbit/sec processing speed. One disk can do sequential read with 1Gbit/sec so i would expect someting around 20 GBit/sec (minus some overhead), and I'm getting only 3.5. Is my expectaion valid? I checked the jobtracked and it seems all nodes are working, each reading the right blocks. I have not played with the number of mapper and reducers yet. It seems the number of mappers is the same as the number of blocks and the number of reducers is 20 (there are 20 disks). This looks ok for me. We also did an experiment with TestDFSIO with similar results. Aggregated read io speed is around 3.5Gbit/sec. It is just too far from my expectation:( Please help! Thank you, Gyorgy -- View this message in context: http://old.nabble.com/Poor-IO-performance-on-a-10-node-cluster.-tp31732971p31732971.html Sent from the Hadoop core-user mailing list archive at Nabble.com. -- Harsh J
How can I let datanode do not save blocks from other datanodes after a MapReduce job
Hi all, I remember there is a parameter that we can turn this off. I mean we do not allow tasktracker to keep the blocks from other datanode after a MapReduce job finished. I met a problem when I using hadoop-0.21.0. First of all, I balanced cluster according to number of blocks on every datanode. That's to say, for example, under /user/test/, I have 100 blocks data. The replication number is 2. Then there are total 200 block under /user/test. I have 10 datanodes. What I do is to let every datanode to have 20 blocks of the total. However, after about 300 MapReduce jobs finished. I found out the number of blocks in datanodes changed. It is not 20 for every datanode. someone got 21 and someone got 19. I turned off the hadoop balancer. What is the reason caused this problem? Any suggestion will be appreciated! Best wishes! Chen
Change block size from 64M to 128M does not work on Hadoop-0.21
Hi all I met a problem about changing block size from 64M to 128M. I am sure I modified the correct configuration file hdfs-site.xml. Because I can change the replication number correctly. However, it does not work on block size changing. For example: I change the dfs.block.size to 134217728 bytes. I upload a file which is 128M and use fsck to find how many blocks this file has. It shows: /user/file1/file 134217726 bytes, 2 blocks(s): OK 0. blk_xx len=67108864 repl=2 [192.168.0.3:50010, 192.168.0.32:50010 ] 1. blk_xx len=67108862 repl=2 [192.168.0.9:50010, 192.168.0.8:50010] The hadoop version is 0.21. Any suggestion will be appreciated! thanks Chen
Re: Change block size from 64M to 128M does not work on Hadoop-0.21
Hi Harsh Thank you for the reply. Actually, the hadoop directory is on my NFS server, every node reads the same file from NFS server. I think this is not a problem. I like your second solution. But I am not sure, whether the namenode will divide those 128MB blocks to smaller ones in future or not. Chen On Wed, May 4, 2011 at 3:00 PM, Harsh J ha...@cloudera.com wrote: Your client (put) machine must have the same block size configuration during upload as well. Alternatively, you may do something explicit like `hadoop dfs -Ddfs.block.size=size -put file file` On Thu, May 5, 2011 at 12:59 AM, He Chen airb...@gmail.com wrote: Hi all I met a problem about changing block size from 64M to 128M. I am sure I modified the correct configuration file hdfs-site.xml. Because I can change the replication number correctly. However, it does not work on block size changing. For example: I change the dfs.block.size to 134217728 bytes. I upload a file which is 128M and use fsck to find how many blocks this file has. It shows: /user/file1/file 134217726 bytes, 2 blocks(s): OK 0. blk_xx len=67108864 repl=2 [192.168.0.3:50010, 192.168.0.32:50010 ] 1. blk_xx len=67108862 repl=2 [192.168.0.9:50010, 192.168.0.8:50010] The hadoop version is 0.21. Any suggestion will be appreciated! thanks Chen -- Harsh J
Re: Change block size from 64M to 128M does not work on Hadoop-0.21
Tried second solution. Does not work, still 2 64M blocks. h On Wed, May 4, 2011 at 3:16 PM, He Chen airb...@gmail.com wrote: Hi Harsh Thank you for the reply. Actually, the hadoop directory is on my NFS server, every node reads the same file from NFS server. I think this is not a problem. I like your second solution. But I am not sure, whether the namenode will divide those 128MB blocks to smaller ones in future or not. Chen On Wed, May 4, 2011 at 3:00 PM, Harsh J ha...@cloudera.com wrote: Your client (put) machine must have the same block size configuration during upload as well. Alternatively, you may do something explicit like `hadoop dfs -Ddfs.block.size=size -put file file` On Thu, May 5, 2011 at 12:59 AM, He Chen airb...@gmail.com wrote: Hi all I met a problem about changing block size from 64M to 128M. I am sure I modified the correct configuration file hdfs-site.xml. Because I can change the replication number correctly. However, it does not work on block size changing. For example: I change the dfs.block.size to 134217728 bytes. I upload a file which is 128M and use fsck to find how many blocks this file has. It shows: /user/file1/file 134217726 bytes, 2 blocks(s): OK 0. blk_xx len=67108864 repl=2 [192.168.0.3:50010, 192.168.0.32:50010 ] 1. blk_xx len=67108862 repl=2 [192.168.0.9:50010, 192.168.0.8:50010] The hadoop version is 0.21. Any suggestion will be appreciated! thanks Chen -- Harsh J
Re: Change block size from 64M to 128M does not work on Hadoop-0.21
Got it. Thankyou Harsh. BTW It is `hadoop dfs -Ddfs.blocksize=size -put file file`. No dot between block and size On Wed, May 4, 2011 at 3:18 PM, He Chen airb...@gmail.com wrote: Tried second solution. Does not work, still 2 64M blocks. h On Wed, May 4, 2011 at 3:16 PM, He Chen airb...@gmail.com wrote: Hi Harsh Thank you for the reply. Actually, the hadoop directory is on my NFS server, every node reads the same file from NFS server. I think this is not a problem. I like your second solution. But I am not sure, whether the namenode will divide those 128MB blocks to smaller ones in future or not. Chen On Wed, May 4, 2011 at 3:00 PM, Harsh J ha...@cloudera.com wrote: Your client (put) machine must have the same block size configuration during upload as well. Alternatively, you may do something explicit like `hadoop dfs -Ddfs.block.size=size -put file file` On Thu, May 5, 2011 at 12:59 AM, He Chen airb...@gmail.com wrote: Hi all I met a problem about changing block size from 64M to 128M. I am sure I modified the correct configuration file hdfs-site.xml. Because I can change the replication number correctly. However, it does not work on block size changing. For example: I change the dfs.block.size to 134217728 bytes. I upload a file which is 128M and use fsck to find how many blocks this file has. It shows: /user/file1/file 134217726 bytes, 2 blocks(s): OK 0. blk_xx len=67108864 repl=2 [192.168.0.3:50010, 192.168.0.32:50010 ] 1. blk_xx len=67108862 repl=2 [192.168.0.9:50010, 192.168.0.8:50010] The hadoop version is 0.21. Any suggestion will be appreciated! thanks Chen -- Harsh J
Apply HADOOP-4667 to branch-0.20
Hey everyone I tried to apply HADOOP-4667 patch to branch-0.20, but always failed. Because my cluster is based on branch-0.20, however, I want to test the delay scheduling method performance. I do not want to re-format the HDFS. Then I tried to apply HADOOP-4667 to branch-0.20. Anyone did this before or any suggestion? Thank you in advance. Best wishes! Chen
Any one know where to get Hadoop production cluster log
Hi all I am working on Hadoop scheduler. But I do not know where to get log from Hadoop production clusters. Any suggestions? Bests Chen
Re: Not able to Run C++ code in Hadoop Cluster
Agree with Keith Wiley, we use streaming also. On Mon, Mar 14, 2011 at 11:40 AM, Keith Wiley kwi...@keithwiley.com wrote: Not to speak against pipes because I don't have much experience with it, but I eventually abandoned my pipes efforts and went with streaming. If you don't get pipes to work, you might take a look at streaming as an alternative. Cheers! Keith Wiley kwi...@keithwiley.com keithwiley.com music.keithwiley.com I used to be with it, but then they changed what it was. Now, what I'm with isn't it, and what's it seems weird and scary to me. -- Abe (Grandpa) Simpson
Anyone knows how to attach a figure on Hadoop Wiki page?
Hi all Any suggestions? Bests Chen
Re: Cuda Program in Hadoop Cluster
Hi, Adarsh Sharma For C code My friend employ hadoop streaming to run CUDA C code. You can send email to him. p...@cse.unl.edu. Best wishes! Chen On Thu, Mar 3, 2011 at 11:18 PM, Adarsh Sharma adarsh.sha...@orkash.comwrote: Dear all, I followed a fantastic tutorial and able to run the Wordcont C++ Program in Hadoop Cluster. http://cs.smith.edu/dftwiki/index.php/Hadoop_Tutorial_2.2_--_Running_C%2B%2B_Programs_on_Hadoop But know I want to run a Cuda Program in the Hadoop Cluster but results in errors. Is anyone has done it before and guide me how to do this. I attached the both files. Please find the attachment. Thanks best Regards, Adarsh Sharma
Re: hadoop balancer
Thank you very much Icebergs. I rewrite the balancer. Now, given a directory like /user/foo/, I can balance the blocks under this directory evenly to every node in the cluster. Best wishes! Chen On Thu, Mar 3, 2011 at 11:14 PM, icebergs hkm...@gmail.com wrote: try this command hadoop fs -setrep -R -w 2 xx maybe help 2011/3/2 He Chen airb...@gmail.com Hi all I met a problem when I try to balance certain hdfs directory among the clusters. For example, I have a directory /user/xxx/, and there 100 blocks. I want to balance them among my 5 nodes clusters. Each node has 40 blocks (2 replicas). The problem is about transfer block from one datanode to another. Actually, I followed the balancer's method. However, it always waits for the response of destination datanode and halt. I attached the code: . Socket sock = new Socket(); DataOutputStream out = null; DataInputStream in = null; try{ sock.connect(NetUtils.createSocketAddr( target.getName()), HdfsConstants.READ_TIMEOUT); sock.setKeepAlive(true); System.out.println(sock.isConnected()); out = new DataOutputStream( new BufferedOutputStream( sock.getOutputStream(), FSConstants.BUFFER_SIZE)); out.writeShort(DataTransferProtocol.DATA_TRANSFER_VERSION); out.writeByte(DataTransferProtocol.OP_REPLACE_BLOCK); out.writeLong(block2move.getBlockId()); out.writeLong(block2move.getGenerationStamp()); Text.writeString(out, source.getStorageID()); System.out.println(Ready to move); source.write(out); System.out.println(Write to output Stream); out.flush(); System.out.println(out has been flushed!); in = new DataInputStream( new BufferedInputStream( sock.getInputStream(), FSConstants.BUFFER_SIZE)); It stop here and wait for response. short status = in.readShort(); System.out.println(Got the response from input stream!+status); if (status != DataTransferProtocol.OP_STATUS_SUCCESS) { throw new IOException(block move is failed\t+status); } } catch (IOException e) { LOG.warn(Error moving block +block2move.getBlockId()+ from + source.getName() + to + target.getName() + through + source.getName() + : +e.toString()); } finally { IOUtils.closeStream(out); IOUtils.closeStream(in); IOUtils.closeSocket(sock); } .. Any reply will be appreciated. Thank you in advance! Chen
Re: CUDA on Hadoop
Thank you Steve Loughran. I just created a new page on Hadoop wiki, however, how can I create a new document page on Hadoop Wiki? Best wishes Chen On Thu, Feb 10, 2011 at 5:38 AM, Steve Loughran ste...@apache.org wrote: On 09/02/11 17:31, He Chen wrote: Hi sharma I shared our slides about CUDA performance on Hadoop clusters. Feel free to modified it, please mention the copyright! This is nice. If you stick it up online you should link to it from the Hadoop wiki pages -maybe start a hadoop+cuda page and refer to it
Re: CUDA on Hadoop
Hi Sharma I have some experiences on working Hybrid Hadoop with GPU. Our group has tested CUDA performance on Hadoop clusters. We obtain 20 times speedup and save up to 95% power consumption in some computation-intensive test case. You can parallel your Java code by using JCUDA which is a kind of API to help you call CUDA in your Java code. Chen On Wed, Feb 9, 2011 at 8:45 AM, Steve Loughran ste...@apache.org wrote: On 09/02/11 13:58, Harsh J wrote: You can check-out this project which did some work for Hama+CUDA: http://code.google.com/p/mrcl/ Amazon let you bring up a Hadoop cluster on machines with GPUs you can code against, but I haven't heard of anyone using it. The big issue is bandwidth; it just doesn't make sense for a classic scan through the logs kind of problem as the disk:GPU bandwidth ratio is even worse than disk:CPU. That said, if you were doing something that involved a lot of compute on a block of data (e.g. rendering tiles in a map), this could work.
Re: Question about Hadoop Default FCFS Job Scheduler
Hi Nan, Thank you for the reply. I understand what you mean. What I concern is inside the obtainNewLocalMapTask(...) method, it only assigns one tasks a time. Now I understand why it only assigns one task at a time. It is because the outside loop: for (i = 0; i MapperCapacity; ++i){ (..) } I mean why this loop exists here. Why does the scheduler use this type of loop. It imposes overhead to the task assigning process if only assign one task at a time. It is obviously that a node can be assigned all available local tasks it can in one afford obtainNewLocalMapTask(..) method call. Bests Chen On Mon, Jan 17, 2011 at 8:28 AM, Nan Zhu zhunans...@gmail.com wrote: Hi, Chen How is it going recently? Actually I think you misundertand the code in assignTasks() in JobQueueTaskScheduler.java, see the following structure of the interesting codes: //I'm sorry, I hacked the code so much, the name of the variables may be different from the original version for (i = 0; i MapperCapacity; ++i){ ... for (JobInProgress job:jobQueue){ //try to shedule a node-local or rack-local map tasks //here is the interesting place t = job.obtainNewLocalMapTask(...); if (t != null){ ... break;//the break statement here will make the control flow back to for (job:jobQueue) which means that it will restart map tasks selection procedure from the first job, so , it is actually schedule all of the first job's local mappers first until the map slots are full } } } BTW, we can only schedule a reduce task in a single heartbeat Best, Nan On Sat, Jan 15, 2011 at 1:45 PM, He Chen airb...@gmail.com wrote: Hey all Why does the FCFS scheduler only let a node chooses one task at a time in one job? In order to increase the data locality, it is reasonable to let a node to choose all its local tasks (if it can) from a job at a time. Any reply will be appreciated. Thanks Chen
Question about Hadoop Default FCFS Job Scheduler
Hey all Why does the FCFS scheduler only let a node chooses one task at a time in one job? In order to increase the data locality, it is reasonable to let a node to choose all its local tasks (if it can) from a job at a time. Any reply will be appreciated. Thanks Chen
Re: Why Hadoop uses HTTP for file transmission between Map and Reduce?
Actually, PhedEx is using GridFTP for its data transferring. On Thu, Jan 13, 2011 at 5:34 AM, Steve Loughran ste...@apache.org wrote: On 13/01/11 08:34, li ping wrote: That is also my concerns. Is it efficient for data transmission. It's long lived TCP connections, reasonably efficient for bulk data xfer, has all the throttling of TCP built in, and comes with some excellently debugged client and server code in the form of jetty and httpclient. In maintenance costs alone, those libraries justify HTTP unless you have a vastly superior option *and are willing to maintain it forever* FTPs limits are well known (security), NFS limits well known (security, UDP version doesn't throttle), self developed protocols will have whatever problems you want. There are better protocols for long-haul data transfer over fat pipes, such as GridFTP , PhedEX ( http://www.gridpp.ac.uk/papers/ah05_phedex.pdf ), which use multiple TCP channels in parallel to reduce the impact of a single lost packet, but within a datacentre, you shouldn't have to worry about this. If you do find lots of packets get lost, raise the issue with the networking team. -Steve On Thu, Jan 13, 2011 at 4:27 PM, Nan Zhuzhunans...@gmail.com wrote: Hi, all I have a question about the file transmission between Map and Reduce stage, in current implementation, the Reducers get the results generated by Mappers through HTTP Get, I don't understand why HTTP is selected, why not FTP, or a self-developed protocal? Just for HTTP's simple? thanks Nan
FW:FW
I bought some items from a commercial site, because of the unique channel of purchases, product prices unexpected, I think you can go to see: elesales.com , high-quality products can also attract you.
Jcuda on Hadoop
Hello everyone, I 've got a problem when I write some Jcuda program based on Hadoop MapReduce. I use the jcudaUtill. The KernelLauncherSample can be successfully executed on my worker node. However, When I submit a program containing jcuda to Hadoop MapReduce. I got following errors. Any reply will be appreciated! 10/12/09 15:41:39 INFO mapred.JobClient: Running job: job_201012091523_0002 10/12/09 15:41:40 INFO mapred.JobClient: map 0% reduce 0% 10/12/09 15:41:53 INFO mapred.JobClient: Task Id : attempt_201012091523_0002_m_00_0, Status : FAILED Error: Could not load native library attempt_201012091523_0002_m_00_0: [GC 7402K-1594K(12096K), 0.0045650 secs] attempt_201012091523_0002_m_00_0: [GC 108666K-104610K(116800K), 0.0106860 secs] attempt_201012091523_0002_m_00_0: [Full GC 104610K-104276K(129856K), 0.0482530 secs] attempt_201012091523_0002_m_00_0: Error while loading native library with base name JCudaDriver attempt_201012091523_0002_m_00_0: Operating system name: Linux attempt_201012091523_0002_m_00_0: Architecture : amd64 attempt_201012091523_0002_m_00_0: Architecture bit size: 64 10/12/09 15:42:00 INFO mapred.JobClient: Task Id : attempt_201012091523_0002_m_00_1, Status : FAILED Error: Could not load native library attempt_201012091523_0002_m_00_1: [GC 7373K-1573K(18368K), 0.0045230 secs] attempt_201012091523_0002_m_00_1: Error while loading native library with base name JCudaDriver attempt_201012091523_0002_m_00_1: Operating system name: Linux attempt_201012091523_0002_m_00_1: Architecture : amd64 attempt_201012091523_0002_m_00_1: Architecture bit size: 64 It looks like The jcuda library file can not be sucessfully loaded. Actually, I tried many combinations. 1) I include all the jcuda library files in my TaskTracker's classpath, and also include jcuda library file into my mapreduce program. 2) TaskTracker's classpath w/o jcuda library, but my program contains them 3) TaskTracker's classpath w/ jcuda librara, but my program w/o them All of them report the same error above.
Re: Jcuda on Hadoop
I am still in the test stage. I mean I only start one JobTracker and one TaskTracker. I copied all jcuda.*.jar into HADOOP_HOME/lib/ from the ps aux|grep java I can confirm the JT and TT processes all contain the jcuda.*.jar files On Thu, Dec 9, 2010 at 4:05 PM, Mathias Herberts mathias.herbe...@gmail.com wrote: You need to have the native libs on all tasktrackers and have java.library.path correctly set. On Dec 9, 2010 11:01 PM, He Chen airb...@gmail.com wrote: Hello everyone, I 've got a problem when I write some Jcuda program based on Hadoop MapReduce. I use the jcudaUtill. The KernelLauncherSample can be successfully executed on my worker node. However, When I submit a program containing jcuda to Hadoop MapReduce. I got following errors. Any reply will be appreciated! 10/12/09 15:41:39 INFO mapred.JobClient: Running job: job_201012091523_0002 10/12/09 15:41:40 INFO mapred.JobClient: map 0% reduce 0% 10/12/09 15:41:53 INFO mapred.JobClient: Task Id : attempt_201012091523_0002_m_00_0, Status : FAILED Error: Could not load native library attempt_201012091523_0002_m_00_0: [GC 7402K-1594K(12096K), 0.0045650 secs] attempt_201012091523_0002_m_00_0: [GC 108666K-104610K(116800K), 0.0106860 secs] attempt_201012091523_0002_m_00_0: [Full GC 104610K-104276K(129856K), 0.0482530 secs] attempt_201012091523_0002_m_00_0: Error while loading native library with base name JCudaDriver attempt_201012091523_0002_m_00_0: Operating system name: Linux attempt_201012091523_0002_m_00_0: Architecture : amd64 attempt_201012091523_0002_m_00_0: Architecture bit size: 64 10/12/09 15:42:00 INFO mapred.JobClient: Task Id : attempt_201012091523_0002_m_00_1, Status : FAILED Error: Could not load native library attempt_201012091523_0002_m_00_1: [GC 7373K-1573K(18368K), 0.0045230 secs] attempt_201012091523_0002_m_00_1: Error while loading native library with base name JCudaDriver attempt_201012091523_0002_m_00_1: Operating system name: Linux attempt_201012091523_0002_m_00_1: Architecture : amd64 attempt_201012091523_0002_m_00_1: Architecture bit size: 64 It looks like The jcuda library file can not be sucessfully loaded. Actually, I tried many combinations. 1) I include all the jcuda library files in my TaskTracker's classpath, and also include jcuda library file into my mapreduce program. 2) TaskTracker's classpath w/o jcuda library, but my program contains them 3) TaskTracker's classpath w/ jcuda librara, but my program w/o them All of them report the same error above.
Re: Jcuda on Hadoop
Thank you so much, Mathias Herberts This really helps On Thu, Dec 9, 2010 at 4:17 PM, He Chen airb...@gmail.com wrote: I am still in the test stage. I mean I only start one JobTracker and one TaskTracker. I copied all jcuda.*.jar into HADOOP_HOME/lib/ from the ps aux|grep java I can confirm the JT and TT processes all contain the jcuda.*.jar files On Thu, Dec 9, 2010 at 4:05 PM, Mathias Herberts mathias.herbe...@gmail.com wrote: You need to have the native libs on all tasktrackers and have java.library.path correctly set. On Dec 9, 2010 11:01 PM, He Chen airb...@gmail.com wrote: Hello everyone, I 've got a problem when I write some Jcuda program based on Hadoop MapReduce. I use the jcudaUtill. The KernelLauncherSample can be successfully executed on my worker node. However, When I submit a program containing jcuda to Hadoop MapReduce. I got following errors. Any reply will be appreciated! 10/12/09 15:41:39 INFO mapred.JobClient: Running job: job_201012091523_0002 10/12/09 15:41:40 INFO mapred.JobClient: map 0% reduce 0% 10/12/09 15:41:53 INFO mapred.JobClient: Task Id : attempt_201012091523_0002_m_00_0, Status : FAILED Error: Could not load native library attempt_201012091523_0002_m_00_0: [GC 7402K-1594K(12096K), 0.0045650 secs] attempt_201012091523_0002_m_00_0: [GC 108666K-104610K(116800K), 0.0106860 secs] attempt_201012091523_0002_m_00_0: [Full GC 104610K-104276K(129856K), 0.0482530 secs] attempt_201012091523_0002_m_00_0: Error while loading native library with base name JCudaDriver attempt_201012091523_0002_m_00_0: Operating system name: Linux attempt_201012091523_0002_m_00_0: Architecture : amd64 attempt_201012091523_0002_m_00_0: Architecture bit size: 64 10/12/09 15:42:00 INFO mapred.JobClient: Task Id : attempt_201012091523_0002_m_00_1, Status : FAILED Error: Could not load native library attempt_201012091523_0002_m_00_1: [GC 7373K-1573K(18368K), 0.0045230 secs] attempt_201012091523_0002_m_00_1: Error while loading native library with base name JCudaDriver attempt_201012091523_0002_m_00_1: Operating system name: Linux attempt_201012091523_0002_m_00_1: Architecture : amd64 attempt_201012091523_0002_m_00_1: Architecture bit size: 64 It looks like The jcuda library file can not be sucessfully loaded. Actually, I tried many combinations. 1) I include all the jcuda library files in my TaskTracker's classpath, and also include jcuda library file into my mapreduce program. 2) TaskTracker's classpath w/o jcuda library, but my program contains them 3) TaskTracker's classpath w/ jcuda librara, but my program w/o them All of them report the same error above.
Re: Two questions.
Both option 1 and 3 will work. On Wed, Nov 3, 2010 at 9:28 PM, James Seigel ja...@tynt.com wrote: Option 1 = good Sent from my mobile. Please excuse the typos. On 2010-11-03, at 8:27 PM, shangan shan...@corp.kaixin001.com wrote: I don't think the first two options can work, even you stop the tasktracker these to-be-retired nodes are still connected to the namenode. Option 3 can work. You only need to add this exclude file on the namenode, and it is an regular file. Add a key named dfs.hosts.exclude to your conf/hadoop-site.xml file,The value associated with this key provides the full path to a file on the NameNode's local file system which contains a list of machines which are not permitted to connect to HDFS. Then you can run the command bin/hadoop dfsadmin -refreshNodes, then the cluster will decommission the nodes in the exclude file.This might take a period of time as the cluster need to move data from those retired nodes to left nodes. After this you can use these retired nodes as a new cluster.But remember to remove those nodes from the slave nodes file and you can delete the exclude file afterward. 2010-11-04 shangan 发件人: Raj V 发送时间: 2010-11-04 10:05:44 收件人: common-user 抄送: 主题: Two questions. 1. I have a 512 node cluster. I need to have 32 nodes do something else. They can be datanodes but I cannot run any map or reduce jobs on them. So I see three options. 1. Stop the tasktracker on those nodes. leave the datanode running. 2. Set mapred.tasktracker.reduce.tasks.maximum and mapred.tasktracker.map.tasks.maximum to 0 on these nodes and make these final. 3. Use the parameter mapred.hosts.exclude. I am assuming that any of the three methods would work. To start with, I went with option 3. I used a local file /home/hadoop/myjob.exclude and the file myjob.exclude had the hostname of one host per line ( hadoop-480 .. hadoop-511. But I see both map and reduce jobs being scheduled to all the 511 nodes. I understand there is an inherent inefficieny by running only the data node on these 32 nodess. Here are my questions. 1. Will all three methods work? 2. If I choose method 3, does this file exist as a dfs file or a regular file. If regular file , does it need to exist on all the nodes or only the node where teh job is submitted? Many thanks in advance/ Raj __ Information from ESET NOD32 Antivirus, version of virus signature database 5574 (20101029) __ The message was checked by ESET NOD32 Antivirus. http://www.eset.com
Re: Can not upload local file to HDFS
Thanks, but I think you goes too far to focus on the problem itself. On Sun, Sep 26, 2010 at 11:43 AM, Nan Zhu zhunans...@gmail.com wrote: Have you ever check the log file in the directory? I always find some important information there, I suggest you to recompile hadoop with ant since mapred daemons also don't work Nan On Sun, Sep 26, 2010 at 7:29 PM, He Chen airb...@gmail.com wrote: The problem is every datanode may be listed in the error report. That means all my datanodes are bad? One thing I forgot to mention. I can not use start-all.sh and stop-all.sh to start and stop all dfs and mapred processes on my clusters. But the jobtracker and namenode web interface still work. I think I can solve this problem by ssh to every node and kill current hadoop processes and restart them again. The previous problem will also be solved( it's my opinion). But I really want to know why the HDFS reports me previous errors. On Sat, Sep 25, 2010 at 11:20 PM, Nan Zhu zhunans...@gmail.com wrote: Hi Chen, It seems that you have a bad datanode? maybe you should reformat them? Nan On Sun, Sep 26, 2010 at 10:42 AM, He Chen airb...@gmail.com wrote: Hello Neil No matter how big the file is. It always report this to me. The file size is from 10KB to 100MB. On Sat, Sep 25, 2010 at 6:08 PM, Neil Ghosh neil.gh...@gmail.com wrote: How Big is the file? Did you try Formatting Name node and Datanode? On Sun, Sep 26, 2010 at 2:12 AM, He Chen airb...@gmail.com wrote: Hello everyone I can not load local file to HDFS. It gave the following errors. WARN hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block blk_-236192853234282209_419415java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:197) at java.io.DataInputStream.readLong(DataInputStream.java:416) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2397) 10/09/25 15:38:25 WARN hdfs.DFSClient: Error Recovery for block blk_-236192853234282209_419415 bad datanode[0] 192.168.0.23:50010 10/09/25 15:38:25 WARN hdfs.DFSClient: Error Recovery for block blk_-236192853234282209_419415 in pipeline 192.168.0.23:50010, 192.168.0.39:50010: bad datanode 192.168.0.23:50010 Any response will be appreciated! -- Best Wishes! 顺送商祺! -- Chen He (402)613-9298 PhD. student of CSE Dept. Research Assistant of Holland Computing Center University of Nebraska-Lincoln Lincoln NE 68588
Re: Can not upload local file to HDFS
The problem is every datanode may be listed in the error report. That means all my datanodes are bad? One thing I forgot to mention. I can not use start-all.sh and stop-all.sh to start and stop all dfs and mapred processes on my clusters. But the jobtracker and namenode web interface still work. I think I can solve this problem by ssh to every node and kill current hadoop processes and restart them again. The previous problem will also be solved( it's my opinion). But I really want to know why the HDFS reports me previous errors. On Sat, Sep 25, 2010 at 11:20 PM, Nan Zhu zhunans...@gmail.com wrote: Hi Chen, It seems that you have a bad datanode? maybe you should reformat them? Nan On Sun, Sep 26, 2010 at 10:42 AM, He Chen airb...@gmail.com wrote: Hello Neil No matter how big the file is. It always report this to me. The file size is from 10KB to 100MB. On Sat, Sep 25, 2010 at 6:08 PM, Neil Ghosh neil.gh...@gmail.com wrote: How Big is the file? Did you try Formatting Name node and Datanode? On Sun, Sep 26, 2010 at 2:12 AM, He Chen airb...@gmail.com wrote: Hello everyone I can not load local file to HDFS. It gave the following errors. WARN hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block blk_-236192853234282209_419415java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:197) at java.io.DataInputStream.readLong(DataInputStream.java:416) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2397) 10/09/25 15:38:25 WARN hdfs.DFSClient: Error Recovery for block blk_-236192853234282209_419415 bad datanode[0] 192.168.0.23:50010 10/09/25 15:38:25 WARN hdfs.DFSClient: Error Recovery for block blk_-236192853234282209_419415 in pipeline 192.168.0.23:50010, 192.168.0.39:50010: bad datanode 192.168.0.23:50010 Any response will be appreciated!
Can not upload local file to HDFS
Hello everyone I can not load local file to HDFS. It gave the following errors. WARN hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block blk_-236192853234282209_419415java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:197) at java.io.DataInputStream.readLong(DataInputStream.java:416) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2397) 10/09/25 15:38:25 WARN hdfs.DFSClient: Error Recovery for block blk_-236192853234282209_419415 bad datanode[0] 192.168.0.23:50010 10/09/25 15:38:25 WARN hdfs.DFSClient: Error Recovery for block blk_-236192853234282209_419415 in pipeline 192.168.0.23:50010, 192.168.0.39:50010: bad datanode 192.168.0.23:50010 Any response will be appreciated! -- Best Wishes! 顺送商祺! -- Chen He
Re: Job performance issue: output.collect()
Hey Oded Rosen I am not sure what is the functionality of your map() method. Intuitively, move the map() method computation to the reduce() method if your map() output is problematic. I mean just let the map() method act as a data input reader and divider and let reduce() method do all you computation. In this way, your intermediate results are less than before. Shuffle time can also be reduced. If the computation is still slow, I think it may not be the MapReduce framework problem, but your programs. Hope this helps. Chen On Wed, Sep 1, 2010 at 7:18 AM, Oded Rosen o...@legolas-media.com wrote: Hi all, My job (written in old 0.18 api, but that's not the issue here) is producing large amounts of map output. Each map() call generates about ~20 output.collects, and each output is pretty big (~1K) = each map() produces about 20K. All of this data is fed to a combiner that really reduces the output's size + amounts. the job input is not so big: there are about 120M map input records. This job is pretty slow. Other jobs that work on the same input are much faster, since they do not produce so much output. Analyzing the job performance (timing the map() function parts), I've seen that much time is spent on the output.collect() line itself. I know that during the output.collect() command the output is being written to local filesystem spills (when the spill buffer reaches a 80% limit), so I guessed that reducing the size of each output will improve performance. This was not the case - after cutting 30% of the map output size, the job took the same amount of time. The thing that I cannot reduce is the amount of output lines being written out of the map. I would like to know what happens in the output.collect line that takes lots of time, in order to cut down this job's running time. Please keep in mind that I have a combiner, and to my understanding different things happen to the map output when a combiner is present. Can anyone help me understand how can I save this precious time? Thanks, -- Oded
Re: Best way to reduce a 8-node cluster in half and get hdfs to come out of safe mode
Way#3 1) bring up all 8 dn and the nn 2) retire one of your 4 nodes: kill the datanode process hadoop dfsadmin -refreshNodes (this should be done on nn) 3) do 2) extra three times On Fri, Aug 6, 2010 at 1:21 AM, Allen Wittenauer awittena...@linkedin.comwrote: On Aug 5, 2010, at 10:42 PM, Steve Kuo wrote: As part of our experimentation, the plan is to pull 4 slave nodes out of a 8-slave/1-master cluster. With replication factor set to 3, I thought losing half of the cluster may be too much for hdfs to recover. Thus I copied out all relevant data from hdfs to local disk and reconfigure the cluster. It depends. If you have configured Hadoop to have a topology such that the 8 nodes were in 2 logical racks, then it would have worked just fine. If you didn't have any topology configured, then each node is considered its own rack. So pulling half of the grid down means you are likely losing a good chunk of all your blocks. The 4 slave nodes started okay but hdfs never left safe mode. The nn.log has the following line. What is the best way to deal with this? Shall I restart the cluster with 8-node and then delete /data/hadoop-hadoop/mapred/system? Or shall I reformat hdfs? Two ways to go: Way #1: 1) configure dfs.hosts 2) bring up all 8 nodes 3) configure dfs.hosts.exclude to include the 4 you don't want 4) dfsadmin -refreshNodes to start decommissioning the 4 you don't want Way #2: 1) configure a topology 2) bring up all 8 nodes 3) setrep all files +1 4) wait for nn to finish replication 5) pull 4 nodes 6) bring down nn 7) remove topology 8) bring nn up 9) setrep -1 -- Best Wishes! 顺送商祺! -- Chen He (402)613-9298 PhD. student of CSE Dept. Research Assistant of Holland Computing Center University of Nebraska-Lincoln Lincoln NE 68588
Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
Hey Deepak Diwakar Try to keep the /etc/hosts file as the same among all your cluster nodes. See whether this problem will disappear. On Tue, Jul 27, 2010 at 2:31 PM, Deepak Diwakar ddeepa...@gmail.com wrote: Hey friends, I got stuck on setting up hdfs cluster and getting this error while running simple wordcount example(I did that 2 yrs back not had any problem). Currently testing over hadoop-0.20.1 with 2 nodes. instruction followed from ( http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29 ). I checked the firewall settings and /etc/hosts there is no issue there. Also master and slave are accessible both ways. Also the input size very low ~ 3 MB and hence there shouldn't be no issue because ulimit(its btw of 4096). Would be really thankful if anyone can guide me to resolve this. Thanks regards, - Deepak Diwakar, On 28 June 2010 18:39, bmdevelopment bmdevelopm...@gmail.com wrote: Hi, Sorry for the cross-post. But just trying to see if anyone else has had this issue before. Thanks -- Forwarded message -- From: bmdevelopment bmdevelopm...@gmail.com Date: Fri, Jun 25, 2010 at 10:56 AM Subject: Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. To: mapreduce-u...@hadoop.apache.org Hello, Thanks so much for the reply. See inline. On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala yhema...@gmail.com wrote: Hi, I've been getting the following error when trying to run a very simple MapReduce job. Map finishes without problem, but error occurs as soon as it enters Reduce phase. 10/06/24 18:41:00 INFO mapred.JobClient: Task Id : attempt_201006241812_0001_r_00_0, Status : FAILED Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. I am running a 5 node cluster and I believe I have all my settings correct: * ulimit -n 32768 * DNS/RDNS configured properly * hdfs-site.xml : http://pastebin.com/xuZ17bPM * mapred-site.xml : http://pastebin.com/JraVQZcW The program is very simple - just counts a unique string in a log file. See here: http://pastebin.com/5uRG3SFL When I run, the job fails and I get the following output. http://pastebin.com/AhW6StEb However, runs fine when I do *not* use substring() on the value (see map function in code above). This runs fine and completes successfully: String str = val.toString(); This causes error and fails: String str = val.toString().substring(0,10); Please let me know if you need any further information. It would be greatly appreciated if anyone could shed some light on this problem. It catches attention that changing the code to use a substring is causing a difference. Assuming it is consistent and not a red herring, Yes, this has been consistent over the last week. I was running 0.20.1 first and then upgrade to 0.20.2 but results have been exactly the same. can you look at the counters for the two jobs using the JobTracker web UI - things like map records, bytes etc and see if there is a noticeable difference ? Ok, so here is the first job using write.set(value.toString()); having *no* errors: http://pastebin.com/xvy0iGwL And here is the second job using write.set(value.toString().substring(0, 10)); that fails: http://pastebin.com/uGw6yNqv And here is even another where I used a longer, and therefore unique string, by write.set(value.toString().substring(0, 20)); This makes every line unique, similar to first job. Still fails. http://pastebin.com/GdQ1rp8i Also, are the two programs being run against the exact same input data ? Yes, exactly the same input: a single csv file with 23K lines. Using a shorter string leads to more like keys and therefore more combining/reducing, but going by the above it seems to fail whether the substring/key is entirely unique (23000 combine output records) or mostly the same (9 combine output records). Also, since the cluster size is small, you could also look at the tasktracker logs on the machines where the maps have run to see if there are any failures when the reduce attempts start failing. Here is the TT log from the last failed job. I do not see anything besides the shuffle failure, but there may be something I am overlooking or simply do not understand. http://pastebin.com/DKFTyGXg Thanks again! Thanks Hemanth -- Best Wishes! 顺送商祺! -- Chen He (402)613-9298 PhD. student of CSE Dept. Research Assistant of Holland Computing Center University of Nebraska-Lincoln Lincoln NE 68588
Re: hybrid map/reducer scheduler?
You can write your own one based on them. They are open source. On Mon, Jun 28, 2010 at 6:13 PM, jiang licht licht_ji...@yahoo.com wrote: In addition to default FIFO scheduler, there are fair scheduler and capacity scheduler. In some sense, fair scheduler can be considered a user-based scheduling while capacity scheduler does a queue-based scheduling. Is there or will there be a hybrid scheduler that combines the good parts of the two (or a capacity scheduler that allows preemption, then different users are asked to submit jobs to different queues, in this way implicitly follow user-based scheduling as well, more or less)? Thanks, --Michael
Re: Shuffle error
problem solved. This is caused by the inconsistency of the /etc/hosts file. 2010/5/24 He Chen airb...@gmail.com Hey, every one I have a problem when I run hadoop programs. Some of my worker nodes always report following Shuffle Error. Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. I tried to find out whether this is caused by certain nodes' hardware. For example, memory, hard drive. But it looks like any of the worker nodes incline to have this error. I am using 0.20.1. Any suggestion will be appreciated! -- Best Wishes! 顺送商祺! -- Chen He (402)613-9298 PhD. student of CSE Dept. Holland Computing Center University of Nebraska-Lincoln Lincoln NE 68588 -- Best Wishes! 顺送商祺! -- Chen He (402)613-9298 PhD. student of CSE Dept. Holland Computing Center University of Nebraska-Lincoln Lincoln NE 68588
Re: Any possible to set hdfs block size to a value smaller than 64MB?
If you know how to use AspectJ to do aspect oriented programming. You can write a aspect class. Let it just monitors the whole process of MapReduce On Tue, May 18, 2010 at 10:00 AM, Patrick Angeles patr...@cloudera.comwrote: Should be evident in the total job running time... that's the only metric that really matters :) On Tue, May 18, 2010 at 10:39 AM, Pierre ANCELOT pierre...@gmail.com wrote: Thank you, Any way I can measure the startup overhead in terms of time? On Tue, May 18, 2010 at 4:27 PM, Patrick Angeles patr...@cloudera.com wrote: Pierre, Adding to what Brian has said (some things are not explicitly mentioned in the HDFS design doc)... - If you have small files that take up 64MB you do not actually use the entire 64MB block on disk. - You *do* use up RAM on the NameNode, as each block represents meta-data that needs to be maintained in-memory in the NameNode. - Hadoop won't perform optimally with very small block sizes. Hadoop I/O is optimized for high sustained throughput per single file/block. There is a penalty for doing too many seeks to get to the beginning of each block. Additionally, you will have a MapReduce task per small file. Each MapReduce task has a non-trivial startup overhead. - The recommendation is to consolidate your small files into large files. One way to do this is via SequenceFiles... put the filename in the SequenceFile key field, and the file's bytes in the SequenceFile value field. In addition to the HDFS design docs, I recommend reading this blog post: http://www.cloudera.com/blog/2009/02/the-small-files-problem/ Happy Hadooping, - Patrick On Tue, May 18, 2010 at 9:11 AM, Pierre ANCELOT pierre...@gmail.com wrote: Okay, thank you :) On Tue, May 18, 2010 at 2:48 PM, Brian Bockelman bbock...@cse.unl.edu wrote: On May 18, 2010, at 7:38 AM, Pierre ANCELOT wrote: Hi, thanks for this fast answer :) If so, what do you mean by blocks? If a file has to be splitted, it will be splitted when larger than 64MB? For every 64MB of the file, Hadoop will create a separate block. So, if you have a 32KB file, there will be one block of 32KB. If the file is 65MB, then it will have one block of 64MB and another block of 1MB. Splitting files is very useful for load-balancing and distributing I/O across multiple nodes. At 32KB / file, you don't really need to split the files at all. I recommend reading the HDFS design document for background issues like this: http://hadoop.apache.org/common/docs/r0.20.0/hdfs_design.html Brian On Tue, May 18, 2010 at 2:34 PM, Brian Bockelman bbock...@cse.unl.edu wrote: Hey Pierre, These are not traditional filesystem blocks - if you save a file smaller than 64MB, you don't lose 64MB of file space.. Hadoop will use 32KB to store a 32KB file (ok, plus a KB of metadata or so), not 64MB. Brian On May 18, 2010, at 7:06 AM, Pierre ANCELOT wrote: Hi, I'm porting a legacy application to hadoop and it uses a bunch of small files. I'm aware that having such small files ain't a good idea but I'm not doing the technical decisions and the port has to be done for yesterday... Of course such small files are a problem, loading 64MB blocks for a few lines of text is an evident loss. What will happen if I set a smaller, or even way smaller (32kB) blocks? Thank you. Pierre ANCELOT. -- http://www.neko-consulting.com Ego sum quis ego servo Je suis ce que je protège I am what I protect -- http://www.neko-consulting.com Ego sum quis ego servo Je suis ce que je protège I am what I protect -- http://www.neko-consulting.com Ego sum quis ego servo Je suis ce que je protège I am what I protect -- Best Wishes! 顺送商祺! -- Chen He (402)613-9298 PhD. student of CSE Dept. Holland Computing Center University of Nebraska-Lincoln Lincoln NE 68588
Hadoop does not follow my setting
Hi everyone I am doing a benchmark by using Hadoop 0.20.0's wordcount example. I have a 30GB file. I plan to test differenct number of mappers' performance. For example, for a wordcount job, I plan to test 22 mappers, 44 mappers, 66 mappers and 110 mappers. However, I set the mapred.map.tasks equals to 22. But when I ran the job, it shows 436 mappers total. I think maybe the wordcount set its parameters inside the its own program. I give -Dmapred.map.tasks=22 to this program. But it is still 436 again in my another try. I found out that 30GB divide by 436 is just 64MB, it is just my block size. Any suggestions will be appreciated. Thank you in advance! -- Best Wishes! 顺送商祺! -- Chen He (402)613-9298 PhD. student of CSE Dept. Holland Computing Center University of Nebraska-Lincoln Lincoln NE 68588
Re: Hadoop does not follow my setting
Hey Eric Sammer Thank you for the reply. Actually, I only care about the number of mappers in my circumstance. Looks like, I should write the wordcount program with my own InputFormat class. 2010/4/22 Eric Sammer esam...@cloudera.com This is normal and expected. The mapred.map.tasks parameter is only a hint. The InputFormat gets to decide how to calculate splits. FileInputFormat and all subclasses, including TextInputFormat, use a few parameters to figure out what the appropriate split size will be but under most circumstances, this winds up being the block size. If you used fewer map tasks than blocks, you would sacrifice data locality which would only hurt performance. 2010/4/22 He Chen airb...@gmail.com: Hi everyone I am doing a benchmark by using Hadoop 0.20.0's wordcount example. I have a 30GB file. I plan to test differenct number of mappers' performance. For example, for a wordcount job, I plan to test 22 mappers, 44 mappers, 66 mappers and 110 mappers. However, I set the mapred.map.tasks equals to 22. But when I ran the job, it shows 436 mappers total. I think maybe the wordcount set its parameters inside the its own program. I give -Dmapred.map.tasks=22 to this program. But it is still 436 again in my another try. I found out that 30GB divide by 436 is just 64MB, it is just my block size. Any suggestions will be appreciated. Thank you in advance! -- Best Wishes! 顺送商祺! -- Chen He (402)613-9298 PhD. student of CSE Dept. Holland Computing Center University of Nebraska-Lincoln Lincoln NE 68588 -- Eric Sammer phone: +1-917-287-2675 twitter: esammer data: www.cloudera.com
Re: Hadoop does not follow my setting
Hi Raymond Jennings III I use 22 mappers because I have 22 cores in my clusters. Is this what you want? On Thu, Apr 22, 2010 at 11:55 AM, Raymond Jennings III raymondj...@yahoo.com wrote: Isn't the number of mappers specified only a suggestion ? --- On Thu, 4/22/10, He Chen airb...@gmail.com wrote: From: He Chen airb...@gmail.com Subject: Hadoop does not follow my setting To: common-user@hadoop.apache.org Date: Thursday, April 22, 2010, 12:50 PM Hi everyone I am doing a benchmark by using Hadoop 0.20.0's wordcount example. I have a 30GB file. I plan to test differenct number of mappers' performance. For example, for a wordcount job, I plan to test 22 mappers, 44 mappers, 66 mappers and 110 mappers. However, I set the mapred.map.tasks equals to 22. But when I ran the job, it shows 436 mappers total. I think maybe the wordcount set its parameters inside the its own program. I give -Dmapred.map.tasks=22 to this program. But it is still 436 again in my another try. I found out that 30GB divide by 436 is just 64MB, it is just my block size. Any suggestions will be appreciated. Thank you in advance! -- Best Wishes! 顺送商祺! -- Chen He (402)613-9298 PhD. student of CSE Dept. Holland Computing Center University of Nebraska-Lincoln Lincoln NE 68588 -- Best Wishes! 顺送商祺! -- Chen He (402)613-9298 PhD. student of CSE Dept. Holland Computing Center University of Nebraska-Lincoln Lincoln NE 68588
Re: Hadoop does not follow my setting
Yes, but if you have more mappers, you may have more waves to execute. I mean if I have 110 mappers for a job and I only have 22 cores. Then, it will execute 5 waves approximately, If I have only 22 mappers, It will save the overhead time. 2010/4/22 Edward Capriolo edlinuxg...@gmail.com 2010/4/22 He Chen airb...@gmail.com Hi Raymond Jennings III I use 22 mappers because I have 22 cores in my clusters. Is this what you want? On Thu, Apr 22, 2010 at 11:55 AM, Raymond Jennings III raymondj...@yahoo.com wrote: Isn't the number of mappers specified only a suggestion ? --- On Thu, 4/22/10, He Chen airb...@gmail.com wrote: From: He Chen airb...@gmail.com Subject: Hadoop does not follow my setting To: common-user@hadoop.apache.org Date: Thursday, April 22, 2010, 12:50 PM Hi everyone I am doing a benchmark by using Hadoop 0.20.0's wordcount example. I have a 30GB file. I plan to test differenct number of mappers' performance. For example, for a wordcount job, I plan to test 22 mappers, 44 mappers, 66 mappers and 110 mappers. However, I set the mapred.map.tasks equals to 22. But when I ran the job, it shows 436 mappers total. I think maybe the wordcount set its parameters inside the its own program. I give -Dmapred.map.tasks=22 to this program. But it is still 436 again in my another try. I found out that 30GB divide by 436 is just 64MB, it is just my block size. Any suggestions will be appreciated. Thank you in advance! -- Best Wishes! 顺送商祺! -- Chen He (402)613-9298 PhD. student of CSE Dept. Holland Computing Center University of Nebraska-Lincoln Lincoln NE 68588 -- Best Wishes! 顺送商祺! -- Chen He (402)613-9298 PhD. student of CSE Dept. Holland Computing Center University of Nebraska-Lincoln Lincoln NE 68588 No matter how many total mappers exist for the job only a certain number of them run at once. -- Best Wishes! 顺送商祺! -- Chen He (402)613-9298 PhD. student of CSE Dept. Holland Computing Center University of Nebraska-Lincoln Lincoln NE 68588
Re: Hadoop does not follow my setting
In some extents, for 30GB file, if it is well balanced the overhead imposed by data locality may not be too much. We will see. I will report my results to this mail-list. On Thu, Apr 22, 2010 at 2:44 PM, Allen Wittenauer awittena...@linkedin.comwrote: On Apr 22, 2010, at 11:46 AM, He Chen wrote: Yes, but if you have more mappers, you may have more waves to execute. I mean if I have 110 mappers for a job and I only have 22 cores. Then, it will execute 5 waves approximately, If I have only 22 mappers, It will save the overhead time. But you'll sacrifice data locality, which means that instead of testing the cpu, you'll be testing cpu+network. -- Best Wishes! 顺送商祺! -- Chen He (402)613-9298 PhD. student of CSE Dept. Holland Computing Center University of Nebraska-Lincoln Lincoln NE 68588
Re: Reducer-side join example
For the Map function: Input key: default input value: File A and File B lines output key: A, B, C,(first colomn of the final result) output value: 12, 24, Car, 13, Van, SUV... Reduce function: take the Map output and do: for each key { if the value of a key is integer then same it to array1; else save it to array2 } for ith element in array1 for jth element in array2 output(key, array1[i]+\t+array2[j]); done Hope this helps. On Mon, Apr 5, 2010 at 4:10 PM, M B machac...@gmail.com wrote: Hi, I need a good java example to get me started with some joining we need to do, any examples would be appreciated. File A: Field1 Field2 A12 B13 C22 A24 File B: Field1 Field2 Field3 ACar ... BTruck... BSUV ... BVan ... So, we need to first join File A and B on Field1 (say both are string fields). The result would just be: A 12 Car ... A 24 Car ... B 13 Truck ... B 13 SUV ... B 13 Van ... and so on - with all the fields from both files returning. Once we have that, we sometimes need to then transform it so we have a single record per key (Field1): A (12,Car) (24,Car) B (13,Truck) (13,SUV) (13,Van) --however it looks, basically tuples for each key (we'll modify this later to return a conatenated set of fields from B, etc) At other times, instead of transforming to a single row, we just need to modify rows based on values. So if B.Field2 equals Van, we need to set Output.Field2 = whatever then output to file ... Are there any good examples of this in native java (we can't use pig/hive/etc)? thanks. -- Best Wishes! -- Chen He PhD. student of CSE Dept. Holland Computing Center University of Nebraska-Lincoln Lincoln NE 68588
Re: Defining the number of map tasks
in the hadoop-site.xml or hadoop-default.xml file. you can find a parameter: mapred.map.tasks. Change it value to 3. At the same time set mapred.tasktracker.map.tasks.maximum to 3 if you use only one tasktracker. On Wed, Dec 16, 2009 at 3:26 PM, psdc1978 psdc1...@gmail.com wrote: Hi, I would like to have several Map tasks that execute the same tasks. For example, I've 3 map tasks (M1, M2 and M3) and a 1Gb of input data to be read by each map. Each map should read the same input data and send the result to the same Reduce. At the end, the reduce should produce the same 3 results. Put in conf/slaves file 3 instances of the same machine file localhost localhost localhost /file does it solve the problem? How I define the number of map tasks to run? Best regards, -- xeon Chen