Re: Linear slowdown producing streaming output
Mappers don't just write data, they also sort it. Different parameters like the size of the sort buffer can have an impact on these kinds of metrics. On Friday, February 4, 2011, Keith Wiley kwi...@keithwiley.com wrote: I noticed that it takes much longer to write 55 MBs to a streaming output than it takes to write 12 MBs (much more than 4-5X longer), so I broke the output up, writing 1 MB at a time and discovered a perfectly linear slowdown. Bottom line, the more data I have already written to stdout from a streaming task, the longer it takes to write the next block of data. I have no idea if this is intrinsic to producing stdout from any Unix process (I've never heard of such a thing) or if this is a Hadoop issue. Does anyone have any idea what's going on here? From pos 0, wrote 1048576 bytes. Next pos will be 1048576. diffTimeWriteOneBlock: 0.31s From pos 1048576, wrote 1048576 bytes. Next pos will be 2097152. diffTimeWriteOneBlock: 0.9s From pos 2097152, wrote 1048576 bytes. Next pos will be 3145728. diffTimeWriteOneBlock: 1.46s From pos 3145728, wrote 1048576 bytes. Next pos will be 4194304. diffTimeWriteOneBlock: 1.98s From pos 4194304, wrote 1048576 bytes. Next pos will be 5242880. diffTimeWriteOneBlock: 2.47s From pos 5242880, wrote 1048576 bytes. Next pos will be 6291456. diffTimeWriteOneBlock: 3.06s From pos 6291456, wrote 1048576 bytes. Next pos will be 7340032. diffTimeWriteOneBlock: 3.53s From pos 7340032, wrote 1048576 bytes. Next pos will be 8388608. diffTimeWriteOneBlock: 3.96s From pos 8388608, wrote 1048576 bytes. Next pos will be 9437184. diffTimeWriteOneBlock: 4.24s From pos 9437184, wrote 1048576 bytes. Next pos will be 10485760. diffTimeWriteOneBlock: 4.74s From pos 10485760, wrote 1048576 bytes. Next pos will be 11534336. diffTimeWriteOneBlock: 5.24s From pos 11534336, wrote 1048576 bytes. Next pos will be 12582912. diffTimeWriteOneBlock: 5.72s From pos 12582912, wrote 1048576 bytes. Next pos will be 13631488. diffTimeWriteOneBlock: 6.25s From pos 13631488, wrote 1048576 bytes. Next pos will be 14680064. diffTimeWriteOneBlock: 6.77s From pos 14680064, wrote 1048576 bytes. Next pos will be 15728640. diffTimeWriteOneBlock: 7.37s From pos 15728640, wrote 1048576 bytes. Next pos will be 16777216. diffTimeWriteOneBlock: 7.76s From pos 16777216, wrote 1048576 bytes. Next pos will be 17825792. diffTimeWriteOneBlock: 8.74s From pos 17825792, wrote 1048576 bytes. Next pos will be 18874368. diffTimeWriteOneBlock: 8.99s From pos 18874368, wrote 1048576 bytes. Next pos will be 19922944. diffTimeWriteOneBlock: 9.35s From pos 19922944, wrote 1048576 bytes. Next pos will be 20971520. diffTimeWriteOneBlock: 9.85s From pos 20971520, wrote 1048576 bytes. Next pos will be 22020096. diffTimeWriteOneBlock: 10.43s From pos 22020096, wrote 1048576 bytes. Next pos will be 23068672. diffTimeWriteOneBlock: 11.05s From pos 23068672, wrote 1048576 bytes. Next pos will be 24117248. diffTimeWriteOneBlock: 11.52s From pos 24117248, wrote 1048576 bytes. Next pos will be 25165824. diffTimeWriteOneBlock: 12.23s From pos 25165824, wrote 1048576 bytes. Next pos will be 26214400. diffTimeWriteOneBlock: 12.49s From pos 26214400, wrote 1048576 bytes. Next pos will be 27262976. diffTimeWriteOneBlock: 13.1s From pos 27262976, wrote 1048576 bytes. Next pos will be 28311552. diffTimeWriteOneBlock: 13.83s From pos 28311552, wrote 1048576 bytes. Next pos will be 29360128. diffTimeWriteOneBlock: 14.31s From pos 29360128, wrote 1048576 bytes. Next pos will be 30408704. diffTimeWriteOneBlock: 14.65s From pos 30408704, wrote 1048576 bytes. Next pos will be 31457280. diffTimeWriteOneBlock: 15.32s From pos 31457280, wrote 1048576 bytes. Next pos will be 32505856. diffTimeWriteOneBlock: 15.88s From pos 32505856, wrote 1048576 bytes. Next pos will be 33554432. diffTimeWriteOneBlock: 16.77s From pos 33554432, wrote 1048576 bytes. Next pos will be 34603008. diffTimeWriteOneBlock: 16.9s From pos 34603008, wrote 1048576 bytes. Next pos will be 35651584. diffTimeWriteOneBlock: 17.39s From pos 35651584, wrote 1048576 bytes. Next pos will be 36700160. diffTimeWriteOneBlock: 18.12s From pos 36700160, wrote 1048576 bytes. Next pos will be 37748736. diffTimeWriteOneBlock: 18.69s From pos 37748736, wrote 1048576 bytes. Next pos will be 38797312. diffTimeWriteOneBlock: 19.09s From
Re: Finding the datanode ID
You can use java.net.InetAddress.getLocalHost to determine what node you are working on. On Saturday, August 7, 2010, Denim Live denim.l...@yahoo.com wrote: Hi all, For some odd processing situation, I want to determine in each map and reduce task the node ID of the physical node on which a record is being processed. Is it possible to determine the node ID programmatically?
Re: adding new node to hadoop cluster
The slaves file is only envoked for the start and stop scripts. If you want to add a node to cluster without restarting it, just ssh to the node and use the hadoop-daemon.sh script. The options you want are start datanode and start tasktracker. On 7/22/10, Khaled BEN BAHRI khaled.ben_ba...@it-sudparis.eu wrote: Hi, I want to add a new data node to my hadoop cluster, but the problem when add it in configuration files, i have ti restart deamons of hadoop to know that is a new data node added to cluster. please can any one know how to add new nodes without needing to restart hadoop. thanks in advance fo help Best regards khaled
Re: Task tracker and Data node not stopping
Inside hadoop-env.sh, you will see a property that sets the directory for pids to be written too. Check which directory it is and then investigate the possibility that some other process is deleting, or overwriting those files. If you are using NFS, with all nodes pointing at the same directory, then it might be a matter of each node overwriting the same file. Either way, the stop scripts look for those pid files, and used them to stop the correct daemon. If they are not found, or if the file contains the wrong pid, the script will echo no process to stop. On Thu, Jul 15, 2010 at 4:51 AM, Karthik Kumar karthik84ku...@gmail.comwrote: Hi, I am using a cluster of two machines one master and one slave. When i try to stop the cluster using stop-all.sh it is displaying as below. the task tracker and datanode are also not stopped in the slave. Please help me in solving this. stopping jobtracker 160.110.150.29: no tasktracker to stop stopping namenode 160.110.150.29: no datanode to stop localhost: stopping secondarynamenode -- With Regards, Karthik
Re: Problem with socket timeouts
What is your file descriptor limit? If it is set to the 1024, you will want to up that considerably. Don't be afraid to go to 64k. On Thu, Jul 15, 2010 at 3:38 AM, Peter Falk pe...@bugsoft.nu wrote: Hi, We are noticing the following errors in our datanode logs. We are running hbase on top of hdfs, but are not noticing any errors in the hbase logs. So it seems like the hdfs clients are not suffering from these errors. However, it would be nice to understand why they appear. We have upped the number of xcievers to 1024 and are having some throughts about there are too many sockets and some of them are timing out because they are not being used? Also we have set dfs.datanode.socket.write.timeout and dfs.socket.timeout to about 10 minutes. Anyone knows why these errors appears, and perhaps how to get rid of them? 2010-07-15 12:31:02,307 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration( 192.168.10.54:50010, storageID=DS-1341800526-192.168.10.54-50010-1278528499852, infoPort=50075, ipcPort=50020):Got exception while serving blk_3305049902326993023_98358 to /192.168.10.54: java.net.SocketTimeoutException: 60 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/192.168.10.54:50010 remote=/ 192.168.10.54:46276] at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246) at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159) at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:313) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:400) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:180) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:95) at java.lang.Thread.run(Thread.java:619) 2010-07-15 12:31:02,308 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration( 192.168.10.54:50010, storageID=DS-1341800526-192.168.10.54-50010-1278528499852, infoPort=50075, ipcPort=50020):DataXceiver java.net.SocketTimeoutException: 60 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/192.168.10.54:50010 remote=/ 192.168.10.54:46276] at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246) at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159) at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:313) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:400) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:180) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:95) at java.lang.Thread.run(Thread.java:619) TIA, Peter
Re: using 'fs -put' from datanode: all data written to that node's hdfs and not distributed
All writes from a datanode leave one copy on the local node, one copy on another node in the same rack, and a third on another rack if available. On 7/12/10, Nathan Grice ngr...@gmail.com wrote: We are trying to load data into hdfs from one of the slaves and when the put command is run from a slave(datanode) all of the blocks are written to the datanode's hdfs, and not distributed to all of the nodes in the cluster. It does not seem to matter what destination format we use ( /filename vs hdfs://master:9000/filename) it always behaves the same. Conversely, running the same command from the namenode distributes the files across the datanodes. Is there something I am missing? -Nathan
Re: How to control the number of map tasks for each nodes?
If you want to have a different number of tasks for different nodes, you will need to look at one of the more advanced schedulers. FairScheduler and CapacityScheduler are the most common. FairScheduler has extensibility points where you can add your own logic for deciding if a particular node can schedule another task. I believe CapacityScheduler does this too, but i haven't used it as much. On Thu, Jul 8, 2010 at 6:49 AM, Jones, Nick nick.jo...@amd.com wrote: Vitaliy/Edward, One thing to keep in mind is that overcommitting the number of cores can lead to map timeouts unless the map task submits progress updates to jobtracker. I found out the hard way that with a few computationally expensive maps. Nick Jones -Original Message- From: Vitaliy Semochkin [mailto:vitaliy...@gmail.com] Sent: Thursday, July 08, 2010 5:15 AM To: common-user@hadoop.apache.org Subject: Re: How to control the number of map tasks for each nodes? Hi, in mapred-site.xml you should place property namemapred.tasktracker.map.tasks.maximum/name value8/value descriptionthe number of available cores on the tasktracker machines for map tasks /description /property property namemapred.tasktracker.reduce.tasks.maximum/name value8/value descriptionthe number of available cores on the tasktracker machines for reduce tasks /description /property where 8 is number of your CORES not CPUS, if you have 8 dual core processors place 16 there. I found out that having number of map tasks a bit bigger than number of cores is better cause sometimes hadoop waits for IO operations and task do nothing. Regards, Vitaliy S On Thu, Jul 8, 2010 at 1:07 PM, edward choi mp2...@gmail.com wrote: Hi, I have a cluster consisting of 11 slaves and a single master. The thing is that 3 of my slaves have i7 cpu which means that they can have up to 8 simultaneous processes. But other slaves only have dual core cpus. So I was wondering if I can specify the number of map tasks for each of my slaves. For example, I want to give 8 map tasks to the slaves that have i7 cpus and only two map tasks to the others. Is there a way to do this?
Re: How to access Reporter in new API?
The reporter and the outputcollector have all been rolled up into the context object in the new API. http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/Mapper.Context.html http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/Reducer.Context.html On Thu, Jul 8, 2010 at 3:40 AM, Vitaliy Semochkin vitaliy...@gmail.comwrote: Hi, I'm using Mapper interface from new Hadoop API. How to access Reporter instance in new API? PS If someone knows any article on logging and problem reporting in Hadoop please post a link here. Thanks in Advance, Vitaliy S
Re: how to add jar file to hadoop
You have a couple choices. You can either add your jar file to a directory already on the classpath, like hadoop/lib, or you can add you jar file to the classpath set inside hadoop-env.sh located in the conf directory. Either way, you will need to restart mapred after doing so. After you do that, your job should see the classes you have added. On 7/7/10, Ahmad Shahzad ashahz...@gmail.com wrote: I want to add some utility to hadoop. For that i wanted to know that how can i add the jar file to hadoop directory and than i can access files in the jar from java files in hadoop directory ( for example TaskTracker.java). I am not writing any map-reduce program. Regards, Ahmad Shahzad
Re: decomission a node
Inside the hdfs conf, property namedfs.hosts.exclude/name value/value descriptionNames a file that contains a list of hosts that are not permitted to connect to the namenode. The full pathname of the file must be specified. If the value is empty, no hosts are excluded./description /property Point this property at a file containing a list of nodes you want to decommision. From there, use the command line hadoop dfsadmin -refreshNodes. On Tue, Jul 6, 2010 at 7:31 AM, Some Body someb...@squareplanet.de wrote: Hi, Is it possible move all the data blocks off a cluster node and then decommision the node? I'm asking because, now that my MR job is working, I'd like see how things scale. I.e., less processing nodes, amount of data (number size of files, etc.). I currently have 8 nodes, and am processing 5GB spread across 2000 files. Alan
Re: why my Reduce Class does not work?
You need @Override on your reduce method. Right now you are getting the identity reduce method. On 7/4/10, Vitaliy Semochkin vitaliy...@gmail.com wrote: Hi, I rewritten WordCount sample to use new Hadoop API however my reduce task doesn't launch. the result file always looks like some_word 1 some_word 1 another_word 1 another_word 1 ... Here is the code: import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WordCount { public static class WordCountMapper extends MapperLongWritable, Text, Text, IntWritable { @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer st = new StringTokenizer(value.toString()); while (st.hasMoreTokens()) { context.write(new Text(st.nextToken()), new IntWritable(1)); } } } public static class WordCountReduce extends ReducerText, IntWritable, Text, IntWritable { @SuppressWarnings(unchecked) public void reduce(Text key, IterableIntWritable values, Reducer.Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable value : values) { sum += value.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException { Job job = new Job(); job.setJobName(WordCounter); job.setJarByClass(WordCount.class); job.setMapperClass(WordCountMapper.class); job.setReducerClass(WordCountReduce.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true)? 0 :1); } } Looks like WordCountReduce was never launched but I don't see any Warnings or Errors in log file. Any help is highly appreciated. Thanks in Advance, Vitaliy S
Re: Intermediate files generated.
You could also use multi output from the old api. This will allow you to create multiple output collectors. One collector could be used at the beginning of the reduce call for writing the key-value pairing unaltered, and another collector for writing the results of your processing. On Fri, Jul 2, 2010 at 5:17 AM, Pramy Bhats pramybh...@googlemail.comwrote: Hi, Isn't possible to hack-in the intermediate files generated ? I am writing a compilation framework, so i dont want to mess up with existing programming framework. The upper layer or the programmer should write the program the way he should write, and I want to leverage the intermediate file generated for my analysis. thanks, --PB. On Fri, Jul 2, 2010 at 1:05 PM, Jones, Nick nick.jo...@amd.com wrote: Hi Pramy, I would setup one M/R job to just map (setNumReducers=0) and chain another job that uses a unity mapper to pass the intermediate data to the reduce step. Nick Sent by radiation. - Original Message - From: Pramy Bhats pramybh...@googlemail.com To: common-user@hadoop.apache.org common-user@hadoop.apache.org Sent: Fri Jul 02 01:05:25 2010 Subject: Re: Intermediate files generated. Hi Hemanth, I need to use the output of the mapper for some other application. As a result, if I can redirect the output of the map in temp files of my choice (which are stored on hdfs) then i can reuse the output later. At the same time, the succeeding reducer can read the input from this temp files without any overhead. thanks, --PB On Fri, Jul 2, 2010 at 3:52 AM, Hemanth Yamijala yhema...@gmail.com wrote: Alex, I don't think this is what I am looking for. Essential, I wish to run both mapper as well as reducer. But at the same time, i wish to make sure that the temp files that are used between mappers and reducers are of my choice. Here, the choice means that I can specify the files in HDFS that can be used as temp files. Could you explain why you want to do this ? thanks, --PB. On Fri, Jul 2, 2010 at 12:14 AM, Alex Loddengaard a...@cloudera.com wrote: You could use the HDFS API from within your mapper, and run with 0 reducers. Alex On Thu, Jul 1, 2010 at 3:07 PM, Pramy Bhats pramybh...@googlemail.com wrote: Hi, I am using hadoop framework for writing MapReduce jobs. I want to redirect the output of Map into files of my choice and later use those files as input for Reduce phase. Could you please suggest, how to proceed for it ? thanks, --PB.
Re: Dynamically set mapred.tasktracker.map.tasks.maximum from inside a job.
What you want to do can be accomplished in the scheduler. Take a look at the fair scheduler, specifically the user extensible options. There you will find the ability to add some extra logic for deciding if a task can be launched on a per job basis. Could be as simple as deciding a particular job can't launch more than 12 tasks at a time. Capacity scheduler might be able to do this too, but I'm not sure. On Wednesday, June 30, 2010, Pierre ANCELOT pierre...@gmail.com wrote: ok, well, thanks... I truely hoped a solution would exist for this. Thanks. Pierre. On Wed, Jun 30, 2010 at 3:56 PM, Yu Li car...@gmail.com wrote: Hi Pierre, The setNumReduceTasks method is for setting the number of reduce tasks to launch, it's equal to set the mapred.reduce.tasks parameter, while the mapred.tasktracker.reduce.tasks.maximum parameter decides the number of tasks running *concurrently* on one node. And as Amareshwari mentioned, the mapred.tasktracker.map/reduce.tasks.maximum is a cluster configuration which could not be set per job. If you set mapred.tasktracker.map.tasks.maximum to 20, and the overall number of map tasks is larger than 20*nodes number, there would be 20 map tasks running concurrently on a node. As I know, you probably need to restart the tasktracker if you truely need to change the configuration. Best Regards, Carp 2010/6/30 Pierre ANCELOT pierre...@gmail.com Sure, but not the number of tasks running concurrently on a node at the same time. On Wed, Jun 30, 2010 at 1:57 PM, Ted Yu yuzhih...@gmail.com wrote: The number of map tasks is determined by InputSplit. On Wednesday, June 30, 2010, Pierre ANCELOT pierre...@gmail.com wrote: Hi, Okay, so, if I set the 20 by default, I could maybe limit the number of concurrent maps per node instead? job.setNumReduceTasks exists but I see no equivalent for maps, though I think there was a setNumMapTasks before... Was it removed? Why? Any idea about how to acheive this? Thank you. On Wed, Jun 30, 2010 at 12:08 PM, Amareshwari Sri Ramadasu amar...@yahoo-inc.com wrote: Hi Pierre, mapred.tasktracker.map.tasks.maximum is a cluster level configuration, cannot be set per job. It is loaded only while bringing up the TaskTracker. Thanks Amareshwari On 6/30/10 3:05 PM, Pierre ANCELOT pierre...@gmail.com wrote: Hi everyone :) There's something I'm probably doing wrong but I can't seem to figure out what. I have two hadoop programs running one after the other. This is done because they don't have the same needs in term of processor in memory, so by separating them I optimize each task better. Fact is, I need for the first job on every node mapred.tasktracker.map.tasks.maximum set to 12. For the second task, I need it to be set to 20. so by default I set it to 12 and in the second job's code, I set this: Configuration hadoopConfiguration = new Configuration(); hadoopConfiguration.setInt(mapred.tasktracker.map.tasks.maximum, 20); But when running the job, instead of having the 20 tasks on each node as expected, I have 12 Any idea please? Thank you. Pierre. -- http://www.neko-consulting.com Ego sum quis ego servo Je suis ce que je protège I am what I protect
Re: newbie - job failing at reduce
Have you increased your file handle limits? You can check this with a 'ulimit -n' call. If you are still at 1024, then you will want to increase the limit to something quite a bit higher. On Wed, Jun 30, 2010 at 9:40 AM, Siddharth Karandikar siddharth.karandi...@gmail.com wrote: Yeah. SSH is working as mentioned in the docs. Even directory mentioned for 'mapred.local.dir' has enough space. - Siddharth On Wed, Jun 30, 2010 at 10:01 PM, Chris Collord ccoll...@lanl.gov wrote: Interesting that the reduce phase makes it that far before failing! Are you able to SSH (without a password) into the failing node? Any possible folder permissions issues? ~Chris On 06/30/2010 10:26 AM, Siddharth Karandikar wrote: Hey Chris, Thanks for your inputs. I have tried most of the stuff, but will surely go though tutorial you have pointed out. May be I will get some hint there. Interestingly, while experimenting with it more, I noticed that, if small size input file is there (50MBs) the job works perfectly fine. If I give bigger input, it starts hanging @ reduce tasks. Map phase always finishes 100%. - Siddharth On Wed, Jun 30, 2010 at 9:11 PM, Chris Collordccoll...@lanl.gov wrote: Hi Siddharth, I'm VERY new to this myself, but here are a few thoughts (since nobody else is responding!). -You might want to set dfs.replication to 2. I have read that for clusters 8, you should have replication set to 2 machines. 8+ node clusters use 3. This may make your cluster work, but it won't fix your problem. -Run a bin/hadoop dfsadmin -report with the hadoop cluster running and see what it shows for your failing node. -Check your logs/ folder for datanode logs and see if there's anything useful in there before the error you're getting. -You might try reformatting your hdfs, if you don't have anything important in there. bin/hadoop namenode -format. (Note: this has caused problems for me in the past with namenode ID's, see the bottom on the link for Michael Noll's tutorial if that happens) You should check out Michael Noll's tutorial for all the little details: http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29 Let me know if anything helps! ~Chris On 06/30/2010 04:02 AM, Siddharth Karandikar wrote: Anyone? On Tue, Jun 29, 2010 at 8:41 PM, Siddharth Karandikar siddharth.karandi...@gmail.comwrote: Hi All, I am new to Hadoop, but by reading online docs and other resource, I have moved ahead and now trying to run a cluster of 3 nodes. Before doing this, tried my program on standalone and pseudo systems and thats working fine. Now the issue that I am facing - mapping phase works correctly. While doing reduce, I am seeing following error on one of the nodes - 2010-06-29 14:35:01,848 WARN org.apache.hadoop.mapred.TaskTracker: getMapOutput(attempt_201006291958_0001_m_08_0,0) failed : org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/jobcache/job_201006291958_0001/attempt_201006291958_0001_m_08_0/output/file.out.index in any of the configured local directories Lets say this is @ Node1. But there is no such directory named 'taskTracker/jobcache/job_201006291958_0001/attempt_201006291958_0001_m_08_0' under /tmp/mapred/local/taskTracker/ on Node1. Interestingly, this directory is available on Node2 (or Node3). Tried running the job multiple times, but its always failing while reducing. Same error. I have configured /tmp/mapred/local on each node from mapred-site.xml. I really don't understand why mappers are misplacing these files? Or am I missing something in configuration? If someone wants to look @ configurations, I have pasted that below. Thanks, Siddharth Configurations == conf/core-site.xml --- ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? configuration property namefs.default.name/name valuehdfs://192.168.2.115//value /property /configuration conf/hdfs-site.xml -- ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? configuration property namefs.default.name/name valuehdfs://192.168.2.115/value /property property namedfs.data.dir/name value/home/siddharth/hdfs/data/value /property property namedfs.name.dir/name value/home/siddharth/hdfs/name/value /property property namedfs.replication/name value3/value /property /configuration conf/mapred-site.xml -- ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? configuration property namemapred.job.tracker/name value192.168.2.115:8021/value /property property namemapred.local.dir/name
Re: copy files to HDFS protocols
Also, hadoop fs -put can read from stdin by specifying '-' as your input file. On Wed, Jun 30, 2010 at 2:19 AM, Jeff Zhang zjf...@gmail.com wrote: 1. You can use mount the files on another machine to your local machine, and invoke the copyFromLocal 2. Sure, you can write stream to HDFS, but I'm afraid you have to use Java API of hadoop, refer FileSystem.java in hadoop On Wed, Jun 30, 2010 at 4:59 PM, Oleg Ruchovets oruchov...@gmail.com wrote: Hi , I use HDFS Shell to copy files from local FileSystem to Hadoop HDFS. (copyFromLocal command). 1) How can I provide Path to file which locates on local FS but on different machine then hadoop locates? 2) What other protocols (way) can I use to write files to HDFS? Is it possible to use streaming data to write to HDFS ? Thank in Advance Oleg. -- Best Regards Jeff Zhang
Re: do all mappers finish before reducer starts
The reduce function is always called after all map tasks are complete. This is not to be confused with the reduce task. The reduce task can be launched and begin copying data as soon as the first mapper completes. By default though, reduce tasks are not launched until 5% of the mappers are completed. 2010/2/1 Jeyendran Balakrishnan jbalakrish...@docomolabs-usa.com Correct me if I'm wrong, but this: Yes, any reduce function call should be after all the mappers have done their work. is strictly true only if speculative execution is explicitly turned off. Otherwise there is a chance that some reduce tasks can actually start before all the maps are complete. In case it turns out that some map output key used by one speculative reduce task is output by some other map after this reduce task has started, I think the JT then kills this speculative task. -Original Message- From: Gang Luo [mailto:lgpub...@yahoo.com.cn] Sent: Friday, January 29, 2010 2:27 PM To: common-user@hadoop.apache.org Subject: Re: do all mappers finish before reducer starts It seems this is a hot issue When any mapper finishes (the sorted intermediate result is on local disk), the shuffle start to transfer the result to corresponding reducers, even other mappers are still working. For the shuffle is part of the reduce phase, the map phase and reduce phase could be seen overlap to some extend. That is why you see such a progress report. What you actually mentioned is the reduce function. Yes, any reduce function call should be after all the mappers have done their work. -Gang - 原始邮件 发件人: adeelmahmood adeelmahm...@gmail.com 收件人: core-u...@hadoop.apache.org 发送日期: 2010/1/29 (周五) 4:10:50 下午 主 题: do all mappers finish before reducer starts I just have a conceptual question. My understanding is that all the mappers have to complete their job for the reducers to start working because mappers dont know about each other so we need values for a given key from all the different mappers so we have to wait until all mappers have collectively given the system all possible values for a key .so that then that can be passed on the reducer .. but when I ran these jobs .. almost everytime before the mappers are all done the reducers start working .. so it would say map 60% reduce 30% .. how does this works Does it finds all possibly values for a single key from all mappers .. pass that on the reducer and then works on other keys any help is appreciated -- View this message in context: http://old.nabble.com/do-all-mappers-finish-before-reducer-starts-tp27330927p27330927.html Sent from the Hadoop core-user mailing list archive at Nabble.com. ___ 好玩贺卡等你发,邮箱贺卡全新上线! http://card.mail.cn.yahoo.com/
Re: Could not obtain block
Could not obtain block errors are often caused by running out of available file handles. You can confirm this by going to the shell and entering ulimit -n. If it says 1024, the default, then you will want to increase it to about 64,000. On Fri, Jan 29, 2010 at 4:06 PM, MilleBii mille...@gmail.com wrote: X-POST with Nutch mailing list. HEEELP !!! Kind of get stuck on this one. I backed-up my hdfs data, reformated the hdfs, put data back, try to merge my segments together and it explodes again. Exception in thread Lucene Merge Thread #0 org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException: Could not obtain block: blk_4670839132945043210_1585 file=/user/nutch/crawl/indexed-segments/20100113003609/part-0/_ym.frq at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:309) If I go into the hfds/data directory I DO find the faulty block Could it be a synchro problem on the segment merger code ? 2010/1/29 MilleBii mille...@gmail.com I'm looking for some help. I'm Nutch user, everything was working fine, but now I get the following error when indexing. I have a single note pseudo distributed set up. Some people on the Nutch list indicated to me that I could full, so I remove many things and hdfs is far from full. This file directory was perfectly OK the day before. I did a hadoop fsck... report says healthy. What can I do ? Is is safe to do a Linux FSCK just in case ? Caused by: java.io.IOException: Could not obtain block: blk_8851198258748412820_9031 file=/user/nutch/crawl/indexed-segments/20100111233601/part-0/_103.frq -- -MilleBii- -- -MilleBii- -- Ken Goodhope Cell: 425-750-5616 362 Bellevue Way NE Apt N415 Bellevue WA, 98004