Re: different input/output formats
Thanks for the reply but I already tried this option, and is the error: java.io.IOException: wrong key class: org.apache.hadoop.io.LongWritable is not class org.apache.hadoop.io.FloatWritable at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:998) at org.apache.hadoop.mapred.SequenceFileOutputFormat$1.write(SequenceFileOutputFormat.java:75) at org.apache.hadoop.mapred.MapTask$DirectMapOutputCollector.collect(MapTask.java:705) at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:508) at filter.stat.cosine.preprocess.SortByNorm1$Norm1Mapper.map(SortByNorm1.java:60) at filter.stat.cosine.preprocess.SortByNorm1$Norm1Mapper.map(SortByNorm1.java:1) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:397) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330) at org.apache.hadoop.mapred.Child$4.run(Child.java:217) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.Use Mark On Tue, May 29, 2012 at 1:05 PM, samir das mohapatra samir.help...@gmail.com wrote: Hi Mark public void map(LongWritable offset, Text val,OutputCollector FloatWritable,Text output, Reporter reporter) throws IOException { output.collect(new FloatWritable(*1*), val); *//chanage 1 to 1.0f then it will work.* } let me know the status after the change On Wed, May 30, 2012 at 1:27 AM, Mark question markq2...@gmail.com wrote: Hi guys, this is a very simple program, trying to use TextInputFormat and SequenceFileoutputFormat. Should be easy but I get the same error. Here is my configurations: conf.setMapperClass(myMapper.class); conf.setMapOutputKeyClass(FloatWritable.class); conf.setMapOutputValueClass(Text.class); conf.setNumReduceTasks(0); conf.setOutputKeyClass(FloatWritable.class); conf.setOutputValueClass(Text.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(SequenceFileOutputFormat.class); TextInputFormat.addInputPath(conf, new Path(args[0])); SequenceFileOutputFormat.setOutputPath(conf, new Path(args[1])); myMapper class is: public class myMapper extends MapReduceBase implements MapperLongWritable,Text,FloatWritable,Text { public void map(LongWritable offset, Text val,OutputCollectorFloatWritable,Text output, Reporter reporter) throws IOException { output.collect(new FloatWritable(1), val); } } But I get the following error: 12/05/29 12:54:31 INFO mapreduce.Job: Task Id : attempt_201205260045_0032_m_00_0, Status : FAILED java.io.IOException: wrong key class: org.apache.hadoop.io.LongWritable is not class org.apache.hadoop.io.FloatWritable at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:998) at org.apache.hadoop.mapred.SequenceFileOutputFormat$1.write(SequenceFileOutputFormat.java:75) at org.apache.hadoop.mapred.MapTask$DirectMapOutputCollector.collect(MapTask.java:705) at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:508) at filter.stat.cosine.preprocess.SortByNorm1$Norm1Mapper.map(SortByNorm1.java:59) at filter.stat.cosine.preprocess.SortByNorm1$Norm1Mapper.map(SortByNorm1.java:1) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:397) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330) at org.apache.hadoop.mapred.Child$4.run(Child.java:217) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.Use Where is the writing of LongWritable coming from ?? Thank you, Mark
Re: different input/output formats
Hi Samir, can you email me your main class.. or if you can check mine, it is as follows: public class SortByNorm1 extends Configured implements Tool { @Override public int run(String[] args) throws Exception { if (args.length != 2) { System.err.printf(Usage:bin/hadoop jar norm1.jar inputDir outputDir\n); ToolRunner.printGenericCommandUsage(System.err); return -1; } JobConf conf = new JobConf(new Configuration(),SortByNorm1.class); conf.setJobName(SortDocByNorm1); conf.setMapperClass(Norm1Mapper.class); conf.setMapOutputKeyClass(FloatWritable.class); conf.setMapOutputValueClass(Text.class); conf.setNumReduceTasks(0); conf.setReducerClass(Norm1Reducer.class); conf.setOutputKeyClass(FloatWritable.class); conf.setOutputValueClass(Text.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(SequenceFileOutputFormat.class); TextInputFormat.addInputPath(conf, new Path(args[0])); SequenceFileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); return 0; } public static void main(String[] args) throws Exception { int exitCode = ToolRunner.run(new SortByNorm1(), args); System.exit(exitCode); } On Tue, May 29, 2012 at 1:55 PM, samir das mohapatra samir.help...@gmail.com wrote: Hi Mark See the out put for that same Application . I am not getting any error. On Wed, May 30, 2012 at 1:27 AM, Mark question markq2...@gmail.comwrote: Hi guys, this is a very simple program, trying to use TextInputFormat and SequenceFileoutputFormat. Should be easy but I get the same error. Here is my configurations: conf.setMapperClass(myMapper.class); conf.setMapOutputKeyClass(FloatWritable.class); conf.setMapOutputValueClass(Text.class); conf.setNumReduceTasks(0); conf.setOutputKeyClass(FloatWritable.class); conf.setOutputValueClass(Text.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(SequenceFileOutputFormat.class); TextInputFormat.addInputPath(conf, new Path(args[0])); SequenceFileOutputFormat.setOutputPath(conf, new Path(args[1])); myMapper class is: public class myMapper extends MapReduceBase implements MapperLongWritable,Text,FloatWritable,Text { public void map(LongWritable offset, Text val,OutputCollectorFloatWritable,Text output, Reporter reporter) throws IOException { output.collect(new FloatWritable(1), val); } } But I get the following error: 12/05/29 12:54:31 INFO mapreduce.Job: Task Id : attempt_201205260045_0032_m_00_0, Status : FAILED java.io.IOException: wrong key class: org.apache.hadoop.io.LongWritable is not class org.apache.hadoop.io.FloatWritable at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:998) at org.apache.hadoop.mapred.SequenceFileOutputFormat$1.write(SequenceFileOutputFormat.java:75) at org.apache.hadoop.mapred.MapTask$DirectMapOutputCollector.collect(MapTask.java:705) at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:508) at filter.stat.cosine.preprocess.SortByNorm1$Norm1Mapper.map(SortByNorm1.java:59) at filter.stat.cosine.preprocess.SortByNorm1$Norm1Mapper.map(SortByNorm1.java:1) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:397) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330) at org.apache.hadoop.mapred.Child$4.run(Child.java:217) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.Use Where is the writing of LongWritable coming from ?? Thank you, Mark
Re: How to add debugging to map- red code
I'm interested in this too, but could you tell me where to apply the patch and is the following the right command to write it: https://issues.apache.org/jira/secure/attachment/12416955/MAPREDUCE-336_0_20090818.patchpatch MAPREDUCE-336_0_20090818.patchhttps://issues.apache.org/jira/secure/attachment/12416955/MAPREDUCE-336_0_20090818.patch Thank you, Mark On Fri, Apr 20, 2012 at 8:28 AM, Harsh J ha...@cloudera.com wrote: Yes this is possible, and there's two ways to do this. 1. Use a distro/release that carries the https://issues.apache.org/jira/browse/MAPREDUCE-336 fix. This will let you avoid work (see 2, which is same as your idea) 2. Configure your implementation's logger object's level in the setup/setConf methods of the task, by looking at some conf prop to decide the level. This will work just as well - and will also avoid changing Hadoop's own Child log levels, unlike the (1) method. On Fri, Apr 20, 2012 at 8:47 PM, Mapred Learn mapred.le...@gmail.com wrote: Hi, I m trying to find out best way to add debugging in map- red code. I have System.out.println() statements that I keep on commenting and uncommenting so as not to increase stdout size But problem is anytime I need debug, I Hv to re-compile. If there a way, I can define log levels using log4j in map-red code and define log level as conf option ? Thanks, JJ Sent from my iPhone -- Harsh J
Has anyone installed HCE and built it successfully?
Hey guys, I've been stuck with HCE installation for two days now and can't figure out the problem. Errors I get from running (sh build.sh) is can not execute binary file . I tried setting my JAVA_HOME and ANT_HOME manually and using the script build.sh, no luck. So, please if you've used HCE could you share with me your knowledge. Thank you, Mark
Re: Hadoop streaming or pipes ..
Thanks all, and Charles you guided me to Baidu slides titled: Introduction to *Hadoop C++ Extension*http://hic2010.hadooper.cn/dct/attach/Y2xiOmNsYjpwZGY6ODI5 which is their experience and the sixth-slide shows exactly what I was looking for. It is still hard to manage memory with pipes besides the no performance gains, hence the advancement of HCE. Thanks, Mark On Thu, Apr 5, 2012 at 2:23 PM, Charles Earl charles.ce...@gmail.comwrote: Also bear in mind that there is a kind of detour involved, in the sense that a pipes map must send key,value data back to the Java process and then to reduce (more or less). I think that the Hadoop C Extension (HCE, there is a patch) is supposed to be faster. Would be interested to know if the community has any experience with HCE performance. C On Apr 5, 2012, at 3:49 PM, Robert Evans ev...@yahoo-inc.com wrote: Both streaming and pipes do very similar things. They will fork/exec a separate process that is running whatever you want it to run. The JVM that is running hadoop then communicates with this process to send the data over and get the processing results back. The difference between streaming and pipes is that streaming uses stdin/stdout for this communication so preexisting processing like grep, sed and awk can be used here. Pipes uses a custom protocol with a C++ library to communicate. The C++ library is tagged with SWIG compatible data so that it can be wrapped to have APIs in other languages like python or perl. I am not sure what the performance difference is between the two, but in my own work I have seen a significant performance penalty from using either of them, because there is a somewhat large overhead of sending all of the data out to a separate process just to read it back in again. --Bobby Evans On 4/5/12 1:54 PM, Mark question markq2...@gmail.com wrote: Hi guys, quick question: Are there any performance gains from hadoop streaming or pipes over Java? From what I've read, it's only to ease testing by using your favorite language. So I guess it is eventually translated to bytecode then executed. Is that true? Thank you, Mark
Hadoop pipes and streaming ..
Hi guys, Two quick questions: 1. Are there any performance gains from hadoop streaming or pipes ? As far as I read, it is to ease testing using your favorite language. Which I think implies that everything is translated to bytecode eventually and executed.
Hadoop streaming or pipes ..
Hi guys, quick question: Are there any performance gains from hadoop streaming or pipes over Java? From what I've read, it's only to ease testing by using your favorite language. So I guess it is eventually translated to bytecode then executed. Is that true? Thank you, Mark
Re: Hadoop streaming or pipes ..
Thanks for the response Robert .. so the overhead will be in read/write and communication. But is the new process spawned a JVM or a regular process? Thanks, Mark On Thu, Apr 5, 2012 at 12:49 PM, Robert Evans ev...@yahoo-inc.com wrote: Both streaming and pipes do very similar things. They will fork/exec a separate process that is running whatever you want it to run. The JVM that is running hadoop then communicates with this process to send the data over and get the processing results back. The difference between streaming and pipes is that streaming uses stdin/stdout for this communication so preexisting processing like grep, sed and awk can be used here. Pipes uses a custom protocol with a C++ library to communicate. The C++ library is tagged with SWIG compatible data so that it can be wrapped to have APIs in other languages like python or perl. I am not sure what the performance difference is between the two, but in my own work I have seen a significant performance penalty from using either of them, because there is a somewhat large overhead of sending all of the data out to a separate process just to read it back in again. --Bobby Evans On 4/5/12 1:54 PM, Mark question markq2...@gmail.com wrote: Hi guys, quick question: Are there any performance gains from hadoop streaming or pipes over Java? From what I've read, it's only to ease testing by using your favorite language. So I guess it is eventually translated to bytecode then executed. Is that true? Thank you, Mark
Re: Custom Seq File Loader: ClassNotFoundException
Hi Madhu, it has the following line: TermDocFreqArrayWritable () {} but I'll try it with public access in case it's been called outside of my package. Thank you, Mark On Sun, Mar 4, 2012 at 9:55 PM, madhu phatak phatak@gmail.com wrote: Hi, Please make sure that your CustomWritable has a default constructor. On Sat, Mar 3, 2012 at 4:56 AM, Mark question markq2...@gmail.com wrote: Hello, I'm trying to debug my code through eclipse, which worked fine with given Hadoop applications (eg. wordcount), but as soon as I run it on my application with my custom sequence input file/types, I get: Java.lang.runtimeException.java.ioException (Writable name can't load class) SequenceFile$Reader.getValeClass(Sequence File.class) because my valueClass is customed. In other words, how can I add/build my CustomWritable class to be with hadoop LongWritable,IntegerWritable etc. Did anyone used eclipse? Mark -- Join me at http://hadoopworkshop.eventbrite.com/
Re: Custom Seq File Loader: ClassNotFoundException
Unfortunately, public didn't change my error ... Any other ideas? Has anyone ran Hadoop on eclipse with custom sequence inputs ? Thank you, Mark On Mon, Mar 5, 2012 at 9:58 AM, Mark question markq2...@gmail.com wrote: Hi Madhu, it has the following line: TermDocFreqArrayWritable () {} but I'll try it with public access in case it's been called outside of my package. Thank you, Mark On Sun, Mar 4, 2012 at 9:55 PM, madhu phatak phatak@gmail.com wrote: Hi, Please make sure that your CustomWritable has a default constructor. On Sat, Mar 3, 2012 at 4:56 AM, Mark question markq2...@gmail.com wrote: Hello, I'm trying to debug my code through eclipse, which worked fine with given Hadoop applications (eg. wordcount), but as soon as I run it on my application with my custom sequence input file/types, I get: Java.lang.runtimeException.java.ioException (Writable name can't load class) SequenceFile$Reader.getValeClass(Sequence File.class) because my valueClass is customed. In other words, how can I add/build my CustomWritable class to be with hadoop LongWritable,IntegerWritable etc. Did anyone used eclipse? Mark -- Join me at http://hadoopworkshop.eventbrite.com/
Re: Streaming Hadoop using C
Starfish worked great for wordcount .. I didn't run it on my application because I have only map tasks. Mark On Thu, Mar 1, 2012 at 4:34 AM, Charles Earl charles.ce...@gmail.comwrote: How was your experience of starfish? C On Mar 1, 2012, at 12:35 AM, Mark question wrote: Thank you for your time and suggestions, I've already tried starfish, but not jmap. I'll check it out. Thanks again, Mark On Wed, Feb 29, 2012 at 1:17 PM, Charles Earl charles.ce...@gmail.com wrote: I assume you have also just tried running locally and using the jdk performance tools (e.g. jmap) to gain insight by configuring hadoop to run absolute minimum number of tasks? Perhaps the discussion http://grokbase.com/t/hadoop/common-user/11ahm67z47/how-do-i-connect-java-visual-vm-to-a-remote-task might be relevant? On Feb 29, 2012, at 3:53 PM, Mark question wrote: I've used hadoop profiling (.prof) to show the stack trace but it was hard to follow. jConsole locally since I couldn't find a way to set a port number to child processes when running them remotely. Linux commands (top,/proc), showed me that the virtual memory is almost twice as my physical which means swapping is happening which is what I'm trying to avoid. So basically, is there a way to assign a port to child processes to monitor them remotely (asked before by Xun) or would you recommend another monitoring tool? Thank you, Mark On Wed, Feb 29, 2012 at 11:35 AM, Charles Earl charles.ce...@gmail.com wrote: Mark, So if I understand, it is more the memory management that you are interested in, rather than a need to run an existing C or C++ application in MapReduce platform? Have you done profiling of the application? C On Feb 29, 2012, at 2:19 PM, Mark question wrote: Thanks Charles .. I'm running Hadoop for research to perform duplicate detection methods. To go deeper, I need to understand what's slowing my program, which usually starts with analyzing memory to predict best input size for map task. So you're saying piping can help me control memory even though it's running on VM eventually? Thanks, Mark On Wed, Feb 29, 2012 at 11:03 AM, Charles Earl charles.ce...@gmail.com wrote: Mark, Both streaming and pipes allow this, perhaps more so pipes at the level of the mapreduce task. Can you provide more details on the application? On Feb 29, 2012, at 1:56 PM, Mark question wrote: Hi guys, thought I should ask this before I use it ... will using C over Hadoop give me the usual C memory management? For example, malloc() , sizeof() ? My guess is no since this all will eventually be turned into bytecode, but I need more control on memory which obviously is hard for me to do with Java. Let me know of any advantages you know about streaming in C over hadoop. Thank you, Mark
Streaming Hadoop using C
Hi guys, thought I should ask this before I use it ... will using C over Hadoop give me the usual C memory management? For example, malloc() , sizeof() ? My guess is no since this all will eventually be turned into bytecode, but I need more control on memory which obviously is hard for me to do with Java. Let me know of any advantages you know about streaming in C over hadoop. Thank you, Mark
Re: Streaming Hadoop using C
Thanks Charles .. I'm running Hadoop for research to perform duplicate detection methods. To go deeper, I need to understand what's slowing my program, which usually starts with analyzing memory to predict best input size for map task. So you're saying piping can help me control memory even though it's running on VM eventually? Thanks, Mark On Wed, Feb 29, 2012 at 11:03 AM, Charles Earl charles.ce...@gmail.comwrote: Mark, Both streaming and pipes allow this, perhaps more so pipes at the level of the mapreduce task. Can you provide more details on the application? On Feb 29, 2012, at 1:56 PM, Mark question wrote: Hi guys, thought I should ask this before I use it ... will using C over Hadoop give me the usual C memory management? For example, malloc() , sizeof() ? My guess is no since this all will eventually be turned into bytecode, but I need more control on memory which obviously is hard for me to do with Java. Let me know of any advantages you know about streaming in C over hadoop. Thank you, Mark
Re: Streaming Hadoop using C
I've used hadoop profiling (.prof) to show the stack trace but it was hard to follow. jConsole locally since I couldn't find a way to set a port number to child processes when running them remotely. Linux commands (top,/proc), showed me that the virtual memory is almost twice as my physical which means swapping is happening which is what I'm trying to avoid. So basically, is there a way to assign a port to child processes to monitor them remotely (asked before by Xun) or would you recommend another monitoring tool? Thank you, Mark On Wed, Feb 29, 2012 at 11:35 AM, Charles Earl charles.ce...@gmail.comwrote: Mark, So if I understand, it is more the memory management that you are interested in, rather than a need to run an existing C or C++ application in MapReduce platform? Have you done profiling of the application? C On Feb 29, 2012, at 2:19 PM, Mark question wrote: Thanks Charles .. I'm running Hadoop for research to perform duplicate detection methods. To go deeper, I need to understand what's slowing my program, which usually starts with analyzing memory to predict best input size for map task. So you're saying piping can help me control memory even though it's running on VM eventually? Thanks, Mark On Wed, Feb 29, 2012 at 11:03 AM, Charles Earl charles.ce...@gmail.com wrote: Mark, Both streaming and pipes allow this, perhaps more so pipes at the level of the mapreduce task. Can you provide more details on the application? On Feb 29, 2012, at 1:56 PM, Mark question wrote: Hi guys, thought I should ask this before I use it ... will using C over Hadoop give me the usual C memory management? For example, malloc() , sizeof() ? My guess is no since this all will eventually be turned into bytecode, but I need more control on memory which obviously is hard for me to do with Java. Let me know of any advantages you know about streaming in C over hadoop. Thank you, Mark
Re: Streaming Hadoop using C
Thank you for your time and suggestions, I've already tried starfish, but not jmap. I'll check it out. Thanks again, Mark On Wed, Feb 29, 2012 at 1:17 PM, Charles Earl charles.ce...@gmail.comwrote: I assume you have also just tried running locally and using the jdk performance tools (e.g. jmap) to gain insight by configuring hadoop to run absolute minimum number of tasks? Perhaps the discussion http://grokbase.com/t/hadoop/common-user/11ahm67z47/how-do-i-connect-java-visual-vm-to-a-remote-task might be relevant? On Feb 29, 2012, at 3:53 PM, Mark question wrote: I've used hadoop profiling (.prof) to show the stack trace but it was hard to follow. jConsole locally since I couldn't find a way to set a port number to child processes when running them remotely. Linux commands (top,/proc), showed me that the virtual memory is almost twice as my physical which means swapping is happening which is what I'm trying to avoid. So basically, is there a way to assign a port to child processes to monitor them remotely (asked before by Xun) or would you recommend another monitoring tool? Thank you, Mark On Wed, Feb 29, 2012 at 11:35 AM, Charles Earl charles.ce...@gmail.com wrote: Mark, So if I understand, it is more the memory management that you are interested in, rather than a need to run an existing C or C++ application in MapReduce platform? Have you done profiling of the application? C On Feb 29, 2012, at 2:19 PM, Mark question wrote: Thanks Charles .. I'm running Hadoop for research to perform duplicate detection methods. To go deeper, I need to understand what's slowing my program, which usually starts with analyzing memory to predict best input size for map task. So you're saying piping can help me control memory even though it's running on VM eventually? Thanks, Mark On Wed, Feb 29, 2012 at 11:03 AM, Charles Earl charles.ce...@gmail.com wrote: Mark, Both streaming and pipes allow this, perhaps more so pipes at the level of the mapreduce task. Can you provide more details on the application? On Feb 29, 2012, at 1:56 PM, Mark question wrote: Hi guys, thought I should ask this before I use it ... will using C over Hadoop give me the usual C memory management? For example, malloc() , sizeof() ? My guess is no since this all will eventually be turned into bytecode, but I need more control on memory which obviously is hard for me to do with Java. Let me know of any advantages you know about streaming in C over hadoop. Thank you, Mark
Re: memory of mappers and reducers
Great! thanks a lot Srinivas ! Mark On Thu, Feb 16, 2012 at 7:02 AM, Srinivas Surasani vas...@gmail.com wrote: 1) Yes option 2 is enough. 2) Configuration variable mapred.child.ulimit can be used to control the maximum virtual memory of the child (map/reduce) processes. ** value of mapred.child.ulimit value of mapred.child.java.opts On Thu, Feb 16, 2012 at 12:38 AM, Mark question markq2...@gmail.com wrote: Thanks for the reply Srinivas, so option 2 will be enough, however, when I tried setting it to 512MB, I see through the system monitor that the map task is given 275MB of real memory!! Is that normal in hadoop to go over the upper bound of memory given by the property mapred.child.java.opts. Mark On Wed, Feb 15, 2012 at 4:00 PM, Srinivas Surasani vas...@gmail.com wrote: Hey Mark, Yes, you can limit the memory for each task with mapred.child.java.opts property. Set this to final if no developer has to change it . Little intro to mapred.task.default.maxvmem This property has to be set on both the JobTracker for making scheduling decisions and on the TaskTracker nodes for the sake of memory management. If a job doesn't specify its virtual memory requirement by setting mapred.task.maxvmem to -1, tasks are assured a memory limit set to this property. This property is set to -1 by default. This value should in general be less than the cluster-wide configuration mapred.task.limit.maxvmem. If not or if it is not set, TaskTracker's memory management will be disabled and a scheduler's memory based scheduling decisions may be affected. On Wed, Feb 15, 2012 at 5:57 PM, Mark question markq2...@gmail.com wrote: Hi, My question is what's the difference between the following two settings: 1. mapred.task.default.maxvmem 2. mapred.child.java.opts The first one is used by the TT to monitor the memory usage of tasks, while the second one is the maximum heap space assigned for each task. I want to limit each task to use upto say 100MB of memory. Can I use only #2 ?? Thank you, Mark -- -- Srinivas srini...@cloudwick.com -- -- Srinivas srini...@cloudwick.com
memory of mappers and reducers
Hi, My question is what's the difference between the following two settings: 1. mapred.task.default.maxvmem 2. mapred.child.java.opts The first one is used by the TT to monitor the memory usage of tasks, while the second one is the maximum heap space assigned for each task. I want to limit each task to use upto say 100MB of memory. Can I use only #2 ?? Thank you, Mark
Re: memory of mappers and reducers
Thanks for the reply Srinivas, so option 2 will be enough, however, when I tried setting it to 512MB, I see through the system monitor that the map task is given 275MB of real memory!! Is that normal in hadoop to go over the upper bound of memory given by the property mapred.child.java.opts. Mark On Wed, Feb 15, 2012 at 4:00 PM, Srinivas Surasani vas...@gmail.com wrote: Hey Mark, Yes, you can limit the memory for each task with mapred.child.java.opts property. Set this to final if no developer has to change it . Little intro to mapred.task.default.maxvmem This property has to be set on both the JobTracker for making scheduling decisions and on the TaskTracker nodes for the sake of memory management. If a job doesn't specify its virtual memory requirement by setting mapred.task.maxvmem to -1, tasks are assured a memory limit set to this property. This property is set to -1 by default. This value should in general be less than the cluster-wide configuration mapred.task.limit.maxvmem. If not or if it is not set, TaskTracker's memory management will be disabled and a scheduler's memory based scheduling decisions may be affected. On Wed, Feb 15, 2012 at 5:57 PM, Mark question markq2...@gmail.com wrote: Hi, My question is what's the difference between the following two settings: 1. mapred.task.default.maxvmem 2. mapred.child.java.opts The first one is used by the TT to monitor the memory usage of tasks, while the second one is the maximum heap space assigned for each task. I want to limit each task to use upto say 100MB of memory. Can I use only #2 ?? Thank you, Mark -- -- Srinivas srini...@cloudwick.com
Namenode no lease exception ... what does it mean?
Hi guys, Even though there is enough space on HDFS as shown by -report ... I get the following 2 error shown first in the log of a datanode and the second on Namenode log: 1)2012-02-09 10:18:37,519 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addToInvalidates: blk_8448117986822173955 is added to invalidSet of 10.0.40.33:50010 2) 2012-02-09 10:18:41,788 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: addStoredBlock request received for blk_132544693472320409_2778 on 10.0.40.12:50010 size 67108864 But it does not belong to any file. 2012-02-09 10:18:41,789 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4 on 12123, call addBlock(/user/mark/output33/_temporary/_attempt_201202090811_0005_m_000247_0/part-00247, DFSClient_attempt_201202090811_0005_m_000247_0) from 10.0.40.12:34103: error: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /user/mark/output33/_temporary/_attempt_201202090811_0005_m_000247_0/part-00247 File does not exist. Holder DFSClient_attempt_201202090811_0005_m_000247_0 does not have any open files. org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /user/mark/output33/_temporary/_attempt_201202090811_0005_m_000247_0/part-00247 File does not exist. Holder DFSClient_attempt_201202090811_0005_m_000247_0 does not have any open files. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1332) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1323) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1251) at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:422) at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955) at java.security.AccessController.doPrivileged(Native Method) Any other ways to debug this? Thanks, Mark
Re: Too many open files Error
Hi Harsh and Idris ... so the only drawback for increasing the value of xcievers is memory issue? In that case then I'll set it to a value that doesn't fill the memory ... Thanks, Mark On Thu, Jan 26, 2012 at 10:37 PM, Idris Ali psychid...@gmail.com wrote: Hi Mark, As Harsh pointed out it is not good idea to increase the Xceiver count to arbitrarily higher value, I suggested to increase the xceiver count just to unblock execution of your program temporarily. Thanks, -Idris On Fri, Jan 27, 2012 at 10:39 AM, Harsh J ha...@cloudera.com wrote: You are technically allowing DN to run 1 million block transfer (in/out) threads by doing that. It does not take up resources by default sure, but now it can be abused with requests to make your DN run out of memory and crash cause its not bound to proper limits now. On Fri, Jan 27, 2012 at 5:49 AM, Mark question markq2...@gmail.com wrote: Harsh, could you explain briefly why is 1M setting for xceiver is bad? the job is working now ... about the ulimit -u it shows 200703, so is that why connection is reset by peer? How come it's working with the xceiver modification? Thanks, Mark On Thu, Jan 26, 2012 at 12:21 PM, Harsh J ha...@cloudera.com wrote: Agree with Raj V here - Your problem should not be the # of transfer threads nor the number of open files given that stacktrace. And the values you've set for the transfer threads are far beyond recommendations of 4k/8k - I would not recommend doing that. Default in 1.0.0 is 256 but set it to 2048/4096, which are good value to have when noticing increased HDFS load, or when running services like HBase. You should instead focus on why its this particular job (or even particular task, which is important to notice) that fails, and not other jobs (or other task attempts). On Fri, Jan 27, 2012 at 1:10 AM, Raj V rajv...@yahoo.com wrote: Mark You have this Connection reset by peer. Why do you think this problem is related to too many open files? Raj From: Mark question markq2...@gmail.com To: common-user@hadoop.apache.org Sent: Thursday, January 26, 2012 11:10 AM Subject: Re: Too many open files Error Hi again, I've tried : property namedfs.datanode.max.xcievers/name value1048576/value /property but I'm still getting the same error ... how high can I go?? Thanks, Mark On Thu, Jan 26, 2012 at 9:29 AM, Mark question markq2...@gmail.com wrote: Thanks for the reply I have nothing about dfs.datanode.max.xceivers on my hdfs-site.xml so hopefully this would solve the problem and about the ulimit -n , I'm running on an NFS cluster, so usually I just start Hadoop with a single bin/start-all.sh ... Do you think I can add it by bin/Datanode -ulimit n ? Mark On Thu, Jan 26, 2012 at 7:33 AM, Mapred Learn mapred.le...@gmail.com wrote: U need to set ulimit -n bigger value on datanode and restart datanodes. Sent from my iPhone On Jan 26, 2012, at 6:06 AM, Idris Ali psychid...@gmail.com wrote: Hi Mark, On a lighter note what is the count of xceivers? dfs.datanode.max.xceivers property in hdfs-site.xml? Thanks, -idris On Thu, Jan 26, 2012 at 5:28 PM, Michel Segel michael_se...@hotmail.comwrote: Sorry going from memory... As user Hadoop or mapred or hdfs what do you see when you do a ulimit -a? That should give you the number of open files allowed by a single user... Sent from a remote device. Please excuse any typos... Mike Segel On Jan 26, 2012, at 5:13 AM, Mark question markq2...@gmail.com wrote: Hi guys, I get this error from a job trying to process 3Million records. java.io.IOException: Bad connect ack with firstBadLink 192.168.1.20:50010 at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2903) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2826) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288) When I checked the logfile of the datanode-20, I see : 2012-01-26 03:00:11,827 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration( 192.168.1.20:50010, storageID=DS-97608578-192.168.1.20-50010-1327575205369, infoPort=50075, ipcPort=50020):DataXceiver java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcher.read0(Native
Re: Too many open files Error
Thanks for the reply I have nothing about dfs.datanode.max.xceivers on my hdfs-site.xml so hopefully this would solve the problem and about the ulimit -n , I'm running on an NFS cluster, so usually I just start Hadoop with a single bin/start-all.sh ... Do you think I can add it by bin/Datanode -ulimit n ? Mark On Thu, Jan 26, 2012 at 7:33 AM, Mapred Learn mapred.le...@gmail.comwrote: U need to set ulimit -n bigger value on datanode and restart datanodes. Sent from my iPhone On Jan 26, 2012, at 6:06 AM, Idris Ali psychid...@gmail.com wrote: Hi Mark, On a lighter note what is the count of xceivers? dfs.datanode.max.xceivers property in hdfs-site.xml? Thanks, -idris On Thu, Jan 26, 2012 at 5:28 PM, Michel Segel michael_se...@hotmail.com wrote: Sorry going from memory... As user Hadoop or mapred or hdfs what do you see when you do a ulimit -a? That should give you the number of open files allowed by a single user... Sent from a remote device. Please excuse any typos... Mike Segel On Jan 26, 2012, at 5:13 AM, Mark question markq2...@gmail.com wrote: Hi guys, I get this error from a job trying to process 3Million records. java.io.IOException: Bad connect ack with firstBadLink 192.168.1.20:50010 at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2903) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2826) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288) When I checked the logfile of the datanode-20, I see : 2012-01-26 03:00:11,827 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration( 192.168.1.20:50010, storageID=DS-97608578-192.168.1.20-50010-1327575205369, infoPort=50075, ipcPort=50020):DataXceiver java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcher.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:202) at sun.nio.ch.IOUtil.read(IOUtil.java:175) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243) at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) at java.io.BufferedInputStream.read1(BufferedInputStream.java:256) at java.io.BufferedInputStream.read(BufferedInputStream.java:317) at java.io.DataInputStream.read(DataInputStream.java:132) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:262) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:309) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:373) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:525) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:357) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:103) at java.lang.Thread.run(Thread.java:662) Which is because I'm running 10 maps per taskTracker on a 20 node cluster, each map opens about 300 files so that should give 6000 opened files at the same time ... why is this a problem? the maximum # of files per process on one machine is: cat /proc/sys/fs/file-max --- 2403545 Any suggestions? Thanks, Mark
Re: connection between slaves and master
exactly right. Thanks Praveen. Mark On Tue, Jan 10, 2012 at 1:54 AM, Praveen Sripati praveensrip...@gmail.comwrote: Mark, [mark@node67 ~]$ telnet node77 You need to specify the port number along with the server name like `telnet node77 1234`. 2012-01-09 10:04:03,436 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 0 time(s). Slaves are not able to connect to the master. The configurations ` fs.default.name` and `mapred.job.tracker` should point to the master and not to localhost when the master and slaves are on different machines. Praveen On Mon, Jan 9, 2012 at 11:41 PM, Mark question markq2...@gmail.com wrote: Hello guys, I'm requesting from a PBS scheduler a number of machines to run Hadoop and even though all hadoop daemons start normally on the master and slaves, the slaves don't have worker tasks in them. Digging into that, there seems to be some blocking between nodes (?) don't know how to describe it except that on slave if I telnet master-node it should be able to connect, but I get this error: [mark@node67 ~]$ telnet node77 Trying 192.168.1.77... telnet: connect to address 192.168.1.77: Connection refused telnet: Unable to connect to remote host: Connection refused The log at the slave nodes shows the same thing, even though it has datanode and tasktracker started from the maste (?): 2012-01-09 10:04:03,436 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 0 time(s). 2012-01-09 10:04:04,439 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 1 time(s). 2012-01-09 10:04:05,442 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 2 time(s). 2012-01-09 10:04:06,444 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 3 time(s). 2012-01-09 10:04:07,446 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 4 time(s). 2012-01-09 10:04:08,448 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 5 time(s). 2012-01-09 10:04:09,450 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 6 time(s). 2012-01-09 10:04:10,452 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 7 time(s). 2012-01-09 10:04:11,454 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 8 time(s). 2012-01-09 10:04:12,456 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 9 time(s). 2012-01-09 10:04:12,456 INFO org.apache.hadoop.ipc.RPC: Server at localhost/ 127.0.0.1:12123 not available yet, Z... Any suggestions of what I can do? Thanks, Mark
connection between slaves and master
Hello guys, I'm requesting from a PBS scheduler a number of machines to run Hadoop and even though all hadoop daemons start normally on the master and slaves, the slaves don't have worker tasks in them. Digging into that, there seems to be some blocking between nodes (?) don't know how to describe it except that on slave if I telnet master-node it should be able to connect, but I get this error: [mark@node67 ~]$ telnet node77 Trying 192.168.1.77... telnet: connect to address 192.168.1.77: Connection refused telnet: Unable to connect to remote host: Connection refused The log at the slave nodes shows the same thing, even though it has datanode and tasktracker started from the maste (?): 2012-01-09 10:04:03,436 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 0 time(s). 2012-01-09 10:04:04,439 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 1 time(s). 2012-01-09 10:04:05,442 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 2 time(s). 2012-01-09 10:04:06,444 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 3 time(s). 2012-01-09 10:04:07,446 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 4 time(s). 2012-01-09 10:04:08,448 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 5 time(s). 2012-01-09 10:04:09,450 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 6 time(s). 2012-01-09 10:04:10,452 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 7 time(s). 2012-01-09 10:04:11,454 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 8 time(s). 2012-01-09 10:04:12,456 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 9 time(s). 2012-01-09 10:04:12,456 INFO org.apache.hadoop.ipc.RPC: Server at localhost/ 127.0.0.1:12123 not available yet, Z... Any suggestions of what I can do? Thanks, Mark
Re: Expected file://// error
mapred-site.xml: configuration property namemapred.job.tracker/name valuelocalhost:10001/value /property property namemapred.child.java.opts/name value-Xmx1024m/value /property property namemapred.tasktracker.map.tasks.maximum/name value10/value /property /configuration Command is running a script which runs a java program that submit two jobs consecutively insuring waiting for the first job ( is working on my laptop but on the cluster). On the cluster I get: hdfs://localhost:12123/tmp/hadoop-mark/mapred/system/job_201201061404_0003/job.jar, expected: file:/// at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:310) at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:47) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:357) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:192) at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1189) at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1165) at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1137) at org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:657) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at Main.run(Main.java:304) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at Main.main(Main.java:53) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) The first job output is : folder_logs folderpart-0 I'm set folder as input path to the next job, could it be from the _logs ... ? but again it worked on my laptop under hadoop-0.21.0. The cluster has hadoop-0.20.2. Thanks, Mark
Re: Expected file://// error
It's already in there ... don't worry about it, I'm submitting the first job then the second job manually for now. export CLASSPATH=/home/mark/hadoop-0.20.2/conf:$CLASSPATH export CLASSPATH=/home/mark/hadoop-0.20.2/lib:$CLASSPATH export CLASSPATH=/home/mark/hadoop-0.20.2/hadoop-0.20.2-core.jar:/home/mark/hadoop-0.20.2/lib/commons-cli-1.2.jar:$CLASSPATH Thank you for your time, Mark On Sun, Jan 8, 2012 at 12:22 PM, Joey Echeverria j...@cloudera.com wrote: What's the classpath of the java program submitting the job? It has to have the configuration directory (e.g. /opt/hadoop/conf) in there or it won't pick up the correct configs. -Joey On Sun, Jan 8, 2012 at 12:59 PM, Mark question markq2...@gmail.com wrote: mapred-site.xml: configuration property namemapred.job.tracker/name valuelocalhost:10001/value /property property namemapred.child.java.opts/name value-Xmx1024m/value /property property namemapred.tasktracker.map.tasks.maximum/name value10/value /property /configuration Command is running a script which runs a java program that submit two jobs consecutively insuring waiting for the first job ( is working on my laptop but on the cluster). On the cluster I get: hdfs://localhost:12123/tmp/hadoop-mark/mapred/system/job_201201061404_0003/job.jar, expected: file:/// at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:310) at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:47) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:357) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:192) at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1189) at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1165) at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1137) at org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:657) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at Main.run(Main.java:304) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at Main.main(Main.java:53) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) The first job output is : folder_logs folderpart-0 I'm set folder as input path to the next job, could it be from the _logs ... ? but again it worked on my laptop under hadoop-0.21.0. The cluster has hadoop-0.20.2. Thanks, Mark -- Joseph Echeverria Cloudera, Inc. 443.305.9434
Expected file://// error
Hello, I'm running two jobs on Hadoop-0.20.2 consecutively, such that the second one reads the output of the first which would look like: outputPath/part-0 outputPath/_logs But I get the error: 12/01/06 03:29:34 WARN fs.FileSystem: localhost:12123 is a deprecated filesystem name. Use hdfs://localhost:12123/ instead. java.lang.IllegalArgumentException: Wrong FS: hdfs://localhost:12123/tmp/hadoop-mark/mapred/system/job_201201060323_0005/job.jar, expected: file:/// at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:310) at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:47) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:357) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:192) at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1189) at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1165) at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1137) at org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:657) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at Main.run(Main.java:301) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at Main.main(Main.java:53) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) This looks similar to the problem described here but for older versions than mine: https://issues.apache.org/jira/browse/HADOOP-5259 I tried applying that patch, but probably due to different versions didn't work. Can anyone help? Thank you, Mark
Re: Expected file://// error
Hi Harsh, thanks for the reply, you were right, I didn't have hdfs://, but even after inserting it I still get the error. java.lang.IllegalArgumentException: Wrong FS: hdfs://localhost:12123/tmp/hadoop-mark/mapred/system/job_201201061404_0003/job.jar, expected: file:/// at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:310) at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:47) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:357) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:192) at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1189) at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1165) at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1137) at org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:657) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at Main.run(Main.java:304) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at Main.main(Main.java:53) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) Mark On Fri, Jan 6, 2012 at 6:02 AM, Harsh J ha...@cloudera.com wrote: What is your fs.default.name set to? It should be set to hdfs://host:port and not just host:port. Can you ensure this and retry? On 06-Jan-2012, at 5:45 PM, Mark question wrote: Hello, I'm running two jobs on Hadoop-0.20.2 consecutively, such that the second one reads the output of the first which would look like: outputPath/part-0 outputPath/_logs But I get the error: 12/01/06 03:29:34 WARN fs.FileSystem: localhost:12123 is a deprecated filesystem name. Use hdfs://localhost:12123/ instead. java.lang.IllegalArgumentException: Wrong FS: hdfs://localhost:12123/tmp/hadoop-mark/mapred/system/job_201201060323_0005/job.jar, expected: file:/// at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:310) at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:47) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:357) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:192) at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1189) at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1165) at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1137) at org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:657) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at Main.run(Main.java:301) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at Main.main(Main.java:53) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) This looks similar to the problem described here but for older versions than mine: https://issues.apache.org/jira/browse/HADOOP-5259 I tried applying that patch, but probably due to different versions didn't work. Can anyone help? Thank you, Mark
Connection reset by peer Error
Hi, I've been getting this error multiple times now, the namenode mentions something about peer resetting connection, but I don't know why this is happening, because I'm running on a single machine with 12 cores any ideas? The job starting running normally, which contains about 200 mappers each opens 200 files (one file at a time inside mapper code) then: .. . ... 11/11/20 06:27:52 INFO mapred.JobClient: map 55% reduce 0% 11/11/20 06:28:38 INFO mapred.JobClient: map 56% reduce 0% 11/11/20 06:29:18 INFO mapred.JobClient: Task Id : attempt_20200450_0001_m_ 000219_0, Status : FAILED org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/mark/output/_temporary/_attempt_20200450_0001_m_000219_0/part-00219 could only be replicated to 0 nodes, instead of 1 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1271) at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:422) at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953) at org.apache.hadoop.ipc.Client.call(Client.java:740) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) at $Proxy1.addBlock(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) at $Proxy1.addBlock(Unknown Source) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2937) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2819) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288) ... ... Namenode Log: 2011-11-20 06:29:51,964 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=mark,ucsb ip=/127.0.0.1cmd=opensrc=/user/mark/input/G14_10_aldst=null perm=null 2011-11-20 06:29:52,039 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=mark,ucsb ip=/127.0.0.1cmd=opensrc=/user/mark/input/G13_12_aqdst=null perm=null 2011-11-20 06:29:52,178 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=mark,ucsb ip=/127.0.0.1cmd=opensrc=/user/mark/input/G14_10_andst=null perm=null 2011-11-20 06:29:52,348 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:50010 is added to blk_-2308051162058662821_1643 size 20024660 2011-11-20 06:29:52,348 INFO org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.completeFile: file /user/mark/output/_temporary/_attempt_20200450_0001_m_000222_0/part-00222 is closed by DFSClient_attempt_20200450_0001_m_000222_0 2011-11-20 06:29:52,351 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:50010 is added to blk_9206172750679206987_1639 size 51330092 2011-11-20 06:29:52,352 INFO org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.completeFile: file /user/mark/output/_temporary/_attempt_20200450_0001_m_000226_0/part-00226 is closed by DFSClient_attempt_20200450_0001_m_000226_0 2011-11-20 06:29:52,416 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=mark,ucsb ip=/127.0.0.1cmd=create src=/user/mark/output/_temporary/_attempt_20200450_0001_m_000223_2/part-00223 dst=nullperm=mark:supergroup:rw-r--r-- 2011-11-20 06:29:52,430 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 12123: readAndProcess threw exception java.io.IOException:Connection reset by peer. Count of bytes read: 0 java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcher.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:202) at sun.nio.ch.IOUtil.read(IOUtil.java:175) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243) at org.apache.hadoop.ipc.Server.channelRead(Server.java:1211) at
reading Hadoop output messages
Hi all, I'm wondering if there is a way to get output messages that are printed from the main class of a Hadoop job. Usually 21 out.log would wok, but in this case it only saves the output messages printed in the main class before starting the job. What I want is the output messages that are printed also in the main class but after the job is done. For example: in my main class: try {JobClient.runJob(conf); } catch (Exception e) { e.printStackTrace();} //submit job to JT sLogger.info(\n Job Finished in + (System.currentTimeMillis() - startTime) / 6.0 + Minutes.); I can't see the last message unless I see the screen. Any ideas? Thank you, Mark
Re: Cannot access JobTracker GUI (port 50030) via web browser while running on Amazon EC2
I have the same issue and the output of curl localhost:50030 is like yours, and I'm running on a remote cluster on pesudo-distributed mode. Can anyone help? Thanks, Mark On Mon, Oct 24, 2011 at 11:02 AM, Sameer Farooqui cassandral...@gmail.comwrote: Hi guys, I'm running a 1-node Hadoop 0.20.2 pseudo-distributed node with RedHat 6.1 on Amazon EC2 and while my node is healthy, I can't seem to get to the JobTracker GUI working. Running 'curl localhost:50030' from the CMD line returns a valid HTML file. Ports 50030, 50060, 50070 are open in the Amazon Security Group. MapReduce jobs are starting and completing successfully, so my Hadoop install is working fine. But when I try to access the web GUI from a Chrome browser on my local computer, I get nothing. Any thoughts? I tried some Google searches and even did a hail-mary Bing search, but none of them were fruitful. Some troubleshooting I did is below: [root@ip-10-86-x-x ~]# jps 1337 QuorumPeerMain 1494 JobTracker 1410 DataNode 1629 SecondaryNameNode 1556 NameNode 1694 TaskTracker 1181 HRegionServer 1107 HMaster 11363 Jps [root@ip-10-86-x-x ~]# curl localhost:50030 meta HTTP-EQUIV=REFRESH content=0;url=jobtracker.jsp/ html head titleHadoop Administration/title /head body h1Hadoop Administration/h1 ul lia href=jobtracker.jspJobTracker/a/li /ul /body /html
Re: Cannot access JobTracker GUI (port 50030) via web browser while running on Amazon EC2
Thank you, I'll try it. Mark On Mon, Oct 24, 2011 at 1:50 PM, Sameer Farooqui cassandral...@gmail.comwrote: Mark, We figured it out. It's an issue with RedHat's IPTables. You have to open up those ports: vim /etc/sysconfig/iptables Make the file look like this # Firewall configuration written by system-config-firewall # Manual customization of this file is not recommended. *filter :INPUT ACCEPT [0:0] :FORWARD ACCEPT [0:0] :OUTPUT ACCEPT [0:0] -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT -A INPUT -p icmp -j ACCEPT -A INPUT -i lo -j ACCEPT -A INPUT -m state --state NEW -m tcp -p tcp --dport 22 -j ACCEPT -A INPUT -m state --state NEW -m tcp -p tcp --dport 80 -j ACCEPT -A INPUT -m state --state NEW -m tcp -p tcp --dport 50030 -j ACCEPT -A INPUT -m state --state NEW -m tcp -p tcp --dport 50060 -j ACCEPT -A INPUT -m state --state NEW -m tcp -p tcp --dport 50070 -j ACCEPT -A INPUT -j REJECT --reject-with icmp-host-prohibited -A FORWARD -j REJECT --reject-with icmp-host-prohibited COMMIT Restart the web services /etc/init.d/iptables restart iptables: Flushing firewall rules: [ OK ] iptables: Setting chains to policy ACCEPT: filter [ OK ] iptables: Unloading modules: [ OK ] iptables: Applying firewall rules: [ OK ] On Mon, Oct 24, 2011 at 1:37 PM, Mark question markq2...@gmail.com wrote: I have the same issue and the output of curl localhost:50030 is like yours, and I'm running on a remote cluster on pesudo-distributed mode. Can anyone help? Thanks, Mark On Mon, Oct 24, 2011 at 11:02 AM, Sameer Farooqui cassandral...@gmail.comwrote: Hi guys, I'm running a 1-node Hadoop 0.20.2 pseudo-distributed node with RedHat 6.1 on Amazon EC2 and while my node is healthy, I can't seem to get to the JobTracker GUI working. Running 'curl localhost:50030' from the CMD line returns a valid HTML file. Ports 50030, 50060, 50070 are open in the Amazon Security Group. MapReduce jobs are starting and completing successfully, so my Hadoop install is working fine. But when I try to access the web GUI from a Chrome browser on my local computer, I get nothing. Any thoughts? I tried some Google searches and even did a hail-mary Bing search, but none of them were fruitful. Some troubleshooting I did is below: [root@ip-10-86-x-x ~]# jps 1337 QuorumPeerMain 1494 JobTracker 1410 DataNode 1629 SecondaryNameNode 1556 NameNode 1694 TaskTracker 1181 HRegionServer 1107 HMaster 11363 Jps [root@ip-10-86-x-x ~]# curl localhost:50030 meta HTTP-EQUIV=REFRESH content=0;url=jobtracker.jsp/ html head titleHadoop Administration/title /head body h1Hadoop Administration/h1 ul lia href=jobtracker.jspJobTracker/a/li /ul /body /html
Remote Blocked Transfer count
Hello, I wonder if there is a way to measure how many of the data blocks have transferred over the network? Or more generally, how many times where there a connection/contact between different machines? I thought of checking the Namenode log file which usually shows blk_ from src= to dst ... but I'm not sure if it's correct to count those lines. Any ideas are helpful. Mark
fixing the mapper percentage viewer
Hi all, I'm written a custom mapRunner, but it seems to have ruined the percentage shown for maps on console. I want to know which part of code is responsible for adjusting the percentage of maps ... Is it the following in MapRunner: if(incrProcCount) { reporter.incrCounter(SkipBadRecords.COUNTER_GROUP, SkipBadRecords.COUNTER_MAP_PROCESSED_RECORDS, 1); Thank you, Mark
Re: hadoop input buffer size
Thanks for the clarifications guys :) Mark On Mon, Oct 10, 2011 at 8:27 AM, Uma Maheswara Rao G 72686 mahesw...@huawei.com wrote: I think below can give you more info about it. http://developer.yahoo.com/blogs/hadoop/posts/2009/08/the_anatomy_of_hadoop_io_pipel/ Nice explanation by Owen here. Regards, Uma - Original Message - From: Yang Xiaoliang yangxiaoliang2...@gmail.com Date: Wednesday, October 5, 2011 4:27 pm Subject: Re: hadoop input buffer size To: common-user@hadoop.apache.org Hi, Hadoop neither read one line each time, nor fetching dfs.block.size of lines into a buffer, Actually, for the TextInputFormat, it read io.file.buffer.size bytes of text into a buffer each time, this can be seen from the hadoop source file LineReader.java 2011/10/5 Mark question markq2...@gmail.com Hello, Correct me if I'm wrong, but when a program opens n-files at the same time to read from, and start reading from each file at a time 1 line at a time. Isn't hadoop actually fetching dfs.block.size of lines into a buffer? and not actually one line. If this is correct, I set up my dfs.block.size = 3MB and each line takes about 650 bytes only, then I would assume the performance for reading 1-4000 lines would be the same, but it isn't ! Do you know a way to find #n of lines to be read at once? Thank you, Mark
hadoop input buffer size
Hello, Correct me if I'm wrong, but when a program opens n-files at the same time to read from, and start reading from each file at a time 1 line at a time. Isn't hadoop actually fetching dfs.block.size of lines into a buffer? and not actually one line. If this is correct, I set up my dfs.block.size = 3MB and each line takes about 650 bytes only, then I would assume the performance for reading 1-4000 lines would be the same, but it isn't ! Do you know a way to find #n of lines to be read at once? Thank you, Mark
Mapper Progress
Hi, I have my custom MapRunner which apparently seemed to affect the progress report of the mapper and showing 100% while the mapper is still reading files to process. Where can I change/add a progress object to be shown in UI ? Thank you, Mark
Re: One file per mapper
Hi Govind, You should use overwrite your FileInputFormat isSplitable function in a class say myFileInputFormat extends FileInputFormat as follows: @Override public boolean isSplitable(FileSystem fs, Path filename){ return false; } Then one you use your myFileInputFormat class. To know the path, write the following in your mapper class: @Override public void configure(JobConf job) { Path inputPath = new Path(job.get(map.input.file)); } ~cheers, Mark On Tue, Jul 5, 2011 at 1:04 PM, Govind Kothari govindkoth...@gmail.comwrote: Hi, I am new to hadoop. I have a set of files and I want to assign each file to a mapper. Also in mapper there should be a way to know the complete path of the file. Can you please tell me how to do that ? Thanks, Govind -- Govind Kothari Graduate Student Dept. of Computer Science University of Maryland College Park ---Seek Excellence, Success will Follow ---
One node with Rack-local mappers ?!!!
Hi, this is weird ... I'm running a job on single node with 32 mappers, running one at a time. Output says this: .. 11/06/16 00:59:43 INFO mapred.JobClient: Rack-local map tasks=18 == 11/06/16 00:59:43 INFO mapred.JobClient: Launched map tasks=32 11/06/16 00:59:43 INFO mapred.JobClient: Data-local map tasks=14 Number of Hadoop nodes specified by user: 1 Received 1 nodes from PBS Clean up node: tcc-5-72 When is that usually possible? Thank you, Mark
Hadoop Runner
Hi, 1) Where can I find the main class of hadoop? The one that calls the InputFormat then the MapperRunner and ReducerRunner and others? This will help me understand what is in memory or still on disk , exact flow of data between split and mappers . My problem is, assuming I have a TextInputFormat and would like to modify the input in memory before being read by RecordReader... where shall I do that? InputFormat was my first guess, but unfortunately, it only defines the logical splits ... So, the only way I can think of is use the recordReader to read all the records in split into another variable (with the format I want) then process that variable by map functions. But is that efficient? So, to understand this,I hope someone can give an answer to Q(1) Thank you, Mark
org.apache.hadoop.mapred.Utils can not be resolved
Hi, My question here is general to this problem. How can you know which jar file will solve such error: *org.apache.hadoop.mapred.Utils can not be resolved. *I don't plan to include all hadoop jars ... Well, hope so .. Can you tell me your techniques? Thanks, Mark * *
DiskUsage class DU Error
Hi, Has Anyone tried using DU class to report hdfs-files size? Both of the following lines are causing errors , running on Mac: DU DiskUsage = new DU(new File(outDir.getPath()),12L); DU DiskUsage = new DU(new File(outDir.getName()),Configuration)conf); where, Path outDir = SequenceFileOutputFormat.getOutputPath(conf); // Working fine Exception in thread main java.io.IOException: Expecting a line not the end of stream at org.apache.hadoop.fs.DU.parseExecResult(DU.java:185) at org.apache.hadoop.util.Shell.runCommand(Shell.java:238) at org.apache.hadoop.util.Shell.run(Shell.java:183) at org.apache.hadoop.fs.DU.init(DU.java:57) at Analysis.analyzeOutput(Analysis.java:22) at Main.main(Main.java:48) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:192) I run this DU command after the job is done. Any hints? Thank you, Mark
Re: re-reading
Thanks for the replies, but input doesn't have 'clone' I don't know why ... so I'll have to write my custom inputFormat ... I was hoping for an easier way though. Thank you, Mark On Wed, Jun 8, 2011 at 1:58 AM, Harsh J ha...@cloudera.com wrote: Or if that does not work for any reason (haven't tried it really), try writing your own InputFormat wrapper where in you can have direct access to the InputSplit object to do what you want to (open two record readers, and manage them separately). On Wed, Jun 8, 2011 at 1:48 PM, Stefan Wienert ste...@wienert.cc wrote: Try input.clone()... 2011/6/8 Mark question markq2...@gmail.com: Hi, I'm trying to read the inputSplit over and over using following function in MapperRunner: @Override public void run(RecordReader input, OutputCollector output, Reporter reporter) throws IOException { RecordReader copyInput = input; //First read while(input.next(key,value)); //Second read while(copyInput.next(key,value)); } It can clearly be seen that this won't work because both RecordReaders are actually the same. I'm trying to find a way for the second reader to start reading the split again from beginning ... How can I do that? Thanks, Mark -- Stefan Wienert http://www.wienert.cc ste...@wienert.cc Telefon: +495251-2026838 Mobil: +49176-40170270 -- Harsh J
Re: re-reading
I have a question though for Harsh case... I wrote my custom inputFormat which will create an array of recordReaders and give them to the MapRunner. Will that mean multiple copies of the inputSplit are all in memory? or will there be one copy pointed by all of them .. as if they were pointers ? Thanks, Mark On Wed, Jun 8, 2011 at 9:13 AM, Mark question markq2...@gmail.com wrote: Thanks for the replies, but input doesn't have 'clone' I don't know why ... so I'll have to write my custom inputFormat ... I was hoping for an easier way though. Thank you, Mark On Wed, Jun 8, 2011 at 1:58 AM, Harsh J ha...@cloudera.com wrote: Or if that does not work for any reason (haven't tried it really), try writing your own InputFormat wrapper where in you can have direct access to the InputSplit object to do what you want to (open two record readers, and manage them separately). On Wed, Jun 8, 2011 at 1:48 PM, Stefan Wienert ste...@wienert.cc wrote: Try input.clone()... 2011/6/8 Mark question markq2...@gmail.com: Hi, I'm trying to read the inputSplit over and over using following function in MapperRunner: @Override public void run(RecordReader input, OutputCollector output, Reporter reporter) throws IOException { RecordReader copyInput = input; //First read while(input.next(key,value)); //Second read while(copyInput.next(key,value)); } It can clearly be seen that this won't work because both RecordReaders are actually the same. I'm trying to find a way for the second reader to start reading the split again from beginning ... How can I do that? Thanks, Mark -- Stefan Wienert http://www.wienert.cc ste...@wienert.cc Telefon: +495251-2026838 Mobil: +49176-40170270 -- Harsh J
Re: re-reading
I assumed before reading the split API that it is the actual split, my bad. Thanks alot Harsh, it's working great! Mark
re-reading
Hi, I'm trying to read the inputSplit over and over using following function in MapperRunner: @Override public void run(RecordReader input, OutputCollector output, Reporter reporter) throws IOException { RecordReader copyInput = input; //First read while(input.next(key,value)); //Second read while(copyInput.next(key,value)); } It can clearly be seen that this won't work because both RecordReaders are actually the same. I'm trying to find a way for the second reader to start reading the split again from beginning ... How can I do that? Thanks, Mark
Reducing Mapper InputSplit size
Hi, Does anyone have a way to reduce InputSplit size in general ? By default, the minimum size chunk that map input should be split into is set to 0 (ie.mapred.min.split.size). Can I change dfs.block.size or some other configuration to reduce the split size and spawn many mappers? Thanks, Mark
Re: Reducing Mapper InputSplit size
Great! Thanks guys :) Mark 2011/6/6 Panayotis Antonopoulos antonopoulos...@hotmail.com Hi Mark, Check: http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html I think that setMaxInputSplitSize(Job job, long size) will do what you need. Regards, P.A. Date: Mon, 6 Jun 2011 19:31:17 -0700 Subject: Reducing Mapper InputSplit size From: markq2...@gmail.com To: common-user@hadoop.apache.org Hi, Does anyone have a way to reduce InputSplit size in general ? By default, the minimum size chunk that map input should be split into is set to 0 (ie.mapred.min.split.size). Can I change dfs.block.size or some other configuration to reduce the split size and spawn many mappers? Thanks, Mark
SequenceFile.Reader
Hi, Does anyone knows if : SequenceFile.next(key) is actually not reading value into memory *nexthttp://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/SequenceFile.Reader.html#next%28org.apache.hadoop.io.Writable%29 *(Writablehttp://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/Writable.html key) Read the next key in the file into key, skipping its value. or is it reading the value into memory but not showing it to me ? Thanks, Mark
Re: SequenceFile.Reader
Hi John, thanks for the reply. But I'm not asking about the key memory allocation here. I'm just saying what's the difference between: Next(key,value) and Next(key) . Is the later one still reading the value of the key to reach the next key? or does it read the key then using the recordSize skips to the next key? Thanks, Mark On Thu, Jun 2, 2011 at 3:49 PM, John Armstrong john.armstr...@ccri.comwrote: On Thu, 2 Jun 2011 15:43:37 -0700, Mark question markq2...@gmail.com wrote: Does anyone knows if : SequenceFile.next(key) is actually not reading value into memory I think what you're confused by is something I stumbled upon quite by accident. The secret is that there is actually only ONE Key object that the RecordReader presents to you. The next() method doesn't create a new Key object (containing the new data) but actually just loads the new data into the existing Key object. The only place I've seen that you absolutely must remember these unusual semantics is when you're trying to copy keys or values for some reason, or to iterate over the Iterable of values more than once. In these cases you must make defensive copies because otherwise you'll just git a big list of copies of the same Key, containing the last Key data you saw. hth
Re: SequenceFile.Reader
Actually, I checked the source code of Reader and it turns it reads the value into a buffer but only returns the key to the user :( how is this different than : Writable value = new Writable(); reader.next(key,value) !!! both are using the same object for multiple reads. I was hoping next(key) would skip reading value from disk. Mark On Thu, Jun 2, 2011 at 6:20 PM, Mark question markq2...@gmail.com wrote: Hi John, thanks for the reply. But I'm not asking about the key memory allocation here. I'm just saying what's the difference between: Next(key,value) and Next(key) . Is the later one still reading the value of the key to reach the next key? or does it read the key then using the recordSize skips to the next key? Thanks, Mark On Thu, Jun 2, 2011 at 3:49 PM, John Armstrong john.armstr...@ccri.comwrote: On Thu, 2 Jun 2011 15:43:37 -0700, Mark question markq2...@gmail.com wrote: Does anyone knows if : SequenceFile.next(key) is actually not reading value into memory I think what you're confused by is something I stumbled upon quite by accident. The secret is that there is actually only ONE Key object that the RecordReader presents to you. The next() method doesn't create a new Key object (containing the new data) but actually just loads the new data into the existing Key object. The only place I've seen that you absolutely must remember these unusual semantics is when you're trying to copy keys or values for some reason, or to iterate over the Iterable of values more than once. In these cases you must make defensive copies because otherwise you'll just git a big list of copies of the same Key, containing the last Key data you saw. hth
UI not working
Hi, My UI for hadoop 20.2 on a single machine suddenly is giving the following errors for NN and JT web-sites respectively: HTTP ERROR: 404 /dfshealth.jsp RequestURI=/dfshealth.jsp *Powered by Jetty:// http://jetty.mortbay.org/* HTTP ERROR: 503 SERVICE_UNAVAILABLE RequestURI=/jobtracker.jsp *Powered by jetty:// http://jetty.mortbay.org/* The only thing I think of, is that I also installed version 21.0 , but had problems with it so I shut it off and went back to 20.2. When I check the system for 20.2 using 'fsck' everything looks fine and jobs work ok. Let me know how to fix that please. Thank, Mark
Increase node-mappers capacity in single node
Hi, I tried changing mapreduce.job.maps to be more than 2 , but since I'm running in pseudo distributed mode, JobTracker is local and hence this property is not changed. I'm running on a 12 core machine and would like to make use of that ... Is there a way to trick Hadoop? I also tried using my virtual machine name instead of localhost, but no luck. Please help, Thanks, Mark
Re: How to copy over using dfs
I don't think so, becauseI read somewhere that this is to insure the safety of the produced data. Hence Hadoop will force you to do this to know what exactly is happening. Mark On Fri, May 27, 2011 at 12:28 PM, Mohit Anchlia mohitanch...@gmail.comwrote: If I have to overwrite a file I generally use hadoop dfs -rm file hadoop dfs -copyFromLocal or -put file Is there a command to overwrite/replace the file instead of doing rm first?
Re: web site doc link broken
I also got the following from learn about : Not Found The requested URL /common/docs/stable/ was not found on this server. -- Apache/2.3.8 (Unix) mod_ssl/2.3.8 OpenSSL/1.0.0c Server at hadoop.apache.orgPort 80 Mark On Fri, May 27, 2011 at 8:03 AM, Harsh J ha...@cloudera.com wrote: Am not sure if someone's already fixed this, but I head to the first link and click Learn About, and it gets redirected to the current/ just fine. There's only one such link on the page as well. On Fri, May 27, 2011 at 3:42 AM, Lee Fisher blib...@gmail.com wrote: Th Hadoop Common home page: http://hadoop.apache.org/common/ has a broken link (Learn About) to the docs. It tries to use: http://hadoop.apache.org/common/docs/stable/ which doesn't exist (404). It should probably be: http://hadoop.apache.org/common/docs/current/ Or, someone has deleted the stable docs, which I can't help you with. :-) Thanks. -- Harsh J
Re: Sorting ...
Well, I want something like TeraSort but for sequenceFiles instead of Lines in Text. My goal is efficiency and I'm currently working with Hadoop only. Thanks for your suggestions, Mark On Thu, May 26, 2011 at 8:34 AM, Robert Evans ev...@yahoo-inc.com wrote: Also if you want something that is fairly fast and a lot less dev work to get going you might want to look at pig. They can do a distributed order by that is fairly good. --Bobby Evans On 5/26/11 2:45 AM, Luca Pireddu pire...@crs4.it wrote: On May 25, 2011 22:15:50 Mark question wrote: I'm using SequenceFileInputFormat, but then what to write in my mappers? each mapper is taking a split from the SequenceInputFile then sort its split ?! I don't want that.. Thanks, Mark On Wed, May 25, 2011 at 2:09 AM, Luca Pireddu pire...@crs4.it wrote: On May 25, 2011 01:43:22 Mark question wrote: Thanks Luca, but what other way to sort a directory of sequence files? I don't plan to write a sorting algorithm in mappers/reducers, but hoping to use the sequenceFile.sorter instead. Any ideas? Mark If you want to achieve a global sort, then look at how TeraSort does it: http://sortbenchmark.org/YahooHadoop.pdf The idea is to partition the data so that all keys in part[i] are all keys in part[i+1]. Each partition in individually sorted, so to read the data in globally sorted order you simply have to traverse it starting from the first partition and working your way to the last one. If your keys are already what you want to sort by, then you don't even need a mapper (just use the default identity map). -- Luca Pireddu CRS4 - Distributed Computing Group Loc. Pixina Manna Edificio 1 Pula 09010 (CA), Italy Tel: +39 0709250452
Re: one question about hadoop
web.xml is in: hadoop-releaseNo/webapps/job/WEB-INF/web.xml Mark On Thu, May 26, 2011 at 1:29 AM, Luke Lu l...@vicaya.com wrote: Hadoop embeds jetty directly into hadoop servers with the org.apache.hadoop.http.HttpServer class for servlets. For jsp, web.xml is auto generated with the jasper compiler during the build phase. The new web framework for mapreduce 2.0 (MAPREDUCE-2399) wraps the hadoop HttpServer and doesn't need web.xml and/or jsp support either. On Thu, May 26, 2011 at 12:14 AM, 王晓峰 sanlang2...@gmail.com wrote: hi,admin: I'm a fresh fish from China. I want to know how the Jetty combines with the hadoop. I can't find the file named web.xml that should exist in usual system that combine with Jetty. I'll be very happy to receive your answer. If you have any question,please feel free to contract with me. Best Regards, Jack
Re: I can't see this email ... So to clarify ..
I do ... $ ls -l /cs/student/mark/tmp/hodhod total 4 drwxr-xr-x 3 mark grad 4096 May 24 21:10 dfs and .. $ ls -l /tmp/hadoop-mark total 4 drwxr-xr-x 3 mark grad 4096 May 24 21:10 dfs $ ls -l /tmp/hadoop-maha/dfs/name/only name is created here no data Thanks agian, Mark On Tue, May 24, 2011 at 9:26 PM, Mapred Learn mapred.le...@gmail.comwrote: Do u Hv right permissions on the new dirs ? Try stopping n starting cluster... -JJ On May 24, 2011, at 9:13 PM, Mark question markq2...@gmail.com wrote: Well, you're right ... moving it to hdfs-site.xml had an effect at least. But now I'm in the NameSpace incompatable error: WARN org.apache.hadoop.hdfs.server.common.Util: Path /tmp/hadoop-mark/dfs/data should be specified as a URI in configuration files. Please update hdfs configuration. java.io.IOException: Incompatible namespaceIDs in /tmp/hadoop-mark/dfs/data My configuration for this part in hdfs-site.xml: configuration property namedfs.data.dir/name value/tmp/hadoop-mark/dfs/data/value /property property namedfs.name.dir/name value/tmp/hadoop-mark/dfs/name/value /property property namehadoop.tmp.dir/name value/cs/student/mark/tmp/hodhod/value /property /configuration The reason why I want to change hadoop.tmp.dir is because the student quota under /tmp is small so I wanted to mount on /cs/student instead for hadoop.tmp.dir. Thanks, Mark On Tue, May 24, 2011 at 7:25 PM, Joey Echeverria j...@cloudera.com wrote: Try moving the the configuration to hdfs-site.xml. One word of warning, if you use /tmp to store your HDFS data, you risk data loss. On many operating systems, files and directories in /tmp are automatically deleted. -Joey On Tue, May 24, 2011 at 10:22 PM, Mark question markq2...@gmail.com wrote: Hi guys, I'm using an NFS cluster consisting of 30 machines, but only specified 3 of the nodes to be my hadoop cluster. So my problem is this. Datanode won't start in one of the nodes because of the following error: org.apache.hadoop.hdfs.server. common.Storage: Cannot lock storage /cs/student/mark/tmp/hodhod/dfs/data. The directory is already locked I think it's because of the NFS property which allows one node to lock it then the second node can't lock it. So I had to change the following configuration: dfs.data.dir to be /tmp/hadoop-user/dfs/data But this configuration is overwritten by ${hadoop.tmp.dir}/dfs/data where my hadoop.tmp.dir = /cs/student/mark/tmp as you might guess from above. Where is this configuration over-written ? I thought my core-site.xml has the final configuration values. Thanks, Mark -- Joseph Echeverria Cloudera, Inc. 443.305.9434
Re: Sorting ...
I'm using SequenceFileInputFormat, but then what to write in my mappers? each mapper is taking a split from the SequenceInputFile then sort its split ?! I don't want that.. Thanks, Mark On Wed, May 25, 2011 at 2:09 AM, Luca Pireddu pire...@crs4.it wrote: On May 25, 2011 01:43:22 Mark question wrote: Thanks Luca, but what other way to sort a directory of sequence files? I don't plan to write a sorting algorithm in mappers/reducers, but hoping to use the sequenceFile.sorter instead. Any ideas? Mark Maybe this class can help? org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat With it you should be able to read (key,value) records from your sequence files and then do whatever you need with them. -- Luca Pireddu CRS4 - Distributed Computing Group Loc. Pixina Manna Edificio 1 Pula 09010 (CA), Italy Tel: +39 0709250452
UI not working ..
Hi, My UI for hadoop 20.2 on a single machine suddenly is giving the following errors for NN and JT web-sites respectively: HTTP ERROR: 404 /dfshealth.jsp RequestURI=/dfshealth.jsp *Powered by Jetty:// http://jetty.mortbay.org/* HTTP ERROR: 503 SERVICE_UNAVAILABLE RequestURI=/jobtracker.jsp *Powered by jetty:// http://jetty.mortbay.org/* The only thing I think of, is that I also installed version 21.0 , but had problems so I shut it off and went back to 20.2. When I check the system using 'fsck' everything looks fine though. Let me know what you think. Thank, Mark
Re: UI not working ..
Hi, My UI for hadoop 20.2 on a single machine suddenly is giving the following errors for NN and JT web-sites respectively: HTTP ERROR: 404 /dfshealth.jsp RequestURI=/dfshealth.jsp *Powered by Jetty:// http://jetty.mortbay.org/* HTTP ERROR: 503 SERVICE_UNAVAILABLE RequestURI=/jobtracker.jsp *Powered by jetty:// http://jetty.mortbay.org/* The only thing I think of, is that I also installed version 21.0 , but had problems so I shut it off and went back to 20.2. When I check the system using 'fsck' everything looks fine though. Let me know what you think. Thank, Mark
Re: get name of file in mapper output directory
thanks both for the comments, but even though finally, I managed to get the output file of the current mapper, I couldn't use it because apparently, mappers uses _temporary file while it's in process. So in Mapper.close , the file for eg. part-0 which it wrote to, does not exists yet. There has to be another way to get the produced file. I need to sort it immediately within mappers. Again, your thoughts are really helpful ! Mark On Mon, May 23, 2011 at 5:51 AM, Luca Pireddu pire...@crs4.it wrote: The path is defined by the FileOutputFormat in use. In particular, I think this function is responsible: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/lib/output/FileOutputFormat.html#getDefaultWorkFile(org.apache.hadoop.mapreduce.TaskAttemptContext , java.lang.String) It should give you the file path before all tasks have completed and the output is committed to the final output path. Luca On May 23, 2011 14:42:04 Joey Echeverria wrote: Hi Mark, FYI, I'm moving the discussion over to mapreduce-u...@hadoop.apache.org since your question is specific to MapReduce. You can derive the output name from the TaskAttemptID which you can get by calling getTaskAttemptID() on the context passed to your cleanup() funciton. The task attempt id will look like this: attempt_200707121733_0003_m_05_0 You're interested in the m_05 part, This gets translated into the output file name part-m-5. -Joey On Sat, May 21, 2011 at 8:03 PM, Mark question markq2...@gmail.com wrote: Hi, I'm running a job with maps only and I want by end of each map (ie.Close() function) to open the file that the current map has wrote using its output.collector. I know job.getWorkingDirectory() would give me the parent path of the file written, but how to get the full path or the name (ie. part-0 or part-1). Thanks, Mark -- Luca Pireddu CRS4 - Distributed Computing Group Loc. Pixina Manna Edificio 1 Pula 09010 (CA), Italy Tel: +39 0709250452
Re: Sorting ...
Thanks Luca, but what other way to sort a directory of sequence files? I don't plan to write a sorting algorithm in mappers/reducers, but hoping to use the sequenceFile.sorter instead. Any ideas? Mark On Mon, May 23, 2011 at 12:33 AM, Luca Pireddu pire...@crs4.it wrote: On May 22, 2011 03:21:53 Mark question wrote: I'm trying to sort Sequence files using the Hadoop-Example TeraSort. But after taking a couple of minutes .. output is empty. snip I'm trying to find what the input format for the TeraSort is, but it is not specified. Thanks for any thought, Mark Terasort sorts lines of text. The InputFormat (for version 0.20.2) is in hadoop-0.20.2/src/examples/org/apache/hadoop/examples/terasort/TeraInputFormat.java The documentation at the top of the class says An input format that reads the first 10 characters of each line as the key and the rest of the line as the value. HTH -- Luca Pireddu CRS4 - Distributed Computing Group Loc. Pixina Manna Edificio 1 Pula 09010 (CA), Italy Tel: +39 0709250452
Cannot lock storage, directory is already locked
Hi guys, I'm using an NFS cluster consisting of 30 machines, but only specified 3 of the nodes to be my hadoop cluster. So my problem is this. Datanode won't start in one of the nodes because of the following error: org.apache.hadoop.hdfs.server.common.Storage: Cannot lock storage /cs/student/mark/tmp/hodhod/dfs/data. The directory is already locked I think it's because of the NFS property which allows one node to lock it then the second node can't lock it. Any ideas on how to solve this error? Thanks, Mark
I can't see this email ... So to clarify ..
Hi guys, I'm using an NFS cluster consisting of 30 machines, but only specified 3 of the nodes to be my hadoop cluster. So my problem is this. Datanode won't start in one of the nodes because of the following error: org.apache.hadoop.hdfs.server. common.Storage: Cannot lock storage /cs/student/mark/tmp/hodhod/dfs/data. The directory is already locked I think it's because of the NFS property which allows one node to lock it then the second node can't lock it. So I had to change the following configuration: dfs.data.dir to be /tmp/hadoop-user/dfs/data But this configuration is overwritten by ${hadoop.tmp.dir}/dfs/data where my hadoop.tmp.dir = /cs/student/mark/tmp as you might guess from above. Where is this configuration over-written ? I thought my core-site.xml has the final configuration values. Thanks, Mark
Re: I can't see this email ... So to clarify ..
Well, you're right ... moving it to hdfs-site.xml had an effect at least. But now I'm in the NameSpace incompatable error: WARN org.apache.hadoop.hdfs.server.common.Util: Path /tmp/hadoop-mark/dfs/data should be specified as a URI in configuration files. Please update hdfs configuration. java.io.IOException: Incompatible namespaceIDs in /tmp/hadoop-maha/dfs/data My configuration for this part in hdfs-site.xml: configuration property namedfs.data.dir/name value/tmp/hadoop-mark/dfs/data/value /property property namedfs.name.dir/name value/tmp/hadoop-mark/dfs/name/value /property property namehadoop.tmp.dir/name value/cs/student/mark/tmp/hodhod/value /property /configuration The reason why I want to change hadoop.tmp.dir is because the student quota under /tmp is small so I wanted to mount on /cs/student instead for hadoop.tmp.dir. Thanks, Mark On Tue, May 24, 2011 at 7:25 PM, Joey Echeverria j...@cloudera.com wrote: Try moving the the configuration to hdfs-site.xml. One word of warning, if you use /tmp to store your HDFS data, you risk data loss. On many operating systems, files and directories in /tmp are automatically deleted. -Joey On Tue, May 24, 2011 at 10:22 PM, Mark question markq2...@gmail.com wrote: Hi guys, I'm using an NFS cluster consisting of 30 machines, but only specified 3 of the nodes to be my hadoop cluster. So my problem is this. Datanode won't start in one of the nodes because of the following error: org.apache.hadoop.hdfs.server. common.Storage: Cannot lock storage /cs/student/mark/tmp/hodhod/dfs/data. The directory is already locked I think it's because of the NFS property which allows one node to lock it then the second node can't lock it. So I had to change the following configuration: dfs.data.dir to be /tmp/hadoop-user/dfs/data But this configuration is overwritten by ${hadoop.tmp.dir}/dfs/data where my hadoop.tmp.dir = /cs/student/mark/tmp as you might guess from above. Where is this configuration over-written ? I thought my core-site.xml has the final configuration values. Thanks, Mark -- Joseph Echeverria Cloudera, Inc. 443.305.9434
I didn't see my email sent yesterday ... So here is the question again ..
Hi, I'm running a job with maps only and I want by end of each map (ie. in its Close() function) to open the file that the current map has wrote using its output.collector. I know job.getWorkingDirectory() would give me the parent path of the file written, but how to get the full path or the name of the file that this mapper have been assigned to (ie. part-0 or part-1). Thanks, Mark
Re: How hadoop parse input files into (Key,Value) pairs ??
The case your talking about is when you use FileInputFormat ... So usually the InputFormat Interface is the one responsible for that. For FileInputFormat, it uses a LineRecordReader which will take your text file and assigns key to be the offset within your text file and value to be the line (until '\n') is seen. If you want to use other InputFormats check its API and pick what is suitable for you. In my case, I'm hocked with SequenceFileInputFormat where my input files are key,value records written by a regular java program (or parser). Then my Hadoop job will look at the keys and values that I wrote. I hope this helps a little, Mark On Thu, May 5, 2011 at 4:31 AM, praveenesh kumar praveen...@gmail.comwrote: Hi, As we know hadoop mapper takes input as (Key,Value) pairs and generate intermediate (Key,Value) pairs and usually we give input to our Mapper as a text file. How hadoop understand this and parse our input text file into (Key,Value) Pairs Usually our mapper looks like -- *public* *void* map(LongWritable key, Text value,OutputCollectorText, Text outputCollector, Reporter reporter) *throws* IOException { String word = value.toString(); //Some lines of code } So if I pass any text file as input, it is taking every line as VALUE to Mapper..on which I will do some processing and put it to OutputCollector. But how hadoop parsed my text file into ( Key,Value ) pair and how can we tell hadoop what (key,value) it should give to mapper ?? Thanks.
get name of file in mapper output directory
Hi, I'm running a job with maps only and I want by end of each map (ie.Close() function) to open the file that the current map has wrote using its output.collector. I know job.getWorkingDirectory() would give me the parent path of the file written, but how to get the full path or the name (ie. part-0 or part-1). Thanks, Mark
Sorting ...
I'm trying to sort Sequence files using the Hadoop-Example TeraSort. But after taking a couple of minutes .. output is empty. HDFS has the following Sequence files: -rw-r--r-- 1 Hadoop supergroup 196113760 2011-05-21 12:16 /user/Hadoop/out/part-0 -rw-r--r-- 1 Hadoop supergroup 250935096 2011-05-21 12:16 /user/Hadoop/out/part-1 -rw-r--r-- 1 Hadoop supergroup 262943648 2011-05-21 12:17 /user/Hadoop/out/part-2 -rw-r--r-- 1 Hadoop supergroup 114888492 2011-05-21 12:17 /user/Hadoop/out/part-3 After running: hadoop jar hadoop-mapred-examples-0.21.0.jar terasort out sorted Error is: 11/05/21 18:13:12 INFO mapreduce.Job: map 74% reduce 20% 11/05/21 18:13:14 INFO mapreduce.Job: Task Id : attempt_201105202144_0039_m_09_0, Status : FAILED java.io.EOFException: read past eof I'm trying to find what the input format for the TeraSort is, but it is not specified. Thanks for any thought, Mark
Re: current line number as key?
What if you run a MapReduce program to generate a Sequence File from your text file where key is the line number and value is the whole line, then for the second job, the splits are done record wise hence, each mapper will be getting a split/block of records [lineNumberline] ~Cheers, Mark On Wed, May 18, 2011 at 12:18 PM, Robert Evans ev...@yahoo-inc.com wrote: You are correct, that there is no easy and efficient way to do this. You could create a new InputFormat that derives from FileInputFormat that makes it so the files do not split, and then have a RecordReader that keeps track of line numbers. But then each file is read by only one mapper. Alternatively you could assume that the split is going to be done deterministically and do two passes one, where you count the number of lines in each partition, and a second that then assigns the lines based off of the output from the first. But that requires two map passes. --Bobby Evans On 5/18/11 1:53 PM, Alexandra Anghelescu axanghele...@gmail.com wrote: Hi, It is hard to pick up certain lines of a text file - globally I mean. Remember that the file is split according to its size (byte boundries) not lines.,, so, it is possible to keep track of the lines inside a split, but globally for the whole file, assuming it is split among map tasks... i don't think it is possible.. I am new to hadoop, but that is my take on it. Alexandra On Wed, May 18, 2011 at 2:41 PM, bnonymous libei.t...@gmail.com wrote: Hello, I'm trying to pick up certain lines of a text file. (say 1st, 110th line of a file with 10^10 lines). I need a InputFormat which gives the Mapper line number as the key. I tried to implement RecordReader, but I can't get line information from InputSplit. Any solution to this??? Thanks in advance!!! -- View this message in context: http://old.nabble.com/current-line-number-as-key--tp31649694p31649694.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: outputCollector vs. Localfile
I thought it was, because of FileBytesWritten counter. Thanks for the clarification. Mark On Fri, May 20, 2011 at 4:23 AM, Harsh J ha...@cloudera.com wrote: Mark, On Fri, May 20, 2011 at 10:17 AM, Mark question markq2...@gmail.com wrote: This is puzzling me ... With a mapper producing output of size ~ 400 MB ... which one is supposed to be faster? 1) output collector: which will write to local file then copy to HDFS since I don't have reducers. A regular map-only job does not write to the local FS, it writes to the HDFS directly (i.e., a local DN if one is found). -- Harsh J
outputCollector vs. Localfile
This is puzzling me ... With a mapper producing output of size ~ 400 MB ... which one is supposed to be faster? 1) output collector: which will write to local file then copy to HDFS since I don't have reducers. 2) Open a unique local file inside mapred.local.dir for each mapper. I thought of (2), but (1) was actually faster ... can someone explains ? Thanks, Mark
Hadoop tool-kit for monitoring
Hi I need to use hadoop-tool-kit for monitoring. So I followed http://code.google.com/p/hadoop-toolkit/source/checkout and applied the patch in my hadoop.20.2 directory as: patch -p0 patch.20.2 and set a property *“mapred.performance.diagnose”* to true in * mapred-site.xml*. but I don't see the memory stuff information that it's supposed to be shown as http://code.google.com/p/hadoop-toolkit/wiki/HadoopPerformanceMonitoring I then installed hadoop-0.21.0 and only set the same property as above, but still don't see the requested monitor infos. ... What's wrong I'm doing ? I appreciate any thoughts, Mark
Again ... Hadoop tool-kit for monitoring
Sorry for the spam, but I didn't see my previous email yet. I need to use hadoop-tool-kit for monitoring. So I followed http://code.google.com/p/hadoop-toolkit/source/checkout and applied the patch in my hadoop.20.2 directory as: patch -p0 patch.20.2 and set a property *“mapred.performance.diagnose”* to true in * mapred-site.xml*. but I don't see the memory stuff information that it's supposed to be shown as http://code.google.com/p/hadoop-toolkit/wiki/HadoopPerformanceMonitoring I then installed hadoop-0.21.0 and only set the same property as above, but still don't see the requested monitor infos. ... What's wrong I'm doing ? I appreciate any thoughts, Mark
Re: Hadoop tool-kit for monitoring
So what other memory consumption tools do you suggest? I don't want to do it manually and dump statistics into file because IO will affect performance too. Thanks, Mark On Tue, May 17, 2011 at 2:58 PM, Allen Wittenauer a...@apache.org wrote: On May 17, 2011, at 1:01 PM, Mark question wrote: Hi I need to use hadoop-tool-kit for monitoring. So I followed http://code.google.com/p/hadoop-toolkit/source/checkout and applied the patch in my hadoop.20.2 directory as: patch -p0 patch.20.2 Looking at the code, be aware this is going to give incorrect results/suggestions for certain stats it generates when multiple jobs are running. It also seems to lack the algorithm should be rewritten and the data was loaded incorrectly suggestions, which is usually the proper answer for perf problems 80% of the time.
Re: Hadoop tool-kit for monitoring
Thanks for the inputs, but I'm running on a university cluster, not my own and hence are the assumptions such as each task(mapper/reduer) will take 1 GB valid ? So I guess to tune performance I should try running the job multiple times and rely on execution time as an indicator of success. Thanks again, Mark On Tue, May 17, 2011 at 3:16 PM, Konstantin Boudnik c...@apache.org wrote: Also, it seems like Ganglia would be very well complemented by Nagios to allow you to monitor an overall health of your cluster. -- Take care, Konstantin (Cos) Boudnik 2CAC 8312 4870 D885 8616 6115 220F 6980 1F27 E622 Disclaimer: Opinions expressed in this email are those of the author, and do not necessarily represent the views of any company the author might be affiliated with at the moment of writing. On Tue, May 17, 2011 at 15:15, Allen Wittenauer a...@apache.org wrote: On May 17, 2011, at 3:11 PM, Mark question wrote: So what other memory consumption tools do you suggest? I don't want to do it manually and dump statistics into file because IO will affect performance too. We watch memory with Ganglia. We also tune our systems such that a task will only take X amount. In other words, given an 8gb RAM: 1gb for the OS 1gb for the TT and DN 6gb for all tasks if we assume each task will take max 1gb, then we end up with 3 maps and 3 reducers. Keep in mind that the mem consumed is more than just JVM heap size.
Re: How do you run HPROF locally?
I usually do this setting inside my java program (in run function) as follows: JobConf conf = new JobConf(this.getConf(),My.class); conf.set(*mapred*.task.*profile*, true); then I'll see some output files in that same working directory. Hope that helps, Mark On Tue, May 17, 2011 at 4:07 PM, W.P. McNeill bill...@gmail.com wrote: I am running a Hadoop Java program in local single-JVM mode via an IDE (IntelliJ). I want to do performance profiling of it. Following the instructions in chapter 5 of *Hadoop: the Definitive Guide*, I added the following properties to my job configuration file. property namemapred.task.profile/name valuetrue/value /property property namemapred.task.profile.params/name value-agentlib:hprof=cpu=samples,heap=sites,depth=6,force=n,thread=y,verbose=n,file=%s/value /property property namemapred.task.profile.maps/name value0-/value /property property namemapred.task.profile.reduces/name value0-/value /property With these properties, the job runs as before, but I don't see any profiler output. I also tried simply setting property namemapred.child.java.opts/name value-agentlib:hprof=cpu=samples,heap=sites,depth=6,force=n,thread=y,verbose=n,file=%s/value /property Again, no profiler output. I know I have HPROF installed because running java -agentlib:hprof=help at the command prompt produces a result. Is is possible to run HPROF on a local Hadoop job? Am I doing something wrong?
Re: How do you run HPROF locally?
or conf.setBoolean(mapred.task.profile, true); Mark On Tue, May 17, 2011 at 4:49 PM, Mark question markq2...@gmail.com wrote: I usually do this setting inside my java program (in run function) as follows: JobConf conf = new JobConf(this.getConf(),My.class); conf.set(*mapred*.task.*profile*, true); then I'll see some output files in that same working directory. Hope that helps, Mark On Tue, May 17, 2011 at 4:07 PM, W.P. McNeill bill...@gmail.com wrote: I am running a Hadoop Java program in local single-JVM mode via an IDE (IntelliJ). I want to do performance profiling of it. Following the instructions in chapter 5 of *Hadoop: the Definitive Guide*, I added the following properties to my job configuration file. property namemapred.task.profile/name valuetrue/value /property property namemapred.task.profile.params/name value-agentlib:hprof=cpu=samples,heap=sites,depth=6,force=n,thread=y,verbose=n,file=%s/value /property property namemapred.task.profile.maps/name value0-/value /property property namemapred.task.profile.reduces/name value0-/value /property With these properties, the job runs as before, but I don't see any profiler output. I also tried simply setting property namemapred.child.java.opts/name value-agentlib:hprof=cpu=samples,heap=sites,depth=6,force=n,thread=y,verbose=n,file=%s/value /property Again, no profiler output. I know I have HPROF installed because running java -agentlib:hprof=help at the command prompt produces a result. Is is possible to run HPROF on a local Hadoop job? Am I doing something wrong?
Can Mapper get paths of inputSplits ?
Hi I'm using FileInputFormat which will split files logically according to their sizes into splits. Can the mapper get a pointer to these splits? and know which split it is assigned ? I tried looking up the Reporter class and see how is it printing the logical splits on the UI for each mapper .. but it's an interface. Eg. Mapper1: is assigned the logical split hdfs://localhost:9000/user/Hadoop/input:23+24 Mapper2: is assigned the logical split hdfs://localhost:9000/user/Hadoop/input:0+23 Then inside map, I want to ask what are the logical splits and get the upper two strings and know which one my current mapper is assigned. Thanks, Mark
I can't see my messages immediately, and sometimes doesn't even arrive why !
Re: Can Mapper get paths of inputSplits ?
Thanks for the reply Owen, I only knew about map.input.file. So there is no way I can see the other possible splits (start+length)? like some function that returns strings of map.input.file and map.input.offset of the other mappers ? Thanks, Mark On Thu, May 12, 2011 at 9:08 PM, Owen O'Malley omal...@apache.org wrote: On Thu, May 12, 2011 at 8:59 PM, Mark question markq2...@gmail.com wrote: Hi I'm using FileInputFormat which will split files logically according to their sizes into splits. Can the mapper get a pointer to these splits? and know which split it is assigned ? Look at http://hadoop.apache.org/common/docs/r0.20.203.0/mapred_tutorial.html#Task+JVM+Reuse In particular, map.input.file and map.input.offset are the configuration parameters that you want. -- Owen
Re: how to get user-specified Job name from hadoop for running jobs?
you mean by user-specified is when you write your job name via JobConf.setJobName(myTask) ? Then using the same object you can recall your name as follows: JobConf conf ; conf.getJobName() ; ~Cheers Mark On Tue, May 10, 2011 at 10:16 AM, Mark Zand mz...@basistech.com wrote: While I can get JobStatus with this: JobClient client = new JobClient(new JobConf(conf)); JobStatus[] jobStatuses = client.getAllJobs(); I don't see any way to get user-specified Job name. Please help. Thanks.
Re: Can Mapper get paths of inputSplits ?
Then which class is filling the Thanks again Owen, hopefully last but: Who's filling the map.input.file and map.input.offset (ie. which class) so I can extend it to have a function to return these strings. Thanks, Mark On Thu, May 12, 2011 at 10:07 PM, Owen O'Malley omal...@apache.org wrote: On Thu, May 12, 2011 at 9:23 PM, Mark question markq2...@gmail.com wrote: So there is no way I can see the other possible splits (start+length)? like some function that returns strings of map.input.file and map.input.offset of the other mappers ? No, there isn't any way to do it using the public API. The only way would be to look under the covers and read the split file (job.split). -- Owen
Space needed to user SequenceFile.Sorter
I don't know why I can't see my emails immediately sent to the group ... anyways, I'm sorting a sequenceFile using it's sorter on my local filesystem. The inputFile size is 1937690478 bytes. but after 14 minutes of sorting.. I get : TEST SORTING .. java.io.FileNotFoundException: File does not exist: /usr/mark/tmp/mapred/local/SortedOutput.0 at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457) at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:676) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1417) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1353) at org.apache.hadoop.io.SequenceFile$Sorter.cloneFileAttributes(SequenceFile.java:2663) at org.apache.hadoop.io.SequenceFile$Sorter.mergePass(SequenceFile.java:2712) at org.apache.hadoop.io.SequenceFile$Sorter.sort(SequenceFile.java:2285) at org.apache.hadoop.io.SequenceFile$Sorter.sort(SequenceFile.java:2324) at CrossPartitionSimilarity.TestSorter(CrossPartitionSimilarity.java:164) at CrossPartitionSimilarity.main(CrossPartitionSimilarity.java:47) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:616) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) Yet, the file is still there: wc -c SortedOutput.0 --- 1918661230 ../tmp/mapred/local/SortedOutput.0 and if it is because of space, I checked and it can hold up to 209 GB. So, my question are there restrictions on some JVM configurations that I should take care of ? Thank you, Maha
Reading from File
Hi, My mapper opens a file and read records using next() . However, I want to stop reading if there is no memory available. What confuses me here is that even though I'm reading record by record with next(), hadoop actually reads them in dfs.block.size. So, I have two questions: 1. Is it true that even if I set dfs.block.size to 512 MB, then at least one block is loaded in memory for mapper to process (part of inputSplit)? 2. How can I read multiple records from a sequenceFile at once and will it make a difference ? Thanks, Mark
Re: Sequence.Sorter Performance
Thanks Owen ! Mark On Mon, Apr 25, 2011 at 11:43 AM, Owen O'Malley omal...@apache.org wrote: The SequenceFile sorter is ok. It used to be the sort used in the shuffle. *grin* Make sure to set io.sort.factor and io.sort.mb to appropriate values for your hardware. I'd usually use io.sort.factor as 25 * drives and io.sort.mb is the amount of memory you can allocate to the sorting. -- Owen
SequenceFile.Sorter performance
Hi guys, I'm trying to sort a 2.5 GB sequence file in one mapper using its implemented Sort function, but it's taking long that the map is killed for not reporting . I would increase the default time to get reports from the mapper, but I'll do this only if sorting using SequenceFile.sorter is known to be optimal ... Any one knows ? Or other suggested options? Thanks, Mark
SequenceFile.Sorter
Hi guys, I'm trying to sort a 2.5 GB sequence file in one mapper using its implemented Sort function, but it's taking long that the map is killed for not reporting . I would increase the default time to get reports from the mapper, but I'll do this only if sorting using SequenceFile.sorter is known to be optimal ... Any one knows ? Or other suggested options? Thanks, Mark