Re: Creating Sequence File in C++
On Fri, Nov 27, 2009 at 7:07 PM, Saptarshi Guha wrote: Let my Key-Value be something like BinaryWritables (my own class, but > something like this). Is there a way to create the Sequence File > composed of several such key - values, without using Java? > There is not a C++ implementation of SequenceFiles. (If you write one, please consider contributing it back.) A different approach would make a map only Pipes (C++) MapReduce program that reads the data and uses SequenceFileOutputFormat for its output. The map can emit key/value pairs as std::strings containing the bytes you want to write. -- Owen
Creating Sequence File in C++
Hello, Let my Key-Value be something like BinaryWritables (my own class, but something like this). Is there a way to create the Sequence File composed of several such key - values, without using Java? Background: I create objects using protocol buffers, my key and values are serialized versions of these protocol buffer messages. These hadoop k-v pairs that are exchanged in the mapreduce (and stored in both output and input) are the serialized versions of these. I would like to directly create sequence files using C++ and was curious if there is way to do this outside Java (and not have to use JNI), as currently, its best to use a mapreduce job to convert my textfiles to sequence files. Thank you Saptarshi
Re: Processing 10MB files in Hadoop
By default you get at least one task per file; if any file is bigger than a block, then that file is broken up into N tasks where each is one block long. Not sure what you mean by "properly calculate" -- as long as you have more tasks than you have cores, then you'll definitely have work for every core to do; having more tasks with high granularity will also let nodes that get "small" tasks to complete many of them while other cores are stuck with the "heavier" tasks. If you call setNumMapTasks() with a higher number of tasks than the InputFormat creates (via the algorithm above), then it should create additional tasks by dividing files up into smaller chunks (which may be sub-block-sized). As for where you should run your computation.. I don't know that the "map" and "reduce" phases are really "optimized" for computation in any particular way. It's just a data motion thing. (At the end of the day, it's your code doing the processing on either side of the fence, which should dominate the execution time.) If you use an identity mapper with a pseudo-random key to spray the data into a bunch of reduce partitions, then you'll get a bunch of reducers each working on a hopefully-evenly-sized slice of the data. So the map tasks will quickly read from the original source data and forward the workload along to the reducers which do the actual heavy lifting. The cost of this approach is that you have to pay for the time taken to transfer the data from the mapper nodes to the reducer nodes and sort by key when it gets there. If you're only working with 600 MB of data, this is probably negligible. The advantages of doing your computation in the reducers is 1) You can directly control the number of reducer tasks and set this equal to the number of cores in your cluster. 2) You can tune your partitioning algorithm such that all reducers get roughly equal workload assignments, if there appears to be some sort of skew in the dataset. The tradeoff is that you have to ship all the data to the reducers before computation starts, which sacrifices data locality and involves an "intermediate" data set of the same size as the input data set. If this is in the range of hundreds of GB or north, then this can be very time-consuming -- so it doesn't scale terribly well. Of course, by the time you've got several hundred GB of data to work with, your current workload imbalance issues should be moot anyway. - Aaron On Fri, Nov 27, 2009 at 4:33 PM, CubicDesign wrote: > > > Aaron Kimball wrote: > >> (Note: this is a tasktracker setting, not a job setting. you'll need to >> set this on every >> node, then restart the mapreduce cluster to take effect.) >> >> > Ok. And here is my mistake. I set this to 16 only on the main node not also > on data nodes. Thanks a lot!! > > Of course, you need to have enough RAM to make sure that all these tasks >> can >> run concurrently without swapping. >> > No problem! > > > If your individual records require around a minute each to process as you >> claimed earlier, you're >> nowhere near in danger of hitting that particular performance bottleneck. >> >> >> > I was thinking that is I am under the recommended value of 64MB, Hadoop > cannot properly calculate the number of tasks. >
Re: Processing 10MB files in Hadoop
Aaron Kimball wrote: (Note: this is a tasktracker setting, not a job setting. you'll need to set this on every node, then restart the mapreduce cluster to take effect.) Ok. And here is my mistake. I set this to 16 only on the main node not also on data nodes. Thanks a lot!! Of course, you need to have enough RAM to make sure that all these tasks can run concurrently without swapping. No problem! If your individual records require around a minute each to process as you claimed earlier, you're nowhere near in danger of hitting that particular performance bottleneck. I was thinking that is I am under the recommended value of 64MB, Hadoop cannot properly calculate the number of tasks.
Re: Processing 10MB files in Hadoop
3 records in 10MB files. Files can vary and the number of records also can vary. If the data is 10MB and you have 30k records, and it takes ~2 mins to process each record, I'd suggest using map to distribute the data across several reducers then do the actual processing on reduce. Hmmm... Good idea. Thanks. But is 'Reduce' optimized to do the heavy part of the computation?
Re: Processing 10MB files in Hadoop
What does the data look like? You mention 30k records, is that for 10MB or for 600MB, or do you have a constant 30k records with vastly varying file sizes? If the data is 10MB and you have 30k records, and it takes ~2 mins to process each record, I'd suggest using map to distribute the data across several reducers then do the actual processing on reduce. On Fri, Nov 27, 2009 at 7:07 PM, CubicDesign wrote: > Ok. I have set the number on maps to about 1760 (11 nodes * 16 cores/node * > 10 as recommended by Hadoop documentation) and my job still takes several > hours to run instead of one. > > Can be the overhead added by Hadoop that big? I mean I have over 3 > small tasks (about one minute), each one starting its own JVM. > > >
Re: Processing 10MB files in Hadoop
Ok. I have set the number on maps to about 1760 (11 nodes * 16 cores/node * 10 as recommended by Hadoop documentation) and my job still takes several hours to run instead of one. Can be the overhead added by Hadoop that big? I mean I have over 3 small tasks (about one minute), each one starting its own JVM.
Re: part-00000.deflate as output
Thank you, guys, for your very useful answers Mark On Fri, Nov 27, 2009 at 12:44 PM, Aaron Kimball wrote: > You are always free to run with compression disabled. But in many > production > situations, space or performance concerns dictate that all data sets are > stored compressed, so I think Tim was assuming that you might be operating > in such an environment -- in which case, you'd only need things to appear > in > plaintext if a human operator is inspecting the output for debugging. > > - Aaron > > On Thu, Nov 26, 2009 at 4:59 PM, Mark Kerzner > wrote: > > > It worked! > > > > But why is it "for testing?" I only have one job, so I need by related as > > text, can I use this fix all the time? > > > > Thank you, > > Mark > > > > On Thu, Nov 26, 2009 at 1:10 AM, Tim Kiefer wrote: > > > > > For testing purposes you can also try to disable the compression: > > > > > > conf.setBoolean("mapred.output.compress", false); > > > > > > Then you can look at the output. > > > > > > - tim > > > > > > > > > Amogh Vasekar wrote: > > > > > >> Hi, > > >> ".deflate" is the default compression codec used when parameter to > > >> generate compressed output is true ( mapred.output.compress ). > > >> You may set the codec to be used via mapred.output.compression.codec, > > some > > >> commonly used are available in hadoop.io.compress package... > > >> > > >> Amogh > > >> > > >> > > >> On 11/26/09 11:03 AM, "Mark Kerzner" wrote: > > >> > > >> Hi, > > >> > > >> I get this part-0.deflate instead of part-0. > > >> > > >> How do I get rid of the deflate option? > > >> > > >> Thank you, > > >> Mark > > >> > > >> > > >> > > >> > > > > > >
Re: part-00000.deflate as output
You can always do hadoop fs -text This will 'cat' the file for you, and decompress it if necessary. On Thu, Nov 26, 2009 at 7:59 PM, Mark Kerzner wrote: > It worked! > > But why is it "for testing?" I only have one job, so I need by related as > text, can I use this fix all the time? > > Thank you, > Mark > > On Thu, Nov 26, 2009 at 1:10 AM, Tim Kiefer wrote: > > > For testing purposes you can also try to disable the compression: > > > > conf.setBoolean("mapred.output.compress", false); > > > > Then you can look at the output. > > > > - tim > > > > > > Amogh Vasekar wrote: > > > >> Hi, > >> ".deflate" is the default compression codec used when parameter to > >> generate compressed output is true ( mapred.output.compress ). > >> You may set the codec to be used via mapred.output.compression.codec, > some > >> commonly used are available in hadoop.io.compress package... > >> > >> Amogh > >> > >> > >> On 11/26/09 11:03 AM, "Mark Kerzner" wrote: > >> > >> Hi, > >> > >> I get this part-0.deflate instead of part-0. > >> > >> How do I get rid of the deflate option? > >> > >> Thank you, > >> Mark > >> > >> > >> > >> > > >
Re: Processing 10MB files in Hadoop
More importantly: have you told Hadoop to use all your cores? What is mapred.tasktracker.map.tasks.maximum set to? This defaults to 2. If you've got 16 cores/node, you should set this to at least 15--16 so that all your cores are being used. You may need to set this higher, like 20, to ensure that cores aren't being starved. Measure with ganglia or top to make sure your CPU utilization is up to where you're satisfied. (Note: this is a tasktracker setting, not a job setting. you'll need to set this on every node, then restart the mapreduce cluster to take effect.) Of course, you need to have enough RAM to make sure that all these tasks can run concurrently without swapping. Swapping will destroy your performance. Then again, if you bought 16-way machines, presumably you didn't cheap out in that department :) 100 tasks is not an absurd number. For large data sets (e.g., TB scale), I have seen several tens of thousands of tasks. In general, yes, running many tasks over small files is not a good fit for Hadoop, but 100 is not "many small files" -- you might see some sort of speed up by coalescing multiple files into a single task, but when you hear problems with processing many small files, folks are frequently referring to something like 10,000 files where each file is only a few MB, and the actual processing per record is extremely cheap. In cases like this, task startup times severely dominate actual computation time. If your individual records require around a minute each to process as you claimed earlier, you're nowhere near in danger of hitting that particular performance bottleneck. - Aaron On Thu, Nov 26, 2009 at 12:23 PM, CubicDesign wrote: > > > Are the record processing steps bound by a local machine resource - cpu, >> disk io or other? >> >> > Some disk I/O. Not so much compared with the CPU. Basically it is a CPU > bound. This is why each machine has 16 cores. > > What I often do when I have lots of small files to handle is use the >> NlineInputFormat, >> > Each file contains a complete/independent set of records. I cannot mix the > data resulted from processing two different files. > > > - > Ok. I think I need to re-explain my problem :) > While running jobs on these small files, the computation time was almost 5 > times longer than expected. It looks like the job was affected by the number > of map task that I have (100). I don't know which are the best parameters in > my case (10MB files). > > I have zero reduce tasks. > > >
Re: Good idea to run NameNode and JobTracker on same machine?
The real kicker is going to be memory consumption of one or both of these services. The NN in particular uses a large amount of RAM to store the filesystem image. I think that those who are suggesting a breakeven point of <= 10 nodes are lowballing. In practice, unless your machines are really thin on the RAM (e.g., 2--4 GB), I haven't seen any cases where these services need to be separated below the 20-node mark; I've also seen several clusters of 40 nodes running fine with these services colocated. It depends on how many files are in HDFS and how frequently you're submitting a lot of concurrent jobs to MapReduce. If you're setting up a production environment that you plan to expand, however, as a best practice you should configure the master node to have two hostnames (e.g., "nn" and "jt") so that you can have separate hostnames in fs.default.name and mapred.job.tracker; when the day comes that these services are placed on different nodes, you'll then be able to just move one of the hostnames over and not need to reconfigure all 20--40 other nodes. - Aaron On Thu, Nov 26, 2009 at 8:27 PM, Srigurunath Chakravarthi < srig...@yahoo-inc.com> wrote: > Raymond, > Load wise, it should be very safe to run both JT and NN on a single node > for small clusters (< 40 Task Trackers and/or Data Nodes). They don't use > much CPU as such. > > This may even work for larger clusters depending on the type of hardware > you have and the Hadoop job mix. We usually observe < 5% CPU load with ~80 > DNs/TTs on an 8-code Intel processor based box with 16GB RAM. > > It is best that you observe CPU & mem load on the JT+NN node to take a > call on whether to separate them. iostat, top or sar should tell you. > > Regards, > Sriguru > > >-Original Message- > >From: John Martyniak [mailto:j...@beforedawnsolutions.com] > >Sent: Friday, November 27, 2009 3:06 AM > >To: common-user@hadoop.apache.org > >Cc: > >Subject: Re: Good idea to run NameNode and JobTracker on same machine? > > > >I have a cluster of 4 machines plus one machine to run nn & jt. I > >have heard that 5 or 6 is the magic #. I will see when I add the next > >batch of machines. > > > >And it seems to running fine. > > > >-Jogn > > > >On Nov 26, 2009, at 11:38 AM, Yongqiang He > >wrote: > > > >> I think it is definitely not a good idea to combine these two in > >> production > >> environment. > >> > >> Thanks > >> Yongqiang > >> On 11/26/09 6:26 AM, "Raymond Jennings III" > >> wrote: > >> > >>> Do people normally combine these two processes onto one machine? > >>> Currently I > >>> have them on separate machines but I am wondering they use that > >>> much CPU > >>> processing time and maybe I should combine them and create another > >>> DataNode. > >>> > >>> > >>> > >>> > >>> > >> > >> >
Re: part-00000.deflate as output
You are always free to run with compression disabled. But in many production situations, space or performance concerns dictate that all data sets are stored compressed, so I think Tim was assuming that you might be operating in such an environment -- in which case, you'd only need things to appear in plaintext if a human operator is inspecting the output for debugging. - Aaron On Thu, Nov 26, 2009 at 4:59 PM, Mark Kerzner wrote: > It worked! > > But why is it "for testing?" I only have one job, so I need by related as > text, can I use this fix all the time? > > Thank you, > Mark > > On Thu, Nov 26, 2009 at 1:10 AM, Tim Kiefer wrote: > > > For testing purposes you can also try to disable the compression: > > > > conf.setBoolean("mapred.output.compress", false); > > > > Then you can look at the output. > > > > - tim > > > > > > Amogh Vasekar wrote: > > > >> Hi, > >> ".deflate" is the default compression codec used when parameter to > >> generate compressed output is true ( mapred.output.compress ). > >> You may set the codec to be used via mapred.output.compression.codec, > some > >> commonly used are available in hadoop.io.compress package... > >> > >> Amogh > >> > >> > >> On 11/26/09 11:03 AM, "Mark Kerzner" wrote: > >> > >> Hi, > >> > >> I get this part-0.deflate instead of part-0. > >> > >> How do I get rid of the deflate option? > >> > >> Thank you, > >> Mark > >> > >> > >> > >> > > >
Re: Re: Doubt in Hadoop
When you set up the Job object, do you call job.setJarByClass(Map.class)? That will tell Hadoop which jar file to ship with the job and to use for classloading in your code. - Aaron On Thu, Nov 26, 2009 at 11:56 PM, wrote: > Hi, > I am running the job from command line. The job runs fine in the local > mode > but something happens when I try to run the job in the distributed mode. > > > Abhishek Agrawal > > SUNY- Buffalo > (716-435-7122) > > On Fri 11/27/09 2:31 AM , Jeff Zhang zjf...@gmail.com sent: > > Do you run the map reduce job in command line or IDE? in map reduce > > mode, you should put the jar containing the map and reduce class in > > your classpath > > Jeff Zhang > > On Fri, Nov 27, 2009 at 2:19 PM, wrote: > > Hello Everybody, > >I have a doubt in Haddop and was wondering if > > anybody has faced a > > similar problem. I have a package called test. Inside that I have > > class called > > A.java, Map.java, Reduce.java. In A.java I have the main method > > where I am trying > > to initialize the jobConf object. I have written > > jobConf.setMapperClass(Map.class) and similarly for the reduce class > > as well. The > > code works correctly when I run the code locally via > > jobConf.set("mapred.job.tracker","local") but I get an exception > > when I try to > > run this code on my cluster. The stack trace of the exception is as > > under. I > > cannot understand the problem. Any help would be appreciated. > > java.lang.RuntimeException: java.lang.RuntimeException: > > java.lang.ClassNotFoundException: test.Map > >at > > org.apache.hadoop.conf.Configuration.getClass(Configuration.java:752) > >at > > org.apache.hadoop.mapred.JobConf.getMapperClass(JobConf.java:690) > >at > > org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34) > >at > > org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58) > >at > > > > > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83) > >at org.apache.hadoop.mapred.MapTask.run(MapTask.java:338) > >at org.apache.hadoop.mapred.Child.main(Child.java:158) > > Caused by: java.lang.RuntimeException: > > java.lang.ClassNotFoundException: > > Markowitz.covarMatrixMap > >at > > org.apache.hadoop.conf.Configuration.getClass(Configuration.java:720) > >at > > org.apache.hadoop.conf.Configuration.getClass(Configuration.java:744) > >... 6 more > > Caused by: java.lang.ClassNotFoundException: test.Map > >at java.net.URLClassLoader$1.run(URLClassLoader.java:200) > >at java.security.AccessController.doPrivileged(Native > > Method) > >at > > java.net.URLClassLoader.findClass(URLClassLoader.java:188) > >at java.lang.ClassLoader.loadClass(ClassLoader.java:306) > >at > > sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:276) > >at java.lang.ClassLoader.loadClass(ClassLoader.java:251) > >at > > java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319) > >at java.lang.Class.forName0(Native Method) > >at java.lang.Class.forName(Class.java:247) > >at > > > > > org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:673) > >at > > org.apache.hadoop.conf.Configuration.getClass(Configuration.java:718) > >... 7 more > > Thank You > > Abhishek Agrawal > > SUNY- Buffalo > > (716-435-7122) > > > > > >
Re: RE: please help in setting hadoop
You've set hadoop.tmp.dir to /home/hadoop/hadoop-${user.name}. This means that on every node, you're going to need a directory named (e.g.) /home/hadoop/hadoop-root/, since it seems as though you're running things as root (in general, not a good policy; but ok if you're on EC2 or something like that). mapred.local.dir defaults to ${hadoop.tmp.dir}/mapred/local. You've confirmed that this exists on the machine named 'master' -- what about on slave? Then, what are the permissions of /home/hadoop/ on the slave node? Whichever user started the Hadoop daemons (probably either 'root' or 'hadoop') will need the ability to mkdir /home/hadoop/hadoop-root underneath of /home/hadoop. If that directory doesn't exist, or is chown'd to someone else, this will probably be the result. - Aaron On Thu, Nov 26, 2009 at 10:22 PM, wrote: > Hi, > There should be a folder called as logs in $HADOOP_HOME. Also try going > through > > http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29 > . > > > This is a pretty good tutorial > > Abhishek Agrawal > > SUNY- Buffalo > (716-435-7122) > > On Fri 11/27/09 1:18 AM , "Krishna Kumar" krishna.ku...@nechclst.in sent: > > I have tried, but didn't get any success. In bwt can you please tell > exact > > path of log file which I have to refer. > > > > > > Thanks and Best Regards, > > > > Krishna Kumar > > > > Senior Storage Engineer > > > > Why do we have to die? If we had to die, and everything is gone after > that, > > then nothing else matters on this earth - everything is temporary, at > least > > relative to me. > > > > > > > > > > -Original Message- > > > > From: aa...@buffalo.edu [aa...@buffa > > lo.edu] > > Sent: Friday, November 27, 2009 10:56 AM > > > > To: common-user@hadoop.apache.org > > Subject: Re: please help in setting hadoop > > > > > > > > Hi, > > > > Just a thought, but you do not need to setup the temp directory in > > > > conf/hadoop-site.xml especially if you are running basic examples. Give > > that a > > shot, maybe it will work out. Otherwise see if you can find additional > info > > in > > the LOGS > > > > > > > > Thank You > > > > > > > > Abhishek Agrawal > > > > > > > > SUNY- Buffalo > > > > (716-435-7122) > > > > > > > > On Fri 11/27/09 12:20 AM , "Krishna Kumar" kri > > shna.ku...@nechclst.in sent: > > > Dear All, > > > > > Can anybody please help me in getting out from > > these error messages: > > > [ hadoop]# hadoop jar > > > > > > > /usr/lib/hadoop/hadoop-0.18.3-14.cloudera.CH0_3-examples.jar > > > wordcount > > > > > test test-op > > > > > > > > > > 09/11/26 17:15:45 INFO mapred.FileInputFormat: > > Total input paths to > > > process : 4 > > > > > > > > > > 09/11/26 17:15:45 INFO mapred.FileInputFormat: > > Total input paths to > > > process : 4 > > > > > > > > > > org.apache.hadoop.ipc.RemoteException: > > java.io.IOException: No valid > > > local directories in property: mapred.local.dir > > > > > > > > > > at > > > > > > > org.apache.hadoop.conf.Configuration.getLocalPath(Configuration.java:730 > > > ) > > > > > > > > > > at > > > > > > > org.apache.hadoop.mapred.JobConf.getLocalPath(JobConf.java:222) > > > > > > > > at > > > > > > > org.apache.hadoop.mapred.JobInProgress.(JobInProgress.java:194) > > > > > > > > at > > > > > > > org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:1557) > > > > > > > > at > > sun.reflect.NativeMethodAccessorImpl.invoke0(Native > > > Method) > > > > > > > > > > at > > > > > > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.jav > > > a:39) > > > > > > > > > > at > > > > > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor > > > Impl.java:25) > > > > > > > > > > at > > java.lang.reflect.Method.invoke(Method.java:585) > > > > > > > > at > > org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481) > > > > > > > > at > > org.apache.hadoop.ipc.Server$Handler.run(Server.java:890) > > > I am running the hadoop cluster as root user on > > two server nodes: > > > master > > > > > and slave. My hadoop-site.xml file format is as > > follows : > > > fs.default.name > > > > > > > > > > hdfs://master:54310 > > > dfs.permissions > > > > > > > > > > false > > > > > dfs.name.dir > > > > > > > > > > /home/hadoop/dfs/name > > > > > Further the o/p of ls command is as follows: > > > > > > > > > > [ hadoop]# ls -l /home/hadoop/hadoop-root/ > > > > > > > > > > total 8 > > > > > > > > > > drwxr-xr-x 4 root root 4096 Nov 26 16:48 dfs > > > > > > > > > > drwxr-xr-x 3 root root 4096 Nov 26 16:49 mapred > > > > > > > > > > [ hadoop]# > > > > > > > > > > [ hadoop]# > > > > > > > > > > [ hadoop]# ls -l > > /home/hadoop/hadoop-root/mapred/ > > > > > > > > total 4 > > > > > > > > > > drwxr-xr-x 2 root root 4096 Nov 26 16:49 local > > > > > > > > > > [ hadoop]# > > > > > > > > > > [ hadoop]# ls -l > > /home/hadoop/hadoop-root/mapred/local/ > > > > > > > > total 0 > > > > > Thanks and Best Regards, > > > > > > > > > > Krishna Kumar > > > > > > > > >
Re: Hadoop 0.20 map/reduce Failing for old API
On Fri, Nov 27, 2009 at 10:46 AM, Arv Mistry wrote: > Thanks Rekha, I was missing the new library > (hadoop-0.20.1-hdfs-core.jar) in my client. > > It seems to run a little further but I'm now getting a > ClassCastException returned by the mapper. Note, this worked with the > 0.19 load, so I'm assuming there's something additional in the > configuration that I'm missing. Can anyone help? > > java.lang.ClassCastException: org.apache.hadoop.mapred.MultiFileSplit > cannot be cast to org.apache.hadoop.mapred.FileSplit > at > org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat > .java:54) > at > org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) > at org.apache.hadoop.mapred.Child.main(Child.java:170) > > Cheers Arv > > -Original Message- > From: Rekha Joshi [mailto:rekha...@yahoo-inc.com] > Sent: November 26, 2009 11:45 PM > To: common-user@hadoop.apache.org > Subject: Re: Hadoop 0.20 map/reduce Failing for old API > > The exit status of 1 usually indicates configuration issues, incorrect > command invocation in hadoop 0.20 (incorrect params), if not JVM crash. > In your logs there is no indication of crash, but some paths/command can > be the cause. Can you check if your lib paths/data paths are correct? > > If it is a memory intensive task, you may also try values on > mapred.child.java.opts /mapred.job.map.memory.mb.Thanks! > > On 11/27/09 1:28 AM, "Arv Mistry" wrote: > > Hi, > > We've recently upgraded to hadoop 0.20. Writing to HDFS seems to be > working fine, but the map/reduce jobs are failing with the following > exception. Note, we have not moved to the new map/reduce API yet. In the > client that launches the job, the only change I have made is to now load > the three files; core-site.xml, hdfs-site.xml and mapred-site.xml rather > than the hadoop-site.xml. Any ideas? > > INFO | jvm 1 | 2009/11/26 13:47:26 | 2009-11-26 13:47:26,328 INFO > [FileInputFormat] Total input paths to process : 711 > INFO | jvm 1 | 2009/11/26 13:47:28 | 2009-11-26 13:47:28,033 INFO > [JobClient] Running job: job_200911241319_0003 > INFO | jvm 1 | 2009/11/26 13:47:29 | 2009-11-26 13:47:29,036 INFO > [JobClient] map 0% reduce 0% > INFO | jvm 1 | 2009/11/26 13:47:36 | 2009-11-26 13:47:36,068 INFO > [JobClient] Task Id : attempt_200911241319_0003_m_03_0, Status : > FAILED > INFO | jvm 1 | 2009/11/26 13:47:36 | java.io.IOException: Task > process exit with nonzero status of 1. > INFO | jvm 1 | 2009/11/26 13:47:36 | at > org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418) > INFO | jvm 1 | 2009/11/26 13:47:36 | > INFO | jvm 1 | 2009/11/26 13:47:36 | 2009-11-26 13:47:36,094 WARN > [JobClient] Error reading task > outputhttp://dev-cs1.ca.kindsight.net:50060/tasklog?plaintext=true&taski > d=attempt_200911241319_0003_m_03_0&filter=stdout > INFO | jvm 1 | 2009/11/26 13:47:36 | 2009-11-26 13:47:36,096 WARN > [JobClient] Error reading task > outputhttp://dev-cs1.ca.kindsight.net:50060/tasklog?plaintext=true&taski > d=attempt_200911241319_0003_m_03_0&filter=stderr > INFO | jvm 1 | 2009/11/26 13:47:51 | 2009-11-26 13:47:51,162 INFO > [JobClient] Task Id : attempt_200911241319_0003_m_00_0, Status : > FAILED > INFO | jvm 1 | 2009/11/26 13:47:51 | java.io.IOException: Task > process exit with nonzero status of 1. > INFO | jvm 1 | 2009/11/26 13:47:51 | at > org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418) > INFO | jvm 1 | 2009/11/26 13:47:51 | > INFO | jvm 1 | 2009/11/26 13:47:51 | 2009-11-26 13:47:51,166 WARN > [JobClient] Error reading task > outputhttp://dev-cs1.ca.kindsight.net:50060/tasklog?plaintext=true&taski > d=attempt_200911241319_0003_m_00_0&filter=stdout > INFO | jvm 1 | 2009/11/26 13:47:51 | 2009-11-26 13:47:51,167 WARN > [JobClient] Error reading task > outputhttp://dev-cs1.ca.kindsight.net:50060/tasklog?plaintext=true&taski > d=attempt_200911241319_0003_m_00_0&filter=stderr > INFO | jvm 1 | 2009/11/26 13:47:52 | 2009-11-26 13:47:52,173 INFO > [JobClient] map 50% reduce 0% > INFO | jvm 1 | 2009/11/26 13:48:03 | 2009-11-26 13:48:03,219 INFO > [JobClient] Task Id : attempt_200911241319_0003_m_01_0, Status : > FAILED > INFO | jvm 1 | 2009/11/26 13:48:03 | Map output lost, rescheduling: > getMapOutput(attempt_200911241319_0003_m_01_0,0) failed : > INFO | jvm 1 | 2009/11/26 13:48:03 | > org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find > taskTracker/jobcache/job_200911241319_0003/attempt_200911241319_0003_m_0 > 1_0/output/file.out.index in any of the configured local directories > INFO | jvm 1 | 2009/11/26 13:48:03 | at > org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathT > oRead(LocalDirAllocator.java:389) > INFO | jvm 1 | 2009/11/26 13:48:03 | at > org.apache.hadoop.fs.LocalDirA
RE: Hadoop 0.20 map/reduce Failing for old API
Thanks Rekha, I was missing the new library (hadoop-0.20.1-hdfs-core.jar) in my client. It seems to run a little further but I'm now getting a ClassCastException returned by the mapper. Note, this worked with the 0.19 load, so I'm assuming there's something additional in the configuration that I'm missing. Can anyone help? java.lang.ClassCastException: org.apache.hadoop.mapred.MultiFileSplit cannot be cast to org.apache.hadoop.mapred.FileSplit at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat .java:54) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:170) Cheers Arv -Original Message- From: Rekha Joshi [mailto:rekha...@yahoo-inc.com] Sent: November 26, 2009 11:45 PM To: common-user@hadoop.apache.org Subject: Re: Hadoop 0.20 map/reduce Failing for old API The exit status of 1 usually indicates configuration issues, incorrect command invocation in hadoop 0.20 (incorrect params), if not JVM crash. In your logs there is no indication of crash, but some paths/command can be the cause. Can you check if your lib paths/data paths are correct? If it is a memory intensive task, you may also try values on mapred.child.java.opts /mapred.job.map.memory.mb.Thanks! On 11/27/09 1:28 AM, "Arv Mistry" wrote: Hi, We've recently upgraded to hadoop 0.20. Writing to HDFS seems to be working fine, but the map/reduce jobs are failing with the following exception. Note, we have not moved to the new map/reduce API yet. In the client that launches the job, the only change I have made is to now load the three files; core-site.xml, hdfs-site.xml and mapred-site.xml rather than the hadoop-site.xml. Any ideas? INFO | jvm 1| 2009/11/26 13:47:26 | 2009-11-26 13:47:26,328 INFO [FileInputFormat] Total input paths to process : 711 INFO | jvm 1| 2009/11/26 13:47:28 | 2009-11-26 13:47:28,033 INFO [JobClient] Running job: job_200911241319_0003 INFO | jvm 1| 2009/11/26 13:47:29 | 2009-11-26 13:47:29,036 INFO [JobClient] map 0% reduce 0% INFO | jvm 1| 2009/11/26 13:47:36 | 2009-11-26 13:47:36,068 INFO [JobClient] Task Id : attempt_200911241319_0003_m_03_0, Status : FAILED INFO | jvm 1| 2009/11/26 13:47:36 | java.io.IOException: Task process exit with nonzero status of 1. INFO | jvm 1| 2009/11/26 13:47:36 | at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418) INFO | jvm 1| 2009/11/26 13:47:36 | INFO | jvm 1| 2009/11/26 13:47:36 | 2009-11-26 13:47:36,094 WARN [JobClient] Error reading task outputhttp://dev-cs1.ca.kindsight.net:50060/tasklog?plaintext=true&taski d=attempt_200911241319_0003_m_03_0&filter=stdout INFO | jvm 1| 2009/11/26 13:47:36 | 2009-11-26 13:47:36,096 WARN [JobClient] Error reading task outputhttp://dev-cs1.ca.kindsight.net:50060/tasklog?plaintext=true&taski d=attempt_200911241319_0003_m_03_0&filter=stderr INFO | jvm 1| 2009/11/26 13:47:51 | 2009-11-26 13:47:51,162 INFO [JobClient] Task Id : attempt_200911241319_0003_m_00_0, Status : FAILED INFO | jvm 1| 2009/11/26 13:47:51 | java.io.IOException: Task process exit with nonzero status of 1. INFO | jvm 1| 2009/11/26 13:47:51 | at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418) INFO | jvm 1| 2009/11/26 13:47:51 | INFO | jvm 1| 2009/11/26 13:47:51 | 2009-11-26 13:47:51,166 WARN [JobClient] Error reading task outputhttp://dev-cs1.ca.kindsight.net:50060/tasklog?plaintext=true&taski d=attempt_200911241319_0003_m_00_0&filter=stdout INFO | jvm 1| 2009/11/26 13:47:51 | 2009-11-26 13:47:51,167 WARN [JobClient] Error reading task outputhttp://dev-cs1.ca.kindsight.net:50060/tasklog?plaintext=true&taski d=attempt_200911241319_0003_m_00_0&filter=stderr INFO | jvm 1| 2009/11/26 13:47:52 | 2009-11-26 13:47:52,173 INFO [JobClient] map 50% reduce 0% INFO | jvm 1| 2009/11/26 13:48:03 | 2009-11-26 13:48:03,219 INFO [JobClient] Task Id : attempt_200911241319_0003_m_01_0, Status : FAILED INFO | jvm 1| 2009/11/26 13:48:03 | Map output lost, rescheduling: getMapOutput(attempt_200911241319_0003_m_01_0,0) failed : INFO | jvm 1| 2009/11/26 13:48:03 | org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/jobcache/job_200911241319_0003/attempt_200911241319_0003_m_0 1_0/output/file.out.index in any of the configured local directories INFO | jvm 1| 2009/11/26 13:48:03 | at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathT oRead(LocalDirAllocator.java:389) INFO | jvm 1| 2009/11/26 13:48:03 | at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAlloca tor.java:138) INFO | jvm 1| 2009/11/26 13:48:03 | at org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker. java:2886) INFO | jvm 1| 2009/11/26 13:48:03 |
Re: AW: KeyValueTextInputFormat and Hadoop 0.20.1
https://issues.apache.org/jira/browse/MAPREDUCE-655 fixed in version 0.21.0 On 11/26/09 9:43 PM, "Matthias Scherer" wrote: Sorry, but I can't find it in the version control system for release 0.20.1: http://svn.apache.org/repos/asf/hadoop/common/tags/release-0.20.1/src/mapred/org/apache/hadoop/mapreduce/lib/input/ Du you have another distribution? Regards, Matthias > -Ursprüngliche Nachricht- > Von: Jeff Zhang [mailto:zjf...@gmail.com] > Gesendet: Donnerstag, 26. November 2009 16:35 > An: common-user@hadoop.apache.org > Betreff: Re: KeyValueTextInputFormat and Hadoop 0.20.1 > > There's a KeyValueInputFormat under package > org.apache.hadoop.mapreduce.lib.input > which is for hadoop new API > > > Jeff Zhang > > > On Thu, Nov 26, 2009 at 7:10 AM, Matthias Scherer > > wrote: > > > Hi, > > > > I started my first experimental Hadoop project with Hadoop > 0.20.1 an > > run in the following problem: > > > > Job job = new Job(new Configuration(),"Myjob"); > > job.setInputFormatClass(KeyValueTextInputFormat.class); > > > > The last line throws the following error: "The method > > setInputFormatClass(Class) in the > type Job is > > not applicable for the arguments (Class)" > > > > Job.setInputFormatClass expects a subclass of the new class > > org.apache.hadoop.mapreduce.InputFormat. But > KeyValueTextInputFormat > > is only available as subclass of the deprecated > > org.apache.hadoop.mapred.FileInputFormat. > > > > Is there a way to use KeyValueTextInputFormat with the new > classes Job > > and Configuration? > > > > Thanks, > > Matthias > > >