How jobs are copied to other nodes?
I am interested to know the internal working of Hadoop regarding distribution of jobs. How are the jobs copied to other nodes? Is the class file copied to all other nodes where they are executed?
RE: Map/Reduce Type Mismatch error
The key provided by the default FileInputFormat is not Text, but an integer offset into the split(which is not very usful IMHO). Try changing your mapper back to . If you are expecting the file name to be the key, you will (I think) need to write your own InputFormat. Jeff -Original Message- From: Prasan Ary [mailto:[EMAIL PROTECTED] Sent: Friday, March 07, 2008 3:50 PM To: hadoop Subject: Map/Reduce Type Mismatch error Hi All, I am running a Map/Reduce on a textfile. Map takes as (key,value) input pair , and outputs as (key,value) output pair. Reduce takes as (key,value) input pair, and outputs as (key,value) output pair. I am getting a type mismatch error. Any suggestion? JobConf job = new JobConf(.. job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); - public static class Map extends MapReduceBase implements Mapper { .. public void map(Text key, Text value, OutputCollector output, Reporter reporter) throws IOException { .. output.collect(key,new IntWritable(1)); public static class Reduce extends MapReduceBase implements Reducer { public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { .. output.collect(key, new Text("SomeText"); - Never miss a thing. Make Yahoo your homepage.
Map/Reduce Type Mismatch error
Hi All, I am running a Map/Reduce on a textfile. Map takes as (key,value) input pair , and outputs as (key,value) output pair. Reduce takes as (key,value) input pair, and outputs as (key,value) output pair. I am getting a type mismatch error. Any suggestion? JobConf job = new JobConf(.. job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); - public static class Map extends MapReduceBase implements Mapper { .. public void map(Text key, Text value, OutputCollector output, Reporter reporter) throws IOException { .. output.collect(key,new IntWritable(1)); public static class Reduce extends MapReduceBase implements Reducer { public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { .. output.collect(key, new Text("SomeText"); - Never miss a thing. Make Yahoo your homepage.
Re: Does Hadoop Honor Reserved Space?
Unfortunately, I had to clean up my HDFS in order to get some work done, but I was running Hadoop on Hadoop 0.16.0 running on a Linux box. My configuration is two machines. One has the JobTracker/NameNode and a TaskTracker instance all running on the same machine. The other machine is just running a TaskTracker. Replication was set to 2 for the default and the max. -- Jimmy On Thu, 06 Mar 2008 16:01:16 -0600, Hairong Kuang <[EMAIL PROTECTED]> wrote: In addition to the version, could you please send us a copy of the datanode report by running the command bin/hadoop dfsadmin -report? Thanks, Hairong On 3/6/08 11:56 AM, "Joydeep Sen Sarma" <[EMAIL PROTECTED]> wrote: but intermediate data is stored in a different directory from dfs/data (something like mapred/local by default i think). what version are u running? -Original Message- From: Ashwinder Ahluwalia on behalf of [EMAIL PROTECTED] Sent: Thu 3/6/2008 10:14 AM To: core-user@hadoop.apache.org Subject: RE: Does Hadoop Honor Reserved Space? I've run into a similar issue in the past. From what I understand, this parameter only controls the HDFS space usage. However, the intermediate data in the map reduce job is stored on the local file system (not HDFS) and is not subject to this configuration. In the past I have used mapred.local.dir.minspacekill and mapred.local.dir.minspacestart to control the amount of space that is allowable for use by this temporary data. Not sure if that is the best approach though, so I'd love to hear what other people have done. In your case, you have a map-red job that will consume too much space (without setting a limit, you didn't have enough disk capacity for the job), so looking at mapred.output.compress and mapred.compress.map.output might be useful to decrease the job's disk requirements. --Ash -Original Message- From: Jimmy Wan [mailto:[EMAIL PROTECTED] Sent: Thursday, March 06, 2008 9:56 AM To: core-user@hadoop.apache.org Subject: Does Hadoop Honor Reserved Space? I've got 2 datanodes setup with the following configuration parameter: dfs.datanode.du.reserved 429496729600 Reserved space in bytes per volume. Always leave this much space free for non dfs use. Both are housed on 800GB volumes, so I thought this would keep about half the volume free for non-HDFS usage. After some long running jobs last night, both disk volumes were completely filled. The bulk of the data was in: ${my.hadoop.tmp.dir}/hadoop-hadoop/dfs/data This is running as the user hadoop. Am I interpretting these parameters incorrectly? I noticed this issue, but it is marked as closed: http://issues.apache.org/jira/browse/HADOOP-2549
Re: using a perl script with argument variables which point to config files on the DFS as a mapper
So I've read up on -cacheFile and -File and I still can't quite get my script to work. I'm running it as follows: hstream -input basedir/finegrain/validation.txt.head -output basedir/output -mapper "Evaluate_linux.pl segs.xml config.txt" -numReduceTasks 0 -jobconf mapred.job.name="Evaluate" -file Evaluate_linux.pl -cacheFile hdfs://servername:9008/user/tvan/basedir/custom/final_segs.20080305.xml#segs.xml -cacheFile hdfs://servername:9008/user/tvan/basedir/config.txt#config.txt the job starts but all map jobs fail with the same code: java.io.IOException: log:null R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s] minRecWrittenToEnableSkip_=9223372036854775807 LOGNAME=null HOST=null USER=tvanrooy HADOOP_USER=null last Hadoop input: |null| last tool output: |null| Date: Fri Mar 07 15:47:37 EST 2008 java.io.IOException: Broken pipe Is this an indication that my script isn't finding the files I pass it? On Thu, Mar 6, 2008 at 5:17 PM, Lohit <[EMAIL PROTECTED]> wrote: > you could use -cacheFile or -file option for this. Check streaming doc > for examples. > > > > > > On Mar 6, 2008, at 2:32 PM, "Theodore Van Rooy" <[EMAIL PROTECTED]> > wrote: > > > I would like to convert a perl script that currently uses argument > > variables > > to run with Hadoop Streaming. > > > > Normally I would use the script like > > > > 'cat datafile.txt | myscript.pl folder/myfile1.txt folder/ > > myfile2.txt' > > > > where the two argument variables are actually the names of > > configuration > > files for the myscript.pl. > > > > The question I have is, how do I get the perl script to either look > > in the > > local directory for the config files, or how would I go about > > getting them > > to look on the DFS for the config files? Once the configurations are > > passed > > in there is no problem using the STDIN to process the datafile > > passed into > > it by hadoop. > > -- Theodore Van Rooy Green living isn't just for hippies... http://greentheo.scroggles.com
Re: Equivalent of cmdline head or tail?
I thought so as well until I reflected for a moment. But if you include the top N from every combiner, then you are guaranteed to have the global top N in the output of all of the combiners. On 3/6/08 11:50 PM, "Owen O'Malley" <[EMAIL PROTECTED]> wrote: > > On Mar 6, 2008, at 5:02 PM, Ted Dunning wrote: > >> I don't know if the combiner sees things in >> order. IF it does, then you can prune on both levels to minimize data >> transfer. > > The input to the combiners is sorted. However, when filtering to the > top N, you need to be careful to include enough that the partial view > doesn't distort the global view. > > -- O
Re: Copying files from remote machine to dfs
If uploading files from a non-HDFS file system: Install the hadoop distribution, configure it to talk to your namenode, make sure there are no firewall restrictions (tcp ports 8020, 50010, 50070, 50075) and then simply run "hadoop dfs -put " On 3/7/08 03:43, "Ved Prakash" <[EMAIL PROTECTED]> wrote: > Hi Friends, > > Can we copy files from residing on a remote machine to dfs ? > > Thanks > > Ved -- Marco Nicosia - Grid Services Ops Systems, Tools, and Services Group
Custom Input Formats
Hello, First, I am currently subscribed to the digest, could you please cc me at [EMAIL PROTECTED] with any replies. I really appreciate it. I have a few questions regarding input formats. Specifically, I want to use one complete text file per input format. I understand that I must implement both FileInputFormat and and RecordReader. From there, however, I am not sure what to do. Can I include these in my MR project or do I need to keep them in a separate jar and reference that in HADOOP-CLASSPATH? Also should HADOOP-CLASSPATH point to a directory of jars or does it mimic the space-delimited manifest.mf? Finally, are there any examples of user-defined input formats available anywhere? Thanks, Dan
[HOD] Collecting MapReduce logs
Hello everyone, I wonder what is the meaning of hodring.log-destination-uri versus hodring.log-dir. I'd like to collect MapReduce UI logs after a job has been run and the only attribute seems to be hod.hadoop-ui-log-dir, in the hod section. With that attribute specified, logs are all grabbed in that directory, producing a large amount of html files. Is there a way to collect them, maybe as a .tar.gz, in a place somewhere related to the user? Additionally, how do administrators specify variables in these values? Which interpreter interprets them? For instance, variables specified in a bash fashion like $USER in section hodring or ringmaster work well (I guess they are interpreted by bash itself) but if specified in the hod section they're not: I tried with [hod] hadoop-ui-log-dir=/somedir/$USER but any hod command fails displaying an error on that line. Cheers, Luca
[HOD] Collecting MapReduce logs
Hello everyone, I wonder what is the meaning of hodring.log-destination-uri versus hodring.log-dir. I'd like to collect MapReduce UI logs after a job has been run and the only attribute seems to be hod.hadoop-ui-log-dir, in the hod section. With that attribute specified, logs are all grabbed in that directory, producing a large amount of html files. Is there a way to collect them, maybe as a .tar.gz, in a place somewhere related to the user? Additionally, how do administrators specify variables in these values? Which interpreter interprets them? For instance, variables specified in a bash fashion like $USER in section hodring or ringmaster work well (I guess they are interpreted by bash itself) but if specified in the hod section they're not: I tried with [hod] hadoop-ui-log-dir=/somedir/$USER but any hod command fails displaying an error on that line. Cheers, Luca
Re: 答复: clustering problem
Hi, I found the solution for the problem I have posted, I would post the resolution here so that others may benefit from this. the incompatibility was showing on my slave was because of incompatible java installed on my slave. I removed the current java installation from slave and installed the same version as I have on my master and that solved the problem. Thanks all, for your responses. Ved 2008/3/5 Ved Prakash <[EMAIL PROTECTED]>: > Hi Miles, > > Yes, I have hadoop-0.15.2 installed on both my systems. > > Ved > > 2008/3/5 Miles Osborne <[EMAIL PROTECTED]>: > > Did you use exactly the same version of Hadoop on each and every node? > > > > Miles > > > > On 05/03/2008, Ved Prakash <[EMAIL PROTECTED]> wrote: > > > > > > Hi Zhang, > > > > > > Thanks for your reply, I tried this but no use. It still throws up > > > Incompatible build versions. > > > > > > I removed the dfs local directory on slave and issued start-dfs.sh on > > > server, and when I checked the logs it showed up with the same > > problem. > > > > > > Do you guys need some more information from my side to have a better > > > understanding about the problem. > > > > > > Please let me know, > > > > > > Thanks > > > > > > Ved > > > > > > 2008/3/5 Zhang, Guibin <[EMAIL PROTECTED]>: > > > > > > > You can delete the DFS local dir in the slave (The local dictionary > > > should > > > > be ${hadoop.tmp.dir}/dfs/) and try again. > > > > > > > > > > > > -邮件原件- > > > > 发件人: Ved Prakash [mailto:[EMAIL PROTECTED] > > > > 发送时间: 2008年3月5日 14:51 > > > > 收件人: core-user@hadoop.apache.org > > > > 主题: clustering problem > > > > > > > > Hi Guys, > > > > > > > > I am having problems creating clusters on 2 machines > > > > > > > > Machine configuration : > > > > Master : OS: Fedora core 7 > > > > hadoop-0.15.2 > > > > > > > > hadoop-site.xml listing > > > > > > > > > > > > > > > >fs.default.name > > > >anaconda:50001 > > > > > > > > > > > >mapred.job.tracker > > > >anaconda:50002 > > > > > > > > > > > >dfs.replication > > > >2 > > > > > > > > > > > >dfs.secondary.info.port > > > >50003 > > > > > > > > > > > >dfs.info.port > > > >50004 > > > > > > > > > > > >mapred.job.tracker.info.port > > > >50005 > > > > > > > > > > > >tasktracker.http.port > > > >50006 > > > > > > > > > > > > > > > > conf/masters > > > > localhost > > > > > > > > conf/slaves > > > > anaconda > > > > v-desktop > > > > > > > > the datanode, namenode, secondarynamenode seems to be working fine > > on > > > the > > > > master but on slave this is not the case > > > > > > > > slave > > > > OS: Ubuntu > > > > > > > > hadoop-site.xml listing > > > > > > > > same as master > > > > > > > > in the logs on slave machine I see this > > > > > > > > 2008-03-05 12:15:25,705 INFO > > org.apache.hadoop.metrics.jvm.JvmMetrics: > > > > Initializing JVM Metrics with processName=DataNode, sessionId=null > > > > 2008-03-05 12:15:25,920 FATAL org.apache.hadoop.dfs.DataNode: > > > Incompatible > > > > build versions: namenode BV = Unknown; datanode BV = 607333 > > > > 2008-03-05 12:15:25,926 ERROR org.apache.hadoop.dfs.DataNode: > > > > java.io.IOException: Incompatible build versions: namenode BV = > > Unknown; > > > > datanode BV = 607333 > > > >at org.apache.hadoop.dfs.DataNode.handshake(DataNode.java > > :316) > > > >at org.apache.hadoop.dfs.DataNode.startDataNode(DataNode.java > > > :238) > > > >at org.apache.hadoop.dfs.DataNode.(DataNode.java:206) > > > >at org.apache.hadoop.dfs.DataNode.makeInstance(DataNode.java > > > :1575) > > > >at org.apache.hadoop.dfs.DataNode.run(DataNode.java:1519) > > > >at org.apache.hadoop.dfs.DataNode.createDataNode( > > DataNode.java > > > > :1540) > > > >at org.apache.hadoop.dfs.DataNode.main(DataNode.java:1711) > > > > > > > > Can someone help me with this please. > > > > > > > > Thanks > > > > > > > > Ved > > > > > > > > > > > > > > > -- > > The University of Edinburgh is a charitable body, registered in > > Scotland, > > with registration number SC005336. > > > >
Copying files from remote machine to dfs
Hi Friends, Can we copy files from residing on a remote machine to dfs ? Thanks Ved
Re: Pipes task being killed
On Mar 5, 2008, at 9:31 AM, Rahul Sood wrote: Hi, We have a Pipes C++ application where the Reduce task does a lot of computation. After some time the task gets killed by the Hadoop framework. The job output shows the following error: Task task_200803051654_0001_r_00_0 failed to report status for 604 seconds. Killing! Is there any way to send a heartbeat to the TaskTracker from a Pipes application. I believe this is possible in Java using org.apache.hadoop.util.Progress and we're looking for something equivalent in the C++ Pipes API. The context object has a progress method that should be called during long computations... http://tinyurl.com/yt7hyx search for progress... -- Owen