clustering problem
Hi Guys, I am having problems creating clusters on 2 machines Machine configuration : Master : OS: Fedora core 7 hadoop-0.15.2 hadoop-site.xml listing fs.default.name anaconda:50001 mapred.job.tracker anaconda:50002 dfs.replication 2 dfs.secondary.info.port 50003 dfs.info.port 50004 mapred.job.tracker.info.port 50005 tasktracker.http.port 50006 conf/masters localhost conf/slaves anaconda v-desktop the datanode, namenode, secondarynamenode seems to be working fine on the master but on slave this is not the case slave OS: Ubuntu hadoop-site.xml listing same as master in the logs on slave machine I see this 2008-03-05 12:15:25,705 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=DataNode, sessionId=null 2008-03-05 12:15:25,920 FATAL org.apache.hadoop.dfs.DataNode: Incompatible build versions: namenode BV = Unknown; datanode BV = 607333 2008-03-05 12:15:25,926 ERROR org.apache.hadoop.dfs.DataNode: java.io.IOException: Incompatible build versions: namenode BV = Unknown; datanode BV = 607333 at org.apache.hadoop.dfs.DataNode.handshake(DataNode.java:316) at org.apache.hadoop.dfs.DataNode.startDataNode(DataNode.java:238) at org.apache.hadoop.dfs.DataNode.(DataNode.java:206) at org.apache.hadoop.dfs.DataNode.makeInstance(DataNode.java:1575) at org.apache.hadoop.dfs.DataNode.run(DataNode.java:1519) at org.apache.hadoop.dfs.DataNode.createDataNode(DataNode.java:1540) at org.apache.hadoop.dfs.DataNode.main(DataNode.java:1711) Can someone help me with this please. Thanks Ved
Using Sorted Files For Filtering Input (File Index)
Let's say I have a simple data file with pairs and the entire file is ascending sorted order by 'value'. What I want to be able to do is filter the data so that the map function is only invoked with pairs where 'value' is greater than some input value. Does such a feature already exist or would I need to implement my own RecordReader to do this filter? Is this the right place to do this in Hadoop's input pipeline? What I essentially want is a cheap index. By sorting the values ahead of time, you could just do a binary search on the InputSplit until you found the starting value that satisfies the predicate. The RecordReader would then start this point in the file, read all the lines in, and pass the records to map(). Any thoughts? -- Andy Pavlo [EMAIL PROTECTED]
Re: Processing multiple files - need to identify in map
Hi Tarandeep, the jobconf you get in your configure() method has the info It is available via map.input.file parameter (more info here http://wiki.apache.org/hadoop/TaskExecutionEnvironment) Yes, you can have multiple input directories. You can use JobConf::addInputPath() to add more input paths before submitting your job more info here http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/JobConf.html#addInputPath(org.apache.hadoop.fs.Path) Thanks, Lohit - Original Message From: Tarandeep Singh <[EMAIL PROTECTED]> To: core-user@hadoop.apache.org Sent: Tuesday, March 4, 2008 5:38:41 PM Subject: Processing multiple files - need to identify in map Hi, I need to identify from which file, a key came from, in the map phase. Is it possible ? What I have is multiple types of log files in one directory that I need to process for my application. Right now, I am relying on the structure of the log files (e.g if a line starts with "weblog", the line came from Log File A or if the number of tab-separated fields in the line is N, then it is Log File B) Is there a better way to do this ? Is there a way that the Hadoop framework passes me as a key the path of the file (right now it is the offset in the file, I guess) ? One more related question - can I set 2 directories as input to my map reduce program ? This is just to avoid copying files from one log directory to another. thanks, Taran
Re: Processing multiple files - need to identify in map
more specifically, call jobConf.get( "map.input.file" ); in the configure(JobConf conf) method of your mapper. there are some cases this won't work. but in general it works fine. and yes, you can add many input directories. jobConf.addInputPath(...) On Mar 4, 2008, at 5:54 PM, Ted Dunning wrote: Yes. Use the configure method which is called each time a new file is used in the map. Save the file name in a field of the mapper. The other alternative is to derive a new InputFormat that remembers the input file name. On 3/4/08 5:38 PM, "Tarandeep Singh" <[EMAIL PROTECTED]> wrote: Hi, I need to identify from which file, a key came from, in the map phase. Is it possible ? What I have is multiple types of log files in one directory that I need to process for my application. Right now, I am relying on the structure of the log files (e.g if a line starts with "weblog", the line came from Log File A or if the number of tab-separated fields in the line is N, then it is Log File B) Is there a better way to do this ? Is there a way that the Hadoop framework passes me as a key the path of the file (right now it is the offset in the file, I guess) ? One more related question - can I set 2 directories as input to my map reduce program ? This is just to avoid copying files from one log directory to another. thanks, Taran Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/
Re: Processing multiple files - need to identify in map
Yes. Use the configure method which is called each time a new file is used in the map. Save the file name in a field of the mapper. The other alternative is to derive a new InputFormat that remembers the input file name. On 3/4/08 5:38 PM, "Tarandeep Singh" <[EMAIL PROTECTED]> wrote: > Hi, > > I need to identify from which file, a key came from, in the map phase. > Is it possible ? > > What I have is multiple types of log files in one directory that I > need to process for my application. Right now, I am relying on the > structure of the log files (e.g if a line starts with "weblog", the > line came from Log File A or if the number of tab-separated fields in > the line is N, then it is Log File B) > > Is there a better way to do this ? > > Is there a way that the Hadoop framework passes me as a key the path > of the file (right now it is the offset in the file, I guess) ? > > One more related question - can I set 2 directories as input to my map > reduce program ? This is just to avoid copying files from one log > directory to another. > > thanks, > Taran
Re: Processing multiple files - need to identify in map
the Reporter object given to the map() method can get you the InputSplit that is being mapped over. If this subclasses FileInputSplit, you can grab the path name from there. - Aaron Tarandeep Singh wrote: Hi, I need to identify from which file, a key came from, in the map phase. Is it possible ? What I have is multiple types of log files in one directory that I need to process for my application. Right now, I am relying on the structure of the log files (e.g if a line starts with "weblog", the line came from Log File A or if the number of tab-separated fields in the line is N, then it is Log File B) Is there a better way to do this ? Is there a way that the Hadoop framework passes me as a key the path of the file (right now it is the offset in the file, I guess) ? One more related question - can I set 2 directories as input to my map reduce program ? This is just to avoid copying files from one log directory to another. thanks, Taran
Processing multiple files - need to identify in map
Hi, I need to identify from which file, a key came from, in the map phase. Is it possible ? What I have is multiple types of log files in one directory that I need to process for my application. Right now, I am relying on the structure of the log files (e.g if a line starts with "weblog", the line came from Log File A or if the number of tab-separated fields in the line is N, then it is Log File B) Is there a better way to do this ? Is there a way that the Hadoop framework passes me as a key the path of the file (right now it is the offset in the file, I guess) ? One more related question - can I set 2 directories as input to my map reduce program ? This is just to avoid copying files from one log directory to another. thanks, Taran
Re: What's the best way to get to a single key?
And this, btw, provides a rationale for having a key in the reducer output. On 3/4/08 12:53 PM, "Doug Cutting" <[EMAIL PROTECTED]> wrote: > So you should be able to > just switch from specifying SequenceFileOutputFormat to > MapFileOutputFormat in your jobs and everything should work the same > except you'll have index files that permit random access.
Re: What's the best way to get to a single key?
Xavier Stevens wrote: Is there a way to do this when your input data is using SequenceFile compression? Yes. A MapFile is simply a directory containing two SequenceFiles named "data" and "index". MapFileOutputFormat uses the same compression parameters as SequenceFileOutputFormat. SequenceFileInputFormat recognizes MapFiles and reads the "data" file. So you should be able to just switch from specifying SequenceFileOutputFormat to MapFileOutputFormat in your jobs and everything should work the same except you'll have index files that permit random access. Doug
Re: configuration access
Arun C Murthy wrote: > Can you re-check if the right paths (for your config files) are on the > CLASSPATH? That was it. Thanks. -- Steve Sapovits Invite Media - http://www.invitemedia.com [EMAIL PROTECTED]
Re: Hadoop / HDFS over WAN
I don't think there are any known deployments of Hadoop over WAN. There aren't any WAN specific tweaks or configuration settings present that I know of. Hadoop apps tend to be data intensive. Any more details what the configuration likely to be? Will HDFS itself be across WAN? Some exmpale tweaks could be, if you have a high latency and high bandwidth across WAN, each socket connection might need use large recv/send buffers for TCP sockets to mask latency. Raghu. Tom Deckers (tdeckers) wrote: How well does HDFS perform over WAN links? Any best practices to take into account? Thanks! Tom.
RE: What's the best way to get to a single key?
Is there a way to do this when your input data is using SequenceFile compression? Thanks, -Xavier -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Monday, March 03, 2008 2:52 PM To: core-user@hadoop.apache.org Subject: Re: What's the best way to get to a single key? Use MapFileOutputFormat to write your data, then call: http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/ MapFileOutputFormat.html#getEntry(org.apache.hadoop.io.MapFile.Reader[], %20org.apache.hadoop.mapred.Partitioner,%20K,%20V) The documentation is pretty sparse, but the intent is that you open a MapFile.Reader for each mapreduce output, pass the partitioner used, the key, and the value to be read into. A MapFile maintains an index of keys, so the entire file need not be scanned. If you really only need the value of a single key then you might avoid opening all of the output files. In that case you could might use the Partitioner and the MapFile API directly. Doug Xavier Stevens wrote: > I am curious how others might be solving this problem. I want to > retrieve a record from HDFS based on its key. Are there any methods > that can shortcut this type of search to avoid parsing all data until > you find it? Obviously Hbase would do this as well, but I wanted to > know if there is a way to do it using just Map/Reduce and HDFS. > > Thanks, > > -Xavier >
Re: map/reduce function on xml string
Here's the code. If folks are interested, I can submit it as a patch as well. Prasan Ary wrote: Colin, Is it possible that you share some of the code with us? thx, Prasan Colin Evans <[EMAIL PROTECTED]> wrote: We ended up subclassing TextInputFormat and adding a custom RecordReader that starts and ends record reads on tags. The StreamXmlRecordReader class is a good reference for this. Prasan Ary wrote: Hi All, I am writing a java implementation for my map/reduce function on hadoop. Input to this is a xml file, and the map function has to process a well formed xml records. So far I have been unable to split the xml file at xml record boundary to feed into my map function. Can anybody point me to resources where forcing file split at desired boundary is explained ? thx, Pra. - Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now. - Looking for last minute shopping deals? Find them fast with Yahoo! Search. package com.metaweb.hadoop.util; import org.apache.hadoop.fs.FSDataInputStream; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.DataOutputBuffer; import org.apache.hadoop.io.Text; import org.apache.hadoop.io.Writable; import org.apache.hadoop.io.WritableComparable; import org.apache.hadoop.mapred.*; import java.io.IOException; /** * Reads records that are delimited by a specifc begin/end tag. */ public class XmlInputFormat extends TextInputFormat { public static final String START_TAG_KEY = "xmlinput.start"; public static final String END_TAG_KEY = "xmlinput.end"; public void configure(JobConf jobConf) { super.configure(jobConf); } public RecordReader getRecordReader(InputSplit inputSplit, JobConf jobConf, Reporter reporter) throws IOException { return new XmlRecordReader((FileSplit) inputSplit, jobConf); } public static class XmlRecordReader implements RecordReader { private byte[] startTag; private byte[] endTag; private long start; private long end; private FSDataInputStream fsin; private DataOutputBuffer buffer = new DataOutputBuffer(); public XmlRecordReader(FileSplit split, JobConf jobConf) throws IOException { startTag = jobConf.get("xmlinput.start").getBytes("utf-8"); endTag = jobConf.get("xmlinput.end").getBytes("utf-8"); // open the file and seek to the start of the split start = split.getStart(); end = start + split.getLength(); Path file = split.getPath(); FileSystem fs = file.getFileSystem(jobConf); fsin = fs.open(split.getPath()); fsin.seek(start); } public boolean next(WritableComparable key, Writable value) throws IOException { if (fsin.getPos() < end) { if (readUntilMatch(startTag, false)) { try { buffer.write(startTag); if (readUntilMatch(endTag, true)) { ((Text) key).set(Long.toString(fsin.getPos())); ((Text) value).set(buffer.getData(), 0, buffer.getLength()); return true; } } finally { buffer.reset(); } } } return false; } public WritableComparable createKey() { return new Text(); } public Writable createValue() { return new Text(); } public long getPos() throws IOException { return fsin.getPos(); } public void close() throws IOException { fsin.close(); } public float getProgress() throws IOException { return ((float) (fsin.getPos() - start)) / ((float) (end - start)); } / private boolean readUntilMatch(byte[] match, boolean withinBlock) throws IOException { int i = 0; while (true) { int b = fsin.read(); // end of file: if (b == -1) return false; // save to buffer: if (withinBlock) buffer.write(b); // check if we're matching: if (b == match[i]) { i++; if (i >= match.length) return true; } else i = 0; // see if we've passed the stop point: if(!withinBlock && i == 0 && fsin.getPos() >= end) return false; } } } }
Re: configuration access
On Mar 4, 2008, at 8:34 AM, Steve Sapovits wrote: Can someone point me to working examples of config access? I'm trying to use the FileSystem and Configuration classes to get the 'fs.default.name' value. I see that the Configuration object thinks it's loaded hadoop-default.xml and hadoop-site.xml files, but no matter what I ask it for I get a default back and not what's configured. Can you re-check if the right paths (for your config files) are on the CLASSPATH? You should be able to get the configured value via: String fsName = conf.get("fs.default.name"); Arun A working example of this would probably set me straight. I tried explicitly adding resources hard-coding the full path names of the config. files ... same thing. -- Steve Sapovits Invite Media - http://www.invitemedia.com [EMAIL PROTECTED]
Re: map/reduce function on xml string
Colin, Is it possible that you share some of the code with us? thx, Prasan Colin Evans <[EMAIL PROTECTED]> wrote: We ended up subclassing TextInputFormat and adding a custom RecordReader that starts and ends record reads on tags. The StreamXmlRecordReader class is a good reference for this. Prasan Ary wrote: > Hi All, > I am writing a java implementation for my map/reduce function on hadoop. > Input to this is a xml file, and the map function has to process a well > formed xml records. So far I have been unable to split the xml file at xml > record boundary to feed into my map function. > Can anybody point me to resources where forcing file split at desired > boundary is explained ? > > thx, > Pra. > > > - > Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now. > - Looking for last minute shopping deals? Find them fast with Yahoo! Search.
configuration access
Can someone point me to working examples of config access? I'm trying to use the FileSystem and Configuration classes to get the 'fs.default.name' value. I see that the Configuration object thinks it's loaded hadoop-default.xml and hadoop-site.xml files, but no matter what I ask it for I get a default back and not what's configured. A working example of this would probably set me straight. I tried explicitly adding resources hard-coding the full path names of the config. files ... same thing. -- Steve Sapovits Invite Media - http://www.invitemedia.com [EMAIL PROTECTED]
Re: Could we call hadoop a distributed OS?
It could be defined as a middleware in a broad sense. But I don't think this is still a good definition of hadoop. It is a collection of distributed tools for large data processing. On Tue, Mar 4, 2008 at 12:34 PM, wang daming <[EMAIL PROTECTED]> wrote: > how about middleware? > > > > 2008/3/3, Amar Kamat <[EMAIL PROTECTED]>: > > > > HADOOP is not a distributed OS. It requires some OS on > > which it can be run. Also its not an application. Its a > > platform for running applications on the grid. There are certain class > of > > applications (like the ones to do with web) that can make use of this > > platform (service) to run data intensive applications that have inherent > > parallelism, on the grid. According to me appropriate classification > would > > be a distributed computing software. > > Amar > > On Mon, 3 Mar 2008, > > Steve Han wrote: > > > > > Or it just a distributed application?How we can learn more about the > > design > > > ideas of Hadoop.Thanks > > > > > > > > >
Re: Could we call hadoop a distributed OS?
how about middleware? 2008/3/3, Amar Kamat <[EMAIL PROTECTED]>: > > HADOOP is not a distributed OS. It requires some OS on > which it can be run. Also its not an application. Its a > platform for running applications on the grid. There are certain class of > applications (like the ones to do with web) that can make use of this > platform (service) to run data intensive applications that have inherent > parallelism, on the grid. According to me appropriate classification would > be a distributed computing software. > Amar > On Mon, 3 Mar 2008, > Steve Han wrote: > > > Or it just a distributed application?How we can learn more about the > design > > ideas of Hadoop.Thanks > > > > >
Hadoop / HDFS over WAN
How well does HDFS perform over WAN links? Any best practices to take into account? Thanks! Tom.
Re: org.apache.hadoop.dfs.NameNode: java.lang.NullPointerException
Hi Raghu, thx for filing :-) Btw. I sent dhruba the requested namenode log yesterday. Hopefully it helps. Cu on the 'net, Bye - bye, < André èrbnA > Raghu Angadi wrote: filed https://issues.apache.org/jira/browse/HADOOP-2934 Raghu. André Martin wrote: Hi everyone, the namenode doesn't re-start properly: 2008-03-02 01:25:25,120 INFO org.apache.hadoop.dfs.NameNode: STARTUP_MSG: / STARTUP_MSG: Starting NameNode STARTUP_MSG: host = se09/141.76.xxx.xxx STARTUP_MSG: args = [] STARTUP_MSG: version = 2008-02-28_11-01-44 STARTUP_MSG: build = http://svn.apache.org/repos/asf/hadoop/core/trunk -r 631915; compiled by 'hudson' on Thu Feb 28 11:11:52 UTC 2008 / 2008-03-02 01:25:25,247 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing RPC Metrics with serverName=NameNode, port=8000 2008-03-02 01:25:25,254 INFO org.apache.hadoop.dfs.NameNode: Namenode up at: se09.inf.tu-dresden.de/141.76.44.xxx:xxx 2008-03-02 01:25:25,257 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=NameNode, sessionId=null 2008-03-02 01:25:25,260 INFO org.apache.hadoop.dfs.NameNodeMetrics: Initializing NameNodeMeterics using context object:org.apache.hadoop.metrics.spi.NullContext 2008-03-02 01:25:25,358 INFO org.apache.hadoop.fs.FSNamesystem: fsOwner=amartin,students 2008-03-02 01:25:25,359 INFO org.apache.hadoop.fs.FSNamesystem: supergroup=supergroup 2008-03-02 01:25:25,359 INFO org.apache.hadoop.fs.FSNamesystem: isPermissionEnabled=true 2008-03-02 01:25:29,887 ERROR org.apache.hadoop.dfs.NameNode: java.lang.NullPointerException at org.apache.hadoop.dfs.FSImage.readINodeUnderConstruction(FSImage.java:950) at org.apache.hadoop.dfs.FSImage.loadFilesUnderConstruction(FSImage.java:919) at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:749) at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:634) at org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:223) at org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:79) at org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:261) at org.apache.hadoop.dfs.FSNamesystem.(FSNamesystem.java:242) at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:131) at org.apache.hadoop.dfs.NameNode.(NameNode.java:176) at org.apache.hadoop.dfs.NameNode.(NameNode.java:162) at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:851) at org.apache.hadoop.dfs.NameNode.main(NameNode.java:860) 2008-03-02 01:25:29,888 INFO org.apache.hadoop.dfs.NameNode: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down NameNode at se09/141.76.xxx.xxx / Any ideas? Looks like a bug... Cu on the 'net, Bye - bye, < André èrbnA >
Re: Namenode fails to re-start after cluster shutdown
OK, that makes sense - thx! Cu on the 'net, Bye - bye, < André e`rbnA > Konstantin Shvachko wrote: Also, the namenode still says: "Upgrade for version -13 has been completed. Upgrade is not finalized." even 15 hours after launching it :-/ You can -finalizeUpgrade if you don't need the previous version anymore. http://hadoop.apache.org/core/docs/current/hdfs_user_guide.html#Upgrade+and+Rollback
quietmode in configuration.java
Hi, I am trying to set up Hadoop for some quick testing, and I've run into a small problem. My current configuration doesn't seem to get fully read, but no errors are thrown. A quick read of the source code turned up the quietmode variable, which seems to be the culprit. I'd like to set quietmode to false, but short of changing the source I can't find a way. Can anyone help? Till
Re: Pipes example wordcount-nopipe.cc failed when reading from input splits
Hi, Here is some discussion on how to run wordcount-nopipe : http://www.nabble.com/pipe-application-error-td13840804.html Probably makes sense for your question. Thanks Amareshwari 11 Nov. wrote: I traced into the c++ recordreader code: WordCountReader(HadoopPipes::MapContext& context) { std::string filename; HadoopUtils::StringInStream stream(context.getInputSplit()); HadoopUtils::deserializeString(filename, stream); struct stat statResult; stat(filename.c_str(), &statResult); bytesTotal = statResult.st_size; bytesRead = 0; cout << filename<: hi colleagues, I have set up the single node cluster to test pipes examples. wordcount-simple and wordcount-part work just fine. but wordcount-nopipe can't run. Here is my commnad line: bin/hadoop pipes -conf src/examples/pipes/conf/word-nopipe.xml -input input/ -output out-dir-nopipe1 and here is the error message printed on my console: 08/03/03 23:23:06 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String). 08/03/03 23:23:06 INFO mapred.FileInputFormat: Total input paths to process : 1 08/03/03 23:23:07 INFO mapred.JobClient: Running job: job_200803032218_0004 08/03/03 23:23:08 INFO mapred.JobClient: map 0% reduce 0% 08/03/03 23:23:11 INFO mapred.JobClient: Task Id : task_200803032218_0004_m_00_0, Status : FAILED java.io.IOException: pipe child exception at org.apache.hadoop.mapred.pipes.Application.abort( Application.java:138) at org.apache.hadoop.mapred.pipes.PipesMapRunner.run( PipesMapRunner.java:83) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192) at org.apache.hadoop.mapred.TaskTracker$Child.main( TaskTracker.java:1787) Caused by: java.io.EOFException at java.io.DataInputStream.readByte(DataInputStream.java:250) at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java :313) at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java :335) at org.apache.hadoop.mapred.pipes.BinaryProtocol$UplinkReaderThread.run( BinaryProtocol.java:112) task_200803032218_0004_m_00_0: task_200803032218_0004_m_00_0: task_200803032218_0004_m_00_0: task_200803032218_0004_m_00_0: Hadoop Pipes Exception: failed to open at /home/hadoop/hadoop-0.15.2-single-cluster /src/examples/pipes/impl/wordcount-nopipe.cc:67 in WordCountReader::WordCountReader(HadoopPipes::MapContext&) Could anybody tell me how to fix this? That will be appreciated. Thanks a lot!