RE: OutOfMemoryError with map jobs
Hello Chris, From the stack trace you provided, your OOM is probably due to HADOOP-3931, which is fixed in 0.17.2. It occurs when the deserialized key in an outputted record exactly fills the serialization buffer that collects map outputs, causing an allocation as large as the size of that buffer. It causes an extra spill, an OOM exception if the task JVM has a max heap size too small to mask the bug, and will miss the combiner if you've defined one, but it won't drop records. Ok thanks for that information. I guess that means I will have to upgrade. :-) However, I was wondering: are these hard architectural limits? Say that I wanted to emit 25,000 maps for a single input record, would that mean that I will require huge amounts of (virtual) memory? In other words, what exactly is the reason that increasing the number of emitted maps per input record causes an OutOfMemoryError ? Do you mean the number of output records per input record in the map? The memory allocated for collecting records out of the map is (mostly) fixed at the size defined in io.sort.mb. The ratio of input records to output records does not affect the collection and sort. The number of output records can sometimes influence the memory requirements, but not significantly. -C Ok, so I should not have to worry about this too much! Thanks for the reply and information! Regards, Leon Mergen
Re: no output from job run on cluster
Are you sure there isn't any error or exception in logs? 2008/9/5, Shirley Cohen [EMAIL PROTECTED]: Hi Dmitry, Thanks for your suggestion. I checked and the other systems on the cluster do seem to have java installed. I was also able to run the job in single mode on the cluster. However, as soon as I add the others 15 nodes to the slaves file and re-run the job, the problem appears (i.e. there is zero output). I guess I was going to wait to see if anyone else have seen this problem before submitting a bug report. Shirley On Sep 4, 2008, at 1:37 PM, Dmitry Pushkarev wrote: Hi, I'd check java version installed, that was the problem in my case, and surprisingly no output from hadoop. If it help - can you submit bug request ? :) -Original Message- From: Shirley Cohen [mailto:[EMAIL PROTECTED] Sent: Thursday, September 04, 2008 10:07 AM To: core-user@hadoop.apache.org Subject: no output from job run on cluster Hi, I'm running on hadoop-0.18.0. I have a m-r job that executes correctly in standalone mode. However, when run on a cluster, the same job produces zero output. It is very bizarre. I looked in the logs and couldn't find anything unusual. All I see are the usual deprecated filesystem name warnings. Has this ever happened to anyone? Do you have any suggestions on how I might go about diagnosing the problem? Thanks, Shirley -- Sorry for my englist!! 明
Re: Hadoop custom readers and writers
Seems the stuff in Nutch trunk is older, I have an updated version. I have sent it to you directly. Amit K Singh wrote: Thanks dennis, I get it, you mean that the big arc file was not split and there was one map per arc file. In the new code a single file can be split into multiple maps. Also I noted that in ARCinputform getSplits is not overidden, so how do ya make sure that arc file is not split ?. (number of maps property in config ??) It really shouldn't matter unless you need all the maps from a given input file in the say part-x output file and a partitioner wouldn't work. You actually want it to be able to be broken up so it can scale properly. Also any pointers on the other two questions 1) getSplits for TextInputFormat splits at arbitary bytes. now that might lead to truncated line for 2- mappers. How and where in src code is that dealt. Any pointers would be of great help. The new code finds gzip boundaries. and splits at that. It will actually start scanning forward to find the next record at a split. Anything before is handled by a different map task that scans a little over its split index. 2) class Record is used for what purpose. Record or RecordReader? Dennis Dennis Kubes-2 wrote: We did something similar with the ARC format where is record (webpage) is gzipped and then appended. It is not exactly the same but it may help. Take a look at the following classes, they are in the Nutch trunk: org.apache.nutch.tools.arc.ArcInputFormat org.apache.nutch.tools.arc.ArcRecordReader The way we did it though was to create an InputFormat and RecordReader that extended FileInputFormat and would read and uncompress the records on the fly. Unless your files are small I would recommend going that route. Dennis Amit Simgh wrote: Hi, I have thousands of webpages each represented as serialized tree object compressed (ZLIB) together (file size varying from 2.5 GB to 4.5GB). I have to do some heavy text processing on these pages. What the the best way to read /access these pages. Method1 *** 1) Write Custom Splitter that 1. uncompresses the file(2.5GB to 4GB) and then parses it(time : around 10 minutes ) 2. Splits the binary data in to parts 10-20 2) Implement specific readers to read a page and present it to mapper OR. Method -2 *** Read the entire file w/o splitting : one one Map task per file. Implement specific readers to read a page and present it to mapper Slight detour: I was browing thru code in FileInputFormat and TextInputFormat. In getSplit method the file is broken at arbitary byte boundaries. So in case of TextInputFormat what if last line of mapper is truncated (incomplete byte sequence). what happens. Is truncated data lost or recovered Can someone explain and give pointers in code where and how this recovery happens? I also saw classes like Records . What are these used for? Regds Amit S
specifying number of nodes for job
Hi, This may be a silly question, but I'm strangely having trouble finding an answer for it (perhaps I'm looking in the wrong places?). Suppose I have a cluster with n nodes each with m processors. I wish to test the performance of, say, the wordcount program on k processors, where k is varied from k = 1 ... nm. How would I do this? I'm having trouble finding the proper command line option in the commands manual ( http://hadoop.apache.org/core/docs/current/commands_manual.html) Thank you very much for you time. -SM
Basic code organization questions + scheduling
Hi, I'm planning to use Hadoop in for a set of typical crawler/indexer tasks. The basic flow is input:array of urls actions: | 1. get pages | 2. extract new urls from pages - start new job extract text - index / filter (as new jobs) What I'm considering is how I should build this application to fit into the map/reduce context. I'm thinking that step 1 and 2 should be separate map/reduce tasks that then pipe things on to the next step. This is where I am a bit at loss to see how it is smart to organize the code in logical units and also how to spawn new tasks when an old one is over. Is the usual way to control the flow of a set of tasks to have an external application running that listens to jobs ending via the endNotificationUri and then spawns new tasks or should the job itself contain code to create new jobs? Would it be a good idea to use Cascading here? I'm also considering how I should do job scheduling (I got a lot of reoccurring tasks). Has anyone found a good framework for job control of reoccurring tasks or should I plan to build my own using quartz ? Any tips/best practices with regard to the issues described above are most welcome. Feel free to ask further questions if you find my descriptions of the issues lacking. Kind regards, Tarjei
Failing MR jobs!
Hi! I'm trying to run a MR job, but it keeps on failing and I can't understand why. Sometimes it shows output at 66% and sometimes 98% or so. I had a couple of exception before that I didn't catch that made the job to fail. The log file from the task can be found at: http://pastebin.com/m4414d369 and the code looks like: //Java import java.io.*; import java.util.*; import java.net.*; //Hadoop import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.hbase.io.ImmutableBytesWritable; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.*; //HBase import org.apache.hadoop.hbase.*; import org.apache.hadoop.hbase.mapred.*; import org.apache.hadoop.hbase.io.*; import org.apache.hadoop.hbase.client.*; // org.apache.hadoop.hbase.client.HTable //Extra import org.apache.commons.cli.ParseException; import org.apache.commons.httpclient.MultiThreadedHttpConnectionManager; import org.apache.commons.httpclient.*; import org.apache.commons.httpclient.methods.*; import org.apache.commons.httpclient.params.HttpMethodParams; public class SerpentMR1 extends TableMap implements Mapper, Tool { //Setting DebugLevel private static final int DL = 0; //Setting up the variables for the MR job private static final String NAME = SerpentMR1; private static final String INPUTTABLE = sources; private final String[] COLS = {content:feedurl, content:ttl, content:updated}; private Configuration conf; public JobConf createSubmittableJob(String[] args) throws IOException{ JobConf c = new JobConf(getConf(), SerpentMR1.class); String jar = /home/hbase/SerpentMR/ +NAME+.jar; c.setJar(jar); c.setJobName(NAME); int mapTasks = 4; int reduceTasks = 20; c.setNumMapTasks(mapTasks); c.setNumReduceTasks(reduceTasks); String inputCols = ; for (int i=0; iCOLS.length; i++){inputCols += COLS[i] + ; } TableMap.initJob(INPUTTABLE, inputCols, this.getClass(), Text.class, BytesWritable.class, c); //Classes between: c.setOutputFormat(TextOutputFormat.class); Path path = new Path(users); //inserting into a temp table FileOutputFormat.setOutputPath(c, path); c.setReducerClass(MyReducer.class); return c; } public void map(ImmutableBytesWritable key, RowResult res, OutputCollector output, Reporter reporter) throws IOException { Cell cellLast= res.get(COLS[2].getBytes());//lastupdate long oldTime = cellLast.getTimestamp(); Cell cell_ttl= res.get(COLS[1].getBytes());//ttl long ttl = StreamyUtil.BytesToLong(cell_ttl.getValue() ); byte[] url = null; long currTime = time.GetTimeInMillis(); if(currTime - oldTime ttl){ url = res.get(COLS[0].getBytes()).getValue();//url output.collect(new Text(Base64.encode_strip(res.getRow())), new BytesWritable(url) );/ } } public static class MyReducer implements Reducer{ //org.apache.hadoop.mapred.Reducer{ private int timeout = 1000; //Sets the connection timeout time ms; public void reduce(Object key, Iterator values, OutputCollector output, Reporter rep) throws IOException { HttpClient client = new HttpClient();//new MultiThreadedHttpConnectionManager()); client.getHttpConnectionManager(). getParams().setConnectionTimeout(timeout); GetMethod method = null; int stat = 0; String content = ; byte[] colFam = select.getBytes(); byte[] column = lastupdate.getBytes(); byte[] currTime = null; HBaseRef hbref = new HBaseRef(); JerlType sendjerl = null; //new JerlType(); ArrayList jd = new ArrayList(); InputStream is = null; while(values.hasNext()){ BytesWritable bw = (BytesWritable)values.next(); String address = new String(bw.get()); try{ System.out.println(address); method = new GetMethod(address); method.setFollowRedirects(true); } catch (Exception e){ System.err.println(Invalid Address); e.printStackTrace(); } if (method != null){ try { // Execute the method. stat = client.executeMethod(method); if(stat == 200){ content = ; is = (InputStream)(method.getResponseBodyAsStream()); //Write to HBase new stamp select:lastupdate currTime = StreamyUtil.LongToBytes(time.GetTimeInMillis() ); jd.add(new
task assignment managemens.
Dear Hadoop users, Is it possible without using java manage task assignment to implement some simple rules? Like do not launch more that 1 instance of crawling task on a machine, and do not run data intensive tasks on remote machines, and do not run computationally intensive tasks on single-core machines:etc. Now it's done by failing tasks that decided to run on a wrong machine, but I hope to find some solution on jobtracker side.. --- Dmitry
Getting started questions
I've been reading up on Hadoop for a while now and I'm excited that I'm finally getting my feet wet with the examples + my own variations. If anyone could answer any of the following questions, I'd greatly appreciate it. 1. I'm processing document collections, with the number of documents ranging from 10,000 - 10,000,000. What is the best way to store this data for effective processing? - The bodies of the documents usually range from 1K-100KB in size, but some outliers can be as big as 4-5GB. - I will also need to store some metadata for each document which I figure could be stored as JSON or XML. - I'll typically filter on the metadata and then doing standard operations on the bodies, like word frequency and searching. Is there a canned FileInputFormat that makes sense? Should I roll my own? How can I access the bodies as streams so I don't have to read them into RAM all at once? Am I right in thinking that I should treat each document as a record and map across them, or do I need to be more creative in what I'm mapping across? 2. Some of the tasks I want to run are pure map operations (no reduction), where I'm calculating new metadata fields on each document. To end up with a good result set, I'll need to copy the entire input record + new fields into another set of output files. Is there a better way? I haven't wanted to go down the HBase road because it can't handle very large values (for the bodies) and it seems to make the most sense to keep the document bodies together with the metadata, to allow for the greatest locality of reference on the datanodes. 3. I'm sure this is not a new idea, but I haven't seen anything regarding it... I'll need to run several MR jobs as a pipeline... is there any way for the map tasks in a subsequent stage to begin processing data from previous stage's reduce task before that reducer has fully finished? Whatever insight folks could lend me would be a big help in crossing the chasm from the Word Count and associated examples to something more real. A whole heap of thanks in advance, John
Re: specifying number of nodes for job
On Mon, Sep 8, 2008 at 2:25 AM, Sandy [EMAIL PROTECTED] wrote: Hi, This may be a silly question, but I'm strangely having trouble finding an answer for it (perhaps I'm looking in the wrong places?). Suppose I have a cluster with n nodes each with m processors. I wish to test the performance of, say, the wordcount program on k processors, where k is varied from k = 1 ... nm. You can specify the number of tasks for each node in your hadoop-site.xml file. So you can get k varied from k = n, 2*nm*n instead of k = 1...nm. How would I do this? I'm having trouble finding the proper command line option in the commands manual ( http://hadoop.apache.org/core/docs/current/commands_manual.html) Thank you very much for you time. -SM -- [EMAIL PROTECTED] Institute of Computing Technology, Chinese Academy of Sciences, Beijing.
Re: How to debug org.apache.hadoop.streaming.TestMultipleCachefiles.testMultipleCachefiles
You can put src in your classpath,Because The HttpServer scan the webapps folder in your classpath to start jetty webserver. Josh Ma http://www.hadoop.org.cn http://www.hadoop.org.cn Abdul Qadeer wrote: I am running a single node Hadoop. If I try to debug org.apache.hadoop.streaming.TestMultipleCachefiles.testMultipleCachefiles, the following exception tells that I haven't put webapps on the classpath. I have in fact put src/webapps on classpath. So I was wondering what is wrong. 2008-09-06 20:03:28,183 INFO fs.FSNamesystem (FSNamesystem.java:setConfigurationParameters(409)) - fsOwner=personalpc\admin,None,root,Administrators,Users 2008-09-06 20:03:28,191 INFO fs.FSNamesystem (FSNamesystem.java:setConfigurationParameters(413)) - supergroup=supergroup 2008-09-06 20:03:28,191 INFO fs.FSNamesystem (FSNamesystem.java:setConfigurationParameters(414)) - isPermissionEnabled=true 2008-09-06 20:03:28,440 INFO common.Storage (FSImage.java:saveFSImage(897)) - Image file of size 90 saved in 0 seconds. 2008-09-06 20:03:28,510 INFO common.Storage (FSImage.java:format(952)) - Storage directory dfs\name1 has been successfully formatted. 2008-09-06 20:03:28,538 INFO common.Storage (FSImage.java:saveFSImage(897)) - Image file of size 90 saved in 0 seconds. 2008-09-06 20:03:28,590 INFO common.Storage (FSImage.java:format(952)) - Storage directory dfs\name2 has been successfully formatted. 2008-09-06 20:03:28,904 INFO metrics.RpcMetrics (RpcMetrics.java:init(56)) - Initializing RPC Metrics with hostName=NameNode, port=57467 2008-09-06 20:03:29,246 INFO namenode.NameNode (NameNode.java:initialize(156)) - Namenode up at: 127.0.0.1/127.0.0.1:57467 2008-09-06 20:03:29,290 INFO jvm.JvmMetrics (JvmMetrics.java:init(67)) - Initializing JVM Metrics with processName=NameNode, sessionId=null 2008-09-06 20:03:29,352 INFO metrics.NameNodeMetrics (NameNodeMetrics.java:init(85)) - Initializing NameNodeMeterics using context object:org.apache.hadoop.metrics.spi.NullContext 2008-09-06 20:03:29,417 INFO fs.FSNamesystem (FSNamesystem.java:setConfigurationParameters(409)) - fsOwner=personalpc\admin,None,root,Administrators,Users 2008-09-06 20:03:29,418 INFO fs.FSNamesystem (FSNamesystem.java:setConfigurationParameters(413)) - supergroup=supergroup 2008-09-06 20:03:29,418 INFO fs.FSNamesystem (FSNamesystem.java:setConfigurationParameters(414)) - isPermissionEnabled=true 2008-09-06 20:03:29,442 INFO metrics.FSNamesystemMetrics (FSNamesystemMetrics.java:init(60)) - Initializing FSNamesystemMeterics using context object:org.apache.hadoop.metrics.spi.NullContext 2008-09-06 20:03:29,444 INFO fs.FSNamesystem (FSNamesystem.java:registerMBean(4274)) - Registered FSNamesystemStatusMBean 2008-09-06 20:03:29,519 INFO common.Storage (FSImage.java:loadFSImage(748)) - Number of files = 0 2008-09-06 20:03:29,519 INFO common.Storage (FSImage.java:loadFilesUnderConstruction(1052)) - Number of files under construction = 0 2008-09-06 20:03:29,520 INFO common.Storage (FSImage.java:loadFSImage(683)) - Image file of size 90 loaded in 0 seconds. 2008-09-06 20:03:29,521 INFO common.Storage (FSEditLog.java:loadFSEdits(667)) - Edits file edits of size 4 edits # 0 loaded in 0 seconds. 2008-09-06 20:03:29,541 INFO common.Storage (FSImage.java:saveFSImage(897)) - Image file of size 90 saved in 0 seconds. 2008-09-06 20:03:29,581 INFO common.Storage (FSImage.java:saveFSImage(897)) - Image file of size 90 saved in 0 seconds. 2008-09-06 20:03:29,727 INFO fs.FSNamesystem (FSNamesystem.java:initialize(307)) - Finished loading FSImage in 373 msecs 2008-09-06 20:03:29,761 INFO hdfs.StateChange (FSNamesystem.java:leave(3795)) - STATE* Leaving safe mode after 0 secs. 2008-09-06 20:03:29,762 INFO hdfs.StateChange (FSNamesystem.java:leave(3804)) - STATE* Network topology has 0 racks and 0 datanodes 2008-09-06 20:03:29,762 INFO hdfs.StateChange (FSNamesystem.java:leave(3807)) - STATE* UnderReplicatedBlocks has 0 blocks 2008-09-06 20:03:30,173 ERROR fs.FSNamesystem (FSNamesystem.java:init(286)) - FSNamesystem initialization failed. java.io.IOException: webapps not found in CLASSPATH at org.apache.hadoop.http.HttpServer.getWebAppsPath(HttpServer.java:157) at org.apache.hadoop.http.HttpServer.init(HttpServer.java:74) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:347) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:284) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:160) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:205) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:191) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:849) at org.apache.hadoop.hdfs.MiniDFSCluster.init(MiniDFSCluster.java:270) at
Re: Basic code organization questions + scheduling
Hi Tarjei, You should take a look at Nutch. It's a search-engine built on Lucene, though it can be setup on top of Hadoop. Take a look: http://lucene.apache.org/nutch/ -and- http://wiki.apache.org/nutch/NutchHadoopTutorial Hope this helps! Alex On Mon, Sep 8, 2008 at 2:54 AM, Tarjei Huse [EMAIL PROTECTED] wrote: Hi, I'm planning to use Hadoop in for a set of typical crawler/indexer tasks. The basic flow is input:array of urls actions: | 1. get pages | 2. extract new urls from pages - start new job extract text - index / filter (as new jobs) What I'm considering is how I should build this application to fit into the map/reduce context. I'm thinking that step 1 and 2 should be separate map/reduce tasks that then pipe things on to the next step. This is where I am a bit at loss to see how it is smart to organize the code in logical units and also how to spawn new tasks when an old one is over. Is the usual way to control the flow of a set of tasks to have an external application running that listens to jobs ending via the endNotificationUri and then spawns new tasks or should the job itself contain code to create new jobs? Would it be a good idea to use Cascading here? I'm also considering how I should do job scheduling (I got a lot of reoccurring tasks). Has anyone found a good framework for job control of reoccurring tasks or should I plan to build my own using quartz ? Any tips/best practices with regard to the issues described above are most welcome. Feel free to ask further questions if you find my descriptions of the issues lacking. Kind regards, Tarjei
distcp failing
I'm attempting to load data into hadoop (version 0.17.1), from a non-datanode machine in the cluster. I can run jobs and copyFromLocal works fine, but when i try to use distcp i get the below. I'm don't understand what the error, can anyone help? Thanks blue:hadoop-0.17.1 mdidomenico$ time bin/hadoop distcp -overwrite file:///Users/mdidomenico/hadoop/1gTestfile /user/mdidomenico/1gTestfile 08/09/07 23:56:06 INFO util.CopyFiles: srcPaths=[file:/Users/mdidomenico/hadoop/1gTestfile] 08/09/07 23:56:06 INFO util.CopyFiles: destPath=/user/mdidomenico/1gTestfile1 08/09/07 23:56:07 INFO util.CopyFiles: srcCount=1 With failures, global counters are inaccurate; consider running with -i Copy failed: org.apache.hadoop.ipc.RemoteException: java.io.IOException: /tmp/hadoop-hadoop/mapred/system/job_200809072254_0005/job.xml: No such file or directory at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:215) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:149) at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1155) at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1136) at org.apache.hadoop.mapred.JobInProgress.init(JobInProgress.java:175) at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:1755) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896) at org.apache.hadoop.ipc.Client.call(Client.java:557) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212) at $Proxy1.submitJob(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:585) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) at $Proxy1.submitJob(Unknown Source) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:758) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:973) at org.apache.hadoop.util.CopyFiles.copy(CopyFiles.java:604) at org.apache.hadoop.util.CopyFiles.run(CopyFiles.java:743) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.util.CopyFiles.main(CopyFiles.java:763)
Re: Getting started questions
John Howland wrote: I've been reading up on Hadoop for a while now and I'm excited that I'm finally getting my feet wet with the examples + my own variations. If anyone could answer any of the following questions, I'd greatly appreciate it. 1. I'm processing document collections, with the number of documents ranging from 10,000 - 10,000,000. What is the best way to store this data for effective processing? AFAIK hadoop doesn't do well with, although it can handle, a large number of small files. So it would be better to read in the documents and store them in SequenceFile or MapFile format. This would be similar to the way the Fetcher works in Nutch. 10M documents in a sequence/map file on DFS is comparatively small and can be handled efficiently. - The bodies of the documents usually range from 1K-100KB in size, but some outliers can be as big as 4-5GB. I would say store your document objects as Text objects, not sure if Text has a max size. I think it does but not sure what that is. If it does you can always store as a BytesWritable which is just an array of bytes. But you are going to have memory issues reading in and writing out that large of a record. - I will also need to store some metadata for each document which I figure could be stored as JSON or XML. - I'll typically filter on the metadata and then doing standard operations on the bodies, like word frequency and searching. It is possible to create an OutputFormat that writes out multiple files. You could also use a MapWritable as the value to store the document and associated metadata. Is there a canned FileInputFormat that makes sense? Should I roll my own? How can I access the bodies as streams so I don't have to read them into RAM A writable is read into RAM so even treating it like a stream doesn't get around that. One thing you might want to consider is to tar up say X documents at a time and store that as a file in DFS. You would have many of these files. Then have an index that has the offsets of the files and their keys (document ids). That index can be passed as input into a MR job that can then go to DFS and stream out the file as you need it. The job will be slower because you are doing it this way but it is a solution to handling such large documents as streams. all at once? Am I right in thinking that I should treat each document as a record and map across them, or do I need to be more creative in what I'm mapping across? 2. Some of the tasks I want to run are pure map operations (no reduction), where I'm calculating new metadata fields on each document. To end up with a good result set, I'll need to copy the entire input record + new fields into another set of output files. Is there a better way? I haven't wanted to go down the HBase road because it can't handle very large values (for the bodies) and it seems to make the most sense to keep the document bodies together with the metadata, to allow for the greatest locality of reference on the datanodes. If you don't specify a reducer, the IdentityReducer is run which simply passes through output. 3. I'm sure this is not a new idea, but I haven't seen anything regarding it... I'll need to run several MR jobs as a pipeline... is there any way for the map tasks in a subsequent stage to begin processing data from previous stage's reduce task before that reducer has fully finished? Yup, just use FileOutputFormat.getOutputPath(previousJobConf); Dennis Whatever insight folks could lend me would be a big help in crossing the chasm from the Word Count and associated examples to something more real. A whole heap of thanks in advance, John
Parhely (ORM for HBase) released!
Hi guys. Finally I released the first draft of an ORM for HBase named Parhely. Check it out at http://dev.tailsweep.com/ Kindly //Marcus
Re: task assignment managemens.
No that is not possible today. However, you might want to look at the TaskScheduler to see if you can implement a scheduler to provide this kind of task scheduling. In the current hadoop, one point regarding computationally intensive task is that if the machine is not able to keep up with the rest of the machines (and the task on that machine is running slower than others), speculative execution, if enabled, can help a lot. Also, implicitly, faster/better machines get more work than the slower machines. On 9/8/08 3:27 AM, Dmitry Pushkarev [EMAIL PROTECTED] wrote: Dear Hadoop users, Is it possible without using java manage task assignment to implement some simple rules? Like do not launch more that 1 instance of crawling task on a machine, and do not run data intensive tasks on remote machines, and do not run computationally intensive tasks on single-core machines:etc. Now it's done by failing tasks that decided to run on a wrong machine, but I hope to find some solution on jobtracker side.. --- Dmitry