Re: Bean Scripting Framework?
Why dont you use hadoop streaming? - Original Message From: Lincoln Ritter <[EMAIL PROTECTED]> To: core-user Sent: Friday, July 25, 2008 1:10:20 AM Subject: Bean Scripting Framework? Hello all. Has anybody ever tried/considered using the Bean Scripting Framework within Hadoop? BSF seems nice since it allows "two-way" communication between ruby and java. I'd love to hear your thoughts as I've been trying to make this work to allow using ruby in the m/r pipeline. For now, I don't need a fully general solution, I'd just like to call some ruby in my map or reduce tasks. Thanks! -lincoln -- lincolnritter.com
Re: Bean Scripting Framework?
On Thu, Jul 24, 2008 at 3:51 PM, Lincoln Ritter <[EMAIL PROTECTED]> wrote: > Well that sounds awesome! It would be simply splendid to see what > you've got if you're willing to share. I'll be happy to share, but it's pretty much in pieces, not ready for release. I'll put it out with whatever license Hadoop itself uses (presumably Apache). > > Are you going the 'direct' embedding route or using a scripting frame > work (BSF or javax.script)? JSR233 is the way to go according to the JRuby guys at RailsConf last month. It's pretty straightforward - see http://wiki.jruby.org/wiki/Java_Integration#Java_6_.28using_JSR_223:_Scripting.29 -- James Moore | [EMAIL PROTECTED] Ruby and Ruby on Rails consulting blog.restphone.com
Re: Name node heap space problem
Check how much memory is allocated for the JVM running namenode. In a file HADOOP_INSTALL/conf/hadoop-env.sh you should change a line that starts with "export HADOOP_HEAPSIZE=1000" It's set to 1GB by default. On Fri, Jul 25, 2008 at 2:51 AM, Gert Pfeifer <[EMAIL PROTECTED]> wrote: > Update on this one... > > I put some more memory in the machine running the name node. Now fsck is > running. Unfortunately ls fails with a time-out. > > I identified one directory that causes the trouble. I can run fsck on it > but not ls. > > What could be the problem? > > Gert > > Gert Pfeifer schrieb: > > Hi, >> I am running a Hadoop DFS on a cluster of 5 data nodes with a name node >> and one secondary name node. >> >> I have 1788874 files and directories, 1465394 blocks = 3254268 total. >> Heap Size max is 3.47 GB. >> >> My problem is that I produce many small files. Therefore I have a cron >> job which just runs daily across the new files and copies them into >> bigger files and deletes the small files. >> >> Apart from this program, even a fsck kills the cluster. >> >> The problem is that, as soon as I start this program, the heap space of >> the name node reaches 100 %. >> >> What could be the problem? There are not many small files right now and >> still it doesn't work. I guess we have this problem since the upgrade to >> 0.17. >> >> Here is some additional data about the DFS: >> Capacity : 2 TB >> DFS Remaining : 1.19 TB >> DFS Used: 719.35 GB >> DFS Used% : 35.16 % >> >> Thanks for hints, >> Gert >> > >
Need help to setup Hadoop on Fedora Core 6
Hello Folks I somebody has successfully installed Hadoop on FC 6, Please Help !!! Just bootstrapping into the Haddop madness and was attempting to install hadoop on Fedora Core 6. Tried all sorts of things but couldn't get past this error which is not starting the reduce tasks 2008-07-24 13:04:06,642 INFO org.apache.hadoop.mapred.TaskInProgress: Error from task_200807241301_0001_r_00_0: java.lang.NullPointerException at java.util.Hashtable.get(Hashtable.java:334) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier.fetchOutputs(ReduceTask.java:1103) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:328) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124) Before you ask, here are the details: 1. Running hadoop as a single node cluster 2. Disabled IPV6 3. Using Hadoop version */hadoop-0.17.1/* 4. enabled ssh to access local machine 5. Master and Slaves are set to localhost 6. Created simple sample file and loaded into DFS 7. Encountered error when I was running the sample with the wordcount example provided with the package 8. Here is my hadoop-site.xml hadoop.tmp.dir /tmp/hadoop-${user.name} A base for other temporary directories. fs.default.name hdfs://localhost:54310 The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem. mapred.job.tracker localhost:54311 The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. mapred.map.tasks 1 define mapred.map tasks to be number of slave hosts mapred.reduce.tasks 1 define mapred.reduce tasks to be number of slave hosts dfs.replication 1 Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. mapred.child.java.opts -Xmx1800m Java opts for the task tracker child processes. The following symbol, if present, will be interpolated: @taskid@ is replaced by current TaskID. Any other occurrences of '@' will go unchanged. For example, to enable verbose gc logging to a file named for the taskid in /tmp and to set the heap maximum to be a gigabyte, pass a 'value' of: -Xmx1024m -verbose:gc -Xloggc:/tmp/@[EMAIL PROTECTED]
Re: Bean Scripting Framework?
Well that sounds awesome! It would be simply splendid to see what you've got if you're willing to share. Are you going the 'direct' embedding route or using a scripting frame work (BSF or javax.script)? -lincoln -- lincolnritter.com On Thu, Jul 24, 2008 at 3:42 PM, James Moore <[EMAIL PROTECTED]> wrote: > Funny you should mention it - I'm working on a framework to do JRuby > Hadoop this week. Something like: > > class MyHadoopJob < Radoop > input_format :text_input_format > output_format :text_output_format > map_output_key_class :text > map_output_value_class :text > > def mapper(k, v, output, reporter) ># ... > end > > def reducer(k, vs, output, reporter) > end > end > > Plus a java glue file to call the Ruby stuff. > > And then it jars up the ruby files, the gem directory, and goes from there. > > -- > James Moore | [EMAIL PROTECTED] > Ruby and Ruby on Rails consulting > blog.restphone.com >
Re: Bean Scripting Framework?
Funny you should mention it - I'm working on a framework to do JRuby Hadoop this week. Something like: class MyHadoopJob < Radoop input_format :text_input_format output_format :text_output_format map_output_key_class :text map_output_value_class :text def mapper(k, v, output, reporter) # ... end def reducer(k, vs, output, reporter) end end Plus a java glue file to call the Ruby stuff. And then it jars up the ruby files, the gem directory, and goes from there. -- James Moore | [EMAIL PROTECTED] Ruby and Ruby on Rails consulting blog.restphone.com
Re: Bean Scripting Framework?
Andreas, If you wouldn't mind posting some snippets that would be great! There seems to be a general lack of examples out there so pretty much anything would help. -lincoln -- lincolnritter.com On Thu, Jul 24, 2008 at 3:06 PM, Andreas Kostyrka <[EMAIL PROTECTED]> wrote: > On Thursday 24 July 2008 23:24:19 Lincoln Ritter wrote: >> > Why not use jruby? >> >> Indeed! I'm basically working from the JRuby wiki page on Java >> integration (http://wiki.jruby.org/wiki/Java_Integration). I'm taking >> this one step at a time and, while I would love tighter integration, >> the recommended way is through the scripting frameworks. >> >> Right now, I most interested in taking some baby steps before going >> more general. I welcome any and all feedback/suggestions. Especially >> if you have tried this. I will post any results if there is interest, >> but mostly I am trying to accomplish a pretty small task and am not >> yet thinking about a more general solution. > > Guess I won't be a big resource for you then, the only thing that I did was > implementing a tar program with Jython that creates/extracts from/to HDFS. > > It was painful, but not to painful, and it's not Jythons fault, it's just that > using these clunky interfaces/classes is painful to a Python developer. Guess > the same feeling will come from Ruby developers. > > (and that's not a problem of Hadoop, I think that most Java APIs feel clunky > to people used to more powerful languages. :-P) > > Andreas >
Re: Trying to write to HDFS from mapreduce.
I think your conf is incorrectly set and your job was run locally. Also, have you done jobconf.setNumReduceTasks(0)? Try running some example jobs to test your setting. Nicholas Sze - Original Message > From: Erik Holstad <[EMAIL PROTECTED]> > To: core-user@hadoop.apache.org > Sent: Thursday, July 24, 2008 3:17:40 PM > Subject: Trying to write to HDFS from mapreduce. > > Hi! > I'm writing a mapreduce job where I want the output from the mapper to go > strait > to the HDFS without passing the reduce method. Have been told that I can do: > c.setOutputFormat(TextOutputFormat.class); also added > Path path = new Path("user"); > FileOutputFormat.setOutputPath(c, path); > > But I still ended up with the result in the local filesystem instead. > > Regards Erik
Re: can hadoop read files backwards
On Fri, Jul 18, 2008 at 2:06 PM, Miles Osborne <[EMAIL PROTECTED]> wrote: > unless you have a gigantic number of items with the same id, this is > straightforward. have a mapper emit items of the form: > > key=id, value = type,timestamp Or if you do have a large (by hadoop standards) number of items with the same id, use the timestamp + id for the key, emit one row for timestamp through timestamp + 5, and put a unique identifier in the row I think you can get a guaranteed-unique id from mapred.task.id (but check me on that), and just add a counter to that: IDtype Timestamp A1X 1215647404 A1 Y 1215647408 becomes 1215647404/a1, x, uniqueidX 1215647405/a1, x, uniqueidX 1215647406/a1, x, uniqueidX 1215647407/a1, x, uniqueidX 1215647408/a1, x, uniqueidX 1215647408/a1, y, uniqueidY 1215647409/a1, y, uniqueidY 1215647410/a1, y, uniqueidY etc If a key has a uniqueX, then write all the uniqueYs. Then the problem just becomes WordCount as a second pass. (Someone more clever than myself can probably do this in one pass...) Your mapper ends up spitting out 5x more rows, but your reducer has many fewer rows to keep in memory. At Hadoop scales, that might matter. -- James Moore | [EMAIL PROTECTED] Ruby and Ruby on Rails consulting blog.restphone.com
Trying to write to HDFS from mapreduce.
Hi! I'm writing a mapreduce job where I want the output from the mapper to go strait to the HDFS without passing the reduce method. Have been told that I can do: c.setOutputFormat(TextOutputFormat.class); also added Path path = new Path("user"); FileOutputFormat.setOutputPath(c, path); But I still ended up with the result in the local filesystem instead. Regards Erik
Re: Bean Scripting Framework?
On Thursday 24 July 2008 23:24:19 Lincoln Ritter wrote: > > Why not use jruby? > > Indeed! I'm basically working from the JRuby wiki page on Java > integration (http://wiki.jruby.org/wiki/Java_Integration). I'm taking > this one step at a time and, while I would love tighter integration, > the recommended way is through the scripting frameworks. > > Right now, I most interested in taking some baby steps before going > more general. I welcome any and all feedback/suggestions. Especially > if you have tried this. I will post any results if there is interest, > but mostly I am trying to accomplish a pretty small task and am not > yet thinking about a more general solution. Guess I won't be a big resource for you then, the only thing that I did was implementing a tar program with Jython that creates/extracts from/to HDFS. It was painful, but not to painful, and it's not Jythons fault, it's just that using these clunky interfaces/classes is painful to a Python developer. Guess the same feeling will come from Ruby developers. (and that's not a problem of Hadoop, I think that most Java APIs feel clunky to people used to more powerful languages. :-P) Andreas signature.asc Description: This is a digitally signed message part.
Re: Bean Scripting Framework?
> Why not use jruby? Indeed! I'm basically working from the JRuby wiki page on Java integration (http://wiki.jruby.org/wiki/Java_Integration). I'm taking this one step at a time and, while I would love tighter integration, the recommended way is through the scripting frameworks. Right now, I most interested in taking some baby steps before going more general. I welcome any and all feedback/suggestions. Especially if you have tried this. I will post any results if there is interest, but mostly I am trying to accomplish a pretty small task and am not yet thinking about a more general solution. -lincoln -- lincolnritter.com On Thu, Jul 24, 2008 at 1:58 PM, Andreas Kostyrka <[EMAIL PROTECTED]> wrote: > On Thursday 24 July 2008 21:40:20 Lincoln Ritter wrote: >> Hello all. >> >> Has anybody ever tried/considered using the Bean Scripting Framework >> within Hadoop? BSF seems nice since it allows "two-way" communication >> between ruby and java. I'd love to hear your thoughts as I've been >> trying to make this work to allow using ruby in the m/r pipeline. For >> now, I don't need a fully general solution, I'd just like to call some >> ruby in my map or reduce tasks. > > Why not use jruby? AFAIK, there is a complete ruby implementation on top of > Java, and although I have not used it, I'd presume that it allows full usage > of Java classes, as Jython does. > > Andreas >
Re: Bean Scripting Framework?
On Thursday 24 July 2008 21:40:20 Lincoln Ritter wrote: > Hello all. > > Has anybody ever tried/considered using the Bean Scripting Framework > within Hadoop? BSF seems nice since it allows "two-way" communication > between ruby and java. I'd love to hear your thoughts as I've been > trying to make this work to allow using ruby in the m/r pipeline. For > now, I don't need a fully general solution, I'd just like to call some > ruby in my map or reduce tasks. Why not use jruby? AFAIK, there is a complete ruby implementation on top of Java, and although I have not used it, I'd presume that it allows full usage of Java classes, as Jython does. Andreas signature.asc Description: This is a digitally signed message part.
Help Need to get Hadoop on Fedora Core 6
Hello Folks I somebody has successfully installed Hadoop on FC 6, Please Help !!! Just bootstrapping into the Haddop madness and was attempting to install hadoop on Fedora Core 6. Tried all sorts of things but couldn't get past this error which is not starting the reduce tasks 2008-07-24 13:04:06,642 INFO org.apache.hadoop.mapred.TaskInProgress: Error from task_200807241301_0001_r_00_0: java.lang.NullPointerException at java.util.Hashtable.get(Hashtable.java:334) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier.fetchOutputs(ReduceTask.java:1103) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:328) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124) Before you ask, here are the details: 1. Running hadoop as a single node cluster 2. Disabled IPV6 3. Using Hadoop version */hadoop-0.17.1/* 4. enabled ssh to access local machine 5. Master and Slaves are set to localhost 6. Created simple sample file and loaded into DFS 7. Encountered error when I was running the sample with the wordcount example provided with the package 8. Here is my hadoop-site.xml hadoop.tmp.dir /tmp/hadoop-${user.name} A base for other temporary directories. fs.default.name hdfs://localhost:54310 The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem. mapred.job.tracker localhost:54311 The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. mapred.map.tasks 1 define mapred.map tasks to be number of slave hosts mapred.reduce.tasks 1 define mapred.reduce tasks to be number of slave hosts dfs.replication 1 Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. mapred.child.java.opts -Xmx1800m Java opts for the task tracker child processes. The following symbol, if present, will be interpolated: @taskid@ is replaced by current TaskID. Any other occurrences of '@' will go unchanged. For example, to enable verbose gc logging to a file named for the taskid in /tmp and to set the heap maximum to be a gigabyte, pass a 'value' of: -Xmx1024m -verbose:gc -Xloggc:/tmp/@[EMAIL PROTECTED]
Re: can hadoop read files backwards
never mind i got it. Elia Mazzawi wrote: I need some help with the implementation, to have the mapper produce key=id, value = type,timestamp which is essentially string, string what do i give output.collect for the Value, i want to store type, timestamp it only takes but i want to store Text> ? or what can i store in there. here is my reducer which doesn't work because output.collect doesn't want public static class Map extends MapReduceBase implements Mapper { private Text Key = new Text(); private Text Value = new Text(); public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException { String line = value.toString(); // line is parsed and now i have 2 strings // String S1; // contains the key // String S2; // contains the value Key.set(S1); Value.set(S2); output.collect(Key, Value); } } Miles Osborne wrote: unless you have a gigantic number of items with the same id, this is straightforward. have a mapper emit items of the form: key=id, value = type,timestamp and your reducer will then see all ids that have the same value together. it is then a simple matter to process all items with the same id. for example, you could simply read them into a list and work on them in any manner you see fit. (note that hadoop is perfectly fine at dealing with multi-line items. all you need do is make sure that the items you want to process together all share the same key) Miles 2008/7/18 Elia Mazzawi <[EMAIL PROTECTED]>: well here is the problem I'm trying to solve, I have a data set that looks like this: IDtype Timestamp A1X 1215647404 A2X 1215647405 A3X 1215647406 A1 Y 1215647409 I want to count how many A1 Y, show up within 5 seconds of an A1 X I was planning to have the data sorted by ID then timestamp, then read it backwards, (or have it sorted by reverse timestamp) go through it cashing all Y's for the same ID for 5 seconds to either find a matching X or not. the results don't need to be 100% accurate. so if hadoop gives the same file with the same lines in order then this will work. seems hadoop is really good at solving problems that depend on 1 line at a time? but not multi lines? hadoop has to get data in order, and be able to work on multi lines, otherwise how can it be setting records in data sorts. I'd appreciate other suggestions to go about doing this. Jim R. Wilson wrote: does wordcount get the lines in order? or are they random? can i have hadoop return them in reverse order? You can't really depend on the order that the lines are given - it's best to think of them as random. The purpose of MapReduce/Hadoop is to distribute a problem among a number of cooperating nodes. The idea is that any given line can be interpreted separately, completely independent of any other line. So in wordcount, this makes sense. For example, say you and I are nodes. Each of us gets half the lines in a file and we can count the words we see and report on them - it doesn't matter what order we're given the lines, or which lines we're given, or even whether we get the same number of lines (if you're faster at it, or maybe you get shorter lines, you may get more lines to process in the interest of saving time). So if the project you're working on requires getting the lines in a particular order, then you probably need to rethink your approach. It may be that hadoop isn't right for your problem, or maybe that the problem just needs to be attacked in a different way. Without knowing more about what you're trying to achieve, I can't offer any specifics. Good luck! -- Jim On Thu, Jul 17, 2008 at 4:41 PM, Elia Mazzawi <[EMAIL PROTECTED]> wrote: I have a program based on wordcount.java and I have files that are smaller than 64mb files (so i believe each file is one task ) do does wordcount get the lines in order? or are they random? can i have hadoop return them in reverse order? Jim R. Wilson wrote: It sounds to me like you're talking about hadoop streaming (correct me if I'm wrong there). In that case, there's really no "order" to the lines being doled out as I understand it. Any given line could be handed to any given mapper task running on any given node. I may be wrong, of course, someone closer to the project could give you the right answer in that case. -- Jim R. Wilson (jimbojw) On Thu, Jul 17, 2008 at 4:06 PM, Elia Mazzawi <[EMAIL PROTECTED]> wrote: is there a way to have hadoop hand over the lines of a file backwards to my mapper ? as in give the last line first.
Re: Hadoop and Fedora Core 6 Adventure, Need Help ASAP
Hello Folks I somebody has successfully installed Hadoop on FC 6, Please Help !!! Just bootstrapping into the Haddop madness and was attempting to install hadoop on Fedora Core 6. Tried all sorts of things but couldn't get past this error which is not starting the reduce tasks 2008-07-24 13:04:06,642 INFO org.apache.hadoop.mapred.TaskInProgress: Error from task_200807241301_0001_r_00_0: java.lang.NullPointerException at java.util.Hashtable.get(Hashtable.java:334) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier.fetchOutputs(ReduceTask.java:1103) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:328) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124) Before you ask, here are the details: 1. Running hadoop as a single node cluster 2. Disabled IPV6 3. Using Hadoop version */hadoop-0.17.1/* 4. enabled ssh to access local machine 5. Master and Slaves are set to localhost 6. Created simple sample file and loaded into DFS 7. Encountered error when I was running the sample with the wordcount example provided with the package 8. Here is my hadoop-site.xml hadoop.tmp.dir /tmp/hadoop-${user.name} A base for other temporary directories. fs.default.name hdfs://localhost:54310 The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem. mapred.job.tracker localhost:54311 The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. mapred.map.tasks 1 define mapred.map tasks to be number of slave hosts mapred.reduce.tasks 1 define mapred.reduce tasks to be number of slave hosts dfs.replication 1 Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. mapred.child.java.opts -Xmx1800m Java opts for the task tracker child processes. The following symbol, if present, will be interpolated: @taskid@ is replaced by current TaskID. Any other occurrences of '@' will go unchanged. For example, to enable verbose gc logging to a file named for the taskid in /tmp and to set the heap maximum to be a gigabyte, pass a 'value' of: -Xmx1024m -verbose:gc -Xloggc:/tmp/@[EMAIL PROTECTED]
Re: Hadoop and Ganglia Meterics
Ah, yeah, I found that one. :) Patching 'java/org/apache/hadoop/mapred/JobInProgress.java' on 0.17.1. -joe Jason Venner wrote: I have only applied this patch as far forward as 0.16.0 Joe Williams wrote: Sweet, thanks. Jason Venner wrote: Once the patch is applied you should start seeing the ganglia metrics We do. Joe Williams wrote: Once I have the patch applied and have it running should I see the metrics? Or do I need to additional work? Thanks. -Joe Jason Venner wrote: I applied the patch in the jira to my distro Joe Williams wrote: Thanks Jason, until this is implemented are how are you pulling stats from Hadoop? -joe Jason Venner wrote: Check out https://issues.apache.org/jira/browse/HADOOP-3422 Joe Williams wrote: I have been attempting to get Hadoop metrics in Ganliga and have been unsuccessful thus far. I have see this thread (http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200712.mbox/raw/[EMAIL PROTECTED]/) but it didn't help much. I have setup my properties file like so: [EMAIL PROTECTED] current]# cat conf/hadoop-metrics.properties dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext dfs.period=10 dfs.servers=127.0.0.1:8649 mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext mapred.period=10 mapred.servers=127.0.0.1:8649 And if I 'telnet 127.0.0.1 8649' I receive the Ganglia XML metrics output without any hadoop specific metrics: [EMAIL PROTECTED] current]# telnet 127.0.0.1 8649 Trying 127.0.0.1... Connected to localhost (127.0.0.1). Escape character is '^]'. ?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?> !DOCTYPE GANGLIA_XML [ !ELEMENT GANGLIA_XML (GRID|CLUSTER|HOST)*> !ATTLIST GANGLIA_XML VERSION CDATA #REQUIRED> !ATTLIST GANGLIA_XML SOURCE CDATA #REQUIRED> --SNIP-- Is there more I need to do to get the metrics to show up in this output, am I doing something incorrectly? Do I need to have a gmetric script run in a cron to update the stats? If so, does anyone have a hadoop specific example of this? Any info would be helpful. Thanks. -Joe -- Name: Joseph A. Williams Email: [EMAIL PROTECTED]
Re: Hadoop and Ganglia Meterics
I have only applied this patch as far forward as 0.16.0 Joe Williams wrote: Sweet, thanks. Jason Venner wrote: Once the patch is applied you should start seeing the ganglia metrics We do. Joe Williams wrote: Once I have the patch applied and have it running should I see the metrics? Or do I need to additional work? Thanks. -Joe Jason Venner wrote: I applied the patch in the jira to my distro Joe Williams wrote: Thanks Jason, until this is implemented are how are you pulling stats from Hadoop? -joe Jason Venner wrote: Check out https://issues.apache.org/jira/browse/HADOOP-3422 Joe Williams wrote: I have been attempting to get Hadoop metrics in Ganliga and have been unsuccessful thus far. I have see this thread (http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200712.mbox/raw/[EMAIL PROTECTED]/) but it didn't help much. I have setup my properties file like so: [EMAIL PROTECTED] current]# cat conf/hadoop-metrics.properties dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext dfs.period=10 dfs.servers=127.0.0.1:8649 mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext mapred.period=10 mapred.servers=127.0.0.1:8649 And if I 'telnet 127.0.0.1 8649' I receive the Ganglia XML metrics output without any hadoop specific metrics: [EMAIL PROTECTED] current]# telnet 127.0.0.1 8649 Trying 127.0.0.1... Connected to localhost (127.0.0.1). Escape character is '^]'. ?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?> !DOCTYPE GANGLIA_XML [ !ELEMENT GANGLIA_XML (GRID|CLUSTER|HOST)*> !ATTLIST GANGLIA_XML VERSION CDATA #REQUIRED> !ATTLIST GANGLIA_XML SOURCE CDATA #REQUIRED> --SNIP-- Is there more I need to do to get the metrics to show up in this output, am I doing something incorrectly? Do I need to have a gmetric script run in a cron to update the stats? If so, does anyone have a hadoop specific example of this? Any info would be helpful. Thanks. -Joe
Hadoop DFS
Hi, I am new to Hadoop. Right now, I am Only interested to Work with Hadoop DFS. Can some one guide me where to start? Anyone has information about some application has already integrated Hadoop DFS ? Any information regarding Material about Hadoop DFS, case studies, Articles, books etc will be very nice. Thanks, Wasim
about the overhead
Hi all, Does hadoop provide a way to let the users know the time for computation(map/reduce functions) and the time for different types of overhead (such as the startup, sorting, i/o disk, etc.) respectively? Thanks~~ Best regards, -- --- Wei
Re: Hadoop and Ganglia Meterics
Sweet, thanks. Jason Venner wrote: Once the patch is applied you should start seeing the ganglia metrics We do. Joe Williams wrote: Once I have the patch applied and have it running should I see the metrics? Or do I need to additional work? Thanks. -Joe Jason Venner wrote: I applied the patch in the jira to my distro Joe Williams wrote: Thanks Jason, until this is implemented are how are you pulling stats from Hadoop? -joe Jason Venner wrote: Check out https://issues.apache.org/jira/browse/HADOOP-3422 Joe Williams wrote: I have been attempting to get Hadoop metrics in Ganliga and have been unsuccessful thus far. I have see this thread (http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200712.mbox/raw/[EMAIL PROTECTED]/) but it didn't help much. I have setup my properties file like so: [EMAIL PROTECTED] current]# cat conf/hadoop-metrics.properties dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext dfs.period=10 dfs.servers=127.0.0.1:8649 mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext mapred.period=10 mapred.servers=127.0.0.1:8649 And if I 'telnet 127.0.0.1 8649' I receive the Ganglia XML metrics output without any hadoop specific metrics: [EMAIL PROTECTED] current]# telnet 127.0.0.1 8649 Trying 127.0.0.1... Connected to localhost (127.0.0.1). Escape character is '^]'. ?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?> !DOCTYPE GANGLIA_XML [ !ELEMENT GANGLIA_XML (GRID|CLUSTER|HOST)*> !ATTLIST GANGLIA_XML VERSION CDATA #REQUIRED> !ATTLIST GANGLIA_XML SOURCE CDATA #REQUIRED> --SNIP-- Is there more I need to do to get the metrics to show up in this output, am I doing something incorrectly? Do I need to have a gmetric script run in a cron to update the stats? If so, does anyone have a hadoop specific example of this? Any info would be helpful. Thanks. -Joe -- Name: Joseph A. Williams Email: [EMAIL PROTECTED]
Re: hadoop 0.17.1 reducer not fetching map output problem
On Thursday 24 July 2008 21:40:22 Devaraj Das wrote: > On 7/25/08 12:09 AM, "Andreas Kostyrka" <[EMAIL PROTECTED]> wrote: > > On Thursday 24 July 2008 15:19:22 Devaraj Das wrote: > >> Could you try to kill the tasktracker hosting the task the next time > >> when it happens? I just want to isolate the problem - whether it is a > >> problem in the TT-JT communication or in the Task-TT communication. From > >> your description it looks like the problem is between the JT-TT > >> communication. But pls run the experiment when it happens again and let > >> us know what happens. > > > > Well, I did restart the tasktracker where the reduce job was running, but > > that lead only to a situation where the jobtracker did not restart the > > job, showed it as still running, and was not able to kill the reduce task > > via hadoop job -kill-task nor -fail-task. > > The reduce task would eventually be reexecuted (after some timeout, > defaulting to 10 minutes, the tasktracker would be assumed as lost and all > reducers that were running on that node would be reexecuted). > > > I hope to avoid a repeat, I'll be relapsing out cluster to 0.15 today. A > > peer at another startup confirmed the whole batch of problems I've been > > experiencing, and for him 0.15 works for production. > > > > > > No question, 0.17 is way better than 0.16, on the other hand I wonder how > > 0.16 could get released? (I'm using streaming.jar, and with 0.16.x I've > > introduced reducing to our workloads, and before 0.16 failed >80% of the > > jobs with reducers not being able to get their output. 0.17.0 improved > > that to a point where one can, with some pain, e.g. restarting the > > cluster daily, not storing anything important on HDFS, only temporary > > data, ..., use it somehow for production, at least for small jobs.) So > > one wonders how 0.16 got released? Or was it meant only as developer-only > > bug fixing series? > > > > Pls raise jiras for the specific problems. I know, that's why I bracketed it as rantmode. OTOH, many of these issues had either this creepy feeling where you wondered if you did something wrong or were issues where I had to react relatively quickly, which usually destroys the faulty state. (I know, as a developer having reproduced a bug is golden. As an admin asked about processing lag, it's rather to opposite) Plus fixing the issue in the next release or even via a patch means that I have a non-working cluster till then. Now I that means I would need to start debugging the cluster utility software instead of our apps. ;( Andreas signature.asc Description: This is a digitally signed message part.
Re: Hadoop and Ganglia Meterics
Once the patch is applied you should start seeing the ganglia metrics We do. Joe Williams wrote: Once I have the patch applied and have it running should I see the metrics? Or do I need to additional work? Thanks. -Joe Jason Venner wrote: I applied the patch in the jira to my distro Joe Williams wrote: Thanks Jason, until this is implemented are how are you pulling stats from Hadoop? -joe Jason Venner wrote: Check out https://issues.apache.org/jira/browse/HADOOP-3422 Joe Williams wrote: I have been attempting to get Hadoop metrics in Ganliga and have been unsuccessful thus far. I have see this thread (http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200712.mbox/raw/[EMAIL PROTECTED]/) but it didn't help much. I have setup my properties file like so: [EMAIL PROTECTED] current]# cat conf/hadoop-metrics.properties dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext dfs.period=10 dfs.servers=127.0.0.1:8649 mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext mapred.period=10 mapred.servers=127.0.0.1:8649 And if I 'telnet 127.0.0.1 8649' I receive the Ganglia XML metrics output without any hadoop specific metrics: [EMAIL PROTECTED] current]# telnet 127.0.0.1 8649 Trying 127.0.0.1... Connected to localhost (127.0.0.1). Escape character is '^]'. ?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?> !DOCTYPE GANGLIA_XML [ !ELEMENT GANGLIA_XML (GRID|CLUSTER|HOST)*> !ATTLIST GANGLIA_XML VERSION CDATA #REQUIRED> !ATTLIST GANGLIA_XML SOURCE CDATA #REQUIRED> --SNIP-- Is there more I need to do to get the metrics to show up in this output, am I doing something incorrectly? Do I need to have a gmetric script run in a cron to update the stats? If so, does anyone have a hadoop specific example of this? Any info would be helpful. Thanks. -Joe
Re: hadoop 0.17.1 reducer not fetching map output problem
On 7/25/08 12:09 AM, "Andreas Kostyrka" <[EMAIL PROTECTED]> wrote: > On Thursday 24 July 2008 15:19:22 Devaraj Das wrote: >> Could you try to kill the tasktracker hosting the task the next time when >> it happens? I just want to isolate the problem - whether it is a problem in >> the TT-JT communication or in the Task-TT communication. From your >> description it looks like the problem is between the JT-TT communication. >> But pls run the experiment when it happens again and let us know what >> happens. > > Well, I did restart the tasktracker where the reduce job was running, but that > lead only to a situation where the jobtracker did not restart the job, showed > it as still running, and was not able to kill the reduce task via hadoop > job -kill-task nor -fail-task. The reduce task would eventually be reexecuted (after some timeout, defaulting to 10 minutes, the tasktracker would be assumed as lost and all reducers that were running on that node would be reexecuted). > > I hope to avoid a repeat, I'll be relapsing out cluster to 0.15 today. A peer > at another startup confirmed the whole batch of problems I've been > experiencing, and for him 0.15 works for production. > > > No question, 0.17 is way better than 0.16, on the other hand I wonder how 0.16 > could get released? (I'm using streaming.jar, and with 0.16.x I've introduced > reducing to our workloads, and before 0.16 failed >80% of the jobs with > reducers not being able to get their output. 0.17.0 improved that to a point > where one can, with some pain, e.g. restarting the cluster daily, not storing > anything important on HDFS, only temporary data, ..., use it somehow for > production, at least for small jobs.) So one wonders how 0.16 got released? > Or was it meant only as developer-only bug fixing series? > > Pls raise jiras for the specific problems. > Sorry, this has been driving me up the walls into an asylum till I compared > notes with a collegue, and decided that I'm not crazy ;) > > Andreas > >> >> Thanks, >> Devaraj >> >> On 7/24/08 1:42 PM, "Andreas Kostyrka" <[EMAIL PROTECTED]> wrote: >>> Hi! >>> >>> I'm experiencing hung reducers, with the following symptoms: Task Logs: 'task_200807230647_0008_r_09_1' stdout logs stderr logs syslog logs red.ReduceTask: task_200807230647_0008_r_09_1 Got 0 known map output location(s); scheduling... 2008-07-24 07:56:11,064 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24 07:56:16,073 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24 07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1: Got 0 new map-outputs & 0 obsolete map-outputs from tasktracker and 0 map-outputs from previous failures 2008-07-24 07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 Got 0 known map output location(s); scheduling... 2008-07-24 07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24 07:56:21,083 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24 07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1: Got 0 new map-outputs & 0 obsolete map-outputs from tasktracker and 0 map-outputs from previous failures 2008-07-24 07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 Got 0 known map output location(s); scheduling... 2008-07-24 07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24 07:56:26,093 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24 07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1: Got 0 new map-outputs & 0 obsolete map-outputs from tasktracker and 0 map-outputs from previous failures 2008-07-24 07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 Got 0 known map output location(s); scheduling... 2008-07-24 07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24 07:56:31,103 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24 07:56:31,104 INFO org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1: Got 0 new map-outputs & 0 obsolete map-outputs from t
Bean Scripting Framework?
Hello all. Has anybody ever tried/considered using the Bean Scripting Framework within Hadoop? BSF seems nice since it allows "two-way" communication between ruby and java. I'd love to hear your thoughts as I've been trying to make this work to allow using ruby in the m/r pipeline. For now, I don't need a fully general solution, I'd just like to call some ruby in my map or reduce tasks. Thanks! -lincoln -- lincolnritter.com
Re: Hadoop and Ganglia Meterics
Once I have the patch applied and have it running should I see the metrics? Or do I need to additional work? Thanks. -Joe Jason Venner wrote: I applied the patch in the jira to my distro Joe Williams wrote: Thanks Jason, until this is implemented are how are you pulling stats from Hadoop? -joe Jason Venner wrote: Check out https://issues.apache.org/jira/browse/HADOOP-3422 Joe Williams wrote: I have been attempting to get Hadoop metrics in Ganliga and have been unsuccessful thus far. I have see this thread (http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200712.mbox/raw/[EMAIL PROTECTED]/) but it didn't help much. I have setup my properties file like so: [EMAIL PROTECTED] current]# cat conf/hadoop-metrics.properties dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext dfs.period=10 dfs.servers=127.0.0.1:8649 mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext mapred.period=10 mapred.servers=127.0.0.1:8649 And if I 'telnet 127.0.0.1 8649' I receive the Ganglia XML metrics output without any hadoop specific metrics: [EMAIL PROTECTED] current]# telnet 127.0.0.1 8649 Trying 127.0.0.1... Connected to localhost (127.0.0.1). Escape character is '^]'. ?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?> !DOCTYPE GANGLIA_XML [ !ELEMENT GANGLIA_XML (GRID|CLUSTER|HOST)*> !ATTLIST GANGLIA_XML VERSION CDATA #REQUIRED> !ATTLIST GANGLIA_XML SOURCE CDATA #REQUIRED> --SNIP-- Is there more I need to do to get the metrics to show up in this output, am I doing something incorrectly? Do I need to have a gmetric script run in a cron to update the stats? If so, does anyone have a hadoop specific example of this? Any info would be helpful. Thanks. -Joe -- Name: Joseph A. Williams Email: [EMAIL PROTECTED]
Re: hadoop 0.17.1 reducer not fetching map output problem
On Thursday 24 July 2008 15:19:22 Devaraj Das wrote: > Could you try to kill the tasktracker hosting the task the next time when > it happens? I just want to isolate the problem - whether it is a problem in > the TT-JT communication or in the Task-TT communication. From your > description it looks like the problem is between the JT-TT communication. > But pls run the experiment when it happens again and let us know what > happens. Well, I did restart the tasktracker where the reduce job was running, but that lead only to a situation where the jobtracker did not restart the job, showed it as still running, and was not able to kill the reduce task via hadoop job -kill-task nor -fail-task. I hope to avoid a repeat, I'll be relapsing out cluster to 0.15 today. A peer at another startup confirmed the whole batch of problems I've been experiencing, and for him 0.15 works for production. No question, 0.17 is way better than 0.16, on the other hand I wonder how 0.16 could get released? (I'm using streaming.jar, and with 0.16.x I've introduced reducing to our workloads, and before 0.16 failed >80% of the jobs with reducers not being able to get their output. 0.17.0 improved that to a point where one can, with some pain, e.g. restarting the cluster daily, not storing anything important on HDFS, only temporary data, ..., use it somehow for production, at least for small jobs.) So one wonders how 0.16 got released? Or was it meant only as developer-only bug fixing series? Sorry, this has been driving me up the walls into an asylum till I compared notes with a collegue, and decided that I'm not crazy ;) Andreas > > Thanks, > Devaraj > > On 7/24/08 1:42 PM, "Andreas Kostyrka" <[EMAIL PROTECTED]> wrote: > > Hi! > > > > I'm experiencing hung reducers, with the following symptoms: > >> Task Logs: 'task_200807230647_0008_r_09_1' > >> > >> > >> stdout logs > >> > >> > >> > >> stderr logs > >> > >> > >> > >> syslog logs > >> > >> red.ReduceTask: task_200807230647_0008_r_09_1 Got 0 known map output > >> location(s); scheduling... 2008-07-24 07:56:11,064 INFO > >> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 > >> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24 > >> 07:56:16,073 INFO org.apache.hadoop.mapred.ReduceTask: > >> task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24 > >> 07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask: > >> task_200807230647_0008_r_09_1: Got 0 new map-outputs & 0 obsolete > >> map-outputs from tasktracker and 0 map-outputs from previous failures > >> 2008-07-24 07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask: > >> task_200807230647_0008_r_09_1 Got 0 known map output location(s); > >> scheduling... 2008-07-24 07:56:16,074 INFO > >> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 > >> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24 > >> 07:56:21,083 INFO org.apache.hadoop.mapred.ReduceTask: > >> task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24 > >> 07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask: > >> task_200807230647_0008_r_09_1: Got 0 new map-outputs & 0 obsolete > >> map-outputs from tasktracker and 0 map-outputs from previous failures > >> 2008-07-24 07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask: > >> task_200807230647_0008_r_09_1 Got 0 known map output location(s); > >> scheduling... 2008-07-24 07:56:21,084 INFO > >> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 > >> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24 > >> 07:56:26,093 INFO org.apache.hadoop.mapred.ReduceTask: > >> task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24 > >> 07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask: > >> task_200807230647_0008_r_09_1: Got 0 new map-outputs & 0 obsolete > >> map-outputs from tasktracker and 0 map-outputs from previous failures > >> 2008-07-24 07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask: > >> task_200807230647_0008_r_09_1 Got 0 known map output location(s); > >> scheduling... 2008-07-24 07:56:26,094 INFO > >> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 > >> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24 > >> 07:56:31,103 INFO org.apache.hadoop.mapred.ReduceTask: > >> task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24 > >> 07:56:31,104 INFO org.apache.hadoop.mapred.ReduceTask: > >> task_200807230647_0008_r_09_1: Got 0 new map-outputs & 0 obsolete > >> map-outputs from tasktracker and 0 map-outputs from previous failures > >> 2008-07-24 07:56:31,104 INFO org.apache.hadoop.mapred.ReduceTask: > >> task_200807230647_0008_r_09_1 Got 0 known map output location(s); > >> scheduling... 2008-07-24 07:56:31,104 INFO > >> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 > >> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup host
Re: can hadoop read files backwards
I need some help with the implementation, to have the mapper produce key=id, value = type,timestamp which is essentially string, string what do i give output.collect for the Value, i want to store type, timestamp it only takes but i want to store Text> ? or what can i store in there. here is my reducer which doesn't work because output.collect doesn't want public static class Map extends MapReduceBase implements Mapper { private Text Key = new Text(); private Text Value = new Text(); public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException { String line = value.toString(); // line is parsed and now i have 2 strings // String S1; // contains the key // String S2; // contains the value Key.set(S1); Value.set(S2); output.collect(Key, Value); } } Miles Osborne wrote: unless you have a gigantic number of items with the same id, this is straightforward. have a mapper emit items of the form: key=id, value = type,timestamp and your reducer will then see all ids that have the same value together. it is then a simple matter to process all items with the same id. for example, you could simply read them into a list and work on them in any manner you see fit. (note that hadoop is perfectly fine at dealing with multi-line items. all you need do is make sure that the items you want to process together all share the same key) Miles 2008/7/18 Elia Mazzawi <[EMAIL PROTECTED]>: well here is the problem I'm trying to solve, I have a data set that looks like this: IDtype Timestamp A1X 1215647404 A2X 1215647405 A3X 1215647406 A1 Y 1215647409 I want to count how many A1 Y, show up within 5 seconds of an A1 X I was planning to have the data sorted by ID then timestamp, then read it backwards, (or have it sorted by reverse timestamp) go through it cashing all Y's for the same ID for 5 seconds to either find a matching X or not. the results don't need to be 100% accurate. so if hadoop gives the same file with the same lines in order then this will work. seems hadoop is really good at solving problems that depend on 1 line at a time? but not multi lines? hadoop has to get data in order, and be able to work on multi lines, otherwise how can it be setting records in data sorts. I'd appreciate other suggestions to go about doing this. Jim R. Wilson wrote: does wordcount get the lines in order? or are they random? can i have hadoop return them in reverse order? You can't really depend on the order that the lines are given - it's best to think of them as random. The purpose of MapReduce/Hadoop is to distribute a problem among a number of cooperating nodes. The idea is that any given line can be interpreted separately, completely independent of any other line. So in wordcount, this makes sense. For example, say you and I are nodes. Each of us gets half the lines in a file and we can count the words we see and report on them - it doesn't matter what order we're given the lines, or which lines we're given, or even whether we get the same number of lines (if you're faster at it, or maybe you get shorter lines, you may get more lines to process in the interest of saving time). So if the project you're working on requires getting the lines in a particular order, then you probably need to rethink your approach. It may be that hadoop isn't right for your problem, or maybe that the problem just needs to be attacked in a different way. Without knowing more about what you're trying to achieve, I can't offer any specifics. Good luck! -- Jim On Thu, Jul 17, 2008 at 4:41 PM, Elia Mazzawi <[EMAIL PROTECTED]> wrote: I have a program based on wordcount.java and I have files that are smaller than 64mb files (so i believe each file is one task ) do does wordcount get the lines in order? or are they random? can i have hadoop return them in reverse order? Jim R. Wilson wrote: It sounds to me like you're talking about hadoop streaming (correct me if I'm wrong there). In that case, there's really no "order" to the lines being doled out as I understand it. Any given line could be handed to any given mapper task running on any given node. I may be wrong, of course, someone closer to the project could give you the right answer in that case. -- Jim R. Wilson (jimbojw) On Thu, Jul 17, 2008 at 4:06 PM, Elia Mazzawi <[EMAIL PROTECTED]> wrote: is there a way to have hadoop hand over the lines of a file backwards to my mapper ? as in give the last line first.
Anybody used AppNexus for hosting Hadoop app?
I discovered AppNexus yesterday. They offer hosting similar to Amazon EC2, with apparently more dedicated hardware and a better notion of where things are in the datacenter. Their web site says they are optimized for Hadoop applications. Anybody tried and could give some feedback? J.
Re: Name node heap space problem
Update on this one... I put some more memory in the machine running the name node. Now fsck is running. Unfortunately ls fails with a time-out. I identified one directory that causes the trouble. I can run fsck on it but not ls. What could be the problem? Gert Gert Pfeifer schrieb: Hi, I am running a Hadoop DFS on a cluster of 5 data nodes with a name node and one secondary name node. I have 1788874 files and directories, 1465394 blocks = 3254268 total. Heap Size max is 3.47 GB. My problem is that I produce many small files. Therefore I have a cron job which just runs daily across the new files and copies them into bigger files and deletes the small files. Apart from this program, even a fsck kills the cluster. The problem is that, as soon as I start this program, the heap space of the name node reaches 100 %. What could be the problem? There are not many small files right now and still it doesn't work. I guess we have this problem since the upgrade to 0.17. Here is some additional data about the DFS: Capacity : 2 TB DFS Remaining : 1.19 TB DFS Used: 719.35 GB DFS Used% : 35.16 % Thanks for hints, Gert
Anyway to order all the output folder?
Hi All, There are 30 output folders using Hadoop. Each folder it is in ascending order, but the order is not ascending among folders, like the value is 1, 5 , 10 in folder A and 6, 8, 9 in folder B. My question is how to enforce the order among all the folders as well, as output value 1, 5, 6 in folder A and 8, 9, 10 in folder B. I just start to learn Hadoop and hope you can help me. :) Thanks Shane
Re: Hadoop and Ganglia Meterics
I applied the patch in the jira to my distro Joe Williams wrote: Thanks Jason, until this is implemented are how are you pulling stats from Hadoop? -joe Jason Venner wrote: Check out https://issues.apache.org/jira/browse/HADOOP-3422 Joe Williams wrote: I have been attempting to get Hadoop metrics in Ganliga and have been unsuccessful thus far. I have see this thread (http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200712.mbox/raw/[EMAIL PROTECTED]/) but it didn't help much. I have setup my properties file like so: [EMAIL PROTECTED] current]# cat conf/hadoop-metrics.properties dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext dfs.period=10 dfs.servers=127.0.0.1:8649 mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext mapred.period=10 mapred.servers=127.0.0.1:8649 And if I 'telnet 127.0.0.1 8649' I receive the Ganglia XML metrics output without any hadoop specific metrics: [EMAIL PROTECTED] current]# telnet 127.0.0.1 8649 Trying 127.0.0.1... Connected to localhost (127.0.0.1). Escape character is '^]'. ?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?> !DOCTYPE GANGLIA_XML [ !ELEMENT GANGLIA_XML (GRID|CLUSTER|HOST)*> !ATTLIST GANGLIA_XML VERSION CDATA #REQUIRED> !ATTLIST GANGLIA_XML SOURCE CDATA #REQUIRED> --SNIP-- Is there more I need to do to get the metrics to show up in this output, am I doing something incorrectly? Do I need to have a gmetric script run in a cron to update the stats? If so, does anyone have a hadoop specific example of this? Any info would be helpful. Thanks. -Joe
Re: Hadoop and Ganglia Meterics
Thanks Jason, until this is implemented are how are you pulling stats from Hadoop? -joe Jason Venner wrote: Check out https://issues.apache.org/jira/browse/HADOOP-3422 Joe Williams wrote: I have been attempting to get Hadoop metrics in Ganliga and have been unsuccessful thus far. I have see this thread (http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200712.mbox/raw/[EMAIL PROTECTED]/) but it didn't help much. I have setup my properties file like so: [EMAIL PROTECTED] current]# cat conf/hadoop-metrics.properties dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext dfs.period=10 dfs.servers=127.0.0.1:8649 mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext mapred.period=10 mapred.servers=127.0.0.1:8649 And if I 'telnet 127.0.0.1 8649' I receive the Ganglia XML metrics output without any hadoop specific metrics: [EMAIL PROTECTED] current]# telnet 127.0.0.1 8649 Trying 127.0.0.1... Connected to localhost (127.0.0.1). Escape character is '^]'. ?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?> !DOCTYPE GANGLIA_XML [ !ELEMENT GANGLIA_XML (GRID|CLUSTER|HOST)*> !ATTLIST GANGLIA_XML VERSION CDATA #REQUIRED> !ATTLIST GANGLIA_XML SOURCE CDATA #REQUIRED> --SNIP-- Is there more I need to do to get the metrics to show up in this output, am I doing something incorrectly? Do I need to have a gmetric script run in a cron to update the stats? If so, does anyone have a hadoop specific example of this? Any info would be helpful. Thanks. -Joe -- Name: Joseph A. Williams Email: [EMAIL PROTECTED]
Re: How to write one file per key as mapreduce output
On Tue, Jul 22, 2008 at 8:04 PM, Lincoln Ritter <[EMAIL PROTECTED]> wrote: > I have what I think is a pretty straight-forward, noobie question. I > would like to write one file per key in the reduce (or map) phase of a > mapreduce job. I have looked at the documentation for > FileOutputFormat and MultipleTextOutputFormat but am a bit unclear on > how to use it/them. Can anybody give me a quick pointer? Hi Lincoln, I do something like this to dump my records out, one per file, for debugging. This may not be "correct" because it writes the files as side-effects of the job, but hey, it works. It looks something like this: public static class MyMap extends MapReduceBase implements Mapper { private JobConf conf; public void configure(JobConf conf) { this.conf = conf; } public void map(VIntWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException { FileSystem fs = FileSystem.get(conf); Path workPath = FileOutputFormat.getWorkOutputPath(conf); Path filePath = new Path(workPath, key.toString()); OutputStream out = fs.create(filePath); /* ... write value to out ... */ out.close(); } }
Re: hadoop 0.17.1 reducer not fetching map output problem
Could you try to kill the tasktracker hosting the task the next time when it happens? I just want to isolate the problem - whether it is a problem in the TT-JT communication or in the Task-TT communication. From your description it looks like the problem is between the JT-TT communication. But pls run the experiment when it happens again and let us know what happens. Thanks, Devaraj On 7/24/08 1:42 PM, "Andreas Kostyrka" <[EMAIL PROTECTED]> wrote: > Hi! > > I'm experiencing hung reducers, with the following symptoms: > >> Task Logs: 'task_200807230647_0008_r_09_1' >> >> >> stdout logs >> >> >> >> stderr logs >> >> >> >> syslog logs >> >> red.ReduceTask: task_200807230647_0008_r_09_1 Got 0 known map output >> location(s); scheduling... 2008-07-24 07:56:11,064 INFO >> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 >> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24 >> 07:56:16,073 INFO org.apache.hadoop.mapred.ReduceTask: >> task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24 >> 07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask: >> task_200807230647_0008_r_09_1: Got 0 new map-outputs & 0 obsolete >> map-outputs from tasktracker and 0 map-outputs from previous failures >> 2008-07-24 07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask: >> task_200807230647_0008_r_09_1 Got 0 known map output location(s); >> scheduling... 2008-07-24 07:56:16,074 INFO >> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 >> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24 >> 07:56:21,083 INFO org.apache.hadoop.mapred.ReduceTask: >> task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24 >> 07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask: >> task_200807230647_0008_r_09_1: Got 0 new map-outputs & 0 obsolete >> map-outputs from tasktracker and 0 map-outputs from previous failures >> 2008-07-24 07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask: >> task_200807230647_0008_r_09_1 Got 0 known map output location(s); >> scheduling... 2008-07-24 07:56:21,084 INFO >> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 >> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24 >> 07:56:26,093 INFO org.apache.hadoop.mapred.ReduceTask: >> task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24 >> 07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask: >> task_200807230647_0008_r_09_1: Got 0 new map-outputs & 0 obsolete >> map-outputs from tasktracker and 0 map-outputs from previous failures >> 2008-07-24 07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask: >> task_200807230647_0008_r_09_1 Got 0 known map output location(s); >> scheduling... 2008-07-24 07:56:26,094 INFO >> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 >> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24 >> 07:56:31,103 INFO org.apache.hadoop.mapred.ReduceTask: >> task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24 >> 07:56:31,104 INFO org.apache.hadoop.mapred.ReduceTask: >> task_200807230647_0008_r_09_1: Got 0 new map-outputs & 0 obsolete >> map-outputs from tasktracker and 0 map-outputs from previous failures >> 2008-07-24 07:56:31,104 INFO org.apache.hadoop.mapred.ReduceTask: >> task_200807230647_0008_r_09_1 Got 0 known map output location(s); >> scheduling... 2008-07-24 07:56:31,104 INFO >> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 >> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24 >> 07:56:36,113 INFO org.apache.hadoop.mapred.ReduceTask: >> task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24 >> 07:56:36,114 INFO org.apache.hadoop.mapred.ReduceTask: >> task_200807230647_0008_r_09_1: Got 0 new map-outputs & 0 obsolete >> map-outputs from tasktracker and 0 map-outputs from previous failures >> 2008-07-24 07:56:36,114 INFO org.apache.hadoop.mapred.ReduceTask: >> task_200807230647_0008_r_09_1 Got 0 known map output location(s); >> scheduling... 2008-07-24 07:56:36,114 INFO >> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 >> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24 >> 07:56:41,123 INFO org.apache.hadoop.mapred.ReduceTask: >> task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24 >> 07:56:41,126 INFO org.apache.hadoop.mapred.ReduceTask: >> task_200807230647_0008_r_09_1: Got 0 new map-outputs & 0 obsolete >> map-outputs from tasktracker and 0 map-outputs from previous failures >> 2008-07-24 07:56:41,126 INFO org.apache.hadoop.mapred.ReduceTask: >> task_200807230647_0008_r_09_1 Got 0 known map output location(s); >> scheduling... 2008-07-24 07:56:41,126 INFO >> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 >> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) > > > Notice how it needs 6 map outputs, all ma
Re: Using MapReduce to do table comparing.
Yes, I think this is the simplest method , but there are problems too: 1. The reduce stage wouldn't begin until the map stage ends, by when we have done a two table scanning, and the comparing will take almost the same time, because about 90% of intermediate pairs will have two values and different keys, if I can specify a number n, by when there are n intermediate pairs with the same key the reduce tasks start, that will be better. In my case I will set the magic number to 2. 2. I am not sure about how Hadoop stores intermediate pairs, we would not afford it as data volume increasing if it is kept in memory. -- From: "James Moore" <[EMAIL PROTECTED]> Sent: Thursday, July 24, 2008 1:12 AM To: Subject: Re: Using MapReduce to do table comparing. > On Wed, Jul 23, 2008 at 7:33 AM, Amber <[EMAIL PROTECTED]> wrote: >> We have a 10 million row table exported from AS400 mainframe every day, the >> table is exported as a csv text file, which is about 30GB in size, then the >> csv file is imported into a RDBMS table which is dropped and recreated every >> day. Now we want to find how many rows are updated during each export-import >> interval, the table has a primary key, so deletes and inserts can be found >> using RDBMS joins quickly, but we must do a column to column comparing in >> order to find the difference between rows ( about 90%) with the same primary >> keys. Our goal is to find a comparing process which takes no more than 10 >> minutes with a 4-node cluster, each server in which has 4 4-core 3.0 GHz >> CPUs, 8GB memory and a 300G local RAID5 array. >> >> Bellow is our current solution: >>The old data is kept in the RDBMS with index created on the primary key, >> the new data is imported into HDFS as the input file of our Map-Reduce job. >> Every map task connects to the RDBMS database, and selects old data from it >> for every row, map tasks will generate outputs if differences are found, and >> there are no reduce tasks. >> >> As you can see, with the number of concurrent map tasks increasing, the >> RDBMS database will become the bottleneck, so we want to kick off the RDBMS, >> but we have no idea about how to retrieve the old row with a given key >> quickly from HDFS files, any suggestion is welcome. > > Think of map/reduce as giving you a kind of key/value lookup for free > - it just falls out of how the system works. > > You don't care about the RDBMS. It's a distraction - you're given a > set of csv files with unique keys and dates, and you need to find the > differences between them. > > Say the data looks like this: > > File for jul 10: > 0x1,stuff > 0x2,more stuff > > File for jul 11: > 0x1,stuff > 0x2,apples > 0x3,parrot > > Preprocess the csv files to add dates to the values: > > File for jul 10: > 0x1,20080710,stuff > 0x2,20080710,more stuff > > File for jul 11: > 0x1,20080711,stuff > 0x2,20080711,apples > 0x3,20080711,parrot > > Feed two days worth of these files into a hadoop job. > > The mapper splits these into k=0x1, v=20080710,stuff etc. > > The reducer gets one or two v's per key, and each v has the date > embedded in it - that's essentially your lookup step. > > You'll end up with a system that can do compares for any two dates, > and could easily be expanded to do all sorts of deltas across these > files. > > The preprocess-the-files-to-add-a-date can probably be included as > part of your mapper and isn't really a separate step - just depends on > how easy it is to use one of the off-the-shelf mappers with your data. > If it turns out to be its own step, it can become a very simple > hadoop job. > > -- > James Moore | [EMAIL PROTECTED] > Ruby and Ruby on Rails consulting > blog.restphone.com >
Re: Hadoop and Ganglia Meterics
Check out https://issues.apache.org/jira/browse/HADOOP-3422 Joe Williams wrote: I have been attempting to get Hadoop metrics in Ganliga and have been unsuccessful thus far. I have see this thread (http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200712.mbox/raw/[EMAIL PROTECTED]/) but it didn't help much. I have setup my properties file like so: [EMAIL PROTECTED] current]# cat conf/hadoop-metrics.properties dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext dfs.period=10 dfs.servers=127.0.0.1:8649 mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext mapred.period=10 mapred.servers=127.0.0.1:8649 And if I 'telnet 127.0.0.1 8649' I receive the Ganglia XML metrics output without any hadoop specific metrics: [EMAIL PROTECTED] current]# telnet 127.0.0.1 8649 Trying 127.0.0.1... Connected to localhost (127.0.0.1). Escape character is '^]'. ?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?> !DOCTYPE GANGLIA_XML [ !ELEMENT GANGLIA_XML (GRID|CLUSTER|HOST)*> !ATTLIST GANGLIA_XML VERSION CDATA #REQUIRED> !ATTLIST GANGLIA_XML SOURCE CDATA #REQUIRED> --SNIP-- Is there more I need to do to get the metrics to show up in this output, am I doing something incorrectly? Do I need to have a gmetric script run in a cron to update the stats? If so, does anyone have a hadoop specific example of this? Any info would be helpful. Thanks. -Joe
Re: Using MapReduce to do table comparing.
I agree with you this is an acceptable method if time spent on exporting data from RDBM, importing file into HDFS and then importing data into RDBM again is considered as well, but this is an single-process/thread method. BTW, can you tell me how long does it take your method to process those 130 million rows, how much is the data volume, and how powerful are your physical computers, thanks a lot! -- From: "Michael Lee" <[EMAIL PROTECTED]> Sent: Thursday, July 24, 2008 11:51 AM To: Subject: Re: Using MapReduce to do table comparing. > Amber wrote: >> We have a 10 million row table exported from AS400 mainframe every day, the >> table is exported as a csv text file, which is about 30GB in size, then the >> csv file is imported into a RDBMS table which is dropped and recreated every >> day. Now we want to find how many rows are updated during each export-import >> interval, the table has a primary key, so deletes and inserts can be found >> using RDBMS joins quickly, but we must do a column to column comparing in >> order to find the difference between rows ( about 90%) with the same primary >> keys. Our goal is to find a comparing process which takes no more than 10 >> minutes with a 4-node cluster, each server in which has 4 4-core 3.0 GHz >> CPUs, 8GB memory and a 300G local RAID5 array. >> >> Bellow is our current solution: >> The old data is kept in the RDBMS with index created on the primary key, >> the new data is imported into HDFS as the input file of our Map-Reduce job. >> Every map task connects to the RDBMS database, and selects old data from it >> for every row, map tasks will generate outputs if differences are found, and >> there are no reduce tasks. >> >> As you can see, with the number of concurrent map tasks increasing, the >> RDBMS database will become the bottleneck, so we want to kick off the RDBMS, >> but we have no idea about how to retrieve the old row with a given key >> quickly from HDFS files, any suggestion is welcome. > 10 million is not bad. I do this all the time in UDB 8.1 - multiple key > columns and multiple value columns and calculate delta's - insert, > delete and update. > > What other has suggested works ( I tried very crude version of what > James Moore suggested in Hadoop with 70+ million records ) but you have > to remember there are other costs ( dumping out files, putting into > HDFS, etc. ). It might be better if you process straight in database or > do a straight file processing. Also the key is avoiding transaction. > > If you are doing outside of database... > > you have 'old.csv' and 'new.csv' and sorted by primary keys ( when you > extract make sure you do order by ). In your application, you open two > file handlers and read one line at time. Create the keys. If the keys > are the same, you compare two strings if they are the same. If key is > not the same, you have to find out natural orders - it can be insert or > delete. Once you decide, you read another line ( if insert/delete - you > only read one line from one of the file ) > > Here is the pseudo code > > oldFile = File.new(oldFilename, "r") > newFile = File.new(newFilename, "r") > outFile = File.new(outFilename, "w") > > oldLine = oldFile.gets > newLine = newFile.gets > > while ( true ) > { >oldKey = convertToKey(oldLine) >newKey = convertToKey(newLine) > >if ( oldKey < newKey ) >{ > ## it is deletion > outFile.puts oldLine + "," + "DELETE"; > oldLine = oldFile.gets >} >elsif ( oldKey > newKey ) >{ > ## it is insert > outFile.puts newLine + "," + "INSERT"; > newLine = newFile.gets >} >else >{ > ## compare > outFile.puts newLine + "," + "UPDATE" if ( oldLine != newLine ) > > oldLine = oldFile.gets > newLine = newFile.gets >} > } > > Okay - I skipped the part if eof is reached for each file but you get > the point. > > If the both old and new are in database, you can open two databases > connections and just do the process without dumping files. > > I journal about 130 million rows every day for quant financial database... > > > > > >
RE: distcp skipping the file
Hi, > The -update behavior is by design. If I am right, -update is to overwrite the file at the destination if it is already there. But, in this case it is overwriting the folder as a file at destination which seems to be a bug > > Could you provide the command line, and the directory structure before > and after issuing the copy? -C Cmd is: hadoop distcp -update 'hftp://:50070/user//distcpsrc' distcp_dest hadoop dfs -lsr distcpsrc /user//distcpsrc/12008-07-24 05:53 /user//distcpsrc/1/t 4 2008-07-22 06:12 hadoop dfs -lsr distcp_dest /user//distcp_dest/1 4 2008-07-24 06:03 << expected /user//distcp_dest/1/t, file is copied as '1' instead of '1/t' If I run without '-update', destination dir is: hadoop dfs -lsr distcp_dest_noupdate /user//distcp_dest_noupdate/1 2008-07-24 06:08 << file 't' is not copied and '1' is directory Thanks, Murali > > On Jul 22, 2008, at 9:46 PM, Murali Krishna wrote: > > > Hi, > > I am using 0.15.3 and the destination is empty. One more > > behavior that I am seeing is that if I pass '-update' option, it is > > writing the content of file '2' in folder 1. (Makes the folder '1' as > > file in the destination). So, look like it is treating the destination > > for file distcpsrc/1/2 as distcpdest/1. > > > > Thanks, > > Murali > > > >> -Original Message- > >> From: Chris Douglas [mailto:[EMAIL PROTECTED] > >> Sent: Wednesday, July 23, 2008 1:13 AM > >> To: core-user@hadoop.apache.org > >> Subject: Re: distcp skipping the file > >> > >> There were many fixes and improvements to distcp in 0.16, but most of > >> the critical fixes made it into 0.15.2 and 0.15.3. Is the destination > >> empty? Anything already existing at the destination is skipped. -C > >> > >> On Jul 22, 2008, at 4:39 AM, Murali Krishna wrote: > >> > >>> Hi, > >>> > >>> My source folder has a single folder and a single file inside that. > >>> > >>> /user//distcpsrc/1/24 2008-07-22 04:22 > >>> > >>> In the destination, it is creating the folder '1' but not the file > >>> '2'. > >>> > >>> The counters show 1 file has been skipped. > >>> > >>> 08/07/22 04:22:36 INFO mapred.JobClient: Files skipped=1 > >>> > >>> > >>> > >>> If I create one more file in any directory under the distscpsrc > >>> folder, > >>> it copies both the files properly. Is this a bug? > >>> > >>> [I am using 15.3] > >>> > >>> > >>> > >>> Thanks, > >>> > >>> Murali > >>> > >
hadoop 0.17.1 reducer not fetching map output problem
Hi! I'm experiencing hung reducers, with the following symptoms: > Task Logs: 'task_200807230647_0008_r_09_1' > > > stdout logs > > > > stderr logs > > > > syslog logs > > red.ReduceTask: task_200807230647_0008_r_09_1 Got 0 known map output > location(s); scheduling... 2008-07-24 07:56:11,064 INFO > org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 > Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24 > 07:56:16,073 INFO org.apache.hadoop.mapred.ReduceTask: > task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24 > 07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask: > task_200807230647_0008_r_09_1: Got 0 new map-outputs & 0 obsolete > map-outputs from tasktracker and 0 map-outputs from previous failures > 2008-07-24 07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask: > task_200807230647_0008_r_09_1 Got 0 known map output location(s); > scheduling... 2008-07-24 07:56:16,074 INFO > org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 > Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24 > 07:56:21,083 INFO org.apache.hadoop.mapred.ReduceTask: > task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24 > 07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask: > task_200807230647_0008_r_09_1: Got 0 new map-outputs & 0 obsolete > map-outputs from tasktracker and 0 map-outputs from previous failures > 2008-07-24 07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask: > task_200807230647_0008_r_09_1 Got 0 known map output location(s); > scheduling... 2008-07-24 07:56:21,084 INFO > org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 > Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24 > 07:56:26,093 INFO org.apache.hadoop.mapred.ReduceTask: > task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24 > 07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask: > task_200807230647_0008_r_09_1: Got 0 new map-outputs & 0 obsolete > map-outputs from tasktracker and 0 map-outputs from previous failures > 2008-07-24 07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask: > task_200807230647_0008_r_09_1 Got 0 known map output location(s); > scheduling... 2008-07-24 07:56:26,094 INFO > org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 > Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24 > 07:56:31,103 INFO org.apache.hadoop.mapred.ReduceTask: > task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24 > 07:56:31,104 INFO org.apache.hadoop.mapred.ReduceTask: > task_200807230647_0008_r_09_1: Got 0 new map-outputs & 0 obsolete > map-outputs from tasktracker and 0 map-outputs from previous failures > 2008-07-24 07:56:31,104 INFO org.apache.hadoop.mapred.ReduceTask: > task_200807230647_0008_r_09_1 Got 0 known map output location(s); > scheduling... 2008-07-24 07:56:31,104 INFO > org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 > Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24 > 07:56:36,113 INFO org.apache.hadoop.mapred.ReduceTask: > task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24 > 07:56:36,114 INFO org.apache.hadoop.mapred.ReduceTask: > task_200807230647_0008_r_09_1: Got 0 new map-outputs & 0 obsolete > map-outputs from tasktracker and 0 map-outputs from previous failures > 2008-07-24 07:56:36,114 INFO org.apache.hadoop.mapred.ReduceTask: > task_200807230647_0008_r_09_1 Got 0 known map output location(s); > scheduling... 2008-07-24 07:56:36,114 INFO > org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 > Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24 > 07:56:41,123 INFO org.apache.hadoop.mapred.ReduceTask: > task_200807230647_0008_r_09_1 Need 6 map output(s) 2008-07-24 > 07:56:41,126 INFO org.apache.hadoop.mapred.ReduceTask: > task_200807230647_0008_r_09_1: Got 0 new map-outputs & 0 obsolete > map-outputs from tasktracker and 0 map-outputs from previous failures > 2008-07-24 07:56:41,126 INFO org.apache.hadoop.mapred.ReduceTask: > task_200807230647_0008_r_09_1 Got 0 known map output location(s); > scheduling... 2008-07-24 07:56:41,126 INFO > org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_09_1 > Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) Notice how it needs 6 map outputs, all map tasks have finished, and it still just hangs there. The second speculative copy of that reducer task needs 14 map outputs with the same messages :( Other observations: killing the reduce tasks via job -killtask ends up with restarting the job on the same node, and curiously the new job gets jammed at the same position (6/14 maps needed). The only remedy to this problem seems to be a complete restart of the cluster and reprocessing. That gets really boring with jobs that took a day to process first :( Andreas signature.asc Description: